LinuxLists.cc - [PATCH 0/2] stop lockref from degrading to locked-only ops

2024-06-13 00:12:49

Subject: [PATCH 0/2] stop lockref from degrading to locked-only ops

... and speed up parallel lookups of the same terminal inode

Hi Linus. It is rather unclear who to mail concerning lockref. Last time
I had issues with it (the cpu_relax removal) you were quite responsive
and the resulting commit is the last one made there, so I figured I'm
going to rope you in.

lockref has a corner case where it can degrade to always taking the spin
lock (triggerable while benchmarking). In the past I sometimes ran into
it and ignored the problem, but it started showing up a lot with my
dentry patch (outlined below). I think it is time to address it.

The dentry thing moves d_lockref out of an area used by rcu lookup.
This provides a significant speed up when doing parallel stat on the
same file (will-it-scale, see [1] for the bench).

Results (see patch 2):
> before: 5038006
> after: 7842883 (+55%)

> One minor remark: the 'after' result is unstable, fluctuating in the
> range ~7.8 mln to ~9 mln during different runs.

The speed up also means the vulnerable code executes more often per
second, making it more likely to spot a transient lock acquire by
something unrelated and decide to lock as well, starting the cascade.

If I leave it running it eventually degrades to locked-only operation,
stats look like this:
> min:417297 max:426249 total:8401766 <-- expected performance
> min:219024 max:221764 total:4398413 <-- some locking started
> min:62855 max:64949 total:1273712 <-- everyone degraded
> min:62472 max:64873 total:1268733
> min:62179 max:63843 total:1256375

Sometimes takes literally few seconds, other times it takes few minutes.

I don't know who transiently takes the d_lock and I don't think it is
particularly relevant. I did find that I can trivially trigger the
problem by merely issuing "ls /tmp" a few times. It does depend on
tmpfs, no problem with ext4 at least.

Bottom line though is that if the d_lockref reordering lands and this
issue is not addressed, the lkp folks (or whoever else benchmarking) may
trigger the bug and report a bogus regression.

Even if the d_lockref patch gets rejected I would argue the problem
should be sorted out, it is going to eventually bite someone.

I wrote the easiest variant of the fix I could think of but I'm not
married to any specific way to solve it.

If the vfs things is accepted it needs to land after the lockref issue
is resolved, thus I'm mailing both in the same patchset.

[1] https://github.com/antonblanchard/will-it-scale/pull/35

Mateusz Guzik (2):
lockref: speculatively spin waiting for the lock to be released
vfs: move d_lockref out of the area used by RCU lookup

include/linux/dcache.h | 7 +++-
lib/lockref.c | 85 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 91 insertions(+), 1 deletion(-)

--
2.43.0

2024-06-13 00:12:59

by Mateusz Guzik

[permalink] [raw]

Subject: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

The usual inc/dec scheme is to try to do it with atomics if the lock is
not taken and fallback to locked operation otherwise.

If the lock is only transiently held routines could instead
speculatively wait and avoid the acquire altogether.

This has a minor benefit of slightly speeding up cases of the sort,
which do happen in real workloads.

More importantly this fixes a corner case where lockref consumers
degrade to locked operation and are unable to go back to atomics.

Consider a continuous stream of lockref ops from several threads and
another thread taking the lock for a brief moment. Any lockref user
which manages to observe the lock as taken is going to fallback to
locking it itself, but this increases the likelihood that more threads
executing the code will do the same. This eventually can culminate in
nobody being able to go back to atomics on the problematic lockref.

While this is not a state which can persist in a real workload for
anything but few calls, it very much can permanently degrade when
benchmarking and in consequence grossly disfigure results.

Here is an example of will-it-scale doing 20-way stat(2) on the same
file residing on tmpfs, reports once per second with number of ops
completed:
[snip several ok results]
min:417297 max:426249 total:8401766 <-- expected performance
min:219024 max:221764 total:4398413 <-- some locking started
min:62855 max:64949 total:1273712 <-- everyone degraded
min:62472 max:64873 total:1268733
min:62179 max:63843 total:1256375
[snip remaining bad results]

While I did not try to figure out who transiently took the lock (it was
something outside of the benchmark), I devised a trivial reproducer
which triggers the problem almost every time: merely issue "ls" of the
directory containing the tested file (in this case: "ls /tmp").

The problem does not persist with the fix below.

Signed-off-by: Mateusz Guzik <[email protected]>
---
lib/lockref.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 85 insertions(+)

diff --git a/lib/lockref.c b/lib/lockref.c
index 2afe4c5d8919..596b521bc1f1 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -26,10 +26,46 @@
} \
} while (0)

+/*
+ * Most routines only ever inc/dec the reference, but the lock may be
+ * transiently held forcing them to take it as well.
+ *
+ * Should the lock be taken for any reason (including outside of lockref),
+ * a steady stream of ref/unref requests may find itself unable to go back
+ * to lockless operation.
+ *
+ * Combat the problem by giving the routines a way to speculatively wait in
+ * hopes of avoiding having to take the lock.
+ *
+ * The spin count is limited to guarantee forward progress, although the
+ * value is arbitrarily chosen.
+ *
+ * Note this routine is only used if the lock was found to be taken.
+ */
+static inline bool lockref_trywait_unlocked(struct lockref *lockref)
+{
+ struct lockref old;
+ int retry = 100;
+
+ for (;;) {
+ cpu_relax();
+ old.lock_count = READ_ONCE(lockref->lock_count);
+ if (arch_spin_value_unlocked(old.lock.rlock.raw_lock))
+ return true;
+ if (!--retry)
+ return false;
+ }
+}
+
#else

#define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0)

+static inline bool lockref_trywait_unlocked(struct lockref *lockref)
+{
+ return false;
+}
+
#endif

/**
@@ -47,6 +83,14 @@ void lockref_get(struct lockref *lockref)
return;
);

+ if (lockref_trywait_unlocked(lockref)) {
+ CMPXCHG_LOOP(
+ new.count++;
+ ,
+ return;
+ );
+ }
+
spin_lock(&lockref->lock);
lockref->count++;
spin_unlock(&lockref->lock);
@@ -70,6 +114,16 @@ int lockref_get_not_zero(struct lockref *lockref)
return 1;
);

+ if (lockref_trywait_unlocked(lockref)) {
+ CMPXCHG_LOOP(
+ new.count++;
+ if (old.count <= 0)
+ return 0;
+ ,
+ return 1;
+ );
+ }
+
spin_lock(&lockref->lock);
retval = 0;
if (lockref->count > 0) {
@@ -98,6 +152,16 @@ int lockref_put_not_zero(struct lockref *lockref)
return 1;
);

+ if (lockref_trywait_unlocked(lockref)) {
+ CMPXCHG_LOOP(
+ new.count--;
+ if (old.count <= 1)
+ return 0;
+ ,
+ return 1;
+ );
+ }
+
spin_lock(&lockref->lock);
retval = 0;
if (lockref->count > 1) {
@@ -125,6 +189,17 @@ int lockref_put_return(struct lockref *lockref)
,
return new.count;
);
+
+ if (lockref_trywait_unlocked(lockref)) {
+ CMPXCHG_LOOP(
+ new.count--;
+ if (old.count <= 0)
+ return -1;
+ ,
+ return new.count;
+ );
+ }
+
return -1;
}
EXPORT_SYMBOL(lockref_put_return);
@@ -181,6 +256,16 @@ int lockref_get_not_dead(struct lockref *lockref)
return 1;
);

+ if (lockref_trywait_unlocked(lockref)) {
+ CMPXCHG_LOOP(
+ new.count++;
+ if (old.count < 0)
+ return 0;
+ ,
+ return 1;
+ );
+ }
+
spin_lock(&lockref->lock);
retval = 0;
if (lockref->count >= 0) {
--
2.43.0

2024-06-13 00:13:11

by Mateusz Guzik

[permalink] [raw]

Subject: [PATCH 2/2] vfs: move d_lockref out of the area used by RCU lookup

Stock kernel scales worse than FreeBSD when doing a 20-way stat(2) on
the same tmpfs-backed file.

According to perf top:
38.09% lockref_put_return
26.08% lockref_get_not_dead
25.60% __d_lookup_rcu
0.89% clear_bhb_loop

__d_lookup_rcu is participating in cacheline ping pong due to the
embedded name sharing a cacheline with lockref.

Moving it out resolves the problem:
41.50% lockref_put_return
41.03% lockref_get_not_dead
1.54% clear_bhb_loop

benchmark (will-it-scale, Sapphire Rapids, tmpfs, ops/s):
FreeBSD:7219334
before: 5038006
after: 7842883 (+55%)

One minor remark: the 'after' result is unstable, fluctuating in the
range ~7.8 mln to ~9 mln during different runs.

Signed-off-by: Mateusz Guzik <[email protected]>
---
include/linux/dcache.h | 7 ++++++-
lib/lockref.c | 2 +-
2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index bf53e3894aae..326dbccc3736 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -89,13 +89,18 @@ struct dentry {
struct inode *d_inode; /* Where the name belongs to - NULL is
* negative */
unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */
+ /* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */

/* Ref lookup also touches following */
- struct lockref d_lockref; /* per-dentry lock and refcount */
const struct dentry_operations *d_op;
struct super_block *d_sb; /* The root of the dentry tree */
unsigned long d_time; /* used by d_revalidate */
void *d_fsdata; /* fs-specific data */
+ /* --- cacheline 2 boundary (128 bytes) --- */
+ struct lockref d_lockref; /* per-dentry lock and refcount
+ * keep separate from RCU lookup area if
+ * possible!
+ */

union {
struct list_head d_lru; /* LRU list */
diff --git a/lib/lockref.c b/lib/lockref.c
index 596b521bc1f1..c1e2736a7bac 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -45,7 +45,7 @@
static inline bool lockref_trywait_unlocked(struct lockref *lockref)
{
struct lockref old;
- int retry = 100;
+ int retry = 256;

for (;;) {
cpu_relax();
--
2.43.0

2024-06-13 01:23:55

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Wed, 12 Jun 2024 at 17:12, Mateusz Guzik <[email protected]> wrote:
>
> While I did not try to figure out who transiently took the lock (it was
> something outside of the benchmark), I devised a trivial reproducer
> which triggers the problem almost every time: merely issue "ls" of the
> directory containing the tested file (in this case: "ls /tmp").

So I have no problem with your patch 2/2 - moving the lockref data
structure away from everything else that can be shared read-only makes
a ton of sense independently of anything else.

Except you also randomly increased a retry count in there, which makes no sense.

But this patch 1/2 makes me go "Eww, hacky hacky".

We already *have* the retry loop, it's just that currently it only
covers the cmpxchg failures.

The natural thing to do is to just make the "wait for unlocked" be
part of the same loop.

In fact, I have this memory of trying this originally, and it not
mattering and just making the code uglier, but that may be me
confusing myself. It's a *loong* time ago.

With the attached patch, lockref_get() (to pick one random case) ends
up looking like this:

mov (%rdi),%rax
mov $0x64,%ecx
loop:
test %eax,%eax
jne locked
mov %rax,%rdx
sar $0x20,%rdx
add $0x1,%edx
shl $0x20,%rdx
lock cmpxchg %rdx,(%rdi)
jne fail
// SUCCESS
ret
locked:
pause
mov (%rdi),%rax
fail:
sub $0x1,%ecx
jne loop

(with the rest being the "take lock and go slow" case).

It seems much better to me to have *one* retry loop that handles both
the causes of failures.

Entirely untested, I only looked at the generated code and it looked
reasonable. The patch may be entirely broken for some random reason I
didn't think of.

And in case you wonder, that 'lockref_locked()' macro I introduce is
purely to make the code more readable. Without it, that one
conditional line ends up being insanely long, the macro is there just
to break things up into slightly more manageable chunks.

Mind testing this approach instead?

Linus

Attachments:

patch.diff (1.10 kB)

2024-06-13 01:49:45

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Wed, 12 Jun 2024 at 18:23, Linus Torvalds
<[email protected]> wrote:
>
> The natural thing to do is to just make the "wait for unlocked" be
> part of the same loop.

Oh, and while I like my approach a lot more than your patch, I do
think that the real issue here is likely that something takes the
d_lock way too much.

One of the ideas behind the reflux was that locking should be an
exceptional thing when something special happens. So things like final
dput() and friends.

What I *think* is going on - judging by your description of how you
triggered this - is that sadly our libfs 'readdir()' thing is pretty
nasty.

It does use d_lock a lot for the cursor handling, and things like
scan_positives() in particular.

I understand *why* it does that, and maybe it's practically unfixable,
but I do think the most likely deeper reason for that "go into slow
mode" is the cursor on the directory causing issues.

Put another way: while I think doing the retry loop will help
benchmarks, it would be lovely if you were to look at that arguably
deeper issue of the 'd_sib' list.

Linus

2024-06-13 06:10:42

by Mateusz Guzik

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Wed, Jun 12, 2024 at 06:23:18PM -0700, Linus Torvalds wrote:
> So I have no problem with your patch 2/2 - moving the lockref data
> structure away from everything else that can be shared read-only makes
> a ton of sense independently of anything else.
>
> Except you also randomly increased a retry count in there, which makes no sense.
>

Cmon man, that's a change which unintentionally crept into the patch and
I failed to notice.

> But this patch 1/2 makes me go "Eww, hacky hacky".
>
> We already *have* the retry loop, it's just that currently it only
> covers the cmpxchg failures.
>
> The natural thing to do is to just make the "wait for unlocked" be
> part of the same loop.
>

I was playing with a bpftrace probe reporting how many spins were
performed waiting for the lock. For my intentional usage with ls it was
always < 30 or so. The random-ass intruder which messes with my bench on
occasion needed over 100.

The bump is something I must have fat-fingered into the wrong patch.

Ultimately even if *some* iterations will still take the lock, they
should be able to avoid it next time around, so the permanent
degradation problem is still solved.

> Mind testing this approach instead?
>

So I'm not going to argue about the fix.

I tested your code and confirm it fixes the problem, nothing blows up
either and I fwiw I don't see any bugs in it.

When writing the commit message feel free to use mine in whatever capacity
(including none) you want.

On Wed, Jun 12, 2024 at 06:49:00PM -0700, Linus Torvalds wrote:

> On Wed, 12 Jun 2024 at 18:23, Linus Torvalds
> One of the ideas behind the reflux was that locking should be an
> exceptional thing when something special happens. So things like final
> dput() and friends.
>
> What I *think* is going on - judging by your description of how you
> triggered this - is that sadly our libfs 'readdir()' thing is pretty
> nasty.
>
> It does use d_lock a lot for the cursor handling, and things like
> scan_positives() in particular.
>
> I understand *why* it does that, and maybe it's practically unfixable,
> but I do think the most likely deeper reason for that "go into slow
> mode" is the cursor on the directory causing issues.
>
> Put another way: while I think doing the retry loop will help
> benchmarks, it would be lovely if you were to look at that arguably
> deeper issue of the 'd_sib' list.
>

I think lockref claiming to be a general locking facility means it
should not be suffering the degradation problem to begin with, so that
would be the real problem as far as lockref goes.

For vfs as a whole I do agree that the d_lock usage in various spots is
probably avoidable and if so would be nice to get patched out, readdir
included. So Someone(tm) should certainly look into it.

As for me at the moment I have other stuff on my todo list, so I am not
going to do it for the time being.

Regardless, patching up lockref to dodge the lock is a step forward in
the right direction AFAICS.

=====

All that aside, you did not indicate how do you want to move forward
regarding patch submission.

As indicated in my cover letter the vfs change (if it is to land) needs
to be placed *after* the lockref issue is addressed, otherwise it may
result in bogus regression reports. Thus I think it makes most sense to
just ship them together.

So maybe you can send me a patch and I send a v2 of the patchset with
that, alternatively perhaps you can edit out the unintional retry
adjustment from my dentry patch and handle the rest yourself.

Or maybe you have some other idea.

The thing that matters to me is not landing in a state where d_lockref
is moved around, but the lockref corner case is not patched (even it is
patched *after*). I really don't want to investigate a regression report
only to find it's caused by the above.

2024-06-13 13:46:26

by Christian Brauner

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

> All that aside, you did not indicate how do you want to move forward
> regarding patch submission.

I've picked Linus patch and your for testing into the vfs.inode.rcu branch.
Was trivial to fix your typo and to add Linus as author with your commit
message. Let's see what syzbot and that perf bot have to say.

2024-06-13 13:52:42

by Mateusz Guzik

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Thu, Jun 13, 2024 at 3:46 PM Christian Brauner <[email protected]> wrote:
>
> > All that aside, you did not indicate how do you want to move forward
> > regarding patch submission.
>
> I've picked Linus patch and your for testing into the vfs.inode.rcu branch.
> Was trivial to fix your typo and to add Linus as author with your commit
> message. Let's see what syzbot and that perf bot have to say.

sounds good to me, thanks

--
Mateusz Guzik <mjguzik gmail.com>

2024-06-13 16:50:46

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Wed, 12 Jun 2024 at 23:10, Mateusz Guzik <[email protected]> wrote:
>
> On Wed, Jun 12, 2024 at 06:23:18PM -0700, Linus Torvalds wrote:
> > So I have no problem with your patch 2/2 - moving the lockref data
> > structure away from everything else that can be shared read-only makes
> > a ton of sense independently of anything else.
> >
> > Except you also randomly increased a retry count in there, which makes no sense.
>
> Cmon man, that's a change which unintentionally crept into the patch and
> I failed to notice.

Heh. It wasn't entirely obvious that it was unintentional, since the
series very much was about that whole rety thing.

But good.

> I was playing with a bpftrace probe reporting how many spins were
> performed waiting for the lock. For my intentional usage with ls it was
> always < 30 or so. The random-ass intruder which messes with my bench on
> occasion needed over 100.

Ok. So keeping it at 100 is likely fine.

Of course, one option is to simply get rid of the retry count
entirely, and just make it all be entirely unconditional.

The fallback to locking is not technically required at all for the
USE_CMPXCHG_LOCKREF case.

That has some unfairness issues, though. But my point is mostly that
the count is probably not hugely important or meaningful.

> I tested your code and confirm it fixes the problem, nothing blows up
> either and I fwiw I don't see any bugs in it.

I was more worried about some fat-fingering than any huge conceptual
bugs, and any minor testing with performance checks would find that.

Just as an example: my first attempt switched the while(likely(..)) to
the if (unlikely(..)) in the loop, but didn't add the "!" to negate
the test.

I caught it immediately and obviously never sent that broken thing out
(and it was one reason why I decided I needed to make the code more
readable with that lockref_locked() helper macro). But that's mainly
the kind of thing I was worried about.

> When writing the commit message feel free to use mine in whatever capacity
> (including none) you want.

Ack.
> I think lockref claiming to be a general locking facility means it
> should not be suffering the degradation problem to begin with, so that
> would be the real problem as far as lockref goes.

Well, it was certainly originally meant to be generic, but after more
than a decade, the number of users is three. And the two non-dcache
users are pretty darn random.

So it's effectively really just a dcache special case with two
filesystems that started using it just because the authors had clearly
seen the vfs use of it...

> All that aside, you did not indicate how do you want to move forward
> regarding patch submission.

lockref is fairly unusual, and is pretty much mine. The initial
concept was from Waiman Long:

https://lore.kernel.org/all/[email protected]/

but I ended up redoing it pretty much from scratch, and credited him
as author for the initial commit.

It's _technically_ a locking thing, but realistically it has never
been locking tree and it's effectively a dcache thing.

Linus

2024-06-13 17:00:59

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Thu, 13 Jun 2024 at 06:46, Christian Brauner <[email protected]> wrote:
>
> I've picked Linus patch and your for testing into the vfs.inode.rcu branch.

Btw, if you added [patch 2/2] too, I think the exact byte counts in
the comments are a bit misleading.

The actual cacheline details will very much depend on 32-bit vs 64-bit
builds, but also on things like the slab allocator debug settings.

I think the important part is to keep the d_lockref - that is often
dirty and exclusive in the cache - away from the mostly RO parts of
the dentry that can be shared across CPU's in the cache.

So rather than talk about exact byte offsets, maybe just state that
overall rule?

Linus

2024-06-13 18:13:36

by Mateusz Guzik

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Thu, Jun 13, 2024 at 7:00 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 13 Jun 2024 at 06:46, Christian Brauner <[email protected]> wrote:
> >
> > I've picked Linus patch and your for testing into the vfs.inode.rcu branch.
>
> Btw, if you added [patch 2/2] too, I think the exact byte counts in
> the comments are a bit misleading.
>
> The actual cacheline details will very much depend on 32-bit vs 64-bit
> builds, but also on things like the slab allocator debug settings.
>

I indeed assumed "x86_64 production", with lines just copied from pahole.

However, to the best of my understanding the counts are what one
should expect on a 64-bit kernel.

That said:

> I think the important part is to keep the d_lockref - that is often
> dirty and exclusive in the cache - away from the mostly RO parts of
> the dentry that can be shared across CPU's in the cache.
>
> So rather than talk about exact byte offsets, maybe just state that
> overall rule?
>

I would assume the rule is pretty much well known and instead an
indicator where is what (as added in my comments) would be welcome.

But if going that route, then perhaps:
"Make sure d_lockref does not share a cacheline with fields used by
RCU lookup, otherwise it can result in a signification throughput
reduction. You can use pahole(1) to check the layout."
[maybe a link to perfbook or something goes here?]

Arguably a bunch of BUILD_BUG_ONs could be added to detect the overlap
(only active without whatever runtime debug messing with the layout).

--
Mateusz Guzik <mjguzik gmail.com>

2024-06-13 18:41:52

by Mateusz Guzik

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Thu, Jun 13, 2024 at 8:13 PM Mateusz Guzik <[email protected]> wrote:
>
> On Thu, Jun 13, 2024 at 7:00 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Thu, 13 Jun 2024 at 06:46, Christian Brauner <[email protected]> wrote:
> > >
> > > I've picked Linus patch and your for testing into the vfs.inode.rcu branch.
> >
> > Btw, if you added [patch 2/2] too, I think the exact byte counts in
> > the comments are a bit misleading.
> >
> > The actual cacheline details will very much depend on 32-bit vs 64-bit
> > builds, but also on things like the slab allocator debug settings.
> >
>
> I indeed assumed "x86_64 production", with lines just copied from pahole.
>
> However, to the best of my understanding the counts are what one
> should expect on a 64-bit kernel.
>
> That said:
>
> > I think the important part is to keep the d_lockref - that is often
> > dirty and exclusive in the cache - away from the mostly RO parts of
> > the dentry that can be shared across CPU's in the cache.
> >
> > So rather than talk about exact byte offsets, maybe just state that
> > overall rule?
> >
>
> I would assume the rule is pretty much well known and instead an
> indicator where is what (as added in my comments) would be welcome.
>
> But if going that route, then perhaps:
> "Make sure d_lockref does not share a cacheline with fields used by
> RCU lookup, otherwise it can result in a signification throughput
> reduction. You can use pahole(1) to check the layout."
> [maybe a link to perfbook or something goes here?]
>
> Arguably a bunch of BUILD_BUG_ONs could be added to detect the overlap
> (only active without whatever runtime debug messing with the layout).
>

So I tried to add the BUILD_BUG_ONs but I got some compilation errors
immediately concerning the syntax, maybe I'm brainfarting here. I am
not pasting that in case you want to take a stab yourself from
scratch.

I did however type up the following:
/*
* Note: dentries have a read-mostly area heavily used by RCU (denoted with
* "RCU lookup touched fields") which must not share cachelines with a
* frequently written-to field like d_lockref.
*
* If you are modifying the layout you can check with pahole(1) that there is
* no overlap.
*/

could land just above the struct.

That's my $0,03. I am not going to give further commentary on the
matter, you guys touch it up however you see fit.

--
Mateusz Guzik <mjguzik gmail.com>

2024-06-13 18:44:33

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Thu, 13 Jun 2024 at 11:13, Mateusz Guzik <[email protected]> wrote:
>
> I would assume the rule is pretty much well known and instead an
> indicator where is what (as added in my comments) would be welcome.

Oh, the rule is well-known, but I think what is worth pointing out is
the different classes of fields, and the name[] field in particular.

This ordering was last really updated back in 2011, by commit
44a7d7a878c9 ("fs: cache optimise dentry and inode for rcu-walk"). And
it was actually somewhat intentional at the time. Quoting from that
commit:

We also fit in 8 bytes of inline name in the first 64 bytes, so for short
names, only 64 bytes needs to be touched to perform the lookup. We should
get rid of the hash->prev pointer from the first 64 bytes, and fit 16 bytes
of name in there, which will take care of 81% rather than 32% of the kernel
tree.

but what has actually really changed - and that I didn't even realize
until I now did a 'pahole' on it, was that this was all COMPLETELY
broken by

seqcount_spinlock_t d_seq;

because seqcount_spinlock_t has been entirely broken and went from
being 4 bytes back when, to now being 64 bytes.

Crazy crazy.

How did that happen?

Linus

2024-06-13 18:48:16

by Mateusz Guzik

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Thu, Jun 13, 2024 at 8:43 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 13 Jun 2024 at 11:13, Mateusz Guzik <[email protected]> wrote:
> >
> > I would assume the rule is pretty much well known and instead an
> > indicator where is what (as added in my comments) would be welcome.
>
> Oh, the rule is well-known, but I think what is worth pointing out is
> the different classes of fields, and the name[] field in particular.
>
> This ordering was last really updated back in 2011, by commit
> 44a7d7a878c9 ("fs: cache optimise dentry and inode for rcu-walk"). And
> it was actually somewhat intentional at the time. Quoting from that
> commit:
>
> We also fit in 8 bytes of inline name in the first 64 bytes, so for short
> names, only 64 bytes needs to be touched to perform the lookup. We should
> get rid of the hash->prev pointer from the first 64 bytes, and fit 16 bytes
> of name in there, which will take care of 81% rather than 32% of the kernel
> tree.
>

Right. Things degrading by accident is why I suggested BUILD_BUG_ON.

> but what has actually really changed - and that I didn't even realize
> until I now did a 'pahole' on it, was that this was all COMPLETELY
> broken by
>
> seqcount_spinlock_t d_seq;
>
> because seqcount_spinlock_t has been entirely broken and went from
> being 4 bytes back when, to now being 64 bytes.
>
> Crazy crazy.
>
> How did that happen?
>

perhaps lockdep in your config?

this is the layout on my prod config:
struct dentry {
unsigned int d_flags; /* 0 4 */
seqcount_spinlock_t d_seq; /* 4 4 */
struct hlist_bl_node d_hash; /* 8 16 */
struct dentry * d_parent; /* 24 8 */
struct qstr d_name; /* 32 16 */
struct inode * d_inode; /* 48 8 */
unsigned char d_iname[40]; /* 56 40 */
/* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
const struct dentry_operations * d_op; /* 96 8 */
struct super_block * d_sb; /* 104 8 */
long unsigned int d_time; /* 112 8 */
void * d_fsdata; /* 120 8 */
/* --- cacheline 2 boundary (128 bytes) --- */
struct lockref d_lockref
__attribute__((__aligned__(8))); /* 128 8 */
union {
struct list_head d_lru; /* 136 16 */
wait_queue_head_t * d_wait; /* 136 8 */
}; /* 136 16 */
struct hlist_node d_sib; /* 152 16 */
struct hlist_head d_children; /* 168 8 */
union {
struct hlist_node d_alias; /* 176 16 */
struct hlist_bl_node d_in_lookup_hash; /* 176 16 */
struct callback_head d_rcu
__attribute__((__aligned__(8))); /* 176 16 */
} d_u __attribute__((__aligned__(8))); /* 176 16 */

/* size: 192, cachelines: 3, members: 16 */
/* forced alignments: 2 */
} __attribute__((__aligned__(8)));

--
Mateusz Guzik <mjguzik gmail.com>

2024-06-13 18:56:46

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Thu, 13 Jun 2024 at 11:48, Mateusz Guzik <[email protected]> wrote:
>
> perhaps lockdep in your config?

Yes, happily it was just lockdep, and the fact that my regular tree
doesn't have debug info, so I looked at my allmodconfig tree.

I didn't *think* anything in the dentry struct should care about
debugging, but clearly that sequence number thing did.

But with that fixed, it's still the case that "we used to know about
this", but what you actually fixed is the case of larger names than 8
bytes.

You did mention the name clashing in your commit message, but I think
that should be the important part in the code comments too: make them
about "these fields are hot and pretty much read-only", "these fields
don't matter" and "this field is hot and written, and shouldn't be
near the read-only ones".

The exact byte counts may change, the basic notion doesn't.

(Of course, putting it at the *end* of the structure then possibly
causes cache conflicts with the next one - we don't force the dentries
be cacheline aligned even if we've tried to make them generally work
that way)

Linus

2024-06-13 19:03:21

by Mateusz Guzik

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Thu, Jun 13, 2024 at 8:56 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 13 Jun 2024 at 11:48, Mateusz Guzik <[email protected]> wrote:
> >
> > perhaps lockdep in your config?
>
> Yes, happily it was just lockdep, and the fact that my regular tree
> doesn't have debug info, so I looked at my allmodconfig tree.
>
> I didn't *think* anything in the dentry struct should care about
> debugging, but clearly that sequence number thing did.
>
> But with that fixed, it's still the case that "we used to know about
> this", but what you actually fixed is the case of larger names than 8
> bytes.
>
> You did mention the name clashing in your commit message, but I think
> that should be the important part in the code comments too: make them
> about "these fields are hot and pretty much read-only", "these fields
> don't matter" and "this field is hot and written, and shouldn't be
> near the read-only ones".
>
> The exact byte counts may change, the basic notion doesn't.
>
> (Of course, putting it at the *end* of the structure then possibly
> causes cache conflicts with the next one - we don't force the dentries
> be cacheline aligned even if we've tried to make them generally work
> that way)
>

Look mate, I'm not going to go back and forth about this bit.

Nobody is getting a turing award for moving a field elsewhere, so as
far as I am concerned you are free to write your own version of the
patch, I don't even need to be mentioned. You are also free to grab
the commit message in whatever capacity (I guess you would at least
bench results?).

As long as lockref (the facility) and d_lockref are out of the way I'm
a happy camper.
--
Mateusz Guzik <mjguzik gmail.com>

2024-06-13 19:35:04

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Thu, 13 Jun 2024 at 11:56, Linus Torvalds
<[email protected]> wrote:
>
> I didn't *think* anything in the dentry struct should care about
> debugging, but clearly that sequence number thing did.

Looking at the 32-bit build, it looks like out current 'struct dentry'
is 136 bytes in size, not 128.

Looks like DNAME_INLINE_LEN should be reduced to 36 on 32-bit.

And moving d_lockref to after d_fsdata works there too.

Not that anybody really cares, but let's make sure it's actually
properly done when this is changed. Christian?

Linus

2024-06-14 00:53:30

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

On Thu, Jun 13, 2024 at 11:43:05AM -0700, Linus Torvalds wrote:

> seqcount_spinlock_t d_seq;
>
> because seqcount_spinlock_t has been entirely broken and went from
> being 4 bytes back when, to now being 64 bytes.

1ca7d67cf5d5 "seqcount: Add lockdep functionality to seqcount/seqlock structures"