2010-06-24 03:15:44

by Nick Piggin

[permalink] [raw]
Subject: [patch 00/52] vfs scalability patches updated

http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/

Update to vfs scalability patches:

- Lots of fixes, particularly RCU inode stuff
- Lots of cleanups and aesthetic to the code, ifdef reduction etc
- Use bit locks for inode and dentry hashes
- Small improvements to single-threaded performance
- Split inode LRU and writeback list locking
- Per-bdi inode writeback list locking
- Per-zone mm shrinker
- Per-zone dentry and inode LRU lists
- Several fixes brought in from -rt tree testing
- No global locks remain in any fastpaths (arguably, rename)

I have not included the store-free path walk patches in this posting. They
require a bit more work and they will need to be reworked after
->d_revalidate/->follow_mount changes that Al wants to do. I prefer to
concentrate on these locking patches first.

Autofs4 is sadly missing. It's a bit tricky, patches have to be reworked.

Performance:
Last time I was testing on a 32-node Altix which could be considered as not a
sweet-spot for Linux performance target (ie. improvements there may not justify
complexity). So recently I've been testing with a tightly interconnected
4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of
system.

*** Single-thread microbenchmark (simple syscall loops, lower is better):
Test Difference at 95.0% confidence (50 runs)
open/close -6.07% +/- 1.075%
creat/unlink 27.83% +/- 0.522%
Open/close is a little faster, which should be due to one less atomic in the
dput common case. Creat/unlink is significantly slower, which is due to RCU
freeing inodes. We have have made the same magnitude of performance regression
tradeoff when going to RCU freed dentries and files as well. Inode RCU is
required for reducing inode hash lookup locking and improve lock ordering,
also for store-free path-walk.

*** Let's take a look at this creat/unlink regression more closely. If we call
rdtsc around the creat/unlink loop, and just run it once (so as to avoid
much of the RCU induced problems):
vanilla: 5328 cycles
vfs: 5960 cycles (+11.8%)
Not so bad when RCU is not being stressed.

*** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs):
vanilla vfs
real 0m4.911s 0m0.183s
user 0m1.920s 0m1.610s
sys 4m58.670s 0m5.770s
After vfs patches, 26x increase in throughput, however parallelism is limited
by test spawning and exit phases. sys time improvement shows closer to 50x
improvement. vanilla is bottlenecked on dcache_lock.

*** Google sockets (http://marc.info/?l=linux-kernel&m=123215942507568&w=2):
vanilla vfs
real 1m 7.774s 0m 3.245s
user 0m19.230s 0m36.750s
sys 71m41.310s 2m47.320s
do_exit path for
the run took 24.755s 1.219s
After vfs patches, 20x increase in throughput for both the total duration and
the do_exit (teardown) time.

*** file-ops test (people.redhat.com/mingo/file-ops-test/file-ops-test.c)
Parallel open/close or creat/unlink in same or different cwds within the same
ramfs mount. Relative throughput percentages are given at each parallelism
point (higher is better):

open/close vanilla vfs
same cwd
1 100.0 119.1
2 74.2 187.4
4 38.4 40.9
8 18.7 27.0
16 9.0 24.9
32 5.9 24.2
64 6.0 27.7
different cwd
1 100.0 119.1
2 133.0 238.6
4 21.2 488.6
8 19.7 932.6
16 18.8 1784.1
32 18.7 3469.5
64 19.0 2858.0

creat/unlink vanilla vfs
same cwd
1 100.0 75.0
2 44.1 41.8
4 28.7 24.6
8 16.5 14.2
16 8.7 8.9
32 5.5 7.8
64 5.9 7.4
different cwd
1 100.0 75.0
2 89.8 137.2
4 20.1 267.4
8 17.2 513.0
16 16.2 901.9
32 15.8 1724.0
64 17.3 1161.8

Note that at 64, we start using sibling threads on the CPU, making results jump
around a bit. The drop at 64 in different-cwd cases seems to be hitting an RCU
or slab allocator issue (or maybe it's just the SMT).

The scalability regression I was seeing in same-cwd tests is no longer there
(is even improved now). It may still be present in some workloads doing
common-element path lookups. This can be solved by making d_count atomic again,
at the cost of more atomic ops in some cases, but scalability is still limited.
So I prefer to do store-free path walking which is much more scalable.

In the different cwd open/close case, cost to bounce cachelines over the
interconnect is putting absolute upper limit of 162K open/closes per second
over the entire machine in vanilla kernel. After vfs patches, it is around 30M.
On larger and less well connected machines, the lower limit will only get lower
while the vfs case should continue to keep going up (assuming mm subsystem
can keep up).

*** Reclaim
I have not done much reclaim testing yet. It should be more scalable and lower
latency due to significant reduction in lru locks interfering with other
critical sections in inode/dentry code, and because we have per-zone locks.
Per-zone LRUs mean that reclaim is targetted to the correct zone, and that
kswapd will operate on lists of node-local memory objects.


2010-06-25 07:12:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

If you actuall want to get this work in reposting huge patchkit again and
again probably doesn't help. Start to prioritize areas and work on small
sets to get them ready.

files_lock and vfsmount_lock seem like rather easy targets to start
with. But for files_lock I really want to see something to generalize
the tty special case. If you touch that are in detail that wart needs
to go. Al didn't seem to like my variant very much, so he might have
a better idea for it - otherwise it really makes the VFS locking simple
by removing any tty interaction with the superblock files list. The
other suggestion would be to only open regular (maybe even just
writeable) files to the list. In addition to reducing the number of
list operations require it will also make the tty code a lot easier.

As for the other patches: I don't think the massive fine-grained
locking in the hash tables is a good idea. I would recommend to defer
them for now, and then look into better data structures for these caches
instead of working around the inherent problems of global hash tables.

2010-06-25 08:05:16

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

On Fri, Jun 25, 2010 at 03:12:21AM -0400, Christoph Hellwig wrote:
> If you actuall want to get this work in reposting huge patchkit again and
> again probably doesn't help. Start to prioritize areas and work on small
> sets to get them ready.

Sure, I haven't been posting the same thing (haven't posted it for a
long time). This simply had a lot of new stuff and improvements to all
existing patches.

I didn't cc anyone in particular because it's only for interested
people to take a look at. As you saw last time I cc'ed Al I exactly
was just trying to get those easier targets merged.


> files_lock and vfsmount_lock seem like rather easy targets to start
> with. But for files_lock I really want to see something to generalize
> the tty special case. If you touch that are in detail that wart needs
> to go. Al didn't seem to like my variant very much, so he might have
> a better idea for it - otherwise it really makes the VFS locking simple
> by removing any tty interaction with the superblock files list.

Actually I didn't like it because the error handling in the tty code
was broken and difficult to fix properly. The concept was OK though.

But the fact is that today already tty "knows" that vfs doesn't need its
files on the superblock list, and so it may take them off and use that
list_head privately. Currently it is also using files lock to protect
that private usage. These are two independent problems. My patch fixes
the second, and anything that fixes the first also needs to fix the
second in exactly the same way.


> The
> other suggestion would be to only open regular (maybe even just
> writeable) files to the list. In addition to reducing the number of
> list operations require it will also make the tty code a lot easier.

This was my suggestion, yes. Either way is conceptually the same, this
one just avoids the memory allocation and error handling problems that
yours had.

But again, locking change is still required and it would look exactly
the same as my patch really.


> As for the other patches: I don't think the massive fine-grained
> locking in the hash tables is a good idea. I would recommend to defer
> them for now, and then look into better data structures for these caches
> instead of working around the inherent problems of global hash tables.

I don't agree actually. I don't think there is any downside to fine
grained locking the hash with bit spinlocks. Until I see one, I will
keep them.

I agree that some other data structure may be better, but it should be
compared with the best possible hash implementation, which is a scalable
hash like this one.

Also, our big impending performance problem is SMP scalability, not hash
lookup, AFAIKS.

2010-06-30 11:31:24

by Dave Chinner

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

On Thu, Jun 24, 2010 at 01:02:12PM +1000, [email protected] wrote:
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/

Can you put a git tree up somewhere?

> Update to vfs scalability patches:

....

Now that I've had a look at the whole series, I'll make an overall
comment: I suspect that the locking is sufficiently complex that we
can count the number of people that will be able to debug it on one
hand. This patch set didn't just fall off the locking cliff, it
fell into a bottomless pit...

> Performance:
> Last time I was testing on a 32-node Altix which could be considered as not a
> sweet-spot for Linux performance target (ie. improvements there may not justify
> complexity). So recently I've been testing with a tightly interconnected
> 4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of
> system.

Sure, but I have to question how much of this is actually necessary?
A lot of it looks like scalability for scalabilities sake, not
because there is a demonstrated need...

> *** Single-thread microbenchmark (simple syscall loops, lower is better):
> Test Difference at 95.0% confidence (50 runs)
> open/close -6.07% +/- 1.075%
> creat/unlink 27.83% +/- 0.522%
> Open/close is a little faster, which should be due to one less atomic in the
> dput common case. Creat/unlink is significantly slower, which is due to RCU
> freeing inodes.

That's a pretty big ouch. Why does RCU freeing of inodes cause that
much regression? The RCU freeing is out of line, so where does the big
impact come from?

> *** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs):
> vanilla vfs
> real 0m4.911s 0m0.183s
> user 0m1.920s 0m1.610s
> sys 4m58.670s 0m5.770s
> After vfs patches, 26x increase in throughput, however parallelism is limited
> by test spawning and exit phases. sys time improvement shows closer to 50x
> improvement. vanilla is bottlenecked on dcache_lock.

So if we cherry pick patches out of the series, what is the bare
minimum set needed to obtain a result in this ballpark? Same for the
other tests?

> *** Reclaim
> I have not done much reclaim testing yet. It should be more scalable and lower
> latency due to significant reduction in lru locks interfering with other
> critical sections in inode/dentry code, and because we have per-zone locks.
> Per-zone LRUs mean that reclaim is targetted to the correct zone, and that
> kswapd will operate on lists of node-local memory objects.

This means we no longer have any global LRUness to inode or dentry
reclaim, which is going to significantly change caching behaviour.
It's also got interesting corner cases like a workload running on a
single node with a dentry/icache working set larger than the VM
wants to hold on a single node.

We went through these sorts of problems with cpusets a few years
back, and the workaround for it was not to limit the slab cache to
the cpuset's nodes. Handling this sort of problem correctly seems
distinctly non-trivial, so I'm really very reluctant to move in this
direction without clear evidence that we have no other
alternative....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-06-30 14:33:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

On Wed, Jun 30, 2010 at 09:30:54PM +1000, Dave Chinner wrote:
> On Thu, Jun 24, 2010 at 01:02:12PM +1000, [email protected] wrote:
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/
>
> Can you put a git tree up somewhere?

I suppose I should. I'll try to set one up.


> > Update to vfs scalability patches:
>
> ....
>
> Now that I've had a look at the whole series, I'll make an overall
> comment: I suspect that the locking is sufficiently complex that we
> can count the number of people that will be able to debug it on one
> hand.

As opposed to everyone who understood it beforehand? :)


> This patch set didn't just fall off the locking cliff, it
> fell into a bottomless pit...

I actually think it's simpler in ways. It has more locks, but a
lot of those protect small, well defined data.

Filesystems have required surprisingly minimal changes (except
autofs4, but that's fairly special case).


> > Performance:
> > Last time I was testing on a 32-node Altix which could be considered as not a
> > sweet-spot for Linux performance target (ie. improvements there may not justify
> > complexity). So recently I've been testing with a tightly interconnected
> > 4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of
> > system.
>
> Sure, but I have to question how much of this is actually necessary?
> A lot of it looks like scalability for scalabilities sake, not
> because there is a demonstrated need...

People are complaining about vfs scalability already (at least Intel,
Google, IBM, and networking people). By the time people start shouting,
it's too late because it will take years to get the patches merged. I'm
not counting -rt people who have a bad time with global vfs locks.

You saw the "batched dput+iput" hacks that google posted a couple of
years ago. Those were in the days of 4 core Core2 CPUs, long before 16
thread Nehalems that will scale well to 4/8 sockets at low cost.

At the high end, vaguely extrapolating from my numbers, a big POWER7 may
do under 100 open/close operations per second per hw thread. A big UV
probably under 10 per core.

But actually it's not all for scalability. I have some follow on patches
(that require RCU inodes, among other things) that actually improve
single threaded performance significnatly. git diff workload IIRC was
several % improved from speeding up stat(2).


> > *** Single-thread microbenchmark (simple syscall loops, lower is better):
> > Test Difference at 95.0% confidence (50 runs)
> > open/close -6.07% +/- 1.075%
> > creat/unlink 27.83% +/- 0.522%
> > Open/close is a little faster, which should be due to one less atomic in the
> > dput common case. Creat/unlink is significantly slower, which is due to RCU
> > freeing inodes.
>
> That's a pretty big ouch. Why does RCU freeing of inodes cause that
> much regression? The RCU freeing is out of line, so where does the big
> impact come from?

That comes mostly from inability to reuse the cache-hot inode structure,
and the cost to go over the deferred RCU list and free them after they
get cache cold.


> > *** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs):
> > vanilla vfs
> > real 0m4.911s 0m0.183s
> > user 0m1.920s 0m1.610s
> > sys 4m58.670s 0m5.770s
> > After vfs patches, 26x increase in throughput, however parallelism is limited
> > by test spawning and exit phases. sys time improvement shows closer to 50x
> > improvement. vanilla is bottlenecked on dcache_lock.
>
> So if we cherry pick patches out of the series, what is the bare
> minimum set needed to obtain a result in this ballpark? Same for the
> other tests?

Well it's very hard to just scale up bits and pieces because the
dcache_lock is currently basically global (except for d_flags and
some cases of d_count manipulations).

Start chipping away at bits and pieces of it as people hit bottlenecks
and I think it will end in a bigger mess than we have now.

I don't think this should be done lightly, but I think it is going to
be required soon.


> > *** Reclaim
> > I have not done much reclaim testing yet. It should be more scalable and lower
> > latency due to significant reduction in lru locks interfering with other
> > critical sections in inode/dentry code, and because we have per-zone locks.
> > Per-zone LRUs mean that reclaim is targetted to the correct zone, and that
> > kswapd will operate on lists of node-local memory objects.
>
> This means we no longer have any global LRUness to inode or dentry
> reclaim, which is going to significantly change caching behaviour.
> It's also got interesting corner cases like a workload running on a
> single node with a dentry/icache working set larger than the VM
> wants to hold on a single node.
>
> We went through these sorts of problems with cpusets a few years
> back, and the workaround for it was not to limit the slab cache to
> the cpuset's nodes. Handling this sort of problem correctly seems
> distinctly non-trivial, so I'm really very reluctant to move in this
> direction without clear evidence that we have no other
> alternative....

As I explained in the other mail, that's not actaully how the
per-zone reclaim works.

Thanks,
Nick

2010-06-30 17:09:28

by Frank Mayhar

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

On Wed, 2010-06-30 at 21:30 +1000, Dave Chinner wrote:
> Sure, but I have to question how much of this is actually necessary?
> A lot of it looks like scalability for scalabilities sake, not
> because there is a demonstrated need...

Well, we've repeatedly run into problems with contention on the
dcache_lock as well as the inode_lock; changes that improve those paths
are extremely interesting to us. I've also seen numbers from systems
with large (i.e. 32 to 64) numbers of cores that clearly show serious
problems in this area.

Further, while this seems like a bunch of patches, a close look shows
that it basically just pushes the dcache and inode locks down as far as
possible, making other improvements (such as removal of a few atomics
and no longer batching inode reclaims, among other things) based on that
work. I would be hard-pressed to find much to cherry-pick from this
patch set.

One interesting thing might be to do a set of performance tests for
kernels with increasingly more of the patchset, just to see the effect
of the earlier patches against a vanilla kernel and to measure the
cumulative effect of the later patches. (I'm not volunteering, however:
ENOTIME.)
--
Frank Mayhar <[email protected]>
Google, Inc.

2010-07-01 03:57:13

by Dave Chinner

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

On Wed, Jun 30, 2010 at 10:40:49PM +1000, Nick Piggin wrote:
> On Wed, Jun 30, 2010 at 09:30:54PM +1000, Dave Chinner wrote:
> > On Thu, Jun 24, 2010 at 01:02:12PM +1000, [email protected] wrote:
> > > Performance:
> > > Last time I was testing on a 32-node Altix which could be considered as not a
> > > sweet-spot for Linux performance target (ie. improvements there may not justify
> > > complexity). So recently I've been testing with a tightly interconnected
> > > 4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of
> > > system.
> >
> > Sure, but I have to question how much of this is actually necessary?
> > A lot of it looks like scalability for scalabilities sake, not
> > because there is a demonstrated need...
>
> People are complaining about vfs scalability already (at least Intel,
> Google, IBM, and networking people). By the time people start shouting,
> it's too late because it will take years to get the patches merged. I'm
> not counting -rt people who have a bad time with global vfs locks.

I'm not denying it that we need to do work here - I'm questioning
the "change everything at once" approach this patch set takes.
You've started from the assumption that everything the dcache_lock
and inode_lock protect are a problem and goes from there.

However, if we move some things out fom under the dcache lock, then
the pressure on the lock goes down and the remaining operations may
not hinder scalability. That's what I'm trying to understand, and
why I'm suggesting that you need to break this down into smaller,
more easily verifable, benchamrked patch sets. IMO, I have no way of
verifying if any of these patches are necessary or not, and I need
to understand that as part of reviewing them...

> > > *** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs):
> > > vanilla vfs
> > > real 0m4.911s 0m0.183s
> > > user 0m1.920s 0m1.610s
> > > sys 4m58.670s 0m5.770s
> > > After vfs patches, 26x increase in throughput, however parallelism is limited
> > > by test spawning and exit phases. sys time improvement shows closer to 50x
> > > improvement. vanilla is bottlenecked on dcache_lock.
> >
> > So if we cherry pick patches out of the series, what is the bare
> > minimum set needed to obtain a result in this ballpark? Same for the
> > other tests?
>
> Well it's very hard to just scale up bits and pieces because the
> dcache_lock is currently basically global (except for d_flags and
> some cases of d_count manipulations).
>
> Start chipping away at bits and pieces of it as people hit bottlenecks
> and I think it will end in a bigger mess than we have now.

I'm not suggesting that we should do this randomly. A more
structured approach that demonstrates the improvement as groups of
changes are made will help us evaluate the changes more effectively.
It may be that we need every single change in the patch series, but
there is no way we can verify that with the information that has
been provided.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-01 08:20:35

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

On Thu, Jul 01, 2010 at 01:56:57PM +1000, Dave Chinner wrote:
> On Wed, Jun 30, 2010 at 10:40:49PM +1000, Nick Piggin wrote:
> > On Wed, Jun 30, 2010 at 09:30:54PM +1000, Dave Chinner wrote:
> > > On Thu, Jun 24, 2010 at 01:02:12PM +1000, [email protected] wrote:
> > > > Performance:
> > > > Last time I was testing on a 32-node Altix which could be considered as not a
> > > > sweet-spot for Linux performance target (ie. improvements there may not justify
> > > > complexity). So recently I've been testing with a tightly interconnected
> > > > 4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of
> > > > system.
> > >
> > > Sure, but I have to question how much of this is actually necessary?
> > > A lot of it looks like scalability for scalabilities sake, not
> > > because there is a demonstrated need...
> >
> > People are complaining about vfs scalability already (at least Intel,
> > Google, IBM, and networking people). By the time people start shouting,
> > it's too late because it will take years to get the patches merged. I'm
> > not counting -rt people who have a bad time with global vfs locks.
>
> I'm not denying it that we need to do work here - I'm questioning
> the "change everything at once" approach this patch set takes.
> You've started from the assumption that everything the dcache_lock
> and inode_lock protect are a problem and goes from there.
>
> However, if we move some things out fom under the dcache lock, then
> the pressure on the lock goes down and the remaining operations may
> not hinder scalability. That's what I'm trying to understand, and
> why I'm suggesting that you need to break this down into smaller,
> more easily verifable, benchamrked patch sets. IMO, I have no way of
> verifying if any of these patches are necessary or not, and I need
> to understand that as part of reviewing them...

I can see where you're coming from, and I tried to do that, but it
got pretty hard and messy. Also, it was pretty difficult to lift
dcache and inode lock out of many paths unless *everything* else
was protected by other locks. It is also hard not to introduce more
atomic operations and slow down single thread performance.

It's not so much the lock hold times as the cacheline bouncing that
hurts most. So when adding or removing a dentry for example, we
manipulate hash, lru, inode alias, parent, and the fields in the
dentry itself. If you have to take the dcache_lock for any of those
manipulations, you incur the global cacheline bounce for that operation.

Honestly, I like the way the locking turned out. In dcache.c, inode.c
and fs-writeback.c it is complex, but it always has been. For
filesystems I would say it is simpler.

Need to stabilize a dentry? Take dentry->d_lock. This freezes all its
fields, its refcount, pins it in (or out of) data structures, and
pins its immediate parent and children, and inode we point to. Same
for inodes.

The rest of the data structures (hash, lru, io lists, inode alias lists
etc) that they may belong to, are protected by individual, narrow locks.


> > > > *** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs):
> > > > vanilla vfs
> > > > real 0m4.911s 0m0.183s
> > > > user 0m1.920s 0m1.610s
> > > > sys 4m58.670s 0m5.770s
> > > > After vfs patches, 26x increase in throughput, however parallelism is limited
> > > > by test spawning and exit phases. sys time improvement shows closer to 50x
> > > > improvement. vanilla is bottlenecked on dcache_lock.
> > >
> > > So if we cherry pick patches out of the series, what is the bare
> > > minimum set needed to obtain a result in this ballpark? Same for the
> > > other tests?
> >
> > Well it's very hard to just scale up bits and pieces because the
> > dcache_lock is currently basically global (except for d_flags and
> > some cases of d_count manipulations).
> >
> > Start chipping away at bits and pieces of it as people hit bottlenecks
> > and I think it will end in a bigger mess than we have now.
>
> I'm not suggesting that we should do this randomly. A more
> structured approach that demonstrates the improvement as groups of
> changes are made will help us evaluate the changes more effectively.
> It may be that we need every single change in the patch series, but
> there is no way we can verify that with the information that has
> been provided.

I didn't say randomly, but piece-wise, reducing locks bit by bit as
problems are quantified. Doing that means that all the code has to go
*far more* locking-scheme transitions and it's harder to come to a clean
overall end result.

2010-07-01 17:23:34

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

On Wed, Jun 30, 2010 at 10:40:49PM +1000, Nick Piggin wrote:
> But actually it's not all for scalability. I have some follow on patches
> (that require RCU inodes, among other things) that actually improve
> single threaded performance significnatly. git diff workload IIRC was
> several % improved from speeding up stat(2).

I rewrote the store-free path walk patch that goes on top of this
patchset (it's now much cleaner and more optimised, I'll post a patch
soonish). It is quicker than I remembered.

A single thread running stat(2) in a loop on a file "./file" has the
following cost (on an 2s8c Barcelona):

2.6.35-rc3 595 ns/op
patched 336 ns/op

stat(2) takes 56% the time with patches. It's something like 13 fewer
atomic operations per syscall.

What's that good for? A single threaded, cached `git diff` on the linux
kernel tree takes just 81% of the time after the vfs patches (0.27s vs
0.33s).

2010-07-01 17:28:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

Nick Piggin <[email protected]> writes:
>
> What's that good for? A single threaded, cached `git diff` on the linux
> kernel tree takes just 81% of the time after the vfs patches (0.27s vs
> 0.33s).

That's very cool!

Hopefully we can make some progress on the whole patchkit now.

-Andi

2010-07-01 17:36:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

On Wed, Jun 30, 2010 at 5:40 AM, Nick Piggin <[email protected]> wrote:
>>
>> That's a pretty big ouch. Why does RCU freeing of inodes cause that
>> much regression? The RCU freeing is out of line, so where does the big
>> impact come from?
>
> That comes mostly from inability to reuse the cache-hot inode structure,
> and the cost to go over the deferred RCU list and free them after they
> get cache cold.

I do wonder if this isn't a big design bug.

Most of the time with RCU, we don't need to wait to actually do the
_freeing_ of the individual data structure, we only need to make sure
that the data structure remains of the same _type_. IOW, we can free
it (and re-use it), but the backing storage cannot be released to the
page cache. That's what SLAB_DESTROY_BY_RCU should give us.

Is that not possible in this situation? Do we really need to keep the
inode _identity_ around for RCU?

If you use just SLAB_DESTROY_BY_RCU, then inode re-use remains, and
cache behavior would be much improved. The usual requirement for
SLAB_DESTROY_BY_RCU is that you only touch a lock (and perhaps
re-validate the identity) in the RCU-reader paths. Could that be made
to work?

Because that 27% drop really is pretty distressing.

That said, open (of the non-creating kind), close, and stat are
certainly more important than creating and freeing files. So as a
trade-off, it's probably the right thing to do. But if we can get all
the improvement _without_ that big downside, that would obviously be
better yet.

Linus

2010-07-01 17:36:49

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

Dave Chinner <[email protected]> writes:
>
> I'm not denying it that we need to do work here - I'm questioning
> the "change everything at once" approach this patch set takes.
> You've started from the assumption that everything the dcache_lock
> and inode_lock protect are a problem and goes from there.

Global code locks in a core subsystem are definitely a problem.

In many ways they're as bad a a BKL. There will be always
workloads where they hurt. They are bad coding style.
They just have to go.

I don't understand how anyone can even defend them.

Especially bad are code locks that protect lots of different
things. Those are not only bad for scalability, but also
bad for maintainability, because few people can really
understand them even. With smaller well defined locks
that's usually easier.

-Andi

--
[email protected] -- Speaking for myself only.

2010-07-01 17:52:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

On Thu, Jul 01, 2010 at 10:35:35AM -0700, Linus Torvalds wrote:
> On Wed, Jun 30, 2010 at 5:40 AM, Nick Piggin <[email protected]> wrote:
> >>
> >> That's a pretty big ouch. Why does RCU freeing of inodes cause that
> >> much regression? The RCU freeing is out of line, so where does the big
> >> impact come from?
> >
> > That comes mostly from inability to reuse the cache-hot inode structure,
> > and the cost to go over the deferred RCU list and free them after they
> > get cache cold.
>
> I do wonder if this isn't a big design bug.

It's possible, yes. Although a lot of that drop does come from
hitting RCU and overruning slab allocator queues. It was what,
closer to 10% when doing small numbers of creat/unlink loops.


> Most of the time with RCU, we don't need to wait to actually do the
> _freeing_ of the individual data structure, we only need to make sure
> that the data structure remains of the same _type_. IOW, we can free
> it (and re-use it), but the backing storage cannot be released to the
> page cache. That's what SLAB_DESTROY_BY_RCU should give us.
>
> Is that not possible in this situation? Do we really need to keep the
> inode _identity_ around for RCU?
>
> If you use just SLAB_DESTROY_BY_RCU, then inode re-use remains, and
> cache behavior would be much improved. The usual requirement for
> SLAB_DESTROY_BY_RCU is that you only touch a lock (and perhaps
> re-validate the identity) in the RCU-reader paths. Could that be made
> to work?

I definitely thought of that. I actually thought it would not
be possible with the store-free path walk patches though, because
we need to check some inode properties (eg. permission). So I was
thinking the usual approach of taking a per-entry lock defeats
the whole purpose of store-free path walk.

But you've got me to think about it again and it should be possible to
do just using the dentry seqlock. IOW, if the inode gets disconnected
from the dentry (and then can get possibly freed and reused) then just
retry the lookup.

It may be a little tricky. I'll wait until the path-walk code is
more polished first.

>
> Because that 27% drop really is pretty distressing.
>
> That said, open (of the non-creating kind), close, and stat are
> certainly more important than creating and freeing files. So as a
> trade-off, it's probably the right thing to do. But if we can get all
> the improvement _without_ that big downside, that would obviously be
> better yet.

We have actually bigger regressions than that for other code
paths. The RCU freeing for files structs causes similar, about
20-30% regression in open/close.

I actually have a (proper) patch to make that use DESTROY_BY_RCU
too. It actually slows down fd lookup by a tiny bit, though
(lock, load, branch, increment, unlock versus atomic inc). But
same number of atomic ops.

2010-07-02 04:01:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

On Thu, Jul 01, 2010 at 10:35:35AM -0700, Linus Torvalds wrote:
> On Wed, Jun 30, 2010 at 5:40 AM, Nick Piggin <[email protected]> wrote:
> >>
> >> That's a pretty big ouch. Why does RCU freeing of inodes cause that
> >> much regression? The RCU freeing is out of line, so where does the big
> >> impact come from?
> >
> > That comes mostly from inability to reuse the cache-hot inode structure,
> > and the cost to go over the deferred RCU list and free them after they
> > get cache cold.
>
> I do wonder if this isn't a big design bug.
>
> Most of the time with RCU, we don't need to wait to actually do the
> _freeing_ of the individual data structure, we only need to make sure
> that the data structure remains of the same _type_. IOW, we can free
> it (and re-use it), but the backing storage cannot be released to the
> page cache. That's what SLAB_DESTROY_BY_RCU should give us.
>
> Is that not possible in this situation? Do we really need to keep the
> inode _identity_ around for RCU?

In this case, the workload can be very update-heavy, so this type-safe
(vs. identity-safe) approach indeed makes a lot of sense. But if this
was a read-heavy situation (think SELinux or many areas in networking),
the read-side simplifications and speedups that often come with
identity safety would probably more than make up for the occasional
grace-period-induced cache miss.

So, as a -very- rough rule of thumb, when less than a few percent
of the accesses are updates, you most likely want identity safety.
If more than half of the accesses can be updates, you probably want
SLAB_DESTROY_BY_RCU-style type safety instead -- or maybe just straight
locking. If you are somewhere in between, pick one randomly, if
it works, go with it, otherwise try something else. ;-)

In this situation, a create/rename/delete workload would be quite update
heavy, so, as you say, SLAB_DESTROY_BY_RCU is well worth looking into.

Thanx, Paul

> If you use just SLAB_DESTROY_BY_RCU, then inode re-use remains, and
> cache behavior would be much improved. The usual requirement for
> SLAB_DESTROY_BY_RCU is that you only touch a lock (and perhaps
> re-validate the identity) in the RCU-reader paths. Could that be made
> to work?
>
> Because that 27% drop really is pretty distressing.
>
> That said, open (of the non-creating kind), close, and stat are
> certainly more important than creating and freeing files. So as a
> trade-off, it's probably the right thing to do. But if we can get all
> the improvement _without_ that big downside, that would obviously be
> better yet.
>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-07-06 17:49:45

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 00/52] vfs scalability patches updated

On Fri, Jul 02, 2010 at 03:23:17AM +1000, Nick Piggin wrote:
> On Wed, Jun 30, 2010 at 10:40:49PM +1000, Nick Piggin wrote:
> > But actually it's not all for scalability. I have some follow on patches
> > (that require RCU inodes, among other things) that actually improve
> > single threaded performance significnatly. git diff workload IIRC was
> > several % improved from speeding up stat(2).
>
> I rewrote the store-free path walk patch that goes on top of this
> patchset (it's now much cleaner and more optimised, I'll post a patch
> soonish). It is quicker than I remembered.
>
> A single thread running stat(2) in a loop on a file "./file" has the
> following cost (on an 2s8c Barcelona):
>
> 2.6.35-rc3 595 ns/op
> patched 336 ns/op
>
> stat(2) takes 56% the time with patches. It's something like 13 fewer
> atomic operations per syscall.
>
> What's that good for? A single threaded, cached `git diff` on the linux
> kernel tree takes just 81% of the time after the vfs patches (0.27s vs
> 0.33s).

At the other end of the scale, I tried dbench on ramfs on the little
32n64c Altix. Dbench actually has statfs() call completely removed from
the workload -- it's still a little problematic and patched kernel
throughput is ~halved with statfs().

dbench procs 1 64
2.6.35-rc3 235MB/s 95MB/s ( 0.6% scaling)
patched 245MB/s 14870MB/s (94.8% scaling)

(note all these numbers are with store-free path walking patches on top
of the posted patchset -- dbench procs do path walking from common cwds
so it will never scale this well if we have to take refcounts on common
dentries)

Thanks,
Nick