2009-10-16 20:07:28

by john stultz

[permalink] [raw]
Subject: -rt dbench scalabiltiy issue

See http://lwn.net/Articles/354690/ for a bit of background here.

I've been looking at scalability regressions in the -rt kernel. One easy
place to see regressions is with the dbench benchmark. While dbench can
be painfully noisy from run to run, it does clearly show some severe
regressions with -rt.

There's a chart in the article above that illustrates this, but here's
some specific numbers on an 8-way box running dbench-3.04 as follows:

./dbench 8 -t 10 -D . -c client.txt 2>&1

I ran both on an ext3 disk and a ramfs mounted directory.

(Again, the numbers are VERY rough due to the run-to-run variance seen)

ext3 ramfs
2.6.32-rc3: ~1800 MB/sec ~1600 MB/sec
2.6.31.2-rt13: ~300 MB/sec ~66 MB/sec

Ouch. Similar to the charts in the LWN article.

Dino pointed out that using lockstat with -rt, we can see the
dcache_lock is fairly hot with the -rt kernel. One of the issues with
the -rt tree is that the change from spinlocks to sleeping-spinlocks
doesn't effect the un-contended case very much, but when there is
contention on the lock, the overhead is much worse then with vanilla.

And as noted at the realtime mini-conf, Ingo saw this dcache_lock
bottleneck as well and suggested trying Nick Piggin's dcache_lock
removal patches.

So over the last week, I've ported Nick's fs-scale patches to -rt.

Specifically the tarball found here:
ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/06102009.tar.gz


Due to the 2.6.32 2.6.31-rt split, the port wasn't exactly straight
forward, but I believe I managed to do a decent job. Once I had the
patchset applied, building and booted, I eagerly ran dbench to see the
new results, aaaaaand.....

ext3 ramfs
2.6.31.2-rt13-nick: ~80 MB/sec ~126 MB/sec


So yea, mixed bag there. The ramfs got a little bit better but not that
much, and the ext3 numbers regressed further.

I then looked into the perf tool, to see if it would shed some light on
whats going on (snipped results below).

2.6.31.2-rt13 on ext3:
42.45% dbench [kernel] [k] _atomic_spin_lock_irqsave
|
|--85.61%-- rt_spin_lock_slowlock
| rt_spin_lock
| |
| |--23.91%-- start_this_handle
| | journal_start
| | ext3_journal_start_sb
| |--21.29%-- journal_stop
| |
| |--13.80%-- ext3_test_allocatable
| |
| |--12.15%-- bitmap_search_next_usable_block
| |
| |--9.79%-- journal_put_journal_head
| |
| |--5.93%-- journal_add_journal_head
| |
| |--2.59%-- atomic_dec_and_spin_lock
| | dput
| | |
| | |--65.31%-- path_put
| | | |
| | | |--53.37%-- __link_path_walk
...

So this is initially interesting, as it seems on ext3 it seems the
journal locking is really whats catching us more then the dcache_lock.
Am I reading this right?


2.6.31.2-rt13 on ramfs:
45.98% dbench [kernel] [k] _atomic_spin_lock_irqsave
|
|--82.94%-- rt_spin_lock_slowlock
| rt_spin_lock
| |
| |--61.18%-- dcache_readdir
| | vfs_readdir
| | sys_getdents
| | system_call_fastpath
| | __getdents64
| |
| |--11.26%-- atomic_dec_and_spin_lock
| | dput
| |
| |--7.93%-- d_path
| | seq_path
| | show_vfsmnt
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI___libc_read
| |


So here we do see the dcache_readdir's use of the dcache lock pop up to
the top. And with ramfs we don't see any of the ext3 journal code.

Next up is with Nick's patchset:

2.6.31.2-rt13-nick on ext3:
45.48% dbench [kernel] [k] _atomic_spin_lock_irqsave
|
|--83.40%-- rt_spin_lock_slowlock
| |
| |--100.00%-- rt_spin_lock
| | |
| | |--43.35%-- dput
| | | |
| | | |--50.29%-- __link_path_walk
| | | --49.71%-- path_put
| | |--39.07%-- path_get
| | | |
| | | |--61.98%-- path_walk
| | | |--38.01%-- path_init
| | |
| | |--7.33%-- journal_put_journal_head
| | |
| | |--4.32%-- journal_add_journal_head
| | |
| | |--2.83%-- start_this_handle
| | | journal_start
| | | ext3_journal_start_sb
| | |
| | |--2.52%-- journal_stop
|
|--15.87%-- rt_spin_lock_slowunlock
| rt_spin_unlock
| |
| |--43.48%-- path_get
| |
| |--41.80%-- dput
| |
| |--5.34%-- journal_add_journal_head
...

With Nick's patches on ext3, it seems dput()'s locking is the bottleneck
more then the journal code (maybe due to the multiple spinning nested
trylocks?).

With the ramfs, it looks mostly the same, but without the journal calls:

2.6.31.2-rt13-nick on ramfs:
46.51% dbench [kernel] [k] _atomic_spin_lock_irqsave
|
|--86.95%-- rt_spin_lock_slowlock
| rt_spin_lock
| |
| |--50.08%-- dput
| | |
| | |--56.92%-- __link_path_walk
| | |
| | --43.08%-- path_put
| |
| |--49.12%-- path_get
| | |
| | |--63.22%-- path_walk
| | |
| | |--36.73%-- path_init
|
|--12.59%-- rt_spin_lock_slowunlock
| rt_spin_unlock
| |
| |--49.86%-- path_get
| | |
| | |--58.15%-- path_init
| | | |
...


So the net of this is: Nick's patches helped some but not that much in
ramfs filesystems, and hurt ext3 performance w/ -rt.

Maybe I just mis-applied the patches? I'll admit I'm unfamiliar with the
dcache code, and converting the patches to the -rt tree was not always
straight forward.

Or maybe these results are expected? With Nick's patch against
2.6.32-rc3 I got:

ext3 ramfs
2.6.32-rc3-nick ~1800 MB/sec ~2200 MB/sec

So ext3 performance didn't change, but ramfs did see a nice bump. Maybe
Nick's patches helped where they could, but we still have other
contention points that are problematic with -rt's lock slowpath
overhead?


Ingo, Nick, Thomas: Any thoughts or comments here? Am I reading perf's
results incorrectly? Any idea why with Nick's patch the contention in
dput() hurts ext3 so much worse then in the ramfs case?


I'll be doing some further tests today w/ ext2 to see if getting the
journal code out of the way shows any benefit. But if folks have any
insight or suggestions for other ideas to look at please let me know.

thanks
-john


2009-10-17 00:45:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Fri, Oct 16, 2009 at 01:05:19PM -0700, john stultz wrote:
> See http://lwn.net/Articles/354690/ for a bit of background here.
>
> I've been looking at scalability regressions in the -rt kernel. One easy
> place to see regressions is with the dbench benchmark. While dbench can
> be painfully noisy from run to run, it does clearly show some severe
> regressions with -rt.
>
> There's a chart in the article above that illustrates this, but here's
> some specific numbers on an 8-way box running dbench-3.04 as follows:
>
> ./dbench 8 -t 10 -D . -c client.txt 2>&1
>
> I ran both on an ext3 disk and a ramfs mounted directory.
>
> (Again, the numbers are VERY rough due to the run-to-run variance seen)
>
> ext3 ramfs
> 2.6.32-rc3: ~1800 MB/sec ~1600 MB/sec
> 2.6.31.2-rt13: ~300 MB/sec ~66 MB/sec
>
> Ouch. Similar to the charts in the LWN article.
>
> Dino pointed out that using lockstat with -rt, we can see the
> dcache_lock is fairly hot with the -rt kernel. One of the issues with
> the -rt tree is that the change from spinlocks to sleeping-spinlocks
> doesn't effect the un-contended case very much, but when there is
> contention on the lock, the overhead is much worse then with vanilla.
>
> And as noted at the realtime mini-conf, Ingo saw this dcache_lock
> bottleneck as well and suggested trying Nick Piggin's dcache_lock
> removal patches.
>
> So over the last week, I've ported Nick's fs-scale patches to -rt.
>
> Specifically the tarball found here:
> ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/06102009.tar.gz
>
>
> Due to the 2.6.32 2.6.31-rt split, the port wasn't exactly straight
> forward, but I believe I managed to do a decent job. Once I had the
> patchset applied, building and booted, I eagerly ran dbench to see the
> new results, aaaaaand.....
>
> ext3 ramfs
> 2.6.31.2-rt13-nick: ~80 MB/sec ~126 MB/sec
>
>
> So yea, mixed bag there. The ramfs got a little bit better but not that
> much, and the ext3 numbers regressed further.

OK, I will ask the stupid question... What happens if you run on ext2?

Thanx, Paul

> I then looked into the perf tool, to see if it would shed some light on
> whats going on (snipped results below).
>
> 2.6.31.2-rt13 on ext3:
> 42.45% dbench [kernel] [k] _atomic_spin_lock_irqsave
> |
> |--85.61%-- rt_spin_lock_slowlock
> | rt_spin_lock
> | |
> | |--23.91%-- start_this_handle
> | | journal_start
> | | ext3_journal_start_sb
> | |--21.29%-- journal_stop
> | |
> | |--13.80%-- ext3_test_allocatable
> | |
> | |--12.15%-- bitmap_search_next_usable_block
> | |
> | |--9.79%-- journal_put_journal_head
> | |
> | |--5.93%-- journal_add_journal_head
> | |
> | |--2.59%-- atomic_dec_and_spin_lock
> | | dput
> | | |
> | | |--65.31%-- path_put
> | | | |
> | | | |--53.37%-- __link_path_walk
> ...
>
> So this is initially interesting, as it seems on ext3 it seems the
> journal locking is really whats catching us more then the dcache_lock.
> Am I reading this right?
>
>
> 2.6.31.2-rt13 on ramfs:
> 45.98% dbench [kernel] [k] _atomic_spin_lock_irqsave
> |
> |--82.94%-- rt_spin_lock_slowlock
> | rt_spin_lock
> | |
> | |--61.18%-- dcache_readdir
> | | vfs_readdir
> | | sys_getdents
> | | system_call_fastpath
> | | __getdents64
> | |
> | |--11.26%-- atomic_dec_and_spin_lock
> | | dput
> | |
> | |--7.93%-- d_path
> | | seq_path
> | | show_vfsmnt
> | | seq_read
> | | vfs_read
> | | sys_read
> | | system_call_fastpath
> | | __GI___libc_read
> | |
>
>
> So here we do see the dcache_readdir's use of the dcache lock pop up to
> the top. And with ramfs we don't see any of the ext3 journal code.
>
> Next up is with Nick's patchset:
>
> 2.6.31.2-rt13-nick on ext3:
> 45.48% dbench [kernel] [k] _atomic_spin_lock_irqsave
> |
> |--83.40%-- rt_spin_lock_slowlock
> | |
> | |--100.00%-- rt_spin_lock
> | | |
> | | |--43.35%-- dput
> | | | |
> | | | |--50.29%-- __link_path_walk
> | | | --49.71%-- path_put
> | | |--39.07%-- path_get
> | | | |
> | | | |--61.98%-- path_walk
> | | | |--38.01%-- path_init
> | | |
> | | |--7.33%-- journal_put_journal_head
> | | |
> | | |--4.32%-- journal_add_journal_head
> | | |
> | | |--2.83%-- start_this_handle
> | | | journal_start
> | | | ext3_journal_start_sb
> | | |
> | | |--2.52%-- journal_stop
> |
> |--15.87%-- rt_spin_lock_slowunlock
> | rt_spin_unlock
> | |
> | |--43.48%-- path_get
> | |
> | |--41.80%-- dput
> | |
> | |--5.34%-- journal_add_journal_head
> ...
>
> With Nick's patches on ext3, it seems dput()'s locking is the bottleneck
> more then the journal code (maybe due to the multiple spinning nested
> trylocks?).
>
> With the ramfs, it looks mostly the same, but without the journal calls:
>
> 2.6.31.2-rt13-nick on ramfs:
> 46.51% dbench [kernel] [k] _atomic_spin_lock_irqsave
> |
> |--86.95%-- rt_spin_lock_slowlock
> | rt_spin_lock
> | |
> | |--50.08%-- dput
> | | |
> | | |--56.92%-- __link_path_walk
> | | |
> | | --43.08%-- path_put
> | |
> | |--49.12%-- path_get
> | | |
> | | |--63.22%-- path_walk
> | | |
> | | |--36.73%-- path_init
> |
> |--12.59%-- rt_spin_lock_slowunlock
> | rt_spin_unlock
> | |
> | |--49.86%-- path_get
> | | |
> | | |--58.15%-- path_init
> | | | |
> ...
>
>
> So the net of this is: Nick's patches helped some but not that much in
> ramfs filesystems, and hurt ext3 performance w/ -rt.
>
> Maybe I just mis-applied the patches? I'll admit I'm unfamiliar with the
> dcache code, and converting the patches to the -rt tree was not always
> straight forward.
>
> Or maybe these results are expected? With Nick's patch against
> 2.6.32-rc3 I got:
>
> ext3 ramfs
> 2.6.32-rc3-nick ~1800 MB/sec ~2200 MB/sec
>
> So ext3 performance didn't change, but ramfs did see a nice bump. Maybe
> Nick's patches helped where they could, but we still have other
> contention points that are problematic with -rt's lock slowpath
> overhead?
>
>
> Ingo, Nick, Thomas: Any thoughts or comments here? Am I reading perf's
> results incorrectly? Any idea why with Nick's patch the contention in
> dput() hurts ext3 so much worse then in the ramfs case?
>
>
> I'll be doing some further tests today w/ ext2 to see if getting the
> journal code out of the way shows any benefit. But if folks have any
> insight or suggestions for other ideas to look at please let me know.
>
> thanks
> -john
>

2009-10-17 01:03:54

by john stultz

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Fri, 2009-10-16 at 17:45 -0700, Paul E. McKenney wrote:
> On Fri, Oct 16, 2009 at 01:05:19PM -0700, john stultz wrote:
> > See http://lwn.net/Articles/354690/ for a bit of background here.
> >
> > I've been looking at scalability regressions in the -rt kernel. One easy
> > place to see regressions is with the dbench benchmark. While dbench can
> > be painfully noisy from run to run, it does clearly show some severe
> > regressions with -rt.
> >
> > There's a chart in the article above that illustrates this, but here's
> > some specific numbers on an 8-way box running dbench-3.04 as follows:
> >
> > ./dbench 8 -t 10 -D . -c client.txt 2>&1
> >
> > I ran both on an ext3 disk and a ramfs mounted directory.
> >
> > (Again, the numbers are VERY rough due to the run-to-run variance seen)
> >
> > ext3 ramfs
> > 2.6.32-rc3: ~1800 MB/sec ~1600 MB/sec
> > 2.6.31.2-rt13: ~300 MB/sec ~66 MB/sec
> >
> > Ouch. Similar to the charts in the LWN article.
> >
> > Dino pointed out that using lockstat with -rt, we can see the
> > dcache_lock is fairly hot with the -rt kernel. One of the issues with
> > the -rt tree is that the change from spinlocks to sleeping-spinlocks
> > doesn't effect the un-contended case very much, but when there is
> > contention on the lock, the overhead is much worse then with vanilla.
> >
> > And as noted at the realtime mini-conf, Ingo saw this dcache_lock
> > bottleneck as well and suggested trying Nick Piggin's dcache_lock
> > removal patches.
> >
> > So over the last week, I've ported Nick's fs-scale patches to -rt.
> >
> > Specifically the tarball found here:
> > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/06102009.tar.gz
> >
> >
> > Due to the 2.6.32 2.6.31-rt split, the port wasn't exactly straight
> > forward, but I believe I managed to do a decent job. Once I had the
> > patchset applied, building and booted, I eagerly ran dbench to see the
> > new results, aaaaaand.....
> >
> > ext3 ramfs
> > 2.6.31.2-rt13-nick: ~80 MB/sec ~126 MB/sec
> >
> >
> > So yea, mixed bag there. The ramfs got a little bit better but not that
> > much, and the ext3 numbers regressed further.
>
> OK, I will ask the stupid question... What happens if you run on ext2?

Yep. That was next on my list. Basically its faster, but the regressions
are similar % wise with each patchset.

ext3 ext2
2.6.32-rc3: ~1800 MB/sec ~2900 MB/sec
2.6.31.2-rt13: ~300 MB/sec ~600 MB/sec
2.6.31.2-rt13-nick: ~80 MB/sec ~130 MB/sec

thanks
-john

2009-10-17 01:37:42

by john stultz

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Fri, 2009-10-16 at 18:03 -0700, john stultz wrote:
> On Fri, 2009-10-16 at 17:45 -0700, Paul E. McKenney wrote:
> > On Fri, Oct 16, 2009 at 01:05:19PM -0700, john stultz wrote:
> > > See http://lwn.net/Articles/354690/ for a bit of background here.
> > >
> > > I've been looking at scalability regressions in the -rt kernel. One easy
> > > place to see regressions is with the dbench benchmark. While dbench can
> > > be painfully noisy from run to run, it does clearly show some severe
> > > regressions with -rt.
> > >
> > > There's a chart in the article above that illustrates this, but here's
> > > some specific numbers on an 8-way box running dbench-3.04 as follows:
> > >
> > > ./dbench 8 -t 10 -D . -c client.txt 2>&1
> > >
> > > I ran both on an ext3 disk and a ramfs mounted directory.
> > >
> > > (Again, the numbers are VERY rough due to the run-to-run variance seen)
> > >
> > > ext3 ramfs
> > > 2.6.32-rc3: ~1800 MB/sec ~1600 MB/sec
> > > 2.6.31.2-rt13: ~300 MB/sec ~66 MB/sec
> > >
> > > Ouch. Similar to the charts in the LWN article.
> > >
> > > Dino pointed out that using lockstat with -rt, we can see the
> > > dcache_lock is fairly hot with the -rt kernel. One of the issues with
> > > the -rt tree is that the change from spinlocks to sleeping-spinlocks
> > > doesn't effect the un-contended case very much, but when there is
> > > contention on the lock, the overhead is much worse then with vanilla.
> > >
> > > And as noted at the realtime mini-conf, Ingo saw this dcache_lock
> > > bottleneck as well and suggested trying Nick Piggin's dcache_lock
> > > removal patches.
> > >
> > > So over the last week, I've ported Nick's fs-scale patches to -rt.
> > >
> > > Specifically the tarball found here:
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/06102009.tar.gz
> > >
> > >
> > > Due to the 2.6.32 2.6.31-rt split, the port wasn't exactly straight
> > > forward, but I believe I managed to do a decent job. Once I had the
> > > patchset applied, building and booted, I eagerly ran dbench to see the
> > > new results, aaaaaand.....
> > >
> > > ext3 ramfs
> > > 2.6.31.2-rt13-nick: ~80 MB/sec ~126 MB/sec
> > >
> > >
> > > So yea, mixed bag there. The ramfs got a little bit better but not that
> > > much, and the ext3 numbers regressed further.
> >
> > OK, I will ask the stupid question... What happens if you run on ext2?
>
> Yep. That was next on my list. Basically its faster, but the regressions
> are similar % wise with each patchset.
>
> ext3 ext2
> 2.6.32-rc3: ~1800 MB/sec ~2900 MB/sec
> 2.6.31.2-rt13: ~300 MB/sec ~600 MB/sec
> 2.6.31.2-rt13-nick: ~80 MB/sec ~130 MB/sec

Additionally looking at the perf data, it does seem the dcache_lock is
the contention point w/ ext2 on -rt13, but with Nick's patch, the
contention still stays mostly in the dput/path_get functions. So it
seems its just been moved rather then eased with _my port_ of Nick's
patch (emphasis on "my port", since with nick's patch against mainline
there is no regression at all.. I don't want to drag Nick's patches
through the mud here :)

thanks
-john

2009-10-17 22:39:05

by Nick Piggin

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Fri, Oct 16, 2009 at 01:05:19PM -0700, john stultz wrote:
> 2.6.31.2-rt13-nick on ramfs:
> 46.51% dbench [kernel] [k] _atomic_spin_lock_irqsave
> |
> |--86.95%-- rt_spin_lock_slowlock
> | rt_spin_lock
> | |
> | |--50.08%-- dput
> | | |
> | | |--56.92%-- __link_path_walk
> | | |
> | | --43.08%-- path_put
> | |
> | |--49.12%-- path_get
> | | |
> | | |--63.22%-- path_walk
> | | |
> | | |--36.73%-- path_init
> |
> |--12.59%-- rt_spin_lock_slowunlock
> | rt_spin_unlock
> | |
> | |--49.86%-- path_get
> | | |
> | | |--58.15%-- path_init
> | | | |
> ...
>
>
> So the net of this is: Nick's patches helped some but not that much in
> ramfs filesystems, and hurt ext3 performance w/ -rt.
>
> Maybe I just mis-applied the patches? I'll admit I'm unfamiliar with the
> dcache code, and converting the patches to the -rt tree was not always
> straight forward.

The above are dentry->d_lock, and they are rom path walking. It has
become more pronounced because I use d_lock to protect d_count rather
than an atomic_t (which saves on atomic ops).

But the patchset you have converted it missing the store-free path wailk
patches which will get rid of most of this. The next thing you hit is
glibc reading /proc/mounts to implement statvfs :( If you turn that call
into statfs you'll get a little further (but we need to improve statfs
support for glibc so it doesn't need those hacks).

And then you run into something else, I'd say d_lock again for creating
and unlinking things, but I didn't get a chance to profile it yet.

> Ingo, Nick, Thomas: Any thoughts or comments here? Am I reading perf's
> results incorrectly? Any idea why with Nick's patch the contention in
> dput() hurts ext3 so much worse then in the ramfs case?

ext3 may be doing more dentry refcounting which is hitting the spin
lock. I _could_ be persuaded to turn it back to an atomic_t, however
I will want to wait until other things like the path walking is more
mature which should take a lot of pressure off it.

Also... dbench throughput in exchange for adding an extra atomic at
dput-time is... not a good idea. We would need some more important
workloads I think (even a real samba serving netbench would be
preferable).

2009-10-17 23:06:53

by Nick Piggin

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Fri, Oct 16, 2009 at 06:37:36PM -0700, john stultz wrote:
> Additionally looking at the perf data, it does seem the dcache_lock is
> the contention point w/ ext2 on -rt13, but with Nick's patch, the
> contention still stays mostly in the dput/path_get functions. So it
> seems its just been moved rather then eased with _my port_ of Nick's
> patch (emphasis on "my port", since with nick's patch against mainline
> there is no regression at all.. I don't want to drag Nick's patches
> through the mud here :)

No there is nothing to suggest you've done the wrong thing. The dentry
refcounting issue was always a known one, and yes looking at the ext3
profiles, ext3/jbd is adding to the refcounting load so it is not
surprising that it has slowed down. Also -rt probably has more trouble
than mainline when you start hitting spinlock contention.

But before turning it back into an atomic, we need to see how far
the lock free path walking stuff goes (and that is not finished
yet either).

2009-11-18 01:28:14

by john stultz

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Sun, 2009-10-18 at 00:39 +0200, Nick Piggin wrote:
> On Fri, Oct 16, 2009 at 01:05:19PM -0700, john stultz wrote:
> > 2.6.31.2-rt13-nick on ramfs:
> > 46.51% dbench [kernel] [k] _atomic_spin_lock_irqsave
> > |
> > |--86.95%-- rt_spin_lock_slowlock
> > | rt_spin_lock
> > | |
> > | |--50.08%-- dput
> > | | |
> > | | |--56.92%-- __link_path_walk
> > | | |
> > | | --43.08%-- path_put
> > | |
> > | |--49.12%-- path_get
> > | | |
> > | | |--63.22%-- path_walk
> > | | |
> > | | |--36.73%-- path_init
> > |
> > |--12.59%-- rt_spin_lock_slowunlock
> > | rt_spin_unlock
> > | |
> > | |--49.86%-- path_get
> > | | |
> > | | |--58.15%-- path_init
> > | | | |
> > ...
> >
> >
> > So the net of this is: Nick's patches helped some but not that much in
> > ramfs filesystems, and hurt ext3 performance w/ -rt.
> >
> > Maybe I just mis-applied the patches? I'll admit I'm unfamiliar with the
> > dcache code, and converting the patches to the -rt tree was not always
> > straight forward.
>
> The above are dentry->d_lock, and they are rom path walking. It has
> become more pronounced because I use d_lock to protect d_count rather
> than an atomic_t (which saves on atomic ops).
>
> But the patchset you have converted it missing the store-free path wailk
> patches which will get rid of most of this. The next thing you hit is
> glibc reading /proc/mounts to implement statvfs :( If you turn that call
> into statfs you'll get a little further (but we need to improve statfs
> support for glibc so it doesn't need those hacks).
>
> And then you run into something else, I'd say d_lock again for creating
> and unlinking things, but I didn't get a chance to profile it yet.
>
> > Ingo, Nick, Thomas: Any thoughts or comments here? Am I reading perf's
> > results incorrectly? Any idea why with Nick's patch the contention in
> > dput() hurts ext3 so much worse then in the ramfs case?
>
> ext3 may be doing more dentry refcounting which is hitting the spin
> lock. I _could_ be persuaded to turn it back to an atomic_t, however
> I will want to wait until other things like the path walking is more
> mature which should take a lot of pressure off it.
>
> Also... dbench throughput in exchange for adding an extra atomic at
> dput-time is... not a good idea. We would need some more important
> workloads I think (even a real samba serving netbench would be
> preferable).


Hey Nick,
Just an update here, I moved up to your 09102009 patch, and spent
awhile playing with it.

Just as you theorized, moving d_count back to an atomic_t does seem to
greatly improve the performance on -rt.

Again, very very rough numbers for an 8-way system:

ext3 ramfs
2.6.32-rc3: ~1800 MB/sec ~1600 MB/sec
2.6.32-rc3-nick ~1800 MB/sec ~2200 MB/sec
2.6.31.2-rt13: ~300 MB/sec ~66 MB/sec
2.6.31.2-rt13-nick: ~80 MB/sec ~126 MB/sec
2.6.31.6-rt19-nick+atomic: ~400 MB/sec ~2200 MB/sec

>From the perf report, all of the dcache related overhead has fallen
away, and it all seems to be journal related contention at this point
that's keeping the ext3 numbers down.

So yes, on -rt, the overhead from lock contention is way way worse then
any extra atomic ops. :)

I'm not totally convinced I did the conversion back to atomic_t's
properly, so I'm doing some stress testing, but I'll hopefully have
something to send out for review soon.

As for your concern about dbench being a poor benchmark here, I'll try
to get some numbers on iozone or another suggested workload and get
those out to you shortly.

thanks
-john

2009-11-18 04:25:13

by Nick Piggin

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

Hi John,

Great stuff, thanks for persisting with this. I've been a little bit busy
with a bit of distro work recently, but I hope to get back to some mainly
projects soon.

On Tue, Nov 17, 2009 at 05:28:16PM -0800, john stultz wrote:
> Hey Nick,
> Just an update here, I moved up to your 09102009 patch, and spent
> awhile playing with it.
>
> Just as you theorized, moving d_count back to an atomic_t does seem to
> greatly improve the performance on -rt.
>
> Again, very very rough numbers for an 8-way system:
>
> ext3 ramfs
> 2.6.32-rc3: ~1800 MB/sec ~1600 MB/sec
> 2.6.32-rc3-nick ~1800 MB/sec ~2200 MB/sec
> 2.6.31.2-rt13: ~300 MB/sec ~66 MB/sec
> 2.6.31.2-rt13-nick: ~80 MB/sec ~126 MB/sec
> 2.6.31.6-rt19-nick+atomic: ~400 MB/sec ~2200 MB/sec

OK, that's very interesting. 09102009 patch contains the lock free path
walk that I was hoping will improve some of your issues. I guess it did
improve them a little bit but it is interesting that the atomic_t
conversion still gave such a huge speedup.

It would be interesting to know what d_count updates are causing the
most d_lock contention (without your +atomic patch).

One concern I have with +atomic is the extra atomic op required in
some cases. I still haven't gone over single thread performance with
a fine tooth comb, but even without +atomic, we have some areas that
need to be improved.

Nice numbers, btw. I never thought -rt would be able to completely
match mainline on dbench for that size of system (in vfs performance,
ie. the ramfs case).


> >From the perf report, all of the dcache related overhead has fallen
> away, and it all seems to be journal related contention at this point
> that's keeping the ext3 numbers down.
>
> So yes, on -rt, the overhead from lock contention is way way worse then
> any extra atomic ops. :)

How about overhead for an uncontended lock? Ie. is the problem caused
because lock *contention* issues are magnified on -rt, or is it
because uncontended lock overheads are higher? Detailed callgraph
profiles and lockstat of +/-atomic case would be very interesting.

Ideally we just eliminate the cause of the d_count update, but I
concede that at some point and in some workloads, atomic d_count
is going to scale better.

I'd imagine in dbench case, contention comes on directory dentries
from like adding child dentries, which causes lockless path walk
to fail and retry the full locked walk from the root. One important
optimisation I have left to do is to just continue with locked
walk at the point where lockless fails, rather than the full path.
This should naturally help scalability as well as single threaded
performance.


> I'm not totally convinced I did the conversion back to atomic_t's
> properly, so I'm doing some stress testing, but I'll hopefully have
> something to send out for review soon.
>
> As for your concern about dbench being a poor benchmark here, I'll try
> to get some numbers on iozone or another suggested workload and get
> those out to you shortly.

Well I was mostly concerned that we needn't spend *lots* of time
trying to make dbench work. Bad performing dbench didn't necessarily
say too much, but good performing dbench is a good indication
(because it hits the vfs harder and in different ways than a lot
of other benchmarks).

Now, I don't think there is dispute that these patches vastly
improve scalability. So what I am personally most interested in
at this stage are any and all single thread performance benchmarks.
But please, the more numbers the merrier, so anything helps.

Thanks,
Nick

2009-11-18 10:19:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

Nick,

On Wed, 18 Nov 2009, Nick Piggin wrote:
> > So yes, on -rt, the overhead from lock contention is way way worse then
> > any extra atomic ops. :)
>
> How about overhead for an uncontended lock? Ie. is the problem caused
> because lock *contention* issues are magnified on -rt, or is it
> because uncontended lock overheads are higher? Detailed callgraph
> profiles and lockstat of +/-atomic case would be very interesting.

In the uncontended case we have the overhead of calling might_sleep()
before we acquire the lock with cmpxchg(). The uncontended unlock is a
cmpxchg() as well.

I don't think that this is significant overhead and we see real lock
contention issues magnified by at least an order of magnitude.

Thanks,

tglx

2009-11-18 10:52:47

by Nick Piggin

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Wed, Nov 18, 2009 at 11:19:14AM +0100, Thomas Gleixner wrote:
> Nick,
>
> On Wed, 18 Nov 2009, Nick Piggin wrote:
> > > So yes, on -rt, the overhead from lock contention is way way worse then
> > > any extra atomic ops. :)
> >
> > How about overhead for an uncontended lock? Ie. is the problem caused
> > because lock *contention* issues are magnified on -rt, or is it
> > because uncontended lock overheads are higher? Detailed callgraph
> > profiles and lockstat of +/-atomic case would be very interesting.
>
> In the uncontended case we have the overhead of calling might_sleep()
> before we acquire the lock with cmpxchg(). The uncontended unlock is a
> cmpxchg() as well.

OK well then you don't reduce atomic ops in the lookup/dput fastpaths
by protecting d_count with d_lock, so single threaded performance should
not hurt by using atomic_t here.

I'll keep this in mind. As I said, I still need to do some more work on
the fast path lookup and other single threaded performance. In the worst
case that mainline really doesn't like atomic_t there it probably isn't
hard to make some small wrappers for -rt.


> I don't think that this is significant overhead and we see real lock
> contention issues magnified by at least an order of magnitude.

Yeah I'm sure you're right. I'm just interested where it is coming from
in -rt.

2009-11-20 02:22:45

by john stultz

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Wed, 2009-11-18 at 05:25 +0100, Nick Piggin wrote:
> On Tue, Nov 17, 2009 at 05:28:16PM -0800, john stultz wrote:
> > Just as you theorized, moving d_count back to an atomic_t does seem to
> > greatly improve the performance on -rt.
> >
> > Again, very very rough numbers for an 8-way system:
> >
> > ext3 ramfs
> > 2.6.32-rc3: ~1800 MB/sec ~1600 MB/sec
> > 2.6.32-rc3-nick ~1800 MB/sec ~2200 MB/sec
> > 2.6.31.2-rt13: ~300 MB/sec ~66 MB/sec
> > 2.6.31.2-rt13-nick: ~80 MB/sec ~126 MB/sec
> > 2.6.31.6-rt19-nick+atomic: ~400 MB/sec ~2200 MB/sec

So I realized the above wasn't quite apples to apples, as the
2.6.32-rc3-nick and 2.6.31.2-rt13-nick are using your 06102009 patch, so
I went through and regenerated all the data using the 09102009 patch so
its a fair comparison. I also collected 1cpu, 2cpu, 4cpu and 8cpu data
points to show the scalability from single threaded on up. Dbench still
gives me more variability then I would like from run to run, so I'd not
trust the numbers as very precise, but it should give you an idea of
where things are.

I put the data I collected up here:
http://sr71.net/~jstultz/dbench-scalability/

The most interesting (ramfs) chart is here:
http://sr71.net/~jstultz/dbench-scalability/graphs/ramfs-scalability.png


> OK, that's very interesting. 09102009 patch contains the lock free path
> walk that I was hoping will improve some of your issues. I guess it did
> improve them a little bit but it is interesting that the atomic_t
> conversion still gave such a huge speedup.
>
> It would be interesting to know what d_count updates are causing the
> most d_lock contention (without your +atomic patch).

>From the perf logs here:
http://sr71.net/~jstultz/dbench-scalability/perflogs/

Its mostly d_lock contention from dput() and path_get()

rt19-nick:
41.09% dbench [kernel] [k] _atomic_spin_lock_irqsave
|
|--85.45%-- rt_spin_lock_slowlock
| |
| |--100.00%-- rt_spin_lock
| | |
| | |--48.61%-- dput
| | | |
| | | |--53.56%-- __link_path_walk
| | | | |
| | | | |--100.00%-- path_walk
...
| | |--46.48%-- path_get
| | | |
| | | |--59.01%-- path_walk
| | | | |
| | | | |--90.13%-- do_path_lookup


rt19-nick-atomic:
13.04% dbench [kernel] [k] copy_user_generic_string
|
|--62.51%-- generic_file_aio_read
| do_sync_read
| vfs_read
| |
| |--97.80%-- sys_pread64
| | system_call_fastpath
| | __GI___libc_pread
| |
...
|--32.17%-- generic_file_buffered_write
| __generic_file_aio_write_nolock
| generic_file_aio_write
| do_sync_write
| vfs_write
| sys_pwrite64
| system_call_fastpath
| __GI_pwrite


> One concern I have with +atomic is the extra atomic op required in
> some cases. I still haven't gone over single thread performance with
> a fine tooth comb, but even without +atomic, we have some areas that
> need to be improved.
>
> Nice numbers, btw. I never thought -rt would be able to completely
> match mainline on dbench for that size of system (in vfs performance,
> ie. the ramfs case).

Yea, I was *very* pleased to see the ramfs numbers. Been working on this
without much progress for quite awhile (mostly due to my vfs ignorance).


> > >From the perf report, all of the dcache related overhead has fallen
> > away, and it all seems to be journal related contention at this point
> > that's keeping the ext3 numbers down.
> >
> > So yes, on -rt, the overhead from lock contention is way way worse then
> > any extra atomic ops. :)
>
> How about overhead for an uncontended lock? Ie. is the problem caused
> because lock *contention* issues are magnified on -rt, or is it
> because uncontended lock overheads are higher? Detailed callgraph
> profiles and lockstat of +/-atomic case would be very interesting.

Yea, Thomas already addressed this, but spinlocks become rtmutexes with
-rt, so while the fast-path is very similar to a spinlock overhead wise,
the slowpath hit in the contended case is much more painful.

The callgraphs perf logs are linked above. lockstat data is less useful
with -rt, because all the atomic_spin_lock (real spinlock) contention is
on the rtmutex waiter lock. Additionally, the actual lockstat output for
rtmutexes is much less helpful.

> Ideally we just eliminate the cause of the d_count update, but I
> concede that at some point and in some workloads, atomic d_count
> is going to scale better.
>
> I'd imagine in dbench case, contention comes on directory dentries
> from like adding child dentries, which causes lockless path walk
> to fail and retry the full locked walk from the root. One important
> optimisation I have left to do is to just continue with locked
> walk at the point where lockless fails, rather than the full path.
> This should naturally help scalability as well as single threaded
> performance.

Yea, I was trying to figure out why the rcu path walk seems to always
fail back to the locked version, but I'm still not groking the
vfs/dcache code well enough to understand.

The dput contention does seem mostly focused on the CWD that dbench is
run from (I'm guessing this is where the locked full path walk is
hurting and your change to fallback to where things failed might help?)


> > I'm not totally convinced I did the conversion back to atomic_t's
> > properly, so I'm doing some stress testing, but I'll hopefully have
> > something to send out for review soon.

And to follow up here, I apparently don't have the conversion back to
atomic_t's done properly. I'm frequently hitting a bug in the unmount
path on restart. So that still needs work, but the dcount-atomic patch
(as well as the backport of your patch) can be found here if you'd like
to take a look:
http://sr71.net/~jstultz/dbench-scalability/patches/


> > As for your concern about dbench being a poor benchmark here, I'll try
> > to get some numbers on iozone or another suggested workload and get
> > those out to you shortly.
>
> Well I was mostly concerned that we needn't spend *lots* of time
> trying to make dbench work. Bad performing dbench didn't necessarily
> say too much, but good performing dbench is a good indication
> (because it hits the vfs harder and in different ways than a lot
> of other benchmarks).
>
> Now, I don't think there is dispute that these patches vastly
> improve scalability. So what I am personally most interested in
> at this stage are any and all single thread performance benchmarks.
> But please, the more numbers the merrier, so anything helps.

Ok, I'll still be working on on getting iozone numbers, but I wanted to
collect and make more presentable the dbench data I had so far.

thanks
-john

2009-11-23 09:06:28

by Nick Piggin

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Thu, Nov 19, 2009 at 06:22:44PM -0800, john stultz wrote:
> On Wed, 2009-11-18 at 05:25 +0100, Nick Piggin wrote:
> > On Tue, Nov 17, 2009 at 05:28:16PM -0800, john stultz wrote:
> > > Just as you theorized, moving d_count back to an atomic_t does seem to
> > > greatly improve the performance on -rt.
> > >
> > > Again, very very rough numbers for an 8-way system:
> > >
> > > ext3 ramfs
> > > 2.6.32-rc3: ~1800 MB/sec ~1600 MB/sec
> > > 2.6.32-rc3-nick ~1800 MB/sec ~2200 MB/sec
> > > 2.6.31.2-rt13: ~300 MB/sec ~66 MB/sec
> > > 2.6.31.2-rt13-nick: ~80 MB/sec ~126 MB/sec
> > > 2.6.31.6-rt19-nick+atomic: ~400 MB/sec ~2200 MB/sec
>
> So I realized the above wasn't quite apples to apples, as the
> 2.6.32-rc3-nick and 2.6.31.2-rt13-nick are using your 06102009 patch, so
> I went through and regenerated all the data using the 09102009 patch so
> its a fair comparison. I also collected 1cpu, 2cpu, 4cpu and 8cpu data
> points to show the scalability from single threaded on up. Dbench still
> gives me more variability then I would like from run to run, so I'd not
> trust the numbers as very precise, but it should give you an idea of
> where things are.
>
> I put the data I collected up here:
> http://sr71.net/~jstultz/dbench-scalability/
>
> The most interesting (ramfs) chart is here:
> http://sr71.net/~jstultz/dbench-scalability/graphs/ramfs-scalability.png

OK cool. So -rt-nick is a lot better, but +atomic is still yet better
again.


> > OK, that's very interesting. 09102009 patch contains the lock free path
> > walk that I was hoping will improve some of your issues. I guess it did
> > improve them a little bit but it is interesting that the atomic_t
> > conversion still gave such a huge speedup.
> >
> > It would be interesting to know what d_count updates are causing the
> > most d_lock contention (without your +atomic patch).
>
> >From the perf logs here:
> http://sr71.net/~jstultz/dbench-scalability/perflogs/
>
> Its mostly d_lock contention from dput() and path_get()
>
> rt19-nick:
> 41.09% dbench [kernel] [k] _atomic_spin_lock_irqsave
> |
> |--85.45%-- rt_spin_lock_slowlock
> | |
> | |--100.00%-- rt_spin_lock
> | | |
> | | |--48.61%-- dput
> | | | |
> | | | |--53.56%-- __link_path_walk
> | | | | |
> | | | | |--100.00%-- path_walk
> ...
> | | |--46.48%-- path_get
> | | | |
> | | | |--59.01%-- path_walk
> | | | | |
> | | | | |--90.13%-- do_path_lookup

Yep, these are coming from the non-_rcu variants, so probably mostly
contention on cwd. So I hope improving the lockfree fallback case
will improve this further.


> rt19-nick-atomic:

BTW. your next entry is this:
6.92% dbench [kernel] [k] rt_spin_lock
|
|--30.69%-- __d_path
| d_path
| seq_path
| show_vfsmnt
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI___libc_read

Which is glibc's braindead statvfs implementation. It makes several
syscalls including reading and parsing /proc/mounts which is pretty
heavy.

I replaced the statvfs call to statfs in dbench to get around this.
(longer term we need to extend statfs to return mount flags and
get glibc to start using it). This would get rid of your most costly
spinlock and get you scaling a little further.


> > One concern I have with +atomic is the extra atomic op required in
> > some cases. I still haven't gone over single thread performance with
> > a fine tooth comb, but even without +atomic, we have some areas that
> > need to be improved.
> >
> > Nice numbers, btw. I never thought -rt would be able to completely
> > match mainline on dbench for that size of system (in vfs performance,
> > ie. the ramfs case).
>
> Yea, I was *very* pleased to see the ramfs numbers. Been working on this
> without much progress for quite awhile (mostly due to my vfs ignorance).

BTW. Have you tested like ext4, xfs and btrfs cases? I don't think ext3
is likely to see a huge amount of scalability work, and so it will be
good to give an early heads up to the more active projects...


> > > >From the perf report, all of the dcache related overhead has fallen
> > > away, and it all seems to be journal related contention at this point
> > > that's keeping the ext3 numbers down.
> > >
> > > So yes, on -rt, the overhead from lock contention is way way worse then
> > > any extra atomic ops. :)
> >
> > How about overhead for an uncontended lock? Ie. is the problem caused
> > because lock *contention* issues are magnified on -rt, or is it
> > because uncontended lock overheads are higher? Detailed callgraph
> > profiles and lockstat of +/-atomic case would be very interesting.
>
> Yea, Thomas already addressed this, but spinlocks become rtmutexes with
> -rt, so while the fast-path is very similar to a spinlock overhead wise,
> the slowpath hit in the contended case is much more painful.

Well, fast-path is similar but it has 2 atomics, so that means we
can do the atomic_t d_count without adding an extra atomic op in
the fastpath.

lookup does (for the last element):
spin_lock(d_lock)
// recheck fields
d_count++
spin_unlock(d_lock)

then release, dput:
spin_lock(d_lock)
d_count--
spin_unlock(d_lock)

So 2 atomics. Converting to atomic d_count would make it 3 atomics
because we can't easily avoid the d_lock in the lookup path. For
-rt, you have 4 atomics either way.


> The callgraphs perf logs are linked above. lockstat data is less useful
> with -rt, because all the atomic_spin_lock (real spinlock) contention is
> on the rtmutex waiter lock. Additionally, the actual lockstat output for
> rtmutexes is much less helpful.

Sure. I'd thought that perf may not be so good if lots of cost is
in the sleeping on the lock.


> > Ideally we just eliminate the cause of the d_count update, but I
> > concede that at some point and in some workloads, atomic d_count
> > is going to scale better.
> >
> > I'd imagine in dbench case, contention comes on directory dentries
> > from like adding child dentries, which causes lockless path walk
> > to fail and retry the full locked walk from the root. One important
> > optimisation I have left to do is to just continue with locked
> > walk at the point where lockless fails, rather than the full path.
> > This should naturally help scalability as well as single threaded
> > performance.
>
> Yea, I was trying to figure out why the rcu path walk seems to always
> fail back to the locked version, but I'm still not groking the
> vfs/dcache code well enough to understand.
>
> The dput contention does seem mostly focused on the CWD that dbench is
> run from (I'm guessing this is where the locked full path walk is
> hurting and your change to fallback to where things failed might help?)

Exactly.


> > > I'm not totally convinced I did the conversion back to atomic_t's
> > > properly, so I'm doing some stress testing, but I'll hopefully have
> > > something to send out for review soon.
>
> And to follow up here, I apparently don't have the conversion back to
> atomic_t's done properly. I'm frequently hitting a bug in the unmount
> path on restart. So that still needs work, but the dcount-atomic patch
> (as well as the backport of your patch) can be found here if you'd like
> to take a look:
> http://sr71.net/~jstultz/dbench-scalability/patches/

I'll have to get some time to start working on this again. Soon.


> > > As for your concern about dbench being a poor benchmark here, I'll try
> > > to get some numbers on iozone or another suggested workload and get
> > > those out to you shortly.
> >
> > Well I was mostly concerned that we needn't spend *lots* of time
> > trying to make dbench work. Bad performing dbench didn't necessarily
> > say too much, but good performing dbench is a good indication
> > (because it hits the vfs harder and in different ways than a lot
> > of other benchmarks).
> >
> > Now, I don't think there is dispute that these patches vastly
> > improve scalability. So what I am personally most interested in
> > at this stage are any and all single thread performance benchmarks.
> > But please, the more numbers the merrier, so anything helps.
>
> Ok, I'll still be working on on getting iozone numbers, but I wanted to
> collect and make more presentable the dbench data I had so far.

Yeah looks good, it's really interesting thanks.

2009-11-25 02:16:16

by john stultz

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Mon, 2009-11-23 at 10:06 +0100, Nick Piggin wrote:
> BTW. Have you tested like ext4, xfs and btrfs cases? I don't think ext3
> is likely to see a huge amount of scalability work, and so it will be
> good to give an early heads up to the more active projects...

Yea, I need to give those a shot. I also generated the same numbers as
before with ext2 (all the raw numbers are in dbench-scalability dir):

http://sr71.net/~jstultz/dbench-scalability/graphs/ext2-scalability.png

Again, its similar to ext3, in that all the -rt variants are hitting
some contention issues. But I was a little surprised to see
2.6.32-rc7-nick below 2.6.32-rc7 there, so generated perf data there as
well:

http://sr71.net/~jstultz/dbench-scalability/perflogs/2.6.32-rc7-nick.ext2.perflog


31.91% dbench [kernel] [k] _spin_lock
|
|--44.17%-- dput
| |
| |--50.39%-- __link_path_walk
| | |
| | |--99.88%-- path_walk
| | | |
| | | |--96.18%-- do_path_lookup
| | | | |
| | | | |--51.49%-- user_path_at
| | | | | |
| | | | | |--96.84%-- vfs_fstatat
| | | | | | vfs_stat
| | | | | | sys_newstat
| | | | | | system_call_fastpath
| | | | | | _xstat
| | | | | |
| | | | | --3.16%-- do_utimes
| | | | | sys_utime
| | | | | system_call_fastpath
| | | | | __GI_utime
| | | | |
| | | | |--43.18%-- do_filp_open
| | | | | do_sys_open
| | | | | sys_open
| | | | | system_call_fastpath
| | | | | __GI___open
...

|--39.98%-- path_get
| |
| |--51.19%-- path_init
| | |
| | |--96.40%-- do_path_lookup
| | | |
| | | |--55.12%-- user_path_at
| | | | |
| | | | |--91.47%-- vfs_fstatat
| | | | | vfs_stat
| | | | | sys_newstat
| | | | | system_call_fastpath
| | | | | _xstat
| | | | | |
| | | | | --100.00%-- 0x7478742e746e65
| | | | |


So I'm a little confused here. Why is dput/path_get showing up in ext2
when it all but disappeared with ramfs? Does ramfs have some sort of
vfs_stat shortcut that avoids whatever is catching us here?

thanks
-john

2009-11-25 07:18:06

by Nick Piggin

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Tue, Nov 24, 2009 at 06:16:17PM -0800, john stultz wrote:
> On Mon, 2009-11-23 at 10:06 +0100, Nick Piggin wrote:
> > BTW. Have you tested like ext4, xfs and btrfs cases? I don't think ext3
> > is likely to see a huge amount of scalability work, and so it will be
> > good to give an early heads up to the more active projects...
>
> Yea, I need to give those a shot. I also generated the same numbers as
> before with ext2 (all the raw numbers are in dbench-scalability dir):
>
> http://sr71.net/~jstultz/dbench-scalability/graphs/ext2-scalability.png
>
> Again, its similar to ext3, in that all the -rt variants are hitting
> some contention issues. But I was a little surprised to see
> 2.6.32-rc7-nick below 2.6.32-rc7 there, so generated perf data there as
> well:
>
> http://sr71.net/~jstultz/dbench-scalability/perflogs/2.6.32-rc7-nick.ext2.perflog

Ext2 doesn't look too bad in the -rt profiles. The block allocator is
only taking up a couple of % of the spinlocks. Most of the contention
is in path lookup, probably cwd dentry contention.

The -nick case probably also is hitting d_lock contention more. Don't
know why it shows up more on ext2. The stat path should be nothing
filesystem specific outside the regular path lookup, until path lookup
is done and we found the inode.

If you're using acls or something on ext2 then lock free path walk
might fail more often.

Anyway, I think all these problems should largely go away when path
walk fall back is improved.

2009-11-25 22:20:38

by john stultz

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Wed, 2009-11-25 at 08:18 +0100, Nick Piggin wrote:
> On Tue, Nov 24, 2009 at 06:16:17PM -0800, john stultz wrote:
> > On Mon, 2009-11-23 at 10:06 +0100, Nick Piggin wrote:
> > > BTW. Have you tested like ext4, xfs and btrfs cases? I don't think ext3
> > > is likely to see a huge amount of scalability work, and so it will be
> > > good to give an early heads up to the more active projects...
> >
> > Yea, I need to give those a shot. I also generated the same numbers as
> > before with ext2 (all the raw numbers are in dbench-scalability dir):
> >
> > http://sr71.net/~jstultz/dbench-scalability/graphs/ext2-scalability.png
> >
> > Again, its similar to ext3, in that all the -rt variants are hitting
> > some contention issues. But I was a little surprised to see
> > 2.6.32-rc7-nick below 2.6.32-rc7 there, so generated perf data there as
> > well:
> >
> > http://sr71.net/~jstultz/dbench-scalability/perflogs/2.6.32-rc7-nick.ext2.perflog
>
> Ext2 doesn't look too bad in the -rt profiles. The block allocator is
> only taking up a couple of % of the spinlocks. Most of the contention
> is in path lookup, probably cwd dentry contention.
>
> The -nick case probably also is hitting d_lock contention more. Don't
> know why it shows up more on ext2. The stat path should be nothing
> filesystem specific outside the regular path lookup, until path lookup
> is done and we found the inode.
>
> If you're using acls or something on ext2 then lock free path walk
> might fail more often.

CC'ed Ted and Mingming as they might be interested:

Got ext4 data up:
http://sr71.net/~jstultz/dbench-scalability/graphs/ext4-scalability.png

Looks pretty similar to ext2. I'm also seeing path_get contention as
well with your patch on ext4 in the perflogs:
http://sr71.net/~jstultz/dbench-scalability/perflogs/2.6.32-rc7-nick.ext4.perflog


> Anyway, I think all these problems should largely go away when path
> walk fall back is improved.

Great, let me know when you have a rough shot at it ready for testing.

thanks
-john

2009-11-26 06:20:21

by Nick Piggin

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Wed, Nov 25, 2009 at 02:20:33PM -0800, john stultz wrote:
> On Wed, 2009-11-25 at 08:18 +0100, Nick Piggin wrote:
> > If you're using acls or something on ext2 then lock free path walk
> > might fail more often.
>
> CC'ed Ted and Mingming as they might be interested:
>
> Got ext4 data up:
> http://sr71.net/~jstultz/dbench-scalability/graphs/ext4-scalability.png

Ahh, looks much nicer than ext3, at least on non-rt. -rt seems to
be running into journal lock contention.


> Looks pretty similar to ext2. I'm also seeing path_get contention as
> well with your patch on ext4 in the perflogs:
> http://sr71.net/~jstultz/dbench-scalability/perflogs/2.6.32-rc7-nick.ext4.perflog

That's *all* coming from reading /proc/mounts by the looks. I don't
think we're going to bother trying to make d_path incredibly scalable,
the fix is to fix glibc's statvfs call.

As I said, you can work around this by changing dbench's statvfs call
to statfs.

After that, the same journal locks look like they might hit next, but
it should get quite a lot further.


> > Anyway, I think all these problems should largely go away when path
> > walk fall back is improved.
>
> Great, let me know when you have a rough shot at it ready for testing.

Will do

2009-12-02 01:53:36

by john stultz

[permalink] [raw]
Subject: Re: -rt dbench scalabiltiy issue

On Thu, 2009-11-26 at 07:20 +0100, Nick Piggin wrote:
> On Wed, Nov 25, 2009 at 02:20:33PM -0800, john stultz wrote:
> > On Wed, 2009-11-25 at 08:18 +0100, Nick Piggin wrote:
> > > If you're using acls or something on ext2 then lock free path walk
> > > might fail more often.
> >
> > CC'ed Ted and Mingming as they might be interested:
> >
> > Got ext4 data up:
> > http://sr71.net/~jstultz/dbench-scalability/graphs/ext4-scalability.png
>
> Ahh, looks much nicer than ext3, at least on non-rt. -rt seems to
> be running into journal lock contention.

Yep.


>
> > Looks pretty similar to ext2. I'm also seeing path_get contention as
> > well with your patch on ext4 in the perflogs:
> > http://sr71.net/~jstultz/dbench-scalability/perflogs/2.6.32-rc7-nick.ext4.perflog
>
> That's *all* coming from reading /proc/mounts by the looks. I don't
> think we're going to bother trying to make d_path incredibly scalable,
> the fix is to fix glibc's statvfs call.
>
> As I said, you can work around this by changing dbench's statvfs call
> to statfs.
>
> After that, the same journal locks look like they might hit next, but
> it should get quite a lot further.

Thanks for reminding me. I had tried this for ext2, but didn't see much
change, so I forgot to try it for ext4.

But your right. I've verified changing dbench to use statfs() does cause
the path_get contention to fall out for the mainline ext4 case. It
doesn't change too much in the -rt case, but it does help (and gives a
really nice boost for the non-rt case).

See:
http://sr71.net/~jstultz/dbench-scalability/graphs/ext4-statfs-scalability.png
vs
http://sr71.net/~jstultz/dbench-scalability/graphs/ext4-scalability.png


Perflogs here:
http://sr71.net/~jstultz/dbench-scalability/perflogs/


So yea from -rt's perspective, with this patchset we're down to journal
lock contention for ext3 and ext4 as the main issue now.


Nick: So what's your plan to upstream this work? With 2.6.32 around the
corner and 2.6.32-rt likely following shortly, I don't think pushing the
backport to 2.6.31-rt make much sense right at this moment. But when
2.6.32-rt does arrive, it might be nice to have the broken out patches.

Thomas/Ingo: Any thoughts on how receptive you guys would be to picking
up these changes for -rt (maybe for 2.6.32-rt)? Or should they go
mainline first?

thanks
-john