2011-02-12 09:22:21

by Alex Shi

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote:
> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote:
> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
> > > add Jan and Theodore to the loop.
> > >
> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> > > > Shaohua and I tested kernel building performance on latest kernel. and
> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > > > system. We find this performance dropping is due to commit
> > > > 749ef9f8423054e326f. If we revert this patch or just change the
> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > > > recovered.
> > > >
> > > > iostat report show with the commit, read request merge number increased
> > > > and write request merge dropped. The total request size increased and
> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> > >
> >
> > Yep, it does sound like reduce write merging. But moving journal commits
> > back to WRITE, then fsync performance will drop as there will be idling
> > introduced between fsync thread and journalling thread. So that does
> > not sound like a good idea either.
> >
> > Secondly, in presence of mixed workload (some other sync read happening)
> > WRITES can get less bandwidth and sync workload much more. So by
> > marking journal commits as WRITES you might increase the delay there
> > in completion in presence of other sync workload.
> >
> > So Jan Kara's approach makes sense that if somebody is waiting on
> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
> > did it not work for you. Is it possible to run some traces and do
> > more debugging that figure out what's happening.
> Sorry for the long delay.
>
> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk
> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers
> ext4, maybe this is the reason.
> I changed jbd to use WRITE for journal_commit_transaction. With the change and
> Jan's patch, the test seems fine.
Let me clarify the bug situation again.
With the following scenarios, the regression is clear.
1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2,
kbuild on /dev/sdb1 with ext4.
but if we disable the ccache, only do kbuild on sdb1 with ext4. There is
no regressions whenever with or without Jan's patch.
So, problem focus on the ccache scenario, (from fedora 11, ccache is
default setting).

If we compare the vmstat output with or without ccache, there is too
many write when ccache enabled. According the result, it should to do
some tunning on ext3 fs.


vmstat average output per 10 seconds, without ccache
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0

vmstat average output per 10 seconds, with ccache
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0


>
> Jan,
> can you send a patch with similar change for ext3? So we can do more tests.
>
> Thanks,
> Shaohua



2011-02-12 18:26:05

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

On Sat, Feb 12, 2011 at 10:21 AM, Alex,Shi <[email protected]> wrote:
> On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote:
>> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote:
>> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
>> > > add Jan and Theodore to the loop.
>> > >
>> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
>> > > > Shaohua and I tested kernel building performance on latest kernel. and
>> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
>> > > > system. We find this performance dropping is due to commit
>> > > > 749ef9f8423054e326f. If we revert this patch or just change the
>> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
>> > > > recovered.
>> > > >
>> > > > iostat report show with the commit, read request merge number increased
>> > > > and write request merge dropped. The total request size increased and
>> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC
>> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
>> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
>> > >
>> >
>> > Yep, it does sound like reduce write merging. But moving journal commits
>> > back to WRITE, then fsync performance will drop as there will be idling
>> > introduced between fsync thread and journalling thread. So that does
>> > not sound like a good idea either.
>> >
>> > Secondly, in presence of mixed workload (some other sync read happening)
>> > WRITES can get less bandwidth and sync workload much more. So by
>> > marking journal commits as WRITES you might increase the delay there
>> > in completion in presence of other sync workload.
>> >
>> > So Jan Kara's approach makes sense that if somebody is waiting on
>> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
>> > did it not work for you. Is it possible to run some traces and do
>> > more debugging that figure out what's happening.
>> Sorry for the long delay.
>>
>> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk
>> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers
>> ext4, maybe this is the reason.
>> I changed jbd to use WRITE for journal_commit_transaction. With the change and
>> Jan's patch, the test seems fine.
> Let me clarify the bug situation again.
> With the following scenarios, the regression is clear.
> 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2,
> kbuild on /dev/sdb1 with ext4.
> but if we disable the ccache, only do kbuild on sdb1 with ext4. There is
> no regressions whenever with or without Jan's patch.
> So, problem focus on the ccache scenario, (from fedora 11, ccache is
> default setting).
>
> If we compare the vmstat output with or without ccache, there is too
> many write when ccache enabled. According the result, it should to do
> some tunning on ext3 fs.
Is ext3 configured with data ordered or writeback?
I think ccache might be performing fsyncs, and this is a bad workload
for ext3, especially in ordered mode.
It might be that my patch introduced a regression in ext3 fsync
performance, but I don't understand how reverting only the change in
jbd2 (that is the ext4 specific journaling daemon) could restore it.
The two partitions are on different disks, so each one should be
isolated from the I/O perspective (do they share a single
controller?). The only interaction I see happens at the VM level,
since changing performance of any of the two changes the rate at which
pages can be cleaned.

Corrado
>
>
> vmstat average output per 10 seconds, without ccache
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0
>
> vmstat average output per 10 seconds, with ccache
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0
>
>
>>
>> Jan,
>> can you send a patch with similar change for ext3? So we can do more tests.
>>
>> Thanks,
>> Shaohua
>
>
>
>



--
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

2011-02-14 02:25:47

by Alex Shi

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

On Sun, 2011-02-13 at 02:25 +0800, Corrado Zoccolo wrote:
> On Sat, Feb 12, 2011 at 10:21 AM, Alex,Shi <[email protected]> wrote:
> > On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote:
> >> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote:
> >> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
> >> > > add Jan and Theodore to the loop.
> >> > >
> >> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> >> > > > Shaohua and I tested kernel building performance on latest kernel. and
> >> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> >> > > > system. We find this performance dropping is due to commit
> >> > > > 749ef9f8423054e326f. If we revert this patch or just change the
> >> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> >> > > > recovered.
> >> > > >
> >> > > > iostat report show with the commit, read request merge number increased
> >> > > > and write request merge dropped. The total request size increased and
> >> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC
> >> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> >> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> >> > >
> >> >
> >> > Yep, it does sound like reduce write merging. But moving journal commits
> >> > back to WRITE, then fsync performance will drop as there will be idling
> >> > introduced between fsync thread and journalling thread. So that does
> >> > not sound like a good idea either.
> >> >
> >> > Secondly, in presence of mixed workload (some other sync read happening)
> >> > WRITES can get less bandwidth and sync workload much more. So by
> >> > marking journal commits as WRITES you might increase the delay there
> >> > in completion in presence of other sync workload.
> >> >
> >> > So Jan Kara's approach makes sense that if somebody is waiting on
> >> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
> >> > did it not work for you. Is it possible to run some traces and do
> >> > more debugging that figure out what's happening.
> >> Sorry for the long delay.
> >>
> >> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk
> >> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers
> >> ext4, maybe this is the reason.
> >> I changed jbd to use WRITE for journal_commit_transaction. With the change and
> >> Jan's patch, the test seems fine.
> > Let me clarify the bug situation again.
> > With the following scenarios, the regression is clear.
> > 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2,
> > kbuild on /dev/sdb1 with ext4.
> > but if we disable the ccache, only do kbuild on sdb1 with ext4. There is
> > no regressions whenever with or without Jan's patch.
> > So, problem focus on the ccache scenario, (from fedora 11, ccache is
> > default setting).
> >
> > If we compare the vmstat output with or without ccache, there is too
> > many write when ccache enabled. According the result, it should to do
> > some tunning on ext3 fs.
> Is ext3 configured with data ordered or writeback?

The ext3 on sda and ext4 on sdb are both used 'ordered' mounting mode.

> I think ccache might be performing fsyncs, and this is a bad workload
> for ext3, especially in ordered mode.
> It might be that my patch introduced a regression in ext3 fsync
> performance, but I don't understand how reverting only the change in
> jbd2 (that is the ext4 specific journaling daemon) could restore it.
> The two partitions are on different disks, so each one should be
> isolated from the I/O perspective (do they share a single
> controller?).

No, sda/sdb use separated controller.

> The only interaction I see happens at the VM level,
> since changing performance of any of the two changes the rate at which
> pages can be cleaned.
>
> Corrado
> >
> >
> > vmstat average output per 10 seconds, without ccache
> > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0
> >
> > vmstat average output per 10 seconds, with ccache
> > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0
> >
> >
> >>
> >> Jan,
> >> can you send a patch with similar change for ext3? So we can do more tests.
> >>
> >> Thanks,
> >> Shaohua
> >
> >
> >
> >
>
>
>

2011-02-15 01:10:09

by Shaohua Li

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

On Mon, 2011-02-14 at 10:25 +0800, Shi, Alex wrote:
> On Sun, 2011-02-13 at 02:25 +0800, Corrado Zoccolo wrote:
> > On Sat, Feb 12, 2011 at 10:21 AM, Alex,Shi <[email protected]> wrote:
> > > On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote:
> > >> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote:
> > >> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
> > >> > > add Jan and Theodore to the loop.
> > >> > >
> > >> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> > >> > > > Shaohua and I tested kernel building performance on latest kernel. and
> > >> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > >> > > > system. We find this performance dropping is due to commit
> > >> > > > 749ef9f8423054e326f. If we revert this patch or just change the
> > >> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > >> > > > recovered.
> > >> > > >
> > >> > > > iostat report show with the commit, read request merge number increased
> > >> > > > and write request merge dropped. The total request size increased and
> > >> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > >> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> > >> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> > >> > >
> > >> >
> > >> > Yep, it does sound like reduce write merging. But moving journal commits
> > >> > back to WRITE, then fsync performance will drop as there will be idling
> > >> > introduced between fsync thread and journalling thread. So that does
> > >> > not sound like a good idea either.
> > >> >
> > >> > Secondly, in presence of mixed workload (some other sync read happening)
> > >> > WRITES can get less bandwidth and sync workload much more. So by
> > >> > marking journal commits as WRITES you might increase the delay there
> > >> > in completion in presence of other sync workload.
> > >> >
> > >> > So Jan Kara's approach makes sense that if somebody is waiting on
> > >> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
> > >> > did it not work for you. Is it possible to run some traces and do
> > >> > more debugging that figure out what's happening.
> > >> Sorry for the long delay.
> > >>
> > >> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk
> > >> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers
> > >> ext4, maybe this is the reason.
> > >> I changed jbd to use WRITE for journal_commit_transaction. With the change and
> > >> Jan's patch, the test seems fine.
> > > Let me clarify the bug situation again.
> > > With the following scenarios, the regression is clear.
> > > 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2,
> > > kbuild on /dev/sdb1 with ext4.
> > > but if we disable the ccache, only do kbuild on sdb1 with ext4. There is
> > > no regressions whenever with or without Jan's patch.
> > > So, problem focus on the ccache scenario, (from fedora 11, ccache is
> > > default setting).
> > >
> > > If we compare the vmstat output with or without ccache, there is too
> > > many write when ccache enabled. According the result, it should to do
> > > some tunning on ext3 fs.
> > Is ext3 configured with data ordered or writeback?
>
> The ext3 on sda and ext4 on sdb are both used 'ordered' mounting mode.
>
> > I think ccache might be performing fsyncs, and this is a bad workload
> > for ext3, especially in ordered mode.
> > It might be that my patch introduced a regression in ext3 fsync
> > performance, but I don't understand how reverting only the change in
> > jbd2 (that is the ext4 specific journaling daemon) could restore it.
> > The two partitions are on different disks, so each one should be
> > isolated from the I/O perspective (do they share a single
> > controller?).
>
> No, sda/sdb use separated controller.
>
> > The only interaction I see happens at the VM level,
> > since changing performance of any of the two changes the rate at which
> > pages can be cleaned.
> >
> > Corrado
> > >
> > >
> > > vmstat average output per 10 seconds, without ccache
> > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0
> > >
> > > vmstat average output per 10 seconds, with ccache
> > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0
> > >
> > >
> > >>
> > >> Jan,
> > >> can you send a patch with similar change for ext3? So we can do more tests.
Hi Jan,
can you send a patch with both ext3 and ext4 changes? Our test shows
your patch has positive effect, but need confirm with the ext3 change.

Thanks,
Shaohua

2011-02-21 16:49:15

by Jan Kara

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

On Tue 15-02-11 09:10:01, Shaohua Li wrote:
> On Mon, 2011-02-14 at 10:25 +0800, Shi, Alex wrote:
> > On Sun, 2011-02-13 at 02:25 +0800, Corrado Zoccolo wrote:
> > > On Sat, Feb 12, 2011 at 10:21 AM, Alex,Shi <[email protected]> wrote:
> > > > On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote:
> > > >> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote:
> > > >> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote:
> > > >> > > add Jan and Theodore to the loop.
> > > >> > >
> > > >> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote:
> > > >> > > > Shaohua and I tested kernel building performance on latest kernel. and
> > > >> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
> > > >> > > > system. We find this performance dropping is due to commit
> > > >> > > > 749ef9f8423054e326f. If we revert this patch or just change the
> > > >> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
> > > >> > > > recovered.
> > > >> > > >
> > > >> > > > iostat report show with the commit, read request merge number increased
> > > >> > > > and write request merge dropped. The total request size increased and
> > > >> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC
> > > >> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
> > > >> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue.
> > > >> > >
> > > >> >
> > > >> > Yep, it does sound like reduce write merging. But moving journal commits
> > > >> > back to WRITE, then fsync performance will drop as there will be idling
> > > >> > introduced between fsync thread and journalling thread. So that does
> > > >> > not sound like a good idea either.
> > > >> >
> > > >> > Secondly, in presence of mixed workload (some other sync read happening)
> > > >> > WRITES can get less bandwidth and sync workload much more. So by
> > > >> > marking journal commits as WRITES you might increase the delay there
> > > >> > in completion in presence of other sync workload.
> > > >> >
> > > >> > So Jan Kara's approach makes sense that if somebody is waiting on
> > > >> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why
> > > >> > did it not work for you. Is it possible to run some traces and do
> > > >> > more debugging that figure out what's happening.
> > > >> Sorry for the long delay.
> > > >>
> > > >> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk
> > > >> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers
> > > >> ext4, maybe this is the reason.
> > > >> I changed jbd to use WRITE for journal_commit_transaction. With the change and
> > > >> Jan's patch, the test seems fine.
> > > > Let me clarify the bug situation again.
> > > > With the following scenarios, the regression is clear.
> > > > 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2,
> > > > kbuild on /dev/sdb1 with ext4.
> > > > but if we disable the ccache, only do kbuild on sdb1 with ext4. There is
> > > > no regressions whenever with or without Jan's patch.
> > > > So, problem focus on the ccache scenario, (from fedora 11, ccache is
> > > > default setting).
> > > >
> > > > If we compare the vmstat output with or without ccache, there is too
> > > > many write when ccache enabled. According the result, it should to do
> > > > some tunning on ext3 fs.
> > > Is ext3 configured with data ordered or writeback?
> >
> > The ext3 on sda and ext4 on sdb are both used 'ordered' mounting mode.
> >
> > > I think ccache might be performing fsyncs, and this is a bad workload
> > > for ext3, especially in ordered mode.
> > > It might be that my patch introduced a regression in ext3 fsync
> > > performance, but I don't understand how reverting only the change in
> > > jbd2 (that is the ext4 specific journaling daemon) could restore it.
> > > The two partitions are on different disks, so each one should be
> > > isolated from the I/O perspective (do they share a single
> > > controller?).
> >
> > No, sda/sdb use separated controller.
> >
> > > The only interaction I see happens at the VM level,
> > > since changing performance of any of the two changes the rate at which
> > > pages can be cleaned.
> > >
> > > Corrado
> > > >
> > > >
> > > > vmstat average output per 10 seconds, without ccache
> > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0
> > > >
> > > > vmstat average output per 10 seconds, with ccache
> > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> > > > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > > > 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0
> > > >
> > > >
> > > >>
> > > >> Jan,
> > > >> can you send a patch with similar change for ext3? So we can do more tests.
> Hi Jan,
> can you send a patch with both ext3 and ext4 changes? Our test shows
> your patch has positive effect, but need confirm with the ext3 change.
Sure. Patches for both ext3 & ext4 are attached. Sorry, it took me a
while to get to this.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (5.32 kB)
0001-jbd2-Refine-commit-writeout-logic.patch (10.79 kB)
0002-jbd-Refine-commit-writeout-logic.patch (9.61 kB)
Download all attachments

2011-02-23 08:23:51

by Alex Shi

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

Though these patches can not totally recovered the problem, but they are
quite helpful with ccache enabled situation. It increase 10% performance
on 38-rc1 kernel.
I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
with patches:
r b swpd free buff cache si so bi bo in cs us sy id wa st
1.5 24.7 0.0 64316199.8 17240.8 153376.9 0.0 0.0 1777.1 1788.0 6479.0 2605.3 1.7 0.9 89.1 8.2 0.0
original 38-rc1 kernel:
2.4 32.3 0.0 63653302.9 17170.6 153125.3 0.0 0.0 1579.7 1834.1 6016.4 2407.0 1.5 0.7 86.6 10.1 0.0

It reduce write blocks clearly.

> > can you send a patch with both ext3 and ext4 changes? Our test shows
> > your patch has positive effect, but need confirm with the ext3 change.
> Sure. Patches for both ext3 & ext4 are attached. Sorry, it took me a
> while to get to this.
>
> Honza

2011-02-24 12:13:47

by Jan Kara

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

On Wed 23-02-11 16:24:47, Alex,Shi wrote:
> Though these patches can not totally recovered the problem, but they are
> quite helpful with ccache enabled situation. It increase 10% performance
> on 38-rc1 kernel.
OK and what was the original performance drop with WRITE_SYNC change?

> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
> with patches:
I'm attaching patches rebased on top of latest Linus's tree.
Corrado, could you possibly run your fsync-heavy tests so that we see
whether there isn't negative impact of my patches on your fsync-heavy
workload? Thanks.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (673.00 B)
0001-jbd2-Refine-commit-writeout-logic.patch (10.69 kB)
0002-jbd-Refine-commit-writeout-logic.patch (9.61 kB)
Download all attachments

2011-02-25 00:52:32

by Alex Shi

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

Jan Kara wrote:
> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
>
>> Though these patches can not totally recovered the problem, but they are
>> quite helpful with ccache enabled situation. It increase 10% performance
>> on 38-rc1 kernel.
>>
> OK and what was the original performance drop with WRITE_SYNC change?
>
The original drop is 30%.
>
>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
>> with patches:
>>
> I'm attaching patches rebased on top of latest Linus's tree.
> Corrado, could you possibly run your fsync-heavy tests so that we see
> whether there isn't negative impact of my patches on your fsync-heavy
> workload? Thanks.
>
> Honza
>

2011-02-26 14:45:17

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <[email protected]> wrote:
> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
>> Though these patches can not totally recovered the problem, but they are
>> quite helpful with ccache enabled situation. It increase 10% performance
>> on 38-rc1 kernel.
>  OK and what was the original performance drop with WRITE_SYNC change?
>
>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
>> with patches:
>  I'm attaching patches rebased on top of latest Linus's tree.
> Corrado, could you possibly run your fsync-heavy tests so that we see
> whether there isn't negative impact of my patches on your fsync-heavy
> workload? Thanks.
The workload was actually Jeff's, and the stalls that my change tried
to mitigate showed up on his enterprise class storage. Adding him so
he can test it.

Corrado
>                                                                Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>



--
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

2011-03-01 19:56:53

by Jeff Moyer

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

Corrado Zoccolo <[email protected]> writes:

> On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <[email protected]> wrote:
>> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
>>> Though these patches can not totally recovered the problem, but they are
>>> quite helpful with ccache enabled situation. It increase 10% performance
>>> on 38-rc1 kernel.
>>  OK and what was the original performance drop with WRITE_SYNC change?
>>
>>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
>>> with patches:
>>  I'm attaching patches rebased on top of latest Linus's tree.
>> Corrado, could you possibly run your fsync-heavy tests so that we see
>> whether there isn't negative impact of my patches on your fsync-heavy
>> workload? Thanks.
> The workload was actually Jeff's, and the stalls that my change tried
> to mitigate showed up on his enterprise class storage. Adding him so
> he can test it.

Sorry for the late reply. You can use either fs_mark or iozone to
generate an fsync-heavy workload. The test I did was to mix this with a
sequential reader. If you can point me at patches, I should be able to
test this.

Cheers,
Jeff

2011-03-02 09:42:49

by Jan Kara

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

On Tue 01-03-11 14:56:43, Jeff Moyer wrote:
> Corrado Zoccolo <[email protected]> writes:
>
> > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <[email protected]> wrote:
> >> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
> >>> Though these patches can not totally recovered the problem, but they are
> >>> quite helpful with ccache enabled situation. It increase 10% performance
> >>> on 38-rc1 kernel.
> >> ?OK and what was the original performance drop with WRITE_SYNC change?
> >>
> >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
> >>> with patches:
> >> ?I'm attaching patches rebased on top of latest Linus's tree.
> >> Corrado, could you possibly run your fsync-heavy tests so that we see
> >> whether there isn't negative impact of my patches on your fsync-heavy
> >> workload? Thanks.
> > The workload was actually Jeff's, and the stalls that my change tried
> > to mitigate showed up on his enterprise class storage. Adding him so
> > he can test it.
>
> Sorry for the late reply. You can use either fs_mark or iozone to
> generate an fsync-heavy workload. The test I did was to mix this with a
> sequential reader. If you can point me at patches, I should be able to
> test this.
The latest version of patches is attached to:
https://lkml.org/lkml/2011/2/24/125

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-03-02 16:14:33

by Jeff Moyer

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

Jan Kara <[email protected]> writes:

> On Tue 01-03-11 14:56:43, Jeff Moyer wrote:
>> Corrado Zoccolo <[email protected]> writes:
>>
>> > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <[email protected]> wrote:
>> >> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
>> >>> Though these patches can not totally recovered the problem, but they are
>> >>> quite helpful with ccache enabled situation. It increase 10% performance
>> >>> on 38-rc1 kernel.
>> >>  OK and what was the original performance drop with WRITE_SYNC change?
>> >>
>> >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
>> >>> with patches:
>> >>  I'm attaching patches rebased on top of latest Linus's tree.
>> >> Corrado, could you possibly run your fsync-heavy tests so that we see
>> >> whether there isn't negative impact of my patches on your fsync-heavy
>> >> workload? Thanks.
>> > The workload was actually Jeff's, and the stalls that my change tried
>> > to mitigate showed up on his enterprise class storage. Adding him so
>> > he can test it.
>>
>> Sorry for the late reply. You can use either fs_mark or iozone to
>> generate an fsync-heavy workload. The test I did was to mix this with a
>> sequential reader. If you can point me at patches, I should be able to
>> test this.
> The latest version of patches is attached to:
> https://lkml.org/lkml/2011/2/24/125

Perhaps you should fix up the merge conflicts, first? ;-)

+<<<<<<< HEAD
tid = transaction->t_tid;
need_to_start = !tid_geq(journal->j_commit_request, tid);
+=======
+ __jbd2_log_start_commit(journal, transaction->t_tid, false);
+>>>>>>> jbd2: Refine commit writeout logic

2011-03-02 21:17:54

by Jan Kara

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

On Wed 02-03-11 11:13:53, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
> > On Tue 01-03-11 14:56:43, Jeff Moyer wrote:
> >> Corrado Zoccolo <[email protected]> writes:
> >>
> >> > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <[email protected]> wrote:
> >> >> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
> >> >>> Though these patches can not totally recovered the problem, but they are
> >> >>> quite helpful with ccache enabled situation. It increase 10% performance
> >> >>> on 38-rc1 kernel.
> >> >> ?OK and what was the original performance drop with WRITE_SYNC change?
> >> >>
> >> >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
> >> >>> with patches:
> >> >> ?I'm attaching patches rebased on top of latest Linus's tree.
> >> >> Corrado, could you possibly run your fsync-heavy tests so that we see
> >> >> whether there isn't negative impact of my patches on your fsync-heavy
> >> >> workload? Thanks.
> >> > The workload was actually Jeff's, and the stalls that my change tried
> >> > to mitigate showed up on his enterprise class storage. Adding him so
> >> > he can test it.
> >>
> >> Sorry for the late reply. You can use either fs_mark or iozone to
> >> generate an fsync-heavy workload. The test I did was to mix this with a
> >> sequential reader. If you can point me at patches, I should be able to
> >> test this.
> > The latest version of patches is attached to:
> > https://lkml.org/lkml/2011/2/24/125
>
> Perhaps you should fix up the merge conflicts, first? ;-)
>
> +<<<<<<< HEAD
> tid = transaction->t_tid;
> need_to_start = !tid_geq(journal->j_commit_request, tid);
> +=======
> + __jbd2_log_start_commit(journal, transaction->t_tid, false);
> +>>>>>>> jbd2: Refine commit writeout logic
Doh, how embarrassing ;). Attached is a new version which compiles and
seems to run OK.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (1.89 kB)
0001-jbd2-Refine-commit-writeout-logic.patch (10.57 kB)
0002-jbd-Refine-commit-writeout-logic.patch (9.61 kB)
Download all attachments

2011-03-02 21:21:09

by Jeff Moyer

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

Jan Kara <[email protected]> writes:

> On Wed 02-03-11 11:13:53, Jeff Moyer wrote:
>> Jan Kara <[email protected]> writes:
>> > On Tue 01-03-11 14:56:43, Jeff Moyer wrote:
>> >> Corrado Zoccolo <[email protected]> writes:
>> >>
>> >> > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <[email protected]> wrote:
>> >> >> On Wed 23-02-11 16:24:47, Alex,Shi wrote:
>> >> >>> Though these patches can not totally recovered the problem, but they are
>> >> >>> quite helpful with ccache enabled situation. It increase 10% performance
>> >> >>> on 38-rc1 kernel.
>> >> >>  OK and what was the original performance drop with WRITE_SYNC change?
>> >> >>
>> >> >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here:
>> >> >>> with patches:
>> >> >>  I'm attaching patches rebased on top of latest Linus's tree.
>> >> >> Corrado, could you possibly run your fsync-heavy tests so that we see
>> >> >> whether there isn't negative impact of my patches on your fsync-heavy
>> >> >> workload? Thanks.
>> >> > The workload was actually Jeff's, and the stalls that my change tried
>> >> > to mitigate showed up on his enterprise class storage. Adding him so
>> >> > he can test it.
>> >>
>> >> Sorry for the late reply. You can use either fs_mark or iozone to
>> >> generate an fsync-heavy workload. The test I did was to mix this with a
>> >> sequential reader. If you can point me at patches, I should be able to
>> >> test this.
>> > The latest version of patches is attached to:
>> > https://lkml.org/lkml/2011/2/24/125
>>
>> Perhaps you should fix up the merge conflicts, first? ;-)
>>
>> +<<<<<<< HEAD
>> tid = transaction->t_tid;
>> need_to_start = !tid_geq(journal->j_commit_request, tid);
>> +=======
>> + __jbd2_log_start_commit(journal, transaction->t_tid, false);
>> +>>>>>>> jbd2: Refine commit writeout logic
> Doh, how embarrassing ;). Attached is a new version which compiles and
> seems to run OK.
>
> Honza

Thanks, Jan. I should have results for you tomorrow.

Cheers,
Jeff

2011-03-03 01:14:24

by Jeff Moyer

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

Hi, Jan,

So, the results are in. The test workload is an fs_mark process writing
out 64k files and fsyncing each file after it's written. Concurrently
with this is a fio job running a buffered sequential reader (bsr). Each
data point is the average of 10 runs, after throwing out the first run.
File system mount options are left at their defaults, which means that
barriers are on. The storage is an HP EVA, connected to the host via a
single 4Gb FC path.

ext3 looks marginally better with your patches. We get better files/sec
AND better throughput from the buffered reader. For ext4, the results
are less encouraging. We see a drop in files/sec, and an increase in
throughput for the sequential reader. So, the fsync-ing is being
starved a bit more than before.

|| ext3 || ext4 ||
|| fs_mark | fio bsr || fs_mark | fio bsr ||
--------++---------+---------++---------+---------||
vanilla || 517.535 | 178187 || 408.547 | 277130 ||
patched || 540.34 | 182312 || 342.813 | 294655 ||
====================================================
%diff || +4.4% | +2.3% || -16.1% | +6.3% ||

I'm tired right now, but I'll have a look at your ext4 patch in the
morning and see if I can come up with a good reason for this drop.

Let me know if you have any questions.

Cheers,
Jeff

2011-03-04 15:32:54

by Jan Kara

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

Hi Jeff,
On Wed 02-03-11 20:14:13, Jeff Moyer wrote:
> So, the results are in. The test workload is an fs_mark process writing
> out 64k files and fsyncing each file after it's written. Concurrently
> with this is a fio job running a buffered sequential reader (bsr). Each
> data point is the average of 10 runs, after throwing out the first run.
> File system mount options are left at their defaults, which means that
> barriers are on. The storage is an HP EVA, connected to the host via a
> single 4Gb FC path.
Thanks a lot for testing! BTW: fs_mark runs in a single thread or do you
use more threads?

> ext3 looks marginally better with your patches. We get better files/sec
> AND better throughput from the buffered reader. For ext4, the results
> are less encouraging. We see a drop in files/sec, and an increase in
> throughput for the sequential reader. So, the fsync-ing is being
> starved a bit more than before.
>
> || ext3 || ext4 ||
> || fs_mark | fio bsr || fs_mark | fio bsr ||
> --------++---------+---------++---------+---------||
> vanilla || 517.535 | 178187 || 408.547 | 277130 ||
> patched || 540.34 | 182312 || 342.813 | 294655 ||
> ====================================================
> %diff || +4.4% | +2.3% || -16.1% | +6.3% ||
Interesting. I'm surprised ext3 and ext4 results differ this much. I'm more
than happy with ext3 results since I just wanted to verify that fsync load
doesn't degrade too much with the improved logic preferring non-fsync load
more than we used to.

I'm not so happy with ext4 results. The difference between ext3 and ext4
might be that amount of data written by kjournald in ext3 is considerably
larger if it ends up pushing out data (because of data=ordered mode) as
well. With ext4, all data are written by filemap_fdatawrite() from fsync
because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG
is hurting us with your fast storage and small amount of written data? With
WRITE_SYNC, data would be already on it's way to storage before we get to
wait for them...

Or it could be that we really send more data in WRITE mode rather than in
WRITE_SYNC mode with the patch on ext4 (that should be verifiable with
blktrace). But I wonder how that could happen...

Bye
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-03-04 15:40:44

by Jeff Moyer

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

Jan Kara <[email protected]> writes:

> Hi Jeff,
> On Wed 02-03-11 20:14:13, Jeff Moyer wrote:
>> So, the results are in. The test workload is an fs_mark process writing
>> out 64k files and fsyncing each file after it's written. Concurrently
>> with this is a fio job running a buffered sequential reader (bsr). Each
>> data point is the average of 10 runs, after throwing out the first run.
>> File system mount options are left at their defaults, which means that
>> barriers are on. The storage is an HP EVA, connected to the host via a
>> single 4Gb FC path.
> Thanks a lot for testing! BTW: fs_mark runs in a single thread or do you
> use more threads?

I use a single fs_mark thread. FWIW, I also tested just fs_mark, and
those numbers look good.

>> ext3 looks marginally better with your patches. We get better files/sec
>> AND better throughput from the buffered reader. For ext4, the results
>> are less encouraging. We see a drop in files/sec, and an increase in
>> throughput for the sequential reader. So, the fsync-ing is being
>> starved a bit more than before.
>>
>> || ext3 || ext4 ||
>> || fs_mark | fio bsr || fs_mark | fio bsr ||
>> --------++---------+---------++---------+---------||
>> vanilla || 517.535 | 178187 || 408.547 | 277130 ||
>> patched || 540.34 | 182312 || 342.813 | 294655 ||
>> ====================================================
>> %diff || +4.4% | +2.3% || -16.1% | +6.3% ||
> Interesting. I'm surprised ext3 and ext4 results differ this much. I'm more
> than happy with ext3 results since I just wanted to verify that fsync load
> doesn't degrade too much with the improved logic preferring non-fsync load
> more than we used to.
>
> I'm not so happy with ext4 results. The difference between ext3 and ext4
> might be that amount of data written by kjournald in ext3 is considerably
> larger if it ends up pushing out data (because of data=ordered mode) as
> well. With ext4, all data are written by filemap_fdatawrite() from fsync
> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG
> is hurting us with your fast storage and small amount of written data? With
> WRITE_SYNC, data would be already on it's way to storage before we get to
> wait for them...
>
> Or it could be that we really send more data in WRITE mode rather than in
> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with
> blktrace). But I wonder how that could happen...

Yeah, I've collected blktrace data and I'll get to evaluating that.
Sorry, I ran out of time yesterday.

Cheers,
Jeff

2011-03-04 15:50:41

by Jeff Moyer

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

Jan Kara <[email protected]> writes:

> I'm not so happy with ext4 results. The difference between ext3 and ext4
> might be that amount of data written by kjournald in ext3 is considerably
> larger if it ends up pushing out data (because of data=ordered mode) as
> well. With ext4, all data are written by filemap_fdatawrite() from fsync
> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG
> is hurting us with your fast storage and small amount of written data? With
> WRITE_SYNC, data would be already on it's way to storage before we get to
> wait for them...

> Or it could be that we really send more data in WRITE mode rather than in
> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with
> blktrace). But I wonder how that could happen...

It looks like this is the case, the I/O isn't coming down as
synchronous. I'm seeing a lot of writes, very few write sync's, which
means that the write stream will be preempted by the incoming reads.

Time to audit that fsync path and make sure it's marked properly, I
guess.

Cheers,
Jeff

2011-03-04 18:28:07

by Jeff Moyer

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

Jeff Moyer <[email protected]> writes:

> Jan Kara <[email protected]> writes:
>
>> I'm not so happy with ext4 results. The difference between ext3 and ext4
>> might be that amount of data written by kjournald in ext3 is considerably
>> larger if it ends up pushing out data (because of data=ordered mode) as
>> well. With ext4, all data are written by filemap_fdatawrite() from fsync
>> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG
>> is hurting us with your fast storage and small amount of written data? With
>> WRITE_SYNC, data would be already on it's way to storage before we get to
>> wait for them...
>
>> Or it could be that we really send more data in WRITE mode rather than in
>> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with
>> blktrace). But I wonder how that could happen...
>
> It looks like this is the case, the I/O isn't coming down as
> synchronous. I'm seeing a lot of writes, very few write sync's, which
> means that the write stream will be preempted by the incoming reads.
>
> Time to audit that fsync path and make sure it's marked properly, I
> guess.

OK, I spoke too soon. Here's the blktrace summary information (I re-ran
the tests using 3 samples, the blktrace is from the last run of the
three in each case):

Vanilla
-------
fs_mark: 307.288 files/sec
fio: 286509 KB/s

Total (sde):
Reads Queued: 341,558, 84,994MiB Writes Queued: 1,561K, 6,244MiB
Read Dispatches: 341,493, 84,994MiB Write Dispatches: 648,046, 6,244MiB
Reads Requeued: 0 Writes Requeued: 27
Reads Completed: 341,491, 84,994MiB Writes Completed: 648,021, 6,244MiB
Read Merges: 65, 2,780KiB Write Merges: 913,076, 3,652MiB
IO unplugs: 578,102 Timer unplugs: 0

Throughput (R/W): 282,797KiB/s / 20,776KiB/s
Events (sde): 16,724,303 entries

Patched
-------
fs_mark: 278.587 files/sec
fio: 298007 KB/s

Total (sde):
Reads Queued: 345,407, 86,834MiB Writes Queued: 1,566K, 6,264MiB
Read Dispatches: 345,391, 86,834MiB Write Dispatches: 327,404, 6,264MiB
Reads Requeued: 0 Writes Requeued: 33
Reads Completed: 345,391, 86,834MiB Writes Completed: 327,371, 6,264MiB
Read Merges: 16, 1,576KiB Write Merges: 1,238K, 4,954MiB
IO unplugs: 580,308 Timer unplugs: 0

Throughput (R/W): 288,771KiB/s / 20,832KiB/s
Events (sde): 14,030,610 entries

So, it appears we flush out writes much more aggressively without the
patch in place. I'm not sure why the write bandwidth looks to be higher
in the patched case... odd.

Cheers,
Jeff

2011-03-22 07:36:39

by Alex Shi

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine


On Sat, 2011-03-05 at 02:27 +0800, Jeff Moyer wrote:
> Jeff Moyer <[email protected]> writes:
>
> > Jan Kara <[email protected]> writes:
> >
> >> I'm not so happy with ext4 results. The difference between ext3 and ext4
> >> might be that amount of data written by kjournald in ext3 is considerably
> >> larger if it ends up pushing out data (because of data=ordered mode) as
> >> well. With ext4, all data are written by filemap_fdatawrite() from fsync
> >> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG
> >> is hurting us with your fast storage and small amount of written data? With
> >> WRITE_SYNC, data would be already on it's way to storage before we get to
> >> wait for them...
> >
> >> Or it could be that we really send more data in WRITE mode rather than in
> >> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with
> >> blktrace). But I wonder how that could happen...
> >
> > It looks like this is the case, the I/O isn't coming down as
> > synchronous. I'm seeing a lot of writes, very few write sync's, which
> > means that the write stream will be preempted by the incoming reads.
> >
> > Time to audit that fsync path and make sure it's marked properly, I
> > guess.
>
> OK, I spoke too soon. Here's the blktrace summary information (I re-ran
> the tests using 3 samples, the blktrace is from the last run of the
> three in each case):
>
> Vanilla
> -------
> fs_mark: 307.288 files/sec
> fio: 286509 KB/s
>
> Total (sde):
> Reads Queued: 341,558, 84,994MiB Writes Queued: 1,561K, 6,244MiB
> Read Dispatches: 341,493, 84,994MiB Write Dispatches: 648,046, 6,244MiB
> Reads Requeued: 0 Writes Requeued: 27
> Reads Completed: 341,491, 84,994MiB Writes Completed: 648,021, 6,244MiB
> Read Merges: 65, 2,780KiB Write Merges: 913,076, 3,652MiB
> IO unplugs: 578,102 Timer unplugs: 0
>
> Throughput (R/W): 282,797KiB/s / 20,776KiB/s
> Events (sde): 16,724,303 entries
>
> Patched
> -------
> fs_mark: 278.587 files/sec
> fio: 298007 KB/s
>
> Total (sde):
> Reads Queued: 345,407, 86,834MiB Writes Queued: 1,566K, 6,264MiB
> Read Dispatches: 345,391, 86,834MiB Write Dispatches: 327,404, 6,264MiB
> Reads Requeued: 0 Writes Requeued: 33
> Reads Completed: 345,391, 86,834MiB Writes Completed: 327,371, 6,264MiB
> Read Merges: 16, 1,576KiB Write Merges: 1,238K, 4,954MiB
> IO unplugs: 580,308 Timer unplugs: 0
>
> Throughput (R/W): 288,771KiB/s / 20,832KiB/s
> Events (sde): 14,030,610 entries
>
> So, it appears we flush out writes much more aggressively without the
> patch in place. I'm not sure why the write bandwidth looks to be higher
> in the patched case... odd.
>

Jan:
Do you have new idea on this?

2011-03-22 16:14:12

by Jan Kara

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

On Tue 22-03-11 15:38:19, Alex,Shi wrote:
> On Sat, 2011-03-05 at 02:27 +0800, Jeff Moyer wrote:
> > Jeff Moyer <[email protected]> writes:
> >
> > > Jan Kara <[email protected]> writes:
> > >
> > >> I'm not so happy with ext4 results. The difference between ext3 and ext4
> > >> might be that amount of data written by kjournald in ext3 is considerably
> > >> larger if it ends up pushing out data (because of data=ordered mode) as
> > >> well. With ext4, all data are written by filemap_fdatawrite() from fsync
> > >> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG
> > >> is hurting us with your fast storage and small amount of written data? With
> > >> WRITE_SYNC, data would be already on it's way to storage before we get to
> > >> wait for them...
> > >
> > >> Or it could be that we really send more data in WRITE mode rather than in
> > >> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with
> > >> blktrace). But I wonder how that could happen...
> > >
> > > It looks like this is the case, the I/O isn't coming down as
> > > synchronous. I'm seeing a lot of writes, very few write sync's, which
> > > means that the write stream will be preempted by the incoming reads.
> > >
> > > Time to audit that fsync path and make sure it's marked properly, I
> > > guess.
> >
> > OK, I spoke too soon. Here's the blktrace summary information (I re-ran
> > the tests using 3 samples, the blktrace is from the last run of the
> > three in each case):
> >
> > Vanilla
> > -------
> > fs_mark: 307.288 files/sec
> > fio: 286509 KB/s
> >
> > Total (sde):
> > Reads Queued: 341,558, 84,994MiB Writes Queued: 1,561K, 6,244MiB
> > Read Dispatches: 341,493, 84,994MiB Write Dispatches: 648,046, 6,244MiB
> > Reads Requeued: 0 Writes Requeued: 27
> > Reads Completed: 341,491, 84,994MiB Writes Completed: 648,021, 6,244MiB
> > Read Merges: 65, 2,780KiB Write Merges: 913,076, 3,652MiB
> > IO unplugs: 578,102 Timer unplugs: 0
> >
> > Throughput (R/W): 282,797KiB/s / 20,776KiB/s
> > Events (sde): 16,724,303 entries
> >
> > Patched
> > -------
> > fs_mark: 278.587 files/sec
> > fio: 298007 KB/s
> >
> > Total (sde):
> > Reads Queued: 345,407, 86,834MiB Writes Queued: 1,566K, 6,264MiB
> > Read Dispatches: 345,391, 86,834MiB Write Dispatches: 327,404, 6,264MiB
> > Reads Requeued: 0 Writes Requeued: 33
> > Reads Completed: 345,391, 86,834MiB Writes Completed: 327,371, 6,264MiB
> > Read Merges: 16, 1,576KiB Write Merges: 1,238K, 4,954MiB
> > IO unplugs: 580,308 Timer unplugs: 0
> >
> > Throughput (R/W): 288,771KiB/s / 20,832KiB/s
> > Events (sde): 14,030,610 entries
> >
> > So, it appears we flush out writes much more aggressively without the
> > patch in place. I'm not sure why the write bandwidth looks to be higher
> > in the patched case... odd.
>
> Jan:
> Do you have new idea on this?
I was looking at the block traces for quite some time but I couldn't find
the reason why fs_mark is slower with my patch. Actually, looking at the
data now, I don't even understand how fs_mark can report lower files/sec
values.

Both block traces were taken for 300 seconds. From the above stats, we see
that on both kernels, we wrote 6.2 GB over that time. Looking at more
detailed stats I made, fs_mark processes wrote 4094 MB on vanilla kernel
and 4107 MB on the patched kernel. Given that they just sequentially create
and fsync 64 KB files, files/sec ratio should be about the same with both
kernels. So I'm really missing how fs_mark arrives at different files/sec
values or how with such different values it happens that the amount written
is actually the same. Anyone has any idea?

Looking at how fs_mark works and at differences in trace files - could it
be that the difference is caused by a difference in how log files each
fs_mark thread is writing are flushed? Or possibly by IO caused unlinking
of created files somehow leaking to the time of the next measured fs_mark
run in one case and not the other one? Jeff, I suppose the log files of
fs_mark processes are on the same device as the test directory, aren't
they - that might explain the flusher thread doing IO? The patch below
should limit the interactions. If you have time to test fs_mark with this
patch applied - does it make any difference?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (4.45 kB)
fs_mark.c.diff (462.00 B)
Download all attachments

2011-03-22 17:46:19

by Jeff Moyer

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

Jan Kara <[email protected]> writes:

> Jeff, I suppose the log files of fs_mark processes are on the same
> device as the test directory, aren't they - that might explain the
> flusher thread doing IO? The patch below

No. My test environments are sane. The file system is on a completely
separate disk and it is unmounted and remounted in between each run.
That should take care of any lost flushes.

Did you receive my last couple of private emails commenting on the trace
data?

Cheers,
Jeff

2011-03-24 06:43:40

by Alex Shi

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

On Wed, 2011-03-23 at 01:46 +0800, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> > Jeff, I suppose the log files of fs_mark processes are on the same
> > device as the test directory, aren't they - that might explain the
> > flusher thread doing IO? The patch below
>
> No. My test environments are sane. The file system is on a completely
> separate disk and it is unmounted and remounted in between each run.
> That should take care of any lost flushes.

Jeff, why not give a try with Jan's fs_mark patch? Maybe the result show
something change. :)

>
> Did you receive my last couple of private emails commenting on the trace
> data?
>
> Cheers,
> Jeff

2011-03-28 19:49:01

by Jan Kara

[permalink] [raw]
Subject: Re: [performance bug] kernel building regression on 64 LCPUs machine

On Tue 22-03-11 13:46:06, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> > Jeff, I suppose the log files of fs_mark processes are on the same
> > device as the test directory, aren't they - that might explain the
> > flusher thread doing IO? The patch below
>
> No. My test environments are sane. The file system is on a completely
> separate disk and it is unmounted and remounted in between each run.
> That should take care of any lost flushes.
OK, I understand but let me clear out one thing. What fs_mark seems to be
doing is:
for given number of iterations { (-L option, default 1)
fork each thread {
start timer
for i = 1 .. 10000 { (could be changed by option -n)
write file i
}
stop timer
for i = 1 .. 10000 {
unlink file i
}
write statistics to log file
exit
}
read all log files, write combined results to another log file
}

I see from blktrace data that indeed you are running more than one
iteration of the fs_mark process so the problem I was wondering about is
whether unlinking or writing of log files cannot interefere with the IO
happening while the timer is running.

Now log files are rather small so I don't really think they cause any
problem but they could be the data flusher thread writes - they are stored
in the current directory (unless you use -l option). So I my question was
aiming at where these files are stored - do you specify -l option and if
not, what is the current directory of fs_mark process in your setup?

I think unlinks could possibly cause problems and that's why I suggested we
add sync() to the parent process before it runs the next iteration...

> Did you receive my last couple of private emails commenting on the trace
> data?
Yes, I did get them. I'll reply to them in a while.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR