2007-01-10 22:38:10

by David Chinner

[permalink] [raw]
Subject: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown


Discussion thread:

http://oss.sgi.com/archives/xfs/2007-01/msg00052.html

Short story is that buffered writes slowed down by 20-30%
between 2.6.18 and 2.6.19 and became a lot more erratic.
Writing a single file to a single filesystem doesn't appear
to have major problems, but when writing a file per filesystem
and using 3 filesystems performance is much worse on 2.6.19
and is only slightly better on 2.6.20-rc3.

It doesn't appear to be fragmentation (I wrote quite a few
800GB files when testing this and they all had "perfect"
extent layouts (i.e. extents the size of allocation groups
and in sequential AGs). It's not the block devices, either,
as doing the same I/O to the block device gives the same
results.

My test case is effectively:

#!/bin/bash

mkfs.xfs -f -l version=2 -d sunit=512,swidth=2048 /dev/dm-0
mkfs.xfs -f -l version=2 -d sunit=512,swidth=2048 /dev/dm-1
mkfs.xfs -f -l version=2 -d sunit=512,swidth=2048 /dev/dm-2

mount /dev/dm-0 /mnt/dm0
mount /dev/dm-1 /mnt/dm1
mount /dev/dm-2 /mnt/dm2

dd if=/dev/zero of=/mnt/dm0/test bs=1024k count=800k &
dd if=/dev/zero of=/mnt/dm1/test bs=1024k count=800k &
dd if=/dev/zero of=/mnt/dm2/test bs=1024k count=800k &
wait

unmount /mnt/dm0
unmount /mnt/dm1
unmount /mnt/dm2

#EOF

Overall, on 2.6.18 this gave an average of about 240MB/s per
filesystem with minimum write rates of about 190MB/s per fs
(when writing near the inner edge of the disks).

On 2.6.20-rc3, this gave and average of ~200MB/s per fs
with minimum write rates of about 110MB/s per fs which
occurrred randomly throughout the test.

The performance and smoothness is fully restored on 2.6.20-rc3
by setting dirty_ratio down to 10 (from the default 40), so
something in the VM is not working as well as it used to....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group


2007-01-10 23:04:39

by Christoph Lameter

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

On Thu, 11 Jan 2007, David Chinner wrote:

> The performance and smoothness is fully restored on 2.6.20-rc3
> by setting dirty_ratio down to 10 (from the default 40), so
> something in the VM is not working as well as it used to....

dirty_background_ratio is left as is at 10? So you gain performance
by switching off background writes via pdflush?

2007-01-10 23:09:19

by David Chinner

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

On Wed, Jan 10, 2007 at 03:04:15PM -0800, Christoph Lameter wrote:
> On Thu, 11 Jan 2007, David Chinner wrote:
>
> > The performance and smoothness is fully restored on 2.6.20-rc3
> > by setting dirty_ratio down to 10 (from the default 40), so
> > something in the VM is not working as well as it used to....
>
> dirty_background_ratio is left as is at 10?

Yes.

> So you gain performance by switching off background writes via pdflush?

Well, pdflush appears to be doing very little on both 2.6.18 and
2.6.20-rc3. In both cases kswapd is consuming 10-20% of a CPU and
all of the pdflush threads combined (I've seen up to 7 active at
once) use maybe 1-2% of cpu time. This occurs regardless of the
dirty_ratio setting.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-01-10 23:12:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

On Thu, 11 Jan 2007, David Chinner wrote:

> Well, pdflush appears to be doing very little on both 2.6.18 and
> 2.6.20-rc3. In both cases kswapd is consuming 10-20% of a CPU and
> all of the pdflush threads combined (I've seen up to 7 active at
> once) use maybe 1-2% of cpu time. This occurs regardless of the
> dirty_ratio setting.

That sounds a bit much for kswapd. How many nodes? Any cpusets in use?

A upper maximum on the number of pdflush threads exists at 8. Are these
multiple files or single file transfers?

2007-01-10 23:14:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

David Chinner wrote:
> On Wed, Jan 10, 2007 at 03:04:15PM -0800, Christoph Lameter wrote:
>
>>On Thu, 11 Jan 2007, David Chinner wrote:
>>
>>
>>>The performance and smoothness is fully restored on 2.6.20-rc3
>>>by setting dirty_ratio down to 10 (from the default 40), so
>>>something in the VM is not working as well as it used to....
>>
>>dirty_background_ratio is left as is at 10?
>
>
> Yes.
>
>
>>So you gain performance by switching off background writes via pdflush?
>
>
> Well, pdflush appears to be doing very little on both 2.6.18 and
> 2.6.20-rc3. In both cases kswapd is consuming 10-20% of a CPU and
> all of the pdflush threads combined (I've seen up to 7 active at
> once) use maybe 1-2% of cpu time. This occurs regardless of the
> dirty_ratio setting.

Hi David,

Could you get /proc/vmstat deltas for each kernel, to start with?

I'm guessing CPU time isn't a problem, but if it is then I guess
profiles as well.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-10 23:19:08

by David Chinner

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

On Wed, Jan 10, 2007 at 03:12:02PM -0800, Christoph Lameter wrote:
> On Thu, 11 Jan 2007, David Chinner wrote:
>
> > Well, pdflush appears to be doing very little on both 2.6.18 and
> > 2.6.20-rc3. In both cases kswapd is consuming 10-20% of a CPU and
> > all of the pdflush threads combined (I've seen up to 7 active at
> > once) use maybe 1-2% of cpu time. This occurs regardless of the
> > dirty_ratio setting.
>
> That sounds a bit much for kswapd. How many nodes? Any cpusets in use?

It's an x86-64 box - an XE 240 - 4 core, 16GB RAM, single node, no cpusets.

> A upper maximum on the number of pdflush threads exists at 8. Are these
> multiple files or single file transfers?

See the test case i posted - a single file write per filesystem, three
filesystems being written to at once, all on different, unshared block
devices.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-01-11 00:32:34

by David Chinner

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

On Thu, Jan 11, 2007 at 10:13:55AM +1100, Nick Piggin wrote:
> David Chinner wrote:
> >On Wed, Jan 10, 2007 at 03:04:15PM -0800, Christoph Lameter wrote:
> >
> >>On Thu, 11 Jan 2007, David Chinner wrote:
> >>
> >>
> >>>The performance and smoothness is fully restored on 2.6.20-rc3
> >>>by setting dirty_ratio down to 10 (from the default 40), so
> >>>something in the VM is not working as well as it used to....
> >>
> >>dirty_background_ratio is left as is at 10?
> >
> >
> >Yes.
> >
> >
> >>So you gain performance by switching off background writes via pdflush?
> >
> >
> >Well, pdflush appears to be doing very little on both 2.6.18 and
> >2.6.20-rc3. In both cases kswapd is consuming 10-20% of a CPU and
> >all of the pdflush threads combined (I've seen up to 7 active at
> >once) use maybe 1-2% of cpu time. This occurs regardless of the
> >dirty_ratio setting.
>
> Hi David,
>
> Could you get /proc/vmstat deltas for each kernel, to start with?

Sure, but that doesn't really show the how erratic the per-filesystem
throughput is because the test I'm running is PCI-X bus limited in
it's throughput at about 750MB/s. Each dm device is capable of about
340MB/s write, so when one slows down, the others will typically
speed up.

So, what I've attached is three files which have both
'vmstat 5' output and 'iostat 5 |grep dm-' output in them.

- 2.6.18.out - 2.6.18 behaviour near start of writes.
Behaviour does not change over the couse of the test,
just gets a bit slower as the test moves from the outer
edge of the disk to the inner. erractic behaviour is
highlighted.

- 2.6.20-rc3.out - 2.6.20-rc3 behaviour near start of writes.
Somewhat more erratic than 2.6.18, but about 100-150GB into
the write test, things change with dirty_ratio=40. erractic
behaviour is highlighted.

- 2.6.20-rc3-worse.out - 2.6.20-rc3 behavour when things go
bad. We're not keeping the disks or the PCI-X bus fully
utilised (each dm device can do about 300MB/s at this offset)
and aggregate throughput has dropped to 500-600MB/s.

With 2.6.20-rc3 and dirty_ratio = 10, the performance drop-off part way
into the test does not occur and the output is almost identical to
2.6.18.out.

> I'm guessing CPU time isn't a problem, but if it is then I guess
> profiles as well.

Plenty of idle cpu so I don't think it's a problem.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group


Attachments:
(No filename) (2.38 kB)
2.6.18.out (14.69 kB)
2.6.20-rc3.out (20.74 kB)
2.6.20-rc3-worse.out (9.61 kB)
Download all attachments

2007-01-11 00:43:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

You are comparing a debian 2.6.18 standard kernel with your tuned version
of 2.6.20-rc3. There may be a lot of differences. Could you get us the
config? Or use the same config file and build 2.6.20/18 the same way.

2007-01-11 01:06:36

by David Chinner

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

On Wed, Jan 10, 2007 at 04:43:36PM -0800, Christoph Lameter wrote:
> You are comparing a debian 2.6.18 standard kernel with your tuned version
> of 2.6.20-rc3. There may be a lot of differences. Could you get us the
> config? Or use the same config file and build 2.6.20/18 the same way.

I took the /proc/config.gz from the debian 2.6.18-1 kernel as the
base config for the 2.6.20-rc3 kernel and did a make oldconfig on
it to make sure it was valid for the newer kernel but pretty much
the same. I think that's the right process, so I don't think
different build configs are the problem here.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-01-11 01:08:44

by Nick Piggin

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

David Chinner wrote:
> On Thu, Jan 11, 2007 at 10:13:55AM +1100, Nick Piggin wrote:
>
>>David Chinner wrote:
>>
>>>On Wed, Jan 10, 2007 at 03:04:15PM -0800, Christoph Lameter wrote:
>>>
>>>
>>>>On Thu, 11 Jan 2007, David Chinner wrote:
>>>>
>>>>
>>>>
>>>>>The performance and smoothness is fully restored on 2.6.20-rc3
>>>>>by setting dirty_ratio down to 10 (from the default 40), so
>>>>>something in the VM is not working as well as it used to....
>>>>
>>>>dirty_background_ratio is left as is at 10?
>>>
>>>
>>>Yes.
>>>
>>>
>>>
>>>>So you gain performance by switching off background writes via pdflush?
>>>
>>>
>>>Well, pdflush appears to be doing very little on both 2.6.18 and
>>>2.6.20-rc3. In both cases kswapd is consuming 10-20% of a CPU and
>>>all of the pdflush threads combined (I've seen up to 7 active at
>>>once) use maybe 1-2% of cpu time. This occurs regardless of the
>>>dirty_ratio setting.
>>
>>Hi David,
>>
>>Could you get /proc/vmstat deltas for each kernel, to start with?
>
>
> Sure, but that doesn't really show the how erratic the per-filesystem
> throughput is because the test I'm running is PCI-X bus limited in
> it's throughput at about 750MB/s. Each dm device is capable of about
> 340MB/s write, so when one slows down, the others will typically
> speed up.

But you do also get aggregate throughput drops? (ie. 2.6.20-rc3-worse)

> So, what I've attached is three files which have both
> 'vmstat 5' output and 'iostat 5 |grep dm-' output in them.

Ahh, sorry to be unclear, I meant:

cat /proc/vmstat > pre
run_test
cat /proc/vmstat > post

It might just give us a hint what is changing (however vmstat doesn't
give much interesting in the way of pdflush stats, so it might not
show anything up).

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-11 01:11:22

by David Chinner

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

On Thu, Jan 11, 2007 at 10:08:55AM +1100, David Chinner wrote:
> On Wed, Jan 10, 2007 at 03:04:15PM -0800, Christoph Lameter wrote:
> > On Thu, 11 Jan 2007, David Chinner wrote:
> >
> > > The performance and smoothness is fully restored on 2.6.20-rc3
> > > by setting dirty_ratio down to 10 (from the default 40), so
> > > something in the VM is not working as well as it used to....
> >
> > dirty_background_ratio is left as is at 10?
>
> Yes.

FWIW, setting dirty_ratio to 20 instead of 10 fixes the most of
the erraticness of the writeback and most of the performance as well.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-01-11 01:24:42

by David Chinner

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

On Thu, Jan 11, 2007 at 12:08:10PM +1100, Nick Piggin wrote:
> David Chinner wrote:
> >Sure, but that doesn't really show the how erratic the per-filesystem
> >throughput is because the test I'm running is PCI-X bus limited in
> >it's throughput at about 750MB/s. Each dm device is capable of about
> >340MB/s write, so when one slows down, the others will typically
> >speed up.
>
> But you do also get aggregate throughput drops? (ie. 2.6.20-rc3-worse)

Yes - you can see that from the vmstat output I sent.

At 500GB into the write of each file (about 60% of the disks filled)
the per fs write rate should be around 220MB/s, so aggregate should
be around 650MB/s. That's what Im seeing with 2.6.18 and 2.6.20-rc3
with a tweaked dirty_ratio. Without the dirty_ratio tweak, you see
what is in 2.6.20-rc3-worse.

e.g. I just changed dirty ratio from 10 to 40 and I've gone from
consistent 210-215MB/s per filesystm (~630-650MB/s aggregate) to
ranging over 110-200MB/s per fielsystem and aggregates of ~450-600MB/s.
I changed dirty_ratio back to 10, and within 15 seconds we are back
to consistent 210MB/s per filesystem and 630-650MB/s write.

> >So, what I've attached is three files which have both
> >'vmstat 5' output and 'iostat 5 |grep dm-' output in them.
>
> Ahh, sorry to be unclear, I meant:
>
> cat /proc/vmstat > pre
> run_test
> cat /proc/vmstat > post

Ok, I'll get back to you on that one - even at 600+MB/s, writing 5TB
of data takes some time....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-01-11 01:41:04

by Christoph Lameter

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

On Thu, 11 Jan 2007, David Chinner wrote:

> On Wed, Jan 10, 2007 at 04:43:36PM -0800, Christoph Lameter wrote:
> > You are comparing a debian 2.6.18 standard kernel with your tuned version
> > of 2.6.20-rc3. There may be a lot of differences. Could you get us the
> > config? Or use the same config file and build 2.6.20/18 the same way.
>
> I took the /proc/config.gz from the debian 2.6.18-1 kernel as the
> base config for the 2.6.20-rc3 kernel and did a make oldconfig on
> it to make sure it was valid for the newer kernel but pretty much
> the same. I think that's the right process, so I don't think
> different build configs are the problem here.

Debian may have added extra patches that are not upstream. I see f.e. some
of my post 2.6.18 patches in there.

2007-01-11 02:58:18

by David Chinner

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

On Wed, Jan 10, 2007 at 05:40:26PM -0800, Christoph Lameter wrote:
> On Thu, 11 Jan 2007, David Chinner wrote:
>
> > On Wed, Jan 10, 2007 at 04:43:36PM -0800, Christoph Lameter wrote:
> > > You are comparing a debian 2.6.18 standard kernel with your tuned version
> > > of 2.6.20-rc3. There may be a lot of differences. Could you get us the
> > > config? Or use the same config file and build 2.6.20/18 the same way.
> >
> > I took the /proc/config.gz from the debian 2.6.18-1 kernel as the
> > base config for the 2.6.20-rc3 kernel and did a make oldconfig on
> > it to make sure it was valid for the newer kernel but pretty much
> > the same. I think that's the right process, so I don't think
> > different build configs are the problem here.
>
> Debian may have added extra patches that are not upstream. I see f.e. some
> of my post 2.6.18 patches in there.

Did you read the thread I linked in my original report? The original
bug report was for a regression from 2.6.18.1 to 2.6.20-rc3. I have
reproduced the same regression between the debian 2.6.18-1 kernel
and 2.6.20-rc3. I think you're looking in the wrong place for the
cause of the problem....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-01-11 09:23:55

by Nick Piggin

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

Thanks. BTW. You didn't cc this to the list, so I won't either in case
you want it kept private.

David Chinner wrote:
> On Thu, Jan 11, 2007 at 12:08:10PM +1100, Nick Piggin wrote:
>
>>Ahh, sorry to be unclear, I meant:
>>
>> cat /proc/vmstat > pre
>> run_test
>> cat /proc/vmstat > post
>
>
> 6 files attached - 2.6.18 pre/post, 2.6.20-rc3 dirty_ratio = 10 pre/post
> and 2.6.20-rc3 dirty_ratio=40 pre/post.
>
> Cheers,
>
> Dave.
>
>
> ------------------------------------------------------------------------
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-11 09:27:41

by Nick Piggin

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

David Chinner wrote:
> On Thu, Jan 11, 2007 at 12:08:10PM +1100, Nick Piggin wrote:

>>>So, what I've attached is three files which have both
>>>'vmstat 5' output and 'iostat 5 |grep dm-' output in them.
>>
>>Ahh, sorry to be unclear, I meant:
>>
>> cat /proc/vmstat > pre
>> run_test
>> cat /proc/vmstat > post
>
>
> Ok, I'll get back to you on that one - even at 600+MB/s, writing 5TB
> of data takes some time....

OK, according to your vmstat deltas, you are doing an order of magnitude
more writeout off the LRU with 2.6.20-rc3 default than with the smaller
dirty_ratio (53GB of data vs 4GB of data). 2.6.18 does not have that stat,
unfortunately.

allocstall and direct reclaim are way down when the dirty ratio is lower,
but those numbers with vanilla 2.6.20-rc3 are comparable to 2.6.18, so
that shows that kswapd in 2.6.18 is probably also having trouble which may
mean it is also writing out a lot off the LRU.

You're not turning on zone_reclaim, by any chance, are you?

Otherwise, nothing jumps out at me yet. I'll have a bit of a look through
changelogs tomorrow. I guess it could be a pdflush or vmscan change (XFS,
maybe?).

Can you narrow it down at all?

THanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-11 17:51:31

by Christoph Lameter

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

On Thu, 11 Jan 2007, Nick Piggin wrote:

> You're not turning on zone_reclaim, by any chance, are you?

It is not a NUMA system so zone reclaim is not available. zone reclaim was
already in 2.6.16.

2007-01-12 00:07:17

by Nick Piggin

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

Christoph Lameter wrote:
> On Thu, 11 Jan 2007, Nick Piggin wrote:
>
>
>>You're not turning on zone_reclaim, by any chance, are you?
>
>
> It is not a NUMA system so zone reclaim is not available.

Ah yes... Can't you force it on if you have a NUMA complied kernel?

> zone reclaim was
> already in 2.6.16.

Well it was a long shot, but that is something that has had a few
changes recently and is something that could interact badly with
the global pdflush.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-12 03:05:05

by Christoph Lameter

[permalink] [raw]
Subject: Re: [REGRESSION] 2.6.19/2.6.20-rc3 buffered write slowdown

On Fri, 12 Jan 2007, Nick Piggin wrote:

> Ah yes... Can't you force it on if you have a NUMA complied kernel?

But it wont do anything since it only comes into action if you have an off
node allocation. If you run a NUMA kernel on an SMP system then you only
have one node. There is no way that an off node allocation can occur.

> > zone reclaim was already in 2.6.16.
>
> Well it was a long shot, but that is something that has had a few
> changes recently and is something that could interact badly with
> the global pdflush.

zone reclaim is not touching dirty pages in its default configuration. It
would only remove up clean pagecache pages.