2008-01-18 08:19:08

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

----- Original Message ----
> From: Mel Gorman <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: Fengguang Wu <[email protected]>; Mike Snitzer <[email protected]>; Peter Zijlstra <[email protected]>; [email protected]; Ingo Molnar <[email protected]>; [email protected]; "[email protected]" <[email protected]>; Linus Torvalds <[email protected]>; [email protected]
> Sent: Thursday, January 17, 2008 11:12:21 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
>
> On (17/01/08 13:50), Martin Knoblauch didst pronounce:
> > >
> >
> > The effect is defintely depending on the IO hardware.
> >
performed the same tests
> > on a different box with an AACRAID controller and there things
> > look different.
>
> I take it different also means it does not show this odd performance
> behaviour and is similar whether the patch is applied or not?
>

Here are the numbers (MB/s) from the AACRAID box, after a fresh boot:

Test 2.6.19.2 2.6.24-rc6 2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d
dd1 325 350 290
dd1-dir 180 160 160
dd2 2x90 2x113 2x110
dd2-dir 2x120 2x92 2x93
dd3 3x54 3x70 3x70
dd3-dir 3x83 3x64 3x64
mix3 55,2x30 400,2x25 310,2x25

What we are seing here is that:

a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system compared to the CCISS box
b) Reverting your patch hurts single stream
c) dual/triple stream are not affected by your patch and are improved over 2.6.19
d) the mix3 performance is improved compared to 2.6.19.
d1) reverting your patch hurts the local-disk part of mix3
e) the AACRAID setup is definitely faster than the CCISS.

So, on this box your patch is definitely needed to get the pre-2.6.24 performance
when writing a single big file.

Actually things on the CCISS box might be even more complicated. I forgot the fact
that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have
ext2/Hardware. Do you think that the LVM/MD are sensitive to the page order/coloring?

Anyway: does your patch only address this performance issue, or are there also
data integrity concerns without it? I may consider reverting the patch for my
production environment. It really helps two thirds of my boxes big time, while it does
not hurt the other third that much :-)

> >
> > I can certainly stress the box before doing the tests. Please
> > define "many" for the kernel compiles :-)
> >
>
> With 8GiB of RAM, try making 24 copies of the kernel and compiling them
> all simultaneously. Running that for for 20-30 minutes should be enough
>
to randomise the freelists affecting what color of page is used for the
> dd test.
>

ouch :-) OK, I will try that.

Martin


2008-01-18 16:01:23

by Mel Gorman

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On (18/01/08 00:19), Martin Knoblauch didst pronounce:
> > > The effect is defintely depending on the IO hardware.
> > > performed the same tests
> > > on a different box with an AACRAID controller and there things
> > > look different.
> >
> > I take it different also means it does not show this odd performance
> > behaviour and is similar whether the patch is applied or not?
> >
>
> Here are the numbers (MB/s) from the AACRAID box, after a fresh boot:
>
> Test 2.6.19.2 2.6.24-rc6 2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> dd1 325 350 290
> dd1-dir 180 160 160
> dd2 2x 90 2x113 2x110
> dd2-dir 2x120 2x 92 2x 93
> dd3 3x 54 3x 70 3x 70
> dd3-dir 3x 83 3x 64 3x 64
> mix3 55,2x 30 400,2x 25 310,2x 25
>
> What we are seing here is that:
>
> a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system compared to the CCISS box
> b) Reverting your patch hurts single stream

Right, and this is consistent with other complaints about the PFN of the
page mattering to some hardware.

> c) dual/triple stream are not affected by your patch and are improved over 2.6.19

I am not very surprised. The callers to the page allocator are probably
making no special effort to get a batch of pages in PFN-order. They are just
assuming that subsequent calls give contiguous pages. With two or more
threads involved, there will not be a correlation between physical pages
and what is on disk any more.

> d) the mix3 performance is improved compared to 2.6.19.
> d1) reverting your patch hurts the local-disk part of mix3
> e) the AACRAID setup is definitely faster than the CCISS.
>
> So, on this box your patch is definitely needed to get the pre-2.6.24 performance
> when writing a single big file.
>
> Actually things on the CCISS box might be even more complicated. I forgot the fact
> that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have
> ext2/Hardware. Do you think that the LVM/MD are sensitive to the page order/coloring?
>

I don't have enough experience with LVM setups to make an intelligent
guess.

> Anyway: does your patch only address this performance issue, or are there also
> data integrity concerns without it?

Performance issue only. There are no data integrity concerns with that
patch.

> I may consider reverting the patch for my
> production environment. It really helps two thirds of my boxes big time, while it does
> not hurt the other third that much :-)
>

That is certainly an option.

> > >
> > > I can certainly stress the box before doing the tests. Please
> > > define "many" for the kernel compiles :-)
> > >
> >
> > With 8GiB of RAM, try making 24 copies of the kernel and compiling them
> > all simultaneously. Running that for for 20-30 minutes should be enough
> >
> to randomise the freelists affecting what color of page is used for the
> > dd test.
> >
>
> ouch :-) OK, I will try that.
>

Thanks.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-18 17:48:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX



On Fri, 18 Jan 2008, Mel Gorman wrote:
>
> Right, and this is consistent with other complaints about the PFN of the
> page mattering to some hardware.

I don't think it's actually the PFN per se.

I think it's simply that some controllers (quite probably affected by both
driver and hardware limits) have some subtle interactions with the size of
the IO commands.

For example, let's say that you have a controller that has some limit X on
the size of IO in flight (whether due to hardware or driver issues doesn't
really matter) in addition to a limit on the size of the scatter-gather
size. They all tend to have limits, and they differ.

Now, the PFN doesn't matter per se, but the allocation pattern definitely
matters for whether the IO's are physically contiguous, and thus matters
for the size of the scatter-gather thing.

Now, generally the rule-of-thumb is that you want big commands, so
physical merging is good for you, but I could well imagine that the IO
limits interact, and end up hurting each other. Let's say that a better
allocation order allows for bigger contiguous physical areas, and thus
fewer scatter-gather entries.

What does that result in? The obvious answer is

"Better performance obviously, because the controller needs to do fewer
scatter-gather lookups, and the requests are bigger, because there are
fewer IO's that hit scatter-gather limits!"

Agreed?

Except maybe the *real* answer for some controllers end up being

"Worse performance, because individual commands grow because they don't
hit the per-command limits, but now we hit the global size-in-flight
limits and have many fewer of these good commands in flight. And while
the commands are larger, it means that there are fewer outstanding
commands, which can mean that the disk cannot scheduling things
as well, or makes high latency of command generation by the controller
much more visible because there aren't enough concurrent requests
queued up to hide it"

Is this the reason? I have no idea. But somebody who knows the AACRAID
hardware and driver limits might think about interactions like that.
Sometimes you actually might want to have smaller individual commands if
there is some other limit that means that it can be more advantageous to
have many small requests over a few big onees.

RAID might well make it worse. Maybe small requests work better because
they are simpler to schedule because they only hit one disk (eg if you
have simple striping)! So that's another reason why one *large* request
may actually be slower than two requests half the size, even if it's
against the "normal rule".

And it may be that that AACRAID box takes a big hit on DIO exactly because
DIO has been optimized almost purely for making one command as big as
possible.

Just a theory.

Linus

2008-01-18 19:01:13

by Martin Knoblauch

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX


--- Linus Torvalds <[email protected]> wrote:

>
>
> On Fri, 18 Jan 2008, Mel Gorman wrote:
> >
> > Right, and this is consistent with other complaints about the PFN
> > of the page mattering to some hardware.
>
> I don't think it's actually the PFN per se.
>
> I think it's simply that some controllers (quite probably affected by
> both driver and hardware limits) have some subtle interactions with
> the size of the IO commands.
>
> For example, let's say that you have a controller that has some limit
> X on the size of IO in flight (whether due to hardware or driver
> issues doesn't really matter) in addition to a limit on the size
> of the scatter-gather size. They all tend to have limits, and
> they differ.
>
> Now, the PFN doesn't matter per se, but the allocation pattern
> definitely matters for whether the IO's are physically
> contiguous, and thus matters for the size of the scatter-gather
> thing.
>
> Now, generally the rule-of-thumb is that you want big commands, so
> physical merging is good for you, but I could well imagine that the
> IO limits interact, and end up hurting each other. Let's say that a
> better allocation order allows for bigger contiguous physical areas,
> and thus fewer scatter-gather entries.
>
> What does that result in? The obvious answer is
>
> "Better performance obviously, because the controller needs to do
> fewer scatter-gather lookups, and the requests are bigger, because
> there are fewer IO's that hit scatter-gather limits!"
>
> Agreed?
>
> Except maybe the *real* answer for some controllers end up being
>
> "Worse performance, because individual commands grow because they
> don't hit the per-command limits, but now we hit the global
> size-in-flight limits and have many fewer of these good commands in
> flight. And while the commands are larger, it means that there
> are fewer outstanding commands, which can mean that the disk
> cannot scheduling things as well, or makes high latency of command
> generation by the controller much more visible because there aren't
> enough concurrent requests queued up to hide it"
>
> Is this the reason? I have no idea. But somebody who knows the
> AACRAID hardware and driver limits might think about interactions
> like that. Sometimes you actually might want to have smaller
> individual commands if there is some other limit that means that
> it can be more advantageous to have many small requests over a
> few big onees.
>
> RAID might well make it worse. Maybe small requests work better
> because they are simpler to schedule because they only hit one
> disk (eg if you have simple striping)! So that's another reason
> why one *large* request may actually be slower than two requests
> half the size, even if it's against the "normal rule".
>
> And it may be that that AACRAID box takes a big hit on DIO
> exactly because DIO has been optimized almost purely for making
> one command as big as possible.
>
> Just a theory.
>
> Linus

just to make one thing clear - I am not so much concerned about the
performance of AACRAID. It is OK with or without Mel's patch. It is
better with Mel's patch. The regression in DIO compared to 2.6.19.2 is
completely independent of Mel's stuff.

What interests me much more is the behaviour of the CCISS+LVM based
system. Here I see a huge benefit of reverting Mel's patch.

I dirtied the system after reboot as Mel suggested (24 parallel kernel
build) and repeated the tests. The dirtying did not make any
difference. Here are the results:

Test -rc8 -rc8-without-Mels-Patch
dd1 57 94
dd1-dir 87 86
dd2 2x8.5 2x45
dd2-dir 2x43 2x43
dd3 3x7 3x30
dd3-dir 3x28.5 3x28.5
mix3 59,2x25 98,2x24

The big IO size with Mel's patch really has a devastating effect on
the parallel write. Nowhere near the value one would expect, while the
numbers are perfect without Mel's patch as in rc1-rc5. To bad I did not
see this earlier. Maybe we could have found a solution for .24.

At least, rc1-rc5 have shown that the CCISS system can do well. Now
the question is which part of the system does not cope well with the
larger IO sizes? Is it the CCISS controller, LVM or both. I am open to
suggestions on how to debug that.

Cheers
Martin

------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de

2008-01-18 19:24:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX



On Fri, 18 Jan 2008, Martin Knoblauch wrote:
>
> just to make one thing clear - I am not so much concerned about the
> performance of AACRAID. It is OK with or without Mel's patch. It is
> better with Mel's patch. The regression in DIO compared to 2.6.19.2 is
> completely independent of Mel's stuff.
>
> What interests me much more is the behaviour of the CCISS+LVM based
> system. Here I see a huge benefit of reverting Mel's patch.

Ok, I just got your usage cases confused.

The argument stays the same: some controllers/drivers may have subtle
behavioural differences that come from the IO limits themselves.

So it wasn't AACRAID, it was CCISS+LVM. The argument is the same: it may
well be that the *bigger* IO sizes are actually what hurts, even if the
conventional wisdom is traditionally that bigger submissions are better.

> At least, rc1-rc5 have shown that the CCISS system can do well. Now
> the question is which part of the system does not cope well with the
> larger IO sizes? Is it the CCISS controller, LVM or both. I am open to
> suggestions on how to debug that.

I think you need to ask the MD/DM people for suggestions.. Who aren't cc'd
here.

Linus

2008-01-18 20:00:31

by Mike Snitzer

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Jan 18, 2008 12:46 PM, Linus Torvalds <[email protected]> wrote:
>
>
> On Fri, 18 Jan 2008, Mel Gorman wrote:
> >
> > Right, and this is consistent with other complaints about the PFN of the
> > page mattering to some hardware.
>
> I don't think it's actually the PFN per se.
>
> I think it's simply that some controllers (quite probably affected by both
> driver and hardware limits) have some subtle interactions with the size of
> the IO commands.
>
> For example, let's say that you have a controller that has some limit X on
> the size of IO in flight (whether due to hardware or driver issues doesn't
> really matter) in addition to a limit on the size of the scatter-gather
> size. They all tend to have limits, and they differ.
>
> Now, the PFN doesn't matter per se, but the allocation pattern definitely
> matters for whether the IO's are physically contiguous, and thus matters
> for the size of the scatter-gather thing.
>
> Now, generally the rule-of-thumb is that you want big commands, so
> physical merging is good for you, but I could well imagine that the IO
> limits interact, and end up hurting each other. Let's say that a better
> allocation order allows for bigger contiguous physical areas, and thus
> fewer scatter-gather entries.
>
> What does that result in? The obvious answer is
>
> "Better performance obviously, because the controller needs to do fewer
> scatter-gather lookups, and the requests are bigger, because there are
> fewer IO's that hit scatter-gather limits!"
>
> Agreed?
>
> Except maybe the *real* answer for some controllers end up being
>
> "Worse performance, because individual commands grow because they don't
> hit the per-command limits, but now we hit the global size-in-flight
> limits and have many fewer of these good commands in flight. And while
> the commands are larger, it means that there are fewer outstanding
> commands, which can mean that the disk cannot scheduling things
> as well, or makes high latency of command generation by the controller
> much more visible because there aren't enough concurrent requests
> queued up to hide it"
>
> Is this the reason? I have no idea. But somebody who knows the AACRAID
> hardware and driver limits might think about interactions like that.
> Sometimes you actually might want to have smaller individual commands if
> there is some other limit that means that it can be more advantageous to
> have many small requests over a few big onees.
>
> RAID might well make it worse. Maybe small requests work better because
> they are simpler to schedule because they only hit one disk (eg if you
> have simple striping)! So that's another reason why one *large* request
> may actually be slower than two requests half the size, even if it's
> against the "normal rule".
>
> And it may be that that AACRAID box takes a big hit on DIO exactly because
> DIO has been optimized almost purely for making one command as big as
> possible.
>
> Just a theory.

Oddly enough, I'm seeing the opposite here with 2.6.22.16 w/ AACRAID
configured with 5 LUNS (each 2disk HW RAID0, 1024k stripesz). That
is, with dd the avgrqsiz (from iostat) shows DIO to be ~130k whereas
non-DIO is a mere ~13k! (NOTE: with aacraid, max_hw_sectors_kb=192)

DIO cmdline: dd if=/dev/zero of=/dev/sdX bs=8192k count=1k
non-DIO cmdline: dd if=/dev/zero of=/dev/sdX bs=8192k count=1k

DIO is ~80MB/s on all 5 LUNs for a total of ~400MB/s
non-DIO is only ~12MB on all 5 LUNs for a mere ~70MB/s aggregate
(deadline w/ nr_requests=32)

Calls into question the theory of small requests being beneficial for
AACRAID. Martin, what are you seeing for the avg request size when
you're conducting your AACRAID tests?

I can fire up 2.6.24-rc8 in short order to see if things are vastly
improved (as Martin seems to indicate that he is happy with AACRAID on
2.6.24-rc8). Although even Martin's AACRAID numbers from 2.6.19.2 are
still quite good (relative to mine). Martin can you share any tuning
you may have done to get AACRAID to where it is for you right now?

regards,
Mike

2008-01-18 22:47:04

by Mike Snitzer

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Jan 18, 2008 3:00 PM, Mike Snitzer <[email protected]> wrote:
>
> On Jan 18, 2008 12:46 PM, Linus Torvalds <[email protected]> wrote:
> >
> >
> > On Fri, 18 Jan 2008, Mel Gorman wrote:
> > >
> > > Right, and this is consistent with other complaints about the PFN of the
> > > page mattering to some hardware.
> >
> > I don't think it's actually the PFN per se.
> >
> > I think it's simply that some controllers (quite probably affected by both
> > driver and hardware limits) have some subtle interactions with the size of
> > the IO commands.
> >
> > For example, let's say that you have a controller that has some limit X on
> > the size of IO in flight (whether due to hardware or driver issues doesn't
> > really matter) in addition to a limit on the size of the scatter-gather
> > size. They all tend to have limits, and they differ.
> >
> > Now, the PFN doesn't matter per se, but the allocation pattern definitely
> > matters for whether the IO's are physically contiguous, and thus matters
> > for the size of the scatter-gather thing.
> >
> > Now, generally the rule-of-thumb is that you want big commands, so
> > physical merging is good for you, but I could well imagine that the IO
> > limits interact, and end up hurting each other. Let's say that a better
> > allocation order allows for bigger contiguous physical areas, and thus
> > fewer scatter-gather entries.
> >
> > What does that result in? The obvious answer is
> >
> > "Better performance obviously, because the controller needs to do fewer
> > scatter-gather lookups, and the requests are bigger, because there are
> > fewer IO's that hit scatter-gather limits!"
> >
> > Agreed?
> >
> > Except maybe the *real* answer for some controllers end up being
> >
> > "Worse performance, because individual commands grow because they don't
> > hit the per-command limits, but now we hit the global size-in-flight
> > limits and have many fewer of these good commands in flight. And while
> > the commands are larger, it means that there are fewer outstanding
> > commands, which can mean that the disk cannot scheduling things
> > as well, or makes high latency of command generation by the controller
> > much more visible because there aren't enough concurrent requests
> > queued up to hide it"
> >
> > Is this the reason? I have no idea. But somebody who knows the AACRAID
> > hardware and driver limits might think about interactions like that.
> > Sometimes you actually might want to have smaller individual commands if
> > there is some other limit that means that it can be more advantageous to
> > have many small requests over a few big onees.
> >
> > RAID might well make it worse. Maybe small requests work better because
> > they are simpler to schedule because they only hit one disk (eg if you
> > have simple striping)! So that's another reason why one *large* request
> > may actually be slower than two requests half the size, even if it's
> > against the "normal rule".
> >
> > And it may be that that AACRAID box takes a big hit on DIO exactly because
> > DIO has been optimized almost purely for making one command as big as
> > possible.
> >
> > Just a theory.
>
> Oddly enough, I'm seeing the opposite here with 2.6.22.16 w/ AACRAID
> configured with 5 LUNS (each 2disk HW RAID0, 1024k stripesz). That
> is, with dd the avgrqsiz (from iostat) shows DIO to be ~130k whereas
> non-DIO is a mere ~13k! (NOTE: with aacraid, max_hw_sectors_kb=192)
...
> I can fire up 2.6.24-rc8 in short order to see if things are vastly
> improved (as Martin seems to indicate that he is happy with AACRAID on
> 2.6.24-rc8). Although even Martin's AACRAID numbers from 2.6.19.2 are
> still quite good (relative to mine). Martin can you share any tuning
> you may have done to get AACRAID to where it is for you right now?

I can confirm 2.6.24-rc8 behaves like Martin has posted for the
AACRAID. Slower DIO with smaller avgreqsiz. Much faster buffered IO
(for my config anyway) with a much larger avgreqsiz (180K).

I have no idea why 2.6.22.16's request size on non-DIO is _so_ small...

Mike

2008-01-22 14:41:09

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: regression: 100% io-wait with 2.6.24-rcX

On Fri, Jan 18, 2008 at 11:01:11AM -0800, Martin Knoblauch wrote:
> At least, rc1-rc5 have shown that the CCISS system can do well. Now
> the question is which part of the system does not cope well with the
> larger IO sizes? Is it the CCISS controller, LVM or both. I am open to
> suggestions on how to debug that.

What is your LVM device configuration?
E.g. 'dmsetup table' and 'dmsetup info -c' output.
Some configurations lead to large IOs getting split up on the way through
device-mapper.

See if these patches make any difference:
http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/

dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
dm-introduce-merge_bvec_fn.patch
dm-linear-add-merge.patch
dm-table-remove-merge_bvec-sector-restriction.patch

Alasdair
--
[email protected]