2003-07-02 22:15:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Status of the IO scheduler fixes for 2.4


Hello people,

What is the status of the IO scheduler fixes for increased fairness for
2.4 ?

I haven't had time to read and think about everything you guys discussed,
so a brief summary would be very helpful for me.

Danke


2003-07-02 22:19:21

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4



On Wed, 2 Jul 2003, Marcelo Tosatti wrote:

>
> Hello people,
>
> What is the status of the IO scheduler fixes for increased fairness for
> 2.4 ?
>
> I haven't had time to read and think about everything you guys discussed,
> so a brief summary would be very helpful for me.
>
> Danke

Ah, we all want that the fairness issues to be fixed in 2.4.22, right ?

2003-07-03 01:49:26

by Chris Mason

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Wed, 2003-07-02 at 18:28, Marcelo Tosatti wrote:
> On Wed, 2 Jul 2003, Marcelo Tosatti wrote:
>
> >
> > Hello people,
> >
> > What is the status of the IO scheduler fixes for increased fairness for
> > 2.4 ?
> >
> > I haven't had time to read and think about everything you guys discussed,
> > so a brief summary would be very helpful for me.
> >
> > Danke
>
> Ah, we all want that the fairness issues to be fixed in 2.4.22, right ?

My current code is attached, it's basically a merge of these 3 patches,
with modifications based on benchmarks and latency measurements here.

fix_pausing: From Andrea, it fixes a few corner case races where
wakeups can be missed in wait_on_buffer, wait_on_page, and
__get_request_wait.

elevator-low-latency: From Andrea, it keeps the amount of io on a given
queue to a reasonable number. This prevents a small number of huge
requests from introducing large latencies on smaller requests.

q->full: From Nick, it reduces latency in __get_request_wait by making
sure new io can't come in and steal requests before old waiters are
served.

Those represent the big 3 areas I believe the latencies are coming
from. The q->full patch can hurt throughput badly as the number of
writers increases (50% of what 2.4.21 gets for 10 or more concurrent
streaming writers), but it really seems to help desktop workloads here.

Andrea's elevator-low-latency patch solves 90% of the latency problem,
it keeps heavy io from taking over the disk entirely and starving out
small readers (like ls). It also keeps good throughput numbers.

So, the patch attached includes the q->full code but has it off by
default. I've got code locally for an elvtune interface that can toggle
q->full check on a per device basis, as well as tune the max io per
queue. I've got two choices on how to submit it, I can either add a new
ioctl or abuse the max_bomb_segments field in the existing ioctl.

If we can agree on the userland tuning side, I can have some kind of
elvtune patch tomorrow.

Note: my merge of elevator-low-latency is a little different from
Andrea's. I added a blk_finished_sectors call to keep track of io as it
finishes instead of changing blk_finished_io. If a given driver is
going to call blk_finished_sectors when it calls blk_finished_io, it
should call blk_queue_throttle_sectors(q, 1) after blk_init_queue to
tell the rest of ll_rw_block to enable throttling. The extra calls are
there to keep compatibility with external drivers.

[ random summary of negative comments that come to mind ]

Andrea and I disagree about the code in get_request_wait_wakeup, he
wants to find a way to unplug the queue instead of triggering wakeups.
He also really doesn't like my change to __generic_unplug_device, which
is meant to lower latencies when a reader is running the task queue in
__wait_on_buffer. This code is off by default in the attached patch.

Nick would like to see a better balance of throughput/fairness, I wimped
out and went for the userspace toggle instead because I think anything
else requires pulling in larger changes from 2.5 land.

-chris


Attachments:
io-stalls-8-min.diff (23.69 kB)

2003-07-03 10:43:55

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On 02 Jul 2003 22:02:07 -0400
Chris Mason <[email protected]> wrote:

> [...]
> Nick would like to see a better balance of throughput/fairness, I wimped
> out and went for the userspace toggle instead because I think anything
> else requires pulling in larger changes from 2.5 land.

I have a short question on that: did you check if there are any drawbacks on
network performance through this? We had a phenomenon here with 2.4.21 with
both samba and simple ftp where network performance dropped to a crawl when
simply entering "sync" on the console. Even simple telnet-sessions seemed to be
affected. As we could not create a reproducable setup I did not talk about this
up to now, but I wonder if anyone else ever checked that out ...

Regards,
Stephan


2003-07-03 11:48:18

by Chris Mason

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Thu, 2003-07-03 at 06:58, Stephan von Krawczynski wrote:
> On 02 Jul 2003 22:02:07 -0400
> Chris Mason <[email protected]> wrote:
>
> > [...]
> > Nick would like to see a better balance of throughput/fairness, I wimped
> > out and went for the userspace toggle instead because I think anything
> > else requires pulling in larger changes from 2.5 land.
>
> I have a short question on that: did you check if there are any drawbacks on
> network performance through this? We had a phenomenon here with 2.4.21 with
> both samba and simple ftp where network performance dropped to a crawl when
> simply entering "sync" on the console. Even simple telnet-sessions seemed to be
> affected. As we could not create a reproducable setup I did not talk about this
> up to now, but I wonder if anyone else ever checked that out ...

It's possible the network programs involved are actually getting stuck
in atime updates somewhere. A decoded sysrq-t during one of the stalls
would help find the real cause.

-chris

2003-07-03 11:54:18

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Thursday 03 July 2003 12:58, Stephan von Krawczynski wrote:

Hi Stephan,

> I have a short question on that: did you check if there are any drawbacks
> on network performance through this? We had a phenomenon here with 2.4.21
> with both samba and simple ftp where network performance dropped to a crawl
> when simply entering "sync" on the console. Even simple telnet-sessions
> seemed to be affected. As we could not create a reproducable setup I did
> not talk about this up to now, but I wonder if anyone else ever checked
> that out ...
does the patch from Chris cause this?

ciao, Marc

2003-07-03 12:14:11

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Thu, 3 Jul 2003 14:07:12 +0200
Marc-Christian Petersen <[email protected]> wrote:

> On Thursday 03 July 2003 12:58, Stephan von Krawczynski wrote:
>
> Hi Stephan,
>
> > I have a short question on that: did you check if there are any drawbacks
> > on network performance through this? We had a phenomenon here with 2.4.21
> > with both samba and simple ftp where network performance dropped to a crawl
> > when simply entering "sync" on the console. Even simple telnet-sessions
> > seemed to be affected. As we could not create a reproducable setup I did
> > not talk about this up to now, but I wonder if anyone else ever checked
> > that out ...
> does the patch from Chris cause this?

No, this is an independent issue. We did not try these patches so far.
The question was purely informative. We did not dig into the subject for lack
of time upto now - just a "try and see if anyone can back our experience".

Regards,
Stephan

2003-07-03 12:18:59

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Thursday 03 July 2003 04:02, Chris Mason wrote:

Hi Chris,

> So, the patch attached includes the q->full code but has it off by
> default. I've got code locally for an elvtune interface that can toggle
> q->full check on a per device basis, as well as tune the max io per
> queue. I've got two choices on how to submit it, I can either add a new
> ioctl or abuse the max_bomb_segments field in the existing ioctl.
> If we can agree on the userland tuning side, I can have some kind of
> elvtune patch tomorrow.
what about /proc ?

ciao, Marc

2003-07-03 12:57:56

by Chris Mason

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Thu, 2003-07-03 at 08:31, Marc-Christian Petersen wrote:
> On Thursday 03 July 2003 04:02, Chris Mason wrote:
>
> Hi Chris,
>
> > So, the patch attached includes the q->full code but has it off by
> > default. I've got code locally for an elvtune interface that can toggle
> > q->full check on a per device basis, as well as tune the max io per
> > queue. I've got two choices on how to submit it, I can either add a new
> > ioctl or abuse the max_bomb_segments field in the existing ioctl.
> > If we can agree on the userland tuning side, I can have some kind of
> > elvtune patch tomorrow.
> what about /proc ?

Always an option. If elvtune didn't exist at all I'd say proc was a
better choice. But I do want to be able to tune things on a per device
basis, which probably means a new directory tree somewhere in proc. Our
chances are only 50/50 of getting that patch in without a long thread
about the one true way to access kernel tunables through an fs
interface visible to userland ;-)

For the most part I'm only visiting drivers/block/*.c right now, so I'll
code whatever interface the long term maintainers hate the least.

-chris


2003-07-03 17:54:24

by Chris Mason

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Wed, 2003-07-02 at 18:28, Marcelo Tosatti wrote:
> On Wed, 2 Jul 2003, Marcelo Tosatti wrote:
>
> >
> > Hello people,
> >
> > What is the status of the IO scheduler fixes for increased fairness for
> > 2.4 ?
> >
> > I haven't had time to read and think about everything you guys discussed,
> > so a brief summary would be very helpful for me.

Ok, based on mail from Andrea, it seems like reusing the existing
elvtune ioctl is the way to go for now. The patch below uses
max_bomb_segments for two things.

1) the number of sectors allowed on a given request queue before new io
triggers an unplug (q->max_queue_sectors)

2) to enable or disable q->full checks. If max_bomb_segments is odd, it
means the q->full checks are on, even means they are off.

The ioctl code on the kernel side is setup to allow this:

elvtune -b 1 /dev/xxx
elvtune -b 0 /dev/xxx

For just switching q->full on and off without changing
q->max_queue_sectors.

elvtune -b 8192 /dev/xxx will get you the current default behavior (4MB
in flight, q->full checks off)

-chris


Attachments:
io-stalls-9.diff (26.68 kB)

2003-07-04 19:49:51

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4



On Wed, 2 Jul 2003, Chris Mason wrote:

> On Wed, 2003-07-02 at 18:28, Marcelo Tosatti wrote:
> > On Wed, 2 Jul 2003, Marcelo Tosatti wrote:
> >
> > >
> > > Hello people,
> > >
> > > What is the status of the IO scheduler fixes for increased fairness for
> > > 2.4 ?
> > >
> > > I haven't had time to read and think about everything you guys discussed,
> > > so a brief summary would be very helpful for me.
> > >
> > > Danke
> >
> > Ah, we all want that the fairness issues to be fixed in 2.4.22, right ?
>
> My current code is attached, it's basically a merge of these 3 patches,
> with modifications based on benchmarks and latency measurements here.
>
> fix_pausing: From Andrea, it fixes a few corner case races where
> wakeups can be missed in wait_on_buffer, wait_on_page, and
> __get_request_wait.
>
> elevator-low-latency: From Andrea, it keeps the amount of io on a given
> queue to a reasonable number. This prevents a small number of huge
> requests from introducing large latencies on smaller requests.
>
> q->full: From Nick, it reduces latency in __get_request_wait by making
> sure new io can't come in and steal requests before old waiters are
> served.
>
> Those represent the big 3 areas I believe the latencies are coming
> from. The q->full patch can hurt throughput badly as the number of
> writers increases (50% of what 2.4.21 gets for 10 or more concurrent
> streaming writers), but it really seems to help desktop workloads here.

Chris,

Would you please separate those tree fixes in separate diffs?

For me it seems low latency and fix-pausing patches should be enough for
"better" IO fairness. I might be wrong about that, though.

Lets try this: Include elevator-low-latency in -pre3 (which I'm trying to
release today), then fix pausing in -pre4. If the IO fairness still doesnt
get somewhat better for general use (well get isolated user reports and
benchmarks for both the two patches), then I might consider the q->full
patch (it has throughtput drawbacks and I prefer avoiding a tunable
there).


Sounds good?

2003-07-04 21:24:53

by Chris Mason

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Fri, 2003-07-04 at 16:01, Marcelo Tosatti wrote:

> Chris,
>
> Would you please separate those tree fixes in separate diffs?
>
> For me it seems low latency and fix-pausing patches should be enough for
> "better" IO fairness. I might be wrong about that, though.
>
> Lets try this: Include elevator-low-latency in -pre3 (which I'm trying to
> release today), then fix pausing in -pre4.

I don't believe elevator-low-latency is good without fix-pausing, we
need the smp corner cases fix-pausing provides.

> If the IO fairness still doesnt
> get somewhat better for general use (well get isolated user reports and
> benchmarks for both the two patches), then I might consider the q->full
> patch (it has throughtput drawbacks and I prefer avoiding a tunable
> there).
>
>
> Sounds good?

We effectively had this in 2.4.19-aa and 2.4.20-aa, and people still
reported stalls on those kernels. So, aside from testing them
separately I'm not sure what benefit we really get. I really like the
tunable for max_queue_sectors at least (without the q->full overload),
I'd like to get that in sometime during 2.4.22-pre if we take
elevator-low-latency.

It's a holiday weekend here, so I don't have a huge amount of time to
rediff and test. The code below is fix-pausing and elevator-low-latency
combined, without any of the q->full code. It is lightly tested, so
please run it through a few paces locally if you plan on applying it.

I've also attached a patch I've been working on to solve the latencies a
different way. bdflush-progress.diff changes balance_dirty to wait on
bdflush instead of trying to start io. It helps a lot here (both
throughput and latency) but hasn't yet been tested on a box with tons of
drives.

-chris


Attachments:
bdflush_progress-3.diff (2.45 kB)
io-stalls-10.diff (21.54 kB)
Download all attachments

2003-07-04 23:45:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Fri, Jul 04, 2003 at 05:01:54PM -0300, Marcelo Tosatti wrote:
> release today), then fix pausing in -pre4. If the IO fairness still doesnt

fix pausing is a showstopper bugfix, the box will hang for days without
it.

lowlatency elevator is for the desktop complains we get about
interactivity compared to 2.5, so it's much lower prio than fix pausing.
I would never merge fix pausing after lowlatency elevator. But that's
just me.

Andrea

2003-07-04 23:51:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Fri, Jul 04, 2003 at 05:37:35PM -0400, Chris Mason wrote:
> I've also attached a patch I've been working on to solve the latencies a
> different way. bdflush-progress.diff changes balance_dirty to wait on
> bdflush instead of trying to start io. It helps a lot here (both
> throughput and latency) but hasn't yet been tested on a box with tons of
> drives.

that's orthogonal, it changes the write throttling, it doesn't touch the
blkdev layer like the other patches does. So if it helps it solves a
different kind of latencies.

However the implementation in theory can run the box oom, since it won't
limit the dirty buffers anymore. To be safe you need to wait 2
generations. I doubt in practice it would be easily reproducible though ;).

Andrea

2003-07-05 00:33:22

by Chris Mason

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Fri, 2003-07-04 at 20:05, Andrea Arcangeli wrote:
> On Fri, Jul 04, 2003 at 05:37:35PM -0400, Chris Mason wrote:
> > I've also attached a patch I've been working on to solve the latencies a
> > different way. bdflush-progress.diff changes balance_dirty to wait on
> > bdflush instead of trying to start io. It helps a lot here (both
> > throughput and latency) but hasn't yet been tested on a box with tons of
> > drives.
>
> that's orthogonal, it changes the write throttling, it doesn't touch the
> blkdev layer like the other patches does. So if it helps it solves a
> different kind of latencies.

It's also almost useless without elevator-low-latency applied ;-) One
major source of our latencies is a bunch of procs hammering on
__get_request_wait, so bdflush-progess helps by reducing the number of
procs doing io. It does push some of the latency problem up higher into
balance_dirty, but I believe it is easier to manage there.

bdflush-progress doesn't help at all for non-async workloads.

> However the implementation in theory can run the box oom, since it won't
> limit the dirty buffers anymore. To be safe you need to wait 2
> generations. I doubt in practice it would be easily reproducible though ;).

Hmmm, I think the critical part here is to write some buffers after
marking a buffer dirty. We don't check the generation number until
after marking the buffer dirty, so if the generation has incremented at
all we've made progress. What am I missing?

-chris


2003-07-05 00:50:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Fri, Jul 04, 2003 at 08:47:00PM -0400, Chris Mason wrote:
> On Fri, 2003-07-04 at 20:05, Andrea Arcangeli wrote:
> > On Fri, Jul 04, 2003 at 05:37:35PM -0400, Chris Mason wrote:
> > > I've also attached a patch I've been working on to solve the latencies a
> > > different way. bdflush-progress.diff changes balance_dirty to wait on
> > > bdflush instead of trying to start io. It helps a lot here (both
> > > throughput and latency) but hasn't yet been tested on a box with tons of
> > > drives.
> >
> > that's orthogonal, it changes the write throttling, it doesn't touch the
> > blkdev layer like the other patches does. So if it helps it solves a
> > different kind of latencies.
>
> It's also almost useless without elevator-low-latency applied ;-) One
> major source of our latencies is a bunch of procs hammering on
> __get_request_wait, so bdflush-progess helps by reducing the number of
> procs doing io. It does push some of the latency problem up higher into
> balance_dirty, but I believe it is easier to manage there.
>
> bdflush-progress doesn't help at all for non-async workloads.

agreed.

> > However the implementation in theory can run the box oom, since it won't
> > limit the dirty buffers anymore. To be safe you need to wait 2
> > generations. I doubt in practice it would be easily reproducible though ;).
>
> Hmmm, I think the critical part here is to write some buffers after
> marking a buffer dirty. We don't check the generation number until

btw, not after a buffer, but after as much as "buffers per page" buffers
(infact for the code to be correct on systems with quite big page size
the 32 should be replaced with a max(bh_per_page, 32))

> after marking the buffer dirty, so if the generation has incremented at
> all we've made progress. What am I missing?

write 32 buffers
read generation
inc generation
generation changed -> OOM

Am I missing any smp serialization between the two cpus?

Andrea

2003-07-05 15:47:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4


On Sat, 5 Jul 2003, Andrea Arcangeli wrote:

> On Fri, Jul 04, 2003 at 05:01:54PM -0300, Marcelo Tosatti wrote:
> > release today), then fix pausing in -pre4. If the IO fairness still doesnt
>
> fix pausing is a showstopper bugfix, the box will hang for days without
> it.
>
> lowlatency elevator is for the desktop complains we get about
> interactivity compared to 2.5, so it's much lower prio than fix pausing.
> I would never merge fix pausing after lowlatency elevator. But that's
> just me.

You're right. I'll merge both patches in -pre3.

Danke

2003-07-05 16:22:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Sat, Jul 05, 2003 at 12:59:19PM -0300, Marcelo Tosatti wrote:
>
> On Sat, 5 Jul 2003, Andrea Arcangeli wrote:
>
> > On Fri, Jul 04, 2003 at 05:01:54PM -0300, Marcelo Tosatti wrote:
> > > release today), then fix pausing in -pre4. If the IO fairness still doesnt
> >
> > fix pausing is a showstopper bugfix, the box will hang for days without
> > it.
> >
> > lowlatency elevator is for the desktop complains we get about
> > interactivity compared to 2.5, so it's much lower prio than fix pausing.
> > I would never merge fix pausing after lowlatency elevator. But that's
> > just me.
>
> You're right. I'll merge both patches in -pre3.

agreed, thanks.

> Danke

prego ;)

Andrea

2003-07-06 07:46:31

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Friday 04 July 2003 23:37, Chris Mason wrote:

Hi Chris,

> > If the IO fairness still doesnt
> > get somewhat better for general use (well get isolated user reports and
> > benchmarks for both the two patches), then I might consider the q->full
> > patch (it has throughtput drawbacks and I prefer avoiding a tunable
> > there).
now there is io-stalls-10 in .22-pre3 (lowlat elev. + fixpausing). Could you
please send "q->full patch" as ontop of -pre3? :-)

Thank you.

ciao, Marc

2003-07-06 18:38:20

by Chris Mason

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4

On Sun, 2003-07-06 at 03:58, Marc-Christian Petersen wrote:
> On Friday 04 July 2003 23:37, Chris Mason wrote:
>
> Hi Chris,
>
> > > If the IO fairness still doesnt
> > > get somewhat better for general use (well get isolated user reports and
> > > benchmarks for both the two patches), then I might consider the q->full
> > > patch (it has throughtput drawbacks and I prefer avoiding a tunable
> > > there).
> now there is io-stalls-10 in .22-pre3 (lowlat elev. + fixpausing). Could you
> please send "q->full patch" as ontop of -pre3? :-)

Attached, this defaults to q->full off and keeps the elvtune changes.
So to turn on the q->full low latency fixes, you need to:

elvtune -b 1 /dev/xxxx

Note that for lvm and md, you need to elvtune each underlying device.
Running it on an lvm/md device doesn't do anything.

-chris


Attachments:
q_full.diff (7.09 kB)

2003-07-07 03:52:50

by Nick Piggin

[permalink] [raw]
Subject: Re: Status of the IO scheduler fixes for 2.4



Chris Mason wrote:

>On Sun, 2003-07-06 at 03:58, Marc-Christian Petersen wrote:
>
>>On Friday 04 July 2003 23:37, Chris Mason wrote:
>>
>>Hi Chris,
>>
>>
>>>>If the IO fairness still doesnt
>>>>get somewhat better for general use (well get isolated user reports and
>>>>benchmarks for both the two patches), then I might consider the q->full
>>>>patch (it has throughtput drawbacks and I prefer avoiding a tunable
>>>>there).
>>>>
>>now there is io-stalls-10 in .22-pre3 (lowlat elev. + fixpausing). Could you
>>please send "q->full patch" as ontop of -pre3? :-)
>>
>
>Attached, this defaults to q->full off and keeps the elvtune changes.
>So to turn on the q->full low latency fixes, you need to:
>
>

Its a shame to have it default off, seeing as it fixes a real starvation
/ unfairness problem, and nobody is going to turn it on. It would be
nice to turn it on by default and see how many people shout about the
throughput drop, but I guess you can't do that in a stable series :P

I guess it will be useful to be able to ask people to try it if they are
reporting bad behaviour.