LinuxLists.cc - [patch] Give kjournald a IOPRIO_CLASS

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

On Mon, 15 Oct 2007 10:46:47 -0700
Arjan van de Ven <[email protected]> wrote:

>
> Subject: Give kjournald a IOPRIO_CLASS_RT io priority
> From: Arjan van de Ven <[email protected]>
>
> With latencytop, I noticed that the (in memory) atime updates during a
> kernel build had latencies of 600 msec or longer; this is obviously not so
> nice behavior. Other EXT3 journal related operations had similar or even
> longer latencies.
>
> Digging into this a bit more, it appears to be an interaction between EXT3
> and CFQ in that CFQ tries to be fair to everyone, including kjournald.
> However, in reality, kjournald is "special" in that it does a lot of journal
> work and effectively this leads to a twisted kind of "mass priority
> inversion" type of behavior.
>
> The good news is that CFQ already has the infrastructure to make certain
> processes special... JBD just wasn't using that quite yet.
>
> The patch below makes kjournald of the IOPRIO_CLASS_RT priority to break
> this priority inversion behavior. With this patch, the latencies for atime
> updates (and similar operation) go down by a factor of 3x to 4x !
>

Seems a pretty fundamental change which could do with some careful
benchmarking, methinks.

See, your patch amounts to "do more seeks to improve one test case".
Surely other testcases will worsen. What are they?

> Signed-off-by: Arjan van de Ven <[email protected]>
>
>
> diff -purN linux-2.6.23-rc9.org/fs/jbd/journal.c linux-2.6.23-rc9.lt/fs/jbd/journal.c
> --- linux-2.6.23-rc9.org/fs/jbd/journal.c 2007-10-02 05:24:52.000000000 +0200
> +++ linux-2.6.23-rc9.lt/fs/jbd/journal.c 2007-10-14 00:06:55.000000000 +0200
> @@ -35,6 +35,7 @@
> #include <linux/kthread.h>
> #include <linux/poison.h>
> #include <linux/proc_fs.h>
> +#include <linux/ioprio.h>
>
> #include <asm/uaccess.h>
> #include <asm/page.h>
> @@ -131,6 +132,8 @@ static int kjournald(void *arg)
> printk(KERN_INFO "kjournald starting. Commit interval %ld seconds\n",
> journal->j_commit_interval / HZ);
>
> + current->ioprio = (IOPRIO_CLASS_RT << IOPRIO_CLASS_SHIFT) | 4;
> +

Might be worth a code comment?

2007-10-15 19:28:39

by Jens Axboe

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

On Mon, Oct 15 2007, Andrew Morton wrote:
> On Mon, 15 Oct 2007 10:46:47 -0700
> Arjan van de Ven <[email protected]> wrote:
>
> >
> > Subject: Give kjournald a IOPRIO_CLASS_RT io priority
> > From: Arjan van de Ven <[email protected]>
> >
> > With latencytop, I noticed that the (in memory) atime updates during a
> > kernel build had latencies of 600 msec or longer; this is obviously not so
> > nice behavior. Other EXT3 journal related operations had similar or even
> > longer latencies.
> >
> > Digging into this a bit more, it appears to be an interaction between EXT3
> > and CFQ in that CFQ tries to be fair to everyone, including kjournald.
> > However, in reality, kjournald is "special" in that it does a lot of journal
> > work and effectively this leads to a twisted kind of "mass priority
> > inversion" type of behavior.
> >
> > The good news is that CFQ already has the infrastructure to make certain
> > processes special... JBD just wasn't using that quite yet.
> >
> > The patch below makes kjournald of the IOPRIO_CLASS_RT priority to break
> > this priority inversion behavior. With this patch, the latencies for atime
> > updates (and similar operation) go down by a factor of 3x to 4x !
> >
>
> Seems a pretty fundamental change which could do with some careful
> benchmarking, methinks.
>
> See, your patch amounts to "do more seeks to improve one test case".
> Surely other testcases will worsen. What are they?

Yes, completely agree! I think Arjans patch makes a heap of sense, but
some numbers would be great to see.

> > Signed-off-by: Arjan van de Ven <[email protected]>
> >
> >
> > diff -purN linux-2.6.23-rc9.org/fs/jbd/journal.c linux-2.6.23-rc9.lt/fs/jbd/journal.c
> > --- linux-2.6.23-rc9.org/fs/jbd/journal.c 2007-10-02 05:24:52.000000000 +0200
> > +++ linux-2.6.23-rc9.lt/fs/jbd/journal.c 2007-10-14 00:06:55.000000000 +0200
> > @@ -35,6 +35,7 @@
> > #include <linux/kthread.h>
> > #include <linux/poison.h>
> > #include <linux/proc_fs.h>
> > +#include <linux/ioprio.h>
> >
> > #include <asm/uaccess.h>
> > #include <asm/page.h>
> > @@ -131,6 +132,8 @@ static int kjournald(void *arg)
> > printk(KERN_INFO "kjournald starting. Commit interval %ld seconds\n",
> > journal->j_commit_interval / HZ);
> >
> > + current->ioprio = (IOPRIO_CLASS_RT << IOPRIO_CLASS_SHIFT) | 4;
> > +
>
> Might be worth a code comment?

It should not be merged as-is, instead I'll provide a function to do
this. It should also set current->io_context->ioprio_changed. Since no
IO has been done yet at this point it doesn't matter. But we should cut
a piece of set_task_ioprio() out and provide that as a kernel helper for
this sort of thing.

Even just writing it as:

current->ioprio = (IOPRIO_CLASS_RT << IOPRIO_CLASS_SHIFT) | IOPRIO_NORM;

would be more readable. Or perhaps this would suffice, given the above
restriction that IO hasn't been done yet:

current->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, IOPRIO_NORM);

--
Jens Axboe

2007-10-15 20:16:49

by Rik van Riel

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

On Mon, 15 Oct 2007 11:47:38 -0700
Andrew Morton <[email protected]> wrote:
> On Mon, 15 Oct 2007 10:46:47 -0700
> Arjan van de Ven <[email protected]> wrote:
>
> >
> > Subject: Give kjournald a IOPRIO_CLASS_RT io priority
> > From: Arjan van de Ven <[email protected]>

> Seems a pretty fundamental change which could do with some careful
> benchmarking, methinks.

FWIW, I have marked the kjournald processes on my system
realtime with "rtprio -c 1 `pidof kjournald`" and the
usual desktop stalls that plague my system have not yet
happened this afternoon.

> See, your patch amounts to "do more seeks to improve one test case".
> Surely other testcases will worsen. What are they?

The big problem I have seen here is that processes end up
waiting on kjournald to do something, and kjournald is
waiting due to the IO scheduler.

This can cause a lot of low IO (high IO priority) processes
to indirectly get stuck behind a few high IO (low priority)
processes.

Since you have been involved a lot with ext3 development,
which kinds of workloads do you think will show a performance
degradation with Arjan's patch? What should I test?

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-10-15 21:13:41

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

On Mon, 15 Oct 2007 16:13:15 -0400
Rik van Riel <[email protected]> wrote:

> Since you have been involved a lot with ext3 development,
> which kinds of workloads do you think will show a performance
> degradation with Arjan's patch? What should I test?

Gee. Expect the unexpected ;)

One problem might be when kjournald is doing its ordered-mode data
writeback at the start of commit. That writeout will now be
higher-priority and might impact other tasks which are doing synchronous
file overwrites (ie: no commits) or O_DIRECT reads or writes or just plain
old reads.

If the aggregate number of seeks over the long term is the same as before
then of course the overall throughput should be the same, in which case the
impact might only be upon latency. However if for some reason the total
number of seeks is increased then there will be throughput impacts as well.

So as a starting point I guess one could set up a
copy-a-kernel-tree-in-a-loop running in the background and then see what
impact that has upon a large-linear-read, upon a
read-a-different-kernel-tree and upon some database-style multithreaded
O_DIRECT reader/overwriter.

2007-10-22 09:11:45

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

* Jens Axboe <[email protected]> wrote:

> > Seems a pretty fundamental change which could do with some careful
> > benchmarking, methinks.
> >
> > See, your patch amounts to "do more seeks to improve one test case".
> > Surely other testcases will worsen. What are they?
>
> Yes, completely agree! I think Arjans patch makes a heap of sense, but
> some numbers would be great to see.

Arjan gave the relevant hard numbers:

| With latencytop, I noticed that the (in memory) atime updates during a
| kernel build had latencies of 600 msec or longer [...]
|
| With this patch, the latencies for atime updates (and similar
| operation) go down by a factor of 3x to 4x !

atime update latencies went down by a factor of 3x-4x ...

but what bothers me even more is the large picture. Linux's development
is still fundamentally skewed towards bandwidth (which goes up with
hardware advances anyway), while the focus on latencies is very lacking
(which users do care about much more and which usually does _not_
improve with improved hardware), so i cannot see why we shouldnt apply
this. Reminds me of the illogical, almost superstitious resistence
against the relatime patch. (which is not in 2.6.24 mind you - killed
for good)

if bandwidth hurts anywhere, it will be pointed out and fixed, we've got
like tons of bandwidth benchmarks and it's _easy_ to fix bandwidth
problems. But _finally_ we now have desktop latency tools, hard numbers
and patches that fix them, but what do we do ... we put up extra
roadblocks??

so lets just goddamn apply this _trivial_ patch. This isnt an intrusive
1000 line rewrite that is hard to revert. If it causes any bandwidth
problems, it will be just as trivial to undo. If we do anything else we
just stiffle the still young and very much under-represented "lets fix
latencies that bothers people" movement. If anything we need _positive_
discrimination for latency related fixes (which treatment this fix does
not need at all - all it needs is _equal_ footing with the countless
bandwidth patches that go into the kernel all the time), otherwise it
will never take off and become as healthy as bandwidth optimizations.
Ok?

Ingo

2007-10-22 09:23:51

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

On Mon, 22 Oct 2007 11:10:57 +0200 Ingo Molnar <[email protected]> wrote:

>
> * Jens Axboe <[email protected]> wrote:
>
> > > Seems a pretty fundamental change which could do with some careful
> > > benchmarking, methinks.
> > >
> > > See, your patch amounts to "do more seeks to improve one test case".
> > > Surely other testcases will worsen. What are they?
> >
> > Yes, completely agree! I think Arjans patch makes a heap of sense, but
> > some numbers would be great to see.
>
> Arjan gave the relevant hard numbers:
>
> | With latencytop, I noticed that the (in memory) atime updates during a
> | kernel build had latencies of 600 msec or longer [...]
> |
> | With this patch, the latencies for atime updates (and similar
> | operation) go down by a factor of 3x to 4x !
>
> atime update latencies went down by a factor of 3x-4x ...
>
> but what bothers me even more is the large picture. Linux's development
> is still fundamentally skewed towards bandwidth (which goes up with
> hardware advances anyway), while the focus on latencies is very lacking
> (which users do care about much more and which usually does _not_
> improve with improved hardware), so i cannot see why we shouldnt apply
> this. Reminds me of the illogical, almost superstitious resistence
> against the relatime patch. (which is not in 2.6.24 mind you - killed
> for good)

Try `mount -o relatime' and prepare to be surprised ;)

> if bandwidth hurts anywhere, it will be pointed out and fixed, we've got
> like tons of bandwidth benchmarks and it's _easy_ to fix bandwidth
> problems. But _finally_ we now have desktop latency tools, hard numbers
> and patches that fix them, but what do we do ... we put up extra
> roadblocks??
>
> so lets just goddamn apply this _trivial_ patch. This isnt an intrusive
> 1000 line rewrite that is hard to revert. If it causes any bandwidth
> problems, it will be just as trivial to undo. If we do anything else we
> just stiffle the still young and very much under-represented "lets fix
> latencies that bothers people" movement. If anything we need _positive_
> discrimination for latency related fixes (which treatment this fix does
> not need at all - all it needs is _equal_ footing with the countless
> bandwidth patches that go into the kernel all the time), otherwise it
> will never take off and become as healthy as bandwidth optimizations.
> Ok?
>

I think the situation is that we've asked for some additional
what-can-be-hurt-by-this testing.

Yes, we could sling it out there and wait for the reports. But often
that's a pretty painful process and regressions can be discovered too late
for us to do anything about them.

2007-10-22 09:28:18

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

* Andrew Morton <[email protected]> wrote:

> > but what bothers me even more is the large picture. Linux's
> > development is still fundamentally skewed towards bandwidth (which
> > goes up with hardware advances anyway), while the focus on latencies
> > is very lacking (which users do care about much more and which
> > usually does _not_ improve with improved hardware), so i cannot see
> > why we shouldnt apply this. Reminds me of the illogical, almost
> > superstitious resistence against the relatime patch. (which is not
> > in 2.6.24 mind you - killed for good)
>
> Try `mount -o relatime' and prepare to be surprised ;)

i mean this one:

http://people.redhat.com/mingo/relatime-patches/improve-relatime.patch

this actually makes relatime practical.

Ingo

2007-10-22 09:40:44

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

* Andrew Morton <[email protected]> wrote:

> > so lets just goddamn apply this _trivial_ patch. This isnt an
> > intrusive 1000 line rewrite that is hard to revert. If it causes any
> > bandwidth problems, it will be just as trivial to undo. If we do
> > anything else we just stiffle the still young and very much
> > under-represented "lets fix latencies that bothers people" movement.
> > If anything we need _positive_ discrimination for latency related
> > fixes (which treatment this fix does not need at all - all it needs
> > is _equal_ footing with the countless bandwidth patches that go into
> > the kernel all the time), otherwise it will never take off and
> > become as healthy as bandwidth optimizations. Ok?
>
> I think the situation is that we've asked for some additional
> what-can-be-hurt-by-this testing.
>
> Yes, we could sling it out there and wait for the reports. But often
> that's a pretty painful process and regressions can be discovered too
> late for us to do anything about them.

reverting this oneliner is trivial. Finding bandwidth problems and
tracking them down to this oneliner change is relatively easy too.
Finding latency problems and fixing them is _not_ trivial.

Boot up a Linux desktop and start OOo or firefox, and measure the time
it takes to start the app up. 10-20 seconds on a top-of-the-line
quad-core 3.2 GHz system - which is a shame. Same box can do in excess
of 1GB/sec block IO. Yes, one could blame the apps but in reality most
of the blame is mostly on the kernel side. We do not make bloat and
latency suckage apparent enough to user-space (due to lack of
intelligent instrumentation), we make latencies hard to fix, we have an
acceptance bias towards bandwidth fixes (because they are easier to
measure and justify) - and that's all what it takes to let such a
situation get out of control.

and i can bring up the scheduler as an example. CFS broke the bandwidth
performance of one particular app and it took only a few days to get it
back under control. But it was months to get good latency behavior out
of the scheduler. And that is with the help of excellent scheduler
instrumentation. In the IO space the latency situation is much, much
worse. Really.

Ingo

2007-10-22 09:50:46

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

On Mon, 22 Oct 2007 11:40:14 +0200 Ingo Molnar <[email protected]> wrote:

>
> * Andrew Morton <[email protected]> wrote:
>
> > > so lets just goddamn apply this _trivial_ patch. This isnt an
> > > intrusive 1000 line rewrite that is hard to revert. If it causes any
> > > bandwidth problems, it will be just as trivial to undo. If we do
> > > anything else we just stiffle the still young and very much
> > > under-represented "lets fix latencies that bothers people" movement.
> > > If anything we need _positive_ discrimination for latency related
> > > fixes (which treatment this fix does not need at all - all it needs
> > > is _equal_ footing with the countless bandwidth patches that go into
> > > the kernel all the time), otherwise it will never take off and
> > > become as healthy as bandwidth optimizations. Ok?
> >
> > I think the situation is that we've asked for some additional
> > what-can-be-hurt-by-this testing.
> >
> > Yes, we could sling it out there and wait for the reports. But often
> > that's a pretty painful process and regressions can be discovered too
> > late for us to do anything about them.
>
> reverting this oneliner is trivial. Finding bandwidth problems and
> tracking them down to this oneliner change is relatively easy too.
> Finding latency problems and fixing them is _not_ trivial.
>
> Boot up a Linux desktop and start OOo or firefox, and measure the time
> it takes to start the app up. 10-20 seconds on a top-of-the-line
> quad-core 3.2 GHz system - which is a shame. Same box can do in excess
> of 1GB/sec block IO. Yes, one could blame the apps but in reality most
> of the blame is mostly on the kernel side. We do not make bloat and
> latency suckage apparent enough to user-space (due to lack of
> intelligent instrumentation), we make latencies hard to fix, we have an
> acceptance bias towards bandwidth fixes (because they are easier to
> measure and justify) - and that's all what it takes to let such a
> situation get out of control.
>
> and i can bring up the scheduler as an example. CFS broke the bandwidth
> performance of one particular app and it took only a few days to get it
> back under control. But it was months to get good latency behavior out
> of the scheduler. And that is with the help of excellent scheduler
> instrumentation. In the IO space the latency situation is much, much
> worse. Really.
>

None of which is an argument for simply not bothering to do a bit more
developer testing before merging.

2007-11-14 17:16:32

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

(cc lkml restored, with permission)

On Wed, 14 Nov 2007 10:48:10 -0500 "Alan D. Brunelle" <[email protected]> wrote:

> Andrew Morton wrote:
> > On Mon, 15 Oct 2007 16:13:15 -0400
> > Rik van Riel <[email protected]> wrote:
> >
> >
> >> Since you have been involved a lot with ext3 development,
> >> which kinds of workloads do you think will show a performance
> >> degradation with Arjan's patch? What should I test?
> >>
> >
> > Gee. Expect the unexpected ;)
> >
> > One problem might be when kjournald is doing its ordered-mode data
> > writeback at the start of commit. That writeout will now be
> > higher-priority and might impact other tasks which are doing synchronous
> > file overwrites (ie: no commits) or O_DIRECT reads or writes or just plain
> > old reads.
> >
> > If the aggregate number of seeks over the long term is the same as before
> > then of course the overall throughput should be the same, in which case the
> > impact might only be upon latency. However if for some reason the total
> > number of seeks is increased then there will be throughput impacts as well.
> >
> > So as a starting point I guess one could set up a
> > copy-a-kernel-tree-in-a-loop running in the background and then see what
> > impact that has upon a large-linear-read, upon a
> > read-a-different-kernel-tree and upon some database-style multithreaded
> > O_DIRECT reader/overwriter.
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
> Hi folks -
>
> I noticed this thread recently (sorry, month late and a dollar short),
> but it was very interesting to me, and as I had some cycles yesterday I
> did a quick run of what Andrew suggested here (using 2.6.24-rc2 w/out
> and then w/ Ajran's patch). After doing the runs last night I'm not
> overly happy with the test setup, but before redoing that, I thought I'd
> ask to make sure that this patch was still being considered?

I'd consider its status to be "might be a good idea, more performance
testing needed".

> And, btw, the results from the first pass were rather mixed -

ooh, more performance testing. Thanks

> * The overwriter task (on an 8GiB file), average over 10 runs:
> o 2.6.24 - 300.88226 seconds
> o 2.6.24 + Arjan's patch - 403.85505 seconds
>
> * The read-a-different-kernel-tree task, average over 10 runs:
> o 2.6.24 - 46.8145945549 seconds
> o 2.6.24 + Arjan's patch - 39.6430601119 seconds
>
> * The large-linear-read task (on an 8GiB file), average over 10 runs:
> o 2.6.24 - 290.32522 seconds
> o 2.6.24 + Arjan's patch - 386.34860 seconds

These are *large* differences, making this a very signifcant patch. Much
care is needed now.

Could you expand a bit on what you're testing here? I think that in one
process you're doing a continuous copy-a-kernel-tree and in the other
process you're the above three things, yes?

I guess the other things we should look at are the impact on the
continuously-copy-a-kernel-tree process and also the overall IO throughput.
These things will of course be related. If the overall system-wide IO
throughput increases with the patch then we probably have a no-brainer. If
(as I suspect) the overall IO throughput is decreased then this will be a
difficult call.

> This was performed on a 4-way IA64 w/ 4GiB RAM w/ a FC disk containing
> an Ext3 (4KiB block) FS. The reason I woke up this morning and wasn't
> too happy with the benchmarking harness is that a kernel source tree is
> less than 1GiB and size, and thus the back ground task of reading it
> might all fit in the page cache. To do this right, I'd use another tree
> that has >4GiB of data stored in it. (And likewise, I'd change the
> read-a-different-kernel-tree task to use a larger tree as well.)

hm, yes. Back in the days when I used to do useful things I'd do most
testing of this sort on 256MB, 128MB or even 64MB machines. So that data
would get tossed out of cache quickly so that I could use smaller working
sets so that the tests could be executed faster.

> But before I do that, I want to make sure that the path is still being
> considered.

Sure, it hasn't been ruled out. Especially as those time deltas you're
measuring are so large. We haven't seen changes in IO throughput like that
in years. We just need to work out if they're net-positive or net-negative ;)

This will end up being a pretty large hunk of work I expect.

2007-11-14 17:18:44

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

* Andrew Morton <[email protected]> wrote:

> ooh, more performance testing. Thanks
>
> > * The overwriter task (on an 8GiB file), average over 10 runs:
> > o 2.6.24 - 300.88226 seconds
> > o 2.6.24 + Arjan's patch - 403.85505 seconds
> >
> > * The read-a-different-kernel-tree task, average over 10 runs:
> > o 2.6.24 - 46.8145945549 seconds
> > o 2.6.24 + Arjan's patch - 39.6430601119 seconds
> >
> > * The large-linear-read task (on an 8GiB file), average over 10 runs:
> > o 2.6.24 - 290.32522 seconds
> > o 2.6.24 + Arjan's patch - 386.34860 seconds
>
> These are *large* differences, making this a very signifcant patch.
> Much care is needed now.

and the numbers suggest it's mostly a severe performance regression.
That's not what i have expected - ho hum. Apologies for my earlier
"please merge it already!" whining.

Ingo

2007-11-14 17:53:50

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

On Wed, 14 Nov 2007 18:18:05 +0100
Ingo Molnar <[email protected]> wrote:

>
> * Andrew Morton <[email protected]> wrote:
>
> > ooh, more performance testing. Thanks
> >
> > > * The overwriter task (on an 8GiB file), average over 10 runs:
> > > o 2.6.24 - 300.88226 seconds
> > > o 2.6.24 + Arjan's patch - 403.85505 seconds
> > >
> > > * The read-a-different-kernel-tree task, average over 10 runs:
> > > o 2.6.24 - 46.8145945549 seconds
> > > o 2.6.24 + Arjan's patch - 39.6430601119 seconds
> > >
> > > * The large-linear-read task (on an 8GiB file), average over
> > > 10 runs: o 2.6.24 - 290.32522 seconds
> > > o 2.6.24 + Arjan's patch - 386.34860 seconds
> >
> > These are *large* differences, making this a very signifcant
> > patch. Much care is needed now.
>
> and the numbers suggest it's mostly a severe performance regression.
> That's not what i have expected - ho hum. Apologies for my earlier
> "please merge it already!" whining.

that's.. not automatic; it depends on what the right thing is :-(
What for sure changes is that who gets to do IO changes. Some of the
tests we ran internally (we didn't publish yet because we saw REALLY
large variations for most of them even without any patch) show for
example that "dbench" got slower. But.. dbench gets slower when things
get more fair, and faster when things get unfair. What conclusion you
draw out of that is a whole different matter and depends on exactly
what the test is doing, and what is the right thing for the OS to do in
terms of who gets to do the IO.

THis makes the patch more tricky than the one line change suggests, and
this is also why I haven't published a ton of data yet; it's hard to
get useful tests for this (and the variation of the 2.6.23+ kernels
makes it even harder to do anything meaningful ;-( )

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2007-11-14 18:56:38

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

* Arjan van de Ven <[email protected]> wrote:

> > > > * The read-a-different-kernel-tree task, average over 10 runs:
> > > > o 2.6.24 - 46.8145945549 seconds
> > > > o 2.6.24 + Arjan's patch - 39.6430601119 seconds
> > > >
> > > > * The large-linear-read task (on an 8GiB file), average over
> > > > 10 runs: o 2.6.24 - 290.32522 seconds
> > > > o 2.6.24 + Arjan's patch - 386.34860 seconds
> > >
> > > These are *large* differences, making this a very signifcant
> > > patch. Much care is needed now.
> >
> > and the numbers suggest it's mostly a severe performance regression.
> > That's not what i have expected - ho hum. Apologies for my earlier
> > "please merge it already!" whining.
>
> that's.. not automatic; it depends on what the right thing is :-( What
> for sure changes is that who gets to do IO changes. Some of the tests
> we ran internally (we didn't publish yet because we saw REALLY large
> variations for most of them even without any patch) show for example
> that "dbench" got slower. But.. dbench gets slower when things get
> more fair, and faster when things get unfair. What conclusion you draw
> out of that is a whole different matter and depends on exactly what
> the test is doing, and what is the right thing for the OS to do in
> terms of who gets to do the IO.

yeah, i'd agree to not too much faith into dbench results.

Ingo

2007-11-14 19:17:42

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

Andrew Morton wrote:
> (cc lkml restored, with permission)
>
> On Wed, 14 Nov 2007 10:48:10 -0500 "Alan D. Brunelle" <[email protected]> wrote:
>
>
>> Andrew Morton wrote:
>>
>>> On Mon, 15 Oct 2007 16:13:15 -0400
>>> Rik van Riel <[email protected]> wrote:
>>>
>>>
>>>
>>>> Since you have been involved a lot with ext3 development,
>>>> which kinds of workloads do you think will show a performance
>>>> degradation with Arjan's patch? What should I test?
>>>>
>>>>
>>> Gee. Expect the unexpected ;)
>>>
>>> One problem might be when kjournald is doing its ordered-mode data
>>> writeback at the start of commit. That writeout will now be
>>> higher-priority and might impact other tasks which are doing synchronous
>>> file overwrites (ie: no commits) or O_DIRECT reads or writes or just plain
>>> old reads.
>>>
>>> If the aggregate number of seeks over the long term is the same as before
>>> then of course the overall throughput should be the same, in which case the
>>> impact might only be upon latency. However if for some reason the total
>>> number of seeks is increased then there will be throughput impacts as well.
>>>
>>> So as a starting point I guess one could set up a
>>> copy-a-kernel-tree-in-a-loop running in the background and then see what
>>> impact that has upon a large-linear-read, upon a
>>> read-a-different-kernel-tree and upon some database-style multithreaded
>>> O_DIRECT reader/overwriter.
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>>>
>> Hi folks -
>>
>> I noticed this thread recently (sorry, month late and a dollar short),
>> but it was very interesting to me, and as I had some cycles yesterday I
>> did a quick run of what Andrew suggested here (using 2.6.24-rc2 w/out
>> and then w/ Ajran's patch). After doing the runs last night I'm not
>> overly happy with the test setup, but before redoing that, I thought I'd
>> ask to make sure that this patch was still being considered?
>>
>
> I'd consider its status to be "might be a good idea, more performance
> testing needed".
>
>
>> And, btw, the results from the first pass were rather mixed -
>>
>
> ooh, more performance testing. Thanks
>
>
>> * The overwriter task (on an 8GiB file), average over 10 runs:
>> o 2.6.24 - 300.88226 seconds
>> o 2.6.24 + Arjan's patch - 403.85505 seconds
>>
>> * The read-a-different-kernel-tree task, average over 10 runs:
>> o 2.6.24 - 46.8145945549 seconds
>> o 2.6.24 + Arjan's patch - 39.6430601119 seconds
>>
>> * The large-linear-read task (on an 8GiB file), average over 10 runs:
>> o 2.6.24 - 290.32522 seconds
>> o 2.6.24 + Arjan's patch - 386.34860 seconds
>>
>
> These are *large* differences, making this a very signifcant patch. Much
> care is needed now.
>
> Could you expand a bit on what you're testing here? I think that in one
> process you're doing a continuous copy-a-kernel-tree and in the other
> process you're the above three things, yes?
>

The test works like this:

1. I ensure that the device under test (DUT) is set to run the CFQ
scheduler.
1. It is a Fibre Channel 72GiB disk
2. Single partition...
2. Put an Ext3 FS on the partition (mkfs.ext3 -b 4096)
3. Mount the device, and then:
1. Put an 8GiB file on the new FS
2. Put 3 copies of a Linux tree (w/ objs & kernel & such) onto
the FS in separate directories
1. Note: I'm going to do runs with 6 copies to each
directory tree to get to about 4.2GiB per directory tree
4. Then, for each of the tests:
1. Remount the device (purge page cache by umount & then mount)
2. Start up a copy of 1 kernel tree to another tree (you hadn't
specified if the copy in the background should be to a new
area or not, so I'm just re-using the same area so we don't
have to worry about removing the old). I keep doing the copy
as long as the tests are going
3. Perform the test (10 times)

The tests are:

* Linear read of a large file (8GiB)
* Tree read (foreach file in the tree, dd it to /dev/null)
* Overwrite of that large file: was doing 256KiB random&direct
read/writes, will go down to 4KiB read/writes as that is more
realistic I'd guess

I'm going to try and get the comparisons done by tomorrow, the results
should be very different due to the changes noted above (going to 4.2GiB
trees instead of 700MiB, going to 4K instead of 256K read/writes). This
may cause the runs to be much longer, and then I won't get it done as
quickly...

> I guess the other things we should look at are the impact on the
> continuously-copy-a-kernel-tree process and also the overall IO throughput.
> These things will of course be related. If the overall system-wide IO
> throughput increases with the patch then we probably have a no-brainer. If
> (as I suspect) the overall IO throughput is decreased then this will be a
> difficult call.
>

I'll add in continuous 'iostat' grabs, and present data on that too - it
would contain both generic IO information as well as grabbing
CPU-related stuff (in particular %system).

>
>> This was performed on a 4-way IA64 w/ 4GiB RAM w/ a FC disk containing
>> an Ext3 (4KiB block) FS. The reason I woke up this morning and wasn't
>> too happy with the benchmarking harness is that a kernel source tree is
>> less than 1GiB and size, and thus the back ground task of reading it
>> might all fit in the page cache. To do this right, I'd use another tree
>> that has >4GiB of data stored in it. (And likewise, I'd change the
>> read-a-different-kernel-tree task to use a larger tree as well.)
>>
>
> hm, yes. Back in the days when I used to do useful things I'd do most
> testing of this sort on 256MB, 128MB or even 64MB machines. So that data
> would get tossed out of cache quickly so that I could use smaller working
> sets so that the tests could be executed faster.
>
>
>> But before I do that, I want to make sure that the path is still being
>> considered.
>>
>
> Sure, it hasn't been ruled out. Especially as those time deltas you're
> measuring are so large. We haven't seen changes in IO throughput like that
> in years. We just need to work out if they're net-positive or net-negative ;)
>
> This will end up being a pretty large hunk of work I expect.
>
I'll get results out when I have the changes made to the script
(outlined above), and the runs done.

Alan

2007-11-14 19:46:07

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

Arjan van de Ven wrote:
> On Wed, 14 Nov 2007 18:18:05 +0100
> Ingo Molnar <[email protected]> wrote:
>
>
>> * Andrew Morton <[email protected]> wrote:
>>
>>
>>> ooh, more performance testing. Thanks
>>>
>>>
>>>> * The overwriter task (on an 8GiB file), average over 10 runs:
>>>> o 2.6.24 - 300.88226 seconds
>>>> o 2.6.24 + Arjan's patch - 403.85505 seconds
>>>>
>>>> * The read-a-different-kernel-tree task, average over 10 runs:
>>>> o 2.6.24 - 46.8145945549 seconds
>>>> o 2.6.24 + Arjan's patch - 39.6430601119 seconds
>>>>
>>>> * The large-linear-read task (on an 8GiB file), average over
>>>> 10 runs: o 2.6.24 - 290.32522 seconds
>>>> o 2.6.24 + Arjan's patch - 386.34860 seconds
>>>>
>>> These are *large* differences, making this a very signifcant
>>> patch. Much care is needed now.
>>>
>> and the numbers suggest it's mostly a severe performance regression.
>> That's not what i have expected - ho hum. Apologies for my earlier
>> "please merge it already!" whining.
>>
>
> that's.. not automatic; it depends on what the right thing is :-(
> What for sure changes is that who gets to do IO changes. Some of the
> tests we ran internally (we didn't publish yet because we saw REALLY
> large variations for most of them even without any patch) show for
> example that "dbench" got slower. But.. dbench gets slower when things
> get more fair, and faster when things get unfair. What conclusion you
> draw out of that is a whole different matter and depends on exactly
> what the test is doing, and what is the right thing for the OS to do in
> terms of who gets to do the IO.
>
> THis makes the patch more tricky than the one line change suggests, and
> this is also why I haven't published a ton of data yet; it's hard to
> get useful tests for this (and the variation of the 2.6.23+ kernels
> makes it even harder to do anything meaningful ;-( )
>
>
>
I'd also like to point out here that the run-to-run deviation was indeed
quite large for both the unpatched- and patched-kernels, I'll report on
that information with the next set of results...

Alan

2007-11-14 19:51:42

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

On Wed, 14 Nov 2007 14:24:03 -0500
"Alan D. Brunelle" <[email protected]> wrote:
> >
>
> The test works like this:
>
> 1. I ensure that the device under test (DUT) is set to run the CFQ
> scheduler.
> 1. It is a Fibre Channel 72GiB disk
> 2. Single partition...
> 2. Put an Ext3 FS on the partition (mkfs.ext3 -b 4096)
> 3. Mount the device, and then:
> 1. Put an 8GiB file on the new FS
> 2. Put 3 copies of a Linux tree (w/ objs & kernel & such)
> onto the FS in separate directories
> 1. Note: I'm going to do runs with 6 copies to each
> directory tree to get to about 4.2GiB per directory
> tree 4. Then, for each of the tests:
> 1. Remount the device (purge page cache by umount & then
> mount) 2. Start up a copy of 1 kernel tree to another tree (you hadn't
> specified if the copy in the background should be to a new
> area or not, so I'm just re-using the same area so we
> don't have to worry about removing the old). I keep doing the copy
> as long as the tests are going
> 3. Perform the test (10 times)
>
> The tests are:
>
> * Linear read of a large file (8GiB)
> * Tree read (foreach file in the tree, dd it to /dev/null)
> * Overwrite of that large file: was doing 256KiB random&direct
> read/writes, will go down to 4KiB read/writes as that is more
> realistic I'd guess
>

ok so the obvious meta-question is this: what does it mean that your
test takes longer or shorter. I can see IO "capacity" (trying to avoid
the use bandwidth here) moves from the foreground test to the
background test (and/or other way around)... but if that was starved
previously... it could or could not be the right result.
What do you think the measure of "it's at least not worse" is? Is there
any way to get to that concept? (and then looking at if that got met is
the second step ;( )

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2007-11-14 19:59:18

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

Oh, and the runs were done in single-user mode...

Alan

2007-11-16 16:35:39

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

Here are the results for the latest tests, some notes:

o The machine actually has 8GiB of RAM, so the tests still may end up
using (some) page cache. (But at least it was the same for both kernels!
:-) )

o Sorry the results took so long - the updated tree size caused the
runs to take > 12 hours...

o The longer runs seemed to bring down the standard deviation a bit,
although they are still quite large.

o 10 runs per test (read large file, read a tree, overwrite large
file), with averages presented.

o 1st 4 columns (min, avg, max, std dev) refer to the average run
lengths for the tests - real time, in seconds

o The last 3 columns are extracted from iostat results over the course
of the whole run.

o The read a tree test certainly stands out - the other 2 large file
manipulations have the two kernels within a couple of percent, but the
read a tree test has Arjan's patch taking about 47%(!) longer on
average. The increased %iowait & %system time in all 3 cases is interesting.

Read large file:

Kernel Min Avg Max Std Dev %user %system %iowait
--------------------------------------------------------------
base : 201.6 215.1 275.5 22.8 0.26% 4.69% 33.54%
arjan: 198.0 210.3 261.5 18.5 0.33% 10.24% 54.00%

Read a tree:

Kernel Min Avg Max Std Dev %user %system %iowait
--------------------------------------------------------------
base : 3518.2 4631.3 5991.3 784.6 0.19% 3.29% 23.56%
arjan: 5731.6 6849.8 7777.4 731.6 0.32% 9.90% 52.70%

Overwrite large file:

Kernel Min Avg Max Std Dev %user %system %iowait
--------------------------------------------------------------
base : 104.2 147.7 239.5 38.4 0.02% 0.05% 1.08%
arjan: 106.2 149.7 239.2 38.4 0.25% 0.79% 14.97%

Let me know if there is anything else I can do to elaborate, or if you
have suggestions for further testing.

Alan

2007-11-16 16:40:59

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

Alan D. Brunelle wrote:
>
> Read large file:
>
> Kernel Min Avg Max Std Dev %user %system %iowait
> --------------------------------------------------------------
> base : 201.6 215.1 275.5 22.8 0.26% 4.69% 33.54%
> arjan: 198.0 210.3 261.5 18.5 0.33% 10.24% 54.00%
>
> Read a tree:
>
> Kernel Min Avg Max Std Dev %user %system %iowait
> --------------------------------------------------------------
> base : 3518.2 4631.3 5991.3 784.6 0.19% 3.29% 23.56%
> arjan: 5731.6 6849.8 7777.4 731.6 0.32% 9.90% 52.70%
>
> Overwrite large file:
>
> Kernel Min Avg Max Std Dev %user %system %iowait
> --------------------------------------------------------------
> base : 104.2 147.7 239.5 38.4 0.02% 0.05% 1.08%
> arjan: 106.2 149.7 239.2 38.4 0.25% 0.79% 14.97%
>
>

I'm going to try and do some clean up work on the iostat CPU results -
the reason %user & %system are so low is (I think) because they also
include a lot of 0% results from the tail of the runs (as the unmount is
going on I think). I'm going to try and extract results for just the
"meat" of the runs.

Alan

2007-11-16 18:35:58

by Ray Lee

[permalink] [raw]

Subject: Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority

On Nov 16, 2007 8:25 AM, Alan D. Brunelle <[email protected]> wrote:
> Here are the results for the latest tests, some notes:
>
> o The machine actually has 8GiB of RAM, so the tests still may end up
> using (some) page cache. (But at least it was the same for both kernels!
> :-) )
>
> o Sorry the results took so long - the updated tree size caused the
> runs to take > 12 hours...
>
> o The longer runs seemed to bring down the standard deviation a bit,
> although they are still quite large.
>
> o 10 runs per test (read large file, read a tree, overwrite large
> file), with averages presented.
>
> o 1st 4 columns (min, avg, max, std dev) refer to the average run
> lengths for the tests - real time, in seconds
>
> o The last 3 columns are extracted from iostat results over the course
> of the whole run.
>
> o The read a tree test certainly stands out - the other 2 large file
> manipulations have the two kernels within a couple of percent, but the
> read a tree test has Arjan's patch taking about 47%(!) longer on
> average. The increased %iowait & %system time in all 3 cases is interesting.
>
>
> Read large file:
>
> Kernel Min Avg Max Std Dev %user %system %iowait
> --------------------------------------------------------------
> base : 201.6 215.1 275.5 22.8 0.26% 4.69% 33.54%
> arjan: 198.0 210.3 261.5 18.5 0.33% 10.24% 54.00%
>
> Read a tree:
>
> Kernel Min Avg Max Std Dev %user %system %iowait
> --------------------------------------------------------------
> base : 3518.2 4631.3 5991.3 784.6 0.19% 3.29% 23.56%
> arjan: 5731.6 6849.8 7777.4 731.6 0.32% 9.90% 52.70%
>
> Overwrite large file:
>
> Kernel Min Avg Max Std Dev %user %system %iowait
> --------------------------------------------------------------
> base : 104.2 147.7 239.5 38.4 0.02% 0.05% 1.08%
> arjan: 106.2 149.7 239.2 38.4 0.25% 0.79% 14.97%
>
> Let me know if there is anything else I can do to elaborate, or if you
> have suggestions for further testing.

Out of curiosity, what are the mount options for the freshly created
ext3 fs? In particular, are you using noatime, nodiratime?

Ray

2007-11-16 18:40:02