LinuxLists.cc - fio mmap randread 64k more than 40% regression with 2.6.33-rc1

2009-12-31 09:15:53

Subject: fio mmap randread 64k more than 40% regression with 2.6.33-rc1

Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with
2.6.33-rc1.

The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create
8 1-GB files per partition and start 8 processes to do rand read on the 8 files
per partitions. There are 8*24 processes totally. randread block size is 64K.

We found the regression on 2 machines. One machine has 8GB memory and the other has
6GB.

Bisect is very unstable. The related patches are many instead of just one.

1) commit 8e550632cccae34e265cb066691945515eaa7fb5
Author: Corrado Zoccolo <[email protected]>
Date: Thu Nov 26 10:02:58 2009 +0100

cfq-iosched: fix corner cases in idling logic

This patch introduces about less than 20% regression. I just reverted below section
and this part regression disappear. It shows this regression is stable and not impacted
by other patches.

@@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
return;

/*
- * still requests with the driver, don't idle
+ * still active requests from this queue, don't idle
*/
- if (rq_in_driver(cfqd))
+ if (cfqq->dispatched)
return;

2) How about other 20%~30% regressions? It's complicated. My bisect plus
Li Shaohua's investigation located 3 patches,
df5fe3e8e13883f58dc97489076bbcc150789a21,
b3b6d0408c953524f979468562e7e210d8634150,
5db5d64277bf390056b1a87d0bb288c8b8553f96.

tiobench also has regression and Li Shaohua located the same patches. See link
http://lkml.indiana.edu/hypermail/linux/kernel/0912.2/03355.html.

Shaohua worked about patches to fix the tiobench regression. However, his patch
doesn't work for fio randread 64k regression.
I retried bisect manually and eventually located below patch,

commit 718eee0579b802aabe3bafacf09d0a9b0830f1dd
Author: Corrado Zoccolo <[email protected]>
Date: Mon Oct 26 22:45:29 2009 +0100

cfq-iosched: fairness for sync no-idle queues

The patch is a little big. After many try, I found below section is the key.
@@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);

if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq)))
+ (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
enable_idle = 0;

That section deletes the condition checking of !cfqd->cfq_latency, so
enable_idle=0 with more possibility.

I wrote a testing patch which just overlooks the original 3 patches related
to tiobench regression, and a patch which adds back the checking of !cfqd->cfq_latency.
Then, all regression of fio randread 64k disappears.

Then, instead of working around the original 3 patches, I applied Shaohua's 2 patches
and added the checking of !cfqd->cfq_latency while also reverting the patch mentioned in 1).
But the result still has more than 20% regression. So Shaohua's patches couldn't improve
fio rand read 64k regression.

fio_mmap_randread_4k has about 10% improvement instead of regression. I checked
that my patch plus the debugging patch have no impact on this improvement.

randwrite 64k has about 25% regression. My method also restores its performance.

I worked out a patch to add the checking of !cfqd->cfq_latency back in
function cfq_update_idle_window.

In addition, as for item 1), could we just revert the section in cfq_arm_slice_timer?

As Shaohua's patches don't work for this regression, we might continue to find
better methods. I will check it next week.

---

With kernel 2.6.33-rc1, fio rand read 64k has more than 40% regression. Located
below patch.

commit 718eee0579b802aabe3bafacf09d0a9b0830f1dd
Author: Corrado Zoccolo <[email protected]>
Date: Mon Oct 26 22:45:29 2009 +0100

cfq-iosched: fairness for sync no-idle queues

It introduces for more than 20% regression. The reason is function cfq_update_idle_window
forgets to check cfqd->cfq_latency, so enable_idle=0 with more possibility.

Below patch against 2.6.33-rc1 adds the checking back.

Signed-off-by: Zhang Yanmin <[email protected]>

---

diff -Nraup linux-2.6.33_rc1/block/cfq-iosched.c linux-2.6.33_rc1_rand64k/block/cfq-iosched.c
--- linux-2.6.33_rc1/block/cfq-iosched.c 2009-12-23 14:12:03.000000000 +0800
+++ linux-2.6.33_rc1_rand64k/block/cfq-iosched.c 2009-12-31 16:26:32.000000000 +0800
@@ -3064,8 +3064,8 @@ cfq_update_idle_window(struct cfq_data *
cfq_mark_cfqq_deep(cfqq);

if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples)
- && CFQQ_SEEKY(cfqq)))
+ (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) &&
+ sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
enable_idle = 0;
else if (sample_valid(cic->ttime_samples)) {
if (cic->ttime_mean > cfqd->cfq_slice_idle)

2009-12-31 10:34:36

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1

Hi Yanmin,
On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin
<[email protected]> wrote:
> Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with
> 2.6.33-rc1.

Can you compare the performance also with 2.6.31?
I think I understand what causes your problem.
2.6.32, with default settings, handled even random readers as
sequential ones to provide fairness. This has benefits on single disks
and JBODs, but causes harm on raids.
For 2.6.33, we changed the way in which this is handled, restoring the
enable_idle = 0 for seeky queues as it was in 2.6.31:
@@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd,
struct cfq_queue *cfqq,
enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);

if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq)))
+ (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
enable_idle = 0;
(compare with 2.6.31:
if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
(cfqd->hw_tag && CIC_SEEKY(cic)))
enable_idle = 0;
excluding the sample_valid check, it should be equivalent for you (I
assume you have NCQ disks))
and we provide fairness for them by servicing all seeky queues
together, and then idling before switching to other ones.

The mmap 64k randreader will have a large seek_mean, resulting in
being marked seeky, but will send 16 * 4k sequential requests one
after the other, so alternating between those seeky queues will cause
harm.

I'm working on a new way to compute seekiness of queues, that should
fix your issue, correctly identifying those queues as non-seeky (for
me, a queue should be considered seeky only if it submits more than 1
seeky requests for 8 sequential ones).

>
> The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create
> 8 1-GB files per partition and start 8 processes to do rand read on the 8 files
> per partitions. There are 8*24 processes totally. randread block size is 64K.
>
> We found the regression on 2 machines. One machine has 8GB memory and the other has
> 6GB.
>
> Bisect is very unstable. The related patches are many instead of just one.
>
>
> 1) commit 8e550632cccae34e265cb066691945515eaa7fb5
> Author: Corrado Zoccolo <[email protected]>
> Date: Thu Nov 26 10:02:58 2009 +0100
>
> cfq-iosched: fix corner cases in idling logic
>
>
> This patch introduces about less than 20% regression. I just reverted below section
> and this part regression disappear. It shows this regression is stable and not impacted
> by other patches.
>
> @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> return;
>
> /*
> - * still requests with the driver, don't idle
> + * still active requests from this queue, don't idle
> */
> - if (rq_in_driver(cfqd))
> + if (cfqq->dispatched)
> return;
>
This shouldn't affect you if all queues are marked as idle. Does just
your patch:
> - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples)
> - && CFQQ_SEEKY(cfqq)))
> + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) &&
> + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
fix most of the regression without touching arm_slice_timer?

I guess
> 5db5d64277bf390056b1a87d0bb288c8b8553f96.
will still introduce a 10% regression, but this is needed to improve
latency, and you can just disable low_latency to avoid it.

Thanks,
Corrado

2010-01-01 10:12:45

Hi Yanmin
On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo <[email protected]> wrote:
> Hi Yanmin,
> On Mon, Jan 4, 2010 at 9:18 AM, Zhang, Yanmin
> <[email protected]> wrote:
>> On Sat, 2010-01-02 at 19:52 +0100, Corrado Zoccolo wrote:
>>> Hi
>>> On Sat, Jan 2, 2010 at 1:33 PM, Zhang, Yanmin
>>> <[email protected]> wrote:
>>> > On Fri, 2010-01-01 at 17:32 +0100, Corrado Zoccolo wrote:
>>> >> Hi Yanmin,
>>> >> On Fri, Jan 1, 2010 at 11:12 AM, Zhang, Yanmin
>>> >> <[email protected]> wrote:
>>> >> > On Thu, 2009-12-31 at 11:34 +0100, Corrado Zoccolo wrote:
>>> >> >> Hi Yanmin,
>>> >> >> On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin
>>> >> >> <[email protected]> wrote:
>>> >> >> > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with
>>> >> >> > 2.6.33-rc1.
>>> >> >>
>>> >> > Thanks for your timely reply. Some comments inlined below.
>>> >> >
>>> >> >> Can you compare the performance also with 2.6.31?
>>> >> > We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel
>>> >> > is released.
>>> >> >
>>> >> > The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about
>>> >> > 8% better than the one of 2.6.31.
>>> >> >
>>> >> >> I think I understand what causes your problem.
>>> >> >> 2.6.32, with default settings, handled even random readers as
>>> >> >> sequential ones to provide fairness. This has benefits on single disks
>>> >> >> and JBODs, but causes harm on raids.
>>> >> > I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on
>>> >> > hardware RAID in HBA, mostly we use noop io scheduler.
>>> >> I think you should start testing cfq with them, too. From 2.6.33, we
>>> >> have some big improvements in this area.
>>> > Great! I once compared cfq and noop against non-raid and raid0. One interesting finding
>>> > about sequential read testing is when there are fewer processes to read files on the raid0
>>> > JBOD, noop on raid0 is pretty good, but when there are lots of processes to do so on a non-raid
>>> > JBOD, cfq is pretty better. I planed to investigate it, but too busy in other issues.
>>> >
>>> >> >
>>> >> >> For 2.6.33, we changed the way in which this is handled, restoring the
>>> >> >> enable_idle = 0 for seeky queues as it was in 2.6.31:
>>> >> >> @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd,
>>> >> >> struct cfq_queue *cfqq,
>>> >> >> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>>> >> >>
>>> >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>>> >> >> - (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq)))
>>> >> >> + (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
>>> >> >> enable_idle = 0;
>>> >> >> (compare with 2.6.31:
>>> >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>>> >> >> (cfqd->hw_tag && CIC_SEEKY(cic)))
>>> >> >> enable_idle = 0;
>>> >> >> excluding the sample_valid check, it should be equivalent for you (I
>>> >> >> assume you have NCQ disks))
>>> >> >> and we provide fairness for them by servicing all seeky queues
>>> >> >> together, and then idling before switching to other ones.
>>> >> > As for function cfq_update_idle_window, you is right. But since
>>> >> > 2.6.32, CFQ merges many patches and the patches have impact on each other.
>>> >> >
>>> >> >>
>>> >> >> The mmap 64k randreader will have a large seek_mean, resulting in
>>> >> >> being marked seeky, but will send 16 * 4k sequential requests one
>>> >> >> after the other, so alternating between those seeky queues will cause
>>> >> >> harm.
>>> >> >>
>>> >> >> I'm working on a new way to compute seekiness of queues, that should
>>> >> >> fix your issue, correctly identifying those queues as non-seeky (for
>>> >> >> me, a queue should be considered seeky only if it submits more than 1
>>> >> >> seeky requests for 8 sequential ones).
>>> >> >>
>>> >> >> >
>>> >> >> > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create
>>> >> >> > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files
>>> >> >> > per partitions. There are 8*24 processes totally. randread block size is 64K.
>>> >> >> >
>>> >> >> > We found the regression on 2 machines. One machine has 8GB memory and the other has
>>> >> >> > 6GB.
>>> >> >> >
>>> >> >> > Bisect is very unstable. The related patches are many instead of just one.
>>> >> >> >
>>> >> >> >
>>> >> >> > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5
>>> >> >> > Author: Corrado Zoccolo <[email protected]>
>>> >> >> > Date: Thu Nov 26 10:02:58 2009 +0100
>>> >> >> >
>>> >> >> > cfq-iosched: fix corner cases in idling logic
>>> >> >> >
>>> >> >> >
>>> >> >> > This patch introduces about less than 20% regression. I just reverted below section
>>> >> >> > and this part regression disappear. It shows this regression is stable and not impacted
>>> >> >> > by other patches.
>>> >> >> >
>>> >> >> > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
>>> >> >> > return;
>>> >> >> >
>>> >> >> > /*
>>> >> >> > - * still requests with the driver, don't idle
>>> >> >> > + * still active requests from this queue, don't idle
>>> >> >> > */
>>> >> >> > - if (rq_in_driver(cfqd))
>>> >> >> > + if (cfqq->dispatched)
>>> >> >> > return;
>>> >> > Although 5 patches are related to the regression, above line is quite
>>> >> > independent. Reverting above line could always improve the result for about
>>> >> > 20%.
>>> >> I've looked at your fio script, and it is quite complex,
>>> > As we have about 40 fio sub cases, we have a script to create fio job file from
>>> > a specific parameter list. So there are some superfluous parameters.
>>> >
>>> My point is that there are so many things going on, that is more
>>> difficult to analyse the issues.
>>> I prefer looking at one problem at a time, so (initially) removing the
>>> possibility of queue merging, that Shaohua already investigated, can
>>> help in spotting the still not-well-understood problem.
>> Sounds reasonable.
>>
>>> Could you generate the same script, but with each process accessing
>>> only one of the files, instead of chosing it at random?
>> Ok. New testing starts 8 processes per partition and every process just works
>> on one file.
> Great, thanks.
>>
>>>
>>> > Another point is we need stable result.
>>> >
>>> >> with lot of
>>> >> things going on.
>>> >> Let's keep this for last.
>>> > Ok. But the change like what you do mostly reduces regresion.
>>> >
>>> >> I've created a smaller test, that already shows some regression:
>>> >> [global]
>>> >> direct=0
>>> >> ioengine=mmap
>>> >> size=8G
>>> >> bs=64k
>>> >> numjobs=1
>>> >> loops=5
>>> >> runtime=60
>>> >> #group_reporting
>>> >> invalidate=0
>>> >> directory=/media/hd/cfq-tests
>>> >>
>>> >> [job0]
>>> >> startdelay=0
>>> >> rw=randread
>>> >> filename=testfile1
>>> >>
>>> >> [job1]
>>> >> startdelay=0
>>> >> rw=randread
>>> >> filename=testfile2
>>> >>
>>> >> [job2]
>>> >> startdelay=0
>>> >> rw=randread
>>> >> filename=testfile3
>>> >>
>>> >> [job3]
>>> >> startdelay=0
>>> >> rw=randread
>>> >> filename=testfile4
>>> >>
>>> >> The attached patches, in particular 0005 (that apply on top of
>>> >> for-linus branch of Jen's tree
>>> >> git://git.kernel.dk/linux-2.6-block.git) fix the regression on this
>>> >> simplified workload.
>>> > I didn't download the tree. I tested the 3 attached patches against 2.6.33-rc1. The
>>> > result isn't resolved.
>>> Can you quantify if there is an improvement, though?
>>
>> Ok. Because of company policy, I could only post percent instead of real number
> Sure, it is fine.
>>
>>> Please, also include Shahoua's patches.
>>> I'd like to see the comparison between (always with low_latency set to 0):
>>> plain 2.6.33
>>> plain 2.6.33 + shahoua's
>>> plain 2.6.33 + shahoua's + my patch
>>> plain 2.6.33 + shahoua's + my patch + rq_in_driver vs dispatched patch.
>>
>> 1) low_latency=0
>> 2.6.32 kernel 0
>> 2.6.33-rc1 -0.33
>> 2.6.33-rc1_shaohua -0.33
>> 2.6.33-rc1+corrado 0.03
>> 2.6.33-rc1_corrado+shaohua 0.02
>> 2.6.33-rc1_corrado+shaohua+rq_in_driver 0.01
>>
> So my patch fixes the situation for low_latency = 0, as I expected.
> I'll send it to Jens with a proper changelog.
>
>> 2) low_latency=1
>> 2.6.32 kernel 0
>> 2.6.33-rc1 -0.45
>> 2.6.33-rc1+corrado -0.24
>> 2.6.33-rc1_corrado+shaohua -0.23
>> 2.6.33-rc1_corrado+shaohua+rq_in_driver -0.23
> The results are as expected. With each process working on a separate
> file, Shahoua's patches do not influence the result sensibly.
> Interestingly, even rq_in_driver doesn't improve in this case, so
> maybe its effect is somewhat connected to queue merging.
> The remaining -23% is due to timeslice shrinking, that is done to
> reduce max latency when there are too many processes doing I/O, at the
> expense of throughput. It is a documented change, and the suggested
> way if you favor throughput over latency is to set low_latency = 0.
>
>>
>>
>> When low_latency=1, we get the biggest number with kernel 2.6.32.
>> Comparing with low_latency=0's result, the prior one is about 4% better.
> Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with
> fastest 2.6.32, so we can consider the first part of the problem
> solved.
>
I think we can return now to your full script with queue merging.
I'm wondering if (in arm_slice_timer):
- if (cfqq->dispatched)
+ if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd)))
return;
gives the same improvement you were experiencing just reverting to rq_in_driver.

We saw that cfqq->dispatched worked fine when there was no queue
merging happening, so it must be something concerning merging,
probably dispatched is not accurate when we set up for a merging, but
the merging was not yet done.

Thanks,
Corrado