2008-08-24 20:24:48

by Daniel J Blueman

[permalink] [raw]
Subject: Re: performance "regression" in cfq compared to anticipatory, deadline and noop

Hi Fabio, Jens,

On Thu, May 15, 2008 at 1:21 PM, Fabio Checconi <[email protected]> wrote:
>> From: Jens Axboe <[email protected]>
>> Date: Thu, May 15, 2008 09:01:28AM +0200
>>
>> I don't think it's 2.6.25 vs 2.6.26-rc2, I can still reproduce some
>> request size offsets with the patch. So still fumbling around with this,
>> I'll be sending out another test patch when I'm confident it's solved
>> the size issue.
>
> IMO an interesting thing is how/why anticipatory doesn't show the
> issue. The device is not put into ANTIC_WAIT_NEXT if there is no
> dispatch returning no requests while the queue is not empty. This
> seems to be enough in the reported workloads.
>
> I don't think this behavior is the correct one (it is still racy
> WRT merges after breaking anticipation) anyway it should make things
> a little bit better. I fear that a complete solution would not
> involve only the scheduler.
>
> Introducing the very same behavior in cfq seems to be not so easy
> (i.e., start idling only if there was a dispatch round while the
> last request was being served) but an approximated version can be
> introduced quite easily. The patch below should do that, rescheduling
> the dispatch only if necessary; it is not tested at all, just posted
> for discussion.
>
> ---
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index b399c62..41f1e0e 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -169,6 +169,7 @@ enum cfqq_state_flags {
> CFQ_CFQQ_FLAG_queue_new, /* queue never been serviced */
> CFQ_CFQQ_FLAG_slice_new, /* no requests dispatched in slice */
> CFQ_CFQQ_FLAG_sync, /* synchronous queue */
> + CFQ_CFQQ_FLAG_dispatched, /* empty dispatch while idling */
> };
>
> #define CFQ_CFQQ_FNS(name) \
> @@ -196,6 +197,7 @@ CFQ_CFQQ_FNS(prio_changed);
> CFQ_CFQQ_FNS(queue_new);
> CFQ_CFQQ_FNS(slice_new);
> CFQ_CFQQ_FNS(sync);
> +CFQ_CFQQ_FNS(dispatched);
> #undef CFQ_CFQQ_FNS
>
> static void cfq_dispatch_insert(struct request_queue *, struct request *);
> @@ -749,6 +751,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
> cfqq->slice_end = 0;
> cfq_clear_cfqq_must_alloc_slice(cfqq);
> cfq_clear_cfqq_fifo_expire(cfqq);
> + cfq_clear_cfqq_dispatched(cfqq);
> cfq_mark_cfqq_slice_new(cfqq);
> cfq_clear_cfqq_queue_new(cfqq);
> }
> @@ -978,6 +981,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> */
> if (timer_pending(&cfqd->idle_slice_timer) ||
> (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
> + cfq_mark_cfqq_dispatched(cfqq);
> cfqq = NULL;
> goto keep_queue;
> }
> @@ -1784,7 +1788,10 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> if (cfq_cfqq_wait_request(cfqq)) {
> cfq_mark_cfqq_must_dispatch(cfqq);
> del_timer(&cfqd->idle_slice_timer);
> - blk_start_queueing(cfqd->queue);
> + if (cfq_cfqq_dispatched(cfqq)) {
> + cfq_clear_cfqq_dispatched(cfqq);
> + cfq_schedule_dispatch(cfqd);
> + }
> }
> } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
> /*

This was the last test I didn't get around to. Alas, is did help, but
didn't give the merging required for full performance:

# echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
bs=128k count=2000
262144000 bytes (262 MB) copied, 2.47787 s, 106 MB/s

# echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
Timing buffered disk reads: 308 MB in 3.01 seconds = 102.46 MB/sec

It is an improvement over the baseline performance of 2.6.27-rc4:

# echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
bs=128k count=2000
262144000 bytes (262 MB) copied, 2.56514 s, 102 MB/s

# echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
Timing buffered disk reads: 294 MB in 3.02 seconds = 97.33 MB/sec

Note that platter speed is around 125MB/s (which I get near at smaller
read sizes).

I feel 128KB read requests are perhaps important, as this is a
commonly-used RAID stripe size, and may explain the read-performance
drop sometimes we see in hardware vs software RAID benchmarks.

How can we generate some ideas or movement on fixing/improving this behaviour?

Thanks!
Daniel
--
Daniel J Blueman


2008-08-25 10:43:18

by Fabio Checconi

[permalink] [raw]
Subject: Re: performance "regression" in cfq compared to anticipatory, deadline and noop

Hi,

> From: Daniel J Blueman <[email protected]>
> Date: Sun, Aug 24, 2008 09:24:37PM +0100
>
> Hi Fabio, Jens,
>
...
> This was the last test I didn't get around to. Alas, is did help, but
> didn't give the merging required for full performance:
>
> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
> bs=128k count=2000
> 262144000 bytes (262 MB) copied, 2.47787 s, 106 MB/s
>
> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
> Timing buffered disk reads: 308 MB in 3.01 seconds = 102.46 MB/sec
>
> It is an improvement over the baseline performance of 2.6.27-rc4:
>
> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
> bs=128k count=2000
> 262144000 bytes (262 MB) copied, 2.56514 s, 102 MB/s
>
> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
> Timing buffered disk reads: 294 MB in 3.02 seconds = 97.33 MB/sec
>
> Note that platter speed is around 125MB/s (which I get near at smaller
> read sizes).
>
> I feel 128KB read requests are perhaps important, as this is a
> commonly-used RAID stripe size, and may explain the read-performance
> drop sometimes we see in hardware vs software RAID benchmarks.
>
> How can we generate some ideas or movement on fixing/improving this behaviour?
>

Thank you for testing. The blktrace output for this run should be
interesting, esp. to compare it with a blktrace obtained from anticipatory
with the same workload - IIRC anticipatory didn't suffer from the problem,
and anticipatory has a slightly different dispatching mechanism that
this patch tried to bring into cfq.

Even if a proper fix may not belong to the elevator itself, I think
that this couple (this last test + anticipatory) of traces should help
in better understanding what is still going wrong.

Thank you in advance.

2008-08-25 15:39:18

by Daniel J Blueman

[permalink] [raw]
Subject: Re: performance "regression" in cfq compared to anticipatory, deadline and noop

On Mon, Aug 25, 2008 at 9:29 PM, Fabio Checconi <[email protected]> wrote:
> Hi,
>
>> From: Daniel J Blueman <[email protected]>
>> Date: Sun, Aug 24, 2008 09:24:37PM +0100
>>
>> Hi Fabio, Jens,
>>
> ...
>> This was the last test I didn't get around to. Alas, is did help, but
>> didn't give the merging required for full performance:
>>
>> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
>> bs=128k count=2000
>> 262144000 bytes (262 MB) copied, 2.47787 s, 106 MB/s
>>
>> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
>> Timing buffered disk reads: 308 MB in 3.01 seconds = 102.46 MB/sec
>>
>> It is an improvement over the baseline performance of 2.6.27-rc4:
>>
>> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
>> bs=128k count=2000
>> 262144000 bytes (262 MB) copied, 2.56514 s, 102 MB/s
>>
>> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
>> Timing buffered disk reads: 294 MB in 3.02 seconds = 97.33 MB/sec
>>
>> Note that platter speed is around 125MB/s (which I get near at smaller
>> read sizes).
>>
>> I feel 128KB read requests are perhaps important, as this is a
>> commonly-used RAID stripe size, and may explain the read-performance
>> drop sometimes we see in hardware vs software RAID benchmarks.
>>
>> How can we generate some ideas or movement on fixing/improving this behaviour?
>>
>
> Thank you for testing. The blktrace output for this run should be
> interesting, esp. to compare it with a blktrace obtained from anticipatory
> with the same workload - IIRC anticipatory didn't suffer from the problem,
> and anticipatory has a slightly different dispatching mechanism that
> this patch tried to bring into cfq.
>
> Even if a proper fix may not belong to the elevator itself, I think
> that this couple (this last test + anticipatory) of traces should help
> in better understanding what is still going wrong.
>
> Thank you in advance.

See http://quora.org/blktrace-n.tar.bz2

Where n is:
0 - 2.6.27-rc4 unpatched
1 - 2.6.27-rc4 with your CFQ patch, CFQ scheduler
2 - 2.6.27-rc4 with your CFQ patch, anticipatory scheduler
3 - 2.6.27-rc4 with your CFQ patch, deadline scheduler

I have found it's not always possible to reproduce this issue, eg now,
with stock CFQ, I'm seeing consistent 117-123MB/s with hdparm and dd
(as above), whereas I was seeing a consistent 95-103MB/s, so the
blktraces may not show the slower-performance pattern - even with
precisely the same (controlled) environment.

Thanks,
Daniel
--
Daniel J Blueman

2008-08-25 17:06:55

by Fabio Checconi

[permalink] [raw]
Subject: Re: performance "regression" in cfq compared to anticipatory, deadline and noop

> From: Daniel J Blueman <[email protected]>
> Date: Mon, Aug 25, 2008 04:39:01PM +0100
>
> On Mon, Aug 25, 2008 at 9:29 PM, Fabio Checconi <[email protected]> wrote:
> > Hi,
> >
> >> From: Daniel J Blueman <[email protected]>
> >> Date: Sun, Aug 24, 2008 09:24:37PM +0100
> >>
> >> Hi Fabio, Jens,
> >>
> > ...
> >> This was the last test I didn't get around to. Alas, is did help, but
> >> didn't give the merging required for full performance:
> >>
> >> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
> >> bs=128k count=2000
> >> 262144000 bytes (262 MB) copied, 2.47787 s, 106 MB/s
> >>
> >> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
> >> Timing buffered disk reads: 308 MB in 3.01 seconds = 102.46 MB/sec
> >>
> >> It is an improvement over the baseline performance of 2.6.27-rc4:
> >>
> >> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
> >> bs=128k count=2000
> >> 262144000 bytes (262 MB) copied, 2.56514 s, 102 MB/s
> >>
> >> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
> >> Timing buffered disk reads: 294 MB in 3.02 seconds = 97.33 MB/sec
> >>
> >> Note that platter speed is around 125MB/s (which I get near at smaller
> >> read sizes).
> >>
> >> I feel 128KB read requests are perhaps important, as this is a
> >> commonly-used RAID stripe size, and may explain the read-performance
> >> drop sometimes we see in hardware vs software RAID benchmarks.
> >>
> >> How can we generate some ideas or movement on fixing/improving this behaviour?
> >>
> >
> > Thank you for testing. The blktrace output for this run should be
> > interesting, esp. to compare it with a blktrace obtained from anticipatory
> > with the same workload - IIRC anticipatory didn't suffer from the problem,
> > and anticipatory has a slightly different dispatching mechanism that
> > this patch tried to bring into cfq.
> >
> > Even if a proper fix may not belong to the elevator itself, I think
> > that this couple (this last test + anticipatory) of traces should help
> > in better understanding what is still going wrong.
> >
> > Thank you in advance.
>
> See http://quora.org/blktrace-n.tar.bz2
>
> Where n is:
> 0 - 2.6.27-rc4 unpatched
> 1 - 2.6.27-rc4 with your CFQ patch, CFQ scheduler
> 2 - 2.6.27-rc4 with your CFQ patch, anticipatory scheduler
> 3 - 2.6.27-rc4 with your CFQ patch, deadline scheduler
>
> I have found it's not always possible to reproduce this issue, eg now,
> with stock CFQ, I'm seeing consistent 117-123MB/s with hdparm and dd
> (as above), whereas I was seeing a consistent 95-103MB/s, so the
> blktraces may not show the slower-performance pattern - even with
> precisely the same (controlled) environment.
>

If I read them correctly, all the traces show dispatches with
requests still growing; the elevator cannot know if a request
will grow or not once it has been queued, and the heuristics
we tried so far to postpone dispatches gave no results.

I don't see any elevator-only solution to the problem...

2008-12-09 15:14:30

by Daniel J Blueman

[permalink] [raw]
Subject: Re: performance "regression" in cfq compared to anticipatory, deadline and noop

Hi Jens, Fabio,

On Mon, Aug 25, 2008 at 5:06 PM, Fabio Checconi <[email protected]> wrote:
>> From: Daniel J Blueman <[email protected]>
>> Date: Mon, Aug 25, 2008 04:39:01PM +0100
>>
>> On Mon, Aug 25, 2008 at 9:29 PM, Fabio Checconi <[email protected]> wrote:
>> > Hi,
>> >
>> >> From: Daniel J Blueman <[email protected]>
>> >> Date: Sun, Aug 24, 2008 09:24:37PM +0100
>> >>
>> >> Hi Fabio, Jens,
>> >>
>> > ...
>> >> This was the last test I didn't get around to. Alas, is did help, but
>> >> didn't give the merging required for full performance:
>> >>
>> >> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
>> >> bs=128k count=2000
>> >> 262144000 bytes (262 MB) copied, 2.47787 s, 106 MB/s
>> >>
>> >> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
>> >> Timing buffered disk reads: 308 MB in 3.01 seconds = 102.46 MB/sec
>> >>
>> >> It is an improvement over the baseline performance of 2.6.27-rc4:
>> >>
>> >> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
>> >> bs=128k count=2000
>> >> 262144000 bytes (262 MB) copied, 2.56514 s, 102 MB/s
>> >>
>> >> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
>> >> Timing buffered disk reads: 294 MB in 3.02 seconds = 97.33 MB/sec
>> >>
>> >> Note that platter speed is around 125MB/s (which I get near at smaller
>> >> read sizes).
>> >>
>> >> I feel 128KB read requests are perhaps important, as this is a
>> >> commonly-used RAID stripe size, and may explain the read-performance
>> >> drop sometimes we see in hardware vs software RAID benchmarks.
>> >>
>> >> How can we generate some ideas or movement on fixing/improving this behaviour?
>> >>
>> >
>> > Thank you for testing. The blktrace output for this run should be
>> > interesting, esp. to compare it with a blktrace obtained from anticipatory
>> > with the same workload - IIRC anticipatory didn't suffer from the problem,
>> > and anticipatory has a slightly different dispatching mechanism that
>> > this patch tried to bring into cfq.
>> >
>> > Even if a proper fix may not belong to the elevator itself, I think
>> > that this couple (this last test + anticipatory) of traces should help
>> > in better understanding what is still going wrong.
>> >
>> > Thank you in advance.
>>
>> See http://quora.org/blktrace-n.tar.bz2
>>
>> Where n is:
>> 0 - 2.6.27-rc4 unpatched
>> 1 - 2.6.27-rc4 with your CFQ patch, CFQ scheduler
>> 2 - 2.6.27-rc4 with your CFQ patch, anticipatory scheduler
>> 3 - 2.6.27-rc4 with your CFQ patch, deadline scheduler
>>
>> I have found it's not always possible to reproduce this issue, eg now,
>> with stock CFQ, I'm seeing consistent 117-123MB/s with hdparm and dd
>> (as above), whereas I was seeing a consistent 95-103MB/s, so the
>> blktraces may not show the slower-performance pattern - even with
>> precisely the same (controlled) environment.
>>
>
> If I read them correctly, all the traces show dispatches with
> requests still growing; the elevator cannot know if a request
> will grow or not once it has been queued, and the heuristics
> we tried so far to postpone dispatches gave no results.
>
> I don't see any elevator-only solution to the problem...

I was running into this performance issue again:

Everything same as before, 2.6.24, CFQ scheduler, Seagate 7200.11
320GB SATA (SD11 firmware) on a quiescent and well-powered system:

# sync; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
bs=128k count=1000
1000+0 records in
1000+0 records out
131072000 bytes (131 MB) copied, 2.24231 s, 58.5 MB/s

I found that tuning the AHCI SATA TCQ depth to 2 provides exactly the
performance we expect:

# echo 2 >/sys/block/sda/device/queue_depth
# sync; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
bs=128k count=1000
1000+0 records in
1000+0 records out
131072000 bytes (131 MB) copied, 0.98503 s, 133 MB/s

depth 1: 132 MB/s
depth 2: 133 MB/s
depth 3: 69.1 MB/s
depth 4: 59.7 MB/s
depth 8: 54.9 MB/s
depth 16: 57.1 MB/s
depth 31: 58.0 MB/s

Very interesting interaction, and the figures are very stable. Could
this be a product of the maximum time the drive waits to coalesce
requests before acting on them? If so, how can we diagnose this, apart
from you guys getting one of these disks?

Thanks,
Daniel
--
Daniel J Blueman