2009-11-26 16:10:49

by Corrado Zoccolo

[permalink] [raw]
Subject: [RFC,PATCH] cfq-iosched: improve async queue ramp up formula

The introduction of ramp-up formula for async queue depths has
slowed down dirty page reclaim, by reducing async write performance.
This patch improves the formula by considering the remaining slice.

The new formula will allow more dispatches at the beginning of the
slice, reducing them at the end.
This will ensure that we achieve good throughput, without the risk of
overrunning the allotted timeslice.

The threshold is automatically increased when sync I/O is not
intermingled with async, in accordance with the previous incarnation of
the formula.

Signed-off-by: Corrado Zoccolo <[email protected]>
---
block/cfq-iosched.c | 24 ++++++++++++++++++------
1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a5de31f..799782d 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1633,12 +1633,24 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
* based on the last sync IO we serviced
*/
if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
- unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
- unsigned int depth;
-
- depth = last_sync / cfqd->cfq_slice[1];
- if (!depth && !cfqq->dispatched)
- depth = 1;
+ unsigned long now = jiffies;
+ unsigned long last_sync = now - cfqd->last_end_sync_rq;
+ unsigned int depth = 1;
+ if (cfqq->slice_end > now) {
+ unsigned int num,den;
+ /*
+ * (cfqq->slice_end - now) / cfqd->cfq_slice_idle
+ * approximates the number of requests that can be
+ * dispatched before our slice ends
+ * last_sync/cfq_slice[1] gives a boost when no
+ * concurrent sync activity is expected
+ */
+ num = last_sync * (cfqq->slice_end - now);
+ den = cfqd->cfq_slice[1] * cfqd->cfq_slice_idle;
+ if (!den)
+ den++;
+ depth += num / den;
+ }
if (depth < max_dispatch)
max_dispatch = depth;
}
--
1.6.2.5


2009-11-26 21:25:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC,PATCH] cfq-iosched: improve async queue ramp up formula

On Thu, Nov 26, 2009 at 05:10:39PM +0100, Corrado Zoccolo wrote:
> The introduction of ramp-up formula for async queue depths has
> slowed down dirty page reclaim, by reducing async write performance.
> This patch improves the formula by considering the remaining slice.
>
> The new formula will allow more dispatches at the beginning of the
> slice, reducing them at the end.
> This will ensure that we achieve good throughput, without the risk of
> overrunning the allotted timeslice.
>
> The threshold is automatically increased when sync I/O is not
> intermingled with async, in accordance with the previous incarnation of
> the formula.
>

Thanks.

I don't quite get the patch but it certainly helps the situation for the
tests I was running. It's not as good as disabling the low_latency switch
but it's an improvement. The iozone figures are now comparable to disabling
low_latency and for sysbench and the gitk stuff, I now have

SYSBENCH
sysbench-with low-latency sysbench-without
low-latency async-rampup low-latency
1 1266.02 ( 0.00%) 1265.15 (-0.07%) 1278.55 ( 0.98%)
2 1182.58 ( 0.00%) 1223.03 ( 3.31%) 1379.25 (14.26%)
3 1218.64 ( 0.00%) 1246.42 ( 2.23%) 1580.08 (22.87%)
4 1212.11 ( 0.00%) 1325.17 ( 8.53%) 1534.17 (20.99%)
5 1046.77 ( 0.00%) 1008.44 (-3.80%) 1552.48 (32.57%)
6 1187.14 ( 0.00%) 1147.18 (-3.48%) 1661.19 (28.54%)
7 1179.37 ( 0.00%) 1202.49 ( 1.92%) 790.26 (-49.24%)
8 1164.62 ( 0.00%) 1184.56 ( 1.68%) 854.10 (-36.36%)
9 1095.22 ( 0.00%) 1002.42 (-9.26%) 1655.04 (33.83%)
10 1147.52 ( 0.00%) 1151.73 ( 0.37%) 1653.89 (30.62%)
11 823.38 ( 0.00%) 754.15 (-9.18%) 1627.45 (49.41%)
12 813.73 ( 0.00%) 848.32 ( 4.08%) 1494.63 (45.56%)
13 898.22 ( 0.00%) 931.47 ( 3.57%) 1521.64 (40.97%)
14 873.50 ( 0.00%) 875.75 ( 0.26%) 1311.09 (33.38%)
15 808.32 ( 0.00%) 877.87 ( 7.92%) 1009.70 (19.94%)
16 758.17 ( 0.00%) 881.23 (13.96%) 725.17 (-4.55%)

Many gains there. Not as much as disabling the switch but an improvement
nonetheless.

desktop-net-gitk
gitk-with low-latency gitk-without
low-latency async-rampup low-latency
min 954.46 ( 0.00%) 796.22 (16.58%) 640.65 (32.88%)
mean 964.79 ( 0.00%) 798.01 (17.29%) 655.57 (32.05%)
stddev 10.01 ( 0.00%) 1.91 (80.95%) 13.33 (-33.18%)
max 981.23 ( 0.00%) 800.91 (18.38%) 675.65 (31.14%)
pgalloc-fail 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%)

Interesting to note how much more stable the results for the gitk tests are
with the patch applied.

The for-2.6.33 branch for linux-2.6-block are now in progress and I've
queued up the high-order allocation tests but it'll take several hours
to complete.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-27 08:23:12

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC,PATCH] cfq-iosched: improve async queue ramp up formula

On Thu, Nov 26 2009, Corrado Zoccolo wrote:
> The introduction of ramp-up formula for async queue depths has
> slowed down dirty page reclaim, by reducing async write performance.
> This patch improves the formula by considering the remaining slice.
>
> The new formula will allow more dispatches at the beginning of the
> slice, reducing them at the end.
> This will ensure that we achieve good throughput, without the risk of
> overrunning the allotted timeslice.
>
> The threshold is automatically increased when sync I/O is not
> intermingled with async, in accordance with the previous incarnation of
> the formula.

The slow ramp up is pretty much essential to being able to have low
latency for the sync reads, so I'm afraid this will break that. I would
prefer doing it through memory reclaim detection, like the other patch
you and Motohiro suggested.

--
Jens Axboe

2009-11-27 09:03:33

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC,PATCH] cfq-iosched: improve async queue ramp up formula

Hi Jens,
let me explain why my improved formula should work better.

The original problem was that, even if an async queue had a slice of 40ms,
it could take much more to complete since it could have up to 31
requests dispatched at the moment of expiry.
In total, it could take up to 40 + 16 * 8 = 168 ms (worst case) to
complete all dispatched requests, if they were seeky (I'm taking 8ms
average service time of a seeky request).

With your patch, within the first 200ms from last sync, the max depth
will be 1, so a slice will take at most 48ms.
My patch still ensures that a slice will take at most 48ms within the
first 200ms from last sync, but lifts the restriction that depth will
be 1 at all time.
In fact, after the first 100ms, a new async slice will start allowing
5 requests (async_slice/slice_idle). Then, whenever a request
completes, we compute remaining_slice / slice_idle, and compare this
with the number of dispatched requests. If it is greater, it means we
were lucky, and the requests were sequential, so we can allow more
requests to be dispatched. The number of requests dispatched will
decrease when reaching the end of the slice, and at the end we will
allow only depth 1.
For next 100ms, you will allow just depth 2, and my patch will allow
depth 2 at the end of the slice (but larger at the beginning), and so
on.

I think the numbers by Mel show that this idea can give better and
more stable timings, and they were just with a single NCQ rotational
disk. I wonder how much improvement we can get on a raid, where
keeping the depth at 1 hits performance really hard.
Probably, waiting until memory reclaiming is noticeably active (since
in CFQ we will be sampling) may be too late.

Thanks,
Corrado

On Fri, Nov 27, 2009 at 9:23 AM, Jens Axboe <[email protected]> wrote:
> On Thu, Nov 26 2009, Corrado Zoccolo wrote:
>> The introduction of ramp-up formula for async queue depths has
>> slowed down dirty page reclaim, by reducing async write performance.
>> This patch improves the formula by considering the remaining slice.
>>
>> The new formula will allow more dispatches at the beginning of the
>> slice, reducing them at the end.
>> This will ensure that we achieve good throughput, without the risk of
>> overrunning the allotted timeslice.
>>
>> The threshold is automatically increased when sync I/O is not
>> intermingled with async, in accordance with the previous incarnation of
>> the formula.
>
> The slow ramp up is pretty much essential to being able to have low
> latency for the sync reads, so I'm afraid this will break that. I would
> prefer doing it through memory reclaim detection, like the other patch
> you and Motohiro suggested.
>
> --
> Jens Axboe
>
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

2009-11-27 11:48:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC,PATCH] cfq-iosched: improve async queue ramp up formula

On Fri, Nov 27 2009, Corrado Zoccolo wrote:
> Hi Jens,
> let me explain why my improved formula should work better.
>
> The original problem was that, even if an async queue had a slice of 40ms,
> it could take much more to complete since it could have up to 31
> requests dispatched at the moment of expiry.
> In total, it could take up to 40 + 16 * 8 = 168 ms (worst case) to
> complete all dispatched requests, if they were seeky (I'm taking 8ms
> average service time of a seeky request).
>
> With your patch, within the first 200ms from last sync, the max depth
> will be 1, so a slice will take at most 48ms.
> My patch still ensures that a slice will take at most 48ms within the
> first 200ms from last sync, but lifts the restriction that depth will
> be 1 at all time.
> In fact, after the first 100ms, a new async slice will start allowing
> 5 requests (async_slice/slice_idle). Then, whenever a request
> completes, we compute remaining_slice / slice_idle, and compare this
> with the number of dispatched requests. If it is greater, it means we
> were lucky, and the requests were sequential, so we can allow more
> requests to be dispatched. The number of requests dispatched will
> decrease when reaching the end of the slice, and at the end we will
> allow only depth 1.
> For next 100ms, you will allow just depth 2, and my patch will allow
> depth 2 at the end of the slice (but larger at the beginning), and so
> on.
>
> I think the numbers by Mel show that this idea can give better and
> more stable timings, and they were just with a single NCQ rotational
> disk. I wonder how much improvement we can get on a raid, where
> keeping the depth at 1 hits performance really hard.
> Probably, waiting until memory reclaiming is noticeably active (since
> in CFQ we will be sampling) may be too late.

I'm not saying it's a no-go, just that it invalidates the low latency
testing done through the 2.6.32 cycle and we should re-run those tests
before committing and submitting anything.

If the 'check for reclaim' hack isn't good enough, then that's probably
what we have to do.

--
Jens Axboe

2009-11-27 15:12:29

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC,PATCH] cfq-iosched: improve async queue ramp up formula

On Fri, Nov 27, 2009 at 12:48 PM, Jens Axboe <[email protected]> wrote:
> On Fri, Nov 27 2009, Corrado Zoccolo wrote:
>> Hi Jens,
>> let me explain why my improved formula should work better.
>>
>> The original problem was that, even if an async queue had a slice of 40ms,
>> it could take much more to complete since it could have up to 31
>> requests dispatched at the moment of expiry.
>> In total, it could take up to 40 + 16 * 8 = 168 ms (worst case) to
>> complete all dispatched requests, if they were seeky (I'm taking 8ms
>> average service time of a seeky request).
>>
>> With your patch, within the first 200ms from last sync, the max depth
>> will be 1, so a slice will take at most 48ms.
>> My patch still ensures that a slice will take at most 48ms within the
>> first 200ms from last sync, but lifts the restriction that depth will
>> be 1 at all time.
>> In fact, after the first 100ms, a new async slice will start allowing
>> 5 requests (async_slice/slice_idle). Then, whenever a request
>> completes, we compute remaining_slice / slice_idle, and compare this
>> with the number of dispatched requests. If it is greater, it means we
>> were lucky, and the requests were sequential, so we can allow more
>> requests to be dispatched. The number of requests dispatched will
>> decrease when reaching the end of the slice, and at the end we will
>> allow only depth 1.
>> For next 100ms, you will allow just depth 2, and my patch will allow
>> depth 2 at the end of the slice (but larger at the beginning), and so
>> on.
>>
>> I think the numbers by Mel show that this idea can give better and
>> more stable timings, and they were just with a single NCQ rotational
>> disk. I wonder how much improvement we can get on a raid, where
>> keeping the depth at 1 hits performance really hard.
>> Probably, waiting until memory reclaiming is noticeably active (since
>> in CFQ we will be sampling) may be too late.
>
> I'm not saying it's a no-go, just that it invalidates the low latency
> testing done through the 2.6.32 cycle and we should re-run those tests
> before committing and submitting anything.
Agreed, but it should be ok for 2.6.33.
BTW, when Jeff investigated the write performance drop in 2.6.32, he
found, too, that low_latency should be set to 0 to get performance
comparable with previous kernels.
>
> If the 'check for reclaim' hack isn't good enough, then that's probably
> what we have to do.
There is still something puzzling me. The write performance drop
affects mostly NCQ disks, so it can't be the only cause for hitting
OOM condition, otherwise we should have observed it also on previous
kernels, on non-NCQ disks, in which the phenomenon I described above
doesn't happen.
So probably there is something else to look at, and even this patch
can only be a palliative.

Thanks
Corrado

>
> --
> Jens Axboe
>
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-11-27 16:05:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC,PATCH] cfq-iosched: improve async queue ramp up formula

On Fri, Nov 27, 2009 at 12:48:47PM +0100, Jens Axboe wrote:
> On Fri, Nov 27 2009, Corrado Zoccolo wrote:
> > Hi Jens,
> > let me explain why my improved formula should work better.
> >
> > The original problem was that, even if an async queue had a slice of 40ms,
> > it could take much more to complete since it could have up to 31
> > requests dispatched at the moment of expiry.
> > In total, it could take up to 40 + 16 * 8 = 168 ms (worst case) to
> > complete all dispatched requests, if they were seeky (I'm taking 8ms
> > average service time of a seeky request).
> >
> > With your patch, within the first 200ms from last sync, the max depth
> > will be 1, so a slice will take at most 48ms.
> > My patch still ensures that a slice will take at most 48ms within the
> > first 200ms from last sync, but lifts the restriction that depth will
> > be 1 at all time.
> > In fact, after the first 100ms, a new async slice will start allowing
> > 5 requests (async_slice/slice_idle). Then, whenever a request
> > completes, we compute remaining_slice / slice_idle, and compare this
> > with the number of dispatched requests. If it is greater, it means we
> > were lucky, and the requests were sequential, so we can allow more
> > requests to be dispatched. The number of requests dispatched will
> > decrease when reaching the end of the slice, and at the end we will
> > allow only depth 1.
> > For next 100ms, you will allow just depth 2, and my patch will allow
> > depth 2 at the end of the slice (but larger at the beginning), and so
> > on.
> >
> > I think the numbers by Mel show that this idea can give better and
> > more stable timings, and they were just with a single NCQ rotational
> > disk. I wonder how much improvement we can get on a raid, where
> > keeping the depth at 1 hits performance really hard.
> > Probably, waiting until memory reclaiming is noticeably active (since
> > in CFQ we will be sampling) may be too late.
>
> I'm not saying it's a no-go, just that it invalidates the low latency
> testing done through the 2.6.32 cycle and we should re-run those tests
> before committing and submitting anything.
>

Any chance there is a description of the tests that were used to
evaulate low_latency around?

> If the 'check for reclaim' hack isn't good enough, then that's probably
> what we have to do.
>

It isn't good enough. I'll try variations of the same idea but the
initial tests were not promising at all.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-30 17:07:24

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC,PATCH] cfq-iosched: improve async queue ramp up formula

On Fri, Nov 27, 2009 at 10:03:35AM +0100, Corrado Zoccolo wrote:
> Hi Jens,
> let me explain why my improved formula should work better.
>
> The original problem was that, even if an async queue had a slice of 40ms,
> it could take much more to complete since it could have up to 31
> requests dispatched at the moment of expiry.
> In total, it could take up to 40 + 16 * 8 = 168 ms (worst case) to
> complete all dispatched requests, if they were seeky (I'm taking 8ms
> average service time of a seeky request).
>
> With your patch, within the first 200ms from last sync, the max depth
> will be 1, so a slice will take at most 48ms.
> My patch still ensures that a slice will take at most 48ms within the
> first 200ms from last sync, but lifts the restriction that depth will
> be 1 at all time.
> In fact, after the first 100ms, a new async slice will start allowing
> 5 requests (async_slice/slice_idle). Then, whenever a request
> completes, we compute remaining_slice / slice_idle, and compare this
> with the number of dispatched requests. If it is greater, it means we
> were lucky, and the requests were sequential, so we can allow more
> requests to be dispatched. The number of requests dispatched will
> decrease when reaching the end of the slice, and at the end we will
> allow only depth 1.
> For next 100ms, you will allow just depth 2, and my patch will allow
> depth 2 at the end of the slice (but larger at the beginning), and so
> on.

Got a query. Here assumption is that async queues are not being preempted.
So driving shallower queue depths at the end of slice will help in terms
of max latencies and driving deeper queue depths at the beginning of slice
will help in getting more out of disk, without increasing max latencies.

But if we allow deeper queue depths at the beginning of the async slice,
and then async queue is preempted, then we are back to the old problem of
first request taking more time to complete.

But I guess that problem will be less severe this time as for sync-noidle
workload we will idle. So ideally we will experience the higher delays
only for first request and not on subsequent request. Previously, we did
not enable idling on random seeky queues and after one dispatch from the
queue, we will again run async queue and there was high delay after every
read request. This is assuming if upon preemption, we started running
sync-noidle workload and did not continue to dispatch from async workload.

Corrodo, you can clear up the air here. What's the policy w.r.t to
preemption of async queues and workload slice.

Thanks
Vivek

>
> I think the numbers by Mel show that this idea can give better and
> more stable timings, and they were just with a single NCQ rotational
> disk. I wonder how much improvement we can get on a raid, where
> keeping the depth at 1 hits performance really hard.
> Probably, waiting until memory reclaiming is noticeably active (since
> in CFQ we will be sampling) may be too late.
>
> Thanks,
> Corrado
>
> On Fri, Nov 27, 2009 at 9:23 AM, Jens Axboe <[email protected]> wrote:
> > On Thu, Nov 26 2009, Corrado Zoccolo wrote:
> >> The introduction of ramp-up formula for async queue depths has
> >> slowed down dirty page reclaim, by reducing async write performance.
> >> This patch improves the formula by considering the remaining slice.
> >>
> >> The new formula will allow more dispatches at the beginning of the
> >> slice, reducing them at the end.
> >> This will ensure that we achieve good throughput, without the risk of
> >> overrunning the allotted timeslice.
> >>
> >> The threshold is automatically increased when sync I/O is not
> >> intermingled with async, in accordance with the previous incarnation of
> >> the formula.
> >
> > The slow ramp up is pretty much essential to being able to have low
> > latency for the sync reads, so I'm afraid this will break that. I would
> > prefer doing it through memory reclaim detection, like the other patch
> > you and Motohiro suggested.
> >
> > --
> > Jens Axboe
> >
> >
>
>
>
> --
> __________________________________________________________________________
>
> dott. Corrado Zoccolo mailto:[email protected]
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------
> The self-confidence of a warrior is not the self-confidence of the average
> man. The average man seeks certainty in the eyes of the onlooker and calls
> that self-confidence. The warrior seeks impeccability in his own eyes and
> calls that humbleness.
> Tales of Power - C. Castaneda

2009-11-30 18:58:28

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC,PATCH] cfq-iosched: improve async queue ramp up formula

Hi Vivek,
On Mon, Nov 30, 2009 at 6:06 PM, Vivek Goyal <[email protected]> wrote:
> Got a query. Here assumption is that async queues are not being preempted.
> So driving shallower queue depths at the end of slice will help in terms
> of max latencies and driving deeper queue depths at the beginning of slice
> will help in getting more out of disk, without increasing max latencies.
>
> But if we allow deeper queue depths at the beginning of the async slice,
> and then async queue is preempted, then we are back to the old problem of
> first request taking more time to complete.
The problem should be solved, nevertheless.
First, the deeper queue will start after the first 100ms.
Moreover, we will still dispatch less requests than before, and the
max delay is now limited by the time slice.
Since the async time slice is also reduced according to competing sync
processes, it will not hurt the latency seen by the other processes.

> But I guess that problem will be less severe this time as for sync-noidle
> workload we will idle. So ideally we will experience the higher delays
> only for first request and not on subsequent request. Previously, we did
> not enable idling on random seeky queues and after one dispatch from the
> queue, we will again run async queue and there was high delay after every
> read request.
Yes.
> This is assuming if upon preemption, we started running
> sync-noidle workload and did not continue to dispatch from async workload.
>
> Corrodo, you can clear up the air here. What's the policy w.r.t to
> preemption of async queues and workload slice.
Ok. The workload slice works as follows (this is not only for async,
but for all workloads).
A new slice in the workload can be started if the workload slice did
not expire, and there is a ready queue.
When a queue is active, even if the workload expires, it will still
finish its complete slice, unless it has no requests and times out.

Now, preemption of sync vs async:
* if sync comes when the async workload slice is not expired, it will
just change the rb_key of the preempting queue (and workload),
ensuring that the next selected workload will be the preempting one,
but the preemption will be actually delayed until the workload slice
ends
* if sync comes after the workload slice expired, but the async queue
still has some remaining slice, it will preempt it immediately.

Basically, we protect the async to do some work, even if minimal (the
workload slices can become very small since in async workload there is
usually only 1 queue, and this is scaled against all the queues in the
other workloads).
This has shown to improve the situation when memory pressure is high,
and we need writeback to free some of it.

Thanks,
Corrado