LinuxLists.cc - [RFC] cfq: adapt slice to number of processes doing I/O

2009-09-03 11:07:06

Subject: [RFC] cfq: adapt slice to number of processes doing I/O

When the number of processes performing I/O concurrently increases, a
fixed time slice per process will cause large latencies.
In the patch, if there are more than 3 processes performing concurrent
I/O, we scale the time slice down proportionally.
To safeguard sequential bandwidth, we impose a minimum time slice,
computed from cfq_slice_idle (the idea is that cfq_slice_idle
approximates the cost for a seek).

I performed two tests, on a rotational disk:
* 32 concurrent processes performing random reads
** the bandwidth is improved from 466KB/s to 477KB/s
** the maximum latency is reduced from 7.667s to 1.728
* 32 concurrent processes performing sequential reads
** the bandwidth is reduced from 28093KB/s to 24393KB/s
** the maximum latency is reduced from 3.781s to 1.115s

I expect numbers to be even better on SSDs, where the penalty to
disrupt sequential read is much less.

Signed-off-by: Corrado Zoccolo <czoccolo@gmail-com>

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fd7080e..cff4ca8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -306,7 +306,15 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct
cfq_queue *cfqq)
static inline void
cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
+ unsigned low_slice = cfqd->cfq_slice_idle * (1 + cfq_cfqq_sync(cfqq));
+ unsigned interested_queues = cfq_class_rt(cfqq) ?
cfqd->busy_rt_queues : cfqd->busy_queues;
+ unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
+ if (interested_queues > 3) {
+ slice *= 3;
+ slice /= interested_queues;
+ }
+
+ cfqq->slice_end = jiffies + max(slice, low_slice);
cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
}

Attachments:

test2f.fio (4.63 kB)
cfq_orig.out (54.38 kB)
cfq_adapt.out (55.47 kB)
Download all attachments

2009-09-03 13:01:17

by Jeff Moyer

[permalink] [raw]

Subject: Re: [RFC] cfq: adapt slice to number of processes doing I/O

Corrado Zoccolo <[email protected]> writes:

> When the number of processes performing I/O concurrently increases, a
> fixed time slice per process will cause large latencies.
> In the patch, if there are more than 3 processes performing concurrent
> I/O, we scale the time slice down proportionally.
> To safeguard sequential bandwidth, we impose a minimum time slice,
> computed from cfq_slice_idle (the idea is that cfq_slice_idle
> approximates the cost for a seek).
>
> I performed two tests, on a rotational disk:
> * 32 concurrent processes performing random reads
> ** the bandwidth is improved from 466KB/s to 477KB/s
> ** the maximum latency is reduced from 7.667s to 1.728
> * 32 concurrent processes performing sequential reads
> ** the bandwidth is reduced from 28093KB/s to 24393KB/s
> ** the maximum latency is reduced from 3.781s to 1.115s
>
> I expect numbers to be even better on SSDs, where the penalty to
> disrupt sequential read is much less.

Interesting approach. I'm not sure what the benefits will be on SSDs,
as the idling logic is disabled for them (when nonrot is set and they
support ncq). See cfq_arm_slice_timer.

> Signed-off-by: Corrado Zoccolo <czoccolo@gmail-com>
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index fd7080e..cff4ca8 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -306,7 +306,15 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct
> cfq_queue *cfqq)
> static inline void
> cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> {
> - cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
> + unsigned low_slice = cfqd->cfq_slice_idle * (1 + cfq_cfqq_sync(cfqq));
> + unsigned interested_queues = cfq_class_rt(cfqq) ?
> cfqd->busy_rt_queues : cfqd->busy_queues;

Either my mailer displayed this wrong, or yours wraps lines.

> + unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
> + if (interested_queues > 3) {
> + slice *= 3;

How did you come to this magic number of 3, both for the number of
competing tasks and the multiplier for the slice time? Did you
experiment with this number at all?

> + slice /= interested_queues;

Of course you realize this could disable the idling logic completely,
right? I'll run this patch through some tests and let you know how it
goes.

Thanks!

-Jeff

2009-09-03 13:07:30

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC] cfq: adapt slice to number of processes doing I/O

On Thu, Sep 03 2009, Jeff Moyer wrote:
> Corrado Zoccolo <[email protected]> writes:
>
> > When the number of processes performing I/O concurrently increases, a
> > fixed time slice per process will cause large latencies.
> > In the patch, if there are more than 3 processes performing concurrent
> > I/O, we scale the time slice down proportionally.
> > To safeguard sequential bandwidth, we impose a minimum time slice,
> > computed from cfq_slice_idle (the idea is that cfq_slice_idle
> > approximates the cost for a seek).
> >
> > I performed two tests, on a rotational disk:
> > * 32 concurrent processes performing random reads
> > ** the bandwidth is improved from 466KB/s to 477KB/s
> > ** the maximum latency is reduced from 7.667s to 1.728
> > * 32 concurrent processes performing sequential reads
> > ** the bandwidth is reduced from 28093KB/s to 24393KB/s
> > ** the maximum latency is reduced from 3.781s to 1.115s
> >
> > I expect numbers to be even better on SSDs, where the penalty to
> > disrupt sequential read is much less.
>
> Interesting approach. I'm not sure what the benefits will be on SSDs,
> as the idling logic is disabled for them (when nonrot is set and they
> support ncq). See cfq_arm_slice_timer.

Also, the problem with scaling the slice a lot is that throughput has a
tendency to fall off a cliff at some point. Have you tried benchmarking
buffered writes with reads?

--
Jens Axboe

2009-09-03 15:38:12

by Jeff Moyer

[permalink] [raw]

Subject: Re: [RFC] cfq: adapt slice to number of processes doing I/O

Jeff Moyer <[email protected]> writes:

> Corrado Zoccolo <[email protected]> writes:
>
>> When the number of processes performing I/O concurrently increases, a
>> fixed time slice per process will cause large latencies.
>> In the patch, if there are more than 3 processes performing concurrent
>> I/O, we scale the time slice down proportionally.
>> To safeguard sequential bandwidth, we impose a minimum time slice,
>> computed from cfq_slice_idle (the idea is that cfq_slice_idle
>> approximates the cost for a seek).
>>
>> I performed two tests, on a rotational disk:
>> * 32 concurrent processes performing random reads
>> ** the bandwidth is improved from 466KB/s to 477KB/s
>> ** the maximum latency is reduced from 7.667s to 1.728
>> * 32 concurrent processes performing sequential reads
>> ** the bandwidth is reduced from 28093KB/s to 24393KB/s
>> ** the maximum latency is reduced from 3.781s to 1.115s
>>
>> I expect numbers to be even better on SSDs, where the penalty to
>> disrupt sequential read is much less.
>
> Interesting approach. I'm not sure what the benefits will be on SSDs,
> as the idling logic is disabled for them (when nonrot is set and they
> support ncq). See cfq_arm_slice_timer.
>
>> Signed-off-by: Corrado Zoccolo <czoccolo@gmail-com>
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index fd7080e..cff4ca8 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -306,7 +306,15 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct
>> cfq_queue *cfqq)
>> static inline void
>> cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> {
>> - cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
>> + unsigned low_slice = cfqd->cfq_slice_idle * (1 + cfq_cfqq_sync(cfqq));
>> + unsigned interested_queues = cfq_class_rt(cfqq) ?
>> cfqd->busy_rt_queues : cfqd->busy_queues;
>
> Either my mailer displayed this wrong, or yours wraps lines.
>
>> + unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
>> + if (interested_queues > 3) {
>> + slice *= 3;
>
> How did you come to this magic number of 3, both for the number of
> competing tasks and the multiplier for the slice time? Did you
> experiment with this number at all?
>
>> + slice /= interested_queues;
>
> Of course you realize this could disable the idling logic completely,
> right? I'll run this patch through some tests and let you know how it
> goes.

I missed that you updated the slice end based on a max of slice and
low_slice. Sorry about that.

This patch does not fare well when judging fairness between processes.
I have several fio jobs that generate read workloads, and I try to
figure out whether the I/O scheduler is providing fairness based on the
I/O priorities of the processes. With your patch applied, we get the
following results:

total priority: 880
total data transferred: 1045920
class prio ideal xferred %diff
be 0 213938 352500 64
be 1 190167 193012 1
be 2 166396 123380 -26
be 3 142625 86260 -40
be 4 118854 62964 -48
be 5 95083 40180 -58
be 6 71312 74484 4
be 7 47541 113140 137

Class and prio should be self-explanatory. ideal is my cooked up
version of the ideal number of bytes the given priority should have
transferred based on the total data transferred and all processes
weighted by priority competing for the disk. xferred is the actual
amount of data transferred, and %diff is the difference between those
last two columns.

Notice that best effort priority 7 managed to transfer more data than be
prio 3. That's bad. Now, let's look at 8 processes all at the same
priority level:

total priority: 800
total data transferred: 1071036
class prio ideal xferred %diff
be 4 133879 222452 66
be 4 133879 243188 81
be 4 133879 187380 39
be 4 133879 42512 -69
be 4 133879 39156 -71
be 4 133879 47604 -65
be 4 133879 37364 -73
be 4 133879 251380 87

Hmm. That doesn't look good.

For comparison, here is the output from the vanilla kernel for those two
runs:

total priority: 880
total data transferred: 954272
class prio ideal xferred %diff
be 0 195192 229108 17
be 1 173504 202740 16
be 2 151816 156660 3
be 3 130128 152052 16
be 4 108440 91636 -16
be 5 86752 64244 -26
be 6 65064 34292 -48
be 7 43376 23540 -46

total priority: 800
total data transferred: 887264
class prio ideal xferred %diff
be 4 110908 124404 12
be 4 110908 123380 11
be 4 110908 118004 6
be 4 110908 113396 2
be 4 110908 107252 -4
be 4 110908 98356 -12
be 4 110908 96244 -14
be 4 110908 106228 -5

It's worth noting that the overall throughput went up in the patched
kernel for this second case. However, if we care at all about the
notion of I/O priorities, I think your patch needs more work.

Cheers,
Jeff

2009-09-03 16:26:14

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [RFC] cfq: adapt slice to number of processes doing I/O

Hi Jeff,

On Thu, Sep 3, 2009 at 3:01 PM, Jeff Moyer<[email protected]> wrote:
> Corrado Zoccolo <[email protected]> writes:
>
>> When the number of processes performing I/O concurrently increases, a
>> fixed time slice per process will cause large latencies.
>> In the patch, if there are more than 3 processes performing concurrent
>> I/O, we scale the time slice down proportionally.
>> To safeguard sequential bandwidth, we impose a minimum time slice,
>> computed from cfq_slice_idle (the idea is that cfq_slice_idle
>> approximates the cost for a seek).
>>
>> I performed two tests, on a rotational disk:
>> * 32 concurrent processes performing random reads
>> ** the bandwidth is improved from 466KB/s to 477KB/s
>> ** the maximum latency is reduced from 7.667s to 1.728
>> * 32 concurrent processes performing sequential reads
>> ** the bandwidth is reduced from 28093KB/s to 24393KB/s
>> ** the maximum latency is reduced from 3.781s to 1.115s
>>
>> I expect numbers to be even better on SSDs, where the penalty to
>> disrupt sequential read is much less.
>
> Interesting approach. I'm not sure what the benefits will be on SSDs,
> as the idling logic is disabled for them (when nonrot is set and they
> support ncq). See cfq_arm_slice_timer.
>

Yes, I know. Unfortunately, not all SSD devices have ncq.
Moreover, idling is disabled only for seeky processes, and the current
threshold to identify seeky processes is so high, that it rarely kicks
in.

>> Signed-off-by: Corrado Zoccolo <czoccolo@gmail-com>
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index fd7080e..cff4ca8 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -306,7 +306,15 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct
>> cfq_queue *cfqq)
>> static inline void
>> cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> {
>> - cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
>> + unsigned low_slice = cfqd->cfq_slice_idle * (1 + cfq_cfqq_sync(cfqq));
>> + unsigned interested_queues = cfq_class_rt(cfqq) ?
>> cfqd->busy_rt_queues : cfqd->busy_queues;
>
> Either my mailer displayed this wrong, or yours wraps lines.
>
>> + unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
>> + if (interested_queues > 3) {
>> + slice *= 3;
>
> How did you come to this magic number of 3, both for the number of
> competing tasks and the multiplier for the slice time? Did you
> experiment with this number at all?
>

The number is quickly explained. The base slice is 100ms, and on the
mailing list it was mentioned that the latency becomes annoing for an
user when it is above 300ms.
This means that, up to 3 processes, the current thresholds are good,
and for more we have to scale.
This is good, because we don't change the behaviour unless we have
many competing processes.

>> + slice /= interested_queues;
>
> Of course you realize this could disable the idling logic completely,
> right? I'll run this patch through some tests and let you know how it
> goes.
>
You already found that this is not the case.

Thanks for the interest
Corrado
> Thanks!
>
> -Jeff
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-09-03 16:36:33

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [RFC] cfq: adapt slice to number of processes doing I/O

Hi Jens,
On Thu, Sep 3, 2009 at 3:07 PM, Jens Axboe<[email protected]> wrote:
> On Thu, Sep 03 2009, Jeff Moyer wrote:
>> Corrado Zoccolo <[email protected]> writes:
>>
>> > When the number of processes performing I/O concurrently increases, a
>> > fixed time slice per process will cause large latencies.
>> > In the patch, if there are more than 3 processes performing concurrent
>> > I/O, we scale the time slice down proportionally.
>> > To safeguard sequential bandwidth, we impose a minimum time slice,
>> > computed from cfq_slice_idle (the idea is that cfq_slice_idle
>> > approximates the cost for a seek).
>> >
>> > I performed two tests, on a rotational disk:
>> > * 32 concurrent processes performing random reads
>> > ** the bandwidth is improved from 466KB/s to 477KB/s
>> > ** the maximum latency is reduced from 7.667s to 1.728
>> > * 32 concurrent processes performing sequential reads
>> > ** the bandwidth is reduced from 28093KB/s to 24393KB/s
>> > ** the maximum latency is reduced from 3.781s to 1.115s
>> >
>> > I expect numbers to be even better on SSDs, where the penalty to
>> > disrupt sequential read is much less.
>>
>> Interesting approach. I'm not sure what the benefits will be on SSDs,
>> as the idling logic is disabled for them (when nonrot is set and they
>> support ncq). See cfq_arm_slice_timer.
>
> Also, the problem with scaling the slice a lot is that throughput has a
> tendency to fall off a cliff at some point.

This is the reason that I have a minimum slice. It is already reached
for 32 processes as in my example, so the throughput drop is at most
20%.
Currently it is computed as 2*slice_idle for sync, and 1*slice_idle
for async queues.
I think this causes the leveling of data transferred regardless of
priorities. I'll cook up a formula to better scale also the minimum
slice according to priority, to fix this issue.

> Have you tried benchmarking
> buffered writes with reads?

Yes. I used that workload for benchmarks while tuning the patch.
Adding async writes doesn't change the results, mostly because cfq
preempts async queues when sync queues have new requests, and with
many readers, there are always plenty of incoming reads. Writes almost
have no chance to happen.

Corrado

>
> --
> Jens Axboe
>
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-09-03 16:47:45

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [RFC] cfq: adapt slice to number of processes doing I/O

Hi Jeff,
can you share the benchmark?
I think I have to fix the min slice to consider priority, too, to
respect the priorities when there are many processes.

For the fairness at a single priority level, my tests show that
fairness is improved with the patches (comparing minimum and maximum
bandwidth for a set of 32 processes):

Original:
Run status group 0 (all jobs):
READ: io=14192KiB, aggrb=480KiB/s, minb=7KiB/s, maxb=20KiB/s,
mint=30001msec, maxt=30258msec

Run status group 1 (all jobs):
READ: io=829292KiB, aggrb=27816KiB/s, minb=723KiB/s,
maxb=1004KiB/s, mint=30004msec, maxt=30529msec

Adaptive:
Run status group 0 (all jobs):
READ: io=14444KiB, aggrb=488KiB/s, minb=12KiB/s, maxb=17KiB/s,
mint=30003msec, maxt=30298msec

Run status group 1 (all jobs):
READ: io=721324KiB, aggrb=24140KiB/s, minb=689KiB/s, maxb=795KiB/s,
mint=30003msec, maxt=30598msec

Are you using random think times? This could explain the discrepancy.

Corrado

On Thu, Sep 3, 2009 at 5:38 PM, Jeff Moyer<[email protected]> wrote:
> Jeff Moyer <[email protected]> writes:
>
>> Corrado Zoccolo <[email protected]> writes:
>>
>>> When the number of processes performing I/O concurrently increases, a
>>> fixed time slice per process will cause large latencies.
>>> In the patch, if there are more than 3 processes performing concurrent
>>> I/O, we scale the time slice down proportionally.
>>> To safeguard sequential bandwidth, we impose a minimum time slice,
>>> computed from cfq_slice_idle (the idea is that cfq_slice_idle
>>> approximates the cost for a seek).
>>>
>>> I performed two tests, on a rotational disk:
>>> * 32 concurrent processes performing random reads
>>> ** the bandwidth is improved from 466KB/s to 477KB/s
>>> ** the maximum latency is reduced from 7.667s to 1.728
>>> * 32 concurrent processes performing sequential reads
>>> ** the bandwidth is reduced from 28093KB/s to 24393KB/s
>>> ** the maximum latency is reduced from 3.781s to 1.115s
>>>
>>> I expect numbers to be even better on SSDs, where the penalty to
>>> disrupt sequential read is much less.
>>
>> Interesting approach. I'm not sure what the benefits will be on SSDs,
>> as the idling logic is disabled for them (when nonrot is set and they
>> support ncq). See cfq_arm_slice_timer.
>>
>>> Signed-off-by: Corrado Zoccolo <czoccolo@gmail-com>
>>>
>>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>>> index fd7080e..cff4ca8 100644
>>> --- a/block/cfq-iosched.c
>>> +++ b/block/cfq-iosched.c
>>> @@ -306,7 +306,15 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct
>>> cfq_queue *cfqq)
>>> static inline void
>>> cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>>> {
>>> - cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
>>> + unsigned low_slice = cfqd->cfq_slice_idle * (1 + cfq_cfqq_sync(cfqq));
>>> + unsigned interested_queues = cfq_class_rt(cfqq) ?
>>> cfqd->busy_rt_queues : cfqd->busy_queues;
>>
>> Either my mailer displayed this wrong, or yours wraps lines.
>>
>>> + unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
>>> + if (interested_queues > 3) {
>>> + slice *= 3;
>>
>> How did you come to this magic number of 3, both for the number of
>> competing tasks and the multiplier for the slice time? Did you
>> experiment with this number at all?
>>
>>> + slice /= interested_queues;
>>
>> Of course you realize this could disable the idling logic completely,
>> right? I'll run this patch through some tests and let you know how it
>> goes.
>
> I missed that you updated the slice end based on a max of slice and
> low_slice. Sorry about that.
>
> This patch does not fare well when judging fairness between processes.
> I have several fio jobs that generate read workloads, and I try to
> figure out whether the I/O scheduler is providing fairness based on the
> I/O priorities of the processes. With your patch applied, we get the
> following results:
>
> total priority: 880
> total data transferred: 1045920
> class prio ideal xferred %diff
> be 0 213938 352500 64
> be 1 190167 193012 1
> be 2 166396 123380 -26
> be 3 142625 86260 -40
> be 4 118854 62964 -48
> be 5 95083 40180 -58
> be 6 71312 74484 4
> be 7 47541 113140 137
>
> Class and prio should be self-explanatory. ideal is my cooked up
> version of the ideal number of bytes the given priority should have
> transferred based on the total data transferred and all processes
> weighted by priority competing for the disk. xferred is the actual
> amount of data transferred, and %diff is the difference between those
> last two columns.
>
> Notice that best effort priority 7 managed to transfer more data than be
> prio 3. That's bad. Now, let's look at 8 processes all at the same
> priority level:
>
> total priority: 800
> total data transferred: 1071036
> class prio ideal xferred %diff
> be 4 133879 222452 66
> be 4 133879 243188 81
> be 4 133879 187380 39
> be 4 133879 42512 -69
> be 4 133879 39156 -71
> be 4 133879 47604 -65
> be 4 133879 37364 -73
> be 4 133879 251380 87
>
> Hmm. That doesn't look good.
>
> For comparison, here is the output from the vanilla kernel for those two
> runs:
>
> total priority: 880
> total data transferred: 954272
> class prio ideal xferred %diff
> be 0 195192 229108 17
> be 1 173504 202740 16
> be 2 151816 156660 3
> be 3 130128 152052 16
> be 4 108440 91636 -16
> be 5 86752 64244 -26
> be 6 65064 34292 -48
> be 7 43376 23540 -46
>
> total priority: 800
> total data transferred: 887264
> class prio ideal xferred %diff
> be 4 110908 124404 12
> be 4 110908 123380 11
> be 4 110908 118004 6
> be 4 110908 113396 2
> be 4 110908 107252 -4
> be 4 110908 98356 -12
> be 4 110908 96244 -14
> be 4 110908 106228 -5
>
> It's worth noting that the overall throughput went up in the patched
> kernel for this second case. However, if we care at all about the
> notion of I/O priorities, I think your patch needs more work.
>
> Cheers,
> Jeff
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

2009-09-03 17:16:57

by Jeff Moyer

[permalink] [raw]

Subject: Re: [RFC] cfq: adapt slice to number of processes doing I/O

Corrado Zoccolo <[email protected]> writes:

> Hi Jeff,
> can you share the benchmark?

Of course, how miserly of me!

http://people.redhat.com/jmoyer/cfq-regression-tests-0.0.1.tar.gz

> I think I have to fix the min slice to consider priority, too, to
> respect the priorities when there are many processes.
>
> For the fairness at a single priority level, my tests show that
> fairness is improved with the patches (comparing minimum and maximum
> bandwidth for a set of 32 processes):
>
> Original:
> Run status group 0 (all jobs):
> READ: io=14192KiB, aggrb=480KiB/s, minb=7KiB/s, maxb=20KiB/s,
> mint=30001msec, maxt=30258msec
>
> Run status group 1 (all jobs):
> READ: io=829292KiB, aggrb=27816KiB/s, minb=723KiB/s,
> maxb=1004KiB/s, mint=30004msec, maxt=30529msec
>
> Adaptive:
> Run status group 0 (all jobs):
> READ: io=14444KiB, aggrb=488KiB/s, minb=12KiB/s, maxb=17KiB/s,
> mint=30003msec, maxt=30298msec
>
> Run status group 1 (all jobs):
> READ: io=721324KiB, aggrb=24140KiB/s, minb=689KiB/s, maxb=795KiB/s,
> mint=30003msec, maxt=30598msec
>
> Are you using random think times? This could explain the discrepancy.

No, it's just a sync read benchmark. It's the be4-x-8.fio job file in
the tarball mentioned above. Note that the run-time is only 10 seconds,
so maybe that's not enough to get accurate data? If you try increasing
it, be careful that you don't read in the entire file and wrap back
around, as this is a buffered read test, that will skew the results.

Cheers,
Jeff

2009-09-03 18:29:38

by Jeff Moyer

[permalink] [raw]

Subject: Re: [RFC] cfq: adapt slice to number of processes doing I/O

Corrado Zoccolo <[email protected]> writes:

> Hi Jeff,
>
>> How did you come to this magic number of 3, both for the number of
>> competing tasks and the multiplier for the slice time? Did you
>> experiment with this number at all?
>
> The number is quickly explained. The base slice is 100ms, and on the
> mailing list it was mentioned that the latency becomes annoing for an
> user when it is above 300ms. This means that, up to 3 processes, the
> current thresholds are good, and for more we have to scale. This is
> good, because we don't change the behaviour unless we have many
> competing processes.

OK, then in your next patch, could you make this more clear? Maybe
define a LATENCY_MAX and derive the number of processes from that,
instead of assuming that the base slice will always and forever be
100ms?

Cheers,
Jeff

2009-09-04 07:22:33

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [RFC] cfq: adapt slice to number of processes doing I/O

Hi Jeff,

On Thu, Sep 3, 2009 at 7:16 PM, Jeff Moyer<[email protected]> wrote:
> Corrado Zoccolo <[email protected]> writes:
>
>> Hi Jeff,
>> can you share the benchmark?
>
> Of course, how miserly of me!
>
> http://people.redhat.com/jmoyer/cfq-regression-tests-0.0.1.tar.gz
>
>> I think I have to fix the min slice to consider priority, too, to
>> respect the priorities when there are many processes.
>>
>> For the fairness at a single priority level, my tests show that
>> fairness is improved with the patches (comparing minimum and maximum
>> bandwidth for a set of 32 processes):
>>
>> Original:
>> Run status group 0 (all jobs):
>> READ: io=14192KiB, aggrb=480KiB/s, minb=7KiB/s, maxb=20KiB/s,
>> mint=30001msec, maxt=30258msec
>>
>> Run status group 1 (all jobs):
>> READ: io=829292KiB, aggrb=27816KiB/s, minb=723KiB/s,
>> maxb=1004KiB/s, mint=30004msec, maxt=30529msec
>>
>> Adaptive:
>> Run status group 0 (all jobs):
>> READ: io=14444KiB, aggrb=488KiB/s, minb=12KiB/s, maxb=17KiB/s,
>> mint=30003msec, maxt=30298msec
>>
>> Run status group 1 (all jobs):
>> READ: io=721324KiB, aggrb=24140KiB/s, minb=689KiB/s, maxb=795KiB/s,
>> mint=30003msec, maxt=30598msec
>>
>> Are you using random think times? This could explain the discrepancy.
>
> No, it's just a sync read benchmark. It's the be4-x-8.fio job file in
> the tarball mentioned above. Note that the run-time is only 10 seconds,
> so maybe that's not enough to get accurate data? If you try increasing
> it, be careful that you don't read in the entire file and wrap back
> around, as this is a buffered read test, that will skew the results.
>
I could reproduce the skew on my setup, but with a different pattern than yours.
I think the reason is, in my case, that, since the files are very
large (and the fs partially filled), they are allocated on places with
different read speed.
Even cutting file size by half, in order to fit the benchmark on my disk, I get:

[root@et2 cfq-regression-tests]# for i in `seq 1 8`; do sync; echo 3 >
/proc/sys/vm/drop_caches; dd if=/media/lacie/testfile$i of=/dev/null
bs=1M count=30; done
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 0,978737 s, 32,1 MB/s
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 0,971206 s, 32,4 MB/s
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 1,04638 s, 30,1 MB/s
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 1,18266 s, 26,6 MB/s
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 1,32781 s, 23,7 MB/s
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 1,33056 s, 23,6 MB/s
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 1,71092 s, 18,4 MB/s
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 1,56802 s, 20,1 MB/s

Original cfq gives:
/home/corrado/cfq-regression-tests/cfq_orig/be4-x-8.fio
/home/corrado/cfq-regression-tests
total priority: 800
total data transferred: 179072
class prio ideal xferred %diff
be 4 22384 28148 25
be 4 22384 29428 31
be 4 22384 27380 22
be 4 22384 23524 5
be 4 22384 20964 -7
be 4 22384 22004 -2
be 4 22384 12020 -47
be 4 22384 15604 -31

patched cfq is a bit more skewed, but follows the same pattern:
/home/corrado/cfq-regression-tests/cfq_adapt/be4-x-8.fio
/home/corrado/cfq-regression-tests
total priority: 800
total data transferred: 204088
class prio ideal xferred %diff
be 4 25511 34020 33
be 4 25511 34020 33
be 4 25511 32484 27
be 4 25511 28388 11
be 4 25511 24628 -4
be 4 25511 24940 -3
be 4 25511 10276 -60
be 4 25511 15332 -40
/home/corrado/cfq-regression-tests

I think it is more skewed because the seek time, with a smaller slice,
impacts more on the amount of transferred data for slower transfer
rates, as shown below:
The amount of data transferred per slice is given by:
(1) slice = seek + amount * rate
if we multiply slice by alpha < 1, we can transfer a different amount
of data (amount1):
(2) alpha* slice = seek + amount1 * rate
substituting slice from (1) into (2), to express amount1 as a function
of amount:
alpha* (seek + amount *rate) = seek + amount1 *rate
amount1 = amount - (1-alpha)* seek / rate
where we see that decreasing rate, increases the impact of seek time.

Your data, though, seems very different from mine, in fact vanilla
doesn't show the same difference.
Can you try running vanilla with smaller base slices, to see if the
same effect can be seen there?
echo 15 > /sys/block/sdb/queue/iosched/slice_async
echo 37 > /sys/block/sdb/queue/iosched/slice_sync
If this doesn't exhibit the same pattern, then maybe the problem is
that the number of busy queues observed by the processes is
consistently different, resulting in unequal time slices.
In that case, I'll add some averaging.

I also added a "sync; echo 3 > /proc/sys/vm/drop_caches" in ioprio.sh
before each test. It is not strictly necessary, because you already
unmount the test FS, but helps reducing the pressure on VM, so that
doing I/O on our files doesn't cause dependent I/O on others as well
to free up space.

Thanks,
Corrado

> Cheers,
> Jeff
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-09-05 16:16:34

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC] cfq: adapt slice to number of processes doing I/O

On Thu, Sep 03 2009, Corrado Zoccolo wrote:
> Hi Jens,
> On Thu, Sep 3, 2009 at 3:07 PM, Jens Axboe<[email protected]> wrote:
> > On Thu, Sep 03 2009, Jeff Moyer wrote:
> >> Corrado Zoccolo <[email protected]> writes:
> >>
> >> > When the number of processes performing I/O concurrently increases, ?a
> >> > fixed time slice per process will cause large latencies.
> >> > In the patch, if there are more than 3 processes performing concurrent
> >> > I/O, we scale the time slice down proportionally.
> >> > To safeguard sequential bandwidth, we impose a minimum time slice,
> >> > computed from cfq_slice_idle (the idea is that cfq_slice_idle
> >> > approximates the cost for a seek).
> >> >
> >> > I performed two tests, on a rotational disk:
> >> > * 32 concurrent processes performing random reads
> >> > ** the bandwidth is improved from 466KB/s to 477KB/s
> >> > ** the maximum latency is reduced from 7.667s to 1.728
> >> > * 32 concurrent processes performing sequential reads
> >> > ** the bandwidth is reduced from 28093KB/s to 24393KB/s
> >> > ** the maximum latency is reduced from 3.781s to 1.115s
> >> >
> >> > I expect numbers to be even better on SSDs, where the penalty to
> >> > disrupt sequential read is much less.
> >>
> >> Interesting approach. ?I'm not sure what the benefits will be on SSDs,
> >> as the idling logic is disabled for them (when nonrot is set and they
> >> support ncq). ?See cfq_arm_slice_timer.
> >
> > Also, the problem with scaling the slice a lot is that throughput has a
> > tendency to fall off a cliff at some point.
>
> This is the reason that I have a minimum slice. It is already reached
> for 32 processes as in my example, so the throughput drop is at most
> 20%.
> Currently it is computed as 2*slice_idle for sync, and 1*slice_idle
> for async queues.
> I think this causes the leveling of data transferred regardless of
> priorities. I'll cook up a formula to better scale also the minimum
> slice according to priority, to fix this issue.

For your case, it may be different for other hardware. But I think the
approach is sane to some degree, it'll require more work though. One
problem is that the count of busy queues will fluctuate a lot for sync
IO, so you'll have fairness issues. The number of potentially interested
processes needs to be a rolling average of some sort, not just looking
at ->busy_queues.

> > Have you tried benchmarking
> > buffered writes with reads?
>
> Yes. I used that workload for benchmarks while tuning the patch.
> Adding async writes doesn't change the results, mostly because cfq
> preempts async queues when sync queues have new requests, and with
> many readers, there are always plenty of incoming reads. Writes almost
> have no chance to happen.

OK, it should not, if the slice start logic is working. Just wanted to
make sure :-)

--
Jens Axboe

2009-09-07 14:57:47

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [RFC] cfq: adapt slice to number of processes doing I/O

Hi,
On Sat, Sep 05 2009 18:16:35, Jens Axboe wrote:
: > On Thu, Sep 03 2009, Corrado Zoccolo wrote:
> > Hi Jens,
> >
> > [snip]
> >
> > This is the reason that I have a minimum slice. It is already reached
> > for 32 processes as in my example, so the throughput drop is at most
> > 20%.
> > Currently it is computed as 2*slice_idle for sync, and 1*slice_idle
> > for async queues.
> > I think this causes the leveling of data transferred regardless of
> > priorities. I'll cook up a formula to better scale also the minimum
> > slice according to priority, to fix this issue.
>
> For your case, it may be different for other hardware. But I think the
> approach is sane to some degree, it'll require more work though. One
> problem is that the count of busy queues will fluctuate a lot for sync
> IO, so you'll have fairness issues. The number of potentially interested
> processes needs to be a rolling average of some sort, not just looking
> at ->busy_queues.
here is the new patch, with the improved formula for minimum time slice,
and with rolling average for number of processes doing I/O.
The average is computed in such a way that it can quickly follow
sudden increases (to keep latency low),
and decrease slowly (to have fairness in spite of rapid decreases of this value).

I added 2 tunables, for testing:
* the preferred latency, used to compute the threshold on number of processes (that was hardcoded in previous version).
* the divisor in the averaging formula, to alter the weights (weights are computed as 1/d and (d-1)/d).
they can be removed if/when we find the optimal values.

Signed-off-by: Corrado Zoccolo <[email protected]>
---
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fd7080e..b200b13 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -27,6 +27,8 @@ static const int cfq_slice_sync = HZ / 10;
static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
static int cfq_slice_idle = HZ / 125;
+static int cfq_preferred_latency = HZ * 3/10; /* 300 ms */
+static int cfq_queue_hist_divisor = 4;

/*
* offset from end of service tree
@@ -134,11 +136,13 @@ struct cfq_data {
struct rb_root prio_trees[CFQ_PRIO_LISTS];

unsigned int busy_queues;
+ unsigned int busy_queues_avg;
/*
* Used to track any pending rt requests so we can pre-empt current
* non-RT cfqq in service when this value is non-zero.
*/
unsigned int busy_rt_queues;
+ unsigned int busy_rt_queues_avg;

int rq_in_driver;
int sync_flight;
@@ -178,6 +182,8 @@ struct cfq_data {
unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
+ unsigned int cfq_preferred_latency;
+ unsigned int cfq_queue_hist_divisor;

struct list_head cic_list;

@@ -303,10 +309,37 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
}

+static inline unsigned
+cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) {
+ unsigned min_q, max_q;
+ unsigned mult = cfqd->cfq_queue_hist_divisor -1;
+ unsigned round = cfqd->cfq_queue_hist_divisor / 2;
+ if (rt) {
+ min_q = min(cfqd->busy_rt_queues_avg, cfqd->busy_rt_queues);
+ max_q = max(cfqd->busy_rt_queues_avg, cfqd->busy_rt_queues);
+ cfqd->busy_rt_queues_avg = (mult * max_q + min_q + round) / cfqd->cfq_queue_hist_divisor;
+ return cfqd->busy_rt_queues_avg;
+ } else {
+ min_q = min(cfqd->busy_queues_avg, cfqd->busy_queues);
+ max_q = max(cfqd->busy_queues_avg, cfqd->busy_queues);
+ cfqd->busy_queues_avg = (mult * max_q + min_q + round) / cfqd->cfq_queue_hist_divisor;
+ return cfqd->busy_queues_avg;
+ }
+}
+
static inline void
cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
+ unsigned process_threshold = cfqd->cfq_preferred_latency / cfqd->cfq_slice[1];
+ unsigned interested_queues = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq));
+ unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
+
+ if (interested_queues > process_threshold) {
+ unsigned low_slice = min(slice, 2 * slice * cfqd->cfq_slice_idle / cfqd->cfq_slice[1]);
+ slice = max(slice * process_threshold / interested_queues, low_slice);
+ }
+
+ cfqq->slice_end = jiffies + slice;
cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
}

@@ -2494,6 +2527,8 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
+ cfqd->cfq_preferred_latency = cfq_preferred_latency;
+ cfqd->cfq_queue_hist_divisor = cfq_queue_hist_divisor;
cfqd->hw_tag = 1;

return cfqd;
@@ -2563,6 +2598,8 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
+SHOW_FUNCTION(cfq_preferred_latency_show, cfqd->cfq_preferred_latency, 1);
+SHOW_FUNCTION(cfq_queue_hist_divisor_show, cfqd->cfq_queue_hist_divisor, 0);
#undef SHOW_FUNCTION

#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
@@ -2594,6 +2631,10 @@ STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
+
+STORE_FUNCTION(cfq_preferred_latency_store, &cfqd->cfq_preferred_latency, 1, 1000, 1);
+STORE_FUNCTION(cfq_queue_hist_divisor_store, &cfqd->cfq_queue_hist_divisor, 1, 100, 0);
+
#undef STORE_FUNCTION

#define CFQ_ATTR(name) \
@@ -2609,6 +2650,8 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
+ CFQ_ATTR(preferred_latency),
+ CFQ_ATTR(queue_hist_divisor),
__ATTR_NULL
};