Found this whilst reviewing the CFQ I/O scheduler code: Currently, this
routine only sorts using the I/O priority class - it does not properly
sort prioritized queues within a specific class. The patch changes the
sort to utilize the full I/O priority (class & priority).
A simple test shows the problem & fixed results: on a 16-way box, for
each of 12 attached disks I started up 17 processes (one process at each
possible class/priority). Each process operated on a separate file in
the file system. I then did two types of tests: (a) direct/synchronous
and (b) direct/asynchronous w/ an 80/20 read/write split.
I then tabulated the overall I/O performed per task: (first column is
priority class (1==RT, 2==BE, 3==IDLE), second column is the I/O
priority (0==highest), then two groupings of read/write data moved
(total KiBs over a span of 120 seconds):
Synchronous:
2.6.32-rc8 2.6.32-rc8+patch
Read Write Read Write
---------------- ----------------
1 0 | 311164 310760 | 424260 424116 |
1 1 | 129712 129792 | 390208 393232 |
1 2 | 72312 71284 | 448 420 |
1 3 | 40364 41052 | 28 20 |
1 4 | 26788 26352 | 28 24 |
1 5 | 16936 16940 | 52 32 |
1 6 | 11196 11140 | 28 20 |
1 7 | 6476 6648 | 20 28 |
2 0 | 24 24 | 40 8 |
2 1 | 24 24 | 12 36 |
2 2 | 20 28 | 20 28 |
2 3 | 28 20 | 24 24 |
2 4 | 28 20 | 28 20 |
2 5 | 28 20 | 20 28 |
2 6 | 24 24 | 20 28 |
2 7 | 24 24 | 36 12 |
3 | 36 12 | 28 20 |
---------------- ----------------
Sum 615184 614164 815300 818096
You can see that due to the "random" nature of the unpatched kernel
lower priority real-time processes get elevated I/O amounts. With the
patched kernel, real-time priorities 0&1 get the vast majority of the
available throughput (as expected). [Basically: priority 0 & 1
flip-flop: when priority 0's I/O finishes, priority 1's gets inserted
then priority 0 comes back with another I/O quick enough (most of the
time) and bumps all the other queues out of the way.]
More I/O is performed with the patched kernel (most likely) because
there is much less thrashing/seeking on the disk.
Asynchronous:
2.6.32-rc8 2.6.32-rc8+patch
Read Write Read Write
---------------- ----------------
1 0 | 1969220 1967036 | 2272660 2266220 |
1 1 | 65880 66216 | 71552 71424 |
1 2 | 30760 30808 | 3532 3508 |
1 3 | 17352 17336 | 2996 3148 |
1 4 | 11496 11288 | 3028 3116 |
1 5 | 7836 8036 | 3008 3136 |
1 6 | 5432 5448 | 2992 3152 |
1 7 | 3692 3860 | 3068 3076 |
2 0 | 3172 2972 | 3052 3092 |
2 1 | 3100 3044 | 3000 3144 |
2 2 | 3140 3004 | 3056 3088 |
2 3 | 3108 3036 | 3084 3060 |
2 4 | 3116 3028 | 2968 3176 |
2 5 | 3068 3076 | 3096 3048 |
2 6 | 2884 3260 | 3084 3060 |
2 7 | 3112 3032 | 3208 2936 |
3 | 3172 2972 | 2968 3176 |
---------------- ----------------
Sum 2139540 2137452 2390352 2384560
With Asynch I/O priority 0 gets the vast (vast!) majority of the
bandwidth because it is issuing more I/Os in one go (128 asynch I/Os at
a time).
Sort order was being incorrectly calculated using just the class, this
patch includes the priority within the classes when deciding sort order.
Note: IOPRIO_CLASS_NONE classes are converted to IOPRIO_CLASS_BE before
getting to this function, hence the WARN_ON in the added function
cfq_class_prio.
Signed-off-by: Alan D. Brunelle <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/cfq-iosched.c | 27 ++++++++++++++++-----------
1 files changed, 16 insertions(+), 11 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index aa1e953..7b9ca4d 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -226,6 +226,12 @@ CFQ_CFQQ_FNS(coop);
CFQ_CFQQ_FNS(coop_preempt);
#undef CFQ_CFQQ_FNS
+static inline int cfq_class_prio(struct cfq_queue *cfqq)
+{
+ WARN_ON(cfqq->ioprio_class == IOPRIO_CLASS_NONE);
+ return IOPRIO_PRIO_VALUE(cfqq->ioprio_class, cfqq->ioprio);
+}
+
#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
#define cfq_log(cfqd, fmt, args...) \
@@ -536,30 +542,29 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
p = &cfqd->service_tree.rb.rb_node;
while (*p) {
struct rb_node **n;
+ int cfqq_prio, __cfqq_prio;
parent = *p;
__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
+ cfqq_prio = cfq_class_prio(cfqq);
+ __cfqq_prio = cfq_class_prio(__cfqq);
+
/*
* sort RT queues first, we always want to give
* preference to them. IDLE queues goes to the back.
* after that, sort on the next service time.
+ * (Lower priority values represent higher priorities.)
*/
- if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
+ if (cfqq_prio < __cfqq_prio)
n = &(*p)->rb_left;
- else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
- n = &(*p)->rb_right;
- else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
+ else if (cfqq_prio == __cfqq_prio &&
+ time_before(rb_key, __cfqq->rb_key))
n = &(*p)->rb_left;
- else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
- n = &(*p)->rb_right;
- else if (time_before(rb_key, __cfqq->rb_key))
- n = &(*p)->rb_left;
- else
+ else {
n = &(*p)->rb_right;
-
- if (n == &(*p)->rb_right)
left = 0;
+ }
p = n;
}
--
1.6.3.3
On Tue, Nov 24, 2009 at 2:11 PM, Alan D. Brunelle <[email protected]> wrote:
> Found this whilst reviewing the CFQ I/O scheduler code: Currently, this
> routine only sorts using the I/O priority class - it does not properly
> sort prioritized queues within a specific class. The patch changes the
> sort to utilize the full I/O priority (class & priority).
This changes mixes the interpretation of classes and levels within class.
In the original code, those different things have different meanings:
* priority class decides who can use the disk
* priority level within a class determines how much of the disk time
each queue will obtain
In your case. instead, you completely remove the second meaning, and
provide a larger number of levels to just decide the first.
>
> A simple test shows the problem & fixed results: on a 16-way box, for
> each of 12 attached disks I started up 17 processes (one process at each
> possible class/priority). Each process operated on a separate file in
> the file system. I then did two types of tests: (a) direct/synchronous
> and (b) direct/asynchronous w/ an 80/20 read/write split.
>
> I then tabulated the overall I/O performed per task: (first column is
> priority class (1==RT, 2==BE, 3==IDLE), second column is the I/O
> priority (0==highest), then two groupings of read/write data moved
> (total KiBs over a span of 120 seconds):
>
> Synchronous:
> 2.6.32-rc8 2.6.32-rc8+patch
> Read Write Read Write
> ---------------- ----------------
> 1 0 | 311164 310760 | 424260 424116 |
> 1 1 | 129712 129792 | 390208 393232 |
> 1 2 | 72312 71284 | 448 420 |
> 1 3 | 40364 41052 | 28 20 |
> 1 4 | 26788 26352 | 28 24 |
> 1 5 | 16936 16940 | 52 32 |
> 1 6 | 11196 11140 | 28 20 |
> 1 7 | 6476 6648 | 20 28 |
The numbers for the patched kernel are bad.
All priority levels > 2 are starved. They can complete an amount of
I/O comparable with lower priority class:
> 2 0 | 24 24 | 40 8 |
> 2 1 | 24 24 | 12 36 |
> 2 2 | 20 28 | 20 28 |
> 2 3 | 28 20 | 24 24 |
> 2 4 | 28 20 | 28 20 |
> 2 5 | 28 20 | 20 28 |
> 2 6 | 24 24 | 20 28 |
> 2 7 | 24 24 | 36 12 |
>
> 3 | 36 12 | 28 20 |
> ---------------- ----------------
> Sum 615184 614164 815300 818096
>
This is not the intended behaviour, and you don't need 14 priority
levels to get only one use the disk.
Cheers,
Corrado
On Tue, 2009-11-24 at 15:03 +0100, Corrado Zoccolo wrote:
> On Tue, Nov 24, 2009 at 2:11 PM, Alan D. Brunelle <[email protected]> wrote:
> > Found this whilst reviewing the CFQ I/O scheduler code: Currently, this
> > routine only sorts using the I/O priority class - it does not properly
> > sort prioritized queues within a specific class. The patch changes the
> > sort to utilize the full I/O priority (class & priority).
>
> This changes mixes the interpretation of classes and levels within class.
> In the original code, those different things have different meanings:
> * priority class decides who can use the disk
> * priority level within a class determines how much of the disk time
> each queue will obtain
> In your case. instead, you completely remove the second meaning, and
> provide a larger number of levels to just decide the first.
Having read the ioprio.txt I had thought that the priorities within a
class should still be honored and that the time slice calculations in
cfq_prio_slice would be left as is. "ioprio" is probably the wrong field
name in the code (and text) then, as it is not meant as a priority but
as a time slice indicator?! The text in ioprio.txt and in the man page
for ionice are very inconsistent here: For example, the ionice man page
states: "This [best effort] class takes a priority argument from 0-7,
with lower number being higher priority. Programs running at the same
best effort priority are served in a round-robin fashion." Which implies
a secondary sort-order for priority within a class. Of course, both
ioprio.txt and the ionice man page also talk about class levels in a way
that may indicate it is not priority based. Hm...
Regards,
Alan
Hi Alan,
On Tue, Nov 24, 2009 at 3:49 PM, Alan D. Brunelle <[email protected]> wrote:
>
> Having read the ioprio.txt I had thought that the priorities within a
> class should still be honored and that the time slice calculations in
> cfq_prio_slice would be left as is. "ioprio" is probably the wrong field
> name in the code (and text) then, as it is not meant as a priority but
> as a time slice indicator?! The text in ioprio.txt and in the man page
> for ionice are very inconsistent here: For example, the ionice man page
> states: "This [best effort] class takes a priority argument from 0-7,
> with lower number being higher priority. Programs running at the same
> best effort priority are served in a round-robin fashion." Which implies
> a secondary sort-order for priority within a class. Of course, both
> ioprio.txt and the ionice man page also talk about class levels in a way
> that may indicate it is not priority based. Hm...
ionice man page is usually not in sync with the kernel. It should
really just point at the kernel doc, since it is just an interface to
set up parameters, that will then be interpreted by the kernel (or
list the different interpretations for those at each kernel
incarnation, but this is infeasible).
ioprio.txt in my kernel tree says quite explicitly that RT can starve
BE, and IDLE is starved by anyone.
Moreover, it says:
RT: Within the RT class, there are 8 levels of class data that
determine exactly how much time this process needs the disk for on
each service.
BE: The class data determines how much io bandwidth the process will
get, it's directly mappable to the cpu nice levels just more coarsely
implemented. 0 is the highest BE prio level, 7 is the lowest.
This is still a bit confusing, since it says disk time for RT and bw
for BE, while the implementation is the same for both, time based.
BTW, if you want to review CFQ, you should look at
http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=refs/heads/for-2.6.33,
where cfq development for next kernel version is taking place.
Thanks
Corrado