Hi,
This is V2 of the group_idle implementation patchset. I have done some more
testing since V1 and fixed couple of bugs since V1.
What's the problem
------------------
On high end storage (I got on HP EVA storage array with 12 SATA disks in
RAID 5), CFQ's model of dispatching requests from a single queue at a
time (sequential readers/write sync writers etc), becomes a bottleneck.
Often we don't drive enough request queue depth to keep all the disks busy
and suffer a lot in terms of overall throughput.
All these problems primarily originate from two things. Idling on per
cfq queue and quantum (dispatching limited number of requests from a
single queue) and till then not allowing dispatch from other queues. Once
you set the slice_idle=0 and quantum to higher value, most of the CFQ's
problem on higher end storage disappear.
This problem also becomes visible in IO controller where one creates
multiple groups and gets the fairness but overall throughput is less. In
the following table, I am running increasing number of sequential readers
(1,2,4,8) in 8 groups of weight 100 to 800.
Kernel=2.6.35-rc5-gi-sl-accounting+
GROUPMODE=1 NRGRP=8
DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
Workload=bsr iosched=cfq Filesz=512M bs=4K
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR test1 test2 test3 test4 test5 test6 test7 test8 total
--- --- -- ---------------------------------------------------------------
bsr 1 1 6245 12776 16591 23471 28746 36799 43031 49778 217437
bsr 1 2 5100 11063 17216 23136 23738 30391 35910 40874 187428
bsr 1 4 4623 9718 14746 18356 22875 30407 33215 38073 172013
bsr 1 8 4720 10143 13499 19115 22157 29126 31688 30784 161232
Notice that overall throughput is just around 160MB/s with 8 sequential reader
in each group.
With this patch set, I have set slice_idle=0 and re-ran same test.
Kernel=2.6.35-rc5-gi-sl-accounting+
GROUPMODE=1 NRGRP=8
DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
Workload=bsr iosched=cfq Filesz=512M bs=4K
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR test1 test2 test3 test4 test5 test6 test7 test8 total
--- --- -- ---------------------------------------------------------------
bsr 1 1 6789 12764 17174 23111 28528 36807 42753 48826 216752
bsr 1 2 9845 20617 30521 39061 45514 51607 63683 63770 324618
bsr 1 4 14835 24241 42444 55207 45914 51318 54661 60318 348938
bsr 1 8 12022 24478 36732 48651 54333 60974 64856 72930 374976
Notice how overall throughput has shot upto 374MB/s while retaining the ability
to do the IO control.
So this is not the default mode. This new tunable group_idle, allows one to
set slice_idle=0 to disable some of the CFQ features and and use primarily
group service differentation feature.
If you have thoughts on other ways of solving the problem, I am all ears
to it.
Thanks
Vivek
- Currently in CFQ there are many situations where don't know how
much time slice has been consumed by a queue. For example, all
the random reader/writer queues where we don't idle on
individual queues and we expire the queue either immediately
after the request dispatch.
- In this case time consumed by a queue is just a memory copy
operation. Actually time measurement is possible only if we
idle on a queue and allow dispatch from a queue for significant
amount of time.
- As of today, in such cases we calculate the time since the
dispatch from the queue started and charge all that time.
Generally this rounds to 1 jiffy but in some cases it can
be more. For example, if we are driving high request queue
depth and driver is too busy and does not ask for new
requests for 8-10 jiffies. In such cases, the active queue
gets charged very unfairly.
- So fundamentally, whole notion of charging for time slice
is valid only if we have been idling on the queue. Otherwise
in an NCQ queue, there might be other requests on the queue
and we can not do the time slice calculation.
- This patch tweaks the slice charging logic a bit so that
in the cases where we can't know the amount of time, we
start charging in terms of number of requests dispatched
(IOPS). This practically switching CFQ fairness model to
fairness in terms of IOPS with slice_idle=0.
- As of today this will primarily be useful only with
group_idle patches so that we get fairness in terms of
IOPS across groups. The idea is that on fast storage
one can run CFQ with slice_idle=0 and still get IO
controller working without losing too much of
throughput.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 24 +++++++++++++++++++++---
1 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 7982b83..f44064c 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
* if there are mutiple queues in the group, each can dispatch
* a single request on seeky media and cause lots of seek time
* and group will never know it.
+ *
+ * If drive is NCQ and we are driving deep queue depths, then
+ * it is not reasonable to charge the slice since dispatch
+ * started because this time will include time taken by all
+ * the other requests in the queue.
+ *
+ * Actually there is no reasonable way to know the disk time
+ * here and we need to come up with some approximation. If
+ * disk is non NCQ, we should be driving request queue depth
+ * 1, then charge for time since dispatch start and this will
+ * account for seek time properly on seeky media. If request
+ * queue depth is high, then charge for number of requests
+ * dispatched from the queue. This will sort of becoming
+ * charging in terms of IOPS.
*/
- slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start),
- 1);
+ if (cfqq->cfqd->hw_tag == 0)
+ slice_used = max_t(unsigned,
+ (jiffies - cfqq->dispatch_start), 1);
+ else
+ slice_used = cfqq->slice_dispatch;
} else {
slice_used = jiffies - cfqq->slice_start;
if (slice_used > cfqq->allocated_slice)
slice_used = cfqq->allocated_slice;
}
- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
+ cfqq->slice_dispatch);
return slice_used;
}
--
1.7.1.1
o Implement a new tunable group_idle, which allows idling on the group
instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
on the individual queues but idle on the group. This way on fast storage
we can get fairness between groups at the same time overall throughput
improves.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 60 +++++++++++++++++++++++++++++++++++++++++++++------
1 files changed, 53 insertions(+), 7 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f44064c..b23d7f4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
static int cfq_slice_idle = HZ / 125;
+static int cfq_group_idle = HZ / 125;
static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
static const int cfq_hist_divisor = 4;
@@ -198,6 +199,8 @@ struct cfq_group {
struct hlist_node cfqd_node;
atomic_t ref;
#endif
+ /* number of requests that are on the dispatch list or inside driver */
+ int dispatched;
};
/*
@@ -271,6 +274,7 @@ struct cfq_data {
unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
+ unsigned int cfq_group_idle;
unsigned int cfq_latency;
unsigned int cfq_group_isolation;
@@ -1856,6 +1860,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
BUG_ON(!service_tree);
BUG_ON(!service_tree->count);
+ if (!cfqd->cfq_slice_idle)
+ return false;
+
/* We never do for idle class queues. */
if (prio == IDLE_WORKLOAD)
return false;
@@ -1880,7 +1887,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq = cfqd->active_queue;
struct cfq_io_context *cic;
- unsigned long sl;
+ unsigned long sl, group_idle = 0;
/*
* SSD device without seek penalty, disable idling. But only do so
@@ -1896,8 +1903,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
/*
* idle is disabled, either manually or by past process history
*/
- if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
- return;
+ if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) {
+ /* no queue idling. Check for group idling */
+ if (cfqd->cfq_group_idle)
+ group_idle = cfqd->cfq_group_idle;
+ else
+ return;
+ }
/*
* still active requests from this queue, don't idle
@@ -1924,13 +1936,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
return;
}
+ /* There are other queues in the group, don't do group idle */
+ if (group_idle && cfqq->cfqg->nr_cfqq > 1)
+ return;
+
cfq_mark_cfqq_wait_request(cfqq);
- sl = cfqd->cfq_slice_idle;
+ if (group_idle)
+ sl = cfqd->cfq_group_idle;
+ else
+ sl = cfqd->cfq_slice_idle;
mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
cfq_blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg);
- cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
+ cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu group_idle: %d", sl,
+ group_idle ? 1 : 0);
}
/*
@@ -1946,6 +1966,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
cfq_remove_request(rq);
cfqq->dispatched++;
+ (RQ_CFQG(rq))->dispatched++;
elv_dispatch_sort(q, rq);
cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
@@ -2215,7 +2236,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
cfqq = NULL;
goto keep_queue;
} else
- goto expire;
+ goto check_group_idle;
}
/*
@@ -2249,6 +2270,17 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
goto keep_queue;
}
+ /*
+ * If group idle is enabled and there are requests dispatched from
+ * this group, wait for requests to complete.
+ */
+check_group_idle:
+ if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
+ && cfqq->cfqg->dispatched) {
+ cfqq = NULL;
+ goto keep_queue;
+ }
+
expire:
cfq_slice_expired(cfqd, 0);
new_queue:
@@ -3391,6 +3423,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
WARN_ON(!cfqq->dispatched);
cfqd->rq_in_driver--;
cfqq->dispatched--;
+ (RQ_CFQG(rq))->dispatched--;
cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg,
rq_start_time_ns(rq), rq_io_start_time_ns(rq),
rq_data_dir(rq), rq_is_sync(rq));
@@ -3420,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
* the queue.
*/
if (cfq_should_wait_busy(cfqd, cfqq)) {
- cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
+ unsigned long extend_sl = cfqd->cfq_slice_idle;
+ if (!cfqd->cfq_slice_idle)
+ extend_sl = cfqd->cfq_group_idle;
+ cfqq->slice_end = jiffies + extend_sl;
cfq_mark_cfqq_wait_busy(cfqq);
cfq_log_cfqq(cfqd, cfqq, "will busy wait");
}
@@ -3865,6 +3901,7 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
+ cfqd->cfq_group_idle = cfq_group_idle;
cfqd->cfq_latency = 1;
cfqd->cfq_group_isolation = 0;
cfqd->hw_tag = -1;
@@ -3937,6 +3974,7 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
+SHOW_FUNCTION(cfq_group_idle_show, cfqd->cfq_group_idle, 1);
SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
@@ -3969,6 +4007,7 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
UINT_MAX, 0);
STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
+STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
@@ -3990,6 +4029,7 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
+ CFQ_ATTR(group_idle),
CFQ_ATTR(low_latency),
CFQ_ATTR(group_isolation),
__ATTR_NULL
@@ -4043,6 +4083,12 @@ static int __init cfq_init(void)
if (!cfq_slice_idle)
cfq_slice_idle = 1;
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ if (!cfq_group_idle)
+ cfq_group_idle = 1;
+#else
+ cfq_group_idle = 0;
+#endif
if (cfq_slab_setup())
return -ENOMEM;
--
1.7.1.1
- Divyesh had gotten rid of this code in the past. I want to re-introduce it
back as it helps me a lot during debugging.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b23d7f4..1f0d4ea 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -148,6 +148,8 @@ struct cfq_queue {
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
struct cfq_group *orig_cfqg;
+ /* Number of sectors dispatched from queue in single dispatch round */
+ unsigned long nr_sectors;
};
/*
@@ -926,8 +928,8 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
slice_used = cfqq->allocated_slice;
}
- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
- cfqq->slice_dispatch);
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, disp=%u sect=%u",
+ slice_used, cfqq->slice_dispatch, cfqq->nr_sectors);
return slice_used;
}
@@ -1608,6 +1610,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
cfqq->allocated_slice = 0;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;
+ cfqq->nr_sectors = 0;
cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_must_dispatch(cfqq);
@@ -1970,6 +1973,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
elv_dispatch_sort(q, rq);
cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
+ cfqq->nr_sectors += blk_rq_sectors(rq);
cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq),
rq_data_dir(rq), rq_is_sync(rq));
}
--
1.7.1.1
On Mon, Jul 19, 2010 at 10:14 AM, Vivek Goyal <[email protected]> wrote:
> @@ -3420,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> ? ? ? ? ? ? ? ? * the queue.
> ? ? ? ? ? ? ? ? */
> ? ? ? ? ? ? ? ?if (cfq_should_wait_busy(cfqd, cfqq)) {
> - ? ? ? ? ? ? ? ? ? ? ? cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> + ? ? ? ? ? ? ? ? ? ? ? unsigned long extend_sl = cfqd->cfq_slice_idle;
> + ? ? ? ? ? ? ? ? ? ? ? if (!cfqd->cfq_slice_idle)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? extend_sl = cfqd->cfq_group_idle;
> + ? ? ? ? ? ? ? ? ? ? ? cfqq->slice_end = jiffies + extend_sl;
> ? ? ? ? ? ? ? ? ? ? ? ?cfq_mark_cfqq_wait_busy(cfqq);
> ? ? ? ? ? ? ? ? ? ? ? ?cfq_log_cfqq(cfqd, cfqq, "will busy wait");
> ? ? ? ? ? ? ? ?}
Vivek, I haven't looked at this particular code snippet for some time.
Can you tell me why we add the slice_idle (or w/ your change
extend_sl) to slice_end instead of arming the idle timer with that
amount of time?
-Divyesh
On Mon, Jul 19, 2010 at 01:54:53PM -0700, Divyesh Shah wrote:
> On Mon, Jul 19, 2010 at 10:14 AM, Vivek Goyal <[email protected]> wrote:
> > @@ -3420,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> > ? ? ? ? ? ? ? ? * the queue.
> > ? ? ? ? ? ? ? ? */
> > ? ? ? ? ? ? ? ?if (cfq_should_wait_busy(cfqd, cfqq)) {
> > - ? ? ? ? ? ? ? ? ? ? ? cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> > + ? ? ? ? ? ? ? ? ? ? ? unsigned long extend_sl = cfqd->cfq_slice_idle;
> > + ? ? ? ? ? ? ? ? ? ? ? if (!cfqd->cfq_slice_idle)
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? extend_sl = cfqd->cfq_group_idle;
> > + ? ? ? ? ? ? ? ? ? ? ? cfqq->slice_end = jiffies + extend_sl;
> > ? ? ? ? ? ? ? ? ? ? ? ?cfq_mark_cfqq_wait_busy(cfqq);
> > ? ? ? ? ? ? ? ? ? ? ? ?cfq_log_cfqq(cfqd, cfqq, "will busy wait");
> > ? ? ? ? ? ? ? ?}
>
> Vivek, I haven't looked at this particular code snippet for some time.
> Can you tell me why we add the slice_idle (or w/ your change
> extend_sl) to slice_end instead of arming the idle timer with that
> amount of time?
Divyesh,
With wait busy we do arm the slice time also. wait busy is just saying
that extend the slice a bit so that we do arm the timer and select_queue()
does not expire the queue right away.
Vivek