Hi,
This is V4 of the patches for group_idle and CFQ group charge accounting in
terms of IOPS implementation. Since V3 not much has changed. Just more testing
and rebase on top of for-2.6.36 branch of block tree.
What's the problem
------------------
On high end storage (I got on HP EVA storage array with 12 SATA disks in
RAID 5), CFQ's model of dispatching requests from a single queue at a
time (sequential readers/write sync writers etc), becomes a bottleneck.
Often we don't drive enough request queue depth to keep all the disks busy
and suffer a lot in terms of overall throughput.
All these problems primarily originate from two things. Idling on per
cfq queue and quantum (dispatching limited number of requests from a
single queue) and till then not allowing dispatch from other queues. Once
you set the slice_idle=0 and quantum to higher value, most of the CFQ's
problem on higher end storage disappear.
This problem also becomes visible in IO controller where one creates
multiple groups and gets the fairness but overall throughput is less. In
the following table, I am running increasing number of sequential readers
(1,2,4,8) in 8 groups of weight 100 to 800.
Kernel=2.6.35-blktree-group_idle+
GROUPMODE=1 NRGRP=8 DEV=/dev/dm-3
Workload=bsr iosched=cfq Filesz=512M bs=4K
gi=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
--- --- -- ---------------------------------------------------------------
bsr 1 1 6519 12742 16801 23109 28694 35988 43175 49272 216300
bsr 1 2 5522 10922 17174 22554 24151 30488 36572 42021 189404
bsr 1 4 4593 9620 13120 21405 25827 28097 33029 37335 173026
bsr 1 8 3622 8277 12557 18296 21775 26022 30760 35713 157022
Notice that overall throughput is just around 160MB/s with 8 sequential reader
in each group.
With this patch set, I have set slice_idle=0 and re-ran same test.
Kernel=2.6.35-blktree-group_idle+
GROUPMODE=1 NRGRP=8 DEV=/dev/dm-3
Workload=bsr iosched=cfq Filesz=512M bs=4K
gi=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
--- --- -- ---------------------------------------------------------------
bsr 1 1 6652 12341 17335 23856 28740 36059 42833 48487 216303
bsr 1 2 10168 20292 29827 38363 45746 52842 60071 63957 321266
bsr 1 4 11176 21763 32713 42970 53222 58613 63598 69296 353351
bsr 1 8 11750 23718 34102 47144 56975 63613 69000 69666 375968
Notice how overall throughput has shot upto 350-370MB/s while retaining the
ability to do the IO control.
So this is not the default mode. This new tunable group_idle, allows one to
set slice_idle=0 to disable some of the CFQ features and and use primarily
group service differentation feature.
By default nothing should change for CFQ and this change should be fairly
low risk.
Thanks
Vivek
o Some documentation to provide help with tunables.
Signed-off-by: Vivek Goyal <[email protected]>
---
Documentation/block/cfq-iosched.txt | 45 ++++++++++++++++++++++++++++
Documentation/cgroups/blkio-controller.txt | 28 +++++++++++++++++
2 files changed, 73 insertions(+), 0 deletions(-)
create mode 100644 Documentation/block/cfq-iosched.txt
diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt
new file mode 100644
index 0000000..3ea9e79
--- /dev/null
+++ b/Documentation/block/cfq-iosched.txt
@@ -0,0 +1,45 @@
+CFQ ioscheduler tunables
+========================
+
+slice_idle
+----------
+This specifies how long CFQ should idle for next request on certain cfq queues
+(for sequential workloads) and service trees (for random workloads) before
+queue is expired and CFQ selects next queue to dispatch from.
+
+By default slice_idle is a non-zero value. That means by default we idle on
+queues/service trees. This can be very helpful on highly seeky media like
+single spindle SATA/SAS disks where we can cut down on overall number of
+seeks and see improved throughput.
+
+Setting slice_idle to 0 will remove all the idling on queues/service tree
+level and one should see an overall improved throughput on faster storage
+devices like multiple SATA/SAS disks in hardware RAID configuration. The down
+side is that isolation provided from WRITES also goes down and notion of
+IO priority becomes weaker.
+
+So depending on storage and workload, it might be useful to set slice_idle=0.
+In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
+keeping slice_idle enabled should be useful. For any configurations where
+there are multiple spindles behind single LUN (Host based hardware RAID
+controller or for storage arrays), setting slice_idle=0 might end up in better
+throughput and acceptable latencies.
+
+CFQ IOPS Mode for group scheduling
+==================================
+Basic CFQ design is to provide priority based time slices. Higher priority
+process gets bigger time slice and lower priority process gets smaller time
+slice. Measuring time becomes harder if storage is fast and supports NCQ and
+it would be better to dispatch multiple requests from multiple cfq queues in
+request queue at a time. In such scenario, it is not possible to measure time
+consumed by single queue accurately.
+
+What is possible though is to measure number of requests dispatched from a
+single queue and also allow dispatch from multiple cfq queue at the same time.
+This effectively becomes the fairness in terms of IOPS (IO operations per
+second).
+
+If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
+to IOPS mode and starts providing fairness in terms of number of requests
+dispatched. Note that this mode switching takes effect only for group
+scheduling. For non-cgroup users nothing should change.
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index 48e0b21..6919d62 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -217,6 +217,7 @@ Details of cgroup files
CFQ sysfs tunable
=================
/sys/block/<disk>/queue/iosched/group_isolation
+-----------------------------------------------
If group_isolation=1, it provides stronger isolation between groups at the
expense of throughput. By default group_isolation is 0. In general that
@@ -243,6 +244,33 @@ By default one should run with group_isolation=0. If that is not sufficient
and one wants stronger isolation between groups, then set group_isolation=1
but this will come at cost of reduced throughput.
+/sys/block/<disk>/queue/iosched/slice_idle
+------------------------------------------
+On a faster hardware CFQ can be slow, especially with sequential workload.
+This happens because CFQ idles on a single queue and single queue might not
+drive deeper request queue depths to keep the storage busy. In such scenarios
+one can try setting slice_idle=0 and that would switch CFQ to IOPS
+(IO operations per second) mode on NCQ supporting hardware.
+
+That means CFQ will not idle between cfq queues of a cfq group and hence be
+able to driver higher queue depth and achieve better throughput. That also
+means that cfq provides fairness among groups in terms of IOPS and not in
+terms of disk time.
+
+/sys/block/<disk>/queue/iosched/group_idle
+------------------------------------------
+If one disables idling on individual cfq queues and cfq service trees by
+setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
+on the group in an attempt to provide fairness among groups.
+
+By default group_idle is same as slice_idle and does not do anything if
+slice_idle is enabled.
+
+One can experience an overall throughput drop if you have created multiple
+groups and put applications in that group which are not driving enough
+IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
+on individual groups and throughput should improve.
+
What works
==========
- Currently only sync IO queues are support. All the buffered writes are
--
1.7.1.1
o do not idle either on cfq queue or service tree if slice_idle=0. User does
not want any queue or service tree idling. Currently even if slice_idle=0,
we were waiting for request to finish before expiring the queue and that
can lead to lower queue depths.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index eb4086f..8830569 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1839,6 +1839,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
BUG_ON(!service_tree);
BUG_ON(!service_tree->count);
+ if (!cfqd->cfq_slice_idle)
+ return false;
+
/* We never do for idle class queues. */
if (prio == IDLE_WORKLOAD)
return false;
@@ -1879,7 +1882,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
/*
* idle is disabled, either manually or by past process history
*/
- if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
+ if (!cfq_should_idle(cfqd, cfqq))
return;
/*
--
1.7.1.1
o Implement another CFQ mode where we charge group in terms of number
of requests dispatched instead of measuring the time. Measuring in terms
of time is not possible when we are driving deeper queue depths and there
are requests from multiple cfq queues in the request queue.
o This mode currently gets activated if one sets slice_idle=0 and associated
disk supports NCQ. Again the idea is that on an NCQ disk with idling disabled
most of the queues will dispatch 1 or more requests and then cfq queue
expiry happens and we don't have a way to measure time. So start providing
fairness in terms of IOPS.
o Currently IOPS mode works only with cfq group scheduling. CFQ is following
different scheduling algorithms for queue and group scheduling. These IOPS
stats are used only for group scheduling hence in non-croup mode nothing
should change.
o For CFQ group scheduling one can disable slice idling so that we don't idle
on queue and drive deeper request queue depths (achieving better throughput),
at the same time group idle is enabled so one should get service
differentiation among groups.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 30 ++++++++++++++++++++++++------
1 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8830569..3fc6be1 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -378,6 +378,21 @@ CFQ_CFQQ_FNS(wait_busy);
&cfqg->service_trees[i][j]: NULL) \
+static inline bool iops_mode(struct cfq_data *cfqd)
+{
+ /*
+ * If we are not idling on queues and it is a NCQ drive, parallel
+ * execution of requests is on and measuring time is not possible
+ * in most of the cases until and unless we drive shallower queue
+ * depths and that becomes a performance bottleneck. In such cases
+ * switch to start providing fairness in terms of number of IOs.
+ */
+ if (!cfqd->cfq_slice_idle && cfqd->hw_tag)
+ return true;
+ else
+ return false;
+}
+
static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
{
if (cfq_class_idle(cfqq))
@@ -906,7 +921,6 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
slice_used = cfqq->allocated_slice;
}
- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
return slice_used;
}
@@ -914,19 +928,21 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
struct cfq_queue *cfqq)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
- unsigned int used_sl, charge_sl;
+ unsigned int used_sl, charge;
int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
- cfqg->service_tree_idle.count;
BUG_ON(nr_sync < 0);
- used_sl = charge_sl = cfq_cfqq_slice_usage(cfqq);
+ used_sl = charge = cfq_cfqq_slice_usage(cfqq);
- if (!cfq_cfqq_sync(cfqq) && !nr_sync)
- charge_sl = cfqq->allocated_slice;
+ if (iops_mode(cfqd))
+ charge = cfqq->slice_dispatch;
+ else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
+ charge = cfqq->allocated_slice;
/* Can't update vdisktime while group is on service tree */
cfq_rb_erase(&cfqg->rb_node, st);
- cfqg->vdisktime += cfq_scale_slice(charge_sl, cfqg);
+ cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
__cfq_group_service_tree_add(st, cfqg);
/* This group is being expired. Save the context */
@@ -940,6 +956,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
st->min_vdisktime);
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u",
+ used_sl, cfqq->slice_dispatch, charge, iops_mode(cfqd));
cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
}
--
1.7.1.1
o Implement a new tunable group_idle, which allows idling on the group
instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
on the individual queues but idle on the group. This way on fast storage
we can get fairness between groups at the same time overall throughput
improves.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 65 +++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 56 insertions(+), 9 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3fc6be1..85e4819 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
static int cfq_slice_idle = HZ / 125;
+static int cfq_group_idle = HZ / 125;
static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
static const int cfq_hist_divisor = 4;
@@ -198,6 +199,8 @@ struct cfq_group {
struct hlist_node cfqd_node;
atomic_t ref;
#endif
+ /* number of requests that are on the dispatch list or inside driver */
+ int dispatched;
};
/*
@@ -271,6 +274,7 @@ struct cfq_data {
unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
+ unsigned int cfq_group_idle;
unsigned int cfq_latency;
unsigned int cfq_group_isolation;
@@ -1884,7 +1888,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq = cfqd->active_queue;
struct cfq_io_context *cic;
- unsigned long sl;
+ unsigned long sl, group_idle = 0;
/*
* SSD device without seek penalty, disable idling. But only do so
@@ -1900,8 +1904,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
/*
* idle is disabled, either manually or by past process history
*/
- if (!cfq_should_idle(cfqd, cfqq))
- return;
+ if (!cfq_should_idle(cfqd, cfqq)) {
+ /* no queue idling. Check for group idling */
+ if (cfqd->cfq_group_idle)
+ group_idle = cfqd->cfq_group_idle;
+ else
+ return;
+ }
/*
* still active requests from this queue, don't idle
@@ -1928,13 +1937,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
return;
}
+ /* There are other queues in the group, don't do group idle */
+ if (group_idle && cfqq->cfqg->nr_cfqq > 1)
+ return;
+
cfq_mark_cfqq_wait_request(cfqq);
- sl = cfqd->cfq_slice_idle;
+ if (group_idle)
+ sl = cfqd->cfq_group_idle;
+ else
+ sl = cfqd->cfq_slice_idle;
mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
cfq_blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg);
- cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
+ cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu group_idle: %d", sl,
+ group_idle ? 1 : 0);
}
/*
@@ -1950,6 +1967,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
cfq_remove_request(rq);
cfqq->dispatched++;
+ (RQ_CFQG(rq))->dispatched++;
elv_dispatch_sort(q, rq);
cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
@@ -2219,7 +2237,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
cfqq = NULL;
goto keep_queue;
} else
- goto expire;
+ goto check_group_idle;
}
/*
@@ -2247,8 +2265,23 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
* flight or is idling for a new request, allow either of these
* conditions to happen (or time out) before selecting a new queue.
*/
- if (timer_pending(&cfqd->idle_slice_timer) ||
- (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) {
+ if (timer_pending(&cfqd->idle_slice_timer)) {
+ cfqq = NULL;
+ goto keep_queue;
+ }
+
+ if (cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
+ cfqq = NULL;
+ goto keep_queue;
+ }
+
+ /*
+ * If group idle is enabled and there are requests dispatched from
+ * this group, wait for requests to complete.
+ */
+check_group_idle:
+ if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
+ && cfqq->cfqg->dispatched) {
cfqq = NULL;
goto keep_queue;
}
@@ -3396,6 +3429,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
WARN_ON(!cfqq->dispatched);
cfqd->rq_in_driver--;
cfqq->dispatched--;
+ (RQ_CFQG(rq))->dispatched--;
cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg,
rq_start_time_ns(rq), rq_io_start_time_ns(rq),
rq_data_dir(rq), rq_is_sync(rq));
@@ -3425,7 +3459,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
* the queue.
*/
if (cfq_should_wait_busy(cfqd, cfqq)) {
- cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
+ unsigned long extend_sl = cfqd->cfq_slice_idle;
+ if (!cfqd->cfq_slice_idle)
+ extend_sl = cfqd->cfq_group_idle;
+ cfqq->slice_end = jiffies + extend_sl;
cfq_mark_cfqq_wait_busy(cfqq);
cfq_log_cfqq(cfqd, cfqq, "will busy wait");
}
@@ -3871,6 +3908,7 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
+ cfqd->cfq_group_idle = cfq_group_idle;
cfqd->cfq_latency = 1;
cfqd->cfq_group_isolation = 0;
cfqd->hw_tag = -1;
@@ -3943,6 +3981,7 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
+SHOW_FUNCTION(cfq_group_idle_show, cfqd->cfq_group_idle, 1);
SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
@@ -3975,6 +4014,7 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
UINT_MAX, 0);
STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
+STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
@@ -3996,6 +4036,7 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
+ CFQ_ATTR(group_idle),
CFQ_ATTR(low_latency),
CFQ_ATTR(group_isolation),
__ATTR_NULL
@@ -4049,6 +4090,12 @@ static int __init cfq_init(void)
if (!cfq_slice_idle)
cfq_slice_idle = 1;
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ if (!cfq_group_idle)
+ cfq_group_idle = 1;
+#else
+ cfq_group_idle = 0;
+#endif
if (cfq_slab_setup())
return -ENOMEM;
--
1.7.1.1
o Divyesh had gotten rid of this code in the past. I want to re-introduce it
back as it helps me a lot during debugging.
Reviewed-by: Jeff Moyer <[email protected]>
Reviewed-by: Divyesh Shah <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 85e4819..f65c6f0 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -148,6 +148,8 @@ struct cfq_queue {
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
struct cfq_group *orig_cfqg;
+ /* Number of sectors dispatched from queue in single dispatch round */
+ unsigned long nr_sectors;
};
/*
@@ -960,8 +962,9 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
st->min_vdisktime);
- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u",
- used_sl, cfqq->slice_dispatch, charge, iops_mode(cfqd));
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
+ " sect=%u", used_sl, cfqq->slice_dispatch, charge,
+ iops_mode(cfqd), cfqq->nr_sectors);
cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
}
@@ -1609,6 +1612,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
cfqq->allocated_slice = 0;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;
+ cfqq->nr_sectors = 0;
cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_must_dispatch(cfqq);
@@ -1971,6 +1975,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
elv_dispatch_sort(q, rq);
cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
+ cfqq->nr_sectors += blk_rq_sectors(rq);
cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq),
rq_data_dir(rq), rq_is_sync(rq));
}
--
1.7.1.1
Vivek Goyal <[email protected]> writes:
> o do not idle either on cfq queue or service tree if slice_idle=0. User does
> not want any queue or service tree idling. Currently even if slice_idle=0,
> we were waiting for request to finish before expiring the queue and that
> can lead to lower queue depths.
Hm, I submitted a similar patch that was supposed to make .35. Oh well,
Acked-by: Jeff Moyer <[email protected]>
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> block/cfq-iosched.c | 5 ++++-
> 1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index eb4086f..8830569 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1839,6 +1839,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> BUG_ON(!service_tree);
> BUG_ON(!service_tree->count);
>
> + if (!cfqd->cfq_slice_idle)
> + return false;
> +
> /* We never do for idle class queues. */
> if (prio == IDLE_WORKLOAD)
> return false;
> @@ -1879,7 +1882,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> /*
> * idle is disabled, either manually or by past process history
> */
> - if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
> + if (!cfq_should_idle(cfqd, cfqq))
> return;
>
> /*
Vivek Goyal <[email protected]> writes:
> o Implement another CFQ mode where we charge group in terms of number
> of requests dispatched instead of measuring the time. Measuring in terms
> of time is not possible when we are driving deeper queue depths and there
> are requests from multiple cfq queues in the request queue.
>
> o This mode currently gets activated if one sets slice_idle=0 and associated
> disk supports NCQ. Again the idea is that on an NCQ disk with idling disabled
> most of the queues will dispatch 1 or more requests and then cfq queue
> expiry happens and we don't have a way to measure time. So start providing
> fairness in terms of IOPS.
>
> o Currently IOPS mode works only with cfq group scheduling. CFQ is following
> different scheduling algorithms for queue and group scheduling. These IOPS
> stats are used only for group scheduling hence in non-croup mode nothing
> should change.
>
> o For CFQ group scheduling one can disable slice idling so that we don't idle
> on queue and drive deeper request queue depths (achieving better throughput),
> at the same time group idle is enabled so one should get service
> differentiation among groups.
>
> Signed-off-by: Vivek Goyal <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
> ---
> block/cfq-iosched.c | 30 ++++++++++++++++++++++++------
> 1 files changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 8830569..3fc6be1 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -378,6 +378,21 @@ CFQ_CFQQ_FNS(wait_busy);
> &cfqg->service_trees[i][j]: NULL) \
>
>
> +static inline bool iops_mode(struct cfq_data *cfqd)
> +{
> + /*
> + * If we are not idling on queues and it is a NCQ drive, parallel
> + * execution of requests is on and measuring time is not possible
> + * in most of the cases until and unless we drive shallower queue
> + * depths and that becomes a performance bottleneck. In such cases
> + * switch to start providing fairness in terms of number of IOs.
> + */
> + if (!cfqd->cfq_slice_idle && cfqd->hw_tag)
> + return true;
> + else
> + return false;
> +}
> +
> static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
> {
> if (cfq_class_idle(cfqq))
> @@ -906,7 +921,6 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
> slice_used = cfqq->allocated_slice;
> }
>
> - cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
> return slice_used;
> }
>
> @@ -914,19 +928,21 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
> struct cfq_queue *cfqq)
> {
> struct cfq_rb_root *st = &cfqd->grp_service_tree;
> - unsigned int used_sl, charge_sl;
> + unsigned int used_sl, charge;
> int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
> - cfqg->service_tree_idle.count;
>
> BUG_ON(nr_sync < 0);
> - used_sl = charge_sl = cfq_cfqq_slice_usage(cfqq);
> + used_sl = charge = cfq_cfqq_slice_usage(cfqq);
>
> - if (!cfq_cfqq_sync(cfqq) && !nr_sync)
> - charge_sl = cfqq->allocated_slice;
> + if (iops_mode(cfqd))
> + charge = cfqq->slice_dispatch;
> + else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
> + charge = cfqq->allocated_slice;
>
> /* Can't update vdisktime while group is on service tree */
> cfq_rb_erase(&cfqg->rb_node, st);
> - cfqg->vdisktime += cfq_scale_slice(charge_sl, cfqg);
> + cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
> __cfq_group_service_tree_add(st, cfqg);
>
> /* This group is being expired. Save the context */
> @@ -940,6 +956,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>
> cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
> st->min_vdisktime);
> + cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u",
> + used_sl, cfqq->slice_dispatch, charge, iops_mode(cfqd));
> cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
> cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
> }
Vivek Goyal <[email protected]> writes:
> o Implement a new tunable group_idle, which allows idling on the group
> instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
> on the individual queues but idle on the group. This way on fast storage
> we can get fairness between groups at the same time overall throughput
> improves.
>
> Signed-off-by: Vivek Goyal <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
> ---
> block/cfq-iosched.c | 65 +++++++++++++++++++++++++++++++++++++++++++-------
> 1 files changed, 56 insertions(+), 9 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 3fc6be1..85e4819 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
> static int cfq_slice_async = HZ / 25;
> static const int cfq_slice_async_rq = 2;
> static int cfq_slice_idle = HZ / 125;
> +static int cfq_group_idle = HZ / 125;
> static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
> static const int cfq_hist_divisor = 4;
>
> @@ -198,6 +199,8 @@ struct cfq_group {
> struct hlist_node cfqd_node;
> atomic_t ref;
> #endif
> + /* number of requests that are on the dispatch list or inside driver */
> + int dispatched;
> };
>
> /*
> @@ -271,6 +274,7 @@ struct cfq_data {
> unsigned int cfq_slice[2];
> unsigned int cfq_slice_async_rq;
> unsigned int cfq_slice_idle;
> + unsigned int cfq_group_idle;
> unsigned int cfq_latency;
> unsigned int cfq_group_isolation;
>
> @@ -1884,7 +1888,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> {
> struct cfq_queue *cfqq = cfqd->active_queue;
> struct cfq_io_context *cic;
> - unsigned long sl;
> + unsigned long sl, group_idle = 0;
>
> /*
> * SSD device without seek penalty, disable idling. But only do so
> @@ -1900,8 +1904,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> /*
> * idle is disabled, either manually or by past process history
> */
> - if (!cfq_should_idle(cfqd, cfqq))
> - return;
> + if (!cfq_should_idle(cfqd, cfqq)) {
> + /* no queue idling. Check for group idling */
> + if (cfqd->cfq_group_idle)
> + group_idle = cfqd->cfq_group_idle;
> + else
> + return;
> + }
>
> /*
> * still active requests from this queue, don't idle
> @@ -1928,13 +1937,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> return;
> }
>
> + /* There are other queues in the group, don't do group idle */
> + if (group_idle && cfqq->cfqg->nr_cfqq > 1)
> + return;
> +
> cfq_mark_cfqq_wait_request(cfqq);
>
> - sl = cfqd->cfq_slice_idle;
> + if (group_idle)
> + sl = cfqd->cfq_group_idle;
> + else
> + sl = cfqd->cfq_slice_idle;
>
> mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
> cfq_blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg);
> - cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
> + cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu group_idle: %d", sl,
> + group_idle ? 1 : 0);
> }
>
> /*
> @@ -1950,6 +1967,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
> cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
> cfq_remove_request(rq);
> cfqq->dispatched++;
> + (RQ_CFQG(rq))->dispatched++;
> elv_dispatch_sort(q, rq);
>
> cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
> @@ -2219,7 +2237,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> cfqq = NULL;
> goto keep_queue;
> } else
> - goto expire;
> + goto check_group_idle;
> }
>
> /*
> @@ -2247,8 +2265,23 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> * flight or is idling for a new request, allow either of these
> * conditions to happen (or time out) before selecting a new queue.
> */
> - if (timer_pending(&cfqd->idle_slice_timer) ||
> - (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) {
> + if (timer_pending(&cfqd->idle_slice_timer)) {
> + cfqq = NULL;
> + goto keep_queue;
> + }
> +
> + if (cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
> + cfqq = NULL;
> + goto keep_queue;
> + }
> +
> + /*
> + * If group idle is enabled and there are requests dispatched from
> + * this group, wait for requests to complete.
> + */
> +check_group_idle:
> + if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
> + && cfqq->cfqg->dispatched) {
> cfqq = NULL;
> goto keep_queue;
> }
> @@ -3396,6 +3429,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> WARN_ON(!cfqq->dispatched);
> cfqd->rq_in_driver--;
> cfqq->dispatched--;
> + (RQ_CFQG(rq))->dispatched--;
> cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg,
> rq_start_time_ns(rq), rq_io_start_time_ns(rq),
> rq_data_dir(rq), rq_is_sync(rq));
> @@ -3425,7 +3459,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> * the queue.
> */
> if (cfq_should_wait_busy(cfqd, cfqq)) {
> - cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> + unsigned long extend_sl = cfqd->cfq_slice_idle;
> + if (!cfqd->cfq_slice_idle)
> + extend_sl = cfqd->cfq_group_idle;
> + cfqq->slice_end = jiffies + extend_sl;
> cfq_mark_cfqq_wait_busy(cfqq);
> cfq_log_cfqq(cfqd, cfqq, "will busy wait");
> }
> @@ -3871,6 +3908,7 @@ static void *cfq_init_queue(struct request_queue *q)
> cfqd->cfq_slice[1] = cfq_slice_sync;
> cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
> cfqd->cfq_slice_idle = cfq_slice_idle;
> + cfqd->cfq_group_idle = cfq_group_idle;
> cfqd->cfq_latency = 1;
> cfqd->cfq_group_isolation = 0;
> cfqd->hw_tag = -1;
> @@ -3943,6 +3981,7 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
> SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
> SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
> SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
> +SHOW_FUNCTION(cfq_group_idle_show, cfqd->cfq_group_idle, 1);
> SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
> SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
> SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
> @@ -3975,6 +4014,7 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
> STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
> UINT_MAX, 0);
> STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
> +STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
> STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
> STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
> STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
> @@ -3996,6 +4036,7 @@ static struct elv_fs_entry cfq_attrs[] = {
> CFQ_ATTR(slice_async),
> CFQ_ATTR(slice_async_rq),
> CFQ_ATTR(slice_idle),
> + CFQ_ATTR(group_idle),
> CFQ_ATTR(low_latency),
> CFQ_ATTR(group_isolation),
> __ATTR_NULL
> @@ -4049,6 +4090,12 @@ static int __init cfq_init(void)
> if (!cfq_slice_idle)
> cfq_slice_idle = 1;
>
> +#ifdef CONFIG_CFQ_GROUP_IOSCHED
> + if (!cfq_group_idle)
> + cfq_group_idle = 1;
> +#else
> + cfq_group_idle = 0;
> +#endif
> if (cfq_slab_setup())
> return -ENOMEM;
In general, I've resisted the urge to correct grammar. Comments below.
Vivek Goyal <[email protected]> writes:
> o Some documentation to provide help with tunables.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> Documentation/block/cfq-iosched.txt | 45 ++++++++++++++++++++++++++++
> Documentation/cgroups/blkio-controller.txt | 28 +++++++++++++++++
> 2 files changed, 73 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/block/cfq-iosched.txt
>
> diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt
> new file mode 100644
> index 0000000..3ea9e79
> --- /dev/null
> +++ b/Documentation/block/cfq-iosched.txt
> @@ -0,0 +1,45 @@
> +CFQ ioscheduler tunables
> +========================
> +
> +slice_idle
> +----------
> +This specifies how long CFQ should idle for next request on certain cfq queues
> +(for sequential workloads) and service trees (for random workloads) before
> +queue is expired and CFQ selects next queue to dispatch from.
> +
> +By default slice_idle is a non-zero value. That means by default we idle on
> +queues/service trees. This can be very helpful on highly seeky media like
'seeky' is not a property of the media. I think you meant on storage
devices with a high seek cost.
> +single spindle SATA/SAS disks where we can cut down on overall number of
> +seeks and see improved throughput.
> +
> +Setting slice_idle to 0 will remove all the idling on queues/service tree
> +level and one should see an overall improved throughput on faster storage
> +devices like multiple SATA/SAS disks in hardware RAID configuration. The down
> +side is that isolation provided from WRITES also goes down and notion of
> +IO priority becomes weaker.
> +
> +So depending on storage and workload, it might be useful to set slice_idle=0.
> +In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
You think? I'm pretty sure we've measured that. ;-)
> +keeping slice_idle enabled should be useful. For any configurations where
> +there are multiple spindles behind single LUN (Host based hardware RAID
> +controller or for storage arrays), setting slice_idle=0 might end up in better
> +throughput and acceptable latencies.
> +
> +CFQ IOPS Mode for group scheduling
> +==================================
> +Basic CFQ design is to provide priority based time slices. Higher priority
> +process gets bigger time slice and lower priority process gets smaller time
> +slice. Measuring time becomes harder if storage is fast and supports NCQ and
> +it would be better to dispatch multiple requests from multiple cfq queues in
> +request queue at a time. In such scenario, it is not possible to measure time
> +consumed by single queue accurately.
> +
> +What is possible though is to measure number of requests dispatched from a
> +single queue and also allow dispatch from multiple cfq queue at the same time.
> +This effectively becomes the fairness in terms of IOPS (IO operations per
> +second).
> +
> +If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
> +to IOPS mode and starts providing fairness in terms of number of requests
> +dispatched. Note that this mode switching takes effect only for group
> +scheduling. For non-cgroup users nothing should change.
> diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
> index 48e0b21..6919d62 100644
> --- a/Documentation/cgroups/blkio-controller.txt
> +++ b/Documentation/cgroups/blkio-controller.txt
> @@ -217,6 +217,7 @@ Details of cgroup files
> CFQ sysfs tunable
> =================
> /sys/block/<disk>/queue/iosched/group_isolation
> +-----------------------------------------------
>
> If group_isolation=1, it provides stronger isolation between groups at the
> expense of throughput. By default group_isolation is 0. In general that
> @@ -243,6 +244,33 @@ By default one should run with group_isolation=0. If that is not sufficient
> and one wants stronger isolation between groups, then set group_isolation=1
> but this will come at cost of reduced throughput.
>
> +/sys/block/<disk>/queue/iosched/slice_idle
> +------------------------------------------
> +On a faster hardware CFQ can be slow, especially with sequential workload.
> +This happens because CFQ idles on a single queue and single queue might not
> +drive deeper request queue depths to keep the storage busy. In such scenarios
> +one can try setting slice_idle=0 and that would switch CFQ to IOPS
> +(IO operations per second) mode on NCQ supporting hardware.
> +
> +That means CFQ will not idle between cfq queues of a cfq group and hence be
> +able to driver higher queue depth and achieve better throughput. That also
> +means that cfq provides fairness among groups in terms of IOPS and not in
> +terms of disk time.
I'm not sure we need documentation of this tunable twice. Why not just
give guidance on when it should be set to 0 in the next section
(group_idle) and refer to cfq-iosched.txt?
> +
> +/sys/block/<disk>/queue/iosched/group_idle
> +------------------------------------------
> +If one disables idling on individual cfq queues and cfq service trees by
> +setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
> +on the group in an attempt to provide fairness among groups.
> +
> +By default group_idle is same as slice_idle and does not do anything if
> +slice_idle is enabled.
> +
> +One can experience an overall throughput drop if you have created multiple
> +groups and put applications in that group which are not driving enough
> +IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
> +on individual groups and throughput should improve.
> +
> What works
> ==========
> - Currently only sync IO queues are support. All the buffered writes are
supported.
Looks like something is amiss. Your text was truncated somewhere.
Cheers,
Jeff
On Mon, Aug 16, 2010 at 03:00:53PM -0400, Jeff Moyer wrote:
>
> In general, I've resisted the urge to correct grammar. Comments below.
Thanks for the review Jeff.
[..]
> > +CFQ ioscheduler tunables
> > +========================
> > +
> > +slice_idle
> > +----------
> > +This specifies how long CFQ should idle for next request on certain cfq queues
> > +(for sequential workloads) and service trees (for random workloads) before
> > +queue is expired and CFQ selects next queue to dispatch from.
> > +
> > +By default slice_idle is a non-zero value. That means by default we idle on
> > +queues/service trees. This can be very helpful on highly seeky media like
>
> 'seeky' is not a property of the media. I think you meant on storage
> devices with a high seek cost.
Fixed
>
> > +single spindle SATA/SAS disks where we can cut down on overall number of
> > +seeks and see improved throughput.
> > +
> > +Setting slice_idle to 0 will remove all the idling on queues/service tree
> > +level and one should see an overall improved throughput on faster storage
> > +devices like multiple SATA/SAS disks in hardware RAID configuration. The down
> > +side is that isolation provided from WRITES also goes down and notion of
> > +IO priority becomes weaker.
> > +
> > +So depending on storage and workload, it might be useful to set slice_idle=0.
> > +In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
>
> You think? I'm pretty sure we've measured that. ;-)
Fixed :-)
[..]
> > +/sys/block/<disk>/queue/iosched/slice_idle
> > +------------------------------------------
> > +On a faster hardware CFQ can be slow, especially with sequential workload.
> > +This happens because CFQ idles on a single queue and single queue might not
> > +drive deeper request queue depths to keep the storage busy. In such scenarios
> > +one can try setting slice_idle=0 and that would switch CFQ to IOPS
> > +(IO operations per second) mode on NCQ supporting hardware.
> > +
> > +That means CFQ will not idle between cfq queues of a cfq group and hence be
> > +able to driver higher queue depth and achieve better throughput. That also
> > +means that cfq provides fairness among groups in terms of IOPS and not in
> > +terms of disk time.
>
> I'm not sure we need documentation of this tunable twice. Why not just
> give guidance on when it should be set to 0 in the next section
> (group_idle) and refer to cfq-iosched.txt?
I have put one line saying for more details look at cfq-iosched.txt. I
have still retained the slice_idle entry because at the end of the day
this is the tunable I expect people to modify and not group_idle. Also
notice that these are two different files (cfq-iosched.txt and
blkio-controller.txt).
>
> > +
> > +/sys/block/<disk>/queue/iosched/group_idle
> > +------------------------------------------
> > +If one disables idling on individual cfq queues and cfq service trees by
> > +setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
> > +on the group in an attempt to provide fairness among groups.
> > +
> > +By default group_idle is same as slice_idle and does not do anything if
> > +slice_idle is enabled.
> > +
> > +One can experience an overall throughput drop if you have created multiple
> > +groups and put applications in that group which are not driving enough
> > +IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
> > +on individual groups and throughput should improve.
> > +
> > What works
> > ==========
> > - Currently only sync IO queues are support. All the buffered writes are
> supported.
>
> Looks like something is amiss. Your text was truncated somewhere.
Actually above text are context lines (3 lines). So nothing is truncated.
Thanks
Vivek
o Some documentation to provide help with tunables.
Signed-off-by: Vivek Goyal <[email protected]>
---
Documentation/block/cfq-iosched.txt | 45 +++++++++++++++++++++++++++++
Documentation/cgroups/blkio-controller.txt | 32 +++++++++++++++++++-
2 files changed, 76 insertions(+), 1 deletion(-)
Index: linux-2.6-block/Documentation/block/cfq-iosched.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-block/Documentation/block/cfq-iosched.txt 2010-08-17 08:39:59.000000000 -0400
@@ -0,0 +1,45 @@
+CFQ ioscheduler tunables
+========================
+
+slice_idle
+----------
+This specifies how long CFQ should idle for next request on certain cfq queues
+(for sequential workloads) and service trees (for random workloads) before
+queue is expired and CFQ selects next queue to dispatch from.
+
+By default slice_idle is a non-zero value. That means by default we idle on
+queues/service trees. This can be very helpful on storage devices with high
+seek cost like single spindle SATA/SAS disks where we can cut down on overall
+number of seeks and see improved throughput.
+
+Setting slice_idle to 0 will remove all the idling on queues/service tree
+level and one should see an overall improved throughput on faster storage
+devices like multiple SATA/SAS disks in hardware RAID configuration. The down
+side is that isolation provided from WRITES also goes down and notion of
+IO priority becomes weaker.
+
+So depending on storage and workload, it might be useful to set slice_idle=0.
+In general for SATA/SAS disks and software RAID of SATA/SAS disks keeping
+slice_idle enabled should be useful. For any configurations where there are
+multiple spindles behind single LUN (Host based hardware RAID controller or
+for storage arrays), setting slice_idle=0 might end up in better throughput
+and acceptable latencies.
+
+CFQ IOPS Mode for group scheduling
+==================================
+Basic CFQ design is to provide priority based time slices. Higher priority
+process gets bigger time slice and lower priority process gets smaller time
+slice. Measuring time becomes harder if storage is fast and supports NCQ and
+it would be better to dispatch multiple requests from multiple cfq queues in
+request queue at a time. In such scenario, it is not possible to measure time
+consumed by single queue accurately.
+
+What is possible though is to measure number of requests dispatched from a
+single queue and also allow dispatch from multiple cfq queue at the same time.
+This effectively becomes the fairness in terms of IOPS (IO operations per
+second).
+
+If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
+to IOPS mode and starts providing fairness in terms of number of requests
+dispatched. Note that this mode switching takes effect only for group
+scheduling. For non-cgroup users nothing should change.
Index: linux-2.6-block/Documentation/cgroups/blkio-controller.txt
===================================================================
--- linux-2.6-block.orig/Documentation/cgroups/blkio-controller.txt 2010-08-11 11:12:42.000000000 -0400
+++ linux-2.6-block/Documentation/cgroups/blkio-controller.txt 2010-08-17 08:44:42.000000000 -0400
@@ -217,6 +217,7 @@ Details of cgroup files
CFQ sysfs tunable
=================
/sys/block/<disk>/queue/iosched/group_isolation
+-----------------------------------------------
If group_isolation=1, it provides stronger isolation between groups at the
expense of throughput. By default group_isolation is 0. In general that
@@ -243,8 +244,37 @@ By default one should run with group_iso
and one wants stronger isolation between groups, then set group_isolation=1
but this will come at cost of reduced throughput.
+/sys/block/<disk>/queue/iosched/slice_idle
+------------------------------------------
+On a faster hardware CFQ can be slow, especially with sequential workload.
+This happens because CFQ idles on a single queue and single queue might not
+drive deeper request queue depths to keep the storage busy. In such scenarios
+one can try setting slice_idle=0 and that would switch CFQ to IOPS
+(IO operations per second) mode on NCQ supporting hardware.
+
+That means CFQ will not idle between cfq queues of a cfq group and hence be
+able to driver higher queue depth and achieve better throughput. That also
+means that cfq provides fairness among groups in terms of IOPS and not in
+terms of disk time.
+
+For more details look at Documentation/block/cfq-iosched.txt
+
+/sys/block/<disk>/queue/iosched/group_idle
+------------------------------------------
+If one disables idling on individual cfq queues and cfq service trees by
+setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
+on the group in an attempt to provide fairness among groups.
+
+By default group_idle is same as slice_idle and does not do anything if
+slice_idle is enabled.
+
+One can experience an overall throughput drop if you have created multiple
+groups and put applications in that group which are not driving enough
+IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
+on individual groups and throughput should improve.
+
What works
==========
-- Currently only sync IO queues are support. All the buffered writes are
+- Currently only sync IO queues are supported. All the buffered writes are
still system wide and not per group. Hence we will not see service
differentiation between buffered writes between groups.
Vivek Goyal <[email protected]> writes:
>> I'm not sure we need documentation of this tunable twice. Why not just
>> give guidance on when it should be set to 0 in the next section
>> (group_idle) and refer to cfq-iosched.txt?
>
> I have put one line saying for more details look at cfq-iosched.txt. I
> have still retained the slice_idle entry because at the end of the day
> this is the tunable I expect people to modify and not group_idle. Also
> notice that these are two different files (cfq-iosched.txt and
> blkio-controller.txt).
Yeah, I know it's in two places, and that means updating two places with
changes. However, I doubt we'll have to update this often, so I'm not
going to argue.
Acked-by: Jeff Moyer <[email protected]>
Cheers,
Jeff
On 2010-08-12 00:44, Vivek Goyal wrote:
>
> Hi,
>
> This is V4 of the patches for group_idle and CFQ group charge accounting in
> terms of IOPS implementation. Since V3 not much has changed. Just more testing
> and rebase on top of for-2.6.36 branch of block tree.
Thanks Vivek, I have applied the patches.
--
Jens Axboe