LinuxLists.cc - [RFC PATCH] cfq-iosced: Implement IOPS mode and group

2010-07-21 19:07:21

Subject: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

Hi,

This is V3 of the group_idle and CFQ IOPS mode implementation patchset. Since V2
I have cleaned up the code a bit to clarify the confusion lingering around in
what cases do we charge time slice and in what cases do we charge number of
requests.

What's the problem
------------------
On high end storage (I got on HP EVA storage array with 12 SATA disks in
RAID 5), CFQ's model of dispatching requests from a single queue at a
time (sequential readers/write sync writers etc), becomes a bottleneck.
Often we don't drive enough request queue depth to keep all the disks busy
and suffer a lot in terms of overall throughput.

All these problems primarily originate from two things. Idling on per
cfq queue and quantum (dispatching limited number of requests from a
single queue) and till then not allowing dispatch from other queues. Once
you set the slice_idle=0 and quantum to higher value, most of the CFQ's
problem on higher end storage disappear.

This problem also becomes visible in IO controller where one creates
multiple groups and gets the fairness but overall throughput is less. In
the following table, I am running increasing number of sequential readers
(1,2,4,8) in 8 groups of weight 100 to 800.

Kernel=2.6.35-rc5-iops+
GROUPMODE=1 NRGRP=8
DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
Workload=bsr iosched=cfq Filesz=512M bs=4K
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
--- --- -- ---------------------------------------------------------------
bsr 3 1 6186 12752 16568 23068 28608 35785 42322 48409 213701
bsr 3 2 5396 10902 16959 23471 25099 30643 37168 42820 192461
bsr 3 4 4655 9463 14042 20537 24074 28499 34679 37895 173847
bsr 3 8 4418 8783 12625 19015 21933 26354 29830 36290 159249

Notice that overall throughput is just around 160MB/s with 8 sequential reader
in each group.

With this patch set, I have set slice_idle=0 and re-ran same test.

Kernel=2.6.35-rc5-iops+
GROUPMODE=1 NRGRP=8
DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
Workload=bsr iosched=cfq Filesz=512M bs=4K
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
--- --- -- ---------------------------------------------------------------
bsr 3 1 6523 12399 18116 24752 30481 36144 42185 48894 219496
bsr 3 2 10072 20078 29614 38378 46354 52513 58315 64833 320159
bsr 3 4 11045 22340 33013 44330 52663 58254 63883 70990 356520
bsr 3 8 12362 25860 37920 47486 61415 47292 45581 70828 348747

Notice how overall throughput has shot upto 348MB/s while retaining the ability
to do the IO control.

So this is not the default mode. This new tunable group_idle, allows one to
set slice_idle=0 to disable some of the CFQ features and and use primarily
group service differentation feature.

If you have thoughts on other ways of solving the problem, I am all ears
to it.

Thanks
Vivek

2010-07-21 19:06:44

by Vivek Goyal

[permalink] [raw]

Subject: [PATCH 2/3] cfq-iosched: Implement a tunable group_idle

o Implement a new tunable group_idle, which allows idling on the group
instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
on the individual queues but idle on the group. This way on fast storage
we can get fairness between groups at the same time overall throughput
improves.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 60 +++++++++++++++++++++++++++++++++++++++++++++------
1 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4671c51..8ca5c39 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
static int cfq_slice_idle = HZ / 125;
+static int cfq_group_idle = HZ / 125;
static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
static const int cfq_hist_divisor = 4;

@@ -198,6 +199,8 @@ struct cfq_group {
struct hlist_node cfqd_node;
atomic_t ref;
#endif
+ /* number of requests that are on the dispatch list or inside driver */
+ int dispatched;
};

/*
@@ -271,6 +274,7 @@ struct cfq_data {
unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
+ unsigned int cfq_group_idle;
unsigned int cfq_latency;
unsigned int cfq_group_isolation;

@@ -1861,6 +1865,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
BUG_ON(!service_tree);
BUG_ON(!service_tree->count);

+ if (!cfqd->cfq_slice_idle)
+ return false;
+
/* We never do for idle class queues. */
if (prio == IDLE_WORKLOAD)
return false;
@@ -1885,7 +1892,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq = cfqd->active_queue;
struct cfq_io_context *cic;
- unsigned long sl;
+ unsigned long sl, group_idle = 0;

/*
* SSD device without seek penalty, disable idling. But only do so
@@ -1901,8 +1908,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
/*
* idle is disabled, either manually or by past process history
*/
- if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
- return;
+ if (!cfq_should_idle(cfqd, cfqq)) {
+ /* no queue idling. Check for group idling */
+ if (cfqd->cfq_group_idle)
+ group_idle = cfqd->cfq_group_idle;
+ else
+ return;
+ }

/*
* still active requests from this queue, don't idle
@@ -1929,13 +1941,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
return;
}

+ /* There are other queues in the group, don't do group idle */
+ if (group_idle && cfqq->cfqg->nr_cfqq > 1)
+ return;
+
cfq_mark_cfqq_wait_request(cfqq);

- sl = cfqd->cfq_slice_idle;
+ if (group_idle)
+ sl = cfqd->cfq_group_idle;
+ else
+ sl = cfqd->cfq_slice_idle;

mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
cfq_blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg);
- cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
+ cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu group_idle: %d", sl,
+ group_idle ? 1 : 0);
}

/*
@@ -1951,6 +1971,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
cfq_remove_request(rq);
cfqq->dispatched++;
+ (RQ_CFQG(rq))->dispatched++;
elv_dispatch_sort(q, rq);

cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
@@ -2220,7 +2241,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
cfqq = NULL;
goto keep_queue;
} else
- goto expire;
+ goto check_group_idle;
}

/*
@@ -2254,6 +2275,17 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
goto keep_queue;
}

+ /*
+ * If group idle is enabled and there are requests dispatched from
+ * this group, wait for requests to complete.
+ */
+check_group_idle:
+ if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
+ && cfqq->cfqg->dispatched) {
+ cfqq = NULL;
+ goto keep_queue;
+ }
+
expire:
cfq_slice_expired(cfqd, 0);
new_queue:
@@ -3396,6 +3428,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
WARN_ON(!cfqq->dispatched);
cfqd->rq_in_driver--;
cfqq->dispatched--;
+ (RQ_CFQG(rq))->dispatched--;
cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg,
rq_start_time_ns(rq), rq_io_start_time_ns(rq),
rq_data_dir(rq), rq_is_sync(rq));
@@ -3425,7 +3458,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
* the queue.
*/
if (cfq_should_wait_busy(cfqd, cfqq)) {
- cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
+ unsigned long extend_sl = cfqd->cfq_slice_idle;
+ if (!cfqd->cfq_slice_idle)
+ extend_sl = cfqd->cfq_group_idle;
+ cfqq->slice_end = jiffies + extend_sl;
cfq_mark_cfqq_wait_busy(cfqq);
cfq_log_cfqq(cfqd, cfqq, "will busy wait");
}
@@ -3870,6 +3906,7 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
+ cfqd->cfq_group_idle = cfq_group_idle;
cfqd->cfq_latency = 1;
cfqd->cfq_group_isolation = 0;
cfqd->hw_tag = -1;
@@ -3942,6 +3979,7 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
+SHOW_FUNCTION(cfq_group_idle_show, cfqd->cfq_group_idle, 1);
SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
@@ -3974,6 +4012,7 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
UINT_MAX, 0);
STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
+STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
@@ -3995,6 +4034,7 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
+ CFQ_ATTR(group_idle),
CFQ_ATTR(low_latency),
CFQ_ATTR(group_isolation),
__ATTR_NULL
@@ -4048,6 +4088,12 @@ static int __init cfq_init(void)
if (!cfq_slice_idle)
cfq_slice_idle = 1;

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ if (!cfq_group_idle)
+ cfq_group_idle = 1;
+#else
+ cfq_group_idle = 0;
+#endif
if (cfq_slab_setup())
return -ENOMEM;

--
1.7.1.1

2010-07-21 19:06:41

by Vivek Goyal

[permalink] [raw]

Subject: [PATCH 3/3] cfq-iosched: Print number of sectors dispatched per cfqq slice

o Divyesh had gotten rid of this code in the past. I want to re-introduce it
back as it helps me a lot during debugging.

Reviewed-by: Jeff Moyer <[email protected]>
Reviewed-by: Divyesh Shah <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8ca5c39..68d16c4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -148,6 +148,8 @@ struct cfq_queue {
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
struct cfq_group *orig_cfqg;
+ /* Number of sectors dispatched from queue in single dispatch round */
+ unsigned long nr_sectors;
};

/*
@@ -959,8 +961,9 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,

cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
st->min_vdisktime);
- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u",
- used_sl, cfqq->slice_dispatch, charge, iops_mode(cfqd));
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
+ " sect=%u", used_sl, cfqq->slice_dispatch, charge,
+ iops_mode(cfqd), cfqq->nr_sectors);
cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
}
@@ -1608,6 +1611,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
cfqq->allocated_slice = 0;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;
+ cfqq->nr_sectors = 0;

cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_must_dispatch(cfqq);
@@ -1975,6 +1979,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
elv_dispatch_sort(q, rq);

cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
+ cfqq->nr_sectors += blk_rq_sectors(rq);
cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq),
rq_data_dir(rq), rq_is_sync(rq));
}
--
1.7.1.1

2010-07-21 19:06:42

by Vivek Goyal

[permalink] [raw]

Subject: [PATCH 1/3] cfq-iosched: Implment IOPS mode

o Implement another CFQ mode where we charge queue/group in terms of number
of requests dispatched instead of measuring the time. Measuring in terms
of time is not possible when we are driving deeper queue depths and there
are requests from multiple cfq queues in the request queue.

o This mode currently gets activated if one sets slice_idle=0 and associated
disk supports NCQ. Again the idea is that on an NCQ disk with idling disabled
most of the queues will dispatch 1 or more requests and then cfq queue
expiry happens and we don't have a way to measure time. So start providing
fairness in terms of IOPS.

o Currently this primarily is beneficial with cfq group scheduling where one
can disable slice idling so that we don't idle on queue and drive deeper
request queue deptsh (achieving better throughput), at the same time group
idle is enabled so one should get service differentiation among groups.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 37 ++++++++++++++++++++++++++++++-------
1 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 7982b83..4671c51 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -378,6 +378,21 @@ CFQ_CFQQ_FNS(wait_busy);
&cfqg->service_trees[i][j]: NULL) \

+static inline bool iops_mode(struct cfq_data *cfqd)
+{
+ /*
+ * If we are not idling on queues and it is a NCQ drive, parallel
+ * execution of requests is on and measuring time is not possible
+ * in most of the cases until and unless we drive shallower queue
+ * depths and that becomes a performance bottleneck. In such cases
+ * switch to start providing fairness in terms of number of IOs.
+ */
+ if (!cfqd->cfq_slice_idle && cfqd->hw_tag)
+ return true;
+ else
+ return false;
+}
+
static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
{
if (cfq_class_idle(cfqq))
@@ -905,7 +920,6 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
slice_used = cfqq->allocated_slice;
}

- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
return slice_used;
}

@@ -913,19 +927,21 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
struct cfq_queue *cfqq)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
- unsigned int used_sl, charge_sl;
+ unsigned int used_sl, charge;
int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
- cfqg->service_tree_idle.count;

BUG_ON(nr_sync < 0);
- used_sl = charge_sl = cfq_cfqq_slice_usage(cfqq);
+ used_sl = charge = cfq_cfqq_slice_usage(cfqq);

- if (!cfq_cfqq_sync(cfqq) && !nr_sync)
- charge_sl = cfqq->allocated_slice;
+ if (iops_mode(cfqd))
+ charge = cfqq->slice_dispatch;
+ else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
+ charge = cfqq->allocated_slice;

/* Can't update vdisktime while group is on service tree */
cfq_rb_erase(&cfqg->rb_node, st);
- cfqg->vdisktime += cfq_scale_slice(charge_sl, cfqg);
+ cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
__cfq_group_service_tree_add(st, cfqg);

/* This group is being expired. Save the context */
@@ -939,6 +955,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,

cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
st->min_vdisktime);
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u",
+ used_sl, cfqq->slice_dispatch, charge, iops_mode(cfqd));
cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
}
@@ -1625,8 +1643,13 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,

/*
* store what was left of this slice, if the queue idled/timed out
+ * Currently in IOPS mode I am not getting into the business of
+ * saving remaining slice/number of requests because I think it does
+ * not help much in most of the cases. We can fix it later, if that's
+ * not the case. IOPS mode is primarily more useful for group
+ * scheduling.
*/
- if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
+ if (timed_out && !cfq_cfqq_slice_new(cfqq) && !iops_mode(cfqd)) {
cfqq->slice_resid = cfqq->slice_end - jiffies;
cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
}
--
1.7.1.1

2010-07-21 19:41:05

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH 2/3] cfq-iosched: Implement a tunable group_idle

Vivek Goyal <[email protected]> writes:

> o Implement a new tunable group_idle, which allows idling on the group
> instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
> on the individual queues but idle on the group. This way on fast storage
> we can get fairness between groups at the same time overall throughput
> improves.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
[snip]
> @@ -1929,13 +1941,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> return;
> }
>
> + /* There are other queues in the group, don't do group idle */
> + if (group_idle && cfqq->cfqg->nr_cfqq > 1)
> + return;
> +
> cfq_mark_cfqq_wait_request(cfqq);
>
> - sl = cfqd->cfq_slice_idle;
> + if (group_idle)
> + sl = cfqd->cfq_group_idle;
> + else
> + sl = cfqd->cfq_slice_idle;

What happens when both group_idle and slice_idle are set? Is that a
sane thing to do from a user's perspective? If not, please protect
against it in the configuration code. If so, then explain why we prefer
group_idle here, but slice_idle in completed request for the extend_sl:

> @@ -3425,7 +3458,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> * the queue.
> */
> if (cfq_should_wait_busy(cfqd, cfqq)) {
> - cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> + unsigned long extend_sl = cfqd->cfq_slice_idle;
> + if (!cfqd->cfq_slice_idle)
> + extend_sl = cfqd->cfq_group_idle;
> + cfqq->slice_end = jiffies + extend_sl;

Also, you'll need to add documentation for this new tunable.

Cheers,
Jeff

2010-07-21 20:13:32

by Vivek Goyal

[permalink] [raw]

Subject: Re: [PATCH 2/3] cfq-iosched: Implement a tunable group_idle

On Wed, Jul 21, 2010 at 03:40:44PM -0400, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > o Implement a new tunable group_idle, which allows idling on the group
> > instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
> > on the individual queues but idle on the group. This way on fast storage
> > we can get fairness between groups at the same time overall throughput
> > improves.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> [snip]
> > @@ -1929,13 +1941,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> > return;
> > }
> >
> > + /* There are other queues in the group, don't do group idle */
> > + if (group_idle && cfqq->cfqg->nr_cfqq > 1)
> > + return;
> > +
> > cfq_mark_cfqq_wait_request(cfqq);
> >
> > - sl = cfqd->cfq_slice_idle;
> > + if (group_idle)
> > + sl = cfqd->cfq_group_idle;
> > + else
> > + sl = cfqd->cfq_slice_idle;
>
> What happens when both group_idle and slice_idle are set?

slice_idle prevails. Notice that "group_idle" is a local variable which
is set to 1 only if we decide not to idle on the cfq queue.

> Is that a
> sane thing to do from a user's perspective?

In fact by default both slice_idle=8 and group_idle=8. Just that in this
mode group_idle never kicks in as slice_idle logic kicks in always before
group_idle logic gets any chance.

> If not, please protect
> against it in the configuration code. If so, then explain why we prefer
> group_idle here, but slice_idle in completed request for the extend_sl:
>

In both the places we first prefer slice_idle. Just noticed the value of
"group_idle" in the beginning of arm_time() function and notice in what
circumstances do we set group_idle=1

Thanks
Vivek

> > @@ -3425,7 +3458,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> > * the queue.
> > */
> > if (cfq_should_wait_busy(cfqd, cfqq)) {
> > - cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> > + unsigned long extend_sl = cfqd->cfq_slice_idle;
> > + if (!cfqd->cfq_slice_idle)
> > + extend_sl = cfqd->cfq_group_idle;
> > + cfqq->slice_end = jiffies + extend_sl;
>
> Also, you'll need to add documentation for this new tunable.
>
> Cheers,
> Jeff

2010-07-21 20:33:22

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH 1/3] cfq-iosched: Implment IOPS mode

Vivek Goyal <[email protected]> writes:

> o Implement another CFQ mode where we charge queue/group in terms of number
> of requests dispatched instead of measuring the time. Measuring in terms
> of time is not possible when we are driving deeper queue depths and there
> are requests from multiple cfq queues in the request queue.
>
> o This mode currently gets activated if one sets slice_idle=0 and associated
> disk supports NCQ. Again the idea is that on an NCQ disk with idling disabled
> most of the queues will dispatch 1 or more requests and then cfq queue
> expiry happens and we don't have a way to measure time. So start providing
> fairness in terms of IOPS.
>
> o Currently this primarily is beneficial with cfq group scheduling where one
> can disable slice idling so that we don't idle on queue and drive deeper
> request queue deptsh (achieving better throughput), at the same time group
> idle is enabled so one should get service differentiation among groups.

I like that this is more isolated now. I'm slowly warming up to it. I
have one question--just a curiosity, really. What do you see now for
the reported sl_used in blktrace when slice_idle is zero and the
hardware supports command queueing?

Cheers,
Jeff

2010-07-21 20:54:18

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH 2/3] cfq-iosched: Implement a tunable group_idle

Vivek Goyal <[email protected]> writes:

> On Wed, Jul 21, 2010 at 03:40:44PM -0400, Jeff Moyer wrote:
>> Vivek Goyal <[email protected]> writes:
>>
>> > o Implement a new tunable group_idle, which allows idling on the group
>> > instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
>> > on the individual queues but idle on the group. This way on fast storage
>> > we can get fairness between groups at the same time overall throughput
>> > improves.
>> >
>> > Signed-off-by: Vivek Goyal <[email protected]>
>> > ---
>> [snip]
>> > @@ -1929,13 +1941,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
>> > return;
>> > }
>> >
>> > + /* There are other queues in the group, don't do group idle */
>> > + if (group_idle && cfqq->cfqg->nr_cfqq > 1)
>> > + return;
>> > +
>> > cfq_mark_cfqq_wait_request(cfqq);
>> >
>> > - sl = cfqd->cfq_slice_idle;
>> > + if (group_idle)
>> > + sl = cfqd->cfq_group_idle;
>> > + else
>> > + sl = cfqd->cfq_slice_idle;
>>
>> What happens when both group_idle and slice_idle are set?
>
> slice_idle prevails. Notice that "group_idle" is a local variable which
> is set to 1 only if we decide not to idle on the cfq queue.

Ah, silly me.

Cheers,
Jeff

2010-07-21 20:57:41

by Vivek Goyal

[permalink] [raw]

Subject: Re: [PATCH 1/3] cfq-iosched: Implment IOPS mode

On Wed, Jul 21, 2010 at 04:33:04PM -0400, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > o Implement another CFQ mode where we charge queue/group in terms of number
> > of requests dispatched instead of measuring the time. Measuring in terms
> > of time is not possible when we are driving deeper queue depths and there
> > are requests from multiple cfq queues in the request queue.
> >
> > o This mode currently gets activated if one sets slice_idle=0 and associated
> > disk supports NCQ. Again the idea is that on an NCQ disk with idling disabled
> > most of the queues will dispatch 1 or more requests and then cfq queue
> > expiry happens and we don't have a way to measure time. So start providing
> > fairness in terms of IOPS.
> >
> > o Currently this primarily is beneficial with cfq group scheduling where one
> > can disable slice idling so that we don't idle on queue and drive deeper
> > request queue deptsh (achieving better throughput), at the same time group
> > idle is enabled so one should get service differentiation among groups.
>
> I like that this is more isolated now. I'm slowly warming up to it. I
> have one question--just a curiosity, really. What do you see now for
> the reported sl_used in blktrace when slice_idle is zero and the
> hardware supports command queueing?

sl_used, still shows amount of time elapsed since we started dispatch from
the queue. I retained that info because we export that info through cgroup
interface.

Just that charging logic to the group changed where in IOPS mode instead
of charging sl_used, we charge iops. Following is sample output of
blktrace after the patches.

253,0 0 0 0.014157613 0 m N cfq19226S /cgrp7 sl_used=3 disp=1 charge=1 iops=1 sect=8

Here we slice used since dispatch start is 3 jiffies, we dispatched 1
request in this duration. Because we are iops mode (iops=1), we charge
the group for 1 rq and no 3 jiffies. (charge=1). sect shows we dispatched
8 sectors in this duration.

Vivek

>
> Cheers,
> Jeff

2010-07-22 05:56:12

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

On Wed, Jul 21, 2010 at 03:06:18PM -0400, Vivek Goyal wrote:
> On high end storage (I got on HP EVA storage array with 12 SATA disks in
> RAID 5),

That's actually quite low end storage for a server these days :)

> So this is not the default mode. This new tunable group_idle, allows one to
> set slice_idle=0 to disable some of the CFQ features and and use primarily
> group service differentation feature.

While this is better than before needing a sysfs tweak to get any
performance out of any kind of server class hardware still is pretty
horrible. And slice_idle=0 is not exactly the most obvious paramter
I would look for either. So having some way to automatically disable
this mode based on hardware characteristics would be really useful,
and if that's not possible at least make sure it's very obviously
document and easily found using web searches.

Btw, what effect does slice_idle=0 with your patches have to single SATA
disk and single SSD setups?

2010-07-22 07:10:28

by Gui, Jianfeng/归剑峰

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

2010-07-22 14:01:04

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

On Thu, Jul 22, 2010 at 01:56:02AM -0400, Christoph Hellwig wrote:
> On Wed, Jul 21, 2010 at 03:06:18PM -0400, Vivek Goyal wrote:
> > On high end storage (I got on HP EVA storage array with 12 SATA disks in
> > RAID 5),
>
> That's actually quite low end storage for a server these days :)
>

Yes it is. Just that this is the best I got access to. :-)

> > So this is not the default mode. This new tunable group_idle, allows one to
> > set slice_idle=0 to disable some of the CFQ features and and use primarily
> > group service differentation feature.
>
> While this is better than before needing a sysfs tweak to get any
> performance out of any kind of server class hardware still is pretty
> horrible. And slice_idle=0 is not exactly the most obvious paramter
> I would look for either. So having some way to automatically disable
> this mode based on hardware characteristics would be really useful,

An IO scheduler able to change its behavior based on unerlying storage
property is the ideal and most convenient thing. For that we will need
some kind of auto tuning features in CFQ where we monitor for the ongoing
IO (for sequentiality, for block size) and then try to make some
predictions about the storage property.

Auto tuning is little hard to implement. So I thought that in first step we
can make sure things work reasonably well with the help of tunables and
then look into auto tuning the stuff.

I was actually thinking of writting a user space utility which can do
some specific IO patterns to the disk/lun and setup some IO scheduler
tunables automatically.

> and if that's not possible at least make sure it's very obviously
> document and easily found using web searches.

Sure. I think I will create a new file Documentation/block/cfq-iosched.txt
and document this new mode there. Becuase this mode primarily is useful
for group scheduling, I will also add some info in
Documentation/cgroups/blkio-controller.txt.

>
> Btw, what effect does slice_idle=0 with your patches have to single SATA
> disk and single SSD setups?

I am not expecting any major effect of IOPS mode on a non-group setup on
any kind of storage.

IOW, currently if one sets slice_idle=0 in CFQ, then we kind of become almost
like deadline (with some differences here and there). Notion of ioprio
almost disappears except that in some cases you can still see some
service differentation among queues of different prio level.

With this patchset, one would swtich to IOPS mode with slice_idle=0. We
will still show a deadlinish behavior. The only difference will be that
there will be no service differentation among ioprio levels.

I am not bothering about fixing it currently because in slice_idle=0 mode,
notion of ioprio is so weak and unpredictable that I think it is not worth
fixing it at this point of time. If somebody is looking for service
differentation with slice_idle=0, using cgroups might turn out to be a
better bet.

In summary, in non cgroup setup, wth slice_idle=0, one should not see
significant change with this patchset on any kind of storage. With
slice_idle=0, CFQ stops idling and achieves much better throughput and
even in IOPS mode it will continue doing that.

The difference is primarily visible for cgroup users where we get better
accounting done in IOPS mode and are able to provide service differentation
among groups in a more predictable manner.

Thanks
Vivek

2010-07-22 14:49:47

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

On Thu, Jul 22, 2010 at 03:08:00PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi,
> >
> > This is V3 of the group_idle and CFQ IOPS mode implementation patchset. Since V2
> > I have cleaned up the code a bit to clarify the confusion lingering around in
> > what cases do we charge time slice and in what cases do we charge number of
> > requests.
> >
> > What's the problem
> > ------------------
> > On high end storage (I got on HP EVA storage array with 12 SATA disks in
> > RAID 5), CFQ's model of dispatching requests from a single queue at a
> > time (sequential readers/write sync writers etc), becomes a bottleneck.
> > Often we don't drive enough request queue depth to keep all the disks busy
> > and suffer a lot in terms of overall throughput.
> >
> > All these problems primarily originate from two things. Idling on per
> > cfq queue and quantum (dispatching limited number of requests from a
> > single queue) and till then not allowing dispatch from other queues. Once
> > you set the slice_idle=0 and quantum to higher value, most of the CFQ's
> > problem on higher end storage disappear.
> >
> > This problem also becomes visible in IO controller where one creates
> > multiple groups and gets the fairness but overall throughput is less. In
> > the following table, I am running increasing number of sequential readers
> > (1,2,4,8) in 8 groups of weight 100 to 800.
> >
> > Kernel=2.6.35-rc5-iops+
> > GROUPMODE=1 NRGRP=8
> > DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
> > Workload=bsr iosched=cfq Filesz=512M bs=4K
> > group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> > =========================================================================
> > AVERAGE[bsr] [bw in KB/s]
> > -------
> > job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
> > --- --- -- ---------------------------------------------------------------
> > bsr 3 1 6186 12752 16568 23068 28608 35785 42322 48409 213701
> > bsr 3 2 5396 10902 16959 23471 25099 30643 37168 42820 192461
> > bsr 3 4 4655 9463 14042 20537 24074 28499 34679 37895 173847
> > bsr 3 8 4418 8783 12625 19015 21933 26354 29830 36290 159249
> >
> > Notice that overall throughput is just around 160MB/s with 8 sequential reader
> > in each group.
> >
> > With this patch set, I have set slice_idle=0 and re-ran same test.
> >
> > Kernel=2.6.35-rc5-iops+
> > GROUPMODE=1 NRGRP=8
> > DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
> > Workload=bsr iosched=cfq Filesz=512M bs=4K
> > group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> > =========================================================================
> > AVERAGE[bsr] [bw in KB/s]
> > -------
> > job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
> > --- --- -- ---------------------------------------------------------------
> > bsr 3 1 6523 12399 18116 24752 30481 36144 42185 48894 219496
> > bsr 3 2 10072 20078 29614 38378 46354 52513 58315 64833 320159
> > bsr 3 4 11045 22340 33013 44330 52663 58254 63883 70990 356520
> > bsr 3 8 12362 25860 37920 47486 61415 47292 45581 70828 348747
> >
> > Notice how overall throughput has shot upto 348MB/s while retaining the ability
> > to do the IO control.
> >
> > So this is not the default mode. This new tunable group_idle, allows one to
> > set slice_idle=0 to disable some of the CFQ features and and use primarily
> > group service differentation feature.
> >
> > If you have thoughts on other ways of solving the problem, I am all ears
> > to it.
>
> Hi Vivek
>
> Would you attach your fio job config file?
>

Hi Gui,

I have written a fio based test script, "iostest", to be able to
do cgroup and other IO scheduler testing more smoothly and I am using
that. I am attaching the compressed script with the mail. Try using it
and if it works for you and you find it useful, I can think of hosting a
git tree somewhere.

I used following following command lines to test above.

# iostest <block-device> -G -w bsr -m 8 -c --nrgrp 8 --total

With slice idle disabled.

# iostest <block-device> -G -w bsr -m 8 -c --nrgrp 8 --total -I 0

Thanks
Vivek

Attachments:

(No filename) (4.21 kB)
iostest.tar.gz (20.37 kB)
Download all attachments

2010-07-22 20:55:08

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

2010-07-22 23:56:29

by Gui, Jianfeng/归剑峰

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

Vivek Goyal wrote:
> On Thu, Jul 22, 2010 at 03:08:00PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> Hi,
>>>
>>> This is V3 of the group_idle and CFQ IOPS mode implementation patchset. Since V2
>>> I have cleaned up the code a bit to clarify the confusion lingering around in
>>> what cases do we charge time slice and in what cases do we charge number of
>>> requests.
>>>
>>> What's the problem
>>> ------------------
>>> On high end storage (I got on HP EVA storage array with 12 SATA disks in
>>> RAID 5), CFQ's model of dispatching requests from a single queue at a
>>> time (sequential readers/write sync writers etc), becomes a bottleneck.
>>> Often we don't drive enough request queue depth to keep all the disks busy
>>> and suffer a lot in terms of overall throughput.
>>>
>>> All these problems primarily originate from two things. Idling on per
>>> cfq queue and quantum (dispatching limited number of requests from a
>>> single queue) and till then not allowing dispatch from other queues. Once
>>> you set the slice_idle=0 and quantum to higher value, most of the CFQ's
>>> problem on higher end storage disappear.
>>>
>>> This problem also becomes visible in IO controller where one creates
>>> multiple groups and gets the fairness but overall throughput is less. In
>>> the following table, I am running increasing number of sequential readers
>>> (1,2,4,8) in 8 groups of weight 100 to 800.
>>>
>>> Kernel=2.6.35-rc5-iops+
>>> GROUPMODE=1 NRGRP=8
>>> DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
>>> Workload=bsr iosched=cfq Filesz=512M bs=4K
>>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>>> =========================================================================
>>> AVERAGE[bsr] [bw in KB/s]
>>> -------
>>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
>>> --- --- -- ---------------------------------------------------------------
>>> bsr 3 1 6186 12752 16568 23068 28608 35785 42322 48409 213701
>>> bsr 3 2 5396 10902 16959 23471 25099 30643 37168 42820 192461
>>> bsr 3 4 4655 9463 14042 20537 24074 28499 34679 37895 173847
>>> bsr 3 8 4418 8783 12625 19015 21933 26354 29830 36290 159249
>>>
>>> Notice that overall throughput is just around 160MB/s with 8 sequential reader
>>> in each group.
>>>
>>> With this patch set, I have set slice_idle=0 and re-ran same test.
>>>
>>> Kernel=2.6.35-rc5-iops+
>>> GROUPMODE=1 NRGRP=8
>>> DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
>>> Workload=bsr iosched=cfq Filesz=512M bs=4K
>>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>>> =========================================================================
>>> AVERAGE[bsr] [bw in KB/s]
>>> -------
>>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
>>> --- --- -- ---------------------------------------------------------------
>>> bsr 3 1 6523 12399 18116 24752 30481 36144 42185 48894 219496
>>> bsr 3 2 10072 20078 29614 38378 46354 52513 58315 64833 320159
>>> bsr 3 4 11045 22340 33013 44330 52663 58254 63883 70990 356520
>>> bsr 3 8 12362 25860 37920 47486 61415 47292 45581 70828 348747
>>>
>>> Notice how overall throughput has shot upto 348MB/s while retaining the ability
>>> to do the IO control.
>>>
>>> So this is not the default mode. This new tunable group_idle, allows one to
>>> set slice_idle=0 to disable some of the CFQ features and and use primarily
>>> group service differentation feature.
>>>
>>> If you have thoughts on other ways of solving the problem, I am all ears
>>> to it.
>> Hi Vivek
>>
>> Would you attach your fio job config file?
>>
>
> Hi Gui,
>
> I have written a fio based test script, "iostest", to be able to
> do cgroup and other IO scheduler testing more smoothly and I am using
> that. I am attaching the compressed script with the mail. Try using it
> and if it works for you and you find it useful, I can think of hosting a
> git tree somewhere.
>
> I used following following command lines to test above.
>
> # iostest <block-device> -G -w bsr -m 8 -c --nrgrp 8 --total
>
> With slice idle disabled.
>
> # iostest <block-device> -G -w bsr -m 8 -c --nrgrp 8 --total -I 0

That's cool! Very helpful, I'll try it.

Thanks,
Gui

>
> Thanks
> Vivek

--
Regards
Gui Jianfeng

2010-07-24 08:51:39

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

To me this sounds like slice_idle=0 is the right default then, as it
gives useful behaviour for all systems linux runs on. Setups with
more than a few spindles are for sure more common than setups making
use of cgroups. Especially given that cgroups are more of a high end
feature you'd rarely use on a single SATA spindle anyway. So setting
a paramter to make this useful sounds like the much better option.

Especially given that the block cgroup code doesn't work particularly
well in presence of barriers, which are on for any kind of real life
production setup anyway.

2010-07-24 09:07:12

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

On Sat, Jul 24, 2010 at 10:51 AM, Christoph Hellwig <[email protected]> wrote:
> To me this sounds like slice_idle=0 is the right default then, as it
> gives useful behaviour for all systems linux runs on.
No, it will give bad performance on single disks, possibly worse than
deadline (deadline at least sorts the requests between different
queues, while CFQ with slice_idle=0 doesn't even do this for readers).
Setting slice_idle to 0 should be considered only when a single
sequential reader cannot saturate the disk bandwidth, and this happens
only on smart enough hardware with large number of spindles.
> Setups with
> more than a few spindles are for sure more common than setups making
> use of cgroups. Especially given that cgroups are more of a high end
> feature you'd rarely use on a single SATA spindle anyway. So setting
> a paramter to make this useful sounds like the much better option.
>
> Especially given that the block cgroup code doesn't work particularly
> well in presence of barriers, which are on for any kind of real life
> production setup anyway.
>
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

2010-07-26 07:00:51

by Gui, Jianfeng/归剑峰

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

Vivek Goyal wrote:
> Hi,
>
> This is V3 of the group_idle and CFQ IOPS mode implementation patchset. Since V2
> I have cleaned up the code a bit to clarify the confusion lingering around in
> what cases do we charge time slice and in what cases do we charge number of
> requests.
>
> What's the problem
> ------------------
> On high end storage (I got on HP EVA storage array with 12 SATA disks in
> RAID 5), CFQ's model of dispatching requests from a single queue at a
> time (sequential readers/write sync writers etc), becomes a bottleneck.
> Often we don't drive enough request queue depth to keep all the disks busy
> and suffer a lot in terms of overall throughput.
>
> All these problems primarily originate from two things. Idling on per
> cfq queue and quantum (dispatching limited number of requests from a
> single queue) and till then not allowing dispatch from other queues. Once
> you set the slice_idle=0 and quantum to higher value, most of the CFQ's
> problem on higher end storage disappear.
>
> This problem also becomes visible in IO controller where one creates
> multiple groups and gets the fairness but overall throughput is less. In
> the following table, I am running increasing number of sequential readers
> (1,2,4,8) in 8 groups of weight 100 to 800.
>
> Kernel=2.6.35-rc5-iops+
> GROUPMODE=1 NRGRP=8
> DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
> Workload=bsr iosched=cfq Filesz=512M bs=4K
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[bsr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
> --- --- -- ---------------------------------------------------------------
> bsr 3 1 6186 12752 16568 23068 28608 35785 42322 48409 213701
> bsr 3 2 5396 10902 16959 23471 25099 30643 37168 42820 192461
> bsr 3 4 4655 9463 14042 20537 24074 28499 34679 37895 173847
> bsr 3 8 4418 8783 12625 19015 21933 26354 29830 36290 159249
>
> Notice that overall throughput is just around 160MB/s with 8 sequential reader
> in each group.
>
> With this patch set, I have set slice_idle=0 and re-ran same test.
>
> Kernel=2.6.35-rc5-iops+
> GROUPMODE=1 NRGRP=8
> DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
> Workload=bsr iosched=cfq Filesz=512M bs=4K
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[bsr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total
> --- --- -- ---------------------------------------------------------------
> bsr 3 1 6523 12399 18116 24752 30481 36144 42185 48894 219496
> bsr 3 2 10072 20078 29614 38378 46354 52513 58315 64833 320159
> bsr 3 4 11045 22340 33013 44330 52663 58254 63883 70990 356520
> bsr 3 8 12362 25860 37920 47486 61415 47292 45581 70828 348747
>
> Notice how overall throughput has shot upto 348MB/s while retaining the ability
> to do the IO control.
>
> So this is not the default mode. This new tunable group_idle, allows one to
> set slice_idle=0 to disable some of the CFQ features and and use primarily
> group service differentation feature.
>
> If you have thoughts on other ways of solving the problem, I am all ears
> to it.

Hi Vivek,

I did some tests on single SATA disk on my desktop. With patches applied, seems no
regression occurs till now, and have some performance improvement in case of
"Direct Random Reader" mode. Here're some numbers on my box.

Vallina kernel:

Blkio is already mounted at /cgroup/blkio. Unmounting it
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
GROUPMODE=1 NRGRP=4
Will run workloads for increasing number of threads upto a max of 4
Starting test for [drr] with set=1 numjobs=1 filesz=512M bs=32k runtime=30
Starting test for [drr] with set=1 numjobs=2 filesz=512M bs=32k runtime=30
Starting test for [drr] with set=1 numjobs=4 filesz=512M bs=32k runtime=30
Finished test for workload [drr]
Host=localhost.localdomain Kernel=2.6.35-rc4-Vivek-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[drr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drr 1 1 761 761 762 760 3044
drr 1 2 185 420 727 1256 2588
drr 1 4 180 371 588 863 2002

Patched kernel:

Blkio is already mounted at /cgroup/blkio. Unmounting it
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
GROUPMODE=1 NRGRP=4
Will run workloads for increasing number of threads upto a max of 4
Starting test for [drr] with set=1 numjobs=1 filesz=512M bs=32k runtime=30
Starting test for [drr] with set=1 numjobs=2 filesz=512M bs=32k runtime=30
Starting test for [drr] with set=1 numjobs=4 filesz=512M bs=32k runtime=30
Finished test for workload [drr]
Host=localhost.localdomain Kernel=2.6.35-rc4-Vivek-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[drr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drr 1 1 323 671 1030 1378 3402
drr 1 2 165 391 686 1144 2386
drr 1 4 185 373 612 873 2043

Thanks
Gui

>
> Thanks
> Vivek
>

2010-07-26 13:52:04

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

On Sat, Jul 24, 2010 at 04:51:35AM -0400, Christoph Hellwig wrote:
> To me this sounds like slice_idle=0 is the right default then, as it
> gives useful behaviour for all systems linux runs on. Setups with
> more than a few spindles are for sure more common than setups making
> use of cgroups. Especially given that cgroups are more of a high end
> feature you'd rarely use on a single SATA spindle anyway. So setting
> a paramter to make this useful sounds like the much better option.
>

Setting slice_idle=0 should give very bad interactivity experience on
laptops/desktops having SATA disks. My previous tests showed that if
I start a buffered writer on the disk, then launching firefox took more
than 5 minutes.

So slice_idle=0 should not be default. It should be selectively done
on hardware with multiple spindles and where single cfq queue can't
keep all spindles busy.

> Especially given that the block cgroup code doesn't work particularly
> well in presence of barriers, which are on for any kind of real life
> production setup anyway.

True. I was hoping that on a battery backed up storage we shoudl not need
barriers. Last we talked about it, it sounded as if there might be some
bugs in file systems we need to fix before we can confidently say that
yes on battery backed up storage, one can mount file system (ext3, ext4,
xfs) with barrier disabled and still expect data integrity.

Thanks
Vivek

2010-07-26 14:11:10

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

On Mon, Jul 26, 2010 at 02:58:16PM +0800, Gui Jianfeng wrote:

[..]
> Hi Vivek,
>
> I did some tests on single SATA disk on my desktop. With patches applied, seems no
> regression occurs till now, and have some performance improvement in case of
> "Direct Random Reader" mode. Here're some numbers on my box.
>

Thanks for testing Gui. "iostest" seems to be working for you. If you had
to some fixes to make it work on my boxes, do send those to me, and I can
commit those in my internal git tree.

After running the script, you can also run "iostest -R <result-dir>" and
that will generate a report. It will not have all this "Starting test..."
lines and looks nicer.

Good to know that you don't see any regressions on SATA disk in your
cgroup testing with this patchset. Little improvement in "drr" might
be due to the fact that with existing slice_idle=0, we can still do
some extra idling on service tree and first patch in the series (V4)
gets rid of that.

Thanks
Vivek

> Vallina kernel:
>
> Blkio is already mounted at /cgroup/blkio. Unmounting it
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> GROUPMODE=1 NRGRP=4
> Will run workloads for increasing number of threads upto a max of 4
> Starting test for [drr] with set=1 numjobs=1 filesz=512M bs=32k runtime=30
> Starting test for [drr] with set=1 numjobs=2 filesz=512M bs=32k runtime=30
> Starting test for [drr] with set=1 numjobs=4 filesz=512M bs=32k runtime=30
> Finished test for workload [drr]
> Host=localhost.localdomain Kernel=2.6.35-rc4-Vivek-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drr 1 1 761 761 762 760 3044
> drr 1 2 185 420 727 1256 2588
> drr 1 4 180 371 588 863 2002
>
>
> Patched kernel:
>
> Blkio is already mounted at /cgroup/blkio. Unmounting it
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> GROUPMODE=1 NRGRP=4
> Will run workloads for increasing number of threads upto a max of 4
> Starting test for [drr] with set=1 numjobs=1 filesz=512M bs=32k runtime=30
> Starting test for [drr] with set=1 numjobs=2 filesz=512M bs=32k runtime=30
> Starting test for [drr] with set=1 numjobs=4 filesz=512M bs=32k runtime=30
> Finished test for workload [drr]
> Host=localhost.localdomain Kernel=2.6.35-rc4-Vivek-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drr 1 1 323 671 1030 1378 3402
> drr 1 2 165 391 686 1144 2386
> drr 1 4 185 373 612 873 2043
>
> Thanks
> Gui
>
> >
> > Thanks
> > Vivek
> >

2010-07-26 14:30:40

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

On Sat, Jul 24, 2010 at 11:07:07AM +0200, Corrado Zoccolo wrote:
> On Sat, Jul 24, 2010 at 10:51 AM, Christoph Hellwig <[email protected]> wrote:
> > To me this sounds like slice_idle=0 is the right default then, as it
> > gives useful behaviour for all systems linux runs on.
> No, it will give bad performance on single disks, possibly worse than
> deadline (deadline at least sorts the requests between different
> queues, while CFQ with slice_idle=0 doesn't even do this for readers).

> Setting slice_idle to 0 should be considered only when a single
> sequential reader cannot saturate the disk bandwidth, and this happens
> only on smart enough hardware with large number of spindles.

I was thinking of writting a user space utility which can launch
increasing number of parallel direct/buffered reads from device and if
device can sustain more than 1 parallel reads with increasing throughput,
then it probably is good indicator that one might be better off with
slice_idle=0.

Will try that today...

Vivek

2010-07-26 14:33:14

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

On Sat, Jul 24, 2010 at 11:07:07AM +0200, Corrado Zoccolo wrote:
> On Sat, Jul 24, 2010 at 10:51 AM, Christoph Hellwig <[email protected]> wrote:
> > To me this sounds like slice_idle=0 is the right default then, as it
> > gives useful behaviour for all systems linux runs on.
> No, it will give bad performance on single disks, possibly worse than
> deadline (deadline at least sorts the requests between different
> queues, while CFQ with slice_idle=0 doesn't even do this for readers).

Not sure if CFQ will be worse than deadline with slice_idle=0. CFQ has
some inbuilt things which should help.

- Readers preempt Writers
- All writers go in one single queue (at one prio level), readers get
their individual queues and can outnumber writers.

So I guess CFQ with slice_idle=0 should not be worse than deadline in terms
of read latencies.

Vivek

2010-07-26 21:22:17

by Vivek Goyal

[permalink] [raw]

Subject: Tuning IO scheduler (Was: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3)

On Mon, Jul 26, 2010 at 10:30:23AM -0400, Vivek Goyal wrote:
> On Sat, Jul 24, 2010 at 11:07:07AM +0200, Corrado Zoccolo wrote:
> > On Sat, Jul 24, 2010 at 10:51 AM, Christoph Hellwig <[email protected]> wrote:
> > > To me this sounds like slice_idle=0 is the right default then, as it
> > > gives useful behaviour for all systems linux runs on.
> > No, it will give bad performance on single disks, possibly worse than
> > deadline (deadline at least sorts the requests between different
> > queues, while CFQ with slice_idle=0 doesn't even do this for readers).
>
> > Setting slice_idle to 0 should be considered only when a single
> > sequential reader cannot saturate the disk bandwidth, and this happens
> > only on smart enough hardware with large number of spindles.
>
> I was thinking of writting a user space utility which can launch
> increasing number of parallel direct/buffered reads from device and if
> device can sustain more than 1 parallel reads with increasing throughput,
> then it probably is good indicator that one might be better off with
> slice_idle=0.
>
> Will try that today...

Ok, here is a small hackish bash script which takes a block device as
input. It runs multiple parallel sequential readers in raw mode (dd on
block device) and measures the total throughput. I run readers on
different areas of disks so that readers don't overlap and don't end up
reading same block.

The idea is to write a simple script which can run bunch of tests and
suggest to user what IO scheduler to run or what IO scheduler tunable to
use. At this point of time I am only looking to identify if we should
use slice_idle or not in CFQ on a given block device.

Here are some results of various runs. First column reporesents number of
processes run in paralle, second column is total BW and third column is
bandwidth of individual dd processes. Throughputs are in MB/s.

SATA disk
=========
Noop
----
1 63.3 63.3
2 18.7 9.4 9.3
4 21.6 5.5 5.4 5.4 5.3
8 29.6 5.9 4.5 3.6 3.5 3.3 3.0 3.0 2.8

CFQ
---
1 63.2 63.2
2 54.8 29.2 25.6
4 50.3 13.9 12.8 12.1 11.5
8 42.9 6.0 5.8 5.5 5.4 5.2 5.1 5.0 4.9

Storage Array (12 disks in RAID 5 configuration)
================================================
Noop
----
1 62.5 62.5
2 86.5 46.1 40.4
4 98.7 32.4 24.3 21.9 20.1
8 112.5 15.8 15.5 15.3 13.6 13.6 13.3 13.2 12.2

CFQ
---
1 56.9 56.9
2 34.8 18.0 16.8
4 38.8 10.4 10.3 9.4 8.7
8 44.4 6.1 6.1 5.9 5.9 5.7 5.0 4.9 4.8

SSD
===
Noop
----
1 243 243
2 231 122 109
4 270.6 73.8 73.5 65.1 58.2
8 262.9 33.3 33.2 33.2 33.2 33.2 33.2 33.2 30.4

CFQ
---
1 244 244
2 228 120 108
4 260.6 67.1 67.0 67.0 59.5
8 266.0 35.0 33.4 33.4 33.4 33.4 33.4 33.4 30.6

Summary:

- On SATA disk with single spindle as number of processes increase (2),
disk starts experiencing seeks and throughput drops dramatically. Here
CFQ idling helps.

- On storage array, with noop, total throughput increases as number of
dd processes increase. That means underlying storage can support
multiple parallel readers without getting seek bound. In this probably
one should set slice_idle=0

- With SSD throughput does not deteriorate as number of readers are
incrased. CFQ also performs well because internally idling is disabled
as SSD is marked as non-rotational device.

So bottom line, if device can support multiple parallel read stream
without significant drop in throughput, one can set slice_idle=0 in CFQ
to achieve better overall throughput.

This will primarily be true for data disks and not root disk as it does not
gurantee better latencies in presence of buffered WRITES.

Thanks
Vivek

Attachments:

(No filename) (3.76 kB)
iostune (2.17 kB)
Download all attachments

2010-07-27 08:36:28

by Gui, Jianfeng/归剑峰

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

Vivek Goyal wrote:
> On Mon, Jul 26, 2010 at 02:58:16PM +0800, Gui Jianfeng wrote:
>
> [..]
>> Hi Vivek,
>>
>> I did some tests on single SATA disk on my desktop. With patches applied, seems no
>> regression occurs till now, and have some performance improvement in case of
>> "Direct Random Reader" mode. Here're some numbers on my box.
>>
>
> Thanks for testing Gui. "iostest" seems to be working for you. If you had
> to some fixes to make it work on my boxes, do send those to me, and I can
> commit those in my internal git tree.

Hi Vivek,

I didn't modify iostest at all but just upgraded fio to 1.42

Gui

>
> After running the script, you can also run "iostest -R <result-dir>" and
> that will generate a report. It will not have all this "Starting test..."
> lines and looks nicer.
>
> Good to know that you don't see any regressions on SATA disk in your
> cgroup testing with this patchset. Little improvement in "drr" might
> be due to the fact that with existing slice_idle=0, we can still do
> some extra idling on service tree and first patch in the series (V4)
> gets rid of that.
>
> Thanks
> Vivek
>
>> Vallina kernel:
>>
>> Blkio is already mounted at /cgroup/blkio. Unmounting it
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> GROUPMODE=1 NRGRP=4
>> Will run workloads for increasing number of threads upto a max of 4
>> Starting test for [drr] with set=1 numjobs=1 filesz=512M bs=32k runtime=30
>> Starting test for [drr] with set=1 numjobs=2 filesz=512M bs=32k runtime=30
>> Starting test for [drr] with set=1 numjobs=4 filesz=512M bs=32k runtime=30
>> Finished test for workload [drr]
>> Host=localhost.localdomain Kernel=2.6.35-rc4-Vivek-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drr 1 1 761 761 762 760 3044
>> drr 1 2 185 420 727 1256 2588
>> drr 1 4 180 371 588 863 2002
>>
>>
>> Patched kernel:
>>
>> Blkio is already mounted at /cgroup/blkio. Unmounting it
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> GROUPMODE=1 NRGRP=4
>> Will run workloads for increasing number of threads upto a max of 4
>> Starting test for [drr] with set=1 numjobs=1 filesz=512M bs=32k runtime=30
>> Starting test for [drr] with set=1 numjobs=2 filesz=512M bs=32k runtime=30
>> Starting test for [drr] with set=1 numjobs=4 filesz=512M bs=32k runtime=30
>> Finished test for workload [drr]
>> Host=localhost.localdomain Kernel=2.6.35-rc4-Vivek-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drr 1 1 323 671 1030 1378 3402
>> drr 1 2 165 391 686 1144 2386
>> drr 1 4 185 373 612 873 2043
>>
>> Thanks
>> Gui
>>
>>> Thanks
>>> Vivek
>>>
>
>

2010-07-29 19:57:52

by Corrado Zoccolo

[permalink] [raw]

Subject: Re: [RFC PATCH] cfq-iosced: Implement IOPS mode and group_idle tunable V3

On Mon, Jul 26, 2010 at 4:33 PM, Vivek Goyal <[email protected]> wrote:
> On Sat, Jul 24, 2010 at 11:07:07AM +0200, Corrado Zoccolo wrote:
>> On Sat, Jul 24, 2010 at 10:51 AM, Christoph Hellwig <[email protected]> wrote:
>> > To me this sounds like slice_idle=0 is the right default then, as it
>> > gives useful behaviour for all systems linux runs on.
>> No, it will give bad performance on single disks, possibly worse than
>> deadline (deadline at least sorts the requests between different
>> queues, while CFQ with slice_idle=0 doesn't even do this for readers).
>
> Not sure if CFQ will be worse than deadline with slice_idle=0. CFQ has
> some inbuilt things which should help.
>
> - Readers preempt Writers
> - All writers go in one single queue (at one prio level), readers get
> their individual queues and can outnumber writers.
>
> So I guess CFQ with slice_idle=0 should not be worse than deadline in terms
> of read latencies.

I was thinking more to the fact that read requests are not sorted:
they will basically be serviced in FIFO order, while deadline will
sort them and possibly increase locality. In the reader vs writer
case, cfq may have a small edge. Basically, they will severely
underperform vs. cfq with slice != 0, though.

>
> Vivek
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda