2010-07-19 17:20:20

by Vivek Goyal

[permalink] [raw]
Subject: [RFC PATCH] cfq-iosched: Implement group idle V2


[ Got Jens's mail id wrong in last post hence reposting. Sorry for cluttering
your mailboxes.]

Hi,

This is V2 of the group_idle implementation patchset. I have done some more
testing since V1 and fixed couple of bugs since V1.

What's the problem
------------------
On high end storage (I got on HP EVA storage array with 12 SATA disks in
RAID 5), CFQ's model of dispatching requests from a single queue at a
time (sequential readers/write sync writers etc), becomes a bottleneck.
Often we don't drive enough request queue depth to keep all the disks busy
and suffer a lot in terms of overall throughput.

All these problems primarily originate from two things. Idling on per
cfq queue and quantum (dispatching limited number of requests from a
single queue) and till then not allowing dispatch from other queues. Once
you set the slice_idle=0 and quantum to higher value, most of the CFQ's
problem on higher end storage disappear.

This problem also becomes visible in IO controller where one creates
multiple groups and gets the fairness but overall throughput is less. In
the following table, I am running increasing number of sequential readers
(1,2,4,8) in 8 groups of weight 100 to 800.

Kernel=2.6.35-rc5-gi-sl-accounting+
GROUPMODE=1 NRGRP=8
DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
Workload=bsr iosched=cfq Filesz=512M bs=4K
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR test1 test2 test3 test4 test5 test6 test7 test8 total
--- --- -- ---------------------------------------------------------------
bsr 1 1 6245 12776 16591 23471 28746 36799 43031 49778 217437
bsr 1 2 5100 11063 17216 23136 23738 30391 35910 40874 187428
bsr 1 4 4623 9718 14746 18356 22875 30407 33215 38073 172013
bsr 1 8 4720 10143 13499 19115 22157 29126 31688 30784 161232

Notice that overall throughput is just around 160MB/s with 8 sequential reader
in each group.

With this patch set, I have set slice_idle=0 and re-ran same test.

Kernel=2.6.35-rc5-gi-sl-accounting+
GROUPMODE=1 NRGRP=8
DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
Workload=bsr iosched=cfq Filesz=512M bs=4K
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR test1 test2 test3 test4 test5 test6 test7 test8 total
--- --- -- ---------------------------------------------------------------
bsr 1 1 6789 12764 17174 23111 28528 36807 42753 48826 216752
bsr 1 2 9845 20617 30521 39061 45514 51607 63683 63770 324618
bsr 1 4 14835 24241 42444 55207 45914 51318 54661 60318 348938
bsr 1 8 12022 24478 36732 48651 54333 60974 64856 72930 374976

Notice how overall throughput has shot upto 374MB/s while retaining the ability
to do the IO control.

So this is not the default mode. This new tunable group_idle, allows one to
set slice_idle=0 to disable some of the CFQ features and and use primarily
group service differentation feature.

If you have thoughts on other ways of solving the problem, I am all ears
to it.

Thanks
Vivek


2010-07-19 17:20:29

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 1/3] cfq-iosched: Improve time slice charging logic

- Currently in CFQ there are many situations where don't know how
much time slice has been consumed by a queue. For example, all
the random reader/writer queues where we don't idle on
individual queues and we expire the queue either immediately
after the request dispatch.

- In this case time consumed by a queue is just a memory copy
operation. Actually time measurement is possible only if we
idle on a queue and allow dispatch from a queue for significant
amount of time.

- As of today, in such cases we calculate the time since the
dispatch from the queue started and charge all that time.
Generally this rounds to 1 jiffy but in some cases it can
be more. For example, if we are driving high request queue
depth and driver is too busy and does not ask for new
requests for 8-10 jiffies. In such cases, the active queue
gets charged very unfairly.

- So fundamentally, whole notion of charging for time slice
is valid only if we have been idling on the queue. Otherwise
in an NCQ queue, there might be other requests on the queue
and we can not do the time slice calculation.

- This patch tweaks the slice charging logic a bit so that
in the cases where we can't know the amount of time, we
start charging in terms of number of requests dispatched
(IOPS). This practically switching CFQ fairness model to
fairness in terms of IOPS with slice_idle=0.

- As of today this will primarily be useful only with
group_idle patches so that we get fairness in terms of
IOPS across groups. The idea is that on fast storage
one can run CFQ with slice_idle=0 and still get IO
controller working without losing too much of
throughput.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 24 +++++++++++++++++++++---
1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 7982b83..f44064c 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
* if there are mutiple queues in the group, each can dispatch
* a single request on seeky media and cause lots of seek time
* and group will never know it.
+ *
+ * If drive is NCQ and we are driving deep queue depths, then
+ * it is not reasonable to charge the slice since dispatch
+ * started because this time will include time taken by all
+ * the other requests in the queue.
+ *
+ * Actually there is no reasonable way to know the disk time
+ * here and we need to come up with some approximation. If
+ * disk is non NCQ, we should be driving request queue depth
+ * 1, then charge for time since dispatch start and this will
+ * account for seek time properly on seeky media. If request
+ * queue depth is high, then charge for number of requests
+ * dispatched from the queue. This will sort of becoming
+ * charging in terms of IOPS.
*/
- slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start),
- 1);
+ if (cfqq->cfqd->hw_tag == 0)
+ slice_used = max_t(unsigned,
+ (jiffies - cfqq->dispatch_start), 1);
+ else
+ slice_used = cfqq->slice_dispatch;
} else {
slice_used = jiffies - cfqq->slice_start;
if (slice_used > cfqq->allocated_slice)
slice_used = cfqq->allocated_slice;
}

- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
+ cfqq->slice_dispatch);
return slice_used;
}

--
1.7.1.1

2010-07-19 17:20:31

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace

- Divyesh had gotten rid of this code in the past. I want to re-introduce it
back as it helps me a lot during debugging.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b23d7f4..1f0d4ea 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -148,6 +148,8 @@ struct cfq_queue {
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
struct cfq_group *orig_cfqg;
+ /* Number of sectors dispatched from queue in single dispatch round */
+ unsigned long nr_sectors;
};

/*
@@ -926,8 +928,8 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
slice_used = cfqq->allocated_slice;
}

- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
- cfqq->slice_dispatch);
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, disp=%u sect=%u",
+ slice_used, cfqq->slice_dispatch, cfqq->nr_sectors);
return slice_used;
}

@@ -1608,6 +1610,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
cfqq->allocated_slice = 0;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;
+ cfqq->nr_sectors = 0;

cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_must_dispatch(cfqq);
@@ -1970,6 +1973,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
elv_dispatch_sort(q, rq);

cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
+ cfqq->nr_sectors += blk_rq_sectors(rq);
cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq),
rq_data_dir(rq), rq_is_sync(rq));
}
--
1.7.1.1

2010-07-19 17:20:59

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle

o Implement a new tunable group_idle, which allows idling on the group
instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
on the individual queues but idle on the group. This way on fast storage
we can get fairness between groups at the same time overall throughput
improves.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 60 +++++++++++++++++++++++++++++++++++++++++++++------
1 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f44064c..b23d7f4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
static int cfq_slice_idle = HZ / 125;
+static int cfq_group_idle = HZ / 125;
static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
static const int cfq_hist_divisor = 4;

@@ -198,6 +199,8 @@ struct cfq_group {
struct hlist_node cfqd_node;
atomic_t ref;
#endif
+ /* number of requests that are on the dispatch list or inside driver */
+ int dispatched;
};

/*
@@ -271,6 +274,7 @@ struct cfq_data {
unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
+ unsigned int cfq_group_idle;
unsigned int cfq_latency;
unsigned int cfq_group_isolation;

@@ -1856,6 +1860,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
BUG_ON(!service_tree);
BUG_ON(!service_tree->count);

+ if (!cfqd->cfq_slice_idle)
+ return false;
+
/* We never do for idle class queues. */
if (prio == IDLE_WORKLOAD)
return false;
@@ -1880,7 +1887,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq = cfqd->active_queue;
struct cfq_io_context *cic;
- unsigned long sl;
+ unsigned long sl, group_idle = 0;

/*
* SSD device without seek penalty, disable idling. But only do so
@@ -1896,8 +1903,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
/*
* idle is disabled, either manually or by past process history
*/
- if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
- return;
+ if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) {
+ /* no queue idling. Check for group idling */
+ if (cfqd->cfq_group_idle)
+ group_idle = cfqd->cfq_group_idle;
+ else
+ return;
+ }

/*
* still active requests from this queue, don't idle
@@ -1924,13 +1936,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
return;
}

+ /* There are other queues in the group, don't do group idle */
+ if (group_idle && cfqq->cfqg->nr_cfqq > 1)
+ return;
+
cfq_mark_cfqq_wait_request(cfqq);

- sl = cfqd->cfq_slice_idle;
+ if (group_idle)
+ sl = cfqd->cfq_group_idle;
+ else
+ sl = cfqd->cfq_slice_idle;

mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
cfq_blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg);
- cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
+ cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu group_idle: %d", sl,
+ group_idle ? 1 : 0);
}

/*
@@ -1946,6 +1966,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
cfq_remove_request(rq);
cfqq->dispatched++;
+ (RQ_CFQG(rq))->dispatched++;
elv_dispatch_sort(q, rq);

cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
@@ -2215,7 +2236,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
cfqq = NULL;
goto keep_queue;
} else
- goto expire;
+ goto check_group_idle;
}

/*
@@ -2249,6 +2270,17 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
goto keep_queue;
}

+ /*
+ * If group idle is enabled and there are requests dispatched from
+ * this group, wait for requests to complete.
+ */
+check_group_idle:
+ if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
+ && cfqq->cfqg->dispatched) {
+ cfqq = NULL;
+ goto keep_queue;
+ }
+
expire:
cfq_slice_expired(cfqd, 0);
new_queue:
@@ -3391,6 +3423,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
WARN_ON(!cfqq->dispatched);
cfqd->rq_in_driver--;
cfqq->dispatched--;
+ (RQ_CFQG(rq))->dispatched--;
cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg,
rq_start_time_ns(rq), rq_io_start_time_ns(rq),
rq_data_dir(rq), rq_is_sync(rq));
@@ -3420,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
* the queue.
*/
if (cfq_should_wait_busy(cfqd, cfqq)) {
- cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
+ unsigned long extend_sl = cfqd->cfq_slice_idle;
+ if (!cfqd->cfq_slice_idle)
+ extend_sl = cfqd->cfq_group_idle;
+ cfqq->slice_end = jiffies + extend_sl;
cfq_mark_cfqq_wait_busy(cfqq);
cfq_log_cfqq(cfqd, cfqq, "will busy wait");
}
@@ -3865,6 +3901,7 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
+ cfqd->cfq_group_idle = cfq_group_idle;
cfqd->cfq_latency = 1;
cfqd->cfq_group_isolation = 0;
cfqd->hw_tag = -1;
@@ -3937,6 +3974,7 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
+SHOW_FUNCTION(cfq_group_idle_show, cfqd->cfq_group_idle, 1);
SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
@@ -3969,6 +4007,7 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
UINT_MAX, 0);
STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
+STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
@@ -3990,6 +4029,7 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
+ CFQ_ATTR(group_idle),
CFQ_ATTR(low_latency),
CFQ_ATTR(group_isolation),
__ATTR_NULL
@@ -4043,6 +4083,12 @@ static int __init cfq_init(void)
if (!cfq_slice_idle)
cfq_slice_idle = 1;

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ if (!cfq_group_idle)
+ cfq_group_idle = 1;
+#else
+ cfq_group_idle = 0;
+#endif
if (cfq_slab_setup())
return -ENOMEM;

--
1.7.1.1

2010-07-19 18:47:37

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic

Vivek Goyal <[email protected]> writes:

> - Currently in CFQ there are many situations where don't know how
> much time slice has been consumed by a queue. For example, all
> the random reader/writer queues where we don't idle on
> individual queues and we expire the queue either immediately
> after the request dispatch.
>
> - In this case time consumed by a queue is just a memory copy
> operation. Actually time measurement is possible only if we
> idle on a queue and allow dispatch from a queue for significant
> amount of time.
>
> - As of today, in such cases we calculate the time since the
> dispatch from the queue started and charge all that time.
> Generally this rounds to 1 jiffy but in some cases it can
> be more. For example, if we are driving high request queue
> depth and driver is too busy and does not ask for new
> requests for 8-10 jiffies. In such cases, the active queue
> gets charged very unfairly.
>
> - So fundamentally, whole notion of charging for time slice
> is valid only if we have been idling on the queue. Otherwise
> in an NCQ queue, there might be other requests on the queue
> and we can not do the time slice calculation.
>
> - This patch tweaks the slice charging logic a bit so that
> in the cases where we can't know the amount of time, we
> start charging in terms of number of requests dispatched
> (IOPS). This practically switching CFQ fairness model to
> fairness in terms of IOPS with slice_idle=0.
>
> - As of today this will primarily be useful only with
> group_idle patches so that we get fairness in terms of
> IOPS across groups. The idea is that on fast storage
> one can run CFQ with slice_idle=0 and still get IO
> controller working without losing too much of
> throughput.

I'm not fluent in the cgroup code, my apologies for that. However, just
trying to make sense of this is giving me a headache. Now, in some
cases you are using IOPS *in place of* jiffies. How are we to know
which is which and in what cases?

It sounds like this is addressing an important problem, but I'm having a
hard time picking out what that problem is. Is this problem noticable
for competing sync-noidle workloads (competing between groups, that is)?
If not, then what?

Thanks,
Jeff

> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> block/cfq-iosched.c | 24 +++++++++++++++++++++---
> 1 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 7982b83..f44064c 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
> * if there are mutiple queues in the group, each can dispatch
> * a single request on seeky media and cause lots of seek time
> * and group will never know it.
> + *
> + * If drive is NCQ and we are driving deep queue depths, then
> + * it is not reasonable to charge the slice since dispatch
> + * started because this time will include time taken by all
> + * the other requests in the queue.
> + *
> + * Actually there is no reasonable way to know the disk time
> + * here and we need to come up with some approximation. If
> + * disk is non NCQ, we should be driving request queue depth
> + * 1, then charge for time since dispatch start and this will
> + * account for seek time properly on seeky media. If request
> + * queue depth is high, then charge for number of requests
> + * dispatched from the queue. This will sort of becoming
> + * charging in terms of IOPS.
> */
> - slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start),
> - 1);
> + if (cfqq->cfqd->hw_tag == 0)
> + slice_used = max_t(unsigned,
> + (jiffies - cfqq->dispatch_start), 1);
> + else
> + slice_used = cfqq->slice_dispatch;
> } else {
> slice_used = jiffies - cfqq->slice_start;
> if (slice_used > cfqq->allocated_slice)
> slice_used = cfqq->allocated_slice;
> }
>
> - cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
> + cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
> + cfqq->slice_dispatch);
> return slice_used;
> }

2010-07-19 18:58:40

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic

On Mon, Jul 19, 2010 at 02:47:20PM -0400, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > - Currently in CFQ there are many situations where don't know how
> > much time slice has been consumed by a queue. For example, all
> > the random reader/writer queues where we don't idle on
> > individual queues and we expire the queue either immediately
> > after the request dispatch.
> >
> > - In this case time consumed by a queue is just a memory copy
> > operation. Actually time measurement is possible only if we
> > idle on a queue and allow dispatch from a queue for significant
> > amount of time.
> >
> > - As of today, in such cases we calculate the time since the
> > dispatch from the queue started and charge all that time.
> > Generally this rounds to 1 jiffy but in some cases it can
> > be more. For example, if we are driving high request queue
> > depth and driver is too busy and does not ask for new
> > requests for 8-10 jiffies. In such cases, the active queue
> > gets charged very unfairly.
> >
> > - So fundamentally, whole notion of charging for time slice
> > is valid only if we have been idling on the queue. Otherwise
> > in an NCQ queue, there might be other requests on the queue
> > and we can not do the time slice calculation.
> >
> > - This patch tweaks the slice charging logic a bit so that
> > in the cases where we can't know the amount of time, we
> > start charging in terms of number of requests dispatched
> > (IOPS). This practically switching CFQ fairness model to
> > fairness in terms of IOPS with slice_idle=0.
> >
> > - As of today this will primarily be useful only with
> > group_idle patches so that we get fairness in terms of
> > IOPS across groups. The idea is that on fast storage
> > one can run CFQ with slice_idle=0 and still get IO
> > controller working without losing too much of
> > throughput.
>
> I'm not fluent in the cgroup code, my apologies for that. However, just
> trying to make sense of this is giving me a headache. Now, in some
> cases you are using IOPS *in place of* jiffies. How are we to know
> which is which and in what cases?

Yes it is mixed now for default CFQ case. Whereever we don't have the
capability to determine the slice_used, we charge IOPS.

For slice_idle=0 case, we should charge IOPS almost all the time. Though
if there is a workload where single cfqq can keep the request queue
saturated, then current code will charge in terms of time.

I agree that this is little confusing. May be in case of slice_idle=0
we can always charge in terms of IOPS.

>
> It sounds like this is addressing an important problem, but I'm having a
> hard time picking out what that problem is. Is this problem noticable
> for competing sync-noidle workloads (competing between groups, that is)?
> If not, then what?

I noticed problem during competing workloads in different groups. With
slice_idle 0, we will drive full queue depth of 32. Sometimes when we
hit high queue depth, say 32, for few jiffies, driver did not ask for
new requests. So say for 10-12 ms, requests only completed and new
requests did not get issued. In that case, all this 10-12 ms gets charged
to active queue and the fact is that this active queue did not even get
to dispatch more than 1 request. This queue was just unfortunate to be
there at that time. The higher weight queue ofen run into this situation
because CFQ tries to keep them as active queue more often.

So if you are driving full queue depth where in NCQ request queue there
are requests pending from multiple queues and groups, you have no way to
measure the time. My impression is that on fast devices, we can no longer
stick to the model of measuring the time. If we switch to IOPS model,
then we can drive deeper requests queue depths and keep the device
saturated, at the same time achieve group IO control.

Thanks
Vivek

>
> Thanks,
> Jeff
>
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > block/cfq-iosched.c | 24 +++++++++++++++++++++---
> > 1 files changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index 7982b83..f44064c 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
> > * if there are mutiple queues in the group, each can dispatch
> > * a single request on seeky media and cause lots of seek time
> > * and group will never know it.
> > + *
> > + * If drive is NCQ and we are driving deep queue depths, then
> > + * it is not reasonable to charge the slice since dispatch
> > + * started because this time will include time taken by all
> > + * the other requests in the queue.
> > + *
> > + * Actually there is no reasonable way to know the disk time
> > + * queue depth is high, then charge for number of requests
> > + * dispatched from the queue. This will sort of becoming
> > + * charging in terms of IOPS.
> > */
> > - slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start),
> > - 1);
> > + if (cfqq->cfqd->hw_tag == 0)
> > + slice_used = max_t(unsigned,
> > + (jiffies - cfqq->dispatch_start), 1);
> > + else
> > + slice_used = cfqq->slice_dispatch;
> > } else {
> > slice_used = jiffies - cfqq->slice_start;
> > if (slice_used > cfqq->allocated_slice)
> > slice_used = cfqq->allocated_slice;
> > }
> >
> > - cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
> > + cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
> > + cfqq->slice_dispatch);
> > return slice_used;
> > }

2010-07-19 18:58:55

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle

Vivek Goyal <[email protected]> writes:

> o Implement a new tunable group_idle, which allows idling on the group
> instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
> on the individual queues but idle on the group. This way on fast storage
> we can get fairness between groups at the same time overall throughput
> improves.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> block/cfq-iosched.c | 60 +++++++++++++++++++++++++++++++++++++++++++++------
> 1 files changed, 53 insertions(+), 7 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index f44064c..b23d7f4 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
> static int cfq_slice_async = HZ / 25;
> static const int cfq_slice_async_rq = 2;
> static int cfq_slice_idle = HZ / 125;
> +static int cfq_group_idle = HZ / 125;
> static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
> static const int cfq_hist_divisor = 4;
>
> @@ -198,6 +199,8 @@ struct cfq_group {
> struct hlist_node cfqd_node;
> atomic_t ref;
> #endif
> + /* number of requests that are on the dispatch list or inside driver */
> + int dispatched;
> };
>
> /*
> @@ -271,6 +274,7 @@ struct cfq_data {
> unsigned int cfq_slice[2];
> unsigned int cfq_slice_async_rq;
> unsigned int cfq_slice_idle;
> + unsigned int cfq_group_idle;
> unsigned int cfq_latency;
> unsigned int cfq_group_isolation;
>
> @@ -1856,6 +1860,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> BUG_ON(!service_tree);
> BUG_ON(!service_tree->count);
>
> + if (!cfqd->cfq_slice_idle)
> + return false;
> +
> /* We never do for idle class queues. */
> if (prio == IDLE_WORKLOAD)
> return false;
> @@ -1880,7 +1887,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> {
> struct cfq_queue *cfqq = cfqd->active_queue;
> struct cfq_io_context *cic;
> - unsigned long sl;
> + unsigned long sl, group_idle = 0;
>
> /*
> * SSD device without seek penalty, disable idling. But only do so
> @@ -1896,8 +1903,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> /*
> * idle is disabled, either manually or by past process history
> */
> - if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
> - return;
> + if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) {

The check for cfqd->cfq_slice_idle is now redundant (as it's done in
cfq_should_idle).

> @@ -2215,7 +2236,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> cfqq = NULL;
> goto keep_queue;
> } else
> - goto expire;
> + goto check_group_idle;
> }
>
> /*
> @@ -2249,6 +2270,17 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> goto keep_queue;
> }
>
> + /*
> + * If group idle is enabled and there are requests dispatched from
> + * this group, wait for requests to complete.
> + */
> +check_group_idle:
> + if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
> + && cfqq->cfqg->dispatched) {
> + cfqq = NULL;
> + goto keep_queue;
> + }

I really wish we could roll all of this logic into cfq_should_idle.

> @@ -3420,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> * the queue.
> */
> if (cfq_should_wait_busy(cfqd, cfqq)) {
> - cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> + unsigned long extend_sl = cfqd->cfq_slice_idle;
> + if (!cfqd->cfq_slice_idle)
> + extend_sl = cfqd->cfq_group_idle;
> + cfqq->slice_end = jiffies + extend_sl;

So, if slice_idle and group_idle are both set, slice_idle trumps
group_idle? Did you give that case any thought? If it doesn't make
sense to configure both, then we should probably make sure they can't
both be set.

Cheers,
Jeff

2010-07-19 19:00:31

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace

Vivek Goyal <[email protected]> writes:

> - Divyesh had gotten rid of this code in the past. I want to re-introduce it
> back as it helps me a lot during debugging.

Fine with me.

Reviewed-by: Jeff Moyer <[email protected]>

2010-07-19 20:20:36

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle

On Mon, Jul 19, 2010 at 02:58:41PM -0400, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > o Implement a new tunable group_idle, which allows idling on the group
> > instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
> > on the individual queues but idle on the group. This way on fast storage
> > we can get fairness between groups at the same time overall throughput
> > improves.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > block/cfq-iosched.c | 60 +++++++++++++++++++++++++++++++++++++++++++++------
> > 1 files changed, 53 insertions(+), 7 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index f44064c..b23d7f4 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
> > static int cfq_slice_async = HZ / 25;
> > static const int cfq_slice_async_rq = 2;
> > static int cfq_slice_idle = HZ / 125;
> > +static int cfq_group_idle = HZ / 125;
> > static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
> > static const int cfq_hist_divisor = 4;
> >
> > @@ -198,6 +199,8 @@ struct cfq_group {
> > struct hlist_node cfqd_node;
> > atomic_t ref;
> > #endif
> > + /* number of requests that are on the dispatch list or inside driver */
> > + int dispatched;
> > };
> >
> > /*
> > @@ -271,6 +274,7 @@ struct cfq_data {
> > unsigned int cfq_slice[2];
> > unsigned int cfq_slice_async_rq;
> > unsigned int cfq_slice_idle;
> > + unsigned int cfq_group_idle;
> > unsigned int cfq_latency;
> > unsigned int cfq_group_isolation;
> >
> > @@ -1856,6 +1860,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> > BUG_ON(!service_tree);
> > BUG_ON(!service_tree->count);
> >
> > + if (!cfqd->cfq_slice_idle)
> > + return false;
> > +
> > /* We never do for idle class queues. */
> > if (prio == IDLE_WORKLOAD)
> > return false;
> > @@ -1880,7 +1887,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> > {
> > struct cfq_queue *cfqq = cfqd->active_queue;
> > struct cfq_io_context *cic;
> > - unsigned long sl;
> > + unsigned long sl, group_idle = 0;
> >
> > /*
> > * SSD device without seek penalty, disable idling. But only do so
> > @@ -1896,8 +1903,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> > /*
> > * idle is disabled, either manually or by past process history
> > */
> > - if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
> > - return;
> > + if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) {
>
> The check for cfqd->cfq_slice_idle is now redundant (as it's done in
> cfq_should_idle).

Yep. I will get rid of extra check.

>
> > @@ -2215,7 +2236,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> > cfqq = NULL;
> > goto keep_queue;
> > } else
> > - goto expire;
> > + goto check_group_idle;
> > }
> >
> > /*
> > @@ -2249,6 +2270,17 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> > goto keep_queue;
> > }
> >
> > + /*
> > + * If group idle is enabled and there are requests dispatched from
> > + * this group, wait for requests to complete.
> > + */
> > +check_group_idle:
> > + if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
> > + && cfqq->cfqg->dispatched) {
> > + cfqq = NULL;
> > + goto keep_queue;
> > + }
>
> I really wish we could roll all of this logic into cfq_should_idle.

Currently whether to idle on queue or not logic is also part of select_queue().
So to keep it same it makes sense to keep group logic also in select_queue().

>
> > @@ -3420,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> > * the queue.
> > */
> > if (cfq_should_wait_busy(cfqd, cfqq)) {
> > - cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> > + unsigned long extend_sl = cfqd->cfq_slice_idle;
> > + if (!cfqd->cfq_slice_idle)
> > + extend_sl = cfqd->cfq_group_idle;
> > + cfqq->slice_end = jiffies + extend_sl;
>
> So, if slice_idle and group_idle are both set, slice_idle trumps
> group_idle? Did you give that case any thought? If it doesn't make
> sense to configure both, then we should probably make sure they can't
> both be set.

Actually, the wait busy logic makes sense in case of slice_idle, where we
can give an extended slice to a queue and to make sure a queue/group does
not get deleted immediately after completing the slice and hence losing
the share.

Now in case of slice_idle=0, if there is only one queue in the group then
it becomes the same case as one queue in the group with slice_idle=8.

So yes, slice_idle trumps group_idle. group_idle kicks in only if
slice_idle=0. I think by default it being set it does not harm. What good
it will do if slice_idle=8 and group_idle=0. Instead it will become
more work for user to also set one more tunable if he plans to set
slice_idle=0.

Thanks
Vivek



>
> Cheers,
> Jeff

2010-07-19 20:32:49

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic

On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <[email protected]> wrote:
> Yes it is mixed now for default CFQ case. Whereever we don't have the
> capability to determine the slice_used, we charge IOPS.
>
> For slice_idle=0 case, we should charge IOPS almost all the time. Though
> if there is a workload where single cfqq can keep the request queue
> saturated, then current code will charge in terms of time.
>
> I agree that this is little confusing. May be in case of slice_idle=0
> we can always charge in terms of IOPS.

I agree with Jeff that this is very confusing. Also there are
absolutely no bets that one job may end up getting charged in IOPs for
this behavior while other jobs continue getting charged in timefor
their IOs. Depending on the speed of the disk, this could be a huge
advantage or disadvantage for the cgroup being charged in IOPs.

It should be black or white, time or IOPs and also very clearly called
out not just in code comments but in the Documentation too.

2010-07-19 20:44:57

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic

On Mon, Jul 19, 2010 at 01:32:24PM -0700, Divyesh Shah wrote:
> On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <[email protected]> wrote:
> > Yes it is mixed now for default CFQ case. Whereever we don't have the
> > capability to determine the slice_used, we charge IOPS.
> >
> > For slice_idle=0 case, we should charge IOPS almost all the time. Though
> > if there is a workload where single cfqq can keep the request queue
> > saturated, then current code will charge in terms of time.
> >
> > I agree that this is little confusing. May be in case of slice_idle=0
> > we can always charge in terms of IOPS.
>
> I agree with Jeff that this is very confusing. Also there are
> absolutely no bets that one job may end up getting charged in IOPs for
> this behavior while other jobs continue getting charged in timefor
> their IOs. Depending on the speed of the disk, this could be a huge
> advantage or disadvantage for the cgroup being charged in IOPs.
>
> It should be black or white, time or IOPs and also very clearly called
> out not just in code comments but in the Documentation too.

Ok, how about always charging in IOPS when slice_idle=0?

So on fast devices, admin/user space tool, can set slice_idle=0, and CFQ
starts doing accounting in IOPS instead of time. On slow devices we
continue to run with slice_idle=8 and nothing changes.

Personally I feel that it is hard to sustain time based logic on high end
devices and still get good throughput. We could make CFQ a dual mode kind
of scheduler which is capable of doing accouting both in terms of time as
well as IOPS. When slice_idle !=0, we do accounting in terms of time and
it will be same CFQ as of today. When slice_idle=0, CFQ starts accounting
in terms of IOPS.

I think this change should bring us one step closer to our goal of one
IO sheduler for all devices.

Jens, what do you think?

Thanks
Vivek

2010-07-19 21:19:24

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic

On Mon, Jul 19, 2010 at 10:44 PM, Vivek Goyal <[email protected]> wrote:
> On Mon, Jul 19, 2010 at 01:32:24PM -0700, Divyesh Shah wrote:
>> On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <[email protected]> wrote:
>> > Yes it is mixed now for default CFQ case. Whereever we don't have the
>> > capability to determine the slice_used, we charge IOPS.
>> >
>> > For slice_idle=0 case, we should charge IOPS almost all the time. Though
>> > if there is a workload where single cfqq can keep the request queue
>> > saturated, then current code will charge in terms of time.
>> >
>> > I agree that this is little confusing. May be in case of slice_idle=0
>> > we can always charge in terms of IOPS.
>>
>> I agree with Jeff that this is very confusing. Also there are
>> absolutely no bets that one job may end up getting charged in IOPs for
>> this behavior while other jobs continue getting charged in timefor
>> their IOs. Depending on the speed of the disk, this could be a huge
>> advantage or disadvantage for the cgroup being charged in IOPs.
>>
>> It should be black or white, time or IOPs and also very clearly called
>> out not just in code comments but in the Documentation too.
>
> Ok, how about always charging in IOPS when slice_idle=0?
>
> So on fast devices, admin/user space tool, can set slice_idle=0, and CFQ
> starts doing accounting in IOPS instead of time. On slow devices we
> continue to run with slice_idle=8 and nothing changes.
>
> Personally I feel that it is hard to sustain time based logic on high end
> devices and still get good throughput. We could make CFQ a dual mode kind
> of scheduler which is capable of doing accouting both in terms of time as
> well as IOPS. When slice_idle !=0, we do accounting in terms of time and
> it will be same CFQ as of today. When slice_idle=0, CFQ starts accounting
> in terms of IOPS.
There is an other mode in which cfq can operate: for ncq ssds, it
basically ignores slice_idle, and operates as if it was 0.
This mode should also be handled as an IOPS counting mode.
SSD mode, though, differs from rotational mode for the definition of
"seekyness", and we should think if this mode is appropriate also for
the other hardware where slice_idle=0 is beneficial.
>
> I think this change should bring us one step closer to our goal of one
> IO sheduler for all devices.

I think this is an interesting instance of a more general problem: cfq
needs a cost function applicable to all requests on any hardware. The
current function is a concrete one (measured time), but unfortunately
it is not always applicable, because:
- for fast hardware the resolution is too coarse (this can be fixed
using higher resolution timers)
- for hardware that allows parallel dispatching, we can't measure the
cost of a single request (can we try something like average cost of
the requests executed in parallel?).
IOPS, instead, is a synthetic cost measure. It is a simplified model,
that will approximate some devices (SSDs) better than others
(multi-spindle rotational disks). But if we want to go for the
synthetic path, we can have more complex measures, that also take into
account other parameters, as sequentiality of the requests, their size
and so on, all parameters that may have still some impact on high-end
devices.

Thanks,
Corrado
>
> Jens, what do you think?
>
> Thanks
> Vivek

2010-07-19 22:05:23

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic

On Mon, Jul 19, 2010 at 11:19:21PM +0200, Corrado Zoccolo wrote:
> On Mon, Jul 19, 2010 at 10:44 PM, Vivek Goyal <[email protected]> wrote:
> > On Mon, Jul 19, 2010 at 01:32:24PM -0700, Divyesh Shah wrote:
> >> On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <[email protected]> wrote:
> >> > Yes it is mixed now for default CFQ case. Whereever we don't have the
> >> > capability to determine the slice_used, we charge IOPS.
> >> >
> >> > For slice_idle=0 case, we should charge IOPS almost all the time. Though
> >> > if there is a workload where single cfqq can keep the request queue
> >> > saturated, then current code will charge in terms of time.
> >> >
> >> > I agree that this is little confusing. May be in case of slice_idle=0
> >> > we can always charge in terms of IOPS.
> >>
> >> I agree with Jeff that this is very confusing. Also there are
> >> absolutely no bets that one job may end up getting charged in IOPs for
> >> this behavior while other jobs continue getting charged in timefor
> >> their IOs. Depending on the speed of the disk, this could be a huge
> >> advantage or disadvantage for the cgroup being charged in IOPs.
> >>
> >> It should be black or white, time or IOPs and also very clearly called
> >> out not just in code comments but in the Documentation too.
> >
> > Ok, how about always charging in IOPS when slice_idle=0?
> >
> > So on fast devices, admin/user space tool, can set slice_idle=0, and CFQ
> > starts doing accounting in IOPS instead of time. On slow devices we
> > continue to run with slice_idle=8 and nothing changes.
> >
> > Personally I feel that it is hard to sustain time based logic on high end
> > devices and still get good throughput. We could make CFQ a dual mode kind
> > of scheduler which is capable of doing accouting both in terms of time as
> > well as IOPS. When slice_idle !=0, we do accounting in terms of time and
> > it will be same CFQ as of today. When slice_idle=0, CFQ starts accounting
> > in terms of IOPS.
> There is an other mode in which cfq can operate: for ncq ssds, it
> basically ignores slice_idle, and operates as if it was 0.
> This mode should also be handled as an IOPS counting mode.
> SSD mode, though, differs from rotational mode for the definition of
> "seekyness", and we should think if this mode is appropriate also for
> the other hardware where slice_idle=0 is beneficial.

I am always wondering that in practice, what is the difference between
slice_idle=0 and rotational=0. I think the only difference is NCQ queue
detection. slice_idle=0 will always not idle, irrespective of the fact
whether queue is NCQ or not and rotational=0 will disable idling only
if device supports NCQ.

If that's the case, then we can probably internally switch the
slice_idle=0 once we have detected that an SSD supports NCQ and we can
get rid of this confusion.

Well looking more closely, there seems to be one more difference. With
SSD, and NCQ, we still idle on sync-noidle tree. This seemingly, will
provide us protection from WRITES. Not sure if this is true for good
SSDs also. I am assuming they should be giving priority to reads and
balancing things out. cfq_should_idle() is interesting though, that
we disable idling for sync-idle tree. So we idle on sync-noidle tree
but do not provide any protection to sequential readers. Anyway, that's
a minor detail....

In fact we can switch to IOPS model for NCQ SSD also.

> >
> > I think this change should bring us one step closer to our goal of one
> > IO sheduler for all devices.
>
> I think this is an interesting instance of a more general problem: cfq
> needs a cost function applicable to all requests on any hardware. The
> current function is a concrete one (measured time), but unfortunately
> it is not always applicable, because:
> - for fast hardware the resolution is too coarse (this can be fixed
> using higher resolution timers)

Yes this is fixable.

> - for hardware that allows parallel dispatching, we can't measure the
> cost of a single request (can we try something like average cost of
> the requests executed in parallel?).

This is the biggest problem. How to get right estimate of time when
a request queue can have requests from multiple processes at the same
time.

> IOPS, instead, is a synthetic cost measure. It is a simplified model,
> that will approximate some devices (SSDs) better than others
> (multi-spindle rotational disks).

Agreed that IOPS is a simplified model.

> But if we want to go for the
> synthetic path, we can have more complex measures, that also take into
> account other parameters, as sequentiality of the requests,

Once we start dispatching requests from multiple cfq queues at a time,
notion of sequentiality is lost (at least on the device).

> their size
> and so on, all parameters that may have still some impact on high-end
> devices.

size is an interesting factor though. Again we can only come up with
some kind of approximation only as this cost will vary from device to
device

I think we can begin with something simple (IOPS) and if it works fine,
then we can take into account additional factors (especially size of
request) and factor that into the cost.

The only thing to keep in mind is that group scheduling will benefit
most from it. The notion of ioprio is fairly weak currently in CFQ
(especially on SSD and with slice_idle=0).

Thanks
Vivek

>
> Thanks,
> Corrado
> >
> > Jens, what do you think?
> >
> > Thanks
> > Vivek

2010-07-19 22:16:55

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace

Looks good to me.

Reviewed-by: Divyesh Shah <[email protected]>

On Mon, Jul 19, 2010 at 10:20 AM, Vivek Goyal <[email protected]> wrote:
> - Divyesh had gotten rid of this code in the past. I want to re-introduce it
> ?back as it helps me a lot during debugging.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> ?block/cfq-iosched.c | ? ?8 ++++++--
> ?1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index b23d7f4..1f0d4ea 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -148,6 +148,8 @@ struct cfq_queue {
> ? ? ? ?struct cfq_queue *new_cfqq;
> ? ? ? ?struct cfq_group *cfqg;
> ? ? ? ?struct cfq_group *orig_cfqg;
> + ? ? ? /* Number of sectors dispatched from queue in single dispatch round */
> + ? ? ? unsigned long nr_sectors;
> ?};
>
> ?/*
> @@ -926,8 +928,8 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
> ? ? ? ? ? ? ? ? ? ? ? ?slice_used = cfqq->allocated_slice;
> ? ? ? ?}
>
> - ? ? ? cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
> - ? ? ? ? ? ? ? ? ? ? ? cfqq->slice_dispatch);
> + ? ? ? cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, disp=%u sect=%u",
> + ? ? ? ? ? ? ? ? ? ? ? slice_used, cfqq->slice_dispatch, cfqq->nr_sectors);
> ? ? ? ?return slice_used;
> ?}
>
> @@ -1608,6 +1610,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
> ? ? ? ? ? ? ? ?cfqq->allocated_slice = 0;
> ? ? ? ? ? ? ? ?cfqq->slice_end = 0;
> ? ? ? ? ? ? ? ?cfqq->slice_dispatch = 0;
> + ? ? ? ? ? ? ? cfqq->nr_sectors = 0;
>
> ? ? ? ? ? ? ? ?cfq_clear_cfqq_wait_request(cfqq);
> ? ? ? ? ? ? ? ?cfq_clear_cfqq_must_dispatch(cfqq);
> @@ -1970,6 +1973,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
> ? ? ? ?elv_dispatch_sort(q, rq);
>
> ? ? ? ?cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
> + ? ? ? cfqq->nr_sectors += blk_rq_sectors(rq);
> ? ? ? ?cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq),
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rq_data_dir(rq), rq_is_sync(rq));
> ?}
> --
> 1.7.1.1
>
>