2009-11-30 03:04:29

by Vivek Goyal

[permalink] [raw]
Subject: Block IO Controller V4

Hi Jens,

This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
of block tree.

A consolidated patch can be found here:

http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch


Changed from V3:
- Removed group_idle tunable and introduced group_isolation tunable. Thanks
to corrodo for the idea and thanks to Alan for testing and reporting
performance issues with random reads.

Generally if random reads are put in separate groups, these groups get
exclusive access to disk and we drive lower queue depth and performance
drops. So by default now random queues are moved to root group hence
performance drop due to idling on each group's sync-noidle tree is less.

If one wants stronger isolation/fairness for random IO, he needs to set
group_isolation=1 and that will also result in performance drop if group
does not have enough IO going on to keep disk busy.

- Got rid of wait_busy() function in select_queue(). Now I increase the
slice length of a queue by one slice_idle period to give it a chance
to get busy before it gets expired so that group does not lose share. This
has simplified the logic a bit. Thanks again to corrodo for the idea.

- Introduced a macro "for_each_cfqg_st" to travese through all the service
trees of a group.

- Now async workload share is calculated based on system wide busy queues
and not just based on queues in root group.

- allow async queue preemption in root group by sync queues in other groups.

Changed from V2:
- Made group target latency calculations in proportion to group weight
instead of evenly dividing the slice among all the groups.

- Modified cfq_rb_first() to check "count" and return NULL if service tree
is empty.

- Did some reshuffling in patch order. Moved Documentation patch to the end.
Also moved group idling patch down the order.

- Fixed the "slice_end" issue raised by Gui during slice usage calculation.

Changes from V1:

- Rebased the patches for "for-2.6.33" branch.
- Currently dropped the support for priority class of groups. For the time
being only BE class groups are supported.

After the discussions at IO minisummit at Tokyo, Japan, it was agreed that
one single IO control policy at either leaf nodes or at higher level nodes
does not meet all the requirements and we need something so that we have
the capability to support more than one IO control policy (like proportional
weight division and max bandwidth control) and also have capability to
implement some of these policies at higher level logical devices.

It was agreed that CFQ is the right place to implement time based proportional
weight division policy. Other policies like max bandwidth control/throttling
will make more sense at higher level logical devices.

This patch introduces blkio cgroup controller. It provides the management
interface for the block IO control. The idea is that keep the interface
common and in the background we should be able to switch policies based on
user options. Hence user can control the IO throughout the IO stack with
a single cgroup interface.

Apart from blkio cgroup interface, this patchset also modifies CFQ to implement
time based proportional weight division of disk. CFQ already does it in flat
mode. It has been modified to do group IO scheduling also.

IO control is a huge problem and the moment we start addressing all the
issues in one patchset, it bloats to unmanageable proportions and then nothing
gets inside the kernel. So at io mini summit we agreed that lets take small
steps and once a piece of code is inside the kernel and stablized, take the
next step. So this is the first step.

Some parts of the code are based on BFQ patches posted by Paolo and Fabio.

Your feedback is welcome.

TODO
====
- Direct random writers seem to be very fickle in terms of workload
classification. They seem to be switching between sync-idle and sync-noidle
workload type in a little unpredictable manner. Debug and fix it.

- Support async IO control (buffered writes).

Buffered writes is a beast and requires changes at many a places to solve the
problem and patchset becomes huge. Hence first we plan to support only sync
IO in control then work on async IO too.

Some of the work items identified are.

- Per memory cgroup dirty ratio
- Possibly modification of writeback to force writeback from a
particular cgroup.
- Implement IO tracking support so that a bio can be mapped to a cgroup.
- Per group request descriptor infrastructure in block layer.
- At CFQ level, implement per cfq_group async queues.

In this patchset, all the async IO goes in system wide queues and there are
no per group async queues. That means we will see service differentiation
only for sync IO only. Async IO willl be handled later.

- Support for higher level policies like max BW controller.
- Support groups of RT class also.

Thanks
Vivek

Documentation/cgroups/blkio-controller.txt | 135 +++++
block/Kconfig | 22 +
block/Kconfig.iosched | 17 +
block/Makefile | 1 +
block/blk-cgroup.c | 312 ++++++++++
block/blk-cgroup.h | 90 +++
block/cfq-iosched.c | 901 +++++++++++++++++++++++++---
include/linux/cgroup_subsys.h | 6 +
include/linux/iocontext.h | 4 +
9 files changed, 1401 insertions(+), 87 deletions(-)


2009-11-30 03:01:12

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 01/21] blkio: Set must_dispatch only if we decided to not dispatch the request

o must_dispatch flag should be set only if we decided not to run the quue
dispatch the request.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a5de31f..9adfa48 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2494,9 +2494,9 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
cfqd->busy_queues > 1) {
del_timer(&cfqd->idle_slice_timer);
- __blk_run_queue(cfqd->queue);
- }
- cfq_mark_cfqq_must_dispatch(cfqq);
+ __blk_run_queue(cfqd->queue);
+ } else
+ cfq_mark_cfqq_must_dispatch(cfqq);
}
} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
/*
--
1.6.2.5

2009-11-30 03:01:21

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 02/21] blkio: Introduce the notion of cfq groups

o This patch introduce the notion of cfq groups. Soon we will can have multiple
groups of different weights in the system.

o Various service trees (prioclass and workload type trees), will become per
cfq group. So hierarchy looks as follows.

cfq_groups
|
workload type
|
cfq queue

o When an scheduling decision has to be taken, first we select the cfq group
then workload with-in the group and then cfq queue with-in the workload
type.

o This patch just makes various workload service tree per cfq group and
introduce the function to be able to choose a group for scheduling.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 109 +++++++++++++++++++++++++++++++++++---------------
1 files changed, 76 insertions(+), 33 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9adfa48..3baa3f4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -132,6 +132,7 @@ struct cfq_queue {

struct cfq_rb_root *service_tree;
struct cfq_queue *new_cfqq;
+ struct cfq_group *cfqg;
};

/*
@@ -153,25 +154,30 @@ enum wl_type_t {
SYNC_WORKLOAD = 2
};

+/* This is per cgroup per device grouping structure */
+struct cfq_group {
+ /*
+ * rr lists of queues with requests, onle rr for each priority class.
+ * Counts are embedded in the cfq_rb_root
+ */
+ struct cfq_rb_root service_trees[2][3];
+ struct cfq_rb_root service_tree_idle;
+};

/*
* Per block device queue structure
*/
struct cfq_data {
struct request_queue *queue;
+ struct cfq_group root_group;

/*
- * rr lists of queues with requests, onle rr for each priority class.
- * Counts are embedded in the cfq_rb_root
- */
- struct cfq_rb_root service_trees[2][3];
- struct cfq_rb_root service_tree_idle;
- /*
* The priority currently being served
*/
enum wl_prio_t serving_prio;
enum wl_type_t serving_type;
unsigned long workload_expires;
+ struct cfq_group *serving_group;
bool noidle_tree_requires_idle;

/*
@@ -240,14 +246,15 @@ struct cfq_data {
unsigned long last_end_sync_rq;
};

-static struct cfq_rb_root *service_tree_for(enum wl_prio_t prio,
+static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
+ enum wl_prio_t prio,
enum wl_type_t type,
struct cfq_data *cfqd)
{
if (prio == IDLE_WORKLOAD)
- return &cfqd->service_tree_idle;
+ return &cfqg->service_tree_idle;

- return &cfqd->service_trees[prio][type];
+ return &cfqg->service_trees[prio][type];
}

enum cfqq_state_flags {
@@ -317,12 +324,14 @@ static enum wl_type_t cfqq_type(struct cfq_queue *cfqq)

static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd)
{
+ struct cfq_group *cfqg = &cfqd->root_group;
+
if (wl == IDLE_WORKLOAD)
- return cfqd->service_tree_idle.count;
+ return cfqg->service_tree_idle.count;

- return cfqd->service_trees[wl][ASYNC_WORKLOAD].count
- + cfqd->service_trees[wl][SYNC_NOIDLE_WORKLOAD].count
- + cfqd->service_trees[wl][SYNC_WORKLOAD].count;
+ return cfqg->service_trees[wl][ASYNC_WORKLOAD].count
+ + cfqg->service_trees[wl][SYNC_NOIDLE_WORKLOAD].count
+ + cfqg->service_trees[wl][SYNC_WORKLOAD].count;
}

static void cfq_dispatch_insert(struct request_queue *, struct request *);
@@ -611,7 +620,8 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
{
struct cfq_rb_root *service_tree;

- service_tree = service_tree_for(cfqq_prio(cfqq), cfqq_type(cfqq), cfqd);
+ service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
+ cfqq_type(cfqq), cfqd);

/*
* just an approximation, should be ok.
@@ -634,7 +644,8 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
struct cfq_rb_root *service_tree;
int left;

- service_tree = service_tree_for(cfqq_prio(cfqq), cfqq_type(cfqq), cfqd);
+ service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
+ cfqq_type(cfqq), cfqd);
if (cfq_class_idle(cfqq)) {
rb_key = CFQ_IDLE_DELAY;
parent = rb_last(&service_tree->rb);
@@ -1070,7 +1081,8 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
{
struct cfq_rb_root *service_tree =
- service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd);
+ service_tree_for(cfqd->serving_group, cfqd->serving_prio,
+ cfqd->serving_type, cfqd);

if (RB_EMPTY_ROOT(&service_tree->rb))
return NULL;
@@ -1222,7 +1234,8 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
* in their service tree.
*/
if (!service_tree)
- service_tree = service_tree_for(prio, cfqq_type(cfqq), cfqd);
+ service_tree = service_tree_for(cfqq->cfqg, prio,
+ cfqq_type(cfqq), cfqd);

if (service_tree->count == 0)
return true;
@@ -1381,8 +1394,9 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
}
}

-static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd, enum wl_prio_t prio,
- bool prio_changed)
+static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
+ struct cfq_group *cfqg, enum wl_prio_t prio,
+ bool prio_changed)
{
struct cfq_queue *queue;
int i;
@@ -1396,10 +1410,10 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd, enum wl_prio_t prio,
* from SYNC_NOIDLE (first choice), or just SYNC
* over ASYNC
*/
- if (service_tree_for(prio, cur_best, cfqd)->count)
+ if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
return cur_best;
cur_best = SYNC_WORKLOAD;
- if (service_tree_for(prio, cur_best, cfqd)->count)
+ if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
return cur_best;

return ASYNC_WORKLOAD;
@@ -1407,7 +1421,7 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd, enum wl_prio_t prio,

for (i = 0; i < 3; ++i) {
/* otherwise, select the one with lowest rb_key */
- queue = cfq_rb_first(service_tree_for(prio, i, cfqd));
+ queue = cfq_rb_first(service_tree_for(cfqg, prio, i, cfqd));
if (queue &&
(!key_valid || time_before(queue->rb_key, lowest_key))) {
lowest_key = queue->rb_key;
@@ -1419,12 +1433,13 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd, enum wl_prio_t prio,
return cur_best;
}

-static void choose_service_tree(struct cfq_data *cfqd)
+static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
enum wl_prio_t previous_prio = cfqd->serving_prio;
bool prio_changed;
unsigned slice;
unsigned count;
+ struct cfq_rb_root *st;

/* Choose next priority. RT > BE > IDLE */
if (cfq_busy_queues_wl(RT_WORKLOAD, cfqd))
@@ -1443,8 +1458,9 @@ static void choose_service_tree(struct cfq_data *cfqd)
* expiration time
*/
prio_changed = (cfqd->serving_prio != previous_prio);
- count = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd)
- ->count;
+ st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
+ cfqd);
+ count = st->count;

/*
* If priority didn't change, check workload expiration,
@@ -1456,9 +1472,10 @@ static void choose_service_tree(struct cfq_data *cfqd)

/* otherwise select new workload type */
cfqd->serving_type =
- cfq_choose_wl(cfqd, cfqd->serving_prio, prio_changed);
- count = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd)
- ->count;
+ cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio, prio_changed);
+ st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
+ cfqd);
+ count = st->count;

/*
* the workload slice is computed as a fraction of target latency
@@ -1482,6 +1499,12 @@ static void choose_service_tree(struct cfq_data *cfqd)
cfqd->noidle_tree_requires_idle = false;
}

+static void cfq_choose_cfqg(struct cfq_data *cfqd)
+{
+ cfqd->serving_group = &cfqd->root_group;
+ choose_service_tree(cfqd, &cfqd->root_group);
+}
+
/*
* Select a queue for service. If we have a current active queue,
* check whether to continue servicing it, or retrieve and set a new one.
@@ -1539,7 +1562,7 @@ new_queue:
* service tree
*/
if (!new_cfqq)
- choose_service_tree(cfqd);
+ cfq_choose_cfqg(cfqd);

cfqq = cfq_set_active_queue(cfqd, new_cfqq);
keep_queue:
@@ -1568,13 +1591,15 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
struct cfq_queue *cfqq;
int dispatched = 0;
int i, j;
+ struct cfq_group *cfqg = &cfqd->root_group;
+
for (i = 0; i < 2; ++i)
for (j = 0; j < 3; ++j)
- while ((cfqq = cfq_rb_first(&cfqd->service_trees[i][j]))
+ while ((cfqq = cfq_rb_first(&cfqg->service_trees[i][j]))
!= NULL)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);

- while ((cfqq = cfq_rb_first(&cfqd->service_tree_idle)) != NULL)
+ while ((cfqq = cfq_rb_first(&cfqg->service_tree_idle)) != NULL)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);

cfq_slice_expired(cfqd, 0);
@@ -2045,14 +2070,26 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->pid = pid;
}

+static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
+{
+ cfqq->cfqg = cfqg;
+}
+
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+{
+ return &cfqd->root_group;
+}
+
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
struct io_context *ioc, gfp_t gfp_mask)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
struct cfq_io_context *cic;
+ struct cfq_group *cfqg;

retry:
+ cfqg = cfq_get_cfqg(cfqd, 1);
cic = cfq_cic_lookup(cfqd, ioc);
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);
@@ -2083,6 +2120,7 @@ retry:
if (cfqq) {
cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
cfq_init_prio_data(cfqq, ioc);
+ cfq_link_cfqq_cfqg(cfqq, cfqg);
cfq_log_cfqq(cfqd, cfqq, "alloced");
} else
cfqq = &cfqd->oom_cfqq;
@@ -2935,15 +2973,19 @@ static void *cfq_init_queue(struct request_queue *q)
{
struct cfq_data *cfqd;
int i, j;
+ struct cfq_group *cfqg;

cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
if (!cfqd)
return NULL;

+ /* Init root group */
+ cfqg = &cfqd->root_group;
+
for (i = 0; i < 2; ++i)
for (j = 0; j < 3; ++j)
- cfqd->service_trees[i][j] = CFQ_RB_ROOT;
- cfqd->service_tree_idle = CFQ_RB_ROOT;
+ cfqg->service_trees[i][j] = CFQ_RB_ROOT;
+ cfqg->service_tree_idle = CFQ_RB_ROOT;

/*
* Not strictly needed (since RB_ROOT just clears the node and we
@@ -2960,6 +3002,7 @@ static void *cfq_init_queue(struct request_queue *q)
*/
cfq_init_cfqq(cfqd, &cfqd->oom_cfqq, 1, 0);
atomic_inc(&cfqd->oom_cfqq.ref);
+ cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group);

INIT_LIST_HEAD(&cfqd->cic_list);

--
1.6.2.5

2009-11-30 03:04:28

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 03/21] blkio: Implement macro to traverse each idle tree in group

o Implement a macro to traverse each service tree in the group. This avoids
usage of double for loop and special condition for idle tree 4 times.

o Macro is little twisted because of special handling of idle class service
tree.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 35 +++++++++++++++++++++--------------
1 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3baa3f4..c73ff44 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -303,6 +303,15 @@ CFQ_CFQQ_FNS(deep);
#define cfq_log(cfqd, fmt, args...) \
blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)

+/* Traverses through cfq group service trees */
+#define for_each_cfqg_st(cfqg, i, j, st) \
+ for (i = 0; i < 3; i++) \
+ for (j = 0, st = i < 2 ? &cfqg->service_trees[i][j] : \
+ &cfqg->service_tree_idle; \
+ (i < 2 && j < 3) || (i == 2 && j < 1); \
+ j++, st = i < 2 ? &cfqg->service_trees[i][j]: NULL) \
+
+
static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
{
if (cfq_class_idle(cfqq))
@@ -565,6 +574,10 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2,
*/
static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
{
+ /* Service tree is empty */
+ if (!root->count)
+ return NULL;
+
if (!root->left)
root->left = rb_first(&root->rb);

@@ -1592,18 +1605,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
int dispatched = 0;
int i, j;
struct cfq_group *cfqg = &cfqd->root_group;
+ struct cfq_rb_root *st;

- for (i = 0; i < 2; ++i)
- for (j = 0; j < 3; ++j)
- while ((cfqq = cfq_rb_first(&cfqg->service_trees[i][j]))
- != NULL)
- dispatched += __cfq_forced_dispatch_cfqq(cfqq);
-
- while ((cfqq = cfq_rb_first(&cfqg->service_tree_idle)) != NULL)
- dispatched += __cfq_forced_dispatch_cfqq(cfqq);
+ for_each_cfqg_st(cfqg, i, j, st) {
+ while ((cfqq = cfq_rb_first(st)) != NULL)
+ dispatched += __cfq_forced_dispatch_cfqq(cfqq);
+ }

cfq_slice_expired(cfqd, 0);
-
BUG_ON(cfqd->busy_queues);

cfq_log(cfqd, "forced_dispatch=%d", dispatched);
@@ -2974,6 +2983,7 @@ static void *cfq_init_queue(struct request_queue *q)
struct cfq_data *cfqd;
int i, j;
struct cfq_group *cfqg;
+ struct cfq_rb_root *st;

cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
if (!cfqd)
@@ -2981,11 +2991,8 @@ static void *cfq_init_queue(struct request_queue *q)

/* Init root group */
cfqg = &cfqd->root_group;
-
- for (i = 0; i < 2; ++i)
- for (j = 0; j < 3; ++j)
- cfqg->service_trees[i][j] = CFQ_RB_ROOT;
- cfqg->service_tree_idle = CFQ_RB_ROOT;
+ for_each_cfqg_st(cfqg, i, j, st)
+ *st = CFQ_RB_ROOT;

/*
* Not strictly needed (since RB_ROOT just clears the node and we
--
1.6.2.5

2009-11-30 03:01:24

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 04/21] blkio: Keep queue on service tree until we expire it

o Currently cfqq deletes a queue from service tree if it is empty (even if
we might idle on the queue). This patch keeps the queue on service tree
hence associated group remains on the service tree until we decide that
we are not going to idle on the queue and expire it.

o This just helps in time accounting for queue/group and in implementation
of rest of the patches.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 70 +++++++++++++++++++++++++++++++++++---------------
1 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c73ff44..a0d0a83 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -391,7 +391,7 @@ static int cfq_queue_empty(struct request_queue *q)
{
struct cfq_data *cfqd = q->elevator->elevator_data;

- return !cfqd->busy_queues;
+ return !cfqd->rq_queued;
}

/*
@@ -845,7 +845,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_del_rq_rb(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
- struct cfq_data *cfqd = cfqq->cfqd;
const int sync = rq_is_sync(rq);

BUG_ON(!cfqq->queued[sync]);
@@ -853,8 +852,17 @@ static void cfq_del_rq_rb(struct request *rq)

elv_rb_del(&cfqq->sort_list, rq);

- if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
- cfq_del_cfqq_rr(cfqd, cfqq);
+ if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list)) {
+ /*
+ * Queue will be deleted from service tree when we actually
+ * expire it later. Right now just remove it from prio tree
+ * as it is empty.
+ */
+ if (cfqq->p_root) {
+ rb_erase(&cfqq->p_node, cfqq->p_root);
+ cfqq->p_root = NULL;
+ }
+ }
}

static void cfq_add_rq_rb(struct request *rq)
@@ -1068,6 +1076,9 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
}

+ if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
+ cfq_del_cfqq_rr(cfqd, cfqq);
+
cfq_resort_rr_list(cfqd, cfqq);

if (cfqq == cfqd->active_queue)
@@ -1097,11 +1108,30 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
service_tree_for(cfqd->serving_group, cfqd->serving_prio,
cfqd->serving_type, cfqd);

+ if (!cfqd->rq_queued)
+ return NULL;
+
if (RB_EMPTY_ROOT(&service_tree->rb))
return NULL;
return cfq_rb_first(service_tree);
}

+static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
+{
+ struct cfq_group *cfqg = &cfqd->root_group;
+ struct cfq_queue *cfqq;
+ int i, j;
+ struct cfq_rb_root *st;
+
+ if (!cfqd->rq_queued)
+ return NULL;
+
+ for_each_cfqg_st(cfqg, i, j, st)
+ if ((cfqq = cfq_rb_first(st)) != NULL)
+ return cfqq;
+ return NULL;
+}
+
/*
* Get and set a new active queue for service.
*/
@@ -1234,6 +1264,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
enum wl_prio_t prio = cfqq_prio(cfqq);
struct cfq_rb_root *service_tree = cfqq->service_tree;

+ BUG_ON(!service_tree);
+ BUG_ON(!service_tree->count);
+
/* We never do for idle class queues. */
if (prio == IDLE_WORKLOAD)
return false;
@@ -1246,14 +1279,7 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
* Otherwise, we do only if they are the last ones
* in their service tree.
*/
- if (!service_tree)
- service_tree = service_tree_for(cfqq->cfqg, prio,
- cfqq_type(cfqq), cfqd);
-
- if (service_tree->count == 0)
- return true;
-
- return (service_tree->count == 1 && cfq_rb_first(service_tree) == cfqq);
+ return service_tree->count == 1;
}

static void cfq_arm_slice_timer(struct cfq_data *cfqd)
@@ -1530,6 +1556,8 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
if (!cfqq)
goto new_queue;

+ if (!cfqd->rq_queued)
+ return NULL;
/*
* The active queue has run out of time, expire it and select new.
*/
@@ -1592,6 +1620,9 @@ static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
}

BUG_ON(!list_empty(&cfqq->fifo));
+
+ /* By default cfqq is not expired if it is empty. Do it explicitly */
+ __cfq_slice_expired(cfqq->cfqd, cfqq, 0);
return dispatched;
}

@@ -1603,14 +1634,9 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq;
int dispatched = 0;
- int i, j;
- struct cfq_group *cfqg = &cfqd->root_group;
- struct cfq_rb_root *st;

- for_each_cfqg_st(cfqg, i, j, st) {
- while ((cfqq = cfq_rb_first(st)) != NULL)
- dispatched += __cfq_forced_dispatch_cfqq(cfqq);
- }
+ while ((cfqq = cfq_get_next_queue_forced(cfqd)) != NULL)
+ dispatched += __cfq_forced_dispatch_cfqq(cfqq);

cfq_slice_expired(cfqd, 0);
BUG_ON(cfqd->busy_queues);
@@ -1779,13 +1805,13 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
cfq_log_cfqq(cfqd, cfqq, "put_queue");
BUG_ON(rb_first(&cfqq->sort_list));
BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
- BUG_ON(cfq_cfqq_on_rr(cfqq));

if (unlikely(cfqd->active_queue == cfqq)) {
__cfq_slice_expired(cfqd, cfqq, 0);
cfq_schedule_dispatch(cfqd);
}

+ BUG_ON(cfq_cfqq_on_rr(cfqq));
kmem_cache_free(cfq_pool, cfqq);
}

@@ -2447,9 +2473,11 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (cfq_class_idle(cfqq))
return true;

+ /* Allow preemption only if we are idling on sync-noidle tree */
if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
- new_cfqq->service_tree->count == 1)
+ new_cfqq->service_tree->count == 2 &&
+ RB_EMPTY_ROOT(&cfqq->sort_list))
return true;

/*
--
1.6.2.5

2009-11-30 03:01:16

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 05/21] blkio: Introduce the root service tree for cfq groups

o So far we just had one cfq_group in cfq_data. To create space for more than
one cfq_group, we need to have a service tree of groups where all the groups
can be queued if they have active cfq queues backlogged in these.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 133 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a0d0a83..0a284be 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -77,8 +77,9 @@ struct cfq_rb_root {
struct rb_root rb;
struct rb_node *left;
unsigned count;
+ u64 min_vdisktime;
};
-#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, }
+#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, 0, }

/*
* Per process-grouping structure
@@ -156,6 +157,16 @@ enum wl_type_t {

/* This is per cgroup per device grouping structure */
struct cfq_group {
+ /* group service_tree member */
+ struct rb_node rb_node;
+
+ /* group service_tree key */
+ u64 vdisktime;
+ bool on_st;
+
+ /* number of cfqq currently on this group */
+ int nr_cfqq;
+
/*
* rr lists of queues with requests, onle rr for each priority class.
* Counts are embedded in the cfq_rb_root
@@ -169,6 +180,8 @@ struct cfq_group {
*/
struct cfq_data {
struct request_queue *queue;
+ /* Root service tree for cfq_groups */
+ struct cfq_rb_root grp_service_tree;
struct cfq_group root_group;

/*
@@ -251,6 +264,9 @@ static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
enum wl_type_t type,
struct cfq_data *cfqd)
{
+ if (!cfqg)
+ return NULL;
+
if (prio == IDLE_WORKLOAD)
return &cfqg->service_tree_idle;

@@ -587,6 +603,17 @@ static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
return NULL;
}

+static struct cfq_group *cfq_rb_first_group(struct cfq_rb_root *root)
+{
+ if (!root->left)
+ root->left = rb_first(&root->rb);
+
+ if (root->left)
+ return rb_entry(root->left, struct cfq_group, rb_node);
+
+ return NULL;
+}
+
static void rb_erase_init(struct rb_node *n, struct rb_root *root)
{
rb_erase(n, root);
@@ -643,6 +670,83 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
}

+static inline s64
+cfqg_key(struct cfq_rb_root *st, struct cfq_group *cfqg)
+{
+ return cfqg->vdisktime - st->min_vdisktime;
+}
+
+static void
+__cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
+{
+ struct rb_node **node = &st->rb.rb_node;
+ struct rb_node *parent = NULL;
+ struct cfq_group *__cfqg;
+ s64 key = cfqg_key(st, cfqg);
+ int left = 1;
+
+ while (*node != NULL) {
+ parent = *node;
+ __cfqg = rb_entry(parent, struct cfq_group, rb_node);
+
+ if (key < cfqg_key(st, __cfqg))
+ node = &parent->rb_left;
+ else {
+ node = &parent->rb_right;
+ left = 0;
+ }
+ }
+
+ if (left)
+ st->left = &cfqg->rb_node;
+
+ rb_link_node(&cfqg->rb_node, parent, node);
+ rb_insert_color(&cfqg->rb_node, &st->rb);
+}
+
+static void
+cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_group *__cfqg;
+ struct rb_node *n;
+
+ cfqg->nr_cfqq++;
+ if (cfqg->on_st)
+ return;
+
+ /*
+ * Currently put the group at the end. Later implement something
+ * so that groups get lesser vtime based on their weights, so that
+ * if group does not loose all if it was not continously backlogged.
+ */
+ n = rb_last(&st->rb);
+ if (n) {
+ __cfqg = rb_entry(n, struct cfq_group, rb_node);
+ cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
+ } else
+ cfqg->vdisktime = st->min_vdisktime;
+
+ __cfq_group_service_tree_add(st, cfqg);
+ cfqg->on_st = true;
+}
+
+static void
+cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+
+ BUG_ON(cfqg->nr_cfqq < 1);
+ cfqg->nr_cfqq--;
+ /* If there are other cfq queues under this group, don't delete it */
+ if (cfqg->nr_cfqq)
+ return;
+
+ cfqg->on_st = false;
+ if (!RB_EMPTY_NODE(&cfqg->rb_node))
+ cfq_rb_erase(&cfqg->rb_node, st);
+}
+
/*
* The cfqd->service_trees holds all pending cfq_queue's that have
* requests waiting to be processed. It is sorted in the order that
@@ -725,6 +829,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
rb_link_node(&cfqq->rb_node, parent, p);
rb_insert_color(&cfqq->rb_node, &service_tree->rb);
service_tree->count++;
+ cfq_group_service_tree_add(cfqd, cfqq->cfqg);
}

static struct cfq_queue *
@@ -835,6 +940,7 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
cfqq->p_root = NULL;
}

+ cfq_group_service_tree_del(cfqd, cfqq->cfqg);
BUG_ON(!cfqd->busy_queues);
cfqd->busy_queues--;
}
@@ -1111,6 +1217,9 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
if (!cfqd->rq_queued)
return NULL;

+ /* There is nothing to dispatch */
+ if (!service_tree)
+ return NULL;
if (RB_EMPTY_ROOT(&service_tree->rb))
return NULL;
return cfq_rb_first(service_tree);
@@ -1480,6 +1589,12 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
unsigned count;
struct cfq_rb_root *st;

+ if (!cfqg) {
+ cfqd->serving_prio = IDLE_WORKLOAD;
+ cfqd->workload_expires = jiffies + 1;
+ return;
+ }
+
/* Choose next priority. RT > BE > IDLE */
if (cfq_busy_queues_wl(RT_WORKLOAD, cfqd))
cfqd->serving_prio = RT_WORKLOAD;
@@ -1538,10 +1653,21 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
cfqd->noidle_tree_requires_idle = false;
}

+static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+
+ if (RB_EMPTY_ROOT(&st->rb))
+ return NULL;
+ return cfq_rb_first_group(st);
+}
+
static void cfq_choose_cfqg(struct cfq_data *cfqd)
{
- cfqd->serving_group = &cfqd->root_group;
- choose_service_tree(cfqd, &cfqd->root_group);
+ struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
+
+ cfqd->serving_group = cfqg;
+ choose_service_tree(cfqd, cfqg);
}

/*
@@ -3017,10 +3143,14 @@ static void *cfq_init_queue(struct request_queue *q)
if (!cfqd)
return NULL;

+ /* Init root service tree */
+ cfqd->grp_service_tree = CFQ_RB_ROOT;
+
/* Init root group */
cfqg = &cfqd->root_group;
for_each_cfqg_st(cfqg, i, j, st)
*st = CFQ_RB_ROOT;
+ RB_CLEAR_NODE(&cfqg->rb_node);

/*
* Not strictly needed (since RB_ROOT just clears the node and we
--
1.6.2.5

2009-11-30 03:01:27

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 06/21] blkio: Introduce blkio controller cgroup interface

o This is basic implementation of blkio controller cgroup interface. This is
the common interface visible to user space and should be used by different
IO control policies as we implement those.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/Kconfig | 13 +++
block/Kconfig.iosched | 1 +
block/Makefile | 1 +
block/blk-cgroup.c | 177 +++++++++++++++++++++++++++++++++++++++++
block/blk-cgroup.h | 58 +++++++++++++
include/linux/cgroup_subsys.h | 6 ++
include/linux/iocontext.h | 4 +
7 files changed, 260 insertions(+), 0 deletions(-)
create mode 100644 block/blk-cgroup.c
create mode 100644 block/blk-cgroup.h

diff --git a/block/Kconfig b/block/Kconfig
index 9be0b56..6ba1a8e 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -77,6 +77,19 @@ config BLK_DEV_INTEGRITY
T10/SCSI Data Integrity Field or the T13/ATA External Path
Protection. If in doubt, say N.

+config BLK_CGROUP
+ bool
+ depends on CGROUPS
+ default n
+ ---help---
+ Generic block IO controller cgroup interface. This is the common
+ cgroup interface which should be used by various IO controlling
+ policies.
+
+ Currently, CFQ IO scheduler uses it to recognize task groups and
+ control disk bandwidth allocation (proportional time slice allocation)
+ to such task groups.
+
endif # BLOCK

config BLOCK_COMPAT
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8bd1051..be0280d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -23,6 +23,7 @@ config IOSCHED_DEADLINE

config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
+ select BLK_CGROUP
default y
---help---
The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/Makefile b/block/Makefile
index 7914108..cb2d515 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-iopoll.o ioctl.o genhd.o scsi_ioctl.o

obj-$(CONFIG_BLK_DEV_BSG) += bsg.o
+obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
new file mode 100644
index 0000000..4f6afd7
--- /dev/null
+++ b/block/blk-cgroup.c
@@ -0,0 +1,177 @@
+/*
+ * Common Block IO controller cgroup interface
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <[email protected]>
+ *
+ * Copyright (C) 2008 Fabio Checconi <[email protected]>
+ * Paolo Valente <[email protected]>
+ *
+ * Copyright (C) 2009 Vivek Goyal <[email protected]>
+ * Nauman Rafique <[email protected]>
+ */
+#include <linux/ioprio.h>
+#include "blk-cgroup.h"
+
+struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
+
+struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
+{
+ return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
+ struct blkio_group *blkg, void *key)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&blkcg->lock, flags);
+ rcu_assign_pointer(blkg->key, key);
+ hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+}
+
+int blkiocg_del_blkio_group(struct blkio_group *blkg)
+{
+ /* Implemented later */
+ return 0;
+}
+
+/* called under rcu_read_lock(). */
+struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key)
+{
+ struct blkio_group *blkg;
+ struct hlist_node *n;
+ void *__key;
+
+ hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
+ __key = blkg->key;
+ if (__key == key)
+ return blkg;
+ }
+
+ return NULL;
+}
+
+#define SHOW_FUNCTION(__VAR) \
+static u64 blkiocg_##__VAR##_read(struct cgroup *cgroup, \
+ struct cftype *cftype) \
+{ \
+ struct blkio_cgroup *blkcg; \
+ \
+ blkcg = cgroup_to_blkio_cgroup(cgroup); \
+ return (u64)blkcg->__VAR; \
+}
+
+SHOW_FUNCTION(weight);
+#undef SHOW_FUNCTION
+
+static int
+blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
+{
+ struct blkio_cgroup *blkcg;
+
+ if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
+ return -EINVAL;
+
+ blkcg = cgroup_to_blkio_cgroup(cgroup);
+ blkcg->weight = (unsigned int)val;
+ return 0;
+}
+
+struct cftype blkio_files[] = {
+ {
+ .name = "weight",
+ .read_u64 = blkiocg_weight_read,
+ .write_u64 = blkiocg_weight_write,
+ },
+};
+
+static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ return cgroup_add_files(cgroup, subsys, blkio_files,
+ ARRAY_SIZE(blkio_files));
+}
+
+static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+
+ free_css_id(&blkio_subsys, &blkcg->css);
+ kfree(blkcg);
+}
+
+static struct cgroup_subsys_state *
+blkiocg_create(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ struct blkio_cgroup *blkcg, *parent_blkcg;
+
+ if (!cgroup->parent) {
+ blkcg = &blkio_root_cgroup;
+ goto done;
+ }
+
+ /* Currently we do not support hierarchy deeper than two level (0,1) */
+ parent_blkcg = cgroup_to_blkio_cgroup(cgroup->parent);
+ if (css_depth(&parent_blkcg->css) > 0)
+ return ERR_PTR(-EINVAL);
+
+ blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
+ if (!blkcg)
+ return ERR_PTR(-ENOMEM);
+
+ blkcg->weight = BLKIO_WEIGHT_DEFAULT;
+done:
+ spin_lock_init(&blkcg->lock);
+ INIT_HLIST_HEAD(&blkcg->blkg_list);
+
+ return &blkcg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic data structures. For now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc.
+ */
+static int blkiocg_can_attach(struct cgroup_subsys *subsys,
+ struct cgroup *cgroup, struct task_struct *tsk,
+ bool threadgroup)
+{
+ struct io_context *ioc;
+ int ret = 0;
+
+ /* task_lock() is needed to avoid races with exit_io_context() */
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc && atomic_read(&ioc->nr_tasks) > 1)
+ ret = -EINVAL;
+ task_unlock(tsk);
+
+ return ret;
+}
+
+static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+ struct cgroup *prev, struct task_struct *tsk,
+ bool threadgroup)
+{
+ struct io_context *ioc;
+
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc)
+ ioc->cgroup_changed = 1;
+ task_unlock(tsk);
+}
+
+struct cgroup_subsys blkio_subsys = {
+ .name = "blkio",
+ .create = blkiocg_create,
+ .can_attach = blkiocg_can_attach,
+ .attach = blkiocg_attach,
+ .destroy = blkiocg_destroy,
+ .populate = blkiocg_populate,
+ .subsys_id = blkio_subsys_id,
+ .use_id = 1,
+};
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
new file mode 100644
index 0000000..ba5703f
--- /dev/null
+++ b/block/blk-cgroup.h
@@ -0,0 +1,58 @@
+#ifndef _BLK_CGROUP_H
+#define _BLK_CGROUP_H
+/*
+ * Common Block IO controller cgroup interface
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <[email protected]>
+ *
+ * Copyright (C) 2008 Fabio Checconi <[email protected]>
+ * Paolo Valente <[email protected]>
+ *
+ * Copyright (C) 2009 Vivek Goyal <[email protected]>
+ * Nauman Rafique <[email protected]>
+ */
+
+#include <linux/cgroup.h>
+
+struct blkio_cgroup {
+ struct cgroup_subsys_state css;
+ unsigned int weight;
+ spinlock_t lock;
+ struct hlist_head blkg_list;
+};
+
+struct blkio_group {
+ /* An rcu protected unique identifier for the group */
+ void *key;
+ struct hlist_node blkcg_node;
+};
+
+#define BLKIO_WEIGHT_MIN 100
+#define BLKIO_WEIGHT_MAX 1000
+#define BLKIO_WEIGHT_DEFAULT 500
+
+#ifdef CONFIG_BLK_CGROUP
+extern struct blkio_cgroup blkio_root_cgroup;
+extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
+extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
+ struct blkio_group *blkg, void *key);
+extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
+extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
+ void *key);
+#else
+static inline struct blkio_cgroup *
+cgroup_to_blkio_cgroup(struct cgroup *cgroup) { return NULL; }
+
+static inline void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
+ struct blkio_group *blkg, void *key)
+{
+}
+
+static inline int
+blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
+
+static inline struct blkio_group *
+blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
+#endif
+#endif /* _BLK_CGROUP_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..ccefff0 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,9 @@ SUBSYS(net_cls)
#endif

/* */
+
+#ifdef CONFIG_BLK_CGROUP
+SUBSYS(blkio)
+#endif
+
+/* */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index eb73632..d61b0b8 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -68,6 +68,10 @@ struct io_context {
unsigned short ioprio;
unsigned short ioprio_changed;

+#ifdef CONFIG_BLK_CGROUP
+ unsigned short cgroup_changed;
+#endif
+
/*
* For request batching
*/
--
1.6.2.5

2009-11-30 03:02:21

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 07/21] blkio: Introduce per cfq group weights and vdisktime calculations

o Bring in the per cfq group weight and how vdisktime is calculated for the
group. Also bring in the functionality of updating the min_vdisktime of
the group service tree.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/Kconfig.iosched | 9 ++++++-
block/cfq-iosched.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index be0280d..fa95fa7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -23,7 +23,6 @@ config IOSCHED_DEADLINE

config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
- select BLK_CGROUP
default y
---help---
The CFQ I/O scheduler tries to distribute bandwidth equally
@@ -33,6 +32,14 @@ config IOSCHED_CFQ

This is the default I/O scheduler.

+config CFQ_GROUP_IOSCHED
+ bool "CFQ Group Scheduling support"
+ depends on IOSCHED_CFQ && CGROUPS
+ select BLK_CGROUP
+ default n
+ ---help---
+ Enable group IO scheduling in CFQ.
+
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 0a284be..fbb6bf5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -13,6 +13,7 @@
#include <linux/rbtree.h>
#include <linux/ioprio.h>
#include <linux/blktrace_api.h>
+#include "blk-cgroup.h"

/*
* tunables
@@ -49,6 +50,7 @@ static const int cfq_hist_divisor = 4;

#define CFQ_SLICE_SCALE (5)
#define CFQ_HW_QUEUE_MIN (5)
+#define CFQ_SERVICE_SHIFT 12

#define RQ_CIC(rq) \
((struct cfq_io_context *) (rq)->elevator_private)
@@ -78,6 +80,7 @@ struct cfq_rb_root {
struct rb_node *left;
unsigned count;
u64 min_vdisktime;
+ struct rb_node *active;
};
#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, 0, }

@@ -162,6 +165,7 @@ struct cfq_group {

/* group service_tree key */
u64 vdisktime;
+ unsigned int weight;
bool on_st;

/* number of cfqq currently on this group */
@@ -431,6 +435,51 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
}

+static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg)
+{
+ u64 d = delta << CFQ_SERVICE_SHIFT;
+
+ d = d * BLKIO_WEIGHT_DEFAULT;
+ do_div(d, cfqg->weight);
+ return d;
+}
+
+static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+ s64 delta = (s64)(vdisktime - min_vdisktime);
+ if (delta > 0)
+ min_vdisktime = vdisktime;
+
+ return min_vdisktime;
+}
+
+static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+ s64 delta = (s64)(vdisktime - min_vdisktime);
+ if (delta < 0)
+ min_vdisktime = vdisktime;
+
+ return min_vdisktime;
+}
+
+static void update_min_vdisktime(struct cfq_rb_root *st)
+{
+ u64 vdisktime = st->min_vdisktime;
+ struct cfq_group *cfqg;
+
+ if (st->active) {
+ cfqg = rb_entry(st->active, struct cfq_group, rb_node);
+ vdisktime = cfqg->vdisktime;
+ }
+
+ if (st->left) {
+ cfqg = rb_entry(st->left, struct cfq_group, rb_node);
+ vdisktime = min_vdisktime(vdisktime, cfqg->vdisktime);
+ }
+
+ st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
/*
* get averaged number of queues of RT/BE priority.
* average is updated, with a formula that gives more weight to higher numbers,
@@ -736,8 +785,12 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;

+ if (st->active == &cfqg->rb_node)
+ st->active = NULL;
+
BUG_ON(cfqg->nr_cfqq < 1);
cfqg->nr_cfqq--;
+
/* If there are other cfq queues under this group, don't delete it */
if (cfqg->nr_cfqq)
return;
@@ -1656,10 +1709,14 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_group *cfqg;

if (RB_EMPTY_ROOT(&st->rb))
return NULL;
- return cfq_rb_first_group(st);
+ cfqg = cfq_rb_first_group(st);
+ st->active = &cfqg->rb_node;
+ update_min_vdisktime(st);
+ return cfqg;
}

static void cfq_choose_cfqg(struct cfq_data *cfqd)
@@ -3152,6 +3209,9 @@ static void *cfq_init_queue(struct request_queue *q)
*st = CFQ_RB_ROOT;
RB_CLEAR_NODE(&cfqg->rb_node);

+ /* Give preference to root group over other groups */
+ cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
+
/*
* Not strictly needed (since RB_ROOT just clears the node and we
* zeroed cfqd on alloc), but better be safe in case someone decides
--
1.6.2.5

2009-11-30 03:03:40

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 08/21] blkio: Implement per cfq group latency target and busy queue avg

o So far we had 300ms soft target latency system wide. Now with the
introduction of cfq groups, divide that latency by number of groups so
that one can come up with group target latency which will be helpful
in determining the workload slice with-in group and also the dynamic
slice length of the cfq queue.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 65 +++++++++++++++++++++++++++++++++++---------------
1 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fbb6bf5..6466393 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -81,6 +81,7 @@ struct cfq_rb_root {
unsigned count;
u64 min_vdisktime;
struct rb_node *active;
+ unsigned total_weight;
};
#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, 0, }

@@ -171,6 +172,8 @@ struct cfq_group {
/* number of cfqq currently on this group */
int nr_cfqq;

+ /* Per group busy queus average. Useful for workload slice calc. */
+ unsigned int busy_queues_avg[2];
/*
* rr lists of queues with requests, onle rr for each priority class.
* Counts are embedded in the cfq_rb_root
@@ -187,6 +190,8 @@ struct cfq_data {
/* Root service tree for cfq_groups */
struct cfq_rb_root grp_service_tree;
struct cfq_group root_group;
+ /* Number of active cfq groups on group service tree */
+ int nr_groups;

/*
* The priority currently being served
@@ -205,7 +210,6 @@ struct cfq_data {
struct rb_root prio_trees[CFQ_PRIO_LISTS];

unsigned int busy_queues;
- unsigned int busy_queues_avg[2];

int rq_in_driver[2];
int sync_flight;
@@ -351,10 +355,10 @@ static enum wl_type_t cfqq_type(struct cfq_queue *cfqq)
return SYNC_WORKLOAD;
}

-static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd)
+static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl,
+ struct cfq_data *cfqd,
+ struct cfq_group *cfqg)
{
- struct cfq_group *cfqg = &cfqd->root_group;
-
if (wl == IDLE_WORKLOAD)
return cfqg->service_tree_idle.count;

@@ -486,18 +490,27 @@ static void update_min_vdisktime(struct cfq_rb_root *st)
* to quickly follows sudden increases and decrease slowly
*/

-static inline unsigned cfq_get_avg_queues(struct cfq_data *cfqd, bool rt)
+static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
+ struct cfq_group *cfqg, bool rt)
{
unsigned min_q, max_q;
unsigned mult = cfq_hist_divisor - 1;
unsigned round = cfq_hist_divisor / 2;
- unsigned busy = cfq_busy_queues_wl(rt, cfqd);
+ unsigned busy = cfq_group_busy_queues_wl(rt, cfqd, cfqg);

- min_q = min(cfqd->busy_queues_avg[rt], busy);
- max_q = max(cfqd->busy_queues_avg[rt], busy);
- cfqd->busy_queues_avg[rt] = (mult * max_q + min_q + round) /
+ min_q = min(cfqg->busy_queues_avg[rt], busy);
+ max_q = max(cfqg->busy_queues_avg[rt], busy);
+ cfqg->busy_queues_avg[rt] = (mult * max_q + min_q + round) /
cfq_hist_divisor;
- return cfqd->busy_queues_avg[rt];
+ return cfqg->busy_queues_avg[rt];
+}
+
+static inline unsigned
+cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+
+ return cfq_target_latency * cfqg->weight / st->total_weight;
}

static inline void
@@ -505,12 +518,17 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
if (cfqd->cfq_latency) {
- /* interested queues (we consider only the ones with the same
- * priority class) */
- unsigned iq = cfq_get_avg_queues(cfqd, cfq_class_rt(cfqq));
+ /*
+ * interested queues (we consider only the ones with the same
+ * priority class in the cfq group)
+ */
+ unsigned iq = cfq_group_get_avg_queues(cfqd, cfqq->cfqg,
+ cfq_class_rt(cfqq));
unsigned sync_slice = cfqd->cfq_slice[1];
unsigned expect_latency = sync_slice * iq;
- if (expect_latency > cfq_target_latency) {
+ unsigned group_slice = cfq_group_slice(cfqd, cfqq->cfqg);
+
+ if (expect_latency > group_slice) {
unsigned base_low_slice = 2 * cfqd->cfq_slice_idle;
/* scale low_slice according to IO priority
* and sync vs async */
@@ -518,7 +536,7 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
min(slice, base_low_slice * slice / sync_slice);
/* the adapted slice value is scaled to fit all iqs
* into the target latency */
- slice = max(slice * cfq_target_latency / expect_latency,
+ slice = max(slice * group_slice / expect_latency,
low_slice);
}
}
@@ -778,6 +796,8 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)

__cfq_group_service_tree_add(st, cfqg);
cfqg->on_st = true;
+ cfqd->nr_groups++;
+ st->total_weight += cfqg->weight;
}

static void
@@ -796,6 +816,8 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
return;

cfqg->on_st = false;
+ cfqd->nr_groups--;
+ st->total_weight -= cfqg->weight;
if (!RB_EMPTY_NODE(&cfqg->rb_node))
cfq_rb_erase(&cfqg->rb_node, st);
}
@@ -1641,6 +1663,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
unsigned slice;
unsigned count;
struct cfq_rb_root *st;
+ unsigned group_slice;

if (!cfqg) {
cfqd->serving_prio = IDLE_WORKLOAD;
@@ -1649,9 +1672,9 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
}

/* Choose next priority. RT > BE > IDLE */
- if (cfq_busy_queues_wl(RT_WORKLOAD, cfqd))
+ if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
cfqd->serving_prio = RT_WORKLOAD;
- else if (cfq_busy_queues_wl(BE_WORKLOAD, cfqd))
+ else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
cfqd->serving_prio = BE_WORKLOAD;
else {
cfqd->serving_prio = IDLE_WORKLOAD;
@@ -1689,9 +1712,11 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
* proportional to the number of queues in that workload, over
* all the queues in the same priority class
*/
- slice = cfq_target_latency * count /
- max_t(unsigned, cfqd->busy_queues_avg[cfqd->serving_prio],
- cfq_busy_queues_wl(cfqd->serving_prio, cfqd));
+ group_slice = cfq_group_slice(cfqd, cfqg);
+
+ slice = group_slice * count /
+ max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
+ cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));

if (cfqd->serving_type == ASYNC_WORKLOAD)
/* async workload slice is scaled down according to
--
1.6.2.5

2009-11-30 03:03:56

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 09/21] blkio: Group time used accounting and workload context save restore

o This patch introduces the functionality to do the accounting of group time
when a queue expires. This time used decides which is the group to go
next.

o Also introduce the functionlity to save and restore the workload type
context with-in group. It might happen that once we expire the cfq queue
and group, a different group will schedule in and we will lose the context
of the workload type. Hence save and restore it upon queue expiry.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6466393..5a75f81 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -114,6 +114,10 @@ struct cfq_queue {
/* fifo list of requests in sort_list */
struct list_head fifo;

+ /* time when queue got scheduled in to dispatch first request. */
+ unsigned long dispatch_start;
+ /* time when first request from queue completed and slice started. */
+ unsigned long slice_start;
unsigned long slice_end;
long slice_resid;
unsigned int slice_dispatch;
@@ -180,6 +184,10 @@ struct cfq_group {
*/
struct cfq_rb_root service_trees[2][3];
struct cfq_rb_root service_tree_idle;
+
+ unsigned long saved_workload_slice;
+ enum wl_type_t saved_workload;
+ enum wl_prio_t saved_serving_prio;
};

/*
@@ -540,6 +548,7 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
low_slice);
}
}
+ cfqq->slice_start = jiffies;
cfqq->slice_end = jiffies + slice;
cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
}
@@ -820,6 +829,58 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
st->total_weight -= cfqg->weight;
if (!RB_EMPTY_NODE(&cfqg->rb_node))
cfq_rb_erase(&cfqg->rb_node, st);
+ cfqg->saved_workload_slice = 0;
+}
+
+static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
+{
+ unsigned int slice_used, allocated_slice;
+
+ /*
+ * Queue got expired before even a single request completed or
+ * got expired immediately after first request completion.
+ */
+ if (!cfqq->slice_start || cfqq->slice_start == jiffies) {
+ /*
+ * Also charge the seek time incurred to the group, otherwise
+ * if there are mutiple queues in the group, each can dispatch
+ * a single request on seeky media and cause lots of seek time
+ * and group will never know it.
+ */
+ slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start),
+ 1);
+ } else {
+ slice_used = jiffies - cfqq->slice_start;
+ allocated_slice = cfqq->slice_end - cfqq->slice_start;
+ if (slice_used > allocated_slice)
+ slice_used = allocated_slice;
+ }
+
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
+ return slice_used;
+}
+
+static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
+ struct cfq_queue *cfqq)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ unsigned int used_sl;
+
+ used_sl = cfq_cfqq_slice_usage(cfqq);
+
+ /* Can't update vdisktime while group is on service tree */
+ cfq_rb_erase(&cfqg->rb_node, st);
+ cfqg->vdisktime += cfq_scale_slice(used_sl, cfqg);
+ __cfq_group_service_tree_add(st, cfqg);
+
+ /* This group is being expired. Save the context */
+ if (time_after(cfqd->workload_expires, jiffies)) {
+ cfqg->saved_workload_slice = cfqd->workload_expires
+ - jiffies;
+ cfqg->saved_workload = cfqd->serving_type;
+ cfqg->saved_serving_prio = cfqd->serving_prio;
+ } else
+ cfqg->saved_workload_slice = 0;
}

/*
@@ -835,6 +896,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
unsigned long rb_key;
struct cfq_rb_root *service_tree;
int left;
+ int new_cfqq = 1;

service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
cfqq_type(cfqq), cfqd);
@@ -863,6 +925,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
}

if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
+ new_cfqq = 0;
/*
* same position, nothing more to do
*/
@@ -904,6 +967,8 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
rb_link_node(&cfqq->rb_node, parent, p);
rb_insert_color(&cfqq->rb_node, &service_tree->rb);
service_tree->count++;
+ if (add_front || !new_cfqq)
+ return;
cfq_group_service_tree_add(cfqd, cfqq->cfqg);
}

@@ -1220,6 +1285,8 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
{
if (cfqq) {
cfq_log_cfqq(cfqd, cfqq, "set_active");
+ cfqq->slice_start = 0;
+ cfqq->dispatch_start = jiffies;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;

@@ -1257,6 +1324,8 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
}

+ cfq_group_served(cfqd, cfqq->cfqg, cfqq);
+
if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
cfq_del_cfqq_rr(cfqd, cfqq);

@@ -1265,6 +1334,9 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (cfqq == cfqd->active_queue)
cfqd->active_queue = NULL;

+ if (&cfqq->cfqg->rb_node == cfqd->grp_service_tree.active)
+ cfqd->grp_service_tree.active = NULL;
+
if (cfqd->active_cic) {
put_io_context(cfqd->active_cic->ioc);
cfqd->active_cic = NULL;
@@ -1749,6 +1821,13 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);

cfqd->serving_group = cfqg;
+
+ /* Restore the workload type data */
+ if (cfqg->saved_workload_slice) {
+ cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
+ cfqd->serving_type = cfqg->saved_workload;
+ cfqd->serving_prio = cfqg->saved_serving_prio;
+ }
choose_service_tree(cfqd, cfqg);
}

--
1.6.2.5

2009-11-30 03:00:49

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 10/21] blkio: Dynamic cfq group creation based on cgroup tasks belongs to

o Determine the cgroup IO submitting task belongs to and create the cfq
group if it does not exist already.

o Also link cfqq and associated cfq group.

o Currently all async IO is mapped to root group.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 111 ++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 100 insertions(+), 11 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5a75f81..3706f19 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -188,6 +188,10 @@ struct cfq_group {
unsigned long saved_workload_slice;
enum wl_type_t saved_workload;
enum wl_prio_t saved_serving_prio;
+ struct blkio_group blkg;
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ struct hlist_node cfqd_node;
+#endif
};

/*
@@ -273,8 +277,13 @@ struct cfq_data {
struct cfq_queue oom_cfqq;

unsigned long last_end_sync_rq;
+
+ /* List of cfq groups being managed on this device*/
+ struct hlist_head cfqg_list;
};

+static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
+
static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
enum wl_prio_t prio,
enum wl_type_t type,
@@ -883,6 +892,89 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
cfqg->saved_workload_slice = 0;
}

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
+{
+ if (blkg)
+ return container_of(blkg, struct cfq_group, blkg);
+ return NULL;
+}
+
+static struct cfq_group *
+cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+ struct cfq_group *cfqg = NULL;
+ void *key = cfqd;
+ int i, j;
+ struct cfq_rb_root *st;
+
+ /* Do we need to take this reference */
+ if (!css_tryget(&blkcg->css))
+ return NULL;;
+
+ cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+ if (cfqg || !create)
+ goto done;
+
+ cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
+ if (!cfqg)
+ goto done;
+
+ cfqg->weight = blkcg->weight;
+ for_each_cfqg_st(cfqg, i, j, st)
+ *st = CFQ_RB_ROOT;
+ RB_CLEAR_NODE(&cfqg->rb_node);
+
+ /* Add group onto cgroup list */
+ blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd);
+
+ /* Add group on cfqd list */
+ hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+
+done:
+ css_put(&blkcg->css);
+ return cfqg;
+}
+
+/*
+ * Search for the cfq group current task belongs to. If create = 1, then also
+ * create the cfq group if it does not exist. request_queue lock must be held.
+ */
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+{
+ struct cgroup *cgroup;
+ struct cfq_group *cfqg = NULL;
+
+ rcu_read_lock();
+ cgroup = task_cgroup(current, blkio_subsys_id);
+ cfqg = cfq_find_alloc_cfqg(cfqd, cgroup, create);
+ if (!cfqg && create)
+ cfqg = &cfqd->root_group;
+ rcu_read_unlock();
+ return cfqg;
+}
+
+static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
+{
+ /* Currently, all async queues are mapped to root group */
+ if (!cfq_cfqq_sync(cfqq))
+ cfqg = &cfqq->cfqd->root_group;
+
+ cfqq->cfqg = cfqg;
+}
+#else /* GROUP_IOSCHED */
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+{
+ return &cfqd->root_group;
+}
+static inline void
+cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg) {
+ cfqq->cfqg = cfqg;
+}
+
+#endif /* GROUP_IOSCHED */
+
/*
* The cfqd->service_trees holds all pending cfq_queue's that have
* requests waiting to be processed. It is sorted in the order that
@@ -1374,7 +1466,7 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)

static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
{
- struct cfq_group *cfqg = &cfqd->root_group;
+ struct cfq_group *cfqg;
struct cfq_queue *cfqq;
int i, j;
struct cfq_rb_root *st;
@@ -1382,6 +1474,10 @@ static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
if (!cfqd->rq_queued)
return NULL;

+ cfqg = cfq_get_next_cfqg(cfqd);
+ if (!cfqg)
+ return NULL;
+
for_each_cfqg_st(cfqg, i, j, st)
if ((cfqq = cfq_rb_first(st)) != NULL)
return cfqq;
@@ -2392,16 +2488,6 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->pid = pid;
}

-static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
-{
- cfqq->cfqg = cfqg;
-}
-
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
-{
- return &cfqd->root_group;
-}
-
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
struct io_context *ioc, gfp_t gfp_mask)
@@ -3316,6 +3402,9 @@ static void *cfq_init_queue(struct request_queue *q)
/* Give preference to root group over other groups */
cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd);
+#endif
/*
* Not strictly needed (since RB_ROOT just clears the node and we
* zeroed cfqd on alloc), but better be safe in case someone decides
--
1.6.2.5

2009-11-30 03:01:09

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 11/21] blkio: Take care of cgroup deletion and cfq group reference counting

o One can choose to change elevator or delete a cgroup. Implement group
reference counting so that both elevator exit and cgroup deletion can
take place gracefully.

Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Nauman Rafique <[email protected]>
---
block/blk-cgroup.c | 66 ++++++++++++++++++++++++++++++++++-
block/blk-cgroup.h | 1 +
block/cfq-iosched.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 160 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4f6afd7..0426ab6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -13,6 +13,8 @@
#include <linux/ioprio.h>
#include "blk-cgroup.h"

+extern void cfq_unlink_blkio_group(void *, struct blkio_group *);
+
struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };

struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
@@ -28,14 +30,43 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,

spin_lock_irqsave(&blkcg->lock, flags);
rcu_assign_pointer(blkg->key, key);
+ blkg->blkcg_id = css_id(&blkcg->css);
hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
spin_unlock_irqrestore(&blkcg->lock, flags);
}

+static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
+{
+ hlist_del_init_rcu(&blkg->blkcg_node);
+ blkg->blkcg_id = 0;
+}
+
+/*
+ * returns 0 if blkio_group was still on cgroup list. Otherwise returns 1
+ * indicating that blk_group was unhashed by the time we got to it.
+ */
int blkiocg_del_blkio_group(struct blkio_group *blkg)
{
- /* Implemented later */
- return 0;
+ struct blkio_cgroup *blkcg;
+ unsigned long flags;
+ struct cgroup_subsys_state *css;
+ int ret = 1;
+
+ rcu_read_lock();
+ css = css_lookup(&blkio_subsys, blkg->blkcg_id);
+ if (!css)
+ goto out;
+
+ blkcg = container_of(css, struct blkio_cgroup, css);
+ spin_lock_irqsave(&blkcg->lock, flags);
+ if (!hlist_unhashed(&blkg->blkcg_node)) {
+ __blkiocg_del_blkio_group(blkg);
+ ret = 0;
+ }
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+out:
+ rcu_read_unlock();
+ return ret;
}

/* called under rcu_read_lock(). */
@@ -97,8 +128,39 @@ static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+ unsigned long flags;
+ struct blkio_group *blkg;
+ void *key;

+ rcu_read_lock();
+remove_entry:
+ spin_lock_irqsave(&blkcg->lock, flags);
+
+ if (hlist_empty(&blkcg->blkg_list)) {
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+ goto done;
+ }
+
+ blkg = hlist_entry(blkcg->blkg_list.first, struct blkio_group,
+ blkcg_node);
+ key = rcu_dereference(blkg->key);
+ __blkiocg_del_blkio_group(blkg);
+
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+
+ /*
+ * This blkio_group is being unlinked as associated cgroup is going
+ * away. Let all the IO controlling policies know about this event.
+ *
+ * Currently this is static call to one io controlling policy. Once
+ * we have more policies in place, we need some dynamic registration
+ * of callback function.
+ */
+ cfq_unlink_blkio_group(key, blkg);
+ goto remove_entry;
+done:
free_css_id(&blkio_subsys, &blkcg->css);
+ rcu_read_unlock();
kfree(blkcg);
}

diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index ba5703f..cd50a2f 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -26,6 +26,7 @@ struct blkio_group {
/* An rcu protected unique identifier for the group */
void *key;
struct hlist_node blkcg_node;
+ unsigned short blkcg_id;
};

#define BLKIO_WEIGHT_MIN 100
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3706f19..af3d569 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -191,6 +191,7 @@ struct cfq_group {
struct blkio_group blkg;
#ifdef CONFIG_CFQ_GROUP_IOSCHED
struct hlist_node cfqd_node;
+ atomic_t ref;
#endif
};

@@ -926,6 +927,14 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
*st = CFQ_RB_ROOT;
RB_CLEAR_NODE(&cfqg->rb_node);

+ /*
+ * Take the initial reference that will be released on destroy
+ * This can be thought of a joint reference by cgroup and
+ * elevator which will be dropped by either elevator exit
+ * or cgroup deletion path depending on who is exiting first.
+ */
+ atomic_set(&cfqg->ref, 1);
+
/* Add group onto cgroup list */
blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd);

@@ -962,7 +971,77 @@ static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
cfqg = &cfqq->cfqd->root_group;

cfqq->cfqg = cfqg;
+ /* cfqq reference on cfqg */
+ atomic_inc(&cfqq->cfqg->ref);
+}
+
+static void cfq_put_cfqg(struct cfq_group *cfqg)
+{
+ struct cfq_rb_root *st;
+ int i, j;
+
+ BUG_ON(atomic_read(&cfqg->ref) <= 0);
+ if (!atomic_dec_and_test(&cfqg->ref))
+ return;
+ for_each_cfqg_st(cfqg, i, j, st)
+ BUG_ON(!RB_EMPTY_ROOT(&st->rb) || st->active != NULL);
+ kfree(cfqg);
+}
+
+static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ /* Something wrong if we are trying to remove same group twice */
+ BUG_ON(hlist_unhashed(&cfqg->cfqd_node));
+
+ hlist_del_init(&cfqg->cfqd_node);
+
+ /*
+ * Put the reference taken at the time of creation so that when all
+ * queues are gone, group can be destroyed.
+ */
+ cfq_put_cfqg(cfqg);
+}
+
+static void cfq_release_cfq_groups(struct cfq_data *cfqd)
+{
+ struct hlist_node *pos, *n;
+ struct cfq_group *cfqg;
+
+ hlist_for_each_entry_safe(cfqg, pos, n, &cfqd->cfqg_list, cfqd_node) {
+ /*
+ * If cgroup removal path got to blk_group first and removed
+ * it from cgroup list, then it will take care of destroying
+ * cfqg also.
+ */
+ if (!blkiocg_del_blkio_group(&cfqg->blkg))
+ cfq_destroy_cfqg(cfqd, cfqg);
+ }
}
+
+/*
+ * Blk cgroup controller notification saying that blkio_group object is being
+ * delinked as associated cgroup object is going away. That also means that
+ * no new IO will come in this group. So get rid of this group as soon as
+ * any pending IO in the group is finished.
+ *
+ * This function is called under rcu_read_lock(). key is the rcu protected
+ * pointer. That means "key" is a valid cfq_data pointer as long as we are rcu
+ * read lock.
+ *
+ * "key" was fetched from blkio_group under blkio_cgroup->lock. That means
+ * it should not be NULL as even if elevator was exiting, cgroup deltion
+ * path got to it first.
+ */
+void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
+{
+ unsigned long flags;
+ struct cfq_data *cfqd = key;
+
+ spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+ cfq_destroy_cfqg(cfqd, cfqg_of_blkg(blkg));
+ spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+}
+
#else /* GROUP_IOSCHED */
static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
{
@@ -973,6 +1052,9 @@ cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg) {
cfqq->cfqg = cfqg;
}

+static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
+static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
+
#endif /* GROUP_IOSCHED */

/*
@@ -2174,11 +2256,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
* task holds one reference to the queue, dropped when task exits. each rq
* in-flight on this queue also holds a reference, dropped when rq is freed.
*
+ * Each cfq queue took a reference on the parent group. Drop it now.
* queue lock must be held here.
*/
static void cfq_put_queue(struct cfq_queue *cfqq)
{
struct cfq_data *cfqd = cfqq->cfqd;
+ struct cfq_group *cfqg;

BUG_ON(atomic_read(&cfqq->ref) <= 0);

@@ -2188,6 +2272,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
cfq_log_cfqq(cfqd, cfqq, "put_queue");
BUG_ON(rb_first(&cfqq->sort_list));
BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
+ cfqg = cfqq->cfqg;

if (unlikely(cfqd->active_queue == cfqq)) {
__cfq_slice_expired(cfqd, cfqq, 0);
@@ -2196,6 +2281,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)

BUG_ON(cfq_cfqq_on_rr(cfqq));
kmem_cache_free(cfq_pool, cfqq);
+ cfq_put_cfqg(cfqg);
}

/*
@@ -3371,11 +3457,15 @@ static void cfq_exit_queue(struct elevator_queue *e)
}

cfq_put_async_queues(cfqd);
+ cfq_release_cfq_groups(cfqd);
+ blkiocg_del_blkio_group(&cfqd->root_group.blkg);

spin_unlock_irq(q->queue_lock);

cfq_shutdown_timer_wq(cfqd);

+ /* Wait for cfqg->blkg->key accessors to exit their grace periods. */
+ synchronize_rcu();
kfree(cfqd);
}

@@ -3403,6 +3493,11 @@ static void *cfq_init_queue(struct request_queue *q)
cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;

#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ /*
+ * Take a reference to root group which we never drop. This is just
+ * to make sure that cfq_put_cfqg() does not try to kfree root group
+ */
+ atomic_set(&cfqg->ref, 1);
blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd);
#endif
/*
--
1.6.2.5

2009-11-30 03:04:26

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 12/21] blkio: Some debugging aids for CFQ

o Some debugging aids for CFQ.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/Kconfig | 9 +++++++++
block/Kconfig.iosched | 9 +++++++++
block/blk-cgroup.c | 4 ++++
block/blk-cgroup.h | 13 +++++++++++++
block/cfq-iosched.c | 19 ++++++++++++++++++-
5 files changed, 53 insertions(+), 1 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index 6ba1a8e..e20fbde 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -90,6 +90,15 @@ config BLK_CGROUP
control disk bandwidth allocation (proportional time slice allocation)
to such task groups.

+config DEBUG_BLK_CGROUP
+ bool
+ depends on BLK_CGROUP
+ default n
+ ---help---
+ Enable some debugging help. Currently it stores the cgroup path
+ in the blk group which can be used by cfq for tracing various
+ group related activity.
+
endif # BLOCK

config BLOCK_COMPAT
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index fa95fa7..b71abfb 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -40,6 +40,15 @@ config CFQ_GROUP_IOSCHED
---help---
Enable group IO scheduling in CFQ.

+config DEBUG_CFQ_IOSCHED
+ bool "Debug CFQ Scheduling"
+ depends on CFQ_GROUP_IOSCHED
+ select DEBUG_BLK_CGROUP
+ default n
+ ---help---
+ Enable CFQ IO scheduling debugging in CFQ. Currently it makes
+ blktrace output more verbose.
+
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0426ab6..6bc99a3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -33,6 +33,10 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
blkg->blkcg_id = css_id(&blkcg->css);
hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
spin_unlock_irqrestore(&blkcg->lock, flags);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+ /* Need to take css reference ? */
+ cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
+#endif
}

static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index cd50a2f..3573199 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -27,12 +27,25 @@ struct blkio_group {
void *key;
struct hlist_node blkcg_node;
unsigned short blkcg_id;
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+ /* Store cgroup path */
+ char path[128];
+#endif
};

#define BLKIO_WEIGHT_MIN 100
#define BLKIO_WEIGHT_MAX 1000
#define BLKIO_WEIGHT_DEFAULT 500

+#ifdef CONFIG_DEBUG_BLK_CGROUP
+static inline char *blkg_path(struct blkio_group *blkg)
+{
+ return blkg->path;
+}
+#else
+static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
+#endif
+
#ifdef CONFIG_BLK_CGROUP
extern struct blkio_cgroup blkio_root_cgroup;
extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index af3d569..62148ee 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -340,8 +340,21 @@ CFQ_CFQQ_FNS(coop);
CFQ_CFQQ_FNS(deep);
#undef CFQ_CFQQ_FNS

+#ifdef CONFIG_DEBUG_CFQ_IOSCHED
+#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
+ blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, (cfqq)->pid, \
+ cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
+ blkg_path(&(cfqq)->cfqg->blkg), ##args);
+
+#define cfq_log_cfqg(cfqd, cfqg, fmt, args...) \
+ blk_add_trace_msg((cfqd)->queue, "%s " fmt, \
+ blkg_path(&(cfqg)->blkg), ##args); \
+
+#else
#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
+#define cfq_log_cfqg(cfqd, cfqg, fmt, args...) do {} while (0);
+#endif
#define cfq_log(cfqd, fmt, args...) \
blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)

@@ -834,6 +847,7 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
if (cfqg->nr_cfqq)
return;

+ cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
cfqg->on_st = false;
cfqd->nr_groups--;
st->total_weight -= cfqg->weight;
@@ -891,6 +905,9 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
cfqg->saved_serving_prio = cfqd->serving_prio;
} else
cfqg->saved_workload_slice = 0;
+
+ cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
+ st->min_vdisktime);
}

#ifdef CONFIG_CFQ_GROUP_IOSCHED
@@ -3104,7 +3121,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
unsigned long now;

now = jiffies;
- cfq_log_cfqq(cfqd, cfqq, "complete");
+ cfq_log_cfqq(cfqd, cfqq, "complete rqnoidle %d", !!rq_noidle(rq));

cfq_update_hw_tag(cfqd);

--
1.6.2.5

2009-11-30 03:01:06

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 13/21] blkio: Export disk time and sectors used by a group to user space

o Export disk time and sector used by a group to user space through cgroup
interface.

o Also export a "dequeue" interface to cgroup which keeps track of how many
a times a group was deleted from service tree. Helps in debugging.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++-
block/blk-cgroup.h | 22 ++++++++++++++++-
block/cfq-iosched.c | 19 ++++++++++++--
3 files changed, 99 insertions(+), 6 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 6bc99a3..4ef78d3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -11,6 +11,8 @@
* Nauman Rafique <[email protected]>
*/
#include <linux/ioprio.h>
+#include <linux/seq_file.h>
+#include <linux/kdev_t.h>
#include "blk-cgroup.h"

extern void cfq_unlink_blkio_group(void *, struct blkio_group *);
@@ -23,8 +25,15 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
struct blkio_cgroup, css);
}

+void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
+ unsigned long time, unsigned long sectors)
+{
+ blkg->time += time;
+ blkg->sectors += sectors;
+}
+
void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
- struct blkio_group *blkg, void *key)
+ struct blkio_group *blkg, void *key, dev_t dev)
{
unsigned long flags;

@@ -37,6 +46,7 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
/* Need to take css reference ? */
cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
#endif
+ blkg->dev = dev;
}

static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
@@ -115,12 +125,64 @@ blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
return 0;
}

+#define SHOW_FUNCTION_PER_GROUP(__VAR) \
+static int blkiocg_##__VAR##_read(struct cgroup *cgroup, \
+ struct cftype *cftype, struct seq_file *m) \
+{ \
+ struct blkio_cgroup *blkcg; \
+ struct blkio_group *blkg; \
+ struct hlist_node *n; \
+ \
+ if (!cgroup_lock_live_group(cgroup)) \
+ return -ENODEV; \
+ \
+ blkcg = cgroup_to_blkio_cgroup(cgroup); \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {\
+ if (blkg->dev) \
+ seq_printf(m, "%u:%u %lu\n", MAJOR(blkg->dev), \
+ MINOR(blkg->dev), blkg->__VAR); \
+ } \
+ rcu_read_unlock(); \
+ cgroup_unlock(); \
+ return 0; \
+}
+
+SHOW_FUNCTION_PER_GROUP(time);
+SHOW_FUNCTION_PER_GROUP(sectors);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+SHOW_FUNCTION_PER_GROUP(dequeue);
+#endif
+#undef SHOW_FUNCTION_PER_GROUP
+
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
+ unsigned long dequeue)
+{
+ blkg->dequeue += dequeue;
+}
+#endif
+
struct cftype blkio_files[] = {
{
.name = "weight",
.read_u64 = blkiocg_weight_read,
.write_u64 = blkiocg_weight_write,
},
+ {
+ .name = "time",
+ .read_seq_string = blkiocg_time_read,
+ },
+ {
+ .name = "sectors",
+ .read_seq_string = blkiocg_sectors_read,
+ },
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+ {
+ .name = "dequeue",
+ .read_seq_string = blkiocg_dequeue_read,
+ },
+#endif
};

static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 3573199..b24ab71 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -30,7 +30,15 @@ struct blkio_group {
#ifdef CONFIG_DEBUG_BLK_CGROUP
/* Store cgroup path */
char path[128];
+ /* How many times this group has been removed from service tree */
+ unsigned long dequeue;
#endif
+ /* The device MKDEV(major, minor), this group has been created for */
+ dev_t dev;
+
+ /* total disk time and nr sectors dispatched by this group */
+ unsigned long time;
+ unsigned long sectors;
};

#define BLKIO_WEIGHT_MIN 100
@@ -42,24 +50,30 @@ static inline char *blkg_path(struct blkio_group *blkg)
{
return blkg->path;
}
+void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
+ unsigned long dequeue);
#else
static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
+static inline void blkiocg_update_blkio_group_dequeue_stats(
+ struct blkio_group *blkg, unsigned long dequeue) {}
#endif

#ifdef CONFIG_BLK_CGROUP
extern struct blkio_cgroup blkio_root_cgroup;
extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
- struct blkio_group *blkg, void *key);
+ struct blkio_group *blkg, void *key, dev_t dev);
extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
void *key);
+void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
+ unsigned long time, unsigned long sectors);
#else
static inline struct blkio_cgroup *
cgroup_to_blkio_cgroup(struct cgroup *cgroup) { return NULL; }

static inline void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
- struct blkio_group *blkg, void *key)
+ struct blkio_group *blkg, void *key, dev_t dev)
{
}

@@ -68,5 +82,9 @@ blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }

static inline struct blkio_group *
blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
+static inline void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
+ unsigned long time, unsigned long sectors)
+{
+}
#endif
#endif /* _BLK_CGROUP_H */
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 62148ee..8abc25b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -142,6 +142,8 @@ struct cfq_queue {
struct cfq_rb_root *service_tree;
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
+ /* Sectors dispatched in current dispatch round */
+ unsigned long nr_sectors;
};

/*
@@ -854,6 +856,7 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
if (!RB_EMPTY_NODE(&cfqg->rb_node))
cfq_rb_erase(&cfqg->rb_node, st);
cfqg->saved_workload_slice = 0;
+ blkiocg_update_blkio_group_dequeue_stats(&cfqg->blkg, 1);
}

static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
@@ -880,7 +883,8 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
slice_used = allocated_slice;
}

- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u sect=%lu", slice_used,
+ cfqq->nr_sectors);
return slice_used;
}

@@ -908,6 +912,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,

cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
st->min_vdisktime);
+ blkiocg_update_blkio_group_stats(&cfqg->blkg, used_sl,
+ cfqq->nr_sectors);
}

#ifdef CONFIG_CFQ_GROUP_IOSCHED
@@ -926,6 +932,8 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
void *key = cfqd;
int i, j;
struct cfq_rb_root *st;
+ struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+ unsigned int major, minor;

/* Do we need to take this reference */
if (!css_tryget(&blkcg->css))
@@ -953,7 +961,9 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
atomic_set(&cfqg->ref, 1);

/* Add group onto cgroup list */
- blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd);
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
+ MKDEV(major, minor));

/* Add group on cfqd list */
hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
@@ -1480,6 +1490,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
cfqq->dispatch_start = jiffies;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;
+ cfqq->nr_sectors = 0;

cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_must_dispatch(cfqq);
@@ -1803,6 +1814,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)

if (cfq_cfqq_sync(cfqq))
cfqd->sync_flight++;
+ cfqq->nr_sectors += blk_rq_sectors(rq);
}

/*
@@ -3515,7 +3527,8 @@ static void *cfq_init_queue(struct request_queue *q)
* to make sure that cfq_put_cfqg() does not try to kfree root group
*/
atomic_set(&cfqg->ref, 1);
- blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd);
+ blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd,
+ 0);
#endif
/*
* Not strictly needed (since RB_ROOT just clears the node and we
--
1.6.2.5

2009-11-30 03:02:01

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 14/21] blkio: Provide some isolation between groups

o Do not allow following three operations across groups for isolation.
- selection of co-operating queues
- preemtpions across groups
- request merging across groups.

o Async queues are currently global and not per group. Allow preemption of
an async queue if a sync queue in other group gets backlogged.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 30 ++++++++++++++++++++----------
1 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8abc25b..ed18746 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1463,6 +1463,9 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
struct cfq_io_context *cic;
struct cfq_queue *cfqq;

+ /* Deny merge if bio and rq don't belong to same cfq group */
+ if ((RQ_CFQQ(rq))->cfqg != cfq_get_cfqg(cfqd, 0))
+ return false;
/*
* Disallow merge of a sync bio into an async request.
*/
@@ -1700,6 +1703,10 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
if (!cfqq)
return NULL;

+ /* If new queue belongs to different cfq_group, don't choose it */
+ if (cur_cfqq->cfqg != cfqq->cfqg)
+ return NULL;
+
/*
* It only makes sense to merge sync queues.
*/
@@ -2952,22 +2959,12 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (!cfqq)
return false;

- if (cfq_slice_used(cfqq))
- return true;
-
if (cfq_class_idle(new_cfqq))
return false;

if (cfq_class_idle(cfqq))
return true;

- /* Allow preemption only if we are idling on sync-noidle tree */
- if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
- cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
- new_cfqq->service_tree->count == 2 &&
- RB_EMPTY_ROOT(&cfqq->sort_list))
- return true;
-
/*
* if the new request is sync, but the currently running queue is
* not, let the sync request have priority.
@@ -2975,6 +2972,19 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
return true;

+ if (new_cfqq->cfqg != cfqq->cfqg)
+ return false;
+
+ if (cfq_slice_used(cfqq))
+ return true;
+
+ /* Allow preemption only if we are idling on sync-noidle tree */
+ if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
+ cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
+ new_cfqq->service_tree->count == 2 &&
+ RB_EMPTY_ROOT(&cfqq->sort_list))
+ return true;
+
/*
* So both queues are sync. Let the new request get disk time if
* it's a metadata request and the current queue is doing regular IO.
--
1.6.2.5

2009-11-30 03:03:21

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 15/21] blkio: Drop the reference to queue once the task changes cgroup

o If a task changes cgroup, drop reference to the cfqq associated with io
context and set cfqq pointer stored in ioc to NULL so that upon next request
arrival we will allocate a new queue in new group.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 39 +++++++++++++++++++++++++++++++++++++++
1 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ed18746..d076d01 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2610,6 +2610,41 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->pid = pid;
}

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+ struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+ struct cfq_data *cfqd = cic->key;
+ unsigned long flags;
+ struct request_queue *q;
+
+ if (unlikely(!cfqd))
+ return;
+
+ q = cfqd->queue;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ if (sync_cfqq) {
+ /*
+ * Drop reference to sync queue. A new sync queue will be
+ * assigned in new group upon arrival of a fresh request.
+ */
+ cfq_log_cfqq(cfqd, sync_cfqq, "changed cgroup");
+ cic_set_cfqq(cic, NULL, 1);
+ cfq_put_queue(sync_cfqq);
+ }
+
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+ call_for_each_cic(ioc, changed_cgroup);
+ ioc->cgroup_changed = 0;
+}
+#endif /* CONFIG_CFQ_GROUP_IOSCHED */
+
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
struct io_context *ioc, gfp_t gfp_mask)
@@ -2842,6 +2877,10 @@ out:
if (unlikely(ioc->ioprio_changed))
cfq_ioc_set_ioprio(ioc);

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ if (unlikely(ioc->cgroup_changed))
+ cfq_ioc_set_cgroup(ioc);
+#endif
return cic;
err_free:
cfq_cic_free(cic);
--
1.6.2.5

2009-11-30 03:01:20

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 16/21] blkio: Propagate cgroup weight updation to cfq groups

o Propagate blkio cgroup weight updation to associated cfq groups.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 7 +++++++
block/cfq-iosched.c | 6 ++++++
2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4ef78d3..179ddfa 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -16,6 +16,7 @@
#include "blk-cgroup.h"

extern void cfq_unlink_blkio_group(void *, struct blkio_group *);
+extern void cfq_update_blkio_group_weight(struct blkio_group *, unsigned int);

struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };

@@ -116,12 +117,18 @@ static int
blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
{
struct blkio_cgroup *blkcg;
+ struct blkio_group *blkg;
+ struct hlist_node *n;

if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
return -EINVAL;

blkcg = cgroup_to_blkio_cgroup(cgroup);
+ spin_lock_irq(&blkcg->lock);
blkcg->weight = (unsigned int)val;
+ hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
+ cfq_update_blkio_group_weight(blkg, blkcg->weight);
+ spin_unlock_irq(&blkcg->lock);
return 0;
}

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d076d01..a9fea9e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -924,6 +924,12 @@ static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
return NULL;
}

+void
+cfq_update_blkio_group_weight(struct blkio_group *blkg, unsigned int weight)
+{
+ cfqg_of_blkg(blkg)->weight = weight;
+}
+
static struct cfq_group *
cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
{
--
1.6.2.5

2009-11-30 03:02:58

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 17/21] blkio: Wait for cfq queue to get backlogged if group is empty

o If a queue consumes its slice and then gets deleted from service tree, its
associated group will also get deleted from service tree if this was the
only queue in the group. That will make group loose its share.

o For the queues on which we have idling on and if these have used their
slice, wait a bit for these queues to get backlogged again and then
expire these queues so that group does not loose its share.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 34 +++++++++++++++++++++++++++++-----
1 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a9fea9e..52d504b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -116,6 +116,7 @@ struct cfq_queue {

/* time when queue got scheduled in to dispatch first request. */
unsigned long dispatch_start;
+ unsigned int allocated_slice;
/* time when first request from queue completed and slice started. */
unsigned long slice_start;
unsigned long slice_end;
@@ -313,6 +314,8 @@ enum cfqq_state_flags {
CFQ_CFQQ_FLAG_sync, /* synchronous queue */
CFQ_CFQQ_FLAG_coop, /* cfqq is shared */
CFQ_CFQQ_FLAG_deep, /* sync cfqq experienced large depth */
+ CFQ_CFQQ_FLAG_wait_busy, /* Waiting for next request */
+ CFQ_CFQQ_FLAG_wait_busy_done, /* Got new request. Expire the queue */
};

#define CFQ_CFQQ_FNS(name) \
@@ -340,6 +343,8 @@ CFQ_CFQQ_FNS(slice_new);
CFQ_CFQQ_FNS(sync);
CFQ_CFQQ_FNS(coop);
CFQ_CFQQ_FNS(deep);
+CFQ_CFQQ_FNS(wait_busy);
+CFQ_CFQQ_FNS(wait_busy_done);
#undef CFQ_CFQQ_FNS

#ifdef CONFIG_DEBUG_CFQ_IOSCHED
@@ -575,6 +580,7 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
}
cfqq->slice_start = jiffies;
cfqq->slice_end = jiffies + slice;
+ cfqq->allocated_slice = slice;
cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
}

@@ -861,7 +867,7 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)

static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
{
- unsigned int slice_used, allocated_slice;
+ unsigned int slice_used;

/*
* Queue got expired before even a single request completed or
@@ -878,9 +884,8 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
1);
} else {
slice_used = jiffies - cfqq->slice_start;
- allocated_slice = cfqq->slice_end - cfqq->slice_start;
- if (slice_used > allocated_slice)
- slice_used = allocated_slice;
+ if (slice_used > cfqq->allocated_slice)
+ slice_used = cfqq->allocated_slice;
}

cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u sect=%lu", slice_used,
@@ -1497,6 +1502,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
cfq_log_cfqq(cfqd, cfqq, "set_active");
cfqq->slice_start = 0;
cfqq->dispatch_start = jiffies;
+ cfqq->allocated_slice = 0;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;
cfqq->nr_sectors = 0;
@@ -1526,6 +1532,8 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
del_timer(&cfqd->idle_slice_timer);

cfq_clear_cfqq_wait_request(cfqq);
+ cfq_clear_cfqq_wait_busy(cfqq);
+ cfq_clear_cfqq_wait_busy_done(cfqq);

/*
* store what was left of this slice, if the queue idled/timed out
@@ -2068,7 +2076,8 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
/*
* The active queue has run out of time, expire it and select new.
*/
- if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
+ if ((cfq_slice_used(cfqq) || cfq_cfqq_wait_busy_done(cfqq))
+ && !cfq_cfqq_must_dispatch(cfqq))
goto expire;

/*
@@ -3098,6 +3107,10 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);

if (cfqq == cfqd->active_queue) {
+ if (cfq_cfqq_wait_busy(cfqq)) {
+ cfq_clear_cfqq_wait_busy(cfqq);
+ cfq_mark_cfqq_wait_busy_done(cfqq);
+ }
/*
* Remember that we saw a request from this process, but
* don't start queuing just yet. Otherwise we risk seeing lots
@@ -3216,6 +3229,17 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
cfq_set_prio_slice(cfqd, cfqq);
cfq_clear_cfqq_slice_new(cfqq);
}
+
+ /*
+ * If this queue consumed its slice and this is last queue
+ * in the group, wait for next request before we expire
+ * the queue
+ */
+ if (cfq_slice_used(cfqq) && cfqq->cfqg->nr_cfqq == 1) {
+ cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
+ cfq_mark_cfqq_wait_busy(cfqq);
+ }
+
/*
* Idling is not enabled on:
* - expired queues
--
1.6.2.5

2009-11-30 03:01:00

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 18/21] blkio: Determine async workload length based on total number of queues

o Async queues are not per group. Instead these are system wide and maintained
in root group. Hence their workload slice length should be calculated
based on total number of queues in the system and not just queues in the
root group.

o As root group's default weight is 1000, make sure to charge async queue
more in terms of vtime so that it does not get more time on disk because
root group has higher weight.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 36 +++++++++++++++++++++++++++++++-----
1 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 52d504b..ba052cf 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -405,6 +405,13 @@ static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl,
+ cfqg->service_trees[wl][SYNC_WORKLOAD].count;
}

+static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
+ struct cfq_group *cfqg)
+{
+ return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count
+ + cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
+}
+
static void cfq_dispatch_insert(struct request_queue *, struct request *);
static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
struct io_context *, gfp_t);
@@ -897,13 +904,19 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
struct cfq_queue *cfqq)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
- unsigned int used_sl;
+ unsigned int used_sl, charge_sl;
+ int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
+ - cfqg->service_tree_idle.count;
+
+ BUG_ON(nr_sync < 0);
+ used_sl = charge_sl = cfq_cfqq_slice_usage(cfqq);

- used_sl = cfq_cfqq_slice_usage(cfqq);
+ if (!cfq_cfqq_sync(cfqq) && !nr_sync)
+ charge_sl = cfqq->allocated_slice;

/* Can't update vdisktime while group is on service tree */
cfq_rb_erase(&cfqg->rb_node, st);
- cfqg->vdisktime += cfq_scale_slice(used_sl, cfqg);
+ cfqg->vdisktime += cfq_scale_slice(charge_sl, cfqg);
__cfq_group_service_tree_add(st, cfqg);

/* This group is being expired. Save the context */
@@ -2018,11 +2031,24 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));

- if (cfqd->serving_type == ASYNC_WORKLOAD)
+ if (cfqd->serving_type == ASYNC_WORKLOAD) {
+ unsigned int tmp;
+
+ /*
+ * Async queues are currently system wide. Just taking
+ * proportion of queues with-in same group will lead to higher
+ * async ratio system wide as generally root group is going
+ * to have higher weight. A more accurate thing would be to
+ * calculate system wide asnc/sync ratio.
+ */
+ tmp = cfq_target_latency * cfqg_busy_async_queues(cfqd, cfqg);
+ tmp = tmp/cfqd->busy_queues;
+ slice = min_t(unsigned, slice, tmp);
+
/* async workload slice is scaled down according to
* the sync/async slice ratio. */
slice = slice * cfqd->cfq_slice[0] / cfqd->cfq_slice[1];
- else
+ } else
/* sync workload slice is at least 2 * cfq_slice_idle */
slice = max(slice, 2 * cfqd->cfq_slice_idle);

--
1.6.2.5

2009-11-30 03:00:45

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 19/21] blkio: Implement group_isolation tunable

o If a group is running only a random reader, then it will not have enough
traffic to keep disk busy and we will reduce overall throughput. This
should result in better latencies for random reader though. If we don't
idle on random reader service tree, then this random reader will experience
large latencies if there are other groups present in system with sequential
readers running in these.

o One solution suggested by corrado is that by default keep the random readers
or sync-noidle workload in root group so that during one dispatch round
we idle only once on sync-noidle tree. This means that all the sync-idle
workload queues will be in their respective group and we will see service
differentiation in those but not on sync-noidle workload.

o Provide a tunable group_isolation. If set, this will make sure that even
sync-noidle queues go in their respective group and we wait on these. This
provides stronger isolation between groups but at the expense of throughput
if group does not have enough traffic to keep the disk busy.

o By default group_isolation = 0

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 37 ++++++++++++++++++++++++++++++++++++-
1 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ba052cf..e87f0d1 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -143,6 +143,7 @@ struct cfq_queue {
struct cfq_rb_root *service_tree;
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
+ struct cfq_group *orig_cfqg;
/* Sectors dispatched in current dispatch round */
unsigned long nr_sectors;
};
@@ -272,6 +273,7 @@ struct cfq_data {
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
unsigned int cfq_latency;
+ unsigned int cfq_group_isolation;

struct list_head cic_list;

@@ -1122,6 +1124,33 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
struct cfq_rb_root *service_tree;
int left;
int new_cfqq = 1;
+ int group_changed = 0;
+
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ if (!cfqd->cfq_group_isolation
+ && cfqq_type(cfqq) == SYNC_NOIDLE_WORKLOAD
+ && cfqq->cfqg && cfqq->cfqg != &cfqd->root_group) {
+ /* Move this cfq to root group */
+ cfq_log_cfqq(cfqd, cfqq, "moving to root group");
+ if (!RB_EMPTY_NODE(&cfqq->rb_node))
+ cfq_group_service_tree_del(cfqd, cfqq->cfqg);
+ cfqq->orig_cfqg = cfqq->cfqg;
+ cfqq->cfqg = &cfqd->root_group;
+ atomic_inc(&cfqd->root_group.ref);
+ group_changed = 1;
+ } else if (!cfqd->cfq_group_isolation
+ && cfqq_type(cfqq) == SYNC_WORKLOAD && cfqq->orig_cfqg) {
+ /* cfqq is sequential now needs to go to its original group */
+ BUG_ON(cfqq->cfqg != &cfqd->root_group);
+ if (!RB_EMPTY_NODE(&cfqq->rb_node))
+ cfq_group_service_tree_del(cfqd, cfqq->cfqg);
+ cfq_put_cfqg(cfqq->cfqg);
+ cfqq->cfqg = cfqq->orig_cfqg;
+ cfqq->orig_cfqg = NULL;
+ group_changed = 1;
+ cfq_log_cfqq(cfqd, cfqq, "moved to origin group");
+ }
+#endif

service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
cfqq_type(cfqq), cfqd);
@@ -1192,7 +1221,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
rb_link_node(&cfqq->rb_node, parent, p);
rb_insert_color(&cfqq->rb_node, &service_tree->rb);
service_tree->count++;
- if (add_front || !new_cfqq)
+ if ((add_front || !new_cfqq) && !group_changed)
return;
cfq_group_service_tree_add(cfqd, cfqq->cfqg);
}
@@ -2359,6 +2388,8 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
BUG_ON(cfq_cfqq_on_rr(cfqq));
kmem_cache_free(cfq_pool, cfqq);
cfq_put_cfqg(cfqg);
+ if (cfqq->orig_cfqg)
+ cfq_put_cfqg(cfqq->orig_cfqg);
}

/*
@@ -3672,6 +3703,7 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
cfqd->cfq_latency = 1;
+ cfqd->cfq_group_isolation = 0;
cfqd->hw_tag = -1;
cfqd->last_end_sync_rq = jiffies;
return cfqd;
@@ -3742,6 +3774,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
SHOW_FUNCTION(cfq_low_latency_show, cfqd->cfq_latency, 0);
+SHOW_FUNCTION(cfq_group_isolation_show, cfqd->cfq_group_isolation, 0);
#undef SHOW_FUNCTION

#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
@@ -3774,6 +3807,7 @@ STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
+STORE_FUNCTION(cfq_group_isolation_store, &cfqd->cfq_group_isolation, 0, 1, 0);
#undef STORE_FUNCTION

#define CFQ_ATTR(name) \
@@ -3790,6 +3824,7 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
CFQ_ATTR(low_latency),
+ CFQ_ATTR(group_isolation),
__ATTR_NULL
};

--
1.6.2.5

2009-11-30 03:02:47

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 20/21] blkio: Wait on sync-noidle queue even if rq_noidle = 1

o rq_noidle() is supposed to tell cfq that do not expect a request after this
one, hence don't idle. But this does not seem to work very well. For example
for direct random readers, rq_noidle = 1 but there is next request coming
after this. Not idling, leads to a group not getting its share even if
group_isolation=1.

o The right solution for this issue is to scan the higher layers and set
right flag (WRITE_SYNC or WRITE_ODIRECT). For the time being, this single
line fix helps. This should not have any significant impact when we are
not using cgroups. I will later figure out IO paths in higher layer and
fix it.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e87f0d1..8b528de 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3316,7 +3316,8 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
* only if we processed at least one !rq_noidle request
*/
if (cfqd->serving_type == SYNC_WORKLOAD
- || cfqd->noidle_tree_requires_idle)
+ || cfqd->noidle_tree_requires_idle
+ || cfqq->cfqg->nr_cfqq == 1)
cfq_arm_slice_timer(cfqd);
}
}
--
1.6.2.5

2009-11-30 03:01:44

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 21/21] blkio: Documentation

Signed-off-by: Vivek Goyal <[email protected]>
---
Documentation/cgroups/blkio-controller.txt | 135 ++++++++++++++++++++++++++++
1 files changed, 135 insertions(+), 0 deletions(-)
create mode 100644 Documentation/cgroups/blkio-controller.txt

diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
new file mode 100644
index 0000000..630879c
--- /dev/null
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -0,0 +1,135 @@
+ Block IO Controller
+ ===================
+Overview
+========
+cgroup subsys "blkio" implements the block io controller. There seems to be
+a need of various kinds of IO control policies (like proportional BW, max BW)
+both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
+Plan is to use the same cgroup based management interface for blkio controller
+and based on user options switch IO policies in the background.
+
+In the first phase, this patchset implements proportional weight time based
+division of disk policy. It is implemented in CFQ. Hence this policy takes
+effect only on leaf nodes when CFQ is being used.
+
+HOWTO
+=====
+You can do a very simple testing of running two dd threads in two different
+cgroups. Here is what you can do.
+
+- Enable group scheduling in CFQ
+ CONFIG_CFQ_GROUP_IOSCHED=y
+
+- Compile and boot into kernel and mount IO controller (blkio).
+
+ mount -t cgroup -o blkio none /cgroup
+
+- Create two cgroups
+ mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+ echo 1000 > /cgroup/test1/blkio.weight
+ echo 500 > /cgroup/test2/blkio.weight
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+ launch two dd threads in different cgroup to read those files.
+
+ sync
+ echo 3 > /proc/sys/vm/drop_caches
+
+ dd if=/mnt/sdb/zerofile1 of=/dev/null &
+ echo $! > /cgroup/test1/tasks
+ cat /cgroup/test1/tasks
+
+ dd if=/mnt/sdb/zerofile2 of=/dev/null &
+ echo $! > /cgroup/test2/tasks
+ cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+ on looking at (with the help of script), at blkio.disk_time and
+ blkio.disk_sectors files of both test1 and test2 groups. This will tell how
+ much disk time (in milli seconds), each group got and how many secotors each
+ group dispatched to the disk. We provide fairness in terms of disk time, so
+ ideally io.disk_time of cgroups should be in proportion to the weight.
+
+Various user visible config options
+===================================
+CONFIG_CFQ_GROUP_IOSCHED
+ - Enables group scheduling in CFQ. Currently only 1 level of group
+ creation is allowed.
+
+CONFIG_DEBUG_CFQ_IOSCHED
+ - Enables some debugging messages in blktrace. Also creates extra
+ cgroup file blkio.dequeue.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configuration.
+
+CONFIG_BLK_CGROUP
+ - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED.
+
+CONFIG_DEBUG_BLK_CGROUP
+ - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED.
+
+Details of cgroup files
+=======================
+- blkio.weight
+ - Specifies per cgroup weight.
+
+ Currently allowed range of weights is from 100 to 1000.
+
+- blkio.time
+ - disk time allocated to cgroup per device in milliseconds. First
+ two fields specify the major and minor number of the device and
+ third field specifies the disk time allocated to group in
+ milliseconds.
+
+- blkio.sectors
+ - number of sectors transferred to/from disk by the group. First
+ two fields specify the major and minor number of the device and
+ third field specifies the number of sectors transferred by the
+ group to/from the device.
+
+- blkio.dequeue
+ - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
+ gives the statistics about how many a times a group was dequeued
+ from service tree of the device. First two fields specify the major
+ and minor number of the device and third field specifies the number
+ of times a group was dequeued from a particular device.
+
+CFQ sysfs tunable
+=================
+/sys/block/<disk>/queue/iosched/group_isolation
+
+If group_isolation=1, it provides stronger isolation between groups at the
+expense of throughput. By default group_isolation is 0. In general that
+means that if group_isolation=0, expect fairness for sequential workload
+only. Set group_isolation=1 to see fairness for random IO workload also.
+
+Generally CFQ will put random seeky workload in sync-noidle category. CFQ
+will disable idling on these queues and it does a collective idling on group
+of such queues. Generally these are slow moving queues and if there is a
+sync-noidle service tree in each group, that group gets exclusive access to
+disk for certain period. That means it will bring the throughput down if
+group does not have enough IO to drive deeper queue depths and utilize disk
+capacity to the fullest in the slice allocated to it. But the flip side is
+that even a random reader should get better latencies and overall throughput
+if there are lots of sequential readers/sync-idle workload running in the
+system.
+
+If group_isolation=0, then CFQ automatically moves all the random seeky queues
+in the root group. That means there will be no service differentiation for
+that kind of workload. This leads to better throughput as we do collective
+idling on root sync-noidle tree.
+
+By default one should run with group_isolation=0. If that is not sufficient
+and one wants stronger isolation between groups, then set group_isolation=1
+but this will come at cost of reduced throughput.
+
+What works
+==========
+- Currently only sync IO queues are support. All the buffered writes are
+ still system wide and not per group. Hence we will not see service
+ differentiation between buffered writes between groups.
--
1.6.2.5

2009-11-30 15:34:31

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Block IO Controller V4

Hi Vivek,
On Mon, Nov 30, 2009 at 3:59 AM, Vivek Goyal <[email protected]> wrote:
> Hi Jens,
> [snip]
> TODO
> ====
> - Direct random writers seem to be very fickle in terms of workload
>  classification. They seem to be switching between sync-idle and sync-noidle
>  workload type in a little unpredictable manner. Debug and fix it.
>

Are you still experiencing erratic behaviour after my patches were
integrated in for-2.6.33?

> - Support async IO control (buffered writes).
I was thinking about this.
Currently, writeback can either be issued by a kernel daemon (when
actual dirty ratio is > background dirty ratio, but < dirty_ratio) or
from various processes, if the actual dirty ratio is > dirty ratio.
Could the writeback issued in the context of a process be marked as sync?
In this way:
* normal writeback when system is not under pressure will run in the
root group, without interferring with sync workload
* the writeback issued when we have high dirty ratio will have more
priority, so the system will return in a normal condition quicker.
* your code will work out of the box, in fact processes with lower
weight will complete less I/O, therefore they will be slowed down more
than higher weight ones.

>
>  Buffered writes is a beast and requires changes at many a places to solve the
>  problem and patchset becomes huge. Hence first we plan to support only sync
>  IO in control then work on async IO too.
>
>  Some of the work items identified are.
>
>        - Per memory cgroup dirty ratio
>        - Possibly modification of writeback to force writeback from a
>          particular cgroup.
>        - Implement IO tracking support so that a bio can be mapped to a cgroup.
>        - Per group request descriptor infrastructure in block layer.
>        - At CFQ level, implement per cfq_group async queues.
>
>  In this patchset, all the async IO goes in system wide queues and there are
>  no per group async queues. That means we will see service differentiation
>  only for sync IO only. Async IO willl be handled later.
>
> - Support for higher level policies like max BW controller.
> - Support groups of RT class also.

Thanks,
Corrado

2009-11-30 16:02:18

by Vivek Goyal

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Mon, Nov 30, 2009 at 04:34:36PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Mon, Nov 30, 2009 at 3:59 AM, Vivek Goyal <[email protected]> wrote:
> > Hi Jens,
> > [snip]
> > TODO
> > ====
> > - Direct random writers seem to be very fickle in terms of workload
> > ?classification. They seem to be switching between sync-idle and sync-noidle
> > ?workload type in a little unpredictable manner. Debug and fix it.
> >
>
> Are you still experiencing erratic behaviour after my patches were
> integrated in for-2.6.33?

Your patches helped with deep seeky queues. But if I am running a random
writer with default iodepth of 1 (without libaio), I still see that idle
0/1 flipping happens so frequently during 30 seconds duration of
execution.

As per CFQ classification definition, a seeky random writer with shallow
depth should be classified as sync-noidle and stay there until and unless
workload changes its nature. But that does not seem to be happening.

Just try two fio random writers and monitor the blktrace and see how
freqently we enable and disable idle on the queues.

>
> > - Support async IO control (buffered writes).
> I was thinking about this.
> Currently, writeback can either be issued by a kernel daemon (when
> actual dirty ratio is > background dirty ratio, but < dirty_ratio) or
> from various processes, if the actual dirty ratio is > dirty ratio.

- If dirty_ratio > background_dirty_ratio, then a process will be
throttled and it can do one of the following actions.

- Pick one inode and start flushing its dirty pages. Now these
pages could have been dirtied by another process in another
group.

- It might just wait for flusher threads to flush some pages and
sleep for that duration.

> Could the writeback issued in the context of a process be marked as sync?
> In this way:
> * normal writeback when system is not under pressure will run in the
> root group, without interferring with sync workload
> * the writeback issued when we have high dirty ratio will have more
> priority, so the system will return in a normal condition quicker.

Marking async IO submitted in the context of processes and not kernel
threads is interesting. We could try that, but in general the processes
that are being throttled are doing buffered writes and generally these
are not very latency sensitive.

Group stuff apart, I would rather think of providing consistent share to
async workload. So that when there is lots of sync as well async IO is
going on in the system, nobody starves and we provide access to disk in
a deterministic manner.

That's why I do like the idea of fixing a workload share of async
workload so that async workload does not starve in the face of lot of sync
IO going on. Not sure how effectively it is working though.

Thanks
Vivek


> * your code will work out of the box, in fact processes with lower
> weight will complete less I/O, therefore they will be slowed down more
> than higher weight ones.
>
> >
> > ?Buffered writes is a beast and requires changes at many a places to solve the
> > ?problem and patchset becomes huge. Hence first we plan to support only sync
> > ?IO in control then work on async IO too.
> >
> > ?Some of the work items identified are.
> >
> > ? ? ? ?- Per memory cgroup dirty ratio
> > ? ? ? ?- Possibly modification of writeback to force writeback from a
> > ? ? ? ? ?particular cgroup.
> > ? ? ? ?- Implement IO tracking support so that a bio can be mapped to a cgroup.
> > ? ? ? ?- Per group request descriptor infrastructure in block layer.
> > ? ? ? ?- At CFQ level, implement per cfq_group async queues.
> >
> > ?In this patchset, all the async IO goes in system wide queues and there are
> > ?no per group async queues. That means we will see service differentiation
> > ?only for sync IO only. Async IO willl be handled later.
> >
> > - Support for higher level policies like max BW controller.
> > - Support groups of RT class also.
>
> Thanks,
> Corrado

2009-11-30 20:13:38

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 03/21] blkio: Implement macro to traverse each idle tree in group

On Mon, Nov 30, 2009 at 8:29 AM, Vivek Goyal <[email protected]> wrote:
> o Implement a macro to traverse each service tree in the group. This avoids
> ?usage of double for loop and special condition for idle tree 4 times.
>
> o Macro is little twisted because of special handling of idle class service
> ?tree.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> ?block/cfq-iosched.c | ? 35 +++++++++++++++++++++--------------
> ?1 files changed, 21 insertions(+), 14 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 3baa3f4..c73ff44 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -303,6 +303,15 @@ CFQ_CFQQ_FNS(deep);
> ?#define cfq_log(cfqd, fmt, args...) ? ?\
> ? ? ? ?blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
>
> +/* Traverses through cfq group service trees */
> +#define for_each_cfqg_st(cfqg, i, j, st) \
> + ? ? ? for (i = 0; i < 3; i++) \
> + ? ? ? ? ? ? ? for (j = 0, st = i < 2 ? &cfqg->service_trees[i][j] : \
> + ? ? ? ? ? ? ? ? ? ? ? &cfqg->service_tree_idle; \
> + ? ? ? ? ? ? ? ? ? ? ? (i < 2 && j < 3) || (i == 2 && j < 1); \
> + ? ? ? ? ? ? ? ? ? ? ? j++, st = i < 2 ? &cfqg->service_trees[i][j]: NULL) \

Can this be simplified a bit by moving the service tree assignments
out of the for statement? Also is it possible to use macros for the
various service classes instead of 1, 2, 3?

> +
> +
> ?static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
> ?{
> ? ? ? ?if (cfq_class_idle(cfqq))
> @@ -565,6 +574,10 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2,
> ?*/
> ?static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
> ?{
> + ? ? ? /* Service tree is empty */
> + ? ? ? if (!root->count)
> + ? ? ? ? ? ? ? return NULL;
> +
> ? ? ? ?if (!root->left)
> ? ? ? ? ? ? ? ?root->left = rb_first(&root->rb);
>
> @@ -1592,18 +1605,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
> ? ? ? ?int dispatched = 0;
> ? ? ? ?int i, j;
> ? ? ? ?struct cfq_group *cfqg = &cfqd->root_group;
> + ? ? ? struct cfq_rb_root *st;
>
> - ? ? ? for (i = 0; i < 2; ++i)
> - ? ? ? ? ? ? ? for (j = 0; j < 3; ++j)
> - ? ? ? ? ? ? ? ? ? ? ? while ((cfqq = cfq_rb_first(&cfqg->service_trees[i][j]))
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? != NULL)
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? dispatched += __cfq_forced_dispatch_cfqq(cfqq);
> -
> - ? ? ? while ((cfqq = cfq_rb_first(&cfqg->service_tree_idle)) != NULL)
> - ? ? ? ? ? ? ? dispatched += __cfq_forced_dispatch_cfqq(cfqq);
> + ? ? ? for_each_cfqg_st(cfqg, i, j, st) {
> + ? ? ? ? ? ? ? while ((cfqq = cfq_rb_first(st)) != NULL)
> + ? ? ? ? ? ? ? ? ? ? ? dispatched += __cfq_forced_dispatch_cfqq(cfqq);
> + ? ? ? }
>
> ? ? ? ?cfq_slice_expired(cfqd, 0);
> -
> ? ? ? ?BUG_ON(cfqd->busy_queues);
>
> ? ? ? ?cfq_log(cfqd, "forced_dispatch=%d", dispatched);
> @@ -2974,6 +2983,7 @@ static void *cfq_init_queue(struct request_queue *q)
> ? ? ? ?struct cfq_data *cfqd;
> ? ? ? ?int i, j;
> ? ? ? ?struct cfq_group *cfqg;
> + ? ? ? struct cfq_rb_root *st;
>
> ? ? ? ?cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
> ? ? ? ?if (!cfqd)
> @@ -2981,11 +2991,8 @@ static void *cfq_init_queue(struct request_queue *q)
>
> ? ? ? ?/* Init root group */
> ? ? ? ?cfqg = &cfqd->root_group;
> -
> - ? ? ? for (i = 0; i < 2; ++i)
> - ? ? ? ? ? ? ? for (j = 0; j < 3; ++j)
> - ? ? ? ? ? ? ? ? ? ? ? cfqg->service_trees[i][j] = CFQ_RB_ROOT;
> - ? ? ? cfqg->service_tree_idle = CFQ_RB_ROOT;
> + ? ? ? for_each_cfqg_st(cfqg, i, j, st)
> + ? ? ? ? ? ? ? *st = CFQ_RB_ROOT;
>
> ? ? ? ?/*
> ? ? ? ? * Not strictly needed (since RB_ROOT just clears the node and we
> --
> 1.6.2.5
>
>

2009-11-30 21:34:28

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Mon, Nov 30, 2009 at 5:00 PM, Vivek Goyal <[email protected]> wrote:
> On Mon, Nov 30, 2009 at 04:34:36PM +0100, Corrado Zoccolo wrote:
>> Hi Vivek,
>> On Mon, Nov 30, 2009 at 3:59 AM, Vivek Goyal <[email protected]> wrote:
>> > Hi Jens,
>> > [snip]
>> > TODO
>> > ====
>> > - Direct random writers seem to be very fickle in terms of workload
>> >  classification. They seem to be switching between sync-idle and sync-noidle
>> >  workload type in a little unpredictable manner. Debug and fix it.
>> >
>>
>> Are you still experiencing erratic behaviour after my patches were
>> integrated in for-2.6.33?
>
> Your patches helped with deep seeky queues. But if I am running a random
> writer with default iodepth of 1 (without libaio), I still see that idle
> 0/1 flipping happens so frequently during 30 seconds duration of
> execution.
Ok. This is probably because the average seek goes below the threshold.
You can try a larger file, or reducing the threshold.
>
> As per CFQ classification definition, a seeky random writer with shallow
> depth should be classified as sync-noidle and stay there until and unless
> workload changes its nature. But that does not seem to be happening.
>
> Just try two fio random writers and monitor the blktrace and see how
> freqently we enable and disable idle on the queues.
>
>>
>> > - Support async IO control (buffered writes).
>> I was thinking about this.
>> Currently, writeback can either be issued by a kernel daemon (when
>> actual dirty ratio is > background dirty ratio, but < dirty_ratio) or
>> from various processes, if the actual dirty ratio is > dirty ratio.
>
> - If dirty_ratio > background_dirty_ratio, then a process will be
>  throttled and it can do one of the following actions.
>
>        - Pick one inode and start flushing its dirty pages. Now these
>          pages could have been dirtied by another process in another
>          group.
>
>        - It might just wait for flusher threads to flush some pages and
>          sleep for that duration.
>
>> Could the writeback issued in the context of a process be marked as sync?
>> In this way:
>> * normal writeback when system is not under pressure will run in the
>> root group, without interferring with sync workload
>> * the writeback issued when we have high dirty ratio will have more
>> priority, so the system will return in a normal condition quicker.
>
> Marking async IO submitted in the context of processes and not kernel
> threads is interesting. We could try that, but in general the processes
> that are being throttled are doing buffered writes and generally these
> are not very latency sensitive.
If we have too much dirty memory, then allocations could depend on freeing
some pages, so this would become latency sensitive. In fact, it seems that
the 2.6.32 low_latency patch is hurting some workloads in low memory scenarios.
2.6.33 provides improvements for async writes, but if writeback could
become sync
when dirty ratio is too high, we could have a better response to such
extreme scenarios.

>
> Group stuff apart, I would rather think of providing consistent share to
> async workload. So that when there is lots of sync as well async IO is
> going on in the system, nobody starves and we provide access to disk in
> a deterministic manner.
>
> That's why I do like the idea of fixing a workload share of async
> workload so that async workload does not starve in the face of lot of sync
> IO going on. Not sure how effectively it is working though.
I described how the current patch work in an other mail.

Thanks
Corrado

2009-11-30 22:00:14

by Vivek Goyal

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Mon, Nov 30, 2009 at 10:34:32PM +0100, Corrado Zoccolo wrote:
> On Mon, Nov 30, 2009 at 5:00 PM, Vivek Goyal <[email protected]> wrote:
> > On Mon, Nov 30, 2009 at 04:34:36PM +0100, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> On Mon, Nov 30, 2009 at 3:59 AM, Vivek Goyal <[email protected]> wrote:
> >> > Hi Jens,
> >> > [snip]
> >> > TODO
> >> > ====
> >> > - Direct random writers seem to be very fickle in terms of workload
> >> > ?classification. They seem to be switching between sync-idle and sync-noidle
> >> > ?workload type in a little unpredictable manner. Debug and fix it.
> >> >
> >>
> >> Are you still experiencing erratic behaviour after my patches were
> >> integrated in for-2.6.33?
> >
> > Your patches helped with deep seeky queues. But if I am running a random
> > writer with default iodepth of 1 (without libaio), I still see that idle
> > 0/1 flipping happens so frequently during 30 seconds duration of
> > execution.
> Ok. This is probably because the average seek goes below the threshold.
> You can try a larger file, or reducing the threshold.

Yes, sometimes average seek goes below threshold. Default seek mean threshold
is 8K. If I launch a random writer using fio on a 2G file, many a times it
gets classified as sync-idle workload and sometimes as sync-noidle
workload. Looks like write pattern generated is not as random all the time
to cross seek thresold in case of random writes. In case of random reads, I saw
they always got classified as sync-noidle.

During one run when I launched two random writers, one process accessed some
really low address sector and then high address sector. Its seek mean got
boosted so much to begin with that it took long time for seek_mean to come
down and till then process was sync-noidle. On the contrast other process
running had seek_mean less than threshold and was classified as sync-idle
process. As randome writers were coming with rq_noidle=1, we were idling
on one random writer and not other and hence they got different BW. But it
does not happen all the time.

So it seems to be a combination of multiple things.

> >
> > As per CFQ classification definition, a seeky random writer with shallow
> > depth should be classified as sync-noidle and stay there until and unless
> > workload changes its nature. But that does not seem to be happening.
> >
> > Just try two fio random writers and monitor the blktrace and see how
> > freqently we enable and disable idle on the queues.
> >
> >>
> >> > - Support async IO control (buffered writes).
> >> I was thinking about this.
> >> Currently, writeback can either be issued by a kernel daemon (when
> >> actual dirty ratio is > background dirty ratio, but < dirty_ratio) or
> >> from various processes, if the actual dirty ratio is > dirty ratio.
> >
> > - If dirty_ratio > background_dirty_ratio, then a process will be
> > ?throttled and it can do one of the following actions.
> >
> > ? ? ? ?- Pick one inode and start flushing its dirty pages. Now these
> > ? ? ? ? ?pages could have been dirtied by another process in another
> > ? ? ? ? ?group.
> >
> > ? ? ? ?- It might just wait for flusher threads to flush some pages and
> > ? ? ? ? ?sleep for that duration.
> >
> >> Could the writeback issued in the context of a process be marked as sync?
> >> In this way:
> >> * normal writeback when system is not under pressure will run in the
> >> root group, without interferring with sync workload
> >> * the writeback issued when we have high dirty ratio will have more
> >> priority, so the system will return in a normal condition quicker.
> >
> > Marking async IO submitted in the context of processes and not kernel
> > threads is interesting. We could try that, but in general the processes
> > that are being throttled are doing buffered writes and generally these
> > are not very latency sensitive.
> If we have too much dirty memory, then allocations could depend on freeing
> some pages, so this would become latency sensitive. In fact, it seems that
> the 2.6.32 low_latency patch is hurting some workloads in low memory scenarios.
> 2.6.33 provides improvements for async writes, but if writeback could
> become sync
> when dirty ratio is too high, we could have a better response to such
> extreme scenarios.

Ok, that makes sense. So we do have a fixed share of async workoad
(proportionate to number of queues). But that does not seem to be enough
when other sync IO is going on and we are low on memory. In that case,
marking async writes submitted by user space processes as sync, should
help. Try it out. The only thing to watch for is that we don't overdo it
and it does not impact sync workload a lot. Only trial and testing will
help. :-)

>
> >
> > Group stuff apart, I would rather think of providing consistent share to
> > async workload. So that when there is lots of sync as well async IO is
> > going on in the system, nobody starves and we provide access to disk in
> > a deterministic manner.
> >
> > That's why I do like the idea of fixing a workload share of async
> > workload so that async workload does not starve in the face of lot of sync
> > IO going on. Not sure how effectively it is working though.
> I described how the current patch work in an other mail.

Yep, that point is clear now.

Thanks
Vivek

2009-11-30 22:00:38

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: Block IO Controller V4

FYI: Results today from my test suite - haven't had time to parse them
in any depth...

---- ---- - --------- --------- ---------
Mode RdWr N base i1,s8 i1,s0
---- ---- - --------- --------- ---------
rnd rd 2 43.3 50.6 43.3
rnd rd 4 40.9 55.8 41.1
rnd rd 8 36.7 61.6 36.9

rnd wr 2 69.2 68.1 69.4
rnd wr 4 66.0 62.7 66.0
rnd wr 8 60.5 47.8 61.3

rnd rdwr 2 54.3 49.1 54.3
rnd rdwr 4 50.3 41.7 50.4
rnd rdwr 8 45.9 30.4 46.2

seq rd 2 613.7 606.0 602.8
seq rd 4 617.3 606.7 606.1
seq rd 8 618.3 602.9 605.0

seq wr 2 670.3 725.9 703.9
seq wr 4 680.0 722.0 627.0
seq wr 8 685.3 710.4 631.3

seq rdwr 2 703.4 665.3 680.2
seq rdwr 4 677.5 656.8 639.9
seq rdwr 8 683.3 646.4 633.7

===============================================================

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
base rnd rd 2 21.7 21.5
base rnd rd 4 11.3 11.4 9.4 8.8
base rnd rd 8 2.7 2.9 7.0 7.2 4.2 4.3 4.6 3.8

base rnd wr 2 34.2 34.9
base rnd wr 4 18.2 18.3 15.3 14.2
base rnd wr 8 3.9 3.8 16.8 17.3 4.7 4.6 5.1 4.3

base rnd rdwr 2 27.1 27.2
base rnd rdwr 4 13.8 13.3 11.8 11.4
base rnd rdwr 8 2.9 2.8 9.9 9.6 4.9 5.4 5.7 4.6


base seq rd 2 306.9 306.8
base seq rd 4 160.6 161.0 147.5 148.1
base seq rd 8 78.3 78.9 76.7 77.6 76.1 75.8 77.8 77.1

base seq wr 2 335.2 335.1
base seq wr 4 170.7 171.5 168.7 169.0
base seq wr 8 87.7 88.3 85.4 85.0 81.9 84.2 85.6 87.2

base seq rdwr 2 350.6 352.8
base seq rdwr 4 180.3 181.4 157.7 158.2
base seq rdwr 8 85.8 86.2 87.2 86.8 82.6 81.5 85.3 88.0

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
i1,s8 rnd rd 2 20.6 30.0
i1,s8 rnd rd 4 2.0 4.8 26.1 22.8
i1,s8 rnd rd 8 0.7 1.3 3.5 4.6 15.2 16.1 10.0 10.2

i1,s8 rnd wr 2 18.5 49.6
i1,s8 rnd wr 4 1.0 2.1 19.7 40.0
i1,s8 rnd wr 8 0.5 0.7 0.9 1.2 1.6 3.2 15.1 24.5

i1,s8 rnd rdwr 2 16.4 32.7
i1,s8 rnd rdwr 4 1.2 3.5 16.2 20.8
i1,s8 rnd rdwr 8 0.6 0.8 1.1 1.6 2.1 3.6 9.3 11.3


i1,s8 seq rd 2 202.8 403.2
i1,s8 seq rd 4 91.9 115.3 181.9 217.7
i1,s8 seq rd 8 39.1 76.1 73.7 74.6 74.9 75.6 84.6 104.3

i1,s8 seq wr 2 246.8 479.1
i1,s8 seq wr 4 108.1 157.4 201.9 254.6
i1,s8 seq wr 8 52.2 81.0 80.8 83.0 90.9 95.6 108.6 118.3

i1,s8 seq rdwr 2 226.9 438.4
i1,s8 seq rdwr 4 103.4 139.4 186.4 227.7
i1,s8 seq rdwr 8 53.4 77.4 77.4 77.9 79.7 82.1 93.5 105.1

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
i1,s0 rnd rd 2 21.7 21.6
i1,s0 rnd rd 4 12.4 12.0 9.7 7.0
i1,s0 rnd rd 8 2.7 2.8 7.4 7.6 4.4 4.1 4.4 3.5

i1,s0 rnd wr 2 35.4 34.0
i1,s0 rnd wr 4 19.9 19.9 13.7 12.4
i1,s0 rnd wr 8 4.0 3.8 17.5 19.8 4.4 3.9 4.5 3.5

i1,s0 rnd rdwr 2 27.4 26.9
i1,s0 rnd rdwr 4 14.1 14.8 10.6 10.9
i1,s0 rnd rdwr 8 2.7 3.1 10.3 10.5 5.6 4.7 5.1 4.1


i1,s0 seq rd 2 301.4 301.3
i1,s0 seq rd 4 157.8 156.9 145.1 146.2
i1,s0 seq rd 8 76.4 76.4 75.2 74.9 76.7 75.4 74.3 75.7

i1,s0 seq wr 2 351.5 352.4
i1,s0 seq wr 4 156.5 156.4 156.1 158.1
i1,s0 seq wr 8 80.3 79.7 81.3 80.8 75.8 76.2 77.7 79.4

i1,s0 seq rdwr 2 340.6 339.6
i1,s0 seq rdwr 4 162.5 161.7 157.9 157.8
i1,s0 seq rdwr 8 77.2 77.1 80.1 80.4 78.6 79.1 80.8 80.3

2009-11-30 22:26:36

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 03/21] blkio: Implement macro to traverse each idle tree in group

On Tue, Dec 01, 2009 at 01:43:16AM +0530, Divyesh Shah wrote:
> On Mon, Nov 30, 2009 at 8:29 AM, Vivek Goyal <[email protected]> wrote:
> > o Implement a macro to traverse each service tree in the group. This avoids
> > ?usage of double for loop and special condition for idle tree 4 times.
> >
> > o Macro is little twisted because of special handling of idle class service
> > ?tree.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > ?block/cfq-iosched.c | ? 35 +++++++++++++++++++++--------------
> > ?1 files changed, 21 insertions(+), 14 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index 3baa3f4..c73ff44 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -303,6 +303,15 @@ CFQ_CFQQ_FNS(deep);
> > ?#define cfq_log(cfqd, fmt, args...) ? ?\
> > ? ? ? ?blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
> >
> > +/* Traverses through cfq group service trees */
> > +#define for_each_cfqg_st(cfqg, i, j, st) \
> > + ? ? ? for (i = 0; i < 3; i++) \
> > + ? ? ? ? ? ? ? for (j = 0, st = i < 2 ? &cfqg->service_trees[i][j] : \
> > + ? ? ? ? ? ? ? ? ? ? ? &cfqg->service_tree_idle; \
> > + ? ? ? ? ? ? ? ? ? ? ? (i < 2 && j < 3) || (i == 2 && j < 1); \
> > + ? ? ? ? ? ? ? ? ? ? ? j++, st = i < 2 ? &cfqg->service_trees[i][j]: NULL) \
>
> Can this be simplified a bit by moving the service tree assignments
> out of the for statement?

Not sure how that can be done. One way is that st assignment happens in
the body of the for loop. That means I shall have to open braces for for
loop in the macro and user of the macro will close the braces. That will
look really ugly.

for_each_cfqg_st()
}

Do you have other ideas.

> Also is it possible to use macros for the various service classes instead of
> 1, 2, 3?
>

Please find attached the new version of patch. I have used macro names for
various classses instead of numbers. Hopefully this one is little better
to read.


o Implement a macro to traverse each service tree in the group. This avoids
usage of double for loop and special condition for idle tree 4 times.

o Macro is little twisted because of special handling of idle class service
tree.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 41 +++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)

Index: linux10/block/cfq-iosched.c
===================================================================
--- linux10.orig/block/cfq-iosched.c 2009-11-30 17:20:42.000000000 -0500
+++ linux10/block/cfq-iosched.c 2009-11-30 17:22:45.000000000 -0500
@@ -140,9 +140,9 @@ struct cfq_queue {
* IDLE is handled separately, so it has negative index
*/
enum wl_prio_t {
- IDLE_WORKLOAD = -1,
BE_WORKLOAD = 0,
- RT_WORKLOAD = 1
+ RT_WORKLOAD = 1,
+ IDLE_WORKLOAD = 2,
};

/*
@@ -303,6 +303,17 @@ CFQ_CFQQ_FNS(deep);
#define cfq_log(cfqd, fmt, args...) \
blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)

+/* Traverses through cfq group service trees */
+#define for_each_cfqg_st(cfqg, i, j, st) \
+ for (i = 0; i <= IDLE_WORKLOAD; i++) \
+ for (j = 0, st = i < IDLE_WORKLOAD ? &cfqg->service_trees[i][j]\
+ : &cfqg->service_tree_idle; \
+ (i < IDLE_WORKLOAD && j <= SYNC_WORKLOAD) || \
+ (i == IDLE_WORKLOAD && j == 0); \
+ j++, st = i < IDLE_WORKLOAD ? \
+ &cfqg->service_trees[i][j]: NULL) \
+
+
static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
{
if (cfq_class_idle(cfqq))
@@ -565,6 +576,10 @@ cfq_choose_req(struct cfq_data *cfqd, st
*/
static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
{
+ /* Service tree is empty */
+ if (!root->count)
+ return NULL;
+
if (!root->left)
root->left = rb_first(&root->rb);

@@ -1592,18 +1607,14 @@ static int cfq_forced_dispatch(struct cf
int dispatched = 0;
int i, j;
struct cfq_group *cfqg = &cfqd->root_group;
+ struct cfq_rb_root *st;

- for (i = 0; i < 2; ++i)
- for (j = 0; j < 3; ++j)
- while ((cfqq = cfq_rb_first(&cfqg->service_trees[i][j]))
- != NULL)
- dispatched += __cfq_forced_dispatch_cfqq(cfqq);
-
- while ((cfqq = cfq_rb_first(&cfqg->service_tree_idle)) != NULL)
- dispatched += __cfq_forced_dispatch_cfqq(cfqq);
+ for_each_cfqg_st(cfqg, i, j, st) {
+ while ((cfqq = cfq_rb_first(st)) != NULL)
+ dispatched += __cfq_forced_dispatch_cfqq(cfqq);
+ }

cfq_slice_expired(cfqd, 0);
-
BUG_ON(cfqd->busy_queues);

cfq_log(cfqd, "forced_dispatch=%d", dispatched);
@@ -2974,6 +2985,7 @@ static void *cfq_init_queue(struct reque
struct cfq_data *cfqd;
int i, j;
struct cfq_group *cfqg;
+ struct cfq_rb_root *st;

cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
if (!cfqd)
@@ -2981,11 +2993,8 @@ static void *cfq_init_queue(struct reque

/* Init root group */
cfqg = &cfqd->root_group;
-
- for (i = 0; i < 2; ++i)
- for (j = 0; j < 3; ++j)
- cfqg->service_trees[i][j] = CFQ_RB_ROOT;
- cfqg->service_tree_idle = CFQ_RB_ROOT;
+ for_each_cfqg_st(cfqg, i, j, st)
+ *st = CFQ_RB_ROOT;

/*
* Not strictly needed (since RB_ROOT just clears the node and we

2009-11-30 22:58:36

by Vivek Goyal

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Mon, Nov 30, 2009 at 05:00:33PM -0500, Alan D. Brunelle wrote:
> FYI: Results today from my test suite - haven't had time to parse them
> in any depth...

Thanks Alan. I am trying to parse the results below. s0 and s8 still mean
slice idle enabled disabled? Instead of that we can try group_isolation
enabled or disabled for all the tests.

>
> ---- ---- - --------- --------- ---------
> Mode RdWr N base i1,s8 i1,s0
> ---- ---- - --------- --------- ---------
> rnd rd 2 43.3 50.6 43.3
> rnd rd 4 40.9 55.8 41.1
> rnd rd 8 36.7 61.6 36.9

I am assuming that base still means no io controller patches applied. Also
assuming that base must have run with slice_idle=8.

If yes, above is surprising. After applying the patches, performance
of random reads have become much better with slice_idle=8. May be you
ran the base with slice_idle=0 that's why results more or less match
with ioc patches applied with slice_idle=0.

>
> rnd wr 2 69.2 68.1 69.4
> rnd wr 4 66.0 62.7 66.0
> rnd wr 8 60.5 47.8 61.3

If you ran base with slice_idle=0, then first and third columns match.
Can't conclude much about i1,s8 case. I am curious though why performance
dropped when number of writers reached 8.

>
> rnd rdwr 2 54.3 49.1 54.3
> rnd rdwr 4 50.3 41.7 50.4
> rnd rdwr 8 45.9 30.4 46.2

Same as random write.

>
> seq rd 2 613.7 606.0 602.8
> seq rd 4 617.3 606.7 606.1
> seq rd 8 618.3 602.9 605.0
>

This is surprising again. if s0 means slice_idle=0, then in that case
performance should have sucked with N=8 as we should have been seeking
all over the place.

> seq wr 2 670.3 725.9 703.9
> seq wr 4 680.0 722.0 627.0
> seq wr 8 685.3 710.4 631.3
>
> seq rdwr 2 703.4 665.3 680.2
> seq rdwr 4 677.5 656.8 639.9
> seq rdwr 8 683.3 646.4 633.7
>
> ===============================================================
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> base rnd rd 2 21.7 21.5
> base rnd rd 4 11.3 11.4 9.4 8.8
> base rnd rd 8 2.7 2.9 7.0 7.2 4.2 4.3 4.6 3.8
>
> base rnd wr 2 34.2 34.9
> base rnd wr 4 18.2 18.3 15.3 14.2
> base rnd wr 8 3.9 3.8 16.8 17.3 4.7 4.6 5.1 4.3
>
> base rnd rdwr 2 27.1 27.2
> base rnd rdwr 4 13.8 13.3 11.8 11.4
> base rnd rdwr 8 2.9 2.8 9.9 9.6 4.9 5.4 5.7 4.6
>
>
> base seq rd 2 306.9 306.8
> base seq rd 4 160.6 161.0 147.5 148.1
> base seq rd 8 78.3 78.9 76.7 77.6 76.1 75.8 77.8 77.1
>
> base seq wr 2 335.2 335.1
> base seq wr 4 170.7 171.5 168.7 169.0
> base seq wr 8 87.7 88.3 85.4 85.0 81.9 84.2 85.6 87.2
>
> base seq rdwr 2 350.6 352.8
> base seq rdwr 4 180.3 181.4 157.7 158.2
> base seq rdwr 8 85.8 86.2 87.2 86.8 82.6 81.5 85.3 88.0
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> i1,s8 rnd rd 2 20.6 30.0
> i1,s8 rnd rd 4 2.0 4.8 26.1 22.8
> i1,s8 rnd rd 8 0.7 1.3 3.5 4.6 15.2 16.1 10.0 10.2
>

These are rates MB/s? I think we need to look at the disk time also because
we try to provide fairness in terms of disk time. In many a cases it
very closely maps to rates also but not always.

Is group_isolation enabled for these test cases. If not, these results are
surprising as I would expect all the random readers to be in root group
and then almost match base results.

But these seem to be very different from base results. So may be
group_isolation=1. If that's the case, then we do see service
differentiation but this does not seem too proportionate in terms of
weight.

Looking at disk.time and disk.dequeue file will help here.

Stopping parsing till I you get a chance to let me know some of the
parameters.

Thanks
Vivek


> i1,s8 rnd wr 2 18.5 49.6
> i1,s8 rnd wr 4 1.0 2.1 19.7 40.0
> i1,s8 rnd wr 8 0.5 0.7 0.9 1.2 1.6 3.2 15.1 24.5
>
> i1,s8 rnd rdwr 2 16.4 32.7
> i1,s8 rnd rdwr 4 1.2 3.5 16.2 20.8
> i1,s8 rnd rdwr 8 0.6 0.8 1.1 1.6 2.1 3.6 9.3 11.3
>
>
> i1,s8 seq rd 2 202.8 403.2
> i1,s8 seq rd 4 91.9 115.3 181.9 217.7
> i1,s8 seq rd 8 39.1 76.1 73.7 74.6 74.9 75.6 84.6 104.3
>
> i1,s8 seq wr 2 246.8 479.1
> i1,s8 seq wr 4 108.1 157.4 201.9 254.6
> i1,s8 seq wr 8 52.2 81.0 80.8 83.0 90.9 95.6 108.6 118.3
>
> i1,s8 seq rdwr 2 226.9 438.4
> i1,s8 seq rdwr 4 103.4 139.4 186.4 227.7
> i1,s8 seq rdwr 8 53.4 77.4 77.4 77.9 79.7 82.1 93.5 105.1
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> i1,s0 rnd rd 2 21.7 21.6
> i1,s0 rnd rd 4 12.4 12.0 9.7 7.0
> i1,s0 rnd rd 8 2.7 2.8 7.4 7.6 4.4 4.1 4.4 3.5
>
> i1,s0 rnd wr 2 35.4 34.0
> i1,s0 rnd wr 4 19.9 19.9 13.7 12.4
> i1,s0 rnd wr 8 4.0 3.8 17.5 19.8 4.4 3.9 4.5 3.5
>
> i1,s0 rnd rdwr 2 27.4 26.9
> i1,s0 rnd rdwr 4 14.1 14.8 10.6 10.9
> i1,s0 rnd rdwr 8 2.7 3.1 10.3 10.5 5.6 4.7 5.1 4.1
>
>
> i1,s0 seq rd 2 301.4 301.3
> i1,s0 seq rd 4 157.8 156.9 145.1 146.2
> i1,s0 seq rd 8 76.4 76.4 75.2 74.9 76.7 75.4 74.3 75.7
>
> i1,s0 seq wr 2 351.5 352.4
> i1,s0 seq wr 4 156.5 156.4 156.1 158.1
> i1,s0 seq wr 8 80.3 79.7 81.3 80.8 75.8 76.2 77.7 79.4
>
> i1,s0 seq rdwr 2 340.6 339.6
> i1,s0 seq rdwr 4 162.5 161.7 157.9 157.8
> i1,s0 seq rdwr 8 77.2 77.1 80.1 80.4 78.6 79.1 80.8 80.3
>

2009-11-30 23:50:27

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Mon, 2009-11-30 at 17:56 -0500, Vivek Goyal wrote:
> On Mon, Nov 30, 2009 at 05:00:33PM -0500, Alan D. Brunelle wrote:
> > FYI: Results today from my test suite - haven't had time to parse them
> > in any depth...
>
> Thanks Alan. I am trying to parse the results below. s0 and s8 still mean
> slice idle enabled disabled? Instead of that we can try group_isolation
> enabled or disabled for all the tests.

I'm concerned as well - I think I did the base run /after/ the i1,s0
run. I'll need to check that out tomorrow...

Alan

2009-11-30 23:55:58

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 05/21] blkio: Introduce the root service tree for cfq groups

On Mon, Nov 30, 2009 at 8:29 AM, Vivek Goyal <[email protected]> wrote:
> o So far we just had one cfq_group in cfq_data. To create space for more than
> ?one cfq_group, we need to have a service tree of groups where all the groups
> ?can be queued if they have active cfq queues backlogged in these.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> ?block/cfq-iosched.c | ?136 +++++++++++++++++++++++++++++++++++++++++++++++++-
> ?1 files changed, 133 insertions(+), 3 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index a0d0a83..0a284be 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -77,8 +77,9 @@ struct cfq_rb_root {
> ? ? ? ?struct rb_root rb;
> ? ? ? ?struct rb_node *left;
> ? ? ? ?unsigned count;
> + ? ? ? u64 min_vdisktime;
> ?};
> -#define CFQ_RB_ROOT ? ?(struct cfq_rb_root) { RB_ROOT, NULL, 0, }
> +#define CFQ_RB_ROOT ? ?(struct cfq_rb_root) { RB_ROOT, NULL, 0, 0, }
>
> ?/*
> ?* Per process-grouping structure
> @@ -156,6 +157,16 @@ enum wl_type_t {
>
> ?/* This is per cgroup per device grouping structure */
> ?struct cfq_group {
> + ? ? ? /* group service_tree member */
> + ? ? ? struct rb_node rb_node;
> +
> + ? ? ? /* group service_tree key */
> + ? ? ? u64 vdisktime;
> + ? ? ? bool on_st;
> +
> + ? ? ? /* number of cfqq currently on this group */
> + ? ? ? int nr_cfqq;
> +
> ? ? ? ?/*
> ? ? ? ? * rr lists of queues with requests, onle rr for each priority class.
> ? ? ? ? * Counts are embedded in the cfq_rb_root
> @@ -169,6 +180,8 @@ struct cfq_group {
> ?*/
> ?struct cfq_data {
> ? ? ? ?struct request_queue *queue;
> + ? ? ? /* Root service tree for cfq_groups */
> + ? ? ? struct cfq_rb_root grp_service_tree;
> ? ? ? ?struct cfq_group root_group;
>
> ? ? ? ?/*
> @@ -251,6 +264,9 @@ static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum wl_type_t type,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct cfq_data *cfqd)
> ?{
> + ? ? ? if (!cfqg)
> + ? ? ? ? ? ? ? return NULL;
> +
> ? ? ? ?if (prio == IDLE_WORKLOAD)
> ? ? ? ? ? ? ? ?return &cfqg->service_tree_idle;
>
> @@ -587,6 +603,17 @@ static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
> ? ? ? ?return NULL;
> ?}
>
> +static struct cfq_group *cfq_rb_first_group(struct cfq_rb_root *root)
> +{
> + ? ? ? if (!root->left)
> + ? ? ? ? ? ? ? root->left = rb_first(&root->rb);
> +
> + ? ? ? if (root->left)
> + ? ? ? ? ? ? ? return rb_entry(root->left, struct cfq_group, rb_node);

Can you please define a cfqg_entry macro and reuse that at different
places in the code?
#define cfqg_entry(ptr) rb_entry(ptr, struct cfq_group, rb_node)

> +
> + ? ? ? return NULL;
> +}
> +
> ?static void rb_erase_init(struct rb_node *n, struct rb_root *root)
> ?{
> ? ? ? ?rb_erase(n, root);
> @@ -643,6 +670,83 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
> ? ? ? ? ? ? ? ? ? cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
> ?}
>
> +static inline s64
> +cfqg_key(struct cfq_rb_root *st, struct cfq_group *cfqg)
> +{
> + ? ? ? return cfqg->vdisktime - st->min_vdisktime;
> +}
> +
> +static void
> +__cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
> +{
> + ? ? ? struct rb_node **node = &st->rb.rb_node;
> + ? ? ? struct rb_node *parent = NULL;
> + ? ? ? struct cfq_group *__cfqg;
> + ? ? ? s64 key = cfqg_key(st, cfqg);
> + ? ? ? int left = 1;
> +
> + ? ? ? while (*node != NULL) {
> + ? ? ? ? ? ? ? parent = *node;
> + ? ? ? ? ? ? ? __cfqg = rb_entry(parent, struct cfq_group, rb_node);
> +
> + ? ? ? ? ? ? ? if (key < cfqg_key(st, __cfqg))
> + ? ? ? ? ? ? ? ? ? ? ? node = &parent->rb_left;
> + ? ? ? ? ? ? ? else {
> + ? ? ? ? ? ? ? ? ? ? ? node = &parent->rb_right;
> + ? ? ? ? ? ? ? ? ? ? ? left = 0;
> + ? ? ? ? ? ? ? }
> + ? ? ? }
> +
> + ? ? ? if (left)
> + ? ? ? ? ? ? ? st->left = &cfqg->rb_node;
> +
> + ? ? ? rb_link_node(&cfqg->rb_node, parent, node);
> + ? ? ? rb_insert_color(&cfqg->rb_node, &st->rb);
> +}
> +
> +static void
> +cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
> +{
> + ? ? ? struct cfq_rb_root *st = &cfqd->grp_service_tree;
> + ? ? ? struct cfq_group *__cfqg;
> + ? ? ? struct rb_node *n;
> +
> + ? ? ? cfqg->nr_cfqq++;
> + ? ? ? if (cfqg->on_st)
> + ? ? ? ? ? ? ? return;
> +
> + ? ? ? /*
> + ? ? ? ?* Currently put the group at the end. Later implement something
> + ? ? ? ?* so that groups get lesser vtime based on their weights, so that
> + ? ? ? ?* if group does not loose all if it was not continously backlogged.
> + ? ? ? ?*/
> + ? ? ? n = rb_last(&st->rb);
> + ? ? ? if (n) {
> + ? ? ? ? ? ? ? __cfqg = rb_entry(n, struct cfq_group, rb_node);
> + ? ? ? ? ? ? ? cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
> + ? ? ? } else
> + ? ? ? ? ? ? ? cfqg->vdisktime = st->min_vdisktime;
> +
> + ? ? ? __cfq_group_service_tree_add(st, cfqg);
> + ? ? ? cfqg->on_st = true;
> +}
> +
> +static void
> +cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
> +{
> + ? ? ? struct cfq_rb_root *st = &cfqd->grp_service_tree;
> +
> + ? ? ? BUG_ON(cfqg->nr_cfqq < 1);
> + ? ? ? cfqg->nr_cfqq--;
> + ? ? ? /* If there are other cfq queues under this group, don't delete it */
> + ? ? ? if (cfqg->nr_cfqq)
> + ? ? ? ? ? ? ? return;
> +
> + ? ? ? cfqg->on_st = false;
> + ? ? ? if (!RB_EMPTY_NODE(&cfqg->rb_node))
> + ? ? ? ? ? ? ? cfq_rb_erase(&cfqg->rb_node, st);
> +}
> +
> ?/*
> ?* The cfqd->service_trees holds all pending cfq_queue's that have
> ?* requests waiting to be processed. It is sorted in the order that
> @@ -725,6 +829,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> ? ? ? ?rb_link_node(&cfqq->rb_node, parent, p);
> ? ? ? ?rb_insert_color(&cfqq->rb_node, &service_tree->rb);
> ? ? ? ?service_tree->count++;
> + ? ? ? cfq_group_service_tree_add(cfqd, cfqq->cfqg);
> ?}
>
> ?static struct cfq_queue *
> @@ -835,6 +940,7 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> ? ? ? ? ? ? ? ?cfqq->p_root = NULL;
> ? ? ? ?}
>
> + ? ? ? cfq_group_service_tree_del(cfqd, cfqq->cfqg);
> ? ? ? ?BUG_ON(!cfqd->busy_queues);
> ? ? ? ?cfqd->busy_queues--;
> ?}
> @@ -1111,6 +1217,9 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
> ? ? ? ?if (!cfqd->rq_queued)
> ? ? ? ? ? ? ? ?return NULL;
>
> + ? ? ? /* There is nothing to dispatch */
> + ? ? ? if (!service_tree)
> + ? ? ? ? ? ? ? return NULL;
> ? ? ? ?if (RB_EMPTY_ROOT(&service_tree->rb))
> ? ? ? ? ? ? ? ?return NULL;
> ? ? ? ?return cfq_rb_first(service_tree);
> @@ -1480,6 +1589,12 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> ? ? ? ?unsigned count;
> ? ? ? ?struct cfq_rb_root *st;
>
> + ? ? ? if (!cfqg) {
> + ? ? ? ? ? ? ? cfqd->serving_prio = IDLE_WORKLOAD;
> + ? ? ? ? ? ? ? cfqd->workload_expires = jiffies + 1;
> + ? ? ? ? ? ? ? return;
> + ? ? ? }
> +
> ? ? ? ?/* Choose next priority. RT > BE > IDLE */
> ? ? ? ?if (cfq_busy_queues_wl(RT_WORKLOAD, cfqd))
> ? ? ? ? ? ? ? ?cfqd->serving_prio = RT_WORKLOAD;
> @@ -1538,10 +1653,21 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> ? ? ? ?cfqd->noidle_tree_requires_idle = false;
> ?}
>
> +static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
> +{
> + ? ? ? struct cfq_rb_root *st = &cfqd->grp_service_tree;
> +
> + ? ? ? if (RB_EMPTY_ROOT(&st->rb))
> + ? ? ? ? ? ? ? return NULL;
> + ? ? ? return cfq_rb_first_group(st);
> +}
> +
> ?static void cfq_choose_cfqg(struct cfq_data *cfqd)
> ?{
> - ? ? ? cfqd->serving_group = &cfqd->root_group;
> - ? ? ? choose_service_tree(cfqd, &cfqd->root_group);
> + ? ? ? struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
> +
> + ? ? ? cfqd->serving_group = cfqg;
> + ? ? ? choose_service_tree(cfqd, cfqg);
> ?}
>
> ?/*
> @@ -3017,10 +3143,14 @@ static void *cfq_init_queue(struct request_queue *q)
> ? ? ? ?if (!cfqd)
> ? ? ? ? ? ? ? ?return NULL;
>
> + ? ? ? /* Init root service tree */
> + ? ? ? cfqd->grp_service_tree = CFQ_RB_ROOT;
> +
> ? ? ? ?/* Init root group */
> ? ? ? ?cfqg = &cfqd->root_group;
> ? ? ? ?for_each_cfqg_st(cfqg, i, j, st)
> ? ? ? ? ? ? ? ?*st = CFQ_RB_ROOT;
> + ? ? ? RB_CLEAR_NODE(&cfqg->rb_node);
>
> ? ? ? ?/*
> ? ? ? ? * Not strictly needed (since RB_ROOT just clears the node and we
> --
> 1.6.2.5
>
>

2009-12-01 00:05:06

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 06/21] blkio: Introduce blkio controller cgroup interface

On Mon, Nov 30, 2009 at 8:29 AM, Vivek Goyal <[email protected]> wrote:
> o This is basic implementation of blkio controller cgroup interface. This is
> ?the common interface visible to user space and should be used by different
> ?IO control policies as we implement those.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> ?block/Kconfig ? ? ? ? ? ? ? ? | ? 13 +++
> ?block/Kconfig.iosched ? ? ? ? | ? ?1 +
> ?block/Makefile ? ? ? ? ? ? ? ?| ? ?1 +
> ?block/blk-cgroup.c ? ? ? ? ? ?| ?177 +++++++++++++++++++++++++++++++++++++++++
> ?block/blk-cgroup.h ? ? ? ? ? ?| ? 58 +++++++++++++
> ?include/linux/cgroup_subsys.h | ? ?6 ++
> ?include/linux/iocontext.h ? ? | ? ?4 +
> ?7 files changed, 260 insertions(+), 0 deletions(-)
> ?create mode 100644 block/blk-cgroup.c
> ?create mode 100644 block/blk-cgroup.h
>
> diff --git a/block/Kconfig b/block/Kconfig
> index 9be0b56..6ba1a8e 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -77,6 +77,19 @@ config BLK_DEV_INTEGRITY
> ? ? ? ?T10/SCSI Data Integrity Field or the T13/ATA External Path
> ? ? ? ?Protection. ?If in doubt, say N.
>
> +config BLK_CGROUP
> + ? ? ? bool
> + ? ? ? depends on CGROUPS
> + ? ? ? default n
> + ? ? ? ---help---
> + ? ? ? Generic block IO controller cgroup interface. This is the common
> + ? ? ? cgroup interface which should be used by various IO controlling
> + ? ? ? policies.
> +
> + ? ? ? Currently, CFQ IO scheduler uses it to recognize task groups and
> + ? ? ? control disk bandwidth allocation (proportional time slice allocation)
> + ? ? ? to such task groups.
> +
> ?endif # BLOCK
>
> ?config BLOCK_COMPAT
> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> index 8bd1051..be0280d 100644
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -23,6 +23,7 @@ config IOSCHED_DEADLINE
>
> ?config IOSCHED_CFQ
> ? ? ? ?tristate "CFQ I/O scheduler"
> + ? ? ? select BLK_CGROUP
> ? ? ? ?default y
> ? ? ? ?---help---
> ? ? ? ? ?The CFQ I/O scheduler tries to distribute bandwidth equally
> diff --git a/block/Makefile b/block/Makefile
> index 7914108..cb2d515 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -8,6 +8,7 @@ obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
> ? ? ? ? ? ? ? ? ? ? ? ?blk-iopoll.o ioctl.o genhd.o scsi_ioctl.o
>
> ?obj-$(CONFIG_BLK_DEV_BSG) ? ? ?+= bsg.o
> +obj-$(CONFIG_BLK_CGROUP) ? ? ? += blk-cgroup.o
> ?obj-$(CONFIG_IOSCHED_NOOP) ? ? += noop-iosched.o
> ?obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
> ?obj-$(CONFIG_IOSCHED_CFQ) ? ? ?+= cfq-iosched.o
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> new file mode 100644
> index 0000000..4f6afd7
> --- /dev/null
> +++ b/block/blk-cgroup.c
> @@ -0,0 +1,177 @@
> +/*
> + * Common Block IO controller cgroup interface
> + *
> + * Based on ideas and code from CFQ, CFS and BFQ:
> + * Copyright (C) 2003 Jens Axboe <[email protected]>
> + *
> + * Copyright (C) 2008 Fabio Checconi <[email protected]>
> + * ? ? ? ? ? ? ? ? ? Paolo Valente <[email protected]>
> + *
> + * Copyright (C) 2009 Vivek Goyal <[email protected]>
> + * ? ? ? ? ? ? ? ? ? Nauman Rafique <[email protected]>
> + */
> +#include <linux/ioprio.h>
> +#include "blk-cgroup.h"
> +
> +struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };

This should use BLKIO_WEIGHT_MAX as 2*BLKIO_WEIGHT_DEFAULT is same as
BLKIO_WEIGHT_MAX unless there is a reason why you would want the value
to remain as a multiple of default_weight instead of max in case the
constants change later.

> +
> +struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
> +{
> + ? ? ? return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? struct blkio_cgroup, css);
> +}
> +
> +void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct blkio_group *blkg, void *key)
> +{
> + ? ? ? unsigned long flags;
> +
> + ? ? ? spin_lock_irqsave(&blkcg->lock, flags);
> + ? ? ? rcu_assign_pointer(blkg->key, key);
> + ? ? ? hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
> + ? ? ? spin_unlock_irqrestore(&blkcg->lock, flags);
> +}
> +
> +int blkiocg_del_blkio_group(struct blkio_group *blkg)
> +{
> + ? ? ? /* Implemented later */
> + ? ? ? return 0;
> +}
> +
> +/* called under rcu_read_lock(). */
> +struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key)
> +{
> + ? ? ? struct blkio_group *blkg;
> + ? ? ? struct hlist_node *n;
> + ? ? ? void *__key;
> +
> + ? ? ? hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
> + ? ? ? ? ? ? ? __key = blkg->key;
> + ? ? ? ? ? ? ? if (__key == key)
> + ? ? ? ? ? ? ? ? ? ? ? return blkg;
> + ? ? ? }
> +
> + ? ? ? return NULL;
> +}
> +
> +#define SHOW_FUNCTION(__VAR) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> +static u64 blkiocg_##__VAR##_read(struct cgroup *cgroup, ? ? ? ? ? ? ? \
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct cftype *cftype) ? ? ? ? ? \
> +{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> + ? ? ? struct blkio_cgroup *blkcg; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> + ? ? ? blkcg = cgroup_to_blkio_cgroup(cgroup); ? ? ? ? ? ? ? ? ? ? ? ? \
> + ? ? ? return (u64)blkcg->__VAR; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> +}
> +
> +SHOW_FUNCTION(weight);
> +#undef SHOW_FUNCTION
> +
> +static int
> +blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
> +{
> + ? ? ? struct blkio_cgroup *blkcg;
> +
> + ? ? ? if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
> + ? ? ? ? ? ? ? return -EINVAL;
> +
> + ? ? ? blkcg = cgroup_to_blkio_cgroup(cgroup);
> + ? ? ? blkcg->weight = (unsigned int)val;
> + ? ? ? return 0;
> +}
> +
> +struct cftype blkio_files[] = {
> + ? ? ? {
> + ? ? ? ? ? ? ? .name = "weight",
> + ? ? ? ? ? ? ? .read_u64 = blkiocg_weight_read,
> + ? ? ? ? ? ? ? .write_u64 = blkiocg_weight_write,
> + ? ? ? },
> +};
> +
> +static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> +{
> + ? ? ? return cgroup_add_files(cgroup, subsys, blkio_files,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ARRAY_SIZE(blkio_files));
> +}
> +
> +static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> +{
> + ? ? ? struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
> +
> + ? ? ? free_css_id(&blkio_subsys, &blkcg->css);
> + ? ? ? kfree(blkcg);
> +}
> +
> +static struct cgroup_subsys_state *
> +blkiocg_create(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> +{
> + ? ? ? struct blkio_cgroup *blkcg, *parent_blkcg;
> +
> + ? ? ? if (!cgroup->parent) {
> + ? ? ? ? ? ? ? blkcg = &blkio_root_cgroup;
> + ? ? ? ? ? ? ? goto done;
> + ? ? ? }
> +
> + ? ? ? /* Currently we do not support hierarchy deeper than two level (0,1) */
> + ? ? ? parent_blkcg = cgroup_to_blkio_cgroup(cgroup->parent);
> + ? ? ? if (css_depth(&parent_blkcg->css) > 0)
> + ? ? ? ? ? ? ? return ERR_PTR(-EINVAL);
> +
> + ? ? ? blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
> + ? ? ? if (!blkcg)
> + ? ? ? ? ? ? ? return ERR_PTR(-ENOMEM);
> +
> + ? ? ? blkcg->weight = BLKIO_WEIGHT_DEFAULT;
> +done:
> + ? ? ? spin_lock_init(&blkcg->lock);
> + ? ? ? INIT_HLIST_HEAD(&blkcg->blkg_list);
> +
> + ? ? ? return &blkcg->css;
> +}
> +
> +/*
> + * We cannot support shared io contexts, as we have no mean to support
> + * two tasks with the same ioc in two different groups without major rework
> + * of the main cic data structures. ?For now we allow a task to change
> + * its cgroup only if it's the only owner of its ioc.
> + */
> +static int blkiocg_can_attach(struct cgroup_subsys *subsys,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct cgroup *cgroup, struct task_struct *tsk,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool threadgroup)
> +{
> + ? ? ? struct io_context *ioc;
> + ? ? ? int ret = 0;
> +
> + ? ? ? /* task_lock() is needed to avoid races with exit_io_context() */
> + ? ? ? task_lock(tsk);
> + ? ? ? ioc = tsk->io_context;
> + ? ? ? if (ioc && atomic_read(&ioc->nr_tasks) > 1)
> + ? ? ? ? ? ? ? ret = -EINVAL;
> + ? ? ? task_unlock(tsk);
> +
> + ? ? ? return ret;
> +}
> +
> +static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct cgroup *prev, struct task_struct *tsk,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool threadgroup)
> +{
> + ? ? ? struct io_context *ioc;
> +
> + ? ? ? task_lock(tsk);
> + ? ? ? ioc = tsk->io_context;
> + ? ? ? if (ioc)
> + ? ? ? ? ? ? ? ioc->cgroup_changed = 1;
> + ? ? ? task_unlock(tsk);
> +}
> +
> +struct cgroup_subsys blkio_subsys = {
> + ? ? ? .name = "blkio",
> + ? ? ? .create = blkiocg_create,
> + ? ? ? .can_attach = blkiocg_can_attach,
> + ? ? ? .attach = blkiocg_attach,
> + ? ? ? .destroy = blkiocg_destroy,
> + ? ? ? .populate = blkiocg_populate,
> + ? ? ? .subsys_id = blkio_subsys_id,
> + ? ? ? .use_id = 1,
> +};
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> new file mode 100644
> index 0000000..ba5703f
> --- /dev/null
> +++ b/block/blk-cgroup.h
> @@ -0,0 +1,58 @@
> +#ifndef _BLK_CGROUP_H
> +#define _BLK_CGROUP_H
> +/*
> + * Common Block IO controller cgroup interface
> + *
> + * Based on ideas and code from CFQ, CFS and BFQ:
> + * Copyright (C) 2003 Jens Axboe <[email protected]>
> + *
> + * Copyright (C) 2008 Fabio Checconi <[email protected]>
> + * ? ? ? ? ? ? ? ? ? Paolo Valente <[email protected]>
> + *
> + * Copyright (C) 2009 Vivek Goyal <[email protected]>
> + * ? ? ? ? ? ? ? ? ? Nauman Rafique <[email protected]>
> + */
> +
> +#include <linux/cgroup.h>
> +
> +struct blkio_cgroup {
> + ? ? ? struct cgroup_subsys_state css;
> + ? ? ? unsigned int weight;
> + ? ? ? spinlock_t lock;
> + ? ? ? struct hlist_head blkg_list;
> +};
> +
> +struct blkio_group {
> + ? ? ? /* An rcu protected unique identifier for the group */
> + ? ? ? void *key;
> + ? ? ? struct hlist_node blkcg_node;
> +};
> +
> +#define BLKIO_WEIGHT_MIN ? ? ? 100
> +#define BLKIO_WEIGHT_MAX ? ? ? 1000
> +#define BLKIO_WEIGHT_DEFAULT ? 500
> +
> +#ifdef CONFIG_BLK_CGROUP
> +extern struct blkio_cgroup blkio_root_cgroup;
> +extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
> +extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct blkio_group *blkg, void *key);
> +extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
> +extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? void *key);
> +#else
> +static inline struct blkio_cgroup *
> +cgroup_to_blkio_cgroup(struct cgroup *cgroup) { return NULL; }
> +
> +static inline void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> + ? ? ? ? ? ? ? ? ? ? ? struct blkio_group *blkg, void *key)
> +{
> +}
> +
> +static inline int
> +blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
> +
> +static inline struct blkio_group *
> +blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
> +#endif
> +#endif /* _BLK_CGROUP_H */
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 9c8d31b..ccefff0 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -60,3 +60,9 @@ SUBSYS(net_cls)
> ?#endif
>
> ?/* */
> +
> +#ifdef CONFIG_BLK_CGROUP
> +SUBSYS(blkio)
> +#endif
> +
> +/* */
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index eb73632..d61b0b8 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -68,6 +68,10 @@ struct io_context {
> ? ? ? ?unsigned short ioprio;
> ? ? ? ?unsigned short ioprio_changed;
>
> +#ifdef CONFIG_BLK_CGROUP
> + ? ? ? unsigned short cgroup_changed;
> +#endif
> +
> ? ? ? ?/*
> ? ? ? ? * For request batching
> ? ? ? ? */
> --
> 1.6.2.5
>
>

2009-12-01 22:29:35

by Vivek Goyal

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Sun, Nov 29, 2009 at 09:59:07PM -0500, Vivek Goyal wrote:
> Hi Jens,
>
> This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
> of block tree.
>
> A consolidated patch can be found here:
>
> http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch
>

Hi All,

Here are some test results with V4 of the patches. Alan, I have tried to
create tables like you to get some idea what is happening.

I used one entry level enterprise class storage array. It has got few
rotational disks (5-6).

I have tried to run sequential readers, random readers, sequential writers
and random writers in 8 cgroups with weights 100,200,300,400,500,600,700,
and 800 respectively and see how BW and disk time has been distributed.
Cgroup are named test1, test2, test3.....test8. All the IO is _direct_ IO
and no buffered IO for testing purposes.

I have also run same test with everything being in root cgroup. So
workload remains the same and that is 8 instances of either seq reader,
random reader or seq writer or random writer but everything runs in root
cgroup instead of test cgroups.

Some abbreviation details.

rcg--> All 8 fio jobs are running in root cgroup.
ioc--> Each fio job is running in respective cgroup.
gi0/1--> /sys/block/<disk>/sdc/queue/iosched/group_isolation tunable is 0/1
Tms--> Time in ms, consumed by this group on the disk. This is obtained
with the help of cgroup file blkio.time
S---> Number of sectors transferred by this group
BW--> Aggregate BW achieved by the fio process running either in root
group or associated test group.

Summary
======
- To me results look pretty good. We provide fairness in terms of disk
time and these numbers are pretty close. There are some glitches but
these can be fixed by diving deeper. Nothing major.

Test Mode OT test1 test2 test3 test4 test5 test6 test7 test8
==========================================================================
rcg,gi0 seq,rd BW 1,357K 958K 1,890K 1,824K 1,898K 1,841K 1,912K 1,883K

ioc,gi0 seq,rd BW 321K 384K 1,182K 1,669K 2,181K 2,596K 2,977K 3,386K
ioc,gi0 seq,rd Tms 848 1665 2317 3234 4107 4901 5691 6611
ioc,gi0 seq,rd S 18K 23K 68K 100K 131K 156K 177K 203K

ioc,gi1 seq,rd BW 314K 307K 1,209K 1,603K 2,124K 2,562K 2,912K 3,336K
ioc,gi1 seq,rd Tms 833 1649 2476 3269 4101 4951 5743 6566
ioc,gi1 seq,rd S 18K 18K 72K 96K 127K 153K 174K 200K

----------------
rcg,gi0 rnd,rd BW 229K 225K 226K 228K 232K 224K 228K 216K

ioc,gi0 rnd,rd BW 234K 217K 221K 223K 235K 217K 214K 217K
ioc,gi0 rnd,rd Tms 20 21 50 85 41 52 51 92
ioc,gi0 rnd,rd S 0K 0K 0K 0K 0K 0K 0K 0K

ioc,gi1 rnd,rd BW 11K 22K 30K 39K 49K 55K 69K 80K
ioc,gi1 rnd,rd Tms 666 1301 1956 2617 3281 3901 4588 5215
ioc,gi1 rnd,rd S 1K 2K 3K 3K 4K 5K 5K 6K

Note:
- With group_isolation=0, all the random readers move to root cgroup
automatically. Hence we don't see disk time consumed or number of
sectors transferred. Everything is in root cgroup. There is no service
differentiation in this case.

- With group_isolation=1, we see service differentiation but we also see
tremendous overall throughput drop. This happens because now every group
gets exclusive access to disk and group does not have enough traffic to
keep disk busy. So group_isolation=1 provides stronger isolation but
also brings throughput down if groups don't have enough IO to do.

----------------
rcg,gi0 seq,wr BW 1,748K 1,042K 2,131K 1,211K 1,170K 1,189K 1,262K 1,050K

ioc,gi0 seq,wr BW 294K 550K 1,048K 1,091K 1,666K 1,651K 2,137K 2,642K
ioc,gi0 seq,wr Tms 826 1484 2793 2943 4431 4459 5595 6989
ioc,gi0 seq,wr S 17K 31K 62K 65K 100K 99K 125K 158K

ioc,gi1 seq,wr BW 319K 603K 988K 1,174K 1,510K 1,871K 2,179K 2,567K
ioc,gi1 seq,wr Tms 891 1620 2592 3117 3969 4901 5722 6690
ioc,gi1 seq,wr S 19K 36K 59K 70K 90K 112K 130K 154K

Note:
- In case of sequential write, files have been preallocated so that
interference from kjournald is minimum and we see service differentiation.

----------------
rcg,gi0 rnd,wr BW 1,349K 1,417K 1,034K 1,018K 910K 1,301K 1,443K 1,387K

ioc,gi0 rnd,wr BW 319K 542K 837K 1,086K 1,389K 1,673K 1,932K 2,215K
ioc,gi0 rnd,wr Tms 926 1547 2353 3058 3843 4511 5228 6030
ioc,gi0 rnd,wr S 19K 32K 50K 65K 83K 98K 112K 130K

ioc,gi1 rnd,wr BW 299K 603K 843K 1,156K 1,467K 1,717K 2,002K 2,327K
ioc,gi1 rnd,wr Tms 845 1641 2286 3114 3922 4629 5364 6289
ioc,gi1 rnd,wr S 18K 36K 50K 69K 88K 103K 120K 139K

Thanks
Vivek

2009-12-02 01:55:41

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: Block IO Controller V4

Vivek Goyal wrote:
> Hi Jens,
>
> This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
> of block tree.
>
> A consolidated patch can be found here:
>
> http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch
>

Hi Vivek,

It seems this version doesn't work very well for "direct(O_DIRECT) sequence read" mode.
For example, you can create group A and group B, then assign weight 100 to group A and
weight 400 to group B, and you run "direct sequence read" workload in group A and B
simultaneously. Ideally, we should see 1:4 disk time differentiation for group A and B.
But actually, I see almost 1:2 disk time differentiation for group A and B. I'm looking
into this issue.
BTW, V3 works well for this case.

Thanks,
Gui

2009-12-02 14:08:37

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 01/21] blkio: Set must_dispatch only if we decided to not dispatch the request

Vivek Goyal <[email protected]> writes:

> o must_dispatch flag should be set only if we decided not to run the quue
> dispatch the request.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> block/cfq-iosched.c | 6 +++---
> 1 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index a5de31f..9adfa48 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -2494,9 +2494,9 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
> cfqd->busy_queues > 1) {
> del_timer(&cfqd->idle_slice_timer);
> - __blk_run_queue(cfqd->queue);
> - }
> - cfq_mark_cfqq_must_dispatch(cfqq);
> + __blk_run_queue(cfqd->queue);
> + } else
> + cfq_mark_cfqq_must_dispatch(cfqq);
> }
> } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
> /*

This makes sense, but could be submitted as a bugfix on its own.

Reviewed-by: Jeff Moyer <[email protected]>

2009-12-02 14:27:07

by Vivek Goyal

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Wed, Dec 02, 2009 at 09:51:36AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi Jens,
> >
> > This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
> > of block tree.
> >
> > A consolidated patch can be found here:
> >
> > http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch
> >
>
> Hi Vivek,
>
> It seems this version doesn't work very well for "direct(O_DIRECT) sequence read" mode.
> For example, you can create group A and group B, then assign weight 100 to group A and
> weight 400 to group B, and you run "direct sequence read" workload in group A and B
> simultaneously. Ideally, we should see 1:4 disk time differentiation for group A and B.
> But actually, I see almost 1:2 disk time differentiation for group A and B. I'm looking
> into this issue.
> BTW, V3 works well for this case.

Hi Gui,

In my testing of 8 fio jobs in 8 cgroups, direct sequential reads seems to
be working fine.

http://lkml.org/lkml/2009/12/1/367

I suspect that in some case we choose not to idle on the group and it gets
deleted from service tree hence we loose share. Can you have a look at
blkio.dequeue files. If there are excessive deletions, that will signify
that we are loosing share because we chose not to idle.

If yes, please also run blktrace to see in what cases we chose not to
idle.

In V3, I had a stronger check to idle on the group if it is empty using
wait_busy() function. In V4 I have removed that and trying to wait busy
on a queue by extending its slice if it has consumed its allocated slice.

Thanks
Vivek

2009-12-02 15:29:35

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 06/21] blkio: Introduce blkio controller cgroup interface

On Tue, Dec 01, 2009 at 05:34:37AM +0530, Divyesh Shah wrote:
> On Mon, Nov 30, 2009 at 8:29 AM, Vivek Goyal <[email protected]> wrote:
> > o This is basic implementation of blkio controller cgroup interface. This is
> > ?the common interface visible to user space and should be used by different
> > ?IO control policies as we implement those.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > ?block/Kconfig ? ? ? ? ? ? ? ? | ? 13 +++
> > ?block/Kconfig.iosched ? ? ? ? | ? ?1 +
> > ?block/Makefile ? ? ? ? ? ? ? ?| ? ?1 +
> > ?block/blk-cgroup.c ? ? ? ? ? ?| ?177 +++++++++++++++++++++++++++++++++++++++++
> > ?block/blk-cgroup.h ? ? ? ? ? ?| ? 58 +++++++++++++
> > ?include/linux/cgroup_subsys.h | ? ?6 ++
> > ?include/linux/iocontext.h ? ? | ? ?4 +
> > ?7 files changed, 260 insertions(+), 0 deletions(-)
> > ?create mode 100644 block/blk-cgroup.c
> > ?create mode 100644 block/blk-cgroup.h
> >
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 9be0b56..6ba1a8e 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ -77,6 +77,19 @@ config BLK_DEV_INTEGRITY
> > ? ? ? ?T10/SCSI Data Integrity Field or the T13/ATA External Path
> > ? ? ? ?Protection. ?If in doubt, say N.
> >
> > +config BLK_CGROUP
> > + ? ? ? bool
> > + ? ? ? depends on CGROUPS
> > + ? ? ? default n
> > + ? ? ? ---help---
> > + ? ? ? Generic block IO controller cgroup interface. This is the common
> > + ? ? ? cgroup interface which should be used by various IO controlling
> > + ? ? ? policies.
> > +
> > + ? ? ? Currently, CFQ IO scheduler uses it to recognize task groups and
> > + ? ? ? control disk bandwidth allocation (proportional time slice allocation)
> > + ? ? ? to such task groups.
> > +
> > ?endif # BLOCK
> >
> > ?config BLOCK_COMPAT
> > diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> > index 8bd1051..be0280d 100644
> > --- a/block/Kconfig.iosched
> > +++ b/block/Kconfig.iosched
> > @@ -23,6 +23,7 @@ config IOSCHED_DEADLINE
> >
> > ?config IOSCHED_CFQ
> > ? ? ? ?tristate "CFQ I/O scheduler"
> > + ? ? ? select BLK_CGROUP
> > ? ? ? ?default y
> > ? ? ? ?---help---
> > ? ? ? ? ?The CFQ I/O scheduler tries to distribute bandwidth equally
> > diff --git a/block/Makefile b/block/Makefile
> > index 7914108..cb2d515 100644
> > --- a/block/Makefile
> > +++ b/block/Makefile
> > @@ -8,6 +8,7 @@ obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
> > ? ? ? ? ? ? ? ? ? ? ? ?blk-iopoll.o ioctl.o genhd.o scsi_ioctl.o
> >
> > ?obj-$(CONFIG_BLK_DEV_BSG) ? ? ?+= bsg.o
> > +obj-$(CONFIG_BLK_CGROUP) ? ? ? += blk-cgroup.o
> > ?obj-$(CONFIG_IOSCHED_NOOP) ? ? += noop-iosched.o
> > ?obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
> > ?obj-$(CONFIG_IOSCHED_CFQ) ? ? ?+= cfq-iosched.o
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > new file mode 100644
> > index 0000000..4f6afd7
> > --- /dev/null
> > +++ b/block/blk-cgroup.c
> > @@ -0,0 +1,177 @@
> > +/*
> > + * Common Block IO controller cgroup interface
> > + *
> > + * Based on ideas and code from CFQ, CFS and BFQ:
> > + * Copyright (C) 2003 Jens Axboe <[email protected]>
> > + *
> > + * Copyright (C) 2008 Fabio Checconi <[email protected]>
> > + * ? ? ? ? ? ? ? ? ? Paolo Valente <[email protected]>
> > + *
> > + * Copyright (C) 2009 Vivek Goyal <[email protected]>
> > + * ? ? ? ? ? ? ? ? ? Nauman Rafique <[email protected]>
> > + */
> > +#include <linux/ioprio.h>
> > +#include "blk-cgroup.h"
> > +
> > +struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
>
> This should use BLKIO_WEIGHT_MAX as 2*BLKIO_WEIGHT_DEFAULT is same as
> BLKIO_WEIGHT_MAX unless there is a reason why you would want the value
> to remain as a multiple of default_weight instead of max in case the
> constants change later.
>

Hi Divyesh,

Every new group gets the weight as BLKIO_WEIGHT_DEFAULT. For root group by
default I wanted to give it double the weight to begin with as it will
have all the async queues, will also have all the sync-noidle queues
(in case of group_isolation=0) and will be doing all the system IO.

That's a different thing that 2*BLKIO_WEIGHT_DEFAULT happens to be same
as BLKIO_WEIGHT_MAX.

So functionality wise it does not change anything. It is just a matter of
what is more intutive as a root group default.

To me 2*BLKIO_WEIGHT_DEFAULT looks just fine. So I will keep it as it is,
unless you really think that it is not intutive and BLKIO_WEIGHT_MAX is
more intutive.

Thanks
Vivek

2009-12-02 15:44:36

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 05/21] blkio: Introduce the root service tree for cfq groups

On Tue, Dec 01, 2009 at 05:25:36AM +0530, Divyesh Shah wrote:
> On Mon, Nov 30, 2009 at 8:29 AM, Vivek Goyal <[email protected]> wrote:
> > o So far we just had one cfq_group in cfq_data. To create space for more than
> > ?one cfq_group, we need to have a service tree of groups where all the groups
> > ?can be queued if they have active cfq queues backlogged in these.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > ?block/cfq-iosched.c | ?136 +++++++++++++++++++++++++++++++++++++++++++++++++-
> > ?1 files changed, 133 insertions(+), 3 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index a0d0a83..0a284be 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -77,8 +77,9 @@ struct cfq_rb_root {
> > ? ? ? ?struct rb_root rb;
> > ? ? ? ?struct rb_node *left;
> > ? ? ? ?unsigned count;
> > + ? ? ? u64 min_vdisktime;
> > ?};
> > -#define CFQ_RB_ROOT ? ?(struct cfq_rb_root) { RB_ROOT, NULL, 0, }
> > +#define CFQ_RB_ROOT ? ?(struct cfq_rb_root) { RB_ROOT, NULL, 0, 0, }
> >
> > ?/*
> > ?* Per process-grouping structure
> > @@ -156,6 +157,16 @@ enum wl_type_t {
> >
> > ?/* This is per cgroup per device grouping structure */
> > ?struct cfq_group {
> > + ? ? ? /* group service_tree member */
> > + ? ? ? struct rb_node rb_node;
> > +
> > + ? ? ? /* group service_tree key */
> > + ? ? ? u64 vdisktime;
> > + ? ? ? bool on_st;
> > +
> > + ? ? ? /* number of cfqq currently on this group */
> > + ? ? ? int nr_cfqq;
> > +
> > ? ? ? ?/*
> > ? ? ? ? * rr lists of queues with requests, onle rr for each priority class.
> > ? ? ? ? * Counts are embedded in the cfq_rb_root
> > @@ -169,6 +180,8 @@ struct cfq_group {
> > ?*/
> > ?struct cfq_data {
> > ? ? ? ?struct request_queue *queue;
> > + ? ? ? /* Root service tree for cfq_groups */
> > + ? ? ? struct cfq_rb_root grp_service_tree;
> > ? ? ? ?struct cfq_group root_group;
> >
> > ? ? ? ?/*
> > @@ -251,6 +264,9 @@ static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum wl_type_t type,
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct cfq_data *cfqd)
> > ?{
> > + ? ? ? if (!cfqg)
> > + ? ? ? ? ? ? ? return NULL;
> > +
> > ? ? ? ?if (prio == IDLE_WORKLOAD)
> > ? ? ? ? ? ? ? ?return &cfqg->service_tree_idle;
> >
> > @@ -587,6 +603,17 @@ static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
> > ? ? ? ?return NULL;
> > ?}
> >
> > +static struct cfq_group *cfq_rb_first_group(struct cfq_rb_root *root)
> > +{
> > + ? ? ? if (!root->left)
> > + ? ? ? ? ? ? ? root->left = rb_first(&root->rb);
> > +
> > + ? ? ? if (root->left)
> > + ? ? ? ? ? ? ? return rb_entry(root->left, struct cfq_group, rb_node);
>
> Can you please define a cfqg_entry macro and reuse that at different
> places in the code?
> #define cfqg_entry(ptr) rb_entry(ptr, struct cfq_group, rb_node)

Ok, I have introduced rb_entry_cfqg() along the lines of rb_entry_rq().

Changes are there in two patches. Reposting these patches in same thread
soon.

Thanks
Vivek

2009-12-02 15:51:27

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 05/21] blkio: Introduce the root service tree for cfq groups

On Sun, Nov 29, 2009 at 09:59:12PM -0500, Vivek Goyal wrote:
> o So far we just had one cfq_group in cfq_data. To create space for more than
> one cfq_group, we need to have a service tree of groups where all the groups
> can be queued if they have active cfq queues backlogged in these.
>

Reposting this patch. Introduced a new macro rb_entry_cfqg() along the
lines of rb_entry_rq().


o So far we just had one cfq_group in cfq_data. To create space for more than
one cfq_group, we need to have a service tree of groups where all the groups
can be queued if they have active cfq queues backlogged in these.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 134 insertions(+), 3 deletions(-)

Index: linux10/block/cfq-iosched.c
===================================================================
--- linux10.orig/block/cfq-iosched.c 2009-11-30 17:22:57.000000000 -0500
+++ linux10/block/cfq-iosched.c 2009-12-02 10:47:17.000000000 -0500
@@ -66,6 +66,7 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
#define cfq_class_rt(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_RT)

#define sample_valid(samples) ((samples) > 80)
+#define rb_entry_cfqg(node) rb_entry((node), struct cfq_group, rb_node)

/*
* Most of our rbtree usage is for sorting with min extraction, so
@@ -77,8 +78,9 @@ struct cfq_rb_root {
struct rb_root rb;
struct rb_node *left;
unsigned count;
+ u64 min_vdisktime;
};
-#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, }
+#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, 0, }

/*
* Per process-grouping structure
@@ -156,6 +158,16 @@ enum wl_type_t {

/* This is per cgroup per device grouping structure */
struct cfq_group {
+ /* group service_tree member */
+ struct rb_node rb_node;
+
+ /* group service_tree key */
+ u64 vdisktime;
+ bool on_st;
+
+ /* number of cfqq currently on this group */
+ int nr_cfqq;
+
/*
* rr lists of queues with requests, onle rr for each priority class.
* Counts are embedded in the cfq_rb_root
@@ -169,6 +181,8 @@ struct cfq_group {
*/
struct cfq_data {
struct request_queue *queue;
+ /* Root service tree for cfq_groups */
+ struct cfq_rb_root grp_service_tree;
struct cfq_group root_group;

/*
@@ -251,6 +265,9 @@ static struct cfq_rb_root *service_tree_
enum wl_type_t type,
struct cfq_data *cfqd)
{
+ if (!cfqg)
+ return NULL;
+
if (prio == IDLE_WORKLOAD)
return &cfqg->service_tree_idle;

@@ -589,6 +606,17 @@ static struct cfq_queue *cfq_rb_first(st
return NULL;
}

+static struct cfq_group *cfq_rb_first_group(struct cfq_rb_root *root)
+{
+ if (!root->left)
+ root->left = rb_first(&root->rb);
+
+ if (root->left)
+ return rb_entry_cfqg(root->left);
+
+ return NULL;
+}
+
static void rb_erase_init(struct rb_node *n, struct rb_root *root)
{
rb_erase(n, root);
@@ -645,6 +673,83 @@ static unsigned long cfq_slice_offset(st
cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
}

+static inline s64
+cfqg_key(struct cfq_rb_root *st, struct cfq_group *cfqg)
+{
+ return cfqg->vdisktime - st->min_vdisktime;
+}
+
+static void
+__cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
+{
+ struct rb_node **node = &st->rb.rb_node;
+ struct rb_node *parent = NULL;
+ struct cfq_group *__cfqg;
+ s64 key = cfqg_key(st, cfqg);
+ int left = 1;
+
+ while (*node != NULL) {
+ parent = *node;
+ __cfqg = rb_entry_cfqg(parent);
+
+ if (key < cfqg_key(st, __cfqg))
+ node = &parent->rb_left;
+ else {
+ node = &parent->rb_right;
+ left = 0;
+ }
+ }
+
+ if (left)
+ st->left = &cfqg->rb_node;
+
+ rb_link_node(&cfqg->rb_node, parent, node);
+ rb_insert_color(&cfqg->rb_node, &st->rb);
+}
+
+static void
+cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_group *__cfqg;
+ struct rb_node *n;
+
+ cfqg->nr_cfqq++;
+ if (cfqg->on_st)
+ return;
+
+ /*
+ * Currently put the group at the end. Later implement something
+ * so that groups get lesser vtime based on their weights, so that
+ * if group does not loose all if it was not continously backlogged.
+ */
+ n = rb_last(&st->rb);
+ if (n) {
+ __cfqg = rb_entry_cfqg(n);
+ cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
+ } else
+ cfqg->vdisktime = st->min_vdisktime;
+
+ __cfq_group_service_tree_add(st, cfqg);
+ cfqg->on_st = true;
+}
+
+static void
+cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+
+ BUG_ON(cfqg->nr_cfqq < 1);
+ cfqg->nr_cfqq--;
+ /* If there are other cfq queues under this group, don't delete it */
+ if (cfqg->nr_cfqq)
+ return;
+
+ cfqg->on_st = false;
+ if (!RB_EMPTY_NODE(&cfqg->rb_node))
+ cfq_rb_erase(&cfqg->rb_node, st);
+}
+
/*
* The cfqd->service_trees holds all pending cfq_queue's that have
* requests waiting to be processed. It is sorted in the order that
@@ -727,6 +832,7 @@ static void cfq_service_tree_add(struct
rb_link_node(&cfqq->rb_node, parent, p);
rb_insert_color(&cfqq->rb_node, &service_tree->rb);
service_tree->count++;
+ cfq_group_service_tree_add(cfqd, cfqq->cfqg);
}

static struct cfq_queue *
@@ -837,6 +943,7 @@ static void cfq_del_cfqq_rr(struct cfq_d
cfqq->p_root = NULL;
}

+ cfq_group_service_tree_del(cfqd, cfqq->cfqg);
BUG_ON(!cfqd->busy_queues);
cfqd->busy_queues--;
}
@@ -1113,6 +1220,9 @@ static struct cfq_queue *cfq_get_next_qu
if (!cfqd->rq_queued)
return NULL;

+ /* There is nothing to dispatch */
+ if (!service_tree)
+ return NULL;
if (RB_EMPTY_ROOT(&service_tree->rb))
return NULL;
return cfq_rb_first(service_tree);
@@ -1482,6 +1592,12 @@ static void choose_service_tree(struct c
unsigned count;
struct cfq_rb_root *st;

+ if (!cfqg) {
+ cfqd->serving_prio = IDLE_WORKLOAD;
+ cfqd->workload_expires = jiffies + 1;
+ return;
+ }
+
/* Choose next priority. RT > BE > IDLE */
if (cfq_busy_queues_wl(RT_WORKLOAD, cfqd))
cfqd->serving_prio = RT_WORKLOAD;
@@ -1540,10 +1656,21 @@ static void choose_service_tree(struct c
cfqd->noidle_tree_requires_idle = false;
}

+static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+
+ if (RB_EMPTY_ROOT(&st->rb))
+ return NULL;
+ return cfq_rb_first_group(st);
+}
+
static void cfq_choose_cfqg(struct cfq_data *cfqd)
{
- cfqd->serving_group = &cfqd->root_group;
- choose_service_tree(cfqd, &cfqd->root_group);
+ struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
+
+ cfqd->serving_group = cfqg;
+ choose_service_tree(cfqd, cfqg);
}

/*
@@ -3019,10 +3146,14 @@ static void *cfq_init_queue(struct reque
if (!cfqd)
return NULL;

+ /* Init root service tree */
+ cfqd->grp_service_tree = CFQ_RB_ROOT;
+
/* Init root group */
cfqg = &cfqd->root_group;
for_each_cfqg_st(cfqg, i, j, st)
*st = CFQ_RB_ROOT;
+ RB_CLEAR_NODE(&cfqg->rb_node);

/*
* Not strictly needed (since RB_ROOT just clears the node and we

2009-12-02 15:52:15

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 07/21] blkio: Introduce per cfq group weights and vdisktime calculations

On Sun, Nov 29, 2009 at 09:59:14PM -0500, Vivek Goyal wrote:
> o Bring in the per cfq group weight and how vdisktime is calculated for the
> group. Also bring in the functionality of updating the min_vdisktime of
> the group service tree.
>

Reposting this patch to make use of newly introduced rb_entry_cfqg().


o Bring in the per cfq group weight and how vdisktime is calculated for the
group. Also bring in the functionality of updating the min_vdisktime of
the group service tree.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/Kconfig.iosched | 9 ++++++-
block/cfq-iosched.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 69 insertions(+), 2 deletions(-)

Index: linux10/block/cfq-iosched.c
===================================================================
--- linux10.orig/block/cfq-iosched.c 2009-12-02 10:47:17.000000000 -0500
+++ linux10/block/cfq-iosched.c 2009-12-02 10:47:53.000000000 -0500
@@ -13,6 +13,7 @@
#include <linux/rbtree.h>
#include <linux/ioprio.h>
#include <linux/blktrace_api.h>
+#include "blk-cgroup.h"

/*
* tunables
@@ -49,6 +50,7 @@ static const int cfq_hist_divisor = 4;

#define CFQ_SLICE_SCALE (5)
#define CFQ_HW_QUEUE_MIN (5)
+#define CFQ_SERVICE_SHIFT 12

#define RQ_CIC(rq) \
((struct cfq_io_context *) (rq)->elevator_private)
@@ -79,6 +81,7 @@ struct cfq_rb_root {
struct rb_node *left;
unsigned count;
u64 min_vdisktime;
+ struct rb_node *active;
};
#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, 0, }

@@ -163,6 +166,7 @@ struct cfq_group {

/* group service_tree key */
u64 vdisktime;
+ unsigned int weight;
bool on_st;

/* number of cfqq currently on this group */
@@ -434,6 +438,51 @@ cfq_prio_to_slice(struct cfq_data *cfqd,
return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
}

+static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg)
+{
+ u64 d = delta << CFQ_SERVICE_SHIFT;
+
+ d = d * BLKIO_WEIGHT_DEFAULT;
+ do_div(d, cfqg->weight);
+ return d;
+}
+
+static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+ s64 delta = (s64)(vdisktime - min_vdisktime);
+ if (delta > 0)
+ min_vdisktime = vdisktime;
+
+ return min_vdisktime;
+}
+
+static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+ s64 delta = (s64)(vdisktime - min_vdisktime);
+ if (delta < 0)
+ min_vdisktime = vdisktime;
+
+ return min_vdisktime;
+}
+
+static void update_min_vdisktime(struct cfq_rb_root *st)
+{
+ u64 vdisktime = st->min_vdisktime;
+ struct cfq_group *cfqg;
+
+ if (st->active) {
+ cfqg = rb_entry_cfqg(st->active);
+ vdisktime = cfqg->vdisktime;
+ }
+
+ if (st->left) {
+ cfqg = rb_entry_cfqg(st->left);
+ vdisktime = min_vdisktime(vdisktime, cfqg->vdisktime);
+ }
+
+ st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
/*
* get averaged number of queues of RT/BE priority.
* average is updated, with a formula that gives more weight to higher numbers,
@@ -739,8 +788,12 @@ cfq_group_service_tree_del(struct cfq_da
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;

+ if (st->active == &cfqg->rb_node)
+ st->active = NULL;
+
BUG_ON(cfqg->nr_cfqq < 1);
cfqg->nr_cfqq--;
+
/* If there are other cfq queues under this group, don't delete it */
if (cfqg->nr_cfqq)
return;
@@ -1659,10 +1712,14 @@ static void choose_service_tree(struct c
static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_group *cfqg;

if (RB_EMPTY_ROOT(&st->rb))
return NULL;
- return cfq_rb_first_group(st);
+ cfqg = cfq_rb_first_group(st);
+ st->active = &cfqg->rb_node;
+ update_min_vdisktime(st);
+ return cfqg;
}

static void cfq_choose_cfqg(struct cfq_data *cfqd)
@@ -3155,6 +3212,9 @@ static void *cfq_init_queue(struct reque
*st = CFQ_RB_ROOT;
RB_CLEAR_NODE(&cfqg->rb_node);

+ /* Give preference to root group over other groups */
+ cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
+
/*
* Not strictly needed (since RB_ROOT just clears the node and we
* zeroed cfqd on alloc), but better be safe in case someone decides
Index: linux10/block/Kconfig.iosched
===================================================================
--- linux10.orig/block/Kconfig.iosched 2009-12-02 10:47:22.000000000 -0500
+++ linux10/block/Kconfig.iosched 2009-12-02 10:47:24.000000000 -0500
@@ -23,7 +23,6 @@ config IOSCHED_DEADLINE

config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
- select BLK_CGROUP
default y
---help---
The CFQ I/O scheduler tries to distribute bandwidth equally
@@ -33,6 +32,14 @@ config IOSCHED_CFQ

This is the default I/O scheduler.

+config CFQ_GROUP_IOSCHED
+ bool "CFQ Group Scheduling support"
+ depends on IOSCHED_CFQ && CGROUPS
+ select BLK_CGROUP
+ default n
+ ---help---
+ Enable group IO scheduling in CFQ.
+
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ

2009-12-02 19:14:55

by Vivek Goyal

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Mon, Nov 30, 2009 at 06:50:25PM -0500, Alan D. Brunelle wrote:
> On Mon, 2009-11-30 at 17:56 -0500, Vivek Goyal wrote:
> > On Mon, Nov 30, 2009 at 05:00:33PM -0500, Alan D. Brunelle wrote:
> > > FYI: Results today from my test suite - haven't had time to parse them
> > > in any depth...
> >
> > Thanks Alan. I am trying to parse the results below. s0 and s8 still mean
> > slice idle enabled disabled? Instead of that we can try group_isolation
> > enabled or disabled for all the tests.
>
> I'm concerned as well - I think I did the base run /after/ the i1,s0
> run. I'll need to check that out tomorrow...
>

Hi Alan, Gui,

Currently fio does not have support for putting jobs in appropriate
cgroups. Hence I wrote the scripts to move jobs in to right cgroup. But
this had the issue of synchronization between threads so that they all
start at same time after laying out the files. I had written some programs
to synchronize on external semaphore. This was getting complicated.

Now I have written this small hacky patch to modify fio to launch a job
in specified cgroup. Just use following in your job file or command line.

cgroup=<cgroup dir path>

In my initial testing it seems to be working both from command line and
from job file. I will test it more. Sending the patch in this thread because
you and Gui seems to be testing this stuff and life will become little easier.

Thanks
Vivek


o A simple patch to put run a fio job in a specific cgroup. Its a simple hack
and not much of error checking. So make sure cgroup exists.

Signed-off-by: Vivek Goyal <[email protected]>
---
HOWTO | 3 +++
fio.c | 8 ++++++++
fio.h | 1 +
options.c | 7 +++++++
4 files changed, 19 insertions(+)

Index: fio/options.c
===================================================================
--- fio.orig/options.c 2009-12-02 11:10:36.000000000 -0500
+++ fio/options.c 2009-12-02 13:40:49.000000000 -0500
@@ -1727,6 +1727,13 @@ static struct fio_option options[] = {
.help = "Select a specific builtin performance test",
},
{
+ .name = "cgroup",
+ .type = FIO_OPT_STR_STORE,
+ .off1 = td_var_offset(cgroup),
+ .cb = str_directory_cb,
+ .help = "cgroup directory to run the job in",
+ },
+ {
.name = NULL,
},
};
Index: fio/fio.h
===================================================================
--- fio.orig/fio.h 2009-12-02 11:10:36.000000000 -0500
+++ fio/fio.h 2009-12-02 13:42:26.000000000 -0500
@@ -271,6 +271,7 @@ struct thread_options {
* Benchmark profile type
*/
unsigned int profile;
+ char *cgroup;
};

#define FIO_VERROR_SIZE 128
Index: fio/fio.c
===================================================================
--- fio.orig/fio.c 2009-12-02 11:10:36.000000000 -0500
+++ fio/fio.c 2009-12-02 13:51:16.000000000 -0500
@@ -1025,6 +1025,14 @@ static void *thread_main(void *data)
pthread_cond_init(&td->verify_cond, &attr);
pthread_cond_init(&td->free_cond, &attr);

+ /* Move thread to right cgroup */
+ if (td->o.cgroup) {
+ char str[50];
+ sprintf(str, "echo %d > %s/tasks", td->pid, td->o.cgroup);
+ if (system(str) == -1)
+ log_err("fio: exec of cmd <%s> failed\n", str);
+ }
+
td_set_runstate(td, TD_INITIALIZED);
dprint(FD_MUTEX, "up startup_mutex\n");
fio_mutex_up(startup_mutex);
Index: fio/HOWTO
===================================================================
--- fio.orig/HOWTO 2009-12-02 11:10:36.000000000 -0500
+++ fio/HOWTO 2009-12-02 14:04:55.000000000 -0500
@@ -1003,6 +1003,9 @@ continue_on_error=bool Normally fio will
given in the stats is the first error that was hit during the
run.

+cgroup=str Specify the cgroup directory in which a job should run.
+ ex. cgroup=/cgroup/blkio/test1/
+

6.0 Interpreting the output
---------------------------

2009-12-03 08:45:56

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: Block IO Controller V4

Vivek Goyal wrote:
> On Wed, Dec 02, 2009 at 09:51:36AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> Hi Jens,
>>>
>>> This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
>>> of block tree.
>>>
>>> A consolidated patch can be found here:
>>>
>>> http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch
>>>
>> Hi Vivek,
>>
>> It seems this version doesn't work very well for "direct(O_DIRECT) sequence read" mode.
>> For example, you can create group A and group B, then assign weight 100 to group A and
>> weight 400 to group B, and you run "direct sequence read" workload in group A and B
>> simultaneously. Ideally, we should see 1:4 disk time differentiation for group A and B.
>> But actually, I see almost 1:2 disk time differentiation for group A and B. I'm looking
>> into this issue.
>> BTW, V3 works well for this case.
>
> Hi Gui,
>
> In my testing of 8 fio jobs in 8 cgroups, direct sequential reads seems to
> be working fine.
>
> http://lkml.org/lkml/2009/12/1/367
>
> I suspect that in some case we choose not to idle on the group and it gets
> deleted from service tree hence we loose share. Can you have a look at
> blkio.dequeue files. If there are excessive deletions, that will signify
> that we are loosing share because we chose not to idle.
>
> If yes, please also run blktrace to see in what cases we chose not to
> idle.
>
> In V3, I had a stronger check to idle on the group if it is empty using
> wait_busy() function. In V4 I have removed that and trying to wait busy
> on a queue by extending its slice if it has consumed its allocated slice.

Hi Vivek,

I ckecked the blktrace output, it seems that io group was deleted all the time,
because we don't have group idle any more. I pulled the wait_busy code back to
V4, and retest it, problem seems disappeared.

So i suggest that we need to retain the wait_busy code.

Thanks,
Gui

2009-12-03 14:38:42

by Vivek Goyal

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Thu, Dec 03, 2009 at 04:41:50PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Wed, Dec 02, 2009 at 09:51:36AM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >>> Hi Jens,
> >>>
> >>> This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
> >>> of block tree.
> >>>
> >>> A consolidated patch can be found here:
> >>>
> >>> http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch
> >>>
> >> Hi Vivek,
> >>
> >> It seems this version doesn't work very well for "direct(O_DIRECT) sequence read" mode.
> >> For example, you can create group A and group B, then assign weight 100 to group A and
> >> weight 400 to group B, and you run "direct sequence read" workload in group A and B
> >> simultaneously. Ideally, we should see 1:4 disk time differentiation for group A and B.
> >> But actually, I see almost 1:2 disk time differentiation for group A and B. I'm looking
> >> into this issue.
> >> BTW, V3 works well for this case.
> >
> > Hi Gui,
> >
> > In my testing of 8 fio jobs in 8 cgroups, direct sequential reads seems to
> > be working fine.
> >
> > http://lkml.org/lkml/2009/12/1/367
> >
> > I suspect that in some case we choose not to idle on the group and it gets
> > deleted from service tree hence we loose share. Can you have a look at
> > blkio.dequeue files. If there are excessive deletions, that will signify
> > that we are loosing share because we chose not to idle.
> >
> > If yes, please also run blktrace to see in what cases we chose not to
> > idle.
> >
> > In V3, I had a stronger check to idle on the group if it is empty using
> > wait_busy() function. In V4 I have removed that and trying to wait busy
> > on a queue by extending its slice if it has consumed its allocated slice.
>
> Hi Vivek,
>
> I ckecked the blktrace output, it seems that io group was deleted all the time,
> because we don't have group idle any more. I pulled the wait_busy code back to
> V4, and retest it, problem seems disappeared.
>
> So i suggest that we need to retain the wait_busy code.

Hi Gui,

We need to figure out why the existing code is not working on your system.
In V4, I introduced the functionality to extend the slice by slice_idle
so that we will arm slice idle timer and wait for new request to come in
and then expire the queue. Following is the code to extend the slice.

/*
* If this queue consumed its slice and this is last queue
* in the group, wait for next request before we expire
* the queue
*/
if (cfq_slice_used(cfqq) && cfqq->cfqg->nr_cfqq == 1) {
cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
cfq_mark_cfqq_wait_busy(cfqq);
}

One loop hole I see is that, I extend the slice only if current slice has
been used. If if we on the boundary and slice has not been used yet, then
I will not extend the slice. We also might not arm the timer thinking that
remaining slice is less than think time of process and that can lead to
expiry of queue. To rule out this possibility, can you remove following
code in arm_slice_timer() and try it again.

/*
* If our average think time is larger than the remaining time
* slice, then don't idle. This avoids overrunning the allotted
* time slice.
*/
if (sample_valid(cic->ttime_samples) &&
(cfqq->slice_end - jiffies < cic->ttime_mean))
return;

The other possiblity is that at the request completion time slice has not
expired hence we don't extend the slice and arm the timer. But then
select_queue() hits and by that time slice has expired and we expire the
queue. I thought this will not happen very frequently.

Can you figure out what is happening on your system. Why we are not doing
wait busy on the queue/group (new queue wait_busy and wait_busy_done
flags) and instead expiring the queue and hence group.

You can send your blktrace logs to me also. I can also try figuring out
what is happening.

Thanks
Vivek

2009-12-03 18:11:59

by Vivek Goyal

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Thu, Dec 03, 2009 at 09:36:41AM -0500, Vivek Goyal wrote:
> On Thu, Dec 03, 2009 at 04:41:50PM +0800, Gui Jianfeng wrote:
> > Vivek Goyal wrote:
> > > On Wed, Dec 02, 2009 at 09:51:36AM +0800, Gui Jianfeng wrote:
> > >> Vivek Goyal wrote:
> > >>> Hi Jens,
> > >>>
> > >>> This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
> > >>> of block tree.
> > >>>
> > >>> A consolidated patch can be found here:
> > >>>
> > >>> http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch
> > >>>
> > >> Hi Vivek,
> > >>
> > >> It seems this version doesn't work very well for "direct(O_DIRECT) sequence read" mode.
> > >> For example, you can create group A and group B, then assign weight 100 to group A and
> > >> weight 400 to group B, and you run "direct sequence read" workload in group A and B
> > >> simultaneously. Ideally, we should see 1:4 disk time differentiation for group A and B.
> > >> But actually, I see almost 1:2 disk time differentiation for group A and B. I'm looking
> > >> into this issue.
> > >> BTW, V3 works well for this case.
> > >
> > > Hi Gui,
> > >
> > > In my testing of 8 fio jobs in 8 cgroups, direct sequential reads seems to
> > > be working fine.
> > >
> > > http://lkml.org/lkml/2009/12/1/367
> > >
> > > I suspect that in some case we choose not to idle on the group and it gets
> > > deleted from service tree hence we loose share. Can you have a look at
> > > blkio.dequeue files. If there are excessive deletions, that will signify
> > > that we are loosing share because we chose not to idle.
> > >
> > > If yes, please also run blktrace to see in what cases we chose not to
> > > idle.
> > >
> > > In V3, I had a stronger check to idle on the group if it is empty using
> > > wait_busy() function. In V4 I have removed that and trying to wait busy
> > > on a queue by extending its slice if it has consumed its allocated slice.
> >
> > Hi Vivek,
> >
> > I ckecked the blktrace output, it seems that io group was deleted all the time,
> > because we don't have group idle any more. I pulled the wait_busy code back to
> > V4, and retest it, problem seems disappeared.
> >
> > So i suggest that we need to retain the wait_busy code.
>
> Hi Gui,
>
> We need to figure out why the existing code is not working on your system.
> In V4, I introduced the functionality to extend the slice by slice_idle
> so that we will arm slice idle timer and wait for new request to come in
> and then expire the queue. Following is the code to extend the slice.
>
> /*
> * If this queue consumed its slice and this is last queue
> * in the group, wait for next request before we expire
> * the queue
> */
> if (cfq_slice_used(cfqq) && cfqq->cfqg->nr_cfqq == 1) {
> cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> cfq_mark_cfqq_wait_busy(cfqq);
> }
>
> One loop hole I see is that, I extend the slice only if current slice has
> been used. If if we on the boundary and slice has not been used yet, then
> I will not extend the slice. We also might not arm the timer thinking that
> remaining slice is less than think time of process and that can lead to
> expiry of queue. To rule out this possibility, can you remove following
> code in arm_slice_timer() and try it again.
>
> /*
> * If our average think time is larger than the remaining time
> * slice, then don't idle. This avoids overrunning the allotted
> * time slice.
> */
> if (sample_valid(cic->ttime_samples) &&
> (cfqq->slice_end - jiffies < cic->ttime_mean))
> return;
>
> The other possiblity is that at the request completion time slice has not
> expired hence we don't extend the slice and arm the timer. But then
> select_queue() hits and by that time slice has expired and we expire the
> queue. I thought this will not happen very frequently.
>
> Can you figure out what is happening on your system. Why we are not doing
> wait busy on the queue/group (new queue wait_busy and wait_busy_done
> flags) and instead expiring the queue and hence group.
>
> You can send your blktrace logs to me also. I can also try figuring out
> what is happening.

Hi Gui,

Can you please try following patch and see if it helps you. If not, then
we need to figure out why we choose to not idle and delete the group from
service tree.

Thanks
Vivek


Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 27 +++++++++++++++++++++++----
1 file changed, 23 insertions(+), 4 deletions(-)

Index: linux11/block/cfq-iosched.c
===================================================================
--- linux11.orig/block/cfq-iosched.c 2009-12-03 11:54:50.000000000 -0500
+++ linux11/block/cfq-iosched.c 2009-12-03 12:39:26.000000000 -0500
@@ -3248,6 +3248,26 @@ static void cfq_update_hw_tag(struct cfq
cfqd->hw_tag = 0;
}

+static inline bool
+cfq_should_wait_busy(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+ struct cfq_io_context *cic = cfqd->active_cic;
+
+ /* If there are other queues in the group, don't wait */
+ if (cfqq->cfqg->nr_cfqq > 1)
+ return false;
+
+ if (cfq_slice_used(cfqq))
+ return true;
+
+ /* if slice left is less than think time, wait busy */
+ if (cic && sample_valid(cic->ttime_samples)
+ && (cfqq->slice_end - jiffies < cic->ttime_mean))
+ return true;
+
+ return false;
+}
+
static void cfq_completed_request(struct request_queue *q, struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -3286,11 +3306,10 @@ static void cfq_completed_request(struct
}

/*
- * If this queue consumed its slice and this is last queue
- * in the group, wait for next request before we expire
- * the queue
+ * Should we wait for next request to come in before we expire
+ * the queue.
*/
- if (cfq_slice_used(cfqq) && cfqq->cfqg->nr_cfqq == 1) {
+ if (cfq_should_wait_busy(cfqd, cfqq)) {
cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
cfq_mark_cfqq_wait_busy(cfqq);
}

2009-12-03 23:53:57

by Vivek Goyal

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Thu, Dec 03, 2009 at 01:10:03PM -0500, Vivek Goyal wrote:

[..]
> Hi Gui,
>
> Can you please try following patch and see if it helps you. If not, then
> we need to figure out why we choose to not idle and delete the group from
> service tree.
>

Hi Gui,

Please try this version of the patch instead of previous one. During more
testing I saw some additional deletions where we should have waited and
the reason being that we were hitting boundary condition. At the request
completion time slice has not expired but after 4-5 ns, select_queue hits
and jiffy has incremented by then and slice expires.

ttime_mean, is not covering this condition because this workload is so
sequential that ttime_mean=0.

So I am checking for new condition where if we are into last ms of slice,
mark the queue wait_busy.

Thanks
Vivek

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 30 ++++++++++++++++++++++++++----
1 file changed, 26 insertions(+), 4 deletions(-)

Index: linux11/block/cfq-iosched.c
===================================================================
--- linux11.orig/block/cfq-iosched.c 2009-12-03 15:10:53.000000000 -0500
+++ linux11/block/cfq-iosched.c 2009-12-03 18:37:35.000000000 -0500
@@ -3248,6 +3248,29 @@ static void cfq_update_hw_tag(struct cfq
cfqd->hw_tag = 0;
}

+static inline bool
+cfq_should_wait_busy(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+ struct cfq_io_context *cic = cfqd->active_cic;
+
+ /* If there are other queues in the group, don't wait */
+ if (cfqq->cfqg->nr_cfqq > 1)
+ return false;
+
+ if (cfq_slice_used(cfqq))
+ return true;
+
+ if (cfqq->slice_end - jiffies == 1)
+ return true;
+
+ /* if slice left is less than think time, wait busy */
+ if (cic && sample_valid(cic->ttime_samples)
+ && (cfqq->slice_end - jiffies < cic->ttime_mean))
+ return true;
+
+ return false;
+}
+
static void cfq_completed_request(struct request_queue *q, struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -3286,11 +3309,10 @@ static void cfq_completed_request(struct
}

/*
- * If this queue consumed its slice and this is last queue
- * in the group, wait for next request before we expire
- * the queue
+ * Should we wait for next request to come in before we expire
+ * the queue.
*/
- if (cfq_slice_used(cfqq) && cfqq->cfqg->nr_cfqq == 1) {
+ if (cfq_should_wait_busy(cfqd, cfqq)) {
cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
cfq_mark_cfqq_wait_busy(cfqq);
}

2009-12-07 01:39:17

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: Block IO Controller V4

Vivek Goyal wrote:
> On Thu, Dec 03, 2009 at 04:41:50PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Wed, Dec 02, 2009 at 09:51:36AM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>>> Hi Jens,
>>>>>
>>>>> This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
>>>>> of block tree.
>>>>>
>>>>> A consolidated patch can be found here:
>>>>>
>>>>> http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch
>>>>>
>>>> Hi Vivek,
>>>>
>>>> It seems this version doesn't work very well for "direct(O_DIRECT) sequence read" mode.
>>>> For example, you can create group A and group B, then assign weight 100 to group A and
>>>> weight 400 to group B, and you run "direct sequence read" workload in group A and B
>>>> simultaneously. Ideally, we should see 1:4 disk time differentiation for group A and B.
>>>> But actually, I see almost 1:2 disk time differentiation for group A and B. I'm looking
>>>> into this issue.
>>>> BTW, V3 works well for this case.
>>> Hi Gui,
>>>
>>> In my testing of 8 fio jobs in 8 cgroups, direct sequential reads seems to
>>> be working fine.
>>>
>>> http://lkml.org/lkml/2009/12/1/367
>>>
>>> I suspect that in some case we choose not to idle on the group and it gets
>>> deleted from service tree hence we loose share. Can you have a look at
>>> blkio.dequeue files. If there are excessive deletions, that will signify
>>> that we are loosing share because we chose not to idle.
>>>
>>> If yes, please also run blktrace to see in what cases we chose not to
>>> idle.
>>>
>>> In V3, I had a stronger check to idle on the group if it is empty using
>>> wait_busy() function. In V4 I have removed that and trying to wait busy
>>> on a queue by extending its slice if it has consumed its allocated slice.
>> Hi Vivek,
>>
>> I ckecked the blktrace output, it seems that io group was deleted all the time,
>> because we don't have group idle any more. I pulled the wait_busy code back to
>> V4, and retest it, problem seems disappeared.
>>
>> So i suggest that we need to retain the wait_busy code.
>
> Hi Gui,
>
> We need to figure out why the existing code is not working on your system.
> In V4, I introduced the functionality to extend the slice by slice_idle
> so that we will arm slice idle timer and wait for new request to come in
> and then expire the queue. Following is the code to extend the slice.
>
> /*
> * If this queue consumed its slice and this is last queue
> * in the group, wait for next request before we expire
> * the queue
> */
> if (cfq_slice_used(cfqq) && cfqq->cfqg->nr_cfqq == 1) {
> cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> cfq_mark_cfqq_wait_busy(cfqq);
> }
>
> One loop hole I see is that, I extend the slice only if current slice has
> been used. If if we on the boundary and slice has not been used yet, then
> I will not extend the slice. We also might not arm the timer thinking that
> remaining slice is less than think time of process and that can lead to
> expiry of queue. To rule out this possibility, can you remove following
> code in arm_slice_timer() and try it again.
>
> /*
> * If our average think time is larger than the remaining time
> * slice, then don't idle. This avoids overrunning the allotted
> * time slice.
> */
> if (sample_valid(cic->ttime_samples) &&
> (cfqq->slice_end - jiffies < cic->ttime_mean))
> return;
>
> The other possiblity is that at the request completion time slice has not
> expired hence we don't extend the slice and arm the timer. But then
> select_queue() hits and by that time slice has expired and we expire the
> queue. I thought this will not happen very frequently.
>
> Can you figure out what is happening on your system. Why we are not doing
> wait busy on the queue/group (new queue wait_busy and wait_busy_done
> flags) and instead expiring the queue and hence group.

Hi Vivek,

Sorry for the late reply.
In V4, we don't have wait_busy() in select_queue(), so if there isn't any
request on this queue and no cooperator queue available, this queue will
expire immediately. We don't have a chance to get that queue backlogged
again. So group will get removed frequently.


> You can send your blktrace logs to me also. I can also try figuring out
> what is happening.

I think here is the most significant part of blktrace output for this issue.

8,16 0 4024 0.642072068 3924 Q R 320708977 + 8 [rwio]
8,16 0 4025 0.642078523 3924 G R 320708977 + 8 [rwio]
8,16 0 4026 0.642082632 3924 I R 320708977 + 8 [rwio]
8,16 0 0 0.642084075 0 m N cfq3924S /test1 insert_request
8,16 0 0 0.642087062 0 m N cfq3924S /test1 dispatch_insert
8,16 0 0 0.642088250 0 m N cfq3924S /test1 dispatched a request
8,16 0 0 0.642089242 0 m N cfq3924S /test1 activate rq, drv=1
8,16 0 4027 0.642089573 3924 D R 320708977 + 8 [rwio]
8,16 0 0 0.642185679 0 m N cfq3924S /test1 slice expired t=0 <= I think this happens in select_queue()
8,16 0 0 0.642187132 0 m N cfq3924S /test1 sl_used=60 sect=2056
8,16 0 0 0.642189007 0 m N /test1 served: vt=276536888 min_vt=275308088
8,16 0 0 0.642190265 0 m N cfq3924S /test1 del_from_rr
8,16 0 0 0.642190941 0 m N /test1 del_from_rr group
8,16 0 0 0.642192600 0 m N cfq3925S /test2 set_active
8,16 0 0 0.642194414 0 m N cfq3925S /test2 fifo=(null)
8,16 0 0 0.642195296 0 m N cfq3925S /test2 dispatch_insert
8,16 0 0 0.642196709 0 m N cfq3925S /test2 dispatched a request
8,16 0 0 0.642197737 0 m N cfq3925S /test2 activate rq, drv=2
8,16 0 4028 0.642198102 3924 D R 324900545 + 8 [rwio]
8,16 0 4029 0.642204612 3924 U N [rwio] 2



>
> Thanks
> Vivek
>
>
>

--
Regards
Gui Jianfeng

2009-12-07 08:45:22

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: Block IO Controller V4

Gui Jianfeng wrote:
> Vivek Goyal wrote:
>> On Thu, Dec 03, 2009 at 04:41:50PM +0800, Gui Jianfeng wrote:
>>> Vivek Goyal wrote:
>>>> On Wed, Dec 02, 2009 at 09:51:36AM +0800, Gui Jianfeng wrote:
>>>>> Vivek Goyal wrote:
>>>>>> Hi Jens,
>>>>>>
>>>>>> This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
>>>>>> of block tree.
>>>>>>
>>>>>> A consolidated patch can be found here:
>>>>>>
>>>>>> http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch
>>>>>>
>>>>> Hi Vivek,
>>>>>
>>>>> It seems this version doesn't work very well for "direct(O_DIRECT) sequence read" mode.
>>>>> For example, you can create group A and group B, then assign weight 100 to group A and
>>>>> weight 400 to group B, and you run "direct sequence read" workload in group A and B
>>>>> simultaneously. Ideally, we should see 1:4 disk time differentiation for group A and B.
>>>>> But actually, I see almost 1:2 disk time differentiation for group A and B. I'm looking
>>>>> into this issue.
>>>>> BTW, V3 works well for this case.
>>>> Hi Gui,
>>>>
>>>> In my testing of 8 fio jobs in 8 cgroups, direct sequential reads seems to
>>>> be working fine.
>>>>
>>>> http://lkml.org/lkml/2009/12/1/367
>>>>
>>>> I suspect that in some case we choose not to idle on the group and it gets
>>>> deleted from service tree hence we loose share. Can you have a look at
>>>> blkio.dequeue files. If there are excessive deletions, that will signify
>>>> that we are loosing share because we chose not to idle.
>>>>
>>>> If yes, please also run blktrace to see in what cases we chose not to
>>>> idle.
>>>>
>>>> In V3, I had a stronger check to idle on the group if it is empty using
>>>> wait_busy() function. In V4 I have removed that and trying to wait busy
>>>> on a queue by extending its slice if it has consumed its allocated slice.
>>> Hi Vivek,
>>>
>>> I ckecked the blktrace output, it seems that io group was deleted all the time,
>>> because we don't have group idle any more. I pulled the wait_busy code back to
>>> V4, and retest it, problem seems disappeared.
>>>
>>> So i suggest that we need to retain the wait_busy code.
>> Hi Gui,
>>
>> We need to figure out why the existing code is not working on your system.
>> In V4, I introduced the functionality to extend the slice by slice_idle
>> so that we will arm slice idle timer and wait for new request to come in
>> and then expire the queue. Following is the code to extend the slice.
>>
>> /*
>> * If this queue consumed its slice and this is last queue
>> * in the group, wait for next request before we expire
>> * the queue
>> */
>> if (cfq_slice_used(cfqq) && cfqq->cfqg->nr_cfqq == 1) {
>> cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
>> cfq_mark_cfqq_wait_busy(cfqq);
>> }
>>
>> One loop hole I see is that, I extend the slice only if current slice has
>> been used. If if we on the boundary and slice has not been used yet, then
>> I will not extend the slice. We also might not arm the timer thinking that
>> remaining slice is less than think time of process and that can lead to
>> expiry of queue. To rule out this possibility, can you remove following
>> code in arm_slice_timer() and try it again.
>>
>> /*
>> * If our average think time is larger than the remaining time
>> * slice, then don't idle. This avoids overrunning the allotted
>> * time slice.
>> */
>> if (sample_valid(cic->ttime_samples) &&
>> (cfqq->slice_end - jiffies < cic->ttime_mean))
>> return;
>>
>> The other possiblity is that at the request completion time slice has not
>> expired hence we don't extend the slice and arm the timer. But then
>> select_queue() hits and by that time slice has expired and we expire the
>> queue. I thought this will not happen very frequently.
>>
>> Can you figure out what is happening on your system. Why we are not doing
>> wait busy on the queue/group (new queue wait_busy and wait_busy_done
>> flags) and instead expiring the queue and hence group.
>
> Hi Vivek,
>
> Sorry for the late reply.
> In V4, we don't have wait_busy() in select_queue(), so if there isn't any
> request on this queue and no cooperator queue available, this queue will
> expire immediately. We don't have a chance to get that queue backlogged
> again. So group will get removed frequently.

Please ignore the above.
I confirm that cfqq is expired because of using up time slice.

Thanks
Gui

2009-12-07 08:50:03

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: Block IO Controller V4

Vivek Goyal wrote:
> On Thu, Dec 03, 2009 at 01:10:03PM -0500, Vivek Goyal wrote:
>
> [..]
>> Hi Gui,
>>
>> Can you please try following patch and see if it helps you. If not, then
>> we need to figure out why we choose to not idle and delete the group from
>> service tree.
>>
>
> Hi Gui,
>
> Please try this version of the patch instead of previous one. During more
> testing I saw some additional deletions where we should have waited and
> the reason being that we were hitting boundary condition. At the request
> completion time slice has not expired but after 4-5 ns, select_queue hits
> and jiffy has incremented by then and slice expires.
>
> ttime_mean, is not covering this condition because this workload is so
> sequential that ttime_mean=0.
>
> So I am checking for new condition where if we are into last ms of slice,
> mark the queue wait_busy.
>
> Thanks
> Vivek
>
> Signed-off-by: Vivek Goyal <[email protected]>

Hi, Vivek

I add some debug message in select_queue, it does meet the boundary condition.
I tried this patch, and works fine on my box.

Acked-by: Gui Jianfeng <[email protected]>

Thanks,
Gui

2009-12-07 15:26:44

by Vivek Goyal

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Mon, Dec 07, 2009 at 04:45:49PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Thu, Dec 03, 2009 at 01:10:03PM -0500, Vivek Goyal wrote:
> >
> > [..]
> >> Hi Gui,
> >>
> >> Can you please try following patch and see if it helps you. If not, then
> >> we need to figure out why we choose to not idle and delete the group from
> >> service tree.
> >>
> >
> > Hi Gui,
> >
> > Please try this version of the patch instead of previous one. During more
> > testing I saw some additional deletions where we should have waited and
> > the reason being that we were hitting boundary condition. At the request
> > completion time slice has not expired but after 4-5 ns, select_queue hits
> > and jiffy has incremented by then and slice expires.
> >
> > ttime_mean, is not covering this condition because this workload is so
> > sequential that ttime_mean=0.
> >
> > So I am checking for new condition where if we are into last ms of slice,
> > mark the queue wait_busy.
> >
> > Thanks
> > Vivek
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
>
> Hi, Vivek
>
> I add some debug message in select_queue, it does meet the boundary condition.
> I tried this patch, and works fine on my box.
>
> Acked-by: Gui Jianfeng <[email protected]>

Thanks Gui, I will send this patch to Jens in a separate mail.

Thanks
Vivek

2009-12-08 15:17:48

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: Block IO Controller V4

Hi Vivek -

Sorry, I've been off doing other work and haven't had time to follow up
on this (until recently). I have runs based upon Jens' for-2.6.33 tree
as of commit 0d99519efef15fd0cf84a849492c7b1deee1e4b7 and your V4 patch
sequence (the refresh patch you sent me on 3 December 2009). I _think_
things look pretty darn good. There are three modes compared:

(1) base - just Jens' for-2.6.33 tree, not patched.
(2) i1,s8 - Your patches added and slice_idle set to 8 (default)
(3) i1,s0 - Your patched added and slice_idle set to 0

I did both synchronous and asynchronous runs, direct I/Os in both case,
random and sequential, with reads, writes and 80%/20% read/write cases.
The results are in throughput (as reported by fio). The first table
shows overall test results, the other tables show breakdowns per cgroup
(disk).

Regards,
Alan

---- ---- - --------- --------- --------- --------- --------- ---------
Mode RdWr N as,base as,i1,s8 as,i1,s0 sy,base sy,i1,s8 sy,i1,s0
---- ---- - --------- --------- --------- --------- --------- ---------
rnd rd 2 39.7 39.1 43.7 20.5 20.5 20.4
rnd rd 4 33.9 33.3 41.2 28.5 28.5 28.5
rnd rd 8 23.7 25.0 36.7 34.4 34.5 34.6

rnd wr 2 66.1 67.8 68.9 71.8 71.8 71.9
rnd wr 4 57.8 62.9 66.1 64.1 64.2 64.3
rnd wr 8 39.5 47.4 60.6 54.7 54.6 54.9

rnd rdwr 2 50.2 49.1 54.5 31.1 31.1 31.1
rnd rdwr 4 41.4 41.3 50.9 38.9 39.1 39.6
rnd rdwr 8 28.1 30.5 46.3 42.5 42.6 43.8

seq rd 2 612.3 605.7 611.2 509.6 528.3 608.6
seq rd 4 614.1 606.9 606.2 493.0 490.6 615.4
seq rd 8 613.6 603.8 605.9 453.0 461.8 617.6

seq wr 2 694.6 726.1 701.2 685.8 661.8 314.2
seq wr 4 687.6 715.3 628.3 702.9 702.3 317.8
seq wr 8 695.0 710.0 629.8 704.0 708.3 339.4

seq rdwr 2 692.3 664.9 693.8 508.4 504.0 642.8
seq rdwr 4 664.5 657.1 639.3 484.5 481.0 694.3
seq rdwr 8 659.0 648.0 634.4 458.1 460.4 709.6

===============================================================

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
as,base rnd rd 2 20.0 19.7
as,base rnd rd 4 8.8 8.5 8.3 8.3
as,base rnd rd 8 3.3 3.1 3.3 3.2 2.7 2.7 2.8 2.6

as,base rnd wr 2 33.2 32.9
as,base rnd wr 4 15.9 15.2 14.5 12.3
as,base rnd wr 8 5.8 3.4 7.8 8.7 3.5 3.4 3.8 3.1

as,base rnd rdwr 2 25.0 25.2
as,base rnd rdwr 4 10.6 10.4 10.2 10.2
as,base rnd rdwr 8 3.7 3.6 4.0 4.1 3.2 3.4 3.3 2.9


as,base seq rd 2 305.9 306.4
as,base seq rd 4 159.4 160.5 147.3 146.9
as,base seq rd 8 79.7 80.0 77.3 78.4 73.0 70.0 77.5 77.7

as,base seq wr 2 348.6 346.0
as,base seq wr 4 189.9 187.6 154.7 155.3
as,base seq wr 8 87.9 88.3 84.7 85.3 84.5 85.1 90.4 88.8

as,base seq rdwr 2 347.2 345.1
as,base seq rdwr 4 181.6 181.8 150.8 150.2
as,base seq rdwr 8 83.6 82.1 82.1 82.7 80.6 82.7 82.2 82.9

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
as,i1,s8 rnd rd 2 12.7 26.3
as,i1,s8 rnd rd 4 1.2 3.7 12.2 16.3
as,i1,s8 rnd rd 8 0.5 0.8 1.2 1.7 2.1 3.5 6.7 8.4

as,i1,s8 rnd wr 2 18.5 49.3
as,i1,s8 rnd wr 4 1.0 1.6 20.7 39.6
as,i1,s8 rnd wr 8 0.5 0.7 0.9 1.2 1.7 2.5 15.5 24.5

as,i1,s8 rnd rdwr 2 16.2 32.9
as,i1,s8 rnd rdwr 4 1.2 4.7 15.6 19.9
as,i1,s8 rnd rdwr 8 0.6 0.8 1.1 1.7 2.1 3.4 9.4 11.5

as,i1,s8 seq rd 2 202.7 403.0
as,i1,s8 seq rd 4 92.1 114.7 182.4 217.6
as,i1,s8 seq rd 8 38.7 76.2 74.0 73.9 74.5 74.7 84.7 107.0

as,i1,s8 seq wr 2 243.8 482.3
as,i1,s8 seq wr 4 107.7 155.5 200.4 251.7
as,i1,s8 seq wr 8 52.1 77.2 81.9 80.8 89.6 99.9 109.8 118.7

as,i1,s8 seq rdwr 2 225.8 439.1
as,i1,s8 seq rdwr 4 103.2 140.2 186.5 227.2
as,i1,s8 seq rdwr 8 50.3 77.4 77.5 78.9 80.5 83.9 94.3 105.2

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
as,i1,s0 rnd rd 2 21.9 21.8
as,i1,s0 rnd rd 4 11.4 12.0 9.1 8.7
as,i1,s0 rnd rd 8 3.2 3.2 6.7 6.7 4.7 4.0 4.7 3.5

as,i1,s0 rnd wr 2 34.5 34.4
as,i1,s0 rnd wr 4 21.6 20.5 12.6 11.4
as,i1,s0 rnd wr 8 5.1 4.8 18.2 16.9 4.1 4.0 4.0 3.3

as,i1,s0 rnd rdwr 2 27.5 27.0
as,i1,s0 rnd rdwr 4 16.1 15.4 10.2 9.2
as,i1,s0 rnd rdwr 8 5.3 4.6 9.9 9.7 4.6 4.0 4.4 3.8

as,i1,s0 seq rd 2 305.5 305.6
as,i1,s0 seq rd 4 159.5 157.3 144.1 145.3
as,i1,s0 seq rd 8 74.1 74.6 76.7 76.4 74.6 76.7 75.5 77.4

as,i1,s0 seq wr 2 350.3 350.9
as,i1,s0 seq wr 4 160.3 161.7 153.1 153.2
as,i1,s0 seq wr 8 79.5 80.9 78.2 78.7 79.7 78.3 77.8 76.7

as,i1,s0 seq rdwr 2 346.8 347.0
as,i1,s0 seq rdwr 4 163.3 163.5 156.7 155.8
as,i1,s0 seq rdwr 8 79.1 79.4 80.1 80.3 79.1 78.9 79.6 77.8

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
sy,base rnd rd 2 10.2 10.2
sy,base rnd rd 4 7.2 7.2 7.1 7.0
sy,base rnd rd 8 4.1 4.1 4.5 4.5 4.3 4.3 4.4 4.1

sy,base rnd wr 2 36.1 35.7
sy,base rnd wr 4 16.7 16.5 15.6 15.3
sy,base rnd wr 8 5.7 5.4 9.0 8.6 6.6 6.5 6.8 6.0

sy,base rnd rdwr 2 15.5 15.5
sy,base rnd rdwr 4 9.9 9.8 9.7 9.6
sy,base rnd rdwr 8 4.8 4.9 5.8 5.8 5.4 5.4 5.4 4.9

sy,base seq rd 2 254.7 254.8
sy,base seq rd 4 124.2 123.6 121.8 123.4
sy,base seq rd 8 56.9 56.5 56.1 56.8 56.6 56.7 56.5 56.9

sy,base seq wr 2 343.1 342.8
sy,base seq wr 4 177.4 177.9 173.1 174.7
sy,base seq wr 8 86.2 87.5 87.6 89.5 86.8 89.6 88.0 88.7

sy,base seq rdwr 2 254.0 254.4
sy,base seq rdwr 4 124.2 124.5 118.0 117.8
sy,base seq rdwr 8 57.2 56.8 57.0 58.8 56.8 56.3 57.5 57.8

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
sy,i1,s8 rnd rd 2 10.2 10.2
sy,i1,s8 rnd rd 4 7.2 7.2 7.1 7.1
sy,i1,s8 rnd rd 8 4.1 4.1 4.5 4.5 4.4 4.4 4.4 4.2

sy,i1,s8 rnd wr 2 36.2 35.5
sy,i1,s8 rnd wr 4 16.9 17.0 15.3 15.0
sy,i1,s8 rnd wr 8 5.7 5.6 8.5 8.7 6.7 6.5 6.6 6.3

sy,i1,s8 rnd rdwr 2 15.5 15.5
sy,i1,s8 rnd rdwr 4 9.8 9.8 9.7 9.6
sy,i1,s8 rnd rdwr 8 4.9 4.9 5.9 5.8 5.4 5.4 5.4 5.0

sy,i1,s8 seq rd 2 165.9 362.3
sy,i1,s8 seq rd 4 54.0 97.2 145.5 193.9
sy,i1,s8 seq rd 8 14.9 31.4 41.8 52.8 62.8 73.2 85.9 98.8

sy,i1,s8 seq wr 2 220.7 441.1
sy,i1,s8 seq wr 4 77.6 141.9 208.6 274.3
sy,i1,s8 seq wr 8 24.9 47.3 63.8 79.1 97.8 114.8 132.1 148.6

sy,i1,s8 seq rdwr 2 167.7 336.4
sy,i1,s8 seq rdwr 4 54.5 98.2 141.1 187.2
sy,i1,s8 seq rdwr 8 16.7 31.8 41.4 52.3 63.1 73.9 84.6 96.7

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
sy,i1,s0 rnd rd 2 10.2 10.2
sy,i1,s0 rnd rd 4 7.2 7.2 7.1 7.1
sy,i1,s0 rnd rd 8 4.1 4.1 4.6 4.6 4.4 4.4 4.4 4.2

sy,i1,s0 rnd wr 2 36.3 35.6
sy,i1,s0 rnd wr 4 16.9 17.0 15.3 15.2
sy,i1,s0 rnd wr 8 6.0 6.0 8.9 8.8 6.5 6.2 6.5 5.9

sy,i1,s0 rnd rdwr 2 15.6 15.6
sy,i1,s0 rnd rdwr 4 10.0 10.0 9.8 9.8
sy,i1,s0 rnd rdwr 8 5.0 5.0 6.0 6.0 5.5 5.5 5.6 5.1

sy,i1,s0 seq rd 2 304.2 304.3
sy,i1,s0 seq rd 4 154.2 154.2 153.4 153.7
sy,i1,s0 seq rd 8 76.9 76.8 77.3 76.9 77.1 77.2 77.4 78.0

sy,i1,s0 seq wr 2 156.8 157.4
sy,i1,s0 seq wr 4 80.7 79.6 78.5 79.0
sy,i1,s0 seq wr 8 43.2 41.7 41.7 42.6 42.1 42.6 42.8 42.7

sy,i1,s0 seq rdwr 2 321.1 321.7
sy,i1,s0 seq rdwr 4 174.2 174.0 172.6 173.6
sy,i1,s0 seq rdwr 8 86.6 86.3 88.6 88.9 90.2 89.8 90.1 89.0




2009-12-08 16:35:07

by Vivek Goyal

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Tue, Dec 08, 2009 at 10:17:48AM -0500, Alan D. Brunelle wrote:
> Hi Vivek -
>
> Sorry, I've been off doing other work and haven't had time to follow up
> on this (until recently). I have runs based upon Jens' for-2.6.33 tree
> as of commit 0d99519efef15fd0cf84a849492c7b1deee1e4b7 and your V4 patch
> sequence (the refresh patch you sent me on 3 December 2009). I _think_
> things look pretty darn good.

That's good to hear. :-)

>There are three modes compared:
>
> (1) base - just Jens' for-2.6.33 tree, not patched.
> (2) i1,s8 - Your patches added and slice_idle set to 8 (default)
> (3) i1,s0 - Your patched added and slice_idle set to 0
>

Thanks Alan. Whenever you run your tests again, it would be better to run
it against Jens's for-2.6.33 branch as Jens has merged block IO controller
patches.

> I did both synchronous and asynchronous runs, direct I/Os in both case,
> random and sequential, with reads, writes and 80%/20% read/write cases.
> The results are in throughput (as reported by fio). The first table
> shows overall test results, the other tables show breakdowns per cgroup
> (disk).

What is asynchronous direct sequential read? Reads done through libaio?

Few thoughts/questions inline.

>
> Regards,
> Alan
>

I am assuming that purpose of following table is to see what is the
overhead of IO controller patches. If yes, this looks more or less
good except there is slight dip in as seq rd case.

> ---- ---- - --------- --------- --------- --------- --------- ---------
> Mode RdWr N as,base as,i1,s8 as,i1,s0 sy,base sy,i1,s8 sy,i1,s0
> ---- ---- - --------- --------- --------- --------- --------- ---------
> rnd rd 2 39.7 39.1 43.7 20.5 20.5 20.4
> rnd rd 4 33.9 33.3 41.2 28.5 28.5 28.5
> rnd rd 8 23.7 25.0 36.7 34.4 34.5 34.6
>

slice_idle=0 improves throughput for "as" case. That's interesting.
Especially in case of 8 random readers running. Well that should be a
general CFQ property and not effect of group IO control.

I am not sure, why did you not capture base with slice_idle=0 mode so that
apple vs apple comaprison could be done.


> rnd wr 2 66.1 67.8 68.9 71.8 71.8 71.9
> rnd wr 4 57.8 62.9 66.1 64.1 64.2 64.3
> rnd wr 8 39.5 47.4 60.6 54.7 54.6 54.9
>
> rnd rdwr 2 50.2 49.1 54.5 31.1 31.1 31.1
> rnd rdwr 4 41.4 41.3 50.9 38.9 39.1 39.6
> rnd rdwr 8 28.1 30.5 46.3 42.5 42.6 43.8
>
> seq rd 2 612.3 605.7 611.2 509.6 528.3 608.6
> seq rd 4 614.1 606.9 606.2 493.0 490.6 615.4
> seq rd 8 613.6 603.8 605.9 453.0 461.8 617.6
>

Not sure where does this 1-2% dip in as seq read comes from.


> seq wr 2 694.6 726.1 701.2 685.8 661.8 314.2
> seq wr 4 687.6 715.3 628.3 702.9 702.3 317.8
> seq wr 8 695.0 710.0 629.8 704.0 708.3 339.4
>
> seq rdwr 2 692.3 664.9 693.8 508.4 504.0 642.8
> seq rdwr 4 664.5 657.1 639.3 484.5 481.0 694.3
> seq rdwr 8 659.0 648.0 634.4 458.1 460.4 709.6
>
> ===============================================================
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> as,base rnd rd 2 20.0 19.7
> as,base rnd rd 4 8.8 8.5 8.3 8.3
> as,base rnd rd 8 3.3 3.1 3.3 3.2 2.7 2.7 2.8 2.6
>
> as,base rnd wr 2 33.2 32.9
> as,base rnd wr 4 15.9 15.2 14.5 12.3
> as,base rnd wr 8 5.8 3.4 7.8 8.7 3.5 3.4 3.8 3.1
>
> as,base rnd rdwr 2 25.0 25.2
> as,base rnd rdwr 4 10.6 10.4 10.2 10.2
> as,base rnd rdwr 8 3.7 3.6 4.0 4.1 3.2 3.4 3.3 2.9
>
>
> as,base seq rd 2 305.9 306.4
> as,base seq rd 4 159.4 160.5 147.3 146.9
> as,base seq rd 8 79.7 80.0 77.3 78.4 73.0 70.0 77.5 77.7
>
> as,base seq wr 2 348.6 346.0
> as,base seq wr 4 189.9 187.6 154.7 155.3
> as,base seq wr 8 87.9 88.3 84.7 85.3 84.5 85.1 90.4 88.8
>
> as,base seq rdwr 2 347.2 345.1
> as,base seq rdwr 4 181.6 181.8 150.8 150.2
> as,base seq rdwr 8 83.6 82.1 82.1 82.7 80.6 82.7 82.2 82.9
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> as,i1,s8 rnd rd 2 12.7 26.3
> as,i1,s8 rnd rd 4 1.2 3.7 12.2 16.3
> as,i1,s8 rnd rd 8 0.5 0.8 1.2 1.7 2.1 3.5 6.7 8.4
>

This looks more or less good except the fact that last two groups seem to
have got much more share of disk. In general it would be nice to also
capture the disk time also apart from BW.

> as,i1,s8 rnd wr 2 18.5 49.3
> as,i1,s8 rnd wr 4 1.0 1.6 20.7 39.6
> as,i1,s8 rnd wr 8 0.5 0.7 0.9 1.2 1.7 2.5 15.5 24.5
>

Same as random read. Last two group got much more BW than their share. Can
you send me your exact fio command you used to run async workload. I would
like to try it out on my system and see what's happenig.

> as,i1,s8 rnd rdwr 2 16.2 32.9
> as,i1,s8 rnd rdwr 4 1.2 4.7 15.6 19.9
> as,i1,s8 rnd rdwr 8 0.6 0.8 1.1 1.7 2.1 3.4 9.4 11.5
>
> as,i1,s8 seq rd 2 202.7 403.0
> as,i1,s8 seq rd 4 92.1 114.7 182.4 217.6
> as,i1,s8 seq rd 8 38.7 76.2 74.0 73.9 74.5 74.7 84.7 107.0
>
> as,i1,s8 seq wr 2 243.8 482.3
> as,i1,s8 seq wr 4 107.7 155.5 200.4 251.7
> as,i1,s8 seq wr 8 52.1 77.2 81.9 80.8 89.6 99.9 109.8 118.7
>

We do see increasing BW in case of async seq rd and seq wr but again is
not very proportionate to weights. Again disk time will help here.

> as,i1,s8 seq rdwr 2 225.8 439.1
> as,i1,s8 seq rdwr 4 103.2 140.2 186.5 227.2
> as,i1,s8 seq rdwr 8 50.3 77.4 77.5 78.9 80.5 83.9 94.3 105.2
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> as,i1,s0 rnd rd 2 21.9 21.8
> as,i1,s0 rnd rd 4 11.4 12.0 9.1 8.7
> as,i1,s0 rnd rd 8 3.2 3.2 6.7 6.7 4.7 4.0 4.7 3.5
>
> as,i1,s0 rnd wr 2 34.5 34.4
> as,i1,s0 rnd wr 4 21.6 20.5 12.6 11.4
> as,i1,s0 rnd wr 8 5.1 4.8 18.2 16.9 4.1 4.0 4.0 3.3
>
> as,i1,s0 rnd rdwr 2 27.5 27.0
> as,i1,s0 rnd rdwr 4 16.1 15.4 10.2 9.2
> as,i1,s0 rnd rdwr 8 5.3 4.6 9.9 9.7 4.6 4.0 4.4 3.8
>
> as,i1,s0 seq rd 2 305.5 305.6
> as,i1,s0 seq rd 4 159.5 157.3 144.1 145.3
> as,i1,s0 seq rd 8 74.1 74.6 76.7 76.4 74.6 76.7 75.5 77.4
>
> as,i1,s0 seq wr 2 350.3 350.9
> as,i1,s0 seq wr 4 160.3 161.7 153.1 153.2
> as,i1,s0 seq wr 8 79.5 80.9 78.2 78.7 79.7 78.3 77.8 76.7
>
> as,i1,s0 seq rdwr 2 346.8 347.0
> as,i1,s0 seq rdwr 4 163.3 163.5 156.7 155.8
> as,i1,s0 seq rdwr 8 79.1 79.4 80.1 80.3 79.1 78.9 79.6 77.8
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> sy,base rnd rd 2 10.2 10.2
> sy,base rnd rd 4 7.2 7.2 7.1 7.0
> sy,base rnd rd 8 4.1 4.1 4.5 4.5 4.3 4.3 4.4 4.1
>
> sy,base rnd wr 2 36.1 35.7
> sy,base rnd wr 4 16.7 16.5 15.6 15.3
> sy,base rnd wr 8 5.7 5.4 9.0 8.6 6.6 6.5 6.8 6.0
>
> sy,base rnd rdwr 2 15.5 15.5
> sy,base rnd rdwr 4 9.9 9.8 9.7 9.6
> sy,base rnd rdwr 8 4.8 4.9 5.8 5.8 5.4 5.4 5.4 4.9
>
> sy,base seq rd 2 254.7 254.8
> sy,base seq rd 4 124.2 123.6 121.8 123.4
> sy,base seq rd 8 56.9 56.5 56.1 56.8 56.6 56.7 56.5 56.9
>
> sy,base seq wr 2 343.1 342.8
> sy,base seq wr 4 177.4 177.9 173.1 174.7
> sy,base seq wr 8 86.2 87.5 87.6 89.5 86.8 89.6 88.0 88.7
>
> sy,base seq rdwr 2 254.0 254.4
> sy,base seq rdwr 4 124.2 124.5 118.0 117.8
> sy,base seq rdwr 8 57.2 56.8 57.0 58.8 56.8 56.3 57.5 57.8
>
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> sy,i1,s8 rnd rd 2 10.2 10.2
> sy,i1,s8 rnd rd 4 7.2 7.2 7.1 7.1
> sy,i1,s8 rnd rd 8 4.1 4.1 4.5 4.5 4.4 4.4 4.4 4.2
>

This is consitent. All random/sync-idle IO will be in root group with
group_isolation=0 and we will not see service differentiation between
groups.

> sy,i1,s8 rnd wr 2 36.2 35.5
> sy,i1,s8 rnd wr 4 16.9 17.0 15.3 15.0
> sy,i1,s8 rnd wr 8 5.7 5.6 8.5 8.7 6.7 6.5 6.6 6.3
>

On my system I was seeing service differentiation for random writes also.
The kind of pattern fio was generating, for most part of the run, CFQ
categorized these as sync-idle workload hence these got fairness even with
group_isolation=0.

If you run the same test with group_isolation=1, you should see better
numbers for this case.

> sy,i1,s8 rnd rdwr 2 15.5 15.5
> sy,i1,s8 rnd rdwr 4 9.8 9.8 9.7 9.6
> sy,i1,s8 rnd rdwr 8 4.9 4.9 5.9 5.8 5.4 5.4 5.4 5.0
>
> sy,i1,s8 seq rd 2 165.9 362.3
> sy,i1,s8 seq rd 4 54.0 97.2 145.5 193.9
> sy,i1,s8 seq rd 8 14.9 31.4 41.8 52.8 62.8 73.2 85.9 98.8
>
> sy,i1,s8 seq wr 2 220.7 441.1
> sy,i1,s8 seq wr 4 77.6 141.9 208.6 274.3
> sy,i1,s8 seq wr 8 24.9 47.3 63.8 79.1 97.8 114.8 132.1 148.6
>

Above seq rd and seq wr look very good. BW seems to be in proportiona to
weight.

> sy,i1,s8 seq rdwr 2 167.7 336.4
> sy,i1,s8 seq rdwr 4 54.5 98.2 141.1 187.2
> sy,i1,s8 seq rdwr 8 16.7 31.8 41.4 52.3 63.1 73.9 84.6 96.7
>

with slice_idle=0 generally you will not get any service differentiation
until and unless group is continously backlogged. So if you launch
multiple processes in the group, then you should see service
differentiation even with slice_idle=0.

> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> sy,i1,s0 rnd rd 2 10.2 10.2
> sy,i1,s0 rnd rd 4 7.2 7.2 7.1 7.1
> sy,i1,s0 rnd rd 8 4.1 4.1 4.6 4.6 4.4 4.4 4.4 4.2
>
> sy,i1,s0 rnd wr 2 36.3 35.6
> sy,i1,s0 rnd wr 4 16.9 17.0 15.3 15.2
> sy,i1,s0 rnd wr 8 6.0 6.0 8.9 8.8 6.5 6.2 6.5 5.9
>
> sy,i1,s0 rnd rdwr 2 15.6 15.6
> sy,i1,s0 rnd rdwr 4 10.0 10.0 9.8 9.8
> sy,i1,s0 rnd rdwr 8 5.0 5.0 6.0 6.0 5.5 5.5 5.6 5.1
>
> sy,i1,s0 seq rd 2 304.2 304.3
> sy,i1,s0 seq rd 4 154.2 154.2 153.4 153.7
> sy,i1,s0 seq rd 8 76.9 76.8 77.3 76.9 77.1 77.2 77.4 78.0
>
> sy,i1,s0 seq wr 2 156.8 157.4
> sy,i1,s0 seq wr 4 80.7 79.6 78.5 79.0
> sy,i1,s0 seq wr 8 43.2 41.7 41.7 42.6 42.1 42.6 42.8 42.7
>
> sy,i1,s0 seq rdwr 2 321.1 321.7
> sy,i1,s0 seq rdwr 4 174.2 174.0 172.6 173.6
> sy,i1,s0 seq rdwr 8 86.6 86.3 88.6 88.9 90.2 89.8 90.1 89.0
>

In summary, async results look little bit off and need investigation. Can
you please send me one sample async fio script.

Thanks
Vivek

2009-12-08 18:05:47

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Tue, 2009-12-08 at 11:32 -0500, Vivek Goyal wrote:
> On Tue, Dec 08, 2009 at 10:17:48AM -0500, Alan D. Brunelle wrote:
> > Hi Vivek -
> >
> > Sorry, I've been off doing other work and haven't had time to follow up
> > on this (until recently). I have runs based upon Jens' for-2.6.33 tree
> > as of commit 0d99519efef15fd0cf84a849492c7b1deee1e4b7 and your V4 patch
> > sequence (the refresh patch you sent me on 3 December 2009). I _think_
> > things look pretty darn good.
>
> That's good to hear. :-)
>
> >There are three modes compared:
> >
> > (1) base - just Jens' for-2.6.33 tree, not patched.
> > (2) i1,s8 - Your patches added and slice_idle set to 8 (default)
> > (3) i1,s0 - Your patched added and slice_idle set to 0
> >
>
> Thanks Alan. Whenever you run your tests again, it would be better to run
> it against Jens's for-2.6.33 branch as Jens has merged block IO controller
> patches.

Will do another set of runs w/ the straight branch.

>
> > I did both synchronous and asynchronous runs, direct I/Os in both case,
> > random and sequential, with reads, writes and 80%/20% read/write cases.
> > The results are in throughput (as reported by fio). The first table
> > shows overall test results, the other tables show breakdowns per cgroup
> > (disk).
>
> What is asynchronous direct sequential read? Reads done through libaio?

Yep - An asynchronous run would have fio job files like:

[global]
size=8g
overwrite=0
runtime=120
ioengine=libaio
iodepth=128
iodepth_low=128
iodepth_batch=128
iodepth_batch_complete=32
direct=1
bs=4k
readwrite=randread
[/mnt/sda/data.0]
filename=/mnt/sda/data.0

The equivalent synchronous run would be:

[global]
size=8g
overwrite=0
runtime=120
ioengine=sync
direct=1
bs=4k
readwrite=randread
[/mnt/sda/data.0]
filename=/mnt/sda/data.0

~
>
> Few thoughts/questions inline.
>
> >
> > Regards,
> > Alan
> >
>
> I am assuming that purpose of following table is to see what is the
> overhead of IO controller patches. If yes, this looks more or less
> good except there is slight dip in as seq rd case.
>
> > ---- ---- - --------- --------- --------- --------- --------- ---------
> > Mode RdWr N as,base as,i1,s8 as,i1,s0 sy,base sy,i1,s8 sy,i1,s0
> > ---- ---- - --------- --------- --------- --------- --------- ---------
> > rnd rd 2 39.7 39.1 43.7 20.5 20.5 20.4
> > rnd rd 4 33.9 33.3 41.2 28.5 28.5 28.5
> > rnd rd 8 23.7 25.0 36.7 34.4 34.5 34.6
> >
>
> slice_idle=0 improves throughput for "as" case. That's interesting.
> Especially in case of 8 random readers running. Well that should be a
> general CFQ property and not effect of group IO control.
>
> I am not sure, why did you not capture base with slice_idle=0 mode so that
> apple vs apple comaprison could be done.

Could add that...will add that...

>
>
> > rnd wr 2 66.1 67.8 68.9 71.8 71.8 71.9
> > rnd wr 4 57.8 62.9 66.1 64.1 64.2 64.3
> > rnd wr 8 39.5 47.4 60.6 54.7 54.6 54.9
> >
> > rnd rdwr 2 50.2 49.1 54.5 31.1 31.1 31.1
> > rnd rdwr 4 41.4 41.3 50.9 38.9 39.1 39.6
> > rnd rdwr 8 28.1 30.5 46.3 42.5 42.6 43.8
> >
> > seq rd 2 612.3 605.7 611.2 509.6 528.3 608.6
> > seq rd 4 614.1 606.9 606.2 493.0 490.6 615.4
> > seq rd 8 613.6 603.8 605.9 453.0 461.8 617.6
> >
>
> Not sure where does this 1-2% dip in as seq read comes from.
>
>
> > seq wr 2 694.6 726.1 701.2 685.8 661.8 314.2
> > seq wr 4 687.6 715.3 628.3 702.9 702.3 317.8
> > seq wr 8 695.0 710.0 629.8 704.0 708.3 339.4
> >
> > seq rdwr 2 692.3 664.9 693.8 508.4 504.0 642.8
> > seq rdwr 4 664.5 657.1 639.3 484.5 481.0 694.3
> > seq rdwr 8 659.0 648.0 634.4 458.1 460.4 709.6
> >
> > ===============================================================
> >
> > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > as,base rnd rd 2 20.0 19.7
> > as,base rnd rd 4 8.8 8.5 8.3 8.3
> > as,base rnd rd 8 3.3 3.1 3.3 3.2 2.7 2.7 2.8 2.6
> >
> > as,base rnd wr 2 33.2 32.9
> > as,base rnd wr 4 15.9 15.2 14.5 12.3
> > as,base rnd wr 8 5.8 3.4 7.8 8.7 3.5 3.4 3.8 3.1
> >
> > as,base rnd rdwr 2 25.0 25.2
> > as,base rnd rdwr 4 10.6 10.4 10.2 10.2
> > as,base rnd rdwr 8 3.7 3.6 4.0 4.1 3.2 3.4 3.3 2.9
> >
> >
> > as,base seq rd 2 305.9 306.4
> > as,base seq rd 4 159.4 160.5 147.3 146.9
> > as,base seq rd 8 79.7 80.0 77.3 78.4 73.0 70.0 77.5 77.7
> >
> > as,base seq wr 2 348.6 346.0
> > as,base seq wr 4 189.9 187.6 154.7 155.3
> > as,base seq wr 8 87.9 88.3 84.7 85.3 84.5 85.1 90.4 88.8
> >
> > as,base seq rdwr 2 347.2 345.1
> > as,base seq rdwr 4 181.6 181.8 150.8 150.2
> > as,base seq rdwr 8 83.6 82.1 82.1 82.7 80.6 82.7 82.2 82.9
> >
> > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > as,i1,s8 rnd rd 2 12.7 26.3
> > as,i1,s8 rnd rd 4 1.2 3.7 12.2 16.3
> > as,i1,s8 rnd rd 8 0.5 0.8 1.2 1.7 2.1 3.5 6.7 8.4
> >
>
> This looks more or less good except the fact that last two groups seem to
> have got much more share of disk. In general it would be nice to also
> capture the disk time also apart from BW.

What specifically are you looking for? Any other fields from the fio
output? I have all that data & could reprocess it easily enough.

>
> > as,i1,s8 rnd wr 2 18.5 49.3
> > as,i1,s8 rnd wr 4 1.0 1.6 20.7 39.6
> > as,i1,s8 rnd wr 8 0.5 0.7 0.9 1.2 1.7 2.5 15.5 24.5
> >
>
> Same as random read. Last two group got much more BW than their share. Can
> you send me your exact fio command you used to run async workload. I would
> like to try it out on my system and see what's happenig.
>
> > as,i1,s8 rnd rdwr 2 16.2 32.9
> > as,i1,s8 rnd rdwr 4 1.2 4.7 15.6 19.9
> > as,i1,s8 rnd rdwr 8 0.6 0.8 1.1 1.7 2.1 3.4 9.4 11.5
> >
> > as,i1,s8 seq rd 2 202.7 403.0
> > as,i1,s8 seq rd 4 92.1 114.7 182.4 217.6
> > as,i1,s8 seq rd 8 38.7 76.2 74.0 73.9 74.5 74.7 84.7 107.0
> >
> > as,i1,s8 seq wr 2 243.8 482.3
> > as,i1,s8 seq wr 4 107.7 155.5 200.4 251.7
> > as,i1,s8 seq wr 8 52.1 77.2 81.9 80.8 89.6 99.9 109.8 118.7
> >
>
> We do see increasing BW in case of async seq rd and seq wr but again is
> not very proportionate to weights. Again disk time will help here.
>
> > as,i1,s8 seq rdwr 2 225.8 439.1
> > as,i1,s8 seq rdwr 4 103.2 140.2 186.5 227.2
> > as,i1,s8 seq rdwr 8 50.3 77.4 77.5 78.9 80.5 83.9 94.3 105.2
> >
> > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > as,i1,s0 rnd rd 2 21.9 21.8
> > as,i1,s0 rnd rd 4 11.4 12.0 9.1 8.7
> > as,i1,s0 rnd rd 8 3.2 3.2 6.7 6.7 4.7 4.0 4.7 3.5
> >
> > as,i1,s0 rnd wr 2 34.5 34.4
> > as,i1,s0 rnd wr 4 21.6 20.5 12.6 11.4
> > as,i1,s0 rnd wr 8 5.1 4.8 18.2 16.9 4.1 4.0 4.0 3.3
> >
> > as,i1,s0 rnd rdwr 2 27.5 27.0
> > as,i1,s0 rnd rdwr 4 16.1 15.4 10.2 9.2
> > as,i1,s0 rnd rdwr 8 5.3 4.6 9.9 9.7 4.6 4.0 4.4 3.8
> >
> > as,i1,s0 seq rd 2 305.5 305.6
> > as,i1,s0 seq rd 4 159.5 157.3 144.1 145.3
> > as,i1,s0 seq rd 8 74.1 74.6 76.7 76.4 74.6 76.7 75.5 77.4
> >
> > as,i1,s0 seq wr 2 350.3 350.9
> > as,i1,s0 seq wr 4 160.3 161.7 153.1 153.2
> > as,i1,s0 seq wr 8 79.5 80.9 78.2 78.7 79.7 78.3 77.8 76.7
> >
> > as,i1,s0 seq rdwr 2 346.8 347.0
> > as,i1,s0 seq rdwr 4 163.3 163.5 156.7 155.8
> > as,i1,s0 seq rdwr 8 79.1 79.4 80.1 80.3 79.1 78.9 79.6 77.8
> >
> > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > sy,base rnd rd 2 10.2 10.2
> > sy,base rnd rd 4 7.2 7.2 7.1 7.0
> > sy,base rnd rd 8 4.1 4.1 4.5 4.5 4.3 4.3 4.4 4.1
> >
> > sy,base rnd wr 2 36.1 35.7
> > sy,base rnd wr 4 16.7 16.5 15.6 15.3
> > sy,base rnd wr 8 5.7 5.4 9.0 8.6 6.6 6.5 6.8 6.0
> >
> > sy,base rnd rdwr 2 15.5 15.5
> > sy,base rnd rdwr 4 9.9 9.8 9.7 9.6
> > sy,base rnd rdwr 8 4.8 4.9 5.8 5.8 5.4 5.4 5.4 4.9
> >
> > sy,base seq rd 2 254.7 254.8
> > sy,base seq rd 4 124.2 123.6 121.8 123.4
> > sy,base seq rd 8 56.9 56.5 56.1 56.8 56.6 56.7 56.5 56.9
> >
> > sy,base seq wr 2 343.1 342.8
> > sy,base seq wr 4 177.4 177.9 173.1 174.7
> > sy,base seq wr 8 86.2 87.5 87.6 89.5 86.8 89.6 88.0 88.7
> >
> > sy,base seq rdwr 2 254.0 254.4
> > sy,base seq rdwr 4 124.2 124.5 118.0 117.8
> > sy,base seq rdwr 8 57.2 56.8 57.0 58.8 56.8 56.3 57.5 57.8
> >
> > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > sy,i1,s8 rnd rd 2 10.2 10.2
> > sy,i1,s8 rnd rd 4 7.2 7.2 7.1 7.1
> > sy,i1,s8 rnd rd 8 4.1 4.1 4.5 4.5 4.4 4.4 4.4 4.2
> >
>
> This is consitent. All random/sync-idle IO will be in root group with
> group_isolation=0 and we will not see service differentiation between
> groups.
>
> > sy,i1,s8 rnd wr 2 36.2 35.5
> > sy,i1,s8 rnd wr 4 16.9 17.0 15.3 15.0
> > sy,i1,s8 rnd wr 8 5.7 5.6 8.5 8.7 6.7 6.5 6.6 6.3
> >
>
> On my system I was seeing service differentiation for random writes also.
> The kind of pattern fio was generating, for most part of the run, CFQ
> categorized these as sync-idle workload hence these got fairness even with
> group_isolation=0.
>
> If you run the same test with group_isolation=1, you should see better
> numbers for this case.

I'll work on updating my script to work w/ the new FIO bits (that have
cgroup included).

>
> > sy,i1,s8 rnd rdwr 2 15.5 15.5
> > sy,i1,s8 rnd rdwr 4 9.8 9.8 9.7 9.6
> > sy,i1,s8 rnd rdwr 8 4.9 4.9 5.9 5.8 5.4 5.4 5.4 5.0
> >
> > sy,i1,s8 seq rd 2 165.9 362.3
> > sy,i1,s8 seq rd 4 54.0 97.2 145.5 193.9
> > sy,i1,s8 seq rd 8 14.9 31.4 41.8 52.8 62.8 73.2 85.9 98.8
> >
> > sy,i1,s8 seq wr 2 220.7 441.1
> > sy,i1,s8 seq wr 4 77.6 141.9 208.6 274.3
> > sy,i1,s8 seq wr 8 24.9 47.3 63.8 79.1 97.8 114.8 132.1 148.6
> >
>
> Above seq rd and seq wr look very good. BW seems to be in proportiona to
> weight.
>
> > sy,i1,s8 seq rdwr 2 167.7 336.4
> > sy,i1,s8 seq rdwr 4 54.5 98.2 141.1 187.2
> > sy,i1,s8 seq rdwr 8 16.7 31.8 41.4 52.3 63.1 73.9 84.6 96.7
> >
>
> with slice_idle=0 generally you will not get any service differentiation
> until and unless group is continously backlogged. So if you launch
> multiple processes in the group, then you should see service
> differentiation even with slice_idle=0.
>
> > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > sy,i1,s0 rnd rd 2 10.2 10.2
> > sy,i1,s0 rnd rd 4 7.2 7.2 7.1 7.1
> > sy,i1,s0 rnd rd 8 4.1 4.1 4.6 4.6 4.4 4.4 4.4 4.2
> >
> > sy,i1,s0 rnd wr 2 36.3 35.6
> > sy,i1,s0 rnd wr 4 16.9 17.0 15.3 15.2
> > sy,i1,s0 rnd wr 8 6.0 6.0 8.9 8.8 6.5 6.2 6.5 5.9
> >
> > sy,i1,s0 rnd rdwr 2 15.6 15.6
> > sy,i1,s0 rnd rdwr 4 10.0 10.0 9.8 9.8
> > sy,i1,s0 rnd rdwr 8 5.0 5.0 6.0 6.0 5.5 5.5 5.6 5.1
> >
> > sy,i1,s0 seq rd 2 304.2 304.3
> > sy,i1,s0 seq rd 4 154.2 154.2 153.4 153.7
> > sy,i1,s0 seq rd 8 76.9 76.8 77.3 76.9 77.1 77.2 77.4 78.0
> >
> > sy,i1,s0 seq wr 2 156.8 157.4
> > sy,i1,s0 seq wr 4 80.7 79.6 78.5 79.0
> > sy,i1,s0 seq wr 8 43.2 41.7 41.7 42.6 42.1 42.6 42.8 42.7
> >
> > sy,i1,s0 seq rdwr 2 321.1 321.7
> > sy,i1,s0 seq rdwr 4 174.2 174.0 172.6 173.6
> > sy,i1,s0 seq rdwr 8 86.6 86.3 88.6 88.9 90.2 89.8 90.1 89.0
> >
>
> In summary, async results look little bit off and need investigation. Can
> you please send me one sample async fio script.

The fio file I included above should help, right? If not, let me know,
I'll send you all the command files...

>
> Thanks
> Vivek


2009-12-10 03:46:36

by Vivek Goyal

[permalink] [raw]
Subject: Re: Block IO Controller V4

On Tue, Dec 08, 2009 at 01:05:41PM -0500, Alan D. Brunelle wrote:

[..]
> > Thanks Alan. Whenever you run your tests again, it would be better to run
> > it against Jens's for-2.6.33 branch as Jens has merged block IO controller
> > patches.
>
> Will do another set of runs w/ the straight branch.
>
> >
> > > I did both synchronous and asynchronous runs, direct I/Os in both case,
> > > random and sequential, with reads, writes and 80%/20% read/write cases.
> > > The results are in throughput (as reported by fio). The first table
> > > shows overall test results, the other tables show breakdowns per cgroup
> > > (disk).
> >
> > What is asynchronous direct sequential read? Reads done through libaio?
>
> Yep - An asynchronous run would have fio job files like:
>
> [global]
> size=8g
> overwrite=0

Alan, can you try to run with overwrite=1. IIUC, overwrite=1, will first
layout the files on disk for write operations and then start IO. This
should give us much better results with ext3 as interference/seriliazation
introduced by kjournald comes down.

> runtime=120
> ioengine=libaio
> iodepth=128
> iodepth_low=128
> iodepth_batch=128
> iodepth_batch_complete=32
> direct=1
> bs=4k
> readwrite=randread
> [/mnt/sda/data.0]
> filename=/mnt/sda/data.0

I am also migrating my scripts to latest fio. Will also do some async
testing using libaio and report the results.

>
> The equivalent synchronous run would be:
>
> [global]
> size=8g
> overwrite=0
> runtime=120
> ioengine=sync
> direct=1
> bs=4k
> readwrite=randread
> [/mnt/sda/data.0]
> filename=/mnt/sda/data.0
>
> ~
> >
> > Few thoughts/questions inline.
> >
> > >
> > > Regards,
> > > Alan
> > >
> >
> > I am assuming that purpose of following table is to see what is the
> > overhead of IO controller patches. If yes, this looks more or less
> > good except there is slight dip in as seq rd case.
> >
> > > ---- ---- - --------- --------- --------- --------- --------- ---------
> > > Mode RdWr N as,base as,i1,s8 as,i1,s0 sy,base sy,i1,s8 sy,i1,s0
> > > ---- ---- - --------- --------- --------- --------- --------- ---------
> > > rnd rd 2 39.7 39.1 43.7 20.5 20.5 20.4
> > > rnd rd 4 33.9 33.3 41.2 28.5 28.5 28.5
> > > rnd rd 8 23.7 25.0 36.7 34.4 34.5 34.6
> > >
> >
> > slice_idle=0 improves throughput for "as" case. That's interesting.
> > Especially in case of 8 random readers running. Well that should be a
> > general CFQ property and not effect of group IO control.
> >
> > I am not sure, why did you not capture base with slice_idle=0 mode so that
> > apple vs apple comaprison could be done.
>
> Could add that...will add that...

I think at this point of time slice_idle=0 results are not very interesting.
You can ignore it both for with ioc patches and without ioc patches.

[..]
> > > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > > Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
> > > ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
> > > as,i1,s8 rnd rd 2 12.7 26.3
> > > as,i1,s8 rnd rd 4 1.2 3.7 12.2 16.3
> > > as,i1,s8 rnd rd 8 0.5 0.8 1.2 1.7 2.1 3.5 6.7 8.4
> > >
> >
> > This looks more or less good except the fact that last two groups seem to
> > have got much more share of disk. In general it would be nice to also
> > capture the disk time also apart from BW.
>
> What specifically are you looking for? Any other fields from the fio
> output? I have all that data & could reprocess it easily enough.

I want disk time also and that is in cgroup dir. Read blkio.time file
after the test has run for all the cgroups.

[..]
> > In summary, async results look little bit off and need investigation. Can
> > you please send me one sample async fio script.
>
> The fio file I included above should help, right? If not, let me know,
> I'll send you all the command files...

I think this is good enough. I will do testing with your fio command file.

Thanks
Vivek