2011-02-21 06:09:56

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: [PATCH 0/6 v5] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

Hi

Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
in the way that it puts all CFQ queues in a hidden group and schedules with other
CFQ group under their parent. The patchset is available here,
http://lkml.org/lkml/2010/8/30/30

Vivek think this approach isn't so instinct that we should treat CFQ queues
and groups at the same level. Here is the new approach for hierarchical
scheduling based on Vivek's suggestion. The most big change of CFQ is that
it gets rid of cfq_slice_offset logic, and makes use of vdisktime for CFQ
queue scheduling just like CFQ group does. But I still give cfqq some jump
in vdisktime based on ioprio, thanks for Vivek to point out this. Now CFQ
queue and CFQ group use the same scheduling algorithm.

"use_hierarchy" interface is now added to switch between hierarchical mode
and flat mode. It works as memcg.

V4 -> V5 Changes:
- Change boosting base to a smaller value.
- Rename repostion_time to position_time
- Replace duplicated code by calling cfq_scale_slice()
- Remove redundant use_hierarchy in cfqd
- Fix grp_service_tree comment
- Rename init_cfqe() to init_group_cfqe()

--
V3 -> V4 Changes:
- Take io class into account when calculating the boost value.
- Refine the vtime boosting logic as Vivek's Suggestion.
- Make the calculation of group slice cross all service trees under a group.
- Modify Documentation in terms of Vivek's comments.

--
V2 -> V3 Changes:
- Starting from cfqd->grp_service_tree for both hierarchical mode and flat mode
- Avoid recursion when allocating cfqg and force dispatch logic
- Fix a bug when boosting vdisktime
- Adjusting total_weight accordingly when changing weight
- Change group slice calculation into a hierarchical way
- Keep flat mode rather than deleting it first then adding it later
- kfree the parent cfqg if there nobody references to it
- Simplify select_queue logic by using some wrap function
- Make "use_hierarchy" interface work as memcg
- Make use of time_before() for vdisktime compare
- Update Document
- Fix some code style problems

--
V1 -> V2 Changes:
- Raname "struct io_sched_entity" to "struct cfq_entity" and don't differentiate
queue_entity and group_entity, just use cfqe instead.
- Give newly added cfqq a small vdisktime jump accord to its ioprio.
- Make flat mode as default CFQ group scheduling mode.
- Introduce "use_hierarchy" interface.
- Update blkio cgroup documents

Documentation/cgroups/blkio-controller.txt | 81 +-
block/blk-cgroup.c | 61 +
block/blk-cgroup.h | 3
block/cfq-iosched.c | 959 ++++++++++++++++++++---------
4 files changed, 815 insertions(+), 289 deletions(-)

Thanks,
Gui


2011-02-23 03:01:55

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: [PATCH 0/6 v5.1] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

Hi

I rebase this series on top of *for-next* branch, it will make merging life easier.

Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
in the way that it puts all CFQ queues in a hidden group and schedules with other
CFQ group under their parent. The patchset is available here,
http://lkml.org/lkml/2010/8/30/30

Vivek think this approach isn't so instinct that we should treat CFQ queues
and groups at the same level. Here is the new approach for hierarchical
scheduling based on Vivek's suggestion. The most big change of CFQ is that
it gets rid of cfq_slice_offset logic, and makes use of vdisktime for CFQ
queue scheduling just like CFQ group does. But I still give cfqq some jump
in vdisktime based on ioprio, thanks for Vivek to point out this. Now CFQ
queue and CFQ group use the same scheduling algorithm.

"use_hierarchy" interface is now added to switch between hierarchical mode
and flat mode. It works as memcg.

V4 -> V5 Changes:
- Change boosting base to a smaller value.
- Rename repostion_time to position_time
- Replace duplicated code by calling cfq_scale_slice()
- Remove redundant use_hierarchy in cfqd
- Fix grp_service_tree comment
- Rename init_cfqe() to init_group_cfqe()

--
V3 -> V4 Changes:
- Take io class into account when calculating the boost value.
- Refine the vtime boosting logic as Vivek's Suggestion.
- Make the calculation of group slice cross all service trees under a group.
- Modify Documentation in terms of Vivek's comments.

--
V2 -> V3 Changes:
- Starting from cfqd->grp_service_tree for both hierarchical mode and flat mode
- Avoid recursion when allocating cfqg and force dispatch logic
- Fix a bug when boosting vdisktime
- Adjusting total_weight accordingly when changing weight
- Change group slice calculation into a hierarchical way
- Keep flat mode rather than deleting it first then adding it later
- kfree the parent cfqg if there nobody references to it
- Simplify select_queue logic by using some wrap function
- Make "use_hierarchy" interface work as memcg
- Make use of time_before() for vdisktime compare
- Update Document
- Fix some code style problems

--
V1 -> V2 Changes:
- Raname "struct io_sched_entity" to "struct cfq_entity" and don't differentiate
queue_entity and group_entity, just use cfqe instead.
- Give newly added cfqq a small vdisktime jump accord to its ioprio.
- Make flat mode as default CFQ group scheduling mode.
- Introduce "use_hierarchy" interface.
- Update blkio cgroup documents

Documentation/cgroups/blkio-controller.txt | 81 +-
block/blk-cgroup.c | 61 +
block/blk-cgroup.h | 3
block/cfq-iosched.c | 959 ++++++++++++++++++++---------
4 files changed, 815 insertions(+), 289 deletions(-)

Thanks,
Gui

2011-02-23 03:07:27

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: [PATCH 1/6 v5.1] cfq-iosched: Introduce cfq_entity for CFQ queue

Introduce cfq_entity for CFQ queue

Signed-off-by: Gui Jianfeng <[email protected]>
---
block/cfq-iosched.c | 118 ++++++++++++++++++++++++++++++++------------------
1 files changed, 75 insertions(+), 43 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index dbf01f1..15344fb 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -92,19 +92,29 @@ struct cfq_rb_root {
.count = 0, .min_vdisktime = 0, }

/*
+ * This's the CFQ queue schedule entity which is scheduled on service tree.
+ */
+struct cfq_entity {
+ /* service tree */
+ struct cfq_rb_root *service_tree;
+ /* service_tree member */
+ struct rb_node rb_node;
+ /* service_tree key, represent the position on the tree */
+ unsigned long rb_key;
+};
+
+/*
* Per process-grouping structure
*/
struct cfq_queue {
+ /* The schedule entity */
+ struct cfq_entity cfqe;
/* reference count */
int ref;
/* various state flags, see below */
unsigned int flags;
/* parent cfq_data */
struct cfq_data *cfqd;
- /* service_tree member */
- struct rb_node rb_node;
- /* service_tree key */
- unsigned long rb_key;
/* prio tree member */
struct rb_node p_node;
/* prio tree root we belong to, if any */
@@ -143,7 +153,6 @@ struct cfq_queue {
u32 seek_history;
sector_t last_request_pos;

- struct cfq_rb_root *service_tree;
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
struct cfq_group *orig_cfqg;
@@ -302,6 +311,15 @@ struct cfq_data {
struct rcu_head rcu;
};

+static inline struct cfq_queue *
+cfqq_of_entity(struct cfq_entity *cfqe)
+{
+ if (cfqe)
+ return container_of(cfqe, struct cfq_queue, cfqe);
+
+ return NULL;
+}
+
static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);

static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
@@ -751,7 +769,7 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2,
/*
* The below is leftmost cache rbtree addon
*/
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
+static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
{
/* Service tree is empty */
if (!root->count)
@@ -761,7 +779,7 @@ static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
root->left = rb_first(&root->rb);

if (root->left)
- return rb_entry(root->left, struct cfq_queue, rb_node);
+ return rb_entry(root->left, struct cfq_entity, rb_node);

return NULL;
}
@@ -1179,21 +1197,24 @@ static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
bool add_front)
{
+ struct cfq_entity *cfqe;
struct rb_node **p, *parent;
- struct cfq_queue *__cfqq;
+ struct cfq_entity *__cfqe;
unsigned long rb_key;
struct cfq_rb_root *service_tree;
int left;
int new_cfqq = 1;
int group_changed = 0;

+ cfqe = &cfqq->cfqe;
+
#ifdef CONFIG_CFQ_GROUP_IOSCHED
if (!cfqd->cfq_group_isolation
&& cfqq_type(cfqq) == SYNC_NOIDLE_WORKLOAD
&& cfqq->cfqg && cfqq->cfqg != &cfqd->root_group) {
/* Move this cfq to root group */
cfq_log_cfqq(cfqd, cfqq, "moving to root group");
- if (!RB_EMPTY_NODE(&cfqq->rb_node))
+ if (!RB_EMPTY_NODE(&cfqe->rb_node))
cfq_group_service_tree_del(cfqd, cfqq->cfqg);
cfqq->orig_cfqg = cfqq->cfqg;
cfqq->cfqg = &cfqd->root_group;
@@ -1203,7 +1224,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
&& cfqq_type(cfqq) == SYNC_WORKLOAD && cfqq->orig_cfqg) {
/* cfqq is sequential now needs to go to its original group */
BUG_ON(cfqq->cfqg != &cfqd->root_group);
- if (!RB_EMPTY_NODE(&cfqq->rb_node))
+ if (!RB_EMPTY_NODE(&cfqe->rb_node))
cfq_group_service_tree_del(cfqd, cfqq->cfqg);
cfq_put_cfqg(cfqq->cfqg);
cfqq->cfqg = cfqq->orig_cfqg;
@@ -1218,9 +1239,9 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (cfq_class_idle(cfqq)) {
rb_key = CFQ_IDLE_DELAY;
parent = rb_last(&service_tree->rb);
- if (parent && parent != &cfqq->rb_node) {
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- rb_key += __cfqq->rb_key;
+ if (parent && parent != &cfqe->rb_node) {
+ __cfqe = rb_entry(parent, struct cfq_entity, rb_node);
+ rb_key += __cfqe->rb_key;
} else
rb_key += jiffies;
} else if (!add_front) {
@@ -1235,37 +1256,37 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->slice_resid = 0;
} else {
rb_key = -HZ;
- __cfqq = cfq_rb_first(service_tree);
- rb_key += __cfqq ? __cfqq->rb_key : jiffies;
+ __cfqe = cfq_rb_first(service_tree);
+ rb_key += __cfqe ? __cfqe->rb_key : jiffies;
}

- if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
+ if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
new_cfqq = 0;
/*
* same position, nothing more to do
*/
- if (rb_key == cfqq->rb_key &&
- cfqq->service_tree == service_tree)
+ if (rb_key == cfqe->rb_key &&
+ cfqe->service_tree == service_tree)
return;

- cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
- cfqq->service_tree = NULL;
+ cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
+ cfqe->service_tree = NULL;
}

left = 1;
parent = NULL;
- cfqq->service_tree = service_tree;
+ cfqe->service_tree = service_tree;
p = &service_tree->rb.rb_node;
while (*p) {
struct rb_node **n;

parent = *p;
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
+ __cfqe = rb_entry(parent, struct cfq_entity, rb_node);

/*
* sort by key, that represents service time.
*/
- if (time_before(rb_key, __cfqq->rb_key))
+ if (time_before(rb_key, __cfqe->rb_key))
n = &(*p)->rb_left;
else {
n = &(*p)->rb_right;
@@ -1276,11 +1297,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
}

if (left)
- service_tree->left = &cfqq->rb_node;
+ service_tree->left = &cfqe->rb_node;

- cfqq->rb_key = rb_key;
- rb_link_node(&cfqq->rb_node, parent, p);
- rb_insert_color(&cfqq->rb_node, &service_tree->rb);
+ cfqe->rb_key = rb_key;
+ rb_link_node(&cfqe->rb_node, parent, p);
+ rb_insert_color(&cfqe->rb_node, &service_tree->rb);
service_tree->count++;
if ((add_front || !new_cfqq) && !group_changed)
return;
@@ -1382,13 +1403,16 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
*/
static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
+ struct cfq_entity *cfqe;
cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
BUG_ON(!cfq_cfqq_on_rr(cfqq));
cfq_clear_cfqq_on_rr(cfqq);

- if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
- cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
- cfqq->service_tree = NULL;
+ cfqe = &cfqq->cfqe;
+
+ if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
+ cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
+ cfqe->service_tree = NULL;
}
if (cfqq->p_root) {
rb_erase(&cfqq->p_node, cfqq->p_root);
@@ -1719,13 +1743,13 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
return NULL;
if (RB_EMPTY_ROOT(&service_tree->rb))
return NULL;
- return cfq_rb_first(service_tree);
+ return cfqq_of_entity(cfq_rb_first(service_tree));
}

static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
{
struct cfq_group *cfqg;
- struct cfq_queue *cfqq;
+ struct cfq_entity *cfqe;
int i, j;
struct cfq_rb_root *st;

@@ -1736,9 +1760,11 @@ static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
if (!cfqg)
return NULL;

- for_each_cfqg_st(cfqg, i, j, st)
- if ((cfqq = cfq_rb_first(st)) != NULL)
- return cfqq;
+ for_each_cfqg_st(cfqg, i, j, st) {
+ cfqe = cfq_rb_first(st);
+ if (cfqe != NULL)
+ return cfqq_of_entity(cfqe);
+ }
return NULL;
}

@@ -1875,9 +1901,12 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,

static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
+ struct cfq_entity *cfqe;
enum wl_prio_t prio = cfqq_prio(cfqq);
- struct cfq_rb_root *service_tree = cfqq->service_tree;
+ struct cfq_rb_root *service_tree;

+ cfqe = &cfqq->cfqe;
+ service_tree = cfqe->service_tree;
BUG_ON(!service_tree);
BUG_ON(!service_tree->count);

@@ -2087,7 +2116,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
struct cfq_group *cfqg, enum wl_prio_t prio)
{
- struct cfq_queue *queue;
+ struct cfq_entity *cfqe;
int i;
bool key_valid = false;
unsigned long lowest_key = 0;
@@ -2095,10 +2124,10 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,

for (i = 0; i <= SYNC_WORKLOAD; ++i) {
/* select the one with lowest rb_key */
- queue = cfq_rb_first(service_tree_for(cfqg, prio, i));
- if (queue &&
- (!key_valid || time_before(queue->rb_key, lowest_key))) {
- lowest_key = queue->rb_key;
+ cfqe = cfq_rb_first(service_tree_for(cfqg, prio, i));
+ if (cfqe &&
+ (!key_valid || time_before(cfqe->rb_key, lowest_key))) {
+ lowest_key = cfqe->rb_key;
cur_best = i;
key_valid = true;
}
@@ -2846,7 +2875,10 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
pid_t pid, bool is_sync)
{
- RB_CLEAR_NODE(&cfqq->rb_node);
+ struct cfq_entity *cfqe;
+
+ cfqe = &cfqq->cfqe;
+ RB_CLEAR_NODE(&cfqe->rb_node);
RB_CLEAR_NODE(&cfqq->p_node);
INIT_LIST_HEAD(&cfqq->fifo);

@@ -3255,7 +3287,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
/* Allow preemption only if we are idling on sync-noidle tree */
if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
- new_cfqq->service_tree->count == 2 &&
+ new_cfqq->cfqe.service_tree->count == 2 &&
RB_EMPTY_ROOT(&cfqq->sort_list))
return true;

--
1.7.1

2011-02-23 03:08:52

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: [PATCH 2/6 v5.1] cfq-iosched: Introduce cfq_entity for CFQ group

Introduce cfq_entity for CFQ group

Signed-off-by: Gui Jianfeng <[email protected]>
---
block/cfq-iosched.c | 111 +++++++++++++++++++++++++++++++--------------------
1 files changed, 67 insertions(+), 44 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 15344fb..98a39eb 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -73,7 +73,7 @@ static DEFINE_IDA(cic_index_ida);
#define cfq_class_rt(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_RT)

#define sample_valid(samples) ((samples) > 80)
-#define rb_entry_cfqg(node) rb_entry((node), struct cfq_group, rb_node)
+#define rb_entry_entity(node) rb_entry((node), struct cfq_entity, rb_node)

/*
* Most of our rbtree usage is for sorting with min extraction, so
@@ -101,6 +101,11 @@ struct cfq_entity {
struct rb_node rb_node;
/* service_tree key, represent the position on the tree */
unsigned long rb_key;
+
+ /* group service_tree key */
+ u64 vdisktime;
+ bool is_group_entity;
+ unsigned int weight;
};

/*
@@ -182,12 +187,8 @@ enum wl_type_t {

/* This is per cgroup per device grouping structure */
struct cfq_group {
- /* group service_tree member */
- struct rb_node rb_node;
-
- /* group service_tree key */
- u64 vdisktime;
- unsigned int weight;
+ /* cfq group sched entity */
+ struct cfq_entity cfqe;

/* number of cfqq currently on this group */
int nr_cfqq;
@@ -314,12 +315,22 @@ struct cfq_data {
static inline struct cfq_queue *
cfqq_of_entity(struct cfq_entity *cfqe)
{
- if (cfqe)
+ if (cfqe && !cfqe->is_group_entity)
return container_of(cfqe, struct cfq_queue, cfqe);

return NULL;
}

+static inline struct cfq_group *
+cfqg_of_entity(struct cfq_entity *cfqe)
+{
+ if (cfqe && cfqe->is_group_entity)
+ return container_of(cfqe, struct cfq_group, cfqe);
+
+ return NULL;
+}
+
+
static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);

static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
@@ -547,12 +558,12 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
}

-static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg)
+static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_entity *cfqe)
{
u64 d = delta << CFQ_SERVICE_SHIFT;

d = d * BLKIO_WEIGHT_DEFAULT;
- do_div(d, cfqg->weight);
+ do_div(d, cfqe->weight);
return d;
}

@@ -577,11 +588,11 @@ static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
static void update_min_vdisktime(struct cfq_rb_root *st)
{
u64 vdisktime = st->min_vdisktime;
- struct cfq_group *cfqg;
+ struct cfq_entity *cfqe;

if (st->left) {
- cfqg = rb_entry_cfqg(st->left);
- vdisktime = min_vdisktime(vdisktime, cfqg->vdisktime);
+ cfqe = rb_entry_entity(st->left);
+ vdisktime = min_vdisktime(vdisktime, cfqe->vdisktime);
}

st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
@@ -612,8 +623,9 @@ static inline unsigned
cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_entity *cfqe = &cfqg->cfqe;

- return cfq_target_latency * cfqg->weight / st->total_weight;
+ return cfq_target_latency * cfqe->weight / st->total_weight;
}

static inline unsigned
@@ -784,13 +796,13 @@ static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
return NULL;
}

-static struct cfq_group *cfq_rb_first_group(struct cfq_rb_root *root)
+static struct cfq_entity *cfq_rb_first_entity(struct cfq_rb_root *root)
{
if (!root->left)
root->left = rb_first(&root->rb);

if (root->left)
- return rb_entry_cfqg(root->left);
+ return rb_entry_entity(root->left);

return NULL;
}
@@ -847,9 +859,9 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
}

static inline s64
-cfqg_key(struct cfq_rb_root *st, struct cfq_group *cfqg)
+entity_key(struct cfq_rb_root *st, struct cfq_entity *entity)
{
- return cfqg->vdisktime - st->min_vdisktime;
+ return entity->vdisktime - st->min_vdisktime;
}

static void
@@ -857,15 +869,16 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
{
struct rb_node **node = &st->rb.rb_node;
struct rb_node *parent = NULL;
- struct cfq_group *__cfqg;
- s64 key = cfqg_key(st, cfqg);
+ struct cfq_entity *__cfqe;
+ struct cfq_entity *cfqe = &cfqg->cfqe;
+ s64 key = entity_key(st, cfqe);
int left = 1;

while (*node != NULL) {
parent = *node;
- __cfqg = rb_entry_cfqg(parent);
+ __cfqe = rb_entry_entity(parent);

- if (key < cfqg_key(st, __cfqg))
+ if (key < entity_key(st, __cfqe))
node = &parent->rb_left;
else {
node = &parent->rb_right;
@@ -874,21 +887,22 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
}

if (left)
- st->left = &cfqg->rb_node;
+ st->left = &cfqe->rb_node;

- rb_link_node(&cfqg->rb_node, parent, node);
- rb_insert_color(&cfqg->rb_node, &st->rb);
+ rb_link_node(&cfqe->rb_node, parent, node);
+ rb_insert_color(&cfqe->rb_node, &st->rb);
}

static void
cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
- struct cfq_group *__cfqg;
+ struct cfq_entity *cfqe = &cfqg->cfqe;
+ struct cfq_entity *__cfqe;
struct rb_node *n;

cfqg->nr_cfqq++;
- if (!RB_EMPTY_NODE(&cfqg->rb_node))
+ if (!RB_EMPTY_NODE(&cfqe->rb_node))
return;

/*
@@ -898,19 +912,20 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
*/
n = rb_last(&st->rb);
if (n) {
- __cfqg = rb_entry_cfqg(n);
- cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
+ __cfqe = rb_entry_entity(n);
+ cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
} else
- cfqg->vdisktime = st->min_vdisktime;
+ cfqe->vdisktime = st->min_vdisktime;

__cfq_group_service_tree_add(st, cfqg);
- st->total_weight += cfqg->weight;
+ st->total_weight += cfqe->weight;
}

static void
cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_entity *cfqe = &cfqg->cfqe;

BUG_ON(cfqg->nr_cfqq < 1);
cfqg->nr_cfqq--;
@@ -920,9 +935,9 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
return;

cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
- st->total_weight -= cfqg->weight;
- if (!RB_EMPTY_NODE(&cfqg->rb_node))
- cfq_rb_erase(&cfqg->rb_node, st);
+ st->total_weight -= cfqe->weight;
+ if (!RB_EMPTY_NODE(&cfqe->rb_node))
+ cfq_rb_erase(&cfqe->rb_node, st);
cfqg->saved_workload_slice = 0;
cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
}
@@ -960,6 +975,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
unsigned int used_sl, charge;
int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
- cfqg->service_tree_idle.count;
+ struct cfq_entity *cfqe = &cfqg->cfqe;

BUG_ON(nr_sync < 0);
used_sl = charge = cfq_cfqq_slice_usage(cfqq);
@@ -970,8 +986,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
charge = cfqq->allocated_slice;

/* Can't update vdisktime while group is on service tree */
- cfq_rb_erase(&cfqg->rb_node, st);
- cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
+ cfq_rb_erase(&cfqe->rb_node, st);
+ cfqe->vdisktime += cfq_scale_slice(charge, cfqe);
__cfq_group_service_tree_add(st, cfqg);

/* This group is being expired. Save the context */
@@ -983,8 +999,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
} else
cfqg->saved_workload_slice = 0;

- cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
- st->min_vdisktime);
+ cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu",
+ cfqe->vdisktime, st->min_vdisktime);
cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
" sect=%u", used_sl, cfqq->slice_dispatch, charge,
iops_mode(cfqd), cfqq->nr_sectors);
@@ -1003,7 +1019,7 @@ static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
unsigned int weight)
{
- cfqg_of_blkg(blkg)->weight = weight;
+ cfqg_of_blkg(blkg)->cfqe.weight = weight;
}

static struct cfq_group *
@@ -1032,7 +1048,9 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)

for_each_cfqg_st(cfqg, i, j, st)
*st = CFQ_RB_ROOT;
- RB_CLEAR_NODE(&cfqg->rb_node);
+ RB_CLEAR_NODE(&cfqg->cfqe.rb_node);
+
+ cfqg->cfqe.is_group_entity = true;

/*
* Take the initial reference that will be released on destroy
@@ -1056,7 +1074,7 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
0);

- cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+ cfqg->cfqe.weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);

/* Add group on cfqd list */
hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
@@ -2220,10 +2238,13 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
struct cfq_group *cfqg;
+ struct cfq_entity *cfqe;

if (RB_EMPTY_ROOT(&st->rb))
return NULL;
- cfqg = cfq_rb_first_group(st);
+ cfqe = cfq_rb_first_entity(st);
+ cfqg = cfqg_of_entity(cfqe);
+ BUG_ON(!cfqg);
update_min_vdisktime(st);
return cfqg;
}
@@ -2882,6 +2903,7 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
RB_CLEAR_NODE(&cfqq->p_node);
INIT_LIST_HEAD(&cfqq->fifo);

+ cfqe->is_group_entity = false;
cfqq->ref = 0;
cfqq->cfqd = cfqd;

@@ -3931,10 +3953,11 @@ static void *cfq_init_queue(struct request_queue *q)
cfqg = &cfqd->root_group;
for_each_cfqg_st(cfqg, i, j, st)
*st = CFQ_RB_ROOT;
- RB_CLEAR_NODE(&cfqg->rb_node);
+ RB_CLEAR_NODE(&cfqg->cfqe.rb_node);

/* Give preference to root group over other groups */
- cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
+ cfqg->cfqe.weight = 2*BLKIO_WEIGHT_DEFAULT;
+ cfqg->cfqe.is_group_entity = true;

#ifdef CONFIG_CFQ_GROUP_IOSCHED
/*
--
1.7.1

2011-02-23 03:09:17

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: [PATCH 3/6 v5.1] cfq-iosched: Introduce vdisktime and io weight for CFQ queue

Introduce vdisktime and io weight for CFQ queue scheduling. Currently, io priority
maps to a range [100,1000]. It also gets rid of cfq_slice_offset() logic and makes
use the same scheduling algorithm as CFQ group does. This helps for CFQ queue and
group scheduling on the same service tree.

Signed-off-by: Gui Jianfeng <[email protected]>
---
block/cfq-iosched.c | 210 ++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 158 insertions(+), 52 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 98a39eb..2cceeb1 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -40,6 +40,13 @@ static const int cfq_hist_divisor = 4;
#define CFQ_IDLE_DELAY (HZ / 5)

/*
+ * The base boosting value.
+ */
+#define CFQ_BOOST_SYNC_BASE ((HZ / 10) / 4)
+#define CFQ_BOOST_ASYNC_BASE ((HZ / 25) / 4)
+
+
+/*
* below this threshold, we consider thinktime immediate
*/
#define CFQ_MIN_TT (2)
@@ -99,10 +106,7 @@ struct cfq_entity {
struct cfq_rb_root *service_tree;
/* service_tree member */
struct rb_node rb_node;
- /* service_tree key, represent the position on the tree */
- unsigned long rb_key;
-
- /* group service_tree key */
+ /* service_tree key */
u64 vdisktime;
bool is_group_entity;
unsigned int weight;
@@ -114,6 +118,8 @@ struct cfq_entity {
struct cfq_queue {
/* The schedule entity */
struct cfq_entity cfqe;
+ /* Position time */
+ unsigned long position_time;
/* reference count */
int ref;
/* various state flags, see below */
@@ -312,6 +318,24 @@ struct cfq_data {
struct rcu_head rcu;
};

+/*
+ * Map io priority(7 ~ 0) to io weight(100 ~ 1000) as follows
+ * prio 0 1 2 3 4 5 6 7
+ * weight 1000 868 740 612 484 356 228 100
+ */
+static inline unsigned int cfq_prio_to_weight(unsigned short ioprio)
+{
+ unsigned int step;
+
+ BUG_ON(ioprio >= IOPRIO_BE_NR);
+
+ step = (BLKIO_WEIGHT_MAX - BLKIO_WEIGHT_MIN) / (IOPRIO_BE_NR - 1);
+ if (ioprio == 0)
+ return BLKIO_WEIGHT_MAX;
+
+ return BLKIO_WEIGHT_MIN + (IOPRIO_BE_NR - ioprio - 1) * step;
+}
+
static inline struct cfq_queue *
cfqq_of_entity(struct cfq_entity *cfqe)
{
@@ -848,16 +872,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
return cfq_choose_req(cfqd, next, prev, blk_rq_pos(last));
}

-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- /*
- * just an approximation, should be ok.
- */
- return (cfqq->cfqg->nr_cfqq - 1) * (cfq_prio_slice(cfqd, 1, 0) -
- cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
static inline s64
entity_key(struct cfq_rb_root *st, struct cfq_entity *entity)
{
@@ -1207,6 +1221,15 @@ static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}

#endif /* GROUP_IOSCHED */

+static inline u64 cfq_get_boost(struct cfq_data *cfqd,
+ struct cfq_queue *cfqq)
+{
+ if (cfq_cfqq_sync(cfqq))
+ return cfq_scale_slice(CFQ_BOOST_SYNC_BASE, &cfqq->cfqe);
+ else
+ return cfq_scale_slice(CFQ_BOOST_ASYNC_BASE, &cfqq->cfqe);
+}
+
/*
* The cfqd->service_trees holds all pending cfq_queue's that have
* requests waiting to be processed. It is sorted in the order that
@@ -1218,13 +1241,14 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
struct cfq_entity *cfqe;
struct rb_node **p, *parent;
struct cfq_entity *__cfqe;
- unsigned long rb_key;
- struct cfq_rb_root *service_tree;
+ struct cfq_rb_root *service_tree, *orig_st;
int left;
int new_cfqq = 1;
int group_changed = 0;
+ s64 key;

cfqe = &cfqq->cfqe;
+ orig_st = cfqe->service_tree;

#ifdef CONFIG_CFQ_GROUP_IOSCHED
if (!cfqd->cfq_group_isolation
@@ -1232,8 +1256,15 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
&& cfqq->cfqg && cfqq->cfqg != &cfqd->root_group) {
/* Move this cfq to root group */
cfq_log_cfqq(cfqd, cfqq, "moving to root group");
- if (!RB_EMPTY_NODE(&cfqe->rb_node))
+ if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
cfq_group_service_tree_del(cfqd, cfqq->cfqg);
+ /*
+ * Group changed, dequeue this CFQ queue from the
+ * original service tree.
+ */
+ cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
+ orig_st->total_weight -= cfqe->weight;
+ }
cfqq->orig_cfqg = cfqq->cfqg;
cfqq->cfqg = &cfqd->root_group;
cfqd->root_group.ref++;
@@ -1242,8 +1273,15 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
&& cfqq_type(cfqq) == SYNC_WORKLOAD && cfqq->orig_cfqg) {
/* cfqq is sequential now needs to go to its original group */
BUG_ON(cfqq->cfqg != &cfqd->root_group);
- if (!RB_EMPTY_NODE(&cfqe->rb_node))
+ if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
cfq_group_service_tree_del(cfqd, cfqq->cfqg);
+ /*
+ * Group changed, dequeue this CFQ queue from the
+ * original service tree.
+ */
+ cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
+ orig_st->total_weight -= cfqe->weight;
+ }
cfq_put_cfqg(cfqq->cfqg);
cfqq->cfqg = cfqq->orig_cfqg;
cfqq->orig_cfqg = NULL;
@@ -1254,47 +1292,65 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,

service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
cfqq_type(cfqq));
+ if (RB_EMPTY_NODE(&cfqe->rb_node)) {
+ /*
+ * If this CFQ queue moves to another group, the original
+ * vdisktime makes no sense any more, reset the vdisktime
+ * here.
+ */
+ parent = rb_last(&service_tree->rb);
+ if (parent) {
+ u64 pos_offset;
+
+ /*
+ * Estimate the position according to its weight and
+ * ioprio.
+ */
+ pos_offset = cfq_get_boost(cfqd, cfqq);
+ cfqe->vdisktime = service_tree->min_vdisktime +
+ pos_offset;
+ } else
+ cfqe->vdisktime = service_tree->min_vdisktime;
+
+ goto insert;
+ }
+
+ /*
+ * Ok, we get here, this CFQ queue is on the service tree, dequeue it
+ * firstly.
+ */
+ cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
+ orig_st->total_weight -= cfqe->weight;
+
+ new_cfqq = 0;
+
if (cfq_class_idle(cfqq)) {
- rb_key = CFQ_IDLE_DELAY;
parent = rb_last(&service_tree->rb);
if (parent && parent != &cfqe->rb_node) {
__cfqe = rb_entry(parent, struct cfq_entity, rb_node);
- rb_key += __cfqe->rb_key;
+ cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
} else
- rb_key += jiffies;
+ cfqe->vdisktime = service_tree->min_vdisktime;
} else if (!add_front) {
/*
- * Get our rb key offset. Subtract any residual slice
- * value carried from last service. A negative resid
- * count indicates slice overrun, and this should position
- * the next service time further away in the tree.
+ * We charge the CFQ queue by the time this queue runs, and
+ * repsition it on the service tree.
*/
- rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
- rb_key -= cfqq->slice_resid;
+ unsigned int used_sl;
+
+ used_sl = cfq_cfqq_slice_usage(cfqq);
+ cfqe->vdisktime += cfq_scale_slice(used_sl, cfqe);
cfqq->slice_resid = 0;
} else {
- rb_key = -HZ;
- __cfqe = cfq_rb_first(service_tree);
- rb_key += __cfqe ? __cfqe->rb_key : jiffies;
- }
-
- if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
- new_cfqq = 0;
- /*
- * same position, nothing more to do
- */
- if (rb_key == cfqe->rb_key &&
- cfqe->service_tree == service_tree)
- return;
-
- cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
- cfqe->service_tree = NULL;
+ cfqe->vdisktime = service_tree->min_vdisktime;
}

+insert:
left = 1;
parent = NULL;
cfqe->service_tree = service_tree;
p = &service_tree->rb.rb_node;
+ key = entity_key(service_tree, cfqe);
while (*p) {
struct rb_node **n;

@@ -1304,7 +1360,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
/*
* sort by key, that represents service time.
*/
- if (time_before(rb_key, __cfqe->rb_key))
+ if (key < entity_key(service_tree, __cfqe))
n = &(*p)->rb_left;
else {
n = &(*p)->rb_right;
@@ -1317,10 +1373,12 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (left)
service_tree->left = &cfqe->rb_node;

- cfqe->rb_key = rb_key;
rb_link_node(&cfqe->rb_node, parent, p);
rb_insert_color(&cfqe->rb_node, &service_tree->rb);
+ update_min_vdisktime(service_tree);
service_tree->count++;
+ service_tree->total_weight += cfqe->weight;
+ cfqq->position_time = jiffies;
if ((add_front || !new_cfqq) && !group_changed)
return;
cfq_group_service_tree_add(cfqd, cfqq->cfqg);
@@ -1422,14 +1480,18 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
struct cfq_entity *cfqe;
+ struct cfq_rb_root *service_tree;
+
cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
BUG_ON(!cfq_cfqq_on_rr(cfqq));
cfq_clear_cfqq_on_rr(cfqq);

cfqe = &cfqq->cfqe;
+ service_tree = cfqe->service_tree;

if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
+ service_tree->total_weight -= cfqe->weight;
cfqe->service_tree = NULL;
}
if (cfqq->p_root) {
@@ -2131,23 +2193,36 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
}
}

+/*
+ * The time when a CFQ queue is put onto a service tree is recoreded in
+ * cfqq->position_time. Currently, we check the first priority CFQ queues
+ * on each service tree, and select the workload type that contains the lowest
+ * position_time CFQ queue among them.
+ */
static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
struct cfq_group *cfqg, enum wl_prio_t prio)
{
struct cfq_entity *cfqe;
+ struct cfq_queue *cfqq;
+ unsigned long lowest_start_time;
int i;
- bool key_valid = false;
- unsigned long lowest_key = 0;
+ bool time_valid = false;
enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;

+ /*
+ * TODO: We may take io priority and io class into account when
+ * choosing a workload type. But for the time being just make use of
+ * position_time only.
+ */
for (i = 0; i <= SYNC_WORKLOAD; ++i) {
- /* select the one with lowest rb_key */
cfqe = cfq_rb_first(service_tree_for(cfqg, prio, i));
- if (cfqe &&
- (!key_valid || time_before(cfqe->rb_key, lowest_key))) {
- lowest_key = cfqe->rb_key;
+ cfqq = cfqq_of_entity(cfqe);
+ if (cfqe && (!time_valid ||
+ time_before(cfqq->position_time,
+ lowest_start_time))) {
+ lowest_start_time = cfqq->position_time;
cur_best = i;
- key_valid = true;
+ time_valid = true;
}
}

@@ -2819,10 +2894,13 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
{
struct task_struct *tsk = current;
int ioprio_class;
+ struct cfq_entity *cfqe;

if (!cfq_cfqq_prio_changed(cfqq))
return;

+ cfqe = &cfqq->cfqe;
+
ioprio_class = IOPRIO_PRIO_CLASS(ioc->ioprio);
switch (ioprio_class) {
default:
@@ -2849,6 +2927,17 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
break;
}

+ if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
+ /*
+ * If this CFQ entity is already on service tree, we need to
+ * adjust service tree's total weight accordingly.
+ */
+ cfqe->service_tree->total_weight -= cfqe->weight;
+ cfqe->weight = cfq_prio_to_weight(cfqq->ioprio);
+ cfqe->service_tree->total_weight += cfqe->weight;
+ } else
+ cfqe->weight = cfq_prio_to_weight(cfqq->ioprio);
+
/*
* keep track of original prio settings in case we have to temporarily
* elevate the priority of this queue
@@ -3596,6 +3685,9 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
*/
static void cfq_prio_boost(struct cfq_queue *cfqq)
{
+ struct cfq_entity *cfqe;
+
+ cfqe = &cfqq->cfqe;
if (has_fs_excl()) {
/*
* boost idle prio on transactions that would lock out other
@@ -3612,6 +3704,20 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
cfqq->ioprio_class = cfqq->org_ioprio_class;
cfqq->ioprio = cfqq->org_ioprio;
}
+
+ /*
+ * update the io weight if io priority gets changed.
+ */
+ if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
+ /*
+ * If this CFQ entity is already on service tree, we need to
+ * adjust service tree's total weight accordingly.
+ */
+ cfqe->service_tree->total_weight -= cfqe->weight;
+ cfqe->weight = cfq_prio_to_weight(cfqq->ioprio);
+ cfqe->service_tree->total_weight += cfqe->weight;
+ } else
+ cfqe->weight = cfq_prio_to_weight(cfqq->ioprio);
}

static inline int __cfq_may_queue(struct cfq_queue *cfqq)
--
1.7.1

2011-02-23 03:10:06

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: [PATCH 4/6 v5.1] cfq-iosched: Extract some common code of service tree handling for CFQ queue and CFQ group.

Extract some common code of service tree handling for CFQ queue
and CFQ group. This helps when CFQ queue and CFQ group are scheduling
together.

Signed-off-by: Gui Jianfeng <[email protected]>
---
block/cfq-iosched.c | 86 +++++++++++++++++++++-----------------------------
1 files changed, 36 insertions(+), 50 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 2cceeb1..d10f776 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -879,12 +879,11 @@ entity_key(struct cfq_rb_root *st, struct cfq_entity *entity)
}

static void
-__cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
+__cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
{
struct rb_node **node = &st->rb.rb_node;
struct rb_node *parent = NULL;
struct cfq_entity *__cfqe;
- struct cfq_entity *cfqe = &cfqg->cfqe;
s64 key = entity_key(st, cfqe);
int left = 1;

@@ -908,6 +907,14 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
}

static void
+cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
+{
+ __cfq_entity_service_tree_add(st, cfqe);
+ st->count++;
+ st->total_weight += cfqe->weight;
+}
+
+static void
cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
@@ -931,8 +938,23 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
} else
cfqe->vdisktime = st->min_vdisktime;

- __cfq_group_service_tree_add(st, cfqg);
- st->total_weight += cfqe->weight;
+ cfq_entity_service_tree_add(st, cfqe);
+}
+
+static void
+__cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
+{
+ cfq_rb_erase(&cfqe->rb_node, st);
+}
+
+static void
+cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
+{
+ if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
+ __cfq_entity_service_tree_del(st, cfqe);
+ st->total_weight -= cfqe->weight;
+ cfqe->service_tree = NULL;
+ }
}

static void
@@ -949,9 +971,7 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
return;

cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
- st->total_weight -= cfqe->weight;
- if (!RB_EMPTY_NODE(&cfqe->rb_node))
- cfq_rb_erase(&cfqe->rb_node, st);
+ cfq_entity_service_tree_del(st, cfqe);
cfqg->saved_workload_slice = 0;
cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
}
@@ -1000,9 +1020,9 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
charge = cfqq->allocated_slice;

/* Can't update vdisktime while group is on service tree */
- cfq_rb_erase(&cfqe->rb_node, st);
+ __cfq_entity_service_tree_del(st, cfqe);
cfqe->vdisktime += cfq_scale_slice(charge, cfqe);
- __cfq_group_service_tree_add(st, cfqg);
+ __cfq_entity_service_tree_add(st, cfqe);

/* This group is being expired. Save the context */
if (time_after(cfqd->workload_expires, jiffies)) {
@@ -1239,13 +1259,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
bool add_front)
{
struct cfq_entity *cfqe;
- struct rb_node **p, *parent;
+ struct rb_node *parent;
struct cfq_entity *__cfqe;
struct cfq_rb_root *service_tree, *orig_st;
- int left;
int new_cfqq = 1;
int group_changed = 0;
- s64 key;

cfqe = &cfqq->cfqe;
orig_st = cfqe->service_tree;
@@ -1262,8 +1280,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
* Group changed, dequeue this CFQ queue from the
* original service tree.
*/
- cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
- orig_st->total_weight -= cfqe->weight;
+ cfq_entity_service_tree_del(orig_st, cfqe);
}
cfqq->orig_cfqg = cfqq->cfqg;
cfqq->cfqg = &cfqd->root_group;
@@ -1279,8 +1296,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
* Group changed, dequeue this CFQ queue from the
* original service tree.
*/
- cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
- orig_st->total_weight -= cfqe->weight;
+ cfq_entity_service_tree_del(orig_st, cfqe);
}
cfq_put_cfqg(cfqq->cfqg);
cfqq->cfqg = cfqq->orig_cfqg;
@@ -1319,8 +1335,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
* Ok, we get here, this CFQ queue is on the service tree, dequeue it
* firstly.
*/
- cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
- orig_st->total_weight -= cfqe->weight;
+ cfq_entity_service_tree_del(orig_st, cfqe);

new_cfqq = 0;

@@ -1346,38 +1361,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
}

insert:
- left = 1;
- parent = NULL;
cfqe->service_tree = service_tree;
- p = &service_tree->rb.rb_node;
- key = entity_key(service_tree, cfqe);
- while (*p) {
- struct rb_node **n;
-
- parent = *p;
- __cfqe = rb_entry(parent, struct cfq_entity, rb_node);
-
- /*
- * sort by key, that represents service time.
- */
- if (key < entity_key(service_tree, __cfqe))
- n = &(*p)->rb_left;
- else {
- n = &(*p)->rb_right;
- left = 0;
- }
-
- p = n;
- }

- if (left)
- service_tree->left = &cfqe->rb_node;
-
- rb_link_node(&cfqe->rb_node, parent, p);
- rb_insert_color(&cfqe->rb_node, &service_tree->rb);
+ /* Add cfqq onto service tree. */
+ cfq_entity_service_tree_add(service_tree, cfqe);
update_min_vdisktime(service_tree);
- service_tree->count++;
- service_tree->total_weight += cfqe->weight;
cfqq->position_time = jiffies;
if ((add_front || !new_cfqq) && !group_changed)
return;
@@ -1490,9 +1478,7 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
service_tree = cfqe->service_tree;

if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
- cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
- service_tree->total_weight -= cfqe->weight;
- cfqe->service_tree = NULL;
+ cfq_entity_service_tree_del(service_tree, cfqe);
}
if (cfqq->p_root) {
rb_erase(&cfqq->p_node, cfqq->p_root);
--
1.7.1

2011-02-23 03:10:42

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: [PATCH 5/6 v5.1] cfq-iosched: CFQ group hierarchical scheduling and use_hierarchy interface.

CFQ group hierarchical scheduling and use_hierarchy interface.

Signed-off-by: Gui Jianfeng <[email protected]>
---
block/blk-cgroup.c | 61 +++++-
block/blk-cgroup.h | 3 +
block/cfq-iosched.c | 596 +++++++++++++++++++++++++++++++++++++--------------
3 files changed, 496 insertions(+), 164 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 455768a..c55fecd 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -25,7 +25,10 @@
static DEFINE_SPINLOCK(blkio_list_lock);
static LIST_HEAD(blkio_list);

-struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
+struct blkio_cgroup blkio_root_cgroup = {
+ .weight = 2*BLKIO_WEIGHT_DEFAULT,
+ .use_hierarchy = 0
+};
EXPORT_SYMBOL_GPL(blkio_root_cgroup);

static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
@@ -454,6 +457,7 @@ static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
blkg->blkcg_id = 0;
}

+
/*
* returns 0 if blkio_group was still on cgroup list. Otherwise returns 1
* indicating that blk_group was unhashed by the time we got to it.
@@ -765,6 +769,12 @@ unsigned int blkcg_get_weight(struct blkio_cgroup *blkcg,
}
EXPORT_SYMBOL_GPL(blkcg_get_weight);

+unsigned int blkcg_get_use_hierarchy(struct blkio_cgroup *blkcg)
+{
+ return blkcg->use_hierarchy;
+}
+EXPORT_SYMBOL_GPL(blkcg_get_use_hierarchy);
+
uint64_t blkcg_get_read_bps(struct blkio_cgroup *blkcg, dev_t dev)
{
struct blkio_policy_node *pn;
@@ -1202,6 +1212,8 @@ static u64 blkiocg_file_read_u64 (struct cgroup *cgrp, struct cftype *cft) {
switch(name) {
case BLKIO_PROP_weight:
return (u64)blkcg->weight;
+ case BLKIO_PROP_use_hierarchy:
+ return (u64)blkcg->use_hierarchy;
}
break;
default:
@@ -1210,6 +1222,36 @@ static u64 blkiocg_file_read_u64 (struct cgroup *cgrp, struct cftype *cft) {
return 0;
}

+static int blkio_use_hierarchy_write(struct cgroup *cgrp, u64 val)
+{
+ struct cgroup *parent = cgrp->parent;
+ struct blkio_cgroup *blkcg, *parent_blkcg = NULL;
+ int ret = 0;
+
+ if (val != 0 && val != 1)
+ return -EINVAL;
+
+ blkcg = cgroup_to_blkio_cgroup(cgrp);
+ if (parent)
+ parent_blkcg = cgroup_to_blkio_cgroup(parent);
+
+ cgroup_lock();
+ /*
+ * If parent's use_hierarchy is set, we can't make any modifications
+ * in the child subtrees. If it is unset, then the change can occur,
+ * provided the current cgroup has no children.
+ */
+ if (!parent_blkcg || !parent_blkcg->use_hierarchy) {
+ if (list_empty(&cgrp->children))
+ blkcg->use_hierarchy = val;
+ else
+ ret = -EBUSY;
+ } else
+ ret = -EINVAL;
+ cgroup_unlock();
+ return ret;
+}
+
static int
blkiocg_file_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
{
@@ -1224,6 +1266,8 @@ blkiocg_file_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
switch(name) {
case BLKIO_PROP_weight:
return blkio_weight_write(blkcg, val);
+ case BLKIO_PROP_use_hierarchy:
+ return blkio_use_hierarchy_write(cgrp, val);
}
break;
default:
@@ -1301,6 +1345,13 @@ struct cftype blkio_files[] = {
.name = "reset_stats",
.write_u64 = blkiocg_reset_stats,
},
+ {
+ .name = "use_hierarchy",
+ .private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
+ BLKIO_PROP_use_hierarchy),
+ .read_u64 = blkiocg_file_read_u64,
+ .write_u64 = blkiocg_file_write_u64,
+ },
#ifdef CONFIG_BLK_DEV_THROTTLING
{
.name = "throttle.read_bps_device",
@@ -1444,7 +1495,7 @@ static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
static struct cgroup_subsys_state *
blkiocg_create(struct cgroup_subsys *subsys, struct cgroup *cgroup)
{
- struct blkio_cgroup *blkcg;
+ struct blkio_cgroup *blkcg, *parent_blkcg = NULL;
struct cgroup *parent = cgroup->parent;

if (!parent) {
@@ -1452,6 +1503,7 @@ blkiocg_create(struct cgroup_subsys *subsys, struct cgroup *cgroup)
goto done;
}

+ parent_blkcg = cgroup_to_blkio_cgroup(parent);
blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
if (!blkcg)
return ERR_PTR(-ENOMEM);
@@ -1462,6 +1514,11 @@ done:
INIT_HLIST_HEAD(&blkcg->blkg_list);

INIT_LIST_HEAD(&blkcg->policy_list);
+ if (parent)
+ blkcg->use_hierarchy = parent_blkcg->use_hierarchy;
+ else
+ blkcg->use_hierarchy = 0;
+
return &blkcg->css;
}

diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index ea4861b..5b4b351 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -90,6 +90,7 @@ enum blkcg_file_name_prop {
BLKIO_PROP_idle_time,
BLKIO_PROP_empty_time,
BLKIO_PROP_dequeue,
+ BLKIO_PROP_use_hierarchy,
};

/* cgroup files owned by throttle policy */
@@ -105,6 +106,7 @@ enum blkcg_file_name_throtl {
struct blkio_cgroup {
struct cgroup_subsys_state css;
unsigned int weight;
+ bool use_hierarchy;
spinlock_t lock;
struct hlist_head blkg_list;
struct list_head policy_list; /* list of blkio_policy_node */
@@ -179,6 +181,7 @@ struct blkio_policy_node {

extern unsigned int blkcg_get_weight(struct blkio_cgroup *blkcg,
dev_t dev);
+extern unsigned int blkcg_get_use_hierarchy(struct blkio_cgroup *blkcg);
extern uint64_t blkcg_get_read_bps(struct blkio_cgroup *blkcg,
dev_t dev);
extern uint64_t blkcg_get_write_bps(struct blkio_cgroup *blkcg,
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d10f776..380d667 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -110,6 +110,9 @@ struct cfq_entity {
u64 vdisktime;
bool is_group_entity;
unsigned int weight;
+ struct cfq_entity *parent;
+ /* Position time */
+ unsigned long position_time;
};

/*
@@ -118,8 +121,6 @@ struct cfq_entity {
struct cfq_queue {
/* The schedule entity */
struct cfq_entity cfqe;
- /* Position time */
- unsigned long position_time;
/* reference count */
int ref;
/* various state flags, see below */
@@ -199,6 +200,9 @@ struct cfq_group {
/* number of cfqq currently on this group */
int nr_cfqq;

+ /* number of sub cfq groups */
+ int nr_subgp;
+
/*
* Per group busy queus average. Useful for workload slice calc. We
* create the array for each prio class but at run time it is used
@@ -234,8 +238,6 @@ struct cfq_group {
*/
struct cfq_data {
struct request_queue *queue;
- /* Root service tree for cfq_groups */
- struct cfq_rb_root grp_service_tree;
struct cfq_group root_group;

/*
@@ -247,6 +249,12 @@ struct cfq_data {
struct cfq_group *serving_group;

/*
+ * Both flat mode and hierarchical mode start from the service
+ * tree here.
+ */
+ struct cfq_rb_root grp_service_tree;
+
+ /*
* Each priority tree is sorted by next_request position. These
* trees are used when determining if two or more queues are
* interleaving requests (see cfq_close_cooperator).
@@ -355,8 +363,6 @@ cfqg_of_entity(struct cfq_entity *cfqe)
}


-static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
-
static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
enum wl_prio_t prio,
enum wl_type_t type)
@@ -643,13 +649,50 @@ static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
return cfqg->busy_queues_avg[rt];
}

+static inline unsigned int
+cfq_group_get_total_weight(struct cfq_group *cfqg)
+{
+ int i, j;
+ struct cfq_rb_root *st;
+ unsigned int total_weight = 0;
+
+ for_each_cfqg_st(cfqg, i, j, st) {
+ total_weight += st->total_weight;
+ }
+
+ return total_weight;
+}
+
static inline unsigned
cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
struct cfq_entity *cfqe = &cfqg->cfqe;
+ struct cfq_rb_root *st;
+ int group_slice = cfq_target_latency;
+ unsigned int grp_total_weight;
+ struct cfq_group *p_cfqg;
+
+ /*
+ * Calculate group slice in a hierarchical way.
+ * Note, the calculation is cross all service trees under a group.
+ */
+ do {
+ if (cfqe->parent) {
+ p_cfqg = cfqg_of_entity(cfqe->parent);
+ grp_total_weight = cfq_group_get_total_weight(p_cfqg);
+ group_slice = group_slice * cfqe->weight /
+ grp_total_weight;
+ } else {
+ /* For top level groups */
+ st = cfqe->service_tree;
+ group_slice = group_slice * cfqe->weight /
+ st->total_weight;
+ }
+
+ cfqe = cfqe->parent;
+ } while (cfqe);

- return cfq_target_latency * cfqe->weight / st->total_weight;
+ return group_slice;
}

static inline unsigned
@@ -672,7 +715,8 @@ cfq_scaled_cfqq_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
/* scale low_slice according to IO priority
* and sync vs async */
unsigned low_slice =
- min(slice, base_low_slice * slice / sync_slice);
+ min(slice, base_low_slice * slice /
+ sync_slice);
/* the adapted slice value is scaled to fit all iqs
* into the target latency */
slice = max(slice * group_slice / expect_latency,
@@ -820,17 +864,6 @@ static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
return NULL;
}

-static struct cfq_entity *cfq_rb_first_entity(struct cfq_rb_root *root)
-{
- if (!root->left)
- root->left = rb_first(&root->rb);
-
- if (root->left)
- return rb_entry_entity(root->left);
-
- return NULL;
-}
-
static void rb_erase_init(struct rb_node *n, struct rb_root *root)
{
rb_erase(n, root);
@@ -904,12 +937,15 @@ __cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)

rb_link_node(&cfqe->rb_node, parent, node);
rb_insert_color(&cfqe->rb_node, &st->rb);
+
+ update_min_vdisktime(st);
}

static void
cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
{
__cfq_entity_service_tree_add(st, cfqe);
+ cfqe->position_time = jiffies;
st->count++;
st->total_weight += cfqe->weight;
}
@@ -917,34 +953,52 @@ cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
static void
cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
struct cfq_entity *cfqe = &cfqg->cfqe;
- struct cfq_entity *__cfqe;
struct rb_node *n;
+ struct cfq_entity *entity;
+ struct cfq_rb_root *st;
+ struct cfq_group *__cfqg;

cfqg->nr_cfqq++;
+
if (!RB_EMPTY_NODE(&cfqe->rb_node))
return;

/*
- * Currently put the group at the end. Later implement something
- * so that groups get lesser vtime based on their weights, so that
- * if group does not loose all if it was not continously backlogged.
+ * Enqueue this group and its ancestors onto their service tree.
*/
- n = rb_last(&st->rb);
- if (n) {
- __cfqe = rb_entry_entity(n);
- cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
- } else
- cfqe->vdisktime = st->min_vdisktime;
+ while (cfqe) {
+ if (!RB_EMPTY_NODE(&cfqe->rb_node))
+ return;
+
+ /*
+ * Currently put the group at the end. Later implement
+ * something so that groups get lesser vtime based on
+ * their weights, so that if group does not loose all
+ * if it was not continously backlogged.
+ */
+ st = cfqe->service_tree;
+ n = rb_last(&st->rb);
+ if (n) {
+ entity = rb_entry_entity(n);
+ cfqe->vdisktime = entity->vdisktime +
+ CFQ_IDLE_DELAY;
+ } else
+ cfqe->vdisktime = st->min_vdisktime;

- cfq_entity_service_tree_add(st, cfqe);
+ cfq_entity_service_tree_add(st, cfqe);
+ cfqe = cfqe->parent;
+ __cfqg = cfqg_of_entity(cfqe);
+ if (__cfqg)
+ __cfqg->nr_subgp++;
+ }
}

static void
__cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
{
cfq_rb_erase(&cfqe->rb_node, st);
+ update_min_vdisktime(st);
}

static void
@@ -953,27 +1007,43 @@ cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
__cfq_entity_service_tree_del(st, cfqe);
st->total_weight -= cfqe->weight;
- cfqe->service_tree = NULL;
}
}

static void
cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
struct cfq_entity *cfqe = &cfqg->cfqe;
+ struct cfq_group *__cfqg, *p_cfqg;

BUG_ON(cfqg->nr_cfqq < 1);
cfqg->nr_cfqq--;

- /* If there are other cfq queues under this group, don't delete it */
- if (cfqg->nr_cfqq)
+ /*
+ * If there are other cfq queues under this group, or there are other
+ * cfq groups under this group, don't delete it.
+ */
+ if (cfqg->nr_cfqq || cfqg->nr_subgp)
return;

- cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
- cfq_entity_service_tree_del(st, cfqe);
- cfqg->saved_workload_slice = 0;
- cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
+ /*
+ * Dequeue this group and its ancestors from their service
+ * tree.
+ */
+ while (cfqe) {
+ __cfqg = cfqg_of_entity(cfqe);
+ p_cfqg = cfqg_of_entity(cfqe->parent);
+ cfq_entity_service_tree_del(cfqe->service_tree, cfqe);
+ cfq_blkiocg_update_dequeue_stats(&__cfqg->blkg, 1);
+ cfq_log_cfqg(cfqd, __cfqg, "del_from_rr group");
+ __cfqg->saved_workload_slice = 0;
+ cfqe = cfqe->parent;
+ if (p_cfqg) {
+ p_cfqg->nr_subgp--;
+ if (p_cfqg->nr_cfqq || p_cfqg->nr_subgp)
+ return;
+ }
+ }
}

static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
@@ -1005,7 +1075,6 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
struct cfq_queue *cfqq)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
unsigned int used_sl, charge;
int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
- cfqg->service_tree_idle.count;
@@ -1019,10 +1088,23 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
charge = cfqq->allocated_slice;

- /* Can't update vdisktime while group is on service tree */
- __cfq_entity_service_tree_del(st, cfqe);
- cfqe->vdisktime += cfq_scale_slice(charge, cfqe);
- __cfq_entity_service_tree_add(st, cfqe);
+ /*
+ * Update the vdisktime on the whole chain.
+ */
+ while (cfqe) {
+ struct cfq_rb_root *st = cfqe->service_tree;
+
+ /*
+ * Can't update vdisktime while group is on service
+ * tree.
+ */
+ __cfq_entity_service_tree_del(st, cfqe);
+ cfqe->vdisktime += cfq_scale_slice(charge, cfqe);
+ __cfq_entity_service_tree_add(st, cfqe);
+ st->count++;
+ cfqe->position_time = jiffies;
+ cfqe = cfqe->parent;
+ }

/* This group is being expired. Save the context */
if (time_after(cfqd->workload_expires, jiffies)) {
@@ -1034,7 +1116,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
cfqg->saved_workload_slice = 0;

cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu",
- cfqe->vdisktime, st->min_vdisktime);
+ cfqg->cfqe.vdisktime,
+ cfqg->cfqe.service_tree->min_vdisktime);
cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
" sect=%u", used_sl, cfqq->slice_dispatch, charge,
iops_mode(cfqd), cfqq->nr_sectors);
@@ -1056,35 +1139,27 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
cfqg_of_blkg(blkg)->cfqe.weight = weight;
}

-static struct cfq_group *
-cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
+static void init_group_cfqe(struct blkio_cgroup *blkcg,
+ struct cfq_group *cfqg)
+{
+ struct cfq_entity *cfqe = &cfqg->cfqe;
+
+ cfqe->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+ RB_CLEAR_NODE(&cfqe->rb_node);
+ cfqe->is_group_entity = true;
+ cfqe->parent = NULL;
+}
+
+static void init_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
+ struct cfq_group *cfqg)
{
- struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
- struct cfq_group *cfqg = NULL;
- void *key = cfqd;
int i, j;
struct cfq_rb_root *st;
- struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
unsigned int major, minor;
-
- cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
- if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
- sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
- cfqg->blkg.dev = MKDEV(major, minor);
- goto done;
- }
- if (cfqg || !create)
- goto done;
-
- cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
- if (!cfqg)
- goto done;
+ struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;

for_each_cfqg_st(cfqg, i, j, st)
*st = CFQ_RB_ROOT;
- RB_CLEAR_NODE(&cfqg->cfqe.rb_node);
-
- cfqg->cfqe.is_group_entity = true;

/*
* Take the initial reference that will be released on destroy
@@ -1094,24 +1169,195 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
*/
cfqg->ref = 1;

+ /* Add group onto cgroup list */
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
+ MKDEV(major, minor));
+ /* Initiate group entity */
+ init_group_cfqe(blkcg, cfqg);
+ /* Add group on cfqd list */
+ hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+}
+
+static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg);
+
+static void uninit_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ if (!cfq_blkiocg_del_blkio_group(&cfqg->blkg))
+ cfq_destroy_cfqg(cfqd, cfqg);
+}
+
+static void cfqg_set_parent(struct cfq_data *cfqd, struct cfq_group *cfqg,
+ struct cfq_group *p_cfqg)
+{
+ struct cfq_entity *cfqe, *p_cfqe;
+
+ cfqe = &cfqg->cfqe;
+
/*
- * Add group onto cgroup list. It might happen that bdi->dev is
- * not initialized yet. Initialize this new group without major
- * and minor info and this info will be filled in once a new thread
- * comes for IO. See code above.
+ * 1. If use_hierarchy of the CGroup where cfqg's parent stays is not
+ * set, we put this cfqg onto global service tree.
+ * 2. If cfqg is root cfqg, put it onto global service tree.
*/
- if (bdi->dev) {
- sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
- cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
- MKDEV(major, minor));
- } else
- cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
- 0);
+ if (!p_cfqg) {
+ cfqe->service_tree = &cfqd->grp_service_tree;
+ cfqe->parent = NULL;
+ return;
+ }

- cfqg->cfqe.weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+ p_cfqe = &p_cfqg->cfqe;

- /* Add group on cfqd list */
- hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+ cfqe->parent = p_cfqe;
+
+ /*
+ * Currently, just put cfq group entity on "BE:SYNC" workload
+ * service tree.
+ */
+ cfqe->service_tree = service_tree_for(p_cfqg, BE_WORKLOAD,
+ SYNC_WORKLOAD);
+
+ /* child reference */
+ p_cfqg->ref++;
+}
+
+static struct cfq_group *cfqg_get_parent(struct cfq_group * cfqg)
+{
+ struct cfq_entity *cfqe, *p_cfqe;
+
+ if (!cfqg)
+ return NULL;
+
+ cfqe = &cfqg->cfqe;
+ p_cfqe = cfqe->parent;
+ if (!p_cfqe)
+ return NULL;
+
+ return cfqg_of_entity(p_cfqe);
+}
+
+static struct cfq_group *
+cfqg_chain_alloc(struct cfq_data *cfqd, struct cgroup *cgroup)
+{
+ struct blkio_cgroup *blkcg;
+ struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+ unsigned int major, minor;
+ struct cfq_group *cfqg, *leaf_cfqg, *child_cfqg, *tmp_cfqg;
+ void *key = cfqd;
+
+ /*
+ * If CGroup's use_hierarchy is unset, we just need to allocate only
+ * one CFQ group, and this group will put onto the "grp_service_tree".
+ * We don't need to check whether the cfqg exists, the caller has
+ * already checked it.
+ */
+ blkcg = cgroup_to_blkio_cgroup(cgroup);
+ if (!blkcg_get_use_hierarchy(blkcg)) {
+ cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC,
+ cfqd->queue->node);
+ if (!cfqg)
+ return NULL;
+
+ init_cfqg(cfqd, blkcg, cfqg);
+ cfqg_set_parent(cfqd, cfqg, NULL);
+ return cfqg;
+ }
+
+ /*
+ * Allocate the CFQ group chain until we meet the group we'v already
+ * allocated before, or to the CGroup whose use_hierarchy is not set.
+ */
+ leaf_cfqg = NULL;
+ child_cfqg = NULL;
+ for (; cgroup != NULL; cgroup = cgroup->parent) {
+ blkcg = cgroup_to_blkio_cgroup(cgroup);
+ cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+ if (cfqg) {
+ if (!cfqg->blkg.dev && bdi->dev &&
+ dev_name(bdi->dev)) {
+ sscanf(dev_name(bdi->dev), "%u:%u",
+ &major, &minor);
+ cfqg->blkg.dev = MKDEV(major, minor);
+ }
+ /*
+ * Initialization of parent doesn't finish yet, get
+ * it done.
+ */
+ if (child_cfqg) {
+ if (blkcg_get_use_hierarchy(blkcg))
+ cfqg_set_parent(cfqd, child_cfqg,
+ cfqg);
+ else
+ cfqg_set_parent(cfqd, child_cfqg,
+ NULL);
+ }
+
+ /* chain has already been built */
+ break;
+ }
+ /*
+ * We only allocate a cfqg that the corresponding cgroup's
+ * use_hierarchy is set.
+ */
+ if (blkcg_get_use_hierarchy(blkcg)) {
+ cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC,
+ cfqd->queue->node);
+ if (!cfqg)
+ goto clean_up;
+ if (!leaf_cfqg)
+ leaf_cfqg = cfqg;
+ init_cfqg(cfqd, blkcg, cfqg);
+ } else {
+ cfqg = NULL;
+ }
+
+ if (child_cfqg)
+ cfqg_set_parent(cfqd, child_cfqg, cfqg);
+
+ /*
+ * This CGroup's use_hierarchy isn't set, this means the CFQ
+ * group chain has been built.
+ */
+ if (!blkcg_get_use_hierarchy(blkcg))
+ break;
+
+ child_cfqg = cfqg;
+ }
+
+ return leaf_cfqg;
+
+clean_up:
+ /* clean up the allocated cfq groups. */
+ while (leaf_cfqg) {
+ tmp_cfqg = leaf_cfqg;
+ leaf_cfqg = cfqg_get_parent(leaf_cfqg);
+ uninit_cfqg(cfqd, tmp_cfqg);
+ }
+ return NULL;
+}
+
+static struct cfq_group *
+cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+ struct cfq_group *cfqg = NULL;
+ void *key = cfqd;
+ struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+ unsigned int major, minor;
+
+ cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+ if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ cfqg->blkg.dev = MKDEV(major, minor);
+ goto done;
+ }
+ if (cfqg || !create)
+ goto done;
+
+ /*
+ * Allocate CFQ group chain to the root group or we meet the CGroup
+ * with use_hierarchy disabled.
+ */
+ cfqg = cfqg_chain_alloc(cfqd, cgroup);

done:
return cfqg;
@@ -1156,6 +1402,7 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
{
struct cfq_rb_root *st;
int i, j;
+ struct cfq_group *p_cfqg;

BUG_ON(cfqg->ref <= 0);
cfqg->ref--;
@@ -1163,6 +1410,22 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
return;
for_each_cfqg_st(cfqg, i, j, st)
BUG_ON(!RB_EMPTY_ROOT(&st->rb));
+
+ do {
+ p_cfqg = cfqg_get_parent(cfqg);
+ kfree(cfqg);
+ cfqg = NULL;
+ /*
+ * Drop the reference taken by children, if nobody references
+ * parent group, we need delete the parent also.
+ */
+ if (p_cfqg) {
+ p_cfqg->ref--;
+ if (p_cfqg->ref == 0)
+ cfqg = p_cfqg;
+ }
+ } while (cfqg);
+
kfree(cfqg);
}

@@ -1364,9 +1627,8 @@ insert:
cfqe->service_tree = service_tree;

/* Add cfqq onto service tree. */
+
cfq_entity_service_tree_add(service_tree, cfqe);
- update_min_vdisktime(service_tree);
- cfqq->position_time = jiffies;
if ((add_front || !new_cfqq) && !group_changed)
return;
cfq_group_service_tree_add(cfqd, cfqq->cfqg);
@@ -1812,28 +2074,43 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
return cfqq_of_entity(cfq_rb_first(service_tree));
}

-static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
+struct cfq_rb_root *choose_service_tree_forced(struct cfq_group *cfqg)
{
- struct cfq_group *cfqg;
- struct cfq_entity *cfqe;
int i, j;
struct cfq_rb_root *st;

- if (!cfqd->rq_queued)
- return NULL;
+ for_each_cfqg_st(cfqg, i, j, st) {
+ if (st->count != 0)
+ return st;
+ }

- cfqg = cfq_get_next_cfqg(cfqd);
- if (!cfqg)
+ return NULL;
+}
+
+static struct cfq_entity *
+cfq_get_next_entity_forced(struct cfq_data *cfqd)
+{
+ struct cfq_entity *cfqe;
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_group *cfqg;
+
+ if (!cfqd->rq_queued)
return NULL;

- for_each_cfqg_st(cfqg, i, j, st) {
+ do {
cfqe = cfq_rb_first(st);
- if (cfqe != NULL)
- return cfqq_of_entity(cfqe);
- }
+ if (cfqe && !cfqe->is_group_entity)
+ return cfqe;
+ else if (cfqe && cfqe->is_group_entity)
+ cfqg = cfqg_of_entity(cfqe);
+
+ st = choose_service_tree_forced(cfqg);
+ } while (st);
+
return NULL;
}

+
/*
* Get and set a new active queue for service.
*/
@@ -2189,7 +2466,6 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
struct cfq_group *cfqg, enum wl_prio_t prio)
{
struct cfq_entity *cfqe;
- struct cfq_queue *cfqq;
unsigned long lowest_start_time;
int i;
bool time_valid = false;
@@ -2202,11 +2478,10 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
*/
for (i = 0; i <= SYNC_WORKLOAD; ++i) {
cfqe = cfq_rb_first(service_tree_for(cfqg, prio, i));
- cfqq = cfqq_of_entity(cfqe);
if (cfqe && (!time_valid ||
- time_before(cfqq->position_time,
+ time_before(cfqe->position_time,
lowest_start_time))) {
- lowest_start_time = cfqq->position_time;
+ lowest_start_time = cfqe->position_time;
cur_best = i;
time_valid = true;
}
@@ -2215,46 +2490,13 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
return cur_best;
}

-static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
+static void set_workload_expire(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
unsigned slice;
unsigned count;
struct cfq_rb_root *st;
unsigned group_slice;
- enum wl_prio_t original_prio = cfqd->serving_prio;
-
- /* Choose next priority. RT > BE > IDLE */
- if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
- cfqd->serving_prio = RT_WORKLOAD;
- else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
- cfqd->serving_prio = BE_WORKLOAD;
- else {
- cfqd->serving_prio = IDLE_WORKLOAD;
- cfqd->workload_expires = jiffies + 1;
- return;
- }
-
- if (original_prio != cfqd->serving_prio)
- goto new_workload;
-
- /*
- * For RT and BE, we have to choose also the type
- * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
- * expiration time
- */
- st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
- count = st->count;
-
- /*
- * check workload expiration, and that we still have other queues ready
- */
- if (count && !time_after(jiffies, cfqd->workload_expires))
- return;

-new_workload:
- /* otherwise select new workload type */
- cfqd->serving_type =
- cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
count = st->count;

@@ -2293,38 +2535,63 @@ new_workload:
slice = max_t(unsigned, slice, CFQ_MIN_TT);
cfq_log(cfqd, "workload slice:%d", slice);
cfqd->workload_expires = jiffies + slice;
+ /* Restore the previous saved slice. */
+ if (cfqg->saved_workload_slice)
+ cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
}

-static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
+static struct cfq_rb_root *choose_service_tree(struct cfq_data *cfqd,
+ struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
- struct cfq_group *cfqg;
- struct cfq_entity *cfqe;
-
- if (RB_EMPTY_ROOT(&st->rb))
+ if (!cfqg) {
+ cfqd->serving_prio = IDLE_WORKLOAD;
+ cfqd->workload_expires = jiffies + 1;
return NULL;
- cfqe = cfq_rb_first_entity(st);
- cfqg = cfqg_of_entity(cfqe);
- BUG_ON(!cfqg);
- update_min_vdisktime(st);
- return cfqg;
+ }
+
+ /* Choose next priority. RT > BE > IDLE */
+ if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
+ cfqd->serving_prio = RT_WORKLOAD;
+ else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
+ cfqd->serving_prio = BE_WORKLOAD;
+ else {
+ cfqd->serving_prio = IDLE_WORKLOAD;
+ cfqd->workload_expires = jiffies + 1;
+ return service_tree_for(cfqg, cfqd->serving_prio,
+ cfqd->serving_type);
+ }
+
+ /* otherwise select new workload type */
+ cfqd->serving_type =
+ cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
+
+ return service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
}

-static void cfq_choose_cfqg(struct cfq_data *cfqd)
+static struct cfq_entity *cfq_select_entity(struct cfq_data *cfqd)
{
- struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_entity *cfqe;
+ struct cfq_group *cfqg = NULL;

- cfqd->serving_group = cfqg;
+ if (!cfqd->rq_queued)
+ return NULL;

- /* Restore the workload type data */
- if (cfqg->saved_workload_slice) {
- cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
- cfqd->serving_type = cfqg->saved_workload;
- cfqd->serving_prio = cfqg->saved_serving_prio;
- } else
- cfqd->workload_expires = jiffies - 1;
+ do {
+ cfqe = cfq_rb_first(st);
+ if (!cfqe->is_group_entity) {
+ set_workload_expire(cfqd, cfqg);
+ cfqd->serving_group = cfqg;
+ return cfqe;
+ } else {
+ cfqg = cfqg_of_entity(cfqe);
+ st = choose_service_tree(cfqd, cfqg);
+ if (!st || RB_EMPTY_ROOT(&st->rb))
+ return NULL;
+ }
+ } while (st);

- choose_service_tree(cfqd, cfqg);
+ return NULL;
}

/*
@@ -2334,6 +2601,7 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
+ struct cfq_entity *entity;

cfqq = cfqd->active_queue;
if (!cfqq)
@@ -2433,9 +2701,9 @@ new_queue:
* Current queue expired. Check if we have to switch to a new
* service tree
*/
- if (!new_cfqq)
- cfq_choose_cfqg(cfqd);
-
+ entity = cfq_select_entity(cfqd);
+ BUG_ON(entity->is_group_entity);
+ new_cfqq = cfqq_of_entity(entity);
cfqq = cfq_set_active_queue(cfqd, new_cfqq);
keep_queue:
return cfqq;
@@ -2465,10 +2733,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq;
int dispatched = 0;
+ struct cfq_entity *cfqe;

/* Expire the timeslice of the current active queue first */
cfq_slice_expired(cfqd, 0);
- while ((cfqq = cfq_get_next_queue_forced(cfqd)) != NULL) {
+
+ while ((cfqe = cfq_get_next_entity_forced(cfqd)) != NULL) {
+ BUG_ON(cfqe->is_group_entity);
+ cfqq = cfqq_of_entity(cfqe);
__cfq_set_active_queue(cfqd, cfqq);
dispatched += __cfq_forced_dispatch_cfqq(cfqq);
}
@@ -2476,6 +2748,7 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
BUG_ON(cfqd->busy_queues);

cfq_log(cfqd, "forced_dispatch=%d", dispatched);
+
return dispatched;
}

@@ -4038,9 +4311,6 @@ static void *cfq_init_queue(struct request_queue *q)
*/
cfqd->cic_index = i;

- /* Init root service tree */
- cfqd->grp_service_tree = CFQ_RB_ROOT;
-
/* Init root group */
cfqg = &cfqd->root_group;
for_each_cfqg_st(cfqg, i, j, st)
@@ -4050,6 +4320,7 @@ static void *cfq_init_queue(struct request_queue *q)
/* Give preference to root group over other groups */
cfqg->cfqe.weight = 2*BLKIO_WEIGHT_DEFAULT;
cfqg->cfqe.is_group_entity = true;
+ cfqg_set_parent(cfqd, cfqg, NULL);

#ifdef CONFIG_CFQ_GROUP_IOSCHED
/*
@@ -4102,6 +4373,7 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_latency = 1;
cfqd->cfq_group_isolation = 0;
cfqd->hw_tag = -1;
+
/*
* we optimistically start assuming sync ops weren't delayed in last
* second, in order to have larger depth for async operations.
--
1.7.1

2011-02-23 03:11:05

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: [PATCH 6/6 v5.1] blkio-cgroup: Document for blkio.use_hierarchy interface

Document for blkio.use_hierarchy interface

Signed-off-by: Gui Jianfeng <[email protected]>
---
Documentation/cgroups/blkio-controller.txt | 81 +++++++++++++++++++++-------
1 files changed, 62 insertions(+), 19 deletions(-)

diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index 4ed7b5c..24399f4 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -91,30 +91,62 @@ Throttling/Upper Limit policy

Hierarchical Cgroups
====================
-- Currently none of the IO control policy supports hierarhical groups. But
- cgroup interface does allow creation of hierarhical cgroups and internally
- IO policies treat them as flat hierarchy.
+- Cgroup interface allows creation of hierarchical cgroups. Currently,
+ internally IO policies are able to treat them as flat hierarchy or
+ hierarchical hierarchy. Both hierarchical bandwidth division and flat
+ bandwidth division are supported. "blkio.use_hierarchy" can be used to
+ switch between flat mode and hierarchical mode.

- So this patch will allow creation of cgroup hierarhcy but at the backend
- everything will be treated as flat. So if somebody created a hierarchy like
- as follows.
+ Note: Currently, "blkio.use_hierarchy" only effects proportional bandwidth
+ division. For Throttling logic, it still continues to treat everything as flat.

- root
- / \
- test1 test2
- |
- test3
+ Consider the following CGroup hierarchy:

- CFQ and throttling will practically treat all groups at same level.
+ Root
+ / | \
+ Grp1 Grp2 tsk1
+ / \
+ Grp3 tsk2

- pivot
- / | \ \
- root test1 test2 test3
+ If blkio.use_hierarchy is disabled in all CGroups, CFQ will practically treat all groups
+ at the same level.

- Down the line we can implement hierarchical accounting/control support
- and also introduce a new cgroup file "use_hierarchy" which will control
- whether cgroup hierarchy is viewed as flat or hierarchical by the policy..
- This is how memory controller also has implemented the things.
+ Pivot tree
+ / | | \
+ Root Grp1 Grp2 Grp3
+ / |
+ tsk1 tsk2
+
+ If blkio.use_hierarchy is enabled in Root group, then all children will inherit it, thus
+ all children group have use_hierarchy=1 set automatically and looks as follows.
+
+ Pivot tree
+ |
+ Root
+ / | \
+ Grp1 Grp2 tsk1
+ / \
+ Grp3 tsk2
+
+ If blkio.use_hierarchy is enabled in Grp1 and Grp3, CFQ will treat groups and tasks as the
+ same view in CGroup hierarchy, it looks as follows.
+
+
+ Pivot tree
+ / | \
+ Root Grp1 Grp2
+ / / \
+ tsk1 Grp3 tsk2
+
+ Root, Grp1 and Grp2 are treated at the same level under Pivot tree. tsk1 stays under Root.
+ Grp3 and tsk2 are treated at the same level under Grp1. Below is the mapping between
+ task io priority and io weight:
+
+ prio 0 1 2 3 4 5 6 7
+ weight 1000 868 740 612 484 356 228 100
+
+ Note: Regardless of the use_hierarchy setting in Root group, Root group is always put onto
+ Pivot tree.

Various user visible config options
===================================
@@ -169,6 +201,17 @@ Proportional weight policy files
dev weight
8:16 300

+- blkio.use_hierarchy
+ - Switch between hierarchical mode and flat mode as stated above.
+ blkio.use_hierarchy == 1 means hierarchical mode is enabled.
+ blkio.use_hierarchy == 0 means flat mode is enabled.
+ You can set this interface only if there isn't any child CGroup under
+ this CGroup. If one CGroup's blkio.use_hierarchy is set, the created
+ children will inherit it. it's not allowed to unset it in children.
+ The default mode in Root CGroup is flat.
+ blkio.use_hierarchy only works for proportional bandwidth division
+ as of today and doesn't have any effect on throttling logic.
+
- blkio.time
- disk time allocated to cgroup per device in milliseconds. First
two fields specify the major and minor number of the device and
--
1.7.1

2011-02-24 18:11:57

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5.1] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

On Wed, Feb 23, 2011 at 11:01:35AM +0800, Gui Jianfeng wrote:
> Hi
>
> I rebase this series on top of *for-next* branch, it will make merging life easier.
>
> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
> in the way that it puts all CFQ queues in a hidden group and schedules with other
> CFQ group under their parent. The patchset is available here,
> http://lkml.org/lkml/2010/8/30/30

Gui,

I was running some tests (iostest) with these patches and my system crashed
after a while.

To be precise I was running "brrmmap" test of iostest.

train.lab.bos.redhat.com login: [72194.404201] EXT4-fs (dm-1): mounted
filesystem with ordered data mode. Opts: (null)
[72642.818976] EXT4-fs (dm-1): mounted filesystem with ordered data mode.
Opts: (null)
[72931.409460] BUG: unable to handle kernel NULL pointer dereference at
0000000000000010
[72931.410216] IP: [<ffffffff812265ff>] __rb_rotate_left+0xb/0x64
[72931.410216] PGD 134d80067 PUD 12f524067 PMD 0
[72931.410216] Oops: 0000 [#1] SMP
[72931.410216] last sysfs file:
/sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
[72931.410216] CPU 3
[72931.410216] Modules linked in: kvm_intel kvm qla2xxx scsi_transport_fc
[last unloaded: scsi_wait_scan]
[72931.410216]
[72931.410216] Pid: 18675, comm: sh Not tainted 2.6.38-rc4+ #3 0A98h/HP
xw8600 Workstation
[72931.410216] RIP: 0010:[<ffffffff812265ff>] [<ffffffff812265ff>]
__rb_rotate_left+0xb/0x64
[72931.410216] RSP: 0000:ffff88012f461480 EFLAGS: 00010086
[72931.410216] RAX: 0000000000000000 RBX: ffff880135f40c00 RCX:
ffffffffffffdcc8
[72931.410216] RDX: ffff880135f43800 RSI: ffff880135f43000 RDI:
ffff880135f42c00
[72931.410216] RBP: ffff88012f461480 R08: ffff880135f40c00 R09:
ffff880135f43018
[72931.410216] R10: 0000000000000000 R11: 0000001000000000 R12:
ffff880135f42c00
[72931.410216] R13: ffff880135f41808 R14: ffff880135f43000 R15:
ffff880135f40c00
[72931.410216] FS: 0000000000000000(0000) GS:ffff8800bfcc0000(0000)
knlGS:0000000000000000
[72931.410216] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[72931.410216] CR2: 0000000000000010 CR3: 000000013774f000 CR4:
00000000000006e0
[72931.410216] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[72931.410216] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[72931.410216] Process sh (pid: 18675, threadinfo ffff88012f460000, task
ffff8801376e6f90)
[72931.410216] Stack:
[72931.410216] ffff88012f4614b8 ffffffff81226778 ffff880135f43000
ffff880135f43000
[72931.410216] ffff88011c5bed00 0000000000000000 0000000000000001
ffff88012f4614d8
[72931.410216] ffffffff8121c521 0000001000000000 ffff880135f41800
ffff88012f461528
[72931.410216] Call Trace:
[72931.410216] [<ffffffff81226778>] rb_insert_color+0xbc/0xe5
[72931.410216] [<ffffffff8121c521>]
__cfq_entity_service_tree_add+0x76/0xa5
[72931.410216] [<ffffffff8121cb28>] cfq_service_tree_add+0x383/0x3eb
[72931.410216] [<ffffffff8121cbaa>] cfq_resort_rr_list+0x1a/0x2a
[72931.410216] [<ffffffff8121eb06>] cfq_add_rq_rb+0xbd/0xff
[72931.410216] [<ffffffff8121ec0a>] cfq_insert_request+0xc2/0x556
[72931.410216] [<ffffffff8120a44c>] elv_insert+0x118/0x188
[72931.410216] [<ffffffff8120a52a>] __elv_add_request+0x6e/0x75
[72931.410216] [<ffffffff812102d0>] __make_request+0x3ac/0x42f
[72931.410216] [<ffffffff8120e9ca>] generic_make_request+0x2ec/0x356
[72931.410216] [<ffffffff8120eb05>] submit_bio+0xd1/0xdc
[72931.410216] [<ffffffff8110bea3>] submit_bh+0xe6/0x108
[72931.410216] [<ffffffff8110eb9d>] __bread+0x4c/0x6f
[72931.410216] [<ffffffff811453ab>] ext3_get_branch+0x64/0xdf
[72931.410216] [<ffffffff81146f5c>] ext3_get_blocks_handle+0x9b/0x90b
[72931.410216] [<ffffffff81147882>] ext3_get_block+0xb6/0xf6
[72931.410216] [<ffffffff81113520>] do_mpage_readpage+0x198/0x4bd
[72931.410216] [<ffffffff810c01b2>] ? __inc_zone_page_state+0x29/0x2b
[72931.410216] [<ffffffff810ab6e4>] ? add_to_page_cache_locked+0xb6/0x10d
[72931.410216] [<ffffffff81113980>] mpage_readpages+0xd6/0x123
[72931.410216] [<ffffffff811477cc>] ? ext3_get_block+0x0/0xf6
[72931.410216] [<ffffffff811477cc>] ? ext3_get_block+0x0/0xf6
[72931.410216] [<ffffffff810da750>] ? alloc_pages_current+0xa2/0xc5
[72931.410216] [<ffffffff81145a6a>] ext3_readpages+0x18/0x1a
[72931.410216] [<ffffffff810b31fc>] __do_page_cache_readahead+0x111/0x1a7
[72931.410216] [<ffffffff810b32ae>] ra_submit+0x1c/0x20
[72931.410216] [<ffffffff810acb1b>] filemap_fault+0x165/0x35b
[72931.410216] [<ffffffff810c6ce1>] __do_fault+0x50/0x3e2
[72931.410216] [<ffffffff810c7cf8>] handle_pte_fault+0x2ff/0x779
[72931.410216] [<ffffffff810b05c9>] ? __free_pages+0x1b/0x24
[72931.410216] [<ffffffff810c82d1>] handle_mm_fault+0x15f/0x173
[72931.410216] [<ffffffff815b0963>] do_page_fault+0x348/0x36a
[72931.410216] [<ffffffff810f21c5>] ? path_put+0x1d/0x21
[72931.410216] [<ffffffff810f21c5>] ? path_put+0x1d/0x21
[72931.410216] [<ffffffff815adf1f>] page_fault+0x1f/0x30
[72931.410216] Code: 48 83 c4 18 44 89 e8 5b 41 5c 41 5d c9 c3 48 83 7b 18
00 0f 84 71 ff ff ff e9 77 ff ff ff 90 90 48 8b 47 08 55 48 8b 17 48 89 e5
<48> 8b 48 10 48 83 e2 fc 48 85 c9 48 89 4f 08 74 10 4c 8b 40 10
[72931.410216] RIP [<ffffffff812265ff>] __rb_rotate_left+0xb/0x64
[72931.410216] RSP <ffff88012f461480>
[72931.410216] CR2: 0000000000000010
[72931.410216] ---[ end trace cddc7a4456407f6a ]---

Thanks
Vivek

>
> Vivek think this approach isn't so instinct that we should treat CFQ queues
> and groups at the same level. Here is the new approach for hierarchical
> scheduling based on Vivek's suggestion. The most big change of CFQ is that
> it gets rid of cfq_slice_offset logic, and makes use of vdisktime for CFQ
> queue scheduling just like CFQ group does. But I still give cfqq some jump
> in vdisktime based on ioprio, thanks for Vivek to point out this. Now CFQ
> queue and CFQ group use the same scheduling algorithm.
>
> "use_hierarchy" interface is now added to switch between hierarchical mode
> and flat mode. It works as memcg.
>
> V4 -> V5 Changes:
> - Change boosting base to a smaller value.
> - Rename repostion_time to position_time
> - Replace duplicated code by calling cfq_scale_slice()
> - Remove redundant use_hierarchy in cfqd
> - Fix grp_service_tree comment
> - Rename init_cfqe() to init_group_cfqe()
>
> --
> V3 -> V4 Changes:
> - Take io class into account when calculating the boost value.
> - Refine the vtime boosting logic as Vivek's Suggestion.
> - Make the calculation of group slice cross all service trees under a group.
> - Modify Documentation in terms of Vivek's comments.
>
> --
> V2 -> V3 Changes:
> - Starting from cfqd->grp_service_tree for both hierarchical mode and flat mode
> - Avoid recursion when allocating cfqg and force dispatch logic
> - Fix a bug when boosting vdisktime
> - Adjusting total_weight accordingly when changing weight
> - Change group slice calculation into a hierarchical way
> - Keep flat mode rather than deleting it first then adding it later
> - kfree the parent cfqg if there nobody references to it
> - Simplify select_queue logic by using some wrap function
> - Make "use_hierarchy" interface work as memcg
> - Make use of time_before() for vdisktime compare
> - Update Document
> - Fix some code style problems
>
> --
> V1 -> V2 Changes:
> - Raname "struct io_sched_entity" to "struct cfq_entity" and don't differentiate
> queue_entity and group_entity, just use cfqe instead.
> - Give newly added cfqq a small vdisktime jump accord to its ioprio.
> - Make flat mode as default CFQ group scheduling mode.
> - Introduce "use_hierarchy" interface.
> - Update blkio cgroup documents
>
> Documentation/cgroups/blkio-controller.txt | 81 +-
> block/blk-cgroup.c | 61 +
> block/blk-cgroup.h | 3
> block/cfq-iosched.c | 959 ++++++++++++++++++++---------
> 4 files changed, 815 insertions(+), 289 deletions(-)
>
> Thanks,
> Gui

2011-02-25 01:55:56

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5.1] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

Vivek Goyal wrote:
> On Wed, Feb 23, 2011 at 11:01:35AM +0800, Gui Jianfeng wrote:
>> Hi
>>
>> I rebase this series on top of *for-next* branch, it will make merging life easier.
>>
>> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
>> in the way that it puts all CFQ queues in a hidden group and schedules with other
>> CFQ group under their parent. The patchset is available here,
>> http://lkml.org/lkml/2010/8/30/30
>
> Gui,
>
> I was running some tests (iostest) with these patches and my system crashed
> after a while.
>
> To be precise I was running "brrmmap" test of iostest.

Vivek,

I simply run iostest with brrmmap mode, I can't reproduce this bug.
Would you give more details.
Can you tell me the iostest command line options?
Did you enable use_hierarchy in root group?

Thanks,
Gui

>
> train.lab.bos.redhat.com login: [72194.404201] EXT4-fs (dm-1): mounted
> filesystem with ordered data mode. Opts: (null)
> [72642.818976] EXT4-fs (dm-1): mounted filesystem with ordered data mode.
> Opts: (null)
> [72931.409460] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000010
> [72931.410216] IP: [<ffffffff812265ff>] __rb_rotate_left+0xb/0x64
> [72931.410216] PGD 134d80067 PUD 12f524067 PMD 0
> [72931.410216] Oops: 0000 [#1] SMP
> [72931.410216] last sysfs file:
> /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
> [72931.410216] CPU 3
> [72931.410216] Modules linked in: kvm_intel kvm qla2xxx scsi_transport_fc
> [last unloaded: scsi_wait_scan]
> [72931.410216]
> [72931.410216] Pid: 18675, comm: sh Not tainted 2.6.38-rc4+ #3 0A98h/HP
> xw8600 Workstation
> [72931.410216] RIP: 0010:[<ffffffff812265ff>] [<ffffffff812265ff>]
> __rb_rotate_left+0xb/0x64
> [72931.410216] RSP: 0000:ffff88012f461480 EFLAGS: 00010086
> [72931.410216] RAX: 0000000000000000 RBX: ffff880135f40c00 RCX:
> ffffffffffffdcc8
> [72931.410216] RDX: ffff880135f43800 RSI: ffff880135f43000 RDI:
> ffff880135f42c00
> [72931.410216] RBP: ffff88012f461480 R08: ffff880135f40c00 R09:
> ffff880135f43018
> [72931.410216] R10: 0000000000000000 R11: 0000001000000000 R12:
> ffff880135f42c00
> [72931.410216] R13: ffff880135f41808 R14: ffff880135f43000 R15:
> ffff880135f40c00
> [72931.410216] FS: 0000000000000000(0000) GS:ffff8800bfcc0000(0000)
> knlGS:0000000000000000
> [72931.410216] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [72931.410216] CR2: 0000000000000010 CR3: 000000013774f000 CR4:
> 00000000000006e0
> [72931.410216] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [72931.410216] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [72931.410216] Process sh (pid: 18675, threadinfo ffff88012f460000, task
> ffff8801376e6f90)
> [72931.410216] Stack:
> [72931.410216] ffff88012f4614b8 ffffffff81226778 ffff880135f43000
> ffff880135f43000
> [72931.410216] ffff88011c5bed00 0000000000000000 0000000000000001
> ffff88012f4614d8
> [72931.410216] ffffffff8121c521 0000001000000000 ffff880135f41800
> ffff88012f461528
> [72931.410216] Call Trace:
> [72931.410216] [<ffffffff81226778>] rb_insert_color+0xbc/0xe5
> [72931.410216] [<ffffffff8121c521>]
> __cfq_entity_service_tree_add+0x76/0xa5
> [72931.410216] [<ffffffff8121cb28>] cfq_service_tree_add+0x383/0x3eb
> [72931.410216] [<ffffffff8121cbaa>] cfq_resort_rr_list+0x1a/0x2a
> [72931.410216] [<ffffffff8121eb06>] cfq_add_rq_rb+0xbd/0xff
> [72931.410216] [<ffffffff8121ec0a>] cfq_insert_request+0xc2/0x556
> [72931.410216] [<ffffffff8120a44c>] elv_insert+0x118/0x188
> [72931.410216] [<ffffffff8120a52a>] __elv_add_request+0x6e/0x75
> [72931.410216] [<ffffffff812102d0>] __make_request+0x3ac/0x42f
> [72931.410216] [<ffffffff8120e9ca>] generic_make_request+0x2ec/0x356
> [72931.410216] [<ffffffff8120eb05>] submit_bio+0xd1/0xdc
> [72931.410216] [<ffffffff8110bea3>] submit_bh+0xe6/0x108
> [72931.410216] [<ffffffff8110eb9d>] __bread+0x4c/0x6f
> [72931.410216] [<ffffffff811453ab>] ext3_get_branch+0x64/0xdf
> [72931.410216] [<ffffffff81146f5c>] ext3_get_blocks_handle+0x9b/0x90b
> [72931.410216] [<ffffffff81147882>] ext3_get_block+0xb6/0xf6
> [72931.410216] [<ffffffff81113520>] do_mpage_readpage+0x198/0x4bd
> [72931.410216] [<ffffffff810c01b2>] ? __inc_zone_page_state+0x29/0x2b
> [72931.410216] [<ffffffff810ab6e4>] ? add_to_page_cache_locked+0xb6/0x10d
> [72931.410216] [<ffffffff81113980>] mpage_readpages+0xd6/0x123
> [72931.410216] [<ffffffff811477cc>] ? ext3_get_block+0x0/0xf6
> [72931.410216] [<ffffffff811477cc>] ? ext3_get_block+0x0/0xf6
> [72931.410216] [<ffffffff810da750>] ? alloc_pages_current+0xa2/0xc5
> [72931.410216] [<ffffffff81145a6a>] ext3_readpages+0x18/0x1a
> [72931.410216] [<ffffffff810b31fc>] __do_page_cache_readahead+0x111/0x1a7
> [72931.410216] [<ffffffff810b32ae>] ra_submit+0x1c/0x20
> [72931.410216] [<ffffffff810acb1b>] filemap_fault+0x165/0x35b
> [72931.410216] [<ffffffff810c6ce1>] __do_fault+0x50/0x3e2
> [72931.410216] [<ffffffff810c7cf8>] handle_pte_fault+0x2ff/0x779
> [72931.410216] [<ffffffff810b05c9>] ? __free_pages+0x1b/0x24
> [72931.410216] [<ffffffff810c82d1>] handle_mm_fault+0x15f/0x173
> [72931.410216] [<ffffffff815b0963>] do_page_fault+0x348/0x36a
> [72931.410216] [<ffffffff810f21c5>] ? path_put+0x1d/0x21
> [72931.410216] [<ffffffff810f21c5>] ? path_put+0x1d/0x21
> [72931.410216] [<ffffffff815adf1f>] page_fault+0x1f/0x30
> [72931.410216] Code: 48 83 c4 18 44 89 e8 5b 41 5c 41 5d c9 c3 48 83 7b 18
> 00 0f 84 71 ff ff ff e9 77 ff ff ff 90 90 48 8b 47 08 55 48 8b 17 48 89 e5
> <48> 8b 48 10 48 83 e2 fc 48 85 c9 48 89 4f 08 74 10 4c 8b 40 10
> [72931.410216] RIP [<ffffffff812265ff>] __rb_rotate_left+0xb/0x64
> [72931.410216] RSP <ffff88012f461480>
> [72931.410216] CR2: 0000000000000010
> [72931.410216] ---[ end trace cddc7a4456407f6a ]---
>
> Thanks
> Vivek
>
>> Vivek think this approach isn't so instinct that we should treat CFQ queues
>> and groups at the same level. Here is the new approach for hierarchical
>> scheduling based on Vivek's suggestion. The most big change of CFQ is that
>> it gets rid of cfq_slice_offset logic, and makes use of vdisktime for CFQ
>> queue scheduling just like CFQ group does. But I still give cfqq some jump
>> in vdisktime based on ioprio, thanks for Vivek to point out this. Now CFQ
>> queue and CFQ group use the same scheduling algorithm.
>>
>> "use_hierarchy" interface is now added to switch between hierarchical mode
>> and flat mode. It works as memcg.
>>
>> V4 -> V5 Changes:
>> - Change boosting base to a smaller value.
>> - Rename repostion_time to position_time
>> - Replace duplicated code by calling cfq_scale_slice()
>> - Remove redundant use_hierarchy in cfqd
>> - Fix grp_service_tree comment
>> - Rename init_cfqe() to init_group_cfqe()
>>
>> --
>> V3 -> V4 Changes:
>> - Take io class into account when calculating the boost value.
>> - Refine the vtime boosting logic as Vivek's Suggestion.
>> - Make the calculation of group slice cross all service trees under a group.
>> - Modify Documentation in terms of Vivek's comments.
>>
>> --
>> V2 -> V3 Changes:
>> - Starting from cfqd->grp_service_tree for both hierarchical mode and flat mode
>> - Avoid recursion when allocating cfqg and force dispatch logic
>> - Fix a bug when boosting vdisktime
>> - Adjusting total_weight accordingly when changing weight
>> - Change group slice calculation into a hierarchical way
>> - Keep flat mode rather than deleting it first then adding it later
>> - kfree the parent cfqg if there nobody references to it
>> - Simplify select_queue logic by using some wrap function
>> - Make "use_hierarchy" interface work as memcg
>> - Make use of time_before() for vdisktime compare
>> - Update Document
>> - Fix some code style problems
>>
>> --
>> V1 -> V2 Changes:
>> - Raname "struct io_sched_entity" to "struct cfq_entity" and don't differentiate
>> queue_entity and group_entity, just use cfqe instead.
>> - Give newly added cfqq a small vdisktime jump accord to its ioprio.
>> - Make flat mode as default CFQ group scheduling mode.
>> - Introduce "use_hierarchy" interface.
>> - Update blkio cgroup documents
>>
>> Documentation/cgroups/blkio-controller.txt | 81 +-
>> block/blk-cgroup.c | 61 +
>> block/blk-cgroup.h | 3
>> block/cfq-iosched.c | 959 ++++++++++++++++++++---------
>> 4 files changed, 815 insertions(+), 289 deletions(-)
>>
>> Thanks,
>> Gui
>

--
Regards
Gui Jianfeng

2011-02-27 23:16:29

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5.1] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

On Fri, Feb 25, 2011 at 09:55:32AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Wed, Feb 23, 2011 at 11:01:35AM +0800, Gui Jianfeng wrote:
> >> Hi
> >>
> >> I rebase this series on top of *for-next* branch, it will make merging life easier.
> >>
> >> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
> >> in the way that it puts all CFQ queues in a hidden group and schedules with other
> >> CFQ group under their parent. The patchset is available here,
> >> http://lkml.org/lkml/2010/8/30/30
> >
> > Gui,
> >
> > I was running some tests (iostest) with these patches and my system crashed
> > after a while.
> >
> > To be precise I was running "brrmmap" test of iostest.
>
> Vivek,
>
> I simply run iostest with brrmmap mode, I can't reproduce this bug.
> Would you give more details.
> Can you tell me the iostest command line options?

iostest /dev/dm-1 -G --nrgrp 4 -m 8 --cgtime --io_serviced --dequeue --total

I was actually trying to run all the workloads defined but after running
2 workloads it crashed on 3rd workload.

Now I tried to re-run brrmmap and it did not crash. So I am trying to run
all the inbuilt workloads again.

> Did you enable use_hierarchy in root group?

No I did not. Trying to test the flat setup first.

Thanks
Vivek

2011-02-28 00:15:56

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5.1] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

On Sun, Feb 27, 2011 at 06:16:18PM -0500, Vivek Goyal wrote:
> On Fri, Feb 25, 2011 at 09:55:32AM +0800, Gui Jianfeng wrote:
> > Vivek Goyal wrote:
> > > On Wed, Feb 23, 2011 at 11:01:35AM +0800, Gui Jianfeng wrote:
> > >> Hi
> > >>
> > >> I rebase this series on top of *for-next* branch, it will make merging life easier.
> > >>
> > >> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
> > >> in the way that it puts all CFQ queues in a hidden group and schedules with other
> > >> CFQ group under their parent. The patchset is available here,
> > >> http://lkml.org/lkml/2010/8/30/30
> > >
> > > Gui,
> > >
> > > I was running some tests (iostest) with these patches and my system crashed
> > > after a while.
> > >
> > > To be precise I was running "brrmmap" test of iostest.
> >
> > Vivek,
> >
> > I simply run iostest with brrmmap mode, I can't reproduce this bug.
> > Would you give more details.
> > Can you tell me the iostest command line options?
>
> iostest /dev/dm-1 -G --nrgrp 4 -m 8 --cgtime --io_serviced --dequeue --total
>
> I was actually trying to run all the workloads defined but after running
> 2 workloads it crashed on 3rd workload.
>
> Now I tried to re-run brrmmap and it did not crash. So I am trying to run
> all the inbuilt workloads again.
>
> > Did you enable use_hierarchy in root group?
>
> No I did not. Trying to test the flat setup first.

Again was running above job and after 3 workloads it ran into a different
BUG_ON().

Thanks
Vivek

login: [277063.539001] ------------[ cut here ]------------
[277063.539001] kernel BUG at block/cfq-iosched.c:1407!
[277063.539001] invalid opcode: 0000 [#1] SMP
[277063.539001] last sysfs file: /sys/devices/virtual/block/dm-1/queue/scheduler
[277063.539001] CPU 2
[277063.539001] Modules linked in: kvm_intel kvm qla2xxx scsi_transport_fc [last unloaded: scsi_wait_scan]
[277063.539001]
[277063.539001] Pid: 24628, comm: iostest Not tainted 2.6.38-rc4+ #3 0A98h/HP xw8600 Workstation
[277063.539001] RIP: 0010:[<ffffffff8121c6f0>] [<ffffffff8121c6f0>] cfq_put_cfqg+0x13/0xc8
[277063.539001] RSP: 0018:ffff880129a81d48 EFLAGS: 00010046
[277063.539001] RAX: 0000000000000000 RBX: ffff880135e4b800 RCX: ffff88012b9d8ed0
[277063.539001] RDX: ffff880135e4be30 RSI: ffff880135070c00 RDI: ffff880135070c00
[277063.539001] RBP: ffff880129a81d58 R08: ffff880135e4bbc8 R09: ffffffff81ad76d0
[277063.539001] R10: ffff880129a81d48 R11: ffff880129a81d78 R12: ffff880135e4bc18
[277063.539001] R13: ffff880135e4bbc8 R14: ffff8801359c3020 R15: ffff880135033310
[277063.539001] FS: 00007f1329ed4700(0000) GS:ffff8800bfc80000(0000) knlGS:0000000000000000
[277063.539001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[277063.539001] CR2: 0000000000b89c08 CR3: 00000001230a5000 CR4: 00000000000006e0
[277063.539001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[277063.539001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[277063.539001] Process iostest (pid: 24628, threadinfo ffff880129a80000, task ffff880131394a60)
[277063.539001] Stack:
[277063.539001] ffff880135e4b800 ffff880135e4b800 ffff880129a81d68 ffffffff8121d0a0
[277063.539001] ffff880129a81db8 ffffffff8121d6b6 ffff880133af1850 ffff880135070c00
[277063.539001] ffff880129a81db8 ffff88012c908900 ffff88012c908958 ffff880133af1840
[277063.539001] Call Trace:
[277063.539001] [<ffffffff8121d0a0>] cfq_destroy_cfqg+0x45/0x47
[277063.539001] [<ffffffff8121d6b6>] cfq_exit_queue+0xcc/0x164
[277063.539001] [<ffffffff81209f19>] elevator_exit+0x2a/0x47
[277063.539001] [<ffffffff8120a8cc>] elevator_change+0x12f/0x1b7
[277063.539001] [<ffffffff8120a976>] elv_iosched_store+0x22/0x4c
[277063.539001] [<ffffffff812111f2>] queue_attr_store+0x6a/0x89
[277063.539001] [<ffffffff8113e093>] sysfs_write_file+0xfc/0x138
[277063.539001] [<ffffffff810e9fd4>] vfs_write+0xa9/0x105
[277063.539001] [<ffffffff810ea0e9>] sys_write+0x45/0x6c
[277063.539001] [<ffffffff8100293b>] system_call_fastpath+0x16/0x1b
[277063.539001] Code: 75 09 48 85 db 0f 85 77 ff ff ff 41 5c 5b 41 5c 41 5d 41 5e 41 5f c9 c3 55 48 89 e5 53 48 83 ec 08 8b 87 20 03 00 00 85 c0 7f 04 <0f> 0b eb fe ff c8 85 c0 89 87 20 03 00 00 0f 85 9d 00 00 00 4c
[277063.539001] RIP [<ffffffff8121c6f0>] cfq_put_cfqg+0x13/0xc8
[277063.539001] RSP <ffff880129a81d48>
[277063.539001] ---[ end trace d7596ee55221d6a7 ]---

2011-02-28 09:34:50

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5.1] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

Vivek Goyal wrote:
> On Sun, Feb 27, 2011 at 06:16:18PM -0500, Vivek Goyal wrote:
>> On Fri, Feb 25, 2011 at 09:55:32AM +0800, Gui Jianfeng wrote:
>>> Vivek Goyal wrote:
>>>> On Wed, Feb 23, 2011 at 11:01:35AM +0800, Gui Jianfeng wrote:
>>>>> Hi
>>>>>
>>>>> I rebase this series on top of *for-next* branch, it will make merging life easier.
>>>>>
>>>>> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
>>>>> in the way that it puts all CFQ queues in a hidden group and schedules with other
>>>>> CFQ group under their parent. The patchset is available here,
>>>>> http://lkml.org/lkml/2010/8/30/30
>>>> Gui,
>>>>
>>>> I was running some tests (iostest) with these patches and my system crashed
>>>> after a while.
>>>>
>>>> To be precise I was running "brrmmap" test of iostest.
>>> Vivek,
>>>
>>> I simply run iostest with brrmmap mode, I can't reproduce this bug.
>>> Would you give more details.
>>> Can you tell me the iostest command line options?
>> iostest /dev/dm-1 -G --nrgrp 4 -m 8 --cgtime --io_serviced --dequeue --total
>>
>> I was actually trying to run all the workloads defined but after running
>> 2 workloads it crashed on 3rd workload.
>>
>> Now I tried to re-run brrmmap and it did not crash. So I am trying to run
>> all the inbuilt workloads again.
>>
>>> Did you enable use_hierarchy in root group?
>> No I did not. Trying to test the flat setup first.
>
> Again was running above job and after 3 workloads it ran into a different
> BUG_ON().
>
> Thanks
> Vivek
>
> login: [277063.539001] ------------[ cut here ]------------
> [277063.539001] kernel BUG at block/cfq-iosched.c:1407!

Vivek,

It seems there's something wrong when handling cfqg's reference counter.
But I'm not sure why for the moment, I'll try to reproduce it and figure
out the reason.
Would you help to take a look also.

Thanks,
Gui


> [277063.539001] invalid opcode: 0000 [#1] SMP
> [277063.539001] last sysfs file: /sys/devices/virtual/block/dm-1/queue/scheduler
> [277063.539001] CPU 2
> [277063.539001] Modules linked in: kvm_intel kvm qla2xxx scsi_transport_fc [last unloaded: scsi_wait_scan]
> [277063.539001]
> [277063.539001] Pid: 24628, comm: iostest Not tainted 2.6.38-rc4+ #3 0A98h/HP xw8600 Workstation
> [277063.539001] RIP: 0010:[<ffffffff8121c6f0>] [<ffffffff8121c6f0>] cfq_put_cfqg+0x13/0xc8
> [277063.539001] RSP: 0018:ffff880129a81d48 EFLAGS: 00010046
> [277063.539001] RAX: 0000000000000000 RBX: ffff880135e4b800 RCX: ffff88012b9d8ed0
> [277063.539001] RDX: ffff880135e4be30 RSI: ffff880135070c00 RDI: ffff880135070c00
> [277063.539001] RBP: ffff880129a81d58 R08: ffff880135e4bbc8 R09: ffffffff81ad76d0
> [277063.539001] R10: ffff880129a81d48 R11: ffff880129a81d78 R12: ffff880135e4bc18
> [277063.539001] R13: ffff880135e4bbc8 R14: ffff8801359c3020 R15: ffff880135033310
> [277063.539001] FS: 00007f1329ed4700(0000) GS:ffff8800bfc80000(0000) knlGS:0000000000000000
> [277063.539001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [277063.539001] CR2: 0000000000b89c08 CR3: 00000001230a5000 CR4: 00000000000006e0
> [277063.539001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [277063.539001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [277063.539001] Process iostest (pid: 24628, threadinfo ffff880129a80000, task ffff880131394a60)
> [277063.539001] Stack:
> [277063.539001] ffff880135e4b800 ffff880135e4b800 ffff880129a81d68 ffffffff8121d0a0
> [277063.539001] ffff880129a81db8 ffffffff8121d6b6 ffff880133af1850 ffff880135070c00
> [277063.539001] ffff880129a81db8 ffff88012c908900 ffff88012c908958 ffff880133af1840
> [277063.539001] Call Trace:
> [277063.539001] [<ffffffff8121d0a0>] cfq_destroy_cfqg+0x45/0x47
> [277063.539001] [<ffffffff8121d6b6>] cfq_exit_queue+0xcc/0x164
> [277063.539001] [<ffffffff81209f19>] elevator_exit+0x2a/0x47
> [277063.539001] [<ffffffff8120a8cc>] elevator_change+0x12f/0x1b7
> [277063.539001] [<ffffffff8120a976>] elv_iosched_store+0x22/0x4c
> [277063.539001] [<ffffffff812111f2>] queue_attr_store+0x6a/0x89
> [277063.539001] [<ffffffff8113e093>] sysfs_write_file+0xfc/0x138
> [277063.539001] [<ffffffff810e9fd4>] vfs_write+0xa9/0x105
> [277063.539001] [<ffffffff810ea0e9>] sys_write+0x45/0x6c
> [277063.539001] [<ffffffff8100293b>] system_call_fastpath+0x16/0x1b
> [277063.539001] Code: 75 09 48 85 db 0f 85 77 ff ff ff 41 5c 5b 41 5c 41 5d 41 5e 41 5f c9 c3 55 48 89 e5 53 48 83 ec 08 8b 87 20 03 00 00 85 c0 7f 04 <0f> 0b eb fe ff c8 85 c0 89 87 20 03 00 00 0f 85 9d 00 00 00 4c
> [277063.539001] RIP [<ffffffff8121c6f0>] cfq_put_cfqg+0x13/0xc8
> [277063.539001] RSP <ffff880129a81d48>
> [277063.539001] ---[ end trace d7596ee55221d6a7 ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-03-02 10:01:25

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5.1] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

Vivek Goyal wrote:
> On Sun, Feb 27, 2011 at 06:16:18PM -0500, Vivek Goyal wrote:
>> On Fri, Feb 25, 2011 at 09:55:32AM +0800, Gui Jianfeng wrote:
>>> Vivek Goyal wrote:
>>>> On Wed, Feb 23, 2011 at 11:01:35AM +0800, Gui Jianfeng wrote:
>>>>> Hi
>>>>>
>>>>> I rebase this series on top of *for-next* branch, it will make merging life easier.
>>>>>
>>>>> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
>>>>> in the way that it puts all CFQ queues in a hidden group and schedules with other
>>>>> CFQ group under their parent. The patchset is available here,
>>>>> http://lkml.org/lkml/2010/8/30/30
>>>> Gui,
>>>>
>>>> I was running some tests (iostest) with these patches and my system crashed
>>>> after a while.
>>>>
>>>> To be precise I was running "brrmmap" test of iostest.
>>> Vivek,
>>>
>>> I simply run iostest with brrmmap mode, I can't reproduce this bug.
>>> Would you give more details.
>>> Can you tell me the iostest command line options?
>> iostest /dev/dm-1 -G --nrgrp 4 -m 8 --cgtime --io_serviced --dequeue --total
>>
>> I was actually trying to run all the workloads defined but after running
>> 2 workloads it crashed on 3rd workload.
>>
>> Now I tried to re-run brrmmap and it did not crash. So I am trying to run
>> all the inbuilt workloads again.
>>
>>> Did you enable use_hierarchy in root group?
>> No I did not. Trying to test the flat setup first.
>
> Again was running above job and after 3 workloads it ran into a different
> BUG_ON().

Vivek,

It seems there's a race.
Would you try the following patch. This patch seems working for me.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 380d667..abbbb0e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -4126,11 +4126,12 @@ new_queue:
cfqq->allocated[rw]++;
cfqq->ref++;

- spin_unlock_irqrestore(q->queue_lock, flags);
-
rq->elevator_private[0] = cic;
rq->elevator_private[1] = cfqq;
rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
+
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
return 0;

queue_fail:

Thanks
Gui



>
> Thanks
> Vivek
>
> login: [277063.539001] ------------[ cut here ]------------
> [277063.539001] kernel BUG at block/cfq-iosched.c:1407!
> [277063.539001] invalid opcode: 0000 [#1] SMP
> [277063.539001] last sysfs file: /sys/devices/virtual/block/dm-1/queue/scheduler
> [277063.539001] CPU 2
> [277063.539001] Modules linked in: kvm_intel kvm qla2xxx scsi_transport_fc [last unloaded: scsi_wait_scan]
> [277063.539001]
> [277063.539001] Pid: 24628, comm: iostest Not tainted 2.6.38-rc4+ #3 0A98h/HP xw8600 Workstation
> [277063.539001] RIP: 0010:[<ffffffff8121c6f0>] [<ffffffff8121c6f0>] cfq_put_cfqg+0x13/0xc8
> [277063.539001] RSP: 0018:ffff880129a81d48 EFLAGS: 00010046
> [277063.539001] RAX: 0000000000000000 RBX: ffff880135e4b800 RCX: ffff88012b9d8ed0
> [277063.539001] RDX: ffff880135e4be30 RSI: ffff880135070c00 RDI: ffff880135070c00
> [277063.539001] RBP: ffff880129a81d58 R08: ffff880135e4bbc8 R09: ffffffff81ad76d0
> [277063.539001] R10: ffff880129a81d48 R11: ffff880129a81d78 R12: ffff880135e4bc18
> [277063.539001] R13: ffff880135e4bbc8 R14: ffff8801359c3020 R15: ffff880135033310
> [277063.539001] FS: 00007f1329ed4700(0000) GS:ffff8800bfc80000(0000) knlGS:0000000000000000
> [277063.539001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [277063.539001] CR2: 0000000000b89c08 CR3: 00000001230a5000 CR4: 00000000000006e0
> [277063.539001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [277063.539001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [277063.539001] Process iostest (pid: 24628, threadinfo ffff880129a80000, task ffff880131394a60)
> [277063.539001] Stack:
> [277063.539001] ffff880135e4b800 ffff880135e4b800 ffff880129a81d68 ffffffff8121d0a0
> [277063.539001] ffff880129a81db8 ffffffff8121d6b6 ffff880133af1850 ffff880135070c00
> [277063.539001] ffff880129a81db8 ffff88012c908900 ffff88012c908958 ffff880133af1840
> [277063.539001] Call Trace:
> [277063.539001] [<ffffffff8121d0a0>] cfq_destroy_cfqg+0x45/0x47
> [277063.539001] [<ffffffff8121d6b6>] cfq_exit_queue+0xcc/0x164
> [277063.539001] [<ffffffff81209f19>] elevator_exit+0x2a/0x47
> [277063.539001] [<ffffffff8120a8cc>] elevator_change+0x12f/0x1b7
> [277063.539001] [<ffffffff8120a976>] elv_iosched_store+0x22/0x4c
> [277063.539001] [<ffffffff812111f2>] queue_attr_store+0x6a/0x89
> [277063.539001] [<ffffffff8113e093>] sysfs_write_file+0xfc/0x138
> [277063.539001] [<ffffffff810e9fd4>] vfs_write+0xa9/0x105
> [277063.539001] [<ffffffff810ea0e9>] sys_write+0x45/0x6c
> [277063.539001] [<ffffffff8100293b>] system_call_fastpath+0x16/0x1b
> [277063.539001] Code: 75 09 48 85 db 0f 85 77 ff ff ff 41 5c 5b 41 5c 41 5d 41 5e 41 5f c9 c3 55 48 89 e5 53 48 83 ec 08 8b 87 20 03 00 00 85 c0 7f 04 <0f> 0b eb fe ff c8 85 c0 89 87 20 03 00 00 0f 85 9d 00 00 00 4c
> [277063.539001] RIP [<ffffffff8121c6f0>] cfq_put_cfqg+0x13/0xc8
> [277063.539001] RSP <ffff880129a81d48>
> [277063.539001] ---[ end trace d7596ee55221d6a7 ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-03-04 04:34:50

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5.1] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

Gui Jianfeng wrote:
> Vivek Goyal wrote:
>> On Sun, Feb 27, 2011 at 06:16:18PM -0500, Vivek Goyal wrote:
>>> On Fri, Feb 25, 2011 at 09:55:32AM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>>> On Wed, Feb 23, 2011 at 11:01:35AM +0800, Gui Jianfeng wrote:
>>>>>> Hi
>>>>>>
>>>>>> I rebase this series on top of *for-next* branch, it will make merging life easier.
>>>>>>
>>>>>> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
>>>>>> in the way that it puts all CFQ queues in a hidden group and schedules with other
>>>>>> CFQ group under their parent. The patchset is available here,
>>>>>> http://lkml.org/lkml/2010/8/30/30
>>>>> Gui,
>>>>>
>>>>> I was running some tests (iostest) with these patches and my system crashed
>>>>> after a while.
>>>>>
>>>>> To be precise I was running "brrmmap" test of iostest.
>>>> Vivek,
>>>>
>>>> I simply run iostest with brrmmap mode, I can't reproduce this bug.
>>>> Would you give more details.
>>>> Can you tell me the iostest command line options?
>>> iostest /dev/dm-1 -G --nrgrp 4 -m 8 --cgtime --io_serviced --dequeue --total
>>>
>>> I was actually trying to run all the workloads defined but after running
>>> 2 workloads it crashed on 3rd workload.
>>>
>>> Now I tried to re-run brrmmap and it did not crash. So I am trying to run
>>> all the inbuilt workloads again.
>>>
>>>> Did you enable use_hierarchy in root group?
>>> No I did not. Trying to test the flat setup first.
>> Again was running above job and after 3 workloads it ran into a different
>> BUG_ON().
>
> Vivek,
>
> It seems there's a race.
> Would you try the following patch. This patch seems working for me.
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 380d667..abbbb0e 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -4126,11 +4126,12 @@ new_queue:
> cfqq->allocated[rw]++;
> cfqq->ref++;
>
> - spin_unlock_irqrestore(q->queue_lock, flags);
> -
> rq->elevator_private[0] = cic;
> rq->elevator_private[1] = cfqq;
> rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
> +
> + spin_unlock_irqrestore(q->queue_lock, flags);
> +
> return 0;
>
> queue_fail:
>
> Thanks
> Gui
>

Jens,

This bug seems being introduced in commmit 763414b in for-next branch when
merging for-2.6.39/core branch.
Would you apply the above patch?

Vivek, can you try the patchset again with this fix? It works fine for me now.

Thanks,
Gui


>
>
>> Thanks
>> Vivek
>>
>> login: [277063.539001] ------------[ cut here ]------------
>> [277063.539001] kernel BUG at block/cfq-iosched.c:1407!
>> [277063.539001] invalid opcode: 0000 [#1] SMP
>> [277063.539001] last sysfs file: /sys/devices/virtual/block/dm-1/queue/scheduler
>> [277063.539001] CPU 2
>> [277063.539001] Modules linked in: kvm_intel kvm qla2xxx scsi_transport_fc [last unloaded: scsi_wait_scan]
>> [277063.539001]
>> [277063.539001] Pid: 24628, comm: iostest Not tainted 2.6.38-rc4+ #3 0A98h/HP xw8600 Workstation
>> [277063.539001] RIP: 0010:[<ffffffff8121c6f0>] [<ffffffff8121c6f0>] cfq_put_cfqg+0x13/0xc8
>> [277063.539001] RSP: 0018:ffff880129a81d48 EFLAGS: 00010046
>> [277063.539001] RAX: 0000000000000000 RBX: ffff880135e4b800 RCX: ffff88012b9d8ed0
>> [277063.539001] RDX: ffff880135e4be30 RSI: ffff880135070c00 RDI: ffff880135070c00
>> [277063.539001] RBP: ffff880129a81d58 R08: ffff880135e4bbc8 R09: ffffffff81ad76d0
>> [277063.539001] R10: ffff880129a81d48 R11: ffff880129a81d78 R12: ffff880135e4bc18
>> [277063.539001] R13: ffff880135e4bbc8 R14: ffff8801359c3020 R15: ffff880135033310
>> [277063.539001] FS: 00007f1329ed4700(0000) GS:ffff8800bfc80000(0000) knlGS:0000000000000000
>> [277063.539001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [277063.539001] CR2: 0000000000b89c08 CR3: 00000001230a5000 CR4: 00000000000006e0
>> [277063.539001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [277063.539001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [277063.539001] Process iostest (pid: 24628, threadinfo ffff880129a80000, task ffff880131394a60)
>> [277063.539001] Stack:
>> [277063.539001] ffff880135e4b800 ffff880135e4b800 ffff880129a81d68 ffffffff8121d0a0
>> [277063.539001] ffff880129a81db8 ffffffff8121d6b6 ffff880133af1850 ffff880135070c00
>> [277063.539001] ffff880129a81db8 ffff88012c908900 ffff88012c908958 ffff880133af1840
>> [277063.539001] Call Trace:
>> [277063.539001] [<ffffffff8121d0a0>] cfq_destroy_cfqg+0x45/0x47
>> [277063.539001] [<ffffffff8121d6b6>] cfq_exit_queue+0xcc/0x164
>> [277063.539001] [<ffffffff81209f19>] elevator_exit+0x2a/0x47
>> [277063.539001] [<ffffffff8120a8cc>] elevator_change+0x12f/0x1b7
>> [277063.539001] [<ffffffff8120a976>] elv_iosched_store+0x22/0x4c
>> [277063.539001] [<ffffffff812111f2>] queue_attr_store+0x6a/0x89
>> [277063.539001] [<ffffffff8113e093>] sysfs_write_file+0xfc/0x138
>> [277063.539001] [<ffffffff810e9fd4>] vfs_write+0xa9/0x105
>> [277063.539001] [<ffffffff810ea0e9>] sys_write+0x45/0x6c
>> [277063.539001] [<ffffffff8100293b>] system_call_fastpath+0x16/0x1b
>> [277063.539001] Code: 75 09 48 85 db 0f 85 77 ff ff ff 41 5c 5b 41 5c 41 5d 41 5e 41 5f c9 c3 55 48 89 e5 53 48 83 ec 08 8b 87 20 03 00 00 85 c0 7f 04 <0f> 0b eb fe ff c8 85 c0 89 87 20 03 00 00 0f 85 9d 00 00 00 4c
>> [277063.539001] RIP [<ffffffff8121c6f0>] cfq_put_cfqg+0x13/0xc8
>> [277063.539001] RSP <ffff880129a81d48>
>> [277063.539001] ---[ end trace d7596ee55221d6a7 ]---
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-03-04 19:15:09

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5.1] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

On Fri, Mar 04, 2011 at 12:34:11PM +0800, Gui Jianfeng wrote:
> Gui Jianfeng wrote:
> > Vivek Goyal wrote:
> >> On Sun, Feb 27, 2011 at 06:16:18PM -0500, Vivek Goyal wrote:
> >>> On Fri, Feb 25, 2011 at 09:55:32AM +0800, Gui Jianfeng wrote:
> >>>> Vivek Goyal wrote:
> >>>>> On Wed, Feb 23, 2011 at 11:01:35AM +0800, Gui Jianfeng wrote:
> >>>>>> Hi
> >>>>>>
> >>>>>> I rebase this series on top of *for-next* branch, it will make merging life easier.
> >>>>>>
> >>>>>> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
> >>>>>> in the way that it puts all CFQ queues in a hidden group and schedules with other
> >>>>>> CFQ group under their parent. The patchset is available here,
> >>>>>> http://lkml.org/lkml/2010/8/30/30
> >>>>> Gui,
> >>>>>
> >>>>> I was running some tests (iostest) with these patches and my system crashed
> >>>>> after a while.
> >>>>>
> >>>>> To be precise I was running "brrmmap" test of iostest.
> >>>> Vivek,
> >>>>
> >>>> I simply run iostest with brrmmap mode, I can't reproduce this bug.
> >>>> Would you give more details.
> >>>> Can you tell me the iostest command line options?
> >>> iostest /dev/dm-1 -G --nrgrp 4 -m 8 --cgtime --io_serviced --dequeue --total
> >>>
> >>> I was actually trying to run all the workloads defined but after running
> >>> 2 workloads it crashed on 3rd workload.
> >>>
> >>> Now I tried to re-run brrmmap and it did not crash. So I am trying to run
> >>> all the inbuilt workloads again.
> >>>
> >>>> Did you enable use_hierarchy in root group?
> >>> No I did not. Trying to test the flat setup first.
> >> Again was running above job and after 3 workloads it ran into a different
> >> BUG_ON().
> >
> > Vivek,
> >
> > It seems there's a race.
> > Would you try the following patch. This patch seems working for me.
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index 380d667..abbbb0e 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -4126,11 +4126,12 @@ new_queue:
> > cfqq->allocated[rw]++;
> > cfqq->ref++;
> >
> > - spin_unlock_irqrestore(q->queue_lock, flags);
> > -
> > rq->elevator_private[0] = cic;
> > rq->elevator_private[1] = cfqq;
> > rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
> > +
> > + spin_unlock_irqrestore(q->queue_lock, flags);
> > +
> > return 0;
> >
> > queue_fail:
> >
> > Thanks
> > Gui
> >
>
> Jens,
>
> This bug seems being introduced in commmit 763414b in for-next branch when
> merging for-2.6.39/core branch.
> Would you apply the above patch?
>
> Vivek, can you try the patchset again with this fix? It works fine for me now.

Gui,

Ok, I ran iostest with this fix and it seems to have worked. I need to run
it for some more time. And I also need to spend more time reviewing your
patchset. There are so many details to it. Soon I will spare some time
to review it more and also test it bit more.

Of the top of my head I have one concern.

- How to map iopriority to weights. I am thinking that currently weight
range is 100-1000. If we decide to extend the range in current scheme,
it will change the ioprio entity weight also and effectively the
service differentiation between ioprio level will change. I am
wondering if this is a concern and how cpu scheduler has managed it

Thanks
Vivek

2011-03-05 05:25:05

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5.1] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

Vivek Goyal wrote:
> On Fri, Mar 04, 2011 at 12:34:11PM +0800, Gui Jianfeng wrote:
>> Gui Jianfeng wrote:
>>> Vivek Goyal wrote:
>>>> On Sun, Feb 27, 2011 at 06:16:18PM -0500, Vivek Goyal wrote:
>>>>> On Fri, Feb 25, 2011 at 09:55:32AM +0800, Gui Jianfeng wrote:
>>>>>> Vivek Goyal wrote:
>>>>>>> On Wed, Feb 23, 2011 at 11:01:35AM +0800, Gui Jianfeng wrote:
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> I rebase this series on top of *for-next* branch, it will make merging life easier.
>>>>>>>>
>>>>>>>> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
>>>>>>>> in the way that it puts all CFQ queues in a hidden group and schedules with other
>>>>>>>> CFQ group under their parent. The patchset is available here,
>>>>>>>> http://lkml.org/lkml/2010/8/30/30
>>>>>>> Gui,
>>>>>>>
>>>>>>> I was running some tests (iostest) with these patches and my system crashed
>>>>>>> after a while.
>>>>>>>
>>>>>>> To be precise I was running "brrmmap" test of iostest.
>>>>>> Vivek,
>>>>>>
>>>>>> I simply run iostest with brrmmap mode, I can't reproduce this bug.
>>>>>> Would you give more details.
>>>>>> Can you tell me the iostest command line options?
>>>>> iostest /dev/dm-1 -G --nrgrp 4 -m 8 --cgtime --io_serviced --dequeue --total
>>>>>
>>>>> I was actually trying to run all the workloads defined but after running
>>>>> 2 workloads it crashed on 3rd workload.
>>>>>
>>>>> Now I tried to re-run brrmmap and it did not crash. So I am trying to run
>>>>> all the inbuilt workloads again.
>>>>>
>>>>>> Did you enable use_hierarchy in root group?
>>>>> No I did not. Trying to test the flat setup first.
>>>> Again was running above job and after 3 workloads it ran into a different
>>>> BUG_ON().
>>> Vivek,
>>>
>>> It seems there's a race.
>>> Would you try the following patch. This patch seems working for me.
>>>
>>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>>> index 380d667..abbbb0e 100644
>>> --- a/block/cfq-iosched.c
>>> +++ b/block/cfq-iosched.c
>>> @@ -4126,11 +4126,12 @@ new_queue:
>>> cfqq->allocated[rw]++;
>>> cfqq->ref++;
>>>
>>> - spin_unlock_irqrestore(q->queue_lock, flags);
>>> -
>>> rq->elevator_private[0] = cic;
>>> rq->elevator_private[1] = cfqq;
>>> rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
>>> +
>>> + spin_unlock_irqrestore(q->queue_lock, flags);
>>> +
>>> return 0;
>>>
>>> queue_fail:
>>>
>>> Thanks
>>> Gui
>>>
>> Jens,
>>
>> This bug seems being introduced in commmit 763414b in for-next branch when
>> merging for-2.6.39/core branch.
>> Would you apply the above patch?
>>
>> Vivek, can you try the patchset again with this fix? It works fine for me now.
>
> Gui,
>
> Ok, I ran iostest with this fix and it seems to have worked. I need to run
> it for some more time. And I also need to spend more time reviewing your
> patchset. There are so many details to it. Soon I will spare some time
> to review it more and also test it bit more.

Vivek,

Ok, thanks.

>
> Of the top of my head I have one concern.
>
> - How to map iopriority to weights. I am thinking that currently weight
> range is 100-1000. If we decide to extend the range in current scheme,
> it will change the ioprio entity weight also and effectively the
> service differentiation between ioprio level will change. I am
> wondering if this is a concern and how cpu scheduler has managed it

Isn't it enought for ten times of weight difference? The old ioprio scheme
has only 4.5 times service difference. So I think we don't need to extend
the range for the time being.

Thanks,
Gui

>
> Thanks
> Vivek
>

2011-03-07 14:28:38

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5.1] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

On Sat, Mar 05, 2011 at 01:16:08PM +0800, Gui Jianfeng wrote:

[..]
> >> This bug seems being introduced in commmit 763414b in for-next branch when
> >> merging for-2.6.39/core branch.
> >> Would you apply the above patch?
> >>
> >> Vivek, can you try the patchset again with this fix? It works fine for me now.
> >
> > Gui,
> >
> > Ok, I ran iostest with this fix and it seems to have worked. I need to run
> > it for some more time. And I also need to spend more time reviewing your
> > patchset. There are so many details to it. Soon I will spare some time
> > to review it more and also test it bit more.
>
> Vivek,
>
> Ok, thanks.
>
> >
> > Of the top of my head I have one concern.
> >
> > - How to map iopriority to weights. I am thinking that currently weight
> > range is 100-1000. If we decide to extend the range in current scheme,
> > it will change the ioprio entity weight also and effectively the
> > service differentiation between ioprio level will change. I am
> > wondering if this is a concern and how cpu scheduler has managed it
>
> Isn't it enought for ten times of weight difference? The old ioprio scheme
> has only 4.5 times service difference. So I think we don't need to extend
> the range for the time being.

Well, never say never. I think google guys are already using minimum
weight of 10. So don't rule it out.

Secondly, because we might not idle all the time the effective service
differentiation might be much less than a factor of 10. In that case
to get effective 10, one might have to go for wider range of weights.

Thanks
Vivek

2011-03-07 18:08:34

by Justin TerAvest

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5.1] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface

On Mon, Mar 7, 2011 at 6:28 AM, Vivek Goyal <[email protected]> wrote:
> On Sat, Mar 05, 2011 at 01:16:08PM +0800, Gui Jianfeng wrote:
>
> [..]
>> >> This bug seems being introduced in commmit 763414b in for-next branch when
>> >> merging for-2.6.39/core branch.
>> >> Would you apply the above patch?
>> >>
>> >> Vivek, can you try the patchset again with this fix? It works fine for me now.
>> >
>> > Gui,
>> >
>> > Ok, I ran iostest with this fix and it seems to have worked. I need to run
>> > it for some more time. And I also need to spend more time reviewing your
>> > patchset. There are so many details to it. Soon I will spare some time
>> > to review it more and also test it bit more.
>>
>> Vivek,
>>
>> Ok, thanks.
>>
>> >
>> > Of the top of my head I have one concern.
>> >
>> > - How to map iopriority to weights. I am thinking that currently weight
>> > ? range is 100-1000. If we decide to extend the range in current scheme,
>> > ? it will change the ioprio entity weight also and effectively the
>> > ? service differentiation between ioprio level will change. I am
>> > ? wondering if this is a concern and how cpu scheduler has managed it
>>
>> Isn't it enought for ten times of weight difference? The old ioprio scheme
>> has only 4.5 times service difference. So I think we don't need to extend
>> the range for the time being.
>
> Well, never say never. I think google guys are already using minimum
> weight of 10. So don't rule it out.

Yes, we're using a minimum weight of 10. We still see good isolation with
the minimum that low.

Thanks,
Justin

>
> Secondly, because we might not idle all the time the effective service
> differentiation might be much less than a factor of 10. In that case
> to get effective 10, one might have to go for wider range of weights.
>
> Thanks
> Vivek
>