2009-12-16 22:56:00

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] CFQ group scheduling structure organization

Hi All,

With some basic group scheduling support in CFQ, there are few questions
regarding how group structure should look like in CFQ.

Currently, grouping looks as follows. A, and B are two cgroups created by
user.

Proposal 1:
=========
grp-service-tree
/ | \
root A B

One issue with this structure is that RT tasks are not system wide. So an
RT tasks inside root group has RT priority only with-in root group. So a
BE task inside A will get it fair share despite the fact that root has got
RT tasks.


Proposal 2:
==========
One proposal to solve this issue is that make RT and IDLE tasks system
wide and provide weight based service differentiation only for BE class
tasks. So RT or IDLE tasks running in any of the groups will automatically
move to one global RT group maintained by CFQ internally. Same is true for
IDLE tasks. But BE class tasks will honor the cgroup limitations and will
get differentiated service according to weight.

Internal structure will look as follows.

grp-RT-service-tree grp-BE-service-tree grp-IDLE-service-tree
| / \ |
all_RT_task_group A B all_idle_tasks_grp


Here A and B are two cgroups and some BE tasks might be running inside
those groups. systemwide RT tasks will move under all_RT_task_group and
all idle tasks will move under all_idle_tasks_grp.

So one will notice service differentiation only for BE tasks.


Proposal 3:
===========

One can argue that we need group service differentiation for RT class
tasks also and don't move tasks automatically across groups. That means
we need to support "group class" type also. Probably we can support
three classes of cgroups RT, BE and IDLE and CFQ will use that data to
put cgroups in respective tree.

Things should look as follows.

grp-RT-service-tree grp-BE-service-tree grp-IDLE-service-tree
/ \ / \ / \
C D A B E F


Here A and B are BE type groups created by user.
C and D are RT type cgroups created by user.
E and F are IDLE type cgroups created by user.

Now in this scheme of things, by default root will be of type BE. Any task
RT task under "root" group will not be system wide RT task. It will be RT
only with-in root group. To make it system wide idle, admin shall have to
create a new cgroup, say C, of type RT and move task in that cgroup.
Because RT group C is system wide, now that task becomes system wide RT.

So this scheme might throw some surprise to existing users. They might
create a new group and not realize that their RT tasks are no more system
wide RT tasks and they need to specifically create one RT cgroup and move
all RT tasks in that cgroup.

Practically I am not sure how many people are looking for group service
differentiation for RT and IDLE class tasks also.

Proposal 4:
==========
Treat task and group at same level. Currently groups are at top level and
at second level are tasks. View the whole hierarchy as follows.


service-tree
/ | \ \
T1 T2 G1 G2

Here T1 and T2 are two tasks in root group and G1 and G2 are two cgroups
created under root.

In this kind of scheme, any RT task in root group will still be system
wide RT even if we create groups G1 and G2.

So what are the issues?

- I talked to few folks and everybody found this scheme not so intutive.
Their argument was that once I create a cgroup, say A, under root, then
bandwidth should be divided between "root" and "A" proportionate to
the weight.

It is not very intutive that group is competing with all the tasks
running in root group. And disk share of newly created group will change
if more tasks fork in root group. So it is highly dynamic and not
static hence un-intutive.

To emulate the behavior of previous proposals, root shall have to create
a new group and move all root tasks there. But admin shall have to still
keep RT tasks in root group so that they still remain system-wide.

service-tree
/ | \ \
T1 root G1 G2
|
T2

Now admin has specifically created a group "root" along side G1 and G2
and moved T2 under root. T1 is still left in top level group as it might
be an RT task and we want it to remain RT task systemwide.

So to some people this scheme is un-intutive and requires more work in
user space to achive desired behavior. I am kind of 50:50 between two
kind of arrangements.


I am looking for some feedback on what makes most sense.

For the time being, I am little inclined towards proposal 2 and I have
implemented a proof of concept version on top of for-2.6.33 branch in block
tree. These patches are compile and boot tested only and I have yet to do
testing.

Thanks
Vivek


2009-12-16 22:55:18

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 1/4] cfq-iosced: Remove the check for same cfq group from allow_merge

o allow_merge() already checks if submitting task is pointing to same cfqq
as rq has been queued in. If everything is fine, we should not be having
a task in one cgroup and having a pointer to cfqq in other cgroup.

Well I guess in some situations it can happen and that is, when a random
IO queue has been moved into root cgroup for group_isolation=0. In
this case, tasks's cgroup/group is different from where actually cfqq is,
but this is intentional and in this case merging should be allowed.

The second situation is where due to close cooperator patches, multiple
processes can be sharing a cfqq. If everything implemented right, we should
not end up in a situation where tasks from different processes in different
groups are sharing the same cfqq as we allow merging of cooperating queues
only if they are in same group.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e2f8046..a0e5347 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1513,9 +1513,6 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
struct cfq_io_context *cic;
struct cfq_queue *cfqq;

- /* Deny merge if bio and rq don't belong to same cfq group */
- if ((RQ_CFQQ(rq))->cfqg != cfq_get_cfqg(cfqd, 0))
- return false;
/*
* Disallow merge of a sync bio into an async request.
*/
--
1.6.2.5

2009-12-16 22:55:46

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 2/4] cfq-iosched: Get rid of nr_groups

o Currently code does not seem to be using cfqd->nr_groups. Get rid of it.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 4 ----
1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a0e5347..d9bfa09 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -208,8 +208,6 @@ struct cfq_data {
/* Root service tree for cfq_groups */
struct cfq_rb_root grp_service_tree;
struct cfq_group root_group;
- /* Number of active cfq groups on group service tree */
- int nr_groups;

/*
* The priority currently being served
@@ -842,7 +840,6 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)

__cfq_group_service_tree_add(st, cfqg);
cfqg->on_st = true;
- cfqd->nr_groups++;
st->total_weight += cfqg->weight;
}

@@ -863,7 +860,6 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)

cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
cfqg->on_st = false;
- cfqd->nr_groups--;
st->total_weight -= cfqg->weight;
if (!RB_EMPTY_NODE(&cfqg->rb_node))
cfq_rb_erase(&cfqg->rb_node, st);
--
1.6.2.5

2009-12-16 22:56:11

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 3/4] cfq-iosched: Remove prio_change logic for workload selection

o CFQ now internally divides cfq queues in therr workload categories. sync-idle,
sync-noidle and async. Which workload to run depends primarily on rb_key
offset across three service trees. Which is a combination of mulitiple things
including what time queue got queued on the service tree.

There is one exception though. That is if we switched the prio class, say
we served some RT tasks and again started serving BE class, then with-in
BE class we always started with sync-noidle workload irrespective of rb_key
offset in service trees.

This can provide better latencies for sync-noidle workload in the presence
of RT tasks.

o This patch gets rid of that exception and which workload to run with-in
class always depends on lowest rb_key across service trees. The reason
being that now we have multiple BE class groups and if we always switch
to sync-noidle workload with-in group, we can potentially starve a sync-idle
workload with-in group. Same is true for async workload which will be in
root group. Also the workload-switching with-in group will become very
unpredictable as it now depends whether some RT workload was running in
the system or not.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 48 ++++++++++++------------------------------------
1 files changed, 12 insertions(+), 36 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d9bfa09..8df4fe5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -292,8 +292,7 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);

static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
enum wl_prio_t prio,
- enum wl_type_t type,
- struct cfq_data *cfqd)
+ enum wl_type_t type)
{
if (!cfqg)
return NULL;
@@ -1146,7 +1145,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
#endif

service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
- cfqq_type(cfqq), cfqd);
+ cfqq_type(cfqq));
if (cfq_class_idle(cfqq)) {
rb_key = CFQ_IDLE_DELAY;
parent = rb_last(&service_tree->rb);
@@ -1609,7 +1608,7 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
{
struct cfq_rb_root *service_tree =
service_tree_for(cfqd->serving_group, cfqd->serving_prio,
- cfqd->serving_type, cfqd);
+ cfqd->serving_type);

if (!cfqd->rq_queued)
return NULL;
@@ -1956,8 +1955,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
}

static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
- struct cfq_group *cfqg, enum wl_prio_t prio,
- bool prio_changed)
+ struct cfq_group *cfqg, enum wl_prio_t prio)
{
struct cfq_queue *queue;
int i;
@@ -1965,24 +1963,9 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
unsigned long lowest_key = 0;
enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;

- if (prio_changed) {
- /*
- * When priorities switched, we prefer starting
- * from SYNC_NOIDLE (first choice), or just SYNC
- * over ASYNC
- */
- if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
- return cur_best;
- cur_best = SYNC_WORKLOAD;
- if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
- return cur_best;
-
- return ASYNC_WORKLOAD;
- }
-
- for (i = 0; i < 3; ++i) {
- /* otherwise, select the one with lowest rb_key */
- queue = cfq_rb_first(service_tree_for(cfqg, prio, i, cfqd));
+ for (i = 0; i <= SYNC_WORKLOAD; ++i) {
+ /* select the one with lowest rb_key */
+ queue = cfq_rb_first(service_tree_for(cfqg, prio, i));
if (queue &&
(!key_valid || time_before(queue->rb_key, lowest_key))) {
lowest_key = queue->rb_key;
@@ -1996,8 +1979,6 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,

static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- enum wl_prio_t previous_prio = cfqd->serving_prio;
- bool prio_changed;
unsigned slice;
unsigned count;
struct cfq_rb_root *st;
@@ -2025,24 +2006,19 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
* (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
* expiration time
*/
- prio_changed = (cfqd->serving_prio != previous_prio);
- st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
- cfqd);
+ st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
count = st->count;

/*
- * If priority didn't change, check workload expiration,
- * and that we still have other queues ready
+ * check workload expiration, and that we still have other queues ready
*/
- if (!prio_changed && count &&
- !time_after(jiffies, cfqd->workload_expires))
+ if (count && !time_after(jiffies, cfqd->workload_expires))
return;

/* otherwise select new workload type */
cfqd->serving_type =
- cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio, prio_changed);
- st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
- cfqd);
+ cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
+ st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
count = st->count;

/*
--
1.6.2.5

2009-12-16 22:55:28

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 4/4] cfq-iosched: Implement system wide RT and IDLE groups

o This is the core patch which implements system wide RT and IDLE groups
and automatically moves idle and RT tasks in those groups irrespective
of the cgroup they belong to.

This is just proof of concept patch boot tested only.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 313 +++++++++++++++++++++++++++++++--------------------
1 files changed, 190 insertions(+), 123 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8df4fe5..c6235d5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -178,21 +178,23 @@ struct cfq_group {
unsigned int weight;
bool on_st;

+ /* Group's prio class (RT, BE, IDLE) */
+ enum wl_prio_t prio_class;
+
/* number of cfqq currently on this group */
int nr_cfqq;

/* Per group busy queus average. Useful for workload slice calc. */
- unsigned int busy_queues_avg[2];
+ unsigned int busy_queues_avg;
/*
- * rr lists of queues with requests, onle rr for each priority class.
- * Counts are embedded in the cfq_rb_root
+ * rr lists of cfq queues with requests, One service tree each for
+ * each kind of workload (sync-idle, sync-noidle, async). Counts are
+ * embedded in the cfq_rb_root.
*/
- struct cfq_rb_root service_trees[2][3];
- struct cfq_rb_root service_tree_idle;
+ struct cfq_rb_root service_trees[3];

unsigned long saved_workload_slice;
enum wl_type_t saved_workload;
- enum wl_prio_t saved_serving_prio;
struct blkio_group blkg;
#ifdef CONFIG_CFQ_GROUP_IOSCHED
struct hlist_node cfqd_node;
@@ -206,11 +208,11 @@ struct cfq_group {
struct cfq_data {
struct request_queue *queue;
/* Root service tree for cfq_groups */
- struct cfq_rb_root grp_service_tree;
- struct cfq_group root_group;
+ struct cfq_rb_root grp_service_trees[3];
+ struct cfq_group root_groups[3];

/*
- * The priority currently being served
+ * The workload currently being served
*/
enum wl_prio_t serving_prio;
enum wl_type_t serving_type;
@@ -290,17 +292,17 @@ struct cfq_data {

static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);

-static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
- enum wl_prio_t prio,
- enum wl_type_t type)
+static struct cfq_rb_root *
+service_tree_for(struct cfq_group *cfqg, enum wl_type_t type)
{
if (!cfqg)
return NULL;

- if (prio == IDLE_WORKLOAD)
- return &cfqg->service_tree_idle;
+ /* For idle class group, always use first service tree */
+ if (cfqg->prio_class == IDLE_WORKLOAD)
+ return &cfqg->service_trees[0];

- return &cfqg->service_trees[prio][type];
+ return &cfqg->service_trees[type];
}

enum cfqq_state_flags {
@@ -365,14 +367,9 @@ CFQ_CFQQ_FNS(wait_busy);
blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)

/* Traverses through cfq group service trees */
-#define for_each_cfqg_st(cfqg, i, j, st) \
- for (i = 0; i <= IDLE_WORKLOAD; i++) \
- for (j = 0, st = i < IDLE_WORKLOAD ? &cfqg->service_trees[i][j]\
- : &cfqg->service_tree_idle; \
- (i < IDLE_WORKLOAD && j <= SYNC_WORKLOAD) || \
- (i == IDLE_WORKLOAD && j == 0); \
- j++, st = i < IDLE_WORKLOAD ? \
- &cfqg->service_trees[i][j]: NULL) \
+#define for_each_cfqg_st(cfqg, i, st) \
+ for (i = 0, st = &cfqg->service_trees[i]; \
+ i <= SYNC_WORKLOAD && (st = &cfqg->service_trees[i]); i++) \


static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
@@ -394,23 +391,18 @@ static enum wl_type_t cfqq_type(struct cfq_queue *cfqq)
return SYNC_WORKLOAD;
}

-static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl,
- struct cfq_data *cfqd,
- struct cfq_group *cfqg)
+static inline int
+cfq_group_busy_queues(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- if (wl == IDLE_WORKLOAD)
- return cfqg->service_tree_idle.count;
-
- return cfqg->service_trees[wl][ASYNC_WORKLOAD].count
- + cfqg->service_trees[wl][SYNC_NOIDLE_WORKLOAD].count
- + cfqg->service_trees[wl][SYNC_WORKLOAD].count;
+ return cfqg->service_trees[ASYNC_WORKLOAD].count
+ + cfqg->service_trees[SYNC_NOIDLE_WORKLOAD].count
+ + cfqg->service_trees[SYNC_WORKLOAD].count;
}

-static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
- struct cfq_group *cfqg)
+static inline int
+cfqg_busy_async_queues(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count
- + cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
+ return cfqg->service_trees[ASYNC_WORKLOAD].count;
}

static void cfq_dispatch_insert(struct request_queue *, struct request *);
@@ -531,30 +523,30 @@ static void update_min_vdisktime(struct cfq_rb_root *st)
}

/*
- * get averaged number of queues of RT/BE priority.
- * average is updated, with a formula that gives more weight to higher numbers,
- * to quickly follows sudden increases and decrease slowly
+ * get averaged number of queues in the group. average is updated, with a
+ * formula that gives more weight to higher numbers, to quickly follows sudden
+ * increases and decrease slowly.
*/

-static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
- struct cfq_group *cfqg, bool rt)
+static inline unsigned
+cfq_group_get_avg_queues(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
unsigned min_q, max_q;
unsigned mult = cfq_hist_divisor - 1;
unsigned round = cfq_hist_divisor / 2;
- unsigned busy = cfq_group_busy_queues_wl(rt, cfqd, cfqg);
+ unsigned busy = cfq_group_busy_queues(cfqd, cfqg);

- min_q = min(cfqg->busy_queues_avg[rt], busy);
- max_q = max(cfqg->busy_queues_avg[rt], busy);
- cfqg->busy_queues_avg[rt] = (mult * max_q + min_q + round) /
+ min_q = min(cfqg->busy_queues_avg, busy);
+ max_q = max(cfqg->busy_queues_avg, busy);
+ cfqg->busy_queues_avg = (mult * max_q + min_q + round) /
cfq_hist_divisor;
- return cfqg->busy_queues_avg[rt];
+ return cfqg->busy_queues_avg;
}

static inline unsigned
cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_rb_root *st = &cfqd->grp_service_trees[cfqg->prio_class];

return cfq_target_latency * cfqg->weight / st->total_weight;
}
@@ -568,8 +560,7 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
* interested queues (we consider only the ones with the same
* priority class in the cfq group)
*/
- unsigned iq = cfq_group_get_avg_queues(cfqd, cfqq->cfqg,
- cfq_class_rt(cfqq));
+ unsigned iq = cfq_group_get_avg_queues(cfqd, cfqq->cfqg);
unsigned sync_slice = cfqd->cfq_slice[1];
unsigned expect_latency = sync_slice * iq;
unsigned group_slice = cfq_group_slice(cfqd, cfqq->cfqg);
@@ -817,7 +808,7 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
static void
cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_rb_root *st = &cfqd->grp_service_trees[cfqg->prio_class];
struct cfq_group *__cfqg;
struct rb_node *n;

@@ -845,7 +836,7 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
static void
cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_rb_root *st = &cfqd->grp_service_trees[cfqg->prio_class];

if (st->active == &cfqg->rb_node)
st->active = NULL;
@@ -897,10 +888,9 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
struct cfq_queue *cfqq)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_rb_root *st = &cfqd->grp_service_trees[cfqg->prio_class];
unsigned int used_sl, charge_sl;
- int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
- - cfqg->service_tree_idle.count;
+ int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg);

BUG_ON(nr_sync < 0);
used_sl = charge_sl = cfq_cfqq_slice_usage(cfqq);
@@ -918,7 +908,6 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
cfqg->saved_workload_slice = cfqd->workload_expires
- jiffies;
cfqg->saved_workload = cfqd->serving_type;
- cfqg->saved_serving_prio = cfqd->serving_prio;
} else
cfqg->saved_workload_slice = 0;

@@ -948,7 +937,7 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
struct cfq_group *cfqg = NULL;
void *key = cfqd;
- int i, j;
+ int i;
struct cfq_rb_root *st;
struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
unsigned int major, minor;
@@ -966,9 +955,10 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
goto done;

cfqg->weight = blkcg->weight;
- for_each_cfqg_st(cfqg, i, j, st)
+ for_each_cfqg_st(cfqg, i, st)
*st = CFQ_RB_ROOT;
RB_CLEAR_NODE(&cfqg->rb_node);
+ cfqg->prio_class = BE_WORKLOAD;

/*
* Take the initial reference that will be released on destroy
@@ -999,12 +989,35 @@ static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
{
struct cgroup *cgroup;
struct cfq_group *cfqg = NULL;
+ struct task_struct *tsk = current;
+ int ioprio_class = IOPRIO_CLASS_NONE;
+
+ /*
+ * If task belongs to RT or IDLE class, statically assign it to root
+ * rt or idle group respectively.
+ */
+ if (tsk->io_context)
+ ioprio_class = IOPRIO_PRIO_CLASS(tsk->io_context->ioprio);
+
+ if (ioprio_class == IOPRIO_CLASS_NONE)
+ /*
+ * no prio set, inherit CPU scheduling settings
+ */
+ ioprio_class = task_nice_ioclass(tsk);
+
+ switch (ioprio_class) {
+ case IOPRIO_CLASS_RT:
+ return &cfqd->root_groups[RT_WORKLOAD];
+ case IOPRIO_CLASS_IDLE:
+ return &cfqd->root_groups[IDLE_WORKLOAD];
+ }

+ /* Its a BE class task. Find alloc group */
rcu_read_lock();
cgroup = task_cgroup(current, blkio_subsys_id);
cfqg = cfq_find_alloc_cfqg(cfqd, cgroup, create);
if (!cfqg && create)
- cfqg = &cfqd->root_group;
+ cfqg = &cfqd->root_groups[BE_WORKLOAD];
rcu_read_unlock();
return cfqg;
}
@@ -1013,7 +1026,7 @@ static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
{
/* Currently, all async queues are mapped to root group */
if (!cfq_cfqq_sync(cfqq))
- cfqg = &cfqq->cfqd->root_group;
+ cfqg = &cfqq->cfqd->root_groups[cfqq_prio(cfqq)];

cfqq->cfqg = cfqg;
/* cfqq reference on cfqg */
@@ -1023,12 +1036,12 @@ static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
static void cfq_put_cfqg(struct cfq_group *cfqg)
{
struct cfq_rb_root *st;
- int i, j;
+ int i;

BUG_ON(atomic_read(&cfqg->ref) <= 0);
if (!atomic_dec_and_test(&cfqg->ref))
return;
- for_each_cfqg_st(cfqg, i, j, st)
+ for_each_cfqg_st(cfqg, i, st)
BUG_ON(!RB_EMPTY_ROOT(&st->rb) || st->active != NULL);
kfree(cfqg);
}
@@ -1090,7 +1103,28 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
#else /* GROUP_IOSCHED */
static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
{
- return &cfqd->root_group;
+ struct task_struct *tsk = current;
+ int ioprio_class;
+
+ /*
+ * If task belongs to RT or IDLE class, statically assign it to root
+ * rt or idle group respectively.
+ */
+ ioprio_class = IOPRIO_PRIO_CLASS(tsk->io_context->ioprio);
+ if (ioprio_class == IOPRIO_CLASS_NONE)
+ /*
+ * no prio set, inherit CPU scheduling settings
+ */
+ ioprio_class = task_nice_ioclass(tsk);
+
+ switch (ioprio_class) {
+ case IOPRIO_CLASS_RT:
+ return &cfqd->root_groups[RT_WORKLOAD];
+ case IOPRIO_CLASS_IDLE:
+ return &cfqd->root_groups[IDLE_WORKLOAD];
+ }
+
+ return &cfqd->root_groups[BE_WORKLOAD];
}
static inline void
cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg) {
@@ -1118,22 +1152,46 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
int new_cfqq = 1;
int group_changed = 0;

+ /* Check if cfqq's group changed because of cfqq prio class changes */
+ if (cfqq->cfqg && cfqq_prio(cfqq) != cfqq->cfqg->prio_class) {
+ /*
+ * Move cfqq to new group of right ioprio class. This movement
+ * happens in root group as we don't have any information
+ * about submitting task context hence cgroup here.
+ *
+ * TODO: when prio class is changed, make sure to drop the ioc
+ * reference to cfqq so that a new queue is setup for new
+ * request and this queue will complete IO in root group
+ */
+ if (!RB_EMPTY_NODE(&cfqq->rb_node))
+ cfq_group_service_tree_del(cfqd, cfqq->cfqg);
+
+ if (cfqq->orig_cfqg) {
+ cfq_put_cfqg(cfqq->orig_cfqg);
+ cfqq->orig_cfqg = NULL;
+ }
+
+ cfqq->cfqg = &cfqd->root_groups[cfqq_prio(cfqq)];
+ group_changed = 1;
+ }
+
#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ /* Handle group movement because of cfqq workload type changes */
if (!cfqd->cfq_group_isolation
- && cfqq_type(cfqq) == SYNC_NOIDLE_WORKLOAD
- && cfqq->cfqg && cfqq->cfqg != &cfqd->root_group) {
+ && cfqq_type(cfqq) == SYNC_NOIDLE_WORKLOAD && cfqq->cfqg
+ && cfqq->cfqg != &cfqd->root_groups[cfqq_prio(cfqq)]) {
/* Move this cfq to root group */
cfq_log_cfqq(cfqd, cfqq, "moving to root group");
if (!RB_EMPTY_NODE(&cfqq->rb_node))
cfq_group_service_tree_del(cfqd, cfqq->cfqg);
cfqq->orig_cfqg = cfqq->cfqg;
- cfqq->cfqg = &cfqd->root_group;
- atomic_inc(&cfqd->root_group.ref);
+ cfqq->cfqg = &cfqd->root_groups[cfqq_prio(cfqq)];
+ atomic_inc(&cfqd->root_groups[cfqq_prio(cfqq)].ref);
group_changed = 1;
} else if (!cfqd->cfq_group_isolation
&& cfqq_type(cfqq) == SYNC_WORKLOAD && cfqq->orig_cfqg) {
/* cfqq is sequential now needs to go to its original group */
- BUG_ON(cfqq->cfqg != &cfqd->root_group);
+ BUG_ON(cfqq->cfqg != &cfqd->root_groups[cfqq_prio(cfqq)]);
if (!RB_EMPTY_NODE(&cfqq->rb_node))
cfq_group_service_tree_del(cfqd, cfqq->cfqg);
cfq_put_cfqg(cfqq->cfqg);
@@ -1144,8 +1202,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
}
#endif

- service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
- cfqq_type(cfqq));
+ service_tree = service_tree_for(cfqq->cfqg, cfqq_type(cfqq));
if (cfq_class_idle(cfqq)) {
rb_key = CFQ_IDLE_DELAY;
parent = rb_last(&service_tree->rb);
@@ -1557,6 +1614,8 @@ static void
__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
bool timed_out)
{
+ struct cfq_group *cfqg = cfqq->cfqg;
+
cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);

if (cfq_cfqq_wait_request(cfqq))
@@ -1583,8 +1642,9 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (cfqq == cfqd->active_queue)
cfqd->active_queue = NULL;

- if (&cfqq->cfqg->rb_node == cfqd->grp_service_tree.active)
- cfqd->grp_service_tree.active = NULL;
+ /* Do not rely on cfqq->cfqg as after resort, cfqq might change group */
+ if (&cfqg->rb_node == cfqd->grp_service_trees[cfqg->prio_class].active)
+ cfqd->grp_service_trees[cfqg->prio_class].active = NULL;

if (cfqd->active_cic) {
put_io_context(cfqd->active_cic->ioc);
@@ -1607,8 +1667,7 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
{
struct cfq_rb_root *service_tree =
- service_tree_for(cfqd->serving_group, cfqd->serving_prio,
- cfqd->serving_type);
+ service_tree_for(cfqd->serving_group, cfqd->serving_type);

if (!cfqd->rq_queued)
return NULL;
@@ -1625,7 +1684,7 @@ static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
{
struct cfq_group *cfqg;
struct cfq_queue *cfqq;
- int i, j;
+ int i;
struct cfq_rb_root *st;

if (!cfqd->rq_queued)
@@ -1635,7 +1694,7 @@ static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
if (!cfqg)
return NULL;

- for_each_cfqg_st(cfqg, i, j, st)
+ for_each_cfqg_st(cfqg, i, st)
if ((cfqq = cfq_rb_first(st)) != NULL)
return cfqq;
return NULL;
@@ -1954,8 +2013,8 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
}
}

-static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
- struct cfq_group *cfqg, enum wl_prio_t prio)
+static enum wl_type_t
+cfq_choose_wl(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
struct cfq_queue *queue;
int i;
@@ -1965,7 +2024,7 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,

for (i = 0; i <= SYNC_WORKLOAD; ++i) {
/* select the one with lowest rb_key */
- queue = cfq_rb_first(service_tree_for(cfqg, prio, i));
+ queue = cfq_rb_first(service_tree_for(cfqg, i));
if (queue &&
(!key_valid || time_before(queue->rb_key, lowest_key))) {
lowest_key = queue->rb_key;
@@ -1984,19 +2043,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
struct cfq_rb_root *st;
unsigned group_slice;

- if (!cfqg) {
- cfqd->serving_prio = IDLE_WORKLOAD;
- cfqd->workload_expires = jiffies + 1;
- return;
- }
-
- /* Choose next priority. RT > BE > IDLE */
- if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
- cfqd->serving_prio = RT_WORKLOAD;
- else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
- cfqd->serving_prio = BE_WORKLOAD;
- else {
- cfqd->serving_prio = IDLE_WORKLOAD;
+ if (cfqg->prio_class == IOPRIO_CLASS_IDLE) {
cfqd->workload_expires = jiffies + 1;
return;
}
@@ -2006,7 +2053,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
* (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
* expiration time
*/
- st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
+ st = service_tree_for(cfqg, cfqd->serving_type);
count = st->count;

/*
@@ -2016,9 +2063,8 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
return;

/* otherwise select new workload type */
- cfqd->serving_type =
- cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
- st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
+ cfqd->serving_type = cfq_choose_wl(cfqd, cfqg);
+ st = service_tree_for(cfqg, cfqd->serving_type);
count = st->count;

/*
@@ -2028,9 +2074,8 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
*/
group_slice = cfq_group_slice(cfqd, cfqg);

- slice = group_slice * count /
- max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
- cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));
+ slice = group_slice * count / max_t(unsigned, cfqg->busy_queues_avg,
+ cfq_group_busy_queues(cfqd, cfqg));

if (cfqd->serving_type == ASYNC_WORKLOAD) {
unsigned int tmp;
@@ -2060,14 +2105,19 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)

static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
- struct cfq_group *cfqg;
+ struct cfq_rb_root *st;
+ struct cfq_group *cfqg = NULL;
+ int i;
+
+ for (i = 0; i <= IDLE_WORKLOAD; i++) {
+ st = &cfqd->grp_service_trees[i];
+ if (RB_EMPTY_ROOT(&st->rb))
+ continue;
+ cfqg = cfq_rb_first_group(st);
+ st->active = &cfqg->rb_node;
+ update_min_vdisktime(st);
+ }

- if (RB_EMPTY_ROOT(&st->rb))
- return NULL;
- cfqg = cfq_rb_first_group(st);
- st->active = &cfqg->rb_node;
- update_min_vdisktime(st);
return cfqg;
}

@@ -2077,11 +2127,20 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)

cfqd->serving_group = cfqg;

+ if (!cfqg) {
+ /*
+ * Nothing to dispatch. Mark workload as expired so that next
+ * time we choose a fresh workload
+ */
+ cfqd->serving_type = SYNC_NOIDLE_WORKLOAD;
+ cfqd->workload_expires = jiffies - 1;
+ return;
+ }
+
/* Restore the workload type data */
if (cfqg->saved_workload_slice) {
cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
cfqd->serving_type = cfqg->saved_workload;
- cfqd->serving_prio = cfqg->saved_serving_prio;
} else
cfqd->workload_expires = jiffies - 1;

@@ -3651,7 +3710,7 @@ static void cfq_exit_queue(struct elevator_queue *e)

cfq_put_async_queues(cfqd);
cfq_release_cfq_groups(cfqd);
- blkiocg_del_blkio_group(&cfqd->root_group.blkg);
+ blkiocg_del_blkio_group(&cfqd->root_groups[BE_WORKLOAD].blkg);

spin_unlock_irq(q->queue_lock);

@@ -3672,27 +3731,35 @@ static void *cfq_init_queue(struct request_queue *q)
if (!cfqd)
return NULL;

- /* Init root service tree */
- cfqd->grp_service_tree = CFQ_RB_ROOT;
+ /* Init root service trees */
+ for (i = 0; i <= IDLE_WORKLOAD; i++)
+ cfqd->grp_service_trees[i] = CFQ_RB_ROOT;

/* Init root group */
- cfqg = &cfqd->root_group;
- for_each_cfqg_st(cfqg, i, j, st)
- *st = CFQ_RB_ROOT;
- RB_CLEAR_NODE(&cfqg->rb_node);
+ for (i = 0; i <= IDLE_WORKLOAD; i++) {
+ cfqg = &cfqd->root_groups[i];
+ for_each_cfqg_st(cfqg, j, st)
+ *st = CFQ_RB_ROOT;
+ RB_CLEAR_NODE(&cfqg->rb_node);

- /* Give preference to root group over other groups */
- cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
+ /* Give preference to root group over other groups */
+ cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
+ cfqg->prio_class = i;

#ifdef CONFIG_CFQ_GROUP_IOSCHED
- /*
- * Take a reference to root group which we never drop. This is just
- * to make sure that cfq_put_cfqg() does not try to kfree root group
- */
- atomic_set(&cfqg->ref, 1);
- blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd,
- 0);
+ /*
+ * Take a reference to root group which we never drop. This is
+ * just to make sure that cfq_put_cfqg() does not try to
+ * kfree root group.
+ */
+ atomic_set(&cfqg->ref, 1);
+
+ /* TODO: Fix it to add RT and IDLE groups also to root group */
+ if (cfqg->prio_class == BE_WORKLOAD)
+ blkiocg_add_blkio_group(&blkio_root_cgroup,
+ &cfqg->blkg, (void *)cfqd, 0);
#endif
+ }
/*
* Not strictly needed (since RB_ROOT just clears the node and we
* zeroed cfqd on alloc), but better be safe in case someone decides
@@ -3708,7 +3775,7 @@ static void *cfq_init_queue(struct request_queue *q)
*/
cfq_init_cfqq(cfqd, &cfqd->oom_cfqq, 1, 0);
atomic_inc(&cfqd->oom_cfqq.ref);
- cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group);
+ cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_groups[BE_WORKLOAD]);

INIT_LIST_HEAD(&cfqd->cic_list);

--
1.6.2.5

2009-12-16 23:14:16

by Nauman Rafique

[permalink] [raw]
Subject: Re: [RFC] CFQ group scheduling structure organization

On Wed, Dec 16, 2009 at 2:52 PM, Vivek Goyal <[email protected]> wrote:
> Hi All,
>
> With some basic group scheduling support in CFQ, there are few questions
> regarding how group structure should look like in CFQ.
>
> Currently, grouping looks as follows. A, and B are two cgroups created by
> user.
>
> Proposal 1:
> =========
> ? ? ? ? ? ? ? ? ? ? ? ?grp-service-tree
> ? ? ? ? ? ? ? ? ? ? ? ?/ ? ? ?| ? ? \
> ? ? ? ? ? ? ? ? ? ?root ? ? ? A ? ? B
>
> One issue with this structure is that RT tasks are not system wide. So an
> RT tasks inside root group has RT priority only with-in root group. So a
> BE task inside A will get it fair share despite the fact that root has got
> RT tasks.
>
>
> Proposal 2:
> ==========
> One proposal to solve this issue is that make RT and IDLE tasks system
> wide and provide weight based service differentiation only for BE class
> tasks. So RT or IDLE tasks running in any of the groups will automatically
> move to one global RT group maintained by CFQ internally. Same is true for
> IDLE tasks. But BE class tasks will honor the cgroup limitations and will
> get differentiated service according to weight.
>
> Internal structure will look as follows.
>
> ? ? grp-RT-service-tree ?grp-BE-service-tree ? grp-IDLE-service-tree
> ? ? ? ? ? ? | ? ? ? ? ? ? ? ?/ ?\ ? ? ? ? ? ? ? ? ? ? ?|
> ? ? ? ?all_RT_task_group ? ?A ? B ? ? ? ? ? ? ? all_idle_tasks_grp
>
>
> Here A and B are two cgroups and some BE tasks might be running inside
> those groups. systemwide RT tasks will move under all_RT_task_group and
> all idle tasks will move under all_idle_tasks_grp.
>
> So one will notice service differentiation only for BE tasks.

What is the use case for this kind of setup? Do we need RT tasks
system wide that could basically starve out all the cgroups present in
the system?

>
>
> Proposal 3:
> ===========
>
> One can argue that we need group service differentiation for RT class
> tasks also and don't move tasks automatically across groups. That means
> we need to support "group class" type also. Probably we can support
> three classes of cgroups RT, BE and IDLE and CFQ will use that data to
> put cgroups in respective tree.
>
> Things should look as follows.
>
> ? ? grp-RT-service-tree ?grp-BE-service-tree ? grp-IDLE-service-tree
> ? ? ? ? ? ? / \ ? ? ? ? ? ? ? ? ? ? ?/ ?\ ? ? ? ? ? ? / ? \
> ? ? ? ? ? ?C ?D ? ? ? ? ? ? ? ? ? ? A ? B ? ? ? ? ? ?E ? ?F
>
>
> Here A and B are BE type groups created by user.
> C and D are RT type cgroups created by user.
> E and F are IDLE type cgroups created by user.
>
> Now in this scheme of things, by default root will be of type BE. Any task
> RT task under "root" group will not be system wide RT task. It will be RT
> only with-in root group. To make it system wide idle, admin shall have to
> create a new cgroup, say C, of type RT and move task in that cgroup.
> Because RT group C is system wide, now that task becomes system wide RT.
>
> So this scheme might throw some surprise to existing users. They might
> create a new group and not realize that their RT tasks are no more system
> wide RT tasks and they need to specifically create one RT cgroup and move
> all RT tasks in that cgroup.
>
> Practically I am not sure how many people are looking for group service
> differentiation for RT and IDLE class tasks also.

IMHO, this is the setup that offers the most flexibility. With this,
we could set up a cgroup which has RT tasks internal to it, but it
still obeys the proportions imposed by system. And if one wanted a
real RT behavior, a cgroup with RT class can be used.

>
> Proposal 4:
> ==========
> Treat task and group at same level. Currently groups are at top level and
> at second level are tasks. View the whole hierarchy as follows.
>
>
> ? ? ? ? ? ? ? ? ? ? ? ?service-tree
> ? ? ? ? ? ? ? ? ? ? ? ?/ ? | ?\ ?\
> ? ? ? ? ? ? ? ? ? ? ? T1 ? T2 ?G1 G2
>
> Here T1 and T2 are two tasks in root group and G1 and G2 are two cgroups
> created under root.
>
> In this kind of scheme, any RT task in root group will still be system
> wide RT even if we create groups G1 and G2.
>
> So what are the issues?
>
> - I talked to few folks and everybody found this scheme not so intutive.
> ?Their argument was that once I create a cgroup, say A, ?under root, then
> ?bandwidth should be divided between "root" and "A" proportionate to
> ?the weight.
>
> ?It is not very intutive that group is competing with all the tasks
> ?running in root group. And disk share of newly created group will change
> ?if more tasks fork in root group. So it is highly dynamic and not
> ?static hence un-intutive.
>
> ?To emulate the behavior of previous proposals, root shall have to create
> ?a new group and move all root tasks there. But admin shall have to still
> ?keep RT tasks in root group so that they still remain system-wide.
>
> ? ? ? ? ? ? ? ? ? ? ? ?service-tree
> ? ? ? ? ? ? ? ? ? ? ? ?/ ? | ? ?\ ?\
> ? ? ? ? ? ? ? ? ? ? ? T1 ?root ?G1 G2
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?|
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?T2
>
> ?Now admin has specifically created a group "root" along side G1 and G2
> ?and moved T2 under root. T1 is still left in top level group as it might
> ?be an RT task and we want it to remain RT task systemwide.
>
> ?So to some people this scheme is un-intutive and requires more work in
> ?user space to achive desired behavior. I am kind of 50:50 between two
> ?kind of arrangements.
>
>
> I am looking for some feedback on what makes most sense.
>
> For the time being, I am little inclined towards proposal 2 and I have
> implemented a proof of concept version on top of for-2.6.33 branch in block
> tree. ?These patches are compile and boot tested only and I have yet to do
> testing.
>
> Thanks
> Vivek
>
>

2009-12-16 23:26:34

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] CFQ group scheduling structure organization

On Wed, Dec 16, 2009 at 03:14:08PM -0800, Nauman Rafique wrote:
> On Wed, Dec 16, 2009 at 2:52 PM, Vivek Goyal <[email protected]> wrote:
> > Hi All,
> >
> > With some basic group scheduling support in CFQ, there are few questions
> > regarding how group structure should look like in CFQ.
> >
> > Currently, grouping looks as follows. A, and B are two cgroups created by
> > user.
> >
> > Proposal 1:
> > =========
> > ? ? ? ? ? ? ? ? ? ? ? ?grp-service-tree
> > ? ? ? ? ? ? ? ? ? ? ? ?/ ? ? ?| ? ? \
> > ? ? ? ? ? ? ? ? ? ?root ? ? ? A ? ? B
> >
> > One issue with this structure is that RT tasks are not system wide. So an
> > RT tasks inside root group has RT priority only with-in root group. So a
> > BE task inside A will get it fair share despite the fact that root has got
> > RT tasks.
> >
> >
> > Proposal 2:
> > ==========
> > One proposal to solve this issue is that make RT and IDLE tasks system
> > wide and provide weight based service differentiation only for BE class
> > tasks. So RT or IDLE tasks running in any of the groups will automatically
> > move to one global RT group maintained by CFQ internally. Same is true for
> > IDLE tasks. But BE class tasks will honor the cgroup limitations and will
> > get differentiated service according to weight.
> >
> > Internal structure will look as follows.
> >
> > ? ? grp-RT-service-tree ?grp-BE-service-tree ? grp-IDLE-service-tree
> > ? ? ? ? ? ? | ? ? ? ? ? ? ? ?/ ?\ ? ? ? ? ? ? ? ? ? ? ?|
> > ? ? ? ?all_RT_task_group ? ?A ? B ? ? ? ? ? ? ? all_idle_tasks_grp
> >
> >
> > Here A and B are two cgroups and some BE tasks might be running inside
> > those groups. systemwide RT tasks will move under all_RT_task_group and
> > all idle tasks will move under all_idle_tasks_grp.
> >
> > So one will notice service differentiation only for BE tasks.
>
> What is the use case for this kind of setup? Do we need RT tasks
> system wide that could basically starve out all the cgroups present in
> the system?
>

Isn't that expected out of RT tasks? In fact only admin can launch RT
tasks and users can not. So what's the point of creating an RT task if
it is RT only with-in root group and not starving out user's BE tasks. It
better be then a simple BE task.

So until and unless admin wants some kind of group service differentiation
between differnt RT tasks, it makes sense to keep RT task system wide.

> >
> >
> > Proposal 3:
> > ===========
> >
> > One can argue that we need group service differentiation for RT class
> > tasks also and don't move tasks automatically across groups. That means
> > we need to support "group class" type also. Probably we can support
> > three classes of cgroups RT, BE and IDLE and CFQ will use that data to
> > put cgroups in respective tree.
> >
> > Things should look as follows.
> >
> > ? ? grp-RT-service-tree ?grp-BE-service-tree ? grp-IDLE-service-tree
> > ? ? ? ? ? ? / \ ? ? ? ? ? ? ? ? ? ? ?/ ?\ ? ? ? ? ? ? / ? \
> > ? ? ? ? ? ?C ?D ? ? ? ? ? ? ? ? ? ? A ? B ? ? ? ? ? ?E ? ?F
> >
> >
> > Here A and B are BE type groups created by user.
> > C and D are RT type cgroups created by user.
> > E and F are IDLE type cgroups created by user.
> >
> > Now in this scheme of things, by default root will be of type BE. Any task
> > RT task under "root" group will not be system wide RT task. It will be RT
> > only with-in root group. To make it system wide idle, admin shall have to
> > create a new cgroup, say C, of type RT and move task in that cgroup.
> > Because RT group C is system wide, now that task becomes system wide RT.
> >
> > So this scheme might throw some surprise to existing users. They might
> > create a new group and not realize that their RT tasks are no more system
> > wide RT tasks and they need to specifically create one RT cgroup and move
> > all RT tasks in that cgroup.
> >
> > Practically I am not sure how many people are looking for group service
> > differentiation for RT and IDLE class tasks also.
>
> IMHO, this is the setup that offers the most flexibility. With this,
> we could set up a cgroup which has RT tasks internal to it, but it
> still obeys the proportions imposed by system. And if one wanted a
> real RT behavior, a cgroup with RT class can be used.

True that this is more enhanced functionality and provides more
configuration choices. But then this will also make implementation more
complex and to maintain current RT behavior, root needs to specifically
move out RT tasks in an RT group. Otherwise surprise is in store.

I think key question here is that does admin want to get group isolation
for RT tasks also or not. Will it be common enough, worth the extra
complexity.

>
> >
> > Proposal 4:
> > ==========
> > Treat task and group at same level. Currently groups are at top level and
> > at second level are tasks. View the whole hierarchy as follows.
> >
> >
> > ? ? ? ? ? ? ? ? ? ? ? ?service-tree
> > ? ? ? ? ? ? ? ? ? ? ? ?/ ? | ?\ ?\
> > ? ? ? ? ? ? ? ? ? ? ? T1 ? T2 ?G1 G2
> >
> > Here T1 and T2 are two tasks in root group and G1 and G2 are two cgroups
> > created under root.
> >
> > In this kind of scheme, any RT task in root group will still be system
> > wide RT even if we create groups G1 and G2.
> >
> > So what are the issues?
> >
> > - I talked to few folks and everybody found this scheme not so intutive.
> > ?Their argument was that once I create a cgroup, say A, ?under root, then
> > ?bandwidth should be divided between "root" and "A" proportionate to
> > ?the weight.
> >
> > ?It is not very intutive that group is competing with all the tasks
> > ?running in root group. And disk share of newly created group will change
> > ?if more tasks fork in root group. So it is highly dynamic and not
> > ?static hence un-intutive.
> >
> > ?To emulate the behavior of previous proposals, root shall have to create
> > ?a new group and move all root tasks there. But admin shall have to still
> > ?keep RT tasks in root group so that they still remain system-wide.
> >
> > ? ? ? ? ? ? ? ? ? ? ? ?service-tree
> > ? ? ? ? ? ? ? ? ? ? ? ?/ ? | ? ?\ ?\
> > ? ? ? ? ? ? ? ? ? ? ? T1 ?root ?G1 G2
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ?|
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ?T2
> >
> > ?Now admin has specifically created a group "root" along side G1 and G2
> > ?and moved T2 under root. T1 is still left in top level group as it might
> > ?be an RT task and we want it to remain RT task systemwide.
> >
> > ?So to some people this scheme is un-intutive and requires more work in
> > ?user space to achive desired behavior. I am kind of 50:50 between two
> > ?kind of arrangements.
> >
> >
> > I am looking for some feedback on what makes most sense.
> >
> > For the time being, I am little inclined towards proposal 2 and I have
> > implemented a proof of concept version on top of for-2.6.33 branch in block
> > tree. ?These patches are compile and boot tested only and I have yet to do
> > testing.
> >
> > Thanks
> > Vivek
> >
> >

2009-12-17 09:25:09

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 3/4] cfq-iosched: Remove prio_change logic for workload selection

Vivek Goyal wrote:
> o CFQ now internally divides cfq queues in therr workload categories. sync-idle,
> sync-noidle and async. Which workload to run depends primarily on rb_key
> offset across three service trees. Which is a combination of mulitiple things
> including what time queue got queued on the service tree.
>
> There is one exception though. That is if we switched the prio class, say
> we served some RT tasks and again started serving BE class, then with-in
> BE class we always started with sync-noidle workload irrespective of rb_key
> offset in service trees.
>
> This can provide better latencies for sync-noidle workload in the presence
> of RT tasks.
>
> o This patch gets rid of that exception and which workload to run with-in
> class always depends on lowest rb_key across service trees. The reason
> being that now we have multiple BE class groups and if we always switch
> to sync-noidle workload with-in group, we can potentially starve a sync-idle
> workload with-in group. Same is true for async workload which will be in
> root group. Also the workload-switching with-in group will become very
> unpredictable as it now depends whether some RT workload was running in
> the system or not.
>
> Signed-off-by: Vivek Goyal <[email protected]>

Actually, in current CFQ, there still has bug in prio_changed logic.
Because prio_changed = (cfqd->serving_prio != previous_prio), and previous_prio
might come from another group, in this case, prio_changed becomes invalid, but
cfq_choose_wl() still relies on prio_changed to calculate serving_type, this
doesn't make sence IMHO.
But with this change, this bug is gone.

Reviewed-by: Gui Jianfeng <[email protected]>

> ---
> block/cfq-iosched.c | 48 ++++++++++++------------------------------------
> 1 files changed, 12 insertions(+), 36 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index d9bfa09..8df4fe5 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -292,8 +292,7 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
>
> static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
> enum wl_prio_t prio,
> - enum wl_type_t type,
> - struct cfq_data *cfqd)
> + enum wl_type_t type)
> {
> if (!cfqg)
> return NULL;
> @@ -1146,7 +1145,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> #endif
>
> service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
> - cfqq_type(cfqq), cfqd);
> + cfqq_type(cfqq));
> if (cfq_class_idle(cfqq)) {
> rb_key = CFQ_IDLE_DELAY;
> parent = rb_last(&service_tree->rb);
> @@ -1609,7 +1608,7 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
> {
> struct cfq_rb_root *service_tree =
> service_tree_for(cfqd->serving_group, cfqd->serving_prio,
> - cfqd->serving_type, cfqd);
> + cfqd->serving_type);
>
> if (!cfqd->rq_queued)
> return NULL;
> @@ -1956,8 +1955,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
> }
>
> static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> - struct cfq_group *cfqg, enum wl_prio_t prio,
> - bool prio_changed)
> + struct cfq_group *cfqg, enum wl_prio_t prio)
> {
> struct cfq_queue *queue;
> int i;
> @@ -1965,24 +1963,9 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> unsigned long lowest_key = 0;
> enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;
>
> - if (prio_changed) {
> - /*
> - * When priorities switched, we prefer starting
> - * from SYNC_NOIDLE (first choice), or just SYNC
> - * over ASYNC
> - */
> - if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
> - return cur_best;
> - cur_best = SYNC_WORKLOAD;
> - if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
> - return cur_best;
> -
> - return ASYNC_WORKLOAD;
> - }
> -
> - for (i = 0; i < 3; ++i) {
> - /* otherwise, select the one with lowest rb_key */
> - queue = cfq_rb_first(service_tree_for(cfqg, prio, i, cfqd));
> + for (i = 0; i <= SYNC_WORKLOAD; ++i) {
> + /* select the one with lowest rb_key */
> + queue = cfq_rb_first(service_tree_for(cfqg, prio, i));
> if (queue &&
> (!key_valid || time_before(queue->rb_key, lowest_key))) {
> lowest_key = queue->rb_key;
> @@ -1996,8 +1979,6 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>
> static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> {
> - enum wl_prio_t previous_prio = cfqd->serving_prio;
> - bool prio_changed;
> unsigned slice;
> unsigned count;
> struct cfq_rb_root *st;
> @@ -2025,24 +2006,19 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
> * expiration time
> */
> - prio_changed = (cfqd->serving_prio != previous_prio);
> - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> - cfqd);
> + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
> count = st->count;
>
> /*
> - * If priority didn't change, check workload expiration,
> - * and that we still have other queues ready
> + * check workload expiration, and that we still have other queues ready
> */
> - if (!prio_changed && count &&
> - !time_after(jiffies, cfqd->workload_expires))
> + if (count && !time_after(jiffies, cfqd->workload_expires))
> return;
>
> /* otherwise select new workload type */
> cfqd->serving_type =
> - cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio, prio_changed);
> - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> - cfqd);
> + cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
> + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
> count = st->count;
>
> /*

2009-12-17 09:30:35

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 1/4] cfq-iosced: Remove the check for same cfq group from allow_merge

Vivek Goyal wrote:
> o allow_merge() already checks if submitting task is pointing to same cfqq
> as rq has been queued in. If everything is fine, we should not be having
> a task in one cgroup and having a pointer to cfqq in other cgroup.
>
> Well I guess in some situations it can happen and that is, when a random
> IO queue has been moved into root cgroup for group_isolation=0. In
> this case, tasks's cgroup/group is different from where actually cfqq is,
> but this is intentional and in this case merging should be allowed.
>
> The second situation is where due to close cooperator patches, multiple
> processes can be sharing a cfqq. If everything implemented right, we should
> not end up in a situation where tasks from different processes in different
> groups are sharing the same cfqq as we allow merging of cooperating queues
> only if they are in same group.
>
> Signed-off-by: Vivek Goyal <[email protected]>

Reviewed-by: Gui Jianfeng <[email protected]>

2009-12-17 09:31:19

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 2/4] cfq-iosched: Get rid of nr_groups

Vivek Goyal wrote:
> o Currently code does not seem to be using cfqd->nr_groups. Get rid of it.
>
> Signed-off-by: Vivek Goyal <[email protected]>

Reviewed-by: Gui Jianfeng <[email protected]>

2009-12-17 10:22:16

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [RFC] CFQ group scheduling structure organization

Vivek Goyal wrote:
> Hi All,
>
> With some basic group scheduling support in CFQ, there are few questions
> regarding how group structure should look like in CFQ.
>
> Currently, grouping looks as follows. A, and B are two cgroups created by
> user.
>
> Proposal 1:
> =========
> grp-service-tree
> / | \
> root A B
>
> One issue with this structure is that RT tasks are not system wide. So an
> RT tasks inside root group has RT priority only with-in root group. So a
> BE task inside A will get it fair share despite the fact that root has got
> RT tasks.
>
>
> Proposal 2:
> ==========
> One proposal to solve this issue is that make RT and IDLE tasks system
> wide and provide weight based service differentiation only for BE class
> tasks. So RT or IDLE tasks running in any of the groups will automatically
> move to one global RT group maintained by CFQ internally. Same is true for
> IDLE tasks. But BE class tasks will honor the cgroup limitations and will
> get differentiated service according to weight.
>
> Internal structure will look as follows.
>
> grp-RT-service-tree grp-BE-service-tree grp-IDLE-service-tree
> | / \ |
> all_RT_task_group A B all_idle_tasks_grp
>
>
> Here A and B are two cgroups and some BE tasks might be running inside
> those groups. systemwide RT tasks will move under all_RT_task_group and
> all idle tasks will move under all_idle_tasks_grp.
>
> So one will notice service differentiation only for BE tasks.

Hi Vivek,

I still think that we need to give choices for users. When an user want to give
RT Tasks service differentiation, we shouldn't treat all RT tasks as systemwide.
But if a user want better latency for RT tasks, we treat them systemwide. CFQ can
rely on sysfs tunable to achieve this.

Thanks
Gui

>
>
> Proposal 3:
> ===========
>
> One can argue that we need group service differentiation for RT class
> tasks also and don't move tasks automatically across groups. That means
> we need to support "group class" type also. Probably we can support
> three classes of cgroups RT, BE and IDLE and CFQ will use that data to
> put cgroups in respective tree.
>
> Things should look as follows.
>
> grp-RT-service-tree grp-BE-service-tree grp-IDLE-service-tree
> / \ / \ / \
> C D A B E F
>
>
> Here A and B are BE type groups created by user.
> C and D are RT type cgroups created by user.
> E and F are IDLE type cgroups created by user.
>
> Now in this scheme of things, by default root will be of type BE. Any task
> RT task under "root" group will not be system wide RT task. It will be RT
> only with-in root group. To make it system wide idle, admin shall have to
> create a new cgroup, say C, of type RT and move task in that cgroup.
> Because RT group C is system wide, now that task becomes system wide RT.
>
> So this scheme might throw some surprise to existing users. They might
> create a new group and not realize that their RT tasks are no more system
> wide RT tasks and they need to specifically create one RT cgroup and move
> all RT tasks in that cgroup.
>
> Practically I am not sure how many people are looking for group service
> differentiation for RT and IDLE class tasks also.
>
> Proposal 4:
> ==========
> Treat task and group at same level. Currently groups are at top level and
> at second level are tasks. View the whole hierarchy as follows.
>
>
> service-tree
> / | \ \
> T1 T2 G1 G2
>
> Here T1 and T2 are two tasks in root group and G1 and G2 are two cgroups
> created under root.
>
> In this kind of scheme, any RT task in root group will still be system
> wide RT even if we create groups G1 and G2.
>
> So what are the issues?
>
> - I talked to few folks and everybody found this scheme not so intutive.
> Their argument was that once I create a cgroup, say A, under root, then
> bandwidth should be divided between "root" and "A" proportionate to
> the weight.
>
> It is not very intutive that group is competing with all the tasks
> running in root group. And disk share of newly created group will change
> if more tasks fork in root group. So it is highly dynamic and not
> static hence un-intutive.
>
> To emulate the behavior of previous proposals, root shall have to create
> a new group and move all root tasks there. But admin shall have to still
> keep RT tasks in root group so that they still remain system-wide.
>
> service-tree
> / | \ \
> T1 root G1 G2
> |
> T2
>
> Now admin has specifically created a group "root" along side G1 and G2
> and moved T2 under root. T1 is still left in top level group as it might
> be an RT task and we want it to remain RT task systemwide.
>
> So to some people this scheme is un-intutive and requires more work in
> user space to achive desired behavior. I am kind of 50:50 between two
> kind of arrangements.
>
>
> I am looking for some feedback on what makes most sense.
>
> For the time being, I am little inclined towards proposal 2 and I have
> implemented a proof of concept version on top of for-2.6.33 branch in block
> tree. These patches are compile and boot tested only and I have yet to do
> testing.
>
> Thanks
> Vivek
>
>
>

2009-12-17 11:41:37

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [RFC] CFQ group scheduling structure organization

Hi,
On Wed, Dec 16, 2009 at 11:52 PM, Vivek Goyal <[email protected]> wrote:
> Hi All,
>
> With some basic group scheduling support in CFQ, there are few questions
> regarding how group structure should look like in CFQ.
>
> Currently, grouping looks as follows. A, and B are two cgroups created by
> user.
>
> [snip]
>
> Proposal 4:
> ==========
> Treat task and group at same level. Currently groups are at top level and
> at second level are tasks. View the whole hierarchy as follows.
>
>
>                        service-tree
>                        /   |  \  \
>                       T1   T2  G1 G2
>
> Here T1 and T2 are two tasks in root group and G1 and G2 are two cgroups
> created under root.
>
> In this kind of scheme, any RT task in root group will still be system
> wide RT even if we create groups G1 and G2.
>
> So what are the issues?
>
> - I talked to few folks and everybody found this scheme not so intutive.
>  Their argument was that once I create a cgroup, say A,  under root, then
>  bandwidth should be divided between "root" and "A" proportionate to
>  the weight.
>
>  It is not very intutive that group is competing with all the tasks
>  running in root group. And disk share of newly created group will change
>  if more tasks fork in root group. So it is highly dynamic and not
>  static hence un-intutive.
>
>  To emulate the behavior of previous proposals, root shall have to create
>  a new group and move all root tasks there. But admin shall have to still
>  keep RT tasks in root group so that they still remain system-wide.
>
>                        service-tree
>                        /   |    \  \
>                       T1  root  G1 G2
>                            |
>                            T2
>
>  Now admin has specifically created a group "root" along side G1 and G2
>  and moved T2 under root. T1 is still left in top level group as it might
>  be an RT task and we want it to remain RT task systemwide.
>
>  So to some people this scheme is un-intutive and requires more work in
>  user space to achive desired behavior. I am kind of 50:50 between two
>  kind of arrangements.
>
This is the one I prefer: it is the most natural one if you see that
groups are scheduling entities like any other task.
I think it becomes intuitive with an analogy with a qemu (e.g. kvm)
virtual machine model. If you think a group like a virtual machine, it
is clear that for the normal system, the whole virtual machine is a
single scheduling entity, and that it has to compete with other
virtual machines (as other single entities) and every process in the
real system (those are inherently more important, since without the
real system, the VMs cannot simply exist).
Having a designated root group, instead, resembles the xen VM model,
where you have a separated domain for each VM and for the real system.

I think the implementation of this approach can make the code simpler
and modular (CFQ could be abstracted to deal with scheduling entities,
and each scheduling entity could be defined in a separate file).
Within each group, you will now have the choice of how to schedule its
queues. This means that you could possibly have different I/O
schedulers within each group, and even have sub-groups within groups.
>
> I am looking for some feedback on what makes most sense.
I think that regardless of our preference, we should coordinate with
how the CPU scheduler works, since I think the users will be more
surprised to see cgroups behaving different w.r.t. CPU and disk, than
if the RT task behaviour changes when cgroups are introduced.

Thanks,
Corrado

>
> For the time being, I am little inclined towards proposal 2 and I have
> implemented a proof of concept version on top of for-2.6.33 branch in block
> tree.  These patches are compile and boot tested only and I have yet to do
> testing.
>
> Thanks
> Vivek
>

2009-12-17 11:49:21

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [PATCH 3/4] cfq-iosched: Remove prio_change logic for workload selection

On Wed, Dec 16, 2009 at 11:52 PM, Vivek Goyal <[email protected]> wrote:
> o CFQ now internally divides cfq queues in therr workload categories. sync-idle,
>  sync-noidle and async. Which workload to run depends primarily on rb_key
>  offset across three service trees. Which is a combination of mulitiple things
>  including what time queue got queued on the service tree.
>
>  There is one exception though. That is if we switched the prio class, say
>  we served some RT tasks and again started serving BE class, then with-in
>  BE class we always started with sync-noidle workload irrespective of rb_key
>  offset in service trees.
>
>  This can provide better latencies for sync-noidle workload in the presence
>  of RT tasks.
>
> o This patch gets rid of that exception and which workload to run with-in
>  class always depends on lowest rb_key across service trees. The reason
>  being that now we have multiple BE class groups and if we always switch
>  to sync-noidle workload with-in group, we can potentially starve a sync-idle
>  workload with-in group. Same is true for async workload which will be in
>  root group. Also the workload-switching with-in group will become very
>  unpredictable as it now depends whether some RT workload was running in
>  the system or not.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
>  block/cfq-iosched.c |   48 ++++++++++++------------------------------------
>  1 files changed, 12 insertions(+), 36 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index d9bfa09..8df4fe5 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -292,8 +292,7 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
>
>  static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
>                                            enum wl_prio_t prio,
> -                                           enum wl_type_t type,
> -                                           struct cfq_data *cfqd)
> +                                           enum wl_type_t type)
>  {
>        if (!cfqg)
>                return NULL;
> @@ -1146,7 +1145,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>  #endif
>
>        service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
> -                                               cfqq_type(cfqq), cfqd);
> +                                               cfqq_type(cfqq));
>        if (cfq_class_idle(cfqq)) {
>                rb_key = CFQ_IDLE_DELAY;
>                parent = rb_last(&service_tree->rb);
> @@ -1609,7 +1608,7 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
>  {
>        struct cfq_rb_root *service_tree =
>                service_tree_for(cfqd->serving_group, cfqd->serving_prio,
> -                                       cfqd->serving_type, cfqd);
> +                                       cfqd->serving_type);
>
>        if (!cfqd->rq_queued)
>                return NULL;
> @@ -1956,8 +1955,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
>  }
>
>  static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> -                               struct cfq_group *cfqg, enum wl_prio_t prio,
> -                               bool prio_changed)
> +                               struct cfq_group *cfqg, enum wl_prio_t prio)
>  {
>        struct cfq_queue *queue;
>        int i;
> @@ -1965,24 +1963,9 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>        unsigned long lowest_key = 0;
>        enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;
>
> -       if (prio_changed) {
> -               /*
> -                * When priorities switched, we prefer starting
> -                * from SYNC_NOIDLE (first choice), or just SYNC
> -                * over ASYNC
> -                */
> -               if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
> -                       return cur_best;
> -               cur_best = SYNC_WORKLOAD;
> -               if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
> -                       return cur_best;
> -
> -               return ASYNC_WORKLOAD;
> -       }
> -
> -       for (i = 0; i < 3; ++i) {
> -               /* otherwise, select the one with lowest rb_key */
> -               queue = cfq_rb_first(service_tree_for(cfqg, prio, i, cfqd));
> +       for (i = 0; i <= SYNC_WORKLOAD; ++i) {
> +               /* select the one with lowest rb_key */
> +               queue = cfq_rb_first(service_tree_for(cfqg, prio, i));
>                if (queue &&
>                    (!key_valid || time_before(queue->rb_key, lowest_key))) {
>                        lowest_key = queue->rb_key;
> @@ -1996,8 +1979,6 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>
>  static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>  {
> -       enum wl_prio_t previous_prio = cfqd->serving_prio;
> -       bool prio_changed;
>        unsigned slice;
>        unsigned count;
>        struct cfq_rb_root *st;
> @@ -2025,24 +2006,19 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>         * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
>         * expiration time
>         */
> -       prio_changed = (cfqd->serving_prio != previous_prio);
> -       st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> -                               cfqd);
> +       st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>        count = st->count;
>
>        /*
> -        * If priority didn't change, check workload expiration,
> -        * and that we still have other queues ready
> +        * check workload expiration, and that we still have other queues ready
>         */
> -       if (!prio_changed && count &&
> -           !time_after(jiffies, cfqd->workload_expires))
> +       if (count && !time_after(jiffies, cfqd->workload_expires))
>                return;
>
>        /* otherwise select new workload type */
>        cfqd->serving_type =
> -               cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio, prio_changed);
> -       st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> -                               cfqd);
> +               cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
> +       st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>        count = st->count;
>
>        /*
> --
> 1.6.2.5
>
>

Acked-by: Corrado Zoccolo <[email protected]>

2009-12-18 00:01:17

by Munehiro Ikeda

[permalink] [raw]
Subject: Re: [RFC] CFQ group scheduling structure organization

Hello,

Corrado Zoccolo wrote, on 12/17/2009 06:41 AM:
> Hi,
> On Wed, Dec 16, 2009 at 11:52 PM, Vivek Goyal<[email protected]> wrote:
>> Hi All,
>>
>> With some basic group scheduling support in CFQ, there are few questions
>> regarding how group structure should look like in CFQ.
>>
>> Currently, grouping looks as follows. A, and B are two cgroups created by
>> user.
>>
>> [snip]
>>
>> Proposal 4:
>> ==========
>> Treat task and group at same level. Currently groups are at top level and
>> at second level are tasks. View the whole hierarchy as follows.
>>
>>
>> service-tree
>> / | \ \
>> T1 T2 G1 G2
>>
>> Here T1 and T2 are two tasks in root group and G1 and G2 are two cgroups
>> created under root.
>>
>> In this kind of scheme, any RT task in root group will still be system
>> wide RT even if we create groups G1 and G2.
>>
>> So what are the issues?
>>
>> - I talked to few folks and everybody found this scheme not so intutive.
>> Their argument was that once I create a cgroup, say A, under root, then
>> bandwidth should be divided between "root" and "A" proportionate to
>> the weight.
>>
>> It is not very intutive that group is competing with all the tasks
>> running in root group. And disk share of newly created group will change
>> if more tasks fork in root group. So it is highly dynamic and not
>> static hence un-intutive.

I agree it might be dynamic but I don't think it's un-intuitive.
I think it's reasonable that disk share of a group is
influenced by the number of tasks running in root group,
because the root group is shared by the tasks and groups from
the viewpoint of cgroup I/F, and they really share disk bandwidth.


>> To emulate the behavior of previous proposals, root shall have to create
>> a new group and move all root tasks there. But admin shall have to still
>> keep RT tasks in root group so that they still remain system-wide.
>>
>> service-tree
>> / | \ \
>> T1 root G1 G2
>> |
>> T2
>>
>> Now admin has specifically created a group "root" along side G1 and G2
>> and moved T2 under root. T1 is still left in top level group as it might
>> be an RT task and we want it to remain RT task systemwide.
>>
>> So to some people this scheme is un-intutive and requires more work in
>> user space to achive desired behavior. I am kind of 50:50 between two
>> kind of arrangements.
>>
> This is the one I prefer: it is the most natural one if you see that
> groups are scheduling entities like any other task.
> I think it becomes intuitive with an analogy with a qemu (e.g. kvm)
> virtual machine model. If you think a group like a virtual machine, it
> is clear that for the normal system, the whole virtual machine is a
> single scheduling entity, and that it has to compete with other
> virtual machines (as other single entities) and every process in the
> real system (those are inherently more important, since without the
> real system, the VMs cannot simply exist).
> Having a designated root group, instead, resembles the xen VM model,
> where you have a separated domain for each VM and for the real system.
>
> I think the implementation of this approach can make the code simpler
> and modular (CFQ could be abstracted to deal with scheduling entities,
> and each scheduling entity could be defined in a separate file).
> Within each group, you will now have the choice of how to schedule its
> queues. This means that you could possibly have different I/O
> schedulers within each group, and even have sub-groups within groups.

Corrado exactly says my preference.

I understand current implementation, like proposal 1, was
employed to make code simple and I believe it succeeded.
However, rather I feel it's un-intuitive because it's
inconsistent with cgroup I/F. Behavior which is inconsistent
with the I/F can lead to misconfiguration of sys-admins.
This might be problematic, IMHO.



Thanks,
Muuhh

--
IKEDA, Munehiro
NEC Corporation of America
[email protected]

2009-12-18 15:19:51

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 3/4] cfq-iosched: Remove prio_change logic for workload selection

On Thu, Dec 17, 2009 at 05:20:30PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > o CFQ now internally divides cfq queues in therr workload categories. sync-idle,
> > sync-noidle and async. Which workload to run depends primarily on rb_key
> > offset across three service trees. Which is a combination of mulitiple things
> > including what time queue got queued on the service tree.
> >
> > There is one exception though. That is if we switched the prio class, say
> > we served some RT tasks and again started serving BE class, then with-in
> > BE class we always started with sync-noidle workload irrespective of rb_key
> > offset in service trees.
> >
> > This can provide better latencies for sync-noidle workload in the presence
> > of RT tasks.
> >
> > o This patch gets rid of that exception and which workload to run with-in
> > class always depends on lowest rb_key across service trees. The reason
> > being that now we have multiple BE class groups and if we always switch
> > to sync-noidle workload with-in group, we can potentially starve a sync-idle
> > workload with-in group. Same is true for async workload which will be in
> > root group. Also the workload-switching with-in group will become very
> > unpredictable as it now depends whether some RT workload was running in
> > the system or not.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
>
> Actually, in current CFQ, there still has bug in prio_changed logic.
> Because prio_changed = (cfqd->serving_prio != previous_prio), and previous_prio
> might come from another group, in this case, prio_changed becomes invalid, but
> cfq_choose_wl() still relies on prio_changed to calculate serving_type, this
> doesn't make sence IMHO.

Ok, so you are referring to the case when we don't save workload state
because it has expired at the time of group expiry.

How about saving the serving prio all the time at the time of group expiry,
irrespective of the fact whether workload has expired or not at the time of
group expiry and while we also need to always restore the serving prio back
when group is rescheduled in.

Thanks
Vivek

> But with this change, this bug is gone.
>
> Reviewed-by: Gui Jianfeng <[email protected]>
>
> > ---
> > block/cfq-iosched.c | 48 ++++++++++++------------------------------------
> > 1 files changed, 12 insertions(+), 36 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index d9bfa09..8df4fe5 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -292,8 +292,7 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
> >
> > static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
> > enum wl_prio_t prio,
> > - enum wl_type_t type,
> > - struct cfq_data *cfqd)
> > + enum wl_type_t type)
> > {
> > if (!cfqg)
> > return NULL;
> > @@ -1146,7 +1145,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> > #endif
> >
> > service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
> > - cfqq_type(cfqq), cfqd);
> > + cfqq_type(cfqq));
> > if (cfq_class_idle(cfqq)) {
> > rb_key = CFQ_IDLE_DELAY;
> > parent = rb_last(&service_tree->rb);
> > @@ -1609,7 +1608,7 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
> > {
> > struct cfq_rb_root *service_tree =
> > service_tree_for(cfqd->serving_group, cfqd->serving_prio,
> > - cfqd->serving_type, cfqd);
> > + cfqd->serving_type);
> >
> > if (!cfqd->rq_queued)
> > return NULL;
> > @@ -1956,8 +1955,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
> > }
> >
> > static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> > - struct cfq_group *cfqg, enum wl_prio_t prio,
> > - bool prio_changed)
> > + struct cfq_group *cfqg, enum wl_prio_t prio)
> > {
> > struct cfq_queue *queue;
> > int i;
> > @@ -1965,24 +1963,9 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> > unsigned long lowest_key = 0;
> > enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;
> >
> > - if (prio_changed) {
> > - /*
> > - * When priorities switched, we prefer starting
> > - * from SYNC_NOIDLE (first choice), or just SYNC
> > - * over ASYNC
> > - */
> > - if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
> > - return cur_best;
> > - cur_best = SYNC_WORKLOAD;
> > - if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
> > - return cur_best;
> > -
> > - return ASYNC_WORKLOAD;
> > - }
> > -
> > - for (i = 0; i < 3; ++i) {
> > - /* otherwise, select the one with lowest rb_key */
> > - queue = cfq_rb_first(service_tree_for(cfqg, prio, i, cfqd));
> > + for (i = 0; i <= SYNC_WORKLOAD; ++i) {
> > + /* select the one with lowest rb_key */
> > + queue = cfq_rb_first(service_tree_for(cfqg, prio, i));
> > if (queue &&
> > (!key_valid || time_before(queue->rb_key, lowest_key))) {
> > lowest_key = queue->rb_key;
> > @@ -1996,8 +1979,6 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> >
> > static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> > {
> > - enum wl_prio_t previous_prio = cfqd->serving_prio;
> > - bool prio_changed;
> > unsigned slice;
> > unsigned count;
> > struct cfq_rb_root *st;
> > @@ -2025,24 +2006,19 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> > * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
> > * expiration time
> > */
> > - prio_changed = (cfqd->serving_prio != previous_prio);
> > - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> > - cfqd);
> > + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
> > count = st->count;
> >
> > /*
> > - * If priority didn't change, check workload expiration,
> > - * and that we still have other queues ready
> > + * check workload expiration, and that we still have other queues ready
> > */
> > - if (!prio_changed && count &&
> > - !time_after(jiffies, cfqd->workload_expires))
> > + if (count && !time_after(jiffies, cfqd->workload_expires))
> > return;
> >
> > /* otherwise select new workload type */
> > cfqd->serving_type =
> > - cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio, prio_changed);
> > - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
> > - cfqd);
> > + cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
> > + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
> > count = st->count;
> >
> > /*

2009-12-18 15:24:05

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] CFQ group scheduling structure organization

On Thu, Dec 17, 2009 at 06:17:43PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi All,
> >
> > With some basic group scheduling support in CFQ, there are few questions
> > regarding how group structure should look like in CFQ.
> >
> > Currently, grouping looks as follows. A, and B are two cgroups created by
> > user.
> >
> > Proposal 1:
> > =========
> > grp-service-tree
> > / | \
> > root A B
> >
> > One issue with this structure is that RT tasks are not system wide. So an
> > RT tasks inside root group has RT priority only with-in root group. So a
> > BE task inside A will get it fair share despite the fact that root has got
> > RT tasks.
> >
> >
> > Proposal 2:
> > ==========
> > One proposal to solve this issue is that make RT and IDLE tasks system
> > wide and provide weight based service differentiation only for BE class
> > tasks. So RT or IDLE tasks running in any of the groups will automatically
> > move to one global RT group maintained by CFQ internally. Same is true for
> > IDLE tasks. But BE class tasks will honor the cgroup limitations and will
> > get differentiated service according to weight.
> >
> > Internal structure will look as follows.
> >
> > grp-RT-service-tree grp-BE-service-tree grp-IDLE-service-tree
> > | / \ |
> > all_RT_task_group A B all_idle_tasks_grp
> >
> >
> > Here A and B are two cgroups and some BE tasks might be running inside
> > those groups. systemwide RT tasks will move under all_RT_task_group and
> > all idle tasks will move under all_idle_tasks_grp.
> >
> > So one will notice service differentiation only for BE tasks.
>
> Hi Vivek,
>
> I still think that we need to give choices for users. When an user want to give
> RT Tasks service differentiation, we shouldn't treat all RT tasks as systemwide.
> But if a user want better latency for RT tasks, we treat them systemwide. CFQ can
> rely on sysfs tunable to achieve this.
>

By user you mean "admin" because only admin can launch RT tasks. Why would
somebody want to limit RT tasks to only that group. That means you want
RT prio with-in group only and not across groups. So BE tasks in other BE
groups can very well be getting disk share. Though giving this choice does
not hurt, but I raised the same point with Nauman, that what's the utility
of this configuartion. Admin can very well keep that task BE instead of
RT.

Thanks
Vivek


> Thanks
> Gui
>
> >
> >
> > Proposal 3:
> > ===========
> >
> > One can argue that we need group service differentiation for RT class
> > tasks also and don't move tasks automatically across groups. That means
> > we need to support "group class" type also. Probably we can support
> > three classes of cgroups RT, BE and IDLE and CFQ will use that data to
> > put cgroups in respective tree.
> >
> > Things should look as follows.
> >
> > grp-RT-service-tree grp-BE-service-tree grp-IDLE-service-tree
> > / \ / \ / \
> > C D A B E F
> >
> >
> > Here A and B are BE type groups created by user.
> > C and D are RT type cgroups created by user.
> > E and F are IDLE type cgroups created by user.
> >
> > Now in this scheme of things, by default root will be of type BE. Any task
> > RT task under "root" group will not be system wide RT task. It will be RT
> > only with-in root group. To make it system wide idle, admin shall have to
> > create a new cgroup, say C, of type RT and move task in that cgroup.
> > Because RT group C is system wide, now that task becomes system wide RT.
> >
> > So this scheme might throw some surprise to existing users. They might
> > create a new group and not realize that their RT tasks are no more system
> > wide RT tasks and they need to specifically create one RT cgroup and move
> > all RT tasks in that cgroup.
> >
> > Practically I am not sure how many people are looking for group service
> > differentiation for RT and IDLE class tasks also.
> >
> > Proposal 4:
> > ==========
> > Treat task and group at same level. Currently groups are at top level and
> > at second level are tasks. View the whole hierarchy as follows.
> >
> >
> > service-tree
> > / | \ \
> > T1 T2 G1 G2
> >
> > Here T1 and T2 are two tasks in root group and G1 and G2 are two cgroups
> > created under root.
> >
> > In this kind of scheme, any RT task in root group will still be system
> > wide RT even if we create groups G1 and G2.
> >
> > So what are the issues?
> >
> > - I talked to few folks and everybody found this scheme not so intutive.
> > Their argument was that once I create a cgroup, say A, under root, then
> > bandwidth should be divided between "root" and "A" proportionate to
> > the weight.
> >
> > It is not very intutive that group is competing with all the tasks
> > running in root group. And disk share of newly created group will change
> > if more tasks fork in root group. So it is highly dynamic and not
> > static hence un-intutive.
> >
> > To emulate the behavior of previous proposals, root shall have to create
> > a new group and move all root tasks there. But admin shall have to still
> > keep RT tasks in root group so that they still remain system-wide.
> >
> > service-tree
> > / | \ \
> > T1 root G1 G2
> > |
> > T2
> >
> > Now admin has specifically created a group "root" along side G1 and G2
> > and moved T2 under root. T1 is still left in top level group as it might
> > be an RT task and we want it to remain RT task systemwide.
> >
> > So to some people this scheme is un-intutive and requires more work in
> > user space to achive desired behavior. I am kind of 50:50 between two
> > kind of arrangements.
> >
> >
> > I am looking for some feedback on what makes most sense.
> >
> > For the time being, I am little inclined towards proposal 2 and I have
> > implemented a proof of concept version on top of for-2.6.33 branch in block
> > tree. These patches are compile and boot tested only and I have yet to do
> > testing.
> >
> > Thanks
> > Vivek
> >
> >
> >

2009-12-18 15:51:11

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] CFQ group scheduling structure organization

On Thu, Dec 17, 2009 at 12:41:32PM +0100, Corrado Zoccolo wrote:
> Hi,
> On Wed, Dec 16, 2009 at 11:52 PM, Vivek Goyal <[email protected]> wrote:
> > Hi All,
> >
> > With some basic group scheduling support in CFQ, there are few questions
> > regarding how group structure should look like in CFQ.
> >
> > Currently, grouping looks as follows. A, and B are two cgroups created by
> > user.
> >
> > [snip]
> >
> > Proposal 4:
> > ==========
> > Treat task and group at same level. Currently groups are at top level and
> > at second level are tasks. View the whole hierarchy as follows.
> >
> >
> > ? ? ? ? ? ? ? ? ? ? ? ?service-tree
> > ? ? ? ? ? ? ? ? ? ? ? ?/ ? | ?\ ?\
> > ? ? ? ? ? ? ? ? ? ? ? T1 ? T2 ?G1 G2
> >
> > Here T1 and T2 are two tasks in root group and G1 and G2 are two cgroups
> > created under root.
> >
> > In this kind of scheme, any RT task in root group will still be system
> > wide RT even if we create groups G1 and G2.
> >
> > So what are the issues?
> >
> > - I talked to few folks and everybody found this scheme not so intutive.
> > ?Their argument was that once I create a cgroup, say A, ?under root, then
> > ?bandwidth should be divided between "root" and "A" proportionate to
> > ?the weight.
> >
> > ?It is not very intutive that group is competing with all the tasks
> > ?running in root group. And disk share of newly created group will change
> > ?if more tasks fork in root group. So it is highly dynamic and not
> > ?static hence un-intutive.
> >
> > ?To emulate the behavior of previous proposals, root shall have to create
> > ?a new group and move all root tasks there. But admin shall have to still
> > ?keep RT tasks in root group so that they still remain system-wide.
> >
> > ? ? ? ? ? ? ? ? ? ? ? ?service-tree
> > ? ? ? ? ? ? ? ? ? ? ? ?/ ? | ? ?\ ?\
> > ? ? ? ? ? ? ? ? ? ? ? T1 ?root ?G1 G2
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ?|
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ?T2
> >
> > ?Now admin has specifically created a group "root" along side G1 and G2
> > ?and moved T2 under root. T1 is still left in top level group as it might
> > ?be an RT task and we want it to remain RT task systemwide.
> >
> > ?So to some people this scheme is un-intutive and requires more work in
> > ?user space to achive desired behavior. I am kind of 50:50 between two
> > ?kind of arrangements.
> >
> This is the one I prefer: it is the most natural one if you see that
> groups are scheduling entities like any other task.

This is the approach I had implemented in my earlier postings. I had
the notion of io_entity which was embedded in both cfq_queue and
cfq_groups. So cfq core scheduler had to worry about scheduling entities
and these entities could be either queues or groups. Something picked from
BFQ and CFS implementation.

> I think it becomes intuitive with an analogy with a qemu (e.g. kvm)
> virtual machine model. If you think a group like a virtual machine, it
> is clear that for the normal system, the whole virtual machine is a
> single scheduling entity, and that it has to compete with other
> virtual machines (as other single entities) and every process in the
> real system (those are inherently more important, since without the
> real system, the VMs cannot simply exist).
> Having a designated root group, instead, resembles the xen VM model,
> where you have a separated domain for each VM and for the real system.
>
> I think the implementation of this approach can make the code simpler
> and modular (CFQ could be abstracted to deal with scheduling entities,
> and each scheduling entity could be defined in a separate file).
> Within each group, you will now have the choice of how to schedule its
> queues. This means that you could possibly have different I/O
> schedulers within each group, and even have sub-groups within groups.

Abstracting in terms of scheduling entities and allowing tasks and groups
to be at same level definitely helps in terms of extending implementation
to hierarchical mode (sub groups with-in groups). My initial posting was
also hierarchical. I cut down later on functionality to reduce patch size.

At the same time it also imposes the restriction that we use the same
schedling algorithm for queues as well as groups. In current
implementation, I am using a vtime based algorithm for groups and we
continue to use original cfq logic (cfq_slice_offset()) for cfqq
scheduling.

Now I shall have to merge these two. The advantage of group scheduling
algorithm is that it can provide you accurate disk time distribution
accoriding to weight (as long as groups are continuously backlogged and
not deleted from service tree). Because it keeps track of vtime, it does
not enforce that group's entitled share needs to be consumed in one go (as
opposed to cfq queue scheduling algorithm). One can expire a queue and
select a different group for dispatch and still original group will not
loose the share.

Migrating cfq's queue scheduling algorithm to use group algorithm should
not be a problem, except the fact that I am not very sure about honoring
the task prio on NCQ SSDs. Currently in group scheduling algorithm we a
group is not continuously backlogged, it is deleted from service tree and
when it comes back, it is put at the end of queue hence it looses share.
In case of NCQ SSDs, we will not idle, and we will loose ioprio
differentiation between various cfqq on NCQ SSDs. In fact I am not even
sure how well cfq approximation is working on NCQ SSD when it comes to
service differentation between various prio queues.

I had tried putting deleted queues not at the end but with lower vtime
based on weight. But then it introduces inaccuracy w.r.t continuously
baklogged groups. So entities which get deleted gain share and
continuously backlogged one gain share.

> >
> > I am looking for some feedback on what makes most sense.
> I think that regardless of our preference, we should coordinate with
> how the CPU scheduler works, since I think the users will be more
> surprised to see cgroups behaving different w.r.t. CPU and disk, than
> if the RT task behaviour changes when cgroups are introduced.

True. AFAIK, cpu scheduler treats tasks and groups at same level. I think
initially they had also started with treating root group at the same level
as other groups in root group but later switched to task and groups at
same level.

CCing Peter Zijlstra. He might have thoughts why treating task and groups at
same level was considered a better approach as compared to treating root
group at same level with other groups in root group.

Thanks
Vivek

2009-12-18 16:03:56

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] CFQ group scheduling structure organization

On Thu, Dec 17, 2009 at 06:58:21PM -0500, Munehiro Ikeda wrote:
> Hello,
>
> Corrado Zoccolo wrote, on 12/17/2009 06:41 AM:
>> Hi,
>> On Wed, Dec 16, 2009 at 11:52 PM, Vivek Goyal<[email protected]> wrote:
>>> Hi All,
>>>
>>> With some basic group scheduling support in CFQ, there are few questions
>>> regarding how group structure should look like in CFQ.
>>>
>>> Currently, grouping looks as follows. A, and B are two cgroups created by
>>> user.
>>>
>>> [snip]
>>>
>>> Proposal 4:
>>> ==========
>>> Treat task and group at same level. Currently groups are at top level and
>>> at second level are tasks. View the whole hierarchy as follows.
>>>
>>>
>>> service-tree
>>> / | \ \
>>> T1 T2 G1 G2
>>>
>>> Here T1 and T2 are two tasks in root group and G1 and G2 are two cgroups
>>> created under root.
>>>
>>> In this kind of scheme, any RT task in root group will still be system
>>> wide RT even if we create groups G1 and G2.
>>>
>>> So what are the issues?
>>>
>>> - I talked to few folks and everybody found this scheme not so intutive.
>>> Their argument was that once I create a cgroup, say A, under root, then
>>> bandwidth should be divided between "root" and "A" proportionate to
>>> the weight.
>>>
>>> It is not very intutive that group is competing with all the tasks
>>> running in root group. And disk share of newly created group will change
>>> if more tasks fork in root group. So it is highly dynamic and not
>>> static hence un-intutive.
>
> I agree it might be dynamic but I don't think it's un-intuitive.
> I think it's reasonable that disk share of a group is
> influenced by the number of tasks running in root group,
> because the root group is shared by the tasks and groups from
> the viewpoint of cgroup I/F, and they really share disk bandwidth.
>

That's true that it becomes more natural to view it that way. That's a
different thing that it might become little more work in user space to
then move root tasks into a sub group otherwise, the effective share of
a newly created group might be really less. All the tasks in a group are
effectively a single task when it comes to top level.

>
>>> To emulate the behavior of previous proposals, root shall have to create
>>> a new group and move all root tasks there. But admin shall have to still
>>> keep RT tasks in root group so that they still remain system-wide.
>>>
>>> service-tree
>>> / | \ \
>>> T1 root G1 G2
>>> |
>>> T2
>>>
>>> Now admin has specifically created a group "root" along side G1 and G2
>>> and moved T2 under root. T1 is still left in top level group as it might
>>> be an RT task and we want it to remain RT task systemwide.
>>>
>>> So to some people this scheme is un-intutive and requires more work in
>>> user space to achive desired behavior. I am kind of 50:50 between two
>>> kind of arrangements.
>>>
>> This is the one I prefer: it is the most natural one if you see that
>> groups are scheduling entities like any other task.
>> I think it becomes intuitive with an analogy with a qemu (e.g. kvm)
>> virtual machine model. If you think a group like a virtual machine, it
>> is clear that for the normal system, the whole virtual machine is a
>> single scheduling entity, and that it has to compete with other
>> virtual machines (as other single entities) and every process in the
>> real system (those are inherently more important, since without the
>> real system, the VMs cannot simply exist).
>> Having a designated root group, instead, resembles the xen VM model,
>> where you have a separated domain for each VM and for the real system.
>>
>> I think the implementation of this approach can make the code simpler
>> and modular (CFQ could be abstracted to deal with scheduling entities,
>> and each scheduling entity could be defined in a separate file).
>> Within each group, you will now have the choice of how to schedule its
>> queues. This means that you could possibly have different I/O
>> schedulers within each group, and even have sub-groups within groups.
>
> Corrado exactly says my preference.
>
> I understand current implementation, like proposal 1, was
> employed to make code simple and I believe it succeeded.
> However, rather I feel it's un-intuitive because it's
> inconsistent with cgroup I/F. Behavior which is inconsistent
> with the I/F can lead to misconfiguration of sys-admins.
> This might be problematic, IMHO.

Thanks Muuhh. It helps to get perspective from various folks before I
start implementing it.

Thanks
Vivek

2009-12-20 04:24:42

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 3/4] cfq-iosched: Remove prio_change logic for workload selection

Vivek Goyal wrote:
> On Thu, Dec 17, 2009 at 05:20:30PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> o CFQ now internally divides cfq queues in therr workload categories. sync-idle,
>>> sync-noidle and async. Which workload to run depends primarily on rb_key
>>> offset across three service trees. Which is a combination of mulitiple things
>>> including what time queue got queued on the service tree.
>>>
>>> There is one exception though. That is if we switched the prio class, say
>>> we served some RT tasks and again started serving BE class, then with-in
>>> BE class we always started with sync-noidle workload irrespective of rb_key
>>> offset in service trees.
>>>
>>> This can provide better latencies for sync-noidle workload in the presence
>>> of RT tasks.
>>>
>>> o This patch gets rid of that exception and which workload to run with-in
>>> class always depends on lowest rb_key across service trees. The reason
>>> being that now we have multiple BE class groups and if we always switch
>>> to sync-noidle workload with-in group, we can potentially starve a sync-idle
>>> workload with-in group. Same is true for async workload which will be in
>>> root group. Also the workload-switching with-in group will become very
>>> unpredictable as it now depends whether some RT workload was running in
>>> the system or not.
>>>
>>> Signed-off-by: Vivek Goyal <[email protected]>
>> Actually, in current CFQ, there still has bug in prio_changed logic.
>> Because prio_changed = (cfqd->serving_prio != previous_prio), and previous_prio
>> might come from another group, in this case, prio_changed becomes invalid, but
>> cfq_choose_wl() still relies on prio_changed to calculate serving_type, this
>> doesn't make sence IMHO.
>
> Ok, so you are referring to the case when we don't save workload state
> because it has expired at the time of group expiry.
>
> How about saving the serving prio all the time at the time of group expiry,
> irrespective of the fact whether workload has expired or not at the time of
> group expiry and while we also need to always restore the serving prio back
> when group is rescheduled in.

Hi Vivek,

Yes, i think so also.

Thanks,
Gui

>
> Thanks
> Vivek
>
>> But with this change, this bug is gone.
>>
>> Reviewed-by: Gui Jianfeng <[email protected]>
>>
>>> ---
>>> block/cfq-iosched.c | 48 ++++++++++++------------------------------------
>>> 1 files changed, 12 insertions(+), 36 deletions(-)
>>>
>>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>>> index d9bfa09..8df4fe5 100644
>>> --- a/block/cfq-iosched.c
>>> +++ b/block/cfq-iosched.c
>>> @@ -292,8 +292,7 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
>>>
>>> static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
>>> enum wl_prio_t prio,
>>> - enum wl_type_t type,
>>> - struct cfq_data *cfqd)
>>> + enum wl_type_t type)
>>> {
>>> if (!cfqg)
>>> return NULL;
>>> @@ -1146,7 +1145,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>>> #endif
>>>
>>> service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
>>> - cfqq_type(cfqq), cfqd);
>>> + cfqq_type(cfqq));
>>> if (cfq_class_idle(cfqq)) {
>>> rb_key = CFQ_IDLE_DELAY;
>>> parent = rb_last(&service_tree->rb);
>>> @@ -1609,7 +1608,7 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
>>> {
>>> struct cfq_rb_root *service_tree =
>>> service_tree_for(cfqd->serving_group, cfqd->serving_prio,
>>> - cfqd->serving_type, cfqd);
>>> + cfqd->serving_type);
>>>
>>> if (!cfqd->rq_queued)
>>> return NULL;
>>> @@ -1956,8 +1955,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
>>> }
>>>
>>> static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>>> - struct cfq_group *cfqg, enum wl_prio_t prio,
>>> - bool prio_changed)
>>> + struct cfq_group *cfqg, enum wl_prio_t prio)
>>> {
>>> struct cfq_queue *queue;
>>> int i;
>>> @@ -1965,24 +1963,9 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>>> unsigned long lowest_key = 0;
>>> enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;
>>>
>>> - if (prio_changed) {
>>> - /*
>>> - * When priorities switched, we prefer starting
>>> - * from SYNC_NOIDLE (first choice), or just SYNC
>>> - * over ASYNC
>>> - */
>>> - if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
>>> - return cur_best;
>>> - cur_best = SYNC_WORKLOAD;
>>> - if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
>>> - return cur_best;
>>> -
>>> - return ASYNC_WORKLOAD;
>>> - }
>>> -
>>> - for (i = 0; i < 3; ++i) {
>>> - /* otherwise, select the one with lowest rb_key */
>>> - queue = cfq_rb_first(service_tree_for(cfqg, prio, i, cfqd));
>>> + for (i = 0; i <= SYNC_WORKLOAD; ++i) {
>>> + /* select the one with lowest rb_key */
>>> + queue = cfq_rb_first(service_tree_for(cfqg, prio, i));
>>> if (queue &&
>>> (!key_valid || time_before(queue->rb_key, lowest_key))) {
>>> lowest_key = queue->rb_key;
>>> @@ -1996,8 +1979,6 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>>>
>>> static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>> {
>>> - enum wl_prio_t previous_prio = cfqd->serving_prio;
>>> - bool prio_changed;
>>> unsigned slice;
>>> unsigned count;
>>> struct cfq_rb_root *st;
>>> @@ -2025,24 +2006,19 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>> * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
>>> * expiration time
>>> */
>>> - prio_changed = (cfqd->serving_prio != previous_prio);
>>> - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
>>> - cfqd);
>>> + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>>> count = st->count;
>>>
>>> /*
>>> - * If priority didn't change, check workload expiration,
>>> - * and that we still have other queues ready
>>> + * check workload expiration, and that we still have other queues ready
>>> */
>>> - if (!prio_changed && count &&
>>> - !time_after(jiffies, cfqd->workload_expires))
>>> + if (count && !time_after(jiffies, cfqd->workload_expires))
>>> return;
>>>
>>> /* otherwise select new workload type */
>>> cfqd->serving_type =
>>> - cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio, prio_changed);
>>> - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
>>> - cfqd);
>>> + cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
>>> + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>>> count = st->count;
>>>
>>> /*
>
>
>

--
Regards
Gui Jianfeng

2009-12-21 12:16:22

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC] CFQ group scheduling structure organization

On Thu, Dec 17 2009, Munehiro Ikeda wrote:
> Hello,
>
> Corrado Zoccolo wrote, on 12/17/2009 06:41 AM:
>> Hi,
>> On Wed, Dec 16, 2009 at 11:52 PM, Vivek Goyal<[email protected]> wrote:
>>> Hi All,
>>>
>>> With some basic group scheduling support in CFQ, there are few questions
>>> regarding how group structure should look like in CFQ.
>>>
>>> Currently, grouping looks as follows. A, and B are two cgroups created by
>>> user.
>>>
>>> [snip]
>>>
>>> Proposal 4:
>>> ==========
>>> Treat task and group at same level. Currently groups are at top level and
>>> at second level are tasks. View the whole hierarchy as follows.
>>>
>>>
>>> service-tree
>>> / | \ \
>>> T1 T2 G1 G2
>>>
>>> Here T1 and T2 are two tasks in root group and G1 and G2 are two cgroups
>>> created under root.
>>>
>>> In this kind of scheme, any RT task in root group will still be system
>>> wide RT even if we create groups G1 and G2.
>>>
>>> So what are the issues?
>>>
>>> - I talked to few folks and everybody found this scheme not so intutive.
>>> Their argument was that once I create a cgroup, say A, under root, then
>>> bandwidth should be divided between "root" and "A" proportionate to
>>> the weight.
>>>
>>> It is not very intutive that group is competing with all the tasks
>>> running in root group. And disk share of newly created group will change
>>> if more tasks fork in root group. So it is highly dynamic and not
>>> static hence un-intutive.
>
> I agree it might be dynamic but I don't think it's un-intuitive.
> I think it's reasonable that disk share of a group is
> influenced by the number of tasks running in root group,
> because the root group is shared by the tasks and groups from
> the viewpoint of cgroup I/F, and they really share disk bandwidth.

Agree, this is my preferred solution as well. There are definitely valid
cases for both doing system wide RT and system wide idle, and there are
definitely valid reasons for doing that inside a single group as well.

--
Jens Axboe

2009-12-21 14:42:49

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] CFQ group scheduling structure organization

On Mon, Dec 21, 2009 at 01:16:19PM +0100, Jens Axboe wrote:
> On Thu, Dec 17 2009, Munehiro Ikeda wrote:
> > Hello,
> >
> > Corrado Zoccolo wrote, on 12/17/2009 06:41 AM:
> >> Hi,
> >> On Wed, Dec 16, 2009 at 11:52 PM, Vivek Goyal<[email protected]> wrote:
> >>> Hi All,
> >>>
> >>> With some basic group scheduling support in CFQ, there are few questions
> >>> regarding how group structure should look like in CFQ.
> >>>
> >>> Currently, grouping looks as follows. A, and B are two cgroups created by
> >>> user.
> >>>
> >>> [snip]
> >>>
> >>> Proposal 4:
> >>> ==========
> >>> Treat task and group at same level. Currently groups are at top level and
> >>> at second level are tasks. View the whole hierarchy as follows.
> >>>
> >>>
> >>> service-tree
> >>> / | \ \
> >>> T1 T2 G1 G2
> >>>
> >>> Here T1 and T2 are two tasks in root group and G1 and G2 are two cgroups
> >>> created under root.
> >>>
> >>> In this kind of scheme, any RT task in root group will still be system
> >>> wide RT even if we create groups G1 and G2.
> >>>
> >>> So what are the issues?
> >>>
> >>> - I talked to few folks and everybody found this scheme not so intutive.
> >>> Their argument was that once I create a cgroup, say A, under root, then
> >>> bandwidth should be divided between "root" and "A" proportionate to
> >>> the weight.
> >>>
> >>> It is not very intutive that group is competing with all the tasks
> >>> running in root group. And disk share of newly created group will change
> >>> if more tasks fork in root group. So it is highly dynamic and not
> >>> static hence un-intutive.
> >
> > I agree it might be dynamic but I don't think it's un-intuitive.
> > I think it's reasonable that disk share of a group is
> > influenced by the number of tasks running in root group,
> > because the root group is shared by the tasks and groups from
> > the viewpoint of cgroup I/F, and they really share disk bandwidth.
>
> Agree, this is my preferred solution as well. There are definitely valid
> cases for both doing system wide RT and system wide idle, and there are
> definitely valid reasons for doing that inside a single group as well.
>

Thanks Jens. I will write a patch to implement above.

Vivek