2009-11-12 23:43:59

by Vivek Goyal

[permalink] [raw]
Subject: [RFC] Block IO Controller V2

Hi Jens,

This is V2 of the Block IO controller patches on top of "for-2.6.33" branch
of block tree.

A consolidated patch can be found here:

http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v2.patch

Changes from V1:

- Rebased the patches for "for-2.6.33" branch.
- Currently dropped the support for priority class of groups. For the time
being only BE class groups are supported.

After the discussions at IO minisummit at Tokyo, Japan, it was agreed that
one single IO control policy at either leaf nodes or at higher level nodes
does not meet all the requirements and we need something so that we have
the capability to support more than one IO control policy (like proportional
weight division and max bandwidth control) and also have capability to
implement some of these policies at higher level logical devices.

It was agreed that CFQ is the right place to implement time based proportional
weight division policy. Other policies like max bandwidth control/throttling
will make more sense at higher level logical devices.

This patch introduces blkio cgroup controller. It provides the management
interface for the block IO control. The idea is that keep the interface
common and in the background we should be able to switch policies based on
user options. Hence user can control the IO throughout the IO stack with
a single cgroup interface.

Apart from blkio cgroup interface, this patchset also modifies CFQ to implement
time based proportional weight division of disk. CFQ already does it in flat
mode. It has been modified to do group IO scheduling also.

IO control is a huge problem and the moment we start addressing all the
issues in one patchset, it bloats to unmanageable proportions and then nothing
gets inside the kernel. So at io mini summit we agreed that lets take small
steps and once a piece of code is inside the kernel and stablized, take the
next step. So this is the first step.

Some parts of the code are based on BFQ patches posted by Paolo and Fabio.

Your feedback is welcome.

TODO
====
- Support async IO control (buffered writes).

Buffered writes is a beast and requires changes at many a places to solve the
problem and patchset becomes huge. Hence first we plan to support only sync
IO in control then work on async IO too.

Some of the work items identified are.

- Per memory cgroup dirty ratio
- Possibly modification of writeback to force writeback from a
particular cgroup.
- Implement IO tracking support so that a bio can be mapped to a cgroup.
- Per group request descriptor infrastructure in block layer.
- At CFQ level, implement per cfq_group async queues.

In this patchset, all the async IO goes in system wide queues and there are
no per group async queues. That means we will see service differentiation
only for sync IO only. Async IO willl be handled later.

- Support for higher level policies like max BW controller.
- Support groups of RT class also.

Thanks
Vivek

Documentation/cgroups/blkio-controller.txt | 100 +++
block/Kconfig | 22 +
block/Kconfig.iosched | 17 +
block/Makefile | 1 +
block/blk-cgroup.c | 312 ++++++++++
block/blk-cgroup.h | 90 +++
block/cfq-iosched.c | 901 ++++++++++++++++++++++++----
include/linux/cgroup_subsys.h | 6 +
include/linux/iocontext.h | 4 +
9 files changed, 1346 insertions(+), 107 deletions(-)


2009-11-13 01:41:53

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 01/16] blkio: Documentation

Signed-off-by: Vivek Goyal <[email protected]>
---
Documentation/cgroups/blkio-controller.txt | 100 ++++++++++++++++++++++++++++
1 files changed, 100 insertions(+), 0 deletions(-)
create mode 100644 Documentation/cgroups/blkio-controller.txt

diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
new file mode 100644
index 0000000..6fb772f
--- /dev/null
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -0,0 +1,100 @@
+ Block IO Controller
+ ===================
+Overview
+========
+cgroup subsys "blkio" implements the block io controller. There seems to be
+a need of various kind of IO control policies (like proportional BW, max BW)
+both at leaf nodes as well as at intermediate nodes in storage hierarchy. Plan
+is to use same cgroup based management interface for blkio controller and
+based on user options switch IO policies in the background.
+
+In the first phase, this patchset implements proportional weight time based
+division of disk policy. It is implemented in CFQ. Hence this policy takes
+effect only on leaf nodes when CFQ is being used.
+
+HOWTO
+=====
+You can do a very simple testing of running two dd threads in two different
+cgroups. Here is what you can do.
+
+- Enable group scheduling in CFQ
+ CONFIG_CFQ_GROUP_IOSCHED=y
+
+- Compile and boot into kernel and mount IO controller (blkio).
+
+ mount -t cgroup -o blkio none /cgroup
+
+- Create two cgroups
+ mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+ echo 1000 > /cgroup/test1/blkio.weight
+ echo 500 > /cgroup/test2/blkio.weight
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+ launch two dd threads in different cgroup to read those files.
+
+ sync
+ echo 3 > /proc/sys/vm/drop_caches
+
+ dd if=/mnt/sdb/zerofile1 of=/dev/null &
+ echo $! > /cgroup/test1/tasks
+ cat /cgroup/test1/tasks
+
+ dd if=/mnt/sdb/zerofile2 of=/dev/null &
+ echo $! > /cgroup/test2/tasks
+ cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+ on looking at (with the help of script), at blkio.disk_time and
+ blkio.disk_sectors files of both test1 and test2 groups. This will tell how
+ much disk time (in milli seconds), each group got and how many secotors each
+ group dispatched to the disk. We provide fairness in terms of disk time, so
+ ideally io.disk_time of cgroups should be in proportion to the weight.
+
+Various user visible config options
+===================================
+CONFIG_CFQ_GROUP_IOSCHED
+ - Enables group scheduling in CFQ. Currently only 1 level of group
+ creation is allowed.
+
+CONFIG_DEBUG_CFQ_IOSCHED
+ - Enables some debugging messages in blktrace. Also creates extra
+ cgroup file blkio.dequeue.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configuration.
+
+CONFIG_BLK_CGROUP
+ - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED.
+
+CONFIG_DEBUG_BLK_CGROUP
+ - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED.
+
+Details of cgroup files
+=======================
+- blkio.weight
+ - Specifies per cgroup weight.
+
+ Currently allowed range of weights is from 100 to 1000.
+
+- blkio.time
+ - disk time allocated to cgroup per device in milliseconds. First
+ two fields specify the major and minor number of the device and
+ third field specifies the disk time allocated to group in
+ milliseconds.
+
+- blkio.sectors
+ - number of sectors transferred to/from disk by the group. First
+ two fields specify the major and minor number of the device and
+ third field specifies the number of sectors transferred by the
+ group to/from the device.
+
+- blkio.dequeue
+ - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
+ gives the statistics about how many a times a group was dequeued
+ from service tree of the device. First two fields specify the major
+ and minor number of the device and third field specifies the number
+ of times a group was dequeued from a particular device.
--
1.6.2.5

2009-11-12 23:34:22

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 02/16] blkio: Introduce the notion of cfq groups

o This patch introduce the notion of cfq groups. Soon we will can have multiple
groups of different weights in the system.

o Various service trees (prioclass and workload type trees), will become per
cfq group. So hierarchy looks as follows.

cfq_groups
|
workload type
|
cfq queue

o When an scheduling decision has to be taken, first we select the cfq group
then workload with-in the group and then cfq queue with-in the workload
type.

o This patch just makes various workload service tree per cfq group and
introduce the function to be able to choose a group for scheduling.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 114 +++++++++++++++++++++++++++++++++++----------------
1 files changed, 79 insertions(+), 35 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1bcbd8c..aebb205 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -132,6 +132,7 @@ struct cfq_queue {

struct cfq_rb_root *service_tree;
struct cfq_queue *new_cfqq;
+ struct cfq_group *cfqg;
};

/*
@@ -153,6 +154,15 @@ enum wl_type_t {
SYNC_WORKLOAD = 2
};

+/* This is per cgroup per device grouping structure */
+struct cfq_group {
+ /*
+ * rr lists of queues with requests, onle rr for each priority class.
+ * Counts are embedded in the cfq_rb_root
+ */
+ struct cfq_rb_root service_trees[2][3];
+ struct cfq_rb_root service_tree_idle;
+};

/*
* Per block device queue structure
@@ -160,18 +170,15 @@ enum wl_type_t {
struct cfq_data {
struct request_queue *queue;

- /*
- * rr lists of queues with requests, onle rr for each priority class.
- * Counts are embedded in the cfq_rb_root
- */
- struct cfq_rb_root service_trees[2][3];
- struct cfq_rb_root service_tree_idle;
+ struct cfq_group root_group;
+
/*
* The priority currently being served
*/
enum wl_prio_t serving_prio;
enum wl_type_t serving_type;
unsigned long workload_expires;
+ struct cfq_group *serving_group;

/*
* Each priority tree is sorted by next_request position. These
@@ -233,14 +240,15 @@ struct cfq_data {
unsigned long last_end_sync_rq;
};

-static struct cfq_rb_root *service_tree_for(enum wl_prio_t prio,
+static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
+ enum wl_prio_t prio,
enum wl_type_t type,
struct cfq_data *cfqd)
{
if (prio == IDLE_WORKLOAD)
- return &cfqd->service_tree_idle;
+ return &cfqg->service_tree_idle;

- return &cfqd->service_trees[prio][type];
+ return &cfqg->service_trees[prio][type];
}

enum cfqq_state_flags {
@@ -308,12 +316,14 @@ static enum wl_type_t cfqq_type(struct cfq_queue *cfqq)

static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd)
{
+ struct cfq_group *cfqg = &cfqd->root_group;
+
if (wl == IDLE_WORKLOAD)
- return cfqd->service_tree_idle.count;
+ return cfqg->service_tree_idle.count;

- return cfqd->service_trees[wl][ASYNC_WORKLOAD].count
- + cfqd->service_trees[wl][SYNC_NOIDLE_WORKLOAD].count
- + cfqd->service_trees[wl][SYNC_WORKLOAD].count;
+ return cfqg->service_trees[wl][ASYNC_WORKLOAD].count
+ + cfqg->service_trees[wl][SYNC_NOIDLE_WORKLOAD].count
+ + cfqg->service_trees[wl][SYNC_WORKLOAD].count;
}

static void cfq_dispatch_insert(struct request_queue *, struct request *);
@@ -621,7 +631,8 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
struct cfq_rb_root *service_tree;
int left;

- service_tree = service_tree_for(cfqq_prio(cfqq), cfqq_type(cfqq), cfqd);
+ service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
+ cfqq_type(cfqq), cfqd);
if (cfq_class_idle(cfqq)) {
rb_key = CFQ_IDLE_DELAY;
parent = rb_last(&service_tree->rb);
@@ -1057,7 +1068,8 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
{
struct cfq_rb_root *service_tree =
- service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd);
+ service_tree_for(cfqd->serving_group, cfqd->serving_prio,
+ cfqd->serving_type, cfqd);

if (RB_EMPTY_ROOT(&service_tree->rb))
return NULL;
@@ -1209,7 +1221,8 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
* in their service tree.
*/
if (!service_tree)
- service_tree = service_tree_for(prio, cfqq_type(cfqq), cfqd);
+ service_tree = service_tree_for(cfqq->cfqg, prio,
+ cfqq_type(cfqq), cfqd);

if (service_tree->count == 0)
return true;
@@ -1222,6 +1235,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
struct cfq_queue *cfqq = cfqd->active_queue;
struct cfq_io_context *cic;
unsigned long sl;
+ struct cfq_rb_root *st;

/*
* SSD device without seek penalty, disable idling. But only do so
@@ -1271,9 +1285,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
* fair distribution of slice time for a process doing back-to-back
* seeks.
*/
- if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
- service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WORKLOAD, cfqd)
- ->count > 0) {
+ st = service_tree_for(cfqq->cfqg, cfqd->serving_prio,
+ SYNC_NOIDLE_WORKLOAD, cfqd);
+ if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD && st->count > 0) {
if (blk_queue_nonrot(cfqd->queue) || cfqd->hw_tag)
return;
sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
@@ -1381,8 +1395,9 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
}
}

-static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd, enum wl_prio_t prio,
- bool prio_changed)
+static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
+ struct cfq_group *cfqg, enum wl_prio_t prio,
+ bool prio_changed)
{
struct cfq_queue *queue;
int i;
@@ -1396,10 +1411,10 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd, enum wl_prio_t prio,
* from SYNC_NOIDLE (first choice), or just SYNC
* over ASYNC
*/
- if (service_tree_for(prio, cur_best, cfqd)->count)
+ if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
return cur_best;
cur_best = SYNC_WORKLOAD;
- if (service_tree_for(prio, cur_best, cfqd)->count)
+ if (service_tree_for(cfqg, prio, cur_best, cfqd)->count)
return cur_best;

return ASYNC_WORKLOAD;
@@ -1407,7 +1422,7 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd, enum wl_prio_t prio,

for (i = 0; i < 3; ++i) {
/* otherwise, select the one with lowest rb_key */
- queue = cfq_rb_first(service_tree_for(prio, i, cfqd));
+ queue = cfq_rb_first(service_tree_for(cfqg, prio, i, cfqd));
if (queue &&
(!key_valid || time_before(queue->rb_key, lowest_key))) {
lowest_key = queue->rb_key;
@@ -1419,12 +1434,13 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd, enum wl_prio_t prio,
return cur_best;
}

-static void choose_service_tree(struct cfq_data *cfqd)
+static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
enum wl_prio_t previous_prio = cfqd->serving_prio;
bool prio_changed;
unsigned slice;
unsigned count;
+ struct cfq_rb_root *st;

/* Choose next priority. RT > BE > IDLE */
if (cfq_busy_queues_wl(RT_WORKLOAD, cfqd))
@@ -1443,8 +1459,9 @@ static void choose_service_tree(struct cfq_data *cfqd)
* expiration time
*/
prio_changed = (cfqd->serving_prio != previous_prio);
- count = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd)
- ->count;
+ st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
+ cfqd);
+ count = st->count;

/*
* If priority didn't change, check workload expiration,
@@ -1456,9 +1473,10 @@ static void choose_service_tree(struct cfq_data *cfqd)

/* otherwise select new workload type */
cfqd->serving_type =
- cfq_choose_wl(cfqd, cfqd->serving_prio, prio_changed);
- count = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd)
- ->count;
+ cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio, prio_changed);
+ st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type,
+ cfqd);
+ count = st->count;

/*
* the workload slice is computed as a fraction of target latency
@@ -1481,6 +1499,12 @@ static void choose_service_tree(struct cfq_data *cfqd)
cfqd->workload_expires = jiffies + slice;
}

+static void cfq_choose_cfqg(struct cfq_data *cfqd)
+{
+ cfqd->serving_group = &cfqd->root_group;
+ choose_service_tree(cfqd, &cfqd->root_group);
+}
+
/*
* Select a queue for service. If we have a current active queue,
* check whether to continue servicing it, or retrieve and set a new one.
@@ -1538,7 +1562,7 @@ new_queue:
* service tree
*/
if (!new_cfqq)
- choose_service_tree(cfqd);
+ cfq_choose_cfqg(cfqd);

cfqq = cfq_set_active_queue(cfqd, new_cfqq);
keep_queue:
@@ -1567,13 +1591,15 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
struct cfq_queue *cfqq;
int dispatched = 0;
int i, j;
+ struct cfq_group *cfqg = &cfqd->root_group;
+
for (i = 0; i < 2; ++i)
for (j = 0; j < 3; ++j)
- while ((cfqq = cfq_rb_first(&cfqd->service_trees[i][j]))
+ while ((cfqq = cfq_rb_first(&cfqg->service_trees[i][j]))
!= NULL)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);

- while ((cfqq = cfq_rb_first(&cfqd->service_tree_idle)) != NULL)
+ while ((cfqq = cfq_rb_first(&cfqg->service_tree_idle)) != NULL)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);

cfq_slice_expired(cfqd, 0);
@@ -2044,14 +2070,26 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->pid = pid;
}

+static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
+{
+ cfqq->cfqg = cfqg;
+}
+
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+{
+ return &cfqd->root_group;
+}
+
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
struct io_context *ioc, gfp_t gfp_mask)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
struct cfq_io_context *cic;
+ struct cfq_group *cfqg;

retry:
+ cfqg = cfq_get_cfqg(cfqd, 1);
cic = cfq_cic_lookup(cfqd, ioc);
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);
@@ -2082,6 +2120,7 @@ retry:
if (cfqq) {
cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
cfq_init_prio_data(cfqq, ioc);
+ cfq_link_cfqq_cfqg(cfqq, cfqg);
cfq_log_cfqq(cfqd, cfqq, "alloced");
} else
cfqq = &cfqd->oom_cfqq;
@@ -2914,15 +2953,19 @@ static void *cfq_init_queue(struct request_queue *q)
{
struct cfq_data *cfqd;
int i, j;
+ struct cfq_group *cfqg;

cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
if (!cfqd)
return NULL;

+ /* Init root group */
+ cfqg = &cfqd->root_group;
+
for (i = 0; i < 2; ++i)
for (j = 0; j < 3; ++j)
- cfqd->service_trees[i][j] = CFQ_RB_ROOT;
- cfqd->service_tree_idle = CFQ_RB_ROOT;
+ cfqg->service_trees[i][j] = CFQ_RB_ROOT;
+ cfqg->service_tree_idle = CFQ_RB_ROOT;

/*
* Not strictly needed (since RB_ROOT just clears the node and we
@@ -2939,6 +2982,7 @@ static void *cfq_init_queue(struct request_queue *q)
*/
cfq_init_cfqq(cfqd, &cfqd->oom_cfqq, 1, 0);
atomic_inc(&cfqd->oom_cfqq.ref);
+ cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group);

INIT_LIST_HEAD(&cfqd->cic_list);

--
1.6.2.5

2009-11-12 23:35:37

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 03/16] blkio: Keep queue on service tree until we expire it

o Currently cfqq deletes a queue from service tree if it is empty (even if
we might idle on the queue). This patch keeps the queue on service tree
hence associated group remains on the service tree until we decide that
we are not going to idle on the queue and expire it.

o This just helps in time accounting for queue/group and in implementation
of rest of the patches.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 68 +++++++++++++++++++++++++++++++++++++++------------
1 files changed, 52 insertions(+), 16 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index aebb205..c254917 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -374,7 +374,7 @@ static int cfq_queue_empty(struct request_queue *q)
{
struct cfq_data *cfqd = q->elevator->elevator_data;

- return !cfqd->busy_queues;
+ return !cfqd->rq_queued;
}

/*
@@ -819,7 +819,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_del_rq_rb(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
- struct cfq_data *cfqd = cfqq->cfqd;
const int sync = rq_is_sync(rq);

BUG_ON(!cfqq->queued[sync]);
@@ -827,8 +826,17 @@ static void cfq_del_rq_rb(struct request *rq)

elv_rb_del(&cfqq->sort_list, rq);

- if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
- cfq_del_cfqq_rr(cfqd, cfqq);
+ if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list)) {
+ /*
+ * Queue will be deleted from service tree when we actually
+ * expire it later. Right now just remove it from prio tree
+ * as it is empty.
+ */
+ if (cfqq->p_root) {
+ rb_erase(&cfqq->p_node, cfqq->p_root);
+ cfqq->p_root = NULL;
+ }
+ }
}

static void cfq_add_rq_rb(struct request *rq)
@@ -1042,6 +1050,9 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
}

+ if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
+ cfq_del_cfqq_rr(cfqd, cfqq);
+
cfq_resort_rr_list(cfqd, cfqq);

if (cfqq == cfqd->active_queue)
@@ -1065,17 +1076,43 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
* Get next queue for service. Unless we have a queue preemption,
* we'll simply select the first cfqq in the service tree.
*/
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
+static struct cfq_queue *__cfq_get_next_queue(struct cfq_data *cfqd)
{
struct cfq_rb_root *service_tree =
service_tree_for(cfqd->serving_group, cfqd->serving_prio,
cfqd->serving_type, cfqd);

+ if (!cfqd->rq_queued)
+ return NULL;
+
if (RB_EMPTY_ROOT(&service_tree->rb))
return NULL;
return cfq_rb_first(service_tree);
}

+static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
+{
+ struct cfq_group *cfqg = &cfqd->root_group;
+ struct cfq_queue *cfqq;
+ int i, j;
+
+ if (!cfqd->rq_queued)
+ return NULL;
+
+ for (i = 0; i < 2; ++i) {
+ for (j = 0; j < 3; ++j) {
+ cfqq = cfq_rb_first(&cfqg->service_trees[i][j]);
+ if (cfqq)
+ return cfqq;
+ }
+ }
+
+ cfqq = cfq_rb_first(&cfqg->service_tree_idle);
+ if (cfqq)
+ return cfqq;
+ return NULL;
+}
+
/*
* Get and set a new active queue for service.
*/
@@ -1083,7 +1120,7 @@ static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
struct cfq_queue *cfqq)
{
if (!cfqq)
- cfqq = cfq_get_next_queue(cfqd);
+ cfqq = __cfq_get_next_queue(cfqd);

__cfq_set_active_queue(cfqd, cfqq);
return cfqq;
@@ -1517,6 +1554,8 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
if (!cfqq)
goto new_queue;

+ if (!cfqd->rq_queued)
+ return NULL;
/*
* The active queue has run out of time, expire it and select new.
*/
@@ -1579,6 +1618,10 @@ static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
}

BUG_ON(!list_empty(&cfqq->fifo));
+
+ /* By default cfqq is not expired if it is empty. Do it explicitly */
+ __cfq_slice_expired(cfqq->cfqd, cfqq, 0);
+
return dispatched;
}

@@ -1590,16 +1633,8 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq;
int dispatched = 0;
- int i, j;
- struct cfq_group *cfqg = &cfqd->root_group;
-
- for (i = 0; i < 2; ++i)
- for (j = 0; j < 3; ++j)
- while ((cfqq = cfq_rb_first(&cfqg->service_trees[i][j]))
- != NULL)
- dispatched += __cfq_forced_dispatch_cfqq(cfqq);

- while ((cfqq = cfq_rb_first(&cfqg->service_tree_idle)) != NULL)
+ while ((cfqq = cfq_get_next_queue(cfqd)) != NULL)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);

cfq_slice_expired(cfqd, 0);
@@ -1770,13 +1805,14 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
cfq_log_cfqq(cfqd, cfqq, "put_queue");
BUG_ON(rb_first(&cfqq->sort_list));
BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
- BUG_ON(cfq_cfqq_on_rr(cfqq));

if (unlikely(cfqd->active_queue == cfqq)) {
__cfq_slice_expired(cfqd, cfqq, 0);
cfq_schedule_dispatch(cfqd);
}

+ BUG_ON(cfq_cfqq_on_rr(cfqq));
+
kmem_cache_free(cfq_pool, cfqq);
}

--
1.6.2.5

2009-11-12 23:35:28

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 04/16] blkio: Introduce the root service tree for cfq groups

o So far we just had one cfq_group in cfq_data. To create space for more than
one cfq_group, we need to have a service tree of groups where all the groups
can be queued if they have active cfq queues backlogged in these.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 134 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c254917..33c3e62 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -77,8 +77,9 @@ struct cfq_rb_root {
struct rb_root rb;
struct rb_node *left;
unsigned count;
+ u64 min_vdisktime;
};
-#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, }
+#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, 0, }

/*
* Per process-grouping structure
@@ -156,6 +157,16 @@ enum wl_type_t {

/* This is per cgroup per device grouping structure */
struct cfq_group {
+ /* group service_tree member */
+ struct rb_node rb_node;
+
+ /* group service_tree key */
+ u64 vdisktime;
+ bool on_st;
+
+ /* number of cfqq currently on this group */
+ int nr_cfqq;
+
/*
* rr lists of queues with requests, onle rr for each priority class.
* Counts are embedded in the cfq_rb_root
@@ -170,6 +181,8 @@ struct cfq_group {
struct cfq_data {
struct request_queue *queue;

+ /* Root service tree for cfq_groups */
+ struct cfq_rb_root grp_service_tree;
struct cfq_group root_group;

/*
@@ -245,6 +258,9 @@ static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
enum wl_type_t type,
struct cfq_data *cfqd)
{
+ if (!cfqg)
+ return NULL;
+
if (prio == IDLE_WORKLOAD)
return &cfqg->service_tree_idle;

@@ -566,6 +582,17 @@ static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
return NULL;
}

+static struct cfq_group *cfq_rb_first_group(struct cfq_rb_root *root)
+{
+ if (!root->left)
+ root->left = rb_first(&root->rb);
+
+ if (root->left)
+ return rb_entry(root->left, struct cfq_group, rb_node);
+
+ return NULL;
+}
+
static void rb_erase_init(struct rb_node *n, struct rb_root *root)
{
rb_erase(n, root);
@@ -617,6 +644,83 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
}

+static inline s64
+cfqg_key(struct cfq_rb_root *st, struct cfq_group *cfqg)
+{
+ return cfqg->vdisktime - st->min_vdisktime;
+}
+
+static void
+__cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
+{
+ struct rb_node **node = &st->rb.rb_node;
+ struct rb_node *parent = NULL;
+ struct cfq_group *__cfqg;
+ s64 key = cfqg_key(st, cfqg);
+ int left = 1;
+
+ while (*node != NULL) {
+ parent = *node;
+ __cfqg = rb_entry(parent, struct cfq_group, rb_node);
+
+ if (key < cfqg_key(st, __cfqg))
+ node = &parent->rb_left;
+ else {
+ node = &parent->rb_right;
+ left = 0;
+ }
+ }
+
+ if (left)
+ st->left = &cfqg->rb_node;
+
+ rb_link_node(&cfqg->rb_node, parent, node);
+ rb_insert_color(&cfqg->rb_node, &st->rb);
+}
+
+static void
+cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_group *__cfqg;
+ struct rb_node *n;
+
+ cfqg->nr_cfqq++;
+ if (cfqg->on_st)
+ return;
+
+ /*
+ * Currently put the group at the end. Later implement something
+ * so that groups get lesser vtime based on their weights, so that
+ * if group does not loose all if it was not continously backlogged.
+ */
+ n = rb_last(&st->rb);
+ if (n) {
+ __cfqg = rb_entry(n, struct cfq_group, rb_node);
+ cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
+ } else
+ cfqg->vdisktime = st->min_vdisktime;
+
+ __cfq_group_service_tree_add(st, cfqg);
+ cfqg->on_st = true;
+}
+
+static void
+cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+
+ BUG_ON(cfqg->nr_cfqq < 1);
+ cfqg->nr_cfqq--;
+ /* If there are other cfq queues under this group, don't delete it */
+ if (cfqg->nr_cfqq)
+ return;
+
+ cfqg->on_st = false;
+ if (!RB_EMPTY_NODE(&cfqg->rb_node))
+ cfq_rb_erase(&cfqg->rb_node, st);
+}
+
/*
* The cfqd->service_trees holds all pending cfq_queue's that have
* requests waiting to be processed. It is sorted in the order that
@@ -699,6 +803,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
rb_link_node(&cfqq->rb_node, parent, p);
rb_insert_color(&cfqq->rb_node, &service_tree->rb);
service_tree->count++;
+ cfq_group_service_tree_add(cfqd, cfqq->cfqg);
}

static struct cfq_queue *
@@ -809,6 +914,8 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
cfqq->p_root = NULL;
}

+ cfq_group_service_tree_del(cfqd, cfqq->cfqg);
+
BUG_ON(!cfqd->busy_queues);
cfqd->busy_queues--;
}
@@ -1085,6 +1192,9 @@ static struct cfq_queue *__cfq_get_next_queue(struct cfq_data *cfqd)
if (!cfqd->rq_queued)
return NULL;

+ /* There is nothing to dispatch */
+ if (!service_tree)
+ return NULL;
if (RB_EMPTY_ROOT(&service_tree->rb))
return NULL;
return cfq_rb_first(service_tree);
@@ -1479,6 +1589,12 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
unsigned count;
struct cfq_rb_root *st;

+ if (!cfqg) {
+ cfqd->serving_prio = IDLE_WORKLOAD;
+ cfqd->workload_expires = jiffies + 1;
+ return;
+ }
+
/* Choose next priority. RT > BE > IDLE */
if (cfq_busy_queues_wl(RT_WORKLOAD, cfqd))
cfqd->serving_prio = RT_WORKLOAD;
@@ -1536,10 +1652,21 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
cfqd->workload_expires = jiffies + slice;
}

+static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+
+ if (RB_EMPTY_ROOT(&st->rb))
+ return NULL;
+ return cfq_rb_first_group(st);
+}
+
static void cfq_choose_cfqg(struct cfq_data *cfqd)
{
- cfqd->serving_group = &cfqd->root_group;
- choose_service_tree(cfqd, &cfqd->root_group);
+ struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
+
+ cfqd->serving_group = cfqg;
+ choose_service_tree(cfqd, cfqg);
}

/*
@@ -2995,6 +3122,9 @@ static void *cfq_init_queue(struct request_queue *q)
if (!cfqd)
return NULL;

+ /* Init root service tree */
+ cfqd->grp_service_tree = CFQ_RB_ROOT;
+
/* Init root group */
cfqg = &cfqd->root_group;

@@ -3002,6 +3132,7 @@ static void *cfq_init_queue(struct request_queue *q)
for (j = 0; j < 3; ++j)
cfqg->service_trees[i][j] = CFQ_RB_ROOT;
cfqg->service_tree_idle = CFQ_RB_ROOT;
+ RB_CLEAR_NODE(&cfqg->rb_node);

/*
* Not strictly needed (since RB_ROOT just clears the node and we
--
1.6.2.5

2009-11-12 23:34:29

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 05/16] blkio: Implement per cfq group latency target and busy queue avg

o So far we had 300ms soft target latency system wide. Now with the
introduction of cfq groups, divide that latency by number of groups so
that one can come up with group target latency which will be helpful
in determining the workload slice with-in group and also the dynamic
slice length of the cfq queue.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 48 ++++++++++++++++++++++++++++++------------------
1 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 33c3e62..f8688b4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -167,6 +167,8 @@ struct cfq_group {
/* number of cfqq currently on this group */
int nr_cfqq;

+ /* Per group busy queus average. Useful for workload slice calc. */
+ unsigned int busy_queues_avg[2];
/*
* rr lists of queues with requests, onle rr for each priority class.
* Counts are embedded in the cfq_rb_root
@@ -184,6 +186,8 @@ struct cfq_data {
/* Root service tree for cfq_groups */
struct cfq_rb_root grp_service_tree;
struct cfq_group root_group;
+ /* Number of active cfq groups on group service tree */
+ int nr_groups;

/*
* The priority currently being served
@@ -201,7 +205,6 @@ struct cfq_data {
struct rb_root prio_trees[CFQ_PRIO_LISTS];

unsigned int busy_queues;
- unsigned int busy_queues_avg[2];

int rq_in_driver[2];
int sync_flight;
@@ -330,10 +333,10 @@ static enum wl_type_t cfqq_type(struct cfq_queue *cfqq)
return SYNC_WORKLOAD;
}

-static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd)
+static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl,
+ struct cfq_data *cfqd,
+ struct cfq_group *cfqg)
{
- struct cfq_group *cfqg = &cfqd->root_group;
-
if (wl == IDLE_WORKLOAD)
return cfqg->service_tree_idle.count;

@@ -420,18 +423,19 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
* to quickly follows sudden increases and decrease slowly
*/

-static inline unsigned cfq_get_avg_queues(struct cfq_data *cfqd, bool rt)
+static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
+ struct cfq_group *cfqg, bool rt)
{
unsigned min_q, max_q;
unsigned mult = cfq_hist_divisor - 1;
unsigned round = cfq_hist_divisor / 2;
- unsigned busy = cfq_busy_queues_wl(rt, cfqd);
+ unsigned busy = cfq_group_busy_queues_wl(rt, cfqd, cfqg);

- min_q = min(cfqd->busy_queues_avg[rt], busy);
- max_q = max(cfqd->busy_queues_avg[rt], busy);
- cfqd->busy_queues_avg[rt] = (mult * max_q + min_q + round) /
+ min_q = min(cfqg->busy_queues_avg[rt], busy);
+ max_q = max(cfqg->busy_queues_avg[rt], busy);
+ cfqg->busy_queues_avg[rt] = (mult * max_q + min_q + round) /
cfq_hist_divisor;
- return cfqd->busy_queues_avg[rt];
+ return cfqg->busy_queues_avg[rt];
}

static inline void
@@ -441,10 +445,13 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
if (cfqd->cfq_latency) {
/* interested queues (we consider only the ones with the same
* priority class) */
- unsigned iq = cfq_get_avg_queues(cfqd, cfq_class_rt(cfqq));
+ unsigned iq = cfq_group_get_avg_queues(cfqd, cfqq->cfqg,
+ cfq_class_rt(cfqq));
unsigned sync_slice = cfqd->cfq_slice[1];
unsigned expect_latency = sync_slice * iq;
- if (expect_latency > cfq_target_latency) {
+ unsigned group_target_lat = cfq_target_latency/cfqd->nr_groups;
+
+ if (expect_latency > group_target_lat) {
unsigned base_low_slice = 2 * cfqd->cfq_slice_idle;
/* scale low_slice according to IO priority
* and sync vs async */
@@ -452,7 +459,7 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
min(slice, base_low_slice * slice / sync_slice);
/* the adapted slice value is scaled to fit all iqs
* into the target latency */
- slice = max(slice * cfq_target_latency / expect_latency,
+ slice = max(slice * group_target_lat / expect_latency,
low_slice);
}
}
@@ -703,6 +710,7 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)

__cfq_group_service_tree_add(st, cfqg);
cfqg->on_st = true;
+ cfqd->nr_groups++;
}

static void
@@ -717,6 +725,7 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
return;

cfqg->on_st = false;
+ cfqd->nr_groups--;
if (!RB_EMPTY_NODE(&cfqg->rb_node))
cfq_rb_erase(&cfqg->rb_node, st);
}
@@ -1588,6 +1597,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
unsigned slice;
unsigned count;
struct cfq_rb_root *st;
+ int group_target_latency;

if (!cfqg) {
cfqd->serving_prio = IDLE_WORKLOAD;
@@ -1596,9 +1606,9 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
}

/* Choose next priority. RT > BE > IDLE */
- if (cfq_busy_queues_wl(RT_WORKLOAD, cfqd))
+ if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
cfqd->serving_prio = RT_WORKLOAD;
- else if (cfq_busy_queues_wl(BE_WORKLOAD, cfqd))
+ else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
cfqd->serving_prio = BE_WORKLOAD;
else {
cfqd->serving_prio = IDLE_WORKLOAD;
@@ -1636,9 +1646,11 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
* proportional to the number of queues in that workload, over
* all the queues in the same priority class
*/
- slice = cfq_target_latency * count /
- max_t(unsigned, cfqd->busy_queues_avg[cfqd->serving_prio],
- cfq_busy_queues_wl(cfqd->serving_prio, cfqd));
+ group_target_latency = cfq_target_latency/cfqd->nr_groups;
+
+ slice = group_target_latency * count /
+ max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
+ cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));

if (cfqd->serving_type == ASYNC_WORKLOAD)
/* async workload slice is scaled down according to
--
1.6.2.5

2009-11-12 23:34:55

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 06/16] blkio: Introduce blkio controller cgroup interface

o This is basic implementation of blkio controller cgroup interface. This is
the common interface visible to user space and should be used by different
IO control policies as we implement those.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/Kconfig | 13 +++
block/Kconfig.iosched | 1 +
block/Makefile | 1 +
block/blk-cgroup.c | 177 +++++++++++++++++++++++++++++++++++++++++
block/blk-cgroup.h | 58 +++++++++++++
include/linux/cgroup_subsys.h | 6 ++
include/linux/iocontext.h | 4 +
7 files changed, 260 insertions(+), 0 deletions(-)
create mode 100644 block/blk-cgroup.c
create mode 100644 block/blk-cgroup.h

diff --git a/block/Kconfig b/block/Kconfig
index 9be0b56..6ba1a8e 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -77,6 +77,19 @@ config BLK_DEV_INTEGRITY
T10/SCSI Data Integrity Field or the T13/ATA External Path
Protection. If in doubt, say N.

+config BLK_CGROUP
+ bool
+ depends on CGROUPS
+ default n
+ ---help---
+ Generic block IO controller cgroup interface. This is the common
+ cgroup interface which should be used by various IO controlling
+ policies.
+
+ Currently, CFQ IO scheduler uses it to recognize task groups and
+ control disk bandwidth allocation (proportional time slice allocation)
+ to such task groups.
+
endif # BLOCK

config BLOCK_COMPAT
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8bd1051..be0280d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -23,6 +23,7 @@ config IOSCHED_DEADLINE

config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
+ select BLK_CGROUP
default y
---help---
The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/Makefile b/block/Makefile
index 7914108..cb2d515 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-iopoll.o ioctl.o genhd.o scsi_ioctl.o

obj-$(CONFIG_BLK_DEV_BSG) += bsg.o
+obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
new file mode 100644
index 0000000..4f6afd7
--- /dev/null
+++ b/block/blk-cgroup.c
@@ -0,0 +1,177 @@
+/*
+ * Common Block IO controller cgroup interface
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <[email protected]>
+ *
+ * Copyright (C) 2008 Fabio Checconi <[email protected]>
+ * Paolo Valente <[email protected]>
+ *
+ * Copyright (C) 2009 Vivek Goyal <[email protected]>
+ * Nauman Rafique <[email protected]>
+ */
+#include <linux/ioprio.h>
+#include "blk-cgroup.h"
+
+struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
+
+struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
+{
+ return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
+ struct blkio_group *blkg, void *key)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&blkcg->lock, flags);
+ rcu_assign_pointer(blkg->key, key);
+ hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+}
+
+int blkiocg_del_blkio_group(struct blkio_group *blkg)
+{
+ /* Implemented later */
+ return 0;
+}
+
+/* called under rcu_read_lock(). */
+struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key)
+{
+ struct blkio_group *blkg;
+ struct hlist_node *n;
+ void *__key;
+
+ hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
+ __key = blkg->key;
+ if (__key == key)
+ return blkg;
+ }
+
+ return NULL;
+}
+
+#define SHOW_FUNCTION(__VAR) \
+static u64 blkiocg_##__VAR##_read(struct cgroup *cgroup, \
+ struct cftype *cftype) \
+{ \
+ struct blkio_cgroup *blkcg; \
+ \
+ blkcg = cgroup_to_blkio_cgroup(cgroup); \
+ return (u64)blkcg->__VAR; \
+}
+
+SHOW_FUNCTION(weight);
+#undef SHOW_FUNCTION
+
+static int
+blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
+{
+ struct blkio_cgroup *blkcg;
+
+ if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
+ return -EINVAL;
+
+ blkcg = cgroup_to_blkio_cgroup(cgroup);
+ blkcg->weight = (unsigned int)val;
+ return 0;
+}
+
+struct cftype blkio_files[] = {
+ {
+ .name = "weight",
+ .read_u64 = blkiocg_weight_read,
+ .write_u64 = blkiocg_weight_write,
+ },
+};
+
+static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ return cgroup_add_files(cgroup, subsys, blkio_files,
+ ARRAY_SIZE(blkio_files));
+}
+
+static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+
+ free_css_id(&blkio_subsys, &blkcg->css);
+ kfree(blkcg);
+}
+
+static struct cgroup_subsys_state *
+blkiocg_create(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ struct blkio_cgroup *blkcg, *parent_blkcg;
+
+ if (!cgroup->parent) {
+ blkcg = &blkio_root_cgroup;
+ goto done;
+ }
+
+ /* Currently we do not support hierarchy deeper than two level (0,1) */
+ parent_blkcg = cgroup_to_blkio_cgroup(cgroup->parent);
+ if (css_depth(&parent_blkcg->css) > 0)
+ return ERR_PTR(-EINVAL);
+
+ blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
+ if (!blkcg)
+ return ERR_PTR(-ENOMEM);
+
+ blkcg->weight = BLKIO_WEIGHT_DEFAULT;
+done:
+ spin_lock_init(&blkcg->lock);
+ INIT_HLIST_HEAD(&blkcg->blkg_list);
+
+ return &blkcg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic data structures. For now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc.
+ */
+static int blkiocg_can_attach(struct cgroup_subsys *subsys,
+ struct cgroup *cgroup, struct task_struct *tsk,
+ bool threadgroup)
+{
+ struct io_context *ioc;
+ int ret = 0;
+
+ /* task_lock() is needed to avoid races with exit_io_context() */
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc && atomic_read(&ioc->nr_tasks) > 1)
+ ret = -EINVAL;
+ task_unlock(tsk);
+
+ return ret;
+}
+
+static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+ struct cgroup *prev, struct task_struct *tsk,
+ bool threadgroup)
+{
+ struct io_context *ioc;
+
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc)
+ ioc->cgroup_changed = 1;
+ task_unlock(tsk);
+}
+
+struct cgroup_subsys blkio_subsys = {
+ .name = "blkio",
+ .create = blkiocg_create,
+ .can_attach = blkiocg_can_attach,
+ .attach = blkiocg_attach,
+ .destroy = blkiocg_destroy,
+ .populate = blkiocg_populate,
+ .subsys_id = blkio_subsys_id,
+ .use_id = 1,
+};
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
new file mode 100644
index 0000000..ba5703f
--- /dev/null
+++ b/block/blk-cgroup.h
@@ -0,0 +1,58 @@
+#ifndef _BLK_CGROUP_H
+#define _BLK_CGROUP_H
+/*
+ * Common Block IO controller cgroup interface
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <[email protected]>
+ *
+ * Copyright (C) 2008 Fabio Checconi <[email protected]>
+ * Paolo Valente <[email protected]>
+ *
+ * Copyright (C) 2009 Vivek Goyal <[email protected]>
+ * Nauman Rafique <[email protected]>
+ */
+
+#include <linux/cgroup.h>
+
+struct blkio_cgroup {
+ struct cgroup_subsys_state css;
+ unsigned int weight;
+ spinlock_t lock;
+ struct hlist_head blkg_list;
+};
+
+struct blkio_group {
+ /* An rcu protected unique identifier for the group */
+ void *key;
+ struct hlist_node blkcg_node;
+};
+
+#define BLKIO_WEIGHT_MIN 100
+#define BLKIO_WEIGHT_MAX 1000
+#define BLKIO_WEIGHT_DEFAULT 500
+
+#ifdef CONFIG_BLK_CGROUP
+extern struct blkio_cgroup blkio_root_cgroup;
+extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
+extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
+ struct blkio_group *blkg, void *key);
+extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
+extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
+ void *key);
+#else
+static inline struct blkio_cgroup *
+cgroup_to_blkio_cgroup(struct cgroup *cgroup) { return NULL; }
+
+static inline void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
+ struct blkio_group *blkg, void *key)
+{
+}
+
+static inline int
+blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
+
+static inline struct blkio_group *
+blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
+#endif
+#endif /* _BLK_CGROUP_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..ccefff0 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,9 @@ SUBSYS(net_cls)
#endif

/* */
+
+#ifdef CONFIG_BLK_CGROUP
+SUBSYS(blkio)
+#endif
+
+/* */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index eb73632..d61b0b8 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -68,6 +68,10 @@ struct io_context {
unsigned short ioprio;
unsigned short ioprio_changed;

+#ifdef CONFIG_BLK_CGROUP
+ unsigned short cgroup_changed;
+#endif
+
/*
* For request batching
*/
--
1.6.2.5

2009-11-12 23:45:14

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 07/16] blkio: Introduce per cfq group weights and vdisktime calculations

o Bring in the per cfq group weight and how vdisktime is calculated for the
group. Also bring in the functionality of updating the min_vdisktime of
the group service tree.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/Kconfig.iosched | 9 ++++++-
block/cfq-iosched.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index be0280d..fa95fa7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -23,7 +23,6 @@ config IOSCHED_DEADLINE

config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
- select BLK_CGROUP
default y
---help---
The CFQ I/O scheduler tries to distribute bandwidth equally
@@ -33,6 +32,14 @@ config IOSCHED_CFQ

This is the default I/O scheduler.

+config CFQ_GROUP_IOSCHED
+ bool "CFQ Group Scheduling support"
+ depends on IOSCHED_CFQ && CGROUPS
+ select BLK_CGROUP
+ default n
+ ---help---
+ Enable group IO scheduling in CFQ.
+
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f8688b4..1354c6b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -13,6 +13,7 @@
#include <linux/rbtree.h>
#include <linux/ioprio.h>
#include <linux/blktrace_api.h>
+#include "blk-cgroup.h"

/*
* tunables
@@ -49,6 +50,7 @@ static const int cfq_hist_divisor = 4;

#define CFQ_SLICE_SCALE (5)
#define CFQ_HW_QUEUE_MIN (5)
+#define CFQ_SERVICE_SHIFT 12

#define RQ_CIC(rq) \
((struct cfq_io_context *) (rq)->elevator_private)
@@ -78,6 +80,7 @@ struct cfq_rb_root {
struct rb_node *left;
unsigned count;
u64 min_vdisktime;
+ struct rb_node *active;
};
#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, 0, }

@@ -162,6 +165,7 @@ struct cfq_group {

/* group service_tree key */
u64 vdisktime;
+ unsigned int weight;
bool on_st;

/* number of cfqq currently on this group */
@@ -417,6 +421,51 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
}

+static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg)
+{
+ u64 d = delta << CFQ_SERVICE_SHIFT;
+
+ d = d * BLKIO_WEIGHT_DEFAULT;
+ do_div(d, cfqg->weight);
+ return d;
+}
+
+static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+ s64 delta = (s64)(vdisktime - min_vdisktime);
+ if (delta > 0)
+ min_vdisktime = vdisktime;
+
+ return min_vdisktime;
+}
+
+static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+ s64 delta = (s64)(vdisktime - min_vdisktime);
+ if (delta < 0)
+ min_vdisktime = vdisktime;
+
+ return min_vdisktime;
+}
+
+static void update_min_vdisktime(struct cfq_rb_root *st)
+{
+ u64 vdisktime = st->min_vdisktime;
+ struct cfq_group *cfqg;
+
+ if (st->active) {
+ cfqg = rb_entry(st->active, struct cfq_group, rb_node);
+ vdisktime = cfqg->vdisktime;
+ }
+
+ if (st->left) {
+ cfqg = rb_entry(st->left, struct cfq_group, rb_node);
+ vdisktime = min_vdisktime(vdisktime, cfqg->vdisktime);
+ }
+
+ st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
/*
* get averaged number of queues of RT/BE priority.
* average is updated, with a formula that gives more weight to higher numbers,
@@ -718,8 +767,12 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;

+ if (st->active == &cfqg->rb_node)
+ st->active = NULL;
+
BUG_ON(cfqg->nr_cfqq < 1);
cfqg->nr_cfqq--;
+
/* If there are other cfq queues under this group, don't delete it */
if (cfqg->nr_cfqq)
return;
@@ -1667,10 +1720,14 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_group *cfqg;

if (RB_EMPTY_ROOT(&st->rb))
return NULL;
- return cfq_rb_first_group(st);
+ cfqg = cfq_rb_first_group(st);
+ st->active = &cfqg->rb_node;
+ update_min_vdisktime(st);
+ return cfqg;
}

static void cfq_choose_cfqg(struct cfq_data *cfqd)
@@ -3146,6 +3203,9 @@ static void *cfq_init_queue(struct request_queue *q)
cfqg->service_tree_idle = CFQ_RB_ROOT;
RB_CLEAR_NODE(&cfqg->rb_node);

+ /* Give preference to root group over other groups */
+ cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
+
/*
* Not strictly needed (since RB_ROOT just clears the node and we
* zeroed cfqd on alloc), but better be safe in case someone decides
--
1.6.2.5

2009-11-12 23:46:26

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 08/16] blkio: Group time used accounting and workload context save restore

o This patch introduces the functionality to do the accounting of group time
when a queue expires. This time used decides which is the group to go
next.

o Also introduce the functionlity to save and restore the workload type
context with-in group. It might happen that once we expire the cfq queue
and group, a different group will schedule in and we will lose the context
of the workload type. Hence save and restore it upon queue expiry.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1354c6b..1e20478 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -113,6 +113,10 @@ struct cfq_queue {
/* fifo list of requests in sort_list */
struct list_head fifo;

+ /* time when queue got scheduled in to dispatch first request. */
+ unsigned long dispatch_start;
+ /* time when first request from queue completed and slice started. */
+ unsigned long slice_start;
unsigned long slice_end;
long slice_resid;
unsigned int slice_dispatch;
@@ -179,6 +183,10 @@ struct cfq_group {
*/
struct cfq_rb_root service_trees[2][3];
struct cfq_rb_root service_tree_idle;
+
+ unsigned long saved_workload_slice;
+ enum wl_type_t saved_workload;
+ enum wl_prio_t saved_serving_prio;
};

/*
@@ -512,6 +520,7 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
low_slice);
}
}
+ cfqq->slice_start = jiffies;
cfqq->slice_end = jiffies + slice;
cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
}
@@ -781,6 +790,55 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
cfqd->nr_groups--;
if (!RB_EMPTY_NODE(&cfqg->rb_node))
cfq_rb_erase(&cfqg->rb_node, st);
+ cfqg->saved_workload_slice = 0;
+}
+
+static inline unsigned long cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
+{
+ unsigned long slice_used, allocated_slice;
+
+ /*
+ * Queue got expired before even a single request completed or
+ * got expired immediately after first request completion.
+ */
+ if (!cfqq->slice_end || cfqq->slice_start == jiffies) {
+ /*
+ * Also charge the seek time incurred to the group, otherwise
+ * if there are mutiple queues in the group, each can dispatch
+ * a single request on seeky media and cause lots of seek time
+ * and group will never know it.
+ */
+ slice_used = max_t(unsigned long,
+ (jiffies - cfqq->dispatch_start), 1);
+ } else {
+ slice_used = jiffies - cfqq->slice_start;
+ allocated_slice = cfqq->slice_end - cfqq->slice_start;
+ if (slice_used > allocated_slice)
+ slice_used = allocated_slice;
+ }
+
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%lu", slice_used);
+ return slice_used;
+}
+
+static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
+ unsigned long service)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+
+ /* Can't update vdisktime while group is on service tree */
+ cfq_rb_erase(&cfqg->rb_node, st);
+ cfqg->vdisktime += cfq_scale_slice(service, cfqg);
+ __cfq_group_service_tree_add(st, cfqg);
+
+ /* This group is being expired. Save the context */
+ if (time_after(cfqd->workload_expires, jiffies)) {
+ cfqg->saved_workload_slice = cfqd->workload_expires
+ - jiffies;
+ cfqg->saved_workload = cfqd->serving_type;
+ cfqg->saved_serving_prio = cfqd->serving_prio;
+ } else
+ cfqg->saved_workload_slice = 0;
}

/*
@@ -796,6 +854,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
unsigned long rb_key;
struct cfq_rb_root *service_tree;
int left;
+ int new_cfqq = 1;

service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
cfqq_type(cfqq), cfqd);
@@ -824,6 +883,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
}

if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
+ new_cfqq = 0;
/*
* same position, nothing more to do
*/
@@ -865,6 +925,8 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
rb_link_node(&cfqq->rb_node, parent, p);
rb_insert_color(&cfqq->rb_node, &service_tree->rb);
service_tree->count++;
+ if (add_front || !new_cfqq)
+ return;
cfq_group_service_tree_add(cfqd, cfqq->cfqg);
}

@@ -1182,6 +1244,8 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
{
if (cfqq) {
cfq_log_cfqq(cfqd, cfqq, "set_active");
+ cfqq->slice_start = 0;
+ cfqq->dispatch_start = jiffies;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;

@@ -1219,6 +1283,8 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
}

+ cfq_group_served(cfqd, cfqq->cfqg, cfq_cfqq_slice_usage(cfqq));
+
if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
cfq_del_cfqq_rr(cfqd, cfqq);

@@ -1227,6 +1293,9 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (cfqq == cfqd->active_queue)
cfqd->active_queue = NULL;

+ if (&cfqq->cfqg->rb_node == cfqd->grp_service_tree.active)
+ cfqd->grp_service_tree.active = NULL;
+
if (cfqd->active_cic) {
put_io_context(cfqd->active_cic->ioc);
cfqd->active_cic = NULL;
@@ -1735,6 +1804,13 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);

cfqd->serving_group = cfqg;
+
+ /* Restore the workload type data */
+ if (cfqg->saved_workload_slice) {
+ cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
+ cfqd->serving_type = cfqg->saved_workload;
+ cfqd->serving_prio = cfqg->saved_serving_prio;
+ }
choose_service_tree(cfqd, cfqg);
}

--
1.6.2.5

2009-11-12 23:51:13

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 09/16] blkio: Dynamic cfq group creation based on cgroup tasks belongs to

o Determine the cgroup IO submitting task belongs to and create the cfq
group if it does not exist already.

o Also link cfqq and associated cfq group.

o Currently all async IO is mapped to root group.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 102 insertions(+), 11 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1e20478..788c1ff 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -187,6 +187,10 @@ struct cfq_group {
unsigned long saved_workload_slice;
enum wl_type_t saved_workload;
enum wl_prio_t saved_serving_prio;
+ struct blkio_group blkg;
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ struct hlist_node cfqd_node;
+#endif
};

/*
@@ -266,8 +270,13 @@ struct cfq_data {
struct cfq_queue oom_cfqq;

unsigned long last_end_sync_rq;
+
+ /* List of cfq groups being managed on this device*/
+ struct hlist_head cfqg_list;
};

+static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
+
static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
enum wl_prio_t prio,
enum wl_type_t type,
@@ -841,6 +850,91 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
cfqg->saved_workload_slice = 0;
}

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
+{
+ if (blkg)
+ return container_of(blkg, struct cfq_group, blkg);
+ return NULL;
+}
+
+static struct cfq_group *
+cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+ struct cfq_group *cfqg = NULL;
+ void *key = cfqd;
+ int i, j;
+
+ /* Do we need to take this reference */
+ if (!css_tryget(&blkcg->css))
+ return NULL;;
+
+ cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+ if (cfqg || !create)
+ goto done;
+
+ cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
+ if (!cfqg)
+ goto done;
+
+ cfqg->weight = blkcg->weight;
+
+ for (i = 0; i < 2; ++i)
+ for (j = 0; j < 3; ++j)
+ cfqg->service_trees[i][j] = CFQ_RB_ROOT;
+ cfqg->service_tree_idle = CFQ_RB_ROOT;
+ RB_CLEAR_NODE(&cfqg->rb_node);
+
+ /* Add group onto cgroup list */
+ blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd);
+
+ /* Add group on cfqd list */
+ hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+
+done:
+ css_put(&blkcg->css);
+ return cfqg;
+}
+
+/*
+ * Search for the cfq group current task belongs to. If create = 1, then also
+ * create the cfq group if it does not exist. request_queue lock must be held.
+ */
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+{
+ struct cgroup *cgroup;
+ struct cfq_group *cfqg = NULL;
+
+ rcu_read_lock();
+ cgroup = task_cgroup(current, blkio_subsys_id);
+ cfqg = cfq_find_alloc_cfqg(cfqd, cgroup, create);
+ if (!cfqg && create)
+ cfqg = &cfqd->root_group;
+ rcu_read_unlock();
+ return cfqg;
+}
+
+static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
+{
+ /* Currently, all async queues are mapped to root group */
+ if (!cfq_cfqq_sync(cfqq))
+ cfqg = &cfqq->cfqd->root_group;
+
+ cfqq->cfqg = cfqg;
+}
+#else /* GROUP_IOSCHED */
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+{
+ return &cfqd->root_group;
+}
+static inline void
+cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg) {
+ cfqq->cfqg = cfqg;
+}
+
+#endif /* GROUP_IOSCHED */
+
/*
* The cfqd->service_trees holds all pending cfq_queue's that have
* requests waiting to be processed. It is sorted in the order that
@@ -1333,13 +1427,17 @@ static struct cfq_queue *__cfq_get_next_queue(struct cfq_data *cfqd)

static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
{
- struct cfq_group *cfqg = &cfqd->root_group;
+ struct cfq_group *cfqg;
struct cfq_queue *cfqq;
int i, j;

if (!cfqd->rq_queued)
return NULL;

+ cfqg = cfq_get_next_cfqg(cfqd);
+ if (!cfqg)
+ return NULL;
+
for (i = 0; i < 2; ++i) {
for (j = 0; j < 3; ++j) {
cfqq = cfq_rb_first(&cfqg->service_trees[i][j]);
@@ -2378,16 +2476,6 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->pid = pid;
}

-static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
-{
- cfqq->cfqg = cfqg;
-}
-
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
-{
- return &cfqd->root_group;
-}
-
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
struct io_context *ioc, gfp_t gfp_mask)
@@ -3282,6 +3370,9 @@ static void *cfq_init_queue(struct request_queue *q)
/* Give preference to root group over other groups */
cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd);
+#endif
/*
* Not strictly needed (since RB_ROOT just clears the node and we
* zeroed cfqd on alloc), but better be safe in case someone decides
--
1.6.2.5

2009-11-12 23:35:12

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 10/16] blkio: Take care of cgroup deletion and cfq group reference counting

o One can choose to change elevator or delete a cgroup. Implement group
reference counting so that both elevator exit and cgroup deletion can
take place gracefully.

Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Nauman Rafique <[email protected]>
---
block/blk-cgroup.c | 66 +++++++++++++++++++++++++++++++-
block/blk-cgroup.h | 1 +
block/cfq-iosched.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 169 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4f6afd7..0426ab6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -13,6 +13,8 @@
#include <linux/ioprio.h>
#include "blk-cgroup.h"

+extern void cfq_unlink_blkio_group(void *, struct blkio_group *);
+
struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };

struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
@@ -28,14 +30,43 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,

spin_lock_irqsave(&blkcg->lock, flags);
rcu_assign_pointer(blkg->key, key);
+ blkg->blkcg_id = css_id(&blkcg->css);
hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
spin_unlock_irqrestore(&blkcg->lock, flags);
}

+static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
+{
+ hlist_del_init_rcu(&blkg->blkcg_node);
+ blkg->blkcg_id = 0;
+}
+
+/*
+ * returns 0 if blkio_group was still on cgroup list. Otherwise returns 1
+ * indicating that blk_group was unhashed by the time we got to it.
+ */
int blkiocg_del_blkio_group(struct blkio_group *blkg)
{
- /* Implemented later */
- return 0;
+ struct blkio_cgroup *blkcg;
+ unsigned long flags;
+ struct cgroup_subsys_state *css;
+ int ret = 1;
+
+ rcu_read_lock();
+ css = css_lookup(&blkio_subsys, blkg->blkcg_id);
+ if (!css)
+ goto out;
+
+ blkcg = container_of(css, struct blkio_cgroup, css);
+ spin_lock_irqsave(&blkcg->lock, flags);
+ if (!hlist_unhashed(&blkg->blkcg_node)) {
+ __blkiocg_del_blkio_group(blkg);
+ ret = 0;
+ }
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+out:
+ rcu_read_unlock();
+ return ret;
}

/* called under rcu_read_lock(). */
@@ -97,8 +128,39 @@ static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+ unsigned long flags;
+ struct blkio_group *blkg;
+ void *key;

+ rcu_read_lock();
+remove_entry:
+ spin_lock_irqsave(&blkcg->lock, flags);
+
+ if (hlist_empty(&blkcg->blkg_list)) {
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+ goto done;
+ }
+
+ blkg = hlist_entry(blkcg->blkg_list.first, struct blkio_group,
+ blkcg_node);
+ key = rcu_dereference(blkg->key);
+ __blkiocg_del_blkio_group(blkg);
+
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+
+ /*
+ * This blkio_group is being unlinked as associated cgroup is going
+ * away. Let all the IO controlling policies know about this event.
+ *
+ * Currently this is static call to one io controlling policy. Once
+ * we have more policies in place, we need some dynamic registration
+ * of callback function.
+ */
+ cfq_unlink_blkio_group(key, blkg);
+ goto remove_entry;
+done:
free_css_id(&blkio_subsys, &blkcg->css);
+ rcu_read_unlock();
kfree(blkcg);
}

diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index ba5703f..cd50a2f 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -26,6 +26,7 @@ struct blkio_group {
/* An rcu protected unique identifier for the group */
void *key;
struct hlist_node blkcg_node;
+ unsigned short blkcg_id;
};

#define BLKIO_WEIGHT_MIN 100
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 788c1ff..de21882 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -190,6 +190,7 @@ struct cfq_group {
struct blkio_group blkg;
#ifdef CONFIG_CFQ_GROUP_IOSCHED
struct hlist_node cfqd_node;
+ atomic_t ref;
#endif
};

@@ -886,6 +887,14 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
cfqg->service_tree_idle = CFQ_RB_ROOT;
RB_CLEAR_NODE(&cfqg->rb_node);

+ /*
+ * Take the initial reference that will be released on destroy
+ * This can be thought of a joint reference by cgroup and
+ * elevator which will be dropped by either elevator exit
+ * or cgroup deletion path depending on who is exiting first.
+ */
+ atomic_set(&cfqg->ref, 1);
+
/* Add group onto cgroup list */
blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd);

@@ -922,7 +931,86 @@ static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
cfqg = &cfqq->cfqd->root_group;

cfqq->cfqg = cfqg;
+ /* cfqq reference on cfqg */
+ atomic_inc(&cfqq->cfqg->ref);
+}
+
+static void cfq_put_cfqg(struct cfq_group *cfqg)
+{
+ struct cfq_rb_root *st;
+ int i, j;
+
+ BUG_ON(atomic_read(&cfqg->ref) <= 0);
+ if (!atomic_dec_and_test(&cfqg->ref))
+ return;
+
+ for (i = 0; i < 2; ++i)
+ for (j = 0; j < 3; ++j) {
+ st = &cfqg->service_trees[i][j];
+ BUG_ON(!RB_EMPTY_ROOT(&st->rb));
+ BUG_ON(st->active != NULL);
+ }
+
+ BUG_ON(!RB_EMPTY_ROOT(&cfqg->service_tree_idle.rb));
+ BUG_ON(cfqg->service_tree_idle.active != NULL);
+
+ kfree(cfqg);
+}
+
+static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ /* Something wrong if we are trying to remove same group twice */
+ BUG_ON(hlist_unhashed(&cfqg->cfqd_node));
+
+ hlist_del_init(&cfqg->cfqd_node);
+
+ /*
+ * Put the reference taken at the time of creation so that when all
+ * queues are gone, group can be destroyed.
+ */
+ cfq_put_cfqg(cfqg);
}
+
+static void cfq_release_cfq_groups(struct cfq_data *cfqd)
+{
+ struct hlist_node *pos, *n;
+ struct cfq_group *cfqg;
+
+ hlist_for_each_entry_safe(cfqg, pos, n, &cfqd->cfqg_list, cfqd_node) {
+ /*
+ * If cgroup removal path got to blk_group first and removed
+ * it from cgroup list, then it will take care of destroying
+ * cfqg also.
+ */
+ if (!blkiocg_del_blkio_group(&cfqg->blkg))
+ cfq_destroy_cfqg(cfqd, cfqg);
+ }
+}
+
+/*
+ * Blk cgroup controller notification saying that blkio_group object is being
+ * delinked as associated cgroup object is going away. That also means that
+ * no new IO will come in this group. So get rid of this group as soon as
+ * any pending IO in the group is finished.
+ *
+ * This function is called under rcu_read_lock(). key is the rcu protected
+ * pointer. That means "key" is a valid cfq_data pointer as long as we are rcu
+ * read lock.
+ *
+ * "key" was fetched from blkio_group under blkio_cgroup->lock. That means
+ * it should not be NULL as even if elevator was exiting, cgroup deltion
+ * path got to it first.
+ */
+void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
+{
+ unsigned long flags;
+ struct cfq_data *cfqd = key;
+
+ spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+ cfq_destroy_cfqg(cfqd, cfqg_of_blkg(blkg));
+ spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+}
+
#else /* GROUP_IOSCHED */
static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
{
@@ -933,6 +1021,9 @@ cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg) {
cfqq->cfqg = cfqg;
}

+static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
+static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
+
#endif /* GROUP_IOSCHED */

/*
@@ -2161,11 +2252,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
* task holds one reference to the queue, dropped when task exits. each rq
* in-flight on this queue also holds a reference, dropped when rq is freed.
*
+ * Each cfq queue took a reference on the parent group. Drop it now.
* queue lock must be held here.
*/
static void cfq_put_queue(struct cfq_queue *cfqq)
{
struct cfq_data *cfqd = cfqq->cfqd;
+ struct cfq_group *cfqg;

BUG_ON(atomic_read(&cfqq->ref) <= 0);

@@ -2175,6 +2268,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
cfq_log_cfqq(cfqd, cfqq, "put_queue");
BUG_ON(rb_first(&cfqq->sort_list));
BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
+ cfqg = cfqq->cfqg;

if (unlikely(cfqd->active_queue == cfqq)) {
__cfq_slice_expired(cfqd, cfqq, 0);
@@ -2184,6 +2278,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
BUG_ON(cfq_cfqq_on_rr(cfqq));

kmem_cache_free(cfq_pool, cfqq);
+ cfq_put_cfqg(cfqg);
}

/*
@@ -3337,11 +3432,15 @@ static void cfq_exit_queue(struct elevator_queue *e)
}

cfq_put_async_queues(cfqd);
+ cfq_release_cfq_groups(cfqd);
+ blkiocg_del_blkio_group(&cfqd->root_group.blkg);

spin_unlock_irq(q->queue_lock);

cfq_shutdown_timer_wq(cfqd);

+ /* Wait for cfqg->blkg->key accessors to exit their grace periods. */
+ synchronize_rcu();
kfree(cfqd);
}

@@ -3371,6 +3470,11 @@ static void *cfq_init_queue(struct request_queue *q)
cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;

#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ /*
+ * Take a reference to root group which we never drop. This is just
+ * to make sure that cfq_put_cfqg() does not try to kfree root group
+ */
+ atomic_set(&cfqg->ref, 1);
blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd);
#endif
/*
--
1.6.2.5

2009-11-13 01:52:48

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 11/16] blkio: Some debugging aids for CFQ

o Some debugging aids for CFQ.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/Kconfig | 9 +++++++++
block/Kconfig.iosched | 9 +++++++++
block/blk-cgroup.c | 4 ++++
block/blk-cgroup.h | 13 +++++++++++++
block/cfq-iosched.c | 17 +++++++++++++++++
5 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index 6ba1a8e..e20fbde 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -90,6 +90,15 @@ config BLK_CGROUP
control disk bandwidth allocation (proportional time slice allocation)
to such task groups.

+config DEBUG_BLK_CGROUP
+ bool
+ depends on BLK_CGROUP
+ default n
+ ---help---
+ Enable some debugging help. Currently it stores the cgroup path
+ in the blk group which can be used by cfq for tracing various
+ group related activity.
+
endif # BLOCK

config BLOCK_COMPAT
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index fa95fa7..b71abfb 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -40,6 +40,15 @@ config CFQ_GROUP_IOSCHED
---help---
Enable group IO scheduling in CFQ.

+config DEBUG_CFQ_IOSCHED
+ bool "Debug CFQ Scheduling"
+ depends on CFQ_GROUP_IOSCHED
+ select DEBUG_BLK_CGROUP
+ default n
+ ---help---
+ Enable CFQ IO scheduling debugging in CFQ. Currently it makes
+ blktrace output more verbose.
+
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0426ab6..6bc99a3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -33,6 +33,10 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
blkg->blkcg_id = css_id(&blkcg->css);
hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
spin_unlock_irqrestore(&blkcg->lock, flags);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+ /* Need to take css reference ? */
+ cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
+#endif
}

static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index cd50a2f..3573199 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -27,12 +27,25 @@ struct blkio_group {
void *key;
struct hlist_node blkcg_node;
unsigned short blkcg_id;
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+ /* Store cgroup path */
+ char path[128];
+#endif
};

#define BLKIO_WEIGHT_MIN 100
#define BLKIO_WEIGHT_MAX 1000
#define BLKIO_WEIGHT_DEFAULT 500

+#ifdef CONFIG_DEBUG_BLK_CGROUP
+static inline char *blkg_path(struct blkio_group *blkg)
+{
+ return blkg->path;
+}
+#else
+static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
+#endif
+
#ifdef CONFIG_BLK_CGROUP
extern struct blkio_cgroup blkio_root_cgroup;
extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index de21882..6b30a6b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -331,8 +331,21 @@ CFQ_CFQQ_FNS(sync);
CFQ_CFQQ_FNS(coop);
#undef CFQ_CFQQ_FNS

+#ifdef CONFIG_DEBUG_CFQ_IOSCHED
+#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
+ blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, (cfqq)->pid, \
+ cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
+ blkg_path(&(cfqq)->cfqg->blkg), ##args);
+
+#define cfq_log_cfqg(cfqd, cfqg, fmt, args...) \
+ blk_add_trace_msg((cfqd)->queue, "%s " fmt, \
+ blkg_path(&(cfqg)->blkg), ##args); \
+
+#else
#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
+#define cfq_log_cfqg(cfqd, cfqg, fmt, args...) do{}while(0);
+#endif
#define cfq_log(cfqd, fmt, args...) \
blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)

@@ -796,6 +809,7 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
if (cfqg->nr_cfqq)
return;

+ cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
cfqg->on_st = false;
cfqd->nr_groups--;
if (!RB_EMPTY_NODE(&cfqg->rb_node))
@@ -849,6 +863,9 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
cfqg->saved_serving_prio = cfqd->serving_prio;
} else
cfqg->saved_workload_slice = 0;
+
+ cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
+ st->min_vdisktime);
}

#ifdef CONFIG_CFQ_GROUP_IOSCHED
--
1.6.2.5

2009-11-12 23:46:11

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 12/16] blkio: Export disk time and sectors used by a group to user space

o Export disk time and sector used by a group to user space through cgroup
interface.

o Also export a "dequeue" interface to cgroup which keeps track of how many
a times a group was deleted from service tree. Helps in debugging.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++-
block/blk-cgroup.h | 22 ++++++++++++++++-
block/cfq-iosched.c | 23 ++++++++++++++----
3 files changed, 101 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 6bc99a3..4ef78d3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -11,6 +11,8 @@
* Nauman Rafique <[email protected]>
*/
#include <linux/ioprio.h>
+#include <linux/seq_file.h>
+#include <linux/kdev_t.h>
#include "blk-cgroup.h"

extern void cfq_unlink_blkio_group(void *, struct blkio_group *);
@@ -23,8 +25,15 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
struct blkio_cgroup, css);
}

+void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
+ unsigned long time, unsigned long sectors)
+{
+ blkg->time += time;
+ blkg->sectors += sectors;
+}
+
void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
- struct blkio_group *blkg, void *key)
+ struct blkio_group *blkg, void *key, dev_t dev)
{
unsigned long flags;

@@ -37,6 +46,7 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
/* Need to take css reference ? */
cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
#endif
+ blkg->dev = dev;
}

static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
@@ -115,12 +125,64 @@ blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
return 0;
}

+#define SHOW_FUNCTION_PER_GROUP(__VAR) \
+static int blkiocg_##__VAR##_read(struct cgroup *cgroup, \
+ struct cftype *cftype, struct seq_file *m) \
+{ \
+ struct blkio_cgroup *blkcg; \
+ struct blkio_group *blkg; \
+ struct hlist_node *n; \
+ \
+ if (!cgroup_lock_live_group(cgroup)) \
+ return -ENODEV; \
+ \
+ blkcg = cgroup_to_blkio_cgroup(cgroup); \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {\
+ if (blkg->dev) \
+ seq_printf(m, "%u:%u %lu\n", MAJOR(blkg->dev), \
+ MINOR(blkg->dev), blkg->__VAR); \
+ } \
+ rcu_read_unlock(); \
+ cgroup_unlock(); \
+ return 0; \
+}
+
+SHOW_FUNCTION_PER_GROUP(time);
+SHOW_FUNCTION_PER_GROUP(sectors);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+SHOW_FUNCTION_PER_GROUP(dequeue);
+#endif
+#undef SHOW_FUNCTION_PER_GROUP
+
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
+ unsigned long dequeue)
+{
+ blkg->dequeue += dequeue;
+}
+#endif
+
struct cftype blkio_files[] = {
{
.name = "weight",
.read_u64 = blkiocg_weight_read,
.write_u64 = blkiocg_weight_write,
},
+ {
+ .name = "time",
+ .read_seq_string = blkiocg_time_read,
+ },
+ {
+ .name = "sectors",
+ .read_seq_string = blkiocg_sectors_read,
+ },
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+ {
+ .name = "dequeue",
+ .read_seq_string = blkiocg_dequeue_read,
+ },
+#endif
};

static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 3573199..b24ab71 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -30,7 +30,15 @@ struct blkio_group {
#ifdef CONFIG_DEBUG_BLK_CGROUP
/* Store cgroup path */
char path[128];
+ /* How many times this group has been removed from service tree */
+ unsigned long dequeue;
#endif
+ /* The device MKDEV(major, minor), this group has been created for */
+ dev_t dev;
+
+ /* total disk time and nr sectors dispatched by this group */
+ unsigned long time;
+ unsigned long sectors;
};

#define BLKIO_WEIGHT_MIN 100
@@ -42,24 +50,30 @@ static inline char *blkg_path(struct blkio_group *blkg)
{
return blkg->path;
}
+void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
+ unsigned long dequeue);
#else
static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
+static inline void blkiocg_update_blkio_group_dequeue_stats(
+ struct blkio_group *blkg, unsigned long dequeue) {}
#endif

#ifdef CONFIG_BLK_CGROUP
extern struct blkio_cgroup blkio_root_cgroup;
extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
- struct blkio_group *blkg, void *key);
+ struct blkio_group *blkg, void *key, dev_t dev);
extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
void *key);
+void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
+ unsigned long time, unsigned long sectors);
#else
static inline struct blkio_cgroup *
cgroup_to_blkio_cgroup(struct cgroup *cgroup) { return NULL; }

static inline void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
- struct blkio_group *blkg, void *key)
+ struct blkio_group *blkg, void *key, dev_t dev)
{
}

@@ -68,5 +82,9 @@ blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }

static inline struct blkio_group *
blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
+static inline void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
+ unsigned long time, unsigned long sectors)
+{
+}
#endif
#endif /* _BLK_CGROUP_H */
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6b30a6b..24640f1 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -141,6 +141,8 @@ struct cfq_queue {
struct cfq_rb_root *service_tree;
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
+ /* Sectors dispatched in current dispatch round */
+ unsigned long nr_sectors;
};

/*
@@ -815,6 +817,7 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
if (!RB_EMPTY_NODE(&cfqg->rb_node))
cfq_rb_erase(&cfqg->rb_node, st);
cfqg->saved_workload_slice = 0;
+ blkiocg_update_blkio_group_dequeue_stats(&cfqg->blkg, 1);
}

static inline unsigned long cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
@@ -841,12 +844,13 @@ static inline unsigned long cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
slice_used = allocated_slice;
}

- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%lu", slice_used);
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%lu sect=%lu", slice_used,
+ cfqq->nr_sectors);
return slice_used;
}

static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
- unsigned long service)
+ unsigned long service, unsigned long sectors)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;

@@ -866,6 +870,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,

cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
st->min_vdisktime);
+ blkiocg_update_blkio_group_stats(&cfqg->blkg, service, sectors);
}

#ifdef CONFIG_CFQ_GROUP_IOSCHED
@@ -883,6 +888,8 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
struct cfq_group *cfqg = NULL;
void *key = cfqd;
int i, j;
+ unsigned int major, minor;
+ struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;

/* Do we need to take this reference */
if (!css_tryget(&blkcg->css))
@@ -913,7 +920,9 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
atomic_set(&cfqg->ref, 1);

/* Add group onto cgroup list */
- blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd);
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
+ MKDEV(major, minor));

/* Add group on cfqd list */
hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
@@ -1450,6 +1459,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
cfqq->dispatch_start = jiffies;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;
+ cfqq->nr_sectors = 0;

cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_must_dispatch(cfqq);
@@ -1485,7 +1495,8 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
}

- cfq_group_served(cfqd, cfqq->cfqg, cfq_cfqq_slice_usage(cfqq));
+ cfq_group_served(cfqd, cfqq->cfqg, cfq_cfqq_slice_usage(cfqq),
+ cfqq->nr_sectors);

if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
cfq_del_cfqq_rr(cfqd, cfqq);
@@ -1798,6 +1809,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)

if (cfq_cfqq_sync(cfqq))
cfqd->sync_flight++;
+ cfqq->nr_sectors += blk_rq_sectors(rq);
}

/*
@@ -3492,7 +3504,8 @@ static void *cfq_init_queue(struct request_queue *q)
* to make sure that cfq_put_cfqg() does not try to kfree root group
*/
atomic_set(&cfqg->ref, 1);
- blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd);
+ blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd,
+ 0);
#endif
/*
* Not strictly needed (since RB_ROOT just clears the node and we
--
1.6.2.5

2009-11-12 23:46:49

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 13/16] blkio: Provide some isolation between groups

o Do not allow following three operations across groups for isolation.
- selection of co-operating queues
- preemtpions across groups
- request merging across groups.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 24640f1..6081efe 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1432,6 +1432,9 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
struct cfq_io_context *cic;
struct cfq_queue *cfqq;

+ /* Deny merge if bio and rq don't belong to same cfq group */
+ if ((RQ_CFQQ(rq))->cfqg != cfq_get_cfqg(cfqd, 0))
+ return false;
/*
* Disallow merge of a sync bio into an async request.
*/
@@ -1677,6 +1680,10 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
if (!cfqq)
return NULL;

+ /* If new queue belongs to different cfq_group, don't choose it */
+ if (cur_cfqq->cfqg != cfqq->cfqg)
+ return NULL;
+
/*
* It only makes sense to merge sync queues.
*/
@@ -2945,6 +2952,9 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (!cfqq)
return false;

+ if (new_cfqq->cfqg != cfqq->cfqg)
+ return false;
+
if (cfq_slice_used(cfqq))
return true;

--
1.6.2.5

2009-11-12 23:34:36

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 14/16] blkio: Idle on a group for some time on rotational media

o If a group is not continuously backlogged, then it will be deleted from
service tree and loose it share. For example, if a single random seeky
reader or a single sequential reader is running in group.

o One solution is to let group loose it share if it is not backlogged and
other solution is to wait a bit for the slow group so that it can get its
time slice. This patch implements waiting for a group to wait a bit.

o This waiting is disabled for NCQ SSDs.

o This patch also intorduces the tunable "group_idle" which can enable/disable
group idling manually.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 142 ++++++++++++++++++++++++++++++++++-----------------
1 files changed, 95 insertions(+), 47 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6081efe..3f7a350 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -264,6 +264,7 @@ struct cfq_data {
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
unsigned int cfq_latency;
+ unsigned int cfq_group_idle;

struct list_head cic_list;

@@ -873,6 +874,37 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
blkiocg_update_blkio_group_stats(&cfqg->blkg, service, sectors);
}

+/*
+ * Determine whether we should enforce idle window for this queue.
+ */
+
+static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+ enum wl_prio_t prio = cfqq_prio(cfqq);
+ struct cfq_rb_root *service_tree = cfqq->service_tree;
+
+ /* We never do for idle class queues. */
+ if (prio == IDLE_WORKLOAD)
+ return false;
+
+ /* We do for queues that were marked with idle window flag. */
+ if (cfq_cfqq_idle_window(cfqq))
+ return true;
+
+ /*
+ * Otherwise, we do only if they are the last ones
+ * in their service tree.
+ */
+ if (!service_tree)
+ service_tree = service_tree_for(cfqq->cfqg, prio,
+ cfqq_type(cfqq), cfqd);
+
+ if (service_tree->count == 0)
+ return true;
+
+ return (service_tree->count == 1 && cfq_rb_first(service_tree) == cfqq);
+}
+
#ifdef CONFIG_CFQ_GROUP_IOSCHED
static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
{
@@ -1037,6 +1069,22 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
}

+static inline bool cfqq_should_wait_busy(struct cfq_queue *cfqq)
+{
+ /* Group idling is disabled */
+ if (!cfqq->cfqd->cfq_group_idle)
+ return false;
+
+ /* cfqq group still has got more requests to dispatch */
+ if (!RB_EMPTY_ROOT(&cfqq->sort_list) || cfqq->cfqg->nr_cfqq > 1)
+ return false;
+
+ if (!cfq_should_idle(cfqq->cfqd, cfqq))
+ return false;
+
+ return true;
+}
+
#else /* GROUP_IOSCHED */
static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
{
@@ -1049,6 +1097,10 @@ cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg) {

static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
+static inline bool cfqq_should_wait_busy(struct cfq_queue *cfqq)
+{
+ return false;
+}

#endif /* GROUP_IOSCHED */

@@ -1701,51 +1753,24 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
return cfqq;
}

-/*
- * Determine whether we should enforce idle window for this queue.
- */
-
-static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
- enum wl_prio_t prio = cfqq_prio(cfqq);
- struct cfq_rb_root *service_tree = cfqq->service_tree;
-
- /* We never do for idle class queues. */
- if (prio == IDLE_WORKLOAD)
- return false;
-
- /* We do for queues that were marked with idle window flag. */
- if (cfq_cfqq_idle_window(cfqq))
- return true;
-
- /*
- * Otherwise, we do only if they are the last ones
- * in their service tree.
- */
- if (!service_tree)
- service_tree = service_tree_for(cfqq->cfqg, prio,
- cfqq_type(cfqq), cfqd);
-
- if (service_tree->count == 0)
- return true;
-
- return (service_tree->count == 1 && cfq_rb_first(service_tree) == cfqq);
-}
-
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static bool cfq_arm_slice_timer(struct cfq_data *cfqd, int wait_busy)
{
struct cfq_queue *cfqq = cfqd->active_queue;
struct cfq_io_context *cic;
unsigned long sl;
struct cfq_rb_root *st;

+ /* If idle timer is already armed, nothing to do */
+ if (wait_busy && timer_pending(&cfqd->idle_slice_timer))
+ return true;
+
/*
* SSD device without seek penalty, disable idling. But only do so
* for devices that support queuing, otherwise we still have a problem
* with sync vs async workloads.
*/
if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
- return;
+ return false;

WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
WARN_ON(cfq_cfqq_slice_new(cfqq));
@@ -1754,29 +1779,29 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
* idle is disabled, either manually or by past process history
*/
if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
- return;
+ return false;

/*
* still requests with the driver, don't idle
*/
- if (rq_in_driver(cfqd))
- return;
+ if (rq_in_driver(cfqd) && !wait_busy)
+ return false;

/*
* task has exited, don't wait
*/
cic = cfqd->active_cic;
if (!cic || !atomic_read(&cic->ioc->nr_tasks))
- return;
+ return false;

/*
* If our average think time is larger than the remaining time
* slice, then don't idle. This avoids overrunning the allotted
* time slice.
*/
- if (sample_valid(cic->ttime_samples) &&
+ if (!wait_busy && sample_valid(cic->ttime_samples) &&
(cfqq->slice_end - jiffies < cic->ttime_mean))
- return;
+ return false;

cfq_mark_cfqq_wait_request(cfqq);

@@ -1789,14 +1814,19 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
*/
st = service_tree_for(cfqq->cfqg, cfqd->serving_prio,
SYNC_NOIDLE_WORKLOAD, cfqd);
- if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD && st->count > 0) {
+ if (!wait_busy && cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
+ && st->count > 0) {
if (blk_queue_nonrot(cfqd->queue) || cfqd->hw_tag)
- return;
+ return false;
sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
}

+ if (wait_busy)
+ sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
+
mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
- cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
+ cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu wait_busy=%d", sl, wait_busy);
+ return true;
}

/*
@@ -2053,6 +2083,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)

if (!cfqd->rq_queued)
return NULL;
+
/*
* The active queue has run out of time, expire it and select new.
*/
@@ -2091,6 +2122,16 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
}

expire:
+ /*
+ * Wait for a group to get busy before we expire it. No wait
+ * is done for NCQ SSDs. Do a small wait of 2ms on rotational
+ * media in the hope that group will get backlogged again and
+ * not loose its fair share.
+ */
+ if (cfqq_should_wait_busy(cfqq) && cfq_arm_slice_timer(cfqd, 1)) {
+ cfqq = NULL;
+ goto keep_queue;
+ }
cfq_slice_expired(cfqd, 0);
new_queue:
/*
@@ -3057,9 +3098,9 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
cfqd->busy_queues > 1) {
del_timer(&cfqd->idle_slice_timer);
- __blk_run_queue(cfqd->queue);
- }
- cfq_mark_cfqq_must_dispatch(cfqq);
+ __blk_run_queue(cfqd->queue);
+ } else
+ cfq_mark_cfqq_must_dispatch(cfqq);
}
} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
/*
@@ -3169,10 +3210,13 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
* of idling.
*/
if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
- cfq_slice_expired(cfqd, 1);
+ if (!cfqq_should_wait_busy(cfqq))
+ cfq_slice_expired(cfqd, 0);
+ else
+ cfq_arm_slice_timer(cfqd, 1);
else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq) &&
sync && !rq_noidle(rq))
- cfq_arm_slice_timer(cfqd);
+ cfq_arm_slice_timer(cfqd, 0);
}

if (!rq_in_driver(cfqd))
@@ -3554,6 +3598,7 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
cfqd->cfq_latency = 1;
+ cfqd->cfq_group_idle = 1;
cfqd->hw_tag = 1;
cfqd->last_end_sync_rq = jiffies;
return cfqd;
@@ -3624,6 +3669,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
SHOW_FUNCTION(cfq_low_latency_show, cfqd->cfq_latency, 0);
+SHOW_FUNCTION(cfq_group_idle_show, cfqd->cfq_group_idle, 0);
#undef SHOW_FUNCTION

#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
@@ -3656,6 +3702,7 @@ STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
+STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, 1, 0);
#undef STORE_FUNCTION

#define CFQ_ATTR(name) \
@@ -3672,6 +3719,7 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
CFQ_ATTR(low_latency),
+ CFQ_ATTR(group_idle),
__ATTR_NULL
};

--
1.6.2.5

2009-11-12 23:34:53

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 15/16] blkio: Drop the reference to queue once the task changes cgroup

o If a task changes cgroup, drop reference to the cfqq associated with io
context and set cfqq pointer stored in ioc to NULL so that upon next request
arrival we will allocate a new queue in new group.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 39 +++++++++++++++++++++++++++++++++++++++
1 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3f7a350..34f029c 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2648,6 +2648,41 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->pid = pid;
}

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+ struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+ struct cfq_data *cfqd = cic->key;
+ unsigned long flags;
+ struct request_queue *q;
+
+ if (unlikely(!cfqd))
+ return;
+
+ q = cfqd->queue;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ if (sync_cfqq) {
+ /*
+ * Drop reference to sync queue. A new sync queue will be
+ * assigned in new group upon arrival of a fresh request.
+ */
+ cfq_log_cfqq(cfqd, sync_cfqq, "changed cgroup");
+ cic_set_cfqq(cic, NULL, 1);
+ cfq_put_queue(sync_cfqq);
+ }
+
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+ call_for_each_cic(ioc, changed_cgroup);
+ ioc->cgroup_changed = 0;
+}
+#endif /* CONFIG_CFQ_GROUP_IOSCHED */
+
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
struct io_context *ioc, gfp_t gfp_mask)
@@ -2880,6 +2915,10 @@ out:
if (unlikely(ioc->ioprio_changed))
cfq_ioc_set_ioprio(ioc);

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ if (unlikely(ioc->cgroup_changed))
+ cfq_ioc_set_cgroup(ioc);
+#endif
return cic;
err_free:
cfq_cic_free(cic);
--
1.6.2.5

2009-11-12 23:34:07

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 16/16] blkio: Propagate cgroup weight updation to cfq groups

o Propagate blkio cgroup weight updation to associated cfq groups.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 7 +++++++
block/cfq-iosched.c | 6 ++++++
2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4ef78d3..179ddfa 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -16,6 +16,7 @@
#include "blk-cgroup.h"

extern void cfq_unlink_blkio_group(void *, struct blkio_group *);
+extern void cfq_update_blkio_group_weight(struct blkio_group *, unsigned int);

struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };

@@ -116,12 +117,18 @@ static int
blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
{
struct blkio_cgroup *blkcg;
+ struct blkio_group *blkg;
+ struct hlist_node *n;

if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
return -EINVAL;

blkcg = cgroup_to_blkio_cgroup(cgroup);
+ spin_lock_irq(&blkcg->lock);
blkcg->weight = (unsigned int)val;
+ hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
+ cfq_update_blkio_group_weight(blkg, blkcg->weight);
+ spin_unlock_irq(&blkcg->lock);
return 0;
}

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 34f029c..4e42b2e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -913,6 +913,12 @@ static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
return NULL;
}

+void
+cfq_update_blkio_group_weight(struct blkio_group *blkg, unsigned int weight)
+{
+ cfqg_of_blkg(blkg)->weight = weight;
+}
+
static struct cfq_group *
cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
{
--
1.6.2.5

2009-11-13 10:40:01

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [PATCH 03/16] blkio: Keep queue on service tree until we expire it

Hi Vivek,
the following is probably not on a critical path, but maybe it can be
written more efficiently.
Now, it will cicle through all service trees multiple times, to
retrieve the queues for the last one.
What about having a cfq_for_each_queue that takes a function pointer
and will apply it to all of them?

On Fri, Nov 13, 2009 at 12:32 AM, Vivek Goyal <[email protected]> wrote:
> +static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
> +{
> +       struct cfq_group *cfqg = &cfqd->root_group;
> +       struct cfq_queue *cfqq;
> +       int i, j;
> +
> +       if (!cfqd->rq_queued)
> +               return NULL;
> +
> +       for (i = 0; i < 2; ++i) {
> +               for (j = 0; j < 3; ++j) {
> +                       cfqq = cfq_rb_first(&cfqg->service_trees[i][j]);
> +                       if (cfqq)
> +                               return cfqq;
> +               }
> +       }
> +
> +       cfqq = cfq_rb_first(&cfqg->service_tree_idle);
> +       if (cfqq)
> +               return cfqq;
> +       return NULL;
> +}
> +
>  /*
>  * Get and set a new active queue for service.
>  */
> @@ -1590,16 +1633,8 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
>  {
>        struct cfq_queue *cfqq;
>        int dispatched = 0;
> -       int i, j;
> -       struct cfq_group *cfqg = &cfqd->root_group;
> -
> -       for (i = 0; i < 2; ++i)
> -               for (j = 0; j < 3; ++j)
> -                       while ((cfqq = cfq_rb_first(&cfqg->service_trees[i][j]))
> -                               != NULL)
> -                               dispatched += __cfq_forced_dispatch_cfqq(cfqq);
>
> -       while ((cfqq = cfq_rb_first(&cfqg->service_tree_idle)) != NULL)
> +       while ((cfqq = cfq_get_next_queue(cfqd)) != NULL)
>                dispatched += __cfq_forced_dispatch_cfqq(cfqq);
>
>        cfq_slice_expired(cfqd, 0);
> @@ -1770,13 +1805,14 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
>        cfq_log_cfqq(cfqd, cfqq, "put_queue");
>        BUG_ON(rb_first(&cfqq->sort_list));
>        BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
> -       BUG_ON(cfq_cfqq_on_rr(cfqq));
>
>        if (unlikely(cfqd->active_queue == cfqq)) {
>                __cfq_slice_expired(cfqd, cfqq, 0);
>                cfq_schedule_dispatch(cfqd);
>        }
>
> +       BUG_ON(cfq_cfqq_on_rr(cfqq));
> +
>        kmem_cache_free(cfq_pool, cfqq);
>  }
>
> --
> 1.6.2.5
>



Thanks
Corrado

2009-11-13 10:46:48

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [PATCH 05/16] blkio: Implement per cfq group latency target and busy queue avg

On Fri, Nov 13, 2009 at 12:32 AM, Vivek Goyal <[email protected]> wrote:>  static inline void> @@ -441,10 +445,13 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)>        if (cfqd->cfq_latency) {>                /* interested queues (we consider only the ones with the same>                 * priority class) */This comment needs to be updated> -               unsigned iq = cfq_get_avg_queues(cfqd, cfq_class_rt(cfqq));> +               unsigned iq = cfq_group_get_avg_queues(cfqd, cfqq->cfqg,> +                                               cfq_class_rt(cfqq));>                unsigned sync_slice = cfqd->cfq_slice[1];>                unsigned expect_latency = sync_slice * iq;> -               if (expect_latency > cfq_target_latency) {> +               unsigned group_target_lat = cfq_target_latency/cfqd->nr_groups;
I'm not sure that we should divide the target latency evenly among groups.Groups with different weights will have different percentage of timein each 300ms round, so probably we should consider it here.
> +> +               if (expect_latency > group_target_lat) {>                        unsigned base_low_slice = 2 * cfqd->cfq_slice_idle;>                        /* scale low_slice according to IO priority>                         * and sync vs async */> @@ -452,7 +459,7 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)>                                min(slice, base_low_slice * slice / sync_slice);>                        /* the adapted slice value is scaled to fit all iqs>                         * into the target latency */> -                       slice = max(slice * cfq_target_latency / expect_latency,> +                       slice = max(slice * group_target_lat / expect_latency,>                                    low_slice);>                }>        }
Thanks,Corrado????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2009-11-13 10:48:09

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 03/16] blkio: Keep queue on service tree until we expire it

On Fri, Nov 13 2009, Corrado Zoccolo wrote:
> Hi Vivek,
> the following is probably not on a critical path, but maybe it can be
> written more efficiently.
> Now, it will cicle through all service trees multiple times, to
> retrieve the queues for the last one.
> What about having a cfq_for_each_queue that takes a function pointer
> and will apply it to all of them?

I thought the same when reading this. Or perhaps have a small bitmap
instead/with the counter to avoid looping around known empty groups.

I want to start merging the initial prep patches soon, so we can both
cut back on the size of your patchset while getting the simpler bits in.
Unfortunately I had to stop here, but with a few corrections it can be
merged too.

>
> On Fri, Nov 13, 2009 at 12:32 AM, Vivek Goyal <[email protected]> wrote:
> > +static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
> > +{
> > + ? ? ? struct cfq_group *cfqg = &cfqd->root_group;
> > + ? ? ? struct cfq_queue *cfqq;
> > + ? ? ? int i, j;
> > +
> > + ? ? ? if (!cfqd->rq_queued)
> > + ? ? ? ? ? ? ? return NULL;
> > +
> > + ? ? ? for (i = 0; i < 2; ++i) {
> > + ? ? ? ? ? ? ? for (j = 0; j < 3; ++j) {
> > + ? ? ? ? ? ? ? ? ? ? ? cfqq = cfq_rb_first(&cfqg->service_trees[i][j]);
> > + ? ? ? ? ? ? ? ? ? ? ? if (cfqq)
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return cfqq;
> > + ? ? ? ? ? ? ? }
> > + ? ? ? }
> > +
> > + ? ? ? cfqq = cfq_rb_first(&cfqg->service_tree_idle);
> > + ? ? ? if (cfqq)
> > + ? ? ? ? ? ? ? return cfqq;
> > + ? ? ? return NULL;
> > +}
> > +
> > ?/*
> > ?* Get and set a new active queue for service.
> > ?*/
> > @@ -1590,16 +1633,8 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
> > ?{
> > ? ? ? ?struct cfq_queue *cfqq;
> > ? ? ? ?int dispatched = 0;
> > - ? ? ? int i, j;
> > - ? ? ? struct cfq_group *cfqg = &cfqd->root_group;
> > -
> > - ? ? ? for (i = 0; i < 2; ++i)
> > - ? ? ? ? ? ? ? for (j = 0; j < 3; ++j)
> > - ? ? ? ? ? ? ? ? ? ? ? while ((cfqq = cfq_rb_first(&cfqg->service_trees[i][j]))
> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? != NULL)
> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? dispatched += __cfq_forced_dispatch_cfqq(cfqq);
> >
> > - ? ? ? while ((cfqq = cfq_rb_first(&cfqg->service_tree_idle)) != NULL)
> > + ? ? ? while ((cfqq = cfq_get_next_queue(cfqd)) != NULL)
> > ? ? ? ? ? ? ? ?dispatched += __cfq_forced_dispatch_cfqq(cfqq);
> >
> > ? ? ? ?cfq_slice_expired(cfqd, 0);
> > @@ -1770,13 +1805,14 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
> > ? ? ? ?cfq_log_cfqq(cfqd, cfqq, "put_queue");
> > ? ? ? ?BUG_ON(rb_first(&cfqq->sort_list));
> > ? ? ? ?BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
> > - ? ? ? BUG_ON(cfq_cfqq_on_rr(cfqq));
> >
> > ? ? ? ?if (unlikely(cfqd->active_queue == cfqq)) {
> > ? ? ? ? ? ? ? ?__cfq_slice_expired(cfqd, cfqq, 0);
> > ? ? ? ? ? ? ? ?cfq_schedule_dispatch(cfqd);
> > ? ? ? ?}
> >
> > + ? ? ? BUG_ON(cfq_cfqq_on_rr(cfqq));
> > +
> > ? ? ? ?kmem_cache_free(cfq_pool, cfqq);
> > ?}
> >
> > --
> > 1.6.2.5
> >
>
>
>
> Thanks
> Corrado

--
Jens Axboe

2009-11-13 10:48:30

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 01/16] blkio: Documentation

On Thu, Nov 12 2009, Vivek Goyal wrote:
> Signed-off-by: Vivek Goyal <[email protected]>

You should put this at the end for the next posting, no point in having
documentation without the code.

--
Jens Axboe

2009-11-13 10:58:53

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [PATCH 14/16] blkio: Idle on a group for some time on rotational media

Hi Vivek,
On Fri, Nov 13, 2009 at 12:32 AM, Vivek Goyal <[email protected]> wrote:
> o If a group is not continuously backlogged, then it will be deleted from
>  service tree and loose it share. For example, if a single random seeky
>  reader or a single sequential reader is running in group.
>
Without groups, a single sequential reader would already have its 10ms
idle slice, and a single random reader on the noidle service tree
would have its 2ms idle, before switching to a new workload. Were
those removed in the previous patches (and this patches re-enable
them), or this introduces an additional idle between groups?

> o One solution is to let group loose it share if it is not backlogged and
>  other solution is to wait a bit for the slow group so that it can get its
>  time slice. This patch implements waiting for a group to wait a bit.
>
> o This waiting is disabled for NCQ SSDs.
>
> o This patch also intorduces the tunable "group_idle" which can enable/disable
>  group idling manually.
>
> Signed-off-by: Vivek Goyal <[email protected]>

Thanks,
Corrado

2009-11-13 15:05:40

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 03/16] blkio: Keep queue on service tree until we expire it

On Fri, Nov 13, 2009 at 11:48:08AM +0100, Jens Axboe wrote:
> On Fri, Nov 13 2009, Corrado Zoccolo wrote:
> > Hi Vivek,
> > the following is probably not on a critical path, but maybe it can be
> > written more efficiently.
> > Now, it will cicle through all service trees multiple times, to
> > retrieve the queues for the last one.
> > What about having a cfq_for_each_queue that takes a function pointer
> > and will apply it to all of them?
>
> I thought the same when reading this. Or perhaps have a small bitmap
> instead/with the counter to avoid looping around known empty groups.

Hi Jens,

Empty groups will not be on the group service tree. Only exception is
the group belonging to current active queue where queue might be empty
but still on service tree because we might be idling on it.

So we don't have to worry about looping through empty groups.

>
> I want to start merging the initial prep patches soon, so we can both
> cut back on the size of your patchset while getting the simpler bits in.
> Unfortunately I had to stop here, but with a few corrections it can be
> merged too.

Now we are left with the issue of looping through various empty service
trees with-in group. We already keep a count of busy queues in each
service tree. One simple way is to just check for that count and if tree
is empty, simply return back. I have modified cfq_rb_first() to check for
count first before it tries to do rb_next().

Please let me know if you think that this is not good enough and we need
to maintain a bit map of busy service trees in a group.


o If service tree is emtpy, return NULL.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 4 ++++
1 file changed, 4 insertions(+)

Index: linux7/block/cfq-iosched.c
===================================================================
--- linux7.orig/block/cfq-iosched.c 2009-11-12 18:02:03.000000000 -0500
+++ linux7/block/cfq-iosched.c 2009-11-13 09:48:38.000000000 -0500
@@ -664,6 +664,10 @@ cfq_choose_req(struct cfq_data *cfqd, st
*/
static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
{
+ /* Service tree is empty */
+ if (!root->count)
+ return NULL;
+
if (!root->left)
root->left = rb_first(&root->rb);

2009-11-13 15:19:02

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 05/16] blkio: Implement per cfq group latency target and busy queue avg

On Fri, Nov 13, 2009 at 11:46:49AM +0100, Corrado Zoccolo wrote:
> On Fri, Nov 13, 2009 at 12:32 AM, Vivek Goyal <[email protected]> wrote:
> > ?static inline void
> > @@ -441,10 +445,13 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> > ? ? ? ?if (cfqd->cfq_latency) {
> > ? ? ? ? ? ? ? ?/* interested queues (we consider only the ones with the same
> > ? ? ? ? ? ? ? ? * priority class) */
> This comment needs to be updated

Sure. Will do. Now the interested queues are the one with same priority
class with-in group.

> > ? ? ? ? ? ? ? ? * priority class) */
> > - ? ? ? ? ? ? ? unsigned iq = cfq_get_avg_queues(cfqd, cfq_class_rt(cfqq));
> > + ? ? ? ? ? ? ? unsigned iq = cfq_group_get_avg_queues(cfqd, cfqq->cfqg,
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? cfq_class_rt(cfqq));
> > ? ? ? ? ? ? ? ?unsigned sync_slice = cfqd->cfq_slice[1];
> > ? ? ? ? ? ? ? ?unsigned expect_latency = sync_slice * iq;
> > - ? ? ? ? ? ? ? if (expect_latency > cfq_target_latency) {
> > + ? ? ? ? ? ? ? unsigned group_target_lat = cfq_target_latency/cfqd->nr_groups;
>
> I'm not sure that we should divide the target latency evenly among groups.
> Groups with different weights will have different percentage of time
> in each 300ms round, so probably we should consider it here.
>

Taking group weight into account will be more precise thing. So may be
I can keep track of total weight on the service tree and determine
group target latency as proportion of total weight.

group_target_lat = group_weight * cfq_target_latency/total_weight_of_groups

Thanks
Vivek

> > +
> > + ? ? ? ? ? ? ? if (expect_latency > group_target_lat) {
> > ? ? ? ? ? ? ? ? ? ? ? ?unsigned base_low_slice = 2 * cfqd->cfq_slice_idle;
> > ? ? ? ? ? ? ? ? ? ? ? ?/* scale low_slice according to IO priority
> > ? ? ? ? ? ? ? ? ? ? ? ? * and sync vs async */
> > @@ -452,7 +459,7 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?min(slice, base_low_slice * slice / sync_slice);
> > ? ? ? ? ? ? ? ? ? ? ? ?/* the adapted slice value is scaled to fit all iqs
> > ? ? ? ? ? ? ? ? ? ? ? ? * into the target latency */
> > - ? ? ? ? ? ? ? ? ? ? ? slice = max(slice * cfq_target_latency / expect_latency,
> > + ? ? ? ? ? ? ? ? ? ? ? slice = max(slice * group_target_lat / expect_latency,
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?low_slice);
> > ? ? ? ? ? ? ? ?}
> > ? ? ? ?}
>
> Thanks,
> Corrado

2009-11-13 15:19:12

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 01/16] blkio: Documentation

On Fri, Nov 13, 2009 at 11:48:34AM +0100, Jens Axboe wrote:
> On Thu, Nov 12 2009, Vivek Goyal wrote:
> > Signed-off-by: Vivek Goyal <[email protected]>
>
> You should put this at the end for the next posting, no point in having
> documentation without the code.
>

Sure. Will do that in next posting.

Thanks
Vivek

> --
> Jens Axboe

2009-11-13 15:37:53

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 14/16] blkio: Idle on a group for some time on rotational media

On Fri, Nov 13, 2009 at 11:58:53AM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Fri, Nov 13, 2009 at 12:32 AM, Vivek Goyal <[email protected]> wrote:
> > o If a group is not continuously backlogged, then it will be deleted from
> > ?service tree and loose it share. For example, if a single random seeky
> > ?reader or a single sequential reader is running in group.
> >
> Without groups, a single sequential reader would already have its 10ms
> idle slice, and a single random reader on the noidle service tree
> would have its 2ms idle, before switching to a new workload. Were
> those removed in the previous patches (and this patches re-enable
> them), or this introduces an additional idle between groups?
>

Previous patches were based on 2.6.32-rc5 did not have the concept of idling
on no-idle group.

Group idling can be thought of an additional idling if group is empty and
and for some reason CFQ did not decide to idle on the queue (because slice
has expired). Even if cfqq slice has expired, we want to wait a bit to
make sure that this group gets backlogged and does not get deleted from
service tree so that it continues to get its fair share.

It helps in following circumstances. I have got a NCQ enabled rotational
disk which roughly does 90MB/s for buffered reads. If I launch two
sequential readers in two group of weight 400 and 200 and first group
should get double the disk time of second group. But after every slice
expiry the group gets deleted and looses share. Hence this extra idle on
group helps (if we already did not decide to idle on queue).

Having said that, I understand that idling can hurt in total throughput
if there is a NCQ enabled fast storage array. But it does not necessarily
hurt if there is a single NCQ enabled rotational disk. So as we move along
we need to figure out a way when to idle to achieve fairness and where we
let the fairness go becuase it hurts too much.

That's the reason I introduced the "group_idle" tunable so that one can
disable group_idle manually. This will also help us analyze when exactly
does it make sense to wait for slow groups to catch up or when it does
not.

So in summary, group_idling is an extra idle period we wait on group if
it is empty and we did not decide to idle on the cfqq. This is an effort
to make the group backlogged again so that it does not get deleted from
service tree and does not loose its share. This idling is disabled on
NCQ SSDs. Now only case left is fast storage arrays with rotational disk
and we need to figure out when to idle and when not to. Currently CFQ
seems to be idling on even fast storage array for sequential tasks and
this hurts if that sequential task is doing direct IO and not utilizing
the full capacity of the array.

Thanks
Vivek

> > o One solution is to let group loose it share if it is not backlogged and
> > ?other solution is to wait a bit for the slow group so that it can get its
> > ?time slice. This patch implements waiting for a group to wait a bit.
> >
> > o This waiting is disabled for NCQ SSDs.
> >
> > o This patch also intorduces the tunable "group_idle" which can enable/disable
> > ?group idling manually.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
>
> Thanks,
> Corrado

2009-11-13 16:15:32

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 05/16] blkio: Implement per cfq group latency target and busy queue avg

On Fri, Nov 13, 2009 at 10:18:15AM -0500, Vivek Goyal wrote:
> On Fri, Nov 13, 2009 at 11:46:49AM +0100, Corrado Zoccolo wrote:
> > On Fri, Nov 13, 2009 at 12:32 AM, Vivek Goyal <[email protected]> wrote:
> > > ?static inline void
> > > @@ -441,10 +445,13 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> > > ? ? ? ?if (cfqd->cfq_latency) {
> > > ? ? ? ? ? ? ? ?/* interested queues (we consider only the ones with the same
> > > ? ? ? ? ? ? ? ? * priority class) */
> > This comment needs to be updated
>
> Sure. Will do. Now the interested queues are the one with same priority
> class with-in group.
>
> > > ? ? ? ? ? ? ? ? * priority class) */
> > > - ? ? ? ? ? ? ? unsigned iq = cfq_get_avg_queues(cfqd, cfq_class_rt(cfqq));
> > > + ? ? ? ? ? ? ? unsigned iq = cfq_group_get_avg_queues(cfqd, cfqq->cfqg,
> > > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? cfq_class_rt(cfqq));
> > > ? ? ? ? ? ? ? ?unsigned sync_slice = cfqd->cfq_slice[1];
> > > ? ? ? ? ? ? ? ?unsigned expect_latency = sync_slice * iq;
> > > - ? ? ? ? ? ? ? if (expect_latency > cfq_target_latency) {
> > > + ? ? ? ? ? ? ? unsigned group_target_lat = cfq_target_latency/cfqd->nr_groups;
> >
> > I'm not sure that we should divide the target latency evenly among groups.
> > Groups with different weights will have different percentage of time
> > in each 300ms round, so probably we should consider it here.
> >
>
> Taking group weight into account will be more precise thing. So may be
> I can keep track of total weight on the service tree and determine
> group target latency as proportion of total weight.
>
> group_target_lat = group_weight * cfq_target_latency/total_weight_of_groups
>

Here is the patch I generated on top of all the patches in series.

o Determine group target latency in proportion to group weight instead of
just number of groups.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)

Index: linux7/block/cfq-iosched.c
===================================================================
--- linux7.orig/block/cfq-iosched.c 2009-11-13 09:48:38.000000000 -0500
+++ linux7/block/cfq-iosched.c 2009-11-13 11:06:22.000000000 -0500
@@ -81,6 +81,7 @@ struct cfq_rb_root {
unsigned count;
u64 min_vdisktime;
struct rb_node *active;
+ unsigned total_weight;
};
#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, 0, }

@@ -521,18 +522,28 @@ static inline unsigned cfq_group_get_avg
return cfqg->busy_queues_avg[rt];
}

+static inline unsigned
+cfq_group_latency(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+
+ return cfq_target_latency * cfqg->weight / st->total_weight;
+}
+
static inline void
cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
if (cfqd->cfq_latency) {
- /* interested queues (we consider only the ones with the same
- * priority class) */
+ /*
+ * interested queues (we consider only the ones with the same
+ * priority class in the cfq group)
+ */
unsigned iq = cfq_group_get_avg_queues(cfqd, cfqq->cfqg,
cfq_class_rt(cfqq));
unsigned sync_slice = cfqd->cfq_slice[1];
unsigned expect_latency = sync_slice * iq;
- unsigned group_target_lat = cfq_target_latency/cfqd->nr_groups;
+ unsigned group_target_lat = cfq_group_latency(cfqd, cfqq->cfqg);

if (expect_latency > group_target_lat) {
unsigned base_low_slice = 2 * cfqd->cfq_slice_idle;
@@ -799,6 +810,7 @@ cfq_group_service_tree_add(struct cfq_da
__cfq_group_service_tree_add(st, cfqg);
cfqg->on_st = true;
cfqd->nr_groups++;
+ st->total_weight += cfqg->weight;
}

static void
@@ -819,6 +831,7 @@ cfq_group_service_tree_del(struct cfq_da
cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
cfqg->on_st = false;
cfqd->nr_groups--;
+ st->total_weight -= cfqg->weight;
if (!RB_EMPTY_NODE(&cfqg->rb_node))
cfq_rb_erase(&cfqg->rb_node, st);
cfqg->saved_workload_slice = 0;
@@ -2033,7 +2046,7 @@ static void choose_service_tree(struct c
* proportional to the number of queues in that workload, over
* all the queues in the same priority class
*/
- group_target_latency = cfq_target_latency/cfqd->nr_groups;
+ group_target_latency = cfq_group_latency(cfqd, cfqg);

slice = group_target_latency * count /
max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],

2009-11-13 18:40:52

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [PATCH 05/16] blkio: Implement per cfq group latency target and busy queue avg

On Fri, Nov 13, 2009 at 5:15 PM, Vivek Goyal <[email protected]> wrote:
> On Fri, Nov 13, 2009 at 10:18:15AM -0500, Vivek Goyal wrote:
>> On Fri, Nov 13, 2009 at 11:46:49AM +0100, Corrado Zoccolo wrote:
>> > On Fri, Nov 13, 2009 at 12:32 AM, Vivek Goyal <[email protected]> wrote:
>> > >  static inline void
>> > > @@ -441,10 +445,13 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> > >        if (cfqd->cfq_latency) {
>> > >                /* interested queues (we consider only the ones with the same
>> > >                 * priority class) */
>> > This comment needs to be updated
>>
>> Sure. Will do. Now the interested queues are the one with same priority
>> class with-in group.
>>
>> > >                 * priority class) */
>> > > -               unsigned iq = cfq_get_avg_queues(cfqd, cfq_class_rt(cfqq));
>> > > +               unsigned iq = cfq_group_get_avg_queues(cfqd, cfqq->cfqg,
>> > > +                                               cfq_class_rt(cfqq));
>> > >                unsigned sync_slice = cfqd->cfq_slice[1];
>> > >                unsigned expect_latency = sync_slice * iq;
>> > > -               if (expect_latency > cfq_target_latency) {
>> > > +               unsigned group_target_lat = cfq_target_latency/cfqd->nr_groups;
>> >
>> > I'm not sure that we should divide the target latency evenly among groups.
>> > Groups with different weights will have different percentage of time
>> > in each 300ms round, so probably we should consider it here.
>> >
>>
>> Taking group weight into account will be more precise thing. So may be
>> I can keep track of total weight on the service tree and determine
>> group target latency as proportion of total weight.
>>
>>  group_target_lat = group_weight * cfq_target_latency/total_weight_of_groups
>>
>
> Here is the patch I generated on top of all the patches in series.
>
> o Determine group target latency in proportion to group weight instead of
>  just number of groups.

Great.
I have only one concern, regarding variable naming:
group_target_lat is a bit misleading. The fact is that it will be
larger for higher weight groups, so people could ask why are you
giving more latency to higher weight group...
Actually, it is the group share of the scheduling round, so you should
name it accordingly.

>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
>  block/cfq-iosched.c |   21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
>
> Index: linux7/block/cfq-iosched.c
> ===================================================================
> --- linux7.orig/block/cfq-iosched.c     2009-11-13 09:48:38.000000000 -0500
> +++ linux7/block/cfq-iosched.c  2009-11-13 11:06:22.000000000 -0500
> @@ -81,6 +81,7 @@ struct cfq_rb_root {
>        unsigned count;
>        u64 min_vdisktime;
>        struct rb_node *active;
> +       unsigned total_weight;
>  };
>  #define CFQ_RB_ROOT    (struct cfq_rb_root) { RB_ROOT, NULL, 0, 0, }
>
> @@ -521,18 +522,28 @@ static inline unsigned cfq_group_get_avg
>        return cfqg->busy_queues_avg[rt];
>  }
>
> +static inline unsigned
> +cfq_group_latency(struct cfq_data *cfqd, struct cfq_group *cfqg)
> +{
> +       struct cfq_rb_root *st = &cfqd->grp_service_tree;
> +
> +       return cfq_target_latency * cfqg->weight / st->total_weight;
> +}
> +
>  static inline void
>  cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  {
>        unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
>        if (cfqd->cfq_latency) {
> -               /* interested queues (we consider only the ones with the same
> -                * priority class) */
> +               /*
> +                * interested queues (we consider only the ones with the same
> +                * priority class in the cfq group)
> +                */
>                unsigned iq = cfq_group_get_avg_queues(cfqd, cfqq->cfqg,
>                                                cfq_class_rt(cfqq));
>                unsigned sync_slice = cfqd->cfq_slice[1];
>                unsigned expect_latency = sync_slice * iq;
> -               unsigned group_target_lat = cfq_target_latency/cfqd->nr_groups;
> +               unsigned group_target_lat = cfq_group_latency(cfqd, cfqq->cfqg);
>
>                if (expect_latency > group_target_lat) {
>                        unsigned base_low_slice = 2 * cfqd->cfq_slice_idle;
> @@ -799,6 +810,7 @@ cfq_group_service_tree_add(struct cfq_da
>        __cfq_group_service_tree_add(st, cfqg);
>        cfqg->on_st = true;
>        cfqd->nr_groups++;
> +       st->total_weight += cfqg->weight;
>  }
>
>  static void
> @@ -819,6 +831,7 @@ cfq_group_service_tree_del(struct cfq_da
>        cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
>        cfqg->on_st = false;
>        cfqd->nr_groups--;
> +       st->total_weight -= cfqg->weight;
>        if (!RB_EMPTY_NODE(&cfqg->rb_node))
>                cfq_rb_erase(&cfqg->rb_node, st);
>        cfqg->saved_workload_slice = 0;
> @@ -2033,7 +2046,7 @@ static void choose_service_tree(struct c
>         * proportional to the number of queues in that workload, over
>         * all the queues in the same priority class
>         */
> -       group_target_latency = cfq_target_latency/cfqd->nr_groups;
> +       group_target_latency = cfq_group_latency(cfqd, cfqg);
>
>        slice = group_target_latency * count /
>                max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

2009-11-13 18:44:08

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 03/16] blkio: Keep queue on service tree until we expire it

On Fri, Nov 13 2009, Vivek Goyal wrote:
> On Fri, Nov 13, 2009 at 11:48:08AM +0100, Jens Axboe wrote:
> > On Fri, Nov 13 2009, Corrado Zoccolo wrote:
> > > Hi Vivek,
> > > the following is probably not on a critical path, but maybe it can be
> > > written more efficiently.
> > > Now, it will cicle through all service trees multiple times, to
> > > retrieve the queues for the last one.
> > > What about having a cfq_for_each_queue that takes a function pointer
> > > and will apply it to all of them?
> >
> > I thought the same when reading this. Or perhaps have a small bitmap
> > instead/with the counter to avoid looping around known empty groups.
>
> Hi Jens,
>
> Empty groups will not be on the group service tree. Only exception is
> the group belonging to current active queue where queue might be empty
> but still on service tree because we might be idling on it.
>
> So we don't have to worry about looping through empty groups.

Looks ok, the forced dispatch is less of an issue.

> > I want to start merging the initial prep patches soon, so we can both
> > cut back on the size of your patchset while getting the simpler bits in.
> > Unfortunately I had to stop here, but with a few corrections it can be
> > merged too.
>
> Now we are left with the issue of looping through various empty service
> trees with-in group. We already keep a count of busy queues in each
> service tree. One simple way is to just check for that count and if tree
> is empty, simply return back. I have modified cfq_rb_first() to check for
> count first before it tries to do rb_next().
>
> Please let me know if you think that this is not good enough and we need
> to maintain a bit map of busy service trees in a group.

I think so, thanks!

--
Jens Axboe

2009-11-13 19:27:15

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 05/16] blkio: Implement per cfq group latency target and busy queue avg

On Fri, Nov 13, 2009 at 07:40:51PM +0100, Corrado Zoccolo wrote:
> On Fri, Nov 13, 2009 at 5:15 PM, Vivek Goyal <[email protected]> wrote:
> > On Fri, Nov 13, 2009 at 10:18:15AM -0500, Vivek Goyal wrote:
> >> On Fri, Nov 13, 2009 at 11:46:49AM +0100, Corrado Zoccolo wrote:
> >> > On Fri, Nov 13, 2009 at 12:32 AM, Vivek Goyal <[email protected]> wrote:
> >> > > ?static inline void
> >> > > @@ -441,10 +445,13 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> >> > > ? ? ? ?if (cfqd->cfq_latency) {
> >> > > ? ? ? ? ? ? ? ?/* interested queues (we consider only the ones with the same
> >> > > ? ? ? ? ? ? ? ? * priority class) */
> >> > This comment needs to be updated
> >>
> >> Sure. Will do. Now the interested queues are the one with same priority
> >> class with-in group.
> >>
> >> > > ? ? ? ? ? ? ? ? * priority class) */
> >> > > - ? ? ? ? ? ? ? unsigned iq = cfq_get_avg_queues(cfqd, cfq_class_rt(cfqq));
> >> > > + ? ? ? ? ? ? ? unsigned iq = cfq_group_get_avg_queues(cfqd, cfqq->cfqg,
> >> > > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? cfq_class_rt(cfqq));
> >> > > ? ? ? ? ? ? ? ?unsigned sync_slice = cfqd->cfq_slice[1];
> >> > > ? ? ? ? ? ? ? ?unsigned expect_latency = sync_slice * iq;
> >> > > - ? ? ? ? ? ? ? if (expect_latency > cfq_target_latency) {
> >> > > + ? ? ? ? ? ? ? unsigned group_target_lat = cfq_target_latency/cfqd->nr_groups;
> >> >
> >> > I'm not sure that we should divide the target latency evenly among groups.
> >> > Groups with different weights will have different percentage of time
> >> > in each 300ms round, so probably we should consider it here.
> >> >
> >>
> >> Taking group weight into account will be more precise thing. So may be
> >> I can keep track of total weight on the service tree and determine
> >> group target latency as proportion of total weight.
> >>
> >> ?group_target_lat = group_weight * cfq_target_latency/total_weight_of_groups
> >>
> >
> > Here is the patch I generated on top of all the patches in series.
> >
> > o Determine group target latency in proportion to group weight instead of
> > ?just number of groups.
>
> Great.
> I have only one concern, regarding variable naming:
> group_target_lat is a bit misleading. The fact is that it will be
> larger for higher weight groups, so people could ask why are you
> giving more latency to higher weight group...
> Actually, it is the group share of the scheduling round, so you should
> name it accordingly.
>

How about "group_slice" ?

Thanks
Vivek

2009-11-13 19:38:05

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: [PATCH 05/16] blkio: Implement per cfq group latency target and busy queue avg

On Fri, Nov 13, 2009 at 8:26 PM, Vivek Goyal <[email protected]> wrote:
> On Fri, Nov 13, 2009 at 07:40:51PM +0100, Corrado Zoccolo wrote:
>> Great.
>> I have only one concern, regarding variable naming:
>> group_target_lat is a bit misleading. The fact is that it will be
>> larger for higher weight groups, so people could ask why are you
>> giving more latency to higher weight group...
>> Actually, it is the group share of the scheduling round, so you should
>> name it accordingly.
>>
>
> How about "group_slice" ?
If you haven't something similar in other parts of the code, it will be fine

Corrado
>
> Thanks
> Vivek
>
>