Hi All,
This is V1 of the Block IO controller patches on top of 2.6.32-rc5.
A consolidated patch can be found here:
http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v1.patch
After the discussions at IO minisummit at Tokyo, Japan, it was agreed that
one single IO control policy at either leaf nodes or at higher level nodes
does not meet all the requirements and we need something so that we have
the capability to support more than one IO control policy (like proportional
weight division and max bandwidth control) and also have capability to
implement some of these policies at higher level logical devices.
It was agreed that CFQ is the right place to implement time based proportional
weight division policy. Other policies like max bandwidth control/throttling
will make more sense at higher level logical devices.
This patch introduces blkio cgroup controller. It provides the management
interface for the block IO control. The idea is that keep the interface
common and in the background we should be able to switch policies based on
user options. Hence user can control the IO throughout the IO stack with
a single cgroup interface.
Apart from blkio cgroup interface, this patchset also modifies CFQ to implement
time based proportional weight division of disk. CFQ already does it in flat
mode. It has been modified to do group IO scheduling also.
IO control is a huge problem and the moment we start addressing all the
issues in one patchset, it bloats to unmanageable proportions and then nothing
gets inside the kernel. So at io mini summit we agreed that lets take small
steps and once a piece of code is inside the kernel and stablized, take the
next step. So this is the first step.
Some parts of the code are based on BFQ patches posted by Paolo and Fabio.
Your feedback is welcome.
TODO
====
- Support async IO control (buffered writes).
Buffered writes is a beast and requires changes at many a places to solve the
problem and patchset becomes huge. Hence first we plan to support only sync
IO in control then work on async IO too.
Some of the work items identified are.
- Per memory cgroup dirty ratio
- Possibly modification of writeback to force writeback from a
particular cgroup.
- Implement IO tracking support so that a bio can be mapped to a cgroup.
- Per group request descriptor infrastructure in block layer.
- At CFQ level, implement per cfq_group async queues.
In this patchset, all the async IO goes in system wide queues and there are
no per group async queues. That means we will see service differentiation
only for sync IO only. Async IO willl be handled later.
- Support for higher level policies like max BW controller.
Thanks
Vivek
Documentation/cgroups/blkio-controller.txt | 106 +++
block/Kconfig | 22 +
block/Kconfig.iosched | 17 +
block/Makefile | 1 +
block/blk-cgroup.c | 343 ++++++++
block/blk-cgroup.h | 67 ++
block/cfq-iosched.c | 1187 ++++++++++++++++++++++-----
include/linux/cgroup_subsys.h | 6 +
include/linux/iocontext.h | 4 +
9 files changed, 1533 insertions(+), 220 deletions(-)
Signed-off-by: Vivek Goyal <[email protected]>
---
Documentation/cgroups/blkio-controller.txt | 106 ++++++++++++++++++++++++++++
1 files changed, 106 insertions(+), 0 deletions(-)
create mode 100644 Documentation/cgroups/blkio-controller.txt
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
new file mode 100644
index 0000000..dc8fb1a
--- /dev/null
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -0,0 +1,106 @@
+ Block IO Controller
+ ===================
+Overview
+========
+cgroup subsys "blkio" implements the block io controller. There seems to be
+a need of various kind of IO control policies (like proportional BW, max BW)
+both at leaf nodes as well as at intermediate nodes in storage hierarchy. Plan
+is to use same cgroup based management interface for blkio controller and
+based on user options switch IO policies in the background.
+
+In the first phase, this patchset implements proportional weight time based
+division of disk policy. It is implemented in CFQ. Hence this policy takes
+effect only on leaf nodes when CFQ is being used.
+
+HOWTO
+=====
+You can do a very simple testing of running two dd threads in two different
+cgroups. Here is what you can do.
+
+- Enable group scheduling in CFQ
+ CONFIG_CFQ_GROUP_IOSCHED=y
+
+- Compile and boot into kernel and mount IO controller (blkio).
+
+ mount -t cgroup -o blkio none /cgroup
+
+- Create two cgroups
+ mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+ echo 1000 > /cgroup/test1/blkio.weight
+ echo 500 > /cgroup/test2/blkio.weight
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+ launch two dd threads in different cgroup to read those files.
+
+ sync
+ echo 3 > /proc/sys/vm/drop_caches
+
+ dd if=/mnt/sdb/zerofile1 of=/dev/null &
+ echo $! > /cgroup/test1/tasks
+ cat /cgroup/test1/tasks
+
+ dd if=/mnt/sdb/zerofile2 of=/dev/null &
+ echo $! > /cgroup/test2/tasks
+ cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+ on looking at (with the help of script), at blkio.disk_time and
+ blkio.disk_sectors files of both test1 and test2 groups. This will tell how
+ much disk time (in milli seconds), each group got and how many secotors each
+ group dispatched to the disk. We provide fairness in terms of disk time, so
+ ideally io.disk_time of cgroups should be in proportion to the weight.
+
+Various user visible config options
+===================================
+CONFIG_CFQ_GROUP_IOSCHED
+ - Enables group scheduling in CFQ. Currently only 1 level of group
+ creation is allowed.
+
+CONFIG_DEBUG_CFQ_IOSCHED
+ - Enables some debugging messages in blktrace. Also creates extra
+ cgroup file blkio.dequeue.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configuration.
+
+CONFIG_BLK_CGROUP
+ - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED.
+
+CONFIG_DEBUG_BLK_CGROUP
+ - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED.
+
+Details of cgroup files
+=======================
+- blkio.ioprio_class
+ - Specifies class of the cgroup (RT, BE, IDLE). This is default io
+ class of the group on all the devices.
+
+ 1 = RT; 2 = BE, 3 = IDLE
+
+- blkio.weight
+ - Specifies per cgroup weight.
+
+ Currently allowed range of weights is from 100 to 1000.
+
+- blkio.time
+ - disk time allocated to cgroup per device in milliseconds. First
+ two fields specify the major and minor number of the device and
+ third field specifies the disk time allocated to group in
+ milliseconds.
+
+- blkio.sectors
+ - number of sectors transferred to/from disk by the group. First
+ two fields specify the major and minor number of the device and
+ third field specifies the number of sectors transferred by the
+ group to/from the device.
+
+- blkio.dequeue
+ - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
+ gives the statistics about how many a times a group was dequeued
+ from service tree of the device. First two fields specify the major
+ and minor number of the device and third field specifies the number
+ of times a group was dequeued from a particular device.
--
1.6.2.5
o Currently CFQ provides priority scaled time slices to processes. If a process
does not use the time slice, either because process did not have sufficient
IO to do or because think time of process is large and CFQ decided to disable
idling, then processes looses it time slice share.
o This works well in flat setup where fair share of a process can be achieved
in one go (by scaled time slices), and CFQ does not have to time stamp the
queue. But once IO groups are introduced, it does not work very well.
Consider following case.
root
/ \
G1 G2
| |
T1 T2
Here G1 and G2 are two groups of weights 100 each and T1 and T2 are two
tasks of prio 0 and 4 each. Now both groups should get 50% of disk time.
CFQ assigns slice length of 180ms to T1 (prio 0) and slice length of 100ms
to T2 (prio4). Now plain round robin of scaled slices does not work at
group level.
o One possible way to handle this is implement CFS like time stamping of the
cfq queues and keep track of vtime. Next queue for execution will be selected
based on the one who got lowest vtime. This patch implemented time stamping
mechanism of cfq queues based on disk time used.
o min_vdisktime represents the minimum vdisktime of the queue, either being
serviced or leftmost element on the serviec tree.
o Previously CFQ had one service tree where queues of all theree prio classes
were being queued. One side affect of this time stamping approach is that
now single tree approach might not work and we need to keep separate service
trees for three prio classes.
o Some parts of code of this patch are taken from CFS and BFQ.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 480 +++++++++++++++++++++++++++++++++++----------------
1 files changed, 335 insertions(+), 145 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 069a610..58ac8b7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -28,6 +28,8 @@ static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
static int cfq_slice_idle = HZ / 125;
+#define IO_IOPRIO_CLASSES 3
+
/*
* offset from end of service tree
*/
@@ -64,11 +66,17 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
* to find it. Idea borrowed from Ingo Molnars CFS scheduler. We should
* move this into the elevator for the rq sorting as well.
*/
-struct cfq_rb_root {
+struct cfq_service_tree {
struct rb_root rb;
struct rb_node *left;
+ u64 min_vdisktime;
+ struct cfq_queue *active;
+};
+#define CFQ_RB_ROOT (struct cfq_service_tree) { RB_ROOT, NULL, 0, NULL}
+
+struct cfq_sched_data {
+ struct cfq_service_tree service_tree[IO_IOPRIO_CLASSES];
};
-#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, }
/*
* Per process-grouping structure
@@ -83,7 +91,9 @@ struct cfq_queue {
/* service_tree member */
struct rb_node rb_node;
/* service_tree key */
- unsigned long rb_key;
+ u64 vdisktime;
+ /* service tree we belong to */
+ struct cfq_service_tree *st;
/* prio tree member */
struct rb_node p_node;
/* prio tree root we belong to, if any */
@@ -99,8 +109,9 @@ struct cfq_queue {
/* fifo list of requests in sort_list */
struct list_head fifo;
+ /* time when first request from queue completed and slice started. */
+ unsigned long slice_start;
unsigned long slice_end;
- long slice_resid;
unsigned int slice_dispatch;
/* pending metadata requests */
@@ -111,6 +122,7 @@ struct cfq_queue {
/* io prio of this group */
unsigned short ioprio, org_ioprio;
unsigned short ioprio_class, org_ioprio_class;
+ bool ioprio_class_changed;
pid_t pid;
};
@@ -124,7 +136,7 @@ struct cfq_data {
/*
* rr list of queues with requests and the count of them
*/
- struct cfq_rb_root service_tree;
+ struct cfq_sched_data sched_data;
/*
* Each priority tree is sorted by next_request position. These
@@ -234,6 +246,67 @@ static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
struct io_context *);
+static void cfq_put_queue(struct cfq_queue *cfqq);
+static struct cfq_queue *__cfq_get_next_queue(struct cfq_service_tree *st);
+
+static inline void
+init_cfqq_service_tree(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+ unsigned short idx = cfqq->ioprio_class - 1;
+
+ BUG_ON(idx >= IO_IOPRIO_CLASSES);
+
+ cfqq->st = &cfqd->sched_data.service_tree[idx];
+}
+
+static inline s64
+cfqq_key(struct cfq_service_tree *st, struct cfq_queue *cfqq)
+{
+ return cfqq->vdisktime - st->min_vdisktime;
+}
+
+static inline u64
+cfq_delta_fair(unsigned long delta, struct cfq_queue *cfqq)
+{
+ const int base_slice = cfqq->cfqd->cfq_slice[cfq_cfqq_sync(cfqq)];
+
+ return delta + (base_slice/CFQ_SLICE_SCALE * (cfqq->ioprio - 4));
+}
+
+static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+ s64 delta = (s64)(vdisktime - min_vdisktime);
+ if (delta > 0)
+ min_vdisktime = vdisktime;
+
+ return min_vdisktime;
+}
+
+static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+ s64 delta = (s64)(vdisktime - min_vdisktime);
+ if (delta < 0)
+ min_vdisktime = vdisktime;
+
+ return min_vdisktime;
+}
+
+static void update_min_vdisktime(struct cfq_service_tree *st)
+{
+ u64 vdisktime = st->min_vdisktime;
+
+ if (st->active)
+ vdisktime = st->active->vdisktime;
+
+ if (st->left) {
+ struct cfq_queue *cfqq = rb_entry(st->left, struct cfq_queue,
+ rb_node);
+
+ vdisktime = min_vdisktime(vdisktime, cfqq->vdisktime);
+ }
+
+ st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
static inline int rq_in_driver(struct cfq_data *cfqd)
{
@@ -277,7 +350,7 @@ static int cfq_queue_empty(struct request_queue *q)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
- return !cfqd->busy_queues;
+ return !cfqd->rq_queued;
}
/*
@@ -304,6 +377,7 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static inline void
cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
+ cfqq->slice_start = jiffies;
cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
}
@@ -419,33 +493,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
}
/*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
- if (!root->left)
- root->left = rb_first(&root->rb);
-
- if (root->left)
- return rb_entry(root->left, struct cfq_queue, rb_node);
-
- return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
- rb_erase(n, root);
- RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
- if (root->left == n)
- root->left = NULL;
- rb_erase_init(n, &root->rb);
-}
-
-/*
* would be nice to take fifo expire time into account as well
*/
static struct request *
@@ -472,102 +519,192 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
return cfq_choose_req(cfqd, next, prev);
}
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- /*
- * just an approximation, should be ok.
- */
- return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
- cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- bool add_front)
+static void
+place_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq, int add_front)
{
- struct rb_node **p, *parent;
+ u64 vdisktime = st->min_vdisktime;
+ struct rb_node *parent;
struct cfq_queue *__cfqq;
- unsigned long rb_key;
- int left;
if (cfq_class_idle(cfqq)) {
- rb_key = CFQ_IDLE_DELAY;
- parent = rb_last(&cfqd->service_tree.rb);
+ vdisktime = CFQ_IDLE_DELAY;
+ parent = rb_last(&st->rb);
if (parent && parent != &cfqq->rb_node) {
__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- rb_key += __cfqq->rb_key;
+ vdisktime += __cfqq->vdisktime;
} else
- rb_key += jiffies;
+ vdisktime += st->min_vdisktime;
} else if (!add_front) {
- /*
- * Get our rb key offset. Subtract any residual slice
- * value carried from last service. A negative resid
- * count indicates slice overrun, and this should position
- * the next service time further away in the tree.
- */
- rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
- rb_key -= cfqq->slice_resid;
- cfqq->slice_resid = 0;
- } else {
- rb_key = -HZ;
- __cfqq = cfq_rb_first(&cfqd->service_tree);
- rb_key += __cfqq ? __cfqq->rb_key : jiffies;
+ parent = rb_last(&st->rb);
+ if (parent && parent != &cfqq->rb_node) {
+ __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
+ vdisktime = __cfqq->vdisktime;
+ }
}
- if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
+ cfqq->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
+static inline void cfqq_update_ioprio_class(struct cfq_queue *cfqq)
+{
+ if (unlikely(cfqq->ioprio_class_changed)) {
+ struct cfq_data *cfqd = cfqq->cfqd;
+
/*
- * same position, nothing more to do
+ * Re-initialize the service tree pointer as ioprio class
+ * change will lead to service tree change.
*/
- if (rb_key == cfqq->rb_key)
- return;
-
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+ init_cfqq_service_tree(cfqd, cfqq);
+ cfqq->ioprio_class_changed = 0;
+ cfqq->vdisktime = 0;
}
+}
- left = 1;
- parent = NULL;
- p = &cfqd->service_tree.rb.rb_node;
- while (*p) {
- struct rb_node **n;
+static void __dequeue_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq)
+{
+ /* Node is not on tree */
+ if (RB_EMPTY_NODE(&cfqq->rb_node))
+ return;
- parent = *p;
+ if (st->left == &cfqq->rb_node)
+ st->left = rb_next(&cfqq->rb_node);
+
+ rb_erase(&cfqq->rb_node, &st->rb);
+ RB_CLEAR_NODE(&cfqq->rb_node);
+}
+
+static void dequeue_cfqq(struct cfq_queue *cfqq)
+{
+ struct cfq_service_tree *st = cfqq->st;
+
+ if (st->active == cfqq)
+ st->active = NULL;
+
+ __dequeue_cfqq(st, cfqq);
+}
+
+static void __enqueue_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq,
+ int add_front)
+{
+ struct rb_node **node = &st->rb.rb_node;
+ struct rb_node *parent = NULL;
+ struct cfq_queue *__cfqq;
+ s64 key = cfqq_key(st, cfqq);
+ int leftmost = 1;
+
+ while (*node != NULL) {
+ parent = *node;
__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- /*
- * sort RT queues first, we always want to give
- * preference to them. IDLE queues goes to the back.
- * after that, sort on the next service time.
- */
- if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
- n = &(*p)->rb_left;
- else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
- n = &(*p)->rb_right;
- else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
- n = &(*p)->rb_left;
- else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
- n = &(*p)->rb_right;
- else if (time_before(rb_key, __cfqq->rb_key))
- n = &(*p)->rb_left;
- else
- n = &(*p)->rb_right;
+ if (key < cfqq_key(st, __cfqq) ||
+ (add_front && (key == cfqq_key(st, __cfqq)))) {
+ node = &parent->rb_left;
+ } else {
+ node = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ /*
+ * Maintain a cache of leftmost tree entries (it is frequently
+ * used)
+ */
+ if (leftmost)
+ st->left = &cfqq->rb_node;
- if (n == &(*p)->rb_right)
- left = 0;
+ rb_link_node(&cfqq->rb_node, parent, node);
+ rb_insert_color(&cfqq->rb_node, &st->rb);
+}
- p = n;
+static void enqueue_cfqq(struct cfq_queue *cfqq)
+{
+ cfqq_update_ioprio_class(cfqq);
+ place_cfqq(cfqq->st, cfqq, 0);
+ __enqueue_cfqq(cfqq->st, cfqq, 0);
+}
+
+/* Requeue a cfqq which is already on the service tree */
+static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
+{
+ struct cfq_service_tree *st = cfqq->st;
+ struct cfq_queue *next_cfqq;
+
+ if (add_front) {
+ next_cfqq = __cfq_get_next_queue(st);
+ if (next_cfqq && next_cfqq == cfqq)
+ return;
+ }
+
+ __dequeue_cfqq(st, cfqq);
+ place_cfqq(st, cfqq, add_front);
+ __enqueue_cfqq(st, cfqq, add_front);
+}
+
+static void __cfqq_served(struct cfq_queue *cfqq, unsigned long served)
+{
+ /*
+ * Can't update entity disk time while it is on sorted rb-tree
+ * as vdisktime is used as key.
+ */
+ __dequeue_cfqq(cfqq->st, cfqq);
+ cfqq->vdisktime += cfq_delta_fair(served, cfqq);
+ update_min_vdisktime(cfqq->st);
+ __enqueue_cfqq(cfqq->st, cfqq, 0);
+}
+
+static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
+{
+ /*
+ * We don't want to charge more than allocated slice otherwise this
+ * queue can miss one dispatch round doubling max latencies. On the
+ * other hand we don't want to charge less than allocated slice as
+ * we stick to CFQ theme of queue loosing its share if it does not
+ * use the slice and moves to the back of service tree (almost).
+ */
+ served = cfq_prio_to_slice(cfqq->cfqd, cfqq);
+ __cfqq_served(cfqq, served);
+
+ /* If cfqq prio class has changed, take that into account */
+ if (unlikely(cfqq->ioprio_class_changed)) {
+ dequeue_cfqq(cfqq);
+ enqueue_cfqq(cfqq);
}
+}
+
+/*
+ * Handles three operations.
+ * Addition of a new queue to service tree, when a new request comes in.
+ * Resorting of an expiring queue (used after slice expired)
+ * Requeuing a queue at the front (used during preemption).
+ */
+static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
+ bool add_front, unsigned long service)
+{
+ if (RB_EMPTY_NODE(&cfqq->rb_node)) {
+ /* Its a new queue. Add it to service tree */
+ enqueue_cfqq(cfqq);
+ return;
+ }
+
+ if (service) {
+ /*
+ * This queue just got served. Compute the new key and requeue
+ * in the service tree
+ */
+ cfqq_served(cfqq, service);
- if (left)
- cfqd->service_tree.left = &cfqq->rb_node;
+ /*
+ * Requeue async ioq so that these will be again placed at the
+ * end of service tree giving a chance to sync queues.
+ * TODO: Handle this case in a better manner.
+ */
+ if (!cfq_cfqq_sync(cfqq))
+ requeue_cfqq(cfqq, 0);
+ return;
+ }
- cfqq->rb_key = rb_key;
- rb_link_node(&cfqq->rb_node, parent, p);
- rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
+ /* Just requeuing an existing queue, used during preemption */
+ requeue_cfqq(cfqq, add_front);
}
static struct cfq_queue *
@@ -634,13 +771,14 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
/*
* Update cfqq's position in the service tree.
*/
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq,
+ unsigned long service)
{
/*
* Resorting requires the cfqq to be on the RR list already.
*/
if (cfq_cfqq_on_rr(cfqq)) {
- cfq_service_tree_add(cfqd, cfqq, 0);
+ cfq_service_tree_add(cfqd, cfqq, 0, service);
cfq_prio_tree_add(cfqd, cfqq);
}
}
@@ -656,7 +794,7 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
cfq_mark_cfqq_on_rr(cfqq);
cfqd->busy_queues++;
- cfq_resort_rr_list(cfqd, cfqq);
+ cfq_resort_rr_list(cfqd, cfqq, 0);
}
/*
@@ -669,8 +807,7 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
BUG_ON(!cfq_cfqq_on_rr(cfqq));
cfq_clear_cfqq_on_rr(cfqq);
- if (!RB_EMPTY_NODE(&cfqq->rb_node))
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+ dequeue_cfqq(cfqq);
if (cfqq->p_root) {
rb_erase(&cfqq->p_node, cfqq->p_root);
cfqq->p_root = NULL;
@@ -686,7 +823,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_del_rq_rb(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
- struct cfq_data *cfqd = cfqq->cfqd;
const int sync = rq_is_sync(rq);
BUG_ON(!cfqq->queued[sync]);
@@ -694,8 +830,17 @@ static void cfq_del_rq_rb(struct request *rq)
elv_rb_del(&cfqq->sort_list, rq);
- if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
- cfq_del_cfqq_rr(cfqd, cfqq);
+ if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list)) {
+ /*
+ * Queue will be deleted from service tree when we actually
+ * expire it later. Right now just remove it from prio tree
+ * as it is empty.
+ */
+ if (cfqq->p_root) {
+ rb_erase(&cfqq->p_node, cfqq->p_root);
+ cfqq->p_root = NULL;
+ }
+ }
}
static void cfq_add_rq_rb(struct request *rq)
@@ -869,6 +1014,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
{
if (cfqq) {
cfq_log_cfqq(cfqd, cfqq, "set_active");
+ cfqq->slice_start = 0;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;
@@ -888,10 +1034,11 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
* current cfqq expired its slice (or was too idle), select new one
*/
static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- bool timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
+ long slice_used = 0;
+
+ cfq_log_cfqq(cfqd, cfqq, "slice expired");
if (cfq_cfqq_wait_request(cfqq))
del_timer(&cfqd->idle_slice_timer);
@@ -899,14 +1046,20 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfq_clear_cfqq_wait_request(cfqq);
/*
- * store what was left of this slice, if the queue idled/timed out
+ * Queue got expired before even a single request completed or
+ * got expired immediately after first request completion.
*/
- if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
- cfqq->slice_resid = cfqq->slice_end - jiffies;
- cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
- }
+ if (!cfqq->slice_end || cfqq->slice_start == jiffies)
+ slice_used = 1;
+ else
+ slice_used = jiffies - cfqq->slice_start;
- cfq_resort_rr_list(cfqd, cfqq);
+ cfq_log_cfqq(cfqd, cfqq, "sl_used=%ld", slice_used);
+
+ if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
+ cfq_del_cfqq_rr(cfqd, cfqq);
+
+ cfq_resort_rr_list(cfqd, cfqq, slice_used);
if (cfqq == cfqd->active_queue)
cfqd->active_queue = NULL;
@@ -917,12 +1070,22 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
}
}
-static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq = cfqd->active_queue;
if (cfqq)
- __cfq_slice_expired(cfqd, cfqq, timed_out);
+ __cfq_slice_expired(cfqd, cfqq);
+}
+
+static struct cfq_queue *__cfq_get_next_queue(struct cfq_service_tree *st)
+{
+ struct rb_node *left = st->left;
+
+ if (!left)
+ return NULL;
+
+ return rb_entry(left, struct cfq_queue, rb_node);
}
/*
@@ -931,10 +1094,24 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
*/
static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
{
- if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
+ struct cfq_sched_data *sd = &cfqd->sched_data;
+ struct cfq_service_tree *st = sd->service_tree;
+ struct cfq_queue *cfqq = NULL;
+ int i;
+
+ if (!cfqd->rq_queued)
return NULL;
- return cfq_rb_first(&cfqd->service_tree);
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+ cfqq = __cfq_get_next_queue(st);
+ if (cfqq) {
+ st->active = cfqq;
+ update_min_vdisktime(cfqq->st);
+ break;
+ }
+ }
+
+ return cfqq;
}
/*
@@ -1181,6 +1358,9 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
if (!cfqq)
goto new_queue;
+ if (!cfqd->rq_queued)
+ return NULL;
+
/*
* The active queue has run out of time, expire it and select new.
*/
@@ -1216,7 +1396,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
}
expire:
- cfq_slice_expired(cfqd, 0);
+ cfq_slice_expired(cfqd);
new_queue:
cfqq = cfq_set_active_queue(cfqd, new_cfqq);
keep_queue:
@@ -1233,6 +1413,10 @@ static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
}
BUG_ON(!list_empty(&cfqq->fifo));
+
+ /* By default cfqq is not expired if it is empty. Do it explicitly */
+ __cfq_slice_expired(cfqq->cfqd, cfqq);
+
return dispatched;
}
@@ -1245,10 +1429,10 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
struct cfq_queue *cfqq;
int dispatched = 0;
- while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+ while ((cfqq = cfq_get_next_queue(cfqd)) != NULL)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);
- cfq_slice_expired(cfqd, 0);
+ cfq_slice_expired(cfqd);
BUG_ON(cfqd->busy_queues);
@@ -1391,7 +1575,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
cfq_class_idle(cfqq))) {
cfqq->slice_end = jiffies + 1;
- cfq_slice_expired(cfqd, 0);
+ cfq_slice_expired(cfqd);
}
cfq_log_cfqq(cfqd, cfqq, "dispatched a request");
@@ -1416,13 +1600,14 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
cfq_log_cfqq(cfqd, cfqq, "put_queue");
BUG_ON(rb_first(&cfqq->sort_list));
BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
- BUG_ON(cfq_cfqq_on_rr(cfqq));
if (unlikely(cfqd->active_queue == cfqq)) {
- __cfq_slice_expired(cfqd, cfqq, 0);
+ __cfq_slice_expired(cfqd, cfqq);
cfq_schedule_dispatch(cfqd);
}
+ BUG_ON(cfq_cfqq_on_rr(cfqq));
+
kmem_cache_free(cfq_pool, cfqq);
}
@@ -1514,7 +1699,7 @@ static void cfq_free_io_context(struct io_context *ioc)
static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
if (unlikely(cfqq == cfqd->active_queue)) {
- __cfq_slice_expired(cfqd, cfqq, 0);
+ __cfq_slice_expired(cfqd, cfqq);
cfq_schedule_dispatch(cfqd);
}
@@ -1634,6 +1819,8 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
break;
}
+ if (cfqq->org_ioprio_class != cfqq->ioprio_class)
+ cfqq->ioprio_class_changed = 1;
/*
* keep track of original prio settings in case we have to temporarily
* elevate the priority of this queue
@@ -2079,7 +2266,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
cfq_log_cfqq(cfqd, cfqq, "preempt");
- cfq_slice_expired(cfqd, 1);
+ cfq_slice_expired(cfqd);
/*
* Put the new queue at the front of the of the current list,
@@ -2087,7 +2274,7 @@ static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
*/
BUG_ON(!cfq_cfqq_on_rr(cfqq));
- cfq_service_tree_add(cfqd, cfqq, 1);
+ cfq_service_tree_add(cfqd, cfqq, 1, 0);
cfqq->slice_end = 0;
cfq_mark_cfqq_slice_new(cfqq);
@@ -2229,7 +2416,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
* of idling.
*/
if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
- cfq_slice_expired(cfqd, 1);
+ cfq_slice_expired(cfqd);
else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
sync && !rq_noidle(rq))
cfq_arm_slice_timer(cfqd);
@@ -2250,16 +2437,20 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
* boost idle prio on transactions that would lock out other
* users of the filesystem
*/
- if (cfq_class_idle(cfqq))
+ if (cfq_class_idle(cfqq)) {
cfqq->ioprio_class = IOPRIO_CLASS_BE;
+ cfqq->ioprio_class_changed = 1;
+ }
if (cfqq->ioprio > IOPRIO_NORM)
cfqq->ioprio = IOPRIO_NORM;
} else {
/*
* check if we need to unboost the queue
*/
- if (cfqq->ioprio_class != cfqq->org_ioprio_class)
+ if (cfqq->ioprio_class != cfqq->org_ioprio_class) {
cfqq->ioprio_class = cfqq->org_ioprio_class;
+ cfqq->ioprio_class_changed = 1;
+ }
if (cfqq->ioprio != cfqq->org_ioprio)
cfqq->ioprio = cfqq->org_ioprio;
}
@@ -2391,7 +2582,6 @@ static void cfq_idle_slice_timer(unsigned long data)
struct cfq_data *cfqd = (struct cfq_data *) data;
struct cfq_queue *cfqq;
unsigned long flags;
- int timed_out = 1;
cfq_log(cfqd, "idle timer fired");
@@ -2399,7 +2589,6 @@ static void cfq_idle_slice_timer(unsigned long data)
cfqq = cfqd->active_queue;
if (cfqq) {
- timed_out = 0;
/*
* We saw a request before the queue expired, let it through
@@ -2427,7 +2616,7 @@ static void cfq_idle_slice_timer(unsigned long data)
goto out_kick;
}
expire:
- cfq_slice_expired(cfqd, timed_out);
+ cfq_slice_expired(cfqd);
out_kick:
cfq_schedule_dispatch(cfqd);
out_cont:
@@ -2465,7 +2654,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
spin_lock_irq(q->queue_lock);
if (cfqd->active_queue)
- __cfq_slice_expired(cfqd, cfqd->active_queue, 0);
+ __cfq_slice_expired(cfqd, cfqd->active_queue);
while (!list_empty(&cfqd->cic_list)) {
struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
@@ -2493,7 +2682,8 @@ static void *cfq_init_queue(struct request_queue *q)
if (!cfqd)
return NULL;
- cfqd->service_tree = CFQ_RB_ROOT;
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ cfqd->sched_data.service_tree[i] = CFQ_RB_ROOT;
/*
* Not strictly needed (since RB_ROOT just clears the node and we
--
1.6.2.5
o Introduce the notion of weights. Priorities are mapped to weights internally.
These weights will be useful once IO groups are introduced and group's share
will be decided by the group weight.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 58 ++++++++++++++++++++++++++++++++++----------------
1 files changed, 39 insertions(+), 19 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 58ac8b7..ca815ce 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -29,6 +29,10 @@ static const int cfq_slice_async_rq = 2;
static int cfq_slice_idle = HZ / 125;
#define IO_IOPRIO_CLASSES 3
+#define CFQ_WEIGHT_MIN 100
+#define CFQ_WEIGHT_MAX 1000
+#define CFQ_WEIGHT_DEFAULT 500
+#define CFQ_SERVICE_SHIFT 12
/*
* offset from end of service tree
@@ -40,7 +44,7 @@ static int cfq_slice_idle = HZ / 125;
*/
#define CFQ_MIN_TT (2)
-#define CFQ_SLICE_SCALE (5)
+#define CFQ_SLICE_SCALE (500)
#define CFQ_HW_QUEUE_MIN (5)
#define RQ_CIC(rq) \
@@ -123,6 +127,7 @@ struct cfq_queue {
unsigned short ioprio, org_ioprio;
unsigned short ioprio_class, org_ioprio_class;
bool ioprio_class_changed;
+ unsigned int weight;
pid_t pid;
};
@@ -266,11 +271,22 @@ cfqq_key(struct cfq_service_tree *st, struct cfq_queue *cfqq)
}
static inline u64
+cfq_delta(u64 service, unsigned int numerator_wt, unsigned int denominator_wt)
+{
+ if (numerator_wt != denominator_wt) {
+ service = service * numerator_wt;
+ do_div(service, denominator_wt);
+ }
+
+ return service;
+}
+
+static inline u64
cfq_delta_fair(unsigned long delta, struct cfq_queue *cfqq)
{
- const int base_slice = cfqq->cfqd->cfq_slice[cfq_cfqq_sync(cfqq)];
+ u64 d = delta << CFQ_SERVICE_SHIFT;
- return delta + (base_slice/CFQ_SLICE_SCALE * (cfqq->ioprio - 4));
+ return cfq_delta(d, CFQ_WEIGHT_DEFAULT, cfqq->weight);
}
static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
@@ -308,6 +324,23 @@ static void update_min_vdisktime(struct cfq_service_tree *st)
st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
}
+static inline unsigned int cfq_ioprio_to_weight(int ioprio)
+{
+ WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+ /* Map prio 7 - 0 to weights 200 to 900 */
+ return CFQ_WEIGHT_DEFAULT + (CFQ_WEIGHT_DEFAULT/5 * (4 - ioprio));
+}
+
+static inline int
+cfq_weight_slice(struct cfq_data *cfqd, int sync, unsigned int weight)
+{
+ const int base_slice = cfqd->cfq_slice[sync];
+
+ WARN_ON(weight > CFQ_WEIGHT_MAX);
+
+ return cfq_delta(base_slice, weight, CFQ_WEIGHT_DEFAULT);
+}
+
static inline int rq_in_driver(struct cfq_data *cfqd)
{
return cfqd->rq_in_driver[0] + cfqd->rq_in_driver[1];
@@ -353,25 +386,10 @@ static int cfq_queue_empty(struct request_queue *q)
return !cfqd->rq_queued;
}
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, bool sync,
- unsigned short prio)
-{
- const int base_slice = cfqd->cfq_slice[sync];
-
- WARN_ON(prio >= IOPRIO_BE_NR);
-
- return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
-}
-
static inline int
cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
+ return cfq_weight_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->weight);
}
static inline void
@@ -1819,6 +1837,8 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
break;
}
+ cfqq->weight = cfq_ioprio_to_weight(cfqq->ioprio);
+
if (cfqq->org_ioprio_class != cfqq->ioprio_class)
cfqq->ioprio_class_changed = 1;
/*
--
1.6.2.5
o Introduce the notion of cfq entity. This is a common structure which will
be embedded both in cfq queues as well as cfq groups. This is something like
scheduling entity of CFS.
o Once groups are introduced it becomes easier to deal with entities while
enqueuing/dequeuing queues/groups on service tree and we can handle many
of the operations with single functions dealing in entities instead of
introducing seprate functions for queues and groups.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 246 +++++++++++++++++++++++++++++----------------------
1 files changed, 141 insertions(+), 105 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ca815ce..922aa8e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -59,8 +59,10 @@ static struct completion *ioc_gone;
static DEFINE_SPINLOCK(ioc_gone_lock);
#define CFQ_PRIO_LISTS IOPRIO_BE_NR
-#define cfq_class_idle(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
+#define cfqe_class_idle(cfqe) ((cfqe)->ioprio_class == IOPRIO_CLASS_IDLE)
+#define cfqe_class_rt(cfqe) ((cfqe)->ioprio_class == IOPRIO_CLASS_RT)
+#define cfq_class_idle(cfqq) (cfqe_class_idle(&(cfqq)->entity))
+#define cfq_class_rt(cfqq) (cfqe_class_rt(&(cfqq)->entity))
#define sample_valid(samples) ((samples) > 80)
@@ -74,7 +76,7 @@ struct cfq_service_tree {
struct rb_root rb;
struct rb_node *left;
u64 min_vdisktime;
- struct cfq_queue *active;
+ struct cfq_entity *active;
};
#define CFQ_RB_ROOT (struct cfq_service_tree) { RB_ROOT, NULL, 0, NULL}
@@ -82,22 +84,26 @@ struct cfq_sched_data {
struct cfq_service_tree service_tree[IO_IOPRIO_CLASSES];
};
+struct cfq_entity {
+ struct rb_node rb_node;
+ u64 vdisktime;
+ unsigned int weight;
+ struct cfq_service_tree *st;
+ unsigned short ioprio_class;
+ bool ioprio_class_changed;
+};
+
/*
* Per process-grouping structure
*/
struct cfq_queue {
+ struct cfq_entity entity;
/* reference count */
atomic_t ref;
/* various state flags, see below */
unsigned int flags;
/* parent cfq_data */
struct cfq_data *cfqd;
- /* service_tree member */
- struct rb_node rb_node;
- /* service_tree key */
- u64 vdisktime;
- /* service tree we belong to */
- struct cfq_service_tree *st;
/* prio tree member */
struct rb_node p_node;
/* prio tree root we belong to, if any */
@@ -125,9 +131,7 @@ struct cfq_queue {
/* io prio of this group */
unsigned short ioprio, org_ioprio;
- unsigned short ioprio_class, org_ioprio_class;
- bool ioprio_class_changed;
- unsigned int weight;
+ unsigned short org_ioprio_class;
pid_t pid;
};
@@ -252,22 +256,27 @@ static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
struct io_context *);
static void cfq_put_queue(struct cfq_queue *cfqq);
-static struct cfq_queue *__cfq_get_next_queue(struct cfq_service_tree *st);
+static struct cfq_entity *__cfq_get_next_entity(struct cfq_service_tree *st);
+
+static inline struct cfq_queue *cfqq_of(struct cfq_entity *cfqe)
+{
+ return container_of(cfqe, struct cfq_queue, entity);
+}
static inline void
-init_cfqq_service_tree(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+init_cfqe_service_tree(struct cfq_data *cfqd, struct cfq_entity *cfqe)
{
- unsigned short idx = cfqq->ioprio_class - 1;
+ unsigned short idx = cfqe->ioprio_class - 1;
BUG_ON(idx >= IO_IOPRIO_CLASSES);
- cfqq->st = &cfqd->sched_data.service_tree[idx];
+ cfqe->st = &cfqd->sched_data.service_tree[idx];
}
static inline s64
-cfqq_key(struct cfq_service_tree *st, struct cfq_queue *cfqq)
+cfqe_key(struct cfq_service_tree *st, struct cfq_entity *cfqe)
{
- return cfqq->vdisktime - st->min_vdisktime;
+ return cfqe->vdisktime - st->min_vdisktime;
}
static inline u64
@@ -282,11 +291,11 @@ cfq_delta(u64 service, unsigned int numerator_wt, unsigned int denominator_wt)
}
static inline u64
-cfq_delta_fair(unsigned long delta, struct cfq_queue *cfqq)
+cfq_delta_fair(unsigned long delta, struct cfq_entity *cfqe)
{
u64 d = delta << CFQ_SERVICE_SHIFT;
- return cfq_delta(d, CFQ_WEIGHT_DEFAULT, cfqq->weight);
+ return cfq_delta(d, CFQ_WEIGHT_DEFAULT, cfqe->weight);
}
static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
@@ -315,10 +324,10 @@ static void update_min_vdisktime(struct cfq_service_tree *st)
vdisktime = st->active->vdisktime;
if (st->left) {
- struct cfq_queue *cfqq = rb_entry(st->left, struct cfq_queue,
+ struct cfq_entity *cfqe = rb_entry(st->left, struct cfq_entity,
rb_node);
- vdisktime = min_vdisktime(vdisktime, cfqq->vdisktime);
+ vdisktime = min_vdisktime(vdisktime, cfqe->vdisktime);
}
st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
@@ -389,7 +398,7 @@ static int cfq_queue_empty(struct request_queue *q)
static inline int
cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- return cfq_weight_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->weight);
+ return cfq_weight_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->entity.weight);
}
static inline void
@@ -538,84 +547,90 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
}
static void
-place_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq, int add_front)
+place_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe, int add_front)
{
u64 vdisktime = st->min_vdisktime;
struct rb_node *parent;
- struct cfq_queue *__cfqq;
+ struct cfq_entity *__cfqe;
- if (cfq_class_idle(cfqq)) {
+ if (cfqe_class_idle(cfqe)) {
vdisktime = CFQ_IDLE_DELAY;
parent = rb_last(&st->rb);
- if (parent && parent != &cfqq->rb_node) {
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- vdisktime += __cfqq->vdisktime;
+ if (parent && parent != &cfqe->rb_node) {
+ __cfqe = rb_entry(parent, struct cfq_entity, rb_node);
+ vdisktime += __cfqe->vdisktime;
} else
vdisktime += st->min_vdisktime;
} else if (!add_front) {
parent = rb_last(&st->rb);
- if (parent && parent != &cfqq->rb_node) {
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- vdisktime = __cfqq->vdisktime;
+ if (parent && parent != &cfqe->rb_node) {
+ __cfqe = rb_entry(parent, struct cfq_entity, rb_node);
+ vdisktime = __cfqe->vdisktime;
}
}
- cfqq->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+ cfqe->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
}
-static inline void cfqq_update_ioprio_class(struct cfq_queue *cfqq)
+static inline void cfqe_update_ioprio_class(struct cfq_entity *cfqe)
{
- if (unlikely(cfqq->ioprio_class_changed)) {
+ if (unlikely(cfqe->ioprio_class_changed)) {
+ struct cfq_queue *cfqq = cfqq_of(cfqe);
struct cfq_data *cfqd = cfqq->cfqd;
/*
* Re-initialize the service tree pointer as ioprio class
* change will lead to service tree change.
*/
- init_cfqq_service_tree(cfqd, cfqq);
- cfqq->ioprio_class_changed = 0;
- cfqq->vdisktime = 0;
+ init_cfqe_service_tree(cfqd, cfqe);
+ cfqe->ioprio_class_changed = 0;
+ cfqe->vdisktime = 0;
}
}
-static void __dequeue_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq)
+static void __dequeue_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe)
{
/* Node is not on tree */
- if (RB_EMPTY_NODE(&cfqq->rb_node))
+ if (RB_EMPTY_NODE(&cfqe->rb_node))
return;
- if (st->left == &cfqq->rb_node)
- st->left = rb_next(&cfqq->rb_node);
+ if (st->left == &cfqe->rb_node)
+ st->left = rb_next(&cfqe->rb_node);
- rb_erase(&cfqq->rb_node, &st->rb);
- RB_CLEAR_NODE(&cfqq->rb_node);
+ rb_erase(&cfqe->rb_node, &st->rb);
+ RB_CLEAR_NODE(&cfqe->rb_node);
}
-static void dequeue_cfqq(struct cfq_queue *cfqq)
+static void dequeue_cfqe(struct cfq_entity *cfqe)
{
- struct cfq_service_tree *st = cfqq->st;
+ struct cfq_service_tree *st = cfqe->st;
- if (st->active == cfqq)
+ if (st->active == cfqe)
st->active = NULL;
- __dequeue_cfqq(st, cfqq);
+ __dequeue_cfqe(st, cfqe);
}
-static void __enqueue_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq,
+static void dequeue_cfqq(struct cfq_queue *cfqq)
+{
+ dequeue_cfqe(&cfqq->entity);
+}
+
+static void __enqueue_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe,
int add_front)
{
struct rb_node **node = &st->rb.rb_node;
struct rb_node *parent = NULL;
- struct cfq_queue *__cfqq;
- s64 key = cfqq_key(st, cfqq);
+ struct cfq_entity *__cfqe;
+ s64 key = cfqe_key(st, cfqe);
int leftmost = 1;
while (*node != NULL) {
parent = *node;
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
+ __cfqe = rb_entry(parent, struct cfq_entity, rb_node);
- if (key < cfqq_key(st, __cfqq) ||
- (add_front && (key == cfqq_key(st, __cfqq)))) {
+ if (key < cfqe_key(st, __cfqe) ||
+ (add_front && (key == cfqe_key(st, __cfqe)))) {
node = &parent->rb_left;
} else {
node = &parent->rb_right;
@@ -628,46 +643,56 @@ static void __enqueue_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq,
* used)
*/
if (leftmost)
- st->left = &cfqq->rb_node;
+ st->left = &cfqe->rb_node;
+
+ rb_link_node(&cfqe->rb_node, parent, node);
+ rb_insert_color(&cfqe->rb_node, &st->rb);
+}
- rb_link_node(&cfqq->rb_node, parent, node);
- rb_insert_color(&cfqq->rb_node, &st->rb);
+static void enqueue_cfqe(struct cfq_entity *cfqe)
+{
+ cfqe_update_ioprio_class(cfqe);
+ place_cfqe(cfqe->st, cfqe, 0);
+ __enqueue_cfqe(cfqe->st, cfqe, 0);
}
static void enqueue_cfqq(struct cfq_queue *cfqq)
{
- cfqq_update_ioprio_class(cfqq);
- place_cfqq(cfqq->st, cfqq, 0);
- __enqueue_cfqq(cfqq->st, cfqq, 0);
+ enqueue_cfqe(&cfqq->entity);
}
/* Requeue a cfqq which is already on the service tree */
-static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
+static void requeue_cfqe(struct cfq_entity *cfqe, int add_front)
{
- struct cfq_service_tree *st = cfqq->st;
- struct cfq_queue *next_cfqq;
+ struct cfq_service_tree *st = cfqe->st;
+ struct cfq_entity *next_cfqe;
if (add_front) {
- next_cfqq = __cfq_get_next_queue(st);
- if (next_cfqq && next_cfqq == cfqq)
+ next_cfqe = __cfq_get_next_entity(st);
+ if (next_cfqe && next_cfqe == cfqe)
return;
}
- __dequeue_cfqq(st, cfqq);
- place_cfqq(st, cfqq, add_front);
- __enqueue_cfqq(st, cfqq, add_front);
+ __dequeue_cfqe(st, cfqe);
+ place_cfqe(st, cfqe, add_front);
+ __enqueue_cfqe(st, cfqe, add_front);
}
-static void __cfqq_served(struct cfq_queue *cfqq, unsigned long served)
+static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
+{
+ requeue_cfqe(&cfqq->entity, add_front);
+}
+
+static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
{
/*
* Can't update entity disk time while it is on sorted rb-tree
* as vdisktime is used as key.
*/
- __dequeue_cfqq(cfqq->st, cfqq);
- cfqq->vdisktime += cfq_delta_fair(served, cfqq);
- update_min_vdisktime(cfqq->st);
- __enqueue_cfqq(cfqq->st, cfqq, 0);
+ __dequeue_cfqe(cfqe->st, cfqe);
+ cfqe->vdisktime += cfq_delta_fair(served, cfqe);
+ update_min_vdisktime(cfqe->st);
+ __enqueue_cfqe(cfqe->st, cfqe, 0);
}
static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
@@ -680,10 +705,10 @@ static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
* use the slice and moves to the back of service tree (almost).
*/
served = cfq_prio_to_slice(cfqq->cfqd, cfqq);
- __cfqq_served(cfqq, served);
+ cfqe_served(&cfqq->entity, served);
/* If cfqq prio class has changed, take that into account */
- if (unlikely(cfqq->ioprio_class_changed)) {
+ if (unlikely(cfqq->entity.ioprio_class_changed)) {
dequeue_cfqq(cfqq);
enqueue_cfqq(cfqq);
}
@@ -698,7 +723,7 @@ static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
bool add_front, unsigned long service)
{
- if (RB_EMPTY_NODE(&cfqq->rb_node)) {
+ if (RB_EMPTY_NODE(&cfqq->entity.rb_node)) {
/* Its a new queue. Add it to service tree */
enqueue_cfqq(cfqq);
return;
@@ -1096,14 +1121,32 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd)
__cfq_slice_expired(cfqd, cfqq);
}
-static struct cfq_queue *__cfq_get_next_queue(struct cfq_service_tree *st)
+static struct cfq_entity *__cfq_get_next_entity(struct cfq_service_tree *st)
{
struct rb_node *left = st->left;
if (!left)
return NULL;
- return rb_entry(left, struct cfq_queue, rb_node);
+ return rb_entry(left, struct cfq_entity, rb_node);
+}
+
+static struct cfq_entity *cfq_get_next_entity(struct cfq_sched_data *sd)
+{
+ struct cfq_service_tree *st = sd->service_tree;
+ struct cfq_entity *cfqe = NULL;
+ int i;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+ cfqe = __cfq_get_next_entity(st);
+ if (cfqe) {
+ st->active = cfqe;
+ update_min_vdisktime(cfqe->st);
+ break;
+ }
+ }
+
+ return cfqe;
}
/*
@@ -1112,24 +1155,17 @@ static struct cfq_queue *__cfq_get_next_queue(struct cfq_service_tree *st)
*/
static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
{
- struct cfq_sched_data *sd = &cfqd->sched_data;
- struct cfq_service_tree *st = sd->service_tree;
- struct cfq_queue *cfqq = NULL;
- int i;
+ struct cfq_entity *cfqe = NULL;
if (!cfqd->rq_queued)
return NULL;
- for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
- cfqq = __cfq_get_next_queue(st);
- if (cfqq) {
- st->active = cfqq;
- update_min_vdisktime(cfqq->st);
- break;
- }
- }
+ cfqe = cfq_get_next_entity(&cfqd->sched_data);
- return cfqq;
+ if (cfqe)
+ return cfqq_of(cfqe);
+ else
+ return NULL;
}
/*
@@ -1820,33 +1856,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
* no prio set, inherit CPU scheduling settings
*/
cfqq->ioprio = task_nice_ioprio(tsk);
- cfqq->ioprio_class = task_nice_ioclass(tsk);
+ cfqq->entity.ioprio_class = task_nice_ioclass(tsk);
break;
case IOPRIO_CLASS_RT:
cfqq->ioprio = task_ioprio(ioc);
- cfqq->ioprio_class = IOPRIO_CLASS_RT;
+ cfqq->entity.ioprio_class = IOPRIO_CLASS_RT;
break;
case IOPRIO_CLASS_BE:
cfqq->ioprio = task_ioprio(ioc);
- cfqq->ioprio_class = IOPRIO_CLASS_BE;
+ cfqq->entity.ioprio_class = IOPRIO_CLASS_BE;
break;
case IOPRIO_CLASS_IDLE:
- cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
+ cfqq->entity.ioprio_class = IOPRIO_CLASS_IDLE;
cfqq->ioprio = 7;
cfq_clear_cfqq_idle_window(cfqq);
break;
}
- cfqq->weight = cfq_ioprio_to_weight(cfqq->ioprio);
+ cfqq->entity.weight = cfq_ioprio_to_weight(cfqq->ioprio);
- if (cfqq->org_ioprio_class != cfqq->ioprio_class)
- cfqq->ioprio_class_changed = 1;
+ if (cfqq->org_ioprio_class != cfqq->entity.ioprio_class)
+ cfqq->entity.ioprio_class_changed = 1;
/*
* keep track of original prio settings in case we have to temporarily
* elevate the priority of this queue
*/
cfqq->org_ioprio = cfqq->ioprio;
- cfqq->org_ioprio_class = cfqq->ioprio_class;
+ cfqq->org_ioprio_class = cfqq->entity.ioprio_class;
cfq_clear_cfqq_prio_changed(cfqq);
}
@@ -1888,7 +1924,7 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
pid_t pid, bool is_sync)
{
- RB_CLEAR_NODE(&cfqq->rb_node);
+ RB_CLEAR_NODE(&cfqq->entity.rb_node);
RB_CLEAR_NODE(&cfqq->p_node);
INIT_LIST_HEAD(&cfqq->fifo);
@@ -2458,8 +2494,8 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
* users of the filesystem
*/
if (cfq_class_idle(cfqq)) {
- cfqq->ioprio_class = IOPRIO_CLASS_BE;
- cfqq->ioprio_class_changed = 1;
+ cfqq->entity.ioprio_class = IOPRIO_CLASS_BE;
+ cfqq->entity.ioprio_class_changed = 1;
}
if (cfqq->ioprio > IOPRIO_NORM)
cfqq->ioprio = IOPRIO_NORM;
@@ -2467,9 +2503,9 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
/*
* check if we need to unboost the queue
*/
- if (cfqq->ioprio_class != cfqq->org_ioprio_class) {
- cfqq->ioprio_class = cfqq->org_ioprio_class;
- cfqq->ioprio_class_changed = 1;
+ if (cfqq->entity.ioprio_class != cfqq->org_ioprio_class) {
+ cfqq->entity.ioprio_class = cfqq->org_ioprio_class;
+ cfqq->entity.ioprio_class_changed = 1;
}
if (cfqq->ioprio != cfqq->org_ioprio)
cfqq->ioprio = cfqq->org_ioprio;
--
1.6.2.5
o This is first step in introducing cfq groups. Currently we define only
on cfq_group (root cfq group) which is embedded in cfq_data.
o Down the line, each cfq_group will have its own service tree. Hence move
the service tree from cfqd to root group so that it becomes property of
group.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 27 ++++++++++++++++++---------
1 files changed, 18 insertions(+), 9 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 922aa8e..323ed12 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -136,16 +136,17 @@ struct cfq_queue {
pid_t pid;
};
+/* Per cgroup grouping structure */
+struct cfq_group {
+ struct cfq_sched_data sched_data;
+};
+
/*
* Per block device queue structure
*/
struct cfq_data {
struct request_queue *queue;
-
- /*
- * rr list of queues with requests and the count of them
- */
- struct cfq_sched_data sched_data;
+ struct cfq_group root_group;
/*
* Each priority tree is sorted by next_request position. These
@@ -270,7 +271,7 @@ init_cfqe_service_tree(struct cfq_data *cfqd, struct cfq_entity *cfqe)
BUG_ON(idx >= IO_IOPRIO_CLASSES);
- cfqe->st = &cfqd->sched_data.service_tree[idx];
+ cfqe->st = &cfqd->root_group.sched_data.service_tree[idx];
}
static inline s64
@@ -1160,7 +1161,7 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
if (!cfqd->rq_queued)
return NULL;
- cfqe = cfq_get_next_entity(&cfqd->sched_data);
+ cfqe = cfq_get_next_entity(&cfqd->root_group.sched_data);
if (cfqe)
return cfqq_of(cfqe);
@@ -2700,6 +2701,15 @@ static void cfq_put_async_queues(struct cfq_data *cfqd)
cfq_put_queue(cfqd->async_idle_cfqq);
}
+static void cfq_init_root_group(struct cfq_data *cfqd)
+{
+ struct cfq_group *cfqg = &cfqd->root_group;
+ int i;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ cfqg->sched_data.service_tree[i] = CFQ_RB_ROOT;
+}
+
static void cfq_exit_queue(struct elevator_queue *e)
{
struct cfq_data *cfqd = e->elevator_data;
@@ -2738,8 +2748,7 @@ static void *cfq_init_queue(struct request_queue *q)
if (!cfqd)
return NULL;
- for (i = 0; i < IO_IOPRIO_CLASSES; i++)
- cfqd->sched_data.service_tree[i] = CFQ_RB_ROOT;
+ cfq_init_root_group(cfqd);
/*
* Not strictly needed (since RB_ROOT just clears the node and we
--
1.6.2.5
o This is basic blkio controller cgroup interface. This is the common interface
which will be used by applications to control IO as it flows through IO stack.
o There are some places where it is assumed that only one policy implemented
by CFQ is there hence things have been hardcoded. Once we have one more
policy implmented, we need to introduce some dynamic infrastructure like
registration of policy and get rid of hardcoded calls.
o Some parts of this code have been taken from BFQ patches.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/Kconfig | 13 +++
block/Kconfig.iosched | 8 ++
block/Makefile | 1 +
block/blk-cgroup.c | 199 +++++++++++++++++++++++++++++++++++++++++
block/blk-cgroup.h | 38 ++++++++
block/cfq-iosched.c | 15 ++--
include/linux/cgroup_subsys.h | 6 ++
include/linux/iocontext.h | 4 +
8 files changed, 277 insertions(+), 7 deletions(-)
create mode 100644 block/blk-cgroup.c
create mode 100644 block/blk-cgroup.h
diff --git a/block/Kconfig b/block/Kconfig
index 9be0b56..6ba1a8e 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -77,6 +77,19 @@ config BLK_DEV_INTEGRITY
T10/SCSI Data Integrity Field or the T13/ATA External Path
Protection. If in doubt, say N.
+config BLK_CGROUP
+ bool
+ depends on CGROUPS
+ default n
+ ---help---
+ Generic block IO controller cgroup interface. This is the common
+ cgroup interface which should be used by various IO controlling
+ policies.
+
+ Currently, CFQ IO scheduler uses it to recognize task groups and
+ control disk bandwidth allocation (proportional time slice allocation)
+ to such task groups.
+
endif # BLOCK
config BLOCK_COMPAT
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..a521c69 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -40,6 +40,14 @@ config IOSCHED_CFQ
working environment, suitable for desktop systems.
This is the default I/O scheduler.
+config CFQ_GROUP_IOSCHED
+ bool "CFQ Group Scheduling support"
+ depends on IOSCHED_CFQ && CGROUPS
+ select BLK_CGROUP
+ default n
+ ---help---
+ Enable group IO scheduling in CFQ.
+
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ
diff --git a/block/Makefile b/block/Makefile
index ba74ca6..16334c9 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-iopoll.o ioctl.o genhd.o scsi_ioctl.o
obj-$(CONFIG_BLK_DEV_BSG) += bsg.o
+obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
new file mode 100644
index 0000000..7bde5c4
--- /dev/null
+++ b/block/blk-cgroup.c
@@ -0,0 +1,199 @@
+/*
+ * Common Block IO controller cgroup interface
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <[email protected]>
+ *
+ * Copyright (C) 2008 Fabio Checconi <[email protected]>
+ * Paolo Valente <[email protected]>
+ *
+ * Copyright (C) 2009 Vivek Goyal <[email protected]>
+ * Nauman Rafique <[email protected]>
+ */
+#include <linux/ioprio.h>
+#include "blk-cgroup.h"
+
+struct blkio_cgroup blkio_root_cgroup = {
+ .weight = BLKIO_WEIGHT_DEFAULT,
+ .ioprio_class = IOPRIO_CLASS_BE,
+};
+
+struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
+{
+ return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
+ struct blkio_group *blkg, void *key)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&blkcg->lock, flags);
+ rcu_assign_pointer(blkg->key, key);
+ hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+}
+
+int blkiocg_del_blkio_group(struct blkio_group *blkg)
+{
+ /* Implemented later */
+ return 0;
+}
+
+/* called under rcu_read_lock(). */
+struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key)
+{
+ struct blkio_group *blkg;
+ struct hlist_node *n;
+ void *__key;
+
+ hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
+ __key = blkg->key;
+ if (__key == key)
+ return blkg;
+ }
+
+ return NULL;
+}
+
+#define SHOW_FUNCTION(__VAR) \
+static u64 blkiocg_##__VAR##_read(struct cgroup *cgroup, \
+ struct cftype *cftype) \
+{ \
+ struct blkio_cgroup *blkcg; \
+ \
+ blkcg = cgroup_to_blkio_cgroup(cgroup); \
+ return (u64)blkcg->__VAR; \
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+static int
+blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
+{
+ struct blkio_cgroup *blkcg;
+
+ if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
+ return -EINVAL;
+
+ blkcg = cgroup_to_blkio_cgroup(cgroup);
+ blkcg->weight = (unsigned int)val;
+ return 0;
+}
+
+static int blkiocg_ioprio_class_write(struct cgroup *cgroup,
+ struct cftype *cftype, u64 val)
+{
+ struct blkio_cgroup *blkcg;
+
+ if (val < IOPRIO_CLASS_RT || val > IOPRIO_CLASS_IDLE)
+ return -EINVAL;
+
+ blkcg = cgroup_to_blkio_cgroup(cgroup);
+ blkcg->ioprio_class = (unsigned int)val;
+ return 0;
+}
+
+struct cftype blkio_files[] = {
+ {
+ .name = "weight",
+ .read_u64 = blkiocg_weight_read,
+ .write_u64 = blkiocg_weight_write,
+ },
+ {
+ .name = "ioprio_class",
+ .read_u64 = blkiocg_ioprio_class_read,
+ .write_u64 = blkiocg_ioprio_class_write,
+ },
+};
+
+static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ return cgroup_add_files(cgroup, subsys, blkio_files,
+ ARRAY_SIZE(blkio_files));
+}
+
+static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+
+ free_css_id(&blkio_subsys, &blkcg->css);
+ kfree(blkcg);
+}
+
+static struct cgroup_subsys_state *
+blkiocg_create(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ struct blkio_cgroup *blkcg, *parent_blkcg;
+
+ if (!cgroup->parent) {
+ blkcg = &blkio_root_cgroup;
+ goto done;
+ }
+
+ /* Currently we do not support hierarchy deeper than two level (0,1) */
+ parent_blkcg = cgroup_to_blkio_cgroup(cgroup->parent);
+ if (css_depth(&parent_blkcg->css) > 0)
+ return ERR_PTR(-EINVAL);
+
+ blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
+ if (!blkcg)
+ return ERR_PTR(-ENOMEM);
+done:
+ spin_lock_init(&blkcg->lock);
+ INIT_HLIST_HEAD(&blkcg->blkg_list);
+ blkcg->weight = BLKIO_WEIGHT_DEFAULT;
+ blkcg->ioprio_class = IOPRIO_CLASS_BE;
+
+ return &blkcg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic data structures. For now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc.
+ */
+static int blkiocg_can_attach(struct cgroup_subsys *subsys,
+ struct cgroup *cgroup, struct task_struct *tsk,
+ bool threadgroup)
+{
+ struct io_context *ioc;
+ int ret = 0;
+
+ /* task_lock() is needed to avoid races with exit_io_context() */
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc && atomic_read(&ioc->nr_tasks) > 1)
+ ret = -EINVAL;
+ task_unlock(tsk);
+
+ return ret;
+}
+
+static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+ struct cgroup *prev, struct task_struct *tsk,
+ bool threadgroup)
+{
+ struct io_context *ioc;
+
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc)
+ ioc->cgroup_changed = 1;
+ task_unlock(tsk);
+}
+
+struct cgroup_subsys blkio_subsys = {
+ .name = "blkio",
+ .create = blkiocg_create,
+ .can_attach = blkiocg_can_attach,
+ .attach = blkiocg_attach,
+ .destroy = blkiocg_destroy,
+ .populate = blkiocg_populate,
+ .subsys_id = blkio_subsys_id,
+ .use_id = 1,
+};
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
new file mode 100644
index 0000000..49ca84b
--- /dev/null
+++ b/block/blk-cgroup.h
@@ -0,0 +1,38 @@
+/*
+ * Common Block IO controller cgroup interface
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <[email protected]>
+ *
+ * Copyright (C) 2008 Fabio Checconi <[email protected]>
+ * Paolo Valente <[email protected]>
+ *
+ * Copyright (C) 2009 Vivek Goyal <[email protected]>
+ * Nauman Rafique <[email protected]>
+ */
+
+#include <linux/cgroup.h>
+
+struct blkio_cgroup {
+ struct cgroup_subsys_state css;
+ unsigned int weight;
+ unsigned short ioprio_class;
+ spinlock_t lock;
+ struct hlist_head blkg_list;
+};
+
+struct blkio_group {
+ /* An rcu protected unique identifier for the group */
+ void *key;
+ struct hlist_node blkcg_node;
+};
+
+#define BLKIO_WEIGHT_MIN 100
+#define BLKIO_WEIGHT_MAX 1000
+#define BLKIO_WEIGHT_DEFAULT 500
+
+struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
+void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
+ struct blkio_group *blkg, void *key);
+int blkiocg_del_blkio_group(struct blkio_group *blkg);
+struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 323ed12..bc99163 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,6 +12,7 @@
#include <linux/rbtree.h>
#include <linux/ioprio.h>
#include <linux/blktrace_api.h>
+#include "blk-cgroup.h"
/*
* tunables
@@ -29,9 +30,6 @@ static const int cfq_slice_async_rq = 2;
static int cfq_slice_idle = HZ / 125;
#define IO_IOPRIO_CLASSES 3
-#define CFQ_WEIGHT_MIN 100
-#define CFQ_WEIGHT_MAX 1000
-#define CFQ_WEIGHT_DEFAULT 500
#define CFQ_SERVICE_SHIFT 12
/*
@@ -139,6 +137,9 @@ struct cfq_queue {
/* Per cgroup grouping structure */
struct cfq_group {
struct cfq_sched_data sched_data;
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ struct blkio_group blkg;
+#endif
};
/*
@@ -296,7 +297,7 @@ cfq_delta_fair(unsigned long delta, struct cfq_entity *cfqe)
{
u64 d = delta << CFQ_SERVICE_SHIFT;
- return cfq_delta(d, CFQ_WEIGHT_DEFAULT, cfqe->weight);
+ return cfq_delta(d, BLKIO_WEIGHT_DEFAULT, cfqe->weight);
}
static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
@@ -338,7 +339,7 @@ static inline unsigned int cfq_ioprio_to_weight(int ioprio)
{
WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
/* Map prio 7 - 0 to weights 200 to 900 */
- return CFQ_WEIGHT_DEFAULT + (CFQ_WEIGHT_DEFAULT/5 * (4 - ioprio));
+ return BLKIO_WEIGHT_DEFAULT + (BLKIO_WEIGHT_DEFAULT/5 * (4 - ioprio));
}
static inline int
@@ -346,9 +347,9 @@ cfq_weight_slice(struct cfq_data *cfqd, int sync, unsigned int weight)
{
const int base_slice = cfqd->cfq_slice[sync];
- WARN_ON(weight > CFQ_WEIGHT_MAX);
+ WARN_ON(weight > BLKIO_WEIGHT_MAX);
- return cfq_delta(base_slice, weight, CFQ_WEIGHT_DEFAULT);
+ return cfq_delta(base_slice, weight, BLKIO_WEIGHT_DEFAULT);
}
static inline int rq_in_driver(struct cfq_data *cfqd)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..ccefff0 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,9 @@ SUBSYS(net_cls)
#endif
/* */
+
+#ifdef CONFIG_BLK_CGROUP
+SUBSYS(blkio)
+#endif
+
+/* */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 4da4a75..5357d5c 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,10 @@ struct io_context {
unsigned short ioprio;
unsigned short ioprio_changed;
+#ifdef CONFIG_BLK_CGROUP
+ unsigned short cgroup_changed;
+#endif
+
/*
* For request batching
*/
--
1.6.2.5
o This patch embeds cfq_entity object in cfq_group and provides helper routines
so that group entities can be scheduled.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 110 +++++++++++++++++++++++++++++++++++++++++++--------
1 files changed, 93 insertions(+), 17 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index bc99163..8ec8a82 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -79,6 +79,7 @@ struct cfq_service_tree {
#define CFQ_RB_ROOT (struct cfq_service_tree) { RB_ROOT, NULL, 0, NULL}
struct cfq_sched_data {
+ unsigned int nr_active;
struct cfq_service_tree service_tree[IO_IOPRIO_CLASSES];
};
@@ -89,6 +90,10 @@ struct cfq_entity {
struct cfq_service_tree *st;
unsigned short ioprio_class;
bool ioprio_class_changed;
+ struct cfq_entity *parent;
+ bool on_st;
+ /* Points to the sched_data of group entity. Null for cfqq */
+ struct cfq_sched_data *my_sd;
};
/*
@@ -136,6 +141,7 @@ struct cfq_queue {
/* Per cgroup grouping structure */
struct cfq_group {
+ struct cfq_entity entity;
struct cfq_sched_data sched_data;
#ifdef CONFIG_CFQ_GROUP_IOSCHED
struct blkio_group blkg;
@@ -260,9 +266,23 @@ static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
static void cfq_put_queue(struct cfq_queue *cfqq);
static struct cfq_entity *__cfq_get_next_entity(struct cfq_service_tree *st);
+static inline struct cfq_entity *parent_entity(struct cfq_entity *cfqe)
+{
+ return cfqe->parent;
+}
+
static inline struct cfq_queue *cfqq_of(struct cfq_entity *cfqe)
{
- return container_of(cfqe, struct cfq_queue, entity);
+ if (!cfqe->my_sd)
+ return container_of(cfqe, struct cfq_queue, entity);
+ return NULL;
+}
+
+static inline struct cfq_group *cfqg_of(struct cfq_entity *cfqe)
+{
+ if (cfqe->my_sd)
+ return container_of(cfqe, struct cfq_group, entity);
+ return NULL;
}
static inline void
@@ -352,6 +372,33 @@ cfq_weight_slice(struct cfq_data *cfqd, int sync, unsigned int weight)
return cfq_delta(base_slice, weight, BLKIO_WEIGHT_DEFAULT);
}
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+/* check for entity->parent so that loop is not executed for root entity. */
+#define for_each_entity(entity) \
+ for (; entity && entity->parent; entity = entity->parent)
+
+static inline struct cfq_sched_data *
+cfq_entity_sched_data(struct cfq_entity *cfqe)
+{
+ return &cfqg_of(parent_entity(cfqe))->sched_data;
+}
+#else /* CONFIG_CFQ_GROUP_IOSCHED */
+#define for_each_entity(entity) \
+ for (; entity != NULL; entity = NULL)
+static inline struct cfq_data *cfqd_of(struct cfq_entity *cfqe)
+{
+ return cfqq_of(cfqe)->cfqd;
+}
+
+static inline struct cfq_sched_data *
+cfq_entity_sched_data(struct cfq_entity *cfqe)
+{
+ struct cfq_data *cfqd = cfqd_of(cfqe);
+
+ return &cfqd->root_group.sched_data;
+}
+#endif /* CONFIG_CFQ_GROUP_IOSCHED */
+
static inline int rq_in_driver(struct cfq_data *cfqd)
{
return cfqd->rq_in_driver[0] + cfqd->rq_in_driver[1];
@@ -606,16 +653,28 @@ static void __dequeue_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe)
static void dequeue_cfqe(struct cfq_entity *cfqe)
{
struct cfq_service_tree *st = cfqe->st;
+ struct cfq_sched_data *sd = cfq_entity_sched_data(cfqe);
if (st->active == cfqe)
st->active = NULL;
__dequeue_cfqe(st, cfqe);
+ sd->nr_active--;
+ cfqe->on_st = 0;
}
static void dequeue_cfqq(struct cfq_queue *cfqq)
{
- dequeue_cfqe(&cfqq->entity);
+ struct cfq_entity *cfqe = &cfqq->entity;
+
+ for_each_entity(cfqe) {
+ struct cfq_sched_data *sd = cfq_entity_sched_data(cfqe);
+
+ dequeue_cfqe(cfqe);
+ /* Do not dequeue parent if it has other entities under it */
+ if (sd->nr_active)
+ break;
+ }
}
static void __enqueue_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe,
@@ -653,6 +712,10 @@ static void __enqueue_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe,
static void enqueue_cfqe(struct cfq_entity *cfqe)
{
+ struct cfq_sched_data *sd = cfq_entity_sched_data(cfqe);
+
+ cfqe->on_st = 1;
+ sd->nr_active++;
cfqe_update_ioprio_class(cfqe);
place_cfqe(cfqe->st, cfqe, 0);
__enqueue_cfqe(cfqe->st, cfqe, 0);
@@ -660,7 +723,13 @@ static void enqueue_cfqe(struct cfq_entity *cfqe)
static void enqueue_cfqq(struct cfq_queue *cfqq)
{
- enqueue_cfqe(&cfqq->entity);
+ struct cfq_entity *cfqe = &cfqq->entity;
+
+ for_each_entity(cfqe) {
+ if (cfqe->on_st)
+ break;
+ enqueue_cfqe(cfqe);
+ }
}
/* Requeue a cfqq which is already on the service tree */
@@ -687,14 +756,22 @@ static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
{
- /*
- * Can't update entity disk time while it is on sorted rb-tree
- * as vdisktime is used as key.
- */
- __dequeue_cfqe(cfqe->st, cfqe);
- cfqe->vdisktime += cfq_delta_fair(served, cfqe);
- update_min_vdisktime(cfqe->st);
- __enqueue_cfqe(cfqe->st, cfqe, 0);
+ for_each_entity(cfqe) {
+ /*
+ * Can't update entity disk time while it is on sorted rb-tree
+ * as vdisktime is used as key.
+ */
+ __dequeue_cfqe(cfqe->st, cfqe);
+ cfqe->vdisktime += cfq_delta_fair(served, cfqe);
+ update_min_vdisktime(cfqe->st);
+ __enqueue_cfqe(cfqe->st, cfqe, 0);
+
+ /* If entity prio class has changed, take that into account */
+ if (unlikely(cfqe->ioprio_class_changed)) {
+ dequeue_cfqe(cfqe);
+ enqueue_cfqe(cfqe);
+ }
+ }
}
static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
@@ -708,12 +785,6 @@ static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
*/
served = cfq_prio_to_slice(cfqq->cfqd, cfqq);
cfqe_served(&cfqq->entity, served);
-
- /* If cfqq prio class has changed, take that into account */
- if (unlikely(cfqq->entity.ioprio_class_changed)) {
- dequeue_cfqq(cfqq);
- enqueue_cfqq(cfqq);
- }
}
/*
@@ -1941,6 +2012,8 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfq_mark_cfqq_sync(cfqq);
}
cfqq->pid = pid;
+ cfqq->entity.parent = &cfqd->root_group.entity;
+ cfqq->entity.my_sd = NULL;
}
static struct cfq_queue *
@@ -2707,6 +2780,9 @@ static void cfq_init_root_group(struct cfq_data *cfqd)
struct cfq_group *cfqg = &cfqd->root_group;
int i;
+ cfqg->entity.parent = NULL;
+ cfqg->entity.my_sd = &cfqg->sched_data;
+
for (i = 0; i < IO_IOPRIO_CLASSES; i++)
cfqg->sched_data.service_tree[i] = CFQ_RB_ROOT;
}
--
1.6.2.5
o So far we assumed there is one cfq_group in the system (root group). This
patch introduces the code to map requests to their cgroup and create more
cfq_groups dynamically and keep track of these groups.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 111 insertions(+), 12 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8ec8a82..4481917 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -145,6 +145,7 @@ struct cfq_group {
struct cfq_sched_data sched_data;
#ifdef CONFIG_CFQ_GROUP_IOSCHED
struct blkio_group blkg;
+ struct hlist_node cfqd_node;
#endif
};
@@ -212,6 +213,9 @@ struct cfq_data {
struct cfq_queue oom_cfqq;
unsigned long last_end_sync_rq;
+
+ /* List of cfq groups being managed on this device*/
+ struct hlist_head cfqg_list;
};
enum cfqq_state_flags {
@@ -286,13 +290,14 @@ static inline struct cfq_group *cfqg_of(struct cfq_entity *cfqe)
}
static inline void
-init_cfqe_service_tree(struct cfq_data *cfqd, struct cfq_entity *cfqe)
+init_cfqe_service_tree(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
{
+ struct cfq_group *p_cfqg = cfqg_of(p_cfqe);
unsigned short idx = cfqe->ioprio_class - 1;
BUG_ON(idx >= IO_IOPRIO_CLASSES);
- cfqe->st = &cfqd->root_group.sched_data.service_tree[idx];
+ cfqe->st = &p_cfqg->sched_data.service_tree[idx];
}
static inline s64
@@ -372,16 +377,93 @@ cfq_weight_slice(struct cfq_data *cfqd, int sync, unsigned int weight)
return cfq_delta(base_slice, weight, BLKIO_WEIGHT_DEFAULT);
}
+static inline void
+cfq_init_cfqe_parent(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
+{
+ cfqe->parent = p_cfqe;
+ init_cfqe_service_tree(cfqe, p_cfqe);
+}
+
#ifdef CONFIG_CFQ_GROUP_IOSCHED
/* check for entity->parent so that loop is not executed for root entity. */
#define for_each_entity(entity) \
for (; entity && entity->parent; entity = entity->parent)
+static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
+{
+ if (blkg)
+ return container_of(blkg, struct cfq_group, blkg);
+ return NULL;
+}
+
static inline struct cfq_sched_data *
cfq_entity_sched_data(struct cfq_entity *cfqe)
{
return &cfqg_of(parent_entity(cfqe))->sched_data;
}
+
+static void cfq_init_cfqg(struct cfq_group *cfqg, struct blkio_cgroup *blkcg)
+{
+ struct cfq_entity *cfqe = &cfqg->entity;
+
+ cfqe->weight = blkcg->weight;
+ cfqe->ioprio_class = blkcg->ioprio_class;
+ cfqe->ioprio_class_changed = 1;
+ cfqe->my_sd = &cfqg->sched_data;
+}
+
+static struct cfq_group *
+cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+ struct cfq_group *cfqg = NULL;
+ void *key = cfqd;
+
+ /* Do we need to take this reference */
+ if (!css_tryget(&blkcg->css))
+ return NULL;;
+
+ cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+ if (cfqg || !create)
+ goto done;
+
+ cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC | __GFP_ZERO,
+ cfqd->queue->node);
+ if (!cfqg)
+ goto done;
+
+ cfq_init_cfqg(cfqg, blkcg);
+ cfq_init_cfqe_parent(&cfqg->entity, &cfqd->root_group.entity);
+
+ /* Add group onto cgroup list */
+ blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd);
+
+ /* Add group on cfqd list */
+ hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+
+done:
+ css_put(&blkcg->css);
+ return cfqg;
+}
+
+/*
+ * Search for the cfq group current task belongs to. If create = 1, then also
+ * create the cfq group if it does not exist.
+ * Should be called under request queue lock.
+ */
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+{
+ struct cgroup *cgroup;
+ struct cfq_group *cfqg = NULL;
+
+ rcu_read_lock();
+ cgroup = task_cgroup(current, blkio_subsys_id);
+ cfqg = cfq_find_alloc_cfqg(cfqd, cgroup, create);
+ if (!cfqg && create)
+ cfqg = &cfqd->root_group;
+ rcu_read_unlock();
+ return cfqg;
+}
#else /* CONFIG_CFQ_GROUP_IOSCHED */
#define for_each_entity(entity) \
for (; entity != NULL; entity = NULL)
@@ -397,6 +479,11 @@ cfq_entity_sched_data(struct cfq_entity *cfqe)
return &cfqd->root_group.sched_data;
}
+
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+{
+ return &cfqd->root_group;
+}
#endif /* CONFIG_CFQ_GROUP_IOSCHED */
static inline int rq_in_driver(struct cfq_data *cfqd)
@@ -624,14 +711,11 @@ place_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe, int add_front)
static inline void cfqe_update_ioprio_class(struct cfq_entity *cfqe)
{
if (unlikely(cfqe->ioprio_class_changed)) {
- struct cfq_queue *cfqq = cfqq_of(cfqe);
- struct cfq_data *cfqd = cfqq->cfqd;
-
/*
* Re-initialize the service tree pointer as ioprio class
* change will lead to service tree change.
*/
- init_cfqe_service_tree(cfqd, cfqe);
+ init_cfqe_service_tree(cfqe, parent_entity(cfqe));
cfqe->ioprio_class_changed = 0;
cfqe->vdisktime = 0;
}
@@ -1229,16 +1313,19 @@ static struct cfq_entity *cfq_get_next_entity(struct cfq_sched_data *sd)
static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
{
struct cfq_entity *cfqe = NULL;
+ struct cfq_sched_data *sd;
if (!cfqd->rq_queued)
return NULL;
- cfqe = cfq_get_next_entity(&cfqd->root_group.sched_data);
+ sd = &cfqd->root_group.sched_data;
+ for (; sd ; sd = cfqe->my_sd) {
+ cfqe = cfq_get_next_entity(sd);
+ if (!cfqe)
+ return NULL;
+ }
- if (cfqe)
- return cfqq_of(cfqe);
- else
- return NULL;
+ return cfqq_of(cfqe);
}
/*
@@ -2012,8 +2099,17 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfq_mark_cfqq_sync(cfqq);
}
cfqq->pid = pid;
- cfqq->entity.parent = &cfqd->root_group.entity;
+}
+
+static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
+{
cfqq->entity.my_sd = NULL;
+
+ /* Currently, all async queues are mapped to root group */
+ if (!cfq_cfqq_sync(cfqq))
+ cfqg = &cfqq->cfqd->root_group;
+
+ cfq_init_cfqe_parent(&cfqq->entity, &cfqg->entity);
}
static struct cfq_queue *
@@ -2022,8 +2118,10 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
struct cfq_io_context *cic;
+ struct cfq_group *cfqg;
retry:
+ cfqg = cfq_get_cfqg(cfqd, 1);
cic = cfq_cic_lookup(cfqd, ioc);
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);
@@ -2054,6 +2152,7 @@ retry:
if (cfqq) {
cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
cfq_init_prio_data(cfqq, ioc);
+ cfq_link_cfqq_cfqg(cfqq, cfqg);
cfq_log_cfqq(cfqd, cfqq, "alloced");
} else
cfqq = &cfqd->oom_cfqq;
--
1.6.2.5
o If a user decides the change the weight or ioprio class of a cgroup, this
information needs to be passed on to io controlling policy module also so
that new information can take effect.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 16 ++++++++++++++++
block/cfq-iosched.c | 18 ++++++++++++++++++
2 files changed, 34 insertions(+), 0 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 7bde5c4..0d52a2c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -13,6 +13,10 @@
#include <linux/ioprio.h>
#include "blk-cgroup.h"
+extern void cfq_update_blkio_group_weight(struct blkio_group *, unsigned int);
+extern void cfq_update_blkio_group_ioprio_class(struct blkio_group *,
+ unsigned short);
+
struct blkio_cgroup blkio_root_cgroup = {
.weight = BLKIO_WEIGHT_DEFAULT,
.ioprio_class = IOPRIO_CLASS_BE,
@@ -75,12 +79,18 @@ static int
blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
{
struct blkio_cgroup *blkcg;
+ struct blkio_group *blkg;
+ struct hlist_node *n;
if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
return -EINVAL;
blkcg = cgroup_to_blkio_cgroup(cgroup);
+ spin_lock_irq(&blkcg->lock);
blkcg->weight = (unsigned int)val;
+ hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
+ cfq_update_blkio_group_weight(blkg, blkcg->weight);
+ spin_unlock_irq(&blkcg->lock);
return 0;
}
@@ -88,12 +98,18 @@ static int blkiocg_ioprio_class_write(struct cgroup *cgroup,
struct cftype *cftype, u64 val)
{
struct blkio_cgroup *blkcg;
+ struct blkio_group *blkg;
+ struct hlist_node *n;
if (val < IOPRIO_CLASS_RT || val > IOPRIO_CLASS_IDLE)
return -EINVAL;
blkcg = cgroup_to_blkio_cgroup(cgroup);
+ spin_lock_irq(&blkcg->lock);
blkcg->ioprio_class = (unsigned int)val;
+ hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
+ cfq_update_blkio_group_weight(blkg, blkcg->weight);
+ spin_unlock_irq(&blkcg->lock);
return 0;
}
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4481917..3c0fa1b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -464,6 +464,24 @@ static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
rcu_read_unlock();
return cfqg;
}
+
+void
+cfq_update_blkio_group_weight(struct blkio_group *blkg, unsigned int weight)
+{
+ struct cfq_group *cfqg = cfqg_of_blkg(blkg);
+
+ cfqg->entity.weight = weight;
+}
+
+void cfq_update_blkio_group_ioprio_class(struct blkio_group *blkg,
+ unsigned short ioprio_class)
+{
+ struct cfq_group *cfqg = cfqg_of_blkg(blkg);
+
+ cfqg->entity.ioprio_class = ioprio_class;
+ smp_wmb();
+ cfqg->entity.ioprio_class_changed = 1;
+}
#else /* CONFIG_CFQ_GROUP_IOSCHED */
#define for_each_entity(entity) \
for (; entity != NULL; entity = NULL)
--
1.6.2.5
o With dynamic cfq_groups, comes the need of making sure cfq_groups can be
freed when either elevator exits or one decides to delete the cgroup.
o This patch takes care of elevator exit and cgroup deletion paths and also
implements cfq_group reference counting so that a cgroup can be removed
even if there are backlogged requests in the associated cfq_groups.
Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Nauman Rafique <[email protected]>
---
block/blk-cgroup.c | 66 +++++++++++++++++++++++-
block/blk-cgroup.h | 2 +
block/cfq-iosched.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 208 insertions(+), 3 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0d52a2c..a62b8a3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -16,6 +16,7 @@
extern void cfq_update_blkio_group_weight(struct blkio_group *, unsigned int);
extern void cfq_update_blkio_group_ioprio_class(struct blkio_group *,
unsigned short);
+extern void cfq_delink_blkio_group(void *, struct blkio_group *);
struct blkio_cgroup blkio_root_cgroup = {
.weight = BLKIO_WEIGHT_DEFAULT,
@@ -35,14 +36,43 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
spin_lock_irqsave(&blkcg->lock, flags);
rcu_assign_pointer(blkg->key, key);
+ blkg->blkcg_id = css_id(&blkcg->css);
hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
spin_unlock_irqrestore(&blkcg->lock, flags);
}
+static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
+{
+ hlist_del_init_rcu(&blkg->blkcg_node);
+ blkg->blkcg_id = 0;
+}
+
+/*
+ * returns 0 if blkio_group was still on cgroup list. Otherwise returns 1
+ * indicating that blk_group was unhashed by the time we got to it.
+ */
int blkiocg_del_blkio_group(struct blkio_group *blkg)
{
- /* Implemented later */
- return 0;
+ struct blkio_cgroup *blkcg;
+ unsigned long flags;
+ struct cgroup_subsys_state *css;
+ int ret = 1;
+
+ rcu_read_lock();
+ css = css_lookup(&blkio_subsys, blkg->blkcg_id);
+ if (!css)
+ goto out;
+
+ blkcg = container_of(css, struct blkio_cgroup, css);
+ spin_lock_irqsave(&blkcg->lock, flags);
+ if (!hlist_unhashed(&blkg->blkcg_node)) {
+ __blkiocg_del_blkio_group(blkg);
+ ret = 0;
+ }
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+out:
+ rcu_read_unlock();
+ return ret;
}
/* called under rcu_read_lock(). */
@@ -135,8 +165,40 @@ static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+ unsigned long flags;
+ struct blkio_group *blkg;
+ void *key;
+ rcu_read_lock();
+remove_entry:
+ spin_lock_irqsave(&blkcg->lock, flags);
+
+ if (hlist_empty(&blkcg->blkg_list)) {
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+ goto done;
+ }
+
+ blkg = hlist_entry(blkcg->blkg_list.first, struct blkio_group,
+ blkcg_node);
+ key = rcu_dereference(blkg->key);
+ __blkiocg_del_blkio_group(blkg);
+
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+
+ /*
+ * This blkio_group is being delinked as associated cgroup is going
+ * away. Let all the IO controlling policies know about this event.
+ *
+ * Currently this is static call to one io controlling policy. Once
+ * we have more policies in place, we need some dynamic registration
+ * of callback function.
+ */
+ cfq_delink_blkio_group(key, blkg);
+ goto remove_entry;
+done:
free_css_id(&blkio_subsys, &blkcg->css);
+ rcu_read_unlock();
+
kfree(blkcg);
}
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 49ca84b..2bf736b 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -25,12 +25,14 @@ struct blkio_group {
/* An rcu protected unique identifier for the group */
void *key;
struct hlist_node blkcg_node;
+ unsigned short blkcg_id;
};
#define BLKIO_WEIGHT_MIN 100
#define BLKIO_WEIGHT_MAX 1000
#define BLKIO_WEIGHT_DEFAULT 500
+extern struct blkio_cgroup blkio_root_cgroup;
struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
struct blkio_group *blkg, void *key);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3c0fa1b..b9a052b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -146,6 +146,7 @@ struct cfq_group {
#ifdef CONFIG_CFQ_GROUP_IOSCHED
struct blkio_group blkg;
struct hlist_node cfqd_node;
+ atomic_t ref;
#endif
};
@@ -295,8 +296,18 @@ init_cfqe_service_tree(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
struct cfq_group *p_cfqg = cfqg_of(p_cfqe);
unsigned short idx = cfqe->ioprio_class - 1;
- BUG_ON(idx >= IO_IOPRIO_CLASSES);
+ /*
+ * ioprio class of the entity has not been initialized yet, don't
+ * init service tree right now. This can happen in the case of
+ * oom_cfqq which will inherit its class and prio once first request
+ * gets queued in and at that point of time prio update will make
+ * sure that service tree gets initialized before queue gets onto
+ * tree.
+ */
+ if (cfqe->ioprio_class == IOPRIO_CLASS_NONE)
+ return;
+ BUG_ON(idx >= IO_IOPRIO_CLASSES);
cfqe->st = &p_cfqg->sched_data.service_tree[idx];
}
@@ -402,6 +413,16 @@ cfq_entity_sched_data(struct cfq_entity *cfqe)
return &cfqg_of(parent_entity(cfqe))->sched_data;
}
+static inline struct cfq_group *cfqq_to_cfqg(struct cfq_queue *cfqq)
+{
+ return cfqg_of(parent_entity(&cfqq->entity));
+}
+
+static inline void cfq_get_cfqg_ref(struct cfq_group *cfqg)
+{
+ atomic_inc(&cfqg->ref);
+}
+
static void cfq_init_cfqg(struct cfq_group *cfqg, struct blkio_cgroup *blkcg)
{
struct cfq_entity *cfqe = &cfqg->entity;
@@ -435,6 +456,14 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
cfq_init_cfqg(cfqg, blkcg);
cfq_init_cfqe_parent(&cfqg->entity, &cfqd->root_group.entity);
+ /*
+ * Take the initial reference that will be released on destroy
+ * This can be thought of a joint reference by cgroup and
+ * elevator which will be dropped by either elevator exit
+ * or cgroup deletion path depending on who is exiting first.
+ */
+ cfq_get_cfqg_ref(cfqg);
+
/* Add group onto cgroup list */
blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd);
@@ -482,9 +511,87 @@ void cfq_update_blkio_group_ioprio_class(struct blkio_group *blkg,
smp_wmb();
cfqg->entity.ioprio_class_changed = 1;
}
+
+static void cfq_put_cfqg(struct cfq_group *cfqg)
+{
+ struct cfq_service_tree *st;
+ int i;
+
+ BUG_ON(atomic_read(&cfqg->ref) <= 0);
+ if (!atomic_dec_and_test(&cfqg->ref))
+ return;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = cfqg->sched_data.service_tree + i;
+ BUG_ON(!RB_EMPTY_ROOT(&st->rb));
+ BUG_ON(st->active != NULL);
+ }
+
+ kfree(cfqg);
+}
+
+static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ /* Something wrong if we are trying to remove same group twice */
+ BUG_ON(hlist_unhashed(&cfqg->cfqd_node));
+
+ hlist_del_init(&cfqg->cfqd_node);
+
+ /*
+ * Put the reference taken at the time of creation so that when all
+ * queues are gone, group can be destroyed.
+ */
+ cfq_put_cfqg(cfqg);
+}
+
+static void cfq_release_cfq_groups(struct cfq_data *cfqd)
+{
+ struct hlist_node *pos, *n;
+ struct cfq_group *cfqg;
+
+ hlist_for_each_entry_safe(cfqg, pos, n, &cfqd->cfqg_list, cfqd_node) {
+ /*
+ * If cgroup removal path got to blk_group first and removed
+ * it from cgroup list, then it will take care of destroying
+ * cfqg also.
+ */
+ if (!blkiocg_del_blkio_group(&cfqg->blkg))
+ cfq_destroy_cfqg(cfqd, cfqg);
+ }
+}
+
+/*
+ * Blk cgroup controller notification saying that blkio_group object is being
+ * delinked as associated cgroup object is going away. That also means that
+ * no new IO will come in this group. So get rid of this group as soon as
+ * any pending IO in the group is finished.
+ *
+ * This function is called under rcu_read_lock(). key is the rcu protected
+ * pointer. That means "key" is a valid cfq_data pointer as long as we are rcu
+ * read lock.
+ *
+ * "key" was fetched from blkio_group under blkio_cgroup->lock. That means
+ * it should not be NULL as even if elevator was exiting, cgroup deltion
+ * path got to it first.
+ */
+void cfq_delink_blkio_group(void *key, struct blkio_group *blkg)
+{
+ unsigned long flags;
+ struct cfq_data *cfqd = key;
+
+ spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+ cfq_destroy_cfqg(cfqd, cfqg_of_blkg(blkg));
+ spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+}
+
#else /* CONFIG_CFQ_GROUP_IOSCHED */
#define for_each_entity(entity) \
for (; entity != NULL; entity = NULL)
+
+static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
+static inline void cfq_get_cfqg_ref(struct cfq_group *cfqg) {}
+static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
+
static inline struct cfq_data *cfqd_of(struct cfq_entity *cfqe)
{
return cfqq_of(cfqe)->cfqd;
@@ -498,6 +605,11 @@ cfq_entity_sched_data(struct cfq_entity *cfqe)
return &cfqd->root_group.sched_data;
}
+static inline struct cfq_group *cfqq_to_cfqg(struct cfq_queue *cfqq)
+{
+ return &cfqq->cfqd->root_group;
+}
+
static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
{
return &cfqd->root_group;
@@ -1818,11 +1930,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
* task holds one reference to the queue, dropped when task exits. each rq
* in-flight on this queue also holds a reference, dropped when rq is freed.
*
+ * Each cfq queue took a reference on the parent group. Drop it now.
* queue lock must be held here.
*/
static void cfq_put_queue(struct cfq_queue *cfqq)
{
struct cfq_data *cfqd = cfqq->cfqd;
+ struct cfq_group *cfqg;
BUG_ON(atomic_read(&cfqq->ref) <= 0);
@@ -1832,6 +1946,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
cfq_log_cfqq(cfqd, cfqq, "put_queue");
BUG_ON(rb_first(&cfqq->sort_list));
BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
+ cfqg = cfqq_to_cfqg(cfqq);
if (unlikely(cfqd->active_queue == cfqq)) {
__cfq_slice_expired(cfqd, cfqq);
@@ -1841,6 +1956,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
BUG_ON(cfq_cfqq_on_rr(cfqq));
kmem_cache_free(cfq_pool, cfqq);
+ cfq_put_cfqg(cfqg);
}
/*
@@ -2128,6 +2244,9 @@ static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
cfqg = &cfqq->cfqd->root_group;
cfq_init_cfqe_parent(&cfqq->entity, &cfqg->entity);
+
+ /* cfqq reference on cfqg */
+ cfq_get_cfqg_ref(cfqg);
}
static struct cfq_queue *
@@ -2902,6 +3021,23 @@ static void cfq_init_root_group(struct cfq_data *cfqd)
for (i = 0; i < IO_IOPRIO_CLASSES; i++)
cfqg->sched_data.service_tree[i] = CFQ_RB_ROOT;
+
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ atomic_set(&cfqg->ref, 0);
+ /*
+ * Take a reference to root group which we never drop. This is just
+ * to make sure that cfq_put_cfqg() does not try to kfree root group
+ */
+ cfq_get_cfqg_ref(cfqg);
+ blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd);
+#endif
+}
+
+static void cfq_exit_root_group(struct cfq_data *cfqd)
+{
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ blkiocg_del_blkio_group(&cfqd->root_group.blkg);
+#endif
}
static void cfq_exit_queue(struct elevator_queue *e)
@@ -2926,10 +3062,14 @@ static void cfq_exit_queue(struct elevator_queue *e)
cfq_put_async_queues(cfqd);
+ cfq_release_cfq_groups(cfqd);
+ cfq_exit_root_group(cfqd);
spin_unlock_irq(q->queue_lock);
cfq_shutdown_timer_wq(cfqd);
+ /* Wait for cfqg->blkg->key accessors to exit their grace periods. */
+ synchronize_rcu();
kfree(cfqd);
}
@@ -2959,6 +3099,7 @@ static void *cfq_init_queue(struct request_queue *q)
*/
cfq_init_cfqq(cfqd, &cfqd->oom_cfqq, 1, 0);
atomic_inc(&cfqd->oom_cfqq.ref);
+ cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group);
INIT_LIST_HEAD(&cfqd->cic_list);
--
1.6.2.5
o Some CFQ debugging Aid.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/Kconfig | 9 +++++++++
block/Kconfig.iosched | 9 +++++++++
block/blk-cgroup.c | 4 ++++
block/blk-cgroup.h | 13 +++++++++++++
block/cfq-iosched.c | 33 +++++++++++++++++++++++++++++++++
5 files changed, 68 insertions(+), 0 deletions(-)
diff --git a/block/Kconfig b/block/Kconfig
index 6ba1a8e..e20fbde 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -90,6 +90,15 @@ config BLK_CGROUP
control disk bandwidth allocation (proportional time slice allocation)
to such task groups.
+config DEBUG_BLK_CGROUP
+ bool
+ depends on BLK_CGROUP
+ default n
+ ---help---
+ Enable some debugging help. Currently it stores the cgroup path
+ in the blk group which can be used by cfq for tracing various
+ group related activity.
+
endif # BLOCK
config BLOCK_COMPAT
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a521c69..9c5f0b5 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -48,6 +48,15 @@ config CFQ_GROUP_IOSCHED
---help---
Enable group IO scheduling in CFQ.
+config DEBUG_CFQ_IOSCHED
+ bool "Debug CFQ Scheduling"
+ depends on CFQ_GROUP_IOSCHED
+ select DEBUG_BLK_CGROUP
+ default n
+ ---help---
+ Enable CFQ IO scheduling debugging in CFQ. Currently it makes
+ blktrace output more verbose.
+
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a62b8a3..4c68682 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -39,6 +39,10 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
blkg->blkcg_id = css_id(&blkcg->css);
hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
spin_unlock_irqrestore(&blkcg->lock, flags);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+ /* Need to take css reference ? */
+ cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
+#endif
}
static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 2bf736b..cb72c35 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -26,12 +26,25 @@ struct blkio_group {
void *key;
struct hlist_node blkcg_node;
unsigned short blkcg_id;
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+ /* Store cgroup path */
+ char path[128];
+#endif
};
#define BLKIO_WEIGHT_MIN 100
#define BLKIO_WEIGHT_MAX 1000
#define BLKIO_WEIGHT_DEFAULT 500
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+static inline char *blkg_path(struct blkio_group *blkg)
+{
+ return blkg->path;
+}
+#else
+static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
+#endif
+
extern struct blkio_cgroup blkio_root_cgroup;
struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b9a052b..2fde3c4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -258,8 +258,29 @@ CFQ_CFQQ_FNS(sync);
CFQ_CFQQ_FNS(coop);
#undef CFQ_CFQQ_FNS
+#ifdef CONFIG_DEBUG_CFQ_IOSCHED
+#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
+ blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, (cfqq)->pid, \
+ cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
+ blkg_path(&cfqq_to_cfqg((cfqq))->blkg), ##args);
+
+#define cfq_log_cfqe(cfqd, cfqe, fmt, args...) \
+ if (cfqq_of(cfqe)) { \
+ struct cfq_queue *cfqq = cfqq_of(cfqe); \
+ blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, \
+ (cfqq)->pid, cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
+ blkg_path(&cfqq_to_cfqg((cfqq))->blkg), ##args);\
+ } else { \
+ struct cfq_group *cfqg = cfqg_of(cfqe); \
+ blk_add_trace_msg((cfqd)->queue, "%s " fmt, \
+ blkg_path(&(cfqg)->blkg), ##args); \
+ }
+#else
#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
+#define cfq_log_cfqe(cfqd, cfqe, fmt, args...)
+#endif
+
#define cfq_log(cfqd, fmt, args...) \
blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
@@ -400,6 +421,8 @@ cfq_init_cfqe_parent(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
#define for_each_entity(entity) \
for (; entity && entity->parent; entity = entity->parent)
+#define cfqe_is_cfqq(cfqe) (!(cfqe)->my_sd)
+
static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
{
if (blkg)
@@ -588,6 +611,8 @@ void cfq_delink_blkio_group(void *key, struct blkio_group *blkg)
#define for_each_entity(entity) \
for (; entity != NULL; entity = NULL)
+#define cfqe_is_cfqq(cfqe) 1
+
static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
static inline void cfq_get_cfqg_ref(struct cfq_group *cfqg) {}
static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
@@ -885,6 +910,10 @@ static void dequeue_cfqq(struct cfq_queue *cfqq)
struct cfq_sched_data *sd = cfq_entity_sched_data(cfqe);
dequeue_cfqe(cfqe);
+ if (!cfqe_is_cfqq(cfqe)) {
+ cfq_log_cfqe(cfqq->cfqd, cfqe, "del_from_rr group");
+ }
+
/* Do not dequeue parent if it has other entities under it */
if (sd->nr_active)
break;
@@ -970,6 +999,8 @@ static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
{
+ struct cfq_data *cfqd = cfqq_of(cfqe)->cfqd;
+
for_each_entity(cfqe) {
/*
* Can't update entity disk time while it is on sorted rb-tree
@@ -979,6 +1010,8 @@ static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
cfqe->vdisktime += cfq_delta_fair(served, cfqe);
update_min_vdisktime(cfqe->st);
__enqueue_cfqe(cfqe->st, cfqe, 0);
+ cfq_log_cfqe(cfqd, cfqe, "served: vt=%llx min_vt=%llx",
+ cfqe->vdisktime, cfqe->st->min_vdisktime);
/* If entity prio class has changed, take that into account */
if (unlikely(cfqe->ioprio_class_changed)) {
--
1.6.2.5
Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++-
block/blk-cgroup.h | 10 +++++++++-
block/cfq-iosched.c | 29 +++++++++++++++++++++++++++--
3 files changed, 82 insertions(+), 4 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4c68682..47c0ce7 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -11,6 +11,8 @@
* Nauman Rafique <[email protected]>
*/
#include <linux/ioprio.h>
+#include <linux/seq_file.h>
+#include <linux/kdev_t.h>
#include "blk-cgroup.h"
extern void cfq_update_blkio_group_weight(struct blkio_group *, unsigned int);
@@ -29,8 +31,15 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
struct blkio_cgroup, css);
}
+void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
+ unsigned long time, unsigned long sectors)
+{
+ blkg->time += time;
+ blkg->sectors += sectors;
+}
+
void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
- struct blkio_group *blkg, void *key)
+ struct blkio_group *blkg, void *key, dev_t dev)
{
unsigned long flags;
@@ -43,6 +52,7 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
/* Need to take css reference ? */
cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
#endif
+ blkg->dev = dev;
}
static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
@@ -147,6 +157,33 @@ static int blkiocg_ioprio_class_write(struct cgroup *cgroup,
return 0;
}
+#define SHOW_FUNCTION_PER_GROUP(__VAR) \
+static int blkiocg_##__VAR##_read(struct cgroup *cgroup, \
+ struct cftype *cftype, struct seq_file *m) \
+{ \
+ struct blkio_cgroup *blkcg; \
+ struct blkio_group *blkg; \
+ struct hlist_node *n; \
+ \
+ if (!cgroup_lock_live_group(cgroup)) \
+ return -ENODEV; \
+ \
+ blkcg = cgroup_to_blkio_cgroup(cgroup); \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {\
+ if (blkg->dev) \
+ seq_printf(m, "%u:%u %lu\n", MAJOR(blkg->dev), \
+ MINOR(blkg->dev), blkg->__VAR); \
+ } \
+ rcu_read_unlock(); \
+ cgroup_unlock(); \
+ return 0; \
+}
+
+SHOW_FUNCTION_PER_GROUP(time);
+SHOW_FUNCTION_PER_GROUP(sectors);
+#undef SHOW_FUNCTION_PER_GROUP
+
struct cftype blkio_files[] = {
{
.name = "weight",
@@ -158,6 +195,14 @@ struct cftype blkio_files[] = {
.read_u64 = blkiocg_ioprio_class_read,
.write_u64 = blkiocg_ioprio_class_write,
},
+ {
+ .name = "time",
+ .read_seq_string = blkiocg_time_read,
+ },
+ {
+ .name = "sectors",
+ .read_seq_string = blkiocg_sectors_read,
+ },
};
static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index cb72c35..08f4ef8 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -30,6 +30,12 @@ struct blkio_group {
/* Store cgroup path */
char path[128];
#endif
+ /* The device MKDEV(major, minor), this group has been created for */
+ dev_t dev;
+
+ /* total disk time and nr sectors dispatched by this group */
+ unsigned long time;
+ unsigned long sectors;
};
#define BLKIO_WEIGHT_MIN 100
@@ -48,6 +54,8 @@ static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
extern struct blkio_cgroup blkio_root_cgroup;
struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
- struct blkio_group *blkg, void *key);
+ struct blkio_group *blkg, void *key, dev_t dev);
int blkiocg_del_blkio_group(struct blkio_group *blkg);
struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key);
+void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
+ unsigned long time, unsigned long sectors);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 2fde3c4..21d487f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -137,6 +137,8 @@ struct cfq_queue {
unsigned short org_ioprio_class;
pid_t pid;
+ /* Sectors dispatched in current dispatch round */
+ unsigned long nr_sectors;
};
/* Per cgroup grouping structure */
@@ -462,6 +464,8 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
struct cfq_group *cfqg = NULL;
void *key = cfqd;
+ unsigned int major, minor;
+ struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
/* Do we need to take this reference */
if (!css_tryget(&blkcg->css))
@@ -488,7 +492,9 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
cfq_get_cfqg_ref(cfqg);
/* Add group onto cgroup list */
- blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd);
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
+ MKDEV(major, minor));
/* Add group on cfqd list */
hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
@@ -607,6 +613,18 @@ void cfq_delink_blkio_group(void *key, struct blkio_group *blkg)
spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
}
+static void cfq_update_cfqq_stats(struct cfq_queue *cfqq,
+ unsigned long slice_used)
+{
+ struct cfq_entity *cfqe = &cfqq->entity;
+
+ for_each_entity(cfqe) {
+ struct cfq_group *cfqg = cfqg_of(parent_entity(cfqe));
+ blkiocg_update_blkio_group_stats(&cfqg->blkg, slice_used,
+ cfqq->nr_sectors);
+ }
+}
+
#else /* CONFIG_CFQ_GROUP_IOSCHED */
#define for_each_entity(entity) \
for (; entity != NULL; entity = NULL)
@@ -639,6 +657,9 @@ static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
{
return &cfqd->root_group;
}
+
+static inline void cfq_update_cfqq_stats(struct cfq_queue *cfqq,
+ unsigned long slice_used) {}
#endif /* CONFIG_CFQ_GROUP_IOSCHED */
static inline int rq_in_driver(struct cfq_data *cfqd)
@@ -1380,6 +1401,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
cfqq->slice_start = 0;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;
+ cfqq->nr_sectors = 0;
cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_must_dispatch(cfqq);
@@ -1418,6 +1440,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
slice_used = jiffies - cfqq->slice_start;
cfq_log_cfqq(cfqd, cfqq, "sl_used=%ld", slice_used);
+ cfq_update_cfqq_stats(cfqq, slice_used);
if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
cfq_del_cfqq_rr(cfqd, cfqq);
@@ -1688,6 +1711,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
if (cfq_cfqq_sync(cfqq))
cfqd->sync_flight++;
+ cfqq->nr_sectors += blk_rq_sectors(rq);
}
/*
@@ -3062,7 +3086,8 @@ static void cfq_init_root_group(struct cfq_data *cfqd)
* to make sure that cfq_put_cfqg() does not try to kfree root group
*/
cfq_get_cfqg_ref(cfqg);
- blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd);
+ blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd,
+ 0);
#endif
}
--
1.6.2.5
o "dequeue" is a debugging interface which keeps track how many times a group
was dequeued from service tree. This helps if a group is not getting its
fair share.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 17 +++++++++++++++++
block/blk-cgroup.h | 6 ++++++
block/cfq-iosched.c | 6 ++++++
3 files changed, 29 insertions(+), 0 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 47c0ce7..6a46156 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -182,8 +182,19 @@ static int blkiocg_##__VAR##_read(struct cgroup *cgroup, \
SHOW_FUNCTION_PER_GROUP(time);
SHOW_FUNCTION_PER_GROUP(sectors);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+SHOW_FUNCTION_PER_GROUP(dequeue);
+#endif
#undef SHOW_FUNCTION_PER_GROUP
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
+ unsigned long dequeue)
+{
+ blkg->dequeue += dequeue;
+}
+#endif
+
struct cftype blkio_files[] = {
{
.name = "weight",
@@ -203,6 +214,12 @@ struct cftype blkio_files[] = {
.name = "sectors",
.read_seq_string = blkiocg_sectors_read,
},
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+ {
+ .name = "dequeue",
+ .read_seq_string = blkiocg_dequeue_read,
+ },
+#endif
};
static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 08f4ef8..4ca101d 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -29,6 +29,8 @@ struct blkio_group {
#ifdef CONFIG_DEBUG_BLK_CGROUP
/* Store cgroup path */
char path[128];
+ /* How many times this group has been removed from service tree */
+ unsigned long dequeue;
#endif
/* The device MKDEV(major, minor), this group has been created for */
dev_t dev;
@@ -47,8 +49,12 @@ static inline char *blkg_path(struct blkio_group *blkg)
{
return blkg->path;
}
+void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
+ unsigned long dequeue);
#else
static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
+static inline void blkiocg_update_blkio_group_dequeue_stats(
+ struct blkio_group *blkg, unsigned long dequeue) {}
#endif
extern struct blkio_cgroup blkio_root_cgroup;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 21d487f..6936519 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -921,6 +921,12 @@ static void dequeue_cfqe(struct cfq_entity *cfqe)
__dequeue_cfqe(st, cfqe);
sd->nr_active--;
cfqe->on_st = 0;
+
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ if (!cfqe_is_cfqq(cfqe))
+ blkiocg_update_blkio_group_dequeue_stats(&cfqg_of(cfqe)->blkg,
+ 1);
+#endif
}
static void dequeue_cfqq(struct cfq_queue *cfqq)
--
1.6.2.5
o Do not allow request merging across cfq groups.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6936519..87b1799 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1381,6 +1381,9 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
struct cfq_io_context *cic;
struct cfq_queue *cfqq;
+ /* Deny merge if bio and rq don't belong to same cfq group */
+ if (cfqq_to_cfqg(RQ_CFQQ(rq)) != cfq_get_cfqg(cfqd, 0))
+ return false;
/*
* Disallow merge of a sync bio into an async request.
*/
--
1.6.2.5
o Additional preemption checks for groups where we travel up the hierarchy
and see if one queue should preempt other or not.
o Also prevents preemption across groups in some cases to provide isolation
between groups.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 33 +++++++++++++++++++++++++++++++++
1 files changed, 33 insertions(+), 0 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 87b1799..98dbead 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2636,6 +2636,36 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
}
}
+static bool cfq_should_preempt_group(struct cfq_data *cfqd,
+ struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
+{
+ struct cfq_entity *cfqe = &cfqq->entity;
+ struct cfq_entity *new_cfqe = &new_cfqq->entity;
+
+ if (cfqq_to_cfqg(cfqq) != &cfqd->root_group)
+ cfqe = parent_entity(&cfqq->entity);
+
+ if (cfqq_to_cfqg(new_cfqq) != &cfqd->root_group)
+ new_cfqe = parent_entity(&new_cfqq->entity);
+
+ /*
+ * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+ */
+
+ if (new_cfqe->ioprio_class == IOPRIO_CLASS_RT
+ && cfqe->ioprio_class != IOPRIO_CLASS_RT)
+ return true;
+ /*
+ * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
+ */
+
+ if (new_cfqe->ioprio_class == IOPRIO_CLASS_BE
+ && cfqe->ioprio_class == IOPRIO_CLASS_IDLE)
+ return true;
+
+ return false;
+}
+
/*
* Check if new_cfqq should preempt the currently active queue. Return 0 for
* no or if we aren't sure, a 1 will cause a preempt.
@@ -2666,6 +2696,9 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
return true;
+ if (cfqq_to_cfqg(new_cfqq) != cfqq_to_cfqg(cfqq))
+ return cfq_should_preempt_group(cfqd, cfqq, new_cfqq);
+
/*
* So both queues are sync. Let the new request get disk time if
* it's a metadata request and the current queue is doing regular IO.
--
1.6.2.5
o Select co-operating queue from same group not from a different cfq_group
to maintain the notion of fairness and isolation between groups.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 98dbead..020d6dd 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1635,6 +1635,10 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
if (!cfqq)
return NULL;
+ /* If new queue belongs to different cfq_group, don't choose it */
+ if (cfqq_to_cfqg(cur_cfqq) != cfqq_to_cfqg(cfqq))
+ return NULL;
+
if (cfq_cfqq_coop(cfqq))
return NULL;
--
1.6.2.5
o CFQ expires a cfqq if it has consumed its time slice. Expiry also means that
queue gets deleted from service tree. For the sequential IO, most of the time
a new IO comes almost immediately and cfqq gets backlogged again.
o This additiona dequeuing creates issues. dequeuing means that associated
group will also be removed from service tree and we select a new queue and
new group for dispatch and vdisktime jump takes place and group looses its
fair share.
o One solution is to wait for queue to get busy if it is empty at the time
of expiry and cfq plans to idle on the queue (it expects new request to come
with-in 8ms).
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 81 ++++++++++++++++++++++++++++++++++----------------
1 files changed, 55 insertions(+), 26 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 020d6dd..b7ef953 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -411,6 +411,21 @@ cfq_weight_slice(struct cfq_data *cfqd, int sync, unsigned int weight)
return cfq_delta(base_slice, weight, BLKIO_WEIGHT_DEFAULT);
}
+/*
+ * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
+ * isn't valid until the first request from the dispatch is activated
+ * and the slice time set.
+ */
+static inline bool cfq_slice_used(struct cfq_queue *cfqq)
+{
+ if (cfq_cfqq_slice_new(cfqq))
+ return 0;
+ if (time_before(jiffies, cfqq->slice_end))
+ return 0;
+
+ return 1;
+}
+
static inline void
cfq_init_cfqe_parent(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
{
@@ -425,6 +440,17 @@ cfq_init_cfqe_parent(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
#define cfqe_is_cfqq(cfqe) (!(cfqe)->my_sd)
+static inline bool cfqq_should_wait_busy(struct cfq_queue *cfqq)
+{
+ if (!RB_EMPTY_ROOT(&cfqq->sort_list) || !cfq_cfqq_idle_window(cfqq))
+ return false;
+
+ if (cfqq->dispatched && !cfq_slice_used(cfqq))
+ return false;
+
+ return true;
+}
+
static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
{
if (blkg)
@@ -635,6 +661,11 @@ static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
static inline void cfq_get_cfqg_ref(struct cfq_group *cfqg) {}
static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
+static inline bool cfqq_should_wait_busy(struct cfq_queue *cfqq)
+{
+ return false;
+}
+
static inline struct cfq_data *cfqd_of(struct cfq_entity *cfqe)
{
return cfqq_of(cfqe)->cfqd;
@@ -722,21 +753,6 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
}
/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline bool cfq_slice_used(struct cfq_queue *cfqq)
-{
- if (cfq_cfqq_slice_new(cfqq))
- return 0;
- if (time_before(jiffies, cfqq->slice_end))
- return 0;
-
- return 1;
-}
-
-/*
* Lifted from AS - choose which of rq1 and rq2 that is best served now.
* We choose the request that is closest to the head right now. Distance
* behind the head is penalized and only allowed to a certain extent.
@@ -1647,19 +1663,22 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
return cfqq;
}
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static bool cfq_arm_slice_timer(struct cfq_data *cfqd, int reset)
{
struct cfq_queue *cfqq = cfqd->active_queue;
struct cfq_io_context *cic;
unsigned long sl;
+ /* If idle timer is already armed, nothing to do */
+ if (!reset && timer_pending(&cfqd->idle_slice_timer))
+ return true;
/*
* SSD device without seek penalty, disable idling. But only do so
* for devices that support queuing, otherwise we still have a problem
* with sync vs async workloads.
*/
if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
- return;
+ return false;
WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
WARN_ON(cfq_cfqq_slice_new(cfqq));
@@ -1668,20 +1687,20 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
* idle is disabled, either manually or by past process history
*/
if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
- return;
+ return false;
/*
* still requests with the driver, don't idle
*/
if (rq_in_driver(cfqd))
- return;
+ return false;
/*
* task has exited, don't wait
*/
cic = cfqd->active_cic;
if (!cic || !atomic_read(&cic->ioc->nr_tasks))
- return;
+ return false;
/*
* If our average think time is larger than the remaining time
@@ -1690,7 +1709,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
*/
if (sample_valid(cic->ttime_samples) &&
(cfqq->slice_end - jiffies < cic->ttime_mean))
- return;
+ return false;
cfq_mark_cfqq_wait_request(cfqq);
@@ -1704,7 +1723,8 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
- cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
+ cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu reset=%d", sl, reset);
+ return true;
}
/*
@@ -1775,6 +1795,12 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
if (!cfqd->rq_queued)
return NULL;
+ /* Wait for a queue to get busy before we expire it */
+ if (cfqq_should_wait_busy(cfqq) && cfq_arm_slice_timer(cfqd, 0)) {
+ cfqq = NULL;
+ goto keep_queue;
+ }
+
/*
* The active queue has run out of time, expire it and select new.
*/
@@ -2786,8 +2812,8 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqd->busy_queues > 1) {
del_timer(&cfqd->idle_slice_timer);
__blk_run_queue(cfqd->queue);
- }
- cfq_mark_cfqq_must_dispatch(cfqq);
+ } else
+ cfq_mark_cfqq_must_dispatch(cfqq);
}
} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
/*
@@ -2886,10 +2912,13 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
* of idling.
*/
if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
- cfq_slice_expired(cfqd);
+ if (!cfqq_should_wait_busy(cfqq))
+ cfq_slice_expired(cfqd);
+ else
+ cfq_arm_slice_timer(cfqd, 1);
else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
sync && !rq_noidle(rq))
- cfq_arm_slice_timer(cfqd);
+ cfq_arm_slice_timer(cfqd, 1);
}
if (!rq_in_driver(cfqd))
--
1.6.2.5
o Now we plan to wait for a queue to get backlogged before we expire it. So
we need to arm slice timer even if think time is greater than slice left.
if process sends next IO early and time slice is left, we will dispatch it
otherwise we will expire the queue and move on to next queue.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 9 ---------
1 files changed, 0 insertions(+), 9 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b7ef953..963659a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1702,15 +1702,6 @@ static bool cfq_arm_slice_timer(struct cfq_data *cfqd, int reset)
if (!cic || !atomic_read(&cic->ioc->nr_tasks))
return false;
- /*
- * If our average think time is larger than the remaining time
- * slice, then don't idle. This avoids overrunning the allotted
- * time slice.
- */
- if (sample_valid(cic->ttime_samples) &&
- (cfqq->slice_end - jiffies < cic->ttime_mean))
- return false;
-
cfq_mark_cfqq_wait_request(cfqq);
/*
--
1.6.2.5
o To ensure fairness for a group, we need to make sure at the time of expiry
queue is backlogged and does not get deleted from the service tree. That
means for sequential workload, wait for next request before expiry.
o Sometimes we dispatch a request from a queue and we do not wait busy on the
queue because arm_slice_timer() does not arm slice idle timer because it
thinks there are requests in driver. Further down in select_cfq_queue()
we expire the cfqq because time slice expired and queue looses its share
(vtime jump). Hence idle timer even if there are requests in the driver.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 6 ------
1 files changed, 0 insertions(+), 6 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 963659a..d609a10 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1690,12 +1690,6 @@ static bool cfq_arm_slice_timer(struct cfq_data *cfqd, int reset)
return false;
/*
- * still requests with the driver, don't idle
- */
- if (rq_in_driver(cfqd))
- return false;
-
- /*
* task has exited, don't wait
*/
cic = cfqd->active_cic;
--
1.6.2.5
o If a task changes cgroup, drop reference to the cfqq associated with io
context so that upon next request arrival we will allocate a new queue in
new group.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 39 +++++++++++++++++++++++++++++++++++++++
1 files changed, 39 insertions(+), 0 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d609a10..f23d713 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2330,6 +2330,41 @@ static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
cfq_get_cfqg_ref(cfqg);
}
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+ struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+ struct cfq_data *cfqd = cic->key;
+ unsigned long flags;
+ struct request_queue *q;
+
+ if (unlikely(!cfqd))
+ return;
+
+ q = cfqd->queue;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ if (sync_cfqq) {
+ /*
+ * Drop reference to sync queue. A new sync queue will be
+ * assigned in new group upon arrival of a fresh request.
+ */
+ cfq_log_cfqq(cfqd, sync_cfqq, "changed cgroup");
+ cic_set_cfqq(cic, NULL, 1);
+ cfq_put_queue(sync_cfqq);
+ }
+
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+ call_for_each_cic(ioc, changed_cgroup);
+ ioc->cgroup_changed = 0;
+}
+#endif /* CONFIG_CFQ_GROUP_IOSCHED */
+
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
struct io_context *ioc, gfp_t gfp_mask)
@@ -2562,6 +2597,10 @@ out:
if (unlikely(ioc->ioprio_changed))
cfq_ioc_set_ioprio(ioc);
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ if (unlikely(ioc->cgroup_changed))
+ cfq_ioc_set_cgroup(ioc);
+#endif
return cic;
err_free:
cfq_cic_free(cic);
--
1.6.2.5
On Tue, Nov 03 2009, Vivek Goyal wrote:
> Hi All,
>
> This is V1 of the Block IO controller patches on top of 2.6.32-rc5.
Thanks for posting this Vivek. Can you rebase the patches on top of my
for-2.6.33 branch, there are a bigger-than-usual number of pending CFQ
patches in there so things won't apply directly.
If you do that, I'll pull these patches into testing branch and
hopefully merge it in soonish. This patchset looks a lot more mergeable,
thanks!
--
Jens Axboe
Vivek Goyal <[email protected]> writes:
> +- blkio.weight
> + - Specifies per cgroup weight.
> +
> + Currently allowed range of weights is from 100 to 1000.
Why not 1 to 1000?
Cheers,
Jeff
On Wed, Nov 04, 2009 at 08:43:33AM +0100, Jens Axboe wrote:
> On Tue, Nov 03 2009, Vivek Goyal wrote:
> > Hi All,
> >
> > This is V1 of the Block IO controller patches on top of 2.6.32-rc5.
>
> Thanks for posting this Vivek. Can you rebase the patches on top of my
> for-2.6.33 branch, there are a bigger-than-usual number of pending CFQ
> patches in there so things won't apply directly.
>
> If you do that, I'll pull these patches into testing branch and
> hopefully merge it in soonish. This patchset looks a lot more mergeable,
> thanks!
>
Thanks Jens. Sure, will rebase the patches on top of "for-2.6.33" branch
and repost.
Thanks
Vivek
Vivek Goyal <[email protected]> writes:
> o Currently CFQ provides priority scaled time slices to processes. If a process
> does not use the time slice, either because process did not have sufficient
> IO to do or because think time of process is large and CFQ decided to disable
> idling, then processes looses it time slice share.
^^^^^^
loses
> o One possible way to handle this is implement CFS like time stamping of the
> cfq queues and keep track of vtime. Next queue for execution will be selected
> based on the one who got lowest vtime. This patch implemented time stamping
> mechanism of cfq queues based on disk time used.
>
> o min_vdisktime represents the minimum vdisktime of the queue, either being
^^^^^
> serviced or leftmost element on the serviec tree.
queue or service tree? The latter seems to make more sense to me.
> +static inline u64
> +cfq_delta_fair(unsigned long delta, struct cfq_queue *cfqq)
> +{
> + const int base_slice = cfqq->cfqd->cfq_slice[cfq_cfqq_sync(cfqq)];
> +
> + return delta + (base_slice/CFQ_SLICE_SCALE * (cfqq->ioprio - 4));
> +}
cfq_scale_delta might be a better name.
> +static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
> +{
> + s64 delta = (s64)(vdisktime - min_vdisktime);
> + if (delta > 0)
> + min_vdisktime = vdisktime;
> +
> + return min_vdisktime;
> +}
> +
> +static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
> +{
> + s64 delta = (s64)(vdisktime - min_vdisktime);
> + if (delta < 0)
> + min_vdisktime = vdisktime;
> +
> + return min_vdisktime;
> +}
Is there a reason you've reimplemented min and max?
> + /*
> + * Maintain a cache of leftmost tree entries (it is frequently
> + * used)
> + */
You make it sound like there is a cache of more than one entry. Please
fix the comment.
> +static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
> +{
> + /*
> + * We don't want to charge more than allocated slice otherwise this
> + * queue can miss one dispatch round doubling max latencies. On the
> + * other hand we don't want to charge less than allocated slice as
> + * we stick to CFQ theme of queue loosing its share if it does not
^^^^^^^
losing
> +/*
> + * Handles three operations.
> + * Addition of a new queue to service tree, when a new request comes in.
> + * Resorting of an expiring queue (used after slice expired)
> + * Requeuing a queue at the front (used during preemption).
> + */
> +static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> + bool add_front, unsigned long service)
service? Can we come up with a better name that actually hints at what
this is? service_time, maybe?
Mostly this looks pretty good and is fairly easy to read.
Cheers,
Jeff
Vivek Goyal <[email protected]> writes:
> o Introduce the notion of weights. Priorities are mapped to weights internally.
> These weights will be useful once IO groups are introduced and group's share
> will be decided by the group weight.
I'm sorry, but I need more background to review this patch. Where do
the min and max come from? Why do you scale 7-0 from 200-900? How does
this map to what was there before (exactly, approximately)?
Cheers,
Jeff
Vivek Goyal <[email protected]> writes:
> +void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> + struct blkio_group *blkg, void *key)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&blkcg->lock, flags);
> + rcu_assign_pointer(blkg->key, key);
> + hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
> + spin_unlock_irqrestore(&blkcg->lock, flags);
> +}
I took a look at the rcu stuff, and it seems to be in order.
> +/*
> + * We cannot support shared io contexts, as we have no mean to support
> + * two tasks with the same ioc in two different groups without major rework
> + * of the main cic data structures. For now we allow a task to change
> + * its cgroup only if it's the only owner of its ioc.
> + */
Interesting. So is there no way at all to set the cgroup for a set of
processes that are cloned using CLONE_IO?
> +static int blkiocg_can_attach(struct cgroup_subsys *subsys,
> + struct cgroup *cgroup, struct task_struct *tsk,
> + bool threadgroup)
> +{
> + struct io_context *ioc;
> + int ret = 0;
> +
> + /* task_lock() is needed to avoid races with exit_io_context() */
> + task_lock(tsk);
> + ioc = tsk->io_context;
> + if (ioc && atomic_read(&ioc->nr_tasks) > 1)
> + ret = -EINVAL;
> + task_unlock(tsk);
> +
> + return ret;
> +}
This function's name implies that it returns a boolean.
Cheers,
Jeff
Vivek Goyal <[email protected]> writes:
> o This patch embeds cfq_entity object in cfq_group and provides helper routines
> so that group entities can be scheduled.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> block/cfq-iosched.c | 110 +++++++++++++++++++++++++++++++++++++++++++--------
> 1 files changed, 93 insertions(+), 17 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index bc99163..8ec8a82 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -79,6 +79,7 @@ struct cfq_service_tree {
> #define CFQ_RB_ROOT (struct cfq_service_tree) { RB_ROOT, NULL, 0, NULL}
>
> struct cfq_sched_data {
> + unsigned int nr_active;
> struct cfq_service_tree service_tree[IO_IOPRIO_CLASSES];
> };
>
> @@ -89,6 +90,10 @@ struct cfq_entity {
> struct cfq_service_tree *st;
> unsigned short ioprio_class;
> bool ioprio_class_changed;
> + struct cfq_entity *parent;
> + bool on_st;
> + /* Points to the sched_data of group entity. Null for cfqq */
> + struct cfq_sched_data *my_sd;
Why my_sd? None of the other members required a my. ;-)
> +static inline struct cfq_entity *parent_entity(struct cfq_entity *cfqe)
> +{
> + return cfqe->parent;
> +}
Wow, is this really necessary for a pointer dereference?
> +#ifdef CONFIG_CFQ_GROUP_IOSCHED
> +/* check for entity->parent so that loop is not executed for root entity. */
> +#define for_each_entity(entity) \
> + for (; entity && entity->parent; entity = entity->parent)
See, you don't use it here!
> @@ -660,7 +723,13 @@ static void enqueue_cfqe(struct cfq_entity *cfqe)
>
> static void enqueue_cfqq(struct cfq_queue *cfqq)
> {
> - enqueue_cfqe(&cfqq->entity);
> + struct cfq_entity *cfqe = &cfqq->entity;
> +
> + for_each_entity(cfqe) {
> + if (cfqe->on_st)
> + break;
> + enqueue_cfqe(cfqe);
> + }
> }
Maybe I'm slow, but I would have benefitted from a comment above that
function stating that we have to walk up the tree to make sure that the
parent is also scheduled.
Cheers,
Jeff
On Wed, Nov 04, 2009 at 10:06:16AM -0500, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > o Introduce the notion of weights. Priorities are mapped to weights internally.
> > These weights will be useful once IO groups are introduced and group's share
> > will be decided by the group weight.
>
> I'm sorry, but I need more background to review this patch. Where do
> the min and max come from? Why do you scale 7-0 from 200-900? How does
> this map to what was there before (exactly, approximately)?
>
Well, So far we only have the notion of iopriority for the process and
based on that we determine time slice length.
Soon we will throw cfq groups also in the mix. Because cpu IO controller
is weight driven, people have shown preference that group's share should
be decided based on its weight and not introduce the notion of ioprio for
groups.
So now core scheduling algorithm only recognizes weights for entities (be it
cfq queues or cfq groups), and it is required that we convert the ioprio
of cfqq into weight.
Now it is a matter of coming up with what weight range do we support and
how ioprio should be mapped onto these weights. We can always change the
mappings but to being with, I have followed following.
Allow a weight range from 100 to 1000. Allowing too small a weights like
"1", can lead to very interesting corner cases and I wanted to avoid that
in first implementation. For example, if some group with weight "1" gets
a time slice of 100ms, its vtime will be really high and after that it
will not get scheduled in for a very long time.
Seconly allowing too small a weights can make vtime of the tree move very
fast with faster wrap around of min_vdistime. (especially on SSD where idling
might not be enabled, and for every queue expiry we will attribute minimum of
1ms of slice. If weight of the group is "1" it will higher vtime and
min_vdisktime will move very fast). We don't want too fast a wrap around
of min_vdisktime (especially in case of idle tree. That infrastructure is
not part of current patches).
Hence, to begin with I wanted to limit the range of weights allowed because
wider range opens up lot of interesting corner cases. That's why limited
minimum weight to 100. So at max user can expect the 1000/100=10 times service
differentiation between highest and lower weight groups. If folks need more
than that, we can look into it once things stablize.
Priority and weights follow reverse order. Higher priority means low
weight and vice-versa.
Currently we support 8 priority levels and prio "4" is the middle point.
Anything higher than prio 4 gets 20% less slice as compared to prio 4 and
priorities lower than 4, get 20% higher slice of prio 4 (20% higher/lower
for each priority level).
For weight range 100 - 1000, 500 can be considered as mid point. Now this
is how priority mapping looks like.
100 200 300 400 500 600 700 800 900 1000 (Weights)
7 6 5 4 3 2 1 0 (io prio).
Once priorities are converted to weights, we are able to retain the notion
of 20% difference between prio levels by choosing 500 as the mid point and
mapping prio 0-7 to weights 900-200, hence this mapping.
I am all ears if you have any suggestions on how this ca be handled
better.
Thanks
Vivek
Vivek Goyal <[email protected]> writes:
> + /* Do we need to take this reference */
> + if (!css_tryget(&blkcg->css))
> + return NULL;;
Interesting question. It seems like, so long as you want to reference
that cgroup, you need a reference on it. ->alloc will only give you
one, so as long as we have pointers to these squirreled away, we need to
bump the reference count. Please take a look at this before your next
posting.
> +
> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
> + if (cfqg || !create)
> + goto done;
> +
> + cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC | __GFP_ZERO,
> + cfqd->queue->node);
Do you really have to OR in __GFP_ZERO for kzalloc?
> + cfqg = cfq_find_alloc_cfqg(cfqd, cgroup, create);
> + if (!cfqg && create)
> + cfqg = &cfqd->root_group;
Hmm, is that really the behaviour you want?
Cheers,
Jeff
On Wed, Nov 04, 2009 at 09:30:34AM -0500, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
Thanks for the review Jeff.
> > o Currently CFQ provides priority scaled time slices to processes. If a process
> > does not use the time slice, either because process did not have sufficient
> > IO to do or because think time of process is large and CFQ decided to disable
> > idling, then processes looses it time slice share.
> ^^^^^^
> loses
>
Will fix it.
> > o One possible way to handle this is implement CFS like time stamping of the
> > cfq queues and keep track of vtime. Next queue for execution will be selected
> > based on the one who got lowest vtime. This patch implemented time stamping
> > mechanism of cfq queues based on disk time used.
> >
> > o min_vdisktime represents the minimum vdisktime of the queue, either being
> ^^^^^
> > serviced or leftmost element on the serviec tree.
>
> queue or service tree? The latter seems to make more sense to me.
Yes, it should be service tree. Will fix it.
>
> > +static inline u64
> > +cfq_delta_fair(unsigned long delta, struct cfq_queue *cfqq)
> > +{
> > + const int base_slice = cfqq->cfqd->cfq_slice[cfq_cfqq_sync(cfqq)];
> > +
> > + return delta + (base_slice/CFQ_SLICE_SCALE * (cfqq->ioprio - 4));
> > +}
>
> cfq_scale_delta might be a better name.
>
cfq_scale_delta sounds good. Will use it in next version.
>
> > +static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
> > +{
> > + s64 delta = (s64)(vdisktime - min_vdisktime);
> > + if (delta > 0)
> > + min_vdisktime = vdisktime;
> > +
> > + return min_vdisktime;
> > +}
> > +
> > +static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
> > +{
> > + s64 delta = (s64)(vdisktime - min_vdisktime);
> > + if (delta < 0)
> > + min_vdisktime = vdisktime;
> > +
> > + return min_vdisktime;
> > +}
>
> Is there a reason you've reimplemented min and max?
I think you are referring to min_t and max_t. Will these macros take care
of wrapping too?
For example, if I used min_t(u64, A, B), then unsigned comparision will
not work right wrapping has just taken place for any of the A or B. So if
A=-1 and B=2, then min_t() would return B as minimum. This is not right
in our case.
If we do signed comparison (min_t(s64, A, B)), that also seems to be
broken in another case where a value of variable moves from 63bits to 64bits,
(A=0x7fffffffffffffff, B=0x8000000000000000). Above will return B as minimum but
in our scanario, vdisktime will progress from 0x7fffffffffffffff to
0x8000000000000000 and A should be returned as minimum (unsigned
comparison).
Hence I took these difnitions from CFS.
>
> > + /*
> > + * Maintain a cache of leftmost tree entries (it is frequently
> > + * used)
> > + */
>
> You make it sound like there is a cache of more than one entry. Please
> fix the comment.
Will fix it.
>
> > +static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
> > +{
> > + /*
> > + * We don't want to charge more than allocated slice otherwise this
> > + * queue can miss one dispatch round doubling max latencies. On the
> > + * other hand we don't want to charge less than allocated slice as
> > + * we stick to CFQ theme of queue loosing its share if it does not
> ^^^^^^^
> losing
>
Will fix it.
>
> > +/*
> > + * Handles three operations.
> > + * Addition of a new queue to service tree, when a new request comes in.
> > + * Resorting of an expiring queue (used after slice expired)
> > + * Requeuing a queue at the front (used during preemption).
> > + */
> > +static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> > + bool add_front, unsigned long service)
>
> service? Can we come up with a better name that actually hints at what
> this is? service_time, maybe?
Ok, service_time sounds good. Will change it.
>
>
> Mostly this looks pretty good and is fairly easy to read.
Thanks
Vivek
On Wed, Nov 04, 2009 at 10:23:16AM -0500, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > +void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> > + struct blkio_group *blkg, void *key)
> > +{
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&blkcg->lock, flags);
> > + rcu_assign_pointer(blkg->key, key);
> > + hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
> > + spin_unlock_irqrestore(&blkcg->lock, flags);
> > +}
>
> I took a look at the rcu stuff, and it seems to be in order.
>
> > +/*
> > + * We cannot support shared io contexts, as we have no mean to support
> > + * two tasks with the same ioc in two different groups without major rework
> > + * of the main cic data structures. For now we allow a task to change
> > + * its cgroup only if it's the only owner of its ioc.
> > + */
>
> Interesting. So is there no way at all to set the cgroup for a set of
> processes that are cloned using CLONE_IO?
>
In the current patchset "no". This is bad and should be fixed. Thinking of
following.
- In case of CLONE_IO, when a tread moves, drop the reference to the sync
queue and allow movement of the thread to a different group. Now once
a new request comes in, a new queue will be setup again. Because two
threads sharing the IO context are in two different groups, A sync queue
will be setup for the group from which request comes first. So for some
time we will have a situation where a thread is one group but its IO
is going into a queue of a different group. This will be only temporary
situation and correct itself once all the threads sharing io context
move to same group.
The downside is that user might not know exactly which threads are sharing
IO context and might end up with some threads in one group and others in
different group.
> > +static int blkiocg_can_attach(struct cgroup_subsys *subsys,
> > + struct cgroup *cgroup, struct task_struct *tsk,
> > + bool threadgroup)
> > +{
> > + struct io_context *ioc;
> > + int ret = 0;
> > +
> > + /* task_lock() is needed to avoid races with exit_io_context() */
> > + task_lock(tsk);
> > + ioc = tsk->io_context;
> > + if (ioc && atomic_read(&ioc->nr_tasks) > 1)
> > + ret = -EINVAL;
> > + task_unlock(tsk);
> > +
> > + return ret;
> > +}
>
> This function's name implies that it returns a boolean.
Yes. Will change from int to bool. Thanks.
Vivek
On Wed, Nov 04, 2009 at 10:34:57AM -0500, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > o This patch embeds cfq_entity object in cfq_group and provides helper routines
> > so that group entities can be scheduled.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > block/cfq-iosched.c | 110 +++++++++++++++++++++++++++++++++++++++++++--------
> > 1 files changed, 93 insertions(+), 17 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index bc99163..8ec8a82 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -79,6 +79,7 @@ struct cfq_service_tree {
> > #define CFQ_RB_ROOT (struct cfq_service_tree) { RB_ROOT, NULL, 0, NULL}
> >
> > struct cfq_sched_data {
> > + unsigned int nr_active;
> > struct cfq_service_tree service_tree[IO_IOPRIO_CLASSES];
> > };
> >
> > @@ -89,6 +90,10 @@ struct cfq_entity {
> > struct cfq_service_tree *st;
> > unsigned short ioprio_class;
> > bool ioprio_class_changed;
> > + struct cfq_entity *parent;
> > + bool on_st;
> > + /* Points to the sched_data of group entity. Null for cfqq */
> > + struct cfq_sched_data *my_sd;
>
> Why my_sd? None of the other members required a my. ;-)
>
Because in rest of the code, we have been using "struct cfq_sched_data *sd"
to mean pointer to scheduling data entity/queue/group is queued.
In this case entity is embedded inside a cfq_group object and it is
hosting a sched_data tree. So "*sd" means the sched_data the entity is
queued on and "*my_sd" means the sched_data entity is hosting.
> > +static inline struct cfq_entity *parent_entity(struct cfq_entity *cfqe)
> > +{
> > + return cfqe->parent;
> > +}
>
> Wow, is this really necessary for a pointer dereference?
>
I guess no. As core scheduling code is inspired from CFS, it is coming
from there. :-). Will get rid of it in next posting.
> > +#ifdef CONFIG_CFQ_GROUP_IOSCHED
> > +/* check for entity->parent so that loop is not executed for root entity. */
> > +#define for_each_entity(entity) \
> > + for (; entity && entity->parent; entity = entity->parent)
>
> See, you don't use it here
:-)
>
> > @@ -660,7 +723,13 @@ static void enqueue_cfqe(struct cfq_entity *cfqe)
> >
> > static void enqueue_cfqq(struct cfq_queue *cfqq)
> > {
> > - enqueue_cfqe(&cfqq->entity);
> > + struct cfq_entity *cfqe = &cfqq->entity;
> > +
> > + for_each_entity(cfqe) {
> > + if (cfqe->on_st)
> > + break;
> > + enqueue_cfqe(cfqe);
> > + }
> > }
>
> Maybe I'm slow, but I would have benefitted from a comment above that
> function stating that we have to walk up the tree to make sure that the
> parent is also scheduled.
Sure, will put a comment here.
Thanks
Vivek
On Wed, Nov 4, 2009 at 7:41 AM, Vivek Goyal <[email protected]> wrote:
>
> On Wed, Nov 04, 2009 at 10:06:16AM -0500, Jeff Moyer wrote:
> > Vivek Goyal <[email protected]> writes:
> >
> > > o Introduce the notion of weights. Priorities are mapped to weights internally.
> > > ? These weights will be useful once IO groups are introduced and group's share
> > > ? will be decided by the group weight.
> >
> > I'm sorry, but I need more background to review this patch. ?Where do
> > the min and max come from? ?Why do you scale 7-0 from 200-900? ?How does
> > this map to what was there before (exactly, approximately)?
> >
>
> Well, So far we only have the notion of iopriority for the process and
> based on that we determine time slice length.
>
> Soon we will throw cfq groups also in the mix. Because cpu IO controller
> is weight driven, people have shown preference that group's share should
> be decided based on its weight and not introduce the notion of ioprio for
> groups.
>
> So now core scheduling algorithm only recognizes weights for entities (be it
> cfq queues or cfq groups), and it is required that we convert the ioprio
> of cfqq into weight.
>
> Now it is a matter of coming up with what weight range do we support and
> how ioprio should be mapped onto these weights. We can always change the
> mappings but to being with, I have followed following.
>
> Allow a weight range from 100 to 1000. Allowing too small a weights like
> "1", can lead to very interesting corner cases and I wanted to avoid that
> in first implementation. For example, if some group with weight "1" gets
> a time slice of 100ms, its vtime will be really high and after that it
> will not get scheduled in for a very long time.
>
> Seconly allowing too small a weights can make vtime of the tree move very
> fast with faster wrap around of min_vdistime. (especially on SSD where idling
> might not be enabled, and for every queue expiry we will attribute minimum of
> 1ms of slice. If weight of the group is "1" it will higher vtime and
> min_vdisktime will move very fast). We don't want too fast a wrap around
> of min_vdisktime (especially in case of idle tree. That infrastructure is
> not part of current patches).
>
> Hence, to begin with I wanted to limit the range of weights allowed because
> wider range opens up lot of interesting corner cases. That's why limited
> minimum weight to 100. So at max user can expect the 1000/100=10 times service
> differentiation between highest and lower weight groups. If folks need more
> than that, we can look into it once things stablize.
We definitely need the 1:100 differentiation. I'm ok with adding that
later after the core set of patches stabilize but just letting you
know that it is important to us. Also curious why you chose a higher
range 100-1000 instead of 10-100? For smaller vtime leaps?
>
> Priority and weights follow reverse order. Higher priority means low
> weight and vice-versa.
>
> Currently we support 8 priority levels and prio "4" is the middle point.
> Anything higher than prio 4 gets 20% less slice as compared to prio 4 and
> priorities lower than 4, get 20% higher slice of prio 4 (20% higher/lower
> for each priority level).
>
> For weight range 100 - 1000, 500 can be considered as mid point. Now this
> is how priority mapping looks like.
>
> ? ? ? ?100 200 300 400 500 600 700 800 900 1000 ?(Weights)
> ? ? ? ? ? ? 7 ? 6 ? 5 ? 4 ? 3 ? 2 ?1 ? 0 ? ? ? ? (io prio).
>
> Once priorities are converted to weights, we are able to retain the notion
> of 20% difference between prio levels by choosing 500 as the mid point and
> mapping prio 0-7 to weights 900-200, hence this mapping.
>
> I am all ears if you have any suggestions on how this ca be handled
> better.
>
> Thanks
> Vivek
* Vivek Goyal <[email protected]> [2009-11-03 18:43:38]:
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> Documentation/cgroups/blkio-controller.txt | 106 ++++++++++++++++++++++++++++
> 1 files changed, 106 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/cgroups/blkio-controller.txt
>
> diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
> new file mode 100644
> index 0000000..dc8fb1a
> --- /dev/null
> +++ b/Documentation/cgroups/blkio-controller.txt
> @@ -0,0 +1,106 @@
> + Block IO Controller
> + ===================
> +Overview
> +========
> +cgroup subsys "blkio" implements the block io controller. There seems to be
> +a need of various kind of IO control policies (like proportional BW, max BW)
> +both at leaf nodes as well as at intermediate nodes in storage hierarchy. Plan
> +is to use same cgroup based management interface for blkio controller and
> +based on user options switch IO policies in the background.
> +
> +In the first phase, this patchset implements proportional weight time based
> +division of disk policy. It is implemented in CFQ. Hence this policy takes
> +effect only on leaf nodes when CFQ is being used.
> +
> +HOWTO
> +=====
> +You can do a very simple testing of running two dd threads in two different
> +cgroups. Here is what you can do.
> +
> +- Enable group scheduling in CFQ
> + CONFIG_CFQ_GROUP_IOSCHED=y
> +
> +- Compile and boot into kernel and mount IO controller (blkio).
> +
> + mount -t cgroup -o blkio none /cgroup
> +
> +- Create two cgroups
> + mkdir -p /cgroup/test1/ /cgroup/test2
> +
> +- Set weights of group test1 and test2
> + echo 1000 > /cgroup/test1/blkio.weight
> + echo 500 > /cgroup/test2/blkio.weight
> +
> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> + launch two dd threads in different cgroup to read those files.
> +
> + sync
> + echo 3 > /proc/sys/vm/drop_caches
> +
> + dd if=/mnt/sdb/zerofile1 of=/dev/null &
> + echo $! > /cgroup/test1/tasks
> + cat /cgroup/test1/tasks
> +
> + dd if=/mnt/sdb/zerofile2 of=/dev/null &
> + echo $! > /cgroup/test2/tasks
> + cat /cgroup/test2/tasks
> +
> +- At macro level, first dd should finish first. To get more precise data, keep
> + on looking at (with the help of script), at blkio.disk_time and
> + blkio.disk_sectors files of both test1 and test2 groups. This will tell how
> + much disk time (in milli seconds), each group got and how many secotors each
> + group dispatched to the disk. We provide fairness in terms of disk time, so
> + ideally io.disk_time of cgroups should be in proportion to the weight.
> +
> +Various user visible config options
> +===================================
> +CONFIG_CFQ_GROUP_IOSCHED
> + - Enables group scheduling in CFQ. Currently only 1 level of group
> + creation is allowed.
> +
> +CONFIG_DEBUG_CFQ_IOSCHED
> + - Enables some debugging messages in blktrace. Also creates extra
> + cgroup file blkio.dequeue.
> +
> +Config options selected automatically
> +=====================================
> +These config options are not user visible and are selected/deselected
> +automatically based on IO scheduler configuration.
> +
> +CONFIG_BLK_CGROUP
> + - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED.
> +
> +CONFIG_DEBUG_BLK_CGROUP
> + - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED.
> +
> +Details of cgroup files
> +=======================
> +- blkio.ioprio_class
> + - Specifies class of the cgroup (RT, BE, IDLE). This is default io
> + class of the group on all the devices.
> +
> + 1 = RT; 2 = BE, 3 = IDLE
> +
> +- blkio.weight
> + - Specifies per cgroup weight.
> +
> + Currently allowed range of weights is from 100 to 1000.
> +
> +- blkio.time
> + - disk time allocated to cgroup per device in milliseconds. First
> + two fields specify the major and minor number of the device and
> + third field specifies the disk time allocated to group in
> + milliseconds.
> +
> +- blkio.sectors
> + - number of sectors transferred to/from disk by the group. First
> + two fields specify the major and minor number of the device and
> + third field specifies the number of sectors transferred by the
> + group to/from the device.
> +
> +- blkio.dequeue
> + - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
> + gives the statistics about how many a times a group was dequeued
> + from service tree of the device. First two fields specify the major
> + and minor number of the device and third field specifies the number
> + of times a group was dequeued from a particular device.
Hi, Vivek,
Are the parameters inter-related? What if you have conflicts w.r.t.
time, sectors, etc?
--
Balbir
On Wed, Nov 04, 2009 at 10:51:00PM +0530, Balbir Singh wrote:
> * Vivek Goyal <[email protected]> [2009-11-03 18:43:38]:
>
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > Documentation/cgroups/blkio-controller.txt | 106 ++++++++++++++++++++++++++++
> > 1 files changed, 106 insertions(+), 0 deletions(-)
> > create mode 100644 Documentation/cgroups/blkio-controller.txt
> >
> > diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
> > new file mode 100644
> > index 0000000..dc8fb1a
> > --- /dev/null
> > +++ b/Documentation/cgroups/blkio-controller.txt
> > @@ -0,0 +1,106 @@
> > + Block IO Controller
> > + ===================
> > +Overview
> > +========
> > +cgroup subsys "blkio" implements the block io controller. There seems to be
> > +a need of various kind of IO control policies (like proportional BW, max BW)
> > +both at leaf nodes as well as at intermediate nodes in storage hierarchy. Plan
> > +is to use same cgroup based management interface for blkio controller and
> > +based on user options switch IO policies in the background.
> > +
> > +In the first phase, this patchset implements proportional weight time based
> > +division of disk policy. It is implemented in CFQ. Hence this policy takes
> > +effect only on leaf nodes when CFQ is being used.
> > +
> > +HOWTO
> > +=====
> > +You can do a very simple testing of running two dd threads in two different
> > +cgroups. Here is what you can do.
> > +
> > +- Enable group scheduling in CFQ
> > + CONFIG_CFQ_GROUP_IOSCHED=y
> > +
> > +- Compile and boot into kernel and mount IO controller (blkio).
> > +
> > + mount -t cgroup -o blkio none /cgroup
> > +
> > +- Create two cgroups
> > + mkdir -p /cgroup/test1/ /cgroup/test2
> > +
> > +- Set weights of group test1 and test2
> > + echo 1000 > /cgroup/test1/blkio.weight
> > + echo 500 > /cgroup/test2/blkio.weight
> > +
> > +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> > + launch two dd threads in different cgroup to read those files.
> > +
> > + sync
> > + echo 3 > /proc/sys/vm/drop_caches
> > +
> > + dd if=/mnt/sdb/zerofile1 of=/dev/null &
> > + echo $! > /cgroup/test1/tasks
> > + cat /cgroup/test1/tasks
> > +
> > + dd if=/mnt/sdb/zerofile2 of=/dev/null &
> > + echo $! > /cgroup/test2/tasks
> > + cat /cgroup/test2/tasks
> > +
> > +- At macro level, first dd should finish first. To get more precise data, keep
> > + on looking at (with the help of script), at blkio.disk_time and
> > + blkio.disk_sectors files of both test1 and test2 groups. This will tell how
> > + much disk time (in milli seconds), each group got and how many secotors each
> > + group dispatched to the disk. We provide fairness in terms of disk time, so
> > + ideally io.disk_time of cgroups should be in proportion to the weight.
> > +
> > +Various user visible config options
> > +===================================
> > +CONFIG_CFQ_GROUP_IOSCHED
> > + - Enables group scheduling in CFQ. Currently only 1 level of group
> > + creation is allowed.
> > +
> > +CONFIG_DEBUG_CFQ_IOSCHED
> > + - Enables some debugging messages in blktrace. Also creates extra
> > + cgroup file blkio.dequeue.
> > +
> > +Config options selected automatically
> > +=====================================
> > +These config options are not user visible and are selected/deselected
> > +automatically based on IO scheduler configuration.
> > +
> > +CONFIG_BLK_CGROUP
> > + - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED.
> > +
> > +CONFIG_DEBUG_BLK_CGROUP
> > + - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED.
> > +
> > +Details of cgroup files
> > +=======================
> > +- blkio.ioprio_class
> > + - Specifies class of the cgroup (RT, BE, IDLE). This is default io
> > + class of the group on all the devices.
> > +
> > + 1 = RT; 2 = BE, 3 = IDLE
> > +
> > +- blkio.weight
> > + - Specifies per cgroup weight.
> > +
> > + Currently allowed range of weights is from 100 to 1000.
> > +
> > +- blkio.time
> > + - disk time allocated to cgroup per device in milliseconds. First
> > + two fields specify the major and minor number of the device and
> > + third field specifies the disk time allocated to group in
> > + milliseconds.
> > +
> > +- blkio.sectors
> > + - number of sectors transferred to/from disk by the group. First
> > + two fields specify the major and minor number of the device and
> > + third field specifies the number of sectors transferred by the
> > + group to/from the device.
> > +
> > +- blkio.dequeue
> > + - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
> > + gives the statistics about how many a times a group was dequeued
> > + from service tree of the device. First two fields specify the major
> > + and minor number of the device and third field specifies the number
> > + of times a group was dequeued from a particular device.
>
> Hi, Vivek,
>
> Are the parameters inter-related? What if you have conflicts w.r.t.
> time, sectors, etc?
What kind of conflicts?
time, sectors, and dequeue are read only files. They are mainly for
monitoring purposes. CFQ provides fairness in terms of disk time, so one
can monitor whether time share received by a group is fair or not.
"sectors" just gives additional data of how many sectors were transferred.
It it not a necessary file. I just exported it to get some sense of both
time and amount IO done by cgroup.
So I am not sure what kind of conflicts you are referring to.
Thanks
Vivek
On Wed, Nov 4, 2009 at 5:37 PM, Vivek Goyal <[email protected]> wrote:
> On Wed, Nov 04, 2009 at 09:30:34AM -0500, Jeff Moyer wrote:
>> Vivek Goyal <[email protected]> writes:
>>
>
> Thanks for the review Jeff.
>
>> > o Currently CFQ provides priority scaled time slices to processes. If a process
>> > does not use the time slice, either because process did not have sufficient
>> > IO to do or because think time of process is large and CFQ decided to disable
>> > idling, then processes looses it time slice share.
>> ^^^^^^
>> loses
>>
should be 'process loses'
>> > +static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
>> > +{
>> > + s64 delta = (s64)(vdisktime - min_vdisktime);
>> > + if (delta > 0)
>> > + min_vdisktime = vdisktime;
>> > +
>> > + return min_vdisktime;
>> > +}
>> > +
>> > +static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
>> > +{
>> > + s64 delta = (s64)(vdisktime - min_vdisktime);
>> > + if (delta < 0)
>> > + min_vdisktime = vdisktime;
>> > +
>> > + return min_vdisktime;
>> > +}
>>
>> Is there a reason you've reimplemented min and max?
>
> I think you are referring to min_t and max_t. Will these macros take care
> of wrapping too?
>
> For example, if I used min_t(u64, A, B), then unsigned comparision will
> not work right wrapping has just taken place for any of the A or B. So if
> A=-1 and B=2, then min_t() would return B as minimum. This is not right
> in our case.
>
> If we do signed comparison (min_t(s64, A, B)), that also seems to be
> broken in another case where a value of variable moves from 63bits to 64bits,
> (A=0x7fffffffffffffff, B=0x8000000000000000). Above will return B as minimum but
> in our scanario, vdisktime will progress from 0x7fffffffffffffff to
> 0x8000000000000000 and A should be returned as minimum (unsigned
> comparison).
>
> Hence I took these difnitions from CFS.
If those are times (measured in jiffies), why are you using u64? You
could use unsigned long and time_before/time_after, that perform the
proper wrap checking.
Corrado
Vivek Goyal <[email protected]> writes:
> +extern void cfq_delink_blkio_group(void *, struct blkio_group *);
s/delink/unlink/g
Vivek Goyal <[email protected]> writes:
> o Some CFQ debugging Aid.
Some CFQ debugging aids. Sorry, I couldn't help myself.
> +config DEBUG_BLK_CGROUP
> + bool
> + depends on BLK_CGROUP
> + default n
> + ---help---
> + Enable some debugging help. Currently it stores the cgroup path
> + in the blk group which can be used by cfq for tracing various
> + group related activity.
> +
> endif # BLOCK
>
> +config DEBUG_CFQ_IOSCHED
> + bool "Debug CFQ Scheduling"
> + depends on CFQ_GROUP_IOSCHED
> + select DEBUG_BLK_CGROUP
This seems wrong. DEBUG_CFQ_IOSCHED sounds like it enables debugging of
CFQ. In your implementation, though, it only enables this in tandem
with the blkio cgroup infrastructure. Please decouple these things.
> +#ifdef CONFIG_DEBUG_BLK_CGROUP
> + /* Store cgroup path */
> + char path[128];
> +#endif
Where does 128 come from? What's wrong with PATH_MAX?
Cheers,
Jeff
On Wed, Nov 04, 2009 at 06:59:45PM +0100, Corrado Zoccolo wrote:
> On Wed, Nov 4, 2009 at 5:37 PM, Vivek Goyal <[email protected]> wrote:
> > On Wed, Nov 04, 2009 at 09:30:34AM -0500, Jeff Moyer wrote:
> >> Vivek Goyal <[email protected]> writes:
> >>
> >
> > Thanks for the review Jeff.
> >
> >> > o Currently CFQ provides priority scaled time slices to processes. If a process
> >> > ? does not use the time slice, either because process did not have sufficient
> >> > ? IO to do or because think time of process is large and CFQ decided to disable
> >> > ? idling, then processes looses it time slice share.
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ?^^^^^^
> >> loses
> >>
> should be 'process loses'
>
Will fix it.
> >> > +static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
> >> > +{
> >> > + ? s64 delta = (s64)(vdisktime - min_vdisktime);
> >> > + ? if (delta > 0)
> >> > + ? ? ? ? ? min_vdisktime = vdisktime;
> >> > +
> >> > + ? return min_vdisktime;
> >> > +}
> >> > +
> >> > +static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
> >> > +{
> >> > + ? s64 delta = (s64)(vdisktime - min_vdisktime);
> >> > + ? if (delta < 0)
> >> > + ? ? ? ? ? min_vdisktime = vdisktime;
> >> > +
> >> > + ? return min_vdisktime;
> >> > +}
> >>
> >> Is there a reason you've reimplemented min and max?
> >
> > I think you are referring to min_t and max_t. Will these macros take care
> > of wrapping too?
> >
> > For example, if I used min_t(u64, A, B), then unsigned comparision will
> > not work right wrapping has just taken place for any of the A or B. So if
> > A=-1 and B=2, then min_t() would return B as minimum. This is not right
> > in our case.
> >
> > If we do signed comparison (min_t(s64, A, B)), that also seems to be
> > broken in another case where a value of variable moves from 63bits to 64bits,
> > (A=0x7fffffffffffffff, B=0x8000000000000000). Above will return B as minimum but
> > in our scanario, vdisktime will progress from 0x7fffffffffffffff to
> > 0x8000000000000000 and A should be returned as minimum (unsigned
> > comparison).
> >
> > Hence I took these difnitions from CFS.
> If those are times (measured in jiffies), why are you using u64? You
> could use unsigned long and time_before/time_after, that perform the
> proper wrap checking.
>
This is virtual time and not exactly the jiffies. This can run faster than
real time. In current patchset there are two reasons for that.
- We assign minimum time slice used by queue as 1ms and queue if we expire
the queue immediately after dispatching a request. So if we have really
fast hardware and we decide not to idle, we will be doing queue switch
very fast assigning each queue 1ms as slice used and vtime will be
progressing much faster than real time.
- We shift real time by CFQ_SERVICE_SHIFT so that theoritically one can
see service difference between weights x and x+1 for all values in the
weight range of 1 to 1000. and not loose small difference in the division
part.
vtime calculation is as follows.
vtime = (slice_used << CFQ_SERVICE_SHIFT )* DEFAUTL_WEIGHT/cfqe->weight
minimum value of slice_used is 1 jiffy. DEFAULT WEIGHT is 500. So if one
wants to get different vtime values for queue weights of 998 and 999, we
need to shift slice used by atleast 12 bits.
We are giving 12 bits shift to the real time, we are left with 20 bits on 32
bit hardware. So even if vtime does not run faster, in 1M jiffies (~ 1000
seconds) we will wrap around. I did some 4K sized direct IO on one of the
SSD and it achieve 7K io per second. In the worst case if these IO were
coming from two queues and we were interleaving the requests between these
then we will do 7 queue switch in 1ms. That means vtime travelling 7 times
faster and wrap around will take place in 1000/7 ~= 130 seconds.
I thought that on 32bit hardware we are really close to pushing the
limits, hence I thought of continuing to use 64bit vdisktime/key instead
of a unsigned long one.
Thanks
Vivek
On Wed, Nov 04, 2009 at 09:07:41AM -0800, Divyesh Shah wrote:
> On Wed, Nov 4, 2009 at 7:41 AM, Vivek Goyal <[email protected]> wrote:
> >
> > On Wed, Nov 04, 2009 at 10:06:16AM -0500, Jeff Moyer wrote:
> > > Vivek Goyal <[email protected]> writes:
> > >
> > > > o Introduce the notion of weights. Priorities are mapped to weights internally.
> > > > ? These weights will be useful once IO groups are introduced and group's share
> > > > ? will be decided by the group weight.
> > >
> > > I'm sorry, but I need more background to review this patch. ?Where do
> > > the min and max come from? ?Why do you scale 7-0 from 200-900? ?How does
> > > this map to what was there before (exactly, approximately)?
> > >
> >
> > Well, So far we only have the notion of iopriority for the process and
> > based on that we determine time slice length.
> >
> > Soon we will throw cfq groups also in the mix. Because cpu IO controller
> > is weight driven, people have shown preference that group's share should
> > be decided based on its weight and not introduce the notion of ioprio for
> > groups.
> >
> > So now core scheduling algorithm only recognizes weights for entities (be it
> > cfq queues or cfq groups), and it is required that we convert the ioprio
> > of cfqq into weight.
> >
> > Now it is a matter of coming up with what weight range do we support and
> > how ioprio should be mapped onto these weights. We can always change the
> > mappings but to being with, I have followed following.
> >
> > Allow a weight range from 100 to 1000. Allowing too small a weights like
> > "1", can lead to very interesting corner cases and I wanted to avoid that
> > in first implementation. For example, if some group with weight "1" gets
> > a time slice of 100ms, its vtime will be really high and after that it
> > will not get scheduled in for a very long time.
> >
> > Seconly allowing too small a weights can make vtime of the tree move very
> > fast with faster wrap around of min_vdistime. (especially on SSD where idling
> > might not be enabled, and for every queue expiry we will attribute minimum of
> > 1ms of slice. If weight of the group is "1" it will higher vtime and
> > min_vdisktime will move very fast). We don't want too fast a wrap around
> > of min_vdisktime (especially in case of idle tree. That infrastructure is
> > not part of current patches).
> >
> > Hence, to begin with I wanted to limit the range of weights allowed because
> > wider range opens up lot of interesting corner cases. That's why limited
> > minimum weight to 100. So at max user can expect the 1000/100=10 times service
> > differentiation between highest and lower weight groups. If folks need more
> > than that, we can look into it once things stablize.
>
> We definitely need the 1:100 differentiation. I'm ok with adding that
> later after the core set of patches stabilize but just letting you
> know that it is important to us.
Good to know. I will begin with max service difference of 10 times and
once things stablize, will go enable wider range of weights.
> Also curious why you chose a higher
> range 100-1000 instead of 10-100? For smaller vtime leaps?
Good question. Initially we had thought that range of 1-1000 should be
good enough. Later decided to cap minimum weight to 100. But same can be
achieved by smaller range of 1-100 and capping minimum weight at 10. This
will make vtime leap forward slower also.
Later if somebody needs ratio higher than 1:100, we can think of
supporting even wider weight range.
Thanks Divyesh for the idea. I think I will change weight range to 10-100
and map ioprio 0-7 on weights 20 to 90.
Thanks
Vivek
>
> >
> > Priority and weights follow reverse order. Higher priority means low
> > weight and vice-versa.
> >
> > Currently we support 8 priority levels and prio "4" is the middle point.
> > Anything higher than prio 4 gets 20% less slice as compared to prio 4 and
> > priorities lower than 4, get 20% higher slice of prio 4 (20% higher/lower
> > for each priority level).
> >
> > For weight range 100 - 1000, 500 can be considered as mid point. Now this
> > is how priority mapping looks like.
> >
> > ? ? ? ?100 200 300 400 500 600 700 800 900 1000 ?(Weights)
> > ? ? ? ? ? ? 7 ? 6 ? 5 ? 4 ? 3 ? 2 ?1 ? 0 ? ? ? ? (io prio).
> >
> > Once priorities are converted to weights, we are able to retain the notion
> > of 20% difference between prio levels by choosing 500 as the mid point and
> > mapping prio 0-7 to weights 900-200, hence this mapping.
> >
> > I am all ears if you have any suggestions on how this ca be handled
> > better.
> >
> > Thanks
> > Vivek
Vivek Goyal <[email protected]> writes:
> +static bool cfq_should_preempt_group(struct cfq_data *cfqd,
> + struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
> +{
> + struct cfq_entity *cfqe = &cfqq->entity;
> + struct cfq_entity *new_cfqe = &new_cfqq->entity;
> +
> + if (cfqq_to_cfqg(cfqq) != &cfqd->root_group)
> + cfqe = parent_entity(&cfqq->entity);
> +
> + if (cfqq_to_cfqg(new_cfqq) != &cfqd->root_group)
> + new_cfqe = parent_entity(&new_cfqq->entity);
> +
> + /*
> + * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
> + */
> +
> + if (new_cfqe->ioprio_class == IOPRIO_CLASS_RT
> + && cfqe->ioprio_class != IOPRIO_CLASS_RT)
> + return true;
> + /*
> + * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
> + */
> +
> + if (new_cfqe->ioprio_class == IOPRIO_CLASS_BE
> + && cfqe->ioprio_class == IOPRIO_CLASS_IDLE)
> + return true;
> +
> + return false;
> +}
What was the motivation for this? It seems like this would really break
isolation. What if one group has all RT priority tasks, will it starve
out the other groups?
Cheers,
Jeff
On Wed, Nov 04, 2009 at 01:44:42PM -0500, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
>
> > +extern void cfq_delink_blkio_group(void *, struct blkio_group *);
>
> s/delink/unlink/g
Will do.
Thanks
Vivek
Vivek Goyal <[email protected]> writes:
> o Now we plan to wait for a queue to get backlogged before we expire it. So
> we need to arm slice timer even if think time is greater than slice left.
> if process sends next IO early and time slice is left, we will dispatch it
> otherwise we will expire the queue and move on to next queue.
Should this be rolled into patch 17? I'm just worried about breaking
bisection runs. What happens if this patch isn't applied?
Cheers,
Jeff
Vivek Goyal <[email protected]> writes:
> o If a task changes cgroup, drop reference to the cfqq associated with io
> context so that upon next request arrival we will allocate a new queue in
> new group.
You're doing more than dropping the reference, you're also setting the
pointer in the cic to NULL. This is what you need to do, of course, but
your changelog is misleading (and sounds incomplete, and sends off
alarms in my head).
Cheers,
Jeff
Vivek Goyal <[email protected]> writes:
> Hi All,
>
> This is V1 of the Block IO controller patches on top of 2.6.32-rc5.
>
> A consolidated patch can be found here:
>
> http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v1.patch
Overall it looks good. Vivek, could you please run some benchmarks
against a vanilla kernel and then with your patch applied and cgroups
compiled in but not configured?
Cheers,
Jeff
On Wed, Nov 04, 2009 at 01:52:51PM -0500, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > o Some CFQ debugging Aid.
>
> Some CFQ debugging aids. Sorry, I couldn't help myself.
>
It thought this was all about IO controller and not english grammar. :-)
Will fix it.
> > +config DEBUG_BLK_CGROUP
> > + bool
> > + depends on BLK_CGROUP
> > + default n
> > + ---help---
> > + Enable some debugging help. Currently it stores the cgroup path
> > + in the blk group which can be used by cfq for tracing various
> > + group related activity.
> > +
> > endif # BLOCK
> >
>
> > +config DEBUG_CFQ_IOSCHED
> > + bool "Debug CFQ Scheduling"
> > + depends on CFQ_GROUP_IOSCHED
> > + select DEBUG_BLK_CGROUP
>
> This seems wrong. DEBUG_CFQ_IOSCHED sounds like it enables debugging of
> CFQ. In your implementation, though, it only enables this in tandem
> with the blkio cgroup infrastructure. Please decouple these things.
>
What's wrong with this? We are emitting more debugging information for
CFQ like cgroup name and paths in blktrace output. That's a different
thing that internally that information is also dependent debugging being
enabled in blk io controller.
Important thing here is that blk io controller and its debugging option is
automatically selected depending on what user wants from end policies.
So I really can't see that what's wrong if CFQ debugging option internally
selects some other config option also.
> > +#ifdef CONFIG_DEBUG_BLK_CGROUP
> > + /* Store cgroup path */
> > + char path[128];
> > +#endif
>
> Where does 128 come from? What's wrong with PATH_MAX?
CFS influence (kernel/sched_debug.c)
Actually PATH_MAX will be 4096 bytes. Too long to display in blktrace
output? I have not check what's the blktrace limit but we don't want too
long a lines in output.
Thanks
Vivek
Vivek Goyal <[email protected]> writes:
> On Wed, Nov 04, 2009 at 10:06:16AM -0500, Jeff Moyer wrote:
>> Vivek Goyal <[email protected]> writes:
>>
>> > o Introduce the notion of weights. Priorities are mapped to weights internally.
>> > These weights will be useful once IO groups are introduced and group's share
>> > will be decided by the group weight.
>>
>> I'm sorry, but I need more background to review this patch. Where do
>> the min and max come from? Why do you scale 7-0 from 200-900? How does
>> this map to what was there before (exactly, approximately)?
>>
>
> Well, So far we only have the notion of iopriority for the process and
> based on that we determine time slice length.
>
> Soon we will throw cfq groups also in the mix. Because cpu IO controller
> is weight driven, people have shown preference that group's share should
> be decided based on its weight and not introduce the notion of ioprio for
> groups.
I certainly agree with that.
> Hence, to begin with I wanted to limit the range of weights allowed because
> wider range opens up lot of interesting corner cases. That's why limited
> minimum weight to 100. So at max user can expect the 1000/100=10 times service
> differentiation between highest and lower weight groups. If folks need more
> than that, we can look into it once things stablize.
>
> Priority and weights follow reverse order. Higher priority means low
> weight and vice-versa.
>
> Currently we support 8 priority levels and prio "4" is the middle point.
> Anything higher than prio 4 gets 20% less slice as compared to prio 4 and
> priorities lower than 4, get 20% higher slice of prio 4 (20% higher/lower
> for each priority level).
>
> For weight range 100 - 1000, 500 can be considered as mid point. Now this
> is how priority mapping looks like.
>
> 100 200 300 400 500 600 700 800 900 1000 (Weights)
> 7 6 5 4 3 2 1 0 (io prio).
>
> Once priorities are converted to weights, we are able to retain the notion
> of 20% difference between prio levels by choosing 500 as the mid point and
> mapping prio 0-7 to weights 900-200, hence this mapping.
I see. So when using the old ioprio mechanism, we get a smaller range
of possible values than with the cgroup configuration.
> I am all ears if you have any suggestions on how this ca be handled
> better.
I think that's a fine way to handle it. I just needed to be spoon-fed.
It would be nice if you included a write-up of how service is
differentiated in your documentation patch. In other words, from the
point of view of the sysadmin, how does he use the thing? Simple math
would likely help, too.
Cheers,
Jeff
On Wed, Nov 04, 2009 at 02:04:45PM -0500, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > o Now we plan to wait for a queue to get backlogged before we expire it. So
> > we need to arm slice timer even if think time is greater than slice left.
> > if process sends next IO early and time slice is left, we will dispatch it
> > otherwise we will expire the queue and move on to next queue.
>
> Should this be rolled into patch 17? I'm just worried about breaking
> bisection runs. What happens if this patch isn't applied?
>
We don't wait for queue to get busy and expire it. That means we will not
get fairness numbers between groups even in simple case of sequential
readers.
So nothing catastrophic. Bisection will not be broken as such (kernel
compiles and boots).
In fact thinking more about it, I might have broken close cooperator logic
as I am waiting for queue to get busy irrespective of the fact whether
there is a close cooperating queue or not. I think need to check for
cooperating queue in cfqq_should_wait_busy() and if one is available,
don't busy wait. This should still maintain fairness at group level as
cooperating queues are not selected across groups and if one cooperating
queue is available in same group, that means expiry of this current queue
will not delete group from service tree and group will still get fair
share.
Thanks
Vivek
On Wed, Nov 04, 2009 at 02:09:40PM -0500, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > o If a task changes cgroup, drop reference to the cfqq associated with io
> > context so that upon next request arrival we will allocate a new queue in
> > new group.
>
> You're doing more than dropping the reference, you're also setting the
> pointer in the cic to NULL. This is what you need to do, of course, but
> your changelog is misleading (and sounds incomplete, and sends off
> alarms in my head).
>
Will enrich changelog in next version.
Thanks
Vivek
On Wed, Nov 04, 2009 at 02:12:19PM -0500, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > Hi All,
> >
> > This is V1 of the Block IO controller patches on top of 2.6.32-rc5.
> >
> > A consolidated patch can be found here:
> >
> > http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v1.patch
>
> Overall it looks good. Vivek, could you please run some benchmarks
> against a vanilla kernel and then with your patch applied and cgroups
> compiled in but not configured?
>
Sure I can. Do you have something specific in mind?
Thanks
Vivek
Vivek Goyal <[email protected]> writes:
> On Wed, Nov 04, 2009 at 01:52:51PM -0500, Jeff Moyer wrote:
>> > +config DEBUG_BLK_CGROUP
>> > + bool
>> > + depends on BLK_CGROUP
>> > + default n
>> > + ---help---
>> > + Enable some debugging help. Currently it stores the cgroup path
>> > + in the blk group which can be used by cfq for tracing various
>> > + group related activity.
>> > +
>> > endif # BLOCK
>> >
>>
>> > +config DEBUG_CFQ_IOSCHED
>> > + bool "Debug CFQ Scheduling"
>> > + depends on CFQ_GROUP_IOSCHED
>> > + select DEBUG_BLK_CGROUP
>>
>> This seems wrong. DEBUG_CFQ_IOSCHED sounds like it enables debugging of
>> CFQ. In your implementation, though, it only enables this in tandem
>> with the blkio cgroup infrastructure. Please decouple these things.
>>
>
> What's wrong with this? We are emitting more debugging information for
> CFQ like cgroup name and paths in blktrace output. That's a different
> thing that internally that information is also dependent debugging being
> enabled in blk io controller.
>
> Important thing here is that blk io controller and its debugging option is
> automatically selected depending on what user wants from end policies.
>
> So I really can't see that what's wrong if CFQ debugging option internally
> selects some other config option also.
Sorry, I wasn't very clear. DEBUG_CFQ_IOSCHED implies that we are
debugging CFQ. However, in this case, we're only debugging the
BLKIO_CGROUP bits in the CFQ I/O scheduler. I would prefer if maybe you
did the following:
#ifdef DEBUG_CFQ_IOSCHED
#ifdef DEBUG_BLK_CGROUP
#endif
#endif
Then, if there was later some debugging stuff added to CFQ that was not
specific to the cgroup infrastructure, it could be put in.
Maybe it's not worth worrying about right now. I'll leave it to Jens.
>> > +#ifdef CONFIG_DEBUG_BLK_CGROUP
>> > + /* Store cgroup path */
>> > + char path[128];
>> > +#endif
>>
>> Where does 128 come from? What's wrong with PATH_MAX?
>
> CFS influence (kernel/sched_debug.c)
>
> Actually PATH_MAX will be 4096 bytes. Too long to display in blktrace
> output? I have not check what's the blktrace limit but we don't want too
> long a lines in output.
OK, you're right. I wouldn't want to embed 4k in there anyway.
Cheers,
Jeff
Vivek Goyal <[email protected]> writes:
> On Wed, Nov 04, 2009 at 02:12:19PM -0500, Jeff Moyer wrote:
>> Vivek Goyal <[email protected]> writes:
>>
>> > Hi All,
>> >
>> > This is V1 of the Block IO controller patches on top of 2.6.32-rc5.
>> >
>> > A consolidated patch can be found here:
>> >
>> > http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v1.patch
>>
>> Overall it looks good. Vivek, could you please run some benchmarks
>> against a vanilla kernel and then with your patch applied and cgroups
>> compiled in but not configured?
>>
>
> Sure I can. Do you have something specific in mind?
I don't, actually. iozone comes to mind as a simple test to setup and
run. Or you could run one or more of the fio sample job files. Really,
I just want to see if we're taking a huge performance hit so we can fix
that up before it's merged.
Thanks!
Jeff
On Wed, Nov 04, 2009 at 02:00:33PM -0500, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > +static bool cfq_should_preempt_group(struct cfq_data *cfqd,
> > + struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
> > +{
> > + struct cfq_entity *cfqe = &cfqq->entity;
> > + struct cfq_entity *new_cfqe = &new_cfqq->entity;
> > +
> > + if (cfqq_to_cfqg(cfqq) != &cfqd->root_group)
> > + cfqe = parent_entity(&cfqq->entity);
> > +
> > + if (cfqq_to_cfqg(new_cfqq) != &cfqd->root_group)
> > + new_cfqe = parent_entity(&new_cfqq->entity);
> > +
> > + /*
> > + * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
> > + */
> > +
> > + if (new_cfqe->ioprio_class == IOPRIO_CLASS_RT
> > + && cfqe->ioprio_class != IOPRIO_CLASS_RT)
> > + return true;
> > + /*
> > + * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
> > + */
> > +
> > + if (new_cfqe->ioprio_class == IOPRIO_CLASS_BE
> > + && cfqe->ioprio_class == IOPRIO_CLASS_IDLE)
> > + return true;
> > +
> > + return false;
> > +}
>
> What was the motivation for this? It seems like this would really break
> isolation. What if one group has all RT priority tasks, will it starve
> out the other groups?
>
It will not as we traverse up the hierarchy and look for the ioprio class
of the group entity.
So if you got following configuration where G1 and G2 are two groups. G1
is prio class RT and G2 is prio class BE, then any queue in G1 will
preempt any queue in G2 as at highest level, G1 and G2 are different class
altogether.
root
/ \
G1 G2
Normal cfqq preemption checks will not catch this. So if G2 has some BE
cfqq running, and some BE queue gets backlogged in G1, this new queue wil
not preempt the queue in G2 and it should have.
That's why preemption checks at group level.
Secondly if G1 and G2 are of ioprioclass BE and all the jobs in G1 are of
RT nature, they will not preempt the queues in G2, hence providing
isolation.
Thanks
Vivek
Vivek Goyal <[email protected]> writes:
> It will not as we traverse up the hierarchy and look for the ioprio class
> of the group entity.
>
> So if you got following configuration where G1 and G2 are two groups. G1
> is prio class RT and G2 is prio class BE, then any queue in G1 will
> preempt any queue in G2 as at highest level, G1 and G2 are different class
> altogether.
>
> root
> / \
> G1 G2
>
> Normal cfqq preemption checks will not catch this. So if G2 has some BE
> cfqq running, and some BE queue gets backlogged in G1, this new queue wil
> not preempt the queue in G2 and it should have.
>
> That's why preemption checks at group level.
>
> Secondly if G1 and G2 are of ioprioclass BE and all the jobs in G1 are of
> RT nature, they will not preempt the queues in G2, hence providing
> isolation.
Thanks for the explanation, Vivek. I somehow missed that we were
checking the class of the group entity and not the cfqq's entity.
Cheers,
Jeff
On Wed, Nov 04, 2009 at 02:27:10PM -0500, Jeff Moyer wrote:
> Vivek Goyal <[email protected]> writes:
>
> > On Wed, Nov 04, 2009 at 02:12:19PM -0500, Jeff Moyer wrote:
> >> Vivek Goyal <[email protected]> writes:
> >>
> >> > Hi All,
> >> >
> >> > This is V1 of the Block IO controller patches on top of 2.6.32-rc5.
> >> >
> >> > A consolidated patch can be found here:
> >> >
> >> > http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v1.patch
> >>
> >> Overall it looks good. Vivek, could you please run some benchmarks
> >> against a vanilla kernel and then with your patch applied and cgroups
> >> compiled in but not configured?
> >>
> >
> > Sure I can. Do you have something specific in mind?
>
> I don't, actually. iozone comes to mind as a simple test to setup and
> run. Or you could run one or more of the fio sample job files. Really,
> I just want to see if we're taking a huge performance hit so we can fix
> that up before it's merged.
Sure, will run some tests to see if this patchset is introducing any huge
performance regressions somewhere.
Thanks
Vivek
Hi Vivek,
On Wed, Nov 4, 2009 at 12:43 AM, Vivek Goyal <[email protected]> wrote:
> o Previously CFQ had one service tree where queues of all theree prio classes
> were being queued. One side affect of this time stamping approach is that
> now single tree approach might not work and we need to keep separate service
> trees for three prio classes.
>
Single service tree is no longer true in cfq for-2.6.33.
Now we have a matrix of service trees, with first dimension being the
priority class, and second dimension being the workload type
(synchronous idle, synchronous no-idle, async).
You can have a look at the series: http://lkml.org/lkml/2009/10/26/482 .
It may have other interesting influences on your work, as the idle
introduced at the end of the synchronous no-idle tree, that provides
fairness also for seeky or high-think-time queues.
Corrado
On Wed, Nov 04, 2009 at 10:18:15PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Wed, Nov 4, 2009 at 12:43 AM, Vivek Goyal <[email protected]> wrote:
> > o Previously CFQ had one service tree where queues of all theree prio classes
> > ?were being queued. One side affect of this time stamping approach is that
> > ?now single tree approach might not work and we need to keep separate service
> > ?trees for three prio classes.
> >
> Single service tree is no longer true in cfq for-2.6.33.
> Now we have a matrix of service trees, with first dimension being the
> priority class, and second dimension being the workload type
> (synchronous idle, synchronous no-idle, async).
> You can have a look at the series: http://lkml.org/lkml/2009/10/26/482 .
> It may have other interesting influences on your work, as the idle
> introduced at the end of the synchronous no-idle tree, that provides
> fairness also for seeky or high-think-time queues.
>
Thanks. I am looking at your patches right now. Got one question about
following commit.
****************************************************************
commit a6d44e982d3734583b3b4e1d36921af8cfd61fc0
Author: Corrado Zoccolo <[email protected]>
Date: Mon Oct 26 22:45:11 2009 +0100
cfq-iosched: enable idling for last queue on priority class
cfq can disable idling for queues in various circumstances.
When workloads of different priorities are competing, if the higher
priority queue has idling disabled, lower priority queues may steal
its disk share. For example, in a scenario with an RT process
performing seeky reads vs a BE process performing sequential reads,
on an NCQ enabled hardware, with low_latency unset,
the RT process will dispatch only the few pending requests every full
slice of service for the BE process.
The patch solves this issue by always performing idle on the last
queue at a given priority class > idle. If the same process, or one
that can pre-empt it (so at the same priority or higher), submits a
new request within the idle window, the lower priority queue won't
dispatch, saving the disk bandwidth for higher priority ones.
Note: this doesn't touch the non_rotational + NCQ case (no hardware
to test if this is a benefit in that case).
*************************************************************************
Not able to understand the logic of waiting for last queue in prio
class. This whole patch series seems to be about low latencies. So why
would not somebody set "low_latency" in IO scheduler? And if somebody
sets "low_latencies" then we will enable idling on random seeky reader
also. So problem will not exist.
On top of that, even if we don't idle for RT reader, we will always
preempt BE reader immediately and get the disk. The only side affect
is that on rotational media, disk head might have moved and bring the
overall throughput down.
So my concern is that with this idling on last queue, we are targetting
fairness issue for the random seeky readers with thinktime with-in 8ms.
That can be easily solved by setting low_latency=1. Why are we going
to this lenth then?
Thanks
Vivek
On Wed, Nov 04, 2009 at 10:18:15PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Wed, Nov 4, 2009 at 12:43 AM, Vivek Goyal <[email protected]> wrote:
> > o Previously CFQ had one service tree where queues of all theree prio classes
> > ?were being queued. One side affect of this time stamping approach is that
> > ?now single tree approach might not work and we need to keep separate service
> > ?trees for three prio classes.
> >
> Single service tree is no longer true in cfq for-2.6.33.
> Now we have a matrix of service trees, with first dimension being the
> priority class, and second dimension being the workload type
> (synchronous idle, synchronous no-idle, async).
> You can have a look at the series: http://lkml.org/lkml/2009/10/26/482 .
> It may have other interesting influences on your work, as the idle
> introduced at the end of the synchronous no-idle tree, that provides
> fairness also for seeky or high-think-time queues.
>
Hi Corrado,
Had one more question. Now with dynamic slice length (reduce slice length
to meet target latency), don't wee see reduced throughput on rotational
media with sequential workload?
I saw some you posted numbers for SSD. Do you have some numbers for
rotational media also?
Thanks
Vivek
* Vivek Goyal <[email protected]> [2009-11-04 12:52:44]:
> On Wed, Nov 04, 2009 at 10:51:00PM +0530, Balbir Singh wrote:
> > * Vivek Goyal <[email protected]> [2009-11-03 18:43:38]:
> >
> > > Signed-off-by: Vivek Goyal <[email protected]>
> > > ---
> > > Documentation/cgroups/blkio-controller.txt | 106 ++++++++++++++++++++++++++++
> > > 1 files changed, 106 insertions(+), 0 deletions(-)
> > > create mode 100644 Documentation/cgroups/blkio-controller.txt
> > >
> > > diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
> > > new file mode 100644
> > > index 0000000..dc8fb1a
> > > --- /dev/null
> > > +++ b/Documentation/cgroups/blkio-controller.txt
> > > @@ -0,0 +1,106 @@
> > > + Block IO Controller
> > > + ===================
> > > +Overview
> > > +========
> > > +cgroup subsys "blkio" implements the block io controller. There seems to be
> > > +a need of various kind of IO control policies (like proportional BW, max BW)
> > > +both at leaf nodes as well as at intermediate nodes in storage hierarchy. Plan
> > > +is to use same cgroup based management interface for blkio controller and
> > > +based on user options switch IO policies in the background.
> > > +
> > > +In the first phase, this patchset implements proportional weight time based
> > > +division of disk policy. It is implemented in CFQ. Hence this policy takes
> > > +effect only on leaf nodes when CFQ is being used.
> > > +
> > > +HOWTO
> > > +=====
> > > +You can do a very simple testing of running two dd threads in two different
> > > +cgroups. Here is what you can do.
> > > +
> > > +- Enable group scheduling in CFQ
> > > + CONFIG_CFQ_GROUP_IOSCHED=y
> > > +
> > > +- Compile and boot into kernel and mount IO controller (blkio).
> > > +
> > > + mount -t cgroup -o blkio none /cgroup
> > > +
> > > +- Create two cgroups
> > > + mkdir -p /cgroup/test1/ /cgroup/test2
> > > +
> > > +- Set weights of group test1 and test2
> > > + echo 1000 > /cgroup/test1/blkio.weight
> > > + echo 500 > /cgroup/test2/blkio.weight
> > > +
> > > +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> > > + launch two dd threads in different cgroup to read those files.
> > > +
> > > + sync
> > > + echo 3 > /proc/sys/vm/drop_caches
> > > +
> > > + dd if=/mnt/sdb/zerofile1 of=/dev/null &
> > > + echo $! > /cgroup/test1/tasks
> > > + cat /cgroup/test1/tasks
> > > +
> > > + dd if=/mnt/sdb/zerofile2 of=/dev/null &
> > > + echo $! > /cgroup/test2/tasks
> > > + cat /cgroup/test2/tasks
> > > +
> > > +- At macro level, first dd should finish first. To get more precise data, keep
> > > + on looking at (with the help of script), at blkio.disk_time and
> > > + blkio.disk_sectors files of both test1 and test2 groups. This will tell how
> > > + much disk time (in milli seconds), each group got and how many secotors each
> > > + group dispatched to the disk. We provide fairness in terms of disk time, so
> > > + ideally io.disk_time of cgroups should be in proportion to the weight.
> > > +
> > > +Various user visible config options
> > > +===================================
> > > +CONFIG_CFQ_GROUP_IOSCHED
> > > + - Enables group scheduling in CFQ. Currently only 1 level of group
> > > + creation is allowed.
> > > +
> > > +CONFIG_DEBUG_CFQ_IOSCHED
> > > + - Enables some debugging messages in blktrace. Also creates extra
> > > + cgroup file blkio.dequeue.
> > > +
> > > +Config options selected automatically
> > > +=====================================
> > > +These config options are not user visible and are selected/deselected
> > > +automatically based on IO scheduler configuration.
> > > +
> > > +CONFIG_BLK_CGROUP
> > > + - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED.
> > > +
> > > +CONFIG_DEBUG_BLK_CGROUP
> > > + - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED.
> > > +
> > > +Details of cgroup files
> > > +=======================
> > > +- blkio.ioprio_class
> > > + - Specifies class of the cgroup (RT, BE, IDLE). This is default io
> > > + class of the group on all the devices.
> > > +
> > > + 1 = RT; 2 = BE, 3 = IDLE
> > > +
> > > +- blkio.weight
> > > + - Specifies per cgroup weight.
> > > +
> > > + Currently allowed range of weights is from 100 to 1000.
> > > +
> > > +- blkio.time
> > > + - disk time allocated to cgroup per device in milliseconds. First
> > > + two fields specify the major and minor number of the device and
> > > + third field specifies the disk time allocated to group in
> > > + milliseconds.
> > > +
> > > +- blkio.sectors
> > > + - number of sectors transferred to/from disk by the group. First
> > > + two fields specify the major and minor number of the device and
> > > + third field specifies the number of sectors transferred by the
> > > + group to/from the device.
> > > +
> > > +- blkio.dequeue
> > > + - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
> > > + gives the statistics about how many a times a group was dequeued
> > > + from service tree of the device. First two fields specify the major
> > > + and minor number of the device and third field specifies the number
> > > + of times a group was dequeued from a particular device.
> >
> > Hi, Vivek,
> >
> > Are the parameters inter-related? What if you have conflicts w.r.t.
> > time, sectors, etc?
>
> What kind of conflicts?
>
> time, sectors, and dequeue are read only files. They are mainly for
> monitoring purposes. CFQ provides fairness in terms of disk time, so one
> can monitor whether time share received by a group is fair or not.
>
> "sectors" just gives additional data of how many sectors were transferred.
> It it not a necessary file. I just exported it to get some sense of both
> time and amount IO done by cgroup.
>
> So I am not sure what kind of conflicts you are referring to.
>
OK, so they are read only statistics? I seemed to parse the document
differently
--
Balbir
On Wed, Nov 04, 2009 at 10:18:15PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Wed, Nov 4, 2009 at 12:43 AM, Vivek Goyal <[email protected]> wrote:
> > o Previously CFQ had one service tree where queues of all theree prio classes
> > ?were being queued. One side affect of this time stamping approach is that
> > ?now single tree approach might not work and we need to keep separate service
> > ?trees for three prio classes.
> >
> Single service tree is no longer true in cfq for-2.6.33.
> Now we have a matrix of service trees, with first dimension being the
> priority class, and second dimension being the workload type
> (synchronous idle, synchronous no-idle, async).
> You can have a look at the series: http://lkml.org/lkml/2009/10/26/482 .
> It may have other interesting influences on your work, as the idle
> introduced at the end of the synchronous no-idle tree, that provides
> fairness also for seeky or high-think-time queues.
>
I am sorry that I am asking questions about a different patchset in this
mail. I don't have ready access to other mail thread currently.
I am looking at your patchset and trying to understand how have you
ensured fairness for different priority level queues.
Following seems to be the key piece of code which determines the slice
length of the queue dynamically.
static inline void
cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
if (cfqd->cfq_latency) {
/* interested queues (we consider only the ones with the same
* priority class) */
unsigned iq = cfq_get_avg_queues(cfqd, cfq_class_rt(cfqq));
unsigned sync_slice = cfqd->cfq_slice[1];
unsigned expect_latency = sync_slice * iq;
if (expect_latency > cfq_target_latency) {
unsigned base_low_slice = 2 * cfqd->cfq_slice_idle;
/* scale low_slice according to IO priority
* and sync vs async */
unsigned low_slice =
min(slice, base_low_slice * slice / sync_slice);
/* the adapted slice value is scaled to fit all iqs
* into the target latency */
slice = max(slice * cfq_target_latency / expect_latency,
low_slice);
}
}
cfqq->slice_end = jiffies + slice;
cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
}
A question.
- expect_latency seems to be being calculated based on based slice lenth
for sync queues (100ms). This will give right number only if all the
queues in the system were of prio 4. What if there are 3 prio 0 queues.
They will/should get 180ms slice each resulting in max latency of 540 ms
but we will be calculating expect_latency to = 100 * 3 =300 ms which is
less than cfq_target_latency and we will not adjust slice length?
- With "no-idle" group, who benefits? As I said, all these optimizations
seems to be for low latency. In that case user will set "low_latency"
tunable in CFQ. If that's the case, then we will anyway enable idling
random seeky processes having think time less than 8ms. So they get
their fair share.
I guess this will provide benefit if user has not set "low_latency" and
in that case we will not enable idle on random seeky readers and we will
gain in terms of throughput on NCQ hardware because we dispatch from
other no-idle queues and then idle on the no-idle group.
Time for some testing...
Thanks
Vivek
On Wed, Nov 4, 2009 at 8:37 AM, Vivek Goyal <[email protected]> wrote:
> On Wed, Nov 04, 2009 at 09:30:34AM -0500, Jeff Moyer wrote:
>> Vivek Goyal <[email protected]> writes:
>>
>
> Thanks for the review Jeff.
>
>> > o Currently CFQ provides priority scaled time slices to processes. If a process
>> > ? does not use the time slice, either because process did not have sufficient
>> > ? IO to do or because think time of process is large and CFQ decided to disable
>> > ? idling, then processes looses it time slice share.
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?^^^^^^
>> loses
>>
>
> Will fix it.
>
>> > o One possible way to handle this is implement CFS like time stamping of the
>> > ? cfq queues and keep track of vtime. Next queue for execution will be selected
>> > ? based on the one who got lowest vtime. This patch implemented time stamping
>> > ? mechanism of cfq queues based on disk time used.
>> >
>> > o min_vdisktime represents the minimum vdisktime of the queue, either being
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^^^^^
>> > ? serviced or leftmost element on the serviec tree.
>>
>> queue or service tree? ?The latter seems to make more sense to me.
>
> Yes, it should be service tree. Will fix it.
>
>>
>> > +static inline u64
>> > +cfq_delta_fair(unsigned long delta, struct cfq_queue *cfqq)
>> > +{
>> > + ? const int base_slice = cfqq->cfqd->cfq_slice[cfq_cfqq_sync(cfqq)];
>> > +
>> > + ? return delta + (base_slice/CFQ_SLICE_SCALE * (cfqq->ioprio - 4));
>> > +}
>>
>> cfq_scale_delta might be a better name.
>>
>
> cfq_scale_delta sounds good. Will use it in next version.
>
>>
>> > +static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
>> > +{
>> > + ? s64 delta = (s64)(vdisktime - min_vdisktime);
>> > + ? if (delta > 0)
>> > + ? ? ? ? ? min_vdisktime = vdisktime;
>> > +
>> > + ? return min_vdisktime;
>> > +}
>> > +
>> > +static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
>> > +{
>> > + ? s64 delta = (s64)(vdisktime - min_vdisktime);
>> > + ? if (delta < 0)
>> > + ? ? ? ? ? min_vdisktime = vdisktime;
>> > +
>> > + ? return min_vdisktime;
>> > +}
>>
>> Is there a reason you've reimplemented min and max?
>
> I think you are referring to min_t and max_t. Will these macros take care
> of wrapping too?
>
> For example, if I used min_t(u64, A, B), then unsigned comparision will
> not work right wrapping has just taken place for any of the A or B. So if
> A=-1 and B=2, then min_t() would return B as minimum. This is not right
> in our case.
>
> If we do signed comparison (min_t(s64, A, B)), that also seems to be
> broken in another case where a value of variable moves from 63bits to 64bits,
> (A=0x7fffffffffffffff, B=0x8000000000000000). Above will return B as minimum but
> in our scanario, vdisktime will progress from 0x7fffffffffffffff to
> 0x8000000000000000 and A should be returned as minimum (unsigned
> comparison).
Can you define and use u64 versions of time_before() and time_after()
(from include/linux/jiffies.h) for your comparisons? These take care
of wrapping as well. Maybe call them timestamp_before()/after().
>
> Hence I took these difnitions from CFS.
Also if these are exactly the same and you decide to continue using
these, can we move them to a common header file (time.h or maybe add a
vtime.h) and reuse?
>
>>
>> > + ? /*
>> > + ? ?* Maintain a cache of leftmost tree entries (it is frequently
>> > + ? ?* used)
>> > + ? ?*/
>>
>> You make it sound like there is a cache of more than one entry. ?Please
>> fix the comment.
>
> Will fix it.
>
>>
>> > +static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
>> > +{
>> > + ? /*
>> > + ? ?* We don't want to charge more than allocated slice otherwise this
>> > + ? ?* queue can miss one dispatch round doubling max latencies. On the
>> > + ? ?* other hand we don't want to charge less than allocated slice as
>> > + ? ?* we stick to CFQ theme of queue loosing its share if it does not
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^^^^^^^
>> losing
>>
>
> Will fix it.
>
>>
>> > +/*
>> > + * Handles three operations.
>> > + * Addition of a new queue to service tree, when a new request comes in.
>> > + * Resorting of an expiring queue (used after slice expired)
>> > + * Requeuing a queue at the front (used during preemption).
>> > + */
>> > +static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? bool add_front, unsigned long service)
>>
>> service? ?Can we come up with a better name that actually hints at what
>> this is? ?service_time, maybe?
>
> Ok, service_time sounds good. Will change it.
>
>>
>>
>> Mostly this looks pretty good and is fairly easy to read.
>
> Thanks
> Vivek
>
In the previous patchset when merged with the bio tracking patches we
had a very convenient
biocg_id we could use for debugging. If the eventual plan is to merge
the biotrack patches, do you think it makes sense to introduce the
per-blkio cgroup id here when DEBUG_BLK_CGROUP=y and use that for all
traces? The cgroup path name seems ugly to me.
-Divyesh
On Tue, Nov 3, 2009 at 3:43 PM, Vivek Goyal <[email protected]> wrote:
> o Some CFQ debugging Aid.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> ?block/Kconfig ? ? ? ? | ? ?9 +++++++++
> ?block/Kconfig.iosched | ? ?9 +++++++++
> ?block/blk-cgroup.c ? ?| ? ?4 ++++
> ?block/blk-cgroup.h ? ?| ? 13 +++++++++++++
> ?block/cfq-iosched.c ? | ? 33 +++++++++++++++++++++++++++++++++
> ?5 files changed, 68 insertions(+), 0 deletions(-)
>
> diff --git a/block/Kconfig b/block/Kconfig
> index 6ba1a8e..e20fbde 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -90,6 +90,15 @@ config BLK_CGROUP
> ? ? ? ?control disk bandwidth allocation (proportional time slice allocation)
> ? ? ? ?to such task groups.
>
> +config DEBUG_BLK_CGROUP
> + ? ? ? bool
> + ? ? ? depends on BLK_CGROUP
> + ? ? ? default n
> + ? ? ? ---help---
> + ? ? ? Enable some debugging help. Currently it stores the cgroup path
> + ? ? ? in the blk group which can be used by cfq for tracing various
> + ? ? ? group related activity.
> +
> ?endif # BLOCK
>
> ?config BLOCK_COMPAT
> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> index a521c69..9c5f0b5 100644
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -48,6 +48,15 @@ config CFQ_GROUP_IOSCHED
> ? ? ? ?---help---
> ? ? ? ? ?Enable group IO scheduling in CFQ.
>
> +config DEBUG_CFQ_IOSCHED
> + ? ? ? bool "Debug CFQ Scheduling"
> + ? ? ? depends on CFQ_GROUP_IOSCHED
> + ? ? ? select DEBUG_BLK_CGROUP
> + ? ? ? default n
> + ? ? ? ---help---
> + ? ? ? ? Enable CFQ IO scheduling debugging in CFQ. Currently it makes
> + ? ? ? ? blktrace output more verbose.
> +
> ?choice
> ? ? ? ?prompt "Default I/O scheduler"
> ? ? ? ?default DEFAULT_CFQ
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index a62b8a3..4c68682 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -39,6 +39,10 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> ? ? ? ?blkg->blkcg_id = css_id(&blkcg->css);
> ? ? ? ?hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
> ? ? ? ?spin_unlock_irqrestore(&blkcg->lock, flags);
> +#ifdef CONFIG_DEBUG_BLK_CGROUP
> + ? ? ? /* Need to take css reference ? */
> + ? ? ? cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
> +#endif
> ?}
>
> ?static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 2bf736b..cb72c35 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -26,12 +26,25 @@ struct blkio_group {
> ? ? ? ?void *key;
> ? ? ? ?struct hlist_node blkcg_node;
> ? ? ? ?unsigned short blkcg_id;
> +#ifdef CONFIG_DEBUG_BLK_CGROUP
> + ? ? ? /* Store cgroup path */
> + ? ? ? char path[128];
> +#endif
> ?};
>
> ?#define BLKIO_WEIGHT_MIN ? ? ? 100
> ?#define BLKIO_WEIGHT_MAX ? ? ? 1000
> ?#define BLKIO_WEIGHT_DEFAULT ? 500
>
> +#ifdef CONFIG_DEBUG_BLK_CGROUP
> +static inline char *blkg_path(struct blkio_group *blkg)
> +{
> + ? ? ? return blkg->path;
> +}
> +#else
> +static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
> +#endif
> +
> ?extern struct blkio_cgroup blkio_root_cgroup;
> ?struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
> ?void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index b9a052b..2fde3c4 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -258,8 +258,29 @@ CFQ_CFQQ_FNS(sync);
> ?CFQ_CFQQ_FNS(coop);
> ?#undef CFQ_CFQQ_FNS
>
> +#ifdef CONFIG_DEBUG_CFQ_IOSCHED
> +#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
> + ? ? ? blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, (cfqq)->pid, \
> + ? ? ? ? ? ? ? ? ? ? ? cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
> + ? ? ? ? ? ? ? ? ? ? ? blkg_path(&cfqq_to_cfqg((cfqq))->blkg), ##args);
> +
> +#define cfq_log_cfqe(cfqd, cfqe, fmt, args...) ? ? ? ? ? ? ? ? \
> + ? ? ? if (cfqq_of(cfqe)) { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> + ? ? ? ? ? ? ? struct cfq_queue *cfqq = cfqq_of(cfqe); ? ? ? ? ? ? ? ? \
> + ? ? ? ? ? ? ? blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, ? ? \
> + ? ? ? ? ? ? ? ? ? ? ? (cfqq)->pid, cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
> + ? ? ? ? ? ? ? ? ? ? ? blkg_path(&cfqq_to_cfqg((cfqq))->blkg), ##args);\
> + ? ? ? } else { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> + ? ? ? ? ? ? ? struct cfq_group *cfqg = cfqg_of(cfqe); ? ? ? ? ? ? ? ? \
> + ? ? ? ? ? ? ? blk_add_trace_msg((cfqd)->queue, "%s " fmt, ? ? ? ? ? ? \
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? blkg_path(&(cfqg)->blkg), ##args); ? ? ?\
> + ? ? ? }
> +#else
> ?#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
> ? ? ? ?blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
> +#define cfq_log_cfqe(cfqd, cfqe, fmt, args...)
> +#endif
> +
> ?#define cfq_log(cfqd, fmt, args...) ? ?\
> ? ? ? ?blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
>
> @@ -400,6 +421,8 @@ cfq_init_cfqe_parent(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
> ?#define for_each_entity(entity) ? ? ? ?\
> ? ? ? ?for (; entity && entity->parent; entity = entity->parent)
>
> +#define cfqe_is_cfqq(cfqe) ? ? (!(cfqe)->my_sd)
> +
> ?static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
> ?{
> ? ? ? ?if (blkg)
> @@ -588,6 +611,8 @@ void cfq_delink_blkio_group(void *key, struct blkio_group *blkg)
> ?#define for_each_entity(entity) ? ? ? ?\
> ? ? ? ?for (; entity != NULL; entity = NULL)
>
> +#define cfqe_is_cfqq(cfqe) ? ? 1
> +
> ?static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
> ?static inline void cfq_get_cfqg_ref(struct cfq_group *cfqg) {}
> ?static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
> @@ -885,6 +910,10 @@ static void dequeue_cfqq(struct cfq_queue *cfqq)
> ? ? ? ? ? ? ? ?struct cfq_sched_data *sd = cfq_entity_sched_data(cfqe);
>
> ? ? ? ? ? ? ? ?dequeue_cfqe(cfqe);
> + ? ? ? ? ? ? ? if (!cfqe_is_cfqq(cfqe)) {
> + ? ? ? ? ? ? ? ? ? ? ? cfq_log_cfqe(cfqq->cfqd, cfqe, "del_from_rr group");
> + ? ? ? ? ? ? ? }
> +
> ? ? ? ? ? ? ? ?/* Do not dequeue parent if it has other entities under it */
> ? ? ? ? ? ? ? ?if (sd->nr_active)
> ? ? ? ? ? ? ? ? ? ? ? ?break;
> @@ -970,6 +999,8 @@ static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
>
> ?static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
> ?{
> + ? ? ? struct cfq_data *cfqd = cfqq_of(cfqe)->cfqd;
> +
> ? ? ? ?for_each_entity(cfqe) {
> ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? * Can't update entity disk time while it is on sorted rb-tree
> @@ -979,6 +1010,8 @@ static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
> ? ? ? ? ? ? ? ?cfqe->vdisktime += cfq_delta_fair(served, cfqe);
> ? ? ? ? ? ? ? ?update_min_vdisktime(cfqe->st);
> ? ? ? ? ? ? ? ?__enqueue_cfqe(cfqe->st, cfqe, 0);
> + ? ? ? ? ? ? ? cfq_log_cfqe(cfqd, cfqe, "served: vt=%llx min_vt=%llx",
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? cfqe->vdisktime, cfqe->st->min_vdisktime);
>
> ? ? ? ? ? ? ? ?/* If entity prio class has changed, take that into account */
> ? ? ? ? ? ? ? ?if (unlikely(cfqe->ioprio_class_changed)) {
> --
> 1.6.2.5
>
>
Vivek Goyal wrote:
> o If a user decides the change the weight or ioprio class of a cgroup, this
> information needs to be passed on to io controlling policy module also so
> that new information can take effect.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> block/blk-cgroup.c | 16 ++++++++++++++++
> block/cfq-iosched.c | 18 ++++++++++++++++++
> 2 files changed, 34 insertions(+), 0 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 7bde5c4..0d52a2c 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -13,6 +13,10 @@
> #include <linux/ioprio.h>
> #include "blk-cgroup.h"
>
> +extern void cfq_update_blkio_group_weight(struct blkio_group *, unsigned int);
> +extern void cfq_update_blkio_group_ioprio_class(struct blkio_group *,
> + unsigned short);
> +
> struct blkio_cgroup blkio_root_cgroup = {
> .weight = BLKIO_WEIGHT_DEFAULT,
> .ioprio_class = IOPRIO_CLASS_BE,
> @@ -75,12 +79,18 @@ static int
> blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
> {
> struct blkio_cgroup *blkcg;
> + struct blkio_group *blkg;
> + struct hlist_node *n;
>
> if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
> return -EINVAL;
>
> blkcg = cgroup_to_blkio_cgroup(cgroup);
> + spin_lock_irq(&blkcg->lock);
> blkcg->weight = (unsigned int)val;
> + hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
> + cfq_update_blkio_group_weight(blkg, blkcg->weight);
> + spin_unlock_irq(&blkcg->lock);
> return 0;
> }
>
> @@ -88,12 +98,18 @@ static int blkiocg_ioprio_class_write(struct cgroup *cgroup,
> struct cftype *cftype, u64 val)
> {
> struct blkio_cgroup *blkcg;
> + struct blkio_group *blkg;
> + struct hlist_node *n;
>
> if (val < IOPRIO_CLASS_RT || val > IOPRIO_CLASS_IDLE)
> return -EINVAL;
>
> blkcg = cgroup_to_blkio_cgroup(cgroup);
> + spin_lock_irq(&blkcg->lock);
> blkcg->ioprio_class = (unsigned int)val;
> + hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
> + cfq_update_blkio_group_weight(blkg, blkcg->weight);
Here should be cfq_update_blkio_group_ioprio_class()
Thanks
Gui
Hi Vivek,
let me answer all your questions in a single mail.
On Thu, Nov 5, 2009 at 12:22 AM, Vivek Goyal <[email protected]> wrote:
> Hi Corrado,
>
> Had one more question. Now with dynamic slice length (reduce slice length
> to meet target latency), don't wee see reduced throughput on rotational
> media with sequential workload?
>
Yes. This is the main reason for disabling dynamic slice length when
low_latency is not set. In this way, on servers where low latency is
not a must (but still desirable), this feature can be disabled, while
the others, that have positive impact on throughput, will not be
disabled.
> I saw some you posted numbers for SSD. Do you have some numbers for
> rotational media also?
Yes. I posted it in the first RFC for this patch, outside the series:
http://lkml.org/lkml/2009/9/3/87
The other patches in the series do not affect sequential bandwidth,
but can improve random read BW in case of NCQ hardware, regardless of
it being rotational, SSD, or SAN.
> I am looking at your patchset and trying to understand how have you
> ensured fairness for different priority level queues.
>
> Following seems to be the key piece of code which determines the slice
> length of the queue dynamically.
>
>
> static inline void
> cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> { [snipped code] }
>
> A question.
>
> - expect_latency seems to be being calculated based on based slice lenth
> for sync queues (100ms). This will give right number only if all the
> queues in the system were of prio 4. What if there are 3 prio 0 queues.
> They will/should get 180ms slice each resulting in max latency of 540 ms
> but we will be calculating expect_latency to = 100 * 3 =300 ms which is
> less than cfq_target_latency and we will not adjust slice length?
>
Yes. Those are soft latencies, so we don't *guarantee* 300ms. On an
average system, where the average slice length is 100ms, we will go
pretty close (but since CFQ doesn't count the first seek in the time
slice, we can still be some tenths of ms off), but if you have a
different distribution of priorities, then this will not be
guaranteed.
> - With "no-idle" group, who benefits? As I said, all these optimizations
> seems to be for low latency. In that case user will set "low_latency"
> tunable in CFQ. If that's the case, then we will anyway enable idling
> random seeky processes having think time less than 8ms. So they get
> their fair share.
My patch changes the meaning for low_latency. As we discussed some
months ago, I always thought that the solution of idling for seeky
processes was sub-optimal. With the new code, regardless of
low_latency settings, we won't idle between 'no-idle' queues. We will
idle only at the end of the no-idle tree, if we still have not reached
workload_expires. This provides fairness between 'no-idle' and normal
sync queues.
>
> I guess this will provide benefit if user has not set "low_latency" and
> in that case we will not enable idle on random seeky readers and we will
> gain in terms of throughput on NCQ hardware because we dispatch from
> other no-idle queues and then idle on the no-idle group.
It will improve both latency and bandwidth, and as I said, it is now
not limited to just low_latency not set. After my patch series,
low_latency will control just 2 things:
* the dynamic timeslice adaption
* the dynamic threshold for number of writes dispatched
Thanks
Corrado
Hi Vivek,
On Wed, Nov 4, 2009 at 11:25 PM, Vivek Goyal <[email protected]> wrote:
> Thanks. I am looking at your patches right now. Got one question about
> following commit.
>
> ****************************************************************
> commit a6d44e982d3734583b3b4e1d36921af8cfd61fc0
> Author: Corrado Zoccolo <[email protected]>
> Date: Mon Oct 26 22:45:11 2009 +0100
>
> cfq-iosched: enable idling for last queue on priority class
>
> cfq can disable idling for queues in various circumstances.
> When workloads of different priorities are competing, if the higher
> priority queue has idling disabled, lower priority queues may steal
> its disk share. For example, in a scenario with an RT process
> performing seeky reads vs a BE process performing sequential reads,
> on an NCQ enabled hardware, with low_latency unset,
> the RT process will dispatch only the few pending requests every full
> slice of service for the BE process.
>
> The patch solves this issue by always performing idle on the last
> queue at a given priority class > idle. If the same process, or one
> that can pre-empt it (so at the same priority or higher), submits a
> new request within the idle window, the lower priority queue won't
> dispatch, saving the disk bandwidth for higher priority ones.
>
> Note: this doesn't touch the non_rotational + NCQ case (no hardware
> to test if this is a benefit in that case).
> *************************************************************************
>
[snipping questions I answered in the combo mail]
> On top of that, even if we don't idle for RT reader, we will always
> preempt BE reader immediately and get the disk. The only side affect
> is that on rotational media, disk head might have moved and bring the
> overall throughput down.
You bring down throughput, and also increase latency, not only on
rotational media, so you may not want to enable it on servers.
Without low_latency, I saw this bug in current 'fairness' policy in
CFQ, so this patch fixes it.
>
> So my concern is that with this idling on last queue, we are targetting
> fairness issue for the random seeky readers with thinktime with-in 8ms.
> That can be easily solved by setting low_latency=1. Why are we going
> to this lenth then?
Maybe on the servers where you want to run RT tasks you don't want the
aforementioned drawbacks of low_latency.
Since I was going to change the implications of low_latency in
following patches, I fixed the 'bug' here, so I was free to change the
implementation in the following, without reintroducing this bug (it
was present for long, before being fixed by the introduction of
low_latency).
Thanks
Corrado
>
> Thanks
> Vivek
>
On Wed, Nov 04, 2009 at 06:44:28PM -0800, Divyesh Shah wrote:
> On Wed, Nov 4, 2009 at 8:37 AM, Vivek Goyal <[email protected]> wrote:
> > On Wed, Nov 04, 2009 at 09:30:34AM -0500, Jeff Moyer wrote:
> >> Vivek Goyal <[email protected]> writes:
> >>
> >
> > Thanks for the review Jeff.
> >
> >> > o Currently CFQ provides priority scaled time slices to processes. If a process
> >> > ? does not use the time slice, either because process did not have sufficient
> >> > ? IO to do or because think time of process is large and CFQ decided to disable
> >> > ? idling, then processes looses it time slice share.
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ?^^^^^^
> >> loses
> >>
> >
> > Will fix it.
> >
> >> > o One possible way to handle this is implement CFS like time stamping of the
> >> > ? cfq queues and keep track of vtime. Next queue for execution will be selected
> >> > ? based on the one who got lowest vtime. This patch implemented time stamping
> >> > ? mechanism of cfq queues based on disk time used.
> >> >
> >> > o min_vdisktime represents the minimum vdisktime of the queue, either being
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^^^^^
> >> > ? serviced or leftmost element on the serviec tree.
> >>
> >> queue or service tree? ?The latter seems to make more sense to me.
> >
> > Yes, it should be service tree. Will fix it.
> >
> >>
> >> > +static inline u64
> >> > +cfq_delta_fair(unsigned long delta, struct cfq_queue *cfqq)
> >> > +{
> >> > + ? const int base_slice = cfqq->cfqd->cfq_slice[cfq_cfqq_sync(cfqq)];
> >> > +
> >> > + ? return delta + (base_slice/CFQ_SLICE_SCALE * (cfqq->ioprio - 4));
> >> > +}
> >>
> >> cfq_scale_delta might be a better name.
> >>
> >
> > cfq_scale_delta sounds good. Will use it in next version.
> >
> >>
> >> > +static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
> >> > +{
> >> > + ? s64 delta = (s64)(vdisktime - min_vdisktime);
> >> > + ? if (delta > 0)
> >> > + ? ? ? ? ? min_vdisktime = vdisktime;
> >> > +
> >> > + ? return min_vdisktime;
> >> > +}
> >> > +
> >> > +static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
> >> > +{
> >> > + ? s64 delta = (s64)(vdisktime - min_vdisktime);
> >> > + ? if (delta < 0)
> >> > + ? ? ? ? ? min_vdisktime = vdisktime;
> >> > +
> >> > + ? return min_vdisktime;
> >> > +}
> >>
> >> Is there a reason you've reimplemented min and max?
> >
> > I think you are referring to min_t and max_t. Will these macros take care
> > of wrapping too?
> >
> > For example, if I used min_t(u64, A, B), then unsigned comparision will
> > not work right wrapping has just taken place for any of the A or B. So if
> > A=-1 and B=2, then min_t() would return B as minimum. This is not right
> > in our case.
> >
> > If we do signed comparison (min_t(s64, A, B)), that also seems to be
> > broken in another case where a value of variable moves from 63bits to 64bits,
> > (A=0x7fffffffffffffff, B=0x8000000000000000). Above will return B as minimum but
> > in our scanario, vdisktime will progress from 0x7fffffffffffffff to
> > 0x8000000000000000 and A should be returned as minimum (unsigned
> > comparison).
>
> Can you define and use u64 versions of time_before() and time_after()
> (from include/linux/jiffies.h) for your comparisons? These take care
> of wrapping as well. Maybe call them timestamp_before()/after().
>
> >
> > Hence I took these difnitions from CFS.
>
> Also if these are exactly the same and you decide to continue using
> these, can we move them to a common header file (time.h or maybe add a
> vtime.h) and reuse?
>
Ok. I will look into it. Sharing the function between CFS scheduler and
CFQ scheduler.
Thanks
Vivek
On Wed, Nov 04, 2009 at 07:10:00PM -0800, Divyesh Shah wrote:
> In the previous patchset when merged with the bio tracking patches we
> had a very convenient
> biocg_id we could use for debugging. If the eventual plan is to merge
> the biotrack patches, do you think it makes sense to introduce the
> per-blkio cgroup id here when DEBUG_BLK_CGROUP=y and use that for all
> traces? The cgroup path name seems ugly to me.
>
Hi Divyesh,
We still have biocg_id. Just that new name is "blkcg_id". So in evrey
blkio_group, we embed this id to know which cgroup this blkio_group
belongs to.
We had cgroup path in previous patches also. The only advantage of path
is that we don't have to call cgroup_path() again and again and we can
call it once when blkio_group instanciates and then store the path in
blio_group and use it.
Thanks
Vivek
> -Divyesh
>
> On Tue, Nov 3, 2009 at 3:43 PM, Vivek Goyal <[email protected]> wrote:
> > o Some CFQ debugging Aid.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > ?block/Kconfig ? ? ? ? | ? ?9 +++++++++
> > ?block/Kconfig.iosched | ? ?9 +++++++++
> > ?block/blk-cgroup.c ? ?| ? ?4 ++++
> > ?block/blk-cgroup.h ? ?| ? 13 +++++++++++++
> > ?block/cfq-iosched.c ? | ? 33 +++++++++++++++++++++++++++++++++
> > ?5 files changed, 68 insertions(+), 0 deletions(-)
> >
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 6ba1a8e..e20fbde 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ -90,6 +90,15 @@ config BLK_CGROUP
> > ? ? ? ?control disk bandwidth allocation (proportional time slice allocation)
> > ? ? ? ?to such task groups.
> >
> > +config DEBUG_BLK_CGROUP
> > + ? ? ? bool
> > + ? ? ? depends on BLK_CGROUP
> > + ? ? ? default n
> > + ? ? ? ---help---
> > + ? ? ? Enable some debugging help. Currently it stores the cgroup path
> > + ? ? ? in the blk group which can be used by cfq for tracing various
> > + ? ? ? group related activity.
> > +
> > ?endif # BLOCK
> >
> > ?config BLOCK_COMPAT
> > diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> > index a521c69..9c5f0b5 100644
> > --- a/block/Kconfig.iosched
> > +++ b/block/Kconfig.iosched
> > @@ -48,6 +48,15 @@ config CFQ_GROUP_IOSCHED
> > ? ? ? ?---help---
> > ? ? ? ? ?Enable group IO scheduling in CFQ.
> >
> > +config DEBUG_CFQ_IOSCHED
> > + ? ? ? bool "Debug CFQ Scheduling"
> > + ? ? ? depends on CFQ_GROUP_IOSCHED
> > + ? ? ? select DEBUG_BLK_CGROUP
> > + ? ? ? default n
> > + ? ? ? ---help---
> > + ? ? ? ? Enable CFQ IO scheduling debugging in CFQ. Currently it makes
> > + ? ? ? ? blktrace output more verbose.
> > +
> > ?choice
> > ? ? ? ?prompt "Default I/O scheduler"
> > ? ? ? ?default DEFAULT_CFQ
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index a62b8a3..4c68682 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -39,6 +39,10 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> > ? ? ? ?blkg->blkcg_id = css_id(&blkcg->css);
> > ? ? ? ?hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
> > ? ? ? ?spin_unlock_irqrestore(&blkcg->lock, flags);
> > +#ifdef CONFIG_DEBUG_BLK_CGROUP
> > + ? ? ? /* Need to take css reference ? */
> > + ? ? ? cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
> > +#endif
> > ?}
> >
> > ?static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
> > diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> > index 2bf736b..cb72c35 100644
> > --- a/block/blk-cgroup.h
> > +++ b/block/blk-cgroup.h
> > @@ -26,12 +26,25 @@ struct blkio_group {
> > ? ? ? ?void *key;
> > ? ? ? ?struct hlist_node blkcg_node;
> > ? ? ? ?unsigned short blkcg_id;
> > +#ifdef CONFIG_DEBUG_BLK_CGROUP
> > + ? ? ? /* Store cgroup path */
> > + ? ? ? char path[128];
> > +#endif
> > ?};
> >
> > ?#define BLKIO_WEIGHT_MIN ? ? ? 100
> > ?#define BLKIO_WEIGHT_MAX ? ? ? 1000
> > ?#define BLKIO_WEIGHT_DEFAULT ? 500
> >
> > +#ifdef CONFIG_DEBUG_BLK_CGROUP
> > +static inline char *blkg_path(struct blkio_group *blkg)
> > +{
> > + ? ? ? return blkg->path;
> > +}
> > +#else
> > +static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
> > +#endif
> > +
> > ?extern struct blkio_cgroup blkio_root_cgroup;
> > ?struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
> > ?void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index b9a052b..2fde3c4 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -258,8 +258,29 @@ CFQ_CFQQ_FNS(sync);
> > ?CFQ_CFQQ_FNS(coop);
> > ?#undef CFQ_CFQQ_FNS
> >
> > +#ifdef CONFIG_DEBUG_CFQ_IOSCHED
> > +#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
> > + ? ? ? blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, (cfqq)->pid, \
> > + ? ? ? ? ? ? ? ? ? ? ? cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
> > + ? ? ? ? ? ? ? ? ? ? ? blkg_path(&cfqq_to_cfqg((cfqq))->blkg), ##args);
> > +
> > +#define cfq_log_cfqe(cfqd, cfqe, fmt, args...) ? ? ? ? ? ? ? ? \
> > + ? ? ? if (cfqq_of(cfqe)) { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> > + ? ? ? ? ? ? ? struct cfq_queue *cfqq = cfqq_of(cfqe); ? ? ? ? ? ? ? ? \
> > + ? ? ? ? ? ? ? blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, ? ? \
> > + ? ? ? ? ? ? ? ? ? ? ? (cfqq)->pid, cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
> > + ? ? ? ? ? ? ? ? ? ? ? blkg_path(&cfqq_to_cfqg((cfqq))->blkg), ##args);\
> > + ? ? ? } else { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> > + ? ? ? ? ? ? ? struct cfq_group *cfqg = cfqg_of(cfqe); ? ? ? ? ? ? ? ? \
> > + ? ? ? ? ? ? ? blk_add_trace_msg((cfqd)->queue, "%s " fmt, ? ? ? ? ? ? \
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? blkg_path(&(cfqg)->blkg), ##args); ? ? ?\
> > + ? ? ? }
> > +#else
> > ?#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
> > ? ? ? ?blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
> > +#define cfq_log_cfqe(cfqd, cfqe, fmt, args...)
> > +#endif
> > +
> > ?#define cfq_log(cfqd, fmt, args...) ? ?\
> > ? ? ? ?blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
> >
> > @@ -400,6 +421,8 @@ cfq_init_cfqe_parent(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
> > ?#define for_each_entity(entity) ? ? ? ?\
> > ? ? ? ?for (; entity && entity->parent; entity = entity->parent)
> >
> > +#define cfqe_is_cfqq(cfqe) ? ? (!(cfqe)->my_sd)
> > +
> > ?static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
> > ?{
> > ? ? ? ?if (blkg)
> > @@ -588,6 +611,8 @@ void cfq_delink_blkio_group(void *key, struct blkio_group *blkg)
> > ?#define for_each_entity(entity) ? ? ? ?\
> > ? ? ? ?for (; entity != NULL; entity = NULL)
> >
> > +#define cfqe_is_cfqq(cfqe) ? ? 1
> > +
> > ?static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
> > ?static inline void cfq_get_cfqg_ref(struct cfq_group *cfqg) {}
> > ?static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
> > @@ -885,6 +910,10 @@ static void dequeue_cfqq(struct cfq_queue *cfqq)
> > ? ? ? ? ? ? ? ?struct cfq_sched_data *sd = cfq_entity_sched_data(cfqe);
> >
> > ? ? ? ? ? ? ? ?dequeue_cfqe(cfqe);
> > + ? ? ? ? ? ? ? if (!cfqe_is_cfqq(cfqe)) {
> > + ? ? ? ? ? ? ? ? ? ? ? cfq_log_cfqe(cfqq->cfqd, cfqe, "del_from_rr group");
> > + ? ? ? ? ? ? ? }
> > +
> > ? ? ? ? ? ? ? ?/* Do not dequeue parent if it has other entities under it */
> > ? ? ? ? ? ? ? ?if (sd->nr_active)
> > ? ? ? ? ? ? ? ? ? ? ? ?break;
> > @@ -970,6 +999,8 @@ static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
> >
> > ?static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
> > ?{
> > + ? ? ? struct cfq_data *cfqd = cfqq_of(cfqe)->cfqd;
> > +
> > ? ? ? ?for_each_entity(cfqe) {
> > ? ? ? ? ? ? ? ?/*
> > ? ? ? ? ? ? ? ? * Can't update entity disk time while it is on sorted rb-tree
> > @@ -979,6 +1010,8 @@ static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
> > ? ? ? ? ? ? ? ?cfqe->vdisktime += cfq_delta_fair(served, cfqe);
> > ? ? ? ? ? ? ? ?update_min_vdisktime(cfqe->st);
> > ? ? ? ? ? ? ? ?__enqueue_cfqe(cfqe->st, cfqe, 0);
> > + ? ? ? ? ? ? ? cfq_log_cfqe(cfqd, cfqe, "served: vt=%llx min_vt=%llx",
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? cfqe->vdisktime, cfqe->st->min_vdisktime);
> >
> > ? ? ? ? ? ? ? ?/* If entity prio class has changed, take that into account */
> > ? ? ? ? ? ? ? ?if (unlikely(cfqe->ioprio_class_changed)) {
> > --
> > 1.6.2.5
> >
> >
On Thu, Nov 05, 2009 at 01:35:22PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > o If a user decides the change the weight or ioprio class of a cgroup, this
> > information needs to be passed on to io controlling policy module also so
> > that new information can take effect.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > block/blk-cgroup.c | 16 ++++++++++++++++
> > block/cfq-iosched.c | 18 ++++++++++++++++++
> > 2 files changed, 34 insertions(+), 0 deletions(-)
> >
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 7bde5c4..0d52a2c 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -13,6 +13,10 @@
> > #include <linux/ioprio.h>
> > #include "blk-cgroup.h"
> >
> > +extern void cfq_update_blkio_group_weight(struct blkio_group *, unsigned int);
> > +extern void cfq_update_blkio_group_ioprio_class(struct blkio_group *,
> > + unsigned short);
> > +
> > struct blkio_cgroup blkio_root_cgroup = {
> > .weight = BLKIO_WEIGHT_DEFAULT,
> > .ioprio_class = IOPRIO_CLASS_BE,
> > @@ -75,12 +79,18 @@ static int
> > blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
> > {
> > struct blkio_cgroup *blkcg;
> > + struct blkio_group *blkg;
> > + struct hlist_node *n;
> >
> > if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
> > return -EINVAL;
> >
> > blkcg = cgroup_to_blkio_cgroup(cgroup);
> > + spin_lock_irq(&blkcg->lock);
> > blkcg->weight = (unsigned int)val;
> > + hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
> > + cfq_update_blkio_group_weight(blkg, blkcg->weight);
> > + spin_unlock_irq(&blkcg->lock);
> > return 0;
> > }
> >
> > @@ -88,12 +98,18 @@ static int blkiocg_ioprio_class_write(struct cgroup *cgroup,
> > struct cftype *cftype, u64 val)
> > {
> > struct blkio_cgroup *blkcg;
> > + struct blkio_group *blkg;
> > + struct hlist_node *n;
> >
> > if (val < IOPRIO_CLASS_RT || val > IOPRIO_CLASS_IDLE)
> > return -EINVAL;
> >
> > blkcg = cgroup_to_blkio_cgroup(cgroup);
> > + spin_lock_irq(&blkcg->lock);
> > blkcg->ioprio_class = (unsigned int)val;
> > + hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
> > + cfq_update_blkio_group_weight(blkg, blkcg->weight);
>
> Here should be cfq_update_blkio_group_ioprio_class()
>
Good catch Gui. Will fix it in next version.
Thanks
Vivek
On Thu, Nov 5, 2009 at 6:42 AM, Vivek Goyal <[email protected]> wrote:
> On Wed, Nov 04, 2009 at 07:10:00PM -0800, Divyesh Shah wrote:
>> In the previous patchset when merged with the bio tracking patches we
>> had a very convenient
>> ?biocg_id we could use for debugging. If the eventual plan is to merge
>> the biotrack patches, do you think it makes sense to introduce the
>> per-blkio cgroup id here when DEBUG_BLK_CGROUP=y and use that for all
>> traces? The cgroup path name seems ugly to me.
>>
>
> Hi Divyesh,
>
> We still have biocg_id. Just that new name is "blkcg_id". So in evrey
> blkio_group, we embed this id to know which cgroup this blkio_group
> belongs to.
Ok. So isn't the blkcg_id enough to identify a particular cgroup in
the blktrace output? Is the path really necessary (stored in
blkio_group or referenced repeatedly, either way)?
>
> We had cgroup path in previous patches also. The only advantage of path
> is that we don't have to call cgroup_path() again and again and we can
> call it once when blkio_group instanciates and then store the path in
> blio_group and use it.
>
> Thanks
> Vivek
>
>> -Divyesh
>>
>> On Tue, Nov 3, 2009 at 3:43 PM, Vivek Goyal <[email protected]> wrote:
>> > o Some CFQ debugging Aid.
>> >
>> > Signed-off-by: Vivek Goyal <[email protected]>
>> > ---
>> > ?block/Kconfig ? ? ? ? | ? ?9 +++++++++
>> > ?block/Kconfig.iosched | ? ?9 +++++++++
>> > ?block/blk-cgroup.c ? ?| ? ?4 ++++
>> > ?block/blk-cgroup.h ? ?| ? 13 +++++++++++++
>> > ?block/cfq-iosched.c ? | ? 33 +++++++++++++++++++++++++++++++++
>> > ?5 files changed, 68 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/block/Kconfig b/block/Kconfig
>> > index 6ba1a8e..e20fbde 100644
>> > --- a/block/Kconfig
>> > +++ b/block/Kconfig
>> > @@ -90,6 +90,15 @@ config BLK_CGROUP
>> > ? ? ? ?control disk bandwidth allocation (proportional time slice allocation)
>> > ? ? ? ?to such task groups.
>> >
>> > +config DEBUG_BLK_CGROUP
>> > + ? ? ? bool
>> > + ? ? ? depends on BLK_CGROUP
>> > + ? ? ? default n
>> > + ? ? ? ---help---
>> > + ? ? ? Enable some debugging help. Currently it stores the cgroup path
>> > + ? ? ? in the blk group which can be used by cfq for tracing various
>> > + ? ? ? group related activity.
>> > +
>> > ?endif # BLOCK
>> >
>> > ?config BLOCK_COMPAT
>> > diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
>> > index a521c69..9c5f0b5 100644
>> > --- a/block/Kconfig.iosched
>> > +++ b/block/Kconfig.iosched
>> > @@ -48,6 +48,15 @@ config CFQ_GROUP_IOSCHED
>> > ? ? ? ?---help---
>> > ? ? ? ? ?Enable group IO scheduling in CFQ.
>> >
>> > +config DEBUG_CFQ_IOSCHED
>> > + ? ? ? bool "Debug CFQ Scheduling"
>> > + ? ? ? depends on CFQ_GROUP_IOSCHED
>> > + ? ? ? select DEBUG_BLK_CGROUP
>> > + ? ? ? default n
>> > + ? ? ? ---help---
>> > + ? ? ? ? Enable CFQ IO scheduling debugging in CFQ. Currently it makes
>> > + ? ? ? ? blktrace output more verbose.
>> > +
>> > ?choice
>> > ? ? ? ?prompt "Default I/O scheduler"
>> > ? ? ? ?default DEFAULT_CFQ
>> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> > index a62b8a3..4c68682 100644
>> > --- a/block/blk-cgroup.c
>> > +++ b/block/blk-cgroup.c
>> > @@ -39,6 +39,10 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
>> > ? ? ? ?blkg->blkcg_id = css_id(&blkcg->css);
>> > ? ? ? ?hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
>> > ? ? ? ?spin_unlock_irqrestore(&blkcg->lock, flags);
>> > +#ifdef CONFIG_DEBUG_BLK_CGROUP
>> > + ? ? ? /* Need to take css reference ? */
>> > + ? ? ? cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
>> > +#endif
>> > ?}
>> >
>> > ?static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
>> > diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
>> > index 2bf736b..cb72c35 100644
>> > --- a/block/blk-cgroup.h
>> > +++ b/block/blk-cgroup.h
>> > @@ -26,12 +26,25 @@ struct blkio_group {
>> > ? ? ? ?void *key;
>> > ? ? ? ?struct hlist_node blkcg_node;
>> > ? ? ? ?unsigned short blkcg_id;
>> > +#ifdef CONFIG_DEBUG_BLK_CGROUP
>> > + ? ? ? /* Store cgroup path */
>> > + ? ? ? char path[128];
>> > +#endif
>> > ?};
>> >
>> > ?#define BLKIO_WEIGHT_MIN ? ? ? 100
>> > ?#define BLKIO_WEIGHT_MAX ? ? ? 1000
>> > ?#define BLKIO_WEIGHT_DEFAULT ? 500
>> >
>> > +#ifdef CONFIG_DEBUG_BLK_CGROUP
>> > +static inline char *blkg_path(struct blkio_group *blkg)
>> > +{
>> > + ? ? ? return blkg->path;
>> > +}
>> > +#else
>> > +static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
>> > +#endif
>> > +
>> > ?extern struct blkio_cgroup blkio_root_cgroup;
>> > ?struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
>> > ?void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
>> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> > index b9a052b..2fde3c4 100644
>> > --- a/block/cfq-iosched.c
>> > +++ b/block/cfq-iosched.c
>> > @@ -258,8 +258,29 @@ CFQ_CFQQ_FNS(sync);
>> > ?CFQ_CFQQ_FNS(coop);
>> > ?#undef CFQ_CFQQ_FNS
>> >
>> > +#ifdef CONFIG_DEBUG_CFQ_IOSCHED
>> > +#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
>> > + ? ? ? blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, (cfqq)->pid, \
>> > + ? ? ? ? ? ? ? ? ? ? ? cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
>> > + ? ? ? ? ? ? ? ? ? ? ? blkg_path(&cfqq_to_cfqg((cfqq))->blkg), ##args);
>> > +
>> > +#define cfq_log_cfqe(cfqd, cfqe, fmt, args...) ? ? ? ? ? ? ? ? \
>> > + ? ? ? if (cfqq_of(cfqe)) { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> > + ? ? ? ? ? ? ? struct cfq_queue *cfqq = cfqq_of(cfqe); ? ? ? ? ? ? ? ? \
>> > + ? ? ? ? ? ? ? blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, ? ? \
>> > + ? ? ? ? ? ? ? ? ? ? ? (cfqq)->pid, cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
>> > + ? ? ? ? ? ? ? ? ? ? ? blkg_path(&cfqq_to_cfqg((cfqq))->blkg), ##args);\
>> > + ? ? ? } else { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> > + ? ? ? ? ? ? ? struct cfq_group *cfqg = cfqg_of(cfqe); ? ? ? ? ? ? ? ? \
>> > + ? ? ? ? ? ? ? blk_add_trace_msg((cfqd)->queue, "%s " fmt, ? ? ? ? ? ? \
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? blkg_path(&(cfqg)->blkg), ##args); ? ? ?\
>> > + ? ? ? }
>> > +#else
>> > ?#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
>> > ? ? ? ?blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
>> > +#define cfq_log_cfqe(cfqd, cfqe, fmt, args...)
>> > +#endif
>> > +
>> > ?#define cfq_log(cfqd, fmt, args...) ? ?\
>> > ? ? ? ?blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
>> >
>> > @@ -400,6 +421,8 @@ cfq_init_cfqe_parent(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
>> > ?#define for_each_entity(entity) ? ? ? ?\
>> > ? ? ? ?for (; entity && entity->parent; entity = entity->parent)
>> >
>> > +#define cfqe_is_cfqq(cfqe) ? ? (!(cfqe)->my_sd)
>> > +
>> > ?static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
>> > ?{
>> > ? ? ? ?if (blkg)
>> > @@ -588,6 +611,8 @@ void cfq_delink_blkio_group(void *key, struct blkio_group *blkg)
>> > ?#define for_each_entity(entity) ? ? ? ?\
>> > ? ? ? ?for (; entity != NULL; entity = NULL)
>> >
>> > +#define cfqe_is_cfqq(cfqe) ? ? 1
>> > +
>> > ?static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
>> > ?static inline void cfq_get_cfqg_ref(struct cfq_group *cfqg) {}
>> > ?static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
>> > @@ -885,6 +910,10 @@ static void dequeue_cfqq(struct cfq_queue *cfqq)
>> > ? ? ? ? ? ? ? ?struct cfq_sched_data *sd = cfq_entity_sched_data(cfqe);
>> >
>> > ? ? ? ? ? ? ? ?dequeue_cfqe(cfqe);
>> > + ? ? ? ? ? ? ? if (!cfqe_is_cfqq(cfqe)) {
>> > + ? ? ? ? ? ? ? ? ? ? ? cfq_log_cfqe(cfqq->cfqd, cfqe, "del_from_rr group");
>> > + ? ? ? ? ? ? ? }
>> > +
>> > ? ? ? ? ? ? ? ?/* Do not dequeue parent if it has other entities under it */
>> > ? ? ? ? ? ? ? ?if (sd->nr_active)
>> > ? ? ? ? ? ? ? ? ? ? ? ?break;
>> > @@ -970,6 +999,8 @@ static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
>> >
>> > ?static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
>> > ?{
>> > + ? ? ? struct cfq_data *cfqd = cfqq_of(cfqe)->cfqd;
>> > +
>> > ? ? ? ?for_each_entity(cfqe) {
>> > ? ? ? ? ? ? ? ?/*
>> > ? ? ? ? ? ? ? ? * Can't update entity disk time while it is on sorted rb-tree
>> > @@ -979,6 +1010,8 @@ static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
>> > ? ? ? ? ? ? ? ?cfqe->vdisktime += cfq_delta_fair(served, cfqe);
>> > ? ? ? ? ? ? ? ?update_min_vdisktime(cfqe->st);
>> > ? ? ? ? ? ? ? ?__enqueue_cfqe(cfqe->st, cfqe, 0);
>> > + ? ? ? ? ? ? ? cfq_log_cfqe(cfqd, cfqe, "served: vt=%llx min_vt=%llx",
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? cfqe->vdisktime, cfqe->st->min_vdisktime);
>> >
>> > ? ? ? ? ? ? ? ?/* If entity prio class has changed, take that into account */
>> > ? ? ? ? ? ? ? ?if (unlikely(cfqe->ioprio_class_changed)) {
>> > --
>> > 1.6.2.5
>> >
>> >
>
Vivek Goyal wrote:
> o Additional preemption checks for groups where we travel up the hierarchy
> and see if one queue should preempt other or not.
>
> o Also prevents preemption across groups in some cases to provide isolation
> between groups.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> block/cfq-iosched.c | 33 +++++++++++++++++++++++++++++++++
> 1 files changed, 33 insertions(+), 0 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 87b1799..98dbead 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -2636,6 +2636,36 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> }
> }
>
> +static bool cfq_should_preempt_group(struct cfq_data *cfqd,
> + struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
> +{
> + struct cfq_entity *cfqe = &cfqq->entity;
> + struct cfq_entity *new_cfqe = &new_cfqq->entity;
> +
> + if (cfqq_to_cfqg(cfqq) != &cfqd->root_group)
> + cfqe = parent_entity(&cfqq->entity);
> +
> + if (cfqq_to_cfqg(new_cfqq) != &cfqd->root_group)
> + new_cfqe = parent_entity(&new_cfqq->entity);
> +
> + /*
> + * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
> + */
> +
> + if (new_cfqe->ioprio_class == IOPRIO_CLASS_RT
> + && cfqe->ioprio_class != IOPRIO_CLASS_RT)
> + return true;
> + /*
> + * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
> + */
> +
> + if (new_cfqe->ioprio_class == IOPRIO_CLASS_BE
> + && cfqe->ioprio_class == IOPRIO_CLASS_IDLE)
> + return true;
> +
> + return false;
> +}
> +
> /*
> * Check if new_cfqq should preempt the currently active queue. Return 0 for
> * no or if we aren't sure, a 1 will cause a preempt.
> @@ -2666,6 +2696,9 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
> if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
> return true;
>
> + if (cfqq_to_cfqg(new_cfqq) != cfqq_to_cfqg(cfqq))
> + return cfq_should_preempt_group(cfqd, cfqq, new_cfqq);
> +
Vivek, why not put cfq_should_preempt_group() at the beginning of cfq_should_preempt()
to prevent preemption across groups?
--
Regards
Gui Jianfeng
On Fri, Nov 06, 2009 at 03:55:58PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > o Additional preemption checks for groups where we travel up the hierarchy
> > and see if one queue should preempt other or not.
> >
> > o Also prevents preemption across groups in some cases to provide isolation
> > between groups.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > block/cfq-iosched.c | 33 +++++++++++++++++++++++++++++++++
> > 1 files changed, 33 insertions(+), 0 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index 87b1799..98dbead 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -2636,6 +2636,36 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> > }
> > }
> >
> > +static bool cfq_should_preempt_group(struct cfq_data *cfqd,
> > + struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
> > +{
> > + struct cfq_entity *cfqe = &cfqq->entity;
> > + struct cfq_entity *new_cfqe = &new_cfqq->entity;
> > +
> > + if (cfqq_to_cfqg(cfqq) != &cfqd->root_group)
> > + cfqe = parent_entity(&cfqq->entity);
> > +
> > + if (cfqq_to_cfqg(new_cfqq) != &cfqd->root_group)
> > + new_cfqe = parent_entity(&new_cfqq->entity);
> > +
> > + /*
> > + * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
> > + */
> > +
> > + if (new_cfqe->ioprio_class == IOPRIO_CLASS_RT
> > + && cfqe->ioprio_class != IOPRIO_CLASS_RT)
> > + return true;
> > + /*
> > + * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
> > + */
> > +
> > + if (new_cfqe->ioprio_class == IOPRIO_CLASS_BE
> > + && cfqe->ioprio_class == IOPRIO_CLASS_IDLE)
> > + return true;
> > +
> > + return false;
> > +}
> > +
> > /*
> > * Check if new_cfqq should preempt the currently active queue. Return 0 for
> > * no or if we aren't sure, a 1 will cause a preempt.
> > @@ -2666,6 +2696,9 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
> > if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
> > return true;
> >
> > + if (cfqq_to_cfqg(new_cfqq) != cfqq_to_cfqg(cfqq))
> > + return cfq_should_preempt_group(cfqd, cfqq, new_cfqq);
> > +
>
> Vivek, why not put cfq_should_preempt_group() at the beginning of cfq_should_preempt()
> to prevent preemption across groups?
Hi Gui,
Currently the checks before the group check were not hurting much, that's
why.
The only contentious check will be if a sync IO in one group should
preempt the async IO in other group or not.
Thanks
Vivek
On Wed, Nov 04, 2009 at 10:18:15PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Wed, Nov 4, 2009 at 12:43 AM, Vivek Goyal <[email protected]> wrote:
> > o Previously CFQ had one service tree where queues of all theree prio classes
> > ?were being queued. One side affect of this time stamping approach is that
> > ?now single tree approach might not work and we need to keep separate service
> > ?trees for three prio classes.
> >
> Single service tree is no longer true in cfq for-2.6.33.
> Now we have a matrix of service trees, with first dimension being the
> priority class, and second dimension being the workload type
> (synchronous idle, synchronous no-idle, async).
> You can have a look at the series: http://lkml.org/lkml/2009/10/26/482 .
> It may have other interesting influences on your work, as the idle
> introduced at the end of the synchronous no-idle tree, that provides
> fairness also for seeky or high-think-time queues.
>
Hi All,
I am now rebasing my patches to for-2.6.33 branch. There are significant
number of changes in that branch, especially changes from corrado bring
in an interesting question.
Currently corrado has introduced the functinality of kind of grouping the
cfq queues based on workload type and gives the time slots to these sub
groups (sync-idle, sync-noidle, async).
I was thinking of placing groups on top of this model, so that we select
the group first and then select the type of workload and then finally
the queue to run.
Corrodo came up with an interesting suggestion (in a private mail), that
what if we implement workload type at top and divide the share among
groups with-in workoad type.
So one would first select the workload to run and then select group
with-in workload and then cfq queue with-in group.
The advantage of this approach are.
- for sync-noidle group, we will not idle per group. We will idle only
only at root level. (Well if we don't idle on the group once it becomes
empty, we will not see fairness for group. So it will be fairness vs
throughput call).
- It allows us to limit system wide share of workload type. So for
example, one can kind of fix system wide share of async queues.
Generally it might not be very prudent to allocate a group 50% of
disk share and then that group decides to just do async IO and sync
IO in rest of the groups suffer.
Disadvantage
- The definition of fairness becomes bit murkier. Now fairness will be
achieved for a group with-in the workload type. So if a group is doing
IO of type sync-idle as well as sync-noidle and other group is doing
IO of type only sync-noidle, then first group will get overall more
disk time even if both the groups have same weight.
Looking for some feedback about which appraoch makes more sense before I
write patches.
Thanks
Vivek
Vivek Goyal wrote:
> On Fri, Nov 06, 2009 at 03:55:58PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> o Additional preemption checks for groups where we travel up the hierarchy
>>> and see if one queue should preempt other or not.
>>>
>>> o Also prevents preemption across groups in some cases to provide isolation
>>> between groups.
>>>
>>> Signed-off-by: Vivek Goyal <[email protected]>
>>> ---
>>> block/cfq-iosched.c | 33 +++++++++++++++++++++++++++++++++
>>> 1 files changed, 33 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>>> index 87b1799..98dbead 100644
>>> --- a/block/cfq-iosched.c
>>> +++ b/block/cfq-iosched.c
>>> @@ -2636,6 +2636,36 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>>> }
>>> }
>>>
>>> +static bool cfq_should_preempt_group(struct cfq_data *cfqd,
>>> + struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
>>> +{
>>> + struct cfq_entity *cfqe = &cfqq->entity;
>>> + struct cfq_entity *new_cfqe = &new_cfqq->entity;
>>> +
>>> + if (cfqq_to_cfqg(cfqq) != &cfqd->root_group)
>>> + cfqe = parent_entity(&cfqq->entity);
>>> +
>>> + if (cfqq_to_cfqg(new_cfqq) != &cfqd->root_group)
>>> + new_cfqe = parent_entity(&new_cfqq->entity);
>>> +
>>> + /*
>>> + * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
>>> + */
>>> +
>>> + if (new_cfqe->ioprio_class == IOPRIO_CLASS_RT
>>> + && cfqe->ioprio_class != IOPRIO_CLASS_RT)
>>> + return true;
>>> + /*
>>> + * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
>>> + */
>>> +
>>> + if (new_cfqe->ioprio_class == IOPRIO_CLASS_BE
>>> + && cfqe->ioprio_class == IOPRIO_CLASS_IDLE)
>>> + return true;
>>> +
>>> + return false;
>>> +}
>>> +
>>> /*
>>> * Check if new_cfqq should preempt the currently active queue. Return 0 for
>>> * no or if we aren't sure, a 1 will cause a preempt.
>>> @@ -2666,6 +2696,9 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>>> if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
>>> return true;
>>>
>>> + if (cfqq_to_cfqg(new_cfqq) != cfqq_to_cfqg(cfqq))
>>> + return cfq_should_preempt_group(cfqd, cfqq, new_cfqq);
>>> +
>> Vivek, why not put cfq_should_preempt_group() at the beginning of cfq_should_preempt()
>> to prevent preemption across groups?
>
> Hi Gui,
>
> Currently the checks before the group check were not hurting much, that's
> why.
>
> The only contentious check will be if a sync IO in one group should
> preempt the async IO in other group or not.
In my opinion, sync IO should not preempt async one in other group from the fairness point
of view.
Thanks
Gui
On Fri, Nov 6, 2009 at 2:22 PM, Vivek Goyal <[email protected]> wrote:
> On Wed, Nov 04, 2009 at 10:18:15PM +0100, Corrado Zoccolo wrote:
>> Hi Vivek,
>> On Wed, Nov 4, 2009 at 12:43 AM, Vivek Goyal <[email protected]> wrote:
>> > o Previously CFQ had one service tree where queues of all theree prio classes
>> > ?were being queued. One side affect of this time stamping approach is that
>> > ?now single tree approach might not work and we need to keep separate service
>> > ?trees for three prio classes.
>> >
>> Single service tree is no longer true in cfq for-2.6.33.
>> Now we have a matrix of service trees, with first dimension being the
>> priority class, and second dimension being the workload type
>> (synchronous idle, synchronous no-idle, async).
>> You can have a look at the series: http://lkml.org/lkml/2009/10/26/482 .
>> It may have other interesting influences on your work, as the idle
>> introduced at the end of the synchronous no-idle tree, that provides
>> fairness also for seeky or high-think-time queues.
>>
>
> Hi All,
>
> I am now rebasing my patches to for-2.6.33 branch. There are significant
> number of changes in that branch, especially changes from corrado bring
> in an interesting question.
>
> Currently corrado has introduced the functinality of kind of grouping the
> cfq queues based on workload type and gives the time slots to these sub
> groups (sync-idle, sync-noidle, async).
>
> I was thinking of placing groups on top of this model, so that we select
> the group first and then select the type of workload and then finally
> the queue to run.
>
> Corrodo came up with an interesting suggestion (in a private mail), that
> what if we implement workload type at top and divide the share among
> groups with-in workoad type.
>
> So one would first select the workload to run and then select group
> with-in workload and then cfq queue with-in group.
>
> The advantage of this approach are.
>
> - for sync-noidle group, we will not idle per group. We will idle only
> ?only at root level. (Well if we don't idle on the group once it becomes
> ?empty, we will not see fairness for group. So it will be fairness vs
> ?throughput call).
>
> - It allows us to limit system wide share of workload type. So for
> ?example, one can kind of fix system wide share of async queues.
> ?Generally it might not be very prudent to allocate a group 50% of
> ?disk share and then that group decides to just do async IO and sync
> ?IO in rest of the groups suffer.
>
> Disadvantage
>
> - The definition of fairness becomes bit murkier. Now fairness will be
> ?achieved for a group with-in the workload type. So if a group is doing
> ?IO of type sync-idle as well as sync-noidle and other group is doing
> ?IO of type only sync-noidle, then first group will get overall more
> ?disk time even if both the groups have same weight.
>
> Looking for some feedback about which appraoch makes more sense before I
> write patches.
On the first look, the first option did make some sense. But isn't the
whole point of adding cgroups is to support fairness, or isolation? If
we are adding cgroups support in a way that does not support
isolation, there is not much point to the whole effort.
The first approach seems to be directed towards keeping good overall
throughput. Fairness and isolation might always come with a
possibility of the loss in overall throughput. The assumption is that
once someone is using cgroup, the overall system efficiency is a
concern which is secondary to the performance we are supporting for
each cgroup.
Also, the second approach is cleaner design. For each cgroup, we will
need one data structure, instead of having 3, one for each workload
type. And all the new functionality should still live under a config
option; so if someone does not want cgroups, they are just turn them
off and we will be back to just one set of threads for each workload
type.
>
> Thanks
> Vivek
>
On Fri, Nov 6, 2009 at 11:22 PM, Vivek Goyal <[email protected]> wrote:
> Hi All,
>
> I am now rebasing my patches to for-2.6.33 branch. There are significant
> number of changes in that branch, especially changes from corrado bring
> in an interesting question.
>
> Currently corrado has introduced the functinality of kind of grouping the
> cfq queues based on workload type and gives the time slots to these sub
> groups (sync-idle, sync-noidle, async).
>
> I was thinking of placing groups on top of this model, so that we select
> the group first and then select the type of workload and then finally
> the queue to run.
>
> Corrodo came up with an interesting suggestion (in a private mail), that
> what if we implement workload type at top and divide the share among
> groups with-in workoad type.
>
> So one would first select the workload to run and then select group
> with-in workload and then cfq queue with-in group.
>
> The advantage of this approach are.
>
> - for sync-noidle group, we will not idle per group. We will idle only
> only at root level. (Well if we don't idle on the group once it becomes
> empty, we will not see fairness for group. So it will be fairness vs
> throughput call).
>
> - It allows us to limit system wide share of workload type. So for
> example, one can kind of fix system wide share of async queues.
> Generally it might not be very prudent to allocate a group 50% of
> disk share and then that group decides to just do async IO and sync
> IO in rest of the groups suffer.
>
> Disadvantage
>
> - The definition of fairness becomes bit murkier. Now fairness will be
> achieved for a group with-in the workload type. So if a group is doing
> IO of type sync-idle as well as sync-noidle and other group is doing
> IO of type only sync-noidle, then first group will get overall more
> disk time even if both the groups have same weight.
The fairness definition was always debated (disk time vs data transferred).
I think that the two have both some reason to exist.
Namely, disk time is good for sync-idle workloads, like sequential readers,
while data transferred is good for sync-noidle workloads, like random readers.
Unfortunately, the two measures seems not comparable, so we seem
obliged to schedule independently the two kinds of workloads.
Actually, I think we can compute a feedback from each scheduling turn,
that can be used to temporary alter weights in next turn, in order to
reach long term fairness.
Thanks,
Corrado
>
> Looking for some feedback about which appraoch makes more sense before I
> write patches.
>
> Thanks
> Vivek
>
On Mon, Nov 09, 2009 at 10:47:48PM +0100, Corrado Zoccolo wrote:
> On Fri, Nov 6, 2009 at 11:22 PM, Vivek Goyal <[email protected]> wrote:
> > Hi All,
> >
> > I am now rebasing my patches to for-2.6.33 branch. There are significant
> > number of changes in that branch, especially changes from corrado bring
> > in an interesting question.
> >
> > Currently corrado has introduced the functinality of kind of grouping the
> > cfq queues based on workload type and gives the time slots to these sub
> > groups (sync-idle, sync-noidle, async).
> >
> > I was thinking of placing groups on top of this model, so that we select
> > the group first and then select the type of workload and then finally
> > the queue to run.
> >
> > Corrodo came up with an interesting suggestion (in a private mail), that
> > what if we implement workload type at top and divide the share among
> > groups with-in workoad type.
> >
> > So one would first select the workload to run and then select group
> > with-in workload and then cfq queue with-in group.
> >
> > The advantage of this approach are.
> >
> > - for sync-noidle group, we will not idle per group. We will idle only
> > ?only at root level. (Well if we don't idle on the group once it becomes
> > ?empty, we will not see fairness for group. So it will be fairness vs
> > ?throughput call).
> >
> > - It allows us to limit system wide share of workload type. So for
> > ?example, one can kind of fix system wide share of async queues.
> > ?Generally it might not be very prudent to allocate a group 50% of
> > ?disk share and then that group decides to just do async IO and sync
> > ?IO in rest of the groups suffer.
> >
> > Disadvantage
> >
> > - The definition of fairness becomes bit murkier. Now fairness will be
> > ?achieved for a group with-in the workload type. So if a group is doing
> > ?IO of type sync-idle as well as sync-noidle and other group is doing
> > ?IO of type only sync-noidle, then first group will get overall more
> > ?disk time even if both the groups have same weight.
>
> The fairness definition was always debated (disk time vs data transferred).
> I think that the two have both some reason to exist.
> Namely, disk time is good for sync-idle workloads, like sequential readers,
> while data transferred is good for sync-noidle workloads, like random readers.
I thought it was reverse. For sync-noidle workoads (typically seeky), we
do lot less IO and size of IO is not the right measure otherwise most of
the disk time we will be giving to this sync-noidle queue/group and
sync-idle queues will be heavily punished in other groups.
time based fairness generally should work better on seeky media. As the
seek cost starts to come down, size of IO also starts making sense.
In fact on SSD, we do queue switching so fast and don't idle on the queue,
doing time accounting and providing fairness in terms of time is hard, for
the groups which are not continuously backlogged.
> Unfortunately, the two measures seems not comparable, so we seem
> obliged to schedule independently the two kinds of workloads.
> Actually, I think we can compute a feedback from each scheduling turn,
> that can be used to temporary alter weights in next turn, in order to
> reach long term fairness.
As one simple solution, I thought that on SSDs, one can think of using
higher level IO controlling policy instead of CFQ group scheduling.
Or, we bring in some measuer in CFQ for fairness based on size/amount of
IO.
Thanks
Vivek
On Tue, Nov 10, 2009 at 12:12 AM, Vivek Goyal <[email protected]> wrote:
>
> I thought it was reverse. For sync-noidle workoads (typically seeky), we
> do lot less IO and size of IO is not the right measure otherwise most of
> the disk time we will be giving to this sync-noidle queue/group and
> sync-idle queues will be heavily punished in other groups.
This happens only if you try to measure both sequential and seeky with
the same metric.
But as soon as you can have a specific metric for each, then it
becomes more natural to measure disk time for sequential (since in
order to keep the sequential pattern, you have to devote contiguous
disk time to each queue).
And for seeky workloads, data transferred is viable, since here you
don't have the contiguous time restriction.
Moreover, it is not affected by the amplitude of seek, that is mostly
dependent on how you schedule the requests from multiple queues, so it
cannot be imputed to the single queue. And it works also if you
schedule multiple requests in parallel with NCQ.
>
> time based fairness generally should work better on seeky media. As the
> seek cost starts to come down, size of IO also starts making sense.
>
> In fact on SSD, we do queue switching so fast and don't idle on the queue,
> doing time accounting and providing fairness in terms of time is hard, for
> the groups which are not continuously backlogged.
The mechanism in place still gives fairness in terms of I/Os for SSDs.
One queue is not even nearly backlogged, then there is no point in
enforcing fairness for it so that the backlogged one gets lower
bandwidth, but the not backlogged one doesn't get higher (since it is
limited by its think time).
For me fairness for SSDs should happen only when the total BW required
by all the queues is more than the one the disk can deliver, or the
total number of active queues is more than NCQ depth. Otherwise, each
queue will get exactly the bandwidth it wants, without affecting the
others, so no idling should happen. In the mentioned cases, instead,
no idling needs to be added, since the contention for resource will
already introduce delays.
>
>> Unfortunately, the two measures seems not comparable, so we seem
>> obliged to schedule independently the two kinds of workloads.
>> Actually, I think we can compute a feedback from each scheduling turn,
>> that can be used to temporary alter weights in next turn, in order to
>> reach long term fairness.
>
> As one simple solution, I thought that on SSDs, one can think of using
> higher level IO controlling policy instead of CFQ group scheduling.
>
> Or, we bring in some measuer in CFQ for fairness based on size/amount of
> IO.
It is already working at the I/O scheduler level, when the conditions
above are met, so if you build on top of CFQ, it should work for
groups as well.
Corrado
>
> Thanks
> Vivek
>
On Tue, Nov 10, 2009 at 12:29:30PM +0100, Corrado Zoccolo wrote:
> On Tue, Nov 10, 2009 at 12:12 AM, Vivek Goyal <[email protected]> wrote:
> >
> > I thought it was reverse. For sync-noidle workoads (typically seeky), we
> > do lot less IO and size of IO is not the right measure otherwise most of
> > the disk time we will be giving to this sync-noidle queue/group and
> > sync-idle queues will be heavily punished in other groups.
>
> This happens only if you try to measure both sequential and seeky with
> the same metric.
Ok, we seem to be discussing many things. I will try to pull it back on
core points.
To me there are only two key questions.
- Whether workload type should be on topmost layer or groups should be on
topmost layer.
- How to define fairness in case of NCQ SSD where idling hurts and we
don't choose to idle.
For the first issue, if we keep workoad type on top, then we weaken the
isolation between groups. We provide isolation between only same kind of
workload type and not across the workloads types.
So if a group is running only sequential readers and other group is runnig
random seeky reaeder, then share of second group is not determined by the
group weight but the number of queues in first group.
Hence as we increase number of queues in first group, share of second
group keep on coming down. This kind of implies that sequential reads
in first group are more important as comapred to random seeky reader in
second group. But in this case the relative importance of workload is
specifed by the user with the help of cgroups and weights and IO scheduler
should honor that.
So to me, groups on topmost layer makes more sense than having workload
type on topmost layer.
> >
> > time based fairness generally should work better on seeky media. As the
> > seek cost starts to come down, size of IO also starts making sense.
> >
> > In fact on SSD, we do queue switching so fast and don't idle on the queue,
> > doing time accounting and providing fairness in terms of time is hard, for
> > the groups which are not continuously backlogged.
> The mechanism in place still gives fairness in terms of I/Os for SSDs.
> One queue is not even nearly backlogged, then there is no point in
> enforcing fairness for it so that the backlogged one gets lower
> bandwidth, but the not backlogged one doesn't get higher (since it is
> limited by its think time).
>
> For me fairness for SSDs should happen only when the total BW required
> by all the queues is more than the one the disk can deliver, or the
> total number of active queues is more than NCQ depth. Otherwise, each
> queue will get exactly the bandwidth it wants, without affecting the
> others, so no idling should happen. In the mentioned cases, instead,
> no idling needs to be added, since the contention for resource will
> already introduce delays.
>
Ok, above is pertinent for the second issue of not idling on NCQ SSDs as
it hurts and brings down the overall throughput. I tend to agree here,
that idling on queues limited by think time does not make much sense on
NCQ SSD. In this case probably fairness will be defined by how many a
times a group got scheduled in for dispatch. If group has higher weight
then it should be able to dispatch more times (in proportionate ratio),
as compared to group lower weight group.
We should be able to achieve this without idling hence overall thoughtput
of the system should also be good. The only catch here is that it will be
hard to achieve this behavior if group is not continuously backlogged.
You seem to be suggesting that current CFQ formula for calculating slice
offset provides take care of that. Looking at the formula I can't
understand how does it enable dispatch from a queue in proportion to
weight or priority. I will do some experiments on my NCQ SSD and do
more discussion on this aspect later.
Thoughts?
Thanks
Vivek
On Tue, Nov 10, 2009 at 08:31:13AM -0500, Vivek Goyal wrote:
> On Tue, Nov 10, 2009 at 12:29:30PM +0100, Corrado Zoccolo wrote:
> > On Tue, Nov 10, 2009 at 12:12 AM, Vivek Goyal <[email protected]> wrote:
> > >
> > > I thought it was reverse. For sync-noidle workoads (typically seeky), we
> > > do lot less IO and size of IO is not the right measure otherwise most of
> > > the disk time we will be giving to this sync-noidle queue/group and
> > > sync-idle queues will be heavily punished in other groups.
> >
> > This happens only if you try to measure both sequential and seeky with
> > the same metric.
>
> Ok, we seem to be discussing many things. I will try to pull it back on
> core points.
>
> To me there are only two key questions.
>
> - Whether workload type should be on topmost layer or groups should be on
> topmost layer.
>
> - How to define fairness in case of NCQ SSD where idling hurts and we
> don't choose to idle.
>
>
> For the first issue, if we keep workoad type on top, then we weaken the
> isolation between groups. We provide isolation between only same kind of
> workload type and not across the workloads types.
>
> So if a group is running only sequential readers and other group is runnig
> random seeky reaeder, then share of second group is not determined by the
> group weight but the number of queues in first group.
>
> Hence as we increase number of queues in first group, share of second
> group keep on coming down. This kind of implies that sequential reads
> in first group are more important as comapred to random seeky reader in
> second group. But in this case the relative importance of workload is
> specifed by the user with the help of cgroups and weights and IO scheduler
> should honor that.
>
> So to me, groups on topmost layer makes more sense than having workload
> type on topmost layer.
>
> > >
> > > time based fairness generally should work better on seeky media. As the
> > > seek cost starts to come down, size of IO also starts making sense.
> > >
> > > In fact on SSD, we do queue switching so fast and don't idle on the queue,
> > > doing time accounting and providing fairness in terms of time is hard, for
> > > the groups which are not continuously backlogged.
> > The mechanism in place still gives fairness in terms of I/Os for SSDs.
> > One queue is not even nearly backlogged, then there is no point in
> > enforcing fairness for it so that the backlogged one gets lower
> > bandwidth, but the not backlogged one doesn't get higher (since it is
> > limited by its think time).
> >
> > For me fairness for SSDs should happen only when the total BW required
> > by all the queues is more than the one the disk can deliver, or the
> > total number of active queues is more than NCQ depth. Otherwise, each
> > queue will get exactly the bandwidth it wants, without affecting the
> > others, so no idling should happen. In the mentioned cases, instead,
> > no idling needs to be added, since the contention for resource will
> > already introduce delays.
> >
>
> Ok, above is pertinent for the second issue of not idling on NCQ SSDs as
> it hurts and brings down the overall throughput. I tend to agree here,
> that idling on queues limited by think time does not make much sense on
> NCQ SSD. In this case probably fairness will be defined by how many a
> times a group got scheduled in for dispatch. If group has higher weight
> then it should be able to dispatch more times (in proportionate ratio),
> as compared to group lower weight group.
>
> We should be able to achieve this without idling hence overall thoughtput
> of the system should also be good. The only catch here is that it will be
> hard to achieve this behavior if group is not continuously backlogged.
>
> You seem to be suggesting that current CFQ formula for calculating slice
> offset provides take care of that. Looking at the formula I can't
> understand how does it enable dispatch from a queue in proportion to
> weight or priority. I will do some experiments on my NCQ SSD and do
> more discussion on this aspect later.
>
Ok, I ran some simple tests on my NCQ SSD. I had pulled the Jen's branch
few days back and it has your patches in it.
I am running three direct sequential readers or prio 0, 4 and 7
respectively using fio for 10 seconds and then monitoring who got how
much job done.
Following is my fio job file
****************************************************************
[global]
ioengine=sync
runtime=10
size=1G
rw=read
directory=/mnt/sdc/fio/
direct=1
bs=4K
exec_prerun="echo 3 > /proc/sys/vm/drop_caches"
[seqread0]
prio=0
[seqread4]
prio=4
[seqread7]
prio=7
************************************************************************
Following are the results of 4 runs. Every run lists three jobs of prio0,
prio4 and prio7 respectively.
First run
=========
read : io=75,996KB, bw=7,599KB/s, iops=1,899, runt= 10001msec
read : io=95,920KB, bw=9,591KB/s, iops=2,397, runt= 10001msec
read : io=21,068KB, bw=2,107KB/s, iops=526, runt= 10001msec
Second run
==========
read : io=103MB, bw=10,540KB/s, iops=2,635, runt= 10001msec
read : io=102MB, bw=10,479KB/s, iops=2,619, runt= 10001msec
read : io=720KB, bw=73,728B/s, iops=18, runt= 10000msec
Third Run
=========
read : io=103MB, bw=10,532KB/s, iops=2,632, runt= 10001msec
read : io=85,728KB, bw=8,572KB/s, iops=2,142, runt= 10001msec
read : io=19,696KB, bw=1,969KB/s, iops=492, runt= 10001msec
Fourth Run
==========
read : io=50,060KB, bw=5,005KB/s, iops=1,251, runt= 10001msec
read : io=102MB, bw=10,409KB/s, iops=2,602, runt= 10001msec
read : io=54,844KB, bw=5,484KB/s, iops=1,370, runt= 10001msec
I can't see fairness being provided to processes of diff prio levels. In
first run prio4 got more BW than prio0 process.
In second run prio 7 process got completely starved. Based on slice
calculation, the difference between prio 0 and prio 7 should be 180/40=4.5
Third run is still better.
In fourth run again prio 4 got double the BW of prio 0.
So I can't see how are you achieving fariness on NCQ SSD?
One more important thing to notice is that throughput of SSD has come down
significantly. If I just run one job then I get 73MB/s. With these tree
jobs running, we are achieving close to 19 MB/s.
I think this is happening because of seeks happening almost after every
dispatch and that brings down the overall throughput. If we had idled
here, I think probably overall throughput would have been better.
Thanks
Vivek
On Tue, Nov 10, 2009 at 3:12 PM, Vivek Goyal <[email protected]> wrote:
>
> Ok, I ran some simple tests on my NCQ SSD. I had pulled the Jen's branch
> few days back and it has your patches in it.
>
> I am running three direct sequential readers or prio 0, 4 and 7
> respectively using fio for 10 seconds and then monitoring who got how
> much job done.
>
> Following is my fio job file
>
> ****************************************************************
> [global]
> ioengine=sync
> runtime=10
> size=1G
> rw=read
> directory=/mnt/sdc/fio/
> direct=1
> bs=4K
> exec_prerun="echo 3 > /proc/sys/vm/drop_caches"
>
> [seqread0]
> prio=0
>
> [seqread4]
> prio=4
>
> [seqread7]
> prio=7
> ************************************************************************
Can you try without direct and bs?
>
> Following are the results of 4 runs. Every run lists three jobs of prio0,
> prio4 and prio7 respectively.
>
> First run
> =========
> read : io=75,996KB, bw=7,599KB/s, iops=1,899, runt= 10001msec
> read : io=95,920KB, bw=9,591KB/s, iops=2,397, runt= 10001msec
> read : io=21,068KB, bw=2,107KB/s, iops=526, runt= 10001msec
>
> Second run
> ==========
> read : io=103MB, bw=10,540KB/s, iops=2,635, runt= 10001msec
> read : io=102MB, bw=10,479KB/s, iops=2,619, runt= 10001msec
> read : io=720KB, bw=73,728B/s, iops=18, runt= 10000msec
>
> Third Run
> =========
> read : io=103MB, bw=10,532KB/s, iops=2,632, runt= 10001msec
> read : io=85,728KB, bw=8,572KB/s, iops=2,142, runt= 10001msec
> read : io=19,696KB, bw=1,969KB/s, iops=492, runt= 10001msec
>
> Fourth Run
> ==========
> read : io=50,060KB, bw=5,005KB/s, iops=1,251, runt= 10001msec
> read : io=102MB, bw=10,409KB/s, iops=2,602, runt= 10001msec
> read : io=54,844KB, bw=5,484KB/s, iops=1,370, runt= 10001msec
>
> I can't see fairness being provided to processes of diff prio levels. In
> first run prio4 got more BW than prio0 process.
>
> In second run prio 7 process got completely starved. Based on slice
> calculation, the difference between prio 0 and prio 7 should be 180/40=4.5
>
> Third run is still better.
>
> In fourth run again prio 4 got double the BW of prio 0.
>
> So I can't see how are you achieving fariness on NCQ SSD?
>
> One more important thing to notice is that throughput of SSD has come down
> significantly. If I just run one job then I get 73MB/s. With these tree
> jobs running, we are achieving close to 19 MB/s.
I think it depends on the hardware. On Jeff's SSD, 32 random readers
were obtaining approximately the same aggregate bandwidth than a
single sequential reader. I think that the decision to avoid idling is
sane on that kind of hardware, but not on the ones like yours, in
which seek has a very large penalty (I have one in my netbook, for
which reading 4k takes 1ms). However, if you increase block size, or
remove the direct I/O, the prefetch should still work for you.
>
> I think this is happening because of seeks happening almost after every
> dispatch and that brings down the overall throughput. If we had idled
> here, I think probably overall throughput would have been better.
Agreed. In fact, I'd like to add some measurements in cfq, to
determine the idle parameters, instead of relying on those binary
rules of thumbs.
Which hardware is this, btw?
>
> Thanks
> Vivek
>
Thanks
Corrado
On Tue, Nov 10, 2009 at 07:05:19PM +0100, Corrado Zoccolo wrote:
> On Tue, Nov 10, 2009 at 3:12 PM, Vivek Goyal <[email protected]> wrote:
> >
> > Ok, I ran some simple tests on my NCQ SSD. I had pulled the Jen's branch
> > few days back and it has your patches in it.
> >
> > I am running three direct sequential readers or prio 0, 4 and 7
> > respectively using fio for 10 seconds and then monitoring who got how
> > much job done.
> >
> > Following is my fio job file
> >
> > ****************************************************************
> > [global]
> > ioengine=sync
> > runtime=10
> > size=1G
> > rw=read
> > directory=/mnt/sdc/fio/
> > direct=1
> > bs=4K
> > exec_prerun="echo 3 > /proc/sys/vm/drop_caches"
> >
> > [seqread0]
> > prio=0
> >
> > [seqread4]
> > prio=4
> >
> > [seqread7]
> > prio=7
> > ************************************************************************
>
> Can you try without direct and bs?
>
Ok, here are the results without direct and bs. So it is now buffered
reads. The fio file above remains more or less same except that I had
to change size to 2G as with-in 10 seconds some process can finish reading
1G and get out of contention.
First Run
=========
read : io=382MB, bw=39,112KB/s, iops=9,777, runt= 10001msec
read : io=939MB, bw=96,194KB/s, iops=24,048, runt= 10001msec
read : io=765MB, bw=78,355KB/s, iops=19,588, runt= 10004msec
Second run
==========
read : io=443MB, bw=45,395KB/s, iops=11,348, runt= 10004msec
read : io=1,058MB, bw=106MB/s, iops=27,081, runt= 10001msec
read : io=650MB, bw=66,535KB/s, iops=16,633, runt= 10006msec
Third Run
=========
read : io=727MB, bw=74,465KB/s, iops=18,616, runt= 10004msec
read : io=890MB, bw=91,126KB/s, iops=22,781, runt= 10001msec
read : io=406MB, bw=41,608KB/s, iops=10,401, runt= 10004msec
Fourth Run
==========
read : io=792MB, bw=81,143KB/s, iops=20,285, runt= 10001msec
read : io=1,024MB, bw=102MB/s, iops=26,192, runt= 10009msec
read : io=314MB, bw=32,093KB/s, iops=8,023, runt= 10011msec
Still can't get the service difference proportionate to priority levels.
In fact in some cases it is more like priority inversion where higher
priority is getting lower BW.
> >
> > Following are the results of 4 runs. Every run lists three jobs of prio0,
> > prio4 and prio7 respectively.
> >
> > First run
> > =========
> > read : io=75,996KB, bw=7,599KB/s, iops=1,899, runt= 10001msec
> > read : io=95,920KB, bw=9,591KB/s, iops=2,397, runt= 10001msec
> > read : io=21,068KB, bw=2,107KB/s, iops=526, runt= 10001msec
> >
> > Second run
> > ==========
> > read : io=103MB, bw=10,540KB/s, iops=2,635, runt= 10001msec
> > read : io=102MB, bw=10,479KB/s, iops=2,619, runt= 10001msec
> > read : io=720KB, bw=73,728B/s, iops=18, runt= 10000msec
> >
> > Third Run
> > =========
> > read : io=103MB, bw=10,532KB/s, iops=2,632, runt= 10001msec
> > read : io=85,728KB, bw=8,572KB/s, iops=2,142, runt= 10001msec
> > read : io=19,696KB, bw=1,969KB/s, iops=492, runt= 10001msec
> >
> > Fourth Run
> > ==========
> > read : io=50,060KB, bw=5,005KB/s, iops=1,251, runt= 10001msec
> > read : io=102MB, bw=10,409KB/s, iops=2,602, runt= 10001msec
> > read : io=54,844KB, bw=5,484KB/s, iops=1,370, runt= 10001msec
> >
> > I can't see fairness being provided to processes of diff prio levels. In
> > first run prio4 got more BW than prio0 process.
> >
> > In second run prio 7 process got completely starved. Based on slice
> > calculation, the difference between prio 0 and prio 7 should be 180/40=4.5
> >
> > Third run is still better.
> >
> > In fourth run again prio 4 got double the BW of prio 0.
> >
> > So I can't see how are you achieving fariness on NCQ SSD?
> >
> > One more important thing to notice is that throughput of SSD has come down
> > significantly. If I just run one job then I get 73MB/s. With these tree
> > jobs running, we are achieving close to 19 MB/s.
>
> I think it depends on the hardware. On Jeff's SSD, 32 random readers
> were obtaining approximately the same aggregate bandwidth than a
> single sequential reader. I think that the decision to avoid idling is
> sane on that kind of hardware, but not on the ones like yours, in
> which seek has a very large penalty (I have one in my netbook, for
> which reading 4k takes 1ms). However, if you increase block size, or
> remove the direct I/O, the prefetch should still work for you.
Of course increasing the block size of making the IO buffered which in
turn will increase the block size for sequential reads will increase
the throughput.
Here I wanted to get cache out of picture so that we can see what is
happening at IO scheduling layer.
Thanks
Vivek
> >
> > I think this is happening because of seeks happening almost after every
> > dispatch and that brings down the overall throughput. If we had idled
> > here, I think probably overall throughput would have been better.
> Agreed. In fact, I'd like to add some measurements in cfq, to
> determine the idle parameters, instead of relying on those binary
> rules of thumbs.
> Which hardware is this, btw?
>
> >
> > Thanks
> > Vivek
> >
> Thanks
> Corrado
Vivek Goyal wrote:
...
>
> @@ -1245,10 +1429,10 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
> struct cfq_queue *cfqq;
> int dispatched = 0;
>
> - while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
> + while ((cfqq = cfq_get_next_queue(cfqd)) != NULL)
> dispatched += __cfq_forced_dispatch_cfqq(cfqq);
>
> - cfq_slice_expired(cfqd, 0);
> + cfq_slice_expired(cfqd);
>
> BUG_ON(cfqd->busy_queues);
>
> @@ -1391,7 +1575,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
> cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
> cfq_class_idle(cfqq))) {
> cfqq->slice_end = jiffies + 1;
> - cfq_slice_expired(cfqd, 0);
> + cfq_slice_expired(cfqd);
Hi Vivek,
I think here you should make sure that when updating cfqq->slice_end, cfqq->slice_end doesn't
equal to 0. Because if cfqq->slice_end == 0, cfq_slice_expired() just charge for 1 jiffy, but
if cfqq->slice_end is updated when it equals to 0(first request still in the air), at that time
cfqq->slice_start == 0, and slice_used is charged as "jiffies - cfqq->slice_start". Following
patch fixes this bug.
Signed-off-by: Gui Jianfeng <[email protected]>
---
block/cfq-iosched.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f23d713..12afc14 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1999,7 +1999,8 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
cfq_class_idle(cfqq))) {
- cfqq->slice_end = jiffies + 1;
+ if (cfqq->slice_end)
+ cfqq->slice_end = jiffies + 1;
cfq_slice_expired(cfqd);
}
--
1.5.4.rc3
On Tue, Nov 10, 2009 at 8:15 PM, Vivek Goyal <[email protected]> wrote:
> On Tue, Nov 10, 2009 at 07:05:19PM +0100, Corrado Zoccolo wrote:
>> On Tue, Nov 10, 2009 at 3:12 PM, Vivek Goyal <[email protected]> wrote:
>> >
>> > Ok, I ran some simple tests on my NCQ SSD. I had pulled the Jen's branch
>> > few days back and it has your patches in it.
>> >
>> > I am running three direct sequential readers or prio 0, 4 and 7
>> > respectively using fio for 10 seconds and then monitoring who got how
>> > much job done.
>> >
>> > Following is my fio job file
>> >
>> > ****************************************************************
>> > [global]
>> > ioengine=sync
>> > runtime=10
>> > size=1G
>> > rw=read
>> > directory=/mnt/sdc/fio/
>> > direct=1
>> > bs=4K
>> > exec_prerun="echo 3 > /proc/sys/vm/drop_caches"
>> >
>> > [seqread0]
>> > prio=0
>> >
>> > [seqread4]
>> > prio=4
>> >
>> > [seqread7]
>> > prio=7
>> > ************************************************************************
>>
>> Can you try without direct and bs?
>>
>
> Ok, here are the results without direct and bs. So it is now buffered
> reads. The fio file above remains more or less same except that I had
> to change size to 2G as with-in 10 seconds some process can finish reading
> 1G and get out of contention.
>
> First Run
> =========
> read : io=382MB, bw=39,112KB/s, iops=9,777, runt= 10001msec
> read : io=939MB, bw=96,194KB/s, iops=24,048, runt= 10001msec
> read : io=765MB, bw=78,355KB/s, iops=19,588, runt= 10004msec
>
> Second run
> ==========
> read : io=443MB, bw=45,395KB/s, iops=11,348, runt= 10004msec
> read : io=1,058MB, bw=106MB/s, iops=27,081, runt= 10001msec
> read : io=650MB, bw=66,535KB/s, iops=16,633, runt= 10006msec
>
> Third Run
> =========
> read : io=727MB, bw=74,465KB/s, iops=18,616, runt= 10004msec
> read : io=890MB, bw=91,126KB/s, iops=22,781, runt= 10001msec
> read : io=406MB, bw=41,608KB/s, iops=10,401, runt= 10004msec
>
> Fourth Run
> ==========
> read : io=792MB, bw=81,143KB/s, iops=20,285, runt= 10001msec
> read : io=1,024MB, bw=102MB/s, iops=26,192, runt= 10009msec
> read : io=314MB, bw=32,093KB/s, iops=8,023, runt= 10011msec
>
> Still can't get the service difference proportionate to priority levels.
> In fact in some cases it is more like priority inversion where higher
> priority is getting lower BW.
Jeff's numbers are:
~/tmp/for-cz/for-2.6.33/output/be0-through-7.fio ~/tmp/for-cz/for-2.6.33
total priority: 880
total data transferred: 4064576
class prio ideal xferred %diff
be 0 831390 645764 -23
be 1 739013 562932 -24
be 2 646637 2097156 224
be 3 554260 250612 -55
be 4 461883 185332 -60
be 5 369506 149492 -60
be 6 277130 98036 -65
be 7 184753 75252 -60
~/tmp/for-cz/for-2.6.33
~/tmp/for-cz/for-2.6.33/output/be0-vs-be1.fio ~/tmp/for-cz/for-2.6.33
total priority: 340
total data transferred: 2244584
class prio ideal xferred %diff
be 0 1188309 1179636 -1
be 1 1056274 1064948 0
~/tmp/for-cz/for-2.6.33
~/tmp/for-cz/for-2.6.33/output/be0-vs-be7.fio ~/tmp/for-cz/for-2.6.33
total priority: 220
total data transferred: 2232808
class prio ideal xferred %diff
be 0 1826842 1834484 0
be 7 405965 398324 -2
There is one big outlier, but usually the transferred data is in line
with priority.
Seeing your numbers, though, where the process with intermediate
priority is almost consistently getting more bandwidth than the
others, I think it must be some bug in the code that caused both your
results and the outlier seen in Jeff's test.
I'll have a closer look at the interactions of the various parts of
the code, to see if I can spot the problem.
Thanks
Corrado
On Wed, Nov 11, 2009 at 08:48:09AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> >
> > @@ -1245,10 +1429,10 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
> > struct cfq_queue *cfqq;
> > int dispatched = 0;
> >
> > - while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
> > + while ((cfqq = cfq_get_next_queue(cfqd)) != NULL)
> > dispatched += __cfq_forced_dispatch_cfqq(cfqq);
> >
> > - cfq_slice_expired(cfqd, 0);
> > + cfq_slice_expired(cfqd);
> >
> > BUG_ON(cfqd->busy_queues);
> >
> > @@ -1391,7 +1575,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
> > cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
> > cfq_class_idle(cfqq))) {
> > cfqq->slice_end = jiffies + 1;
> > - cfq_slice_expired(cfqd, 0);
> > + cfq_slice_expired(cfqd);
>
> Hi Vivek,
>
> I think here you should make sure that when updating cfqq->slice_end, cfqq->slice_end doesn't
> equal to 0. Because if cfqq->slice_end == 0, cfq_slice_expired() just charge for 1 jiffy, but
> if cfqq->slice_end is updated when it equals to 0(first request still in the air), at that time
> cfqq->slice_start == 0, and slice_used is charged as "jiffies - cfqq->slice_start". Following
> patch fixes this bug.
>
Hi Gui,
This can happen only once during a one wrap around cycle of jiffies. That
too depends in case we are hitting jiffies+1 as 0 or not.
So I would not worry much about it right now.
In fact, not updating slice_end, will make idle or async queue slice last
much longer than it should have.
Thanks
Vivek
> Signed-off-by: Gui Jianfeng <[email protected]>
> ---
> block/cfq-iosched.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index f23d713..12afc14 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1999,7 +1999,8 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
> if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
> cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
> cfq_class_idle(cfqq))) {
> - cfqq->slice_end = jiffies + 1;
> + if (cfqq->slice_end)
> + cfqq->slice_end = jiffies + 1;
> cfq_slice_expired(cfqd);
> }
>
> --
> 1.5.4.rc3
Vivek Goyal wrote:
> On Wed, Nov 11, 2009 at 08:48:09AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>>
>>> @@ -1245,10 +1429,10 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
>>> struct cfq_queue *cfqq;
>>> int dispatched = 0;
>>>
>>> - while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
>>> + while ((cfqq = cfq_get_next_queue(cfqd)) != NULL)
>>> dispatched += __cfq_forced_dispatch_cfqq(cfqq);
>>>
>>> - cfq_slice_expired(cfqd, 0);
>>> + cfq_slice_expired(cfqd);
>>>
>>> BUG_ON(cfqd->busy_queues);
>>>
>>> @@ -1391,7 +1575,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
>>> cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
>>> cfq_class_idle(cfqq))) {
>>> cfqq->slice_end = jiffies + 1;
>>> - cfq_slice_expired(cfqd, 0);
>>> + cfq_slice_expired(cfqd);
>> Hi Vivek,
>>
>> I think here you should make sure that when updating cfqq->slice_end, cfqq->slice_end doesn't
>> equal to 0. Because if cfqq->slice_end == 0, cfq_slice_expired() just charge for 1 jiffy, but
>> if cfqq->slice_end is updated when it equals to 0(first request still in the air), at that time
>> cfqq->slice_start == 0, and slice_used is charged as "jiffies - cfqq->slice_start". Following
>> patch fixes this bug.
>>
>
> Hi Gui,
>
> This can happen only once during a one wrap around cycle of jiffies. That
> too depends in case we are hitting jiffies+1 as 0 or not.
>
> So I would not worry much about it right now.
>
> In fact, not updating slice_end, will make idle or async queue slice last
> much longer than it should have.
I don't think so Vivek, this bug can be easily trigger by creating two cgroup and run a idle
task in one group, then run a normal task in the other group. When the idle task sends out its
first request, this bug occurs. I can reproduce this bug every time by the following script.
#!/bin/sh
mkdir /cgroup
mount -t cgroup -o blkio io /cgroup
mkdir /cgroup/tst1
mkdir /cgroup/tst2
dd if=/dev/sdb2 of=/dev/null &
pid1=$!
echo $pid1 > /cgroup/tst1/tasks
dd if=/dev/sdb3 of=/dev/null &
pid2=$!
ionice -c3 -p$pid2
echo $pid2 > /cgroup/tst2/tasks
sleep 5
cat /cgroup/tst1/blkio.time
cat /cgroup/tst2/blkio.time
killall -9 dd
sleep 1
rmdir /cgroup/tst1
rmdir /cgroup/tst2
umount /cgroup
rmdir /cgroup
>
> Thanks
> Vivek
>
>
>> Signed-off-by: Gui Jianfeng <[email protected]>
>> ---
>> block/cfq-iosched.c | 3 ++-
>> 1 files changed, 2 insertions(+), 1 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index f23d713..12afc14 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -1999,7 +1999,8 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
>> if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
>> cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
>> cfq_class_idle(cfqq))) {
>> - cfqq->slice_end = jiffies + 1;
>> + if (cfqq->slice_end)
>> + cfqq->slice_end = jiffies + 1;
>> cfq_slice_expired(cfqd);
>> }
>>
>> --
>> 1.5.4.rc3
>
>
>
--
Regards
Gui Jianfeng
On Fri, Nov 13, 2009 at 08:59:08AM +0800, Gui Jianfeng wrote:
>
>
> Vivek Goyal wrote:
> > On Wed, Nov 11, 2009 at 08:48:09AM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >> ...
> >>>
> >>> @@ -1245,10 +1429,10 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
> >>> struct cfq_queue *cfqq;
> >>> int dispatched = 0;
> >>>
> >>> - while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
> >>> + while ((cfqq = cfq_get_next_queue(cfqd)) != NULL)
> >>> dispatched += __cfq_forced_dispatch_cfqq(cfqq);
> >>>
> >>> - cfq_slice_expired(cfqd, 0);
> >>> + cfq_slice_expired(cfqd);
> >>>
> >>> BUG_ON(cfqd->busy_queues);
> >>>
> >>> @@ -1391,7 +1575,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
> >>> cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
> >>> cfq_class_idle(cfqq))) {
> >>> cfqq->slice_end = jiffies + 1;
> >>> - cfq_slice_expired(cfqd, 0);
> >>> + cfq_slice_expired(cfqd);
> >> Hi Vivek,
> >>
> >> I think here you should make sure that when updating cfqq->slice_end, cfqq->slice_end doesn't
> >> equal to 0. Because if cfqq->slice_end == 0, cfq_slice_expired() just charge for 1 jiffy, but
> >> if cfqq->slice_end is updated when it equals to 0(first request still in the air), at that time
> >> cfqq->slice_start == 0, and slice_used is charged as "jiffies - cfqq->slice_start". Following
> >> patch fixes this bug.
> >>
> >
> > Hi Gui,
> >
> > This can happen only once during a one wrap around cycle of jiffies. That
> > too depends in case we are hitting jiffies+1 as 0 or not.
> >
> > So I would not worry much about it right now.
> >
> > In fact, not updating slice_end, will make idle or async queue slice last
> > much longer than it should have.
>
> I don't think so Vivek, this bug can be easily trigger by creating two cgroup and run a idle
> task in one group, then run a normal task in the other group. When the idle task sends out its
> first request, this bug occurs. I can reproduce this bug every time by the following script.
>
Oh.., sorry, Looks like I read your mail too fast. So you are saying that
in this case we should be charging 1 ms but instead we will be charging
(jiffies - 0), which might be too huge a number and then a particular
group will not be scheduled for a long time?
How about changing the charging code to also check if slice_start == 0? So
in my V2 I will change the cfq_cfqq_slice_usage() to also check for
slice_start to make sure whether a slice has actually started or not.
if (!cfqq->slice_start || cfqq->slice_start == jiffies) {
charge_1ms;
else
charge_based_on_time_elapsed;
Thanks
Vivek
> #!/bin/sh
>
> mkdir /cgroup
> mount -t cgroup -o blkio io /cgroup
> mkdir /cgroup/tst1
> mkdir /cgroup/tst2
>
> dd if=/dev/sdb2 of=/dev/null &
> pid1=$!
> echo $pid1 > /cgroup/tst1/tasks
>
> dd if=/dev/sdb3 of=/dev/null &
> pid2=$!
> ionice -c3 -p$pid2
> echo $pid2 > /cgroup/tst2/tasks
>
> sleep 5
>
> cat /cgroup/tst1/blkio.time
> cat /cgroup/tst2/blkio.time
>
> killall -9 dd
> sleep 1
>
> rmdir /cgroup/tst1
> rmdir /cgroup/tst2
> umount /cgroup
> rmdir /cgroup
>
>
> >
> > Thanks
> > Vivek
> >
> >
> >> Signed-off-by: Gui Jianfeng <[email protected]>
> >> ---
> >> block/cfq-iosched.c | 3 ++-
> >> 1 files changed, 2 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> >> index f23d713..12afc14 100644
> >> --- a/block/cfq-iosched.c
> >> +++ b/block/cfq-iosched.c
> >> @@ -1999,7 +1999,8 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
> >> if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
> >> cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
> >> cfq_class_idle(cfqq))) {
> >> - cfqq->slice_end = jiffies + 1;
> >> + if (cfqq->slice_end)
> >> + cfqq->slice_end = jiffies + 1;
> >> cfq_slice_expired(cfqd);
> >> }
> >>
> >> --
> >> 1.5.4.rc3
> >
> >
> >
>
> --
> Regards
> Gui Jianfeng
Vivek Goyal wrote:
> On Fri, Nov 13, 2009 at 08:59:08AM +0800, Gui Jianfeng wrote:
>>
>> Vivek Goyal wrote:
>>> On Wed, Nov 11, 2009 at 08:48:09AM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>> ...
>>>>>
>>>>> @@ -1245,10 +1429,10 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
>>>>> struct cfq_queue *cfqq;
>>>>> int dispatched = 0;
>>>>>
>>>>> - while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
>>>>> + while ((cfqq = cfq_get_next_queue(cfqd)) != NULL)
>>>>> dispatched += __cfq_forced_dispatch_cfqq(cfqq);
>>>>>
>>>>> - cfq_slice_expired(cfqd, 0);
>>>>> + cfq_slice_expired(cfqd);
>>>>>
>>>>> BUG_ON(cfqd->busy_queues);
>>>>>
>>>>> @@ -1391,7 +1575,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
>>>>> cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
>>>>> cfq_class_idle(cfqq))) {
>>>>> cfqq->slice_end = jiffies + 1;
>>>>> - cfq_slice_expired(cfqd, 0);
>>>>> + cfq_slice_expired(cfqd);
>>>> Hi Vivek,
>>>>
>>>> I think here you should make sure that when updating cfqq->slice_end, cfqq->slice_end doesn't
>>>> equal to 0. Because if cfqq->slice_end == 0, cfq_slice_expired() just charge for 1 jiffy, but
>>>> if cfqq->slice_end is updated when it equals to 0(first request still in the air), at that time
>>>> cfqq->slice_start == 0, and slice_used is charged as "jiffies - cfqq->slice_start". Following
>>>> patch fixes this bug.
>>>>
>>> Hi Gui,
>>>
>>> This can happen only once during a one wrap around cycle of jiffies. That
>>> too depends in case we are hitting jiffies+1 as 0 or not.
>>>
>>> So I would not worry much about it right now.
>>>
>>> In fact, not updating slice_end, will make idle or async queue slice last
>>> much longer than it should have.
>> I don't think so Vivek, this bug can be easily trigger by creating two cgroup and run a idle
>> task in one group, then run a normal task in the other group. When the idle task sends out its
>> first request, this bug occurs. I can reproduce this bug every time by the following script.
>>
>
> Oh.., sorry, Looks like I read your mail too fast. So you are saying that
> in this case we should be charging 1 ms but instead we will be charging
> (jiffies - 0), which might be too huge a number and then a particular
> group will not be scheduled for a long time?
Yes, that's it.
>
> How about changing the charging code to also check if slice_start == 0? So
> in my V2 I will change the cfq_cfqq_slice_usage() to also check for
> slice_start to make sure whether a slice has actually started or not.
>
> if (!cfqq->slice_start || cfqq->slice_start == jiffies) {
> charge_1ms;
> else
> charge_based_on_time_elapsed;
I think this change should also work :)
Thanks
Gui
>
> Thanks
> Vivek
>
>> #!/bin/sh
>>
>> mkdir /cgroup
>> mount -t cgroup -o blkio io /cgroup
>> mkdir /cgroup/tst1
>> mkdir /cgroup/tst2
>>
>> dd if=/dev/sdb2 of=/dev/null &
>> pid1=$!
>> echo $pid1 > /cgroup/tst1/tasks
>>
>> dd if=/dev/sdb3 of=/dev/null &
>> pid2=$!
>> ionice -c3 -p$pid2
>> echo $pid2 > /cgroup/tst2/tasks
>>
>> sleep 5
>>
>> cat /cgroup/tst1/blkio.time
>> cat /cgroup/tst2/blkio.time
>>
>> killall -9 dd
>> sleep 1
>>
>> rmdir /cgroup/tst1
>> rmdir /cgroup/tst2
>> umount /cgroup
>> rmdir /cgroup
>>
>>
>>> Thanks
>>> Vivek
>>>
>>>
>>>> Signed-off-by: Gui Jianfeng <[email protected]>
>>>> ---
>>>> block/cfq-iosched.c | 3 ++-
>>>> 1 files changed, 2 insertions(+), 1 deletions(-)
>>>>
>>>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>>>> index f23d713..12afc14 100644
>>>> --- a/block/cfq-iosched.c
>>>> +++ b/block/cfq-iosched.c
>>>> @@ -1999,7 +1999,8 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
>>>> if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
>>>> cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
>>>> cfq_class_idle(cfqq))) {
>>>> - cfqq->slice_end = jiffies + 1;
>>>> + if (cfqq->slice_end)
>>>> + cfqq->slice_end = jiffies + 1;
>>>> cfq_slice_expired(cfqd);
>>>> }
>>>>
>>>> --
>>>> 1.5.4.rc3
>>>
>>>
>> --
>> Regards
>> Gui Jianfeng
>
>
>
--
Regards
Gui Jianfeng