2011-03-22 23:09:30

by Justin TerAvest

[permalink] [raw]
Subject: [RFC] [PATCH v2 0/8] Provide cgroup isolation for buffered writes.

This patchset adds tracking to the page_cgroup structure for which cgroup has
dirtied a page, and uses that information to provide isolation between
cgroups performing writeback.

I know that there is some discussion to remove request descriptor limits
entirely, but I included a patch to introduce per-cgroup limits to enable
this functionality. Without it, we didn't see much isolation improvement.

I think most of this material has been discussed on lkml previously, this is
just another attempt to make a patchset that handles buffered writes for CFQ.

There was a lot of previous discussion at:
http://thread.gmane.org/gmane.linux.kernel/1007922

Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
and Vivek Goyal for work on previous versions of these patches.

For version 2:
- I collected more statistics and provided data in the cover sheet
- blkio id is now stored inside "flags" in page_cgroup, with cmpxchg
- I cleaned up some patch names
- Added symmetric reference wrappers in cfq-iosched

There are a couple lingering issues that exist in this patchset-- it's meant
to be an RFC to discuss the overall design for tracking of buffered writes.
I have at least a couple of patches to finish to make absolutely sure that
refcounts and locking are handled properly, I just need to do more testing.

Documentation/block/biodoc.txt | 10 +
block/blk-cgroup.c | 203 +++++++++++++++++-
block/blk-cgroup.h | 9 +-
block/blk-core.c | 218 +++++++++++++------
block/blk-settings.c | 2 +-
block/blk-sysfs.c | 59 +++---
block/cfq-iosched.c | 473 ++++++++++++++++++++++++++++++----------
block/cfq.h | 6 +-
block/elevator.c | 7 +-
fs/buffer.c | 2 +
fs/direct-io.c | 2 +
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 81 +++++++-
include/linux/blkio-track.h | 89 ++++++++
include/linux/elevator.h | 14 +-
include/linux/iocontext.h | 1 +
include/linux/memcontrol.h | 6 +
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 38 +++-
init/Kconfig | 16 ++
mm/Makefile | 3 +-
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 6 +
mm/memory.c | 6 +
mm/page-writeback.c | 14 +-
mm/page_cgroup.c | 29 ++-
mm/swap_state.c | 2 +
28 files changed, 1066 insertions(+), 240 deletions(-)


8f0b0f4 cfq: Don't allow preemption across cgroups
a47cdc6 block: Per cgroup request descriptor counts
8dd7adb cfq: add per cgroup writeout done by flusher stat
1fa0b6d cfq: Fix up tracked async workload length.
e9e85d3 block: Modify CFQ to use IO tracking information.
f8ffb19 cfq-iosched: Make async queues per cgroup
1d9ee09 block,fs,mm: IO cgroup tracking for buffered write
31c7321 cfq-iosched: add symmetric reference wrappers


===================================== Isolation experiment results

For isolation testing, we run a test that's available at:
git://google3-2.osuosl.org/tests/blkcgroup.git

It creates containers, runs workloads, and checks to see how well we meet
isolation targets. For the purposes of this patchset, I only ran
tests among buffered writers.

Before patches
==============
10:32:06 INFO experiment 0 achieved DTFs: 666, 333
10:32:06 INFO experiment 0 FAILED: max observed error is 167, allowed is 150
10:32:51 INFO experiment 1 achieved DTFs: 647, 352
10:32:51 INFO experiment 1 FAILED: max observed error is 253, allowed is 150
10:33:35 INFO experiment 2 achieved DTFs: 298, 701
10:33:35 INFO experiment 2 FAILED: max observed error is 199, allowed is 150
10:34:19 INFO experiment 3 achieved DTFs: 445, 277, 277
10:34:19 INFO experiment 3 FAILED: max observed error is 155, allowed is 150
10:35:05 INFO experiment 4 achieved DTFs: 418, 104, 261, 215
10:35:05 INFO experiment 4 FAILED: max observed error is 232, allowed is 150
10:35:53 INFO experiment 5 achieved DTFs: 213, 136, 68, 102, 170, 136, 170
10:35:53 INFO experiment 5 PASSED: max observed error is 73, allowed is 150
10:36:04 INFO -----ran 6 experiments, 1 passed, 5 failed

After patches
=============
11:05:22 INFO experiment 0 achieved DTFs: 501, 498
11:05:22 INFO experiment 0 PASSED: max observed error is 2, allowed is 150
11:06:07 INFO experiment 1 achieved DTFs: 874, 125
11:06:07 INFO experiment 1 PASSED: max observed error is 26, allowed is 150
11:06:53 INFO experiment 2 achieved DTFs: 121, 878
11:06:53 INFO experiment 2 PASSED: max observed error is 22, allowed is 150
11:07:46 INFO experiment 3 achieved DTFs: 589, 205, 204
11:07:46 INFO experiment 3 PASSED: max observed error is 11, allowed is 150
11:08:34 INFO experiment 4 achieved DTFs: 616, 109, 109, 163
11:08:34 INFO experiment 4 PASSED: max observed error is 34, allowed is 150
11:09:29 INFO experiment 5 achieved DTFs: 139, 139, 139, 139, 140, 141, 160
11:09:29 INFO experiment 5 PASSED: max observed error is 1, allowed is 150
11:09:46 INFO -----ran 6 experiments, 6 passed, 0 failed

Summary
=======
Isolation between buffered writers is clearly better with this patch.


=============================== Read latency results
To test read latency, I created two containers:
- One called "readers", with weight 900
- One called "writers", with weight 100

I ran this fio workload in "readers":
[global]
directory=/mnt/iostestmnt/fio
runtime=30
time_based=1
group_reporting=1
exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
cgroup_nodelete=1
bs=4K
size=512M

[iostest-read]
description="reader"
numjobs=16
rw=randread
new_group=1


....and this fio workload in "writers"
[global]
directory=/mnt/iostestmnt/fio
runtime=30
time_based=1
group_reporting=1
exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
cgroup_nodelete=1
bs=4K
size=512M

[iostest-write]
description="writer"
cgroup=writers
numjobs=3
rw=write
new_group=1



I've pasted the results from the "read" workload inline.

Before patches
==============
Starting 16 processes

Jobs: 14 (f=14): [_rrrrrr_rrrrrrrr] [36.2% done] [352K/0K /s] [86 /0 iops] [eta 01m:00s]·············
iostest-read: (groupid=0, jobs=16): err= 0: pid=20606
Description : ["reader"]
read : io=13532KB, bw=455814 B/s, iops=111 , runt= 30400msec
clat (usec): min=2190 , max=30399K, avg=30395175.13, stdev= 0.20
lat (usec): min=2190 , max=30399K, avg=30395177.07, stdev= 0.20
bw (KB/s) : min= 0, max= 260, per=0.00%, avg= 0.00, stdev= 0.00
cpu : usr=0.00%, sys=0.03%, ctx=3691, majf=2, minf=468
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=3383/0/0, short=0/0/0

lat (msec): 4=0.03%, 10=2.66%, 20=74.84%, 50=21.90%, 100=0.09%
lat (msec): 250=0.06%, >=2000=0.41%

Run status group 0 (all jobs):
READ: io=13532KB, aggrb=445KB/s, minb=455KB/s, maxb=455KB/s, mint=30400msec, maxt=30400msec

Disk stats (read/write):
sdb: ios=3744/18, merge=0/16, ticks=542713/1675, in_queue=550714, util=99.15%



After patches
=============
tarting 16 processes
Jobs: 16 (f=16): [rrrrrrrrrrrrrrrr] [100.0% done] [557K/0K /s] [136 /0 iops] [eta 00m:00s]
iostest-read: (groupid=0, jobs=16): err= 0: pid=14183
Description : ["reader"]
read : io=14940KB, bw=506105 B/s, iops=123 , runt= 30228msec
clat (msec): min=2 , max=29866 , avg=463.42, stdev=101.84
lat (msec): min=2 , max=29866 , avg=463.42, stdev=101.84
bw (KB/s) : min= 0, max= 198, per=31.69%, avg=156.52, stdev=17.83
cpu : usr=0.01%, sys=0.03%, ctx=4274, majf=2, minf=464
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=3735/0/0, short=0/0/0

lat (msec): 4=0.05%, 10=0.32%, 20=32.99%, 50=64.61%, 100=1.26%
lat (msec): 250=0.11%, 500=0.11%, 750=0.16%, 1000=0.05%, >=2000=0.35%

Run status group 0 (all jobs):
READ: io=14940KB, aggrb=494KB/s, minb=506KB/s, maxb=506KB/s, mint=30228msec, maxt=30228msec

Disk stats (read/write):
sdb: ios=4189/0, merge=0/0, ticks=96428/0, in_queue=478798, util=100.00%



Summary
=======
Read latencies are a bit worse, but this overhead is only imposed when users
ask for this feature by turning on CONFIG_BLKIOTRACK. We expect there to be =
a something of a latency vs isolation tradeoff.


2011-03-22 23:09:21

by Justin TerAvest

[permalink] [raw]
Subject: [PATCH v2 5/8] cfq: Fix up tracked async workload length.

This reverts effectively reverts commit f26bd1f0a3a31bc5e16d285f5e1b00a56abf6238
"blkio: Determine async workload length based on total number of queues"
in the case when async IO tracking is enabled.

That commit was used to work around the fact that async
queues were part of root cgroup. That is no longer the case when we have
async write tracking enabed.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/cfq-iosched.c | 90 +++++++++++++++++++++++++++++++-------------------
1 files changed, 56 insertions(+), 34 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c75bbbf..1b315c3 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -448,13 +448,6 @@ static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl,
+ cfqg->service_trees[wl][SYNC_WORKLOAD].count;
}

-static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
- struct cfq_group *cfqg)
-{
- return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count
- + cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
-}
-
static void cfq_dispatch_insert(struct request_queue *, struct request *);
static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio*, bool,
struct io_context *, gfp_t);
@@ -1014,27 +1007,50 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq,
return slice_used;
}

-static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
- struct cfq_queue *cfqq)
+#ifndef CONFIG_CGROUP_BLKIOTRACK
+static inline int cfqg_busy_async_queues(struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
- unsigned int used_sl, charge, unaccounted_sl = 0;
- int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
+ return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count
+ + cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
+}
+#endif
+
+static void cfq_charge_slice(struct cfq_rb_root *st, struct cfq_group *cfqg,
+ struct cfq_queue *cfqq, unsigned int used_sl)
+{
+ struct cfq_data *cfqd = cfqq->cfqd;
+ unsigned int charge = used_sl;
+#ifndef CONFIG_CGROUP_BLKIOTRACK
+ int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqg)
- cfqg->service_tree_idle.count;

BUG_ON(nr_sync < 0);
- used_sl = charge = cfq_cfqq_slice_usage(cfqq, &unaccounted_sl);

if (iops_mode(cfqd))
charge = cfqq->slice_dispatch;
else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
charge = cfqq->allocated_slice;
+#endif

/* Can't update vdisktime while group is on service tree */
cfq_group_service_tree_del(st, cfqg);
cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
/* If a new weight was requested, update now, off tree */
+ cfq_update_group_weight(cfqg);
cfq_group_service_tree_add(st, cfqg);
+ cfq_log_cfqq(cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
+ " sect=%u", used_sl, cfqq->slice_dispatch, charge,
+ iops_mode(cfqd), cfqq->nr_sectors);
+}
+
+static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
+ struct cfq_queue *cfqq)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ unsigned int used_sl, unaccounted_sl = 0;
+
+ used_sl = cfq_cfqq_slice_usage(cfqq, &unaccounted_sl);
+ cfq_charge_slice(st, cfqg, cfqq, used_sl);

/* This group is being expired. Save the context */
if (time_after(cfqd->workload_expires, jiffies)) {
@@ -1047,9 +1063,6 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,

cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
st->min_vdisktime);
- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
- " sect=%u", used_sl, cfqq->slice_dispatch, charge,
- iops_mode(cfqd), cfqq->nr_sectors);
cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl,
unaccounted_sl);
cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
@@ -2215,6 +2228,30 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
return cur_best;
}

+static unsigned cfq_async_slice(struct cfq_data *cfqd, struct cfq_group *cfqg,
+ unsigned slice)
+{
+#ifndef CONFIG_CGROUP_BLKIOTRACK
+ unsigned tmp;
+ /*
+ * Async queues are currently system wide. Just taking
+ * proportion of queues with-in same group will lead to higher
+ * async ratio system wide as generally root group is going
+ * to have higher weight. A more accurate thing would be to
+ * calculate system wide asnc/sync ratio.
+ */
+ tmp = cfq_target_latency * cfqg_busy_async_queues(cfqg);
+ tmp = tmp/cfqd->busy_queues;
+ slice = min_t(unsigned, slice, tmp);
+#endif
+ /*
+ * async workload slice is scaled down according to
+ * the sync/async slice ratio.
+ */
+ slice = slice * cfqd->cfq_slice[0] / cfqd->cfq_slice[1];
+ return slice;
+}
+
static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
unsigned slice;
@@ -2269,24 +2306,9 @@ new_workload:
max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));

- if (cfqd->serving_type == ASYNC_WORKLOAD) {
- unsigned int tmp;
-
- /*
- * Async queues are currently system wide. Just taking
- * proportion of queues with-in same group will lead to higher
- * async ratio system wide as generally root group is going
- * to have higher weight. A more accurate thing would be to
- * calculate system wide asnc/sync ratio.
- */
- tmp = cfq_target_latency * cfqg_busy_async_queues(cfqd, cfqg);
- tmp = tmp/cfqd->busy_queues;
- slice = min_t(unsigned, slice, tmp);
-
- /* async workload slice is scaled down according to
- * the sync/async slice ratio. */
- slice = slice * cfqd->cfq_slice[0] / cfqd->cfq_slice[1];
- } else
+ if (cfqd->serving_type == ASYNC_WORKLOAD)
+ slice = cfq_async_slice(cfqd, cfqg, slice);
+ else
/* sync workload slice is at least 2 * cfq_slice_idle */
slice = max(slice, 2 * cfqd->cfq_slice_idle);

--
1.7.3.1

2011-03-22 23:09:25

by Justin TerAvest

[permalink] [raw]
Subject: [PATCH v2 3/8] cfq-iosched: Make async queues per cgroup

There is currently only one set of async queues. This patch moves them
to a per cgroup data structure. Changes are done to make sure per
cgroup async queue references are dropped when the cgroup goes away.

TESTED:
Verified by creating multiple cgroups that async queues were getting
created properly. Also made sure that the references are getting
dropped and queues getting deallocated properly in the two situations:
- Cgroup goes away first, while IOs are still being done.
- IOs stop getting done and then cgroup goes away.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/cfq-iosched.c | 57 +++++++++++++++++++++++++--------------------------
1 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 74510f5..011d268 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -202,6 +202,12 @@ struct cfq_group {
struct cfq_rb_root service_trees[2][3];
struct cfq_rb_root service_tree_idle;

+ /*
+ * async queue for each priority case
+ */
+ struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
+ struct cfq_queue *async_idle_cfqq;
+
unsigned long saved_workload_slice;
enum wl_type_t saved_workload;
enum wl_prio_t saved_serving_prio;
@@ -267,12 +273,6 @@ struct cfq_data {
struct cfq_queue *active_queue;
struct cfq_io_context *active_cic;

- /*
- * async queue for each priority case
- */
- struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
- struct cfq_queue *async_idle_cfqq;
-
sector_t last_position;

/*
@@ -455,6 +455,7 @@ static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
struct io_context *);
+static void cfq_put_async_queues(struct cfq_group *cfqg);

static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
bool is_sync)
@@ -1117,10 +1118,6 @@ static void cfq_put_group_ref(struct cfq_group *cfqg)

static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
{
- /* Currently, all async queues are mapped to root group */
- if (!cfq_cfqq_sync(cfqq))
- cfqg = &cfqq->cfqd->root_group;
-
cfqq->cfqg = cfqg;
/* cfqq reference on cfqg */
cfq_get_group_ref(cfqq->cfqg);
@@ -1132,6 +1129,7 @@ static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
BUG_ON(hlist_unhashed(&cfqg->cfqd_node));

hlist_del_init(&cfqg->cfqd_node);
+ cfq_put_async_queues(cfqg);

/*
* Put the reference taken at the time of creation so that when all
@@ -2929,15 +2927,13 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
#endif /* CONFIG_CFQ_GROUP_IOSCHED */

static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
- struct io_context *ioc, gfp_t gfp_mask)
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct cfq_group *cfqg,
+ bool is_sync, struct io_context *ioc, gfp_t gfp_mask)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
struct cfq_io_context *cic;
- struct cfq_group *cfqg;

retry:
- cfqg = cfq_get_cfqg(cfqd, 1);
cic = cfq_cic_lookup(cfqd, ioc);
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);
@@ -2981,15 +2977,15 @@ retry:
}

static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
+cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio)
{
switch (ioprio_class) {
case IOPRIO_CLASS_RT:
- return &cfqd->async_cfqq[0][ioprio];
+ return &cfqg->async_cfqq[0][ioprio];
case IOPRIO_CLASS_BE:
- return &cfqd->async_cfqq[1][ioprio];
+ return &cfqg->async_cfqq[1][ioprio];
case IOPRIO_CLASS_IDLE:
- return &cfqd->async_idle_cfqq;
+ return &cfqg->async_idle_cfqq;
default:
BUG();
}
@@ -3003,17 +2999,19 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue **async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
+ struct cfq_group *cfqg = cfq_get_cfqg(cfqd, 1);

if (!is_sync) {
- async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
+ async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
+ ioprio);
cfqq = *async_cfqq;
}

if (!cfqq)
- cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+ cfqq = cfq_find_alloc_queue(cfqd, cfqg, is_sync, ioc, gfp_mask);

/*
- * pin the queue now that it's allocated, scheduler exit will prune it
+ * pin the queue now that it's allocated, cgroup deletion will prune it
*/
if (!is_sync && !(*async_cfqq)) {
cfq_get_queue_ref(cfqq);
@@ -3826,19 +3824,19 @@ static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
cancel_work_sync(&cfqd->unplug_work);
}

-static void cfq_put_async_queues(struct cfq_data *cfqd)
+static void cfq_put_async_queues(struct cfq_group *cfqg)
{
int i;

for (i = 0; i < IOPRIO_BE_NR; i++) {
- if (cfqd->async_cfqq[0][i])
- cfq_put_queue_ref(cfqd->async_cfqq[0][i]);
- if (cfqd->async_cfqq[1][i])
- cfq_put_queue_ref(cfqd->async_cfqq[1][i]);
+ if (cfqg->async_cfqq[0][i])
+ cfq_put_queue_ref(cfqg->async_cfqq[0][i]);
+ if (cfqg->async_cfqq[1][i])
+ cfq_put_queue_ref(cfqg->async_cfqq[1][i]);
}

- if (cfqd->async_idle_cfqq)
- cfq_put_queue_ref(cfqd->async_idle_cfqq);
+ if (cfqg->async_idle_cfqq)
+ cfq_put_queue_ref(cfqg->async_idle_cfqq);
}

static void cfq_cfqd_free(struct rcu_head *head)
@@ -3866,8 +3864,9 @@ static void cfq_exit_queue(struct elevator_queue *e)
__cfq_exit_single_io_context(cfqd, cic);
}

- cfq_put_async_queues(cfqd);
cfq_release_cfq_groups(cfqd);
+ /* Release the queues of root group. */
+ cfq_put_async_queues(&cfqd->root_group);
cfq_blkiocg_del_blkio_group(&cfqd->root_group.blkg);

spin_unlock_irq(q->queue_lock);
--
1.7.3.1

2011-03-22 23:09:44

by Justin TerAvest

[permalink] [raw]
Subject: [PATCH v2 4/8] block: Modify CFQ to use IO tracking information.

IO tracking bits are used to send IOs to the right async
queue. Current task is still used to identify the cgroup of the
synchronous IO. Current task is also used if IO tracking is disabled.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/blk-cgroup.c | 6 +-
block/blk-core.c | 7 +-
block/cfq-iosched.c | 175 +++++++++++++++++++++++++++++++++++++++++-----
block/elevator.c | 5 +-
include/linux/elevator.h | 6 +-
5 files changed, 172 insertions(+), 27 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f626c65..9732cfd 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -111,6 +111,9 @@ blkio_policy_search_node(const struct blkio_cgroup *blkcg, dev_t dev,

struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
{
+ if (!cgroup)
+ return &blkio_root_cgroup;
+
return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
struct blkio_cgroup, css);
}
@@ -1550,6 +1553,7 @@ unsigned long get_blkio_cgroup_id(struct bio *bio)
id = page_cgroup_get_blkio_id(pc);
return id;
}
+EXPORT_SYMBOL(get_blkio_cgroup_id);

/**
* get_cgroup_from_page() - determine the cgroup from a page.
@@ -1576,8 +1580,6 @@ struct cgroup *get_cgroup_from_page(struct page *page)

return css->cgroup;
}
-
-EXPORT_SYMBOL(get_blkio_cgroup_id);
EXPORT_SYMBOL(get_cgroup_from_page);

#endif /* CONFIG_CGROUP_BLKIOTRACK */
diff --git a/block/blk-core.c b/block/blk-core.c
index 5256932..1b7936bf 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -583,7 +583,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
}

static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+ gfp_t gfp_mask)
{
struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);

@@ -595,7 +596,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
rq->cmd_flags = flags | REQ_ALLOCED;

if (priv) {
- if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+ if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
mempool_free(rq, q->rq.rq_pool);
return NULL;
}
@@ -757,7 +758,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
rw_flags |= REQ_IO_STAT;
spin_unlock_irq(q->queue_lock);

- rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+ rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
if (unlikely(!rq)) {
/*
* Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 011d268..c75bbbf 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -14,6 +14,7 @@
#include <linux/rbtree.h>
#include <linux/ioprio.h>
#include <linux/blktrace_api.h>
+#include <linux/blkio-track.h>
#include "cfq.h"

/*
@@ -309,6 +310,10 @@ static void cfq_get_queue_ref(struct cfq_queue *cfqq);
static void cfq_put_queue_ref(struct cfq_queue *cfqq);

static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
+static struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
+ struct bio *bio, int create);
+static struct cfq_queue **
+cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio);

static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
enum wl_prio_t prio,
@@ -451,7 +456,7 @@ static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
}

static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio*, bool,
struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
struct io_context *);
@@ -463,9 +468,55 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
return cic->cfqq[is_sync];
}

+/*
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
+ */
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+ struct cfq_io_context *cic, struct bio *bio, int is_sync)
+{
+ struct cfq_queue *cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+ if (!cfqq && !is_sync) {
+ const int ioprio = task_ioprio(cic->ioc);
+ const int ioprio_class = task_ioprio_class(cic->ioc);
+ struct cfq_group *cfqg;
+ struct cfq_queue **async_cfqq;
+ /*
+ * async bio tracking is enabled and we are not caching
+ * async queue pointer in cic.
+ */
+ cfqg = cfq_get_cfqg_bio(cfqd, bio, 0);
+ if (!cfqg) {
+ /*
+ * May be this is first rq/bio and io group has not
+ * been setup yet.
+ */
+ return NULL;
+ }
+ async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class, ioprio);
+ return *async_cfqq;
+ }
+#endif
+ return cfqq;
+}
+
static inline void cic_set_cfqq(struct cfq_io_context *cic,
struct cfq_queue *cfqq, bool is_sync)
{
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+ /*
+ * Don't cache async queue pointer as now one io context might
+ * be submitting async io for various different async queues
+ */
+ if (!is_sync)
+ return;
+#endif
cic->cfqq[is_sync] = cfqq;
}

@@ -1032,7 +1083,9 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
unsigned int major, minor;

cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
- if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+ if (!bdi || !bdi->dev || !dev_name(bdi->dev))
+ goto done;
+ if (cfqg && !cfqg->blkg.dev) {
sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
cfqg->blkg.dev = MKDEV(major, minor);
goto done;
@@ -1083,20 +1136,61 @@ done:
* Search for the cfq group current task belongs to. If create = 1, then also
* create the cfq group if it does not exist. request_queue lock must be held.
*/
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, struct page *page,
+ int create)
{
struct cgroup *cgroup;
struct cfq_group *cfqg = NULL;

rcu_read_lock();
- cgroup = task_cgroup(current, blkio_subsys_id);
+
+ if (!page)
+ cgroup = task_cgroup(current, blkio_subsys_id);
+ else
+ cgroup = get_cgroup_from_page(page);
+
+ if (!cgroup) {
+ cfqg = &cfqd->root_group;
+ goto out;
+ }
+
cfqg = cfq_find_alloc_cfqg(cfqd, cgroup, create);
if (!cfqg && create)
cfqg = &cfqd->root_group;
+out:
rcu_read_unlock();
return cfqg;
}

+struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
+ struct bio *bio, int create)
+{
+ struct page *page = NULL;
+
+ /*
+ * Determine the group from task context. Even calls from
+ * blk_get_request() which don't have any bio info will be mapped
+ * to the task's group
+ */
+ if (!bio)
+ goto sync;
+
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+ if (!bio_has_data(bio))
+ goto sync;
+
+ /* Map the sync bio to the right group using task context */
+ if (cfq_bio_sync(bio))
+ goto sync;
+
+ /* Determine the group from info stored in page */
+ page = bio_iovec_idx(bio, 0)->bv_page;
+#endif
+
+sync:
+ return cfq_get_cfqg(cfqd, page, create);
+}
+
static void cfq_get_group_ref(struct cfq_group *cfqg)
{
cfqg->ref++;
@@ -1180,6 +1274,11 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)

#else /* GROUP_IOSCHED */

+static struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
+ struct bio *bio, int create)
+{
+}
+
static void cfq_get_group_ref(struct cfq_group *cfqg)
{
}
@@ -1491,7 +1590,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
if (!cic)
return NULL;

- cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+ cfqq = cic_bio_to_cfqq(cfqd, cic, bio, cfq_bio_sync(bio));
if (cfqq) {
sector_t sector = bio->bi_sector + bio_sectors(bio);

@@ -1615,7 +1714,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
if (!cic)
return false;

- cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+ cfqq = cic_bio_to_cfqq(cfqd, cic, bio, cfq_bio_sync(bio));
return cfqq == RQ_CFQQ(rq);
}

@@ -2848,14 +2947,10 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
spin_lock_irqsave(cfqd->queue->queue_lock, flags);

cfqq = cic->cfqq[BLK_RW_ASYNC];
+
if (cfqq) {
- struct cfq_queue *new_cfqq;
- new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
- GFP_ATOMIC);
- if (new_cfqq) {
- cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
- cfq_put_queue_ref(cfqq);
- }
+ cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+ cfq_put_queue_ref(cfqq);
}

cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -2895,6 +2990,7 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
{
struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+ struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
struct cfq_data *cfqd = cic_to_cfqd(cic);
unsigned long flags;
struct request_queue *q;
@@ -2916,6 +3012,12 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
cfq_put_queue_ref(sync_cfqq);
}

+ if (async_cfqq != NULL) {
+ cfq_log_cfqq(cfqd, async_cfqq, "changed cgroup");
+ cic_set_cfqq(cic, NULL, 0);
+ cfq_put_queue_ref(async_cfqq);
+ }
+
spin_unlock_irqrestore(q->queue_lock, flags);
}

@@ -2938,6 +3040,24 @@ retry:
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);

+#ifdef CONFIG_CGROUP_BLKIOTRACK
+ if (!cfqq && !is_sync) {
+ const int ioprio = task_ioprio(cic->ioc);
+ const int ioprio_class = task_ioprio_class(cic->ioc);
+ struct cfq_queue **async_cfqq;
+
+ /*
+ * We have not cached async queue pointer as bio tracking
+ * is enabled. Look into group async queue array using ioc
+ * class and prio to see if somebody already allocated the
+ * queue.
+ */
+
+ async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class, ioprio);
+ cfqq = *async_cfqq;
+ }
+#endif
+
/*
* Always try a new alloc if we fell back to the OOM cfqq
* originally, since it should just be a temporary situation.
@@ -2992,14 +3112,14 @@ cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio)
}

static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
- gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, bool is_sync,
+ struct io_context *ioc, gfp_t gfp_mask)
{
const int ioprio = task_ioprio(ioc);
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue **async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
- struct cfq_group *cfqg = cfq_get_cfqg(cfqd, 1);
+ struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, 1);

if (!is_sync) {
async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
@@ -3018,7 +3138,25 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
*async_cfqq = cfqq;
}

+#ifdef CONFIG_CGROUP_BLKIOTRACK
+ /*
+ * ioc reference. If async request queue/group is determined from the
+ * original task/cgroup and not from submitter task, io context can
+ * not cache the pointer to async queue and everytime a request comes,
+ * it will be determined by going through the async queue array.
+ *
+ */
+ if (is_sync)
+ cfq_get_queue_ref(cfqq);
+#else
+ /*
+ * async requests are being attributed to task submitting
+ * it, hence cic can cache async cfqq pointer. Take the
+ * queue reference even for async queue.
+ */
+
cfq_get_queue_ref(cfqq);
+#endif
return cfqq;
}

@@ -3686,7 +3824,8 @@ split_cfqq(struct cfq_io_context *cic, struct cfq_queue *cfqq)
* Allocate cfq data structures associated with this request.
*/
static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+ gfp_t gfp_mask)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_io_context *cic;
@@ -3707,7 +3846,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
new_queue:
cfqq = cic_to_cfqq(cic, is_sync);
if (!cfqq || cfqq == &cfqd->oom_cfqq) {
- cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+ cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc, gfp_mask);
cic_set_cfqq(cic, cfqq, is_sync);
} else {
/*
diff --git a/block/elevator.c b/block/elevator.c
index c387d31..9edaeb5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -770,12 +770,13 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
return NULL;
}

-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+ struct bio *bio, gfp_t gfp_mask)
{
struct elevator_queue *e = q->elevator;

if (e->ops->elevator_set_req_fn)
- return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+ return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);

rq->elevator_private[0] = NULL;
return 0;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index d93efcc445..c3a884c 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -24,7 +24,8 @@ typedef struct request *(elevator_request_list_fn) (struct request_queue *, stru
typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
typedef int (elevator_may_queue_fn) (struct request_queue *, int);

-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *,
+ struct bio *bio, gfp_t);
typedef void (elevator_put_req_fn) (struct request *);
typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
@@ -117,7 +118,8 @@ extern void elv_unregister_queue(struct request_queue *q);
extern int elv_may_queue(struct request_queue *, int);
extern void elv_abort_queue(struct request_queue *);
extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+ struct bio *bio, gfp_t);
extern void elv_put_request(struct request_queue *, struct request *);
extern void elv_drain_elevator(struct request_queue *);

--
1.7.3.1

2011-03-22 23:09:33

by Justin TerAvest

[permalink] [raw]
Subject: [PATCH v2 8/8] cfq: Don't allow preemption across cgroups

Previously, a sync queue in can preempt an async queue in another
cgroup. This changes that behavior to disallow such preemption.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/cfq-iosched.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 23caa79..0d0189b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3430,6 +3430,9 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (!cfqq)
return false;

+ if (new_cfqq->cfqg != cfqq->cfqg)
+ return false;
+
if (cfq_class_idle(new_cfqq))
return false;

@@ -3449,9 +3452,6 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
return true;

- if (new_cfqq->cfqg != cfqq->cfqg)
- return false;
-
if (cfq_slice_used(cfqq))
return true;

--
1.7.3.1

2011-03-22 23:10:00

by Justin TerAvest

[permalink] [raw]
Subject: [PATCH v2 1/8] cfq-iosched: add symmetric reference wrappers

The cfq_group and cfq_queue objects did not have symmetric
wrapper functions and many of the reference taking was
open coded. For clarity add the following wrappers:

cfq_group:
- cfq_get_group_ref()
- cfq_put_group_ref()

cfq_queue:
- cfq_get_queue_ref()
- cfq_put_queue_ref()

The use of '_ref' was done not to collide with existing
function names.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/cfq-iosched.c | 95 +++++++++++++++++++++++++++++---------------------
1 files changed, 55 insertions(+), 40 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 12e380b..74510f5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -303,6 +303,11 @@ struct cfq_data {
struct rcu_head rcu;
};

+static void cfq_get_group_ref(struct cfq_group *cfqg);
+static void cfq_put_group_ref(struct cfq_group *cfqg);
+static void cfq_get_queue_ref(struct cfq_queue *cfqq);
+static void cfq_put_queue_ref(struct cfq_queue *cfqq);
+
static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);

static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
@@ -1048,7 +1053,7 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
* elevator which will be dropped by either elevator exit
* or cgroup deletion path depending on who is exiting first.
*/
- cfqg->ref = 1;
+ cfq_get_group_ref(cfqg);

/*
* Add group onto cgroup list. It might happen that bdi->dev is
@@ -1091,24 +1096,12 @@ static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
return cfqg;
}

-static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
+static void cfq_get_group_ref(struct cfq_group *cfqg)
{
cfqg->ref++;
- return cfqg;
}

-static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
-{
- /* Currently, all async queues are mapped to root group */
- if (!cfq_cfqq_sync(cfqq))
- cfqg = &cfqq->cfqd->root_group;
-
- cfqq->cfqg = cfqg;
- /* cfqq reference on cfqg */
- cfqq->cfqg->ref++;
-}
-
-static void cfq_put_cfqg(struct cfq_group *cfqg)
+static void cfq_put_group_ref(struct cfq_group *cfqg)
{
struct cfq_rb_root *st;
int i, j;
@@ -1122,6 +1115,17 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
kfree(cfqg);
}

+static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
+{
+ /* Currently, all async queues are mapped to root group */
+ if (!cfq_cfqq_sync(cfqq))
+ cfqg = &cfqq->cfqd->root_group;
+
+ cfqq->cfqg = cfqg;
+ /* cfqq reference on cfqg */
+ cfq_get_group_ref(cfqq->cfqg);
+}
+
static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
/* Something wrong if we are trying to remove same group twice */
@@ -1133,7 +1137,7 @@ static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
* Put the reference taken at the time of creation so that when all
* queues are gone, group can be destroyed.
*/
- cfq_put_cfqg(cfqg);
+ cfq_put_group_ref(cfqg);
}

static void cfq_release_cfq_groups(struct cfq_data *cfqd)
@@ -1177,14 +1181,18 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
}

#else /* GROUP_IOSCHED */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+
+static void cfq_get_group_ref(struct cfq_group *cfqg)
{
- return &cfqd->root_group;
}

-static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
+static void cfq_put_group_ref(struct cfq_group *cfqg)
{
- return cfqg;
+}
+
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+{
+ return &cfqd->root_group;
}

static inline void
@@ -1193,7 +1201,6 @@ cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg) {
}

static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
-static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}

#endif /* GROUP_IOSCHED */

@@ -2553,6 +2560,11 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
return 1;
}

+static void cfq_get_queue_ref(struct cfq_queue *cfqq)
+{
+ cfqq->ref++;
+}
+
/*
* task holds one reference to the queue, dropped when task exits. each rq
* in-flight on this queue also holds a reference, dropped when rq is freed.
@@ -2560,7 +2572,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
* Each cfq queue took a reference on the parent group. Drop it now.
* queue lock must be held here.
*/
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_put_queue_ref(struct cfq_queue *cfqq)
{
struct cfq_data *cfqd = cfqq->cfqd;
struct cfq_group *cfqg;
@@ -2583,7 +2595,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)

BUG_ON(cfq_cfqq_on_rr(cfqq));
kmem_cache_free(cfq_pool, cfqq);
- cfq_put_cfqg(cfqg);
+ cfq_put_group_ref(cfqg);
}

/*
@@ -2688,7 +2700,7 @@ static void cfq_put_cooperator(struct cfq_queue *cfqq)
break;
}
next = __cfqq->new_cfqq;
- cfq_put_queue(__cfqq);
+ cfq_put_queue_ref(__cfqq);
__cfqq = next;
}
}
@@ -2702,7 +2714,7 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)

cfq_put_cooperator(cfqq);

- cfq_put_queue(cfqq);
+ cfq_put_queue_ref(cfqq);
}

static void __cfq_exit_single_io_context(struct cfq_data *cfqd,
@@ -2844,7 +2856,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
GFP_ATOMIC);
if (new_cfqq) {
cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
- cfq_put_queue(cfqq);
+ cfq_put_queue_ref(cfqq);
}
}

@@ -2903,7 +2915,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
*/
cfq_log_cfqq(cfqd, sync_cfqq, "changed cgroup");
cic_set_cfqq(cic, NULL, 1);
- cfq_put_queue(sync_cfqq);
+ cfq_put_queue_ref(sync_cfqq);
}

spin_unlock_irqrestore(q->queue_lock, flags);
@@ -3004,11 +3016,11 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
* pin the queue now that it's allocated, scheduler exit will prune it
*/
if (!is_sync && !(*async_cfqq)) {
- cfqq->ref++;
+ cfq_get_queue_ref(cfqq);
*async_cfqq = cfqq;
}

- cfqq->ref++;
+ cfq_get_queue_ref(cfqq);
return cfqq;
}

@@ -3633,10 +3645,10 @@ static void cfq_put_request(struct request *rq)
rq->elevator_private[1] = NULL;

/* Put down rq reference on cfqg */
- cfq_put_cfqg(RQ_CFQG(rq));
+ cfq_put_group_ref(RQ_CFQG(rq));
rq->elevator_private[2] = NULL;

- cfq_put_queue(cfqq);
+ cfq_put_queue_ref(cfqq);
}
}

@@ -3647,7 +3659,7 @@ cfq_merge_cfqqs(struct cfq_data *cfqd, struct cfq_io_context *cic,
cfq_log_cfqq(cfqd, cfqq, "merging with queue %p", cfqq->new_cfqq);
cic_set_cfqq(cic, cfqq->new_cfqq, 1);
cfq_mark_cfqq_coop(cfqq->new_cfqq);
- cfq_put_queue(cfqq);
+ cfq_put_queue_ref(cfqq);
return cic_to_cfqq(cic, 1);
}

@@ -3669,7 +3681,7 @@ split_cfqq(struct cfq_io_context *cic, struct cfq_queue *cfqq)

cfq_put_cooperator(cfqq);

- cfq_put_queue(cfqq);
+ cfq_put_queue_ref(cfqq);
return NULL;
}
/*
@@ -3722,10 +3734,12 @@ new_queue:

cfqq->allocated[rw]++;

- cfqq->ref++;
+ cfq_get_queue_ref(cfqq);
rq->elevator_private[0] = cic;
rq->elevator_private[1] = cfqq;
- rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
+
+ cfq_get_group_ref(cfqq->cfqg);
+ rq->elevator_private[2] = cfqq->cfqg;
spin_unlock_irqrestore(q->queue_lock, flags);
return 0;

@@ -3818,13 +3832,13 @@ static void cfq_put_async_queues(struct cfq_data *cfqd)

for (i = 0; i < IOPRIO_BE_NR; i++) {
if (cfqd->async_cfqq[0][i])
- cfq_put_queue(cfqd->async_cfqq[0][i]);
+ cfq_put_queue_ref(cfqd->async_cfqq[0][i]);
if (cfqd->async_cfqq[1][i])
- cfq_put_queue(cfqd->async_cfqq[1][i]);
+ cfq_put_queue_ref(cfqd->async_cfqq[1][i]);
}

if (cfqd->async_idle_cfqq)
- cfq_put_queue(cfqd->async_idle_cfqq);
+ cfq_put_queue_ref(cfqd->async_idle_cfqq);
}

static void cfq_cfqd_free(struct rcu_head *head)
@@ -3922,9 +3936,10 @@ static void *cfq_init_queue(struct request_queue *q)
#ifdef CONFIG_CFQ_GROUP_IOSCHED
/*
* Take a reference to root group which we never drop. This is just
- * to make sure that cfq_put_cfqg() does not try to kfree root group
+ * to make sure that cfq_put_group_ref() does not try to kfree
+ * root group
*/
- cfqg->ref = 1;
+ cfq_get_group_ref(cfqg);
rcu_read_lock();
cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
(void *)cfqd, 0);
--
1.7.3.1

2011-03-22 23:10:25

by Justin TerAvest

[permalink] [raw]
Subject: [PATCH v2 2/8] block,fs,mm: IO cgroup tracking for buffered write

This patch adds IO tracking code in the fs/ and mm/ trees so that
the block layer can provide isolation for buffered writes.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/blk-cgroup.c | 183 +++++++++++++++++++++++++++++++++++++++++++
fs/buffer.c | 2 +
fs/direct-io.c | 2 +
include/linux/blkio-track.h | 89 +++++++++++++++++++++
include/linux/iocontext.h | 1 +
include/linux/memcontrol.h | 6 ++
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 38 +++++++++-
init/Kconfig | 16 ++++
mm/Makefile | 3 +-
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 6 ++
mm/memory.c | 6 ++
mm/page-writeback.c | 14 +++-
mm/page_cgroup.c | 29 +++++---
mm/swap_state.c | 2 +
17 files changed, 388 insertions(+), 17 deletions(-)
create mode 100644 include/linux/blkio-track.h

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 77ee3c1..f626c65 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -19,6 +19,8 @@
#include <linux/slab.h>
#include "blk-cgroup.h"
#include <linux/genhd.h>
+#include <linux/blkio-track.h>
+#include <linux/mm_inline.h>

#define MAX_KEY_LEN 100

@@ -175,6 +177,12 @@ static inline void blkio_update_group_iops(struct blkio_group *blkg,
}
}

+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+ return container_of(task_subsys_state(p, blkio_subsys_id),
+ struct blkio_cgroup, css);
+}
+
/*
* Add to the appropriate stat variable depending on the request type.
* This should be called with the blkg->stats_lock held.
@@ -1241,8 +1249,20 @@ blkiocg_file_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
return 0;
}

+/* Read the ID of the specified blkio cgroup. */
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
+
+ return (u64)css_id(&blkcg->css);
+}
+
struct cftype blkio_files[] = {
{
+ .name = "id",
+ .read_u64 = blkio_id_read,
+ },
+ {
.name = "weight_device",
.private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
BLKIO_PROP_weight_device),
@@ -1399,6 +1419,169 @@ struct cftype blkio_files[] = {
#endif
};

+/* Block IO tracking related functions */
+
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+ struct blkio_cgroup *blkcg;
+ struct page_cgroup *pc;
+
+ if (blkio_cgroup_disabled())
+ return;
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return;
+
+ page_cgroup_set_blkio_id(pc, 0); /* 0: default blkio_cgroup id */
+ if (!mm)
+ return;
+ /*
+ * Locking "pc" isn't necessary here since the current process is
+ * the only one that can access the members related to blkio_cgroup.
+ */
+ rcu_read_lock();
+ blkcg = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+ if (unlikely(!blkcg))
+ goto out;
+ /*
+ * css_get(&bio->css) isn't called to increment the reference
+ * count of this blkio_cgroup "blkcg" so pc's blkio_cgroup_id
+ * might turn invalid even if this page is still active.
+ * This approach is chosen to minimize the overhead.
+ */
+ page_cgroup_set_blkio_id(pc, css_id(&blkcg->css));
+out:
+ rcu_read_unlock();
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+ /*
+ * A little trick:
+ * Just call blkio_cgroup_set_owner() for pages which are already
+ * active since the blkio_cgroup_id member of page_cgroup can be
+ * updated without any locks.
+ */
+ blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+ if (!page_is_file_cache(page))
+ return;
+ if (current->flags & PF_MEMALLOC)
+ return;
+
+ blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage: the page where we want to copy the owner
+ * @opage: the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+ struct page_cgroup *npc, *opc;
+
+ if (blkio_cgroup_disabled())
+ return;
+ npc = lookup_page_cgroup(npage);
+ if (unlikely(!npc))
+ return;
+ opc = lookup_page_cgroup(opage);
+ if (unlikely(!opc))
+ return;
+
+ /*
+ * Do this without any locks. The reason is the same as
+ * blkio_cgroup_reset_owner().
+ */
+ page_cgroup_set_blkio_id(npc, page_cgroup_get_blkio_id(opc));
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio: the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ struct page_cgroup *pc;
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+ unsigned long id = 0;
+
+ pc = lookup_page_cgroup(page);
+ if (pc)
+ id = page_cgroup_get_blkio_id(pc);
+ return id;
+}
+
+/**
+ * get_cgroup_from_page() - determine the cgroup from a page.
+ * @page: the page to be tracked
+ *
+ * Returns the cgroup of a given page. A return value zero means that
+ * the page associated with the page belongs to default_blkio_cgroup.
+ *
+ * Note:
+ * This function must be called under rcu_read_lock().
+ */
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+ struct page_cgroup *pc;
+ struct cgroup_subsys_state *css;
+
+ pc = lookup_page_cgroup(page);
+ if (!pc)
+ return NULL;
+
+ css = css_lookup(&blkio_subsys, page_cgroup_get_blkio_id(pc));
+ if (!css)
+ return NULL;
+
+ return css->cgroup;
+}
+
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(get_cgroup_from_page);
+
+#endif /* CONFIG_CGROUP_BLKIOTRACK */
+
static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
{
return cgroup_add_files(cgroup, subsys, blkio_files,
diff --git a/fs/buffer.c b/fs/buffer.c
index 2e6b1a3..f344929 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/bio.h>
+#include <linux/blkio-track.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/bitops.h>
@@ -659,6 +660,7 @@ static void __set_page_dirty(struct page *page,
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ blkio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 4260831..7c684fa 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
#include <linux/err.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
+#include <linux/blkio-track.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
#include <asm/atomic.h>
@@ -852,6 +853,7 @@ static int do_direct_IO(struct dio *dio)
ret = PTR_ERR(page);
goto out;
}
+ blkio_cgroup_reset_owner(page, current->mm);

while (block_in_page < blocks_per_page) {
unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/blkio-track.h b/include/linux/blkio-track.h
new file mode 100644
index 0000000..4a49c57
--- /dev/null
+++ b/include/linux/blkio-track.h
@@ -0,0 +1,89 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+
+struct block_device;
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc: page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+ page_cgroup_set_blkio_id(pc, 0);
+}
+
+/**
+ * blkio_cgroup_disabled() - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+ if (blkio_subsys.disabled)
+ return true;
+ return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+
+#else /* !CONFIG_CGROUP_BLKIOTRACK */
+
+struct blkiotrack_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+ return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page,
+ struct page *opage)
+{
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ return 0;
+}
+
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+ return NULL;
+}
+
+#endif /* CONFIG_CGROUP_BLKIOTRACK */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index b2eee89..3e70b21 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -76,6 +76,7 @@ int put_io_context(struct io_context *ioc);
void exit_io_context(struct task_struct *task);
struct io_context *get_io_context(gfp_t gfp_flags, int node);
struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void copy_io_context(struct io_context **pdst, struct io_context **psrc);
#else
static inline void exit_io_context(struct task_struct *task)
{
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f512e18..a8a7cf0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -49,6 +49,8 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
* (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
*/

+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
/* for swap handling */
@@ -153,6 +155,10 @@ void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
static inline int mem_cgroup_newpage_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02ecb01..a04c37a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -615,7 +615,7 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
struct page_cgroup *node_page_cgroup;
#endif
#endif
@@ -975,7 +975,7 @@ struct mem_section {

/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
/*
* If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
* section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 6d6cb7a..6a86c0d 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
#ifndef __LINUX_PAGE_CGROUP_H
#define __LINUX_PAGE_CGROUP_H

-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
#include <linux/bit_spinlock.h>
/*
* Page Cgroup can be considered as an extended mem_map.
@@ -11,12 +11,43 @@
* then the page cgroup for pfn always exists.
*/
struct page_cgroup {
+#ifdef CONFIG_CGROUP_PAGE
unsigned long flags;
+#endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct mem_cgroup *mem_cgroup;
struct page *page;
struct list_head lru; /* per cgroup LRU list */
+#endif
};

+#ifdef CONFIG_CGROUP_BLKIOTRACK
+#define PAGE_TRACKING_ID_SHIFT (16)
+#define PAGE_TRACKING_ID_BITS \
+ (8 * sizeof(unsigned long) - PAGE_TRACKING_ID_SHIFT)
+
+static inline void page_cgroup_set_blkio_id(struct page_cgroup *pc,
+ unsigned long id)
+{
+ unsigned long old, new, prev;
+
+ /* pc->flags isn't lock protected, so we must cmpxchg. */
+ WARN_ON(id >= (1UL << PAGE_TRACKING_ID_BITS));
+ do {
+ new = old = pc->flags;
+ new &= (1UL << PAGE_TRACKING_ID_SHIFT) - 1;
+ new |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
+ prev = cmpxchg(&pc->flags, old, new);
+ } while (prev != old);
+}
+
+static inline unsigned long page_cgroup_get_blkio_id(struct page_cgroup *pc)
+{
+ return pc->flags >> PAGE_TRACKING_ID_SHIFT;
+}
+#endif
+
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);

#ifdef CONFIG_SPARSEMEM
@@ -33,6 +64,8 @@ static inline void __init page_cgroup_init(void)

struct page_cgroup *lookup_page_cgroup(struct page *page);

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+
enum {
/* flags for mem_cgroup */
PCG_LOCK, /* Lock for pc->mem_cgroup and following bits. */
@@ -131,8 +164,9 @@ static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
local_irq_restore(*flags);
}
+#endif

-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
struct page_cgroup;

static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index be788c0..ce1cbba 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -742,6 +742,22 @@ config DEBUG_BLK_CGROUP
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.

+config CGROUP_BLKIOTRACK
+ bool "Track async request block io"
+ depends on CGROUPS && BLOCK
+ select MM_OWNER
+ default n
+ ---help---
+ Provides a Resource Controller which enables to track the onwner
+ of every Block I/O requests.
+ The information this subsystem provides can be used from any
+ kind of module such as dm-ioband device mapper modules or
+ the cfq-scheduler.
+
+config CGROUP_PAGE
+ def_bool y
+ depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIOTRACK
+
endif # CGROUPS

menuconfig NAMESPACES
diff --git a/mm/Makefile b/mm/Makefile
index 2b1b575..7da3bc8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -38,7 +38,8 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
diff --git a/mm/bounce.c b/mm/bounce.c
index 1481de6..64980fb 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
+#include <linux/blkio-track.h>
#include <asm/tlbflush.h>

#include <trace/events/block.h>
@@ -211,6 +212,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
to->bv_len = from->bv_len;
to->bv_offset = from->bv_offset;
inc_zone_page_state(to->bv_page, NR_BOUNCE);
+ blkio_cgroup_copy_owner(to->bv_page, page);

if (rw == WRITE) {
char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index f9a29c8..263a3f6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/cpuset.h>
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
+#include <linux/blkio-track.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
#include "internal.h"

@@ -377,6 +378,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;
+ blkio_cgroup_set_owner(page, current->mm);

error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index da53a25..e11c2cd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -359,6 +359,12 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
static void drain_all_stock_async(void);

+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+ pc->mem_cgroup = NULL;
+ INIT_LIST_HEAD(&pc->lru);
+}
+
static struct mem_cgroup_per_zone *
mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
{
diff --git a/mm/memory.c b/mm/memory.c
index 5823698..385bb54 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -52,6 +52,7 @@
#include <linux/init.h>
#include <linux/writeback.h>
#include <linux/memcontrol.h>
+#include <linux/blkio-track.h>
#include <linux/mmu_notifier.h>
#include <linux/kallsyms.h>
#include <linux/swapops.h>
@@ -2391,6 +2392,7 @@ gotten:
*/
ptep_clear_flush(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address);
+ blkio_cgroup_set_owner(new_page, mm);
/*
* We call the notify macro here because, when using secondary
* mmu page tables (such as kvm shadow page tables), we want the
@@ -2828,6 +2830,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
flush_icache_page(vma, page);
set_pte_at(mm, address, page_table, pte);
do_page_add_anon_rmap(page, vma, address, exclusive);
+ blkio_cgroup_reset_owner(page, mm);
/* It's better to call commit-charge after rmap is established */
mem_cgroup_commit_charge_swapin(page, ptr);

@@ -2959,6 +2962,8 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,

inc_mm_counter_fast(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
+ /* Not setting the owner for special pages */
+ blkio_cgroup_set_owner(page, mm);
setpte:
set_pte_at(mm, address, page_table, entry);

@@ -3108,6 +3113,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (anon) {
inc_mm_counter_fast(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
+ blkio_cgroup_set_owner(page, mm);
} else {
inc_mm_counter_fast(mm, MM_FILEPAGES);
page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 24b7ac2..d49b398 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blkio-track.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -1159,7 +1160,8 @@ EXPORT_SYMBOL(account_page_writeback);
* We take care to handle the case where the page was truncated from the
* mapping by re-checking page_mapping() inside tree_lock.
*/
-int __set_page_dirty_nobuffers(struct page *page)
+int __set_page_dirty_nobuffers_track_owner(struct page *page,
+ int update_owner)
{
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
@@ -1174,6 +1176,9 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ if (update_owner)
+ blkio_cgroup_reset_owner_pagedirty(page,
+ current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
@@ -1186,6 +1191,11 @@ int __set_page_dirty_nobuffers(struct page *page)
}
return 0;
}
+
+int __set_page_dirty_nobuffers(struct page *page)
+{
+ return __set_page_dirty_nobuffers_track_owner(page, 1);
+}
EXPORT_SYMBOL(__set_page_dirty_nobuffers);

/*
@@ -1196,7 +1206,7 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers);
int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
{
wbc->pages_skipped++;
- return __set_page_dirty_nobuffers(page);
+ return __set_page_dirty_nobuffers_track_owner(page, 0);
}
EXPORT_SYMBOL(redirty_page_for_writepage);

diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 5bffada..78f5425 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -10,14 +10,17 @@
#include <linux/cgroup.h>
#include <linux/swapops.h>
#include <linux/kmemleak.h>
+#include <linux/blkio-track.h>

static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
pc->flags = 0;
- pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
- INIT_LIST_HEAD(&pc->lru);
+#endif
+ __init_mem_page_cgroup(pc);
+ __init_blkio_page_cgroup(pc);
}
static unsigned long total_usage;

@@ -75,7 +78,7 @@ void __init page_cgroup_init_flatmem(void)

int nid, fail;

- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;

for_each_online_node(nid) {
@@ -84,12 +87,13 @@ void __init page_cgroup_init_flatmem(void)
goto fail;
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
- " don't want memory cgroups\n");
+ printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+ " if you don't want memory and blkio cgroups\n");
return;
fail:
printk(KERN_CRIT "allocation of page_cgroup failed.\n");
- printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+ printk(KERN_CRIT
+ "please try 'cgroup_disable=memory,blkio' boot option\n");
panic("Out of memory");
}

@@ -134,6 +138,7 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
*/
kmemleak_not_leak(base);
} else {
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
/*
* We don't have to allocate page_cgroup again, but
* address of memmap may be changed. So, we have to initialize
@@ -144,6 +149,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
/* check address of memmap is changed or not. */
if (base->page == pfn_to_page(pfn))
return 0;
+#else
+ return 0;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
}

if (!base) {
@@ -258,7 +266,7 @@ void __init page_cgroup_init(void)
unsigned long pfn;
int fail = 0;

- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;

for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -267,14 +275,15 @@ void __init page_cgroup_init(void)
fail = init_section_page_cgroup(pfn);
}
if (fail) {
- printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+ printk(KERN_CRIT
+ "try 'cgroup_disable=memory,blkio' boot option\n");
panic("Out of memory");
} else {
hotplug_memory_notifier(page_cgroup_callback, 0);
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
- " want memory cgroups\n");
+ printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+ " if you don't want memory and blkio cgroups\n");
}

void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4668046..5c7c910 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -19,6 +19,7 @@
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
+#include <linux/blkio-track.h>

#include <asm/pgtable.h>

@@ -327,6 +328,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
/* May fail (-ENOMEM) if radix-tree node allocation failed. */
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
+ blkio_cgroup_set_owner(new_page, current->mm);
err = __add_to_swap_cache(new_page, entry);
if (likely(!err)) {
radix_tree_preload_end();
--
1.7.3.1

2011-03-22 23:10:29

by Justin TerAvest

[permalink] [raw]
Subject: [PATCH v2 6/8] cfq: add per cgroup writeout done by flusher stat

Tracking for buffered writes can detect when traffic comes from a
flusher thread, as opposed to directly from an application. This adds a
statistic to track I/O traffic from flusher threads.

This helps determine whether a flusher thread is being unfair to a
particular cgroup, and if cgroup-based isolation of writeback behavior
is useful.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/blk-cgroup.c | 18 ++++++++++++++++-
block/blk-cgroup.h | 9 ++++++-
block/cfq-iosched.c | 47 ++++++++++++++++++++++++++------------------
block/cfq.h | 6 +++-
include/linux/blk_types.h | 2 +
5 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9732cfd..7b63030 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -412,7 +412,8 @@ void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
EXPORT_SYMBOL_GPL(blkiocg_update_dispatch_stats);

void blkiocg_update_completion_stats(struct blkio_group *blkg,
- uint64_t start_time, uint64_t io_start_time, bool direction, bool sync)
+ uint64_t start_time, uint64_t io_start_time, bool direction, bool sync,
+ bool out_of_ctx)
{
struct blkio_group_stats *stats;
unsigned long flags;
@@ -426,6 +427,8 @@ void blkiocg_update_completion_stats(struct blkio_group *blkg,
if (time_after64(io_start_time, start_time))
blkio_add_stat(stats->stat_arr[BLKIO_STAT_WAIT_TIME],
io_start_time - start_time, direction, sync);
+ if (out_of_ctx)
+ blkg->stats.oo_ctx_io_count++;
spin_unlock_irqrestore(&blkg->stats_lock, flags);
}
EXPORT_SYMBOL_GPL(blkiocg_update_completion_stats);
@@ -620,6 +623,9 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
blkg->stats.unaccounted_time, cb, dev);
#ifdef CONFIG_DEBUG_BLK_CGROUP
+ if (type == BLKIO_STAT_OO_CTX_IO_COUNT)
+ return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
+ blkg->stats.oo_ctx_io_count, cb, dev);
if (type == BLKIO_STAT_AVG_QUEUE_SIZE) {
uint64_t sum = blkg->stats.avg_queue_size_sum;
uint64_t samples = blkg->stats.avg_queue_size_samples;
@@ -1159,6 +1165,10 @@ static int blkiocg_file_read_map(struct cgroup *cgrp, struct cftype *cft,
case BLKIO_PROP_empty_time:
return blkio_read_blkg_stats(blkcg, cft, cb,
BLKIO_STAT_EMPTY_TIME, 0);
+ case BLKIO_PROP_oo_ctx_io_count:
+ return blkio_read_blkg_stats(blkcg, cft, cb,
+ BLKIO_STAT_OO_CTX_IO_COUNT, 0);
+
#endif
default:
BUG();
@@ -1419,6 +1429,12 @@ struct cftype blkio_files[] = {
BLKIO_PROP_dequeue),
.read_map = blkiocg_file_read_map,
},
+ {
+ .name = "oo_ctx_io_count",
+ .private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
+ BLKIO_PROP_oo_ctx_io_count),
+ .read_map = blkiocg_file_read_map,
+ },
#endif
};

diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 10919fa..9556f2b 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -52,6 +52,7 @@ enum stat_type {
/* Time not charged to this cgroup */
BLKIO_STAT_UNACCOUNTED_TIME,
#ifdef CONFIG_DEBUG_BLK_CGROUP
+ BLKIO_STAT_OO_CTX_IO_COUNT,
BLKIO_STAT_AVG_QUEUE_SIZE,
BLKIO_STAT_IDLE_TIME,
BLKIO_STAT_EMPTY_TIME,
@@ -93,6 +94,7 @@ enum blkcg_file_name_prop {
BLKIO_PROP_idle_time,
BLKIO_PROP_empty_time,
BLKIO_PROP_dequeue,
+ BLKIO_PROP_oo_ctx_io_count,
};

/* cgroup files owned by throttle policy */
@@ -119,6 +121,8 @@ struct blkio_group_stats {
uint64_t sectors;
/* Time not charged to this cgroup */
uint64_t unaccounted_time;
+ /* Number of IOs sumbitted out of process context */
+ uint64_t oo_ctx_io_count;
uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL];
#ifdef CONFIG_DEBUG_BLK_CGROUP
/* Sum of number of IOs queued across all samples */
@@ -303,7 +307,8 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg,
void blkiocg_update_dispatch_stats(struct blkio_group *blkg, uint64_t bytes,
bool direction, bool sync);
void blkiocg_update_completion_stats(struct blkio_group *blkg,
- uint64_t start_time, uint64_t io_start_time, bool direction, bool sync);
+ uint64_t start_time, uint64_t io_start_time, bool direction, bool sync,
+ bool out_of_ctx);
void blkiocg_update_io_merged_stats(struct blkio_group *blkg, bool direction,
bool sync);
void blkiocg_update_io_add_stats(struct blkio_group *blkg,
@@ -332,7 +337,7 @@ static inline void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
uint64_t bytes, bool direction, bool sync) {}
static inline void blkiocg_update_completion_stats(struct blkio_group *blkg,
uint64_t start_time, uint64_t io_start_time, bool direction,
- bool sync) {}
+ bool sync, bool out_of_ctx) {}
static inline void blkiocg_update_io_merged_stats(struct blkio_group *blkg,
bool direction, bool sync) {}
static inline void blkiocg_update_io_add_stats(struct blkio_group *blkg,
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1b315c3..c885493 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -311,7 +311,7 @@ static void cfq_put_queue_ref(struct cfq_queue *cfqq);

static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
static struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
- struct bio *bio, int create);
+ struct bio *bio, int *is_oo_ctx, int create);
static struct cfq_queue **
cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio);

@@ -449,8 +449,8 @@ static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl,
}

static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio*, bool,
- struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio*,
+ int *is_oo_ctx, bool, struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
struct io_context *);
static void cfq_put_async_queues(struct cfq_group *cfqg);
@@ -484,7 +484,7 @@ static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
* async bio tracking is enabled and we are not caching
* async queue pointer in cic.
*/
- cfqg = cfq_get_cfqg_bio(cfqd, bio, 0);
+ cfqg = cfq_get_cfqg_bio(cfqd, bio, NULL, 0);
if (!cfqg) {
/*
* May be this is first rq/bio and io group has not
@@ -1150,17 +1150,21 @@ done:
* create the cfq group if it does not exist. request_queue lock must be held.
*/
static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, struct page *page,
- int create)
+ int *is_oo_ctx, int create)
{
- struct cgroup *cgroup;
+ struct cgroup *cgroup, *tracked_cgroup;
struct cfq_group *cfqg = NULL;

rcu_read_lock();

- if (!page)
- cgroup = task_cgroup(current, blkio_subsys_id);
- else
- cgroup = get_cgroup_from_page(page);
+ cgroup = task_cgroup(current, blkio_subsys_id);
+ if (page) {
+ tracked_cgroup = get_cgroup_from_page(page);
+ if (is_oo_ctx)
+ *is_oo_ctx = cgroup && tracked_cgroup &&
+ tracked_cgroup != cgroup;
+ cgroup = tracked_cgroup;
+ }

if (!cgroup) {
cfqg = &cfqd->root_group;
@@ -1175,8 +1179,8 @@ out:
return cfqg;
}

-struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
- struct bio *bio, int create)
+struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd, struct bio *bio,
+ int *is_oo_ctx, int create)
{
struct page *page = NULL;

@@ -1201,7 +1205,7 @@ struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
#endif

sync:
- return cfq_get_cfqg(cfqd, page, create);
+ return cfq_get_cfqg(cfqd, page, is_oo_ctx, create);
}

static void cfq_get_group_ref(struct cfq_group *cfqg)
@@ -1288,7 +1292,7 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
#else /* GROUP_IOSCHED */

static struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
- struct bio *bio, int create)
+ struct bio *bio, int *is_oo_ctx, int create)
{
}

@@ -3134,14 +3138,14 @@ cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio)
}

static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, bool is_sync,
- struct io_context *ioc, gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int *is_oo_ctx,
+ bool is_sync, struct io_context *ioc, gfp_t gfp_mask)
{
const int ioprio = task_ioprio(ioc);
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue **async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
- struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, 1);
+ struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, is_oo_ctx, 1);

if (!is_sync) {
async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
@@ -3667,7 +3671,8 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
(RQ_CFQG(rq))->dispatched--;
cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg,
rq_start_time_ns(rq), rq_io_start_time_ns(rq),
- rq_data_dir(rq), rq_is_sync(rq));
+ rq_data_dir(rq), rq_is_sync(rq),
+ rq->cmd_flags & REQ_OUT_OF_CTX);

cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--;

@@ -3855,6 +3860,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
const bool is_sync = rq_is_sync(rq);
struct cfq_queue *cfqq;
unsigned long flags;
+ int is_oo_ctx = 0;

might_sleep_if(gfp_mask & __GFP_WAIT);

@@ -3868,8 +3874,11 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
new_queue:
cfqq = cic_to_cfqq(cic, is_sync);
if (!cfqq || cfqq == &cfqd->oom_cfqq) {
- cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc, gfp_mask);
+ cfqq = cfq_get_queue(cfqd, bio, &is_oo_ctx, is_sync, cic->ioc,
+ gfp_mask);
cic_set_cfqq(cic, cfqq, is_sync);
+ if (is_oo_ctx)
+ rq->cmd_flags |= REQ_OUT_OF_CTX;
} else {
/*
* If the queue was seeky for too long, break it apart.
diff --git a/block/cfq.h b/block/cfq.h
index 2a15592..6afc10a 100644
--- a/block/cfq.h
+++ b/block/cfq.h
@@ -61,10 +61,12 @@ static inline void cfq_blkiocg_update_dispatch_stats(struct blkio_group *blkg,
blkiocg_update_dispatch_stats(blkg, bytes, direction, sync);
}

-static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg, uint64_t start_time, uint64_t io_start_time, bool direction, bool sync)
+static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg,
+ uint64_t start_time, uint64_t io_start_time,
+ bool direction, bool sync, bool out_of_ctx)
{
blkiocg_update_completion_stats(blkg, start_time, io_start_time,
- direction, sync);
+ direction, sync, out_of_ctx);
}

static inline void cfq_blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index be50d9e..d859395 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -152,6 +152,7 @@ enum rq_flag_bits {
__REQ_MIXED_MERGE, /* merge of different types, fail separately */
__REQ_SECURE, /* secure discard (used with __REQ_DISCARD) */
__REQ_ON_PLUG, /* on plug list */
+ __REQ_OUT_OF_CTX, /* request submitted out of process context */
__REQ_NR_BITS, /* stops here */
};

@@ -193,5 +194,6 @@ enum rq_flag_bits {
#define REQ_MIXED_MERGE (1 << __REQ_MIXED_MERGE)
#define REQ_SECURE (1 << __REQ_SECURE)
#define REQ_ON_PLUG (1 << __REQ_ON_PLUG)
+#define REQ_OUT_OF_CTX (1 << __REQ_OUT_OF_CTX)

#endif /* __LINUX_BLK_TYPES_H */
--
1.7.3.1

2011-03-22 23:10:32

by Justin TerAvest

[permalink] [raw]
Subject: [PATCH v2 7/8] block: Per cgroup request descriptor counts

If a cgroup gets starved, it keeps allocating request
descriptors until it hits the limit. Other cgroups could be
starved of request descriptors, which is a problem.
This patch implements per-cgroup request descriptor limits.

There has been discussion to remove limits entirely, but they are
introduced here to enable tracking for buffered writes without causing
problems for the per request-queue limit.

There are some interesting corner cases that are also covered.
During elevator switch, we start counting the request descriptors
towards the request list of the root group. We drain all the requests
that were started before the switch. If new requests arrive after
the switch is progressing, they are counted towards the common, shared
request list of the root group. As these requests complete, the counter
in the common request list is decremented properly.

Signed-off-by: Justin TerAvest <[email protected]>
---
Documentation/block/biodoc.txt | 10 ++
block/blk-core.c | 211 +++++++++++++++++++++++++++-------------
block/blk-settings.c | 2 +-
block/blk-sysfs.c | 59 ++++++-----
block/cfq-iosched.c | 61 +++++++++++-
block/elevator.c | 2 +-
include/linux/blkdev.h | 81 +++++++++++++++-
include/linux/elevator.h | 8 ++
8 files changed, 337 insertions(+), 97 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index 2a7b38c..856d706 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -970,6 +970,16 @@ elevator_latter_req_fn These return the request before or after the

elevator_completed_req_fn called when a request is completed.

+elevator_req_list_fn called to obtain the active request list for
+ either a bio or a request. This function can
+ be called with either a valid bio or request.
+ This function must be defined only if the
+ scheduler supports per cgroup request lists.
+ When defined, it becomes the responsibility
+ of the scheduler to call blk_alloced_request
+ after the request and queue are allocated and
+ the cgroup is declared.
+
elevator_may_queue_fn returns true if the scheduler wants to allow the
current context to queue a new request even if
it is over the queue limit. This must be used
diff --git a/block/blk-core.c b/block/blk-core.c
index 1b7936bf..465df36 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -381,23 +381,28 @@ void blk_cleanup_queue(struct request_queue *q)
}
EXPORT_SYMBOL(blk_cleanup_queue);

-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
{
- struct request_list *rl = &q->rq;
-
- if (unlikely(rl->rq_pool))
- return 0;
-
rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
- rl->elvpriv = 0;
init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}

- rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
- mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+ if (unlikely(q->rq_pool.rq_pool))
+ return 0;
+
+ blk_init_request_list(&q->rq);

- if (!rl->rq_pool)
+ q->rq_pool.count[BLK_RW_SYNC] = 0;
+ q->rq_pool.count[BLK_RW_ASYNC] = 0;
+ q->rq_pool.elvpriv = 0;
+ q->rq_pool.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+ mempool_alloc_slab, mempool_free_slab,
+ request_cachep, q->node);
+ if (!q->rq_pool.rq_pool)
return -ENOMEM;

return 0;
@@ -579,14 +584,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)

if (rq->cmd_flags & REQ_ELVPRIV)
elv_put_request(q, rq);
- mempool_free(rq, q->rq.rq_pool);
+ mempool_free(rq, q->rq_pool.rq_pool);
}

static struct request *
blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
gfp_t gfp_mask)
{
- struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+ struct request *rq = mempool_alloc(q->rq_pool.rq_pool, gfp_mask);

if (!rq)
return NULL;
@@ -596,11 +601,11 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
rq->cmd_flags = flags | REQ_ALLOCED;

if (priv) {
+ rq->cmd_flags |= REQ_ELVPRIV;
if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
- mempool_free(rq, q->rq.rq_pool);
+ mempool_free(rq, q->rq_pool.rq_pool);
return NULL;
}
- rq->cmd_flags |= REQ_ELVPRIV;
}

return rq;
@@ -640,38 +645,67 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
ioc->last_waited = jiffies;
}

-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+ struct request_list *rl)
{
- struct request_list *rl = &q->rq;
-
- if (rl->count[sync] < queue_congestion_off_threshold(q))
+ if (q->rq_pool.count[sync] < queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, sync);

- if (rl->count[sync] + 1 <= q->nr_requests) {
+ if (q->rq_pool.count[sync] + 1 <= q->nr_requests)
+ blk_clear_queue_full(q, sync);
+
+ if (rl->count[sync] + 1 <= q->nr_group_requests) {
if (waitqueue_active(&rl->wait[sync]))
wake_up(&rl->wait[sync]);
-
- blk_clear_queue_full(q, sync);
}
}

/*
* A request has just been released. Account for it, update the full and
- * congestion status, wake up any waiters. Called under q->queue_lock.
+ * congestion status, wake up any waiters. Called under q->queue_lock.
*/
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+ struct request_list *rl)
{
- struct request_list *rl = &q->rq;
+ if (priv) {
+ q->rq_pool.elvpriv--;
+ BUG_ON(!rl->count[sync]);
+ rl->count[sync]--;
+ }

- rl->count[sync]--;
- if (priv)
- rl->elvpriv--;
+ BUG_ON(!q->rq_pool.count[sync]);
+ q->rq_pool.count[sync]--;

- __freed_request(q, sync);
+ __freed_request(q, sync, rl);

if (unlikely(rl->starved[sync ^ 1]))
- __freed_request(q, sync ^ 1);
+ __freed_request(q, sync ^ 1, rl);
+}
+
+void blk_alloced_request(struct request_queue *q, struct request_list *rl,
+ int sync)
+{
+ int priv;
+
+ q->rq_pool.count[sync]++;
+ rl->starved[sync] = 0;
+
+ priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+ if (priv) {
+ q->rq_pool.elvpriv++;
+ /*
+ * Account the request to request list only if request is
+ * going to elevator. During elevator switch, there will
+ * be small window where group is going away and new group
+ * will not be allocated till elevator switch is complete.
+ * So till then instead of slowing down the application,
+ * we will continue to allocate request from total common
+ * pool instead of per group limit
+ */
+ rl->count[sync]++;
+ }
}
+EXPORT_SYMBOL(blk_alloced_request);

/*
* Determine if elevator data should be initialized when allocating the
@@ -698,10 +732,10 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
* Returns !NULL on success, with queue_lock *not held*.
*/
static struct request *get_request(struct request_queue *q, int rw_flags,
- struct bio *bio, gfp_t gfp_mask)
+ struct bio *bio, gfp_t gfp_mask,
+ struct request_list *rl)
{
struct request *rq = NULL;
- struct request_list *rl = &q->rq;
struct io_context *ioc = NULL;
const bool is_sync = rw_is_sync(rw_flags) != 0;
int may_queue, priv = 0;
@@ -710,31 +744,41 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
if (may_queue == ELV_MQUEUE_NO)
goto rq_starved;

- if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
- if (rl->count[is_sync]+1 >= q->nr_requests) {
- ioc = current_io_context(GFP_ATOMIC, q->node);
+ if (q->rq_pool.count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
+ blk_add_trace_msg(q, "Queue congested: setting congested flag");
+ blk_set_queue_congested(q, is_sync);
+ }
+
+ /*
+ * Looks like there is no user of queue full now.
+ * Keeping it for time being.
+ */
+ if (q->rq_pool.count[is_sync]+1 >= q->nr_requests) {
+ blk_add_trace_msg(q, "Queue congested: setting full flag");
+ blk_set_queue_full(q, is_sync);
+ }
+
+ if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+ ioc = current_io_context(GFP_ATOMIC, q->node);
+ /*
+ * The queue request descriptor group will fill after this
+ * allocation, so mark this process as "batching".
+ * This process will be allowed to complete a batch of
+ * requests, others will be blocked.
+ */
+ if (rl->count[is_sync] <= q->nr_group_requests)
+ ioc_set_batching(q, ioc);
+ else if (may_queue != ELV_MQUEUE_MUST
+ && !ioc_batching(q, ioc)) {
/*
- * The queue will fill after this allocation, so set
- * it as full, and mark this process as "batching".
- * This process will be allowed to complete a batch of
- * requests, others will be blocked.
+ * The queue is full and the allocating
+ * process is not a "batcher", and not
+ * exempted by the IO scheduler
*/
- if (!blk_queue_full(q, is_sync)) {
- ioc_set_batching(q, ioc);
- blk_set_queue_full(q, is_sync);
- } else {
- if (may_queue != ELV_MQUEUE_MUST
- && !ioc_batching(q, ioc)) {
- /*
- * The queue is full and the allocating
- * process is not a "batcher", and not
- * exempted by the IO scheduler
- */
- goto out;
- }
- }
+ blk_add_trace_msg(q, "Queue congested: "
+ "not allocating request");
+ goto out;
}
- blk_set_queue_congested(q, is_sync);
}

/*
@@ -742,18 +786,26 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* limit of requests, otherwise we could have thousands of requests
* allocated with any setting of ->nr_requests
*/
- if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+ if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2)) {
+ blk_add_trace_msg(q, "50 percent over limit, not allocating");
goto out;
+ }

- rl->count[is_sync]++;
- rl->starved[is_sync] = 0;

+ priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
if (blk_rq_should_init_elevator(bio)) {
- priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
if (priv)
- rl->elvpriv++;
+ q->rq_pool.elvpriv++;
}

+ /*
+ * If the scheduler supports per cgroup request lists it will call
+ * blk_alloced_request after the request and queue is allocated and
+ * the cgroup has been decided.
+ */
+ if (!blk_supports_cgroups(q) || !priv)
+ blk_alloced_request(q, rl, is_sync);
+
if (blk_queue_io_stat(q))
rw_flags |= REQ_IO_STAT;
spin_unlock_irq(q->queue_lock);
@@ -768,7 +820,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* wait queue, but this is pretty rare.
*/
spin_lock_irq(q->queue_lock);
- freed_request(q, is_sync, priv);
+ if (!blk_supports_cgroups(q) || !priv)
+ freed_request(q, is_sync, priv, rl);

/*
* in the very unlikely event that allocation failed and no
@@ -778,9 +831,11 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* rq mempool into READ and WRITE
*/
rq_starved:
- if (unlikely(rl->count[is_sync] == 0))
+ if (unlikely(rl->count[is_sync] == 0)) {
+ blk_add_trace_msg(q, "Queue congested: "
+ "marking %d starved", is_sync);
rl->starved[is_sync] = 1;
-
+ }
goto out;
}

@@ -809,16 +864,23 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
{
const bool is_sync = rw_is_sync(rw_flags) != 0;
struct request *rq;
+ struct request_list *rl = blk_get_request_list(q, bio);

- rq = get_request(q, rw_flags, bio, GFP_NOIO);
+ rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
while (!rq) {
DEFINE_WAIT(wait);
struct io_context *ioc;
- struct request_list *rl = &q->rq;

+ /*
+ * We are about to sleep on a request list and we
+ * drop queue lock. After waking up, we will do
+ * finish_wait() on request list and in the mean
+ * time group might be gone. Take a reference to
+ * the group now.
+ */
prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
TASK_UNINTERRUPTIBLE);
-
+ blk_get_rl_group(q, rl);
trace_block_sleeprq(q, bio, rw_flags & 1);

spin_unlock_irq(q->queue_lock);
@@ -836,7 +898,18 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
spin_lock_irq(q->queue_lock);
finish_wait(&rl->wait[is_sync], &wait);

- rq = get_request(q, rw_flags, bio, GFP_NOIO);
+ /*
+ * We had taken a reference to the request list goup.
+ * Put that now
+ */
+ blk_put_rl_group(q, rl);
+
+ /*
+ * After the sleep check the rl again in case the group the bio
+ * belonged to is gone and it is mapped to root group now
+ */
+ rl = blk_get_request_list(q, bio);
+ rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
};

return rq;
@@ -845,6 +918,7 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
{
struct request *rq;
+ struct request_list *rl;

BUG_ON(rw != READ && rw != WRITE);

@@ -852,7 +926,8 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
if (gfp_mask & __GFP_WAIT) {
rq = get_request_wait(q, rw, NULL);
} else {
- rq = get_request(q, rw, NULL, gfp_mask);
+ rl = blk_get_request_list(q, NULL);
+ rq = get_request(q, rw, NULL, gfp_mask, rl);
if (!rq)
spin_unlock_irq(q->queue_lock);
}
@@ -1059,12 +1134,14 @@ void __blk_put_request(struct request_queue *q, struct request *req)
if (req->cmd_flags & REQ_ALLOCED) {
int is_sync = rq_is_sync(req) != 0;
int priv = req->cmd_flags & REQ_ELVPRIV;
+ struct request_list *rl = rq_rl(q, req);

BUG_ON(!list_empty(&req->queuelist));
BUG_ON(!hlist_unhashed(&req->hash));
+ BUG_ON(!rl);

+ freed_request(q, is_sync, priv, rl);
blk_free_request(q, req);
- freed_request(q, is_sync, priv);
}
}
EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 1fa7692..a3039dd 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -158,7 +158,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
* set defaults
*/
q->nr_requests = BLKDEV_MAX_RQ;
-
+ q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
q->make_request_fn = mfn;
blk_queue_dma_alignment(q, 511);
blk_queue_congestion_threshold(q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 261c75c..88116df 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -39,7 +39,6 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
static ssize_t
queue_requests_store(struct request_queue *q, const char *page, size_t count)
{
- struct request_list *rl = &q->rq;
unsigned long nr;
int ret;

@@ -53,33 +52,31 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
spin_lock_irq(q->queue_lock);
q->nr_requests = nr;
blk_queue_congestion_threshold(q);
+ spin_unlock_irq(q->queue_lock);
+ return ret;
+}

- if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
- blk_set_queue_congested(q, BLK_RW_SYNC);
- else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
- blk_clear_queue_congested(q, BLK_RW_SYNC);
-
- if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
- blk_set_queue_congested(q, BLK_RW_ASYNC);
- else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
- blk_clear_queue_congested(q, BLK_RW_ASYNC);
-
- if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
- blk_set_queue_full(q, BLK_RW_SYNC);
- } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
- blk_clear_queue_full(q, BLK_RW_SYNC);
- wake_up(&rl->wait[BLK_RW_SYNC]);
- }
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(q->nr_group_requests, page);
+}

- if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
- blk_set_queue_full(q, BLK_RW_ASYNC);
- } else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
- blk_clear_queue_full(q, BLK_RW_ASYNC);
- wake_up(&rl->wait[BLK_RW_ASYNC]);
- }
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long nr;
+ int ret = queue_var_store(&nr, page, count);
+ if (nr < BLKDEV_MIN_RQ)
+ nr = BLKDEV_MIN_RQ;
+
+ spin_lock_irq(q->queue_lock);
+ q->nr_group_requests = nr;
spin_unlock_irq(q->queue_lock);
return ret;
}
+#endif

static ssize_t queue_ra_show(struct request_queue *q, char *page)
{
@@ -271,6 +268,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
.store = queue_requests_store,
};

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+ .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_group_requests_show,
+ .store = queue_group_requests_store,
+};
+#endif
+
static struct queue_sysfs_entry queue_ra_entry = {
.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
.show = queue_ra_show,
@@ -381,6 +386,9 @@ static struct queue_sysfs_entry queue_random_entry = {

static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ &queue_group_requests_entry.attr,
+#endif
&queue_ra_entry.attr,
&queue_max_hw_sectors_entry.attr,
&queue_max_sectors_entry.attr,
@@ -467,12 +475,11 @@ static void blk_release_queue(struct kobject *kobj)
{
struct request_queue *q =
container_of(kobj, struct request_queue, kobj);
- struct request_list *rl = &q->rq;

blk_sync_queue(q);

- if (rl->rq_pool)
- mempool_destroy(rl->rq_pool);
+ if (q->rq_pool.rq_pool)
+ mempool_destroy(q->rq_pool.rq_pool);

if (q->queue_tags)
__blk_queue_free_tags(q);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c885493..23caa79 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -214,6 +214,9 @@ struct cfq_group {
enum wl_prio_t saved_serving_prio;
struct blkio_group blkg;
#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ /* Request list associated with the group */
+ struct request_list rl;
+
struct hlist_node cfqd_node;
int ref;
#endif
@@ -1137,6 +1140,7 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
0);

cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+ blk_init_request_list(&cfqg->rl);

/* Add group on cfqd list */
hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
@@ -1224,6 +1228,7 @@ static void cfq_put_group_ref(struct cfq_group *cfqg)
return;
for_each_cfqg_st(cfqg, i, j, st)
BUG_ON(!RB_EMPTY_ROOT(&st->rb));
+ BUG_ON(cfqg->rl.count[0] | cfqg->rl.count[1]);
kfree(cfqg);
}

@@ -3145,7 +3150,13 @@ cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int *is_oo_ctx,
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue **async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
- struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, is_oo_ctx, 1);
+ /* If cfqg has not been allocated during request list allocation,
+ do not create it now. Otherwise, request list counter could get
+ off-by one errors.
+ */
+ struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, is_oo_ctx, 0);
+ if (!cfqg)
+ cfqg = &cfqd->root_group;

if (!is_sync) {
async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
@@ -3581,6 +3592,46 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
cfq_rq_enqueued(cfqd, cfqq, rq);
}

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+/*
+ * This function is essentially overloaded to return a request list
+ * given either a bio or a request. Only one of these is expected to
+ * be provided on any given invocation. The other must be NULL.
+ */
+static struct request_list *cfq_request_list(struct request_queue *q,
+ struct bio *bio,
+ struct request *rq)
+{
+ if (rq) {
+ struct cfq_queue *cfqq = RQ_CFQQ(rq);
+ return &cfqq->cfqg->rl;
+ } else {
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, NULL, 1);
+
+ if (!cfqg)
+ cfqg = &cfqd->root_group;
+ return &cfqg->rl;
+ }
+}
+
+#define RL_CFQG(rl) container_of((rl), struct cfq_group, rl)
+
+static void cfq_get_rl_group(struct request_list *rl)
+{
+ struct cfq_group *cfqg = RL_CFQG(rl);
+
+ cfqg->ref++;
+}
+
+static void cfq_put_rl_group(struct request_list *rl)
+{
+ struct cfq_group *cfqg = RL_CFQG(rl);
+
+ cfq_put_group_ref(cfqg);
+}
+#endif
+
/*
* Update hw_tag based on peak queue depth over 50 samples under
* sufficient load.
@@ -3908,6 +3959,8 @@ new_queue:

cfq_get_group_ref(cfqq->cfqg);
rq->elevator_private[2] = cfqq->cfqg;
+
+ blk_alloced_request(q, rq_rl(q, rq), is_sync != 0);
spin_unlock_irqrestore(q->queue_lock, flags);
return 0;

@@ -4110,6 +4163,7 @@ static void *cfq_init_queue(struct request_queue *q)
*/
cfq_get_group_ref(cfqg);
rcu_read_lock();
+ blk_init_request_list(&cfqg->rl);
cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
(void *)cfqd, 0);
rcu_read_unlock();
@@ -4294,6 +4348,11 @@ static struct elevator_type iosched_cfq = {
.elevator_completed_req_fn = cfq_completed_request,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ .elevator_req_list_fn = cfq_request_list,
+ .elevator_get_rl_group_fn = cfq_get_rl_group,
+ .elevator_put_rl_group_fn = cfq_put_rl_group,
+#endif
.elevator_set_req_fn = cfq_set_request,
.elevator_put_req_fn = cfq_put_request,
.elevator_may_queue_fn = cfq_may_queue,
diff --git a/block/elevator.c b/block/elevator.c
index 9edaeb5..f427fad 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -641,7 +641,7 @@ void elv_quiesce_start(struct request_queue *q)
* make sure we don't have any requests in flight
*/
elv_drain_elevator(q);
- while (q->rq.elvpriv) {
+ while (q->rq_pool.elvpriv) {
__blk_run_queue(q, false);
spin_unlock_irq(q->queue_lock);
msleep(10);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 16a902f..a8802ce 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,7 +32,18 @@ struct request;
struct sg_io_hdr;

#define BLKDEV_MIN_RQ 4
+
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ 128 /* Default maximum */
+#define BLKDEV_MAX_GROUP_RQ 128 /* Default per group maximum */
+#else
#define BLKDEV_MAX_RQ 128 /* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue.
+ */
+#define BLKDEV_MAX_GROUP_RQ BLKDEV_MAX_RQ
+#endif

struct request;
typedef void (rq_end_io_fn)(struct request *, int);
@@ -44,9 +55,21 @@ struct request_list {
*/
int count[2];
int starved[2];
+ wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structure keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_pool {
+ /*
+ * Per queue request descriptor count. This is in addition to per
+ * cgroup count.
+ */
+ int count[2];
int elvpriv;
mempool_t *rq_pool;
- wait_queue_head_t wait[2];
};

/*
@@ -274,6 +297,11 @@ struct request_queue
*/
struct request_list rq;

+ /*
+ * Contains request pool and other data like overall request count
+ */
+ struct request_pool rq_pool;
+
request_fn_proc *request_fn;
make_request_fn *make_request_fn;
prep_rq_fn *prep_rq_fn;
@@ -330,6 +358,7 @@ struct request_queue
* queue settings
*/
unsigned long nr_requests; /* Max # of requests */
+ unsigned long nr_group_requests;
unsigned int nr_congestion_on;
unsigned int nr_congestion_off;
unsigned int nr_batching;
@@ -606,6 +635,53 @@ static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio)
}
#endif /* CONFIG_MMU */

+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+ struct bio *bio)
+{
+ struct elevator_queue *e = q->elevator;
+ int priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+
+ if (priv && e->ops->elevator_req_list_fn)
+ return e->ops->elevator_req_list_fn(q, bio, NULL);
+ return &q->rq;
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+ struct request *rq)
+{
+ struct elevator_queue *e = q->elevator;
+ int priv = rq->cmd_flags & REQ_ELVPRIV;
+
+ if (priv && e->ops->elevator_req_list_fn)
+ return e->ops->elevator_req_list_fn(q, NULL, rq);
+ return &q->rq;
+}
+
+static inline void blk_get_rl_group(struct request_queue *q,
+ struct request_list *rl)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e->ops->elevator_get_rl_group_fn)
+ e->ops->elevator_get_rl_group_fn(rl);
+}
+
+static inline void blk_put_rl_group(struct request_queue *q,
+ struct request_list *rl)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e->ops->elevator_put_rl_group_fn)
+ e->ops->elevator_put_rl_group_fn(rl);
+}
+
+static inline int blk_supports_cgroups(struct request_queue *q)
+{
+ struct elevator_queue *e = q->elevator;
+
+ return (e->ops->elevator_req_list_fn) ? 1 : 0;
+}
+
struct rq_map_data {
struct page **pages;
int page_order;
@@ -802,6 +878,7 @@ extern struct request_queue *blk_init_allocated_queue_node(struct request_queue
extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
extern struct request_queue *blk_init_allocated_queue(struct request_queue *,
request_fn_proc *, spinlock_t *);
+extern void blk_init_request_list(struct request_list *rl);
extern void blk_cleanup_queue(struct request_queue *);
extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
extern void blk_queue_bounce_limit(struct request_queue *, u64);
@@ -853,6 +930,8 @@ int blk_get_queue(struct request_queue *);
struct request_queue *blk_alloc_queue(gfp_t);
struct request_queue *blk_alloc_queue_node(gfp_t, int);
extern void blk_put_queue(struct request_queue *);
+extern void blk_alloced_request(struct request_queue *q,
+ struct request_list *rl, int sync);

struct blk_plug {
unsigned long magic;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c3a884c..767bade 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -21,6 +21,10 @@ typedef int (elevator_dispatch_fn) (struct request_queue *, int);

typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
+typedef struct request_list *(elevator_req_list_fn)
+ (struct request_queue *, struct bio *, struct request *);
+typedef void (elevator_get_rl_group_fn)(struct request_list *);
+typedef void (elevator_put_rl_group_fn)(struct request_list *);
typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
typedef int (elevator_may_queue_fn) (struct request_queue *, int);

@@ -50,10 +54,14 @@ struct elevator_ops

elevator_request_list_fn *elevator_former_req_fn;
elevator_request_list_fn *elevator_latter_req_fn;
+ elevator_req_list_fn *elevator_req_list_fn;

elevator_set_req_fn *elevator_set_req_fn;
elevator_put_req_fn *elevator_put_req_fn;

+ elevator_get_rl_group_fn *elevator_get_rl_group_fn;
+ elevator_put_rl_group_fn *elevator_put_rl_group_fn;
+
elevator_may_queue_fn *elevator_may_queue_fn;

elevator_init_fn *elevator_init_fn;
--
1.7.3.1

2011-03-23 01:28:19

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] [PATCH v2 0/8] Provide cgroup isolation for buffered writes.

On Tue, Mar 22, 2011 at 04:08:47PM -0700, Justin TerAvest wrote:

[..]
> ===================================== Isolation experiment results
>
> For isolation testing, we run a test that's available at:
> git://google3-2.osuosl.org/tests/blkcgroup.git
>
> It creates containers, runs workloads, and checks to see how well we meet
> isolation targets. For the purposes of this patchset, I only ran
> tests among buffered writers.
>
> Before patches
> ==============
> 10:32:06 INFO experiment 0 achieved DTFs: 666, 333
> 10:32:06 INFO experiment 0 FAILED: max observed error is 167, allowed is 150
> 10:32:51 INFO experiment 1 achieved DTFs: 647, 352
> 10:32:51 INFO experiment 1 FAILED: max observed error is 253, allowed is 150
> 10:33:35 INFO experiment 2 achieved DTFs: 298, 701
> 10:33:35 INFO experiment 2 FAILED: max observed error is 199, allowed is 150
> 10:34:19 INFO experiment 3 achieved DTFs: 445, 277, 277
> 10:34:19 INFO experiment 3 FAILED: max observed error is 155, allowed is 150
> 10:35:05 INFO experiment 4 achieved DTFs: 418, 104, 261, 215
> 10:35:05 INFO experiment 4 FAILED: max observed error is 232, allowed is 150
> 10:35:53 INFO experiment 5 achieved DTFs: 213, 136, 68, 102, 170, 136, 170
> 10:35:53 INFO experiment 5 PASSED: max observed error is 73, allowed is 150
> 10:36:04 INFO -----ran 6 experiments, 1 passed, 5 failed
>
> After patches
> =============
> 11:05:22 INFO experiment 0 achieved DTFs: 501, 498
> 11:05:22 INFO experiment 0 PASSED: max observed error is 2, allowed is 150
> 11:06:07 INFO experiment 1 achieved DTFs: 874, 125
> 11:06:07 INFO experiment 1 PASSED: max observed error is 26, allowed is 150
> 11:06:53 INFO experiment 2 achieved DTFs: 121, 878
> 11:06:53 INFO experiment 2 PASSED: max observed error is 22, allowed is 150
> 11:07:46 INFO experiment 3 achieved DTFs: 589, 205, 204
> 11:07:46 INFO experiment 3 PASSED: max observed error is 11, allowed is 150
> 11:08:34 INFO experiment 4 achieved DTFs: 616, 109, 109, 163
> 11:08:34 INFO experiment 4 PASSED: max observed error is 34, allowed is 150
> 11:09:29 INFO experiment 5 achieved DTFs: 139, 139, 139, 139, 140, 141, 160
> 11:09:29 INFO experiment 5 PASSED: max observed error is 1, allowed is 150
> 11:09:46 INFO -----ran 6 experiments, 6 passed, 0 failed
>
> Summary
> =======
> Isolation between buffered writers is clearly better with this patch.

Can you pleae explain what is this test doing. All I am seeing is passed
and failed and really don't understand what the test is doing.

Can you run say simple 4 dd buffered writers in 4 cgroups with weights
100, 200, 300 and 400 and see if you get better isolation.

Secondly can you also please explain that how does it work. Without
making writeback cgroup aware, there are no gurantees that higher
weight cgroup will get more IO done.

>
>
> =============================== Read latency results
> To test read latency, I created two containers:
> - One called "readers", with weight 900
> - One called "writers", with weight 100
>
> I ran this fio workload in "readers":
> [global]
> directory=/mnt/iostestmnt/fio
> runtime=30
> time_based=1
> group_reporting=1
> exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
> cgroup_nodelete=1
> bs=4K
> size=512M
>
> [iostest-read]
> description="reader"
> numjobs=16
> rw=randread
> new_group=1
>
>
> ....and this fio workload in "writers"
> [global]
> directory=/mnt/iostestmnt/fio
> runtime=30
> time_based=1
> group_reporting=1
> exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
> cgroup_nodelete=1
> bs=4K
> size=512M
>
> [iostest-write]
> description="writer"
> cgroup=writers
> numjobs=3
> rw=write
> new_group=1
>
>
>
> I've pasted the results from the "read" workload inline.
>
> Before patches
> ==============
> Starting 16 processes
>
> Jobs: 14 (f=14): [_rrrrrr_rrrrrrrr] [36.2% done] [352K/0K /s] [86 /0 iops] [eta 01m:00s]?????????????
> iostest-read: (groupid=0, jobs=16): err= 0: pid=20606
> Description : ["reader"]
> read : io=13532KB, bw=455814 B/s, iops=111 , runt= 30400msec
> clat (usec): min=2190 , max=30399K, avg=30395175.13, stdev= 0.20
> lat (usec): min=2190 , max=30399K, avg=30395177.07, stdev= 0.20
> bw (KB/s) : min= 0, max= 260, per=0.00%, avg= 0.00, stdev= 0.00
> cpu : usr=0.00%, sys=0.03%, ctx=3691, majf=2, minf=468
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued r/w/d: total=3383/0/0, short=0/0/0
>
> lat (msec): 4=0.03%, 10=2.66%, 20=74.84%, 50=21.90%, 100=0.09%
> lat (msec): 250=0.06%, >=2000=0.41%
>
> Run status group 0 (all jobs):
> READ: io=13532KB, aggrb=445KB/s, minb=455KB/s, maxb=455KB/s, mint=30400msec, maxt=30400msec
>
> Disk stats (read/write):
> sdb: ios=3744/18, merge=0/16, ticks=542713/1675, in_queue=550714, util=99.15%
>
>
>
> After patches
> =============
> tarting 16 processes
> Jobs: 16 (f=16): [rrrrrrrrrrrrrrrr] [100.0% done] [557K/0K /s] [136 /0 iops] [eta 00m:00s]
> iostest-read: (groupid=0, jobs=16): err= 0: pid=14183
> Description : ["reader"]
> read : io=14940KB, bw=506105 B/s, iops=123 , runt= 30228msec
> clat (msec): min=2 , max=29866 , avg=463.42, stdev=101.84
> lat (msec): min=2 , max=29866 , avg=463.42, stdev=101.84
> bw (KB/s) : min= 0, max= 198, per=31.69%, avg=156.52, stdev=17.83
> cpu : usr=0.01%, sys=0.03%, ctx=4274, majf=2, minf=464
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued r/w/d: total=3735/0/0, short=0/0/0
>
> lat (msec): 4=0.05%, 10=0.32%, 20=32.99%, 50=64.61%, 100=1.26%
> lat (msec): 250=0.11%, 500=0.11%, 750=0.16%, 1000=0.05%, >=2000=0.35%
>
> Run status group 0 (all jobs):
> READ: io=14940KB, aggrb=494KB/s, minb=506KB/s, maxb=506KB/s, mint=30228msec, maxt=30228msec
>
> Disk stats (read/write):
> sdb: ios=4189/0, merge=0/0, ticks=96428/0, in_queue=478798, util=100.00%
>
>
>
> Summary
> =======
> Read latencies are a bit worse, but this overhead is only imposed when users
> ask for this feature by turning on CONFIG_BLKIOTRACK. We expect there to be =
> a something of a latency vs isolation tradeoff.

- What number you are looking at to say READ latencies are worse.
- Who got isolated here? If READS latencies are worse and you are saying
that's the cost of isolation, that means you are looking for isolation
for WRITES? This is the first time time I am hearing that READS starved
WRITES and I want better isolation for WRITES.

Also CONFIG_BLKIOTRACK=n is not the solution. This will most likely be
set and we need to figure out which makes sense.

To me WRITE isolation comes handy only if we want to create speed
difference between multiple WRITE streams. And that can not reliably be
done till we make writeback logic cgroup aware.

If we try to put WRITES in a separate group, most likely WRITES will end
up getting bigger share of disk then what they are getting by default and
I seriously doubt that who is looking for that. So far all the complaints
I have heard is that in presence of WRITES, my READ latencies suffer and
not vice a versa.

Thanks
Vivek

2011-03-23 04:59:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v2 2/8] block,fs,mm: IO cgroup tracking for buffered write

On Tue, 22 Mar 2011 16:08:49 -0700
Justin TerAvest <[email protected]> wrote:

> +static inline void page_cgroup_set_blkio_id(struct page_cgroup *pc,
> + unsigned long id)
> +{
> + unsigned long old, new, prev;
> +
> + /* pc->flags isn't lock protected, so we must cmpxchg. */
> + WARN_ON(id >= (1UL << PAGE_TRACKING_ID_BITS));
> + do {
> + new = old = pc->flags;
> + new &= (1UL << PAGE_TRACKING_ID_SHIFT) - 1;
> + new |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
> + prev = cmpxchg(&pc->flags, old, new);
> + } while (prev != old);

How to support archs which doesn't have cmpxchg ?
At least, you need take care of compile error.

Thanks,
-Kame

2011-03-23 16:28:16

by Justin TerAvest

[permalink] [raw]
Subject: Re: [RFC] [PATCH v2 0/8] Provide cgroup isolation for buffered writes.

On Tue, Mar 22, 2011 at 6:27 PM, Vivek Goyal <[email protected]> wrote:
> On Tue, Mar 22, 2011 at 04:08:47PM -0700, Justin TerAvest wrote:
>
> [..]
>> ===================================== Isolation experiment results
>>
>> For isolation testing, we run a test that's available at:
>> ? git://google3-2.osuosl.org/tests/blkcgroup.git
>>
>> It creates containers, runs workloads, and checks to see how well we meet
>> isolation targets. For the purposes of this patchset, I only ran
>> tests among buffered writers.
>>
>> Before patches
>> ==============
>> 10:32:06 INFO experiment 0 achieved DTFs: 666, 333
>> 10:32:06 INFO experiment 0 FAILED: max observed error is 167, allowed is 150
>> 10:32:51 INFO experiment 1 achieved DTFs: 647, 352
>> 10:32:51 INFO experiment 1 FAILED: max observed error is 253, allowed is 150
>> 10:33:35 INFO experiment 2 achieved DTFs: 298, 701
>> 10:33:35 INFO experiment 2 FAILED: max observed error is 199, allowed is 150
>> 10:34:19 INFO experiment 3 achieved DTFs: 445, 277, 277
>> 10:34:19 INFO experiment 3 FAILED: max observed error is 155, allowed is 150
>> 10:35:05 INFO experiment 4 achieved DTFs: 418, 104, 261, 215
>> 10:35:05 INFO experiment 4 FAILED: max observed error is 232, allowed is 150
>> 10:35:53 INFO experiment 5 achieved DTFs: 213, 136, 68, 102, 170, 136, 170
>> 10:35:53 INFO experiment 5 PASSED: max observed error is 73, allowed is 150
>> 10:36:04 INFO -----ran 6 experiments, 1 passed, 5 failed
>>
>> After patches
>> =============
>> 11:05:22 INFO experiment 0 achieved DTFs: 501, 498
>> 11:05:22 INFO experiment 0 PASSED: max observed error is 2, allowed is 150
>> 11:06:07 INFO experiment 1 achieved DTFs: 874, 125
>> 11:06:07 INFO experiment 1 PASSED: max observed error is 26, allowed is 150
>> 11:06:53 INFO experiment 2 achieved DTFs: 121, 878
>> 11:06:53 INFO experiment 2 PASSED: max observed error is 22, allowed is 150
>> 11:07:46 INFO experiment 3 achieved DTFs: 589, 205, 204
>> 11:07:46 INFO experiment 3 PASSED: max observed error is 11, allowed is 150
>> 11:08:34 INFO experiment 4 achieved DTFs: 616, 109, 109, 163
>> 11:08:34 INFO experiment 4 PASSED: max observed error is 34, allowed is 150
>> 11:09:29 INFO experiment 5 achieved DTFs: 139, 139, 139, 139, 140, 141, 160
>> 11:09:29 INFO experiment 5 PASSED: max observed error is 1, allowed is 150
>> 11:09:46 INFO -----ran 6 experiments, 6 passed, 0 failed
>>
>> Summary
>> =======
>> Isolation between buffered writers is clearly better with this patch.
>
> Can you pleae explain what is this test doing. All I am seeing is passed
> and failed and really don't understand what the test is doing.

I should have brought in more context; I was trying to keep the email
from becoming so long that nobody would read it.

We create cgroups, and set blkio.weight_device in the cgroups so that
they are assigned different weights for a given device. To give a
concrete example, in this case:
11:05:23 INFO ----- Running experiment 1: 900 wrseq.buf*2, 100 wrseq.buf*2
11:06:07 INFO experiment 1 achieved DTFs: 874, 125
11:06:07 INFO experiment 1 PASSED: max observed error is 26, allowed is 150

We create two cgroups, one with weight 900 for device, the other with
weight 100.
Then in each cgroup we run "/bin/dd if=/dev/zero of=$outputfile bs=64K ...".

After those complete, we measure blkio.time, and compare their ratios
to all of to all of the time taken, to see how closely the time
reported in the cgroup matches the requested weight for the device.

For simplicity, we only did dd WRITER tasks in the testing, though
isolation is also improved when we have a writer and a reader in
separate containers.

>
> Can you run say simple 4 dd buffered writers in 4 cgroups with weights
> 100, 200, 300 and 400 and see if you get better isolation.

Absolutely. :) This is pretty close to what I ran above, I should have
just provided a better description.

Baseline (Jens' tree):
08:43:02 INFO ----- Running experiment 0: 100 wrseq.buf, 200
wrseq.buf, 300 wrseq.buf, 400 wrseq.buf
08:43:46 INFO experiment 0 achieved DTFs: 144, 192, 463, 198
08:43:46 INFO experiment 0 FAILED: max observed error is 202, allowed is 150
08:43:50 INFO -----ran 1 experiments, 0 passed, 1 failed


With patches:
08:36:08 INFO ----- Running experiment 0: 100 wrseq.buf, 200
wrseq.buf, 300 wrseq.buf, 400 wrseq.buf
08:36:55 INFO experiment 0 achieved DTFs: 113, 211, 289, 385
08:36:55 INFO experiment 0 PASSED: max observed error is 15, allowed is 150
08:36:56 INFO -----ran 1 experiments, 1 passed, 0 failed

>
> Secondly can you also please explain that how does it work. Without
> making writeback cgroup aware, there are no gurantees that higher
> weight cgroup will get more IO done.

It is dependent on writeback sending enough requests to the I/O
scheduler that touch multiple groups so that they can be scheduled
properly. We are not guaranteed that writeback will appropriately
choose pages from different cgroups, you are correct.

However, from experiments, we can see that writeback can send enough
I/O to the scheduler (and from enough cgroups) to allow us to get
isolation between cgroups for writes. As writeback more predictably
can pick I/Os from multiple cgroups to issue, I would expect this to
improve.


>
>>
>>
>> =============================== Read latency results
>> To test read latency, I created two containers:
>> ? - One called "readers", with weight 900
>> ? - One called "writers", with weight 100
>>
>> I ran this fio workload in "readers":
>> [global]
>> directory=/mnt/iostestmnt/fio
>> runtime=30
>> time_based=1
>> group_reporting=1
>> exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
>> cgroup_nodelete=1
>> bs=4K
>> size=512M
>>
>> [iostest-read]
>> description="reader"
>> numjobs=16
>> rw=randread
>> new_group=1
>>
>>
>> ....and this fio workload in "writers"
>> [global]
>> directory=/mnt/iostestmnt/fio
>> runtime=30
>> time_based=1
>> group_reporting=1
>> exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
>> cgroup_nodelete=1
>> bs=4K
>> size=512M
>>
>> [iostest-write]
>> description="writer"
>> cgroup=writers
>> numjobs=3
>> rw=write
>> new_group=1
>>
>>
>>
>> I've pasted the results from the "read" workload inline.
>>
>> Before patches
>> ==============
>> Starting 16 processes
>>
>> Jobs: 14 (f=14): [_rrrrrr_rrrrrrrr] [36.2% done] [352K/0K /s] [86 /0 ?iops] [eta 01m:00s]?????????????
>> iostest-read: (groupid=0, jobs=16): err= 0: pid=20606
>> ? Description ?: ["reader"]
>> ? read : io=13532KB, bw=455814 B/s, iops=111 , runt= 30400msec
>> ? ? clat (usec): min=2190 , max=30399K, avg=30395175.13, stdev= 0.20
>> ? ? ?lat (usec): min=2190 , max=30399K, avg=30395177.07, stdev= 0.20
>> ? ? bw (KB/s) : min= ? ?0, max= ?260, per=0.00%, avg= 0.00, stdev= 0.00
>> ? cpu ? ? ? ? ?: usr=0.00%, sys=0.03%, ctx=3691, majf=2, minf=468
>> ? IO depths ? ?: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>> ? ? ?submit ? ?: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> ? ? ?complete ?: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> ? ? ?issued r/w/d: total=3383/0/0, short=0/0/0
>>
>> ? ? ?lat (msec): 4=0.03%, 10=2.66%, 20=74.84%, 50=21.90%, 100=0.09%
>> ? ? ?lat (msec): 250=0.06%, >=2000=0.41%
>>
>> Run status group 0 (all jobs):
>> ? ?READ: io=13532KB, aggrb=445KB/s, minb=455KB/s, maxb=455KB/s, mint=30400msec, maxt=30400msec
>>
>> Disk stats (read/write):
>> ? sdb: ios=3744/18, merge=0/16, ticks=542713/1675, in_queue=550714, util=99.15%
>>
>>
>>
>> After patches
>> =============
>> tarting 16 processes
>> Jobs: 16 (f=16): [rrrrrrrrrrrrrrrr] [100.0% done] [557K/0K /s] [136 /0 ?iops] [eta 00m:00s]
>> iostest-read: (groupid=0, jobs=16): err= 0: pid=14183
>> ? Description ?: ["reader"]
>> ? read : io=14940KB, bw=506105 B/s, iops=123 , runt= 30228msec
>> ? ? clat (msec): min=2 , max=29866 , avg=463.42, stdev=101.84
>> ? ? ?lat (msec): min=2 , max=29866 , avg=463.42, stdev=101.84
>> ? ? bw (KB/s) : min= ? ?0, max= ?198, per=31.69%, avg=156.52, stdev=17.83
>> ? cpu ? ? ? ? ?: usr=0.01%, sys=0.03%, ctx=4274, majf=2, minf=464
>> ? IO depths ? ?: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>> ? ? ?submit ? ?: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> ? ? ?complete ?: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> ? ? ?issued r/w/d: total=3735/0/0, short=0/0/0
>>
>> ? ? ?lat (msec): 4=0.05%, 10=0.32%, 20=32.99%, 50=64.61%, 100=1.26%
>> ? ? ?lat (msec): 250=0.11%, 500=0.11%, 750=0.16%, 1000=0.05%, >=2000=0.35%
>>
>> Run status group 0 (all jobs):
>> ? ?READ: io=14940KB, aggrb=494KB/s, minb=506KB/s, maxb=506KB/s, mint=30228msec, maxt=30228msec
>>
>> Disk stats (read/write):
>> ? sdb: ios=4189/0, merge=0/0, ticks=96428/0, in_queue=478798, util=100.00%
>>
>>
>>
>> Summary
>> =======
>> Read latencies are a bit worse, but this overhead is only imposed when users
>> ask for this feature by turning on CONFIG_BLKIOTRACK. We expect there to be =
>> a something of a latency vs isolation tradeoff.
>
> - What number you are looking at to say READ latencies are worse.

I am looking at the "lat (msec)" values below IO depths in the fio output.

Specifically, before this patch:
? ? ?lat (msec): 4=0.03%, 10=2.66%, 20=74.84%, 50=21.90%, 100=0.09%
? ? ?lat (msec): 250=0.06%, >=2000=0.41%

...and after:
? ? ?lat (msec): 4=0.05%, 10=0.32%, 20=32.99%, 50=64.61%, 100=1.26%
? ? ?lat (msec): 250=0.11%, 500=0.11%, 750=0.16%, 1000=0.05%, >=2000=0.35%

We can see that a lot of IOs moved from the 20ms group to 50ms.


It might be more clear if I made a table of percentage that finished
by a specific time.
Baseline W/Patches
4 0.03 0.05
10 2.69 0.37
20 77.53 33.36
50 99.43 97.97
100 99.52 99.23
250 99.58 99.34
inf 99.99 100.01



> - Who got isolated here? If READS latencies are worse and you are saying
> ?that's the cost of isolation, that means you are looking for isolation
> ?for WRITES? This is the first time time I am hearing that READS starved
> ?WRITES and I want better isolation for WRITES.

Yes, we are trying to get isolation between writes. I think a lot of
the effect that causes read latencies to be hurt is that we no longer
allow sync queues to preempt all async work. Part of the focus here is
also to stop writers from becoming starved.


>
> Also CONFIG_BLKIOTRACK=n is not the solution. This will most likely be
> set and we need to figure out which makes sense.

I would be open to this being a cgroup property or something similar
as well. My point is that this should probably be configurable, as
we're making a tradeoff between latency and isolation, and though I
know that many users care about isolation between groups, maybe not
everyone does.

>
> To me WRITE isolation comes handy only if we want to create speed
> difference between multiple WRITE streams. And that can not reliably be
> done till we make writeback logic cgroup aware.
>
> If we try to put WRITES in a separate group, most likely WRITES will end
> up getting bigger share of disk then what they are getting by default and
> I seriously doubt that who is looking for that. So far all the complaints
> I have heard is that in presence of WRITES, my READ latencies suffer and
> not vice a versa.

READ latencies are an excellent thing to focus on, but so is isolation
(and fairness) between cgroups.

We run many varying workloads in our deployments, and we don't want a
job coming in doing a lot of read traffic to starve out another job
(in a different cgroup) that's writing to the disk. I'm not sure what
the concerns and goals are for other users of CFQ, but from my
perspective, the most important thing is to get isolation between the
tasks so that their traffic to the disk is more predictable.


Thanks,
Justin



>
> Thanks
> Vivek
>

2011-03-23 17:21:28

by Justin TerAvest

[permalink] [raw]
Subject: Re: [PATCH v2 2/8] block,fs,mm: IO cgroup tracking for buffered write

On Tue, Mar 22, 2011 at 9:52 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 22 Mar 2011 16:08:49 -0700
> Justin TerAvest <[email protected]> wrote:
>
>> +static inline void page_cgroup_set_blkio_id(struct page_cgroup *pc,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long id)
>> +{
>> + ? ? unsigned long old, new, prev;
>> +
>> + ? ? /* pc->flags isn't lock protected, so we must cmpxchg. */
>> + ? ? WARN_ON(id >= (1UL << PAGE_TRACKING_ID_BITS));
>> + ? ? do {
>> + ? ? ? ? ? ? new = old = pc->flags;
>> + ? ? ? ? ? ? new &= (1UL << PAGE_TRACKING_ID_SHIFT) - 1;
>> + ? ? ? ? ? ? new |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
>> + ? ? ? ? ? ? prev = cmpxchg(&pc->flags, old, new);
>> + ? ? } while (prev != old);
>
> How to support archs which doesn't have cmpxchg ?
> At least, you need take care of compile error.

Hi Kame,

Is there really no generic cmpxchg()? I thought that was what
include/asm-generic/cmpxchg-local.h was for.

Thanks,
Justin

>
> Thanks,
> -Kame
>
>

2011-03-23 20:06:40

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] [PATCH v2 0/8] Provide cgroup isolation for buffered writes.

On Wed, Mar 23, 2011 at 09:27:47AM -0700, Justin TerAvest wrote:
> On Tue, Mar 22, 2011 at 6:27 PM, Vivek Goyal <[email protected]> wrote:
> > On Tue, Mar 22, 2011 at 04:08:47PM -0700, Justin TerAvest wrote:
> >
> > [..]
> >> ===================================== Isolation experiment results
> >>
> >> For isolation testing, we run a test that's available at:
> >> ? git://google3-2.osuosl.org/tests/blkcgroup.git
> >>
> >> It creates containers, runs workloads, and checks to see how well we meet
> >> isolation targets. For the purposes of this patchset, I only ran
> >> tests among buffered writers.
> >>
> >> Before patches
> >> ==============
> >> 10:32:06 INFO experiment 0 achieved DTFs: 666, 333
> >> 10:32:06 INFO experiment 0 FAILED: max observed error is 167, allowed is 150
> >> 10:32:51 INFO experiment 1 achieved DTFs: 647, 352
> >> 10:32:51 INFO experiment 1 FAILED: max observed error is 253, allowed is 150
> >> 10:33:35 INFO experiment 2 achieved DTFs: 298, 701
> >> 10:33:35 INFO experiment 2 FAILED: max observed error is 199, allowed is 150
> >> 10:34:19 INFO experiment 3 achieved DTFs: 445, 277, 277
> >> 10:34:19 INFO experiment 3 FAILED: max observed error is 155, allowed is 150
> >> 10:35:05 INFO experiment 4 achieved DTFs: 418, 104, 261, 215
> >> 10:35:05 INFO experiment 4 FAILED: max observed error is 232, allowed is 150
> >> 10:35:53 INFO experiment 5 achieved DTFs: 213, 136, 68, 102, 170, 136, 170
> >> 10:35:53 INFO experiment 5 PASSED: max observed error is 73, allowed is 150
> >> 10:36:04 INFO -----ran 6 experiments, 1 passed, 5 failed
> >>
> >> After patches
> >> =============
> >> 11:05:22 INFO experiment 0 achieved DTFs: 501, 498
> >> 11:05:22 INFO experiment 0 PASSED: max observed error is 2, allowed is 150
> >> 11:06:07 INFO experiment 1 achieved DTFs: 874, 125
> >> 11:06:07 INFO experiment 1 PASSED: max observed error is 26, allowed is 150
> >> 11:06:53 INFO experiment 2 achieved DTFs: 121, 878
> >> 11:06:53 INFO experiment 2 PASSED: max observed error is 22, allowed is 150
> >> 11:07:46 INFO experiment 3 achieved DTFs: 589, 205, 204
> >> 11:07:46 INFO experiment 3 PASSED: max observed error is 11, allowed is 150
> >> 11:08:34 INFO experiment 4 achieved DTFs: 616, 109, 109, 163
> >> 11:08:34 INFO experiment 4 PASSED: max observed error is 34, allowed is 150
> >> 11:09:29 INFO experiment 5 achieved DTFs: 139, 139, 139, 139, 140, 141, 160
> >> 11:09:29 INFO experiment 5 PASSED: max observed error is 1, allowed is 150
> >> 11:09:46 INFO -----ran 6 experiments, 6 passed, 0 failed
> >>
> >> Summary
> >> =======
> >> Isolation between buffered writers is clearly better with this patch.
> >
> > Can you pleae explain what is this test doing. All I am seeing is passed
> > and failed and really don't understand what the test is doing.
>
> I should have brought in more context; I was trying to keep the email
> from becoming so long that nobody would read it.
>
> We create cgroups, and set blkio.weight_device in the cgroups so that
> they are assigned different weights for a given device. To give a
> concrete example, in this case:
> 11:05:23 INFO ----- Running experiment 1: 900 wrseq.buf*2, 100 wrseq.buf*2
> 11:06:07 INFO experiment 1 achieved DTFs: 874, 125
> 11:06:07 INFO experiment 1 PASSED: max observed error is 26, allowed is 150
>
> We create two cgroups, one with weight 900 for device, the other with
> weight 100.
> Then in each cgroup we run "/bin/dd if=/dev/zero of=$outputfile bs=64K ...".
>
> After those complete, we measure blkio.time, and compare their ratios
> to all of to all of the time taken, to see how closely the time
> reported in the cgroup matches the requested weight for the device.
>
> For simplicity, we only did dd WRITER tasks in the testing, though
> isolation is also improved when we have a writer and a reader in
> separate containers.
>
> >
> > Can you run say simple 4 dd buffered writers in 4 cgroups with weights
> > 100, 200, 300 and 400 and see if you get better isolation.
>
> Absolutely. :) This is pretty close to what I ran above, I should have
> just provided a better description.
>
> Baseline (Jens' tree):
> 08:43:02 INFO ----- Running experiment 0: 100 wrseq.buf, 200
> wrseq.buf, 300 wrseq.buf, 400 wrseq.buf
> 08:43:46 INFO experiment 0 achieved DTFs: 144, 192, 463, 198
> 08:43:46 INFO experiment 0 FAILED: max observed error is 202, allowed is 150
> 08:43:50 INFO -----ran 1 experiments, 0 passed, 1 failed
>
>
> With patches:
> 08:36:08 INFO ----- Running experiment 0: 100 wrseq.buf, 200
> wrseq.buf, 300 wrseq.buf, 400 wrseq.buf
> 08:36:55 INFO experiment 0 achieved DTFs: 113, 211, 289, 385
> 08:36:55 INFO experiment 0 PASSED: max observed error is 15, allowed is 150
> 08:36:56 INFO -----ran 1 experiments, 1 passed, 0 failed
>

Is it possible to actually paste blkio.time and blkio.sectors numbers for
all the 4 cgroups.

> >
> > Secondly can you also please explain that how does it work. Without
> > making writeback cgroup aware, there are no gurantees that higher
> > weight cgroup will get more IO done.
>
> It is dependent on writeback sending enough requests to the I/O
> scheduler that touch multiple groups so that they can be scheduled
> properly. We are not guaranteed that writeback will appropriately
> choose pages from different cgroups, you are correct.
>
> However, from experiments, we can see that writeback can send enough
> I/O to the scheduler (and from enough cgroups) to allow us to get
> isolation between cgroups for writes. As writeback more predictably
> can pick I/Os from multiple cgroups to issue, I would expect this to
> improve.

Ok, In the past I had tried it with 2 cgroups (running dd inside these
cgroups) and I had no success. I am wondering what has changed.

In the past a high priority throttled process can very well try to
pick up a inode from low prio cgroup and start writting it and get
blocked. I believe similar thing should happen now.

Also, with IO-less throttling the situation will become worse. Right
now a throttled process tries to do IO in its own context but with
IO less throttling everything will be through flusher threads and
completions will be equally divided among throttled processes. So
it might happen that high weight process is not woken up enough to
do more IO and no service differentiation. So I suspect that after
IO less throttling goes in, situation might become worse until and
unless we make writeback aware of cgroups.

Anyway, I tried booting with your patches applied and it crashes.

Thanks
Vivek

mdadm: ARRAY line /dev/md0 has no identity information.
Setting up Logical Volume Management: 3 logical volume(s) in volume group "vg_chilli" now active
[ OK ]
Checking filesystems
Checking all file systems.
[/sbin/fsck.ext4 (1) -- /] fsck.ext4 -a /dev/mapper/vg_chilli-lv_root
/dev/mapper/vg_chilli-lv_root: clean, 367720/2313808 files, 5932252/9252864 blocks
[/sbin/fsck.[ 10.531127] BUG: unable to handle kernel ext4 (1) -- /booNULL pointer dereferencet] fsck.ext4 -a at 000000000000001f
/dev/sda1
[/sb[ 10.534662] IP:in/fsck.ext4 (2) [<ffffffff8123e67e>] cfq_put_request+0x40/0x83
[ 10.534662] PGD 135191067 -- /mnt/ssd-intPUD 135dad067 el] fsck.ext4 -aPMD 0 /dev/sdb
/dev
[ 10.534662] Oops: 0000 [#1] /sdb: clean, 507SMP 918/4890624 file
s, 10566022/1953[ 10.534662] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host3/target3:0:0/3:0:0:0/block/sdb/dev
[ 10.534662] CPU 3 7686 blocks

[ 10.534662] Modules linked in: floppy [last unloaded: scsi_wait_scan]
[ 10.534662]
[ 10.534662] Pid: 0, comm: kworker/0:1 Not tainted 2.6.38-rc6-justin-cfq-io-tracking+ #38 Hewlett-Packard HP xw6600 Workstation/0A9Ch
[ 10.534662] RIP: 0010:[<ffffffff8123e67e>] [<ffffffff8123e67e>] cfq_put_request+0x40/0x83
[ 10.534662] RSP: 0018:ffff8800bfcc3c10 EFLAGS: 00010086
[ 10.534662] RAX: 0000000000000007 RBX: ffff880135a7b4a0 RCX: 0000000000000001
[ 10.534662] RDX: 00000000ffff8800 RSI: ffff880135a7b4a0 RDI: ffff880135a7b4a0
[ 10.534662] RBP: ffff8800bfcc3c20 R08: 0000000000000000 R09: 0000000000000000
[ 10.534662] R10: ffffffff81a19400 R11: 0000000000000001 R12: ffff880135a7b540
[ 10.534662] R13: 0000000000020000 R14: 0000000000000011 R15: 0000000000000001
[ 10.534662] FS: 0000000000000000(0000) GS:ffff8800bfcc0000(0000) knlGS:0000000000000000
[ 10.534662] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 10.534662] CR2: 000000000000001f CR3: 000000013613b000 CR4: 00000000000006e0
[ 10.534662] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 10.534662] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 10.534662] Process kworker/0:1 (pid: 0, threadinfo ffff880137756000, task ffff8801377441c0)
[ 10.534662] Stack:
[ 10.534662] ffff880135a7b4a0 ffff880135168000 ffff8800bfcc3c30 ffffffff81229617
[ 10.534662] ffff8800bfcc3c60 ffffffff8122f22d ffff880135a7b4a0 0000000000000000
[ 10.534662] ffff880135757048 0000000000000001 ffff8800bfcc3ca0 ffffffff8122f45b
[ 10.534662] Call Trace:
[ 10.534662] <IRQ>
[ 10.534662] [<ffffffff81229617>] elv_put_request+0x1e/0x20
[ 10.534662] [<ffffffff8122f22d>] __blk_put_request+0xea/0x103
[ 10.534662] [<ffffffff8122f45b>] blk_finish_request+0x215/0x222
[ 10.534662] [<ffffffff8122f4a8>] __blk_end_request_all+0x40/0x49
[ 10.534662] [<ffffffff81231ec6>] blk_flush_complete_seq+0x18b/0x256
[ 10.534662] [<ffffffff81232132>] flush_end_io+0xad/0xeb
[ 10.534662] [<ffffffff8122f438>] blk_finish_request+0x1f2/0x222
[ 10.534662] [<ffffffff8122f74a>] blk_end_bidi_request+0x42/0x5d
[ 10.534662] [<ffffffff8122f7a1>] blk_end_request+0x10/0x12
[ 10.534662] [<ffffffff8134292c>] scsi_io_completion+0x182/0x3f6
[ 10.534662] [<ffffffff8133c80b>] scsi_finish_command+0xb5/0xbe
[ 10.534662] [<ffffffff81342c97>] scsi_softirq_done+0xe2/0xeb
[ 10.534662] [<ffffffff81233ea2>] blk_done_softirq+0x72/0x82
[ 10.534662] [<ffffffff81045544>] __do_softirq+0xde/0x1c7
[ 10.534662] [<ffffffff81003a0c>] call_softirq+0x1c/0x28
[ 10.534662] [<ffffffff81004ec1>] do_softirq+0x3d/0x85
[ 10.534662] [<ffffffff810452bd>] irq_exit+0x4a/0x8c
[ 10.534662] [<ffffffff815ed1a5>] do_IRQ+0x9d/0xb4
[ 10.534662] [<ffffffff815e6d53>] ret_from_intr+0x0/0x13
[ 10.534662] <EOI>
[ 10.534662] [<ffffffff8100a494>] ? mwait_idle+0xac/0xdd
[ 10.534662] [<ffffffff8100a48b>] ? mwait_idle+0xa3/0xdd
[ 10.534662] [<ffffffff81001ceb>] cpu_idle+0x64/0x9b
[ 10.534662] [<ffffffff815e023e>] start_secondary+0x173/0x177
[ 10.534662] Code: fb 4d 85 e4 74 63 8b 47 40 83 e0 01 48 83 c0 18 41 8b 54 84 08 85 d2 75 04 0f 0b eb fe ff ca 41 89 54 84 08 48 8b 87 98 00 00 00 <48> 8b 78 18 e8 30 45 ff ff 48 8b bb a8 00 00 00 48 c7 83 98 00
[ 10.534662] RIP [<ffffffff8123e67e>] cfq_put_request+0x40/0x83
[ 10.534662] RSP <ffff8800bfcc3c10>
[ 10.534662] CR2: 000000000000001f
[ 10.534662] ---[ end trace 9b1d20dc7519f482 ]---
[ 10.534662] Kernel panic - not syncing: Fatal exception in interrupt
[ 10.534662] Pid: 0, comm: kworker/0:1 Tainted: G D 2.6.38-rc6-justin-cfq-io-tracking+ #38
[ 10.534662] Call Trace:
[ 10.534662] <IRQ> [<ffffffff815e3c5f>] ? panic+0x91/0x199
[ 10.534662] [<ffffffff8103f753>] ? kmsg_dump+0x106/0x12d
[ 10.534662] [<ffffffff815e7bcb>] ? oops_end+0xae/0xbe
[ 10.534662] [<ffffffff81027b2b>] ? no_context+0x1fc/0x20b
[ 10.534662] [<ffffffff81027ccf>] ? __bad_area_nosemaphore+0x195/0x1b8
[ 10.534662] [<ffffffff8100de02>] ? save_stack_trace+0x2d/0x4a
[ 10.534662] [<ffffffff81027d05>] ? bad_area_nosemaphore+0x13/0x15
[ 10.534662] [<ffffffff815e9b74>] ? do_page_fault+0x1b9/0x38c
[ 10.534662] [<ffffffff8106a055>] ? trace_hardirqs_off+0xd/0xf
[ 10.534662] [<ffffffff8106b436>] ? mark_lock+0x2d/0x22c
[ 10.534662] [<ffffffff815e610b>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 10.534662] [<ffffffff815e6fef>] ? page_fault+0x1f/0x30
[ 10.534662] [<ffffffff8123e67e>] ? cfq_put_request+0x40/0x83
[ 10.534662] [<ffffffff81229617>] ? elv_put_request+0x1e/0x20
[ 10.534662] [<ffffffff8122f22d>] ? __blk_put_request+0xea/0x103
[ 10.534662] [<ffffffff8122f45b>] ? blk_finish_request+0x215/0x222
[ 10.534662] [<ffffffff8122f4a8>] ? __blk_end_request_all+0x40/0x49
[ 10.534662] [<ffffffff81231ec6>] ? blk_flush_complete_seq+0x18b/0x256
[ 10.534662] [<ffffffff81232132>] ? flush_end_io+0xad/0xeb
[ 10.534662] [<ffffffff8122f438>] ? blk_finish_request+0x1f2/0x222
[ 10.534662] [<ffffffff8122f74a>] ? blk_end_bidi_request+0x42/0x5d
[ 10.534662] [<ffffffff8122f7a1>] ? blk_end_request+0x10/0x12
[ 10.534662] [<ffffffff8134292c>] ? scsi_io_completion+0x182/0x3f6
[ 10.534662] [<ffffffff8133c80b>] ? scsi_finish_command+0xb5/0xbe
[ 10.534662] [<ffffffff81342c97>] ? scsi_softirq_done+0xe2/0xeb
[ 10.534662] [<ffffffff81233ea2>] ? blk_done_softirq+0x72/0x82
[ 10.534662] [<ffffffff81045544>] ? __do_softirq+0xde/0x1c7
[ 10.534662] [<ffffffff81003a0c>] ? call_softirq+0x1c/0x28
[ 10.534662] [<ffffffff81004ec1>] ? do_softirq+0x3d/0x85
[ 10.534662] [<ffffffff810452bd>] ? irq_exit+0x4a/0x8c
[ 10.534662] [<ffffffff815ed1a5>] ? do_IRQ+0x9d/0xb4
[ 10.534662] [<ffffffff815e6d53>] ? ret_from_intr+0x0/0x13
[ 10.534662] <EOI> [<ffffffff8100a494>] ? mwait_idle+0xac/0xdd
[ 10.534662] [<ffffffff8100a48b>] ? mwait_idle+0xa3/0xdd
[ 10.534662] [<ffffffff81001ceb>] ? cpu_idle+0x64/0x9b
[ 10.534662] [<ffffffff815e023e>] ? start_secondary+0x173/0x177

2011-03-23 22:33:37

by Justin TerAvest

[permalink] [raw]
Subject: Re: [RFC] [PATCH v2 0/8] Provide cgroup isolation for buffered writes.

On Wed, Mar 23, 2011 at 1:06 PM, Vivek Goyal <[email protected]> wrote:
> On Wed, Mar 23, 2011 at 09:27:47AM -0700, Justin TerAvest wrote:
>> On Tue, Mar 22, 2011 at 6:27 PM, Vivek Goyal <[email protected]> wrote:
>> > On Tue, Mar 22, 2011 at 04:08:47PM -0700, Justin TerAvest wrote:
>> >
>> > [..]
>> >> ===================================== Isolation experiment results
>> >>
>> >> For isolation testing, we run a test that's available at:
>> >> ? git://google3-2.osuosl.org/tests/blkcgroup.git
>> >>
>> >> It creates containers, runs workloads, and checks to see how well we meet
>> >> isolation targets. For the purposes of this patchset, I only ran
>> >> tests among buffered writers.
>> >>
>> >> Before patches
>> >> ==============
>> >> 10:32:06 INFO experiment 0 achieved DTFs: 666, 333
>> >> 10:32:06 INFO experiment 0 FAILED: max observed error is 167, allowed is 150
>> >> 10:32:51 INFO experiment 1 achieved DTFs: 647, 352
>> >> 10:32:51 INFO experiment 1 FAILED: max observed error is 253, allowed is 150
>> >> 10:33:35 INFO experiment 2 achieved DTFs: 298, 701
>> >> 10:33:35 INFO experiment 2 FAILED: max observed error is 199, allowed is 150
>> >> 10:34:19 INFO experiment 3 achieved DTFs: 445, 277, 277
>> >> 10:34:19 INFO experiment 3 FAILED: max observed error is 155, allowed is 150
>> >> 10:35:05 INFO experiment 4 achieved DTFs: 418, 104, 261, 215
>> >> 10:35:05 INFO experiment 4 FAILED: max observed error is 232, allowed is 150
>> >> 10:35:53 INFO experiment 5 achieved DTFs: 213, 136, 68, 102, 170, 136, 170
>> >> 10:35:53 INFO experiment 5 PASSED: max observed error is 73, allowed is 150
>> >> 10:36:04 INFO -----ran 6 experiments, 1 passed, 5 failed
>> >>
>> >> After patches
>> >> =============
>> >> 11:05:22 INFO experiment 0 achieved DTFs: 501, 498
>> >> 11:05:22 INFO experiment 0 PASSED: max observed error is 2, allowed is 150
>> >> 11:06:07 INFO experiment 1 achieved DTFs: 874, 125
>> >> 11:06:07 INFO experiment 1 PASSED: max observed error is 26, allowed is 150
>> >> 11:06:53 INFO experiment 2 achieved DTFs: 121, 878
>> >> 11:06:53 INFO experiment 2 PASSED: max observed error is 22, allowed is 150
>> >> 11:07:46 INFO experiment 3 achieved DTFs: 589, 205, 204
>> >> 11:07:46 INFO experiment 3 PASSED: max observed error is 11, allowed is 150
>> >> 11:08:34 INFO experiment 4 achieved DTFs: 616, 109, 109, 163
>> >> 11:08:34 INFO experiment 4 PASSED: max observed error is 34, allowed is 150
>> >> 11:09:29 INFO experiment 5 achieved DTFs: 139, 139, 139, 139, 140, 141, 160
>> >> 11:09:29 INFO experiment 5 PASSED: max observed error is 1, allowed is 150
>> >> 11:09:46 INFO -----ran 6 experiments, 6 passed, 0 failed
>> >>
>> >> Summary
>> >> =======
>> >> Isolation between buffered writers is clearly better with this patch.
>> >
>> > Can you pleae explain what is this test doing. All I am seeing is passed
>> > and failed and really don't understand what the test is doing.
>>
>> I should have brought in more context; I was trying to keep the email
>> from becoming so long that nobody would read it.
>>
>> We create cgroups, and set blkio.weight_device in the cgroups so that
>> they are assigned different weights for a given device. To give a
>> concrete example, in this case:
>> 11:05:23 INFO ----- Running experiment 1: 900 wrseq.buf*2, 100 wrseq.buf*2
>> 11:06:07 INFO experiment 1 achieved DTFs: 874, 125
>> 11:06:07 INFO experiment 1 PASSED: max observed error is 26, allowed is 150
>>
>> We create two cgroups, one with weight 900 for device, the other with
>> weight 100.
>> Then in each cgroup we run "/bin/dd if=/dev/zero of=$outputfile bs=64K ...".
>>
>> After those complete, we measure blkio.time, and compare their ratios
>> to all of to all of the time taken, to see how closely the time
>> reported in the cgroup matches the requested weight for the device.
>>
>> For simplicity, we only did dd WRITER tasks in the testing, though
>> isolation is also improved when we have a writer and a reader in
>> separate containers.
>>
>> >
>> > Can you run say simple 4 dd buffered writers in 4 cgroups with weights
>> > 100, 200, 300 and 400 and see if you get better isolation.
>>
>> Absolutely. :) This is pretty close to what I ran above, I should have
>> just provided a better description.
>>
>> Baseline (Jens' tree):
>> 08:43:02 INFO ----- Running experiment 0: 100 wrseq.buf, 200
>> wrseq.buf, 300 wrseq.buf, 400 wrseq.buf
>> 08:43:46 INFO experiment 0 achieved DTFs: 144, 192, 463, 198
>> 08:43:46 INFO experiment 0 FAILED: max observed error is 202, allowed is 150
>> 08:43:50 INFO -----ran 1 experiments, 0 passed, 1 failed
>>
>>
>> With patches:
>> 08:36:08 INFO ----- Running experiment 0: 100 wrseq.buf, 200
>> wrseq.buf, 300 wrseq.buf, 400 wrseq.buf
>> 08:36:55 INFO experiment 0 achieved DTFs: 113, 211, 289, 385
>> 08:36:55 INFO experiment 0 PASSED: max observed error is 15, allowed is 150
>> 08:36:56 INFO -----ran 1 experiments, 1 passed, 0 failed
>>
>
> Is it possible to actually paste blkio.time and blkio.sectors numbers for
> all the 4 cgroups.

Yes. I'll try to find a good place to host the files so I can make a
good summary and host all the cgroup stats together.

Without the patches isn't very interesting because all of the async
traffic is put in the root cgroup, so we don't see much time or
sectors in the cgroups. Let me know if you want me to email that and I
will too.

With the patches applied:
/dev/cgroup/blkcgroupt0/blkio.sectors
8:16 510040
8:0 376

/dev/cgroup/blkcgroupt1/blkio.sectors
8:16 941040

/dev/cgroup/blkcgroupt2/blkio.sectors
8:16 1224456
8:0 8

/dev/cgroup/blkcgroupt3/blkio.sectors
8:16 1509576
8:0 152

/dev/cgroup/blkcgroupt0/blkio.time
8:16 2651
8:0 20

/dev/cgroup/blkcgroupt1/blkio.time
8:16 5200

/dev/cgroup/blkcgroupt2/blkio.time
8:16 7350
8:0 8

/dev/cgroup/blkcgroupt3/blkio.time
8:16 9591
8:0 20

/dev/cgroup/blkcgroupt0/blkio.weight_device
8:16 100

/dev/cgroup/blkcgroupt1/blkio.weight_device
8:16 200

/dev/cgroup/blkcgroupt2/blkio.weight_device
8:16 300

/dev/cgroup/blkcgroupt3/blkio.weight_device
8:16 400






>
>> >
>> > Secondly can you also please explain that how does it work. Without
>> > making writeback cgroup aware, there are no gurantees that higher
>> > weight cgroup will get more IO done.
>>
>> It is dependent on writeback sending enough requests to the I/O
>> scheduler that touch multiple groups so that they can be scheduled
>> properly. We are not guaranteed that writeback will appropriately
>> choose pages from different cgroups, you are correct.
>>
>> However, from experiments, we can see that writeback can send enough
>> I/O to the scheduler (and from enough cgroups) to allow us to get
>> isolation between cgroups for writes. As writeback more predictably
>> can pick I/Os from multiple cgroups to issue, I would expect this to
>> improve.
>
> Ok, In the past I had tried it with 2 cgroups (running dd inside these
> cgroups) and I had no success. I am wondering what has changed.

It could just be a difference in workload, or dd size, or filesystem?

>
> In the past a high priority throttled process can very well try to
> pick up a inode from low prio cgroup and start writting it and get
> blocked. I believe similar thing should happen now.

You're right that it's very dependent on what inodes get picked up
when from writeback.

>
> Also, with IO-less throttling the situation will become worse. Right
> now a throttled process tries to do IO in its own context but with
> IO less throttling everything will be through flusher threads and
> completions will be equally divided among throttled processes. So
> it might happen that high weight process is not woken up enough to
> do more IO and no service differentiation. So I suspect that after
> IO less throttling goes in, situation might become worse until and
> unless we make writeback aware of cgroups.

This is my primary concern. I really want to understand the issues and
get discussion moving on these patches so that any other effects of
writeback will be sane; we need to make sure that we can provide good
isolation between cgroups, and that isolation includes traffic to the
disk that comes from writeback. I think that if we do IO-less
throttling, it will definitely have to be cgroup aware to provide any
isolation.

>
> Anyway, I tried booting with your patches applied and it crashes.

Thanks, I'll spend time today trying to track down the cause. I have
been debugging an issue with flush_end_io; I have been having some
difficulties since rebasing to for-2.6.39/core.

Thanks,
Justin

>
> Thanks
> Vivek
>
> mdadm: ARRAY line /dev/md0 has no identity information.
> Setting up Logical Volume Management: ? 3 logical volume(s) in volume group "vg_chilli" now active
> [ ?OK ?]
> Checking filesystems
> Checking all file systems.
> [/sbin/fsck.ext4 (1) -- /] fsck.ext4 -a /dev/mapper/vg_chilli-lv_root
> /dev/mapper/vg_chilli-lv_root: clean, 367720/2313808 files, 5932252/9252864 blocks
> [/sbin/fsck.[ ? 10.531127] BUG: unable to handle kernel ext4 (1) -- /booNULL pointer dereferencet] fsck.ext4 -a ?at 000000000000001f
> /dev/sda1
> [/sb[ ? 10.534662] IP:in/fsck.ext4 (2) [<ffffffff8123e67e>] cfq_put_request+0x40/0x83
> [ ? 10.534662] PGD 135191067 ?-- /mnt/ssd-intPUD 135dad067 el] fsck.ext4 -aPMD 0 ?/dev/sdb
> /dev
> [ ? 10.534662] Oops: 0000 [#1] /sdb: clean, 507SMP 918/4890624 file
> s, 10566022/1953[ ? 10.534662] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host3/target3:0:0/3:0:0:0/block/sdb/dev
> [ ? 10.534662] CPU 3 7686 blocks
>
> [ ? 10.534662] Modules linked in: floppy [last unloaded: scsi_wait_scan]
> [ ? 10.534662]
> [ ? 10.534662] Pid: 0, comm: kworker/0:1 Not tainted 2.6.38-rc6-justin-cfq-io-tracking+ #38 Hewlett-Packard HP xw6600 Workstation/0A9Ch
> [ ? 10.534662] RIP: 0010:[<ffffffff8123e67e>] ?[<ffffffff8123e67e>] cfq_put_request+0x40/0x83
> [ ? 10.534662] RSP: 0018:ffff8800bfcc3c10 ?EFLAGS: 00010086
> [ ? 10.534662] RAX: 0000000000000007 RBX: ffff880135a7b4a0 RCX: 0000000000000001
> [ ? 10.534662] RDX: 00000000ffff8800 RSI: ffff880135a7b4a0 RDI: ffff880135a7b4a0
> [ ? 10.534662] RBP: ffff8800bfcc3c20 R08: 0000000000000000 R09: 0000000000000000
> [ ? 10.534662] R10: ffffffff81a19400 R11: 0000000000000001 R12: ffff880135a7b540
> [ ? 10.534662] R13: 0000000000020000 R14: 0000000000000011 R15: 0000000000000001
> [ ? 10.534662] FS: ?0000000000000000(0000) GS:ffff8800bfcc0000(0000) knlGS:0000000000000000
> [ ? 10.534662] CS: ?0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ ? 10.534662] CR2: 000000000000001f CR3: 000000013613b000 CR4: 00000000000006e0
> [ ? 10.534662] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ ? 10.534662] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ ? 10.534662] Process kworker/0:1 (pid: 0, threadinfo ffff880137756000, task ffff8801377441c0)
> [ ? 10.534662] Stack:
> [ ? 10.534662] ?ffff880135a7b4a0 ffff880135168000 ffff8800bfcc3c30 ffffffff81229617
> [ ? 10.534662] ?ffff8800bfcc3c60 ffffffff8122f22d ffff880135a7b4a0 0000000000000000
> [ ? 10.534662] ?ffff880135757048 0000000000000001 ffff8800bfcc3ca0 ffffffff8122f45b
> [ ? 10.534662] Call Trace:
> [ ? 10.534662] ?<IRQ>
> [ ? 10.534662] ?[<ffffffff81229617>] elv_put_request+0x1e/0x20
> [ ? 10.534662] ?[<ffffffff8122f22d>] __blk_put_request+0xea/0x103
> [ ? 10.534662] ?[<ffffffff8122f45b>] blk_finish_request+0x215/0x222
> [ ? 10.534662] ?[<ffffffff8122f4a8>] __blk_end_request_all+0x40/0x49
> [ ? 10.534662] ?[<ffffffff81231ec6>] blk_flush_complete_seq+0x18b/0x256
> [ ? 10.534662] ?[<ffffffff81232132>] flush_end_io+0xad/0xeb
> [ ? 10.534662] ?[<ffffffff8122f438>] blk_finish_request+0x1f2/0x222
> [ ? 10.534662] ?[<ffffffff8122f74a>] blk_end_bidi_request+0x42/0x5d
> [ ? 10.534662] ?[<ffffffff8122f7a1>] blk_end_request+0x10/0x12
> [ ? 10.534662] ?[<ffffffff8134292c>] scsi_io_completion+0x182/0x3f6
> [ ? 10.534662] ?[<ffffffff8133c80b>] scsi_finish_command+0xb5/0xbe
> [ ? 10.534662] ?[<ffffffff81342c97>] scsi_softirq_done+0xe2/0xeb
> [ ? 10.534662] ?[<ffffffff81233ea2>] blk_done_softirq+0x72/0x82
> [ ? 10.534662] ?[<ffffffff81045544>] __do_softirq+0xde/0x1c7
> [ ? 10.534662] ?[<ffffffff81003a0c>] call_softirq+0x1c/0x28
> [ ? 10.534662] ?[<ffffffff81004ec1>] do_softirq+0x3d/0x85
> [ ? 10.534662] ?[<ffffffff810452bd>] irq_exit+0x4a/0x8c
> [ ? 10.534662] ?[<ffffffff815ed1a5>] do_IRQ+0x9d/0xb4
> [ ? 10.534662] ?[<ffffffff815e6d53>] ret_from_intr+0x0/0x13
> [ ? 10.534662] ?<EOI>
> [ ? 10.534662] ?[<ffffffff8100a494>] ? mwait_idle+0xac/0xdd
> [ ? 10.534662] ?[<ffffffff8100a48b>] ? mwait_idle+0xa3/0xdd
> [ ? 10.534662] ?[<ffffffff81001ceb>] cpu_idle+0x64/0x9b
> [ ? 10.534662] ?[<ffffffff815e023e>] start_secondary+0x173/0x177
> [ ? 10.534662] Code: fb 4d 85 e4 74 63 8b 47 40 83 e0 01 48 83 c0 18 41 8b 54 84 08 85 d2 75 04 0f 0b eb fe ff ca 41 89 54 84 08 48 8b 87 98 00 00 00 <48> 8b 78 18 e8 30 45 ff ff 48 8b bb a8 00 00 00 48 c7 83 98 00
> [ ? 10.534662] RIP ?[<ffffffff8123e67e>] cfq_put_request+0x40/0x83
> [ ? 10.534662] ?RSP <ffff8800bfcc3c10>
> [ ? 10.534662] CR2: 000000000000001f
> [ ? 10.534662] ---[ end trace 9b1d20dc7519f482 ]---
> [ ? 10.534662] Kernel panic - not syncing: Fatal exception in interrupt
> [ ? 10.534662] Pid: 0, comm: kworker/0:1 Tainted: G ? ? ?D ? ? 2.6.38-rc6-justin-cfq-io-tracking+ #38
> [ ? 10.534662] Call Trace:
> [ ? 10.534662] ?<IRQ> ?[<ffffffff815e3c5f>] ? panic+0x91/0x199
> [ ? 10.534662] ?[<ffffffff8103f753>] ? kmsg_dump+0x106/0x12d
> [ ? 10.534662] ?[<ffffffff815e7bcb>] ? oops_end+0xae/0xbe
> [ ? 10.534662] ?[<ffffffff81027b2b>] ? no_context+0x1fc/0x20b
> [ ? 10.534662] ?[<ffffffff81027ccf>] ? __bad_area_nosemaphore+0x195/0x1b8
> [ ? 10.534662] ?[<ffffffff8100de02>] ? save_stack_trace+0x2d/0x4a
> [ ? 10.534662] ?[<ffffffff81027d05>] ? bad_area_nosemaphore+0x13/0x15
> [ ? 10.534662] ?[<ffffffff815e9b74>] ? do_page_fault+0x1b9/0x38c
> [ ? 10.534662] ?[<ffffffff8106a055>] ? trace_hardirqs_off+0xd/0xf
> [ ? 10.534662] ?[<ffffffff8106b436>] ? mark_lock+0x2d/0x22c
> [ ? 10.534662] ?[<ffffffff815e610b>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ ? 10.534662] ?[<ffffffff815e6fef>] ? page_fault+0x1f/0x30
> [ ? 10.534662] ?[<ffffffff8123e67e>] ? cfq_put_request+0x40/0x83
> [ ? 10.534662] ?[<ffffffff81229617>] ? elv_put_request+0x1e/0x20
> [ ? 10.534662] ?[<ffffffff8122f22d>] ? __blk_put_request+0xea/0x103
> [ ? 10.534662] ?[<ffffffff8122f45b>] ? blk_finish_request+0x215/0x222
> [ ? 10.534662] ?[<ffffffff8122f4a8>] ? __blk_end_request_all+0x40/0x49
> [ ? 10.534662] ?[<ffffffff81231ec6>] ? blk_flush_complete_seq+0x18b/0x256
> [ ? 10.534662] ?[<ffffffff81232132>] ? flush_end_io+0xad/0xeb
> [ ? 10.534662] ?[<ffffffff8122f438>] ? blk_finish_request+0x1f2/0x222
> [ ? 10.534662] ?[<ffffffff8122f74a>] ? blk_end_bidi_request+0x42/0x5d
> [ ? 10.534662] ?[<ffffffff8122f7a1>] ? blk_end_request+0x10/0x12
> [ ? 10.534662] ?[<ffffffff8134292c>] ? scsi_io_completion+0x182/0x3f6
> [ ? 10.534662] ?[<ffffffff8133c80b>] ? scsi_finish_command+0xb5/0xbe
> [ ? 10.534662] ?[<ffffffff81342c97>] ? scsi_softirq_done+0xe2/0xeb
> [ ? 10.534662] ?[<ffffffff81233ea2>] ? blk_done_softirq+0x72/0x82
> [ ? 10.534662] ?[<ffffffff81045544>] ? __do_softirq+0xde/0x1c7
> [ ? 10.534662] ?[<ffffffff81003a0c>] ? call_softirq+0x1c/0x28
> [ ? 10.534662] ?[<ffffffff81004ec1>] ? do_softirq+0x3d/0x85
> [ ? 10.534662] ?[<ffffffff810452bd>] ? irq_exit+0x4a/0x8c
> [ ? 10.534662] ?[<ffffffff815ed1a5>] ? do_IRQ+0x9d/0xb4
> [ ? 10.534662] ?[<ffffffff815e6d53>] ? ret_from_intr+0x0/0x13
> [ ? 10.534662] ?<EOI> ?[<ffffffff8100a494>] ? mwait_idle+0xac/0xdd
> [ ? 10.534662] ?[<ffffffff8100a48b>] ? mwait_idle+0xa3/0xdd
> [ ? 10.534662] ?[<ffffffff81001ceb>] ? cpu_idle+0x64/0x9b
> [ ? 10.534662] ?[<ffffffff815e023e>] ? start_secondary+0x173/0x177
>
>

2011-03-24 08:33:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v2 2/8] block,fs,mm: IO cgroup tracking for buffered write

On Wed, 23 Mar 2011 10:21:01 -0700
Justin TerAvest <[email protected]> wrote:

> On Tue, Mar 22, 2011 at 9:52 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Tue, 22 Mar 2011 16:08:49 -0700
> > Justin TerAvest <[email protected]> wrote:
> >
> >> +static inline void page_cgroup_set_blkio_id(struct page_cgroup *pc,
> >> +                             unsigned long id)
> >> +{
> >> +     unsigned long old, new, prev;
> >> +
> >> +     /* pc->flags isn't lock protected, so we must cmpxchg. */
> >> +     WARN_ON(id >= (1UL << PAGE_TRACKING_ID_BITS));
> >> +     do {
> >> +             new = old = pc->flags;
> >> +             new &= (1UL << PAGE_TRACKING_ID_SHIFT) - 1;
> >> +             new |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
> >> +             prev = cmpxchg(&pc->flags, old, new);
> >> +     } while (prev != old);
> >
> > How to support archs which doesn't have cmpxchg ?
> > At least, you need take care of compile error.
>
> Hi Kame,
>
> Is there really no generic cmpxchg()? I thought that was what
> include/asm-generic/cmpxchg-local.h was for.
>

I'm not sure ;) But I myself tend to avoid to use cmpxchg which
cannot be used in CONFIG_SMP, sometimes.

Thanks,
-Kame


2011-03-24 13:57:16

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] [PATCH v2 0/8] Provide cgroup isolation for buffered writes.

On Wed, Mar 23, 2011 at 03:32:51PM -0700, Justin TerAvest wrote:

[..]
> > Ok, In the past I had tried it with 2 cgroups (running dd inside these
> > cgroups) and I had no success. I am wondering what has changed.
>
> It could just be a difference in workload, or dd size, or filesystem?

Once you have sorted out the bug and I can boot my system, I will test
it to see what's happening.

>
> >
> > In the past a high priority throttled process can very well try to
> > pick up a inode from low prio cgroup and start writting it and get
> > blocked. I believe similar thing should happen now.
>
> You're right that it's very dependent on what inodes get picked up
> when from writeback.

So then above results should not be reproducible consistently? Are you
using additional patches internally to make this work reliably. My
concenrn is that it will not make sense to have half baked pieces
in upstream kernel. That's why I was hoping that this piece can go in
once we have sorted out following.

- IO less throttling
- Per cgroup dirty ratio
- Some work w.r.t cgroup aware writeback.

In fact cgroup aware writeback can be part of this patch series once
first two pieces are in the kernel.

Thanks
Vivek

2011-03-25 01:52:06

by Justin TerAvest

[permalink] [raw]
Subject: Re: [RFC] [PATCH v2 0/8] Provide cgroup isolation for buffered writes.

On Thu, Mar 24, 2011 at 6:56 AM, Vivek Goyal <[email protected]> wrote:
> On Wed, Mar 23, 2011 at 03:32:51PM -0700, Justin TerAvest wrote:
>
> [..]
>> > Ok, In the past I had tried it with 2 cgroups (running dd inside these
>> > cgroups) and I had no success. I am wondering what has changed.
>>
>> It could just be a difference in workload, or dd size, or filesystem?
>
> Once you have sorted out the bug and I can boot my system, I will test
> it to see what's happening.

Thanks. I'll send out a version 3 of the patch (and organize my
current results better in the cover letter), as soon as I have the
problem resolved. I see a very similar panic during fsck that I'm
trying to track down.

>
>>
>> >
>> > In the past a high priority throttled process can very well try to
>> > pick up a inode from low prio cgroup and start writting it and get
>> > blocked. I believe similar thing should happen now.
>>
>> You're right that it's very dependent on what inodes get picked up
>> when from writeback.
>
> So then above results should not be reproducible consistently? Are you
> using additional patches internally to make this work reliably. My
> concenrn is that it will not make sense to have half baked pieces
> in upstream kernel. That's why I was hoping that this piece can go in
> once we have sorted out following.

I am not using additional patches to make this work reliably. However,
I've noticed the behavior is better with some filesystems than others.

>
> - IO less throttling
> - Per cgroup dirty ratio
> - Some work w.r.t cgroup aware writeback.
>
> In fact cgroup aware writeback can be part of this patch series once
> first two pieces are in the kernel.

I understand your preference for a complete, reliable solution for all
of this. Right now, I'm concerned that it's hard to tie all these
systems together, and I haven't seen an overall plan for how all of
these things work together. If this patchset works reliably for enough
workloads, we'll just see isolation improve as writeback is more
cgroup aware.

>
> Thanks
> Vivek
>

2011-03-25 07:47:03

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC] [PATCH v2 0/8] Provide cgroup isolation for buffered writes.

* Justin TerAvest <[email protected]> [2011-03-22 16:08:47]:

> This patchset adds tracking to the page_cgroup structure for which cgroup has
> dirtied a page, and uses that information to provide isolation between
> cgroups performing writeback.
>
> I know that there is some discussion to remove request descriptor limits
> entirely, but I included a patch to introduce per-cgroup limits to enable
> this functionality. Without it, we didn't see much isolation improvement.
>
> I think most of this material has been discussed on lkml previously, this is
> just another attempt to make a patchset that handles buffered writes for CFQ.
>
> There was a lot of previous discussion at:
> http://thread.gmane.org/gmane.linux.kernel/1007922
>
> Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
> and Vivek Goyal for work on previous versions of these patches.
>
> For version 2:
> - I collected more statistics and provided data in the cover sheet
> - blkio id is now stored inside "flags" in page_cgroup, with cmpxchg
> - I cleaned up some patch names
> - Added symmetric reference wrappers in cfq-iosched
>
> There are a couple lingering issues that exist in this patchset-- it's meant
> to be an RFC to discuss the overall design for tracking of buffered writes.
> I have at least a couple of patches to finish to make absolutely sure that
> refcounts and locking are handled properly, I just need to do more testing.
>
> Documentation/block/biodoc.txt | 10 +
> block/blk-cgroup.c | 203 +++++++++++++++++-
> block/blk-cgroup.h | 9 +-
> block/blk-core.c | 218 +++++++++++++------
> block/blk-settings.c | 2 +-
> block/blk-sysfs.c | 59 +++---
> block/cfq-iosched.c | 473 ++++++++++++++++++++++++++++++----------
> block/cfq.h | 6 +-
> block/elevator.c | 7 +-
> fs/buffer.c | 2 +
> fs/direct-io.c | 2 +
> include/linux/blk_types.h | 2 +
> include/linux/blkdev.h | 81 +++++++-
> include/linux/blkio-track.h | 89 ++++++++
> include/linux/elevator.h | 14 +-
> include/linux/iocontext.h | 1 +
> include/linux/memcontrol.h | 6 +
> include/linux/mmzone.h | 4 +-
> include/linux/page_cgroup.h | 38 +++-
> init/Kconfig | 16 ++
> mm/Makefile | 3 +-
> mm/bounce.c | 2 +
> mm/filemap.c | 2 +
> mm/memcontrol.c | 6 +
> mm/memory.c | 6 +
> mm/page-writeback.c | 14 +-
> mm/page_cgroup.c | 29 ++-
> mm/swap_state.c | 2 +
> 28 files changed, 1066 insertions(+), 240 deletions(-)
>
>
> 8f0b0f4 cfq: Don't allow preemption across cgroups
> a47cdc6 block: Per cgroup request descriptor counts
> 8dd7adb cfq: add per cgroup writeout done by flusher stat
> 1fa0b6d cfq: Fix up tracked async workload length.
> e9e85d3 block: Modify CFQ to use IO tracking information.
> f8ffb19 cfq-iosched: Make async queues per cgroup
> 1d9ee09 block,fs,mm: IO cgroup tracking for buffered write
> 31c7321 cfq-iosched: add symmetric reference wrappers
>
>
> ===================================== Isolation experiment results
>
> For isolation testing, we run a test that's available at:
> git://google3-2.osuosl.org/tests/blkcgroup.git
>
> It creates containers, runs workloads, and checks to see how well we meet
> isolation targets. For the purposes of this patchset, I only ran
> tests among buffered writers.
>
> Before patches
> ==============
> 10:32:06 INFO experiment 0 achieved DTFs: 666, 333
> 10:32:06 INFO experiment 0 FAILED: max observed error is 167, allowed is 150
> 10:32:51 INFO experiment 1 achieved DTFs: 647, 352
> 10:32:51 INFO experiment 1 FAILED: max observed error is 253, allowed is 150
> 10:33:35 INFO experiment 2 achieved DTFs: 298, 701
> 10:33:35 INFO experiment 2 FAILED: max observed error is 199, allowed is 150
> 10:34:19 INFO experiment 3 achieved DTFs: 445, 277, 277
> 10:34:19 INFO experiment 3 FAILED: max observed error is 155, allowed is 150
> 10:35:05 INFO experiment 4 achieved DTFs: 418, 104, 261, 215
> 10:35:05 INFO experiment 4 FAILED: max observed error is 232, allowed is 150
> 10:35:53 INFO experiment 5 achieved DTFs: 213, 136, 68, 102, 170, 136, 170
> 10:35:53 INFO experiment 5 PASSED: max observed error is 73, allowed is 150
> 10:36:04 INFO -----ran 6 experiments, 1 passed, 5 failed
>
> After patches
> =============
> 11:05:22 INFO experiment 0 achieved DTFs: 501, 498
> 11:05:22 INFO experiment 0 PASSED: max observed error is 2, allowed is 150
> 11:06:07 INFO experiment 1 achieved DTFs: 874, 125
> 11:06:07 INFO experiment 1 PASSED: max observed error is 26, allowed is 150
> 11:06:53 INFO experiment 2 achieved DTFs: 121, 878
> 11:06:53 INFO experiment 2 PASSED: max observed error is 22, allowed is 150
> 11:07:46 INFO experiment 3 achieved DTFs: 589, 205, 204
> 11:07:46 INFO experiment 3 PASSED: max observed error is 11, allowed is 150
> 11:08:34 INFO experiment 4 achieved DTFs: 616, 109, 109, 163
> 11:08:34 INFO experiment 4 PASSED: max observed error is 34, allowed is 150
> 11:09:29 INFO experiment 5 achieved DTFs: 139, 139, 139, 139, 140, 141, 160
> 11:09:29 INFO experiment 5 PASSED: max observed error is 1, allowed is 150
> 11:09:46 INFO -----ran 6 experiments, 6 passed, 0 failed

Could you explain what max observed errors is all about?

>
> Summary
> =======
> Isolation between buffered writers is clearly better with this patch.
>
>
> =============================== Read latency results
> To test read latency, I created two containers:
> - One called "readers", with weight 900
> - One called "writers", with weight 100
>
> I ran this fio workload in "readers":
> [global]
> directory=/mnt/iostestmnt/fio
> runtime=30
> time_based=1
> group_reporting=1
> exec_prerun='echo 3 > /proc/sys/vm/drop_caches'

Is this sufficient, do you need a sync prior to this?

> cgroup_nodelete=1
> bs=4K
> size=512M
>
> [iostest-read]
> description="reader"
> numjobs=16
> rw=randread
> new_group=1
>
>
> ....and this fio workload in "writers"
> [global]
> directory=/mnt/iostestmnt/fio
> runtime=30
> time_based=1
> group_reporting=1
> exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
> cgroup_nodelete=1
> bs=4K
> size=512M
>
> [iostest-write]
> description="writer"
> cgroup=writers
> numjobs=3
> rw=write
> new_group=1
>
>
>
> I've pasted the results from the "read" workload inline.
>
> Before patches
> ==============
> Starting 16 processes
>
> Jobs: 14 (f=14): [_rrrrrr_rrrrrrrr] [36.2% done] [352K/0K /s] [86 /0 iops] [eta 01m:00s]?????????????
> iostest-read: (groupid=0, jobs=16): err= 0: pid=20606
> Description : ["reader"]
> read : io=13532KB, bw=455814 B/s, iops=111 , runt= 30400msec
> clat (usec): min=2190 , max=30399K, avg=30395175.13, stdev= 0.20
> lat (usec): min=2190 , max=30399K, avg=30395177.07, stdev= 0.20
> bw (KB/s) : min= 0, max= 260, per=0.00%, avg= 0.00, stdev= 0.00
> cpu : usr=0.00%, sys=0.03%, ctx=3691, majf=2, minf=468
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued r/w/d: total=3383/0/0, short=0/0/0
>
> lat (msec): 4=0.03%, 10=2.66%, 20=74.84%, 50=21.90%, 100=0.09%
> lat (msec): 250=0.06%, >=2000=0.41%
>
> Run status group 0 (all jobs):
> READ: io=13532KB, aggrb=445KB/s, minb=455KB/s, maxb=455KB/s, mint=30400msec, maxt=30400msec
>
> Disk stats (read/write):
> sdb: ios=3744/18, merge=0/16, ticks=542713/1675, in_queue=550714, util=99.15%
>
>
>
> After patches
> =============
> tarting 16 processes
> Jobs: 16 (f=16): [rrrrrrrrrrrrrrrr] [100.0% done] [557K/0K /s] [136 /0 iops] [eta 00m:00s]
> iostest-read: (groupid=0, jobs=16): err= 0: pid=14183
> Description : ["reader"]
> read : io=14940KB, bw=506105 B/s, iops=123 , runt= 30228msec
> clat (msec): min=2 , max=29866 , avg=463.42, stdev=101.84
> lat (msec): min=2 , max=29866 , avg=463.42, stdev=101.84
> bw (KB/s) : min= 0, max= 198, per=31.69%, avg=156.52, stdev=17.83
> cpu : usr=0.01%, sys=0.03%, ctx=4274, majf=2, minf=464
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued r/w/d: total=3735/0/0, short=0/0/0
>
> lat (msec): 4=0.05%, 10=0.32%, 20=32.99%, 50=64.61%, 100=1.26%
> lat (msec): 250=0.11%, 500=0.11%, 750=0.16%, 1000=0.05%, >=2000=0.35%
>
> Run status group 0 (all jobs):
> READ: io=14940KB, aggrb=494KB/s, minb=506KB/s, maxb=506KB/s, mint=30228msec, maxt=30228msec
>
> Disk stats (read/write):
> sdb: ios=4189/0, merge=0/0, ticks=96428/0, in_queue=478798, util=100.00%
>

This shows an improvement in read b/w, what does the writer
output look like?

--
Three Cheers,
Balbir

2011-03-28 15:22:18

by Justin TerAvest

[permalink] [raw]
Subject: Re: [RFC] [PATCH v2 0/8] Provide cgroup isolation for buffered writes.

On Fri, Mar 25, 2011 at 12:46 AM, Balbir Singh
<[email protected]> wrote:
> * Justin TerAvest <[email protected]> [2011-03-22 16:08:47]:
>
>> This patchset adds tracking to the page_cgroup structure for which cgroup has
>> dirtied a page, and uses that information to provide isolation between
>> cgroups performing writeback.
>>
>> I know that there is some discussion to remove request descriptor limits
>> entirely, but I included a patch to introduce per-cgroup limits to enable
>> this functionality. Without it, we didn't see much isolation improvement.
>>
>> I think most of this material has been discussed on lkml previously, this is
>> just another attempt to make a patchset that handles buffered writes for CFQ.
>>
>> There was a lot of previous discussion at:
>> ?http://thread.gmane.org/gmane.linux.kernel/1007922
>>
>> Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
>> and Vivek Goyal for work on previous versions of these patches.
>>
>> For version 2:
>> ? - I collected more statistics and provided data in the cover sheet
>> ? - blkio id is now stored inside "flags" in page_cgroup, with cmpxchg
>> ? - I cleaned up some patch names
>> ? - Added symmetric reference wrappers in cfq-iosched
>>
>> There are a couple lingering issues that exist in this patchset-- it's meant
>> to be an RFC to discuss the overall design for tracking of buffered writes.
>> I have at least a couple of patches to finish to make absolutely sure that
>> refcounts and locking are handled properly, I just need to do more testing.
>>
>> ?Documentation/block/biodoc.txt | ? 10 +
>> ?block/blk-cgroup.c ? ? ? ? ? ? | ?203 +++++++++++++++++-
>> ?block/blk-cgroup.h ? ? ? ? ? ? | ? ?9 +-
>> ?block/blk-core.c ? ? ? ? ? ? ? | ?218 +++++++++++++------
>> ?block/blk-settings.c ? ? ? ? ? | ? ?2 +-
>> ?block/blk-sysfs.c ? ? ? ? ? ? ?| ? 59 +++---
>> ?block/cfq-iosched.c ? ? ? ? ? ?| ?473 ++++++++++++++++++++++++++++++----------
>> ?block/cfq.h ? ? ? ? ? ? ? ? ? ?| ? ?6 +-
>> ?block/elevator.c ? ? ? ? ? ? ? | ? ?7 +-
>> ?fs/buffer.c ? ? ? ? ? ? ? ? ? ?| ? ?2 +
>> ?fs/direct-io.c ? ? ? ? ? ? ? ? | ? ?2 +
>> ?include/linux/blk_types.h ? ? ?| ? ?2 +
>> ?include/linux/blkdev.h ? ? ? ? | ? 81 +++++++-
>> ?include/linux/blkio-track.h ? ?| ? 89 ++++++++
>> ?include/linux/elevator.h ? ? ? | ? 14 +-
>> ?include/linux/iocontext.h ? ? ?| ? ?1 +
>> ?include/linux/memcontrol.h ? ? | ? ?6 +
>> ?include/linux/mmzone.h ? ? ? ? | ? ?4 +-
>> ?include/linux/page_cgroup.h ? ?| ? 38 +++-
>> ?init/Kconfig ? ? ? ? ? ? ? ? ? | ? 16 ++
>> ?mm/Makefile ? ? ? ? ? ? ? ? ? ?| ? ?3 +-
>> ?mm/bounce.c ? ? ? ? ? ? ? ? ? ?| ? ?2 +
>> ?mm/filemap.c ? ? ? ? ? ? ? ? ? | ? ?2 +
>> ?mm/memcontrol.c ? ? ? ? ? ? ? ?| ? ?6 +
>> ?mm/memory.c ? ? ? ? ? ? ? ? ? ?| ? ?6 +
>> ?mm/page-writeback.c ? ? ? ? ? ?| ? 14 +-
>> ?mm/page_cgroup.c ? ? ? ? ? ? ? | ? 29 ++-
>> ?mm/swap_state.c ? ? ? ? ? ? ? ?| ? ?2 +
>> ?28 files changed, 1066 insertions(+), 240 deletions(-)
>>
>>
>> 8f0b0f4 cfq: Don't allow preemption across cgroups
>> a47cdc6 block: Per cgroup request descriptor counts
>> 8dd7adb cfq: add per cgroup writeout done by flusher stat
>> 1fa0b6d cfq: Fix up tracked async workload length.
>> e9e85d3 block: Modify CFQ to use IO tracking information.
>> f8ffb19 cfq-iosched: Make async queues per cgroup
>> 1d9ee09 block,fs,mm: IO cgroup tracking for buffered write
>> 31c7321 cfq-iosched: add symmetric reference wrappers
>>
>>
>> ===================================== Isolation experiment results
>>
>> For isolation testing, we run a test that's available at:
>> ? git://google3-2.osuosl.org/tests/blkcgroup.git
>>
>> It creates containers, runs workloads, and checks to see how well we meet
>> isolation targets. For the purposes of this patchset, I only ran
>> tests among buffered writers.
>>
>> Before patches
>> ==============
>> 10:32:06 INFO experiment 0 achieved DTFs: 666, 333
>> 10:32:06 INFO experiment 0 FAILED: max observed error is 167, allowed is 150
>> 10:32:51 INFO experiment 1 achieved DTFs: 647, 352
>> 10:32:51 INFO experiment 1 FAILED: max observed error is 253, allowed is 150
>> 10:33:35 INFO experiment 2 achieved DTFs: 298, 701
>> 10:33:35 INFO experiment 2 FAILED: max observed error is 199, allowed is 150
>> 10:34:19 INFO experiment 3 achieved DTFs: 445, 277, 277
>> 10:34:19 INFO experiment 3 FAILED: max observed error is 155, allowed is 150
>> 10:35:05 INFO experiment 4 achieved DTFs: 418, 104, 261, 215
>> 10:35:05 INFO experiment 4 FAILED: max observed error is 232, allowed is 150
>> 10:35:53 INFO experiment 5 achieved DTFs: 213, 136, 68, 102, 170, 136, 170
>> 10:35:53 INFO experiment 5 PASSED: max observed error is 73, allowed is 150
>> 10:36:04 INFO -----ran 6 experiments, 1 passed, 5 failed
>>
>> After patches
>> =============
>> 11:05:22 INFO experiment 0 achieved DTFs: 501, 498
>> 11:05:22 INFO experiment 0 PASSED: max observed error is 2, allowed is 150
>> 11:06:07 INFO experiment 1 achieved DTFs: 874, 125
>> 11:06:07 INFO experiment 1 PASSED: max observed error is 26, allowed is 150
>> 11:06:53 INFO experiment 2 achieved DTFs: 121, 878
>> 11:06:53 INFO experiment 2 PASSED: max observed error is 22, allowed is 150
>> 11:07:46 INFO experiment 3 achieved DTFs: 589, 205, 204
>> 11:07:46 INFO experiment 3 PASSED: max observed error is 11, allowed is 150
>> 11:08:34 INFO experiment 4 achieved DTFs: 616, 109, 109, 163
>> 11:08:34 INFO experiment 4 PASSED: max observed error is 34, allowed is 150
>> 11:09:29 INFO experiment 5 achieved DTFs: 139, 139, 139, 139, 140, 141, 160
>> 11:09:29 INFO experiment 5 PASSED: max observed error is 1, allowed is 150
>> 11:09:46 INFO -----ran 6 experiments, 6 passed, 0 failed
>
> Could you explain what max observed errors is all about?

Hi Balbir,

"max observed error" is the difference between the requested weight
and the observed amount of time that reached a device. Lower error
values mean the isolation is more closely meeting the requested
weight.


>
>>
>> Summary
>> =======
>> Isolation between buffered writers is clearly better with this patch.
>>
>>
>> =============================== Read latency results
>> To test read latency, I created two containers:
>> ? - One called "readers", with weight 900
>> ? - One called "writers", with weight 100
>>
>> I ran this fio workload in "readers":
>> [global]
>> directory=/mnt/iostestmnt/fio
>> runtime=30
>> time_based=1
>> group_reporting=1
>> exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
>
> Is this sufficient, do you need a sync prior to this?

I should add a sync prior to this; you're correct. I'll add a sync and
rerun the tests when I clean up the test data for version 3 of the
patchset.

>
>> cgroup_nodelete=1
>> bs=4K
>> size=512M
>>
>> [iostest-read]
>> description="reader"
>> numjobs=16
>> rw=randread
>> new_group=1
>>
>>
>> ....and this fio workload in "writers"
>> [global]
>> directory=/mnt/iostestmnt/fio
>> runtime=30
>> time_based=1
>> group_reporting=1
>> exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
>> cgroup_nodelete=1
>> bs=4K
>> size=512M
>>
>> [iostest-write]
>> description="writer"
>> cgroup=writers
>> numjobs=3
>> rw=write
>> new_group=1
>>
>>
>>
>> I've pasted the results from the "read" workload inline.
>>
>> Before patches
>> ==============
>> Starting 16 processes
>>
>> Jobs: 14 (f=14): [_rrrrrr_rrrrrrrr] [36.2% done] [352K/0K /s] [86 /0 ?iops] [eta 01m:00s]?????????????
>> iostest-read: (groupid=0, jobs=16): err= 0: pid=20606
>> ? Description ?: ["reader"]
>> ? read : io=13532KB, bw=455814 B/s, iops=111 , runt= 30400msec
>> ? ? clat (usec): min=2190 , max=30399K, avg=30395175.13, stdev= 0.20
>> ? ? ?lat (usec): min=2190 , max=30399K, avg=30395177.07, stdev= 0.20
>> ? ? bw (KB/s) : min= ? ?0, max= ?260, per=0.00%, avg= 0.00, stdev= 0.00
>> ? cpu ? ? ? ? ?: usr=0.00%, sys=0.03%, ctx=3691, majf=2, minf=468
>> ? IO depths ? ?: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>> ? ? ?submit ? ?: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> ? ? ?complete ?: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> ? ? ?issued r/w/d: total=3383/0/0, short=0/0/0
>>
>> ? ? ?lat (msec): 4=0.03%, 10=2.66%, 20=74.84%, 50=21.90%, 100=0.09%
>> ? ? ?lat (msec): 250=0.06%, >=2000=0.41%
>>
>> Run status group 0 (all jobs):
>> ? ?READ: io=13532KB, aggrb=445KB/s, minb=455KB/s, maxb=455KB/s, mint=30400msec, maxt=30400msec
>>
>> Disk stats (read/write):
>> ? sdb: ios=3744/18, merge=0/16, ticks=542713/1675, in_queue=550714, util=99.15%
>>
>>
>>
>> After patches
>> =============
>> tarting 16 processes
>> Jobs: 16 (f=16): [rrrrrrrrrrrrrrrr] [100.0% done] [557K/0K /s] [136 /0 ?iops] [eta 00m:00s]
>> iostest-read: (groupid=0, jobs=16): err= 0: pid=14183
>> ? Description ?: ["reader"]
>> ? read : io=14940KB, bw=506105 B/s, iops=123 , runt= 30228msec
>> ? ? clat (msec): min=2 , max=29866 , avg=463.42, stdev=101.84
>> ? ? ?lat (msec): min=2 , max=29866 , avg=463.42, stdev=101.84
>> ? ? bw (KB/s) : min= ? ?0, max= ?198, per=31.69%, avg=156.52, stdev=17.83
>> ? cpu ? ? ? ? ?: usr=0.01%, sys=0.03%, ctx=4274, majf=2, minf=464
>> ? IO depths ? ?: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>> ? ? ?submit ? ?: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> ? ? ?complete ?: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> ? ? ?issued r/w/d: total=3735/0/0, short=0/0/0
>>
>> ? ? ?lat (msec): 4=0.05%, 10=0.32%, 20=32.99%, 50=64.61%, 100=1.26%
>> ? ? ?lat (msec): 250=0.11%, 500=0.11%, 750=0.16%, 1000=0.05%, >=2000=0.35%
>>
>> Run status group 0 (all jobs):
>> ? ?READ: io=14940KB, aggrb=494KB/s, minb=506KB/s, maxb=506KB/s, mint=30228msec, maxt=30228msec
>>
>> Disk stats (read/write):
>> ? sdb: ios=4189/0, merge=0/0, ticks=96428/0, in_queue=478798, util=100.00%
>>
>
> This shows an improvement in read b/w, what does the writer
> output look like?

Before patches:
iostest-write: (groupid=0, jobs=3): err= 0: pid=20654
Description : ["writer"]
write: io=282444KB, bw=9410.5KB/s, iops=2352 , runt= 30014msec
clat (usec): min=3 , max=28921K, avg=1468.79, stdev=108833.25
lat (usec): min=3 , max=28921K, avg=1468.89, stdev=108833.25
bw (KB/s) : min= 101, max= 5448, per=21.39%, avg=2013.25, stdev=1322.76
cpu : usr=0.11%, sys=0.41%, ctx=77, majf=0, minf=81
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/70611/0, short=0/0/0
lat (usec): 4=0.65%, 10=95.27%, 20=1.39%, 50=2.58%, 100=0.01%
lat (usec): 250=0.01%
lat (msec): 2=0.01%, 4=0.01%, 10=0.04%, 20=0.02%, 100=0.01%
lat (msec): 250=0.01%, 500=0.01%, 750=0.01%, >=2000=0.01%

Run status group 0 (all jobs):
WRITE: io=282444KB, aggrb=9410KB/s, minb=9636KB/s, maxb=9636KB/s,
mint=30014msec, maxt=30014msec

Disk stats (read/write):
sdb: ios=3716/0, merge=0/0, ticks=157011/0, in_queue=506264, util=99.09%


After patches:
Jobs: 3 (f=3): [WWW] [100.0% done] [0K/0K /s] [0 /0 iops] [eta 00m:00s]
iostest-write: (groupid=0, jobs=3): err= 0: pid=14178
Description : ["writer"]
write: io=90268KB, bw=3004.9KB/s, iops=751 , runt= 30041msec
clat (usec): min=3 , max=29612K, avg=4086.42, stdev=197096.83
lat (usec): min=3 , max=29612K, avg=4086.53, stdev=197096.83
bw (KB/s) : min= 956, max= 1092, per=32.58%, avg=978.67, stdev= 0.00
cpu : usr=0.03%, sys=0.14%, ctx=44, majf=1, minf=83
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/22567/0, short=0/0/0
lat (usec): 4=1.06%, 10=94.20%, 20=2.11%, 50=2.50%, 100=0.01%
lat (usec): 250=0.01%
lat (msec): 10=0.04%, 20=0.03%, 50=0.01%, 250=0.01%, >=2000=0.01%

Run status group 0 (all jobs):
WRITE: io=90268KB, aggrb=3004KB/s, minb=3076KB/s, maxb=3076KB/s,
mint=30041msec, maxt=30041msec

Disk stats (read/write):
sdb: ios=4158/0, merge=0/0, ticks=95747/0, in_queue=475051, util=100.00%

Thanks,
Justin


>
> --
> ? ? ? ?Three Cheers,
> ? ? ? ?Balbir
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>