LinuxLists.cc - [RFC] [PATCH v3 0/8] Provide cgroup isolation for buffered writes.

2011-03-30 16:51:15

by Justin TerAvest

[permalink] [raw]

Subject: [RFC] [PATCH v3 0/8] Provide cgroup isolation for buffered writes.

This patchset adds tracking to the page_cgroup structure for which cgroup has
dirtied a page, and uses that information to provide isolation between
cgroups performing writeback.

I know that there is some discussion to remove request descriptor limits
entirely, but I included a patch to introduce per-cgroup limits to enable
this functionality. Without it, we didn't see much isolation improvement.

I think most of this material has been discussed on lkml previously, this is
just another attempt to make a patchset that handles buffered writes for CFQ.

There was a lot of previous discussion at:
http://thread.gmane.org/gmane.linux.kernel/1007922

Thanks to Andrea Righi, Kamezawa Hiroyuki, Munehiro Ikeda, Nauman Rafique,
and Vivek Goyal for work on previous versions of these patches.

For version 3:
- I fixed a bug with restored requests with
blk_flush_restore_request() which caused a BUG with flush requests.
- Cleaned up test information in the cover sheet more.
- There are still outstanding issues, this is mostly to prevent an
oops Vivek noticed on boot.

For version 2:
- I collected more statistics and provided data in the cover sheet
- blkio id is now stored inside "flags" in page_cgroup, with cmpxchg
- I cleaned up some patch names
- Added symmetric reference wrappers in cfq-iosched

There are a couple lingering issues that exist in this patchset-- it's meant
to be an RFC to discuss the overall design for tracking of buffered writes.
I have at least a couple of patches to finish to make absolutely sure that
refcounts and locking are handled properly, I just need to do more testing.

No other patches were applied to the Jens' tree when testing this code.

TODOs:
- Make sure we run sync as part of "exec_prerun" for fio testing
- Find a way to not use cmpxchg() to store data in page_cgroup->flags.

Documentation/block/biodoc.txt | 10 +
block/blk-cgroup.c | 203 +++++++++++++++++-
block/blk-cgroup.h | 9 +-
block/blk-core.c | 218 +++++++++++++------
block/blk-flush.c | 2 +
block/blk-settings.c | 2 +-
block/blk-sysfs.c | 59 +++---
block/cfq-iosched.c | 473 ++++++++++++++++++++++++++++++----------
block/cfq.h | 6 +-
block/elevator.c | 7 +-
fs/buffer.c | 2 +
fs/direct-io.c | 2 +
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 81 +++++++-
include/linux/blkio-track.h | 89 ++++++++
include/linux/elevator.h | 14 +-
include/linux/iocontext.h | 1 +
include/linux/memcontrol.h | 6 +
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 38 +++-
init/Kconfig | 16 ++
mm/Makefile | 3 +-
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 6 +
mm/memory.c | 6 +
mm/page-writeback.c | 14 +-
mm/page_cgroup.c | 29 ++-
mm/swap_state.c | 2 +
29 files changed, 1068 insertions(+), 240 deletions(-)

0001-cfq-iosched-add-symmetric-reference-wrappers.patch
0002-block-fs-mm-IO-cgroup-tracking-for-buffered-write.patch
0003-cfq-iosched-Make-async-queues-per-cgroup.patch
0004-block-Modify-CFQ-to-use-IO-tracking-information.patch
0005-cfq-Fix-up-tracked-async-workload-length.patch
0006-cfq-add-per-cgroup-writeout-done-by-flusher-stat.patch
0007-block-Per-cgroup-request-descriptor-counts.patch
0008-cfq-Don-t-allow-preemption-across-cgroups.patch

===================================== Isolation experiment results

For isolation testing, we run a test that's available at:
git://google3-2.osuosl.org/tests/blkcgroup.git

This test creates multiple containers, assigning weights to devices, and
checks how closely values as reported by blkio.time match the requested
weights. This is used as a means to determine how effective the isolation is
provided by the linux kernel.

For example, "900 wrseq.buf*2, 100 wrseq.buf*2", means that we create two
cgroups,
- One with weight_device 900 for a given device
- One with weight_device 100 for the same device
...Then, in the cgroups, we run an identical workload, performing buffered
writes with two processes.

I've filtered the lines below to only mention experiments that involve buffered
writers, since that should be all that is affected by this patch.

All performance numbers below are with the ext2 filesystem.

Before patches
==============
----- Running experiment 12: 500 wrseq.buf*2, 500 wrseq.buf*2
experiment 12 achieved DTFs: 472, 527
experiment 12 PASSED: max observed error is 28, allowed is 150
----- Running experiment 13: 900 wrseq.buf*2, 100 wrseq.buf*2
experiment 13 achieved DTFs: 735, 264
experiment 13 FAILED: max observed error is 165, allowed is 150
----- Running experiment 14: 100 wrseq.buf*2, 900 wrseq.buf*2
experiment 14 achieved DTFs: 312, 687
experiment 14 FAILED: max observed error is 213, allowed is 150
----- Running experiment 15: 600 wrseq.buf*2, 200 wrseq.buf*2, 200 wrseq.buf*2
experiment 15 achieved DTFs: 443, 151, 405
experiment 15 FAILED: max observed error is 205, allowed is 150
----- Running experiment 16: 650 wrseq.buf*2, 100 wrseq.buf*2, 100 wrseq.buf*2, 150 wrseq.buf*2
experiment 16 achieved DTFs: 341, 365, 97, 195
experiment 16 FAILED: max observed error is 309, allowed is 150
----- Running experiment 17: 140 wrseq.buf*2, 140 wrseq.buf*2, 140 wrseq.buf*2, 140 wrseq.buf*2, 140 wrseq.buf*2, 140 wrseq.buf*2, 160 wrseq.buf*2
experiment 17 achieved DTFs: 192, 64, 100, 128, 160, 128, 224
experiment 17 PASSED: max observed error is 76, allowed is 150
----- Running experiment 27: 500 rdrand, 500 wrseq.buf*2
experiment 27 achieved DTFs: 985, 14
experiment 27 FAILED: max observed error is 486, allowed is 150
----- Running experiment 28: 900 rdrand, 100 wrseq.buf*2
experiment 28 achieved DTFs: 991, 8
experiment 28 PASSED: max observed error is 92, allowed is 150
----- Running experiment 29: 100 rdrand, 900 wrseq.buf*2
experiment 29 achieved DTFs: 961, 38
experiment 29 FAILED: max observed error is 862, allowed is 150

After patches
=============
----- Running experiment 12: 500 wrseq.buf*2, 500 wrseq.buf*2
experiment 12 achieved DTFs: 499, 500
experiment 12 PASSED: max observed error is 1, allowed is 150
----- Running experiment 13: 900 wrseq.buf*2, 100 wrseq.buf*2
experiment 13 achieved DTFs: 864, 135
experiment 13 PASSED: max observed error is 36, allowed is 150
----- Running experiment 14: 100 wrseq.buf*2, 900 wrseq.buf*2
experiment 14 achieved DTFs: 113, 886
experiment 14 PASSED: max observed error is 14, allowed is 150
----- Running experiment 15: 600 wrseq.buf*2, 200 wrseq.buf*2, 200 wrseq.buf*2
experiment 15 achieved DTFs: 593, 204, 201
experiment 15 PASSED: max observed error is 7, allowed is 150
----- Running experiment 16: 650 wrseq.buf*2, 100 wrseq.buf*2, 100 wrseq.buf*2, 150 wrseq.buf*2
experiment 16 achieved DTFs: 608, 113, 111, 166
experiment 16 PASSED: max observed error is 42, allowed is 150
----- Running experiment 17: 140 wrseq.buf*2, 140 wrseq.buf*2, 140 wrseq.buf*2, 140 wrseq.buf*2, 140 wrseq.buf*2, 140 wrseq.buf*2, 160 wrseq.buf*2
experiment 17 achieved DTFs: 140, 139, 139, 139, 140, 141, 159
experiment 17 PASSED: max observed error is 1, allowed is 150
----- Running experiment 27: 500 rdrand, 500 wrseq.buf*2
experiment 27 achieved DTFs: 512, 487
experiment 27 PASSED: max observed error is 13, allowed is 150
----- Running experiment 28: 900 rdrand, 100 wrseq.buf*2
experiment 28 achieved DTFs: 869, 130
experiment 28 PASSED: max observed error is 31, allowed is 150
----- Running experiment 29: 100 rdrand, 900 wrseq.buf*2
experiment 29 achieved DTFs: 131, 868
experiment 29 PASSED: max observed error is 32, allowed is 150

With the patches applied, and workload
"100 wrseq.buf, 200 wrseq.buf, 300 wrseq.buf, 400 wrseq.buf", we confirm
how the tests works by providing data from /dev/cgroup:

/dev/cgroup/blkcgroupt0/blkio.sectors
8:16 510040
8:0 376

/dev/cgroup/blkcgroupt1/blkio.sectors
8:16 941040

/dev/cgroup/blkcgroupt2/blkio.sectors
8:16 1224456
8:0 8

/dev/cgroup/blkcgroupt3/blkio.sectors
8:16 1509576
8:0 152

/dev/cgroup/blkcgroupt0/blkio.time
8:16 2651
8:0 20

/dev/cgroup/blkcgroupt1/blkio.time
8:16 5200

/dev/cgroup/blkcgroupt2/blkio.time
8:16 7350
8:0 8

/dev/cgroup/blkcgroupt3/blkio.time
8:16 9591
8:0 20

/dev/cgroup/blkcgroupt0/blkio.weight_device
8:16 100

/dev/cgroup/blkcgroupt1/blkio.weight_device
8:16 200

/dev/cgroup/blkcgroupt2/blkio.weight_device
8:16 300

/dev/cgroup/blkcgroupt3/blkio.weight_device
8:16 400

Summary
=======
Isolation between buffered writers is clearly better with this patch.

"Error" is much lower with the patches, showing that blkio.time is closer in
sync to the weight requested than without the patches.

=============================== Read latency results
To test read latency, I created two containers:
- One called "readers", with weight 900
- One called "writers", with weight 100

Adding "sync;" to the exec_prerun= line causes fio to have some errors; I'll
continue to investigate this.

I ran this fio workload in "readers":
[global]
directory=/mnt/iostestmnt/fio
runtime=30
time_based=1
group_reporting=1
exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
cgroup_nodelete=1
bs=4K
size=512M

[iostest-read]
description="reader"
numjobs=16
rw=randread
new_group=1

....and this fio workload in "writers"
[global]
directory=/mnt/iostestmnt/fio
runtime=30
time_based=1
group_reporting=1
exec_prerun='echo 3 > /proc/sys/vm/drop_caches'
cgroup_nodelete=1
bs=4K
size=512M

[iostest-write]
description="writer"
cgroup=writers
numjobs=3
rw=write
new_group=1

I've pasted the results from the "read" workload inline.

Before patches
==============
Starting 16 processes

Jobs: 14 (f=14): [_rrrrrr_rrrrrrrr] [36.2% done] [352K/0K /s] [86 /0 iops] [eta 01m:00s]·············
iostest-read: (groupid=0, jobs=16): err= 0: pid=20606
Description : ["reader"]
read : io=13532KB, bw=455814 B/s, iops=111 , runt= 30400msec
clat (usec): min=2190 , max=30399K, avg=30395175.13, stdev= 0.20
lat (usec): min=2190 , max=30399K, avg=30395177.07, stdev= 0.20
bw (KB/s) : min= 0, max= 260, per=0.00%, avg= 0.00, stdev= 0.00
cpu : usr=0.00%, sys=0.03%, ctx=3691, majf=2, minf=468
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=3383/0/0, short=0/0/0

lat (msec): 4=0.03%, 10=2.66%, 20=74.84%, 50=21.90%, 100=0.09%
lat (msec): 250=0.06%, >=2000=0.41%

Run status group 0 (all jobs):
READ: io=13532KB, aggrb=445KB/s, minb=455KB/s, maxb=455KB/s, mint=30400msec, maxt=30400msec

Disk stats (read/write):
sdb: ios=3744/18, merge=0/16, ticks=542713/1675, in_queue=550714, util=99.15%

iostest-write: (groupid=0, jobs=3): err= 0: pid=20654
Description : ["writer"]
write: io=282444KB, bw=9410.5KB/s, iops=2352 , runt= 30014msec
clat (usec): min=3 , max=28921K, avg=1468.79, stdev=108833.25
lat (usec): min=3 , max=28921K, avg=1468.89, stdev=108833.25
bw (KB/s) : min= 101, max= 5448, per=21.39%, avg=2013.25, stdev=1322.76
cpu : usr=0.11%, sys=0.41%, ctx=77, majf=0, minf=81
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/70611/0, short=0/0/0
lat (usec): 4=0.65%, 10=95.27%, 20=1.39%, 50=2.58%, 100=0.01%
lat (usec): 250=0.01%
lat (msec): 2=0.01%, 4=0.01%, 10=0.04%, 20=0.02%, 100=0.01%
lat (msec): 250=0.01%, 500=0.01%, 750=0.01%, >=2000=0.01%

Run status group 0 (all jobs):
WRITE: io=282444KB, aggrb=9410KB/s, minb=9636KB/s, maxb=9636KB/s,
mint=30014msec, maxt=30014msec

Disk stats (read/write):
sdb: ios=3716/0, merge=0/0, ticks=157011/0, in_queue=506264, util=99.09%

After patches
=============
Starting 16 processes
Jobs: 16 (f=16): [rrrrrrrrrrrrrrrr] [100.0% done] [557K/0K /s] [136 /0 iops] [eta 00m:00s]
iostest-read: (groupid=0, jobs=16): err= 0: pid=14183
Description : ["reader"]
read : io=14940KB, bw=506105 B/s, iops=123 , runt= 30228msec
clat (msec): min=2 , max=29866 , avg=463.42, stdev=101.84
lat (msec): min=2 , max=29866 , avg=463.42, stdev=101.84
bw (KB/s) : min= 0, max= 198, per=31.69%, avg=156.52, stdev=17.83
cpu : usr=0.01%, sys=0.03%, ctx=4274, majf=2, minf=464
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=3735/0/0, short=0/0/0

lat (msec): 4=0.05%, 10=0.32%, 20=32.99%, 50=64.61%, 100=1.26%
lat (msec): 250=0.11%, 500=0.11%, 750=0.16%, 1000=0.05%, >=2000=0.35%

Run status group 0 (all jobs):
READ: io=14940KB, aggrb=494KB/s, minb=506KB/s, maxb=506KB/s, mint=30228msec, maxt=30228msec

Disk stats (read/write):
sdb: ios=4189/0, merge=0/0, ticks=96428/0, in_queue=478798, util=100.00%

Jobs: 3 (f=3): [WWW] [100.0% done] [0K/0K /s] [0 /0 iops] [eta 00m:00s]
iostest-write: (groupid=0, jobs=3): err= 0: pid=14178
Description : ["writer"]
write: io=90268KB, bw=3004.9KB/s, iops=751 , runt= 30041msec
clat (usec): min=3 , max=29612K, avg=4086.42, stdev=197096.83
lat (usec): min=3 , max=29612K, avg=4086.53, stdev=197096.83
bw (KB/s) : min= 956, max= 1092, per=32.58%, avg=978.67, stdev= 0.00
cpu : usr=0.03%, sys=0.14%, ctx=44, majf=1, minf=83
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/22567/0, short=0/0/0
lat (usec): 4=1.06%, 10=94.20%, 20=2.11%, 50=2.50%, 100=0.01%
lat (usec): 250=0.01%
lat (msec): 10=0.04%, 20=0.03%, 50=0.01%, 250=0.01%, >=2000=0.01%

Run status group 0 (all jobs):
WRITE: io=90268KB, aggrb=3004KB/s, minb=3076KB/s, maxb=3076KB/s,
mint=30041msec, maxt=30041msec

Disk stats (read/write):
sdb: ios=4158/0, merge=0/0, ticks=95747/0, in_queue=475051, util=100.00%

Summary
=======
Read latencies are a bit worse, but this overhead is only imposed when users
ask for this feature by turning on CONFIG_BLKIOTRACK. We expect there to be
something of a latency vs isolation tradeoff.

The latency impact for reads can be seen in this table:
Baseline W/Patches
4 0.03 0.05
10 2.69 0.37
20 77.53 33.36
50 99.43 97.97
100 99.52 99.23
250 99.58 99.34
inf 99.99 100.01

2011-03-30 16:51:13

by Justin TerAvest

[permalink] [raw]

Subject: [PATCH v3 5/8] cfq: Fix up tracked async workload length.

This reverts effectively reverts commit f26bd1f0a3a31bc5e16d285f5e1b00a56abf6238
"blkio: Determine async workload length based on total number of queues"
in the case when async IO tracking is enabled.

That commit was used to work around the fact that async
queues were part of root cgroup. That is no longer the case when we have
async write tracking enabed.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/cfq-iosched.c | 90 +++++++++++++++++++++++++++++++-------------------
1 files changed, 56 insertions(+), 34 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c75bbbf..1b315c3 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -448,13 +448,6 @@ static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl,
+ cfqg->service_trees[wl][SYNC_WORKLOAD].count;
}

-static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
- struct cfq_group *cfqg)
-{
- return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count
- + cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
-}
-
static void cfq_dispatch_insert(struct request_queue *, struct request *);
static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio*, bool,
struct io_context *, gfp_t);
@@ -1014,27 +1007,50 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq,
return slice_used;
}

-static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
- struct cfq_queue *cfqq)
+#ifndef CONFIG_CGROUP_BLKIOTRACK
+static inline int cfqg_busy_async_queues(struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
- unsigned int used_sl, charge, unaccounted_sl = 0;
- int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
+ return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count
+ + cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
+}
+#endif
+
+static void cfq_charge_slice(struct cfq_rb_root *st, struct cfq_group *cfqg,
+ struct cfq_queue *cfqq, unsigned int used_sl)
+{
+ struct cfq_data *cfqd = cfqq->cfqd;
+ unsigned int charge = used_sl;
+#ifndef CONFIG_CGROUP_BLKIOTRACK
+ int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqg)
- cfqg->service_tree_idle.count;

BUG_ON(nr_sync < 0);
- used_sl = charge = cfq_cfqq_slice_usage(cfqq, &unaccounted_sl);

if (iops_mode(cfqd))
charge = cfqq->slice_dispatch;
else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
charge = cfqq->allocated_slice;
+#endif

/* Can't update vdisktime while group is on service tree */
cfq_group_service_tree_del(st, cfqg);
cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
/* If a new weight was requested, update now, off tree */
+ cfq_update_group_weight(cfqg);
cfq_group_service_tree_add(st, cfqg);
+ cfq_log_cfqq(cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
+ " sect=%u", used_sl, cfqq->slice_dispatch, charge,
+ iops_mode(cfqd), cfqq->nr_sectors);
+}
+
+static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
+ struct cfq_queue *cfqq)
+{
+ struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ unsigned int used_sl, unaccounted_sl = 0;
+
+ used_sl = cfq_cfqq_slice_usage(cfqq, &unaccounted_sl);
+ cfq_charge_slice(st, cfqg, cfqq, used_sl);

/* This group is being expired. Save the context */
if (time_after(cfqd->workload_expires, jiffies)) {
@@ -1047,9 +1063,6 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,

cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
st->min_vdisktime);
- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
- " sect=%u", used_sl, cfqq->slice_dispatch, charge,
- iops_mode(cfqd), cfqq->nr_sectors);
cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl,
unaccounted_sl);
cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
@@ -2215,6 +2228,30 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
return cur_best;
}

+static unsigned cfq_async_slice(struct cfq_data *cfqd, struct cfq_group *cfqg,
+ unsigned slice)
+{
+#ifndef CONFIG_CGROUP_BLKIOTRACK
+ unsigned tmp;
+ /*
+ * Async queues are currently system wide. Just taking
+ * proportion of queues with-in same group will lead to higher
+ * async ratio system wide as generally root group is going
+ * to have higher weight. A more accurate thing would be to
+ * calculate system wide asnc/sync ratio.
+ */
+ tmp = cfq_target_latency * cfqg_busy_async_queues(cfqg);
+ tmp = tmp/cfqd->busy_queues;
+ slice = min_t(unsigned, slice, tmp);
+#endif
+ /*
+ * async workload slice is scaled down according to
+ * the sync/async slice ratio.
+ */
+ slice = slice * cfqd->cfq_slice[0] / cfqd->cfq_slice[1];
+ return slice;
+}
+
static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
unsigned slice;
@@ -2269,24 +2306,9 @@ new_workload:
max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));

- if (cfqd->serving_type == ASYNC_WORKLOAD) {
- unsigned int tmp;
-
- /*
- * Async queues are currently system wide. Just taking
- * proportion of queues with-in same group will lead to higher
- * async ratio system wide as generally root group is going
- * to have higher weight. A more accurate thing would be to
- * calculate system wide asnc/sync ratio.
- */
- tmp = cfq_target_latency * cfqg_busy_async_queues(cfqd, cfqg);
- tmp = tmp/cfqd->busy_queues;
- slice = min_t(unsigned, slice, tmp);
-
- /* async workload slice is scaled down according to
- * the sync/async slice ratio. */
- slice = slice * cfqd->cfq_slice[0] / cfqd->cfq_slice[1];
- } else
+ if (cfqd->serving_type == ASYNC_WORKLOAD)
+ slice = cfq_async_slice(cfqd, cfqg, slice);
+ else
/* sync workload slice is at least 2 * cfq_slice_idle */
slice = max(slice, 2 * cfqd->cfq_slice_idle);

--
1.7.3.1

2011-03-30 16:51:28

by Justin TerAvest

[permalink] [raw]

Subject: [PATCH v3 4/8] block: Modify CFQ to use IO tracking information.

IO tracking bits are used to send IOs to the right async
queue. Current task is still used to identify the cgroup of the
synchronous IO. Current task is also used if IO tracking is disabled.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/blk-cgroup.c | 6 +-
block/blk-core.c | 7 +-
block/cfq-iosched.c | 175 +++++++++++++++++++++++++++++++++++++++++-----
block/elevator.c | 5 +-
include/linux/elevator.h | 6 +-
5 files changed, 172 insertions(+), 27 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f626c65..9732cfd 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -111,6 +111,9 @@ blkio_policy_search_node(const struct blkio_cgroup *blkcg, dev_t dev,

struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
{
+ if (!cgroup)
+ return &blkio_root_cgroup;
+
return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
struct blkio_cgroup, css);
}
@@ -1550,6 +1553,7 @@ unsigned long get_blkio_cgroup_id(struct bio *bio)
id = page_cgroup_get_blkio_id(pc);
return id;
}
+EXPORT_SYMBOL(get_blkio_cgroup_id);

/**
* get_cgroup_from_page() - determine the cgroup from a page.
@@ -1576,8 +1580,6 @@ struct cgroup *get_cgroup_from_page(struct page *page)

return css->cgroup;
}
-
-EXPORT_SYMBOL(get_blkio_cgroup_id);
EXPORT_SYMBOL(get_cgroup_from_page);

#endif /* CONFIG_CGROUP_BLKIOTRACK */
diff --git a/block/blk-core.c b/block/blk-core.c
index 5256932..1b7936bf 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -583,7 +583,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
}

static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+ gfp_t gfp_mask)
{
struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);

@@ -595,7 +596,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
rq->cmd_flags = flags | REQ_ALLOCED;

if (priv) {
- if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+ if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
mempool_free(rq, q->rq.rq_pool);
return NULL;
}
@@ -757,7 +758,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
rw_flags |= REQ_IO_STAT;
spin_unlock_irq(q->queue_lock);

- rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+ rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
if (unlikely(!rq)) {
/*
* Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 011d268..c75bbbf 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -14,6 +14,7 @@
#include <linux/rbtree.h>
#include <linux/ioprio.h>
#include <linux/blktrace_api.h>
+#include <linux/blkio-track.h>
#include "cfq.h"

/*
@@ -309,6 +310,10 @@ static void cfq_get_queue_ref(struct cfq_queue *cfqq);
static void cfq_put_queue_ref(struct cfq_queue *cfqq);

static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
+static struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
+ struct bio *bio, int create);
+static struct cfq_queue **
+cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio);

static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
enum wl_prio_t prio,
@@ -451,7 +456,7 @@ static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
}

static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio*, bool,
struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
struct io_context *);
@@ -463,9 +468,55 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
return cic->cfqq[is_sync];
}

+/*
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
+ */
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+ struct cfq_io_context *cic, struct bio *bio, int is_sync)
+{
+ struct cfq_queue *cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+ if (!cfqq && !is_sync) {
+ const int ioprio = task_ioprio(cic->ioc);
+ const int ioprio_class = task_ioprio_class(cic->ioc);
+ struct cfq_group *cfqg;
+ struct cfq_queue **async_cfqq;
+ /*
+ * async bio tracking is enabled and we are not caching
+ * async queue pointer in cic.
+ */
+ cfqg = cfq_get_cfqg_bio(cfqd, bio, 0);
+ if (!cfqg) {
+ /*
+ * May be this is first rq/bio and io group has not
+ * been setup yet.
+ */
+ return NULL;
+ }
+ async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class, ioprio);
+ return *async_cfqq;
+ }
+#endif
+ return cfqq;
+}
+
static inline void cic_set_cfqq(struct cfq_io_context *cic,
struct cfq_queue *cfqq, bool is_sync)
{
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+ /*
+ * Don't cache async queue pointer as now one io context might
+ * be submitting async io for various different async queues
+ */
+ if (!is_sync)
+ return;
+#endif
cic->cfqq[is_sync] = cfqq;
}

@@ -1032,7 +1083,9 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
unsigned int major, minor;

cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
- if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+ if (!bdi || !bdi->dev || !dev_name(bdi->dev))
+ goto done;
+ if (cfqg && !cfqg->blkg.dev) {
sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
cfqg->blkg.dev = MKDEV(major, minor);
goto done;
@@ -1083,20 +1136,61 @@ done:
* Search for the cfq group current task belongs to. If create = 1, then also
* create the cfq group if it does not exist. request_queue lock must be held.
*/
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, struct page *page,
+ int create)
{
struct cgroup *cgroup;
struct cfq_group *cfqg = NULL;

rcu_read_lock();
- cgroup = task_cgroup(current, blkio_subsys_id);
+
+ if (!page)
+ cgroup = task_cgroup(current, blkio_subsys_id);
+ else
+ cgroup = get_cgroup_from_page(page);
+
+ if (!cgroup) {
+ cfqg = &cfqd->root_group;
+ goto out;
+ }
+
cfqg = cfq_find_alloc_cfqg(cfqd, cgroup, create);
if (!cfqg && create)
cfqg = &cfqd->root_group;
+out:
rcu_read_unlock();
return cfqg;
}

+struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
+ struct bio *bio, int create)
+{
+ struct page *page = NULL;
+
+ /*
+ * Determine the group from task context. Even calls from
+ * blk_get_request() which don't have any bio info will be mapped
+ * to the task's group
+ */
+ if (!bio)
+ goto sync;
+
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+ if (!bio_has_data(bio))
+ goto sync;
+
+ /* Map the sync bio to the right group using task context */
+ if (cfq_bio_sync(bio))
+ goto sync;
+
+ /* Determine the group from info stored in page */
+ page = bio_iovec_idx(bio, 0)->bv_page;
+#endif
+
+sync:
+ return cfq_get_cfqg(cfqd, page, create);
+}
+
static void cfq_get_group_ref(struct cfq_group *cfqg)
{
cfqg->ref++;
@@ -1180,6 +1274,11 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)

#else /* GROUP_IOSCHED */

+static struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
+ struct bio *bio, int create)
+{
+}
+
static void cfq_get_group_ref(struct cfq_group *cfqg)
{
}
@@ -1491,7 +1590,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
if (!cic)
return NULL;

- cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+ cfqq = cic_bio_to_cfqq(cfqd, cic, bio, cfq_bio_sync(bio));
if (cfqq) {
sector_t sector = bio->bi_sector + bio_sectors(bio);

@@ -1615,7 +1714,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
if (!cic)
return false;

- cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+ cfqq = cic_bio_to_cfqq(cfqd, cic, bio, cfq_bio_sync(bio));
return cfqq == RQ_CFQQ(rq);
}

@@ -2848,14 +2947,10 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
spin_lock_irqsave(cfqd->queue->queue_lock, flags);

cfqq = cic->cfqq[BLK_RW_ASYNC];
+
if (cfqq) {
- struct cfq_queue *new_cfqq;
- new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
- GFP_ATOMIC);
- if (new_cfqq) {
- cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
- cfq_put_queue_ref(cfqq);
- }
+ cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+ cfq_put_queue_ref(cfqq);
}

cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -2895,6 +2990,7 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
{
struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+ struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
struct cfq_data *cfqd = cic_to_cfqd(cic);
unsigned long flags;
struct request_queue *q;
@@ -2916,6 +3012,12 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
cfq_put_queue_ref(sync_cfqq);
}

+ if (async_cfqq != NULL) {
+ cfq_log_cfqq(cfqd, async_cfqq, "changed cgroup");
+ cic_set_cfqq(cic, NULL, 0);
+ cfq_put_queue_ref(async_cfqq);
+ }
+
spin_unlock_irqrestore(q->queue_lock, flags);
}

@@ -2938,6 +3040,24 @@ retry:
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);

+#ifdef CONFIG_CGROUP_BLKIOTRACK
+ if (!cfqq && !is_sync) {
+ const int ioprio = task_ioprio(cic->ioc);
+ const int ioprio_class = task_ioprio_class(cic->ioc);
+ struct cfq_queue **async_cfqq;
+
+ /*
+ * We have not cached async queue pointer as bio tracking
+ * is enabled. Look into group async queue array using ioc
+ * class and prio to see if somebody already allocated the
+ * queue.
+ */
+
+ async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class, ioprio);
+ cfqq = *async_cfqq;
+ }
+#endif
+
/*
* Always try a new alloc if we fell back to the OOM cfqq
* originally, since it should just be a temporary situation.
@@ -2992,14 +3112,14 @@ cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio)
}

static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
- gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, bool is_sync,
+ struct io_context *ioc, gfp_t gfp_mask)
{
const int ioprio = task_ioprio(ioc);
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue **async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
- struct cfq_group *cfqg = cfq_get_cfqg(cfqd, 1);
+ struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, 1);

if (!is_sync) {
async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
@@ -3018,7 +3138,25 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
*async_cfqq = cfqq;
}

+#ifdef CONFIG_CGROUP_BLKIOTRACK
+ /*
+ * ioc reference. If async request queue/group is determined from the
+ * original task/cgroup and not from submitter task, io context can
+ * not cache the pointer to async queue and everytime a request comes,
+ * it will be determined by going through the async queue array.
+ *
+ */
+ if (is_sync)
+ cfq_get_queue_ref(cfqq);
+#else
+ /*
+ * async requests are being attributed to task submitting
+ * it, hence cic can cache async cfqq pointer. Take the
+ * queue reference even for async queue.
+ */
+
cfq_get_queue_ref(cfqq);
+#endif
return cfqq;
}

@@ -3686,7 +3824,8 @@ split_cfqq(struct cfq_io_context *cic, struct cfq_queue *cfqq)
* Allocate cfq data structures associated with this request.
*/
static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+ gfp_t gfp_mask)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_io_context *cic;
@@ -3707,7 +3846,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
new_queue:
cfqq = cic_to_cfqq(cic, is_sync);
if (!cfqq || cfqq == &cfqd->oom_cfqq) {
- cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+ cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc, gfp_mask);
cic_set_cfqq(cic, cfqq, is_sync);
} else {
/*
diff --git a/block/elevator.c b/block/elevator.c
index c387d31..9edaeb5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -770,12 +770,13 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
return NULL;
}

-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+ struct bio *bio, gfp_t gfp_mask)
{
struct elevator_queue *e = q->elevator;

if (e->ops->elevator_set_req_fn)
- return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+ return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);

rq->elevator_private[0] = NULL;
return 0;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index d93efcc445..c3a884c 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -24,7 +24,8 @@ typedef struct request *(elevator_request_list_fn) (struct request_queue *, stru
typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
typedef int (elevator_may_queue_fn) (struct request_queue *, int);

-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *,
+ struct bio *bio, gfp_t);
typedef void (elevator_put_req_fn) (struct request *);
typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
@@ -117,7 +118,8 @@ extern void elv_unregister_queue(struct request_queue *q);
extern int elv_may_queue(struct request_queue *, int);
extern void elv_abort_queue(struct request_queue *);
extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+ struct bio *bio, gfp_t);
extern void elv_put_request(struct request_queue *, struct request *);
extern void elv_drain_elevator(struct request_queue *);

--
1.7.3.1

2011-03-30 16:51:30

by Justin TerAvest

[permalink] [raw]

Subject: [PATCH v3 7/8] block: Per cgroup request descriptor counts

If a cgroup gets starved, it keeps allocating request
descriptors until it hits the limit. Other cgroups could be
starved of request descriptors, which is a problem.
This patch implements per-cgroup request descriptor limits.

There has been discussion to remove limits entirely, but they are
introduced here to enable tracking for buffered writes without causing
problems for the per request-queue limit.

There are some interesting corner cases that are also covered.
During elevator switch, we start counting the request descriptors
towards the request list of the root group. We drain all the requests
that were started before the switch. If new requests arrive after
the switch is progressing, they are counted towards the common, shared
request list of the root group. As these requests complete, the counter
in the common request list is decremented properly.

Signed-off-by: Justin TerAvest <[email protected]>
---
Documentation/block/biodoc.txt | 10 ++
block/blk-core.c | 211 +++++++++++++++++++++++++++-------------
block/blk-flush.c | 2 +
block/blk-settings.c | 2 +-
block/blk-sysfs.c | 59 ++++++-----
block/cfq-iosched.c | 61 +++++++++++-
block/elevator.c | 2 +-
include/linux/blkdev.h | 81 +++++++++++++++-
include/linux/elevator.h | 8 ++
9 files changed, 339 insertions(+), 97 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index 2a7b38c..856d706 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -970,6 +970,16 @@ elevator_latter_req_fn These return the request before or after the

elevator_completed_req_fn called when a request is completed.

+elevator_req_list_fn called to obtain the active request list for
+ either a bio or a request. This function can
+ be called with either a valid bio or request.
+ This function must be defined only if the
+ scheduler supports per cgroup request lists.
+ When defined, it becomes the responsibility
+ of the scheduler to call blk_alloced_request
+ after the request and queue are allocated and
+ the cgroup is declared.
+
elevator_may_queue_fn returns true if the scheduler wants to allow the
current context to queue a new request even if
it is over the queue limit. This must be used
diff --git a/block/blk-core.c b/block/blk-core.c
index 1b7936bf..465df36 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -381,23 +381,28 @@ void blk_cleanup_queue(struct request_queue *q)
}
EXPORT_SYMBOL(blk_cleanup_queue);

-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
{
- struct request_list *rl = &q->rq;
-
- if (unlikely(rl->rq_pool))
- return 0;
-
rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
- rl->elvpriv = 0;
init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}

- rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
- mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+ if (unlikely(q->rq_pool.rq_pool))
+ return 0;
+
+ blk_init_request_list(&q->rq);

- if (!rl->rq_pool)
+ q->rq_pool.count[BLK_RW_SYNC] = 0;
+ q->rq_pool.count[BLK_RW_ASYNC] = 0;
+ q->rq_pool.elvpriv = 0;
+ q->rq_pool.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+ mempool_alloc_slab, mempool_free_slab,
+ request_cachep, q->node);
+ if (!q->rq_pool.rq_pool)
return -ENOMEM;

return 0;
@@ -579,14 +584,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)

if (rq->cmd_flags & REQ_ELVPRIV)
elv_put_request(q, rq);
- mempool_free(rq, q->rq.rq_pool);
+ mempool_free(rq, q->rq_pool.rq_pool);
}

static struct request *
blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
gfp_t gfp_mask)
{
- struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+ struct request *rq = mempool_alloc(q->rq_pool.rq_pool, gfp_mask);

if (!rq)
return NULL;
@@ -596,11 +601,11 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
rq->cmd_flags = flags | REQ_ALLOCED;

if (priv) {
+ rq->cmd_flags |= REQ_ELVPRIV;
if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
- mempool_free(rq, q->rq.rq_pool);
+ mempool_free(rq, q->rq_pool.rq_pool);
return NULL;
}
- rq->cmd_flags |= REQ_ELVPRIV;
}

return rq;
@@ -640,38 +645,67 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
ioc->last_waited = jiffies;
}

-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+ struct request_list *rl)
{
- struct request_list *rl = &q->rq;
-
- if (rl->count[sync] < queue_congestion_off_threshold(q))
+ if (q->rq_pool.count[sync] < queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, sync);

- if (rl->count[sync] + 1 <= q->nr_requests) {
+ if (q->rq_pool.count[sync] + 1 <= q->nr_requests)
+ blk_clear_queue_full(q, sync);
+
+ if (rl->count[sync] + 1 <= q->nr_group_requests) {
if (waitqueue_active(&rl->wait[sync]))
wake_up(&rl->wait[sync]);
-
- blk_clear_queue_full(q, sync);
}
}

/*
* A request has just been released. Account for it, update the full and
- * congestion status, wake up any waiters. Called under q->queue_lock.
+ * congestion status, wake up any waiters. Called under q->queue_lock.
*/
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+ struct request_list *rl)
{
- struct request_list *rl = &q->rq;
+ if (priv) {
+ q->rq_pool.elvpriv--;
+ BUG_ON(!rl->count[sync]);
+ rl->count[sync]--;
+ }

- rl->count[sync]--;
- if (priv)
- rl->elvpriv--;
+ BUG_ON(!q->rq_pool.count[sync]);
+ q->rq_pool.count[sync]--;

- __freed_request(q, sync);
+ __freed_request(q, sync, rl);

if (unlikely(rl->starved[sync ^ 1]))
- __freed_request(q, sync ^ 1);
+ __freed_request(q, sync ^ 1, rl);
+}
+
+void blk_alloced_request(struct request_queue *q, struct request_list *rl,
+ int sync)
+{
+ int priv;
+
+ q->rq_pool.count[sync]++;
+ rl->starved[sync] = 0;
+
+ priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+ if (priv) {
+ q->rq_pool.elvpriv++;
+ /*
+ * Account the request to request list only if request is
+ * going to elevator. During elevator switch, there will
+ * be small window where group is going away and new group
+ * will not be allocated till elevator switch is complete.
+ * So till then instead of slowing down the application,
+ * we will continue to allocate request from total common
+ * pool instead of per group limit
+ */
+ rl->count[sync]++;
+ }
}
+EXPORT_SYMBOL(blk_alloced_request);

/*
* Determine if elevator data should be initialized when allocating the
@@ -698,10 +732,10 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
* Returns !NULL on success, with queue_lock *not held*.
*/
static struct request *get_request(struct request_queue *q, int rw_flags,
- struct bio *bio, gfp_t gfp_mask)
+ struct bio *bio, gfp_t gfp_mask,
+ struct request_list *rl)
{
struct request *rq = NULL;
- struct request_list *rl = &q->rq;
struct io_context *ioc = NULL;
const bool is_sync = rw_is_sync(rw_flags) != 0;
int may_queue, priv = 0;
@@ -710,31 +744,41 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
if (may_queue == ELV_MQUEUE_NO)
goto rq_starved;

- if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
- if (rl->count[is_sync]+1 >= q->nr_requests) {
- ioc = current_io_context(GFP_ATOMIC, q->node);
+ if (q->rq_pool.count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
+ blk_add_trace_msg(q, "Queue congested: setting congested flag");
+ blk_set_queue_congested(q, is_sync);
+ }
+
+ /*
+ * Looks like there is no user of queue full now.
+ * Keeping it for time being.
+ */
+ if (q->rq_pool.count[is_sync]+1 >= q->nr_requests) {
+ blk_add_trace_msg(q, "Queue congested: setting full flag");
+ blk_set_queue_full(q, is_sync);
+ }
+
+ if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+ ioc = current_io_context(GFP_ATOMIC, q->node);
+ /*
+ * The queue request descriptor group will fill after this
+ * allocation, so mark this process as "batching".
+ * This process will be allowed to complete a batch of
+ * requests, others will be blocked.
+ */
+ if (rl->count[is_sync] <= q->nr_group_requests)
+ ioc_set_batching(q, ioc);
+ else if (may_queue != ELV_MQUEUE_MUST
+ && !ioc_batching(q, ioc)) {
/*
- * The queue will fill after this allocation, so set
- * it as full, and mark this process as "batching".
- * This process will be allowed to complete a batch of
- * requests, others will be blocked.
+ * The queue is full and the allocating
+ * process is not a "batcher", and not
+ * exempted by the IO scheduler
*/
- if (!blk_queue_full(q, is_sync)) {
- ioc_set_batching(q, ioc);
- blk_set_queue_full(q, is_sync);
- } else {
- if (may_queue != ELV_MQUEUE_MUST
- && !ioc_batching(q, ioc)) {
- /*
- * The queue is full and the allocating
- * process is not a "batcher", and not
- * exempted by the IO scheduler
- */
- goto out;
- }
- }
+ blk_add_trace_msg(q, "Queue congested: "
+ "not allocating request");
+ goto out;
}
- blk_set_queue_congested(q, is_sync);
}

/*
@@ -742,18 +786,26 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* limit of requests, otherwise we could have thousands of requests
* allocated with any setting of ->nr_requests
*/
- if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+ if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2)) {
+ blk_add_trace_msg(q, "50 percent over limit, not allocating");
goto out;
+ }

- rl->count[is_sync]++;
- rl->starved[is_sync] = 0;

+ priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
if (blk_rq_should_init_elevator(bio)) {
- priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
if (priv)
- rl->elvpriv++;
+ q->rq_pool.elvpriv++;
}

+ /*
+ * If the scheduler supports per cgroup request lists it will call
+ * blk_alloced_request after the request and queue is allocated and
+ * the cgroup has been decided.
+ */
+ if (!blk_supports_cgroups(q) || !priv)
+ blk_alloced_request(q, rl, is_sync);
+
if (blk_queue_io_stat(q))
rw_flags |= REQ_IO_STAT;
spin_unlock_irq(q->queue_lock);
@@ -768,7 +820,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* wait queue, but this is pretty rare.
*/
spin_lock_irq(q->queue_lock);
- freed_request(q, is_sync, priv);
+ if (!blk_supports_cgroups(q) || !priv)
+ freed_request(q, is_sync, priv, rl);

/*
* in the very unlikely event that allocation failed and no
@@ -778,9 +831,11 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* rq mempool into READ and WRITE
*/
rq_starved:
- if (unlikely(rl->count[is_sync] == 0))
+ if (unlikely(rl->count[is_sync] == 0)) {
+ blk_add_trace_msg(q, "Queue congested: "
+ "marking %d starved", is_sync);
rl->starved[is_sync] = 1;
-
+ }
goto out;
}

@@ -809,16 +864,23 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
{
const bool is_sync = rw_is_sync(rw_flags) != 0;
struct request *rq;
+ struct request_list *rl = blk_get_request_list(q, bio);

- rq = get_request(q, rw_flags, bio, GFP_NOIO);
+ rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
while (!rq) {
DEFINE_WAIT(wait);
struct io_context *ioc;
- struct request_list *rl = &q->rq;

+ /*
+ * We are about to sleep on a request list and we
+ * drop queue lock. After waking up, we will do
+ * finish_wait() on request list and in the mean
+ * time group might be gone. Take a reference to
+ * the group now.
+ */
prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
TASK_UNINTERRUPTIBLE);
-
+ blk_get_rl_group(q, rl);
trace_block_sleeprq(q, bio, rw_flags & 1);

spin_unlock_irq(q->queue_lock);
@@ -836,7 +898,18 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
spin_lock_irq(q->queue_lock);
finish_wait(&rl->wait[is_sync], &wait);

- rq = get_request(q, rw_flags, bio, GFP_NOIO);
+ /*
+ * We had taken a reference to the request list goup.
+ * Put that now
+ */
+ blk_put_rl_group(q, rl);
+
+ /*
+ * After the sleep check the rl again in case the group the bio
+ * belonged to is gone and it is mapped to root group now
+ */
+ rl = blk_get_request_list(q, bio);
+ rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
};

return rq;
@@ -845,6 +918,7 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
{
struct request *rq;
+ struct request_list *rl;

BUG_ON(rw != READ && rw != WRITE);

@@ -852,7 +926,8 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
if (gfp_mask & __GFP_WAIT) {
rq = get_request_wait(q, rw, NULL);
} else {
- rq = get_request(q, rw, NULL, gfp_mask);
+ rl = blk_get_request_list(q, NULL);
+ rq = get_request(q, rw, NULL, gfp_mask, rl);
if (!rq)
spin_unlock_irq(q->queue_lock);
}
@@ -1059,12 +1134,14 @@ void __blk_put_request(struct request_queue *q, struct request *req)
if (req->cmd_flags & REQ_ALLOCED) {
int is_sync = rq_is_sync(req) != 0;
int priv = req->cmd_flags & REQ_ELVPRIV;
+ struct request_list *rl = rq_rl(q, req);

BUG_ON(!list_empty(&req->queuelist));
BUG_ON(!hlist_unhashed(&req->hash));
+ BUG_ON(!rl);

+ freed_request(q, is_sync, priv, rl);
blk_free_request(q, req);
- freed_request(q, is_sync, priv);
}
}
EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 93d5fd8..675d7b6 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -122,6 +122,8 @@ static void blk_flush_restore_request(struct request *rq)

/* make @rq a normal request */
rq->cmd_flags &= ~REQ_FLUSH_SEQ;
+ /* Don't free request descriptors twice. */
+ rq->cmd_flags &= ~REQ_ALLOCED;
rq->end_io = NULL;
}

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 1fa7692..a3039dd 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -158,7 +158,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
* set defaults
*/
q->nr_requests = BLKDEV_MAX_RQ;
-
+ q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
q->make_request_fn = mfn;
blk_queue_dma_alignment(q, 511);
blk_queue_congestion_threshold(q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 261c75c..88116df 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -39,7 +39,6 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
static ssize_t
queue_requests_store(struct request_queue *q, const char *page, size_t count)
{
- struct request_list *rl = &q->rq;
unsigned long nr;
int ret;

@@ -53,33 +52,31 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
spin_lock_irq(q->queue_lock);
q->nr_requests = nr;
blk_queue_congestion_threshold(q);
+ spin_unlock_irq(q->queue_lock);
+ return ret;
+}

- if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
- blk_set_queue_congested(q, BLK_RW_SYNC);
- else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
- blk_clear_queue_congested(q, BLK_RW_SYNC);
-
- if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
- blk_set_queue_congested(q, BLK_RW_ASYNC);
- else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
- blk_clear_queue_congested(q, BLK_RW_ASYNC);
-
- if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
- blk_set_queue_full(q, BLK_RW_SYNC);
- } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
- blk_clear_queue_full(q, BLK_RW_SYNC);
- wake_up(&rl->wait[BLK_RW_SYNC]);
- }
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(q->nr_group_requests, page);
+}

- if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
- blk_set_queue_full(q, BLK_RW_ASYNC);
- } else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
- blk_clear_queue_full(q, BLK_RW_ASYNC);
- wake_up(&rl->wait[BLK_RW_ASYNC]);
- }
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long nr;
+ int ret = queue_var_store(&nr, page, count);
+ if (nr < BLKDEV_MIN_RQ)
+ nr = BLKDEV_MIN_RQ;
+
+ spin_lock_irq(q->queue_lock);
+ q->nr_group_requests = nr;
spin_unlock_irq(q->queue_lock);
return ret;
}
+#endif

static ssize_t queue_ra_show(struct request_queue *q, char *page)
{
@@ -271,6 +268,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
.store = queue_requests_store,
};

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+ .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_group_requests_show,
+ .store = queue_group_requests_store,
+};
+#endif
+
static struct queue_sysfs_entry queue_ra_entry = {
.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
.show = queue_ra_show,
@@ -381,6 +386,9 @@ static struct queue_sysfs_entry queue_random_entry = {

static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ &queue_group_requests_entry.attr,
+#endif
&queue_ra_entry.attr,
&queue_max_hw_sectors_entry.attr,
&queue_max_sectors_entry.attr,
@@ -467,12 +475,11 @@ static void blk_release_queue(struct kobject *kobj)
{
struct request_queue *q =
container_of(kobj, struct request_queue, kobj);
- struct request_list *rl = &q->rq;

blk_sync_queue(q);

- if (rl->rq_pool)
- mempool_destroy(rl->rq_pool);
+ if (q->rq_pool.rq_pool)
+ mempool_destroy(q->rq_pool.rq_pool);

if (q->queue_tags)
__blk_queue_free_tags(q);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c885493..23caa79 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -214,6 +214,9 @@ struct cfq_group {
enum wl_prio_t saved_serving_prio;
struct blkio_group blkg;
#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ /* Request list associated with the group */
+ struct request_list rl;
+
struct hlist_node cfqd_node;
int ref;
#endif
@@ -1137,6 +1140,7 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
0);

cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+ blk_init_request_list(&cfqg->rl);

/* Add group on cfqd list */
hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
@@ -1224,6 +1228,7 @@ static void cfq_put_group_ref(struct cfq_group *cfqg)
return;
for_each_cfqg_st(cfqg, i, j, st)
BUG_ON(!RB_EMPTY_ROOT(&st->rb));
+ BUG_ON(cfqg->rl.count[0] | cfqg->rl.count[1]);
kfree(cfqg);
}

@@ -3145,7 +3150,13 @@ cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int *is_oo_ctx,
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue **async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
- struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, is_oo_ctx, 1);
+ /* If cfqg has not been allocated during request list allocation,
+ do not create it now. Otherwise, request list counter could get
+ off-by one errors.
+ */
+ struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, is_oo_ctx, 0);
+ if (!cfqg)
+ cfqg = &cfqd->root_group;

if (!is_sync) {
async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
@@ -3581,6 +3592,46 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
cfq_rq_enqueued(cfqd, cfqq, rq);
}

+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+/*
+ * This function is essentially overloaded to return a request list
+ * given either a bio or a request. Only one of these is expected to
+ * be provided on any given invocation. The other must be NULL.
+ */
+static struct request_list *cfq_request_list(struct request_queue *q,
+ struct bio *bio,
+ struct request *rq)
+{
+ if (rq) {
+ struct cfq_queue *cfqq = RQ_CFQQ(rq);
+ return &cfqq->cfqg->rl;
+ } else {
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, NULL, 1);
+
+ if (!cfqg)
+ cfqg = &cfqd->root_group;
+ return &cfqg->rl;
+ }
+}
+
+#define RL_CFQG(rl) container_of((rl), struct cfq_group, rl)
+
+static void cfq_get_rl_group(struct request_list *rl)
+{
+ struct cfq_group *cfqg = RL_CFQG(rl);
+
+ cfqg->ref++;
+}
+
+static void cfq_put_rl_group(struct request_list *rl)
+{
+ struct cfq_group *cfqg = RL_CFQG(rl);
+
+ cfq_put_group_ref(cfqg);
+}
+#endif
+
/*
* Update hw_tag based on peak queue depth over 50 samples under
* sufficient load.
@@ -3908,6 +3959,8 @@ new_queue:

cfq_get_group_ref(cfqq->cfqg);
rq->elevator_private[2] = cfqq->cfqg;
+
+ blk_alloced_request(q, rq_rl(q, rq), is_sync != 0);
spin_unlock_irqrestore(q->queue_lock, flags);
return 0;

@@ -4110,6 +4163,7 @@ static void *cfq_init_queue(struct request_queue *q)
*/
cfq_get_group_ref(cfqg);
rcu_read_lock();
+ blk_init_request_list(&cfqg->rl);
cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
(void *)cfqd, 0);
rcu_read_unlock();
@@ -4294,6 +4348,11 @@ static struct elevator_type iosched_cfq = {
.elevator_completed_req_fn = cfq_completed_request,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ .elevator_req_list_fn = cfq_request_list,
+ .elevator_get_rl_group_fn = cfq_get_rl_group,
+ .elevator_put_rl_group_fn = cfq_put_rl_group,
+#endif
.elevator_set_req_fn = cfq_set_request,
.elevator_put_req_fn = cfq_put_request,
.elevator_may_queue_fn = cfq_may_queue,
diff --git a/block/elevator.c b/block/elevator.c
index 9edaeb5..f427fad 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -641,7 +641,7 @@ void elv_quiesce_start(struct request_queue *q)
* make sure we don't have any requests in flight
*/
elv_drain_elevator(q);
- while (q->rq.elvpriv) {
+ while (q->rq_pool.elvpriv) {
__blk_run_queue(q, false);
spin_unlock_irq(q->queue_lock);
msleep(10);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 16a902f..a8802ce 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,7 +32,18 @@ struct request;
struct sg_io_hdr;

#define BLKDEV_MIN_RQ 4
+
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ 128 /* Default maximum */
+#define BLKDEV_MAX_GROUP_RQ 128 /* Default per group maximum */
+#else
#define BLKDEV_MAX_RQ 128 /* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue.
+ */
+#define BLKDEV_MAX_GROUP_RQ BLKDEV_MAX_RQ
+#endif

struct request;
typedef void (rq_end_io_fn)(struct request *, int);
@@ -44,9 +55,21 @@ struct request_list {
*/
int count[2];
int starved[2];
+ wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structure keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_pool {
+ /*
+ * Per queue request descriptor count. This is in addition to per
+ * cgroup count.
+ */
+ int count[2];
int elvpriv;
mempool_t *rq_pool;
- wait_queue_head_t wait[2];
};

/*
@@ -274,6 +297,11 @@ struct request_queue
*/
struct request_list rq;

+ /*
+ * Contains request pool and other data like overall request count
+ */
+ struct request_pool rq_pool;
+
request_fn_proc *request_fn;
make_request_fn *make_request_fn;
prep_rq_fn *prep_rq_fn;
@@ -330,6 +358,7 @@ struct request_queue
* queue settings
*/
unsigned long nr_requests; /* Max # of requests */
+ unsigned long nr_group_requests;
unsigned int nr_congestion_on;
unsigned int nr_congestion_off;
unsigned int nr_batching;
@@ -606,6 +635,53 @@ static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio)
}
#endif /* CONFIG_MMU */

+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+ struct bio *bio)
+{
+ struct elevator_queue *e = q->elevator;
+ int priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+
+ if (priv && e->ops->elevator_req_list_fn)
+ return e->ops->elevator_req_list_fn(q, bio, NULL);
+ return &q->rq;
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+ struct request *rq)
+{
+ struct elevator_queue *e = q->elevator;
+ int priv = rq->cmd_flags & REQ_ELVPRIV;
+
+ if (priv && e->ops->elevator_req_list_fn)
+ return e->ops->elevator_req_list_fn(q, NULL, rq);
+ return &q->rq;
+}
+
+static inline void blk_get_rl_group(struct request_queue *q,
+ struct request_list *rl)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e->ops->elevator_get_rl_group_fn)
+ e->ops->elevator_get_rl_group_fn(rl);
+}
+
+static inline void blk_put_rl_group(struct request_queue *q,
+ struct request_list *rl)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e->ops->elevator_put_rl_group_fn)
+ e->ops->elevator_put_rl_group_fn(rl);
+}
+
+static inline int blk_supports_cgroups(struct request_queue *q)
+{
+ struct elevator_queue *e = q->elevator;
+
+ return (e->ops->elevator_req_list_fn) ? 1 : 0;
+}
+
struct rq_map_data {
struct page **pages;
int page_order;
@@ -802,6 +878,7 @@ extern struct request_queue *blk_init_allocated_queue_node(struct request_queue
extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
extern struct request_queue *blk_init_allocated_queue(struct request_queue *,
request_fn_proc *, spinlock_t *);
+extern void blk_init_request_list(struct request_list *rl);
extern void blk_cleanup_queue(struct request_queue *);
extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
extern void blk_queue_bounce_limit(struct request_queue *, u64);
@@ -853,6 +930,8 @@ int blk_get_queue(struct request_queue *);
struct request_queue *blk_alloc_queue(gfp_t);
struct request_queue *blk_alloc_queue_node(gfp_t, int);
extern void blk_put_queue(struct request_queue *);
+extern void blk_alloced_request(struct request_queue *q,
+ struct request_list *rl, int sync);

struct blk_plug {
unsigned long magic;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c3a884c..767bade 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -21,6 +21,10 @@ typedef int (elevator_dispatch_fn) (struct request_queue *, int);

typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
+typedef struct request_list *(elevator_req_list_fn)
+ (struct request_queue *, struct bio *, struct request *);
+typedef void (elevator_get_rl_group_fn)(struct request_list *);
+typedef void (elevator_put_rl_group_fn)(struct request_list *);
typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
typedef int (elevator_may_queue_fn) (struct request_queue *, int);

@@ -50,10 +54,14 @@ struct elevator_ops

elevator_request_list_fn *elevator_former_req_fn;
elevator_request_list_fn *elevator_latter_req_fn;
+ elevator_req_list_fn *elevator_req_list_fn;

elevator_set_req_fn *elevator_set_req_fn;
elevator_put_req_fn *elevator_put_req_fn;

+ elevator_get_rl_group_fn *elevator_get_rl_group_fn;
+ elevator_put_rl_group_fn *elevator_put_rl_group_fn;
+
elevator_may_queue_fn *elevator_may_queue_fn;

elevator_init_fn *elevator_init_fn;
--
1.7.3.1

2011-03-30 16:51:43

by Justin TerAvest

[permalink] [raw]

Subject: [PATCH v3 6/8] cfq: add per cgroup writeout done by flusher stat

Tracking for buffered writes can detect when traffic comes from a
flusher thread, as opposed to directly from an application. This adds a
statistic to track I/O traffic from flusher threads.

This helps determine whether a flusher thread is being unfair to a
particular cgroup, and if cgroup-based isolation of writeback behavior
is useful.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/blk-cgroup.c | 18 ++++++++++++++++-
block/blk-cgroup.h | 9 ++++++-
block/cfq-iosched.c | 47 ++++++++++++++++++++++++++------------------
block/cfq.h | 6 +++-
include/linux/blk_types.h | 2 +
5 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9732cfd..7b63030 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -412,7 +412,8 @@ void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
EXPORT_SYMBOL_GPL(blkiocg_update_dispatch_stats);

void blkiocg_update_completion_stats(struct blkio_group *blkg,
- uint64_t start_time, uint64_t io_start_time, bool direction, bool sync)
+ uint64_t start_time, uint64_t io_start_time, bool direction, bool sync,
+ bool out_of_ctx)
{
struct blkio_group_stats *stats;
unsigned long flags;
@@ -426,6 +427,8 @@ void blkiocg_update_completion_stats(struct blkio_group *blkg,
if (time_after64(io_start_time, start_time))
blkio_add_stat(stats->stat_arr[BLKIO_STAT_WAIT_TIME],
io_start_time - start_time, direction, sync);
+ if (out_of_ctx)
+ blkg->stats.oo_ctx_io_count++;
spin_unlock_irqrestore(&blkg->stats_lock, flags);
}
EXPORT_SYMBOL_GPL(blkiocg_update_completion_stats);
@@ -620,6 +623,9 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
blkg->stats.unaccounted_time, cb, dev);
#ifdef CONFIG_DEBUG_BLK_CGROUP
+ if (type == BLKIO_STAT_OO_CTX_IO_COUNT)
+ return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
+ blkg->stats.oo_ctx_io_count, cb, dev);
if (type == BLKIO_STAT_AVG_QUEUE_SIZE) {
uint64_t sum = blkg->stats.avg_queue_size_sum;
uint64_t samples = blkg->stats.avg_queue_size_samples;
@@ -1159,6 +1165,10 @@ static int blkiocg_file_read_map(struct cgroup *cgrp, struct cftype *cft,
case BLKIO_PROP_empty_time:
return blkio_read_blkg_stats(blkcg, cft, cb,
BLKIO_STAT_EMPTY_TIME, 0);
+ case BLKIO_PROP_oo_ctx_io_count:
+ return blkio_read_blkg_stats(blkcg, cft, cb,
+ BLKIO_STAT_OO_CTX_IO_COUNT, 0);
+
#endif
default:
BUG();
@@ -1419,6 +1429,12 @@ struct cftype blkio_files[] = {
BLKIO_PROP_dequeue),
.read_map = blkiocg_file_read_map,
},
+ {
+ .name = "oo_ctx_io_count",
+ .private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
+ BLKIO_PROP_oo_ctx_io_count),
+ .read_map = blkiocg_file_read_map,
+ },
#endif
};

diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 10919fa..9556f2b 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -52,6 +52,7 @@ enum stat_type {
/* Time not charged to this cgroup */
BLKIO_STAT_UNACCOUNTED_TIME,
#ifdef CONFIG_DEBUG_BLK_CGROUP
+ BLKIO_STAT_OO_CTX_IO_COUNT,
BLKIO_STAT_AVG_QUEUE_SIZE,
BLKIO_STAT_IDLE_TIME,
BLKIO_STAT_EMPTY_TIME,
@@ -93,6 +94,7 @@ enum blkcg_file_name_prop {
BLKIO_PROP_idle_time,
BLKIO_PROP_empty_time,
BLKIO_PROP_dequeue,
+ BLKIO_PROP_oo_ctx_io_count,
};

/* cgroup files owned by throttle policy */
@@ -119,6 +121,8 @@ struct blkio_group_stats {
uint64_t sectors;
/* Time not charged to this cgroup */
uint64_t unaccounted_time;
+ /* Number of IOs sumbitted out of process context */
+ uint64_t oo_ctx_io_count;
uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL];
#ifdef CONFIG_DEBUG_BLK_CGROUP
/* Sum of number of IOs queued across all samples */
@@ -303,7 +307,8 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg,
void blkiocg_update_dispatch_stats(struct blkio_group *blkg, uint64_t bytes,
bool direction, bool sync);
void blkiocg_update_completion_stats(struct blkio_group *blkg,
- uint64_t start_time, uint64_t io_start_time, bool direction, bool sync);
+ uint64_t start_time, uint64_t io_start_time, bool direction, bool sync,
+ bool out_of_ctx);
void blkiocg_update_io_merged_stats(struct blkio_group *blkg, bool direction,
bool sync);
void blkiocg_update_io_add_stats(struct blkio_group *blkg,
@@ -332,7 +337,7 @@ static inline void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
uint64_t bytes, bool direction, bool sync) {}
static inline void blkiocg_update_completion_stats(struct blkio_group *blkg,
uint64_t start_time, uint64_t io_start_time, bool direction,
- bool sync) {}
+ bool sync, bool out_of_ctx) {}
static inline void blkiocg_update_io_merged_stats(struct blkio_group *blkg,
bool direction, bool sync) {}
static inline void blkiocg_update_io_add_stats(struct blkio_group *blkg,
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1b315c3..c885493 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -311,7 +311,7 @@ static void cfq_put_queue_ref(struct cfq_queue *cfqq);

static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
static struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
- struct bio *bio, int create);
+ struct bio *bio, int *is_oo_ctx, int create);
static struct cfq_queue **
cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio);

@@ -449,8 +449,8 @@ static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl,
}

static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio*, bool,
- struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio*,
+ int *is_oo_ctx, bool, struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
struct io_context *);
static void cfq_put_async_queues(struct cfq_group *cfqg);
@@ -484,7 +484,7 @@ static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
* async bio tracking is enabled and we are not caching
* async queue pointer in cic.
*/
- cfqg = cfq_get_cfqg_bio(cfqd, bio, 0);
+ cfqg = cfq_get_cfqg_bio(cfqd, bio, NULL, 0);
if (!cfqg) {
/*
* May be this is first rq/bio and io group has not
@@ -1150,17 +1150,21 @@ done:
* create the cfq group if it does not exist. request_queue lock must be held.
*/
static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, struct page *page,
- int create)
+ int *is_oo_ctx, int create)
{
- struct cgroup *cgroup;
+ struct cgroup *cgroup, *tracked_cgroup;
struct cfq_group *cfqg = NULL;

rcu_read_lock();

- if (!page)
- cgroup = task_cgroup(current, blkio_subsys_id);
- else
- cgroup = get_cgroup_from_page(page);
+ cgroup = task_cgroup(current, blkio_subsys_id);
+ if (page) {
+ tracked_cgroup = get_cgroup_from_page(page);
+ if (is_oo_ctx)
+ *is_oo_ctx = cgroup && tracked_cgroup &&
+ tracked_cgroup != cgroup;
+ cgroup = tracked_cgroup;
+ }

if (!cgroup) {
cfqg = &cfqd->root_group;
@@ -1175,8 +1179,8 @@ out:
return cfqg;
}

-struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
- struct bio *bio, int create)
+struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd, struct bio *bio,
+ int *is_oo_ctx, int create)
{
struct page *page = NULL;

@@ -1201,7 +1205,7 @@ struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
#endif

sync:
- return cfq_get_cfqg(cfqd, page, create);
+ return cfq_get_cfqg(cfqd, page, is_oo_ctx, create);
}

static void cfq_get_group_ref(struct cfq_group *cfqg)
@@ -1288,7 +1292,7 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
#else /* GROUP_IOSCHED */

static struct cfq_group *cfq_get_cfqg_bio(struct cfq_data *cfqd,
- struct bio *bio, int create)
+ struct bio *bio, int *is_oo_ctx, int create)
{
}

@@ -3134,14 +3138,14 @@ cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio)
}

static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, bool is_sync,
- struct io_context *ioc, gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int *is_oo_ctx,
+ bool is_sync, struct io_context *ioc, gfp_t gfp_mask)
{
const int ioprio = task_ioprio(ioc);
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue **async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
- struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, 1);
+ struct cfq_group *cfqg = cfq_get_cfqg_bio(cfqd, bio, is_oo_ctx, 1);

if (!is_sync) {
async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
@@ -3667,7 +3671,8 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
(RQ_CFQG(rq))->dispatched--;
cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg,
rq_start_time_ns(rq), rq_io_start_time_ns(rq),
- rq_data_dir(rq), rq_is_sync(rq));
+ rq_data_dir(rq), rq_is_sync(rq),
+ rq->cmd_flags & REQ_OUT_OF_CTX);

cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--;

@@ -3855,6 +3860,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
const bool is_sync = rq_is_sync(rq);
struct cfq_queue *cfqq;
unsigned long flags;
+ int is_oo_ctx = 0;

might_sleep_if(gfp_mask & __GFP_WAIT);

@@ -3868,8 +3874,11 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
new_queue:
cfqq = cic_to_cfqq(cic, is_sync);
if (!cfqq || cfqq == &cfqd->oom_cfqq) {
- cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc, gfp_mask);
+ cfqq = cfq_get_queue(cfqd, bio, &is_oo_ctx, is_sync, cic->ioc,
+ gfp_mask);
cic_set_cfqq(cic, cfqq, is_sync);
+ if (is_oo_ctx)
+ rq->cmd_flags |= REQ_OUT_OF_CTX;
} else {
/*
* If the queue was seeky for too long, break it apart.
diff --git a/block/cfq.h b/block/cfq.h
index 2a15592..6afc10a 100644
--- a/block/cfq.h
+++ b/block/cfq.h
@@ -61,10 +61,12 @@ static inline void cfq_blkiocg_update_dispatch_stats(struct blkio_group *blkg,
blkiocg_update_dispatch_stats(blkg, bytes, direction, sync);
}

-static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg, uint64_t start_time, uint64_t io_start_time, bool direction, bool sync)
+static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg,
+ uint64_t start_time, uint64_t io_start_time,
+ bool direction, bool sync, bool out_of_ctx)
{
blkiocg_update_completion_stats(blkg, start_time, io_start_time,
- direction, sync);
+ direction, sync, out_of_ctx);
}

static inline void cfq_blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index be50d9e..d859395 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -152,6 +152,7 @@ enum rq_flag_bits {
__REQ_MIXED_MERGE, /* merge of different types, fail separately */
__REQ_SECURE, /* secure discard (used with __REQ_DISCARD) */
__REQ_ON_PLUG, /* on plug list */
+ __REQ_OUT_OF_CTX, /* request submitted out of process context */
__REQ_NR_BITS, /* stops here */
};

@@ -193,5 +194,6 @@ enum rq_flag_bits {
#define REQ_MIXED_MERGE (1 << __REQ_MIXED_MERGE)
#define REQ_SECURE (1 << __REQ_SECURE)
#define REQ_ON_PLUG (1 << __REQ_ON_PLUG)
+#define REQ_OUT_OF_CTX (1 << __REQ_OUT_OF_CTX)

#endif /* __LINUX_BLK_TYPES_H */
--
1.7.3.1

2011-03-30 16:52:02

by Justin TerAvest

[permalink] [raw]

Subject: [PATCH v3 1/8] cfq-iosched: add symmetric reference wrappers

The cfq_group and cfq_queue objects did not have symmetric
wrapper functions and many of the reference taking was
open coded. For clarity add the following wrappers:

cfq_group:
- cfq_get_group_ref()
- cfq_put_group_ref()

cfq_queue:
- cfq_get_queue_ref()
- cfq_put_queue_ref()

The use of '_ref' was done not to collide with existing
function names.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/cfq-iosched.c | 95 +++++++++++++++++++++++++++++---------------------
1 files changed, 55 insertions(+), 40 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 12e380b..74510f5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -303,6 +303,11 @@ struct cfq_data {
struct rcu_head rcu;
};

+static void cfq_get_group_ref(struct cfq_group *cfqg);
+static void cfq_put_group_ref(struct cfq_group *cfqg);
+static void cfq_get_queue_ref(struct cfq_queue *cfqq);
+static void cfq_put_queue_ref(struct cfq_queue *cfqq);
+
static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);

static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
@@ -1048,7 +1053,7 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
* elevator which will be dropped by either elevator exit
* or cgroup deletion path depending on who is exiting first.
*/
- cfqg->ref = 1;
+ cfq_get_group_ref(cfqg);

/*
* Add group onto cgroup list. It might happen that bdi->dev is
@@ -1091,24 +1096,12 @@ static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
return cfqg;
}

-static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
+static void cfq_get_group_ref(struct cfq_group *cfqg)
{
cfqg->ref++;
- return cfqg;
}

-static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
-{
- /* Currently, all async queues are mapped to root group */
- if (!cfq_cfqq_sync(cfqq))
- cfqg = &cfqq->cfqd->root_group;
-
- cfqq->cfqg = cfqg;
- /* cfqq reference on cfqg */
- cfqq->cfqg->ref++;
-}
-
-static void cfq_put_cfqg(struct cfq_group *cfqg)
+static void cfq_put_group_ref(struct cfq_group *cfqg)
{
struct cfq_rb_root *st;
int i, j;
@@ -1122,6 +1115,17 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
kfree(cfqg);
}

+static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
+{
+ /* Currently, all async queues are mapped to root group */
+ if (!cfq_cfqq_sync(cfqq))
+ cfqg = &cfqq->cfqd->root_group;
+
+ cfqq->cfqg = cfqg;
+ /* cfqq reference on cfqg */
+ cfq_get_group_ref(cfqq->cfqg);
+}
+
static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
/* Something wrong if we are trying to remove same group twice */
@@ -1133,7 +1137,7 @@ static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
* Put the reference taken at the time of creation so that when all
* queues are gone, group can be destroyed.
*/
- cfq_put_cfqg(cfqg);
+ cfq_put_group_ref(cfqg);
}

static void cfq_release_cfq_groups(struct cfq_data *cfqd)
@@ -1177,14 +1181,18 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
}

#else /* GROUP_IOSCHED */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+
+static void cfq_get_group_ref(struct cfq_group *cfqg)
{
- return &cfqd->root_group;
}

-static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
+static void cfq_put_group_ref(struct cfq_group *cfqg)
{
- return cfqg;
+}
+
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+{
+ return &cfqd->root_group;
}

static inline void
@@ -1193,7 +1201,6 @@ cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg) {
}

static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
-static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}

#endif /* GROUP_IOSCHED */

@@ -2553,6 +2560,11 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
return 1;
}

+static void cfq_get_queue_ref(struct cfq_queue *cfqq)
+{
+ cfqq->ref++;
+}
+
/*
* task holds one reference to the queue, dropped when task exits. each rq
* in-flight on this queue also holds a reference, dropped when rq is freed.
@@ -2560,7 +2572,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
* Each cfq queue took a reference on the parent group. Drop it now.
* queue lock must be held here.
*/
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_put_queue_ref(struct cfq_queue *cfqq)
{
struct cfq_data *cfqd = cfqq->cfqd;
struct cfq_group *cfqg;
@@ -2583,7 +2595,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)

BUG_ON(cfq_cfqq_on_rr(cfqq));
kmem_cache_free(cfq_pool, cfqq);
- cfq_put_cfqg(cfqg);
+ cfq_put_group_ref(cfqg);
}

/*
@@ -2688,7 +2700,7 @@ static void cfq_put_cooperator(struct cfq_queue *cfqq)
break;
}
next = __cfqq->new_cfqq;
- cfq_put_queue(__cfqq);
+ cfq_put_queue_ref(__cfqq);
__cfqq = next;
}
}
@@ -2702,7 +2714,7 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)

cfq_put_cooperator(cfqq);

- cfq_put_queue(cfqq);
+ cfq_put_queue_ref(cfqq);
}

static void __cfq_exit_single_io_context(struct cfq_data *cfqd,
@@ -2844,7 +2856,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
GFP_ATOMIC);
if (new_cfqq) {
cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
- cfq_put_queue(cfqq);
+ cfq_put_queue_ref(cfqq);
}
}

@@ -2903,7 +2915,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
*/
cfq_log_cfqq(cfqd, sync_cfqq, "changed cgroup");
cic_set_cfqq(cic, NULL, 1);
- cfq_put_queue(sync_cfqq);
+ cfq_put_queue_ref(sync_cfqq);
}

spin_unlock_irqrestore(q->queue_lock, flags);
@@ -3004,11 +3016,11 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
* pin the queue now that it's allocated, scheduler exit will prune it
*/
if (!is_sync && !(*async_cfqq)) {
- cfqq->ref++;
+ cfq_get_queue_ref(cfqq);
*async_cfqq = cfqq;
}

- cfqq->ref++;
+ cfq_get_queue_ref(cfqq);
return cfqq;
}

@@ -3633,10 +3645,10 @@ static void cfq_put_request(struct request *rq)
rq->elevator_private[1] = NULL;

/* Put down rq reference on cfqg */
- cfq_put_cfqg(RQ_CFQG(rq));
+ cfq_put_group_ref(RQ_CFQG(rq));
rq->elevator_private[2] = NULL;

- cfq_put_queue(cfqq);
+ cfq_put_queue_ref(cfqq);
}
}

@@ -3647,7 +3659,7 @@ cfq_merge_cfqqs(struct cfq_data *cfqd, struct cfq_io_context *cic,
cfq_log_cfqq(cfqd, cfqq, "merging with queue %p", cfqq->new_cfqq);
cic_set_cfqq(cic, cfqq->new_cfqq, 1);
cfq_mark_cfqq_coop(cfqq->new_cfqq);
- cfq_put_queue(cfqq);
+ cfq_put_queue_ref(cfqq);
return cic_to_cfqq(cic, 1);
}

@@ -3669,7 +3681,7 @@ split_cfqq(struct cfq_io_context *cic, struct cfq_queue *cfqq)

cfq_put_cooperator(cfqq);

- cfq_put_queue(cfqq);
+ cfq_put_queue_ref(cfqq);
return NULL;
}
/*
@@ -3722,10 +3734,12 @@ new_queue:

cfqq->allocated[rw]++;

- cfqq->ref++;
+ cfq_get_queue_ref(cfqq);
rq->elevator_private[0] = cic;
rq->elevator_private[1] = cfqq;
- rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
+
+ cfq_get_group_ref(cfqq->cfqg);
+ rq->elevator_private[2] = cfqq->cfqg;
spin_unlock_irqrestore(q->queue_lock, flags);
return 0;

@@ -3818,13 +3832,13 @@ static void cfq_put_async_queues(struct cfq_data *cfqd)

for (i = 0; i < IOPRIO_BE_NR; i++) {
if (cfqd->async_cfqq[0][i])
- cfq_put_queue(cfqd->async_cfqq[0][i]);
+ cfq_put_queue_ref(cfqd->async_cfqq[0][i]);
if (cfqd->async_cfqq[1][i])
- cfq_put_queue(cfqd->async_cfqq[1][i]);
+ cfq_put_queue_ref(cfqd->async_cfqq[1][i]);
}

if (cfqd->async_idle_cfqq)
- cfq_put_queue(cfqd->async_idle_cfqq);
+ cfq_put_queue_ref(cfqd->async_idle_cfqq);
}

static void cfq_cfqd_free(struct rcu_head *head)
@@ -3922,9 +3936,10 @@ static void *cfq_init_queue(struct request_queue *q)
#ifdef CONFIG_CFQ_GROUP_IOSCHED
/*
* Take a reference to root group which we never drop. This is just
- * to make sure that cfq_put_cfqg() does not try to kfree root group
+ * to make sure that cfq_put_group_ref() does not try to kfree
+ * root group
*/
- cfqg->ref = 1;
+ cfq_get_group_ref(cfqg);
rcu_read_lock();
cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
(void *)cfqd, 0);
--
1.7.3.1

2011-03-30 16:52:03

by Justin TerAvest

[permalink] [raw]

Subject: [PATCH v3 2/8] block,fs,mm: IO cgroup tracking for buffered write

This patch adds IO tracking code in the fs/ and mm/ trees so that
the block layer can provide isolation for buffered writes.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/blk-cgroup.c | 183 +++++++++++++++++++++++++++++++++++++++++++
fs/buffer.c | 2 +
fs/direct-io.c | 2 +
include/linux/blkio-track.h | 89 +++++++++++++++++++++
include/linux/iocontext.h | 1 +
include/linux/memcontrol.h | 6 ++
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 38 +++++++++-
init/Kconfig | 16 ++++
mm/Makefile | 3 +-
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 6 ++
mm/memory.c | 6 ++
mm/page-writeback.c | 14 +++-
mm/page_cgroup.c | 29 +++++---
mm/swap_state.c | 2 +
17 files changed, 388 insertions(+), 17 deletions(-)
create mode 100644 include/linux/blkio-track.h

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 77ee3c1..f626c65 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -19,6 +19,8 @@
#include <linux/slab.h>
#include "blk-cgroup.h"
#include <linux/genhd.h>
+#include <linux/blkio-track.h>
+#include <linux/mm_inline.h>

#define MAX_KEY_LEN 100

@@ -175,6 +177,12 @@ static inline void blkio_update_group_iops(struct blkio_group *blkg,
}
}

+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+ return container_of(task_subsys_state(p, blkio_subsys_id),
+ struct blkio_cgroup, css);
+}
+
/*
* Add to the appropriate stat variable depending on the request type.
* This should be called with the blkg->stats_lock held.
@@ -1241,8 +1249,20 @@ blkiocg_file_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
return 0;
}

+/* Read the ID of the specified blkio cgroup. */
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
+
+ return (u64)css_id(&blkcg->css);
+}
+
struct cftype blkio_files[] = {
{
+ .name = "id",
+ .read_u64 = blkio_id_read,
+ },
+ {
.name = "weight_device",
.private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
BLKIO_PROP_weight_device),
@@ -1399,6 +1419,169 @@ struct cftype blkio_files[] = {
#endif
};

+/* Block IO tracking related functions */
+
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+ struct blkio_cgroup *blkcg;
+ struct page_cgroup *pc;
+
+ if (blkio_cgroup_disabled())
+ return;
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return;
+
+ page_cgroup_set_blkio_id(pc, 0); /* 0: default blkio_cgroup id */
+ if (!mm)
+ return;
+ /*
+ * Locking "pc" isn't necessary here since the current process is
+ * the only one that can access the members related to blkio_cgroup.
+ */
+ rcu_read_lock();
+ blkcg = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+ if (unlikely(!blkcg))
+ goto out;
+ /*
+ * css_get(&bio->css) isn't called to increment the reference
+ * count of this blkio_cgroup "blkcg" so pc's blkio_cgroup_id
+ * might turn invalid even if this page is still active.
+ * This approach is chosen to minimize the overhead.
+ */
+ page_cgroup_set_blkio_id(pc, css_id(&blkcg->css));
+out:
+ rcu_read_unlock();
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+ /*
+ * A little trick:
+ * Just call blkio_cgroup_set_owner() for pages which are already
+ * active since the blkio_cgroup_id member of page_cgroup can be
+ * updated without any locks.
+ */
+ blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+ if (!page_is_file_cache(page))
+ return;
+ if (current->flags & PF_MEMALLOC)
+ return;
+
+ blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage: the page where we want to copy the owner
+ * @opage: the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+ struct page_cgroup *npc, *opc;
+
+ if (blkio_cgroup_disabled())
+ return;
+ npc = lookup_page_cgroup(npage);
+ if (unlikely(!npc))
+ return;
+ opc = lookup_page_cgroup(opage);
+ if (unlikely(!opc))
+ return;
+
+ /*
+ * Do this without any locks. The reason is the same as
+ * blkio_cgroup_reset_owner().
+ */
+ page_cgroup_set_blkio_id(npc, page_cgroup_get_blkio_id(opc));
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio: the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ struct page_cgroup *pc;
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+ unsigned long id = 0;
+
+ pc = lookup_page_cgroup(page);
+ if (pc)
+ id = page_cgroup_get_blkio_id(pc);
+ return id;
+}
+
+/**
+ * get_cgroup_from_page() - determine the cgroup from a page.
+ * @page: the page to be tracked
+ *
+ * Returns the cgroup of a given page. A return value zero means that
+ * the page associated with the page belongs to default_blkio_cgroup.
+ *
+ * Note:
+ * This function must be called under rcu_read_lock().
+ */
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+ struct page_cgroup *pc;
+ struct cgroup_subsys_state *css;
+
+ pc = lookup_page_cgroup(page);
+ if (!pc)
+ return NULL;
+
+ css = css_lookup(&blkio_subsys, page_cgroup_get_blkio_id(pc));
+ if (!css)
+ return NULL;
+
+ return css->cgroup;
+}
+
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(get_cgroup_from_page);
+
+#endif /* CONFIG_CGROUP_BLKIOTRACK */
+
static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
{
return cgroup_add_files(cgroup, subsys, blkio_files,
diff --git a/fs/buffer.c b/fs/buffer.c
index 2e6b1a3..f344929 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/bio.h>
+#include <linux/blkio-track.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/bitops.h>
@@ -659,6 +660,7 @@ static void __set_page_dirty(struct page *page,
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ blkio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 4260831..7c684fa 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
#include <linux/err.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
+#include <linux/blkio-track.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
#include <asm/atomic.h>
@@ -852,6 +853,7 @@ static int do_direct_IO(struct dio *dio)
ret = PTR_ERR(page);
goto out;
}
+ blkio_cgroup_reset_owner(page, current->mm);

while (block_in_page < blocks_per_page) {
unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/blkio-track.h b/include/linux/blkio-track.h
new file mode 100644
index 0000000..4a49c57
--- /dev/null
+++ b/include/linux/blkio-track.h
@@ -0,0 +1,89 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIOTRACK
+
+struct block_device;
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc: page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+ page_cgroup_set_blkio_id(pc, 0);
+}
+
+/**
+ * blkio_cgroup_disabled() - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+ if (blkio_subsys.disabled)
+ return true;
+ return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+
+#else /* !CONFIG_CGROUP_BLKIOTRACK */
+
+struct blkiotrack_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+ return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page,
+ struct page *opage)
+{
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ return 0;
+}
+
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+ return NULL;
+}
+
+#endif /* CONFIG_CGROUP_BLKIOTRACK */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index b2eee89..3e70b21 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -76,6 +76,7 @@ int put_io_context(struct io_context *ioc);
void exit_io_context(struct task_struct *task);
struct io_context *get_io_context(gfp_t gfp_flags, int node);
struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void copy_io_context(struct io_context **pdst, struct io_context **psrc);
#else
static inline void exit_io_context(struct task_struct *task)
{
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f512e18..a8a7cf0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -49,6 +49,8 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
* (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
*/

+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
/* for swap handling */
@@ -153,6 +155,10 @@ void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
static inline int mem_cgroup_newpage_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02ecb01..a04c37a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -615,7 +615,7 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
struct page_cgroup *node_page_cgroup;
#endif
#endif
@@ -975,7 +975,7 @@ struct mem_section {

/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
/*
* If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
* section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 6d6cb7a..6a86c0d 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
#ifndef __LINUX_PAGE_CGROUP_H
#define __LINUX_PAGE_CGROUP_H

-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
#include <linux/bit_spinlock.h>
/*
* Page Cgroup can be considered as an extended mem_map.
@@ -11,12 +11,43 @@
* then the page cgroup for pfn always exists.
*/
struct page_cgroup {
+#ifdef CONFIG_CGROUP_PAGE
unsigned long flags;
+#endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct mem_cgroup *mem_cgroup;
struct page *page;
struct list_head lru; /* per cgroup LRU list */
+#endif
};

+#ifdef CONFIG_CGROUP_BLKIOTRACK
+#define PAGE_TRACKING_ID_SHIFT (16)
+#define PAGE_TRACKING_ID_BITS \
+ (8 * sizeof(unsigned long) - PAGE_TRACKING_ID_SHIFT)
+
+static inline void page_cgroup_set_blkio_id(struct page_cgroup *pc,
+ unsigned long id)
+{
+ unsigned long old, new, prev;
+
+ /* pc->flags isn't lock protected, so we must cmpxchg. */
+ WARN_ON(id >= (1UL << PAGE_TRACKING_ID_BITS));
+ do {
+ new = old = pc->flags;
+ new &= (1UL << PAGE_TRACKING_ID_SHIFT) - 1;
+ new |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
+ prev = cmpxchg(&pc->flags, old, new);
+ } while (prev != old);
+}
+
+static inline unsigned long page_cgroup_get_blkio_id(struct page_cgroup *pc)
+{
+ return pc->flags >> PAGE_TRACKING_ID_SHIFT;
+}
+#endif
+
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);

#ifdef CONFIG_SPARSEMEM
@@ -33,6 +64,8 @@ static inline void __init page_cgroup_init(void)

struct page_cgroup *lookup_page_cgroup(struct page *page);

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+
enum {
/* flags for mem_cgroup */
PCG_LOCK, /* Lock for pc->mem_cgroup and following bits. */
@@ -131,8 +164,9 @@ static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
local_irq_restore(*flags);
}
+#endif

-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
struct page_cgroup;

static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index be788c0..ce1cbba 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -742,6 +742,22 @@ config DEBUG_BLK_CGROUP
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.

+config CGROUP_BLKIOTRACK
+ bool "Track async request block io"
+ depends on CGROUPS && BLOCK
+ select MM_OWNER
+ default n
+ ---help---
+ Provides a Resource Controller which enables to track the onwner
+ of every Block I/O requests.
+ The information this subsystem provides can be used from any
+ kind of module such as dm-ioband device mapper modules or
+ the cfq-scheduler.
+
+config CGROUP_PAGE
+ def_bool y
+ depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIOTRACK
+
endif # CGROUPS

menuconfig NAMESPACES
diff --git a/mm/Makefile b/mm/Makefile
index 2b1b575..7da3bc8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -38,7 +38,8 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
diff --git a/mm/bounce.c b/mm/bounce.c
index 1481de6..64980fb 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
+#include <linux/blkio-track.h>
#include <asm/tlbflush.h>

#include <trace/events/block.h>
@@ -211,6 +212,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
to->bv_len = from->bv_len;
to->bv_offset = from->bv_offset;
inc_zone_page_state(to->bv_page, NR_BOUNCE);
+ blkio_cgroup_copy_owner(to->bv_page, page);

if (rw == WRITE) {
char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index f9a29c8..263a3f6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/cpuset.h>
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
+#include <linux/blkio-track.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
#include "internal.h"

@@ -377,6 +378,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;
+ blkio_cgroup_set_owner(page, current->mm);

error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index da53a25..e11c2cd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -359,6 +359,12 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
static void drain_all_stock_async(void);

+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+ pc->mem_cgroup = NULL;
+ INIT_LIST_HEAD(&pc->lru);
+}
+
static struct mem_cgroup_per_zone *
mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
{
diff --git a/mm/memory.c b/mm/memory.c
index 5823698..385bb54 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -52,6 +52,7 @@
#include <linux/init.h>
#include <linux/writeback.h>
#include <linux/memcontrol.h>
+#include <linux/blkio-track.h>
#include <linux/mmu_notifier.h>
#include <linux/kallsyms.h>
#include <linux/swapops.h>
@@ -2391,6 +2392,7 @@ gotten:
*/
ptep_clear_flush(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address);
+ blkio_cgroup_set_owner(new_page, mm);
/*
* We call the notify macro here because, when using secondary
* mmu page tables (such as kvm shadow page tables), we want the
@@ -2828,6 +2830,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
flush_icache_page(vma, page);
set_pte_at(mm, address, page_table, pte);
do_page_add_anon_rmap(page, vma, address, exclusive);
+ blkio_cgroup_reset_owner(page, mm);
/* It's better to call commit-charge after rmap is established */
mem_cgroup_commit_charge_swapin(page, ptr);

@@ -2959,6 +2962,8 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,

inc_mm_counter_fast(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
+ /* Not setting the owner for special pages */
+ blkio_cgroup_set_owner(page, mm);
setpte:
set_pte_at(mm, address, page_table, entry);

@@ -3108,6 +3113,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (anon) {
inc_mm_counter_fast(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
+ blkio_cgroup_set_owner(page, mm);
} else {
inc_mm_counter_fast(mm, MM_FILEPAGES);
page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 24b7ac2..d49b398 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blkio-track.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -1159,7 +1160,8 @@ EXPORT_SYMBOL(account_page_writeback);
* We take care to handle the case where the page was truncated from the
* mapping by re-checking page_mapping() inside tree_lock.
*/
-int __set_page_dirty_nobuffers(struct page *page)
+int __set_page_dirty_nobuffers_track_owner(struct page *page,
+ int update_owner)
{
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
@@ -1174,6 +1176,9 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ if (update_owner)
+ blkio_cgroup_reset_owner_pagedirty(page,
+ current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
@@ -1186,6 +1191,11 @@ int __set_page_dirty_nobuffers(struct page *page)
}
return 0;
}
+
+int __set_page_dirty_nobuffers(struct page *page)
+{
+ return __set_page_dirty_nobuffers_track_owner(page, 1);
+}
EXPORT_SYMBOL(__set_page_dirty_nobuffers);

/*
@@ -1196,7 +1206,7 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers);
int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
{
wbc->pages_skipped++;
- return __set_page_dirty_nobuffers(page);
+ return __set_page_dirty_nobuffers_track_owner(page, 0);
}
EXPORT_SYMBOL(redirty_page_for_writepage);

diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 5bffada..78f5425 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -10,14 +10,17 @@
#include <linux/cgroup.h>
#include <linux/swapops.h>
#include <linux/kmemleak.h>
+#include <linux/blkio-track.h>

static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
pc->flags = 0;
- pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
- INIT_LIST_HEAD(&pc->lru);
+#endif
+ __init_mem_page_cgroup(pc);
+ __init_blkio_page_cgroup(pc);
}
static unsigned long total_usage;

@@ -75,7 +78,7 @@ void __init page_cgroup_init_flatmem(void)

int nid, fail;

- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;

for_each_online_node(nid) {
@@ -84,12 +87,13 @@ void __init page_cgroup_init_flatmem(void)
goto fail;
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
- " don't want memory cgroups\n");
+ printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+ " if you don't want memory and blkio cgroups\n");
return;
fail:
printk(KERN_CRIT "allocation of page_cgroup failed.\n");
- printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+ printk(KERN_CRIT
+ "please try 'cgroup_disable=memory,blkio' boot option\n");
panic("Out of memory");
}

@@ -134,6 +138,7 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
*/
kmemleak_not_leak(base);
} else {
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
/*
* We don't have to allocate page_cgroup again, but
* address of memmap may be changed. So, we have to initialize
@@ -144,6 +149,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
/* check address of memmap is changed or not. */
if (base->page == pfn_to_page(pfn))
return 0;
+#else
+ return 0;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
}

if (!base) {
@@ -258,7 +266,7 @@ void __init page_cgroup_init(void)
unsigned long pfn;
int fail = 0;

- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;

for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -267,14 +275,15 @@ void __init page_cgroup_init(void)
fail = init_section_page_cgroup(pfn);
}
if (fail) {
- printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+ printk(KERN_CRIT
+ "try 'cgroup_disable=memory,blkio' boot option\n");
panic("Out of memory");
} else {
hotplug_memory_notifier(page_cgroup_callback, 0);
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
- " want memory cgroups\n");
+ printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+ " if you don't want memory and blkio cgroups\n");
}

void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4668046..5c7c910 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -19,6 +19,7 @@
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
+#include <linux/blkio-track.h>

#include <asm/pgtable.h>

@@ -327,6 +328,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
/* May fail (-ENOMEM) if radix-tree node allocation failed. */
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
+ blkio_cgroup_set_owner(new_page, current->mm);
err = __add_to_swap_cache(new_page, entry);
if (likely(!err)) {
radix_tree_preload_end();
--
1.7.3.1

2011-03-30 16:52:39

by Justin TerAvest

[permalink] [raw]

Subject: [PATCH v3 3/8] cfq-iosched: Make async queues per cgroup

There is currently only one set of async queues. This patch moves them
to a per cgroup data structure. Changes are done to make sure per
cgroup async queue references are dropped when the cgroup goes away.

TESTED:
Verified by creating multiple cgroups that async queues were getting
created properly. Also made sure that the references are getting
dropped and queues getting deallocated properly in the two situations:
- Cgroup goes away first, while IOs are still being done.
- IOs stop getting done and then cgroup goes away.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/cfq-iosched.c | 57 +++++++++++++++++++++++++--------------------------
1 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 74510f5..011d268 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -202,6 +202,12 @@ struct cfq_group {
struct cfq_rb_root service_trees[2][3];
struct cfq_rb_root service_tree_idle;

+ /*
+ * async queue for each priority case
+ */
+ struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
+ struct cfq_queue *async_idle_cfqq;
+
unsigned long saved_workload_slice;
enum wl_type_t saved_workload;
enum wl_prio_t saved_serving_prio;
@@ -267,12 +273,6 @@ struct cfq_data {
struct cfq_queue *active_queue;
struct cfq_io_context *active_cic;

- /*
- * async queue for each priority case
- */
- struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
- struct cfq_queue *async_idle_cfqq;
-
sector_t last_position;

/*
@@ -455,6 +455,7 @@ static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
struct io_context *);
+static void cfq_put_async_queues(struct cfq_group *cfqg);

static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
bool is_sync)
@@ -1117,10 +1118,6 @@ static void cfq_put_group_ref(struct cfq_group *cfqg)

static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
{
- /* Currently, all async queues are mapped to root group */
- if (!cfq_cfqq_sync(cfqq))
- cfqg = &cfqq->cfqd->root_group;
-
cfqq->cfqg = cfqg;
/* cfqq reference on cfqg */
cfq_get_group_ref(cfqq->cfqg);
@@ -1132,6 +1129,7 @@ static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
BUG_ON(hlist_unhashed(&cfqg->cfqd_node));

hlist_del_init(&cfqg->cfqd_node);
+ cfq_put_async_queues(cfqg);

/*
* Put the reference taken at the time of creation so that when all
@@ -2929,15 +2927,13 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
#endif /* CONFIG_CFQ_GROUP_IOSCHED */

static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
- struct io_context *ioc, gfp_t gfp_mask)
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct cfq_group *cfqg,
+ bool is_sync, struct io_context *ioc, gfp_t gfp_mask)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
struct cfq_io_context *cic;
- struct cfq_group *cfqg;

retry:
- cfqg = cfq_get_cfqg(cfqd, 1);
cic = cfq_cic_lookup(cfqd, ioc);
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);
@@ -2981,15 +2977,15 @@ retry:
}

static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
+cfq_async_queue_prio(struct cfq_group *cfqg, int ioprio_class, int ioprio)
{
switch (ioprio_class) {
case IOPRIO_CLASS_RT:
- return &cfqd->async_cfqq[0][ioprio];
+ return &cfqg->async_cfqq[0][ioprio];
case IOPRIO_CLASS_BE:
- return &cfqd->async_cfqq[1][ioprio];
+ return &cfqg->async_cfqq[1][ioprio];
case IOPRIO_CLASS_IDLE:
- return &cfqd->async_idle_cfqq;
+ return &cfqg->async_idle_cfqq;
default:
BUG();
}
@@ -3003,17 +2999,19 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue **async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
+ struct cfq_group *cfqg = cfq_get_cfqg(cfqd, 1);

if (!is_sync) {
- async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
+ async_cfqq = cfq_async_queue_prio(cfqg, ioprio_class,
+ ioprio);
cfqq = *async_cfqq;
}

if (!cfqq)
- cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+ cfqq = cfq_find_alloc_queue(cfqd, cfqg, is_sync, ioc, gfp_mask);

/*
- * pin the queue now that it's allocated, scheduler exit will prune it
+ * pin the queue now that it's allocated, cgroup deletion will prune it
*/
if (!is_sync && !(*async_cfqq)) {
cfq_get_queue_ref(cfqq);
@@ -3826,19 +3824,19 @@ static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
cancel_work_sync(&cfqd->unplug_work);
}

-static void cfq_put_async_queues(struct cfq_data *cfqd)
+static void cfq_put_async_queues(struct cfq_group *cfqg)
{
int i;

for (i = 0; i < IOPRIO_BE_NR; i++) {
- if (cfqd->async_cfqq[0][i])
- cfq_put_queue_ref(cfqd->async_cfqq[0][i]);
- if (cfqd->async_cfqq[1][i])
- cfq_put_queue_ref(cfqd->async_cfqq[1][i]);
+ if (cfqg->async_cfqq[0][i])
+ cfq_put_queue_ref(cfqg->async_cfqq[0][i]);
+ if (cfqg->async_cfqq[1][i])
+ cfq_put_queue_ref(cfqg->async_cfqq[1][i]);
}

- if (cfqd->async_idle_cfqq)
- cfq_put_queue_ref(cfqd->async_idle_cfqq);
+ if (cfqg->async_idle_cfqq)
+ cfq_put_queue_ref(cfqg->async_idle_cfqq);
}

static void cfq_cfqd_free(struct rcu_head *head)
@@ -3866,8 +3864,9 @@ static void cfq_exit_queue(struct elevator_queue *e)
__cfq_exit_single_io_context(cfqd, cic);
}

- cfq_put_async_queues(cfqd);
cfq_release_cfq_groups(cfqd);
+ /* Release the queues of root group. */
+ cfq_put_async_queues(&cfqd->root_group);
cfq_blkiocg_del_blkio_group(&cfqd->root_group.blkg);

spin_unlock_irq(q->queue_lock);
--
1.7.3.1

2011-03-30 16:51:11

by Justin TerAvest

[permalink] [raw]

Subject: [PATCH v3 8/8] cfq: Don't allow preemption across cgroups

Previously, a sync queue in can preempt an async queue in another
cgroup. This changes that behavior to disallow such preemption.

Signed-off-by: Justin TerAvest <[email protected]>
---
block/cfq-iosched.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 23caa79..0d0189b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3430,6 +3430,9 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (!cfqq)
return false;

+ if (new_cfqq->cfqg != cfqq->cfqg)
+ return false;
+
if (cfq_class_idle(new_cfqq))
return false;

@@ -3449,9 +3452,6 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
return true;

- if (new_cfqq->cfqg != cfqq->cfqg)
- return false;
-
if (cfq_slice_used(cfqq))
return true;

--
1.7.3.1