Hi,
This is V2 of the patches. First version is posted here.
https://lkml.org/lkml/2011/6/3/375
There are no changes from first version except that I have rebased it to
for-3.1/core branch of Jens's block tree.
I have been trying to find ways to solve two problems with block IO controller
cgroups.
- Current throttling logic in IO controller does not throttle buffered WRITES.
Well it does throttle all the WRITEs at device and by that time buffered
WRITE have lost the submitter's context and most of the IO comes in flusher
thread's context at device. Hence currently buffered write throttling is
not supported.
- All WRITEs are throttled at device level and this can easily lead to
filesystem serialization.
One simple example is that if a process writes some pages to cache and
then does fsync(), and process gets throttled then it locks up the
filesystem. With ext4, I noticed that even a simple "ls" does not make
progress. The reason boils down to the fact that filesystems are not
aware of cgroups and one of the things which get serialized is journalling
in ordered mode.
So even if we do something to carry submitter's cgroup information
to device and do throttling there, it will lead to serialization of
filesystems and is not a good idea.
So how to go about fixing it. There seem to be two options.
- Throttling should still be done at device level. Make filesystems aware
of cgroups so that multiple transactions can make progress in parallel
(per cgroup) and there are no shared resources across cgroups in
filesystems which can lead to serialization.
- Throttle WRITEs while they are entering the cache and not after that.
Something like balance_dirty_pages(). Direct IO is still throttled
at device level. That way, we can avoid these journalling related
serialization issues w.r.t trottling.
But the big issue with this approach is that we control the IO rate
entering into the cache and not IO rate at the device. That way it
can happen that flusher later submits lots of WRITEs to device and
we will see a periodic IO spike on end node.
So this mechanism helps a bit but is not the complete solution. It
can primarily help those folks which have the system resources and
plenty of IO bandwidth available but they don't want to give it to
customer because it is not a premium customer etc.
Option 1 seem to be really hard to fix. Filesystems have not been written
keeping cgroups in mind. So I am really skeptical that I can convince file
system designers to make fundamental changes in filesystems and journalling
code to make them cgroup aware.
Hence with this patch series I have implemented option 2. Option 2 is not
the best solution but atleast it gives us some control then not having any
control on buffered writes. Andrea Righi did similar patches in the past
here.
https://lkml.org/lkml/2011/2/28/115
This patch series had issues w.r.t to interaction between bio and task
throttling, so I redid it.
Design
------
IO controller already has the capability to keep track of IO rates of
a group and enqueue the bio in internal queues if group exceeds the
rate and dispatch these bios later.
This patch series also introduce the capability to throttle a dirtying
task in balance_dirty_pages_ratelimited_nr(). Now no WRITES except
direct WRITES will be throttled at device level. If a dirtying task
exceeds its configured IO rate, it is put on a group wait queue and
woken up when it can dirty more pages.
No new interface has been introduced and both direct IO as well as buffered
IO make use of common IO rate limit.
How To
=====
- Create a cgroup and limit it to 1MB/s for writes.
echo "8:16 1024000" > /cgroup/blk/test1/blkio.throttle.write_bps_device
- Launch dd thread in the cgroup
dd if=/dev/zero of=zerofile bs=4K count=1K
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 4.00428 s, 1.0 MB/s
Any feedback is welcome.
Thanks
Vivek
Vivek Goyal (8):
blk-throttle: convert wait routines to return jiffies to wait
blk-throttle: do not enforce first queued bio check in
tg_wait_dispatch
blk-throttle: use io size and direction as parameters to wait
routines
blk-throttle: specify number of ios during dispatch update
blk-throttle: get rid of extend slice trace message
blk-throttle: core logic to throttle task while dirtying pages
blk-throttle: do not throttle writes at device level except direct io
blk-throttle: enable throttling of task while dirtying pages
block/blk-cgroup.c | 6 +-
block/blk-cgroup.h | 2 +-
block/blk-throttle.c | 506 +++++++++++++++++++++++++++++++++++---------
block/cfq-iosched.c | 2 +-
block/cfq.h | 6 +-
fs/direct-io.c | 1 +
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 5 +
mm/page-writeback.c | 3 +
9 files changed, 421 insertions(+), 112 deletions(-)
--
1.7.4.4
Currently we return jiffies to wait for in a function parameter.
A cleaner way would be to return it as return value. Value 0 would mean
that there is no need to wait and anything > 0 means number of jiffies
to wait before bio can be dispatched.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-throttle.c | 72 +++++++++++++++++++++-----------------------------
1 files changed, 30 insertions(+), 42 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index f6a7941..d76717a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -590,8 +590,8 @@ throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw)
tg->slice_start[rw], tg->slice_end[rw], jiffies);
}
-static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg,
- struct bio *bio, unsigned long *wait)
+static unsigned long tg_wait_iops_limit(struct throtl_data *td,
+ struct throtl_grp *tg, struct bio *bio)
{
bool rw = bio_data_dir(bio);
unsigned int io_allowed;
@@ -621,11 +621,8 @@ static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg,
else
io_allowed = tmp;
- if (tg->io_disp[rw] + 1 <= io_allowed) {
- if (wait)
- *wait = 0;
- return 1;
- }
+ if (tg->io_disp[rw] + 1 <= io_allowed)
+ return 0;
/* Calc approx time to dispatch */
jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
@@ -635,13 +632,15 @@ static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg,
else
jiffy_wait = 1;
- if (wait)
- *wait = jiffy_wait;
- return 0;
+ return jiffy_wait;
}
-static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
- struct bio *bio, unsigned long *wait)
+/*
+ * Returns number of jiffies to wait to before IO can be dispatched according
+ * to bps limit.
+ */
+static unsigned long tg_wait_bps_limit(struct throtl_data *td,
+ struct throtl_grp *tg, struct bio *bio)
{
bool rw = bio_data_dir(bio);
u64 bytes_allowed, extra_bytes, tmp;
@@ -659,11 +658,8 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
do_div(tmp, HZ);
bytes_allowed = tmp;
- if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) {
- if (wait)
- *wait = 0;
- return 1;
- }
+ if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed)
+ return 0;
/* Calc approx time to dispatch */
extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed;
@@ -677,9 +673,8 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
* up we did. Add that time also.
*/
jiffy_wait = jiffy_wait + (jiffy_elapsed_rnd - jiffy_elapsed);
- if (wait)
- *wait = jiffy_wait;
- return 0;
+
+ return jiffy_wait;
}
static bool tg_no_rule_group(struct throtl_grp *tg, bool rw) {
@@ -691,9 +686,12 @@ static bool tg_no_rule_group(struct throtl_grp *tg, bool rw) {
/*
* Returns whether one can dispatch a bio or not. Also returns approx number
* of jiffies to wait before this bio is with-in IO rate and can be dispatched
+ *
+ * Retruns the number of jiffies one needs to wait before IO can be dispatched.
+ * 0 means, IO can be dispatched now.
*/
-static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
- struct bio *bio, unsigned long *wait)
+static unsigned long
+tg_wait_dispatch(struct throtl_data *td, struct throtl_grp *tg, struct bio *bio)
{
bool rw = bio_data_dir(bio);
unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
@@ -707,11 +705,8 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw]));
/* If tg->bps = -1, then BW is unlimited */
- if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
- if (wait)
- *wait = 0;
- return 1;
- }
+ if (tg->bps[rw] == -1 && tg->iops[rw] == -1)
+ return 0;
/*
* If previous slice expired, start a new one otherwise renew/extend
@@ -725,22 +720,15 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
throtl_extend_slice(td, tg, rw, jiffies + throtl_slice);
}
- if (tg_with_in_bps_limit(td, tg, bio, &bps_wait)
- && tg_with_in_iops_limit(td, tg, bio, &iops_wait)) {
- if (wait)
- *wait = 0;
- return 1;
- }
+ bps_wait = tg_wait_bps_limit(td, tg, bio);
+ iops_wait = tg_wait_iops_limit(td, tg, bio);
max_wait = max(bps_wait, iops_wait);
- if (wait)
- *wait = max_wait;
-
if (time_before(tg->slice_end[rw], jiffies + max_wait))
throtl_extend_slice(td, tg, rw, jiffies + max_wait);
- return 0;
+ return max_wait;
}
static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
@@ -774,10 +762,10 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
struct bio *bio;
if ((bio = bio_list_peek(&tg->bio_lists[READ])))
- tg_may_dispatch(td, tg, bio, &read_wait);
+ read_wait = tg_wait_dispatch(td, tg, bio);
if ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
- tg_may_dispatch(td, tg, bio, &write_wait);
+ write_wait = tg_wait_dispatch(td, tg, bio);
min_wait = min(read_wait, write_wait);
disptime = jiffies + min_wait;
@@ -819,7 +807,7 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
/* Try to dispatch 75% READS and 25% WRITES */
while ((bio = bio_list_peek(&tg->bio_lists[READ]))
- && tg_may_dispatch(td, tg, bio, NULL)) {
+ && !tg_wait_dispatch(td, tg, bio)) {
tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
nr_reads++;
@@ -829,7 +817,7 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
}
while ((bio = bio_list_peek(&tg->bio_lists[WRITE]))
- && tg_may_dispatch(td, tg, bio, NULL)) {
+ && !tg_wait_dispatch(td, tg, bio)) {
tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
nr_writes++;
@@ -1185,7 +1173,7 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
}
/* Bio is with-in rate limit of group */
- if (tg_may_dispatch(td, tg, bio, NULL)) {
+ if (!tg_wait_dispatch(td, tg, bio)) {
throtl_charge_bio(tg, bio);
/*
--
1.7.4.4
Soon we will be using same wait routines for checking the wait for
task also and no bio is involved. So get rid of any bio dependencies
and leave it to caller that if and bio or task is queued in the group
in same direction, then don't call this routine.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-throttle.c | 8 --------
1 files changed, 0 insertions(+), 8 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index d76717a..8facd17 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -696,14 +696,6 @@ tg_wait_dispatch(struct throtl_data *td, struct throtl_grp *tg, struct bio *bio)
bool rw = bio_data_dir(bio);
unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
- /*
- * Currently whole state machine of group depends on first bio
- * queued in the group bio list. So one should not be calling
- * this function with a different bio if there are other bios
- * queued.
- */
- BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw]));
-
/* If tg->bps = -1, then BW is unlimited */
if (tg->bps[rw] == -1 && tg->iops[rw] == -1)
return 0;
--
1.7.4.4
I want to reuse wait routines for task wait also. Hence get rid of
dependency of bio being passed in. Instead pass in direction of IO
and size of IO.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-throttle.c | 52 +++++++++++++++++++++++--------------------------
1 files changed, 24 insertions(+), 28 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8facd17..885ee4a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -591,9 +591,8 @@ throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw)
}
static unsigned long tg_wait_iops_limit(struct throtl_data *td,
- struct throtl_grp *tg, struct bio *bio)
+ struct throtl_grp *tg, bool rw, unsigned int nr_ios)
{
- bool rw = bio_data_dir(bio);
unsigned int io_allowed;
unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
u64 tmp;
@@ -621,11 +620,11 @@ static unsigned long tg_wait_iops_limit(struct throtl_data *td,
else
io_allowed = tmp;
- if (tg->io_disp[rw] + 1 <= io_allowed)
+ if (tg->io_disp[rw] + nr_ios <= io_allowed)
return 0;
/* Calc approx time to dispatch */
- jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
+ jiffy_wait = ((tg->io_disp[rw] + nr_ios) * HZ)/tg->iops[rw] + 1;
if (jiffy_wait > jiffy_elapsed)
jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -640,9 +639,8 @@ static unsigned long tg_wait_iops_limit(struct throtl_data *td,
* to bps limit.
*/
static unsigned long tg_wait_bps_limit(struct throtl_data *td,
- struct throtl_grp *tg, struct bio *bio)
+ struct throtl_grp *tg, bool rw, unsigned int sz)
{
- bool rw = bio_data_dir(bio);
u64 bytes_allowed, extra_bytes, tmp;
unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
@@ -658,11 +656,11 @@ static unsigned long tg_wait_bps_limit(struct throtl_data *td,
do_div(tmp, HZ);
bytes_allowed = tmp;
- if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed)
+ if (tg->bytes_disp[rw] + sz <= bytes_allowed)
return 0;
/* Calc approx time to dispatch */
- extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed;
+ extra_bytes = tg->bytes_disp[rw] + sz - bytes_allowed;
jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
if (!jiffy_wait)
@@ -690,10 +688,9 @@ static bool tg_no_rule_group(struct throtl_grp *tg, bool rw) {
* Retruns the number of jiffies one needs to wait before IO can be dispatched.
* 0 means, IO can be dispatched now.
*/
-static unsigned long
-tg_wait_dispatch(struct throtl_data *td, struct throtl_grp *tg, struct bio *bio)
+static unsigned long tg_wait_dispatch(struct throtl_data *td,
+ struct throtl_grp *tg, bool rw, unsigned int sz, unsigned int nr_ios)
{
- bool rw = bio_data_dir(bio);
unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
/* If tg->bps = -1, then BW is unlimited */
@@ -712,8 +709,8 @@ tg_wait_dispatch(struct throtl_data *td, struct throtl_grp *tg, struct bio *bio)
throtl_extend_slice(td, tg, rw, jiffies + throtl_slice);
}
- bps_wait = tg_wait_bps_limit(td, tg, bio);
- iops_wait = tg_wait_iops_limit(td, tg, bio);
+ bps_wait = tg_wait_bps_limit(td, tg, rw, sz);
+ iops_wait = tg_wait_iops_limit(td, tg, rw, nr_ios);
max_wait = max(bps_wait, iops_wait);
@@ -723,16 +720,14 @@ tg_wait_dispatch(struct throtl_data *td, struct throtl_grp *tg, struct bio *bio)
return max_wait;
}
-static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
+static void throtl_charge_io(struct throtl_grp *tg, bool rw, unsigned int sz,
+ unsigned int nr_ios, bool sync)
{
- bool rw = bio_data_dir(bio);
- bool sync = bio->bi_rw & REQ_SYNC;
-
- /* Charge the bio to the group */
- tg->bytes_disp[rw] += bio->bi_size;
- tg->io_disp[rw]++;
+ /* Charge the io to the group */
+ tg->bytes_disp[rw] += sz;
+ tg->io_disp[rw] += nr_ios;
- blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size, rw, sync);
+ blkiocg_update_dispatch_stats(&tg->blkg, sz, rw, sync);
}
static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg,
@@ -754,10 +749,10 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
struct bio *bio;
if ((bio = bio_list_peek(&tg->bio_lists[READ])))
- read_wait = tg_wait_dispatch(td, tg, bio);
+ read_wait = tg_wait_dispatch(td, tg, READ, bio->bi_size, 1);
if ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
- write_wait = tg_wait_dispatch(td, tg, bio);
+ write_wait = tg_wait_dispatch(td, tg, WRITE, bio->bi_size, 1);
min_wait = min(read_wait, write_wait);
disptime = jiffies + min_wait;
@@ -781,7 +776,7 @@ static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg,
BUG_ON(td->nr_queued[rw] <= 0);
td->nr_queued[rw]--;
- throtl_charge_bio(tg, bio);
+ throtl_charge_io(tg, rw, bio->bi_size, 1, bio->bi_rw & REQ_SYNC);
bio_list_add(bl, bio);
bio->bi_rw |= REQ_THROTTLED;
@@ -799,7 +794,7 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
/* Try to dispatch 75% READS and 25% WRITES */
while ((bio = bio_list_peek(&tg->bio_lists[READ]))
- && !tg_wait_dispatch(td, tg, bio)) {
+ && !tg_wait_dispatch(td, tg, READ, bio->bi_size, 1)) {
tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
nr_reads++;
@@ -809,7 +804,7 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
}
while ((bio = bio_list_peek(&tg->bio_lists[WRITE]))
- && !tg_wait_dispatch(td, tg, bio)) {
+ && !tg_wait_dispatch(td, tg, WRITE, bio->bi_size, 1)) {
tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
nr_writes++;
@@ -1165,8 +1160,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
}
/* Bio is with-in rate limit of group */
- if (!tg_wait_dispatch(td, tg, bio)) {
- throtl_charge_bio(tg, bio);
+ if (!tg_wait_dispatch(td, tg, rw, bio->bi_size, 1)) {
+ throtl_charge_io(tg, rw, bio->bi_size, 1,
+ bio->bi_rw & REQ_SYNC);
/*
* We need to trim slice even when bios are not being queued
--
1.7.4.4
currently we assume the number of IOs to be 1 while dispatch stats
are being updated. It could happen that we are clubbing multiple
dispatches and updating in one shot. Throttling logic currently will
count one page as one IO for dirty page throttling. Hence modify the
logic to be able to specify number of IOs to associate with the
update.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 6 +++---
block/blk-cgroup.h | 2 +-
block/blk-throttle.c | 2 +-
block/cfq-iosched.c | 2 +-
block/cfq.h | 6 +++---
5 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bcaf16e..c0815c5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -394,8 +394,8 @@ EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
* should be called under rcu read lock or queue lock to make sure blkg pointer
* is valid.
*/
-void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
- uint64_t bytes, bool direction, bool sync)
+void blkiocg_update_dispatch_stats(struct blkio_group *blkg, uint64_t bytes,
+ unsigned int nr_ios, bool direction, bool sync)
{
struct blkio_group_stats_cpu *stats_cpu;
unsigned long flags;
@@ -412,7 +412,7 @@ void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
u64_stats_update_begin(&stats_cpu->syncp);
stats_cpu->sectors += bytes >> 9;
blkio_add_stat(stats_cpu->stat_arr_cpu[BLKIO_STAT_CPU_SERVICED],
- 1, direction, sync);
+ nr_ios, direction, sync);
blkio_add_stat(stats_cpu->stat_arr_cpu[BLKIO_STAT_CPU_SERVICE_BYTES],
bytes, direction, sync);
u64_stats_update_end(&stats_cpu->syncp);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index a71d290..7202dcc 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -318,7 +318,7 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg,
unsigned long time,
unsigned long unaccounted_time);
void blkiocg_update_dispatch_stats(struct blkio_group *blkg, uint64_t bytes,
- bool direction, bool sync);
+ unsigned int nr_ios, bool direction, bool sync);
void blkiocg_update_completion_stats(struct blkio_group *blkg,
uint64_t start_time, uint64_t io_start_time, bool direction, bool sync);
void blkiocg_update_io_merged_stats(struct blkio_group *blkg, bool direction,
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 885ee4a..506f4ec 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -727,7 +727,7 @@ static void throtl_charge_io(struct throtl_grp *tg, bool rw, unsigned int sz,
tg->bytes_disp[rw] += sz;
tg->io_disp[rw] += nr_ios;
- blkiocg_update_dispatch_stats(&tg->blkg, sz, rw, sync);
+ blkiocg_update_dispatch_stats(&tg->blkg, sz, nr_ios, rw, sync);
}
static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg,
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3d403a1..3b88563 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2060,7 +2060,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
cfqq->nr_sectors += blk_rq_sectors(rq);
cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq),
- rq_data_dir(rq), rq_is_sync(rq));
+ 1, rq_data_dir(rq), rq_is_sync(rq));
}
/*
diff --git a/block/cfq.h b/block/cfq.h
index 2a15592..1795b3d 100644
--- a/block/cfq.h
+++ b/block/cfq.h
@@ -56,9 +56,9 @@ cfq_blkiocg_update_set_idle_time_stats(struct blkio_group *blkg)
}
static inline void cfq_blkiocg_update_dispatch_stats(struct blkio_group *blkg,
- uint64_t bytes, bool direction, bool sync)
+ uint64_t bytes, unsigned int nr_ios, bool direction, bool sync)
{
- blkiocg_update_dispatch_stats(blkg, bytes, direction, sync);
+ blkiocg_update_dispatch_stats(blkg, bytes, nr_ios, direction, sync);
}
static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg, uint64_t start_time, uint64_t io_start_time, bool direction, bool sync)
@@ -101,7 +101,7 @@ static inline void
cfq_blkiocg_update_set_idle_time_stats(struct blkio_group *blkg) {}
static inline void cfq_blkiocg_update_dispatch_stats(struct blkio_group *blkg,
- uint64_t bytes, bool direction, bool sync) {}
+ uint64_t bytes, unsigned int nr_ios, bool direction, bool sync) {}
static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg, uint64_t start_time, uint64_t io_start_time, bool direction, bool sync) {}
static inline void cfq_blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
--
1.7.4.4
I am about to introduce more tracing messages for task throttling. There
seem to be too many messages and I thought that this extend slice one
is not that useful. so kill it.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-throttle.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 506f4ec..ec41f6d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -515,9 +515,6 @@ static inline void throtl_extend_slice(struct throtl_data *td,
struct throtl_grp *tg, bool rw, unsigned long jiffy_end)
{
tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
- throtl_log_tg(td, tg, "[%c] extend slice start=%lu end=%lu jiffies=%lu",
- rw == READ ? 'R' : 'W', tg->slice_start[rw],
- tg->slice_end[rw], jiffies);
}
/* Determine if previously allocated or extended slice is complete or not */
--
1.7.4.4
This is the core logic which enables throttling logic to also
throttle tasks while dirtying pages. I basically create the
infrastructure where a wait queue is maintained per group
and if task exceeds its rate limit, task is put to sleep on
wait queue and woken up later.
Though currently we throttle tasks for WRITE requests and
not for READ requests, I have built the logic to support
task throttling for both direction (similar to bio throttling).
This will allow us to do any task READ throttling also easily,
if needed in future.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-throttle.c | 399 ++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 358 insertions(+), 41 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index ec41f6d..be3b4b4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -32,6 +32,12 @@ struct throtl_rb_root {
unsigned long min_disptime;
};
+/* Whether to dispatch a bio or task in next round of a dispatch from a group */
+enum dispatch_type {
+ DISPATCH_BIO,
+ DISPATCH_TASK,
+};
+
#define THROTL_RB_ROOT (struct throtl_rb_root) { .rb = RB_ROOT, .left = NULL, \
.count = 0, .min_disptime = 0}
@@ -59,7 +65,10 @@ struct throtl_grp {
struct bio_list bio_lists[2];
/* Number of queued bios on READ and WRITE lists */
- unsigned int nr_queued[2];
+ unsigned int nr_queued_bio[2];
+
+ /* Number of queued tasks */
+ unsigned int nr_queued_tsk[2];
/* bytes per second rate limits */
uint64_t bps[2];
@@ -80,6 +89,21 @@ struct throtl_grp {
int limits_changed;
struct rcu_head rcu_head;
+
+ /* READ and WRITE wait lists for task throttle */
+ wait_queue_head_t wait[2];
+
+ /* Number of unaccounted dirty pages in this group */
+ unsigned int unaccounted_dirty;
+
+ /* The number of pages to be charged to first task on wait queue */
+ unsigned int active_task_charge[2];
+
+ /*
+ * Keeps track of whether we will dispatch a bio or task in next
+ * round of dispatch (for each dir)
+ */
+ enum dispatch_type active_dispatch[2];
};
struct throtl_data
@@ -94,7 +118,10 @@ struct throtl_data
struct request_queue *queue;
/* Total Number of queued bios on READ and WRITE lists */
- unsigned int nr_queued[2];
+ unsigned int nr_queued_bio[2];
+
+ /* Total Number of queued tasks on READ and WRITE lists */
+ unsigned int nr_queued_tsk[2];
/*
* number of total undestroyed groups
@@ -142,9 +169,19 @@ static inline struct throtl_grp *tg_of_blkg(struct blkio_group *blkg)
return NULL;
}
+static inline int total_queued_bios(struct throtl_data *td)
+{
+ return td->nr_queued_bio[0] + td->nr_queued_bio[1];
+}
+
+static inline int total_queued_tasks(struct throtl_data *td)
+{
+ return td->nr_queued_tsk[0] + td->nr_queued_tsk[1];
+}
+
static inline unsigned int total_nr_queued(struct throtl_data *td)
{
- return td->nr_queued[0] + td->nr_queued[1];
+ return total_queued_bios(td) + total_queued_tasks(td);
}
static inline struct throtl_grp *throtl_ref_get_tg(struct throtl_grp *tg)
@@ -192,6 +229,9 @@ static void throtl_init_group(struct throtl_grp *tg)
tg->bps[0] = tg->bps[1] = -1;
tg->iops[0] = tg->iops[1] = -1;
+ init_waitqueue_head(&tg->wait[READ]);
+ init_waitqueue_head(&tg->wait[WRITE]);
+
/*
* Take the initial reference that will be released on destroy
* This can be thought of a joint reference by cgroup and
@@ -727,6 +767,14 @@ static void throtl_charge_io(struct throtl_grp *tg, bool rw, unsigned int sz,
blkiocg_update_dispatch_stats(&tg->blkg, sz, nr_ios, rw, sync);
}
+static void throtl_charge_dirty_io(struct throtl_grp *tg, bool rw,
+ unsigned int sz, unsigned int nr_ios, bool sync)
+{
+ tg->unaccounted_dirty -= nr_ios;
+ BUG_ON(tg->unaccounted_dirty < 0);
+ throtl_charge_io(tg, rw, sz, nr_ios, sync);
+}
+
static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg,
struct bio *bio)
{
@@ -735,21 +783,38 @@ static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg,
bio_list_add(&tg->bio_lists[rw], bio);
/* Take a bio reference on tg */
throtl_ref_get_tg(tg);
- tg->nr_queued[rw]++;
- td->nr_queued[rw]++;
+ tg->nr_queued_bio[rw]++;
+ td->nr_queued_bio[rw]++;
throtl_enqueue_tg(td, tg);
}
-static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
+/*
+ * Returns the wait time of first bio or task in the specified direction. If
+ * nothing is queued then wait time is -1.
+ */
+static unsigned long
+tg_active_wait_time(struct throtl_data *td, struct throtl_grp *tg, bool rw)
{
- unsigned long read_wait = -1, write_wait = -1, min_wait = -1, disptime;
struct bio *bio;
+ unsigned long sz;
- if ((bio = bio_list_peek(&tg->bio_lists[READ])))
- read_wait = tg_wait_dispatch(td, tg, READ, bio->bi_size, 1);
+ bio = bio_list_peek(&tg->bio_lists[rw]);
+ if (bio && tg->active_dispatch[rw] == DISPATCH_BIO)
+ return tg_wait_dispatch(td, tg, rw, bio->bi_size, 1);
- if ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
- write_wait = tg_wait_dispatch(td, tg, WRITE, bio->bi_size, 1);
+ if (!waitqueue_active(&tg->wait[rw]))
+ return -1;
+
+ sz = tg->active_task_charge[rw] << PAGE_SHIFT;
+ return tg_wait_dispatch(td, tg, rw, sz, tg->active_task_charge[rw]);
+}
+
+static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
+{
+ unsigned long read_wait, write_wait, min_wait, disptime;
+
+ read_wait = tg_active_wait_time(td, tg, READ);
+ write_wait = tg_active_wait_time(td, tg, WRITE);
min_wait = min(read_wait, write_wait);
disptime = jiffies + min_wait;
@@ -760,18 +825,39 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
throtl_enqueue_tg(td, tg);
}
+/* Calculate how many pages must be charged to a task at this point of time. */
+static unsigned int tg_calc_charge_pages(struct throtl_grp *tg, bool rw)
+{
+ unsigned int nr_pages;
+
+ /* Divide unaccounted number of pages among queued tasks */
+ nr_pages = tg->unaccounted_dirty/tg->nr_queued_tsk[rw];
+
+ return max_t(unsigned int, nr_pages, 1);
+}
+
+/* set the active task charge for the first task in the queue */
+static void tg_set_active_task_charge(struct throtl_grp *tg, bool rw)
+{
+ /* If there are more tasks queued, calculate the charge */
+ if (tg->nr_queued_tsk[rw])
+ tg->active_task_charge[rw] = tg_calc_charge_pages(tg, rw);
+ else
+ tg->active_task_charge[rw] = 0;
+}
+
static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg,
bool rw, struct bio_list *bl)
{
struct bio *bio;
bio = bio_list_pop(&tg->bio_lists[rw]);
- tg->nr_queued[rw]--;
+ tg->nr_queued_bio[rw]--;
/* Drop bio reference on tg */
throtl_put_tg(tg);
- BUG_ON(td->nr_queued[rw] <= 0);
- td->nr_queued[rw]--;
+ BUG_ON(td->nr_queued_bio[rw] <= 0);
+ td->nr_queued_bio[rw]--;
throtl_charge_io(tg, rw, bio->bi_size, 1, bio->bi_rw & REQ_SYNC);
bio_list_add(bl, bio);
@@ -780,39 +866,137 @@ static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg,
throtl_trim_slice(td, tg, rw);
}
-static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
- struct bio_list *bl)
+static int throtl_dispatch_tg_bio(struct throtl_data *td, struct throtl_grp *tg,
+ bool rw, struct bio_list *bl, unsigned int max_disp)
{
- unsigned int nr_reads = 0, nr_writes = 0;
- unsigned int max_nr_reads = throtl_grp_quantum*3/4;
- unsigned int max_nr_writes = throtl_grp_quantum - max_nr_reads;
struct bio *bio;
+ unsigned int nr_disp = 0;
- /* Try to dispatch 75% READS and 25% WRITES */
-
- while ((bio = bio_list_peek(&tg->bio_lists[READ]))
- && !tg_wait_dispatch(td, tg, READ, bio->bi_size, 1)) {
+ while ((bio = bio_list_peek(&tg->bio_lists[rw]))
+ && !tg_wait_dispatch(td, tg, rw, bio->bi_size, 1)) {
tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
- nr_reads++;
+ nr_disp++;
- if (nr_reads >= max_nr_reads)
+ if (nr_disp >= max_disp)
break;
}
- while ((bio = bio_list_peek(&tg->bio_lists[WRITE]))
- && !tg_wait_dispatch(td, tg, WRITE, bio->bi_size, 1)) {
+ return nr_disp;
+}
- tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
- nr_writes++;
+static void throtl_dispatch_tg_task(struct throtl_data *td,
+ struct throtl_grp *tg, bool rw, unsigned int max_disp)
+{
+ unsigned int sz;
- if (nr_writes >= max_nr_writes)
- break;
+ if (!waitqueue_active(&tg->wait[rw]))
+ return;
+
+ /* Wake up a task */
+ wake_up(&tg->wait[rw]);
+
+ tg->nr_queued_tsk[rw]--;
+ td->nr_queued_tsk[rw]--;
+
+ /* Charge the woken up task for IO */
+ sz = tg->active_task_charge[rw] << PAGE_SHIFT;
+ throtl_charge_dirty_io(tg, rw, sz, tg->active_task_charge[rw], false);
+ throtl_trim_slice(td, tg, rw);
+
+ /*
+ * We dispatched one task. Set the charge for other queued tasks,
+ * if any.
+ */
+ tg_set_active_task_charge(tg, rw);
+ throtl_log_tg(td, tg, "[%c] Wake up a task. bdisp=%u sz=%u bps=%llu"
+ " iodisp=%u iops=%u bioq=%d/%d taskq=%d/%d",
+ rw == READ ? 'R' : 'W', tg->bytes_disp[rw], sz,
+ tg->bps[rw], tg->io_disp[rw], tg->iops[rw],
+ tg->nr_queued_bio[READ], tg->nr_queued_bio[WRITE],
+ tg->nr_queued_tsk[READ], tg->nr_queued_tsk[WRITE]);
+}
+
+static void tg_switch_active_dispatch(struct throtl_data *td,
+ struct throtl_grp *tg, bool rw)
+{
+ unsigned int nr_tasks = tg->nr_queued_tsk[rw];
+ unsigned int nr_bios = tg->nr_queued_bio[rw];
+ enum dispatch_type curr_dispatch = tg->active_dispatch[rw];
+
+ /* Nothing queued. Whoever gets queued next first, sets dispatch type */
+ if (!nr_bios && !nr_tasks)
+ return;
+
+ if (curr_dispatch == DISPATCH_BIO && nr_tasks) {
+ tg->active_dispatch[rw] = DISPATCH_TASK;
+ return;
+ }
+
+ if (curr_dispatch == DISPATCH_TASK && nr_bios)
+ tg->active_dispatch[rw] = DISPATCH_BIO;
+}
+
+static void tg_update_active_dispatch(struct throtl_data *td,
+ struct throtl_grp *tg, bool rw)
+{
+ unsigned int nr_tasks = tg->nr_queued_tsk[rw];
+ unsigned int nr_bios = tg->nr_queued_bio[rw];
+ enum dispatch_type curr_dispatch = tg->active_dispatch[rw];
+
+ BUG_ON(nr_bios < 0 || nr_tasks < 0);
+
+ if (curr_dispatch == DISPATCH_BIO && !nr_bios) {
+ tg->active_dispatch[rw] = DISPATCH_TASK;
+ return;
}
+ if (curr_dispatch == DISPATCH_TASK && !nr_tasks)
+ tg->active_dispatch[rw] = DISPATCH_BIO;
+}
+
+static int throtl_dispatch_tg_rw(struct throtl_data *td, struct throtl_grp *tg,
+ bool rw, struct bio_list *bl, unsigned int max)
+{
+ unsigned int nr_disp = 0;
+
+ if (tg->active_dispatch[rw] == DISPATCH_BIO)
+ nr_disp = throtl_dispatch_tg_bio(td, tg, rw, bl, max);
+ else
+ /* Only number of bios dispatched is kept track of here */
+ throtl_dispatch_tg_task(td, tg, rw, max);
+
+ tg_switch_active_dispatch(td, tg, rw);
+ return nr_disp;
+}
+
+static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
+ struct bio_list *bl)
+{
+ /* Try to dispatch 75% READS and 25% WRITES */
+ unsigned int nr_reads = 0, nr_writes = 0;
+ unsigned int max_nr_reads = throtl_grp_quantum*3/4;
+ unsigned int max_nr_writes = throtl_grp_quantum - max_nr_reads;
+
+ nr_reads = throtl_dispatch_tg_rw(td, tg, READ, bl, max_nr_reads);
+ nr_writes = throtl_dispatch_tg_rw(td, tg, WRITE, bl, max_nr_writes);
+
return nr_reads + nr_writes;
}
+static bool tg_should_requeue(struct throtl_grp *tg)
+{
+ /* If there are queued bios, requeue */
+ if (tg->nr_queued_bio[0] || tg->nr_queued_bio[1])
+ return 1;
+
+ /* If there are queued tasks reueue */
+ if (tg->nr_queued_tsk[0] || tg->nr_queued_tsk[1])
+ return 1;
+
+ return 0;
+}
+
static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
{
unsigned int nr_disp = 0;
@@ -832,7 +1016,7 @@ static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
nr_disp += throtl_dispatch_tg(td, tg, bl);
- if (tg->nr_queued[0] || tg->nr_queued[1]) {
+ if (tg_should_requeue(tg)) {
tg_update_disptime(td, tg);
throtl_enqueue_tg(td, tg);
}
@@ -899,9 +1083,9 @@ static int throtl_dispatch(struct request_queue *q)
bio_list_init(&bio_list_on_stack);
- throtl_log(td, "dispatch nr_queued=%u read=%u write=%u",
- total_nr_queued(td), td->nr_queued[READ],
- td->nr_queued[WRITE]);
+ throtl_log(td, "dispatch bioq=%u/%u tskq=%u/%u",
+ td->nr_queued_bio[READ], td->nr_queued_bio[WRITE],
+ td->nr_queued_tsk[READ], td->nr_queued_tsk[WRITE]);
nr_disp = throtl_select_dispatch(td, &bio_list_on_stack);
@@ -1122,7 +1306,7 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
if (tg_no_rule_group(tg, rw)) {
blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size,
- rw, bio->bi_rw & REQ_SYNC);
+ 1, rw, bio->bi_rw & REQ_SYNC);
rcu_read_unlock();
return 0;
}
@@ -1146,14 +1330,14 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
}
}
- if (tg->nr_queued[rw]) {
+ /* If there are already queued bio or task in same dir, queue bio */
+ if (tg->nr_queued_bio[rw] || tg->nr_queued_tsk[rw]) {
/*
- * There is already another bio queued in same dir. No
+ * There is already another bio/task queued in same dir. No
* need to update dispatch time.
*/
update_disptime = false;
goto queue_bio;
-
}
/* Bio is with-in rate limit of group */
@@ -1178,16 +1362,18 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
queue_bio:
throtl_log_tg(td, tg, "[%c] bio. bdisp=%llu sz=%u bps=%llu"
- " iodisp=%u iops=%u queued=%d/%d",
+ " iodisp=%u iops=%u bioq=%u/%u taskq=%u/%u",
rw == READ ? 'R' : 'W',
tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
tg->io_disp[rw], tg->iops[rw],
- tg->nr_queued[READ], tg->nr_queued[WRITE]);
+ tg->nr_queued_bio[READ], tg->nr_queued_bio[WRITE],
+ tg->nr_queued_tsk[READ], tg->nr_queued_tsk[WRITE]);
throtl_add_bio_tg(q->td, tg, bio);
*biop = NULL;
if (update_disptime) {
+ tg_update_active_dispatch(td, tg, rw);
tg_update_disptime(td, tg);
throtl_schedule_next_dispatch(td);
}
@@ -1197,6 +1383,137 @@ out:
return 0;
}
+static void
+__blk_throtl_dirty_pages(struct request_queue *q, unsigned long nr_dirty)
+{
+ struct throtl_data *td = q->td;
+ struct throtl_grp *tg;
+ struct blkio_cgroup *blkcg;
+ bool rw = WRITE, update_disptime = true, first_task = false;
+ unsigned int sz = nr_dirty << PAGE_SHIFT;
+ DEFINE_WAIT(wait);
+
+ /*
+ * A throtl_grp pointer retrieved under rcu can be used to access
+ * basic fields like stats and io rates. If a group has no rules,
+ * just update the dispatch stats in lockless manner and return.
+ */
+
+ rcu_read_lock();
+ blkcg = task_blkio_cgroup(current);
+ tg = throtl_find_tg(td, blkcg);
+ if (tg) {
+ throtl_tg_fill_dev_details(td, tg);
+
+ if (tg_no_rule_group(tg, rw)) {
+ blkiocg_update_dispatch_stats(&tg->blkg, sz, nr_dirty,
+ rw, 0);
+ rcu_read_unlock();
+ return;
+ }
+ }
+ rcu_read_unlock();
+
+ spin_lock_irq(q->queue_lock);
+
+ tg = throtl_get_tg(td);
+
+ /* Queue is gone. No queue lock held here. */
+ if (IS_ERR(tg))
+ return;
+
+ tg->unaccounted_dirty += nr_dirty;
+
+ /* If there are already queued task, put this task also on waitq */
+ if (tg->nr_queued_tsk[rw]) {
+ update_disptime = false;
+ goto queue_task;
+ } else
+ first_task = true;
+
+ /* If there are bios already throttled in same dir, queue task */
+ if (!bio_list_empty(&tg->bio_lists[rw])) {
+ update_disptime = false;
+ goto queue_task;
+ }
+
+ /*
+ * Task is with-in rate limit of group.
+ *
+ * Note: How many IOPS we should charge for this operation. For
+ * the time being I am sticking to number of pages as number of
+ * IOPS.
+ */
+ if (!tg_wait_dispatch(td, tg, rw, sz, nr_dirty)) {
+ throtl_charge_dirty_io(tg, rw, sz, nr_dirty, 0);
+ throtl_trim_slice(td, tg, rw);
+ goto out;
+ }
+
+queue_task:
+ throtl_log_tg(td, tg, "[%c] task. bdisp=%u sz=%u bps=%llu"
+ " iodisp=%u iops=%u bioq=%d/%d taskq=%d/%d",
+ rw == READ ? 'R' : 'W',
+ tg->bytes_disp[rw], sz, tg->bps[rw],
+ tg->io_disp[rw], tg->iops[rw],
+ tg->nr_queued_bio[READ], tg->nr_queued_bio[WRITE],
+ tg->nr_queued_tsk[READ], tg->nr_queued_tsk[WRITE]);
+
+ /* Take a task reference on tg */
+ throtl_ref_get_tg(tg);
+ tg->nr_queued_tsk[rw]++;
+ td->nr_queued_tsk[rw]++;
+ prepare_to_wait_exclusive(&tg->wait[rw], &wait, TASK_UNINTERRUPTIBLE);
+
+ if (first_task)
+ tg_set_active_task_charge(tg, rw);
+
+ if (update_disptime) {
+ tg_update_active_dispatch(td, tg, rw);
+ tg_update_disptime(td, tg);
+ throtl_schedule_next_dispatch(td);
+ } else
+ throtl_enqueue_tg(td, tg);
+
+ spin_unlock_irq(q->queue_lock);
+
+ /* Task has been put on a wait queue */
+ io_schedule();
+
+ /*
+ * Task has woken up. Finish wait, drop reference to the group.
+ */
+ spin_lock_irq(q->queue_lock);
+ finish_wait(&tg->wait[rw], &wait);
+
+ /* Drop task reference on group */
+ throtl_put_tg(tg);
+out:
+ spin_unlock_irq(q->queue_lock);
+ return;
+}
+
+void
+blk_throtl_dirty_pages(struct address_space *mapping, unsigned long nr_dirty)
+{
+ struct request_queue *q;
+ struct block_device *bdev;
+
+ bdev = mapping->host->i_sb->s_bdev;
+ if (!bdev)
+ return;
+ q = bdev_get_queue(bdev);
+
+ if (unlikely(!q))
+ return;
+
+ if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
+ return;
+
+ __blk_throtl_dirty_pages(q, nr_dirty);
+}
+
+
int blk_throtl_init(struct request_queue *q)
{
struct throtl_data *td;
--
1.7.4.4
Now WRITES are not throttled at device level except any WRITES
which are marked as DIRECT. This patch I have lifted from Andrea
Righi's previously posted series on lkml.
This will mean that any filesystem internally generated WRITES
also all not be throttled. If need be, down the line situation can
be improved if file system can give a hint to throttling logic that
it is ok to throttle certain WRITES without risk of being
serialized.
Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-throttle.c | 8 ++++++++
fs/direct-io.c | 1 +
include/linux/blk_types.h | 2 ++
3 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index be3b4b4..c6b9d76 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1293,6 +1293,14 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
}
/*
+ * WRITES bio are throttled here only if they are directIO. Otherwise
+ * these are coming from page cache and these must have been subjected
+ * to throttling rules while process was writting to page cache.
+ */
+ if (rw == WRITE && !(bio->bi_rw & REQ_DIRECT_IO))
+ return 0;
+
+ /*
* A throtl_grp pointer retrieved under rcu can be used to access
* basic fields like stats and io rates. If a group has no rules,
* just update the dispatch stats in lockless manner and return.
diff --git a/fs/direct-io.c b/fs/direct-io.c
index ac5f164..7858828 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -361,6 +361,7 @@ static void dio_bio_submit(struct dio *dio)
unsigned long flags;
bio->bi_private = dio;
+ bio->bi_rw |= REQ_DIRECT_IO;
spin_lock_irqsave(&dio->bio_lock, flags);
dio->refcount++;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 2a7cea5..f46e8a4 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -131,6 +131,7 @@ enum rq_flag_bits {
__REQ_RAHEAD, /* read ahead, can fail anytime */
__REQ_THROTTLED, /* This bio has already been subjected to
* throttling rules. Don't do it again. */
+ __REQ_DIRECT_IO, /* It is a direct IO operation */
/* request only flags */
__REQ_SORTED, /* elevator knows about this request */
@@ -172,6 +173,7 @@ enum rq_flag_bits {
#define REQ_RAHEAD (1 << __REQ_RAHEAD)
#define REQ_THROTTLED (1 << __REQ_THROTTLED)
+#define REQ_DIRECT_IO (1 << __REQ_DIRECT_IO)
#define REQ_SORTED (1 << __REQ_SORTED)
#define REQ_SOFTBARRIER (1 << __REQ_SOFTBARRIER)
--
1.7.4.4
Put the blk_throtl_dirty_pages() hook in
balance_dirty_pages_ratelimited_nr() to enable task throttling.
Signed-off-by: Vivek Goyal <[email protected]>
---
include/linux/blkdev.h | 5 +++++
mm/page-writeback.c | 3 +++
2 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4ce6e68..5d4a57e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1180,12 +1180,17 @@ static inline uint64_t rq_io_start_time_ns(struct request *req)
extern int blk_throtl_init(struct request_queue *q);
extern void blk_throtl_exit(struct request_queue *q);
extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
+extern void blk_throtl_dirty_pages(struct address_space *mapping,
+ unsigned long nr_dirty);
#else /* CONFIG_BLK_DEV_THROTTLING */
static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio)
{
return 0;
}
+static inline void blk_throtl_dirty_pages(struct address_space *mapping,
+ unsigned long nr_dirty) {}
+
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
static inline int blk_throtl_exit(struct request_queue *q) { return 0; }
#endif /* CONFIG_BLK_DEV_THROTTLING */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 31f6988..943e551 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -629,6 +629,9 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
unsigned long ratelimit;
unsigned long *p;
+ /* Subject writes to IO controller throttling */
+ blk_throtl_dirty_pages(mapping, nr_pages_dirtied);
+
ratelimit = ratelimit_pages;
if (mapping->backing_dev_info->dirty_exceeded)
ratelimit = 8;
--
1.7.4.4
On Tue, Jun 28, 2011 at 11:35:01AM -0400, Vivek Goyal wrote:
> Hi,
>
> This is V2 of the patches. First version is posted here.
>
> https://lkml.org/lkml/2011/6/3/375
>
> There are no changes from first version except that I have rebased it to
> for-3.1/core branch of Jens's block tree.
>
> I have been trying to find ways to solve two problems with block IO controller
> cgroups.
>
> - Current throttling logic in IO controller does not throttle buffered WRITES.
> Well it does throttle all the WRITEs at device and by that time buffered
> WRITE have lost the submitter's context and most of the IO comes in flusher
> thread's context at device. Hence currently buffered write throttling is
> not supported.
>
> - All WRITEs are throttled at device level and this can easily lead to
> filesystem serialization.
>
> One simple example is that if a process writes some pages to cache and
> then does fsync(), and process gets throttled then it locks up the
> filesystem. With ext4, I noticed that even a simple "ls" does not make
> progress. The reason boils down to the fact that filesystems are not
> aware of cgroups and one of the things which get serialized is journalling
> in ordered mode.
>
> So even if we do something to carry submitter's cgroup information
> to device and do throttling there, it will lead to serialization of
> filesystems and is not a good idea.
>
> So how to go about fixing it. There seem to be two options.
>
> - Throttling should still be done at device level. Make filesystems aware
> of cgroups so that multiple transactions can make progress in parallel
> (per cgroup) and there are no shared resources across cgroups in
> filesystems which can lead to serialization.
>
> - Throttle WRITEs while they are entering the cache and not after that.
> Something like balance_dirty_pages(). Direct IO is still throttled
> at device level. That way, we can avoid these journalling related
> serialization issues w.r.t trottling.
I think that O_DIRECT WRITEs can hit the same serialization problem if
we throttle them at device level.
Have you tried to do some tests? (i.e. create multiple cgroups with very
low I/O limit doing parallel O_DIRECT WRITEs, and try to run at the same
time "ls" or other simple commands from the root cgroup or unlimited
cgroup).
If we hit the same serialization problem I think we should do something
similar also for O_DIRECT WRITEs (e.g, throttle them at the VFS layer),
as a temporary solution.
The best solution is always to address this problem at the filesystem
layer (option 1), but it's a *huge* change, because all the filesystems
need to be redesigned to be cgroup-aware. For now the temporary solution
could help at least to avoid system lockups while doing large O_DIRECT
writes from I/O-limited cgroups.
Thanks,
-Andrea
>
> But the big issue with this approach is that we control the IO rate
> entering into the cache and not IO rate at the device. That way it
> can happen that flusher later submits lots of WRITEs to device and
> we will see a periodic IO spike on end node.
>
> So this mechanism helps a bit but is not the complete solution. It
> can primarily help those folks which have the system resources and
> plenty of IO bandwidth available but they don't want to give it to
> customer because it is not a premium customer etc.
>
> Option 1 seem to be really hard to fix. Filesystems have not been written
> keeping cgroups in mind. So I am really skeptical that I can convince file
> system designers to make fundamental changes in filesystems and journalling
> code to make them cgroup aware.
>
> Hence with this patch series I have implemented option 2. Option 2 is not
> the best solution but atleast it gives us some control then not having any
> control on buffered writes. Andrea Righi did similar patches in the past
> here.
>
> https://lkml.org/lkml/2011/2/28/115
>
> This patch series had issues w.r.t to interaction between bio and task
> throttling, so I redid it.
>
> Design
> ------
>
> IO controller already has the capability to keep track of IO rates of
> a group and enqueue the bio in internal queues if group exceeds the
> rate and dispatch these bios later.
>
> This patch series also introduce the capability to throttle a dirtying
> task in balance_dirty_pages_ratelimited_nr(). Now no WRITES except
> direct WRITES will be throttled at device level. If a dirtying task
> exceeds its configured IO rate, it is put on a group wait queue and
> woken up when it can dirty more pages.
>
> No new interface has been introduced and both direct IO as well as buffered
> IO make use of common IO rate limit.
>
> How To
> =====
> - Create a cgroup and limit it to 1MB/s for writes.
> echo "8:16 1024000" > /cgroup/blk/test1/blkio.throttle.write_bps_device
>
> - Launch dd thread in the cgroup
> dd if=/dev/zero of=zerofile bs=4K count=1K
>
> 1024+0 records in
> 1024+0 records out
> 4194304 bytes (4.2 MB) copied, 4.00428 s, 1.0 MB/s
>
> Any feedback is welcome.
>
> Thanks
> Vivek
>
> Vivek Goyal (8):
> blk-throttle: convert wait routines to return jiffies to wait
> blk-throttle: do not enforce first queued bio check in
> tg_wait_dispatch
> blk-throttle: use io size and direction as parameters to wait
> routines
> blk-throttle: specify number of ios during dispatch update
> blk-throttle: get rid of extend slice trace message
> blk-throttle: core logic to throttle task while dirtying pages
> blk-throttle: do not throttle writes at device level except direct io
> blk-throttle: enable throttling of task while dirtying pages
>
> block/blk-cgroup.c | 6 +-
> block/blk-cgroup.h | 2 +-
> block/blk-throttle.c | 506 +++++++++++++++++++++++++++++++++++---------
> block/cfq-iosched.c | 2 +-
> block/cfq.h | 6 +-
> fs/direct-io.c | 1 +
> include/linux/blk_types.h | 2 +
> include/linux/blkdev.h | 5 +
> mm/page-writeback.c | 3 +
> 9 files changed, 421 insertions(+), 112 deletions(-)
>
> --
> 1.7.4.4
On Tue, Jun 28, 2011 at 06:21:38PM +0200, Andrea Righi wrote:
> On Tue, Jun 28, 2011 at 11:35:01AM -0400, Vivek Goyal wrote:
> > Hi,
> >
> > This is V2 of the patches. First version is posted here.
> >
> > https://lkml.org/lkml/2011/6/3/375
> >
> > There are no changes from first version except that I have rebased it to
> > for-3.1/core branch of Jens's block tree.
> >
> > I have been trying to find ways to solve two problems with block IO controller
> > cgroups.
> >
> > - Current throttling logic in IO controller does not throttle buffered WRITES.
> > Well it does throttle all the WRITEs at device and by that time buffered
> > WRITE have lost the submitter's context and most of the IO comes in flusher
> > thread's context at device. Hence currently buffered write throttling is
> > not supported.
> >
> > - All WRITEs are throttled at device level and this can easily lead to
> > filesystem serialization.
> >
> > One simple example is that if a process writes some pages to cache and
> > then does fsync(), and process gets throttled then it locks up the
> > filesystem. With ext4, I noticed that even a simple "ls" does not make
> > progress. The reason boils down to the fact that filesystems are not
> > aware of cgroups and one of the things which get serialized is journalling
> > in ordered mode.
> >
> > So even if we do something to carry submitter's cgroup information
> > to device and do throttling there, it will lead to serialization of
> > filesystems and is not a good idea.
> >
> > So how to go about fixing it. There seem to be two options.
> >
> > - Throttling should still be done at device level. Make filesystems aware
> > of cgroups so that multiple transactions can make progress in parallel
> > (per cgroup) and there are no shared resources across cgroups in
> > filesystems which can lead to serialization.
> >
> > - Throttle WRITEs while they are entering the cache and not after that.
> > Something like balance_dirty_pages(). Direct IO is still throttled
> > at device level. That way, we can avoid these journalling related
> > serialization issues w.r.t trottling.
>
> I think that O_DIRECT WRITEs can hit the same serialization problem if
> we throttle them at device level.
I think it can but number of cases probably comes down significantly. One
of the main problems seems to be sync related variants sync/fsync etc.
And I think we do not make any gurantees for inflight requests
(not completed yet).
So it will boil down to how dependent these sync primitives are on
inflight direct WRITEs. I did basic testing with ext4 and it looked fine.
On XFS, sync gets blocked behind inflight direct writes. Last time I
raised that issue and looks like Christoph has plans to do something
about it.
So currently my understanding is that dependency on direct writes might
not be a major issue in practice. (Until and unless there is more to
it I am not aware about).
>
> Have you tried to do some tests? (i.e. create multiple cgroups with very
> low I/O limit doing parallel O_DIRECT WRITEs, and try to run at the same
> time "ls" or other simple commands from the root cgroup or unlimited
> cgroup).
I did. On ext4, I created a cgroup with limit 1byte per second and
started a direct write and did "ls", "sync" and some directory traversal
operations in same diretory and it seems to work.
>
> If we hit the same serialization problem I think we should do something
> similar also for O_DIRECT WRITEs (e.g, throttle them at the VFS layer),
> as a temporary solution.
Yep, we could do that if need be. In fact I was thinking of creating
a switch so that a user can also choose to throttle IO either at
device level or page cache level.
>
> The best solution is always to address this problem at the filesystem
> layer (option 1), but it's a *huge* change, because all the filesystems
> need to be redesigned to be cgroup-aware. For now the temporary solution
> could help at least to avoid system lockups while doing large O_DIRECT
> writes from I/O-limited cgroups.
Yep, handling it at file system level is the best solution but so far
I have not seen any positive response on that front from filesystem
developers. Dave Chinner though seemed open to the idea of associating
one allocation group to one cgroup and bring some filesystem awareness
in filesystem. But that is just one.
It is just 300 lines of simple change and we can always change it if
filesystems ever decide to be cgroup aware and prefer write throttling
at device level and not at page cache level.
I had raised buffered write issue at LSF this year and atleast there
feedback was that we need to throttle buffered writes at the time of
entering page cache.
Thanks
Vivek
On Tue, Jun 28, 2011 at 01:06:24PM -0400, Vivek Goyal wrote:
> On Tue, Jun 28, 2011 at 06:21:38PM +0200, Andrea Righi wrote:
> > On Tue, Jun 28, 2011 at 11:35:01AM -0400, Vivek Goyal wrote:
> > > Hi,
> > >
> > > This is V2 of the patches. First version is posted here.
> > >
> > > https://lkml.org/lkml/2011/6/3/375
> > >
> > > There are no changes from first version except that I have rebased it to
> > > for-3.1/core branch of Jens's block tree.
> > >
> > > I have been trying to find ways to solve two problems with block IO controller
> > > cgroups.
> > >
> > > - Current throttling logic in IO controller does not throttle buffered WRITES.
> > > Well it does throttle all the WRITEs at device and by that time buffered
> > > WRITE have lost the submitter's context and most of the IO comes in flusher
> > > thread's context at device. Hence currently buffered write throttling is
> > > not supported.
> > >
> > > - All WRITEs are throttled at device level and this can easily lead to
> > > filesystem serialization.
> > >
> > > One simple example is that if a process writes some pages to cache and
> > > then does fsync(), and process gets throttled then it locks up the
> > > filesystem. With ext4, I noticed that even a simple "ls" does not make
> > > progress. The reason boils down to the fact that filesystems are not
> > > aware of cgroups and one of the things which get serialized is journalling
> > > in ordered mode.
> > >
> > > So even if we do something to carry submitter's cgroup information
> > > to device and do throttling there, it will lead to serialization of
> > > filesystems and is not a good idea.
> > >
> > > So how to go about fixing it. There seem to be two options.
> > >
> > > - Throttling should still be done at device level. Make filesystems aware
> > > of cgroups so that multiple transactions can make progress in parallel
> > > (per cgroup) and there are no shared resources across cgroups in
> > > filesystems which can lead to serialization.
> > >
> > > - Throttle WRITEs while they are entering the cache and not after that.
> > > Something like balance_dirty_pages(). Direct IO is still throttled
> > > at device level. That way, we can avoid these journalling related
> > > serialization issues w.r.t trottling.
> >
> > I think that O_DIRECT WRITEs can hit the same serialization problem if
> > we throttle them at device level.
>
> I think it can but number of cases probably comes down significantly. One
> of the main problems seems to be sync related variants sync/fsync etc.
> And I think we do not make any gurantees for inflight requests
> (not completed yet).
>
> So it will boil down to how dependent these sync primitives are on
> inflight direct WRITEs. I did basic testing with ext4 and it looked fine.
> On XFS, sync gets blocked behind inflight direct writes. Last time I
> raised that issue and looks like Christoph has plans to do something
> about it.
>
> So currently my understanding is that dependency on direct writes might
> not be a major issue in practice. (Until and unless there is more to
> it I am not aware about).
OK, I was asking because I remember to have seen some problems with my
old io-throttle controller in presence of many O_DIRECT writes.
I'll repeat the tests also with this patch set.
>
> >
> > Have you tried to do some tests? (i.e. create multiple cgroups with very
> > low I/O limit doing parallel O_DIRECT WRITEs, and try to run at the same
> > time "ls" or other simple commands from the root cgroup or unlimited
> > cgroup).
>
> I did. On ext4, I created a cgroup with limit 1byte per second and
> started a direct write and did "ls", "sync" and some directory traversal
> operations in same diretory and it seems to work.
Good.
>
> >
> > If we hit the same serialization problem I think we should do something
> > similar also for O_DIRECT WRITEs (e.g, throttle them at the VFS layer),
> > as a temporary solution.
>
> Yep, we could do that if need be. In fact I was thinking of creating
> a switch so that a user can also choose to throttle IO either at
> device level or page cache level.
I think it would be great to have this switch.
Throttling at VFS would have probably "granularity" problems. If a task
performs a large WRITE the only thing we can do is to put the task to
sleep for a large amount of time. And when the timer expires the large
WRITE will be submitted to the block layer all at once. Something like
the I/O spike issue with writeback I/O...
>
> >
> > The best solution is always to address this problem at the filesystem
> > layer (option 1), but it's a *huge* change, because all the filesystems
> > need to be redesigned to be cgroup-aware. For now the temporary solution
> > could help at least to avoid system lockups while doing large O_DIRECT
> > writes from I/O-limited cgroups.
>
> Yep, handling it at file system level is the best solution but so far
> I have not seen any positive response on that front from filesystem
> developers. Dave Chinner though seemed open to the idea of associating
> one allocation group to one cgroup and bring some filesystem awareness
> in filesystem. But that is just one.
>
> It is just 300 lines of simple change and we can always change it if
> filesystems ever decide to be cgroup aware and prefer write throttling
> at device level and not at page cache level.
>
> I had raised buffered write issue at LSF this year and atleast there
> feedback was that we need to throttle buffered writes at the time of
> entering page cache.
Yes, it seems the best option right now.
-Andrea
On Tue, Jun 28, 2011 at 11:35:01AM -0400, Vivek Goyal wrote:
> Hi,
>
> This is V2 of the patches. First version is posted here.
>
> https://lkml.org/lkml/2011/6/3/375
>
> There are no changes from first version except that I have rebased it to
> for-3.1/core branch of Jens's block tree.
>
> I have been trying to find ways to solve two problems with block IO controller
> cgroups.
>
> - Current throttling logic in IO controller does not throttle buffered WRITES.
> Well it does throttle all the WRITEs at device and by that time buffered
> WRITE have lost the submitter's context and most of the IO comes in flusher
> thread's context at device. Hence currently buffered write throttling is
> not supported.
This problem is being solved in a different manner - by making the
bdi-flusher writeback cgroup aware.
That is, writeback will be done in the context of the cgroup that
dirtied the inode in the first place. Hence writeback will be done
in the context that the existing block throttle can understand
without modification.
And with cgroup-aware throttling in balance-dirty-pages (also part of
the same piece of work), we get the throttling based
on dirty memory usage of the cgroup and the rate at which the
bdi-flusher for the cgroup can clean pages. This is directly related
to the block throttle configuration of the specific cgroup....
There are already prototpyes for this infrastructure been written,
and we are currently waiting on the IO-less dirty throttling to be
merged before moving forward with it.
There is still one part missing, though - a necessary precursor to
this is that we need a bdi flush context per cgroup so we don't get
flushing of one cgroup blocking the flushing of another on the same
bdi. The easiest way to do this is to convert the bdi-flusher
threads to use workqueues. We can then easily extend the flush
context to be per-cgroup without an explosion of threads and the
management problems that would introduce.....
> - All WRITEs are throttled at device level and this can easily lead to
> filesystem serialization.
>
> One simple example is that if a process writes some pages to cache and
> then does fsync(), and process gets throttled then it locks up the
> filesystem. With ext4, I noticed that even a simple "ls" does not make
> progress. The reason boils down to the fact that filesystems are not
> aware of cgroups and one of the things which get serialized is journalling
> in ordered mode.
As I've said before - solving this problem is a filesystem problem,
not a throttling or cgroup infrastructure problem...
> So even if we do something to carry submitter's cgroup information
> to device and do throttling there, it will lead to serialization of
> filesystems and is not a good idea.
>
> So how to go about fixing it. There seem to be two options.
>
> - Throttling should still be done at device level.
Yes, that is the way it should be done - all types of IO should be
throttled in the one place by the same mechanism.
> Make filesystems aware
> of cgroups so that multiple transactions can make progress in parallel
> (per cgroup) and there are no shared resources across cgroups in
> filesystems which can lead to serialization.
How a specific filesystem solves this problem (if indeed it is a problem)
needs to be dealt with on a per-filesystem basis.
> - Throttle WRITEs while they are entering the cache and not after that.
> Something like balance_dirty_pages(). Direct IO is still throttled
> at device level. That way, we can avoid these journalling related
> serialization issues w.r.t trottling.
>
> But the big issue with this approach is that we control the IO rate
> entering into the cache and not IO rate at the device. That way it
> can happen that flusher later submits lots of WRITEs to device and
> we will see a periodic IO spike on end node.
>
> So this mechanism helps a bit but is not the complete solution. It
> can primarily help those folks which have the system resources and
> plenty of IO bandwidth available but they don't want to give it to
> customer because it is not a premium customer etc.
As I said earlier - the cgroup aware bdi-flushing infrastructure
solves this problem directly inside balance_dirty_pages. i.e. The
bdi flusher variant integrates much more cleanly with the way the MM
and writeback subsystems work and also work even when block layer
throttling is not being used at all.
If we weren't doing cgroup-aware bdi writeback and IO-less
throttling, then this block throttle method would probably be a good
idea. However, I think we have a more integrated solution already
designed and slowly being implemented....
> Option 1 seem to be really hard to fix. Filesystems have not been written
> keeping cgroups in mind. So I am really skeptical that I can convince file
> system designers to make fundamental changes in filesystems and journalling
> code to make them cgroup aware.
Again, you're assuming that cgroup-awareness is the solution to the
filesystem problem and that filesystems will require fundamental
changes. Some may, but different filesystems will require different
types of changes to work in this environment.
FYI, filesystem development cycles are slow and engineers are
conservative because of the absolute requirement for data integrity.
Hence we tend to focus development on problems that users are
reporting (i.e. known pain points) or functionality they have
requested.
In this case, block throttling works OK on most filesystems out of
the box, but it has some known problems. If there are people out
there hitting these known problems then they'll report them, we'll
hear about them and they'll eventually get fixed.
However, if no-one is reporting problems related to block throttling
then it either works well enough for the existing user base or
nobody is using the functionality. Either way we don't need to spend
time on optimising the filesystem for such functionality.
So while you may be skeptical about whether filesystems will be
changed, it really comes down to behaviour in real-world
deployments. If what we already have is good enough, then we don't
need to spend resources on fixing problems no-one is seeing...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Jun 29, 2011 at 10:42:19AM +1000, Dave Chinner wrote:
> On Tue, Jun 28, 2011 at 11:35:01AM -0400, Vivek Goyal wrote:
> > Hi,
> >
> > This is V2 of the patches. First version is posted here.
> >
> > https://lkml.org/lkml/2011/6/3/375
> >
> > There are no changes from first version except that I have rebased it to
> > for-3.1/core branch of Jens's block tree.
> >
> > I have been trying to find ways to solve two problems with block IO controller
> > cgroups.
> >
> > - Current throttling logic in IO controller does not throttle buffered WRITES.
> > Well it does throttle all the WRITEs at device and by that time buffered
> > WRITE have lost the submitter's context and most of the IO comes in flusher
> > thread's context at device. Hence currently buffered write throttling is
> > not supported.
>
> This problem is being solved in a different manner - by making the
> bdi-flusher writeback cgroup aware.
That's good. I am looking forward for that work and that should things
better for blkio cgroups in general.
>
> That is, writeback will be done in the context of the cgroup that
> dirtied the inode in the first place. Hence writeback will be done
> in the context that the existing block throttle can understand
> without modification.
>
> And with cgroup-aware throttling in balance-dirty-pages (also part of
> the same piece of work), we get the throttling based
> on dirty memory usage of the cgroup and the rate at which the
> bdi-flusher for the cgroup can clean pages. This is directly related
> to the block throttle configuration of the specific cgroup....
>
> There are already prototpyes for this infrastructure been written,
> and we are currently waiting on the IO-less dirty throttling to be
> merged before moving forward with it.
>
> There is still one part missing, though - a necessary precursor to
> this is that we need a bdi flush context per cgroup so we don't get
> flushing of one cgroup blocking the flushing of another on the same
> bdi. The easiest way to do this is to convert the bdi-flusher
> threads to use workqueues. We can then easily extend the flush
> context to be per-cgroup without an explosion of threads and the
> management problems that would introduce.....
Agreed.
[..]
> > - Throttle WRITEs while they are entering the cache and not after that.
> > Something like balance_dirty_pages(). Direct IO is still throttled
> > at device level. That way, we can avoid these journalling related
> > serialization issues w.r.t trottling.
> >
> > But the big issue with this approach is that we control the IO rate
> > entering into the cache and not IO rate at the device. That way it
> > can happen that flusher later submits lots of WRITEs to device and
> > we will see a periodic IO spike on end node.
> >
> > So this mechanism helps a bit but is not the complete solution. It
> > can primarily help those folks which have the system resources and
> > plenty of IO bandwidth available but they don't want to give it to
> > customer because it is not a premium customer etc.
>
> As I said earlier - the cgroup aware bdi-flushing infrastructure
> solves this problem directly inside balance_dirty_pages. i.e. The
> bdi flusher variant integrates much more cleanly with the way the MM
> and writeback subsystems work and also work even when block layer
> throttling is not being used at all.
>
> If we weren't doing cgroup-aware bdi writeback and IO-less
> throttling, then this block throttle method would probably be a good
> idea. However, I think we have a more integrated solution already
> designed and slowly being implemented....
>
> > Option 1 seem to be really hard to fix. Filesystems have not been written
> > keeping cgroups in mind. So I am really skeptical that I can convince file
> > system designers to make fundamental changes in filesystems and journalling
> > code to make them cgroup aware.
>
> Again, you're assuming that cgroup-awareness is the solution to the
> filesystem problem and that filesystems will require fundamental
> changes. Some may, but different filesystems will require different
> types of changes to work in this environment.
>
So for the case of ext3/ext4, atleast journaling design seems to be
fundamental problem. (my example of fsync in slow cgroup serializing
everything behind it) and per bdi per cgroup flusher changes are not going
to address it.
So how would we go about handling this.
- We will change journaling design to fix it.
- It is not a problem and I should simply ask people to use throttling
with ext3/ext4.
- It is a problem but hard to solve problem, so we should ask user
to just live with it.
> FYI, filesystem development cycles are slow and engineers are
> conservative because of the absolute requirement for data integrity.
> Hence we tend to focus development on problems that users are
> reporting (i.e. known pain points) or functionality they have
> requested.
>
> In this case, block throttling works OK on most filesystems out of
> the box, but it has some known problems. If there are people out
> there hitting these known problems then they'll report them, we'll
> hear about them and they'll eventually get fixed.
>
> However, if no-one is reporting problems related to block throttling
> then it either works well enough for the existing user base or
> nobody is using the functionality. Either way we don't need to spend
> time on optimising the filesystem for such functionality.
>
> So while you may be skeptical about whether filesystems will be
> changed, it really comes down to behaviour in real-world
> deployments. If what we already have is good enough, then we don't
> need to spend resources on fixing problems no-one is seeing...
Ok. I was kind of being proactive and wanted to bring this issue forward
now before I really ask my customers to deploy throttling and then two
months down the line they come back either with long delays in dependent
operations or with the problems of file system scalability.
But looks like you prefer to hear from other users before this thing
can be considered a problem.
Thanks
Vivek
On Tue, Jun 28, 2011 at 11:35:07AM -0400, Vivek Goyal wrote:
...
> + /*
> + * We dispatched one task. Set the charge for other queued tasks,
> + * if any.
> + */
> + tg_set_active_task_charge(tg, rw);
> + throtl_log_tg(td, tg, "[%c] Wake up a task. bdisp=%u sz=%u bps=%llu"
> + " iodisp=%u iops=%u bioq=%d/%d taskq=%d/%d",
> + rw == READ ? 'R' : 'W', tg->bytes_disp[rw], sz,
> + tg->bps[rw], tg->io_disp[rw], tg->iops[rw],
> + tg->nr_queued_bio[READ], tg->nr_queued_bio[WRITE],
> + tg->nr_queued_tsk[READ], tg->nr_queued_tsk[WRITE]);
A small nitpick:
tg->bytes_disp[rw] is uint64_t, we should use bdisp=%llu.
> +}
> +
> +static void tg_switch_active_dispatch(struct throtl_data *td,
> + struct throtl_grp *tg, bool rw)
> +{
> + unsigned int nr_tasks = tg->nr_queued_tsk[rw];
> + unsigned int nr_bios = tg->nr_queued_bio[rw];
> + enum dispatch_type curr_dispatch = tg->active_dispatch[rw];
> +
> + /* Nothing queued. Whoever gets queued next first, sets dispatch type */
> + if (!nr_bios && !nr_tasks)
> + return;
> +
> + if (curr_dispatch == DISPATCH_BIO && nr_tasks) {
> + tg->active_dispatch[rw] = DISPATCH_TASK;
> + return;
> + }
> +
> + if (curr_dispatch == DISPATCH_TASK && nr_bios)
> + tg->active_dispatch[rw] = DISPATCH_BIO;
> +}
> +
> +static void tg_update_active_dispatch(struct throtl_data *td,
> + struct throtl_grp *tg, bool rw)
> +{
> + unsigned int nr_tasks = tg->nr_queued_tsk[rw];
> + unsigned int nr_bios = tg->nr_queued_bio[rw];
> + enum dispatch_type curr_dispatch = tg->active_dispatch[rw];
> +
> + BUG_ON(nr_bios < 0 || nr_tasks < 0);
> +
> + if (curr_dispatch == DISPATCH_BIO && !nr_bios) {
> + tg->active_dispatch[rw] = DISPATCH_TASK;
> + return;
> }
>
> + if (curr_dispatch == DISPATCH_TASK && !nr_tasks)
> + tg->active_dispatch[rw] = DISPATCH_BIO;
> +}
> +
> +static int throtl_dispatch_tg_rw(struct throtl_data *td, struct throtl_grp *tg,
> + bool rw, struct bio_list *bl, unsigned int max)
> +{
> + unsigned int nr_disp = 0;
> +
> + if (tg->active_dispatch[rw] == DISPATCH_BIO)
> + nr_disp = throtl_dispatch_tg_bio(td, tg, rw, bl, max);
> + else
> + /* Only number of bios dispatched is kept track of here */
> + throtl_dispatch_tg_task(td, tg, rw, max);
> +
> + tg_switch_active_dispatch(td, tg, rw);
> + return nr_disp;
> +}
> +
> +static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
> + struct bio_list *bl)
> +{
> + /* Try to dispatch 75% READS and 25% WRITES */
> + unsigned int nr_reads = 0, nr_writes = 0;
> + unsigned int max_nr_reads = throtl_grp_quantum*3/4;
> + unsigned int max_nr_writes = throtl_grp_quantum - max_nr_reads;
> +
> + nr_reads = throtl_dispatch_tg_rw(td, tg, READ, bl, max_nr_reads);
> + nr_writes = throtl_dispatch_tg_rw(td, tg, WRITE, bl, max_nr_writes);
> +
> return nr_reads + nr_writes;
> }
>
> +static bool tg_should_requeue(struct throtl_grp *tg)
> +{
> + /* If there are queued bios, requeue */
> + if (tg->nr_queued_bio[0] || tg->nr_queued_bio[1])
> + return 1;
> +
> + /* If there are queued tasks reueue */
> + if (tg->nr_queued_tsk[0] || tg->nr_queued_tsk[1])
> + return 1;
> +
> + return 0;
> +}
> +
> static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
> {
> unsigned int nr_disp = 0;
> @@ -832,7 +1016,7 @@ static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
>
> nr_disp += throtl_dispatch_tg(td, tg, bl);
>
> - if (tg->nr_queued[0] || tg->nr_queued[1]) {
> + if (tg_should_requeue(tg)) {
> tg_update_disptime(td, tg);
> throtl_enqueue_tg(td, tg);
> }
> @@ -899,9 +1083,9 @@ static int throtl_dispatch(struct request_queue *q)
>
> bio_list_init(&bio_list_on_stack);
>
> - throtl_log(td, "dispatch nr_queued=%u read=%u write=%u",
> - total_nr_queued(td), td->nr_queued[READ],
> - td->nr_queued[WRITE]);
> + throtl_log(td, "dispatch bioq=%u/%u tskq=%u/%u",
> + td->nr_queued_bio[READ], td->nr_queued_bio[WRITE],
> + td->nr_queued_tsk[READ], td->nr_queued_tsk[WRITE]);
>
> nr_disp = throtl_select_dispatch(td, &bio_list_on_stack);
>
> @@ -1122,7 +1306,7 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
>
> if (tg_no_rule_group(tg, rw)) {
> blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size,
> - rw, bio->bi_rw & REQ_SYNC);
> + 1, rw, bio->bi_rw & REQ_SYNC);
> rcu_read_unlock();
> return 0;
> }
> @@ -1146,14 +1330,14 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
> }
> }
>
> - if (tg->nr_queued[rw]) {
> + /* If there are already queued bio or task in same dir, queue bio */
> + if (tg->nr_queued_bio[rw] || tg->nr_queued_tsk[rw]) {
> /*
> - * There is already another bio queued in same dir. No
> + * There is already another bio/task queued in same dir. No
> * need to update dispatch time.
> */
> update_disptime = false;
> goto queue_bio;
> -
> }
>
> /* Bio is with-in rate limit of group */
> @@ -1178,16 +1362,18 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
>
> queue_bio:
> throtl_log_tg(td, tg, "[%c] bio. bdisp=%llu sz=%u bps=%llu"
> - " iodisp=%u iops=%u queued=%d/%d",
> + " iodisp=%u iops=%u bioq=%u/%u taskq=%u/%u",
> rw == READ ? 'R' : 'W',
> tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
> tg->io_disp[rw], tg->iops[rw],
> - tg->nr_queued[READ], tg->nr_queued[WRITE]);
> + tg->nr_queued_bio[READ], tg->nr_queued_bio[WRITE],
> + tg->nr_queued_tsk[READ], tg->nr_queued_tsk[WRITE]);
>
> throtl_add_bio_tg(q->td, tg, bio);
> *biop = NULL;
>
> if (update_disptime) {
> + tg_update_active_dispatch(td, tg, rw);
> tg_update_disptime(td, tg);
> throtl_schedule_next_dispatch(td);
> }
> @@ -1197,6 +1383,137 @@ out:
> return 0;
> }
>
> +static void
> +__blk_throtl_dirty_pages(struct request_queue *q, unsigned long nr_dirty)
> +{
> + struct throtl_data *td = q->td;
> + struct throtl_grp *tg;
> + struct blkio_cgroup *blkcg;
> + bool rw = WRITE, update_disptime = true, first_task = false;
> + unsigned int sz = nr_dirty << PAGE_SHIFT;
> + DEFINE_WAIT(wait);
> +
> + /*
> + * A throtl_grp pointer retrieved under rcu can be used to access
> + * basic fields like stats and io rates. If a group has no rules,
> + * just update the dispatch stats in lockless manner and return.
> + */
> +
> + rcu_read_lock();
> + blkcg = task_blkio_cgroup(current);
> + tg = throtl_find_tg(td, blkcg);
> + if (tg) {
> + throtl_tg_fill_dev_details(td, tg);
> +
> + if (tg_no_rule_group(tg, rw)) {
> + blkiocg_update_dispatch_stats(&tg->blkg, sz, nr_dirty,
> + rw, 0);
> + rcu_read_unlock();
> + return;
> + }
> + }
> + rcu_read_unlock();
> +
> + spin_lock_irq(q->queue_lock);
> +
> + tg = throtl_get_tg(td);
> +
> + /* Queue is gone. No queue lock held here. */
> + if (IS_ERR(tg))
> + return;
> +
> + tg->unaccounted_dirty += nr_dirty;
> +
> + /* If there are already queued task, put this task also on waitq */
> + if (tg->nr_queued_tsk[rw]) {
> + update_disptime = false;
> + goto queue_task;
> + } else
> + first_task = true;
> +
> + /* If there are bios already throttled in same dir, queue task */
> + if (!bio_list_empty(&tg->bio_lists[rw])) {
> + update_disptime = false;
> + goto queue_task;
> + }
> +
> + /*
> + * Task is with-in rate limit of group.
> + *
> + * Note: How many IOPS we should charge for this operation. For
> + * the time being I am sticking to number of pages as number of
> + * IOPS.
> + */
> + if (!tg_wait_dispatch(td, tg, rw, sz, nr_dirty)) {
> + throtl_charge_dirty_io(tg, rw, sz, nr_dirty, 0);
> + throtl_trim_slice(td, tg, rw);
> + goto out;
> + }
> +
> +queue_task:
> + throtl_log_tg(td, tg, "[%c] task. bdisp=%u sz=%u bps=%llu"
> + " iodisp=%u iops=%u bioq=%d/%d taskq=%d/%d",
> + rw == READ ? 'R' : 'W',
> + tg->bytes_disp[rw], sz, tg->bps[rw],
> + tg->io_disp[rw], tg->iops[rw],
> + tg->nr_queued_bio[READ], tg->nr_queued_bio[WRITE],
> + tg->nr_queued_tsk[READ], tg->nr_queued_tsk[WRITE]);
Ditto.
See the fix below.
Thanks,
-Andrea
---
block/blk-throttle.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 623cc05..f94a2a8 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -909,7 +909,7 @@ static void throtl_dispatch_tg_task(struct throtl_data *td,
* if any.
*/
tg_set_active_task_charge(tg, rw);
- throtl_log_tg(td, tg, "[%c] Wake up a task. bdisp=%u sz=%u bps=%llu"
+ throtl_log_tg(td, tg, "[%c] Wake up a task. bdisp=%llu sz=%u bps=%llu"
" iodisp=%u iops=%u bioq=%d/%d taskq=%d/%d",
rw == READ ? 'R' : 'W', tg->bytes_disp[rw], sz,
tg->bps[rw], tg->io_disp[rw], tg->iops[rw],
@@ -1459,7 +1459,7 @@ __blk_throtl_dirty_pages(struct request_queue *q, unsigned long nr_dirty)
}
queue_task:
- throtl_log_tg(td, tg, "[%c] task. bdisp=%u sz=%u bps=%llu"
+ throtl_log_tg(td, tg, "[%c] task. bdisp=%llu sz=%u bps=%llu"
" iodisp=%u iops=%u bioq=%d/%d taskq=%d/%d",
rw == READ ? 'R' : 'W',
tg->bytes_disp[rw], sz, tg->bps[rw],
On Tue, Jun 28, 2011 at 11:35:07AM -0400, Vivek Goyal wrote:
> This is the core logic which enables throttling logic to also
> throttle tasks while dirtying pages. I basically create the
> infrastructure where a wait queue is maintained per group
> and if task exceeds its rate limit, task is put to sleep on
> wait queue and woken up later.
>
> Though currently we throttle tasks for WRITE requests and
> not for READ requests, I have built the logic to support
> task throttling for both direction (similar to bio throttling).
> This will allow us to do any task READ throttling also easily,
> if needed in future.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> block/blk-throttle.c | 399 ++++++++++++++++++++++++++++++++++++++++++++-----
> 1 files changed, 358 insertions(+), 41 deletions(-)
>
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index ec41f6d..be3b4b4 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -32,6 +32,12 @@ struct throtl_rb_root {
> unsigned long min_disptime;
> };
>
> +/* Whether to dispatch a bio or task in next round of a dispatch from a group */
> +enum dispatch_type {
> + DISPATCH_BIO,
> + DISPATCH_TASK,
> +};
> +
> #define THROTL_RB_ROOT (struct throtl_rb_root) { .rb = RB_ROOT, .left = NULL, \
> .count = 0, .min_disptime = 0}
>
> @@ -59,7 +65,10 @@ struct throtl_grp {
> struct bio_list bio_lists[2];
>
> /* Number of queued bios on READ and WRITE lists */
> - unsigned int nr_queued[2];
> + unsigned int nr_queued_bio[2];
> +
> + /* Number of queued tasks */
> + unsigned int nr_queued_tsk[2];
>
> /* bytes per second rate limits */
> uint64_t bps[2];
> @@ -80,6 +89,21 @@ struct throtl_grp {
> int limits_changed;
>
> struct rcu_head rcu_head;
> +
> + /* READ and WRITE wait lists for task throttle */
> + wait_queue_head_t wait[2];
> +
> + /* Number of unaccounted dirty pages in this group */
> + unsigned int unaccounted_dirty;
> +
> + /* The number of pages to be charged to first task on wait queue */
> + unsigned int active_task_charge[2];
> +
> + /*
> + * Keeps track of whether we will dispatch a bio or task in next
> + * round of dispatch (for each dir)
> + */
> + enum dispatch_type active_dispatch[2];
> };
>
> struct throtl_data
> @@ -94,7 +118,10 @@ struct throtl_data
> struct request_queue *queue;
>
> /* Total Number of queued bios on READ and WRITE lists */
> - unsigned int nr_queued[2];
> + unsigned int nr_queued_bio[2];
> +
> + /* Total Number of queued tasks on READ and WRITE lists */
> + unsigned int nr_queued_tsk[2];
>
> /*
> * number of total undestroyed groups
> @@ -142,9 +169,19 @@ static inline struct throtl_grp *tg_of_blkg(struct blkio_group *blkg)
> return NULL;
> }
>
> +static inline int total_queued_bios(struct throtl_data *td)
> +{
> + return td->nr_queued_bio[0] + td->nr_queued_bio[1];
> +}
> +
> +static inline int total_queued_tasks(struct throtl_data *td)
> +{
> + return td->nr_queued_tsk[0] + td->nr_queued_tsk[1];
> +}
> +
> static inline unsigned int total_nr_queued(struct throtl_data *td)
> {
> - return td->nr_queued[0] + td->nr_queued[1];
> + return total_queued_bios(td) + total_queued_tasks(td);
> }
>
> static inline struct throtl_grp *throtl_ref_get_tg(struct throtl_grp *tg)
> @@ -192,6 +229,9 @@ static void throtl_init_group(struct throtl_grp *tg)
> tg->bps[0] = tg->bps[1] = -1;
> tg->iops[0] = tg->iops[1] = -1;
>
> + init_waitqueue_head(&tg->wait[READ]);
> + init_waitqueue_head(&tg->wait[WRITE]);
> +
> /*
> * Take the initial reference that will be released on destroy
> * This can be thought of a joint reference by cgroup and
> @@ -727,6 +767,14 @@ static void throtl_charge_io(struct throtl_grp *tg, bool rw, unsigned int sz,
> blkiocg_update_dispatch_stats(&tg->blkg, sz, nr_ios, rw, sync);
> }
>
> +static void throtl_charge_dirty_io(struct throtl_grp *tg, bool rw,
> + unsigned int sz, unsigned int nr_ios, bool sync)
> +{
> + tg->unaccounted_dirty -= nr_ios;
> + BUG_ON(tg->unaccounted_dirty < 0);
> + throtl_charge_io(tg, rw, sz, nr_ios, sync);
> +}
> +
> static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg,
> struct bio *bio)
> {
> @@ -735,21 +783,38 @@ static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg,
> bio_list_add(&tg->bio_lists[rw], bio);
> /* Take a bio reference on tg */
> throtl_ref_get_tg(tg);
> - tg->nr_queued[rw]++;
> - td->nr_queued[rw]++;
> + tg->nr_queued_bio[rw]++;
> + td->nr_queued_bio[rw]++;
> throtl_enqueue_tg(td, tg);
> }
>
> -static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
> +/*
> + * Returns the wait time of first bio or task in the specified direction. If
> + * nothing is queued then wait time is -1.
> + */
> +static unsigned long
> +tg_active_wait_time(struct throtl_data *td, struct throtl_grp *tg, bool rw)
> {
> - unsigned long read_wait = -1, write_wait = -1, min_wait = -1, disptime;
> struct bio *bio;
> + unsigned long sz;
>
> - if ((bio = bio_list_peek(&tg->bio_lists[READ])))
> - read_wait = tg_wait_dispatch(td, tg, READ, bio->bi_size, 1);
> + bio = bio_list_peek(&tg->bio_lists[rw]);
> + if (bio && tg->active_dispatch[rw] == DISPATCH_BIO)
> + return tg_wait_dispatch(td, tg, rw, bio->bi_size, 1);
>
> - if ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
> - write_wait = tg_wait_dispatch(td, tg, WRITE, bio->bi_size, 1);
> + if (!waitqueue_active(&tg->wait[rw]))
> + return -1;
> +
> + sz = tg->active_task_charge[rw] << PAGE_SHIFT;
> + return tg_wait_dispatch(td, tg, rw, sz, tg->active_task_charge[rw]);
> +}
> +
> +static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
> +{
> + unsigned long read_wait, write_wait, min_wait, disptime;
> +
> + read_wait = tg_active_wait_time(td, tg, READ);
> + write_wait = tg_active_wait_time(td, tg, WRITE);
>
> min_wait = min(read_wait, write_wait);
> disptime = jiffies + min_wait;
> @@ -760,18 +825,39 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
> throtl_enqueue_tg(td, tg);
> }
>
> +/* Calculate how many pages must be charged to a task at this point of time. */
> +static unsigned int tg_calc_charge_pages(struct throtl_grp *tg, bool rw)
> +{
> + unsigned int nr_pages;
> +
> + /* Divide unaccounted number of pages among queued tasks */
> + nr_pages = tg->unaccounted_dirty/tg->nr_queued_tsk[rw];
> +
> + return max_t(unsigned int, nr_pages, 1);
> +}
> +
> +/* set the active task charge for the first task in the queue */
> +static void tg_set_active_task_charge(struct throtl_grp *tg, bool rw)
> +{
> + /* If there are more tasks queued, calculate the charge */
> + if (tg->nr_queued_tsk[rw])
> + tg->active_task_charge[rw] = tg_calc_charge_pages(tg, rw);
> + else
> + tg->active_task_charge[rw] = 0;
> +}
> +
> static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg,
> bool rw, struct bio_list *bl)
> {
> struct bio *bio;
>
> bio = bio_list_pop(&tg->bio_lists[rw]);
> - tg->nr_queued[rw]--;
> + tg->nr_queued_bio[rw]--;
> /* Drop bio reference on tg */
> throtl_put_tg(tg);
>
> - BUG_ON(td->nr_queued[rw] <= 0);
> - td->nr_queued[rw]--;
> + BUG_ON(td->nr_queued_bio[rw] <= 0);
> + td->nr_queued_bio[rw]--;
>
> throtl_charge_io(tg, rw, bio->bi_size, 1, bio->bi_rw & REQ_SYNC);
> bio_list_add(bl, bio);
> @@ -780,39 +866,137 @@ static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg,
> throtl_trim_slice(td, tg, rw);
> }
>
> -static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
> - struct bio_list *bl)
> +static int throtl_dispatch_tg_bio(struct throtl_data *td, struct throtl_grp *tg,
> + bool rw, struct bio_list *bl, unsigned int max_disp)
> {
> - unsigned int nr_reads = 0, nr_writes = 0;
> - unsigned int max_nr_reads = throtl_grp_quantum*3/4;
> - unsigned int max_nr_writes = throtl_grp_quantum - max_nr_reads;
> struct bio *bio;
> + unsigned int nr_disp = 0;
>
> - /* Try to dispatch 75% READS and 25% WRITES */
> -
> - while ((bio = bio_list_peek(&tg->bio_lists[READ]))
> - && !tg_wait_dispatch(td, tg, READ, bio->bi_size, 1)) {
> + while ((bio = bio_list_peek(&tg->bio_lists[rw]))
> + && !tg_wait_dispatch(td, tg, rw, bio->bi_size, 1)) {
>
> tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
> - nr_reads++;
> + nr_disp++;
>
> - if (nr_reads >= max_nr_reads)
> + if (nr_disp >= max_disp)
> break;
> }
>
> - while ((bio = bio_list_peek(&tg->bio_lists[WRITE]))
> - && !tg_wait_dispatch(td, tg, WRITE, bio->bi_size, 1)) {
> + return nr_disp;
> +}
>
> - tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
> - nr_writes++;
> +static void throtl_dispatch_tg_task(struct throtl_data *td,
> + struct throtl_grp *tg, bool rw, unsigned int max_disp)
> +{
> + unsigned int sz;
>
> - if (nr_writes >= max_nr_writes)
> - break;
> + if (!waitqueue_active(&tg->wait[rw]))
> + return;
> +
> + /* Wake up a task */
> + wake_up(&tg->wait[rw]);
> +
> + tg->nr_queued_tsk[rw]--;
> + td->nr_queued_tsk[rw]--;
> +
> + /* Charge the woken up task for IO */
> + sz = tg->active_task_charge[rw] << PAGE_SHIFT;
> + throtl_charge_dirty_io(tg, rw, sz, tg->active_task_charge[rw], false);
> + throtl_trim_slice(td, tg, rw);
> +
> + /*
> + * We dispatched one task. Set the charge for other queued tasks,
> + * if any.
> + */
> + tg_set_active_task_charge(tg, rw);
> + throtl_log_tg(td, tg, "[%c] Wake up a task. bdisp=%u sz=%u bps=%llu"
> + " iodisp=%u iops=%u bioq=%d/%d taskq=%d/%d",
> + rw == READ ? 'R' : 'W', tg->bytes_disp[rw], sz,
> + tg->bps[rw], tg->io_disp[rw], tg->iops[rw],
> + tg->nr_queued_bio[READ], tg->nr_queued_bio[WRITE],
> + tg->nr_queued_tsk[READ], tg->nr_queued_tsk[WRITE]);
> +}
> +
> +static void tg_switch_active_dispatch(struct throtl_data *td,
> + struct throtl_grp *tg, bool rw)
> +{
> + unsigned int nr_tasks = tg->nr_queued_tsk[rw];
> + unsigned int nr_bios = tg->nr_queued_bio[rw];
> + enum dispatch_type curr_dispatch = tg->active_dispatch[rw];
> +
> + /* Nothing queued. Whoever gets queued next first, sets dispatch type */
> + if (!nr_bios && !nr_tasks)
> + return;
> +
> + if (curr_dispatch == DISPATCH_BIO && nr_tasks) {
> + tg->active_dispatch[rw] = DISPATCH_TASK;
> + return;
> + }
> +
> + if (curr_dispatch == DISPATCH_TASK && nr_bios)
> + tg->active_dispatch[rw] = DISPATCH_BIO;
> +}
> +
> +static void tg_update_active_dispatch(struct throtl_data *td,
> + struct throtl_grp *tg, bool rw)
> +{
> + unsigned int nr_tasks = tg->nr_queued_tsk[rw];
> + unsigned int nr_bios = tg->nr_queued_bio[rw];
> + enum dispatch_type curr_dispatch = tg->active_dispatch[rw];
> +
> + BUG_ON(nr_bios < 0 || nr_tasks < 0);
> +
> + if (curr_dispatch == DISPATCH_BIO && !nr_bios) {
> + tg->active_dispatch[rw] = DISPATCH_TASK;
> + return;
> }
>
> + if (curr_dispatch == DISPATCH_TASK && !nr_tasks)
> + tg->active_dispatch[rw] = DISPATCH_BIO;
> +}
> +
> +static int throtl_dispatch_tg_rw(struct throtl_data *td, struct throtl_grp *tg,
> + bool rw, struct bio_list *bl, unsigned int max)
> +{
> + unsigned int nr_disp = 0;
> +
> + if (tg->active_dispatch[rw] == DISPATCH_BIO)
> + nr_disp = throtl_dispatch_tg_bio(td, tg, rw, bl, max);
> + else
> + /* Only number of bios dispatched is kept track of here */
> + throtl_dispatch_tg_task(td, tg, rw, max);
> +
> + tg_switch_active_dispatch(td, tg, rw);
> + return nr_disp;
> +}
> +
> +static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
> + struct bio_list *bl)
> +{
> + /* Try to dispatch 75% READS and 25% WRITES */
> + unsigned int nr_reads = 0, nr_writes = 0;
> + unsigned int max_nr_reads = throtl_grp_quantum*3/4;
> + unsigned int max_nr_writes = throtl_grp_quantum - max_nr_reads;
> +
> + nr_reads = throtl_dispatch_tg_rw(td, tg, READ, bl, max_nr_reads);
> + nr_writes = throtl_dispatch_tg_rw(td, tg, WRITE, bl, max_nr_writes);
> +
> return nr_reads + nr_writes;
> }
>
> +static bool tg_should_requeue(struct throtl_grp *tg)
> +{
> + /* If there are queued bios, requeue */
> + if (tg->nr_queued_bio[0] || tg->nr_queued_bio[1])
> + return 1;
> +
> + /* If there are queued tasks reueue */
> + if (tg->nr_queued_tsk[0] || tg->nr_queued_tsk[1])
> + return 1;
> +
> + return 0;
> +}
> +
> static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
> {
> unsigned int nr_disp = 0;
> @@ -832,7 +1016,7 @@ static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
>
> nr_disp += throtl_dispatch_tg(td, tg, bl);
>
> - if (tg->nr_queued[0] || tg->nr_queued[1]) {
> + if (tg_should_requeue(tg)) {
> tg_update_disptime(td, tg);
> throtl_enqueue_tg(td, tg);
> }
> @@ -899,9 +1083,9 @@ static int throtl_dispatch(struct request_queue *q)
>
> bio_list_init(&bio_list_on_stack);
>
> - throtl_log(td, "dispatch nr_queued=%u read=%u write=%u",
> - total_nr_queued(td), td->nr_queued[READ],
> - td->nr_queued[WRITE]);
> + throtl_log(td, "dispatch bioq=%u/%u tskq=%u/%u",
> + td->nr_queued_bio[READ], td->nr_queued_bio[WRITE],
> + td->nr_queued_tsk[READ], td->nr_queued_tsk[WRITE]);
>
> nr_disp = throtl_select_dispatch(td, &bio_list_on_stack);
>
> @@ -1122,7 +1306,7 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
>
> if (tg_no_rule_group(tg, rw)) {
> blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size,
> - rw, bio->bi_rw & REQ_SYNC);
> + 1, rw, bio->bi_rw & REQ_SYNC);
> rcu_read_unlock();
> return 0;
> }
> @@ -1146,14 +1330,14 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
> }
> }
>
> - if (tg->nr_queued[rw]) {
> + /* If there are already queued bio or task in same dir, queue bio */
> + if (tg->nr_queued_bio[rw] || tg->nr_queued_tsk[rw]) {
> /*
> - * There is already another bio queued in same dir. No
> + * There is already another bio/task queued in same dir. No
> * need to update dispatch time.
> */
> update_disptime = false;
> goto queue_bio;
> -
> }
>
> /* Bio is with-in rate limit of group */
> @@ -1178,16 +1362,18 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
>
> queue_bio:
> throtl_log_tg(td, tg, "[%c] bio. bdisp=%llu sz=%u bps=%llu"
> - " iodisp=%u iops=%u queued=%d/%d",
> + " iodisp=%u iops=%u bioq=%u/%u taskq=%u/%u",
> rw == READ ? 'R' : 'W',
> tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
> tg->io_disp[rw], tg->iops[rw],
> - tg->nr_queued[READ], tg->nr_queued[WRITE]);
> + tg->nr_queued_bio[READ], tg->nr_queued_bio[WRITE],
> + tg->nr_queued_tsk[READ], tg->nr_queued_tsk[WRITE]);
>
> throtl_add_bio_tg(q->td, tg, bio);
> *biop = NULL;
>
> if (update_disptime) {
> + tg_update_active_dispatch(td, tg, rw);
> tg_update_disptime(td, tg);
> throtl_schedule_next_dispatch(td);
> }
> @@ -1197,6 +1383,137 @@ out:
> return 0;
> }
>
> +static void
> +__blk_throtl_dirty_pages(struct request_queue *q, unsigned long nr_dirty)
> +{
> + struct throtl_data *td = q->td;
> + struct throtl_grp *tg;
> + struct blkio_cgroup *blkcg;
> + bool rw = WRITE, update_disptime = true, first_task = false;
> + unsigned int sz = nr_dirty << PAGE_SHIFT;
> + DEFINE_WAIT(wait);
> +
> + /*
> + * A throtl_grp pointer retrieved under rcu can be used to access
> + * basic fields like stats and io rates. If a group has no rules,
> + * just update the dispatch stats in lockless manner and return.
> + */
> +
> + rcu_read_lock();
> + blkcg = task_blkio_cgroup(current);
> + tg = throtl_find_tg(td, blkcg);
> + if (tg) {
> + throtl_tg_fill_dev_details(td, tg);
> +
> + if (tg_no_rule_group(tg, rw)) {
> + blkiocg_update_dispatch_stats(&tg->blkg, sz, nr_dirty,
> + rw, 0);
> + rcu_read_unlock();
> + return;
> + }
> + }
> + rcu_read_unlock();
> +
> + spin_lock_irq(q->queue_lock);
> +
> + tg = throtl_get_tg(td);
> +
> + /* Queue is gone. No queue lock held here. */
> + if (IS_ERR(tg))
> + return;
> +
> + tg->unaccounted_dirty += nr_dirty;
> +
> + /* If there are already queued task, put this task also on waitq */
> + if (tg->nr_queued_tsk[rw]) {
> + update_disptime = false;
> + goto queue_task;
> + } else
> + first_task = true;
> +
> + /* If there are bios already throttled in same dir, queue task */
> + if (!bio_list_empty(&tg->bio_lists[rw])) {
> + update_disptime = false;
> + goto queue_task;
> + }
> +
> + /*
> + * Task is with-in rate limit of group.
> + *
> + * Note: How many IOPS we should charge for this operation. For
> + * the time being I am sticking to number of pages as number of
> + * IOPS.
> + */
> + if (!tg_wait_dispatch(td, tg, rw, sz, nr_dirty)) {
> + throtl_charge_dirty_io(tg, rw, sz, nr_dirty, 0);
> + throtl_trim_slice(td, tg, rw);
> + goto out;
> + }
> +
> +queue_task:
> + throtl_log_tg(td, tg, "[%c] task. bdisp=%u sz=%u bps=%llu"
> + " iodisp=%u iops=%u bioq=%d/%d taskq=%d/%d",
> + rw == READ ? 'R' : 'W',
> + tg->bytes_disp[rw], sz, tg->bps[rw],
> + tg->io_disp[rw], tg->iops[rw],
> + tg->nr_queued_bio[READ], tg->nr_queued_bio[WRITE],
> + tg->nr_queued_tsk[READ], tg->nr_queued_tsk[WRITE]);
> +
> + /* Take a task reference on tg */
> + throtl_ref_get_tg(tg);
> + tg->nr_queued_tsk[rw]++;
> + td->nr_queued_tsk[rw]++;
> + prepare_to_wait_exclusive(&tg->wait[rw], &wait, TASK_UNINTERRUPTIBLE);
> +
> + if (first_task)
> + tg_set_active_task_charge(tg, rw);
> +
> + if (update_disptime) {
> + tg_update_active_dispatch(td, tg, rw);
> + tg_update_disptime(td, tg);
> + throtl_schedule_next_dispatch(td);
> + } else
> + throtl_enqueue_tg(td, tg);
> +
> + spin_unlock_irq(q->queue_lock);
> +
> + /* Task has been put on a wait queue */
> + io_schedule();
> +
> + /*
> + * Task has woken up. Finish wait, drop reference to the group.
> + */
> + spin_lock_irq(q->queue_lock);
> + finish_wait(&tg->wait[rw], &wait);
> +
> + /* Drop task reference on group */
> + throtl_put_tg(tg);
> +out:
> + spin_unlock_irq(q->queue_lock);
> + return;
> +}
> +
> +void
> +blk_throtl_dirty_pages(struct address_space *mapping, unsigned long nr_dirty)
> +{
> + struct request_queue *q;
> + struct block_device *bdev;
> +
> + bdev = mapping->host->i_sb->s_bdev;
> + if (!bdev)
> + return;
> + q = bdev_get_queue(bdev);
> +
> + if (unlikely(!q))
> + return;
> +
> + if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
> + return;
> +
> + __blk_throtl_dirty_pages(q, nr_dirty);
> +}
This doesn't throttle buffered WRITEs to raw devices. Maybe we should do
something like this.
-Andrea
===
blk-throttle: control buffered writes to raw block devices
Signed-off-by: Andrea Righi <[email protected]>
---
block/blk-throttle.c | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 623cc05..a5ddf4a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1505,11 +1505,16 @@ void
blk_throtl_dirty_pages(struct address_space *mapping, unsigned long nr_dirty)
{
struct request_queue *q;
+ struct inode *inode = mapping->host;
struct block_device *bdev;
- bdev = mapping->host->i_sb->s_bdev;
- if (!bdev)
- return;
+ if (S_ISBLK(inode->i_mode) && inode->i_bdev) {
+ bdev = inode->i_bdev;
+ } else {
+ bdev = inode->i_sb->s_bdev;
+ if (!bdev)
+ return;
+ }
q = bdev_get_queue(bdev);
if (unlikely(!q))
On Tue, Jun 28, 2011 at 01:06:24PM -0400, Vivek Goyal wrote:
> On Tue, Jun 28, 2011 at 06:21:38PM +0200, Andrea Righi wrote:
> > On Tue, Jun 28, 2011 at 11:35:01AM -0400, Vivek Goyal wrote:
> > > Hi,
> > >
> > > This is V2 of the patches. First version is posted here.
> > >
> > > https://lkml.org/lkml/2011/6/3/375
> > >
> > > There are no changes from first version except that I have rebased it to
> > > for-3.1/core branch of Jens's block tree.
> > >
> > > I have been trying to find ways to solve two problems with block IO controller
> > > cgroups.
> > >
> > > - Current throttling logic in IO controller does not throttle buffered WRITES.
> > > Well it does throttle all the WRITEs at device and by that time buffered
> > > WRITE have lost the submitter's context and most of the IO comes in flusher
> > > thread's context at device. Hence currently buffered write throttling is
> > > not supported.
> > >
> > > - All WRITEs are throttled at device level and this can easily lead to
> > > filesystem serialization.
> > >
> > > One simple example is that if a process writes some pages to cache and
> > > then does fsync(), and process gets throttled then it locks up the
> > > filesystem. With ext4, I noticed that even a simple "ls" does not make
> > > progress. The reason boils down to the fact that filesystems are not
> > > aware of cgroups and one of the things which get serialized is journalling
> > > in ordered mode.
> > >
> > > So even if we do something to carry submitter's cgroup information
> > > to device and do throttling there, it will lead to serialization of
> > > filesystems and is not a good idea.
> > >
> > > So how to go about fixing it. There seem to be two options.
> > >
> > > - Throttling should still be done at device level. Make filesystems aware
> > > of cgroups so that multiple transactions can make progress in parallel
> > > (per cgroup) and there are no shared resources across cgroups in
> > > filesystems which can lead to serialization.
> > >
> > > - Throttle WRITEs while they are entering the cache and not after that.
> > > Something like balance_dirty_pages(). Direct IO is still throttled
> > > at device level. That way, we can avoid these journalling related
> > > serialization issues w.r.t trottling.
> >
> > I think that O_DIRECT WRITEs can hit the same serialization problem if
> > we throttle them at device level.
>
> I think it can but number of cases probably comes down significantly. One
> of the main problems seems to be sync related variants sync/fsync etc.
> And I think we do not make any gurantees for inflight requests
> (not completed yet).
>
> So it will boil down to how dependent these sync primitives are on
> inflight direct WRITEs. I did basic testing with ext4 and it looked fine.
> On XFS, sync gets blocked behind inflight direct writes. Last time I
> raised that issue and looks like Christoph has plans to do something
> about it.
>
> So currently my understanding is that dependency on direct writes might
> not be a major issue in practice. (Until and unless there is more to
> it I am not aware about).
>
> >
> > Have you tried to do some tests? (i.e. create multiple cgroups with very
> > low I/O limit doing parallel O_DIRECT WRITEs, and try to run at the same
> > time "ls" or other simple commands from the root cgroup or unlimited
> > cgroup).
>
> I did. On ext4, I created a cgroup with limit 1byte per second and
> started a direct write and did "ls", "sync" and some directory traversal
> operations in same diretory and it seems to work.
Confirm. Everything seems to work fine also on my side.
Tested-by: Andrea Righi <[email protected]>
FYI, I've used the following script to test it if you're interested.
I tested both with O_DIRECT=1 and O_DIRECT=0.
-Andrea
---
#!/bin/bash
#
# blkio.throttle unit test
#
# This script creates many cgroups and spawns many parallel IO workers inside
# each cgroup.
# cgroupfs mount point
CGROUP_MP=/sys/fs/cgroup/blkio
# temporary directory used to generate IO
TMPDIR=/tmp
# how many cgroups?
CGROUPS=16
# how many IO workers per cgroup?
WORKERS=16
# max IO bandwidth of each cgroup
BW_MAX=$((1 * 1024 * 1024))
# max IO operations per second of each cgroup
IOPS_MAX=0
# IO block size
IO_BLOCK_SIZE=$((1 * 1024 * 1024))
# how many blocks to read/write (for each worker)
IO_BLOCK_NUM=4
# how many times each worker have to repeat the IO operation?
IO_COUNT=16
# enable O_DIRECT?
O_DIRECT=0
# timeout to consider a task blocked for too much time and dump a
# message in the kernel log (set to 0 to disable this check)
HUNG_TASK_TIMEOUT=60
cleanup_handler() {
pkill sleep
pkill dd
echo "terminating..."
sleep 10
rmdir $CGROUP_MP/grp_*
rm -rf $TMPDIR/grp_*
sleep 1
exit 1
}
worker() {
out=$1
if [ "z$O_DIRECT" = "z1" ]; then
out_flags=oflag=direct
in_flags=iflag=direct
else
out_flag=
in_flag=
fi
sleep 5
for i in `seq 1 16`; do
dd if=/dev/zero of=$out \
bs=$IO_BLOCK_SIZE count=$IO_BLOCK_NUM \
$out_flags 2>/dev/null
done
for i in `seq 1 16`; do
dd if=$out of=/dev/null \
bs=$IO_BLOCK_SIZE count=$IO_BLOCK_NUM \
$in_flags 2>/dev/null
done
rm -f $out
unset out
}
spawn_workers() {
grp=$1
device=`df $TMPDIR | sed '1d' | awk '{print $1}' | sed 's/[0-9]$//'`
devnum=`grep $(basename $device)$ /proc/partitions | awk '{print $1":"$2}'`
mkdir $CGROUP_MP/$grp
echo $devnum $BW_MAX > $CGROUP_MP/$grp/blkio.throttle.read_bps_device
echo $devnum $BW_MAX > $CGROUP_MP/$grp/blkio.throttle.write_bps_device
echo $devnum $IOPS_MAX > $CGROUP_MP/$grp/blkio.throttle.read_iops_device
echo $devnum $IOPS_MAX > $CGROUP_MP/$grp/blkio.throttle.write_iops_device
mkdir -p $TMPDIR/$grp
for i in `seq 1 $WORKERS`; do
worker $TMPDIR/$grp/zero$i &
echo $! > $CGROUP_MP/$grp/tasks
done
for i in `seq 1 $WORKERS`; do
wait
done
rmdir $TMPDIR/$grp
rmdir $CGROUP_MP/$grp
unset grp
}
# mount cgroupfs
mount -t cgroup -o blkio none $CGROUP_MP
# set hung task check timeout (help to catch system-wide lockups)
echo $HUNG_TASK_TIMEOUT > /proc/sys/kernel/hung_task_timeout_secs
# invalidate page cache
sync
echo 3 > /proc/sys/vm/drop_caches
# show expected bandwidth
bw=$(($CGROUPS * $BW_MAX / 1024))
space=$(($CGROUPS * $WORKERS * $IO_BLOCK_SIZE * $IO_BLOCK_NUM / 1024 / 1024))
echo -ne "\n\n"
echo creating $CGROUPS cgroups, $WORKERS tasks per cgroup, bw=$BW_MAX
echo required disk space: $space MiB
echo expected average bandwith: $bw MiB/s
echo -ne "\n\n"
# trap SIGINT and SIGTERM to quit cleanly
trap cleanup_handler SIGINT SIGTERM
# run workers
for i in `seq 1 $CGROUPS`; do
spawn_workers grp_$i &
done
# wait the completion of the workers
for i in `seq 1 $CGROUPS`; do
wait
done
echo "test completed."
On Wed, Jun 29, 2011 at 05:25:17PM +0200, Andrea Righi wrote:
[..]
> > +void
> > +blk_throtl_dirty_pages(struct address_space *mapping, unsigned long nr_dirty)
> > +{
> > + struct request_queue *q;
> > + struct block_device *bdev;
> > +
> > + bdev = mapping->host->i_sb->s_bdev;
> > + if (!bdev)
> > + return;
> > + q = bdev_get_queue(bdev);
> > +
> > + if (unlikely(!q))
> > + return;
> > +
> > + if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
> > + return;
> > +
> > + __blk_throtl_dirty_pages(q, nr_dirty);
> > +}
>
> This doesn't throttle buffered WRITEs to raw devices. Maybe we should do
> something like this.
>
Thanks Andrea. I have updated the patch to fix both the issues pointed by
you (raw device fix as well as compile warning). I also have added your
name in Signed-off-by, in core patch as you provided this raw device fix.
Thanks
Vivek
Subject: blk-throttle: core logic to throttle task while dirtying pages
This is the core logic which enables throttling logic to also
throttle tasks while dirtying pages. I basically create the
infrastructure where a wait queue is maintained per group
and if task exceeds its rate limit, task is put to sleep on
wait queue and woken up later.
Though currently we throttle tasks for WRITE requests and
not for READ requests, I have built the logic to support
task throttling for both direction (similar to bio throttling).
This will allow us to do any task READ throttling also easily,
if needed in future.
Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
---
block/blk-throttle.c | 405 +++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 364 insertions(+), 41 deletions(-)
Index: linux-2.6-block/block/blk-throttle.c
===================================================================
--- linux-2.6-block.orig/block/blk-throttle.c 2011-06-29 15:45:46.415553948 -0400
+++ linux-2.6-block/block/blk-throttle.c 2011-06-29 15:49:52.592309807 -0400
@@ -32,6 +32,12 @@ struct throtl_rb_root {
unsigned long min_disptime;
};
+/* Whether to dispatch a bio or task in next round of a dispatch from a group */
+enum dispatch_type {
+ DISPATCH_BIO,
+ DISPATCH_TASK,
+};
+
#define THROTL_RB_ROOT (struct throtl_rb_root) { .rb = RB_ROOT, .left = NULL, \
.count = 0, .min_disptime = 0}
@@ -59,7 +65,10 @@ struct throtl_grp {
struct bio_list bio_lists[2];
/* Number of queued bios on READ and WRITE lists */
- unsigned int nr_queued[2];
+ unsigned int nr_queued_bio[2];
+
+ /* Number of queued tasks */
+ unsigned int nr_queued_tsk[2];
/* bytes per second rate limits */
uint64_t bps[2];
@@ -80,6 +89,21 @@ struct throtl_grp {
int limits_changed;
struct rcu_head rcu_head;
+
+ /* READ and WRITE wait lists for task throttle */
+ wait_queue_head_t wait[2];
+
+ /* Number of unaccounted dirty pages in this group */
+ unsigned int unaccounted_dirty;
+
+ /* The number of pages to be charged to first task on wait queue */
+ unsigned int active_task_charge[2];
+
+ /*
+ * Keeps track of whether we will dispatch a bio or task in next
+ * round of dispatch (for each dir)
+ */
+ enum dispatch_type active_dispatch[2];
};
struct throtl_data
@@ -94,7 +118,10 @@ struct throtl_data
struct request_queue *queue;
/* Total Number of queued bios on READ and WRITE lists */
- unsigned int nr_queued[2];
+ unsigned int nr_queued_bio[2];
+
+ /* Total Number of queued tasks on READ and WRITE lists */
+ unsigned int nr_queued_tsk[2];
/*
* number of total undestroyed groups
@@ -142,9 +169,19 @@ static inline struct throtl_grp *tg_of_b
return NULL;
}
+static inline int total_queued_bios(struct throtl_data *td)
+{
+ return td->nr_queued_bio[0] + td->nr_queued_bio[1];
+}
+
+static inline int total_queued_tasks(struct throtl_data *td)
+{
+ return td->nr_queued_tsk[0] + td->nr_queued_tsk[1];
+}
+
static inline unsigned int total_nr_queued(struct throtl_data *td)
{
- return td->nr_queued[0] + td->nr_queued[1];
+ return total_queued_bios(td) + total_queued_tasks(td);
}
static inline struct throtl_grp *throtl_ref_get_tg(struct throtl_grp *tg)
@@ -192,6 +229,9 @@ static void throtl_init_group(struct thr
tg->bps[0] = tg->bps[1] = -1;
tg->iops[0] = tg->iops[1] = -1;
+ init_waitqueue_head(&tg->wait[READ]);
+ init_waitqueue_head(&tg->wait[WRITE]);
+
/*
* Take the initial reference that will be released on destroy
* This can be thought of a joint reference by cgroup and
@@ -727,6 +767,14 @@ static void throtl_charge_io(struct thro
blkiocg_update_dispatch_stats(&tg->blkg, sz, nr_ios, rw, sync);
}
+static void throtl_charge_dirty_io(struct throtl_grp *tg, bool rw,
+ unsigned int sz, unsigned int nr_ios, bool sync)
+{
+ tg->unaccounted_dirty -= nr_ios;
+ BUG_ON(tg->unaccounted_dirty < 0);
+ throtl_charge_io(tg, rw, sz, nr_ios, sync);
+}
+
static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg,
struct bio *bio)
{
@@ -735,21 +783,38 @@ static void throtl_add_bio_tg(struct thr
bio_list_add(&tg->bio_lists[rw], bio);
/* Take a bio reference on tg */
throtl_ref_get_tg(tg);
- tg->nr_queued[rw]++;
- td->nr_queued[rw]++;
+ tg->nr_queued_bio[rw]++;
+ td->nr_queued_bio[rw]++;
throtl_enqueue_tg(td, tg);
}
-static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
+/*
+ * Returns the wait time of first bio or task in the specified direction. If
+ * nothing is queued then wait time is -1.
+ */
+static unsigned long
+tg_active_wait_time(struct throtl_data *td, struct throtl_grp *tg, bool rw)
{
- unsigned long read_wait = -1, write_wait = -1, min_wait = -1, disptime;
struct bio *bio;
+ unsigned long sz;
+
+ bio = bio_list_peek(&tg->bio_lists[rw]);
+ if (bio && tg->active_dispatch[rw] == DISPATCH_BIO)
+ return tg_wait_dispatch(td, tg, rw, bio->bi_size, 1);
+
+ if (!waitqueue_active(&tg->wait[rw]))
+ return -1;
- if ((bio = bio_list_peek(&tg->bio_lists[READ])))
- read_wait = tg_wait_dispatch(td, tg, READ, bio->bi_size, 1);
+ sz = tg->active_task_charge[rw] << PAGE_SHIFT;
+ return tg_wait_dispatch(td, tg, rw, sz, tg->active_task_charge[rw]);
+}
- if ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
- write_wait = tg_wait_dispatch(td, tg, WRITE, bio->bi_size, 1);
+static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
+{
+ unsigned long read_wait, write_wait, min_wait, disptime;
+
+ read_wait = tg_active_wait_time(td, tg, READ);
+ write_wait = tg_active_wait_time(td, tg, WRITE);
min_wait = min(read_wait, write_wait);
disptime = jiffies + min_wait;
@@ -760,18 +825,39 @@ static void tg_update_disptime(struct th
throtl_enqueue_tg(td, tg);
}
+/* Calculate how many pages must be charged to a task at this point of time. */
+static unsigned int tg_calc_charge_pages(struct throtl_grp *tg, bool rw)
+{
+ unsigned int nr_pages;
+
+ /* Divide unaccounted number of pages among queued tasks */
+ nr_pages = tg->unaccounted_dirty/tg->nr_queued_tsk[rw];
+
+ return max_t(unsigned int, nr_pages, 1);
+}
+
+/* set the active task charge for the first task in the queue */
+static void tg_set_active_task_charge(struct throtl_grp *tg, bool rw)
+{
+ /* If there are more tasks queued, calculate the charge */
+ if (tg->nr_queued_tsk[rw])
+ tg->active_task_charge[rw] = tg_calc_charge_pages(tg, rw);
+ else
+ tg->active_task_charge[rw] = 0;
+}
+
static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg,
bool rw, struct bio_list *bl)
{
struct bio *bio;
bio = bio_list_pop(&tg->bio_lists[rw]);
- tg->nr_queued[rw]--;
+ tg->nr_queued_bio[rw]--;
/* Drop bio reference on tg */
throtl_put_tg(tg);
- BUG_ON(td->nr_queued[rw] <= 0);
- td->nr_queued[rw]--;
+ BUG_ON(td->nr_queued_bio[rw] <= 0);
+ td->nr_queued_bio[rw]--;
throtl_charge_io(tg, rw, bio->bi_size, 1, bio->bi_rw & REQ_SYNC);
bio_list_add(bl, bio);
@@ -780,39 +866,137 @@ static void tg_dispatch_one_bio(struct t
throtl_trim_slice(td, tg, rw);
}
-static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
- struct bio_list *bl)
+static int throtl_dispatch_tg_bio(struct throtl_data *td, struct throtl_grp *tg,
+ bool rw, struct bio_list *bl, unsigned int max_disp)
{
- unsigned int nr_reads = 0, nr_writes = 0;
- unsigned int max_nr_reads = throtl_grp_quantum*3/4;
- unsigned int max_nr_writes = throtl_grp_quantum - max_nr_reads;
struct bio *bio;
+ unsigned int nr_disp = 0;
- /* Try to dispatch 75% READS and 25% WRITES */
-
- while ((bio = bio_list_peek(&tg->bio_lists[READ]))
- && !tg_wait_dispatch(td, tg, READ, bio->bi_size, 1)) {
+ while ((bio = bio_list_peek(&tg->bio_lists[rw]))
+ && !tg_wait_dispatch(td, tg, rw, bio->bi_size, 1)) {
tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
- nr_reads++;
+ nr_disp++;
- if (nr_reads >= max_nr_reads)
+ if (nr_disp >= max_disp)
break;
}
- while ((bio = bio_list_peek(&tg->bio_lists[WRITE]))
- && !tg_wait_dispatch(td, tg, WRITE, bio->bi_size, 1)) {
+ return nr_disp;
+}
- tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
- nr_writes++;
+static void throtl_dispatch_tg_task(struct throtl_data *td,
+ struct throtl_grp *tg, bool rw, unsigned int max_disp)
+{
+ unsigned int sz;
- if (nr_writes >= max_nr_writes)
- break;
+ if (!waitqueue_active(&tg->wait[rw]))
+ return;
+
+ /* Wake up a task */
+ wake_up(&tg->wait[rw]);
+
+ tg->nr_queued_tsk[rw]--;
+ td->nr_queued_tsk[rw]--;
+
+ /* Charge the woken up task for IO */
+ sz = tg->active_task_charge[rw] << PAGE_SHIFT;
+ throtl_charge_dirty_io(tg, rw, sz, tg->active_task_charge[rw], false);
+ throtl_trim_slice(td, tg, rw);
+
+ /*
+ * We dispatched one task. Set the charge for other queued tasks,
+ * if any.
+ */
+ tg_set_active_task_charge(tg, rw);
+ throtl_log_tg(td, tg, "[%c] Wake up a task. bdisp=%llu sz=%u bps=%llu"
+ " iodisp=%u iops=%u bioq=%d/%d taskq=%d/%d",
+ rw == READ ? 'R' : 'W', tg->bytes_disp[rw], sz,
+ tg->bps[rw], tg->io_disp[rw], tg->iops[rw],
+ tg->nr_queued_bio[READ], tg->nr_queued_bio[WRITE],
+ tg->nr_queued_tsk[READ], tg->nr_queued_tsk[WRITE]);
+}
+
+static void tg_switch_active_dispatch(struct throtl_data *td,
+ struct throtl_grp *tg, bool rw)
+{
+ unsigned int nr_tasks = tg->nr_queued_tsk[rw];
+ unsigned int nr_bios = tg->nr_queued_bio[rw];
+ enum dispatch_type curr_dispatch = tg->active_dispatch[rw];
+
+ /* Nothing queued. Whoever gets queued next first, sets dispatch type */
+ if (!nr_bios && !nr_tasks)
+ return;
+
+ if (curr_dispatch == DISPATCH_BIO && nr_tasks) {
+ tg->active_dispatch[rw] = DISPATCH_TASK;
+ return;
+ }
+
+ if (curr_dispatch == DISPATCH_TASK && nr_bios)
+ tg->active_dispatch[rw] = DISPATCH_BIO;
+}
+
+static void tg_update_active_dispatch(struct throtl_data *td,
+ struct throtl_grp *tg, bool rw)
+{
+ unsigned int nr_tasks = tg->nr_queued_tsk[rw];
+ unsigned int nr_bios = tg->nr_queued_bio[rw];
+ enum dispatch_type curr_dispatch = tg->active_dispatch[rw];
+
+ BUG_ON(nr_bios < 0 || nr_tasks < 0);
+
+ if (curr_dispatch == DISPATCH_BIO && !nr_bios) {
+ tg->active_dispatch[rw] = DISPATCH_TASK;
+ return;
}
+ if (curr_dispatch == DISPATCH_TASK && !nr_tasks)
+ tg->active_dispatch[rw] = DISPATCH_BIO;
+}
+
+static int throtl_dispatch_tg_rw(struct throtl_data *td, struct throtl_grp *tg,
+ bool rw, struct bio_list *bl, unsigned int max)
+{
+ unsigned int nr_disp = 0;
+
+ if (tg->active_dispatch[rw] == DISPATCH_BIO)
+ nr_disp = throtl_dispatch_tg_bio(td, tg, rw, bl, max);
+ else
+ /* Only number of bios dispatched is kept track of here */
+ throtl_dispatch_tg_task(td, tg, rw, max);
+
+ tg_switch_active_dispatch(td, tg, rw);
+ return nr_disp;
+}
+
+static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
+ struct bio_list *bl)
+{
+ /* Try to dispatch 75% READS and 25% WRITES */
+ unsigned int nr_reads = 0, nr_writes = 0;
+ unsigned int max_nr_reads = throtl_grp_quantum*3/4;
+ unsigned int max_nr_writes = throtl_grp_quantum - max_nr_reads;
+
+ nr_reads = throtl_dispatch_tg_rw(td, tg, READ, bl, max_nr_reads);
+ nr_writes = throtl_dispatch_tg_rw(td, tg, WRITE, bl, max_nr_writes);
+
return nr_reads + nr_writes;
}
+static bool tg_should_requeue(struct throtl_grp *tg)
+{
+ /* If there are queued bios, requeue */
+ if (tg->nr_queued_bio[0] || tg->nr_queued_bio[1])
+ return 1;
+
+ /* If there are queued tasks reueue */
+ if (tg->nr_queued_tsk[0] || tg->nr_queued_tsk[1])
+ return 1;
+
+ return 0;
+}
+
static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
{
unsigned int nr_disp = 0;
@@ -832,7 +1016,7 @@ static int throtl_select_dispatch(struct
nr_disp += throtl_dispatch_tg(td, tg, bl);
- if (tg->nr_queued[0] || tg->nr_queued[1]) {
+ if (tg_should_requeue(tg)) {
tg_update_disptime(td, tg);
throtl_enqueue_tg(td, tg);
}
@@ -899,9 +1083,9 @@ static int throtl_dispatch(struct reques
bio_list_init(&bio_list_on_stack);
- throtl_log(td, "dispatch nr_queued=%u read=%u write=%u",
- total_nr_queued(td), td->nr_queued[READ],
- td->nr_queued[WRITE]);
+ throtl_log(td, "dispatch bioq=%u/%u tskq=%u/%u",
+ td->nr_queued_bio[READ], td->nr_queued_bio[WRITE],
+ td->nr_queued_tsk[READ], td->nr_queued_tsk[WRITE]);
nr_disp = throtl_select_dispatch(td, &bio_list_on_stack);
@@ -1122,7 +1306,7 @@ int blk_throtl_bio(struct request_queue
if (tg_no_rule_group(tg, rw)) {
blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size,
- rw, bio->bi_rw & REQ_SYNC);
+ 1, rw, bio->bi_rw & REQ_SYNC);
rcu_read_unlock();
return 0;
}
@@ -1146,14 +1330,14 @@ int blk_throtl_bio(struct request_queue
}
}
- if (tg->nr_queued[rw]) {
+ /* If there are already queued bio or task in same dir, queue bio */
+ if (tg->nr_queued_bio[rw] || tg->nr_queued_tsk[rw]) {
/*
- * There is already another bio queued in same dir. No
+ * There is already another bio/task queued in same dir. No
* need to update dispatch time.
*/
update_disptime = false;
goto queue_bio;
-
}
/* Bio is with-in rate limit of group */
@@ -1178,16 +1362,18 @@ int blk_throtl_bio(struct request_queue
queue_bio:
throtl_log_tg(td, tg, "[%c] bio. bdisp=%llu sz=%u bps=%llu"
- " iodisp=%u iops=%u queued=%d/%d",
+ " iodisp=%u iops=%u bioq=%u/%u taskq=%u/%u",
rw == READ ? 'R' : 'W',
tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
tg->io_disp[rw], tg->iops[rw],
- tg->nr_queued[READ], tg->nr_queued[WRITE]);
+ tg->nr_queued_bio[READ], tg->nr_queued_bio[WRITE],
+ tg->nr_queued_tsk[READ], tg->nr_queued_tsk[WRITE]);
throtl_add_bio_tg(q->td, tg, bio);
*biop = NULL;
if (update_disptime) {
+ tg_update_active_dispatch(td, tg, rw);
tg_update_disptime(td, tg);
throtl_schedule_next_dispatch(td);
}
@@ -1197,6 +1383,143 @@ out:
return 0;
}
+static void
+__blk_throtl_dirty_pages(struct request_queue *q, unsigned long nr_dirty)
+{
+ struct throtl_data *td = q->td;
+ struct throtl_grp *tg;
+ struct blkio_cgroup *blkcg;
+ bool rw = WRITE, update_disptime = true, first_task = false;
+ unsigned int sz = nr_dirty << PAGE_SHIFT;
+ DEFINE_WAIT(wait);
+
+ /*
+ * A throtl_grp pointer retrieved under rcu can be used to access
+ * basic fields like stats and io rates. If a group has no rules,
+ * just update the dispatch stats in lockless manner and return.
+ */
+
+ rcu_read_lock();
+ blkcg = task_blkio_cgroup(current);
+ tg = throtl_find_tg(td, blkcg);
+ if (tg) {
+ throtl_tg_fill_dev_details(td, tg);
+
+ if (tg_no_rule_group(tg, rw)) {
+ blkiocg_update_dispatch_stats(&tg->blkg, sz, nr_dirty,
+ rw, 0);
+ rcu_read_unlock();
+ return;
+ }
+ }
+ rcu_read_unlock();
+
+ spin_lock_irq(q->queue_lock);
+
+ tg = throtl_get_tg(td);
+
+ /* Queue is gone. No queue lock held here. */
+ if (IS_ERR(tg))
+ return;
+
+ tg->unaccounted_dirty += nr_dirty;
+
+ /* If there are already queued task, put this task also on waitq */
+ if (tg->nr_queued_tsk[rw]) {
+ update_disptime = false;
+ goto queue_task;
+ } else
+ first_task = true;
+
+ /* If there are bios already throttled in same dir, queue task */
+ if (!bio_list_empty(&tg->bio_lists[rw])) {
+ update_disptime = false;
+ goto queue_task;
+ }
+
+ /*
+ * Task is with-in rate limit of group.
+ *
+ * Note: How many IOPS we should charge for this operation. For
+ * the time being I am sticking to number of pages as number of
+ * IOPS.
+ */
+ if (!tg_wait_dispatch(td, tg, rw, sz, nr_dirty)) {
+ throtl_charge_dirty_io(tg, rw, sz, nr_dirty, 0);
+ throtl_trim_slice(td, tg, rw);
+ goto out;
+ }
+
+queue_task:
+ throtl_log_tg(td, tg, "[%c] task. bdisp=%llu sz=%u bps=%llu"
+ " iodisp=%u iops=%u bioq=%d/%d taskq=%d/%d",
+ rw == READ ? 'R' : 'W',
+ tg->bytes_disp[rw], sz, tg->bps[rw],
+ tg->io_disp[rw], tg->iops[rw],
+ tg->nr_queued_bio[READ], tg->nr_queued_bio[WRITE],
+ tg->nr_queued_tsk[READ], tg->nr_queued_tsk[WRITE]);
+
+ /* Take a task reference on tg */
+ throtl_ref_get_tg(tg);
+ tg->nr_queued_tsk[rw]++;
+ td->nr_queued_tsk[rw]++;
+ prepare_to_wait_exclusive(&tg->wait[rw], &wait, TASK_UNINTERRUPTIBLE);
+
+ if (first_task)
+ tg_set_active_task_charge(tg, rw);
+
+ if (update_disptime) {
+ tg_update_active_dispatch(td, tg, rw);
+ tg_update_disptime(td, tg);
+ throtl_schedule_next_dispatch(td);
+ } else
+ throtl_enqueue_tg(td, tg);
+
+ spin_unlock_irq(q->queue_lock);
+
+ /* Task has been put on a wait queue */
+ io_schedule();
+
+ /*
+ * Task has woken up. Finish wait, drop reference to the group.
+ */
+ spin_lock_irq(q->queue_lock);
+ finish_wait(&tg->wait[rw], &wait);
+
+ /* Drop task reference on group */
+ throtl_put_tg(tg);
+out:
+ spin_unlock_irq(q->queue_lock);
+ return;
+}
+
+void
+blk_throtl_dirty_pages(struct address_space *mapping, unsigned long nr_dirty)
+{
+ struct request_queue *q;
+ struct block_device *bdev;
+ struct inode *inode = mapping->host;
+
+ if (S_ISBLK(inode->i_mode) && inode->i_bdev) {
+ bdev = inode->i_bdev;
+ } else {
+ bdev = inode->i_sb->s_bdev;
+ if (!bdev)
+ return;
+ }
+
+ q = bdev_get_queue(bdev);
+
+ if (unlikely(!q))
+ return;
+
+ if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
+ return;
+
+ __blk_throtl_dirty_pages(q, nr_dirty);
+}
+
+
int blk_throtl_init(struct request_queue *q)
{
struct throtl_data *td;
On Wed, Jun 29, 2011 at 06:05:32PM +0200, Andrea Righi wrote:
[..]
> > > Have you tried to do some tests? (i.e. create multiple cgroups with very
> > > low I/O limit doing parallel O_DIRECT WRITEs, and try to run at the same
> > > time "ls" or other simple commands from the root cgroup or unlimited
> > > cgroup).
> >
> > I did. On ext4, I created a cgroup with limit 1byte per second and
> > started a direct write and did "ls", "sync" and some directory traversal
> > operations in same diretory and it seems to work.
>
> Confirm. Everything seems to work fine also on my side.
>
> Tested-by: Andrea Righi <[email protected]>
>
> FYI, I've used the following script to test it if you're interested.
> I tested both with O_DIRECT=1 and O_DIRECT=0.
Thanks for testing these patches. This script looks good. Will save it.
Thanks
Vivek
On Tue, Jun 28, 2011 at 11:35:09AM -0400, Vivek Goyal wrote:
> Put the blk_throtl_dirty_pages() hook in
> balance_dirty_pages_ratelimited_nr() to enable task throttling.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> include/linux/blkdev.h | 5 +++++
> mm/page-writeback.c | 3 +++
> 2 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 4ce6e68..5d4a57e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1180,12 +1180,17 @@ static inline uint64_t rq_io_start_time_ns(struct request *req)
> extern int blk_throtl_init(struct request_queue *q);
> extern void blk_throtl_exit(struct request_queue *q);
> extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
> +extern void blk_throtl_dirty_pages(struct address_space *mapping,
> + unsigned long nr_dirty);
> #else /* CONFIG_BLK_DEV_THROTTLING */
> static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio)
> {
> return 0;
> }
>
> +static inline void blk_throtl_dirty_pages(struct address_space *mapping,
> + unsigned long nr_dirty) {}
> +
> static inline int blk_throtl_init(struct request_queue *q) { return 0; }
> static inline int blk_throtl_exit(struct request_queue *q) { return 0; }
> #endif /* CONFIG_BLK_DEV_THROTTLING */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 31f6988..943e551 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -629,6 +629,9 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> unsigned long ratelimit;
> unsigned long *p;
>
> + /* Subject writes to IO controller throttling */
> + blk_throtl_dirty_pages(mapping, nr_pages_dirtied);
> +
mmmh.. in this way we throttle also tasks that are re-writing dirty pages
multiple times.
>From the controller perspective what is actually generating I/O on block
devices is the generation of _new_ dirty pages. Multiple re-writes in page
cache should never be throttled IMHO.
I would re-write this patch in the following way. What do you think?
Thanks,
-Andrea
---
Subject: [PATCH 8/8] blk-throttle: enable throttling of task while dirtying pages
From: Andrea Righi <[email protected]>
Put the blk_throtl_dirty_pages() hook in balance_dirty_pages_ratelimited_nr()
to enable task throttling.
Moreover, modify balance_dirty_pages_ratelimited_nr() to accept the additional
parameter "redirty". This parameter can be used to notify if the pages have
been dirtied for the first time or re-dirtied.
This information can be used by the blkio.throttle controller to distinguish
between a WRITE in the page cache, that will eventually generates I/O activity
on block device by the writeback code, and a re-WRITE operation that most of
the time will not generate additional I/O activity.
This means that a task that re-writes multiple times the same blocks of a file
is affected by the blkio limitations only for the actual I/O that will be
performed to the underlying block devices during the writeback process.
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---
include/linux/writeback.h | 8 ++++++--
mm/filemap.c | 4 +++-
mm/page-writeback.c | 11 +++++++----
3 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 17e7ccc..82d69c9 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -129,8 +129,12 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
unsigned long dirty);
void page_writeback_init(void);
-void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
- unsigned long nr_pages_dirtied);
+
+#define balance_dirty_pages_ratelimited_nr(__mapping, __nr) \
+ __balance_dirty_pages_ratelimited_nr(__mapping, __nr, false)
+void __balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
+ unsigned long nr_pages_dirtied,
+ bool redirty);
static inline void
balance_dirty_pages_ratelimited(struct address_space *mapping)
diff --git a/mm/filemap.c b/mm/filemap.c
index a8251a8..ff4bdc5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2377,6 +2377,7 @@ static ssize_t generic_perform_write(struct file *file,
unsigned long bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */
void *fsdata;
+ unsigned int dirty;
offset = (pos & (PAGE_CACHE_SIZE - 1));
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
@@ -2412,6 +2413,7 @@ again:
pagefault_enable();
flush_dcache_page(page);
+ dirty = PageDirty(page);
mark_page_accessed(page);
status = a_ops->write_end(file, mapping, pos, bytes, copied,
page, fsdata);
@@ -2438,7 +2440,7 @@ again:
pos += copied;
written += copied;
- balance_dirty_pages_ratelimited(mapping);
+ __balance_dirty_pages_ratelimited_nr(mapping, 1, dirty);
} while (iov_iter_count(i));
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 943e551..4ca505d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -613,6 +613,7 @@ static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
* balance_dirty_pages_ratelimited_nr - balance dirty memory state
* @mapping: address_space which was dirtied
* @nr_pages_dirtied: number of pages which the caller has just dirtied
+ * @redirty: false if the pages have been dirtied for the first time
*
* Processes which are dirtying memory should call in here once for each page
* which was newly dirtied. The function will periodically check the system's
@@ -623,14 +624,16 @@ static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
* limit we decrease the ratelimiting by a lot, to prevent individual processes
* from overshooting the limit by (ratelimit_pages) each.
*/
-void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
- unsigned long nr_pages_dirtied)
+void __balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
+ unsigned long nr_pages_dirtied,
+ bool redirty)
{
unsigned long ratelimit;
unsigned long *p;
/* Subject writes to IO controller throttling */
- blk_throtl_dirty_pages(mapping, nr_pages_dirtied);
+ if (!redirty)
+ blk_throtl_dirty_pages(mapping, nr_pages_dirtied);
ratelimit = ratelimit_pages;
if (mapping->backing_dev_info->dirty_exceeded)
@@ -652,7 +655,7 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
}
preempt_enable();
}
-EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
+EXPORT_SYMBOL(__balance_dirty_pages_ratelimited_nr);
void throttle_vm_writeout(gfp_t gfp_mask)
{
On Thu, Jun 30, 2011 at 04:52:29PM +0200, Andrea Righi wrote:
> On Tue, Jun 28, 2011 at 11:35:09AM -0400, Vivek Goyal wrote:
> > Put the blk_throtl_dirty_pages() hook in
> > balance_dirty_pages_ratelimited_nr() to enable task throttling.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > include/linux/blkdev.h | 5 +++++
> > mm/page-writeback.c | 3 +++
> > 2 files changed, 8 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 4ce6e68..5d4a57e 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1180,12 +1180,17 @@ static inline uint64_t rq_io_start_time_ns(struct request *req)
> > extern int blk_throtl_init(struct request_queue *q);
> > extern void blk_throtl_exit(struct request_queue *q);
> > extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
> > +extern void blk_throtl_dirty_pages(struct address_space *mapping,
> > + unsigned long nr_dirty);
> > #else /* CONFIG_BLK_DEV_THROTTLING */
> > static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio)
> > {
> > return 0;
> > }
> >
> > +static inline void blk_throtl_dirty_pages(struct address_space *mapping,
> > + unsigned long nr_dirty) {}
> > +
> > static inline int blk_throtl_init(struct request_queue *q) { return 0; }
> > static inline int blk_throtl_exit(struct request_queue *q) { return 0; }
> > #endif /* CONFIG_BLK_DEV_THROTTLING */
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 31f6988..943e551 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -629,6 +629,9 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> > unsigned long ratelimit;
> > unsigned long *p;
> >
> > + /* Subject writes to IO controller throttling */
> > + blk_throtl_dirty_pages(mapping, nr_pages_dirtied);
> > +
>
> mmmh.. in this way we throttle also tasks that are re-writing dirty pages
> multiple times.
>
> From the controller perspective what is actually generating I/O on block
> devices is the generation of _new_ dirty pages. Multiple re-writes in page
> cache should never be throttled IMHO.
>
> I would re-write this patch in the following way. What do you think?
>
> Thanks,
> -Andrea
>
> ---
> Subject: [PATCH 8/8] blk-throttle: enable throttling of task while dirtying pages
>
> From: Andrea Righi <[email protected]>
>
> Put the blk_throtl_dirty_pages() hook in balance_dirty_pages_ratelimited_nr()
> to enable task throttling.
>
> Moreover, modify balance_dirty_pages_ratelimited_nr() to accept the additional
> parameter "redirty". This parameter can be used to notify if the pages have
> been dirtied for the first time or re-dirtied.
>
> This information can be used by the blkio.throttle controller to distinguish
> between a WRITE in the page cache, that will eventually generates I/O activity
> on block device by the writeback code, and a re-WRITE operation that most of
> the time will not generate additional I/O activity.
>
> This means that a task that re-writes multiple times the same blocks of a file
> is affected by the blkio limitations only for the actual I/O that will be
> performed to the underlying block devices during the writeback process.
>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Vivek Goyal <[email protected]>
A simple test (see rewrite.c below):
# echo 8:0 1000000 > /sys/fs/cgroup/blkio/foo/blkio.throttle.write_bps_device
- before:
$ ./rewrite
0: 4s <-- first write
1: 4s \
2: 4s |
3: 5s |
4: 4s |
5: 4s | <-- re-writes (not generating additional I/O)
6: 4s |
7: 4s |
8: 5s |
9: 4s /
- after:
$ ./rewrite
0: 4s <-- first write
1: 0s \
2: 0s |
3: 0s |
4: 0s |
5: 0s | <-- re-writes (not generating additional I/O)
6: 0s |
7: 0s |
8: 0s |
9: 0s /
-Andrea
---
/*
* rewrite.c
*/
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <fcntl.h>
#include <sys/types.h>
static char buf[4 * 1024 * 1024];
int main(int argc, char **argv)
{
int fd, i;
fd = open("junk", O_WRONLY | O_CREAT, 0600);
if (fd < 0) {
perror("open");
exit(1);
}
for (i = 0; i < 10; i++) {
time_t start, end;
lseek(fd, 0, SEEK_SET);
start = time(NULL);
if (write(fd, buf, sizeof(buf)) < 0) {
perror("write");
exit(1);
}
end = time(NULL);
printf("%d: %zus\n", i, end - start);
fflush(stdout);
}
unlink("junk");
return 0;
}
On Thu, Jun 30, 2011 at 04:52:29PM +0200, Andrea Righi wrote:
> On Tue, Jun 28, 2011 at 11:35:09AM -0400, Vivek Goyal wrote:
> > Put the blk_throtl_dirty_pages() hook in
> > balance_dirty_pages_ratelimited_nr() to enable task throttling.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > include/linux/blkdev.h | 5 +++++
> > mm/page-writeback.c | 3 +++
> > 2 files changed, 8 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 4ce6e68..5d4a57e 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1180,12 +1180,17 @@ static inline uint64_t rq_io_start_time_ns(struct request *req)
> > extern int blk_throtl_init(struct request_queue *q);
> > extern void blk_throtl_exit(struct request_queue *q);
> > extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
> > +extern void blk_throtl_dirty_pages(struct address_space *mapping,
> > + unsigned long nr_dirty);
> > #else /* CONFIG_BLK_DEV_THROTTLING */
> > static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio)
> > {
> > return 0;
> > }
> >
> > +static inline void blk_throtl_dirty_pages(struct address_space *mapping,
> > + unsigned long nr_dirty) {}
> > +
> > static inline int blk_throtl_init(struct request_queue *q) { return 0; }
> > static inline int blk_throtl_exit(struct request_queue *q) { return 0; }
> > #endif /* CONFIG_BLK_DEV_THROTTLING */
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 31f6988..943e551 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -629,6 +629,9 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> > unsigned long ratelimit;
> > unsigned long *p;
> >
> > + /* Subject writes to IO controller throttling */
> > + blk_throtl_dirty_pages(mapping, nr_pages_dirtied);
> > +
>
> mmmh.. in this way we throttle also tasks that are re-writing dirty pages
> multiple times.
>
> >From the controller perspective what is actually generating I/O on block
> devices is the generation of _new_ dirty pages. Multiple re-writes in page
> cache should never be throttled IMHO.
>
> I would re-write this patch in the following way. What do you think?
>
> Thanks,
> -Andrea
>
> ---
> Subject: [PATCH 8/8] blk-throttle: enable throttling of task while dirtying pages
>
> From: Andrea Righi <[email protected]>
>
> Put the blk_throtl_dirty_pages() hook in balance_dirty_pages_ratelimited_nr()
> to enable task throttling.
>
> Moreover, modify balance_dirty_pages_ratelimited_nr() to accept the additional
> parameter "redirty". This parameter can be used to notify if the pages have
> been dirtied for the first time or re-dirtied.
>
> This information can be used by the blkio.throttle controller to distinguish
> between a WRITE in the page cache, that will eventually generates I/O activity
> on block device by the writeback code, and a re-WRITE operation that most of
> the time will not generate additional I/O activity.
>
> This means that a task that re-writes multiple times the same blocks of a file
> is affected by the blkio limitations only for the actual I/O that will be
> performed to the underlying block devices during the writeback process.
Hi Andrea,
It kind of makes sense to me. Why not do it for regular dirty throttling
also. There also we want to control overall dirty memory in the system
and as long as we are not redirtying and not increasing number of
dirty pages then not subjecting them to throttling will make sense?
If yes, then we can simply skill calling balance_dirty_pages_ratelimited()
in case page was already dirty instead of introducing another variant
__balance_dirty_pages_ratelimited_nr()?
Thanks
Vivek
>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> include/linux/writeback.h | 8 ++++++--
> mm/filemap.c | 4 +++-
> mm/page-writeback.c | 11 +++++++----
> 3 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 17e7ccc..82d69c9 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -129,8 +129,12 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> unsigned long dirty);
>
> void page_writeback_init(void);
> -void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> - unsigned long nr_pages_dirtied);
> +
> +#define balance_dirty_pages_ratelimited_nr(__mapping, __nr) \
> + __balance_dirty_pages_ratelimited_nr(__mapping, __nr, false)
> +void __balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> + unsigned long nr_pages_dirtied,
> + bool redirty);
>
> static inline void
> balance_dirty_pages_ratelimited(struct address_space *mapping)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a8251a8..ff4bdc5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2377,6 +2377,7 @@ static ssize_t generic_perform_write(struct file *file,
> unsigned long bytes; /* Bytes to write to page */
> size_t copied; /* Bytes copied from user */
> void *fsdata;
> + unsigned int dirty;
>
> offset = (pos & (PAGE_CACHE_SIZE - 1));
> bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
> @@ -2412,6 +2413,7 @@ again:
> pagefault_enable();
> flush_dcache_page(page);
>
> + dirty = PageDirty(page);
> mark_page_accessed(page);
> status = a_ops->write_end(file, mapping, pos, bytes, copied,
> page, fsdata);
> @@ -2438,7 +2440,7 @@ again:
> pos += copied;
> written += copied;
>
> - balance_dirty_pages_ratelimited(mapping);
> + __balance_dirty_pages_ratelimited_nr(mapping, 1, dirty);
>
> } while (iov_iter_count(i));
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 943e551..4ca505d 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -613,6 +613,7 @@ static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> * balance_dirty_pages_ratelimited_nr - balance dirty memory state
> * @mapping: address_space which was dirtied
> * @nr_pages_dirtied: number of pages which the caller has just dirtied
> + * @redirty: false if the pages have been dirtied for the first time
> *
> * Processes which are dirtying memory should call in here once for each page
> * which was newly dirtied. The function will periodically check the system's
> @@ -623,14 +624,16 @@ static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> * limit we decrease the ratelimiting by a lot, to prevent individual processes
> * from overshooting the limit by (ratelimit_pages) each.
> */
> -void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> - unsigned long nr_pages_dirtied)
> +void __balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> + unsigned long nr_pages_dirtied,
> + bool redirty)
> {
> unsigned long ratelimit;
> unsigned long *p;
>
> /* Subject writes to IO controller throttling */
> - blk_throtl_dirty_pages(mapping, nr_pages_dirtied);
> + if (!redirty)
> + blk_throtl_dirty_pages(mapping, nr_pages_dirtied);
>
> ratelimit = ratelimit_pages;
> if (mapping->backing_dev_info->dirty_exceeded)
> @@ -652,7 +655,7 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> }
> preempt_enable();
> }
> -EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
> +EXPORT_SYMBOL(__balance_dirty_pages_ratelimited_nr);
>
> void throttle_vm_writeout(gfp_t gfp_mask)
> {
On Tue, Jun 28, 2011 at 09:53:36PM -0400, Vivek Goyal wrote:
[..]
> > FYI, filesystem development cycles are slow and engineers are
> > conservative because of the absolute requirement for data integrity.
> > Hence we tend to focus development on problems that users are
> > reporting (i.e. known pain points) or functionality they have
> > requested.
> >
> > In this case, block throttling works OK on most filesystems out of
> > the box, but it has some known problems. If there are people out
> > there hitting these known problems then they'll report them, we'll
> > hear about them and they'll eventually get fixed.
> >
> > However, if no-one is reporting problems related to block throttling
> > then it either works well enough for the existing user base or
> > nobody is using the functionality. Either way we don't need to spend
> > time on optimising the filesystem for such functionality.
> >
> > So while you may be skeptical about whether filesystems will be
> > changed, it really comes down to behaviour in real-world
> > deployments. If what we already have is good enough, then we don't
> > need to spend resources on fixing problems no-one is seeing...
>
[CC linux-ext4 list]
Dave,
Just another example where serialization is taking place with ext4.
I created a group with 1MB/s write limit and ran tedso's fsync tester
program with little modification. I used write() system call instead
of pwrite() so that file size grows. This program basically writes
1MB of data and then fsync's it and then measures the fsync time.
I ran two instances of prgram in two groups on two separate files. One
instances is throttled to 1MB/s and other is in root group unthrottled.
Unthrottled program gets serialized behind throttled one. Following
are fsync times.
Throttled instance Unthrottled Instance
------------------ --------------------
fsync time: 1.0051 fsync time: 1.0067
fsync time: 1.0049 fsync time: 1.0075
fsync time: 1.0048 fsync time: 1.0063
fsync time: 1.0073 fsync time: 1.0062
fsync time: 1.0070 fsync time: 1.0078
fsync time: 1.0032 fsync time: 1.0049
fsync time: 0.0154 fsync time: 1.0068
fsync time: 0.0137 fsync time: 1.0048
Without any throttling both the instances do fine
-------------------------------------------------
Throttled instance Unthrottled Instance
------------------ --------------------
fsync time: 0.0139 fsync time: 0.0162
fsync time: 0.0132 fsync time: 0.0156
fsync time: 0.0149 fsync time: 0.0169
fsync time: 0.0165 fsync time: 0.0152
fsync time: 0.0188 fsync time: 0.0135
fsync time: 0.0137 fsync time: 0.0142
fsync time: 0.0148 fsync time: 0.0149
fsync time: 0.0168 fsync time: 0.0163
fsync time: 0.0153 fsync time: 0.0143
So when we are inreasing the size of file and fsyncing it, other
unthrottled instances of similar activities will get throttled
behind it.
IMHO, this is a problem and should be fixed. If filesystem can fix it great.
But if not, then we should consider the option of throttling buffered writes
in balance_dirty_pages().
Following is the test program.
/*
* * fsync-tester.c
*
* Written by Theodore Ts'o, 3/21/09.
*
* This file may be redistributed under the terms of the GNU Public
* License, version 2.
*/
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <time.h>
#include <fcntl.h>
#include <string.h>
#define SIZE (1024*1024)
static float timeval_subtract(struct timeval *tv1, struct timeval *tv2)
{
return ((tv1->tv_sec - tv2->tv_sec) +
((float) (tv1->tv_usec - tv2->tv_usec)) / 1000000);
}
int main(int argc, char **argv)
{
int fd;
struct timeval tv, tv2;
char buf[SIZE];
fd = open("fsync-tester.tst-file", O_WRONLY|O_CREAT);
if (fd < 0) {
perror("open");
exit(1);
}
memset(buf, 'a', SIZE);
while (1) {
write(fd, buf, SIZE);
gettimeofday(&tv, NULL);
fsync(fd);
gettimeofday(&tv2, NULL);
printf("fsync time: %5.4f\n", timeval_subtract(&tv2,
&tv));
sleep(1);
}
}
Thanks
Vivek
On Thu, Jun 30, 2011 at 04:04:59PM -0400, Vivek Goyal wrote:
[..]
> Dave,
>
> Just another example where serialization is taking place with ext4.
>
> I created a group with 1MB/s write limit and ran tedso's fsync tester
> program with little modification. I used write() system call instead
> of pwrite() so that file size grows. This program basically writes
> 1MB of data and then fsync's it and then measures the fsync time.
>
> I ran two instances of prgram in two groups on two separate files. One
> instances is throttled to 1MB/s and other is in root group unthrottled.
>
> Unthrottled program gets serialized behind throttled one. Following
> are fsync times.
>
> Throttled instance Unthrottled Instance
> ------------------ --------------------
> fsync time: 1.0051 fsync time: 1.0067
> fsync time: 1.0049 fsync time: 1.0075
> fsync time: 1.0048 fsync time: 1.0063
> fsync time: 1.0073 fsync time: 1.0062
> fsync time: 1.0070 fsync time: 1.0078
> fsync time: 1.0032 fsync time: 1.0049
> fsync time: 0.0154 fsync time: 1.0068
> fsync time: 0.0137 fsync time: 1.0048
>
> Without any throttling both the instances do fine
> -------------------------------------------------
> Throttled instance Unthrottled Instance
> ------------------ --------------------
> fsync time: 0.0139 fsync time: 0.0162
> fsync time: 0.0132 fsync time: 0.0156
> fsync time: 0.0149 fsync time: 0.0169
> fsync time: 0.0165 fsync time: 0.0152
> fsync time: 0.0188 fsync time: 0.0135
> fsync time: 0.0137 fsync time: 0.0142
> fsync time: 0.0148 fsync time: 0.0149
> fsync time: 0.0168 fsync time: 0.0163
> fsync time: 0.0153 fsync time: 0.0143
>
> So when we are inreasing the size of file and fsyncing it, other
> unthrottled instances of similar activities will get throttled
> behind it.
>
> IMHO, this is a problem and should be fixed. If filesystem can fix it great.
> But if not, then we should consider the option of throttling buffered writes
> in balance_dirty_pages().
XFS seems to be doing well for this particular test. Unthrottled
fsyncer does not get serialized behind throttled one.
Throttled instance Unthrottled Instance
------------------ --------------------
fsync time: 1.0511 fsync time: 0.0204
fsync time: 1.0486 fsync time: 0.0260
fsync time: 1.0445 fsync time: 0.0260
fsync time: 1.0485 fsync time: 0.0260
fsync time: 1.0446 fsync time: 0.0260
fsync time: 1.2157 fsync time: 0.0260
fsync time: 1.0446 fsync time: 0.0300
fsync time: 1.0484 fsync time: 0.0340
fsync time: 1.0446 fsync time: 0.0221
fsync time: 1.0486 fsync time: 0.0340
fsync time: 1.0406 fsync time: 0.0340
Thanks
Vivek
On Thu, Jun 30, 2011 at 01:14:05PM -0400, Vivek Goyal wrote:
> On Thu, Jun 30, 2011 at 04:52:29PM +0200, Andrea Righi wrote:
> > On Tue, Jun 28, 2011 at 11:35:09AM -0400, Vivek Goyal wrote:
> > > Put the blk_throtl_dirty_pages() hook in
> > > balance_dirty_pages_ratelimited_nr() to enable task throttling.
> > >
> > > Signed-off-by: Vivek Goyal <[email protected]>
> > > ---
> > > include/linux/blkdev.h | 5 +++++
> > > mm/page-writeback.c | 3 +++
> > > 2 files changed, 8 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > index 4ce6e68..5d4a57e 100644
> > > --- a/include/linux/blkdev.h
> > > +++ b/include/linux/blkdev.h
> > > @@ -1180,12 +1180,17 @@ static inline uint64_t rq_io_start_time_ns(struct request *req)
> > > extern int blk_throtl_init(struct request_queue *q);
> > > extern void blk_throtl_exit(struct request_queue *q);
> > > extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
> > > +extern void blk_throtl_dirty_pages(struct address_space *mapping,
> > > + unsigned long nr_dirty);
> > > #else /* CONFIG_BLK_DEV_THROTTLING */
> > > static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio)
> > > {
> > > return 0;
> > > }
> > >
> > > +static inline void blk_throtl_dirty_pages(struct address_space *mapping,
> > > + unsigned long nr_dirty) {}
> > > +
> > > static inline int blk_throtl_init(struct request_queue *q) { return 0; }
> > > static inline int blk_throtl_exit(struct request_queue *q) { return 0; }
> > > #endif /* CONFIG_BLK_DEV_THROTTLING */
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index 31f6988..943e551 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -629,6 +629,9 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> > > unsigned long ratelimit;
> > > unsigned long *p;
> > >
> > > + /* Subject writes to IO controller throttling */
> > > + blk_throtl_dirty_pages(mapping, nr_pages_dirtied);
> > > +
> >
> > mmmh.. in this way we throttle also tasks that are re-writing dirty pages
> > multiple times.
> >
> > >From the controller perspective what is actually generating I/O on block
> > devices is the generation of _new_ dirty pages. Multiple re-writes in page
> > cache should never be throttled IMHO.
> >
> > I would re-write this patch in the following way. What do you think?
> >
> > Thanks,
> > -Andrea
> >
> > ---
> > Subject: [PATCH 8/8] blk-throttle: enable throttling of task while dirtying pages
> >
> > From: Andrea Righi <[email protected]>
> >
> > Put the blk_throtl_dirty_pages() hook in balance_dirty_pages_ratelimited_nr()
> > to enable task throttling.
> >
> > Moreover, modify balance_dirty_pages_ratelimited_nr() to accept the additional
> > parameter "redirty". This parameter can be used to notify if the pages have
> > been dirtied for the first time or re-dirtied.
> >
> > This information can be used by the blkio.throttle controller to distinguish
> > between a WRITE in the page cache, that will eventually generates I/O activity
> > on block device by the writeback code, and a re-WRITE operation that most of
> > the time will not generate additional I/O activity.
> >
> > This means that a task that re-writes multiple times the same blocks of a file
> > is affected by the blkio limitations only for the actual I/O that will be
> > performed to the underlying block devices during the writeback process.
>
> Hi Andrea,
>
> It kind of makes sense to me. Why not do it for regular dirty throttling
> also. There also we want to control overall dirty memory in the system
> and as long as we are not redirtying and not increasing number of
> dirty pages then not subjecting them to throttling will make sense?
>
> If yes, then we can simply skill calling balance_dirty_pages_ratelimited()
> in case page was already dirty instead of introducing another variant
> __balance_dirty_pages_ratelimited_nr()?
I tend to agree. However, I think we're adding a bit of fuzzyness in
checking PageDirty() there in generic_perform_write() and sometimes we
can miss a dirty page.
This fuzzyness can be probably acceptable for the throttling controller
(I think it's acceptable for us to skip a dirty page in rare conditions,
if the benefit is to avoid to always throttle rewrites in page cache),
but maybe it's not acceptable for checking the global dirty state of the
system.
Introducing a new variant of balance_dirty_pages_ratelimited_nr() surely
doesn't break the previous behavior.
-Andrea
>
> Thanks
> Vivek
>
>
> >
> > Signed-off-by: Andrea Righi <[email protected]>
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > include/linux/writeback.h | 8 ++++++--
> > mm/filemap.c | 4 +++-
> > mm/page-writeback.c | 11 +++++++----
> > 3 files changed, 16 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index 17e7ccc..82d69c9 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -129,8 +129,12 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> > unsigned long dirty);
> >
> > void page_writeback_init(void);
> > -void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> > - unsigned long nr_pages_dirtied);
> > +
> > +#define balance_dirty_pages_ratelimited_nr(__mapping, __nr) \
> > + __balance_dirty_pages_ratelimited_nr(__mapping, __nr, false)
> > +void __balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> > + unsigned long nr_pages_dirtied,
> > + bool redirty);
> >
> > static inline void
> > balance_dirty_pages_ratelimited(struct address_space *mapping)
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index a8251a8..ff4bdc5 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2377,6 +2377,7 @@ static ssize_t generic_perform_write(struct file *file,
> > unsigned long bytes; /* Bytes to write to page */
> > size_t copied; /* Bytes copied from user */
> > void *fsdata;
> > + unsigned int dirty;
> >
> > offset = (pos & (PAGE_CACHE_SIZE - 1));
> > bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
> > @@ -2412,6 +2413,7 @@ again:
> > pagefault_enable();
> > flush_dcache_page(page);
> >
> > + dirty = PageDirty(page);
> > mark_page_accessed(page);
> > status = a_ops->write_end(file, mapping, pos, bytes, copied,
> > page, fsdata);
> > @@ -2438,7 +2440,7 @@ again:
> > pos += copied;
> > written += copied;
> >
> > - balance_dirty_pages_ratelimited(mapping);
> > + __balance_dirty_pages_ratelimited_nr(mapping, 1, dirty);
> >
> > } while (iov_iter_count(i));
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 943e551..4ca505d 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -613,6 +613,7 @@ static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> > * balance_dirty_pages_ratelimited_nr - balance dirty memory state
> > * @mapping: address_space which was dirtied
> > * @nr_pages_dirtied: number of pages which the caller has just dirtied
> > + * @redirty: false if the pages have been dirtied for the first time
> > *
> > * Processes which are dirtying memory should call in here once for each page
> > * which was newly dirtied. The function will periodically check the system's
> > @@ -623,14 +624,16 @@ static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> > * limit we decrease the ratelimiting by a lot, to prevent individual processes
> > * from overshooting the limit by (ratelimit_pages) each.
> > */
> > -void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> > - unsigned long nr_pages_dirtied)
> > +void __balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> > + unsigned long nr_pages_dirtied,
> > + bool redirty)
> > {
> > unsigned long ratelimit;
> > unsigned long *p;
> >
> > /* Subject writes to IO controller throttling */
> > - blk_throtl_dirty_pages(mapping, nr_pages_dirtied);
> > + if (!redirty)
> > + blk_throtl_dirty_pages(mapping, nr_pages_dirtied);
> >
> > ratelimit = ratelimit_pages;
> > if (mapping->backing_dev_info->dirty_exceeded)
> > @@ -652,7 +655,7 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> > }
> > preempt_enable();
> > }
> > -EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
> > +EXPORT_SYMBOL(__balance_dirty_pages_ratelimited_nr);
> >
> > void throttle_vm_writeout(gfp_t gfp_mask)
> > {
On Thu, Jun 30, 2011 at 04:44:32PM -0400, Vivek Goyal wrote:
> On Thu, Jun 30, 2011 at 04:04:59PM -0400, Vivek Goyal wrote:
>
> [..]
> > Dave,
> >
> > Just another example where serialization is taking place with ext4.
> >
> > I created a group with 1MB/s write limit and ran tedso's fsync tester
> > program with little modification. I used write() system call instead
> > of pwrite() so that file size grows. This program basically writes
> > 1MB of data and then fsync's it and then measures the fsync time.
> >
> > I ran two instances of prgram in two groups on two separate files. One
> > instances is throttled to 1MB/s and other is in root group unthrottled.
> >
> > Unthrottled program gets serialized behind throttled one. Following
> > are fsync times.
> >
> > Throttled instance Unthrottled Instance
> > ------------------ --------------------
> > fsync time: 1.0051 fsync time: 1.0067
> > fsync time: 1.0049 fsync time: 1.0075
> > fsync time: 1.0048 fsync time: 1.0063
> > fsync time: 1.0073 fsync time: 1.0062
> > fsync time: 1.0070 fsync time: 1.0078
> > fsync time: 1.0032 fsync time: 1.0049
> > fsync time: 0.0154 fsync time: 1.0068
> > fsync time: 0.0137 fsync time: 1.0048
> >
> > Without any throttling both the instances do fine
> > -------------------------------------------------
> > Throttled instance Unthrottled Instance
> > ------------------ --------------------
> > fsync time: 0.0139 fsync time: 0.0162
> > fsync time: 0.0132 fsync time: 0.0156
> > fsync time: 0.0149 fsync time: 0.0169
> > fsync time: 0.0165 fsync time: 0.0152
> > fsync time: 0.0188 fsync time: 0.0135
> > fsync time: 0.0137 fsync time: 0.0142
> > fsync time: 0.0148 fsync time: 0.0149
> > fsync time: 0.0168 fsync time: 0.0163
> > fsync time: 0.0153 fsync time: 0.0143
> >
> > So when we are inreasing the size of file and fsyncing it, other
> > unthrottled instances of similar activities will get throttled
> > behind it.
> >
> > IMHO, this is a problem and should be fixed. If filesystem can fix it great.
> > But if not, then we should consider the option of throttling buffered writes
> > in balance_dirty_pages().
>
> XFS seems to be doing well for this particular test. Unthrottled
> fsyncer does not get serialized behind throttled one.
>
> Throttled instance Unthrottled Instance
> ------------------ --------------------
> fsync time: 1.0511 fsync time: 0.0204
> fsync time: 1.0486 fsync time: 0.0260
> fsync time: 1.0445 fsync time: 0.0260
> fsync time: 1.0485 fsync time: 0.0260
> fsync time: 1.0446 fsync time: 0.0260
> fsync time: 1.2157 fsync time: 0.0260
> fsync time: 1.0446 fsync time: 0.0300
> fsync time: 1.0484 fsync time: 0.0340
> fsync time: 1.0446 fsync time: 0.0221
> fsync time: 1.0486 fsync time: 0.0340
> fsync time: 1.0406 fsync time: 0.0340
And you've just illustrated my point better than I did - that
different filesytsems will suffer from different problems, but some
filesytsems will work better than others out of the box. :)
Of course, there's no guarantee XFS will remain this way - if you
want us to care about regressions of this sort at all, you need to
encapsulate all this behaviour in a set of automated tests.
Preferrably within the xfstests infrastructure because that now has
fairly wide usage within the fs dev community....
Cheers,
Dave.
--
Dave Chinner
[email protected]