2011-02-28 10:17:07

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 0/3] blk-throttle: async write throttling

Overview
========
Currently the blkio.throttle controller only support synchronous IO requests.
This means that we always look at the current task to identify the "owner" of
each IO request.

However dirty pages in the page cache can be wrote to disk asynchronously by
the per-bdi flusher kernel threads or by any other thread in the system,
according to the writeback policy.

For this reason the real writes to the underlying block devices may
occur in a different IO context respect to the task that originally
generated the dirty pages involved in the IO operation. This makes the
tracking and throttling of writeback IO more complicate respect to the
synchronous IO from the blkio controller's perspective.

Proposed solution
=================
In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve
the problem of the buffered writes limitation by tracking the ownership of all
the dirty pages in the system.

This would allow to always identify the owner of each IO operation at the block
layer and apply the appropriate throttling policy implemented by the
blkio.throttle controller.

This solution makes the blkio.throttle controller to work as expected also for
writeback IO, but it does not resolve the problem of faster cgroups getting
blocked by slower cgroups (that would expose a potential way to create DoS in
the system).

In fact, at the moment critical IO requests (that have dependency with other IO
requests made by other cgroups) and non-critical requests are mixed together at
the filesystem layer in a way that throttling a single write request may stop
also other requests in the system, and at the block layer it's not possible to
retrieve such informations to make the right decision.

A simple solution to this problem could be to just limit the rate of async
writes at the time a task is generating dirty pages in the page cache. The
big advantage of this approach is that it does not need the overhead of
tracking the ownership of the dirty pages, because in this way from the blkio
controller perspective all the IO operations will happen from the process
context: writes in memory and synchronous reads from the block device.

The drawback of this approach is that the blkio.throttle controller becomes a
little bit leaky, because with this solution the controller is still affected
by the IO spikes during the writeback of dirty pages executed by the kernel
threads.

Probably an even better approach would be to introduce the tracking of the
dirty page ownership to properly account the cost of each IO operation at the
block layer and apply the throttling of async writes in memory only when IO
limits are exceeded.

To summarize, we can identify three possible solutions to properly throttle the
buffered writes:

1) account & throttle everything at block IO layer (bad for "priority
inversion" problems, needs page tracking for blkio)

2) account at block IO layer and throttle in memory (needs page tracking for
blkio)

3) account & throttle in memory (affected by IO spikes, depending on
dirty_ratio / dirty_background_ratio settings)

For now we start with the solution 3) that seems to be the simplest way to
proceed.

Testcase
========
- create a cgroup with 4MiB/s write limit:
# mount -t cgroup -o blkio none /mnt/cgroup
# mkdir /mnt/cgroup/foo
# echo 8:0 $((4 * 1024 * 1024)) > /mnt/cgroup/foo/blkio.throttle.write_bps_device

NOTE: async io is still limited per-device, as well as sync io

- move a task into the cgroup and run a dd to generate some writeback IO

Results:

- 2.6.38-rc6 vanilla:

$ cat /proc/$$/cgroup
1:blkio:/foo
$ dd if=/dev/zero of=zero bs=1M count=128 &
$ dstat -df
--dsk/sda--
read writ
0 0
...
0 0
0 22M <--- writeback starts here and is not limited at all
0 43M
0 45M
0 18M
...

- 2.6.38-rc6 + async write throttling:

$ cat /proc/$$/cgroup
1:blkio:/foo
$ dd if=/dev/zero of=zero bs=1M count=128 &
$ dstat -df
--dsk/sda--
read writ
0 0
0 0
0 0
0 0
0 22M <--- we have some IO spikes but the overall writeback IO
0 0 is controlled according to the blkio write limit
0 0
0 0
0 0
0 29M
0 0
0 0
0 0
0 0
0 26M
0 0
0 0
0 0
0 0
0 30M
0 0
0 0
0 0
0 0
0 20M

TODO
~~~~
- Consider to add the following new files in the blkio controller to allow the
user to explicitly limit async writes as well as sync writes:

blkio.throttle.async.write_bps_limit
blkio.throttle.async.write_iops_limit

Any feedback is welcome.
-Andrea

[PATCH 1/3] block: introduce REQ_DIRECT to track direct io bio
[PATCH 2/3] blkio-throttle: infrastructure to throttle async io
[PATCH 3/3] blkio-throttle: async write io instrumentation

block/blk-throttle.c | 106 ++++++++++++++++++++++++++++++---------------
fs/direct-io.c | 1 +
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 6 +++
mm/page-writeback.c | 17 +++++++
5 files changed, 97 insertions(+), 35 deletions(-)


2011-02-28 10:17:12

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 1/3] block: introduce REQ_DIRECT to track direct io bio

Introduce a new flag to identify if a bio has been generated for a
direct IO operation.

This flag is used by the blkio controller to identify if a write IO
request has been issued by the current task and can be limited directly
or if it has been generated from another IO context, as a result of a
buffered IO operation.

Signed-off-by: Andrea Righi <[email protected]>
---
fs/direct-io.c | 1 +
include/linux/blk_types.h | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index b044705..fe364a4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -361,6 +361,7 @@ static void dio_bio_submit(struct dio *dio)
unsigned long flags;

bio->bi_private = dio;
+ bio->bi_rw |= REQ_DIRECT;

spin_lock_irqsave(&dio->bio_lock, flags);
dio->refcount++;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 46ad519..2f98c03 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -130,6 +130,7 @@ enum rq_flag_bits {
/* bio only flags */
__REQ_UNPLUG, /* unplug the immediately after submission */
__REQ_RAHEAD, /* read ahead, can fail anytime */
+ __REQ_DIRECT, /* direct io request */
__REQ_THROTTLED, /* This bio has already been subjected to
* throttling rules. Don't do it again. */

@@ -173,6 +174,7 @@ enum rq_flag_bits {
#define REQ_UNPLUG (1 << __REQ_UNPLUG)
#define REQ_RAHEAD (1 << __REQ_RAHEAD)
#define REQ_THROTTLED (1 << __REQ_THROTTLED)
+#define REQ_DIRECT (1 << __REQ_DIRECT)

#define REQ_SORTED (1 << __REQ_SORTED)
#define REQ_SOFTBARRIER (1 << __REQ_SOFTBARRIER)
--
1.7.1

2011-02-28 10:17:17

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 3/3] blkio-throttle: async write io instrumentation

Enforce IO throttling policy to asynchronous IO writes at the time tasks
write pages in the page cache.

Signed-off-by: Andrea Righi <[email protected]>
---
mm/page-writeback.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2cb01f6..e3f5f4f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -607,6 +607,19 @@ void set_page_dirty_balance(struct page *page, int page_mkwrite)
}
}

+/*
+ * Get a request_queue of the underlying superblock device from an
+ * address_space.
+ */
+static struct request_queue *as_to_rq(struct address_space *mapping)
+{
+ struct block_device *bdev;
+
+ bdev = (mapping->host && mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
+ return bdev ? bdev_get_queue(bdev) : NULL;
+}
+
static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;

/**
@@ -628,6 +641,7 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
{
unsigned long ratelimit;
unsigned long *p;
+ struct request_queue *q;

ratelimit = ratelimit_pages;
if (mapping->backing_dev_info->dirty_exceeded)
@@ -644,6 +658,9 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
ratelimit = sync_writeback_pages(*p);
*p = 0;
preempt_enable();
+ q = as_to_rq(mapping);
+ if (q)
+ blk_throtl_async(q, ratelimit << PAGE_SHIFT);
balance_dirty_pages(mapping, ratelimit);
return;
}
--
1.7.1

2011-02-28 10:17:19

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 2/3] blkio-throttle: infrastructure to throttle async io

Add blk_throtl_async() to throttle async io writes.

The idea is to stop applications before they can generate too much dirty
pages in the page cache. Each time a chunk of pages is wrote in memory
we charge the current task / cgroup for the size of the pages dirtied.

If blkio limit are exceeded we also enforce throttling at this layer,
instead of throttling writes at the block IO layer, where it's not
possible to associate the pages with the process that dirtied them.

In this case throttling can be simply enforced via a
schedule_timeout_killable().

Also change some internal blkio function to make them more generic, and
allow to get the size and type of the IO operation without extracting
these information directly from a struct bio. In this way the same
functions can be used also in the contextes where a struct bio is not
yet created (e.g., writes in the page cache).

Signed-off-by: Andrea Righi <[email protected]>
---
block/blk-throttle.c | 106 ++++++++++++++++++++++++++++++++----------------
include/linux/blkdev.h | 6 +++
2 files changed, 77 insertions(+), 35 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a89043a..07ef429 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -449,9 +449,8 @@ throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw)
}

static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg,
- struct bio *bio, unsigned long *wait)
+ bool rw, size_t size, unsigned long *wait)
{
- bool rw = bio_data_dir(bio);
unsigned int io_allowed;
unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
u64 tmp;
@@ -499,9 +498,8 @@ static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg,
}

static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
- struct bio *bio, unsigned long *wait)
+ bool rw, size_t size, unsigned long *wait)
{
- bool rw = bio_data_dir(bio);
u64 bytes_allowed, extra_bytes, tmp;
unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;

@@ -517,14 +515,14 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
do_div(tmp, HZ);
bytes_allowed = tmp;

- if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) {
+ if (tg->bytes_disp[rw] + size <= bytes_allowed) {
if (wait)
*wait = 0;
return 1;
}

/* Calc approx time to dispatch */
- extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed;
+ extra_bytes = tg->bytes_disp[rw] + size - bytes_allowed;
jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);

if (!jiffy_wait)
@@ -540,24 +538,11 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
return 0;
}

-/*
- * Returns whether one can dispatch a bio or not. Also returns approx number
- * of jiffies to wait before this bio is with-in IO rate and can be dispatched
- */
static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
- struct bio *bio, unsigned long *wait)
+ bool rw, size_t size, unsigned long *wait)
{
- bool rw = bio_data_dir(bio);
unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;

- /*
- * Currently whole state machine of group depends on first bio
- * queued in the group bio list. So one should not be calling
- * this function with a different bio if there are other bios
- * queued.
- */
- BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw]));
-
/* If tg->bps = -1, then BW is unlimited */
if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
if (wait)
@@ -577,8 +562,8 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
throtl_extend_slice(td, tg, rw, jiffies + throtl_slice);
}

- if (tg_with_in_bps_limit(td, tg, bio, &bps_wait)
- && tg_with_in_iops_limit(td, tg, bio, &iops_wait)) {
+ if (tg_with_in_bps_limit(td, tg, rw, size, &bps_wait) &&
+ tg_with_in_iops_limit(td, tg, rw, size, &iops_wait)) {
if (wait)
*wait = 0;
return 1;
@@ -595,20 +580,37 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
return 0;
}

-static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
+/*
+ * Returns whether one can dispatch a bio or not. Also returns approx number
+ * of jiffies to wait before this bio is with-in IO rate and can be dispatched
+ */
+static bool tg_may_dispatch_bio(struct throtl_data *td, struct throtl_grp *tg,
+ struct bio *bio, unsigned long *wait)
{
bool rw = bio_data_dir(bio);
- bool sync = bio->bi_rw & REQ_SYNC;

+ /*
+ * Currently whole state machine of group depends on first bio queued
+ * in the group bio list. So one should not be calling this function
+ * with a different bio if there are other bios queued.
+ */
+ BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw]));
+
+ return tg_may_dispatch(td, tg, rw, bio->bi_size, wait);
+}
+
+static void
+throtl_charge_io(struct throtl_grp *tg, bool rw, bool sync, size_t size)
+{
/* Charge the bio to the group */
- tg->bytes_disp[rw] += bio->bi_size;
+ tg->bytes_disp[rw] += size;
tg->io_disp[rw]++;

/*
* TODO: This will take blkg->stats_lock. Figure out a way
* to avoid this cost.
*/
- blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size, rw, sync);
+ blkiocg_update_dispatch_stats(&tg->blkg, size, rw, sync);
}

static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg,
@@ -630,10 +632,10 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
struct bio *bio;

if ((bio = bio_list_peek(&tg->bio_lists[READ])))
- tg_may_dispatch(td, tg, bio, &read_wait);
+ tg_may_dispatch_bio(td, tg, bio, &read_wait);

if ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
- tg_may_dispatch(td, tg, bio, &write_wait);
+ tg_may_dispatch_bio(td, tg, bio, &write_wait);

min_wait = min(read_wait, write_wait);
disptime = jiffies + min_wait;
@@ -657,7 +659,8 @@ static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg,
BUG_ON(td->nr_queued[rw] <= 0);
td->nr_queued[rw]--;

- throtl_charge_bio(tg, bio);
+ throtl_charge_io(tg, bio_data_dir(bio),
+ bio->bi_rw & REQ_SYNC, bio->bi_size);
bio_list_add(bl, bio);
bio->bi_rw |= REQ_THROTTLED;

@@ -674,8 +677,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,

/* Try to dispatch 75% READS and 25% WRITES */

- while ((bio = bio_list_peek(&tg->bio_lists[READ]))
- && tg_may_dispatch(td, tg, bio, NULL)) {
+ while ((bio = bio_list_peek(&tg->bio_lists[READ])) &&
+ tg_may_dispatch_bio(td, tg, bio, NULL)) {

tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
nr_reads++;
@@ -684,8 +687,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
break;
}

- while ((bio = bio_list_peek(&tg->bio_lists[WRITE]))
- && tg_may_dispatch(td, tg, bio, NULL)) {
+ while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) &&
+ tg_may_dispatch_bio(td, tg, bio, NULL)) {

tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
nr_writes++;
@@ -998,6 +1001,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
bio->bi_rw &= ~REQ_THROTTLED;
return 0;
}
+ /* Async writes are ratelimited in blk_throtl_async() */
+ if (rw == WRITE && !(bio->bi_rw & REQ_DIRECT))
+ return 0;

spin_lock_irq(q->queue_lock);
tg = throtl_get_tg(td);
@@ -1018,8 +1024,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
}

/* Bio is with-in rate limit of group */
- if (tg_may_dispatch(td, tg, bio, NULL)) {
- throtl_charge_bio(tg, bio);
+ if (tg_may_dispatch_bio(td, tg, bio, NULL)) {
+ throtl_charge_io(tg, bio_data_dir(bio),
+ bio->bi_rw & REQ_SYNC, bio->bi_size);
goto out;
}

@@ -1044,6 +1051,35 @@ out:
return 0;
}

+/*
+ * Enforce throttling on async i/o writes
+ */
+int blk_throtl_async(struct request_queue *q, size_t size)
+{
+ struct throtl_data *td = q->td;
+ struct throtl_grp *tg;
+ bool rw = 1;
+ unsigned long wait = 0;
+
+ spin_lock_irq(q->queue_lock);
+ tg = throtl_get_tg(td);
+ if (tg_may_dispatch(td, tg, rw, size, &wait))
+ throtl_charge_io(tg, rw, false, size);
+ else
+ throtl_log_tg(td, tg, "[%c] async. bdisp=%u sz=%u bps=%llu"
+ " iodisp=%u iops=%u queued=%d/%d",
+ rw == READ ? 'R' : 'W',
+ tg->bytes_disp[rw], size, tg->bps[rw],
+ tg->io_disp[rw], tg->iops[rw],
+ tg->nr_queued[READ], tg->nr_queued[WRITE]);
+ spin_unlock_irq(q->queue_lock);
+
+ if (wait >= throtl_slice)
+ schedule_timeout_killable(wait);
+
+ return 0;
+}
+
int blk_throtl_init(struct request_queue *q)
{
struct throtl_data *td;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4d18ff3..01c8241 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1136,6 +1136,7 @@ static inline uint64_t rq_io_start_time_ns(struct request *req)
extern int blk_throtl_init(struct request_queue *q);
extern void blk_throtl_exit(struct request_queue *q);
extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
+extern int blk_throtl_async(struct request_queue *q, size_t size);
extern void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay);
extern void throtl_shutdown_timer_wq(struct request_queue *q);
#else /* CONFIG_BLK_DEV_THROTTLING */
@@ -1144,6 +1145,11 @@ static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio)
return 0;
}

+static inline int blk_throtl_async(struct request_queue *q, size_t size)
+{
+ return 0;
+}
+
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
static inline int blk_throtl_exit(struct request_queue *q) { return 0; }
static inline void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay) {}
--
1.7.1

2011-02-28 23:02:41

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:
> Overview
> ========
> Currently the blkio.throttle controller only support synchronous IO requests.
> This means that we always look at the current task to identify the "owner" of
> each IO request.
>
> However dirty pages in the page cache can be wrote to disk asynchronously by
> the per-bdi flusher kernel threads or by any other thread in the system,
> according to the writeback policy.
>
> For this reason the real writes to the underlying block devices may
> occur in a different IO context respect to the task that originally
> generated the dirty pages involved in the IO operation. This makes the
> tracking and throttling of writeback IO more complicate respect to the
> synchronous IO from the blkio controller's perspective.
>
> Proposed solution
> =================
> In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve
> the problem of the buffered writes limitation by tracking the ownership of all
> the dirty pages in the system.
>
> This would allow to always identify the owner of each IO operation at the block
> layer and apply the appropriate throttling policy implemented by the
> blkio.throttle controller.
>
> This solution makes the blkio.throttle controller to work as expected also for
> writeback IO, but it does not resolve the problem of faster cgroups getting
> blocked by slower cgroups (that would expose a potential way to create DoS in
> the system).
>
> In fact, at the moment critical IO requests (that have dependency with other IO
> requests made by other cgroups) and non-critical requests are mixed together at
> the filesystem layer in a way that throttling a single write request may stop
> also other requests in the system, and at the block layer it's not possible to
> retrieve such informations to make the right decision.
>
> A simple solution to this problem could be to just limit the rate of async
> writes at the time a task is generating dirty pages in the page cache. The
> big advantage of this approach is that it does not need the overhead of
> tracking the ownership of the dirty pages, because in this way from the blkio
> controller perspective all the IO operations will happen from the process
> context: writes in memory and synchronous reads from the block device.
>
> The drawback of this approach is that the blkio.throttle controller becomes a
> little bit leaky, because with this solution the controller is still affected
> by the IO spikes during the writeback of dirty pages executed by the kernel
> threads.
>
> Probably an even better approach would be to introduce the tracking of the
> dirty page ownership to properly account the cost of each IO operation at the
> block layer and apply the throttling of async writes in memory only when IO
> limits are exceeded.

Andrea, I am curious to know more about it third option. Can you give more
details about accouting in block layer but throttling in memory. So say
a process starts IO, then it will still be in throttle limits at block
layer (because no writeback has started), then the process will write
bunch of pages in cache. By the time throttle limits are crossed at
block layer, we already have lots of dirty data in page cache and
throttling process now is already late?

>
> To summarize, we can identify three possible solutions to properly throttle the
> buffered writes:
>
> 1) account & throttle everything at block IO layer (bad for "priority
> inversion" problems, needs page tracking for blkio)
>
> 2) account at block IO layer and throttle in memory (needs page tracking for
> blkio)
>
> 3) account & throttle in memory (affected by IO spikes, depending on
> dirty_ratio / dirty_background_ratio settings)
>
> For now we start with the solution 3) that seems to be the simplest way to
> proceed.

Yes, IO spikes is the weakness of this 3rd solution. But it should be
simple too. Also as you said problem can be reduced up to some extent
by changing reducing dirty_ratio and background dirty ratio But that
will have other trade offs, I guess.

Thanks
Vivek

>
> Testcase
> ========
> - create a cgroup with 4MiB/s write limit:
> # mount -t cgroup -o blkio none /mnt/cgroup
> # mkdir /mnt/cgroup/foo
> # echo 8:0 $((4 * 1024 * 1024)) > /mnt/cgroup/foo/blkio.throttle.write_bps_device
>
> NOTE: async io is still limited per-device, as well as sync io
>
> - move a task into the cgroup and run a dd to generate some writeback IO
>
> Results:
>
> - 2.6.38-rc6 vanilla:
>
> $ cat /proc/$$/cgroup
> 1:blkio:/foo
> $ dd if=/dev/zero of=zero bs=1M count=128 &
> $ dstat -df
> --dsk/sda--
> read writ
> 0 0
> ...
> 0 0
> 0 22M <--- writeback starts here and is not limited at all
> 0 43M
> 0 45M
> 0 18M
> ...
>
> - 2.6.38-rc6 + async write throttling:
>
> $ cat /proc/$$/cgroup
> 1:blkio:/foo
> $ dd if=/dev/zero of=zero bs=1M count=128 &
> $ dstat -df
> --dsk/sda--
> read writ
> 0 0
> 0 0
> 0 0
> 0 0
> 0 22M <--- we have some IO spikes but the overall writeback IO
> 0 0 is controlled according to the blkio write limit
> 0 0
> 0 0
> 0 0
> 0 29M
> 0 0
> 0 0
> 0 0
> 0 0
> 0 26M
> 0 0
> 0 0
> 0 0
> 0 0
> 0 30M
> 0 0
> 0 0
> 0 0
> 0 0
> 0 20M
>
> TODO
> ~~~~
> - Consider to add the following new files in the blkio controller to allow the
> user to explicitly limit async writes as well as sync writes:
>
> blkio.throttle.async.write_bps_limit
> blkio.throttle.async.write_iops_limit
>
> Any feedback is welcome.
> -Andrea
>
> [PATCH 1/3] block: introduce REQ_DIRECT to track direct io bio
> [PATCH 2/3] blkio-throttle: infrastructure to throttle async io
> [PATCH 3/3] blkio-throttle: async write io instrumentation
>
> block/blk-throttle.c | 106 ++++++++++++++++++++++++++++++---------------
> fs/direct-io.c | 1 +
> include/linux/blk_types.h | 2 +
> include/linux/blkdev.h | 6 +++
> mm/page-writeback.c | 17 +++++++
> 5 files changed, 97 insertions(+), 35 deletions(-)

2011-03-01 16:08:29

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 2/3] blkio-throttle: infrastructure to throttle async io

On Mon, Feb 28, 2011 at 11:15:04AM +0100, Andrea Righi wrote:
> Add blk_throtl_async() to throttle async io writes.
>
> The idea is to stop applications before they can generate too much dirty
> pages in the page cache. Each time a chunk of pages is wrote in memory
> we charge the current task / cgroup for the size of the pages dirtied.
>
> If blkio limit are exceeded we also enforce throttling at this layer,
> instead of throttling writes at the block IO layer, where it's not
> possible to associate the pages with the process that dirtied them.
>
> In this case throttling can be simply enforced via a
> schedule_timeout_killable().
>
> Also change some internal blkio function to make them more generic, and
> allow to get the size and type of the IO operation without extracting
> these information directly from a struct bio. In this way the same
> functions can be used also in the contextes where a struct bio is not
> yet created (e.g., writes in the page cache).
>
> Signed-off-by: Andrea Righi <[email protected]>
> ---
> block/blk-throttle.c | 106 ++++++++++++++++++++++++++++++++----------------
> include/linux/blkdev.h | 6 +++
> 2 files changed, 77 insertions(+), 35 deletions(-)
>
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index a89043a..07ef429 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -449,9 +449,8 @@ throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw)
> }
>
> static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg,
> - struct bio *bio, unsigned long *wait)
> + bool rw, size_t size, unsigned long *wait)
> {
> - bool rw = bio_data_dir(bio);
> unsigned int io_allowed;
> unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> u64 tmp;
> @@ -499,9 +498,8 @@ static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg,
> }
>
> static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
> - struct bio *bio, unsigned long *wait)
> + bool rw, size_t size, unsigned long *wait)
> {
> - bool rw = bio_data_dir(bio);
> u64 bytes_allowed, extra_bytes, tmp;
> unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
>
> @@ -517,14 +515,14 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
> do_div(tmp, HZ);
> bytes_allowed = tmp;
>
> - if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) {
> + if (tg->bytes_disp[rw] + size <= bytes_allowed) {
> if (wait)
> *wait = 0;
> return 1;
> }
>
> /* Calc approx time to dispatch */
> - extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed;
> + extra_bytes = tg->bytes_disp[rw] + size - bytes_allowed;
> jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
>
> if (!jiffy_wait)
> @@ -540,24 +538,11 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
> return 0;
> }
>
> -/*
> - * Returns whether one can dispatch a bio or not. Also returns approx number
> - * of jiffies to wait before this bio is with-in IO rate and can be dispatched
> - */
> static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
> - struct bio *bio, unsigned long *wait)
> + bool rw, size_t size, unsigned long *wait)
> {
> - bool rw = bio_data_dir(bio);
> unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
>
> - /*
> - * Currently whole state machine of group depends on first bio
> - * queued in the group bio list. So one should not be calling
> - * this function with a different bio if there are other bios
> - * queued.
> - */
> - BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw]));
> -
> /* If tg->bps = -1, then BW is unlimited */
> if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
> if (wait)
> @@ -577,8 +562,8 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
> throtl_extend_slice(td, tg, rw, jiffies + throtl_slice);
> }
>
> - if (tg_with_in_bps_limit(td, tg, bio, &bps_wait)
> - && tg_with_in_iops_limit(td, tg, bio, &iops_wait)) {
> + if (tg_with_in_bps_limit(td, tg, rw, size, &bps_wait) &&
> + tg_with_in_iops_limit(td, tg, rw, size, &iops_wait)) {
> if (wait)
> *wait = 0;
> return 1;
> @@ -595,20 +580,37 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
> return 0;
> }
>
> -static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
> +/*
> + * Returns whether one can dispatch a bio or not. Also returns approx number
> + * of jiffies to wait before this bio is with-in IO rate and can be dispatched
> + */
> +static bool tg_may_dispatch_bio(struct throtl_data *td, struct throtl_grp *tg,
> + struct bio *bio, unsigned long *wait)
> {
> bool rw = bio_data_dir(bio);
> - bool sync = bio->bi_rw & REQ_SYNC;
>
> + /*
> + * Currently whole state machine of group depends on first bio queued
> + * in the group bio list. So one should not be calling this function
> + * with a different bio if there are other bios queued.
> + */
> + BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw]));
> +
> + return tg_may_dispatch(td, tg, rw, bio->bi_size, wait);
> +}
> +
> +static void
> +throtl_charge_io(struct throtl_grp *tg, bool rw, bool sync, size_t size)
> +{
> /* Charge the bio to the group */
> - tg->bytes_disp[rw] += bio->bi_size;
> + tg->bytes_disp[rw] += size;
> tg->io_disp[rw]++;
>
> /*
> * TODO: This will take blkg->stats_lock. Figure out a way
> * to avoid this cost.
> */
> - blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size, rw, sync);
> + blkiocg_update_dispatch_stats(&tg->blkg, size, rw, sync);
> }
>
> static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg,
> @@ -630,10 +632,10 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
> struct bio *bio;
>
> if ((bio = bio_list_peek(&tg->bio_lists[READ])))
> - tg_may_dispatch(td, tg, bio, &read_wait);
> + tg_may_dispatch_bio(td, tg, bio, &read_wait);
>
> if ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
> - tg_may_dispatch(td, tg, bio, &write_wait);
> + tg_may_dispatch_bio(td, tg, bio, &write_wait);
>
> min_wait = min(read_wait, write_wait);
> disptime = jiffies + min_wait;
> @@ -657,7 +659,8 @@ static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg,
> BUG_ON(td->nr_queued[rw] <= 0);
> td->nr_queued[rw]--;
>
> - throtl_charge_bio(tg, bio);
> + throtl_charge_io(tg, bio_data_dir(bio),
> + bio->bi_rw & REQ_SYNC, bio->bi_size);
> bio_list_add(bl, bio);
> bio->bi_rw |= REQ_THROTTLED;
>
> @@ -674,8 +677,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
>
> /* Try to dispatch 75% READS and 25% WRITES */
>
> - while ((bio = bio_list_peek(&tg->bio_lists[READ]))
> - && tg_may_dispatch(td, tg, bio, NULL)) {
> + while ((bio = bio_list_peek(&tg->bio_lists[READ])) &&
> + tg_may_dispatch_bio(td, tg, bio, NULL)) {
>
> tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
> nr_reads++;
> @@ -684,8 +687,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
> break;
> }
>
> - while ((bio = bio_list_peek(&tg->bio_lists[WRITE]))
> - && tg_may_dispatch(td, tg, bio, NULL)) {
> + while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) &&
> + tg_may_dispatch_bio(td, tg, bio, NULL)) {
>
> tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
> nr_writes++;
> @@ -998,6 +1001,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
> bio->bi_rw &= ~REQ_THROTTLED;
> return 0;
> }
> + /* Async writes are ratelimited in blk_throtl_async() */
> + if (rw == WRITE && !(bio->bi_rw & REQ_DIRECT))
> + return 0;
>
> spin_lock_irq(q->queue_lock);
> tg = throtl_get_tg(td);
> @@ -1018,8 +1024,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
> }
>
> /* Bio is with-in rate limit of group */
> - if (tg_may_dispatch(td, tg, bio, NULL)) {
> - throtl_charge_bio(tg, bio);
> + if (tg_may_dispatch_bio(td, tg, bio, NULL)) {
> + throtl_charge_io(tg, bio_data_dir(bio),
> + bio->bi_rw & REQ_SYNC, bio->bi_size);
> goto out;
> }
>
> @@ -1044,6 +1051,35 @@ out:
> return 0;
> }
>
> +/*
> + * Enforce throttling on async i/o writes
> + */
> +int blk_throtl_async(struct request_queue *q, size_t size)
> +{
> + struct throtl_data *td = q->td;
> + struct throtl_grp *tg;
> + bool rw = 1;
> + unsigned long wait = 0;
> +
> + spin_lock_irq(q->queue_lock);
> + tg = throtl_get_tg(td);
> + if (tg_may_dispatch(td, tg, rw, size, &wait))
> + throtl_charge_io(tg, rw, false, size);
> + else
> + throtl_log_tg(td, tg, "[%c] async. bdisp=%u sz=%u bps=%llu"
> + " iodisp=%u iops=%u queued=%d/%d",
> + rw == READ ? 'R' : 'W',
> + tg->bytes_disp[rw], size, tg->bps[rw],
> + tg->io_disp[rw], tg->iops[rw],
> + tg->nr_queued[READ], tg->nr_queued[WRITE]);
> + spin_unlock_irq(q->queue_lock);
> +
> + if (wait >= throtl_slice)
> + schedule_timeout_killable(wait);
> +
> + return 0;
> +}

I think above is not correct.

We have this notion of stretching/renewing the throtl_slice if bio
is big and can not be dispatched in one slice and one bio has been
dispatched, we trim the slice.

So first of all, above code does not take care of other IO going on
in same group. We might be extending a slice for a bigger queued bio
say 1MB, and suddenly a write happens saying oh, i can dispatch
and startving dispatch of that bio.

Similiarly, if there are more than 1 tasks in a group doing write, they
all will wait for certain time and after that time start dispatching.
I think we might have to have a wait queue per group and put async
tasks there and start the time slice on behalf of the task at the
front of the queue and wake it up once that task meets its quota.

Keeping track of for which bio the current slice is operating, we
reduce the number of wakeups for throtl work. A work is woken up
only when bio is ready to be dispatched. So if a bio is queued
that means a slice of that direction is already on, any new IO
in that group is simply queued instead of trying to check again
whether we can dispatch it or not.

Thanks
Vivek

> +
> int blk_throtl_init(struct request_queue *q)
> {
> struct throtl_data *td;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 4d18ff3..01c8241 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1136,6 +1136,7 @@ static inline uint64_t rq_io_start_time_ns(struct request *req)
> extern int blk_throtl_init(struct request_queue *q);
> extern void blk_throtl_exit(struct request_queue *q);
> extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
> +extern int blk_throtl_async(struct request_queue *q, size_t size);
> extern void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay);
> extern void throtl_shutdown_timer_wq(struct request_queue *q);
> #else /* CONFIG_BLK_DEV_THROTTLING */
> @@ -1144,6 +1145,11 @@ static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio)
> return 0;
> }
>
> +static inline int blk_throtl_async(struct request_queue *q, size_t size)
> +{
> + return 0;
> +}
> +
> static inline int blk_throtl_init(struct request_queue *q) { return 0; }
> static inline int blk_throtl_exit(struct request_queue *q) { return 0; }
> static inline void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay) {}
> --
> 1.7.1

2011-03-01 16:28:23

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:

[..]
> TODO
> ~~~~
> - Consider to add the following new files in the blkio controller to allow the
> user to explicitly limit async writes as well as sync writes:
>
> blkio.throttle.async.write_bps_limit
> blkio.throttle.async.write_iops_limit

I am kind of split on this.

- One way of thinking is that blkio.throttle.read/write_limits represent
the limits on requeuest queue of the IO which is actually passing
through queue now. So we should not mix the two and keep async limits
separately. This will also tell the customer explicitly that async
throttling does not mean the same thing as throttling happens before
entering the page cache and there can be/will be IO spikes later
at the request queue.

Also creating the separate files leaves the door open for future
extension of implementing async control when async IO is actually
submitted to request queue. (Though I think that will be hard as
making sure all the filesystems, writeback logic, device mapper
drivers are aware of throttling and will take steps to ensure faster
groups are not stuck behind slower groups).

So keeping async accounting separate will help differentiating that
async control is not same as sync control. There are fundamental
differences.


- On the other hand, it makes life a bit simple for user as they don't
have to specify the async limits separately and there is one aggregate
limit for sync and async (assuming we fix the throttling state issues
so that throttling logic can handle both bio and task throttling out
of single limit).

Any thoughts?

Thanks
Vivek

2011-03-01 21:41:44

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

On Tue, Mar 01, 2011 at 11:27:53AM -0500, Vivek Goyal wrote:
> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:
>
> [..]
> > TODO
> > ~~~~
> > - Consider to add the following new files in the blkio controller to allow the
> > user to explicitly limit async writes as well as sync writes:
> >
> > blkio.throttle.async.write_bps_limit
> > blkio.throttle.async.write_iops_limit
>
> I am kind of split on this.
>
> - One way of thinking is that blkio.throttle.read/write_limits represent
> the limits on requeuest queue of the IO which is actually passing
> through queue now. So we should not mix the two and keep async limits
> separately. This will also tell the customer explicitly that async
> throttling does not mean the same thing as throttling happens before
> entering the page cache and there can be/will be IO spikes later
> at the request queue.
>
> Also creating the separate files leaves the door open for future
> extension of implementing async control when async IO is actually
> submitted to request queue. (Though I think that will be hard as
> making sure all the filesystems, writeback logic, device mapper
> drivers are aware of throttling and will take steps to ensure faster
> groups are not stuck behind slower groups).
>
> So keeping async accounting separate will help differentiating that
> async control is not same as sync control. There are fundamental
> differences.
>
>
> - On the other hand, it makes life a bit simple for user as they don't
> have to specify the async limits separately and there is one aggregate
> limit for sync and async (assuming we fix the throttling state issues
> so that throttling logic can handle both bio and task throttling out
> of single limit).
>
> Any thoughts?

Thinking more about it.

Was chatting with Jeff Moyer about this and he mentioned that
applications might not like this frequent blocking in buffered write
path. Which is a valid concern but I don't know how to address that.
dirty_ratio, dity_bytes and dirty_background_ratio will not be enough
to throttle the overall WRITE rate from application.

I am kind of now little inclined towards creating separate files for
async writeout and not mix these throttling at device level for
following reasons.

- It is a different thorttling functionality at different layer and it is
not same as throttling sync IO at device level when IO is submitted to
device. We are just making use of existing throttling infrastructure to
account for writting rates in page cache. So having separate file and
separate async throttle state per group in blkio makes sense to me.

- There will be these unintutive side affects of frequent blocking
and keeping it separate atleast allows us to disable this functionality.

- And we might come up with throttling async IO at device level and
keeping it separate helps us at that point of time.

So to me throttling the async rate while writting to page cache is
a different functionality and let it be controlled by separate blkio
files. The bigger question is that in what cases this kind of functionality
is really useful given that it can still produce IO spikes and given the fact
that application will be put to sleep freqently depending on throttling rate
configured.

Thanks
Vivek

2011-03-02 13:28:34

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote:
> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:
> > Overview
> > ========
> > Currently the blkio.throttle controller only support synchronous IO requests.
> > This means that we always look at the current task to identify the "owner" of
> > each IO request.
> >
> > However dirty pages in the page cache can be wrote to disk asynchronously by
> > the per-bdi flusher kernel threads or by any other thread in the system,
> > according to the writeback policy.
> >
> > For this reason the real writes to the underlying block devices may
> > occur in a different IO context respect to the task that originally
> > generated the dirty pages involved in the IO operation. This makes the
> > tracking and throttling of writeback IO more complicate respect to the
> > synchronous IO from the blkio controller's perspective.
> >
> > Proposed solution
> > =================
> > In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve
> > the problem of the buffered writes limitation by tracking the ownership of all
> > the dirty pages in the system.
> >
> > This would allow to always identify the owner of each IO operation at the block
> > layer and apply the appropriate throttling policy implemented by the
> > blkio.throttle controller.
> >
> > This solution makes the blkio.throttle controller to work as expected also for
> > writeback IO, but it does not resolve the problem of faster cgroups getting
> > blocked by slower cgroups (that would expose a potential way to create DoS in
> > the system).
> >
> > In fact, at the moment critical IO requests (that have dependency with other IO
> > requests made by other cgroups) and non-critical requests are mixed together at
> > the filesystem layer in a way that throttling a single write request may stop
> > also other requests in the system, and at the block layer it's not possible to
> > retrieve such informations to make the right decision.
> >
> > A simple solution to this problem could be to just limit the rate of async
> > writes at the time a task is generating dirty pages in the page cache. The
> > big advantage of this approach is that it does not need the overhead of
> > tracking the ownership of the dirty pages, because in this way from the blkio
> > controller perspective all the IO operations will happen from the process
> > context: writes in memory and synchronous reads from the block device.
> >
> > The drawback of this approach is that the blkio.throttle controller becomes a
> > little bit leaky, because with this solution the controller is still affected
> > by the IO spikes during the writeback of dirty pages executed by the kernel
> > threads.
> >
> > Probably an even better approach would be to introduce the tracking of the
> > dirty page ownership to properly account the cost of each IO operation at the
> > block layer and apply the throttling of async writes in memory only when IO
> > limits are exceeded.
>
> Andrea, I am curious to know more about it third option. Can you give more
> details about accouting in block layer but throttling in memory. So say
> a process starts IO, then it will still be in throttle limits at block
> layer (because no writeback has started), then the process will write
> bunch of pages in cache. By the time throttle limits are crossed at
> block layer, we already have lots of dirty data in page cache and
> throttling process now is already late?

Charging the cost of each IO operation at the block layer would allow
tasks to write in memory at the maximum speed. Instead, with the 3rd
approach, tasks are forced to write in memory at the rate defined by the
blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device).

When we'll have the per-cgroup dirty memory accounting and limiting
feature, with this approach each cgroup could write to its dirty memory
quota at the maximum rate.

BTW, another thing that we probably need is that any cgroup should be
forced to write their own inodes when the limit defined by dirty_ratio
is exceeded. I mean, avoid to select inodes from the list of dirty
inodes in a FIFO way, but provides a better logic to "assign" the
ownership of each inode to a cgroup (maybe that one that had generated
most of the dirty pages in the inodes) and use for example a list of
dirty inodes per cgroup or something similar.

In this way we should really be able to provide a good quality of
service for the most part of the cases IMHO.

I also plan to write down another patch set to implement this logic.

>
> >
> > To summarize, we can identify three possible solutions to properly throttle the
> > buffered writes:
> >
> > 1) account & throttle everything at block IO layer (bad for "priority
> > inversion" problems, needs page tracking for blkio)
> >
> > 2) account at block IO layer and throttle in memory (needs page tracking for
> > blkio)
> >
> > 3) account & throttle in memory (affected by IO spikes, depending on
> > dirty_ratio / dirty_background_ratio settings)
> >
> > For now we start with the solution 3) that seems to be the simplest way to
> > proceed.
>
> Yes, IO spikes is the weakness of this 3rd solution. But it should be
> simple too. Also as you said problem can be reduced up to some extent
> by changing reducing dirty_ratio and background dirty ratio But that
> will have other trade offs, I guess.

Agreed.

-Andrea

2011-03-02 13:31:53

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 2/3] blkio-throttle: infrastructure to throttle async io

On Tue, Mar 01, 2011 at 11:06:27AM -0500, Vivek Goyal wrote:
> On Mon, Feb 28, 2011 at 11:15:04AM +0100, Andrea Righi wrote:
> > Add blk_throtl_async() to throttle async io writes.
> >
> > The idea is to stop applications before they can generate too much dirty
> > pages in the page cache. Each time a chunk of pages is wrote in memory
> > we charge the current task / cgroup for the size of the pages dirtied.
> >
> > If blkio limit are exceeded we also enforce throttling at this layer,
> > instead of throttling writes at the block IO layer, where it's not
> > possible to associate the pages with the process that dirtied them.
> >
> > In this case throttling can be simply enforced via a
> > schedule_timeout_killable().
> >
> > Also change some internal blkio function to make them more generic, and
> > allow to get the size and type of the IO operation without extracting
> > these information directly from a struct bio. In this way the same
> > functions can be used also in the contextes where a struct bio is not
> > yet created (e.g., writes in the page cache).
> >
> > Signed-off-by: Andrea Righi <[email protected]>
> > ---
> > block/blk-throttle.c | 106 ++++++++++++++++++++++++++++++++----------------
> > include/linux/blkdev.h | 6 +++
> > 2 files changed, 77 insertions(+), 35 deletions(-)
> >
> > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > index a89043a..07ef429 100644
> > --- a/block/blk-throttle.c
> > +++ b/block/blk-throttle.c
> > @@ -449,9 +449,8 @@ throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw)
> > }
> >
> > static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg,
> > - struct bio *bio, unsigned long *wait)
> > + bool rw, size_t size, unsigned long *wait)
> > {
> > - bool rw = bio_data_dir(bio);
> > unsigned int io_allowed;
> > unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> > u64 tmp;
> > @@ -499,9 +498,8 @@ static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg,
> > }
> >
> > static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
> > - struct bio *bio, unsigned long *wait)
> > + bool rw, size_t size, unsigned long *wait)
> > {
> > - bool rw = bio_data_dir(bio);
> > u64 bytes_allowed, extra_bytes, tmp;
> > unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> >
> > @@ -517,14 +515,14 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
> > do_div(tmp, HZ);
> > bytes_allowed = tmp;
> >
> > - if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) {
> > + if (tg->bytes_disp[rw] + size <= bytes_allowed) {
> > if (wait)
> > *wait = 0;
> > return 1;
> > }
> >
> > /* Calc approx time to dispatch */
> > - extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed;
> > + extra_bytes = tg->bytes_disp[rw] + size - bytes_allowed;
> > jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
> >
> > if (!jiffy_wait)
> > @@ -540,24 +538,11 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
> > return 0;
> > }
> >
> > -/*
> > - * Returns whether one can dispatch a bio or not. Also returns approx number
> > - * of jiffies to wait before this bio is with-in IO rate and can be dispatched
> > - */
> > static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
> > - struct bio *bio, unsigned long *wait)
> > + bool rw, size_t size, unsigned long *wait)
> > {
> > - bool rw = bio_data_dir(bio);
> > unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
> >
> > - /*
> > - * Currently whole state machine of group depends on first bio
> > - * queued in the group bio list. So one should not be calling
> > - * this function with a different bio if there are other bios
> > - * queued.
> > - */
> > - BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw]));
> > -
> > /* If tg->bps = -1, then BW is unlimited */
> > if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
> > if (wait)
> > @@ -577,8 +562,8 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
> > throtl_extend_slice(td, tg, rw, jiffies + throtl_slice);
> > }
> >
> > - if (tg_with_in_bps_limit(td, tg, bio, &bps_wait)
> > - && tg_with_in_iops_limit(td, tg, bio, &iops_wait)) {
> > + if (tg_with_in_bps_limit(td, tg, rw, size, &bps_wait) &&
> > + tg_with_in_iops_limit(td, tg, rw, size, &iops_wait)) {
> > if (wait)
> > *wait = 0;
> > return 1;
> > @@ -595,20 +580,37 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
> > return 0;
> > }
> >
> > -static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
> > +/*
> > + * Returns whether one can dispatch a bio or not. Also returns approx number
> > + * of jiffies to wait before this bio is with-in IO rate and can be dispatched
> > + */
> > +static bool tg_may_dispatch_bio(struct throtl_data *td, struct throtl_grp *tg,
> > + struct bio *bio, unsigned long *wait)
> > {
> > bool rw = bio_data_dir(bio);
> > - bool sync = bio->bi_rw & REQ_SYNC;
> >
> > + /*
> > + * Currently whole state machine of group depends on first bio queued
> > + * in the group bio list. So one should not be calling this function
> > + * with a different bio if there are other bios queued.
> > + */
> > + BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw]));
> > +
> > + return tg_may_dispatch(td, tg, rw, bio->bi_size, wait);
> > +}
> > +
> > +static void
> > +throtl_charge_io(struct throtl_grp *tg, bool rw, bool sync, size_t size)
> > +{
> > /* Charge the bio to the group */
> > - tg->bytes_disp[rw] += bio->bi_size;
> > + tg->bytes_disp[rw] += size;
> > tg->io_disp[rw]++;
> >
> > /*
> > * TODO: This will take blkg->stats_lock. Figure out a way
> > * to avoid this cost.
> > */
> > - blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size, rw, sync);
> > + blkiocg_update_dispatch_stats(&tg->blkg, size, rw, sync);
> > }
> >
> > static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg,
> > @@ -630,10 +632,10 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
> > struct bio *bio;
> >
> > if ((bio = bio_list_peek(&tg->bio_lists[READ])))
> > - tg_may_dispatch(td, tg, bio, &read_wait);
> > + tg_may_dispatch_bio(td, tg, bio, &read_wait);
> >
> > if ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
> > - tg_may_dispatch(td, tg, bio, &write_wait);
> > + tg_may_dispatch_bio(td, tg, bio, &write_wait);
> >
> > min_wait = min(read_wait, write_wait);
> > disptime = jiffies + min_wait;
> > @@ -657,7 +659,8 @@ static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg,
> > BUG_ON(td->nr_queued[rw] <= 0);
> > td->nr_queued[rw]--;
> >
> > - throtl_charge_bio(tg, bio);
> > + throtl_charge_io(tg, bio_data_dir(bio),
> > + bio->bi_rw & REQ_SYNC, bio->bi_size);
> > bio_list_add(bl, bio);
> > bio->bi_rw |= REQ_THROTTLED;
> >
> > @@ -674,8 +677,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
> >
> > /* Try to dispatch 75% READS and 25% WRITES */
> >
> > - while ((bio = bio_list_peek(&tg->bio_lists[READ]))
> > - && tg_may_dispatch(td, tg, bio, NULL)) {
> > + while ((bio = bio_list_peek(&tg->bio_lists[READ])) &&
> > + tg_may_dispatch_bio(td, tg, bio, NULL)) {
> >
> > tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
> > nr_reads++;
> > @@ -684,8 +687,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
> > break;
> > }
> >
> > - while ((bio = bio_list_peek(&tg->bio_lists[WRITE]))
> > - && tg_may_dispatch(td, tg, bio, NULL)) {
> > + while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) &&
> > + tg_may_dispatch_bio(td, tg, bio, NULL)) {
> >
> > tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
> > nr_writes++;
> > @@ -998,6 +1001,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
> > bio->bi_rw &= ~REQ_THROTTLED;
> > return 0;
> > }
> > + /* Async writes are ratelimited in blk_throtl_async() */
> > + if (rw == WRITE && !(bio->bi_rw & REQ_DIRECT))
> > + return 0;
> >
> > spin_lock_irq(q->queue_lock);
> > tg = throtl_get_tg(td);
> > @@ -1018,8 +1024,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
> > }
> >
> > /* Bio is with-in rate limit of group */
> > - if (tg_may_dispatch(td, tg, bio, NULL)) {
> > - throtl_charge_bio(tg, bio);
> > + if (tg_may_dispatch_bio(td, tg, bio, NULL)) {
> > + throtl_charge_io(tg, bio_data_dir(bio),
> > + bio->bi_rw & REQ_SYNC, bio->bi_size);
> > goto out;
> > }
> >
> > @@ -1044,6 +1051,35 @@ out:
> > return 0;
> > }
> >
> > +/*
> > + * Enforce throttling on async i/o writes
> > + */
> > +int blk_throtl_async(struct request_queue *q, size_t size)
> > +{
> > + struct throtl_data *td = q->td;
> > + struct throtl_grp *tg;
> > + bool rw = 1;
> > + unsigned long wait = 0;
> > +
> > + spin_lock_irq(q->queue_lock);
> > + tg = throtl_get_tg(td);
> > + if (tg_may_dispatch(td, tg, rw, size, &wait))
> > + throtl_charge_io(tg, rw, false, size);
> > + else
> > + throtl_log_tg(td, tg, "[%c] async. bdisp=%u sz=%u bps=%llu"
> > + " iodisp=%u iops=%u queued=%d/%d",
> > + rw == READ ? 'R' : 'W',
> > + tg->bytes_disp[rw], size, tg->bps[rw],
> > + tg->io_disp[rw], tg->iops[rw],
> > + tg->nr_queued[READ], tg->nr_queued[WRITE]);
> > + spin_unlock_irq(q->queue_lock);
> > +
> > + if (wait >= throtl_slice)
> > + schedule_timeout_killable(wait);
> > +
> > + return 0;
> > +}
>
> I think above is not correct.
>
> We have this notion of stretching/renewing the throtl_slice if bio
> is big and can not be dispatched in one slice and one bio has been
> dispatched, we trim the slice.
>
> So first of all, above code does not take care of other IO going on
> in same group. We might be extending a slice for a bigger queued bio
> say 1MB, and suddenly a write happens saying oh, i can dispatch
> and startving dispatch of that bio.
>
> Similiarly, if there are more than 1 tasks in a group doing write, they
> all will wait for certain time and after that time start dispatching.
> I think we might have to have a wait queue per group and put async
> tasks there and start the time slice on behalf of the task at the
> front of the queue and wake it up once that task meets its quota.
>
> Keeping track of for which bio the current slice is operating, we
> reduce the number of wakeups for throtl work. A work is woken up
> only when bio is ready to be dispatched. So if a bio is queued
> that means a slice of that direction is already on, any new IO
> in that group is simply queued instead of trying to check again
> whether we can dispatch it or not.

OK, I understand. I'll try to apply this logic in the next version of
the patch set.

Thanks for the clarification and the review.

-Andrea

2011-03-02 21:48:36

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote:
> On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote:
> > On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:
> > > Overview
> > > ========
> > > Currently the blkio.throttle controller only support synchronous IO requests.
> > > This means that we always look at the current task to identify the "owner" of
> > > each IO request.
> > >
> > > However dirty pages in the page cache can be wrote to disk asynchronously by
> > > the per-bdi flusher kernel threads or by any other thread in the system,
> > > according to the writeback policy.
> > >
> > > For this reason the real writes to the underlying block devices may
> > > occur in a different IO context respect to the task that originally
> > > generated the dirty pages involved in the IO operation. This makes the
> > > tracking and throttling of writeback IO more complicate respect to the
> > > synchronous IO from the blkio controller's perspective.
> > >
> > > Proposed solution
> > > =================
> > > In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve
> > > the problem of the buffered writes limitation by tracking the ownership of all
> > > the dirty pages in the system.
> > >
> > > This would allow to always identify the owner of each IO operation at the block
> > > layer and apply the appropriate throttling policy implemented by the
> > > blkio.throttle controller.
> > >
> > > This solution makes the blkio.throttle controller to work as expected also for
> > > writeback IO, but it does not resolve the problem of faster cgroups getting
> > > blocked by slower cgroups (that would expose a potential way to create DoS in
> > > the system).
> > >
> > > In fact, at the moment critical IO requests (that have dependency with other IO
> > > requests made by other cgroups) and non-critical requests are mixed together at
> > > the filesystem layer in a way that throttling a single write request may stop
> > > also other requests in the system, and at the block layer it's not possible to
> > > retrieve such informations to make the right decision.
> > >
> > > A simple solution to this problem could be to just limit the rate of async
> > > writes at the time a task is generating dirty pages in the page cache. The
> > > big advantage of this approach is that it does not need the overhead of
> > > tracking the ownership of the dirty pages, because in this way from the blkio
> > > controller perspective all the IO operations will happen from the process
> > > context: writes in memory and synchronous reads from the block device.
> > >
> > > The drawback of this approach is that the blkio.throttle controller becomes a
> > > little bit leaky, because with this solution the controller is still affected
> > > by the IO spikes during the writeback of dirty pages executed by the kernel
> > > threads.
> > >
> > > Probably an even better approach would be to introduce the tracking of the
> > > dirty page ownership to properly account the cost of each IO operation at the
> > > block layer and apply the throttling of async writes in memory only when IO
> > > limits are exceeded.
> >
> > Andrea, I am curious to know more about it third option. Can you give more
> > details about accouting in block layer but throttling in memory. So say
> > a process starts IO, then it will still be in throttle limits at block
> > layer (because no writeback has started), then the process will write
> > bunch of pages in cache. By the time throttle limits are crossed at
> > block layer, we already have lots of dirty data in page cache and
> > throttling process now is already late?
>
> Charging the cost of each IO operation at the block layer would allow
> tasks to write in memory at the maximum speed. Instead, with the 3rd
> approach, tasks are forced to write in memory at the rate defined by the
> blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device).
>
> When we'll have the per-cgroup dirty memory accounting and limiting
> feature, with this approach each cgroup could write to its dirty memory
> quota at the maximum rate.

Ok, so this is option 3 which you have already implemented in this
patchset.

I guess then I am confused with option 2. Can you elaborate a little
more there.

>
> BTW, another thing that we probably need is that any cgroup should be
> forced to write their own inodes when the limit defined by dirty_ratio
> is exceeded. I mean, avoid to select inodes from the list of dirty
> inodes in a FIFO way, but provides a better logic to "assign" the
> ownership of each inode to a cgroup (maybe that one that had generated
> most of the dirty pages in the inodes) and use for example a list of
> dirty inodes per cgroup or something similar.
>
> In this way we should really be able to provide a good quality of
> service for the most part of the cases IMHO.
>
> I also plan to write down another patch set to implement this logic.

Yes, triggering writeout of inodes of a cgroup once it crosses dirty
ratio is desirable. A patch like that can go on top of Greg's per cgroup
dirty ratio patacehes.

Thanks
Vivek

2011-03-02 22:00:36

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

On Wed, Mar 2, 2011 at 5:28 AM, Andrea Righi <[email protected]> wrote:
> On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote:
>> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:
>> > Overview
>> > ========
>> > Currently the blkio.throttle controller only support synchronous IO requests.
>> > This means that we always look at the current task to identify the "owner" of
>> > each IO request.
>> >
>> > However dirty pages in the page cache can be wrote to disk asynchronously by
>> > the per-bdi flusher kernel threads or by any other thread in the system,
>> > according to the writeback policy.
>> >
>> > For this reason the real writes to the underlying block devices may
>> > occur in a different IO context respect to the task that originally
>> > generated the dirty pages involved in the IO operation. This makes the
>> > tracking and throttling of writeback IO more complicate respect to the
>> > synchronous IO from the blkio controller's perspective.
>> >
>> > Proposed solution
>> > =================
>> > In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve
>> > the problem of the buffered writes limitation by tracking the ownership of all
>> > the dirty pages in the system.
>> >
>> > This would allow to always identify the owner of each IO operation at the block
>> > layer and apply the appropriate throttling policy implemented by the
>> > blkio.throttle controller.
>> >
>> > This solution makes the blkio.throttle controller to work as expected also for
>> > writeback IO, but it does not resolve the problem of faster cgroups getting
>> > blocked by slower cgroups (that would expose a potential way to create DoS in
>> > the system).
>> >
>> > In fact, at the moment critical IO requests (that have dependency with other IO
>> > requests made by other cgroups) and non-critical requests are mixed together at
>> > the filesystem layer in a way that throttling a single write request may stop
>> > also other requests in the system, and at the block layer it's not possible to
>> > retrieve such informations to make the right decision.
>> >
>> > A simple solution to this problem could be to just limit the rate of async
>> > writes at the time a task is generating dirty pages in the page cache. The
>> > big advantage of this approach is that it does not need the overhead of
>> > tracking the ownership of the dirty pages, because in this way from the blkio
>> > controller perspective all the IO operations will happen from the process
>> > context: writes in memory and synchronous reads from the block device.
>> >
>> > The drawback of this approach is that the blkio.throttle controller becomes a
>> > little bit leaky, because with this solution the controller is still affected
>> > by the IO spikes during the writeback of dirty pages executed by the kernel
>> > threads.
>> >
>> > Probably an even better approach would be to introduce the tracking of the
>> > dirty page ownership to properly account the cost of each IO operation at the
>> > block layer and apply the throttling of async writes in memory only when IO
>> > limits are exceeded.
>>
>> Andrea, I am curious to know more about it third option. Can you give more
>> details about accouting in block layer but throttling in memory. So say
>> a process starts IO, then it will still be in throttle limits at block
>> layer (because no writeback has started), then the process will write
>> bunch of pages in cache. By the time throttle limits are crossed at
>> block layer, we already have lots of dirty data in page cache and
>> throttling process now is already late?
>
> Charging the cost of each IO operation at the block layer would allow
> tasks to write in memory at the maximum speed. Instead, with the 3rd
> approach, tasks are forced to write in memory at the rate defined by the
> blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device).
>
> When we'll have the per-cgroup dirty memory accounting and limiting
> feature, with this approach each cgroup could write to its dirty memory
> quota at the maximum rate.
>
> BTW, another thing that we probably need is that any cgroup should be
> forced to write their own inodes when the limit defined by dirty_ratio
> is exceeded. I mean, avoid to select inodes from the list of dirty
> inodes in a FIFO way, but provides a better logic to "assign" the
> ownership of each inode to a cgroup (maybe that one that had generated
> most of the dirty pages in the inodes) and use for example a list of
> dirty inodes per cgroup or something similar.
>
> In this way we should really be able to provide a good quality of
> service for the most part of the cases IMHO.
>
> I also plan to write down another patch set to implement this logic.

I have also been thinking about this problem. I had assumed that the
per-bdi inode list would be retained, but a memcg_inode filter
function would be used to determine which dirty inodes contribute any
dirty pages to the over-dirty-limit memcg.

To efficiently determine if an inode contributes dirty pages to a
memcg, I was thinking about implementing a shallow memcg identifier
cache within the inode's address space. Here is a summary of what I
am thinking about:

This approach adds an N deep cache of dirtying cgroups into struct
address_space. N would likely be 1. Each address_space would have an
array of N as_memcg identifiers and an as_memcg_overflow bit. The
memcg identifier may be a memcg pointer or a small integer (maybe
css_id). Unset cache entries would be NULL if using pointers, 0 if
using css_id (an illegal css_id value). The memcg identifier would
not be dereferenced - it would only be used as a fuzzy comparison
value. So a reference to the memcg is not needed. Modifications to
as_memcg[] and as_memcg_overflow are protected by the mapping
tree_lock.

The kernel integration with this approach involves:

1. When an address_space created in inode_init_always():
1.1 Set as_memcg[*]=NULL and as_memcg_overflow=0.

2. When a mapping page is dirtied in __set_page_dirty():
2.1 The tree lock is already held
2.2 If any of as_memcg[*] == current_memcg, then do nothing. The
dirtying memcg is already associated with the address space.
2.3 If any of as_memcg[x] is NULL, then set as_memcg[x] =
current_memcg. This associates the dirtying memcg with the address
space.
2.4 Otherwise, set as_memcg_overflow=1 indicating that more than N
memcg have dirtied the address space since it was last cleaned.

3. When an inode is completely clean - this is a case within
writeback_single_inode():
3.1 Grab the tree_lock
3.2 Set as_memcg[*]=NULL and as_memcg_overflow=0. These are the same
steps as inode creation.
3.3 Release the tree_lock

4. When per-cgroup inode writeback is performed, walk the list of
dirty inodes only matching inodes contributing dirty pages to
cmp_memcg. For each dirty inode:
4.1 Tree locking is not needed here. Races are allowed.
4.2 If as_memcg_overflow is set, then schedule inode for writeback.
Assume it contains dirty pages for cmp_memcg. This is the case where
there are more than N memcg that have dirtied the address space since
it was completely clean.
4.3 If any of as_memcg[*] == cmp_memcg, then schedule the inode for writeback.
4.4 Otherwise, do not schedule the inode for writeback because the
current memcg has not contributed dirty pages to the inode.

Weaknesses:

1. If the N deep cache is not sufficiently large then inodes will be
written back more than is ideal. Example: if more than N cgroups are
dirty an inode, then as_memcg_overflow is set and remains set until
the inode is fully clean. Until the inode is fully cleaned, any
per-memcg writeback will writeback the inode. This will unfairly
affect writeback of cgroups that have never touched or dirtied the
overflowing inode. If this is a significant problem, then increasing
the value of N will solve the problem. Theoretically each inode, or
filesystem could have a different value of N. But the preferred
implementation has a compile time constant value of N.

2. If any cgroup continually dirties an inode, then no other cgroups
that have ever dirtied the inode will ever be expired from the N deep
as_memcg[] cache because the "when an inode is completely clean" logic
is never run. If the past set of dirtying cgroups is greater than N,
then every cgroup writeback will writeback the inode. If this is a
problem, then a dirty_page counter could be added to each as_memcg[]
record to keep track of the number of dirty pages each memcg
contributes to an inode. When the per-memcg count reaches zero then
the particular as_memcg entry could be removed. This would allow for
memcg expiration from as_memcg[] before the inode becomes completely
clean.

3. There is no way to cheaply identify the set of inodes contributing
dirty pages to a memcg that has exceeded its dirty memory limit. When
a memcg blows its limit, what list of dirty inodes is traversed? The
bdi that was last being written to which, in the worst case, will have
very little dirty data from the current memcg. If this is an issue,
we could walk all bdi looking for inodes that contribute dirty pages
to the over-limit memcg.

>> > To summarize, we can identify three possible solutions to properly throttle the
>> > buffered writes:
>> >
>> > 1) account & throttle everything at block IO layer (bad for "priority
>> > ? ?inversion" problems, needs page tracking for blkio)
>> >
>> > 2) account at block IO layer and throttle in memory (needs page tracking for
>> > ? ?blkio)
>> >
>> > 3) account & throttle in memory (affected by IO spikes, depending on
>> > ? ?dirty_ratio / dirty_background_ratio settings)
>> >
>> > For now we start with the solution 3) that seems to be the simplest way to
>> > proceed.
>>
>> Yes, IO spikes is the weakness of this 3rd solution. But it should be
>> simple too. Also as you said problem can be reduced up to some extent
>> by changing reducing dirty_ratio and background dirty ratio But that
>> will have other trade offs, I guess.
>
> Agreed.
>
> -Andrea
>

2011-03-06 15:52:53

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote:
> On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote:
> > On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote:
> > > On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:
> > > > Overview
> > > > ========
> > > > Currently the blkio.throttle controller only support synchronous IO requests.
> > > > This means that we always look at the current task to identify the "owner" of
> > > > each IO request.
> > > >
> > > > However dirty pages in the page cache can be wrote to disk asynchronously by
> > > > the per-bdi flusher kernel threads or by any other thread in the system,
> > > > according to the writeback policy.
> > > >
> > > > For this reason the real writes to the underlying block devices may
> > > > occur in a different IO context respect to the task that originally
> > > > generated the dirty pages involved in the IO operation. This makes the
> > > > tracking and throttling of writeback IO more complicate respect to the
> > > > synchronous IO from the blkio controller's perspective.
> > > >
> > > > Proposed solution
> > > > =================
> > > > In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve
> > > > the problem of the buffered writes limitation by tracking the ownership of all
> > > > the dirty pages in the system.
> > > >
> > > > This would allow to always identify the owner of each IO operation at the block
> > > > layer and apply the appropriate throttling policy implemented by the
> > > > blkio.throttle controller.
> > > >
> > > > This solution makes the blkio.throttle controller to work as expected also for
> > > > writeback IO, but it does not resolve the problem of faster cgroups getting
> > > > blocked by slower cgroups (that would expose a potential way to create DoS in
> > > > the system).
> > > >
> > > > In fact, at the moment critical IO requests (that have dependency with other IO
> > > > requests made by other cgroups) and non-critical requests are mixed together at
> > > > the filesystem layer in a way that throttling a single write request may stop
> > > > also other requests in the system, and at the block layer it's not possible to
> > > > retrieve such informations to make the right decision.
> > > >
> > > > A simple solution to this problem could be to just limit the rate of async
> > > > writes at the time a task is generating dirty pages in the page cache. The
> > > > big advantage of this approach is that it does not need the overhead of
> > > > tracking the ownership of the dirty pages, because in this way from the blkio
> > > > controller perspective all the IO operations will happen from the process
> > > > context: writes in memory and synchronous reads from the block device.
> > > >
> > > > The drawback of this approach is that the blkio.throttle controller becomes a
> > > > little bit leaky, because with this solution the controller is still affected
> > > > by the IO spikes during the writeback of dirty pages executed by the kernel
> > > > threads.
> > > >
> > > > Probably an even better approach would be to introduce the tracking of the
> > > > dirty page ownership to properly account the cost of each IO operation at the
> > > > block layer and apply the throttling of async writes in memory only when IO
> > > > limits are exceeded.
> > >
> > > Andrea, I am curious to know more about it third option. Can you give more
> > > details about accouting in block layer but throttling in memory. So say
> > > a process starts IO, then it will still be in throttle limits at block
> > > layer (because no writeback has started), then the process will write
> > > bunch of pages in cache. By the time throttle limits are crossed at
> > > block layer, we already have lots of dirty data in page cache and
> > > throttling process now is already late?
> >
> > Charging the cost of each IO operation at the block layer would allow
> > tasks to write in memory at the maximum speed. Instead, with the 3rd
> > approach, tasks are forced to write in memory at the rate defined by the
> > blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device).
> >
> > When we'll have the per-cgroup dirty memory accounting and limiting
> > feature, with this approach each cgroup could write to its dirty memory
> > quota at the maximum rate.
>
> Ok, so this is option 3 which you have already implemented in this
> patchset.
>
> I guess then I am confused with option 2. Can you elaborate a little
> more there.

With option 3, we can just limit the rate at which dirty pages are
generated in memory. And this can be done introducing the files
blkio.throttle.async.write_bps/iops_device.

At the moment in blk_throtl_bio() we charge the dispatched bytes/iops
_and_ we check if the bio can be dispatched. These two distinct
operations are now done by the same function.

With option 2, I'm proposing to split these two operations and place
throtl_charge_io() at the block layer in __generic_make_request() and an
equivalent of tg_may_dispatch_bio() (maybe a better name would be
blk_is_throttled()) at the page cache layer, in
balance_dirty_pages_ratelimited_nr():

A prototype for blk_is_throttled() could be the following:

bool blk_is_throttled(void);

This means in balance_dirty_pages_ratelimited_nr() we won't charge any
bytes/iops to the cgroup, but we'll just check if the limits are
exceeded. And stop it in that case, so that no more dirty pages can be
generated by this cgroup.

Instead at the block layer WRITEs will be always dispatched in
blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but
the throtl_charge_io() would charge the cost of the IO operation to the
right cgroup.

To summarize:

__generic_make_request():
blk_throtl_bio(q, &bio);

balance_dirty_pages_ratelimited_nr():
if (blk_is_throttled())
// add the current task into a per-group wait queue and
// wake up once this cgroup meets its quota

What do you think?

Thanks,
-Andrea

2011-03-07 07:30:17

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

Andrea Righi wrote:
> On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote:
>> On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote:
>>> On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote:
>>>> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:
>>>>> Overview
>>>>> ========
>>>>> Currently the blkio.throttle controller only support synchronous IO requests.
>>>>> This means that we always look at the current task to identify the "owner" of
>>>>> each IO request.
>>>>>
>>>>> However dirty pages in the page cache can be wrote to disk asynchronously by
>>>>> the per-bdi flusher kernel threads or by any other thread in the system,
>>>>> according to the writeback policy.
>>>>>
>>>>> For this reason the real writes to the underlying block devices may
>>>>> occur in a different IO context respect to the task that originally
>>>>> generated the dirty pages involved in the IO operation. This makes the
>>>>> tracking and throttling of writeback IO more complicate respect to the
>>>>> synchronous IO from the blkio controller's perspective.
>>>>>
>>>>> Proposed solution
>>>>> =================
>>>>> In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve
>>>>> the problem of the buffered writes limitation by tracking the ownership of all
>>>>> the dirty pages in the system.
>>>>>
>>>>> This would allow to always identify the owner of each IO operation at the block
>>>>> layer and apply the appropriate throttling policy implemented by the
>>>>> blkio.throttle controller.
>>>>>
>>>>> This solution makes the blkio.throttle controller to work as expected also for
>>>>> writeback IO, but it does not resolve the problem of faster cgroups getting
>>>>> blocked by slower cgroups (that would expose a potential way to create DoS in
>>>>> the system).
>>>>>
>>>>> In fact, at the moment critical IO requests (that have dependency with other IO
>>>>> requests made by other cgroups) and non-critical requests are mixed together at
>>>>> the filesystem layer in a way that throttling a single write request may stop
>>>>> also other requests in the system, and at the block layer it's not possible to
>>>>> retrieve such informations to make the right decision.
>>>>>
>>>>> A simple solution to this problem could be to just limit the rate of async
>>>>> writes at the time a task is generating dirty pages in the page cache. The
>>>>> big advantage of this approach is that it does not need the overhead of
>>>>> tracking the ownership of the dirty pages, because in this way from the blkio
>>>>> controller perspective all the IO operations will happen from the process
>>>>> context: writes in memory and synchronous reads from the block device.
>>>>>
>>>>> The drawback of this approach is that the blkio.throttle controller becomes a
>>>>> little bit leaky, because with this solution the controller is still affected
>>>>> by the IO spikes during the writeback of dirty pages executed by the kernel
>>>>> threads.
>>>>>
>>>>> Probably an even better approach would be to introduce the tracking of the
>>>>> dirty page ownership to properly account the cost of each IO operation at the
>>>>> block layer and apply the throttling of async writes in memory only when IO
>>>>> limits are exceeded.
>>>> Andrea, I am curious to know more about it third option. Can you give more
>>>> details about accouting in block layer but throttling in memory. So say
>>>> a process starts IO, then it will still be in throttle limits at block
>>>> layer (because no writeback has started), then the process will write
>>>> bunch of pages in cache. By the time throttle limits are crossed at
>>>> block layer, we already have lots of dirty data in page cache and
>>>> throttling process now is already late?
>>> Charging the cost of each IO operation at the block layer would allow
>>> tasks to write in memory at the maximum speed. Instead, with the 3rd
>>> approach, tasks are forced to write in memory at the rate defined by the
>>> blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device).
>>>
>>> When we'll have the per-cgroup dirty memory accounting and limiting
>>> feature, with this approach each cgroup could write to its dirty memory
>>> quota at the maximum rate.
>> Ok, so this is option 3 which you have already implemented in this
>> patchset.
>>
>> I guess then I am confused with option 2. Can you elaborate a little
>> more there.
>
> With option 3, we can just limit the rate at which dirty pages are
> generated in memory. And this can be done introducing the files
> blkio.throttle.async.write_bps/iops_device.
>
> At the moment in blk_throtl_bio() we charge the dispatched bytes/iops
> _and_ we check if the bio can be dispatched. These two distinct
> operations are now done by the same function.
>
> With option 2, I'm proposing to split these two operations and place
> throtl_charge_io() at the block layer in __generic_make_request() and an
> equivalent of tg_may_dispatch_bio() (maybe a better name would be
> blk_is_throttled()) at the page cache layer, in
> balance_dirty_pages_ratelimited_nr():
>
> A prototype for blk_is_throttled() could be the following:
>
> bool blk_is_throttled(void);
>
> This means in balance_dirty_pages_ratelimited_nr() we won't charge any
> bytes/iops to the cgroup, but we'll just check if the limits are
> exceeded. And stop it in that case, so that no more dirty pages can be
> generated by this cgroup.
>
> Instead at the block layer WRITEs will be always dispatched in
> blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but
> the throtl_charge_io() would charge the cost of the IO operation to the
> right cgroup.
>
> To summarize:
>
> __generic_make_request():
> blk_throtl_bio(q, &bio);
>
> balance_dirty_pages_ratelimited_nr():
> if (blk_is_throttled())
> // add the current task into a per-group wait queue and
> // wake up once this cgroup meets its quota
>
> What do you think?

Hi Andrea,

This means when you throttle writes, the reads issued by this task are also throttled?

Thanks,
Gui

>
> Thanks,
> -Andrea
>

2011-03-07 11:34:16

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

On Mon, Mar 07, 2011 at 03:31:11PM +0800, Gui Jianfeng wrote:
> Andrea Righi wrote:
> > On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote:
> >> On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote:
> >>> On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote:
> >>>> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:
> >>>>> Overview
> >>>>> ========
> >>>>> Currently the blkio.throttle controller only support synchronous IO requests.
> >>>>> This means that we always look at the current task to identify the "owner" of
> >>>>> each IO request.
> >>>>>
> >>>>> However dirty pages in the page cache can be wrote to disk asynchronously by
> >>>>> the per-bdi flusher kernel threads or by any other thread in the system,
> >>>>> according to the writeback policy.
> >>>>>
> >>>>> For this reason the real writes to the underlying block devices may
> >>>>> occur in a different IO context respect to the task that originally
> >>>>> generated the dirty pages involved in the IO operation. This makes the
> >>>>> tracking and throttling of writeback IO more complicate respect to the
> >>>>> synchronous IO from the blkio controller's perspective.
> >>>>>
> >>>>> Proposed solution
> >>>>> =================
> >>>>> In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve
> >>>>> the problem of the buffered writes limitation by tracking the ownership of all
> >>>>> the dirty pages in the system.
> >>>>>
> >>>>> This would allow to always identify the owner of each IO operation at the block
> >>>>> layer and apply the appropriate throttling policy implemented by the
> >>>>> blkio.throttle controller.
> >>>>>
> >>>>> This solution makes the blkio.throttle controller to work as expected also for
> >>>>> writeback IO, but it does not resolve the problem of faster cgroups getting
> >>>>> blocked by slower cgroups (that would expose a potential way to create DoS in
> >>>>> the system).
> >>>>>
> >>>>> In fact, at the moment critical IO requests (that have dependency with other IO
> >>>>> requests made by other cgroups) and non-critical requests are mixed together at
> >>>>> the filesystem layer in a way that throttling a single write request may stop
> >>>>> also other requests in the system, and at the block layer it's not possible to
> >>>>> retrieve such informations to make the right decision.
> >>>>>
> >>>>> A simple solution to this problem could be to just limit the rate of async
> >>>>> writes at the time a task is generating dirty pages in the page cache. The
> >>>>> big advantage of this approach is that it does not need the overhead of
> >>>>> tracking the ownership of the dirty pages, because in this way from the blkio
> >>>>> controller perspective all the IO operations will happen from the process
> >>>>> context: writes in memory and synchronous reads from the block device.
> >>>>>
> >>>>> The drawback of this approach is that the blkio.throttle controller becomes a
> >>>>> little bit leaky, because with this solution the controller is still affected
> >>>>> by the IO spikes during the writeback of dirty pages executed by the kernel
> >>>>> threads.
> >>>>>
> >>>>> Probably an even better approach would be to introduce the tracking of the
> >>>>> dirty page ownership to properly account the cost of each IO operation at the
> >>>>> block layer and apply the throttling of async writes in memory only when IO
> >>>>> limits are exceeded.
> >>>> Andrea, I am curious to know more about it third option. Can you give more
> >>>> details about accouting in block layer but throttling in memory. So say
> >>>> a process starts IO, then it will still be in throttle limits at block
> >>>> layer (because no writeback has started), then the process will write
> >>>> bunch of pages in cache. By the time throttle limits are crossed at
> >>>> block layer, we already have lots of dirty data in page cache and
> >>>> throttling process now is already late?
> >>> Charging the cost of each IO operation at the block layer would allow
> >>> tasks to write in memory at the maximum speed. Instead, with the 3rd
> >>> approach, tasks are forced to write in memory at the rate defined by the
> >>> blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device).
> >>>
> >>> When we'll have the per-cgroup dirty memory accounting and limiting
> >>> feature, with this approach each cgroup could write to its dirty memory
> >>> quota at the maximum rate.
> >> Ok, so this is option 3 which you have already implemented in this
> >> patchset.
> >>
> >> I guess then I am confused with option 2. Can you elaborate a little
> >> more there.
> >
> > With option 3, we can just limit the rate at which dirty pages are
> > generated in memory. And this can be done introducing the files
> > blkio.throttle.async.write_bps/iops_device.
> >
> > At the moment in blk_throtl_bio() we charge the dispatched bytes/iops
> > _and_ we check if the bio can be dispatched. These two distinct
> > operations are now done by the same function.
> >
> > With option 2, I'm proposing to split these two operations and place
> > throtl_charge_io() at the block layer in __generic_make_request() and an
> > equivalent of tg_may_dispatch_bio() (maybe a better name would be
> > blk_is_throttled()) at the page cache layer, in
> > balance_dirty_pages_ratelimited_nr():
> >
> > A prototype for blk_is_throttled() could be the following:
> >
> > bool blk_is_throttled(void);
> >
> > This means in balance_dirty_pages_ratelimited_nr() we won't charge any
> > bytes/iops to the cgroup, but we'll just check if the limits are
> > exceeded. And stop it in that case, so that no more dirty pages can be
> > generated by this cgroup.
> >
> > Instead at the block layer WRITEs will be always dispatched in
> > blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but
> > the throtl_charge_io() would charge the cost of the IO operation to the
> > right cgroup.
> >
> > To summarize:
> >
> > __generic_make_request():
> > blk_throtl_bio(q, &bio);
> >
> > balance_dirty_pages_ratelimited_nr():
> > if (blk_is_throttled())
> > // add the current task into a per-group wait queue and
> > // wake up once this cgroup meets its quota
> >
> > What do you think?
>
> Hi Andrea,
>
> This means when you throttle writes, the reads issued by this task are also throttled?
>
> Thanks,
> Gui

Exactly, we're treating the throttling of READs and WRITEs in two
different ways.

READs will be always throttled synchronously in the
__generic_make_request() -> blk_throtl_bio() path.

-Andrea

2011-03-07 11:43:56

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

Andrea Righi wrote:
> On Mon, Mar 07, 2011 at 03:31:11PM +0800, Gui Jianfeng wrote:
>> Andrea Righi wrote:
>>> On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote:
>>>> On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote:
>>>>> On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote:
>>>>>> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:
>>>>>>> Overview
>>>>>>> ========
>>>>>>> Currently the blkio.throttle controller only support synchronous IO requests.
>>>>>>> This means that we always look at the current task to identify the "owner" of
>>>>>>> each IO request.
>>>>>>>
>>>>>>> However dirty pages in the page cache can be wrote to disk asynchronously by
>>>>>>> the per-bdi flusher kernel threads or by any other thread in the system,
>>>>>>> according to the writeback policy.
>>>>>>>
>>>>>>> For this reason the real writes to the underlying block devices may
>>>>>>> occur in a different IO context respect to the task that originally
>>>>>>> generated the dirty pages involved in the IO operation. This makes the
>>>>>>> tracking and throttling of writeback IO more complicate respect to the
>>>>>>> synchronous IO from the blkio controller's perspective.
>>>>>>>
>>>>>>> Proposed solution
>>>>>>> =================
>>>>>>> In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve
>>>>>>> the problem of the buffered writes limitation by tracking the ownership of all
>>>>>>> the dirty pages in the system.
>>>>>>>
>>>>>>> This would allow to always identify the owner of each IO operation at the block
>>>>>>> layer and apply the appropriate throttling policy implemented by the
>>>>>>> blkio.throttle controller.
>>>>>>>
>>>>>>> This solution makes the blkio.throttle controller to work as expected also for
>>>>>>> writeback IO, but it does not resolve the problem of faster cgroups getting
>>>>>>> blocked by slower cgroups (that would expose a potential way to create DoS in
>>>>>>> the system).
>>>>>>>
>>>>>>> In fact, at the moment critical IO requests (that have dependency with other IO
>>>>>>> requests made by other cgroups) and non-critical requests are mixed together at
>>>>>>> the filesystem layer in a way that throttling a single write request may stop
>>>>>>> also other requests in the system, and at the block layer it's not possible to
>>>>>>> retrieve such informations to make the right decision.
>>>>>>>
>>>>>>> A simple solution to this problem could be to just limit the rate of async
>>>>>>> writes at the time a task is generating dirty pages in the page cache. The
>>>>>>> big advantage of this approach is that it does not need the overhead of
>>>>>>> tracking the ownership of the dirty pages, because in this way from the blkio
>>>>>>> controller perspective all the IO operations will happen from the process
>>>>>>> context: writes in memory and synchronous reads from the block device.
>>>>>>>
>>>>>>> The drawback of this approach is that the blkio.throttle controller becomes a
>>>>>>> little bit leaky, because with this solution the controller is still affected
>>>>>>> by the IO spikes during the writeback of dirty pages executed by the kernel
>>>>>>> threads.
>>>>>>>
>>>>>>> Probably an even better approach would be to introduce the tracking of the
>>>>>>> dirty page ownership to properly account the cost of each IO operation at the
>>>>>>> block layer and apply the throttling of async writes in memory only when IO
>>>>>>> limits are exceeded.
>>>>>> Andrea, I am curious to know more about it third option. Can you give more
>>>>>> details about accouting in block layer but throttling in memory. So say
>>>>>> a process starts IO, then it will still be in throttle limits at block
>>>>>> layer (because no writeback has started), then the process will write
>>>>>> bunch of pages in cache. By the time throttle limits are crossed at
>>>>>> block layer, we already have lots of dirty data in page cache and
>>>>>> throttling process now is already late?
>>>>> Charging the cost of each IO operation at the block layer would allow
>>>>> tasks to write in memory at the maximum speed. Instead, with the 3rd
>>>>> approach, tasks are forced to write in memory at the rate defined by the
>>>>> blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device).
>>>>>
>>>>> When we'll have the per-cgroup dirty memory accounting and limiting
>>>>> feature, with this approach each cgroup could write to its dirty memory
>>>>> quota at the maximum rate.
>>>> Ok, so this is option 3 which you have already implemented in this
>>>> patchset.
>>>>
>>>> I guess then I am confused with option 2. Can you elaborate a little
>>>> more there.
>>> With option 3, we can just limit the rate at which dirty pages are
>>> generated in memory. And this can be done introducing the files
>>> blkio.throttle.async.write_bps/iops_device.
>>>
>>> At the moment in blk_throtl_bio() we charge the dispatched bytes/iops
>>> _and_ we check if the bio can be dispatched. These two distinct
>>> operations are now done by the same function.
>>>
>>> With option 2, I'm proposing to split these two operations and place
>>> throtl_charge_io() at the block layer in __generic_make_request() and an
>>> equivalent of tg_may_dispatch_bio() (maybe a better name would be
>>> blk_is_throttled()) at the page cache layer, in
>>> balance_dirty_pages_ratelimited_nr():
>>>
>>> A prototype for blk_is_throttled() could be the following:
>>>
>>> bool blk_is_throttled(void);
>>>
>>> This means in balance_dirty_pages_ratelimited_nr() we won't charge any
>>> bytes/iops to the cgroup, but we'll just check if the limits are
>>> exceeded. And stop it in that case, so that no more dirty pages can be
>>> generated by this cgroup.
>>>
>>> Instead at the block layer WRITEs will be always dispatched in
>>> blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but
>>> the throtl_charge_io() would charge the cost of the IO operation to the
>>> right cgroup.
>>>
>>> To summarize:
>>>
>>> __generic_make_request():
>>> blk_throtl_bio(q, &bio);
>>>
>>> balance_dirty_pages_ratelimited_nr():
>>> if (blk_is_throttled())
>>> // add the current task into a per-group wait queue and
>>> // wake up once this cgroup meets its quota
>>>
>>> What do you think?
>> Hi Andrea,
>>
>> This means when you throttle writes, the reads issued by this task are also throttled?
>>
>> Thanks,
>> Gui
>
> Exactly, we're treating the throttling of READs and WRITEs in two
> different ways.
>
> READs will be always throttled synchronously in the
> __generic_make_request() -> blk_throtl_bio() path.

Andrea,

I means If the task exceeds write limit, this task will be put to sleep, right?
So It doesn't get a chance to issue read requests.

Gui

>
> -Andrea
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Regards
Gui Jianfeng

2011-03-07 11:59:22

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

On Mon, Mar 07, 2011 at 07:44:49PM +0800, Gui Jianfeng wrote:
> Andrea Righi wrote:
> > On Mon, Mar 07, 2011 at 03:31:11PM +0800, Gui Jianfeng wrote:
> >> Andrea Righi wrote:
> >>> On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote:
> >>>> On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote:
> >>>>> On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote:
> >>>>>> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:
> >>>>>>> Overview
> >>>>>>> ========
> >>>>>>> Currently the blkio.throttle controller only support synchronous IO requests.
> >>>>>>> This means that we always look at the current task to identify the "owner" of
> >>>>>>> each IO request.
> >>>>>>>
> >>>>>>> However dirty pages in the page cache can be wrote to disk asynchronously by
> >>>>>>> the per-bdi flusher kernel threads or by any other thread in the system,
> >>>>>>> according to the writeback policy.
> >>>>>>>
> >>>>>>> For this reason the real writes to the underlying block devices may
> >>>>>>> occur in a different IO context respect to the task that originally
> >>>>>>> generated the dirty pages involved in the IO operation. This makes the
> >>>>>>> tracking and throttling of writeback IO more complicate respect to the
> >>>>>>> synchronous IO from the blkio controller's perspective.
> >>>>>>>
> >>>>>>> Proposed solution
> >>>>>>> =================
> >>>>>>> In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve
> >>>>>>> the problem of the buffered writes limitation by tracking the ownership of all
> >>>>>>> the dirty pages in the system.
> >>>>>>>
> >>>>>>> This would allow to always identify the owner of each IO operation at the block
> >>>>>>> layer and apply the appropriate throttling policy implemented by the
> >>>>>>> blkio.throttle controller.
> >>>>>>>
> >>>>>>> This solution makes the blkio.throttle controller to work as expected also for
> >>>>>>> writeback IO, but it does not resolve the problem of faster cgroups getting
> >>>>>>> blocked by slower cgroups (that would expose a potential way to create DoS in
> >>>>>>> the system).
> >>>>>>>
> >>>>>>> In fact, at the moment critical IO requests (that have dependency with other IO
> >>>>>>> requests made by other cgroups) and non-critical requests are mixed together at
> >>>>>>> the filesystem layer in a way that throttling a single write request may stop
> >>>>>>> also other requests in the system, and at the block layer it's not possible to
> >>>>>>> retrieve such informations to make the right decision.
> >>>>>>>
> >>>>>>> A simple solution to this problem could be to just limit the rate of async
> >>>>>>> writes at the time a task is generating dirty pages in the page cache. The
> >>>>>>> big advantage of this approach is that it does not need the overhead of
> >>>>>>> tracking the ownership of the dirty pages, because in this way from the blkio
> >>>>>>> controller perspective all the IO operations will happen from the process
> >>>>>>> context: writes in memory and synchronous reads from the block device.
> >>>>>>>
> >>>>>>> The drawback of this approach is that the blkio.throttle controller becomes a
> >>>>>>> little bit leaky, because with this solution the controller is still affected
> >>>>>>> by the IO spikes during the writeback of dirty pages executed by the kernel
> >>>>>>> threads.
> >>>>>>>
> >>>>>>> Probably an even better approach would be to introduce the tracking of the
> >>>>>>> dirty page ownership to properly account the cost of each IO operation at the
> >>>>>>> block layer and apply the throttling of async writes in memory only when IO
> >>>>>>> limits are exceeded.
> >>>>>> Andrea, I am curious to know more about it third option. Can you give more
> >>>>>> details about accouting in block layer but throttling in memory. So say
> >>>>>> a process starts IO, then it will still be in throttle limits at block
> >>>>>> layer (because no writeback has started), then the process will write
> >>>>>> bunch of pages in cache. By the time throttle limits are crossed at
> >>>>>> block layer, we already have lots of dirty data in page cache and
> >>>>>> throttling process now is already late?
> >>>>> Charging the cost of each IO operation at the block layer would allow
> >>>>> tasks to write in memory at the maximum speed. Instead, with the 3rd
> >>>>> approach, tasks are forced to write in memory at the rate defined by the
> >>>>> blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device).
> >>>>>
> >>>>> When we'll have the per-cgroup dirty memory accounting and limiting
> >>>>> feature, with this approach each cgroup could write to its dirty memory
> >>>>> quota at the maximum rate.
> >>>> Ok, so this is option 3 which you have already implemented in this
> >>>> patchset.
> >>>>
> >>>> I guess then I am confused with option 2. Can you elaborate a little
> >>>> more there.
> >>> With option 3, we can just limit the rate at which dirty pages are
> >>> generated in memory. And this can be done introducing the files
> >>> blkio.throttle.async.write_bps/iops_device.
> >>>
> >>> At the moment in blk_throtl_bio() we charge the dispatched bytes/iops
> >>> _and_ we check if the bio can be dispatched. These two distinct
> >>> operations are now done by the same function.
> >>>
> >>> With option 2, I'm proposing to split these two operations and place
> >>> throtl_charge_io() at the block layer in __generic_make_request() and an
> >>> equivalent of tg_may_dispatch_bio() (maybe a better name would be
> >>> blk_is_throttled()) at the page cache layer, in
> >>> balance_dirty_pages_ratelimited_nr():
> >>>
> >>> A prototype for blk_is_throttled() could be the following:
> >>>
> >>> bool blk_is_throttled(void);
> >>>
> >>> This means in balance_dirty_pages_ratelimited_nr() we won't charge any
> >>> bytes/iops to the cgroup, but we'll just check if the limits are
> >>> exceeded. And stop it in that case, so that no more dirty pages can be
> >>> generated by this cgroup.
> >>>
> >>> Instead at the block layer WRITEs will be always dispatched in
> >>> blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but
> >>> the throtl_charge_io() would charge the cost of the IO operation to the
> >>> right cgroup.
> >>>
> >>> To summarize:
> >>>
> >>> __generic_make_request():
> >>> blk_throtl_bio(q, &bio);
> >>>
> >>> balance_dirty_pages_ratelimited_nr():
> >>> if (blk_is_throttled())
> >>> // add the current task into a per-group wait queue and
> >>> // wake up once this cgroup meets its quota
> >>>
> >>> What do you think?
> >> Hi Andrea,
> >>
> >> This means when you throttle writes, the reads issued by this task are also throttled?
> >>
> >> Thanks,
> >> Gui
> >
> > Exactly, we're treating the throttling of READs and WRITEs in two
> > different ways.
> >
> > READs will be always throttled synchronously in the
> > __generic_make_request() -> blk_throtl_bio() path.
>
> Andrea,
>
> I means If the task exceeds write limit, this task will be put to sleep, right?
> So It doesn't get a chance to issue read requests.

Oh yes, you're right. This could be a problem. OTOH I wouldn't like to
introduce an additional queue to submit the write requests in the page
cache and dispatch them asyncrhonously.

mmh... ideas?

-Andrea

2011-03-07 12:14:30

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

Andrea Righi wrote:
> On Mon, Mar 07, 2011 at 07:44:49PM +0800, Gui Jianfeng wrote:
>> Andrea Righi wrote:
>>> On Mon, Mar 07, 2011 at 03:31:11PM +0800, Gui Jianfeng wrote:
>>>> Andrea Righi wrote:
>>>>> On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote:
>>>>>> On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote:
>>>>>>> On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote:
>>>>>>>> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:
>>>>>>>>> Overview
>>>>>>>>> ========
>>>>>>>>> Currently the blkio.throttle controller only support synchronous IO requests.
>>>>>>>>> This means that we always look at the current task to identify the "owner" of
>>>>>>>>> each IO request.
>>>>>>>>>
>>>>>>>>> However dirty pages in the page cache can be wrote to disk asynchronously by
>>>>>>>>> the per-bdi flusher kernel threads or by any other thread in the system,
>>>>>>>>> according to the writeback policy.
>>>>>>>>>
>>>>>>>>> For this reason the real writes to the underlying block devices may
>>>>>>>>> occur in a different IO context respect to the task that originally
>>>>>>>>> generated the dirty pages involved in the IO operation. This makes the
>>>>>>>>> tracking and throttling of writeback IO more complicate respect to the
>>>>>>>>> synchronous IO from the blkio controller's perspective.
>>>>>>>>>
>>>>>>>>> Proposed solution
>>>>>>>>> =================
>>>>>>>>> In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve
>>>>>>>>> the problem of the buffered writes limitation by tracking the ownership of all
>>>>>>>>> the dirty pages in the system.
>>>>>>>>>
>>>>>>>>> This would allow to always identify the owner of each IO operation at the block
>>>>>>>>> layer and apply the appropriate throttling policy implemented by the
>>>>>>>>> blkio.throttle controller.
>>>>>>>>>
>>>>>>>>> This solution makes the blkio.throttle controller to work as expected also for
>>>>>>>>> writeback IO, but it does not resolve the problem of faster cgroups getting
>>>>>>>>> blocked by slower cgroups (that would expose a potential way to create DoS in
>>>>>>>>> the system).
>>>>>>>>>
>>>>>>>>> In fact, at the moment critical IO requests (that have dependency with other IO
>>>>>>>>> requests made by other cgroups) and non-critical requests are mixed together at
>>>>>>>>> the filesystem layer in a way that throttling a single write request may stop
>>>>>>>>> also other requests in the system, and at the block layer it's not possible to
>>>>>>>>> retrieve such informations to make the right decision.
>>>>>>>>>
>>>>>>>>> A simple solution to this problem could be to just limit the rate of async
>>>>>>>>> writes at the time a task is generating dirty pages in the page cache. The
>>>>>>>>> big advantage of this approach is that it does not need the overhead of
>>>>>>>>> tracking the ownership of the dirty pages, because in this way from the blkio
>>>>>>>>> controller perspective all the IO operations will happen from the process
>>>>>>>>> context: writes in memory and synchronous reads from the block device.
>>>>>>>>>
>>>>>>>>> The drawback of this approach is that the blkio.throttle controller becomes a
>>>>>>>>> little bit leaky, because with this solution the controller is still affected
>>>>>>>>> by the IO spikes during the writeback of dirty pages executed by the kernel
>>>>>>>>> threads.
>>>>>>>>>
>>>>>>>>> Probably an even better approach would be to introduce the tracking of the
>>>>>>>>> dirty page ownership to properly account the cost of each IO operation at the
>>>>>>>>> block layer and apply the throttling of async writes in memory only when IO
>>>>>>>>> limits are exceeded.
>>>>>>>> Andrea, I am curious to know more about it third option. Can you give more
>>>>>>>> details about accouting in block layer but throttling in memory. So say
>>>>>>>> a process starts IO, then it will still be in throttle limits at block
>>>>>>>> layer (because no writeback has started), then the process will write
>>>>>>>> bunch of pages in cache. By the time throttle limits are crossed at
>>>>>>>> block layer, we already have lots of dirty data in page cache and
>>>>>>>> throttling process now is already late?
>>>>>>> Charging the cost of each IO operation at the block layer would allow
>>>>>>> tasks to write in memory at the maximum speed. Instead, with the 3rd
>>>>>>> approach, tasks are forced to write in memory at the rate defined by the
>>>>>>> blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device).
>>>>>>>
>>>>>>> When we'll have the per-cgroup dirty memory accounting and limiting
>>>>>>> feature, with this approach each cgroup could write to its dirty memory
>>>>>>> quota at the maximum rate.
>>>>>> Ok, so this is option 3 which you have already implemented in this
>>>>>> patchset.
>>>>>>
>>>>>> I guess then I am confused with option 2. Can you elaborate a little
>>>>>> more there.
>>>>> With option 3, we can just limit the rate at which dirty pages are
>>>>> generated in memory. And this can be done introducing the files
>>>>> blkio.throttle.async.write_bps/iops_device.
>>>>>
>>>>> At the moment in blk_throtl_bio() we charge the dispatched bytes/iops
>>>>> _and_ we check if the bio can be dispatched. These two distinct
>>>>> operations are now done by the same function.
>>>>>
>>>>> With option 2, I'm proposing to split these two operations and place
>>>>> throtl_charge_io() at the block layer in __generic_make_request() and an
>>>>> equivalent of tg_may_dispatch_bio() (maybe a better name would be
>>>>> blk_is_throttled()) at the page cache layer, in
>>>>> balance_dirty_pages_ratelimited_nr():
>>>>>
>>>>> A prototype for blk_is_throttled() could be the following:
>>>>>
>>>>> bool blk_is_throttled(void);
>>>>>
>>>>> This means in balance_dirty_pages_ratelimited_nr() we won't charge any
>>>>> bytes/iops to the cgroup, but we'll just check if the limits are
>>>>> exceeded. And stop it in that case, so that no more dirty pages can be
>>>>> generated by this cgroup.
>>>>>
>>>>> Instead at the block layer WRITEs will be always dispatched in
>>>>> blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but
>>>>> the throtl_charge_io() would charge the cost of the IO operation to the
>>>>> right cgroup.
>>>>>
>>>>> To summarize:
>>>>>
>>>>> __generic_make_request():
>>>>> blk_throtl_bio(q, &bio);
>>>>>
>>>>> balance_dirty_pages_ratelimited_nr():
>>>>> if (blk_is_throttled())
>>>>> // add the current task into a per-group wait queue and
>>>>> // wake up once this cgroup meets its quota
>>>>>
>>>>> What do you think?
>>>> Hi Andrea,
>>>>
>>>> This means when you throttle writes, the reads issued by this task are also throttled?
>>>>
>>>> Thanks,
>>>> Gui
>>> Exactly, we're treating the throttling of READs and WRITEs in two
>>> different ways.
>>>
>>> READs will be always throttled synchronously in the
>>> __generic_make_request() -> blk_throtl_bio() path.
>> Andrea,
>>
>> I means If the task exceeds write limit, this task will be put to sleep, right?
>> So It doesn't get a chance to issue read requests.
>
> Oh yes, you're right. This could be a problem. OTOH I wouldn't like to
> introduce an additional queue to submit the write requests in the page
> cache and dispatch them asyncrhonously.
>
> mmh... ideas?
>

hmm, dispatching asynchronously will make things more complicated.
But writes blocking reads goes against the idea of page cache.
I'm not sure how to solve this...

Gui

> -Andrea
>

2011-03-07 15:23:21

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/3] blk-throttle: async write throttling

On Sun, Mar 06, 2011 at 04:52:47PM +0100, Andrea Righi wrote:
> On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote:
> > On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote:
> > > On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote:
> > > > On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote:
> > > > > Overview
> > > > > ========
> > > > > Currently the blkio.throttle controller only support synchronous IO requests.
> > > > > This means that we always look at the current task to identify the "owner" of
> > > > > each IO request.
> > > > >
> > > > > However dirty pages in the page cache can be wrote to disk asynchronously by
> > > > > the per-bdi flusher kernel threads or by any other thread in the system,
> > > > > according to the writeback policy.
> > > > >
> > > > > For this reason the real writes to the underlying block devices may
> > > > > occur in a different IO context respect to the task that originally
> > > > > generated the dirty pages involved in the IO operation. This makes the
> > > > > tracking and throttling of writeback IO more complicate respect to the
> > > > > synchronous IO from the blkio controller's perspective.
> > > > >
> > > > > Proposed solution
> > > > > =================
> > > > > In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve
> > > > > the problem of the buffered writes limitation by tracking the ownership of all
> > > > > the dirty pages in the system.
> > > > >
> > > > > This would allow to always identify the owner of each IO operation at the block
> > > > > layer and apply the appropriate throttling policy implemented by the
> > > > > blkio.throttle controller.
> > > > >
> > > > > This solution makes the blkio.throttle controller to work as expected also for
> > > > > writeback IO, but it does not resolve the problem of faster cgroups getting
> > > > > blocked by slower cgroups (that would expose a potential way to create DoS in
> > > > > the system).
> > > > >
> > > > > In fact, at the moment critical IO requests (that have dependency with other IO
> > > > > requests made by other cgroups) and non-critical requests are mixed together at
> > > > > the filesystem layer in a way that throttling a single write request may stop
> > > > > also other requests in the system, and at the block layer it's not possible to
> > > > > retrieve such informations to make the right decision.
> > > > >
> > > > > A simple solution to this problem could be to just limit the rate of async
> > > > > writes at the time a task is generating dirty pages in the page cache. The
> > > > > big advantage of this approach is that it does not need the overhead of
> > > > > tracking the ownership of the dirty pages, because in this way from the blkio
> > > > > controller perspective all the IO operations will happen from the process
> > > > > context: writes in memory and synchronous reads from the block device.
> > > > >
> > > > > The drawback of this approach is that the blkio.throttle controller becomes a
> > > > > little bit leaky, because with this solution the controller is still affected
> > > > > by the IO spikes during the writeback of dirty pages executed by the kernel
> > > > > threads.
> > > > >
> > > > > Probably an even better approach would be to introduce the tracking of the
> > > > > dirty page ownership to properly account the cost of each IO operation at the
> > > > > block layer and apply the throttling of async writes in memory only when IO
> > > > > limits are exceeded.
> > > >
> > > > Andrea, I am curious to know more about it third option. Can you give more
> > > > details about accouting in block layer but throttling in memory. So say
> > > > a process starts IO, then it will still be in throttle limits at block
> > > > layer (because no writeback has started), then the process will write
> > > > bunch of pages in cache. By the time throttle limits are crossed at
> > > > block layer, we already have lots of dirty data in page cache and
> > > > throttling process now is already late?
> > >
> > > Charging the cost of each IO operation at the block layer would allow
> > > tasks to write in memory at the maximum speed. Instead, with the 3rd
> > > approach, tasks are forced to write in memory at the rate defined by the
> > > blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device).
> > >
> > > When we'll have the per-cgroup dirty memory accounting and limiting
> > > feature, with this approach each cgroup could write to its dirty memory
> > > quota at the maximum rate.
> >
> > Ok, so this is option 3 which you have already implemented in this
> > patchset.
> >
> > I guess then I am confused with option 2. Can you elaborate a little
> > more there.
>
> With option 3, we can just limit the rate at which dirty pages are
> generated in memory. And this can be done introducing the files
> blkio.throttle.async.write_bps/iops_device.
>
> At the moment in blk_throtl_bio() we charge the dispatched bytes/iops
> _and_ we check if the bio can be dispatched. These two distinct
> operations are now done by the same function.
>
> With option 2, I'm proposing to split these two operations and place
> throtl_charge_io() at the block layer in __generic_make_request() and an
> equivalent of tg_may_dispatch_bio() (maybe a better name would be
> blk_is_throttled()) at the page cache layer, in
> balance_dirty_pages_ratelimited_nr():
>
> A prototype for blk_is_throttled() could be the following:
>
> bool blk_is_throttled(void);
>
> This means in balance_dirty_pages_ratelimited_nr() we won't charge any
> bytes/iops to the cgroup, but we'll just check if the limits are
> exceeded. And stop it in that case, so that no more dirty pages can be
> generated by this cgroup.
>
> Instead at the block layer WRITEs will be always dispatched in
> blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but
> the throtl_charge_io() would charge the cost of the IO operation to the
> right cgroup.
>
> To summarize:
>
> __generic_make_request():
> blk_throtl_bio(q, &bio);
>
> balance_dirty_pages_ratelimited_nr():
> if (blk_is_throttled())
> // add the current task into a per-group wait queue and
> // wake up once this cgroup meets its quota
>
> What do you think?

Ok, so an IO is charged only when it hits the block layer. Because a
group is not throttled till actually some IO happens, group will not
be throttled.

- To me, this still does not solve the problem of IO spikes. When a process
starts writting out, IO will go in cache, group is not blocked and process
can dump lots of WRITES in cache and by the time writeout starts and we
throttle the submitter, its too late. All the WRITES in cache will show
up at block device in bulk and impact read latencies.

Having said that I guess that will be the case initially and then process
will be throttled. So every time a process writes, it can dump bunch of IO
in cache and remain unthrottled for some time.

So to me it might work well for the cases where long writeout happen
continuously. But if some process dumps bunch of MBs then goes onto do
other processes for few seconds and comes back agian to dump bunch of
MBs, in these cases this scheme will not be effective.

- Will it handle the case of fsync. So if a bunch of pages are in cache
and process tries fsync, then nothing will be throttled and that will
also impact negatively the read latencies and we are back to IO spikes?

Will it make sense to try to make throttled writeback almost synchronous.
In other words say we implement option 3, and once we realize that we
are doing IO in a cgroup where writes are throttled, we wake up flusher
threads and possibly marking the inodes which we want to be selected for
IO? That way IO to an inode will become more or less synchronous and
effect of IO spikes will come down.

Well thinking more about it, I guess similar could be done by implemeting
per cgroup dirty ratio and setting dirty_ratio to zero or a very low
value? So if a user can co-mount the memory and IO controller and reduce
the impact of IO spikes from WRITES of a cgroup by controlling the dirty
raio of that cgroup?

Thanks
Vivek