Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755863Ab1CBNbx (ORCPT ); Wed, 2 Mar 2011 08:31:53 -0500 Received: from trinity.develer.com ([83.149.158.210]:38422 "EHLO trinity.develer.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755764Ab1CBNbv (ORCPT ); Wed, 2 Mar 2011 08:31:51 -0500 Date: Wed, 2 Mar 2011 14:31:48 +0100 From: Andrea Righi To: Vivek Goyal Cc: Balbir Singh , Daisuke Nishimura , KAMEZAWA Hiroyuki , Greg Thelen , Wu Fengguang , Gui Jianfeng , Ryo Tsuruta , Hirokazu Takahashi , Jens Axboe , Jonathan Corbet , Andrew Morton , containers@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/3] blkio-throttle: infrastructure to throttle async io Message-ID: <20110302133148.GC2061@linux.develer.com> References: <1298888105-3778-1-git-send-email-arighi@develer.com> <1298888105-3778-3-git-send-email-arighi@develer.com> <20110301160627.GA2539@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110301160627.GA2539@redhat.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11190 Lines: 284 On Tue, Mar 01, 2011 at 11:06:27AM -0500, Vivek Goyal wrote: > On Mon, Feb 28, 2011 at 11:15:04AM +0100, Andrea Righi wrote: > > Add blk_throtl_async() to throttle async io writes. > > > > The idea is to stop applications before they can generate too much dirty > > pages in the page cache. Each time a chunk of pages is wrote in memory > > we charge the current task / cgroup for the size of the pages dirtied. > > > > If blkio limit are exceeded we also enforce throttling at this layer, > > instead of throttling writes at the block IO layer, where it's not > > possible to associate the pages with the process that dirtied them. > > > > In this case throttling can be simply enforced via a > > schedule_timeout_killable(). > > > > Also change some internal blkio function to make them more generic, and > > allow to get the size and type of the IO operation without extracting > > these information directly from a struct bio. In this way the same > > functions can be used also in the contextes where a struct bio is not > > yet created (e.g., writes in the page cache). > > > > Signed-off-by: Andrea Righi > > --- > > block/blk-throttle.c | 106 ++++++++++++++++++++++++++++++++---------------- > > include/linux/blkdev.h | 6 +++ > > 2 files changed, 77 insertions(+), 35 deletions(-) > > > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c > > index a89043a..07ef429 100644 > > --- a/block/blk-throttle.c > > +++ b/block/blk-throttle.c > > @@ -449,9 +449,8 @@ throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw) > > } > > > > static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg, > > - struct bio *bio, unsigned long *wait) > > + bool rw, size_t size, unsigned long *wait) > > { > > - bool rw = bio_data_dir(bio); > > unsigned int io_allowed; > > unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; > > u64 tmp; > > @@ -499,9 +498,8 @@ static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg, > > } > > > > static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, > > - struct bio *bio, unsigned long *wait) > > + bool rw, size_t size, unsigned long *wait) > > { > > - bool rw = bio_data_dir(bio); > > u64 bytes_allowed, extra_bytes, tmp; > > unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; > > > > @@ -517,14 +515,14 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, > > do_div(tmp, HZ); > > bytes_allowed = tmp; > > > > - if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) { > > + if (tg->bytes_disp[rw] + size <= bytes_allowed) { > > if (wait) > > *wait = 0; > > return 1; > > } > > > > /* Calc approx time to dispatch */ > > - extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed; > > + extra_bytes = tg->bytes_disp[rw] + size - bytes_allowed; > > jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]); > > > > if (!jiffy_wait) > > @@ -540,24 +538,11 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, > > return 0; > > } > > > > -/* > > - * Returns whether one can dispatch a bio or not. Also returns approx number > > - * of jiffies to wait before this bio is with-in IO rate and can be dispatched > > - */ > > static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, > > - struct bio *bio, unsigned long *wait) > > + bool rw, size_t size, unsigned long *wait) > > { > > - bool rw = bio_data_dir(bio); > > unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0; > > > > - /* > > - * Currently whole state machine of group depends on first bio > > - * queued in the group bio list. So one should not be calling > > - * this function with a different bio if there are other bios > > - * queued. > > - */ > > - BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw])); > > - > > /* If tg->bps = -1, then BW is unlimited */ > > if (tg->bps[rw] == -1 && tg->iops[rw] == -1) { > > if (wait) > > @@ -577,8 +562,8 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, > > throtl_extend_slice(td, tg, rw, jiffies + throtl_slice); > > } > > > > - if (tg_with_in_bps_limit(td, tg, bio, &bps_wait) > > - && tg_with_in_iops_limit(td, tg, bio, &iops_wait)) { > > + if (tg_with_in_bps_limit(td, tg, rw, size, &bps_wait) && > > + tg_with_in_iops_limit(td, tg, rw, size, &iops_wait)) { > > if (wait) > > *wait = 0; > > return 1; > > @@ -595,20 +580,37 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, > > return 0; > > } > > > > -static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio) > > +/* > > + * Returns whether one can dispatch a bio or not. Also returns approx number > > + * of jiffies to wait before this bio is with-in IO rate and can be dispatched > > + */ > > +static bool tg_may_dispatch_bio(struct throtl_data *td, struct throtl_grp *tg, > > + struct bio *bio, unsigned long *wait) > > { > > bool rw = bio_data_dir(bio); > > - bool sync = bio->bi_rw & REQ_SYNC; > > > > + /* > > + * Currently whole state machine of group depends on first bio queued > > + * in the group bio list. So one should not be calling this function > > + * with a different bio if there are other bios queued. > > + */ > > + BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw])); > > + > > + return tg_may_dispatch(td, tg, rw, bio->bi_size, wait); > > +} > > + > > +static void > > +throtl_charge_io(struct throtl_grp *tg, bool rw, bool sync, size_t size) > > +{ > > /* Charge the bio to the group */ > > - tg->bytes_disp[rw] += bio->bi_size; > > + tg->bytes_disp[rw] += size; > > tg->io_disp[rw]++; > > > > /* > > * TODO: This will take blkg->stats_lock. Figure out a way > > * to avoid this cost. > > */ > > - blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size, rw, sync); > > + blkiocg_update_dispatch_stats(&tg->blkg, size, rw, sync); > > } > > > > static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg, > > @@ -630,10 +632,10 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg) > > struct bio *bio; > > > > if ((bio = bio_list_peek(&tg->bio_lists[READ]))) > > - tg_may_dispatch(td, tg, bio, &read_wait); > > + tg_may_dispatch_bio(td, tg, bio, &read_wait); > > > > if ((bio = bio_list_peek(&tg->bio_lists[WRITE]))) > > - tg_may_dispatch(td, tg, bio, &write_wait); > > + tg_may_dispatch_bio(td, tg, bio, &write_wait); > > > > min_wait = min(read_wait, write_wait); > > disptime = jiffies + min_wait; > > @@ -657,7 +659,8 @@ static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg, > > BUG_ON(td->nr_queued[rw] <= 0); > > td->nr_queued[rw]--; > > > > - throtl_charge_bio(tg, bio); > > + throtl_charge_io(tg, bio_data_dir(bio), > > + bio->bi_rw & REQ_SYNC, bio->bi_size); > > bio_list_add(bl, bio); > > bio->bi_rw |= REQ_THROTTLED; > > > > @@ -674,8 +677,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg, > > > > /* Try to dispatch 75% READS and 25% WRITES */ > > > > - while ((bio = bio_list_peek(&tg->bio_lists[READ])) > > - && tg_may_dispatch(td, tg, bio, NULL)) { > > + while ((bio = bio_list_peek(&tg->bio_lists[READ])) && > > + tg_may_dispatch_bio(td, tg, bio, NULL)) { > > > > tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl); > > nr_reads++; > > @@ -684,8 +687,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg, > > break; > > } > > > > - while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) > > - && tg_may_dispatch(td, tg, bio, NULL)) { > > + while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) && > > + tg_may_dispatch_bio(td, tg, bio, NULL)) { > > > > tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl); > > nr_writes++; > > @@ -998,6 +1001,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop) > > bio->bi_rw &= ~REQ_THROTTLED; > > return 0; > > } > > + /* Async writes are ratelimited in blk_throtl_async() */ > > + if (rw == WRITE && !(bio->bi_rw & REQ_DIRECT)) > > + return 0; > > > > spin_lock_irq(q->queue_lock); > > tg = throtl_get_tg(td); > > @@ -1018,8 +1024,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop) > > } > > > > /* Bio is with-in rate limit of group */ > > - if (tg_may_dispatch(td, tg, bio, NULL)) { > > - throtl_charge_bio(tg, bio); > > + if (tg_may_dispatch_bio(td, tg, bio, NULL)) { > > + throtl_charge_io(tg, bio_data_dir(bio), > > + bio->bi_rw & REQ_SYNC, bio->bi_size); > > goto out; > > } > > > > @@ -1044,6 +1051,35 @@ out: > > return 0; > > } > > > > +/* > > + * Enforce throttling on async i/o writes > > + */ > > +int blk_throtl_async(struct request_queue *q, size_t size) > > +{ > > + struct throtl_data *td = q->td; > > + struct throtl_grp *tg; > > + bool rw = 1; > > + unsigned long wait = 0; > > + > > + spin_lock_irq(q->queue_lock); > > + tg = throtl_get_tg(td); > > + if (tg_may_dispatch(td, tg, rw, size, &wait)) > > + throtl_charge_io(tg, rw, false, size); > > + else > > + throtl_log_tg(td, tg, "[%c] async. bdisp=%u sz=%u bps=%llu" > > + " iodisp=%u iops=%u queued=%d/%d", > > + rw == READ ? 'R' : 'W', > > + tg->bytes_disp[rw], size, tg->bps[rw], > > + tg->io_disp[rw], tg->iops[rw], > > + tg->nr_queued[READ], tg->nr_queued[WRITE]); > > + spin_unlock_irq(q->queue_lock); > > + > > + if (wait >= throtl_slice) > > + schedule_timeout_killable(wait); > > + > > + return 0; > > +} > > I think above is not correct. > > We have this notion of stretching/renewing the throtl_slice if bio > is big and can not be dispatched in one slice and one bio has been > dispatched, we trim the slice. > > So first of all, above code does not take care of other IO going on > in same group. We might be extending a slice for a bigger queued bio > say 1MB, and suddenly a write happens saying oh, i can dispatch > and startving dispatch of that bio. > > Similiarly, if there are more than 1 tasks in a group doing write, they > all will wait for certain time and after that time start dispatching. > I think we might have to have a wait queue per group and put async > tasks there and start the time slice on behalf of the task at the > front of the queue and wake it up once that task meets its quota. > > Keeping track of for which bio the current slice is operating, we > reduce the number of wakeups for throtl work. A work is woken up > only when bio is ready to be dispatched. So if a bio is queued > that means a slice of that direction is already on, any new IO > in that group is simply queued instead of trying to check again > whether we can dispatch it or not. OK, I understand. I'll try to apply this logic in the next version of the patch set. Thanks for the clarification and the review. -Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/