Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755120AbZDULQt (ORCPT ); Tue, 21 Apr 2009 07:16:49 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753257AbZDULQk (ORCPT ); Tue, 21 Apr 2009 07:16:40 -0400 Received: from mail-bw0-f163.google.com ([209.85.218.163]:49954 "EHLO mail-bw0-f163.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753036AbZDULQj (ORCPT ); Tue, 21 Apr 2009 07:16:39 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=Y5lQfBk9KF64DFnlhvCr1abXtXyM1oB2MbL61Un7zj6fF0palS/2za5a5H+XZFj1Lc 9RDazy+2+TuNI3fymYT34AukZKIrJ65m0u3YLrAAQx7TT6KSJ05GViey68iHDHWlToYi /2gYerLHcytTPw3pC6vy1f5IsitKrBAOztm60= Date: Tue, 21 Apr 2009 13:16:34 +0200 From: Andrea Righi To: Balbir Singh Cc: Paul Menage , Gui Jianfeng , KAMEZAWA Hiroyuki , agk@sourceware.org, akpm@linux-foundation.org, axboe@kernel.dk, baramsori72@gmail.com, Carl Henrik Lunde , dave@linux.vnet.ibm.com, Divyesh Shah , eric.rannaud@gmail.com, fernando@oss.ntt.co.jp, Hirokazu Takahashi , Li Zefan , matt@bluehost.com, dradford@bluehost.com, ngupta@google.com, randy.dunlap@oracle.com, roberto@unbit.it, Ryo Tsuruta , Satoshi UCHIDA , subrata@linux.vnet.ibm.com, yoshikawa.takuya@oss.ntt.co.jp, Nauman Rafique , fchecconi@gmail.com, paolo.valente@unimore.it, containers@lists.linux-foundation.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/7] res_counter: introduce ratelimiting attributes Message-ID: <20090421111633.GC13699@linux> References: <1240090712-1058-1-git-send-email-righi.andrea@gmail.com> <1240090712-1058-3-git-send-email-righi.andrea@gmail.com> <20090421101326.GE19637@balbir.in.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090421101326.GE19637@balbir.in.ibm.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10041 Lines: 313 On Tue, Apr 21, 2009 at 03:43:26PM +0530, Balbir Singh wrote: > * Andrea Righi [2009-04-18 23:38:27]: > > > Introduce attributes and functions in res_counter to implement throttling-based > > cgroup subsystems. > > > > The following attributes have been added to struct res_counter: > > * @policy: the limiting policy / algorithm > > * @capacity: the maximum capacity of the resource > > * @timestamp: timestamp of the last accounted resource request > > > > Units of each of the above would be desirable, without them it is hard > to understand what you are trying to add. What is the unit of > capacity? Theoretically it can be any unit. At the moment it is used by the io-throttle controller only for the token bucket strategy (@policy = RATELIMIT_TOKEN_BUCKET) and it can be either bytes or IO operations. Maybe I should add a comment like this. > > > Currently the available policies are: token-bucket and leaky-bucket and the > > attribute @capacity is only used by token-bucket policy (to represent the > > bucket size). > > > > The following function has been implemented to return the amount of time a > > cgroup should sleep to remain within the defined resource limits. > > > > unsigned long long > > res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val); > > > > [ Note: only the interfaces needed by the cgroup IO controller are implemented > > right now ] > > > > This is a good RFC, but I would hold off merging till the subsystem > gets in. Having said that I am not convinced about the subsystem > sleeping, if the subsystem is not IO intensive, should it still sleep > because it is over its IO b/w? This might make sense for the CPU > controller, since not having CPU b/w does imply sleeping. > > Could you please use the word throttle instead of sleep. OK, will do in the next version. > > > > Signed-off-by: Andrea Righi > > --- > > include/linux/res_counter.h | 69 +++++++++++++++++++++++++++++++---------- > > kernel/res_counter.c | 72 +++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 124 insertions(+), 17 deletions(-) > > > > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h > > index 4c5bcf6..9bed6af 100644 > > --- a/include/linux/res_counter.h > > +++ b/include/linux/res_counter.h > > @@ -14,30 +14,36 @@ > > */ > > > > #include > > +#include > > > > -/* > > - * The core object. the cgroup that wishes to account for some > > - * resource may include this counter into its structures and use > > - * the helpers described beyond > > - */ > > +/* The various policies that can be used for ratelimiting resources */ > > +#define RATELIMIT_LEAKY_BUCKET 0 > > +#define RATELIMIT_TOKEN_BUCKET 1 > > > > +/** > > + * struct res_counter - the core object to account cgroup resources > > + * > > + * @usage: the current resource consumption level > > + * @max_usage: the maximal value of the usage from the counter creation > > + * @limit: the limit that usage cannot be exceeded > > + * @failcnt: the number of unsuccessful attempts to consume the resource > > + * @policy: the limiting policy / algorithm > > + * @capacity: the maximum capacity of the resource > > + * @timestamp: timestamp of the last accounted resource request > > + * @lock: the lock to protect all of the above. > > + * The routines below consider this to be IRQ-safe > > + * > > + * The cgroup that wishes to account for some resource may include this counter > > + * into its structures and use the helpers described beyond. > > + */ > > struct res_counter { > > - /* > > - * the current resource consumption level > > - */ > > unsigned long long usage; > > - /* > > - * the maximal value of the usage from the counter creation > > - */ > > unsigned long long max_usage; > > - /* > > - * the limit that usage cannot exceed > > - */ > > unsigned long long limit; > > - /* > > - * the number of unsuccessful attempts to consume the resource > > - */ > > Don't understand why res_counter is removed? Am I reading the diff > correctly? It is not removed. I've just used the kernel-doc style comment (Documentation/kernel-doc-nano-HOWTO.txt). I think Randy suggested this in the past. > > > unsigned long long failcnt; > > + unsigned long long policy; > > + unsigned long long capacity; > > + unsigned long long timestamp; > > /* > > * the lock to protect all of the above. > > * the routines below consider this to be IRQ-safe > > @@ -84,6 +90,9 @@ enum { > > RES_USAGE, > > RES_MAX_USAGE, > > RES_LIMIT, > > + RES_POLICY, > > + RES_TIMESTAMP, > > + RES_CAPACITY, > > RES_FAILCNT, > > }; > > > > @@ -130,6 +139,15 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt) > > return false; > > } > > > > +static inline unsigned long long > > +res_counter_ratelimit_delta_t(struct res_counter *res) > > +{ > > + return (long long)get_jiffies_64() - (long long)res->timestamp; > > +} > > + > > +unsigned long long > > +res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val); > > + > > /* > > * Helper function to detect if the cgroup is within it's limit or > > * not. It's currently called from cgroup_rss_prepare() > > @@ -163,6 +181,23 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt) > > spin_unlock_irqrestore(&cnt->lock, flags); > > } > > > > +static inline int > > +res_counter_ratelimit_set_limit(struct res_counter *cnt, > > + unsigned long long policy, > > + unsigned long long limit, unsigned long long max) > > +{ > > + unsigned long flags; > > + > > + spin_lock_irqsave(&cnt->lock, flags); > > + cnt->limit = limit; > > + cnt->capacity = max; > > + cnt->policy = policy; > > + cnt->timestamp = get_jiffies_64(); > > + cnt->usage = 0; > > + spin_unlock_irqrestore(&cnt->lock, flags); > > + return 0; > > +} > > + > > static inline int res_counter_set_limit(struct res_counter *cnt, > > unsigned long long limit) > > { > > diff --git a/kernel/res_counter.c b/kernel/res_counter.c > > index bf8e753..b62319c 100644 > > --- a/kernel/res_counter.c > > +++ b/kernel/res_counter.c > > @@ -9,6 +9,7 @@ > > > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -20,6 +21,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent) > > spin_lock_init(&counter->lock); > > counter->limit = (unsigned long long)LLONG_MAX; > > counter->parent = parent; > > + counter->capacity = (unsigned long long)LLONG_MAX; > > + counter->timestamp = get_jiffies_64(); > > } > > > > int res_counter_charge_locked(struct res_counter *counter, unsigned long val) > > @@ -99,6 +102,12 @@ res_counter_member(struct res_counter *counter, int member) > > return &counter->max_usage; > > case RES_LIMIT: > > return &counter->limit; > > + case RES_POLICY: > > + return &counter->policy; > > + case RES_TIMESTAMP: > > + return &counter->timestamp; > > + case RES_CAPACITY: > > + return &counter->capacity; > > case RES_FAILCNT: > > return &counter->failcnt; > > }; > > @@ -163,3 +172,66 @@ int res_counter_write(struct res_counter *counter, int member, > > spin_unlock_irqrestore(&counter->lock, flags); > > return 0; > > } > > + > > +static unsigned long long > > +ratelimit_leaky_bucket(struct res_counter *res, ssize_t val) > > +{ > > + unsigned long long delta, t; > > + > > + res->usage += val; > > Is this called from a protected context (w.r.t. res)? Yes, it is called with res->lock held (look at res_counter_ratelimit_sleep()). I can add a comment anyway. > > > + delta = res_counter_ratelimit_delta_t(res); > > + if (!delta) > > + return 0; > > + t = res->usage * USEC_PER_SEC; > > + t = usecs_to_jiffies(div_u64(t, res->limit)); > > + if (t > delta) > > + return t - delta; > > + /* Reset i/o statistics */ > > + res->usage = 0; > > + res->timestamp = get_jiffies_64(); > > + return 0; > > +} > > + > > +static unsigned long long > > +ratelimit_token_bucket(struct res_counter *res, ssize_t val) > > +{ > > + unsigned long long delta; > > + long long tok; > > + > > + res->usage -= val; > > + delta = jiffies_to_msecs(res_counter_ratelimit_delta_t(res)); > > + res->timestamp = get_jiffies_64(); > > + tok = (long long)res->usage * MSEC_PER_SEC; > > + if (delta) { > > + long long max = (long long)res->capacity * MSEC_PER_SEC; > > + > > + tok += delta * res->limit; > > + if (tok > max) > > + tok = max; > > Use max_t() here ok. > > > + res->usage = (unsigned long long)div_s64(tok, MSEC_PER_SEC); > > + } > > + return (tok < 0) ? msecs_to_jiffies(div_u64(-tok, res->limit)) : 0; > > +} > > I don't like the usage of MSEC and USEC for res->usage based on > policy. I used a different granularity only because in the io-throttle tests token bucket worked better with USEC and leaky bucket with MSEC. But we can generalize and encode this "granularity" information in a res_counter->flags attribute. > > > + > > +unsigned long long > > +res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val) > > +{ > > + unsigned long long sleep = 0; > > + unsigned long flags; > > + > > + spin_lock_irqsave(&res->lock, flags); > > + if (res->limit) > > + switch (res->policy) { > > + case RATELIMIT_LEAKY_BUCKET: > > + sleep = ratelimit_leaky_bucket(res, val); > > + break; > > + case RATELIMIT_TOKEN_BUCKET: > > + sleep = ratelimit_token_bucket(res, val); > > + break; > > + default: > > + WARN_ON(1); > > + break; > > + } > > + spin_unlock_irqrestore(&res->lock, flags); > > + return sleep; > > +} > > -- > > 1.5.6.3 > > > > > > -- > Balbir Thanks for your comments! -Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/