Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753302AbaBYPAK (ORCPT ); Tue, 25 Feb 2014 10:00:10 -0500 Received: from forward20.mail.yandex.net ([95.108.253.145]:54405 "EHLO forward20.mail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752875AbaBYO6v (ORCPT ); Tue, 25 Feb 2014 09:58:51 -0500 From: Kirill Tkhai To: Juri Lelli Cc: Peter Zijlstra , "linux-kernel@vger.kernel.org" , Steven Rostedt , Ingo Molnar In-Reply-To: <20140225151515.617714e2f2cd6c558531ba61@gmail.com> References: <230991392848160@web13m.yandex.ru> <20140221103715.GP9987@twins.programming.kicks-ass.net> <20140221173641.a060b3d6c0993c21e77f29c2@gmail.com> <5307F5DB.3000705@yandex.ru> <20140225151515.617714e2f2cd6c558531ba61@gmail.com> Subject: Re: [RFC] sched/deadline: Prevent rt_time growth to infinity MIME-Version: 1.0 Message-Id: <188151393340326@web24g.yandex.ru> X-Mailer: Yamail [ http://yandex.ru ] 5.0 Date: Tue, 25 Feb 2014 18:58:46 +0400 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=koi8-r Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 25.02.2014, 18:14, "Juri Lelli" : > On Sat, 22 Feb 2014 04:56:59 +0400 > Kirill Tkhai wrote: > >> ?On 21.02.2014 20:36, Juri Lelli wrote: >>> ?On Fri, 21 Feb 2014 11:37:15 +0100 >>> ?Peter Zijlstra wrote: >>>> ?On Thu, Feb 20, 2014 at 02:16:00AM +0400, Kirill Tkhai wrote: >>>>> ?Since deadline tasks share rt bandwidth, we must care about >>>>> ?bandwidth timer set. Otherwise rt_time may grow up to infinity >>>>> ?in update_curr_dl(), if there are no other available RT tasks >>>>> ?on top level bandwidth. >>>>> >>>>> ?I'm going to decide the problem the way below. Almost untested >>>>> ?because of I skipped almost all of recent patches which haveto be applied from lkml. >>>>> >>>>> ?Please say, if I skipped anything in idea. Maybe better put >>>>> ?start_top_rt_bandwidth() into set_curr_task_dl()? >>>> ?How about we only increment rt_time when there's an RT bandwidth timer >>>> ?active? >>>> >>>> ?--- >>>> ?--- a/kernel/sched/rt.c >>>> ?+++ b/kernel/sched/rt.c >>>> ?@@ -568,6 +568,12 @@ static inline struct rt_bandwidth *sched >>>> >>>> ??#endif /* CONFIG_RT_GROUP_SCHED */ >>>> >>>> ?+bool sched_rt_bandwidth_active(struct rt_rq *rt_rq) >>>> ?+{ >>>> ?+ struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq); >>>> ?+ return hrtimer_active(&rt_b->rt_period_timer); >>>> ?+} >>>> ?+ >>>> ??#ifdef CONFIG_SMP >>>> ??/* >>>> ???* We ran out of runtime, see if we can borrow some from our neighbours. >>>> ?--- a/kernel/sched/deadline.c >>>> ?+++ b/kernel/sched/deadline.c >>>> ?@@ -587,6 +587,8 @@ int dl_runtime_exceeded(struct rq *rq, s >>>> ??????????return 1; >>>> ??} >>>> >>>> ?+extern bool sched_rt_bandwidth_active(struct rt_rq *rt_rq); >>>> ?+ >>>> ??/* >>>> ???* Update the current task's runtime statistics (provided it is still >>>> ???* a -deadline task and has not been removed from the dl_rq). >>>> ?@@ -650,11 +652,13 @@ static void update_curr_dl(struct rq *rq >>>> ??????????????????struct rt_rq *rt_rq = &rq->rt; >>>> >>>> ??????????????????raw_spin_lock(&rt_rq->rt_runtime_lock); >>>> ?- rt_rq->rt_time += delta_exec; >>>> ??????????????????/* >>>> ???????????????????* We'll let actual RT tasks worry about the overflow here, we >>>> ?- * have our own CBS to keep us inline -- see above. >>>> ?+ * have our own CBS to keep us inline; only account when RT >>>> ?+ * bandwidth is relevant. >>>> ???????????????????*/ >>>> ?+ if (sched_rt_bandwidth_active(rt_rq)) >>>> ?+ rt_rq->rt_time += delta_exec; >>>> ??????????????????raw_spin_unlock(&rt_rq->rt_runtime_lock); >>>> ??????????} >>>> ??} >>> ?So, I ran some tests with the above and I'd like to share with you what >>> ?I've found. You can find here a trace-cmd trace that should be feeded >>> ?to kernelshark to be able to understand what follows (or feel free to >>> ?reproduce same scenario :)): >>> ?http://retis.sssup.it/~jlelli/traces/trace_rt_time.dat >>> >>> ?Here you have a DL task (4/10) and a while(1) RT task, both running >>> ?inside a rt_bw of 0.5. RT tasks is activated 500ms after DL. As I >>> ?filtered in sched_rt_period_timer(), you can search for time instants >>> ?when the rt_bw is replenished. It is evident that the first time after >>> ?rt timer is activated back (search for start_bandwidth_timer), we can >>> ?eat some bw to FAIR tasks (if any). This is due to the fact that we >>> ?reset rt_bw budget at this time, start decrementing rt_time for both DL >>> ?and RT tasks, throttle RT tasks when rt_time > runtime, but, since DL >>> ?tasks acually executes inside their own server, they don't care about >>> ?rt_bw. Good news is that steady state is ok: keeping track of overruns >>> ?we are able to stop eating bw to other guys. >>> >>> ?My thougths: >>> >>> ??- Peter's patch is an easy fix to Kirill's problem (RT tasks were >>> ????throttled too early); >>> ??- something to add to this solution could be to pre-calculate bw of >>> ????ready DL tasks and subtract it to rt_bw at replenishment time, but >>> ????it sounds quite awkward, pessimistic, and I'm not sure it is gonna >>> ????work; >>> ??- we are stealing bw to best-effort tasks, and just at the beginning >>> ????of the transistion, is it really a problem? >>> ??- I mean, if you want guarantees make your tasks DL! :); >>> ??- in the long run we are gonna have RT tasks scheduled inside CBS >>> ????servers, and all this will be properly fixed up. >>> >>> ?Comments? >>> >>> ?BTW, rt timer activation/deactivation should probably be fixed for >>> ?!RT_GROUP_SCHED with something like this: >>> >>> ?--- >>> ??kernel/sched/rt.c | ??10 +++++++--- >>> ??1 file changed, 7 insertions(+), 3 deletions(-) >>> >>> ?diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c >>> ?index 6161de8..274f992 100644 >>> ?--- a/kernel/sched/rt.c >>> ?+++ b/kernel/sched/rt.c >>> ?@@ -86,12 +86,12 @@ void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq) >>> ??????????raw_spin_lock_init(&rt_rq->rt_runtime_lock); >>> ??} >>> >>> ?-#ifdef CONFIG_RT_GROUP_SCHED >>> ??static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b) >>> ??{ >>> ??????????hrtimer_cancel(&rt_b->rt_period_timer); >>> ??} >>> >>> ?+#ifdef CONFIG_RT_GROUP_SCHED >>> ??#define rt_entity_is_task(rt_se) (!(rt_se)->my_q) >>> >>> ??static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se) >>> ?@@ -1017,8 +1017,12 @@ inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) >>> ??????????start_rt_bandwidth(&def_rt_bandwidth); >>> ??} >>> >>> ?-static inline >>> ?-void dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) {} >>> ?+static void >>> ?+dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) >>> ?+{ >>> ?+ if (!rt_rq->rt_nr_running) >>> ?+ destroy_rt_bandwidth(&def_rt_bandwidth); >>> ?+} >>> >>> ??#endif /* CONFIG_RT_GROUP_SCHED */ >> ?It looks with both patches applied, we may get into a situation, >> ?when all CPU time is shared between RT and DL tasks: >> >> ?rt_runtime = n >> ?rt_period ?= 2n >> >> ?| RT working, DL sleeping ?| DL working, RT sleeping ?????| >> ?----------------------------------------------------------- >> ?| (1) ????duration = n ????| (2) ????duration = n ????????| (repeat) >> ?|--------------------------|------------------------------| >> ?| (rt_bw timer is running) | (rt_bw timer is not running) | >> >> ?No time for fair tasks at all. > > Ok, this situation is pathological. DL bandwidth is guaranteed at > admission control, while RT isn't. In this case RT tasks are doomed by > construction. Still you'd like to let FAIR tasks execute :). > > I argumented on a slightly different solution in what follows, what you > think? > > Thanks, > > - Juri > > From e44fe2eef34433a7799cfc153f467f7c62813596 Mon Sep 17 00:00:00 2001 > From: Juri Lelli > Date: Fri, 21 Feb 2014 11:37:15 +0100 > Subject: [PATCH] sched/deadline: Prevent rt_time growth to infinity > > Kirill Tkhai noted: > Since deadline tasks share rt bandwidth, we must care about > bandwidth timer set. Otherwise rt_time may grow up to infinity > in update_curr_dl(), if there are no other available RT tasks > on top level bandwidth. > > RT task were in fact throttled right after they got enqueued, > and never executed again (rt_time never again went below rt_runtime). > > Peter than proposed to accrue DL execution on rt_time only when > rt timer is active, and proposed a patch (this patch is a slight > modification of that) to implement that behavior. While this > solves Kirill problem, it has a drawback. > > Indeed, Kirill noted again: > It looks we may get into a situation, when all CPU time is shared > between RT and DL tasks: > > rt_runtime = n > rt_period ?= 2n > > | RT working, DL sleeping ?| DL working, RT sleeping ?????| > ----------------------------------------------------------- > | (1) ????duration = n ????| (2) ????duration = n ????????| (repeat) > |--------------------------|------------------------------| > | (rt_bw timer is running) | (rt_bw timer is not running) | > > No time for fair tasks at all. > > While this can happen during the first period, if rq is always backlogged, > RT tasks won't have the opportunity to execute anymore: rt_time reached > rt_runtime during (1), suppose after (2) RT is enqueued back, it gets > throttled since rt timer didn't fire, replenishment is from now on eaten up > by DL tasks that accrue their execution on rt_time (while rt timer is > active - we have an RT task waiting for replenishment). FAIR tasks are > not touched after this first period. Ok, this is not ideal, and the situation > is even worse! > > What above (the nice case), practically never happens in reality, where > your rt timer is not aligned to tasks periods, tasks are in general not > periodic, etc.. Long story short, you always risk to overload your system. > > This patch is based on Peter's idea, but exploits an additional fact: > if you don't have RT tasks enqueued, it makes little sense to continue > incrementing rt_time once you reached the upper limit (DL tasks have their > own mechanism for throttling). > > This cures both problems: > > ?- no matter how many DL instances in the past, you'll have an rt_time > ???slightly above rt_runtime when an RT task is enqueued, and from that > ???point on (after the first replenishment), the task will normally execute; > > ?- you can still eat up all bandwidth during the first period, but not > ???anymore after that, remember that DL execution will increment rt_time > ???till the upper limit is reached. > > The situation is still not perfect! But, we have a simple solution for now, > that limits how much you can jeopardize your system, as we keep working > towards the right answer: RT groups scheduled using deadline servers. Excellent, Juri! This is almost perfect. Thanks, Kirill > Signed-off-by: Juri Lelli > --- > ?kernel/sched/deadline.c | ???8 ++++++-- > ?kernel/sched/rt.c ??????| ???8 ++++++++ > ?2 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 15cbc17..f59d774 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -564,6 +564,8 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se) > ?????????return 1; > ?} > > +extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq); > + > ?/* > ??* Update the current task's runtime statistics (provided it is still > ??* a -deadline task and has not been removed from the dl_rq). > @@ -627,11 +629,13 @@ static void update_curr_dl(struct rq *rq) > ?????????????????struct rt_rq *rt_rq = &rq->rt; > > ?????????????????raw_spin_lock(&rt_rq->rt_runtime_lock); > - rt_rq->rt_time += delta_exec; > ?????????????????/* > ??????????????????* We'll let actual RT tasks worry about the overflow here, we > - * have our own CBS to keep us inline -- see above. > + * have our own CBS to keep us inline; only account when RT > + * bandwidth is relevant. > ??????????????????*/ > + if (sched_rt_bandwidth_account(rt_rq)) > + rt_rq->rt_time += delta_exec; > ?????????????????raw_spin_unlock(&rt_rq->rt_runtime_lock); > ?????????} > ?} > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index 7dba25a..7f372e1 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -538,6 +538,14 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq) > > ?#endif /* CONFIG_RT_GROUP_SCHED */ > > +bool sched_rt_bandwidth_account(struct rt_rq *rt_rq) > +{ > + struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq); > + > + return (hrtimer_active(&rt_b->rt_period_timer) || > + rt_rq->rt_time < rt_b->rt_runtime); > +} > + > ?#ifdef CONFIG_SMP > ?/* > ??* We ran out of runtime, see if we can borrow some from our neighbours. > -- > 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/