From: Kirill Tkhai <tkhai@yandex.ru>
To: Juri Lelli <juri.lelli@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Steven Rostedt <rostedt@goodmis.org>, Ingo Molnar <mingo@redhat.com>
In-Reply-To: <20140221175305.1e170b45be08fe05c93a33b4@gmail.com>
References: <230991392848160@web13m.yandex.ru>
	<20140221103715.GP9987@twins.programming.kicks-ass.net>
	<20140221173641.a060b3d6c0993c21e77f29c2@gmail.com> <20140221175305.1e170b45be08fe05c93a33b4@gmail.com>
Subject: Re: [RFC] sched/deadline: Prevent rt_time growth to infinity
MIME-Version: 1.0
Message-Id: <57671393026602@web10j.yandex.ru>
Date: Sat, 22 Feb 2014 03:50:02 +0400
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=koi8-r
Sender: linux-kernel-owner@vger.kernel.org


21.02.2014, 20:52, "Juri Lelli" <juri.lelli@gmail.com>:
> On Fri, 21 Feb 2014 17:36:41 +0100
> Juri Lelli <juri.lelli@gmail.com> wrote:
>
>> ?On Fri, 21 Feb 2014 11:37:15 +0100
>> ?Peter Zijlstra <peterz@infradead.org> wrote:
>>> ?On Thu, Feb 20, 2014 at 02:16:00AM +0400, Kirill Tkhai wrote:
>>>> ?Since deadline tasks share rt bandwidth, we must care about
>>>> ?bandwidth timer set. Otherwise rt_time may grow up to infinity
>>>> ?in update_curr_dl(), if there are no other available RT tasks
>>>> ?on top level bandwidth.
>>>>
>>>> ?I'm going to decide the problem the way below. Almost untested
>>>> ?because of I skipped almost all of recent patches which haveto be applied from lkml.
>>>>
>>>> ?Please say, if I skipped anything in idea. Maybe better put
>>>> ?start_top_rt_bandwidth() into set_curr_task_dl()?
>>> ?How about we only increment rt_time when there's an RT bandwidth timer
>>> ?active?
>>>
>>> ?---
>>> ?--- a/kernel/sched/rt.c
>>> ?+++ b/kernel/sched/rt.c
>>> ?@@ -568,6 +568,12 @@ static inline struct rt_bandwidth *sched
>>>
>>> ??#endif /* CONFIG_RT_GROUP_SCHED */
>>>
>>> ?+bool sched_rt_bandwidth_active(struct rt_rq *rt_rq)
>>> ?+{
>>> ?+ struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
>>> ?+ return hrtimer_active(&rt_b->rt_period_timer);
>>> ?+}
>>> ?+
>>> ??#ifdef CONFIG_SMP
>>> ??/*
>>> ???* We ran out of runtime, see if we can borrow some from our neighbours.
>>> ?--- a/kernel/sched/deadline.c
>>> ?+++ b/kernel/sched/deadline.c
>>> ?@@ -587,6 +587,8 @@ int dl_runtime_exceeded(struct rq *rq, s
>>> ??????????return 1;
>>> ??}
>>>
>>> ?+extern bool sched_rt_bandwidth_active(struct rt_rq *rt_rq);
>>> ?+
>>> ??/*
>>> ???* Update the current task's runtime statistics (provided it is still
>>> ???* a -deadline task and has not been removed from the dl_rq).
>>> ?@@ -650,11 +652,13 @@ static void update_curr_dl(struct rq *rq
>>> ??????????????????struct rt_rq *rt_rq = &rq->rt;
>>>
>>> ??????????????????raw_spin_lock(&rt_rq->rt_runtime_lock);
>>> ?- rt_rq->rt_time += delta_exec;
>>> ??????????????????/*
>>> ???????????????????* We'll let actual RT tasks worry about the overflow here, we
>>> ?- * have our own CBS to keep us inline -- see above.
>>> ?+ * have our own CBS to keep us inline; only account when RT
>>> ?+ * bandwidth is relevant.
>>> ???????????????????*/
>>> ?+ if (sched_rt_bandwidth_active(rt_rq))
>>> ?+ rt_rq->rt_time += delta_exec;
>>> ??????????????????raw_spin_unlock(&rt_rq->rt_runtime_lock);
>>> ??????????}
>>> ??}
>> ?So, I ran some tests with the above and I'd like to share with you what
>> ?I've found. You can find here a trace-cmd trace that should be feeded
>> ?to kernelshark to be able to understand what follows (or feel free to
>> ?reproduce same scenario :)):
>> ?http://retis.sssup.it/~jlelli/traces/trace_rt_time.dat
>>
>> ?Here you have a DL task (4/10) and a while(1) RT task, both running
>> ?inside a rt_bw of 0.5. RT tasks is activated 500ms after DL. As I
>> ?filtered in sched_rt_period_timer(), you can search for time instants
>> ?when the rt_bw is replenished. It is evident that the first time after
>> ?rt timer is activated back (search for start_bandwidth_timer), we can
>> ?eat some bw to FAIR tasks (if any). This is due to the fact that we
>> ?reset rt_bw budget at this time, start decrementing rt_time for both DL
>
> The reset happens when rt_bw replenishment timer fires, after a bit:
>
> ?sched_rt_period_timer <-- __run_hrtimer

Juri, sorry, I forgot to wrote I mean the situation when only one task is on_rq
at every moment.

DL, RT, DL, RT, ...

rt_runtime = n;
rt_period  = 2n;

| DL's working, RT's sleeping  | RT's working, DL's sleeping  |   all sleep               |
------------------------------------------------------------------------------------------|
| (1)     duration = n         | (2)     duration = n         | (3)     duration = n      |  (repeat)
|------------------------------|------------------------------|---------------------------|
| (rt_bw timer is not running) |                   (rt_bw timer is running)               |


According to the patch, rt_bw timer is working only if we have queued RT task.

In the case above part (1) has no queued RT tasks, so timer is not working.
rt_time is not being increased too.

We have ratio 2/3.

Thanks,
Kirill

>
> Apologies,
>
> - Juri
>
>> ?and RT tasks, throttle RT tasks when rt_time > runtime, but, since DL
>> ?tasks acually executes inside their own server, they don't care about
>> ?rt_bw. Good news is that steady state is ok: keeping track of overruns
>> ?we are able to stop eating bw to other guys.
>>
>> ?My thougths:
>>
>> ??- Peter's patch is an easy fix to Kirill's problem (RT tasks were
>> ????throttled too early);
>> ??- something to add to this solution could be to pre-calculate bw of
>> ????ready DL tasks and subtract it to rt_bw at replenishment time, but
>> ????it sounds quite awkward, pessimistic, and I'm not sure it is gonna
>> ????work;
>> ??- we are stealing bw to best-effort tasks, and just at the beginning
>> ????of the transistion, is it really a problem?
>> ??- I mean, if you want guarantees make your tasks DL! :);
>> ??- in the long run we are gonna have RT tasks scheduled inside CBS
>> ????servers, and all this will be properly fixed up.
>>
>> ?Comments?
>>
>> ?BTW, rt timer activation/deactivation should probably be fixed for
>> ?!RT_GROUP_SCHED with something like this:
>>
>> ?---
>> ??kernel/sched/rt.c | ??10 +++++++---
>> ??1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> ?diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> ?index 6161de8..274f992 100644
>> ?--- a/kernel/sched/rt.c
>> ?+++ b/kernel/sched/rt.c
>> ?@@ -86,12 +86,12 @@ void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
>> ??????????raw_spin_lock_init(&rt_rq->rt_runtime_lock);
>> ??}
>>
>> ?-#ifdef CONFIG_RT_GROUP_SCHED
>> ??static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
>> ??{
>> ??????????hrtimer_cancel(&rt_b->rt_period_timer);
>> ??}
>>
>> ?+#ifdef CONFIG_RT_GROUP_SCHED
>> ??#define rt_entity_is_task(rt_se) (!(rt_se)->my_q)
>>
>> ??static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
>> ?@@ -1017,8 +1017,12 @@ inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
>> ??????????start_rt_bandwidth(&def_rt_bandwidth);
>> ??}
>>
>> ?-static inline
>> ?-void dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) {}
>> ?+static void
>> ?+dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
>> ?+{
>> ?+ if (!rt_rq->rt_nr_running)
>> ?+ destroy_rt_bandwidth(&def_rt_bandwidth);
>> ?+}
>>
>> ??#endif /* CONFIG_RT_GROUP_SCHED */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/