2012-05-17 04:57:40

by Colin Cross

[permalink] [raw]
Subject: [PATCH] sched/rt: fix SCHED_RR across cgroups

task_tick_rt has an optimization to only reschedule SCHED_RR tasks
if they were the only element on their rq. However, with cgroups
a SCHED_RR task could be the only element on its per-cgroup rq but
still be competing with other SCHED_RR tasks in its parent's
cgroup. In this case, the SCHED_RR task in the child cgroup would
never yield at the end of its timeslice. If the child cgroup
rt_runtime_us was the same as the parent cgroup rt_runtime_us,
the task in the parent cgroup would starve completely.

Modify task_tick_rt to check that the task is the only task on its
rq, and that the each of the scheduling entities of its ancestors
is also the only entity on its rq.

Signed-off-by: Colin Cross <[email protected]>
---
kernel/sched/rt.c | 15 ++++++++++-----
1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 44af55e..8f32475 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1983,6 +1983,8 @@ static void watchdog(struct rq *rq, struct task_struct *p)

static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
+ struct sched_rt_entity *rt_se = &p->rt;
+
update_curr_rt(rq);

watchdog(rq, p);
@@ -2000,12 +2002,15 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
p->rt.time_slice = RR_TIMESLICE;

/*
- * Requeue to the end of queue if we are not the only element
- * on the queue:
+ * Requeue to the end of queue if we (and all of our ancestors) are the
+ * only element on the queue
*/
- if (p->rt.run_list.prev != p->rt.run_list.next) {
- requeue_task_rt(rq, p, 0);
- set_tsk_need_resched(p);
+ for_each_sched_rt_entity(rt_se) {
+ if (rt_se->run_list.prev != rt_se->run_list.next) {
+ requeue_task_rt(rq, p, 0);
+ set_tsk_need_resched(p);
+ return;
+ }
}
}

--
1.7.7.3


2012-05-18 08:56:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/rt: fix SCHED_RR across cgroups

On Wed, 2012-05-16 at 21:34 -0700, Colin Cross wrote:
> task_tick_rt has an optimization to only reschedule SCHED_RR tasks
> if they were the only element on their rq. However, with cgroups
> a SCHED_RR task could be the only element on its per-cgroup rq but
> still be competing with other SCHED_RR tasks in its parent's
> cgroup. In this case, the SCHED_RR task in the child cgroup would
> never yield at the end of its timeslice. If the child cgroup
> rt_runtime_us was the same as the parent cgroup rt_runtime_us,
> the task in the parent cgroup would starve completely.
>
> Modify task_tick_rt to check that the task is the only task on its
> rq, and that the each of the scheduling entities of its ancestors
> is also the only entity on its rq.
>
> Signed-off-by: Colin Cross <[email protected]>

OK, fair enough.. one does wonder though, WTH is android doing with
SCHED_RR?

2012-05-18 17:52:07

by Colin Cross

[permalink] [raw]
Subject: Re: [PATCH] sched/rt: fix SCHED_RR across cgroups

On Fri, May 18, 2012 at 1:56 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-05-16 at 21:34 -0700, Colin Cross wrote:
>> task_tick_rt has an optimization to only reschedule SCHED_RR tasks
>> if they were the only element on their rq. ?However, with cgroups
>> a SCHED_RR task could be the only element on its per-cgroup rq but
>> still be competing with other SCHED_RR tasks in its parent's
>> cgroup. ?In this case, the SCHED_RR task in the child cgroup would
>> never yield at the end of its timeslice. ?If the child cgroup
>> rt_runtime_us was the same as the parent cgroup rt_runtime_us,
>> the task in the parent cgroup would starve completely.
>>
>> Modify task_tick_rt to check that the task is the only task on its
>> rq, and that the each of the scheduling entities of its ancestors
>> is also the only entity on its rq.
>>
>> Signed-off-by: Colin Cross <[email protected]>
>
> OK, fair enough.. one does wonder though, WTH is android doing with
> SCHED_RR?

Nothing, I was just experimenting with how it interacted with cgroups
and the numbers didn't make sense.

2012-05-18 18:37:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/rt: fix SCHED_RR across cgroups

On Fri, 2012-05-18 at 10:52 -0700, Colin Cross wrote:

> > OK, fair enough.. one does wonder though, WTH is android doing with
> > SCHED_RR?
>
> Nothing, I was just experimenting with how it interacted with cgroups
> and the numbers didn't make sense.

OK. Thanks anyway!

2012-05-19 00:13:31

by Colin Cross

[permalink] [raw]
Subject: Re: [PATCH] sched/rt: fix SCHED_RR across cgroups

On Fri, May 18, 2012 at 11:37 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2012-05-18 at 10:52 -0700, Colin Cross wrote:
>
>> > OK, fair enough.. one does wonder though, WTH is android doing with
>> > SCHED_RR?
>>
>> Nothing, I was just experimenting with how it interacted with cgroups
>> and the numbers didn't make sense.
>
> OK. Thanks anyway!

Even with this patch, scheduling of SCHED_RR tasks in cgroups is a
little odd. Each cgroup is treated as a schedulable entity alongside
the tasks in the same parent cgroup, and then the tasks inside the
child cgroup round robin through the child cgroup's time slices. So
in the setup:
root_cgroup
task 1
cgroup
task 2
task 3

The RR will be:
task 1, cgroup(task 2), task 1, cgroup(task 3), ...

task 1 will run twice as often, for a full RR_TIMESLICE each time, as
tasks 2 and 3.

Is that the way SCHED_RR is intended to interact with cgroups?

2012-05-19 13:11:31

by Dario Faggioli

[permalink] [raw]
Subject: Re: [PATCH] sched/rt: fix SCHED_RR across cgroups

On Fri, 2012-05-18 at 17:13 -0700, Colin Cross wrote:
> Even with this patch, scheduling of SCHED_RR tasks in cgroups is a
> little odd. Each cgroup is treated as a schedulable entity alongside
> the tasks in the same parent cgroup, and then the tasks inside the
> child cgroup round robin through the child cgroup's time slices. So
> in the setup:
> root_cgroup
> task 1
> cgroup
> task 2
> task 3
>
> The RR will be:
> task 1, cgroup(task 2), task 1, cgroup(task 3), ...
>
> task 1 will run twice as often, for a full RR_TIMESLICE each time, as
> tasks 2 and 3.
>
That looks right to me...

> Is that the way SCHED_RR is intended to interact with cgroups?
>
I would say it is. That's what you get because of putting task1 and
cgroup at the same level in the "hierarchy". I'm curious, what kind of
behaviour were you expecting?

Of course, the actual schedule also depends on the real-time priority of
the various tasks (groups don't have a priority, they inherit it from
their tasks, or at least it was like this when I used to work with
it :-P), but I guess you're putting all the tasks in the same queue
(i.e., same rt-prio), is it that the case?

Dario

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



Attachments:
signature.asc (198.00 B)
This is a digitally signed message part

2012-05-19 20:37:38

by Colin Cross

[permalink] [raw]
Subject: Re: [PATCH] sched/rt: fix SCHED_RR across cgroups

On Sat, May 19, 2012 at 6:11 AM, Dario Faggioli <[email protected]> wrote:
> On Fri, 2012-05-18 at 17:13 -0700, Colin Cross wrote:
>> Even with this patch, scheduling of SCHED_RR tasks in cgroups is a
>> little odd. ?Each cgroup is treated as a schedulable entity alongside
>> the tasks in the same parent cgroup, and then the tasks inside the
>> child cgroup round robin through the child cgroup's time slices. ?So
>> in the setup:
>> root_cgroup
>> ? ?task 1
>> ? ?cgroup
>> ? ? ? task 2
>> ? ? ? task 3
>>
>> The RR will be:
>> task 1, cgroup(task 2), task 1, cgroup(task 3), ...
>>
>> task 1 will run twice as often, for a full RR_TIMESLICE each time, as
>> tasks 2 and 3.
>>
> That looks right to me...
>
>> Is that the way SCHED_RR is intended to interact with cgroups?
>>
> I would say it is. That's what you get because of putting task1 and
> cgroup at the same level in the "hierarchy". I'm curious, what kind of
> behaviour were you expecting?

That behavior matches exactly with scheduling of normal tasks and
cgroups with default cpu.shares, but doesn't match too well with what
I can see of the posix SCHED_RR description, which suggests all the
SCHED_RR threads go into a single queue. I was just curious if the
behavior my patch restored was correct, since it can't be adjusted by
tweaking any parameters like cpu.shares.

> Of course, the actual schedule also depends on the real-time priority of
> the various tasks (groups don't have a priority, they inherit it from
> their tasks, or at least it was like this when I used to work with
> it :-P), but I guess you're putting all the tasks in the same queue
> (i.e., same rt-prio), is it that the case?

Yes.

2012-05-23 13:32:46

by Dario Faggioli

[permalink] [raw]
Subject: Re: [PATCH] sched/rt: fix SCHED_RR across cgroups

On Sat, 2012-05-19 at 13:37 -0700, Colin Cross wrote:
> > I would say it is. That's what you get because of putting task1 and
> > cgroup at the same level in the "hierarchy". I'm curious, what kind of
> > behaviour were you expecting?
>
> That behavior matches exactly with scheduling of normal tasks and
> cgroups with default cpu.shares, but doesn't match too well with what
> I can see of the posix SCHED_RR description, which suggests all the
> SCHED_RR threads go into a single queue. I was just curious if the
> behavior my patch restored was correct, since it can't be adjusted by
> tweaking any parameters like cpu.shares.
>
Again, I really think it is the intended behaviour, and yes, real-time
group scheduling "breaks" the POSIX specification of the SCHED_{FIFO,RR}
policies intentionally (and _proudly_, as Peter would say it, am I
wrong? :-P).

Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part

2012-05-25 11:52:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/rt: fix SCHED_RR across cgroups

On Wed, 2012-05-23 at 15:32 +0200, Dario Faggioli wrote:
> Again, I really think it is the intended behaviour, and yes, real-time
> group scheduling "breaks" the POSIX specification of the SCHED_{FIFO,RR}
> policies intentionally (and _proudly_, as Peter would say it, am I
> wrong? :-P).

No, cgroups are well outside of POSIX ;-) as is SMP in fact.

2012-05-25 13:12:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] sched/rt: fix SCHED_RR across cgroups

On Fri, 2012-05-25 at 13:52 +0200, Peter Zijlstra wrote:
> On Wed, 2012-05-23 at 15:32 +0200, Dario Faggioli wrote:
> > Again, I really think it is the intended behaviour, and yes, real-time
> > group scheduling "breaks" the POSIX specification of the SCHED_{FIFO,RR}
> > policies intentionally (and _proudly_, as Peter would say it, am I
> > wrong? :-P).
>
> No, cgroups are well outside of POSIX ;-) as is SMP in fact.

That's because the POSIX standards committee is still struggling to come
up with standardized SMP calls to handle NR_CPUS = 0

Isn't Paul on that committee? ;-)

-- Steve

2012-05-25 17:56:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] sched/rt: fix SCHED_RR across cgroups

On Fri, May 25, 2012 at 09:12:06AM -0400, Steven Rostedt wrote:
> On Fri, 2012-05-25 at 13:52 +0200, Peter Zijlstra wrote:
> > On Wed, 2012-05-23 at 15:32 +0200, Dario Faggioli wrote:
> > > Again, I really think it is the intended behaviour, and yes, real-time
> > > group scheduling "breaks" the POSIX specification of the SCHED_{FIFO,RR}
> > > policies intentionally (and _proudly_, as Peter would say it, am I
> > > wrong? :-P).
> >
> > No, cgroups are well outside of POSIX ;-) as is SMP in fact.
>
> That's because the POSIX standards committee is still struggling to come
> up with standardized SMP calls to handle NR_CPUS = 0

;-) ;-) ;-)

> Isn't Paul on that committee? ;-)

I have met with them occasionally, but have spent most of my time on
the C/C++ committees. I didn't try sounding them out on NR_CPUS=0,
partly because they were choking pretty hard on NR_CPUS=4096.

At least part of the problem is that every OS out there has different
SMP feature, so the only way it would be possible to get this sort of
thing through the committee would be to invent something that was
roughly equally incompatible with everyone.

However, there are some SMP features standardized by various random
committees and aggregated by The Open Group, including:

o The pthread_mutex_lock() API
o pthread_getspecific() and pthread_setspecific()
o pthread_getconcurrency() and pthread_setconcurrency()

But yes, even the aggregated standard is quite limiting.

Thanx, Paul

2012-05-30 13:40:52

by Colin Cross

[permalink] [raw]
Subject: [tip:sched/urgent] sched/rt: Fix SCHED_RR across cgroups

Commit-ID: 454c79999f7eaedcdf4c15c449e43902980cbdf5
Gitweb: http://git.kernel.org/tip/454c79999f7eaedcdf4c15c449e43902980cbdf5
Author: Colin Cross <[email protected]>
AuthorDate: Wed, 16 May 2012 21:34:23 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 30 May 2012 14:02:25 +0200

sched/rt: Fix SCHED_RR across cgroups

task_tick_rt() has an optimization to only reschedule SCHED_RR tasks
if they were the only element on their rq. However, with cgroups
a SCHED_RR task could be the only element on its per-cgroup rq but
still be competing with other SCHED_RR tasks in its parent's
cgroup. In this case, the SCHED_RR task in the child cgroup would
never yield at the end of its timeslice. If the child cgroup
rt_runtime_us was the same as the parent cgroup rt_runtime_us,
the task in the parent cgroup would starve completely.

Modify task_tick_rt() to check that the task is the only task on its
rq, and that the each of the scheduling entities of its ancestors
is also the only entity on its rq.

Signed-off-by: Colin Cross <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/rt.c | 15 ++++++++++-----
1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 295da73..2a4e8df 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1985,6 +1985,8 @@ static void watchdog(struct rq *rq, struct task_struct *p)

static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
+ struct sched_rt_entity *rt_se = &p->rt;
+
update_curr_rt(rq);

watchdog(rq, p);
@@ -2002,12 +2004,15 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
p->rt.time_slice = RR_TIMESLICE;

/*
- * Requeue to the end of queue if we are not the only element
- * on the queue:
+ * Requeue to the end of queue if we (and all of our ancestors) are the
+ * only element on the queue
*/
- if (p->rt.run_list.prev != p->rt.run_list.next) {
- requeue_task_rt(rq, p, 0);
- set_tsk_need_resched(p);
+ for_each_sched_rt_entity(rt_se) {
+ if (rt_se->run_list.prev != rt_se->run_list.next) {
+ requeue_task_rt(rq, p, 0);
+ set_tsk_need_resched(p);
+ return;
+ }
}
}