2015-06-01 14:13:26

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 3/7] sched: Allow balance callbacks for check_class_changed()

In order to remove dropping rq->lock from the
switched_{to,from}()/prio_changed() sched_class methods, run the
balance callbacks after it.

We need to remove dropping rq->lock because its buggy,
suppose using sched_setattr()/sched_setscheduler() to change a running
task from FIFO to OTHER.

By the time we get to switched_from_rt() the task is already enqueued
on the cfs runqueues. If switched_from_rt() does pull_rt_task() and
drops rq->lock, load-balancing can come in and move our task @p to
another rq.

The subsequent switched_to_fair() still assumes @p is on @rq and bad
things will happen.

By using balance callbacks we delay the load-balancing operations
{rt,dl}x{push,pull} until we've done all the important work and the
task is fully set up.

Furthermore, the balance callbacks do not know about @p, therefore
they cannot get confused like this.

Reported-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/core.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1001,7 +1001,11 @@ inline int task_curr(const struct task_s
}

/*
- * Can drop rq->lock because from sched_class::switched_from() methods drop it.
+ * switched_from, switched_to and prio_changed must _NOT_ drop rq->lock,
+ * use the balance_callback list if you want balancing.
+ *
+ * this means any call to check_class_changed() must be followed by a call to
+ * balance_callback().
*/
static inline void check_class_changed(struct rq *rq, struct task_struct *p,
const struct sched_class *prev_class,
@@ -1010,7 +1014,7 @@ static inline void check_class_changed(s
if (prev_class != p->sched_class) {
if (prev_class->switched_from)
prev_class->switched_from(rq, p);
- /* Possble rq->lock 'hole'. */
+
p->sched_class->switched_to(rq, p);
} else if (oldprio != p->prio || dl_task(p))
p->sched_class->prio_changed(rq, p, oldprio);
@@ -1491,8 +1495,12 @@ ttwu_do_wakeup(struct rq *rq, struct tas

p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
- if (p->sched_class->task_woken)
+ if (p->sched_class->task_woken) {
+ /*
+ * XXX can drop rq->lock; most likely ok.
+ */
p->sched_class->task_woken(rq, p);
+ }

if (rq->idle_stamp) {
u64 delta = rq_clock(rq) - rq->idle_stamp;
@@ -3094,7 +3102,11 @@ void rt_mutex_setprio(struct task_struct

check_class_changed(rq, p, prev_class, oldprio);
out_unlock:
+ preempt_disable(); /* avoid rq from going away on us */
__task_rq_unlock(rq);
+
+ balance_callback(rq);
+ preempt_enable();
}
#endif

@@ -3655,11 +3667,18 @@ static int __sched_setscheduler(struct t
}

check_class_changed(rq, p, prev_class, oldprio);
+ preempt_disable(); /* avoid rq from going away on us */
task_rq_unlock(rq, p, &flags);

if (pi)
rt_mutex_adjust_pi(p);

+ /*
+ * Run balance callbacks after we've adjusted the PI chain.
+ */
+ balance_callback(rq);
+ preempt_enable();
+
return 0;
}



2015-06-02 13:58:55

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/7] sched: Allow balance callbacks for check_class_changed()

01.06.2015, 17:13, "Peter Zijlstra" <[email protected]>:
> In order to remove dropping rq->lock from the
> switched_{to,from}()/prio_changed() sched_class methods, run the
> balance callbacks after it.
>
> We need to remove dropping rq->lock because its buggy,
> suppose using sched_setattr()/sched_setscheduler() to change a running
> task from FIFO to OTHER.
>
> By the time we get to switched_from_rt() the task is already enqueued
> on the cfs runqueues. If switched_from_rt() does pull_rt_task() and
> drops rq->lock, load-balancing can come in and move our task @p to
> another rq.
>
> The subsequent switched_to_fair() still assumes @p is on @rq and bad
> things will happen.
>
> By using balance callbacks we delay the load-balancing operations
> {rt,dl}x{push,pull} until we've done all the important work and the
> task is fully set up.
>
> Furthermore, the balance callbacks do not know about @p, therefore
> they cannot get confused like this.
>
> Reported-by: Mike Galbraith <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> ?kernel/sched/core.c | ??25 ++++++++++++++++++++++---
> ?1 file changed, 22 insertions(+), 3 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1001,7 +1001,11 @@ inline int task_curr(const struct task_s
> ?}
>
> ?/*
> - * Can drop rq->lock because from sched_class::switched_from() methods drop it.
> + * switched_from, switched_to and prio_changed must _NOT_ drop rq->lock,
> + * use the balance_callback list if you want balancing.
> + *
> + * this means any call to check_class_changed() must be followed by a call to
> + * balance_callback().
> ??*/
> ?static inline void check_class_changed(struct rq *rq, struct task_struct *p,
> ????????????????????????????????????????const struct sched_class *prev_class,
> @@ -1010,7 +1014,7 @@ static inline void check_class_changed(s
> ?????????if (prev_class != p->sched_class) {
> ?????????????????if (prev_class->switched_from)
> ?????????????????????????prev_class->switched_from(rq, p);
> - /* Possble rq->lock 'hole'. ?*/
> +

But switched_from_dl()->cancel_dl_timer() still unlocks rq->lock.

It seems we should drop it (cancel_dl_timer) and move hrtimer_cancel()
from switched_from_dl() to finish_task_switch(). It will be executed
for all classes and completely take the functionality we implement
cancel_dl_timer() for.

> ?????????????????p->sched_class->switched_to(rq, p);
> ?????????} else if (oldprio != p->prio || dl_task(p))
> ?????????????????p->sched_class->prio_changed(rq, p, oldprio);

2015-06-02 14:27:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/7] sched: Allow balance callbacks for check_class_changed()

On Tue, 2015-06-02 at 16:58 +0300, Kirill Tkhai wrote:
> 01.06.2015, 17:13, "Peter Zijlstra" <[email protected]>:

> > @@ -1010,7 +1014,7 @@ static inline void check_class_changed(s
> > if (prev_class != p->sched_class) {
> > if (prev_class->switched_from)
> > prev_class->switched_from(rq, p);
> > - /* Possble rq->lock 'hole'. */
> > +
>
> But switched_from_dl()->cancel_dl_timer() still unlocks rq->lock.
>
> It seems we should drop it (cancel_dl_timer) and move hrtimer_cancel()
> from switched_from_dl() to finish_task_switch(). It will be executed
> for all classes and completely take the functionality we implement
> cancel_dl_timer() for.

*groan* yes.. I don't like moving it into generic code though.

more thinking required.

> > p->sched_class->switched_to(rq, p);
> > } else if (oldprio != p->prio || dl_task(p))
> > p->sched_class->prio_changed(rq, p, oldprio);

2015-06-02 16:07:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/7] sched: Allow balance callbacks for check_class_changed()

On Tue, Jun 02, 2015 at 04:27:10PM +0200, Peter Zijlstra wrote:
> On Tue, 2015-06-02 at 16:58 +0300, Kirill Tkhai wrote:
> > 01.06.2015, 17:13, "Peter Zijlstra" <[email protected]>:
>
> > > @@ -1010,7 +1014,7 @@ static inline void check_class_changed(s
> > > if (prev_class != p->sched_class) {
> > > if (prev_class->switched_from)
> > > prev_class->switched_from(rq, p);
> > > - /* Possble rq->lock 'hole'. */
> > > +
> >
> > But switched_from_dl()->cancel_dl_timer() still unlocks rq->lock.
> >
> > It seems we should drop it (cancel_dl_timer) and move hrtimer_cancel()
> > from switched_from_dl() to finish_task_switch(). It will be executed
> > for all classes and completely take the functionality we implement
> > cancel_dl_timer() for.
>
> *groan* yes.. I don't like moving it into generic code though.
>
> more thinking required.

How about something like so then?

---
Subject: sched,deadline: Fix sched class hopping CBS hole

We still have a few pending issues with the deadline code, one of which
is that switching between scheduling classes can 'leak' CBS state.

Close the hole by retaining the current CBS state when leaving
SCHED_DEADLINE and unconditionally programming the deadline timer.

The timer will then reset the CBS state if the task is still
!SCHED_DEADLINE by the time it hits.

If the task were to die, task_dead_dl() will cancel the timer, so no
lingering state.

Cc: Luca Abeni <[email protected]>
Cc: Juri Lelli <[email protected]>
Fixes: 40767b0dc768 ("sched/deadline: Fix deadline parameter modification handling")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/deadline.c | 71 ++++++++++++++++++++-----------------------------
1 file changed, 29 insertions(+), 42 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7336fe5fea30..625ed74f1467 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -572,20 +572,26 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
rq = task_rq_lock(p, &flags);

/*
- * We need to take care of several possible races here:
- *
- * - the task might have changed its scheduling policy
- * to something different than SCHED_DEADLINE
- * - the task might have changed its reservation parameters
- * (through sched_setattr())
- * - the task might have been boosted by someone else and
- * might be in the boosting/deboosting path
- *
- * In all this cases we bail out, as the task is already
- * in the runqueue or is going to be enqueued back anyway.
+ * The task might have changed its scheduling policy to something
+ * different than SCHED_DEADLINE (through switched_fromd_dl()).
*/
- if (!dl_task(p) || dl_se->dl_new ||
- dl_se->dl_boosted || !dl_se->dl_throttled)
+ if (!dl_task(p)) {
+ __dl_clear_params(p);
+ goto unlock;
+ }
+
+ /*
+ * There's only two ways to have dl_new set; 1) __sched_fork(), and a
+ * fresh task should not have a pending timer, and 2) the above, which
+ * does not get here.
+ */
+ WARN_ON_ONCE(dl_se->dl_new);
+
+ /*
+ * The task might have been boosted by someone else and might be in the
+ * boosting/deboosting path, its not throttled.
+ */
+ if (dl_se->dl_boosted || !dl_se->dl_throttled)
goto unlock;

sched_clock_tick();
@@ -1683,37 +1689,18 @@ void __init init_sched_dl_class(void)

#endif /* CONFIG_SMP */

-/*
- * Ensure p's dl_timer is cancelled. May drop rq->lock for a while.
- */
-static void cancel_dl_timer(struct rq *rq, struct task_struct *p)
-{
- struct hrtimer *dl_timer = &p->dl.dl_timer;
-
- /* Nobody will change task's class if pi_lock is held */
- lockdep_assert_held(&p->pi_lock);
-
- if (hrtimer_active(dl_timer)) {
- int ret = hrtimer_try_to_cancel(dl_timer);
-
- if (unlikely(ret == -1)) {
- /*
- * Note, p may migrate OR new deadline tasks
- * may appear in rq when we are unlocking it.
- * A caller of us must be fine with that.
- */
- raw_spin_unlock(&rq->lock);
- hrtimer_cancel(dl_timer);
- raw_spin_lock(&rq->lock);
- }
- }
-}
-
static void switched_from_dl(struct rq *rq, struct task_struct *p)
{
- /* XXX we should retain the bw until 0-lag */
- cancel_dl_timer(rq, p);
- __dl_clear_params(p);
+ /*
+ * Start the deadline timer; if we switch back to dl before this we'll
+ * continue consuming our current CBS slice. If we stay outside of
+ * SCHED_DEADLINE until the deadline passes, the timer will reset the
+ * task.
+ *
+ * task_dead_dl() will cancel our timer if we happen to die while
+ * its still pending.
+ */
+ start_dl_timer(&p->dl, false);

/*
* Since this might be the only -deadline task on the rq,

2015-06-02 16:27:27

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/7] sched: Allow balance callbacks for check_class_changed()

? ??, 02/06/2015 ? 18:07 +0200, Peter Zijlstra ?????:
On Tue, Jun 02, 2015 at 04:27:10PM +0200, Peter Zijlstra wrote:
> > On Tue, 2015-06-02 at 16:58 +0300, Kirill Tkhai wrote:
> > > 01.06.2015, 17:13, "Peter Zijlstra" <[email protected]>:
> >
> > > > @@ -1010,7 +1014,7 @@ static inline void check_class_changed(s
> > > > if (prev_class != p->sched_class) {
> > > > if (prev_class->switched_from)
> > > > prev_class->switched_from(rq, p);
> > > > - /* Possble rq->lock 'hole'. */
> > > > +
> > >
> > > But switched_from_dl()->cancel_dl_timer() still unlocks rq->lock.
> > >
> > > It seems we should drop it (cancel_dl_timer) and move hrtimer_cancel()
> > > from switched_from_dl() to finish_task_switch(). It will be executed
> > > for all classes and completely take the functionality we implement
> > > cancel_dl_timer() for.
> >
> > *groan* yes.. I don't like moving it into generic code though.
> >
> > more thinking required.
>
> How about something like so then?
>
> ---
> Subject: sched,deadline: Fix sched class hopping CBS hole
>
> We still have a few pending issues with the deadline code, one of which
> is that switching between scheduling classes can 'leak' CBS state.
>
> Close the hole by retaining the current CBS state when leaving
> SCHED_DEADLINE and unconditionally programming the deadline timer.
>
> The timer will then reset the CBS state if the task is still
> !SCHED_DEADLINE by the time it hits.
>
> If the task were to die, task_dead_dl() will cancel the timer, so no
> lingering state.
>
> Cc: Luca Abeni <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Fixes: 40767b0dc768 ("sched/deadline: Fix deadline parameter modification handling")
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/sched/deadline.c | 71 ++++++++++++++++++++-----------------------------
> 1 file changed, 29 insertions(+), 42 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 7336fe5fea30..625ed74f1467 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -572,20 +572,26 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
> rq = task_rq_lock(p, &flags);
>
> /*
> - * We need to take care of several possible races here:
> - *
> - * - the task might have changed its scheduling policy
> - * to something different than SCHED_DEADLINE
> - * - the task might have changed its reservation parameters
> - * (through sched_setattr())
> - * - the task might have been boosted by someone else and
> - * might be in the boosting/deboosting path
> - *
> - * In all this cases we bail out, as the task is already
> - * in the runqueue or is going to be enqueued back anyway.
> + * The task might have changed its scheduling policy to something
> + * different than SCHED_DEADLINE (through switched_fromd_dl()).
> */
> - if (!dl_task(p) || dl_se->dl_new ||
> - dl_se->dl_boosted || !dl_se->dl_throttled)
> + if (!dl_task(p)) {
> + __dl_clear_params(p);
> + goto unlock;
> + }
> +
> + /*
> + * There's only two ways to have dl_new set; 1) __sched_fork(), and a
> + * fresh task should not have a pending timer, and 2) the above, which
> + * does not get here.
> + */
> + WARN_ON_ONCE(dl_se->dl_new);
> +
> + /*
> + * The task might have been boosted by someone else and might be in the
> + * boosting/deboosting path, its not throttled.
> + */
> + if (dl_se->dl_boosted || !dl_se->dl_throttled)
> goto unlock;
>
> sched_clock_tick();
> @@ -1683,37 +1689,18 @@ void __init init_sched_dl_class(void)
>
> #endif /* CONFIG_SMP */
>
> -/*
> - * Ensure p's dl_timer is cancelled. May drop rq->lock for a while.
> - */
> -static void cancel_dl_timer(struct rq *rq, struct task_struct *p)
> -{
> - struct hrtimer *dl_timer = &p->dl.dl_timer;
> -
> - /* Nobody will change task's class if pi_lock is held */
> - lockdep_assert_held(&p->pi_lock);
> -
> - if (hrtimer_active(dl_timer)) {
> - int ret = hrtimer_try_to_cancel(dl_timer);
> -
> - if (unlikely(ret == -1)) {
> - /*
> - * Note, p may migrate OR new deadline tasks
> - * may appear in rq when we are unlocking it.
> - * A caller of us must be fine with that.
> - */
> - raw_spin_unlock(&rq->lock);
> - hrtimer_cancel(dl_timer);
> - raw_spin_lock(&rq->lock);
> - }
> - }
> -}
> -
> static void switched_from_dl(struct rq *rq, struct task_struct *p)
> {
> - /* XXX we should retain the bw until 0-lag */
> - cancel_dl_timer(rq, p);
> - __dl_clear_params(p);
> + /*
> + * Start the deadline timer; if we switch back to dl before this we'll
> + * continue consuming our current CBS slice. If we stay outside of
> + * SCHED_DEADLINE until the deadline passes, the timer will reset the
> + * task.
> + *
> + * task_dead_dl() will cancel our timer if we happen to die while
> + * its still pending.

task_dead_dl() is called for tasks of deadline class only. So if we do that,
the timer may be executed after final task's dead.

> + */
> + start_dl_timer(&p->dl, false);
>
> /*
> * Since this might be the only -deadline task on the rq,
>

2015-06-03 07:32:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/7] sched: Allow balance callbacks for check_class_changed()

On Tue, Jun 02, 2015 at 07:27:19PM +0300, Kirill Tkhai wrote:
> > + * task_dead_dl() will cancel our timer if we happen to die while
> > + * its still pending.
>
> task_dead_dl() is called for tasks of deadline class only. So if we do that,
> the timer may be executed after final task's dead.

Indeed; sleep deprived brain misses the obvious :/

I can't seem to come up with anything much better than pulling that
hrtimer_cancel() into finish_task_switch(), however sad that is.

Something like:

for_each_class(class) {
if (class->task_dead)
class->task_dead(prev);
}

Would be nicest, but will slow down the common case. And a callback list
has the downside of having to preallocate entries for every task,
causing mostly pointless memory overhead.

2015-06-03 10:40:46

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/7] sched: Allow balance callbacks for check_class_changed()

В Ср, 03/06/2015 в 09:32 +0200, Peter Zijlstra пишет:
> On Tue, Jun 02, 2015 at 07:27:19PM +0300, Kirill Tkhai wrote:
> > > + * task_dead_dl() will cancel our timer if we happen to die while
> > > + * its still pending.
> >
> > task_dead_dl() is called for tasks of deadline class only. So if we do that,
> > the timer may be executed after final task's dead.
>
> Indeed; sleep deprived brain misses the obvious :/
>
> I can't seem to come up with anything much better than pulling that
> hrtimer_cancel() into finish_task_switch(), however sad that is.

Yeah, class-specific manipulation in generic function finish_task_switch()
worsen modularity.

I have an idea, but it require small hrtimer modification. We may use
task_struct counters {get,put}_task_struct(), when we're starting or cancelling
the timer, and in timer handler. But it require to return back the commit:

commit 61699e13072a89880aa584dcc64c6da465fb2ccc
Author: Thomas Gleixner <[email protected]>
Date: Tue Apr 14 21:09:23 2015 +0000

hrtimer: Remove hrtimer_start() return value

because restart of the timer races with time handler.

Also, it's not clearly for me if it's safe to do the final put_task_struct()
from irq context and how it worsens life of RT people.

> Something like:
>
> for_each_class(class) {
> if (class->task_dead)
> class->task_dead(prev);
> }
>
> Would be nicest, but will slow down the common case. And a callback list
> has the downside of having to preallocate entries for every task,
> causing mostly pointless memory overhead.
>
>

2015-06-03 10:42:54

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/7] sched: Allow balance callbacks for check_class_changed()

[resend]

03.06.2015, 10:32, "Peter Zijlstra" <[email protected]>:
> On Tue, Jun 02, 2015 at 07:27:19PM +0300, Kirill Tkhai wrote:
>>> ?+ * task_dead_dl() will cancel our timer if we happen to die while
>>> ?+ * its still pending.
>> ?task_dead_dl() is called for tasks of deadline class only. So if we do that,
>> ?the timer may be executed after final task's dead.
>
> Indeed; sleep deprived brain misses the obvious :/

I have an idea, but it require small hrtimer modification. We may use
task_struct counters {get,put}_task_struct(), when we're starting or cancelling
the timer, and in timer handler. But it require to return back the commit:

commit 61699e13072a89880aa584dcc64c6da465fb2ccc
Author: Thomas Gleixner <[email protected]>
Date: Tue Apr 14 21:09:23 2015 +0000

hrtimer: Remove hrtimer_start() return value

because restart of the timer races with time handler.

Also, it's not clearly for me if it's safe to do the final put_task_struct()
from irq context and how it worsens life of RT people.

> I can't seem to come up with anything much better than pulling that
> hrtimer_cancel() into finish_task_switch(), however sad that is.
>
> Something like:
>
> ??for_each_class(class) {
> ????????if (class->task_dead)
> ????????????????class->task_dead(prev);
> ??}
>
> Would be nicest, but will slow down the common case. And a callback list
> has the downside of having to preallocate entries for every task,
> causing mostly pointless memory overhead.

2015-06-03 10:45:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/7] sched: Allow balance callbacks for check_class_changed()

On Wed, Jun 03, 2015 at 01:42:41PM +0300, Kirill Tkhai wrote:
> [resend]
>
> 03.06.2015, 10:32, "Peter Zijlstra" <[email protected]>:
> > On Tue, Jun 02, 2015 at 07:27:19PM +0300, Kirill Tkhai wrote:
> >>> ?+ * task_dead_dl() will cancel our timer if we happen to die while
> >>> ?+ * its still pending.
> >> ?task_dead_dl() is called for tasks of deadline class only. So if we do that,
> >> ?the timer may be executed after final task's dead.
> >
> > Indeed; sleep deprived brain misses the obvious :/
>
> I have an idea, but it require small hrtimer modification. We may use
> task_struct counters {get,put}_task_struct(), when we're starting or cancelling
> the timer, and in timer handler. But it require to return back the commit:

Yeah, already working on that. :-)