2020-10-05 15:14:58

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

In order to minimize the interference of migrate_disable() on lower
priority tasks, which can be deprived of runtime due to being stuck
below a higher priority task. Teach the RT/DL balancers to push away
these higher priority tasks when a lower priority task gets selected
to run on a freshly demoted CPU (pull).

This adds migration interference to the higher priority task, but
restores bandwidth to system that would otherwise be irrevocably lost.
Without this it would be possible to have all tasks on the system
stuck on a single CPU, each task preempted in a migrate_disable()
section with a single high priority task running.

This way we can still approximate running the M highest priority tasks
on the system.

Migrating the top task away is (ofcourse) still subject to
migrate_disable() too, which means the lower task is subject to an
interference equivalent to the worst case migrate_disable() section.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/preempt.h | 38 +++++++++++++++------------
include/linux/sched.h | 3 +-
kernel/sched/core.c | 67 ++++++++++++++++++++++++++++++++++++++++--------
kernel/sched/deadline.c | 26 +++++++++++++-----
kernel/sched/rt.c | 63 ++++++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 32 ++++++++++++++++++++++
6 files changed, 182 insertions(+), 47 deletions(-)

--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -325,24 +325,28 @@ static inline void preempt_notifier_init
#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT)

/*
- * Migrate-Disable and why it is (strongly) undesired.
+ * Migrate-Disable and why it is undesired.
*
- * The premise of the Real-Time schedulers we have on Linux
- * (SCHED_FIFO/SCHED_DEADLINE) is that M CPUs can/will run M tasks
- * concurrently, provided there are sufficient runnable tasks, also known as
- * work-conserving. For instance SCHED_DEADLINE tries to schedule the M
- * earliest deadline threads, and SCHED_FIFO the M highest priority threads.
- *
- * The correctness of various scheduling models depends on this, but is it
- * broken by migrate_disable() that doesn't imply preempt_disable(). Where
- * preempt_disable() implies an immediate priority ceiling, preemptible
- * migrate_disable() allows nesting.
- *
- * The worst case is that all tasks preempt one another in a migrate_disable()
- * region and stack on a single CPU. This then reduces the available bandwidth
- * to a single CPU. And since Real-Time schedulability theory considers the
- * Worst-Case only, all Real-Time analysis shall revert to single-CPU
- * (instantly solving the SMP analysis problem).
+ * When a preempted task becomes elegible to run under the ideal model (IOW it
+ * becomes one of the M highest priority tasks), it might still have to wait
+ * for the preemptee's migrate_disable() section to complete. Thereby suffering
+ * a reduction in bandwidth in the exact duration of the migrate_disable()
+ * section.
+ *
+ * Per this argument, the change from preempt_disable() to migrate_disable()
+ * gets us:
+ *
+ * - a higher priority tasks gains reduced wake-up latency; with preempt_disable()
+ * it would have had to wait for the lower priority task.
+ *
+ * - a lower priority tasks; which under preempt_disable() could've instantly
+ * migrated away when another CPU becomes available, is now constrained
+ * by the ability to push the higher priority task away, which might itself be
+ * in a migrate_disable() section, reducing it's available bandwidth.
+ *
+ * IOW it trades latency / moves the interference term, but it stays in the
+ * system, and as long as it remains unbounded, the system is not fully
+ * deterministic.
*
*
* The reason we have it anyway.
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -715,8 +715,9 @@ struct task_struct {
cpumask_t cpus_mask;
void *migration_pending;
#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT)
- int migration_disabled;
+ unsigned short migration_disabled;
#endif
+ unsigned short migration_flags;

#ifdef CONFIG_PREEMPT_RCU
int rcu_read_lock_nesting;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1767,11 +1767,6 @@ void migrate_enable(void)
}
EXPORT_SYMBOL_GPL(migrate_enable);

-static inline bool is_migration_disabled(struct task_struct *p)
-{
- return p->migration_disabled;
-}
-
static inline bool rq_has_pinned_tasks(struct rq *rq)
{
return rq->nr_pinned;
@@ -1917,6 +1912,49 @@ static int migration_cpu_stop(void *data
return 0;
}

+int push_cpu_stop(void *arg)
+{
+ struct rq *lowest_rq = NULL, *rq = this_rq();
+ struct task_struct *p = arg;
+
+ raw_spin_lock_irq(&p->pi_lock);
+ raw_spin_lock(&rq->lock);
+
+ if (task_rq(p) != rq)
+ goto out_unlock;
+
+ if (is_migration_disabled(p)) {
+ p->migration_flags |= MDF_PUSH;
+ goto out_unlock;
+ }
+
+ p->migration_flags &= ~MDF_PUSH;
+
+ if (p->sched_class->find_lock_rq)
+ lowest_rq = p->sched_class->find_lock_rq(p, rq);
+
+ if (!lowest_rq)
+ goto out_unlock;
+
+ // XXX validate p is still the highest prio task
+ if (task_rq(p) == rq) {
+ deactivate_task(rq, p, 0);
+ set_task_cpu(p, lowest_rq->cpu);
+ activate_task(lowest_rq, p, 0);
+ resched_curr(lowest_rq);
+ }
+
+ double_unlock_balance(rq, lowest_rq);
+
+out_unlock:
+ rq->push_busy = false;
+ raw_spin_unlock(&rq->lock);
+ raw_spin_unlock_irq(&p->pi_lock);
+
+ put_task_struct(p);
+ return 0;
+}
+
/*
* sched_class::set_cpus_allowed must do the below, but is not required to
* actually call this function.
@@ -2001,6 +2039,14 @@ static int affine_move_task(struct rq *rq, stru

/* Can the task run on the task's current CPU? If so, we're done */
if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
+ struct task_struct *push_task = NULL;
+
+ if ((flags & SCA_MIGRATE_ENABLE) &&
+ (p->migration_flags & MDF_PUSH) && !rq->push_busy) {
+ rq->push_busy = true;
+ push_task = get_task_struct(p);
+ }
+
pending = p->migration_pending;
if (pending) {
p->migration_pending = NULL;
@@ -2008,6 +2054,11 @@ static int affine_move_task(struct rq *rq, stru
}
task_rq_unlock(rq, p, rf);

+ if (push_task) {
+ stop_one_cpu_nowait(rq->cpu, push_cpu_stop,
+ p, &rq->push_work);
+ }
+
if (complete)
goto do_complete;

@@ -2045,6 +2096,7 @@ static int affine_move_task(struct rq *rq, stru

if (flags & SCA_MIGRATE_ENABLE) {

+ p->migration_flags &= ~MDF_PUSH;
task_rq_unlock(rq, p, rf);
pending->arg = arg;
stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
@@ -2657,11 +2709,6 @@ static inline int __set_cpus_allowed_ptr

static inline void migrate_disable_switch(struct rq *rq, struct task_struct *p) { }

-static inline bool is_migration_disabled(struct task_struct *p)
-{
- return false;
-}
-
static inline bool rq_has_pinned_tasks(struct rq *rq)
{
return false;
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2182,7 +2182,7 @@ static void push_dl_tasks(struct rq *rq)
static void pull_dl_task(struct rq *this_rq)
{
int this_cpu = this_rq->cpu, cpu;
- struct task_struct *p;
+ struct task_struct *p, *push_task;
bool resched = false;
struct rq *src_rq;
u64 dmin = LONG_MAX;
@@ -2212,6 +2212,7 @@ static void pull_dl_task(struct rq *this
continue;

/* Might drop this_rq->lock */
+ push_task = NULL;
double_lock_balance(this_rq, src_rq);

/*
@@ -2243,17 +2244,27 @@ static void pull_dl_task(struct rq *this
src_rq->curr->dl.deadline))
goto skip;

- resched = true;
-
- deactivate_task(src_rq, p, 0);
- set_task_cpu(p, this_cpu);
- activate_task(this_rq, p, 0);
- dmin = p->dl.deadline;
+ if (is_migration_disabled(p)) {
+ push_task = get_push_task(src_rq);
+ } else {
+ deactivate_task(src_rq, p, 0);
+ set_task_cpu(p, this_cpu);
+ activate_task(this_rq, p, 0);
+ dmin = p->dl.deadline;
+ resched = true;
+ }

/* Is there any other task even earlier? */
}
skip:
double_unlock_balance(this_rq, src_rq);
+
+ if (push_task) {
+ raw_spin_unlock(&this_rq->lock);
+ stop_one_cpu_nowait(src_rq->cpu, push_cpu_stop,
+ push_task, &src_rq->push_work);
+ raw_spin_lock(&this_rq->lock);
+ }
}

if (resched)
@@ -2497,6 +2508,7 @@ const struct sched_class dl_sched_class
.rq_online = rq_online_dl,
.rq_offline = rq_offline_dl,
.task_woken = task_woken_dl,
+ .find_lock_rq = find_lock_later_rq,
#endif

.task_tick = task_tick_dl,
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1859,7 +1859,7 @@ static struct task_struct *pick_next_pus
* running task can migrate over to a CPU that is running a task
* of lesser priority.
*/
-static int push_rt_task(struct rq *rq)
+static int push_rt_task(struct rq *rq, bool pull)
{
struct task_struct *next_task;
struct rq *lowest_rq;
@@ -1873,6 +1873,34 @@ static int push_rt_task(struct rq *rq)
return 0;

retry:
+ if (is_migration_disabled(next_task)) {
+ struct task_struct *push_task = NULL;
+ int cpu;
+
+ if (!pull || rq->push_busy)
+ return 0;
+
+ cpu = find_lowest_rq(rq->curr);
+ if (cpu == -1 || cpu == rq->cpu)
+ return 0;
+
+ /*
+ * Given we found a CPU with lower priority than @next_task,
+ * therefore it should be running. However we cannot migrate it
+ * to this other CPU, instead attempt to push the current
+ * running task on this CPU away.
+ */
+ push_task = get_push_task(rq);
+ if (push_task) {
+ raw_spin_unlock(&rq->lock);
+ stop_one_cpu_nowait(rq->cpu, push_cpu_stop,
+ push_task, &rq->push_work);
+ raw_spin_lock(&rq->lock);
+ }
+
+ return 0;
+ }
+
if (WARN_ON(next_task == rq->curr))
return 0;

@@ -1927,12 +1955,10 @@ static int push_rt_task(struct rq *rq)
deactivate_task(rq, next_task, 0);
set_task_cpu(next_task, lowest_rq->cpu);
activate_task(lowest_rq, next_task, 0);
- ret = 1;
-
resched_curr(lowest_rq);
+ ret = 1;

double_unlock_balance(rq, lowest_rq);
-
out:
put_task_struct(next_task);

@@ -1942,7 +1968,7 @@ static int push_rt_task(struct rq *rq)
static void push_rt_tasks(struct rq *rq)
{
/* push_rt_task will return true if it moved an RT */
- while (push_rt_task(rq))
+ while (push_rt_task(rq, false))
;
}

@@ -2095,7 +2121,8 @@ void rto_push_irq_work_func(struct irq_w
*/
if (has_pushable_tasks(rq)) {
raw_spin_lock(&rq->lock);
- push_rt_tasks(rq);
+ while (push_rt_task(rq, true))
+ ;
raw_spin_unlock(&rq->lock);
}

@@ -2120,7 +2147,7 @@ static void pull_rt_task(struct rq *this
{
int this_cpu = this_rq->cpu, cpu;
bool resched = false;
- struct task_struct *p;
+ struct task_struct *p, *push_task;
struct rq *src_rq;
int rt_overload_count = rt_overloaded(this_rq);

@@ -2167,6 +2194,7 @@ static void pull_rt_task(struct rq *this
* double_lock_balance, and another CPU could
* alter this_rq
*/
+ push_task = NULL;
double_lock_balance(this_rq, src_rq);

/*
@@ -2194,11 +2222,14 @@ static void pull_rt_task(struct rq *this
if (p->prio < src_rq->curr->prio)
goto skip;

- resched = true;
-
- deactivate_task(src_rq, p, 0);
- set_task_cpu(p, this_cpu);
- activate_task(this_rq, p, 0);
+ if (is_migration_disabled(p)) {
+ push_task = get_push_task(src_rq);
+ } else {
+ deactivate_task(src_rq, p, 0);
+ set_task_cpu(p, this_cpu);
+ activate_task(this_rq, p, 0);
+ resched = true;
+ }
/*
* We continue with the search, just in
* case there's an even higher prio task
@@ -2208,6 +2239,13 @@ static void pull_rt_task(struct rq *this
}
skip:
double_unlock_balance(this_rq, src_rq);
+
+ if (push_task) {
+ raw_spin_unlock(&this_rq->lock);
+ stop_one_cpu_nowait(src_rq->cpu, push_cpu_stop,
+ push_task, &src_rq->push_work);
+ raw_spin_lock(&this_rq->lock);
+ }
}

if (resched)
@@ -2446,6 +2484,7 @@ const struct sched_class rt_sched_class
.rq_offline = rq_offline_rt,
.task_woken = task_woken_rt,
.switched_from = switched_from_rt,
+ .find_lock_rq = find_lock_lowest_rq,
#endif

.task_tick = task_tick_rt,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1057,6 +1057,8 @@ struct rq {
#if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
unsigned int nr_pinned;
#endif
+ unsigned int push_busy;
+ struct cpu_stop_work push_work;
};

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1084,6 +1086,16 @@ static inline int cpu_of(struct rq *rq)
#endif
}

+#define MDF_PUSH 0x01
+
+static inline bool is_migration_disabled(struct task_struct *p)
+{
+#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT)
+ return p->migration_disabled;
+#else
+ return false;
+#endif
+}

#ifdef CONFIG_SCHED_SMT
extern void __update_idle_core(struct rq *rq);
@@ -1816,6 +1828,8 @@ struct sched_class {

void (*rq_online)(struct rq *rq);
void (*rq_offline)(struct rq *rq);
+
+ struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);
#endif

void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
@@ -1911,6 +1925,24 @@ extern void trigger_load_balance(struct

extern void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask, u32 flags);

+static inline struct task_struct *get_push_task(struct rq *rq)
+{
+ struct task_struct *p = rq->curr;
+
+ lockdep_assert_held(&rq->lock);
+
+ if (rq->push_busy)
+ return NULL;
+
+ if (p->nr_cpus_allowed == 1)
+ return NULL;
+
+ rq->push_busy = true;
+ return get_task_struct(p);
+}
+
+extern int push_cpu_stop(void *arg);
+
#endif

#ifdef CONFIG_CPU_IDLE



2020-10-06 08:03:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

On Mon, Oct 05, 2020 at 04:57:32PM +0200, Peter Zijlstra wrote:
> +static inline struct task_struct *get_push_task(struct rq *rq)
> +{
> + struct task_struct *p = rq->curr;
> +
> + lockdep_assert_held(&rq->lock);
> +
> + if (rq->push_busy)
> + return NULL;
> +
> + if (p->nr_cpus_allowed == 1)
> + return NULL;

This; that means what when we're stuck below a per-cpu thread, we're
toast. There's just nothing much you can do... :/

> +
> + rq->push_busy = true;
> + return get_task_struct(p);
> +}

2020-10-06 11:23:04

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing


On 05/10/20 15:57, Peter Zijlstra wrote:
> In order to minimize the interference of migrate_disable() on lower
> priority tasks, which can be deprived of runtime due to being stuck
> below a higher priority task. Teach the RT/DL balancers to push away
> these higher priority tasks when a lower priority task gets selected
> to run on a freshly demoted CPU (pull).
>
> This adds migration interference to the higher priority task, but
> restores bandwidth to system that would otherwise be irrevocably lost.
> Without this it would be possible to have all tasks on the system
> stuck on a single CPU, each task preempted in a migrate_disable()
> section with a single high priority task running.
>
> This way we can still approximate running the M highest priority tasks
> on the system.
>

Ah, so IIUC that's the important bit that makes it we can't just say go
through the pushable_tasks list and skip migrate_disable() tasks.

Once the highest-prio task exits its migrate_disable() region, your patch
pushes it away. If we ended up with a single busy CPU, it'll spread the
tasks around one migrate_enable() at a time.

That time where the top task is migrate_disable() is still a crappy time,
and as you pointed out earlier today if it is a genuine pcpu task then the
whole thing is -EBORKED...

An alternative I could see would be to prevent those piles from forming
altogether, say by issuing a similar push_cpu_stop() on migrate_disable()
if the next pushable task is already migrate_disable(); but that's a
proactive approach whereas yours is reactive, so I'm pretty sure that's
bound to perform worse.

> Migrating the top task away is (ofcourse) still subject to
> migrate_disable() too, which means the lower task is subject to an
> interference equivalent to the worst case migrate_disable() section.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

2020-10-06 13:46:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

On Tue, 6 Oct 2020 09:59:39 +0200
Peter Zijlstra <[email protected]> wrote:

> On Mon, Oct 05, 2020 at 04:57:32PM +0200, Peter Zijlstra wrote:
> > +static inline struct task_struct *get_push_task(struct rq *rq)
> > +{
> > + struct task_struct *p = rq->curr;
> > +
> > + lockdep_assert_held(&rq->lock);
> > +
> > + if (rq->push_busy)
> > + return NULL;
> > +
> > + if (p->nr_cpus_allowed == 1)
> > + return NULL;
>
> This; that means what when we're stuck below a per-cpu thread, we're
> toast. There's just nothing much you can do... :/

Well, hopefully, per CPU threads don't run for long periods of time. I'm
working with folks having issues of running non stop RT threads that every
so often go into the kernel kicking off per CPU kernel threads that now get
starved when the RT tasks go back to user space, causing the rest of the
system to hang.

As I've always said. When dealing with real-time systems, you need to be
careful about how you organize your tasks. Ideally, any RT task that is
pinned to a CPU shouldn't be sharing that CPU with anything else that may
be critical.

-- Steve


>
> > +
> > + rq->push_busy = true;
> > + return get_task_struct(p);
> > +}

2020-10-06 13:53:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

On Tue, Oct 06, 2020 at 12:20:43PM +0100, Valentin Schneider wrote:
>
> On 05/10/20 15:57, Peter Zijlstra wrote:
> > In order to minimize the interference of migrate_disable() on lower
> > priority tasks, which can be deprived of runtime due to being stuck
> > below a higher priority task. Teach the RT/DL balancers to push away
> > these higher priority tasks when a lower priority task gets selected
> > to run on a freshly demoted CPU (pull).
> >
> > This adds migration interference to the higher priority task, but
> > restores bandwidth to system that would otherwise be irrevocably lost.
> > Without this it would be possible to have all tasks on the system
> > stuck on a single CPU, each task preempted in a migrate_disable()
> > section with a single high priority task running.
> >
> > This way we can still approximate running the M highest priority tasks
> > on the system.
> >
>
> Ah, so IIUC that's the important bit that makes it we can't just say go
> through the pushable_tasks list and skip migrate_disable() tasks.
>
> Once the highest-prio task exits its migrate_disable() region, your patch
> pushes it away. If we ended up with a single busy CPU, it'll spread the
> tasks around one migrate_enable() at a time.
>
> That time where the top task is migrate_disable() is still a crappy time,
> and as you pointed out earlier today if it is a genuine pcpu task then the
> whole thing is -EBORKED...
>
> An alternative I could see would be to prevent those piles from forming
> altogether, say by issuing a similar push_cpu_stop() on migrate_disable()
> if the next pushable task is already migrate_disable(); but that's a
> proactive approach whereas yours is reactive, so I'm pretty sure that's
> bound to perform worse.

I think it is always possible to form pileups. Just start enough tasks
such that newer, higher priority, tasks have to preempt existing tasks.

Also, we might not be able to place the task elsewhere, suppose we have
all our M CPUs filled with an RT task, then when the lowest priority
task has migrate_disable(), wake the highest priority task.

Per the SMP invariant, this new highest priority task must preempt the
lowest priority task currently running, otherwise we would not be
running the M highest prio tasks.

That's not to say it might not still be beneficial from trying to avoid
them, but we must assume a pilup will occur, therefore my focus was on
dealing with them as best we can first.

2020-10-06 13:53:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

On Tue, Oct 06, 2020 at 09:44:45AM -0400, Steven Rostedt wrote:
> As I've always said. When dealing with real-time systems, you need to be
> careful about how you organize your tasks. Ideally, any RT task that is
> pinned to a CPU shouldn't be sharing that CPU with anything else that may
> be critical.

Due to locks we're always sharing.. :-)

2020-10-06 14:41:23

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

On 06/10/20 15:48, Peter Zijlstra wrote:
> On Tue, Oct 06, 2020 at 12:20:43PM +0100, Valentin Schneider wrote:
> >
> > On 05/10/20 15:57, Peter Zijlstra wrote:
> > > In order to minimize the interference of migrate_disable() on lower
> > > priority tasks, which can be deprived of runtime due to being stuck
> > > below a higher priority task. Teach the RT/DL balancers to push away
> > > these higher priority tasks when a lower priority task gets selected
> > > to run on a freshly demoted CPU (pull).

Still digesting the whole lot, but can't we "simply" force push the
higest prio (that we preempt to make space for the migrate_disabled
lower prio) directly to the cpu that would accept the lower prio that
cannot move?

Asking because AFAIU we are calling find_lock_rq from push_cpu_stop and
that selects the best cpu for the high prio. I'm basically wondering if
we could avoid moving, potentially multiple, high prio tasks around to
make space for a lower prio task.

> > > This adds migration interference to the higher priority task, but
> > > restores bandwidth to system that would otherwise be irrevocably lost.
> > > Without this it would be possible to have all tasks on the system
> > > stuck on a single CPU, each task preempted in a migrate_disable()
> > > section with a single high priority task running.
> > >
> > > This way we can still approximate running the M highest priority tasks
> > > on the system.
> > >
> >
> > Ah, so IIUC that's the important bit that makes it we can't just say go
> > through the pushable_tasks list and skip migrate_disable() tasks.
> >
> > Once the highest-prio task exits its migrate_disable() region, your patch
> > pushes it away. If we ended up with a single busy CPU, it'll spread the
> > tasks around one migrate_enable() at a time.
> >
> > That time where the top task is migrate_disable() is still a crappy time,
> > and as you pointed out earlier today if it is a genuine pcpu task then the
> > whole thing is -EBORKED...
> >
> > An alternative I could see would be to prevent those piles from forming
> > altogether, say by issuing a similar push_cpu_stop() on migrate_disable()
> > if the next pushable task is already migrate_disable(); but that's a
> > proactive approach whereas yours is reactive, so I'm pretty sure that's
> > bound to perform worse.
>
> I think it is always possible to form pileups. Just start enough tasks
> such that newer, higher priority, tasks have to preempt existing tasks.
>
> Also, we might not be able to place the task elsewhere, suppose we have
> all our M CPUs filled with an RT task, then when the lowest priority
> task has migrate_disable(), wake the highest priority task.
>
> Per the SMP invariant, this new highest priority task must preempt the
> lowest priority task currently running, otherwise we would not be
> running the M highest prio tasks.
>
> That's not to say it might not still be beneficial from trying to avoid
> them, but we must assume a pilup will occur, therefore my focus was on
> dealing with them as best we can first.
>

2020-10-06 14:52:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

On Tue, Oct 06, 2020 at 04:37:04PM +0200, Juri Lelli wrote:
> On 06/10/20 15:48, Peter Zijlstra wrote:
> > On Tue, Oct 06, 2020 at 12:20:43PM +0100, Valentin Schneider wrote:
> > >
> > > On 05/10/20 15:57, Peter Zijlstra wrote:
> > > > In order to minimize the interference of migrate_disable() on lower
> > > > priority tasks, which can be deprived of runtime due to being stuck
> > > > below a higher priority task. Teach the RT/DL balancers to push away
> > > > these higher priority tasks when a lower priority task gets selected
> > > > to run on a freshly demoted CPU (pull).
>
> Still digesting the whole lot, but can't we "simply" force push the
> higest prio (that we preempt to make space for the migrate_disabled
> lower prio) directly to the cpu that would accept the lower prio that
> cannot move?
>
> Asking because AFAIU we are calling find_lock_rq from push_cpu_stop and
> that selects the best cpu for the high prio. I'm basically wondering if
> we could avoid moving, potentially multiple, high prio tasks around to
> make space for a lower prio task.

The intention was to do as you describe in the first paragraph, and
isn't pull also using find_lock_rq() to select the 'lowest' priority
runqueue to move the task to?

That is, both actions should end up at the same 'lowest' prio CPU.

2020-10-06 16:21:16

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing


On 06/10/20 14:48, Peter Zijlstra wrote:
> On Tue, Oct 06, 2020 at 12:20:43PM +0100, Valentin Schneider wrote:
>>
>> An alternative I could see would be to prevent those piles from forming
>> altogether, say by issuing a similar push_cpu_stop() on migrate_disable()
>> if the next pushable task is already migrate_disable(); but that's a
>> proactive approach whereas yours is reactive, so I'm pretty sure that's
>> bound to perform worse.
>
> I think it is always possible to form pileups. Just start enough tasks
> such that newer, higher priority, tasks have to preempt existing tasks.
>
> Also, we might not be able to place the task elsewhere, suppose we have
> all our M CPUs filled with an RT task, then when the lowest priority
> task has migrate_disable(), wake the highest priority task.
>
> Per the SMP invariant, this new highest priority task must preempt the
> lowest priority task currently running, otherwise we would not be
> running the M highest prio tasks.
>

Right, and it goes the other way around for the migrate_disable() task: if
it becomes one of the M highest prio tasks, then it *must* run, and
push/pulling its CPU's current away is the only way to do so...

> That's not to say it might not still be beneficial from trying to avoid
> them, but we must assume a pilup will occur, therefore my focus was on
> dealing with them as best we can first.

"Funny" all that... Thanks!

2020-10-07 05:32:16

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

On 06/10/20 16:48, Peter Zijlstra wrote:
> On Tue, Oct 06, 2020 at 04:37:04PM +0200, Juri Lelli wrote:
> > On 06/10/20 15:48, Peter Zijlstra wrote:
> > > On Tue, Oct 06, 2020 at 12:20:43PM +0100, Valentin Schneider wrote:
> > > >
> > > > On 05/10/20 15:57, Peter Zijlstra wrote:
> > > > > In order to minimize the interference of migrate_disable() on lower
> > > > > priority tasks, which can be deprived of runtime due to being stuck
> > > > > below a higher priority task. Teach the RT/DL balancers to push away
> > > > > these higher priority tasks when a lower priority task gets selected
> > > > > to run on a freshly demoted CPU (pull).
> >
> > Still digesting the whole lot, but can't we "simply" force push the
> > higest prio (that we preempt to make space for the migrate_disabled
> > lower prio) directly to the cpu that would accept the lower prio that
> > cannot move?
> >
> > Asking because AFAIU we are calling find_lock_rq from push_cpu_stop and
> > that selects the best cpu for the high prio. I'm basically wondering if
> > we could avoid moving, potentially multiple, high prio tasks around to
> > make space for a lower prio task.
>
> The intention was to do as you describe in the first paragraph, and
> isn't pull also using find_lock_rq() to select the 'lowest' priority
> runqueue to move the task to?
>
> That is, both actions should end up at the same 'lowest' prio CPU.
>

OK, right. Think there might still be differences, since successive calls
to find_lock_rq() could select different candidates (after the
introduction of cpumask_..._distribute), but that shouldn't really
matter much (also things might have changed in the meanwhile and we
really have to call find_lock_rq again I guess).

2020-10-07 21:11:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

On Wed, Oct 07, 2020 at 08:22:44PM +0100, Valentin Schneider wrote:
> + struct task_struct *curr = class->peek_next_task(rq);

If you look at the core-sched patches they have something very similar
:-)

2020-10-07 22:35:38

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing


On 07/10/20 22:08, Peter Zijlstra wrote:
> On Wed, Oct 07, 2020 at 08:22:44PM +0100, Valentin Schneider wrote:
>> + struct task_struct *curr = class->peek_next_task(rq);
>
> If you look at the core-sched patches they have something very similar
> :-)

Shiny, there's the fair thing I chickened out of doing all prechewed!

I was trying very hard to forget about that series, seems like I did partly
succeed in that :)

2020-10-07 22:50:51

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing


On 05/10/20 15:57, Peter Zijlstra wrote:
> In order to minimize the interference of migrate_disable() on lower
> priority tasks, which can be deprived of runtime due to being stuck
> below a higher priority task. Teach the RT/DL balancers to push away
> these higher priority tasks when a lower priority task gets selected
> to run on a freshly demoted CPU (pull).
>
> This adds migration interference to the higher priority task, but
> restores bandwidth to system that would otherwise be irrevocably lost.
> Without this it would be possible to have all tasks on the system
> stuck on a single CPU, each task preempted in a migrate_disable()
> section with a single high priority task running.
>
> This way we can still approximate running the M highest priority tasks
> on the system.
>
> Migrating the top task away is (ofcourse) still subject to
> migrate_disable() too, which means the lower task is subject to an
> interference equivalent to the worst case migrate_disable() section.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[...]
> @@ -1917,6 +1912,49 @@ static int migration_cpu_stop(void *data
> return 0;
> }
>
> +int push_cpu_stop(void *arg)
> +{
> + struct rq *lowest_rq = NULL, *rq = this_rq();
> + struct task_struct *p = arg;
> +
> + raw_spin_lock_irq(&p->pi_lock);
> + raw_spin_lock(&rq->lock);
> +
> + if (task_rq(p) != rq)
> + goto out_unlock;
> +
> + if (is_migration_disabled(p)) {
> + p->migration_flags |= MDF_PUSH;
> + goto out_unlock;
> + }
> +
> + p->migration_flags &= ~MDF_PUSH;
> +
> + if (p->sched_class->find_lock_rq)
> + lowest_rq = p->sched_class->find_lock_rq(p, rq);
> +
> + if (!lowest_rq)
> + goto out_unlock;
> +
> + // XXX validate p is still the highest prio task

So we want to move some !migrate_disable(), highest priority task to make
room for a migrate_disable(), lower priority task. We're working with the
task that was highest prio at the time of kicking the CPU stopper
(i.e. back when we invoked get_push_task()), but as you point out here it
might no longer be of highest prio.

I was thinking that since this is done in stopper context we could peek at
what actually is the current (!stopper) highest prio task, but that implies
first grabbing the rq lock and *then* some p->pi_lock, which is a no-no :(

Regarding the check, I've cobbled the below. I'm not fond of the implicit
expectation that p will always be > CFS, but there's no CFS .find_lock_rq()
(... for now).

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 69b1173eaf45..3ed339980f87 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1919,6 +1919,7 @@ static int migration_cpu_stop(void *data)
int push_cpu_stop(void *arg)
{
struct rq *lowest_rq = NULL, *rq = this_rq();
+ const struct sched_class *class;
struct task_struct *p = arg;

raw_spin_lock_irq(&p->pi_lock);
@@ -1940,14 +1941,23 @@ int push_cpu_stop(void *arg)
if (!lowest_rq)
goto out_unlock;

- // XXX validate p is still the highest prio task
- if (task_rq(p) == rq) {
- deactivate_task(rq, p, 0);
- set_task_cpu(p, lowest_rq->cpu);
- activate_task(lowest_rq, p, 0);
- resched_curr(lowest_rq);
+ // Validate p is still the highest prio task
+ for_class_range(class, &stop_sched_class - 1, p->sched_class - 1) {
+ struct task_struct *curr = class->peek_next_task(rq);
+
+ if (!curr)
+ continue;
+ if (curr != p)
+ goto out_unlock;
+ else
+ break;
}

+ deactivate_task(rq, p, 0);
+ set_task_cpu(p, lowest_rq->cpu);
+ activate_task(lowest_rq, p, 0);
+ resched_curr(lowest_rq);
+
double_unlock_balance(rq, lowest_rq);

out_unlock:
@@ -7312,7 +7322,7 @@ int sched_cpu_deactivate(unsigned int cpu)

rq_lock_irqsave(rq, &rf);
if (rq->rd) {
- update_rq_clock();
+ update_rq_clock(rq);
BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
set_rq_offline(rq);
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 15320ede2f45..7964c42b8604 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1840,6 +1840,19 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
return p;
}

+static struct task_struct *peek_next_task_dl(struct rq *rq)
+{
+ struct sched_dl_entity *dl_se;
+ struct dl_rq *dl_rq = &rq->dl;
+
+ if (!sched_dl_runnable(rq))
+ return NULL;
+
+ dl_se = pick_next_dl_entity(rq, dl_rq);
+ BUG_ON(!dl_se);
+ return dl_task_of(dl_se);
+}
+
static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
{
update_curr_dl(rq);
@@ -2498,6 +2511,8 @@ const struct sched_class dl_sched_class
.check_preempt_curr = check_preempt_curr_dl,

.pick_next_task = pick_next_task_dl,
+ .peek_next_task = peek_next_task_dl,
+
.put_prev_task = put_prev_task_dl,
.set_next_task = set_next_task_dl,

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e90a69b3e85c..83949e9018a3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1636,6 +1636,14 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
return p;
}

+static struct task_struct *peek_next_task_rt(struct rq *rq)
+{
+ if (!sched_rt_runnable(rq))
+ return NULL;
+
+ return _pick_next_task_rt(rq);
+}
+
static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
{
update_curr_rt(rq);
@@ -2479,6 +2487,8 @@ const struct sched_class rt_sched_class
.check_preempt_curr = check_preempt_curr_rt,

.pick_next_task = pick_next_task_rt,
+ .peek_next_task = peek_next_task_rt,
+
.put_prev_task = put_prev_task_rt,
.set_next_task = set_next_task_rt,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2621155393c..7cd3b8682375 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1811,6 +1811,7 @@ struct sched_class {
void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);

struct task_struct *(*pick_next_task)(struct rq *rq);
+ struct task_struct *(*peek_next_task)(struct rq *rq);

void (*put_prev_task)(struct rq *rq, struct task_struct *p);
void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);

Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

On 2020-10-05 16:57:32 [+0200], Peter Zijlstra wrote:
> In order to minimize the interference of migrate_disable() on lower
> priority tasks, which can be deprived of runtime due to being stuck
> below a higher priority task. Teach the RT/DL balancers to push away
> these higher priority tasks when a lower priority task gets selected
> to run on a freshly demoted CPU (pull).
>
> This adds migration interference to the higher priority task, but
> restores bandwidth to system that would otherwise be irrevocably lost.
> Without this it would be possible to have all tasks on the system
> stuck on a single CPU, each task preempted in a migrate_disable()
> section with a single high priority task running.

So there is a task running at priority 99.9 and then scheduler decides
to interrupt it while pushing it to a new CPU? But this does happen if
the task is pinned to one CPU. Then this shouldn't do much harm.

Usually the tasks with high priority are pinned to a single CPU because
otherwise it causes noise/latency when the scheduler bounces it to a
different CPUs.
Then there are the cases where the first lock limits the CPU mask and
the second lock is occupied. After the lock has been released the task
can't acquire it because the CPU is occupied by a task with higher
priority. This would be a win if the high-prio task would move to
another CPU if possible.

Sebastian

2020-10-12 09:58:25

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

On 05/10/2020 16:57, Peter Zijlstra wrote:

[...]

> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1859,7 +1859,7 @@ static struct task_struct *pick_next_pus
> * running task can migrate over to a CPU that is running a task
> * of lesser priority.
> */
> -static int push_rt_task(struct rq *rq)
> +static int push_rt_task(struct rq *rq, bool pull)
> {
> struct task_struct *next_task;
> struct rq *lowest_rq;
> @@ -1873,6 +1873,34 @@ static int push_rt_task(struct rq *rq)
> return 0;
>
> retry:
> + if (is_migration_disabled(next_task)) {
> + struct task_struct *push_task = NULL;
> + int cpu;
> +
> + if (!pull || rq->push_busy)
> + return 0;

Shouldn't there be the same functionality in push_dl_task(), i.e.
returning 0 earlier for a task with migration_disabled?

[...]

2020-10-12 11:30:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

On Mon, Oct 12, 2020 at 11:56:09AM +0200, Dietmar Eggemann wrote:
> On 05/10/2020 16:57, Peter Zijlstra wrote:
>
> [...]
>
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1859,7 +1859,7 @@ static struct task_struct *pick_next_pus
> > * running task can migrate over to a CPU that is running a task
> > * of lesser priority.
> > */
> > -static int push_rt_task(struct rq *rq)
> > +static int push_rt_task(struct rq *rq, bool pull)
> > {
> > struct task_struct *next_task;
> > struct rq *lowest_rq;
> > @@ -1873,6 +1873,34 @@ static int push_rt_task(struct rq *rq)
> > return 0;
> >
> > retry:
> > + if (is_migration_disabled(next_task)) {
> > + struct task_struct *push_task = NULL;
> > + int cpu;
> > +
> > + if (!pull || rq->push_busy)
> > + return 0;
>
> Shouldn't there be the same functionality in push_dl_task(), i.e.
> returning 0 earlier for a task with migration_disabled?

No, deadline didn't implement HAVE_RT_PUSH_IPI.

2020-10-12 12:27:13

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH -v2 15/17] sched: Fix migrate_disable() vs rt/dl balancing

On 12/10/2020 13:28, Peter Zijlstra wrote:
> On Mon, Oct 12, 2020 at 11:56:09AM +0200, Dietmar Eggemann wrote:
>> On 05/10/2020 16:57, Peter Zijlstra wrote:
>>
>> [...]
>>
>>> --- a/kernel/sched/rt.c
>>> +++ b/kernel/sched/rt.c
>>> @@ -1859,7 +1859,7 @@ static struct task_struct *pick_next_pus
>>> * running task can migrate over to a CPU that is running a task
>>> * of lesser priority.
>>> */
>>> -static int push_rt_task(struct rq *rq)
>>> +static int push_rt_task(struct rq *rq, bool pull)
>>> {
>>> struct task_struct *next_task;
>>> struct rq *lowest_rq;
>>> @@ -1873,6 +1873,34 @@ static int push_rt_task(struct rq *rq)
>>> return 0;
>>>
>>> retry:
>>> + if (is_migration_disabled(next_task)) {
>>> + struct task_struct *push_task = NULL;
>>> + int cpu;
>>> +
>>> + if (!pull || rq->push_busy)
>>> + return 0;
>>
>> Shouldn't there be the same functionality in push_dl_task(), i.e.
>> returning 0 earlier for a task with migration_disabled?
>
> No, deadline didn't implement HAVE_RT_PUSH_IPI.

Right, so 'is_migration_disabled(next_task) && !pull' should never
happen then (for RT and DL).