2012-06-02 18:58:29

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [PATCH] sched/cpu hotplug: Don't traverse sched domain tree of cpus going offline

The cpu_active_mask is used to help the scheduler make the right decisions
during CPU hotplug transitions, and thereby enable CPU hotplug operations to
run smoothly.

However there are a few places where the scheduler doesn't consult the
cpu_active_mask and hence can trip over during CPU hotplug operations, by
making decisions based on stale data structures.

In particular, the call to partition_sched_domains() in the CPU offline path
sets cpu_rq(dying_cpu)->sd (sched domain pointer) to NULL. However, until
we get to that point, it remains non-NULL. And during that time period,
the scheduler could end up traversing that non-NULL sched domain tree and
making decisions based on that, which could lead to undesirable consequences.

For example, commit 8f2f748 (CPU hotplug, cpusets, suspend: Don't touch
cpusets during suspend/resume) caused suspend hangs at the CPU hotplug stage,
in some configurations. And the root-cause of that hang was that the
scheduler was traversing the sched domain trees of cpus going offline, thereby
messing up its decisions. The analysis of that hang can be found in:
http://marc.info/?l=linux-kernel&m=133518704022532&w=4

In short, our assumption that we are good to go (even with stale sched
domains) as long as the dying cpu was not in the cpu_active_mask, was wrong.
Because, there are several places where the scheduler code doesn't really
check the cpu_active_mask, but instead traverses the sched domains directly
to do stuff.

Anyway, that commit got reverted. However, that same problem could very well
occur during regular CPU hotplug too, even now. Ideally, we would want the
scheduler to make all its decisions during the CPU hotplug transition, by
looking up the cpu_active_mask (which is updated very early during CPU
Hotplug) to avoid such disasters. So, strengthen the scheduler by making it
consult the cpu_active_mask as a validity check before traversing the sched
domain tree.

Here we are specifically fixing the 2 known instances: idle_balance() and
find_lowest_rq().

1. kernel/sched/core.c and fair.c:
schedule()
__schedule()
idle_balance()

idle_balance() can end up doing:

for_each_domain(dying_cpu, sd) {
...
}

2. kernel/sched/core.c and rt.c:

migration_call()
sched_ttwu_pending()
ttwu_do_activate()
ttwu_do_wakeup()
task_woken()
task_woken_rt()
push_rt_tasks()
find_lowest_rq()

find_lowest_rq() can end up doing:

for_each_domain(dying_cpu, sd) {
...
}

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

kernel/sched/fair.c | 5 +++++
kernel/sched/rt.c | 4 ++++
2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 940e6d1..72cc7c6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4402,6 +4402,10 @@ void idle_balance(int this_cpu, struct rq *this_rq)
raw_spin_unlock(&this_rq->lock);

update_shares(this_cpu);
+
+ if (!cpu_active(this_cpu))
+ goto out;
+
rcu_read_lock();
for_each_domain(this_cpu, sd) {
unsigned long interval;
@@ -4426,6 +4430,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
}
rcu_read_unlock();

+ out:
raw_spin_lock(&this_rq->lock);

if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index c5565c3..8cbafcd 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1488,6 +1488,9 @@ static int find_lowest_rq(struct task_struct *task)
if (!cpumask_test_cpu(this_cpu, lowest_mask))
this_cpu = -1; /* Skip this_cpu opt if not among lowest */

+ if (!cpu_active(cpu))
+ goto out;
+
rcu_read_lock();
for_each_domain(cpu, sd) {
if (sd->flags & SD_WAKE_AFFINE) {
@@ -1513,6 +1516,7 @@ static int find_lowest_rq(struct task_struct *task)
}
rcu_read_unlock();

+ out:
/*
* And finally, if there were no matches within the domains
* just give the caller *something* to work with from the compatible


2012-06-20 10:06:52

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] sched/cpu hotplug: Don't traverse sched domain tree of cpus going offline

On 06/03/2012 12:27 AM, Srivatsa S. Bhat wrote:

> The cpu_active_mask is used to help the scheduler make the right decisions
> during CPU hotplug transitions, and thereby enable CPU hotplug operations to
> run smoothly.
>
> However there are a few places where the scheduler doesn't consult the
> cpu_active_mask and hence can trip over during CPU hotplug operations, by
> making decisions based on stale data structures.
>
> In particular, the call to partition_sched_domains() in the CPU offline path
> sets cpu_rq(dying_cpu)->sd (sched domain pointer) to NULL. However, until
> we get to that point, it remains non-NULL. And during that time period,
> the scheduler could end up traversing that non-NULL sched domain tree and
> making decisions based on that, which could lead to undesirable consequences.
>
> For example, commit 8f2f748 (CPU hotplug, cpusets, suspend: Don't touch
> cpusets during suspend/resume) caused suspend hangs at the CPU hotplug stage,
> in some configurations. And the root-cause of that hang was that the
> scheduler was traversing the sched domain trees of cpus going offline, thereby
> messing up its decisions. The analysis of that hang can be found in:
> http://marc.info/?l=linux-kernel&m=133518704022532&w=4
>
> In short, our assumption that we are good to go (even with stale sched
> domains) as long as the dying cpu was not in the cpu_active_mask, was wrong.
> Because, there are several places where the scheduler code doesn't really
> check the cpu_active_mask, but instead traverses the sched domains directly
> to do stuff.
>
> Anyway, that commit got reverted. However, that same problem could very well
> occur during regular CPU hotplug too, even now. Ideally, we would want the
> scheduler to make all its decisions during the CPU hotplug transition, by
> looking up the cpu_active_mask (which is updated very early during CPU
> Hotplug) to avoid such disasters. So, strengthen the scheduler by making it
> consult the cpu_active_mask as a validity check before traversing the sched
> domain tree.
>
> Here we are specifically fixing the 2 known instances: idle_balance() and
> find_lowest_rq().
>
> 1. kernel/sched/core.c and fair.c:
> schedule()
> __schedule()
> idle_balance()
>
> idle_balance() can end up doing:
>
> for_each_domain(dying_cpu, sd) {
> ...
> }
>
> 2. kernel/sched/core.c and rt.c:
>
> migration_call()
> sched_ttwu_pending()
> ttwu_do_activate()
> ttwu_do_wakeup()
> task_woken()
> task_woken_rt()
> push_rt_tasks()
> find_lowest_rq()
>
> find_lowest_rq() can end up doing:
>
> for_each_domain(dying_cpu, sd) {
> ...
> }

>

> Signed-off-by: Srivatsa S. Bhat <[email protected]>


Any thoughts on this patch?

Regards,
Srivatsa S. Bhat

> ---
>
> kernel/sched/fair.c | 5 +++++
> kernel/sched/rt.c | 4 ++++
> 2 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 940e6d1..72cc7c6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4402,6 +4402,10 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> raw_spin_unlock(&this_rq->lock);
>
> update_shares(this_cpu);
> +
> + if (!cpu_active(this_cpu))
> + goto out;
> +
> rcu_read_lock();
> for_each_domain(this_cpu, sd) {
> unsigned long interval;
> @@ -4426,6 +4430,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> }
> rcu_read_unlock();
>
> + out:
> raw_spin_lock(&this_rq->lock);
>
> if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index c5565c3..8cbafcd 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1488,6 +1488,9 @@ static int find_lowest_rq(struct task_struct *task)
> if (!cpumask_test_cpu(this_cpu, lowest_mask))
> this_cpu = -1; /* Skip this_cpu opt if not among lowest */
>
> + if (!cpu_active(cpu))
> + goto out;
> +
> rcu_read_lock();
> for_each_domain(cpu, sd) {
> if (sd->flags & SD_WAKE_AFFINE) {
> @@ -1513,6 +1516,7 @@ static int find_lowest_rq(struct task_struct *task)
> }
> rcu_read_unlock();
>
> + out:
> /*
> * And finally, if there were no matches within the domains
> * just give the caller *something* to work with from the compatible
>

2012-06-20 10:17:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/cpu hotplug: Don't traverse sched domain tree of cpus going offline

On Wed, 2012-06-20 at 15:20 +0530, Srivatsa S. Bhat wrote:
> Any thoughts on this patch?
>
Yeah, I stared at it for a while and got a head-ache.. the idle-balance
part if ok, the rt.c part needs a little more-thinking.

Thanks for reminding me.. I'll try and have another look.