2015-07-16 09:17:23

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH] x86/smpboot: Check for cpu_active on cpu initialization

From: Joerg Roedel <[email protected]>

Currently the code to bring up secondary CPUs only checks
for cpu_online before it proceeds with launching the per-cpu
threads for the freshly booted remote CPU.

But the code to move these threads to the new CPU checks for
cpu_active to do so. If this check fails the threads end up
on the wrong CPU, causing warnings and bugs like:

WARNING: CPU: 0 PID: 1 at ../kernel/workqueue.c:4417 workqueue_cpu_up_callback

and/or:

kernel BUG at ../kernel/smpboot.c:135!

The reason is that the cpu_active bit for the new CPU
becomes visible significantly later than the cpu_online bit.
The reasons could be that the kernel runs in a KVM guest,
where the vCPU thread gets preempted when the cpu_online bit
is set, but with cpu_active still clear.

But this could also happen on bare-metal systems with lots
of CPUs. We have observed this issue on an 88 core x86
system on bare-metal.

To fix this issue, wait before the remote CPU is online
*and* active before launching the per-cpu threads.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/kernel/smpboot.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index d3010aa..30b7b8b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1006,7 +1006,7 @@ int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
check_tsc_sync_source(cpu);
local_irq_restore(flags);

- while (!cpu_online(cpu)) {
+ while (!cpu_online(cpu) || !cpu_active(cpu)) {
cpu_relax();
touch_nmi_watchdog();
}
--
1.9.1


2015-07-20 14:46:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/smpboot: Check for cpu_active on cpu initialization

On Thu, Jul 16, 2015 at 11:17:17AM +0200, Joerg Roedel wrote:
> From: Joerg Roedel <[email protected]>
>
> Currently the code to bring up secondary CPUs only checks
> for cpu_online before it proceeds with launching the per-cpu
> threads for the freshly booted remote CPU.
>
> But the code to move these threads to the new CPU checks for
> cpu_active to do so. If this check fails the threads end up
> on the wrong CPU, causing warnings and bugs like:
>
> WARNING: CPU: 0 PID: 1 at ../kernel/workqueue.c:4417 workqueue_cpu_up_callback
>
> and/or:
>
> kernel BUG at ../kernel/smpboot.c:135!
>
> The reason is that the cpu_active bit for the new CPU
> becomes visible significantly later than the cpu_online bit.

I see

void set_cpu_online(unsigned int cpu, bool online)
{
if (online) {
cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
} else {

which is called in start_secondary().

Do you mean that setting the bit in cpu_active_mask gets delayed soo
much? Because it comes right after setting the bit in cpu_online_mask.

> The reasons could be that the kernel runs in a KVM guest,
> where the vCPU thread gets preempted when the cpu_online bit
> is set, but with cpu_active still clear.
>
> But this could also happen on bare-metal systems with lots
> of CPUs. We have observed this issue on an 88 core x86
> system on bare-metal.
>
> To fix this issue, wait before the remote CPU is online
> *and* active before launching the per-cpu threads.
>
> Signed-off-by: Joerg Roedel <[email protected]>
> ---
> arch/x86/kernel/smpboot.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index d3010aa..30b7b8b 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1006,7 +1006,7 @@ int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
> check_tsc_sync_source(cpu);
> local_irq_restore(flags);
>
> - while (!cpu_online(cpu)) {
> + while (!cpu_online(cpu) || !cpu_active(cpu)) {
> cpu_relax();
> touch_nmi_watchdog();

Maybe we should just swap the calls in set_cpu_online() instead? I.e.,


if (online) {
cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
}

?

I see cpu_online() being called much more than cpu_active()...

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-07-20 15:02:46

by Jörg Rödel

[permalink] [raw]
Subject: Re: [PATCH] x86/smpboot: Check for cpu_active on cpu initialization

On Mon, Jul 20, 2015 at 04:46:19PM +0200, Borislav Petkov wrote:
> On Thu, Jul 16, 2015 at 11:17:17AM +0200, Joerg Roedel wrote:
> > The reason is that the cpu_active bit for the new CPU
> > becomes visible significantly later than the cpu_online bit.
>
> I see
>
> void set_cpu_online(unsigned int cpu, bool online)
> {
> if (online) {
> cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
> cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
> } else {
>
> which is called in start_secondary().
>
> Do you mean that setting the bit in cpu_active_mask gets delayed soo
> much? Because it comes right after setting the bit in cpu_online_mask.

Yes, cpu_active becomes either set a lot later in a KVM guest, when the
host decides to preempt the vCPU right after setting cpu_online, but
before cpu_active is set, or on bare-metal.

I have seen a report where this happens on bare metal, when the change
to the cpu_active bit becomes visible on the other CPU significantly
later than the the cpu_online bit. This happened on a pretty big machine
with 88 cores.


Joerg

2015-07-20 15:10:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/smpboot: Check for cpu_active on cpu initialization

On Mon, Jul 20, 2015 at 05:02:40PM +0200, Joerg Roedel wrote:
> I have seen a report where this happens on bare metal, when the change
> to the cpu_active bit becomes visible on the other CPU significantly
> later than the the cpu_online bit. This happened on a pretty big machine
> with 88 cores.

So how about what I proposed at the end of my previous mail?

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-07-20 15:18:35

by Jörg Rödel

[permalink] [raw]
Subject: Re: [PATCH] x86/smpboot: Check for cpu_active on cpu initialization

On Mon, Jul 20, 2015 at 05:10:00PM +0200, Borislav Petkov wrote:
> On Mon, Jul 20, 2015 at 05:02:40PM +0200, Joerg Roedel wrote:
> > I have seen a report where this happens on bare metal, when the change
> > to the cpu_active bit becomes visible on the other CPU significantly
> > later than the the cpu_online bit. This happened on a pretty big machine
> > with 88 cores.
>
> So how about what I proposed at the end of my previous mail?

Oh sorry, I missed that. Setting cpu_active first should work on x86,
where writes become visible in the order they are executed. But this
function is in generic code and I have no idea what this change might
break on other architectures.

In the end cpu_active means that the scheduler can push tasks to the
CPU, no idea if some arch code breaks when the scheduler is already
working on a CPU before it becomes visibly online.


Joerg

2015-07-20 15:27:53

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/smpboot: Check for cpu_active on cpu initialization

On Mon, Jul 20, 2015 at 05:18:31PM +0200, Joerg Roedel wrote:
> On Mon, Jul 20, 2015 at 05:10:00PM +0200, Borislav Petkov wrote:
> > On Mon, Jul 20, 2015 at 05:02:40PM +0200, Joerg Roedel wrote:
> > > I have seen a report where this happens on bare metal, when the change
> > > to the cpu_active bit becomes visible on the other CPU significantly
> > > later than the the cpu_online bit. This happened on a pretty big machine
> > > with 88 cores.
> >
> > So how about what I proposed at the end of my previous mail?
>
> Oh sorry, I missed that. Setting cpu_active first should work on x86,
> where writes become visible in the order they are executed. But this
> function is in generic code and I have no idea what this change might
> break on other architectures.
>
> In the end cpu_active means that the scheduler can push tasks to the
> CPU, no idea if some arch code breaks when the scheduler is already
> working on a CPU before it becomes visibly online.

Hmm...

Let's run it by Peter.

@Peter: see the first patch in the mail for the problem of which cpumask
bit to test wrt scheduler and migrating tasks to newly appearing
cores...

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-07-27 18:21:52

by Jan H. Schönherr

[permalink] [raw]
Subject: [PATCH] sched: fix cpu_active_mask/cpu_online_mask race

From: Jan H. Schönherr <[email protected]>

There is a race condition in SMP bootup code, which may result in

WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
workqueue_cpu_up_callback()
or
kernel BUG at kernel/smpboot.c:135!

It can be triggered with a bit of luck in Linux guests running on
busy hosts.

CPU0 CPUn
==== ====

_cpu_up()
__cpu_up()
start_secondary()
set_cpu_online()
cpumask_set_cpu(cpu,
to_cpumask(cpu_online_bits));
cpu_notify(CPU_ONLINE)
<do stuff, see below>
cpumask_set_cpu(cpu,
to_cpumask(cpu_active_bits));

During the various CPU_ONLINE callbacks CPUn is online but not active.
Several things can go wrong at that point, depending on the scheduling
of tasks on CPU0.

Variant 1:

cpu_notify(CPU_ONLINE)
workqueue_cpu_up_callback()
rebind_workers()
set_cpus_allowed_ptr()

This call fails because it requires an active CPU; rebind_workers()
ends with a warning:

WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
workqueue_cpu_up_callback()

Variant 2:

cpu_notify(CPU_ONLINE)
smpboot_thread_call()
smpboot_unpark_threads()
..
__kthread_unpark()
__kthread_bind()
wake_up_state()
..
select_task_rq()
select_fallback_rq()

The ->wake_cpu of the unparked thread is not allowed, making a call
to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
find an allowed, active CPU and promptly resets the allowed CPUs, so
that the task in question ends up on CPU0.

When those unparked tasks are eventually executed, they run
immediately into a BUG:

kernel BUG at kernel/smpboot.c:135!

Just changing the order in which the online/active bits are set (and
adding some memory barriers), would solve the two issues above.
However, it would change the order of operations back to the one
before commit 6acbfb96976f ("sched: Fix hotplug vs.
set_cpus_allowed_ptr()"), thus, reintroducing that particular problem.

Going further back into history, we have at least the following commits
touching this topic:
- commit 2baab4e90495 ("sched: Fix select_fallback_rq() vs
cpu_active/cpu_online")
- commit 5fbd036b552f ("sched: Cleanup cpu_active madness")

Together, these give us the following non-working solutions:
- secondary CPU sets active before online, because active is assumed to
be a subset of online;
- secondary CPU sets online before active, because the primary CPU
assumes that an online CPU is also active;
- secondary CPU sets online and waits for primary CPU to set active,
because it might deadlock.

Commit 875ebe940d77 ("powerpc/smp: Wait until secondaries are active &
online") introduces an arch-specific solution to this arch-independent
problem.

Now, go for a more general solution without explicit waiting and simply
set active twice: once on the secondary CPU after online was set and
once on the primary CPU after online was seen.

Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
Signed-off-by: Jan H. Schönherr <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78b4bad10..155bb4a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5433,6 +5433,7 @@ static int sched_cpu_active(struct notifier_block *nfb,
case CPU_STARTING:
set_cpu_rq_start_time();
return NOTIFY_OK;
+ case CPU_ONLINE:
case CPU_DOWN_FAILED:
set_cpu_active((long)hcpu, true);
return NOTIFY_OK;

2015-07-30 16:01:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: fix cpu_active_mask/cpu_online_mask race

On Mon, Jul 27, 2015 at 08:21:28PM +0200, Jan H. Sch?nherr wrote:
> From: Jan H. Sch?nherr <[email protected]>
>
> Together, these give us the following non-working solutions:
> - secondary CPU sets active before online, because active is assumed to
> be a subset of online;
> - secondary CPU sets online before active, because the primary CPU
> assumes that an online CPU is also active;
> - secondary CPU sets online and waits for primary CPU to set active,
> because it might deadlock.
>
> Commit 875ebe940d77 ("powerpc/smp: Wait until secondaries are active &
> online") introduces an arch-specific solution to this arch-independent
> problem.
>
> Now, go for a more general solution without explicit waiting and simply
> set active twice: once on the secondary CPU after online was set and
> once on the primary CPU after online was seen.

Very vile indeed ;-)

> Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
> Signed-off-by: Jan H. Sch?nherr <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
> kernel/sched/core.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 78b4bad10..155bb4a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5433,6 +5433,7 @@ static int sched_cpu_active(struct notifier_block *nfb,
> case CPU_STARTING:
> set_cpu_rq_start_time();
> return NOTIFY_OK;
> + case CPU_ONLINE:

Not without a big honking comment right there :-)

> case CPU_DOWN_FAILED:
> set_cpu_active((long)hcpu, true);
> return NOTIFY_OK;

2015-07-30 19:17:38

by Jan H. Schönherr

[permalink] [raw]
Subject: [PATCH v2] sched: fix cpu_active_mask/cpu_online_mask race

From: Jan H. Schönherr <[email protected]>

There is a race condition in SMP bootup code, which may result in

WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
workqueue_cpu_up_callback()
or
kernel BUG at kernel/smpboot.c:135!

It can be triggered with a bit of luck in Linux guests running on
busy hosts.

CPU0 CPUn
==== ====

_cpu_up()
__cpu_up()
start_secondary()
set_cpu_online()
cpumask_set_cpu(cpu,
to_cpumask(cpu_online_bits));
cpu_notify(CPU_ONLINE)
<do stuff, see below>
cpumask_set_cpu(cpu,
to_cpumask(cpu_active_bits));

During the various CPU_ONLINE callbacks CPUn is online but not active.
Several things can go wrong at that point, depending on the scheduling
of tasks on CPU0.

Variant 1:

cpu_notify(CPU_ONLINE)
workqueue_cpu_up_callback()
rebind_workers()
set_cpus_allowed_ptr()

This call fails because it requires an active CPU; rebind_workers()
ends with a warning:

WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
workqueue_cpu_up_callback()

Variant 2:

cpu_notify(CPU_ONLINE)
smpboot_thread_call()
smpboot_unpark_threads()
..
__kthread_unpark()
__kthread_bind()
wake_up_state()
..
select_task_rq()
select_fallback_rq()

The ->wake_cpu of the unparked thread is not allowed, making a call
to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
find an allowed, active CPU and promptly resets the allowed CPUs, so
that the task in question ends up on CPU0.

When those unparked tasks are eventually executed, they run
immediately into a BUG:

kernel BUG at kernel/smpboot.c:135!

Just changing the order in which the online/active bits are set (and
adding some memory barriers), would solve the two issues above.
However, it would change the order of operations back to the one
before commit 6acbfb96976f ("sched: Fix hotplug vs.
set_cpus_allowed_ptr()"), thus, reintroducing that particular problem.

Going further back into history, we have at least the following commits
touching this topic:
- commit 2baab4e90495 ("sched: Fix select_fallback_rq() vs
cpu_active/cpu_online")
- commit 5fbd036b552f ("sched: Cleanup cpu_active madness")

Together, these give us the following non-working solutions:
- secondary CPU sets active before online, because active is assumed to
be a subset of online;
- secondary CPU sets online before active, because the primary CPU
assumes that an online CPU is also active;
- secondary CPU sets online and waits for primary CPU to set active,
because it might deadlock.

Commit 875ebe940d77 ("powerpc/smp: Wait until secondaries are active &
online") introduces an arch-specific solution to this arch-independent
problem.

Now, go for a more general solution without explicit waiting and simply
set active twice: once on the secondary CPU after online was set and
once on the primary CPU after online was seen.

Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
Signed-off-by: Jan H. Schönherr <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---

v2:
- Big honking comment in the CPU_ONLINE path for Peter. :)

kernel/sched/core.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78b4bad10..e967343 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5433,6 +5433,14 @@ static int sched_cpu_active(struct notifier_block *nfb,
case CPU_STARTING:
set_cpu_rq_start_time();
return NOTIFY_OK;
+ case CPU_ONLINE:
+ /*
+ * At this point a starting CPU has marked itself as online via
+ * set_cpu_online(). But it might not yet have marked itself
+ * as active, which is essential from here on.
+ *
+ * Thus, fall-through and help the starting CPU along.
+ */
case CPU_DOWN_FAILED:
set_cpu_active((long)hcpu, true);
return NOTIFY_OK;
--
2.4.6.1.gea4e83c