I have a busy ppc64le KVM box where guests sometimes hit the infamous
"kernel BUG at kernel/smpboot.c:134!" issue during boot:
BUG_ON(td->cpu != smp_processor_id());
Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
output confirms it:
CPU: 0
Comm: watchdog/130
The issue is in kthread_bind where we set the cpus_allowed mask, but do
not touch task_thread_info(p)->cpu. The scheduler assumes the previously
scheduled CPU is in the cpus_allowed mask, but in this case we are
moving a thread to another CPU so it is not.
We used to call set_task_cpu which sets task_thread_info(p)->cpu (in fact
kthread_bind still has a comment suggesting this). That was removed in
e2912009fb7b ("sched: Ensure set_task_cpu() is never called on blocked
tasks").
Since we cannot call set_task_cpu (the task is in a sleeping state),
just do an explicit set of task_thread_info(p)->cpu.
Fixes: e2912009fb7b ("sched: Ensure set_task_cpu() is never called on blocked tasks")
Cc: [email protected]
Signed-off-by: Anton Blanchard <[email protected]>
---
kernel/kthread.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 10e489c..e40ab1d 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -327,13 +327,14 @@ EXPORT_SYMBOL(kthread_create_on_node);
static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state)
{
- /* Must have done schedule() in kthread() before we set_task_cpu */
+ /* Must have done schedule() in kthread() before we change affinity */
if (!wait_task_inactive(p, state)) {
WARN_ON(1);
return;
}
/* It's safe because the task is inactive. */
do_set_cpus_allowed(p, cpumask_of(cpu));
+ task_thread_info(p)->cpu = cpu;
p->flags |= PF_NO_SETAFFINITY;
}
--
2.1.0
On Sun, Dec 7, 2014 at 7:27 PM, Anton Blanchard <[email protected]> wrote:
>
> Since we cannot call set_task_cpu (the task is in a sleeping state),
> just do an explicit set of task_thread_info(p)->cpu.
Scheduler people: is this sufficient and ok?
The __set_task_cpu() function does various other things too:
set_task_rq(p, cpu);
#ifdef CONFIG_SMP
/*
* After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
* successfuly executed on another CPU. We must ensure that updates of
* per-task data have been completed by this moment.
*/
smp_wmb();
task_thread_info(p)->cpu = cpu;
p->wake_cpu = cpu;
#endif
which makes me worry about just setting the thread_info->cpu value.
That set_task_rq() initializes various group scheduling things, an
dthat whole "wake_cpu" thing seems relevant too.
I'm not saying the patch is wrong, I just want confirmation/ack for
it. Although to be honest, I'd rather see this come in through the
scheduler tree anyway.
Hmm? This seems to go back to 2.6.33 if that "Fixes" line is accurate,
so it's been around forever (almost exactly five years). Comments?
Linus
Hi Linus,
> The __set_task_cpu() function does various other things too:
>
> set_task_rq(p, cpu);
> #ifdef CONFIG_SMP
> /*
> * After ->cpu is set up to a new value, task_rq_lock(p, ...)
> can be
> * successfuly executed on another CPU. We must ensure that
> updates of
> * per-task data have been completed by this moment.
> */
> smp_wmb();
> task_thread_info(p)->cpu = cpu;
> p->wake_cpu = cpu;
> #endif
>
> which makes me worry about just setting the thread_info->cpu value.
> That set_task_rq() initializes various group scheduling things, an
> dthat whole "wake_cpu" thing seems relevant too.
Yeah, I would definitely like the scheduler guys to weigh in on this,
especially considering how difficult it can be to hit.
Anton
* Anton Blanchard <[email protected]> wrote:
> I have a busy ppc64le KVM box where guests sometimes hit the
> infamous "kernel BUG at kernel/smpboot.c:134!" issue during
> boot:
>
> BUG_ON(td->cpu != smp_processor_id());
>
> Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
> output confirms it:
>
> CPU: 0
> Comm: watchdog/130
>
> The issue is in kthread_bind where we set the cpus_allowed
> mask, but do not touch task_thread_info(p)->cpu. The scheduler
> assumes the previously scheduled CPU is in the cpus_allowed
> mask, but in this case we are moving a thread to another CPU so
> it is not.
>
> We used to call set_task_cpu which sets
> task_thread_info(p)->cpu (in fact kthread_bind still has a
> comment suggesting this). That was removed in e2912009fb7b
> ("sched: Ensure set_task_cpu() is never called on blocked
> tasks").
>
> Since we cannot call set_task_cpu (the task is in a sleeping
> state), just do an explicit set of task_thread_info(p)->cpu.
So we cannot call set_task_cpu() because in the normal life time
of a task the ->cpu value gets set on wakeup. So if a task is
blocked right now, and its affinity changes, it ought to get a
correct ->cpu selected on wakeup. The affinity mask and the
current value of ->cpu getting out of sync is thus 'normal'.
(Check for example how set_cpus_allowed_ptr() works: we first set
the new allowed mask, then do we migrate the task away if
necessary.)
In the kthread_bind() case this is explicitly assumed: it only
calls do_set_cpus_allowed().
But obviously the bug triggers in kernel/smpboot.c, and that
assert shows a real bug - and your patch makes the assert go
away, so the question is, how did the kthread get woken up and
put on a runqueue without its ->cpu getting set?
One possibility is a generic scheduler bug in ttwu(), resulting
in ->cpu not getting set properly. If this was the case then
other places would be blowing up as well, and I don't think we
are seeing this currently, especially not over such a long
timespan.
Another possibility would be that kthread_bind()'s assumption
that the task is inactive is false: if the task activates when we
think it's blocked and we just hotplug-migrate it away while its
running (setting its td->cpu?), the assert could trigger I think
- and the patch would make the assert go away.
A third possibility would be, if this is a freshly created
thread, some sort of initialization race - either in the kthread
or in the scheduler code.
Weird.
Thanks,
Ingo
Hi Ingo,
> So we cannot call set_task_cpu() because in the normal life time
> of a task the ->cpu value gets set on wakeup. So if a task is
> blocked right now, and its affinity changes, it ought to get a
> correct ->cpu selected on wakeup. The affinity mask and the
> current value of ->cpu getting out of sync is thus 'normal'.
>
> (Check for example how set_cpus_allowed_ptr() works: we first set
> the new allowed mask, then do we migrate the task away if
> necessary.)
>
> In the kthread_bind() case this is explicitly assumed: it only
> calls do_set_cpus_allowed().
>
> But obviously the bug triggers in kernel/smpboot.c, and that
> assert shows a real bug - and your patch makes the assert go
> away, so the question is, how did the kthread get woken up and
> put on a runqueue without its ->cpu getting set?
I started going down this line earlier today, and found things like:
select_task_rq_fair:
if (p->nr_cpus_allowed == 1)
return prev_cpu;
I tried returning cpumask_first(tsk_cpus_allowed()) instead, and while
I couldn't hit the BUG I did manage to get a scheduler lockup during
testing.
At that point I thought the previous task_cpu() was somewhat ingrained
in the scheduler and came up with the patch. If not, we could go on a
hunt to see what else needs fixing.
Anton
On Mon, 8 Dec 2014 14:27:01 +1100
Anton Blanchard <[email protected]> wrote:
> I have a busy ppc64le KVM box where guests sometimes hit the infamous
> "kernel BUG at kernel/smpboot.c:134!" issue during boot:
>
> BUG_ON(td->cpu != smp_processor_id());
>
> Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
> output confirms it:
>
> CPU: 0
> Comm: watchdog/130
>
> The issue is in kthread_bind where we set the cpus_allowed mask, but do
> not touch task_thread_info(p)->cpu. The scheduler assumes the previously
> scheduled CPU is in the cpus_allowed mask, but in this case we are
> moving a thread to another CPU so it is not.
>
Does this happen always on boot up, and always with the watchdog thread?
I followed the logic that starts the watchdog threads.
watchdog_enable_all_cpus()
smpboot_register_percpu-thread() {
for_each_online_cpu(cpu) { ... }
Where watchdog_enable_all_cpus() can be called by
lockup_detector_init() before SMP is started, but also by
proc_dowatchdog() which is called by the sysctl commands (after SMP is
up and running).
I noticed there's no "get_online_cpus()" anywhere, although the
unregister_percpu_thread() has it. Is it possible that we created a
thread on a CPU that wasn't fully online yet?
Perhaps the following patch is needed? Even if this isn't the solution
to this bug, it is probably needed as watchdog_enable_all_cpus() can be
called after boot up too.
-- Steve
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index eb89e1807408..60d35ac5d3f1 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -279,6 +279,7 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
unsigned int cpu;
int ret = 0;
+ get_online_cpus();
mutex_lock(&smpboot_threads_lock);
for_each_online_cpu(cpu) {
ret = __smpboot_create_thread(plug_thread, cpu);
@@ -291,6 +292,7 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
list_add(&plug_thread->list, &hotplug_threads);
out:
mutex_unlock(&smpboot_threads_lock);
+ put_online_cpus();
return ret;
}
EXPORT_SYMBOL_GPL(smpboot_register_percpu_thread);
Hi Ingo,
> At that point I thought the previous task_cpu() was somewhat ingrained
> in the scheduler and came up with the patch. If not, we could go on a
> hunt to see what else needs fixing.
I had another look. The scheduled does indeed make assumptions about the
previous task_cpu, but we have a hammer to fix it up called
select_fallback_rq.
I annotated select_fallback_rq, and did hit a case where the CPU was
not active. ppc64 patch below.
I think x86 have a similar (although harder to hit) issue. While it
does wait for the cpu_online bit to be set:
while (!cpu_online(cpu)) {
cpu_relax();
touch_nmi_watchdog();
}
The cpu_active bit is set after the cpu_online bit:
void set_cpu_online(unsigned int cpu, bool online)
{
if (online) {
cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
If the CPU got delayed between the two stores (eg a KVM guest had the CPU
scheduled out), then we'd end up with cpu_active unset and hit the same
issue in select_fallback_rq.
Anton
--
I have a busy ppc64le KVM box where guests sometimes hit the infamous
"kernel BUG at kernel/smpboot.c:134!" issue during boot:
BUG_ON(td->cpu != smp_processor_id());
Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
output confirms it:
CPU: 0
Comm: watchdog/130
The problem is that we aren't ensuring the CPU active and online bits are set
before allowing the master to continue on. The master unparks the secondary
CPUs kthreads and the scheduler looks for a CPU to run on. It calls
select_task_rq and realises the suggested CPU is not in the cpus_allowed
mask. It then ends up in select_fallback_rq, and since the active and
online bits aren't set we choose some other CPU to run on.
Cc: [email protected]
Signed-off-by: Anton Blanchard <[email protected]>
---
arch/powerpc/kernel/smp.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 71e186d..d40e46e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -700,7 +700,6 @@ void start_secondary(void *unused)
smp_store_cpu_info(cpu);
set_dec(tb_ticks_per_jiffy);
preempt_disable();
- cpu_callin_map[cpu] = 1;
if (smp_ops->setup_cpu)
smp_ops->setup_cpu(cpu);
@@ -739,6 +738,14 @@ void start_secondary(void *unused)
notify_cpu_starting(cpu);
set_cpu_online(cpu, true);
+ /*
+ * CPU must be marked active and online before we signal back to the
+ * master, because the scheduler needs to see the cpu_online and
+ * cpu_active bits set.
+ */
+ smp_wmb();
+ cpu_callin_map[cpu] = 1;
+
local_irq_enable();
cpu_startup_entry(CPUHP_ONLINE);
--
2.1.0
On 12/08/2014 09:54 PM, Steven Rostedt wrote:
> On Mon, 8 Dec 2014 14:27:01 +1100
> Anton Blanchard <[email protected]> wrote:
>
>> I have a busy ppc64le KVM box where guests sometimes hit the infamous
>> "kernel BUG at kernel/smpboot.c:134!" issue during boot:
>>
>> BUG_ON(td->cpu != smp_processor_id());
>>
>> Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
>> output confirms it:
>>
>> CPU: 0
>> Comm: watchdog/130
>>
>> The issue is in kthread_bind where we set the cpus_allowed mask, but do
>> not touch task_thread_info(p)->cpu. The scheduler assumes the previously
>> scheduled CPU is in the cpus_allowed mask, but in this case we are
>> moving a thread to another CPU so it is not.
>>
>
> Does this happen always on boot up, and always with the watchdog thread?
>
> I followed the logic that starts the watchdog threads.
>
> watchdog_enable_all_cpus()
> smpboot_register_percpu-thread() {
>
> for_each_online_cpu(cpu) { ... }
>
> Where watchdog_enable_all_cpus() can be called by
> lockup_detector_init() before SMP is started, but also by
> proc_dowatchdog() which is called by the sysctl commands (after SMP is
> up and running).
>
> I noticed there's no "get_online_cpus()" anywhere, although the
> unregister_percpu_thread() has it. Is it possible that we created a
> thread on a CPU that wasn't fully online yet?
>
> Perhaps the following patch is needed? Even if this isn't the solution
> to this bug, it is probably needed as watchdog_enable_all_cpus() can be
> called after boot up too.
>
> -- Steve
Hi, Steven, tglx
See this https://lkml.org/lkml/2014/7/30/804
"[PATCH] smpboot: add missing get_online_cpus() when register"
Thanks,
Lai
>
> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> index eb89e1807408..60d35ac5d3f1 100644
> --- a/kernel/smpboot.c
> +++ b/kernel/smpboot.c
> @@ -279,6 +279,7 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
> unsigned int cpu;
> int ret = 0;
>
> + get_online_cpus();
> mutex_lock(&smpboot_threads_lock);
> for_each_online_cpu(cpu) {
> ret = __smpboot_create_thread(plug_thread, cpu);
> @@ -291,6 +292,7 @@ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
> list_add(&plug_thread->list, &hotplug_threads);
> out:
> mutex_unlock(&smpboot_threads_lock);
> + put_online_cpus();
> return ret;
> }
> EXPORT_SYMBOL_GPL(smpboot_register_percpu_thread);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> .
>
On Mon, Dec 8, 2014 at 3:58 PM, Anton Blanchard <[email protected]> wrote:
> Hi Ingo,
>
>> At that point I thought the previous task_cpu() was somewhat ingrained
>> in the scheduler and came up with the patch. If not, we could go on a
>> hunt to see what else needs fixing.
>
> I had another look. The scheduled does indeed make assumptions about the
> previous task_cpu, but we have a hammer to fix it up called
> select_fallback_rq.
>
> I annotated select_fallback_rq, and did hit a case where the CPU was
> not active. ppc64 patch below.
Anton, I'll assume I will get this through the usual powerpc pull requests?
> I think x86 have a similar (although harder to hit) issue.
Ingo?
Linus
On Tue, 9 Dec 2014, Linus Torvalds wrote:
> On Mon, Dec 8, 2014 at 3:58 PM, Anton Blanchard <[email protected]> wrote:
> > Hi Ingo,
> >
> >> At that point I thought the previous task_cpu() was somewhat ingrained
> >> in the scheduler and came up with the patch. If not, we could go on a
> >> hunt to see what else needs fixing.
> >
> > I had another look. The scheduled does indeed make assumptions about the
> > previous task_cpu, but we have a hammer to fix it up called
> > select_fallback_rq.
> >
> > I annotated select_fallback_rq, and did hit a case where the CPU was
> > not active. ppc64 patch below.
>
> Anton, I'll assume I will get this through the usual powerpc pull requests?
>
> > I think x86 have a similar (although harder to hit) issue.
Indeed way harder to hit:
CPU 0 CPU 1
set_cpu_online(1, true) {
while (!cpu_online(cpu1)) cpumask_set_cpu(1, to_cpumask(cpu_online_bits));
relax();
wakeup_thread_on_cpu1();
cpumask_set_cpu(1, to_cpumask(cpu_active_bits));
On bare metal probably impossible, but on virt it should be
observable. Fix is simple.
Thanks,
tglx
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 668d8f2a8781..534f3384f03f 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -222,7 +222,6 @@ static void notrace start_secondary(void *unused)
lock_vector_lock();
set_cpu_online(smp_processor_id(), true);
unlock_vector_lock();
- per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;
x86_platform.nmi_init();
/* enable local interrupts */
@@ -234,6 +233,7 @@ static void notrace start_secondary(void *unused)
x86_cpuinit.setup_percpu_clockev();
wmb();
+ per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;
cpu_startup_entry(CPUHP_ONLINE);
}
@@ -932,7 +932,7 @@ int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
check_tsc_sync_source(cpu);
local_irq_restore(flags);
- while (!cpu_online(cpu)) {
+ while (per_cpu(cpu_state,cpu) != CPU_ONLINE) {
cpu_relax();
touch_nmi_watchdog();
}
On Tue, 2014-12-09 at 12:54 -0800, Linus Torvalds wrote:
> On Mon, Dec 8, 2014 at 3:58 PM, Anton Blanchard <[email protected]> wrote:
> > Hi Ingo,
> >
> >> At that point I thought the previous task_cpu() was somewhat ingrained
> >> in the scheduler and came up with the patch. If not, we could go on a
> >> hunt to see what else needs fixing.
> >
> > I had another look. The scheduled does indeed make assumptions about the
> > previous task_cpu, but we have a hammer to fix it up called
> > select_fallback_rq.
> >
> > I annotated select_fallback_rq, and did hit a case where the CPU was
> > not active. ppc64 patch below.
>
> Anton, I'll assume I will get this through the usual powerpc pull requests?
Yeah I'll put it in my tree unless Anton objects.
cheers