2013-04-05 21:43:26

by Dave Hansen

[permalink] [raw]
Subject: kernel BUG at kernel/smpboot.c:134!

Hey Thomas,

I seem to be running in to smpboot_thread_fn()'s

BUG_ON(td->cpu != smp_processor_id());

pretty regularly, both at boot and if I boot with maxcpus=x and then
online the CPUs from sysfs after boot. It's a 160-logical-cpu system,
so it's quite a beast. I _seem_ to be hitting it more often at higher
cpu counts, but it doesn't trigger on bringing up a particular CPU as
far as I can tell.

This is on a pull of mainline from today, e0a77f263. Any ideas?

> [ 790.223270] ------------[ cut here ]------------
> [ 790.223966] kernel BUG at kernel/smpboot.c:134!
> [ 790.224739] invalid opcode: 0000 [#1] SMP
> [ 790.225671] Modules linked in:
> [ 790.226428] CPU 81
> [ 790.226909] Pid: 3909, comm: migration/135 Tainted: G W 3.9.0-rc5-00184-gb6a9b7f-dirty #118 FUJITSU-SV PRIMEQUEST 1800E2/SB
> [ 790.228775] RIP: 0010:[<ffffffff8110bee8>] [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
> [ 790.230205] RSP: 0018:ffff88bfef9c1e08 EFLAGS: 00010202
> [ 790.231090] RAX: 0000000000000051 RBX: ffff88bfefb82000 RCX: 000000000000b888
> [ 790.231653] RDX: ffff88bfef9c1fd8 RSI: ffff881fff000000 RDI: 0000000000000087
> [ 790.232085] RBP: ffff88bfef9c1e38 R08: 0000000000000001 R09: 0000000000000000
> [ 790.232850] R10: 0000000000000018 R11: 0000000000000000 R12: ffff88bfec9e22e0
> [ 790.233561] R13: ffffffff81e587a0 R14: ffff88bfec9e22e0 R15: 0000000000000000
> [ 790.234004] FS: 0000000000000000(0000) GS:ffff881fff000000(0000) knlGS:0000000000000000
> [ 790.234918] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 790.235602] CR2: 00007fa89a333c62 CR3: 0000000001e0b000 CR4: 00000000000007e0
> [ 790.236110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 790.236584] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 790.237329] Process migration/135 (pid: 3909, threadinfo ffff88bfef9c0000, task ffff88bfec9e22e0)
> [ 790.238321] Stack:
> [ 790.238882] ffff88bfef9c1e38 0000000000000000 ffff88ffef421cc0 ffff88bfef9c1ec0
> [ 790.245415] ffff88bfefb82000 ffffffff8110bc90 ffff88bfef9c1f48 ffffffff810ff1df
> [ 790.250755] 0000000000000001 0000000000000087 ffff88bfefb82000 0000000000000000
> [ 790.253365] Call Trace:
> [ 790.254121] [<ffffffff8110bc90>] ? __smpboot_create_thread+0x180/0x180
> [ 790.255428] [<ffffffff810ff1df>] kthread+0xef/0x100
> [ 790.256071] [<ffffffff819cb1a4>] ? wait_for_completion+0x124/0x180
> [ 790.256697] [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
> [ 790.257325] [<ffffffff819dba9c>] ret_from_fork+0x7c/0xb0
> [ 790.258233] [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
> [ 790.258942] Code: ef 3d 01 01 48 89 df e8 87 b0 16 00 48 83 05 67 ef 3d 01 01 48 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 5d c3 48 83 05 90 ef 3d 01 01 <0f> 0b 48 83 05 96 ef 3d 01 01 48 83 05 56 ef 3d 01 01 0f 0b 48
> [ 790.276178] RIP [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
> [ 790.276735] RSP <ffff88bfef9c1e08>
> [ 790.278348] ---[ end trace 84baa2bee1434240 ]---


2013-04-06 07:15:20

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: kernel BUG at kernel/smpboot.c:134!

Hi Dave,

On 04/06/2013 03:13 AM, Dave Hansen wrote:
> Hey Thomas,
>
> I seem to be running in to smpboot_thread_fn()'s
>
> BUG_ON(td->cpu != smp_processor_id());
>
> pretty regularly, both at boot and if I boot with maxcpus=x and then
> online the CPUs from sysfs after boot. It's a 160-logical-cpu system,
> so it's quite a beast. I _seem_ to be hitting it more often at higher
> cpu counts, but it doesn't trigger on bringing up a particular CPU as
> far as I can tell.
>
> This is on a pull of mainline from today, e0a77f263. Any ideas?
>

Dave Jones had reported a similar problem some time back and Hillf had
proposed a fix. I guess it slipped through the cracks and never went
upstream.

Here is the link: https://lkml.org/lkml/2013/1/19/1

Can you please try it and see if it improves anything?

Regards,
Srivatsa S. Bhat

>> [ 790.223270] ------------[ cut here ]------------
>> [ 790.223966] kernel BUG at kernel/smpboot.c:134!
>> [ 790.224739] invalid opcode: 0000 [#1] SMP
>> [ 790.225671] Modules linked in:
>> [ 790.226428] CPU 81
>> [ 790.226909] Pid: 3909, comm: migration/135 Tainted: G W 3.9.0-rc5-00184-gb6a9b7f-dirty #118 FUJITSU-SV PRIMEQUEST 1800E2/SB
>> [ 790.228775] RIP: 0010:[<ffffffff8110bee8>] [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
>> [ 790.230205] RSP: 0018:ffff88bfef9c1e08 EFLAGS: 00010202
>> [ 790.231090] RAX: 0000000000000051 RBX: ffff88bfefb82000 RCX: 000000000000b888
>> [ 790.231653] RDX: ffff88bfef9c1fd8 RSI: ffff881fff000000 RDI: 0000000000000087
>> [ 790.232085] RBP: ffff88bfef9c1e38 R08: 0000000000000001 R09: 0000000000000000
>> [ 790.232850] R10: 0000000000000018 R11: 0000000000000000 R12: ffff88bfec9e22e0
>> [ 790.233561] R13: ffffffff81e587a0 R14: ffff88bfec9e22e0 R15: 0000000000000000
>> [ 790.234004] FS: 0000000000000000(0000) GS:ffff881fff000000(0000) knlGS:0000000000000000
>> [ 790.234918] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [ 790.235602] CR2: 00007fa89a333c62 CR3: 0000000001e0b000 CR4: 00000000000007e0
>> [ 790.236110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 790.236584] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [ 790.237329] Process migration/135 (pid: 3909, threadinfo ffff88bfef9c0000, task ffff88bfec9e22e0)
>> [ 790.238321] Stack:
>> [ 790.238882] ffff88bfef9c1e38 0000000000000000 ffff88ffef421cc0 ffff88bfef9c1ec0
>> [ 790.245415] ffff88bfefb82000 ffffffff8110bc90 ffff88bfef9c1f48 ffffffff810ff1df
>> [ 790.250755] 0000000000000001 0000000000000087 ffff88bfefb82000 0000000000000000
>> [ 790.253365] Call Trace:
>> [ 790.254121] [<ffffffff8110bc90>] ? __smpboot_create_thread+0x180/0x180
>> [ 790.255428] [<ffffffff810ff1df>] kthread+0xef/0x100
>> [ 790.256071] [<ffffffff819cb1a4>] ? wait_for_completion+0x124/0x180
>> [ 790.256697] [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
>> [ 790.257325] [<ffffffff819dba9c>] ret_from_fork+0x7c/0xb0
>> [ 790.258233] [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
>> [ 790.258942] Code: ef 3d 01 01 48 89 df e8 87 b0 16 00 48 83 05 67 ef 3d 01 01 48 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 5d c3 48 83 05 90 ef 3d 01 01 <0f> 0b 48 83 05 96 ef 3d 01 01 48 83 05 56 ef 3d 01 01 0f 0b 48
>> [ 790.276178] RIP [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
>> [ 790.276735] RSP <ffff88bfef9c1e08>
>> [ 790.278348] ---[ end trace 84baa2bee1434240 ]---
>
>

2013-04-06 08:31:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: kernel BUG at kernel/smpboot.c:134!

On Sat, 6 Apr 2013, Srivatsa S. Bhat wrote:

> Hi Dave,
>
> On 04/06/2013 03:13 AM, Dave Hansen wrote:
> > Hey Thomas,
> >
> > I seem to be running in to smpboot_thread_fn()'s
> >
> > BUG_ON(td->cpu != smp_processor_id());

That should be WARN_ON of course. Stupid me.

> > pretty regularly, both at boot and if I boot with maxcpus=x and then
> > online the CPUs from sysfs after boot. It's a 160-logical-cpu system,
> > so it's quite a beast. I _seem_ to be hitting it more often at higher
> > cpu counts, but it doesn't trigger on bringing up a particular CPU as
> > far as I can tell.
> >
> > This is on a pull of mainline from today, e0a77f263. Any ideas?
> >
>
> Dave Jones had reported a similar problem some time back and Hillf had
> proposed a fix. I guess it slipped through the cracks and never went
> upstream.
>
> Here is the link: https://lkml.org/lkml/2013/1/19/1

This is Hillfs proposed patch:

> --- a/kernel/kthread.c Sat Jan 19 13:03:52 2013
> +++ b/kernel/kthread.c Sat Jan 19 13:17:54 2013
> @@ -306,6 +306,7 @@ struct task_struct *kthread_create_on_cp
> return p;
> set_bit(KTHREAD_IS_PER_CPU, &to_kthread(p)->flags);
> to_kthread(p)->cpu = cpu;
> + __kthread_bind(p, cpu);
> /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> kthread_park(p);
> return p;

That's bogus. Simply because when we create the thread then the thread
status is HP_THREAD_NONE and the path with the BUG_ON is only entered
with status == HP_THREAD_ACTIVE:

if (ht->park && td->status == HP_THREAD_ACTIVE) {

So in Dave's case the thread was already created and has entered
active state.

> >> [ 790.226909] Pid: 3909, comm: migration/135 Tainted: G W 3.9.0-rc5-00184-gb6a9b7f-dirty #118 FUJITSU-SV PRIMEQUEST 1800E2/SB

Hmm, it's the migration thread which trips over this. Oh joy!

Dave, does the issue reproduce with function tracing enabled? For a
first shot it's probably enough to filter on smpboot_* functions plus
sched_switch and sched_wakeup events.

Thanks,

tglx

2013-04-07 09:20:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: kernel BUG at kernel/smpboot.c:134!

On Sat, 6 Apr 2013, Thomas Gleixner wrote:
> This is Hillfs proposed patch:
>
> > --- a/kernel/kthread.c Sat Jan 19 13:03:52 2013
> > +++ b/kernel/kthread.c Sat Jan 19 13:17:54 2013
> > @@ -306,6 +306,7 @@ struct task_struct *kthread_create_on_cp
> > return p;
> > set_bit(KTHREAD_IS_PER_CPU, &to_kthread(p)->flags);
> > to_kthread(p)->cpu = cpu;
> > + __kthread_bind(p, cpu);
> > /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> > kthread_park(p);
> > return p;
>
> That's bogus. Simply because when we create the thread then the thread

And it's even more bogus because the cpu to which we would bind in
kthread_create_on_cpu() is not yet online.

Thanks,

tglx

2013-04-07 09:50:35

by Borislav Petkov

[permalink] [raw]
Subject: Re: kernel BUG at kernel/smpboot.c:134!

On Sun, Apr 07, 2013 at 11:20:10AM +0200, Thomas Gleixner wrote:
> And it's even more bogus because the cpu to which we would bind in
> kthread_create_on_cpu() is not yet online.

In case you guys are wondering about reproducibility, I saw the same
BUG_ON yesterday on a 32-bit atom smp qemu guest too (don't ask why?
:-))

And yes, this was without kvm (software emulation only in qemu).

[ 0.395000] ------------[ cut here ]------------
[ 0.395000] kernel BUG at kernel/smpboot.c:134!
[ 0.395000] invalid opcode: 0000 [#1] PREEMPT SMP
[ 0.395000] Modules linked in:
[ 0.395000] Pid: 13, comm: ksoftirqd/1 Not tainted 3.9.0-rc5+ #31 Bochs Bochs
[ 0.395000] EIP: 0060:[<c106542c>] EFLAGS: 00000202 CPU: 0
[ 0.395000] EIP is at smpboot_thread_fn+0x20c/0x260
[ 0.395000] EAX: 00000000 EBX: f60ba430 ECX: 00000000 EDX: 00000001
[ 0.395000] ESI: f60024f0 EDI: c1736480 EBP: f60c1f24 ESP: f60c1f04
[ 0.395000] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 0.395000] CR0: 8005003b CR2: ffe17000 CR3: 01943000 CR4: 000006e0
[ 0.395000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 0.395000] DR6: 00000000 DR7: 00000000
[ 0.395000] Process ksoftirqd/1 (pid: 13, ti=f60c0000 task=f60ba430 task.ti=f60c0000)
[ 0.395000] Stack:
[ 0.395000] f60ba430 00000001 00000000 f60c1f24 c105c49d f6083e7c f60024f0 c1065220
[ 0.395000] f60c1fac c105c745 00000001 00000001 f60024f0 00000000 c1510303 dead4ead
[ 0.395000] ffffffff ffffffff c1b510ac c1c28450 00000000 c167e791 00000000 c151744f
[ 0.395000] Call Trace:
[ 0.395000] [<c105c49d>] ? __kthread_parkme+0x2d/0x70
[ 0.395000] [<c1065220>] ? lg_lock_init+0x20/0x20
[ 0.395000] [<c105c745>] kthread+0x95/0xa0
[ 0.395000] [<c1510303>] ? dump_pagetable+0xa/0x101
[ 0.395000] [<c151744f>] ? wait_for_common+0xcf/0x110
[ 0.395000] [<c1060000>] ? __remove_hrtimer+0x90/0xa0
[ 0.395000] [<c151a477>] ret_from_kernel_thread+0x1b/0x28
[ 0.395000] [<c105c6b0>] ? __init_kthread_worker+0x60/0x60
[ 0.395000] Code: 0f a3 02 19 d2 85 d2 0f 95 c2 0f b6 d2 ff d1 89 f0 e8 69 25 0b 00 8d 64 24 14 31 c0 5b 5e 5f 5d c3 e8 d9 2b 4b 00 e9 68
fe ff ff <0f> 0b 66 90 e8 cb 2b 4b 00 e9 ac fe ff ff 8d b6 00 00 00 00 e8
[ 0.395000] EIP: [<c106542c>] smpboot_thread_fn+0x20c/0x260 SS:ESP 0068:f60c1f04
[ 0.403466] ---[ end trace f5696bc4e08aa88f ]---

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-04-08 09:24:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: kernel BUG at kernel/smpboot.c:134!

On Sun, 7 Apr 2013, Borislav Petkov wrote:

> On Sun, Apr 07, 2013 at 11:20:10AM +0200, Thomas Gleixner wrote:
> > And it's even more bogus because the cpu to which we would bind in
> > kthread_create_on_cpu() is not yet online.
>
> In case you guys are wondering about reproducibility, I saw the same
> BUG_ON yesterday on a 32-bit atom smp qemu guest too (don't ask why?
> :-))

I don't have to ask. It's caused by massive brain failure on my side.

Does the patch below fix the issue ?

Thanks,

tglx
---
Index: linux-2.6/kernel/kthread.c
===================================================================
--- linux-2.6.orig/kernel/kthread.c
+++ linux-2.6/kernel/kthread.c
@@ -131,8 +131,14 @@ static void __kthread_parkme(struct kthr
schedule();
__set_current_state(TASK_INTERRUPTIBLE);
}
- clear_bit(KTHREAD_IS_PARKED, &self->flags);
__set_current_state(TASK_RUNNING);
+ clear_bit(KTHREAD_IS_PARKED, &self->flags);
+
+ /* Rebind ourself to the target cpu */
+ if (test_bit(KTHREAD_IS_PER_CPU, &self->flags)) {
+ set_cpus_allowed_ptr(currrent, cpumask_of(self->cpu));
+ current->flags |= PF_THREAD_BOUND;
+ }
}

void kthread_parkme(void)
@@ -344,11 +350,8 @@ void kthread_unpark(struct task_struct *
* park before that happens we'd see the IS_PARKED bit
* which might be about to be cleared.
*/
- if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu);
+ if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags))
wake_up_process(k);
- }
}
put_task_struct(k);
}

2013-04-08 11:56:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: kernel BUG at kernel/smpboot.c:134!

On Mon, Apr 08, 2013 at 11:24:14AM +0200, Thomas Gleixner wrote:
> On Sun, 7 Apr 2013, Borislav Petkov wrote:
>
> > On Sun, Apr 07, 2013 at 11:20:10AM +0200, Thomas Gleixner wrote:
> > > And it's even more bogus because the cpu to which we would bind in
> > > kthread_create_on_cpu() is not yet online.
> >
> > In case you guys are wondering about reproducibility, I saw the same
> > BUG_ON yesterday on a 32-bit atom smp qemu guest too (don't ask why?
> > :-))
>
> I don't have to ask.

Yeah, I actually meant "Don't ask why I'm testing 32-bit atom smp qemu
guests." :-)

> ---
> Index: linux-2.6/kernel/kthread.c
> ===================================================================
> --- linux-2.6.orig/kernel/kthread.c
> +++ linux-2.6/kernel/kthread.c
> @@ -131,8 +131,14 @@ static void __kthread_parkme(struct kthr
> schedule();
> __set_current_state(TASK_INTERRUPTIBLE);
> }
> - clear_bit(KTHREAD_IS_PARKED, &self->flags);
> __set_current_state(TASK_RUNNING);
> + clear_bit(KTHREAD_IS_PARKED, &self->flags);
> +
> + /* Rebind ourself to the target cpu */
> + if (test_bit(KTHREAD_IS_PER_CPU, &self->flags)) {
> + set_cpus_allowed_ptr(currrent, cpumask_of(self->cpu));

"currrent" is a typo, I fixed it up to the obvious "current" so that it
builds.

> + current->flags |= PF_THREAD_BOUND;
> + }
> }
>
> void kthread_parkme(void)
> @@ -344,11 +350,8 @@ void kthread_unpark(struct task_struct *
> * park before that happens we'd see the IS_PARKED bit
> * which might be about to be cleared.
> */
> - if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
> - if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
> - __kthread_bind(k, kthread->cpu);
> + if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags))
> wake_up_process(k);
> - }
> }
> put_task_struct(k);

However, it needs more love, see below.

The thing is, I had to do a couple of runs to actually trigger it again
which means, it does not fire *always*. It seems that increasing the
core count for the qemu guest increases the hit rate. Anyway, this is
the whole qemu command:

$ qemu-system-i386 -gdb tcp::1234 -cpu n270,+movbe -hda /home/boris/kvm/debian/sid-i386.img -name "Debian i386:1235" -boot menu=off,order=c -m 2048 -localtime -net nic -net user,hostfwd=tcp::1235-:22 -usbdevice tablet -kernel /w/kernel/linux-2.6/arch/x86/boot/bzImage -append "root=/dev/sda1 debug ignore_loglevel console=ttyS0,115200 console=tty0" -serial file:/home/boris/kvm/test-i386.log -snapshot -smp 4

and it needs a patched qemu with MOVBE support.

Yep, 4 cores on a single-core atom N270. This doesn't really exist as a
hw platform but it seems to boot. With your patch applied, I get:

[ 0.000000] Linux version 3.9.0-rc5+ (boris@pd) (gcc version 4.7.2 (Debian 4.7.2-5) ) #5 SMP PREEMPT Mon Apr 8 13:46:14 CEST 2013
[ 0.000000] Atom PSE erratum detected, BIOS microcode update recommended
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007fffdfff] usable
[ 0.000000] BIOS-e820: [mem 0x000000007fffe000-0x000000007fffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[ 0.000000] debug: ignoring loglevel setting.
[ 0.000000] Notice: NX (Execute Disable) protection cannot be enabled: non-PAE kernel!
[ 0.000000] SMBIOS 2.4 present.
[ 0.000000] DMI: Bochs Bochs, BIOS Bochs 01/01/2011
[ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[ 0.000000] e820: last_pfn = 0x7fffe max_arch_pfn = 0x100000
[ 0.000000] MTRR default type: write-back
[ 0.000000] MTRR fixed ranges enabled:
[ 0.000000] 00000-9FFFF write-back
[ 0.000000] A0000-BFFFF uncachable
[ 0.000000] C0000-FFFFF write-protect
[ 0.000000] MTRR variable ranges enabled:
[ 0.000000] 0 base 080000000 mask 080000000 uncachable
[ 0.000000] 1 disabled
[ 0.000000] 2 disabled
[ 0.000000] 3 disabled
[ 0.000000] 4 disabled
[ 0.000000] 5 disabled
[ 0.000000] 6 disabled
[ 0.000000] 7 disabled
[ 0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
[ 0.000000] initial memory mapped: [mem 0x00000000-0x023fffff]
[ 0.000000] Base memory trampoline at [c009b000] 9b000 size 16384
[ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[ 0.000000] [mem 0x00000000-0x000fffff] page 4k
[ 0.000000] init_memory_mapping: [mem 0x37000000-0x373fffff]
[ 0.000000] [mem 0x37000000-0x373fffff] page 4k
[ 0.000000] BRK [0x020b3000, 0x020b3fff] PGTABLE
[ 0.000000] init_memory_mapping: [mem 0x30000000-0x36ffffff]
[ 0.000000] [mem 0x30000000-0x36ffffff] page 4k
[ 0.000000] BRK [0x020b4000, 0x020b4fff] PGTABLE
[ 0.000000] BRK [0x020b5000, 0x020b5fff] PGTABLE
[ 0.000000] BRK [0x020b6000, 0x020b6fff] PGTABLE
[ 0.000000] BRK [0x020b7000, 0x020b7fff] PGTABLE
[ 0.000000] init_memory_mapping: [mem 0x00100000-0x2fffffff]
[ 0.000000] [mem 0x00100000-0x2fffffff] page 4k
[ 0.000000] init_memory_mapping: [mem 0x37400000-0x377fdfff]
[ 0.000000] [mem 0x37400000-0x377fdfff] page 4k
[ 0.000000] ACPI: RSDP 000fd8d0 00014 (v00 BOCHS )
[ 0.000000] ACPI: RSDT 7fffe380 00034 (v01 BOCHS BXPCRSDT 00000001 BXPC 00000001)
[ 0.000000] ACPI: FACP 7fffff80 00074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001)
[ 0.000000] ACPI: DSDT 7fffe3c0 011A9 (v01 BXPC BXDSDT 00000001 INTL 20100528)
[ 0.000000] ACPI: FACS 7fffff40 00040
[ 0.000000] ACPI: SSDT 7ffff6e0 00858 (v01 BOCHS BXPCSSDT 00000001 BXPC 00000001)
[ 0.000000] ACPI: APIC 7ffff5b0 00090 (v01 BOCHS BXPCAPIC 00000001 BXPC 00000001)
[ 0.000000] ACPI: HPET 7ffff570 00038 (v01 BOCHS BXPCHPET 00000001 BXPC 00000001)
[ 0.000000] ACPI: Local APIC address 0xfee00000
[ 0.000000] 1160MB HIGHMEM available.
[ 0.000000] 887MB LOWMEM available.
[ 0.000000] mapped low ram: 0 - 377fe000
[ 0.000000] low ram: 0 - 377fe000
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x00001000-0x00ffffff]
[ 0.000000] Normal [mem 0x01000000-0x377fdfff]
[ 0.000000] HighMem [mem 0x377fe000-0x7fffdfff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00001000-0x0009efff]
[ 0.000000] node 0: [mem 0x00100000-0x7fffdfff]
[ 0.000000] On node 0 totalpages: 524188
[ 0.000000] DMA zone: 32 pages used for memmap
[ 0.000000] DMA zone: 0 pages reserved
[ 0.000000] DMA zone: 3998 pages, LIFO batch:0
[ 0.000000] Normal zone: 1744 pages used for memmap
[ 0.000000] Normal zone: 223230 pages, LIFO batch:31
[ 0.000000] HighMem zone: 2320 pages used for memmap
[ 0.000000] HighMem zone: 296960 pages, LIFO batch:31
[ 0.000000] Using APIC driver default
[ 0.000000] ACPI: PM-Timer IO Port: 0xb008
[ 0.000000] ACPI: Local APIC address 0xfee00000
[ 0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[ 0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
[ 0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
[ 0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
[ 0.000000] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[ 0.000000] ACPI: IOAPIC (id[0x00] address[0xfec00000] gsi_base[0])
[ 0.000000] IOAPIC[0]: apic_id 0 already used, trying 4
[ 0.000000] IOAPIC[0]: apic_id 4, version 17, address 0xfec00000, GSI 0-23
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[ 0.000000] ACPI: IRQ0 used by override.
[ 0.000000] ACPI: IRQ2 used by override.
[ 0.000000] ACPI: IRQ5 used by override.
[ 0.000000] ACPI: IRQ9 used by override.
[ 0.000000] ACPI: IRQ10 used by override.
[ 0.000000] ACPI: IRQ11 used by override.
[ 0.000000] Using ACPI (MADT) for SMP configuration information
[ 0.000000] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[ 0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[ 0.000000] nr_irqs_gsi: 40
[ 0.000000] PM: Registered nosave memory: 000000000009f000 - 00000000000a0000
[ 0.000000] PM: Registered nosave memory: 00000000000a0000 - 00000000000f0000
[ 0.000000] PM: Registered nosave memory: 00000000000f0000 - 0000000000100000
[ 0.000000] e820: [mem 0x80000000-0xfffbffff] available for PCI devices
[ 0.000000] setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:4 nr_node_ids:1
[ 0.000000] PERCPU: Embedded 334 pages/cpu @f6178000 s1343936 r0 d24128 u1368064
[ 0.000000] pcpu-alloc: s1343936 r0 d24128 u1368064 alloc=334*4096
[ 0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3
[ 0.000000] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 522412
[ 0.000000] Kernel command line: root=/dev/sda1 debug ignore_loglevel console=ttyS0,115200 console=tty0
[ 0.000000] PID hash table entries: 4096 (order: 2, 16384 bytes)
[ 0.000000] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
[ 0.000000] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
[ 0.000000] __ex_table already sorted, skipping sort
[ 0.000000] Initializing CPU#0
[ 0.000000] Initializing HighMem for node 0 (000377fe:0007fffe)
[ 0.000000] Memory: 2056060k/2097144k available (5202k kernel code, 40692k reserved, 2447k data, 1756k init, 1187840k highmem)
[ 0.000000] virtual kernel memory layout:
[ 0.000000] fixmap : 0xfff17000 - 0xfffff000 ( 928 kB)
[ 0.000000] pkmap : 0xff800000 - 0xffc00000 (4096 kB)
[ 0.000000] vmalloc : 0xf7ffe000 - 0xff7fe000 ( 120 MB)
[ 0.000000] lowmem : 0xc0000000 - 0xf77fe000 ( 887 MB)
[ 0.000000] .init : 0xc1779000 - 0xc1930000 (1756 kB)
[ 0.000000] .data : 0xc1514bfa - 0xc1778940 (2447 kB)
[ 0.000000] .text : 0xc1000000 - 0xc1514bfa (5202 kB)
[ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...Ok.
[ 0.000000] SLUB: Genslabs=15, HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[ 0.000000] Preemptible hierarchical RCU implementation.
[ 0.000000] RCU dyntick-idle grace-period acceleration is enabled.
[ 0.000000] RCU lockdep checking is enabled.
[ 0.000000] Additional per-CPU info printed with stalls.
[ 0.000000] RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=4.
[ 0.000000] NR_IRQS:2304 nr_irqs:712 16
[ 0.000000] CPU 0 irqstacks, hard=f5c08000 soft=f5c0a000
[ 0.000000] Console: colour VGA+ 80x25
[ 0.000000] console [tty0] enabled
[ 0.000000] console [ttyS0] enabled
[ 0.000000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[ 0.000000] ... MAX_LOCKDEP_SUBCLASSES: 8
[ 0.000000] ... MAX_LOCK_DEPTH: 48
[ 0.000000] ... MAX_LOCKDEP_KEYS: 8191
[ 0.000000] ... CLASSHASH_SIZE: 4096
[ 0.000000] ... MAX_LOCKDEP_ENTRIES: 16384
[ 0.000000] ... MAX_LOCKDEP_CHAINS: 32768
[ 0.000000] ... CHAINHASH_SIZE: 16384
[ 0.000000] memory used by lock dependency info: 3823 kB
[ 0.000000] per task-struct memory footprint: 1920 bytes
[ 0.000000] ------------------------
[ 0.000000] | Locking API testsuite:
[ 0.000000] ----------------------------------------------------------------------------
[ 0.000000] | spin |wlock |rlock |mutex | wsem | rsem |
[ 0.000000] --------------------------------------------------------------------------
[ 0.000000] A-A deadlock: ok | ok | ok | ok | ok | ok |
[ 0.000000] A-B-B-A deadlock: ok | ok | ok | ok | ok | ok |
[ 0.000000] A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok |
[ 0.000000] A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok |
[ 0.000000] A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok |
[ 0.000000] A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok |
[ 0.000000] A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok | ok |
[ 0.000000] double unlock: ok | ok | ok | ok | ok | ok |
[ 0.000000] initialize held: ok | ok | ok | ok | ok | ok |
[ 0.000000] bad unlock order: ok | ok | ok | ok | ok | ok |
[ 0.000000] --------------------------------------------------------------------------
[ 0.000000] recursive read-lock: | ok | | ok |
[ 0.000000] recursive read-lock #2: | ok | | ok |
[ 0.000000] mixed read-write-lock: | ok | | ok |
[ 0.000000] mixed write-read-lock: | ok | | ok |
[ 0.000000] --------------------------------------------------------------------------
[ 0.000000] hard-irqs-on + irq-safe-A/12: ok | ok | ok |
[ 0.000000] soft-irqs-on + irq-safe-A/12: ok | ok | ok |
[ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/12: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/21: ok | ok | ok |
[ 0.000000] hard-safe-A + irqs-on/12: ok | ok | ok |
[ 0.000000] soft-safe-A + irqs-on/12: ok | ok | ok |
[ 0.000000] hard-safe-A + irqs-on/21: ok | ok | ok |
[ 0.000000] soft-safe-A + irqs-on/21: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #1/132: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/132: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #1/213: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/213: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #1/231: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/231: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #1/312: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/312: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #1/321: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/321: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #2/123: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #2/123: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #2/132: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #2/132: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #2/213: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #2/213: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #2/231: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #2/231: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #2/312: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #2/312: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #2/321: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #2/321: ok | ok | ok |
[ 0.000000] hard-irq lock-inversion/123: ok | ok | ok |
[ 0.000000] soft-irq lock-inversion/123: ok | ok | ok |
[ 0.000000] hard-irq lock-inversion/132: ok | ok | ok |
[ 0.000000] soft-irq lock-inversion/132: ok | ok | ok |
[ 0.000000] hard-irq lock-inversion/213: ok | ok | ok |
[ 0.000000] soft-irq lock-inversion/213: ok | ok | ok |
[ 0.000000] hard-irq lock-inversion/231: ok | ok | ok |
[ 0.000000] soft-irq lock-inversion/231: ok | ok | ok |
[ 0.000000] hard-irq lock-inversion/312: ok | ok | ok |
[ 0.000000] soft-irq lock-inversion/312: ok | ok | ok |
[ 0.000000] hard-irq lock-inversion/321: ok | ok | ok |
[ 0.000000] soft-irq lock-inversion/321: ok | ok | ok |
[ 0.000000] hard-irq read-recursion/123: ok |
[ 0.000000] soft-irq read-recursion/123: ok |
[ 0.000000] hard-irq read-recursion/132: ok |
[ 0.000000] soft-irq read-recursion/132: ok |
[ 0.000000] hard-irq read-recursion/213: ok |
[ 0.000000] soft-irq read-recursion/213: ok |
[ 0.000000] hard-irq read-recursion/231: ok |
[ 0.000000] soft-irq read-recursion/231: ok |
[ 0.000000] hard-irq read-recursion/312: ok |
[ 0.000000] soft-irq read-recursion/312: ok |
[ 0.000000] hard-irq read-recursion/321: ok |
[ 0.000000] soft-irq read-recursion/321: ok |
[ 0.000000] -------------------------------------------------------
[ 0.000000] Good, all 218 testcases passed! |
[ 0.000000] ---------------------------------
[ 0.000000] hpet clockevent registered
[ 0.000000] tsc: Fast TSC calibration failed
[ 0.000000] tsc: Unable to calibrate against PIT
[ 0.000000] tsc: using HPET reference calibration
[ 0.000000] tsc: Detected 4013.467 MHz processor
[ 0.016000] Calibrating delay loop (skipped), value calculated using timer frequency.. 8026.93 BogoMIPS (lpj=4013467)
[ 0.019168] pid_max: default: 32768 minimum: 301
[ 0.031115] Mount-cache hash table entries: 512
[ 0.108223] Atom PSE erratum detected, BIOS microcode update recommended
[ 0.116305] mce: CPU supports 10 MCE banks
[ 0.127569] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.127569] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.127569] tlb_flushall_shift: 6
[ 0.142361] Freeing SMP alternatives: 20k freed
[ 0.152465] ACPI: Core revision 20130117
[ 0.406397] ACPI: All ACPI Tables successfully acquired
[ 0.419468] ftrace: allocating 19508 entries in 39 pages
[ 0.810447] Enabling APIC mode: Flat. Using 1 I/O APICs
[ 0.841000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.852146] smpboot: CPU0: Intel(R) Atom(TM) CPU N270 @ 1.60GHz (fam: 06, model: 1c, stepping: 02)
[ 0.862000] APIC calibration not consistent with PM-Timer: 111ms instead of 100ms
[ 0.862000] APIC delta adjusted to PM-Timer: 6249993 (6937930)
[ 0.868380] Performance Events: unsupported p6 CPU model 28 no PMU driver, software events only.
[ 0.932068] SMP alternatives: lockdep: fixing up alternatives
[ 0.933327] CPU 1 irqstacks, hard=f5cf4000 soft=f5cf6000
[ 0.934073] smpboot: Booting Node 0, Processors #1[ 0.016000] Initializing CPU#1
[ 0.016000] Atom PSE erratum detected, BIOS microcode update recommended

[ 1.156192] SMP alternatives: lockdep: fixing up alternatives
[ 1.161141] CPU 2 irqstacks, hard=f5d08000 soft=f5d0a000
[ 1.162733] #2[ 0.016000] Initializing CPU#2
[ 0.016000] Atom PSE erratum detected, BIOS microcode update recommended
[ 0.016000] calibrate_delay_direct() failed to get a good estimate for loops_per_jiffy.
[ 0.016000] Probably due to long platform interrupts. Consider using "lpj=" boot option.

[ 1.522125] SMP alternatives: lockdep: fixing up alternatives
[ 1.525834] CPU 3 irqstacks, hard=f5d28000 soft=f5d2a000
[ 1.527101] #3 OK
[ 0.016000] Initializing CPU#3
[ 0.016000] Atom PSE erratum detected, BIOS microcode update recommended
[ 0.016000] calibrate_delay_direct() failed to get a good estimate for loops_per_jiffy.
[ 0.016000] Probably due to long platform interrupts. Consider using "lpj=" boot option.
[ 1.635184] Brought up 4 CPUs
[ 1.636126] smpboot: Total of 4 processors activated (28540.37 BogoMIPS)
[ 1.639142] ------------[ cut here ]------------
[ 1.640000] WARNING: at kernel/sched/core.c:963 set_task_cpu+0x12c/0x280()
[ 1.640000] Hardware name: Bochs
[ 1.640000] Modules linked in:
[ 1.640000] Pid: 8, comm: migration/0 Not tainted 3.9.0-rc5+ #5
[ 1.640000] Call Trace:
[ 1.640000] [<c1035633>] warn_slowpath_common+0x73/0xb0
[ 1.640000] [<c106a6cc>] ? set_task_cpu+0x12c/0x280
[ 1.640000] [<c106a6cc>] ? set_task_cpu+0x12c/0x280
[ 1.640000] [<c1035693>] warn_slowpath_null+0x23/0x30
[ 1.640000] [<c106a6cc>] set_task_cpu+0x12c/0x280
[ 1.640000] [<c1068e2b>] ? update_rq_clock+0x3b/0x70
[ 1.640000] [<c1068ee1>] ? dequeue_task+0x81/0xc0
[ 1.640000] [<c106a922>] __migrate_task+0x102/0x140
[ 1.640000] [<c106a960>] ? __migrate_task+0x140/0x140
[ 1.640000] [<c106a982>] migration_cpu_stop+0x22/0x30
[ 1.640000] [<c10a0a88>] cpu_stopper_thread+0x68/0x130
[ 1.640000] [<c106ea11>] ? sched_clock_cpu+0xf1/0x180
[ 1.640000] [<c106a20d>] ? sub_preempt_count+0x4d/0xb0
[ 1.640000] [<c106eaee>] ? local_clock+0x4e/0x60
[ 1.640000] [<c108cf4a>] ? trace_hardirqs_off_caller+0x1a/0x110
[ 1.640000] [<c108d22d>] ? put_lock_stats.isra.17+0xd/0x30
[ 1.640000] [<c1512b25>] ? _raw_spin_unlock_irqrestore+0x55/0x70
[ 1.640000] [<c1512b0b>] ? _raw_spin_unlock_irqrestore+0x3b/0x70
[ 1.640000] [<c1063f06>] smpboot_thread_fn+0x176/0x260
[ 1.640000] [<c1063d90>] ? lg_lock_init+0x20/0x20
[ 1.640000] [<c105b305>] kthread+0x95/0xa0
[ 1.640000] [<c1510303>] ? schedule_hrtimeout_range_clock+0x163/0x170
[ 1.640000] [<c151077f>] ? wait_for_common+0xcf/0x110
[ 1.640000] [<c1060000>] ? hrtimer_run_pending+0x110/0x110
[ 1.640000] [<c15137b7>] ret_from_kernel_thread+0x1b/0x28
[ 1.640000] [<c105b270>] ? __kthread_parkme+0xa0/0xa0
[ 1.640000]
[ 1.640000] =================================
[ 1.640000] [ INFO: inconsistent lock state ]
[ 1.640000] 3.9.0-rc5+ #5 Not tainted
[ 1.640000] ---------------------------------
[ 1.640000] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ 1.640000] migration/0/8 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1.640000] (&p->pi_lock){?.-...}, at: [<c106a86e>] __migrate_task+0x4e/0x140
[ 1.640000] {IN-HARDIRQ-W} state was registered at:
[ 1.640000] [<c109123d>] __lock_acquire+0x77d/0x1910
[ 1.640000] [<c10929c3>] lock_acquire+0x83/0x1b0
[ 1.640000] [<c15121f5>] _raw_spin_lock_irqsave+0x55/0x90
[ 1.640000] [<c106c635>] try_to_wake_up+0x25/0x270
[ 1.640000] [<c106c8bf>] wake_up_process+0x1f/0x40
[ 1.640000] [<c105e7ed>] hrtimer_wakeup+0x1d/0x30
[ 1.640000] [<c105ebfd>] __run_hrtimer+0x7d/0x3c0
[ 1.640000] [<c1060077>] hrtimer_run_queues+0x77/0x170
[ 1.640000] [<c104787d>] run_local_timers+0xd/0x20
[ 1.640000] [<c10478c3>] update_process_times+0x33/0x70
[ 1.640000] [<c108911a>] tick_periodic+0x2a/0x80
[ 1.640000] [<c10891ee>] tick_handle_periodic+0x1e/0x80
[ 1.640000] [<c1513e95>] smp_apic_timer_interrupt+0x55/0x87
[ 1.640000] [<c1513352>] apic_timer_interrupt+0x32/0x38
[ 1.640000] [<c103e6f6>] irq_exit+0x86/0xa0
[ 1.640000] [<c1513e9a>] smp_apic_timer_interrupt+0x5a/0x87
[ 1.640000] [<c1513352>] apic_timer_interrupt+0x32/0x38
[ 1.640000] [<c105f686>] __hrtimer_start_range_ns+0x1b6/0x560
[ 1.640000] [<c105fa87>] hrtimer_start_range_ns+0x27/0x30
[ 1.640000] [<c1510251>] schedule_hrtimeout_range_clock+0xb1/0x170
[ 1.640000] [<c1510328>] schedule_hrtimeout_range+0x18/0x20
[ 1.640000] [<c1510341>] schedule_hrtimeout+0x11/0x20
[ 1.640000] [<c106ab40>] wait_task_inactive+0x1b0/0x1d0
[ 1.640000] [<c105b620>] kthread_bind+0x20/0x70
[ 1.640000] [<c10530d3>] create_worker+0x1d3/0x200
[ 1.640000] [<c17892be>] init_workqueues+0x235/0x401
[ 1.640000] [<c1000202>] do_one_initcall+0x112/0x160
[ 1.640000] [<c17799e7>] kernel_init_freeable+0x58/0x184
[ 1.640000] [<c14fd0e1>] kernel_init+0x11/0xe0
[ 1.640000] [<c15137b7>] ret_from_kernel_thread+0x1b/0x28
[ 1.640000] irq event stamp: 410
[ 1.640000] hardirqs last enabled at (409): [<c1512b67>] _raw_spin_unlock_irq+0x27/0x50
[ 1.640000] hardirqs last disabled at (410): [<c106a971>] migration_cpu_stop+0x11/0x30
[ 1.640000] softirqs last enabled at (394): [<c103e3a6>] __do_softirq+0x136/0x340
[ 1.640000] softirqs last disabled at (389): [<c103e6f6>] irq_exit+0x86/0xa0
[ 1.640000]
[ 1.640000] other info that might help us debug this:
[ 1.640000] Possible unsafe locking scenario:
[ 1.640000]
[ 1.640000] CPU0
[ 1.640000] ----
[ 1.640000] lock(&p->pi_lock);
[ 1.640000] <Interrupt>
[ 1.640000] lock(&p->pi_lock);
[ 1.640000]
[ 1.640000] *** DEADLOCK ***
[ 1.640000]
[ 1.640000] 4 locks held by migration/0/8:
[ 1.640000] #0: (&p->pi_lock){?.-...}, at: [<c106a86e>] __migrate_task+0x4e/0x140
[ 1.640000] #1: (&rq->lock){-.-...}, at: [<c106a8d7>] __migrate_task+0xb7/0x140
[ 1.640000] #2: (&rq->lock/1){......}, at: [<c106a8e3>] __migrate_task+0xc3/0x140
[ 1.640000] #3: (nonblocking_pool.lock){......}, at: [<c12fd41b>] extract_buf+0x3b/0x140
[ 1.640000]
[ 1.640000] stack backtrace:
[ 1.640000] Pid: 8, comm: migration/0 Not tainted 3.9.0-rc5+ #5
[ 1.640000] Call Trace:
[ 1.640000] [<c150b5c4>] print_usage_bug.part.34+0x239/0x242
[ 1.640000] [<c108ee53>] mark_lock+0x4e3/0x5f0
[ 1.640000] [<c106eaee>] ? local_clock+0x4e/0x60
[ 1.640000] [<c108ded0>] ? print_shortest_lock_dependencies+0x1c0/0x1c0
[ 1.640000] [<c108efd5>] mark_held_locks+0x75/0xe0
[ 1.640000] [<c1001dd5>] ? math_state_restore+0x85/0x160
[ 1.640000] [<c1002b00>] ? do_spurious_interrupt_bug+0x40/0x40
[ 1.640000] [<c108f0de>] trace_hardirqs_on_caller+0x9e/0x1b0
[ 1.640000] [<c108f1fc>] trace_hardirqs_on+0xc/0x10
[ 1.640000] [<c1001dd5>] math_state_restore+0x85/0x160
[ 1.640000] [<c1002b00>] ? do_spurious_interrupt_bug+0x40/0x40
[ 1.640000] [<c1002b15>] do_device_not_available+0x15/0x30
[ 1.640000] [<c1513593>] error_code+0x5f/0x64
[ 1.640000] [<c1002b00>] ? do_spurious_interrupt_bug+0x40/0x40
[ 1.640000] [<c14fbee3>] ? sha_transform+0x33/0x1090
[ 1.640000] [<c12fd41b>] ? extract_buf+0x3b/0x140
[ 1.640000] [<c12fd443>] extract_buf+0x63/0x140
[ 1.640000] [<c1512b01>] ? _raw_spin_unlock_irqrestore+0x31/0x70
[ 1.640000] [<c1512b0b>] ? _raw_spin_unlock_irqrestore+0x3b/0x70
[ 1.640000] [<c1065316>] ? __wake_up+0x46/0x60
[ 1.640000] [<c12fd2d4>] ? account+0xa4/0x1b0
[ 1.640000] [<c1003968>] ? dump_trace+0x78/0xf0
[ 1.640000] [<c12fe64b>] extract_entropy+0x7b/0x1b0
[ 1.640000] [<c12fe919>] ? get_random_bytes+0x29/0x30
[ 1.640000] [<c1004c00>] ? show_trace+0x20/0x30
[ 1.640000] [<c12fe919>] get_random_bytes+0x29/0x30
[ 1.640000] [<c103530f>] init_oops_id+0x3f/0x50
[ 1.640000] [<c1035591>] print_oops_end_marker+0x11/0x40
[ 1.640000] [<c1035638>] warn_slowpath_common+0x78/0xb0
[ 1.640000] [<c106a6cc>] ? set_task_cpu+0x12c/0x280
[ 1.640000] [<c106a6cc>] ? set_task_cpu+0x12c/0x280
[ 1.640000] [<c1035693>] warn_slowpath_null+0x23/0x30
[ 1.640000] [<c106a6cc>] set_task_cpu+0x12c/0x280
[ 1.640000] [<c1068e2b>] ? update_rq_clock+0x3b/0x70
[ 1.640000] [<c1068ee1>] ? dequeue_task+0x81/0xc0
[ 1.640000] [<c106a922>] __migrate_task+0x102/0x140
[ 1.640000] [<c106a960>] ? __migrate_task+0x140/0x140
[ 1.640000] [<c106a982>] migration_cpu_stop+0x22/0x30
[ 1.640000] [<c10a0a88>] cpu_stopper_thread+0x68/0x130
[ 1.640000] [<c106ea11>] ? sched_clock_cpu+0xf1/0x180
[ 1.640000] [<c106a20d>] ? sub_preempt_count+0x4d/0xb0
[ 1.640000] [<c106eaee>] ? local_clock+0x4e/0x60
[ 1.640000] [<c108cf4a>] ? trace_hardirqs_off_caller+0x1a/0x110
[ 1.640000] [<c108d22d>] ? put_lock_stats.isra.17+0xd/0x30
[ 1.640000] [<c1512b25>] ? _raw_spin_unlock_irqrestore+0x55/0x70
[ 1.640000] [<c1512b0b>] ? _raw_spin_unlock_irqrestore+0x3b/0x70
[ 1.640000] [<c1063f06>] smpboot_thread_fn+0x176/0x260
[ 1.640000] [<c1063d90>] ? lg_lock_init+0x20/0x20
[ 1.640000] [<c105b305>] kthread+0x95/0xa0
[ 1.640000] [<c1510303>] ? schedule_hrtimeout_range_clock+0x163/0x170
[ 1.640000] [<c151077f>] ? wait_for_common+0xcf/0x110
[ 1.640000] [<c1060000>] ? hrtimer_run_pending+0x110/0x110
[ 1.640000] [<c15137b7>] ret_from_kernel_thread+0x1b/0x28
[ 1.640000] [<c105b270>] ? __kthread_parkme+0xa0/0xa0

HTH and thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-04-08 12:18:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: kernel BUG at kernel/smpboot.c:134!

On Mon, 8 Apr 2013, Borislav Petkov wrote:
> On Mon, Apr 08, 2013 at 11:24:14AM +0200, Thomas Gleixner wrote:
> > + /* Rebind ourself to the target cpu */
> > + if (test_bit(KTHREAD_IS_PER_CPU, &self->flags)) {
> > + set_cpus_allowed_ptr(currrent, cpumask_of(self->cpu));
>
> "currrent" is a typo, I fixed it up to the obvious "current" so that it
> builds.

Bah.

> [ 1.640000] WARNING: at kernel/sched/core.c:963 set_task_cpu+0x12c/0x280()
> [ 1.640000] Hardware name: Bochs
> [ 1.640000] Modules linked in:
> [ 1.640000] Pid: 8, comm: migration/0 Not tainted 3.9.0-rc5+ #5

Hmm, crap migration thread trying to migrate itself via stomp
machine. Not really a brilliant idea.

/me goes back to drawing board.

> [ 1.640000] Call Trace:
> [ 1.640000] [<c1035633>] warn_slowpath_common+0x73/0xb0
> [ 1.640000] [<c106a6cc>] ? set_task_cpu+0x12c/0x280
> [ 1.640000] [<c106a6cc>] ? set_task_cpu+0x12c/0x280
> [ 1.640000] [<c1035693>] warn_slowpath_null+0x23/0x30
> [ 1.640000] [<c106a6cc>] set_task_cpu+0x12c/0x280
> [ 1.640000] [<c1068e2b>] ? update_rq_clock+0x3b/0x70
> [ 1.640000] [<c1068ee1>] ? dequeue_task+0x81/0xc0
> [ 1.640000] [<c106a922>] __migrate_task+0x102/0x140
> [ 1.640000] [<c106a960>] ? __migrate_task+0x140/0x140
> [ 1.640000] [<c106a982>] migration_cpu_stop+0x22/0x30
> [ 1.640000] [<c10a0a88>] cpu_stopper_thread+0x68/0x130
> [ 1.640000] [<c106ea11>] ? sched_clock_cpu+0xf1/0x180
> [ 1.640000] [<c106a20d>] ? sub_preempt_count+0x4d/0xb0
> [ 1.640000] [<c106eaee>] ? local_clock+0x4e/0x60
> [ 1.640000] [<c108cf4a>] ? trace_hardirqs_off_caller+0x1a/0x110
> [ 1.640000] [<c108d22d>] ? put_lock_stats.isra.17+0xd/0x30
> [ 1.640000] [<c1512b25>] ? _raw_spin_unlock_irqrestore+0x55/0x70
> [ 1.640000] [<c1512b0b>] ? _raw_spin_unlock_irqrestore+0x3b/0x70
> [ 1.640000] [<c1063f06>] smpboot_thread_fn+0x176/0x260
> [ 1.640000] [<c1063d90>] ? lg_lock_init+0x20/0x20
> [ 1.640000] [<c105b305>] kthread+0x95/0xa0
> [ 1.640000] [<c1510303>] ? schedule_hrtimeout_range_clock+0x163/0x170
> [ 1.640000] [<c151077f>] ? wait_for_common+0xcf/0x110
> [ 1.640000] [<c1060000>] ? hrtimer_run_pending+0x110/0x110
> [ 1.640000] [<c15137b7>] ret_from_kernel_thread+0x1b/0x28
> [ 1.640000] [<c105b270>] ? __kthread_parkme+0xa0/0xa0

2013-04-09 14:38:38

by Thomas Gleixner

[permalink] [raw]
Subject: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:

CPU0 CPU1 CPU2
unpark(T) wake_up_process(T)
clear(SHOULD_PARK) T runs
leave parkme() due to !SHOULD_PARK
bind_to(CPU2) BUG_ON(wrong CPU)

We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.

The only rock solid solution to this problem is to prevent wakeups in
park state which are not from unpark(). That way we can guarantee that
the association of the task to the target cpu is working correctly.

Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitely for the unpark wakeup.

Reported-by: Dave Jones <[email protected]>
Reported-by: Dave Hansen <[email protected]>
Reported-by: Borislav Petkov <[email protected]>
Cc: [email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
fs/proc/array.c | 1 +
include/linux/sched.h | 5 +++--
kernel/kthread.c | 38 +++++++++++++++++++++-----------------
3 files changed, 25 insertions(+), 19 deletions(-)

Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -143,6 +143,7 @@ static const char * const task_state_arr
"x (dead)", /* 64 */
"K (wakekill)", /* 128 */
"W (waking)", /* 256 */
+ "P (parked)", /* 512 */
};

static inline const char *get_task_state(struct task_struct *tsk)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -163,9 +163,10 @@ print_cfs_rq(struct seq_file *m, int cpu
#define TASK_DEAD 64
#define TASK_WAKEKILL 128
#define TASK_WAKING 256
-#define TASK_STATE_MAX 512
+#define TASK_PARKED 512
+#define TASK_STATE_MAX 1024

-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"

extern char ___assert_task_state[1 - 2*!!(
sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
Index: linux-2.6/kernel/kthread.c
===================================================================
--- linux-2.6.orig/kernel/kthread.c
+++ linux-2.6/kernel/kthread.c
@@ -124,12 +124,12 @@ void *kthread_data(struct task_struct *t

static void __kthread_parkme(struct kthread *self)
{
- __set_current_state(TASK_INTERRUPTIBLE);
+ __set_current_state(TASK_PARKED);
while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
schedule();
- __set_current_state(TASK_INTERRUPTIBLE);
+ __set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);
__set_current_state(TASK_RUNNING);
@@ -324,6 +324,22 @@ static struct kthread *task_get_live_kth
return NULL;
}

+static void __kthread_unpark(struct task_struct *k, struct kthread *kthread)
+{
+ clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ /*
+ * We clear the IS_PARKED bit here as we don't wait
+ * until the task has left the park code. So if we'd
+ * park before that happens we'd see the IS_PARKED bit
+ * which might be about to be cleared.
+ */
+ if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
+ if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+ __kthread_bind(k, kthread->cpu);
+ wake_up_state(k, TASK_PARKED);
+ }
+}
+
/**
* kthread_unpark - unpark a thread created by kthread_create().
* @k: thread created by kthread_create().
@@ -336,20 +352,8 @@ void kthread_unpark(struct task_struct *
{
struct kthread *kthread = task_get_live_kthread(k);

- if (kthread) {
- clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
- /*
- * We clear the IS_PARKED bit here as we don't wait
- * until the task has left the park code. So if we'd
- * park before that happens we'd see the IS_PARKED bit
- * which might be about to be cleared.
- */
- if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu);
- wake_up_process(k);
- }
- }
+ if (kthread)
+ __kthread_unpark(k, kthread);
put_task_struct(k);
}

@@ -407,7 +411,7 @@ int kthread_stop(struct task_struct *k)
trace_sched_kthread_stop(k);
if (kthread) {
set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
- clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ __kthread_unpark(k, kthread);
wake_up_process(k);
wait_for_completion(&kthread->exited);
}

2013-04-09 15:55:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

Hey Thomas,

I don't think the patch helped my case. Looks like the same BUG_ON().

I accidentally booted with possible_cpus=10 instead of 160. I wasn't
able to trigger this in that case, even repeatedly on/offlining them.
But, once I booted with possible_cpus=160, it triggered in a jiffy.

Two oopses below (bottom one has cpu numbers):

> [ 467.106219] ------------[ cut here ]------------
> [ 467.106400] kernel BUG at kernel/smpboot.c:134!
> [ 467.106556] invalid opcode: 0000 [#1] SMP
> [ 467.106831] Modules linked in:
> [ 467.107039] CPU 0
> [ 467.107109] Pid: 3095, comm: migration/115 Tainted: G W 3.9.0-rc6-00020-g84ee980-dirty #132 FUJITSU-SV PRIMEQUEST 1800E2/SB
> [ 467.107507] RIP: 0010:[<ffffffff8110bed8>] [<ffffffff8110bed8>] smpboot_thread_fn+0x258/0x280
> [ 467.107820] RSP: 0018:ffff887ff0561e08 EFLAGS: 00010202
> [ 467.107980] RAX: 0000000000000000 RBX: ffff887ff04ef010 RCX: 000000000000b888
> [ 467.108142] RDX: ffff887ff0561fd8 RSI: ffff881ffda00000 RDI: 0000000000000073
> [ 467.108303] RBP: ffff887ff0561e38 R08: 0000000000000001 R09: 0000000000000000
> [ 467.108465] R10: 0000000000000018 R11: 0000000000000000 R12: ffff887ff053c5c0
> [ 467.108629] R13: ffffffff81e587a0 R14: ffff887ff053c5c0 R15: 0000000000000000
> [ 467.108791] FS: 0000000000000000(0000) GS:ffff881ffda00000(0000) knlGS:0000000000000000
> [ 467.109037] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 467.109194] CR2: 000000000117c278 CR3: 0000000001e0b000 CR4: 00000000000007f0
> [ 467.109357] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 467.109519] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 467.109684] Process migration/115 (pid: 3095, threadinfo ffff887ff0560000, task ffff887ff053c5c0)
> [ 467.109930] Stack:
> [ 467.110075] ffff887ff0561e38 0000000000000000 ffff881fe60adcc0 ffff887ff0561ec0
> [ 467.110580] ffff887ff04ef010 ffffffff8110bc80 ffff887ff0561f48 ffffffff810ff1df
> [ 467.111075] 0000000000000001 ffff881f00000073 ffff887ff04ef010 ffff887f00000001
> [ 467.111568] Call Trace:
> [ 467.111726] [<ffffffff8110bc80>] ? __smpboot_create_thread+0x180/0x180
> [ 467.111893] [<ffffffff810ff1df>] kthread+0xef/0x100
> [ 467.112057] [<ffffffff8110e340>] ? complete+0x30/0x80
> [ 467.112216] [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
> [ 467.112386] [<ffffffff819db99c>] ret_from_fork+0x7c/0xb0
> [ 467.112548] [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
> [ 467.112708] Code: ef 3d 01 01 48 89 df e8 c7 af 16 00 48 83 05 97 ef 3d 01 01 48 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 5d c3 48 83 05 c0 ef 3d 01 01 <0f> 0b 48 83 05 c6 ef 3d 01 01 48 83 05 86 ef 3d 01 01 0f 0b 48
> [ 467.117014] RIP [<ffffffff8110bed8>] smpboot_thread_fn+0x258/0x280
> [ 467.117233] RSP <ffff887ff0561e08>
> [ 467.117414] ---[ end trace d851dfb0bce51ca2 ]---

Here's the same oops, but with the line numbers munged because I added
some printks:

> [ 161.551788] smpboot_thread_fn():
> [ 161.551807] td->cpu: 132
> [ 161.551808] smp_processor_id(): 121
> [ 161.551811] comm: migration/%u
> [ 161.551840] ------------[ cut here ]------------
> [ 161.551939] kernel BUG at kernel/smpboot.c:149!
> [ 161.552030] invalid opcode: 0000 [#1] SMP
> [ 161.552255] Modules linked in:
> [ 161.552397] CPU 121
> [ 161.552474] Pid: 2957, comm: migration/132 Tainted: G W 3.9.0-rc6-00020-g84ee980-dirty #136 FUJITSU-SV PRIMEQUEST 1800E2/SB
> [ 161.552655] RIP: 0010:[<ffffffff8110bf29>] [<ffffffff8110bf29>] smpboot_thread_fn+0x409/0x560
> [ 161.552852] RSP: 0018:ffff88bff0403de8 EFLAGS: 00010202
> [ 161.552935] RAX: 0000000000000079 RBX: ffff88bff02ac070 RCX: 0000000000000006
> [ 161.553025] RDX: 0000000000000007 RSI: 0000000000000007 RDI: ffff889ffec0d190
> [ 161.553115] RBP: ffff88bff0403e38 R08: 0000000000000001 R09: 0000000000000001
> [ 161.553204] R10: 0000000000000000 R11: 0000000000000b09 R12: ffff88bff04745c0
> [ 161.553319] R13: ffffffff81e587a0 R14: ffffffff8110bb20 R15: ffff88bff04745c0
> [ 161.553411] FS: 0000000000000000(0000) GS:ffff889ffec00000(0000) knlGS:0000000000000000
> [ 161.553534] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 161.553619] CR2: 00007f0c4155c6d0 CR3: 0000000001e0b000 CR4: 00000000000007e0
> [ 161.553709] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 161.553799] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 161.553889] Process migration/132 (pid: 2957, threadinfo ffff88bff0402000, task ffff88bff04745c0)
> [ 161.554156] Stack:
> [ 161.554312] ffffffff8110bb20 ffff88bff04745c0 ffff88bff0403e08 0000000000000000
> [ 161.554839] ffff88bff0403e38 ffff881fef323cc0 ffff88bff0403ec0 ffff88bff02ac070
> [ 161.555370] ffffffff8110bb20 0000000000000000 ffff88bff0403f48 ffffffff810ff08f
> [ 161.555891] Call Trace:
> [ 161.556055] [<ffffffff8110bb20>] ? __smpboot_create_thread+0x180/0x180
> [ 161.556230] [<ffffffff8110bb20>] ? __smpboot_create_thread+0x180/0x180
> [ 161.556409] [<ffffffff810ff08f>] kthread+0xef/0x100
> [ 161.556590] [<ffffffff819d5154>] ? wait_for_completion+0x124/0x180
> [ 161.556761] [<ffffffff810fefa0>] ? __init_kthread_worker+0x80/0x80
> [ 161.556982] [<ffffffff819e59dc>] ret_from_fork+0x7c/0xb0
> [ 161.557148] [<ffffffff810fefa0>] ? __init_kthread_worker+0x80/0x80
> [ 161.557316] Code: 05 e4 f1 3d 01 01 e8 2b cf 8b 00 48 83 05 df f1 3d 01 01 65 8b 04 25 64 b0 00 00 39 03 0f 84 0c fd ff ff 48 83 05 cf f1 3d 01 01 <0f> 0b 48 83 05 cd f1 3d 01 01 0f 1f 44 00 00 b9 8b 00 00 00 48
> [ 161.561934] RIP [<ffffffff8110bf29>] smpboot_thread_fn+0x409/0x560
> [ 161.562171] RSP <ffff88bff0403de8>
> [ 161.562352] ---[ end trace 6a3b5261afedf7da ]---

2013-04-09 18:43:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

Dave,

On Tue, 9 Apr 2013, Dave Hansen wrote:

> Hey Thomas,
>
> I don't think the patch helped my case. Looks like the same BUG_ON().
>
> I accidentally booted with possible_cpus=10 instead of 160. I wasn't
> able to trigger this in that case, even repeatedly on/offlining them.
> But, once I booted with possible_cpus=160, it triggered in a jiffy.

Darn. We should have merged Paul McKenneys NR_CPUS=0 patch last year
and we would have avoided the whole shebang.

> > [ 161.551788] smpboot_thread_fn():
> > [ 161.551807] td->cpu: 132
> > [ 161.551808] smp_processor_id(): 121
> > [ 161.551811] comm: migration/%u
> > [ 161.551840] ------------[ cut here ]------------
> > [ 161.551939] kernel BUG at kernel/smpboot.c:149!

Any chance, that you get a trace for that?

Thanks,

tglx

2013-04-09 19:30:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On Tue, 9 Apr 2013, Thomas Gleixner wrote:
> Dave,
>
> On Tue, 9 Apr 2013, Dave Hansen wrote:
>
> > Hey Thomas,
> >
> > I don't think the patch helped my case. Looks like the same BUG_ON().
> >
> > I accidentally booted with possible_cpus=10 instead of 160. I wasn't
> > able to trigger this in that case, even repeatedly on/offlining them.
> > But, once I booted with possible_cpus=160, it triggered in a jiffy.
>
> Darn. We should have merged Paul McKenneys NR_CPUS=0 patch last year
> and we would have avoided the whole shebang.
>
> > > [ 161.551788] smpboot_thread_fn():
> > > [ 161.551807] td->cpu: 132
> > > [ 161.551808] smp_processor_id(): 121
> > > [ 161.551811] comm: migration/%u
> > > [ 161.551840] ------------[ cut here ]------------
> > > [ 161.551939] kernel BUG at kernel/smpboot.c:149!
>
> Any chance, that you get a trace for that?

Thought more about it and found, that the stupid binding only works
when the task is really descheduled. So there is a small window left,
which could lead to this. Revised patch below.

Anyway a trace for that issue would be appreciated nevertheless.

Thanks,

tglx
---
Subject: kthread: Prevent unpark race which puts threads on the wrong cpu
From: Thomas Gleixner <[email protected]>
Date: Tue, 09 Apr 2013 09:33:34 +0200

The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:

CPU0 CPU1 CPU2
unpark(T) wake_up_process(T)
clear(SHOULD_PARK) T runs
leave parkme() due to !SHOULD_PARK
bind_to(CPU2) BUG_ON(wrong CPU)

We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.

The only rock solid solution to this problem is to prevent wakeups in
park mode which are not from unpark(). That way we can guarantee that
the association of the task to the target cpu is working correctly.

Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitely for the unpark wakeup.

Reported-by: Dave Jones <[email protected]>
Reported-by: Dave Hansen <[email protected]>
Reported-by: Borislav Petkov <[email protected]>
Cc: [email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
fs/proc/array.c | 1
include/linux/sched.h | 5 ++--
kernel/kthread.c | 52 ++++++++++++++++++++++++++------------------------
3 files changed, 32 insertions(+), 26 deletions(-)

Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -143,6 +143,7 @@ static const char * const task_state_arr
"x (dead)", /* 64 */
"K (wakekill)", /* 128 */
"W (waking)", /* 256 */
+ "P (parked)", /* 512 */
};

static inline const char *get_task_state(struct task_struct *tsk)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -163,9 +163,10 @@ print_cfs_rq(struct seq_file *m, int cpu
#define TASK_DEAD 64
#define TASK_WAKEKILL 128
#define TASK_WAKING 256
-#define TASK_STATE_MAX 512
+#define TASK_PARKED 512
+#define TASK_STATE_MAX 1024

-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"

extern char ___assert_task_state[1 - 2*!!(
sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
Index: linux-2.6/kernel/kthread.c
===================================================================
--- linux-2.6.orig/kernel/kthread.c
+++ linux-2.6/kernel/kthread.c
@@ -124,12 +124,12 @@ void *kthread_data(struct task_struct *t

static void __kthread_parkme(struct kthread *self)
{
- __set_current_state(TASK_INTERRUPTIBLE);
+ __set_current_state(TASK_PARKED);
while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
schedule();
- __set_current_state(TASK_INTERRUPTIBLE);
+ __set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);
__set_current_state(TASK_RUNNING);
@@ -256,8 +256,13 @@ struct task_struct *kthread_create_on_no
}
EXPORT_SYMBOL(kthread_create_on_node);

-static void __kthread_bind(struct task_struct *p, unsigned int cpu)
+static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state)
{
+ /* Must have done schedule() in kthread() before we set_task_cpu */
+ if (!wait_task_inactive(p, state)) {
+ WARN_ON(1);
+ return;
+ }
/* It's safe because the task is inactive. */
do_set_cpus_allowed(p, cpumask_of(cpu));
p->flags |= PF_THREAD_BOUND;
@@ -274,12 +279,7 @@ static void __kthread_bind(struct task_s
*/
void kthread_bind(struct task_struct *p, unsigned int cpu)
{
- /* Must have done schedule() in kthread() before we set_task_cpu */
- if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
- WARN_ON(1);
- return;
- }
- __kthread_bind(p, cpu);
+ __kthread_bind(p, cpu, TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(kthread_bind);

@@ -324,6 +324,22 @@ static struct kthread *task_get_live_kth
return NULL;
}

+static void __kthread_unpark(struct task_struct *k, struct kthread *kthread)
+{
+ clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ /*
+ * We clear the IS_PARKED bit here as we don't wait
+ * until the task has left the park code. So if we'd
+ * park before that happens we'd see the IS_PARKED bit
+ * which might be about to be cleared.
+ */
+ if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
+ if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+ __kthread_bind(k, kthread->cpu, TASK_PARKED);
+ wake_up_state(k, TASK_PARKED);
+ }
+}
+
/**
* kthread_unpark - unpark a thread created by kthread_create().
* @k: thread created by kthread_create().
@@ -336,20 +352,8 @@ void kthread_unpark(struct task_struct *
{
struct kthread *kthread = task_get_live_kthread(k);

- if (kthread) {
- clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
- /*
- * We clear the IS_PARKED bit here as we don't wait
- * until the task has left the park code. So if we'd
- * park before that happens we'd see the IS_PARKED bit
- * which might be about to be cleared.
- */
- if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu);
- wake_up_process(k);
- }
- }
+ if (kthread)
+ __kthread_unpark(k, kthread);
put_task_struct(k);
}

@@ -407,7 +411,7 @@ int kthread_stop(struct task_struct *k)
trace_sched_kthread_stop(k);
if (kthread) {
set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
- clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ __kthread_unpark(k, kthread);
wake_up_process(k);
wait_for_completion(&kthread->exited);
}

2013-04-09 20:38:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On 04/09/2013 12:30 PM, Thomas Gleixner wrote:
> On Tue, 9 Apr 2013, Thomas Gleixner wrote:
> Thought more about it and found, that the stupid binding only works
> when the task is really descheduled. So there is a small window left,
> which could lead to this. Revised patch below.
>
> Anyway a trace for that issue would be appreciated nevertheless.

Here you go:

http://sr71.net/~dave/linux/bigbox.1365539189.txt.gz

It oopsed in exit.c:

https://picasaweb.google.com/lh/photo/7v_Xua9I29Rar3bBdNlLu9MTjNZETYmyPJy0liipFm0?feat=directlink

2013-04-09 20:54:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On 04/09/2013 01:38 PM, Dave Hansen wrote:
> It oopsed in exit.c:
>
> https://picasaweb.google.com/lh/photo/7v_Xua9I29Rar3bBdNlLu9MTjNZETYmyPJy0liipFm0?feat=directlink

This was just secondary fallout after the first BUG_ON(). This exit.c
thing isn't a new issue.

2013-04-10 08:30:04

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On Tue, 9 Apr 2013, Dave Hansen wrote:

> On 04/09/2013 12:30 PM, Thomas Gleixner wrote:
> > On Tue, 9 Apr 2013, Thomas Gleixner wrote:
> > Thought more about it and found, that the stupid binding only works
> > when the task is really descheduled. So there is a small window left,
> > which could lead to this. Revised patch below.
> >
> > Anyway a trace for that issue would be appreciated nevertheless.
>
> Here you go:
>
> http://sr71.net/~dave/linux/bigbox.1365539189.txt.gz

Hmm. Unfortunately migration/146 is not in the trace.

Can you please apply the patch below? That avoids the oops, but might
hang an online operation. Though the machine should stay up and you
should be able to retrieve the trace.

Thanks,

tglx
---
Index: linux-2.6/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/kernel/smpboot.c
+++ linux-2.6/kernel/smpboot.c
@@ -131,7 +131,10 @@ static int smpboot_thread_fn(void *data)
continue;
}

- BUG_ON(td->cpu != smp_processor_id());
+ if (td->cpu != smp_processor_id()) {
+ tracing_off();
+ schedule();
+ }

/* Check for state change setup */
switch (td->status) {

2013-04-10 10:51:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On Wed, 10 Apr 2013, Thomas Gleixner wrote:
> On Tue, 9 Apr 2013, Dave Hansen wrote:
>
> > On 04/09/2013 12:30 PM, Thomas Gleixner wrote:
> > > On Tue, 9 Apr 2013, Thomas Gleixner wrote:
> > > Thought more about it and found, that the stupid binding only works
> > > when the task is really descheduled. So there is a small window left,
> > > which could lead to this. Revised patch below.
> > >
> > > Anyway a trace for that issue would be appreciated nevertheless.
> >
> > Here you go:
> >
> > http://sr71.net/~dave/linux/bigbox.1365539189.txt.gz
>
> Hmm. Unfortunately migration/146 is not in the trace.
>
> Can you please apply the patch below? That avoids the oops, but might
> hang an online operation. Though the machine should stay up and you
> should be able to retrieve the trace.
>
> Thanks,
>
> tglx
> ---
> Index: linux-2.6/kernel/smpboot.c
> ===================================================================
> --- linux-2.6.orig/kernel/smpboot.c
> +++ linux-2.6/kernel/smpboot.c
> @@ -131,7 +131,10 @@ static int smpboot_thread_fn(void *data)
> continue;
> }
>
> - BUG_ON(td->cpu != smp_processor_id());
> + if (td->cpu != smp_processor_id()) {
> + tracing_off();
> + schedule();

Bah, that wants a continue. Revised patch below.

> + }
>
> /* Check for state change setup */
> switch (td->status) {

Index: linux-2.6/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/kernel/smpboot.c
+++ linux-2.6/kernel/smpboot.c
@@ -131,7 +131,11 @@ static int smpboot_thread_fn(void *data)
continue;
}

- BUG_ON(td->cpu != smp_processor_id());
+ if (td->cpu != smp_processor_id()) {
+ tracing_off();
+ schedule();
+ continue;
+ }

/* Check for state change setup */
switch (td->status) {

2013-04-10 14:06:25

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn()

Hi Dave,

On 04/09/2013 09:25 PM, Dave Hansen wrote:
> Hey Thomas,
>
> I don't think the patch helped my case. Looks like the same BUG_ON().
>
> I accidentally booted with possible_cpus=10 instead of 160. I wasn't
> able to trigger this in that case, even repeatedly on/offlining them.
> But, once I booted with possible_cpus=160, it triggered in a jiffy.
>
> Two oopses below (bottom one has cpu numbers):
>

So here is a little bit of a hunch + guess work ;) Could you kindly test
this patch on top of mainline and let me know if it helps?

Regards,
Srivatsa S. Bhat

-------------------------------------------------------------------------->

From: Srivatsa S. Bhat <[email protected]>
Subject: [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn()

Dave Hansen and Borislav Petkov reported the following crash, which
corresponds the following BUG_ON in smpboot_thread_fn():

BUG_ON(td->cpu != smp_processor_id());

[ 790.223270] ------------[ cut here ]------------
[ 790.223966] kernel BUG at kernel/smpboot.c:134!
[ 790.224739] invalid opcode: 0000 [#1] SMP
[ 790.225671] Modules linked in:
[ 790.226428] CPU 81
[ 790.226909] Pid: 3909, comm: migration/135 Tainted: G W 3.9.0-rc5-00184-gb6a9b7f-dirty #118 FUJITSU-SV PRIMEQUEST 1800E2/SB
[ 790.228775] RIP: 0010:[<ffffffff8110bee8>] [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
[ 790.230205] RSP: 0018:ffff88bfef9c1e08 EFLAGS: 00010202
[ 790.231090] RAX: 0000000000000051 RBX: ffff88bfefb82000 RCX: 000000000000b888
[ 790.231653] RDX: ffff88bfef9c1fd8 RSI: ffff881fff000000 RDI: 0000000000000087
[ 790.232085] RBP: ffff88bfef9c1e38 R08: 0000000000000001 R09: 0000000000000000
[ 790.232850] R10: 0000000000000018 R11: 0000000000000000 R12: ffff88bfec9e22e0
[ 790.233561] R13: ffffffff81e587a0 R14: ffff88bfec9e22e0 R15: 0000000000000000
[ 790.234004] FS: 0000000000000000(0000) GS:ffff881fff000000(0000) knlGS:0000000000000000
[ 790.234918] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 790.235602] CR2: 00007fa89a333c62 CR3: 0000000001e0b000 CR4: 00000000000007e0
[ 790.236110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 790.236584] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 790.237329] Process migration/135 (pid: 3909, threadinfo ffff88bfef9c0000, task ffff88bfec9e22e0)
[ 790.238321] Stack:
[ 790.238882] ffff88bfef9c1e38 0000000000000000 ffff88ffef421cc0 ffff88bfef9c1ec0
[ 790.245415] ffff88bfefb82000 ffffffff8110bc90 ffff88bfef9c1f48 ffffffff810ff1df
[ 790.250755] 0000000000000001 0000000000000087 ffff88bfefb82000 0000000000000000
[ 790.253365] Call Trace:
[ 790.254121] [<ffffffff8110bc90>] ? __smpboot_create_thread+0x180/0x180
[ 790.255428] [<ffffffff810ff1df>] kthread+0xef/0x100
[ 790.256071] [<ffffffff819cb1a4>] ? wait_for_completion+0x124/0x180
[ 790.256697] [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
[ 790.257325] [<ffffffff819dba9c>] ret_from_fork+0x7c/0xb0
[ 790.258233] [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
[ 790.258942] Code: ef 3d 01 01 48 89 df e8 87 b0 16 00 48 83 05 67 ef 3d 01 01 48 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 5d c3 48 83 05 90 ef 3d 01 01 <0f> 0b 48 83 05 96 ef 3d 01 01 48 83 05 56 ef 3d 01 01 0f 0b 48
[ 790.276178] RIP [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
[ 790.276735] RSP <ffff88bfef9c1e08>
[ 790.278348] ---[ end trace 84baa2bee1434240 ]---


Interestingly, in every single stack trace, the crashing task is the migration
thread. Now, migration thread belongs to the highest priority stop_task sched
class, and this particular sched class is very unique in the way it implements
its internal sched class functions, and I suspect this has a lot of bearing
on how functions like kthread_bind(), wake_up_process() etc react with it
(by looking at how it implements its functions such as select_task_rq(),
enqueue_task(), dequeue_task() etc).

Previous to commit 14e568e (stop_machine: Use smpboot threads), the migration
threads were set to stop_task sched class *after* binding them to the
appropriate cpus. But that commit reversed that sequence inadvertently.
So revert back to the old sequence to fix this issue.

But note that __kthread_bind() can wake up the task if the task is an RT
task. So it can be called only when the CPU (to which we want to bind the task)
is already online. So add a new function called kthread_unpark_prepare() and
call it in smpboot_unpark_thread(), to bind the kthread to the appropriate CPU.
And in stop-machine, move the sched class initialization to ->pre_unpark().
This way, it can be ensured that for stop-task class tasks (such as migration
thread), the sched class is set *after* binding the task to the CPU (and
also, the kthreads belonging to other sched classes will continue to function
properly). Also, teach sched_set_stop_task() to ignore repeated invocations
with the same task as argument.

Reported-by: Dave Jones <[email protected]>
Reported-by: Dave Hansen <[email protected]>
Reported-by: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/kthread.h | 1 +
kernel/kthread.c | 24 ++++++++++++++++++++----
kernel/sched/core.c | 3 +++
kernel/smpboot.c | 2 ++
kernel/stop_machine.c | 7 +------
5 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 8d81664..7552eb0 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -44,6 +44,7 @@ bool kthread_should_park(void);
bool kthread_freezable_should_stop(bool *was_frozen);
void *kthread_data(struct task_struct *k);
int kthread_park(struct task_struct *k);
+void kthread_unpark_prepare(struct task_struct *k);
void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..e3325cc 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -325,6 +325,25 @@ static struct kthread *task_get_live_kthread(struct task_struct *k)
}

/**
+ * kthread_unpark_prepare - prepare to unpark a thread created by
+ * kthread_create().
+ * @k: thread created by kthread_create().
+ *
+ * If the thread is marked per-cpu, it is bound to the cpu, in
+ * preparation for waking it up in the near future.
+ */
+void kthread_unpark_prepare(struct task_struct *k)
+{
+ struct kthread *kthread = task_get_live_kthread(k);
+
+ if (kthread) {
+ if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+ __kthread_bind(k, kthread->cpu);
+ }
+ put_task_struct(k);
+}
+
+/**
* kthread_unpark - unpark a thread created by kthread_create().
* @k: thread created by kthread_create().
*
@@ -344,11 +363,8 @@ void kthread_unpark(struct task_struct *k)
* park before that happens we'd see the IS_PARKED bit
* which might be about to be cleared.
*/
- if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu);
+ if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags))
wake_up_process(k);
- }
}
put_task_struct(k);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..7370fad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -814,6 +814,9 @@ void sched_set_stop_task(int cpu, struct task_struct *stop)
struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
struct task_struct *old_stop = cpu_rq(cpu)->stop;

+ if (stop == old_stop)
+ return;
+
if (stop) {
/*
* Make it appear like a SCHED_FIFO task, its something
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 8eaed9a..bd1a9175 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -209,6 +209,8 @@ static void smpboot_unpark_thread(struct smp_hotplug_thread *ht, unsigned int cp
{
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);

+ kthread_unpark_prepare(tsk);
+
if (ht->pre_unpark)
ht->pre_unpark(cpu);
kthread_unpark(tsk);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..db32f3a 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -300,11 +300,6 @@ repeat:

extern void sched_set_stop_task(int cpu, struct task_struct *stop);

-static void cpu_stop_create(unsigned int cpu)
-{
- sched_set_stop_task(cpu, per_cpu(cpu_stopper_task, cpu));
-}
-
static void cpu_stop_park(unsigned int cpu)
{
struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
@@ -323,6 +318,7 @@ static void cpu_stop_unpark(unsigned int cpu)
{
struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);

+ sched_set_stop_task(cpu, per_cpu(cpu_stopper_task, cpu));
spin_lock_irq(&stopper->lock);
stopper->enabled = true;
spin_unlock_irq(&stopper->lock);
@@ -333,7 +329,6 @@ static struct smp_hotplug_thread cpu_stop_threads = {
.thread_should_run = cpu_stop_should_run,
.thread_fn = cpu_stopper_thread,
.thread_comm = "migration/%u",
- .create = cpu_stop_create,
.setup = cpu_stop_unpark,
.park = cpu_stop_park,
.pre_unpark = cpu_stop_unpark,

2013-04-10 19:42:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

I think I got a full trace this time:

http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz

The last timestamp is pretty close to the timestamp on the console:

[ 2071.033434] smpboot_thread_fn():
[ 2071.033455] smpboot_thread_fn() cpu: 22 159
[ 2071.033470] td->cpu: 22
[ 2071.033475] smp_processor_id(): 21
[ 2071.033486] comm: migration/%u

2013-04-11 08:11:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn()

On Wed, 10 Apr 2013, Srivatsa S. Bhat wrote:

> Interestingly, in every single stack trace, the crashing task is the migration
> thread. Now, migration thread belongs to the highest priority stop_task sched
> class, and this particular sched class is very unique in the way it implements
> its internal sched class functions, and I suspect this has a lot of bearing
> on how functions like kthread_bind(), wake_up_process() etc react with it
> (by looking at how it implements its functions such as select_task_rq(),
> enqueue_task(), dequeue_task() etc).

I don't think that's relevant. The migration thread can only be woken
via try_to_wakeup and my previous patch which implements a separate
task state makes sure that it cannot be woken accidentaly by anything
else than unpark.

> But note that __kthread_bind() can wake up the task if the task is an RT
> task. So it can be called only when the CPU (to which we want to bind the task)

kthread_bind() does NOT wakeup anything. It merily sets the cpus
allowed ptr without further ado.


Thanks,

tglx

2013-04-11 10:20:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

Dave,

On Wed, 10 Apr 2013, Dave Hansen wrote:

> I think I got a full trace this time:
>
> http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
>
> The last timestamp is pretty close to the timestamp on the console:
>
> [ 2071.033434] smpboot_thread_fn():
> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
> [ 2071.033470] td->cpu: 22
> [ 2071.033475] smp_processor_id(): 21
> [ 2071.033486] comm: migration/%u

Yes, that's helpful. Though it makes my mind boggle:

migration/22-4335 [021] d.h. 2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
migration/22-4335 [021] d... 2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
migration/21-4323 [021] d... 2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
migration/22-4335 [021] .... 2071.033422: smpboot_thread_fn <-kthread^M

So the migration thread schedules out with state TASK_PARKED and gets
scheduled back in right away without a wakeup. Srivatsa was about
right, that this might be related to the sched_set_stop_task() call,
but the changelog led completely down the wrong road.

So the issue is:

CPU14 CPU21
create_thread(for CPU22)
park_thread()
wait_for_completion() park_me()
complete()
sched_set_stop_task()
schedule(TASK_PARKED)

The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU21 and that confuses the hell out of the stop_task
class on that cpu. So as we have to synchronize the state for the
bind call (the issue observed by Borislav) we need to do the same
before issuing sched_set_stop_task(). Delta patch below.

Thanks,

tglx
---
Index: linux-2.6/include/trace/events/sched.h
===================================================================
--- linux-2.6.orig/include/trace/events/sched.h
+++ linux-2.6/include/trace/events/sched.h
@@ -147,7 +147,7 @@ TRACE_EVENT(sched_switch,
__print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
{ 16, "Z" }, { 32, "X" }, { 64, "x" },
- { 128, "W" }) : "R",
+ { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
__entry->prev_state & TASK_STATE_MAX ? "+" : "",
__entry->next_comm, __entry->next_pid, __entry->next_prio)
);
Index: linux-2.6/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/kernel/smpboot.c
+++ linux-2.6/kernel/smpboot.c
@@ -185,8 +185,18 @@ __smpboot_create_thread(struct smp_hotpl
}
get_task_struct(tsk);
*per_cpu_ptr(ht->store, cpu) = tsk;
- if (ht->create)
- ht->create(cpu);
+ if (ht->create) {
+ /*
+ * Make sure that the task has actually scheduled out
+ * into park position, before calling the create
+ * callback. At least the migration thread callback
+ * requires that the task is off the runqueue.
+ */
+ if (!wait_task_inactive(tsk, TASK_PARKED))
+ WARN_ON(1);
+ else
+ ht->create(cpu);
+ }
return 0;
}


2013-04-11 10:22:25

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn()

On 04/11/2013 01:40 PM, Thomas Gleixner wrote:
> On Wed, 10 Apr 2013, Srivatsa S. Bhat wrote:
>
>> Interestingly, in every single stack trace, the crashing task is the migration
>> thread. Now, migration thread belongs to the highest priority stop_task sched
>> class, and this particular sched class is very unique in the way it implements
>> its internal sched class functions, and I suspect this has a lot of bearing
>> on how functions like kthread_bind(), wake_up_process() etc react with it
>> (by looking at how it implements its functions such as select_task_rq(),
>> enqueue_task(), dequeue_task() etc).
>
> I don't think that's relevant. The migration thread can only be woken
> via try_to_wakeup and my previous patch which implements a separate
> task state makes sure that it cannot be woken accidentaly by anything
> else than unpark.
>

Hmm, but it got to be simpler than that, no? Given that it used to work fine
before...

>> But note that __kthread_bind() can wake up the task if the task is an RT
>> task. So it can be called only when the CPU (to which we want to bind the task)
>
> kthread_bind() does NOT wakeup anything. It merily sets the cpus
> allowed ptr without further ado.
>

Sorry, I was mistaken and was carried away by a bug in the code I was testing.
I had intended to move kthread_bind() to the body of kthread_create_on_cpu()
and place it after the call to kthread_park(), as shown below:

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..b485fc0 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
to_kthread(p)->cpu = cpu;
/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
kthread_park(p);
+
+ wait_task_inactive(p, TASK_INTERRUPTIBLE);
+ __kthread_bind(p, cpu);
return p;
}

But by mistake, I had written the code as:

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..b485fc0 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
to_kthread(p)->cpu = cpu;
/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
kthread_park(p);
+
+ if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
+ __kthread_bind(p, cpu);
return p;
}

So, no wonder it never actually bound the task to the CPU. So when I gave this a run,
I saw watchdog threads hitting the same BUG_ON(), and since watchdog threads are
of RT priority, and RT is the only class that implements ->set_cpus_allowed(), I
thought that those threads got woken up due to the bind. But I was mistaken of
course, because I had checked for the wrong return value of wait_task_inactive().

Regards,
Srivatsa S. Bhat

2013-04-11 10:51:31

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On 04/11/2013 03:49 PM, Thomas Gleixner wrote:
> Dave,
>
> On Wed, 10 Apr 2013, Dave Hansen wrote:
>
>> I think I got a full trace this time:
>>
>> http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
>>
>> The last timestamp is pretty close to the timestamp on the console:
>>
>> [ 2071.033434] smpboot_thread_fn():
>> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
>> [ 2071.033470] td->cpu: 22
>> [ 2071.033475] smp_processor_id(): 21
>> [ 2071.033486] comm: migration/%u
>
> Yes, that's helpful. Though it makes my mind boggle:
>
> migration/22-4335 [021] d.h. 2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
> migration/22-4335 [021] d... 2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
> migration/21-4323 [021] d... 2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
> migration/22-4335 [021] .... 2071.033422: smpboot_thread_fn <-kthread^M
>
> So the migration thread schedules out with state TASK_PARKED and gets
> scheduled back in right away without a wakeup. Srivatsa was about
> right, that this might be related to the sched_set_stop_task() call,
> but the changelog led completely down the wrong road.
>
> So the issue is:
>
> CPU14 CPU21
> create_thread(for CPU22)
> park_thread()
> wait_for_completion() park_me()
> complete()
> sched_set_stop_task()
> schedule(TASK_PARKED)
>
> The sched_set_stop_task() call is issued while the task is on the
> runqueue of CPU21 and that confuses the hell out of the stop_task
> class on that cpu. So as we have to synchronize the state for the
> bind call (the issue observed by Borislav) we need to do the same
> before issuing sched_set_stop_task(). Delta patch below.
>

In that case, why not just apply this 2 line patch on mainline?
The patch I sent yesterday was more elaborate because I wrongly assumed
that kthread_bind() can cause a wakeup. But now, I feel the patch shown
below should work just fine too. Yeah, it binds the task during creation
as well as during unpark, but that should be ok (see below).

Somehow, I believe we can handle this issue without introducing that
whole TASK_PARKED thing..


diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..74af4a8 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
to_kthread(p)->cpu = cpu;
/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
kthread_park(p);
+
+ wait_task_inactive(p, TASK_INTERRUPTIBLE);
+ __kthread_bind(p, cpu);
return p;
}


The reason why we can't get rid of the bind in the unpark code is because,
the threads are parked during CPU offline *after* calling CPU_DOWN_PREPARE.
And during CPU_DOWN_PREPARE, the scheduler removes the CPU from the cpu_active_mask.
So on any subsequent wakeup of these threads before they are parked, the scheduler
will force migrate them to some other CPU (but alas it wont print this event
because of the p->mm != NULL check in select_fallback_rq()). So during unpark
during the next online operation we need to bind it again. But that's fine, IMHO.

Regards,
Srivatsa S. Bhat

2013-04-11 11:46:54

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On 04/11/2013 04:18 PM, Srivatsa S. Bhat wrote:
> On 04/11/2013 03:49 PM, Thomas Gleixner wrote:
>> Dave,
>>
>> On Wed, 10 Apr 2013, Dave Hansen wrote:
>>
>>> I think I got a full trace this time:
>>>
>>> http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
>>>
>>> The last timestamp is pretty close to the timestamp on the console:
>>>
>>> [ 2071.033434] smpboot_thread_fn():
>>> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
>>> [ 2071.033470] td->cpu: 22
>>> [ 2071.033475] smp_processor_id(): 21
>>> [ 2071.033486] comm: migration/%u
>>
>> Yes, that's helpful. Though it makes my mind boggle:
>>
>> migration/22-4335 [021] d.h. 2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
>> migration/22-4335 [021] d... 2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
>> migration/21-4323 [021] d... 2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
>> migration/22-4335 [021] .... 2071.033422: smpboot_thread_fn <-kthread^M
>>
>> So the migration thread schedules out with state TASK_PARKED and gets
>> scheduled back in right away without a wakeup. Srivatsa was about
>> right, that this might be related to the sched_set_stop_task() call,
>> but the changelog led completely down the wrong road.
>>
>> So the issue is:
>>
>> CPU14 CPU21
>> create_thread(for CPU22)
>> park_thread()
>> wait_for_completion() park_me()
>> complete()
>> sched_set_stop_task()
>> schedule(TASK_PARKED)
>>
>> The sched_set_stop_task() call is issued while the task is on the
>> runqueue of CPU21 and that confuses the hell out of the stop_task
>> class on that cpu. So as we have to synchronize the state for the
>> bind call (the issue observed by Borislav) we need to do the same
>> before issuing sched_set_stop_task(). Delta patch below.
>>
>
> In that case, why not just apply this 2 line patch on mainline?
> The patch I sent yesterday was more elaborate because I wrongly assumed
> that kthread_bind() can cause a wakeup. But now, I feel the patch shown
> below should work just fine too. Yeah, it binds the task during creation
> as well as during unpark, but that should be ok (see below).
>
> Somehow, I believe we can handle this issue without introducing that
> whole TASK_PARKED thing..
>
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 691dc2e..74af4a8 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
> to_kthread(p)->cpu = cpu;
> /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> kthread_park(p);
> +
> + wait_task_inactive(p, TASK_INTERRUPTIBLE);
> + __kthread_bind(p, cpu);
> return p;
> }
>

Oh wait a minute, its simpler than even that I believe! We don't need that
extra kthread_bind() also.

So, here is what is happening, from what I understand:
sched_set_stop_task() will make the newly created migration thread as the
stop task, during ht->create(). But looking into __sched_setscheduler(),
we observe that if p->on_rq is true (where p is the migration thread),
the task is enqueued on the rq by calling ->enqueue_task() of stop_task
sched class, which simply increments the rq->nr_running.

So, on the next call to schedule(), the pick_next_task() in core.c sees
that rq->nr_running is not equal to rq->cfs.h_nr_running and hence invokes
the stop_task sched class' ->pick_next_task(). And that will return the
stop task, which is nothing but this migration thread that we just created.
That's why the migration thread starts running prematurely, even before
it is bound, and even before its corresponding CPU is brought online.

So, if we just wait till p->on_rq is reset to 0 by __schedule(), before
calling ht->create(), the task doesn't get queued to the runqueue with
stop task priority. So that's it - we simply wait till it gets off the
rq and everything should be fine. And during the subsequent online, during
the unpark phase, the task gets bound to the appropriate cpu and only
then woken up, as desired.

So Dave, could you kindly test the below patch on mainline?


diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..9558355 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
to_kthread(p)->cpu = cpu;
/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
kthread_park(p);
+
+ /*
+ * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
+ * migration thread (which belongs to the stop_task sched class)
+ * doesn't run until the cpu is actually onlined and the thread is
+ * unparked.
+ */
+ if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
+ WARN_ON(1);
return p;
}

Regards,
Srivatsa S. Bhat

2013-04-11 12:02:25

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On 04/11/2013 05:13 PM, Srivatsa S. Bhat wrote:
[...]
> So Dave, could you kindly test the below patch on mainline?
>

BTW, you don't need to try out any of the previous patches that I sent,
just this one is good enough. Thanks!

Regards,
Srivatsa S. Bhat

>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 691dc2e..9558355 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
> to_kthread(p)->cpu = cpu;
> /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> kthread_park(p);
> +
> + /*
> + * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
> + * migration thread (which belongs to the stop_task sched class)
> + * doesn't run until the cpu is actually onlined and the thread is
> + * unparked.
> + */
> + if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
> + WARN_ON(1);
> return p;
> }
>

2013-04-11 12:52:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On Thu, 11 Apr 2013, Srivatsa S. Bhat wrote:
> On 04/11/2013 04:18 PM, Srivatsa S. Bhat wrote:
> > On 04/11/2013 03:49 PM, Thomas Gleixner wrote:
> >> Dave,
> >>
> >> On Wed, 10 Apr 2013, Dave Hansen wrote:
> >>
> >>> I think I got a full trace this time:
> >>>
> >>> http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
> >>>
> >>> The last timestamp is pretty close to the timestamp on the console:
> >>>
> >>> [ 2071.033434] smpboot_thread_fn():
> >>> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
> >>> [ 2071.033470] td->cpu: 22
> >>> [ 2071.033475] smp_processor_id(): 21
> >>> [ 2071.033486] comm: migration/%u
> >>
> >> Yes, that's helpful. Though it makes my mind boggle:
> >>
> >> migration/22-4335 [021] d.h. 2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
> >> migration/22-4335 [021] d... 2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
> >> migration/21-4323 [021] d... 2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
> >> migration/22-4335 [021] .... 2071.033422: smpboot_thread_fn <-kthread^M
> >>
> >> So the migration thread schedules out with state TASK_PARKED and gets
> >> scheduled back in right away without a wakeup. Srivatsa was about
> >> right, that this might be related to the sched_set_stop_task() call,
> >> but the changelog led completely down the wrong road.
> >>
> >> So the issue is:
> >>
> >> CPU14 CPU21
> >> create_thread(for CPU22)
> >> park_thread()
> >> wait_for_completion() park_me()
> >> complete()
> >> sched_set_stop_task()
> >> schedule(TASK_PARKED)
> >>
> >> The sched_set_stop_task() call is issued while the task is on the
> >> runqueue of CPU21 and that confuses the hell out of the stop_task
> >> class on that cpu. So as we have to synchronize the state for the
> >> bind call (the issue observed by Borislav) we need to do the same
> >> before issuing sched_set_stop_task(). Delta patch below.
> >>
> >
> > In that case, why not just apply this 2 line patch on mainline?
> > The patch I sent yesterday was more elaborate because I wrongly assumed
> > that kthread_bind() can cause a wakeup. But now, I feel the patch shown
> > below should work just fine too. Yeah, it binds the task during creation
> > as well as during unpark, but that should be ok (see below).
> >
> > Somehow, I believe we can handle this issue without introducing that
> > whole TASK_PARKED thing..
> >
> >
> > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > index 691dc2e..74af4a8 100644
> > --- a/kernel/kthread.c
> > +++ b/kernel/kthread.c
> > @@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
> > to_kthread(p)->cpu = cpu;
> > /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> > kthread_park(p);
> > +
> > + wait_task_inactive(p, TASK_INTERRUPTIBLE);
> > + __kthread_bind(p, cpu);
> > return p;
> > }
> >
>
> Oh wait a minute, its simpler than even that I believe! We don't need that
> extra kthread_bind() also.
>
> So, here is what is happening, from what I understand:
> sched_set_stop_task() will make the newly created migration thread as the
> stop task, during ht->create(). But looking into __sched_setscheduler(),
> we observe that if p->on_rq is true (where p is the migration thread),
> the task is enqueued on the rq by calling ->enqueue_task() of stop_task
> sched class, which simply increments the rq->nr_running.
>
> So, on the next call to schedule(), the pick_next_task() in core.c sees
> that rq->nr_running is not equal to rq->cfs.h_nr_running and hence invokes
> the stop_task sched class' ->pick_next_task(). And that will return the
> stop task, which is nothing but this migration thread that we just created.
> That's why the migration thread starts running prematurely, even before
> it is bound, and even before its corresponding CPU is brought online.
>
> So, if we just wait till p->on_rq is reset to 0 by __schedule(), before
> calling ht->create(), the task doesn't get queued to the runqueue with
> stop task priority. So that's it - we simply wait till it gets off the
> rq and everything should be fine. And during the subsequent online, during
> the unpark phase, the task gets bound to the appropriate cpu and only
> then woken up, as desired.

Right, but I want to avoid that wait for the common case. In most of
the cases the newly created thread reaches PARKED and scheduled out
before we do bind or any other stuff.

Thanks,

tglx

2013-04-11 12:54:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On Thu, 11 Apr 2013, Srivatsa S. Bhat wrote:

> On 04/11/2013 04:18 PM, Srivatsa S. Bhat wrote:
> > On 04/11/2013 03:49 PM, Thomas Gleixner wrote:
> >> Dave,
> >>
> >> On Wed, 10 Apr 2013, Dave Hansen wrote:
> >>
> >>> I think I got a full trace this time:
> >>>
> >>> http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
> >>>
> >>> The last timestamp is pretty close to the timestamp on the console:
> >>>
> >>> [ 2071.033434] smpboot_thread_fn():
> >>> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
> >>> [ 2071.033470] td->cpu: 22
> >>> [ 2071.033475] smp_processor_id(): 21
> >>> [ 2071.033486] comm: migration/%u
> >>
> >> Yes, that's helpful. Though it makes my mind boggle:
> >>
> >> migration/22-4335 [021] d.h. 2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
> >> migration/22-4335 [021] d... 2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
> >> migration/21-4323 [021] d... 2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
> >> migration/22-4335 [021] .... 2071.033422: smpboot_thread_fn <-kthread^M
> >>
> >> So the migration thread schedules out with state TASK_PARKED and gets
> >> scheduled back in right away without a wakeup. Srivatsa was about
> >> right, that this might be related to the sched_set_stop_task() call,
> >> but the changelog led completely down the wrong road.
> >>
> >> So the issue is:
> >>
> >> CPU14 CPU21
> >> create_thread(for CPU22)
> >> park_thread()
> >> wait_for_completion() park_me()
> >> complete()
> >> sched_set_stop_task()
> >> schedule(TASK_PARKED)
> >>
> >> The sched_set_stop_task() call is issued while the task is on the
> >> runqueue of CPU21 and that confuses the hell out of the stop_task
> >> class on that cpu. So as we have to synchronize the state for the
> >> bind call (the issue observed by Borislav) we need to do the same
> >> before issuing sched_set_stop_task(). Delta patch below.
> >>
> >
> > In that case, why not just apply this 2 line patch on mainline?
> > The patch I sent yesterday was more elaborate because I wrongly assumed
> > that kthread_bind() can cause a wakeup. But now, I feel the patch shown
> > below should work just fine too. Yeah, it binds the task during creation
> > as well as during unpark, but that should be ok (see below).
> >
> > Somehow, I believe we can handle this issue without introducing that
> > whole TASK_PARKED thing..

We need it as Borislav showed.

Thread is created and parked. Now after online we unpark the thread,
but something else wakes it before we reach the unpark code and boom
it's on the wrong cpu. We want that PARKED thing for
robustness. Anything else is just voodoo programming.

Thanks,

tglx

2013-04-11 13:46:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On Thu, 11 Apr 2013, Srivatsa S. Bhat wrote:
> The reason why we can't get rid of the bind in the unpark code is because,
> the threads are parked during CPU offline *after* calling CPU_DOWN_PREPARE.
> And during CPU_DOWN_PREPARE, the scheduler removes the CPU from the cpu_active_mask.
> So on any subsequent wakeup of these threads before they are parked, the scheduler
> will force migrate them to some other CPU (but alas it wont print this event
> because of the p->mm != NULL check in select_fallback_rq()). So during unpark
> during the next online operation we need to bind it again. But that's fine, IMHO.

No, it's not fine. If we do not have TASK_PARKED a wakeup which
happens after onlining the task and races with the unpark code will
cause the task to end up on the wrong CPU. Borislav showed traces
which prove that with ksoftirqd. So no, TASK_PARKED is going to stay.

Thanks,

tglx

2013-04-11 18:07:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On 04/11/2013 03:19 AM, Thomas Gleixner wrote:
> --- linux-2.6.orig/kernel/smpboot.c
> +++ linux-2.6/kernel/smpboot.c
> @@ -185,8 +185,18 @@ __smpboot_create_thread(struct smp_hotpl
> }
> get_task_struct(tsk);
> *per_cpu_ptr(ht->store, cpu) = tsk;
> - if (ht->create)
> - ht->create(cpu);
> + if (ht->create) {
> + /*
> + * Make sure that the task has actually scheduled out
> + * into park position, before calling the create
> + * callback. At least the migration thread callback
> + * requires that the task is off the runqueue.
> + */
> + if (!wait_task_inactive(tsk, TASK_PARKED))
> + WARN_ON(1);
> + else
> + ht->create(cpu);
> + }
> return 0;
> }

This one appears to be doing the trick. I'll run the cpus in an
online/offline loop for a bit and make sure it's stable. It's passed
several round so far, which is way more than it's done up to this point,
though.

2013-04-11 19:19:20

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On 04/09/2013 08:08 PM, Thomas Gleixner wrote:
> The smpboot threads rely on the park/unpark mechanism which binds per
> cpu threads on a particular core. Though the functionality is racy:
>
> CPU0 CPU1 CPU2
> unpark(T) wake_up_process(T)
> clear(SHOULD_PARK) T runs
> leave parkme() due to !SHOULD_PARK
> bind_to(CPU2) BUG_ON(wrong CPU)
>

OK, I must admit that I had missed noticing that the task was ksoftirqd
and not the migration thread in Boris' trace. And yes, this unexpected
wakeup is a problem for ksoftirqd.

> We cannot let the tasks move themself to the target CPU as one of
> those tasks is actually the migration thread itself, which requires
> that it starts running on the target cpu right away.
>

Of course, we can't use set_cpus_allowed_ptr(), but we can still achieve
the desired bind effect without any race, see below.

> The only rock solid solution to this problem is to prevent wakeups in
> park state which are not from unpark(). That way we can guarantee that
> the association of the task to the target cpu is working correctly.
>
> Add a new task state (TASK_PARKED) which prevents other wakeups and
> use this state explicitely for the unpark wakeup.
>

Again, I think this is unnecessary. We are good as long as no one other
than the unpark code can kick the kthreads out of the loop in the park
code. Now that I understand the race you explained above, why not just
fix that race itself by reversing the ordering of clear(SHOULD_PARK)
and bind_to(CPU2)? That way, even if someone else wakes up the per-cpu
kthread, it will just remain confined to the park code, as intended.

A patch like below should do it IMHO. I guess I'm being a little too
persistent, sorry!


diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..9512fc5 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
to_kthread(p)->cpu = cpu;
/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
kthread_park(p);
+
+ /*
+ * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
+ * migration thread (which belongs to the stop_task sched class)
+ * doesn't run until the cpu is actually onlined and the thread is
+ * unparked.
+ */
+ if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
+ WARN_ON(1);
return p;
}

@@ -324,6 +333,17 @@ static struct kthread *task_get_live_kthread(struct task_struct *k)
return NULL;
}

+static void __kthread_park(struct task_struct *k, struct kthread *kthread)
+{
+ if (!test_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
+ set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ if (k != current) {
+ wake_up_process(k);
+ wait_for_completion(&kthread->parked);
+ }
+ }
+}
+
/**
* kthread_unpark - unpark a thread created by kthread_create().
* @k: thread created by kthread_create().
@@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
struct kthread *kthread = task_get_live_kthread(k);

if (kthread) {
+ /*
+ * Per-cpu kthreads such as ksoftirqd can get woken up by
+ * other events. So after binding the thread, ensure that
+ * it goes off the CPU atleast once, by parking it again.
+ * This way, we can ensure that it will run on the correct
+ * CPU on subsequent wakeup.
+ */
+ if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
+ __kthread_bind(k, kthread->cpu);
+ clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
+ __kthread_park(k, kthread);
+ }
+
clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+
/*
* We clear the IS_PARKED bit here as we don't wait
* until the task has left the park code. So if we'd
* park before that happens we'd see the IS_PARKED bit
* which might be about to be cleared.
*/
- if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu);
+ if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags))
wake_up_process(k);
- }
}
put_task_struct(k);
}
@@ -371,13 +402,7 @@ int kthread_park(struct task_struct *k)
int ret = -ENOSYS;

if (kthread) {
- if (!test_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
- if (k != current) {
- wake_up_process(k);
- wait_for_completion(&kthread->parked);
- }
- }
+ __kthread_park(k, kthread);
ret = 0;
}
put_task_struct(k);

2013-04-11 19:48:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On Thu, 11 Apr 2013, Dave Hansen wrote:
> On 04/11/2013 03:19 AM, Thomas Gleixner wrote:
> > --- linux-2.6.orig/kernel/smpboot.c
> > +++ linux-2.6/kernel/smpboot.c
> > @@ -185,8 +185,18 @@ __smpboot_create_thread(struct smp_hotpl
> > }
> > get_task_struct(tsk);
> > *per_cpu_ptr(ht->store, cpu) = tsk;
> > - if (ht->create)
> > - ht->create(cpu);
> > + if (ht->create) {
> > + /*
> > + * Make sure that the task has actually scheduled out
> > + * into park position, before calling the create
> > + * callback. At least the migration thread callback
> > + * requires that the task is off the runqueue.
> > + */
> > + if (!wait_task_inactive(tsk, TASK_PARKED))
> > + WARN_ON(1);
> > + else
> > + ht->create(cpu);
> > + }
> > return 0;
> > }
>
> This one appears to be doing the trick. I'll run the cpus in an
> online/offline loop for a bit and make sure it's stable. It's passed
> several round so far, which is way more than it's done up to this point,
> though.

I think the most critical part is bootup when the threads are
created. The online/offline one is an issue as well, which is covered
by the patches, but way harder to trigger.

Thanks,

tglx

2013-04-11 20:47:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

Srivatsa,

On Fri, 12 Apr 2013, Srivatsa S. Bhat wrote:
> On 04/09/2013 08:08 PM, Thomas Gleixner wrote:
> > Add a new task state (TASK_PARKED) which prevents other wakeups and
> > use this state explicitely for the unpark wakeup.
> >
>
> Again, I think this is unnecessary. We are good as long as no one other
> than the unpark code can kick the kthreads out of the loop in the park
> code. Now that I understand the race you explained above, why not just
> fix that race itself by reversing the ordering of clear(SHOULD_PARK)
> and bind_to(CPU2)? That way, even if someone else wakes up the per-cpu
> kthread, it will just remain confined to the park code, as intended.

In theory.

> A patch like below should do it IMHO. I guess I'm being a little too
> persistent, sorry!

No it's not about being persistent, you're JUST too much into voodoo
programming instead of going for the straight forward and robust
solutions.

Darn, I hate it as much as everyone else to introduce a new task
state, but that state allows us to make guarantees and gives us
semantical clarity. A parked thread is parked and can only be woken up
by the unpark code. That's clear semantics and not a magic construct
which will work in most cases and for the remaining ones (See below)
it will give us problems which are way harder to decode than the ones
we tried to fix with that magic.

> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 691dc2e..9512fc5 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
> to_kthread(p)->cpu = cpu;
> /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> kthread_park(p);
> +
> + /*
> + * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
> + * migration thread (which belongs to the stop_task sched class)
> + * doesn't run until the cpu is actually onlined and the thread is
> + * unparked.
> + */
> + if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
> + WARN_ON(1);

Yay, we rely on TASK_INTERRUPTIBLE state with a task which already has
references outside the creation code. And then we _HOPE_ that nothing
wakes it up _BEFORE_ we do something else.

Aside of that, you are still insisting to enforce that for every per
cpu thread even if the only one which needs that at this point are
thos which have a create() callback (i.e. the migration thread). And
next week you figure out that this is a performance impact on bringing
up large machines....

> /**
> * kthread_unpark - unpark a thread created by kthread_create().
> * @k: thread created by kthread_create().
> @@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
> struct kthread *kthread = task_get_live_kthread(k);
>
> if (kthread) {
> + /*
> + * Per-cpu kthreads such as ksoftirqd can get woken up by
> + * other events. So after binding the thread, ensure that
> + * it goes off the CPU atleast once, by parking it again.
> + * This way, we can ensure that it will run on the correct
> + * CPU on subsequent wakeup.
> + */
> + if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
> + __kthread_bind(k, kthread->cpu);
> + clear_bit(KTHREAD_IS_PARKED, &kthread->flags);

And how is that f*cking different from the previous code?

CPU0 CPU1 CPU2
wakeup(T) -> run on CPU1 (last cpu)

switch_to(T)

__kthread_bind(T, CPU2)

clear(KTHREAD_IS_PARKED)

leave loop due to !KTHREAD_IS_PARKED

BUG(wrong cpu) <--- VOODOO FAILURE

kthread_park(T) <-- VOODOO TOO LATE

You can turn around the order of clearing/setting the flags as much as
you want, I'm going to punch an hole in it.

TASK_PARKED is the very obvious and robust solution which fixes _ALL_
of the corner cases, at least as far as I can imagine them. And
robustness rules at least in my world.

Thanks,

tglx

2013-04-11 21:22:40

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On 04/12/2013 02:17 AM, Thomas Gleixner wrote:
> Srivatsa,
>
> On Fri, 12 Apr 2013, Srivatsa S. Bhat wrote:
>> On 04/09/2013 08:08 PM, Thomas Gleixner wrote:
>>> Add a new task state (TASK_PARKED) which prevents other wakeups and
>>> use this state explicitely for the unpark wakeup.
>>>
>>
>> Again, I think this is unnecessary. We are good as long as no one other
>> than the unpark code can kick the kthreads out of the loop in the park
>> code. Now that I understand the race you explained above, why not just
>> fix that race itself by reversing the ordering of clear(SHOULD_PARK)
>> and bind_to(CPU2)? That way, even if someone else wakes up the per-cpu
>> kthread, it will just remain confined to the park code, as intended.
>
> In theory.
>
>> A patch like below should do it IMHO. I guess I'm being a little too
>> persistent, sorry!
>
> No it's not about being persistent, you're JUST too much into voodoo
> programming instead of going for the straight forward and robust
> solutions.
>
> Darn, I hate it as much as everyone else to introduce a new task
> state, but that state allows us to make guarantees and gives us
> semantical clarity. A parked thread is parked and can only be woken up
> by the unpark code. That's clear semantics and not a magic construct
> which will work in most cases and for the remaining ones (See below)
> it will give us problems which are way harder to decode than the ones
> we tried to fix with that magic.
>
>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> index 691dc2e..9512fc5 100644
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
>> to_kthread(p)->cpu = cpu;
>> /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
>> kthread_park(p);
>> +
>> + /*
>> + * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
>> + * migration thread (which belongs to the stop_task sched class)
>> + * doesn't run until the cpu is actually onlined and the thread is
>> + * unparked.
>> + */
>> + if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
>> + WARN_ON(1);
>
> Yay, we rely on TASK_INTERRUPTIBLE state with a task which already has
> references outside the creation code.

I doubt that. We have not even onlined the CPU, how would any else even
_know_ that we created this kthread??

*per_cpu_ptr(ht->store, cpu) = tsk; is executed _after_ returning from
this function.

The problem with ksoftirqd is very clear - we unpark threads _after_ we
online the CPU. So, in between the 2 steps, somebody on that CPU can call
__do_softirq(), leading to the race you described in your cover-letter.
That's why I tried to fix that race.

> And then we _HOPE_ that nothing
> wakes it up _BEFORE_ we do something else.
>

Nothing can wake it up, because no one is aware of the newly created
kthread.

> Aside of that, you are still insisting to enforce that for every per
> cpu thread even if the only one which needs that at this point are
> thos which have a create() callback (i.e. the migration thread). And
> next week you figure out that this is a performance impact on bringing
> up large machines....
>

Making this wait call specific to those kthreads with the ->create callback
won't be that much of a big deal, IMHO. But see below, I'm not going to
insist on going with my suggestions.

>> /**
>> * kthread_unpark - unpark a thread created by kthread_create().
>> * @k: thread created by kthread_create().
>> @@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
>> struct kthread *kthread = task_get_live_kthread(k);
>>
>> if (kthread) {
>> + /*
>> + * Per-cpu kthreads such as ksoftirqd can get woken up by
>> + * other events. So after binding the thread, ensure that
>> + * it goes off the CPU atleast once, by parking it again.
>> + * This way, we can ensure that it will run on the correct
>> + * CPU on subsequent wakeup.
>> + */
>> + if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
>> + __kthread_bind(k, kthread->cpu);
>> + clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
>
> And how is that f*cking different from the previous code?
>
> CPU0 CPU1 CPU2
> wakeup(T) -> run on CPU1 (last cpu)
>
> switch_to(T)
>
> __kthread_bind(T, CPU2)
>
> clear(KTHREAD_IS_PARKED)
>
> leave loop due to !KTHREAD_IS_PARKED

How?? The task will leave the loop only when we clear
SHOULD_PARK, not when we clear IS_PARKED. So it won't
leave the loop here. It will cause the kthread to
perform a fresh complete() for the waiting kthread_park()
on CPU0.
>
> BUG(wrong cpu) <--- VOODOO FAILURE
>
> kthread_park(T) <-- VOODOO TOO LATE
>

No, the purpose of clear(IS_PARKED) followed by __kthread_park() is to
ensure that the task gets *descheduled* atleast once after we did the
kthread_bind(). And that's because we can't use set_cpus_allowed_ptr() to
migrate a running kthread (because the kthread could be the migration
thread). So instead, we use kthread_bind() and depend on sleep->wakeup
to put the task on the right CPU.

> You can turn around the order of clearing/setting the flags as much as
> you want, I'm going to punch an hole in it.
>
> TASK_PARKED is the very obvious and robust solution which fixes _ALL_
> of the corner cases, at least as far as I can imagine them. And
> robustness rules at least in my world.
>

Yes, I agree that it is robust and has clear semantics. No doubt about
that. So I won't insist on going with my suggestions.

Regards,
Srivatsa S. Bhat

2013-04-12 10:41:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

On Tue, 2013-04-09 at 16:38 +0200, Thomas Gleixner wrote:
> The smpboot threads rely on the park/unpark mechanism which binds per
> cpu threads on a particular core. Though the functionality is racy:
>
> CPU0 CPU1 CPU2
> unpark(T) wake_up_process(T)
> clear(SHOULD_PARK) T runs
> leave parkme() due to !SHOULD_PARK
> bind_to(CPU2) BUG_ON(wrong CPU)
>
> We cannot let the tasks move themself to the target CPU as one of
> those tasks is actually the migration thread itself, which requires
> that it starts running on the target cpu right away.
>
> The only rock solid solution to this problem is to prevent wakeups in
> park state which are not from unpark(). That way we can guarantee that
> the association of the task to the target cpu is working correctly.
>
> Add a new task state (TASK_PARKED) which prevents other wakeups and
> use this state explicitely for the unpark wakeup.

explicitly

Also, since the task state is visible to userspace and all the parked
tasks are still in the PID space, its a good hint in ps and friends
that these tasks aren't really there for the moment.

>
> Reported-by: Dave Jones <[email protected]>
> Reported-by: Dave Hansen <[email protected]>
> Reported-by: Borislav Petkov <[email protected]>
> Cc: [email protected]
> Signed-off-by: Thomas Gleixner <[email protected]>

Assuming you're going to merge in the missing trace hunk you posted
further down the thread...

Acked-by: Peter Zijlstra <[email protected]>

> ---
> fs/proc/array.c | 1 +
> include/linux/sched.h | 5 +++--
> kernel/kthread.c | 38 +++++++++++++++++++++-----------------
> 3 files changed, 25 insertions(+), 19 deletions(-)
>
> Index: linux-2.6/fs/proc/array.c
> ===================================================================
> --- linux-2.6.orig/fs/proc/array.c
> +++ linux-2.6/fs/proc/array.c
> @@ -143,6 +143,7 @@ static const char * const task_state_arr
> "x (dead)", /* 64 */
> "K (wakekill)", /* 128 */
> "W (waking)", /* 256 */
> + "P (parked)", /* 512 */
> };
>
> static inline const char *get_task_state(struct task_struct *tsk)
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -163,9 +163,10 @@ print_cfs_rq(struct seq_file *m, int cpu
> #define TASK_DEAD 64
> #define TASK_WAKEKILL 128
> #define TASK_WAKING 256
> -#define TASK_STATE_MAX 512
> +#define TASK_PARKED 512
> +#define TASK_STATE_MAX 1024
>
> -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
> +#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
>
> extern char ___assert_task_state[1 - 2*!!(
> sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
> Index: linux-2.6/kernel/kthread.c
> ===================================================================
> --- linux-2.6.orig/kernel/kthread.c
> +++ linux-2.6/kernel/kthread.c
> @@ -124,12 +124,12 @@ void *kthread_data(struct task_struct *t
>
> static void __kthread_parkme(struct kthread *self)
> {
> - __set_current_state(TASK_INTERRUPTIBLE);
> + __set_current_state(TASK_PARKED);
> while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
> if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
> complete(&self->parked);
> schedule();
> - __set_current_state(TASK_INTERRUPTIBLE);
> + __set_current_state(TASK_PARKED);
> }
> clear_bit(KTHREAD_IS_PARKED, &self->flags);
> __set_current_state(TASK_RUNNING);
> @@ -324,6 +324,22 @@ static struct kthread *task_get_live_kth
> return NULL;
> }
>
> +static void __kthread_unpark(struct task_struct *k, struct kthread *kthread)
> +{
> + clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> + /*
> + * We clear the IS_PARKED bit here as we don't wait
> + * until the task has left the park code. So if we'd
> + * park before that happens we'd see the IS_PARKED bit
> + * which might be about to be cleared.
> + */
> + if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
> + if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
> + __kthread_bind(k, kthread->cpu);
> + wake_up_state(k, TASK_PARKED);
> + }
> +}
> +
> /**
> * kthread_unpark - unpark a thread created by kthread_create().
> * @k: thread created by kthread_create().
> @@ -336,20 +352,8 @@ void kthread_unpark(struct task_struct *
> {
> struct kthread *kthread = task_get_live_kthread(k);
>
> - if (kthread) {
> - clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> - /*
> - * We clear the IS_PARKED bit here as we don't wait
> - * until the task has left the park code. So if we'd
> - * park before that happens we'd see the IS_PARKED bit
> - * which might be about to be cleared.
> - */
> - if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
> - if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
> - __kthread_bind(k, kthread->cpu);
> - wake_up_process(k);
> - }
> - }
> + if (kthread)
> + __kthread_unpark(k, kthread);
> put_task_struct(k);
> }
>
> @@ -407,7 +411,7 @@ int kthread_stop(struct task_struct *k)
> trace_sched_kthread_stop(k);
> if (kthread) {
> set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
> - clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> + __kthread_unpark(k, kthread);
> wake_up_process(k);
> wait_for_completion(&kthread->exited);
> }

2013-04-12 11:00:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

Srivatsa,

On Fri, 12 Apr 2013, Srivatsa S. Bhat wrote:
> On 04/12/2013 02:17 AM, Thomas Gleixner wrote:
> >> +
> >> + /*
> >> + * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
> >> + * migration thread (which belongs to the stop_task sched class)
> >> + * doesn't run until the cpu is actually onlined and the thread is
> >> + * unparked.
> >> + */
> >> + if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
> >> + WARN_ON(1);
> >
> > Yay, we rely on TASK_INTERRUPTIBLE state with a task which already has
> > references outside the creation code.
>
> I doubt that. We have not even onlined the CPU, how would any else even
> _know_ that we created this kthread??

The problem is not only at the thread creation time. We have the same
issue at offline/online and there we have a reference to that very
thread.

> >> /**
> >> * kthread_unpark - unpark a thread created by kthread_create().
> >> * @k: thread created by kthread_create().
> >> @@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
> >> struct kthread *kthread = task_get_live_kthread(k);
> >>
> >> if (kthread) {
> >> + /*
> >> + * Per-cpu kthreads such as ksoftirqd can get woken up by
> >> + * other events. So after binding the thread, ensure that
> >> + * it goes off the CPU atleast once, by parking it again.
> >> + * This way, we can ensure that it will run on the correct
> >> + * CPU on subsequent wakeup.
> >> + */
> >> + if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
> >> + __kthread_bind(k, kthread->cpu);
> >> + clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
> >
> > And how is that f*cking different from the previous code?
> >
> > CPU0 CPU1 CPU2
> > wakeup(T) -> run on CPU1 (last cpu)
> >
> > switch_to(T)
> >
> > __kthread_bind(T, CPU2)
> >
> > clear(KTHREAD_IS_PARKED)
> >
> > leave loop due to !KTHREAD_IS_PARKED
>
> How?? The task will leave the loop only when we clear
> SHOULD_PARK, not when we clear IS_PARKED. So it won't
> leave the loop here. It will cause the kthread to
> perform a fresh complete() for the waiting kthread_park()
> on CPU0.

You are right on that, but you tricked me into misreading your
patch. Why? Simply because it is too complex for no reason.

> No, the purpose of clear(IS_PARKED) followed by __kthread_park() is to
> ensure that the task gets *descheduled* atleast once after we did the
> kthread_bind(). And that's because we can't use set_cpus_allowed_ptr() to
> migrate a running kthread (because the kthread could be the migration
> thread). So instead, we use kthread_bind() and depend on sleep->wakeup
> to put the task on the right CPU.

Yeah, it's a nice workaround, though I really prefer a guaranteed well
defined state over this wakeup/sleep/wakeup trickery, which also adds
the additional cost of a wakeup/sleep cycle to the online operation.

> > TASK_PARKED is the very obvious and robust solution which fixes _ALL_
> > of the corner cases, at least as far as I can imagine them. And
> > robustness rules at least in my world.
> >
>
> Yes, I agree that it is robust and has clear semantics. No doubt about
> that. So I won't insist on going with my suggestions.

I'm glad, that we can agree on the robust solution :)

Thanks,

tglx

2013-04-12 11:29:39

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

Hi Thomas,

On 04/12/2013 04:29 PM, Thomas Gleixner wrote:
> Srivatsa,
>
> On Fri, 12 Apr 2013, Srivatsa S. Bhat wrote:
>> On 04/12/2013 02:17 AM, Thomas Gleixner wrote:
>>>> +
>>>> + /*
>>>> + * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
>>>> + * migration thread (which belongs to the stop_task sched class)
>>>> + * doesn't run until the cpu is actually onlined and the thread is
>>>> + * unparked.
>>>> + */
>>>> + if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
>>>> + WARN_ON(1);
>>>
>>> Yay, we rely on TASK_INTERRUPTIBLE state with a task which already has
>>> references outside the creation code.
>>
>> I doubt that. We have not even onlined the CPU, how would any else even
>> _know_ that we created this kthread??
>
> The problem is not only at the thread creation time. We have the same
> issue at offline/online and there we have a reference to that very
> thread.
>

Right. So our solutions differ in how that is handled, like this:
Yours: ensures that nobody can wakeup the parked thread, except the unpark
code.
Mine: ensures that nobody can make the parked thread leave its park loop
(even if it is woken up), except the unpark code.

Apart from this, everything else is mostly same - for eg., both the patches
depend on that wait_task_inactive() call, in order to make the migration
thread behave.

Either way, the purpose is served, so I'm fine with your solution.

(One of the reasons why I was confident of coming up with a working solution
without adding a new state was because I've worked on the freezer code before,
and IIRC, we have more or less similar problems there and we manage to deal
with it without having a dedicated TASK_FROZEN state. Anyway, nevermind... )

>>>> /**
>>>> * kthread_unpark - unpark a thread created by kthread_create().
>>>> * @k: thread created by kthread_create().
>>>> @@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
>>>> struct kthread *kthread = task_get_live_kthread(k);
>>>>
>>>> if (kthread) {
>>>> + /*
>>>> + * Per-cpu kthreads such as ksoftirqd can get woken up by
>>>> + * other events. So after binding the thread, ensure that
>>>> + * it goes off the CPU atleast once, by parking it again.
>>>> + * This way, we can ensure that it will run on the correct
>>>> + * CPU on subsequent wakeup.
>>>> + */
>>>> + if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
>>>> + __kthread_bind(k, kthread->cpu);
>>>> + clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
>>>
>>> And how is that f*cking different from the previous code?
>>>
>>> CPU0 CPU1 CPU2
>>> wakeup(T) -> run on CPU1 (last cpu)
>>>
>>> switch_to(T)
>>>
>>> __kthread_bind(T, CPU2)
>>>
>>> clear(KTHREAD_IS_PARKED)
>>>
>>> leave loop due to !KTHREAD_IS_PARKED
>>
>> How?? The task will leave the loop only when we clear
>> SHOULD_PARK, not when we clear IS_PARKED. So it won't
>> leave the loop here. It will cause the kthread to
>> perform a fresh complete() for the waiting kthread_park()
>> on CPU0.
>
> You are right on that, but you tricked me into misreading your
> patch. Why? Simply because it is too complex for no reason.
>

;-)

>> No, the purpose of clear(IS_PARKED) followed by __kthread_park() is to
>> ensure that the task gets *descheduled* atleast once after we did the
>> kthread_bind(). And that's because we can't use set_cpus_allowed_ptr() to
>> migrate a running kthread (because the kthread could be the migration
>> thread). So instead, we use kthread_bind() and depend on sleep->wakeup
>> to put the task on the right CPU.
>
> Yeah, it's a nice workaround, though I really prefer a guaranteed well
> defined state over this wakeup/sleep/wakeup trickery, which also adds
> the additional cost of a wakeup/sleep cycle to the online operation.
>

Sure, no objections from me!

>>> TASK_PARKED is the very obvious and robust solution which fixes _ALL_
>>> of the corner cases, at least as far as I can imagine them. And
>>> robustness rules at least in my world.
>>>
>>
>> Yes, I agree that it is robust and has clear semantics. No doubt about
>> that. So I won't insist on going with my suggestions.
>
> I'm glad, that we can agree on the robust solution :)
>

I'm glad too :-) Thanks a lot!

Regards,
Srivatsa S. Bhat

Subject: [tip:core/urgent] kthread: Prevent unpark race which puts threads on the wrong cpu

Commit-ID: f2530dc71cf0822f90bb63ea4600caaef33a66bb
Gitweb: http://git.kernel.org/tip/f2530dc71cf0822f90bb63ea4600caaef33a66bb
Author: Thomas Gleixner <[email protected]>
AuthorDate: Tue, 9 Apr 2013 09:33:34 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 12 Apr 2013 14:18:43 +0200

kthread: Prevent unpark race which puts threads on the wrong cpu

The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:

CPU0 CPU1 CPU2
unpark(T) wake_up_process(T)
clear(SHOULD_PARK) T runs
leave parkme() due to !SHOULD_PARK
bind_to(CPU2) BUG_ON(wrong CPU)

We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.

The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.

Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.

Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.

The migration thread has another related issue.

CPU0 CPU1
Bring up CPU2
create_thread(T)
park(T)
wait_for_completion()
parkme()
complete()
sched_set_stop_task()
schedule(TASK_PARKED)

The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().

Reported-by: Dave Jones <[email protected]>
Reported-and-tested-by: Dave Hansen <[email protected]>
Reported-and-tested-by: Borislav Petkov <[email protected]>
Acked-by: Peter Ziljstra <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner <[email protected]>

---
fs/proc/array.c | 1 +
include/linux/sched.h | 5 +++--
include/trace/events/sched.h | 2 +-
kernel/kthread.c | 52 ++++++++++++++++++++++++--------------------
kernel/smpboot.c | 14 ++++++++++--
5 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index f7ed9ee..cbd0f1b 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -143,6 +143,7 @@ static const char * const task_state_array[] = {
"x (dead)", /* 64 */
"K (wakekill)", /* 128 */
"W (waking)", /* 256 */
+ "P (parked)", /* 512 */
};

static inline const char *get_task_state(struct task_struct *tsk)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..e692a02 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -163,9 +163,10 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#define TASK_DEAD 64
#define TASK_WAKEKILL 128
#define TASK_WAKING 256
-#define TASK_STATE_MAX 512
+#define TASK_PARKED 512
+#define TASK_STATE_MAX 1024

-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"

extern char ___assert_task_state[1 - 2*!!(
sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 5a8671e..e5586ca 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -147,7 +147,7 @@ TRACE_EVENT(sched_switch,
__print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
{ 16, "Z" }, { 32, "X" }, { 64, "x" },
- { 128, "W" }) : "R",
+ { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
__entry->prev_state & TASK_STATE_MAX ? "+" : "",
__entry->next_comm, __entry->next_pid, __entry->next_prio)
);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..9eb7fed 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -124,12 +124,12 @@ void *kthread_data(struct task_struct *task)

static void __kthread_parkme(struct kthread *self)
{
- __set_current_state(TASK_INTERRUPTIBLE);
+ __set_current_state(TASK_PARKED);
while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
schedule();
- __set_current_state(TASK_INTERRUPTIBLE);
+ __set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);
__set_current_state(TASK_RUNNING);
@@ -256,8 +256,13 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
}
EXPORT_SYMBOL(kthread_create_on_node);

-static void __kthread_bind(struct task_struct *p, unsigned int cpu)
+static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state)
{
+ /* Must have done schedule() in kthread() before we set_task_cpu */
+ if (!wait_task_inactive(p, state)) {
+ WARN_ON(1);
+ return;
+ }
/* It's safe because the task is inactive. */
do_set_cpus_allowed(p, cpumask_of(cpu));
p->flags |= PF_THREAD_BOUND;
@@ -274,12 +279,7 @@ static void __kthread_bind(struct task_struct *p, unsigned int cpu)
*/
void kthread_bind(struct task_struct *p, unsigned int cpu)
{
- /* Must have done schedule() in kthread() before we set_task_cpu */
- if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
- WARN_ON(1);
- return;
- }
- __kthread_bind(p, cpu);
+ __kthread_bind(p, cpu, TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(kthread_bind);

@@ -324,6 +324,22 @@ static struct kthread *task_get_live_kthread(struct task_struct *k)
return NULL;
}

+static void __kthread_unpark(struct task_struct *k, struct kthread *kthread)
+{
+ clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ /*
+ * We clear the IS_PARKED bit here as we don't wait
+ * until the task has left the park code. So if we'd
+ * park before that happens we'd see the IS_PARKED bit
+ * which might be about to be cleared.
+ */
+ if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
+ if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+ __kthread_bind(k, kthread->cpu, TASK_PARKED);
+ wake_up_state(k, TASK_PARKED);
+ }
+}
+
/**
* kthread_unpark - unpark a thread created by kthread_create().
* @k: thread created by kthread_create().
@@ -336,20 +352,8 @@ void kthread_unpark(struct task_struct *k)
{
struct kthread *kthread = task_get_live_kthread(k);

- if (kthread) {
- clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
- /*
- * We clear the IS_PARKED bit here as we don't wait
- * until the task has left the park code. So if we'd
- * park before that happens we'd see the IS_PARKED bit
- * which might be about to be cleared.
- */
- if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu);
- wake_up_process(k);
- }
- }
+ if (kthread)
+ __kthread_unpark(k, kthread);
put_task_struct(k);
}

@@ -407,7 +411,7 @@ int kthread_stop(struct task_struct *k)
trace_sched_kthread_stop(k);
if (kthread) {
set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
- clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ __kthread_unpark(k, kthread);
wake_up_process(k);
wait_for_completion(&kthread->exited);
}
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 8eaed9a..02fc5c9 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -185,8 +185,18 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
}
get_task_struct(tsk);
*per_cpu_ptr(ht->store, cpu) = tsk;
- if (ht->create)
- ht->create(cpu);
+ if (ht->create) {
+ /*
+ * Make sure that the task has actually scheduled out
+ * into park position, before calling the create
+ * callback. At least the migration thread callback
+ * requires that the task is off the runqueue.
+ */
+ if (!wait_task_inactive(tsk, TASK_PARKED))
+ WARN_ON(1);
+ else
+ ht->create(cpu);
+ }
return 0;
}

2013-04-15 19:49:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu

I just tested out Linus's current tree (bb33db7). It is quite happy on
the large system which was exciting this issue. Thanks, Thomas!