2005-02-11 23:29:28

by Nathan Lynch

[permalink] [raw]
Subject: 2.6-bk: cpu hotplug + preempt = smp_processor_id warnings galore

Hi-

With 2.6.11-rc3-bk7 on ppc64 I am seeing lots of smp_processor_id
warnings whenever I hotplug cpus:

# echo 0 > /sys/devices/system/cpu/cpu1/online
cpu 1 (hwid 1) Ready to die...
BUG: using smp_processor_id() in preemptible [00000001] code:
ksoftirqd/1/5
caller is .ksoftirqd+0xbc/0x1f8
Call Trace:
[c0000000fffbbce0] [ffffffffffffffff] 0xffffffffffffffff (unreliable)
[c0000000fffbbd60] [c0000000001c9f1c] .smp_processor_id+0x154/0x168
[c0000000fffbbe20] [c00000000005f414] .ksoftirqd+0xbc/0x1f8
[c0000000fffbbed0] [c0000000000764cc] .kthread+0x128/0x134
[c0000000fffbbf90] [c000000000014248] .kernel_thread+0x4c/0x6c

I believe the above warning is caused by the local_softirq_pending
call on a "foreign" cpu before ksoftirqd/1 has been stopped. Looking
at the code, I think this doesn't indicate a real bug, but it would be
better if ksoftirqd didn't check local_softirq_pending after it's been
kicked off its cpu, right?


# echo 1 > /sys/devices/system/cpu/cpu1/online
BUG: using smp_processor_id() in preemptible [00000001] code:
swapper/0
caller is .dedicated_idle+0x68/0x22c
Call Trace:
[c0000000fffafc50] [ffffffffffffffff] 0xffffffffffffffff (unreliable)
[c0000000fffafcd0] [c0000000001c9f1c] .smp_processor_id+0x154/0x168
[c0000000fffafd90] [c00000000000f998] .dedicated_idle+0x68/0x22c
[c0000000fffafe80] [c00000000000fce8] .cpu_idle+0x34/0x4c
[c0000000fffaff00] [c00000000003a744] .start_secondary+0x10c/0x150
[c0000000fffaff90] [c00000000000bd28] .enable_64b_mode+0x0/0x28

Should ppc64 simply use _smp_processor_id() in its idle loop code
(like i386)?

If I online and offline cpus rapidly enough I can eventually get the
following:

printk: 49 messages suppressed.
BUG: using smp_processor_id() in preemptible [00000001] code:
events/3/1262
caller is .cache_reap+0x21c/0x2b8
Call Trace:
[c0000000ed67bb90] [ffffffffffffffff] 0xffffffffffffffff (unreliable)
[c0000000ed67bc10] [c0000000001c9f1c] .smp_processor_id+0x154/0x168
[c0000000ed67bcd0] [c0000000000938e8] .cache_reap+0x21c/0x2b8
[c0000000ed67bda0] [c00000000006f4bc] .worker_thread+0x230/0x310
[c0000000ed67bed0] [c0000000000764cc] .kthread+0x128/0x134
[c0000000ed67bf90] [c000000000014248] .kernel_thread+0x4c/0x6c

And this will repeat over and over even after I stop hotplugging
cpus... from the same events thread so I think it's somehow gotten
"stuck"?

Anything I can do to further debug?


Nathan


2005-02-11 23:57:00

by Matthias-Christian Ott

[permalink] [raw]
Subject: Re: 2.6-bk: cpu hotplug + preempt = smp_processor_id warnings galore

Nathan Lynch wrote:

>Hi-
>
>With 2.6.11-rc3-bk7 on ppc64 I am seeing lots of smp_processor_id
>warnings whenever I hotplug cpus:
>
># echo 0 > /sys/devices/system/cpu/cpu1/online
>cpu 1 (hwid 1) Ready to die...
>BUG: using smp_processor_id() in preemptible [00000001] code:
>ksoftirqd/1/5
>caller is .ksoftirqd+0xbc/0x1f8
>Call Trace:
>[c0000000fffbbce0] [ffffffffffffffff] 0xffffffffffffffff (unreliable)
>[c0000000fffbbd60] [c0000000001c9f1c] .smp_processor_id+0x154/0x168
>[c0000000fffbbe20] [c00000000005f414] .ksoftirqd+0xbc/0x1f8
>[c0000000fffbbed0] [c0000000000764cc] .kthread+0x128/0x134
>[c0000000fffbbf90] [c000000000014248] .kernel_thread+0x4c/0x6c
>
>I believe the above warning is caused by the local_softirq_pending
>call on a "foreign" cpu before ksoftirqd/1 has been stopped. Looking
>at the code, I think this doesn't indicate a real bug, but it would be
>better if ksoftirqd didn't check local_softirq_pending after it's been
>kicked off its cpu, right?
>
>
># echo 1 > /sys/devices/system/cpu/cpu1/online
>BUG: using smp_processor_id() in preemptible [00000001] code:
>swapper/0
>caller is .dedicated_idle+0x68/0x22c
>Call Trace:
>[c0000000fffafc50] [ffffffffffffffff] 0xffffffffffffffff (unreliable)
>[c0000000fffafcd0] [c0000000001c9f1c] .smp_processor_id+0x154/0x168
>[c0000000fffafd90] [c00000000000f998] .dedicated_idle+0x68/0x22c
>[c0000000fffafe80] [c00000000000fce8] .cpu_idle+0x34/0x4c
>[c0000000fffaff00] [c00000000003a744] .start_secondary+0x10c/0x150
>[c0000000fffaff90] [c00000000000bd28] .enable_64b_mode+0x0/0x28
>
>Should ppc64 simply use _smp_processor_id() in its idle loop code
>(like i386)?
>
>If I online and offline cpus rapidly enough I can eventually get the
>following:
>
>printk: 49 messages suppressed.
>BUG: using smp_processor_id() in preemptible [00000001] code:
>events/3/1262
>caller is .cache_reap+0x21c/0x2b8
>Call Trace:
>[c0000000ed67bb90] [ffffffffffffffff] 0xffffffffffffffff (unreliable)
>[c0000000ed67bc10] [c0000000001c9f1c] .smp_processor_id+0x154/0x168
>[c0000000ed67bcd0] [c0000000000938e8] .cache_reap+0x21c/0x2b8
>[c0000000ed67bda0] [c00000000006f4bc] .worker_thread+0x230/0x310
>[c0000000ed67bed0] [c0000000000764cc] .kthread+0x128/0x134
>[c0000000ed67bf90] [c000000000014248] .kernel_thread+0x4c/0x6c
>
>And this will repeat over and over even after I stop hotplugging
>cpus... from the same events thread so I think it's somehow gotten
>"stuck"?
>
>Anything I can do to further debug?
>
>
>Nathan
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
Hi!
Use get_cpu() (It disables preemption) or __smp_processor_id () (on a smp).

Matthias-Christian Ott

2005-02-12 00:56:21

by Nathan Lynch

[permalink] [raw]
Subject: Re: 2.6-bk: cpu hotplug + preempt = smp_processor_id warnings galore

On Sat, Feb 12, 2005 at 12:56:54AM +0100, Matthias-Christian Ott wrote:
> Nathan Lynch wrote:
>
> >With 2.6.11-rc3-bk7 on ppc64 I am seeing lots of smp_processor_id
> >warnings whenever I hotplug cpus:
...
>
> Use get_cpu() (It disables preemption) or __smp_processor_id () (on a smp).

It's not necessarily that simple (ok, maybe the idle loop warning is).
But at least one of the warnings I listed appears to be caused by a
kernel thread that is normally bound to a particular cpu trying to do
normal processing on another cpu before it has stopped. Injudicious
use of __smp_processor_id or get_cpu in this case would only obscure
the problem.


Nathan

2005-02-12 18:58:32

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.6-bk: cpu hotplug + preempt = smp_processor_id warnings galore

On Fri, 11 Feb 2005, Nathan Lynch wrote:

> With 2.6.11-rc3-bk7 on ppc64 I am seeing lots of smp_processor_id
> warnings whenever I hotplug cpus:
>
> # echo 0 > /sys/devices/system/cpu/cpu1/online
> cpu 1 (hwid 1) Ready to die...
> BUG: using smp_processor_id() in preemptible [00000001] code:
> ksoftirqd/1/5
> caller is .ksoftirqd+0xbc/0x1f8
> Call Trace:
> [c0000000fffbbce0] [ffffffffffffffff] 0xffffffffffffffff (unreliable)
> [c0000000fffbbd60] [c0000000001c9f1c] .smp_processor_id+0x154/0x168
> [c0000000fffbbe20] [c00000000005f414] .ksoftirqd+0xbc/0x1f8
> [c0000000fffbbed0] [c0000000000764cc] .kthread+0x128/0x134
> [c0000000fffbbf90] [c000000000014248] .kernel_thread+0x4c/0x6c
>
> I believe the above warning is caused by the local_softirq_pending
> call on a "foreign" cpu before ksoftirqd/1 has been stopped. Looking
> at the code, I think this doesn't indicate a real bug, but it would be
> better if ksoftirqd didn't check local_softirq_pending after it's been
> kicked off its cpu, right?

How about;

Index: linux-2.6.11-rc3-mm2/kernel/softirq.c
===================================================================
RCS file: /home/cvsroot/linux-2.6.11-rc3-mm2/kernel/softirq.c,v
retrieving revision 1.1.1.1
diff -u -p -B -r1.1.1.1 softirq.c
--- linux-2.6.11-rc3-mm2/kernel/softirq.c 11 Feb 2005 05:14:57 -0000 1.1.1.1
+++ linux-2.6.11-rc3-mm2/kernel/softirq.c 12 Feb 2005 18:24:54 -0000
@@ -355,8 +355,12 @@ static int ksoftirqd(void * __bind_cpu)
set_current_state(TASK_INTERRUPTIBLE);

while (!kthread_should_stop()) {
- if (!local_softirq_pending())
+ preempt_disable();
+ if (!local_softirq_pending()) {
+ preempt_enable_no_resched();
schedule();
+ preempt_disable();
+ }

__set_current_state(TASK_RUNNING);

@@ -364,14 +368,14 @@ static int ksoftirqd(void * __bind_cpu)
/* Preempt disable stops cpu going offline.
If already offline, we'll be on wrong CPU:
don't process */
- preempt_disable();
if (cpu_is_offline((long)__bind_cpu))
goto wait_to_die;
do_softirq();
- preempt_enable();
+ preempt_enable_no_resched();
cond_resched();
+ preempt_disable();
}
-
+ preempt_enable();
set_current_state(TASK_INTERRUPTIBLE);
}
__set_current_state(TASK_RUNNING);

> # echo 1 > /sys/devices/system/cpu/cpu1/online
> BUG: using smp_processor_id() in preemptible [00000001] code:
> swapper/0
> caller is .dedicated_idle+0x68/0x22c
> Call Trace:
> [c0000000fffafc50] [ffffffffffffffff] 0xffffffffffffffff (unreliable)
> [c0000000fffafcd0] [c0000000001c9f1c] .smp_processor_id+0x154/0x168
> [c0000000fffafd90] [c00000000000f998] .dedicated_idle+0x68/0x22c
> [c0000000fffafe80] [c00000000000fce8] .cpu_idle+0x34/0x4c
> [c0000000fffaff00] [c00000000003a744] .start_secondary+0x10c/0x150
> [c0000000fffaff90] [c00000000000bd28] .enable_64b_mode+0x0/0x28
>
> Should ppc64 simply use _smp_processor_id() in its idle loop code
> (like i386)?

I would say so, it's definitely safe.

> If I online and offline cpus rapidly enough I can eventually get the
> following:
>
> printk: 49 messages suppressed.
> BUG: using smp_processor_id() in preemptible [00000001] code:
> events/3/1262
> caller is .cache_reap+0x21c/0x2b8
> Call Trace:
> [c0000000ed67bb90] [ffffffffffffffff] 0xffffffffffffffff (unreliable)
> [c0000000ed67bc10] [c0000000001c9f1c] .smp_processor_id+0x154/0x168
> [c0000000ed67bcd0] [c0000000000938e8] .cache_reap+0x21c/0x2b8
> [c0000000ed67bda0] [c00000000006f4bc] .worker_thread+0x230/0x310
> [c0000000ed67bed0] [c0000000000764cc] .kthread+0x128/0x134
> [c0000000ed67bf90] [c000000000014248] .kernel_thread+0x4c/0x6c
>
> And this will repeat over and over even after I stop hotplugging
> cpus... from the same events thread so I think it's somehow gotten
> "stuck"?

Hmm this should be covered by the local_bh_disable() in softirq
processing.

2005-02-14 22:00:05

by Nathan Lynch

[permalink] [raw]
Subject: Re: 2.6-bk: cpu hotplug + preempt = smp_processor_id warnings galore

On Sat, Feb 12, 2005 at 11:59:04AM -0700, Zwane Mwaikambo wrote:
> How about;
>
> Index: linux-2.6.11-rc3-mm2/kernel/softirq.c
> ===================================================================
> RCS file: /home/cvsroot/linux-2.6.11-rc3-mm2/kernel/softirq.c,v
> retrieving revision 1.1.1.1
> diff -u -p -B -r1.1.1.1 softirq.c
> --- linux-2.6.11-rc3-mm2/kernel/softirq.c 11 Feb 2005 05:14:57 -0000 1.1.1.1
> +++ linux-2.6.11-rc3-mm2/kernel/softirq.c 12 Feb 2005 18:24:54 -0000
> @@ -355,8 +355,12 @@ static int ksoftirqd(void * __bind_cpu)
> set_current_state(TASK_INTERRUPTIBLE);
>
> while (!kthread_should_stop()) {
> - if (!local_softirq_pending())
> + preempt_disable();
> + if (!local_softirq_pending()) {
> + preempt_enable_no_resched();
> schedule();
> + preempt_disable();
> + }
>
> __set_current_state(TASK_RUNNING);
>
> @@ -364,14 +368,14 @@ static int ksoftirqd(void * __bind_cpu)
> /* Preempt disable stops cpu going offline.
> If already offline, we'll be on wrong CPU:
> don't process */
> - preempt_disable();
> if (cpu_is_offline((long)__bind_cpu))
> goto wait_to_die;
> do_softirq();
> - preempt_enable();
> + preempt_enable_no_resched();
> cond_resched();
> + preempt_disable();
> }
> -
> + preempt_enable();
> set_current_state(TASK_INTERRUPTIBLE);
> }
> __set_current_state(TASK_RUNNING);
>

Works ok here.

It looks as if we need to explicitly bind worker threads to a newly
onlined cpu. This gets rid of the smp_processor_id warnings from
cache_reap. Adding a little more instrumentation to the debug
smp_processor_id showed that new worker threads were actually running
on the wrong cpu...

Does this look ok?

Index: linux-2.6.11-rc4-bk2/kernel/workqueue.c
===================================================================
--- linux-2.6.11-rc4-bk2.orig/kernel/workqueue.c 2005-02-14 11:13:08.000000000 -0600
+++ linux-2.6.11-rc4-bk2/kernel/workqueue.c 2005-02-14 15:18:35.000000000 -0600
@@ -485,8 +485,10 @@

case CPU_ONLINE:
/* Kick off worker threads. */
- list_for_each_entry(wq, &workqueues, list)
+ list_for_each_entry(wq, &workqueues, list) {
+ kthread_bind(wq->cpu_wq[hotcpu].thread, hotcpu);
wake_up_process(wq->cpu_wq[hotcpu].thread);
+ }
break;

case CPU_UP_CANCELED:

2005-02-15 07:03:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6-bk: cpu hotplug + preempt = smp_processor_id warnings galore


* Nathan Lynch <[email protected]> wrote:

> Works ok here.
>
> It looks as if we need to explicitly bind worker threads to a newly
> onlined cpu. This gets rid of the smp_processor_id warnings from
> cache_reap. Adding a little more instrumentation to the debug
> smp_processor_id showed that new worker threads were actually running
> on the wrong cpu...
>
> Does this look ok?

indeed - looks much better than the 'turn off the warning' solution.

Acked-by: Ingo Molnar <[email protected]>

Ingo

2005-02-15 15:28:53

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.6-bk: cpu hotplug + preempt = smp_processor_id warnings galore

On Mon, 14 Feb 2005, Nathan Lynch wrote:

> It looks as if we need to explicitly bind worker threads to a newly
> onlined cpu. This gets rid of the smp_processor_id warnings from
> cache_reap. Adding a little more instrumentation to the debug
> smp_processor_id showed that new worker threads were actually running
> on the wrong cpu...
>
> Does this look ok?

Yeah, does that patch suffice for all the warnings?

2005-02-15 17:28:44

by Nathan Lynch

[permalink] [raw]
Subject: Re: 2.6-bk: cpu hotplug + preempt = smp_processor_id warnings galore

On Tue, Feb 15, 2005 at 08:29:33AM -0700, Zwane Mwaikambo wrote:
> On Mon, 14 Feb 2005, Nathan Lynch wrote:
>
> > It looks as if we need to explicitly bind worker threads to a newly
> > onlined cpu. This gets rid of the smp_processor_id warnings from
> > cache_reap. Adding a little more instrumentation to the debug
> > smp_processor_id showed that new worker threads were actually running
> > on the wrong cpu...
> >
> > Does this look ok?
>
> Yeah, does that patch suffice for all the warnings?

Nope, ksoftirqd still requires your patch in order to kill the
warnings there.

Nathan

2005-02-16 02:06:38

by Nathan Lynch

[permalink] [raw]
Subject: [PATCH] kthread_bind new worker threads when onlining cpu

Hi Andrew-

On Tue, Feb 15, 2005 at 08:02:17AM +0100, Ingo Molnar wrote:
>
> * Nathan Lynch <[email protected]> wrote:
>
> >
> > It looks as if we need to explicitly bind worker threads to a newly
> > onlined cpu. This gets rid of the smp_processor_id warnings from
> > cache_reap. Adding a little more instrumentation to the debug
> > smp_processor_id showed that new worker threads were actually running
> > on the wrong cpu...
> >
> > Does this look ok?
>
> indeed - looks much better than the 'turn off the warning' solution.
>
> Acked-by: Ingo Molnar <[email protected]>

We weren't binding new worker threads to their cpu when onlining.
Using preempt and the debug version of smp_processor_id found this.

Signed-off-by: Nathan Lynch <[email protected]>

Index: linux-2.6.11-rc4-bk2/kernel/workqueue.c
===================================================================
--- linux-2.6.11-rc4-bk2.orig/kernel/workqueue.c 2005-02-14 11:13:08.000000000 -0600
+++ linux-2.6.11-rc4-bk2/kernel/workqueue.c 2005-02-14 15:18:35.000000000 -0600
@@ -485,8 +485,10 @@

case CPU_ONLINE:
/* Kick off worker threads. */
- list_for_each_entry(wq, &workqueues, list)
+ list_for_each_entry(wq, &workqueues, list) {
+ kthread_bind(wq->cpu_wq[hotcpu].thread, hotcpu);
wake_up_process(wq->cpu_wq[hotcpu].thread);
+ }
break;

case CPU_UP_CANCELED:

2005-02-16 05:30:57

by Zwane Mwaikambo

[permalink] [raw]
Subject: [PATCH] Run softirqs on proper processor on offline

Ensure that we only offline the processor when it's safe and never run
softirqs in another processor's ksoftirqd context. This also gets rid of
the warnings in ksoftirqd on cpu offline.

Signed-off-by: Zwane Mwaikambo <[email protected]>

Index: linux-2.6.11-rc3-mm2/kernel/softirq.c
===================================================================
RCS file: /home/cvsroot/linux-2.6.11-rc3-mm2/kernel/softirq.c,v
retrieving revision 1.1.1.1
diff -u -p -B -r1.1.1.1 softirq.c
--- linux-2.6.11-rc3-mm2/kernel/softirq.c 11 Feb 2005 05:14:57 -0000 1.1.1.1
+++ linux-2.6.11-rc3-mm2/kernel/softirq.c 12 Feb 2005 18:24:54 -0000
@@ -355,8 +355,12 @@ static int ksoftirqd(void * __bind_cpu)
set_current_state(TASK_INTERRUPTIBLE);

while (!kthread_should_stop()) {
- if (!local_softirq_pending())
+ preempt_disable();
+ if (!local_softirq_pending()) {
+ preempt_enable_no_resched();
schedule();
+ preempt_disable();
+ }

__set_current_state(TASK_RUNNING);

@@ -364,14 +368,14 @@ static int ksoftirqd(void * __bind_cpu)
/* Preempt disable stops cpu going offline.
If already offline, we'll be on wrong CPU:
don't process */
- preempt_disable();
if (cpu_is_offline((long)__bind_cpu))
goto wait_to_die;
do_softirq();
- preempt_enable();
+ preempt_enable_no_resched();
cond_resched();
+ preempt_disable();
}
-
+ preempt_enable();
set_current_state(TASK_INTERRUPTIBLE);
}
__set_current_state(TASK_RUNNING);

2005-02-16 05:34:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Run softirqs on proper processor on offline


* Zwane Mwaikambo <[email protected]> wrote:

> Ensure that we only offline the processor when it's safe and never run
> softirqs in another processor's ksoftirqd context. This also gets rid of
> the warnings in ksoftirqd on cpu offline.
>
> Signed-off-by: Zwane Mwaikambo <[email protected]>

looks good.

Acked-by: Ingo Molnar <[email protected]>

Ingo

2005-02-16 05:52:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Run softirqs on proper processor on offline

Zwane Mwaikambo <[email protected]> wrote:
>
> Ensure that we only offline the processor when it's safe and never run
> softirqs in another processor's ksoftirqd context. This also gets rid of
> the warnings in ksoftirqd on cpu offline.

I don't get it. ksoftirqd is pinned to its cpu, so why does any of this
matter?

2005-02-16 06:16:32

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH] Run softirqs on proper processor on offline

On Tue, 15 Feb 2005, Andrew Morton wrote:

> Zwane Mwaikambo <[email protected]> wrote:
> >
> > Ensure that we only offline the processor when it's safe and never run
> > softirqs in another processor's ksoftirqd context. This also gets rid of
> > the warnings in ksoftirqd on cpu offline.
>
> I don't get it. ksoftirqd is pinned to its cpu, so why does any of this
> matter?

We take down ksoftirqds at CPU_DEAD time, so there is a brief period
whereupon there is a ksoftirqd thread for an offline processor, it is at
this point that ->cpus_allowed won't have it pinned anymore. An online
processor would then take down that ksoftirqd and exit it.