2006-10-27 10:49:44

by Satoru Takeuchi

[permalink] [raw]
Subject: [BUGFIX] [PATCH 1/2] cpu-hotplug: Fixing confliction between CPU hot-add and IPI

From: Kenji Kaneshige <[email protected]>

---
Fixing the confliction between CPU hot-add and IPI.

Signed-off-by: Kenji Kaneshige <[email protected]>
Acked-by: Satoru Takeuchi <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>

Index: linux-2.6.19-rc3/arch/ia64/kernel/smp.c
===================================================================
--- linux-2.6.19-rc3.orig/arch/ia64/kernel/smp.c 2006-10-27 18:40:47.000000000 +0900
+++ linux-2.6.19-rc3/arch/ia64/kernel/smp.c 2006-10-27 18:49:48.000000000 +0900
@@ -328,10 +328,14 @@
smp_call_function (void (*func) (void *info), void *info, int nonatomic, int wait)
{
struct call_data_struct data;
- int cpus = num_online_cpus()-1;
+ int cpus;

- if (!cpus)
+ spin_lock(&call_lock);
+ cpus = num_online_cpus() - 1;
+ if (!cpus) {
+ spin_unlock(&call_lock);
return 0;
+ }

/* Can deadlock when called with interrupts disabled */
WARN_ON(irqs_disabled());
@@ -343,8 +347,6 @@
if (wait)
atomic_set(&data.finished, 0);

- spin_lock(&call_lock);
-
call_data = &data;
mb(); /* ensure store to call_data precedes setting of IPI_CALL_FUNC */
send_IPI_allbutself(IPI_CALL_FUNC);


2006-10-27 13:25:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX] [PATCH 1/2] cpu-hotplug: Fixing confliction between CPU hot-add and IPI

On Fri, 27 Oct 2006 19:49:53 +0900
Satoru Takeuchi <[email protected]> wrote:

> Fixing the confliction between CPU hot-add and IPI.
>
Some more verbose inoformation ;)

This patch fixes following bug. found in our test.
==
Pid: 7905, CPU 4, comm: memload
psr : 0000121008022018 ifs : 800000000000038a ip : [<a000000100054681>] Not
tainted
ip is at handle_IPI+0x101/0x380
unat: 0000000000000000 pfs : 000000000000040b rsc : 0000000000000003
rnat: 0000000000565aa9 bsps: a0000001000943b0 pr : 00000000005685a9
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a0000001000e4340 b6 : a000000100054580 b7 : a000000000010640
f6 : 000000000000000000000 f7 : 000000000000000000000
f8 : 000000000000000000000 f9 : 1003effffffffffffc000
f10 : 1003e0000000000000000 f11 : 000000000000000000000
r1 : a000000100c14e40 r2 : fffffffffffffffe r3 : fffffffffffffffe
r8 : 0000000000000001 r9 : 0000000000000000 r10 : 0000000000000000
r11 : 0000000000017d00 r12 : e0000001341dfe30 r13 : e0000001341d8000
r14 : 0000000000000001 r15 : e0000140040165f8 r16 : a000000100861400
r17 : 0000000000000000 r18 : 0000000000000010 r19 : a000000100a2cc48
r20 : 00000000000002fa r21 : ffffffffffff0028 r22 : a0000001009ce2c0
r23 : a0000001009ce298 r24 : a000000100011a60 r25 : e0000001341dfe58
r26 : c000000000000008 r27 : 000000000000000f r28 : a000000000010620
r29 : a000000100879120 r30 : 0000000000000008 r31 : 0000000000550a41

Call Trace:
[<a000000100013e80>] show_stack+0x40/0xa0
sp=e0000001341df9c0 bsp=e0000001341d9338
[<a000000100014780>] show_regs+0x840/0x880
sp=e0000001341dfb90 bsp=e0000001341d92e0
[<a000000100037b60>] die+0x1c0/0x2a0
sp=e0000001341dfb90 bsp=e0000001341d9298
[<a000000100629040>] ia64_do_page_fault+0x8a0/0x9e0
sp=e0000001341dfbb0 bsp=e0000001341d9248
[<a00000010000c700>] __ia64_leave_kernel+0x0/0x280
sp=e0000001341dfc60 bsp=e0000001341d9248
[<a000000100054680>] handle_IPI+0x100/0x380
sp=e0000001341dfe30 bsp=e0000001341d91f0
[<a0000001000e4340>] handle_IRQ_event+0xa0/0x140
sp=e0000001341dfe30 bsp=e0000001341d91b0
[<a0000001000e4510>] __do_IRQ+0x130/0x3e0
sp=e0000001341dfe30 bsp=e0000001341d9168
[<a000000100011990>] ia64_handle_irq+0xf0/0x1a0
sp=e0000001341dfe30 bsp=e0000001341d9138
[<a00000010000c700>] __ia64_leave_kernel+0x0/0x280
sp=e0000001341dfe30 bsp=e0000001341d9138
<3>BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic():1, irqs_disabled():0
==
This is because handle_IPI() accesses call_data, which is NULL.

The scenario is

1. on CPU A: smp_call_funcion() is called.
2. on CPU A: ncpus = num_online_cpus() - 1 is called. set ncpus=X.
3. on CPU B: by cpu_hotplug, some cpus comes up to be online. changes online_map.
4. on CPU A: smp_call_function() sends IPI. The targets of IPI is determined by
cpu_online_map(), which was changed. Then, IPIs are sent to *X+1* processors.
5. on CPU A: wait for X(=ncpus) acks.
6. on CPU A: after getting all Acks. set call_data = NULL
7. on some other cpu, which received *delayed* IPI(by some irq-disable ops)
will access call_data and panics.

This patch moves ncpus=num_online_cpu() under ipi_call_lock.
x86 implemtation does this.

-Kame