LinuxLists.cc - cpu hotplug oops on 2.6.15-rc5

2005-12-19 05:17:26

Subject: cpu hotplug oops on 2.6.15-rc5

(apologies if this is a dup)

Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.

Here's the backtrace:

0:mon> t
[c0000001ad033820] c000000000096a7c .kfree+0x250/0x280
[c0000001ad0338d0] c00000000009a544 .cpuup_callback+0x238/0x5fc
[c0000001ad0339c0] c000000000068114 .notifier_call_chain+0x68/0x9c
[c0000001ad033a50] c0000000000789fc .cpu_down+0x1fc/0x368
[c0000001ad033b40] c0000000002ac658 .store_online+0x88/0xe8
[c0000001ad033bd0] c0000000002a6f14 .sysdev_store+0x4c/0x68
[c0000001ad033c50] c000000000110368 .sysfs_write_file+0x100/0x1a0
[c0000001ad033cf0] c0000000000be854 .vfs_write+0x100/0x200
[c0000001ad033d90] c0000000000bea64 .sys_write+0x54/0x9c
[c0000001ad033e30] c000000000008600 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000fe5ec10
SP (ffc4c4f0) is in userspace

0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c0000001ad033520]
pc: c00000000048bd30: ._spin_lock+0x18/0x80
lr: c000000000096a7c: .kfree+0x250/0x280
sp: c0000001ad0337a0
msr: 8000000000001032
dar: 48
dsisr: 40000000
current = 0xc0000001aff12040
paca = 0xc0000000005c1000
pid = 17376, comm = bash

Should I try this with CONFIG_DEBUG_SLAB ?

Sonny

2005-12-19 06:42:29

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

On Mon, 2005-12-19 at 00:16 -0500, Sonny Rao wrote:
> (apologies if this is a dup)
>
> Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.

First try on -rc6 just in case it's related to the SCSI fix (the bug was
corrupting the SLAB) that got merged just after rc5 iirc.

Ben.

2005-12-19 07:09:17

by Sonny Rao

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

On Mon, Dec 19, 2005 at 05:41:57PM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2005-12-19 at 00:16 -0500, Sonny Rao wrote:
> > (apologies if this is a dup)
> >
> > Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.
>
> First try on -rc6 just in case it's related to the SCSI fix (the bug was
> corrupting the SLAB) that got merged just after rc5 iirc.

Ok, tried it: same crash on -rc6

2:mon> t
[c000000d9f33b820] c000000000097cd0 .kfree+0x29c/0x2cc
[c000000d9f33b8d0] c00000000009c3a8 .cpuup_callback+0x4f8/0x5fc
[c000000d9f33b9c0] c00000000048ff4c .notifier_call_chain+0x68/0x9c
[c000000d9f33ba50] c000000000078da8 .cpu_down+0x1fc/0x368
[c000000d9f33bb40] c0000000002ae514 .store_online+0x88/0xe8
[c000000d9f33bbd0] c0000000002a8dd0 .sysdev_store+0x4c/0x68
[c000000d9f33bc50] c000000000111e70 .sysfs_write_file+0x100/0x1a0
[c000000d9f33bcf0] c0000000000c0360 .vfs_write+0x100/0x200
[c000000d9f33bd90] c0000000000c0570 .sys_write+0x54/0x9c
[c000000d9f33be30] c000000000008600 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000fe5ec10
SP (ffa204f0) is in userspace
2:mon>

Sonny

2005-12-19 21:17:46

by Manfred Spraul

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

Sonny Rao wrote:

>Ok, tried it: same crash on -rc6
>
>2:mon> t
>[c000000d9f33b820] c000000000097cd0 .kfree+0x29c/0x2cc
>[c000000d9f33b8d0] c00000000009c3a8 .cpuup_callback+0x4f8/0x5fc
>[c000000d9f33b9c0] c00000000048ff4c .notifier_call_chain+0x68/0x9c
>[c000000d9f33ba50] c000000000078da8 .cpu_down+0x1fc/0x368
>[c000000d9f33bb40] c0000000002ae514 .store_online+0x88/0xe8
>[c000000d9f33bbd0] c0000000002a8dd0 .sysdev_store+0x4c/0x68
>[c000000d9f33bc50] c000000000111e70 .sysfs_write_file+0x100/0x1a0
>[c000000d9f33bcf0] c0000000000c0360 .vfs_write+0x100/0x200
>[c000000d9f33bd90] c0000000000c0570 .sys_write+0x54/0x9c
>[c000000d9f33be30] c000000000008600 syscall_exit+0x0/0x18
>
>
Very odd call chain.
Could you enable slab debugging?

--
Manfred

2005-12-19 23:16:33

by Sonny Rao

[permalink] [raw]

Subject: Re: SPAMHAUS-Re: cpu hotplug oops on 2.6.15-rc5

On Mon, Dec 19, 2005 at 10:17:04PM +0100, Manfred Spraul wrote:
> Sonny Rao wrote:
>
> >Ok, tried it: same crash on -rc6
> >
> >2:mon> t
> >[c000000d9f33b820] c000000000097cd0 .kfree+0x29c/0x2cc
> >[c000000d9f33b8d0] c00000000009c3a8 .cpuup_callback+0x4f8/0x5fc
> >[c000000d9f33b9c0] c00000000048ff4c .notifier_call_chain+0x68/0x9c
> >[c000000d9f33ba50] c000000000078da8 .cpu_down+0x1fc/0x368
> >[c000000d9f33bb40] c0000000002ae514 .store_online+0x88/0xe8
> >[c000000d9f33bbd0] c0000000002a8dd0 .sysdev_store+0x4c/0x68
> >[c000000d9f33bc50] c000000000111e70 .sysfs_write_file+0x100/0x1a0
> >[c000000d9f33bcf0] c0000000000c0360 .vfs_write+0x100/0x200
> >[c000000d9f33bd90] c0000000000c0570 .sys_write+0x54/0x9c
> >[c000000d9f33be30] c000000000008600 syscall_exit+0x0/0x18
> >
> >
> Very odd call chain.
> Could you enable slab debugging?

Actually, I did turn on slab debugging on -rc6, but it did not seem to
make any difference.

Sonny

2005-12-19 23:54:27

by Anton Blanchard

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

Hi Manfred,

> Very odd call chain.
> Could you enable slab debugging?

Sonny and I had a look around, it seems to be in the
cpuup_callback() / CPU_DEAD case:

if (!cpus_empty(mask)) {
spin_unlock(&l3->list_lock);
goto unlock_cache;
}

if (l3->shared) {
free_block(cachep, l3->shared->entry,
l3->shared->avail, node);
kfree(l3->shared); <-------- HERE
l3->shared = NULL;
}

So we are removing the last cpu in a node, and tearing down the node
related structures. We looked at kfree() -> __cache_free() and we couldnt
convince ourselves that all the CONFIG_NUMA stuff in there wouldnt trip
over itself (since we would be doing the free on an alien node).

Anton

2005-12-22 09:27:55

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> (apologies if this is a dup)
>
> Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.
>
> Here's the backtrace:
>
> 0:mon> t
> [c0000001ad033820] c000000000096a7c .kfree+0x250/0x280
> [c0000001ad0338d0] c00000000009a544 .cpuup_callback+0x238/0x5fc
> [c0000001ad0339c0] c000000000068114 .notifier_call_chain+0x68/0x9c
> [c0000001ad033a50] c0000000000789fc .cpu_down+0x1fc/0x368
> [c0000001ad033b40] c0000000002ac658 .store_online+0x88/0xe8
> [c0000001ad033bd0] c0000000002a6f14 .sysdev_store+0x4c/0x68
> [c0000001ad033c50] c000000000110368 .sysfs_write_file+0x100/0x1a0
> [c0000001ad033cf0] c0000000000be854 .vfs_write+0x100/0x200
> [c0000001ad033d90] c0000000000bea64 .sys_write+0x54/0x9c
> [c0000001ad033e30] c000000000008600 syscall_exit+0x0/0x18
> --- Exception: c01 (System Call) at 000000000fe5ec10
> SP (ffc4c4f0) is in userspace
>
> 0:mon> e
> cpu 0x0: Vector: 300 (Data Access) at [c0000001ad033520]
> pc: c00000000048bd30: ._spin_lock+0x18/0x80
> lr: c000000000096a7c: .kfree+0x250/0x280
> sp: c0000001ad0337a0
> msr: 8000000000001032
> dar: 48
> dsisr: 40000000
> current = 0xc0000001aff12040
> paca = 0xc0000000005c1000
> pid = 17376, comm = bash
>
>

Sonny,
Does this patch fix the issue? This one applies cleanly on 2.6.15-rc6
unlike the one that was sent to you earlier.

Thanks,
Kiran

From: Alok N Kataria <[email protected]>

Fixes a bug in the CPU_DOWN call path, we shouldn't call kfree while
holding kmem_list3's list lock, nor should drain_alien_cache be called
with l3's list lock.

Signed-off-by : Alok N Kataria <[email protected]>
Signed-off-by : Ravikiran Thirumalai <[email protected]>
Signed-off-by : Shai Fultheim <[email protected]>

Index: linux-2.6.15-rc6/mm/slab.c
===================================================================
--- linux-2.6.15-rc6.orig/mm/slab.c 2005-12-21 22:32:14.000000000 -0800
+++ linux-2.6.15-rc6/mm/slab.c 2005-12-21 22:32:58.000000000 -0800
@@ -824,14 +824,14 @@ static inline void __drain_alien_cache(k
}
}

-static void drain_alien_cache(kmem_cache_t *cachep, struct kmem_list3 *l3)
+static void drain_alien_cache(kmem_cache_t *cachep, struct array_cache **alien)
{
int i=0;
struct array_cache *ac;
unsigned long flags;

for_each_online_node(i) {
- ac = l3->alien[i];
+ ac = alien[i];
if (ac) {
spin_lock_irqsave(&ac->lock, flags);
__drain_alien_cache(cachep, ac, i);
@@ -842,7 +842,7 @@ static void drain_alien_cache(kmem_cache
#else
#define alloc_alien_cache(node, limit) do { } while (0)
#define free_alien_cache(ac_ptr) do { } while (0)
-#define drain_alien_cache(cachep, l3) do { } while (0)
+#define drain_alien_cache(cachep, alien) do { } while (0)
#endif

static int __devinit cpuup_callback(struct notifier_block *nfb,
@@ -921,7 +921,7 @@ static int __devinit cpuup_callback(stru
down(&cache_chain_sem);

list_for_each_entry(cachep, &cache_chain, next) {
- struct array_cache *nc;
+ struct array_cache *nc, *shared, **alien;
cpumask_t mask;

mask = node_to_cpumask(node);
@@ -932,7 +932,7 @@ static int __devinit cpuup_callback(stru
l3 = cachep->nodelists[node];

if (!l3)
- goto unlock_cache;
+ goto free_array_cache;

spin_lock(&l3->list_lock);

@@ -943,32 +943,40 @@ static int __devinit cpuup_callback(stru

if (!cpus_empty(mask)) {
spin_unlock(&l3->list_lock);
- goto unlock_cache;
+ goto free_array_cache;
}

- if (l3->shared) {
+ if ((shared = l3->shared)) {
free_block(cachep, l3->shared->entry,
l3->shared->avail, node);
kfree(l3->shared);
l3->shared = NULL;
}
- if (l3->alien) {
- drain_alien_cache(cachep, l3);
- free_alien_cache(l3->alien);
- l3->alien = NULL;
+
+ alien = l3->alien;
+ l3->alien = NULL;
+
+ spin_unlock(&l3->list_lock);
+
+ kfree(nc);
+ kfree(shared);
+ if (alien) {
+ drain_alien_cache(cachep, alien);
+ free_alien_cache(alien);
}

/* free slabs belonging to this node */
if (__node_shrink(cachep, node)) {
+ spin_lock(&l3->list_lock);
cachep->nodelists[node] = NULL;
spin_unlock(&l3->list_lock);
kfree(l3);
- } else {
- spin_unlock(&l3->list_lock);
}
+ goto unlock_cache;
+free_array_cache:
+ kfree(nc);
unlock_cache:
spin_unlock_irq(&cachep->spinlock);
- kfree(nc);
}
up(&cache_chain_sem);
break;
@@ -1918,7 +1926,7 @@ static void drain_cpu_caches(kmem_cache_
drain_array_locked(cachep, l3->shared, 1, node);
spin_unlock(&l3->list_lock);
if (l3->alien)
- drain_alien_cache(cachep, l3);
+ drain_alien_cache(cachep, l3->alien);
}
}
spin_unlock_irq(&cachep->spinlock);
@@ -3310,7 +3318,7 @@ static void cache_reap(void *unused)

l3 = searchp->nodelists[numa_node_id()];
if (l3->alien)
- drain_alien_cache(searchp, l3);
+ drain_alien_cache(searchp, l3->alien);
spin_lock_irq(&l3->list_lock);

drain_array_locked(searchp, ac_data(searchp), 0,

2005-12-22 17:54:01

by Sonny Rao

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

On Thu, Dec 22, 2005 at 11:37:00AM -0600, Sonny Rao wrote:
> On Thu, Dec 22, 2005 at 01:27:43AM -0800, Ravikiran G Thirumalai wrote:
> > On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> > > (apologies if this is a dup)
> > >
> > > Hi, I'm crashing 2.6.15-rc5 when I try and offline the last and only CPU in a node on a ppc64 Power5, SMT was disabled.
> > >
> > > Here's the backtrace:
> > >
> > > 0:mon> t
> > > [c0000001ad033820] c000000000096a7c .kfree+0x250/0x280
> > > [c0000001ad0338d0] c00000000009a544 .cpuup_callback+0x238/0x5fc
> > > [c0000001ad0339c0] c000000000068114 .notifier_call_chain+0x68/0x9c
> > > [c0000001ad033a50] c0000000000789fc .cpu_down+0x1fc/0x368
> > > [c0000001ad033b40] c0000000002ac658 .store_online+0x88/0xe8
> > > [c0000001ad033bd0] c0000000002a6f14 .sysdev_store+0x4c/0x68
> > > [c0000001ad033c50] c000000000110368 .sysfs_write_file+0x100/0x1a0
> > > [c0000001ad033cf0] c0000000000be854 .vfs_write+0x100/0x200
> > > [c0000001ad033d90] c0000000000bea64 .sys_write+0x54/0x9c
> > > [c0000001ad033e30] c000000000008600 syscall_exit+0x0/0x18
> > > --- Exception: c01 (System Call) at 000000000fe5ec10
> > > SP (ffc4c4f0) is in userspace
> > >
> > > 0:mon> e
> > > cpu 0x0: Vector: 300 (Data Access) at [c0000001ad033520]
> > > pc: c00000000048bd30: ._spin_lock+0x18/0x80
> > > lr: c000000000096a7c: .kfree+0x250/0x280
> > > sp: c0000001ad0337a0
> > > msr: 8000000000001032
> > > dar: 48
> > > dsisr: 40000000
> > > current = 0xc0000001aff12040
> > > paca = 0xc0000000005c1000
> > > pid = 17376, comm = bash
> > >
> > >
> >
> > Sonny,
> > Does this patch fix the issue? This one applies cleanly on 2.6.15-rc6
> > unlike the one that was sent to you earlier.
>
> Hi, thanks, now I'm getting a slightly different error,
> hitting a BUG in the slab debug code:
>
> ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
> cpu 0x4: Vector: 700 (Program Check) at [c0000003a8c233f0]
> pc: c00000000009bb2c: .check_slabp+0x130/0x188
> lr: c00000000009bb28: .check_slabp+0x12c/0x188
> sp: c0000003a8c23670
> msr: 8000000000021032
> current = 0xc0000001b95297f0
> paca = 0xc0000000005d7000
> pid = 11116, comm = bash
> kernel BUG in check_slabp at mm/slab.c:2368!
> enter ? for help
>
>
> 4:mon> t
> [c0000003a8c23700] c00000000009d918 .free_block+0x168/0x294
> [c0000003a8c237e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> [c0000003a8c238a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> [c0000003a8c239b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> [c0000003a8c23a40] c00000000007d608 .cpu_down+0x1fc/0x358
> [c0000003a8c23b30] c0000000002bb4ec .store_online+0x88/0xe8
> [c0000003a8c23bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> [c0000003a8c23c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> [c0000003a8c23cf0] c0000000000c6078 .vfs_write+0x100/0x200
> [c0000003a8c23d90] c0000000000c6288 .sys_write+0x54/0x9c
> [c0000003a8c23e30] c000000000008600 syscall_exit+0x0/0x18
> --- Exception: c01 (System Call) at 000000000fe5ec10
> SP (ff865560) is in userspace

More details:

The above crash was with SMT on, and I had already off-lined the SMT
sibling thread.

When I boot with SMT off, I get a slightly different crash:

ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
cpu 0x0: Vector: 700 (Program Check) at [c0000003afa13480]
pc: c00000000009d960: .free_block+0x1b0/0x294
lr: c00000000009d95c: .free_block+0x1ac/0x294
sp: c0000003afa13700
msr: 8000000000021032
current = 0xc0000003afe04000
paca = 0xc0000000005d5000
pid = 10998, comm = bash
kernel BUG in free_block at mm/slab.c:2664!
enter ? for help

0:mon> t
[c0000003afa137e0] c00000000009d1dc .kfree+0x2b8/0x2d4
[c0000003afa138a0] c0000000000a1644 .cpuup_callback+0x144/0x618
[c0000003afa139b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
[c0000003afa13a40] c00000000007d608 .cpu_down+0x1fc/0x358
[c0000003afa13b30] c0000000002bb4ec .store_online+0x88/0xe8
[c0000003afa13bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
[c0000003afa13c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
[c0000003afa13cf0] c0000000000c6078 .vfs_write+0x100/0x200
[c0000003afa13d90] c0000000000c6288 .sys_write+0x54/0x9c
[c0000003afa13e30] c000000000008600 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000fe5ec10
SP (ff8b4560) is in userspace

This one points to a double free somewhere

Sonny

2005-12-22 18:38:00

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

On Thu, Dec 22, 2005 at 12:53:11PM -0500, Sonny Rao wrote:
> On Thu, Dec 22, 2005 at 11:37:00AM -0600, Sonny Rao wrote:
> > On Thu, Dec 22, 2005 at 01:27:43AM -0800, Ravikiran G Thirumalai wrote:
> > > On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> > > > (apologies if this is a dup)
> > > ...
> > > Sonny,
> > > Does this patch fix the issue? This one applies cleanly on 2.6.15-rc6
> > > unlike the one that was sent to you earlier.
> >
> > Hi, thanks, now I'm getting a slightly different error,
> > hitting a BUG in the slab debug code:
> >
> > ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
> > cpu 0x4: Vector: 700 (Program Check) at [c0000003a8c233f0]
> > pc: c00000000009bb2c: .check_slabp+0x130/0x188
> > lr: c00000000009bb28: .check_slabp+0x12c/0x188
> > sp: c0000003a8c23670
> > msr: 8000000000021032
> > current = 0xc0000001b95297f0
> > paca = 0xc0000000005d7000
> > pid = 11116, comm = bash
> > kernel BUG in check_slabp at mm/slab.c:2368!
> > enter ? for help
> >
> >
> > 4:mon> t
> > [c0000003a8c23700] c00000000009d918 .free_block+0x168/0x294
> > [c0000003a8c237e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> > [c0000003a8c238a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> > [c0000003a8c239b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> > [c0000003a8c23a40] c00000000007d608 .cpu_down+0x1fc/0x358
> > [c0000003a8c23b30] c0000000002bb4ec .store_online+0x88/0xe8
> > [c0000003a8c23bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> > [c0000003a8c23c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> > [c0000003a8c23cf0] c0000000000c6078 .vfs_write+0x100/0x200
> > [c0000003a8c23d90] c0000000000c6288 .sys_write+0x54/0x9c
> > [c0000003a8c23e30] c000000000008600 syscall_exit+0x0/0x18
> > --- Exception: c01 (System Call) at 000000000fe5ec10
> > SP (ff865560) is in userspace
>
> More details:
>
> The above crash was with SMT on, and I had already off-lined the SMT
> sibling thread.
>
> When I boot with SMT off, I get a slightly different crash:

I think i missed the first reply above. (I can't seem to find it on lkml
either). So just to confirm, both these crashes are with the new patch on
top of rc6?

Thanks,
Kiran

>
> ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
> cpu 0x0: Vector: 700 (Program Check) at [c0000003afa13480]
> pc: c00000000009d960: .free_block+0x1b0/0x294
> lr: c00000000009d95c: .free_block+0x1ac/0x294
> sp: c0000003afa13700
> msr: 8000000000021032
> current = 0xc0000003afe04000
> paca = 0xc0000000005d5000
> pid = 10998, comm = bash
> kernel BUG in free_block at mm/slab.c:2664!
> enter ? for help
>
> 0:mon> t
> [c0000003afa137e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> [c0000003afa138a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> [c0000003afa139b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> [c0000003afa13a40] c00000000007d608 .cpu_down+0x1fc/0x358
> [c0000003afa13b30] c0000000002bb4ec .store_online+0x88/0xe8
> [c0000003afa13bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> [c0000003afa13c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> [c0000003afa13cf0] c0000000000c6078 .vfs_write+0x100/0x200
> [c0000003afa13d90] c0000000000c6288 .sys_write+0x54/0x9c
> [c0000003afa13e30] c000000000008600 syscall_exit+0x0/0x18
> --- Exception: c01 (System Call) at 000000000fe5ec10
> SP (ff8b4560) is in userspace
>
> This one points to a double free somewhere
>
>

2005-12-22 18:40:42

by Sonny Rao

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

2005-12-22 18:54:37

by Christoph Lameter

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

On Thu, 22 Dec 2005, Sonny Rao wrote:

> Yes, rc6 + the patch you provided.

We may be going down the wrong path here. Has someone else than Sonny
reproduced the problem?

2005-12-22 19:10:09

by Sonny Rao

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

On Thu, Dec 22, 2005 at 10:54:08AM -0800, Christoph Lameter wrote:
> On Thu, 22 Dec 2005, Sonny Rao wrote:
>
> > Yes, rc6 + the patch you provided.
>
> We may be going down the wrong path here. Has someone else than Sonny
> reproduced the problem?

Hi, I've also just reproduced the problem on another machine which does
have multiple cpus/node rather than just one cpu/node. The crash
occurs at the same place when I attempt to offline the last cpu in a
node.

But, I agree that somemone else should repro this. I only have ppc64
machines available to me right now.

Sonny

2005-12-22 19:45:52

by Sonny Rao

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

On Thu, Dec 22, 2005 at 10:37:50AM -0800, Ravikiran G Thirumalai wrote:
> On Thu, Dec 22, 2005 at 12:53:11PM -0500, Sonny Rao wrote:
> > On Thu, Dec 22, 2005 at 11:37:00AM -0600, Sonny Rao wrote:
> > > On Thu, Dec 22, 2005 at 01:27:43AM -0800, Ravikiran G Thirumalai wrote:
> > > > On Mon, Dec 19, 2005 at 12:16:59AM -0500, Sonny Rao wrote:
> > > > > (apologies if this is a dup)
> > > > ...
> > > > Sonny,
> > > > Does this patch fix the issue? This one applies cleanly on 2.6.15-rc6
> > > > unlike the one that was sent to you earlier.
> > >
> > > Hi, thanks, now I'm getting a slightly different error,
> > > hitting a BUG in the slab debug code:
> > >
> > > ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
> > > cpu 0x4: Vector: 700 (Program Check) at [c0000003a8c233f0]
> > > pc: c00000000009bb2c: .check_slabp+0x130/0x188
> > > lr: c00000000009bb28: .check_slabp+0x12c/0x188
> > > sp: c0000003a8c23670
> > > msr: 8000000000021032
> > > current = 0xc0000001b95297f0
> > > paca = 0xc0000000005d7000
> > > pid = 11116, comm = bash
> > > kernel BUG in check_slabp at mm/slab.c:2368!
> > > enter ? for help
> > >
> > >
> > > 4:mon> t
> > > [c0000003a8c23700] c00000000009d918 .free_block+0x168/0x294
> > > [c0000003a8c237e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> > > [c0000003a8c238a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> > > [c0000003a8c239b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> > > [c0000003a8c23a40] c00000000007d608 .cpu_down+0x1fc/0x358
> > > [c0000003a8c23b30] c0000000002bb4ec .store_online+0x88/0xe8
> > > [c0000003a8c23bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> > > [c0000003a8c23c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> > > [c0000003a8c23cf0] c0000000000c6078 .vfs_write+0x100/0x200
> > > [c0000003a8c23d90] c0000000000c6288 .sys_write+0x54/0x9c
> > > [c0000003a8c23e30] c000000000008600 syscall_exit+0x0/0x18
> > > --- Exception: c01 (System Call) at 000000000fe5ec10
> > > SP (ff865560) is in userspace
> >
> > More details:
> >
> > The above crash was with SMT on, and I had already off-lined the SMT
> > sibling thread.
> >
> > When I boot with SMT off, I get a slightly different crash:
>
> I think i missed the first reply above. (I can't seem to find it on lkml
> either). So just to confirm, both these crashes are with the new patch on
> top of rc6?
>
> Thanks,
> Kiran
>
> >
> > ihplus:~ # echo 0 > /sys/devices/system/cpu/cpu14/online
> > cpu 0x0: Vector: 700 (Program Check) at [c0000003afa13480]
> > pc: c00000000009d960: .free_block+0x1b0/0x294
> > lr: c00000000009d95c: .free_block+0x1ac/0x294
> > sp: c0000003afa13700
> > msr: 8000000000021032
> > current = 0xc0000003afe04000
> > paca = 0xc0000000005d5000
> > pid = 10998, comm = bash
> > kernel BUG in free_block at mm/slab.c:2664!
> > enter ? for help
> >
> > 0:mon> t
> > [c0000003afa137e0] c00000000009d1dc .kfree+0x2b8/0x2d4
> > [c0000003afa138a0] c0000000000a1644 .cpuup_callback+0x144/0x618
> > [c0000003afa139b0] c0000000004a0780 .notifier_call_chain+0x68/0x9c
> > [c0000003afa13a40] c00000000007d608 .cpu_down+0x1fc/0x358
> > [c0000003afa13b30] c0000000002bb4ec .store_online+0x88/0xe8
> > [c0000003afa13bc0] c0000000002b5c14 .sysdev_store+0x4c/0x68
> > [c0000003afa13c40] c000000000119c6c .sysfs_write_file+0x118/0x1bc
> > [c0000003afa13cf0] c0000000000c6078 .vfs_write+0x100/0x200
> > [c0000003afa13d90] c0000000000c6288 .sys_write+0x54/0x9c
> > [c0000003afa13e30] c000000000008600 syscall_exit+0x0/0x18
> > --- Exception: c01 (System Call) at 000000000fe5ec10
> > SP (ff8b4560) is in userspace
> >
> > This one points to a double free somewhere

Hi, I think I've found the double free in the rc6 kernel + your patch :

starting on line 949 of the patched slab.c

if ((shared = l3->shared)) {
free_block(cachep, l3->shared->entry,
l3->shared->avail, node);
kfree(l3->shared);
l3->shared = NULL;
}

alien = l3->alien;
l3->alien = NULL;

spin_unlock(&l3->list_lock);

kfree(nc);
kfree(shared);

You conditionally free l3->shared after assigning it to the auto var "shared"
then below that you call kfree on "shared" again == double free.

So, I got rid of the extra free. I don't know if this was correct but
I tried it anyway. Unfortunately this still does not work correctly.
The system hangs for a period of time and then drops into the debugger
again:

0:mon> t
[c00000000f71f890] c00000000049e5ec ._spin_lock+0x10/0x24
[c00000000f71f910] c00000000009d550 .kmem_cache_free+0x270/0x2a4
[c00000000f71f9d0] c0000000003f35e8 .kfree_skbmem+0xa0/0xfc
[c00000000f71fa50] c00000000044d01c .udp_rcv+0x7ac/0x818
[c00000000f71fb60] c000000000420b14 .ip_local_deliver+0xf8/0x3f0
[c00000000f71fbf0] c000000000420328 .ip_rcv+0x3a8/0x724
[c00000000f71fc90] c0000000003fa054 .netif_receive_skb+0x378/0x3d0
[c00000000f71fd30] c0000000003fa1c4 .process_backlog+0x118/0x254
[c00000000f71fe10] c0000000003f7d3c .net_rx_action+0x188/0x2b8
[c00000000f71fed0] c000000000060f18 .__do_softirq+0xd4/0x1b8
[c00000000f71ff90] c00000000002c78c .call_do_softirq+0x14/0x24
[c0000000005ab870] c00000000000bd30 .do_softirq+0x8c/0x9c
[c0000000005ab900] c00000000006143c .irq_exit+0x6c/0x84
[c0000000005ab980] c00000000000c060 .do_IRQ+0xe8/0x194
[c0000000005aba10] c000000000004134 hardware_interrupt_entry+0x8/0x54
--- Exception: 501 (Hardware Interrupt) at c000000000040670
.pseries_dedicated_idle+0x114/0x268
[c0000000005abde0] c000000000021048 .cpu_idle+0x4c/0x60
[c0000000005abe50] c0000000000091f4 .rest_init+0x44/0x5c
[c0000000005abed0] c00000000054e7f4 .start_kernel+0x29c/0x318
[c0000000005abf90] c000000000008494 .hmt_init+0x0/0x6c
0:mon>

0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c00000000f71f580]
pc: c000000000238db4: ._raw_spin_lock+0x2c/0x1d0
lr: c00000000049e5ec: ._spin_lock+0x10/0x24
sp: c00000000f71f800
msr: 8000000000001032
dar: 4c
dsisr: 40000000
current = 0xc00000000061b2f0
paca = 0xc0000000005d5000
pid = 0, comm = swapper
0:mon>

2005-12-28 19:30:23

by Nathan Lynch

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

I wonder if this is related to the problem Sonny is seeing -- powerpc's
definitions of cpu_to_node et al. are not being used. The culprit is
some too-clever preprocessor usage in asm-generic/topology.h, for
example:

#ifndef cpu_to_node
#define cpu_to_node(cpu) (0)
#endif

But asm-powerpc/topology.h has cpu_to_node defined as a static inline
(which does not make it a preprocessor symbol), so we get the generic
- and incorrect - definition.

Does removing the #include of asm-generic/topology.h from the bottom
of asm-powerpc/topology.h have any effect?

2005-12-29 00:31:32

by Sonny Rao

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

On Wed, Dec 28, 2005 at 01:30:12PM -0600, Nathan Lynch wrote:
> I wonder if this is related to the problem Sonny is seeing -- powerpc's
> definitions of cpu_to_node et al. are not being used. The culprit is
> some too-clever preprocessor usage in asm-generic/topology.h, for
> example:
>
>
> #ifndef cpu_to_node
> #define cpu_to_node(cpu) (0)
> #endif
>
> But asm-powerpc/topology.h has cpu_to_node defined as a static inline
> (which does not make it a preprocessor symbol), so we get the generic
> - and incorrect - definition.
>
> Does removing the #include of asm-generic/topology.h from the bottom
> of asm-powerpc/topology.h have any effect?

Hi, no it doesn't make a difference. That include is protected by
CONFIG_NUMA as well, so it never gets hit. At Anton's suggestion I
even put in an #error into asm-generic/topology.h to make sure it
wasn't an issue -- it didn't hit.

Sonny

2005-12-29 04:18:49

by Nathan Lynch

[permalink] [raw]

Subject: Re: cpu hotplug oops on 2.6.15-rc5

Sonny Rao wrote:
> On Wed, Dec 28, 2005 at 01:30:12PM -0600, Nathan Lynch wrote:
> >
> > Does removing the #include of asm-generic/topology.h from the bottom
> > of asm-powerpc/topology.h have any effect?
>
> Hi, no it doesn't make a difference. That include is protected by
> CONFIG_NUMA as well, so it never gets hit. At Anton's suggestion I
> even put in an #error into asm-generic/topology.h to make sure it
> wasn't an issue -- it didn't hit.

Gah, sorry, forgot Anton fixed this a while back.