2012-05-22 01:40:59

by Stephen Rothwell

[permalink] [raw]
Subject: linux-next: PowerPC boot failures in next-20120521

Hi all,

Last nights boot tests on various PowerPC systems failed like this:

calling .numa_group_init+0x0/0x3c @ 1
initcall .numa_group_init+0x0/0x3c returned 0 after 0 usecs
calling .numa_init+0x0/0x1dc @ 1
Unable to handle kernel paging request for data at address 0x00001688
Faulting instruction address: 0xc00000000016e154
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=32 NUMA pSeries
Modules linked in:
NIP: c00000000016e154 LR: c0000000001b9140 CTR: 0000000000000000
REGS: c0000003fc8c76d0 TRAP: 0300 Not tainted (3.4.0-autokern1)
MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 24044022 XER: 00000003
SOFTE: 1
CFAR: 000000000000562c
DAR: 0000000000001688, DSISR: 40000000
TASK = c0000003fc8c8000[1] 'swapper/0' THREAD: c0000003fc8c4000 CPU: 0
GPR00: 0000000000000000 c0000003fc8c7950 c000000000d05b30 00000000000012d0
GPR04: 0000000000000000 0000000000001680 0000000000000000 c0000003fe032f60
GPR08: 0004005400000001 0000000000000000 ffffffffffffc980 c000000000d24fe0
GPR12: 0000000024044024 c00000000f33b000 0000000001a3fa78 00000000009bac00
GPR16: 0000000000e1f338 0000000002d513f0 0000000000001680 0000000000000000
GPR20: 0000000000000001 c0000003fc8c7c00 0000000000000000 0000000000000001
GPR24: 0000000000000001 c000000000d1b490 0000000000000000 0000000000001680
GPR28: 0000000000000000 0000000000000000 c000000000c7ce58 c0000003fe009200
NIP [c00000000016e154] .__alloc_pages_nodemask+0xc4/0x8f0
LR [c0000000001b9140] .new_slab+0xd0/0x3c0
Call Trace:
[c0000003fc8c7950] [2e6e756d615f696e] 0x2e6e756d615f696e (unreliable)
[c0000003fc8c7ae0] [c0000000001b9140] .new_slab+0xd0/0x3c0
[c0000003fc8c7b90] [c0000000001b9844] .__slab_alloc+0x254/0x5b0
[c0000003fc8c7cd0] [c0000000001bb7a4] .kmem_cache_alloc_node_trace+0x94/0x260
[c0000003fc8c7d80] [c000000000ba36d0] .numa_init+0x98/0x1dc
[c0000003fc8c7e10] [c00000000000ace4] .do_one_initcall+0x1a4/0x1e0
[c0000003fc8c7ed0] [c000000000b7b354] .kernel_init+0x124/0x2e0
[c0000003fc8c7f90] [c0000000000211c8] .kernel_thread+0x54/0x70
Instruction dump:
5400d97e 7b170020 0b000000 eb3e8000 3b800000 80190088 2f800000 40de0014
7860efe2 787c6fe2 78000fa4 7f9c0378 <e81b0008> 83f90000 2fa00000 7fff1838
---[ end trace 31fd0ba7d8756001 ]---

swapper/0 (1) used greatest stack depth: 10864 bytes left
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

I may be completely wrong, but I guess the obvious target would be the
sched/numa branch that came in via the tip tree.

Config file attached. I haven't had a chance to try to bisect this yet.

Anyone have any ideas?
--
Cheers,
Stephen Rothwell [email protected]


Attachments:
(No filename) (2.57 kB)
dotconfig.bz2 (15.06 kB)
(No filename) (836.00 B)
Download all attachments

2012-05-22 01:53:41

by David Rientjes

[permalink] [raw]
Subject: Re: linux-next: PowerPC boot failures in next-20120521

On Tue, 22 May 2012, Stephen Rothwell wrote:

> Unable to handle kernel paging request for data at address 0x00001688
> Faulting instruction address: 0xc00000000016e154
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=32 NUMA pSeries
> Modules linked in:
> NIP: c00000000016e154 LR: c0000000001b9140 CTR: 0000000000000000
> REGS: c0000003fc8c76d0 TRAP: 0300 Not tainted (3.4.0-autokern1)
> MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 24044022 XER: 00000003
> SOFTE: 1
> CFAR: 000000000000562c
> DAR: 0000000000001688, DSISR: 40000000
> TASK = c0000003fc8c8000[1] 'swapper/0' THREAD: c0000003fc8c4000 CPU: 0
> GPR00: 0000000000000000 c0000003fc8c7950 c000000000d05b30 00000000000012d0
> GPR04: 0000000000000000 0000000000001680 0000000000000000 c0000003fe032f60
> GPR08: 0004005400000001 0000000000000000 ffffffffffffc980 c000000000d24fe0
> GPR12: 0000000024044024 c00000000f33b000 0000000001a3fa78 00000000009bac00
> GPR16: 0000000000e1f338 0000000002d513f0 0000000000001680 0000000000000000
> GPR20: 0000000000000001 c0000003fc8c7c00 0000000000000000 0000000000000001
> GPR24: 0000000000000001 c000000000d1b490 0000000000000000 0000000000001680
> GPR28: 0000000000000000 0000000000000000 c000000000c7ce58 c0000003fe009200
> NIP [c00000000016e154] .__alloc_pages_nodemask+0xc4/0x8f0
> LR [c0000000001b9140] .new_slab+0xd0/0x3c0
> Call Trace:
> [c0000003fc8c7950] [2e6e756d615f696e] 0x2e6e756d615f696e (unreliable)
> [c0000003fc8c7ae0] [c0000000001b9140] .new_slab+0xd0/0x3c0
> [c0000003fc8c7b90] [c0000000001b9844] .__slab_alloc+0x254/0x5b0
> [c0000003fc8c7cd0] [c0000000001bb7a4] .kmem_cache_alloc_node_trace+0x94/0x260
> [c0000003fc8c7d80] [c000000000ba36d0] .numa_init+0x98/0x1dc
> [c0000003fc8c7e10] [c00000000000ace4] .do_one_initcall+0x1a4/0x1e0
> [c0000003fc8c7ed0] [c000000000b7b354] .kernel_init+0x124/0x2e0
> [c0000003fc8c7f90] [c0000000000211c8] .kernel_thread+0x54/0x70
> Instruction dump:
> 5400d97e 7b170020 0b000000 eb3e8000 3b800000 80190088 2f800000 40de0014
> 7860efe2 787c6fe2 78000fa4 7f9c0378 <e81b0008> 83f90000 2fa00000 7fff1838
> ---[ end trace 31fd0ba7d8756001 ]---
>
> swapper/0 (1) used greatest stack depth: 10864 bytes left
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
>
> I may be completely wrong, but I guess the obvious target would be the
> sched/numa branch that came in via the tip tree.
>
> Config file attached. I haven't had a chance to try to bisect this yet.
>
> Anyone have any ideas?

Yeah, it's sched/numa since that's what introduced numa_init(). It does
for_each_node() for each node and does a kmalloc_node() even though that
node may not be online. Slub ends up passing this node to the page
allocator through alloc_pages_exact_node(). CONFIG_DEBUG_VM would have
caught this and your config confirms its not enabled.

sched/numa either needs a memory hotplug notifier or it needs to pass
NUMA_NO_NODE for nodes that aren't online. Until we get the former, the
following should fix it.


sched, numa: Allocate node_queue on any node for offline nodes

struct node_queue must be allocated with NUMA_NO_NODE for nodes that are
not (yet) online, otherwise the page allocator has a bad zonelist.

Signed-off-by: David Rientjes <[email protected]>
---
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -885,7 +885,8 @@ static __init int numa_init(void)

for_each_node(node) {
struct node_queue *nq = kmalloc_node(sizeof(*nq),
- GFP_KERNEL | __GFP_ZERO, node);
+ GFP_KERNEL | __GFP_ZERO,
+ node_online(node) ? node : NUMA_NO_NODE);
BUG_ON(!nq);

spin_lock_init(&nq->lock);

2012-05-22 02:12:08

by Michael Neuling

[permalink] [raw]
Subject: Re: linux-next: PowerPC boot failures in next-20120521

> Hi all,
>
> Last nights boot tests on various PowerPC systems failed like this:
>
> calling .numa_group_init+0x0/0x3c @ 1
> initcall .numa_group_init+0x0/0x3c returned 0 after 0 usecs
> calling .numa_init+0x0/0x1dc @ 1
> Unable to handle kernel paging request for data at address 0x00001688
> Faulting instruction address: 0xc00000000016e154
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=32 NUMA pSeries
> Modules linked in:
> NIP: c00000000016e154 LR: c0000000001b9140 CTR: 0000000000000000
> REGS: c0000003fc8c76d0 TRAP: 0300 Not tainted (3.4.0-autokern1)
> MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 24044022 XER: 00000003
> SOFTE: 1
> CFAR: 000000000000562c
> DAR: 0000000000001688, DSISR: 40000000
> TASK = c0000003fc8c8000[1] 'swapper/0' THREAD: c0000003fc8c4000 CPU: 0
> GPR00: 0000000000000000 c0000003fc8c7950 c000000000d05b30 00000000000012d0
> GPR04: 0000000000000000 0000000000001680 0000000000000000 c0000003fe032f60
> GPR08: 0004005400000001 0000000000000000 ffffffffffffc980 c000000000d24fe0
> GPR12: 0000000024044024 c00000000f33b000 0000000001a3fa78 00000000009bac00
> GPR16: 0000000000e1f338 0000000002d513f0 0000000000001680 0000000000000000
> GPR20: 0000000000000001 c0000003fc8c7c00 0000000000000000 0000000000000001
> GPR24: 0000000000000001 c000000000d1b490 0000000000000000 0000000000001680
> GPR28: 0000000000000000 0000000000000000 c000000000c7ce58 c0000003fe009200
> NIP [c00000000016e154] .__alloc_pages_nodemask+0xc4/0x8f0
> LR [c0000000001b9140] .new_slab+0xd0/0x3c0
> Call Trace:
> [c0000003fc8c7950] [2e6e756d615f696e] 0x2e6e756d615f696e (unreliable)
> [c0000003fc8c7ae0] [c0000000001b9140] .new_slab+0xd0/0x3c0
> [c0000003fc8c7b90] [c0000000001b9844] .__slab_alloc+0x254/0x5b0
> [c0000003fc8c7cd0] [c0000000001bb7a4] .kmem_cache_alloc_node_trace+0x94/0x260
> [c0000003fc8c7d80] [c000000000ba36d0] .numa_init+0x98/0x1dc
> [c0000003fc8c7e10] [c00000000000ace4] .do_one_initcall+0x1a4/0x1e0
> [c0000003fc8c7ed0] [c000000000b7b354] .kernel_init+0x124/0x2e0
> [c0000003fc8c7f90] [c0000000000211c8] .kernel_thread+0x54/0x70
> Instruction dump:
> 5400d97e 7b170020 0b000000 eb3e8000 3b800000 80190088 2f800000 40de0014
> 7860efe2 787c6fe2 78000fa4 7f9c0378 <e81b0008> 83f90000 2fa00000 7fff1838
> ---[ end trace 31fd0ba7d8756001 ]---
>
> swapper/0 (1) used greatest stack depth: 10864 bytes left
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
>
> I may be completely wrong, but I guess the obvious target would be the
> sched/numa branch that came in via the tip tree.
>
> Config file attached. I haven't had a chance to try to bisect this yet.
>
> Anyone have any ideas?

I'm getting similar here:


console [tty0] enabled
console [hvc0] enabled
pid_max: default: 32768 minimum: 301
Dentry cache hash table entries: 262144 (order: 5, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 4, 1048576 bytes)
Mount-cache hash table entries: 4096
Initializing cgroup subsys cpuacct
Initializing cgroup subsys devices
Initializing cgroup subsys freezer
POWER7 performance monitor hardware support registered
Unable to handle kernel paging request for data at address 0x00001388
Faulting instruction address: 0xc00000000014a070
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=1024 NUMA pSeries
Modules linked in:
NIP: c00000000014a070 LR: c0000000001978cc CTR: c0000000000b6870
REGS: c00000007e5836b0 TRAP: 0300 Tainted: G W (3.4.0-rc6-mikey)
MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR: 28004022 XER: 02000000
SOFTE: 1
CFAR: 00000000000050fc
DAR: 0000000000001388, DSISR: 40000000
TASK = c00000007e560000[1] 'swapper/0' THREAD: c00000007e580000 CPU: 0
GPR00: 0000000000000000 c00000007e583930 c000000000c034d8 00000000000012d0
GPR04: 0000000000000000 0000000000001380 0000000000000000 0000000000000001
GPR08: c00000007e0dff60 0000000000000000 c000000000ca05a0 0000000000000000
GPR12: 0000000028004024 c00000000ff20000 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000001 0000000000001380
GPR20: 0000000000000001 c000000000e14900 c000000000e148f0 0000000000000001
GPR24: c000000000c6f378 0000000000000000 0000000000001380 00000000000002aa
GPR28: 0000000000000000 0000000000000000 c000000000b576b0 c00000007e021200
NIP [c00000000014a070] .__alloc_pages_nodemask+0xd0/0x910
LR [c0000000001978cc] .new_slab+0xcc/0x3d0
Call Trace:
[c00000007e583930] [c00000007e5839c0] 0xc00000007e5839c0 (unreliable)
[c00000007e583ac0] [c0000000001978cc] .new_slab+0xcc/0x3d0
[c00000007e583b70] [c00000000072ae98] .__slab_alloc+0x38c/0x4f8
[c00000007e583cb0] [c000000000198190] .kmem_cache_alloc_node_trace+0x90/0x260
[c00000007e583d60] [c000000000a5a404] .numa_init+0x9c/0x188
[c00000007e583e00] [c00000000000aa30] .do_one_initcall+0x60/0x1e0
[c00000007e583ec0] [c000000000a40b60] .kernel_init+0x128/0x294
[c00000007e583f90] [c000000000020788] .kernel_thread+0x54/0x70
Instruction dump:
0b000000 eb1e8000 3b800000 801800a8 2f800000 409e001c 7860efe3 38000000
41820008 38000002 787c6fe2 7f9c0378 <e93a0008> 801800a4 3b600000 2fa90000
---[ end trace 31fd0ba7d8756002 ]---

Which seems to be this code in __alloc_pages_nodemask
---
/*
* Check the zones suitable for the gfp_mask contain at least one
* valid zone. It's possible to have an empty zonelist as a result
* of GFP_THISNODE and a memoryless node
*/
if (unlikely(!zonelist->_zonerefs->zone))
c00000000014a070: e9 3a 00 08 ld r9,8(r26)
---

r26 is coming from r5 which is the struct zonelist *zonelist parameter
to __alloc_pages_nodemask. Having 0000000000001380 in there is clearly
a bogus pointer.

Bisecting it points to b4cdf91668c27a5a6a5a3ed4234756c042dd8288
b4cdf91 sched/numa: Implement numa balancer

Trying David's patch just posted doesn't fix it.

Mikey

2012-05-22 02:25:06

by David Rientjes

[permalink] [raw]
Subject: Re: linux-next: PowerPC boot failures in next-20120521

On Tue, 22 May 2012, Michael Neuling wrote:

> console [tty0] enabled
> console [hvc0] enabled
> pid_max: default: 32768 minimum: 301
> Dentry cache hash table entries: 262144 (order: 5, 2097152 bytes)
> Inode-cache hash table entries: 131072 (order: 4, 1048576 bytes)
> Mount-cache hash table entries: 4096
> Initializing cgroup subsys cpuacct
> Initializing cgroup subsys devices
> Initializing cgroup subsys freezer
> POWER7 performance monitor hardware support registered
> Unable to handle kernel paging request for data at address 0x00001388
> Faulting instruction address: 0xc00000000014a070
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=1024 NUMA pSeries
> Modules linked in:
> NIP: c00000000014a070 LR: c0000000001978cc CTR: c0000000000b6870
> REGS: c00000007e5836b0 TRAP: 0300 Tainted: G W (3.4.0-rc6-mikey)
> MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR: 28004022 XER: 02000000
> SOFTE: 1
> CFAR: 00000000000050fc
> DAR: 0000000000001388, DSISR: 40000000
> TASK = c00000007e560000[1] 'swapper/0' THREAD: c00000007e580000 CPU: 0
> GPR00: 0000000000000000 c00000007e583930 c000000000c034d8 00000000000012d0
> GPR04: 0000000000000000 0000000000001380 0000000000000000 0000000000000001
> GPR08: c00000007e0dff60 0000000000000000 c000000000ca05a0 0000000000000000
> GPR12: 0000000028004024 c00000000ff20000 0000000000000000 0000000000000000
> GPR16: 0000000000000000 0000000000000000 0000000000000001 0000000000001380
> GPR20: 0000000000000001 c000000000e14900 c000000000e148f0 0000000000000001
> GPR24: c000000000c6f378 0000000000000000 0000000000001380 00000000000002aa
> GPR28: 0000000000000000 0000000000000000 c000000000b576b0 c00000007e021200
> NIP [c00000000014a070] .__alloc_pages_nodemask+0xd0/0x910
> LR [c0000000001978cc] .new_slab+0xcc/0x3d0
> Call Trace:
> [c00000007e583930] [c00000007e5839c0] 0xc00000007e5839c0 (unreliable)
> [c00000007e583ac0] [c0000000001978cc] .new_slab+0xcc/0x3d0
> [c00000007e583b70] [c00000000072ae98] .__slab_alloc+0x38c/0x4f8
> [c00000007e583cb0] [c000000000198190] .kmem_cache_alloc_node_trace+0x90/0x260
> [c00000007e583d60] [c000000000a5a404] .numa_init+0x9c/0x188
> [c00000007e583e00] [c00000000000aa30] .do_one_initcall+0x60/0x1e0
> [c00000007e583ec0] [c000000000a40b60] .kernel_init+0x128/0x294
> [c00000007e583f90] [c000000000020788] .kernel_thread+0x54/0x70
> Instruction dump:
> 0b000000 eb1e8000 3b800000 801800a8 2f800000 409e001c 7860efe3 38000000
> 41820008 38000002 787c6fe2 7f9c0378 <e93a0008> 801800a4 3b600000 2fa90000
> ---[ end trace 31fd0ba7d8756002 ]---
>
> Which seems to be this code in __alloc_pages_nodemask
> ---
> /*
> * Check the zones suitable for the gfp_mask contain at least one
> * valid zone. It's possible to have an empty zonelist as a result
> * of GFP_THISNODE and a memoryless node
> */
> if (unlikely(!zonelist->_zonerefs->zone))
> c00000000014a070: e9 3a 00 08 ld r9,8(r26)
> ---
>
> r26 is coming from r5 which is the struct zonelist *zonelist parameter
> to __alloc_pages_nodemask. Having 0000000000001380 in there is clearly
> a bogus pointer.
>
> Bisecting it points to b4cdf91668c27a5a6a5a3ed4234756c042dd8288
> b4cdf91 sched/numa: Implement numa balancer
>
> Trying David's patch just posted doesn't fix it.
>

Hmm, what does CONFIG_DEBUG_VM say?

2012-05-22 02:39:15

by Michael Neuling

[permalink] [raw]
Subject: Re: linux-next: PowerPC boot failures in next-20120521

> > Trying David's patch just posted doesn't fix it.
> >
>
> Hmm, what does CONFIG_DEBUG_VM say?

No set.

Mikey

2012-05-22 02:40:26

by Michael Neuling

[permalink] [raw]
Subject: Re: linux-next: PowerPC boot failures in next-20120521

Michael Neuling <[email protected]> wrote:

> > > Trying David's patch just posted doesn't fix it.
> > >
> >
> > Hmm, what does CONFIG_DEBUG_VM say?
>
> No set.

Sorry, should have read "Not set"

mikey

2012-05-22 02:44:44

by David Rientjes

[permalink] [raw]
Subject: Re: linux-next: PowerPC boot failures in next-20120521

On Tue, 22 May 2012, Michael Neuling wrote:

> > > > Trying David's patch just posted doesn't fix it.
> > > >
> > >
> > > Hmm, what does CONFIG_DEBUG_VM say?
> >
> > No set.
>
> Sorry, should have read "Not set"
>

I mean if it's set, what does it emit to the kernel log with my patch
applied?

I made CONFIG_DEBUG_VM catch !node_online(node) about six months ago, so I
was thinking it would have caught this if either you or Stephen enable it.

2012-05-22 02:51:46

by Michael Neuling

[permalink] [raw]
Subject: Re: linux-next: PowerPC boot failures in next-20120521

David Rientjes <[email protected]> wrote:

> On Tue, 22 May 2012, Michael Neuling wrote:
>
> > > > > Trying David's patch just posted doesn't fix it.
> > > > >
> > > >
> > > > Hmm, what does CONFIG_DEBUG_VM say?
> > >
> > > No set.
> >
> > Sorry, should have read "Not set"
> >
>
> I mean if it's set, what does it emit to the kernel log with my patch
> applied?
>
> I made CONFIG_DEBUG_VM catch !node_online(node) about six months ago, so I
> was thinking it would have caught this if either you or Stephen enable it.

Sorry, got it... CONFIG_DEBUG_VM enabled below...

pid_max: default: 32768 minimum: 301
Dentry cache hash table entries: 262144 (order: 5, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 4, 1048576 bytes)
Mount-cache hash table entries: 4096
Initializing cgroup subsys cpuacct
Initializing cgroup subsys devices
Initializing cgroup subsys freezer
POWER7 performance monitor hardware support registered
------------[ cut here ]------------
kernel BUG at /scratch/mikey/src/linux-next/include/linux/gfp.h:318!
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=1024 NUMA pSeries
Modules linked in:
NIP: c000000000199164 LR: c0000000001993e0 CTR: c0000000000b6b70
REGS: c00000007e583830 TRAP: 0700 Tainted: G W (3.4.0-rc6-mikey)
MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI> CR: 28004028 XER: 02000000
SOFTE: 1
CFAR: c0000000001993c4
TASK = c00000007e560000[1] 'swapper/0' THREAD: c00000007e580000 CPU: 0
GPR00: 0000000000000001 c00000007e583ab0 c000000000c035a0 00000000000012d0
GPR04: 0000000000000000 0000000000000001 c000000000e14900 0005055500000001
GPR08: 0000000000000001 00000000000012d0 c000000000c6f398 0000000000000001
GPR12: 0000000028004022 c00000000ff20000 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000001380 0000000000000000
GPR20: 0000000000000001 c000000000e14900 c000000000e148f0 0000000000210d00
GPR24: 0000000000000001 00000000000000d0 00000000000002aa 0000000000000000
GPR28: 00000000000000d0 0000000000000001 c000000000b58fc8 c00000007e021200
NIP [c000000000199164] .new_slab+0xb4/0x440
LR [c0000000001993e0] .new_slab+0x330/0x440
Call Trace:
[c00000007e583ab0] [c0000000001993e0] .new_slab+0x330/0x440 (unreliable)
[c00000007e583b60] [c00000000072ce84] .__slab_alloc+0x3bc/0x52c
[c00000007e583ca0] [c000000000199b08] .kmem_cache_alloc_node_trace+0x98/0x280
[c00000007e583d60] [c000000000a5a440] .numa_init+0x9c/0x188
[c00000007e583e00] [c00000000000aa30] .do_one_initcall+0x60/0x1e0
[c00000007e583ec0] [c000000000a40b60] .kernel_init+0x128/0x294
[c00000007e583f90] [c000000000020788] .kernel_thread+0x54/0x70
Instruction dump:
7b5b8402 7f6407b4 7c1ce378 7d29e038 7b990020 61291200 79230020 419202b8
2b9d00ff 78840020 38000001 409d0240 <0b000000> e95e8140 792977e2 7bab1f24
---[ end trace 31fd0ba7d8756002 ]---

2012-05-22 02:58:37

by David Rientjes

[permalink] [raw]
Subject: Re: linux-next: PowerPC boot failures in next-20120521

On Tue, 22 May 2012, Michael Neuling wrote:

> Sorry, got it... CONFIG_DEBUG_VM enabled below...
>
> pid_max: default: 32768 minimum: 301
> Dentry cache hash table entries: 262144 (order: 5, 2097152 bytes)
> Inode-cache hash table entries: 131072 (order: 4, 1048576 bytes)
> Mount-cache hash table entries: 4096
> Initializing cgroup subsys cpuacct
> Initializing cgroup subsys devices
> Initializing cgroup subsys freezer
> POWER7 performance monitor hardware support registered
> ------------[ cut here ]------------
> kernel BUG at /scratch/mikey/src/linux-next/include/linux/gfp.h:318!

Yeah, this is what I was expecting, it's tripping on

VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));

and slub won't pass nid < 0. You're sure my patch is applied? :)

2012-05-22 03:04:19

by Stephen Rothwell

[permalink] [raw]
Subject: Re: linux-next: PowerPC boot failures in next-20120521

Hi David,

On Mon, 21 May 2012 18:53:37 -0700 (PDT) David Rientjes <[email protected]> wrote:
>
> Yeah, it's sched/numa since that's what introduced numa_init(). It does
> for_each_node() for each node and does a kmalloc_node() even though that
> node may not be online. Slub ends up passing this node to the page
> allocator through alloc_pages_exact_node(). CONFIG_DEBUG_VM would have
> caught this and your config confirms its not enabled.
>
> sched/numa either needs a memory hotplug notifier or it needs to pass
> NUMA_NO_NODE for nodes that aren't online. Until we get the former, the
> following should fix it.
>
>
> sched, numa: Allocate node_queue on any node for offline nodes
>
> struct node_queue must be allocated with NUMA_NO_NODE for nodes that are
> not (yet) online, otherwise the page allocator has a bad zonelist.
>
> Signed-off-by: David Rientjes <[email protected]>

Thanks, that fixes it.

Tested-by: Stephen Rothwell <[email protected]>

--
Cheers,
Stephen Rothwell [email protected]


Attachments:
(No filename) (1.03 kB)
(No filename) (836.00 B)
Download all attachments

2012-05-22 03:12:19

by Michael Neuling

[permalink] [raw]
Subject: Re: linux-next: PowerPC boot failures in next-20120521

David Rientjes <[email protected]> wrote:

> On Tue, 22 May 2012, Michael Neuling wrote:
>
> > Sorry, got it... CONFIG_DEBUG_VM enabled below...
> >
> > pid_max: default: 32768 minimum: 301
> > Dentry cache hash table entries: 262144 (order: 5, 2097152 bytes)
> > Inode-cache hash table entries: 131072 (order: 4, 1048576 bytes)
> > Mount-cache hash table entries: 4096
> > Initializing cgroup subsys cpuacct
> > Initializing cgroup subsys devices
> > Initializing cgroup subsys freezer
> > POWER7 performance monitor hardware support registered
> > ------------[ cut here ]------------
> > kernel BUG at /scratch/mikey/src/linux-next/include/linux/gfp.h:318!
>
> Yeah, this is what I was expecting, it's tripping on
>
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>
> and slub won't pass nid < 0. You're sure my patch is applied? :)

I did have your patch applied but at "b4cdf91 sched/numa: Implement numa
balancer" (where git bisect spotted the fail).

If I apply your patch on the full next-20120521 it does fix the problem.

Sorry for the confusion.

Thanks!
Mikey

2012-05-22 03:25:18

by Stephen Rothwell

[permalink] [raw]
Subject: Re: linux-next: PowerPC boot failures in next-20120521

On Tue, 22 May 2012 13:03:54 +1000 Stephen Rothwell <[email protected]> wrote:
>
> On Mon, 21 May 2012 18:53:37 -0700 (PDT) David Rientjes <[email protected]> wrote:
> >
> > Yeah, it's sched/numa since that's what introduced numa_init(). It does
> > for_each_node() for each node and does a kmalloc_node() even though that
> > node may not be online. Slub ends up passing this node to the page
> > allocator through alloc_pages_exact_node(). CONFIG_DEBUG_VM would have
> > caught this and your config confirms its not enabled.
> >
> > sched/numa either needs a memory hotplug notifier or it needs to pass
> > NUMA_NO_NODE for nodes that aren't online. Until we get the former, the
> > following should fix it.
> >
> >
> > sched, numa: Allocate node_queue on any node for offline nodes
> >
> > struct node_queue must be allocated with NUMA_NO_NODE for nodes that are
> > not (yet) online, otherwise the page allocator has a bad zonelist.
> >
> > Signed-off-by: David Rientjes <[email protected]>
>
> Thanks, that fixes it.
>
> Tested-by: Stephen Rothwell <[email protected]>

And I will put that patch in linux-next until it (or something better)
appears.

--
Cheers,
Stephen Rothwell [email protected]


Attachments:
(No filename) (1.22 kB)
(No filename) (836.00 B)
Download all attachments

2012-05-23 04:18:07

by David Rientjes

[permalink] [raw]
Subject: [patch] sched, numa: Allocate node_queue on any node for offline nodes

struct node_queue must be allocated with NUMA_NO_NODE for nodes that are
not (yet) online, otherwise the page allocator has a bad zonelist and
results in an early crash.

Tested-by: Stephen Rothwell <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
kernel/sched/numa.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -885,7 +885,8 @@ static __init int numa_init(void)

for_each_node(node) {
struct node_queue *nq = kmalloc_node(sizeof(*nq),
- GFP_KERNEL | __GFP_ZERO, node);
+ GFP_KERNEL | __GFP_ZERO,
+ node_online(node) ? node : NUMA_NO_NODE);
BUG_ON(!nq);

spin_lock_init(&nq->lock);

2012-05-23 15:33:45

by David Rientjes

[permalink] [raw]
Subject: [tip:sched/numa] sched/numa: Allocate 'struct node_queue' on any node for offline nodes

Commit-ID: 183e4c2d3fa57d1d72d60fb1f0c2a0870d681f8d
Gitweb: http://git.kernel.org/tip/183e4c2d3fa57d1d72d60fb1f0c2a0870d681f8d
Author: David Rientjes <[email protected]>
AuthorDate: Tue, 22 May 2012 21:17:56 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 23 May 2012 16:08:45 +0200

sched/numa: Allocate 'struct node_queue' on any node for offline nodes

'struct node_queue' must be allocated with NUMA_NO_NODE for nodes
that are not (yet) online, otherwise the page allocator has a
bad zonelist and results in an early crash.

Tested-by: Stephen Rothwell <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Cc: [email protected]
Cc: Lee Schermerhorn <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/numa.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
index 8eb92f7..fdf737d 100644
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -885,7 +885,8 @@ static __init int numa_init(void)

for_each_node(node) {
struct node_queue *nq = kmalloc_node(sizeof(*nq),
- GFP_KERNEL | __GFP_ZERO, node);
+ GFP_KERNEL | __GFP_ZERO,
+ node_online(node) ? node : NUMA_NO_NODE);
BUG_ON(!nq);

spin_lock_init(&nq->lock);

2012-06-13 13:37:32

by David Rientjes

[permalink] [raw]
Subject: [tip:sched/numa] sched/numa: Allocate 'struct node_queue' on any node for offline nodes

Commit-ID: 1f49a99116069a8d9dc6027862277a766c2ef17e
Gitweb: http://git.kernel.org/tip/1f49a99116069a8d9dc6027862277a766c2ef17e
Author: David Rientjes <[email protected]>
AuthorDate: Tue, 22 May 2012 21:17:56 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 13 Jun 2012 15:25:37 +0200

sched/numa: Allocate 'struct node_queue' on any node for offline nodes

'struct node_queue' must be allocated with NUMA_NO_NODE for nodes
that are not (yet) online, otherwise the page allocator has a
bad zonelist and results in an early crash.

Tested-by: Stephen Rothwell <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Cc: [email protected]
Cc: Lee Schermerhorn <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/numa.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
index 77fa7d4..002f71c 100644
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -829,7 +829,8 @@ static __init int numa_init(void)

for_each_node(node) {
struct node_queue *nq = kmalloc_node(sizeof(*nq),
- GFP_KERNEL | __GFP_ZERO, node);
+ GFP_KERNEL | __GFP_ZERO,
+ node_online(node) ? node : NUMA_NO_NODE);
BUG_ON(!nq);

spin_lock_init(&nq->lock);