2008-01-15 15:10:00

by Olaf Hering

[permalink] [raw]
Subject: crash in kmem_cache_init


Current linus tree crashes in kmem_cache_init, as shown below. The
system is a 8cpu 2.2GHz POWER5 system, model 9117-570, with 4GB ram.
Firmware is 240_332, 2.6.23 boots ok with the same config.

There is a series of mm related patches in 2.6.24-rc1:
commit 04231b3002ac53f8a64a7bd142fde3fa4b6808c6 seems to break it,

==> .git/BISECT_LOG <==
git-bisect start
# good: [0b8bc8b91cf6befea20fe78b90367ca7b61cfa0d] Linux 2.6.23
git-bisect good 0b8bc8b91cf6befea20fe78b90367ca7b61cfa0d
# bad: [cebdeed27b068dcc3e7c311d7ec0d9c33b5138c2] Linux 2.6.24-rc1
git-bisect bad cebdeed27b068dcc3e7c311d7ec0d9c33b5138c2
# good: [9ac52315d4cf5f561f36dabaf0720c00d3553162] sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields
git-bisect good 9ac52315d4cf5f561f36dabaf0720c00d3553162
# bad: [b9ec0339d8e22cadf2d9d1b010b51dc53837dfb0] add consts where appropriate in fs/nls/Kconfig fs/nls/Makefile fs/nls/nls_ascii.c fs/nls/nls_base.c fs/nls/nls_cp1250.c fs/nls/nls_cp1251.c fs/nls/nls_cp1255.c fs/nls/nls_cp437.c fs/nls/nls_cp737.c fs/nls/nls_cp775.c fs/nls/nls_cp850.c fs/nls/nls_cp852.c fs/nls/nls_cp855.c fs/nls/nls_cp857.c fs/nls/nls_cp860.c fs/nls/nls_cp861.c fs/nls/nls_cp862.c fs/nls/nls_cp863.c fs/nls/nls_cp864.c fs/nls/nls_cp865.c fs/nls/nls_cp866.c fs/nls/nls_cp869.c fs/nls/nls_cp874.c fs/nls/nls_cp932.c fs/nls/nls_cp936.c fs/nls/nls_cp949.c fs/nls/nls_cp950.c fs/nls/nls_euc-jp.c fs/nls/nls_iso8859-1.c fs/nls/nls_iso8859-13.c fs/nls/nls_iso8859-14.c fs/nls/nls_iso8859-15.c fs/nls/nls_iso8859-2.c fs/nls/nls_iso8859-3.c fs/nls/nls_iso8859-4.c fs/nls/nls_iso8859-5.c fs/nls/nls_iso8859-6.c fs/nls/nls_iso8859-7.c fs/nls/nls_iso8859-9.c fs/nls/nls_koi8-r.c fs/nls/nls_koi8-ru.c fs/nls/nls_koi8-u.c fs/nls/nls_utf8.c
git-bisect bad b9ec0339d8e22cadf2d9d1b010b51dc53837dfb0
# bad: [78a26e25ce4837a03ac3b6c32cdae1958e547639] uml: separate timer initialization
git-bisect bad 78a26e25ce4837a03ac3b6c32cdae1958e547639
# good: [4acad72ded8e3f0211bd2a762e23c28229c61a51] [IPV6]: Consolidate the ip6_pol_route_(input|output) pair
git-bisect good 4acad72ded8e3f0211bd2a762e23c28229c61a51
# good: [64da82efae0d7b5f7c478021840fd329f76d965d] Add support for PCMCIA card Sierra WIreless AC850
git-bisect good 64da82efae0d7b5f7c478021840fd329f76d965d
# bad: [37b07e4163f7306aa735a6e250e8d22293e5b8de] memoryless nodes: fixup uses of node_online_map in generic code
git-bisect bad 37b07e4163f7306aa735a6e250e8d22293e5b8de
# good: [64649a58919e66ec21792dbb6c48cb3da22cbd7f] mm: trim more holes
git-bisect good 64649a58919e66ec21792dbb6c48cb3da22cbd7f
# good: [fb53b3094888be0cf8ddf052277654268904bdf5] smbfs: convert to new aops
git-bisect good fb53b3094888be0cf8ddf052277654268904bdf5
# good: [13808910713a98cc1159291e62cdfec92cc94d05] Memoryless nodes: Generic management of nodemasks for various purposes




.............
Please wait, loading kernel...
Allocated 00a00000 bytes for kernel @ 00200000
Elf64 kernel loaded...
OF stdout device is: /vdevice/vty@30000000
Hypertas detected, assuming LPAR !
command line: panic=1 debug xmon=on
memory layout at init:
alloc_bottom : 0000000000ac1000
alloc_top : 0000000010000000
alloc_top_hi : 00000000da000000
rmo_top : 0000000010000000
ram_top : 00000000da000000
Looking for displays
found display : /pci@800000020000002/pci@2/pci@1/display@0, opening ... done
instantiating rtas at 0x000000000f6a1000 ... done
0000000000000000 : boot cpu 0000000000000000
0000000000000002 : starting cpu hw idx 0000000000000002... done
0000000000000004 : starting cpu hw idx 0000000000000004... done
0000000000000006 : starting cpu hw idx 0000000000000006... done
copying OF device tree ...
Building dt strings...
Building dt structure...
Device tree strings 0x0000000000cc2000 -> 0x0000000000cc34e4
Device tree struct 0x0000000000cc4000 -> 0x0000000000cd6000
Calling quiesce ...
returning from prom_init
Partition configured for 8 cpus.
Starting Linux PPC64 #2 SMP Tue Jan 15 14:23:02 CET 2008
-----------------------------------------------------
ppc64_pft_size = 0x1c
physicalMemorySize = 0xda000000
htab_hash_mask = 0x1fffff
-----------------------------------------------------
Linux version 2.6.24-rc7-ppc64 (olaf@lingonberry) (gcc version 4.1.2 20070115 (prerelease) (SUSE Linux)) #2 SMP Tue Jan 15 14:23:02 CET 2008
[boot]0012 Setup Arch
EEH: PCI Enhanced I/O Error Handling Enabled
PPC64 nvram contains 8192 bytes
Zone PFN ranges:
DMA 0 -> 892928
Normal 892928 -> 892928
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
1: 0 -> 892928
Could not find start_pfn for node 0
[boot]0015 Setup Done
Built 2 zonelists in Node order, mobility grouping on. Total pages: 880720
Policy zone: DMA
Kernel command line: panic=1 debug xmon=on
[boot]0020 XICS Init
xics: no ISA interrupt controller
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
time_init: decrementer frequency = 275.070000 MHz
time_init: processor frequency = 2197.800000 MHz
clocksource: timebase mult[e8ab05] shift[22] registered
clockevent: decrementer mult[466a] shift[16] cpu[0]
Console: colour dummy device 80x25
console handover: boot [udbg-1] -> real [hvc0]
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
freeing bootmem node 1
Memory: 3496632k/3571712k available (6188k kernel code, 75080k reserved, 1324k data, 1220k bss, 304k init)
Unable to handle kernel paging request for data at address 0x00000040
Faulting instruction address: 0xc000000000437470
cpu 0x0: Vector: 300 (Data Access) at [c00000000075b830]
pc: c000000000437470: ._spin_lock+0x20/0x88
lr: c0000000000f78a8: .cache_grow+0x7c/0x338
sp: c00000000075bab0
msr: 8000000000009032
dar: 40
dsisr: 40000000
current = 0xc000000000665a50
paca = 0xc000000000666380
pid = 0, comm = swapper
enter ? for help
[c00000000075bb30] c0000000000f78a8 .cache_grow+0x7c/0x338
[c00000000075bbf0] c0000000000f7d04 .fallback_alloc+0x1a0/0x1f4
[c00000000075bca0] c0000000000f8544 .kmem_cache_alloc+0xec/0x150
[c00000000075bd40] c0000000000fb1c0 .kmem_cache_create+0x208/0x478
[c00000000075be20] c0000000005e670c .kmem_cache_init+0x218/0x4f4
[c00000000075bee0] c0000000005bf8ec .start_kernel+0x2f8/0x3fc
[c00000000075bf90] c000000000008590 .start_here_common+0x60/0xd0
0:mon>


2008-01-15 15:58:32

by Olaf Hering

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Tue, Jan 15, Olaf Hering wrote:

>
> Current linus tree crashes in kmem_cache_init, as shown below. The
> system is a 8cpu 2.2GHz POWER5 system, model 9117-570, with 4GB ram.
> Firmware is 240_332, 2.6.23 boots ok with the same config.
>
> There is a series of mm related patches in 2.6.24-rc1:
> commit 04231b3002ac53f8a64a7bd142fde3fa4b6808c6 seems to break it,

2.6.24-rc6-mm1-ppc64 boots past this point, but crashes later.
Likely unrelated to the kmem_cache_init bug:

...
matroxfb: 640x480x8bpp (virtual: 640x26214)
matroxfb: framebuffer at 0x40178000000, mapped to 0xd000080080080000, size 33554432
Console: switching to colour frame buffer device 80x30
fb0: MATROX frame buffer device
matroxfb_crtc2: secondary head of fb0 was registered as fb1
vio_register_driver: driver hvc_console registering
HVSI: registered 0 devices
Generic RTC Driver v1.07
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
pmac_zilog: 0.6 (Benjamin Herrenschmidt <[email protected]>)
input: Macintosh mouse button emulation as /devices/virtual/input/input0
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ehci_hcd 0000:c8:01.2: EHCI Host Controller
ehci_hcd 0000:c8:01.2: new USB bus registered, assigned bus number 1
ehci_hcd 0000:c8:01.2: irq 85, io mem 0x400a0002000
ehci_hcd 0000:c8:01.2: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 5 ports detected
Unable to handle kernel paging request for data at address 0x00000050
Faulting instruction address: 0xc0000000000fa1c4
cpu 0x7: Vector: 300 (Data Access) at [c0000000d82e7a70]
pc: c0000000000fa1c4: .cache_reap+0x74/0x29c
lr: c0000000000fa198: .cache_reap+0x48/0x29c
sp: c0000000d82e7cf0
msr: 8000000000009032
dar: 50
dsisr: 40000000
current = 0xc0000000d82d85c0
paca = 0xc000000000668e00
pid = 27, comm = events/7
enter ? for help
[c0000000d82e7cf0] c00000000070be98 vmstat_update+0x0/0x18 (unreliable)
[c0000000d82e7da0] c000000000092994 .run_workqueue+0x120/0x210
[c0000000d82e7e40] c000000000093bb8 .worker_thread+0xcc/0xf0
[c0000000d82e7f00] c000000000097b70 .kthread+0x78/0xc4
[c0000000d82e7f90] c00000000002ab74 .kernel_thread+0x4c/0x68
7:mon>
...

2008-01-17 12:14:20

by Pekka Enberg

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

Hi Olaf,

[Adding Christoph as cc.]

On Jan 15, 2008 5:09 PM, Olaf Hering <[email protected]> wrote:
> Current linus tree crashes in kmem_cache_init, as shown below. The
> system is a 8cpu 2.2GHz POWER5 system, model 9117-570, with 4GB ram.
> Firmware is 240_332, 2.6.23 boots ok with the same config.
>
> There is a series of mm related patches in 2.6.24-rc1:
> commit 04231b3002ac53f8a64a7bd142fde3fa4b6808c6 seems to break it,

So that's the "Memoryless nodes: Slab support" patch that I think
cause a similar oops while ago.

> Unable to handle kernel paging request for data at address 0x00000040
> Faulting instruction address: 0xc000000000437470
> cpu 0x0: Vector: 300 (Data Access) at [c00000000075b830]
> pc: c000000000437470: ._spin_lock+0x20/0x88
> lr: c0000000000f78a8: .cache_grow+0x7c/0x338
> sp: c00000000075bab0
> msr: 8000000000009032
> dar: 40
> dsisr: 40000000
> current = 0xc000000000665a50
> paca = 0xc000000000666380
> pid = 0, comm = swapper
> enter ? for help
> [c00000000075bb30] c0000000000f78a8 .cache_grow+0x7c/0x338
> [c00000000075bbf0] c0000000000f7d04 .fallback_alloc+0x1a0/0x1f4
> [c00000000075bca0] c0000000000f8544 .kmem_cache_alloc+0xec/0x150
> [c00000000075bd40] c0000000000fb1c0 .kmem_cache_create+0x208/0x478
> [c00000000075be20] c0000000005e670c .kmem_cache_init+0x218/0x4f4
> [c00000000075bee0] c0000000005bf8ec .start_kernel+0x2f8/0x3fc
> [c00000000075bf90] c000000000008590 .start_here_common+0x60/0xd0

Looks similar to the one discussed on linux-mm ("[BUG] at
mm/slab.c:3320" thread). Christoph?

2008-01-17 14:31:09

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Thu, 17 Jan 2008, Pekka Enberg wrote:

> Looks similar to the one discussed on linux-mm ("[BUG] at
> mm/slab.c:3320" thread). Christoph?

Right. Try the latest version of the patch to fix it:

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c 2008-01-03 12:26:42.000000000 -0800
+++ linux-2.6/mm/slab.c 2008-01-09 15:59:49.000000000 -0800
@@ -2977,7 +2977,10 @@ retry:
}
l3 = cachep->nodelists[node];

- BUG_ON(ac->avail > 0 || !l3);
+ if (!l3)
+ return NULL;
+
+ BUG_ON(ac->avail > 0);
spin_lock(&l3->list_lock);

/* See if we can refill from the shared array */
@@ -3224,7 +3227,7 @@ static void *alternate_node_alloc(struct
nid_alloc = cpuset_mem_spread_node();
else if (current->mempolicy)
nid_alloc = slab_node(current->mempolicy);
- if (nid_alloc != nid_here)
+ if (nid_alloc != nid_here && node_state(nid_alloc, N_NORMAL_MEMORY))
return ____cache_alloc_node(cachep, flags, nid_alloc);
return NULL;
}
@@ -3439,8 +3442,14 @@ __do_cache_alloc(struct kmem_cache *cach
* We may just have run out of memory on the local node.
* ____cache_alloc_node() knows how to locate memory on other nodes
*/
- if (!objp)
- objp = ____cache_alloc_node(cache, flags, numa_node_id());
+ if (!objp) {
+ int node_id = numa_node_id();
+ if (likely(cache->nodelists[node_id])) /* fast path */
+ objp = ____cache_alloc_node(cache, flags, node_id);
+ else /* this function can do good fallback */
+ objp = __cache_alloc_node(cache, flags, node_id,
+ __builtin_return_address(0));
+ }

out:
return objp;

2008-01-17 18:12:24

by Olaf Hering

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Thu, Jan 17, Christoph Lameter wrote:

> On Thu, 17 Jan 2008, Pekka Enberg wrote:
>
> > Looks similar to the one discussed on linux-mm ("[BUG] at
> > mm/slab.c:3320" thread). Christoph?
>
> Right. Try the latest version of the patch to fix it:

The patch does not help.

> Index: linux-2.6/mm/slab.c
> ===================================================================
> --- linux-2.6.orig/mm/slab.c 2008-01-03 12:26:42.000000000 -0800
> +++ linux-2.6/mm/slab.c 2008-01-09 15:59:49.000000000 -0800
> @@ -2977,7 +2977,10 @@ retry:
> }
> l3 = cachep->nodelists[node];
>
> - BUG_ON(ac->avail > 0 || !l3);
> + if (!l3)
> + return NULL;
> +
> + BUG_ON(ac->avail > 0);
> spin_lock(&l3->list_lock);
>
> /* See if we can refill from the shared array */

Is this hunk supposed to go into cache_grow()? There is no NULL check
for l3.

But if I do that, it does not help:

freeing bootmem node 1
Memory: 3496632k/3571712k available (6188k kernel code, 75080k reserved, 1324k data, 1220k bss, 304k init)
cache_grow(2781) swapper(0):c0,j4294937299 cp c0000000006a4fb8 !l3
Kernel panic - not syncing: kmem_cache_create(): failed to create slab `size-32'

Rebooting in 1 seconds..

2008-01-17 18:59:09

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Thu, 17 Jan 2008, Olaf Hering wrote:

> The patch does not help.

Duh. We need to know more about the problem.

> > --- linux-2.6.orig/mm/slab.c 2008-01-03 12:26:42.000000000 -0800
> > +++ linux-2.6/mm/slab.c 2008-01-09 15:59:49.000000000 -0800
> > @@ -2977,7 +2977,10 @@ retry:
> > }
> > l3 = cachep->nodelists[node];
> >
> > - BUG_ON(ac->avail > 0 || !l3);
> > + if (!l3)
> > + return NULL;
> > +
> > + BUG_ON(ac->avail > 0);
> > spin_lock(&l3->list_lock);
> >
> > /* See if we can refill from the shared array */
>
> Is this hsupposed to go into cache_grow()? There is no NULL check
> for l3.

No its for cache_alloc_refill. cache_grow should only be called for
nodes that have memory. l3 is always used before cache_grow is called.

> freeing bootmem node 1
> Memory: 3496632k/3571712k available (6188k kernel code, 75080k reserved, 1324k data, 1220k bss, 304k init)
> cache_grow(2781) swapper(0):c0,j4294937299 cp c0000000006a4fb8 !l3

Is there more backtrace information? What function called cache_grow?

2008-01-17 19:03:42

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

Could you try Pekka's suggestion of reverting
04231b3002ac53f8a64a7bd142fde3fa4b6808c6 ?

2008-01-17 19:54:58

by Olaf Hering

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Thu, Jan 17, Christoph Lameter wrote:

> > freeing bootmem node 1
> > Memory: 3496632k/3571712k available (6188k kernel code, 75080k reserved, 1324k data, 1220k bss, 304k init)
> > cache_grow(2781) swapper(0):c0,j4294937299 cp c0000000006a4fb8 !l3
>
> Is there more backtrace information? What function called cache_grow?

I just put a 'if (!l3) return 0;' into cache_grow, the backtrace is the
one from the initial report.
Reverting 04231b3002ac53f8a64a7bd142fde3fa4b6808c6 does not change
anything.


Since -mm boots further, what patch should I try?

The kernel boots on a different p570.
See attached dmesg. huckleberry boots, cranberry crashes.


--- huckleberry.suse.de-2.6.16.57-0.5-ppc64.txt 2008-01-17 20:48:18.510309000 +0100
+++ cranberry.suse.de-2.6.16.57-0.5-ppc64.txt 2008-01-17 20:48:09.425402000 +0100
@@ -1,56 +1,55 @@
Page orders: linear mapping = 24, others = 12
-Found initrd at 0xc000000002700000:0xc000000002a93000
+Found initrd at 0xc000000001300000:0xc0000000016e6c1e
Partition configured for 8 cpus.
Starting Linux PPC64 #1 SMP Wed Dec 5 09:02:21 UTC 2007
-----------------------------------------------------
-ppc64_pft_size = 0x1b
+ppc64_pft_size = 0x1c
ppc64_interrupt_controller = 0x2
platform = 0x101
-physicalMemorySize = 0x158000000
+physicalMemorySize = 0xda000000
ppc64_caches.dcache_line_size = 0x80
ppc64_caches.icache_line_size = 0x80
htab_address = 0x0000000000000000
-htab_hash_mask = 0xfffff
+htab_hash_mask = 0x1fffff
-----------------------------------------------------
[boot]0100 MM Init
[boot]0100 MM Init Done
Linux version 2.6.16.57-0.5-ppc64 (geeko@buildhost) (gcc version 4.1.2 20070115 (prerelease) (SUSE Linux)) #1 SMP Wed Dec 5 09:02:21 UTC 2007
[boot]0012 Setup Arch
-Node 0 Memory: 0x0-0xb0000000
-Node 1 Memory: 0xb0000000-0x158000000
+Node 0 Memory:
+Node 1 Memory: 0x0-0xda000000
EEH: PCI Enhanced I/O Error Handling Enabled
-PPC64 nvram contains 7168 bytes
+PPC64 nvram contains 8192 bytes
Using dedicated idle loop
-On node 0 totalpages: 720896
- DMA zone: 720896 pages, LIFO batch:31
+On node 0 totalpages: 0
+ DMA zone: 0 pages, LIFO batch:0
DMA32 zone: 0 pages, LIFO batch:0
Normal zone: 0 pages, LIFO batch:0
HighMem zone: 0 pages, LIFO batch:0
-On node 1 totalpages: 688128
- DMA zone: 688128 pages, LIFO batch:31
+On node 1 totalpages: 892928
+ DMA zone: 892928 pages, LIFO batch:31
DMA32 zone: 0 pages, LIFO batch:0
Normal zone: 0 pages, LIFO batch:0
HighMem zone: 0 pages, LIFO batch:0
[boot]0015 Setup Done
Built 2 zonelists
-Kernel command line: root=/dev/disk/by-id/scsi-SIBM_ST373453LC_3HW1CPW500007445Q010-part5 xmon=on sysrq=1 quiet
+Kernel command line: root=/dev/system/root xmon=on sysrq=1 quiet
[boot]0020 XICS Init
xics: no ISA interrupt controller
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 131072 bytes)
-time_init: decrementer frequency = 207.052000 MHz
-time_init: processor frequency = 1654.344000 MHz
+time_init: decrementer frequency = 275.070000 MHz
+time_init: processor frequency = 2197.800000 MHz
Console: colour dummy device 80x25
-Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
-Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
-freeing bootmem node 0
+Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
+Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
freeing bootmem node 1
-Memory: 5524952k/5636096k available (4464k kernel code, 111144k reserved, 1992k data, 836k bss, 264k init)
-Calibrating delay loop... 413.69 BogoMIPS (lpj=2068480)
+Memory: 3494648k/3571712k available (4464k kernel code, 77064k reserved, 1992k data, 836k bss, 264k init)
+Calibrating delay loop... 548.86 BogoMIPS (lpj=2744320)
Security Framework v1.0.0 initialized
Mount-cache hash table entries: 256
checking if image is initramfs... it is
-Freeing initrd memory: 3660k freed
+Freeing initrd memory: 3995k freed
Processor 1 found.
Processor 2 found.
Processor 3 found.
@@ -61,7 +60,7 @@ Processor 7 found.
Brought up 8 CPUs
Node 0 CPUs: 0-3
Node 1 CPUs: 4-7
-migration_cost=41,0,4308
+migration_cost=38,0,3225
NET: Registered protocol family 16
PCI: Probing PCI hardware
IOMMU table initialized, virtual merging enabled


Attachments:
(No filename) (4.27 kB)
huckleberry.suse.de-2.6.16.57-0.5-ppc64.txt (16.28 kB)
Download all attachments

2008-01-17 20:20:26

by Olaf Hering

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Thu, Jan 17, Olaf Hering wrote:

> Since -mm boots further, what patch should I try?

rc8-mm1 crashes as well, l3 passed to reap_alien() is NULL.

2008-01-17 21:15:19

by Olaf Hering

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Thu, Jan 17, Christoph Lameter wrote:

> On Thu, 17 Jan 2008, Olaf Hering wrote:
>
> > The patch does not help.
>
> Duh. We need to know more about the problem.

cache_grow is called from 3 places. The third call has cleared l3 for
some reason.


....
Allocated 00a00000 bytes for kernel @ 00200000
Elf64 kernel loaded...
OF stdout device is: /vdevice/vty@30000000
Hypertas detected, assuming LPAR !
command line: xmon=on sysrq=1 debug panic=1
memory layout at init:
alloc_bottom : 0000000000ac1000
alloc_top : 0000000010000000
alloc_top_hi : 00000000da000000
rmo_top : 0000000010000000
ram_top : 00000000da000000
Looking for displays
found display : /pci@800000020000002/pci@2/pci@1/display@0, opening ... done
instantiating rtas at 0x000000000f6a1000 ... done
0000000000000000 : boot cpu 0000000000000000
0000000000000002 : starting cpu hw idx 0000000000000002... done
0000000000000004 : starting cpu hw idx 0000000000000004... done
0000000000000006 : starting cpu hw idx 0000000000000006... done
copying OF device tree ...
Building dt strings...
Building dt structure...
Device tree strings 0x0000000000cc2000 -> 0x0000000000cc34e4
Device tree struct 0x0000000000cc4000 -> 0x0000000000cd6000
Calling quiesce ...
returning from prom_init
Partition configured for 8 cpus.
Starting Linux PPC64 #34 SMP Thu Jan 17 22:06:41 CET 2008
-----------------------------------------------------
ppc64_pft_size = 0x1c
physicalMemorySize = 0xda000000
htab_hash_mask = 0x1fffff
-----------------------------------------------------
Linux version 2.6.24-rc8-ppc64 (olaf@lingonberry) (gcc version 4.1.2 20070115 (prerelease) (SUSE Linux)) #34 SMP Thu Jan 17 22:06:41 CET 2008
[boot]0012 Setup Arch
EEH: PCI Enhanced I/O Error Handling Enabled
PPC64 nvram contains 8192 bytes
Zone PFN ranges:
DMA 0 -> 892928
Normal 892928 -> 892928
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
1: 0 -> 892928
Could not find start_pfn for node 0
[boot]0015 Setup Done
Built 2 zonelists in Node order, mobility grouping on. Total pages: 880720
Policy zone: DMA
Kernel command line: xmon=on sysrq=1 debug panic=1
[boot]0020 XICS Init
xics: no ISA interrupt controller
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
time_init: decrementer frequency = 275.070000 MHz
time_init: processor frequency = 2197.800000 MHz
clocksource: timebase mult[e8ab05] shift[22] registered
clockevent: decrementer mult[466a] shift[16] cpu[0]
Console: colour dummy device 80x25
console handover: boot [udbg-1] -> real [hvc0]
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
freeing bootmem node 1
Memory: 3496633k/3571712k available (6188k kernel code, 75080k reserved, 1324k data, 1220k bss, 304k init)
cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 0 l3 c0000000005fddf0
cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 1 l3 c0000000005fddf0
cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 2 l3 c0000000005fddf0
cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 3 l3 c0000000005fddf0
------------[ cut here ]------------
Badness at /home/olaf/kernel/git/linux-2.6.24-rc8/mm/slab.c:2779
NIP: c0000000000f78f4 LR: c0000000000f78e0 CTR: 80000000001af404
REGS: c00000000075b880 TRAP: 0700 Not tainted (2.6.24-rc8-ppc64)
MSR: 8000000000029032 <EE,ME,IR,DR> CR: 24000022 XER: 00000001
TASK = c000000000665a50[0] 'swapper' THREAD: c000000000758000 CPU: 0
GPR00: 0000000000000004 c00000000075bb00 c0000000007544c0 0000000000000063
GPR04: 0000000000000001 0000000000000001 0000000000000000 0000000000000000
GPR08: ffffffffffffffff c0000000006a19a0 c0000000007a84b0 c0000000007a84a8
GPR12: 0000000000004000 c000000000666380 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 4000000000200000
GPR20: 0000000000000000 00000000007fbd70 c00000000054f6c8 00000000000492d0
GPR24: 0000000000000000 c0000000006a4fb8 c0000000006a4fb8 c0000000005fdc80
GPR28: 0000000000000000 00000000000412d0 c0000000006e5b80 0000000000000004
NIP [c0000000000f78f4] .cache_grow+0xc8/0x39c
LR [c0000000000f78e0] .cache_grow+0xb4/0x39c
Call Trace:
[c00000000075bb00] [c0000000000f78e0] .cache_grow+0xb4/0x39c (unreliable)
[c00000000075bbd0] [c0000000000f82d0] .cache_alloc_refill+0x234/0x2c0
[c00000000075bc90] [c0000000000f842c] .kmem_cache_alloc+0xd0/0x294
[c00000000075bd40] [c0000000000fb4e8] .kmem_cache_create+0x208/0x478
[c00000000075be20] [c0000000005e670c] .kmem_cache_init+0x218/0x4f4
[c00000000075bee0] [c0000000005bf8ec] .start_kernel+0x2f8/0x3fc
[c00000000075bf90] [c000000000008590] .start_here_common+0x60/0xd0
Instruction dump:
e89e80e0 e92a0000 e80b0468 7f4ad378 fbe10070 f8010078 4bf85f01 60000000
381f0001 7c1f07b4 2f9f0004 409effac <0fe00000> 7b091f24 7d29d214 eb690468
cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 0 l3 c0000000005fddf0
cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 1 l3 c0000000005fddf0
cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 2 l3 c0000000005fddf0
cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 3 l3 c0000000005fddf0
------------[ cut here ]------------
Badness at /home/olaf/kernel/git/linux-2.6.24-rc8/mm/slab.c:2779
NIP: c0000000000f78f4 LR: c0000000000f78e0 CTR: 80000000001af404
REGS: c00000000075b890 TRAP: 0700 Not tainted (2.6.24-rc8-ppc64)
MSR: 8000000000029032 <EE,ME,IR,DR> CR: 24000022 XER: 00000001
TASK = c000000000665a50[0] 'swapper' THREAD: c000000000758000 CPU: 0
GPR00: 0000000000000004 c00000000075bb10 c0000000007544c0 0000000000000063
GPR04: 0000000000000001 0000000000000001 0000000000000000 0000000000000000
GPR08: ffffffffffffffff c0000000006a19a0 c0000000007a84b0 c0000000007a84a8
GPR12: 0000000000004000 c000000000666380 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 4000000000200000
GPR20: 0000000000000000 00000000007fbd70 c00000000054f6c8 00000000000492d0
GPR24: 0000000000000000 00000000000080d0 c0000000006a4fb8 c0000000006a4fb8
GPR28: 0000000000000000 00000000000412d0 c0000000006e5b80 0000000000000004
NIP [c0000000000f78f4] .cache_grow+0xc8/0x39c
LR [c0000000000f78e0] .cache_grow+0xb4/0x39c
Call Trace:
[c00000000075bb10] [c0000000000f78e0] .cache_grow+0xb4/0x39c (unreliable)
[c00000000075bbe0] [c0000000000f7f38] .____cache_alloc_node+0x17c/0x1e8
[c00000000075bc90] [c0000000000f846c] .kmem_cache_alloc+0x110/0x294
[c00000000075bd40] [c0000000000fb4e8] .kmem_cache_create+0x208/0x478
[c00000000075be20] [c0000000005e670c] .kmem_cache_init+0x218/0x4f4
[c00000000075bee0] [c0000000005bf8ec] .start_kernel+0x2f8/0x3fc
[c00000000075bf90] [c000000000008590] .start_here_common+0x60/0xd0
Instruction dump:
e89e80e0 e92a0000 e80b0468 7f4ad378 fbe10070 f8010078 4bf85f01 60000000
381f0001 7c1f07b4 2f9f0004 409effac <0fe00000> 7b091f24 7d29d214 eb690468
cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 0 l3 0000000000000000
cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 1 l3 0000000000000000
cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 2 l3 0000000000000000
cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 3 l3 0000000000000000
------------[ cut here ]------------
Badness at /home/olaf/kernel/git/linux-2.6.24-rc8/mm/slab.c:2779
NIP: c0000000000f78f4 LR: c0000000000f78e0 CTR: 80000000001af404
REGS: c00000000075b890 TRAP: 0700 Not tainted (2.6.24-rc8-ppc64)
MSR: 8000000000029032 <EE,ME,IR,DR> CR: 24000022 XER: 00000001
TASK = c000000000665a50[0] 'swapper' THREAD: c000000000758000 CPU: 0
GPR00: 0000000000000004 c00000000075bb10 c0000000007544c0 0000000000000063
GPR04: 0000000000000001 0000000000000001 0000000000000000 0000000000000000
GPR08: ffffffffffffffff c0000000006a19a0 c0000000007a84b0 c0000000007a84a8
GPR12: 0000000000004000 c000000000666380 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 4000000000200000
GPR20: 0000000000000000 00000000007fbd70 c00000000054f6c8 00000000000080d0
GPR24: 0000000000000001 c0000000d9fe4b00 c0000000006a4fb8 0000000000000000
GPR28: c0000000d8000000 00000000000000d0 c0000000006e5b80 0000000000000004
NIP [c0000000000f78f4] .cache_grow+0xc8/0x39c
LR [c0000000000f78e0] .cache_grow+0xb4/0x39c
Call Trace:
[c00000000075bb10] [c0000000000f78e0] .cache_grow+0xb4/0x39c (unreliable)
[c00000000075bbe0] [c0000000000f7d68] .fallback_alloc+0x1a0/0x1f4
[c00000000075bc90] [c0000000000f846c] .kmem_cache_alloc+0x110/0x294
[c00000000075bd40] [c0000000000fb4e8] .kmem_cache_create+0x208/0x478
[c00000000075be20] [c0000000005e670c] .kmem_cache_init+0x218/0x4f4
[c00000000075bee0] [c0000000005bf8ec] .start_kernel+0x2f8/0x3fc
[c00000000075bf90] [c000000000008590] .start_here_common+0x60/0xd0
Instruction dump:
e89e80e0 e92a0000 e80b0468 7f4ad378 fbe10070 f8010078 4bf85f01 60000000
381f0001 7c1f07b4 2f9f0004 409effac <0fe00000> 7b091f24 7d29d214 eb690468
Unable to handle kernel paging request for data at address 0x00000040
Faulting instruction address: 0xc0000000004377b8
cpu 0x0: Vector: 300 (Data Access) at [c00000000075b810]
pc: c0000000004377b8: ._spin_lock+0x20/0x88
lr: c0000000000f790c: .cache_grow+0xe0/0x39c
sp: c00000000075ba90
msr: 8000000000009032
dar: 40
dsisr: 40000000
current = 0xc000000000665a50
paca = 0xc000000000666380
pid = 0, comm = swapper
enter ? for help
[c00000000075bb10] c0000000000f790c .cache_grow+0xe0/0x39c
[c00000000075bbe0] c0000000000f7d68 .fallback_alloc+0x1a0/0x1f4
[c00000000075bc90] c0000000000f846c .kmem_cache_alloc+0x110/0x294
[c00000000075bd40] c0000000000fb4e8 .kmem_cache_create+0x208/0x478
[c00000000075be20] c0000000005e670c .kmem_cache_init+0x218/0x4f4
[c00000000075bee0] c0000000005bf8ec .start_kernel+0x2f8/0x3fc
[c00000000075bf90] c000000000008590 .start_here_common+0x60/0xd0
0:mon>



--
Used patch:

Index: linux-2.6.24-rc8/include/linux/olh.h
===================================================================
--- /dev/null
+++ linux-2.6.24-rc8/include/linux/olh.h
@@ -0,0 +1,6 @@
+#ifndef __LINUX_OLH_H
+#define __LINUX_OLH_H
+#define olh(fmt,args ...) \
+ printk(KERN_DEBUG "%s(%u) %s(%u):c%u,j%lu " fmt "\n",__FUNCTION__,__LINE__,current->comm,current->pid,smp_processor_id(),jiffies,##args)
+#endif
+
Index: linux-2.6.24-rc8/mm/slab.c
===================================================================
--- linux-2.6.24-rc8.orig/mm/slab.c
+++ linux-2.6.24-rc8/mm/slab.c
@@ -110,6 +110,7 @@
#include <linux/fault-inject.h>
#include <linux/rtmutex.h>
#include <linux/reciprocal_div.h>
+#include <linux/olh.h>

#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
@@ -2764,6 +2765,7 @@ static int cache_grow(struct kmem_cache
size_t offset;
gfp_t local_flags;
struct kmem_list3 *l3;
+ int i;

/*
* Be lazy and only check for valid flags here, keeping it out of the
@@ -2772,6 +2774,9 @@ static int cache_grow(struct kmem_cache
BUG_ON(flags & GFP_SLAB_BUG_MASK);
local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);

+ for (i=0;i<4;i++)
+ olh("cachep %p nodeid %d l3 %p",cachep,i,cachep->nodelists[nodeid]);
+ WARN_ON(1);
/* Take the l3 list lock to change the colour_next on this node */
check_irq_off();
l3 = cachep->nodelists[nodeid];

2008-01-18 06:56:20

by Olaf Hering

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Thu, Jan 17, Olaf Hering wrote:

> On Thu, Jan 17, Christoph Lameter wrote:
>
> > On Thu, 17 Jan 2008, Olaf Hering wrote:
> >
> > > The patch does not help.
> >
> > Duh. We need to know more about the problem.
>
> cache_grow is called from 3 places. The third call has cleared l3 for
> some reason.

Typo in debug patch.

calls cache_grow with nodeid 0
> [c00000000075bbd0] [c0000000000f82d0] .cache_alloc_refill+0x234/0x2c0
calls cache_grow with nodeid 0
> [c00000000075bbe0] [c0000000000f7f38] .____cache_alloc_node+0x17c/0x1e8

calls cache_grow with nodeid 1
> [c00000000075bbe0] [c0000000000f7d68] .fallback_alloc+0x1a0/0x1f4

2008-01-18 18:43:00

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Fri, 18 Jan 2008, Olaf Hering wrote:

> calls cache_grow with nodeid 0
> > [c00000000075bbd0] [c0000000000f82d0] .cache_alloc_refill+0x234/0x2c0
> calls cache_grow with nodeid 0
> > [c00000000075bbe0] [c0000000000f7f38] .____cache_alloc_node+0x17c/0x1e8
>
> calls cache_grow with nodeid 1
> > [c00000000075bbe0] [c0000000000f7d68] .fallback_alloc+0x1a0/0x1f4

Hmmm... fallback_alloc should not be called during bootstrap.

2008-01-18 18:47:46

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Thu, 17 Jan 2008, Olaf Hering wrote:

> early_node_map[1] active PFN ranges
> 1: 0 -> 892928
> Could not find start_pfn for node 0

Corrupted min_pfn?

2008-01-18 18:51:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Thu, 17 Jan 2008, Olaf Hering wrote:

> Normal 892928 -> 892928
> Movable zone start PFN for each node
> early_node_map[1] active PFN ranges
> 1: 0 -> 892928
> Could not find start_pfn for node 0

We only have a single node that is node 1? And then we initialize nodes 0
to 3?

> Memory: 3496633k/3571712k available (6188k kernel code, 75080k reserved, 1324k data, 1220k bss, 304k init)
> cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 0 l3 c0000000005fddf0
> cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 1 l3 c0000000005fddf0
> cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 2 l3 c0000000005fddf0
> cache_grow(2778) swapper(0):c0,j4294937299 cachep c0000000006a4fb8 nodeid 3 l3 c0000000005fddf0

???

2008-01-18 21:30:26

by Mel Gorman

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On (18/01/08 10:47), Christoph Lameter didst pronounce:
> On Thu, 17 Jan 2008, Olaf Hering wrote:
>
> > early_node_map[1] active PFN ranges
> > 1: 0 -> 892928
> > Could not find start_pfn for node 0
>
> Corrupted min_pfn?
>

Doubtful. Node 0 has no memory but it is still being initialised.

Still, I looked closer at what is going on when that message gets
displayed and I see this in free_area_init_nodes()

for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
free_area_init_node(nid, pgdat, NULL,
find_min_pfn_for_node(nid), NULL);

/* Any memory on that node */
if (pgdat->node_present_pages)
node_set_state(nid, N_HIGH_MEMORY);
check_for_regular_memory(pgdat);
}

This "Any memory on that node" thing is new and it says if there is any
memory on the node, set N_HIGH_MEMORY. Fine I guess, I haven't tracked these
changes closely. It calls check_for_regular_memory() which looks like

static void check_for_regular_memory(pg_data_t *pgdat)
{
#ifdef CONFIG_HIGHMEM
enum zone_type zone_type;

for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) {
struct zone *zone = &pgdat->node_zones[zone_type];
if (zone->present_pages)
node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
}
#endif
}

i.e. go through the other zones and if any of them have memory, set
N_NORMAL_MEMORY. But... it only does this on CONFIG_HIGHMEM which on
PPC64 is not going to be set so N_NORMAL_MEMORY never gets set on
POWER.... That sounds bad.

mel@arnold:~/git/linux-2.6/mm$ grep -n N_NORMAL_MEMORY slab.c
1593: for_each_node_state(nid, N_NORMAL_MEMORY) {
1971: for_each_node_state(node, N_NORMAL_MEMORY) {
2102: for_each_node_state(node, N_NORMAL_MEMORY) {
3818: for_each_node_state(node, N_NORMAL_MEMORY) {

and one of them is in kmem_cache_init(). That seems very significant.
Christoph, can you think of possibilities of where N_NORMAL_MEMORY not
being set would cause trouble for slab?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-18 21:43:55

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Fri, 18 Jan 2008, Mel Gorman wrote:

> static void check_for_regular_memory(pg_data_t *pgdat)
> {
> #ifdef CONFIG_HIGHMEM
> enum zone_type zone_type;
>
> for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) {
> struct zone *zone = &pgdat->node_zones[zone_type];
> if (zone->present_pages)
> node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
> }
> #endif
> }
>
> i.e. go through the other zones and if any of them have memory, set
> N_NORMAL_MEMORY. But... it only does this on CONFIG_HIGHMEM which on
> PPC64 is not going to be set so N_NORMAL_MEMORY never gets set on
> POWER.... That sounds bad.

Argh. We may need to do a

node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY) in the !HIGHMEM case.

> and one of them is in kmem_cache_init(). That seems very significant.
> Christoph, can you think of possibilities of where N_NORMAL_MEMORY not
> being set would cause trouble for slab?

Yes. That results in the per node structures not being created and thus l3
== NULL. Explains our failures.

2008-01-18 22:16:39

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

Could you try this patch?

Memoryless nodes: Set N_NORMAL_MEMORY for a node if we do not support HIGHMEM

It seems that we only scan through zones to set N_NORMAL_MEMORY only if
CONFIG_HIGHMEM and CONFIG_NUMA are set. We need to set N_NORMAL_MEMORY
in the !CONFIG_HIGHMEM case.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2008-01-18 14:08:41.000000000 -0800
+++ linux-2.6/mm/page_alloc.c 2008-01-18 14:13:34.000000000 -0800
@@ -3812,7 +3812,6 @@ restart:
/* Any regular memory on that node ? */
static void check_for_regular_memory(pg_data_t *pgdat)
{
-#ifdef CONFIG_HIGHMEM
enum zone_type zone_type;

for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) {
@@ -3820,7 +3819,6 @@ static void check_for_regular_memory(pg_
if (zone->present_pages)
node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
}
-#endif
}

/**

2008-01-18 22:20:16

by Nish Aravamudan

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On 1/18/08, Christoph Lameter <[email protected]> wrote:
> Could you try this patch?
>
> Memoryless nodes: Set N_NORMAL_MEMORY for a node if we do not support
> HIGHMEM
>
> It seems that we only scan through zones to set N_NORMAL_MEMORY only if
> CONFIG_HIGHMEM and CONFIG_NUMA are set. We need to set
> N_NORMAL_MEMORY
> in the !CONFIG_HIGHMEM case.

I'm testing this exact patch right now on the machine Mel saw the issues with.

Thanks,
Nish

2008-01-18 22:38:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Fri, 18 Jan 2008, Christoph Lameter wrote:

> Memoryless nodes: Set N_NORMAL_MEMORY for a node if we do not support HIGHMEM

If !CONFIG_HIGHMEM then

enum node_states {
#ifdef CONFIG_HIGHMEM
N_HIGH_MEMORY, /* The node has regular or high memory */
#else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
#endif

So
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
free_area_init_node(nid, pgdat, NULL,
find_min_pfn_for_node(nid), NULL);

/* Any memory on that node */
if (pgdat->node_present_pages)
node_set_state(nid, N_HIGH_MEMORY);
^^^ sets N_NORMAL_MEMORY
check_for_regular_memory(pgdat);
}

2008-01-18 22:57:17

by Olaf Hering

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Fri, Jan 18, Christoph Lameter wrote:

> Could you try this patch?

Does not help, same crash.

2008-01-19 04:55:40

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Fri, 18 Jan 2008, Olaf Hering wrote:

> calls cache_grow with nodeid 0
> > [c00000000075bbd0] [c0000000000f82d0] .cache_alloc_refill+0x234/0x2c0
> calls cache_grow with nodeid 0
> > [c00000000075bbe0] [c0000000000f7f38] .____cache_alloc_node+0x17c/0x1e8
>
> calls cache_grow with nodeid 1
> > [c00000000075bbe0] [c0000000000f7d68] .fallback_alloc+0x1a0/0x1f4

Okay that makes sense. You have no node 0 with normal memory but the node
assigned to the executing processor is zero (correct?). Thus it needs to
fallback to node 1 and that is not possible during bootstrap. You need to
run kmem_cache_init() on a cpu on a processor with memory.

Or we need to revert the patch which would allocate control
structures again for all online nodes regardless if they have memory or
not.

Does reverting 04231b3002ac53f8a64a7bd142fde3fa4b6808c6 change the
situation? (However, we tried this on the other thread without success).

2008-01-19 04:56:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Thu, 17 Jan 2008, Olaf Hering wrote:

> On Thu, Jan 17, Olaf Hering wrote:
>
> > Since -mm boots further, what patch should I try?
>
> rc8-mm1 crashes as well, l3 passed to reap_alien() is NULL.

Sigh. It looks like we need alien cache structures in some cases for nodes
that have no memory. We must allocate structures for all nodes regardless
if they have allocatable memory or not.

2008-01-22 19:55:06

by Mel Gorman

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On (18/01/08 23:57), Olaf Hering didst pronounce:
> On Fri, Jan 18, Christoph Lameter wrote:
>
> > Could you try this patch?
>
> Does not help, same crash.
>

Hi Olaf,

It was suggested this problem was the same as another slab-related boot problem
that was fixed for 2.6.24 by reverting a change. This fix can be found at
http://www.csn.ul.ie/~mel/postings/slab-20080122/partial-revert-slab-changes.patch
. Can you please check on your machine if it fixes your problem?

I am 99.9999% it will *not* fix your problem because there was two bugs, not
one as previously believed. On two test machines here, this kmem_cache_init
problem still happens even with the revert which fixed a third machine. I
was delayed in testing because these boxen unavailable from Friday until
yesterday evening (a stellar display of timing). It was missed on TKO because
it was SLAB-specific and those machines were testing SLUB. I found that the
patch below was necessary to fix the problem.

Olaf, please confirm whether you need the patch below as well as the
revert to make your machine boot.

Christoph/Pekka, this patch is papering over the problem and something
more fundamental may be going wrong. The crash occurs because l3 is NULL
and the cache is kmem_cache so this is early in the boot process. It is
selecting l3 based on node 2 which is correct in terms of available memory
but it initialises the lists on node 0 because that is the node the CPUs are
located. Hence later it uses an uninitialised nodelists and BLAM. Relevant
parts of the log for seeing the memoryless nodes in relation to CPUs is;

early_node_map[1] active PFN ranges
2: 0 -> 1048576
Processor 1 found.
clockevent: decrementer mult[3cf1] shift[16] cpu[2]
Processor 2 found.
clockevent: decrementer mult[3cf1] shift[16] cpu[3]
Processor 3 found.
Brought up 4 CPUs
Node 0 CPUs: 0-3
Node 2 CPUs:

Can you see a better solution than this?

====
Recent changes to how slab operates mean a situation can occur on systems
with memoryless nodes whereby the nodeid used when growing the slab does
not map to the correct kmem_list3. The following patch adds the necessary
check to the indicated preferred nodeid and if it is bogus, use numa_node_id() instead.

Signed-off-by: Mel Gorman <[email protected]>

---
mm/slab.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc8-005-revert-memoryless-slab/mm/slab.c linux-2.6.24-rc8-010_handle_missing_l3/mm/slab.c
--- linux-2.6.24-rc8-005-revert-memoryless-slab/mm/slab.c 2008-01-22 17:46:32.000000000 +0000
+++ linux-2.6.24-rc8-010_handle_missing_l3/mm/slab.c 2008-01-22 18:42:53.000000000 +0000
@@ -2775,6 +2775,11 @@ static int cache_grow(struct kmem_cache
/* Take the l3 list lock to change the colour_next on this node */
check_irq_off();
l3 = cachep->nodelists[nodeid];
+ if (!l3) {
+ nodeid = numa_node_id();
+ l3 = cachep->nodelists[nodeid];
+ }
+ BUG_ON(!l3);
spin_lock(&l3->list_lock);

/* Get colour for the slab, and cal the next value. */
@@ -3317,6 +3322,10 @@ static void *____cache_alloc_node(struct
int x;

l3 = cachep->nodelists[nodeid];
+ if (!l3) {
+ nodeid = numa_node_id();
+ l3 = cachep->nodelists[nodeid];
+ }
BUG_ON(!l3);

retry:


--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-22 20:11:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Tue, 22 Jan 2008, Mel Gorman wrote:

> Christoph/Pekka, this patch is papering over the problem and something
> more fundamental may be going wrong. The crash occurs because l3 is NULL
> and the cache is kmem_cache so this is early in the boot process. It is
> selecting l3 based on node 2 which is correct in terms of available memory
> but it initialises the lists on node 0 because that is the node the CPUs are
> located. Hence later it uses an uninitialised nodelists and BLAM. Relevant
> parts of the log for seeing the memoryless nodes in relation to CPUs is;

Would it be possible to run the bootstrap on a cpu that has a
node with memory associated to it? I believe we had the same situation
last year when GFP_THISNODE was introduced?

After you reverted the slab memoryless node patch there should be per node
structures created for node 0 unless the node is marked offline. Is it? If
so then you are booting a cpu that is associated with an offline node.

> Can you see a better solution than this?

Well this means that bootstrap will work by introducing foreign objects
into the per cpu queue (should only hold per cpu objects). They will
later be consumed and then the queues will contain the right objects so
the effect of the patch is minimal.

I thought we fixed the similar situation last year by dropping
GFP_THISNODE for some allocations?

2008-01-22 21:27:20

by Mel Gorman

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On (22/01/08 12:11), Christoph Lameter didst pronounce:
> On Tue, 22 Jan 2008, Mel Gorman wrote:
>
> > Christoph/Pekka, this patch is papering over the problem and something
> > more fundamental may be going wrong. The crash occurs because l3 is NULL
> > and the cache is kmem_cache so this is early in the boot process. It is
> > selecting l3 based on node 2 which is correct in terms of available memory
> > but it initialises the lists on node 0 because that is the node the CPUs are
> > located. Hence later it uses an uninitialised nodelists and BLAM. Relevant
> > parts of the log for seeing the memoryless nodes in relation to CPUs is;
>
> Would it be possible to run the bootstrap on a cpu that has a
> node with memory associated to it?

Not in the way the machine is currently configured. All the CPUs appear to
be on a node with no memory. It's best to assume I cannot get the machine
reconfigured (which just hides the bug anyway). Physically, it's thousands
of miles away so I can't do the work. I can get lab support to do the job
but that will take a fair while and at the end of the day, it doesn't tell
us a lot. We know that other PPC64 machines work so it's not a general problem.

> I believe we had the same situation
> last year when GFP_THISNODE was introduced?
>

It feels vaguely familiar but I don't recall the details in sufficient detail
to recognise if this is the same problem or not.

> After you reverted the slab memoryless node patch there should be per node
> structures created for node 0 unless the node is marked offline. Is it? If
> so then you are booting a cpu that is associated with an offline node.
>

I'll roll a patch that prints out the online states before startup and
see what it looks like.

> > Can you see a better solution than this?
>
> Well this means that bootstrap will work by introducing foreign objects
> into the per cpu queue (should only hold per cpu objects). They will
> later be consumed and then the queues will contain the right objects so
> the effect of the patch is minimal.
>

By minimal, do you mean that you expect it to break in some other
respect later or minimal as in "this is bad but should not have no
adverse impact".

> I thought we fixed the similar situation last year by dropping
> GFP_THISNODE for some allocations?
>

Whatever this was a problem fixed in the past or not, it's broken again now
:( . It's possible that there is a __GFP_THISNODE that can be dropped early
at boot-time that would also fix this problem in a way that doesn't
affect runtime (like altering cache_grow in my patch does).

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-22 21:34:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Tue, 22 Jan 2008, Mel Gorman wrote:

> > After you reverted the slab memoryless node patch there should be per node
> > structures created for node 0 unless the node is marked offline. Is it? If
> > so then you are booting a cpu that is associated with an offline node.
> >
>
> I'll roll a patch that prints out the online states before startup and
> see what it looks like.

Ok. Great.

>
> > > Can you see a better solution than this?
> >
> > Well this means that bootstrap will work by introducing foreign objects
> > into the per cpu queue (should only hold per cpu objects). They will
> > later be consumed and then the queues will contain the right objects so
> > the effect of the patch is minimal.
> >
>
> By minimal, do you mean that you expect it to break in some other
> respect later or minimal as in "this is bad but should not have no
> adverse impact".

Should not have any adverse impact after the objects from the cpu queue
have been consumed. If the cache_reaper tries to shift objects back
from the per cpu queue into slabs then BUG_ONs may be triggered. Make sure
you run the tests with full debugging please.

> Whatever this was a problem fixed in the past or not, it's broken again now
> :( . It's possible that there is a __GFP_THISNODE that can be dropped early
> at boot-time that would also fix this problem in a way that doesn't
> affect runtime (like altering cache_grow in my patch does).

The dropping of GFP_THISNODE has the same effect as your patch.
Objects from another node get into the per cpu queue. And on free we
assume that per cpu queue objects are from the local node. If debug is on
then we check that with BUG_ONs.

2008-01-22 21:44:52

by Olaf Hering

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Tue, Jan 22, Mel Gorman wrote:

> http://www.csn.ul.ie/~mel/postings/slab-20080122/partial-revert-slab-changes.patch
> .. Can you please check on your machine if it fixes your problem?

It does not fix or change the nature of the crash.

> Olaf, please confirm whether you need the patch below as well as the
> revert to make your machine boot.

It crashes now in a different way if the patch below is applied:

Linux version 2.6.24-rc8-ppc64 (olaf@lingonberry) (gcc version 4.1.2 20070115 (prerelease) (SUSE Linux)) #43 SMP Tue Jan 22 22:39:05 CET 2008
[boot]0012 Setup Arch
EEH: PCI Enhanced I/O Error Handling Enabled
PPC64 nvram contains 8192 bytes
Zone PFN ranges:
DMA 0 -> 892928
Normal 892928 -> 892928
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
1: 0 -> 892928
Could not find start_pfn for node 0
[boot]0015 Setup Done
Built 2 zonelists in Node order, mobility grouping on. Total pages: 880720
Policy zone: DMA
Kernel command line: debug xmon=on panic=1
[boot]0020 XICS Init
xics: no ISA interrupt controller
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
time_init: decrementer frequency = 275.070000 MHz
time_init: processor frequency = 2197.800000 MHz
clocksource: timebase mult[e8ab05] shift[22] registered
clockevent: decrementer mult[466a] shift[16] cpu[0]
Console: colour dummy device 80x25
console handover: boot [udbg-1] -> real [hvc0]
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
freeing bootmem node 1
Memory: 3496632k/3571712k available (6188k kernel code, 75080k reserved, 1324k data, 1220k bss, 304k init)
Unable to handle kernel paging request for data at address 0x00000058
Faulting instruction address: 0xc0000000000fe018
cpu 0x0: Vector: 300 (Data Access) at [c00000000075bac0]
pc: c0000000000fe018: .setup_cpu_cache+0x184/0x1f4
lr: c0000000000fdfa8: .setup_cpu_cache+0x114/0x1f4
sp: c00000000075bd40
msr: 8000000000009032
dar: 58
dsisr: 42000000
current = 0xc000000000665a50
paca = 0xc000000000666380
pid = 0, comm = swapper
enter ? for help
[c00000000075bd40] c0000000000fb368 .kmem_cache_create+0x3c0/0x478 (unreliable)
[c00000000075be20] c0000000005e6780 .kmem_cache_init+0x284/0x4f4
[c00000000075bee0] c0000000005bf8ec .start_kernel+0x2f8/0x3fc
[c00000000075bf90] c000000000008590 .start_here_common+0x60/0xd0
0:mon>

0xc0000000000fe018 is in setup_cpu_cache (/home/olaf/kernel/git/linux-2.6-numa/mm/slab.c:2111).
2106 BUG_ON(!cachep->nodelists[node]);
2107 kmem_list3_init(cachep->nodelists[node]);
2108 }
2109 }
2110 }
2111 cachep->nodelists[numa_node_id()]->next_reap =
2112 jiffies + REAPTIMEOUT_LIST3 +
2113 ((unsigned long)cachep) % REAPTIMEOUT_LIST3;
2114
2115 cpu_cache_get(cachep)->avail = 0;

2008-01-22 22:12:25

by Nish Aravamudan

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On 1/22/08, Olaf Hering <[email protected]> wrote:
> On Tue, Jan 22, Mel Gorman wrote:
>
> > http://www.csn.ul.ie/~mel/postings/slab-20080122/partial-revert-slab-changes.patch
> > .. Can you please check on your machine if it fixes your problem?
>
> It does not fix or change the nature of the crash.
>
> > Olaf, please confirm whether you need the patch below as well as the
> > revert to make your machine boot.
>
> It crashes now in a different way if the patch below is applied:

Was this with the revert Mel mentioned applied as well? I get the
feeling both patches are needed to fix up the memoryless SLAB issue.

> Linux version 2.6.24-rc8-ppc64 (olaf@lingonberry) (gcc version 4.1.2 20070115 (prerelease) (SUSE Linux)) #43 SMP Tue Jan 22 22:39:05 CET 2008

<snip>

> early_node_map[1] active PFN ranges
> 1: 0 -> 892928

<snip>

> Unable to handle kernel paging request for data at address 0x00000058
> Faulting instruction address: 0xc0000000000fe018
> cpu 0x0: Vector: 300 (Data Access) at [c00000000075bac0]
> pc: c0000000000fe018: .setup_cpu_cache+0x184/0x1f4
> lr: c0000000000fdfa8: .setup_cpu_cache+0x114/0x1f4
> sp: c00000000075bd40
> msr: 8000000000009032
> dar: 58
> dsisr: 42000000
> current = 0xc000000000665a50
> paca = 0xc000000000666380
> pid = 0, comm = swapper
> enter ? for help
> [c00000000075bd40] c0000000000fb368 .kmem_cache_create+0x3c0/0x478 (unreliable)
> [c00000000075be20] c0000000005e6780 .kmem_cache_init+0x284/0x4f4
> [c00000000075bee0] c0000000005bf8ec .start_kernel+0x2f8/0x3fc
> [c00000000075bf90] c000000000008590 .start_here_common+0x60/0xd0
> 0:mon>
>
> 0xc0000000000fe018 is in setup_cpu_cache (/home/olaf/kernel/git/linux-2.6-numa/mm/slab.c:2111).
> 2106 BUG_ON(!cachep->nodelists[node]);
> 2107 kmem_list3_init(cachep->nodelists[node]);

I might be barking up the wrong tree, but this block above is supposed
to set up the cachep->nodeslists[*] that are used immediately below.
But if the loop wasn't changed from N_NORMAL_MEMORY to N_ONLINE or
whatever, you might get a bad access right below for node 0 that has
no memory, if that's the node we're running on...

> 2108 }
> 2109 }
> 2110 }
> 2111 cachep->nodelists[numa_node_id()]->next_reap =
> 2112 jiffies + REAPTIMEOUT_LIST3 +
> 2113 ((unsigned long)cachep) % REAPTIMEOUT_LIST3;
> 2114
> 2115 cpu_cache_get(cachep)->avail = 0;

Thanks,
Nish

2008-01-22 22:23:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Tue, 22 Jan 2008, Olaf Hering wrote:

> It crashes now in a different way if the patch below is applied:

Yup no l3 structure for the current node. We are early in boostrap. You
could just check if the l3 is there and if not just skip starting the
reaper? This will be redone later anyways. Not sure if this will solve all
your issues though. An l3 for the current node that we are booting on
needs to be created early on for SLAB bootstrap to succeed. AFAICT SLUB
doesnt care and simply uses whatever the page allocator gives it for the
cpu slab. We may have gotten there because you only tested with SLUB
recently and thus changes got in that broke SLAB boot assumptions.


> 0xc0000000000fe018 is in setup_cpu_cache (/home/olaf/kernel/git/linux-2.6-numa/mm/slab.c:2111).
> 2106 BUG_ON(!cachep->nodelists[node]);
> 2107 kmem_list3_init(cachep->nodelists[node]);
> 2108 }
> 2109 }
> 2110 }

if (cachep->nodelists[numa_node_id()])
return;

> 2111 cachep->nodelists[numa_node_id()]->next_reap =
> 2112 jiffies + REAPTIMEOUT_LIST3 +
> 2113 ((unsigned long)cachep) % REAPTIMEOUT_LIST3;
> 2114
> 2115 cpu_cache_get(cachep)->avail = 0;
>
>

2008-01-22 22:50:59

by Mel Gorman

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On (22/01/08 13:34), Christoph Lameter didst pronounce:
> On Tue, 22 Jan 2008, Mel Gorman wrote:
>
> > > After you reverted the slab memoryless node patch there should be per node
> > > structures created for node 0 unless the node is marked offline. Is it? If
> > > so then you are booting a cpu that is associated with an offline node.
> > >
> >
> > I'll roll a patch that prints out the online states before startup and
> > see what it looks like.
>
> Ok. Great.
>

The dmesg output is below.


> >
> > > > Can you see a better solution than this?
> > >
> > > Well this means that bootstrap will work by introducing foreign objects
> > > into the per cpu queue (should only hold per cpu objects). They will
> > > later be consumed and then the queues will contain the right objects so
> > > the effect of the patch is minimal.
> > >
> >
> > By minimal, do you mean that you expect it to break in some other
> > respect later or minimal as in "this is bad but should not have no
> > adverse impact".
>
> Should not have any adverse impact after the objects from the cpu queue
> have been consumed. If the cache_reaper tries to shift objects back
> from the per cpu queue into slabs then BUG_ONs may be triggered. Make sure
> you run the tests with full debugging please.
>

I am not running a full range of tests at the moment. Just getting boot
first. I'll queue up a range of tests to run with DEBUG on now but it'll
be the morning before I have the results.

> > Whatever this was a problem fixed in the past or not, it's broken again now
> > :( . It's possible that there is a __GFP_THISNODE that can be dropped early
> > at boot-time that would also fix this problem in a way that doesn't
> > affect runtime (like altering cache_grow in my patch does).
>
> The dropping of GFP_THISNODE has the same effect as your patch.

The dropping of it totally? If so, this patch might fix a boot but it'll
potentially be a performance regression on NUMA machines that only have
nodes with memory, right?

> Objects from another node get into the per cpu queue. And on free we
> assume that per cpu queue objects are from the local node. If debug is on
> then we check that with BUG_ONs.
>

The interesting parts of the dmesg output are

Online nodes
o 0
o 2
Nodes with regular memory
o 2
Current running CPU 0 is associated with node 0
Current node is 0

So node 2 has regular memory but it's trying to use node 0 at a glance.
I've attached the patch I used against 2.6.24-rc8. It includes the revert.

Here is the full output


Please wait, loading kernel...
Elf64 kernel loaded...
Loading ramdisk...
ramdisk loaded at 02400000, size: 1192 Kbytes
OF stdout device is: /vdevice/vty@30000000
Hypertas detected, assuming LPAR !
command line: ro console=hvc0 autobench_args: root=/dev/sda6 ABAT:1201041303 loglevel=8
memory layout at init:
alloc_bottom : 000000000252a000
alloc_top : 0000000008000000
alloc_top_hi : 0000000100000000
rmo_top : 0000000008000000
ram_top : 0000000100000000
Looking for displays
instantiating rtas at 0x00000000077d9000 ... done
0000000000000000 : boot cpu 0000000000000000
0000000000000002 : starting cpu hw idx 0000000000000002... done
copying OF device tree ...
Building dt strings...
Building dt structure...
Device tree strings 0x000000000262b000 -> 0x000000000262c1d3
Device tree struct 0x000000000262d000 -> 0x0000000002635000
Calling quiesce ...
returning from prom_init
Partition configured for 4 cpus.
Starting Linux PPC64 #1 SMP Tue Jan 22 17:15:48 EST 2008
-----------------------------------------------------
ppc64_pft_size = 0x1a
physicalMemorySize = 0x100000000
htab_hash_mask = 0x7ffff
-----------------------------------------------------
Linux version 2.6.24-rc8-autokern1 ([email protected]) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Tue Jan 22 17:15:48 EST 2008
[boot]0012 Setup Arch
EEH: PCI Enhanced I/O Error Handling Enabled
PPC64 nvram contains 7168 bytes
Zone PFN ranges:
DMA 0 -> 1048576
Normal 1048576 -> 1048576
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
2: 0 -> 1048576
Could not find start_pfn for node 0
[boot]0015 Setup Done
Built 2 zonelists in Node order, mobility grouping on. Total pages: 1034240
Policy zone: DMA
Kernel command line: ro console=hvc0 autobench_args: root=/dev/sda6 ABAT:1201041303 loglevel=8
[boot]0020 XICS Init
xics: no ISA interrupt controller
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
time_init: decrementer frequency = 238.059000 MHz
time_init: processor frequency = 1904.472000 MHz
clocksource: timebase mult[10cd746] shift[22] registered
clockevent: decrementer mult[3cf1] shift[16] cpu[0]
Console: colour dummy device 80x25
console handover: boot [udbg0] -> real [hvc0]
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
freeing bootmem node 2
Memory: 4105560k/4194304k available (5004k kernel code, 88744k reserved, 876k data, 559k bss, 272k init)
Online nodes
o 0
o 2
Nodes with regular memory
o 2
Current running CPU 0 is associated with node 0
Current node is 0
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 0
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 1
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 2
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 3
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 4
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 5
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 6
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 7
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 8
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 9
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 10
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 11
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 12
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 13
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 14
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 15
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 16
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 17
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 18
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 19
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 20
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 21
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 22
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 23
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 24
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 25
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 26
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 27
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 28
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 29
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 30
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 31
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 32
kmem_cache_init Setting kmem_cache initkmem_list3 0
Unable to handle kernel paging request for data at address 0x00000040
Faulting instruction address: 0xc0000000003c8c00
cpu 0x0: Vector: 300 (Data Access) at [c0000000005c3840]
pc: c0000000003c8c00: __lock_text_start+0x20/0x88
lr: c0000000000dadec: .cache_grow+0x7c/0x338
sp: c0000000005c3ac0
msr: 8000000000009032
dar: 40
dsisr: 40000000
current = 0xc000000000500f10
paca = 0xc000000000501b80
pid = 0, comm = swapper
enter ? for help
[c0000000005c3b40] c0000000000dadec .cache_grow+0x7c/0x338
[c0000000005c3c00] c0000000000db54c .fallback_alloc+0x1c0/0x224
[c0000000005c3cb0] c0000000000db958 .kmem_cache_alloc+0xe0/0x14c
[c0000000005c3d50] c0000000000dcccc .kmem_cache_create+0x230/0x4cc
[c0000000005c3e30] c0000000004c05f4 .kmem_cache_init+0x310/0x640
[c0000000005c3ee0] c00000000049f8d8 .start_kernel+0x304/0x3fc
[c0000000005c3f90] c000000000008594 .start_here_common+0x54/0xc0
0:mon>


Attachments:
(No filename) (8.07 kB)
debug-slab-with-revert.diff (5.57 kB)
Download all attachments

2008-01-22 22:57:39

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Tue, 22 Jan 2008, Mel Gorman wrote:

> > > Whatever this was a problem fixed in the past or not, it's broken again now
> > > :( . It's possible that there is a __GFP_THISNODE that can be dropped early
> > > at boot-time that would also fix this problem in a way that doesn't
> > > affect runtime (like altering cache_grow in my patch does).
> >
> > The dropping of GFP_THISNODE has the same effect as your patch.
>
> The dropping of it totally? If so, this patch might fix a boot but it'll
> potentially be a performance regression on NUMA machines that only have
> nodes with memory, right?

No the dropping during early allocations.,

> o 0
> o 2
> Nodes with regular memory
> o 2
> Current running CPU 0 is associated with node 0
> Current node is 0
>
> So node 2 has regular memory but it's trying to use node 0 at a glance.
> I've attached the patch I used against 2.6.24-rc8. It includes the revert.

We need the current processor to be attached to a node that has
memory. We cannot fall back that early because the structures for the
other nodes do not exist yet.

> Online nodes
> o 0
> o 2
> Nodes with regular memory
> o 2
> Current running CPU 0 is associated with node 0
> Current node is 0
> o kmem_list3_init

This needs to be node 2.

> [c0000000005c3b40] c0000000000dadec .cache_grow+0x7c/0x338
> [c0000000005c3c00] c0000000000db54c .fallback_alloc+0x1c0/0x224

Fallback during bootstrap.

2008-01-22 23:02:30

by Pekka Enberg

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

Hi,

Mel Gorman wrote:
> Faulting instruction address: 0xc0000000003c8c00
> cpu 0x0: Vector: 300 (Data Access) at [c0000000005c3840]
> pc: c0000000003c8c00: __lock_text_start+0x20/0x88
> lr: c0000000000dadec: .cache_grow+0x7c/0x338
> sp: c0000000005c3ac0
> msr: 8000000000009032
> dar: 40
> dsisr: 40000000
> current = 0xc000000000500f10
> paca = 0xc000000000501b80
> pid = 0, comm = swapper
> enter ? for help
> [c0000000005c3b40] c0000000000dadec .cache_grow+0x7c/0x338
> [c0000000005c3c00] c0000000000db54c .fallback_alloc+0x1c0/0x224
> [c0000000005c3cb0] c0000000000db958 .kmem_cache_alloc+0xe0/0x14c
> [c0000000005c3d50] c0000000000dcccc .kmem_cache_create+0x230/0x4cc
> [c0000000005c3e30] c0000000004c05f4 .kmem_cache_init+0x310/0x640
> [c0000000005c3ee0] c00000000049f8d8 .start_kernel+0x304/0x3fc
> [c0000000005c3f90] c000000000008594 .start_here_common+0x54/0xc0
> 0:mon>

I mentioned this already but received no response (maybe I am missing
something totally obvious here):

When we call fallback_alloc() because the current node has ->nodelists
set to NULL, we end up calling kmem_getpages() with -1 as the node id
which is then translated to numa_node_id() by alloc_pages_node. But the
reason we called fallback_alloc() in the first place is because
numa_node_id() doesn't have a ->nodelist which makes cache_grow() oops.

Pekka

2008-01-22 23:11:20

by Mel Gorman

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On (22/01/08 14:57), Christoph Lameter didst pronounce:
> On Tue, 22 Jan 2008, Mel Gorman wrote:
>
> > > > Whatever this was a problem fixed in the past or not, it's broken again now
> > > > :( . It's possible that there is a __GFP_THISNODE that can be dropped early
> > > > at boot-time that would also fix this problem in a way that doesn't
> > > > affect runtime (like altering cache_grow in my patch does).
> > >
> > > The dropping of GFP_THISNODE has the same effect as your patch.
> >
> > The dropping of it totally? If so, this patch might fix a boot but it'll
> > potentially be a performance regression on NUMA machines that only have
> > nodes with memory, right?
>
> No the dropping during early allocations.,
>

We can live with that if the machine otherwise survives during tests.
They are kicked off at the moment with CONFIG_SLAB_DEBUG set but the point
is moot if the patch doesn't work for Olaf. Am still waiting to hear if
the two patches in combination work for him.

> > o 0
> > o 2
> > Nodes with regular memory
> > o 2
> > Current running CPU 0 is associated with node 0
> > Current node is 0
> >
> > So node 2 has regular memory but it's trying to use node 0 at a glance.
> > I've attached the patch I used against 2.6.24-rc8. It includes the revert.
>
> We need the current processor to be attached to a node that has
> memory. We cannot fall back that early because the structures for the
> other nodes do not exist yet.
>

Or bodge it early in the boot process so that a node with memory is
always used.

> > Online nodes
> > o 0
> > o 2
> > Nodes with regular memory
> > o 2
> > Current running CPU 0 is associated with node 0
> > Current node is 0
> > o kmem_list3_init
>
> This needs to be node 2.
>

Rather it should be 2. I'll admit the physical setup of this machine is
.... less than ideal but clearly it's something that can happen even if
it's a bad idea.

> > [c0000000005c3b40] c0000000000dadec .cache_grow+0x7c/0x338
> > [c0000000005c3c00] c0000000000db54c .fallback_alloc+0x1c0/0x224
>
> Fallback during bootstrap.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-22 23:12:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Wed, 23 Jan 2008, Pekka Enberg wrote:

> When we call fallback_alloc() because the current node has ->nodelists set to
> NULL, we end up calling kmem_getpages() with -1 as the node id which is then
> translated to numa_node_id() by alloc_pages_node. But the reason we called
> fallback_alloc() in the first place is because numa_node_id() doesn't have a
> ->nodelist which makes cache_grow() oops.

Right, if nodeid == -1 then we need to call alloc_pages...
Essentiall a revert of 50c85a19e7b3928b5b5188524c44ffcbacdd4e35 from 2005.

But I doubt that this is it. The fallback logic was added later and it
worked fine.


---
mm/slab.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c 2008-01-22 15:05:26.185452369 -0800
+++ linux-2.6/mm/slab.c 2008-01-22 15:05:59.301637009 -0800
@@ -1668,7 +1668,11 @@ static void *kmem_getpages(struct kmem_c
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
flags |= __GFP_RECLAIMABLE;

- page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+ if (nodeid == -1)
+ page = alloc_pages(flags, cachep->gfporder);
+ else
+ page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+
if (!page)
return NULL;

2008-01-22 23:14:36

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Tue, 22 Jan 2008, Mel Gorman wrote:

> Rather it should be 2. I'll admit the physical setup of this machine is
> .... less than ideal but clearly it's something that can happen even if
> it's a bad idea.

Ok. Lets hope that Pekka's find does the trick. But this would mean that
fallback gets memory from node 2 for the page allocator. Then fallback
alloc is going to try to insert it into the l3 of node 2 which is not
there yet. So another ooops. Sigh.

2008-01-22 23:19:03

by Christoph Lameter

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Tue, 22 Jan 2008, Christoph Lameter wrote:

> But I doubt that this is it. The fallback logic was added later and it
> worked fine.

My patch is useless (fascinating history of the changelog there through).
fallback_alloc calls kmem_getpages without GFP_THISNODE. This means that
alloc_pages_node() will try to allocate on the current node but fallback
to neighboring node if nothing is there....

2008-01-23 07:58:17

by Olaf Hering

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Tue, Jan 22, Christoph Lameter wrote:

> > 0xc0000000000fe018 is in setup_cpu_cache (/home/olaf/kernel/git/linux-2.6-numa/mm/slab.c:2111).
> > 2106 BUG_ON(!cachep->nodelists[node]);
> > 2107 kmem_list3_init(cachep->nodelists[node]);
> > 2108 }
> > 2109 }
> > 2110 }
>
> if (cachep->nodelists[numa_node_id()])
> return;

Does not help.


Linux version 2.6.24-rc8-ppc64 (olaf@lingonberry) (gcc version 4.1.2 20070115 (prerelease) (SUSE Linux)) #48 SMP Wed Jan 23 08:54:23 CET 2008
[boot]0012 Setup Arch
EEH: PCI Enhanced I/O Error Handling Enabled
PPC64 nvram contains 8192 bytes
Zone PFN ranges:
DMA 0 -> 892928
Normal 892928 -> 892928
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
1: 0 -> 892928
Could not find start_pfn for node 0
[boot]0015 Setup Done
Built 2 zonelists in Node order, mobility grouping on. Total pages: 880720
Policy zone: DMA
Kernel command line: debug xmon=on panic=1
[boot]0020 XICS Init
xics: no ISA interrupt controller
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
time_init: decrementer frequency = 275.070000 MHz
time_init: processor frequency = 2197.800000 MHz
clocksource: timebase mult[e8ab05] shift[22] registered
clockevent: decrementer mult[466a] shift[16] cpu[0]
Console: colour dummy device 80x25
console handover: boot [udbg-1] -> real [hvc0]
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
freeing bootmem node 1
Memory: 3496632k/3571712k available (6188k kernel code, 75080k reserved, 1324k data, 1220k bss, 304k init)
Kernel panic - not syncing: kmem_cache_create(): failed to create slab `size-32(DMA)'

Rebooting in 1 seconds..

---
mm/slab.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1590,7 +1590,7 @@ void __init kmem_cache_init(void)
/* Replace the static kmem_list3 structures for the boot cpu */
init_list(&cache_cache, &initkmem_list3[CACHE_CACHE], node);

- for_each_node_state(nid, N_NORMAL_MEMORY) {
+ for_each_online_node(nid) {
init_list(malloc_sizes[INDEX_AC].cs_cachep,
&initkmem_list3[SIZE_AC + nid], nid);

@@ -1968,7 +1968,7 @@ static void __init set_up_list3s(struct
{
int node;

- for_each_node_state(node, N_NORMAL_MEMORY) {
+ for_each_online_node(node) {
cachep->nodelists[node] = &initkmem_list3[index + node];
cachep->nodelists[node]->next_reap = jiffies +
REAPTIMEOUT_LIST3 +
@@ -2108,6 +2108,8 @@ static int __init_refok setup_cpu_cache(
}
}
}
+ if (!cachep->nodelists[numa_node_id()])
+ return -ENODEV;
cachep->nodelists[numa_node_id()]->next_reap =
jiffies + REAPTIMEOUT_LIST3 +
((unsigned long)cachep) % REAPTIMEOUT_LIST3;
@@ -2775,6 +2777,11 @@ static int cache_grow(struct kmem_cache
/* Take the l3 list lock to change the colour_next on this node */
check_irq_off();
l3 = cachep->nodelists[nodeid];
+ if (!l3) {
+ nodeid = numa_node_id();
+ l3 = cachep->nodelists[nodeid];
+ }
+ BUG_ON(!l3);
spin_lock(&l3->list_lock);

/* Get colour for the slab, and cal the next value. */
@@ -3317,6 +3324,10 @@ static void *____cache_alloc_node(struct
int x;

l3 = cachep->nodelists[nodeid];
+ if (!l3) {
+ nodeid = numa_node_id();
+ l3 = cachep->nodelists[nodeid];
+ }
BUG_ON(!l3);

retry:
@@ -3815,7 +3826,7 @@ static int alloc_kmemlist(struct kmem_ca
struct array_cache *new_shared;
struct array_cache **new_alien = NULL;

- for_each_node_state(node, N_NORMAL_MEMORY) {
+ for_each_online_node(node) {

if (use_alien_caches) {
new_alien = alloc_alien_cache(node, cachep->limit);

2008-01-23 08:19:46

by Pekka Enberg

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

Hi Christoph,

On Jan 23, 2008 1:18 AM, Christoph Lameter <[email protected]> wrote:
> My patch is useless (fascinating history of the changelog there through).
> fallback_alloc calls kmem_getpages without GFP_THISNODE. This means that
> alloc_pages_node() will try to allocate on the current node but fallback
> to neighboring node if nothing is there....

Sure, but I was referring to the scenario where current node _has_
pages available but no ->nodelists. Olaf, did you try it?

Pekka

2008-01-23 08:40:07

by Olaf Hering

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Wed, Jan 23, Pekka Enberg wrote:

> Hi Christoph,
>
> On Jan 23, 2008 1:18 AM, Christoph Lameter <[email protected]> wrote:
> > My patch is useless (fascinating history of the changelog there through).
> > fallback_alloc calls kmem_getpages without GFP_THISNODE. This means that
> > alloc_pages_node() will try to allocate on the current node but fallback
> > to neighboring node if nothing is there....
>
> Sure, but I was referring to the scenario where current node _has_
> pages available but no ->nodelists. Olaf, did you try it?

Does not help.

2008-01-23 10:50:56

by Mel Gorman

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On (23/01/08 08:58), Olaf Hering didst pronounce:
> On Tue, Jan 22, Christoph Lameter wrote:
>
> > > 0xc0000000000fe018 is in setup_cpu_cache (/home/olaf/kernel/git/linux-2.6-numa/mm/slab.c:2111).
> > > 2106 BUG_ON(!cachep->nodelists[node]);
> > > 2107 kmem_list3_init(cachep->nodelists[node]);
> > > 2108 }
> > > 2109 }
> > > 2110 }
> >
> > if (cachep->nodelists[numa_node_id()])
> > return;
>
> Does not help.
>

Sorry this is dragging out. Can you post the full dmesg with loglevel=8 of the
following patch against 2.6.24-rc8 please? It contains the debug information
that helped me figure out what was going wrong on the PPC64 machine here,
the revert and the !l3 checks (i.e. the two patches that made machines I
have access to work). Thanks

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc8-clean/mm/slab.c linux-2.6.24-rc8-015_debug_slab/mm/slab.c
--- linux-2.6.24-rc8-clean/mm/slab.c 2008-01-16 04:22:48.000000000 +0000
+++ linux-2.6.24-rc8-015_debug_slab/mm/slab.c 2008-01-23 10:44:36.000000000 +0000
@@ -348,6 +348,7 @@ static int slab_early_init = 1;

static void kmem_list3_init(struct kmem_list3 *parent)
{
+ printk(" o kmem_list3_init\n");
INIT_LIST_HEAD(&parent->slabs_full);
INIT_LIST_HEAD(&parent->slabs_partial);
INIT_LIST_HEAD(&parent->slabs_free);
@@ -1236,6 +1237,7 @@ static int __cpuinit cpuup_prepare(long
* kmem_list3 and not this cpu's kmem_list3
*/

+ printk("cpuup_prepare %ld\n", cpu);
list_for_each_entry(cachep, &cache_chain, next) {
/*
* Set up the size64 kmemlist for cpu before we can
@@ -1243,6 +1245,7 @@ static int __cpuinit cpuup_prepare(long
* node has not already allocated this
*/
if (!cachep->nodelists[node]) {
+ printk(" o allocing %s %d\n", cachep->name, node);
l3 = kmalloc_node(memsize, GFP_KERNEL, node);
if (!l3)
goto bad;
@@ -1256,6 +1259,7 @@ static int __cpuinit cpuup_prepare(long
* protection here.
*/
cachep->nodelists[node] = l3;
+ printk(" o l3 setup\n");
}

spin_lock_irq(&cachep->nodelists[node]->list_lock);
@@ -1320,6 +1324,7 @@ static int __cpuinit cpuup_prepare(long
}
return 0;
bad:
+ printk(" o bad\n");
cpuup_canceled(cpu);
return -ENOMEM;
}
@@ -1405,6 +1410,7 @@ static void init_list(struct kmem_cache
spin_lock_init(&ptr->list_lock);

MAKE_ALL_LISTS(cachep, ptr, nodeid);
+ printk("init_list RESETTING %s node %d\n", cachep->name, nodeid);
cachep->nodelists[nodeid] = ptr;
local_irq_enable();
}
@@ -1427,10 +1433,23 @@ void __init kmem_cache_init(void)
numa_platform = 0;
}

+ printk("Online nodes\n");
+ for_each_online_node(node)
+ printk("o %d\n", node);
+ printk("Nodes with regular memory\n");
+ for_each_node_state(node, N_NORMAL_MEMORY)
+ printk("o %d\n", node);
+ printk("Current running CPU %d is associated with node %d\n",
+ smp_processor_id(),
+ cpu_to_node(smp_processor_id()));
+ printk("Current node is %d\n",
+ numa_node_id());
+
for (i = 0; i < NUM_INIT_LISTS; i++) {
kmem_list3_init(&initkmem_list3[i]);
if (i < MAX_NUMNODES)
cache_cache.nodelists[i] = NULL;
+ printk("kmem_cache_init Setting %s NULL %d\n", cache_cache.name, i);
}

/*
@@ -1468,6 +1487,8 @@ void __init kmem_cache_init(void)
cache_cache.colour_off = cache_line_size();
cache_cache.array[smp_processor_id()] = &initarray_cache.cache;
cache_cache.nodelists[node] = &initkmem_list3[CACHE_CACHE];
+ printk("kmem_cache_init Setting %s NULL %d\n", cache_cache.name, node);
+ printk("kmem_cache_init Setting %s initkmem_list3 %d\n", cache_cache.name, node);

/*
* struct kmem_cache size depends on nr_node_ids, which
@@ -1590,7 +1611,7 @@ void __init kmem_cache_init(void)
/* Replace the static kmem_list3 structures for the boot cpu */
init_list(&cache_cache, &initkmem_list3[CACHE_CACHE], node);

- for_each_node_state(nid, N_NORMAL_MEMORY) {
+ for_each_online_node(nid) {
init_list(malloc_sizes[INDEX_AC].cs_cachep,
&initkmem_list3[SIZE_AC + nid], nid);

@@ -1968,11 +1989,13 @@ static void __init set_up_list3s(struct
{
int node;

- for_each_node_state(node, N_NORMAL_MEMORY) {
+ printk("set_up_list3s %s index %d\n", cachep->name, index);
+ for_each_online_node(node) {
cachep->nodelists[node] = &initkmem_list3[index + node];
cachep->nodelists[node]->next_reap = jiffies +
REAPTIMEOUT_LIST3 +
((unsigned long)cachep) % REAPTIMEOUT_LIST3;
+ printk("set_up_list3s %s index %d\n", cachep->name, index);
}
}

@@ -2099,11 +2122,13 @@ static int __init_refok setup_cpu_cache(
g_cpucache_up = PARTIAL_L3;
} else {
int node;
- for_each_node_state(node, N_NORMAL_MEMORY) {
+ printk("setup_cpu_cache %s\n", cachep->name);
+ for_each_online_node(node) {
cachep->nodelists[node] =
kmalloc_node(sizeof(struct kmem_list3),
GFP_KERNEL, node);
BUG_ON(!cachep->nodelists[node]);
+ printk(" o allocated node %d\n", node);
kmem_list3_init(cachep->nodelists[node]);
}
}
@@ -2775,6 +2800,11 @@ static int cache_grow(struct kmem_cache
/* Take the l3 list lock to change the colour_next on this node */
check_irq_off();
l3 = cachep->nodelists[nodeid];
+ if (!l3) {
+ nodeid = numa_node_id();
+ l3 = cachep->nodelists[nodeid];
+ }
+ BUG_ON(!l3);
spin_lock(&l3->list_lock);

/* Get colour for the slab, and cal the next value. */
@@ -3317,6 +3347,10 @@ static void *____cache_alloc_node(struct
int x;

l3 = cachep->nodelists[nodeid];
+ if (!l3) {
+ nodeid = numa_node_id();
+ l3 = cachep->nodelists[nodeid];
+ }
BUG_ON(!l3);

retry:
@@ -3815,8 +3849,10 @@ static int alloc_kmemlist(struct kmem_ca
struct array_cache *new_shared;
struct array_cache **new_alien = NULL;

- for_each_node_state(node, N_NORMAL_MEMORY) {
+ printk("alloc_kmemlist %s\n", cachep->name);
+ for_each_online_node(node) {

+ printk(" o node %d\n", node);
if (use_alien_caches) {
new_alien = alloc_alien_cache(node, cachep->limit);
if (!new_alien)
@@ -3837,6 +3873,7 @@ static int alloc_kmemlist(struct kmem_ca
l3 = cachep->nodelists[node];
if (l3) {
struct array_cache *shared = l3->shared;
+ printk(" o l3 exists\n");

spin_lock_irq(&l3->list_lock);

@@ -3856,10 +3893,12 @@ static int alloc_kmemlist(struct kmem_ca
free_alien_cache(new_alien);
continue;
}
+ printk(" o allocing l3\n");
l3 = kmalloc_node(sizeof(struct kmem_list3), GFP_KERNEL, node);
if (!l3) {
free_alien_cache(new_alien);
kfree(new_shared);
+ printk(" o allocing l3 failed\n");
goto fail;
}

@@ -3871,6 +3910,7 @@ static int alloc_kmemlist(struct kmem_ca
l3->free_limit = (1 + nr_cpus_node(node)) *
cachep->batchcount + cachep->num;
cachep->nodelists[node] = l3;
+ printk(" o setting node %d 0x%lX\n", node, (unsigned long)l3);
}
return 0;

@@ -3886,6 +3926,7 @@ fail:
free_alien_cache(l3->alien);
kfree(l3);
cachep->nodelists[node] = NULL;
+ printk(" o setting node %d FAIL NULL\n", node);
}
node--;
}

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-23 12:14:49

by Olaf Hering

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Wed, Jan 23, Mel Gorman wrote:

> Sorry this is dragging out. Can you post the full dmesg with loglevel=8 of the
> following patch against 2.6.24-rc8 please? It contains the debug information
> that helped me figure out what was going wrong on the PPC64 machine here,
> the revert and the !l3 checks (i.e. the two patches that made machines I
> have access to work). Thanks

It boots with your change.


boot: x
Please wait, loading kernel...
Allocated 00a00000 bytes for kernel @ 00200000
Elf64 kernel loaded...
OF stdout device is: /vdevice/vty@30000000
Hypertas detected, assuming LPAR !
command line: debug xmon=on panic=1 loglevel=8
memory layout at init:
alloc_bottom : 0000000000ac1000
alloc_top : 0000000010000000
alloc_top_hi : 00000000da000000
rmo_top : 0000000010000000
ram_top : 00000000da000000
Looking for displays
found display : /pci@800000020000002/pci@2/pci@1/display@0, opening ... done
instantiating rtas at 0x000000000f6a1000 ... done
0000000000000000 : boot cpu 0000000000000000
0000000000000002 : starting cpu hw idx 0000000000000002... done
0000000000000004 : starting cpu hw idx 0000000000000004... done
0000000000000006 : starting cpu hw idx 0000000000000006... done
copying OF device tree ...
Building dt strings...
Building dt structure...
Device tree strings 0x0000000000cc2000 -> 0x0000000000cc34e4
Device tree struct 0x0000000000cc4000 -> 0x0000000000cd6000
Calling quiesce ...
returning from prom_init
Partition configured for 8 cpus.
Starting Linux PPC64 #52 SMP Wed Jan 23 13:05:38 CET 2008
-----------------------------------------------------
ppc64_pft_size = 0x1c
physicalMemorySize = 0xda000000
htab_hash_mask = 0x1fffff
-----------------------------------------------------
Linux version 2.6.24-rc8-ppc64 (olaf@lingonberry) (gcc version 4.1.2 20070115 (prerelease) (SUSE Linux)) #52 SMP Wed Jan 23 13:05:38 CET 2008
[boot]0012 Setup Arch
EEH: PCI Enhanced I/O Error Handling Enabled
PPC64 nvram contains 8192 bytes
Zone PFN ranges:
DMA 0 -> 892928
Normal 892928 -> 892928
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
1: 0 -> 892928
Could not find start_pfn for node 0
[boot]0015 Setup Done
Built 2 zonelists in Node order, mobility grouping on. Total pages: 880720
Policy zone: DMA
Kernel command line: debug xmon=on panic=1 loglevel=8
[boot]0020 XICS Init
xics: no ISA interrupt controller
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
time_init: decrementer frequency = 275.070000 MHz
time_init: processor frequency = 2197.800000 MHz
clocksource: timebase mult[e8ab05] shift[22] registered
clockevent: decrementer mult[466a] shift[16] cpu[0]
Console: colour dummy device 80x25
console handover: boot [udbg-1] -> real [hvc0]
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
freeing bootmem node 1
Memory: 3496632k/3571712k available (6188k kernel code, 75080k reserved, 1324k data, 1220k bss, 304k init)
Online nodes
o 0
o 1
Nodes with regular memory
o 1
Current running CPU 0 is associated with node 0
Current node is 0
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 0
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 1
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 2
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 3
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 4
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 5
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 6
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 7
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 8
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 9
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 10
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 11
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 12
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 13
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 14
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 15
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 16
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 17
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 18
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 19
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 20
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 21
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 22
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 23
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 24
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 25
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 26
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 27
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 28
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 29
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 30
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 31
o kmem_list3_init
kmem_cache_init Setting kmem_cache NULL 32
kmem_cache_init Setting kmem_cache NULL 0
kmem_cache_init Setting kmem_cache initkmem_list3 0
set_up_list3s size-32 index 1
set_up_list3s size-32 index 1
set_up_list3s size-32 index 1
set_up_list3s size-128 index 17
set_up_list3s size-128 index 17
set_up_list3s size-128 index 17
setup_cpu_cache size-32(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-64
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-64(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-128(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-256
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-256(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-512
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-512(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-1024
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-1024(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-2048
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-2048(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-4096
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-4096(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-8192
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-8192(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-16384
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-16384(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-32768
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-32768(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-65536
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-65536(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-131072
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-131072(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-262144
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-262144(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-524288
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-524288(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-1048576
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-1048576(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-2097152
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-2097152(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-4194304
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-4194304(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-8388608
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-8388608(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-16777216
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
setup_cpu_cache size-16777216(DMA)
o allocated node 0
o kmem_list3_init
o allocated node 1
o kmem_list3_init
init_list RESETTING kmem_cache node 0
init_list RESETTING size-32 node 0
init_list RESETTING size-128 node 0
init_list RESETTING size-32 node 1
init_list RESETTING size-128 node 1
alloc_kmemlist size-16777216(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-16777216
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-8388608(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-8388608
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-4194304(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-4194304
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-2097152(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-2097152
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-1048576(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-1048576
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-524288(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-524288
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-262144(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-262144
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-131072(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-131072
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-65536(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-65536
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-32768(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-32768
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-16384(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-16384
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-8192(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-8192
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-4096(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-4096
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-2048(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-2048
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-1024(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-1024
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-512(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-512
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-256(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-256
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-128(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-64(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-64
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-32(DMA)
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-128
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist size-32
o node 0
o l3 exists
o node 1
o l3 exists
alloc_kmemlist kmem_cache
o node 0
o l3 exists
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D802FA00
alloc_kmemlist numa_policy
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D802FC00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D802FD80
alloc_kmemlist shared_policy_node
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D802FF00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D803E180
Calibrating delay loop... 548.86 BogoMIPS (lpj=2744320)
alloc_kmemlist pid_1
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D803E300
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D803E480
alloc_kmemlist pid_namespace
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D803E600
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D803E780
alloc_kmemlist pgd_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D803E900
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D803EA80
alloc_kmemlist pud_pmd_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D803EC00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D803ED80
alloc_kmemlist anon_vma
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D803EF00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D804C180
alloc_kmemlist task_struct
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D804C300
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D804C480
alloc_kmemlist sighand_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D804C600
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D804C780
alloc_kmemlist signal_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D804C900
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D804CA80
alloc_kmemlist files_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D804CC00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D804CD80
alloc_kmemlist fs_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D804CF00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8057180
alloc_kmemlist vm_area_struct
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8057300
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8057480
alloc_kmemlist mm_struct
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8057600
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8057780
alloc_kmemlist buffer_head
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8057900
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8057A80
alloc_kmemlist idr_layer_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8057C80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8057E00
alloc_kmemlist key_jar
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8057F80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8066200
Security Framework initialized
Capability LSM initialized
Failure registering Root Plug module with the kernel
Failure registering Root Plug module with primary security module.
alloc_kmemlist names_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8066380
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8066500
alloc_kmemlist filp
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8066680
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8066800
alloc_kmemlist dentry
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8066980
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8066B00
alloc_kmemlist inode_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8066C80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8066E00
alloc_kmemlist mnt_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8066F80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8074200
Mount-cache hash table entries: 256
alloc_kmemlist sysfs_dir_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8074380
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8074500
alloc_kmemlist bdev_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8074700
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8074880
alloc_kmemlist radix_tree_node
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8074A00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8074B80
alloc_kmemlist sigqueue
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8074D00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8074E80
alloc_kmemlist proc_inode_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D808E100
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D808E280
alloc_kmemlist taskstats
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D808E400
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D808E580
alloc_kmemlist task_delay_info
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D808E700
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D808E880
cpuup_prepare 1
clockevent: decrementer mult[466a] shift[16] cpu[1]
Processor 1 found.
cpuup_prepare 2
clockevent: decrementer mult[466a] shift[16] cpu[2]
Processor 2 found.
cpuup_prepare 3
clockevent: decrementer mult[466a] shift[16] cpu[3]
Processor 3 found.
cpuup_prepare 4
clockevent: decrementer mult[466a] shift[16] cpu[4]
Processor 4 found.
cpuup_prepare 5
clockevent: decrementer mult[466a] shift[16] cpu[5]
Processor 5 found.
cpuup_prepare 6
clockevent: decrementer mult[466a] shift[16] cpu[6]
Processor 6 found.
cpuup_prepare 7
clockevent: decrementer mult[466a] shift[16] cpu[7]
Processor 7 found.
Brought up 8 CPUs
Node 0 CPUs: 0-3
Node 1 CPUs: 4-7
net_namespace: 120 bytes
alloc_kmemlist file_lock_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D82C6680
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D82C6800
alloc_kmemlist skbuff_head_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D82C6980
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D82C6B00
alloc_kmemlist skbuff_fclone_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D82C6D00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D82C6E80
alloc_kmemlist sock_inode_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8372180
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8372300
NET: Registered protocol family 16
IBM eBus Device Driver
PCI: Probing PCI hardware
IOMMU table initialized, virtual merging enabled
PCI: Probing PCI hardware done
Registering pmac pic with sysfs...
alloc_kmemlist bio
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D83E9580
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D83E9700
alloc_kmemlist biovec-1
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D83E9880
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D83E9A00
alloc_kmemlist biovec-4
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D83E9B80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D83E9D00
alloc_kmemlist biovec-16
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D83E9E80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D83F4100
alloc_kmemlist biovec-64
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D83F4300
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D83F4480
alloc_kmemlist biovec-128
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D83F4600
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D83F4780
alloc_kmemlist biovec-256
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D83F4900
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D83F4A80
alloc_kmemlist blkdev_requests
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8401580
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8401700
alloc_kmemlist blkdev_queue
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8401880
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8401A00
alloc_kmemlist blkdev_ioc
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8401B80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8401D00
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
alloc_kmemlist eventpoll_epi
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8439A80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8439C00
alloc_kmemlist eventpoll_pwq
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8439D80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8439F00
alloc_kmemlist TCP
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8486380
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8486500
alloc_kmemlist request_sock_TCP
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8486680
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8486800
alloc_kmemlist tw_sock_TCP
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8486980
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8486B00
alloc_kmemlist UDP
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8486D00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D8486E80
alloc_kmemlist RAW
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D849D180
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D849D300
NET: Registered protocol family 2
Time: timebase clocksource has been installed.
Switched to high resolution mode on CPU 0
Switched to high resolution mode on CPU 1
Switched to high resolution mode on CPU 2
Switched to high resolution mode on CPU 3
Switched to high resolution mode on CPU 4
Switched to high resolution mode on CPU 5
Switched to high resolution mode on CPU 6
Switched to high resolution mode on CPU 7
alloc_kmemlist arp_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D849D480
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D849D600
alloc_kmemlist ip_dst_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D849DC80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D849DE00
IP route cache hash table entries: 131072 (order: 8, 1048576 bytes)
alloc_kmemlist xfrm_dst_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D84A8300
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D84A8480
alloc_kmemlist secpath_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D84A8600
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D84A8780
alloc_kmemlist inet_peer_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D84A8900
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D84A8A80
alloc_kmemlist tcp_bind_bucket
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D84A8C00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D84A8D80
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
TCP: Hash tables configured (established 524288 bind 65536)
TCP reno registered
alloc_kmemlist UDP-Lite
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D84A8F80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D84C6200
alloc_kmemlist ip_mrt_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D84C6380
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D84C6500
alloc_kmemlist rtas_flash_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D8294880
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D84E5F80
alloc_kmemlist hugepte_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D84EA800
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D84EA680
alloc_kmemlist uid_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D84EA400
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D84EA280
alloc_kmemlist posix_timers_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D84EA100
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D84EAF00
alloc_kmemlist nsproxy
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D85BB180
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D85BB300
audit: initializing netlink socket (disabled)
audit(1201090162.460:1): initialized
RTAS daemon started
RTAS: event: 88, Type: Platform Error, Severity: 2
Total HugeTLB memory allocated, 0
alloc_kmemlist shmem_inode_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D85BB680
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D85BB800
alloc_kmemlist fasync_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D85BBA00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D85BBB80
alloc_kmemlist kiocb
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D85BBD00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D85BBE80
alloc_kmemlist kioctx
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D85EF180
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D85EF300
alloc_kmemlist inotify_watch_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D85EF880
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D85EFA00
alloc_kmemlist inotify_event_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D85EFB80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D85EFD00
VFS: Disk quotas dquot_6.5.1
alloc_kmemlist dquot
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D85EFE80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D85FD100
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
alloc_kmemlist dnotify_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D85FD280
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D85FD400
alloc_kmemlist reiser_inode_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D85FD600
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D85FD780
alloc_kmemlist ext3_xattr
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D85FD980
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D85FDB00
alloc_kmemlist ext3_inode_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D85FDD00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D85FDE80
alloc_kmemlist revoke_record
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6036100
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6036280
alloc_kmemlist revoke_table
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6036400
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6036580
alloc_kmemlist journal_head
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6036700
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6036880
alloc_kmemlist journal_handle
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6036A00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6036B80
alloc_kmemlist ext2_xattr
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6036D80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6036F00
alloc_kmemlist ext2_inode_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6046200
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6046380
alloc_kmemlist hugetlbfs_inode_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6046580
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6046700
alloc_kmemlist fat_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6046900
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6046A80
alloc_kmemlist fat_inode_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6046C80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6046E00
alloc_kmemlist isofs_inode_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6052100
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6052280
alloc_kmemlist mqueue_inode_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6052500
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6052680
alloc_kmemlist bsg_cmd
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6052880
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6052A00
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254)
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
alloc_kmemlist cfq_queue
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6052C00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6052D80
alloc_kmemlist cfq_io_context
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D6052F00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D6070180
io scheduler cfq registered (default)
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
rpaphp: Slot [0001:00:02.0](PCI location=U7879.001.DQD04M6-P1-C3) registered
rpaphp: Slot [0001:00:02.2](PCI location=U7879.001.DQD04M6-P1-C4) registered
rpaphp: Slot [0001:00:02.4](PCI location=U7879.001.DQD04M6-P1-C5) registered
rpaphp: Slot [0001:00:02.6](PCI location=U7879.001.DQD04M6-P1-C6) registered
rpaphp: Slot [0002:00:02.0](PCI location=U7879.001.DQD04M6-P1-C1) registered
rpaphp: Slot [0002:00:02.6](PCI location=U7879.001.DQD04M6-P1-C2) registered
matroxfb: Matrox G450 detected
PInS data found at offset 31168
PInS memtype = 5
matroxfb: 640x480x8bpp (virtual: 640x26214)
matroxfb: framebuffer at 0x40178000000, mapped to 0xd000080080080000, size 33554432
Console: switching to colour frame buffer device 80x30
fb0: MATROX frame buffer device
matroxfb_crtc2: secondary head of fb0 was registered as fb1
vio_register_driver: driver hvc_console registering
HVSI: registered 0 devices
Generic RTC Driver v1.07
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
pmac_zilog: 0.6 (Benjamin Herrenschmidt <[email protected]>)
input: Macintosh mouse button emulation as /devices/virtual/input/input0
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ehci_hcd 0000:c8:01.2: EHCI Host Controller
ehci_hcd 0000:c8:01.2: new USB bus registered, assigned bus number 1
ehci_hcd 0000:c8:01.2: irq 85, io mem 0x400a0002000
ehci_hcd 0000:c8:01.2: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 5 ports detected
ohci_hcd: 2006 August 04 USB 1.1 'Open' Host Controller (OHCI) Driver
ohci_hcd 0000:c8:01.0: OHCI Host Controller
ohci_hcd 0000:c8:01.0: new USB bus registered, assigned bus number 2
ohci_hcd 0000:c8:01.0: irq 85, io mem 0x400a0001000
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 3 ports detected
ohci_hcd 0000:c8:01.1: OHCI Host Controller
ohci_hcd 0000:c8:01.1: new USB bus registered, assigned bus number 3
ohci_hcd 0000:c8:01.1: irq 85, io mem 0x400a0000000
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
mice: PS/2 mouse device common for all mice
EDAC MC: Ver: 2.1.0 Jan 23 2008
usbcore: registered new interface driver hiddev
usbcore: registered new interface driver usbhid
/home/olaf/kernel/git/linux-2.6.24-rc8/drivers/hid/usbhid/hid-core.c: v2.6:USB HID core driver
oprofile: using ppc64/power5+ performance monitoring.
alloc_kmemlist flow_cache
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D612FA80
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D612FC00
alloc_kmemlist UNIX
o node 0
o allocing l3
o kmem_list3_init
o setting node 0 0xC0000000D612FE00
o node 1
o allocing l3
o kmem_list3_init
o setting node 1 0xC0000000D612FF80
NET: Registered protocol family 1
NET: Registered protocol family 17
NET: Registered protocol family 15
registered taskstats version 1
md: Autodetecting RAID arrays.
md: Scanned 0 and added 0 devices.
md: autorun ...
md: ... autorun DONE.
VFS: Cannot open root device "<NULL>" or unknown-block(0,0)
Please append a correct "root=" boot option; here are the available partitions:
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
Rebooting in 1 seconds..

2008-01-23 12:52:19

by Olaf Hering

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On Wed, Jan 23, Olaf Hering wrote:

> On Wed, Jan 23, Mel Gorman wrote:
>
> > Sorry this is dragging out. Can you post the full dmesg with loglevel=8 of the
> > following patch against 2.6.24-rc8 please? It contains the debug information
> > that helped me figure out what was going wrong on the PPC64 machine here,
> > the revert and the !l3 checks (i.e. the two patches that made machines I
> > have access to work). Thanks
>
> It boots with your change.

This version of the patch boots ok for me:
Maybe I made a mistake with earlier patches, no idea.

---
mm/slab.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1590,7 +1590,7 @@ void __init kmem_cache_init(void)
/* Replace the static kmem_list3 structures for the boot cpu */
init_list(&cache_cache, &initkmem_list3[CACHE_CACHE], node);

- for_each_node_state(nid, N_NORMAL_MEMORY) {
+ for_each_online_node(nid) {
init_list(malloc_sizes[INDEX_AC].cs_cachep,
&initkmem_list3[SIZE_AC + nid], nid);

@@ -1968,7 +1968,7 @@ static void __init set_up_list3s(struct
{
int node;

- for_each_node_state(node, N_NORMAL_MEMORY) {
+ for_each_online_node(node) {
cachep->nodelists[node] = &initkmem_list3[index + node];
cachep->nodelists[node]->next_reap = jiffies +
REAPTIMEOUT_LIST3 +
@@ -2099,7 +2099,7 @@ static int __init_refok setup_cpu_cache(
g_cpucache_up = PARTIAL_L3;
} else {
int node;
- for_each_node_state(node, N_NORMAL_MEMORY) {
+ for_each_online_node(node) {
cachep->nodelists[node] =
kmalloc_node(sizeof(struct kmem_list3),
GFP_KERNEL, node);
@@ -2775,6 +2775,11 @@ static int cache_grow(struct kmem_cache
/* Take the l3 list lock to change the colour_next on this node */
check_irq_off();
l3 = cachep->nodelists[nodeid];
+ if (!l3) {
+ nodeid = numa_node_id();
+ l3 = cachep->nodelists[nodeid];
+ }
+ BUG_ON(!l3);
spin_lock(&l3->list_lock);

/* Get colour for the slab, and cal the next value. */
@@ -3317,6 +3322,10 @@ static void *____cache_alloc_node(struct
int x;

l3 = cachep->nodelists[nodeid];
+ if (!l3) {
+ nodeid = numa_node_id();
+ l3 = cachep->nodelists[nodeid];
+ }
BUG_ON(!l3);

retry:
@@ -3815,7 +3824,7 @@ static int alloc_kmemlist(struct kmem_ca
struct array_cache *new_shared;
struct array_cache **new_alien = NULL;

- for_each_node_state(node, N_NORMAL_MEMORY) {
+ for_each_online_node(node) {

if (use_alien_caches) {
new_alien = alloc_alien_cache(node, cachep->limit);

2008-01-23 13:42:00

by Mel Gorman

[permalink] [raw]
Subject: Re: crash in kmem_cache_init

On (23/01/08 13:14), Olaf Hering didst pronounce:
> On Wed, Jan 23, Mel Gorman wrote:
>
> > Sorry this is dragging out. Can you post the full dmesg with loglevel=8 of the
> > following patch against 2.6.24-rc8 please? It contains the debug information
> > that helped me figure out what was going wrong on the PPC64 machine here,
> > the revert and the !l3 checks (i.e. the two patches that made machines I
> > have access to work). Thanks
>
> It boots with your change.
>

....... Nice one! As the only addition here is debugging output, I can
only assume that the two patches were being booted in isolation instead
of combination earlier. The two threads have been a little confused with
hand waving so that can easily happen.

Looking at your log;

> early_node_map[1] active PFN ranges
> 1: 0 -> 892928

All memory on node 1

> Online nodes
> o 0
> o 1
> Nodes with regular memory
> o 1
> Current running CPU 0 is associated with node 0
> Current node is 0

Running CPU associated with node 0 so other than being node 1 instead of
node 2, your machine is similar to the one I had the problem on in terms
of memoryless nodes and CPU configuration.

> VFS: Cannot open root device "<NULL>" or unknown-block(0,0)
> Please append a correct "root=" boot option; here are the available partitions:
> Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
> Rebooting in 1 seconds..
>

I see it failed to complete boot but I'm going to assume this is a relatively
normal commane-line, .config or initrd problem and not a regression of
some type.

I'll post a patch suitable for pick-up shortly. The two patches ran in
combination with CONFIG_DEBUG_SLAB a compile-based stress tests without
difficulty so hopefully there is not new surprises hiding in the corners.

Thanks Olaf.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-23 13:55:31

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

This patch in combination with a partial revert of commit
04231b3002ac53f8a64a7bd142fde3fa4b6808c6 fixes a regression between 2.6.23
and 2.6.24-rc8 where a PPC64 machine with all CPUS on a memoryless node fails
to boot. If approved by the SLAB maintainers, it should be merged for 2.6.24.

With memoryless-node configurations, it is possible that all the CPUs are
associated with a node with no memory. Early in the boot process, nodelists
are not setup that allow fallback_alloc to work, an Oops occurs and the
machine fails to boot.

This patch adds the necessary checks to make sure a kmem_list3 exists for
the preferred node used when growing the cache. If the preferred node has
no nodelist then the currently running node is used instead. This
problem only affects the SLAB allocator, SLUB appears to work fine.

Signed-off-by: Mel Gorman <[email protected]>

---
mm/slab.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc8-005-revert-memoryless-slab/mm/slab.c linux-2.6.24-rc8-010_handle_missing_l3/mm/slab.c
--- linux-2.6.24-rc8-005-revert-memoryless-slab/mm/slab.c 2008-01-22 17:46:32.000000000 +0000
+++ linux-2.6.24-rc8-010_handle_missing_l3/mm/slab.c 2008-01-22 18:42:53.000000000 +0000
@@ -2775,6 +2775,11 @@ static int cache_grow(struct kmem_cache
/* Take the l3 list lock to change the colour_next on this node */
check_irq_off();
l3 = cachep->nodelists[nodeid];
+ if (!l3) {
+ nodeid = numa_node_id();
+ l3 = cachep->nodelists[nodeid];
+ }
+ BUG_ON(!l3);
spin_lock(&l3->list_lock);

/* Get colour for the slab, and cal the next value. */
@@ -3317,6 +3322,10 @@ static void *____cache_alloc_node(struct
int x;

l3 = cachep->nodelists[nodeid];
+ if (!l3) {
+ nodeid = numa_node_id();
+ l3 = cachep->nodelists[nodeid];
+ }
BUG_ON(!l3);

retry:

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-23 14:18:37

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

Hi Mel,

On Wed, 23 Jan 2008, Mel Gorman wrote:
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc8-005-revert-memoryless-slab/mm/slab.c linux-2.6.24-rc8-010_handle_missing_l3/mm/slab.c
> --- linux-2.6.24-rc8-005-revert-memoryless-slab/mm/slab.c 2008-01-22 17:46:32.000000000 +0000
> +++ linux-2.6.24-rc8-010_handle_missing_l3/mm/slab.c 2008-01-22 18:42:53.000000000 +0000
> @@ -2775,6 +2775,11 @@ static int cache_grow(struct kmem_cache
> /* Take the l3 list lock to change the colour_next on this node */
> check_irq_off();
> l3 = cachep->nodelists[nodeid];
> + if (!l3) {
> + nodeid = numa_node_id();
> + l3 = cachep->nodelists[nodeid];
> + }
> + BUG_ON(!l3);
> spin_lock(&l3->list_lock);
>
> /* Get colour for the slab, and cal the next value. */
> @@ -3317,6 +3322,10 @@ static void *____cache_alloc_node(struct
> int x;
>
> l3 = cachep->nodelists[nodeid];
> + if (!l3) {
> + nodeid = numa_node_id();
> + l3 = cachep->nodelists[nodeid];
> + }

What guarantees that current node ->nodelists is never NULL?

I still think Christoph's kmem_getpages() patch is correct (to fix
cache_grow() oops) but I overlooked the fact that none the callers of
____cache_alloc_node() deal with bootstrapping (with the exception of
__cache_alloc_node() that even has a comment about it).

But what I am really wondering about is, why wasn't the
N_NORMAL_MEMORY revert enough? I assume this used to work before so what
more do we need to revert for 2.6.24?

Pekka

2008-01-23 14:27:41

by Olaf Hering

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On Wed, Jan 23, Mel Gorman wrote:

> This patch in combination with a partial revert of commit
> 04231b3002ac53f8a64a7bd142fde3fa4b6808c6 fixes a regression between 2.6.23
> and 2.6.24-rc8 where a PPC64 machine with all CPUS on a memoryless node fails
> to boot. If approved by the SLAB maintainers, it should be merged for 2.6.24.

This change alone does not help, its not the version I tested.
Will all the changes below go into 2.6.24 as well, in a seperate patch?

- for_each_node_state(node, N_NORMAL_MEMORY) {
+ for_each_online_node(node) {

2008-01-23 14:32:40

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On Wed, 23 Jan 2008, Pekka J Enberg wrote:
> I still think Christoph's kmem_getpages() patch is correct (to fix
> cache_grow() oops) but I overlooked the fact that none the callers of
> ____cache_alloc_node() deal with bootstrapping (with the exception of
> __cache_alloc_node() that even has a comment about it).

So something like this (totally untested) patch on top of current git:

---
mm/slab.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c
+++ linux-2.6/mm/slab.c
@@ -1668,7 +1668,11 @@ static void *kmem_getpages(struct kmem_c
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
flags |= __GFP_RECLAIMABLE;

- page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+ if (nodeid == -1)
+ page = alloc_pages(flags, cachep->gfporder);
+ else
+ page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+
if (!page)
return NULL;

@@ -2976,8 +2980,9 @@ retry:
batchcount = BATCHREFILL_LIMIT;
}
l3 = cachep->nodelists[node];
+ if (!l3)
+ return NULL;

- BUG_ON(ac->avail > 0 || !l3);
spin_lock(&l3->list_lock);

/* See if we can refill from the shared array */
@@ -3317,7 +3322,8 @@ static void *____cache_alloc_node(struct
int x;

l3 = cachep->nodelists[nodeid];
- BUG_ON(!l3);
+ if (!l3)
+ return fallback_alloc(cachep, flags);

retry:
check_irq_off();
@@ -3394,12 +3400,6 @@ __cache_alloc_node(struct kmem_cache *ca
if (unlikely(nodeid == -1))
nodeid = numa_node_id();

- if (unlikely(!cachep->nodelists[nodeid])) {
- /* Node not bootstrapped yet */
- ptr = fallback_alloc(cachep, flags);
- goto out;
- }
-
if (nodeid == numa_node_id()) {
/*
* Use the locally cached objects if possible.

2008-01-23 14:42:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On (23/01/08 15:27), Olaf Hering didst pronounce:
> On Wed, Jan 23, Mel Gorman wrote:
>
> > This patch in combination with a partial revert of commit
> > 04231b3002ac53f8a64a7bd142fde3fa4b6808c6 fixes a regression between 2.6.23
> > and 2.6.24-rc8 where a PPC64 machine with all CPUS on a memoryless node fails
> > to boot. If approved by the SLAB maintainers, it should be merged for 2.6.24.
>
> This change alone does not help, its not the version I tested.
> Will all the changes below go into 2.6.24 as well, in a seperate patch?
>
> - for_each_node_state(node, N_NORMAL_MEMORY) {
> + for_each_online_node(node) {

Those changes are already in a separate patch and have been sent. I don't
see it in git yet but it should be on the way.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-23 14:49:53

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

Hi,

On Wed, 23 Jan 2008, Pekka J Enberg wrote:
> > I still think Christoph's kmem_getpages() patch is correct (to fix
> > cache_grow() oops) but I overlooked the fact that none the callers of
> > ____cache_alloc_node() deal with bootstrapping (with the exception of
> > __cache_alloc_node() that even has a comment about it).
>
> So something like this (totally untested) patch on top of current git:

Sorry, removed a BUG_ON() from cache_alloc_refill() by mistake, here's a
better one:

[PATCH] slab: fix allocation on memoryless nodes
From: Pekka Enberg <[email protected]>

As memoryless nodes do not have a nodelist, change cache_alloc_refill() to bail
out for those and let ____cache_alloc_node() always deal with that by resorting
to fallback_alloc().

Furthermore, don't let kmem_getpages() call alloc_pages_node() if nodeid passed
to it is -1 as the latter will always translate that to numa_node_id() which
might not have ->nodelist that caused the invocation of fallback_alloc() in the
first place (for example, during bootstrap).

Signed-off-by: Pekka Enberg <[email protected]>
---
mm/slab.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c
+++ linux-2.6/mm/slab.c
@@ -1668,7 +1668,11 @@ static void *kmem_getpages(struct kmem_c
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
flags |= __GFP_RECLAIMABLE;

- page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+ if (nodeid == -1)
+ page = alloc_pages(flags, cachep->gfporder);
+ else
+ page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+
if (!page)
return NULL;

@@ -2975,9 +2979,11 @@ retry:
*/
batchcount = BATCHREFILL_LIMIT;
}
+ BUG_ON(ac->avail > 0);
l3 = cachep->nodelists[node];
+ if (!l3)
+ return NULL;

- BUG_ON(ac->avail > 0 || !l3);
spin_lock(&l3->list_lock);

/* See if we can refill from the shared array */
@@ -3317,7 +3323,8 @@ static void *____cache_alloc_node(struct
int x;

l3 = cachep->nodelists[nodeid];
- BUG_ON(!l3);
+ if (!l3)
+ return fallback_alloc(cachep, flags);

retry:
check_irq_off();
@@ -3394,12 +3401,6 @@ __cache_alloc_node(struct kmem_cache *ca
if (unlikely(nodeid == -1))
nodeid = numa_node_id();

- if (unlikely(!cachep->nodelists[nodeid])) {
- /* Node not bootstrapped yet */
- ptr = fallback_alloc(cachep, flags);
- goto out;
- }
-
if (nodeid == numa_node_id()) {
/*
* Use the locally cached objects if possible.

2008-01-23 15:57:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On (23/01/08 16:49), Pekka J Enberg didst pronounce:
> Hi,
>
> On Wed, 23 Jan 2008, Pekka J Enberg wrote:
> > > I still think Christoph's kmem_getpages() patch is correct (to fix
> > > cache_grow() oops) but I overlooked the fact that none the callers of
> > > ____cache_alloc_node() deal with bootstrapping (with the exception of
> > > __cache_alloc_node() that even has a comment about it).
> >
> > So something like this (totally untested) patch on top of current git:
>
> Sorry, removed a BUG_ON() from cache_alloc_refill() by mistake, here's a
> better one:
>

Applied in combination with the N_NORMAL_MEMORY revert and it fails to
boot. Console is as follows;

Linux version 2.6.24-rc8-autokern1 ([email protected])
(gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #2 SMP Wed Jan 23
10:37:36 EST 2008
[boot]0012 Setup Arch
EEH: PCI Enhanced I/O Error Handling Enabled
PPC64 nvram contains 7168 bytes
Zone PFN ranges:
DMA 0 -> 1048576
Normal 1048576 -> 1048576
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
2: 0 -> 1048576
Could not find start_pfn for node 0
[boot]0015 Setup Done
Built 2 zonelists in Node order, mobility grouping on. Total pages: 1034240
Policy zone: DMA
Kernel command line: ro console=hvc0 autobench_args: root=/dev/sda6
ABAT:1201101591 loglevel=8
[boot]0020 XICS Init
xics: no ISA interrupt controller
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
time_init: decrementer frequency = 238.059000 MHz
time_init: processor frequency = 1904.472000 MHz
clocksource: timebase mult[10cd746] shift[22] registered
clockevent: decrementer mult[3cf1] shift[16] cpu[0]
Console: colour dummy device 80x25
console handover: boot [udbg0] -> real [hvc0]
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
freeing bootmem node 2
Memory: 4105560k/4194304k available (5004k kernel code, 88744k reserved,
876k data, 559k bss, 272k init)
Unable to handle kernel paging request for data at address 0x00000040
Faulting instruction address: 0xc0000000003c8ae8
cpu 0x0: Vector: 300 (Data Access) at [c0000000005c3840]
pc: c0000000003c8ae8: __lock_text_start+0x20/0x88
lr: c0000000000dadb4: .cache_grow+0x7c/0x338
sp: c0000000005c3ac0
msr: 8000000000009032
dar: 40
dsisr: 40000000
current = 0xc000000000500f10
paca = 0xc000000000501b80
pid = 0, comm = swapper
enter ? for help
[c0000000005c3b40] c0000000000dadb4 .cache_grow+0x7c/0x338
[c0000000005c3c00] c0000000000db518 .fallback_alloc+0x1c0/0x224
[c0000000005c3cb0] c0000000000db920 .kmem_cache_alloc+0xe0/0x14c
[c0000000005c3d50] c0000000000dcbd0 .kmem_cache_create+0x230/0x4cc
[c0000000005c3e30] c0000000004c049c .kmem_cache_init+0x1ec/0x51c
[c0000000005c3ee0] c00000000049f8d8 .start_kernel+0x304/0x3fc
[c0000000005c3f90] c000000000008594 .start_here_common+0x54/0xc0

0xc0000000000dadb4 is in cache_grow (mm/slab.c:2782).
2777 local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
2778
2779 /* Take the l3 list lock to change the colour_next on this node */
2780 check_irq_off();
2781 l3 = cachep->nodelists[nodeid];
2782 spin_lock(&l3->list_lock);
2783
2784 /* Get colour for the slab, and cal the next value. */
2785 offset = l3->colour_next;
2786 l3->colour_next++;

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-01-23 19:52:51

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On 23.01.2008 [19:29:15 +0200], Pekka J Enberg wrote:
> Hi,
>
> On Wed, 23 Jan 2008, Mel Gorman wrote:
> > Applied in combination with the N_NORMAL_MEMORY revert and it fails to
> > boot. Console is as follows;
>
> Thanks for testing!
>
> On Wed, 23 Jan 2008, Mel Gorman wrote:
> > [c0000000005c3b40] c0000000000dadb4 .cache_grow+0x7c/0x338
> > [c0000000005c3c00] c0000000000db518 .fallback_alloc+0x1c0/0x224
> > [c0000000005c3cb0] c0000000000db920 .kmem_cache_alloc+0xe0/0x14c
> > [c0000000005c3d50] c0000000000dcbd0 .kmem_cache_create+0x230/0x4cc
> > [c0000000005c3e30] c0000000004c049c .kmem_cache_init+0x1ec/0x51c
> > [c0000000005c3ee0] c00000000049f8d8 .start_kernel+0x304/0x3fc
> > [c0000000005c3f90] c000000000008594 .start_here_common+0x54/0xc0
> >
> > 0xc0000000000dadb4 is in cache_grow (mm/slab.c:2782).
> > 2777 local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
> > 2778
> > 2779 /* Take the l3 list lock to change the colour_next on this node */
> > 2780 check_irq_off();
> > 2781 l3 = cachep->nodelists[nodeid];
> > 2782 spin_lock(&l3->list_lock);
> > 2783
> > 2784 /* Get colour for the slab, and cal the next value. */
> > 2785 offset = l3->colour_next;
> > 2786 l3->colour_next++;
>
> Ok, so it's too early to fallback_alloc() because in kmem_cache_init() we
> do:
>
> for (i = 0; i < NUM_INIT_LISTS; i++) {
> kmem_list3_init(&initkmem_list3[i]);
> if (i < MAX_NUMNODES)
> cache_cache.nodelists[i] = NULL;
> }
>
> Fine. But, why are we hitting fallback_alloc() in the first place? It's
> definitely not because of missing ->nodelists as we do:
>
> cache_cache.nodelists[node] = &initkmem_list3[CACHE_CACHE];
>
> before attempting to set up kmalloc caches. Now, if I understood
> correctly, we're booting off a memoryless node so kmem_getpages() will
> return NULL thus forcing us to fallback_alloc() which is unavailable at
> this point.
>
> As far as I can tell, there are two ways to fix this:
>
> (1) don't boot off a memoryless node (why are we doing this in the first
> place?)

On at least one of the machines in question, wasn't it the case that
node 0 had all the memory and node 1 had all the CPUs? In that case, you
would have to boot off a memoryless node? And as long as that is a
physically valid configuration, the kernel should handle it.

> (2) initialize cache_cache.nodelists with initmem_list3 equivalents
> for *each node hat has normal memory*
>
> I am still wondering why this worked before, though.

I bet we didn't notice this breaking because SLUB became the default and
SLAB isn't on in the test.kernel.org testing, for instance. Perhaps we
should add a second set of runs for some of the boxes there to run with
CONFIG_SLAB on?

I'm curious if we know, for sure, of a kernel with CONFIG_SLAB=y that
has booted all of the boxes reporting issues? That is, did they all work
with 2.6.23?

Thanks,
Nish

2008-01-23 18:36:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On Wed, 23 Jan 2008, Pekka J Enberg wrote:

> Furthermore, don't let kmem_getpages() call alloc_pages_node() if nodeid passed
> to it is -1 as the latter will always translate that to numa_node_id() which
> might not have ->nodelist that caused the invocation of fallback_alloc() in the
> first place (for example, during bootstrap).

kmem_getpages is called without GFP_THISNODE. This
alloc_pages_node(numa_node_id(), ...) will fall back to the next node with
memory.

2008-01-23 18:42:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On Wed, 23 Jan 2008, Mel Gorman wrote:

> This patch adds the necessary checks to make sure a kmem_list3 exists for
> the preferred node used when growing the cache. If the preferred node has
> no nodelist then the currently running node is used instead. This
> problem only affects the SLAB allocator, SLUB appears to work fine.

That is a dangerous thing to do. SLAB per cpu queues will contain foreign
objects which may cause troubles when pushing the objects back. I think we
may be lucky that these objects are consumed at boot. If all of the
foreign objects are consumed at boot then we are fine. At least an
explanation as to this issue should be added to the patch.

2008-01-23 18:35:44

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On Wed, 23 Jan 2008, Pekka J Enberg wrote:

> I still think Christoph's kmem_getpages() patch is correct (to fix
> cache_grow() oops) but I overlooked the fact that none the callers of
> ____cache_alloc_node() deal with bootstrapping (with the exception of
> __cache_alloc_node() that even has a comment about it).

My patch is useless. kmem_getpages called with nodeid == -1 falls back
correctly to the available node. The problem is that the node structures
for the page does not exist.

> But what I am really wondering about is, why wasn't the
> N_NORMAL_MEMORY revert enough? I assume this used to work before so what
> more do we need to revert for 2.6.24?

I think that is because SLUB relaxed the requirements on having regular
memory on the boot node. Now the expectation is that SLAB can do the same.

2008-01-23 18:51:42

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On Wed, 23 Jan 2008, Pekka J Enberg wrote:

> Fine. But, why are we hitting fallback_alloc() in the first place? It's
> definitely not because of missing ->nodelists as we do:
>
> cache_cache.nodelists[node] = &initkmem_list3[CACHE_CACHE];
>
> before attempting to set up kmalloc caches. Now, if I understood
> correctly, we're booting off a memoryless node so kmem_getpages() will
> return NULL thus forcing us to fallback_alloc() which is unavailable at
> this point.
>
> As far as I can tell, there are two ways to fix this:
>
> (1) don't boot off a memoryless node (why are we doing this in the first
> place?)

Right. That is the solution that I would prefer.

> (2) initialize cache_cache.nodelists with initmem_list3 equivalents
> for *each node hat has normal memory*

Or simply do it for all. SLAB bootstrap is very complex thing though.

>
> I am still wondering why this worked before, though.

I doubt it did ever work for SLAB.

2008-01-23 17:33:52

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

Hi,

On Wed, 23 Jan 2008, Mel Gorman wrote:
> Applied in combination with the N_NORMAL_MEMORY revert and it fails to
> boot. Console is as follows;

Thanks for testing!

On Wed, 23 Jan 2008, Mel Gorman wrote:
> [c0000000005c3b40] c0000000000dadb4 .cache_grow+0x7c/0x338
> [c0000000005c3c00] c0000000000db518 .fallback_alloc+0x1c0/0x224
> [c0000000005c3cb0] c0000000000db920 .kmem_cache_alloc+0xe0/0x14c
> [c0000000005c3d50] c0000000000dcbd0 .kmem_cache_create+0x230/0x4cc
> [c0000000005c3e30] c0000000004c049c .kmem_cache_init+0x1ec/0x51c
> [c0000000005c3ee0] c00000000049f8d8 .start_kernel+0x304/0x3fc
> [c0000000005c3f90] c000000000008594 .start_here_common+0x54/0xc0
>
> 0xc0000000000dadb4 is in cache_grow (mm/slab.c:2782).
> 2777 local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
> 2778
> 2779 /* Take the l3 list lock to change the colour_next on this node */
> 2780 check_irq_off();
> 2781 l3 = cachep->nodelists[nodeid];
> 2782 spin_lock(&l3->list_lock);
> 2783
> 2784 /* Get colour for the slab, and cal the next value. */
> 2785 offset = l3->colour_next;
> 2786 l3->colour_next++;

Ok, so it's too early to fallback_alloc() because in kmem_cache_init() we
do:

for (i = 0; i < NUM_INIT_LISTS; i++) {
kmem_list3_init(&initkmem_list3[i]);
if (i < MAX_NUMNODES)
cache_cache.nodelists[i] = NULL;
}

Fine. But, why are we hitting fallback_alloc() in the first place? It's
definitely not because of missing ->nodelists as we do:

cache_cache.nodelists[node] = &initkmem_list3[CACHE_CACHE];

before attempting to set up kmalloc caches. Now, if I understood
correctly, we're booting off a memoryless node so kmem_getpages() will
return NULL thus forcing us to fallback_alloc() which is unavailable at
this point.

As far as I can tell, there are two ways to fix this:

(1) don't boot off a memoryless node (why are we doing this in the first
place?)
(2) initialize cache_cache.nodelists with initmem_list3 equivalents
for *each node hat has normal memory*

I am still wondering why this worked before, though.

Pekka

2008-01-23 17:42:24

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On Wed, 23 Jan 2008, Pekka J Enberg wrote:
> As far as I can tell, there are two ways to fix this:

[snip]

> (2) initialize cache_cache.nodelists with initmem_list3 equivalents
> for *each node hat has normal memory*

An untested patch follows:

---
mm/slab.c | 39 ++++++++++++++++++++-------------------
1 file changed, 20 insertions(+), 19 deletions(-)

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c
+++ linux-2.6/mm/slab.c
@@ -304,11 +304,11 @@ struct kmem_list3 {
/*
* Need this for bootstrapping a per node allocator.
*/
-#define NUM_INIT_LISTS (2 * MAX_NUMNODES + 1)
+#define NUM_INIT_LISTS (3 * MAX_NUMNODES)
struct kmem_list3 __initdata initkmem_list3[NUM_INIT_LISTS];
#define CACHE_CACHE 0
-#define SIZE_AC 1
-#define SIZE_L3 (1 + MAX_NUMNODES)
+#define SIZE_AC MAX_NUMNODES
+#define SIZE_L3 (2 * MAX_NUMNODES)

static int drain_freelist(struct kmem_cache *cache,
struct kmem_list3 *l3, int tofree);
@@ -1410,6 +1410,22 @@ static void init_list(struct kmem_cache
}

/*
+ * For setting up all the kmem_list3s for cache whose buffer_size is same as
+ * size of kmem_list3.
+ */
+static void __init set_up_list3s(struct kmem_cache *cachep, int index)
+{
+ int node;
+
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ cachep->nodelists[node] = &initkmem_list3[index + node];
+ cachep->nodelists[node]->next_reap = jiffies +
+ REAPTIMEOUT_LIST3 +
+ ((unsigned long)cachep) % REAPTIMEOUT_LIST3;
+ }
+}
+
+/*
* Initialisation. Called after the page allocator have been initialised and
* before smp_init().
*/
@@ -1432,6 +1448,7 @@ void __init kmem_cache_init(void)
if (i < MAX_NUMNODES)
cache_cache.nodelists[i] = NULL;
}
+ set_up_list3s(&cache_cache, CACHE_CACHE);

/*
* Fragmentation resistance on low memory - only use bigger
@@ -1964,22 +1981,6 @@ static void slab_destroy(struct kmem_cac
}
}

-/*
- * For setting up all the kmem_list3s for cache whose buffer_size is same as
- * size of kmem_list3.
- */
-static void __init set_up_list3s(struct kmem_cache *cachep, int index)
-{
- int node;
-
- for_each_node_state(node, N_NORMAL_MEMORY) {
- cachep->nodelists[node] = &initkmem_list3[index + node];
- cachep->nodelists[node]->next_reap = jiffies +
- REAPTIMEOUT_LIST3 +
- ((unsigned long)cachep) % REAPTIMEOUT_LIST3;
- }
-}
-
static void __kmem_cache_destroy(struct kmem_cache *cachep)
{
int i;

2008-01-23 21:03:08

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

Hi,

On Jan 23, 2008 9:52 PM, Nishanth Aravamudan <[email protected]> wrote:
> On at least one of the machines in question, wasn't it the case that
> node 0 had all the memory and node 1 had all the CPUs? In that case, you
> would have to boot off a memoryless node? And as long as that is a
> physically valid configuration, the kernel should handle it.

Agreed. Here's the patch that should fix it:

http://lkml.org/lkml/2008/1/23/332

On Jan 23, 2008 9:52 PM, Nishanth Aravamudan <[email protected]> wrote:
> I bet we didn't notice this breaking because SLUB became the default and
> SLAB isn't on in the test.kernel.org testing, for instance. Perhaps we
> should add a second set of runs for some of the boxes there to run with
> CONFIG_SLAB on?

Sure.

On Jan 23, 2008 9:52 PM, Nishanth Aravamudan <[email protected]> wrote:
> I'm curious if we know, for sure, of a kernel with CONFIG_SLAB=y that
> has booted all of the boxes reporting issues? That is, did they all work
> with 2.6.23?

I think Mel said that their configuration did work with 2.6.23
although I also wonder how that's possible. AFAIK there has been some
changes in the page allocator that might explain this. That is, if
kmem_getpages() returned pages for memoryless node before, bootstrap
would have worked.

Pekka

2008-01-23 21:14:38

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On Wed, 23 Jan 2008, Pekka Enberg wrote:

> I think Mel said that their configuration did work with 2.6.23
> although I also wonder how that's possible. AFAIK there has been some
> changes in the page allocator that might explain this. That is, if
> kmem_getpages() returned pages for memoryless node before, bootstrap
> would have worked.

Regular kmem_getpages is called with GFP_THISNODE set. There was some
breakage in 2.6.22 and before with GFP_THISNODE returning pages from the
wrong node if a node had no memory. So it may have worked accidentally and
in an unsafe manner because the pages would have been associated with the
wrong node which could trigger bug ons and locking troubles.

2008-01-23 21:36:51

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On 23.01.2008 [13:14:26 -0800], Christoph Lameter wrote:
> On Wed, 23 Jan 2008, Pekka Enberg wrote:
>
> > I think Mel said that their configuration did work with 2.6.23
> > although I also wonder how that's possible. AFAIK there has been some
> > changes in the page allocator that might explain this. That is, if
> > kmem_getpages() returned pages for memoryless node before, bootstrap
> > would have worked.
>
> Regular kmem_getpages is called with GFP_THISNODE set. There was some
> breakage in 2.6.22 and before with GFP_THISNODE returning pages from
> the wrong node if a node had no memory. So it may have worked
> accidentally and in an unsafe manner because the pages would have been
> associated with the wrong node which could trigger bug ons and locking
> troubles.

Right, so it might have functioned before, but the correctness was
wobbly at best... Certainly the memoryless patch series has tightened
that up, but we missed these SLAB issues.

I see that your patch fixed Olaf's machine, Pekka. Nice work on
everyone's part tracking this stuff down.

Thanks,
Nish

2008-01-24 03:13:50

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

On Wed, 23 Jan 2008, Nishanth Aravamudan wrote:

> Right, so it might have functioned before, but the correctness was
> wobbly at best... Certainly the memoryless patch series has tightened
> that up, but we missed these SLAB issues.
>
> I see that your patch fixed Olaf's machine, Pekka. Nice work on
> everyone's part tracking this stuff down.

Another important result is that I found that GFP_THISNODE is actually
required for proper SLAB operation and not only an optimization. Fallback
can lead to very bad results. I have two customer reported instances of
SLAB corruption here that can be explained now due to fallback to another
node. Foreign objects enter the per cpu queue. The wrong node lock is
taken during cache_flusharray(). Fields in the struct slab can become
corrupted. It typically hits the list field and the inuse field.