LinuxLists.cc - kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Fri, 13 Oct 2006, Will Schmidt wrote:

> Am seeing a crash on a power5 LPAR when booting the linux-2.6 git
> tree. It's fairly early during boot, so I've included the whole log
> below. This partition has 8 procs, (shared, including threads), and
> 512M RAM.

This looks like slab bootstrap. You are bootstrapping while having
zonelists build with zones that are only going to be populated later?
This will lead to incorrect NUMA placement of lots of slab structures on
bootup.

Check if the patch below may cure the oops. Your memory is likely
still placed on the wrong numa nodes since we have to fallback from
the intended node.

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c 2006-10-13 11:59:55.000000000 -0700
+++ linux-2.6/mm/slab.c 2006-10-13 12:03:15.000000000 -0700
@@ -3154,7 +3154,8 @@ void *fallback_alloc(struct kmem_cache *

for (z = zonelist->zones; *z && !obj; z++)
if (zone_idx(*z) <= ZONE_NORMAL &&
- cpuset_zone_allowed(*z, flags))
+ cpuset_zone_allowed(*z, flags) &&
+ (*z)->free_pages)
obj = __cache_alloc_node(cache,
flags | __GFP_THISNODE,
zone_to_nid(*z));

2006-10-13 19:53:54

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Fri, 2006-13-10 at 12:05 -0700, Christoph Lameter wrote:
> On Fri, 13 Oct 2006, Will Schmidt wrote:
>
> > Am seeing a crash on a power5 LPAR when booting the linux-2.6 git
> > tree. It's fairly early during boot, so I've included the whole log
> > below. This partition has 8 procs, (shared, including threads), and
> > 512M RAM.
>
> This looks like slab bootstrap. You are bootstrapping while having
> zonelists build with zones that are only going to be populated later?
> This will lead to incorrect NUMA placement of lots of slab structures on
> bootup.

I dont think so.. but it's not an area I'm very familiar with. one
of the other PPC folks might chime in with something here.

>
> Check if the patch below may cure the oops. Your memory is likely
> still placed on the wrong numa nodes since we have to fallback from
> the intended node.

Nope, no change with this patch.

>
> Index: linux-2.6/mm/slab.c
> ===================================================================
> --- linux-2.6.orig/mm/slab.c 2006-10-13 11:59:55.000000000 -0700
> +++ linux-2.6/mm/slab.c 2006-10-13 12:03:15.000000000 -0700
> @@ -3154,7 +3154,8 @@ void *fallback_alloc(struct kmem_cache *
>
> for (z = zonelist->zones; *z && !obj; z++)
> if (zone_idx(*z) <= ZONE_NORMAL &&
> - cpuset_zone_allowed(*z, flags))
> + cpuset_zone_allowed(*z, flags) &&
> + (*z)->free_pages)
> obj = __cache_alloc_node(cache,
> flags | __GFP_THISNODE,
> zone_to_nid(*z));

2006-10-13 20:57:28

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Fri, 2006-13-10 at 14:53 -0500, Will Schmidt wrote:
> On Fri, 2006-13-10 at 12:05 -0700, Christoph Lameter wrote:
> > On Fri, 13 Oct 2006, Will Schmidt wrote:
> >
> > > Am seeing a crash on a power5 LPAR when booting the linux-2.6 git
> > > tree. It's fairly early during boot, so I've included the whole log
> > > below. This partition has 8 procs, (shared, including threads), and
> > > 512M RAM.
> >
> > This looks like slab bootstrap. You are bootstrapping while having
> > zonelists build with zones that are only going to be populated later?
> > This will lead to incorrect NUMA placement of lots of slab structures on
> > bootup.
>
> I dont think so.. but it's not an area I'm very familiar with. one
> of the other PPC folks might chime in with something here.
>
> >
> > Check if the patch below may cure the oops. Your memory is likely
> > still placed on the wrong numa nodes since we have to fallback from
> > the intended node.
>
> Nope, no change with this patch.
>

Here is another boot log, with that patch applied, and with a numa=debug
parm.

-----------------------------------------------------
ppc64_pft_size = 0x18
physicalMemorySize = 0x22000000
ppc64_caches.dcache_line_size = 0x80
ppc64_caches.icache_line_size = 0x80
htab_address = 0x0000000000000000
htab_hash_mask = 0x1ffff
-----------------------------------------------------
Linux version 2.6.19-rc1-gb8a3ad5b-dirty (willschm@airbag2) (gcc version
4.1.0 (SUSE Linux)) #60 SMP Fri Oct 13 14:48:20 CDT 2006
[boot]0012 Setup Arch
NUMA associativity depth for CPU/Memory: 3
adding cpu 0 to node 0
node 0
NODE_DATA() = c000000015ffee80
start_paddr = 8000000
end_paddr = 16000000
bootmap_paddr = 15ffc000
reserve_bootmem ffc0000 40000
reserve_bootmem 15ffc000 2000
reserve_bootmem 15ffee80 1180
node 1
NODE_DATA() = c000000021ff7c80
start_paddr = 0
end_paddr = 22000000
bootmap_paddr = 21ff2000
reserve_bootmem 0 847000
reserve_bootmem 264b000 9000
reserve_bootmem 77b2000 84e000
reserve_bootmem 21ff2000 5000
reserve_bootmem 21ff7c80 1180
reserve_bootmem 21ff8e58 71a4
No ramdisk, default root is /dev/sda2
EEH: No capable adapters found
PPC64 nvram contains 7168 bytes
Zone PFN ranges:
DMA 0 -> 139264
Normal 139264 -> 139264
early_node_map[3] active PFN ranges
1: 0 -> 32768
0: 32768 -> 90112
1: 90112 -> 139264
[boot]0015 Setup Done
Built 2 zonelists. Total pages: 136576
Kernel command line: root=/dev/sda3 xmon=on numa=debug
[boot]0020 XICS Init
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour dummy device 80x25
Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
freeing bootmem node 0
freeing bootmem node 1
Memory: 530256k/557056k available (5508k kernel code, 30468k reserved,
2224k data, 543k bss, 244k init)
kernel BUG in __cache_alloc_node
at /development/kernels/linux-2.6.git/mm/slab.c:3178!
cpu 0x0: Vector: 700 (Program Check) at [c0000000007938d0]
pc: c0000000000b3c78: .__cache_alloc_node+0x44/0x1e8
lr: c0000000000b3ed4: .fallback_alloc+0xb8/0xfc
sp: c000000000793b50
msr: 8000000000021032
current = 0xc000000000583a90
paca = 0xc000000000584300
pid = 0, comm = swapper
kernel BUG in __cache_alloc_node
at /development/kernels/linux-2.6.git/mm/slab.c:3178!
enter ? for help
[c000000000793c00] c0000000000b3ed4 .fallback_alloc+0xb8/0xfc
[c000000000793ca0] c0000000000b4484 .kmem_cache_zalloc+0xc8/0x11c
[c000000000793d40] c0000000000b6630 .kmem_cache_create+0x1e8/0x5e0
[c000000000793e30] c00000000053e834 .kmem_cache_init+0x1d8/0x4b0
[c000000000793ef0] c000000000524748 .start_kernel+0x244/0x328
[c000000000793f90] c0000000000084f8 .start_here_common+0x54/0x5c
0:mon>

2006-10-13 21:22:10

by Nathan Lynch

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Will Schmidt wrote:
> On Fri, 2006-13-10 at 14:53 -0500, Will Schmidt wrote:
> > On Fri, 2006-13-10 at 12:05 -0700, Christoph Lameter wrote:
> > > On Fri, 13 Oct 2006, Will Schmidt wrote:
> > >
> > > > Am seeing a crash on a power5 LPAR when booting the linux-2.6 git
> > > > tree. It's fairly early during boot, so I've included the whole log
> > > > below. This partition has 8 procs, (shared, including threads), and
> > > > 512M RAM.
> > >
> > > This looks like slab bootstrap. You are bootstrapping while having
> > > zonelists build with zones that are only going to be populated later?
> > > This will lead to incorrect NUMA placement of lots of slab structures on
> > > bootup.
> >
> > I dont think so.. but it's not an area I'm very familiar with. one
> > of the other PPC folks might chime in with something here.
> >
> > >
> > > Check if the patch below may cure the oops. Your memory is likely
> > > still placed on the wrong numa nodes since we have to fallback from
> > > the intended node.
> >
> > Nope, no change with this patch.
> >
>
> Here is another boot log, with that patch applied, and with a numa=debug
> parm.
>
> -----------------------------------------------------
> ppc64_pft_size = 0x18
> physicalMemorySize = 0x22000000
> ppc64_caches.dcache_line_size = 0x80
> ppc64_caches.icache_line_size = 0x80
> htab_address = 0x0000000000000000
> htab_hash_mask = 0x1ffff
> -----------------------------------------------------
> Linux version 2.6.19-rc1-gb8a3ad5b-dirty (willschm@airbag2) (gcc version
> 4.1.0 (SUSE Linux)) #60 SMP Fri Oct 13 14:48:20 CDT 2006
> [boot]0012 Setup Arch
> NUMA associativity depth for CPU/Memory: 3
> adding cpu 0 to node 0
> node 0
> NODE_DATA() = c000000015ffee80
> start_paddr = 8000000
> end_paddr = 16000000
> bootmap_paddr = 15ffc000
> reserve_bootmem ffc0000 40000
> reserve_bootmem 15ffc000 2000
> reserve_bootmem 15ffee80 1180
> node 1
> NODE_DATA() = c000000021ff7c80
> start_paddr = 0
> end_paddr = 22000000

Strange, node 0 appears to be in the middle of node 1.

2006-10-13 21:35:44

by Anton Blanchard

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Hi,

> Strange, node 0 appears to be in the middle of node 1.

Its an odd setup and may be a firmware issue but Ive seen it a number of
times on POWER5 boxes.

Anton

2006-10-13 22:00:58

by Mike Kravetz

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Fri, Oct 13, 2006 at 04:22:02PM -0500, Nathan Lynch wrote:
> Will Schmidt wrote:
> > NUMA associativity depth for CPU/Memory: 3
> > adding cpu 0 to node 0
> > node 0
> > NODE_DATA() = c000000015ffee80
> > start_paddr = 8000000
> > end_paddr = 16000000
> > bootmap_paddr = 15ffc000
> > reserve_bootmem ffc0000 40000
> > reserve_bootmem 15ffc000 2000
> > reserve_bootmem 15ffee80 1180
> > node 1
> > NODE_DATA() = c000000021ff7c80
> > start_paddr = 0
> > end_paddr = 22000000
>
> Strange, node 0 appears to be in the middle of node 1.

IIRC, this is fairly common. Or, it was on the system/LPAR I had access
to. I'd check again, but I lost easy access to that system. :(

--
Mike

2006-10-13 22:22:25

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Here is another fall back fix checking if the slab has already been setup
for this node. MPOL_INTERLEAVE could redirect the allocation.

Index: linux-2.6.19-rc1-mm1/mm/slab.c
===================================================================
--- linux-2.6.19-rc1-mm1.orig/mm/slab.c 2006-10-10 21:47:12.949563383 -0500
+++ linux-2.6.19-rc1-mm1/mm/slab.c 2006-10-13 17:21:31.937863714 -0500
@@ -3158,12 +3158,15 @@ void *fallback_alloc(struct kmem_cache *
struct zone **z;
void *obj = NULL;

- for (z = zonelist->zones; *z && !obj; z++)
+ for (z = zonelist->zones; *z && !obj; z++) {
+ int nid = zone_to_nid(*z);
+
if (zone_idx(*z) <= ZONE_NORMAL &&
- cpuset_zone_allowed(*z, flags))
+ cpuset_zone_allowed(*z, flags) &&
+ cache->nodelists[nid])
obj = __cache_alloc_node(cache,
- flags | __GFP_THISNODE,
- zone_to_nid(*z));
+ flags | __GFP_THISNODE, nid);
+ }
return obj;
}

2006-10-16 16:01:09

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Fri, 2006-13-10 at 15:22 -0700, Christoph Lameter wrote:
> Here is another fall back fix checking if the slab has already been setup
> for this node. MPOL_INTERLEAVE could redirect the allocation.
>

with this patch applied, a different error in the same area..

freeing bootmem node 0
freeing bootmem node 1
Memory: 530256k/557056k available (5508k kernel code, 30468k reserved,
2224k data, 543k bss, 244k init)
Kernel panic - not syncing: kmem_cache_create(): failed to create slab
`size-32'

> Index: linux-2.6.19-rc1-mm1/mm/slab.c
> ===================================================================
> --- linux-2.6.19-rc1-mm1.orig/mm/slab.c 2006-10-10 21:47:12.949563383 -0500
> +++ linux-2.6.19-rc1-mm1/mm/slab.c 2006-10-13 17:21:31.937863714 -0500
> @@ -3158,12 +3158,15 @@ void *fallback_alloc(struct kmem_cache *
> struct zone **z;
> void *obj = NULL;
>
> - for (z = zonelist->zones; *z && !obj; z++)
> + for (z = zonelist->zones; *z && !obj; z++) {
> + int nid = zone_to_nid(*z);
> +
> if (zone_idx(*z) <= ZONE_NORMAL &&
> - cpuset_zone_allowed(*z, flags))
> + cpuset_zone_allowed(*z, flags) &&
> + cache->nodelists[nid])
> obj = __cache_alloc_node(cache,
> - flags | __GFP_THISNODE,
> - zone_to_nid(*z));
> + flags | __GFP_THISNODE, nid);
> + }
> return obj;
> }
>
>

2006-10-16 19:20:20

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Here is the content of /sys/devices/system/node/* and /proc/meminfo.
This is from the same partition, booted with a 2.6.16-ish distro
kernel.

notice that the node1/meminfo MemUsed value seems just a little bit
elevated. MemFree being larger than MemTotal seems a bit wrong too.

14:07:43 0 willschm@airbag2:~> find /sys/devices/system/node -type f
-print -exec cat {} \;
/sys/devices/system/node/node1/distance
20 10
/sys/devices/system/node/node1/numastat
numa_hit 6279
numa_miss 141588
numa_foreign 0
interleave_hit 5218
local_node 0
other_node 147867
/sys/devices/system/node/node1/meminfo

Node 1 MemTotal: 327680 kB
Node 1 MemFree: 435704 kB
Node 1 MemUsed: 18446744073709443592 kB
Node 1 Active: 41412 kB
Node 1 Inactive: 19976 kB
Node 1 HighTotal: 0 kB
Node 1 HighFree: 0 kB
Node 1 LowTotal: 327680 kB
Node 1 LowFree: 435704 kB
Node 1 Dirty: 0 kB
Node 1 Writeback: 0 kB
Node 1 Mapped: 0 kB
Node 1 Slab: 0 kB
Node 1 HugePages_Total: 0
Node 1 HugePages_Free: 0
/sys/devices/system/node/node1/cpumap
00000000,00000000,00000000,00000000
/sys/devices/system/node/node0/distance
10 20
/sys/devices/system/node/node0/numastat
numa_hit 0
numa_miss 0
numa_foreign 141749
interleave_hit 0
local_node 0
other_node 0
/sys/devices/system/node/node0/meminfo

Node 0 MemTotal: 229376 kB
Node 0 MemFree: 0 kB
Node 0 MemUsed: 229376 kB
Node 0 Active: 0 kB
Node 0 Inactive: 0 kB
Node 0 HighTotal: 0 kB
Node 0 HighFree: 0 kB
Node 0 LowTotal: 229376 kB
Node 0 LowFree: 0 kB
Node 0 Dirty: 8 kB
Node 0 Writeback: 0 kB
Node 0 Mapped: 33940 kB
Node 0 Slab: 25500 kB
Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
/sys/devices/system/node/node0/cpumap
00000000,00000000,00000000,000000ff

---
14:07:45 0 willschm@airbag2:~> cat /proc/meminfo
MemTotal: 531628 kB
MemFree: 436000 kB
Buffers: 2880 kB
Cached: 35156 kB
SwapCached: 0 kB
Active: 41364 kB
Inactive: 19976 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 531628 kB
LowFree: 436000 kB
SwapTotal: 803240 kB
SwapFree: 803240 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 33776 kB
Slab: 25332 kB
CommitLimit: 1069052 kB
Committed_AS: 81980 kB
PageTables: 1088 kB
VmallocTotal: 8589934592 kB
VmallocUsed: 2560 kB
VmallocChunk: 8589931608 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 16384 kB

2006-10-16 19:25:22

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Mon, 16 Oct 2006, Will Schmidt wrote:

> Node 1 MemTotal: 327680 kB
> Node 1 MemFree: 435704 kB

Too big.

> Node 1 MemUsed: 18446744073709443592 kB

Memused is going negative?

> Node 1 Active: 41412 kB
> Node 1 Inactive: 19976 kB
> Node 1 HighTotal: 0 kB
> Node 1 HighFree: 0 kB
> Node 1 LowTotal: 327680 kB
> Node 1 LowFree: 435704 kB
> Node 1 Dirty: 0 kB
> Node 1 Writeback: 0 kB
> Node 1 Mapped: 0 kB
> Node 1 Slab: 0 kB

zero slab??? That cannot be. The slab allocator always allocs on each
node. Or is this <2.6.18 with the strange counters that we had before?

> Node 0 MemTotal: 229376 kB
> Node 0 MemFree: 0 kB
> Node 0 MemUsed: 229376 kB

Node 0 is filled up during bootup?

2006-10-16 20:50:33

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Mon, 2006-16-10 at 12:25 -0700, Christoph Lameter wrote:

> zero slab??? That cannot be. The slab allocator always allocs on each
> node. Or is this <2.6.18 with the strange counters that we had before?

This is output from 2.6.18-rc2. MemFree, MemTotal, MemUsed still
wrong. Node0 slab is still zero. I've also attached the numa=debug
boot log from this boot, in case it has any clues that were missing from
the other boot log.

15:40:53 0 willschm@airbag2:~> find /sys/devices/system/node/ -type f -exec cat {} \;
20 10
numa_hit 4952
numa_miss 152776
numa_foreign 0
interleave_hit 3176
local_node 0
other_node 157761

Node 1 MemTotal: 327680 kB
Node 1 MemFree: 441136 kB
Node 1 MemUsed: 18446744073709438160 kB
Node 1 Active: 39008 kB
Node 1 Inactive: 18040 kB
Node 1 HighTotal: 0 kB
Node 1 HighFree: 0 kB
Node 1 LowTotal: 327680 kB
Node 1 LowFree: 441136 kB
Node 1 Dirty: 0 kB
Node 1 Writeback: 0 kB
Node 1 FilePages: 39868 kB
Node 1 Mapped: 15080 kB
Node 1 AnonPages: 17172 kB
Node 1 PageTables: 956 kB
Node 1 NFS Unstable: 0 kB
Node 1 Bounce: 0 kB
Node 1 Slab: 26036 kB
Node 1 HugePages_Total: 0
Node 1 HugePages_Free: 0
00000000,00000000,00000000,00000000
10 20
numa_hit 0
numa_miss 0
numa_foreign 152941
interleave_hit 0
local_node 0
other_node 0

Node 0 MemTotal: 229376 kB
Node 0 MemFree: 0 kB
Node 0 MemUsed: 229376 kB
Node 0 Active: 0 kB
Node 0 Inactive: 0 kB
Node 0 HighTotal: 0 kB
Node 0 HighFree: 0 kB
Node 0 LowTotal: 229376 kB
Node 0 LowFree: 0 kB
Node 0 Dirty: 0 kB
Node 0 Writeback: 0 kB
Node 0 FilePages: 0 kB
Node 0 Mapped: 0 kB
Node 0 AnonPages: 0 kB
Node 0 PageTables: 0 kB
Node 0 NFS Unstable: 0 kB
Node 0 Bounce: 0 kB
Node 0 Slab: 0 kB
Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
00000000,00000000,00000000,000000ff
15:40:57 0 willschm@airbag2:~>

-
-----------------------------------------------------
ppc64_pft_size = 0x18
physicalMemorySize = 0x22000000
ppc64_caches.dcache_line_size = 0x80
ppc64_caches.icache_line_size = 0x80
htab_address = 0x0000000000000000
htab_hash_mask = 0x1ffff
-----------------------------------------------------
Linux version 2.6.18-rc2 (willschm@airbag2) (gcc version 4.1.0 (SUSE
Linux)) #1 SMP Mon Oct 16 15:27:37 CDT 2006
[boot]0012 Setup Arch
NUMA associativity depth for CPU/Memory: 3
add_region nid 1 start_pfn 0x0 pages 0x8000
add_region nid 0 start_pfn 0x8000 pages 0x2000
add_region nid 0 start_pfn 0xa000 pages 0x2000
add_region nid 0 start_pfn 0xc000 pages 0x2000
add_region nid 0 start_pfn 0xe000 pages 0x2000
add_region nid 0 start_pfn 0x10000 pages 0x2000
add_region nid 0 start_pfn 0x12000 pages 0x2000
add_region nid 0 start_pfn 0x14000 pages 0x2000
add_region nid 1 start_pfn 0x16000 pages 0x2000
add_region nid 1 start_pfn 0x18000 pages 0x2000
add_region nid 1 start_pfn 0x1a000 pages 0x2000
add_region nid 1 start_pfn 0x1c000 pages 0x2000
add_region nid 1 start_pfn 0x1e000 pages 0x2000
add_region nid 1 start_pfn 0x20000 pages 0x2000
Node 0 Memory: 0x8000000-0x16000000
Node 1 Memory: 0x0-0x8000000 0x16000000-0x22000000
adding cpu 0 to node 0
node 0
NODE_DATA() = c000000015ffd780
start_paddr = 8000000
end_paddr = 16000000
bootmap_paddr = 15ffb000
free_bootmem 8000000 e000000
reserve_bootmem ffc0000 40000
reserve_bootmem 15ffb000 2000
reserve_bootmem 15ffd780 2880
node 1
NODE_DATA() = c000000021ff6580
start_paddr = 0
end_paddr = 22000000
bootmap_paddr = 21ff1000
free_bootmem 0 8000000
free_bootmem 16000000 c000000
reserve_bootmem 0 802000
reserve_bootmem 2606000 9000
reserve_bootmem 77b2000 84e000
reserve_bootmem 21ff1000 5000
reserve_bootmem 21ff6580 2880
reserve_bootmem 21ff8e58 71a4
No ramdisk, default root is /dev/sda2
EEH: No capable adapters found
PPC64 nvram contains 7168 bytes
Using shared processor idle loop
free_area_init node 0 e000 8000 (hole: 0)
On node 0 totalpages: 57344
DMA zone: 57344 pages, LIFO batch:15
free_area_init node 1 22000 0 (hole: e000)
On node 1 totalpages: 81920
DMA zone: 81920 pages, LIFO batch:15
[boot]0015 Setup Done
Built 2 zonelists. Total pages: 139264

2006-10-16 23:37:38

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Mon, 16 Oct 2006, Will Schmidt wrote:

> This is output from 2.6.18-rc2. MemFree, MemTotal, MemUsed still
> wrong. Node0 slab is still zero. I've also attached the numa=debug
> boot log from this boot, in case it has any clues that were missing from
> the other boot log.

It looks as if node 0 is allready full on bootup. The new code in 2.6.19
controls locality in a more strict form in the slab. 2.6.18 and earlier
were able to tolerate if a request for a page from the slab allocator for
node 0 returns memory on node1 even if node 1 has not been bootstrapped
yet. But this resulted in a problem in the slab because the node lists
dedicated for node 0 now had memory from node 1 in it (which led to
latency problems since slab code subsequently assumes that node local
memory is very fast, which with corrupted per node lists is no longer
true.).

You must bootstrap on a node that has memory available. If you would
bootstrap the slab on node 1 that would work.

> Node 0 MemTotal: 229376 kB
> Node 0 MemUsed: 229376 kB

^^^^^ This node should not be full!!!

Increase memory on node 0 so that the slab can bootstrap.

2006-10-18 06:11:57

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Christoph,

I also am hitting this BUG on a POWER5 partition. The relevant boot
messages are:

Zone PFN ranges:
DMA 0 -> 524288
Normal 524288 -> 524288
early_node_map[3] active PFN ranges
1: 0 -> 32768
0: 32768 -> 278528
1: 278528 -> 524288
[boot]0015 Setup Done
Built 2 zonelists. Total pages: 513760
Kernel command line: root=/dev/sdc3
[snip]
freeing bootmem node 0
freeing bootmem node 1
Memory: 2046852k/2097152k available (5512k kernel code, 65056k reserved, 2204k data, 554k bss, 256k init)
kernel BUG in __cache_alloc_node at /home/paulus/kernel/powerpc/mm/slab.c:3177!

Since this is a virtualized system there is every possibility that the
memory we get won't be divided into nodes in the nice neat manner you
seem to be expecting. It just depends on what memory the hypervisor
has free, and on what nodes, when the partition is booted.

In other words, the assumption that node pfn ranges won't overlap is
completely untenable for us.

Linus' tree is currently broken for us. Any suggestions for how to
fix it, since I am not very familiar with the NUMA code?

Paul.

2006-10-18 15:13:05

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Wed, 18 Oct 2006, Paul Mackerras wrote:

> Since this is a virtualized system there is every possibility that the
> memory we get won't be divided into nodes in the nice neat manner you
> seem to be expecting. It just depends on what memory the hypervisor
> has free, and on what nodes, when the partition is booted.

The only expectation is that memory is available on the node that you are
bootstrapping the slab allocator from.

> In other words, the assumption that node pfn ranges won't overlap is
> completely untenable for us.

That does not matter for this problem.,

> Linus' tree is currently broken for us. Any suggestions for how to
> fix it, since I am not very familiar with the NUMA code?

Have memory available for slab boot strap on node 0? Or modify the boot
code in such a way that it runs on node 1 or any other node that has
memory available.

2006-10-18 16:06:40

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Wed, 18 Oct 2006, Paul Mackerras wrote:

> Linus' tree is currently broken for us. Any suggestions for how to
> fix it, since I am not very familiar with the NUMA code?

I am not very familiar with the powerpc code and what I got here is
conjecture from various messages. It would help to get some clarification
on what is going on with node 0 memory. Is there really no memory
available from node 0 on bootup? Why is this?

If this is the case then you already have had issues for long time with
per node memory lists being contaminated on bootup.

Why would you attempt to boot linux on a memory node without
memory?

2006-10-18 21:19:39

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Christoph Lameter writes:

> > Linus' tree is currently broken for us. Any suggestions for how to
> > fix it, since I am not very familiar with the NUMA code?
>
> Have memory available for slab boot strap on node 0? Or modify the boot
> code in such a way that it runs on node 1 or any other node that has
> memory available.

OK, then I don't understand. There is about 1GB of memory on node 0,
which is about half of the partition's memory, and it is even in a
contiguous chunk, but it doesn't start at pfn 0:

early_node_map[3] active PFN ranges
1: 0 -> 32768
0: 32768 -> 278528
1: 278528 -> 524288

So it's not that node 0 doesn't have any pages. Any other clues?

Thanks,
Paul.

2006-10-18 21:27:00

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Thu, 19 Oct 2006, Paul Mackerras wrote:

> > Have memory available for slab boot strap on node 0? Or modify the boot
> > code in such a way that it runs on node 1 or any other node that has
> > memory available.
>
> OK, then I don't understand. There is about 1GB of memory on node 0,
> which is about half of the partition's memory, and it is even in a
> contiguous chunk, but it doesn't start at pfn 0:

And the memory is available? In some messages it showed that all of node 0
memory was allocated on bootup! We end up in fallback_alloc which means
that an allocation attempt failed to obtain memory. Could you figure out
what exactly we are trying to allocate? Add some printk's? Why do we
fallback?

> So it's not that node 0 doesn't have any pages. Any other clues?

We are falling back. So something is going wrong. Either we request memory
from an overallocated node or the page allocator for some other reason is
not giving us the requested memory. If we figure out why then the fix is
probably very simple.

I have no way of investigating the issue except by conjecture and code
review since I have no ppc hardware.

2006-10-18 21:49:48

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Here is patch to add some printk to try to figure out what is going on.
Run with this and send me the console output leading up to the failure.

Index: linux-2.6.19-rc2-mm1/mm/slab.c
===================================================================
--- linux-2.6.19-rc2-mm1.orig/mm/slab.c 2006-10-17 18:43:47.000000000 -0500
+++ linux-2.6.19-rc2-mm1/mm/slab.c 2006-10-18 16:47:42.904912835 -0500
@@ -2005,6 +2005,7 @@ static int setup_cpu_cache(struct kmem_c
return enable_cpucache(cachep);

if (g_cpucache_up == NONE) {
+ printk(KERN_CRIT "setup_cpu_cache: NONE\n");
/*
* Note: the first kmem_cache_create must create the cache
* that's used by kmalloc(24), otherwise the creation of
@@ -2023,6 +2024,7 @@ static int setup_cpu_cache(struct kmem_c
else
g_cpucache_up = PARTIAL_AC;
} else {
+ printk(KERN_CRIT "setup_cpu_cache: PARTIAL\n");
cachep->array[smp_processor_id()] =
kmalloc(sizeof(struct arraycache_init), GFP_KERNEL);

@@ -2219,6 +2221,7 @@ kmem_cache_create (const char *name, siz
align = ralign;

/* Get cache's description obj. */
+ printk(KERN_CRIT "Get cache descritor\n");
cachep = kmem_cache_zalloc(&cache_cache, SLAB_KERNEL);
if (!cachep)
goto oops;
@@ -3082,6 +3085,7 @@ static inline void *____cache_alloc(stru
void *objp;
struct array_cache *ac;

+ printk(KERN_CRIT "__cache_alloc\n");
check_irq_off();
ac = cpu_cache_get(cachep);
if (likely(ac->avail)) {
@@ -3135,6 +3139,7 @@ static void *alternate_node_alloc(struct
{
int nid_alloc, nid_here;

+ printk(KERN_CRIT "alternate_node_alloc\n");
if (in_interrupt() || (flags & __GFP_THISNODE))
return NULL;
nid_alloc = nid_here = numa_node_id();
@@ -3160,6 +3165,7 @@ void *fallback_alloc(struct kmem_cache *
struct zone **z;
void *obj = NULL;

+ printk(KERN_CRIT "fallback_alloc\n");
for (z = zonelist->zones; *z && !obj; z++)
if (zone_idx(*z) <= ZONE_NORMAL &&
cpuset_zone_allowed(*z, flags))
@@ -3181,6 +3187,8 @@ static void *__cache_alloc_node(struct k
void *obj;
int x;

+ printk("__cache_alloc_node %d\n", nodeid);
+
l3 = cachep->nodelists[nodeid];
BUG_ON(!l3);

2006-10-19 05:04:19

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Christoph Lameter writes:

> Here is patch to add some printk to try to figure out what is going on.
> Run with this and send me the console output leading up to the failure.

Here... Thanks for your help on this. I'll poke a bit further.

Linux version 2.6.19-rc2-test (paulus@drongo) (gcc version 4.1.2 20060928 (prerelease) (Debian 4.1.1-15)) #37 SMP Thu Oct 19 14:05:18 EST 2006
[boot]0012 Setup Arch
No ramdisk, default root is /dev/sda2
EEH: PCI Enhanced I/O Error Handling Enabled
PPC64 nvram contains 7168 bytes
Zone PFN ranges:
DMA 0 -> 524288
Normal 524288 -> 524288
early_node_map[3] active PFN ranges
1: 0 -> 32768
0: 32768 -> 278528
1: 278528 -> 524288
[boot]0015 Setup Done
Built 2 zonelists. Total pages: 513760
Kernel command line: root=/dev/sdc3
[boot]0020 XICS Init
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour dummy device 80x25
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
freeing bootmem node 0
freeing bootmem node 1
Memory: 2046852k/2097152k available (5512k kernel code, 65056k reserved, 2204k data, 554k bss, 256k init)
Get cache descritor
__cache_alloc
__cache_alloc_node 0
fallback_alloc
__cache_alloc_node 0
__cache_alloc_node 1
kernel BUG in __cache_alloc_node at /home/paulus/kernel/powerpc/mm/slab.c:3185!

2006-10-19 16:16:42

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Thu, 19 Oct 2006, Paul Mackerras wrote:

> Get cache descritor

Attempt to allocate the first descriptor for the first cache.

> __cache_alloc

Attempt to allocate from the caches of node 0 (which are empty on
bootstrap). We try to replenish the caches of node 0 which should have
succeeded. I guess that this failed due to no pages available on
node 0. This should not happen!

It worked before 2.6.19 because the slab allocator allowed the page
allocator to fallback to node 1. However, we then put pages from node 1
on the per node lists for node 0. This was fixed in 2.6.19 using
GFP_THISNODE.

> __cache_alloc_node 0

No we go to __cache_alloc_node because it knows how to get memory from
differnet nodes (we should not get here at all there should be memory on
node 0!)

> fallback_alloc

We failed another attempt to get memory from node 0. Now we are going down
the zonelist.

> __cache_alloc_node 0

First attempt on node 0 (the head of the fallback list) which again has no
pages available.

> __cache_alloc_node 1

Attempt to allocate from node 1 (second zone on the fallback list)

> kernel BUG in __cache_alloc_node at /home/paulus/kernel/powerpc/mm/slab.c:3185!

Node 1 has not been setup yet since we have not completed bootstrap so we
BUG out.

Would you please make memory available on the node that you bootstrap
the slab allocator on? numa_node_id() must point to a node that has memory
available.

2006-10-19 16:33:10

by Anton Blanchard

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Hi,

> Would you please make memory available on the node that you bootstrap
> the slab allocator on? numa_node_id() must point to a node that has memory
> available.

So we've gone from something that worked around sub optimal memory
layouts to something that panics. Sounds like a step backwards to me.

Anton

2006-10-19 16:49:14

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Fri, 20 Oct 2006, Anton Blanchard wrote:

> > Would you please make memory available on the node that you bootstrap
> > the slab allocator on? numa_node_id() must point to a node that has memory
> > available.
>
> So we've gone from something that worked around sub optimal memory
> layouts to something that panics. Sounds like a step backwards to me.

Could you confirm that there is indeed no memory on node 0?

The expectation to have memory available on the node that you
bootstrap on is not unrealistic.

2006-10-19 17:03:16

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

I would expect this patch to fix your issues. This will allow fallback
allocations to occur in the page allocator during slab bootstrap. This
means your per node queues will be contaminated as they were before. After
the slab allocator is fully booted then the per node queues will become
gradually become node clean.

I think it would be better if the PPC arch would fix this issue
by either making memory available on node 0 or setting up node 1 as
the boot node.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.19-rc2-mm1/mm/slab.c
===================================================================
--- linux-2.6.19-rc2-mm1.orig/mm/slab.c 2006-10-19 11:54:24.000000000 -0500
+++ linux-2.6.19-rc2-mm1/mm/slab.c 2006-10-19 11:59:24.208194796 -0500
@@ -1589,7 +1589,10 @@ static void *kmem_getpages(struct kmem_c
* the needed fallback ourselves since we want to serve from our
* per node object lists first for other nodes.
*/
- flags |= cachep->gfpflags | GFP_THISNODE;
+ if (g_cpucache_up != FULL)
+ flags |= cachep->gfpflags;
+ else
+ flags |= cachep->gfpflags | GFP_THISNODE;

page = alloc_pages_node(nodeid, flags, cachep->gfporder);
if (!page)

2006-10-19 18:07:21

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Thu, 19 Oct 2006, Christoph Lameter wrote:

> I would expect this patch to fix your issues. This will allow fallback
> allocations to occur in the page allocator during slab bootstrap. This
> means your per node queues will be contaminated as they were before. After
> the slab allocator is fully booted then the per node queues will become
> gradually become node clean.

Forgot to mention the results of this contamination: The bootstrap process
exercises fine control over data structures to place them in such a way
that the slab allocator can perform optimally. F.e. data structures are
placed in such a way on nodes that a kmalloc does not need a single off
node reference.

The contamination will disrupt this placement. The slab believes that
memory is from a different node than were it actually came from. As a
result key data structures (such as cpucache descriptors) are placed
on the wrong node. kmalloc and other slab operations may require
off node allocations for every call. Depending on the NUMA factor this may
have a significant influence on overall system performance (We have
measured this effect to cause a drop of 20% in AIM7 performance!).

In addition to this stuff, I am right now dealing with huge page
fault serialization (introduced to safely support DB2) and sparsemem
continually causing nested table lookups in fundamental vm operations. All
work of IBM people. Not interested in performance at all?

2006-10-19 20:37:18

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Thu, 2006-19-10 at 10:03 -0700, Christoph Lameter wrote:
> I would expect this patch to fix your issues. This will allow fallback
> allocations to occur in the page allocator during slab bootstrap. This
> means your per node queues will be contaminated as they were before. After
> the slab allocator is fully booted then the per node queues will become
> gradually become node clean.
>
> I think it would be better if the PPC arch would fix this issue
> by either making memory available on node 0 or setting up node 1 as
> the boot node.
>

This didnt fix the problem on my box. I tried this both against mm and
linux-2.6.git

> Signed-off-by: Christoph Lameter <[email protected]>
>
> Index: linux-2.6.19-rc2-mm1/mm/slab.c
> ===================================================================
> --- linux-2.6.19-rc2-mm1.orig/mm/slab.c 2006-10-19 11:54:24.000000000 -0500
> +++ linux-2.6.19-rc2-mm1/mm/slab.c 2006-10-19 11:59:24.208194796 -0500
> @@ -1589,7 +1589,10 @@ static void *kmem_getpages(struct kmem_c
> * the needed fallback ourselves since we want to serve from our
> * per node object lists first for other nodes.
> */
> - flags |= cachep->gfpflags | GFP_THISNODE;
> + if (g_cpucache_up != FULL)
> + flags |= cachep->gfpflags;
> + else
> + flags |= cachep->gfpflags | GFP_THISNODE;
>
> page = alloc_pages_node(nodeid, flags, cachep->gfporder);
> if (!page)
> _______________________________________________
> Linuxppc-dev mailing list
> [email protected]
> https://ozlabs.org/mailman/listinfo/linuxppc-dev

2006-10-19 20:38:34

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Thu, 2006-19-10 at 09:16 -0700, Christoph Lameter wrote:
> On Thu, 19 Oct 2006, Paul Mackerras wrote:
>
> > Get cache descritor
>
> Attempt to allocate the first descriptor for the first cache.
>
> > __cache_alloc
>
> Attempt to allocate from the caches of node 0 (which are empty on
> bootstrap). We try to replenish the caches of node 0 which should have
> succeeded. I guess that this failed due to no pages available on
> node 0. This should not happen!

Is there a hook where we can see what/where the memory is going? Does
it seem reasonable for all of the memory that is in node 0 to be
consumed?
Mine appears to have...
Node 0 MemTotal: 229376 kB
Node 0 MemFree: 0 kB
Node 0 MemUsed: 229376 kB

And one of Paul's earlier notes mentioned about a gig of ram on node0;

-Will

2006-10-19 21:29:07

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Thu, 19 Oct 2006, Will Schmidt wrote:

> This didnt fix the problem on my box. I tried this both against mm and
> linux-2.6.git

Same failure condition? Would you also apply the printk patch and send
me the output?

2006-10-19 21:30:38

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Thu, 19 Oct 2006, Will Schmidt wrote:

> Is there a hook where we can see what/where the memory is going? Does
> it seem reasonable for all of the memory that is in node 0 to be
> consumed?
> Mine appears to have...
> Node 0 MemTotal: 229376 kB
> Node 0 MemFree: 0 kB
> Node 0 MemUsed: 229376 kB

The memory is likely consumed before the slab allocator bootstrap code is
reached.

> And one of Paul's earlier notes mentioned about a gig of ram on node0;

Yeah. I cannot make sense out of all of this. What is so special about
node 0?

2006-10-19 21:39:15

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Thu, 19 Oct 2006, Will Schmidt wrote:

> This didnt fix the problem on my box. I tried this both against mm and
> linux-2.6.git

GFP_THISNODE is also set at a higher level for fallback but it should not
be set for the initial allocation. If you try this with the debug printks
then please use this patch to make sure that all allocs fall back,

Index: linux-2.6.19-rc2-mm1/mm/slab.c
===================================================================
--- linux-2.6.19-rc2-mm1.orig/mm/slab.c 2006-10-19 11:54:24.000000000 -0500
+++ linux-2.6.19-rc2-mm1/mm/slab.c 2006-10-19 16:32:09.454825851 -0500
@@ -1589,7 +1589,10 @@ static void *kmem_getpages(struct kmem_c
* the needed fallback ourselves since we want to serve from our
* per node object lists first for other nodes.
*/
- flags |= cachep->gfpflags | GFP_THISNODE;
+ if (g_cpucache_up != FULL)
+ flags |= cachep->gfpflags & ~__GFP_THISNODE;
+ else
+ flags |= cachep->gfpflags | GFP_THISNODE;

page = alloc_pages_node(nodeid, flags, cachep->gfporder);
if (!page)

2006-10-19 21:43:32

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Thu, 2006-19-10 at 14:28 -0700, Christoph Lameter wrote:
> On Thu, 19 Oct 2006, Will Schmidt wrote:
>
> > This didnt fix the problem on my box. I tried this both against mm and
> > linux-2.6.git
>
> Same failure condition? Would you also apply the printk patch and send
> me the output?

Yup, here it is:

-----------------------------------------------------
ppc64_pft_size = 0x18
physicalMemorySize = 0x22000000
ppc64_caches.dcache_line_size = 0x80
ppc64_caches.icache_line_size = 0x80
htab_address = 0x0000000000000000
htab_hash_mask = 0x1ffff
-----------------------------------------------------
Linux version 2.6.19-rc2-mm1 (willschm@airbag2) (gcc version 4.1.0 (SUSE
Linux)) #2 SMP Thu Oct 19 16:37:26 CDT 2006
[boot]0012 Setup Arch
NUMA associativity depth for CPU/Memory: 3
adding cpu 0 to node 0
node 0
NODE_DATA() = c000000015ffed80
start_paddr = 8000000
end_paddr = 16000000
bootmap_paddr = 15ffc000
reserve_bootmem ffc0000 40000
reserve_bootmem 15ffc000 2000
reserve_bootmem 15ffed80 1280
node 1
NODE_DATA() = c000000021ff7b80
start_paddr = 0
end_paddr = 22000000
bootmap_paddr = 21ff2000
reserve_bootmem 0 851000
reserve_bootmem 2655000 9000
reserve_bootmem 77b2000 84e000
reserve_bootmem 21ff2000 5000
reserve_bootmem 21ff7b80 1280
reserve_bootmem 21ff8e58 71a4
No ramdisk, default root is /dev/sda2
EEH: No capable adapters found
PPC64 nvram contains 7168 bytes
Zone PFN ranges:
DMA 0 -> 139264
Normal 139264 -> 139264
early_node_map[3] active PFN ranges
1: 0 -> 32768
0: 32768 -> 90112
1: 90112 -> 139264
[boot]0015 Setup Done
Built 2 zonelists. Total pages: 136576
Kernel command line: root=/dev/sda3 xmon=on numa=debug
[boot]0020 XICS Init
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour dummy device 80x25
Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
freeing bootmem node 0
freeing bootmem node 1
Memory: 530216k/557056k available (5544k kernel code, 30508k reserved,
2232k data, 548k bss, 248k init)
Get cache descritor
__cache_alloc
__cache_alloc_node 0
fallback_alloc
__cache_alloc_node 0
__cache_alloc_node 1
kernel BUG in __cache_alloc_node
at /development/kernels/2.6-mm/mm/slab.c:3193!
cpu 0x0: Vector: 700 (Program Check) at [c00000000079b8d0]
pc: c0000000000b70f8: .__cache_alloc_node+0x5c/0x208
lr: c0000000000b70e0: .__cache_alloc_node+0x44/0x208
sp: c00000000079bb50
msr: 8000000000021032
current = 0xc00000000058ca90
paca = 0xc00000000058d380
pid = 0, comm = swapper
kernel BUG in __cache_alloc_node
at /development/kernels/2.6-mm/mm/slab.c:3193!
enter ? for help
[c00000000079bc00] c0000000000b735c .fallback_alloc+0xb8/0xfc
[c00000000079bca0] c0000000000b7930 .kmem_cache_zalloc+0xd4/0x128
[c00000000079bd40] c0000000000b9af4 .kmem_cache_create+0x1f4/0x604
[c00000000079be30] c000000000546d98 .kmem_cache_init+0x1d8/0x4b0
[c00000000079bef0] c00000000052c748 .start_kernel+0x244/0x328
[c00000000079bf90] c0000000000084f8 .start_here_common+0x54/0x5c
0:mon>

2006-10-19 22:00:28

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Thu, 19 Oct 2006, Will Schmidt wrote:

> Get cache descritor
> __cache_alloc
> __cache_alloc_node 0

Hmmm... Still no fallback? Weird, would you apply the other patch that
filters the __GFP_THISNODE flag and try again? Could you try to add some
printk's to the page allocator to figure out what is going on there? Or is
it clear that the node is overallocated?

Could it be that the node online mask contains a node that has not been
bootstrapped yet?

Dont you have someone who can debug this? This is kind of an awkward back
and forth with me guessing what the system does. Someone with knowledge
about the way NUMA is implemented in the arch code?

2006-10-19 22:23:59

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Christoph Lameter writes:

> Could you confirm that there is indeed no memory on node 0?

There is about a gigabyte of memory on node 0.

> The expectation to have memory available on the node that you
> bootstrap on is not unrealistic.

What exactly does "available" mean in this context? The console log I
posted earlier showed node 0 as having an active PFN range of 32768 -
278528 (245760 pages, or 960MB), and then showed a "freeing bootmem
node 0" message, *before* we hit the BUG.

If "available" doesn't mean "there are active pages which have been
given to the VM system via free_all_bootmem_node()", what does it
mean?

Paul.

2006-10-19 22:31:26

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Fri, 20 Oct 2006, Paul Mackerras wrote:

> What exactly does "available" mean in this context? The console log I
> posted earlier showed node 0 as having an active PFN range of 32768 -
> 278528 (245760 pages, or 960MB), and then showed a "freeing bootmem
> node 0" message, *before* we hit the BUG.

Available in the sense that the page allocator can allocate from them.
Will's console output shows that all memory of node 0 is allocated and not
available.

> If "available" doesn't mean "there are active pages which have been
> given to the VM system via free_all_bootmem_node()", what does it
> mean?

The page allocator must be running and able to serve pages from the boot
node. This fails for some reason and the slab cannot bootstrap. The memory
not available is the first guess. Could you trace the allocation in the
page allocator (__alloc_pages) when the slab attempts to bootstrap and
figure out why exactly the allocation fails?

2006-10-20 07:18:55

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Christoph Lameter writes:

> The page allocator must be running and able to serve pages from the boot
> node. This fails for some reason and the slab cannot bootstrap. The memory
> not available is the first guess. Could you trace the allocation in the
> page allocator (__alloc_pages) when the slab attempts to bootstrap and
> figure out why exactly the allocation fails?

What is happening is that all pages are getting their zone id field in
their page->flags set to point to zone for node 1 by memmap_init_zone
calling set_page_links (which does set_page_zone). Thus, when those
pages get freed by free_all_bootmem_node, they all end up in the zone
for node 1.

memmap_init_zone is called (as memmap_init, since we don't have
__HAVE_ARCH_MEMMAP_INIT defined) from init_currently_empty_zone, which
is called from free_area_init_core. Now the thing is that memmap_init
and init_currently_empty_zone are called with the node's start PFN and
size in pages, *including* holes. On the partition I'm using we have
these PFN ranges for the nodes:

1: 0 -> 32768
0: 32768 -> 278528
1: 278528 -> 524288

So node 0's start PFN is 32768 and its size is 245760 pages, and so we
correctly set pages 32786 to 278527 to be in the zone for node 0.
Then for node 1, we have the start PFN is 0 and the size is 524288, so
we then go through and set *all* pages of memory to be in the zone for
node 1, including the pages which are actually on node 0.

That's why we can't allocate any pages on node 0, and the kmem cache
bootstrapping blows up.

I don't know this code well enough to know what the correct fix is.
Clearly memmap_init_zone should only be touching the pages that are
actually present in the zone, but I don't know exactly what data
structures it should be using to know what those pages are.

Paul.

2006-10-20 14:20:07

by Andy Whitcroft

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Paul Mackerras wrote:
> Christoph Lameter writes:
>
>> The page allocator must be running and able to serve pages from the boot
>> node. This fails for some reason and the slab cannot bootstrap. The memory
>> not available is the first guess. Could you trace the allocation in the
>> page allocator (__alloc_pages) when the slab attempts to bootstrap and
>> figure out why exactly the allocation fails?
>
> What is happening is that all pages are getting their zone id field in
> their page->flags set to point to zone for node 1 by memmap_init_zone
> calling set_page_links (which does set_page_zone). Thus, when those
> pages get freed by free_all_bootmem_node, they all end up in the zone
> for node 1.
>
> memmap_init_zone is called (as memmap_init, since we don't have
> __HAVE_ARCH_MEMMAP_INIT defined) from init_currently_empty_zone, which
> is called from free_area_init_core. Now the thing is that memmap_init
> and init_currently_empty_zone are called with the node's start PFN and
> size in pages, *including* holes. On the partition I'm using we have
> these PFN ranges for the nodes:
>
> 1: 0 -> 32768
> 0: 32768 -> 278528
> 1: 278528 -> 524288
>
> So node 0's start PFN is 32768 and its size is 245760 pages, and so we
> correctly set pages 32786 to 278527 to be in the zone for node 0.
> Then for node 1, we have the start PFN is 0 and the size is 524288, so
> we then go through and set *all* pages of memory to be in the zone for
> node 1, including the pages which are actually on node 0.
>
> That's why we can't allocate any pages on node 0, and the kmem cache
> bootstrapping blows up.
>
> I don't know this code well enough to know what the correct fix is.
> Clearly memmap_init_zone should only be touching the pages that are
> actually present in the zone, but I don't know exactly what data
> structures it should be using to know what those pages are.

Mel Gorman and I have been poking at this from different ends. Mel from
the context of this thread and myself trying to fix a machine which was
exhibiting on 32MB of ram in node 0 and the rest in node 1.

I remember that we used to have code to cope with this in the ppc64
architecture, indeed I remember reviewing it all that time ago. Looking
at the current state of the tree it was removed in the two patches below
in mainline:
"[PATCH] Remove SPAN_OTHER_NODES config definition"
"[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES"

These commits:
f62859bb6871c5e4a8e591c60befc8caaf54db8c
a94b3ab7eab4edcc9b2cb474b188f774c331adf7

I'll follow up to this email with the reversion patch we used in
testing. It seems to sort this problem out at least, though now its
blam'ing in ibmveth, so am retesting with yet another patch. This patch
reverts the two patches above and updates the commentry on the Kconfig
entry.

-apw

2006-10-20 14:59:05

by Mike Kravetz

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Fri, Oct 20, 2006 at 03:18:52PM +0100, Andy Whitcroft wrote:
> I remember that we used to have code to cope with this in the ppc64
> architecture, indeed I remember reviewing it all that time ago. Looking
> at the current state of the tree it was removed in the two patches below
> in mainline:
> "[PATCH] Remove SPAN_OTHER_NODES config definition"
> "[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES"

That was me. Seem to remember some discussion that these were only
needed for DISCONTIGMEM, so I removed them when the DISCONTIGMEM option
for power went away. But, that is clearly NOT the case. Appears that
SPARSEMEM and the old slab code covered up the issue. Sorry about that.

Thanks!
--
Mike

2006-10-20 15:19:36

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Fri, 2006-20-10 at 15:18 +0100, Andy Whitcroft wrote:
> Paul Mackerras wrote:
> > Christoph Lameter writes:
> >

I got dropped off the CC list somewhere.. :-(

if something is bouncing, let me know,.. otherwise please dont do
that..

> Mel Gorman and I have been poking at this from different ends. Mel from
> the context of this thread and myself trying to fix a machine which was
> exhibiting on 32MB of ram in node 0 and the rest in node 1.
>
> I remember that we used to have code to cope with this in the ppc64
> architecture, indeed I remember reviewing it all that time ago. Looking
> at the current state of the tree it was removed in the two patches below
> in mainline:
> "[PATCH] Remove SPAN_OTHER_NODES config definition"
> "[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES"
>
> These commits:
> f62859bb6871c5e4a8e591c60befc8caaf54db8c
> a94b3ab7eab4edcc9b2cb474b188f774c331adf7
>
> I'll follow up to this email with the reversion patch we used in
> testing. It seems to sort this problem out at least, though now its
> blam'ing in ibmveth, so am retesting with yet another patch. This patch
> reverts the two patches above and updates the commentry on the Kconfig
> entry.

I've got a couple LPARs that exhibit the problem, so can verify your
patch once I see it..

-Will

>
> -apw

2006-10-20 16:01:31

by Andy Whitcroft

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Andy Whitcroft wrote:
> Paul Mackerras wrote:
>> Christoph Lameter writes:
>>
>>> The page allocator must be running and able to serve pages from the boot
>>> node. This fails for some reason and the slab cannot bootstrap. The memory
>>> not available is the first guess. Could you trace the allocation in the
>>> page allocator (__alloc_pages) when the slab attempts to bootstrap and
>>> figure out why exactly the allocation fails?
>> What is happening is that all pages are getting their zone id field in
>> their page->flags set to point to zone for node 1 by memmap_init_zone
>> calling set_page_links (which does set_page_zone). Thus, when those
>> pages get freed by free_all_bootmem_node, they all end up in the zone
>> for node 1.
>>
>> memmap_init_zone is called (as memmap_init, since we don't have
>> __HAVE_ARCH_MEMMAP_INIT defined) from init_currently_empty_zone, which
>> is called from free_area_init_core. Now the thing is that memmap_init
>> and init_currently_empty_zone are called with the node's start PFN and
>> size in pages, *including* holes. On the partition I'm using we have
>> these PFN ranges for the nodes:
>>
>> 1: 0 -> 32768
>> 0: 32768 -> 278528
>> 1: 278528 -> 524288
>>
>> So node 0's start PFN is 32768 and its size is 245760 pages, and so we
>> correctly set pages 32786 to 278527 to be in the zone for node 0.
>> Then for node 1, we have the start PFN is 0 and the size is 524288, so
>> we then go through and set *all* pages of memory to be in the zone for
>> node 1, including the pages which are actually on node 0.
>>
>> That's why we can't allocate any pages on node 0, and the kmem cache
>> bootstrapping blows up.
>>
>> I don't know this code well enough to know what the correct fix is.
>> Clearly memmap_init_zone should only be touching the pages that are
>> actually present in the zone, but I don't know exactly what data
>> structures it should be using to know what those pages are.
>
> Mel Gorman and I have been poking at this from different ends. Mel from
> the context of this thread and myself trying to fix a machine which was
> exhibiting on 32MB of ram in node 0 and the rest in node 1.
>
> I remember that we used to have code to cope with this in the ppc64
> architecture, indeed I remember reviewing it all that time ago. Looking
> at the current state of the tree it was removed in the two patches below
> in mainline:
> "[PATCH] Remove SPAN_OTHER_NODES config definition"
> "[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES"
>
> These commits:
> f62859bb6871c5e4a8e591c60befc8caaf54db8c
> a94b3ab7eab4edcc9b2cb474b188f774c331adf7
>
> I'll follow up to this email with the reversion patch we used in
> testing. It seems to sort this problem out at least, though now its
> blam'ing in ibmveth, so am retesting with yet another patch. This patch
> reverts the two patches above and updates the commentry on the Kconfig
> entry.

Ok, I've just gotten a successful boot on this box for the first time in
like 15 git releases. I needed the three patches below:

clameter-fallback_alloc_fix2 -- from earlier in this thread, under the
message ID below:
<[email protected]>

Reintroduce-NODES_SPAN_OTHER_NODES-for-powerpc -- the patch I just
submitted, under the message ID below:
<8a76dfd735e544016c5f04c98617b87d@pinky>

ibmveth-fix-index-increment-calculation -- this patch is already in -mm.

Feel free to take this as an ACK for the patches other than mine.

Acked-by: Andy Whitcroft <[email protected]>

-apw

2006-10-20 17:14:04

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Fri, 2006-20-10 at 17:00 +0100, Andy Whitcroft wrote:
> Andy Whitcroft wrote:
> > Paul Mackerras wrote:
> >> Christoph Lameter writes:

> Ok, I've just gotten a successful boot on this box for the first time in
> like 15 git releases. I needed the three patches below:
>
> clameter-fallback_alloc_fix2 -- from earlier in this thread, under the
> message ID below:
> <[email protected]>
>
> Reintroduce-NODES_SPAN_OTHER_NODES-for-powerpc -- the patch I just
> submitted, under the message ID below:
> <8a76dfd735e544016c5f04c98617b87d@pinky>
>
> ibmveth-fix-index-increment-calculation -- this patch is already in -mm.
>
> Feel free to take this as an ACK for the patches other than mine.
>
> Acked-by: Andy Whitcroft <[email protected]>
>
> -apw

I've applied these three blobs to the linux-2.6.git tree and verified
that it does fix the problem.
And a "Thanks!" to Christoph for being responsive.. even when the
problem wasnt introduced by him. :)

Acked-by: Will Schmidt <[email protected]>

2006-10-20 17:13:47

by Andrew Morton

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Fri, 20 Oct 2006 17:00:34 +0100
Andy Whitcroft <[email protected]> wrote:

> > I'll follow up to this email with the reversion patch we used in
> > testing. It seems to sort this problem out at least, though now its
> > blam'ing in ibmveth, so am retesting with yet another patch. This patch
> > reverts the two patches above and updates the commentry on the Kconfig
> > entry.
>
> Ok, I've just gotten a successful boot on this box for the first time in
> like 15 git releases. I needed the three patches below:
>
> clameter-fallback_alloc_fix2 -- from earlier in this thread, under the
> message ID below:
> <[email protected]>

That's this:

Here is another fall back fix checking if the slab has already been setup
for this node. MPOL_INTERLEAVE could redirect the allocation.

Index: linux-2.6.19-rc1-mm1/mm/slab.c
===================================================================
--- linux-2.6.19-rc1-mm1.orig/mm/slab.c 2006-10-10 21:47:12.949563383 -0500
+++ linux-2.6.19-rc1-mm1/mm/slab.c 2006-10-13 17:21:31.937863714 -0500
@@ -3158,12 +3158,15 @@ void *fallback_alloc(struct kmem_cache *
struct zone **z;
void *obj = NULL;

- for (z = zonelist->zones; *z && !obj; z++)
+ for (z = zonelist->zones; *z && !obj; z++) {
+ int nid = zone_to_nid(*z);
+
if (zone_idx(*z) <= ZONE_NORMAL &&
- cpuset_zone_allowed(*z, flags))
+ cpuset_zone_allowed(*z, flags) &&
+ cache->nodelists[nid])
obj = __cache_alloc_node(cache,
- flags | __GFP_THISNODE,
- zone_to_nid(*z));
+ flags | __GFP_THISNODE, nid);
+ }
return obj;
}

Christoph, can you please finalise and resend that?

> Reintroduce-NODES_SPAN_OTHER_NODES-for-powerpc -- the patch I just
> submitted, under the message ID below:
> <8a76dfd735e544016c5f04c98617b87d@pinky>

OK, I got that.

> ibmveth-fix-index-increment-calculation -- this patch is already in -mm.

Normally a Jeff thing, but small-and-simple. I'll send that in to Linus
today.

2006-10-20 17:34:56

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

On Fri, 20 Oct 2006, Paul Mackerras wrote:

> What is happening is that all pages are getting their zone id field in
> their page->flags set to point to zone for node 1 by memmap_init_zone
> calling set_page_links (which does set_page_zone). Thus, when those
> pages get freed by free_all_bootmem_node, they all end up in the zone
> for node 1.

Ok. So no memory on node 0? Then my patch to reenable fallback in the slab
should have worked but it did not. Could you retest with the patch that
Will tried? If that is not working the comment out the 3 lines with
__GFP_THISNODE in get_page_from_freelist. That will reenable fallback
globally. If that does not work then I doubt that this is my issue.

> memmap_init_zone is called (as memmap_init, since we don't have
> __HAVE_ARCH_MEMMAP_INIT defined) from init_currently_empty_zone, which
> is called from free_area_init_core. Now the thing is that memmap_init
> and init_currently_empty_zone are called with the node's start PFN and
> size in pages, *including* holes. On the partition I'm using we have
> these PFN ranges for the nodes:
>
> 1: 0 -> 32768
> 0: 32768 -> 278528
> 1: 278528 -> 524288
>
> So node 0's start PFN is 32768 and its size is 245760 pages, and so we
> correctly set pages 32786 to 278527 to be in the zone for node 0.
> Then for node 1, we have the start PFN is 0 and the size is 524288, so
> we then go through and set *all* pages of memory to be in the zone for
> node 1, including the pages which are actually on node 0.

I do not get it. You first mark all pages on node 0 then we run the bootup
code and later we shift those pages into node 0? So the slab bootstrap is
running when all pages are marked as being part of node 1 then later we
switch those pages under it to node 0?

> I don't know this code well enough to know what the correct fix is.
> Clearly memmap_init_zone should only be touching the pages that are
> actually present in the zone, but I don't know exactly what data
> structures it should be using to know what those pages are.

The fix that I posted yesterday should have reenabled fallback in the
slab during bootstrap and should have made the system work. Here it is
again:

Index: linux-2.6.19-rc2-mm1/mm/slab.c
===================================================================
--- linux-2.6.19-rc2-mm1.orig/mm/slab.c 2006-10-19 11:54:24.000000000 -0500
+++ linux-2.6.19-rc2-mm1/mm/slab.c 2006-10-19 16:32:09.454825851 -0500
@@ -1589,7 +1589,10 @@ static void *kmem_getpages(struct kmem_c
* the needed fallback ourselves since we want to serve from our
* per node object lists first for other nodes.
*/
- flags |= cachep->gfpflags | GFP_THISNODE;
+ if (g_cpucache_up != FULL)
+ flags |= cachep->gfpflags & ~__GFP_THISNODE;
+ else
+ flags |= cachep->gfpflags | GFP_THISNODE;

page = alloc_pages_node(nodeid, flags, cachep->gfporder);
if (!page)

2006-10-20 17:48:01

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Here is the patch:

Slab: Do not fallback to nodes that have not been bootstrapped yet

The zonelist may contain zones of nodes that have not been bootstrapped
and we will oops if we try to allocate from those zones. So check if the
node information for the slab and the node have been setup before
attempting an allocation. If it has not been setup then skip that zone.

Usually we will not encounter this situation since the slab bootstrap
code avoids falling back before we have setup the respective nodes but we
seem to have a special needs for pppc.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.19-rc2-mm1/mm/slab.c
===================================================================
--- linux-2.6.19-rc2-mm1.orig/mm/slab.c 2006-10-20 12:39:02.000000000 -0500
+++ linux-2.6.19-rc2-mm1/mm/slab.c 2006-10-20 12:41:04.137684581 -0500
@@ -3160,12 +3160,15 @@ void *fallback_alloc(struct kmem_cache *
struct zone **z;
void *obj = NULL;

- for (z = zonelist->zones; *z && !obj; z++)
+ for (z = zonelist->zones; *z && !obj; z++) {
+ int nid = zone_to_nid(*z);
+
if (zone_idx(*z) <= ZONE_NORMAL &&
- cpuset_zone_allowed(*z, flags))
+ cpuset_zone_allowed(*z, flags) &&
+ cache->nodelists[nid])
obj = __cache_alloc_node(cache,
- flags | __GFP_THISNODE,
- zone_to_nid(*z));
+ flags | __GFP_THISNODE, nid);
+ }
return obj;
}

2006-10-20 18:08:35

by Andy Whitcroft

[permalink] [raw]

Subject: Re: kernel BUG in __cache_alloc_node at linux-2.6.git/mm/slab.c:3177!

Christoph Lameter wrote:
> Here is the patch:
>
> Slab: Do not fallback to nodes that have not been bootstrapped yet
>
> The zonelist may contain zones of nodes that have not been bootstrapped
> and we will oops if we try to allocate from those zones. So check if the
> node information for the slab and the node have been setup before
> attempting an allocation. If it has not been setup then skip that zone.
>
> Usually we will not encounter this situation since the slab bootstrap
> code avoids falling back before we have setup the respective nodes but we
> seem to have a special needs for pppc.
>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> Index: linux-2.6.19-rc2-mm1/mm/slab.c
> ===================================================================
> --- linux-2.6.19-rc2-mm1.orig/mm/slab.c 2006-10-20 12:39:02.000000000 -0500
> +++ linux-2.6.19-rc2-mm1/mm/slab.c 2006-10-20 12:41:04.137684581 -0500
> @@ -3160,12 +3160,15 @@ void *fallback_alloc(struct kmem_cache *
> struct zone **z;
> void *obj = NULL;
>
> - for (z = zonelist->zones; *z && !obj; z++)
> + for (z = zonelist->zones; *z && !obj; z++) {
> + int nid = zone_to_nid(*z);
> +
> if (zone_idx(*z) <= ZONE_NORMAL &&
> - cpuset_zone_allowed(*z, flags))
> + cpuset_zone_allowed(*z, flags) &&
> + cache->nodelists[nid])
> obj = __cache_alloc_node(cache,
> - flags | __GFP_THISNODE,
> - zone_to_nid(*z));
> + flags | __GFP_THISNODE, nid);
> + }
> return obj;
> }
>
>

Applied this and the previous version, diff says they are identicle, so
my previous testing applies.

Acked-by: Andy Whitcroft <[email protected]>

-apw

2006-10-20 22:54:38