Greetings!
Got that on my dual CPU PowerMac G4 DP shortly after boot. This does not happen every time at bootup though:
[...]
kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 1 PID: 40 Comm: kswapd0 Not tainted 6.8.9-gentoo-PMacG4 #1
Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
Call Trace:
[effe5cc0] [c0784b64] dump_stack_lvl+0x80/0xac (unreliable)
[effe5ce0] [c01aef80] warn_alloc+0x100/0x178
[effe5d40] [c01af34c] __alloc_pages+0x354/0x8ac
[effe5e00] [c01af988] page_frag_alloc_align+0x68/0x17c
[effe5e30] [c063b7b4] __netdev_alloc_skb+0xb4/0x17c
[effe5e60] [be99573c] setup_rx_descbuffer+0x40/0x144 [b43legacy]
[effe5e90] [be99687c] b43legacy_dma_rx+0x20c/0x278 [b43legacy]
[effe5ee0] [be9897f4] b43legacy_interrupt_tasklet+0x580/0x5a4 [b43legacy]
[effe5f50] [c0050b34] tasklet_action_common.isra.0+0xb0/0xe8
[effe5f80] [c07aaf5c] __do_softirq+0x1dc/0x218
[effe5ff0] [c0008820] do_softirq_own_stack+0x34/0x40
[c1c555f0] [c07ff5f0] 0xc07ff5f0
[c1c55610] [c0050384] __irq_exit_rcu+0x6c/0xbc
[c1c55620] [c00507f8] irq_exit+0x10/0x20
[c1c55630] [c00045b4] HardwareInterrupt_virt+0x108/0x10c
--- interrupt: 500 at BIT_flushBits+0x1c/0x58
NIP: c0458840 LR: c0458fdc CTR: 00000000
REGS: c1c55640 TRAP: 0500 Not tainted (6.8.9-gentoo-PMacG4)
MSR: 0220b032 <VEC,EE,FP,ME,IR,DR,RI> CR: 88082202 XER: 00000000
GPR00: c0458f98 c1c55700 c1c2b9a0 c1c5575c 00000000 00000000 00000000 00000000
GPR08: 00000000 00000000 f2c1d0d5 c1c55710 88084202 00000000 f2c1e000 f35f8150
GPR16: 00000000 f34bf238 f34bf30c 00000000 0000001b 00000002 00000000 00000000
GPR24: f35f8150 00000080 f35f7950 f35f7d50 00000000 c07ff5f0 00000052 f35f5be0
NIP [c0458840] BIT_flushBits+0x1c/0x58
LR [c0458fdc] ZSTD_encodeSequences+0x2ac/0x308
--- interrupt: 500
[c1c55700] [c07ff5f0] 0xc07ff5f0 (unreliable)
[c1c55710] [c0458f98] ZSTD_encodeSequences+0x268/0x308
[c1c557b0] [c0452eb0] ZSTD_entropyCompressSeqStore.constprop.0+0x1c4/0x2bc
[c1c55840] [c045357c] ZSTD_compressBlock_internal+0xac/0x144
[c1c55870] [c04543c8] ZSTD_compressContinue_internal+0x734/0x7c0
[c1c55920] [c04559d4] ZSTD_compressEnd+0x2c/0x13c
[c1c55950] [c04574b0] ZSTD_compressStream2+0x1b8/0x508
[c1c559a0] [c045788c] ZSTD_compressStream2_simpleArgs+0x48/0x60
[c1c559e0] [c0457904] ZSTD_compress2+0x60/0x90
[c1c55a10] [c03d0abc] __zstd_compress+0x54/0x78
[c1c55a70] [c03c3794] scomp_acomp_comp_decomp+0xe8/0x16c
[c1c55aa0] [c01c4c80] zswap_store+0x4b4/0x668
[c1c55b20] [c01bc404] swap_writepage+0x3c/0xa8
[c1c55b40] [c0172dd4] pageout+0x11c/0x1ec
[c1c55bc0] [c0174e2c] shrink_folio_list+0x6b0/0x878
[c1c55c40] [c01767e8] evict_folios+0x9d4/0xd08
[c1c55d40] [c0176cec] try_to_shrink_lruvec+0x1d0/0x210
[c1c55da0] [c0176dcc] shrink_one+0xa0/0x134
[c1c55dd0] [c01786a4] shrink_node+0x248/0x844
[c1c55e50] [c0178f68] balance_pgdat+0x2c8/0x614
[c1c55f50] [c01794cc] kswapd+0x218/0x24c
[c1c55fc0] [c006aae4] kthread+0xe4/0xe8
[c1c55ff0] [c0015304] start_kernel_thread+0x10/0x14
Mem-Info:
active_anon:247864 inactive_anon:215615 isolated_anon:0
active_file:9633 inactive_file:12323 isolated_file:0
unevictable:4 dirty:0 writeback:2
slab_reclaimable:1293 slab_unreclaimable:6904
mapped:16876 shmem:155 pagetables:899
sec_pagetables:0 bounce:0
kernel_misc_reclaimable:0
free:2224 free_pcp:596 free_cma:0
Node 0 active_anon:991456kB inactive_anon:862460kB active_file:38532kB inactive_file:49292kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:67504kB dirty:0kB writeback:8kB shmem:620kB writeback_tmp:0kB kernel_stack:1536kB pagetables:3596kB sec_pagetables:0kB all_unreclaimable? no
DMA free:1680kB boost:4096kB min:7548kB low:8408kB high:9268kB reserved_highatomic:0KB active_anon:675928kB inactive_anon:5644kB active_file:124kB inactive_file:596kB unevictable:0kB writepending:0kB present:786432kB managed:746656kB mlocked:0kB bounce:0kB free_pcp:2384kB local_pcp:1004kB free_cma:0kB
lowmem_reserve[]: 0 0 1280 1280
DMA: 2*4kB (UM) 1*8kB (U) 1*16kB (U) 18*32kB (UME) 7*64kB (UM) 0*128kB 1*256kB (M) 1*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 1824kB
35907 total pagecache pages
13813 pages in swap cache
Free swap = 8270180kB
[...]
There was no crash though, the G4 was still usable via VNC.
Full dmesg and kernel .config attached.
Regards,
Erhard
On Wed, 8 May 2024 20:21:11 +0200
Erhard Furtner <[email protected]> wrote:
> Greetings!
>
> Got that on my dual CPU PowerMac G4 DP shortly after boot. This does not happen every time at bootup though:
>
> [...]
> kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
> CPU: 1 PID: 40 Comm: kswapd0 Not tainted 6.8.9-gentoo-PMacG4 #1
> Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
Very similar page allocation failure on the same machine on kernel 6.9.0 too. Seems it can easily be provoked by running a memory stressor, e.g. "stress-ng --vm 2 --vm-bytes 1930M --verify -v":
[...]
kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
Call Trace:
[c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
[c1c65960] [c01b6234] warn_alloc+0x100/0x178
[c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
[c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
[c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
[c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
[c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
[c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
[c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
[c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
[c1c65d20] [c0180a40] shrink_slab+0x174/0x260
[c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
[c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
[c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
[c1c65f50] [c017f270] kswapd+0x228/0x25c
[c1c65fc0] [c006bbac] kthread+0xe4/0xe8
[c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 15, objs: 225, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 15, objs: 225, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 15, objs: 225, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 15, objs: 225, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 15, objs: 225, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
node 0: slabs: 33, objs: 165, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 15, objs: 225, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
node 0: slabs: 33, objs: 165, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 15, objs: 225, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
node 0: slabs: 33, objs: 165, free: 0
Mem-Info:
active_anon:340071 inactive_anon:139179 isolated_anon:0
active_file:8297 inactive_file:2506 isolated_file:0
unevictable:4 dirty:1 writeback:18
slab_reclaimable:1377 slab_unreclaimable:7426
mapped:6804 shmem:112 pagetables:946
sec_pagetables:0 bounce:0
kernel_misc_reclaimable:0
free:1141 free_pcp:7 free_cma:0
Node 0 active_anon:1360284kB inactive_anon:556716kB active_file:33188kB inactive_file:10024kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:27216kB dirty:4kB writeback:72kB shmem:448kB writeback_tmp:0kB kernel_stack:1560kB pagetables:3784kB sec_pagetables:0kB all_unreclaimable? no
DMA free:56kB boost:7756kB min:11208kB low:12068kB high:12928kB reserved_highatomic:0KB active_anon:635128kB inactive_anon:58260kB active_file:268kB inactive_file:3000kB unevictable:0kB writepending:324kB present:786432kB managed:746644kB mlocked:0kB bounce:0kB free_pcp:28kB local_pcp:28kB free_cma:0kB
lowmem_reserve[]: 0 0 1280 1280
DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
63943 total pagecache pages
53024 pages in swap cache
Free swap = 8057248kB
Total swap = 8388604kB
524288 pages RAM
327680 pages HighMem/MovableOnly
9947 pages reserved
warn_alloc: 396 callbacks suppressed
kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 1 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
Call Trace:
[c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
[c1c65960] [c01b6234] warn_alloc+0x100/0x178
[c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
[c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
[c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
[c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
[c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
[c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
[c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
[c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
[c1c65d20] [c0180a40] shrink_slab+0x174/0x260
[c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
[c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
[c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
[c1c65f50] [c017f270] kswapd+0x228/0x25c
slab_out_of_memory: 53 callbacks suppressed
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
node 0: slabs: 33, objs: 165, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
node 0: slabs: 33, objs: 165, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
node 0: slabs: 33, objs: 165, free: 0
[c1c65fc0] [c006bbac] kthread+0xe4/0xe8
[c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
Mem-Info:
active_anon:351976 inactive_anon:123514 isolated_anon:0
active_file:4648 inactive_file:2081 isolated_file:0
unevictable:4 dirty:1 writeback:39
slab_reclaimable:918 slab_unreclaimable:7222
mapped:5359 shmem:21 pagetables:940
sec_pagetables:0 bounce:0
kernel_misc_reclaimable:0
free:2563 free_pcp:142 free_cma:0
Node 0 active_anon:1407904kB inactive_anon:494056kB active_file:18592kB inactive_file:8324kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:21436kB dirty:4kB writeback:156kB shmem:84kB writeback_tmp:0kB kernel_stack:1552kB pagetables:3760kB sec_pagetables:0kB all_unreclaimable? no
DMA free:0kB boost:7756kB min:11208kB low:12068kB high:12928kB reserved_highatomic:0KB active_anon:199336kB inactive_anon:491432kB active_file:4612kB inactive_file:5980kB unevictable:0kB writepending:660kB present:786432kB managed:746644kB mlocked:0kB bounce:0kB free_pcp:568kB local_pcp:20kB free_cma:0kB
lowmem_reserve[]: 0 0 1280 1280
DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
45961 total pagecache pages
39207 pages in swap cache
Free swap = 8093096kB
Total swap = 8388604kB
524288 pages RAM
327680 pages HighMem/MovableOnly
9947 pages reserved
warn_alloc: 343 callbacks suppressed
kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
Call Trace:
[c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
[c1c65960] [c01b6234] warn_alloc+0x100/0x178
[c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
[c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
[c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
[c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
[c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
slab_out_of_memory: 59 callbacks suppressed
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
node 0: slabs: 33, objs: 165, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
node 0: slabs: 33, objs: 165, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
node 0: slabs: 18, objs: 270, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
node 0: slabs: 33, objs: 165, free: 0
[c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
[c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
[c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
[c1c65d20] [c0180a40] shrink_slab+0x174/0x260
[c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
[c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
[c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
[c1c65f50] [c017f270] kswapd+0x228/0x25c
[c1c65fc0] [c006bbac] kthread+0xe4/0xe8
[c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
Mem-Info:
active_anon:235002 inactive_anon:240975 isolated_anon:0
active_file:4356 inactive_file:2551 isolated_file:0
unevictable:4 dirty:7 writeback:19
slab_reclaimable:1008 slab_unreclaimable:7218
mapped:5601 shmem:21 pagetables:939
sec_pagetables:0 bounce:0
kernel_misc_reclaimable:0
free:1340 free_pcp:23 free_cma:0
Node 0 active_anon:940008kB inactive_anon:963900kB active_file:17424kB inactive_file:10204kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:22404kB dirty:28kB writeback:76kB shmem:84kB writeback_tmp:0kB kernel_stack:1552kB pagetables:3756kB sec_pagetables:0kB all_unreclaimable? no
DMA free:0kB boost:7756kB min:11208kB low:12068kB high:12928kB reserved_highatomic:0KB active_anon:644060kB inactive_anon:36332kB active_file:5276kB inactive_file:5516kB unevictable:0kB writepending:348kB present:786432kB managed:746644kB mlocked:0kB bounce:0kB free_pcp:92kB local_pcp:92kB free_cma:0kB
lowmem_reserve[]: 0 0 1280 1280
DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
116345 total pagecache pages
109413 pages in swap cache
Free swap = 7819300kB
Total swap = 8388604kB
524288 pages RAM
327680 pages HighMem/MovableOnly
9947 pages reserved
I switched from zstd to lzo as zswap default compressor so zstd does not show up on the dmesg. But the rest looks pretty similar.
Full dmesg and kernel .config attached.
Regards,
Erhard
On Wed, May 15, 2024 at 2:45 PM Erhard Furtner <[email protected]> wrote:
>
> On Wed, 8 May 2024 20:21:11 +0200
> Erhard Furtner <[email protected]> wrote:
>
> > Greetings!
> >
> > Got that on my dual CPU PowerMac G4 DP shortly after boot. This does not happen every time at bootup though:
> >
> > [...]
> > kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
> > CPU: 1 PID: 40 Comm: kswapd0 Not tainted 6.8.9-gentoo-PMacG4 #1
> > Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
>
> Very similar page allocation failure on the same machine on kernel 6.9.0 too. Seems it can easily be provoked by running a memory stressor, e.g. "stress-ng --vm 2 --vm-bytes 1930M --verify -v":
>
> [...]
> kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
> CPU: 0 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
> Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
> Call Trace:
> [c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
> [c1c65960] [c01b6234] warn_alloc+0x100/0x178
> [c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
> [c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
> [c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
> [c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
> [c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
> [c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
> [c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
> [c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
> [c1c65d20] [c0180a40] shrink_slab+0x174/0x260
> [c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
> [c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
> [c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
> [c1c65f50] [c017f270] kswapd+0x228/0x25c
> [c1c65fc0] [c006bbac] kthread+0xe4/0xe8
> [c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> node 0: slabs: 33, objs: 165, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> node 0: slabs: 33, objs: 165, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> node 0: slabs: 33, objs: 165, free: 0
> Mem-Info:
> active_anon:340071 inactive_anon:139179 isolated_anon:0
> active_file:8297 inactive_file:2506 isolated_file:0
> unevictable:4 dirty:1 writeback:18
> slab_reclaimable:1377 slab_unreclaimable:7426
> mapped:6804 shmem:112 pagetables:946
> sec_pagetables:0 bounce:0
> kernel_misc_reclaimable:0
> free:1141 free_pcp:7 free_cma:0
> Node 0 active_anon:1360284kB inactive_anon:556716kB active_file:33188kB inactive_file:10024kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:27216kB dirty:4kB writeback:72kB shmem:448kB writeback_tmp:0kB kernel_stack:1560kB pagetables:3784kB sec_pagetables:0kB all_unreclaimable? no
> DMA free:56kB boost:7756kB min:11208kB low:12068kB high:12928kB reserved_highatomic:0KB active_anon:635128kB inactive_anon:58260kB active_file:268kB inactive_file:3000kB unevictable:0kB writepending:324kB present:786432kB managed:746644kB mlocked:0kB bounce:0kB free_pcp:28kB local_pcp:28kB free_cma:0kB
> lowmem_reserve[]: 0 0 1280 1280
> DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> 63943 total pagecache pages
> 53024 pages in swap cache
> Free swap = 8057248kB
> Total swap = 8388604kB
> 524288 pages RAM
> 327680 pages HighMem/MovableOnly
> 9947 pages reserved
> warn_alloc: 396 callbacks suppressed
> kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
> CPU: 1 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
> Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
> Call Trace:
> [c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
> [c1c65960] [c01b6234] warn_alloc+0x100/0x178
> [c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
> [c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
> [c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
> [c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
> [c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
> [c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
> [c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
> [c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
> [c1c65d20] [c0180a40] shrink_slab+0x174/0x260
> [c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
> [c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
> [c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
> [c1c65f50] [c017f270] kswapd+0x228/0x25c
> slab_out_of_memory: 53 callbacks suppressed
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> node 0: slabs: 33, objs: 165, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> node 0: slabs: 33, objs: 165, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> node 0: slabs: 33, objs: 165, free: 0
> [c1c65fc0] [c006bbac] kthread+0xe4/0xe8
> [c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
> Mem-Info:
> active_anon:351976 inactive_anon:123514 isolated_anon:0
> active_file:4648 inactive_file:2081 isolated_file:0
> unevictable:4 dirty:1 writeback:39
> slab_reclaimable:918 slab_unreclaimable:7222
> mapped:5359 shmem:21 pagetables:940
> sec_pagetables:0 bounce:0
> kernel_misc_reclaimable:0
> free:2563 free_pcp:142 free_cma:0
> Node 0 active_anon:1407904kB inactive_anon:494056kB active_file:18592kB inactive_file:8324kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:21436kB dirty:4kB writeback:156kB shmem:84kB writeback_tmp:0kB kernel_stack:1552kB pagetables:3760kB sec_pagetables:0kB all_unreclaimable? no
> DMA free:0kB boost:7756kB min:11208kB low:12068kB high:12928kB reserved_highatomic:0KB active_anon:199336kB inactive_anon:491432kB active_file:4612kB inactive_file:5980kB unevictable:0kB writepending:660kB present:786432kB managed:746644kB mlocked:0kB bounce:0kB free_pcp:568kB local_pcp:20kB free_cma:0kB
> lowmem_reserve[]: 0 0 1280 1280
> DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> 45961 total pagecache pages
> 39207 pages in swap cache
> Free swap = 8093096kB
> Total swap = 8388604kB
> 524288 pages RAM
> 327680 pages HighMem/MovableOnly
> 9947 pages reserved
> warn_alloc: 343 callbacks suppressed
> kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
> CPU: 0 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
> Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
> Call Trace:
> [c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
> [c1c65960] [c01b6234] warn_alloc+0x100/0x178
> [c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
> [c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
> [c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
> [c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
> [c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
> slab_out_of_memory: 59 callbacks suppressed
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> node 0: slabs: 33, objs: 165, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> node 0: slabs: 33, objs: 165, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> node 0: slabs: 18, objs: 270, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> node 0: slabs: 33, objs: 165, free: 0
> [c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
> [c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
> [c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
> [c1c65d20] [c0180a40] shrink_slab+0x174/0x260
> [c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
> [c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
> [c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
> [c1c65f50] [c017f270] kswapd+0x228/0x25c
> [c1c65fc0] [c006bbac] kthread+0xe4/0xe8
> [c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
> Mem-Info:
> active_anon:235002 inactive_anon:240975 isolated_anon:0
> active_file:4356 inactive_file:2551 isolated_file:0
> unevictable:4 dirty:7 writeback:19
> slab_reclaimable:1008 slab_unreclaimable:7218
> mapped:5601 shmem:21 pagetables:939
> sec_pagetables:0 bounce:0
> kernel_misc_reclaimable:0
> free:1340 free_pcp:23 free_cma:0
> Node 0 active_anon:940008kB inactive_anon:963900kB active_file:17424kB inactive_file:10204kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:22404kB dirty:28kB writeback:76kB shmem:84kB writeback_tmp:0kB kernel_stack:1552kB pagetables:3756kB sec_pagetables:0kB all_unreclaimable? no
> DMA free:0kB boost:7756kB min:11208kB low:12068kB high:12928kB reserved_highatomic:0KB active_anon:644060kB inactive_anon:36332kB active_file:5276kB inactive_file:5516kB unevictable:0kB writepending:348kB present:786432kB managed:746644kB mlocked:0kB bounce:0kB free_pcp:92kB local_pcp:92kB free_cma:0kB
> lowmem_reserve[]: 0 0 1280 1280
> DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> 116345 total pagecache pages
> 109413 pages in swap cache
> Free swap = 7819300kB
> Total swap = 8388604kB
> 524288 pages RAM
> 327680 pages HighMem/MovableOnly
> 9947 pages reserved
>
>
> I switched from zstd to lzo as zswap default compressor so zstd does not show up on the dmesg. But the rest looks pretty similar.
>
> Full dmesg and kernel .config attached.
>
> Regards,
> Erhard
Hi Erhard,
Thanks for the reports. I'll take a look at them and get back to you
in a few days.
On Wed, May 15, 2024 at 4:06 PM Yu Zhao <[email protected]> wrote:
>
> On Wed, May 15, 2024 at 2:45 PM Erhard Furtner <[email protected]> wrote:
> >
> > On Wed, 8 May 2024 20:21:11 +0200
> > Erhard Furtner <[email protected]> wrote:
> >
> > > Greetings!
> > >
> > > Got that on my dual CPU PowerMac G4 DP shortly after boot. This does not happen every time at bootup though:
> > >
> > > [...]
> > > kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
> > > CPU: 1 PID: 40 Comm: kswapd0 Not tainted 6.8.9-gentoo-PMacG4 #1
> > > Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
> >
> > Very similar page allocation failure on the same machine on kernel 6.9.0 too. Seems it can easily be provoked by running a memory stressor, e.g. "stress-ng --vm 2 --vm-bytes 1930M --verify -v":
> >
> > [...]
> > kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
> > CPU: 0 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
> > Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
> > Call Trace:
> > [c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
> > [c1c65960] [c01b6234] warn_alloc+0x100/0x178
> > [c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
> > [c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
> > [c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
> > [c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
> > [c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
> > [c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
> > [c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
> > [c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
> > [c1c65d20] [c0180a40] shrink_slab+0x174/0x260
> > [c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
> > [c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
> > [c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
> > [c1c65f50] [c017f270] kswapd+0x228/0x25c
> > [c1c65fc0] [c006bbac] kthread+0xe4/0xe8
> > [c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> > kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> > node 0: slabs: 33, objs: 165, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> > kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> > node 0: slabs: 33, objs: 165, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> > kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> > node 0: slabs: 33, objs: 165, free: 0
> > Mem-Info:
> > active_anon:340071 inactive_anon:139179 isolated_anon:0
> > active_file:8297 inactive_file:2506 isolated_file:0
> > unevictable:4 dirty:1 writeback:18
> > slab_reclaimable:1377 slab_unreclaimable:7426
> > mapped:6804 shmem:112 pagetables:946
> > sec_pagetables:0 bounce:0
> > kernel_misc_reclaimable:0
> > free:1141 free_pcp:7 free_cma:0
> > Node 0 active_anon:1360284kB inactive_anon:556716kB active_file:33188kB inactive_file:10024kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:27216kB dirty:4kB writeback:72kB shmem:448kB writeback_tmp:0kB kernel_stack:1560kB pagetables:3784kB sec_pagetables:0kB all_unreclaimable? no
> > DMA free:56kB boost:7756kB min:11208kB low:12068kB high:12928kB reserved_highatomic:0KB active_anon:635128kB inactive_anon:58260kB active_file:268kB inactive_file:3000kB unevictable:0kB writepending:324kB present:786432kB managed:746644kB mlocked:0kB bounce:0kB free_pcp:28kB local_pcp:28kB free_cma:0kB
> > lowmem_reserve[]: 0 0 1280 1280
> > DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> > 63943 total pagecache pages
> > 53024 pages in swap cache
> > Free swap = 8057248kB
> > Total swap = 8388604kB
> > 524288 pages RAM
> > 327680 pages HighMem/MovableOnly
> > 9947 pages reserved
> > warn_alloc: 396 callbacks suppressed
> > kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
> > CPU: 1 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
> > Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
> > Call Trace:
> > [c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
> > [c1c65960] [c01b6234] warn_alloc+0x100/0x178
> > [c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
> > [c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
> > [c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
> > [c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
> > [c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
> > [c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
> > [c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
> > [c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
> > [c1c65d20] [c0180a40] shrink_slab+0x174/0x260
> > [c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
> > [c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
> > [c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
> > [c1c65f50] [c017f270] kswapd+0x228/0x25c
> > slab_out_of_memory: 53 callbacks suppressed
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> > kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> > node 0: slabs: 33, objs: 165, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> > kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> > node 0: slabs: 33, objs: 165, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> > kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> > node 0: slabs: 33, objs: 165, free: 0
> > [c1c65fc0] [c006bbac] kthread+0xe4/0xe8
> > [c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
> > Mem-Info:
> > active_anon:351976 inactive_anon:123514 isolated_anon:0
> > active_file:4648 inactive_file:2081 isolated_file:0
> > unevictable:4 dirty:1 writeback:39
> > slab_reclaimable:918 slab_unreclaimable:7222
> > mapped:5359 shmem:21 pagetables:940
> > sec_pagetables:0 bounce:0
> > kernel_misc_reclaimable:0
> > free:2563 free_pcp:142 free_cma:0
> > Node 0 active_anon:1407904kB inactive_anon:494056kB active_file:18592kB inactive_file:8324kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:21436kB dirty:4kB writeback:156kB shmem:84kB writeback_tmp:0kB kernel_stack:1552kB pagetables:3760kB sec_pagetables:0kB all_unreclaimable? no
> > DMA free:0kB boost:7756kB min:11208kB low:12068kB high:12928kB reserved_highatomic:0KB active_anon:199336kB inactive_anon:491432kB active_file:4612kB inactive_file:5980kB unevictable:0kB writepending:660kB present:786432kB managed:746644kB mlocked:0kB bounce:0kB free_pcp:568kB local_pcp:20kB free_cma:0kB
> > lowmem_reserve[]: 0 0 1280 1280
> > DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> > 45961 total pagecache pages
> > 39207 pages in swap cache
> > Free swap = 8093096kB
> > Total swap = 8388604kB
> > 524288 pages RAM
> > 327680 pages HighMem/MovableOnly
> > 9947 pages reserved
> > warn_alloc: 343 callbacks suppressed
> > kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
> > CPU: 0 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
> > Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
> > Call Trace:
> > [c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
> > [c1c65960] [c01b6234] warn_alloc+0x100/0x178
> > [c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
> > [c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
> > [c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
> > [c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
> > [c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
> > slab_out_of_memory: 59 callbacks suppressed
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> > kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> > node 0: slabs: 33, objs: 165, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> > kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> > node 0: slabs: 33, objs: 165, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
> > node 0: slabs: 18, objs: 270, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> > cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
> > kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
> > node 0: slabs: 33, objs: 165, free: 0
> > [c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
> > [c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
> > [c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
> > [c1c65d20] [c0180a40] shrink_slab+0x174/0x260
> > [c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
> > [c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
> > [c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
> > [c1c65f50] [c017f270] kswapd+0x228/0x25c
> > [c1c65fc0] [c006bbac] kthread+0xe4/0xe8
> > [c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
> > Mem-Info:
> > active_anon:235002 inactive_anon:240975 isolated_anon:0
> > active_file:4356 inactive_file:2551 isolated_file:0
> > unevictable:4 dirty:7 writeback:19
> > slab_reclaimable:1008 slab_unreclaimable:7218
> > mapped:5601 shmem:21 pagetables:939
> > sec_pagetables:0 bounce:0
> > kernel_misc_reclaimable:0
> > free:1340 free_pcp:23 free_cma:0
> > Node 0 active_anon:940008kB inactive_anon:963900kB active_file:17424kB inactive_file:10204kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:22404kB dirty:28kB writeback:76kB shmem:84kB writeback_tmp:0kB kernel_stack:1552kB pagetables:3756kB sec_pagetables:0kB all_unreclaimable? no
> > DMA free:0kB boost:7756kB min:11208kB low:12068kB high:12928kB reserved_highatomic:0KB active_anon:644060kB inactive_anon:36332kB active_file:5276kB inactive_file:5516kB unevictable:0kB writepending:348kB present:786432kB managed:746644kB mlocked:0kB bounce:0kB free_pcp:92kB local_pcp:92kB free_cma:0kB
> > lowmem_reserve[]: 0 0 1280 1280
> > DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> > 116345 total pagecache pages
> > 109413 pages in swap cache
> > Free swap = 7819300kB
> > Total swap = 8388604kB
> > 524288 pages RAM
> > 327680 pages HighMem/MovableOnly
> > 9947 pages reserved
> >
> >
> > I switched from zstd to lzo as zswap default compressor so zstd does not show up on the dmesg. But the rest looks pretty similar.
> >
> > Full dmesg and kernel .config attached.
> >
> > Regards,
> > Erhard
>
> Hi Erhard,
>
> Thanks for the reports. I'll take a look at them and get back to you
> in a few days.
Hi Erhard,
The OOM kills on both kernel versions seem to be reasonable to me.
Your system has 2GB memory and it uses zswap with zsmalloc (which is
good since it can allocate from the highmem zone) and zstd/lzo (which
doesn't matter much). Somehow -- I couldn't figure out why -- it
splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000000000000-0x000000002fffffff]
[ 0.000000] Normal empty
[ 0.000000] HighMem [mem 0x0000000030000000-0x000000007fffffff]
The kernel can't allocate from the highmem zone -- only userspace and
zsmalloc can. OOM kills were due to the low memory conditions in the
DMA zone where the kernel itself failed to allocate from.
Do you know a kernel version that doesn't have OOM kills while running
the same workload? If so, could you send that .config to me? If not,
could you try disabling CONFIG_HIGHMEM? (It might not help but I'm out
of ideas at the moment.)
Thanks!
On 01.06.24 08:01, Yu Zhao wrote:
> On Wed, May 15, 2024 at 4:06 PM Yu Zhao <[email protected]> wrote:
>>
>> On Wed, May 15, 2024 at 2:45 PM Erhard Furtner <[email protected]> wrote:
>>>
>>> On Wed, 8 May 2024 20:21:11 +0200
>>> Erhard Furtner <[email protected]> wrote:
>>>
>>>> Greetings!
>>>>
>>>> Got that on my dual CPU PowerMac G4 DP shortly after boot. This does not happen every time at bootup though:
>>>>
>>>> [...]
>>>> kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
>>>> CPU: 1 PID: 40 Comm: kswapd0 Not tainted 6.8.9-gentoo-PMacG4 #1
>>>> Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
>>>
>>> Very similar page allocation failure on the same machine on kernel 6.9.0 too. Seems it can easily be provoked by running a memory stressor, e.g. "stress-ng --vm 2 --vm-bytes 1930M --verify -v":
>>>
>>> [...]
>>> kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
>>> CPU: 0 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
>>> Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
>>> Call Trace:
>>> [c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
>>> [c1c65960] [c01b6234] warn_alloc+0x100/0x178
>>> [c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
>>> [c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
>>> [c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
>>> [c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
>>> [c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
>>> [c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
>>> [c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
>>> [c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
>>> [c1c65d20] [c0180a40] shrink_slab+0x174/0x260
>>> [c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
>>> [c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
>>> [c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
>>> [c1c65f50] [c017f270] kswapd+0x228/0x25c
>>> [c1c65fc0] [c006bbac] kthread+0xe4/0xe8
>>> [c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 15, objs: 225, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 15, objs: 225, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 15, objs: 225, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 15, objs: 225, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 15, objs: 225, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
>>> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
>>> node 0: slabs: 33, objs: 165, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 15, objs: 225, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
>>> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
>>> node 0: slabs: 33, objs: 165, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 15, objs: 225, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
>>> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
>>> node 0: slabs: 33, objs: 165, free: 0
>>> Mem-Info:
>>> active_anon:340071 inactive_anon:139179 isolated_anon:0
>>> active_file:8297 inactive_file:2506 isolated_file:0
>>> unevictable:4 dirty:1 writeback:18
>>> slab_reclaimable:1377 slab_unreclaimable:7426
>>> mapped:6804 shmem:112 pagetables:946
>>> sec_pagetables:0 bounce:0
>>> kernel_misc_reclaimable:0
>>> free:1141 free_pcp:7 free_cma:0
>>> Node 0 active_anon:1360284kB inactive_anon:556716kB active_file:33188kB inactive_file:10024kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:27216kB dirty:4kB writeback:72kB shmem:448kB writeback_tmp:0kB kernel_stack:1560kB pagetables:3784kB sec_pagetables:0kB all_unreclaimable? no
>>> DMA free:56kB boost:7756kB min:11208kB low:12068kB high:12928kB reserved_highatomic:0KB active_anon:635128kB inactive_anon:58260kB active_file:268kB inactive_file:3000kB unevictable:0kB writepending:324kB present:786432kB managed:746644kB mlocked:0kB bounce:0kB free_pcp:28kB local_pcp:28kB free_cma:0kB
>>> lowmem_reserve[]: 0 0 1280 1280
>>> DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> 63943 total pagecache pages
>>> 53024 pages in swap cache
>>> Free swap = 8057248kB
>>> Total swap = 8388604kB
>>> 524288 pages RAM
>>> 327680 pages HighMem/MovableOnly
>>> 9947 pages reserved
>>> warn_alloc: 396 callbacks suppressed
>>> kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
>>> CPU: 1 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
>>> Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
>>> Call Trace:
>>> [c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
>>> [c1c65960] [c01b6234] warn_alloc+0x100/0x178
>>> [c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
>>> [c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
>>> [c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
>>> [c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
>>> [c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
>>> [c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
>>> [c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
>>> [c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
>>> [c1c65d20] [c0180a40] shrink_slab+0x174/0x260
>>> [c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
>>> [c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
>>> [c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
>>> [c1c65f50] [c017f270] kswapd+0x228/0x25c
>>> slab_out_of_memory: 53 callbacks suppressed
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
>>> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
>>> node 0: slabs: 33, objs: 165, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
>>> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
>>> node 0: slabs: 33, objs: 165, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
>>> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
>>> node 0: slabs: 33, objs: 165, free: 0
>>> [c1c65fc0] [c006bbac] kthread+0xe4/0xe8
>>> [c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
>>> Mem-Info:
>>> active_anon:351976 inactive_anon:123514 isolated_anon:0
>>> active_file:4648 inactive_file:2081 isolated_file:0
>>> unevictable:4 dirty:1 writeback:39
>>> slab_reclaimable:918 slab_unreclaimable:7222
>>> mapped:5359 shmem:21 pagetables:940
>>> sec_pagetables:0 bounce:0
>>> kernel_misc_reclaimable:0
>>> free:2563 free_pcp:142 free_cma:0
>>> Node 0 active_anon:1407904kB inactive_anon:494056kB active_file:18592kB inactive_file:8324kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:21436kB dirty:4kB writeback:156kB shmem:84kB writeback_tmp:0kB kernel_stack:1552kB pagetables:3760kB sec_pagetables:0kB all_unreclaimable? no
>>> DMA free:0kB boost:7756kB min:11208kB low:12068kB high:12928kB reserved_highatomic:0KB active_anon:199336kB inactive_anon:491432kB active_file:4612kB inactive_file:5980kB unevictable:0kB writepending:660kB present:786432kB managed:746644kB mlocked:0kB bounce:0kB free_pcp:568kB local_pcp:20kB free_cma:0kB
>>> lowmem_reserve[]: 0 0 1280 1280
>>> DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> 45961 total pagecache pages
>>> 39207 pages in swap cache
>>> Free swap = 8093096kB
>>> Total swap = 8388604kB
>>> 524288 pages RAM
>>> 327680 pages HighMem/MovableOnly
>>> 9947 pages reserved
>>> warn_alloc: 343 callbacks suppressed
>>> kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
>>> CPU: 0 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
>>> Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
>>> Call Trace:
>>> [c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
>>> [c1c65960] [c01b6234] warn_alloc+0x100/0x178
>>> [c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
>>> [c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
>>> [c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
>>> [c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
>>> [c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
>>> slab_out_of_memory: 59 callbacks suppressed
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
>>> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
>>> node 0: slabs: 33, objs: 165, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
>>> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
>>> node 0: slabs: 33, objs: 165, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: skbuff_small_head, object size: 480, buffer size: 544, default order: 1, min order: 0
>>> node 0: slabs: 18, objs: 270, free: 0
>>> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>>> cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default order: 3, min order: 1
>>> kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to disable.
>>> node 0: slabs: 33, objs: 165, free: 0
>>> [c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
>>> [c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
>>> [c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
>>> [c1c65d20] [c0180a40] shrink_slab+0x174/0x260
>>> [c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
>>> [c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
>>> [c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
>>> [c1c65f50] [c017f270] kswapd+0x228/0x25c
>>> [c1c65fc0] [c006bbac] kthread+0xe4/0xe8
>>> [c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
>>> Mem-Info:
>>> active_anon:235002 inactive_anon:240975 isolated_anon:0
>>> active_file:4356 inactive_file:2551 isolated_file:0
>>> unevictable:4 dirty:7 writeback:19
>>> slab_reclaimable:1008 slab_unreclaimable:7218
>>> mapped:5601 shmem:21 pagetables:939
>>> sec_pagetables:0 bounce:0
>>> kernel_misc_reclaimable:0
>>> free:1340 free_pcp:23 free_cma:0
>>> Node 0 active_anon:940008kB inactive_anon:963900kB active_file:17424kB inactive_file:10204kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:22404kB dirty:28kB writeback:76kB shmem:84kB writeback_tmp:0kB kernel_stack:1552kB pagetables:3756kB sec_pagetables:0kB all_unreclaimable? no
>>> DMA free:0kB boost:7756kB min:11208kB low:12068kB high:12928kB reserved_highatomic:0KB active_anon:644060kB inactive_anon:36332kB active_file:5276kB inactive_file:5516kB unevictable:0kB writepending:348kB present:786432kB managed:746644kB mlocked:0kB bounce:0kB free_pcp:92kB local_pcp:92kB free_cma:0kB
>>> lowmem_reserve[]: 0 0 1280 1280
>>> DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> 116345 total pagecache pages
>>> 109413 pages in swap cache
>>> Free swap = 7819300kB
>>> Total swap = 8388604kB
>>> 524288 pages RAM
>>> 327680 pages HighMem/MovableOnly
>>> 9947 pages reserved
>>>
>>>
>>> I switched from zstd to lzo as zswap default compressor so zstd does not show up on the dmesg. But the rest looks pretty similar.
>>>
>>> Full dmesg and kernel .config attached.
>>>
>>> Regards,
>>> Erhard
>>
>> Hi Erhard,
>>
>> Thanks for the reports. I'll take a look at them and get back to you
>> in a few days.
>
> Hi Erhard,
>
> The OOM kills on both kernel versions seem to be reasonable to me.
>
> Your system has 2GB memory and it uses zswap with zsmalloc (which is
> good since it can allocate from the highmem zone) and zstd/lzo (which
> doesn't matter much). Somehow -- I couldn't figure out why -- it
> splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
>
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000000000000-0x000000002fffffff]
> [ 0.000000] Normal empty
> [ 0.000000] HighMem [mem 0x0000000030000000-0x000000007fffffff]
That's really odd. But we are messing with "PowerMac3,6", so I don't
really know what's right or wrong ...
--
Cheers,
David / dhildenb
On Sat, 1 Jun 2024 00:01:48 -0600
Yu Zhao <[email protected]> wrote:
> Hi Erhard,
>
> The OOM kills on both kernel versions seem to be reasonable to me.
>
> Your system has 2GB memory and it uses zswap with zsmalloc (which is
> good since it can allocate from the highmem zone) and zstd/lzo (which
> doesn't matter much). Somehow -- I couldn't figure out why -- it
> splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
>
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000000000000-0x000000002fffffff]
> [ 0.000000] Normal empty
> [ 0.000000] HighMem [mem 0x0000000030000000-0x000000007fffffff]
>
> The kernel can't allocate from the highmem zone -- only userspace and
> zsmalloc can. OOM kills were due to the low memory conditions in the
> DMA zone where the kernel itself failed to allocate from.
>
> Do you know a kernel version that doesn't have OOM kills while running
> the same workload? If so, could you send that .config to me? If not,
> could you try disabling CONFIG_HIGHMEM? (It might not help but I'm out
> of ideas at the moment.)
>
> Thanks!
Hi Yu!
Thanks for looking into this.
The reason for this 0.25GB DMA / 1.75GB highmem split is beyond my knowledge. I can only tell this much that it's like this at least since kernel v4.14.x (dmesg of an old bugreport of mine at https://bugzilla.kernel.org/show_bug.cgi?id=201723), I guess earlier kernel versions too.
Without CONFIG_HIGHMEM the memory layout looks like this:
Total memory = 768MB; using 2048kB for hash table
[...]
Top of RAM: 0x30000000, Total RAM: 0x30000000
Memory hole size: 0MB
Zone ranges:
DMA [mem 0x0000000000000000-0x000000002fffffff]
Normal empty
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x0000000000000000-0x000000002fffffff]
Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff]
percpu: Embedded 29 pages/cpu s28448 r8192 d82144 u118784
pcpu-alloc: s28448 r8192 d82144 u118784 alloc=29*4096
pcpu-alloc: [0] 0 [0] 1
Kernel command line: ro root=/dev/sda5 slub_debug=FZP page_poison=1 [email protected]/eth0,[email protected]/A8:A1:59:16:4F:EA debug
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes, linear)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
Built 1 zonelists, mobility grouping on. Total pages: 194880
mem auto-init: stack:all(pattern), heap alloc:off, heap free:off
Kernel virtual memory layout:
* 0xffbdf000..0xfffff000 : fixmap
* 0xff8f4000..0xffbdf000 : early ioremap
* 0xf1000000..0xff8f4000 : vmalloc & ioremap
* 0xb0000000..0xc0000000 : modules
Memory: 761868K/786432K available (7760K kernel code, 524K rwdata, 4528K rodata, 1100K init, 253K bss, 24564K reserved, 0K cma-reserved)
[...]
With only 768 MB RAM and 2048K hashtable I get pretty much the same "kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL),nodemask=(null),cpuset=/,mems_allowed=0" as with the HIGHMEM enabled kernel at
running "stress-ng --vm 2 --vm-bytes 1930M --verify -v".
I tried the workload on v6.6.32 LTS where the issue shows up too. But v6.1.92 LTS seems ok! Triple checked v6.1.92 to be sure.
Attached please find kernel v6.9.3 dmesg (without HIGHMEM) and kernel v6.1.92 .config.
Regards,
Erhard
On Sun, Jun 2, 2024 at 12:03 PM Erhard Furtner <[email protected]> wrote:
>
> On Sat, 1 Jun 2024 00:01:48 -0600
> Yu Zhao <[email protected]> wrote:
>
> > Hi Erhard,
> >
> > The OOM kills on both kernel versions seem to be reasonable to me.
> >
> > Your system has 2GB memory and it uses zswap with zsmalloc (which is
> > good since it can allocate from the highmem zone) and zstd/lzo (which
> > doesn't matter much). Somehow -- I couldn't figure out why -- it
> > splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
> >
> > [ 0.000000] Zone ranges:
> > [ 0.000000] DMA [mem 0x0000000000000000-0x000000002fffffff]
> > [ 0.000000] Normal empty
> > [ 0.000000] HighMem [mem 0x0000000030000000-0x000000007fffffff]
> >
> > The kernel can't allocate from the highmem zone -- only userspace and
> > zsmalloc can. OOM kills were due to the low memory conditions in the
> > DMA zone where the kernel itself failed to allocate from.
> >
> > Do you know a kernel version that doesn't have OOM kills while running
> > the same workload? If so, could you send that .config to me? If not,
> > could you try disabling CONFIG_HIGHMEM? (It might not help but I'm out
> > of ideas at the moment.)
> >
> > Thanks!
>
> Hi Yu!
>
> Thanks for looking into this.
>
> The reason for this 0.25GB DMA / 1.75GB highmem split is beyond my knowledge. I can only tell this much that it's like this at least since kernel v4.14.x (dmesg of an old bugreport of mine at https://bugzilla.kernel.org/show_bug.cgi?id=201723), I guess earlier kernel versions too.
>
> Without CONFIG_HIGHMEM the memory layout looks like this:
>
> Total memory = 768MB; using 2048kB for hash table
> [...]
> Top of RAM: 0x30000000, Total RAM: 0x30000000
> Memory hole size: 0MB
> Zone ranges:
> DMA [mem 0x0000000000000000-0x000000002fffffff]
> Normal empty
> Movable zone start for each node
> Early memory node ranges
> node 0: [mem 0x0000000000000000-0x000000002fffffff]
> Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff]
> percpu: Embedded 29 pages/cpu s28448 r8192 d82144 u118784
> pcpu-alloc: s28448 r8192 d82144 u118784 alloc=29*4096
> pcpu-alloc: [0] 0 [0] 1
> Kernel command line: ro root=/dev/sda5 slub_debug=FZP page_poison=1 [email protected]/eth0,[email protected]/A8:A1:59:16:4F:EA debug
> Dentry cache hash table entries: 131072 (order: 7, 524288 bytes, linear)
> Inode-cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
> Built 1 zonelists, mobility grouping on. Total pages: 194880
> mem auto-init: stack:all(pattern), heap alloc:off, heap free:off
> Kernel virtual memory layout:
> * 0xffbdf000..0xfffff000 : fixmap
> * 0xff8f4000..0xffbdf000 : early ioremap
> * 0xf1000000..0xff8f4000 : vmalloc & ioremap
> * 0xb0000000..0xc0000000 : modules
> Memory: 761868K/786432K available (7760K kernel code, 524K rwdata, 4528K rodata, 1100K init, 253K bss, 24564K reserved, 0K cma-reserved)
> [...]
>
> With only 768 MB RAM and 2048K hashtable I get pretty much the same "kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL),nodemask=(null),cpuset=/,mems_allowed=0" as with the HIGHMEM enabled kernel at
> running "stress-ng --vm 2 --vm-bytes 1930M --verify -v".
>
> I tried the workload on v6.6.32 LTS where the issue shows up too. But v6.1.92 LTS seems ok! Triple checked v6.1.92 to be sure.
>
> Attached please find kernel v6.9.3 dmesg (without HIGHMEM) and kernel v6.1.92 .config.
Thanks.
I compared the .config between v6.8.9 (you attached previously) and
v6.1.92 -- I didn't see any major differences (both have ZONE_DMA,
HIGHMEM, MGLRU and zswap/zsmalloc). Either there is something broken
between v6.1.92 and v6.6.32 (as you mentioned above), or it's just a
kernel allocation bloat which puts the DMA zone (0.25GB) under too
heavy pressure. The latter isn't uncommon when upgrading to a newer
version of the kernel.
Could you please attach the dmesg from v6.1.92? I want to compare the
dmegs between the two kernel versions as well -- that might provide
some hints.
On Sun, 2 Jun 2024 14:38:18 -0600
Yu Zhao <[email protected]> wrote:
> I compared the .config between v6.8.9 (you attached previously) and
> v6.1.92 -- I didn't see any major differences (both have ZONE_DMA,
> HIGHMEM, MGLRU and zswap/zsmalloc). Either there is something broken
> between v6.1.92 and v6.6.32 (as you mentioned above), or it's just a
> kernel allocation bloat which puts the DMA zone (0.25GB) under too
> heavy pressure. The latter isn't uncommon when upgrading to a newer
> version of the kernel.
>
> Could you please attach the dmesg from v6.1.92? I want to compare the
> dmegs between the two kernel versions as well -- that might provide
> some hints.
No problem, attached please find the v6.1.92 dmesg and also the v6.9.3 .config and dmesg.
I stripped down the v6.9.3 .config a bit (to ease 'make oldconfig' for older kernels) from my originally posted v6.8.9 .config and used that as a starting point for v6.6.32 and v6.1.92.
Also I started a git bisect. Let's see if I get something meaningful out of this...
Regards,
Erhard
On Sun, 2 Jun 2024 20:03:32 +0200
Erhard Furtner <[email protected]> wrote:
> On Sat, 1 Jun 2024 00:01:48 -0600
> Yu Zhao <[email protected]> wrote:
>
> > The OOM kills on both kernel versions seem to be reasonable to me.
> >
> > Your system has 2GB memory and it uses zswap with zsmalloc (which is
> > good since it can allocate from the highmem zone) and zstd/lzo (which
> > doesn't matter much). Somehow -- I couldn't figure out why -- it
> > splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
> >
> > [ 0.000000] Zone ranges:
> > [ 0.000000] DMA [mem 0x0000000000000000-0x000000002fffffff]
> > [ 0.000000] Normal empty
> > [ 0.000000] HighMem [mem 0x0000000030000000-0x000000007fffffff]
> >
> > The kernel can't allocate from the highmem zone -- only userspace and
> > zsmalloc can. OOM kills were due to the low memory conditions in the
> > DMA zone where the kernel itself failed to allocate from.
> >
> > Do you know a kernel version that doesn't have OOM kills while running
> > the same workload? If so, could you send that .config to me? If not,
> > could you try disabling CONFIG_HIGHMEM? (It might not help but I'm out
> > of ideas at the moment.)
Ok, the bisect I did actually revealed something meaningful:
# git bisect good
b8cf32dc6e8c75b712cbf638e0fd210101c22f17 is the first bad commit
commit b8cf32dc6e8c75b712cbf638e0fd210101c22f17
Author: Yosry Ahmed <[email protected]>
Date: Tue Jun 20 19:46:44 2023 +0000
mm: zswap: multiple zpools support
Support using multiple zpools of the same type in zswap, for concurrency
purposes. A fixed number of 32 zpools is suggested by this commit, which
was determined empirically. It can be later changed or made into a config
option if needed.
On a setup with zswap and zsmalloc, comparing a single zpool to 32 zpools
shows improvements in the zsmalloc lock contention, especially on the swap
out path.
The following shows the perf analysis of the swapout path when 10
workloads are simultaneously reclaiming and refaulting tmpfs pages. There
are some improvements on the swap in path as well, but less significant.
1 zpool:
|--28.99%--zswap_frontswap_store
|
<snip>
|
|--8.98%--zpool_map_handle
| |
| --8.98%--zs_zpool_map
| |
| --8.95%--zs_map_object
| |
| --8.38%--_raw_spin_lock
| |
| --7.39%--queued_spin_lock_slowpath
|
|--8.82%--zpool_malloc
| |
| --8.82%--zs_zpool_malloc
| |
| --8.80%--zs_malloc
| |
| |--7.21%--_raw_spin_lock
| | |
| | --6.81%--queued_spin_lock_slowpath
<snip>
32 zpools:
|--16.73%--zswap_frontswap_store
|
<snip>
|
|--1.81%--zpool_malloc
| |
| --1.81%--zs_zpool_malloc
| |
| --1.79%--zs_malloc
| |
| --0.73%--obj_malloc
|
|--1.06%--zswap_update_total_size
|
|--0.59%--zpool_map_handle
| |
| --0.59%--zs_zpool_map
| |
| --0.57%--zs_map_object
| |
| --0.51%--_raw_spin_lock
<snip>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Suggested-by: Yu Zhao <[email protected]>
Acked-by: Chris Li (Google) <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Tested-by: Nhat Pham <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Vitaly Wool <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
mm/zswap.c | 81 +++++++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 54 insertions(+), 27 deletions(-)
'bad' bisects were where the "kswapd0: page allocation failure:" showed up when running the workload, 'good' bisects were the cases where I only got the kernels OOM reaper killing the workload. In the good cases the machine stayed usable via VNC, in the bad cases with the issue showing up the machine crashed and rebooted >80% of the time shortly after showing the issue in dmesg (via netconsole). I triple checked the good cases to be sure only the OOM reaper showed up and not the kswapd0: page allocation failure.
Bisect.log attached.
Regards,
Erhard
On Mon, Jun 3, 2024 at 3:13 PM Erhard Furtner <[email protected]> wrote:
>
> On Sun, 2 Jun 2024 20:03:32 +0200
> Erhard Furtner <[email protected]> wrote:
>
> > On Sat, 1 Jun 2024 00:01:48 -0600
> > Yu Zhao <[email protected]> wrote:
> >
> > > The OOM kills on both kernel versions seem to be reasonable to me.
> > >
> > > Your system has 2GB memory and it uses zswap with zsmalloc (which is
> > > good since it can allocate from the highmem zone) and zstd/lzo (which
> > > doesn't matter much). Somehow -- I couldn't figure out why -- it
> > > splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
> > >
> > > [ 0.000000] Zone ranges:
> > > [ 0.000000] DMA [mem 0x0000000000000000-0x000000002fffffff]
> > > [ 0.000000] Normal empty
> > > [ 0.000000] HighMem [mem 0x0000000030000000-0x000000007fffffff]
> > >
> > > The kernel can't allocate from the highmem zone -- only userspace and
> > > zsmalloc can. OOM kills were due to the low memory conditions in the
> > > DMA zone where the kernel itself failed to allocate from.
> > >
> > > Do you know a kernel version that doesn't have OOM kills while running
> > > the same workload? If so, could you send that .config to me? If not,
> > > could you try disabling CONFIG_HIGHMEM? (It might not help but I'm out
> > > of ideas at the moment.)
>
> Ok, the bisect I did actually revealed something meaningful:
>
> # git bisect good
> b8cf32dc6e8c75b712cbf638e0fd210101c22f17 is the first bad commit
> commit b8cf32dc6e8c75b712cbf638e0fd210101c22f17
> Author: Yosry Ahmed <[email protected]>
> Date: Tue Jun 20 19:46:44 2023 +0000
>
> mm: zswap: multiple zpools support
Thanks for bisecting. Taking a look at the thread, it seems like you
have a very limited area of memory to allocate kernel memory from. One
possible reason why that commit can cause an issue is because we will
have multiple instances of the zsmalloc slab caches 'zspage' and
'zs_handle', which may contribute to fragmentation in slab memory.
Do you have /proc/slabinfo from a good and a bad run by any chance?
Also, could you check if the attached patch helps? It makes sure that
even when we use multiple zsmalloc zpools, we will use a single slab
cache of each type.
On Mon, 3 Jun 2024 16:24:02 -0700
Yosry Ahmed <[email protected]> wrote:
> Thanks for bisecting. Taking a look at the thread, it seems like you
> have a very limited area of memory to allocate kernel memory from. One
> possible reason why that commit can cause an issue is because we will
> have multiple instances of the zsmalloc slab caches 'zspage' and
> 'zs_handle', which may contribute to fragmentation in slab memory.
>
> Do you have /proc/slabinfo from a good and a bad run by any chance?
>
> Also, could you check if the attached patch helps? It makes sure that
> even when we use multiple zsmalloc zpools, we will use a single slab
> cache of each type.
Thanks for looking into this! I got you 'cat /proc/slabinfo' from a good HEAD, from a bad HEAD and from the bad HEAD + your patch applied.
Good was 6be3601517d90b728095d70c14f3a04b9adcb166, bad was b8cf32dc6e8c75b712cbf638e0fd210101c22f17 which I got both from my bisect.log. I got the slabinfo shortly after boot and a 2nd time shortly before the OOM or the kswapd0: page allocation failure happens. I terminated the workload (stress-ng --vm 2 --vm-bytes 1930M --verify -v) manually shortly before the 2 GiB RAM exhausted and got the slabinfo then.
The patch applied to git b8cf32dc6e8c75b712cbf638e0fd210101c22f17 unfortunately didn't make a difference, I got the kswapd0: page allocation failure nevertheless.
Regards,
Erhard
On Tue, Jun 4, 2024 at 4:45 AM Erhard Furtner <[email protected]> wrote:
>
> On Mon, 3 Jun 2024 16:24:02 -0700
> Yosry Ahmed <[email protected]> wrote:
>
> > Thanks for bisecting. Taking a look at the thread, it seems like you
> > have a very limited area of memory to allocate kernel memory from. One
> > possible reason why that commit can cause an issue is because we will
> > have multiple instances of the zsmalloc slab caches 'zspage' and
> > 'zs_handle', which may contribute to fragmentation in slab memory.
> >
> > Do you have /proc/slabinfo from a good and a bad run by any chance?
> >
> > Also, could you check if the attached patch helps? It makes sure that
> > even when we use multiple zsmalloc zpools, we will use a single slab
> > cache of each type.
>
> Thanks for looking into this! I got you 'cat /proc/slabinfo' from a good HEAD, from a bad HEAD and from the bad HEAD + your patch applied.
>
> Good was 6be3601517d90b728095d70c14f3a04b9adcb166, bad was b8cf32dc6e8c75b712cbf638e0fd210101c22f17 which I got both from my bisect.log. I got the slabinfo shortly after boot and a 2nd time shortly before the OOM or the kswapd0: page allocation failure happens. I terminated the workload (stress-ng --vm 2 --vm-bytes 1930M --verify -v) manually shortly before the 2 GiB RAM exhausted and got the slabinfo then.
>
> The patch applied to git b8cf32dc6e8c75b712cbf638e0fd210101c22f17 unfortunately didn't make a difference, I got the kswapd0: page allocation failure nevertheless.
Thanks for trying this out. The patch reduces the amount of wasted
memory due to the 'zs_handle' and 'zspage' caches by an order of
magnitude, but it was a small number to begin with (~250K).
I cannot think of other reasons why having multiple zsmalloc pools
will end up using more memory in the 0.25GB zone that the kernel
allocations can be made from.
The number of zpools can be made configurable or determined at runtime
by the size of the machine, but I don't want to do this without
understanding the problem here first. Adding other zswap and zsmalloc
folks in case they have any ideas.
>
> Regards,
> Erhard
On Tue, Jun 4, 2024 at 10:12 AM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, Jun 4, 2024 at 4:45 AM Erhard Furtner <[email protected]> wrote:
> >
> > On Mon, 3 Jun 2024 16:24:02 -0700
> > Yosry Ahmed <[email protected]> wrote:
> >
> > > Thanks for bisecting. Taking a look at the thread, it seems like you
> > > have a very limited area of memory to allocate kernel memory from. One
> > > possible reason why that commit can cause an issue is because we will
> > > have multiple instances of the zsmalloc slab caches 'zspage' and
> > > 'zs_handle', which may contribute to fragmentation in slab memory.
> > >
> > > Do you have /proc/slabinfo from a good and a bad run by any chance?
> > >
> > > Also, could you check if the attached patch helps? It makes sure that
> > > even when we use multiple zsmalloc zpools, we will use a single slab
> > > cache of each type.
> >
> > Thanks for looking into this! I got you 'cat /proc/slabinfo' from a good HEAD, from a bad HEAD and from the bad HEAD + your patch applied.
> >
> > Good was 6be3601517d90b728095d70c14f3a04b9adcb166, bad was b8cf32dc6e8c75b712cbf638e0fd210101c22f17 which I got both from my bisect.log. I got the slabinfo shortly after boot and a 2nd time shortly before the OOM or the kswapd0: page allocation failure happens. I terminated the workload (stress-ng --vm 2 --vm-bytes 1930M --verify -v) manually shortly before the 2 GiB RAM exhausted and got the slabinfo then.
> >
> > The patch applied to git b8cf32dc6e8c75b712cbf638e0fd210101c22f17 unfortunately didn't make a difference, I got the kswapd0: page allocation failure nevertheless.
>
> Thanks for trying this out. The patch reduces the amount of wasted
> memory due to the 'zs_handle' and 'zspage' caches by an order of
> magnitude, but it was a small number to begin with (~250K).
>
> I cannot think of other reasons why having multiple zsmalloc pools
> will end up using more memory in the 0.25GB zone that the kernel
> allocations can be made from.
>
> The number of zpools can be made configurable or determined at runtime
> by the size of the machine, but I don't want to do this without
> understanding the problem here first. Adding other zswap and zsmalloc
> folks in case they have any ideas.
Hi Erhard,
If it's not too much trouble, could you "grep nr_zspages /proc/vmstat"
on kernels before and after the bad commit? It'd be great if you could
run the grep command right before the OOM kills.
The overall internal fragmentation of multiple zsmalloc pools might be
higher than a single one. I suspect this might be the cause.
Thank you.
On Tue, Jun 4, 2024 at 10:19 AM Yu Zhao <[email protected]> wrote:
>
> On Tue, Jun 4, 2024 at 10:12 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Tue, Jun 4, 2024 at 4:45 AM Erhard Furtner <[email protected]> wrote:
> > >
> > > On Mon, 3 Jun 2024 16:24:02 -0700
> > > Yosry Ahmed <[email protected]> wrote:
> > >
> > > > Thanks for bisecting. Taking a look at the thread, it seems like you
> > > > have a very limited area of memory to allocate kernel memory from. One
> > > > possible reason why that commit can cause an issue is because we will
> > > > have multiple instances of the zsmalloc slab caches 'zspage' and
> > > > 'zs_handle', which may contribute to fragmentation in slab memory.
> > > >
> > > > Do you have /proc/slabinfo from a good and a bad run by any chance?
> > > >
> > > > Also, could you check if the attached patch helps? It makes sure that
> > > > even when we use multiple zsmalloc zpools, we will use a single slab
> > > > cache of each type.
> > >
> > > Thanks for looking into this! I got you 'cat /proc/slabinfo' from a good HEAD, from a bad HEAD and from the bad HEAD + your patch applied.
> > >
> > > Good was 6be3601517d90b728095d70c14f3a04b9adcb166, bad was b8cf32dc6e8c75b712cbf638e0fd210101c22f17 which I got both from my bisect.log. I got the slabinfo shortly after boot and a 2nd time shortly before the OOM or the kswapd0: page allocation failure happens. I terminated the workload (stress-ng --vm 2 --vm-bytes 1930M --verify -v) manually shortly before the 2 GiB RAM exhausted and got the slabinfo then.
> > >
> > > The patch applied to git b8cf32dc6e8c75b712cbf638e0fd210101c22f17 unfortunately didn't make a difference, I got the kswapd0: page allocation failure nevertheless.
> >
> > Thanks for trying this out. The patch reduces the amount of wasted
> > memory due to the 'zs_handle' and 'zspage' caches by an order of
> > magnitude, but it was a small number to begin with (~250K).
> >
> > I cannot think of other reasons why having multiple zsmalloc pools
> > will end up using more memory in the 0.25GB zone that the kernel
> > allocations can be made from.
> >
> > The number of zpools can be made configurable or determined at runtime
> > by the size of the machine, but I don't want to do this without
> > understanding the problem here first. Adding other zswap and zsmalloc
> > folks in case they have any ideas.
>
> Hi Erhard,
>
> If it's not too much trouble, could you "grep nr_zspages /proc/vmstat"
> on kernels before and after the bad commit? It'd be great if you could
> run the grep command right before the OOM kills.
>
> The overall internal fragmentation of multiple zsmalloc pools might be
> higher than a single one. I suspect this might be the cause.
I thought about the internal fragmentation of pools, but zsmalloc
should have access to highmem, and if I understand correctly the
problem here is that we are running out of space in the DMA zone when
making kernel allocations.
Do you suspect zsmalloc is allocating memory from the DMA zone
initially, even though it has access to highmem?
>
> Thank you.
On Tue, Jun 4, 2024 at 11:34 AM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, Jun 4, 2024 at 10:19 AM Yu Zhao <[email protected]> wrote:
> >
> > On Tue, Jun 4, 2024 at 10:12 AM Yosry Ahmed <[email protected]> wrote:
> > >
> > > On Tue, Jun 4, 2024 at 4:45 AM Erhard Furtner <[email protected]> wrote:
> > > >
> > > > On Mon, 3 Jun 2024 16:24:02 -0700
> > > > Yosry Ahmed <[email protected]> wrote:
> > > >
> > > > > Thanks for bisecting. Taking a look at the thread, it seems like you
> > > > > have a very limited area of memory to allocate kernel memory from. One
> > > > > possible reason why that commit can cause an issue is because we will
> > > > > have multiple instances of the zsmalloc slab caches 'zspage' and
> > > > > 'zs_handle', which may contribute to fragmentation in slab memory.
> > > > >
> > > > > Do you have /proc/slabinfo from a good and a bad run by any chance?
> > > > >
> > > > > Also, could you check if the attached patch helps? It makes sure that
> > > > > even when we use multiple zsmalloc zpools, we will use a single slab
> > > > > cache of each type.
> > > >
> > > > Thanks for looking into this! I got you 'cat /proc/slabinfo' from a good HEAD, from a bad HEAD and from the bad HEAD + your patch applied.
> > > >
> > > > Good was 6be3601517d90b728095d70c14f3a04b9adcb166, bad was b8cf32dc6e8c75b712cbf638e0fd210101c22f17 which I got both from my bisect.log. I got the slabinfo shortly after boot and a 2nd time shortly before the OOM or the kswapd0: page allocation failure happens. I terminated the workload (stress-ng --vm 2 --vm-bytes 1930M --verify -v) manually shortly before the 2 GiB RAM exhausted and got the slabinfo then.
> > > >
> > > > The patch applied to git b8cf32dc6e8c75b712cbf638e0fd210101c22f17 unfortunately didn't make a difference, I got the kswapd0: page allocation failure nevertheless.
> > >
> > > Thanks for trying this out. The patch reduces the amount of wasted
> > > memory due to the 'zs_handle' and 'zspage' caches by an order of
> > > magnitude, but it was a small number to begin with (~250K).
> > >
> > > I cannot think of other reasons why having multiple zsmalloc pools
> > > will end up using more memory in the 0.25GB zone that the kernel
> > > allocations can be made from.
> > >
> > > The number of zpools can be made configurable or determined at runtime
> > > by the size of the machine, but I don't want to do this without
> > > understanding the problem here first. Adding other zswap and zsmalloc
> > > folks in case they have any ideas.
> >
> > Hi Erhard,
> >
> > If it's not too much trouble, could you "grep nr_zspages /proc/vmstat"
> > on kernels before and after the bad commit? It'd be great if you could
> > run the grep command right before the OOM kills.
> >
> > The overall internal fragmentation of multiple zsmalloc pools might be
> > higher than a single one. I suspect this might be the cause.
>
> I thought about the internal fragmentation of pools, but zsmalloc
> should have access to highmem, and if I understand correctly the
> problem here is that we are running out of space in the DMA zone when
> making kernel allocations.
>
> Do you suspect zsmalloc is allocating memory from the DMA zone
> initially, even though it has access to highmem?
There was a lot of user memory in the DMA zone. So at a point the
highmem zone was full and allocation fallback happened.
The problem with zone fallback is that recent allocations go into
lower zones, meaning they are further back on the LRU list. This
applies to both user memory and zsmalloc memory -- the latter has a
writeback LRU. On top of this, neither the zswap shrinker nor the
zsmalloc shrinker (compaction) is zone aware. So page reclaim might
have trouble hitting the right target zone.
We can't really tell how zspages are distributed across zones, but the
overall number might be helpful. It'd be great if someone could make
nr_zspages per zone :)
On Tue, Jun 4, 2024 at 10:54 AM Yu Zhao <[email protected]> wrote:
>
> On Tue, Jun 4, 2024 at 11:34 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Tue, Jun 4, 2024 at 10:19 AM Yu Zhao <[email protected]> wrote:
> > >
> > > On Tue, Jun 4, 2024 at 10:12 AM Yosry Ahmed <[email protected]> wrote:
> > > >
> > > > On Tue, Jun 4, 2024 at 4:45 AM Erhard Furtner <[email protected]> wrote:
> > > > >
> > > > > On Mon, 3 Jun 2024 16:24:02 -0700
> > > > > Yosry Ahmed <[email protected]> wrote:
> > > > >
> > > > > > Thanks for bisecting. Taking a look at the thread, it seems like you
> > > > > > have a very limited area of memory to allocate kernel memory from. One
> > > > > > possible reason why that commit can cause an issue is because we will
> > > > > > have multiple instances of the zsmalloc slab caches 'zspage' and
> > > > > > 'zs_handle', which may contribute to fragmentation in slab memory.
> > > > > >
> > > > > > Do you have /proc/slabinfo from a good and a bad run by any chance?
> > > > > >
> > > > > > Also, could you check if the attached patch helps? It makes sure that
> > > > > > even when we use multiple zsmalloc zpools, we will use a single slab
> > > > > > cache of each type.
> > > > >
> > > > > Thanks for looking into this! I got you 'cat /proc/slabinfo' from a good HEAD, from a bad HEAD and from the bad HEAD + your patch applied.
> > > > >
> > > > > Good was 6be3601517d90b728095d70c14f3a04b9adcb166, bad was b8cf32dc6e8c75b712cbf638e0fd210101c22f17 which I got both from my bisect.log. I got the slabinfo shortly after boot and a 2nd time shortly before the OOM or the kswapd0: page allocation failure happens. I terminated the workload (stress-ng --vm 2 --vm-bytes 1930M --verify -v) manually shortly before the 2 GiB RAM exhausted and got the slabinfo then.
> > > > >
> > > > > The patch applied to git b8cf32dc6e8c75b712cbf638e0fd210101c22f17 unfortunately didn't make a difference, I got the kswapd0: page allocation failure nevertheless.
> > > >
> > > > Thanks for trying this out. The patch reduces the amount of wasted
> > > > memory due to the 'zs_handle' and 'zspage' caches by an order of
> > > > magnitude, but it was a small number to begin with (~250K).
> > > >
> > > > I cannot think of other reasons why having multiple zsmalloc pools
> > > > will end up using more memory in the 0.25GB zone that the kernel
> > > > allocations can be made from.
> > > >
> > > > The number of zpools can be made configurable or determined at runtime
> > > > by the size of the machine, but I don't want to do this without
> > > > understanding the problem here first. Adding other zswap and zsmalloc
> > > > folks in case they have any ideas.
> > >
> > > Hi Erhard,
> > >
> > > If it's not too much trouble, could you "grep nr_zspages /proc/vmstat"
> > > on kernels before and after the bad commit? It'd be great if you could
> > > run the grep command right before the OOM kills.
> > >
> > > The overall internal fragmentation of multiple zsmalloc pools might be
> > > higher than a single one. I suspect this might be the cause.
> >
> > I thought about the internal fragmentation of pools, but zsmalloc
> > should have access to highmem, and if I understand correctly the
> > problem here is that we are running out of space in the DMA zone when
> > making kernel allocations.
> >
> > Do you suspect zsmalloc is allocating memory from the DMA zone
> > initially, even though it has access to highmem?
>
> There was a lot of user memory in the DMA zone. So at a point the
> highmem zone was full and allocation fallback happened.
>
> The problem with zone fallback is that recent allocations go into
> lower zones, meaning they are further back on the LRU list. This
> applies to both user memory and zsmalloc memory -- the latter has a
> writeback LRU. On top of this, neither the zswap shrinker nor the
> zsmalloc shrinker (compaction) is zone aware. So page reclaim might
> have trouble hitting the right target zone.
I see what you mean. In this case, yeah I think the internal
fragmentation in the zsmalloc pools may be the reason behind the
problem.
How many CPUs does this machine have? I am wondering if 32 can be an
overkill for small machines, perhaps the number of pools should be
max(nr_cpus, 32)?
Alternatively, the number of pools should scale with the memory size
in some way, such that we only increase fragmentation when it's
tolerable.
>
> We can't really tell how zspages are distributed across zones, but the
> overall number might be helpful. It'd be great if someone could make
> nr_zspages per zone :)
On 6/4/24 1:24 AM, Yosry Ahmed wrote:
> On Mon, Jun 3, 2024 at 3:13 PM Erhard Furtner <[email protected]> wrote:
>>
>> On Sun, 2 Jun 2024 20:03:32 +0200
>> Erhard Furtner <[email protected]> wrote:
>>
>> > On Sat, 1 Jun 2024 00:01:48 -0600
>> > Yu Zhao <[email protected]> wrote:
>> >
>> > > The OOM kills on both kernel versions seem to be reasonable to me.
>> > >
>> > > Your system has 2GB memory and it uses zswap with zsmalloc (which is
>> > > good since it can allocate from the highmem zone) and zstd/lzo (which
>> > > doesn't matter much). Somehow -- I couldn't figure out why -- it
>> > > splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
>> > >
>> > > [ 0.000000] Zone ranges:
>> > > [ 0.000000] DMA [mem 0x0000000000000000-0x000000002fffffff]
>> > > [ 0.000000] Normal empty
>> > > [ 0.000000] HighMem [mem 0x0000000030000000-0x000000007fffffff]
>> > >
>> > > The kernel can't allocate from the highmem zone -- only userspace and
>> > > zsmalloc can. OOM kills were due to the low memory conditions in the
>> > > DMA zone where the kernel itself failed to allocate from.
>> > >
>> > > Do you know a kernel version that doesn't have OOM kills while running
>> > > the same workload? If so, could you send that .config to me? If not,
>> > > could you try disabling CONFIG_HIGHMEM? (It might not help but I'm out
>> > > of ideas at the moment.)
>>
>> Ok, the bisect I did actually revealed something meaningful:
>>
>> # git bisect good
>> b8cf32dc6e8c75b712cbf638e0fd210101c22f17 is the first bad commit
>> commit b8cf32dc6e8c75b712cbf638e0fd210101c22f17
>> Author: Yosry Ahmed <[email protected]>
>> Date: Tue Jun 20 19:46:44 2023 +0000
>>
>> mm: zswap: multiple zpools support
>
> Thanks for bisecting. Taking a look at the thread, it seems like you
> have a very limited area of memory to allocate kernel memory from. One
> possible reason why that commit can cause an issue is because we will
> have multiple instances of the zsmalloc slab caches 'zspage' and
> 'zs_handle', which may contribute to fragmentation in slab memory.
>
> Do you have /proc/slabinfo from a good and a bad run by any chance?
>
> Also, could you check if the attached patch helps? It makes sure that
> even when we use multiple zsmalloc zpools, we will use a single slab
> cache of each type.
As for reducing slab fragmentation/footprint, I would also recommend these
changes to .config:
CONFIG_SLAB_MERGE_DEFAULT=y - this will unify the separate zpool caches as
well (but the patch still makes sense), but also many others
CONFIG_RANDOM_KMALLOC_CACHES=n - no 16 separate copies of kmalloc caches
although the slabinfo output doesn't seem to show
CONFIG_RANDOM_KMALLOC_CACHES in action, weirdly. It was enabled in the
config attached to the first mail.
Both these changes mean giving up some mitigation against potentai
lvulnerabilities. But it's not perfect anyway and the memory seems really
tight here.
On Tue, Jun 4, 2024 at 1:52 PM Vlastimil Babka (SUSE) <[email protected]> wrote:
>
> On 6/4/24 1:24 AM, Yosry Ahmed wrote:
> > On Mon, Jun 3, 2024 at 3:13 PM Erhard Furtner <[email protected]> wrote:
> >>
> >> On Sun, 2 Jun 2024 20:03:32 +0200
> >> Erhard Furtner <[email protected]> wrote:
> >>
> >> > On Sat, 1 Jun 2024 00:01:48 -0600
> >> > Yu Zhao <[email protected]> wrote:
> >> >
> >> > > The OOM kills on both kernel versions seem to be reasonable to me.
> >> > >
> >> > > Your system has 2GB memory and it uses zswap with zsmalloc (which is
> >> > > good since it can allocate from the highmem zone) and zstd/lzo (which
> >> > > doesn't matter much). Somehow -- I couldn't figure out why -- it
> >> > > splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
> >> > >
> >> > > [ 0.000000] Zone ranges:
> >> > > [ 0.000000] DMA [mem 0x0000000000000000-0x000000002fffffff]
> >> > > [ 0.000000] Normal empty
> >> > > [ 0.000000] HighMem [mem 0x0000000030000000-0x000000007fffffff]
> >> > >
> >> > > The kernel can't allocate from the highmem zone -- only userspace and
> >> > > zsmalloc can. OOM kills were due to the low memory conditions in the
> >> > > DMA zone where the kernel itself failed to allocate from.
> >> > >
> >> > > Do you know a kernel version that doesn't have OOM kills while running
> >> > > the same workload? If so, could you send that .config to me? If not,
> >> > > could you try disabling CONFIG_HIGHMEM? (It might not help but I'm out
> >> > > of ideas at the moment.)
> >>
> >> Ok, the bisect I did actually revealed something meaningful:
> >>
> >> # git bisect good
> >> b8cf32dc6e8c75b712cbf638e0fd210101c22f17 is the first bad commit
> >> commit b8cf32dc6e8c75b712cbf638e0fd210101c22f17
> >> Author: Yosry Ahmed <[email protected]>
> >> Date: Tue Jun 20 19:46:44 2023 +0000
> >>
> >> mm: zswap: multiple zpools support
> >
> > Thanks for bisecting. Taking a look at the thread, it seems like you
> > have a very limited area of memory to allocate kernel memory from. One
> > possible reason why that commit can cause an issue is because we will
> > have multiple instances of the zsmalloc slab caches 'zspage' and
> > 'zs_handle', which may contribute to fragmentation in slab memory.
> >
> > Do you have /proc/slabinfo from a good and a bad run by any chance?
> >
> > Also, could you check if the attached patch helps? It makes sure that
> > even when we use multiple zsmalloc zpools, we will use a single slab
> > cache of each type.
>
> As for reducing slab fragmentation/footprint, I would also recommend these
> changes to .config:
>
> CONFIG_SLAB_MERGE_DEFAULT=y - this will unify the separate zpool caches as
> well (but the patch still makes sense), but also many others
> CONFIG_RANDOM_KMALLOC_CACHES=n - no 16 separate copies of kmalloc caches
Yeah, I did send that patch separately, but I think the problem here
is probably fragmentation in the zsmalloc pools themselves, not the
slab caches used by them.
>
> although the slabinfo output doesn't seem to show
> CONFIG_RANDOM_KMALLOC_CACHES in action, weirdly. It was enabled in the
> config attached to the first mail.
>
> Both these changes mean giving up some mitigation against potentai
> lvulnerabilities. But it's not perfect anyway and the memory seems really
> tight here.
I think we may be able to fix the problem here if we address the
zsmalloc fragmentation. In regards to slab caches, the patch proposed
above should avoid the replication without enabling slab cache merging
in general.
Thanks for chiming in!
On 6/4/24 8:01 PM, Yosry Ahmed wrote:
> On Tue, Jun 4, 2024 at 10:54 AM Yu Zhao <[email protected]> wrote:
>> There was a lot of user memory in the DMA zone. So at a point the
>> highmem zone was full and allocation fallback happened.
>>
>> The problem with zone fallback is that recent allocations go into
>> lower zones, meaning they are further back on the LRU list. This
>> applies to both user memory and zsmalloc memory -- the latter has a
>> writeback LRU. On top of this, neither the zswap shrinker nor the
>> zsmalloc shrinker (compaction) is zone aware. So page reclaim might
>> have trouble hitting the right target zone.
>
> I see what you mean. In this case, yeah I think the internal
> fragmentation in the zsmalloc pools may be the reason behind the
> problem.
>
> How many CPUs does this machine have? I am wondering if 32 can be an
> overkill for small machines, perhaps the number of pools should be
> max(nr_cpus, 32)?
>
> Alternatively, the number of pools should scale with the memory size
> in some way, such that we only increase fragmentation when it's
> tolerable.
Sounds like a good idea to me, maybe a combination of both. No point in
trying to scale if there's no benefit and only downside of more memory
consumption.
On Tue, 4 Jun 2024 11:01:39 -0700
Yosry Ahmed <[email protected]> wrote:
> How many CPUs does this machine have? I am wondering if 32 can be an
> overkill for small machines, perhaps the number of pools should be
> max(nr_cpus, 32)?
This PowerMac G4 DP got 2 CPUs. Not much for a desktop machine by todays standards but some SoCs have less. ;)
# lscpu
Architecture: ppc
CPU op-mode(s): 32-bit
Byte Order: Big Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Model name: 7455, altivec supported
Model: 3.3 (pvr 8001 0303)
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 2
BogoMIPS: 83.78
Caches (sum of all):
L1d: 64 KiB (2 instances)
L1i: 64 KiB (2 instances)
L2: 512 KiB (2 instances)
L3: 4 MiB (2 instances)
Regards,
Erhard
On Tue, 4 Jun 2024 11:18:25 -0600
Yu Zhao <[email protected]> wrote:
> Hi Erhard,
>
> If it's not too much trouble, could you "grep nr_zspages /proc/vmstat"
> on kernels before and after the bad commit? It'd be great if you could
> run the grep command right before the OOM kills.
>
> The overall internal fragmentation of multiple zsmalloc pools might be
> higher than a single one. I suspect this might be the cause.
>
> Thank you.
I used watch -n1 'grep nr_zspages /proc/vmstat' to get the readings and repeated this 3 times to check whether the reported values differ much.
Bad commit was b8cf32dc6e8c75b712cbf638e0fd210101c22f17 "mm: zswap: multiple zpools support", next commit in git log after the bad one would be 42c06a0e8ebe95b81e5fb41c6556ff22d9255b0c "mm: kill frontswap".
With this kernel I got 2440, 2337, 3245 as nr_zspages and the kswapd0: page allocation error.
Commit in git log before the bad one would be bfaa4a0ce1bbc1b2b67de7e4c2e1679495f7b905 "scsi: gvp11: Remove unused
gvp11_setup() function".
With this kernel I got 25537, 11321, 16087 as nr_zspages and the OOM reaper.
Tomorrow I could also check the .config changes Vlastimil suggested (CONFIG_SLAB_MERGE_DEFAULT=y, no CONFIG_RANDOM_KMALLOC_CACHES) and report back if that's of interest.
Regards,
Erhard
On Tue, Jun 4, 2024 at 2:10 PM Erhard Furtner <[email protected]> wrote:
>
> On Tue, 4 Jun 2024 11:01:39 -0700
> Yosry Ahmed <[email protected]> wrote:
>
> > How many CPUs does this machine have? I am wondering if 32 can be an
> > overkill for small machines, perhaps the number of pools should be
> > max(nr_cpus, 32)?
>
> This PowerMac G4 DP got 2 CPUs. Not much for a desktop machine by todays standards but some SoCs have less. ;)
>
> # lscpu
> Architecture: ppc
> CPU op-mode(s): 32-bit
> Byte Order: Big Endian
> CPU(s): 2
> On-line CPU(s) list: 0,1
> Model name: 7455, altivec supported
> Model: 3.3 (pvr 8001 0303)
> Thread(s) per core: 1
> Core(s) per socket: 1
> Socket(s): 2
> BogoMIPS: 83.78
> Caches (sum of all):
> L1d: 64 KiB (2 instances)
> L1i: 64 KiB (2 instances)
> L2: 512 KiB (2 instances)
> L3: 4 MiB (2 instances)
>
> Regards,
> Erhard
Could you check if the attached patch helps? It basically changes the
number of zpools from 32 to min(32, nr_cpus).
On Tue, 4 Jun 2024 20:03:27 -0700
Yosry Ahmed <[email protected]> wrote:
> Could you check if the attached patch helps? It basically changes the
> number of zpools from 32 to min(32, nr_cpus).
Thanks! The patch does not fix the issue but it helps.
Means I still get to see the 'kswapd0: page allocation failure' in the dmesg, a 'stress-ng-vm: page allocation failure' later on, another kswapd0 error later on, etc. _but_ the machine keeps running the workload, stays usable via VNC and I get no hard crash any longer.
Without patch kswapd0 error and hard crash (need to power-cycle) <3min. With patch several kswapd0 errors but running for 2 hrs now. I double checked this to be sure.
The patch did not apply cleanly on v6.9.3 so I applied it on v6.10-rc2. dmesg of the current v6.10-rc2 run attached.
Regards,
Erhard
On Wed, Jun 5, 2024 at 4:04 PM Erhard Furtner <[email protected]> wrote:
>
> On Tue, 4 Jun 2024 20:03:27 -0700
> Yosry Ahmed <[email protected]> wrote:
>
> > Could you check if the attached patch helps? It basically changes the
> > number of zpools from 32 to min(32, nr_cpus).
>
> Thanks! The patch does not fix the issue but it helps.
>
> Means I still get to see the 'kswapd0: page allocation failure' in the dmesg, a 'stress-ng-vm: page allocation failure' later on, another kswapd0 error later on, etc. _but_ the machine keeps running the workload, stays usable via VNC and I get no hard crash any longer.
>
> Without patch kswapd0 error and hard crash (need to power-cycle) <3min. With patch several kswapd0 errors but running for 2 hrs now. I double checked this to be sure.
Thanks for trying this out. This is interesting, so even two zpools is
too much fragmentation for your use case.
I think there are multiple ways to go forward here:
(a) Make the number of zpools a config option, leave the default as
32, but allow special use cases to set it to 1 or similar. This is
probably not preferable because it is not clear to users how to set
it, but the idea is that no one will have to set it except special use
cases such as Erhard's (who will want to set it to 1 in this case).
(b) Make the number of zpools scale linearly with the number of CPUs.
Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
approach is that with a large number of CPUs, too many zpools will
start having diminishing returns. Fragmentation will keep increasing,
while the scalability/concurrency gains will diminish.
(c) Make the number of zpools scale logarithmically with the number of
CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
of zpools from increasing too much and close to the status quo. The
problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
will actually give a nr_zpools > nr_cpus. So we will need to come up
with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
(d) Make the number of zpools scale linearly with memory. This makes
more sense than scaling with CPUs because increasing the number of
zpools increases fragmentation, so it makes sense to limit it by the
available memory. This is also more consistent with other magic
numbers we have (e.g. SWAP_ADDRESS_SPACE_SHIFT).
The problem is that unlike zswap trees, the zswap pool is not
connected to the swapfile size, so we don't have an indication for how
much memory will be in the zswap pool. We can scale the number of
zpools with the entire memory on the machine during boot, but this
seems like it would be difficult to figure out, and will not take into
consideration memory hotplugging and the zswap global limit changing.
(e) A creative mix of the above.
(f) Something else (probably simpler).
I am personally leaning toward (c), but I want to hear the opinions of
other people here. Yu, Vlastimil, Johannes, Nhat? Anyone else?
In the long-term, I think we may want to address the lock contention
in zsmalloc itself instead of zswap spawning multiple zpools.
>
> The patch did not apply cleanly on v6.9.3 so I applied it on v6.10-rc2. dmesg of the current v6.10-rc2 run attached.
>
> Regards,
> Erhard
On Wed, Jun 5, 2024 at 4:53 PM Yu Zhao <[email protected]> wrote:
>
> On Wed, Jun 5, 2024 at 5:42 PM Yosry Ahmed <[email protected]> wrote:
> >
> > On Wed, Jun 5, 2024 at 4:04 PM Erhard Furtner <[email protected]> wrote:
> > >
> > > On Tue, 4 Jun 2024 20:03:27 -0700
> > > Yosry Ahmed <[email protected]> wrote:
> > >
> > > > Could you check if the attached patch helps? It basically changes the
> > > > number of zpools from 32 to min(32, nr_cpus).
> > >
> > > Thanks! The patch does not fix the issue but it helps.
> > >
> > > Means I still get to see the 'kswapd0: page allocation failure' in the dmesg, a 'stress-ng-vm: page allocation failure' later on, another kswapd0 error later on, etc. _but_ the machine keeps running the workload, stays usable via VNC and I get no hard crash any longer.
> > >
> > > Without patch kswapd0 error and hard crash (need to power-cycle) <3min. With patch several kswapd0 errors but running for 2 hrs now. I double checked this to be sure.
> >
> > Thanks for trying this out. This is interesting, so even two zpools is
> > too much fragmentation for your use case.
>
> Now I'm a little bit skeptical that the problem is due to fragmentation.
>
> > I think there are multiple ways to go forward here:
> > (a) Make the number of zpools a config option, leave the default as
> > 32, but allow special use cases to set it to 1 or similar. This is
> > probably not preferable because it is not clear to users how to set
> > it, but the idea is that no one will have to set it except special use
> > cases such as Erhard's (who will want to set it to 1 in this case).
> >
> > (b) Make the number of zpools scale linearly with the number of CPUs.
> > Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> > approach is that with a large number of CPUs, too many zpools will
> > start having diminishing returns. Fragmentation will keep increasing,
> > while the scalability/concurrency gains will diminish.
> >
> > (c) Make the number of zpools scale logarithmically with the number of
> > CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> > of zpools from increasing too much and close to the status quo. The
> > problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> > will actually give a nr_zpools > nr_cpus. So we will need to come up
> > with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
> >
> > (d) Make the number of zpools scale linearly with memory. This makes
> > more sense than scaling with CPUs because increasing the number of
> > zpools increases fragmentation, so it makes sense to limit it by the
> > available memory. This is also more consistent with other magic
> > numbers we have (e.g. SWAP_ADDRESS_SPACE_SHIFT).
> >
> > The problem is that unlike zswap trees, the zswap pool is not
> > connected to the swapfile size, so we don't have an indication for how
> > much memory will be in the zswap pool. We can scale the number of
> > zpools with the entire memory on the machine during boot, but this
> > seems like it would be difficult to figure out, and will not take into
> > consideration memory hotplugging and the zswap global limit changing.
> >
> > (e) A creative mix of the above.
> >
> > (f) Something else (probably simpler).
> >
> > I am personally leaning toward (c), but I want to hear the opinions of
> > other people here. Yu, Vlastimil, Johannes, Nhat? Anyone else?
>
> I double checked that commit and didn't find anything wrong. If we are
> all in the mood of getting to the bottom, can we try using only 1
> zpool while there are 2 available? I.e.,
Erhard, do you mind checking if Yu's diff below to use a single zpool
fixes the problem completely? There is also an attached patch that
does the same thing if this is easier to apply for you.
>
> static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
> {
> - return entry->pool->zpools[hash_ptr(entry, ilog2(ZSWAP_NR_ZPOOLS))];
> + return entry->pool->zpools[0];
> }
>
> > In the long-term, I think we may want to address the lock contention
> > in zsmalloc itself instead of zswap spawning multiple zpools.
> >
> > >
> > > The patch did not apply cleanly on v6.9.3 so I applied it on v6.10-rc2. dmesg of the current v6.10-rc2 run attached.
> > >
> > > Regards,
> > > Erhard
On Wed, Jun 5, 2024 at 5:42 PM Yosry Ahmed <[email protected]> wrote:
>
> On Wed, Jun 5, 2024 at 4:04 PM Erhard Furtner <[email protected]> wrote:
> >
> > On Tue, 4 Jun 2024 20:03:27 -0700
> > Yosry Ahmed <[email protected]> wrote:
> >
> > > Could you check if the attached patch helps? It basically changes the
> > > number of zpools from 32 to min(32, nr_cpus).
> >
> > Thanks! The patch does not fix the issue but it helps.
> >
> > Means I still get to see the 'kswapd0: page allocation failure' in the dmesg, a 'stress-ng-vm: page allocation failure' later on, another kswapd0 error later on, etc. _but_ the machine keeps running the workload, stays usable via VNC and I get no hard crash any longer.
> >
> > Without patch kswapd0 error and hard crash (need to power-cycle) <3min. With patch several kswapd0 errors but running for 2 hrs now. I double checked this to be sure.
>
> Thanks for trying this out. This is interesting, so even two zpools is
> too much fragmentation for your use case.
Now I'm a little bit skeptical that the problem is due to fragmentation.
> I think there are multiple ways to go forward here:
> (a) Make the number of zpools a config option, leave the default as
> 32, but allow special use cases to set it to 1 or similar. This is
> probably not preferable because it is not clear to users how to set
> it, but the idea is that no one will have to set it except special use
> cases such as Erhard's (who will want to set it to 1 in this case).
>
> (b) Make the number of zpools scale linearly with the number of CPUs.
> Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> approach is that with a large number of CPUs, too many zpools will
> start having diminishing returns. Fragmentation will keep increasing,
> while the scalability/concurrency gains will diminish.
>
> (c) Make the number of zpools scale logarithmically with the number of
> CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> of zpools from increasing too much and close to the status quo. The
> problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> will actually give a nr_zpools > nr_cpus. So we will need to come up
> with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
>
> (d) Make the number of zpools scale linearly with memory. This makes
> more sense than scaling with CPUs because increasing the number of
> zpools increases fragmentation, so it makes sense to limit it by the
> available memory. This is also more consistent with other magic
> numbers we have (e.g. SWAP_ADDRESS_SPACE_SHIFT).
>
> The problem is that unlike zswap trees, the zswap pool is not
> connected to the swapfile size, so we don't have an indication for how
> much memory will be in the zswap pool. We can scale the number of
> zpools with the entire memory on the machine during boot, but this
> seems like it would be difficult to figure out, and will not take into
> consideration memory hotplugging and the zswap global limit changing.
>
> (e) A creative mix of the above.
>
> (f) Something else (probably simpler).
>
> I am personally leaning toward (c), but I want to hear the opinions of
> other people here. Yu, Vlastimil, Johannes, Nhat? Anyone else?
I double checked that commit and didn't find anything wrong. If we are
all in the mood of getting to the bottom, can we try using only 1
zpool while there are 2 available? I.e.,
static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
{
- return entry->pool->zpools[hash_ptr(entry, ilog2(ZSWAP_NR_ZPOOLS))];
+ return entry->pool->zpools[0];
}
> In the long-term, I think we may want to address the lock contention
> in zsmalloc itself instead of zswap spawning multiple zpools.
>
> >
> > The patch did not apply cleanly on v6.9.3 so I applied it on v6.10-rc2. dmesg of the current v6.10-rc2 run attached.
> >
> > Regards,
> > Erhard
On 2024/6/6 07:41, Yosry Ahmed wrote:
> On Wed, Jun 5, 2024 at 4:04 PM Erhard Furtner <[email protected]> wrote:
>>
>> On Tue, 4 Jun 2024 20:03:27 -0700
>> Yosry Ahmed <[email protected]> wrote:
>>
>>> Could you check if the attached patch helps? It basically changes the
>>> number of zpools from 32 to min(32, nr_cpus).
>>
>> Thanks! The patch does not fix the issue but it helps.
>>
>> Means I still get to see the 'kswapd0: page allocation failure' in the dmesg, a 'stress-ng-vm: page allocation failure' later on, another kswapd0 error later on, etc. _but_ the machine keeps running the workload, stays usable via VNC and I get no hard crash any longer.
>>
>> Without patch kswapd0 error and hard crash (need to power-cycle) <3min. With patch several kswapd0 errors but running for 2 hrs now. I double checked this to be sure.
>
> Thanks for trying this out. This is interesting, so even two zpools is
> too much fragmentation for your use case.
>
> I think there are multiple ways to go forward here:
> (a) Make the number of zpools a config option, leave the default as
> 32, but allow special use cases to set it to 1 or similar. This is
> probably not preferable because it is not clear to users how to set
> it, but the idea is that no one will have to set it except special use
> cases such as Erhard's (who will want to set it to 1 in this case).
>
> (b) Make the number of zpools scale linearly with the number of CPUs.
> Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> approach is that with a large number of CPUs, too many zpools will
> start having diminishing returns. Fragmentation will keep increasing,
> while the scalability/concurrency gains will diminish.
>
> (c) Make the number of zpools scale logarithmically with the number of
> CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> of zpools from increasing too much and close to the status quo. The
> problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> will actually give a nr_zpools > nr_cpus. So we will need to come up
> with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
>
> (d) Make the number of zpools scale linearly with memory. This makes
> more sense than scaling with CPUs because increasing the number of
> zpools increases fragmentation, so it makes sense to limit it by the
> available memory. This is also more consistent with other magic
> numbers we have (e.g. SWAP_ADDRESS_SPACE_SHIFT).
>
> The problem is that unlike zswap trees, the zswap pool is not
> connected to the swapfile size, so we don't have an indication for how
> much memory will be in the zswap pool. We can scale the number of
> zpools with the entire memory on the machine during boot, but this
> seems like it would be difficult to figure out, and will not take into
> consideration memory hotplugging and the zswap global limit changing.
>
> (e) A creative mix of the above.
>
> (f) Something else (probably simpler).
>
> I am personally leaning toward (c), but I want to hear the opinions of
> other people here. Yu, Vlastimil, Johannes, Nhat? Anyone else?
>
> In the long-term, I think we may want to address the lock contention
> in zsmalloc itself instead of zswap spawning multiple zpools.
>
Agree, I think we should try to improve locking scalability of zsmalloc.
I have some thoughts to share, no code or test data yet:
1. First, we can change the pool global lock to per-class lock, which
is more fine-grained.
2. Actually, we only need to take per-zspage lock when malloc/free,
only need to take class lock when its fullness changed.
3. If this is not enough, we can have fewer fullness groups, so the
need to take class lock becomes fewer. (will need some test data)
More comments are welcome. Thanks!
David Hildenbrand <[email protected]> writes:
> On 01.06.24 08:01, Yu Zhao wrote:
>> On Wed, May 15, 2024 at 4:06 PM Yu Zhao <[email protected]> wrote:
...
>>
>> Your system has 2GB memory and it uses zswap with zsmalloc (which is
>> good since it can allocate from the highmem zone) and zstd/lzo (which
>> doesn't matter much). Somehow -- I couldn't figure out why -- it
>> splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
>>
>> [ 0.000000] Zone ranges:
>> [ 0.000000] DMA [mem 0x0000000000000000-0x000000002fffffff]
>> [ 0.000000] Normal empty
>> [ 0.000000] HighMem [mem 0x0000000030000000-0x000000007fffffff]
>
> That's really odd. But we are messing with "PowerMac3,6", so I don't
> really know what's right or wrong ...
The DMA zone exists because 9739ab7eda45 ("powerpc: enable a 30-bit
ZONE_DMA for 32-bit pmac") selects it.
It's 768MB (not 0.25GB) because it's clamped at max_low_pfn:
#ifdef CONFIG_ZONE_DMA
max_zone_pfns[ZONE_DMA] = min(max_low_pfn,
1UL << (zone_dma_bits - PAGE_SHIFT));
#endif
Which comes eventually from CONFIG_LOWMEM_SIZE, which defaults to 768MB.
I think it's 768MB because the user:kernel split is 3G:1G, and then the
kernel needs some of that 1G virtual space for vmalloc/ioremap/highmem,
so it splits it 768M:256M.
Then ZONE_NORMAL is empty because it is also limited to max_low_pfn:
max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
The rest of RAM is highmem.
So I think that's all behaving as expected, but I don't know 32-bit /
highmem stuff that well so I could be wrong.
cheers
On Wed, Jun 5, 2024 at 9:12 PM Michael Ellerman <[email protected]> wrote:
>
> David Hildenbrand <[email protected]> writes:
> > On 01.06.24 08:01, Yu Zhao wrote:
> >> On Wed, May 15, 2024 at 4:06 PM Yu Zhao <[email protected]> wrote:
> ...
> >>
> >> Your system has 2GB memory and it uses zswap with zsmalloc (which is
> >> good since it can allocate from the highmem zone) and zstd/lzo (which
> >> doesn't matter much). Somehow -- I couldn't figure out why -- it
> >> splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
> >>
> >> [ 0.000000] Zone ranges:
> >> [ 0.000000] DMA [mem 0x0000000000000000-0x000000002fffffff]
> >> [ 0.000000] Normal empty
> >> [ 0.000000] HighMem [mem 0x0000000030000000-0x000000007fffffff]
> >
> > That's really odd. But we are messing with "PowerMac3,6", so I don't
> > really know what's right or wrong ...
>
> The DMA zone exists because 9739ab7eda45 ("powerpc: enable a 30-bit
> ZONE_DMA for 32-bit pmac") selects it.
>
> It's 768MB (not 0.25GB) because it's clamped at max_low_pfn:
Right. (I meant 0.75GB.)
> #ifdef CONFIG_ZONE_DMA
> max_zone_pfns[ZONE_DMA] = min(max_low_pfn,
> 1UL << (zone_dma_bits - PAGE_SHIFT));
> #endif
>
> Which comes eventually from CONFIG_LOWMEM_SIZE, which defaults to 768MB.
I see. I grep'ed VMSPLIT which is used on x86 and arm but apparently
not on powerpc.
> I think it's 768MB because the user:kernel split is 3G:1G, and then the
> kernel needs some of that 1G virtual space for vmalloc/ioremap/highmem,
> so it splits it 768M:256M.
>
> Then ZONE_NORMAL is empty because it is also limited to max_low_pfn:
>
> max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
>
> The rest of RAM is highmem.
>
> So I think that's all behaving as expected, but I don't know 32-bit /
> highmem stuff that well so I could be wrong.
Yes, the three zones work as intended.
Erhard,
Since your system only has 2GB memory, I'd try the 2G:2G split, which
would in theory allow both the kernel and userspace to all memory.
CONFIG_LOWMEM_SIZE_BOOL=y
CONFIG_LOWMEM_SIZE=0x7000000
(Michael, please correct me if the above wouldn't work.)
On (24/06/06 10:49), Chengming Zhou wrote:
> > Thanks for trying this out. This is interesting, so even two zpools is
> > too much fragmentation for your use case.
> >
> > I think there are multiple ways to go forward here:
> > (a) Make the number of zpools a config option, leave the default as
> > 32, but allow special use cases to set it to 1 or similar. This is
> > probably not preferable because it is not clear to users how to set
> > it, but the idea is that no one will have to set it except special use
> > cases such as Erhard's (who will want to set it to 1 in this case).
> >
> > (b) Make the number of zpools scale linearly with the number of CPUs.
> > Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> > approach is that with a large number of CPUs, too many zpools will
> > start having diminishing returns. Fragmentation will keep increasing,
> > while the scalability/concurrency gains will diminish.
> >
> > (c) Make the number of zpools scale logarithmically with the number of
> > CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> > of zpools from increasing too much and close to the status quo. The
> > problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> > will actually give a nr_zpools > nr_cpus. So we will need to come up
> > with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
> >
> > (d) Make the number of zpools scale linearly with memory. This makes
> > more sense than scaling with CPUs because increasing the number of
> > zpools increases fragmentation, so it makes sense to limit it by the
> > available memory. This is also more consistent with other magic
> > numbers we have (e.g. SWAP_ADDRESS_SPACE_SHIFT).
> >
> > The problem is that unlike zswap trees, the zswap pool is not
> > connected to the swapfile size, so we don't have an indication for how
> > much memory will be in the zswap pool. We can scale the number of
> > zpools with the entire memory on the machine during boot, but this
> > seems like it would be difficult to figure out, and will not take into
> > consideration memory hotplugging and the zswap global limit changing.
> >
> > (e) A creative mix of the above.
> >
> > (f) Something else (probably simpler).
> >
> > I am personally leaning toward (c), but I want to hear the opinions of
> > other people here. Yu, Vlastimil, Johannes, Nhat? Anyone else?
> >
> > In the long-term, I think we may want to address the lock contention
> > in zsmalloc itself instead of zswap spawning multiple zpools.
Sorry, I'm sure I'm not following this discussion closely enough,
has the lock contention been demonstrated/proved somehow? lock-stats?
> Agree, I think we should try to improve locking scalability of zsmalloc.
> I have some thoughts to share, no code or test data yet:
>
> 1. First, we can change the pool global lock to per-class lock, which
> is more fine-grained.
Commit c0547d0b6a4b6 "zsmalloc: consolidate zs_pool's migrate_lock
and size_class's locks" [1] claimed no significant difference
between class->lock and pool->lock.
[1] https://lkml.kernel.org/r/[email protected]
On 2024/6/6 12:31, Sergey Senozhatsky wrote:
> On (24/06/06 10:49), Chengming Zhou wrote:
>>> Thanks for trying this out. This is interesting, so even two zpools is
>>> too much fragmentation for your use case.
>>>
>>> I think there are multiple ways to go forward here:
>>> (a) Make the number of zpools a config option, leave the default as
>>> 32, but allow special use cases to set it to 1 or similar. This is
>>> probably not preferable because it is not clear to users how to set
>>> it, but the idea is that no one will have to set it except special use
>>> cases such as Erhard's (who will want to set it to 1 in this case).
>>>
>>> (b) Make the number of zpools scale linearly with the number of CPUs.
>>> Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
>>> approach is that with a large number of CPUs, too many zpools will
>>> start having diminishing returns. Fragmentation will keep increasing,
>>> while the scalability/concurrency gains will diminish.
>>>
>>> (c) Make the number of zpools scale logarithmically with the number of
>>> CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
>>> of zpools from increasing too much and close to the status quo. The
>>> problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
>>> will actually give a nr_zpools > nr_cpus. So we will need to come up
>>> with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
>>>
>>> (d) Make the number of zpools scale linearly with memory. This makes
>>> more sense than scaling with CPUs because increasing the number of
>>> zpools increases fragmentation, so it makes sense to limit it by the
>>> available memory. This is also more consistent with other magic
>>> numbers we have (e.g. SWAP_ADDRESS_SPACE_SHIFT).
>>>
>>> The problem is that unlike zswap trees, the zswap pool is not
>>> connected to the swapfile size, so we don't have an indication for how
>>> much memory will be in the zswap pool. We can scale the number of
>>> zpools with the entire memory on the machine during boot, but this
>>> seems like it would be difficult to figure out, and will not take into
>>> consideration memory hotplugging and the zswap global limit changing.
>>>
>>> (e) A creative mix of the above.
>>>
>>> (f) Something else (probably simpler).
>>>
>>> I am personally leaning toward (c), but I want to hear the opinions of
>>> other people here. Yu, Vlastimil, Johannes, Nhat? Anyone else?
>>>
>>> In the long-term, I think we may want to address the lock contention
>>> in zsmalloc itself instead of zswap spawning multiple zpools.
>
> Sorry, I'm sure I'm not following this discussion closely enough,
> has the lock contention been demonstrated/proved somehow? lock-stats?
Yosry has some stats in his commit b8cf32dc6e8c ("mm: zswap: multiple zpools support"),
and I have also seen some locking contention when using zram to test kernel building,
since zram still has only one pool.
>
>> Agree, I think we should try to improve locking scalability of zsmalloc.
>> I have some thoughts to share, no code or test data yet:
>>
>> 1. First, we can change the pool global lock to per-class lock, which
>> is more fine-grained.
>
> Commit c0547d0b6a4b6 "zsmalloc: consolidate zs_pool's migrate_lock
> and size_class's locks" [1] claimed no significant difference
> between class->lock and pool->lock.
Ok, I haven't looked into the history much, that seems preparation of trying
to introduce reclaim in the zsmalloc? Not sure. But now with the reclaim code
in zsmalloc has gone, should we change back to the per-class lock? Which is
obviously more fine-grained than the pool lock. Actually, I have just done it,
will test to get some data later.
Thanks.
>
> [1] https://lkml.kernel.org/r/[email protected]
On (24/06/06 12:46), Chengming Zhou wrote:
> >> Agree, I think we should try to improve locking scalability of zsmalloc.
> >> I have some thoughts to share, no code or test data yet:
> >>
> >> 1. First, we can change the pool global lock to per-class lock, which
> >> is more fine-grained.
> >
> > Commit c0547d0b6a4b6 "zsmalloc: consolidate zs_pool's migrate_lock
> > and size_class's locks" [1] claimed no significant difference
> > between class->lock and pool->lock.
>
> Ok, I haven't looked into the history much, that seems preparation of trying
> to introduce reclaim in the zsmalloc? Not sure. But now with the reclaim code
> in zsmalloc has gone, should we change back to the per-class lock? Which is
Well, the point that commit made was that Nhat (and Johannes?) were
unable to detect any impact of pool->lock on a variety of cases. So
we went on with code simplification.
> obviously more fine-grained than the pool lock. Actually, I have just done it,
> will test to get some data later.
Thanks, we'll need data on this. I'm happy to take the patch, but
jumping back and forth between class->lock and pool->lock merely
"for obvious reasons" is not what I'm extremely excited about.
On 2024/6/6 13:43, Sergey Senozhatsky wrote:
> On (24/06/06 12:46), Chengming Zhou wrote:
>>>> Agree, I think we should try to improve locking scalability of zsmalloc.
>>>> I have some thoughts to share, no code or test data yet:
>>>>
>>>> 1. First, we can change the pool global lock to per-class lock, which
>>>> is more fine-grained.
>>>
>>> Commit c0547d0b6a4b6 "zsmalloc: consolidate zs_pool's migrate_lock
>>> and size_class's locks" [1] claimed no significant difference
>>> between class->lock and pool->lock.
>>
>> Ok, I haven't looked into the history much, that seems preparation of trying
>> to introduce reclaim in the zsmalloc? Not sure. But now with the reclaim code
>> in zsmalloc has gone, should we change back to the per-class lock? Which is
>
> Well, the point that commit made was that Nhat (and Johannes?) were
> unable to detect any impact of pool->lock on a variety of cases. So
> we went on with code simplification.
Right, the code is simpler.
>
>> obviously more fine-grained than the pool lock. Actually, I have just done it,
>> will test to get some data later.
>
> Thanks, we'll need data on this. I'm happy to take the patch, but
> jumping back and forth between class->lock and pool->lock merely
> "for obvious reasons" is not what I'm extremely excited about.
Yeah, agree, we need test data.
On 6/6/24 1:41 AM, Yosry Ahmed wrote:
> On Wed, Jun 5, 2024 at 4:04 PM Erhard Furtner <[email protected]> wrote:
>
> I am personally leaning toward (c), but I want to hear the opinions of
> other people here. Yu, Vlastimil, Johannes, Nhat? Anyone else?
Besides the zpool commit which might have just pushed the machine over the
edge, but it was probably close to it already. I've noticed a more general
problem that there are GFP_KERNEL allocations failing from kswapd. Those
could probably use be __GFP_NOMEMALLOC (or scoped variant, is there one?)
since it's the case of "allocating memory to free memory". Or use mempools
if the progress (success will lead to freeing memory) is really guaranteed.
Another interesting data point could be to see if traditional reclaim
behaves any better on this machine than MGLRU. I saw in the config:
CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y
So disabling at least the second one would revert to the traditional reclaim
and we could see if it handles such a constrained system better or not.
> In the long-term, I think we may want to address the lock contention
> in zsmalloc itself instead of zswap spawning multiple zpools.
>
>>
>> The patch did not apply cleanly on v6.9.3 so I applied it on v6.10-rc2. dmesg of the current v6.10-rc2 run attached.
>>
>> Regards,
>> Erhard
>
Yu Zhao <[email protected]> writes:
> On Wed, Jun 5, 2024 at 9:12 PM Michael Ellerman <[email protected]> wrote:
>>
>> David Hildenbrand <[email protected]> writes:
>> > On 01.06.24 08:01, Yu Zhao wrote:
>> >> On Wed, May 15, 2024 at 4:06 PM Yu Zhao <[email protected]> wrote:
>> ...
>> >>
>> >> Your system has 2GB memory and it uses zswap with zsmalloc (which is
>> >> good since it can allocate from the highmem zone) and zstd/lzo (which
>> >> doesn't matter much). Somehow -- I couldn't figure out why -- it
>> >> splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
>> >>
>> >> [ 0.000000] Zone ranges:
>> >> [ 0.000000] DMA [mem 0x0000000000000000-0x000000002fffffff]
>> >> [ 0.000000] Normal empty
>> >> [ 0.000000] HighMem [mem 0x0000000030000000-0x000000007fffffff]
>> >
>> > That's really odd. But we are messing with "PowerMac3,6", so I don't
>> > really know what's right or wrong ...
>>
>> The DMA zone exists because 9739ab7eda45 ("powerpc: enable a 30-bit
>> ZONE_DMA for 32-bit pmac") selects it.
>>
>> It's 768MB (not 0.25GB) because it's clamped at max_low_pfn:
>
> Right. (I meant 0.75GB.)
>
>> #ifdef CONFIG_ZONE_DMA
>> max_zone_pfns[ZONE_DMA] = min(max_low_pfn,
>> 1UL << (zone_dma_bits - PAGE_SHIFT));
>> #endif
>>
>> Which comes eventually from CONFIG_LOWMEM_SIZE, which defaults to 768MB.
>
> I see. I grep'ed VMSPLIT which is used on x86 and arm but apparently
> not on powerpc.
Those VMSPLIT configs are nice, on powerpc it's all done manually :}
>> I think it's 768MB because the user:kernel split is 3G:1G, and then the
>> kernel needs some of that 1G virtual space for vmalloc/ioremap/highmem,
>> so it splits it 768M:256M.
>>
>> Then ZONE_NORMAL is empty because it is also limited to max_low_pfn:
>>
>> max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
>>
>> The rest of RAM is highmem.
>>
>> So I think that's all behaving as expected, but I don't know 32-bit /
>> highmem stuff that well so I could be wrong.
>
> Yes, the three zones work as intended.
>
> Erhard,
>
> Since your system only has 2GB memory, I'd try the 2G:2G split, which
> would in theory allow both the kernel and userspace to all memory.
>
> CONFIG_LOWMEM_SIZE_BOOL=y
> CONFIG_LOWMEM_SIZE=0x7000000
>
> (Michael, please correct me if the above wouldn't work.)
It's a bit more complicated, in order to increase LOWMEM_SIZE you need
to adjust all the other variables to make space.
To get 2G of user virtual space I think you need:
CONFIG_ADVANCED_OPTIONS=y
CONFIG_LOWMEM_SIZE_BOOL=y
CONFIG_LOWMEM_SIZE=0x60000000
CONFIG_PAGE_OFFSET_BOOL=y
CONFIG_PAGE_OFFSET=0x90000000
CONFIG_KERNEL_START_BOOL=y
CONFIG_KERNEL_START=0x90000000
CONFIG_PHYSICAL_START=0x00000000
CONFIG_TASK_SIZE_BOOL=y
CONFIG_TASK_SIZE=0x80000000
Which results in 1.5GB of lowmem.
Or if you want to map all 2G of RAM directly in the kernel without
highmem, but limit user virtual space to 1.5G:
CONFIG_ADVANCED_OPTIONS=y
CONFIG_LOWMEM_SIZE_BOOL=y
CONFIG_LOWMEM_SIZE=0x80000000
CONFIG_PAGE_OFFSET_BOOL=y
CONFIG_PAGE_OFFSET=0x70000000
CONFIG_KERNEL_START_BOOL=y
CONFIG_KERNEL_START=0x70000000
CONFIG_PHYSICAL_START=0x00000000
CONFIG_TASK_SIZE_BOOL=y
CONFIG_TASK_SIZE=0x60000000
You can also reclaim another 256MB of virtual space if you disable
CONFIG_MODULES.
Those configs do boot on qemu. But I don't have easy access to my 32-bit
machine to test if they boot on actual hardware.
cheers
On Thu, 6 Jun 2024 09:24:56 +0200
"Vlastimil Babka (SUSE)" <[email protected]> wrote:
> Besides the zpool commit which might have just pushed the machine over the
> edge, but it was probably close to it already. I've noticed a more general
> problem that there are GFP_KERNEL allocations failing from kswapd. Those
> could probably use be __GFP_NOMEMALLOC (or scoped variant, is there one?)
> since it's the case of "allocating memory to free memory". Or use mempools
> if the progress (success will lead to freeing memory) is really guaranteed.
>
> Another interesting data point could be to see if traditional reclaim
> behaves any better on this machine than MGLRU. I saw in the config:
>
> CONFIG_LRU_GEN=y
> CONFIG_LRU_GEN_ENABLED=y
>
> So disabling at least the second one would revert to the traditional reclaim
> and we could see if it handles such a constrained system better or not.
I set RANDOM_KMALLOC_CACHES=n and LRU_GEN_ENABLED=n but still hit the issue.
dmesg looks a bit different (unpatched v6.10-rc2).
Regards,
Erhard
On Wed, 5 Jun 2024 16:58:11 -0700
Yosry Ahmed <[email protected]> wrote:
> On Wed, Jun 5, 2024 at 4:53 PM Yu Zhao <[email protected]> wrote:
> >
> > On Wed, Jun 5, 2024 at 5:42 PM Yosry Ahmed <[email protected]> wrote:
> > >
> > > On Wed, Jun 5, 2024 at 4:04 PM Erhard Furtner <[email protected]> wrote:
> > > >
> > > > On Tue, 4 Jun 2024 20:03:27 -0700
> > > > Yosry Ahmed <[email protected]> wrote:
> > > >
> > > > > Could you check if the attached patch helps? It basically changes the
> > > > > number of zpools from 32 to min(32, nr_cpus).
> > > >
> > > > Thanks! The patch does not fix the issue but it helps.
> > > >
> > > > Means I still get to see the 'kswapd0: page allocation failure' in the dmesg, a 'stress-ng-vm: page allocation failure' later on, another kswapd0 error later on, etc. _but_ the machine keeps running the workload, stays usable via VNC and I get no hard crash any longer.
> > > >
> > > > Without patch kswapd0 error and hard crash (need to power-cycle) <3min. With patch several kswapd0 errors but running for 2 hrs now. I double checked this to be sure.
> > >
> > > Thanks for trying this out. This is interesting, so even two zpools is
> > > too much fragmentation for your use case.
> >
> > Now I'm a little bit skeptical that the problem is due to fragmentation.
> >
> > > I think there are multiple ways to go forward here:
> > > (a) Make the number of zpools a config option, leave the default as
> > > 32, but allow special use cases to set it to 1 or similar. This is
> > > probably not preferable because it is not clear to users how to set
> > > it, but the idea is that no one will have to set it except special use
> > > cases such as Erhard's (who will want to set it to 1 in this case).
> > >
> > > (b) Make the number of zpools scale linearly with the number of CPUs.
> > > Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> > > approach is that with a large number of CPUs, too many zpools will
> > > start having diminishing returns. Fragmentation will keep increasing,
> > > while the scalability/concurrency gains will diminish.
> > >
> > > (c) Make the number of zpools scale logarithmically with the number of
> > > CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> > > of zpools from increasing too much and close to the status quo. The
> > > problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> > > will actually give a nr_zpools > nr_cpus. So we will need to come up
> > > with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
> > >
> > > (d) Make the number of zpools scale linearly with memory. This makes
> > > more sense than scaling with CPUs because increasing the number of
> > > zpools increases fragmentation, so it makes sense to limit it by the
> > > available memory. This is also more consistent with other magic
> > > numbers we have (e.g. SWAP_ADDRESS_SPACE_SHIFT).
> > >
> > > The problem is that unlike zswap trees, the zswap pool is not
> > > connected to the swapfile size, so we don't have an indication for how
> > > much memory will be in the zswap pool. We can scale the number of
> > > zpools with the entire memory on the machine during boot, but this
> > > seems like it would be difficult to figure out, and will not take into
> > > consideration memory hotplugging and the zswap global limit changing.
> > >
> > > (e) A creative mix of the above.
> > >
> > > (f) Something else (probably simpler).
> > >
> > > I am personally leaning toward (c), but I want to hear the opinions of
> > > other people here. Yu, Vlastimil, Johannes, Nhat? Anyone else?
> >
> > I double checked that commit and didn't find anything wrong. If we are
> > all in the mood of getting to the bottom, can we try using only 1
> > zpool while there are 2 available? I.e.,
>
> Erhard, do you mind checking if Yu's diff below to use a single zpool
> fixes the problem completely? There is also an attached patch that
> does the same thing if this is easier to apply for you.
No, setting ZSWAP_NR_ZPOOLS to 1 does not fix the problem unfortunately (that being the only patch applied on v6.10-rc2).
Trying to alter the lowmem and virtual mem limits next as Michael suggested.
Regards,
Erhard
On Thu, 06 Jun 2024 22:08:40 +1000
Michael Ellerman <[email protected]> wrote:
> It's a bit more complicated, in order to increase LOWMEM_SIZE you need
> to adjust all the other variables to make space.
>
> To get 2G of user virtual space I think you need:
>
> CONFIG_ADVANCED_OPTIONS=y
> CONFIG_LOWMEM_SIZE_BOOL=y
> CONFIG_LOWMEM_SIZE=0x60000000
> CONFIG_PAGE_OFFSET_BOOL=y
> CONFIG_PAGE_OFFSET=0x90000000
> CONFIG_KERNEL_START_BOOL=y
> CONFIG_KERNEL_START=0x90000000
> CONFIG_PHYSICAL_START=0x00000000
> CONFIG_TASK_SIZE_BOOL=y
> CONFIG_TASK_SIZE=0x80000000
>
> Which results in 1.5GB of lowmem.
Booting this config on the G4 worked, but the issue showed up anyhow.
> Or if you want to map all 2G of RAM directly in the kernel without
> highmem, but limit user virtual space to 1.5G:
>
> CONFIG_ADVANCED_OPTIONS=y
> CONFIG_LOWMEM_SIZE_BOOL=y
> CONFIG_LOWMEM_SIZE=0x80000000
> CONFIG_PAGE_OFFSET_BOOL=y
> CONFIG_PAGE_OFFSET=0x70000000
> CONFIG_KERNEL_START_BOOL=y
> CONFIG_KERNEL_START=0x70000000
> CONFIG_PHYSICAL_START=0x00000000
> CONFIG_TASK_SIZE_BOOL=y
> CONFIG_TASK_SIZE=0x60000000
This actually did the trick!
Also I disabled highmem via HIGHMEM=n. With this config the machine did run the "stress-ng --vm 2 --vm-bytes 1930M --verify -v" load for about 2 hrs without hitting the issue.
> You can also reclaim another 256MB of virtual space if you disable
> CONFIG_MODULES.
Did not try that 'cause the 2nd config worked.
Working 2G_no-highmem .config attached and the dmesg of both configs attached.
Regards,
Erhard
On Thu, Jun 6, 2024 at 6:28 AM Erhard Furtner <[email protected]> wrote:
>
> On Wed, 5 Jun 2024 16:58:11 -0700
> Yosry Ahmed <[email protected]> wrote:
>
> > On Wed, Jun 5, 2024 at 4:53 PM Yu Zhao <[email protected]> wrote:
> > >
> > > On Wed, Jun 5, 2024 at 5:42 PM Yosry Ahmed <[email protected]> wrote:
> > > >
> > > > On Wed, Jun 5, 2024 at 4:04 PM Erhard Furtner <[email protected]> wrote:
> > > > >
> > > > > On Tue, 4 Jun 2024 20:03:27 -0700
> > > > > Yosry Ahmed <[email protected]> wrote:
> > > > >
> > > > > > Could you check if the attached patch helps? It basically changes the
> > > > > > number of zpools from 32 to min(32, nr_cpus).
> > > > >
> > > > > Thanks! The patch does not fix the issue but it helps.
> > > > >
> > > > > Means I still get to see the 'kswapd0: page allocation failure' in the dmesg, a 'stress-ng-vm: page allocation failure' later on, another kswapd0 error later on, etc. _but_ the machine keeps running the workload, stays usable via VNC and I get no hard crash any longer.
> > > > >
> > > > > Without patch kswapd0 error and hard crash (need to power-cycle) <3min. With patch several kswapd0 errors but running for 2 hrs now. I double checked this to be sure.
> > > >
> > > > Thanks for trying this out. This is interesting, so even two zpools is
> > > > too much fragmentation for your use case.
> > >
> > > Now I'm a little bit skeptical that the problem is due to fragmentation.
> > >
> > > > I think there are multiple ways to go forward here:
> > > > (a) Make the number of zpools a config option, leave the default as
> > > > 32, but allow special use cases to set it to 1 or similar. This is
> > > > probably not preferable because it is not clear to users how to set
> > > > it, but the idea is that no one will have to set it except special use
> > > > cases such as Erhard's (who will want to set it to 1 in this case).
> > > >
> > > > (b) Make the number of zpools scale linearly with the number of CPUs.
> > > > Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> > > > approach is that with a large number of CPUs, too many zpools will
> > > > start having diminishing returns. Fragmentation will keep increasing,
> > > > while the scalability/concurrency gains will diminish.
> > > >
> > > > (c) Make the number of zpools scale logarithmically with the number of
> > > > CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> > > > of zpools from increasing too much and close to the status quo. The
> > > > problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> > > > will actually give a nr_zpools > nr_cpus. So we will need to come up
> > > > with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
> > > >
> > > > (d) Make the number of zpools scale linearly with memory. This makes
> > > > more sense than scaling with CPUs because increasing the number of
> > > > zpools increases fragmentation, so it makes sense to limit it by the
> > > > available memory. This is also more consistent with other magic
> > > > numbers we have (e.g. SWAP_ADDRESS_SPACE_SHIFT).
> > > >
> > > > The problem is that unlike zswap trees, the zswap pool is not
> > > > connected to the swapfile size, so we don't have an indication for how
> > > > much memory will be in the zswap pool. We can scale the number of
> > > > zpools with the entire memory on the machine during boot, but this
> > > > seems like it would be difficult to figure out, and will not take into
> > > > consideration memory hotplugging and the zswap global limit changing.
> > > >
> > > > (e) A creative mix of the above.
> > > >
> > > > (f) Something else (probably simpler).
> > > >
> > > > I am personally leaning toward (c), but I want to hear the opinions of
> > > > other people here. Yu, Vlastimil, Johannes, Nhat? Anyone else?
> > >
> > > I double checked that commit and didn't find anything wrong. If we are
> > > all in the mood of getting to the bottom, can we try using only 1
> > > zpool while there are 2 available? I.e.,
> >
> > Erhard, do you mind checking if Yu's diff below to use a single zpool
> > fixes the problem completely? There is also an attached patch that
> > does the same thing if this is easier to apply for you.
>
> No, setting ZSWAP_NR_ZPOOLS to 1 does not fix the problem unfortunately (that being the only patch applied on v6.10-rc2).
This confirms Yu's theory that the zpools fragmentation is not the
main reason for the problem. As Vlastimil said, the setup is already
tight on memory and that commit may have just pushed it over the edge.
Since setting ZSWAP_NR_ZPOOLS to 1 (which effectively reverts the
commit) does not help in v6.10-rc2, then something else that came
after the commit would have pushed it over the edge anyway.
>
> Trying to alter the lowmem and virtual mem limits next as Michael suggested.
I saw that this worked. So it seems like we don't need to worry about
the number of zpools, for now at least :)
Thanks for helping with the testing, and thanks to everyone else who
helped on this thread.
>
> Regards,
> Erhard
On 6/6/24 3:32 PM, Erhard Furtner wrote:
> On Thu, 6 Jun 2024 09:24:56 +0200
> "Vlastimil Babka (SUSE)" <[email protected]> wrote:
>
>> Besides the zpool commit which might have just pushed the machine over the
>> edge, but it was probably close to it already. I've noticed a more general
>> problem that there are GFP_KERNEL allocations failing from kswapd. Those
>> could probably use be __GFP_NOMEMALLOC (or scoped variant, is there one?)
>> since it's the case of "allocating memory to free memory". Or use mempools
>> if the progress (success will lead to freeing memory) is really guaranteed.
>>
>> Another interesting data point could be to see if traditional reclaim
>> behaves any better on this machine than MGLRU. I saw in the config:
>>
>> CONFIG_LRU_GEN=y
>> CONFIG_LRU_GEN_ENABLED=y
>>
>> So disabling at least the second one would revert to the traditional reclaim
>> and we could see if it handles such a constrained system better or not.
>
> I set RANDOM_KMALLOC_CACHES=n and LRU_GEN_ENABLED=n but still hit the issue.
>
> dmesg looks a bit different (unpatched v6.10-rc2).
What caught my eye, but it's also in some of the previous dmesg with MGRLU,
is that in one case there's:
DMA free:0kB
That means many allocations went through that are allowed to just ignore all
reserves, and depleted everything. That would mean __GFP_MEMALLOC or
PF_MEMALLOC, which I suggested earlier for the GFP_KERNEL failure, is being
used somewhere, but not leading to the expected memory freeing.
> Regards,
> Erhard
2024年6月6日(木) 8:42 Yosry Ahmed <[email protected]>:
> I think there are multiple ways to go forward here:
> (a) Make the number of zpools a config option, leave the default as
> 32, but allow special use cases to set it to 1 or similar. This is
> probably not preferable because it is not clear to users how to set
> it, but the idea is that no one will have to set it except special use
> cases such as Erhard's (who will want to set it to 1 in this case).
>
> (b) Make the number of zpools scale linearly with the number of CPUs.
> Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> approach is that with a large number of CPUs, too many zpools will
> start having diminishing returns. Fragmentation will keep increasing,
> while the scalability/concurrency gains will diminish.
>
> (c) Make the number of zpools scale logarithmically with the number of
> CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> of zpools from increasing too much and close to the status quo. The
> problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> will actually give a nr_zpools > nr_cpus. So we will need to come up
> with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
>
I just posted a patch to limit the number of zpools, with some
theoretical background explained in the code comments. I believe that
2 * CPU linearly is sufficient to reduce contention, but the scale can
be reduced further. All CPUs are trying to allocate/free zswap is
unlikely to happen.
How many concurrent accesses were the original 32 zpools supposed to
handle? I think it was for 16 cpu or more. or nr_cpus/4 would be
enough?
--
<[email protected]>
On Thu, Jun 6, 2024 at 10:14 AM Takero Funaki <[email protected]> wrote:
>
> 2024年6月6日(木) 8:42 Yosry Ahmed <[email protected]>:
>
> > I think there are multiple ways to go forward here:
> > (a) Make the number of zpools a config option, leave the default as
> > 32, but allow special use cases to set it to 1 or similar. This is
> > probably not preferable because it is not clear to users how to set
> > it, but the idea is that no one will have to set it except special use
> > cases such as Erhard's (who will want to set it to 1 in this case).
> >
> > (b) Make the number of zpools scale linearly with the number of CPUs.
> > Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> > approach is that with a large number of CPUs, too many zpools will
> > start having diminishing returns. Fragmentation will keep increasing,
> > while the scalability/concurrency gains will diminish.
> >
> > (c) Make the number of zpools scale logarithmically with the number of
> > CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> > of zpools from increasing too much and close to the status quo. The
> > problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> > will actually give a nr_zpools > nr_cpus. So we will need to come up
> > with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
> >
>
> I just posted a patch to limit the number of zpools, with some
> theoretical background explained in the code comments. I believe that
> 2 * CPU linearly is sufficient to reduce contention, but the scale can
> be reduced further. All CPUs are trying to allocate/free zswap is
> unlikely to happen.
> How many concurrent accesses were the original 32 zpools supposed to
> handle? I think it was for 16 cpu or more. or nr_cpus/4 would be
> enough?
We use 32 zpools on machines with 100s of CPUs. Two zpools per CPU is
an overkill imo.
I have further comments that I will leave on the patch, but I mainly
think this should be driven by real data, not theoretical possibility
of lock contention.
>
> --
>
> <[email protected]>
On Thu, Jun 6, 2024 at 11:42 AM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Jun 6, 2024 at 10:14 AM Takero Funaki <[email protected]> wrote:
> >
> > 2024年6月6日(木) 8:42 Yosry Ahmed <[email protected]>:
> >
> > > I think there are multiple ways to go forward here:
> > > (a) Make the number of zpools a config option, leave the default as
> > > 32, but allow special use cases to set it to 1 or similar. This is
> > > probably not preferable because it is not clear to users how to set
> > > it, but the idea is that no one will have to set it except special use
> > > cases such as Erhard's (who will want to set it to 1 in this case).
> > >
> > > (b) Make the number of zpools scale linearly with the number of CPUs.
> > > Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> > > approach is that with a large number of CPUs, too many zpools will
> > > start having diminishing returns. Fragmentation will keep increasing,
> > > while the scalability/concurrency gains will diminish.
> > >
> > > (c) Make the number of zpools scale logarithmically with the number of
> > > CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> > > of zpools from increasing too much and close to the status quo. The
> > > problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> > > will actually give a nr_zpools > nr_cpus. So we will need to come up
> > > with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
> > >
> >
> > I just posted a patch to limit the number of zpools, with some
> > theoretical background explained in the code comments. I believe that
> > 2 * CPU linearly is sufficient to reduce contention, but the scale can
> > be reduced further. All CPUs are trying to allocate/free zswap is
> > unlikely to happen.
> > How many concurrent accesses were the original 32 zpools supposed to
> > handle? I think it was for 16 cpu or more. or nr_cpus/4 would be
> > enough?
>
> We use 32 zpools on machines with 100s of CPUs. Two zpools per CPU is
> an overkill imo.
Not to choose a camp; just a friendly note on why I strongly disagree
with the N zpools per CPU approach:
1. It is fundamentally flawed to assume the system is linear;
2. Nonlinear systems usually have diminishing returns.
For Google data centers, using nr_cpus as the scaling factor had long
passed the acceptable ROI threshold. Per-CPU data, especially when
compounded per memcg or even per process, is probably the number-one
overhead in terms of DRAM efficiency.
> I have further comments that I will leave on the patch, but I mainly
> think this should be driven by real data, not theoretical possibility
> of lock contention.
On Thu, Jun 6, 2024 at 10:55 AM Yu Zhao <[email protected]> wrote:
>
> On Thu, Jun 6, 2024 at 11:42 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Thu, Jun 6, 2024 at 10:14 AM Takero Funaki <[email protected]> wrote:
> > >
> > > 2024年6月6日(木) 8:42 Yosry Ahmed <[email protected]>:
> > >
> > > > I think there are multiple ways to go forward here:
> > > > (a) Make the number of zpools a config option, leave the default as
> > > > 32, but allow special use cases to set it to 1 or similar. This is
> > > > probably not preferable because it is not clear to users how to set
> > > > it, but the idea is that no one will have to set it except special use
> > > > cases such as Erhard's (who will want to set it to 1 in this case).
> > > >
> > > > (b) Make the number of zpools scale linearly with the number of CPUs.
> > > > Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> > > > approach is that with a large number of CPUs, too many zpools will
> > > > start having diminishing returns. Fragmentation will keep increasing,
> > > > while the scalability/concurrency gains will diminish.
> > > >
> > > > (c) Make the number of zpools scale logarithmically with the number of
> > > > CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> > > > of zpools from increasing too much and close to the status quo. The
> > > > problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> > > > will actually give a nr_zpools > nr_cpus. So we will need to come up
> > > > with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
> > > >
> > >
> > > I just posted a patch to limit the number of zpools, with some
> > > theoretical background explained in the code comments. I believe that
> > > 2 * CPU linearly is sufficient to reduce contention, but the scale can
> > > be reduced further. All CPUs are trying to allocate/free zswap is
> > > unlikely to happen.
> > > How many concurrent accesses were the original 32 zpools supposed to
> > > handle? I think it was for 16 cpu or more. or nr_cpus/4 would be
> > > enough?
> >
> > We use 32 zpools on machines with 100s of CPUs. Two zpools per CPU is
> > an overkill imo.
>
> Not to choose a camp; just a friendly note on why I strongly disagree
> with the N zpools per CPU approach:
> 1. It is fundamentally flawed to assume the system is linear;
> 2. Nonlinear systems usually have diminishing returns.
>
> For Google data centers, using nr_cpus as the scaling factor had long
> passed the acceptable ROI threshold. Per-CPU data, especially when
> compounded per memcg or even per process, is probably the number-one
> overhead in terms of DRAM efficiency.
100% agreed. If you look at option (b) above, I specifically called
out that scaling the number of zpools linearly with the number with
CPUs have diminishing returns :)
On Thu, Jun 6, 2024 at 6:43 AM Sergey Senozhatsky
<[email protected]> wrote:
>
> On (24/06/06 12:46), Chengming Zhou wrote:
> > >> Agree, I think we should try to improve locking scalability of zsmalloc.
> > >> I have some thoughts to share, no code or test data yet:
> > >>
> > >> 1. First, we can change the pool global lock to per-class lock, which
> > >> is more fine-grained.
> > >
> > > Commit c0547d0b6a4b6 "zsmalloc: consolidate zs_pool's migrate_lock
> > > and size_class's locks" [1] claimed no significant difference
> > > between class->lock and pool->lock.
> >
> > Ok, I haven't looked into the history much, that seems preparation of trying
> > to introduce reclaim in the zsmalloc? Not sure. But now with the reclaim code
> > in zsmalloc has gone, should we change back to the per-class lock? Which is
>
> Well, the point that commit made was that Nhat (and Johannes?) were
> unable to detect any impact of pool->lock on a variety of cases. So
> we went on with code simplification.
Yeah, we benchmarked it before zsmalloc writeback was introduced (the
patch to remove class lock was a prep patch of the series). We weren't
able to detect any regression at the time with just using a global
pool lock.
>
> > obviously more fine-grained than the pool lock. Actually, I have just done it,
> > will test to get some data later.
>
> Thanks, we'll need data on this. I'm happy to take the patch, but
> jumping back and forth between class->lock and pool->lock merely
> "for obvious reasons" is not what I'm extremely excited about.
FWIW, I do think it'd be nice if we can make the locking more granular
- the pool lock now is essentially a global lock, and we're just
getting around that by replicating the (z)pools themselves.
Personally, I'm not super convinced about class locks. We're
essentially relying on the post-compression size of the data to
load-balance the queries - I can imagine a scenario where a workload
has a concentrated distribution of post-compression data (i.e its
pages are compressed to similar-ish sizes), and we're once again
contending for a (few) lock(s) again.
That said, I'll let the data tell the story :) We don't need a perfect
solution, just a good enough solution for now.
On (24/06/07 10:40), Nhat Pham wrote:
> Personally, I'm not super convinced about class locks. We're
> essentially relying on the post-compression size of the data to
> load-balance the queries - I can imagine a scenario where a workload
> has a concentrated distribution of post-compression data (i.e its
> pages are compressed to similar-ish sizes), and we're once again
> contending for a (few) lock(s) again.
>
> That said, I'll let the data tell the story :) We don't need a perfect
> solution, just a good enough solution for now.
Speaking of size class locks:
One thing to mention is that zsmalloc merges size classes, we never have
documented/claimed 256 size classe, the actual number is always much
much lower. Each such "cluster" (merged size classes) holds a range of
objects' sizes (e.g. 3504-3584 bytes). The wider the cluster's size range
the more likely the (size class) lock contention is.
Setting CONFIG_ZSMALLOC_CHAIN_SIZE to 10 or higher makes zsmalloc pool
to be configured with more size class clusters (which means that clusters
hold narrower size intervals).