2016-12-15 23:06:45

by Nils Holland

[permalink] [raw]
Subject: OOM: Better, but still there on 4.9

Hi folks,

I've been reading quite a bit about OOM related issues in recent
kernels, and as I've been experiencing some of these myself for quite
a while, I thought I'd send in my report in case in the hope that it
is useful. Of course, if there's ever anything to test, like some
patches or something, I'd be glad to help!

Now, my situation: I have two different x86 machines, both equipped
with 4 GB of RAM, running 32 bit kernels. I've never really observed
any OOM issues until kernel 4.8, but with that kernel, it was enough
to unpack a bigger source tarball (like the firefox sources) on a
freshly booted system and subsequently compile them, and with very
high certainty, the OOL killer would kick in during the compile,
killing a whole lot of processes, with the machine then becoming
unresponsive and finally a kernel panic taking place.

With kernel 4.9, these OOM events seem to be somewhat harder to
trigger, but in most cases, unpacking some larger tarballs and then
launching a build process on a freshly booted system without many
other processed (not even X) running seems to do the trick. However,
the consequences don't seem to be as severe as they were in 4.8: The
machines did, in fact, become unresponsive in the way that logging in
locally (after I'm being thrown out when my bash gets killed by the
OOM reaper) is no longer possible, sshing into the machine also
doesn't work anymore (most likely because sshd has also been killed),
but in all cases the machine was still pingable and the magic
SysRequest key combo was still working. I've not yet seen a single
real kernel panic as I did with 4.8, still, the only way to get the
machine back into action was a hard reboot via the magic SysRequest
commands or a power cycle.

For the reference, I'm attaching an OOM I've observed under 4.9 at the
end of this machine. This one happened after I had been using the
machine in a normal fashion for a short time, and then had portage,
Gentoo's build system, unpack the firefox sources - compiling hadn't
even started yet at this point. Oh yes, I'm using btrfs in case that
might make a differences - at least I've seen some references to it in
other similar reports I've found on the web.

Of course, none of this are workloads that are new / special in any
way - prior to 4.8, I never experienced any issues doing the exact
same things.

Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0
Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
Dec 15 19:02:18 teela kernel: eff0b604 c142bcce eff0b734 00000000 eff0b634 c1163332 00000000 00000292
Dec 15 19:02:18 teela kernel: eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734
Dec 15 19:02:18 teela kernel: eff0b678 c110795f c1043895 eff0b664 c11075c7 00000007 00000000 00000000
Dec 15 19:02:18 teela kernel: Call Trace:
Dec 15 19:02:18 teela kernel: [<c142bcce>] dump_stack+0x47/0x69
Dec 15 19:02:18 teela kernel: [<c1163332>] dump_header+0x60/0x178
Dec 15 19:02:18 teela kernel: [<c1431876>] ? ___ratelimit+0x86/0xe0
Dec 15 19:02:18 teela kernel: [<c110795f>] oom_kill_process+0x20f/0x3d0
Dec 15 19:02:18 teela kernel: [<c1043895>] ? has_capability_noaudit+0x15/0x20
Dec 15 19:02:18 teela kernel: [<c11075c7>] ? oom_badness.part.13+0xb7/0x130
Dec 15 19:02:18 teela kernel: [<c1107df9>] out_of_memory+0xd9/0x260
Dec 15 19:02:18 teela kernel: [<c110ba0b>] __alloc_pages_nodemask+0xbfb/0xc80
Dec 15 19:02:18 teela kernel: [<c110414d>] pagecache_get_page+0xad/0x270
Dec 15 19:02:18 teela kernel: [<c13664a6>] alloc_extent_buffer+0x116/0x3e0
Dec 15 19:02:18 teela kernel: [<c1334a2e>] btrfs_find_create_tree_block+0xe/0x10
Dec 15 19:02:18 teela kernel: [<c132a57f>] btrfs_alloc_tree_block+0x1ef/0x5f0
Dec 15 19:02:18 teela kernel: [<c130f7c3>] __btrfs_cow_block+0x143/0x5f0
Dec 15 19:02:18 teela kernel: [<c130fe1a>] btrfs_cow_block+0x13a/0x220
Dec 15 19:02:18 teela kernel: [<c13132f1>] btrfs_search_slot+0x1d1/0x870
Dec 15 19:02:18 teela kernel: [<c132fcdd>] btrfs_lookup_file_extent+0x4d/0x60
Dec 15 19:02:18 teela kernel: [<c1354fe6>] __btrfs_drop_extents+0x176/0x1070
Dec 15 19:02:18 teela kernel: [<c1150377>] ? kmem_cache_alloc+0xb7/0x190
Dec 15 19:02:18 teela kernel: [<c133dbb5>] ? start_transaction+0x65/0x4b0
Dec 15 19:02:18 teela kernel: [<c1150597>] ? __kmalloc+0x147/0x1e0
Dec 15 19:02:18 teela kernel: [<c1345005>] cow_file_range_inline+0x215/0x6b0
Dec 15 19:02:18 teela kernel: [<c13459fc>] cow_file_range.isra.49+0x55c/0x6d0
Dec 15 19:02:18 teela kernel: [<c1361795>] ? lock_extent_bits+0x75/0x1e0
Dec 15 19:02:18 teela kernel: [<c1346d51>] run_delalloc_range+0x441/0x470
Dec 15 19:02:18 teela kernel: [<c13626e4>] writepage_delalloc.isra.47+0x144/0x1e0
Dec 15 19:02:18 teela kernel: [<c1364548>] __extent_writepage+0xd8/0x2b0
Dec 15 19:02:18 teela kernel: [<c1365c4c>] extent_writepages+0x25c/0x380
Dec 15 19:02:18 teela kernel: [<c1342cd0>] ? btrfs_real_readdir+0x610/0x610
Dec 15 19:02:18 teela kernel: [<c133ff0f>] btrfs_writepages+0x1f/0x30
Dec 15 19:02:18 teela kernel: [<c110ff85>] do_writepages+0x15/0x40
Dec 15 19:02:18 teela kernel: [<c1190a95>] __writeback_single_inode+0x35/0x2f0
Dec 15 19:02:18 teela kernel: [<c119112e>] writeback_sb_inodes+0x16e/0x340
Dec 15 19:02:18 teela kernel: [<c119145a>] wb_writeback+0xaa/0x280
Dec 15 19:02:18 teela kernel: [<c1191de8>] wb_workfn+0xd8/0x3e0
Dec 15 19:02:18 teela kernel: [<c104fd34>] process_one_work+0x114/0x3e0
Dec 15 19:02:18 teela kernel: [<c1050b4f>] worker_thread+0x2f/0x4b0
Dec 15 19:02:18 teela kernel: [<c1050b20>] ? create_worker+0x180/0x180
Dec 15 19:02:18 teela kernel: [<c10552e7>] kthread+0x97/0xb0
Dec 15 19:02:18 teela kernel: [<c1055250>] ? __kthread_parkme+0x60/0x60
Dec 15 19:02:18 teela kernel: [<c19b5cb7>] ret_from_fork+0x1b/0x28
Dec 15 19:02:18 teela kernel: Mem-Info:
Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0
active_file:274324 inactive_file:281962 isolated_file:0
unevictable:0 dirty:649 writeback:0 unstable:0
slab_reclaimable:40662 slab_unreclaimable:17754
mapped:7382 shmem:202 pagetables:351 bounce:0
free:206736 free_pcp:332 free_cma:0
Dec 15 19:02:18 teela kernel: Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
Dec 15 19:02:18 teela kernel: DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 813 3474 3474
Dec 15 19:02:18 teela kernel: Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 0 21292 21292
Dec 15 19:02:18 teela kernel: HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 0 0 0
Dec 15 19:02:18 teela kernel: DMA: 0*4kB 2*8kB (ME) 5*16kB (UME) 13*32kB (UM) 11*64kB (UME) 3*128kB (UM) 1*256kB (M) 2*512kB (E) 1*1024kB (M) 0*2048kB 0*4096kB = 3904kB
Dec 15 19:02:18 teela kernel: Normal: 27*4kB (ME) 25*8kB (UME) 442*16kB (UME) 189*32kB (UME) 411*64kB (UME) 13*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 41396kB
Dec 15 19:02:18 teela kernel: HighMem: 1*4kB (M) 11*8kB (U) 2*16kB (U) 3*32kB (UM) 16*64kB (U) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 1*2048kB (M) 190*4096kB (UM) = 781660kB
Dec 15 19:02:18 teela kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dec 15 19:02:18 teela kernel: 556515 total pagecache pages
Dec 15 19:02:18 teela kernel: 0 pages in swap cache
Dec 15 19:02:18 teela kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 15 19:02:18 teela kernel: Free swap = 8191996kB
Dec 15 19:02:18 teela kernel: Total swap = 8191996kB
Dec 15 19:02:18 teela kernel: 909598 pages RAM
Dec 15 19:02:18 teela kernel: 681346 pages HighMem/MovableOnly
Dec 15 19:02:18 teela kernel: 15211 pages reserved
Dec 15 19:02:18 teela kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 15 19:02:18 teela kernel: [ 1888] 0 1888 6161 1112 10 3 0 0 systemd-journal
Dec 15 19:02:18 teela kernel: [ 2508] 0 2508 2959 945 6 3 0 -1000 systemd-udevd
Dec 15 19:02:18 teela kernel: [ 2610] 105 2610 3870 899 8 3 0 0 systemd-timesyn
Dec 15 19:02:18 teela kernel: [ 2613] 0 2613 6300 948 10 3 0 0 rsyslogd
Dec 15 19:02:18 teela kernel: [ 2615] 88 2615 1158 568 6 3 0 0 nullmailer-send
Dec 15 19:02:18 teela kernel: [ 2618] 0 2618 1514 1027 7 3 0 0 systemd-logind
Dec 15 19:02:18 teela kernel: [ 2619] 101 2619 1266 847 6 3 0 -900 dbus-daemon
Dec 15 19:02:18 teela kernel: [ 2620] 0 2620 622 300 5 3 0 0 atd
Dec 15 19:02:18 teela kernel: [ 2628] 0 2628 26097 3193 27 3 0 0 NetworkManager
Dec 15 19:02:18 teela kernel: [ 2647] 0 2647 1511 458 5 3 0 0 fcron
Dec 15 19:02:18 teela kernel: [ 2673] 0 2673 750 543 6 3 0 0 dhcpcd
Dec 15 19:02:18 teela kernel: [ 2676] 0 2676 638 447 5 3 0 0 vnstatd
Dec 15 19:02:18 teela kernel: [ 2690] 0 2690 1457 1061 6 3 0 -1000 sshd
Dec 15 19:02:18 teela kernel: [ 2716] 106 2716 16384 4239 20 3 0 0 polkitd
Dec 15 19:02:18 teela kernel: [ 2717] 0 2717 2145 1360 7 3 0 0 wpa_supplicant
Dec 15 19:02:18 teela kernel: [ 2947] 0 2947 1794 775 7 3 0 0 screen
Dec 15 19:02:18 teela kernel: [ 2950] 0 2950 1831 913 7 3 0 0 bash
Dec 15 19:02:18 teela kernel: [ 2954] 0 2954 37411 36100 79 3 0 0 emerge
Dec 15 19:02:18 teela kernel: [ 2970] 0 2970 1152 507 6 3 0 0 agetty
Dec 15 19:02:18 teela kernel: [ 3897] 250 3897 548 358 5 3 0 0 sandbox
Dec 15 19:02:18 teela kernel: [ 3906] 250 3906 2625 1584 8 3 0 0 ebuild.sh
Dec 15 19:02:18 teela kernel: [ 3926] 250 3926 2657 1370 7 3 0 0 ebuild.sh
Dec 15 19:02:18 teela kernel: [ 3935] 250 3935 17160 16891 37 3 0 0 xz
Dec 15 19:02:18 teela kernel: [ 3936] 250 3936 799 510 5 3 0 0 tar
Dec 15 19:02:18 teela kernel: [ 4117] 0 4117 2598 1389 9 3 0 0 sshd
Dec 15 19:02:18 teela kernel: [ 4119] 0 4119 1964 1243 7 3 0 0 systemd
Dec 15 19:02:18 teela kernel: [ 4144] 0 4144 6645 632 10 3 0 0 (sd-pam)
Dec 15 19:02:18 teela kernel: [ 4163] 0 4163 1830 909 7 3 0 0 bash
Dec 15 19:02:18 teela kernel: [ 4182] 0 4182 1695 684 7 3 0 0 screen
Dec 15 19:02:18 teela kernel: [ 4221] 0 4221 1831 893 7 3 0 0 bash
Dec 15 19:02:18 teela kernel: Out of memory: Kill process 2954 (emerge) score 11 or sacrifice child
Dec 15 19:02:18 teela kernel: Killed process 3897 (sandbox) total-vm:2192kB, anon-rss:128kB, file-rss:1304kB, shmem-rss:0kB
Dec 15 19:02:18 teela kernel: bash invoked oom-killer: gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, order=1, oom_score_adj=0
Dec 15 19:02:18 teela kernel: bash cpuset=/ mems_allowed=0
Dec 15 19:02:18 teela kernel: CPU: 0 PID: 4221 Comm: bash Not tainted 4.9.0-gentoo #2
Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
Dec 15 19:02:18 teela kernel: c5187d68 c142bcce c5187e98 00000000 c5187d98 c1163332 00000000 00200282
Dec 15 19:02:18 teela kernel: c5187d98 c1431876 c5187d9c e7fb0b00 e7fa2900 e7fa2900 c1b58785 c5187e98
Dec 15 19:02:18 teela kernel: c5187ddc c110795f c1043895 c5187dc8 c11075c7 00000007 00000000 00000000
Dec 15 19:02:18 teela kernel: Call Trace:
Dec 15 19:02:18 teela kernel: [<c142bcce>] dump_stack+0x47/0x69
Dec 15 19:02:18 teela kernel: [<c1163332>] dump_header+0x60/0x178
Dec 15 19:02:18 teela kernel: [<c1431876>] ? ___ratelimit+0x86/0xe0
Dec 15 19:02:18 teela kernel: [<c110795f>] oom_kill_process+0x20f/0x3d0
Dec 15 19:02:18 teela kernel: [<c1043895>] ? has_capability_noaudit+0x15/0x20
Dec 15 19:02:18 teela kernel: [<c11075c7>] ? oom_badness.part.13+0xb7/0x130
Dec 15 19:02:18 teela kernel: [<c1107df9>] out_of_memory+0xd9/0x260
Dec 15 19:02:18 teela kernel: [<c110b989>] __alloc_pages_nodemask+0xb79/0xc80
Dec 15 19:02:18 teela kernel: [<c1038a05>] copy_process.part.51+0xe5/0x1420
Dec 15 19:02:18 teela kernel: [<c1150377>] ? kmem_cache_alloc+0xb7/0x190
Dec 15 19:02:18 teela kernel: [<c1039ee7>] _do_fork+0xc7/0x360
Dec 15 19:02:18 teela kernel: [<c1182a4b>] ? fd_install+0x1b/0x20
Dec 15 19:02:18 teela kernel: [<c103a247>] SyS_clone+0x27/0x30
Dec 15 19:02:18 teela kernel: [<c10018bc>] do_fast_syscall_32+0x7c/0x130
Dec 15 19:02:18 teela kernel: [<c19b5d2b>] sysenter_past_esp+0x40/0x6a
Dec 15 19:02:18 teela kernel: Mem-Info:
Dec 15 19:02:18 teela kernel: active_anon:57050 inactive_anon:90 isolated_anon:0
active_file:274371 inactive_file:281954 isolated_file:0
unevictable:0 dirty:616 writeback:0 unstable:0
slab_reclaimable:40669 slab_unreclaimable:17758
mapped:7370 shmem:202 pagetables:346 bounce:0
free:208199 free_pcp:501 free_cma:0
Dec 15 19:02:18 teela kernel: Node 0 active_anon:228200kB inactive_anon:360kB active_file:1097484kB inactive_file:1127816kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29480kB dirty:2464kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
Dec 15 19:02:18 teela kernel: DMA free:3904kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7356kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3208kB slab_unreclaimable:1448kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 813 3474 3474
Dec 15 19:02:18 teela kernel: Normal free:41280kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532796kB inactive_file:44kB unevictable:0kB writepending:144kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159468kB slab_unreclaimable:69584kB kernel_stack:1104kB pagetables:1384kB bounce:0kB free_pcp:628kB local_pcp:188kB free_cma:0kB
Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 0 21292 21292
Dec 15 19:02:18 teela kernel: HighMem free:787612kB min:512kB low:34356kB high:68200kB active_anon:228200kB inactive_anon:360kB active_file:557332kB inactive_file:1127772kB unevictable:0kB writepending:2320kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1376kB local_pcp:648kB free_cma:0kB
Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 0 0 0
Dec 15 19:02:18 teela kernel: DMA: 0*4kB 2*8kB (ME) 5*16kB (UME) 13*32kB (UM) 11*64kB (UME) 3*128kB (UM) 1*256kB (M) 2*512kB (E) 1*1024kB (M) 0*2048kB 0*4096kB = 3904kB
Dec 15 19:02:18 teela kernel: Normal: 26*4kB (M) 25*8kB (UM) 441*16kB (UM) 188*32kB (UM) 410*64kB (UME) 13*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 41280kB
Dec 15 19:02:18 teela kernel: HighMem: 37*4kB (UM) 21*8kB (UM) 6*16kB (UM) 8*32kB (UM) 22*64kB (UM) 5*128kB (M) 4*256kB (M) 3*512kB (M) 2*1024kB (M) 1*2048kB (M) 190*4096kB (UM) = 787612kB
Dec 15 19:02:18 teela kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dec 15 19:02:18 teela kernel: 556527 total pagecache pages
Dec 15 19:02:18 teela kernel: 0 pages in swap cache
Dec 15 19:02:18 teela kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 15 19:02:18 teela kernel: Free swap = 8191996kB
Dec 15 19:02:18 teela kernel: Total swap = 8191996kB
Dec 15 19:02:18 teela kernel: 909598 pages RAM
Dec 15 19:02:18 teela kernel: 681346 pages HighMem/MovableOnly
Dec 15 19:02:18 teela kernel: 15211 pages reserved
Dec 15 19:02:18 teela kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 15 19:02:18 teela kernel: [ 1888] 0 1888 6161 1112 10 3 0 0 systemd-journal
Dec 15 19:02:18 teela kernel: [ 2508] 0 2508 2959 945 6 3 0 -1000 systemd-udevd
Dec 15 19:02:18 teela kernel: [ 2610] 105 2610 3870 899 8 3 0 0 systemd-timesyn
Dec 15 19:02:18 teela kernel: [ 2613] 0 2613 6300 951 10 3 0 0 rsyslogd
Dec 15 19:02:18 teela kernel: [ 2615] 88 2615 1158 568 6 3 0 0 nullmailer-send
Dec 15 19:02:18 teela kernel: [ 2618] 0 2618 1514 1027 7 3 0 0 systemd-logind
Dec 15 19:02:18 teela kernel: [ 2619] 101 2619 1266 847 6 3 0 -900 dbus-daemon
Dec 15 19:02:18 teela kernel: [ 2620] 0 2620 622 300 5 3 0 0 atd
Dec 15 19:02:18 teela kernel: [ 2628] 0 2628 26097 3193 27 3 0 0 NetworkManager
Dec 15 19:02:18 teela kernel: [ 2647] 0 2647 1511 458 5 3 0 0 fcron
Dec 15 19:02:18 teela kernel: [ 2673] 0 2673 750 543 6 3 0 0 dhcpcd
Dec 15 19:02:18 teela kernel: [ 2676] 0 2676 638 447 5 3 0 0 vnstatd
Dec 15 19:02:18 teela kernel: [ 2690] 0 2690 1457 1061 6 3 0 -1000 sshd
Dec 15 19:02:18 teela kernel: [ 2716] 106 2716 16384 4239 20 3 0 0 polkitd
Dec 15 19:02:18 teela kernel: [ 2717] 0 2717 2145 1360 7 3 0 0 wpa_supplicant
Dec 15 19:02:18 teela kernel: [ 2947] 0 2947 1794 775 7 3 0 0 screen
Dec 15 19:02:18 teela kernel: [ 2950] 0 2950 1831 913 7 3 0 0 bash
Dec 15 19:02:18 teela kernel: [ 2954] 0 2954 37411 36100 79 3 0 0 emerge
Dec 15 19:02:18 teela kernel: [ 2970] 0 2970 1152 507 6 3 0 0 agetty
Dec 15 19:02:18 teela kernel: [ 3906] 250 3906 2625 1584 8 3 0 0 ebuild.sh
Dec 15 19:02:18 teela kernel: [ 3926] 250 3926 2657 1370 7 3 0 0 ebuild.sh
Dec 15 19:02:18 teela kernel: [ 3935] 250 3935 17160 16891 37 3 0 0 xz
Dec 15 19:02:18 teela kernel: [ 3936] 250 3936 799 510 5 3 0 0 tar
Dec 15 19:02:18 teela kernel: [ 4117] 0 4117 2598 1389 9 3 0 0 sshd
Dec 15 19:02:18 teela kernel: [ 4119] 0 4119 1964 1243 7 3 0 0 systemd
Dec 15 19:02:18 teela kernel: [ 4144] 0 4144 6645 632 10 3 0 0 (sd-pam)
Dec 15 19:02:18 teela kernel: [ 4163] 0 4163 1830 909 7 3 0 0 bash
Dec 15 19:02:18 teela kernel: [ 4182] 0 4182 1695 684 7 3 0 0 screen
Dec 15 19:02:18 teela kernel: [ 4221] 0 4221 1831 893 7 3 0 0 bash
Dec 15 19:02:18 teela kernel: Out of memory: Kill process 2954 (emerge) score 11 or sacrifice child
Dec 15 19:02:18 teela kernel: Killed process 2954 (emerge) total-vm:149644kB, anon-rss:137136kB, file-rss:7264kB, shmem-rss:0kB
Dec 15 19:02:18 teela kernel: bash invoked oom-killer: gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, order=1, oom_score_adj=0
Dec 15 19:02:18 teela kernel: bash cpuset=/ mems_allowed=0
Dec 15 19:02:18 teela kernel: CPU: 0 PID: 4221 Comm: bash Not tainted 4.9.0-gentoo #2
Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
Dec 15 19:02:18 teela kernel: c5187d68 c142bcce c5187e98 00000000 c5187d98 c1163332 00000000 00000282
Dec 15 19:02:18 teela kernel: c5187d98 c1431876 c5187d9c e7f9e100 e7f947c0 e7f947c0 c1b58785 c5187e98
Dec 15 19:02:18 teela kernel: c5187ddc c110795f c1043895 c5187dc8 c11075c7 00000007 00000000 00000000
Dec 15 19:02:18 teela kernel: Call Trace:
Dec 15 19:02:18 teela kernel: [<c142bcce>] dump_stack+0x47/0x69
Dec 15 19:02:18 teela kernel: [<c1163332>] dump_header+0x60/0x178
Dec 15 19:02:18 teela kernel: [<c1431876>] ? ___ratelimit+0x86/0xe0
Dec 15 19:02:18 teela kernel: [<c110795f>] oom_kill_process+0x20f/0x3d0
Dec 15 19:02:18 teela kernel: [<c1043895>] ? has_capability_noaudit+0x15/0x20
Dec 15 19:02:18 teela kernel: [<c11075c7>] ? oom_badness.part.13+0xb7/0x130
Dec 15 19:02:18 teela kernel: [<c1107df9>] out_of_memory+0xd9/0x260
Dec 15 19:02:18 teela kernel: [<c110b989>] __alloc_pages_nodemask+0xb79/0xc80
Dec 15 19:02:18 teela kernel: [<c1038a05>] copy_process.part.51+0xe5/0x1420
Dec 15 19:02:18 teela kernel: [<c1150377>] ? kmem_cache_alloc+0xb7/0x190
Dec 15 19:02:18 teela kernel: [<c1039ee7>] _do_fork+0xc7/0x360
Dec 15 19:02:18 teela kernel: [<c1182a4b>] ? fd_install+0x1b/0x20
Dec 15 19:02:18 teela kernel: [<c103a247>] SyS_clone+0x27/0x30
Dec 15 19:02:18 teela kernel: [<c10018bc>] do_fast_syscall_32+0x7c/0x130
Dec 15 19:02:18 teela kernel: [<c19b5d2b>] sysenter_past_esp+0x40/0x6a
Dec 15 19:02:18 teela kernel: Mem-Info:
Dec 15 19:02:18 teela kernel: active_anon:22769 inactive_anon:90 isolated_anon:0
active_file:274396 inactive_file:281929 isolated_file:0
unevictable:0 dirty:616 writeback:0 unstable:0
slab_reclaimable:40669 slab_unreclaimable:17741
mapped:6595 shmem:202 pagetables:271 bounce:0
free:242474 free_pcp:608 free_cma:0
Dec 15 19:02:18 teela kernel: Node 0 active_anon:91076kB inactive_anon:360kB active_file:1097584kB inactive_file:1127716kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:26380kB dirty:2464kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 108544kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
Dec 15 19:02:18 teela kernel: DMA free:3904kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7356kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3208kB slab_unreclaimable:1448kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 813 3474 3474
Dec 15 19:02:18 teela kernel: Normal free:41280kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532796kB inactive_file:44kB unevictable:0kB writepending:144kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159468kB slab_unreclaimable:69516kB kernel_stack:1104kB pagetables:1084kB bounce:0kB free_pcp:1048kB local_pcp:608kB free_cma:0kB
Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 0 21292 21292
Dec 15 19:02:18 teela kernel: HighMem free:924712kB min:512kB low:34356kB high:68200kB active_anon:91076kB inactive_anon:360kB active_file:557432kB inactive_file:1127672kB unevictable:0kB writepending:2320kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1384kB local_pcp:656kB free_cma:0kB
Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 0 0 0
Dec 15 19:02:18 teela kernel: DMA: 0*4kB 2*8kB (ME) 5*16kB (UME) 13*32kB (UM) 11*64kB (UME) 3*128kB (UM) 1*256kB (M) 2*512kB (E) 1*1024kB (M) 0*2048kB 0*4096kB = 3904kB
Dec 15 19:02:18 teela kernel: Normal: 26*4kB (M) 26*8kB (UM) 441*16kB (UM) 188*32kB (UM) 410*64kB (UME) 13*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 41288kB
Dec 15 19:02:18 teela kernel: HighMem: 1518*4kB (UM) 608*8kB (UM) 155*16kB (UM) 67*32kB (UM) 34*64kB (UM) 6*128kB (M) 2*256kB (M) 3*512kB (M) 1*1024kB (M) 43*2048kB (M) 199*4096kB (UM) = 924744kB
Dec 15 19:02:18 teela kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dec 15 19:02:18 teela kernel: 556527 total pagecache pages
Dec 15 19:02:18 teela kernel: 0 pages in swap cache
Dec 15 19:02:18 teela kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 15 19:02:18 teela kernel: Free swap = 8191996kB
Dec 15 19:02:18 teela kernel: Total swap = 8191996kB
Dec 15 19:02:18 teela kernel: 909598 pages RAM
Dec 15 19:02:18 teela kernel: 681346 pages HighMem/MovableOnly
Dec 15 19:02:18 teela kernel: 15211 pages reserved
Dec 15 19:02:18 teela kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 15 19:02:18 teela kernel: [ 1888] 0 1888 6161 1112 10 3 0 0 systemd-journal
Dec 15 19:02:18 teela kernel: [ 2508] 0 2508 2959 945 6 3 0 -1000 systemd-udevd
Dec 15 19:02:18 teela kernel: [ 2610] 105 2610 3870 899 8 3 0 0 systemd-timesyn
Dec 15 19:02:18 teela kernel: [ 2613] 0 2613 6300 951 10 3 0 0 rsyslogd
Dec 15 19:02:18 teela kernel: [ 2615] 88 2615 1158 568 6 3 0 0 nullmailer-send
Dec 15 19:02:18 teela kernel: [ 2618] 0 2618 1514 1027 7 3 0 0 systemd-logind
Dec 15 19:02:18 teela kernel: [ 2619] 101 2619 1266 847 6 3 0 -900 dbus-daemon
Dec 15 19:02:18 teela kernel: [ 2620] 0 2620 622 300 5 3 0 0 atd
Dec 15 19:02:18 teela kernel: [ 2628] 0 2628 26097 3193 27 3 0 0 NetworkManager
Dec 15 19:02:18 teela kernel: [ 2647] 0 2647 1511 458 5 3 0 0 fcron
Dec 15 19:02:18 teela kernel: [ 2673] 0 2673 750 543 6 3 0 0 dhcpcd
Dec 15 19:02:18 teela kernel: [ 2676] 0 2676 638 447 5 3 0 0 vnstatd
Dec 15 19:02:18 teela kernel: [ 2690] 0 2690 1457 1061 6 3 0 -1000 sshd
Dec 15 19:02:18 teela kernel: [ 2716] 106 2716 16384 4239 20 3 0 0 polkitd
Dec 15 19:02:18 teela kernel: [ 2717] 0 2717 2145 1360 7 3 0 0 wpa_supplicant
Dec 15 19:02:18 teela kernel: [ 2947] 0 2947 1794 775 7 3 0 0 screen
Dec 15 19:02:18 teela kernel: [ 2950] 0 2950 1831 915 7 3 0 0 bash
Dec 15 19:02:18 teela kernel: [ 2970] 0 2970 1152 507 6 3 0 0 agetty
Dec 15 19:02:18 teela kernel: [ 3906] 250 3906 2625 1584 8 3 0 0 ebuild.sh
Dec 15 19:02:18 teela kernel: [ 3926] 250 3926 2657 1370 7 3 0 0 ebuild.sh
Dec 15 19:02:18 teela kernel: [ 3935] 250 3935 17160 16891 37 3 0 0 xz
Dec 15 19:02:18 teela kernel: [ 3936] 250 3936 799 510 5 3 0 0 tar
Dec 15 19:02:18 teela kernel: [ 4117] 0 4117 2598 1389 9 3 0 0 sshd
Dec 15 19:02:18 teela kernel: [ 4119] 0 4119 1964 1243 7 3 0 0 systemd
Dec 15 19:02:18 teela kernel: [ 4144] 0 4144 6645 632 10 3 0 0 (sd-pam)
Dec 15 19:02:18 teela kernel: [ 4163] 0 4163 1830 909 7 3 0 0 bash
Dec 15 19:02:18 teela kernel: [ 4182] 0 4182 1695 684 7 3 0 0 screen
Dec 15 19:02:18 teela kernel: [ 4221] 0 4221 1831 893 7 3 0 0 bash
Dec 15 19:02:18 teela kernel: Out of memory: Kill process 3935 (xz) score 5 or sacrifice child
Dec 15 19:02:18 teela kernel: Killed process 3935 (xz) total-vm:68640kB, anon-rss:65928kB, file-rss:1636kB, shmem-rss:0kB
Dec 15 19:02:18 teela kernel: ebuild.sh invoked oom-killer: gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, order=1, oom_score_adj=0
Dec 15 19:02:18 teela kernel: ebuild.sh cpuset=/ mems_allowed=0
Dec 15 19:02:18 teela kernel: CPU: 0 PID: 3926 Comm: ebuild.sh Not tainted 4.9.0-gentoo #2
Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
Dec 15 19:02:18 teela kernel: d3473d68 c142bcce d3473e98 00000000 d3473d98 c1163332 00000000 00000282
Dec 15 19:02:18 teela kernel: d3473d98 c1431876 d3473d9c f12463c0 f25ea900 f25ea900 c1b58785 d3473e98
Dec 15 19:02:18 teela kernel: d3473ddc c110795f c1043895 d3473dc8 c11075c7 00000006 00000000 00000000
Dec 15 19:02:18 teela kernel: Call Trace:
Dec 15 19:02:18 teela kernel: [<c142bcce>] dump_stack+0x47/0x69
Dec 15 19:02:18 teela kernel: [<c1163332>] dump_header+0x60/0x178
Dec 15 19:02:18 teela kernel: [<c1431876>] ? ___ratelimit+0x86/0xe0
Dec 15 19:02:18 teela kernel: [<c110795f>] oom_kill_process+0x20f/0x3d0
Dec 15 19:02:18 teela kernel: [<c1043895>] ? has_capability_noaudit+0x15/0x20
Dec 15 19:02:18 teela kernel: [<c11075c7>] ? oom_badness.part.13+0xb7/0x130
Dec 15 19:02:18 teela kernel: [<c1107df9>] out_of_memory+0xd9/0x260
Dec 15 19:02:18 teela kernel: [<c110b989>] __alloc_pages_nodemask+0xb79/0xc80
Dec 15 19:02:18 teela kernel: [<c1151a00>] ? __kmem_cache_shutdown+0x220/0x290
Dec 15 19:02:18 teela kernel: [<c1038a05>] copy_process.part.51+0xe5/0x1420
Dec 15 19:02:18 teela kernel: [<c1150377>] ? kmem_cache_alloc+0xb7/0x190
Dec 15 19:02:18 teela kernel: [<c1039ee7>] _do_fork+0xc7/0x360
Dec 15 19:02:18 teela kernel: [<c1438c64>] ? _copy_to_user+0x44/0x60
Dec 15 19:02:18 teela kernel: [<c103a247>] SyS_clone+0x27/0x30
Dec 15 19:02:18 teela kernel: [<c10018bc>] do_fast_syscall_32+0x7c/0x130
Dec 15 19:02:18 teela kernel: [<c19b5d2b>] sysenter_past_esp+0x40/0x6a
Dec 15 19:02:18 teela kernel: Mem-Info:
Dec 15 19:02:18 teela kernel: active_anon:6238 inactive_anon:90 isolated_anon:0
active_file:274469 inactive_file:281903 isolated_file:0
unevictable:0 dirty:557 writeback:255 unstable:0
slab_reclaimable:40673 slab_unreclaimable:17738
mapped:6479 shmem:202 pagetables:238 bounce:0
free:258997 free_pcp:617 free_cma:0
Dec 15 19:02:18 teela kernel: Node 0 active_anon:24952kB inactive_anon:360kB active_file:1097876kB inactive_file:1127612kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:25916kB dirty:2228kB writeback:1020kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 6144kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
Dec 15 19:02:18 teela kernel: DMA free:3904kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7356kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3208kB slab_unreclaimable:1448kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 813 3474 3474
Dec 15 19:02:18 teela kernel: Normal free:41272kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532844kB inactive_file:48kB unevictable:0kB writepending:400kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159484kB slab_unreclaimable:69504kB kernel_stack:1096kB pagetables:952kB bounce:0kB free_pcp:1128kB local_pcp:588kB free_cma:0kB
Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 0 21292 21292
Dec 15 19:02:18 teela kernel: HighMem free:990812kB min:512kB low:34356kB high:68200kB active_anon:24952kB inactive_anon:360kB active_file:557676kB inactive_file:1127564kB unevictable:0kB writepending:2848kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1340kB local_pcp:688kB free_cma:0kB
Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 0 0 0
Dec 15 19:02:18 teela kernel: DMA: 0*4kB 2*8kB (ME) 5*16kB (UME) 13*32kB (UM) 11*64kB (UME) 3*128kB (UM) 1*256kB (M) 2*512kB (E) 1*1024kB (M) 0*2048kB 0*4096kB = 3904kB
Dec 15 19:02:18 teela kernel: Normal: 30*4kB (UME) 31*8kB (UM) 437*16kB (UME) 188*32kB (UM) 410*64kB (UME) 13*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 41280kB
Dec 15 19:02:18 teela kernel: HighMem: 1621*4kB (UM) 660*8kB (UM) 184*16kB (UM) 90*32kB (UM) 41*64kB (UM) 7*128kB (M) 2*256kB (M) 3*512kB (M) 1*1024kB (M) 50*2048kB (M) 211*4096kB (UM) = 990836kB
Dec 15 19:02:18 teela kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dec 15 19:02:18 teela kernel: 556574 total pagecache pages
Dec 15 19:02:18 teela kernel: 0 pages in swap cache
Dec 15 19:02:18 teela kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 15 19:02:18 teela kernel: Free swap = 8191996kB
Dec 15 19:02:18 teela kernel: Total swap = 8191996kB
Dec 15 19:02:18 teela kernel: 909598 pages RAM
Dec 15 19:02:18 teela kernel: 681346 pages HighMem/MovableOnly
Dec 15 19:02:18 teela kernel: 15211 pages reserved
Dec 15 19:02:18 teela kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 15 19:02:18 teela kernel: [ 1888] 0 1888 6161 1143 10 3 0 0 systemd-journal
Dec 15 19:02:18 teela kernel: [ 2508] 0 2508 2959 945 6 3 0 -1000 systemd-udevd
Dec 15 19:02:18 teela kernel: [ 2610] 105 2610 3870 899 8 3 0 0 systemd-timesyn
Dec 15 19:02:18 teela kernel: [ 2613] 0 2613 6300 956 10 3 0 0 rsyslogd
Dec 15 19:02:18 teela kernel: [ 2615] 88 2615 1158 568 6 3 0 0 nullmailer-send
Dec 15 19:02:18 teela kernel: [ 2618] 0 2618 1514 1027 7 3 0 0 systemd-logind
Dec 15 19:02:18 teela kernel: [ 2619] 101 2619 1266 847 6 3 0 -900 dbus-daemon
Dec 15 19:02:18 teela kernel: [ 2620] 0 2620 622 300 5 3 0 0 atd
Dec 15 19:02:26 teela kernel: [ 2628] 0 2628 26097 3193 27 3 0 0 NetworkManager
Dec 15 19:02:26 teela kernel: [ 2647] 0 2647 1511 458 5 3 0 0 fcron
Dec 15 19:02:26 teela kernel: [ 2673] 0 2673 750 543 6 3 0 0 dhcpcd
Dec 15 19:02:26 teela kernel: [ 2676] 0 2676 638 447 5 3 0 0 vnstatd
Dec 15 19:02:26 teela kernel: [ 2690] 0 2690 1457 1061 6 3 0 -1000 sshd
Dec 15 19:02:26 teela kernel: [ 2716] 106 2716 16384 4239 20 3 0 0 polkitd
Dec 15 19:02:26 teela kernel: [ 2717] 0 2717 2145 1360 7 3 0 0 wpa_supplicant
Dec 15 19:02:26 teela kernel: [ 2947] 0 2947 1794 775 7 3 0 0 screen
Dec 15 19:02:26 teela kernel: [ 2950] 0 2950 1831 915 7 3 0 0 bash
Dec 15 19:02:26 teela kernel: [ 2970] 0 2970 1152 507 6 3 0 0 agetty
Dec 15 19:02:26 teela kernel: [ 3906] 250 3906 2625 1584 8 3 0 0 ebuild.sh
Dec 15 19:02:26 teela kernel: [ 3926] 250 3926 2657 1377 7 3 0 0 ebuild.sh
Dec 15 19:02:26 teela kernel: [ 4117] 0 4117 2598 1389 9 3 0 0 sshd
Dec 15 19:02:26 teela kernel: [ 4119] 0 4119 1964 1243 7 3 0 0 systemd
Dec 15 19:02:26 teela kernel: [ 4144] 0 4144 6645 632 10 3 0 0 (sd-pam)
Dec 15 19:02:26 teela kernel: [ 4163] 0 4163 1830 909 7 3 0 0 bash
Dec 15 19:02:26 teela kernel: [ 4182] 0 4182 1695 684 7 3 0 0 screen
Dec 15 19:02:26 teela kernel: [ 4221] 0 4221 1831 893 7 3 0 0 bash
Dec 15 19:02:26 teela kernel: [ 4225] 0 4225 1831 400 6 3 0 0 bash

Greetings
Nils


2016-12-16 07:39:55

by Michal Hocko

[permalink] [raw]
Subject: Re: OOM: Better, but still there on 4.9

[CC linux-mm and btrfs guys]

On Thu 15-12-16 23:57:04, Nils Holland wrote:
[...]
> Of course, none of this are workloads that are new / special in any
> way - prior to 4.8, I never experienced any issues doing the exact
> same things.
>
> Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
> Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0
> Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
> Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
> Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
> Dec 15 19:02:18 teela kernel: eff0b604 c142bcce eff0b734 00000000 eff0b634 c1163332 00000000 00000292
> Dec 15 19:02:18 teela kernel: eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734
> Dec 15 19:02:18 teela kernel: eff0b678 c110795f c1043895 eff0b664 c11075c7 00000007 00000000 00000000
> Dec 15 19:02:18 teela kernel: Call Trace:
> Dec 15 19:02:18 teela kernel: [<c142bcce>] dump_stack+0x47/0x69
> Dec 15 19:02:18 teela kernel: [<c1163332>] dump_header+0x60/0x178
> Dec 15 19:02:18 teela kernel: [<c1431876>] ? ___ratelimit+0x86/0xe0
> Dec 15 19:02:18 teela kernel: [<c110795f>] oom_kill_process+0x20f/0x3d0
> Dec 15 19:02:18 teela kernel: [<c1043895>] ? has_capability_noaudit+0x15/0x20
> Dec 15 19:02:18 teela kernel: [<c11075c7>] ? oom_badness.part.13+0xb7/0x130
> Dec 15 19:02:18 teela kernel: [<c1107df9>] out_of_memory+0xd9/0x260
> Dec 15 19:02:18 teela kernel: [<c110ba0b>] __alloc_pages_nodemask+0xbfb/0xc80
> Dec 15 19:02:18 teela kernel: [<c110414d>] pagecache_get_page+0xad/0x270
> Dec 15 19:02:18 teela kernel: [<c13664a6>] alloc_extent_buffer+0x116/0x3e0
> Dec 15 19:02:18 teela kernel: [<c1334a2e>] btrfs_find_create_tree_block+0xe/0x10
> Dec 15 19:02:18 teela kernel: [<c132a57f>] btrfs_alloc_tree_block+0x1ef/0x5f0
> Dec 15 19:02:18 teela kernel: [<c130f7c3>] __btrfs_cow_block+0x143/0x5f0
> Dec 15 19:02:18 teela kernel: [<c130fe1a>] btrfs_cow_block+0x13a/0x220
> Dec 15 19:02:18 teela kernel: [<c13132f1>] btrfs_search_slot+0x1d1/0x870
> Dec 15 19:02:18 teela kernel: [<c132fcdd>] btrfs_lookup_file_extent+0x4d/0x60
> Dec 15 19:02:18 teela kernel: [<c1354fe6>] __btrfs_drop_extents+0x176/0x1070
> Dec 15 19:02:18 teela kernel: [<c1150377>] ? kmem_cache_alloc+0xb7/0x190
> Dec 15 19:02:18 teela kernel: [<c133dbb5>] ? start_transaction+0x65/0x4b0
> Dec 15 19:02:18 teela kernel: [<c1150597>] ? __kmalloc+0x147/0x1e0
> Dec 15 19:02:18 teela kernel: [<c1345005>] cow_file_range_inline+0x215/0x6b0
> Dec 15 19:02:18 teela kernel: [<c13459fc>] cow_file_range.isra.49+0x55c/0x6d0
> Dec 15 19:02:18 teela kernel: [<c1361795>] ? lock_extent_bits+0x75/0x1e0
> Dec 15 19:02:18 teela kernel: [<c1346d51>] run_delalloc_range+0x441/0x470
> Dec 15 19:02:18 teela kernel: [<c13626e4>] writepage_delalloc.isra.47+0x144/0x1e0
> Dec 15 19:02:18 teela kernel: [<c1364548>] __extent_writepage+0xd8/0x2b0
> Dec 15 19:02:18 teela kernel: [<c1365c4c>] extent_writepages+0x25c/0x380
> Dec 15 19:02:18 teela kernel: [<c1342cd0>] ? btrfs_real_readdir+0x610/0x610
> Dec 15 19:02:18 teela kernel: [<c133ff0f>] btrfs_writepages+0x1f/0x30
> Dec 15 19:02:18 teela kernel: [<c110ff85>] do_writepages+0x15/0x40
> Dec 15 19:02:18 teela kernel: [<c1190a95>] __writeback_single_inode+0x35/0x2f0
> Dec 15 19:02:18 teela kernel: [<c119112e>] writeback_sb_inodes+0x16e/0x340
> Dec 15 19:02:18 teela kernel: [<c119145a>] wb_writeback+0xaa/0x280
> Dec 15 19:02:18 teela kernel: [<c1191de8>] wb_workfn+0xd8/0x3e0
> Dec 15 19:02:18 teela kernel: [<c104fd34>] process_one_work+0x114/0x3e0
> Dec 15 19:02:18 teela kernel: [<c1050b4f>] worker_thread+0x2f/0x4b0
> Dec 15 19:02:18 teela kernel: [<c1050b20>] ? create_worker+0x180/0x180
> Dec 15 19:02:18 teela kernel: [<c10552e7>] kthread+0x97/0xb0
> Dec 15 19:02:18 teela kernel: [<c1055250>] ? __kthread_parkme+0x60/0x60
> Dec 15 19:02:18 teela kernel: [<c19b5cb7>] ret_from_fork+0x1b/0x28
> Dec 15 19:02:18 teela kernel: Mem-Info:
> Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0
> active_file:274324 inactive_file:281962 isolated_file:0

OK, so there is still some anonymous memory that could be swapped out
and quite a lot of page cache. This might be harder to reclaim because
the allocation is a GFP_NOFS request which is limited in its reclaim
capabilities. It might be possible that those pagecache pages are pinned
in some way by the the filesystem.

> unevictable:0 dirty:649 writeback:0 unstable:0
> slab_reclaimable:40662 slab_unreclaimable:17754
> mapped:7382 shmem:202 pagetables:351 bounce:0
> free:206736 free_pcp:332 free_cma:0
> Dec 15 19:02:18 teela kernel: Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
> Dec 15 19:02:18 teela kernel: DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 813 3474 3474
> Dec 15 19:02:18 teela kernel: Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB

And this shows that there is no anonymous memory in the lowmem zone.
Note that this request cannot use the highmem zone so no swap out would
help. So if we are not able to reclaim those pages on the file LRU then
we are out of luck

> Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 0 21292 21292
> Dec 15 19:02:18 teela kernel: HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

That being said, the OOM killer invocation is clearly pointless and
pre-mature. We normally do not invoke it normally for GFP_NOFS requests
exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which
behaves differently. I am about to change that but my last attempt [1]
has to be rethought.

Now another thing is that the __GFP_NOFAIL which has this nasty side
effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from
early transaction abort") in 4.3 so I am quite surprised that this has
shown up only in 4.8. Anyway there might be some other changes in the
btrfs which could make it more subtle.

I believe the right way to go around this is to pursue what I've started
in [1]. I will try to prepare something for testing today for you. Stay
tuned. But I would be really happy if somebody from the btrfs camp could
check the NOFS aspect of this allocation. We have already seen
allocation stalls from this path quite recently

[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs

2016-12-16 15:58:29

by Michal Hocko

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Fri 16-12-16 08:39:41, Michal Hocko wrote:
[...]
> That being said, the OOM killer invocation is clearly pointless and
> pre-mature. We normally do not invoke it normally for GFP_NOFS requests
> exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which
> behaves differently. I am about to change that but my last attempt [1]
> has to be rethought.
>
> Now another thing is that the __GFP_NOFAIL which has this nasty side
> effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from
> early transaction abort") in 4.3 so I am quite surprised that this has
> shown up only in 4.8. Anyway there might be some other changes in the
> btrfs which could make it more subtle.
>
> I believe the right way to go around this is to pursue what I've started
> in [1]. I will try to prepare something for testing today for you. Stay
> tuned. But I would be really happy if somebody from the btrfs camp could
> check the NOFS aspect of this allocation. We have already seen
> allocation stalls from this path quite recently

Could you try to run with the two following patches?

2016-12-16 15:58:43

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically

From: Michal Hocko <[email protected]>

__alloc_pages_may_oom makes sure to skip the OOM killer depending on
the allocation request. This includes lowmem requests, costly high
order requests and others. For a long time __GFP_NOFAIL acted as an
override for all those rules. This is not documented and it can be quite
surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
the existing open coded loops around allocator to nofail request (and we
have done that in the past) then such a change would have a non trivial
side effect which is not obvious. Note that the primary motivation for
skipping the OOM killer is to prevent from pre-mature invocation.

The exception has been added by 82553a937f12 ("oom: invoke oom killer
for __GFP_NOFAIL"). The changelog points out that the oom killer has to
be invoked otherwise the request would be looping for ever. But this
argument is rather weak because the OOM killer doesn't really guarantee
any forward progress for those exceptional cases:
- it will hardly help to form costly order which in turn can
result in the system panic because of no oom killable task in
the end - I believe we certainly do not want to put the system
down just because there is a nasty driver asking for order-9
page with GFP_NOFAIL not realizing all the consequences. It is
much better this request would loop for ever than the massive
system disruption
- lowmem is also highly unlikely to be freed during OOM killer
- GFP_NOFS request could trigger while there is still a lot of
memory pinned by filesystems.

The pre-mature OOM killer is a real issue as reported by Nils Holland
kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
Workqueue: writeback wb_workfn (flush-btrfs-1)
eff0b604 c142bcce eff0b734 00000000 eff0b634 c1163332 00000000 00000292
eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734
eff0b678 c110795f c1043895 eff0b664 c11075c7 00000007 00000000 00000000
Call Trace:
[<c142bcce>] dump_stack+0x47/0x69
[<c1163332>] dump_header+0x60/0x178
[<c1431876>] ? ___ratelimit+0x86/0xe0
[<c110795f>] oom_kill_process+0x20f/0x3d0
[<c1043895>] ? has_capability_noaudit+0x15/0x20
[<c11075c7>] ? oom_badness.part.13+0xb7/0x130
[<c1107df9>] out_of_memory+0xd9/0x260
[<c110ba0b>] __alloc_pages_nodemask+0xbfb/0xc80
[<c110414d>] pagecache_get_page+0xad/0x270
[<c13664a6>] alloc_extent_buffer+0x116/0x3e0
[<c1334a2e>] btrfs_find_create_tree_block+0xe/0x10
[...]
Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

this is a GFP_NOFS|__GFP_NOFAIL request which invokes the OOM killer
because there is clearly nothing reclaimable in the zone Normal while
there is a lot of page cache which is most probably pinned by the fs but
GFP_NOFS cannot reclaim it.

This patch simply removes the __GFP_NOFAIL special case in order to have
a more clear semantic without surprising side effects. Instead we do
allow nofail requests to access memory reserves to move forward in both
cases when the OOM killer is invoked and when it should be supressed.
In the later case we are more careful and only allow a partial access
because we do not want to risk the whole reserves depleting. There
are users doing GFP_NOFS|__GFP_NOFAIL heavily (e.g. __getblk_gfp ->
grow_dev_page).

Introduce __alloc_pages_cpuset_fallback helper which allows to bypass
allocation constrains for the given gfp mask while it enforces cpusets
whenever possible.

Reported-by: Nils Holland <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/oom_kill.c | 2 +-
mm/page_alloc.c | 97 ++++++++++++++++++++++++++++++++++++---------------------
2 files changed, 62 insertions(+), 37 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..12a6fce85f61 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
* make sure exclude 0 mask - all other users should have at least
* ___GFP_DIRECT_RECLAIM to get here.
*/
- if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+ if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
return true;

/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 095e2fa286de..d6bc3e4f1a0c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3057,6 +3057,26 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
}

static inline struct page *
+__alloc_pages_cpuset_fallback(gfp_t gfp_mask, unsigned int order,
+ unsigned int alloc_flags,
+ const struct alloc_context *ac)
+{
+ struct page *page;
+
+ page = get_page_from_freelist(gfp_mask, order,
+ alloc_flags|ALLOC_CPUSET, ac);
+ /*
+ * fallback to ignore cpuset restriction if our nodes
+ * are depleted
+ */
+ if (!page)
+ page = get_page_from_freelist(gfp_mask, order,
+ alloc_flags, ac);
+
+ return page;
+}
+
+static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
const struct alloc_context *ac, unsigned long *did_some_progress)
{
@@ -3091,47 +3111,42 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
if (page)
goto out;

- if (!(gfp_mask & __GFP_NOFAIL)) {
- /* Coredumps can quickly deplete all memory reserves */
- if (current->flags & PF_DUMPCORE)
- goto out;
- /* The OOM killer will not help higher order allocs */
- if (order > PAGE_ALLOC_COSTLY_ORDER)
- goto out;
- /* The OOM killer does not needlessly kill tasks for lowmem */
- if (ac->high_zoneidx < ZONE_NORMAL)
- goto out;
- if (pm_suspended_storage())
- goto out;
- /*
- * XXX: GFP_NOFS allocations should rather fail than rely on
- * other request to make a forward progress.
- * We are in an unfortunate situation where out_of_memory cannot
- * do much for this context but let's try it to at least get
- * access to memory reserved if the current task is killed (see
- * out_of_memory). Once filesystems are ready to handle allocation
- * failures more gracefully we should just bail out here.
- */
+ /* Coredumps can quickly deplete all memory reserves */
+ if (current->flags & PF_DUMPCORE)
+ goto out;
+ /* The OOM killer will not help higher order allocs */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ goto out;
+ /* The OOM killer does not needlessly kill tasks for lowmem */
+ if (ac->high_zoneidx < ZONE_NORMAL)
+ goto out;
+ if (pm_suspended_storage())
+ goto out;
+ /*
+ * XXX: GFP_NOFS allocations should rather fail than rely on
+ * other request to make a forward progress.
+ * We are in an unfortunate situation where out_of_memory cannot
+ * do much for this context but let's try it to at least get
+ * access to memory reserved if the current task is killed (see
+ * out_of_memory). Once filesystems are ready to handle allocation
+ * failures more gracefully we should just bail out here.
+ */
+
+ /* The OOM killer may not free memory on a specific node */
+ if (gfp_mask & __GFP_THISNODE)
+ goto out;

- /* The OOM killer may not free memory on a specific node */
- if (gfp_mask & __GFP_THISNODE)
- goto out;
- }
/* Exhausted what can be done so it's blamo time */
- if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+ if (out_of_memory(&oc)) {
*did_some_progress = 1;

- if (gfp_mask & __GFP_NOFAIL) {
- page = get_page_from_freelist(gfp_mask, order,
- ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
- /*
- * fallback to ignore cpuset restriction if our nodes
- * are depleted
- */
- if (!page)
- page = get_page_from_freelist(gfp_mask, order,
+ /*
+ * Help non-failing allocations by giving them access to memory
+ * reserves
+ */
+ if (gfp_mask & __GFP_NOFAIL)
+ page = __alloc_pages_cpuset_fallback(gfp_mask, order,
ALLOC_NO_WATERMARKS, ac);
- }
}
out:
mutex_unlock(&oom_lock);
@@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
*/
WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);

+ /*
+ * Help non-failing allocations by giving them access to memory
+ * reserves but do not use ALLOC_NO_WATERMARKS because this
+ * could deplete whole memory reserves which would just make
+ * the situation worse
+ */
+ page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
+ if (page)
+ goto got_pg;
+
cond_resched();
goto retry;
}
--
2.10.2

2016-12-16 15:58:53

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath

From: Michal Hocko <[email protected]>

Tetsuo Handa has pointed out that 0a0337e0d1d1 ("mm, oom: rework oom
detection") has subtly changed semantic for costly high order requests
with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now.
My code inspection didn't reveal any such users in the tree but it is
true that this might lead to unexpected allocation failures and
subsequent OOPs.

__alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
There are few special cases but we are lacking a catch all place to be
sure we will not miss any case where the non failing allocation might
fail. This patch reorganizes the code a bit and puts all those special
cases under nopage label which is the generic go-to-fail path. Non
failing allocations are retried or those that cannot retry like
non-sleeping allocation go to the failure point directly. This should
make the code flow much easier to follow and make it less error prone
for future changes.

While we are there we have to move the stall check up to catch
potentially looping non-failing allocations.

Changes since v1
- do not skip direct reclaim for TIF_MEMDIE && GFP_NOFAIL as per Hillf
- do not skip __alloc_pages_may_oom for TIF_MEMDIE && GFP_NOFAIL as
per Tetsuo

Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------
1 file changed, 44 insertions(+), 31 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f2c9e535f7f..095e2fa286de 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3640,35 +3640,21 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto got_pg;

/* Caller is not willing to reclaim, we can't balance anything */
- if (!can_direct_reclaim) {
- /*
- * All existing users of the __GFP_NOFAIL are blockable, so warn
- * of any new users that actually allow this type of allocation
- * to fail.
- */
- WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
+ if (!can_direct_reclaim)
goto nopage;
- }

- /* Avoid recursion of direct reclaim */
- if (current->flags & PF_MEMALLOC) {
- /*
- * __GFP_NOFAIL request from this context is rather bizarre
- * because we cannot reclaim anything and only can loop waiting
- * for somebody to do a work for us.
- */
- if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
- cond_resched();
- goto retry;
- }
- goto nopage;
+ /* Make sure we know about allocations which stall for too long */
+ if (time_after(jiffies, alloc_start + stall_timeout)) {
+ warn_alloc(gfp_mask,
+ "page alloction stalls for %ums, order:%u",
+ jiffies_to_msecs(jiffies-alloc_start), order);
+ stall_timeout += 10 * HZ;
}

- /* Avoid allocations with no watermarks from looping endlessly */
- if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+ /* Avoid recursion of direct reclaim */
+ if (current->flags & PF_MEMALLOC)
goto nopage;

-
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
&did_some_progress);
@@ -3692,14 +3678,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
goto nopage;

- /* Make sure we know about allocations which stall for too long */
- if (time_after(jiffies, alloc_start + stall_timeout)) {
- warn_alloc(gfp_mask,
- "page allocation stalls for %ums, order:%u",
- jiffies_to_msecs(jiffies-alloc_start), order);
- stall_timeout += 10 * HZ;
- }
-
if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
did_some_progress > 0, &no_progress_loops))
goto retry;
@@ -3721,6 +3699,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (page)
goto got_pg;

+ /* Avoid allocations with no watermarks from looping endlessly */
+ if (test_thread_flag(TIF_MEMDIE))
+ goto nopage;
+
/* Retry as long as the OOM killer is making progress */
if (did_some_progress) {
no_progress_loops = 0;
@@ -3728,6 +3710,37 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
}

nopage:
+ /*
+ * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
+ * we always retry
+ */
+ if (gfp_mask & __GFP_NOFAIL) {
+ /*
+ * All existing users of the __GFP_NOFAIL are blockable, so warn
+ * of any new users that actually require GFP_NOWAIT
+ */
+ if (WARN_ON_ONCE(!can_direct_reclaim))
+ goto fail;
+
+ /*
+ * PF_MEMALLOC request from this context is rather bizarre
+ * because we cannot reclaim anything and only can loop waiting
+ * for somebody to do a work for us
+ */
+ WARN_ON_ONCE(current->flags & PF_MEMALLOC);
+
+ /*
+ * non failing costly orders are a hard requirement which we
+ * are not prepared for much so let's warn about these users
+ * so that we can identify them and convert them to something
+ * else.
+ */
+ WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
+
+ cond_resched();
+ goto retry;
+ }
+fail:
warn_alloc(gfp_mask,
"page allocation failure: order:%u", order);
got_pg:
--
2.10.2

2016-12-16 17:37:08

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically

On Fri, Dec 16, 2016 at 04:58:08PM +0100, Michal Hocko wrote:
> @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
> * make sure exclude 0 mask - all other users should have at least
> * ___GFP_DIRECT_RECLAIM to get here.
> */
> - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> + if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> return true;

This makes sense, we should go back to what we had here. Because it's
not that the reported OOMs are premature - there is genuinely no more
memory reclaimable from the allocating context - but that this class
of allocations should never invoke the OOM killer in the first place.

> @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> */
> WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
>
> + /*
> + * Help non-failing allocations by giving them access to memory
> + * reserves but do not use ALLOC_NO_WATERMARKS because this
> + * could deplete whole memory reserves which would just make
> + * the situation worse
> + */
> + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
> + if (page)
> + goto got_pg;
> +

But this should be a separate patch, IMO.

Do we observe GFP_NOFS lockups when we don't do this? Don't we risk
premature exhaustion of the memory reserves, and it's better to wait
for other reclaimers to make some progress instead? Should we give
reserve access to all GFP_NOFS allocations, or just the ones from a
reclaim/cleaning context? All that should go into the changelog of a
separate allocation booster patch, I think.

2016-12-16 18:15:43

by Chris Mason

[permalink] [raw]
Subject: Re: OOM: Better, but still there on 4.9

On 12/16/2016 02:39 AM, Michal Hocko wrote:
> [CC linux-mm and btrfs guys]
>
> On Thu 15-12-16 23:57:04, Nils Holland wrote:
> [...]
>> Of course, none of this are workloads that are new / special in any
>> way - prior to 4.8, I never experienced any issues doing the exact
>> same things.
>>
>> Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
>> Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0
>> Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
>> Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
>> Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
>> Dec 15 19:02:18 teela kernel: eff0b604 c142bcce eff0b734 00000000 eff0b634 c1163332 00000000 00000292
>> Dec 15 19:02:18 teela kernel: eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734
>> Dec 15 19:02:18 teela kernel: eff0b678 c110795f c1043895 eff0b664 c11075c7 00000007 00000000 00000000
>> Dec 15 19:02:18 teela kernel: Call Trace:
>> Dec 15 19:02:18 teela kernel: [<c142bcce>] dump_stack+0x47/0x69
>> Dec 15 19:02:18 teela kernel: [<c1163332>] dump_header+0x60/0x178
>> Dec 15 19:02:18 teela kernel: [<c1431876>] ? ___ratelimit+0x86/0xe0
>> Dec 15 19:02:18 teela kernel: [<c110795f>] oom_kill_process+0x20f/0x3d0
>> Dec 15 19:02:18 teela kernel: [<c1043895>] ? has_capability_noaudit+0x15/0x20
>> Dec 15 19:02:18 teela kernel: [<c11075c7>] ? oom_badness.part.13+0xb7/0x130
>> Dec 15 19:02:18 teela kernel: [<c1107df9>] out_of_memory+0xd9/0x260
>> Dec 15 19:02:18 teela kernel: [<c110ba0b>] __alloc_pages_nodemask+0xbfb/0xc80
>> Dec 15 19:02:18 teela kernel: [<c110414d>] pagecache_get_page+0xad/0x270
>> Dec 15 19:02:18 teela kernel: [<c13664a6>] alloc_extent_buffer+0x116/0x3e0
>> Dec 15 19:02:18 teela kernel: [<c1334a2e>] btrfs_find_create_tree_block+0xe/0x10
>> Dec 15 19:02:18 teela kernel: [<c132a57f>] btrfs_alloc_tree_block+0x1ef/0x5f0
>> Dec 15 19:02:18 teela kernel: [<c130f7c3>] __btrfs_cow_block+0x143/0x5f0
>> Dec 15 19:02:18 teela kernel: [<c130fe1a>] btrfs_cow_block+0x13a/0x220
>> Dec 15 19:02:18 teela kernel: [<c13132f1>] btrfs_search_slot+0x1d1/0x870
>> Dec 15 19:02:18 teela kernel: [<c132fcdd>] btrfs_lookup_file_extent+0x4d/0x60
>> Dec 15 19:02:18 teela kernel: [<c1354fe6>] __btrfs_drop_extents+0x176/0x1070
>> Dec 15 19:02:18 teela kernel: [<c1150377>] ? kmem_cache_alloc+0xb7/0x190
>> Dec 15 19:02:18 teela kernel: [<c133dbb5>] ? start_transaction+0x65/0x4b0
>> Dec 15 19:02:18 teela kernel: [<c1150597>] ? __kmalloc+0x147/0x1e0
>> Dec 15 19:02:18 teela kernel: [<c1345005>] cow_file_range_inline+0x215/0x6b0
>> Dec 15 19:02:18 teela kernel: [<c13459fc>] cow_file_range.isra.49+0x55c/0x6d0
>> Dec 15 19:02:18 teela kernel: [<c1361795>] ? lock_extent_bits+0x75/0x1e0
>> Dec 15 19:02:18 teela kernel: [<c1346d51>] run_delalloc_range+0x441/0x470
>> Dec 15 19:02:18 teela kernel: [<c13626e4>] writepage_delalloc.isra.47+0x144/0x1e0
>> Dec 15 19:02:18 teela kernel: [<c1364548>] __extent_writepage+0xd8/0x2b0
>> Dec 15 19:02:18 teela kernel: [<c1365c4c>] extent_writepages+0x25c/0x380
>> Dec 15 19:02:18 teela kernel: [<c1342cd0>] ? btrfs_real_readdir+0x610/0x610
>> Dec 15 19:02:18 teela kernel: [<c133ff0f>] btrfs_writepages+0x1f/0x30
>> Dec 15 19:02:18 teela kernel: [<c110ff85>] do_writepages+0x15/0x40
>> Dec 15 19:02:18 teela kernel: [<c1190a95>] __writeback_single_inode+0x35/0x2f0
>> Dec 15 19:02:18 teela kernel: [<c119112e>] writeback_sb_inodes+0x16e/0x340
>> Dec 15 19:02:18 teela kernel: [<c119145a>] wb_writeback+0xaa/0x280
>> Dec 15 19:02:18 teela kernel: [<c1191de8>] wb_workfn+0xd8/0x3e0
>> Dec 15 19:02:18 teela kernel: [<c104fd34>] process_one_work+0x114/0x3e0
>> Dec 15 19:02:18 teela kernel: [<c1050b4f>] worker_thread+0x2f/0x4b0
>> Dec 15 19:02:18 teela kernel: [<c1050b20>] ? create_worker+0x180/0x180
>> Dec 15 19:02:18 teela kernel: [<c10552e7>] kthread+0x97/0xb0
>> Dec 15 19:02:18 teela kernel: [<c1055250>] ? __kthread_parkme+0x60/0x60
>> Dec 15 19:02:18 teela kernel: [<c19b5cb7>] ret_from_fork+0x1b/0x28
>> Dec 15 19:02:18 teela kernel: Mem-Info:
>> Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0
>> active_file:274324 inactive_file:281962 isolated_file:0
>
> OK, so there is still some anonymous memory that could be swapped out
> and quite a lot of page cache. This might be harder to reclaim because
> the allocation is a GFP_NOFS request which is limited in its reclaim
> capabilities. It might be possible that those pagecache pages are pinned
> in some way by the the filesystem.
>
>> unevictable:0 dirty:649 writeback:0 unstable:0
>> slab_reclaimable:40662 slab_unreclaimable:17754
>> mapped:7382 shmem:202 pagetables:351 bounce:0
>> free:206736 free_pcp:332 free_cma:0
>> Dec 15 19:02:18 teela kernel: Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
>> Dec 15 19:02:18 teela kernel: DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
>> Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 813 3474 3474
>> Dec 15 19:02:18 teela kernel: Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
>
> And this shows that there is no anonymous memory in the lowmem zone.
> Note that this request cannot use the highmem zone so no swap out would
> help. So if we are not able to reclaim those pages on the file LRU then
> we are out of luck
>
>> Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 0 21292 21292
>> Dec 15 19:02:18 teela kernel: HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
>
> That being said, the OOM killer invocation is clearly pointless and
> pre-mature. We normally do not invoke it normally for GFP_NOFS requests
> exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which
> behaves differently. I am about to change that but my last attempt [1]
> has to be rethought.
>
> Now another thing is that the __GFP_NOFAIL which has this nasty side
> effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from
> early transaction abort") in 4.3 so I am quite surprised that this has
> shown up only in 4.8. Anyway there might be some other changes in the
> btrfs which could make it more subtle.
>
> I believe the right way to go around this is to pursue what I've started
> in [1]. I will try to prepare something for testing today for you. Stay
> tuned. But I would be really happy if somebody from the btrfs camp could
> check the NOFS aspect of this allocation. We have already seen
> allocation stalls from this path quite recently

Just double checking, are you asking why we're using GFP_NOFS to avoid
going into btrfs from the btrfs writepages call, or are you asking why
we aren't allowing highmem?

For why we're not using highmem, it goes back to 2011:

commit a65917156e345946dbde3d7effd28124c6d6a8c2
Btrfs: stop using highmem for extent_buffers

The short answer is that kmap + shared caching pointer between threads
made it hugely complex. I gave up and dropped the highmem part.

-chris

2016-12-16 18:47:24

by Nils Holland

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Fri, Dec 16, 2016 at 04:58:06PM +0100, Michal Hocko wrote:
> On Fri 16-12-16 08:39:41, Michal Hocko wrote:
> [...]
> > That being said, the OOM killer invocation is clearly pointless and
> > pre-mature. We normally do not invoke it normally for GFP_NOFS requests
> > exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which
> > behaves differently. I am about to change that but my last attempt [1]
> > has to be rethought.
> >
> > Now another thing is that the __GFP_NOFAIL which has this nasty side
> > effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from
> > early transaction abort") in 4.3 so I am quite surprised that this has
> > shown up only in 4.8. Anyway there might be some other changes in the
> > btrfs which could make it more subtle.
> >
> > I believe the right way to go around this is to pursue what I've started
> > in [1]. I will try to prepare something for testing today for you. Stay
> > tuned. But I would be really happy if somebody from the btrfs camp could
> > check the NOFS aspect of this allocation. We have already seen
> > allocation stalls from this path quite recently
>
> Could you try to run with the two following patches?

I tried the two patches you sent, and ... well, things are different
now, but probably still a bit problematic. ;-)

Once again, I freshly booted both of my machines and told Gentoo's
portage to unpack and build the firefox sources. The first machine,
the one from which yesterday's OOM report came, became unresponsive
during the tarball unpack phase and had to be power cycled.
Unfortunately, there's nothing concerning its OOMs in the logs. :-(

The second machine actually finished the unpack phase successfully and
started the build process (which, every now and then, had also worked
with previous problematic kernels). However, after it had been
building for a while and I decided to increase the stress level by
starting X, firefox as well as a terminal and unpack a kernel source
tarball in it, it also started OOMing, this time once more with a
genuine kernel panic. Luckily, this machine also caught something in
the logs, which I'm including below.

Despite the fact that I'm no expert, I can see that there's no more
GFP_NOFS being logged, which seems to be what the patches tried to
achieve. What the still present OOMs mean remains up for
interpretation by the experts, all I can say is that in the (pre-4.8?)
past, doing all of the things I just did would probably slow down my
machine quite a bit, but I can't remember to have ever seen it OOM or
even crash completely.

Dec 16 18:56:24 boerne.fritz.box kernel: Purging GPU memory, 37 pages freed, 10219 pages still pinned.
Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd invoked oom-killer: gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, order=1, oom_score_adj=0
Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd cpuset=/ mems_allowed=0
Dec 16 18:56:29 boerne.fritz.box kernel: CPU: 1 PID: 2 Comm: kthreadd Not tainted 4.9.0-gentoo #3
Dec 16 18:56:29 boerne.fritz.box kernel: Hardware name: TOSHIBA Satellite L500/KSWAA, BIOS V1.80 10/28/2009
Dec 16 18:56:29 boerne.fritz.box kernel: f4105d6c c1433406 f4105e9c c6611280 f4105d9c c1170011 f4105df0 00200296
Dec 16 18:56:29 boerne.fritz.box kernel: f4105d9c c1438fff f4105da0 edc1bc80 ee32ce00 c6611280 c1ad1899 f4105e9c
Dec 16 18:56:29 boerne.fritz.box kernel: f4105de0 c1114407 c10513a5 f4105dcc c11140a1 00000001 00000000 00000000
Dec 16 18:56:29 boerne.fritz.box kernel: Call Trace:
Dec 16 18:56:29 boerne.fritz.box kernel: [<c1433406>] dump_stack+0x47/0x61
Dec 16 18:56:29 boerne.fritz.box kernel: [<c1170011>] dump_header+0x5f/0x175
Dec 16 18:56:29 boerne.fritz.box kernel: [<c1438fff>] ? ___ratelimit+0x7f/0xe0
Dec 16 18:56:29 boerne.fritz.box kernel: [<c1114407>] oom_kill_process+0x207/0x3c0
Dec 16 18:56:29 boerne.fritz.box kernel: [<c10513a5>] ? has_capability_noaudit+0x15/0x20
Dec 16 18:56:29 boerne.fritz.box kernel: [<c11140a1>] ? oom_badness.part.13+0xb1/0x120
Dec 16 18:56:29 boerne.fritz.box kernel: [<c11148c4>] out_of_memory+0xd4/0x270
Dec 16 18:56:29 boerne.fritz.box kernel: [<c1118615>] __alloc_pages_nodemask+0xcf5/0xd60
Dec 16 18:56:29 boerne.fritz.box kernel: [<c10464f5>] copy_process.part.52+0xd5/0x1410
Dec 16 18:56:29 boerne.fritz.box kernel: [<c1080779>] ? pick_next_task_fair+0x479/0x510
Dec 16 18:56:29 boerne.fritz.box kernel: [<c1062ba0>] ? __kthread_parkme+0x60/0x60
Dec 16 18:56:29 boerne.fritz.box kernel: [<c10479d7>] _do_fork+0xc7/0x360
Dec 16 18:56:29 boerne.fritz.box kernel: [<c1062ba0>] ? __kthread_parkme+0x60/0x60
Dec 16 18:56:29 boerne.fritz.box kernel: [<c1047ca0>] kernel_thread+0x30/0x40
Dec 16 18:56:29 boerne.fritz.box kernel: [<c10637c6>] kthreadd+0x106/0x150
Dec 16 18:56:29 boerne.fritz.box kernel: [<c10636c0>] ? kthread_park+0x50/0x50
Dec 16 18:56:29 boerne.fritz.box kernel: [<c19422b7>] ret_from_fork+0x1b/0x28
Dec 16 18:56:29 boerne.fritz.box kernel: Mem-Info:
Dec 16 18:56:29 boerne.fritz.box kernel: active_anon:132176 inactive_anon:11640 isolated_anon:0
active_file:295257 inactive_file:389350 isolated_file:20
unevictable:0 dirty:3956 writeback:0 unstable:0
slab_reclaimable:54632 slab_unreclaimable:21963
mapped:36724 shmem:11853 pagetables:914 bounce:0
free:77600 free_pcp:327 free_cma:0
Dec 16 18:56:29 boerne.fritz.box kernel: Node 0 active_anon:528704kB inactive_anon:46560kB active_file:1181028kB inactive_file:1557400kB unevictable:0kB isolated(anon):0kB isolated(file):80kB mapped:146896kB dirty:15824kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 172032kB anon_thp: 47412kB writeback_tmp:0kB unstable:0kB pages_scanned:15066965 all_unreclaimable? yes
Dec 16 18:56:29 boerne.fritz.box kernel: DMA free:3976kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:4788kB inactive_file:0kB unevictable:0kB writepending:160kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:5356kB slab_unreclaimable:1616kB kernel_stack:32kB pagetables:84kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Dec 16 18:56:29 boerne.fritz.box kernel: lowmem_reserve[]: 0 808 3849 3849
Dec 16 18:56:29 boerne.fritz.box kernel: Normal free:41008kB min:41100kB low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB active_file:470556kB inactive_file:148kB unevictable:0kB writepending:1616kB present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213172kB slab_unreclaimable:86236kB kernel_stack:1864kB pagetables:3572kB bounce:0kB free_pcp:532kB local_pcp:456kB free_cma:0kB
Dec 16 18:56:29 boerne.fritz.box kernel: lowmem_reserve[]: 0 0 24330 24330
Dec 16 18:56:29 boerne.fritz.box kernel: HighMem free:265416kB min:512kB low:39184kB high:77856kB active_anon:528704kB inactive_anon:46560kB active_file:705684kB inactive_file:1557292kB unevictable:0kB writepending:14048kB present:3114256kB managed:3114256kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:776kB local_pcp:660kB free_cma:0kB
Dec 16 18:56:29 boerne.fritz.box kernel: lowmem_reserve[]: 0 0 0 0
Dec 16 18:56:29 boerne.fritz.box kernel: DMA: 2*4kB (UE) 2*8kB (U) 1*16kB (E) 1*32kB (U) 1*64kB (U) 0*128kB 1*256kB (E) 1*512kB (E) 1*1024kB (U) 1*2048kB (M) 0*4096kB = 3976kB
Dec 16 18:56:29 boerne.fritz.box kernel: Normal: 32*4kB (ME) 28*8kB (UM) 15*16kB (UM) 141*32kB (UME) 141*64kB (UM) 80*128kB (UM) 19*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 2*2048kB (ME) 1*4096kB (M) = 41008kB
Dec 16 18:56:29 boerne.fritz.box kernel: HighMem: 340*4kB (UME) 339*8kB (UME) 258*16kB (UME) 192*32kB (UME) 69*64kB (UME) 15*128kB (UME) 6*256kB (ME) 5*512kB (UME) 7*1024kB (UME) 4*2048kB (UE) 55*4096kB (UM) = 265416kB
Dec 16 18:56:29 boerne.fritz.box kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dec 16 18:56:29 boerne.fritz.box kernel: 696480 total pagecache pages
Dec 16 18:56:29 boerne.fritz.box kernel: 0 pages in swap cache
Dec 16 18:56:29 boerne.fritz.box kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 16 18:56:29 boerne.fritz.box kernel: Free swap = 3781628kB
Dec 16 18:56:29 boerne.fritz.box kernel: Total swap = 3781628kB
Dec 16 18:56:29 boerne.fritz.box kernel: 1006816 pages RAM
Dec 16 18:56:29 boerne.fritz.box kernel: 778564 pages HighMem/MovableOnly
Dec 16 18:56:29 boerne.fritz.box kernel: 16403 pages reserved
Dec 16 18:56:29 boerne.fritz.box kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 16 18:56:29 boerne.fritz.box kernel: [ 1874] 0 1874 6166 987 9 3 0 0 systemd-journal
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2497] 0 2497 2965 911 8 3 0 -1000 systemd-udevd
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2582] 107 2582 3874 958 8 3 0 0 systemd-timesyn
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2585] 108 2585 1269 883 6 3 0 -900 dbus-daemon
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2586] 0 2586 22054 3277 20 3 0 0 NetworkManager
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2587] 0 2587 1521 972 7 3 0 0 systemd-logind
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2589] 88 2589 1158 627 6 3 0 0 nullmailer-send
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2612] 0 2612 1510 460 5 3 0 0 fcron
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2665] 0 2665 768 580 5 3 0 0 dhcpcd
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2668] 0 2668 639 408 5 3 0 0 vnstatd
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2669] 0 2669 1460 1063 6 3 0 -1000 sshd
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2670] 0 2670 1235 838 6 3 0 0 login
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2672] 0 2672 1972 1267 7 3 0 0 systemd
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2700] 0 2700 2279 586 7 3 0 0 (sd-pam)
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2733] 0 2733 1836 890 7 3 0 0 bash
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2753] 109 2753 16724 3089 19 3 0 0 polkitd
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2776] 0 2776 2153 1349 7 3 0 0 wpa_supplicant
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2941] 0 2941 16268 15095 36 3 0 0 emerge
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2942] 0 2942 1235 833 5 3 0 0 login
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2949] 1000 2949 2033 1378 7 3 0 0 systemd
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2973] 1000 2973 2279 589 7 3 0 0 (sd-pam)
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2989] 1000 2989 1836 907 7 3 0 0 bash
Dec 16 18:56:29 boerne.fritz.box kernel: [ 2997] 1000 2997 25339 2169 17 3 0 0 pulseaudio
Dec 16 18:56:29 boerne.fritz.box kernel: [ 3000] 111 3000 5763 655 9 3 0 0 rtkit-daemon
Dec 16 18:56:29 boerne.fritz.box kernel: [ 3019] 1000 3019 3575 1403 11 3 0 0 gconf-helper
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5626] 1000 5626 1743 709 8 3 0 0 startx
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5647] 1000 5647 1001 579 6 3 0 0 xinit
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5648] 1000 5648 22873 7477 43 3 0 0 X
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5674] 1000 5674 10584 4543 21 3 0 0 awesome
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5718] 1000 5718 1571 610 7 3 0 0 dbus-launch
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5720] 1000 5720 1238 645 6 3 0 0 dbus-daemon
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5725] 1000 5725 1571 634 7 3 0 0 dbus-launch
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5726] 1000 5726 1238 649 6 3 0 0 dbus-daemon
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5823] 1000 5823 35683 8366 42 3 0 0 nm-applet
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5825] 1000 5825 21454 7358 31 3 0 0 xfce4-terminal
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5827] 1000 5827 11257 1911 14 3 0 0 at-spi-bus-laun
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5832] 1000 5832 1238 831 6 3 0 0 dbus-daemon
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5838] 1000 5838 7480 2110 12 3 0 0 at-spi2-registr
Dec 16 18:56:29 boerne.fritz.box kernel: [ 5840] 1000 5840 10179 1459 13 3 0 0 gvfsd
Dec 16 18:56:29 boerne.fritz.box kernel: [ 6181] 1000 6181 1836 883 7 3 0 0 bash
Dec 16 18:56:29 boerne.fritz.box kernel: [ 7874] 1000 7874 2246 1185 8 3 0 0 ssh
Dec 16 18:56:29 boerne.fritz.box kernel: [12950] 1000 12950 197232 73307 252 3 0 0 firefox
Dec 16 18:56:29 boerne.fritz.box kernel: [13020] 250 13020 549 377 4 3 0 0 sandbox
Dec 16 18:56:29 boerne.fritz.box kernel: [13022] 250 13022 2629 1567 8 3 0 0 ebuild.sh
Dec 16 18:56:29 boerne.fritz.box kernel: [13040] 1000 13040 1836 933 7 3 0 0 bash
Dec 16 18:56:29 boerne.fritz.box kernel: [13048] 250 13048 3002 1718 8 3 0 0 ebuild.sh
Dec 16 18:56:29 boerne.fritz.box kernel: [13052] 250 13052 1122 732 5 3 0 0 emake
Dec 16 18:56:29 boerne.fritz.box kernel: [13054] 250 13054 921 697 5 3 0 0 make
Dec 16 18:56:29 boerne.fritz.box kernel: [13118] 250 13118 1048 783 5 3 0 0 make
Dec 16 18:56:29 boerne.fritz.box kernel: [13181] 250 13181 1043 789 5 3 0 0 make
Dec 16 18:56:29 boerne.fritz.box kernel: [13208] 250 13208 1095 855 6 3 0 0 make
Dec 16 18:56:29 boerne.fritz.box kernel: [13255] 250 13255 772 555 5 3 0 0 make
Dec 16 18:56:29 boerne.fritz.box kernel: [13299] 250 13299 913 689 5 3 0 0 make
Dec 16 18:56:29 boerne.fritz.box kernel: [13493] 250 13493 876 619 5 3 0 0 make
Dec 16 18:56:29 boerne.fritz.box kernel: [13494] 250 13494 15191 14639 34 3 0 0 python
Dec 16 18:56:29 boerne.fritz.box kernel: [13532] 250 13532 808 594 4 3 0 0 make
Dec 16 18:56:29 boerne.fritz.box kernel: [13593] 1000 13593 1533 624 7 3 0 0 tar
Dec 16 18:56:29 boerne.fritz.box kernel: [13594] 1000 13594 17834 16906 38 3 0 0 xz
Dec 16 18:56:29 boerne.fritz.box kernel: [13604] 250 13604 12439 11843 27 3 0 0 python
Dec 16 18:56:29 boerne.fritz.box kernel: [13651] 250 13651 253 5 1 3 0 0 sh
Dec 16 18:56:29 boerne.fritz.box kernel: Out of memory: Kill process 12950 (firefox) score 38 or sacrifice child
Dec 16 18:56:29 boerne.fritz.box kernel: Killed process 12950 (firefox) total-vm:788928kB, anon-rss:192656kB, file-rss:100548kB, shmem-rss:24kB
Dec 16 18:56:29 boerne.fritz.box kernel: oom_reaper: reaped process 12950 (firefox), now anon-rss:0kB, file-rss:96kB, shmem-rss:24kB
Dec 16 18:56:31 boerne.fritz.box kernel: xfce4-terminal invoked oom-killer: gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0
Dec 16 18:56:31 boerne.fritz.box kernel: xfce4-terminal cpuset=/ mems_allowed=0
Dec 16 18:56:31 boerne.fritz.box kernel: CPU: 0 PID: 5825 Comm: xfce4-terminal Not tainted 4.9.0-gentoo #3
Dec 16 18:56:31 boerne.fritz.box kernel: Hardware name: TOSHIBA Satellite L500/KSWAA, BIOS V1.80 10/28/2009
Dec 16 18:56:31 boerne.fritz.box kernel: c6941c18 c1433406 c6941d48 c5972500 c6941c48 c1170011 c6941c9c 00200286
Dec 16 18:56:31 boerne.fritz.box kernel: c6941c48 c1438fff c6941c4c edc1a940 ee32d400 c5972500 c1ad1899 c6941d48
Dec 16 18:56:31 boerne.fritz.box kernel: c6941c8c c1114407 c10513a5 c6941c78 c11140a1 00000006 00000000 00000000
Dec 16 18:56:31 boerne.fritz.box kernel: Call Trace:
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1433406>] dump_stack+0x47/0x61
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1170011>] dump_header+0x5f/0x175
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1438fff>] ? ___ratelimit+0x7f/0xe0
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1114407>] oom_kill_process+0x207/0x3c0
Dec 16 18:56:31 boerne.fritz.box kernel: [<c10513a5>] ? has_capability_noaudit+0x15/0x20
Dec 16 18:56:31 boerne.fritz.box kernel: [<c11140a1>] ? oom_badness.part.13+0xb1/0x120
Dec 16 18:56:31 boerne.fritz.box kernel: [<c11148c4>] out_of_memory+0xd4/0x270
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1118615>] __alloc_pages_nodemask+0xcf5/0xd60
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1758900>] ? skb_queue_purge+0x30/0x30
Dec 16 18:56:31 boerne.fritz.box kernel: [<c175dcde>] alloc_skb_with_frags+0xee/0x1a0
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1753dba>] sock_alloc_send_pskb+0x19a/0x1c0
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1186120>] ? poll_select_copy_remaining+0x120/0x120
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1825880>] ? wait_for_unix_gc+0x20/0x90
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1823fc0>] unix_stream_sendmsg+0x2a0/0x350
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1750b3d>] sock_sendmsg+0x2d/0x40
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1750bb7>] sock_write_iter+0x67/0xc0
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1172c42>] do_readv_writev+0x1e2/0x380
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1750b50>] ? sock_sendmsg+0x40/0x40
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1033763>] ? lapic_next_event+0x13/0x20
Dec 16 18:56:31 boerne.fritz.box kernel: [<c10ae675>] ? clockevents_program_event+0x95/0x190
Dec 16 18:56:31 boerne.fritz.box kernel: [<c10a074a>] ? __hrtimer_run_queues+0x20a/0x280
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1173d16>] vfs_writev+0x36/0x60
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1173d85>] do_writev+0x45/0xc0
Dec 16 18:56:31 boerne.fritz.box kernel: [<c1173efb>] SyS_writev+0x1b/0x20
Dec 16 18:56:31 boerne.fritz.box kernel: [<c10018ec>] do_fast_syscall_32+0x7c/0x130
Dec 16 18:56:31 boerne.fritz.box kernel: [<c194232b>] sysenter_past_esp+0x40/0x6a
Dec 16 18:56:31 boerne.fritz.box kernel: Mem-Info:
Dec 16 18:56:31 boerne.fritz.box kernel: active_anon:72795 inactive_anon:7267 isolated_anon:0
active_file:297627 inactive_file:387672 isolated_file:0
unevictable:0 dirty:77 writeback:18 unstable:0
slab_reclaimable:54648 slab_unreclaimable:21983
mapped:17819 shmem:8215 pagetables:662 bounce:8
free:141692 free_pcp:107 free_cma:0
Dec 16 18:56:31 boerne.fritz.box kernel: Node 0 active_anon:291180kB inactive_anon:29068kB active_file:1190508kB inactive_file:1550688kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:71276kB dirty:308kB writeback:72kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 122880kB anon_thp: 32860kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
Dec 16 18:56:31 boerne.fritz.box kernel: DMA free:4020kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:4804kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:5356kB slab_unreclaimable:1572kB kernel_stack:32kB pagetables:84kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Dec 16 18:56:32 boerne.fritz.box kernel: lowmem_reserve[]: 0 808 3849 3849
Dec 16 18:56:32 boerne.fritz.box kernel: Normal free:41028kB min:41100kB low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB active_file:472164kB inactive_file:108kB unevictable:0kB writepending:112kB present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213236kB slab_unreclaimable:86360kB kernel_stack:1584kB pagetables:2564kB bounce:32kB free_pcp:180kB local_pcp:24kB free_cma:0kB
Dec 16 18:56:32 boerne.fritz.box kernel: lowmem_reserve[]: 0 0 24330 24330
Dec 16 18:56:32 boerne.fritz.box kernel: HighMem free:521720kB min:512kB low:39184kB high:77856kB active_anon:291180kB inactive_anon:29068kB active_file:713448kB inactive_file:1550556kB unevictable:0kB writepending:76kB present:3114256kB managed:3114256kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:248kB local_pcp:156kB free_cma:0kB
Dec 16 18:56:32 boerne.fritz.box kernel: lowmem_reserve[]: 0 0 0 0
Dec 16 18:56:32 boerne.fritz.box kernel: DMA: 13*4kB (UE) 2*8kB (U) 1*16kB (E) 1*32kB (U) 1*64kB (U) 0*128kB 1*256kB (E) 1*512kB (E) 1*1024kB (U) 1*2048kB (M) 0*4096kB = 4020kB
Dec 16 18:56:32 boerne.fritz.box kernel: Normal: 37*4kB (UME) 24*8kB (ME) 17*16kB (UME) 137*32kB (UME) 143*64kB (UME) 82*128kB (UM) 18*256kB (UM) 3*512kB (UME) 2*1024kB (ME) 2*2048kB (ME) 1*4096kB (M) = 41028kB
Dec 16 18:56:32 boerne.fritz.box kernel: HighMem: 3230*4kB (ME) 1616*8kB (M) 680*16kB (UM) 398*32kB (UME) 145*64kB (UM) 59*128kB (UM) 25*256kB (ME) 19*512kB (UME) 9*1024kB (UME) 36*2048kB (UME) 87*4096kB (UME) = 521720kB
Dec 16 18:56:32 boerne.fritz.box kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dec 16 18:56:32 boerne.fritz.box kernel: 693537 total pagecache pages
Dec 16 18:56:32 boerne.fritz.box kernel: 0 pages in swap cache
Dec 16 18:56:32 boerne.fritz.box kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 16 18:56:32 boerne.fritz.box kernel: Free swap = 3781628kB
Dec 16 18:56:32 boerne.fritz.box kernel: Total swap = 3781628kB
Dec 16 18:56:32 boerne.fritz.box kernel: 1006816 pages RAM
Dec 16 18:56:32 boerne.fritz.box kernel: 778564 pages HighMem/MovableOnly
Dec 16 18:56:32 boerne.fritz.box kernel: 16403 pages reserved
Dec 16 18:56:32 boerne.fritz.box kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 16 18:56:32 boerne.fritz.box kernel: [ 1874] 0 1874 6166 1007 9 3 0 0 systemd-journal
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2497] 0 2497 2965 911 8 3 0 -1000 systemd-udevd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2582] 107 2582 3874 958 8 3 0 0 systemd-timesyn
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2585] 108 2585 1301 885 6 3 0 -900 dbus-daemon
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2586] 0 2586 22054 3277 20 3 0 0 NetworkManager
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2587] 0 2587 1521 972 7 3 0 0 systemd-logind
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2589] 88 2589 1158 627 6 3 0 0 nullmailer-send
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2612] 0 2612 1510 460 5 3 0 0 fcron
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2665] 0 2665 768 580 5 3 0 0 dhcpcd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2668] 0 2668 639 408 5 3 0 0 vnstatd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2669] 0 2669 1460 1063 6 3 0 -1000 sshd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2670] 0 2670 1235 838 6 3 0 0 login
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2672] 0 2672 1972 1267 7 3 0 0 systemd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2700] 0 2700 2279 586 7 3 0 0 (sd-pam)
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2733] 0 2733 1836 890 7 3 0 0 bash
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2753] 109 2753 16724 3089 19 3 0 0 polkitd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2776] 0 2776 2153 1349 7 3 0 0 wpa_supplicant
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2941] 0 2941 16268 15095 36 3 0 0 emerge
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2942] 0 2942 1235 833 5 3 0 0 login
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2949] 1000 2949 2033 1378 7 3 0 0 systemd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2973] 1000 2973 2279 589 7 3 0 0 (sd-pam)
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2989] 1000 2989 1836 907 7 3 0 0 bash
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2997] 1000 2997 25339 2169 17 3 0 0 pulseaudio
Dec 16 18:56:32 boerne.fritz.box kernel: [ 3000] 111 3000 5763 655 9 3 0 0 rtkit-daemon
Dec 16 18:56:32 boerne.fritz.box kernel: [ 3019] 1000 3019 3575 1403 11 3 0 0 gconf-helper
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5626] 1000 5626 1743 709 8 3 0 0 startx
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5647] 1000 5647 1001 579 6 3 0 0 xinit
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5648] 1000 5648 22392 7078 41 3 0 0 X
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5674] 1000 5674 10584 4543 21 3 0 0 awesome
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5718] 1000 5718 1571 610 7 3 0 0 dbus-launch
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5720] 1000 5720 1238 645 6 3 0 0 dbus-daemon
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5725] 1000 5725 1571 634 7 3 0 0 dbus-launch
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5726] 1000 5726 1238 649 6 3 0 0 dbus-daemon
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5823] 1000 5823 35683 8366 42 3 0 0 nm-applet
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5825] 1000 5825 21454 7358 31 3 0 0 xfce4-terminal
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5827] 1000 5827 11257 1911 14 3 0 0 at-spi-bus-laun
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5832] 1000 5832 1238 831 6 3 0 0 dbus-daemon
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5838] 1000 5838 7480 2110 12 3 0 0 at-spi2-registr
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5840] 1000 5840 10179 1459 13 3 0 0 gvfsd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 6181] 1000 6181 1836 883 7 3 0 0 bash
Dec 16 18:56:32 boerne.fritz.box kernel: [ 7874] 1000 7874 2246 1185 8 3 0 0 ssh
Dec 16 18:56:32 boerne.fritz.box kernel: [13020] 250 13020 549 377 4 3 0 0 sandbox
Dec 16 18:56:32 boerne.fritz.box kernel: [13022] 250 13022 2629 1567 8 3 0 0 ebuild.sh
Dec 16 18:56:32 boerne.fritz.box kernel: [13040] 1000 13040 1836 933 7 3 0 0 bash
Dec 16 18:56:32 boerne.fritz.box kernel: [13048] 250 13048 3002 1718 8 3 0 0 ebuild.sh
Dec 16 18:56:32 boerne.fritz.box kernel: [13052] 250 13052 1122 732 5 3 0 0 emake
Dec 16 18:56:32 boerne.fritz.box kernel: [13054] 250 13054 921 697 5 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13118] 250 13118 1048 783 5 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13181] 250 13181 1043 789 5 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13208] 250 13208 1095 855 6 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13255] 250 13255 772 555 5 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13299] 250 13299 913 689 5 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13493] 250 13493 876 619 5 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13494] 250 13494 15321 14729 34 3 0 0 python
Dec 16 18:56:32 boerne.fritz.box kernel: [13532] 250 13532 808 594 4 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13593] 1000 13593 1533 624 7 3 0 0 tar
Dec 16 18:56:32 boerne.fritz.box kernel: [13594] 1000 13594 17834 16906 38 3 0 0 xz
Dec 16 18:56:32 boerne.fritz.box kernel: [13604] 250 13604 12599 12029 28 3 0 0 python
Dec 16 18:56:32 boerne.fritz.box kernel: [13658] 250 13658 1549 1104 6 3 0 0 python
Dec 16 18:56:32 boerne.fritz.box kernel: Out of memory: Kill process 13594 (xz) score 8 or sacrifice child
Dec 16 18:56:32 boerne.fritz.box kernel: Killed process 13594 (xz) total-vm:71336kB, anon-rss:65668kB, file-rss:1956kB, shmem-rss:0kB
Dec 16 18:56:32 boerne.fritz.box kernel: xfce4-terminal invoked oom-killer: gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0
Dec 16 18:56:32 boerne.fritz.box kernel: xfce4-terminal cpuset=/ mems_allowed=0
Dec 16 18:56:32 boerne.fritz.box kernel: CPU: 1 PID: 5825 Comm: xfce4-terminal Not tainted 4.9.0-gentoo #3
Dec 16 18:56:32 boerne.fritz.box kernel: Hardware name: TOSHIBA Satellite L500/KSWAA, BIOS V1.80 10/28/2009
Dec 16 18:56:32 boerne.fritz.box kernel: c6941c18 c1433406 c6941d48 ef25ef00 c6941c48 c1170011 c6941c9c 00200286
Dec 16 18:56:32 boerne.fritz.box kernel: c6941c48 c1438fff c6941c4c ef267c80 ef233a00 ef25ef00 c1ad1899 c6941d48
Dec 16 18:56:32 boerne.fritz.box kernel: c6941c8c c1114407 c10513a5 c6941c78 c11140a1 00000006 00000000 00000000
Dec 16 18:56:32 boerne.fritz.box kernel: Call Trace:
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1433406>] dump_stack+0x47/0x61
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1170011>] dump_header+0x5f/0x175
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1438fff>] ? ___ratelimit+0x7f/0xe0
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1114407>] oom_kill_process+0x207/0x3c0
Dec 16 18:56:32 boerne.fritz.box kernel: [<c10513a5>] ? has_capability_noaudit+0x15/0x20
Dec 16 18:56:32 boerne.fritz.box kernel: [<c11140a1>] ? oom_badness.part.13+0xb1/0x120
Dec 16 18:56:32 boerne.fritz.box kernel: [<c11148c4>] out_of_memory+0xd4/0x270
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1118615>] __alloc_pages_nodemask+0xcf5/0xd60
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1758900>] ? skb_queue_purge+0x30/0x30
Dec 16 18:56:32 boerne.fritz.box kernel: [<c175dcde>] alloc_skb_with_frags+0xee/0x1a0
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1753dba>] sock_alloc_send_pskb+0x19a/0x1c0
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1186120>] ? poll_select_copy_remaining+0x120/0x120
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1825880>] ? wait_for_unix_gc+0x20/0x90
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1823fc0>] unix_stream_sendmsg+0x2a0/0x350
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1750b3d>] sock_sendmsg+0x2d/0x40
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1750bb7>] sock_write_iter+0x67/0xc0
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1172c42>] do_readv_writev+0x1e2/0x380
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1750b50>] ? sock_sendmsg+0x40/0x40
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1033763>] ? lapic_next_event+0x13/0x20
Dec 16 18:56:32 boerne.fritz.box kernel: [<c10ae675>] ? clockevents_program_event+0x95/0x190
Dec 16 18:56:32 boerne.fritz.box kernel: [<c10a074a>] ? __hrtimer_run_queues+0x20a/0x280
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1173d16>] vfs_writev+0x36/0x60
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1173d85>] do_writev+0x45/0xc0
Dec 16 18:56:32 boerne.fritz.box kernel: [<c1173efb>] SyS_writev+0x1b/0x20
Dec 16 18:56:32 boerne.fritz.box kernel: [<c10018ec>] do_fast_syscall_32+0x7c/0x130
Dec 16 18:56:32 boerne.fritz.box kernel: [<c194232b>] sysenter_past_esp+0x40/0x6a
Dec 16 18:56:32 boerne.fritz.box kernel: Mem-Info:
Dec 16 18:56:32 boerne.fritz.box kernel: active_anon:56747 inactive_anon:7267 isolated_anon:0
active_file:297677 inactive_file:387697 isolated_file:0
unevictable:0 dirty:151 writeback:18 unstable:0
slab_reclaimable:54648 slab_unreclaimable:21983
mapped:17769 shmem:8215 pagetables:637 bounce:8
free:157498 free_pcp:299 free_cma:0
Dec 16 18:56:32 boerne.fritz.box kernel: Node 0 active_anon:226988kB inactive_anon:29068kB active_file:1190708kB inactive_file:1550788kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:71076kB dirty:604kB writeback:72kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 47104kB anon_thp: 32860kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
Dec 16 18:56:32 boerne.fritz.box kernel: DMA free:4020kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:4804kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:5356kB slab_unreclaimable:1572kB kernel_stack:32kB pagetables:84kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Dec 16 18:56:32 boerne.fritz.box kernel: lowmem_reserve[]: 0 808 3849 3849
Dec 16 18:56:32 boerne.fritz.box kernel: Normal free:40988kB min:41100kB low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB active_file:472436kB inactive_file:144kB unevictable:0kB writepending:312kB present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213236kB slab_unreclaimable:86360kB kernel_stack:1584kB pagetables:2464kB bounce:32kB free_pcp:116kB local_pcp:0kB free_cma:0kB
Dec 16 18:56:32 boerne.fritz.box kernel: lowmem_reserve[]: 0 0 24330 24330
Dec 16 18:56:32 boerne.fritz.box kernel: HighMem free:584984kB min:512kB low:39184kB high:77856kB active_anon:226988kB inactive_anon:29068kB active_file:713448kB inactive_file:1550556kB unevictable:0kB writepending:224kB present:3114256kB managed:3114256kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1080kB local_pcp:400kB free_cma:0kB
Dec 16 18:56:32 boerne.fritz.box kernel: lowmem_reserve[]: 0 0 0 0
Dec 16 18:56:32 boerne.fritz.box kernel: DMA: 13*4kB (UE) 2*8kB (U) 1*16kB (E) 1*32kB (U) 1*64kB (U) 0*128kB 1*256kB (E) 1*512kB (E) 1*1024kB (U) 1*2048kB (M) 0*4096kB = 4020kB
Dec 16 18:56:32 boerne.fritz.box kernel: Normal: 36*4kB (ME) 24*8kB (ME) 16*16kB (ME) 138*32kB (UME) 143*64kB (UME) 82*128kB (UM) 18*256kB (UM) 3*512kB (UME) 2*1024kB (ME) 2*2048kB (ME) 1*4096kB (M) = 41040kB
Dec 16 18:56:32 boerne.fritz.box kernel: HighMem: 3430*4kB (UME) 1795*8kB (UME) 750*16kB (UM) 401*32kB (UM) 148*64kB (UME) 56*128kB (UM) 28*256kB (UME) 19*512kB (UME) 9*1024kB (UME) 55*2048kB (UME) 92*4096kB (UME) = 585136kB
Dec 16 18:56:32 boerne.fritz.box kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dec 16 18:56:32 boerne.fritz.box kernel: 693648 total pagecache pages
Dec 16 18:56:32 boerne.fritz.box kernel: 0 pages in swap cache
Dec 16 18:56:32 boerne.fritz.box kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 16 18:56:32 boerne.fritz.box kernel: Free swap = 3781628kB
Dec 16 18:56:32 boerne.fritz.box kernel: Total swap = 3781628kB
Dec 16 18:56:32 boerne.fritz.box kernel: 1006816 pages RAM
Dec 16 18:56:32 boerne.fritz.box kernel: 778564 pages HighMem/MovableOnly
Dec 16 18:56:32 boerne.fritz.box kernel: 16403 pages reserved
Dec 16 18:56:32 boerne.fritz.box kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 16 18:56:32 boerne.fritz.box kernel: [ 1874] 0 1874 6166 1011 9 3 0 0 systemd-journal
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2497] 0 2497 2965 911 8 3 0 -1000 systemd-udevd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2582] 107 2582 3874 958 8 3 0 0 systemd-timesyn
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2585] 108 2585 1301 885 6 3 0 -900 dbus-daemon
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2586] 0 2586 22054 3277 20 3 0 0 NetworkManager
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2587] 0 2587 1521 972 7 3 0 0 systemd-logind
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2589] 88 2589 1158 627 6 3 0 0 nullmailer-send
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2612] 0 2612 1510 460 5 3 0 0 fcron
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2665] 0 2665 768 580 5 3 0 0 dhcpcd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2668] 0 2668 639 408 5 3 0 0 vnstatd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2669] 0 2669 1460 1063 6 3 0 -1000 sshd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2670] 0 2670 1235 838 6 3 0 0 login
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2672] 0 2672 1972 1267 7 3 0 0 systemd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2700] 0 2700 2279 586 7 3 0 0 (sd-pam)
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2733] 0 2733 1836 890 7 3 0 0 bash
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2753] 109 2753 16724 3089 19 3 0 0 polkitd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2776] 0 2776 2153 1349 7 3 0 0 wpa_supplicant
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2941] 0 2941 16268 15095 36 3 0 0 emerge
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2942] 0 2942 1235 833 5 3 0 0 login
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2949] 1000 2949 2033 1378 7 3 0 0 systemd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2973] 1000 2973 2279 589 7 3 0 0 (sd-pam)
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2989] 1000 2989 1836 907 7 3 0 0 bash
Dec 16 18:56:32 boerne.fritz.box kernel: [ 2997] 1000 2997 25339 2169 17 3 0 0 pulseaudio
Dec 16 18:56:32 boerne.fritz.box kernel: [ 3000] 111 3000 5763 655 9 3 0 0 rtkit-daemon
Dec 16 18:56:32 boerne.fritz.box kernel: [ 3019] 1000 3019 3575 1403 11 3 0 0 gconf-helper
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5626] 1000 5626 1743 709 8 3 0 0 startx
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5647] 1000 5647 1001 579 6 3 0 0 xinit
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5648] 1000 5648 22392 7078 41 3 0 0 X
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5674] 1000 5674 10584 4543 21 3 0 0 awesome
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5718] 1000 5718 1571 610 7 3 0 0 dbus-launch
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5720] 1000 5720 1238 645 6 3 0 0 dbus-daemon
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5725] 1000 5725 1571 634 7 3 0 0 dbus-launch
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5726] 1000 5726 1238 649 6 3 0 0 dbus-daemon
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5823] 1000 5823 35683 8366 42 3 0 0 nm-applet
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5825] 1000 5825 21454 7358 31 3 0 0 xfce4-terminal
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5827] 1000 5827 11257 1911 14 3 0 0 at-spi-bus-laun
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5832] 1000 5832 1238 831 6 3 0 0 dbus-daemon
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5838] 1000 5838 7480 2110 12 3 0 0 at-spi2-registr
Dec 16 18:56:32 boerne.fritz.box kernel: [ 5840] 1000 5840 10179 1459 13 3 0 0 gvfsd
Dec 16 18:56:32 boerne.fritz.box kernel: [ 6181] 1000 6181 1836 883 7 3 0 0 bash
Dec 16 18:56:32 boerne.fritz.box kernel: [ 7874] 1000 7874 2246 1185 8 3 0 0 ssh
Dec 16 18:56:32 boerne.fritz.box kernel: [13020] 250 13020 549 377 4 3 0 0 sandbox
Dec 16 18:56:32 boerne.fritz.box kernel: [13022] 250 13022 2629 1567 8 3 0 0 ebuild.sh
Dec 16 18:56:32 boerne.fritz.box kernel: [13040] 1000 13040 1836 933 7 3 0 0 bash
Dec 16 18:56:32 boerne.fritz.box kernel: [13048] 250 13048 3002 1718 8 3 0 0 ebuild.sh
Dec 16 18:56:32 boerne.fritz.box kernel: [13052] 250 13052 1122 732 5 3 0 0 emake
Dec 16 18:56:32 boerne.fritz.box kernel: [13054] 250 13054 921 697 5 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13118] 250 13118 1048 783 5 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13181] 250 13181 1043 789 5 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13208] 250 13208 1095 855 6 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13255] 250 13255 772 555 5 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13299] 250 13299 913 689 5 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13493] 250 13493 876 619 5 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13494] 250 13494 15321 14775 34 3 0 0 python
Dec 16 18:56:32 boerne.fritz.box kernel: [13532] 250 13532 808 594 4 3 0 0 make
Dec 16 18:56:32 boerne.fritz.box kernel: [13593] 1000 13593 1533 643 7 3 0 0 tar
Dec 16 18:56:32 boerne.fritz.box kernel: [13604] 250 13604 12760 12198 28 3 0 0 python
Dec 16 18:56:32 boerne.fritz.box kernel: [13658] 250 13658 1687 1280 6 3 0 0 python
Dec 16 18:56:32 boerne.fritz.box kernel: Out of memory: Kill process 13494 (python) score 7 or sacrifice child
Dec 16 18:56:32 boerne.fritz.box kernel: Killed process 13494 (python) total-vm:61284kB, anon-rss:54128kB, file-rss:4972kB, shmem-rss:0kB
Dec 16 18:56:32 boerne.fritz.box kernel: oom_reaper: reaped process 13494 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Greetings
Nils

2016-12-16 19:50:36

by Chris Mason

[permalink] [raw]
Subject: Re: OOM: Better, but still there on 4.9

On 12/16/2016 02:39 AM, Michal Hocko wrote:
> [CC linux-mm and btrfs guys]
>
> On Thu 15-12-16 23:57:04, Nils Holland wrote:
> [...]
>> Of course, none of this are workloads that are new / special in any
>> way - prior to 4.8, I never experienced any issues doing the exact
>> same things.
>>
>> Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
>> Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0
>> Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
>> Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
>> Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
>> Dec 15 19:02:18 teela kernel: eff0b604 c142bcce eff0b734 00000000 eff0b634 c1163332 00000000 00000292
>> Dec 15 19:02:18 teela kernel: eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734
>> Dec 15 19:02:18 teela kernel: eff0b678 c110795f c1043895 eff0b664 c11075c7 00000007 00000000 00000000
>> Dec 15 19:02:18 teela kernel: Call Trace:
>> Dec 15 19:02:18 teela kernel: [<c142bcce>] dump_stack+0x47/0x69
>> Dec 15 19:02:18 teela kernel: [<c1163332>] dump_header+0x60/0x178
>> Dec 15 19:02:18 teela kernel: [<c1431876>] ? ___ratelimit+0x86/0xe0
>> Dec 15 19:02:18 teela kernel: [<c110795f>] oom_kill_process+0x20f/0x3d0
>> Dec 15 19:02:18 teela kernel: [<c1043895>] ? has_capability_noaudit+0x15/0x20
>> Dec 15 19:02:18 teela kernel: [<c11075c7>] ? oom_badness.part.13+0xb7/0x130
>> Dec 15 19:02:18 teela kernel: [<c1107df9>] out_of_memory+0xd9/0x260
>> Dec 15 19:02:18 teela kernel: [<c110ba0b>] __alloc_pages_nodemask+0xbfb/0xc80
>> Dec 15 19:02:18 teela kernel: [<c110414d>] pagecache_get_page+0xad/0x270
>> Dec 15 19:02:18 teela kernel: [<c13664a6>] alloc_extent_buffer+0x116/0x3e0
>> Dec 15 19:02:18 teela kernel: [<c1334a2e>] btrfs_find_create_tree_block+0xe/0x10
>> Dec 15 19:02:18 teela kernel: [<c132a57f>] btrfs_alloc_tree_block+0x1ef/0x5f0
>> Dec 15 19:02:18 teela kernel: [<c130f7c3>] __btrfs_cow_block+0x143/0x5f0
>> Dec 15 19:02:18 teela kernel: [<c130fe1a>] btrfs_cow_block+0x13a/0x220
>> Dec 15 19:02:18 teela kernel: [<c13132f1>] btrfs_search_slot+0x1d1/0x870
>> Dec 15 19:02:18 teela kernel: [<c132fcdd>] btrfs_lookup_file_extent+0x4d/0x60
>> Dec 15 19:02:18 teela kernel: [<c1354fe6>] __btrfs_drop_extents+0x176/0x1070
>> Dec 15 19:02:18 teela kernel: [<c1150377>] ? kmem_cache_alloc+0xb7/0x190
>> Dec 15 19:02:18 teela kernel: [<c133dbb5>] ? start_transaction+0x65/0x4b0
>> Dec 15 19:02:18 teela kernel: [<c1150597>] ? __kmalloc+0x147/0x1e0
>> Dec 15 19:02:18 teela kernel: [<c1345005>] cow_file_range_inline+0x215/0x6b0
>> Dec 15 19:02:18 teela kernel: [<c13459fc>] cow_file_range.isra.49+0x55c/0x6d0
>> Dec 15 19:02:18 teela kernel: [<c1361795>] ? lock_extent_bits+0x75/0x1e0
>> Dec 15 19:02:18 teela kernel: [<c1346d51>] run_delalloc_range+0x441/0x470
>> Dec 15 19:02:18 teela kernel: [<c13626e4>] writepage_delalloc.isra.47+0x144/0x1e0
>> Dec 15 19:02:18 teela kernel: [<c1364548>] __extent_writepage+0xd8/0x2b0
>> Dec 15 19:02:18 teela kernel: [<c1365c4c>] extent_writepages+0x25c/0x380
>> Dec 15 19:02:18 teela kernel: [<c1342cd0>] ? btrfs_real_readdir+0x610/0x610
>> Dec 15 19:02:18 teela kernel: [<c133ff0f>] btrfs_writepages+0x1f/0x30
>> Dec 15 19:02:18 teela kernel: [<c110ff85>] do_writepages+0x15/0x40
>> Dec 15 19:02:18 teela kernel: [<c1190a95>] __writeback_single_inode+0x35/0x2f0
>> Dec 15 19:02:18 teela kernel: [<c119112e>] writeback_sb_inodes+0x16e/0x340
>> Dec 15 19:02:18 teela kernel: [<c119145a>] wb_writeback+0xaa/0x280
>> Dec 15 19:02:18 teela kernel: [<c1191de8>] wb_workfn+0xd8/0x3e0
>> Dec 15 19:02:18 teela kernel: [<c104fd34>] process_one_work+0x114/0x3e0
>> Dec 15 19:02:18 teela kernel: [<c1050b4f>] worker_thread+0x2f/0x4b0
>> Dec 15 19:02:18 teela kernel: [<c1050b20>] ? create_worker+0x180/0x180
>> Dec 15 19:02:18 teela kernel: [<c10552e7>] kthread+0x97/0xb0
>> Dec 15 19:02:18 teela kernel: [<c1055250>] ? __kthread_parkme+0x60/0x60
>> Dec 15 19:02:18 teela kernel: [<c19b5cb7>] ret_from_fork+0x1b/0x28
>> Dec 15 19:02:18 teela kernel: Mem-Info:
>> Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0
>> active_file:274324 inactive_file:281962 isolated_file:0
>
> OK, so there is still some anonymous memory that could be swapped out
> and quite a lot of page cache. This might be harder to reclaim because
> the allocation is a GFP_NOFS request which is limited in its reclaim
> capabilities. It might be possible that those pagecache pages are pinned
> in some way by the the filesystem.

Reading harder, its possible those pagecache pages are all from the
btree inode. They shouldn't be pinned by btrfs, kswapd should be able
to wander in and free a good chunk. What btrfs wants to happen is for
this allocation to sit and wait for kswapd to make progress.

-chris

2016-12-16 22:12:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically

On Fri 16-12-16 12:31:51, Johannes Weiner wrote:
> On Fri, Dec 16, 2016 at 04:58:08PM +0100, Michal Hocko wrote:
> > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
> > * make sure exclude 0 mask - all other users should have at least
> > * ___GFP_DIRECT_RECLAIM to get here.
> > */
> > - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> > + if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> > return true;
>
> This makes sense, we should go back to what we had here. Because it's
> not that the reported OOMs are premature - there is genuinely no more
> memory reclaimable from the allocating context - but that this class
> of allocations should never invoke the OOM killer in the first place.

agreed, at least not with the current implementtion. If we had a proper
accounting where we know that the memory pinned by the fs is not really
there then we could invoke the oom killer and be safe

> > @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > */
> > WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
> >
> > + /*
> > + * Help non-failing allocations by giving them access to memory
> > + * reserves but do not use ALLOC_NO_WATERMARKS because this
> > + * could deplete whole memory reserves which would just make
> > + * the situation worse
> > + */
> > + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
> > + if (page)
> > + goto got_pg;
> > +
>
> But this should be a separate patch, IMO.
>
> Do we observe GFP_NOFS lockups when we don't do this?

this is hard to tell but considering users like grow_dev_page we can get
stuck with a very slow progress I believe. Those allocations could see
some help.

> Don't we risk
> premature exhaustion of the memory reserves, and it's better to wait
> for other reclaimers to make some progress instead?

waiting for other reclaimers would be preferable but we should at least
give these some priority, which is what ALLOC_HARDER should help with.

> Should we give
> reserve access to all GFP_NOFS allocations, or just the ones from a
> reclaim/cleaning context?

I would focus only for those which are important enough. Which are those
is a harder question. But certainly those with GFP_NOFAIL are important
enough.

> All that should go into the changelog of a separate allocation booster
> patch, I think.

The reason I did both in the same patch is to address the concern about
potential lockups when NOFS|NOFAIL cannot make any progress. I've chosen
ALLOC_HARDER to give the minimum portion of the reserves so that we do
not risk other high priority users to be blocked out but still help a
bit at least and prevent from starvation when other reclaimers are
faster to consume the reclaimed memory.

I can extend the changelog of course but I believe that having both
changes together makes some sense. NOFS|NOFAIL allocations are not all
that rare and sometimes we really depend on them making a further
progress.

--
Michal Hocko
SUSE Labs

2016-12-16 22:14:34

by Michal Hocko

[permalink] [raw]
Subject: Re: OOM: Better, but still there on 4.9

On Fri 16-12-16 13:15:18, Chris Mason wrote:
> On 12/16/2016 02:39 AM, Michal Hocko wrote:
[...]
> > I believe the right way to go around this is to pursue what I've started
> > in [1]. I will try to prepare something for testing today for you. Stay
> > tuned. But I would be really happy if somebody from the btrfs camp could
> > check the NOFS aspect of this allocation. We have already seen
> > allocation stalls from this path quite recently
>
> Just double checking, are you asking why we're using GFP_NOFS to avoid going
> into btrfs from the btrfs writepages call, or are you asking why we aren't
> allowing highmem?

I am more interested in the NOFS part. Why cannot this be a full
GFP_KERNEL context? What kind of locks we would lock up when recursing
to the fs via slab shrinkers?
--
Michal Hocko
SUSE Labs

2016-12-16 22:47:44

by Chris Mason

[permalink] [raw]
Subject: Re: OOM: Better, but still there on 4.9

On 12/16/2016 05:14 PM, Michal Hocko wrote:
> On Fri 16-12-16 13:15:18, Chris Mason wrote:
>> On 12/16/2016 02:39 AM, Michal Hocko wrote:
> [...]
>>> I believe the right way to go around this is to pursue what I've started
>>> in [1]. I will try to prepare something for testing today for you. Stay
>>> tuned. But I would be really happy if somebody from the btrfs camp could
>>> check the NOFS aspect of this allocation. We have already seen
>>> allocation stalls from this path quite recently
>>
>> Just double checking, are you asking why we're using GFP_NOFS to avoid going
>> into btrfs from the btrfs writepages call, or are you asking why we aren't
>> allowing highmem?
>
> I am more interested in the NOFS part. Why cannot this be a full
> GFP_KERNEL context? What kind of locks we would lock up when recursing
> to the fs via slab shrinkers?
>

Since this is our writepages call, any jump into direct reclaim would go
to writepage, which would end up calling the same set of code to read
metadata blocks, which would do a GFP_KERNEL allocation and end up back
in writepage again.

We'd also have issues with blowing through transaction reservations
since the writepage recursion would have to nest into the running
transaction.

-chris

2016-12-16 23:31:26

by Michal Hocko

[permalink] [raw]
Subject: Re: OOM: Better, but still there on 4.9

On Fri 16-12-16 17:47:25, Chris Mason wrote:
> On 12/16/2016 05:14 PM, Michal Hocko wrote:
> > On Fri 16-12-16 13:15:18, Chris Mason wrote:
> > > On 12/16/2016 02:39 AM, Michal Hocko wrote:
> > [...]
> > > > I believe the right way to go around this is to pursue what I've started
> > > > in [1]. I will try to prepare something for testing today for you. Stay
> > > > tuned. But I would be really happy if somebody from the btrfs camp could
> > > > check the NOFS aspect of this allocation. We have already seen
> > > > allocation stalls from this path quite recently
> > >
> > > Just double checking, are you asking why we're using GFP_NOFS to avoid going
> > > into btrfs from the btrfs writepages call, or are you asking why we aren't
> > > allowing highmem?
> >
> > I am more interested in the NOFS part. Why cannot this be a full
> > GFP_KERNEL context? What kind of locks we would lock up when recursing
> > to the fs via slab shrinkers?
> >
>
> Since this is our writepages call, any jump into direct reclaim would go to
> writepage, which would end up calling the same set of code to read metadata
> blocks, which would do a GFP_KERNEL allocation and end up back in writepage
> again.

But we are not doing pageout on the page cache from the direct reclaim
for a long time. So basically the only way to recurse back to the fs
code is via slab ([di]cache) shrinkers. Are those a problem as well?

--
Michal Hocko
SUSE Labs

2016-12-17 00:02:11

by Michal Hocko

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Fri 16-12-16 19:47:00, Nils Holland wrote:
[...]
> Despite the fact that I'm no expert, I can see that there's no more
> GFP_NOFS being logged, which seems to be what the patches tried to
> achieve. What the still present OOMs mean remains up for
> interpretation by the experts, all I can say is that in the (pre-4.8?)
> past, doing all of the things I just did would probably slow down my
> machine quite a bit, but I can't remember to have ever seen it OOM or
> even crash completely.
>
> Dec 16 18:56:24 boerne.fritz.box kernel: Purging GPU memory, 37 pages freed, 10219 pages still pinned.
> Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd invoked oom-killer: gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, order=1, oom_score_adj=0
> Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd cpuset=/ mems_allowed=0
[...]
> Dec 16 18:56:29 boerne.fritz.box kernel: Normal free:41008kB min:41100kB low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB active_file:470556kB inactive_file:148kB unevictable:0kB writepending:1616kB present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213172kB slab_unreclaimable:86236kB kernel_stack:1864kB pagetables:3572kB bounce:0kB free_pcp:532kB local_pcp:456kB free_cma:0kB

this is a GFP_KERNEL allocation so it cannot use the highmem zone again.
There is no anonymous memory in this zone but the allocation
context implies the full reclaim context so the file LRU should be
reclaimable. For some reason ~470MB of the active file LRU is still
there. This is quite unexpected. It is harder to tell more without
further data. It would be great if you could enable reclaim related
tracepoints:

mount -t tracefs none /debug/trace
echo 1 > /debug/trace/events/vmscan/enable
cat /debug/trace/trace_pipe > trace.log

should help
[...]

> Dec 16 18:56:31 boerne.fritz.box kernel: xfce4-terminal invoked oom-killer: gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0

another allocation in a short time. Killing the task has obviously
didn't help because the lowmem memory pressure hasn't been relieved

[...]
> Dec 16 18:56:32 boerne.fritz.box kernel: Normal free:41028kB min:41100kB low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB active_file:472164kB inactive_file:108kB unevictable:0kB writepending:112kB present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213236kB slab_unreclaimable:86360kB kernel_stack:1584kB pagetables:2564kB bounce:32kB free_pcp:180kB local_pcp:24kB free_cma:0kB

in fact we have even more pages on the file LRUs.

[...]

> Dec 16 18:56:32 boerne.fritz.box kernel: xfce4-terminal invoked oom-killer: gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0
[...]
> Dec 16 18:56:32 boerne.fritz.box kernel: Normal free:40988kB min:41100kB low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB active_file:472436kB inactive_file:144kB unevictable:0kB writepending:312kB present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213236kB slab_unreclaimable:86360kB kernel_stack:1584kB pagetables:2464kB bounce:32kB free_pcp:116kB local_pcp:0kB free_cma:0kB

same here. All that suggests that the page cache cannot be reclaimed for
some reason. It is hard to tell why but there is definitely something
bad going on.
--
Michal Hocko
SUSE Labs

2016-12-17 11:17:15

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically

Michal Hocko wrote:
> On Fri 16-12-16 12:31:51, Johannes Weiner wrote:
>>> @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>> */
>>> WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
>>>
>>> + /*
>>> + * Help non-failing allocations by giving them access to memory
>>> + * reserves but do not use ALLOC_NO_WATERMARKS because this
>>> + * could deplete whole memory reserves which would just make
>>> + * the situation worse
>>> + */
>>> + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
>>> + if (page)
>>> + goto got_pg;
>>> +
>>
>> But this should be a separate patch, IMO.
>>
>> Do we observe GFP_NOFS lockups when we don't do this?
>
> this is hard to tell but considering users like grow_dev_page we can get
> stuck with a very slow progress I believe. Those allocations could see
> some help.
>
>> Don't we risk
>> premature exhaustion of the memory reserves, and it's better to wait
>> for other reclaimers to make some progress instead?
>
> waiting for other reclaimers would be preferable but we should at least
> give these some priority, which is what ALLOC_HARDER should help with.
>
>> Should we give
>> reserve access to all GFP_NOFS allocations, or just the ones from a
>> reclaim/cleaning context?
>
> I would focus only for those which are important enough. Which are those
> is a harder question. But certainly those with GFP_NOFAIL are important
> enough.
>
>> All that should go into the changelog of a separate allocation booster
>> patch, I think.
>
> The reason I did both in the same patch is to address the concern about
> potential lockups when NOFS|NOFAIL cannot make any progress. I've chosen
> ALLOC_HARDER to give the minimum portion of the reserves so that we do
> not risk other high priority users to be blocked out but still help a
> bit at least and prevent from starvation when other reclaimers are
> faster to consume the reclaimed memory.
>
> I can extend the changelog of course but I believe that having both
> changes together makes some sense. NOFS|NOFAIL allocations are not all
> that rare and sometimes we really depend on them making a further
> progress.
>

I feel that allowing access to memory reserves based on __GFP_NOFAIL might not
make sense. My understanding is that actual I/O operation triggered by I/O
requests by filesystem code are processed by other threads. Even if we grant
access to memory reserves to GFP_NOFS | __GFP_NOFAIL allocations by fs code,
I think that it is possible that memory allocations by underlying bio code
fails to make a further progress unless memory reserves are granted as well.

Below is a typical trace which I observe under OOM lockuped situation (though
this trace is from an OOM stress test using XFS).

----------------------------------------
[ 1845.187246] MemAlloc: kworker/2:1(14498) flags=0x4208060 switches=323636 seq=48 gfp=0x2400000(GFP_NOIO) order=0 delay=430400 uninterruptible
[ 1845.187248] kworker/2:1 D12712 14498 2 0x00000080
[ 1845.187251] Workqueue: events_freezable_power_ disk_events_workfn
[ 1845.187252] Call Trace:
[ 1845.187253] ? __schedule+0x23f/0xba0
[ 1845.187254] schedule+0x38/0x90
[ 1845.187255] schedule_timeout+0x205/0x4a0
[ 1845.187256] ? del_timer_sync+0xd0/0xd0
[ 1845.187257] schedule_timeout_uninterruptible+0x25/0x30
[ 1845.187258] __alloc_pages_nodemask+0x1035/0x10e0
[ 1845.187259] ? alloc_request_struct+0x14/0x20
[ 1845.187261] alloc_pages_current+0x96/0x1b0
[ 1845.187262] ? bio_alloc_bioset+0x20f/0x2e0
[ 1845.187264] bio_copy_kern+0xc4/0x180
[ 1845.187265] blk_rq_map_kern+0x6f/0x120
[ 1845.187268] __scsi_execute.isra.23+0x12f/0x160
[ 1845.187270] scsi_execute_req_flags+0x8f/0x100
[ 1845.187271] sr_check_events+0xba/0x2b0 [sr_mod]
[ 1845.187274] cdrom_check_events+0x13/0x30 [cdrom]
[ 1845.187275] sr_block_check_events+0x25/0x30 [sr_mod]
[ 1845.187276] disk_check_events+0x5b/0x150
[ 1845.187277] disk_events_workfn+0x17/0x20
[ 1845.187278] process_one_work+0x1fc/0x750
[ 1845.187279] ? process_one_work+0x167/0x750
[ 1845.187279] worker_thread+0x126/0x4a0
[ 1845.187280] kthread+0x10a/0x140
[ 1845.187281] ? process_one_work+0x750/0x750
[ 1845.187282] ? kthread_create_on_node+0x60/0x60
[ 1845.187283] ret_from_fork+0x2a/0x40
----------------------------------------

I think that this GFP_NOIO allocation request needs to consume more memory reserves
than GFP_NOFS allocation request to make progress.
Do we want to add __GFP_NOFAIL to this GFP_NOIO allocation request in order to allow
access to memory reserves as well as GFP_NOFS | __GFP_NOFAIL allocation request?

2016-12-17 13:00:03

by Nils Holland

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
> On Fri 16-12-16 19:47:00, Nils Holland wrote:
> >
> > Dec 16 18:56:24 boerne.fritz.box kernel: Purging GPU memory, 37 pages freed, 10219 pages still pinned.
> > Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd invoked oom-killer: gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, order=1, oom_score_adj=0
> > Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd cpuset=/ mems_allowed=0
> [...]
> > Dec 16 18:56:29 boerne.fritz.box kernel: Normal free:41008kB min:41100kB low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB active_file:470556kB inactive_file:148kB unevictable:0kB writepending:1616kB present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213172kB slab_unreclaimable:86236kB kernel_stack:1864kB pagetables:3572kB bounce:0kB free_pcp:532kB local_pcp:456kB free_cma:0kB
>
> this is a GFP_KERNEL allocation so it cannot use the highmem zone again.
> There is no anonymous memory in this zone but the allocation
> context implies the full reclaim context so the file LRU should be
> reclaimable. For some reason ~470MB of the active file LRU is still
> there. This is quite unexpected. It is harder to tell more without
> further data. It would be great if you could enable reclaim related
> tracepoints:
>
> mount -t tracefs none /debug/trace
> echo 1 > /debug/trace/events/vmscan/enable
> cat /debug/trace/trace_pipe > trace.log
>
> should help
> [...]

No problem! I enabled writing the trace data to a file and then tried
to trigger another OOM situation. That worked, this time without a
complete kernel panic, but with only my processes being killed and the
system becoming unresponsive. When that happened, I let it run for
another minute or two so that in case it was still logging something
to the trace file, it could continue to do so some time longer. Then I
rebooted with the only thing that still worked, i.e. by means of magic
SysRequest.

The trace file has actually become rather big (around 21 MB). I didn't
dare to cut anything from it because I didn't want to risk deleting
something that might turn out important. So, due to the size, I'm not
attaching the trace file to this message, but it's up compressed
(about 536 KB) to be grabbed at:

http://ftp.tisys.org/pub/misc/trace.log.xz

For reference, here's the OOM report that goes along with this
incident and the trace file:

Dec 17 13:31:06 boerne.fritz.box kernel: Purging GPU memory, 145 pages freed, 10287 pages still pinned.
Dec 17 13:31:07 boerne.fritz.box kernel: awesome invoked oom-killer: gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0
Dec 17 13:31:07 boerne.fritz.box kernel: awesome cpuset=/ mems_allowed=0
Dec 17 13:31:07 boerne.fritz.box kernel: CPU: 1 PID: 5599 Comm: awesome Not tainted 4.9.0-gentoo #3
Dec 17 13:31:07 boerne.fritz.box kernel: Hardware name: TOSHIBA Satellite L500/KSWAA, BIOS V1.80 10/28/2009
Dec 17 13:31:07 boerne.fritz.box kernel: c5a37c18
Dec 17 13:31:07 boerne.fritz.box kernel: c1433406
Dec 17 13:31:07 boerne.fritz.box kernel: c5a37d48
Dec 17 13:31:07 boerne.fritz.box kernel: c5319280
Dec 17 13:31:07 boerne.fritz.box kernel: c5a37c48
Dec 17 13:31:07 boerne.fritz.box kernel: c1170011
Dec 17 13:31:07 boerne.fritz.box kernel: c5a37c9c
Dec 17 13:31:07 boerne.fritz.box kernel: 00200286
Dec 17 13:31:07 boerne.fritz.box kernel: c5a37c48
Dec 17 13:31:07 boerne.fritz.box kernel: c1438fff
Dec 17 13:31:07 boerne.fritz.box kernel: c5a37c4c
Dec 17 13:31:07 boerne.fritz.box kernel: c72479c0
Dec 17 13:31:07 boerne.fritz.box kernel: c60dd200
Dec 17 13:31:07 boerne.fritz.box kernel: c5319280
Dec 17 13:31:07 boerne.fritz.box kernel: c1ad1899
Dec 17 13:31:07 boerne.fritz.box kernel: c5a37d48
Dec 17 13:31:07 boerne.fritz.box kernel: c5a37c8c
Dec 17 13:31:07 boerne.fritz.box kernel: c1114407
Dec 17 13:31:07 boerne.fritz.box kernel: c10513a5
Dec 17 13:31:07 boerne.fritz.box kernel: c5a37c78
Dec 17 13:31:07 boerne.fritz.box kernel: c11140a1
Dec 17 13:31:07 boerne.fritz.box kernel: 00000005
Dec 17 13:31:07 boerne.fritz.box kernel: 00000000
Dec 17 13:31:07 boerne.fritz.box kernel: 00000000
Dec 17 13:31:07 boerne.fritz.box kernel: Call Trace:
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1433406>] dump_stack+0x47/0x61
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1170011>] dump_header+0x5f/0x175
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1438fff>] ? ___ratelimit+0x7f/0xe0
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1114407>] oom_kill_process+0x207/0x3c0
Dec 17 13:31:07 boerne.fritz.box kernel: [<c10513a5>] ? has_capability_noaudit+0x15/0x20
Dec 17 13:31:07 boerne.fritz.box kernel: [<c11140a1>] ? oom_badness.part.13+0xb1/0x120
Dec 17 13:31:07 boerne.fritz.box kernel: [<c11148c4>] out_of_memory+0xd4/0x270
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1118615>] __alloc_pages_nodemask+0xcf5/0xd60
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1758900>] ? skb_queue_purge+0x30/0x30
Dec 17 13:31:07 boerne.fritz.box kernel: [<c175dcde>] alloc_skb_with_frags+0xee/0x1a0
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1753dba>] sock_alloc_send_pskb+0x19a/0x1c0
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1825880>] ? wait_for_unix_gc+0x20/0x90
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1823fc0>] unix_stream_sendmsg+0x2a0/0x350
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1750b3d>] sock_sendmsg+0x2d/0x40
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1750bb7>] sock_write_iter+0x67/0xc0
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1172c42>] do_readv_writev+0x1e2/0x380
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1750b50>] ? sock_sendmsg+0x40/0x40
Dec 17 13:31:07 boerne.fritz.box kernel: [<c10806f2>] ? pick_next_task_fair+0x3f2/0x510
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1033763>] ? lapic_next_event+0x13/0x20
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1173d16>] vfs_writev+0x36/0x60
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1173d85>] do_writev+0x45/0xc0
Dec 17 13:31:07 boerne.fritz.box kernel: [<c1173efb>] SyS_writev+0x1b/0x20
Dec 17 13:31:07 boerne.fritz.box kernel: [<c10018ec>] do_fast_syscall_32+0x7c/0x130
Dec 17 13:31:07 boerne.fritz.box kernel: [<c194232b>] sysenter_past_esp+0x40/0x6a
Dec 17 13:31:07 boerne.fritz.box kernel: Mem-Info:
Dec 17 13:31:07 boerne.fritz.box kernel: active_anon:99962 inactive_anon:10651 isolated_anon:0
active_file:305350 inactive_file:411946 isolated_file:36
unevictable:0 dirty:5961 writeback:0 unstable:0
slab_reclaimable:50496 slab_unreclaimable:21852
mapped:36866 shmem:10990 pagetables:973 bounce:0
free:82280 free_pcp:103 free_cma:0
Dec 17 13:31:07 boerne.fritz.box kernel: Node 0 active_anon:399848kB inactive_anon:42604kB active_file:1221400kB inactive_file:1647784kB unevictable:0kB isolated(anon):0kB isolated(file):144kB mapped:147464kB dirty:23844kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 165888kB anon_thp: 43960kB writeback_tmp:0kB unstable:0kB pages_scanned:56194255 all_unreclaimable? yes
Dec 17 13:31:07 boerne.fritz.box kernel: DMA free:3944kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:6504kB inactive_file:0kB unevictable:0kB writepending:120kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:2712kB slab_unreclaimable:1016kB kernel_stack:360kB pagetables:1132kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Dec 17 13:31:07 boerne.fritz.box kernel: lowmem_reserve[]:
Dec 17 13:31:07 boerne.fritz.box kernel: 0
Dec 17 13:31:07 boerne.fritz.box kernel: 808
Dec 17 13:31:07 boerne.fritz.box kernel: 3849
Dec 17 13:31:07 boerne.fritz.box kernel: 3849
Dec 17 13:31:07 boerne.fritz.box kernel: Normal free:41056kB min:41100kB low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB active_file:483028kB inactive_file:4kB unevictable:0kB writepending:2056kB present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:199272kB slab_unreclaimable:86392kB kernel_stack:1656kB pagetables:2760kB bounce:0kB free_pcp:252kB local_pcp:144kB free_cma:0kB
Dec 17 13:31:07 boerne.fritz.box kernel: lowmem_reserve[]:
Dec 17 13:31:07 boerne.fritz.box kernel: 0
Dec 17 13:31:07 boerne.fritz.box kernel: 0
Dec 17 13:31:07 boerne.fritz.box kernel: 24330
Dec 17 13:31:07 boerne.fritz.box kernel: 24330
Dec 17 13:31:07 boerne.fritz.box kernel: HighMem free:284120kB min:512kB low:39184kB high:77856kB active_anon:399848kB inactive_anon:42604kB active_file:731868kB inactive_file:1647684kB unevictable:0kB writepending:21668kB present:3114256kB managed:3114256kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:160kB local_pcp:84kB free_cma:0kB
Dec 17 13:31:07 boerne.fritz.box kernel: lowmem_reserve[]:
Dec 17 13:31:07 boerne.fritz.box kernel: 0
Dec 17 13:31:07 boerne.fritz.box kernel: 0
Dec 17 13:31:07 boerne.fritz.box kernel: 0
Dec 17 13:31:07 boerne.fritz.box kernel: 0
Dec 17 13:31:07 boerne.fritz.box kernel: DMA:
Dec 17 13:31:07 boerne.fritz.box kernel: 4*4kB
Dec 17 13:31:07 boerne.fritz.box kernel: (U)
Dec 17 13:31:07 boerne.fritz.box kernel: 1*8kB
Dec 17 13:31:07 boerne.fritz.box kernel: (E)
Dec 17 13:31:07 boerne.fritz.box kernel: 1*16kB
Dec 17 13:31:07 boerne.fritz.box kernel: (U)
Dec 17 13:31:07 boerne.fritz.box kernel: 8*32kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UE)
Dec 17 13:31:07 boerne.fritz.box kernel: 3*64kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UE)
Dec 17 13:31:07 boerne.fritz.box kernel: 1*128kB
Dec 17 13:31:07 boerne.fritz.box kernel: (U)
Dec 17 13:31:07 boerne.fritz.box kernel: 1*256kB
Dec 17 13:31:07 boerne.fritz.box kernel: (E)
Dec 17 13:31:07 boerne.fritz.box kernel: 0*512kB
Dec 17 13:31:07 boerne.fritz.box kernel: 1*1024kB
Dec 17 13:31:07 boerne.fritz.box kernel: (E)
Dec 17 13:31:07 boerne.fritz.box kernel: 1*2048kB
Dec 17 13:31:07 boerne.fritz.box kernel: (M)
Dec 17 13:31:07 boerne.fritz.box kernel: 0*4096kB
Dec 17 13:31:07 boerne.fritz.box kernel: = 3944kB
Dec 17 13:31:07 boerne.fritz.box kernel: Normal:
Dec 17 13:31:07 boerne.fritz.box kernel: 40*4kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UM)
Dec 17 13:31:07 boerne.fritz.box kernel: 28*8kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UME)
Dec 17 13:31:07 boerne.fritz.box kernel: 22*16kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UME)
Dec 17 13:31:07 boerne.fritz.box kernel: 20*32kB
Dec 17 13:31:07 boerne.fritz.box kernel: (M)
Dec 17 13:31:07 boerne.fritz.box kernel: 92*64kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UM)
Dec 17 13:31:07 boerne.fritz.box kernel: 76*128kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UME)
Dec 17 13:31:07 boerne.fritz.box kernel: 20*256kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UME)
Dec 17 13:31:07 boerne.fritz.box kernel: 3*512kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UM)
Dec 17 13:31:07 boerne.fritz.box kernel: 1*1024kB
Dec 17 13:31:07 boerne.fritz.box kernel: (E)
Dec 17 13:31:07 boerne.fritz.box kernel: 2*2048kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UM)
Dec 17 13:31:07 boerne.fritz.box kernel: 3*4096kB
Dec 17 13:31:07 boerne.fritz.box kernel: (M)
Dec 17 13:31:07 boerne.fritz.box kernel: = 41056kB
Dec 17 13:31:07 boerne.fritz.box kernel: HighMem:
Dec 17 13:31:07 boerne.fritz.box kernel: 1452*4kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UME)
Dec 17 13:31:07 boerne.fritz.box kernel: 1347*8kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UME)
Dec 17 13:31:07 boerne.fritz.box kernel: 903*16kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UME)
Dec 17 13:31:07 boerne.fritz.box kernel: 443*32kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UME)
Dec 17 13:31:07 boerne.fritz.box kernel: 135*64kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UME)
Dec 17 13:31:07 boerne.fritz.box kernel: 33*128kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UME)
Dec 17 13:31:07 boerne.fritz.box kernel: 11*256kB
Dec 17 13:31:07 boerne.fritz.box kernel: (ME)
Dec 17 13:31:07 boerne.fritz.box kernel: 10*512kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UME)
Dec 17 13:31:07 boerne.fritz.box kernel: 7*1024kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UME)
Dec 17 13:31:07 boerne.fritz.box kernel: 3*2048kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UE)
Dec 17 13:31:07 boerne.fritz.box kernel: 50*4096kB
Dec 17 13:31:07 boerne.fritz.box kernel: (UM)
Dec 17 13:31:07 boerne.fritz.box kernel: = 284120kB
Dec 17 13:31:07 boerne.fritz.box kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dec 17 13:31:07 boerne.fritz.box kernel: 728298 total pagecache pages
Dec 17 13:31:07 boerne.fritz.box kernel: 0 pages in swap cache
Dec 17 13:31:07 boerne.fritz.box kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 17 13:31:07 boerne.fritz.box kernel: Free swap = 3781628kB
Dec 17 13:31:07 boerne.fritz.box kernel: Total swap = 3781628kB
Dec 17 13:31:07 boerne.fritz.box kernel: 1006816 pages RAM
Dec 17 13:31:07 boerne.fritz.box kernel: 778564 pages HighMem/MovableOnly
Dec 17 13:31:07 boerne.fritz.box kernel: 16403 pages reserved
Dec 17 13:31:07 boerne.fritz.box kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 17 13:31:07 boerne.fritz.box kernel: [ 1876] 0 1876 6165 985 10 3 0 0 systemd-journal
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2497] 0 2497 2965 915 6 3 0 -1000 systemd-udevd
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2582] 107 2582 3874 902 8 3 0 0 systemd-timesyn
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2585] 88 2585 1158 567 6 3 0 0 nullmailer-send
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2588] 108 2588 1271 848 7 3 0 -900 dbus-daemon
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2590] 0 2590 1510 459 5 3 0 0 fcron
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2594] 0 2594 1521 994 6 3 0 0 systemd-logind
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2595] 0 2595 22001 3143 21 3 0 0 NetworkManager
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2649] 0 2649 768 579 5 3 0 0 dhcpcd
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2655] 0 2655 639 416 5 3 0 0 vnstatd
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2656] 0 2656 1235 843 6 3 0 0 login
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2657] 0 2657 1460 1047 6 3 0 -1000 sshd
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2684] 0 2684 1972 1291 7 3 0 0 systemd
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2713] 0 2713 2279 569 7 3 0 0 (sd-pam)
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2728] 0 2728 1836 914 7 3 0 0 bash
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2768] 109 2768 16725 3172 19 3 0 0 polkitd
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2798] 0 2798 2157 1375 7 3 0 0 wpa_supplicant
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2864] 0 2864 1743 703 7 3 0 0 start_trace
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2866] 0 2866 1395 390 7 3 0 0 cat
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2867] 0 2867 1370 422 6 3 0 0 tail
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2916] 0 2916 1235 845 6 3 0 0 login
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2917] 0 2917 1836 870 7 3 0 0 bash
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2956] 0 2956 16257 14998 36 3 0 0 emerge
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2963] 0 2963 1235 846 6 3 0 0 login
Dec 17 13:31:07 boerne.fritz.box kernel: [ 2972] 0 2972 1836 906 7 3 0 0 bash
Dec 17 13:31:07 boerne.fritz.box kernel: [ 3021] 0 3021 6058 1745 15 3 0 0 journalctl
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5253] 250 5253 549 356 5 3 0 0 sandbox
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5255] 250 5255 2629 1567 8 3 0 0 ebuild.sh
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5272] 250 5272 2995 1763 8 3 0 0 ebuild.sh
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5335] 0 5335 1235 843 6 3 0 0 login
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5343] 250 5343 1123 724 5 3 0 0 emake
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5345] 250 5345 909 661 6 3 0 0 make
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5467] 1000 5467 2033 1374 7 3 0 0 systemd
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5483] 1000 5483 6633 597 10 3 0 0 (sd-pam)
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5506] 1000 5506 1836 887 7 3 0 0 bash
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5530] 250 5530 1057 674 4 3 0 0 sh
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5531] 250 5531 3204 2648 10 3 0 0 python2.7
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5536] 1000 5536 25339 2203 18 3 0 0 pulseaudio
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5537] 111 5537 5763 643 9 3 0 0 rtkit-daemon
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5560] 1000 5560 3575 1420 10 3 0 0 gconf-helper
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5567] 1000 5567 1743 709 7 3 0 0 startx
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5588] 1000 5588 1001 579 5 3 0 0 xinit
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5589] 1000 5589 23142 6927 42 3 0 0 X
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5599] 1000 5599 10592 4532 21 3 0 0 awesome
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5625] 1000 5625 1571 616 7 3 0 0 dbus-launch
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5626] 1000 5626 1238 636 6 3 0 0 dbus-daemon
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5631] 1000 5631 1571 621 7 3 0 0 dbus-launch
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5632] 1000 5632 1238 703 6 3 0 0 dbus-daemon
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5659] 250 5659 3749 3243 11 3 0 0 python
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5671] 1000 5671 31584 7782 39 3 0 0 nm-applet
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5707] 1000 5707 11224 1897 14 3 0 0 at-spi-bus-laun
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5718] 1000 5718 1238 806 6 3 0 0 dbus-daemon
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5725] 1000 5725 7480 2144 12 3 0 0 at-spi2-registr
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5732] 1000 5732 10179 1469 14 3 0 0 gvfsd
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5765] 1000 5765 194951 71017 247 3 0 0 firefox
Dec 17 13:31:07 boerne.fritz.box kernel: [ 5825] 250 5825 1209 839 5 3 0 0 sh
Dec 17 13:31:07 boerne.fritz.box kernel: [ 7253] 1000 7253 21521 7455 32 3 0 0 xfce4-terminal
Dec 17 13:31:07 boerne.fritz.box kernel: [ 7359] 1000 7359 1836 891 7 3 0 0 bash
Dec 17 13:31:07 boerne.fritz.box kernel: [ 8641] 1000 8641 1533 593 6 3 0 0 tar
Dec 17 13:31:07 boerne.fritz.box kernel: [ 8642] 1000 8642 17834 16879 38 3 0 0 xz
Dec 17 13:31:07 boerne.fritz.box kernel: [ 9059] 250 9059 10070 2536 13 3 0 0 python
Dec 17 13:31:07 boerne.fritz.box kernel: [ 9063] 250 9063 3155 1923 10 3 0 0 python
Dec 17 13:31:07 boerne.fritz.box kernel: [ 9064] 250 9064 3155 1926 10 3 0 0 python
Dec 17 13:31:07 boerne.fritz.box kernel: [ 9068] 250 9068 1211 826 5 3 0 0 sh
Dec 17 13:31:07 boerne.fritz.box kernel: [ 9075] 250 9075 3847 3307 11 3 0 0 python
Dec 17 13:31:07 boerne.fritz.box kernel: [ 9417] 1000 9417 1829 901 7 3 0 0 bash
Dec 17 13:31:07 boerne.fritz.box kernel: [ 9459] 1000 9459 2246 1206 9 3 0 0 ssh
Dec 17 13:31:07 boerne.fritz.box kernel: [ 9499] 250 9499 1087 710 5 3 0 0 sh
Dec 17 13:31:07 boerne.fritz.box kernel: [ 9567] 250 9567 1087 532 5 3 0 0 sh
Dec 17 13:31:07 boerne.fritz.box kernel: [ 9570] 250 9570 1088 618 5 3 0 0 sh
Dec 17 13:31:07 boerne.fritz.box kernel: Out of memory: Kill process 5765 (firefox) score 36 or sacrifice child
Dec 17 13:31:07 boerne.fritz.box kernel: Killed process 5765 (firefox) total-vm:779804kB, anon-rss:183712kB, file-rss:100332kB, shmem-rss:24kB
Dec 17 13:31:08 boerne.fritz.box kernel: awesome invoked oom-killer: gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0
Dec 17 13:31:08 boerne.fritz.box kernel: awesome cpuset=/ mems_allowed=0
Dec 17 13:31:08 boerne.fritz.box kernel: CPU: 0 PID: 5599 Comm: awesome Not tainted 4.9.0-gentoo #3
Dec 17 13:31:08 boerne.fritz.box kernel: Hardware name: TOSHIBA Satellite L500/KSWAA, BIOS V1.80 10/28/2009
Dec 17 13:31:08 boerne.fritz.box kernel: c5a37c18 c1433406 c5a37d48 c531ca00 c5a37c48 c1170011 c5a37c9c 00000286
Dec 17 13:31:08 boerne.fritz.box kernel: c5a37c48 c1438fff c5a37c4c c7246c00 e737e800 c531ca00 c1ad1899 c5a37d48
Dec 17 13:31:08 boerne.fritz.box kernel: c5a37c8c c1114407 001d89cc c5a37c78 c1114000 00000005 00000000 00000000
Dec 17 13:31:08 boerne.fritz.box kernel: Call Trace:
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1433406>] dump_stack+0x47/0x61
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1170011>] dump_header+0x5f/0x175
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1438fff>] ? ___ratelimit+0x7f/0xe0
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1114407>] oom_kill_process+0x207/0x3c0
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1114000>] ? oom_badness.part.13+0x10/0x120
Dec 17 13:31:08 boerne.fritz.box kernel: [<c11148c4>] out_of_memory+0xd4/0x270
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1118615>] __alloc_pages_nodemask+0xcf5/0xd60
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1758900>] ? skb_queue_purge+0x30/0x30
Dec 17 13:31:08 boerne.fritz.box kernel: [<c175dcde>] alloc_skb_with_frags+0xee/0x1a0
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1753dba>] sock_alloc_send_pskb+0x19a/0x1c0
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1825880>] ? wait_for_unix_gc+0x20/0x90
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1823fc0>] unix_stream_sendmsg+0x2a0/0x350
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1750b3d>] sock_sendmsg+0x2d/0x40
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1750bb7>] sock_write_iter+0x67/0xc0
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1172c42>] do_readv_writev+0x1e2/0x380
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1750b50>] ? sock_sendmsg+0x40/0x40
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1033763>] ? lapic_next_event+0x13/0x20
Dec 17 13:31:08 boerne.fritz.box kernel: [<c10ae675>] ? clockevents_program_event+0x95/0x190
Dec 17 13:31:08 boerne.fritz.box kernel: [<c10a074a>] ? __hrtimer_run_queues+0x20a/0x280
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1173d16>] vfs_writev+0x36/0x60
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1173d85>] do_writev+0x45/0xc0
Dec 17 13:31:08 boerne.fritz.box kernel: [<c1173efb>] SyS_writev+0x1b/0x20
Dec 17 13:31:08 boerne.fritz.box kernel: [<c10018ec>] do_fast_syscall_32+0x7c/0x130
Dec 17 13:31:08 boerne.fritz.box kernel: [<c194232b>] sysenter_past_esp+0x40/0x6a
Dec 17 13:31:08 boerne.fritz.box kernel: Mem-Info:
Dec 17 13:31:08 boerne.fritz.box kernel: active_anon:53993 inactive_anon:7042 isolated_anon:0
active_file:310474 inactive_file:411136 isolated_file:0
unevictable:0 dirty:9093 writeback:0 unstable:0
slab_reclaimable:50588 slab_unreclaimable:21858
mapped:18104 shmem:7404 pagetables:732 bounce:0
free:127428 free_pcp:488 free_cma:0
Dec 17 13:31:08 boerne.fritz.box kernel: Node 0 active_anon:215972kB inactive_anon:28168kB active_file:1241896kB inactive_file:1644544kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:72416kB dirty:36372kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 112640kB anon_thp: 29616kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
Dec 17 13:31:08 boerne.fritz.box kernel: DMA free:3928kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:6964kB inactive_file:44kB unevictable:0kB writepending:596kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3016kB slab_unreclaimable:1176kB kernel_stack:96kB pagetables:388kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Dec 17 13:31:08 boerne.fritz.box kernel: lowmem_reserve[]: 0 808 3849 3849
Dec 17 13:31:08 boerne.fritz.box kernel: Normal free:40944kB min:41100kB low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB active_file:483096kB inactive_file:80kB unevictable:0kB writepending:2060kB present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:199336kB slab_unreclaimable:86256kB kernel_stack:1632kB pagetables:2540kB bounce:0kB free_pcp:692kB local_pcp:396kB free_cma:0kB
Dec 17 13:31:08 boerne.fritz.box kernel: lowmem_reserve[]: 0 0 24330 24330
Dec 17 13:31:08 boerne.fritz.box kernel: HighMem free:464840kB min:512kB low:39184kB high:77856kB active_anon:215972kB inactive_anon:28168kB active_file:751836kB inactive_file:1644320kB unevictable:0kB writepending:33716kB present:3114256kB managed:3114256kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1260kB local_pcp:628kB free_cma:0kB
Dec 17 13:31:08 boerne.fritz.box kernel: lowmem_reserve[]: 0 0 0 0
Dec 17 13:31:08 boerne.fritz.box kernel: DMA: 6*4kB (U) 14*8kB (U) 15*16kB (U) 7*32kB (U) 2*64kB (U) 1*128kB (U) 0*256kB 0*512kB 1*1024kB (E) 1*2048kB (M) 0*4096kB = 3928kB
Dec 17 13:31:08 boerne.fritz.box kernel: Normal: 40*4kB (UM) 30*8kB (UM) 22*16kB (UME) 24*32kB (UM) 92*64kB (UM) 76*128kB (UME) 19*256kB (UM) 3*512kB (UM) 1*1024kB (E) 2*2048kB (UM) 3*4096kB (M) = 40944kB
Dec 17 13:31:08 boerne.fritz.box kernel: HighMem: 14*4kB (UE) 1256*8kB (ME) 869*16kB (UME) 520*32kB (UME) 210*64kB (UME) 93*128kB (UME) 42*256kB (ME) 22*512kB (UME) 12*1024kB (UME) 30*2048kB (UME) 74*4096kB (UM) = 464840kB
Dec 17 13:31:08 boerne.fritz.box kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Dec 17 13:31:08 boerne.fritz.box kernel: 729003 total pagecache pages
Dec 17 13:31:08 boerne.fritz.box kernel: 0 pages in swap cache
Dec 17 13:31:08 boerne.fritz.box kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 17 13:31:08 boerne.fritz.box kernel: Free swap = 3781628kB
Dec 17 13:31:08 boerne.fritz.box kernel: Total swap = 3781628kB
Dec 17 13:31:08 boerne.fritz.box kernel: 1006816 pages RAM
Dec 17 13:31:08 boerne.fritz.box kernel: 778564 pages HighMem/MovableOnly
Dec 17 13:31:08 boerne.fritz.box kernel: 16403 pages reserved
Dec 17 13:31:08 boerne.fritz.box kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 17 13:31:08 boerne.fritz.box kernel: [ 1876] 0 1876 6165 1016 10 3 0 0 systemd-journal
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2497] 0 2497 2965 915 6 3 0 -1000 systemd-udevd
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2582] 107 2582 3874 902 8 3 0 0 systemd-timesyn
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2585] 88 2585 1158 567 6 3 0 0 nullmailer-send
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2588] 108 2588 1271 848 7 3 0 -900 dbus-daemon
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2590] 0 2590 1510 459 5 3 0 0 fcron
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2594] 0 2594 1521 994 6 3 0 0 systemd-logind
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2595] 0 2595 22001 3143 21 3 0 0 NetworkManager
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2649] 0 2649 768 579 5 3 0 0 dhcpcd
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2655] 0 2655 639 416 5 3 0 0 vnstatd
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2656] 0 2656 1235 843 6 3 0 0 login
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2657] 0 2657 1460 1047 6 3 0 -1000 sshd
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2684] 0 2684 1972 1291 7 3 0 0 systemd
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2713] 0 2713 2279 569 7 3 0 0 (sd-pam)
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2728] 0 2728 1836 914 7 3 0 0 bash
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2768] 109 2768 16725 3172 19 3 0 0 polkitd
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2798] 0 2798 2157 1375 7 3 0 0 wpa_supplicant
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2864] 0 2864 1743 703 7 3 0 0 start_trace
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2866] 0 2866 1395 390 7 3 0 0 cat
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2867] 0 2867 1370 422 6 3 0 0 tail
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2916] 0 2916 1235 845 6 3 0 0 login
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2917] 0 2917 1836 870 7 3 0 0 bash
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2956] 0 2956 16257 14998 36 3 0 0 emerge
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2963] 0 2963 1235 846 6 3 0 0 login
Dec 17 13:31:08 boerne.fritz.box kernel: [ 2972] 0 2972 1836 906 7 3 0 0 bash
Dec 17 13:31:08 boerne.fritz.box kernel: [ 3021] 0 3021 6058 1761 15 3 0 0 journalctl
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5253] 250 5253 549 356 5 3 0 0 sandbox
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5255] 250 5255 2629 1567 8 3 0 0 ebuild.sh
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5272] 250 5272 2995 1763 8 3 0 0 ebuild.sh
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5335] 0 5335 1235 843 6 3 0 0 login
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5343] 250 5343 1123 724 5 3 0 0 emake
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5345] 250 5345 909 661 6 3 0 0 make
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5467] 1000 5467 2033 1374 7 3 0 0 systemd
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5483] 1000 5483 6633 597 10 3 0 0 (sd-pam)
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5506] 1000 5506 1836 887 7 3 0 0 bash
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5530] 250 5530 1057 674 4 3 0 0 sh
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5531] 250 5531 3204 2648 10 3 0 0 python2.7
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5536] 1000 5536 25339 2203 18 3 0 0 pulseaudio
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5537] 111 5537 5763 643 9 3 0 0 rtkit-daemon
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5560] 1000 5560 3575 1420 10 3 0 0 gconf-helper
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5567] 1000 5567 1743 709 7 3 0 0 startx
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5588] 1000 5588 1001 579 5 3 0 0 xinit
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5589] 1000 5589 23069 6556 42 3 0 0 X
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5599] 1000 5599 10592 4532 21 3 0 0 awesome
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5625] 1000 5625 1571 616 7 3 0 0 dbus-launch
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5626] 1000 5626 1238 636 6 3 0 0 dbus-daemon
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5631] 1000 5631 1571 621 7 3 0 0 dbus-launch
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5632] 1000 5632 1238 703 6 3 0 0 dbus-daemon
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5659] 250 5659 3749 3243 11 3 0 0 python
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5671] 1000 5671 31584 7782 39 3 0 0 nm-applet
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5707] 1000 5707 11224 1897 14 3 0 0 at-spi-bus-laun
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5718] 1000 5718 1238 806 6 3 0 0 dbus-daemon
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5725] 1000 5725 7480 2144 12 3 0 0 at-spi2-registr
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5732] 1000 5732 10179 1469 14 3 0 0 gvfsd
Dec 17 13:31:08 boerne.fritz.box kernel: [ 5825] 250 5825 1209 839 5 3 0 0 sh
Dec 17 13:31:08 boerne.fritz.box kernel: [ 7253] 1000 7253 21521 7455 32 3 0 0 xfce4-terminal
Dec 17 13:31:08 boerne.fritz.box kernel: [ 7359] 1000 7359 1836 891 7 3 0 0 bash
Dec 17 13:31:08 boerne.fritz.box kernel: [ 8641] 1000 8641 1533 593 6 3 0 0 tar
Dec 17 13:31:08 boerne.fritz.box kernel: [ 8642] 1000 8642 17834 16879 38 3 0 0 xz
Dec 17 13:31:08 boerne.fritz.box kernel: [ 9059] 250 9059 10070 2536 13 3 0 0 python
Dec 17 13:31:08 boerne.fritz.box kernel: [ 9063] 250 9063 3155 1923 10 3 0 0 python
Dec 17 13:31:08 boerne.fritz.box kernel: [ 9064] 250 9064 3155 1926 10 3 0 0 python
Dec 17 13:31:08 boerne.fritz.box kernel: [ 9068] 250 9068 1211 826 5 3 0 0 sh
Dec 17 13:31:08 boerne.fritz.box kernel: [ 9075] 250 9075 3847 3307 11 3 0 0 python
Dec 17 13:31:08 boerne.fritz.box kernel: [ 9417] 1000 9417 1829 901 7 3 0 0 bash
Dec 17 13:31:08 boerne.fritz.box kernel: [ 9459] 1000 9459 2246 1206 9 3 0 0 ssh
Dec 17 13:31:08 boerne.fritz.box kernel: [ 9499] 250 9499 1087 711 5 3 0 0 sh
Dec 17 13:31:08 boerne.fritz.box kernel: [ 9607] 250 9607 1211 755 5 3 0 0 sh
Dec 17 13:31:08 boerne.fritz.box kernel: [ 9608] 250 9608 1087 533 5 3 0 0 sh

Greetings
Nils

2016-12-17 14:45:17

by Tetsuo Handa

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On 2016/12/17 21:59, Nils Holland wrote:
> On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
>> mount -t tracefs none /debug/trace
>> echo 1 > /debug/trace/events/vmscan/enable
>> cat /debug/trace/trace_pipe > trace.log
>>
>> should help
>> [...]
>
> No problem! I enabled writing the trace data to a file and then tried
> to trigger another OOM situation. That worked, this time without a
> complete kernel panic, but with only my processes being killed and the
> system becoming unresponsive. When that happened, I let it run for
> another minute or two so that in case it was still logging something
> to the trace file, it could continue to do so some time longer. Then I
> rebooted with the only thing that still worked, i.e. by means of magic
> SysRequest.

Under OOM situation, writing to a file on disk unlikely works. Maybe
logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
if your are using bash) works better. (I wish we can do it from kernel
so that /bin/cat is not disturbed by delays due to page fault.)

If you can configure netconsole for logging OOM killer messages and
UDP socket for logging trace_pipe messages, udplogger at
https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
might fit for logging both output with timestamp into a single file.

2016-12-17 17:11:22

by Nils Holland

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Sat, Dec 17, 2016 at 11:44:45PM +0900, Tetsuo Handa wrote:
> On 2016/12/17 21:59, Nils Holland wrote:
> > On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
> >> mount -t tracefs none /debug/trace
> >> echo 1 > /debug/trace/events/vmscan/enable
> >> cat /debug/trace/trace_pipe > trace.log
> >>
> >> should help
> >> [...]
> >
> > No problem! I enabled writing the trace data to a file and then tried
> > to trigger another OOM situation. That worked, this time without a
> > complete kernel panic, but with only my processes being killed and the
> > system becoming unresponsive.
> > [...]
>
> Under OOM situation, writing to a file on disk unlikely works. Maybe
> logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
> if your are using bash) works better. (I wish we can do it from kernel
> so that /bin/cat is not disturbed by delays due to page fault.)
>
> If you can configure netconsole for logging OOM killer messages and
> UDP socket for logging trace_pipe messages, udplogger at
> https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
> might fit for logging both output with timestamp into a single file.

Thanks for the hint, sounds very sane! I'll try to go that route for
the next log / trace I produce. Of course, if Michal says that the
trace file I've already posted, and which has been logged to file, is
useless and would have been better if I had instead logged to a
different machine via the network, I could also repeat the current
experiment and produce a new file at any time. :-)

Greetings
Nils

2016-12-17 21:06:58

by Nils Holland

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Sat, Dec 17, 2016 at 11:44:45PM +0900, Tetsuo Handa wrote:
> On 2016/12/17 21:59, Nils Holland wrote:
> > On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
> >> mount -t tracefs none /debug/trace
> >> echo 1 > /debug/trace/events/vmscan/enable
> >> cat /debug/trace/trace_pipe > trace.log
> >>
> >> should help
> >> [...]
> >
> > No problem! I enabled writing the trace data to a file and then tried
> > to trigger another OOM situation. That worked, this time without a
> > complete kernel panic, but with only my processes being killed and the
> > system becoming unresponsive.
>
> Under OOM situation, writing to a file on disk unlikely works. Maybe
> logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
> if your are using bash) works better. (I wish we can do it from kernel
> so that /bin/cat is not disturbed by delays due to page fault.)
>
> If you can configure netconsole for logging OOM killer messages and
> UDP socket for logging trace_pipe messages, udplogger at
> https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
> might fit for logging both output with timestamp into a single file.

Actually, I decided to give this a try once more on machine #2, i.e.
not the one that produced the previous trace, but the other one.

I logged via netconsole as well as 'cat /debug/trace/trace_pipe' via
the network to another machine running udplogger. After the machine
had been frehsly booted and I had set up the logging, unpacking of the
firefox source tarball started. After it had been unpacking for a
while, the first load of trace messages started to appear. Some time
later, OOMs started to appear - I've got quite a lot of them in my
capture file this time.

Unfortunately, the reclaim trace messages stopped a while after the first
OOM messages show up - most likely my "cat" had been killed at that
point or became unresponsive. :-/

In the end, the machine didn't completely panic, but after nothing new
showed up being logged via the network, I walked up to the
machine and found it in a state where I couldn't really log in to it
anymore, but all that worked was, as always, a magic SysRequest reboot.

The complete log, from machine boot right up to the point where it
wouldn't really do anything anymore, is up again on my web server (~42
MB, 928 KB packed):

http://ftp.tisys.org/pub/misc/teela_2016-12-17.log.xz

Greetings
Nils

2016-12-18 00:35:07

by Xin Zhou

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

Hi,
The system supposes to have special memory reservation for coredump and other debug info when encountering panic,
the size seems configurable.
Thanks,
Xin
 
 

Sent: Saturday, December 17, 2016 at 6:44 AM
From: "Tetsuo Handa" <[email protected]>
To: "Nils Holland" <[email protected]>, "Michal Hocko" <[email protected]>
Cc: [email protected], [email protected], "Chris Mason" <[email protected]>, "David Sterba" <[email protected]>, [email protected]
Subject: Re: OOM: Better, but still there on
On 2016/12/17 21:59, Nils Holland wrote:
> On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
>> mount -t tracefs none /debug/trace
>> echo 1 > /debug/trace/events/vmscan/enable
>> cat /debug/trace/trace_pipe > trace.log
>>
>> should help
>> [...]
>
> No problem! I enabled writing the trace data to a file and then tried
> to trigger another OOM situation. That worked, this time without a
> complete kernel panic, but with only my processes being killed and the
> system becoming unresponsive. When that happened, I let it run for
> another minute or two so that in case it was still logging something
> to the trace file, it could continue to do so some time longer. Then I
> rebooted with the only thing that still worked, i.e. by means of magic
> SysRequest.

Under OOM situation, writing to a file on disk unlikely works. Maybe
logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
if your are using bash) works better. (I wish we can do it from kernel
so that /bin/cat is not disturbed by delays due to page fault.)

If you can configure netconsole for logging OOM killer messages and
UDP socket for logging trace_pipe messages, udplogger at
https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
might fit for logging both output with timestamp into a single file.

2016-12-18 05:15:24

by Tetsuo Handa

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

Nils Holland wrote:
> On Sat, Dec 17, 2016 at 11:44:45PM +0900, Tetsuo Handa wrote:
> > On 2016/12/17 21:59, Nils Holland wrote:
> > > On Sat, Dec 17, 2016 at 01:02:03AM +0100, Michal Hocko wrote:
> > >> mount -t tracefs none /debug/trace
> > >> echo 1 > /debug/trace/events/vmscan/enable
> > >> cat /debug/trace/trace_pipe > trace.log
> > >>
> > >> should help
> > >> [...]
> > >
> > > No problem! I enabled writing the trace data to a file and then tried
> > > to trigger another OOM situation. That worked, this time without a
> > > complete kernel panic, but with only my processes being killed and the
> > > system becoming unresponsive.
> >
> > Under OOM situation, writing to a file on disk unlikely works. Maybe
> > logging via network ( "cat /debug/trace/trace_pipe > /dev/udp/$ip/$port"
> > if your are using bash) works better. (I wish we can do it from kernel
> > so that /bin/cat is not disturbed by delays due to page fault.)
> >
> > If you can configure netconsole for logging OOM killer messages and
> > UDP socket for logging trace_pipe messages, udplogger at
> > https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
> > might fit for logging both output with timestamp into a single file.
>
> Actually, I decided to give this a try once more on machine #2, i.e.
> not the one that produced the previous trace, but the other one.
>
> I logged via netconsole as well as 'cat /debug/trace/trace_pipe' via
> the network to another machine running udplogger. After the machine
> had been frehsly booted and I had set up the logging, unpacking of the
> firefox source tarball started. After it had been unpacking for a
> while, the first load of trace messages started to appear. Some time
> later, OOMs started to appear - I've got quite a lot of them in my
> capture file this time.

Thank you for capturing. I think it worked well. Let's wait for Michal.

The first OOM killer invocation was

2016-12-17 21:36:56 192.168.17.23:6665 [ 1276.828639] Killed process 3894 (xz) total-vm:68640kB, anon-rss:65920kB, file-rss:1696kB, shmem-rss:0kB

and the last OOM killer invocation was

2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800677] Killed process 3070 (screen) total-vm:7440kB, anon-rss:960kB, file-rss:2360kB, shmem-rss:0kB

and trace output was sent until

2016-12-17 21:37:07 192.168.17.23:48468 kworker/u4:4-3896 [000] .... 1287.202958: mm_shrink_slab_start: super_cache_scan+0x0/0x170 f4436ed4: nid: 0 objects to shrink 86 gfp_flags GFP_NOFS|__GFP_NOFAIL pgs_scanned 32 lru_pgs 406078 cache items 412 delta 0 total_scan 86

which (I hope) should be sufficient for analysis.

>
> Unfortunately, the reclaim trace messages stopped a while after the first
> OOM messages show up - most likely my "cat" had been killed at that
> point or became unresponsive. :-/
>
> In the end, the machine didn't completely panic, but after nothing new
> showed up being logged via the network, I walked up to the
> machine and found it in a state where I couldn't really log in to it
> anymore, but all that worked was, as always, a magic SysRequest reboot.

There is a known issue (since Linux 2.6.32) that all memory allocation requests
get stuck due to kswapd v.s. shrink_inactive_list() livelock which occurs under
almost OOM situation ( http://lkml.kernel.org/r/20160211225929.GU14668@dastard ).
If we hit it, even "page allocation stalls for " messages do not show up.

Even if we didn't hit it, although agetty and sshd were still alive

2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800614] [ 2800] 0 2800 1152 494 6 3 0 0 agetty
2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800618] [ 2802] 0 2802 1457 1055 6 3 0 -1000 sshd

memory allocation was delaying too much

2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034624] btrfs-transacti: page alloction stalls for 93995ms, order:0, mode:0x2400840(GFP_NOFS|__GFP_NOFAIL)
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034628] CPU: 1 PID: 1949 Comm: btrfs-transacti Not tainted 4.9.0-gentoo #3
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034630] Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034638] f162f94c c142bd8e 00000001 00000000 f162f970 c110ad7e c1b58833 02400840
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034645] f162f978 f162f980 c1b55814 f162f960 00000160 f162fa38 c110b78c 02400840
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034652] c1b55814 00016f2b 00000000 00400000 00000000 f21d0000 f21d0000 00000001
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034653] Call Trace:
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034660] [<c142bd8e>] dump_stack+0x47/0x69
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034666] [<c110ad7e>] warn_alloc+0xce/0xf0
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034671] [<c110b78c>] __alloc_pages_nodemask+0x97c/0xd30
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034678] [<c1103fbd>] ? find_get_entry+0x1d/0x100
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034681] [<c1102fc1>] ? add_to_page_cache_lru+0x61/0xc0
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034685] [<c110414d>] pagecache_get_page+0xad/0x270
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034692] [<c1366556>] alloc_extent_buffer+0x116/0x3e0
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034699] [<c1334ade>] btrfs_find_create_tree_block+0xe/0x10
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034704] [<c132a62f>] btrfs_alloc_tree_block+0x1ef/0x5f0
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034710] [<c1079050>] ? autoremove_wake_function+0x40/0x40
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034716] [<c130f873>] __btrfs_cow_block+0x143/0x5f0
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034723] [<c130feca>] btrfs_cow_block+0x13a/0x220
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034727] [<c13133a1>] btrfs_search_slot+0x1d1/0x870
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034731] [<c131a74a>] lookup_inline_extent_backref+0x10a/0x6d0
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034736] [<c19b656c>] ? common_interrupt+0x2c/0x34
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034742] [<c131c959>] __btrfs_free_extent+0x129/0xe80
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034750] [<c1322160>] __btrfs_run_delayed_refs+0xaf0/0x13e0
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034754] [<c106f759>] ? set_next_entity+0x659/0xec0
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034757] [<c106c351>] ? put_prev_entity+0x21/0xcf0
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034801] [<fa83b2da>] ? xfs_attr3_leaf_add_work+0x25a/0x420 [xfs]
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034808] [<c13259f1>] btrfs_run_delayed_refs+0x71/0x260
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034813] [<c10903ef>] ? lock_timer_base+0x5f/0x80
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034818] [<c133cefb>] btrfs_commit_transaction+0x2b/0xd30
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034821] [<c133dc65>] ? start_transaction+0x65/0x4b0
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034826] [<c1337f65>] transaction_kthread+0x1b5/0x1d0
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034830] [<c1337db0>] ? btrfs_cleanup_transaction+0x490/0x490
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034833] [<c10552e7>] kthread+0x97/0xb0
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034837] [<c1055250>] ? __kthread_parkme+0x60/0x60
2016-12-17 21:41:03 192.168.17.23:6665 [ 1521.034842] [<c19b5d77>] ret_from_fork+0x1b/0x28

and therefore memory allocation by page fault by trying to login was too slow to wait.

>
> The complete log, from machine boot right up to the point where it
> wouldn't really do anything anymore, is up again on my web server (~42
> MB, 928 KB packed):
>
> http://ftp.tisys.org/pub/misc/teela_2016-12-17.log.xz
>
> Greetings
> Nils
>

It might be pointless to check, but is your 4.9.0-gentoo kernel using 4.9.0 final source?
The typo "page alloction stalls" was fixed in v4.9-rc5. Maybe some last minute changes are
missing...

2016-12-18 16:37:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically

On Sat 17-12-16 20:17:07, Tetsuo Handa wrote:
[...]
> I feel that allowing access to memory reserves based on __GFP_NOFAIL might not
> make sense. My understanding is that actual I/O operation triggered by I/O
> requests by filesystem code are processed by other threads. Even if we grant
> access to memory reserves to GFP_NOFS | __GFP_NOFAIL allocations by fs code,
> I think that it is possible that memory allocations by underlying bio code
> fails to make a further progress unless memory reserves are granted as well.

IO layer should rely on mempools to guarantee a forward progress.

> Below is a typical trace which I observe under OOM lockuped situation (though
> this trace is from an OOM stress test using XFS).
>
> ----------------------------------------
> [ 1845.187246] MemAlloc: kworker/2:1(14498) flags=0x4208060 switches=323636 seq=48 gfp=0x2400000(GFP_NOIO) order=0 delay=430400 uninterruptible
> [ 1845.187248] kworker/2:1 D12712 14498 2 0x00000080
> [ 1845.187251] Workqueue: events_freezable_power_ disk_events_workfn
> [ 1845.187252] Call Trace:
> [ 1845.187253] ? __schedule+0x23f/0xba0
> [ 1845.187254] schedule+0x38/0x90
> [ 1845.187255] schedule_timeout+0x205/0x4a0
> [ 1845.187256] ? del_timer_sync+0xd0/0xd0
> [ 1845.187257] schedule_timeout_uninterruptible+0x25/0x30
> [ 1845.187258] __alloc_pages_nodemask+0x1035/0x10e0
> [ 1845.187259] ? alloc_request_struct+0x14/0x20
> [ 1845.187261] alloc_pages_current+0x96/0x1b0
> [ 1845.187262] ? bio_alloc_bioset+0x20f/0x2e0
> [ 1845.187264] bio_copy_kern+0xc4/0x180
> [ 1845.187265] blk_rq_map_kern+0x6f/0x120
> [ 1845.187268] __scsi_execute.isra.23+0x12f/0x160
> [ 1845.187270] scsi_execute_req_flags+0x8f/0x100
> [ 1845.187271] sr_check_events+0xba/0x2b0 [sr_mod]
> [ 1845.187274] cdrom_check_events+0x13/0x30 [cdrom]
> [ 1845.187275] sr_block_check_events+0x25/0x30 [sr_mod]
> [ 1845.187276] disk_check_events+0x5b/0x150
> [ 1845.187277] disk_events_workfn+0x17/0x20
> [ 1845.187278] process_one_work+0x1fc/0x750
> [ 1845.187279] ? process_one_work+0x167/0x750
> [ 1845.187279] worker_thread+0x126/0x4a0
> [ 1845.187280] kthread+0x10a/0x140
> [ 1845.187281] ? process_one_work+0x750/0x750
> [ 1845.187282] ? kthread_create_on_node+0x60/0x60
> [ 1845.187283] ret_from_fork+0x2a/0x40
> ----------------------------------------
>
> I think that this GFP_NOIO allocation request needs to consume more memory reserves
> than GFP_NOFS allocation request to make progress.

AFAIU, this is an allocation path which doesn't block a forward progress
on a regular IO. It is merely a check whether there is a new medium in
the CDROM (aka regular polling of the device). I really fail to see any
reason why this one should get any access to memory reserves at all.

I actually do not see any reason why it should be NOIO in the first
place but I am not familiar with this code much so there might be some
reasons for that. The fact that it might stall under a heavy memory
pressure is sad but who actually cares?

> Do we want to add __GFP_NOFAIL to this GFP_NOIO allocation request
> in order to allow access to memory reserves as well as GFP_NOFS |
> __GFP_NOFAIL allocation request?

Why?

--
Michal Hocko
SUSE Labs

2016-12-19 13:45:42

by Michal Hocko

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Sat 17-12-16 22:06:47, Nils Holland wrote:
[...]
> Unfortunately, the reclaim trace messages stopped a while after the first
> OOM messages show up - most likely my "cat" had been killed at that
> point or became unresponsive. :-/

The later is more probable because I do not see the OOM killer to kill
any cat process and the first bash has been killed 10s after the first
OOM.

2016-12-17 21:36:56 192.168.17.23:6665 [ 1276.828639] Killed process 3894 (xz) total-vm:68640kB, anon-rss:65920kB, file-rss:1696kB, shmem-rss:0kB
2016-12-17 21:36:57 192.168.17.23:6665 [ 1277.598271] Killed process 3864 (sandbox) total-vm:2192kB, anon-rss:128kB, file-rss:1400kB, shmem-rss:0kB
2016-12-17 21:36:57 192.168.17.23:6665 [ 1278.222416] Killed process 3086 (emerge) total-vm:65064kB, anon-rss:52768kB, file-rss:7216kB, shmem-rss:0kB
2016-12-17 21:36:58 192.168.17.23:6665 [ 1278.846902] Killed process 2705 (NetworkManager) total-vm:104376kB, anon-rss:4172kB, file-rss:10516kB, shmem-rss:0kB
2016-12-17 21:36:59 192.168.17.23:6665 [ 1279.862150] Killed process 2823 (polkitd) total-vm:65536kB, anon-rss:2192kB, file-rss:8656kB, shmem-rss:0kB
2016-12-17 21:37:00 192.168.17.23:6665 [ 1280.496988] Killed process 3885 (ebuild.sh) total-vm:10640kB, anon-rss:3340kB, file-rss:2244kB, shmem-rss:0kB
2016-12-17 21:37:04 192.168.17.23:6665 [ 1285.126052] Killed process 2824 (wpa_supplicant) total-vm:8580kB, anon-rss:540kB, file-rss:5092kB, shmem-rss:0kB
2016-12-17 21:37:05 192.168.17.23:6665 [ 1286.124687] Killed process 2943 (bash) total-vm:7320kB, anon-rss:368kB, file-rss:3240kB, shmem-rss:0kB
2016-12-17 21:37:07 192.168.17.23:6665 [ 1287.974353] Killed process 2878 (sshd) total-vm:10524kB, anon-rss:700kB, file-rss:4908kB, shmem-rss:4kB
2016-12-17 21:37:16 192.168.17.23:6665 [ 1296.953350] Killed process 4048 (ebuild.sh) total-vm:10640kB, anon-rss:3352kB, file-rss:1892kB, shmem-rss:0kB
2016-12-17 21:37:24 192.168.17.23:6665 [ 1304.398944] Killed process 1980 (systemd-journal) total-vm:24640kB, anon-rss:332kB, file-rss:4608kB, shmem-rss:4kB
2016-12-17 21:37:25 192.168.17.23:6665 [ 1305.934472] Killed process 2918 ((sd-pam)) total-vm:9152kB, anon-rss:964kB, file-rss:1536kB, shmem-rss:0kB
2016-12-17 21:37:28 192.168.17.23:6665 [ 1308.878775] Killed process 2888 (systemd) total-vm:7856kB, anon-rss:528kB, file-rss:4388kB, shmem-rss:0kB
2016-12-17 21:37:34 192.168.17.23:6665 [ 1314.268177] Killed process 2711 (rsyslogd) total-vm:25200kB, anon-rss:1084kB, file-rss:2908kB, shmem-rss:0kB
2016-12-17 21:37:39 192.168.17.23:6665 [ 1319.634561] Killed process 2704 (systemd-logind) total-vm:5980kB, anon-rss:340kB, file-rss:3568kB, shmem-rss:0kB
2016-12-17 21:37:43 192.168.17.23:6665 [ 1323.488894] Killed process 3103 (htop) total-vm:7532kB, anon-rss:1024kB, file-rss:2872kB, shmem-rss:0kB
2016-12-17 21:38:42 192.168.17.23:6665 [ 1379.556282] Killed process 2701 (systemd-timesyn) total-vm:15480kB, anon-rss:356kB, file-rss:3292kB, shmem-rss:0kB
2016-12-17 21:39:05 192.168.17.23:6665 [ 1403.130435] Killed process 3082 (bash) total-vm:7324kB, anon-rss:380kB, file-rss:3324kB, shmem-rss:0kB
2016-12-17 21:39:17 192.168.17.23:6665 [ 1417.600367] Killed process 3077 (start_trace) total-vm:6948kB, anon-rss:184kB, file-rss:2524kB, shmem-rss:0kB
2016-12-17 21:39:24 192.168.17.23:6665 [ 1423.955452] Killed process 3073 (bash) total-vm:7324kB, anon-rss:380kB, file-rss:3284kB, shmem-rss:0kB
2016-12-17 21:39:27 192.168.17.23:6665 [ 1425.338670] Killed process 3099 (bash) total-vm:7324kB, anon-rss:376kB, file-rss:3176kB, shmem-rss:0kB
2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800677] Killed process 3070 (screen) total-vm:7440kB, anon-rss:960kB, file-rss:2360kB, shmem-rss:0kB

> In the end, the machine didn't completely panic, but after nothing new
> showed up being logged via the network, I walked up to the
> machine and found it in a state where I couldn't really log in to it
> anymore, but all that worked was, as always, a magic SysRequest reboot.
>
> The complete log, from machine boot right up to the point where it
> wouldn't really do anything anymore, is up again on my web server (~42
> MB, 928 KB packed):
>
> http://ftp.tisys.org/pub/misc/teela_2016-12-17.log.xz

$ xzgrep invoked teela_2016-12-17.log.xz | sed 's@.*gfp_mask=0x[0-9a-f]*(\(.*\)), .*@\1@' | sort | uniq -c
2 GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK
1 GFP_KERNEL|__GFP_NOTRACK
6 GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_NOTRACK
1 GFP_KERNEL|__GFP_NOWARN|__GFP_REPEAT|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_NOTRACK
2 GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK
2 GFP_TEMPORARY
5 GFP_TEMPORARY|__GFP_NOTRACK
3 GFP_USER|__GFP_COLD

so all of them are lowmem requests which is in line with your previous
report. This basically means that only zone Normal is usable as I've
already mentioned before. In general lowmem problems are inherent to the
32b kernels but in this case we still have a _lot of_ page cache to
reclaim so we shouldn't really blow up.

Normal free:41260kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532676kB inactive_file:100kB unevictable:0kB writepending:124kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:157428kB slab_unreclaimable:68940kB kernel_stack:1160kB pagetables:1336kB bounce:0kB free_pcp:484kB local_pcp:240kB free_cma:0kB

and this looks very similar to your previous report as well. No
anonymous pages and the whole file LRU sitting in the active list so
there is nothing imediatelly reclaimable. This is very weird because
we should rotate the active list to the inactive if the later is low
which it obviously is here and this seems to be the case in other cases
as well (inactive_is_low.sh is a simple and dirty script to subtract
Highmem active/inactive counters from the node ones).

$ xzgrep -f zones teela_2016-12-17.log.xz | sh inactive_is_low.sh
total_active 1094600 active 541424 total_inactive 1117512 inactive 104 ratio 1 low 1
total_active 1094744 active 541568 total_inactive 1117524 inactive 116 ratio 1 low 1
total_active 1094864 active 541564 total_inactive 1117512 inactive 108 ratio 1 low 1
total_active 1095188 active 541564 total_inactive 1117220 inactive 116 ratio 1 low 1
total_active 1097520 active 541596 total_inactive 1115048 inactive 120 ratio 1 low 1
total_active 1097836 active 541612 total_inactive 1114764 inactive 136 ratio 1 low 1
total_active 1098692 active 542384 total_inactive 1114688 inactive 100 ratio 1 low 1
total_active 1098964 active 542504 total_inactive 1114480 inactive 24 ratio 1 low 1
total_active 1099108 active 542620 total_inactive 1114544 inactive 92 ratio 1 low 1
total_active 1099180 active 542548 total_inactive 1114564 inactive 236 ratio 1 low 1
[...]

Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
not know whether we managed to rotate those pages. If they are referenced
quickly enough we might just keep refaulting them... Could you try to apply
the followin diff on top what you have currently. It should add some more
tracepoint data which might tell us more. We can reduce the amount of
tracing data by enabling only mm_vmscan_lru_isolate,
mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.
---
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index bfe53d95c25b..2ba3e6dea6ef 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -519,7 +519,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
extern void __free_pages(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
extern void free_hot_cold_page(struct page *page, bool cold);
-extern void free_hot_cold_page_list(struct list_head *list, bool cold);
+extern int free_hot_cold_page_list(struct list_head *list, bool cold);

struct page_frag_cache;
extern void __page_frag_drain(struct page *page, unsigned int order,
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index c88fd0934e7e..7966915cf663 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -365,14 +365,27 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,

TP_PROTO(int nid,
unsigned long nr_scanned, unsigned long nr_reclaimed,
+ unsigned long nr_dirty, unsigned long nr_writeback,
+ unsigned long nr_congested, unsigned long nr_immediate,
+ unsigned long nr_activate, unsigned long nr_ref_keep,
+ unsigned long nr_unmap_fail,
int priority, int file),

- TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
+ TP_ARGS(nid, nr_scanned, nr_reclaimed, nr_dirty, nr_writeback,
+ nr_congested, nr_immediate, nr_activate, nr_ref_keep,
+ nr_unmap_fail, priority, file),

TP_STRUCT__entry(
__field(int, nid)
__field(unsigned long, nr_scanned)
__field(unsigned long, nr_reclaimed)
+ __field(unsigned long, nr_dirty)
+ __field(unsigned long, nr_writeback)
+ __field(unsigned long, nr_congested)
+ __field(unsigned long, nr_immediate)
+ __field(unsigned long, nr_activate)
+ __field(unsigned long, nr_ref_keep)
+ __field(unsigned long, nr_unmap_fail)
__field(int, priority)
__field(int, reclaim_flags)
),
@@ -381,17 +394,63 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
__entry->nid = nid;
__entry->nr_scanned = nr_scanned;
__entry->nr_reclaimed = nr_reclaimed;
+ __entry->nr_dirty = nr_dirty;
+ __entry->nr_writeback = nr_writeback;
+ __entry->nr_congested = nr_congested;
+ __entry->nr_immediate = nr_immediate;
+ __entry->nr_activate = nr_activate;
+ __entry->nr_ref_keep = nr_ref_keep;
__entry->priority = priority;
__entry->reclaim_flags = trace_shrink_flags(file);
),

- TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+ TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld nr_dirty=%ld nr_writeback=%ld nr_congested=%ld nr_immediate=%ld nr_activate=%ld nr_ref_keep=%ld nr_unmap_fail=%ld priority=%d flags=%s",
__entry->nid,
__entry->nr_scanned, __entry->nr_reclaimed,
- __entry->priority,
+ __entry->nr_dirty, __entry->nr_writeback,
+ __entry->nr_congested, __entry->nr_immediate,
+ __entry->nr_activate, __entry->nr_ref_keep,
+ __entry->nr_unmap_fail, __entry->priority,
show_reclaim_flags(__entry->reclaim_flags))
);

+TRACE_EVENT(mm_vmscan_lru_shrink_active,
+
+ TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_freed,
+ unsigned long nr_unevictable, unsigned long nr_deactivated,
+ unsigned long nr_rotated, int priority, int file),
+
+ TP_ARGS(nid, nr_scanned, nr_freed, nr_unevictable, nr_deactivated, nr_rotated, priority, file),
+
+ TP_STRUCT__entry(
+ __field(int, nid)
+ __field(unsigned long, nr_scanned)
+ __field(unsigned long, nr_freed)
+ __field(unsigned long, nr_unevictable)
+ __field(unsigned long, nr_deactivated)
+ __field(unsigned long, nr_rotated)
+ __field(int, priority)
+ __field(int, reclaim_flags)
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->nr_scanned = nr_scanned;
+ __entry->nr_freed = nr_freed;
+ __entry->nr_unevictable = nr_unevictable;
+ __entry->nr_deactivated = nr_deactivated;
+ __entry->nr_rotated = nr_rotated;
+ __entry->priority = priority;
+ __entry->reclaim_flags = trace_shrink_flags(file);
+ ),
+
+ TP_printk("nid=%d nr_scanned=%ld nr_freed=%ld nr_unevictable=%ld nr_deactivated=%ld nr_rotated=%ld priority=%d flags=%s",
+ __entry->nid,
+ __entry->nr_scanned, __entry->nr_freed, __entry->nr_unevictable,
+ __entry->nr_deactivated, __entry->nr_rotated,
+ __entry->priority,
+ show_reclaim_flags(__entry->reclaim_flags))
+);
#endif /* _TRACE_VMSCAN_H */

/* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e701be6b930a..a8a103a5f7f0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2490,14 +2490,18 @@ void free_hot_cold_page(struct page *page, bool cold)
/*
* Free a list of 0-order pages
*/
-void free_hot_cold_page_list(struct list_head *list, bool cold)
+int free_hot_cold_page_list(struct list_head *list, bool cold)
{
struct page *page, *next;
+ int ret = 0;

list_for_each_entry_safe(page, next, list, lru) {
trace_mm_page_free_batched(page, cold);
free_hot_cold_page(page, cold);
+ ret++;
}
+
+ return ret;
}

/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4ea6b610f20e..4d7febde9e72 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -902,6 +902,17 @@ static void page_check_dirty_writeback(struct page *page,
mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
}

+struct reclaim_stat {
+ unsigned nr_dirty;
+ unsigned nr_unqueued_dirty;
+ unsigned nr_congested;
+ unsigned nr_writeback;
+ unsigned nr_immediate;
+ unsigned nr_activate;
+ unsigned nr_ref_keep;
+ unsigned nr_unmap_fail;
+};
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -909,22 +920,21 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct pglist_data *pgdat,
struct scan_control *sc,
enum ttu_flags ttu_flags,
- unsigned long *ret_nr_dirty,
- unsigned long *ret_nr_unqueued_dirty,
- unsigned long *ret_nr_congested,
- unsigned long *ret_nr_writeback,
- unsigned long *ret_nr_immediate,
+ struct reclaim_stat *stat,
bool force_reclaim)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
int pgactivate = 0;
- unsigned long nr_unqueued_dirty = 0;
- unsigned long nr_dirty = 0;
- unsigned long nr_congested = 0;
- unsigned long nr_reclaimed = 0;
- unsigned long nr_writeback = 0;
- unsigned long nr_immediate = 0;
+ unsigned nr_unqueued_dirty = 0;
+ unsigned nr_dirty = 0;
+ unsigned nr_congested = 0;
+ unsigned nr_reclaimed = 0;
+ unsigned nr_writeback = 0;
+ unsigned nr_immediate = 0;
+ unsigned nr_activate = 0;
+ unsigned nr_ref_keep = 0;
+ unsigned nr_unmap_fail = 0;

cond_resched();

@@ -1063,6 +1073,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
case PAGEREF_ACTIVATE:
goto activate_locked;
case PAGEREF_KEEP:
+ nr_ref_keep++;
goto keep_locked;
case PAGEREF_RECLAIM:
case PAGEREF_RECLAIM_CLEAN:
@@ -1100,6 +1111,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
(ttu_flags | TTU_BATCH_FLUSH))) {
case SWAP_FAIL:
+ nr_unmap_fail++;
goto activate_locked;
case SWAP_AGAIN:
goto keep_locked;
@@ -1252,6 +1264,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
VM_BUG_ON_PAGE(PageActive(page), page);
SetPageActive(page);
pgactivate++;
+ nr_activate++;
keep_locked:
unlock_page(page);
keep:
@@ -1266,11 +1279,16 @@ static unsigned long shrink_page_list(struct list_head *page_list,
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);

- *ret_nr_dirty += nr_dirty;
- *ret_nr_congested += nr_congested;
- *ret_nr_unqueued_dirty += nr_unqueued_dirty;
- *ret_nr_writeback += nr_writeback;
- *ret_nr_immediate += nr_immediate;
+ if (stat) {
+ stat->nr_dirty = nr_dirty;
+ stat->nr_congested = nr_congested;
+ stat->nr_unqueued_dirty = nr_unqueued_dirty;
+ stat->nr_writeback = nr_writeback;
+ stat->nr_immediate = nr_immediate;
+ stat->nr_activate = nr_activate;
+ stat->nr_ref_keep = nr_ref_keep;
+ stat->nr_unmap_fail = nr_unmap_fail;
+ }
return nr_reclaimed;
}

@@ -1282,7 +1300,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
.priority = DEF_PRIORITY,
.may_unmap = 1,
};
- unsigned long ret, dummy1, dummy2, dummy3, dummy4, dummy5;
+ unsigned long ret;
struct page *page, *next;
LIST_HEAD(clean_pages);

@@ -1295,8 +1313,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
}

ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
- TTU_UNMAP|TTU_IGNORE_ACCESS,
- &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
+ TTU_UNMAP|TTU_IGNORE_ACCESS, NULL, true);
list_splice(&clean_pages, page_list);
mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
return ret;
@@ -1696,11 +1713,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
unsigned long nr_taken;
- unsigned long nr_dirty = 0;
- unsigned long nr_congested = 0;
- unsigned long nr_unqueued_dirty = 0;
- unsigned long nr_writeback = 0;
- unsigned long nr_immediate = 0;
+ struct reclaim_stat stat = {};
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -1745,9 +1758,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
return 0;

nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
- &nr_dirty, &nr_unqueued_dirty, &nr_congested,
- &nr_writeback, &nr_immediate,
- false);
+ &stat, false);

spin_lock_irq(&pgdat->lru_lock);

@@ -1781,7 +1792,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* of pages under pages flagged for immediate reclaim and stall if any
* are encountered in the nr_immediate check below.
*/
- if (nr_writeback && nr_writeback == nr_taken)
+ if (stat.nr_writeback && stat.nr_writeback == nr_taken)
set_bit(PGDAT_WRITEBACK, &pgdat->flags);

/*
@@ -1793,7 +1804,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* Tag a zone as congested if all the dirty pages scanned were
* backed by a congested BDI and wait_iff_congested will stall.
*/
- if (nr_dirty && nr_dirty == nr_congested)
+ if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
set_bit(PGDAT_CONGESTED, &pgdat->flags);

/*
@@ -1802,7 +1813,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* the pgdat PGDAT_DIRTY and kswapd will start writing pages from
* reclaim context.
*/
- if (nr_unqueued_dirty == nr_taken)
+ if (stat.nr_unqueued_dirty == nr_taken)
set_bit(PGDAT_DIRTY, &pgdat->flags);

/*
@@ -1811,7 +1822,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* that pages are cycling through the LRU faster than
* they are written so also forcibly stall.
*/
- if (nr_immediate && current_may_throttle())
+ if (stat.nr_immediate && current_may_throttle())
congestion_wait(BLK_RW_ASYNC, HZ/10);
}

@@ -1826,6 +1837,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,

trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed,
+ stat.nr_dirty, stat.nr_writeback,
+ stat.nr_congested, stat.nr_immediate,
+ stat.nr_activate, stat.nr_ref_keep, stat.nr_unmap_fail,
sc->priority, file);
return nr_reclaimed;
}
@@ -1846,9 +1860,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
*
* The downside is that we have to touch page->_refcount against each page.
* But we had to alter page->flags anyway.
+ *
+ * Returns the number of pages moved to the given lru.
*/

-static void move_active_pages_to_lru(struct lruvec *lruvec,
+static int move_active_pages_to_lru(struct lruvec *lruvec,
struct list_head *list,
struct list_head *pages_to_free,
enum lru_list lru)
@@ -1857,6 +1873,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
unsigned long pgmoved = 0;
struct page *page;
int nr_pages;
+ int nr_moved = 0;

while (!list_empty(list)) {
page = lru_to_page(list);
@@ -1882,11 +1899,15 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
spin_lock_irq(&pgdat->lru_lock);
} else
list_add(&page->lru, pages_to_free);
+ } else {
+ nr_moved++;
}
}

if (!is_active_lru(lru))
__count_vm_events(PGDEACTIVATE, pgmoved);
+
+ return nr_moved;
}

static void shrink_active_list(unsigned long nr_to_scan,
@@ -1902,7 +1923,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
LIST_HEAD(l_inactive);
struct page *page;
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
- unsigned long nr_rotated = 0;
+ unsigned long nr_rotated = 0, nr_unevictable = 0;
+ unsigned long nr_freed, nr_deactivate, nr_activate;
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -1935,6 +1957,7 @@ static void shrink_active_list(unsigned long nr_to_scan,

if (unlikely(!page_evictable(page))) {
putback_lru_page(page);
+ nr_unevictable++;
continue;
}

@@ -1980,13 +2003,16 @@ static void shrink_active_list(unsigned long nr_to_scan,
*/
reclaim_stat->recent_rotated[file] += nr_rotated;

- move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
- move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
+ nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
+ nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
spin_unlock_irq(&pgdat->lru_lock);

mem_cgroup_uncharge_list(&l_hold);
- free_hot_cold_page_list(&l_hold, true);
+ nr_freed = free_hot_cold_page_list(&l_hold, true);
+ trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_scanned, nr_freed,
+ nr_unevictable, nr_deactivate, nr_rotated,
+ sc->priority, file);
}

/*
--
Michal Hocko
SUSE Labs

2016-12-20 02:08:41

by Nils Holland

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:

> Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> not know whether we managed to rotate those pages. If they are referenced
> quickly enough we might just keep refaulting them... Could you try to apply
> the followin diff on top what you have currently. It should add some more
> tracepoint data which might tell us more. We can reduce the amount of
> tracing data by enabling only mm_vmscan_lru_isolate,
> mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.

So, the results are in! I applied your patch and rebuild the kernel,
then I rebooted the machine, set up tracing so that only the three
events you mentioned were being traced, and captured the output over
the network.

Things went a bit different this time: The trace events started to
appear after a while and a whole lot of them were generated, but
suddenly they stopped. A short while later, we get

[ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)

along with a backtrace and memory information, and then there was
silence. When I walked up to the machine, it had completely died; it
wouldn't turn on its screen on key press any more, blindly trying to
reboot via SysRequest had no effect, but the caps lock LED also wasn't
blinking, like it normally does when a kernel panic occurs. Good
question what state it was in. The OOM reaper didn't really seem to
kick in and kill processes this time, it seems.

The complete capture is up at:

http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz

Greetings
Nils

2016-12-21 07:37:07

by Michal Hocko

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

TL;DR
there is another version of the debugging patch. Just revert the
previous one and apply this one instead. It's still not clear what
is going on but I suspect either some misaccounting or unexpeted
pages on the LRU lists. I have added one more tracepoint, so please
enable also mm_vmscan_inactive_list_is_low.

Hopefully the additional data will tell us more.

On Tue 20-12-16 03:08:29, Nils Holland wrote:
> On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:
>
> > Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> > not know whether we managed to rotate those pages. If they are referenced
> > quickly enough we might just keep refaulting them... Could you try to apply
> > the followin diff on top what you have currently. It should add some more
> > tracepoint data which might tell us more. We can reduce the amount of
> > tracing data by enabling only mm_vmscan_lru_isolate,
> > mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.
>
> So, the results are in! I applied your patch and rebuild the kernel,
> then I rebooted the machine, set up tracing so that only the three
> events you mentioned were being traced, and captured the output over
> the network.
>
> Things went a bit different this time: The trace events started to
> appear after a while and a whole lot of them were generated, but
> suddenly they stopped. A short while later, we get

It is possible that you are hitting multiple issues so it would be
great to focus at one at the time. The underlying problem might be
same/similar in the end but this is hard to tell now. Could you try to
reproduce and provide data for the OOM killer situation as well?

> [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
>
> along with a backtrace and memory information, and then there was
> silence.

> When I walked up to the machine, it had completely died; it
> wouldn't turn on its screen on key press any more, blindly trying to
> reboot via SysRequest had no effect, but the caps lock LED also wasn't
> blinking, like it normally does when a kernel panic occurs. Good
> question what state it was in. The OOM reaper didn't really seem to
> kick in and kill processes this time, it seems.
>
> The complete capture is up at:
>
> http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz

This is the stall report:
[ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
[ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 4.9.0-gentoo #4

pid 1950 is trying to allocate for a _long_ time. Considering that this
is the only stall report, this means that reclaim took really long so we
didn't get to the page allocator for that long. It sounds really crazy!

$ xzgrep -w 1950 teela_2016-12-20.log.xz | grep mm_vmscan_lru_shrink_inactive | sed 's@.*nr_reclaimed=\([0-9\]*\).*@\1@' | sort | uniq -c
509 0
1 1
1 10
5 11
1 12
1 14
1 16
2 19
5 2
1 22
2 23
1 25
3 28
2 3
1 4
4 5

It barely managed to reclaim something. While it has tried a lot. It
had hard times to actually isolate anything:

$ xzgrep -w 1950 teela_2016-12-20.log.xz | grep mm_vmscan_lru_isolate: | sed 's@.*nr_taken=@@' | sort | uniq -c
8284 0 file=1
8 11 file=1
4 14 file=1
1 1 file=1
7 23 file=1
1 25 file=1
9 2 file=1
501 32 file=1
1 3 file=1
7 5 file=1
1 6 file=1

a typical mm_vmscan_lru_isolate looks as follows

btrfs-transacti-1950 [001] d... 1368.508008: mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=266727 nr_taken=0 file=1

so the whole inactive lru has been scanned it seems. But we couldn't
isolate a single page. There are two possibilities here. Either we skip
them all because they are from the highmem zone or we fail to
__isolate_lru_page them. Counters will not tell us because nr_scanned
includes skipped pages. I have updated the debugging patch to make this
distinction. I suspect we are skipping all of them...
The later option would be really surprising because the only way to fail
__isolate_lru_page with the 0 isolate_mode is if get_page_unless_zero(page)
fails which would mean we would have pages with 0 reference count on the
LRU list.

The stall message is from a later time so the situation might have
changed but
[ 1661.490170] Node 0 active_anon:139296kB inactive_anon:432kB active_file:1088996kB inactive_file:1114524kB
[ 1661.490745] DMA active_anon:0kB inactive_anon:0kB active_file:9540kB inactive_file:0kB
[ 1661.491528] Normal active_anon:0kB inactive_anon:0kB active_file:530560kB inactive_file:452kB
[ 1661.513077] HighMem active_anon:139296kB inactive_anon:432kB active_file:548896kB inactive_file:1114068kB

suggests our inactive file LRU is low:
file total_active 1088996 active 540100 total_inactive 1114524 inactive 456 ratio 1 low 1

and we should be rotating active pages. But

$ xzgrep -w 1950 teela_2016-12-20.log.xz | grep mm_vmscan_lru_shrink_active
$

Now inactive_list_is_low is racy but I doubt we can consistently see it
racing and give us a wrong answer. I also do not see it would miss lowmem
zones imbalanced but hidden by highmem zones (assuming those counters
are OK).

That being said, numbers do not make much sense to me, to be honest.
Could you try with the updated tracing patch please?
---
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..61aa9b49e86d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -503,7 +503,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
extern void __free_pages(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
extern void free_hot_cold_page(struct page *page, bool cold);
-extern void free_hot_cold_page_list(struct list_head *list, bool cold);
+extern int free_hot_cold_page_list(struct list_head *list, bool cold);

struct page_frag_cache;
extern void __page_frag_drain(struct page *page, unsigned int order,
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index c88fd0934e7e..cbd2fff521f0 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -275,20 +275,22 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
int order,
unsigned long nr_requested,
unsigned long nr_scanned,
+ unsigned long nr_skipped,
unsigned long nr_taken,
isolate_mode_t isolate_mode,
- int file),
+ int lru),

- TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file),
+ TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, nr_taken, isolate_mode, lru),

TP_STRUCT__entry(
__field(int, classzone_idx)
__field(int, order)
__field(unsigned long, nr_requested)
__field(unsigned long, nr_scanned)
+ __field(unsigned long, nr_skipped)
__field(unsigned long, nr_taken)
__field(isolate_mode_t, isolate_mode)
- __field(int, file)
+ __field(int, lru)
),

TP_fast_assign(
@@ -296,19 +298,21 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
__entry->order = order;
__entry->nr_requested = nr_requested;
__entry->nr_scanned = nr_scanned;
+ __entry->nr_skipped = nr_skipped;
__entry->nr_taken = nr_taken;
__entry->isolate_mode = isolate_mode;
- __entry->file = file;
+ __entry->lru = lru;
),

- TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu file=%d",
+ TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_skipped=%lu nr_taken=%lu lru=%d",
__entry->isolate_mode,
__entry->classzone_idx,
__entry->order,
__entry->nr_requested,
__entry->nr_scanned,
+ __entry->nr_skipped,
__entry->nr_taken,
- __entry->file)
+ __entry->lru)
);

DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_lru_isolate,
@@ -317,11 +321,12 @@ DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_lru_isolate,
int order,
unsigned long nr_requested,
unsigned long nr_scanned,
+ unsigned long nr_skipped,
unsigned long nr_taken,
isolate_mode_t isolate_mode,
- int file),
+ int lru),

- TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
+ TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, nr_taken, isolate_mode, lru)

);

@@ -331,11 +336,12 @@ DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_memcg_isolate,
int order,
unsigned long nr_requested,
unsigned long nr_scanned,
+ unsigned long nr_skipped,
unsigned long nr_taken,
isolate_mode_t isolate_mode,
- int file),
+ int lru),

- TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
+ TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, nr_taken, isolate_mode, lru)

);

@@ -365,14 +371,27 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,

TP_PROTO(int nid,
unsigned long nr_scanned, unsigned long nr_reclaimed,
+ unsigned long nr_dirty, unsigned long nr_writeback,
+ unsigned long nr_congested, unsigned long nr_immediate,
+ unsigned long nr_activate, unsigned long nr_ref_keep,
+ unsigned long nr_unmap_fail,
int priority, int file),

- TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
+ TP_ARGS(nid, nr_scanned, nr_reclaimed, nr_dirty, nr_writeback,
+ nr_congested, nr_immediate, nr_activate, nr_ref_keep,
+ nr_unmap_fail, priority, file),

TP_STRUCT__entry(
__field(int, nid)
__field(unsigned long, nr_scanned)
__field(unsigned long, nr_reclaimed)
+ __field(unsigned long, nr_dirty)
+ __field(unsigned long, nr_writeback)
+ __field(unsigned long, nr_congested)
+ __field(unsigned long, nr_immediate)
+ __field(unsigned long, nr_activate)
+ __field(unsigned long, nr_ref_keep)
+ __field(unsigned long, nr_unmap_fail)
__field(int, priority)
__field(int, reclaim_flags)
),
@@ -381,17 +400,100 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
__entry->nid = nid;
__entry->nr_scanned = nr_scanned;
__entry->nr_reclaimed = nr_reclaimed;
+ __entry->nr_dirty = nr_dirty;
+ __entry->nr_writeback = nr_writeback;
+ __entry->nr_congested = nr_congested;
+ __entry->nr_immediate = nr_immediate;
+ __entry->nr_activate = nr_activate;
+ __entry->nr_ref_keep = nr_ref_keep;
+ __entry->nr_unmap_fail = nr_unmap_fail;
__entry->priority = priority;
__entry->reclaim_flags = trace_shrink_flags(file);
),

- TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+ TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld nr_dirty=%ld nr_writeback=%ld nr_congested=%ld nr_immediate=%ld nr_activate=%ld nr_ref_keep=%ld nr_unmap_fail=%ld priority=%d flags=%s",
__entry->nid,
__entry->nr_scanned, __entry->nr_reclaimed,
+ __entry->nr_dirty, __entry->nr_writeback,
+ __entry->nr_congested, __entry->nr_immediate,
+ __entry->nr_activate, __entry->nr_ref_keep,
+ __entry->nr_unmap_fail, __entry->priority,
+ show_reclaim_flags(__entry->reclaim_flags))
+);
+
+TRACE_EVENT(mm_vmscan_lru_shrink_active,
+
+ TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_freed,
+ unsigned long nr_unevictable, unsigned long nr_deactivated,
+ unsigned long nr_rotated, int priority, int file),
+
+ TP_ARGS(nid, nr_scanned, nr_freed, nr_unevictable, nr_deactivated, nr_rotated, priority, file),
+
+ TP_STRUCT__entry(
+ __field(int, nid)
+ __field(unsigned long, nr_scanned)
+ __field(unsigned long, nr_freed)
+ __field(unsigned long, nr_unevictable)
+ __field(unsigned long, nr_deactivated)
+ __field(unsigned long, nr_rotated)
+ __field(int, priority)
+ __field(int, reclaim_flags)
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->nr_scanned = nr_scanned;
+ __entry->nr_freed = nr_freed;
+ __entry->nr_unevictable = nr_unevictable;
+ __entry->nr_deactivated = nr_deactivated;
+ __entry->nr_rotated = nr_rotated;
+ __entry->priority = priority;
+ __entry->reclaim_flags = trace_shrink_flags(file);
+ ),
+
+ TP_printk("nid=%d nr_scanned=%ld nr_freed=%ld nr_unevictable=%ld nr_deactivated=%ld nr_rotated=%ld priority=%d flags=%s",
+ __entry->nid,
+ __entry->nr_scanned, __entry->nr_freed, __entry->nr_unevictable,
+ __entry->nr_deactivated, __entry->nr_rotated,
__entry->priority,
show_reclaim_flags(__entry->reclaim_flags))
);

+TRACE_EVENT(mm_vmscan_inactive_list_is_low,
+
+ TP_PROTO(int nid, unsigned long total_inactive, unsigned long inactive,
+ unsigned long total_active, unsigned long active,
+ unsigned long ratio, int file),
+
+ TP_ARGS(nid, total_inactive, inactive, total_active, active, ratio, file),
+
+ TP_STRUCT__entry(
+ __field(int, nid)
+ __field(unsigned long, total_inactive)
+ __field(unsigned long, inactive)
+ __field(unsigned long, total_active)
+ __field(unsigned long, active)
+ __field(unsigned long, ratio)
+ __field(int, reclaim_flags)
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->total_inactive = total_inactive;
+ __entry->inactive = inactive;
+ __entry->total_active = total_active;
+ __entry->active = active;
+ __entry->ratio = ratio;
+ __entry->reclaim_flags = trace_shrink_flags(file);
+ ),
+
+ TP_printk("nid=%d total_inactive=%ld inactive=%ld total_active=%ld active=%ld ratio=%ld flags=%s",
+ __entry->nid,
+ __entry->total_inactive, __entry->inactive,
+ __entry->total_active, __entry->active,
+ __entry->ratio,
+ show_reclaim_flags(__entry->reclaim_flags))
+);
#endif /* _TRACE_VMSCAN_H */

/* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c24112308d6..77d204660857 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2487,14 +2487,18 @@ void free_hot_cold_page(struct page *page, bool cold)
/*
* Free a list of 0-order pages
*/
-void free_hot_cold_page_list(struct list_head *list, bool cold)
+int free_hot_cold_page_list(struct list_head *list, bool cold)
{
struct page *page, *next;
+ int ret = 0;

list_for_each_entry_safe(page, next, list, lru) {
trace_mm_page_free_batched(page, cold);
free_hot_cold_page(page, cold);
+ ret++;
}
+
+ return ret;
}

/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4abf08861d2..0c4707571762 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -902,6 +902,17 @@ static void page_check_dirty_writeback(struct page *page,
mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
}

+struct reclaim_stat {
+ unsigned nr_dirty;
+ unsigned nr_unqueued_dirty;
+ unsigned nr_congested;
+ unsigned nr_writeback;
+ unsigned nr_immediate;
+ unsigned nr_activate;
+ unsigned nr_ref_keep;
+ unsigned nr_unmap_fail;
+};
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -909,22 +920,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct pglist_data *pgdat,
struct scan_control *sc,
enum ttu_flags ttu_flags,
- unsigned long *ret_nr_dirty,
- unsigned long *ret_nr_unqueued_dirty,
- unsigned long *ret_nr_congested,
- unsigned long *ret_nr_writeback,
- unsigned long *ret_nr_immediate,
+ struct reclaim_stat *stat,
bool force_reclaim)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
int pgactivate = 0;
- unsigned long nr_unqueued_dirty = 0;
- unsigned long nr_dirty = 0;
- unsigned long nr_congested = 0;
- unsigned long nr_reclaimed = 0;
- unsigned long nr_writeback = 0;
- unsigned long nr_immediate = 0;
+ unsigned nr_unqueued_dirty = 0;
+ unsigned nr_dirty = 0;
+ unsigned nr_congested = 0;
+ unsigned nr_reclaimed = 0;
+ unsigned nr_writeback = 0;
+ unsigned nr_immediate = 0;
+ unsigned nr_ref_keep = 0;
+ unsigned nr_unmap_fail = 0;

cond_resched();

@@ -1063,6 +1072,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
case PAGEREF_ACTIVATE:
goto activate_locked;
case PAGEREF_KEEP:
+ nr_ref_keep++;
goto keep_locked;
case PAGEREF_RECLAIM:
case PAGEREF_RECLAIM_CLEAN:
@@ -1100,6 +1110,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
(ttu_flags | TTU_BATCH_FLUSH))) {
case SWAP_FAIL:
+ nr_unmap_fail++;
goto activate_locked;
case SWAP_AGAIN:
goto keep_locked;
@@ -1266,11 +1277,16 @@ static unsigned long shrink_page_list(struct list_head *page_list,
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);

- *ret_nr_dirty += nr_dirty;
- *ret_nr_congested += nr_congested;
- *ret_nr_unqueued_dirty += nr_unqueued_dirty;
- *ret_nr_writeback += nr_writeback;
- *ret_nr_immediate += nr_immediate;
+ if (stat) {
+ stat->nr_dirty = nr_dirty;
+ stat->nr_congested = nr_congested;
+ stat->nr_unqueued_dirty = nr_unqueued_dirty;
+ stat->nr_writeback = nr_writeback;
+ stat->nr_immediate = nr_immediate;
+ stat->nr_activate = pgactivate;
+ stat->nr_ref_keep = nr_ref_keep;
+ stat->nr_unmap_fail = nr_unmap_fail;
+ }
return nr_reclaimed;
}

@@ -1282,7 +1298,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
.priority = DEF_PRIORITY,
.may_unmap = 1,
};
- unsigned long ret, dummy1, dummy2, dummy3, dummy4, dummy5;
+ unsigned long ret;
struct page *page, *next;
LIST_HEAD(clean_pages);

@@ -1295,8 +1311,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
}

ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
- TTU_UNMAP|TTU_IGNORE_ACCESS,
- &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
+ TTU_UNMAP|TTU_IGNORE_ACCESS, NULL, true);
list_splice(&clean_pages, page_list);
mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
return ret;
@@ -1428,6 +1443,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
unsigned long nr_taken = 0;
unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
+ unsigned long skipped = 0, total_skipped = 0;
unsigned long scan, nr_pages;
LIST_HEAD(pages_skipped);

@@ -1479,14 +1495,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
*/
if (!list_empty(&pages_skipped)) {
int zid;
- unsigned long total_skipped = 0;

for (zid = 0; zid < MAX_NR_ZONES; zid++) {
if (!nr_skipped[zid])
continue;

__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
- total_skipped += nr_skipped[zid];
+ skipped += nr_skipped[zid];
}

/*
@@ -1494,13 +1509,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
* close to unreclaimable. If the LRU list is empty, account
* skipped pages as a full scan.
*/
- scan += list_empty(src) ? total_skipped : total_skipped >> 2;
+ total_skipped = list_empty(src) ? skipped : skipped >> 2;

list_splice(&pages_skipped, src);
}
- *nr_scanned = scan;
+ *nr_scanned = scan + total_skipped;
trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
- nr_taken, mode, is_file_lru(lru));
+ skipped, nr_taken, mode, is_file_lru(lru));
update_lru_sizes(lruvec, lru, nr_zone_taken, nr_taken);
return nr_taken;
}
@@ -1696,11 +1711,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
unsigned long nr_taken;
- unsigned long nr_dirty = 0;
- unsigned long nr_congested = 0;
- unsigned long nr_unqueued_dirty = 0;
- unsigned long nr_writeback = 0;
- unsigned long nr_immediate = 0;
+ struct reclaim_stat stat = {};
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -1745,9 +1756,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
return 0;

nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
- &nr_dirty, &nr_unqueued_dirty, &nr_congested,
- &nr_writeback, &nr_immediate,
- false);
+ &stat, false);

spin_lock_irq(&pgdat->lru_lock);

@@ -1781,7 +1790,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* of pages under pages flagged for immediate reclaim and stall if any
* are encountered in the nr_immediate check below.
*/
- if (nr_writeback && nr_writeback == nr_taken)
+ if (stat.nr_writeback && stat.nr_writeback == nr_taken)
set_bit(PGDAT_WRITEBACK, &pgdat->flags);

/*
@@ -1793,7 +1802,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* Tag a zone as congested if all the dirty pages scanned were
* backed by a congested BDI and wait_iff_congested will stall.
*/
- if (nr_dirty && nr_dirty == nr_congested)
+ if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
set_bit(PGDAT_CONGESTED, &pgdat->flags);

/*
@@ -1802,7 +1811,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* the pgdat PGDAT_DIRTY and kswapd will start writing pages from
* reclaim context.
*/
- if (nr_unqueued_dirty == nr_taken)
+ if (stat.nr_unqueued_dirty == nr_taken)
set_bit(PGDAT_DIRTY, &pgdat->flags);

/*
@@ -1811,7 +1820,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* that pages are cycling through the LRU faster than
* they are written so also forcibly stall.
*/
- if (nr_immediate && current_may_throttle())
+ if (stat.nr_immediate && current_may_throttle())
congestion_wait(BLK_RW_ASYNC, HZ/10);
}

@@ -1826,6 +1835,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,

trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed,
+ stat.nr_dirty, stat.nr_writeback,
+ stat.nr_congested, stat.nr_immediate,
+ stat.nr_activate, stat.nr_ref_keep, stat.nr_unmap_fail,
sc->priority, file);
return nr_reclaimed;
}
@@ -1846,9 +1858,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
*
* The downside is that we have to touch page->_refcount against each page.
* But we had to alter page->flags anyway.
+ *
+ * Returns the number of pages moved to the given lru.
*/

-static void move_active_pages_to_lru(struct lruvec *lruvec,
+static int move_active_pages_to_lru(struct lruvec *lruvec,
struct list_head *list,
struct list_head *pages_to_free,
enum lru_list lru)
@@ -1857,6 +1871,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
unsigned long pgmoved = 0;
struct page *page;
int nr_pages;
+ int nr_moved = 0;

while (!list_empty(list)) {
page = lru_to_page(list);
@@ -1882,11 +1897,15 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
spin_lock_irq(&pgdat->lru_lock);
} else
list_add(&page->lru, pages_to_free);
+ } else {
+ nr_moved++;
}
}

if (!is_active_lru(lru))
__count_vm_events(PGDEACTIVATE, pgmoved);
+
+ return nr_moved;
}

static void shrink_active_list(unsigned long nr_to_scan,
@@ -1902,7 +1921,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
LIST_HEAD(l_inactive);
struct page *page;
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
- unsigned long nr_rotated = 0;
+ unsigned long nr_rotated = 0, nr_unevictable = 0;
+ unsigned long nr_freed, nr_deactivate, nr_activate;
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -1935,6 +1955,7 @@ static void shrink_active_list(unsigned long nr_to_scan,

if (unlikely(!page_evictable(page))) {
putback_lru_page(page);
+ nr_unevictable++;
continue;
}

@@ -1980,13 +2001,16 @@ static void shrink_active_list(unsigned long nr_to_scan,
*/
reclaim_stat->recent_rotated[file] += nr_rotated;

- move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
- move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
+ nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
+ nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
spin_unlock_irq(&pgdat->lru_lock);

mem_cgroup_uncharge_list(&l_hold);
- free_hot_cold_page_list(&l_hold, true);
+ nr_freed = free_hot_cold_page_list(&l_hold, true);
+ trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_scanned, nr_freed,
+ nr_unevictable, nr_deactivate, nr_rotated,
+ sc->priority, file);
}

/*
@@ -2019,8 +2043,8 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
struct scan_control *sc)
{
unsigned long inactive_ratio;
- unsigned long inactive;
- unsigned long active;
+ unsigned long total_inactive, inactive;
+ unsigned long total_active, active;
unsigned long gb;
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
int zid;
@@ -2032,8 +2056,8 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
if (!file && !total_swap_pages)
return false;

- inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
- active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
+ total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
+ total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);

/*
* For zone-constrained allocations, it is necessary to check if
@@ -2062,6 +2086,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
else
inactive_ratio = 1;

+ trace_mm_vmscan_inactive_list_is_low(pgdat->node_id,
+ total_inactive, inactive,
+ total_active, active, inactive_ratio, file);
return inactive * inactive_ratio < active;
}

--
Michal Hocko
SUSE Labs

2016-12-21 11:01:11

by Tetsuo Handa

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

Michal Hocko wrote:
> TL;DR
> there is another version of the debugging patch. Just revert the
> previous one and apply this one instead. It's still not clear what
> is going on but I suspect either some misaccounting or unexpeted
> pages on the LRU lists. I have added one more tracepoint, so please
> enable also mm_vmscan_inactive_list_is_low.
>
> Hopefully the additional data will tell us more.
>
> On Tue 20-12-16 03:08:29, Nils Holland wrote:
> > On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:
> >
> > > Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> > > not know whether we managed to rotate those pages. If they are referenced
> > > quickly enough we might just keep refaulting them... Could you try to apply
> > > the followin diff on top what you have currently. It should add some more
> > > tracepoint data which might tell us more. We can reduce the amount of
> > > tracing data by enabling only mm_vmscan_lru_isolate,
> > > mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.
> >
> > So, the results are in! I applied your patch and rebuild the kernel,
> > then I rebooted the machine, set up tracing so that only the three
> > events you mentioned were being traced, and captured the output over
> > the network.
> >
> > Things went a bit different this time: The trace events started to
> > appear after a while and a whole lot of them were generated, but
> > suddenly they stopped. A short while later, we get

"cat /debug/trace/trace_pipe > /dev/udp/$ip/$port" stops reporting if
/bin/cat is disturbed by page fault and/or memory allocation needed for
sending UDP packets. Since netconsole can send UDP packets without involving
memory allocation, printk() is preferable than tracing under OOM.

>
> It is possible that you are hitting multiple issues so it would be
> great to focus at one at the time. The underlying problem might be
> same/similar in the end but this is hard to tell now. Could you try to
> reproduce and provide data for the OOM killer situation as well?
>
> > [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> >
> > along with a backtrace and memory information, and then there was
> > silence.
>
> > When I walked up to the machine, it had completely died; it
> > wouldn't turn on its screen on key press any more, blindly trying to
> > reboot via SysRequest had no effect, but the caps lock LED also wasn't
> > blinking, like it normally does when a kernel panic occurs. Good
> > question what state it was in. The OOM reaper didn't really seem to
> > kick in and kill processes this time, it seems.
> >
> > The complete capture is up at:
> >
> > http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz
>
> This is the stall report:
> [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> [ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 4.9.0-gentoo #4
>
> pid 1950 is trying to allocate for a _long_ time. Considering that this
> is the only stall report, this means that reclaim took really long so we
> didn't get to the page allocator for that long. It sounds really crazy!

warn_alloc() reports only if !__GFP_NOWARN.

We can report where they were looping using kmallocwd at
http://lkml.kernel.org/r/1478416501-10104-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
(and extend it to call printk() for reporting values using SystemTap which your
trace hooks would report, only during memory allocations are stalling, without
delay caused by page fault and/or memory allocation needed for sending UDP packets).

But if trying to reboot via SysRq-b did not work, I think that the system
was in hard lockup state. That would be a different problem.

By the way, Michal, I'm feeling strange because it seems to me that your
analysis does not refer to the implications of "x86_32 kernel". Maybe
you already referred x86_32 by "they are from the highmem zone" though.

2016-12-21 11:17:03

by Michal Hocko

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Wed 21-12-16 20:00:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > TL;DR
> > there is another version of the debugging patch. Just revert the
> > previous one and apply this one instead. It's still not clear what
> > is going on but I suspect either some misaccounting or unexpeted
> > pages on the LRU lists. I have added one more tracepoint, so please
> > enable also mm_vmscan_inactive_list_is_low.
> >
> > Hopefully the additional data will tell us more.
> >
> > On Tue 20-12-16 03:08:29, Nils Holland wrote:
[...]
> > > http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz
> >
> > This is the stall report:
> > [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> > [ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 4.9.0-gentoo #4
> >
> > pid 1950 is trying to allocate for a _long_ time. Considering that this
> > is the only stall report, this means that reclaim took really long so we
> > didn't get to the page allocator for that long. It sounds really crazy!
>
> warn_alloc() reports only if !__GFP_NOWARN.

yes and the above allocation clear is !__GFP_NOWARN allocation which is
reported after 611s! If there are no prior/lost warn_alloc() then it
implies we have spent _that_ much time in the reclaim. Considering the
tracing data we cannot really rule that out. All the reclaimers would
fight over the lru_lock and considering we are scanning the whole LRU
this will take some time.

[...]

> By the way, Michal, I'm feeling strange because it seems to me that your
> analysis does not refer to the implications of "x86_32 kernel". Maybe
> you already referred x86_32 by "they are from the highmem zone" though.

yes Highmem as well all those scanning anomalies is the 32b kernel
specific thing. I believe I have already mentioned that the 32b kernel
suffers from some inherent issues but I would like to understand what is
going on here before blaming the 32b.

One thing to note here, when we are talking about 32b kernel, things
have changed in 4.8 when we moved from the zone based to node based
reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a
per-node basis") and associated patches). It is possible that the
reporter is hitting some pathological path which needs fixing but it
might be also related to something else. So I am rather not trying to
blame 32b yet...

--
Michal Hocko
SUSE Labs

2016-12-21 14:04:14

by Chris Mason

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Wed, Dec 21, 2016 at 12:16:53PM +0100, Michal Hocko wrote:
>On Wed 21-12-16 20:00:38, Tetsuo Handa wrote:
>
>One thing to note here, when we are talking about 32b kernel, things
>have changed in 4.8 when we moved from the zone based to node based
>reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a
>per-node basis") and associated patches). It is possible that the
>reporter is hitting some pathological path which needs fixing but it
>might be also related to something else. So I am rather not trying to
>blame 32b yet...

It might be interesting to put tracing on releasepage and see if btrfs
is pinning pages around. I can't see how 32bit kernels would be
different, but maybe we're hitting a weird corner.

-chris

2016-12-22 10:10:36

by Nils Holland

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Wed, Dec 21, 2016 at 08:36:59AM +0100, Michal Hocko wrote:
> TL;DR
> there is another version of the debugging patch. Just revert the
> previous one and apply this one instead. It's still not clear what
> is going on but I suspect either some misaccounting or unexpeted
> pages on the LRU lists. I have added one more tracepoint, so please
> enable also mm_vmscan_inactive_list_is_low.

Right, I did just that and can provide a new log. I was also able, in
this case, to reproduce the OOM issues again and not just the "page
allocation stalls" that were the only thing visible in the previous
log. However, the log comes from machine #2 again today, as I'm
unfortunately forced to try this via VPN from work to home today, so I
have exactly one attempt per machine before it goes down and locks up
(and I can only restart it later tonight). Machine #1 failed to
produce good looking results during its one attempt, but what machine #2
produced seems to be exactly what we've been trying to track down, and so
its log us now up at:

http://ftp.tisys.org/pub/misc/boerne_2016-12-22.log.xz

Greetings
Nils

2016-12-22 10:27:31

by Michal Hocko

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Thu 22-12-16 11:10:29, Nils Holland wrote:
> On Wed, Dec 21, 2016 at 08:36:59AM +0100, Michal Hocko wrote:
> > TL;DR
> > there is another version of the debugging patch. Just revert the
> > previous one and apply this one instead. It's still not clear what
> > is going on but I suspect either some misaccounting or unexpeted
> > pages on the LRU lists. I have added one more tracepoint, so please
> > enable also mm_vmscan_inactive_list_is_low.
>
> Right, I did just that and can provide a new log. I was also able, in
> this case, to reproduce the OOM issues again and not just the "page
> allocation stalls" that were the only thing visible in the previous
> log.

Thanks a lot for testing! I will have a look later today.

> However, the log comes from machine #2 again today, as I'm
> unfortunately forced to try this via VPN from work to home today, so I
> have exactly one attempt per machine before it goes down and locks up
> (and I can only restart it later tonight).

This is really surprising to me. Are you sure that you have sysrq
configured properly. At least sysrq+b shouldn't depend on any memory
allocations and should allow you to reboot immediately. A sysrq+m right
before the reboot might turn out being helpful as well.
--
Michal Hocko
SUSE Labs

2016-12-22 10:35:31

by Nils Holland

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Thu, Dec 22, 2016 at 11:27:25AM +0100, Michal Hocko wrote:
> On Thu 22-12-16 11:10:29, Nils Holland wrote:
>
> > However, the log comes from machine #2 again today, as I'm
> > unfortunately forced to try this via VPN from work to home today, so I
> > have exactly one attempt per machine before it goes down and locks up
> > (and I can only restart it later tonight).
>
> This is really surprising to me. Are you sure that you have sysrq
> configured properly. At least sysrq+b shouldn't depend on any memory
> allocations and should allow you to reboot immediately. A sysrq+m right
> before the reboot might turn out being helpful as well.

Well, the issue is that I could only do everything via ssh today and
don't have any physical access to the machines. In fact, both seem to
have suffered a genuine kernel panic, which is also visible in the
last few lines of the log I provided today. So, basically, both
machines are now sitting at my home in panic state and I'll only be
able to resurrect them wheh I'm physically there again tonight. But
that was expected; I could have waited with the test until I'm at
home, which makes things easier, but I thought the sooner I can
provide a log for you to look at, the better. ;-)

Greetings
Nils

2016-12-22 10:46:36

by Tetsuo Handa

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

Nils Holland wrote:
> Well, the issue is that I could only do everything via ssh today and
> don't have any physical access to the machines. In fact, both seem to
> have suffered a genuine kernel panic, which is also visible in the
> last few lines of the log I provided today. So, basically, both
> machines are now sitting at my home in panic state and I'll only be
> able to resurrect them wheh I'm physically there again tonight.

# echo 10 > /proc/sys/kernel/panic

2016-12-22 19:17:28

by Michal Hocko

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

TL;DR I still do not see what is going on here and it still smells like
multiple issues. Please apply the patch below on _top_ of what you had.

On Thu 22-12-16 11:10:29, Nils Holland wrote:
[...]
> http://ftp.tisys.org/pub/misc/boerne_2016-12-22.log.xz

It took me a while to realize that tracepoint and printk messages are
not sorted by the timestamp. Some massaging has fixed that
$ xzcat boerne_2016-12-22.log.xz | sed -e 's@.*192.168.17.32:6665 \[[[:space:]]*\([0-9\.]\+\)\] @\1 @' -e 's@.*192.168.17.32:53062[[:space:]]*\([^[:space:]]\+\)[[:space:]].*[[:space:]]\([0-9\.]\+\):@\2 \1@' | sort -k1 -n -s

461.757468 kswapd0-32 mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=32 nr_skipped=0 nr_taken=32 lru=1
461.757501 kswapd0-32 mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=32 nr_reclaimed=32 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate=0 nr_ref_keep=0 nr_unmap_fail=0 p
riority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757504 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=11852 inactive=0 total_active=118195 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757508 kswapd0-32 mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=32 nr_skipped=0 nr_taken=32 lru=1
461.757535 kswapd0-32 mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=32 nr_reclaimed=32 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate=0 nr_ref_keep=0 nr_unmap_fail=0 p
riority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757537 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=11820 inactive=0 total_active=118195 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757543 kswapd0-32 mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=32 nr_skipped=0 nr_taken=32 lru=1
461.757584 kswapd0-32 mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=32 nr_reclaimed=32 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate=0 nr_ref_keep=0 nr_unmap_fail=0 p
riority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757588 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=11788 inactive=0 total_active=118195 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
[...]
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=9939 inactive=0 total_active=120208 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=9939 inactive=0 total_active=120208 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=89 inactive=0 total_active=1301 active=0 ratio=1 flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722385 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 inactive=0 total_active=0 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722386 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 inactive=0 total_active=0 active=0 ratio=1 flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722391 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 inactive=0 total_active=0 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722391 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 inactive=0 total_active=0 active=0 ratio=1 flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722396 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=1 inactive=0 total_active=21 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722396 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 inactive=0 total_active=131 active=0 ratio=1 flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722397 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=1 inactive=0 total_active=21 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722397 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 inactive=0 total_active=131 active=0 ratio=1 flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722401 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=450730 inactive=0 total_active=206026 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
484.144971 collect2 invoked oom-killer: gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, order=0, oom_score_adj=0
[...]
484.146871 Node 0 active_anon:100688kB inactive_anon:380kB active_file:1296560kB inactive_file:1848044kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:32180kB dirty:20896kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 40960kB anon_thp: 776kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
484.147097 DMA free:4004kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:8016kB inactive_file:12kB unevictable:0kB writepending:68kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:2652kB slab_unreclaimable:1224kB kernel_stack:8kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
484.147319 lowmem_reserve[]: 0 808 3849 3849
484.147387 Normal free:41016kB min:41100kB low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB active_file:464688kB inactive_file:48kB unevictable:0kB writepending:2684kB present:897016kB managed:831472kB mlocked:0kB slab_reclaimable:215812kB slab_unreclaimable:90092kB kernel_stack:1336kB pagetables:1436kB bounce:0kB free_pcp:372kB local_pcp:176kB free_cma:0kB
484.149971 lowmem_reserve[]: 0 0 24330 24330
484.152390 HighMem free:332648kB min:512kB low:39184kB high:77856kB active_anon:100688kB inactive_anon:380kB active_file:823856kB inactive_file:1847984kB unevictable:0kB writepending:18144kB present:3114256kB managed:3114256kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:836kB local_pcp:156kB free_cma:0kB

Unfortunately LOST EVENT are not logged with the timestamp but there are
many lost events between 10:55:31-33 which corresponds to above time
range in timestamps:
$ xzgrep "10:55:3[1-3].*LOST" boerne_2016-12-22.log.xz | awk '{sum+=$6}END{print sum}'
5616415

so we do not have a good picture again :/ One thing is highly suspicious
though. I really doubt the _whole_ pagecache went down to zero and then up
in such a short time:
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=89 inactive=0 total_active=1301 active=0 ratio=1 flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722397 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=1 inactive=0 total_active=21 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722401 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=450730 inactive=0 total_active=206026 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

File inactive 450730 resp. active 206026 roughly match the global
counters in the oom report so I would trust this to be more realistic. I
simply do not see any large source of the LRU isolation. Maybe those
pages have been truncated and new ones allocated. The time window is
really short though but who knows...

Another possibility would be a misaccounting but I do not see anything
that would use __mod_zone_page_state and __mod_node_page_state on LRU
handles node vs. zone counters inconsistently. Everything seems to go
via __update_lru_size.

Another thing to check would be the per-cpu counters usage. The
following patch should use the more precise numbers. I am also not
sure about the lockless nature of inactive_list_is_low so the patch
below adds the lru_lock there.

The only clear thing is that mm_vmscan_lru_isolate indeed skipped
through the whole list without finding a single suitable page
when it couldn't isolate any pages. So the failure is not due to
get_page_unless_zero.
$ xzgrep "mm_vmscan_lru_isolate.*nr_taken=0" boerne_2016-12-22.log.xz | sed 's@.*nr_scanned=\([0-9]*\).*@\1@' | sort | uniq -c
7941 0

I am not able to draw any conclusion now. I am suspecting get_scan_count
as well. Let's see whether the patch below makes any difference and if
not I will dig into g_s_c some more. I will think about it some more,
maybe somebody else will notice something so I am sending this half
baked analysis.

---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cb82913b62bb..8727b68a8e70 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -239,7 +239,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
if (!mem_cgroup_disabled())
return mem_cgroup_get_lru_size(lruvec, lru);

- return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
+ return node_page_state_snapshot(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
}

/*
@@ -2056,6 +2056,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
if (!file && !total_swap_pages)
return false;

+ spin_lock_irq(&pgdat->lru_lock);
total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);

@@ -2071,14 +2072,15 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
if (!managed_zone(zone))
continue;

- inactive_zone = zone_page_state(zone,
+ inactive_zone = zone_page_state_snapshot(zone,
NR_ZONE_LRU_BASE + (file * LRU_FILE));
- active_zone = zone_page_state(zone,
+ active_zone = zone_page_state_snapshot(zone,
NR_ZONE_LRU_BASE + (file * LRU_FILE) + LRU_ACTIVE);

inactive -= min(inactive, inactive_zone);
active -= min(active, active_zone);
}
+ spin_unlock_irq(&pgdat->lru_lock);

gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
--
Michal Hocko
SUSE Labs

2016-12-22 21:46:36

by Nils Holland

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Thu, Dec 22, 2016 at 08:17:19PM +0100, Michal Hocko wrote:
> TL;DR I still do not see what is going on here and it still smells like
> multiple issues. Please apply the patch below on _top_ of what you had.

I've run the usual procedure again with the new patch on top and the
log is now up at:

http://ftp.tisys.org/pub/misc/boerne_2016-12-22_2.log.xz

As a little side note: It is likely, but I cannot completely say for
sure yet, that this issue is rather easy to reproduce. When I had some
time today at work, I set up a fresh Debian Sid installation in a VM
(32 bit PAE kernel, 4 GB RAM, btrfs as root fs). I used some late 4.9rc(8?)
kernel supplied by Debian - they don't seem to have 4.9 final yet and I
didn't come around to build and use a custom 4.9 final kernel, probably
even with your patches. But the 4.9rc kernel there seemed to behave very much
the same as the 4.9 kernel on my real 32 bit machines does: All I had
to do was unpack a few big tarballs - firefox, libreoffice and the
kernel are my favorites - and the machine would start OOMing.

This might suggest - although I have to admit, again, that this is
inconclusive, as I've not used a final 4.9 kernel - that you could
very easily reproduce the issue yourself by just setting up a 32 bit
system with a btrfs filesystem and then unpacking a few huge tarballs.
Of course, I'm more than happy to continue giving any patches sent to
me a spin, but I thought I'd still mention this in case it makes
things easier for you. :-)

Greetings
Nils

2016-12-23 10:52:04

by Michal Hocko

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

TL;DR
drop the last patch, check whether memory cgroup is enabled and retest
with cgroup_disable=memory to see whether this is memcg related and if
it is _not_ then try to test with the patch below

On Thu 22-12-16 22:46:11, Nils Holland wrote:
> On Thu, Dec 22, 2016 at 08:17:19PM +0100, Michal Hocko wrote:
> > TL;DR I still do not see what is going on here and it still smells like
> > multiple issues. Please apply the patch below on _top_ of what you had.
>
> I've run the usual procedure again with the new patch on top and the
> log is now up at:
>
> http://ftp.tisys.org/pub/misc/boerne_2016-12-22_2.log.xz

OK, so there are still large page cache fluctuations even with the
locking applied:
472.042409 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=450451 inactive=0 total_active=210056 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
472.042442 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 inactive=0 total_active=0 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
472.042451 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 inactive=0 total_active=12 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
472.042484 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=11944 inactive=0 total_active=117286 active=0 ratio=1 flags=RECLAIM_WB_FILE|RECLAIM_WB

One thing that didn't occure to me previously was that this might be an
effect of the memory cgroups. Do you have memory cgroups enabled? If
yes then reruning with cgroup_disable=memory would be interesting
as well.

Anyway, now I am looking at get_scan_count which determines how many pages
we should scan on each LRU list. The problem I can see there is that
it doesn't reflect eligible zones (or at least it doesn't do that
consistently). So it might happen we simply decide to scan the whole LRU
list (when we get down to prio 0 because we cannot make any progress)
and then _slowly_ scan through it in SWAP_CLUSTER_MAX chunks each
time. This can take a lot of time and who knows what might have happened
if there are many such reclaimers in parallel.

[...]

> This might suggest - although I have to admit, again, that this is
> inconclusive, as I've not used a final 4.9 kernel - that you could
> very easily reproduce the issue yourself by just setting up a 32 bit
> system with a btrfs filesystem and then unpacking a few huge tarballs.
> Of course, I'm more than happy to continue giving any patches sent to
> me a spin, but I thought I'd still mention this in case it makes
> things easier for you. :-)

I would appreciate to stick with your setup to not pull new unknows into
the picture.
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cb82913b62bb..533bb591b0be 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -243,6 +243,35 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
}

/*
+ * Return the number of pages on the given lru which are eligibne for the
+ * given zone_idx
+ */
+static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
+ enum lru_list lru, int zone_idx)
+{
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+ unsigned long lru_size;
+ int zid;
+
+ if (!mem_cgroup_disabled())
+ return mem_cgroup_get_lru_size(lruvec, lru);
+
+ lru_size = lruvec_lru_size(lruvec, lru);
+ for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
+ struct zone *zone = &pgdat->node_zones[zid];
+ unsigned long size;
+
+ if (!managed_zone(zone))
+ continue;
+
+ size = zone_page_state(zone, NR_ZONE_LRU_BASE + lru);
+ lru_size -= min(size, lru_size);
+ }
+
+ return lru_size;
+}
+
+/*
* Add a shrinker callback to be called from the vm.
*/
int register_shrinker(struct shrinker *shrinker)
@@ -2228,7 +2257,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* system is under heavy pressure.
*/
if (!inactive_list_is_low(lruvec, true, sc) &&
- lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+ lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2295,7 +2324,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
unsigned long size;
unsigned long scan;

- size = lruvec_lru_size(lruvec, lru);
+ size = lruvec_lru_size_zone_idx(lruvec, lru, sc->reclaim_idx);
scan = size >> sc->priority;

if (!scan && pass && force_scan)
--
Michal Hocko
SUSE Labs

2016-12-23 12:18:58

by Nils Holland

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Fri, Dec 23, 2016 at 11:51:57AM +0100, Michal Hocko wrote:
> TL;DR
> drop the last patch, check whether memory cgroup is enabled and retest
> with cgroup_disable=memory to see whether this is memcg related and if
> it is _not_ then try to test with the patch below

Right, it seems we might be looking in the right direction! So I
removed the previous patch from my kernel and verified if memory
cgroup was enabled, and indeed, it was. So I booted with
cgroup_disable=memory and ran my ordinary test again ... and in fact,
no ooms! I could have the firefox sources building and unpack half a
dozen big tarballs, which would previously with 99% certainty already
trigger an OOM upon unpacking the first tarball. Also, the system
seemed to run noticably "nicer", in the sense that the other processes
I had running (like htop) would not get delayed / hung. The new patch
you sent has, as per your instructions, NOT been applied.

I've provided a log of this run, it's available at:

http://ftp.tisys.org/pub/misc/boerne_2016-12-23.log.xz

As no OOMs or other bad situations occured, no memory information was
forcibly logged. However, about three times I triggered a memory info
manually via SysReq, because I guess that might be interesting for you
to look at.

I'd like to run the same test on my second machine as well just to
make sure that cgroup_disable=memory has an effect there too. I
should be able to do that later tonight and will report back as soon
as I know more!

> I would appreciate to stick with your setup to not pull new unknows into
> the picture.

No problem! It's just likely that I won't be able to test during the
following days until Dec 27th, but after that I should be back to
normal and thus be able to run further tests in a timely fashion. :-)

Greetings
Nils

2016-12-23 12:57:40

by Michal Hocko

[permalink] [raw]
Subject: Re: OOM: Better, but still there on

On Fri 23-12-16 13:18:51, Nils Holland wrote:
> On Fri, Dec 23, 2016 at 11:51:57AM +0100, Michal Hocko wrote:
> > TL;DR
> > drop the last patch, check whether memory cgroup is enabled and retest
> > with cgroup_disable=memory to see whether this is memcg related and if
> > it is _not_ then try to test with the patch below
>
> Right, it seems we might be looking in the right direction! So I
> removed the previous patch from my kernel and verified if memory
> cgroup was enabled, and indeed, it was. So I booted with
> cgroup_disable=memory and ran my ordinary test again ... and in fact,
> no ooms!

OK, thanks for confirmation. I could have figured that earlier. The
pagecache differences in such a short time should have raised the red
flag and point towards memcgs...

[...]
> > I would appreciate to stick with your setup to not pull new unknows into
> > the picture.
>
> No problem! It's just likely that I won't be able to test during the
> following days until Dec 27th, but after that I should be back to
> normal and thus be able to run further tests in a timely fashion. :-)

no problem at all. I will try to cook up a patch in the mean time.
--
Michal Hocko
SUSE Labs

2016-12-23 14:47:47

by Michal Hocko

[permalink] [raw]
Subject: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

[Add Mel, Johannes and Vladimir - the email thread started here
http://lkml.kernel.org/r/[email protected]
The long story short, the zone->node reclaim change has broken active
list aging for lowmem requests when memory cgroups are enabled. More
details below.

On Fri 23-12-16 13:57:28, Michal Hocko wrote:
> On Fri 23-12-16 13:18:51, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 11:51:57AM +0100, Michal Hocko wrote:
> > > TL;DR
> > > drop the last patch, check whether memory cgroup is enabled and retest
> > > with cgroup_disable=memory to see whether this is memcg related and if
> > > it is _not_ then try to test with the patch below
> >
> > Right, it seems we might be looking in the right direction! So I
> > removed the previous patch from my kernel and verified if memory
> > cgroup was enabled, and indeed, it was. So I booted with
> > cgroup_disable=memory and ran my ordinary test again ... and in fact,
> > no ooms!
>
> OK, thanks for confirmation. I could have figured that earlier. The
> pagecache differences in such a short time should have raised the red
> flag and point towards memcgs...
>
> [...]
> > > I would appreciate to stick with your setup to not pull new unknows into
> > > the picture.
> >
> > No problem! It's just likely that I won't be able to test during the
> > following days until Dec 27th, but after that I should be back to
> > normal and thus be able to run further tests in a timely fashion. :-)
>
> no problem at all. I will try to cook up a patch in the mean time.

So here is my attempt. Only compile tested so be careful, it might eat
your kittens or do more harm. I would appreciate other guys to have a
look to see whether this is sane. There are probably other places which
would need some tweaks. I think that get_scan_count needs some tweaks
as well because we should only consider eligible zones when counting the
number of pages to scan. This would be for a separate patch which I will
send later. I just want to fix this one first.

Nils, even though this is still highly experimental, could you give it a
try please?
---
>From a66fd89d43e9fd8ca9afa7e6c7252ab73d22b686 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Fri, 23 Dec 2016 15:11:54 +0100
Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
memcg is enabled

Nils Holland has reported unexpected OOM killer invocations with 32b
kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
active_file:274324 inactive_file:281962 isolated_file:0
unevictable:0 dirty:649 writeback:0 unstable:0
slab_reclaimable:40662 slab_unreclaimable:17754
mapped:7382 shmem:202 pagetables:351 bounce:0
free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

the oom killer is clearly pre-mature because there there is still a
lot of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.
It simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which doesn't make any sense. We can simply end up
always seeing the resulting active and inactive counts 0 and return
false. This issue is not limited to 32b kernels but in practice the
effect on systems without CONFIG_HIGHMEM would be much harder to notice
because we do not invoke the OOM killer for allocations requests
targeting < ZONE_NORMAL.

Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled. Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.

Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
Cc: stable # 4.8+
Reported-by: Nils Holland <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 26 +++++++++++++++++++++++---
include/linux/mm_inline.h | 2 +-
mm/memcontrol.c | 11 ++++++-----
mm/vmscan.c | 26 ++++++++++++++++----------
4 files changed, 46 insertions(+), 19 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 61d20c17f3b7..002cb08b0f3e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,7 +120,7 @@ struct mem_cgroup_reclaim_iter {
*/
struct mem_cgroup_per_node {
struct lruvec lruvec;
- unsigned long lru_size[NR_LRU_LISTS];
+ unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];

struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];

@@ -432,7 +432,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);

void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
- int nr_pages);
+ int zid, int nr_pages);

unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
int nid, unsigned int lru_mask);
@@ -441,9 +441,23 @@ static inline
unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
{
struct mem_cgroup_per_node *mz;
+ unsigned long nr_pages = 0;
+ int zid;

mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- return mz->lru_size[lru];
+ for (zid = 0; zid < MAX_NR_ZONES; zid++)
+ nr_pages += mz->lru_zone_size[zid][lru];
+ return nr_pages;
+}
+
+static inline
+unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, enum lru_list lru,
+ int zone_idx)
+{
+ struct mem_cgroup_per_node *mz;
+
+ mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ return mz->lru_zone_size[zone_idx][lru];
}

void mem_cgroup_handle_over_high(void);
@@ -671,6 +685,12 @@ mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
{
return 0;
}
+static inline
+unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, enum lru_list lru,
+ int zone_idx)
+{
+ return 0;
+}

static inline unsigned long
mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 71613e8a720f..41d376e7116d 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -39,7 +39,7 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
{
__update_lru_size(lruvec, lru, zid, nr_pages);
#ifdef CONFIG_MEMCG
- mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+ mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
#endif
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 91dfc7c5ce8f..f4e9c4d49df3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -625,8 +625,8 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
int nid, unsigned int lru_mask)
{
+ struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
unsigned long nr = 0;
- struct mem_cgroup_per_node *mz;
enum lru_list lru;

VM_BUG_ON((unsigned)nid >= nr_node_ids);
@@ -634,8 +634,7 @@ unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
for_each_lru(lru) {
if (!(BIT(lru) & lru_mask))
continue;
- mz = mem_cgroup_nodeinfo(memcg, nid);
- nr += mz->lru_size[lru];
+ nr += mem_cgroup_get_lru_size(lruvec, lru);
}
return nr;
}
@@ -1002,6 +1001,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
* mem_cgroup_update_lru_size - account for adding or removing an lru page
* @lruvec: mem_cgroup per zone lru vector
* @lru: index of lru list the page is sitting on
+ * @zid: zone id of the accounted pages
* @nr_pages: positive when adding or negative when removing
*
* This function must be called under lru_lock, just before a page is added
@@ -1009,7 +1009,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
* so as to allow it to check that lru_size 0 is consistent with list_empty).
*/
void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
- int nr_pages)
+ int zid, int nr_pages)
{
struct mem_cgroup_per_node *mz;
unsigned long *lru_size;
@@ -1020,7 +1020,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
return;

mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- lru_size = mz->lru_size + lru;
+ lru_size = &mz->lru_zone_size[zid][lru];
empty = list_empty(lruvec->lists + lru);

if (nr_pages < 0)
@@ -1036,6 +1036,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,

if (nr_pages > 0)
*lru_size += nr_pages;
+ mz->lru_zone_size[zid][lru] += nr_pages;
}

bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4abf08861d2..c98b1a585992 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -242,6 +242,15 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
}

+unsigned long lruvec_zone_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
+{
+ if (!mem_cgroup_disabled())
+ return mem_cgroup_get_zone_lru_size(lruvec, lru, zone_idx);
+
+ return zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zone_idx],
+ NR_ZONE_LRU_BASE + lru);
+}
+
/*
* Add a shrinker callback to be called from the vm.
*/
@@ -1382,8 +1391,7 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
* be complete before mem_cgroup_update_lru_size due to a santity check.
*/
static __always_inline void update_lru_sizes(struct lruvec *lruvec,
- enum lru_list lru, unsigned long *nr_zone_taken,
- unsigned long nr_taken)
+ enum lru_list lru, unsigned long *nr_zone_taken)
{
int zid;

@@ -1392,11 +1400,11 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
continue;

__update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
- }
-
#ifdef CONFIG_MEMCG
- mem_cgroup_update_lru_size(lruvec, lru, -nr_taken);
+ mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
#endif
+ }
+
}

/*
@@ -1501,7 +1509,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
*nr_scanned = scan;
trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
nr_taken, mode, is_file_lru(lru));
- update_lru_sizes(lruvec, lru, nr_zone_taken, nr_taken);
+ update_lru_sizes(lruvec, lru, nr_zone_taken);
return nr_taken;
}

@@ -2047,10 +2055,8 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
if (!managed_zone(zone))
continue;

- inactive_zone = zone_page_state(zone,
- NR_ZONE_LRU_BASE + (file * LRU_FILE));
- active_zone = zone_page_state(zone,
- NR_ZONE_LRU_BASE + (file * LRU_FILE) + LRU_ACTIVE);
+ inactive_zone = lruvec_zone_lru_size(lruvec, file * LRU_FILE, zid);
+ active_zone = lruvec_zone_lru_size(lruvec, (file * LRU_FILE) + LRU_ACTIVE, zid);

inactive -= min(inactive, inactive_zone);
active -= min(active, active_zone);
--
2.10.2


--
Michal Hocko
SUSE Labs

2016-12-23 22:26:10

by Nils Holland

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
>
> Nils, even though this is still highly experimental, could you give it a
> try please?

Yes, no problem! So I kept the very first patch you sent but had to
revert the latest version of the debugging patch (the one in
which you added the "mm_vmscan_inactive_list_is_low" event) because
otherwise the patch you just sent wouldn't apply. Then I rebooted with
memory cgroups enabled again, and the first thing that strikes the eye
is that I get this during boot:

[ 1.568174] ------------[ cut here ]------------
[ 1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0x118/0x130
[ 1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not empty
[ 1.568754] Modules linked in:
[ 1.568922] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-gentoo #6
[ 1.569052] Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
[ 1.571750] f44e5b84 c142bdee f44e5bc8 c1b5ade0 f44e5bb4 c103ab1d c1b583e4 f44e5be4
[ 1.572262] 00000001 c1b5ade0 00000408 c11603d8 00000408 00000000 c1b5af73 00000001
[ 1.572774] f44e5bd0 c103ab76 00000009 00000000 f44e5bc8 c1b583e4 f44e5be4 f44e5c18
[ 1.573285] Call Trace:
[ 1.573419] [<c142bdee>] dump_stack+0x47/0x69
[ 1.573551] [<c103ab1d>] __warn+0xed/0x110
[ 1.573681] [<c11603d8>] ? mem_cgroup_update_lru_size+0x118/0x130
[ 1.573812] [<c103ab76>] warn_slowpath_fmt+0x36/0x40
[ 1.573942] [<c11603d8>] mem_cgroup_update_lru_size+0x118/0x130
[ 1.574076] [<c1111467>] __pagevec_lru_add_fn+0xd7/0x1b0
[ 1.574206] [<c1111390>] ? perf_trace_mm_lru_insertion+0x150/0x150
[ 1.574336] [<c111239d>] pagevec_lru_move_fn+0x4d/0x80
[ 1.574465] [<c1111390>] ? perf_trace_mm_lru_insertion+0x150/0x150
[ 1.574595] [<c11127e5>] __lru_cache_add+0x45/0x60
[ 1.574724] [<c1112848>] lru_cache_add+0x8/0x10
[ 1.574852] [<c1102fc1>] add_to_page_cache_lru+0x61/0xc0
[ 1.574982] [<c110418e>] pagecache_get_page+0xee/0x270
[ 1.575111] [<c11060f0>] grab_cache_page_write_begin+0x20/0x40
[ 1.575243] [<c118b955>] simple_write_begin+0x25/0xd0
[ 1.575372] [<c11061b8>] generic_perform_write+0xa8/0x1a0
[ 1.575503] [<c1106447>] __generic_file_write_iter+0x197/0x1f0
[ 1.575634] [<c110663f>] generic_file_write_iter+0x19f/0x2b0
[ 1.575766] [<c11669c1>] __vfs_write+0xd1/0x140
[ 1.575897] [<c1166bc5>] vfs_write+0x95/0x1b0
[ 1.576026] [<c1166daf>] SyS_write+0x3f/0x90
[ 1.576157] [<c1ce4474>] xwrite+0x1c/0x4b
[ 1.576285] [<c1ce44c5>] do_copy+0x22/0xac
[ 1.576413] [<c1ce42c3>] write_buffer+0x1d/0x2c
[ 1.576540] [<c1ce42f0>] flush_buffer+0x1e/0x70
[ 1.576670] [<c1d0eae8>] unxz+0x149/0x211
[ 1.576798] [<c1d0e99f>] ? unlzo+0x359/0x359
[ 1.576926] [<c1ce4946>] unpack_to_rootfs+0x14f/0x246
[ 1.577054] [<c1ce42d2>] ? write_buffer+0x2c/0x2c
[ 1.577183] [<c1ce4216>] ? initrd_load+0x3b/0x3b
[ 1.577312] [<c1ce4b20>] ? maybe_link.part.3+0xe3/0xe3
[ 1.577443] [<c1ce4b67>] populate_rootfs+0x47/0x8f
[ 1.577573] [<c1000456>] do_one_initcall+0x36/0x150
[ 1.577701] [<c1ce351e>] ? repair_env_string+0x12/0x54
[ 1.577832] [<c1054ded>] ? parse_args+0x25d/0x400
[ 1.577962] [<c1ce3baf>] ? kernel_init_freeable+0x101/0x19e
[ 1.578092] [<c1ce3bcf>] kernel_init_freeable+0x121/0x19e
[ 1.578222] [<c19b0700>] ? rest_init+0x60/0x60
[ 1.578350] [<c19b070b>] kernel_init+0xb/0x100
[ 1.578480] [<c1060c7c>] ? schedule_tail+0xc/0x50
[ 1.578608] [<c19b0700>] ? rest_init+0x60/0x60
[ 1.578737] [<c19b5db7>] ret_from_fork+0x1b/0x28
[ 1.578871] ---[ end trace cf6f1adac9dfe60e ]---

The machine then continued to boot just normally, however, so I
started my ordinary tests. And in fact, they were working just fine,
i.e. no OOMing anymore, even during heavy tarball unpacking.

Would it make sense to capture more trace data for you at this point?
As I'm on the go, I don't currently have a second machine for
capturing over the network, but since we're not having OOMs or other
issues now, capturing to file should probably work just fine.

I'll keep the patch applied and see if I notice anything else that
doesn't look normal during day to day usage, especially during my
ordinary Gentoo updates, which consist of a lot of fetching /
unpacking / building, and in the recent past had been very problematic
(in fact, that was where the problem first struck me and the "heavy
tarball unpacking" test was then just what I distilled it down to
in order to manually reproduce this with the least time and effort
possible).

Greetings
Nils

2016-12-26 06:26:33

by kernel test robot

[permalink] [raw]
Subject: [lkp-developer] [mm, memcg] d18e2b2aca: WARNING:at_mm/memcontrol.c:#mem_cgroup_update_lru_size


FYI, we noticed the following commit:

commit: d18e2b2aca0396849f588241e134787a829c707d ("mm, memcg: fix (Re: OOM: Better, but still there on)")
url: https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-memcg-fix-Re-OOM-Better-but-still-there-on/20161223-225057
base: git://git.cmpxchg.org/linux-mmotm.git master

in testcase: boot

on test machine: qemu-system-i386 -enable-kvm -m 360M

caused below changes:


+--------------------------------------------------------+------------+------------+
| | c7d85b880b | d18e2b2aca |
+--------------------------------------------------------+------------+------------+
| boot_successes | 8 | 0 |
| boot_failures | 0 | 2 |
| WARNING:at_mm/memcontrol.c:#mem_cgroup_update_lru_size | 0 | 2 |
| kernel_BUG_at_mm/memcontrol.c | 0 | 2 |
| invalid_opcode:#[##]DEBUG_PAGEALLOC | 0 | 2 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 2 |
+--------------------------------------------------------+------------+------------+



[ 95.226364] init: tty6 main process (990) killed by TERM signal
[ 95.314020] init: plymouth-upstart-bridge main process (1039) terminated with status 1
[ 97.588568] ------------[ cut here ]------------
[ 97.594364] WARNING: CPU: 0 PID: 1055 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0xdd/0x12b
[ 97.606654] mem_cgroup_update_lru_size(40297f00, 0, -1): lru_size 1 but empty
[ 97.615140] Modules linked in:
[ 97.618834] CPU: 0 PID: 1055 Comm: killall5 Not tainted 4.9.0-mm1-00095-gd18e2b2 #82
[ 97.628008] Call Trace:
[ 97.631025] dump_stack+0x16/0x18
[ 97.635107] __warn+0xaf/0xc6
[ 97.638729] ? mem_cgroup_update_lru_size+0xdd/0x12b


To reproduce:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Xiaolong


Attachments:
(No filename) (2.11 kB)
config-4.9.0-mm1-00095-gd18e2b2 (83.51 kB)
job-script (3.86 kB)
dmesg.xz (14.47 kB)
Download all attachments

2016-12-26 12:26:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [lkp-developer] [mm, memcg] d18e2b2aca: WARNING:at_mm/memcontrol.c:#mem_cgroup_update_lru_size

On Mon 26-12-16 06:25:56, kernel test robot wrote:
>
> FYI, we noticed the following commit:
>
> commit: d18e2b2aca0396849f588241e134787a829c707d ("mm, memcg: fix (Re: OOM: Better, but still there on)")
> url: https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-memcg-fix-Re-OOM-Better-but-still-there-on/20161223-225057
> base: git://git.cmpxchg.org/linux-mmotm.git master
>
> in testcase: boot
>
> on test machine: qemu-system-i386 -enable-kvm -m 360M
>
> caused below changes:
>
>
> +--------------------------------------------------------+------------+------------+
> | | c7d85b880b | d18e2b2aca |
> +--------------------------------------------------------+------------+------------+
> | boot_successes | 8 | 0 |
> | boot_failures | 0 | 2 |
> | WARNING:at_mm/memcontrol.c:#mem_cgroup_update_lru_size | 0 | 2 |
> | kernel_BUG_at_mm/memcontrol.c | 0 | 2 |
> | invalid_opcode:#[##]DEBUG_PAGEALLOC | 0 | 2 |
> | Kernel_panic-not_syncing:Fatal_exception | 0 | 2 |
> +--------------------------------------------------------+------------+------------+
>
>
>
> [ 95.226364] init: tty6 main process (990) killed by TERM signal
> [ 95.314020] init: plymouth-upstart-bridge main process (1039) terminated with status 1
> [ 97.588568] ------------[ cut here ]------------
> [ 97.594364] WARNING: CPU: 0 PID: 1055 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0xdd/0x12b
> [ 97.606654] mem_cgroup_update_lru_size(40297f00, 0, -1): lru_size 1 but empty
> [ 97.615140] Modules linked in:
> [ 97.618834] CPU: 0 PID: 1055 Comm: killall5 Not tainted 4.9.0-mm1-00095-gd18e2b2 #82
> [ 97.628008] Call Trace:
> [ 97.631025] dump_stack+0x16/0x18
> [ 97.635107] __warn+0xaf/0xc6
> [ 97.638729] ? mem_cgroup_update_lru_size+0xdd/0x12b

Do you have the full backtrace?
--
Michal Hocko
SUSE Labs

2016-12-26 12:48:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Fri 23-12-16 23:26:00, Nils Holland wrote:
> On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> >
> > Nils, even though this is still highly experimental, could you give it a
> > try please?
>
> Yes, no problem! So I kept the very first patch you sent but had to
> revert the latest version of the debugging patch (the one in
> which you added the "mm_vmscan_inactive_list_is_low" event) because
> otherwise the patch you just sent wouldn't apply. Then I rebooted with
> memory cgroups enabled again, and the first thing that strikes the eye
> is that I get this during boot:
>
> [ 1.568174] ------------[ cut here ]------------
> [ 1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0x118/0x130
> [ 1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not empty

Ohh, I can see what is wrong! a) there is a bug in the accounting in
my patch (I double account) and b) the detection for the empty list
cannot work after my change because per node zone will not match per
zone statistics. The updated patch is below. So I hope my brain already
works after it's been mostly off last few days...
---
>From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Fri, 23 Dec 2016 15:11:54 +0100
Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
memcg is enabled

Nils Holland has reported unexpected OOM killer invocations with 32b
kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
active_file:274324 inactive_file:281962 isolated_file:0
unevictable:0 dirty:649 writeback:0 unstable:0
slab_reclaimable:40662 slab_unreclaimable:17754
mapped:7382 shmem:202 pagetables:351 bounce:0
free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

the oom killer is clearly pre-mature because there there is still a
lot of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.
It simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which doesn't make any sense. We can simply end up
always seeing the resulting active and inactive counts 0 and return
false. This issue is not limited to 32b kernels but in practice the
effect on systems without CONFIG_HIGHMEM would be much harder to notice
because we do not invoke the OOM killer for allocations requests
targeting < ZONE_NORMAL.

Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled. Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.

We are loosing empty LRU but non-zero lru size detection introduced by
ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs. node discrepancy.

Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
Cc: stable # 4.8+
Reported-by: Nils Holland <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 26 +++++++++++++++++++++++---
include/linux/mm_inline.h | 2 +-
mm/memcontrol.c | 18 ++++++++----------
mm/vmscan.c | 26 ++++++++++++++++----------
4 files changed, 48 insertions(+), 24 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 61d20c17f3b7..002cb08b0f3e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,7 +120,7 @@ struct mem_cgroup_reclaim_iter {
*/
struct mem_cgroup_per_node {
struct lruvec lruvec;
- unsigned long lru_size[NR_LRU_LISTS];
+ unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];

struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];

@@ -432,7 +432,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);

void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
- int nr_pages);
+ int zid, int nr_pages);

unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
int nid, unsigned int lru_mask);
@@ -441,9 +441,23 @@ static inline
unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
{
struct mem_cgroup_per_node *mz;
+ unsigned long nr_pages = 0;
+ int zid;

mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- return mz->lru_size[lru];
+ for (zid = 0; zid < MAX_NR_ZONES; zid++)
+ nr_pages += mz->lru_zone_size[zid][lru];
+ return nr_pages;
+}
+
+static inline
+unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, enum lru_list lru,
+ int zone_idx)
+{
+ struct mem_cgroup_per_node *mz;
+
+ mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ return mz->lru_zone_size[zone_idx][lru];
}

void mem_cgroup_handle_over_high(void);
@@ -671,6 +685,12 @@ mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
{
return 0;
}
+static inline
+unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, enum lru_list lru,
+ int zone_idx)
+{
+ return 0;
+}

static inline unsigned long
mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 71613e8a720f..41d376e7116d 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -39,7 +39,7 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
{
__update_lru_size(lruvec, lru, zid, nr_pages);
#ifdef CONFIG_MEMCG
- mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+ mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
#endif
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 91dfc7c5ce8f..b59676026272 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -625,8 +625,8 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
int nid, unsigned int lru_mask)
{
+ struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
unsigned long nr = 0;
- struct mem_cgroup_per_node *mz;
enum lru_list lru;

VM_BUG_ON((unsigned)nid >= nr_node_ids);
@@ -634,8 +634,7 @@ unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
for_each_lru(lru) {
if (!(BIT(lru) & lru_mask))
continue;
- mz = mem_cgroup_nodeinfo(memcg, nid);
- nr += mz->lru_size[lru];
+ nr += mem_cgroup_get_lru_size(lruvec, lru);
}
return nr;
}
@@ -1002,6 +1001,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
* mem_cgroup_update_lru_size - account for adding or removing an lru page
* @lruvec: mem_cgroup per zone lru vector
* @lru: index of lru list the page is sitting on
+ * @zid: zone id of the accounted pages
* @nr_pages: positive when adding or negative when removing
*
* This function must be called under lru_lock, just before a page is added
@@ -1009,27 +1009,25 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
* so as to allow it to check that lru_size 0 is consistent with list_empty).
*/
void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
- int nr_pages)
+ int zid, int nr_pages)
{
struct mem_cgroup_per_node *mz;
unsigned long *lru_size;
long size;
- bool empty;

if (mem_cgroup_disabled())
return;

mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- lru_size = mz->lru_size + lru;
- empty = list_empty(lruvec->lists + lru);
+ lru_size = &mz->lru_zone_size[zid][lru];

if (nr_pages < 0)
*lru_size += nr_pages;

size = *lru_size;
- if (WARN_ONCE(size < 0 || empty != !size,
- "%s(%p, %d, %d): lru_size %ld but %sempty\n",
- __func__, lruvec, lru, nr_pages, size, empty ? "" : "not ")) {
+ if (WARN_ONCE(size < 0,
+ "%s(%p, %d, %d): lru_size %ld\n",
+ __func__, lruvec, lru, nr_pages, size)) {
VM_BUG_ON(1);
*lru_size = 0;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4abf08861d2..c98b1a585992 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -242,6 +242,15 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
}

+unsigned long lruvec_zone_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
+{
+ if (!mem_cgroup_disabled())
+ return mem_cgroup_get_zone_lru_size(lruvec, lru, zone_idx);
+
+ return zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zone_idx],
+ NR_ZONE_LRU_BASE + lru);
+}
+
/*
* Add a shrinker callback to be called from the vm.
*/
@@ -1382,8 +1391,7 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
* be complete before mem_cgroup_update_lru_size due to a santity check.
*/
static __always_inline void update_lru_sizes(struct lruvec *lruvec,
- enum lru_list lru, unsigned long *nr_zone_taken,
- unsigned long nr_taken)
+ enum lru_list lru, unsigned long *nr_zone_taken)
{
int zid;

@@ -1392,11 +1400,11 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
continue;

__update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
- }
-
#ifdef CONFIG_MEMCG
- mem_cgroup_update_lru_size(lruvec, lru, -nr_taken);
+ mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
#endif
+ }
+
}

/*
@@ -1501,7 +1509,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
*nr_scanned = scan;
trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
nr_taken, mode, is_file_lru(lru));
- update_lru_sizes(lruvec, lru, nr_zone_taken, nr_taken);
+ update_lru_sizes(lruvec, lru, nr_zone_taken);
return nr_taken;
}

@@ -2047,10 +2055,8 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
if (!managed_zone(zone))
continue;

- inactive_zone = zone_page_state(zone,
- NR_ZONE_LRU_BASE + (file * LRU_FILE));
- active_zone = zone_page_state(zone,
- NR_ZONE_LRU_BASE + (file * LRU_FILE) + LRU_ACTIVE);
+ inactive_zone = lruvec_zone_lru_size(lruvec, file * LRU_FILE, zid);
+ active_zone = lruvec_zone_lru_size(lruvec, (file * LRU_FILE) + LRU_ACTIVE, zid);

inactive -= min(inactive, inactive_zone);
active -= min(active, active_zone);
--
2.10.2


--
Michal Hocko
SUSE Labs

2016-12-26 12:51:01

by Michal Hocko

[permalink] [raw]
Subject: Re: [lkp-developer] [mm, memcg] d18e2b2aca: WARNING:at_mm/memcontrol.c:#mem_cgroup_update_lru_size

On Mon 26-12-16 13:26:51, Michal Hocko wrote:
> On Mon 26-12-16 06:25:56, kernel test robot wrote:
[...]
> > [ 95.226364] init: tty6 main process (990) killed by TERM signal
> > [ 95.314020] init: plymouth-upstart-bridge main process (1039) terminated with status 1
> > [ 97.588568] ------------[ cut here ]------------
> > [ 97.594364] WARNING: CPU: 0 PID: 1055 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0xdd/0x12b
> > [ 97.606654] mem_cgroup_update_lru_size(40297f00, 0, -1): lru_size 1 but empty
> > [ 97.615140] Modules linked in:
> > [ 97.618834] CPU: 0 PID: 1055 Comm: killall5 Not tainted 4.9.0-mm1-00095-gd18e2b2 #82
> > [ 97.628008] Call Trace:
> > [ 97.631025] dump_stack+0x16/0x18
> > [ 97.635107] __warn+0xaf/0xc6
> > [ 97.638729] ? mem_cgroup_update_lru_size+0xdd/0x12b
>
> Do you have the full backtrace?

It's not needed. I found the bug in my patch and it should be fixed by
the updated patch http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs

2016-12-26 18:57:14

by Nils Holland

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > >
> > > Nils, even though this is still highly experimental, could you give it a
> > > try please?
> >
> > Yes, no problem! So I kept the very first patch you sent but had to
> > revert the latest version of the debugging patch (the one in
> > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > memory cgroups enabled again, and the first thing that strikes the eye
> > is that I get this during boot:
> >
> > [ 1.568174] ------------[ cut here ]------------
> > [ 1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0x118/0x130
> > [ 1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not empty
>
> Ohh, I can see what is wrong! a) there is a bug in the accounting in
> my patch (I double account) and b) the detection for the empty list
> cannot work after my change because per node zone will not match per
> zone statistics. The updated patch is below. So I hope my brain already
> works after it's been mostly off last few days...

I tried the updated patch, and I can confirm that the warning during
boot is gone. Also, I've tried my ordinary procedure to reproduce my
testcase, and I can say that a kernel with this new patch also works
fine and doesn't produce OOMs or similar issues.

I had the previous version of the patch in use on a machine non-stop
for the last few days during normal day-to-day workloads and didn't
notice any issues. Now I'll keep a machine running during the next few
days with this patch, and in case I notice something that doesn't look
normal, I'll of course report back!

Greetings
Nils

2016-12-27 08:08:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Mon 26-12-16 19:57:03, Nils Holland wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > >
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > >
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > >
> > > [ 1.568174] ------------[ cut here ]------------
> > > [ 1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0x118/0x130
> > > [ 1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not empty
> >
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
>
> I tried the updated patch, and I can confirm that the warning during
> boot is gone. Also, I've tried my ordinary procedure to reproduce my
> testcase, and I can say that a kernel with this new patch also works
> fine and doesn't produce OOMs or similar issues.
>
> I had the previous version of the patch in use on a machine non-stop
> for the last few days during normal day-to-day workloads and didn't
> notice any issues. Now I'll keep a machine running during the next few
> days with this patch, and in case I notice something that doesn't look
> normal, I'll of course report back!

Thanks for your testing! Can I add your
Tested-by: Nils Holland <[email protected]>
?
--
Michal Hocko
SUSE Labs

2016-12-27 11:23:23

by Nils Holland

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Tue, Dec 27, 2016 at 09:08:38AM +0100, Michal Hocko wrote:
> On Mon 26-12-16 19:57:03, Nils Holland wrote:
> > On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > >
> > > > > Nils, even though this is still highly experimental, could you give it a
> > > > > try please?
> > > >
> > > > Yes, no problem! So I kept the very first patch you sent but had to
> > > > revert the latest version of the debugging patch (the one in
> > > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > > memory cgroups enabled again, and the first thing that strikes the eye
> > > > is that I get this during boot:
> > > >
> > > > [ 1.568174] ------------[ cut here ]------------
> > > > [ 1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0x118/0x130
> > > > [ 1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not empty
> > >
> > > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > > my patch (I double account) and b) the detection for the empty list
> > > cannot work after my change because per node zone will not match per
> > > zone statistics. The updated patch is below. So I hope my brain already
> > > works after it's been mostly off last few days...
> >
> > I tried the updated patch, and I can confirm that the warning during
> > boot is gone. Also, I've tried my ordinary procedure to reproduce my
> > testcase, and I can say that a kernel with this new patch also works
> > fine and doesn't produce OOMs or similar issues.
> >
> > I had the previous version of the patch in use on a machine non-stop
> > for the last few days during normal day-to-day workloads and didn't
> > notice any issues. Now I'll keep a machine running during the next few
> > days with this patch, and in case I notice something that doesn't look
> > normal, I'll of course report back!
>
> Thanks for your testing! Can I add your
> Tested-by: Nils Holland <[email protected]>

Yes, I think so! The patch has now been running for 16 hours on my two
machines, and that's an uptime that was hard to achieve since 4.8 for
me. ;-) So my tests clearly suggest that the patch is good! :-)

Greetings
Nils

2016-12-27 11:28:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Tue 27-12-16 12:23:13, Nils Holland wrote:
> On Tue, Dec 27, 2016 at 09:08:38AM +0100, Michal Hocko wrote:
> > On Mon 26-12-16 19:57:03, Nils Holland wrote:
> > > On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > > > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > > >
> > > > > > Nils, even though this is still highly experimental, could you give it a
> > > > > > try please?
> > > > >
> > > > > Yes, no problem! So I kept the very first patch you sent but had to
> > > > > revert the latest version of the debugging patch (the one in
> > > > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > > > memory cgroups enabled again, and the first thing that strikes the eye
> > > > > is that I get this during boot:
> > > > >
> > > > > [ 1.568174] ------------[ cut here ]------------
> > > > > [ 1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0x118/0x130
> > > > > [ 1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not empty
> > > >
> > > > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > > > my patch (I double account) and b) the detection for the empty list
> > > > cannot work after my change because per node zone will not match per
> > > > zone statistics. The updated patch is below. So I hope my brain already
> > > > works after it's been mostly off last few days...
> > >
> > > I tried the updated patch, and I can confirm that the warning during
> > > boot is gone. Also, I've tried my ordinary procedure to reproduce my
> > > testcase, and I can say that a kernel with this new patch also works
> > > fine and doesn't produce OOMs or similar issues.
> > >
> > > I had the previous version of the patch in use on a machine non-stop
> > > for the last few days during normal day-to-day workloads and didn't
> > > notice any issues. Now I'll keep a machine running during the next few
> > > days with this patch, and in case I notice something that doesn't look
> > > normal, I'll of course report back!
> >
> > Thanks for your testing! Can I add your
> > Tested-by: Nils Holland <[email protected]>
>
> Yes, I think so! The patch has now been running for 16 hours on my two
> machines, and that's an uptime that was hard to achieve since 4.8 for
> me. ;-) So my tests clearly suggest that the patch is good! :-)

OK, thanks a lot for your testing! I will wait few more days before I
send it to Andrew.

--
Michal Hocko
SUSE Labs

2016-12-27 15:55:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

Hi,
could you try to run with the following patch on top of the previous
one? I do not think it will make a large change in your workload but
I think we need something like that so some testing under which is known
to make a high lowmem pressure would be really appreciated. If you have
more time to play with it then running with and without the patch with
mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
whether it make any difference at all.

I would also appreciate if Mel and Johannes had a look at it. I am not
yet sure whether we need the same thing for anon/file balancing in
get_scan_count. I suspect we need but need to think more about that.

Thanks a lot again!
---
>From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 27 Dec 2016 16:28:44 +0100
Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count

get_scan_count considers the whole node LRU size when
- doing SCAN_FILE due to many page cache inactive pages
- calculating the number of pages to scan

in both cases this might lead to unexpected behavior especially on 32b
systems where we can expect lowmem memory pressure very often.

A large highmem zone can easily distort SCAN_FILE heuristic because
there might be only few file pages from the eligible zones on the node
lru and we would still enforce file lru scanning which can lead to
trashing while we could still scan anonymous pages.

The later use of lruvec_lru_size can be problematic as well. Especially
when there are not many pages from the eligible zones. We would have to
skip over many pages to find anything to reclaim but shrink_node_memcg
would only reduce the remaining number to scan by SWAP_CLUSTER_MAX
at maximum. Therefore we can end up going over a large LRU many times
without actually having chance to reclaim much if anything at all. The
closer we are out of memory on lowmem zone the worse the problem will
be.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmscan.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c98b1a585992..785b4d7fb8a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -252,6 +252,32 @@ unsigned long lruvec_zone_lru_size(struct lruvec *lruvec, enum lru_list lru, int
}

/*
+ * Return the number of pages on the given lru which are eligibne for the
+ * given zone_idx
+ */
+static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
+ enum lru_list lru, int zone_idx)
+{
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+ unsigned long lru_size;
+ int zid;
+
+ lru_size = lruvec_lru_size(lruvec, lru);
+ for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
+ struct zone *zone = &pgdat->node_zones[zid];
+ unsigned long size;
+
+ if (!managed_zone(zone))
+ continue;
+
+ size = lruvec_zone_lru_size(lruvec, lru, zid);
+ lru_size -= min(size, lru_size);
+ }
+
+ return lru_size;
+}
+
+/*
* Add a shrinker callback to be called from the vm.
*/
int register_shrinker(struct shrinker *shrinker)
@@ -2207,7 +2233,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* system is under heavy pressure.
*/
if (!inactive_list_is_low(lruvec, true, sc) &&
- lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+ lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2274,7 +2300,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
unsigned long size;
unsigned long scan;

- size = lruvec_lru_size(lruvec, lru);
+ size = lruvec_lru_size_zone_idx(lruvec, lru, sc->reclaim_idx);
scan = size >> sc->priority;

if (!scan && pass && force_scan)
--
2.10.2

--
Michal Hocko
SUSE Labs

2016-12-27 16:28:56

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: consider eligible zones in get_scan_count

Hi Michal,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.10-rc1 next-20161224]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-vmscan-consider-eligible-zones-in-get_scan_count/20161228-000917
base: git://git.cmpxchg.org/linux-mmotm.git master
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All errors (new ones prefixed by >>):

mm/vmscan.c: In function 'lruvec_lru_size_zone_idx':
>> mm/vmscan.c:264:10: error: implicit declaration of function 'lruvec_zone_lru_size' [-Werror=implicit-function-declaration]
size = lruvec_zone_lru_size(lruvec, lru, zid);
^~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors

vim +/lruvec_zone_lru_size +264 mm/vmscan.c

258 struct zone *zone = &pgdat->node_zones[zid];
259 unsigned long size;
260
261 if (!managed_zone(zone))
262 continue;
263
> 264 size = lruvec_zone_lru_size(lruvec, lru, zid);
265 lru_size -= min(size, lru_size);
266 }
267

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.37 kB)
.config.gz (6.27 kB)
Download all attachments

2016-12-27 19:33:27

by Nils Holland

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> Hi,
> could you try to run with the following patch on top of the previous
> one? I do not think it will make a large change in your workload but
> I think we need something like that so some testing under which is known
> to make a high lowmem pressure would be really appreciated. If you have
> more time to play with it then running with and without the patch with
> mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> whether it make any difference at all.

Of course, no problem!

First, about the events to trace: mm_vmscan_direct_reclaim_start
doesn't seem to exist, but mm_vmscan_direct_reclaim_begin does. I'm
sure that's what you meant and so I took that one instead.

Then I have to admit in both cases (once without the latest patch,
once with) very little trace data was actually produced. In the case
without the patch, the reclaim was started more often and reclaimed a
smaller number of pages each time, in the case with the patch it was
invoked less often, and with the last time it was invoked it reclaimed
a rather big number of pages. I have no clue, however, if that
happened "by chance" or if it was actually causes by the patch and
thus an expected change.

In both cases, my test case was: Reboot, setup logging, do "emerge
firefox" (which unpacks and builds the firefox sources), then, when
the emerge had come so far that the unpacking was done and the
building had started, switch to another console and untar the latest
kernel, libreoffice and (once more) firefox sources there. After that
had completed, I aborted the emerge build process and stopped tracing.

Here's the trace data captured without the latest patch applied:

khugepaged-22 [000] .... 566.123383: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [000] .N.. 566.165520: mm_vmscan_direct_reclaim_end: nr_reclaimed=1100
khugepaged-22 [001] .... 587.515424: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [000] .... 587.596035: mm_vmscan_direct_reclaim_end: nr_reclaimed=1029
khugepaged-22 [001] .... 599.879536: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [000] .... 601.000812: mm_vmscan_direct_reclaim_end: nr_reclaimed=1100
khugepaged-22 [001] .... 601.228137: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [001] .... 601.309952: mm_vmscan_direct_reclaim_end: nr_reclaimed=1081
khugepaged-22 [001] .... 694.935267: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [001] .N.. 695.081943: mm_vmscan_direct_reclaim_end: nr_reclaimed=1071
khugepaged-22 [001] .... 701.370707: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [001] .... 701.372798: mm_vmscan_direct_reclaim_end: nr_reclaimed=1089
khugepaged-22 [001] .... 764.752036: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [000] .... 771.047905: mm_vmscan_direct_reclaim_end: nr_reclaimed=1039
khugepaged-22 [000] .... 781.760515: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [001] .... 781.826543: mm_vmscan_direct_reclaim_end: nr_reclaimed=1040
khugepaged-22 [001] .... 782.595575: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [000] .... 782.638591: mm_vmscan_direct_reclaim_end: nr_reclaimed=1040
khugepaged-22 [001] .... 782.930455: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [001] .... 782.993608: mm_vmscan_direct_reclaim_end: nr_reclaimed=1040
khugepaged-22 [001] .... 783.330378: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [001] .... 783.369653: mm_vmscan_direct_reclaim_end: nr_reclaimed=1040

And this is the same with the patch applied:

khugepaged-22 [001] .... 523.599997: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [001] .... 523.683110: mm_vmscan_direct_reclaim_end: nr_reclaimed=1092
khugepaged-22 [001] .... 535.345477: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [001] .... 535.401189: mm_vmscan_direct_reclaim_end: nr_reclaimed=1078
khugepaged-22 [000] .... 692.876716: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
khugepaged-22 [001] .... 703.312399: mm_vmscan_direct_reclaim_end: nr_reclaimed=197759

If my test case and thus the results don't sound good, I could of
course try some other test cases ... like capturing for a longer
period of time or trying to produce more memory pressure by running
more processes at the same time, or something like that.

Besides that I can say that the patch hasn't produced any warnings or
other issues so far, so at first glance, it doesn't seem to hurt
anything.

Greetings
Nils

2016-12-28 08:51:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: consider eligible zones in get_scan_count

On Wed 28-12-16 00:28:38, kbuild test robot wrote:
> Hi Michal,
>
> [auto build test ERROR on mmotm/master]
> [also build test ERROR on v4.10-rc1 next-20161224]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-vmscan-consider-eligible-zones-in-get_scan_count/20161228-000917
> base: git://git.cmpxchg.org/linux-mmotm.git master
> config: i386-tinyconfig (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
>
> All errors (new ones prefixed by >>):
>
> mm/vmscan.c: In function 'lruvec_lru_size_zone_idx':
> >> mm/vmscan.c:264:10: error: implicit declaration of function 'lruvec_zone_lru_size' [-Werror=implicit-function-declaration]
> size = lruvec_zone_lru_size(lruvec, lru, zid);

this patch depends on the previous one
http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs

2016-12-28 08:58:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Tue 27-12-16 20:33:09, Nils Holland wrote:
> On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > Hi,
> > could you try to run with the following patch on top of the previous
> > one? I do not think it will make a large change in your workload but
> > I think we need something like that so some testing under which is known
> > to make a high lowmem pressure would be really appreciated. If you have
> > more time to play with it then running with and without the patch with
> > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > whether it make any difference at all.
>
> Of course, no problem!
>
> First, about the events to trace: mm_vmscan_direct_reclaim_start
> doesn't seem to exist, but mm_vmscan_direct_reclaim_begin does. I'm
> sure that's what you meant and so I took that one instead.

yes, sorry about the confusion

> Then I have to admit in both cases (once without the latest patch,
> once with) very little trace data was actually produced. In the case
> without the patch, the reclaim was started more often and reclaimed a
> smaller number of pages each time, in the case with the patch it was
> invoked less often, and with the last time it was invoked it reclaimed
> a rather big number of pages. I have no clue, however, if that
> happened "by chance" or if it was actually causes by the patch and
> thus an expected change.

yes that seems to be a variation of the workload I would say because if
anything the patch should reduce the number of scanned pages.

> In both cases, my test case was: Reboot, setup logging, do "emerge
> firefox" (which unpacks and builds the firefox sources), then, when
> the emerge had come so far that the unpacking was done and the
> building had started, switch to another console and untar the latest
> kernel, libreoffice and (once more) firefox sources there. After that
> had completed, I aborted the emerge build process and stopped tracing.
>
> Here's the trace data captured without the latest patch applied:
>
> khugepaged-22 [000] .... 566.123383: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [000] .N.. 566.165520: mm_vmscan_direct_reclaim_end: nr_reclaimed=1100
> khugepaged-22 [001] .... 587.515424: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [000] .... 587.596035: mm_vmscan_direct_reclaim_end: nr_reclaimed=1029
> khugepaged-22 [001] .... 599.879536: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [000] .... 601.000812: mm_vmscan_direct_reclaim_end: nr_reclaimed=1100
> khugepaged-22 [001] .... 601.228137: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [001] .... 601.309952: mm_vmscan_direct_reclaim_end: nr_reclaimed=1081
> khugepaged-22 [001] .... 694.935267: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [001] .N.. 695.081943: mm_vmscan_direct_reclaim_end: nr_reclaimed=1071
> khugepaged-22 [001] .... 701.370707: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [001] .... 701.372798: mm_vmscan_direct_reclaim_end: nr_reclaimed=1089
> khugepaged-22 [001] .... 764.752036: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [000] .... 771.047905: mm_vmscan_direct_reclaim_end: nr_reclaimed=1039
> khugepaged-22 [000] .... 781.760515: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [001] .... 781.826543: mm_vmscan_direct_reclaim_end: nr_reclaimed=1040
> khugepaged-22 [001] .... 782.595575: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [000] .... 782.638591: mm_vmscan_direct_reclaim_end: nr_reclaimed=1040
> khugepaged-22 [001] .... 782.930455: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [001] .... 782.993608: mm_vmscan_direct_reclaim_end: nr_reclaimed=1040
> khugepaged-22 [001] .... 783.330378: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [001] .... 783.369653: mm_vmscan_direct_reclaim_end: nr_reclaimed=1040
>
> And this is the same with the patch applied:
>
> khugepaged-22 [001] .... 523.599997: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [001] .... 523.683110: mm_vmscan_direct_reclaim_end: nr_reclaimed=1092
> khugepaged-22 [001] .... 535.345477: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [001] .... 535.401189: mm_vmscan_direct_reclaim_end: nr_reclaimed=1078
> khugepaged-22 [000] .... 692.876716: mm_vmscan_direct_reclaim_begin: order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22 [001] .... 703.312399: mm_vmscan_direct_reclaim_end: nr_reclaimed=197759

In these cases there is no real difference because this is not the
lowmem pressure because those requests can go to the highmem zone.

> If my test case and thus the results don't sound good, I could of
> course try some other test cases ... like capturing for a longer
> period of time or trying to produce more memory pressure by running
> more processes at the same time, or something like that.

yes, a stronger memory pressure would be needed. I suspect that your
original issues was more about active list aging than a really strong
memory pressure. So it might be possible that your workload will not
notice. If you can collect those two tracepoints over a longer time it
can still tell us something but I do not want you to burn a lot of time
on this. The main issue seems to be fixed and the follow up fix can wait
for a throughout review after both Mel and Johannes are back from
holiday.

> Besides that I can say that the patch hasn't produced any warnings or
> other issues so far, so at first glance, it doesn't seem to hurt
> anything.

Thanks!
--
Michal Hocko
SUSE Labs

2016-12-29 00:33:10

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > >
> > > Nils, even though this is still highly experimental, could you give it a
> > > try please?
> >
> > Yes, no problem! So I kept the very first patch you sent but had to
> > revert the latest version of the debugging patch (the one in
> > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > memory cgroups enabled again, and the first thing that strikes the eye
> > is that I get this during boot:
> >
> > [ 1.568174] ------------[ cut here ]------------
> > [ 1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0x118/0x130
> > [ 1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not empty
>
> Ohh, I can see what is wrong! a) there is a bug in the accounting in
> my patch (I double account) and b) the detection for the empty list
> cannot work after my change because per node zone will not match per
> zone statistics. The updated patch is below. So I hope my brain already
> works after it's been mostly off last few days...
> ---
> From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Fri, 23 Dec 2016 15:11:54 +0100
> Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
> memcg is enabled
>
> Nils Holland has reported unexpected OOM killer invocations with 32b
> kernel starting with 4.8 kernels
>
> kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
> kworker/u4:5 cpuset=/ mems_allowed=0
> CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
> [...]
> Mem-Info:
> active_anon:58685 inactive_anon:90 isolated_anon:0
> active_file:274324 inactive_file:281962 isolated_file:0
> unevictable:0 dirty:649 writeback:0 unstable:0
> slab_reclaimable:40662 slab_unreclaimable:17754
> mapped:7382 shmem:202 pagetables:351 bounce:0
> free:206736 free_pcp:332 free_cma:0
> Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
> DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 813 3474 3474
> Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
> lowmem_reserve[]: 0 0 21292 21292
> HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
>
> the oom killer is clearly pre-mature because there there is still a
> lot of page cache in the zone Normal which should satisfy this lowmem
> request. Further debugging has shown that the reclaim cannot make any
> forward progress because the page cache is hidden in the active list
> which doesn't get rotated because inactive_list_is_low is not memcg
> aware.
> It simply subtracts per-zone highmem counters from the respective
> memcg's lru sizes which doesn't make any sense. We can simply end up
> always seeing the resulting active and inactive counts 0 and return
> false. This issue is not limited to 32b kernels but in practice the
> effect on systems without CONFIG_HIGHMEM would be much harder to notice
> because we do not invoke the OOM killer for allocations requests
> targeting < ZONE_NORMAL.
>
> Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
> and subtract per-memcg highmem counts when memcg is enabled. Introduce
> helper lruvec_zone_lru_size which redirects to either zone counters or
> mem_cgroup_get_zone_lru_size when appropriate.
>
> We are loosing empty LRU but non-zero lru size detection introduced by
> ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
> of the inherent zone vs. node discrepancy.
>
> Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
> Cc: stable # 4.8+
> Reported-by: Nils Holland <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Minchan Kim <[email protected]>

2016-12-29 01:03:29

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Thu, Dec 29, 2016 at 09:31:54AM +0900, Minchan Kim wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > >
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > >
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > >
> > > [ 1.568174] ------------[ cut here ]------------
> > > [ 1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0x118/0x130
> > > [ 1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not empty
> >
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> > ---
> > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
> > memcg is enabled
> >
> > Nils Holland has reported unexpected OOM killer invocations with 32b
> > kernel starting with 4.8 kernels
> >
> > kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
> > kworker/u4:5 cpuset=/ mems_allowed=0
> > CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
> > [...]
> > Mem-Info:
> > active_anon:58685 inactive_anon:90 isolated_anon:0
> > active_file:274324 inactive_file:281962 isolated_file:0
> > unevictable:0 dirty:649 writeback:0 unstable:0
> > slab_reclaimable:40662 slab_unreclaimable:17754
> > mapped:7382 shmem:202 pagetables:351 bounce:0
> > free:206736 free_pcp:332 free_cma:0
> > Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
> > DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > lowmem_reserve[]: 0 813 3474 3474
> > Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
> > lowmem_reserve[]: 0 0 21292 21292
> > HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
> >
> > the oom killer is clearly pre-mature because there there is still a
> > lot of page cache in the zone Normal which should satisfy this lowmem
> > request. Further debugging has shown that the reclaim cannot make any
> > forward progress because the page cache is hidden in the active list
> > which doesn't get rotated because inactive_list_is_low is not memcg
> > aware.
> > It simply subtracts per-zone highmem counters from the respective
> > memcg's lru sizes which doesn't make any sense. We can simply end up
> > always seeing the resulting active and inactive counts 0 and return
> > false. This issue is not limited to 32b kernels but in practice the
> > effect on systems without CONFIG_HIGHMEM would be much harder to notice
> > because we do not invoke the OOM killer for allocations requests
> > targeting < ZONE_NORMAL.
> >
> > Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
> > and subtract per-memcg highmem counts when memcg is enabled. Introduce
> > helper lruvec_zone_lru_size which redirects to either zone counters or
> > mem_cgroup_get_zone_lru_size when appropriate.
> >
> > We are loosing empty LRU but non-zero lru size detection introduced by
> > ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
> > of the inherent zone vs. node discrepancy.
> >
> > Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
> > Cc: stable # 4.8+
> > Reported-by: Nils Holland <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> Acked-by: Minchan Kim <[email protected]>

Nit:

WARNING: line over 80 characters
#53: FILE: include/linux/memcontrol.h:689:
+unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, enum lru_list lru,

WARNING: line over 80 characters
#147: FILE: mm/vmscan.c:248:
+unsigned long lruvec_zone_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)

WARNING: line over 80 characters
#177: FILE: mm/vmscan.c:1446:
+ mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);

WARNING: line over 80 characters
#201: FILE: mm/vmscan.c:2099:
+ inactive_zone = lruvec_zone_lru_size(lruvec, file * LRU_FILE, zid);

WARNING: line over 80 characters
#202: FILE: mm/vmscan.c:2100:
+ active_zone = lruvec_zone_lru_size(lruvec, (file * LRU_FILE) + LRU_ACTIVE, zid);

2016-12-29 01:20:31

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> Hi,
> could you try to run with the following patch on top of the previous
> one? I do not think it will make a large change in your workload but
> I think we need something like that so some testing under which is known
> to make a high lowmem pressure would be really appreciated. If you have
> more time to play with it then running with and without the patch with
> mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> whether it make any difference at all.
>
> I would also appreciate if Mel and Johannes had a look at it. I am not
> yet sure whether we need the same thing for anon/file balancing in
> get_scan_count. I suspect we need but need to think more about that.
>
> Thanks a lot again!
> ---
> From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Tue, 27 Dec 2016 16:28:44 +0100
> Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
>
> get_scan_count considers the whole node LRU size when
> - doing SCAN_FILE due to many page cache inactive pages
> - calculating the number of pages to scan
>
> in both cases this might lead to unexpected behavior especially on 32b
> systems where we can expect lowmem memory pressure very often.
>
> A large highmem zone can easily distort SCAN_FILE heuristic because
> there might be only few file pages from the eligible zones on the node
> lru and we would still enforce file lru scanning which can lead to
> trashing while we could still scan anonymous pages.

Nit:
It doesn't make thrashing because isolate_lru_pages filter out them
but I agree it makes pointless CPU burning to find eligible pages.

>
> The later use of lruvec_lru_size can be problematic as well. Especially
> when there are not many pages from the eligible zones. We would have to
> skip over many pages to find anything to reclaim but shrink_node_memcg
> would only reduce the remaining number to scan by SWAP_CLUSTER_MAX
> at maximum. Therefore we can end up going over a large LRU many times
> without actually having chance to reclaim much if anything at all. The
> closer we are out of memory on lowmem zone the worse the problem will
> be.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/vmscan.c | 30 ++++++++++++++++++++++++++++--
> 1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c98b1a585992..785b4d7fb8a0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -252,6 +252,32 @@ unsigned long lruvec_zone_lru_size(struct lruvec *lruvec, enum lru_list lru, int
> }
>
> /*
> + * Return the number of pages on the given lru which are eligibne for the
eligible
> + * given zone_idx
> + */
> +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> + enum lru_list lru, int zone_idx)

Nit:

Although there is a comment, function name is rather confusing when I compared
it with lruvec_zone_lru_size.

lruvec_eligible_zones_lru_size is better?


> +{
> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> + unsigned long lru_size;
> + int zid;
> +
> + lru_size = lruvec_lru_size(lruvec, lru);
> + for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> + struct zone *zone = &pgdat->node_zones[zid];
> + unsigned long size;
> +
> + if (!managed_zone(zone))
> + continue;
> +
> + size = lruvec_zone_lru_size(lruvec, lru, zid);
> + lru_size -= min(size, lru_size);
> + }
> +
> + return lru_size;
> +}
> +
> +/*
> * Add a shrinker callback to be called from the vm.
> */
> int register_shrinker(struct shrinker *shrinker)
> @@ -2207,7 +2233,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> * system is under heavy pressure.
> */
> if (!inactive_list_is_low(lruvec, true, sc) &&
> - lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
> + lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) {
> scan_balance = SCAN_FILE;
> goto out;
> }
> @@ -2274,7 +2300,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> unsigned long size;
> unsigned long scan;
>
> - size = lruvec_lru_size(lruvec, lru);
> + size = lruvec_lru_size_zone_idx(lruvec, lru, sc->reclaim_idx);
> scan = size >> sc->priority;
>
> if (!scan && pass && force_scan)
> --
> 2.10.2

Nit:

With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx rather than
own custom calculation to filter out non-eligible pages.

Anyway, I think this patch does right things so I suppose this.

Acked-by: Minchan Kim <[email protected]>

2016-12-29 08:53:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Thu 29-12-16 09:48:24, Minchan Kim wrote:
> On Thu, Dec 29, 2016 at 09:31:54AM +0900, Minchan Kim wrote:
[...]
> > Acked-by: Minchan Kim <[email protected]>

Thanks!

> Nit:
>
> WARNING: line over 80 characters
> #53: FILE: include/linux/memcontrol.h:689:
> +unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, enum lru_list lru,
>
> WARNING: line over 80 characters
> #147: FILE: mm/vmscan.c:248:
> +unsigned long lruvec_zone_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
>
> WARNING: line over 80 characters
> #177: FILE: mm/vmscan.c:1446:
> + mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);

fixed

> WARNING: line over 80 characters
> #201: FILE: mm/vmscan.c:2099:
> + inactive_zone = lruvec_zone_lru_size(lruvec, file * LRU_FILE, zid);
>
> WARNING: line over 80 characters
> #202: FILE: mm/vmscan.c:2100:
> + active_zone = lruvec_zone_lru_size(lruvec, (file * LRU_FILE) + LRU_ACTIVE, zid);

I would prefer to have those on the same line though. It will make them
easier to follow.

--
Michal Hocko
SUSE Labs

2016-12-29 09:05:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > Hi,
> > could you try to run with the following patch on top of the previous
> > one? I do not think it will make a large change in your workload but
> > I think we need something like that so some testing under which is known
> > to make a high lowmem pressure would be really appreciated. If you have
> > more time to play with it then running with and without the patch with
> > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > whether it make any difference at all.
> >
> > I would also appreciate if Mel and Johannes had a look at it. I am not
> > yet sure whether we need the same thing for anon/file balancing in
> > get_scan_count. I suspect we need but need to think more about that.
> >
> > Thanks a lot again!
> > ---
> > From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Tue, 27 Dec 2016 16:28:44 +0100
> > Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> >
> > get_scan_count considers the whole node LRU size when
> > - doing SCAN_FILE due to many page cache inactive pages
> > - calculating the number of pages to scan
> >
> > in both cases this might lead to unexpected behavior especially on 32b
> > systems where we can expect lowmem memory pressure very often.
> >
> > A large highmem zone can easily distort SCAN_FILE heuristic because
> > there might be only few file pages from the eligible zones on the node
> > lru and we would still enforce file lru scanning which can lead to
> > trashing while we could still scan anonymous pages.
>
> Nit:
> It doesn't make thrashing because isolate_lru_pages filter out them
> but I agree it makes pointless CPU burning to find eligible pages.

This is not about isolate_lru_pages. The trashing could happen if we had
lowmem pagecache user which would constantly reclaim recently faulted
in pages while there is anonymous memory in the lowmem which could be
reclaimed instead.

[...]
> > /*
> > + * Return the number of pages on the given lru which are eligibne for the
> eligible

fixed

> > + * given zone_idx
> > + */
> > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > + enum lru_list lru, int zone_idx)
>
> Nit:
>
> Although there is a comment, function name is rather confusing when I compared
> it with lruvec_zone_lru_size.

I am all for a better name.

> lruvec_eligible_zones_lru_size is better?

this would be too easy to confuse with lruvec_eligible_zone_lru_size.
What about lruvec_lru_size_eligible_zones?

> Nit:
>
> With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx rather than
> own custom calculation to filter out non-eligible pages.

Yes, that would be possible and I was considering that. But then I found
useful to see total and reduced numbers in the tracepoint
http://lkml.kernel.org/r/[email protected]
and didn't want to call lruvec_lru_size 2 times. But if you insist then
I can just do that.

> Anyway, I think this patch does right things so I suppose this.
>
> Acked-by: Minchan Kim <[email protected]>

Thanks for the review!

--
Michal Hocko
SUSE Labs

2016-12-30 02:05:28

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Thu, Dec 29, 2016 at 10:04:32AM +0100, Michal Hocko wrote:
> On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> > On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > > Hi,
> > > could you try to run with the following patch on top of the previous
> > > one? I do not think it will make a large change in your workload but
> > > I think we need something like that so some testing under which is known
> > > to make a high lowmem pressure would be really appreciated. If you have
> > > more time to play with it then running with and without the patch with
> > > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > > whether it make any difference at all.
> > >
> > > I would also appreciate if Mel and Johannes had a look at it. I am not
> > > yet sure whether we need the same thing for anon/file balancing in
> > > get_scan_count. I suspect we need but need to think more about that.
> > >
> > > Thanks a lot again!
> > > ---
> > > From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <[email protected]>
> > > Date: Tue, 27 Dec 2016 16:28:44 +0100
> > > Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> > >
> > > get_scan_count considers the whole node LRU size when
> > > - doing SCAN_FILE due to many page cache inactive pages
> > > - calculating the number of pages to scan
> > >
> > > in both cases this might lead to unexpected behavior especially on 32b
> > > systems where we can expect lowmem memory pressure very often.
> > >
> > > A large highmem zone can easily distort SCAN_FILE heuristic because
> > > there might be only few file pages from the eligible zones on the node
> > > lru and we would still enforce file lru scanning which can lead to
> > > trashing while we could still scan anonymous pages.
> >
> > Nit:
> > It doesn't make thrashing because isolate_lru_pages filter out them
> > but I agree it makes pointless CPU burning to find eligible pages.
>
> This is not about isolate_lru_pages. The trashing could happen if we had
> lowmem pagecache user which would constantly reclaim recently faulted
> in pages while there is anonymous memory in the lowmem which could be
> reclaimed instead.
>
> [...]
> > > /*
> > > + * Return the number of pages on the given lru which are eligibne for the
> > eligible
>
> fixed
>
> > > + * given zone_idx
> > > + */
> > > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > > + enum lru_list lru, int zone_idx)
> >
> > Nit:
> >
> > Although there is a comment, function name is rather confusing when I compared
> > it with lruvec_zone_lru_size.
>
> I am all for a better name.
>
> > lruvec_eligible_zones_lru_size is better?
>
> this would be too easy to confuse with lruvec_eligible_zone_lru_size.
> What about lruvec_lru_size_eligible_zones?

Don't mind.

>
> > Nit:
> >
> > With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx rather than
> > own custom calculation to filter out non-eligible pages.
>
> Yes, that would be possible and I was considering that. But then I found
> useful to see total and reduced numbers in the tracepoint
> http://lkml.kernel.org/r/[email protected]
> and didn't want to call lruvec_lru_size 2 times. But if you insist then
> I can just do that.

I don't mind either but I think we need to describe the reason if you want to
go with your open-coded version. Otherwise, someone will try to fix it.

2016-12-30 10:19:31

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > >
> > > Nils, even though this is still highly experimental, could you give it a
> > > try please?
> >
> > Yes, no problem! So I kept the very first patch you sent but had to
> > revert the latest version of the debugging patch (the one in
> > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > memory cgroups enabled again, and the first thing that strikes the eye
> > is that I get this during boot:
> >
> > [ 1.568174] ------------[ cut here ]------------
> > [ 1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0x118/0x130
> > [ 1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not empty
>
> Ohh, I can see what is wrong! a) there is a bug in the accounting in
> my patch (I double account) and b) the detection for the empty list
> cannot work after my change because per node zone will not match per
> zone statistics. The updated patch is below. So I hope my brain already
> works after it's been mostly off last few days...
> ---
> From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Fri, 23 Dec 2016 15:11:54 +0100
> Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
> memcg is enabled
>
> Nils Holland has reported unexpected OOM killer invocations with 32b
> kernel starting with 4.8 kernels
>

I think it's unfortunate that per-zone stats are reintroduced to the
memcg structure. I can't help but think that it would have also worked
to always rotate a small number of pages if !inactive_list_is_low and
reclaiming for memcg even if it distorted page aging. However, given
that such an approach would be less robust and this has been heavily
tested;

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2016-12-30 10:40:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Fri 30-12-16 11:05:22, Minchan Kim wrote:
> On Thu, Dec 29, 2016 at 10:04:32AM +0100, Michal Hocko wrote:
> > On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> > > On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
[...]
> > > > + * given zone_idx
> > > > + */
> > > > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > > > + enum lru_list lru, int zone_idx)
> > >
> > > Nit:
> > >
> > > Although there is a comment, function name is rather confusing when I compared
> > > it with lruvec_zone_lru_size.
> >
> > I am all for a better name.
> >
> > > lruvec_eligible_zones_lru_size is better?
> >
> > this would be too easy to confuse with lruvec_eligible_zone_lru_size.
> > What about lruvec_lru_size_eligible_zones?
>
> Don't mind.

I will go with lruvec_lru_size_eligible_zones then.

> > > Nit:
> > >
> > > With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx rather than
> > > own custom calculation to filter out non-eligible pages.
> >
> > Yes, that would be possible and I was considering that. But then I found
> > useful to see total and reduced numbers in the tracepoint
> > http://lkml.kernel.org/r/[email protected]
> > and didn't want to call lruvec_lru_size 2 times. But if you insist then
> > I can just do that.
>
> I don't mind either but I think we need to describe the reason if you want to
> go with your open-coded version. Otherwise, someone will try to fix it.

OK, I will go with the follow up patch on top of the tracepoints series.
I was hoping that the way how tracing is full of macros would allow us
to evaluate arguments only when the tracepoint is enabled but this
doesn't seem to be the case. Let's CC Steven. Would it be possible to
define a tracepoint in such a way that all given arguments are evaluated
only when the tracepoint is enabled?
---
>From 9a561d652f91f3557db22161600f10ca2462c74f Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Fri, 30 Dec 2016 11:28:20 +0100
Subject: [PATCH] mm, vmscan: cleanup up inactive_list_is_low

inactive_list_is_low is effectively duplicating logic implemented by
lruvec_lru_size_eligibe_zones. Let's use the dedicated function to
get the number of eligible pages on the lru list and ask use
lruvec_lru_size to get the total LRU lize only when the tracing is
really requested. We are still iterating over all LRUs two times in that
case but a) inactive_list_is_low is not a hot path and b) this can be
addressed at the tracing layer and only evaluate arguments only when the
tracing is enabled in future if that ever matters.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmscan.c | 38 ++++++++++----------------------------
1 file changed, 10 insertions(+), 28 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 137bc85067d3..a9c881f06c0e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2054,11 +2054,10 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
struct scan_control *sc, bool trace)
{
unsigned long inactive_ratio;
- unsigned long total_inactive, inactive;
- unsigned long total_active, active;
+ unsigned long inactive, active;
+ enum lru_list inactive_lru = file * LRU_FILE;
+ enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
unsigned long gb;
- struct pglist_data *pgdat = lruvec_pgdat(lruvec);
- int zid;

/*
* If we don't have swap space, anonymous page deactivation
@@ -2067,27 +2066,8 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
if (!file && !total_swap_pages)
return false;

- total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
- total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
-
- /*
- * For zone-constrained allocations, it is necessary to check if
- * deactivations are required for lowmem to be reclaimed. This
- * calculates the inactive/active pages available in eligible zones.
- */
- for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) {
- struct zone *zone = &pgdat->node_zones[zid];
- unsigned long inactive_zone, active_zone;
-
- if (!managed_zone(zone))
- continue;
-
- inactive_zone = lruvec_zone_lru_size(lruvec, file * LRU_FILE, zid);
- active_zone = lruvec_zone_lru_size(lruvec, (file * LRU_FILE) + LRU_ACTIVE, zid);
-
- inactive -= min(inactive, inactive_zone);
- active -= min(active, active_zone);
- }
+ inactive = lruvec_lru_size_eligibe_zones(lruvec, inactive_lru, sc->reclaim_idx);
+ active = lruvec_lru_size_eligibe_zones(lruvec, active_lru, sc->reclaim_idx);

gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
@@ -2096,10 +2076,12 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
inactive_ratio = 1;

if (trace)
- trace_mm_vmscan_inactive_list_is_low(pgdat->node_id,
+ trace_mm_vmscan_inactive_list_is_low(lruvec_pgdat(lruvec)->node_id,
sc->reclaim_idx,
- total_inactive, inactive,
- total_active, active, inactive_ratio, file);
+ lruvec_lru_size(lruvec, inactive_lru), inactive,
+ lruvec_lru_size(lruvec, active_lru), active,
+ inactive_ratio, file);
+
return inactive * inactive_ratio < active;
}

--
2.10.2

--
Michal Hocko
SUSE Labs

2016-12-30 11:05:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Fri 30-12-16 10:19:26, Mel Gorman wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > >
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > >
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > >
> > > [ 1.568174] ------------[ cut here ]------------
> > > [ 1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0x118/0x130
> > > [ 1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not empty
> >
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> > ---
> > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
> > memcg is enabled
> >
> > Nils Holland has reported unexpected OOM killer invocations with 32b
> > kernel starting with 4.8 kernels
> >
>
> I think it's unfortunate that per-zone stats are reintroduced to the
> memcg structure.

the original patch I had didn't add per zone stats but rather did a
nr_highmem counter to mem_cgroup_per_node (inside ifdeff CONFIG_HIGMEM).
This would help for this particular case but it wouldn't work for other
lowmem requests (e.g. GFP_DMA32) and with the kmem accounting this might
be a problem in future. So I've decided to go with a more generic
approach which requires per-zone tracking. I cannot say I would be
overly happy about this at all.

> I can't help but think that it would have also worked
> to always rotate a small number of pages if !inactive_list_is_low and
> reclaiming for memcg even if it distorted page aging.

I am not really sure how that would work. Do you mean something like the
following?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fa30010a5277..563ada3c02ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2044,6 +2044,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);

+ if (!mem_cgroup_disabled())
+ goto out;
+
/*
* For zone-constrained allocations, it is necessary to check if
* deactivations are required for lowmem to be reclaimed. This
@@ -2063,6 +2066,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
active -= min(active, active_zone);
}

+out:
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
inactive_ratio = int_sqrt(10 * gb);

The problem I see with such an approach is that chances are that this
would reintroduce what f8d1a31163fc ("mm: consider whether to decivate
based on eligible zones inactive ratio") tried to fix. But maybe I have
missed your point.

> However, given that such an approach would be less robust and this has
> been heavily tested;
>
> Acked-by: Mel Gorman <[email protected]>

Thanks!
--
Michal Hocko
SUSE Labs

2016-12-30 12:43:50

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

On Fri, Dec 30, 2016 at 12:05:45PM +0100, Michal Hocko wrote:
> On Fri 30-12-16 10:19:26, Mel Gorman wrote:
> > On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > >
> > > > > Nils, even though this is still highly experimental, could you give it a
> > > > > try please?
> > > >
> > > > Yes, no problem! So I kept the very first patch you sent but had to
> > > > revert the latest version of the debugging patch (the one in
> > > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > > memory cgroups enabled again, and the first thing that strikes the eye
> > > > is that I get this during boot:
> > > >
> > > > [ 1.568174] ------------[ cut here ]------------
> > > > [ 1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 mem_cgroup_update_lru_size+0x118/0x130
> > > > [ 1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not empty
> > >
> > > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > > my patch (I double account) and b) the detection for the empty list
> > > cannot work after my change because per node zone will not match per
> > > zone statistics. The updated patch is below. So I hope my brain already
> > > works after it's been mostly off last few days...
> > > ---
> > > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <[email protected]>
> > > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
> > > memcg is enabled
> > >
> > > Nils Holland has reported unexpected OOM killer invocations with 32b
> > > kernel starting with 4.8 kernels
> > >
> >
> > I think it's unfortunate that per-zone stats are reintroduced to the
> > memcg structure.
>
> the original patch I had didn't add per zone stats but rather did a
> nr_highmem counter to mem_cgroup_per_node (inside ifdeff CONFIG_HIGMEM).
> This would help for this particular case but it wouldn't work for other
> lowmem requests (e.g. GFP_DMA32) and with the kmem accounting this might
> be a problem in future.

That did occur to me.

> So I've decided to go with a more generic
> approach which requires per-zone tracking. I cannot say I would be
> overly happy about this at all.
>
> > I can't help but think that it would have also worked
> > to always rotate a small number of pages if !inactive_list_is_low and
> > reclaiming for memcg even if it distorted page aging.
>
> I am not really sure how that would work. Do you mean something like the
> following?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fa30010a5277..563ada3c02ac 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2044,6 +2044,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
> inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
>
> + if (!mem_cgroup_disabled())
> + goto out;
> +
> /*
> * For zone-constrained allocations, it is necessary to check if
> * deactivations are required for lowmem to be reclaimed. This
> @@ -2063,6 +2066,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
> active -= min(active, active_zone);
> }
>
> +out:
> gb = (inactive + active) >> (30 - PAGE_SHIFT);
> if (gb)
> inactive_ratio = int_sqrt(10 * gb);
>
> The problem I see with such an approach is that chances are that this
> would reintroduce what f8d1a31163fc ("mm: consider whether to decivate
> based on eligible zones inactive ratio") tried to fix. But maybe I have
> missed your point.
>

No, you didn't miss the point. It was something like that I had in mind
but as I thought about it, I could see some cases where it might not work
and still cause a premature OOM. The per-zone accounting is unfortunate
but it's robust hence the Ack.

--
Mel Gorman
SUSE Labs