Hello,
syzbot found the following crash on:
HEAD commit: 570b7bdeaf18 Add linux-next specific files for 20181009
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=13eeb685400000
kernel config: https://syzkaller.appspot.com/x/.config?x=9b5a60e1381390c4
dashboard link: https://syzkaller.appspot.com/bug?extid=77e6b28a7a7106ad0def
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
Unfortunately, I don't have any reproducer for this crash yet.
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: [email protected]
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000457579
RDX: 0000000000000002 RSI: 0000000000b36000 RDI: 0000000020000000
RBP: 000000000072bf00 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000008031 R11: 0000000000000246 R12: 00007f9315bfc6d4
R13: 00000000004c284a R14: 00000000004d3bd0 R15: 00000000ffffffff
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 0-....: (1 GPs behind) idle=cb6/1/0x4000000000000002
softirq=64368/64369 fqs=750
rcu: (t=10505 jiffies g=81341 q=1698)
NMI backtrace for cpu 0
CPU: 0 PID: 2050 Comm: syz-executor0 Not tainted 4.19.0-rc7-next-20181009+
#90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x3ab lib/dump_stack.c:113
nmi_cpu_backtrace.cold.2+0x5c/0xa1 lib/nmi_backtrace.c:101
nmi_trigger_cpumask_backtrace+0x1e8/0x22a lib/nmi_backtrace.c:62
arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
trigger_single_cpu_backtrace include/linux/nmi.h:162 [inline]
rcu_dump_cpu_stacks+0x16f/0x1bc kernel/rcu/tree.c:1195
print_cpu_stall.cold.67+0x1f3/0x3c7 kernel/rcu/tree.c:1334
check_cpu_stall kernel/rcu/tree.c:1408 [inline]
rcu_pending kernel/rcu/tree.c:2961 [inline]
rcu_check_callbacks+0xf38/0x13f0 kernel/rcu/tree.c:2506
update_process_times+0x2d/0x70 kernel/time/timer.c:1636
tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:164
tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1274
__run_hrtimer kernel/time/hrtimer.c:1398 [inline]
__hrtimer_run_queues+0x412/0x10c0 kernel/time/hrtimer.c:1460
hrtimer_interrupt+0x313/0x780 kernel/time/hrtimer.c:1518
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1034 [inline]
smp_apic_timer_interrupt+0x1a1/0x750 arch/x86/kernel/apic/apic.c:1059
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:804
</IRQ>
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:761
[inline]
RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118
Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f
84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00
e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00
RSP: 0018:ffff88017d3a5c70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000040000 RBX: 1ffffffff1263ebe RCX: ffffc90001e5a000
RDX: 0000000000040000 RSI: ffffffff87b4e0f4 RDI: 0000000000000246
RBP: ffff88017d3a5d18 R08: ffff8801d7e02480 R09: fffffbfff13da030
R10: fffffbfff13da030 R11: 0000000000000003 R12: 1ffff1002fa74b96
R13: 00000000ffffffff R14: 0000000000000200 R15: 0000000000000000
dump_header+0x27b/0xf72 mm/oom_kill.c:441
out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
mem_cgroup_oom mm/memcontrol.c:1701 [inline]
try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
shmem_fault+0x25f/0x960 mm/shmem.c:1982
__do_fault+0x100/0x6b0 mm/memory.c:2996
do_read_fault mm/memory.c:3408 [inline]
do_fault mm/memory.c:3531 [inline]
handle_pte_fault mm/memory.c:3762 [inline]
__handle_mm_fault+0x3d40/0x5a40 mm/memory.c:3886
handle_mm_fault+0x54f/0xc70 mm/memory.c:3923
faultin_page mm/gup.c:518 [inline]
__get_user_pages+0x806/0x1b30 mm/gup.c:718
populate_vma_page_range+0x2db/0x3d0 mm/gup.c:1222
__mm_populate+0x286/0x4d0 mm/gup.c:1270
mm_populate include/linux/mm.h:2311 [inline]
vm_mmap_pgoff+0x27f/0x2c0 mm/util.c:362
ksys_mmap_pgoff+0xf1/0x660 mm/mmap.c:1606
__do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
__se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
__x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457579
Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f9315bfbc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000457579
RDX: 0000000000000002 RSI: 0000000000b36000 RDI: 0000000020000000
RBP: 000000000072bf00 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000008031 R11: 0000000000000246 R12: 00007f9315bfc6d4
R13: 00000000004c284a R14: 00000000004d3bd0 R15: 00000000ffffffff
Memory limit reached of cgroup /syz0
memory: usage 205164kB, limit 204800kB, failcnt 6901
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /syz0: cache:680KB rss:176336KB rss_huge:163840KB
shmem:740KB mapped_file:660KB dirty:0KB writeback:0KB swap:0KB
inactive_anon:708KB active_anon:176448KB inactive_file:4KB active_file:0KB
unevictable:0KB
Out of memory and no killable processes...
syz-executor0 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE),
nodemask=(null), order=0, oom_score_adj=-1000
syz-executor0 cpuset=syz0 mems_allowed=0
CPU: 0 PID: 2050 Comm: syz-executor0 Not tainted 4.19.0-rc7-next-20181009+
#90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x3ab lib/dump_stack.c:113
dump_header+0x27b/0xf72 mm/oom_kill.c:441
out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
mem_cgroup_oom mm/memcontrol.c:1701 [inline]
try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
shmem_fault+0x25f/0x960 mm/shmem.c:1982
__do_fault+0x100/0x6b0 mm/memory.c:2996
do_read_fault mm/memory.c:3408 [inline]
do_fault mm/memory.c:3531 [inline]
handle_pte_fault mm/memory.c:3762 [inline]
__handle_mm_fault+0x3d40/0x5a40 mm/memory.c:3886
handle_mm_fault+0x54f/0xc70 mm/memory.c:3923
faultin_page mm/gup.c:518 [inline]
__get_user_pages+0x806/0x1b30 mm/gup.c:718
populate_vma_page_range+0x2db/0x3d0 mm/gup.c:1222
__mm_populate+0x286/0x4d0 mm/gup.c:1270
mm_populate include/linux/mm.h:2311 [inline]
vm_mmap_pgoff+0x27f/0x2c0 mm/util.c:362
ksys_mmap_pgoff+0xf1/0x660 mm/mmap.c:1606
__do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
__se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
__x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457579
Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f9315bfbc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000457579
RDX: 0000000000000002 RSI: 0000000000b36000 RDI: 0000000020000000
RBP: 000000000072bf00 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000008031 R11: 0000000000000246 R12: 00007f9315bfc6d4
R13: 00000000004c284a R14: 00000000004d3bd0 R15: 00000000ffffffff
Memory limit reached of cgroup /syz0
memory: usage 205168kB, limit 204800kB, failcnt 6909
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /syz0: cache:680KB rss:176336KB rss_huge:163840KB
shmem:740KB mapped_file:660KB dirty:0KB writeback:0KB swap:0KB
inactive_anon:712KB active_anon:176448KB inactive_file:0KB active_file:4KB
unevictable:0KB
Out of memory and no killable processes...
syz-executor0 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE),
nodemask=(null), order=0, oom_score_adj=-1000
syz-executor0 cpuset=syz0 mems_allowed=0
CPU: 0 PID: 2050 Comm: syz-executor0 Not tainted 4.19.0-rc7-next-20181009+
#90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x3ab lib/dump_stack.c:113
dump_header+0x27b/0xf72 mm/oom_kill.c:441
out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
mem_cgroup_oom mm/memcontrol.c:1701 [inline]
try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
shmem_fault+0x25f/0x960 mm/shmem.c:1982
__do_fault+0x100/0x6b0 mm/memory.c:2996
do_read_fault mm/memory.c:3408 [inline]
do_fault mm/memory.c:3531 [inline]
handle_pte_fault mm/memory.c:3762 [inline]
__handle_mm_fault+0x3d40/0x5a40 mm/memory.c:3886
handle_mm_fault+0x54f/0xc70 mm/memory.c:3923
faultin_page mm/gup.c:518 [inline]
__get_user_pages+0x806/0x1b30 mm/gup.c:718
populate_vma_page_range+0x2db/0x3d0 mm/gup.c:1222
__mm_populate+0x286/0x4d0 mm/gup.c:1270
mm_populate include/linux/mm.h:2311 [inline]
vm_mmap_pgoff+0x27f/0x2c0 mm/util.c:362
ksys_mmap_pgoff+0xf1/0x660 mm/mmap.c:1606
__do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
__se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
__x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457579
Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f9315bfbc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000457579
RDX: 0000000000000002 RSI: 0000000000b36000 RDI: 0000000020000000
RBP: 000000000072bf00 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000008031 R11: 0000000000000246 R12: 00007f9315bfc6d4
R13: 00000000004c284a R14: 00000000004d3bd0 R15: 00000000ffffffff
Memory limit reached of cgroup /syz0
memory: usage 205172kB, limit 204800kB, failcnt 6917
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /syz0: cache:680KB rss:176336KB rss_huge:163840KB
shmem:740KB mapped_file:792KB dirty:0KB writeback:0KB swap:0KB
inactive_anon:716KB active_anon:176448KB inactive_file:4KB active_file:0KB
unevictable:0KB
Out of memory and no killable processes...
syz-executor0 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE),
nodemask=(null), order=0, oom_score_adj=-1000
syz-executor0 cpuset=syz0 mems_allowed=0
CPU: 0 PID: 2050 Comm: syz-executor0 Not tainted 4.19.0-rc7-next-20181009+
#90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x3ab lib/dump_stack.c:113
dump_header+0x27b/0xf72 mm/oom_kill.c:441
out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
mem_cgroup_oom mm/memcontrol.c:1701 [inline]
try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
shmem_fault+0x25f/0x960 mm/shmem.c:1982
__do_fault+0x100/0x6b0 mm/memory.c:2996
do_read_fault mm/memory.c:3408 [inline]
do_fault mm/memory.c:3531 [inline]
handle_pte_fault mm/memory.c:3762 [inline]
__handle_mm_fault+0x3d40/0x5a40 mm/memory.c:3886
handle_mm_fault+0x54f/0xc70 mm/memory.c:3923
faultin_page mm/gup.c:518 [inline]
__get_user_pages+0x806/0x1b30 mm/gup.c:718
populate_vma_page_range+0x2db/0x3d0 mm/gup.c:1222
__mm_populate+0x286/0x4d0 mm/gup.c:1270
mm_populate include/linux/mm.h:2311 [inline]
vm_mmap_pgoff+0x27f/0x2c0 mm/util.c:362
ksys_mmap_pgoff+0xf1/0x660 mm/mmap.c:1606
__do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
__se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
__x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457579
Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f9315bfbc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000457579
RDX: 0000000000000002 RSI: 0000000000b36000 RDI: 0000000020000000
RBP: 000000000072bf00 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000008031 R11: 0000000000000246 R12: 00007f9315bfc6d4
R13: 00000000004c284a R14: 00000000004d3bd0 R15: 00000000ffffffff
Memory limit reached of cgroup /syz0
memory: usage 205176kB, limit 204800kB, failcnt 6925
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /syz0: cache:680KB rss:176336KB rss_huge:163840KB
shmem:740KB mapped_file:792KB dirty:0KB writeback:0KB swap:0KB
inactive_anon:720KB active_anon:176448KB inactive_file:0KB active_file:4KB
unevictable:0KB
Out of memory and no killable processes...
syz-executor0 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE),
nodemask=(null), order=0, oom_score_adj=-1000
syz-executor0 cpuset=syz0 mems_allowed=0
CPU: 0 PID: 2050 Comm: syz-executor0 Not tainted 4.19.0-rc7-next-20181009+
#90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x3ab lib/dump_stack.c:113
dump_header+0x27b/0xf72 mm/oom_kill.c:441
out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
mem_cgroup_oom mm/memcontrol.c:1701 [inline]
try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
shmem_fault+0x25f/0x960 mm/shmem.c:1982
__do_fault+0x100/0x6b0 mm/memory.c:2996
do_read_fault mm/memory.c:3408 [inline]
do_fault mm/memory.c:3531 [inline]
handle_pte_fault mm/memory.c:3762 [inline]
__handle_mm_fault+0x3d40/0x5a40 mm/memory.c:3886
handle_mm_fault+0x54f/0xc70 mm/memory.c:3923
faultin_page mm/gup.c:518 [inline]
__get_user_pages+0x806/0x1b30 mm/gup.c:718
populate_vma_page_range+0x2db/0x3d0 mm/gup.c:1222
__mm_populate+0x286/0x4d0 mm/gup.c:1270
mm_populate include/linux/mm.h:2311 [inline]
vm_mmap_pgoff+0x27f/0x2c0 mm/util.c:362
ksys_mmap_pgoff+0xf1/0x660 mm/mmap.c:1606
__do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
__se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
__x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457579
Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f9315bfbc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000457579
RDX: 0000000000000002 RSI: 0000000000b36000 RDI: 0000000020000000
RBP: 000000000072bf00 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000008031 R11: 0000000000000246 R12: 00007f9315bfc6d4
R13: 00000000004c284a R14: 00000000004d3bd0 R15: 00000000ffffffff
Memory limit reached of cgroup /syz0
memory: usage 205180kB, limit 204800kB, failcnt 6933
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /syz0: cache:680KB rss:176336KB rss_huge:163840KB
shmem:740KB mapped_file:792KB dirty:0KB writeback:0KB swap:0KB
inactive_anon:724KB active_anon:176448KB inactive_file:4KB active_file:0KB
unevictable:0KB
Out of memory and no killable processes...
syz-executor0 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE),
nodemask=(null), order=0, oom_score_adj=-1000
syz-executor0 cpuset=syz0 mems_allowed=0
CPU: 0 PID: 2050 Comm: syz-executor0 Not tainted 4.19.0-rc7-next-20181009+
#90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x3ab lib/dump_stack.c:113
dump_header+0x27b/0xf72 mm/oom_kill.c:441
out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
mem_cgroup_oom mm/memcontrol.c:1701 [inline]
try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
shmem_fault+0x25f/0x960 mm/shmem.c:1982
__do_fault+0x100/0x6b0 mm/memory.c:2996
do_read_fault mm/memory.c:3408 [inline]
do_fault mm/memory.c:3531 [inline]
handle_pte_fault mm/memory.c:3762 [inline]
__handle_mm_fault+0x3d40/0x5a40 mm/memory.c:3886
handle_mm_fault+0x54f/0xc70 mm/memory.c:3923
faultin_page mm/gup.c:518 [inline]
__get_user_pages+0x806/0x1b30 mm/gup.c:718
populate_vma_page_range+0x2db/0x3d0 mm/gup.c:1222
__mm_populate+0x286/0x4d0 mm/gup.c:1270
mm_populate include/linux/mm.h:2311 [inline]
vm_mmap_pgoff+0x27f/0x2c0 mm/util.c:362
ksys_mmap_pgoff+0xf1/0x660 mm/mmap.c:1606
__do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
__se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
__x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457579
Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f9315bfbc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000457579
RDX: 0000000000000002 RSI: 0000000000b36000 RDI: 0000000020000000
RBP: 000000000072bf00 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000008031 R11: 0000000000000246 R12: 00007f9315bfc6d4
R13: 00000000004c284a R14: 00000000004d3bd0 R15: 00000000ffffffff
Memory limit reached of cgroup /syz0
memory: usage 205184kB, limit 204800kB, failcnt 6941
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /syz0: cache:680KB rss:176336KB rss_huge:163840KB
shmem:740KB mapped_file:792KB dirty:0KB writeback:0KB swap:0KB
inactive_anon:728KB active_anon:176448KB inactive_file:0KB active_file:4KB
unevictable:0KB
Out of memory and no killable processes...
syz-executor0 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE),
nodemask=(null), order=0, oom_score_adj=-1000
syz-executor0 cpuset=syz0 mems_allowed=0
CPU: 0 PID: 2050 Comm: syz-executor0 Not tainted 4.19.0-rc7-next-20181009+
#90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x3ab lib/dump_stack.c:113
dump_header+0x27b/0xf72 mm/oom_kill.c:441
out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
mem_cgroup_oom mm/memcontrol.c:1701 [inline]
try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
shmem_fault+0x25f/0x960 mm/shmem.c:1982
__do_fault+0x100/0x6b0 mm/memory.c:2996
do_read_fault mm/memory.c:3408 [inline]
do_fault mm/memory.c:3531 [inline]
handle_pte_fault mm/memory.c:3762 [inline]
__handle_mm_fault+0x3d40/0x5a40 mm/memory.c:3886
handle_mm_fault+0x54f/0xc70 mm/memory.c:3923
faultin_page mm/gup.c:518 [inline]
__get_user_pages+0x806/0x1b30 mm/gup.c:718
populate_vma_page_range+0x2db/0x3d0 mm/gup.c:1222
__mm_populate+0x286/0x4d0 mm/gup.c:1270
mm_populate include/linux/mm.h:2311 [inline]
vm_mmap_pgoff+0x27f/0x2c0 mm/util.c:362
ksys_mmap_pgoff+0xf1/0x660 mm/mmap.c:1606
__do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
__se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
__x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457579
Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f9315bfbc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000457579
RDX: 0000000000000002 RSI: 0000000000b36000 RDI: 0000000020000000
RBP: 000000000072bf00 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000008031 R11: 0000000000000246 R12: 00007f9315bfc6d4
R13: 00000000004c284a R14: 00000000004d3bd0 R15: 00000000ffffffff
Memory limit reached of cgroup /syz0
memory: usage 205188kB, limit 204800kB, failcnt 6949
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /syz0: cache:680KB rss:176336KB rss_huge:163840KB
shmem:740KB mapped_file:792KB dirty:0KB writeback:0KB swap:0KB
inactive_anon:732KB active_anon:176448KB inactive_file:4KB active_file:0KB
unevictable:0KB
Out of memory and no killable processes...
syz-executor0 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE),
nodemask=(null), order=0, oom_score_adj=-1000
syz-executor0 cpuset=syz0 mems_allowed=0
CPU: 0 PID: 2050 Comm: syz-executor0 Not tainted 4.19.0-rc7-next-20181009+
#90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x3ab lib/dump_stack.c:113
dump_header+0x27b/0xf72 mm/oom_kill.c:441
out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
mem_cgroup_oom mm/memcontrol.c:1701 [inline]
try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
shmem_fault+0x25f/0x960 mm/shmem.c:1982
__do_fault+0x100/0x6b0 mm/memory.c:2996
do_read_fault mm/memory.c:3408 [inline]
do_fault mm/memory.c:3531 [inline]
handle_pte_fault mm/memory.c:3762 [inline]
__handle_mm_fault+0x3d40/0x5a40 mm/memory.c:3886
handle_mm_fault+0x54f/0xc70 mm/memory.c:3923
faultin_page mm/gup.c:518 [inline]
__get_user_pages+0x806/0x1b30 mm/gup.c:718
populate_vma_page_range+0x2db/0x3d0 mm/gup.c:1222
__mm_populate+0x286/0x4d0 mm/gup.c:1270
mm_populate include/linux/mm.h:2311 [inline]
vm_mmap_pgoff+0x27f/0x2c0 mm/util.c:362
ksys_mmap_pgoff+0xf1/0x660 mm/mmap.c:1606
__do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
__se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
__x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457579
Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f9315bfbc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000457579
RDX: 0000000000000002 RSI: 0000000000b36000 RDI: 0000000020000000
RBP: 000000000072bf00 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000008031 R11: 0000000000000246 R12: 00007f9315bfc6d4
R13: 00000000004c284a R14: 00000000004d3bd0 R15: 00000000ffffffff
Memory limit reached of cgroup /syz0
memory: usage 205192kB, limit 204800kB, failcnt 6957
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /syz0: cache:680KB rss:176336KB rss_huge:163840KB
shmem:740KB mapped_file:792KB dirty:0KB writeback:0KB swap:0KB
inactive_anon:736KB active_anon:176448KB inactive_file:0KB active_file:4KB
unevictable:0KB
Out of memory and no killable processes...
syz-executor0 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE),
nodemask=(null), order=0, oom_score_adj=-1000
syz-executor0 cpuset=syz0 mems_allowed=0
CPU: 0 PID: 2050 Comm: syz-executor0 Not tainted 4.19.0-rc7-next-20181009+
#90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x3ab lib/dump_stack.c:113
dump_header+0x27b/0xf72 mm/oom_kill.c:441
out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
mem_cgroup_oom mm/memcontrol.c:1701 [inline]
try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
shmem_fault+0x25f/0x960 mm/shmem.c:1982
__do_fault+0x100/0x6b0 mm/memory.c:2996
do_read_fault mm/memory.c:3408 [inline]
do_fault mm/memory.c:3531 [inline]
handle_pte_fault mm/memory.c:3762 [inline]
__handle_mm_fault+0x3d40/0x5a40 mm/memory.c:3886
handle_mm_fault+0x54f/0xc70 mm/memory.c:3923
faultin_page mm/gup.c:518 [inline]
__get_user_pages+0x806/0x1b30 mm/gup.c:718
populate_vma_page_range+0x2db/0x3d0 mm/gup.c:1222
__mm_populate+0x286/0x4d0 mm/gup.c:1270
mm_populate include/linux/mm.h:2311 [inline]
vm_mmap_pgoff+0x27f/0x2c0 mm/util.c:362
ksys_mmap_pgoff+0xf1/0x660 mm/mmap.c:1606
__do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
__se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
__x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457579
Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f9315bfbc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000457579
RDX: 0000000000000002 RSI: 0000000000b36000 RDI: 0000000020000000
RBP: 000000000072bf00 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000008031 R11: 0000000000000246 R12: 00007f9315bfc6d4
R13: 00000000004c284a R14: 00000000004d3bd0 R15: 00000000ffffffff
Memory limit reached of cgroup /syz0
memory: usage 205196kB, limit 204800kB, failcnt 6965
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /syz0: cache:680KB rss:176336KB rss_huge:163840KB
shmem:740KB mapped_file:792KB dirty:0KB writeback:0KB swap:0KB
inactive_anon:740KB active_anon:176448KB inactive_file:4KB active_file:0KB
unevictable:0KB
Out of memory and no killable processes...
syz-executor0 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE),
nodemask=(null), order=0, oom_score_adj=-1000
syz-executor0 cpuset=syz0 mems_allowed=0
CPU: 0 PID: 2050 Comm: syz-executor0 Not tainted 4.19.0-rc7-next-20181009+
#90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x3ab lib/dump_stack.c:113
dump_header+0x27b/0xf72 mm/oom_kill.c:441
out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
mem_cgroup_oom mm/memcontrol.c:1701 [inline]
try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].
syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.
syzbot is hitting RCU stall due to memcg-OOM event.
https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
What should we do if memcg-OOM found no killable task because the allocating task
was oom_score_adj == -1000 ? Flooding printk() until RCU stall watchdog fires
(which seems to be caused by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper
OOM header when no eligible victim left") because syzbot was terminating the test
upon WARN(1) removed by that commit) is not a good behavior.
syz-executor0 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=-1000
syz-executor0 cpuset=syz0 mems_allowed=0
CPU: 0 PID: 2050 Comm: syz-executor0 Not tainted 4.19.0-rc7-next-20181009+ #90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
(...snipped...)
Memory limit reached of cgroup /syz0
memory: usage 205168kB, limit 204800kB, failcnt 6909
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /syz0: cache:680KB rss:176336KB rss_huge:163840KB shmem:740KB mapped_file:660KB dirty:0KB writeback:0KB swap:0KB inactive_anon:712KB active_anon:176448KB inactive_file:0KB active_file:4KB unevictable:0KB
Out of memory and no killable processes...
On Wed, 10 Oct 2018, Tetsuo Handa wrote:
> syzbot is hitting RCU stall due to memcg-OOM event.
> https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
>
> What should we do if memcg-OOM found no killable task because the allocating task
> was oom_score_adj == -1000 ? Flooding printk() until RCU stall watchdog fires
> (which seems to be caused by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper
> OOM header when no eligible victim left") because syzbot was terminating the test
> upon WARN(1) removed by that commit) is not a good behavior.
>
Not printing anything would be the obvious solution but the ideal solution
would probably involve
- adding feedback to the memcg oom killer that there are no killable
processes,
- adding complete coverage for memcg_oom_recover() in all uncharge paths
where the oom memcg's page_counter is decremented, and
- having all processes stall until memcg_oom_recover() is called so
looping back into try_charge() has a reasonable expectation to succeed.
On Wed, Oct 10, 2018 at 6:11 AM, 'David Rientjes' via syzkaller-bugs
<[email protected]> wrote:
> On Wed, 10 Oct 2018, Tetsuo Handa wrote:
>
>> syzbot is hitting RCU stall due to memcg-OOM event.
>> https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
>>
>> What should we do if memcg-OOM found no killable task because the allocating task
>> was oom_score_adj == -1000 ? Flooding printk() until RCU stall watchdog fires
>> (which seems to be caused by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper
>> OOM header when no eligible victim left") because syzbot was terminating the test
>> upon WARN(1) removed by that commit) is not a good behavior.
You want to say that most of the recent hangs and stalls are actually
caused by our attempt to sandbox test processes with memory cgroup?
The process with oom_score_adj == -1000 is not supposed to consume any
significant memory; we have another (test) process with oom_score_adj
== 0 that's actually consuming memory.
But should we refrain from using -1000? Perhaps it would be better to
use -500/500 for control/test process, or -999/1000?
> Not printing anything would be the obvious solution but the ideal solution
> would probably involve
>
> - adding feedback to the memcg oom killer that there are no killable
> processes,
>
> - adding complete coverage for memcg_oom_recover() in all uncharge paths
> where the oom memcg's page_counter is decremented, and
>
> - having all processes stall until memcg_oom_recover() is called so
> looping back into try_charge() has a reasonable expectation to succeed.
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/alpine.DEB.2.21.1810092106190.83503%40chino.kir.corp.google.com.
> For more options, visit https://groups.google.com/d/optout.
On Wed 10-10-18 09:12:45, Tetsuo Handa wrote:
> syzbot is hitting RCU stall due to memcg-OOM event.
> https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
This is really interesting. If we do not have any eligible oom victim we
simply force the charge (allow to proceed and go over the hard limit)
and break the isolation. That means that the caller gets back to running
and realease all locks take on the way. I am wondering how come we are
seeing the RCU stall. Whole is holding the rcu lock? Certainly not the
charge patch and neither should the caller because you have to be in a
sleepable context to trigger the OOM killer. So there must be something
more going on.
> What should we do if memcg-OOM found no killable task because the allocating task
> was oom_score_adj == -1000 ? Flooding printk() until RCU stall watchdog fires
> (which seems to be caused by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper
> OOM header when no eligible victim left") because syzbot was terminating the test
> upon WARN(1) removed by that commit) is not a good behavior.
We definitely want to inform about ineligible oom victim. We might
consider some rate limiting for the memcg state but that is a valuable
information to see under normal situation (when you do not have floods
of these situations).
--
Michal Hocko
SUSE Labs
On Tue 09-10-18 21:11:48, David Rientjes wrote:
> On Wed, 10 Oct 2018, Tetsuo Handa wrote:
>
> > syzbot is hitting RCU stall due to memcg-OOM event.
> > https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
> >
> > What should we do if memcg-OOM found no killable task because the allocating task
> > was oom_score_adj == -1000 ? Flooding printk() until RCU stall watchdog fires
> > (which seems to be caused by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper
> > OOM header when no eligible victim left") because syzbot was terminating the test
> > upon WARN(1) removed by that commit) is not a good behavior.
> >
>
> Not printing anything would be the obvious solution but the ideal solution
> would probably involve
>
> - adding feedback to the memcg oom killer that there are no killable
> processes,
We already have that - out_of_memory == F
> - adding complete coverage for memcg_oom_recover() in all uncharge paths
> where the oom memcg's page_counter is decremented, and
Could you elaborate?
> - having all processes stall until memcg_oom_recover() is called so
> looping back into try_charge() has a reasonable expectation to succeed.
You cannot stall in the charge path waiting for others to make a forward
progress because we would be back to oom deadlocks when nobody can make
forward progress due to lock dependencies.
Right now we simply force the charge and allow for further progress when
situation like this happen because this shouldn't happen unless the
memcg is misconfigured badly.
--
Michal Hocko
SUSE Labs
On Wed 10-10-18 09:55:57, Dmitry Vyukov wrote:
> On Wed, Oct 10, 2018 at 6:11 AM, 'David Rientjes' via syzkaller-bugs
> <[email protected]> wrote:
> > On Wed, 10 Oct 2018, Tetsuo Handa wrote:
> >
> >> syzbot is hitting RCU stall due to memcg-OOM event.
> >> https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
> >>
> >> What should we do if memcg-OOM found no killable task because the allocating task
> >> was oom_score_adj == -1000 ? Flooding printk() until RCU stall watchdog fires
> >> (which seems to be caused by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper
> >> OOM header when no eligible victim left") because syzbot was terminating the test
> >> upon WARN(1) removed by that commit) is not a good behavior.
>
>
> You want to say that most of the recent hangs and stalls are actually
> caused by our attempt to sandbox test processes with memory cgroup?
> The process with oom_score_adj == -1000 is not supposed to consume any
> significant memory; we have another (test) process with oom_score_adj
> == 0 that's actually consuming memory.
> But should we refrain from using -1000? Perhaps it would be better to
> use -500/500 for control/test process, or -999/1000?
oom disable on a task (especially when this is the only task in the
memcg) is tricky. Look at the memcg report
[ 935.562389] Memory limit reached of cgroup /syz0
[ 935.567398] memory: usage 204808kB, limit 204800kB, failcnt 6081
[ 935.573768] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 935.580650] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 935.586923] Memory cgroup stats for /syz0: cache:152KB rss:176336KB rss_huge:163840KB shmem:344KB mapped_file:264KB dirty:0KB writeback:0KB swap:0KB inactive_anon:260KB active_anon:176448KB inactive_file:4KB active_file:0KB
There is still somebody holding anonymous (THP) memory. If there is no
other eligible oom victim then it must be some of the oom disabled ones.
You have suppressed the task list information so we do not know who that
might be though.
So it looks like there is some misconfiguration or a bug in the oom
victim selection.
--
Michal Hocko
SUSE Labs
On Wed, Oct 10, 2018 at 11:13 AM, Michal Hocko <[email protected]> wrote:
> On Wed 10-10-18 09:55:57, Dmitry Vyukov wrote:
>> On Wed, Oct 10, 2018 at 6:11 AM, 'David Rientjes' via syzkaller-bugs
>> <[email protected]> wrote:
>> > On Wed, 10 Oct 2018, Tetsuo Handa wrote:
>> >
>> >> syzbot is hitting RCU stall due to memcg-OOM event.
>> >> https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
>> >>
>> >> What should we do if memcg-OOM found no killable task because the allocating task
>> >> was oom_score_adj == -1000 ? Flooding printk() until RCU stall watchdog fires
>> >> (which seems to be caused by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper
>> >> OOM header when no eligible victim left") because syzbot was terminating the test
>> >> upon WARN(1) removed by that commit) is not a good behavior.
>>
>>
>> You want to say that most of the recent hangs and stalls are actually
>> caused by our attempt to sandbox test processes with memory cgroup?
>> The process with oom_score_adj == -1000 is not supposed to consume any
>> significant memory; we have another (test) process with oom_score_adj
>> == 0 that's actually consuming memory.
>> But should we refrain from using -1000? Perhaps it would be better to
>> use -500/500 for control/test process, or -999/1000?
>
> oom disable on a task (especially when this is the only task in the
> memcg) is tricky. Look at the memcg report
> [ 935.562389] Memory limit reached of cgroup /syz0
> [ 935.567398] memory: usage 204808kB, limit 204800kB, failcnt 6081
> [ 935.573768] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> [ 935.580650] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
> [ 935.586923] Memory cgroup stats for /syz0: cache:152KB rss:176336KB rss_huge:163840KB shmem:344KB mapped_file:264KB dirty:0KB writeback:0KB swap:0KB inactive_anon:260KB active_anon:176448KB inactive_file:4KB active_file:0KB
>
> There is still somebody holding anonymous (THP) memory. If there is no
> other eligible oom victim then it must be some of the oom disabled ones.
> You have suppressed the task list information so we do not know who that
> might be though.
>
> So it looks like there is some misconfiguration or a bug in the oom
> victim selection.
I afraid KASAN can interfere with memory accounting/OMM killing too.
KASAN quarantines up to 1/32-th of physical memory (in our case
7.5GB/32 = 230MB) that is already freed by the task, but as far as I
understand is still accounted against memcg. So maybe making cgroup
limit >> quarantine size will help to resolve this too.
But of course there can be a plain memory leak too.
On 2018/10/10 17:59, Michal Hocko wrote:
> On Wed 10-10-18 09:12:45, Tetsuo Handa wrote:
>> syzbot is hitting RCU stall due to memcg-OOM event.
>> https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
>
> This is really interesting. If we do not have any eligible oom victim we
> simply force the charge (allow to proceed and go over the hard limit)
> and break the isolation. That means that the caller gets back to running
> and realease all locks take on the way.
What happens if the caller continued trying to allocate more memory
because the caller cannot be noticed by SIGKILL from the OOM killer?
> I am wondering how come we are
> seeing the RCU stall. Whole is holding the rcu lock? Certainly not the
> charge patch and neither should the caller because you have to be in a
> sleepable context to trigger the OOM killer. So there must be something
> more going on.
Just flooding out of memory messages can trigger RCU stall problems.
For example, a severe skbuff_head_cache or kmalloc-512 leak bug is causing
INFO: rcu detected stall in filemap_fault
https://syzkaller.appspot.com/bug?id=8e7f5412a78197a2e0f848fa513c2e7f0071ffa2
INFO: rcu detected stall in show_free_areas
https://syzkaller.appspot.com/bug?id=b2cc06dd0a76e7ca92aa8d13ef4227cb7fd0d217
INFO: rcu detected stall in proc_reg_read
https://syzkaller.appspot.com/bug?id=0d6a21d39c8ef7072c695dea11095df6c07c79af
INFO: rcu detected stall in call_timer_fn
https://syzkaller.appspot.com/bug?id=88a07e525266567efe221f7a4a05511c032e5822
INFO: rcu detected stall in br_multicast_port_group_expired (2)
https://syzkaller.appspot.com/bug?id=15c7ad8cf35a07059e8a697a22527e11d294bc94
INFO: rcu detected stall in br_multicast_port_group_expired (2)
https://syzkaller.appspot.com/bug?id=15c7ad8cf35a07059e8a697a22527e11d294bc94
INFO: rcu detected stall in tun_chr_close
https://syzkaller.appspot.com/bug?id=6c50618bde03e5a2eefdd0269cf9739c5ebb8270
INFO: rcu detected stall in discover_timer
https://syzkaller.appspot.com/bug?id=55da031ddb910e58ab9c6853a5784efd94f03b54
INFO: rcu detected stall in ret_from_fork (2)
https://syzkaller.appspot.com/bug?id=c83129a6683b44b39f5b8864a1325893c9218363
INFO: rcu detected stall in addrconf_rs_timer
https://syzkaller.appspot.com/bug?id=21c029af65f81488edbc07a10ed20792444711b6
INFO: rcu detected stall in kthread (2)
https://syzkaller.appspot.com/bug?id=6accd1ed11c31110fed1982f6ad38cc9676477d2
INFO: rcu detected stall in ext4_filemap_fault
https://syzkaller.appspot.com/bug?id=817e38d20e9ee53390ac361bf0fd2007eaf188af
INFO: rcu detected stall in run_timer_softirq (2)
https://syzkaller.appspot.com/bug?id=f5a230a3ff7822f8d39fddf8485931bd06ae47fe
INFO: rcu detected stall in bpf_prog_ADDR
https://syzkaller.appspot.com/bug?id=fb4911fd0e861171cc55124e209f810a0dd68744
INFO: rcu detected stall in __run_timers (2)
https://syzkaller.appspot.com/bug?id=65416569ddc8d2feb8f19066aa761f5a47f7451a
reports.
>
>> What should we do if memcg-OOM found no killable task because the allocating task
>> was oom_score_adj == -1000 ? Flooding printk() until RCU stall watchdog fires
>> (which seems to be caused by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper
>> OOM header when no eligible victim left") because syzbot was terminating the test
>> upon WARN(1) removed by that commit) is not a good behavior.
>
> We definitely want to inform about ineligible oom victim. We might
> consider some rate limiting for the memcg state but that is a valuable
> information to see under normal situation (when you do not have floods
> of these situations).
>
But if the caller cannot be noticed by SIGKILL from the OOM killer,
allowing the caller to trigger the OOM killer again and again (until
global OOM killer triggers) is bad.
On Wed 10-10-18 19:43:38, Tetsuo Handa wrote:
> On 2018/10/10 17:59, Michal Hocko wrote:
> > On Wed 10-10-18 09:12:45, Tetsuo Handa wrote:
> >> syzbot is hitting RCU stall due to memcg-OOM event.
> >> https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
> >
> > This is really interesting. If we do not have any eligible oom victim we
> > simply force the charge (allow to proceed and go over the hard limit)
> > and break the isolation. That means that the caller gets back to running
> > and realease all locks take on the way.
>
> What happens if the caller continued trying to allocate more memory
> because the caller cannot be noticed by SIGKILL from the OOM killer?
It could eventually trigger the global OOM.
> > I am wondering how come we are
> > seeing the RCU stall. Whole is holding the rcu lock? Certainly not the
> > charge patch and neither should the caller because you have to be in a
> > sleepable context to trigger the OOM killer. So there must be something
> > more going on.
>
> Just flooding out of memory messages can trigger RCU stall problems.
> For example, a severe skbuff_head_cache or kmalloc-512 leak bug is causing
[...]
Quite some of them, indeed! I guess we want to rate limit the output.
What about the following?
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa5360616..4ee393c85e27 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -430,6 +430,9 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
static void dump_header(struct oom_control *oc, struct task_struct *p)
{
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n",
current->comm, oc->gfp_mask, &oc->gfp_mask,
nodemask_pr_args(oc->nodemask), oc->order,
@@ -437,6 +440,9 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
pr_warn("COMPACTION is disabled!!!\n");
+ if (!__ratelimit(&oom_rs))
+ return;
+
cpuset_print_current_mems_allowed();
dump_stack();
if (is_memcg_oom(oc))
@@ -931,8 +937,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
struct task_struct *t;
struct mem_cgroup *oom_group;
unsigned int victim_points = 0;
- static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
/*
* If the task is already exiting, don't alarm the sysadmin or kill
@@ -949,8 +953,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
}
task_unlock(p);
- if (__ratelimit(&oom_rs))
- dump_header(oc, p);
+ dump_header(oc, p);
pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);
> >> What should we do if memcg-OOM found no killable task because the allocating task
> >> was oom_score_adj == -1000 ? Flooding printk() until RCU stall watchdog fires
> >> (which seems to be caused by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper
> >> OOM header when no eligible victim left") because syzbot was terminating the test
> >> upon WARN(1) removed by that commit) is not a good behavior.
> >
> > We definitely want to inform about ineligible oom victim. We might
> > consider some rate limiting for the memcg state but that is a valuable
> > information to see under normal situation (when you do not have floods
> > of these situations).
> >
>
> But if the caller cannot be noticed by SIGKILL from the OOM killer,
> allowing the caller to trigger the OOM killer again and again (until
> global OOM killer triggers) is bad.
There is simply no other option. Well, except for failing the charge
which has been considered and refused because it could trigger
unexpected error paths and that breaking the isolation on rare cases
when of the misconfiguration is acceptable. We can reconsider that
but you should bring really good arguments on the table. I was very
successful doing that.
--
Michal Hocko
SUSE Labs
On (10/10/18 13:35), Michal Hocko wrote:
> > Just flooding out of memory messages can trigger RCU stall problems.
> > For example, a severe skbuff_head_cache or kmalloc-512 leak bug is causing
>
> [...]
>
> Quite some of them, indeed! I guess we want to rate limit the output.
> What about the following?
A bit unrelated, but while we are at it:
I like it when we rate-limit printk-s that lookup the system.
But it seems that default rate-limit values are not always good enough,
DEFAULT_RATELIMIT_INTERVAL / DEFAULT_RATELIMIT_BURST can still be too
verbose. For instance, when we have a very slow IPMI emulated serial
console -- e.g. baud rate at 57600. DEFAULT_RATELIMIT_INTERVAL and
DEFAULT_RATELIMIT_BURST can add new OOM headers and backtraces faster
than we evict them.
Does it sound reasonable enough to use larger than default rate-limits
for printk-s in OOM print-outs? OOM reports tend to be somewhat large
and the reported numbers are not always *very* unique.
What do you think?
-ss
On Wed 10-10-18 20:48:33, Sergey Senozhatsky wrote:
> On (10/10/18 13:35), Michal Hocko wrote:
> > > Just flooding out of memory messages can trigger RCU stall problems.
> > > For example, a severe skbuff_head_cache or kmalloc-512 leak bug is causing
> >
> > [...]
> >
> > Quite some of them, indeed! I guess we want to rate limit the output.
> > What about the following?
>
> A bit unrelated, but while we are at it:
>
> I like it when we rate-limit printk-s that lookup the system.
> But it seems that default rate-limit values are not always good enough,
> DEFAULT_RATELIMIT_INTERVAL / DEFAULT_RATELIMIT_BURST can still be too
> verbose. For instance, when we have a very slow IPMI emulated serial
> console -- e.g. baud rate at 57600. DEFAULT_RATELIMIT_INTERVAL and
> DEFAULT_RATELIMIT_BURST can add new OOM headers and backtraces faster
> than we evict them.
>
> Does it sound reasonable enough to use larger than default rate-limits
> for printk-s in OOM print-outs? OOM reports tend to be somewhat large
> and the reported numbers are not always *very* unique.
>
> What do you think?
I do not really care about the current inerval/burst values. This change
should be done seprately and ideally with some numbers.
--
Michal Hocko
SUSE Labs
On Wed, Oct 10, 2018 at 2:25 PM, Michal Hocko <[email protected]> wrote:
> On Wed 10-10-18 20:48:33, Sergey Senozhatsky wrote:
>> On (10/10/18 13:35), Michal Hocko wrote:
>> > > Just flooding out of memory messages can trigger RCU stall problems.
>> > > For example, a severe skbuff_head_cache or kmalloc-512 leak bug is causing
>> >
>> > [...]
>> >
>> > Quite some of them, indeed! I guess we want to rate limit the output.
>> > What about the following?
>>
>> A bit unrelated, but while we are at it:
>>
>> I like it when we rate-limit printk-s that lookup the system.
>> But it seems that default rate-limit values are not always good enough,
>> DEFAULT_RATELIMIT_INTERVAL / DEFAULT_RATELIMIT_BURST can still be too
>> verbose. For instance, when we have a very slow IPMI emulated serial
>> console -- e.g. baud rate at 57600. DEFAULT_RATELIMIT_INTERVAL and
>> DEFAULT_RATELIMIT_BURST can add new OOM headers and backtraces faster
>> than we evict them.
>>
>> Does it sound reasonable enough to use larger than default rate-limits
>> for printk-s in OOM print-outs? OOM reports tend to be somewhat large
>> and the reported numbers are not always *very* unique.
>>
>> What do you think?
>
> I do not really care about the current inerval/burst values. This change
> should be done seprately and ideally with some numbers.
I think Sergey meant that this place may need to use
larger-than-default values because it prints lots of output per
instance (whereas the default limit is more tuned for cases that print
just 1 line).
I've found at least 1 place that uses DEFAULT_RATELIMIT_INTERVAL*10:
https://elixir.bootlin.com/linux/latest/source/fs/btrfs/extent-tree.c#L8365
Probably we need something similar here.
On Wed, Oct 10, 2018 at 2:29 PM, Dmitry Vyukov <[email protected]> wrote:
> On Wed, Oct 10, 2018 at 2:25 PM, Michal Hocko <[email protected]> wrote:
>> On Wed 10-10-18 20:48:33, Sergey Senozhatsky wrote:
>>> On (10/10/18 13:35), Michal Hocko wrote:
>>> > > Just flooding out of memory messages can trigger RCU stall problems.
>>> > > For example, a severe skbuff_head_cache or kmalloc-512 leak bug is causing
>>> >
>>> > [...]
>>> >
>>> > Quite some of them, indeed! I guess we want to rate limit the output.
>>> > What about the following?
>>>
>>> A bit unrelated, but while we are at it:
>>>
>>> I like it when we rate-limit printk-s that lookup the system.
>>> But it seems that default rate-limit values are not always good enough,
>>> DEFAULT_RATELIMIT_INTERVAL / DEFAULT_RATELIMIT_BURST can still be too
>>> verbose. For instance, when we have a very slow IPMI emulated serial
>>> console -- e.g. baud rate at 57600. DEFAULT_RATELIMIT_INTERVAL and
>>> DEFAULT_RATELIMIT_BURST can add new OOM headers and backtraces faster
>>> than we evict them.
>>>
>>> Does it sound reasonable enough to use larger than default rate-limits
>>> for printk-s in OOM print-outs? OOM reports tend to be somewhat large
>>> and the reported numbers are not always *very* unique.
>>>
>>> What do you think?
>>
>> I do not really care about the current inerval/burst values. This change
>> should be done seprately and ideally with some numbers.
>
> I think Sergey meant that this place may need to use
> larger-than-default values because it prints lots of output per
> instance (whereas the default limit is more tuned for cases that print
> just 1 line).
>
> I've found at least 1 place that uses DEFAULT_RATELIMIT_INTERVAL*10:
> https://elixir.bootlin.com/linux/latest/source/fs/btrfs/extent-tree.c#L8365
> Probably we need something similar here.
In parallel with the kernel changes I've also made a change to
syzkaller that (1) makes it not use oom_score_adj=-1000, this hard
killing limit looks like quite risky thing, (2) increase memcg size
beyond expected KASAN quarantine size:
https://github.com/google/syzkaller/commit/adedaf77a18f3d03d695723c86fc083c3551ff5b
If this will stop the flow of hang/stall reports, then we can just
close all old reports as invalid.
On 2018/10/10 21:36, Dmitry Vyukov wrote:
> On Wed, Oct 10, 2018 at 2:29 PM, Dmitry Vyukov <[email protected]> wrote:
>> On Wed, Oct 10, 2018 at 2:25 PM, Michal Hocko <[email protected]> wrote:
>>> On Wed 10-10-18 20:48:33, Sergey Senozhatsky wrote:
>>>> On (10/10/18 13:35), Michal Hocko wrote:
>>>>>> Just flooding out of memory messages can trigger RCU stall problems.
>>>>>> For example, a severe skbuff_head_cache or kmalloc-512 leak bug is causing
>>>>>
>>>>> [...]
>>>>>
>>>>> Quite some of them, indeed! I guess we want to rate limit the output.
>>>>> What about the following?
>>>>
>>>> A bit unrelated, but while we are at it:
>>>>
>>>> I like it when we rate-limit printk-s that lookup the system.
>>>> But it seems that default rate-limit values are not always good enough,
>>>> DEFAULT_RATELIMIT_INTERVAL / DEFAULT_RATELIMIT_BURST can still be too
>>>> verbose. For instance, when we have a very slow IPMI emulated serial
>>>> console -- e.g. baud rate at 57600. DEFAULT_RATELIMIT_INTERVAL and
>>>> DEFAULT_RATELIMIT_BURST can add new OOM headers and backtraces faster
>>>> than we evict them.
>>>>
>>>> Does it sound reasonable enough to use larger than default rate-limits
>>>> for printk-s in OOM print-outs? OOM reports tend to be somewhat large
>>>> and the reported numbers are not always *very* unique.
>>>>
>>>> What do you think?
>>>
>>> I do not really care about the current inerval/burst values. This change
>>> should be done seprately and ideally with some numbers.
>>
>> I think Sergey meant that this place may need to use
>> larger-than-default values because it prints lots of output per
>> instance (whereas the default limit is more tuned for cases that print
>> just 1 line).
Yes. The OOM killer tends to print a lot of messages (and I estimate that
mutex_trylock(&oom_lock) accelerates wasting more CPU consumption by
preemption).
>>
>> I've found at least 1 place that uses DEFAULT_RATELIMIT_INTERVAL*10:
>> https://elixir.bootlin.com/linux/latest/source/fs/btrfs/extent-tree.c#L8365
>> Probably we need something similar here.
Since printk() is a significantly CPU consuming operation, I think that what
we need to guarantee is interval between the end of an OOM killer messages
and the beginning of next OOM killer messages is large enough. For example,
setup a timer with 5 seconds timeout upon the end of an OOM killer messages
and check whether the timer already fired upon the beginning of next OOM killer
messages.
>
>
> In parallel with the kernel changes I've also made a change to
> syzkaller that (1) makes it not use oom_score_adj=-1000, this hard
> killing limit looks like quite risky thing, (2) increase memcg size
> beyond expected KASAN quarantine size:
> https://github.com/google/syzkaller/commit/adedaf77a18f3d03d695723c86fc083c3551ff5b
> If this will stop the flow of hang/stall reports, then we can just
> close all old reports as invalid.
I don't think so. Only this report was different from others because printk()
in this report was from memcg OOM events without eligible tasks whereas printk()
in others are from global OOM events triggered by severe slab memory leak.
On Wed, Oct 10, 2018 at 3:10 PM, Tetsuo Handa
<[email protected]> wrote:
>>>>>>> Just flooding out of memory messages can trigger RCU stall problems.
>>>>>>> For example, a severe skbuff_head_cache or kmalloc-512 leak bug is causing
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>> Quite some of them, indeed! I guess we want to rate limit the output.
>>>>>> What about the following?
>>>>>
>>>>> A bit unrelated, but while we are at it:
>>>>>
>>>>> I like it when we rate-limit printk-s that lookup the system.
>>>>> But it seems that default rate-limit values are not always good enough,
>>>>> DEFAULT_RATELIMIT_INTERVAL / DEFAULT_RATELIMIT_BURST can still be too
>>>>> verbose. For instance, when we have a very slow IPMI emulated serial
>>>>> console -- e.g. baud rate at 57600. DEFAULT_RATELIMIT_INTERVAL and
>>>>> DEFAULT_RATELIMIT_BURST can add new OOM headers and backtraces faster
>>>>> than we evict them.
>>>>>
>>>>> Does it sound reasonable enough to use larger than default rate-limits
>>>>> for printk-s in OOM print-outs? OOM reports tend to be somewhat large
>>>>> and the reported numbers are not always *very* unique.
>>>>>
>>>>> What do you think?
>>>>
>>>> I do not really care about the current inerval/burst values. This change
>>>> should be done seprately and ideally with some numbers.
>>>
>>> I think Sergey meant that this place may need to use
>>> larger-than-default values because it prints lots of output per
>>> instance (whereas the default limit is more tuned for cases that print
>>> just 1 line).
>
> Yes. The OOM killer tends to print a lot of messages (and I estimate that
> mutex_trylock(&oom_lock) accelerates wasting more CPU consumption by
> preemption).
>
>>>
>>> I've found at least 1 place that uses DEFAULT_RATELIMIT_INTERVAL*10:
>>> https://elixir.bootlin.com/linux/latest/source/fs/btrfs/extent-tree.c#L8365
>>> Probably we need something similar here.
>
> Since printk() is a significantly CPU consuming operation, I think that what
> we need to guarantee is interval between the end of an OOM killer messages
> and the beginning of next OOM killer messages is large enough. For example,
> setup a timer with 5 seconds timeout upon the end of an OOM killer messages
> and check whether the timer already fired upon the beginning of next OOM killer
> messages.
>
>>
>>
>> In parallel with the kernel changes I've also made a change to
>> syzkaller that (1) makes it not use oom_score_adj=-1000, this hard
>> killing limit looks like quite risky thing, (2) increase memcg size
>> beyond expected KASAN quarantine size:
>> https://github.com/google/syzkaller/commit/adedaf77a18f3d03d695723c86fc083c3551ff5b
>> If this will stop the flow of hang/stall reports, then we can just
>> close all old reports as invalid.
>
> I don't think so. Only this report was different from others because printk()
> in this report was from memcg OOM events without eligible tasks whereas printk()
> in others are from global OOM events triggered by severe slab memory leak.
Ack.
I guess I just hoped deep down that we somehow magically get rid of
all these reports with some simple change like this :)
On 2018/10/10 20:35, Michal Hocko wrote:
>>>> What should we do if memcg-OOM found no killable task because the allocating task
>>>> was oom_score_adj == -1000 ? Flooding printk() until RCU stall watchdog fires
>>>> (which seems to be caused by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper
>>>> OOM header when no eligible victim left") because syzbot was terminating the test
>>>> upon WARN(1) removed by that commit) is not a good behavior.
>>>
>>> We definitely want to inform about ineligible oom victim. We might
>>> consider some rate limiting for the memcg state but that is a valuable
>>> information to see under normal situation (when you do not have floods
>>> of these situations).
>>>
>>
>> But if the caller cannot be noticed by SIGKILL from the OOM killer,
>> allowing the caller to trigger the OOM killer again and again (until
>> global OOM killer triggers) is bad.
>
> There is simply no other option. Well, except for failing the charge
> which has been considered and refused because it could trigger
> unexpected error paths and that breaking the isolation on rare cases
> when of the misconfiguration is acceptable. We can reconsider that
> but you should bring really good arguments on the table. I was very
> successful doing that.
>
By the way, how do we avoid this flooding? Something like this?
include/linux/sched.h | 1 +
mm/oom_kill.c | 11 +++++++++++
2 files changed, 12 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57..58eff50 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -723,6 +723,7 @@ struct task_struct {
#endif
#ifdef CONFIG_MEMCG
unsigned in_user_fault:1;
+ unsigned memcg_oom_no_eligible_warned:1;
#ifdef CONFIG_MEMCG_KMEM
unsigned memcg_kmem_skip_account:1;
#endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..ff0fa65 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,13 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! */
if (!oc->chosen) {
+#ifdef CONFIG_MEMCG
+ if (is_memcg_oom(oc)) {
+ if (current->memcg_oom_no_eligible_warned)
+ return false;
+ current->memcg_oom_no_eligible_warned = 1;
+ }
+#endif
dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");
/*
@@ -1115,6 +1122,10 @@ bool out_of_memory(struct oom_control *oc)
*/
if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
panic("System is deadlocked on memory\n");
+#ifdef CONFIG_MEMCG
+ } else if (is_memcg_oom(oc)) {
+ current->memcg_oom_no_eligible_warned = 0;
+#endif
}
if (oc->chosen && oc->chosen != (void *)-1UL)
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
--
1.8.3.1
From: Michal Hocko <[email protected]>
syzbot has noticed that it can trigger RCU stalls from the memcg oom
path:
RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118
Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f
84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00
e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00
RSP: 0018:ffff88017d3a5c70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000040000 RBX: 1ffffffff1263ebe RCX: ffffc90001e5a000
RDX: 0000000000040000 RSI: ffffffff87b4e0f4 RDI: 0000000000000246
RBP: ffff88017d3a5d18 R08: ffff8801d7e02480 R09: fffffbfff13da030
R10: fffffbfff13da030 R11: 0000000000000003 R12: 1ffff1002fa74b96
R13: 00000000ffffffff R14: 0000000000000200 R15: 0000000000000000
dump_header+0x27b/0xf72 mm/oom_kill.c:441
out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
mem_cgroup_oom mm/memcontrol.c:1701 [inline]
try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
shmem_fault+0x25f/0x960 mm/shmem.c:1982
__do_fault+0x100/0x6b0 mm/memory.c:2996
do_read_fault mm/memory.c:3408 [inline]
do_fault mm/memory.c:3531 [inline]
The primary reason of the stall lies in an expensive printk handling
of oom report flood because a misconfiguration on the syzbot side
caused that there is simply no eligible task because they have
OOM_SCORE_ADJ_MIN set. This generates the oom report for each allocation
from the memcg context.
While normal workloads should be much more careful about potential heavy
memory consumers that are OOM disabled it makes some sense to rate limit
a potentially expensive oom reports for cases when there is no eligible
victim found. Do that by moving the rate limit logic inside dump_header.
We no longer rely on the caller to do that. It was only oom_kill_process
which has been throttling. Other two call sites simply didn't have to
care because one just paniced on the OOM when configured that way and
no eligible task would panic for the global case as well. Memcg changed
the picture because we do not panic and we might have multiple sources
of the same event.
Once we are here, make sure that the reason to trigger the OOM is
printed without ratelimiting because this is really valuable to
debug what happened.
Reported-by: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Michal Hocko <[email protected]>
---
mm/oom_kill.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa5360616..4ee393c85e27 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -430,6 +430,9 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
static void dump_header(struct oom_control *oc, struct task_struct *p)
{
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n",
current->comm, oc->gfp_mask, &oc->gfp_mask,
nodemask_pr_args(oc->nodemask), oc->order,
@@ -437,6 +440,9 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
pr_warn("COMPACTION is disabled!!!\n");
+ if (!__ratelimit(&oom_rs))
+ return;
+
cpuset_print_current_mems_allowed();
dump_stack();
if (is_memcg_oom(oc))
@@ -931,8 +937,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
struct task_struct *t;
struct mem_cgroup *oom_group;
unsigned int victim_points = 0;
- static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
/*
* If the task is already exiting, don't alarm the sysadmin or kill
@@ -949,8 +953,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
}
task_unlock(p);
- if (__ratelimit(&oom_rs))
- dump_header(oc, p);
+ dump_header(oc, p);
pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);
--
2.19.0
On (10/10/18 14:29), Dmitry Vyukov wrote:
> >> A bit unrelated, but while we are at it:
> >>
> >> I like it when we rate-limit printk-s that lookup the system.
> >> But it seems that default rate-limit values are not always good enough,
> >> DEFAULT_RATELIMIT_INTERVAL / DEFAULT_RATELIMIT_BURST can still be too
> >> verbose. For instance, when we have a very slow IPMI emulated serial
> >> console -- e.g. baud rate at 57600. DEFAULT_RATELIMIT_INTERVAL and
> >> DEFAULT_RATELIMIT_BURST can add new OOM headers and backtraces faster
> >> than we evict them.
> >>
> >> Does it sound reasonable enough to use larger than default rate-limits
> >> for printk-s in OOM print-outs? OOM reports tend to be somewhat large
> >> and the reported numbers are not always *very* unique.
> >>
> >> What do you think?
> >
> > I do not really care about the current inerval/burst values. This change
> > should be done seprately and ideally with some numbers.
>
> I think Sergey meant that this place may need to use
> larger-than-default values because it prints lots of output per
> instance (whereas the default limit is more tuned for cases that print
> just 1 line).
>
> I've found at least 1 place that uses DEFAULT_RATELIMIT_INTERVAL*10:
> https://elixir.bootlin.com/linux/latest/source/fs/btrfs/extent-tree.c#L8365
> Probably we need something similar here.
Yes, Dmitry, that's what I meant - to use something like
DEFAULT_RATELIMIT_INTERVAL * 10 in OOM. I didn't mean to change
the default values system wide.
---
We are not rate-limiting a single annoying printk() in OOM, but
functions that do a whole bunch of printks - OOM header, backtraces, etc.
Thus OOM report can be, I don't know, 50 or 70 or 100 lines (who knows).
So that's why rate-limit in OOM is more permissive in terms of number of
printed lines. When we rate-limit a single printk() we let 10 prinks()
/*10 lines*/ max every 5 seconds. While in OOM this transforms into
10 dump_header() + 10 oom_kill_process() every 5 seconds. Still can be
too many printk()-s, enough to lockup the system.
-ss
On (10/10/18 22:10), Tetsuo Handa wrote:
> >> I've found at least 1 place that uses DEFAULT_RATELIMIT_INTERVAL*10:
> >> https://elixir.bootlin.com/linux/latest/source/fs/btrfs/extent-tree.c#L8365
> >> Probably we need something similar here.
>
> Since printk() is a significantly CPU consuming operation, I think that what
> we need to guarantee is interval between the end of an OOM killer messages
> and the beginning of next OOM killer messages is large enough. For example,
> setup a timer with 5 seconds timeout upon the end of an OOM killer messages
> and check whether the timer already fired upon the beginning of next OOM killer
> messages.
Hmm, there is no way to make sure that previous OOM report made it to
consoles. So maybe timer approach will be as good as rate-limiting.
-ss
Michal Hocko wrote:
> Once we are here, make sure that the reason to trigger the OOM is
> printed without ratelimiting because this is really valuable to
> debug what happened.
Here is my version.
From 0c9ab34fd01837d4c85794042ecb9e922c9eed5a Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Thu, 11 Oct 2018 15:27:42 +0900
Subject: [PATCH] mm: memcontrol: Don't flood OOM messages with no eligible task.
syzbot is hitting RCU stall at shmem_fault() [1].
This is because memcg-OOM events with no eligible task (current thread
is marked as OOM-unkillable) continued calling dump_header() from
out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol:
print proper OOM header when no eligible victim left.").
Let's make sure that next dump_header() waits for at least 60 seconds from
previous "Out of memory and no killable processes..." message, if current
thread already reported it. This change allows current thread to complete
memory allocation requests for doing recovery operation, and also allows
us to know whether the current thread is doing recovery operation.
[1] https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
Signed-off-by: Tetsuo Handa <[email protected]>
Reported-by: syzbot <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
---
include/linux/sched.h | 1 +
mm/oom_kill.c | 23 +++++++++++++++++++++++
2 files changed, 24 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57..701cf3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -723,6 +723,7 @@ struct task_struct {
#endif
#ifdef CONFIG_MEMCG
unsigned in_user_fault:1;
+ unsigned memcg_oom_no_eligible_warned:1;
#ifdef CONFIG_MEMCG_KMEM
unsigned memcg_kmem_skip_account:1;
#endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..f954d99 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,21 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! */
if (!oc->chosen) {
+#ifdef CONFIG_MEMCG
+ /*
+ * Don't flood with dump_header() if already reported, in case
+ * current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN. Maybe
+ * this variable should be per "struct mem_cgroup". But since
+ * we can't prove that multiple concurrent memcg OOM without
+ * eligible task won't cause flooding, choose global variable
+ * for safety.
+ */
+ static u64 last_warned;
+
+ if (is_memcg_oom(oc) && current->memcg_oom_no_eligible_warned
+ && time_before64(get_jiffies_64(), last_warned + 60 * HZ))
+ return false;
+#endif
dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");
/*
@@ -1115,6 +1130,14 @@ bool out_of_memory(struct oom_control *oc)
*/
if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
panic("System is deadlocked on memory\n");
+#ifdef CONFIG_MEMCG
+ if (is_memcg_oom(oc)) {
+ last_warned = get_jiffies_64();
+ current->memcg_oom_no_eligible_warned = 1;
+ }
+ } else if (is_memcg_oom(oc)) {
+ current->memcg_oom_no_eligible_warned = 0;
+#endif
}
if (oc->chosen && oc->chosen != (void *)-1UL)
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
--
1.8.3.1
On 2018/10/11 15:37, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> Once we are here, make sure that the reason to trigger the OOM is
>> printed without ratelimiting because this is really valuable to
>> debug what happened.
>
> Here is my version.
>
Hmm, per mem_cgroup flag would be better than per task_struct flag.
From 5e80805e4988f23372a3b352fd4185e6bad53f58 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Fri, 12 Oct 2018 16:24:49 +0900
Subject: [PATCH v2] mm: memcontrol: Don't flood OOM messages with no eligible task.
syzbot is hitting RCU stall at shmem_fault() [1].
This is because memcg-OOM events with no eligible task (current thread
is marked as OOM-unkillable) continued calling dump_header() from
out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol:
print proper OOM header when no eligible victim left.").
Let's make sure that next dump_header() waits for at least 60 seconds from
previous "Out of memory and no killable processes..." message, if requested
memcg already reported it. This change allows threads in requested memcg to
complete memory allocation requests for doing recovery operation, and also
allows us to know whether threads in requested memcg is doing recovery
operation.
[1] https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
Signed-off-by: Tetsuo Handa <[email protected]>
Reported-by: syzbot <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 3 +++
mm/oom_kill.c | 23 +++++++++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 652f602..8da2340 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -230,6 +230,9 @@ struct mem_cgroup {
*/
bool oom_group;
+ /* Previous OOM-kill request failed due to no eligible task? */
+ bool oom_no_eligible_warned;
+
/* protected by memcg_oom_lock */
bool oom_lock;
int under_oom;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..52a8319 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,21 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! */
if (!oc->chosen) {
+#ifdef CONFIG_MEMCG
+ /*
+ * Don't flood with dump_header() if already reported, in case
+ * current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN. Maybe
+ * this variable should be per "struct mem_cgroup". But since
+ * we can't prove that multiple concurrent memcg OOM without
+ * eligible task won't cause flooding, choose global variable
+ * for safety.
+ */
+ static u64 last_warned;
+
+ if (is_memcg_oom(oc) && oc->memcg->oom_no_eligible_warned &&
+ time_before64(get_jiffies_64(), last_warned + 60 * HZ))
+ return false;
+#endif
dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");
/*
@@ -1115,6 +1130,14 @@ bool out_of_memory(struct oom_control *oc)
*/
if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
panic("System is deadlocked on memory\n");
+#ifdef CONFIG_MEMCG
+ if (is_memcg_oom(oc)) {
+ oc->memcg->oom_no_eligible_warned = true;
+ last_warned = get_jiffies_64();
+ }
+ } else if (is_memcg_oom(oc)) {
+ oc->memcg->oom_no_eligible_warned = false;
+#endif
}
if (oc->chosen && oc->chosen != (void *)-1UL)
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
--
1.8.3.1
On Wed, Oct 10, 2018 at 05:11:35PM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> syzbot has noticed that it can trigger RCU stalls from the memcg oom
> path:
> RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118
> Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f
> 84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00
> e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00
> RSP: 0018:ffff88017d3a5c70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
> RAX: 0000000000040000 RBX: 1ffffffff1263ebe RCX: ffffc90001e5a000
> RDX: 0000000000040000 RSI: ffffffff87b4e0f4 RDI: 0000000000000246
> RBP: ffff88017d3a5d18 R08: ffff8801d7e02480 R09: fffffbfff13da030
> R10: fffffbfff13da030 R11: 0000000000000003 R12: 1ffff1002fa74b96
> R13: 00000000ffffffff R14: 0000000000000200 R15: 0000000000000000
> dump_header+0x27b/0xf72 mm/oom_kill.c:441
> out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
> mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
> mem_cgroup_oom mm/memcontrol.c:1701 [inline]
> try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
> mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
> mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
> shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
> shmem_fault+0x25f/0x960 mm/shmem.c:1982
> __do_fault+0x100/0x6b0 mm/memory.c:2996
> do_read_fault mm/memory.c:3408 [inline]
> do_fault mm/memory.c:3531 [inline]
>
> The primary reason of the stall lies in an expensive printk handling
> of oom report flood because a misconfiguration on the syzbot side
> caused that there is simply no eligible task because they have
> OOM_SCORE_ADJ_MIN set. This generates the oom report for each allocation
> from the memcg context.
>
> While normal workloads should be much more careful about potential heavy
> memory consumers that are OOM disabled it makes some sense to rate limit
> a potentially expensive oom reports for cases when there is no eligible
> victim found. Do that by moving the rate limit logic inside dump_header.
> We no longer rely on the caller to do that. It was only oom_kill_process
> which has been throttling. Other two call sites simply didn't have to
> care because one just paniced on the OOM when configured that way and
> no eligible task would panic for the global case as well. Memcg changed
> the picture because we do not panic and we might have multiple sources
> of the same event.
>
> Once we are here, make sure that the reason to trigger the OOM is
> printed without ratelimiting because this is really valuable to
> debug what happened.
>
> Reported-by: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Michal Hocko <[email protected]>
So not more than 10 dumps in each 5s interval. That looks reasonable
to me. By the time it starts dropping data you have more than enough
information to go on already.
Acked-by: Johannes Weiner <[email protected]>
On 2018/10/12 21:08, Michal Hocko wrote:
>> So not more than 10 dumps in each 5s interval. That looks reasonable
>> to me. By the time it starts dropping data you have more than enough
>> information to go on already.
>
> Yeah. Unless we have a storm coming from many different cgroups in
> parallel. But even then we have the allocation context for each OOM so
> we are not losing everything. Should we ever tune this, it can be done
> later with some explicit examples.
>
>> Acked-by: Johannes Weiner <[email protected]>
>
> Thanks! I will post the patch to Andrew early next week.
>
How do you handle environments where one dump takes e.g. 3 seconds?
Counting delay between first message in previous dump and first message
in next dump is not safe. Unless we count delay between last message
in previous dump and first message in next dump, we cannot guarantee
that the system won't lockup due to printk() flooding.
On Fri 12-10-18 07:20:08, Johannes Weiner wrote:
> On Wed, Oct 10, 2018 at 05:11:35PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > syzbot has noticed that it can trigger RCU stalls from the memcg oom
> > path:
> > RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118
> > Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f
> > 84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00
> > e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00
> > RSP: 0018:ffff88017d3a5c70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
> > RAX: 0000000000040000 RBX: 1ffffffff1263ebe RCX: ffffc90001e5a000
> > RDX: 0000000000040000 RSI: ffffffff87b4e0f4 RDI: 0000000000000246
> > RBP: ffff88017d3a5d18 R08: ffff8801d7e02480 R09: fffffbfff13da030
> > R10: fffffbfff13da030 R11: 0000000000000003 R12: 1ffff1002fa74b96
> > R13: 00000000ffffffff R14: 0000000000000200 R15: 0000000000000000
> > dump_header+0x27b/0xf72 mm/oom_kill.c:441
> > out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
> > mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
> > mem_cgroup_oom mm/memcontrol.c:1701 [inline]
> > try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
> > mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
> > mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
> > shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
> > shmem_fault+0x25f/0x960 mm/shmem.c:1982
> > __do_fault+0x100/0x6b0 mm/memory.c:2996
> > do_read_fault mm/memory.c:3408 [inline]
> > do_fault mm/memory.c:3531 [inline]
> >
> > The primary reason of the stall lies in an expensive printk handling
> > of oom report flood because a misconfiguration on the syzbot side
> > caused that there is simply no eligible task because they have
> > OOM_SCORE_ADJ_MIN set. This generates the oom report for each allocation
> > from the memcg context.
> >
> > While normal workloads should be much more careful about potential heavy
> > memory consumers that are OOM disabled it makes some sense to rate limit
> > a potentially expensive oom reports for cases when there is no eligible
> > victim found. Do that by moving the rate limit logic inside dump_header.
> > We no longer rely on the caller to do that. It was only oom_kill_process
> > which has been throttling. Other two call sites simply didn't have to
> > care because one just paniced on the OOM when configured that way and
> > no eligible task would panic for the global case as well. Memcg changed
> > the picture because we do not panic and we might have multiple sources
> > of the same event.
> >
> > Once we are here, make sure that the reason to trigger the OOM is
> > printed without ratelimiting because this is really valuable to
> > debug what happened.
> >
> > Reported-by: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Michal Hocko <[email protected]>
>
> So not more than 10 dumps in each 5s interval. That looks reasonable
> to me. By the time it starts dropping data you have more than enough
> information to go on already.
Yeah. Unless we have a storm coming from many different cgroups in
parallel. But even then we have the allocation context for each OOM so
we are not losing everything. Should we ever tune this, it can be done
later with some explicit examples.
> Acked-by: Johannes Weiner <[email protected]>
Thanks! I will post the patch to Andrew early next week.
--
Michal Hocko
SUSE Labs
On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote:
> On 2018/10/12 21:08, Michal Hocko wrote:
> >> So not more than 10 dumps in each 5s interval. That looks reasonable
> >> to me. By the time it starts dropping data you have more than enough
> >> information to go on already.
> >
> > Yeah. Unless we have a storm coming from many different cgroups in
> > parallel. But even then we have the allocation context for each OOM so
> > we are not losing everything. Should we ever tune this, it can be done
> > later with some explicit examples.
> >
> >> Acked-by: Johannes Weiner <[email protected]>
> >
> > Thanks! I will post the patch to Andrew early next week.
> >
>
> How do you handle environments where one dump takes e.g. 3 seconds?
> Counting delay between first message in previous dump and first message
> in next dump is not safe. Unless we count delay between last message
> in previous dump and first message in next dump, we cannot guarantee
> that the system won't lockup due to printk() flooding.
How is that different from any other printk ratelimiting? If a dump
takes 3 seconds you need to fix your console. It doesn't make sense to
design KERN_INFO messages for the slowest serial consoles out there.
That's what we did, btw. We used to patch out the OOM header because
our serial console was so bad, but obviously that's not a generic
upstream solution. We've since changed the loglevel on the serial and
use netconsole[1] for the chattier loglevels.
[1] https://github.com/facebook/fbkutils/tree/master/netconsd
Calling printk() people. ;-)
On 2018/10/12 21:41, Johannes Weiner wrote:
> On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote:
>> On 2018/10/12 21:08, Michal Hocko wrote:
>>>> So not more than 10 dumps in each 5s interval. That looks reasonable
>>>> to me. By the time it starts dropping data you have more than enough
>>>> information to go on already.
>>>
>>> Yeah. Unless we have a storm coming from many different cgroups in
>>> parallel. But even then we have the allocation context for each OOM so
>>> we are not losing everything. Should we ever tune this, it can be done
>>> later with some explicit examples.
>>>
>>>> Acked-by: Johannes Weiner <[email protected]>
>>>
>>> Thanks! I will post the patch to Andrew early next week.
>>>
>>
>> How do you handle environments where one dump takes e.g. 3 seconds?
>> Counting delay between first message in previous dump and first message
>> in next dump is not safe. Unless we count delay between last message
>> in previous dump and first message in next dump, we cannot guarantee
>> that the system won't lockup due to printk() flooding.
>
> How is that different from any other printk ratelimiting? If a dump
> takes 3 seconds you need to fix your console. It doesn't make sense to
> design KERN_INFO messages for the slowest serial consoles out there.
You can't fix the console. It is a hardware limitation.
>
> That's what we did, btw. We used to patch out the OOM header because
> our serial console was so bad, but obviously that's not a generic
> upstream solution. We've since changed the loglevel on the serial and
> use netconsole[1] for the chattier loglevels.
>
> [1] https://github.com/facebook/fbkutils/tree/master/netconsd
>
On 2018/10/12 21:58, Tetsuo Handa wrote:
> On 2018/10/12 21:41, Johannes Weiner wrote:
>> On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote:
>>> On 2018/10/12 21:08, Michal Hocko wrote:
>>>>> So not more than 10 dumps in each 5s interval. That looks reasonable
>>>>> to me. By the time it starts dropping data you have more than enough
>>>>> information to go on already.
Not reasonable at all.
>>>>
>>>> Yeah. Unless we have a storm coming from many different cgroups in
>>>> parallel. But even then we have the allocation context for each OOM so
>>>> we are not losing everything. Should we ever tune this, it can be done
>>>> later with some explicit examples.
>>>>
>>>>> Acked-by: Johannes Weiner <[email protected]>
>>>>
>>>> Thanks! I will post the patch to Andrew early next week.
>>>>
One thread from one cgroup is sufficient. I don't think that Michal's patch
is an appropriate mitigation. It still needlessly floods kernel log buffer
and significantly defers recovery operation.
Nacked-by: Tetsuo Handa <[email protected]>
---------- Testcase ----------
#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
FILE *fp;
const unsigned long size = 1048576 * 200;
char *buf = malloc(size);
mkdir("/sys/fs/cgroup/memory/test1", 0755);
fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
fprintf(fp, "%lu\n", size / 2);
fclose(fp);
fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
fprintf(fp, "%u\n", getpid());
fclose(fp);
fp = fopen("/proc/self/oom_score_adj", "w");
fprintf(fp, "-1000\n");
fclose(fp);
fp = fopen("/dev/zero", "r");
fread(buf, 1, size, fp);
fclose(fp);
return 0;
}
---------- Michal's patch ----------
73133 lines (5.79MB) of kernel messages per one run
[root@ccsecurity ~]# time ./a.out
real 3m44.389s
user 0m0.000s
sys 3m42.334s
[root@ccsecurity ~]# time ./a.out
real 3m41.767s
user 0m0.004s
sys 3m39.779s
---------- My v2 patch ----------
50 lines (3.40 KB) of kernel messages per one run
[root@ccsecurity ~]# time ./a.out
real 0m5.227s
user 0m0.000s
sys 0m4.950s
[root@ccsecurity ~]# time ./a.out
real 0m5.249s
user 0m0.000s
sys 0m4.956s
On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote:
> ---------- Michal's patch ----------
>
> 73133 lines (5.79MB) of kernel messages per one run
>
> [root@ccsecurity ~]# time ./a.out
>
> real 3m44.389s
> user 0m0.000s
> sys 3m42.334s
>
> [root@ccsecurity ~]# time ./a.out
>
> real 3m41.767s
> user 0m0.004s
> sys 3m39.779s
>
> ---------- My v2 patch ----------
>
> 50 lines (3.40 KB) of kernel messages per one run
>
> [root@ccsecurity ~]# time ./a.out
>
> real 0m5.227s
> user 0m0.000s
> sys 0m4.950s
>
> [root@ccsecurity ~]# time ./a.out
>
> real 0m5.249s
> user 0m0.000s
> sys 0m4.956s
Your patch is suppressing information that I want to have and my
console can handle, just because your console is slow, even though
there is no need to use that console at that log level.
NAK to your patch. I think you're looking at this from the wrong
angle. A console that takes almost 4 minutes to print 70k lines
shouldn't be the baseline for how verbose KERN_INFO is.
On 2018/10/13 20:22, Johannes Weiner wrote:
> On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote:
>> ---------- Michal's patch ----------
>>
>> 73133 lines (5.79MB) of kernel messages per one run
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real 3m44.389s
>> user 0m0.000s
>> sys 3m42.334s
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real 3m41.767s
>> user 0m0.004s
>> sys 3m39.779s
>>
>> ---------- My v2 patch ----------
>>
>> 50 lines (3.40 KB) of kernel messages per one run
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real 0m5.227s
>> user 0m0.000s
>> sys 0m4.950s
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real 0m5.249s
>> user 0m0.000s
>> sys 0m4.956s
>
> Your patch is suppressing information that I want to have and my
> console can handle, just because your console is slow, even though
> there is no need to use that console at that log level.
My patch is not suppressing information you want to have.
My patch is mainly suppressing
[ 52.393146] Out of memory and no killable processes...
[ 52.395195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 52.398623] Out of memory and no killable processes...
[ 52.401195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 52.404356] Out of memory and no killable processes...
[ 52.406492] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 52.409595] Out of memory and no killable processes...
[ 52.411745] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 52.415588] Out of memory and no killable processes...
[ 52.418484] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 52.421904] Out of memory and no killable processes...
[ 52.424273] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
lines which Michal's patch cannot suppress.
Also, my console is console=ttyS0,115200n8 . Not slow at all.
>
> NAK to your patch. I think you're looking at this from the wrong
> angle. A console that takes almost 4 minutes to print 70k lines
> shouldn't be the baseline for how verbose KERN_INFO is.
>
Run the testcase in your environment.
On Sat 13-10-18 20:28:38, Tetsuo Handa wrote:
> On 2018/10/13 20:22, Johannes Weiner wrote:
> > On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote:
> >> ---------- Michal's patch ----------
> >>
> >> 73133 lines (5.79MB) of kernel messages per one run
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real 3m44.389s
> >> user 0m0.000s
> >> sys 3m42.334s
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real 3m41.767s
> >> user 0m0.004s
> >> sys 3m39.779s
> >>
> >> ---------- My v2 patch ----------
> >>
> >> 50 lines (3.40 KB) of kernel messages per one run
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real 0m5.227s
> >> user 0m0.000s
> >> sys 0m4.950s
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real 0m5.249s
> >> user 0m0.000s
> >> sys 0m4.956s
> >
> > Your patch is suppressing information that I want to have and my
> > console can handle, just because your console is slow, even though
> > there is no need to use that console at that log level.
>
> My patch is not suppressing information you want to have.
> My patch is mainly suppressing
>
> [ 52.393146] Out of memory and no killable processes...
> [ 52.395195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
> [ 52.398623] Out of memory and no killable processes...
> [ 52.401195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
> [ 52.404356] Out of memory and no killable processes...
> [ 52.406492] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
> [ 52.409595] Out of memory and no killable processes...
> [ 52.411745] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
> [ 52.415588] Out of memory and no killable processes...
> [ 52.418484] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
> [ 52.421904] Out of memory and no killable processes...
> [ 52.424273] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
>
> lines which Michal's patch cannot suppress.
This was a deliberate decision because the allocation failure context is
usually a useful information to get. If this is killing a reasonably
configured machine then we can move the ratelimit up and suppress that
information. This will always be cost vs. benefit decision. And as such
it should be argued in the changelog.
As so many dozens of times before, I will point you to an incremental
nature of changes we really prefer in the mm land. We are also after a
simplicity which your proposal lacks in many aspects. You seem to ignore
that general approach and I have hard time to consider your NAK as a
relevant feedback. Going to an extreme and basing a complex solution on
it is not going to fly. No killable process should be a rare event which
requires a seriously misconfigured memcg to happen so wildly. If you can
trigger it with a normal user privileges then it would be a clear bug to
address rather than work around with printk throttling.
--
Michal Hocko
SUSE Labs
On 2018/10/15 17:19, Michal Hocko wrote:
> As so many dozens of times before, I will point you to an incremental
> nature of changes we really prefer in the mm land. We are also after a
> simplicity which your proposal lacks in many aspects. You seem to ignore
> that general approach and I have hard time to consider your NAK as a
> relevant feedback. Going to an extreme and basing a complex solution on
> it is not going to fly. No killable process should be a rare event which
> requires a seriously misconfigured memcg to happen so wildly. If you can
> trigger it with a normal user privileges then it would be a clear bug to
> address rather than work around with printk throttling.
>
I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
with a normal user privileges. This is a lot of needless noise/delay.
No killable process is not a rare event, even without root privileges.
[root@ccsecurity kumaneko]# time ./a.out
Killed
real 0m2.396s
user 0m0.000s
sys 0m2.970s
[root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
202
[root@ccsecurity ~]# dmesg | wc
942 7335 70716
[root@ccsecurity ~]#
----------------------------------------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <sys/mman.h>
#define NUMTHREADS 256
#define MMAPSIZE 4 * 10485760
#define STACKSIZE 4096
static int pipe_fd[2] = { EOF, EOF };
static int memory_eater(void *unused)
{
int fd = open("/dev/zero", O_RDONLY);
char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ,
MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
read(pipe_fd[0], buf, 1);
read(fd, buf, MMAPSIZE);
pause();
return 0;
}
int main(int argc, char *argv[])
{
int i;
char *stack;
FILE *fp;
const unsigned long size = 1048576 * 200;
mkdir("/sys/fs/cgroup/memory/test1", 0755);
fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
fprintf(fp, "%lu\n", size);
fclose(fp);
fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
fprintf(fp, "%u\n", getpid());
fclose(fp);
if (setgid(-2) || setuid(-2))
return 1;
stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ,
MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
for (i = 0; i < NUMTHREADS; i++)
if (clone(memory_eater, stack + (i + 1) * STACKSIZE,
CLONE_SIGHAND | CLONE_THREAD | CLONE_VM | CLONE_FS | CLONE_FILES, NULL) == -1)
break;
sleep(1);
close(pipe_fd[1]);
pause();
return 0;
}
----------------------------------------
On Mon 15-10-18 19:57:35, Tetsuo Handa wrote:
> On 2018/10/15 17:19, Michal Hocko wrote:
> > As so many dozens of times before, I will point you to an incremental
> > nature of changes we really prefer in the mm land. We are also after a
> > simplicity which your proposal lacks in many aspects. You seem to ignore
> > that general approach and I have hard time to consider your NAK as a
> > relevant feedback. Going to an extreme and basing a complex solution on
> > it is not going to fly. No killable process should be a rare event which
> > requires a seriously misconfigured memcg to happen so wildly. If you can
> > trigger it with a normal user privileges then it would be a clear bug to
> > address rather than work around with printk throttling.
> >
>
> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
> with a normal user privileges. This is a lot of needless noise/delay.
I am pretty sure you have understood the part of my message you have
chosen to not quote where I have said that the specific rate limitting
decisions can be changed based on reasonable configurations. There is
absolutely zero reason to NAK a natural decision to unify the throttling
and cook a per-memcg way for a very specific path instead.
> No killable process is not a rare event, even without root privileges.
>
> [root@ccsecurity kumaneko]# time ./a.out
> Killed
>
> real 0m2.396s
> user 0m0.000s
> sys 0m2.970s
> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
> 202
> [root@ccsecurity ~]# dmesg | wc
> 942 7335 70716
OK, so this is 70kB worth of data pushed throug the console. Is this
really killing any machine?
--
Michal Hocko
SUSE Labs
On 2018/10/15 20:24, Michal Hocko wrote:
> On Mon 15-10-18 19:57:35, Tetsuo Handa wrote:
>> On 2018/10/15 17:19, Michal Hocko wrote:
>>> As so many dozens of times before, I will point you to an incremental
>>> nature of changes we really prefer in the mm land. We are also after a
>>> simplicity which your proposal lacks in many aspects. You seem to ignore
>>> that general approach and I have hard time to consider your NAK as a
>>> relevant feedback. Going to an extreme and basing a complex solution on
>>> it is not going to fly. No killable process should be a rare event which
>>> requires a seriously misconfigured memcg to happen so wildly. If you can
>>> trigger it with a normal user privileges then it would be a clear bug to
>>> address rather than work around with printk throttling.
>>>
>>
>> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
>> with a normal user privileges. This is a lot of needless noise/delay.
>
> I am pretty sure you have understood the part of my message you have
> chosen to not quote where I have said that the specific rate limitting
> decisions can be changed based on reasonable configurations. There is
> absolutely zero reason to NAK a natural decision to unify the throttling
> and cook a per-memcg way for a very specific path instead.
>
>> No killable process is not a rare event, even without root privileges.
>>
>> [root@ccsecurity kumaneko]# time ./a.out
>> Killed
>>
>> real 0m2.396s
>> user 0m0.000s
>> sys 0m2.970s
>> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
>> 202
>> [root@ccsecurity ~]# dmesg | wc
>> 942 7335 70716
>
> OK, so this is 70kB worth of data pushed throug the console. Is this
> really killing any machine?
>
Nobody can prove that it never kills some machine. This is just one example result of
one example stress tried in my environment. Since I am secure programming man from security
subsystem, I really hate your "Can you trigger it?" resistance. Since this is OOM path
where nobody tests, starting from being prepared for the worst case keeps things simple.
On Mon 15-10-18 21:47:08, Tetsuo Handa wrote:
> On 2018/10/15 20:24, Michal Hocko wrote:
> > On Mon 15-10-18 19:57:35, Tetsuo Handa wrote:
> >> On 2018/10/15 17:19, Michal Hocko wrote:
> >>> As so many dozens of times before, I will point you to an incremental
> >>> nature of changes we really prefer in the mm land. We are also after a
> >>> simplicity which your proposal lacks in many aspects. You seem to ignore
> >>> that general approach and I have hard time to consider your NAK as a
> >>> relevant feedback. Going to an extreme and basing a complex solution on
> >>> it is not going to fly. No killable process should be a rare event which
> >>> requires a seriously misconfigured memcg to happen so wildly. If you can
> >>> trigger it with a normal user privileges then it would be a clear bug to
> >>> address rather than work around with printk throttling.
> >>>
> >>
> >> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
> >> with a normal user privileges. This is a lot of needless noise/delay.
> >
> > I am pretty sure you have understood the part of my message you have
> > chosen to not quote where I have said that the specific rate limitting
> > decisions can be changed based on reasonable configurations. There is
> > absolutely zero reason to NAK a natural decision to unify the throttling
> > and cook a per-memcg way for a very specific path instead.
> >
> >> No killable process is not a rare event, even without root privileges.
> >>
> >> [root@ccsecurity kumaneko]# time ./a.out
> >> Killed
> >>
> >> real 0m2.396s
> >> user 0m0.000s
> >> sys 0m2.970s
> >> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
> >> 202
> >> [root@ccsecurity ~]# dmesg | wc
> >> 942 7335 70716
> >
> > OK, so this is 70kB worth of data pushed throug the console. Is this
> > really killing any machine?
> >
>
> Nobody can prove that it never kills some machine. This is just one example result of
> one example stress tried in my environment. Since I am secure programming man from security
> subsystem, I really hate your "Can you trigger it?" resistance. Since this is OOM path
> where nobody tests, starting from being prepared for the worst case keeps things simple.
There is simply no way to be generally safe this kind of situation. As
soon as your console is so slow that you cannot push the oom report
through there is only one single option left and that is to disable the
oom report altogether. And that might be a viable option. But fiddling
with per memcg limit is not going to fly. Just realize what will happen
if you have hundreds of different memcgs triggering this path around the
same time.
So can you start being reasonable and try to look at a wider picture
finally please?
--
Michal Hocko
SUSE Labs
On 2018/10/15 22:35, Michal Hocko wrote:
>> Nobody can prove that it never kills some machine. This is just one example result of
>> one example stress tried in my environment. Since I am secure programming man from security
>> subsystem, I really hate your "Can you trigger it?" resistance. Since this is OOM path
>> where nobody tests, starting from being prepared for the worst case keeps things simple.
>
> There is simply no way to be generally safe this kind of situation. As
> soon as your console is so slow that you cannot push the oom report
> through there is only one single option left and that is to disable the
> oom report altogether. And that might be a viable option.
There is a way to be safe this kind of situation. The way is to make sure that printk()
is called with enough interval. That is, count the interval between the end of previous
printk() messages and the beginning of next printk() messages.
And you are misunderstanding my patch. Although my patch does not ratelimit the first
occurrence of memcg OOM in each memcg domain (because the first
dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");
output is usually a useful information to get) which is serialized by oom_lock mutex,
my patch cannot cause lockup because my patch ratelimits subsequent outputs in any
memcg domain. That is, my patch might cause
"** %u printk messages dropped **\n"
when we have hundreds of different memcgs triggering this path around the same time,
my patch refrains from "keep disturbing administrator's manual recovery operation from
console by printing
"%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
"Out of memory and no killable processes...\n"
on each page fault event from hundreds of different memcgs triggering this path".
There is no need to print
"%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
"Out of memory and no killable processes...\n"
lines on evey page fault event. A kernel which consumes multiple milliseconds on each page
fault event (due to printk() messages from the defunctional OOM killer) is stupid.
> But fiddling
> with per memcg limit is not going to fly. Just realize what will happen
> if you have hundreds of different memcgs triggering this path around the
> same time.
You have just said that "No killable process should be a rare event which
requires a seriously misconfigured memcg to happen so wildly." and now you
refer to a very bad case "Just realize what will happen if you have hundreds
of different memcgs triggering this path around the same time." which makes
your former comment suspicious.
>
> So can you start being reasonable and try to look at a wider picture
> finally please?
>
Honestly, I can't look at a wider picture because I have never been shown a picture from you.
What we are doing is endless loop of "let's do ... because ..." and "hmm, our assumption
was wrong because ..."; that is, making changes without firstly considering the worst case.
For example, OOM victims which David Rientjes is complaining is that our assumption that
"__oom_reap_task_mm() can reclaim majority of memory" was wrong. (And your proposal to
hand over is getting no response.) For another example, __set_oom_adj() which Yong-Taek Lee
is trying to optimize is that our assumption that "we already succeeded enforcing same
oom_score_adj among multiple thread groups" was wrong. For yet another example,
CVE-2018-1000200 and CVE-2016-10723 are caused by ignoring my concern. And funny thing
is that you negated the rationale of "mm, oom: remove sleep from under oom_lock" by
"mm, oom: remove oom_lock from oom_reaper" after only 4 days...
Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with simplified
one shown below (because you don't like per memcg limit).
---
mm/oom_kill.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..9056f9b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,11 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! */
if (!oc->chosen) {
+ static unsigned long last_warned;
+
+ if ((is_sysrq_oom(oc) || is_memcg_oom(oc)) &&
+ time_in_range(jiffies, last_warned, last_warned + 60 * HZ))
+ return false;
dump_header(oc, NULL);
pr_warn("Out of memory and no killable processes...\n");
/*
@@ -1115,6 +1120,7 @@ bool out_of_memory(struct oom_control *oc)
*/
if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
panic("System is deadlocked on memory\n");
+ last_warned = jiffies;
}
if (oc->chosen && oc->chosen != (void *)-1UL)
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
--
1.8.3.1
On Tue 16-10-18 09:55:06, Tetsuo Handa wrote:
> On 2018/10/15 22:35, Michal Hocko wrote:
> >> Nobody can prove that it never kills some machine. This is just one example result of
> >> one example stress tried in my environment. Since I am secure programming man from security
> >> subsystem, I really hate your "Can you trigger it?" resistance. Since this is OOM path
> >> where nobody tests, starting from being prepared for the worst case keeps things simple.
> >
> > There is simply no way to be generally safe this kind of situation. As
> > soon as your console is so slow that you cannot push the oom report
> > through there is only one single option left and that is to disable the
> > oom report altogether. And that might be a viable option.
>
> There is a way to be safe this kind of situation. The way is to make sure that printk()
> is called with enough interval. That is, count the interval between the end of previous
> printk() messages and the beginning of next printk() messages.
You are simply wrong. Because any interval is meaningless without
knowing the printk throughput.
[...]
> lines on evey page fault event. A kernel which consumes multiple milliseconds on each page
> fault event (due to printk() messages from the defunctional OOM killer) is stupid.
Not if it represent an unusual situation where there is no eligible
task available. Because this is an exceptional case where the cost of
the printk is simply not relevant.
[...]
I am sorry to skip large part of your message but this discussion, like
many others, doesn't lead anywhere. You simply refuse to understand
some of the core assumptions in this area.
> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with simplified
> one shown below (because you don't like per memcg limit).
My patch is adding a rate-limit! I really fail to see why we need yet
another one on top of it. This is just ridiculous. I can see reasons to
tune that rate limit but adding 2 different mechanisms is just wrong.
If your NAK to unify the ratelimit for dump_header for all paths
still holds then I do not care too much to push it forward. But I find
thiis way of the review feedback counter productive.
--
Michal Hocko
SUSE Labs
On 2018/10/16 18:20, Michal Hocko wrote:
>> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with simplified
>> one shown below (because you don't like per memcg limit).
>
> My patch is adding a rate-limit! I really fail to see why we need yet
> another one on top of it. This is just ridiculous. I can see reasons to
> tune that rate limit but adding 2 different mechanisms is just wrong.
>
> If your NAK to unify the ratelimit for dump_header for all paths
> still holds then I do not care too much to push it forward. But I find
> thiis way of the review feedback counter productive.
>
Your patch is _NOT_ adding a rate-limit for
"%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
"Out of memory and no killable processes...\n"
lines!
[ 97.519229] Out of memory and no killable processes...
[ 97.522060] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.525507] Out of memory and no killable processes...
[ 97.528817] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.532345] Out of memory and no killable processes...
[ 97.534813] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.538270] Out of memory and no killable processes...
[ 97.541449] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.546268] Out of memory and no killable processes...
[ 97.548823] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.552399] Out of memory and no killable processes...
[ 97.554939] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.558616] Out of memory and no killable processes...
[ 97.562257] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.565998] Out of memory and no killable processes...
[ 97.568642] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.572169] Out of memory and no killable processes...
[ 97.575200] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.579357] Out of memory and no killable processes...
[ 97.581912] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.585414] Out of memory and no killable processes...
[ 97.589191] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.593586] Out of memory and no killable processes...
[ 97.596527] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.600118] Out of memory and no killable processes...
[ 97.603237] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.606837] Out of memory and no killable processes...
[ 97.611550] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.615244] Out of memory and no killable processes...
[ 97.617859] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.621634] Out of memory and no killable processes...
[ 97.624884] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.629256] Out of memory and no killable processes...
[ 97.631885] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.635367] Out of memory and no killable processes...
[ 97.638033] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.641827] Out of memory and no killable processes...
[ 97.641993] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.648453] Out of memory and no killable processes...
[ 97.651481] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.655082] Out of memory and no killable processes...
[ 97.657941] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.663036] Out of memory and no killable processes...
[ 97.665890] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[ 97.669473] Out of memory and no killable processes...
We don't need to print these lines every few milliseconds. Even if an exceptional case,
this is a DoS for console users. Printing once (or a few times) per a minute will be
enough. Otherwise, users cannot see what they are typing and what are printed.
On Tue 16-10-18 20:05:47, Tetsuo Handa wrote:
> On 2018/10/16 18:20, Michal Hocko wrote:
> >> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with simplified
> >> one shown below (because you don't like per memcg limit).
> >
> > My patch is adding a rate-limit! I really fail to see why we need yet
> > another one on top of it. This is just ridiculous. I can see reasons to
> > tune that rate limit but adding 2 different mechanisms is just wrong.
> >
> > If your NAK to unify the ratelimit for dump_header for all paths
> > still holds then I do not care too much to push it forward. But I find
> > thiis way of the review feedback counter productive.
> >
>
> Your patch is _NOT_ adding a rate-limit for
>
> "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
> "Out of memory and no killable processes...\n"
>
> lines!
And I've said I do not have objections to have an _incremental_ patch to
move the ratelimit up with a clear cost/benefit evaluation.
--
Michal Hocko
SUSE Labs