2018-08-04 13:34:05

by syzbot

[permalink] [raw]
Subject: WARNING in try_charge

Hello,

syzbot found the following crash on:

HEAD commit: d1e0b8e0cb7a Add linux-next specific files for 20180725
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=15a1c770400000
kernel config: https://syzkaller.appspot.com/x/.config?x=eef3552c897e4d33
dashboard link: https://syzkaller.appspot.com/bug?extid=bab151e82a4e973fa325
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: [email protected]

Killed process 23767 (syz-executor2) total-vm:70472kB, anon-rss:104kB,
file-rss:32768kB, shmem-rss:0kB
oom_reaper: reaped process 23767 (syz-executor2), now anon-rss:0kB,
file-rss:32000kB, shmem-rss:0kB
------------[ cut here ]------------
Memory cgroup charge failed because of no reclaimable memory! This looks
like a misconfiguration or a kernel bug.
WARNING: CPU: 1 PID: 23767 at mm/memcontrol.c:1710 mem_cgroup_oom
mm/memcontrol.c:1709 [inline]
WARNING: CPU: 1 PID: 23767 at mm/memcontrol.c:1710 try_charge+0x734/0x1680
mm/memcontrol.c:2211
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 23767 Comm: syz-executor2 Not tainted 4.18.0-rc6-next-20180725+
#18
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
panic+0x238/0x4e7 kernel/panic.c:184


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.


2018-08-04 13:46:56

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

syzbot is hitting WARN(1) because of mem_cgroup_out_of_memory() == false.
At first I suspected that syzbot is hitting

static bool oom_kill_memcg_victim(struct oom_control *oc)
{
if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
return oc->chosen_memcg;

case because

/* We have one or more terminating processes at this point. */
oc->chosen_task = INFLIGHT_VICTIM;

is not called. But since that patch was dropped from next-20180803, syzbot
seems to be hitting a different race condition
( https://syzkaller.appspot.com/text?tag=CrashLog&x=12071654400000 ).

Therefore, next culprit I suspect is

mm, oom: remove oom_lock from oom_reaper

oom_reaper used to rely on the oom_lock since e2fe14564d33 ("oom_reaper:
close race with exiting task"). We do not really need the lock anymore
though. 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run
concurrently") has removed serialization with the exit path based on the
mm reference count and so we do not really rely on the oom_lock anymore.

Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
to prevent from races when the page allocator didn't manage to get the
freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
on and move on to another victim. Although this is possible in principle
let's wait for it to actually happen in real life before we make the
locking more complex again.

Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem +
MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path
now.

which is in next-20180803, and my "mm, oom: Fix unnecessary killing of additional processes."
( https://marc.info/?i=1533389386-3501-4-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp )
could mitigate it. Michal and David, please respond.


2018-08-05 08:15:52

by syzbot

[permalink] [raw]
Subject: Re: WARNING in try_charge

syzbot has found a reproducer for the following crash on:

HEAD commit: 116b181bb646 Add linux-next specific files for 20180803
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=15f87bfc400000
kernel config: https://syzkaller.appspot.com/x/.config?x=b4f38be7c2c519d5
dashboard link: https://syzkaller.appspot.com/bug?extid=bab151e82a4e973fa325
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=10e553e0400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: [email protected]

Killed process 6527 (syz-executor2) total-vm:37704kB, anon-rss:2140kB,
file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 6527 (syz-executor2), now anon-rss:0kB,
file-rss:0kB, shmem-rss:0kB
------------[ cut here ]------------
oom_reaper: reaped process 6532 (syz-executor3), now anon-rss:0kB,
file-rss:4kB, shmem-rss:0kB
Memory cgroup charge failed because of no reclaimable memory! This looks
like a misconfiguration or a kernel bug.
WARNING: CPU: 1 PID: 6536 at mm/memcontrol.c:1705 mem_cgroup_oom
mm/memcontrol.c:1704 [inline]
WARNING: CPU: 1 PID: 6536 at mm/memcontrol.c:1705 try_charge+0x734/0x1680
mm/memcontrol.c:2262
oom_reaper: reaped process 6518 (syz-executor0), now anon-rss:0kB,
file-rss:4kB, shmem-rss:0kB
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 6536 Comm: syz-executor2 Not tainted 4.18.0-rc7-next-20180803+
#31
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
panic+0x238/0x4e7 kernel/panic.c:184
__warn.cold.8+0x163/0x1ba kernel/panic.c:536
report_bug+0x252/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:996
RIP: 0010:mem_cgroup_oom mm/memcontrol.c:1704 [inline]
RIP: 0010:try_charge+0x734/0x1680 mm/memcontrol.c:2262
Code: 85 b8 04 00 00 8b b5 c8 fd ff ff 44 89 f2 4c 89 ff e8 f0 51 ff ff 84
c0 0f 85 31 08 00 00 48 c7 c7 60 17 13 87 e8 7c fe 85 ff <0f> 0b 48 8d 95
f8 fd ff ff 48 8b b5 c0 fd ff ff 48 b8 00 00 00 00
RSP: 0018:ffff8801a8637620 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8801cbb5c980 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff816366f1 RDI: ffff8801a8637310
RBP: ffff8801a86378b0 R08: ffff8801b23001c0 R09: fffffbfff0ff11fc
R10: fffffbfff0ff11fc R11: ffffffff87f88fe3 R12: dffffc0000000000
R13: ffff8801a8637888 R14: 0000000000000000 R15: ffff8801cbb5c980
memcg_kmem_charge_memcg+0x7c/0x120 mm/memcontrol.c:2592
memcg_charge_slab mm/slab.h:283 [inline]
kmem_getpages mm/slab.c:1415 [inline]
cache_grow_begin+0x207/0x710 mm/slab.c:2677
fallback_alloc+0x203/0x2c0 mm/slab.c:3219
____cache_alloc_node+0x1c7/0x1e0 mm/slab.c:3287
__do_cache_alloc mm/slab.c:3356 [inline]
slab_alloc mm/slab.c:3384 [inline]
kmem_cache_alloc+0x1e5/0x760 mm/slab.c:3552
shmem_alloc_inode+0x1b/0x40 mm/shmem.c:3515
alloc_inode+0x63/0x190 fs/inode.c:210
new_inode_pseudo+0x71/0x1a0 fs/inode.c:903
new_inode+0x1c/0x40 fs/inode.c:932
shmem_get_inode+0xf1/0x910 mm/shmem.c:2142
__shmem_file_setup.part.48+0x83/0x2a0 mm/shmem.c:3871
__shmem_file_setup mm/shmem.c:3865 [inline]
shmem_file_setup+0x65/0x90 mm/shmem.c:3912
__do_sys_memfd_create mm/memfd.c:304 [inline]
__se_sys_memfd_create mm/memfd.c:247 [inline]
__x64_sys_memfd_create+0x2af/0x4f0 mm/memfd.c:247
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x456b29
Code: Bad RIP value.
RSP: 002b:00007f91f0f04a88 EFLAGS: 00000246 ORIG_RAX: 000000000000013f
RAX: ffffffffffffffda RBX: 0000000020000740 RCX: 0000000000456b29
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000004c1c8d
RBP: 00000000009300a0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000020000740 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000004d5bb8 R14: 00000000004c9491 R15: 0000000000000000
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..


2018-08-05 11:34:20

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

On 2018/08/04 22:45, Tetsuo Handa wrote:
> syzbot is hitting WARN(1) because of mem_cgroup_out_of_memory() == false.

Since syzbot found a syz reproducer, I asked syzbot to try two patches.

Setting MMF_OOM_SKIP under oom_lock to prevent from races
( https://syzkaller.appspot.com/x/patch.diff?x=10fb3fd0400000 ) was not sufficient.

Waiting until __mmput() completes (with timeout using OOM score feedback)
( https://syzkaller.appspot.com/x/patch.diff?x=101e449c400000 ) solved this race.

2018-08-06 09:17:01

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Sat 04-08-18 06:33:02, syzbot wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit: d1e0b8e0cb7a Add linux-next specific files for 20180725
> git tree: linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=15a1c770400000
> kernel config: https://syzkaller.appspot.com/x/.config?x=eef3552c897e4d33
> dashboard link: https://syzkaller.appspot.com/bug?extid=bab151e82a4e973fa325
> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: [email protected]
>
> Killed process 23767 (syz-executor2) total-vm:70472kB, anon-rss:104kB,
> file-rss:32768kB, shmem-rss:0kB
> oom_reaper: reaped process 23767 (syz-executor2), now anon-rss:0kB,
> file-rss:32000kB, shmem-rss:0kB

More interesting stuff is higher in the kernel log
: [ 366.435015] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/ile0,task_memcg=/ile0,task=syz-executor3,pid=23766,uid=0
: [ 366.449416] memory: usage 112kB, limit 0kB, failcnt 1605

Are you sure you want to have hard limit set to 0?

: [ 366.454963] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
: [ 366.461787] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
: [ 366.467946] Memory cgroup stats for /ile0: cache:12KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB

There are only 3 pages charged to this memcg!

: [ 366.487490] Tasks state (memory values in pages):
: [ 366.492349] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
: [ 366.501237] [ 23766] 0 23766 17620 8221 126976 0 0 syz-executor3
: [ 366.510367] [ 23767] 0 23767 17618 8218 126976 0 0 syz-executor2
: [ 366.519409] Memory cgroup out of memory: Kill process 23766 (syz-executor3) score 8252000 or sacrifice child
: [ 366.529422] Killed process 23766 (syz-executor3) total-vm:70480kB, anon-rss:116kB, file-rss:32768kB, shmem-rss:0kB
: [ 366.540456] oom_reaper: reaped process 23766 (syz-executor3), now anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB

The oom reaper cannot reclaim file backed memory from a large part. I
assume this is are shared mappings which are living outside of memcg
because of the counter.

: [...]
: [ 367.085870] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/ile0,task_memcg=/ile0,task=syz-executor2,pid=23767,uid=0
: [ 367.100073] memory: usage 112kB, limit 0kB, failcnt 1615
: [ 367.105549] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
: [ 367.112428] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
: [ 367.118593] Memory cgroup stats for /ile0: cache:12KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
: [ 367.138136] Tasks state (memory values in pages):
: [ 367.142986] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
: [ 367.151889] [ 23766] 0 23766 17620 8002 126976 0 0 syz-executor3
: [ 367.160946] [ 23767] 0 23767 17618 8218 126976 0 0 syz-executor2
: [ 367.169994] Memory cgroup out of memory: Kill process 23767 (syz-executor2) score 8249000 or sacrifice child
: [ 367.180119] Killed process 23767 (syz-executor2) total-vm:70472kB, anon-rss:104kB, file-rss:32768kB, shmem-rss:0kB
: [ 367.192101] oom_reaper: reaped process 23767 (syz-executor2), now anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
: [ 367.202986] ------------[ cut here ]------------
: [ 367.207845] Memory cgroup charge failed because of no reclaimable memory! This looks like a misconfiguration or a kernel bug.
: [ 367.207965] WARNING: CPU: 1 PID: 23767 at mm/memcontrol.c:1710 try_charge+0x734/0x1680
: [ 367.227540] Kernel panic - not syncing: panic_on_warn set ...

This is unexpected though. We have killed a task (23767) which is trying
to charge the memory which means it should
trigger the charge retry and that one should force the charge

/*
* Unlike in global OOM situations, memcg is not in a physical
* memory shortage. Allow dying and OOM-killed tasks to
* bypass the last charges so that they can exit quickly and
* free their memory.
*/
if (unlikely(tsk_is_oom_victim(current) ||
fatal_signal_pending(current) ||
current->flags & PF_EXITING))
goto force;

There doesn't seem to be any other sign of OOM killer invocation which
could then indeed lead to the warning as there is no other task to kill
(both syz-executor[23] have been killed and oom_reaped already). So I
would be curious what happened between 367.180119 which was the last
successful oom invocation and 367.207845. An additional printk in
mem_cgroup_out_of_memory might tell us more.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4603ad75c9a9..852cd3dbdcd9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
bool ret;

mutex_lock(&oom_lock);
+ pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
+ current->comm, current->pid, tsk_is_oom_victim(current));
ret = out_of_memory(&oc);
mutex_unlock(&oom_lock);
return ret;

Anyway your memcg setup is indeed misconfigured. Memcg with 0 hard limit
and basically no memory charged by existing tasks is not going to fly
and the warning is exactly to call that out.

If this is some sort of race and we warn too eagerly is still to be
checked and potentially fixed of course.
--
Michal Hocko
SUSE Labs

2018-08-06 09:32:37

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon, Aug 6, 2018 at 11:15 AM, Michal Hocko <[email protected]> wrote:
> On Sat 04-08-18 06:33:02, syzbot wrote:
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit: d1e0b8e0cb7a Add linux-next specific files for 20180725
>> git tree: linux-next
>> console output: https://syzkaller.appspot.com/x/log.txt?x=15a1c770400000
>> kernel config: https://syzkaller.appspot.com/x/.config?x=eef3552c897e4d33
>> dashboard link: https://syzkaller.appspot.com/bug?extid=bab151e82a4e973fa325
>> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: [email protected]
>>
>> Killed process 23767 (syz-executor2) total-vm:70472kB, anon-rss:104kB,
>> file-rss:32768kB, shmem-rss:0kB
>> oom_reaper: reaped process 23767 (syz-executor2), now anon-rss:0kB,
>> file-rss:32000kB, shmem-rss:0kB
>
> More interesting stuff is higher in the kernel log
> : [ 366.435015] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/ile0,task_memcg=/ile0,task=syz-executor3,pid=23766,uid=0
> : [ 366.449416] memory: usage 112kB, limit 0kB, failcnt 1605
>
> Are you sure you want to have hard limit set to 0?

syzkaller really does not mind to have it.

> : [ 366.454963] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> : [ 366.461787] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
> : [ 366.467946] Memory cgroup stats for /ile0: cache:12KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
>
> There are only 3 pages charged to this memcg!
>
> : [ 366.487490] Tasks state (memory values in pages):
> : [ 366.492349] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
> : [ 366.501237] [ 23766] 0 23766 17620 8221 126976 0 0 syz-executor3
> : [ 366.510367] [ 23767] 0 23767 17618 8218 126976 0 0 syz-executor2
> : [ 366.519409] Memory cgroup out of memory: Kill process 23766 (syz-executor3) score 8252000 or sacrifice child
> : [ 366.529422] Killed process 23766 (syz-executor3) total-vm:70480kB, anon-rss:116kB, file-rss:32768kB, shmem-rss:0kB
> : [ 366.540456] oom_reaper: reaped process 23766 (syz-executor3), now anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
>
> The oom reaper cannot reclaim file backed memory from a large part. I
> assume this is are shared mappings which are living outside of memcg
> because of the counter.
>
> : [...]
> : [ 367.085870] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/ile0,task_memcg=/ile0,task=syz-executor2,pid=23767,uid=0
> : [ 367.100073] memory: usage 112kB, limit 0kB, failcnt 1615
> : [ 367.105549] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> : [ 367.112428] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
> : [ 367.118593] Memory cgroup stats for /ile0: cache:12KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
> : [ 367.138136] Tasks state (memory values in pages):
> : [ 367.142986] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
> : [ 367.151889] [ 23766] 0 23766 17620 8002 126976 0 0 syz-executor3
> : [ 367.160946] [ 23767] 0 23767 17618 8218 126976 0 0 syz-executor2
> : [ 367.169994] Memory cgroup out of memory: Kill process 23767 (syz-executor2) score 8249000 or sacrifice child
> : [ 367.180119] Killed process 23767 (syz-executor2) total-vm:70472kB, anon-rss:104kB, file-rss:32768kB, shmem-rss:0kB
> : [ 367.192101] oom_reaper: reaped process 23767 (syz-executor2), now anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
> : [ 367.202986] ------------[ cut here ]------------
> : [ 367.207845] Memory cgroup charge failed because of no reclaimable memory! This looks like a misconfiguration or a kernel bug.
> : [ 367.207965] WARNING: CPU: 1 PID: 23767 at mm/memcontrol.c:1710 try_charge+0x734/0x1680
> : [ 367.227540] Kernel panic - not syncing: panic_on_warn set ...
>
> This is unexpected though. We have killed a task (23767) which is trying
> to charge the memory which means it should
> trigger the charge retry and that one should force the charge
>
> /*
> * Unlike in global OOM situations, memcg is not in a physical
> * memory shortage. Allow dying and OOM-killed tasks to
> * bypass the last charges so that they can exit quickly and
> * free their memory.
> */
> if (unlikely(tsk_is_oom_victim(current) ||
> fatal_signal_pending(current) ||
> current->flags & PF_EXITING))
> goto force;
>
> There doesn't seem to be any other sign of OOM killer invocation which
> could then indeed lead to the warning as there is no other task to kill
> (both syz-executor[23] have been killed and oom_reaped already). So I
> would be curious what happened between 367.180119 which was the last
> successful oom invocation and 367.207845. An additional printk in
> mem_cgroup_out_of_memory might tell us more.
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4603ad75c9a9..852cd3dbdcd9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> bool ret;
>
> mutex_lock(&oom_lock);
> + pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
> + current->comm, current->pid, tsk_is_oom_victim(current));
> ret = out_of_memory(&oc);
> mutex_unlock(&oom_lock);
> return ret;
>
> Anyway your memcg setup is indeed misconfigured. Memcg with 0 hard limit
> and basically no memory charged by existing tasks is not going to fly
> and the warning is exactly to call that out.


Please-please-please do not mix kernel bugs and notices to user into
the same bucket:

https://lore.kernel.org/patchwork/patch/949071/

2018-08-06 10:46:48

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon 06-08-18 11:30:37, Dmitry Vyukov wrote:
> On Mon, Aug 6, 2018 at 11:15 AM, Michal Hocko <[email protected]> wrote:
[...]
> > More interesting stuff is higher in the kernel log
> > : [ 366.435015] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/ile0,task_memcg=/ile0,task=syz-executor3,pid=23766,uid=0
> > : [ 366.449416] memory: usage 112kB, limit 0kB, failcnt 1605
> >
> > Are you sure you want to have hard limit set to 0?
>
> syzkaller really does not mind to have it.

So what do you use it for? What do you actually test by this setting?

[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 4603ad75c9a9..852cd3dbdcd9 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > bool ret;
> >
> > mutex_lock(&oom_lock);
> > + pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
> > + current->comm, current->pid, tsk_is_oom_victim(current));
> > ret = out_of_memory(&oc);
> > mutex_unlock(&oom_lock);
> > return ret;
> >
> > Anyway your memcg setup is indeed misconfigured. Memcg with 0 hard limit
> > and basically no memory charged by existing tasks is not going to fly
> > and the warning is exactly to call that out.
>
>
> Please-please-please do not mix kernel bugs and notices to user into
> the same bucket:

Well, WARN_ON used to be a standard way to make user aware of a
misbehavior. In this case it warns about a pottential runaway when memcg
is misconfigured. I do not insist on using WARN_ON here of course. If
there is a general agreement that such a condition is better handled by
pr_err then I am fine with it. Users tend to be more sensitive on
WARN_ONs though.

Btw. running with the above diff on top might help us to ideantify
whether this is a pre-mature warning or a valid one. Still useful to
find out.
--
Michal Hocko
SUSE Labs

2018-08-06 10:53:38

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon, Aug 6, 2018 at 11:48 AM, Michal Hocko <[email protected]> wrote:
> On Mon 06-08-18 11:30:37, Dmitry Vyukov wrote:
>> On Mon, Aug 6, 2018 at 11:15 AM, Michal Hocko <[email protected]> wrote:
> [...]
>> > More interesting stuff is higher in the kernel log
>> > : [ 366.435015] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/ile0,task_memcg=/ile0,task=syz-executor3,pid=23766,uid=0
>> > : [ 366.449416] memory: usage 112kB, limit 0kB, failcnt 1605
>> >
>> > Are you sure you want to have hard limit set to 0?
>>
>> syzkaller really does not mind to have it.
>
> So what do you use it for? What do you actually test by this setting?

syzkaller is kernel fuzzer, it finds kernel bugs by doing whatever is
doable from user-space. Some of that may not make sense, but it does
not matter because kernel should still stand still.

> [...]
>> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> > index 4603ad75c9a9..852cd3dbdcd9 100644
>> > --- a/mm/memcontrol.c
>> > +++ b/mm/memcontrol.c
>> > @@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>> > bool ret;
>> >
>> > mutex_lock(&oom_lock);
>> > + pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
>> > + current->comm, current->pid, tsk_is_oom_victim(current));
>> > ret = out_of_memory(&oc);
>> > mutex_unlock(&oom_lock);
>> > return ret;
>> >
>> > Anyway your memcg setup is indeed misconfigured. Memcg with 0 hard limit
>> > and basically no memory charged by existing tasks is not going to fly
>> > and the warning is exactly to call that out.
>>
>>
>> Please-please-please do not mix kernel bugs and notices to user into
>> the same bucket:
>
> Well, WARN_ON used to be a standard way to make user aware of a
> misbehavior. In this case it warns about a pottential runaway when memcg
> is misconfigured. I do not insist on using WARN_ON here of course. If
> there is a general agreement that such a condition is better handled by
> pr_err then I am fine with it. Users tend to be more sensitive on
> WARN_ONs though.

The docs change was acked by Greg, and Andrew took it into mm, Linus
was CCed too. It missed the release because I guess it's comments only
change, but otherwise it should reach upstream tree on the next merge
window.

WARN is _not_ a common way to notify users today. syzbot reports _all_
WARN occurrences and you can see there are not many of them now
(probably 1 another now, +dtor for that one):
https://syzkaller.appspot.com#upstream
There is probably some long tail that we need to fix. We really do
want systematic testing capability. You do not want every of 2 billion
linux users to come to you with this kernel splat, just so that you
can explain to them that it's some programs of their machines doing
something wrong, right?

WARN is really a bad way to inform a user about something. Consider a
non-kernel developer, perhaps even non-programmer. What they see is
"WARNING: CPU: 1 PID: 23767 at mm/memcontrol.c:1710
try_charge+0x734/0x1680" followed by some obscure things and hex
numbers. File:line reference is pointless, they don't what what/where
it is. This one is slightly better because it prints "Memory cgroup
charge failed because of no reclaimable memory! This looks like a
misconfiguration or a kernel bug." before the warning. But still it
says "or a kernel bug", which means that they will come to you. A much
friendlier for user way to say this would be print a message at the
point of misconfiguration saying what exactly is wrong, e.g. "pid $PID
misconfigures cgroup /cgroup/path with mem.limit=0" without a stack
trace (does not give any useful info for user). And return EINVAL if
it can't fly at all? And then leave the "or a kernel bug" part for the
WARNING each occurrence of which we do want to be reported to kernel
developers.

2018-08-06 10:53:49

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon, Aug 6, 2018 at 11:48 AM, Michal Hocko <[email protected]> wrote:
> On Mon 06-08-18 11:30:37, Dmitry Vyukov wrote:
>> On Mon, Aug 6, 2018 at 11:15 AM, Michal Hocko <[email protected]> wrote:
> [...]
>> > More interesting stuff is higher in the kernel log
>> > : [ 366.435015] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/ile0,task_memcg=/ile0,task=syz-executor3,pid=23766,uid=0
>> > : [ 366.449416] memory: usage 112kB, limit 0kB, failcnt 1605
>> >
>> > Are you sure you want to have hard limit set to 0?
>>
>> syzkaller really does not mind to have it.
>
> So what do you use it for? What do you actually test by this setting?
>
> [...]
>> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> > index 4603ad75c9a9..852cd3dbdcd9 100644
>> > --- a/mm/memcontrol.c
>> > +++ b/mm/memcontrol.c
>> > @@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>> > bool ret;
>> >
>> > mutex_lock(&oom_lock);
>> > + pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
>> > + current->comm, current->pid, tsk_is_oom_victim(current));
>> > ret = out_of_memory(&oc);
>> > mutex_unlock(&oom_lock);
>> > return ret;
>> >
>> > Anyway your memcg setup is indeed misconfigured. Memcg with 0 hard limit
>> > and basically no memory charged by existing tasks is not going to fly
>> > and the warning is exactly to call that out.
>>
>>
>> Please-please-please do not mix kernel bugs and notices to user into
>> the same bucket:
>
> Well, WARN_ON used to be a standard way to make user aware of a
> misbehavior. In this case it warns about a pottential runaway when memcg
> is misconfigured. I do not insist on using WARN_ON here of course. If
> there is a general agreement that such a condition is better handled by
> pr_err then I am fine with it. Users tend to be more sensitive on
> WARN_ONs though.
>
> Btw. running with the above diff on top might help us to ideantify
> whether this is a pre-mature warning or a valid one. Still useful to
> find out.

The bug report has a reproducer, so you can run it with the patch. Or
ask syzbot to test your patch:
https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches
Which basically boils down to saying:

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
master

Note that a text patch without a base tree/commit can be useless, I
used torvalds/linux.git but I don't know if it will apply there or
not. Let's see.


Attachments:
cgroup.patch (473.00 B)

2018-08-06 11:01:47

by syzbot

[permalink] [raw]
Subject: Re: WARNING in try_charge

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger
crash:

Reported-and-tested-by:
[email protected]

Tested on:

commit: 1ffaddd029c8 Linux 4.18-rc8
git tree: upstream
kernel config: https://syzkaller.appspot.com/x/.config?x=3bdb367561cb7285
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=146f9830400000

Note: testing is done by a robot and is best-effort only.

2018-08-06 11:03:56

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon 06-08-18 12:34:30, Dmitry Vyukov wrote:
> On Mon, Aug 6, 2018 at 11:48 AM, Michal Hocko <[email protected]> wrote:
> > On Mon 06-08-18 11:30:37, Dmitry Vyukov wrote:
> >> On Mon, Aug 6, 2018 at 11:15 AM, Michal Hocko <[email protected]> wrote:
> > [...]
> >> > More interesting stuff is higher in the kernel log
> >> > : [ 366.435015] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/ile0,task_memcg=/ile0,task=syz-executor3,pid=23766,uid=0
> >> > : [ 366.449416] memory: usage 112kB, limit 0kB, failcnt 1605
> >> >
> >> > Are you sure you want to have hard limit set to 0?
> >>
> >> syzkaller really does not mind to have it.
> >
> > So what do you use it for? What do you actually test by this setting?
>
> syzkaller is kernel fuzzer, it finds kernel bugs by doing whatever is
> doable from user-space. Some of that may not make sense, but it does
> not matter because kernel should still stand still.

I am not questioning that. What I am saying is that the configuration
doesn't make much sense and the kernel warns about it.

> > [...]
> >> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> > index 4603ad75c9a9..852cd3dbdcd9 100644
> >> > --- a/mm/memcontrol.c
> >> > +++ b/mm/memcontrol.c
> >> > @@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >> > bool ret;
> >> >
> >> > mutex_lock(&oom_lock);
> >> > + pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
> >> > + current->comm, current->pid, tsk_is_oom_victim(current));
> >> > ret = out_of_memory(&oc);
> >> > mutex_unlock(&oom_lock);
> >> > return ret;
> >> >
> >> > Anyway your memcg setup is indeed misconfigured. Memcg with 0 hard limit
> >> > and basically no memory charged by existing tasks is not going to fly
> >> > and the warning is exactly to call that out.
> >>
> >>
> >> Please-please-please do not mix kernel bugs and notices to user into
> >> the same bucket:
> >
> > Well, WARN_ON used to be a standard way to make user aware of a
> > misbehavior. In this case it warns about a pottential runaway when memcg
> > is misconfigured. I do not insist on using WARN_ON here of course. If
> > there is a general agreement that such a condition is better handled by
> > pr_err then I am fine with it. Users tend to be more sensitive on
> > WARN_ONs though.
>
> The docs change was acked by Greg, and Andrew took it into mm, Linus
> was CCed too. It missed the release because I guess it's comments only
> change, but otherwise it should reach upstream tree on the next merge
> window.
>
> WARN is _not_ a common way to notify users today. syzbot reports _all_
> WARN occurrences and you can see there are not many of them now
> (probably 1 another now, +dtor for that one):
> https://syzkaller.appspot.com#upstream
> There is probably some long tail that we need to fix. We really do
> want systematic testing capability. You do not want every of 2 billion
> linux users to come to you with this kernel splat, just so that you
> can explain to them that it's some programs of their machines doing
> something wrong, right?

[This is an orthogonal discussion I believe]

How does it differ from pr_err though?

> WARN is really a bad way to inform a user about something. Consider a
> non-kernel developer, perhaps even non-programmer. What they see is
> "WARNING: CPU: 1 PID: 23767 at mm/memcontrol.c:1710
> try_charge+0x734/0x1680" followed by some obscure things and hex
> numbers. File:line reference is pointless, they don't what what/where
> it is.

Well, you get a precise location where the problem happens and the
backtrace to see how it happened. This is much more information than,
pr_err without dump_stack.

> This one is slightly better because it prints "Memory cgroup
> charge failed because of no reclaimable memory! This looks like a
> misconfiguration or a kernel bug." before the warning. But still it
> says "or a kernel bug", which means that they will come to you.

Yeah, and that was the purpose of the thing.

> A much
> friendlier for user way to say this would be print a message at the
> point of misconfiguration saying what exactly is wrong, e.g. "pid $PID
> misconfigures cgroup /cgroup/path with mem.limit=0" without a stack
> trace (does not give any useful info for user). And return EINVAL if
> it can't fly at all? And then leave the "or a kernel bug" part for the
> WARNING each occurrence of which we do want to be reported to kernel
> developers.

But this is not applicable here. Your misconfiguration is quite obvious
because you simply set the hard limit to 0. This is not the only
situation when this can happen. There is no clear point to tell, you are
doing this wrong. If it was we would do it at that point obviously.

If you have a strong reason to believe that this is an abuse of WARN I
am all happy to change that. But I haven't heard any yet, to be honest.
--
Michal Hocko
SUSE Labs

2018-08-06 11:23:01

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon 06-08-18 19:47:00, Tetsuo Handa wrote:
> On 2018/08/06 19:39, Dmitry Vyukov wrote:
> > On Mon, Aug 6, 2018 at 11:48 AM, Michal Hocko <[email protected]> wrote:
> >> Btw. running with the above diff on top might help us to ideantify
> >> whether this is a pre-mature warning or a valid one. Still useful to
> >> find out.
>
> Since syzbot already found a syz reproducer, you can ask syzbot to test it.
>
> >
> > The bug report has a reproducer, so you can run it with the patch. Or
> > ask syzbot to test your patch:
> > https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches
> > Which basically boils down to saying:
> >
> > #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > master
>
> Excuse me, but this is linux-next only problem. Therefore,

If this really is a linux-next only problem then please retest with the
current linux-next which has dropped the and replaced the group oom
code.

> #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4603ad75c9a9..852cd3dbdcd9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> bool ret;
>
> mutex_lock(&oom_lock);
> + pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
> + current->comm, current->pid, tsk_is_oom_victim(current));
> ret = out_of_memory(&oc);
> mutex_unlock(&oom_lock);
> return ret;
>
> F.Y.I. Waiting until __mmput() completes (with timeout using OOM score feedback)
> ( https://syzkaller.appspot.com/x/patch.diff?x=101e449c400000 ) solves this race.

Which just means that something else is broken. Seriously, timout is not
going to fix anything. It merely changes the picture.

--
Michal Hocko
SUSE Labs

2018-08-06 11:30:25

by syzbot

[permalink] [raw]
Subject: Re: WARNING in try_charge

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger
crash:

Reported-and-tested-by:
[email protected]

Tested on:

commit: 8c8399e0a3fb Add linux-next specific files for 20180806
git tree: linux-next
kernel config: https://syzkaller.appspot.com/x/.config?x=1b6bc1781e49e93e
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=14fe18e2400000

Note: testing is done by a robot and is best-effort only.

2018-08-06 11:34:22

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon 06-08-18 04:27:02, syzbot wrote:
> Hello,
>
> syzbot has tested the proposed patch and the reproducer did not trigger
> crash:
>
> Reported-and-tested-by:
> [email protected]
>
> Tested on:
>
> commit: 8c8399e0a3fb Add linux-next specific files for 20180806
> git tree: linux-next
> kernel config: https://syzkaller.appspot.com/x/.config?x=1b6bc1781e49e93e
> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
> patch: https://syzkaller.appspot.com/x/patch.diff?x=14fe18e2400000
>
> Note: testing is done by a robot and is best-effort only.

OK, so this smells like a problem in the previous group oom changes. Or
maybe it is not very easy to reproduce?

--
Michal Hocko
SUSE Labs

2018-08-06 11:59:58

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon, Aug 6, 2018 at 1:02 PM, Michal Hocko <[email protected]> wrote:
>> >> > More interesting stuff is higher in the kernel log
>> >> > : [ 366.435015] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/ile0,task_memcg=/ile0,task=syz-executor3,pid=23766,uid=0
>> >> > : [ 366.449416] memory: usage 112kB, limit 0kB, failcnt 1605
>> >> >
>> >> > Are you sure you want to have hard limit set to 0?
>> >>
>> >> syzkaller really does not mind to have it.
>> >
>> > So what do you use it for? What do you actually test by this setting?
>>
>> syzkaller is kernel fuzzer, it finds kernel bugs by doing whatever is
>> doable from user-space. Some of that may not make sense, but it does
>> not matter because kernel should still stand still.
>
> I am not questioning that. What I am saying is that the configuration
> doesn't make much sense and the kernel warns about it.
>
>> > [...]
>> >> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> >> > index 4603ad75c9a9..852cd3dbdcd9 100644
>> >> > --- a/mm/memcontrol.c
>> >> > +++ b/mm/memcontrol.c
>> >> > @@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>> >> > bool ret;
>> >> >
>> >> > mutex_lock(&oom_lock);
>> >> > + pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
>> >> > + current->comm, current->pid, tsk_is_oom_victim(current));
>> >> > ret = out_of_memory(&oc);
>> >> > mutex_unlock(&oom_lock);
>> >> > return ret;
>> >> >
>> >> > Anyway your memcg setup is indeed misconfigured. Memcg with 0 hard limit
>> >> > and basically no memory charged by existing tasks is not going to fly
>> >> > and the warning is exactly to call that out.
>> >>
>> >>
>> >> Please-please-please do not mix kernel bugs and notices to user into
>> >> the same bucket:
>> >
>> > Well, WARN_ON used to be a standard way to make user aware of a
>> > misbehavior. In this case it warns about a pottential runaway when memcg
>> > is misconfigured. I do not insist on using WARN_ON here of course. If
>> > there is a general agreement that such a condition is better handled by
>> > pr_err then I am fine with it. Users tend to be more sensitive on
>> > WARN_ONs though.
>>
>> The docs change was acked by Greg, and Andrew took it into mm, Linus
>> was CCed too. It missed the release because I guess it's comments only
>> change, but otherwise it should reach upstream tree on the next merge
>> window.
>>
>> WARN is _not_ a common way to notify users today. syzbot reports _all_
>> WARN occurrences and you can see there are not many of them now
>> (probably 1 another now, +dtor for that one):
>> https://syzkaller.appspot.com#upstream
>> There is probably some long tail that we need to fix. We really do
>> want systematic testing capability. You do not want every of 2 billion
>> linux users to come to you with this kernel splat, just so that you
>> can explain to them that it's some programs of their machines doing
>> something wrong, right?
>
> [This is an orthogonal discussion I believe]
>
> How does it differ from pr_err though?

pr_err output looks very different, says that it is the user program
that does something wrong, explains what exactly is done wrong and
does not contain traces, offsets of hex numbers that scare end users.
WARN for kernel bugs/pr_err for user separation allows to easily
understand (including by computer programs) if it is a kernel bugs
(something to notify kernel developers about) or not. In particular if
it would be the case here, we would not have this bug report and would
not spent time here.

>> WARN is really a bad way to inform a user about something. Consider a
>> non-kernel developer, perhaps even non-programmer. What they see is
>> "WARNING: CPU: 1 PID: 23767 at mm/memcontrol.c:1710
>> try_charge+0x734/0x1680" followed by some obscure things and hex
>> numbers. File:line reference is pointless, they don't what what/where
>> it is.
>
> Well, you get a precise location where the problem happens and the
> backtrace to see how it happened. This is much more information than,
> pr_err without dump_stack.

This information is important for kernel bugs, but not for invalid
values passed by user programs. For the latter the exact location
where the problem happened is not there, it's in some user program.
For the latter it is more important to explain in readable language
what exactly arguments to what call were wrong.
Say, if you enter a wrong pin in ATM, it says "you entered wrong pin"
and not dumps some internal state in hex.

>> This one is slightly better because it prints "Memory cgroup
>> charge failed because of no reclaimable memory! This looks like a
>> misconfiguration or a kernel bug." before the warning. But still it
>> says "or a kernel bug", which means that they will come to you.
>
> Yeah, and that was the purpose of the thing.

Why? Was it clear how exactly if can happen? Can we refine it now?

>> A much
>> friendlier for user way to say this would be print a message at the
>> point of misconfiguration saying what exactly is wrong, e.g. "pid $PID
>> misconfigures cgroup /cgroup/path with mem.limit=0" without a stack
>> trace (does not give any useful info for user). And return EINVAL if
>> it can't fly at all? And then leave the "or a kernel bug" part for the
>> WARNING each occurrence of which we do want to be reported to kernel
>> developers.
>
> But this is not applicable here. Your misconfiguration is quite obvious
> because you simply set the hard limit to 0. This is not the only
> situation when this can happen. There is no clear point to tell, you are
> doing this wrong. If it was we would do it at that point obviously.

But, isn't there a point were hard limit is set to 0? I would expect
there is a something like cgroup file write handler with a value of 0
or something.

> If you have a strong reason to believe that this is an abuse of WARN I
> am all happy to change that. But I haven't heard any yet, to be honest.

WARN must not be used for anything that is not kernel bugs. If this is
not kernel bug, WARN must not be used here.

2018-08-06 12:00:08

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon, Aug 6, 2018 at 1:32 PM, Michal Hocko <[email protected]> wrote:
> On Mon 06-08-18 04:27:02, syzbot wrote:
>> Hello,
>>
>> syzbot has tested the proposed patch and the reproducer did not trigger
>> crash:
>>
>> Reported-and-tested-by:
>> [email protected]
>>
>> Tested on:
>>
>> commit: 8c8399e0a3fb Add linux-next specific files for 20180806
>> git tree: linux-next
>> kernel config: https://syzkaller.appspot.com/x/.config?x=1b6bc1781e49e93e
>> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>> patch: https://syzkaller.appspot.com/x/patch.diff?x=14fe18e2400000
>>
>> Note: testing is done by a robot and is best-effort only.
>
> OK, so this smells like a problem in the previous group oom changes. Or
> maybe it is not very easy to reproduce?

It's possible to ask syzbot to test on a particular tree/commit too.

2018-08-06 12:53:43

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

On 2018/08/06 19:39, Dmitry Vyukov wrote:
> On Mon, Aug 6, 2018 at 11:48 AM, Michal Hocko <[email protected]> wrote:
>> Btw. running with the above diff on top might help us to ideantify
>> whether this is a pre-mature warning or a valid one. Still useful to
>> find out.

Since syzbot already found a syz reproducer, you can ask syzbot to test it.

>
> The bug report has a reproducer, so you can run it with the patch. Or
> ask syzbot to test your patch:
> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches
> Which basically boils down to saying:
>
> #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> master

Excuse me, but this is linux-next only problem. Therefore,

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4603ad75c9a9..852cd3dbdcd9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
bool ret;

mutex_lock(&oom_lock);
+ pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
+ current->comm, current->pid, tsk_is_oom_victim(current));
ret = out_of_memory(&oc);
mutex_unlock(&oom_lock);
return ret;

F.Y.I. Waiting until __mmput() completes (with timeout using OOM score feedback)
( https://syzkaller.appspot.com/x/patch.diff?x=101e449c400000 ) solves this race.

2018-08-06 14:22:49

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon 06-08-18 13:57:38, Dmitry Vyukov wrote:
> On Mon, Aug 6, 2018 at 1:02 PM, Michal Hocko <[email protected]> wrote:
[...]
> >> A much
> >> friendlier for user way to say this would be print a message at the
> >> point of misconfiguration saying what exactly is wrong, e.g. "pid $PID
> >> misconfigures cgroup /cgroup/path with mem.limit=0" without a stack
> >> trace (does not give any useful info for user). And return EINVAL if
> >> it can't fly at all? And then leave the "or a kernel bug" part for the
> >> WARNING each occurrence of which we do want to be reported to kernel
> >> developers.
> >
> > But this is not applicable here. Your misconfiguration is quite obvious
> > because you simply set the hard limit to 0. This is not the only
> > situation when this can happen. There is no clear point to tell, you are
> > doing this wrong. If it was we would do it at that point obviously.
>
> But, isn't there a point were hard limit is set to 0? I would expect
> there is a something like cgroup file write handler with a value of 0
> or something.

Yeah, but this is only one instance of the problem. Other is that the
memcg is not reclaimable for any other reasons. And we do not know what
those might be

>
> > If you have a strong reason to believe that this is an abuse of WARN I
> > am all happy to change that. But I haven't heard any yet, to be honest.
>
> WARN must not be used for anything that is not kernel bugs. If this is
> not kernel bug, WARN must not be used here.

This is rather strong wording without any backing arguments. I strongly
doubt 90% of existing WARN* match this expectation. WARN* has
traditionally been a way to tell that something suspicious is going on.
Those situation are mostly likely not fatal but it is good to know they
are happening.

Sure there is that panic_on_warn thingy which you seem to be using and I
suspect it is a reason why you are so careful about warnings in general
but my experience tells me that this configuration is barely usable
except for testing (which is your case).

But as I've said, I do not insist on WARN here. All I care about is to
warn user that something might go south and this may be either due to
misconfiguration or a subtly wrong memcg reclaim/OOM handler behavior.
--
Michal Hocko
SUSE Labs

2018-08-06 14:43:25

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

+David Howells

On 2018/08/06 20:32, Michal Hocko wrote:
> On Mon 06-08-18 04:27:02, syzbot wrote:
>> Hello,
>>
>> syzbot has tested the proposed patch and the reproducer did not trigger
>> crash:
>>
>> Reported-and-tested-by:
>> [email protected]
>>
>> Tested on:
>>
>> commit: 8c8399e0a3fb Add linux-next specific files for 20180806
>> git tree: linux-next
>> kernel config: https://syzkaller.appspot.com/x/.config?x=1b6bc1781e49e93e
>> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>> patch: https://syzkaller.appspot.com/x/patch.diff?x=14fe18e2400000
>>
>> Note: testing is done by a robot and is best-effort only.
>
> OK, so this smells like a problem in the previous group oom changes. Or
> maybe it is not very easy to reproduce?
>

Since I can't find mm related changes between next-20180803 (syzbot can reproduce) and
next-20180806 (syzbot has not reproduced), I can't guess what makes this problem go away.

But since this problem did not occur for 3.5 hours on next-20180806 (when this problem
was occurring once per 60-90 minutes), the reproducer might not be working as intended
due to "kernfs, sysfs, cgroup, intel_rdt: Support fs_context" or something...

./kernel/cgroup/cgroup-internal.h | 3
./kernel/cgroup/cgroup-v1.c | 211
./kernel/cgroup/cgroup.c | 81

2018-08-06 15:10:37

by David Howells

[permalink] [raw]
Subject: Re: WARNING in try_charge

Do you have a link to the problem?

David

2018-08-06 15:12:46

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon 06-08-18 23:41:22, Tetsuo Handa wrote:
> +David Howells
>
> On 2018/08/06 20:32, Michal Hocko wrote:
> > On Mon 06-08-18 04:27:02, syzbot wrote:
> >> Hello,
> >>
> >> syzbot has tested the proposed patch and the reproducer did not trigger
> >> crash:
> >>
> >> Reported-and-tested-by:
> >> [email protected]
> >>
> >> Tested on:
> >>
> >> commit: 8c8399e0a3fb Add linux-next specific files for 20180806
> >> git tree: linux-next
> >> kernel config: https://syzkaller.appspot.com/x/.config?x=1b6bc1781e49e93e
> >> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
> >> patch: https://syzkaller.appspot.com/x/patch.diff?x=14fe18e2400000
> >>
> >> Note: testing is done by a robot and is best-effort only.
> >
> > OK, so this smells like a problem in the previous group oom changes. Or
> > maybe it is not very easy to reproduce?
> >
>
> Since I can't find mm related changes between next-20180803 (syzbot can reproduce) and
> next-20180806 (syzbot has not reproduced), I can't guess what makes this problem go away.

Hmm, but original report was against 4.18.0-rc6-next-20180725+ kernel.
And that one had the old group oom code. /me confused.
--
Michal Hocko
SUSE Labs

2018-08-06 15:14:10

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

On 2018/08/06 23:54, David Howells wrote:
> Do you have a link to the problem?
>
> David
>

https://groups.google.com/forum/#!msg/syzkaller-bugs/R03vI7RCVco/0PijCTrcCgAJ

syzbot found a reproducer, and the reproducer was working until next-20180803.
But the reproducer is failing to reproduce this problem in next-20180806 despite
there is no mm related change between next-20180803 and next-20180806.

Therefore, I suspect that the reproducer is no longer working as intended. And
there was parser change (your patch) between next-20180803 and next-20180806.


2018-08-06 15:29:13

by Johannes Weiner

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon, Aug 06, 2018 at 04:21:24PM +0200, Michal Hocko wrote:
> On Mon 06-08-18 13:57:38, Dmitry Vyukov wrote:
> > On Mon, Aug 6, 2018 at 1:02 PM, Michal Hocko <[email protected]> wrote:
> > > If you have a strong reason to believe that this is an abuse of WARN I
> > > am all happy to change that. But I haven't heard any yet, to be honest.
> >
> > WARN must not be used for anything that is not kernel bugs. If this is
> > not kernel bug, WARN must not be used here.
>
> This is rather strong wording without any backing arguments. I strongly
> doubt 90% of existing WARN* match this expectation. WARN* has
> traditionally been a way to tell that something suspicious is going on.
> Those situation are mostly likely not fatal but it is good to know they
> are happening.

I have to agree with Dmitry here. WARN should indicate a real kernel
issue, not user input that knowingly triggers undesirable behavior in
the kernel. It's our assert() for states we don't think are possible.

I would wager that MOST developers and users understand it that way.

2018-08-06 15:34:04

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

Now, let's test next-20180803 .

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 116b181bb646afedd770985de20a68721bdb2648

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4603ad75c9a9..852cd3dbdcd9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
bool ret;

mutex_lock(&oom_lock);
+ pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
+ current->comm, current->pid, tsk_is_oom_victim(current));
ret = out_of_memory(&oc);
mutex_unlock(&oom_lock);
return ret;


2018-08-06 15:43:38

by syzbot

[permalink] [raw]
Subject: Re: WARNING in try_charge

Hello,

syzbot has tested the proposed patch but the reproducer still triggered
crash:
WARNING in try_charge

Killed process 6410 (syz-executor5) total-vm:37708kB, anon-rss:2128kB,
file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 6410 (syz-executor5), now anon-rss:0kB,
file-rss:0kB, shmem-rss:0kB
task=syz-executor5 pid=6410 invoked memcg oom killer. oom_victim=1
------------[ cut here ]------------
Memory cgroup charge failed because of no reclaimable memory! This looks
like a misconfiguration or a kernel bug.
WARNING: CPU: 1 PID: 6410 at mm/memcontrol.c:1707 mem_cgroup_oom
mm/memcontrol.c:1706 [inline]
WARNING: CPU: 1 PID: 6410 at mm/memcontrol.c:1707 try_charge+0x734/0x1680
mm/memcontrol.c:2264
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 6410 Comm: syz-executor5 Not tainted 4.18.0-rc7-next-20180803+
#1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
panic+0x238/0x4e7 kernel/panic.c:184
__warn.cold.8+0x163/0x1ba kernel/panic.c:536
report_bug+0x252/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:996
RIP: 0010:mem_cgroup_oom mm/memcontrol.c:1706 [inline]
RIP: 0010:try_charge+0x734/0x1680 mm/memcontrol.c:2264
Code: 85 b8 04 00 00 8b b5 c8 fd ff ff 44 89 f2 4c 89 ff e8 00 51 ff ff 84
c0 0f 85 31 08 00 00 48 c7 c7 c0 17 13 87 e8 8c fd 85 ff <0f> 0b 48 8d 95
f8 fd ff ff 48 8b b5 c0 fd ff ff 48 b8 00 00 00 00
RSP: 0018:ffff8801be6ef580 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8801afab4c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff816366f1 RDI: ffff8801be6ef270
RBP: ffff8801be6ef810 R08: ffff8801bcf48140 R09: fffffbfff0ff1238
R10: fffffbfff0ff1238 R11: ffffffff87f891c3 R12: dffffc0000000000
R13: ffff8801be6ef7e8 R14: 0000000000000000 R15: ffff8801afab4c00
mem_cgroup_try_charge+0x4ff/0xa70 mm/memcontrol.c:5916
mem_cgroup_try_charge_delay+0x1d/0x90 mm/memcontrol.c:5931
do_anonymous_page mm/memory.c:3166 [inline]
handle_pte_fault mm/memory.c:3971 [inline]
__handle_mm_fault+0x25be/0x4470 mm/memory.c:4097
handle_mm_fault+0x53e/0xc80 mm/memory.c:4134
__do_page_fault+0x620/0xe50 arch/x86/mm/fault.c:1395
do_page_fault+0xf6/0x8c0 arch/x86/mm/fault.c:1470
page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1164
RIP: 0033:0x40e33f
Code: Bad RIP value.
RSP: 002b:00007ffe221246e0 EFLAGS: 00010206
RAX: 00007fb83d11b000 RBX: 0000000000020000 RCX: 0000000000456b7a
RDX: 0000000000021000 RSI: 0000000000021000 RDI: 0000000000000000
RBP: 00007ffe221247c0 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe221248b0
R13: 00007fb83d13b700 R14: 0000000000000005 R15: 0000000000000001
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..


Tested on:

commit: 116b181bb646 Add linux-next specific files for 20180803
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=12447864400000
kernel config: https://syzkaller.appspot.com/x/.config?x=b4f38be7c2c519d5
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=14c5b68c400000


2018-08-06 16:04:55

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

On 2018/08/07 0:42, syzbot wrote:
> Hello,
>
> syzbot has tested the proposed patch but the reproducer still triggered crash:
> WARNING in try_charge
>
> Killed process 6410 (syz-executor5) total-vm:37708kB, anon-rss:2128kB, file-rss:0kB, shmem-rss:0kB
> oom_reaper: reaped process 6410 (syz-executor5), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> task=syz-executor5 pid=6410 invoked memcg oom killer. oom_victim=1
> ------------[ cut here ]------------
> Memory cgroup charge failed because of no reclaimable memory! This looks like a misconfiguration or a kernel bug.
> WARNING: CPU: 1 PID: 6410 at mm/memcontrol.c:1707 mem_cgroup_oom mm/memcontrol.c:1706 [inline]
> WARNING: CPU: 1 PID: 6410 at mm/memcontrol.c:1707 try_charge+0x734/0x1680 mm/memcontrol.c:2264
> Kernel panic - not syncing: panic_on_warn set ...

Michal, this is "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once."
problem which you are refusing at https://www.spinics.net/lists/linux-mm/msg133774.html .

2018-08-06 16:42:20

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon, Aug 6, 2018 at 4:21 PM, Michal Hocko <[email protected]> wrote:
> On Mon 06-08-18 13:57:38, Dmitry Vyukov wrote:
>> On Mon, Aug 6, 2018 at 1:02 PM, Michal Hocko <[email protected]> wrote:
> [...]
>> >> A much
>> >> friendlier for user way to say this would be print a message at the
>> >> point of misconfiguration saying what exactly is wrong, e.g. "pid $PID
>> >> misconfigures cgroup /cgroup/path with mem.limit=0" without a stack
>> >> trace (does not give any useful info for user). And return EINVAL if
>> >> it can't fly at all? And then leave the "or a kernel bug" part for the
>> >> WARNING each occurrence of which we do want to be reported to kernel
>> >> developers.
>> >
>> > But this is not applicable here. Your misconfiguration is quite obvious
>> > because you simply set the hard limit to 0. This is not the only
>> > situation when this can happen. There is no clear point to tell, you are
>> > doing this wrong. If it was we would do it at that point obviously.
>>
>> But, isn't there a point were hard limit is set to 0? I would expect
>> there is a something like cgroup file write handler with a value of 0
>> or something.
>
> Yeah, but this is only one instance of the problem. Other is that the
> memcg is not reclaimable for any other reasons. And we do not know what
> those might be
>
>>
>> > If you have a strong reason to believe that this is an abuse of WARN I
>> > am all happy to change that. But I haven't heard any yet, to be honest.
>>
>> WARN must not be used for anything that is not kernel bugs. If this is
>> not kernel bug, WARN must not be used here.
>
> This is rather strong wording without any backing arguments. I strongly
> doubt 90% of existing WARN* match this expectation. WARN* has
> traditionally been a way to tell that something suspicious is going on.
> Those situation are mostly likely not fatal but it is good to know they
> are happening.

Today syzbot covers about 1M lines of kernel code, and we fuzz for
several years with panic_on_warn=1 and each unique crash is recorded
and reported. Over several thousands bugs that we reported, there were
maybe 2 dozens of such cases (WARN on invalid user inputs, ENOMEM,
etc). The solution always was to remove the WARNING on covert to
pr_err. As of now, I see only 2 such cases open: this one and WARN on
ENOMEM in input subsystem.

Either way, we do badly need this separation. If there are deviations
we need to continue fixing them.

2018-08-06 17:25:54

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon, Aug 6, 2018 at 4:21 PM, Michal Hocko <[email protected]> wrote:
> On Mon 06-08-18 13:57:38, Dmitry Vyukov wrote:
>> On Mon, Aug 6, 2018 at 1:02 PM, Michal Hocko <[email protected]> wrote:
> [...]
>> >> A much
>> >> friendlier for user way to say this would be print a message at the
>> >> point of misconfiguration saying what exactly is wrong, e.g. "pid $PID
>> >> misconfigures cgroup /cgroup/path with mem.limit=0" without a stack
>> >> trace (does not give any useful info for user). And return EINVAL if
>> >> it can't fly at all? And then leave the "or a kernel bug" part for the
>> >> WARNING each occurrence of which we do want to be reported to kernel
>> >> developers.
>> >
>> > But this is not applicable here. Your misconfiguration is quite obvious
>> > because you simply set the hard limit to 0. This is not the only
>> > situation when this can happen. There is no clear point to tell, you are
>> > doing this wrong. If it was we would do it at that point obviously.
>>
>> But, isn't there a point were hard limit is set to 0? I would expect
>> there is a something like cgroup file write handler with a value of 0
>> or something.
>
> Yeah, but this is only one instance of the problem. Other is that the
> memcg is not reclaimable for any other reasons. And we do not know what
> those might be
>
>>
>> > If you have a strong reason to believe that this is an abuse of WARN I
>> > am all happy to change that. But I haven't heard any yet, to be honest.
>>
>> WARN must not be used for anything that is not kernel bugs. If this is
>> not kernel bug, WARN must not be used here.
>
> This is rather strong wording without any backing arguments. I strongly
> doubt 90% of existing WARN* match this expectation. WARN* has
> traditionally been a way to tell that something suspicious is going on.
> Those situation are mostly likely not fatal but it is good to know they
> are happening.
>
> Sure there is that panic_on_warn thingy which you seem to be using and I
> suspect it is a reason why you are so careful about warnings in general
> but my experience tells me that this configuration is barely usable
> except for testing (which is your case).
>
> But as I've said, I do not insist on WARN here. All I care about is to
> warn user that something might go south and this may be either due to
> misconfiguration or a subtly wrong memcg reclaim/OOM handler behavior.

I am a bit lost. Can limit=0 legally lead to the warnings? Or there is
also a kernel bug on top of that and it's actually a kernel bug that
provokes the warning?
If it's a kernel bug, then I propose to stop arguing about
configuration and concentrate on the bug.
If it's just the misconfiguration that triggers the warning, then can
we separate the 2 causes of the warning (user misconfiguration and
kernel bugs)? Say, return EINVAL when mem limit is set to 0 (and print
a line to console if necessary)? Or if the limit=0 is somehow not
possible/desirable to detect right away, check limit=0 at the point of
the warning and don't want?

2018-08-06 17:40:16

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

On 2018/08/06 23:58, Michal Hocko wrote:
>> Since I can't find mm related changes between next-20180803 (syzbot can reproduce) and
>> next-20180806 (syzbot has not reproduced), I can't guess what makes this problem go away.
>
> Hmm, but original report was against 4.18.0-rc6-next-20180725+ kernel.
> And that one had the old group oom code. /me confused.
>

Yes. But I confirmed that syzbot can reproduce this problem with next-20180803
which already dropped the old group oom code. Therefore, I think that syzbot is
hitting a problem other than the old group oom code.

2018-08-06 18:22:55

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon 06-08-18 16:58:01, Dmitry Vyukov wrote:
> On Mon, Aug 6, 2018 at 4:21 PM, Michal Hocko <[email protected]> wrote:
> > On Mon 06-08-18 13:57:38, Dmitry Vyukov wrote:
> >> On Mon, Aug 6, 2018 at 1:02 PM, Michal Hocko <[email protected]> wrote:
> > [...]
> >> >> A much
> >> >> friendlier for user way to say this would be print a message at the
> >> >> point of misconfiguration saying what exactly is wrong, e.g. "pid $PID
> >> >> misconfigures cgroup /cgroup/path with mem.limit=0" without a stack
> >> >> trace (does not give any useful info for user). And return EINVAL if
> >> >> it can't fly at all? And then leave the "or a kernel bug" part for the
> >> >> WARNING each occurrence of which we do want to be reported to kernel
> >> >> developers.
> >> >
> >> > But this is not applicable here. Your misconfiguration is quite obvious
> >> > because you simply set the hard limit to 0. This is not the only
> >> > situation when this can happen. There is no clear point to tell, you are
> >> > doing this wrong. If it was we would do it at that point obviously.
> >>
> >> But, isn't there a point were hard limit is set to 0? I would expect
> >> there is a something like cgroup file write handler with a value of 0
> >> or something.
> >
> > Yeah, but this is only one instance of the problem. Other is that the
> > memcg is not reclaimable for any other reasons. And we do not know what
> > those might be
> >
> >>
> >> > If you have a strong reason to believe that this is an abuse of WARN I
> >> > am all happy to change that. But I haven't heard any yet, to be honest.
> >>
> >> WARN must not be used for anything that is not kernel bugs. If this is
> >> not kernel bug, WARN must not be used here.
> >
> > This is rather strong wording without any backing arguments. I strongly
> > doubt 90% of existing WARN* match this expectation. WARN* has
> > traditionally been a way to tell that something suspicious is going on.
> > Those situation are mostly likely not fatal but it is good to know they
> > are happening.
> >
> > Sure there is that panic_on_warn thingy which you seem to be using and I
> > suspect it is a reason why you are so careful about warnings in general
> > but my experience tells me that this configuration is barely usable
> > except for testing (which is your case).
> >
> > But as I've said, I do not insist on WARN here. All I care about is to
> > warn user that something might go south and this may be either due to
> > misconfiguration or a subtly wrong memcg reclaim/OOM handler behavior.
>
> I am a bit lost. Can limit=0 legally lead to the warnings? Or there is
> also a kernel bug on top of that and it's actually a kernel bug that
> provokes the warning?

As I've tried to tell already. I cannot tell for sure. It is the killed
oom victim which triggered thw warning and that shouldn't really
happen. Considering this doesn't reproduce with the current linux next
nor linus tree and the oom code has changed since the version you have
tested then I would suspect there was something wrong with the memcg oom
code. But maybe the test doesn't really reproduce reliably.

> If it's a kernel bug, then I propose to stop arguing about
> configuration and concentrate on the bug.
> If it's just the misconfiguration that triggers the warning, then can
> we separate the 2 causes of the warning (user misconfiguration and
> kernel bugs)? Say, return EINVAL when mem limit is set to 0 (and print
> a line to console if necessary)? Or if the limit=0 is somehow not
> possible/desirable to detect right away, check limit=0 at the point of
> the warning and don't want?

No we simply cannot. There is numerous situations when this can trigger.
Say you set the hard limit to N and then try to fault in shmem file with
the size >= N. No oom killer will help to reclaim memory. Or say you
migrate the all tasks away from the memcg and then somebody triggers the
memcg OOM in that group. There is simply nobody to kill. See the point?
There is simply no direct contection between the configuration and
actual problem. Too many things might happen between those two points.
Let me repeat. We do warn because we want to hear if this happens. WARN
tends to be a good way to get that attention. If you strongly believe
this is an abuse I won't mind seeing a patch to turn it into something
different.

--
Michal Hocko
SUSE Labs

2018-08-06 18:28:19

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon, Aug 6, 2018 at 7:44 PM, Michal Hocko <[email protected]> wrote:
> On Mon 06-08-18 08:42:02, syzbot wrote:
>> Hello,
>>
>> syzbot has tested the proposed patch but the reproducer still triggered
>> crash:
>> WARNING in try_charge
>>
>> Killed process 6410 (syz-executor5) total-vm:37708kB, anon-rss:2128kB,
>> file-rss:0kB, shmem-rss:0kB
>> oom_reaper: reaped process 6410 (syz-executor5), now anon-rss:0kB,
>> file-rss:0kB, shmem-rss:0kB
>> task=syz-executor5 pid=6410 invoked memcg oom killer. oom_victim=1
>
> Thank you. This is useful. The full oom picture is this
> : [ 65.363983] task=syz-executor5 pid=6415 invoked memcg oom killer. oom_victim=0
> [...]
> : [ 65.920355] Task in /ile0 killed as a result of limit of /ile0
> : [ 65.926389] memory: usage 0kB, limit 0kB, failcnt 20
> : [ 65.931518] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> : [ 65.938296] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
> : [ 65.944467] Memory cgroup stats for /ile0: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
> : [ 65.963878] Tasks state (memory values in pages):
> : [ 65.968743] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
> : [ 65.977615] [ 6410] 0 6410 9427 532 61440 0 0 syz-executor5
> : [ 65.986647] Memory cgroup out of memory: Kill process 6410 (syz-executor5) score 547000 or sacrifice child
> : [ 65.996474] Killed process 6410 (syz-executor5) total-vm:37708kB, anon-rss:2128kB, file-rss:0kB, shmem-rss:0kB
> : [ 66.007471] oom_reaper: reaped process 6410 (syz-executor5), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> : [ 66.017652] task=syz-executor5 pid=6410 invoked memcg oom killer. oom_victim=1
> : [ 66.025137] ------------[ cut here ]------------
> : [ 66.029927] Memory cgroup charge failed because of no reclaimable memory! This looks like a misconfiguration or a kernel bug.
> : [ 66.030061] WARNING: CPU: 1 PID: 6410 at mm/memcontrol.c:1707 try_charge+0x734/0x1680
>
> So we have only a single task in the memcg and it is this task which
> triggers the OOM. It gets killed and oom_reaped. This means that
> out_of_memory should return with true and so we should retry and force
> the charge as I've already mentioned. For some reason this task has
> triggered the oom killer path again and then we haven't found any
> eligible task and resulted in the warning. This shouldn't happen.
>
> I will stare to the code some more to see how the heck we get there
> without passing
> if (unlikely(tsk_is_oom_victim(current) ||
> fatal_signal_pending(current) ||
> current->flags & PF_EXITING))
> goto force;


This is the syzkaller reproducer if it will give you any hints:

mkdir(&(0x7f0000000340)='./file0\x00', 0x0)
mount(&(0x7f00000000c0)='./file0//ile0\x00',
&(0x7f0000000080)='./file0\x00', &(0x7f0000000200)='cgroup2\x00', 0x0,
0x0)
r0 = open(&(0x7f0000000040)='./file0//ile0\x00', 0x0, 0x0)
r1 = openat$cgroup_procs(r0, &(0x7f00000001c0)='cgroup.procs\x00', 0x2, 0x0)
write$cgroup_pid(r1, &(0x7f0000000100), 0x12)
syz_mount_image$xfs(&(0x7f0000000280)='xfs\x00',
&(0x7f00000002c0)='./file0//ile0\x00', 0x0, 0x0, &(0x7f0000000740),
0x0, &(0x7f0000000800))
getsockopt$sock_cred(r0, 0x1, 0x11, &(0x7f0000000180), &(0x7f0000000240)=0xc)
r2 = open(&(0x7f0000000040)='./file0//ile0\x00', 0x0, 0x0)
write$FUSE_STATFS(r2, &(0x7f0000000280)={0x60, 0xfffffffffffffffe,
0x7, {{0x401, 0x7, 0x7d, 0x1c, 0xd8, 0xb7, 0x2, 0x3}}}, 0x60)
r3 = openat$cgroup_int(r2, &(0x7f00000001c0)='memory.max\x00', 0x2, 0x0)
write$cgroup_int(r3, &(0x7f00000000c0), 0x12)

2018-08-06 18:28:43

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon, Aug 6, 2018 at 7:30 PM, Michal Hocko <[email protected]> wrote:
>> >> >> A much
>> >> >> friendlier for user way to say this would be print a message at the
>> >> >> point of misconfiguration saying what exactly is wrong, e.g. "pid $PID
>> >> >> misconfigures cgroup /cgroup/path with mem.limit=0" without a stack
>> >> >> trace (does not give any useful info for user). And return EINVAL if
>> >> >> it can't fly at all? And then leave the "or a kernel bug" part for the
>> >> >> WARNING each occurrence of which we do want to be reported to kernel
>> >> >> developers.
>> >> >
>> >> > But this is not applicable here. Your misconfiguration is quite obvious
>> >> > because you simply set the hard limit to 0. This is not the only
>> >> > situation when this can happen. There is no clear point to tell, you are
>> >> > doing this wrong. If it was we would do it at that point obviously.
>> >>
>> >> But, isn't there a point were hard limit is set to 0? I would expect
>> >> there is a something like cgroup file write handler with a value of 0
>> >> or something.
>> >
>> > Yeah, but this is only one instance of the problem. Other is that the
>> > memcg is not reclaimable for any other reasons. And we do not know what
>> > those might be
>> >
>> >>
>> >> > If you have a strong reason to believe that this is an abuse of WARN I
>> >> > am all happy to change that. But I haven't heard any yet, to be honest.
>> >>
>> >> WARN must not be used for anything that is not kernel bugs. If this is
>> >> not kernel bug, WARN must not be used here.
>> >
>> > This is rather strong wording without any backing arguments. I strongly
>> > doubt 90% of existing WARN* match this expectation. WARN* has
>> > traditionally been a way to tell that something suspicious is going on.
>> > Those situation are mostly likely not fatal but it is good to know they
>> > are happening.
>> >
>> > Sure there is that panic_on_warn thingy which you seem to be using and I
>> > suspect it is a reason why you are so careful about warnings in general
>> > but my experience tells me that this configuration is barely usable
>> > except for testing (which is your case).
>> >
>> > But as I've said, I do not insist on WARN here. All I care about is to
>> > warn user that something might go south and this may be either due to
>> > misconfiguration or a subtly wrong memcg reclaim/OOM handler behavior.
>>
>> I am a bit lost. Can limit=0 legally lead to the warnings? Or there is
>> also a kernel bug on top of that and it's actually a kernel bug that
>> provokes the warning?
>
> As I've tried to tell already. I cannot tell for sure. It is the killed
> oom victim which triggered thw warning and that shouldn't really
> happen. Considering this doesn't reproduce with the current linux next
> nor linus tree and the oom code has changed since the version you have
> tested then I would suspect there was something wrong with the memcg oom
> code. But maybe the test doesn't really reproduce reliably.
>
>> If it's a kernel bug, then I propose to stop arguing about
>> configuration and concentrate on the bug.
>> If it's just the misconfiguration that triggers the warning, then can
>> we separate the 2 causes of the warning (user misconfiguration and
>> kernel bugs)? Say, return EINVAL when mem limit is set to 0 (and print
>> a line to console if necessary)? Or if the limit=0 is somehow not
>> possible/desirable to detect right away, check limit=0 at the point of
>> the warning and don't want?
>
> No we simply cannot. There is numerous situations when this can trigger.
> Say you set the hard limit to N and then try to fault in shmem file with
> the size >= N. No oom killer will help to reclaim memory. Or say you
> migrate the all tasks away from the memcg and then somebody triggers the
> memcg OOM in that group. There is simply nobody to kill. See the point?
> There is simply no direct contection between the configuration and
> actual problem. Too many things might happen between those two points.
> Let me repeat. We do warn because we want to hear if this happens. WARN
> tends to be a good way to get that attention. If you strongly believe
> this is an abuse I won't mind seeing a patch to turn it into something
> different.

I don't believe it is an abuse, I don't know this code well. Let's
assume the misconfiguration is a red-herring for now then.

2018-08-06 18:37:15

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

I simply do not see how this is possible. Let's try with the following
extended debugging patch.

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 116b181bb646afedd770985de20a68721bdb2648

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4603ad75c9a9..e2dfdf4361ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
bool ret;

mutex_lock(&oom_lock);
+ pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
+ current->comm, current->pid, tsk_is_oom_victim(current));
ret = out_of_memory(&oc);
mutex_unlock(&oom_lock);
return ret;
@@ -2108,6 +2110,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,

if (mem_cgroup_is_root(memcg))
return 0;
+ if (tsk_is_oom_victim(current))
+ pr_info("task=%s pid=%d charge for nr_pages=%d\n",
+ current->comm, current->pid, nr_pages);
retry:
if (consume_stock(memcg, nr_pages))
return 0;
@@ -2137,8 +2142,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
if (unlikely(tsk_is_oom_victim(current) ||
fatal_signal_pending(current) ||
- current->flags & PF_EXITING))
+ current->flags & PF_EXITING)) {
+ pr_info("task=%s pid=%d charge bypass\n",
+ current->comm, current->pid);
goto force;
+ }

/*
* Prevent unbounded recursion when reclaim operations need to
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 104ef4a01a55..7d9adcde8cf6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -692,6 +692,8 @@ static void mark_oom_victim(struct task_struct *tsk)
__thaw_task(tsk);
atomic_inc(&oom_victims);
trace_mark_victim(tsk->pid);
+ pr_info("task=%s pid=%d is oom victim now\n",
+ current->comm, current->pid);
}

/**
--
Michal Hocko
SUSE Labs

2018-08-06 18:57:02

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

The debugging patch was wrong but I guess I see it finally.
It's a race

: [ 72.901666] Memory cgroup out of memory: Kill process 6584 (syz-executor1) score 550000 or sacrifice child
: [ 72.917037] Killed process 6584 (syz-executor1) total-vm:37704kB, anon-rss:2140kB, file-rss:0kB, shmem-rss:0kB
: [ 72.927256] task=syz-executor5 pid=6581 charge bypass
: [ 72.928046] oom_reaper: reaped process 6584 (syz-executor1), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
: [ 72.932818] task=syz-executor6 pid=6576 invoked memcg oom killer. oom_victim=1
: [ 72.942790] task=syz-executor5 pid=6581 charge for nr_pages=1
: [ 72.949769] syz-executor6 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
: [ 72.955606] task=syz-executor5 pid=6581 charge bypass
: [ 72.967394] syz-executor6 cpuset=/ mems_allowed=0
: [ 72.973175] task=syz-executor5 pid=6581 charge for nr_pages=1
: [...]
: [ 73.534865] Task in /ile0 killed as a result of limit of /ile0
: [ 73.540865] memory: usage 76kB, limit 0kB, failcnt 260
: [ 73.546142] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
: [ 73.552898] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
: [ 73.559051] Memory cgroup stats for /ile0: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
: [ 73.578533] Tasks state (memory values in pages):
: [ 73.583404] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
: [ 73.592277] [ 6569] 0 6562 9427 1 53248 0 0 syz-executor0
: [ 73.601299] [ 6576] 0 6576 9426 0 61440 0 0 syz-executor6
: [ 73.610333] [ 6578] 0 6578 9426 534 61440 0 0 syz-executor4
: [ 73.619381] [ 6579] 0 6579 9426 0 57344 0 0 syz-executor5
: [ 73.628414] [ 6582] 0 6582 9426 0 61440 0 0 syz-executor7
: [ 73.637441] [ 6584] 0 6584 9426 0 57344 0 0 syz-executor1
: [ 73.646464] Memory cgroup out of memory: Kill process 6578 (syz-executor4) score 549000 or sacrifice child
: [ 73.656295] task=syz-executor6 pid=6576 is oom victim now

This should be 6578 but we at least know that we are running in 6576
context so the we are setting the state from a remote context which
itself has been killed already

: [ 73.661841] Killed process 6578 (syz-executor4) total-vm:37704kB, anon-rss:2136kB, file-rss:0kB, shmem-rss:0kB
: [ 73.672035] task=syz-executor6 pid=6576 charge bypass
: [ 73.672801] oom_reaper: reaped process 6578 (syz-executor4), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
: [ 73.678829] task=syz-executor4 pid=6578 invoked memcg oom killer. oom_victim=1

and here the victim finally reached the oom path finally.

: [ 73.687453] task=syz-executor6 pid=6576 charge for nr_pages=1
: [ 73.694534] ------------[ cut here ]------------
: [ 73.700424] task=syz-executor6 pid=6576 charge bypass
: [ 73.705175] Memory cgroup charge failed because of no reclaimable memory! This looks like a misconfiguration or a kernel bug.
: [ 73.705321] WARNING: CPU: 1 PID: 6578 at mm/memcontrol.c:1707 try_charge+0xafa/0x1710

But there is nobody killable. So the oom kill happened _after_ our force
charge path. Therefore we should do the following regardless whether we
make tis warn or pr_$foo

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 116b181bb646afedd770985de20a68721bdb2648

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4603ad75c9a9..1b6eed1bc404 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1703,7 +1703,8 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
return OOM_ASYNC;
}

- if (mem_cgroup_out_of_memory(memcg, mask, order))
+ if (mem_cgroup_out_of_memory(memcg, mask, order) ||
+ tsk_is_oom_victim(current))
return OOM_SUCCESS;

WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
--
Michal Hocko
SUSE Labs

2018-08-06 19:32:32

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon 06-08-18 19:44:10, Michal Hocko wrote:
> On Mon 06-08-18 08:42:02, syzbot wrote:
> > Hello,
> >
> > syzbot has tested the proposed patch but the reproducer still triggered
> > crash:
> > WARNING in try_charge
> >
> > Killed process 6410 (syz-executor5) total-vm:37708kB, anon-rss:2128kB,
> > file-rss:0kB, shmem-rss:0kB
> > oom_reaper: reaped process 6410 (syz-executor5), now anon-rss:0kB,
> > file-rss:0kB, shmem-rss:0kB
> > task=syz-executor5 pid=6410 invoked memcg oom killer. oom_victim=1
>
> Thank you. This is useful. The full oom picture is this
> : [ 65.363983] task=syz-executor5 pid=6415 invoked memcg oom killer. oom_victim=0
> [...]
> : [ 65.920355] Task in /ile0 killed as a result of limit of /ile0
> : [ 65.926389] memory: usage 0kB, limit 0kB, failcnt 20
> : [ 65.931518] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> : [ 65.938296] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
> : [ 65.944467] Memory cgroup stats for /ile0: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
> : [ 65.963878] Tasks state (memory values in pages):
> : [ 65.968743] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
> : [ 65.977615] [ 6410] 0 6410 9427 532 61440 0 0 syz-executor5
> : [ 65.986647] Memory cgroup out of memory: Kill process 6410 (syz-executor5) score 547000 or sacrifice child
> : [ 65.996474] Killed process 6410 (syz-executor5) total-vm:37708kB, anon-rss:2128kB, file-rss:0kB, shmem-rss:0kB
> : [ 66.007471] oom_reaper: reaped process 6410 (syz-executor5), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> : [ 66.017652] task=syz-executor5 pid=6410 invoked memcg oom killer. oom_victim=1
> : [ 66.025137] ------------[ cut here ]------------
> : [ 66.029927] Memory cgroup charge failed because of no reclaimable memory! This looks like a misconfiguration or a kernel bug.
> : [ 66.030061] WARNING: CPU: 1 PID: 6410 at mm/memcontrol.c:1707 try_charge+0x734/0x1680
>
> So we have only a single task in the memcg and it is this task which
> triggers the OOM. It gets killed and oom_reaped. This means that
> out_of_memory should return with true and so we should retry and force
> the charge as I've already mentioned. For some reason this task has
> triggered the oom killer path again and then we haven't found any
> eligible task and resulted in the warning. This shouldn't happen.
>
> I will stare to the code some more to see how the heck we get there
> without passing
> if (unlikely(tsk_is_oom_victim(current) ||
> fatal_signal_pending(current) ||
> current->flags & PF_EXITING))
> goto force;

Hmm, so while the OOM killer was invoked from
[ 65.405905] Call Trace:
[ 65.408498] dump_stack+0x1c9/0x2b4
[ 65.421606] dump_header+0x27b/0xf70
[ 65.545094] oom_kill_process.cold.28+0x10/0x95a
[ 65.605696] out_of_memory+0xa8a/0x14d0
[ 65.627227] mem_cgroup_out_of_memory+0x213/0x300
[ 65.641293] try_charge+0x720/0x1680
[ 65.674806] memcg_kmem_charge_memcg+0x7c/0x120
[ 65.687939] cache_grow_begin+0x207/0x710
[ 65.696553] fallback_alloc+0x203/0x2c0
[ 65.700519] ____cache_alloc_node+0x1c7/0x1e0
[ 65.704999] kmem_cache_alloc+0x1e5/0x760
[ 65.717947] shmem_alloc_inode+0x1b/0x40
[ 65.722003] alloc_inode+0x63/0x190
[ 65.725642] new_inode_pseudo+0x71/0x1a0
[ 65.738077] new_inode+0x1c/0x40
[ 65.741432] shmem_get_inode+0xf1/0x910
[ 65.771550] __shmem_file_setup.part.48+0x83/0x2a0
[ 65.776482] shmem_file_setup+0x65/0x90
[ 65.780444] __x64_sys_memfd_create+0x2af/0x4f0

The warning happened from a different path
[ 66.151455] RIP: 0010:try_charge+0x734/0x1680
[...]
[ 66.270886] mem_cgroup_try_charge+0x4ff/0xa70
[ 66.305602] mem_cgroup_try_charge_delay+0x1d/0x90
[ 66.310514] __handle_mm_fault+0x25be/0x4470
[ 66.366608] handle_mm_fault+0x53e/0xc80
[ 66.402384] do_page_fault+0xf6/0x8c0
[ 66.451629] page_fault+0x1e/0x30

So the oom victim indeed passed the above force path after the oom
invocation. But later on hit the page fault path and that behaved
differently and for some reason the force path hasn't triggered. I am
wondering how could we hit the page fault path in the first place. The
task is already killed! So what the hell is going on here.

I must be missing something obvious here.
--
Michal Hocko
SUSE Labs

2018-08-06 19:35:00

by syzbot

[permalink] [raw]
Subject: Re: WARNING in try_charge

Hello,

syzbot has tested the proposed patch but the reproducer still triggered
crash:
WARNING in try_charge

task=syz-executor4 pid=6578 invoked memcg oom killer. oom_victim=1
task=syz-executor6 pid=6576 charge for nr_pages=1
------------[ cut here ]------------
task=syz-executor6 pid=6576 charge bypass
Memory cgroup charge failed because of no reclaimable memory! This looks
like a misconfiguration or a kernel bug.
WARNING: CPU: 1 PID: 6578 at mm/memcontrol.c:1707 mem_cgroup_oom
mm/memcontrol.c:1706 [inline]
WARNING: CPU: 1 PID: 6578 at mm/memcontrol.c:1707 try_charge+0xafa/0x1710
mm/memcontrol.c:2270
task=syz-executor6 pid=6576 charge for nr_pages=1
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 6578 Comm: syz-executor4 Not tainted 4.18.0-rc7-next-20180803+
#1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
task=syz-executor6 pid=6576 charge bypass
panic+0x238/0x4e7 kernel/panic.c:184
task=syz-executor6 pid=6576 charge for nr_pages=1
__warn.cold.8+0x163/0x1ba kernel/panic.c:536
task=syz-executor6 pid=6576 charge bypass
report_bug+0x252/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
task=syz-executor6 pid=6576 charge for nr_pages=1
task=syz-executor6 pid=6576 charge bypass
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:996
RIP: 0010:mem_cgroup_oom mm/memcontrol.c:1706 [inline]
RIP: 0010:try_charge+0xafa/0x1710 mm/memcontrol.c:2270
task=syz-executor6 pid=6576 charge for nr_pages=1
Code: 85 4a 01 00 00 8b b5 bc fd ff ff 44 89 f2 4c 89 ff e8 3a 4d ff ff 84
c0 0f 85 9b 04 00 00 48 c7 c7 a0 18 13 87 e8 86 f9 85 ff <0f> 0b 48 8b 95
d0 fd ff ff 48 8b b5 c0 fd ff ff 48 b8 00 00 00 00
RSP: 0018:ffff8801af5bf458 EFLAGS: 00010282
RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff816366f1 RDI: ffff8801af5bf148
task=syz-executor6 pid=6576 charge bypass
RBP: ffff8801af5bf6f0 R08: ffff8801aeae2540 R09: fffffbfff0ff11fc
R10: fffffbfff0ff11fc R11: ffffffff87f88fe3 R12: ffff8801b6bd4800
R13: ffff8801af5bf6c8 R14: 0000000000000000 R15: ffff8801b6bd4800
task=syz-executor5 pid=6579 invoked memcg oom killer. oom_victim=1
------------[ cut here ]------------
Memory cgroup charge failed because of no reclaimable memory! This looks
like a misconfiguration or a kernel bug.
memcg_kmem_charge_memcg+0x7c/0x120 mm/memcontrol.c:2600
WARNING: CPU: 0 PID: 6579 at mm/memcontrol.c:1707 mem_cgroup_oom
mm/memcontrol.c:1706 [inline]
WARNING: CPU: 0 PID: 6579 at mm/memcontrol.c:1707 try_charge+0xafa/0x1710
mm/memcontrol.c:2270
memcg_charge_slab mm/slab.h:283 [inline]
kmem_getpages mm/slab.c:1415 [inline]
cache_grow_begin+0x207/0x710 mm/slab.c:2677
Modules linked in:
fallback_alloc+0x203/0x2c0 mm/slab.c:3219
CPU: 0 PID: 6579 Comm: syz-executor5 Not tainted 4.18.0-rc7-next-20180803+
#1
____cache_alloc_node+0x1c7/0x1e0 mm/slab.c:3287
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
__do_cache_alloc mm/slab.c:3356 [inline]
slab_alloc mm/slab.c:3384 [inline]
kmem_cache_alloc+0x1e5/0x760 mm/slab.c:3552
RIP: 0010:mem_cgroup_oom mm/memcontrol.c:1706 [inline]
RIP: 0010:try_charge+0xafa/0x1710 mm/memcontrol.c:2270
anon_vma_chain_alloc mm/rmap.c:129 [inline]
__anon_vma_prepare+0xc4/0x720 mm/rmap.c:183
Code: 85 4a 01 00 00 8b b5 bc fd ff ff 44 89 f2 4c 89 ff e8 3a 4d ff ff 84
c0 0f 85 9b 04 00 00 48 c7 c7 a0 18 13 87 e8 86 f9 85 ff <0f> 0b 48 8b 95
d0 fd ff ff 48 8b b5 c0 fd ff ff 48 b8 00 00 00 00
RSP: 0018:ffff8801c82f7458 EFLAGS: 00010282
RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff816366f1 RDI: ffff8801c82f7148
RBP: ffff8801c82f76f0 R08: ffff8801ae6905c0 R09: fffffbfff0ff11fc
R10: fffffbfff0ff11fc R11: ffffffff87f88fe3 R12: ffff8801b6bd4800
anon_vma_prepare include/linux/rmap.h:153 [inline]
do_anonymous_page mm/memory.c:3160 [inline]
handle_pte_fault mm/memory.c:3971 [inline]
__handle_mm_fault+0x3556/0x4470 mm/memory.c:4097
R13: ffff8801c82f76c8 R14: 0000000000000000 R15: ffff8801b6bd4800
FS: 0000000000b19940(0000) GS:ffff8801db000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000001dc8978 CR3: 00000001c83bc000 CR4: 00000000001406f0
Call Trace:
memcg_kmem_charge_memcg+0x7c/0x120 mm/memcontrol.c:2600
memcg_charge_slab mm/slab.h:283 [inline]
kmem_getpages mm/slab.c:1415 [inline]
cache_grow_begin+0x207/0x710 mm/slab.c:2677
handle_mm_fault+0x53e/0xc80 mm/memory.c:4134
fallback_alloc+0x203/0x2c0 mm/slab.c:3219
____cache_alloc_node+0x1c7/0x1e0 mm/slab.c:3287
__do_page_fault+0x620/0xe50 arch/x86/mm/fault.c:1395
__do_cache_alloc mm/slab.c:3356 [inline]
slab_alloc mm/slab.c:3384 [inline]
kmem_cache_alloc+0x1e5/0x760 mm/slab.c:3552
anon_vma_chain_alloc mm/rmap.c:129 [inline]
__anon_vma_prepare+0xc4/0x720 mm/rmap.c:183
do_page_fault+0xf6/0x8c0 arch/x86/mm/fault.c:1470
anon_vma_prepare include/linux/rmap.h:153 [inline]
do_anonymous_page mm/memory.c:3160 [inline]
handle_pte_fault mm/memory.c:3971 [inline]
__handle_mm_fault+0x3556/0x4470 mm/memory.c:4097
page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1164
RIP: 0033:0x40e33f
Code: Bad RIP value.
RSP: 002b:00007ffc7e6c6590 EFLAGS: 00010206
RAX: 00007ff400550000 RBX: 0000000000020000 RCX: 0000000000456b7a
RDX: 0000000000021000 RSI: 0000000000021000 RDI: 0000000000000000
handle_mm_fault+0x53e/0xc80 mm/memory.c:4134
RBP: 00007ffc7e6c6670 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc7e6c6760
R13: 00007ff400570700 R14: 0000000000000005 R15: 0000000000000001
__do_page_fault+0x620/0xe50 arch/x86/mm/fault.c:1395
do_page_fault+0xf6/0x8c0 arch/x86/mm/fault.c:1470
page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1164
RIP: 0033:0x40e33f
Code: Bad RIP value.
RSP: 002b:00007ffc29506db0 EFLAGS: 00010206
RAX: 00007f755194e000 RBX: 0000000000020000 RCX: 0000000000456b7a
RDX: 0000000000021000 RSI: 0000000000021000 RDI: 0000000000000000
RBP: 00007ffc29506e90 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc29506f80
R13: 00007f755196e700 R14: 0000000000000005 R15: 0000000000000001
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] (null)
hardirqs last disabled at (0): [<ffffffff8146af61>]
copy_process.part.37+0x1911/0x7240 kernel/fork.c:1781
softirqs last enabled at (0): [<ffffffff8146b002>]
copy_process.part.37+0x19b2/0x7240 kernel/fork.c:1784
softirqs last disabled at (0): [<0000000000000000>] (null)
---[ end trace 0466ae1ce671f8c4 ]---
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..


Tested on:

commit: 116b181bb646 Add linux-next specific files for 20180803
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=14dfe2ac400000
kernel config: https://syzkaller.appspot.com/x/.config?x=b4f38be7c2c519d5
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=10fccee8400000


2018-08-06 19:37:26

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon 06-08-18 20:13:39, Michal Hocko wrote:
> I simply do not see how this is possible. Let's try with the following
> extended debugging patch.
>
> #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 116b181bb646afedd770985de20a68721bdb2648
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4603ad75c9a9..e2dfdf4361ba 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> bool ret;
>
> mutex_lock(&oom_lock);
> + pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
> + current->comm, current->pid, tsk_is_oom_victim(current));
> ret = out_of_memory(&oc);
> mutex_unlock(&oom_lock);
> return ret;
> @@ -2108,6 +2110,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
> if (mem_cgroup_is_root(memcg))
> return 0;
> + if (tsk_is_oom_victim(current))
> + pr_info("task=%s pid=%d charge for nr_pages=%d\n",
> + current->comm, current->pid, nr_pages);
> retry:
> if (consume_stock(memcg, nr_pages))
> return 0;
> @@ -2137,8 +2142,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> */
> if (unlikely(tsk_is_oom_victim(current) ||
> fatal_signal_pending(current) ||
> - current->flags & PF_EXITING))
> + current->flags & PF_EXITING)) {
> + pr_info("task=%s pid=%d charge bypass\n",
> + current->comm, current->pid);
> goto force;
> + }
>
> /*
> * Prevent unbounded recursion when reclaim operations need to
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 104ef4a01a55..7d9adcde8cf6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -692,6 +692,8 @@ static void mark_oom_victim(struct task_struct *tsk)
> __thaw_task(tsk);
> atomic_inc(&oom_victims);
> trace_mark_victim(tsk->pid);
> + pr_info("task=%s pid=%d is oom victim now\n",
> + current->comm, current->pid);
> }

Ble this should be s@current@tsk@g. But it didn't really help. I will
have to think about this more.

--
Michal Hocko
SUSE Labs

2018-08-06 19:51:20

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon 06-08-18 08:42:02, syzbot wrote:
> Hello,
>
> syzbot has tested the proposed patch but the reproducer still triggered
> crash:
> WARNING in try_charge
>
> Killed process 6410 (syz-executor5) total-vm:37708kB, anon-rss:2128kB,
> file-rss:0kB, shmem-rss:0kB
> oom_reaper: reaped process 6410 (syz-executor5), now anon-rss:0kB,
> file-rss:0kB, shmem-rss:0kB
> task=syz-executor5 pid=6410 invoked memcg oom killer. oom_victim=1

Thank you. This is useful. The full oom picture is this
: [ 65.363983] task=syz-executor5 pid=6415 invoked memcg oom killer. oom_victim=0
[...]
: [ 65.920355] Task in /ile0 killed as a result of limit of /ile0
: [ 65.926389] memory: usage 0kB, limit 0kB, failcnt 20
: [ 65.931518] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
: [ 65.938296] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
: [ 65.944467] Memory cgroup stats for /ile0: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
: [ 65.963878] Tasks state (memory values in pages):
: [ 65.968743] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
: [ 65.977615] [ 6410] 0 6410 9427 532 61440 0 0 syz-executor5
: [ 65.986647] Memory cgroup out of memory: Kill process 6410 (syz-executor5) score 547000 or sacrifice child
: [ 65.996474] Killed process 6410 (syz-executor5) total-vm:37708kB, anon-rss:2128kB, file-rss:0kB, shmem-rss:0kB
: [ 66.007471] oom_reaper: reaped process 6410 (syz-executor5), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
: [ 66.017652] task=syz-executor5 pid=6410 invoked memcg oom killer. oom_victim=1
: [ 66.025137] ------------[ cut here ]------------
: [ 66.029927] Memory cgroup charge failed because of no reclaimable memory! This looks like a misconfiguration or a kernel bug.
: [ 66.030061] WARNING: CPU: 1 PID: 6410 at mm/memcontrol.c:1707 try_charge+0x734/0x1680

So we have only a single task in the memcg and it is this task which
triggers the OOM. It gets killed and oom_reaped. This means that
out_of_memory should return with true and so we should retry and force
the charge as I've already mentioned. For some reason this task has
triggered the oom killer path again and then we haven't found any
eligible task and resulted in the warning. This shouldn't happen.

I will stare to the code some more to see how the heck we get there
without passing
if (unlikely(tsk_is_oom_victim(current) ||
fatal_signal_pending(current) ||
current->flags & PF_EXITING))
goto force;
--
Michal Hocko
SUSE Labs

2018-08-06 20:38:43

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

[CCing Greg - the email thread starts here
http://lkml.kernel.org/r/[email protected]]

On Mon 06-08-18 12:12:02, syzbot wrote:
> Hello,
>
> syzbot has tested the proposed patch and the reproducer did not trigger
> crash:

OK, this is reassuring. Btw Greg has pointed out this potential case
http://lkml.kernel.org/r/[email protected]
but I simply didn't get what he meant. He was suggesting MMF_OOM_SKIP
but I didn't get why that matters. I didn't think about a race.

So how about this patch:
From 74d980f8d066d06ada657ebf9b586dbf5668ed26 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 6 Aug 2018 21:21:24 +0200
Subject: [PATCH] memcg, oom: be careful about races when warning about no
reclaimable task

"memcg, oom: move out_of_memory back to the charge path" has added a
warning triggered when the oom killer cannot find any eligible task
and so there is no way to reclaim the oom memcg under its hard limit.
Further charges for such a memcg are forced and therefore the hard limit
isolation is weakened.

The current warning is however too eager to trigger even when we are not
really hitting the above condition. Syzbot and Greg Thelen have noticed
that we can hit this condition even when there is still oom victim
pending. E.g. the following race is possible:

memcg has two tasks taskA, taskB.

CPU1 (taskA) CPU2 CPU3 (taskB)
try_charge
mem_cgroup_out_of_memory try_charge
select_bad_process(taskB)
oom_kill_process oom_reap_task
# No real memory reaped
mem_cgroup_out_of_memory
# set taskB -> MMF_OOM_SKIP
# retry charge
mem_cgroup_out_of_memory
oom_lock oom_lock
select_bad_process(self)
oom_kill_process(self)
oom_unlock
# no eligible task

In fact syzbot test triggered this situation by placing multiple tasks
into a memcg with hard limit set to 0. So no task really had any memory
charged to the memcg

: Memory cgroup stats for /ile0: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
: Tasks state (memory values in pages):
: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
: [ 6569] 0 6562 9427 1 53248 0 0 syz-executor0
: [ 6576] 0 6576 9426 0 61440 0 0 syz-executor6
: [ 6578] 0 6578 9426 534 61440 0 0 syz-executor4
: [ 6579] 0 6579 9426 0 57344 0 0 syz-executor5
: [ 6582] 0 6582 9426 0 61440 0 0 syz-executor7
: [ 6584] 0 6584 9426 0 57344 0 0 syz-executor1

so in principle there is indeed nothing reclaimable in this memcg and
this looks like a misconfiguration. On the other hand we can clearly
kill all those tasks so it is a bit early to warn and scare users. Do
that by checking that the current is the oom victim and bypass the
warning then. The victim is allowed to force charge and terminate to
release its temporal charge along the way.

Fixes: "memcg, oom: move out_of_memory back to the charge path"
Noticed-by: Greg Thelen <[email protected]>
Reported-and-tested-by: [email protected]
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4603ad75c9a9..1b6eed1bc404 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1703,7 +1703,8 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
return OOM_ASYNC;
}

- if (mem_cgroup_out_of_memory(memcg, mask, order))
+ if (mem_cgroup_out_of_memory(memcg, mask, order) ||
+ tsk_is_oom_victim(current))
return OOM_SUCCESS;

WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
--
2.18.0

--
Michal Hocko
SUSE Labs

2018-08-06 20:39:33

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon 06-08-18 21:45:53, Michal Hocko wrote:
> [CCing Greg - the email thread starts here
> http://lkml.kernel.org/r/[email protected]]

now for real

>
> On Mon 06-08-18 12:12:02, syzbot wrote:
> > Hello,
> >
> > syzbot has tested the proposed patch and the reproducer did not trigger
> > crash:
>
> OK, this is reassuring. Btw Greg has pointed out this potential case
> http://lkml.kernel.org/r/[email protected]
> but I simply didn't get what he meant. He was suggesting MMF_OOM_SKIP
> but I didn't get why that matters. I didn't think about a race.
>
> So how about this patch:
> From 74d980f8d066d06ada657ebf9b586dbf5668ed26 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Mon, 6 Aug 2018 21:21:24 +0200
> Subject: [PATCH] memcg, oom: be careful about races when warning about no
> reclaimable task
>
> "memcg, oom: move out_of_memory back to the charge path" has added a
> warning triggered when the oom killer cannot find any eligible task
> and so there is no way to reclaim the oom memcg under its hard limit.
> Further charges for such a memcg are forced and therefore the hard limit
> isolation is weakened.
>
> The current warning is however too eager to trigger even when we are not
> really hitting the above condition. Syzbot and Greg Thelen have noticed
> that we can hit this condition even when there is still oom victim
> pending. E.g. the following race is possible:
>
> memcg has two tasks taskA, taskB.
>
> CPU1 (taskA) CPU2 CPU3 (taskB)
> try_charge
> mem_cgroup_out_of_memory try_charge
> select_bad_process(taskB)
> oom_kill_process oom_reap_task
> # No real memory reaped
> mem_cgroup_out_of_memory
> # set taskB -> MMF_OOM_SKIP
> # retry charge
> mem_cgroup_out_of_memory
> oom_lock oom_lock
> select_bad_process(self)
> oom_kill_process(self)
> oom_unlock
> # no eligible task
>
> In fact syzbot test triggered this situation by placing multiple tasks
> into a memcg with hard limit set to 0. So no task really had any memory
> charged to the memcg
>
> : Memory cgroup stats for /ile0: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
> : Tasks state (memory values in pages):
> : [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
> : [ 6569] 0 6562 9427 1 53248 0 0 syz-executor0
> : [ 6576] 0 6576 9426 0 61440 0 0 syz-executor6
> : [ 6578] 0 6578 9426 534 61440 0 0 syz-executor4
> : [ 6579] 0 6579 9426 0 57344 0 0 syz-executor5
> : [ 6582] 0 6582 9426 0 61440 0 0 syz-executor7
> : [ 6584] 0 6584 9426 0 57344 0 0 syz-executor1
>
> so in principle there is indeed nothing reclaimable in this memcg and
> this looks like a misconfiguration. On the other hand we can clearly
> kill all those tasks so it is a bit early to warn and scare users. Do
> that by checking that the current is the oom victim and bypass the
> warning then. The victim is allowed to force charge and terminate to
> release its temporal charge along the way.
>
> Fixes: "memcg, oom: move out_of_memory back to the charge path"
> Noticed-by: Greg Thelen <[email protected]>
> Reported-and-tested-by: [email protected]
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/memcontrol.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4603ad75c9a9..1b6eed1bc404 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1703,7 +1703,8 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
> return OOM_ASYNC;
> }
>
> - if (mem_cgroup_out_of_memory(memcg, mask, order))
> + if (mem_cgroup_out_of_memory(memcg, mask, order) ||
> + tsk_is_oom_victim(current))
> return OOM_SUCCESS;
>
> WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
> --
> 2.18.0
>
> --
> Michal Hocko
> SUSE Labs

--
Michal Hocko
SUSE Labs

2018-08-06 20:49:31

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

On 2018/08/07 2:56, Michal Hocko wrote:
> So the oom victim indeed passed the above force path after the oom
> invocation. But later on hit the page fault path and that behaved
> differently and for some reason the force path hasn't triggered. I am
> wondering how could we hit the page fault path in the first place. The
> task is already killed! So what the hell is going on here.
>
> I must be missing something obvious here.
>
YOU ARE OBVIOUSLY MISSING MY MAIL!

I already said this is "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once."
problem which you are refusing at https://www.spinics.net/lists/linux-mm/msg133774.html .
And you again ignored my mail. Very sad...

2018-08-06 20:51:53

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Tue 07-08-18 05:26:23, Tetsuo Handa wrote:
> On 2018/08/07 2:56, Michal Hocko wrote:
> > So the oom victim indeed passed the above force path after the oom
> > invocation. But later on hit the page fault path and that behaved
> > differently and for some reason the force path hasn't triggered. I am
> > wondering how could we hit the page fault path in the first place. The
> > task is already killed! So what the hell is going on here.
> >
> > I must be missing something obvious here.
> >
> YOU ARE OBVIOUSLY MISSING MY MAIL!
>
> I already said this is "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once."
> problem which you are refusing at https://www.spinics.net/lists/linux-mm/msg133774.html .
> And you again ignored my mail. Very sad...

Your suggestion simply didn't make much sense. There is nothing like
first check is different from the rest.
--
Michal Hocko
SUSE Labs

2018-08-06 20:54:14

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

On 2018/08/07 5:34, Michal Hocko wrote:
> On Tue 07-08-18 05:26:23, Tetsuo Handa wrote:
>> On 2018/08/07 2:56, Michal Hocko wrote:
>>> So the oom victim indeed passed the above force path after the oom
>>> invocation. But later on hit the page fault path and that behaved
>>> differently and for some reason the force path hasn't triggered. I am
>>> wondering how could we hit the page fault path in the first place. The
>>> task is already killed! So what the hell is going on here.
>>>
>>> I must be missing something obvious here.
>>>
>> YOU ARE OBVIOUSLY MISSING MY MAIL!
>>
>> I already said this is "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once."
>> problem which you are refusing at https://www.spinics.net/lists/linux-mm/msg133774.html .
>> And you again ignored my mail. Very sad...
>
> Your suggestion simply didn't make much sense. There is nothing like
> first check is different from the rest.
>

I don't think your patch is appropriate. It avoids hitting WARN(1) but does not avoid
unnecessary killing of OOM victims.

If you look at https://syzkaller.appspot.com/text?tag=CrashLog&x=15a1c770400000 , you will
notice that both 23766 and 23767 are killed due to task_will_free_mem(current) == false.
This is "unnecessary killing of additional processes".

[ 365.869417] syz-executor2 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=0
[ 365.878899] CPU: 0 PID: 23767 Comm: syz-executor2 Not tainted 4.18.0-rc6-next-20180725+ #18
(...snipped...)
[ 366.487490] Tasks state (memory values in pages):
[ 366.492349] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 366.501237] [ 23766] 0 23766 17620 8221 126976 0 0 syz-executor3
[ 366.510367] [ 23767] 0 23767 17618 8218 126976 0 0 syz-executor2
[ 366.519409] Memory cgroup out of memory: Kill process 23766 (syz-executor3) score 8252000 or sacrifice child
[ 366.529422] Killed process 23766 (syz-executor3) total-vm:70480kB, anon-rss:116kB, file-rss:32768kB, shmem-rss:0kB
[ 366.540456] oom_reaper: reaped process 23766 (syz-executor3), now anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
[ 366.550949] syz-executor3 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=0
[ 366.560374] CPU: 1 PID: 23766 Comm: syz-executor3 Not tainted 4.18.0-rc6-next-20180725+ #18
(...snipped...)
[ 367.138136] Tasks state (memory values in pages):
[ 367.142986] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 367.151889] [ 23766] 0 23766 17620 8002 126976 0 0 syz-executor3
[ 367.160946] [ 23767] 0 23767 17618 8218 126976 0 0 syz-executor2
[ 367.169994] Memory cgroup out of memory: Kill process 23767 (syz-executor2) score 8249000 or sacrifice child
[ 367.180119] Killed process 23767 (syz-executor2) total-vm:70472kB, anon-rss:104kB, file-rss:32768kB, shmem-rss:0kB
[ 367.192101] oom_reaper: reaped process 23767 (syz-executor2), now anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
[ 367.202986] ------------[ cut here ]------------
[ 367.207845] Memory cgroup charge failed because of no reclaimable memory! This looks like a misconfiguration or a kernel bug.
[ 367.207965] WARNING: CPU: 1 PID: 23767 at mm/memcontrol.c:1710 try_charge+0x734/0x1680
[ 367.227540] Kernel panic - not syncing: panic_on_warn set ...


2018-08-06 20:57:30

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Tue 07-08-18 05:46:04, Tetsuo Handa wrote:
> On 2018/08/07 5:34, Michal Hocko wrote:
> > On Tue 07-08-18 05:26:23, Tetsuo Handa wrote:
> >> On 2018/08/07 2:56, Michal Hocko wrote:
> >>> So the oom victim indeed passed the above force path after the oom
> >>> invocation. But later on hit the page fault path and that behaved
> >>> differently and for some reason the force path hasn't triggered. I am
> >>> wondering how could we hit the page fault path in the first place. The
> >>> task is already killed! So what the hell is going on here.
> >>>
> >>> I must be missing something obvious here.
> >>>
> >> YOU ARE OBVIOUSLY MISSING MY MAIL!
> >>
> >> I already said this is "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once."
> >> problem which you are refusing at https://www.spinics.net/lists/linux-mm/msg133774.html .
> >> And you again ignored my mail. Very sad...
> >
> > Your suggestion simply didn't make much sense. There is nothing like
> > first check is different from the rest.
> >
>
> I don't think your patch is appropriate. It avoids hitting WARN(1) but does not avoid
> unnecessary killing of OOM victims.
>
> If you look at https://syzkaller.appspot.com/text?tag=CrashLog&x=15a1c770400000 , you will
> notice that both 23766 and 23767 are killed due to task_will_free_mem(current) == false.
> This is "unnecessary killing of additional processes".

Have you noticed the mere detail that the memcg has to kill any task
attempting the charge because the hard limit is 0? There is simply no
other way around. You cannot charge. There is no unnecessary killing.
Full stop. We do allow temporary breach of the hard limit just to let
the task die and uncharge on the way out.
--
Michal Hocko
SUSE Labs

2018-08-06 21:09:52

by syzbot

[permalink] [raw]
Subject: Re: WARNING in try_charge

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger
crash:

Reported-and-tested-by:
[email protected]

Tested on:

commit: 116b181bb646 Add linux-next specific files for 20180803
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
kernel config: https://syzkaller.appspot.com/x/.config?x=b4f38be7c2c519d5
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=147b3a72400000

Note: testing is done by a robot and is best-effort only.

2018-08-06 21:55:46

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

On 2018/08/07 5:55, Michal Hocko wrote:
> On Tue 07-08-18 05:46:04, Tetsuo Handa wrote:
>> On 2018/08/07 5:34, Michal Hocko wrote:
>>> On Tue 07-08-18 05:26:23, Tetsuo Handa wrote:
>>>> On 2018/08/07 2:56, Michal Hocko wrote:
>>>>> So the oom victim indeed passed the above force path after the oom
>>>>> invocation. But later on hit the page fault path and that behaved
>>>>> differently and for some reason the force path hasn't triggered. I am
>>>>> wondering how could we hit the page fault path in the first place. The
>>>>> task is already killed! So what the hell is going on here.
>>>>>
>>>>> I must be missing something obvious here.
>>>>>
>>>> YOU ARE OBVIOUSLY MISSING MY MAIL!
>>>>
>>>> I already said this is "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once."
>>>> problem which you are refusing at https://www.spinics.net/lists/linux-mm/msg133774.html .
>>>> And you again ignored my mail. Very sad...
>>>
>>> Your suggestion simply didn't make much sense. There is nothing like
>>> first check is different from the rest.
>>>
>>
>> I don't think your patch is appropriate. It avoids hitting WARN(1) but does not avoid
>> unnecessary killing of OOM victims.
>>
>> If you look at https://syzkaller.appspot.com/text?tag=CrashLog&x=15a1c770400000 , you will
>> notice that both 23766 and 23767 are killed due to task_will_free_mem(current) == false.
>> This is "unnecessary killing of additional processes".
>
> Have you noticed the mere detail that the memcg has to kill any task
> attempting the charge because the hard limit is 0? There is simply no
> other way around. You cannot charge. There is no unnecessary killing.
> Full stop. We do allow temporary breach of the hard limit just to let
> the task die and uncharge on the way out.
>

select_bad_process() is called just because
task_will_free_mem("already killed current thread which has not completed __mmput()") == false
is a bug. I'm saying that the OOM killer should not give up as soon as MMF_OOM_SKIP is set.

static bool oom_has_pending_victims(struct oom_control *oc)
{
struct task_struct *p, *tmp;
bool ret = false;
bool gaveup = false;

if (is_sysrq_oom(oc))
return false;
/*
* Wait for pending victims until __mmput() completes or stalled
* too long.
*/
list_for_each_entry_safe(p, tmp, &oom_victim_list, oom_victim_list) {
struct mm_struct *mm = p->signal->oom_mm;

if (oom_unkillable_task(p, oc->memcg, oc->nodemask))
continue;
ret = true;
+ /*
+ * Since memcg OOM allows forced charge, we can safely wait
+ * until __mmput() completes.
+ */
+ if (is_memcg_oom(oc))
+ return true;
#ifdef CONFIG_MMU
/*
* Since the OOM reaper exists, we can safely wait until
* MMF_OOM_SKIP is set.
*/
if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
if (!oom_reap_target) {
get_task_struct(p);
oom_reap_target = p;
trace_wake_reaper(p->pid);
wake_up(&oom_reaper_wait);
}
#endif
continue;
}
#endif
/* We can wait as long as OOM score is decreasing over time. */
if (!victim_mm_stalling(p, mm))
continue;
gaveup = true;
list_del(&p->oom_victim_list);
/* Drop a reference taken by mark_oom_victim(). */
put_task_struct(p);
}
if (gaveup)
debug_show_all_locks();

return ret;
}


2018-08-07 10:27:40

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

On 2018/08/07 6:50, Tetsuo Handa wrote:
> list_for_each_entry_safe(p, tmp, &oom_victim_list, oom_victim_list) {
> struct mm_struct *mm = p->signal->oom_mm;
>
> if (oom_unkillable_task(p, oc->memcg, oc->nodemask))
> continue;
> ret = true;
> + /*
> + * Since memcg OOM allows forced charge, we can safely wait
> + * until __mmput() completes.
> + */
> + if (is_memcg_oom(oc))
> + return true;

Oops. If this OOM victim was blocked on some lock which current thread is
holding, waiting forever unconditionally is not safe.

> #ifdef CONFIG_MMU
> /*
> * Since the OOM reaper exists, we can safely wait until
> * MMF_OOM_SKIP is set.
> */
> if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
> if (!oom_reap_target) {
> get_task_struct(p);
> oom_reap_target = p;
> trace_wake_reaper(p->pid);
> wake_up(&oom_reaper_wait);
> }
> #endif
> continue;
> }
> #endif
> /* We can wait as long as OOM score is decreasing over time. */
> if (!victim_mm_stalling(p, mm))
> continue;
> gaveup = true;
> list_del(&p->oom_victim_list);
> /* Drop a reference taken by mark_oom_victim(). */
> put_task_struct(p);
> }
> if (gaveup)
> debug_show_all_locks();
>
> return ret;
> }
>

2018-08-07 11:32:05

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Mon, Aug 6, 2018 at 8:55 PM, Michal Hocko <[email protected]> wrote:
> The debugging patch was wrong but I guess I see it finally.
> It's a race
>
> : [ 72.901666] Memory cgroup out of memory: Kill process 6584 (syz-executor1) score 550000 or sacrifice child
> : [ 72.917037] Killed process 6584 (syz-executor1) total-vm:37704kB, anon-rss:2140kB, file-rss:0kB, shmem-rss:0kB
> : [ 72.927256] task=syz-executor5 pid=6581 charge bypass
> : [ 72.928046] oom_reaper: reaped process 6584 (syz-executor1), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> : [ 72.932818] task=syz-executor6 pid=6576 invoked memcg oom killer. oom_victim=1
> : [ 72.942790] task=syz-executor5 pid=6581 charge for nr_pages=1
> : [ 72.949769] syz-executor6 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
> : [ 72.955606] task=syz-executor5 pid=6581 charge bypass
> : [ 72.967394] syz-executor6 cpuset=/ mems_allowed=0
> : [ 72.973175] task=syz-executor5 pid=6581 charge for nr_pages=1
> : [...]
> : [ 73.534865] Task in /ile0 killed as a result of limit of /ile0
> : [ 73.540865] memory: usage 76kB, limit 0kB, failcnt 260
> : [ 73.546142] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> : [ 73.552898] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
> : [ 73.559051] Memory cgroup stats for /ile0: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
> : [ 73.578533] Tasks state (memory values in pages):
> : [ 73.583404] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
> : [ 73.592277] [ 6569] 0 6562 9427 1 53248 0 0 syz-executor0
> : [ 73.601299] [ 6576] 0 6576 9426 0 61440 0 0 syz-executor6
> : [ 73.610333] [ 6578] 0 6578 9426 534 61440 0 0 syz-executor4
> : [ 73.619381] [ 6579] 0 6579 9426 0 57344 0 0 syz-executor5
> : [ 73.628414] [ 6582] 0 6582 9426 0 61440 0 0 syz-executor7
> : [ 73.637441] [ 6584] 0 6584 9426 0 57344 0 0 syz-executor1
> : [ 73.646464] Memory cgroup out of memory: Kill process 6578 (syz-executor4) score 549000 or sacrifice child
> : [ 73.656295] task=syz-executor6 pid=6576 is oom victim now
>
> This should be 6578 but we at least know that we are running in 6576
> context so the we are setting the state from a remote context which
> itself has been killed already
>
> : [ 73.661841] Killed process 6578 (syz-executor4) total-vm:37704kB, anon-rss:2136kB, file-rss:0kB, shmem-rss:0kB
> : [ 73.672035] task=syz-executor6 pid=6576 charge bypass
> : [ 73.672801] oom_reaper: reaped process 6578 (syz-executor4), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> : [ 73.678829] task=syz-executor4 pid=6578 invoked memcg oom killer. oom_victim=1
>
> and here the victim finally reached the oom path finally.
>
> : [ 73.687453] task=syz-executor6 pid=6576 charge for nr_pages=1
> : [ 73.694534] ------------[ cut here ]------------
> : [ 73.700424] task=syz-executor6 pid=6576 charge bypass
> : [ 73.705175] Memory cgroup charge failed because of no reclaimable memory! This looks like a misconfiguration or a kernel bug.
> : [ 73.705321] WARNING: CPU: 1 PID: 6578 at mm/memcontrol.c:1707 try_charge+0xafa/0x1710
>
> But there is nobody killable. So the oom kill happened _after_ our force
> charge path. Therefore we should do the following regardless whether we
> make tis warn or pr_$foo

Great we are making progress here!

So if it's something to fix in kernel we just leave WARN alone. It
served its intended purpose of notifying kernel developers about
something to fix in kernel. And as you noted 0 is not actually special
in this context anyway. I misunderstood how exactly misconfiguration
is involved here.


> #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 116b181bb646afedd770985de20a68721bdb2648
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4603ad75c9a9..1b6eed1bc404 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1703,7 +1703,8 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
> return OOM_ASYNC;
> }
>
> - if (mem_cgroup_out_of_memory(memcg, mask, order))
> + if (mem_cgroup_out_of_memory(memcg, mask, order) ||
> + tsk_is_oom_victim(current))
> return OOM_SUCCESS;
>
> WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
> --
> Michal Hocko
> SUSE Labs

2018-08-07 12:16:30

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Tue 07-08-18 13:18:00, Dmitry Vyukov wrote:
[...]
> Great we are making progress here!
>
> So if it's something to fix in kernel we just leave WARN alone. It
> served its intended purpose of notifying kernel developers about
> something to fix in kernel.

Yes, agreed! And the way how your syzbot automation works made it so
much easier so thanks a lot!

The patch has been posted http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs

2018-08-09 13:59:19

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

From b1f38168f14397c7af9c122cd8207663d96e02ec Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Thu, 9 Aug 2018 22:49:40 +0900
Subject: [PATCH] mm, oom: task_will_free_mem(current) should retry until
memory reserve fails

Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed to select next OOM victim as soon as
MMF_OOM_SKIP is set. But we don't need to select next OOM victim as
long as ALLOC_OOM allocation can succeed. And syzbot is hitting WARN(1)
caused by this race window [1].

Since memcg OOM case uses forced charge if current thread is killed,
out_of_memory() can return true without selecting next OOM victim.
Therefore, this patch changes task_will_free_mem(current) to ignore
MMF_OOM_SKIP unless ALLOC_OOM allocation failed.

[1] https://syzkaller.appspot.com/bug?id=ea8c7912757d253537375e981b61749b2da69258

Signed-off-by: Tetsuo Handa <[email protected]>
Reported-by: syzbot <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: David Rientjes <[email protected]>
---
include/linux/oom.h | 3 +++
mm/oom_kill.c | 8 ++++----
mm/page_alloc.c | 7 +++++--
3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 69864a5..b5abacd 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -38,6 +38,9 @@ struct oom_control {
*/
const int order;

+ /* Did we already try ALLOC_OOM allocation? i*/
+ const bool reserve_tried;
+
/* Used by oom implementation, do not set */
unsigned long totalpages;
struct task_struct *chosen;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e10b86..95453e8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -782,7 +782,7 @@ static inline bool __task_will_free_mem(struct task_struct *task)
* Caller has to make sure that task->mm is stable (hold task_lock or
* it operates on the current).
*/
-static bool task_will_free_mem(struct task_struct *task)
+static bool task_will_free_mem(struct task_struct *task, bool select_new)
{
struct mm_struct *mm = task->mm;
struct task_struct *p;
@@ -803,7 +803,7 @@ static bool task_will_free_mem(struct task_struct *task)
* This task has already been drained by the oom reaper so there are
* only small chances it will free some more
*/
- if (test_bit(MMF_OOM_SKIP, &mm->flags))
+ if (test_bit(MMF_OOM_SKIP, &mm->flags) && select_new)
return false;

if (atomic_read(&mm->mm_users) <= 1)
@@ -939,7 +939,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
* so it can die quickly
*/
task_lock(p);
- if (task_will_free_mem(p)) {
+ if (task_will_free_mem(p, true)) {
mark_oom_victim(p);
wake_oom_reaper(p);
task_unlock(p);
@@ -1069,7 +1069,7 @@ bool out_of_memory(struct oom_control *oc)
* select it. The goal is to allow it to allocate so that it may
* quickly exit and free its memory.
*/
- if (task_will_free_mem(current)) {
+ if (task_will_free_mem(current, oc->reserve_tried)) {
mark_oom_victim(current);
wake_oom_reaper(current);
return true;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 879b861..03ca29a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3455,7 +3455,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
}

static inline struct page *
-__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, bool reserve_tried,
const struct alloc_context *ac, unsigned long *did_some_progress)
{
struct oom_control oc = {
@@ -3464,6 +3464,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
.memcg = NULL,
.gfp_mask = gfp_mask,
.order = order,
+ .reserve_tried = reserve_tried,
};
struct page *page;

@@ -4239,7 +4240,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
goto retry_cpuset;

/* Reclaim has failed us, start killing things */
- page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+ page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags == ALLOC_OOM
+ || (gfp_mask & __GFP_NOMEMALLOC), ac,
+ &did_some_progress);
if (page)
goto got_pg;

--
1.8.3.1



2018-08-09 15:09:32

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Thu 09-08-18 22:57:43, Tetsuo Handa wrote:
> >From b1f38168f14397c7af9c122cd8207663d96e02ec Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <[email protected]>
> Date: Thu, 9 Aug 2018 22:49:40 +0900
> Subject: [PATCH] mm, oom: task_will_free_mem(current) should retry until
> memory reserve fails
>
> Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> oom_reaped tasks") changed to select next OOM victim as soon as
> MMF_OOM_SKIP is set. But we don't need to select next OOM victim as
> long as ALLOC_OOM allocation can succeed. And syzbot is hitting WARN(1)
> caused by this race window [1].

It is not because the syzbot was exercising a completely different code
path (memcg charge rather than the page allocator).

> Since memcg OOM case uses forced charge if current thread is killed,
> out_of_memory() can return true without selecting next OOM victim.
> Therefore, this patch changes task_will_free_mem(current) to ignore
> MMF_OOM_SKIP unless ALLOC_OOM allocation failed.

And the patch is simply wrong for memcg.
--
Michal Hocko
SUSE Labs

2018-08-09 15:34:05

by Johannes Weiner

[permalink] [raw]
Subject: Re: WARNING in try_charge

On Thu, Aug 09, 2018 at 10:57:43PM +0900, Tetsuo Handa wrote:
> From b1f38168f14397c7af9c122cd8207663d96e02ec Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <[email protected]>
> Date: Thu, 9 Aug 2018 22:49:40 +0900
> Subject: [PATCH] mm, oom: task_will_free_mem(current) should retry until
> memory reserve fails
>
> Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> oom_reaped tasks") changed to select next OOM victim as soon as
> MMF_OOM_SKIP is set. But we don't need to select next OOM victim as
> long as ALLOC_OOM allocation can succeed. And syzbot is hitting WARN(1)
> caused by this race window [1].

Huh? That's the memcg path, it has nothing to do with ALLOC_OOM.

> Since memcg OOM case uses forced charge if current thread is killed,
> out_of_memory() can return true without selecting next OOM victim.
> Therefore, this patch changes task_will_free_mem(current) to ignore
> MMF_OOM_SKIP unless ALLOC_OOM allocation failed.

I have no idea how the first and the second half of this paragraph go
together. They're completely independent code paths.

2018-08-09 21:07:22

by Tetsuo Handa

[permalink] [raw]
Subject: Re: WARNING in try_charge

On 2018/08/10 0:07, Michal Hocko wrote:
> On Thu 09-08-18 22:57:43, Tetsuo Handa wrote:
>> >From b1f38168f14397c7af9c122cd8207663d96e02ec Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa <[email protected]>
>> Date: Thu, 9 Aug 2018 22:49:40 +0900
>> Subject: [PATCH] mm, oom: task_will_free_mem(current) should retry until
>> memory reserve fails
>>
>> Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
>> oom_reaped tasks") changed to select next OOM victim as soon as
>> MMF_OOM_SKIP is set. But we don't need to select next OOM victim as
>> long as ALLOC_OOM allocation can succeed. And syzbot is hitting WARN(1)
>> caused by this race window [1].
>
> It is not because the syzbot was exercising a completely different code
> path (memcg charge rather than the page allocator).

I know syzbot is hitting memcg charge path.

>
>> Since memcg OOM case uses forced charge if current thread is killed,
>> out_of_memory() can return true without selecting next OOM victim.
>> Therefore, this patch changes task_will_free_mem(current) to ignore
>> MMF_OOM_SKIP unless ALLOC_OOM allocation failed.
>
> And the patch is simply wrong for memcg.
>

Why? I think I should have done

-+ page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags == ALLOC_OOM
-+ || (gfp_mask & __GFP_NOMEMALLOC), ac,
-+ &did_some_progress);
++ page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags == ALLOC_OOM,
++ ac, &did_some_progress);

because nobody will use __GFP_NOMEMALLOC | __GFP_NOFAIL. But for memcg charge
path, task_will_free_mem(current, false) == true and out_of_memory() will return
true, which avoids unnecessary OOM killing.

Of course, this patch cannot avoid unnecessary OOM killing if out_of_memory()
is called by not yet killed process. But to mitigate it, what can we do other
than defer setting MMF_OOM_SKIP using a timeout based mechanism? Making
the OOM reaper unconditionally reclaim all memory is not a valid answer.