2021-10-18 08:16:49

by Vasily Averin

[permalink] [raw]
Subject: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

While checking the patches fixed broken memcg accounting in vmalloc I found
another issue: a false global OOM triggered by memcg-limited user space task.

I executed vmalloc-eater inside a memcg limited LXC container in a loop, checked
that it does not consume host memory beyond the assigned limit, triggers memcg OOM
and generates "Memory cgroup out of memory" messages. Everything was as expected.

However I was surprised to find quite rare global OOM messages too.
I set sysctl vm.panic_on_oom to 1, repeated the test and successfully
crashed the node.

Dmesg showed that global OOM was detected on 16 GB node with ~10 GB of free memory.

syz-executor invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=1000
CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
dump_header+0x4a/0x2c1
out_of_memory.cold+0xa/0x7e
pagefault_out_of_memory+0x46/0x60
exc_page_fault+0x79/0x2b0
asm_exc_page_fault+0x1e/0x30
...
Mem-Info:
Node 0 DMA: 0*4kB 0*8kB <...> = 13296kB
Node 0 DMA32: 705*4kB (UM) <...> = 2586964kB
Node 0 Normal: 2743*4kB (UME) <...> = 6904828kB
...
4095866 pages RAM
...
Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled

Full dmesg can be found in attached file.

How could this happen?

User-space task inside the memcg-limited container generated a page fault,
its handler do_user_addr_fault() called handle_mm_fault which could not
allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
Then do_user_addr_fault() called pagefault_out_of_memory() which executed
out_of_memory() without set of memcg.

Partially this problem depends on one of my recent patches, disabled unlimited
memory allocation for dying tasks. However I think the problem can happen
on non-killed tasks too, for example because of kmem limit.

At present do_user_addr_fault() does not know why page allocation was failed,
i.e. was it global or memcg OOM. I propose to save this information in new flag
on task_struct. It can be set in case of memcg restrictons in
obj_cgroup_charge_pages() (for memory controller) and in try_charge_memcg()
(for kmem controller). Then it can be used in mem_cgroup_oom_synchronize()
called inside pagefault_out_of_memory():
in case of memcg-related restrictions it will not trigger fake global OOM and
returns to user space which will retry the fault or kill the process if it got
a fatal signal.

Thank you,
Vasily Averin

Vasily Averin (1):
memcg: prevent false global OOM trigggerd by memcg limited task.

include/linux/sched.h | 1 +
mm/memcontrol.c | 12 +++++++++---
2 files changed, 10 insertions(+), 3 deletions(-)

--
2.32.0


Attachments:
dmesg-oom.txt (25.05 kB)

2021-10-18 09:05:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Mon 18-10-21 11:13:52, Vasily Averin wrote:
[...]
> How could this happen?
>
> User-space task inside the memcg-limited container generated a page fault,
> its handler do_user_addr_fault() called handle_mm_fault which could not
> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> out_of_memory() without set of memcg.
>
> Partially this problem depends on one of my recent patches, disabled unlimited
> memory allocation for dying tasks. However I think the problem can happen
> on non-killed tasks too, for example because of kmem limit.

Could you be more specific on how this can happen without your patch? I
have to say I haven't realized this side effect when discussing it.

I will be honest that I am not really happy about pagefault_out_of_memory.
I have tried to remove it in the past. Without much success back then,
unfortunately[1].
Maybe we should get rid of it finally. The OOM is always triggered from
inside the allocator where we have much more infromation about the
allocation context. A first step would be to skip pagefault_out_of_memory
for killed or exiting processes.

[1] I do not have msg-id so I cannot provide a lore link but google
pointed me to https://www.mail-archive.com/[email protected]/msg1400402.html
--
Michal Hocko
SUSE Labs

2021-10-18 10:08:31

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On 18.10.2021 12:04, Michal Hocko wrote:
> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> [...]
>> How could this happen?
>>
>> User-space task inside the memcg-limited container generated a page fault,
>> its handler do_user_addr_fault() called handle_mm_fault which could not
>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>> out_of_memory() without set of memcg.
>>
>> Partially this problem depends on one of my recent patches, disabled unlimited
>> memory allocation for dying tasks. However I think the problem can happen
>> on non-killed tasks too, for example because of kmem limit.
>
> Could you be more specific on how this can happen without your patch? I
> have to say I haven't realized this side effect when discussing it.

We can reach obj_cgroup_charge_pages() for example via

do_user_addr_fault
handle_mm_fault
__handle_mm_fault
p4d_alloc
__p4d_alloc
p4d_alloc_one
get_zeroed_page
__get_free_pages
alloc_pages
__alloc_pages
__memcg_kmem_charge_page
obj_cgroup_charge_pages

Here we call try_charge_memcg() that return success and approve the allocation,
however then we hit into kmem limit and fail the allocation.

If required I can try to search how try_charge_memcg() can reject page allocation
of non-dying task too.

> I will be honest that I am not really happy about pagefault_out_of_memory.
> I have tried to remove it in the past. Without much success back then,
> unfortunately[1].
> Maybe we should get rid of it finally. The OOM is always triggered from
> inside the allocator where we have much more infromation about the
> allocation context. A first step would be to skip pagefault_out_of_memory
> for killed or exiting processes.

I like this idea, however it may be not enough, at least in scenario described above.

> [1] I do not have msg-id so I cannot provide a lore link but google
> pointed me to https://www.mail-archive.com/[email protected]/msg1400402.html

Thank you, I'll read this discussion.
Vasily Averin

2021-10-18 10:16:55

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On 18.10.2021 13:05, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> Partially this problem depends on one of my recent patches, disabled unlimited
>>> memory allocation for dying tasks. However I think the problem can happen
>>> on non-killed tasks too, for example because of kmem limit.
>>
>> Could you be more specific on how this can happen without your patch? I
>> have to say I haven't realized this side effect when discussing it.
>
> We can reach obj_cgroup_charge_pages() for example via
>
> do_user_addr_fault
> handle_mm_fault
> __handle_mm_fault
> p4d_alloc
> __p4d_alloc
> p4d_alloc_one
> get_zeroed_page
> __get_free_pages
> alloc_pages
> __alloc_pages
> __memcg_kmem_charge_page
> obj_cgroup_charge_pages
>
> Here we call try_charge_memcg() that return success and approve the allocation,
> however then we hit into kmem limit and fail the allocation.

btw. in OpenVZ kernels we trying to cleanup the memory in when task hit kmem limit,
therefore we moved kmem limit check into try_charge_memcg.

Are any improvements for kmem controller interesting for upstream?

Thank you,
Vasily Averin

2021-10-18 11:58:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
> > On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> > [...]
> >> How could this happen?
> >>
> >> User-space task inside the memcg-limited container generated a page fault,
> >> its handler do_user_addr_fault() called handle_mm_fault which could not
> >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> >> out_of_memory() without set of memcg.
> >>
> >> Partially this problem depends on one of my recent patches, disabled unlimited
> >> memory allocation for dying tasks. However I think the problem can happen
> >> on non-killed tasks too, for example because of kmem limit.
> >
> > Could you be more specific on how this can happen without your patch? I
> > have to say I haven't realized this side effect when discussing it.
>
> We can reach obj_cgroup_charge_pages() for example via
>
> do_user_addr_fault
> handle_mm_fault
> __handle_mm_fault
> p4d_alloc
> __p4d_alloc
> p4d_alloc_one
> get_zeroed_page
> __get_free_pages
> alloc_pages
> __alloc_pages
> __memcg_kmem_charge_page
> obj_cgroup_charge_pages
>
> Here we call try_charge_memcg() that return success and approve the allocation,
> however then we hit into kmem limit and fail the allocation.

Just to make sure I understand this would be for the v1 kmem explicit
limit, correct?

> If required I can try to search how try_charge_memcg() can reject page allocation
> of non-dying task too.

Yes.

> > I will be honest that I am not really happy about pagefault_out_of_memory.
> > I have tried to remove it in the past. Without much success back then,
> > unfortunately[1].
> > Maybe we should get rid of it finally. The OOM is always triggered from
> > inside the allocator where we have much more infromation about the
> > allocation context. A first step would be to skip pagefault_out_of_memory
> > for killed or exiting processes.
>
> I like this idea, however it may be not enough, at least in scenario described above.

I original patch has removed the oom killer completely.

--
Michal Hocko
SUSE Labs

2021-10-18 12:31:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

[restore the cc list]

On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> On 18.10.2021 14:53, Michal Hocko wrote:
> > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >> On 18.10.2021 12:04, Michal Hocko wrote:
> >> Here we call try_charge_memcg() that return success and approve the allocation,
> >> however then we hit into kmem limit and fail the allocation.
> >
> > Just to make sure I understand this would be for the v1 kmem explicit
> > limit, correct?
>
> yes, I mean this limit.

OK, thanks for the clarification. This is a known problem. Have a look
at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
status since 2019 without any actual report sugested by the kernel
message. Maybe we should try and remove it and see whether that prompts
some pushback.

--
Michal Hocko
SUSE Labs

2021-10-18 15:10:36

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <[email protected]> wrote:
>
> [restore the cc list]
>
> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > On 18.10.2021 14:53, Michal Hocko wrote:
> > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > >> however then we hit into kmem limit and fail the allocation.
> > >
> > > Just to make sure I understand this would be for the v1 kmem explicit
> > > limit, correct?
> >
> > yes, I mean this limit.
>
> OK, thanks for the clarification. This is a known problem. Have a look
> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> status since 2019 without any actual report sugested by the kernel
> message. Maybe we should try and remove it and see whether that prompts
> some pushback.
>

Yes, I think now should be the right time to take the next step for
deprecation of kmem limits:
https://lore.kernel.org/all/[email protected]/

2021-10-18 16:53:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Mon 18-10-21 08:07:20, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <[email protected]> wrote:
> >
> > [restore the cc list]
> >
> > On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > > On 18.10.2021 14:53, Michal Hocko wrote:
> > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > > >> however then we hit into kmem limit and fail the allocation.
> > > >
> > > > Just to make sure I understand this would be for the v1 kmem explicit
> > > > limit, correct?
> > >
> > > yes, I mean this limit.
> >
> > OK, thanks for the clarification. This is a known problem. Have a look
> > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> > status since 2019 without any actual report sugested by the kernel
> > message. Maybe we should try and remove it and see whether that prompts
> > some pushback.
> >
>
> Yes, I think now should be the right time to take the next step for
> deprecation of kmem limits:
> https://lore.kernel.org/all/[email protected]/

I completely forgot about your patch. Anyway, it usually takes us years
to deprecate something so let's stick with it and consider 2 years as
years ;)

--
Michal Hocko
SUSE Labs

2021-10-18 17:15:35

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Mon, Oct 18, 2021 at 9:51 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 18-10-21 08:07:20, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <[email protected]> wrote:
> > >
> > > [restore the cc list]
> > >
> > > On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > > > On 18.10.2021 14:53, Michal Hocko wrote:
> > > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > > > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > > > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > > > >> however then we hit into kmem limit and fail the allocation.
> > > > >
> > > > > Just to make sure I understand this would be for the v1 kmem explicit
> > > > > limit, correct?
> > > >
> > > > yes, I mean this limit.
> > >
> > > OK, thanks for the clarification. This is a known problem. Have a look
> > > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> > > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> > > status since 2019 without any actual report sugested by the kernel
> > > message. Maybe we should try and remove it and see whether that prompts
> > > some pushback.
> > >
> >
> > Yes, I think now should be the right time to take the next step for
> > deprecation of kmem limits:
> > https://lore.kernel.org/all/[email protected]/
>
> I completely forgot about your patch. Anyway, it usually takes us years
> to deprecate something so let's stick with it and consider 2 years as
> years ;)
>

Sure, I will rebase and resend.

2021-10-18 18:54:01

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On 18.10.2021 18:07, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <[email protected]> wrote:
>>
>> [restore the cc list]
>>
>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>> however then we hit into kmem limit and fail the allocation.
>>>>
>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>> limit, correct?
>>>
>>> yes, I mean this limit.
>>
>> OK, thanks for the clarification. This is a known problem. Have a look
>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>> status since 2019 without any actual report sugested by the kernel
>> message. Maybe we should try and remove it and see whether that prompts
>> some pushback.
>>
>
> Yes, I think now should be the right time to take the next step for
> deprecation of kmem limits:
> https://lore.kernel.org/all/[email protected]/

Are you going to push it to stable kernels too?

Thank you,
Vasily Averin

2021-10-18 19:21:56

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On 18.10.2021 21:52, Vasily Averin wrote:
> On 18.10.2021 18:07, Shakeel Butt wrote:
>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <[email protected]> wrote:
>>>
>>> [restore the cc list]
>>>
>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>>> however then we hit into kmem limit and fail the allocation.
>>>>>
>>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>>> limit, correct?
>>>>
>>>> yes, I mean this limit.
>>>
>>> OK, thanks for the clarification. This is a known problem. Have a look
>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>>> status since 2019 without any actual report sugested by the kernel
>>> message. Maybe we should try and remove it and see whether that prompts
>>> some pushback.
>>>
>>
>> Yes, I think now should be the right time to take the next step for
>> deprecation of kmem limits:
>> https://lore.kernel.org/all/[email protected]/
>
> Are you going to push it to stable kernels too?

Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37.


2021-10-19 05:35:34

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Mon, Oct 18, 2021 at 12:19 PM Vasily Averin <[email protected]> wrote:
>
> On 18.10.2021 21:52, Vasily Averin wrote:
> > On 18.10.2021 18:07, Shakeel Butt wrote:
> >> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <[email protected]> wrote:
> >>>
> >>> [restore the cc list]
> >>>
> >>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>>> however then we hit into kmem limit and fail the allocation.
> >>>>>
> >>>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>>> limit, correct?
> >>>>
> >>>> yes, I mean this limit.
> >>>
> >>> OK, thanks for the clarification. This is a known problem. Have a look
> >>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >>> status since 2019 without any actual report sugested by the kernel
> >>> message. Maybe we should try and remove it and see whether that prompts
> >>> some pushback.
> >>>
> >>
> >> Yes, I think now should be the right time to take the next step for
> >> deprecation of kmem limits:
> >> https://lore.kernel.org/all/[email protected]/
> >
> > Are you going to push it to stable kernels too?
>
> Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37.
>

CONFIG_MEMCG_KMEM is orthogonal to setting kmem limits. We are not
disabling the kmem accounting.

2021-10-19 05:35:49

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <[email protected]> wrote:
>
> On 18.10.2021 18:07, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <[email protected]> wrote:
> >>
> >> [restore the cc list]
> >>
> >> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>> however then we hit into kmem limit and fail the allocation.
> >>>>
> >>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>> limit, correct?
> >>>
> >>> yes, I mean this limit.
> >>
> >> OK, thanks for the clarification. This is a known problem. Have a look
> >> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >> status since 2019 without any actual report sugested by the kernel
> >> message. Maybe we should try and remove it and see whether that prompts
> >> some pushback.
> >>
> >
> > Yes, I think now should be the right time to take the next step for
> > deprecation of kmem limits:
> > https://lore.kernel.org/all/[email protected]/
>
> Are you going to push it to stable kernels too?
>

Not really. Is there a reason I should? More exposure to catch breakage?

2021-10-19 06:33:26

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On 18.10.2021 14:53, Michal Hocko wrote:
> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>> On 18.10.2021 12:04, Michal Hocko wrote:
>>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> [...]
>>>> How could this happen?
>>>>
>>>> User-space task inside the memcg-limited container generated a page fault,
>>>> its handler do_user_addr_fault() called handle_mm_fault which could not
>>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>>>> out_of_memory() without set of memcg.
>>>>
>>>> Partially this problem depends on one of my recent patches, disabled unlimited
>>>> memory allocation for dying tasks. However I think the problem can happen
>>>> on non-killed tasks too, for example because of kmem limit.
>>>
>>> Could you be more specific on how this can happen without your patch? I
>>> have to say I haven't realized this side effect when discussing it.

>> If required I can try to search how try_charge_memcg() can reject page allocation
>> of non-dying task too.
>
> Yes.

Now I think that such failure was very unlikely (w/o my patch and kmem limit).
I cannot exclude it completely, because I did not finished this review and perhaps I missed something,
but I checked most part of code and found nothing.

With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
a) due to fatal signal
b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)

To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.

However all these cases can be successfully handled by my new patch
"memcg: prevent false global OOM triggered by memcg limited task"
and I think it is better solution.

Thank you,
Vasily Averin

2021-10-19 06:46:17

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On 19.10.2021 08:33, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <[email protected]> wrote:
>>
>> On 18.10.2021 18:07, Shakeel Butt wrote:
>>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <[email protected]> wrote:
>>>>
>>>> [restore the cc list]
>>>>
>>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>>>> however then we hit into kmem limit and fail the allocation.
>>>>>>
>>>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>>>> limit, correct?
>>>>>
>>>>> yes, I mean this limit.
>>>>
>>>> OK, thanks for the clarification. This is a known problem. Have a look
>>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>>>> status since 2019 without any actual report sugested by the kernel
>>>> message. Maybe we should try and remove it and see whether that prompts
>>>> some pushback.
>>>
>>> Yes, I think now should be the right time to take the next step for
>>> deprecation of kmem limits:
>>> https://lore.kernel.org/all/[email protected]/
>>
>> Are you going to push it to stable kernels too?
>
> Not really. Is there a reason I should? More exposure to catch breakage?

There is a problem: kmem limit can trigger fake global OOM.
To fix it in upstream you can remove kmem controller.

However how to handle this problem in stable and LTS kernels?
My current patch resolves the problem too, and it can be backported.
However I afraid nobody will do it if teh patch will not be approved in upsteam.

Thank you,
Vasily Averin

2021-10-19 08:49:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Tue 19-10-21 09:42:38, Vasily Averin wrote:
> On 19.10.2021 08:33, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <[email protected]> wrote:
> >>
> >> On 18.10.2021 18:07, Shakeel Butt wrote:
> >>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <[email protected]> wrote:
> >>>>
> >>>> [restore the cc list]
> >>>>
> >>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>>>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>>>> however then we hit into kmem limit and fail the allocation.
> >>>>>>
> >>>>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>>>> limit, correct?
> >>>>>
> >>>>> yes, I mean this limit.
> >>>>
> >>>> OK, thanks for the clarification. This is a known problem. Have a look
> >>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >>>> status since 2019 without any actual report sugested by the kernel
> >>>> message. Maybe we should try and remove it and see whether that prompts
> >>>> some pushback.
> >>>
> >>> Yes, I think now should be the right time to take the next step for
> >>> deprecation of kmem limits:
> >>> https://lore.kernel.org/all/[email protected]/
> >>
> >> Are you going to push it to stable kernels too?
> >
> > Not really. Is there a reason I should? More exposure to catch breakage?
>
> There is a problem: kmem limit can trigger fake global OOM.
> To fix it in upstream you can remove kmem controller.
>
> However how to handle this problem in stable and LTS kernels?

I do not see any bug reports coming in and I strongly suspect this is
because the functionality is simply not used wildly enough or in the
mode where it would matter (U != 0, K < U from our documentation).
If there are relevant usecases for this setup then we really want to
hear about those because we do not want to break userspace. Handling
that setup would far from trivial and the global oom killer is not the
only one of those.

> My current patch resolves the problem too, and it can be backported.
> However I afraid nobody will do it if teh patch will not be approved in upsteam.

I do not think your current approach is the right one. We do not really
want yet another flag to tell we are in a memcg OOM. We already have
one.

A proper way is to deal with the pagefault oom killer trigger but that
might be just too risky for stable kernels.
--
Michal Hocko
SUSE Labs

2021-10-19 08:52:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Tue 19-10-21 09:30:18, Vasily Averin wrote:
[...]
> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> a) due to fatal signal
> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>
> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.

How is b) possible without current being killed? Do we allow remote
charging?

> However all these cases can be successfully handled by my new patch
> "memcg: prevent false global OOM triggered by memcg limited task"
> and I think it is better solution.

I have already replied to your approach in other email. Sorry our
replies have crossed.
--
Michal Hocko
SUSE Labs

2021-10-19 10:33:13

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On 19.10.2021 11:49, Michal Hocko wrote:
> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> [...]
>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>> a) due to fatal signal
>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>
>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.

> How is b) possible without current being killed? Do we allow remote
> charging?

out_of_memory for memcg_oom
select_bad_process
mem_cgroup_scan_tasks
oom_evaluate_task
oom_badness

/*
* Do not even consider tasks which are explicitly marked oom
* unkillable or have been already oom reaped or the are in
* the middle of vfork
*/
adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN ||
test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
in_vfork(p)) {
task_unlock(p);
return LONG_MIN;
}

This time we handle userspace page fault, so we cannot be kenrel thread,
and cannot be in_vfork().
However task can be marked as oom unkillable,
i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

It can be set via oom_score_adj_write().

Thank you,
Vasily Averin

2021-10-19 11:56:08

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> On 19.10.2021 11:49, Michal Hocko wrote:
> > On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> > [...]
> >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> >> a) due to fatal signal
> >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> >>
> >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
>
> > How is b) possible without current being killed? Do we allow remote
> > charging?
>
> out_of_memory for memcg_oom
> select_bad_process
> mem_cgroup_scan_tasks
> oom_evaluate_task
> oom_badness
>
> /*
> * Do not even consider tasks which are explicitly marked oom
> * unkillable or have been already oom reaped or the are in
> * the middle of vfork
> */
> adj = (long)p->signal->oom_score_adj;
> if (adj == OOM_SCORE_ADJ_MIN ||
> test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
> in_vfork(p)) {
> task_unlock(p);
> return LONG_MIN;
> }
>
> This time we handle userspace page fault, so we cannot be kenrel thread,
> and cannot be in_vfork().
> However task can be marked as oom unkillable,
> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

You are right. I am not sure there is a way out of this though. The task
can only retry for ever in this case. There is nothing actionable here.
We cannot kill the task and there is no other way to release the memory.

--
Michal Hocko
SUSE Labs

2021-10-19 12:08:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Tue 19-10-21 13:54:42, Michal Hocko wrote:
> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> > On 19.10.2021 11:49, Michal Hocko wrote:
> > > On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> > > [...]
> > >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> > >> a) due to fatal signal
> > >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> > >>
> > >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> > >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> >
> > > How is b) possible without current being killed? Do we allow remote
> > > charging?
> >
> > out_of_memory for memcg_oom
> > select_bad_process
> > mem_cgroup_scan_tasks
> > oom_evaluate_task
> > oom_badness
> >
> > /*
> > * Do not even consider tasks which are explicitly marked oom
> > * unkillable or have been already oom reaped or the are in
> > * the middle of vfork
> > */
> > adj = (long)p->signal->oom_score_adj;
> > if (adj == OOM_SCORE_ADJ_MIN ||
> > test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
> > in_vfork(p)) {
> > task_unlock(p);
> > return LONG_MIN;
> > }
> >
> > This time we handle userspace page fault, so we cannot be kenrel thread,
> > and cannot be in_vfork().
> > However task can be marked as oom unkillable,
> > i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>
> You are right. I am not sure there is a way out of this though. The task
> can only retry for ever in this case. There is nothing actionable here.
> We cannot kill the task and there is no other way to release the memory.

Btw. don't we force the charge in that case?
--
Michal Hocko
SUSE Labs

2021-10-19 13:31:00

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On 19.10.2021 15:04, Michal Hocko wrote:
> On Tue 19-10-21 13:54:42, Michal Hocko wrote:
>> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
>>> On 19.10.2021 11:49, Michal Hocko wrote:
>>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
>>>> [...]
>>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>>>>> a) due to fatal signal
>>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>>>>
>>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
>>>
>>>> How is b) possible without current being killed? Do we allow remote
>>>> charging?
>>>
>>> out_of_memory for memcg_oom
>>> select_bad_process
>>> mem_cgroup_scan_tasks
>>> oom_evaluate_task
>>> oom_badness
>>>
>>> /*
>>> * Do not even consider tasks which are explicitly marked oom
>>> * unkillable or have been already oom reaped or the are in
>>> * the middle of vfork
>>> */
>>> adj = (long)p->signal->oom_score_adj;
>>> if (adj == OOM_SCORE_ADJ_MIN ||
>>> test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>>> in_vfork(p)) {
>>> task_unlock(p);
>>> return LONG_MIN;
>>> }
>>>
>>> This time we handle userspace page fault, so we cannot be kenrel thread,
>>> and cannot be in_vfork().
>>> However task can be marked as oom unkillable,
>>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>>
>> You are right. I am not sure there is a way out of this though. The task
>> can only retry for ever in this case. There is nothing actionable here.
>> We cannot kill the task and there is no other way to release the memory.
>
> Btw. don't we force the charge in that case?

We should force charge for allocation from inside page fault handler,
to prevent endless cycle in retried page faults.
However we should not do it for allocations from task context,
to prevent memcg-limited vmalloc-eaters from to consume all host memory.

Also I would like to return to the following hunk.
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
* A few threads which were not waiting at mutex_lock_killable() can
* fail to bail out. Therefore, check again after holding oom_lock.
*/
- ret = should_force_charge() || out_of_memory(&oc);
+ ret = task_is_dying() || out_of_memory(&oc);

unlock:
mutex_unlock(&oom_lock);

Now I think it's better to keep task_is_dying() check here.
if task is dying, it is not necessary to push other task to free the memory.
We broke vmalloc cycle already, so it looks like nothing should prevent us
from returning to userspace, handle fatal signal, exit and free the memory.

Thank you,
Vasily Averin

2021-10-19 14:18:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Tue 19-10-21 16:26:50, Vasily Averin wrote:
> On 19.10.2021 15:04, Michal Hocko wrote:
> > On Tue 19-10-21 13:54:42, Michal Hocko wrote:
> >> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> >>> On 19.10.2021 11:49, Michal Hocko wrote:
> >>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> >>>> [...]
> >>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> >>>>> a) due to fatal signal
> >>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> >>>>>
> >>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> >>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> >>>
> >>>> How is b) possible without current being killed? Do we allow remote
> >>>> charging?
> >>>
> >>> out_of_memory for memcg_oom
> >>> select_bad_process
> >>> mem_cgroup_scan_tasks
> >>> oom_evaluate_task
> >>> oom_badness
> >>>
> >>> /*
> >>> * Do not even consider tasks which are explicitly marked oom
> >>> * unkillable or have been already oom reaped or the are in
> >>> * the middle of vfork
> >>> */
> >>> adj = (long)p->signal->oom_score_adj;
> >>> if (adj == OOM_SCORE_ADJ_MIN ||
> >>> test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
> >>> in_vfork(p)) {
> >>> task_unlock(p);
> >>> return LONG_MIN;
> >>> }
> >>>
> >>> This time we handle userspace page fault, so we cannot be kenrel thread,
> >>> and cannot be in_vfork().
> >>> However task can be marked as oom unkillable,
> >>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> >>
> >> You are right. I am not sure there is a way out of this though. The task
> >> can only retry for ever in this case. There is nothing actionable here.
> >> We cannot kill the task and there is no other way to release the memory.
> >
> > Btw. don't we force the charge in that case?
>
> We should force charge for allocation from inside page fault handler,
> to prevent endless cycle in retried page faults.
> However we should not do it for allocations from task context,
> to prevent memcg-limited vmalloc-eaters from to consume all host memory.

I don't see a big difference between those two. Because the #PF could
result into the very same situation depleting all the memory by
overcharging. A different behavior just leads to a confusion and
unexpected behavior. E.g. in the past we only triggered memcg OOM killer
from the #PF path and failed the charge otherwise. That is something
different but it shows problems we haven't anticipated and had user
visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back
to the charge path").

> Also I would like to return to the following hunk.
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> * A few threads which were not waiting at mutex_lock_killable() can
> * fail to bail out. Therefore, check again after holding oom_lock.
> */
> - ret = should_force_charge() || out_of_memory(&oc);
> + ret = task_is_dying() || out_of_memory(&oc);
>
> unlock:
> mutex_unlock(&oom_lock);
>
> Now I think it's better to keep task_is_dying() check here.
> if task is dying, it is not necessary to push other task to free the memory.
> We broke vmalloc cycle already, so it looks like nothing should prevent us
> from returning to userspace, handle fatal signal, exit and free the memory.

That patch has to be discuss in its full length. There were other
details I have brought up AFAIU.
--
Michal Hocko
SUSE Labs

2021-10-19 14:22:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Tue 19-10-21 16:13:19, Michal Hocko wrote:
[...]
> That patch has to be discuss in its full length. There were other
> details I have brought up AFAIU.

And btw. thanks for walking through that maze! It is far from trivial
with side effects all over the place which far from obvious and heavily
underdocumented.

--
Michal Hocko
SUSE Labs

2021-10-19 19:12:22

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On 19.10.2021 17:13, Michal Hocko wrote:
> On Tue 19-10-21 16:26:50, Vasily Averin wrote:
>> On 19.10.2021 15:04, Michal Hocko wrote:
>>> On Tue 19-10-21 13:54:42, Michal Hocko wrote:
>>>> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
>>>>> On 19.10.2021 11:49, Michal Hocko wrote:
>>>>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
>>>>>> [...]
>>>>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>>>>>>> a) due to fatal signal
>>>>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>>>>>>
>>>>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>>>>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
>>>>>
>>>>>> How is b) possible without current being killed? Do we allow remote
>>>>>> charging?
>>>>>
>>>>> out_of_memory for memcg_oom
>>>>> select_bad_process
>>>>> mem_cgroup_scan_tasks
>>>>> oom_evaluate_task
>>>>> oom_badness
>>>>>
>>>>> /*
>>>>> * Do not even consider tasks which are explicitly marked oom
>>>>> * unkillable or have been already oom reaped or the are in
>>>>> * the middle of vfork
>>>>> */
>>>>> adj = (long)p->signal->oom_score_adj;
>>>>> if (adj == OOM_SCORE_ADJ_MIN ||
>>>>> test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>>>>> in_vfork(p)) {
>>>>> task_unlock(p);
>>>>> return LONG_MIN;
>>>>> }
>>>>>
>>>>> This time we handle userspace page fault, so we cannot be kenrel thread,
>>>>> and cannot be in_vfork().
>>>>> However task can be marked as oom unkillable,
>>>>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>>>>
>>>> You are right. I am not sure there is a way out of this though. The task
>>>> can only retry for ever in this case. There is nothing actionable here.
>>>> We cannot kill the task and there is no other way to release the memory.
>>>
>>> Btw. don't we force the charge in that case?
>>
>> We should force charge for allocation from inside page fault handler,
>> to prevent endless cycle in retried page faults.
>> However we should not do it for allocations from task context,
>> to prevent memcg-limited vmalloc-eaters from to consume all host memory.
>
> I don't see a big difference between those two. Because the #PF could
> result into the very same situation depleting all the memory by
> overcharging. A different behavior just leads to a confusion and
> unexpected behavior. E.g. in the past we only triggered memcg OOM killer
> from the #PF path and failed the charge otherwise. That is something
> different but it shows problems we haven't anticipated and had user
> visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back
> to the charge path").

In this case I think we should fail this allocation.
It's better do not allow overcharge, neither in #PF not in regular allocations.

However this failure will trigger false global OOM in pagefault_out_of_memory(),
and we need to find some way to prevent it.

Thank you,
Vasily Averin

2021-10-20 08:08:58

by Vasily Averin

[permalink] [raw]
Subject: [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time.

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.

In addition, this patch renames should_force_charge() helper
to task_is_dying() because now its use is not associated witch forced
charging.

If try_charge_memcg() is called from #PF, its new failres can force
pagefault_out_of_memory() to execute the global OOM. To prevent it
pagefault_out_of_memory() was updated to properly handle memcg-related
restrictions.

Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Vasily Averin <[email protected]>
---
v4: updated pagefault_out_of_memory() to properly handle new memcg-related
restrictions and not allow false global OOM
v3: no functional changes, just improved patch description
v2: swicthed to patch version proposed by mhocko@

mm/memcontrol.c | 52 ++++++++++++++++++++++++++++---------------------
mm/oom_kill.c | 3 +++
2 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..b09d3c64f63f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
iter != NULL; \
iter = mem_cgroup_iter(NULL, iter, NULL))

-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
{
return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
* A few threads which were not waiting at mutex_lock_killable() can
* fail to bail out. Therefore, check again after holding oom_lock.
*/
- ret = should_force_charge() || out_of_memory(&oc);
+ ret = task_is_dying() || out_of_memory(&oc);

unlock:
mutex_unlock(&oom_lock);
@@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
mem_cgroup_oom_notify(memcg);

mem_cgroup_unmark_under_oom(memcg);
- if (mem_cgroup_out_of_memory(memcg, mask, order))
+ if (mem_cgroup_out_of_memory(memcg, mask, order)) {
ret = OOM_SUCCESS;
- else
+ } else {
ret = OOM_FAILED;
-
+ /*
+ * In some rare cases mem_cgroup_out_of_memory() can return false.
+ * If it was called from #PF it forces handle_mm_fault()
+ * return VM_FAULT_OOM and executes pagefault_out_of_memory().
+ * memcg_in_oom is set here to notify pagefault_out_of_memory()
+ * that it was a memcg-related failure and not allow to run
+ * global OOM.
+ */
+ if (current->in_user_fault)
+ current->memcg_in_oom = (struct mem_cgroup *)ret;
+ }
if (locked)
mem_cgroup_oom_unlock(memcg);

@@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle)
if (!memcg)
return false;

+ /* OOM is memcg, however out_of_memory() found no victim */
+ if (memcg == (struct mem_cgroup *)OOM_FAILED) {
+ /*
+ * Should be called from pagefault_out_of_memory() only,
+ * where it is used to prevent false global OOM.
+ */
+ current->memcg_in_oom = NULL;
+ return true;
+ }
if (!handle)
goto cleanup;

@@ -2530,6 +2549,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
struct page_counter *counter;
enum oom_status oom_status;
unsigned long nr_reclaimed;
+ bool passed_oom = false;
bool may_swap = true;
bool drained = false;
unsigned long pflags;
@@ -2564,15 +2584,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (gfp_mask & __GFP_ATOMIC)
goto force;

- /*
- * Unlike in global OOM situations, memcg is not in a physical
- * memory shortage. Allow dying and OOM-killed tasks to
- * bypass the last charges so that they can exit quickly and
- * free their memory.
- */
- if (unlikely(should_force_charge()))
- goto force;
-
/*
* Prevent unbounded recursion when reclaim operations need to
* allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2641,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (gfp_mask & __GFP_RETRY_MAYFAIL)
goto nomem;

- if (fatal_signal_pending(current))
- goto force;
+ /* Avoid endless loop for tasks bypassed by the oom killer */
+ if (passed_oom && task_is_dying())
+ goto nomem;

/*
* keep retrying as long as the memcg oom killer is able to make
@@ -2640,14 +2652,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
get_order(nr_pages * PAGE_SIZE));
- switch (oom_status) {
- case OOM_SUCCESS:
+ if (oom_status == OOM_SUCCESS) {
+ passed_oom = true;
nr_retries = MAX_RECLAIM_RETRIES;
goto retry;
- case OOM_FAILED:
- goto force;
- default:
- goto nomem;
}
nomem:
if (!(gfp_mask & __GFP_NOFAIL))
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..1deef8c7a71b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
if (mem_cgroup_oom_synchronize(true))
return;

+ if (fatal_signal_pending(current))
+ return;
+
if (!mutex_trylock(&oom_lock))
return;
out_of_memory(&oc);
--
2.32.0

2021-10-20 08:44:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks

On Wed 20-10-21 11:07:02, Vasily Averin wrote:
[...]
I haven't read through the changelog and only focused on the patch this
time.

[...]
> @@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
> mem_cgroup_oom_notify(memcg);
>
> mem_cgroup_unmark_under_oom(memcg);
> - if (mem_cgroup_out_of_memory(memcg, mask, order))
> + if (mem_cgroup_out_of_memory(memcg, mask, order)) {
> ret = OOM_SUCCESS;
> - else
> + } else {
> ret = OOM_FAILED;
> -
> + /*
> + * In some rare cases mem_cgroup_out_of_memory() can return false.
> + * If it was called from #PF it forces handle_mm_fault()
> + * return VM_FAULT_OOM and executes pagefault_out_of_memory().
> + * memcg_in_oom is set here to notify pagefault_out_of_memory()
> + * that it was a memcg-related failure and not allow to run
> + * global OOM.
> + */
> + if (current->in_user_fault)
> + current->memcg_in_oom = (struct mem_cgroup *)ret;
> + }
> if (locked)
> mem_cgroup_oom_unlock(memcg);
>
> @@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle)
> if (!memcg)
> return false;
>
> + /* OOM is memcg, however out_of_memory() found no victim */
> + if (memcg == (struct mem_cgroup *)OOM_FAILED) {
> + /*
> + * Should be called from pagefault_out_of_memory() only,
> + * where it is used to prevent false global OOM.
> + */
> + current->memcg_in_oom = NULL;
> + return true;
> + }
> if (!handle)
> goto cleanup;

I have to say I am not a great fan of this but this belongs to a
separate patch on top of all the previous ones.

[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..1deef8c7a71b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
> if (mem_cgroup_oom_synchronize(true))
> return;
>
> + if (fatal_signal_pending(current))
> + return;
> +

This belongs to its own patch as well.

All that being said I would go with pagefault_out_of_memory as the first
patch because it is necessary to handle charge failures. Then go with a
patch to remove charge forcing when OOM killer succeeds but the retry
still fails and finally go with one that tries to handle oom failures.
--
Michal Hocko
SUSE Labs

2021-10-20 12:14:58

by Vasily Averin

[permalink] [raw]
Subject: [PATCH memcg RFC 0/3] memcg: prohibit unconditional exceeding the limit of dying tasks

Dear Michal,
as you requested, I splited v4 patch version into 3 separate parts.
Let's discuss each of them.

Open questions/ToDo:

- Update patch descriptions (and add some comments)

- in part 2 aka "memcg: remove charge forcinig for dying tasks":
should we keep task_is_dying() in mem_cgroup_out_of_memory() ?

- in part 3 aka "memcg: handle memcg oom failures"
what is the best way to notify pagefault_out_of_memory() about
mem_cgroup_out_of_memory failure ?

- what is the best way to handle memcg failure doe to kmem limit,
it can trigger false global OOM

Vasily Averin (3):
mm: do not firce global OOM from inside dying tasks
memcg: remove charge forcinig for dying tasks
memcg: handle memcg oom failures

mm/memcontrol.c | 52 ++++++++++++++++++++++++++++---------------------
mm/oom_kill.c | 3 +++
2 files changed, 33 insertions(+), 22 deletions(-)

--
2.32.0

2021-10-21 08:07:59

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On 18.10.2021 12:04, Michal Hocko wrote:
> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> [...]
>> How could this happen?
>>
>> User-space task inside the memcg-limited container generated a page fault,
>> its handler do_user_addr_fault() called handle_mm_fault which could not
>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>> out_of_memory() without set of memcg.

> I will be honest that I am not really happy about pagefault_out_of_memory.
> I have tried to remove it in the past. Without much success back then,
> unfortunately[1].
>
> [1] I do not have msg-id so I cannot provide a lore link but google
> pointed me to https://www.mail-archive.com/[email protected]/msg1400402.html

I re-read this discussion and in general I support your position.
As far as I understand your opponents cannot explain why "random kill" is mandatory here,
they are just afraid that it might be useful here and do not want to remove it completely.

Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.

However I would like to have some choice in this point.

In general we can:
- continue to use "random kill" and rely on the wisdom of the ancestors.
- do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
- add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.
- mark the current task as cycled in #PF and somehow use this mark in allocator
- make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.
- implement any better ideas,
- use any combination of previous points

We can select required strategy for example via sysctl.

For me "random kill" is worst choice,
Why can't we just kill the looped process instead?
It can be marked as oom-unkillable, so OOM-killer was unable to select it.
However I doubt it means "never kill it", for me it is something like "last possible victim" priority.

Thank you,
Vasily Averin

2021-10-21 11:53:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On Thu 21-10-21 11:03:43, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
> > On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> > [...]
> >> How could this happen?
> >>
> >> User-space task inside the memcg-limited container generated a page fault,
> >> its handler do_user_addr_fault() called handle_mm_fault which could not
> >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> >> out_of_memory() without set of memcg.
>
> > I will be honest that I am not really happy about pagefault_out_of_memory.
> > I have tried to remove it in the past. Without much success back then,
> > unfortunately[1].
> >
> > [1] I do not have msg-id so I cannot provide a lore link but google
> > pointed me to https://www.mail-archive.com/[email protected]/msg1400402.html
>
> I re-read this discussion and in general I support your position.
> As far as I understand your opponents cannot explain why "random kill" is mandatory here,
> they are just afraid that it might be useful here and do not want to remove it completely.

That aligns with my recollection.

> Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.
>
> However I would like to have some choice in this point.
>
> In general we can:
> - continue to use "random kill" and rely on the wisdom of the ancestors.

I do not follow. Does that mean to preserve existing oom killer from
#PF?

> - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
> - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.

Again, not really sure what you mean

> - mark the current task as cycled in #PF and somehow use this mark in allocator

How?

> - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.

No! We cannot really kill the task if we could we would have done it by
the oom killer already

> - implement any better ideas,
> - use any combination of previous points
>
> We can select required strategy for example via sysctl.

Absolutely no! How can admin know any better than the kernel?

> For me "random kill" is worst choice,
> Why can't we just kill the looped process instead?

See above.

> It can be marked as oom-unkillable, so OOM-killer was unable to select it.
> However I doubt it means "never kill it", for me it is something like "last possible victim" priority.

It means never kill it because of OOM. If it is retrying because of OOM
then it is effectively the same thing.

The oom killer from the #PF doesn't really provide any clear advantage
these days AFAIK. On the other hand it allows for a very disruptive
behavior. In a worst case it can lead to a system panic if the
VM_FAULT_OOM is not really caused by a memory shortage but rather a
wrong error handling. If a task is looping there without any progress
then it is still kilallable which is a much saner behavior IMHO.
--
Michal Hocko
SUSE Labs

2021-10-21 13:28:14

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task

On 21.10.2021 14:49, Michal Hocko wrote:
> On Thu 21-10-21 11:03:43, Vasily Averin wrote:
>> On 18.10.2021 12:04, Michal Hocko wrote:
>>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> [...]
>>>> How could this happen?
>>>>
>>>> User-space task inside the memcg-limited container generated a page fault,
>>>> its handler do_user_addr_fault() called handle_mm_fault which could not
>>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>>>> out_of_memory() without set of memcg.
>>
>>> I will be honest that I am not really happy about pagefault_out_of_memory.
>>> I have tried to remove it in the past. Without much success back then,
>>> unfortunately[1].
>>>
>>> [1] I do not have msg-id so I cannot provide a lore link but google
>>> pointed me to https://www.mail-archive.com/[email protected]/msg1400402.html
>>
>> I re-read this discussion and in general I support your position.
>> As far as I understand your opponents cannot explain why "random kill" is mandatory here,
>> they are just afraid that it might be useful here and do not want to remove it completely.
>
> That aligns with my recollection.
>
>> Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.
>>
>> However I would like to have some choice in this point.
>>
>> In general we can:
>> - continue to use "random kill" and rely on the wisdom of the ancestors.
>
> I do not follow. Does that mean to preserve existing oom killer from
> #PF?
>
>> - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
>> - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.
>
> Again, not really sure what you mean
>
>> - mark the current task as cycled in #PF and somehow use this mark in allocator
>
> How?
>
>> - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.
>
> No! We cannot really kill the task if we could we would have done it by
> the oom killer already
>
>> - implement any better ideas,
>> - use any combination of previous points
>>
>> We can select required strategy for example via sysctl.
>
> Absolutely no! How can admin know any better than the kernel?
>
>> For me "random kill" is worst choice,
>> Why can't we just kill the looped process instead?
>
> See above.
>
>> It can be marked as oom-unkillable, so OOM-killer was unable to select it.
>> However I doubt it means "never kill it", for me it is something like "last possible victim" priority.
>
> It means never kill it because of OOM. If it is retrying because of OOM
> then it is effectively the same thing.
>
> The oom killer from the #PF doesn't really provide any clear advantage
> these days AFAIK. On the other hand it allows for a very disruptive
> behavior. In a worst case it can lead to a system panic if the
> VM_FAULT_OOM is not really caused by a memory shortage but rather a
> wrong error handling. If a task is looping there without any progress
> then it is still kilallable which is a much saner behavior IMHO.

Let's continue this discussion in "Re: [PATCH memcg 3/3] memcg: handle memcg oom failures" thread.
.