When a process is oom killed as a result of memcg limits and the victim
is waiting to exit, nothing ends up actually yielding the processor back
to the victim on UP systems with preemption disabled. Instead, the
charging process simply loops in memcg reclaim and eventually soft
lockups.
Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
RIP: 0010:shrink_lruvec+0x4e9/0xa40
...
Call Trace:
shrink_node+0x40d/0x7d0
do_try_to_free_pages+0x13f/0x470
try_to_free_mem_cgroup_pages+0x16d/0x230
try_charge+0x247/0xac0
mem_cgroup_try_charge+0x10a/0x220
mem_cgroup_try_charge_delay+0x1e/0x40
handle_mm_fault+0xdf2/0x15f0
do_user_addr_fault+0x21f/0x420
page_fault+0x2f/0x40
Make sure that something ends up actually yielding the processor back to
the victim to allow for memory freeing. Most appropriate place appears to
be shrink_node_memcgs() where the iteration of all decendant memcgs could
be particularly lengthy.
Cc: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Signed-off-by: David Rientjes <[email protected]>
---
mm/vmscan.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
unsigned long reclaimed;
unsigned long scanned;
+ cond_resched();
+
switch (mem_cgroup_protected(target_memcg, memcg)) {
case MEMCG_PROT_MIN:
/*
On 2020/03/11 6:39, David Rientjes wrote:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> unsigned long reclaimed;
> unsigned long scanned;
>
> + cond_resched();
> +
Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority,
can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather
than the OOM victim ?) gets scheduled?
> switch (mem_cgroup_protected(target_memcg, memcg)) {
> case MEMCG_PROT_MIN:
> /*
>
On Tue 10-03-20 14:39:48, David Rientjes wrote:
> When a process is oom killed as a result of memcg limits and the victim
> is waiting to exit, nothing ends up actually yielding the processor back
> to the victim on UP systems with preemption disabled. Instead, the
> charging process simply loops in memcg reclaim and eventually soft
> lockups.
>
> Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> RIP: 0010:shrink_lruvec+0x4e9/0xa40
> ...
> Call Trace:
> shrink_node+0x40d/0x7d0
> do_try_to_free_pages+0x13f/0x470
> try_to_free_mem_cgroup_pages+0x16d/0x230
> try_charge+0x247/0xac0
> mem_cgroup_try_charge+0x10a/0x220
> mem_cgroup_try_charge_delay+0x1e/0x40
> handle_mm_fault+0xdf2/0x15f0
> do_user_addr_fault+0x21f/0x420
> page_fault+0x2f/0x40
>
> Make sure that something ends up actually yielding the processor back to
> the victim to allow for memory freeing. Most appropriate place appears to
> be shrink_node_memcgs() where the iteration of all decendant memcgs could
> be particularly lengthy.
There is a cond_resched in shrink_lruvec and another one in
shrink_page_list. Why doesn't any of them hit? Is it because there are
no pages on the LRU list? Because rss data suggests there should be
enough pages to go that path. Or maybe it is shrink_slab path that takes
too long?
The patch itself makes sense to me but I would like to see more
explanation on how that happens.
Thanks.
> Cc: Vlastimil Babka <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Signed-off-by: David Rientjes <[email protected]>
> ---
> mm/vmscan.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> unsigned long reclaimed;
> unsigned long scanned;
>
> + cond_resched();
> +
> switch (mem_cgroup_protected(target_memcg, memcg)) {
> case MEMCG_PROT_MIN:
> /*
--
Michal Hocko
SUSE Labs
On Wed, 11 Mar 2020, Tetsuo Handa wrote:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > unsigned long reclaimed;
> > unsigned long scanned;
> >
> > + cond_resched();
> > +
>
> Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority,
> can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather
> than the OOM victim ?) gets scheduled?
>
I think it's the best we can do that immediately solves the issue unless
you have another idea in mind?
> > switch (mem_cgroup_protected(target_memcg, memcg)) {
> > case MEMCG_PROT_MIN:
> > /*
> >
>
On Tue, 10 Mar 2020, Michal Hocko wrote:
> > When a process is oom killed as a result of memcg limits and the victim
> > is waiting to exit, nothing ends up actually yielding the processor back
> > to the victim on UP systems with preemption disabled. Instead, the
> > charging process simply loops in memcg reclaim and eventually soft
> > lockups.
> >
> > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> > RIP: 0010:shrink_lruvec+0x4e9/0xa40
> > ...
> > Call Trace:
> > shrink_node+0x40d/0x7d0
> > do_try_to_free_pages+0x13f/0x470
> > try_to_free_mem_cgroup_pages+0x16d/0x230
> > try_charge+0x247/0xac0
> > mem_cgroup_try_charge+0x10a/0x220
> > mem_cgroup_try_charge_delay+0x1e/0x40
> > handle_mm_fault+0xdf2/0x15f0
> > do_user_addr_fault+0x21f/0x420
> > page_fault+0x2f/0x40
> >
> > Make sure that something ends up actually yielding the processor back to
> > the victim to allow for memory freeing. Most appropriate place appears to
> > be shrink_node_memcgs() where the iteration of all decendant memcgs could
> > be particularly lengthy.
>
> There is a cond_resched in shrink_lruvec and another one in
> shrink_page_list. Why doesn't any of them hit? Is it because there are
> no pages on the LRU list? Because rss data suggests there should be
> enough pages to go that path. Or maybe it is shrink_slab path that takes
> too long?
>
I think it can be a number of cases, most notably mem_cgroup_protected()
checks which is why the cond_resched() is added above it. Rather than add
cond_resched() only for MEMCG_PROT_MIN and for certain MEMCG_PROT_LOW, the
cond_resched() is added above the switch clause because the iteration
itself may be potentially very lengthy.
We could also do it in shrink_zones() or the priority based
do_try_to_free_pages() loop, but I'd be nervous about the lengthy memcg
iteration in shrink_node_memcgs() independent of this.
Any other ideas on how to ensure we actually try to resched for the
benefit of an oom victim to prevent this soft lockup?
> The patch itself makes sense to me but I would like to see more
> explanation on how that happens.
>
> Thanks.
>
> > Cc: Vlastimil Babka <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: David Rientjes <[email protected]>
> > ---
> > mm/vmscan.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > unsigned long reclaimed;
> > unsigned long scanned;
> >
> > + cond_resched();
> > +
> > switch (mem_cgroup_protected(target_memcg, memcg)) {
> > case MEMCG_PROT_MIN:
> > /*
>
> --
> Michal Hocko
> SUSE Labs
>
On Tue, 10 Mar 2020 14:39:48 -0700 (PDT) David Rientjes <[email protected]> wrote:
> When a process is oom killed as a result of memcg limits and the victim
> is waiting to exit, nothing ends up actually yielding the processor back
> to the victim on UP systems with preemption disabled. Instead, the
> charging process simply loops in memcg reclaim and eventually soft
> lockups.
>
> Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> RIP: 0010:shrink_lruvec+0x4e9/0xa40
> ...
> Call Trace:
> shrink_node+0x40d/0x7d0
> do_try_to_free_pages+0x13f/0x470
> try_to_free_mem_cgroup_pages+0x16d/0x230
> try_charge+0x247/0xac0
> mem_cgroup_try_charge+0x10a/0x220
> mem_cgroup_try_charge_delay+0x1e/0x40
> handle_mm_fault+0xdf2/0x15f0
> do_user_addr_fault+0x21f/0x420
> page_fault+0x2f/0x40
>
> Make sure that something ends up actually yielding the processor back to
> the victim to allow for memory freeing. Most appropriate place appears to
> be shrink_node_memcgs() where the iteration of all decendant memcgs could
> be particularly lengthy.
>
That's a bit sad.
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> unsigned long reclaimed;
> unsigned long scanned;
>
> + cond_resched();
> +
> switch (mem_cgroup_protected(target_memcg, memcg)) {
> case MEMCG_PROT_MIN:
> /*
Obviously better, but this will still spin wheels until this tasks's
timeslice expires, and we might want to do something to help ensure
that the victim runs next (or soon)?
(And why is shrink_node_memcgs compiled in when CONFIG_MEMCG=n?)
On Tue, 10 Mar 2020, Andrew Morton wrote:
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > unsigned long reclaimed;
> > unsigned long scanned;
> >
> > + cond_resched();
> > +
> > switch (mem_cgroup_protected(target_memcg, memcg)) {
> > case MEMCG_PROT_MIN:
> > /*
>
>
> Obviously better, but this will still spin wheels until this tasks's
> timeslice expires, and we might want to do something to help ensure
> that the victim runs next (or soon)?
>
We used to have a schedule_timeout_killable(1) to address exactly that
scenario but it was removed in 4.19:
commit 9bfe5ded054b8e28a94c78580f233d6879a00146
Author: Michal Hocko <[email protected]>
Date: Fri Aug 17 15:49:04 2018 -0700
mm, oom: remove sleep from under oom_lock
This is why we don't see this issue on 4.14 guests but we do on 4.19. I
had assumed the issue Tetsuo reported that resulted in that patch was
still an issue and I preferred to fix the weird UP issue by adding a
cond_resched() that is likely needed for the iteration in
shrink_node_memcg() anyway. Do we care to optimize for UP systems
encountering memcg oom kills? Eh, maybe, but I'm not very interested in
opening up a centithread about this.
> (And why is shrink_node_memcgs compiled in when CONFIG_MEMCG=n?)
>
This guest does have CONFIG_MEMCG enabled, it's a memcg oom condition.
But unrelated to this patch, I think it's just a weird naming for it. The
do-while loop in shrink_node_memcgs() actually uses memcg = NULL for the
non-memcg case and is responsible for calling into page and slab reclaim.
On Tue 10-03-20 16:02:23, David Rientjes wrote:
> On Tue, 10 Mar 2020, Michal Hocko wrote:
>
> > > When a process is oom killed as a result of memcg limits and the victim
> > > is waiting to exit, nothing ends up actually yielding the processor back
> > > to the victim on UP systems with preemption disabled. Instead, the
> > > charging process simply loops in memcg reclaim and eventually soft
> > > lockups.
> > >
> > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> > > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> > > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> > > RIP: 0010:shrink_lruvec+0x4e9/0xa40
> > > ...
> > > Call Trace:
> > > shrink_node+0x40d/0x7d0
> > > do_try_to_free_pages+0x13f/0x470
> > > try_to_free_mem_cgroup_pages+0x16d/0x230
> > > try_charge+0x247/0xac0
> > > mem_cgroup_try_charge+0x10a/0x220
> > > mem_cgroup_try_charge_delay+0x1e/0x40
> > > handle_mm_fault+0xdf2/0x15f0
> > > do_user_addr_fault+0x21f/0x420
> > > page_fault+0x2f/0x40
> > >
> > > Make sure that something ends up actually yielding the processor back to
> > > the victim to allow for memory freeing. Most appropriate place appears to
> > > be shrink_node_memcgs() where the iteration of all decendant memcgs could
> > > be particularly lengthy.
> >
> > There is a cond_resched in shrink_lruvec and another one in
> > shrink_page_list. Why doesn't any of them hit? Is it because there are
> > no pages on the LRU list? Because rss data suggests there should be
> > enough pages to go that path. Or maybe it is shrink_slab path that takes
> > too long?
> >
>
> I think it can be a number of cases, most notably mem_cgroup_protected()
> checks which is why the cond_resched() is added above it. Rather than add
> cond_resched() only for MEMCG_PROT_MIN and for certain MEMCG_PROT_LOW, the
> cond_resched() is added above the switch clause because the iteration
> itself may be potentially very lengthy.
Was any of the above the case for your soft lockup case? How have you
managed to trigger it? As I've said I am not against the patch but I
would really like to see an actual explanation what happened rather than
speculations of what might have happened. If for nothing else then for
the future reference.
If this is really about all the hierarchy being MEMCG_PROT_MIN protected
and that results in a very expensive and pointless reclaim walk that can
trigger soft lockup then it should be explicitly mentioned in the
changelog.
--
Michal Hocko
SUSE Labs
On Tue 10-03-20 17:18:02, Andrew Morton wrote:
[...]
> (And why is shrink_node_memcgs compiled in when CONFIG_MEMCG=n?)
Because there won't be anything memcg specific with the config disabled.
mem_cgroup_iter will expand to NULL memcg, mem_cgroup_protected switch
compiled out, mem_cgroup_lruvec will return the lruvec abstraction which
resolves to pgdat and the rest is not memcg dependent. We could have
split up the reclaim protection or the loop out of the line but I
believe it is better to be clearly visible.
--
Michal Hocko
SUSE Labs
On 2020/03/11 7:55, David Rientjes wrote:
> On Wed, 11 Mar 2020, Tetsuo Handa wrote:
>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>>> unsigned long reclaimed;
>>> unsigned long scanned;
>>>
>>> + cond_resched();
>>> +
>>
>> Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority,
>> can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather
>> than the OOM victim ?) gets scheduled?
>>
>
> I think it's the best we can do that immediately solves the issue unless
> you have another idea in mind?
"schedule_timeout_killable(1) outside of oom_lock" or "the OOM reaper grabs oom_lock
so that allocating threads guarantee that the OOM reaper gets scheduled" or "direct OOM
reaping so that allocating threads guarantee that some memory is reclaimed".
>
>>> switch (mem_cgroup_protected(target_memcg, memcg)) {
>>> case MEMCG_PROT_MIN:
>>> /*
>>>
>>
On Wed, 11 Mar 2020, Tetsuo Handa wrote:
> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >>> unsigned long reclaimed;
> >>> unsigned long scanned;
> >>>
> >>> + cond_resched();
> >>> +
> >>
> >> Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority,
> >> can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather
> >> than the OOM victim ?) gets scheduled?
> >>
> >
> > I think it's the best we can do that immediately solves the issue unless
> > you have another idea in mind?
>
> "schedule_timeout_killable(1) outside of oom_lock" or "the OOM reaper grabs oom_lock
> so that allocating threads guarantee that the OOM reaper gets scheduled" or "direct OOM
> reaping so that allocating threads guarantee that some memory is reclaimed".
>
The cond_resched() here is needed if the iteration is lengthy depending on
the number of descendant memcgs already.
schedule_timeout_killable(1) does not make any guarantees that current
will be scheduled after the victim or oom_reaper on UP systems.
If you have an alternate patch to try, we can test it. But since this
cond_resched() is needed anyway, I'm not sure it will change the result.
> >
> >>> switch (mem_cgroup_protected(target_memcg, memcg)) {
> >>> case MEMCG_PROT_MIN:
> >>> /*
> >>>
> >>
>
>
On Wed, 11 Mar 2020, Michal Hocko wrote:
> > > > When a process is oom killed as a result of memcg limits and the victim
> > > > is waiting to exit, nothing ends up actually yielding the processor back
> > > > to the victim on UP systems with preemption disabled. Instead, the
> > > > charging process simply loops in memcg reclaim and eventually soft
> > > > lockups.
> > > >
> > > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> > > > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> > > > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> > > > RIP: 0010:shrink_lruvec+0x4e9/0xa40
> > > > ...
> > > > Call Trace:
> > > > shrink_node+0x40d/0x7d0
> > > > do_try_to_free_pages+0x13f/0x470
> > > > try_to_free_mem_cgroup_pages+0x16d/0x230
> > > > try_charge+0x247/0xac0
> > > > mem_cgroup_try_charge+0x10a/0x220
> > > > mem_cgroup_try_charge_delay+0x1e/0x40
> > > > handle_mm_fault+0xdf2/0x15f0
> > > > do_user_addr_fault+0x21f/0x420
> > > > page_fault+0x2f/0x40
> > > >
> > > > Make sure that something ends up actually yielding the processor back to
> > > > the victim to allow for memory freeing. Most appropriate place appears to
> > > > be shrink_node_memcgs() where the iteration of all decendant memcgs could
> > > > be particularly lengthy.
> > >
> > > There is a cond_resched in shrink_lruvec and another one in
> > > shrink_page_list. Why doesn't any of them hit? Is it because there are
> > > no pages on the LRU list? Because rss data suggests there should be
> > > enough pages to go that path. Or maybe it is shrink_slab path that takes
> > > too long?
> > >
> >
> > I think it can be a number of cases, most notably mem_cgroup_protected()
> > checks which is why the cond_resched() is added above it. Rather than add
> > cond_resched() only for MEMCG_PROT_MIN and for certain MEMCG_PROT_LOW, the
> > cond_resched() is added above the switch clause because the iteration
> > itself may be potentially very lengthy.
>
> Was any of the above the case for your soft lockup case? How have you
> managed to trigger it? As I've said I am not against the patch but I
> would really like to see an actual explanation what happened rather than
> speculations of what might have happened. If for nothing else then for
> the future reference.
>
Yes, this is how it was triggered in my own testing.
> If this is really about all the hierarchy being MEMCG_PROT_MIN protected
> and that results in a very expensive and pointless reclaim walk that can
> trigger soft lockup then it should be explicitly mentioned in the
> changelog.
I think the changelog clearly states that we need to guarantee that a
reclaimer will yield the processor back to allow a victim to exit. This
is where we make the guarantee. If it helps for the specific reason it
triggered in my testing, we could add:
"For example, mem_cgroup_protected() can prohibit reclaim and thus any
yielding in page reclaim would not address the issue."
On 2020/03/12 4:38, David Rientjes wrote:
> On Wed, 11 Mar 2020, Tetsuo Handa wrote:
>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>>>>> unsigned long reclaimed;
>>>>> unsigned long scanned;
>>>>>
>>>>> + cond_resched();
>>>>> +
>>>>
>>>> Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority,
>>>> can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather
>>>> than the OOM victim ?) gets scheduled?
>>>>
>>>
>>> I think it's the best we can do that immediately solves the issue unless
>>> you have another idea in mind?
>>
>> "schedule_timeout_killable(1) outside of oom_lock" or "the OOM reaper grabs oom_lock
>> so that allocating threads guarantee that the OOM reaper gets scheduled" or "direct OOM
>> reaping so that allocating threads guarantee that some memory is reclaimed".
>>
>
> The cond_resched() here is needed if the iteration is lengthy depending on
> the number of descendant memcgs already.
No. cond_resched() here will become no-op if CONFIG_PREEMPTION=y and current
thread has realtime priority.
>
> schedule_timeout_killable(1) does not make any guarantees that current
> will be scheduled after the victim or oom_reaper on UP systems.
The point of schedule_timeout_*(1) is to guarantee that current thread
will yield CPU to other threads even if CONFIG_PREEMPTION=y and current
thread has realtime priority case. There is no guarantee that current
thread will be rescheduled immediately after a sleep is irrelevant.
>
> If you have an alternate patch to try, we can test it. But since this
> cond_resched() is needed anyway, I'm not sure it will change the result.
schedule_timeout_killable(1) is an alternate patch to try; I don't think
that this cond_resched() is needed anyway.
>
>>>
>>>>> switch (mem_cgroup_protected(target_memcg, memcg)) {
>>>>> case MEMCG_PROT_MIN:
>>>>> /*
>>>>>
>>>>
>>
>>
On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> > The cond_resched() here is needed if the iteration is lengthy depending on
> > the number of descendant memcgs already.
>
> No. cond_resched() here will become no-op if CONFIG_PREEMPTION=y and current
> thread has realtime priority.
>
It's helpful without CONFIG_PREEMPTION for excessively long memcg
iterations to avoid starving need_resched.
> > schedule_timeout_killable(1) does not make any guarantees that current
> > will be scheduled after the victim or oom_reaper on UP systems.
>
> The point of schedule_timeout_*(1) is to guarantee that current thread
> will yield CPU to other threads even if CONFIG_PREEMPTION=y and current
> thread has realtime priority case. There is no guarantee that current
> thread will be rescheduled immediately after a sleep is irrelevant.
>
> >
> > If you have an alternate patch to try, we can test it. But since this
> > cond_resched() is needed anyway, I'm not sure it will change the result.
>
> schedule_timeout_killable(1) is an alternate patch to try; I don't think
> that this cond_resched() is needed anyway.
>
You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()?
> On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> > > If you have an alternate patch to try, we can test it. But since this
> > > cond_resched() is needed anyway, I'm not sure it will change the result.
> >
> > schedule_timeout_killable(1) is an alternate patch to try; I don't think
> > that this cond_resched() is needed anyway.
> >
>
> You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()?
>
Andrew Morton also mentioned whether cond_resched() in shrink_node_memcgs()
is enough. But like you mentioned,
David Rientjes wrote:
> On Tue, 10 Mar 2020, Andrew Morton wrote:
>
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > > unsigned long reclaimed;
> > > unsigned long scanned;
> > >
> > > + cond_resched();
> > > +
> > > switch (mem_cgroup_protected(target_memcg, memcg)) {
> > > case MEMCG_PROT_MIN:
> > > /*
> >
> >
> > Obviously better, but this will still spin wheels until this tasks's
> > timeslice expires, and we might want to do something to help ensure
> > that the victim runs next (or soon)?
> >
>
> We used to have a schedule_timeout_killable(1) to address exactly that
> scenario but it was removed in 4.19:
>
> commit 9bfe5ded054b8e28a94c78580f233d6879a00146
> Author: Michal Hocko <[email protected]>
> Date: Fri Aug 17 15:49:04 2018 -0700
>
> mm, oom: remove sleep from under oom_lock
you can try re-adding sleep outside of oom_lock:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d09776cd6e10..3aee7e0eca4e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1576,6 +1576,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
ret = should_force_charge() || out_of_memory(&oc);
mutex_unlock(&oom_lock);
+ schedule_timeout_killable(1);
return ret;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..e80158049651 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3797,7 +3797,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
*/
if (!mutex_trylock(&oom_lock)) {
*did_some_progress = 1;
- schedule_timeout_uninterruptible(1);
return NULL;
}
@@ -4590,6 +4589,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
/* Retry as long as the OOM killer is making progress */
if (did_some_progress) {
+ schedule_timeout_uninterruptible(1);
no_progress_loops = 0;
goto retry;
}
By the way, will you share the reproducer (and how to use the reproducer) ?
> > We used to have a schedule_timeout_killable(1) to address exactly that
> > scenario but it was removed in 4.19:
> >
> > commit 9bfe5ded054b8e28a94c78580f233d6879a00146
> > Author: Michal Hocko <[email protected]>
> > Date: Fri Aug 17 15:49:04 2018 -0700
> >
> > mm, oom: remove sleep from under oom_lock
>
For the record: If I revert that commit, I can observe that
current thread sleeps for more than one minute with oom_lock held.
That is, I don't want to revert that commit.
[ 1629.930998][T14021] idle-priority invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[ 1629.934944][T14021] CPU: 0 PID: 14021 Comm: idle-priority Kdump: loaded Not tainted 5.6.0-rc5+ #968
[ 1629.938352][T14021] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019
[ 1629.942211][T14021] Call Trace:
[ 1629.943956][T14021] dump_stack+0x71/0xa5
[ 1629.946041][T14021] dump_header+0x4d/0x3e0
[ 1629.948097][T14021] oom_kill_process+0x179/0x210
[ 1629.950191][T14021] out_of_memory+0x109/0x3d0
[ 1629.952154][T14021] ? out_of_memory+0x1ea/0x3d0
[ 1629.954349][T14021] __alloc_pages_slowpath+0x934/0xb89
[ 1629.956959][T14021] __alloc_pages_nodemask+0x333/0x370
[ 1629.959121][T14021] handle_pte_fault+0x4ce/0xb40
[ 1629.961274][T14021] __handle_mm_fault+0x2b2/0x5e0
[ 1629.963363][T14021] handle_mm_fault+0x177/0x360
[ 1629.965404][T14021] ? handle_mm_fault+0x46/0x360
[ 1629.967451][T14021] do_page_fault+0x2d5/0x680
[ 1629.969351][T14021] page_fault+0x34/0x40
[ 1629.971290][T14021] RIP: 0010:__clear_user+0x36/0x70
[ 1629.973621][T14021] Code: 0c a7 e2 95 53 48 89 f3 be 14 00 00 00 e8 82 87 b4 ff 0f 01 cb 48 89 d8 48 c1 eb 03 4c 89 e7 83 e0 07 48 89 d9 48 85 c9 74 0f <48> c7 07 00 00 00 00 48 83 c7 08 ff c9 75 f1 48 89 c1 85 c9 74 0a
[ 1629.980104][T14021] RSP: 0000:ffffa06e0742bd40 EFLAGS: 00050202
[ 1629.982543][T14021] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
[ 1629.985322][T14021] RDX: 0000000000000000 RSI: 00000000a8057a95 RDI: 00007f0ca8605000
[ 1629.999883][T14021] RBP: ffffa06e0742bd50 R08: 0000000000000001 R09: 0000000000000000
[ 1630.018164][T14021] R10: 0000000000000000 R11: ffff953843a80978 R12: 00007f0ca8604010
[ 1630.021491][T14021] R13: 000000000c988000 R14: ffffa06e0742be08 R15: 0000000000001000
[ 1630.024881][T14021] clear_user+0x47/0x60
[ 1630.026651][T14021] iov_iter_zero+0x95/0x3d0
[ 1630.028455][T14021] read_iter_zero+0x38/0xb0
[ 1630.030234][T14021] new_sync_read+0x112/0x1b0
[ 1630.032025][T14021] __vfs_read+0x27/0x40
[ 1630.033639][T14021] vfs_read+0xaf/0x160
[ 1630.035243][T14021] ksys_read+0x62/0xe0
[ 1630.036869][T14021] __x64_sys_read+0x15/0x20
[ 1630.038827][T14021] do_syscall_64+0x4a/0x1e0
[ 1630.040515][T14021] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1630.042598][T14021] RIP: 0033:0x7f109bb78950
[ 1630.044208][T14021] Code: Bad RIP value.
[ 1630.045798][T14021] RSP: 002b:00007fff499db458 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 1630.048558][T14021] RAX: ffffffffffffffda RBX: 0000000200000000 RCX: 00007f109bb78950
[ 1630.051117][T14021] RDX: 0000000200000000 RSI: 00007f0c9bc7c010 RDI: 0000000000000003
[ 1630.053715][T14021] RBP: 00007f0c9bc7c010 R08: 0000000000000000 R09: 0000000000021000
[ 1630.056488][T14021] R10: 00007fff499daea0 R11: 0000000000000246 R12: 00007f0c9bc7c010
[ 1630.059124][T14021] R13: 0000000000000005 R14: 0000000000000003 R15: 0000000000000000
[ 1630.061935][T14021] Mem-Info:
[ 1630.063226][T14021] active_anon:1314192 inactive_anon:5198 isolated_anon:0
[ 1630.063226][T14021] active_file:9 inactive_file:0 isolated_file:0
[ 1630.063226][T14021] unevictable:0 dirty:0 writeback:0 unstable:0
[ 1630.063226][T14021] slab_reclaimable:6649 slab_unreclaimable:268898
[ 1630.063226][T14021] mapped:1094 shmem:10475 pagetables:127946 bounce:0
[ 1630.063226][T14021] free:25585 free_pcp:62 free_cma:0
[ 1630.075882][T14021] Node 0 active_anon:5256768kB inactive_anon:20792kB active_file:36kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:4376kB dirty:0kB writeback:0kB shmem:41900kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3158016kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 1630.084961][T14021] DMA free:15360kB min:132kB low:164kB high:196kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15360kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1630.094169][T14021] lowmem_reserve[]: 0 2717 7644 7644
[ 1630.096174][T14021] DMA32 free:43680kB min:23972kB low:29964kB high:35956kB reserved_highatomic:0KB active_anon:2733524kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2782468kB mlocked:0kB kernel_stack:0kB pagetables:4764kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1630.106132][T14021] lowmem_reserve[]: 0 0 4927 4927
[ 1630.108104][T14021] Normal free:43300kB min:43476kB low:54344kB high:65212kB reserved_highatomic:4096KB active_anon:2523244kB inactive_anon:20792kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:0kB present:5242880kB managed:5045744kB mlocked:0kB kernel_stack:313872kB pagetables:507020kB bounce:0kB free_pcp:248kB local_pcp:248kB free_cma:0kB
[ 1630.118473][T14021] lowmem_reserve[]: 0 0 0 0
[ 1630.120348][T14021] DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15360kB
[ 1630.124316][T14021] DMA32: 0*4kB 0*8kB 2*16kB (UM) 2*32kB (UM) 1*64kB (U) 0*128kB 2*256kB (UM) 2*512kB (UM) 1*1024kB (M) 2*2048kB (UM) 9*4096kB (M) = 43680kB
[ 1630.128773][T14021] Normal: 1231*4kB (UE) 991*8kB (UE) 343*16kB (UME) 230*32kB (UE) 117*64kB (UME) 65*128kB (UME) 7*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43300kB
[ 1630.134207][T14021] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1630.137435][T14021] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1630.140770][T14021] 10478 total pagecache pages
[ 1630.142789][T14021] 0 pages in swap cache
[ 1630.144572][T14021] Swap cache stats: add 0, delete 0, find 0/0
[ 1630.146921][T14021] Free swap = 0kB
[ 1630.148627][T14021] Total swap = 0kB
[ 1630.150298][T14021] 2096998 pages RAM
[ 1630.152078][T14021] 0 pages HighMem/MovableOnly
[ 1630.154073][T14021] 136105 pages reserved
[ 1630.156165][T14021] 0 pages cma reserved
[ 1630.158028][T14021] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),global_oom,task_memcg=/,task=normal-priority,pid=23173,uid=0
[ 1630.161867][T14021] Out of memory: Killed process 23173 (normal-priority) total-vm:4228kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:52kB oom_score_adj:1000
[ 1630.171681][ T18] oom_reaper: reaped process 23173 (normal-priority), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[ 1800.548694][ C0] idle-priority R running task 12416 14021 1183 0x80004080
[ 1800.550942][ C0] Call Trace:
[ 1800.552116][ C0] __schedule+0x28c/0x860
[ 1800.553531][ C0] ? _raw_spin_unlock_irqrestore+0x2c/0x50
[ 1800.555293][ C0] schedule+0x5f/0xd0
[ 1800.556618][ C0] schedule_timeout+0x1ab/0x340
[ 1800.558151][ C0] ? __next_timer_interrupt+0xd0/0xd0
[ 1800.560069][ C0] schedule_timeout_killable+0x25/0x30
[ 1800.561836][ C0] out_of_memory+0x113/0x3d0
[ 1800.563519][ C0] ? out_of_memory+0x1ea/0x3d0
[ 1800.565040][ C0] __alloc_pages_slowpath+0x934/0xb89
[ 1800.566727][ C0] __alloc_pages_nodemask+0x333/0x370
[ 1800.568397][ C0] handle_pte_fault+0x4ce/0xb40
[ 1800.569944][ C0] __handle_mm_fault+0x2b2/0x5e0
[ 1800.571527][ C0] handle_mm_fault+0x177/0x360
[ 1800.573046][ C0] ? handle_mm_fault+0x46/0x360
[ 1800.574592][ C0] do_page_fault+0x2d5/0x680
[ 1800.576306][ C0] page_fault+0x34/0x40
[ 1800.578413][ C0] RIP: 0010:__clear_user+0x36/0x70
On Wed 11-03-20 12:45:40, David Rientjes wrote:
> On Wed, 11 Mar 2020, Michal Hocko wrote:
>
> > > > > When a process is oom killed as a result of memcg limits and the victim
> > > > > is waiting to exit, nothing ends up actually yielding the processor back
> > > > > to the victim on UP systems with preemption disabled. Instead, the
> > > > > charging process simply loops in memcg reclaim and eventually soft
> > > > > lockups.
> > > > >
> > > > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> > > > > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> > > > > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> > > > > RIP: 0010:shrink_lruvec+0x4e9/0xa40
> > > > > ...
> > > > > Call Trace:
> > > > > shrink_node+0x40d/0x7d0
> > > > > do_try_to_free_pages+0x13f/0x470
> > > > > try_to_free_mem_cgroup_pages+0x16d/0x230
> > > > > try_charge+0x247/0xac0
> > > > > mem_cgroup_try_charge+0x10a/0x220
> > > > > mem_cgroup_try_charge_delay+0x1e/0x40
> > > > > handle_mm_fault+0xdf2/0x15f0
> > > > > do_user_addr_fault+0x21f/0x420
> > > > > page_fault+0x2f/0x40
> > > > >
> > > > > Make sure that something ends up actually yielding the processor back to
> > > > > the victim to allow for memory freeing. Most appropriate place appears to
> > > > > be shrink_node_memcgs() where the iteration of all decendant memcgs could
> > > > > be particularly lengthy.
> > > >
> > > > There is a cond_resched in shrink_lruvec and another one in
> > > > shrink_page_list. Why doesn't any of them hit? Is it because there are
> > > > no pages on the LRU list? Because rss data suggests there should be
> > > > enough pages to go that path. Or maybe it is shrink_slab path that takes
> > > > too long?
> > > >
> > >
> > > I think it can be a number of cases, most notably mem_cgroup_protected()
> > > checks which is why the cond_resched() is added above it. Rather than add
> > > cond_resched() only for MEMCG_PROT_MIN and for certain MEMCG_PROT_LOW, the
> > > cond_resched() is added above the switch clause because the iteration
> > > itself may be potentially very lengthy.
> >
> > Was any of the above the case for your soft lockup case? How have you
> > managed to trigger it? As I've said I am not against the patch but I
> > would really like to see an actual explanation what happened rather than
> > speculations of what might have happened. If for nothing else then for
> > the future reference.
> >
>
> Yes, this is how it was triggered in my own testing.
>
> > If this is really about all the hierarchy being MEMCG_PROT_MIN protected
> > and that results in a very expensive and pointless reclaim walk that can
> > trigger soft lockup then it should be explicitly mentioned in the
> > changelog.
>
> I think the changelog clearly states that we need to guarantee that a
> reclaimer will yield the processor back to allow a victim to exit. This
> is where we make the guarantee. If it helps for the specific reason it
> triggered in my testing, we could add:
>
> "For example, mem_cgroup_protected() can prohibit reclaim and thus any
> yielding in page reclaim would not address the issue."
I would suggest something like the following:
"
The reclaim path (including the OOM) relies on explicit scheduling
points to hand over execution to tasks which could help with the reclaim
process. Currently it is mostly shrink_page_list which yields CPU for
each reclaimed page. This might be insuficient though in some
configurations. E.g. when a memcg OOM path is triggered in a hierarchy
which doesn't have any reclaimable memory because of memory reclaim
protection (MEMCG_PROT_MIN) then there is possible to trigger a soft
lockup during an out of memory situation on non preemptible kernels
<PUT YOUR SOFT LOCKUP SPLAT HERE>
Fix this by adding a cond_resched up in the reclaim path and make sure
there is a yield point regardless of reclaimability of the target
hierarchy.
"
--
Michal Hocko
SUSE Labs
On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> > On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> > > > If you have an alternate patch to try, we can test it. But since this
> > > > cond_resched() is needed anyway, I'm not sure it will change the result.
> > >
> > > schedule_timeout_killable(1) is an alternate patch to try; I don't think
> > > that this cond_resched() is needed anyway.
> > >
> >
> > You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()?
> >
>
> Andrew Morton also mentioned whether cond_resched() in shrink_node_memcgs()
> is enough. But like you mentioned,
>
It passes our testing because this is where the allocator is looping while
the victim is trying to exit if only it could be scheduled.
> you can try re-adding sleep outside of oom_lock:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d09776cd6e10..3aee7e0eca4e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1576,6 +1576,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> */
> ret = should_force_charge() || out_of_memory(&oc);
> mutex_unlock(&oom_lock);
> + schedule_timeout_killable(1);
> return ret;
> }
>
If current was process chosen for oom kill, this would actually induce the
problem, not fix it.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..e80158049651 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3797,7 +3797,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> */
> if (!mutex_trylock(&oom_lock)) {
> *did_some_progress = 1;
> - schedule_timeout_uninterruptible(1);
> return NULL;
> }
>
> @@ -4590,6 +4589,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>
> /* Retry as long as the OOM killer is making progress */
> if (did_some_progress) {
> + schedule_timeout_uninterruptible(1);
> no_progress_loops = 0;
> goto retry;
> }
>
> By the way, will you share the reproducer (and how to use the reproducer) ?
>
On an UP kernel with swap disabled, you limit a memcg to 100MB and start
three processes that each fault 40MB attached to it. Same reproducer as
the "mm, oom: make a last minute check to prevent unnecessary memcg oom
kills" patch except in that case there are two cores.
On Thu, 12 Mar 2020, Michal Hocko wrote:
> > I think the changelog clearly states that we need to guarantee that a
> > reclaimer will yield the processor back to allow a victim to exit. This
> > is where we make the guarantee. If it helps for the specific reason it
> > triggered in my testing, we could add:
> >
> > "For example, mem_cgroup_protected() can prohibit reclaim and thus any
> > yielding in page reclaim would not address the issue."
>
> I would suggest something like the following:
> "
> The reclaim path (including the OOM) relies on explicit scheduling
> points to hand over execution to tasks which could help with the reclaim
> process.
Are there other examples where yielding in the reclaim path would "help
with the reclaim process" other than oom victims? This sentence seems
vague.
> Currently it is mostly shrink_page_list which yields CPU for
> each reclaimed page. This might be insuficient though in some
> configurations. E.g. when a memcg OOM path is triggered in a hierarchy
> which doesn't have any reclaimable memory because of memory reclaim
> protection (MEMCG_PROT_MIN) then there is possible to trigger a soft
> lockup during an out of memory situation on non preemptible kernels
> <PUT YOUR SOFT LOCKUP SPLAT HERE>
>
> Fix this by adding a cond_resched up in the reclaim path and make sure
> there is a yield point regardless of reclaimability of the target
> hierarchy.
> "
>
On Thu 12-03-20 11:20:33, David Rientjes wrote:
> On Thu, 12 Mar 2020, Michal Hocko wrote:
>
> > > I think the changelog clearly states that we need to guarantee that a
> > > reclaimer will yield the processor back to allow a victim to exit. This
> > > is where we make the guarantee. If it helps for the specific reason it
> > > triggered in my testing, we could add:
> > >
> > > "For example, mem_cgroup_protected() can prohibit reclaim and thus any
> > > yielding in page reclaim would not address the issue."
> >
> > I would suggest something like the following:
> > "
> > The reclaim path (including the OOM) relies on explicit scheduling
> > points to hand over execution to tasks which could help with the reclaim
> > process.
>
> Are there other examples where yielding in the reclaim path would "help
> with the reclaim process" other than oom victims? This sentence seems
> vague.
In the context of UP and !PREEMPT this also includes IO flushers,
filesystems rely on workers and there are things I am very likely not
aware of. If you think this is vaague then feel free to reformulate.
All I really do care about is what the next paragraph is explaining.
> > Currently it is mostly shrink_page_list which yields CPU for
> > each reclaimed page. This might be insuficient though in some
> > configurations. E.g. when a memcg OOM path is triggered in a hierarchy
> > which doesn't have any reclaimable memory because of memory reclaim
> > protection (MEMCG_PROT_MIN) then there is possible to trigger a soft
> > lockup during an out of memory situation on non preemptible kernels
> > <PUT YOUR SOFT LOCKUP SPLAT HERE>
> >
> > Fix this by adding a cond_resched up in the reclaim path and make sure
> > there is a yield point regardless of reclaimability of the target
> > hierarchy.
> > "
> >
--
Michal Hocko
SUSE Labs
On Thu, 12 Mar 2020 11:07:15 -0700 (PDT) David Rientjes <[email protected]> wrote:
> On Thu, 12 Mar 2020, Tetsuo Handa wrote:
>
> > > On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> > > > > If you have an alternate patch to try, we can test it. But since this
> > > > > cond_resched() is needed anyway, I'm not sure it will change the result.
> > > >
> > > > schedule_timeout_killable(1) is an alternate patch to try; I don't think
> > > > that this cond_resched() is needed anyway.
> > > >
> > >
> > > You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()?
> > >
> >
> > Andrew Morton also mentioned whether cond_resched() in shrink_node_memcgs()
> > is enough. But like you mentioned,
> >
>
> It passes our testing because this is where the allocator is looping while
> the victim is trying to exit if only it could be scheduled.
What happens if the allocator has SCHED_FIFO?
David Rientjes wrote:
> > By the way, will you share the reproducer (and how to use the reproducer) ?
> >
>
> On an UP kernel with swap disabled, you limit a memcg to 100MB and start
> three processes that each fault 40MB attached to it. Same reproducer as
> the "mm, oom: make a last minute check to prevent unnecessary memcg oom
> kills" patch except in that case there are two cores.
>
I'm not a heavy memcg user. Please provide steps for reproducing your problem
in a "copy and pastable" way (e.g. bash script, C program).
> > @@ -1576,6 +1576,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > */
> > ret = should_force_charge() || out_of_memory(&oc);
> > mutex_unlock(&oom_lock);
> > + schedule_timeout_killable(1);
> > return ret;
> > }
> >
>
> If current was process chosen for oom kill, this would actually induce the
> problem, not fix it.
>
Why? Memcg OOM path allows using forced charge path if should_force_charge() == true.
Since your lockup report
Call Trace:
shrink_node+0x40d/0x7d0
do_try_to_free_pages+0x13f/0x470
try_to_free_mem_cgroup_pages+0x16d/0x230
try_charge+0x247/0xac0
mem_cgroup_try_charge+0x10a/0x220
mem_cgroup_try_charge_delay+0x1e/0x40
handle_mm_fault+0xdf2/0x15f0
do_user_addr_fault+0x21f/0x420
page_fault+0x2f/0x40
says that allocating thread was calling try_to_free_mem_cgroup_pages() from try_charge(),
allocating thread must be able to reach mem_cgroup_out_of_memory() from mem_cgroup_oom()
from try_charge(). And actually
Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
says that allocating thread did reach mem_cgroup_out_of_memory(). Then, allocating thread
must be able to sleep at mem_cgroup_out_of_memory() if schedule_timeout_killable(1) is
mem_cgroup_out_of_memory().
Also, if current process was chosen for OOM-kill, current process will be able to leave
try_charge() due to should_force_charge() == true, won't it?
Thus, how can "this would actually induce the problem, not fix it." happen?
If your problem is that something keeps allocating threads away from reaching
should_force_charge() check, please explain the mechanism. If that is explained,
I would agree that schedule_timeout_killable(1) in mem_cgroup_out_of_memory()
won't help.
On Fri, 13 Mar 2020, Tetsuo Handa wrote:
> > On an UP kernel with swap disabled, you limit a memcg to 100MB and start
> > three processes that each fault 40MB attached to it. Same reproducer as
> > the "mm, oom: make a last minute check to prevent unnecessary memcg oom
> > kills" patch except in that case there are two cores.
> >
>
> I'm not a heavy memcg user. Please provide steps for reproducing your problem
> in a "copy and pastable" way (e.g. bash script, C program).
>
Use Documentation/admin-guide/cgroup-v1/memory.rst or
Documentation/admin-guide/cgroup-v2.rst to setup a memcg depending on
which cgroup version you use, limit it to 100MB, and attach your process
to it.
Run three programs that fault 40MB. To do that, you need to use mmap:
(void)mmap(NULL, 40 << 20, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, 0, 0);
Have it stall after populating the memory:
for (;;);
> > > @@ -1576,6 +1576,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > */
> > > ret = should_force_charge() || out_of_memory(&oc);
> > > mutex_unlock(&oom_lock);
> > > + schedule_timeout_killable(1);
> > > return ret;
> > > }
> > >
> >
> > If current was process chosen for oom kill, this would actually induce the
> > problem, not fix it.
> >
>
> Why? Memcg OOM path allows using forced charge path if should_force_charge() == true.
>
> Since your lockup report
>
> Call Trace:
> shrink_node+0x40d/0x7d0
> do_try_to_free_pages+0x13f/0x470
> try_to_free_mem_cgroup_pages+0x16d/0x230
> try_charge+0x247/0xac0
> mem_cgroup_try_charge+0x10a/0x220
> mem_cgroup_try_charge_delay+0x1e/0x40
> handle_mm_fault+0xdf2/0x15f0
> do_user_addr_fault+0x21f/0x420
> page_fault+0x2f/0x40
>
> says that allocating thread was calling try_to_free_mem_cgroup_pages() from try_charge(),
> allocating thread must be able to reach mem_cgroup_out_of_memory() from mem_cgroup_oom()
> from try_charge(). And actually
>
> Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
>
> says that allocating thread did reach mem_cgroup_out_of_memory(). Then, allocating thread
> must be able to sleep at mem_cgroup_out_of_memory() if schedule_timeout_killable(1) is
> mem_cgroup_out_of_memory().
>
> Also, if current process was chosen for OOM-kill, current process will be able to leave
> try_charge() due to should_force_charge() == true, won't it?
>
> Thus, how can "this would actually induce the problem, not fix it." happen?
The entire issue is that the victim never gets a chance to run because the
allocator doesn't give it a chance to run on an UP system. Your patch is
broken because if the victim is current, you've lost your golden
opportunity to actually exit and ceded control to the allocator that will
now starve the victim.
On 2020/03/14 7:01, David Rientjes wrote:
> The entire issue is that the victim never gets a chance to run because the
> allocator doesn't give it a chance to run on an UP system. Your patch is
> broken because if the victim is current, you've lost your golden
> opportunity to actually exit and ceded control to the allocator that will
> now starve the victim.
>
I still cannot understand. There is no need to give CPU time to OOM victims.
We just need to give CPU time to the OOM reaper kernel thread till the OOM
reaper kernel thread sets MMF_OOM_SKIP to OOM victims. If current thread is
an OOM victim, schedule_timeout_killable(1) will give other threads (including
the OOM reaper kernel thread) CPU time to run. That is similar with your
cond_resched() patch (except that cond_resched() might fail to give other
threads CPU time to run if current thread has realtime priority), isn't it?
So, please explain the mechanism why cond_resched() works but
schedule_timeout_killable(1) cannot work.
On 2020/03/14 8:15, Tetsuo Handa wrote:
> If current thread is
> an OOM victim, schedule_timeout_killable(1) will give other threads (including
> the OOM reaper kernel thread) CPU time to run.
If current thread is an OOM victim, schedule_timeout_killable(1) will give other
threads (including the OOM reaper kernel thread) CPU time to run, by leaving
try_charge() path due to should_force_charge() == true and reaching do_exit() path
instead of returning to userspace code doing "for (;;);".
Unless the problem is that current thread cannot reach should_force_charge() check,
schedule_timeout_killable(1) should work.
On Thu 12-03-20 15:32:38, Andrew Morton wrote:
> On Thu, 12 Mar 2020 11:07:15 -0700 (PDT) David Rientjes <[email protected]> wrote:
>
> > On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> >
> > > > On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> > > > > > If you have an alternate patch to try, we can test it. But since this
> > > > > > cond_resched() is needed anyway, I'm not sure it will change the result.
> > > > >
> > > > > schedule_timeout_killable(1) is an alternate patch to try; I don't think
> > > > > that this cond_resched() is needed anyway.
> > > > >
> > > >
> > > > You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()?
> > > >
> > >
> > > Andrew Morton also mentioned whether cond_resched() in shrink_node_memcgs()
> > > is enough. But like you mentioned,
> > >
> >
> > It passes our testing because this is where the allocator is looping while
> > the victim is trying to exit if only it could be scheduled.
>
> What happens if the allocator has SCHED_FIFO?
The same thing as a SCHED_FIFO running in a tight loop in the userspace.
As long as a high priority context depends on a resource held by a low
priority task then we have a priority inversion problem and the page
allocator is no real exception here. But I do not see the allocator
is much different from any other code in the kernel. We do not add
random sleeps here and there to push a high priority FIFO or RT tasks
out of the execution context. We do cond_resched to help !PREEMPT
kernels but priority related issues are really out of scope of that
facility.
--
Michal Hocko
SUSE Labs
On Thu 12-03-20 21:16:27, Michal Hocko wrote:
> On Thu 12-03-20 11:20:33, David Rientjes wrote:
> > On Thu, 12 Mar 2020, Michal Hocko wrote:
> >
> > > > I think the changelog clearly states that we need to guarantee that a
> > > > reclaimer will yield the processor back to allow a victim to exit. This
> > > > is where we make the guarantee. If it helps for the specific reason it
> > > > triggered in my testing, we could add:
> > > >
> > > > "For example, mem_cgroup_protected() can prohibit reclaim and thus any
> > > > yielding in page reclaim would not address the issue."
> > >
> > > I would suggest something like the following:
> > > "
> > > The reclaim path (including the OOM) relies on explicit scheduling
> > > points to hand over execution to tasks which could help with the reclaim
> > > process.
> >
> > Are there other examples where yielding in the reclaim path would "help
> > with the reclaim process" other than oom victims? This sentence seems
> > vague.
>
> In the context of UP and !PREEMPT this also includes IO flushers,
> filesystems rely on workers and there are things I am very likely not
> aware of. If you think this is vaague then feel free to reformulate.
> All I really do care about is what the next paragraph is explaining.
Btw. do you plan to send a patch with an updated changelog?
> > > Currently it is mostly shrink_page_list which yields CPU for
> > > each reclaimed page. This might be insuficient though in some
> > > configurations. E.g. when a memcg OOM path is triggered in a hierarchy
> > > which doesn't have any reclaimable memory because of memory reclaim
> > > protection (MEMCG_PROT_MIN) then there is possible to trigger a soft
> > > lockup during an out of memory situation on non preemptible kernels
> > > <PUT YOUR SOFT LOCKUP SPLAT HERE>
> > >
> > > Fix this by adding a cond_resched up in the reclaim path and make sure
> > > there is a yield point regardless of reclaimability of the target
> > > hierarchy.
> > > "
> > >
>
> --
> Michal Hocko
> SUSE Labs
--
Michal Hocko
SUSE Labs
On 2020/03/16 18:31, Michal Hocko wrote:
>> What happens if the allocator has SCHED_FIFO?
>
> The same thing as a SCHED_FIFO running in a tight loop in the userspace.
>
> As long as a high priority context depends on a resource held by a low
> priority task then we have a priority inversion problem and the page
> allocator is no real exception here. But I do not see the allocator
> is much different from any other code in the kernel. We do not add
> random sleeps here and there to push a high priority FIFO or RT tasks
> out of the execution context. We do cond_resched to help !PREEMPT
> kernels but priority related issues are really out of scope of that
> facility.
>
Spinning with realtime priority in userspace is a userspace's bug.
Spinning with realtime priority in kernelspace until watchdog fires is
a kernel's bug. We are not responsible for userspace's bug, and I'm
asking whether the memory allocator kernel code can give enough CPU
time to other threads even if current thread has realtime priority.
On Mon 16-03-20 19:04:44, Tetsuo Handa wrote:
> On 2020/03/16 18:31, Michal Hocko wrote:
> >> What happens if the allocator has SCHED_FIFO?
> >
> > The same thing as a SCHED_FIFO running in a tight loop in the userspace.
> >
> > As long as a high priority context depends on a resource held by a low
> > priority task then we have a priority inversion problem and the page
> > allocator is no real exception here. But I do not see the allocator
> > is much different from any other code in the kernel. We do not add
> > random sleeps here and there to push a high priority FIFO or RT tasks
> > out of the execution context. We do cond_resched to help !PREEMPT
> > kernels but priority related issues are really out of scope of that
> > facility.
> >
>
> Spinning with realtime priority in userspace is a userspace's bug.
> Spinning with realtime priority in kernelspace until watchdog fires is
> a kernel's bug. We are not responsible for userspace's bug, and I'm
> asking whether the memory allocator kernel code can give enough CPU
> time to other threads even if current thread has realtime priority.
We've been through that discussion many times and the core point is that
this requires a large surgery to work properly. It is not just to add a
sleep into the page allocator and be done with that. Page allocator
cannot really do much on its own. It relies on many other contexts to
make a forward progress. What you really demand is far from trivial.
Maybe you are looking something much closer to the RT kernel than what
other preemption modes can offer currently.
Right now, you really have to be careful when running FIFO/RT processes
and plan their resources very carefully. Is that ideal? Not really but
considering that this is the status quo for many years it seems that
the usecases tend to find their way around that restriction.
--
Michal Hocko
SUSE Labs
On Sat, 14 Mar 2020, Tetsuo Handa wrote:
> > If current thread is
> > an OOM victim, schedule_timeout_killable(1) will give other threads (including
> > the OOM reaper kernel thread) CPU time to run.
>
> If current thread is an OOM victim, schedule_timeout_killable(1) will give other
> threads (including the OOM reaper kernel thread) CPU time to run, by leaving
> try_charge() path due to should_force_charge() == true and reaching do_exit() path
> instead of returning to userspace code doing "for (;;);".
>
> Unless the problem is that current thread cannot reach should_force_charge() check,
> schedule_timeout_killable(1) should work.
>
No need to yield if current is the oom victim, allowing the oom reaper to
run when it may not actually be able to free memory is not required. It
increases the likelihood that some other process schedules and is unable
to yield back due to the memcg oom condition such that the victim doesn't
get a chance to run again.
This happens because the victim is allowed to overcharge but other
processes attached to an oom memcg hierarchy simply fail the charge. We
are then reliant on all memory chargers in the kernel to yield if their
charges fail due to oom. It's the only way to allow the victim to
eventually run.
So the only change that I would make to your patch is to do this in
mem_cgroup_out_of_memory() instead:
if (!fatal_signal_pending(current))
schedule_timeout_killable(1);
So we don't have this reliance on all other memory chargers to yield when
their charge fails and there is no delay for victims themselves.
[ I'll still propose my change that adds cond_resched() to
shrink_node_memcgs() because we can see need_resched set for a
prolonged period of time without scheduling. ]
If you agree, I'll propose your patch with a changelog that indicates it
can fix the soft lockup issue for UP and can likely get a tested-by for
it.
David Rientjes wrote:
> On Sat, 14 Mar 2020, Tetsuo Handa wrote:
> > If current thread is an OOM victim, schedule_timeout_killable(1) will give other
> > threads (including the OOM reaper kernel thread) CPU time to run, by leaving
> > try_charge() path due to should_force_charge() == true and reaching do_exit() path
> > instead of returning to userspace code doing "for (;;);".
> >
> > Unless the problem is that current thread cannot reach should_force_charge() check,
> > schedule_timeout_killable(1) should work.
> >
>
> No need to yield if current is the oom victim, allowing the oom reaper to
> run when it may not actually be able to free memory is not required. It
> increases the likelihood that some other process schedules and is unable
> to yield back due to the memcg oom condition such that the victim doesn't
> get a chance to run again.
>
> This happens because the victim is allowed to overcharge but other
> processes attached to an oom memcg hierarchy simply fail the charge. We
> are then reliant on all memory chargers in the kernel to yield if their
> charges fail due to oom. It's the only way to allow the victim to
> eventually run.
>
> So the only change that I would make to your patch is to do this in
> mem_cgroup_out_of_memory() instead:
>
> if (!fatal_signal_pending(current))
> schedule_timeout_killable(1);
>
> So we don't have this reliance on all other memory chargers to yield when
> their charge fails and there is no delay for victims themselves.
I see. You want below functions for environments where current thread can
fail to resume execution for long if current thread once reschedules (e.g.
UP kernel, many threads contended on one CPU).
/*
* Give other threads CPU time, unless current thread was already killed.
* Used when we prefer killed threads to continue execution (in a hope that
* killed threads terminate quickly) over giving other threads CPU time.
*/
signed long __sched schedule_timeout_killable_expedited(signed long timeout)
{
if (unlikely(fatal_signal_pending(current)))
return timeout;
return schedule_timeout_killable(timeout);
}
/*
* Latency reduction via explicit rescheduling in places that are safe,
* but becomes no-op if current thread was already killed. Used when we
* prefer killed threads to continue execution (in a hope that killed
* threads terminate quickly) over giving other threads CPU time.
*/
int cond_resched_expedited(void)
{
if (unlikely(fatal_signal_pending(current)))
return 0;
return cond_resched();
}
>
> [ I'll still propose my change that adds cond_resched() to
> shrink_node_memcgs() because we can see need_resched set for a
> prolonged period of time without scheduling. ]
As long as there is schedule_timeout_killable(), I'm fine with adding
cond_resched() in other places.
>
> If you agree, I'll propose your patch with a changelog that indicates it
> can fix the soft lockup issue for UP and can likely get a tested-by for
> it.
>
Please go ahead.
On Tue, 17 Mar 2020, Tetsuo Handa wrote:
> > if (!fatal_signal_pending(current))
> > schedule_timeout_killable(1);
> >
> > So we don't have this reliance on all other memory chargers to yield when
> > their charge fails and there is no delay for victims themselves.
>
> I see. You want below functions for environments where current thread can
> fail to resume execution for long if current thread once reschedules (e.g.
> UP kernel, many threads contended on one CPU).
>
> /*
> * Give other threads CPU time, unless current thread was already killed.
> * Used when we prefer killed threads to continue execution (in a hope that
> * killed threads terminate quickly) over giving other threads CPU time.
> */
> signed long __sched schedule_timeout_killable_expedited(signed long timeout)
> {
> if (unlikely(fatal_signal_pending(current)))
> return timeout;
> return schedule_timeout_killable(timeout);
> }
>
I simply want the
if (!fatal_signal_pending(current))
schedule_timeout_killable(1);
after dropping oom_lock because I don't know that a generic function would
be useful outside of this single place. If it becomes a regular pattern,
for whatever reason, I think we can consider a new schedule_timeout
variant.
> /*
> * Latency reduction via explicit rescheduling in places that are safe,
> * but becomes no-op if current thread was already killed. Used when we
> * prefer killed threads to continue execution (in a hope that killed
> * threads terminate quickly) over giving other threads CPU time.
> */
> int cond_resched_expedited(void)
> {
> if (unlikely(fatal_signal_pending(current)))
> return 0;
> return cond_resched();
> }
>
> >
> > [ I'll still propose my change that adds cond_resched() to
> > shrink_node_memcgs() because we can see need_resched set for a
> > prolonged period of time without scheduling. ]
>
> As long as there is schedule_timeout_killable(), I'm fine with adding
> cond_resched() in other places.
>
Sounds good, thanks Tetsuo.
When a process is oom killed as a result of memcg limits and the victim
is waiting to exit, nothing ends up actually yielding the processor back
to the victim on UP systems with preemption disabled. Instead, the
charging process simply loops in memcg reclaim and eventually soft
lockups.
Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB,
anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB
oom_score_adj:0
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
RIP: 0010:shrink_lruvec+0x4e9/0xa40
...
Call Trace:
shrink_node+0x40d/0x7d0
do_try_to_free_pages+0x13f/0x470
try_to_free_mem_cgroup_pages+0x16d/0x230
try_charge+0x247/0xac0
mem_cgroup_try_charge+0x10a/0x220
mem_cgroup_try_charge_delay+0x1e/0x40
handle_mm_fault+0xdf2/0x15f0
do_user_addr_fault+0x21f/0x420
page_fault+0x2f/0x40
Make sure that once the oom killer has been called that we forcibly yield
if current is not the chosen victim regardless of priority to allow for
memory freeing. The same situation can theoretically occur in the page
allocator, so do this after dropping oom_lock there as well.
Suggested-by: Tetsuo Handa <[email protected]>
Tested-by: Robert Kolchmeyer <[email protected]>
Cc: [email protected]
Signed-off-by: David Rientjes <[email protected]>
---
mm/memcontrol.c | 2 ++
mm/page_alloc.c | 2 ++
2 files changed, 4 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1576,6 +1576,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
ret = should_force_charge() || out_of_memory(&oc);
mutex_unlock(&oom_lock);
+ if (!fatal_signal_pending(current))
+ schedule_timeout_killable(1);
return ret;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3861,6 +3861,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
}
out:
mutex_unlock(&oom_lock);
+ if (!fatal_signal_pending(current))
+ schedule_timeout_killable(1);
return page;
}
On Tue 17-03-20 17:55:04, David Rientjes wrote:
> When a process is oom killed as a result of memcg limits and the victim
> is waiting to exit, nothing ends up actually yielding the processor back
> to the victim on UP systems with preemption disabled. Instead, the
> charging process simply loops in memcg reclaim and eventually soft
> lockups.
It seems that my request to describe the setup got ignored. Sigh.
> Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB,
> anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB
> oom_score_adj:0
> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> RIP: 0010:shrink_lruvec+0x4e9/0xa40
> ...
> Call Trace:
> shrink_node+0x40d/0x7d0
> do_try_to_free_pages+0x13f/0x470
> try_to_free_mem_cgroup_pages+0x16d/0x230
> try_charge+0x247/0xac0
> mem_cgroup_try_charge+0x10a/0x220
> mem_cgroup_try_charge_delay+0x1e/0x40
> handle_mm_fault+0xdf2/0x15f0
> do_user_addr_fault+0x21f/0x420
> page_fault+0x2f/0x40
>
> Make sure that once the oom killer has been called that we forcibly yield
> if current is not the chosen victim regardless of priority to allow for
> memory freeing. The same situation can theoretically occur in the page
> allocator, so do this after dropping oom_lock there as well.
I would have prefered the cond_resched solution proposed previously but
I can live with this as well. I would just ask to add more information
to the changelog. E.g.
"
We used to have a short sleep after the oom handling but 9bfe5ded054b
("mm, oom: remove sleep from under oom_lock") has removed it because
sleep inside the oom_lock is dangerous. This patch restores the sleep
outside of the lock.
"
> Suggested-by: Tetsuo Handa <[email protected]>
> Tested-by: Robert Kolchmeyer <[email protected]>
> Cc: [email protected]
> Signed-off-by: David Rientjes <[email protected]>
> ---
> mm/memcontrol.c | 2 ++
> mm/page_alloc.c | 2 ++
> 2 files changed, 4 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1576,6 +1576,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> */
> ret = should_force_charge() || out_of_memory(&oc);
> mutex_unlock(&oom_lock);
> + if (!fatal_signal_pending(current))
> + schedule_timeout_killable(1);
Check for fatal_signal_pending is redundant.
--
Michal Hocko
SUSE Labs
On Wed, 18 Mar 2020, Michal Hocko wrote:
> > When a process is oom killed as a result of memcg limits and the victim
> > is waiting to exit, nothing ends up actually yielding the processor back
> > to the victim on UP systems with preemption disabled. Instead, the
> > charging process simply loops in memcg reclaim and eventually soft
> > lockups.
>
> It seems that my request to describe the setup got ignored. Sigh.
>
> > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB,
> > anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB
> > oom_score_adj:0
> > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> > RIP: 0010:shrink_lruvec+0x4e9/0xa40
> > ...
> > Call Trace:
> > shrink_node+0x40d/0x7d0
> > do_try_to_free_pages+0x13f/0x470
> > try_to_free_mem_cgroup_pages+0x16d/0x230
> > try_charge+0x247/0xac0
> > mem_cgroup_try_charge+0x10a/0x220
> > mem_cgroup_try_charge_delay+0x1e/0x40
> > handle_mm_fault+0xdf2/0x15f0
> > do_user_addr_fault+0x21f/0x420
> > page_fault+0x2f/0x40
> >
> > Make sure that once the oom killer has been called that we forcibly yield
> > if current is not the chosen victim regardless of priority to allow for
> > memory freeing. The same situation can theoretically occur in the page
> > allocator, so do this after dropping oom_lock there as well.
>
> I would have prefered the cond_resched solution proposed previously but
> I can live with this as well. I would just ask to add more information
> to the changelog. E.g.
I'm still planning on sending the cond_resched() change as well, but not
as advertised to fix this particular issue per Tetsuo's feedback. I think
the reported issue showed it's possible to excessively loop in reclaim
without a conditional yield depending on various memcg configs and the
shrink_node_memcgs() cond_resched() is still appropriate for interactivity
but also because the iteration of memcgs can be particularly long.
> "
> We used to have a short sleep after the oom handling but 9bfe5ded054b
> ("mm, oom: remove sleep from under oom_lock") has removed it because
> sleep inside the oom_lock is dangerous. This patch restores the sleep
> outside of the lock.
Will do.
> "
> > Suggested-by: Tetsuo Handa <[email protected]>
> > Tested-by: Robert Kolchmeyer <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: David Rientjes <[email protected]>
> > ---
> > mm/memcontrol.c | 2 ++
> > mm/page_alloc.c | 2 ++
> > 2 files changed, 4 insertions(+)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1576,6 +1576,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > */
> > ret = should_force_charge() || out_of_memory(&oc);
> > mutex_unlock(&oom_lock);
> > + if (!fatal_signal_pending(current))
> > + schedule_timeout_killable(1);
>
> Check for fatal_signal_pending is redundant.
>
> --
> Michal Hocko
> SUSE Labs
>
When a process is oom killed as a result of memcg limits and the victim
is waiting to exit, nothing ends up actually yielding the processor back
to the victim on UP systems with preemption disabled. Instead, the
charging process simply loops in memcg reclaim and eventually soft
lockups.
For example, on an UP system with a memcg limited to 100MB, if three
processes each charge 40MB of heap with swap disabled, one of the charging
processes can loop endlessly trying to charge memory which starves the oom
victim.
Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB,
anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB
oom_score_adj:0
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
RIP: 0010:shrink_lruvec+0x4e9/0xa40
...
Call Trace:
shrink_node+0x40d/0x7d0
do_try_to_free_pages+0x13f/0x470
try_to_free_mem_cgroup_pages+0x16d/0x230
try_charge+0x247/0xac0
mem_cgroup_try_charge+0x10a/0x220
mem_cgroup_try_charge_delay+0x1e/0x40
handle_mm_fault+0xdf2/0x15f0
do_user_addr_fault+0x21f/0x420
page_fault+0x2f/0x40
Make sure that once the oom killer has been called that we forcibly yield
if current is not the chosen victim regardless of priority to allow for
memory freeing. The same situation can theoretically occur in the page
allocator, so do this after dropping oom_lock there as well.
We used to have a short sleep after oom killing, but commit 9bfe5ded054b
("mm, oom: remove sleep from under oom_lock") removed it because sleeping
inside the oom_lock is dangerous. This patch restores the sleep outside of
the lock.
Suggested-by: Tetsuo Handa <[email protected]>
Tested-by: Robert Kolchmeyer <[email protected]>
Cc: [email protected]
Signed-off-by: David Rientjes <[email protected]>
---
mm/memcontrol.c | 6 ++++++
mm/page_alloc.c | 6 ++++++
2 files changed, 12 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1576,6 +1576,12 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
ret = should_force_charge() || out_of_memory(&oc);
mutex_unlock(&oom_lock);
+ /*
+ * Give a killed process a good chance to exit before trying to
+ * charge memory again.
+ */
+ if (ret)
+ schedule_timeout_killable(1);
return ret;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3861,6 +3861,12 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
}
out:
mutex_unlock(&oom_lock);
+ /*
+ * Give a killed process a good chance to exit before trying to
+ * allocate memory again.
+ */
+ if (*did_some_progress)
+ schedule_timeout_killable(1);
return page;
}
On Wed 18-03-20 15:03:52, David Rientjes wrote:
> When a process is oom killed as a result of memcg limits and the victim
> is waiting to exit, nothing ends up actually yielding the processor back
> to the victim on UP systems with preemption disabled. Instead, the
> charging process simply loops in memcg reclaim and eventually soft
> lockups.
>
> For example, on an UP system with a memcg limited to 100MB, if three
> processes each charge 40MB of heap with swap disabled, one of the charging
> processes can loop endlessly trying to charge memory which starves the oom
> victim.
This only happens if there is no reclaimable memory in the hierarchy.
That is a very specific condition. I do not see any other way than
having a misconfigured system with min protection preventing any
reclaim. Otherwise we have cond_resched both in slab shrinking code
(do_shrink_slab) and LRU shrinking shrink_lruvec. If I am wrong and
those are insufficient then please be explicit about the scenario.
This is a very important information to have in the changelog!
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1576,6 +1576,12 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> */
> ret = should_force_charge() || out_of_memory(&oc);
> mutex_unlock(&oom_lock);
> + /*
> + * Give a killed process a good chance to exit before trying to
> + * charge memory again.
> + */
> + if (ret)
> + schedule_timeout_killable(1);
Why are you making this conditional? Say that there is no victim to
kill. The charge path would simply bail out and it would really depend
on the call chain whether there is a scheduling point or not. Isn't it
simply safer to call schedule_timeout_killable unconditioanlly at this
stage?
> return ret;
> }
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3861,6 +3861,12 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> }
> out:
> mutex_unlock(&oom_lock);
> + /*
> + * Give a killed process a good chance to exit before trying to
> + * allocate memory again.
> + */
> + if (*did_some_progress)
> + schedule_timeout_killable(1);
This doesn't make much sense either. Please remember that the primary
reason you are adding this schedule_timeout_killable in this path is
because you want to somehow reduce the priority inversion problem
mentioned by Tetsuo. Because the page allocator path doesn't lack
regular scheduling points - compaction, reclaim and should_reclaim_retry
etc have them.
> return page;
> }
>
--
Michal Hocko
SUSE Labs