Hi,
this is a follow up for [1] which has been nacked mostly because Tetsuo
was able to find a simple workload which can trigger a race where
no-eligible task is reported without a good reason. I believe the patch2
addresses that issue and we do not have to play dirty games with
throttling just because of the race. I still believe that patch proposed
in [1] is a useful one but this can be addressed later.
This series comprises 2 patch. The first one is something I meant to do
loooong time ago, I just never have time to do that. We need it here to
handle CLONE_VM without CLONE_SIGHAND cases. The second patch closes the
race.
I didn't get to test this throughly so it is posted as an RFC.
Feedback is appreciated of course.
[1] http://lkml.kernel.org/r/[email protected]
From: Michal Hocko <[email protected]>
Tetsuo has reported [1] that a single process group memcg might easily
swamp the log with no-eligible oom victim reports due to race between
the memcg charge and oom_reaper
Thread 1 Thread2 oom_reaper
try_charge try_charge
mem_cgroup_out_of_memory
mutex_lock(oom_lock)
mem_cgroup_out_of_memory
mutex_lock(oom_lock)
out_of_memory
select_bad_process
oom_kill_process(current)
wake_oom_reaper
oom_reap_task
MMF_OOM_SKIP->victim
mutex_unlock(oom_lock)
out_of_memory
select_bad_process # no task
If Thread1 didn't race it would bail out from try_charge and force the
charge. We can achieve the same by checking tsk_is_oom_victim inside
the oom_lock and therefore close the race.
[1] http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e79cb59552d9..a9dfed29967b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
.gfp_mask = gfp_mask,
.order = order,
};
- bool ret;
+ bool ret = true;
mutex_lock(&oom_lock);
+
+ /*
+ * multi-threaded tasks might race with oom_reaper and gain
+ * MMF_OOM_SKIP before reaching out_of_memory which can lead
+ * to out_of_memory failure if the task is the last one in
+ * memcg which would be a false possitive failure reported
+ */
+ if (tsk_is_oom_victim(current))
+ goto unlock;
+
ret = out_of_memory(&oc);
+
+unlock:
mutex_unlock(&oom_lock);
return ret;
}
--
2.19.1
From: Michal Hocko <[email protected]>
Historically we have called mark_oom_victim only to the main task
selected as the oom victim because oom victims have access to memory
reserves and granting the access to all killed tasks could deplete
memory reserves very quickly and cause even larger problems.
Since only a partial access to memory reserves is allowed there is no
longer this risk and so all tasks killed along with the oom victim
can be considered as well.
The primary motivation for that is that process groups which do not
shared signals would behave more like standard thread groups wrt oom
handling (aka tsk_is_oom_victim will work the same way for them).
Signed-off-by: Michal Hocko <[email protected]>
---
mm/oom_kill.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa5360616..188ae490cf3e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -898,6 +898,7 @@ static void __oom_kill_process(struct task_struct *victim)
if (unlikely(p->flags & PF_KTHREAD))
continue;
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
+ mark_oom_victim(p);
}
rcu_read_unlock();
--
2.19.1
Michal Hocko wrote:
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -898,6 +898,7 @@ static void __oom_kill_process(struct task_struct *victim)
> if (unlikely(p->flags & PF_KTHREAD))
> continue;
> do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
> + mark_oom_victim(p);
> }
> rcu_read_unlock();
>
> --
Wrong. Either
---
mm/oom_kill.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..99b36ff 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -879,6 +879,8 @@ static void __oom_kill_process(struct task_struct *victim)
*/
rcu_read_lock();
for_each_process(p) {
+ struct task_struct *t;
+
if (!process_shares_mm(p, mm))
continue;
if (same_thread_group(p, victim))
@@ -898,6 +900,11 @@ static void __oom_kill_process(struct task_struct *victim)
if (unlikely(p->flags & PF_KTHREAD))
continue;
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
+ t = find_lock_task_mm(p);
+ if (!t)
+ continue;
+ mark_oom_victim(t);
+ task_unlock(t);
}
rcu_read_unlock();
--
1.8.3.1
or
---
mm/oom_kill.c | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..7fa9b7c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -854,13 +854,6 @@ static void __oom_kill_process(struct task_struct *victim)
count_vm_event(OOM_KILL);
memcg_memory_event_mm(mm, MEMCG_OOM_KILL);
- /*
- * We should send SIGKILL before granting access to memory reserves
- * in order to prevent the OOM victim from depleting the memory
- * reserves from the user space under its control.
- */
- do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, PIDTYPE_TGID);
- mark_oom_victim(victim);
pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -879,11 +872,23 @@ static void __oom_kill_process(struct task_struct *victim)
*/
rcu_read_lock();
for_each_process(p) {
- if (!process_shares_mm(p, mm))
+ struct task_struct *t;
+
+ /*
+ * No use_mm() user needs to read from the userspace so we are
+ * ok to reap it.
+ */
+ if (unlikely(p->flags & PF_KTHREAD))
+ continue;
+ t = find_lock_task_mm(p);
+ if (!t)
continue;
- if (same_thread_group(p, victim))
+ if (likely(t->mm != mm)) {
+ task_unlock(t);
continue;
+ }
if (is_global_init(p)) {
+ task_unlock(t);
can_oom_reap = false;
set_bit(MMF_OOM_SKIP, &mm->flags);
pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
@@ -892,12 +897,13 @@ static void __oom_kill_process(struct task_struct *victim)
continue;
}
/*
- * No use_mm() user needs to read from the userspace so we are
- * ok to reap it.
+ * We should send SIGKILL before granting access to memory
+ * reserves in order to prevent the OOM victim from depleting
+ * the memory reserves from the user space under its control.
*/
- if (unlikely(p->flags & PF_KTHREAD))
- continue;
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
+ mark_oom_victim(t);
+ task_unlock(t);
}
rcu_read_unlock();
--
1.8.3.1
will be needed.
On Mon 22-10-18 16:58:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -898,6 +898,7 @@ static void __oom_kill_process(struct task_struct *victim)
> > if (unlikely(p->flags & PF_KTHREAD))
> > continue;
> > do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
> > + mark_oom_victim(p);
> > }
> > rcu_read_unlock();
> >
> > --
>
> Wrong. Either
You are right. The mm might go away between process_shares_mm and here.
While your find_lock_task_mm would be correct I believe we can do better
by using the existing mm that we already have. I will make it a separate
patch to clarity.
Thanks for pointing this out.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 188ae490cf3e..4c205061ed67 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -663,6 +663,7 @@ static inline void wake_oom_reaper(struct task_struct *tsk)
/**
* mark_oom_victim - mark the given task as OOM victim
* @tsk: task to mark
+ * @mm: mm associated with the task
*
* Has to be called with oom_lock held and never after
* oom has been disabled already.
@@ -670,10 +671,8 @@ static inline void wake_oom_reaper(struct task_struct *tsk)
* tsk->mm has to be non NULL and caller has to guarantee it is stable (either
* under task_lock or operate on the current).
*/
-static void mark_oom_victim(struct task_struct *tsk)
+static void mark_oom_victim(struct task_struct *tsk, struct mm_struct *mm)
{
- struct mm_struct *mm = tsk->mm;
-
WARN_ON(oom_killer_disabled);
/* OOM killer might race with memcg OOM */
if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
@@ -860,7 +859,7 @@ static void __oom_kill_process(struct task_struct *victim)
* reserves from the user space under its control.
*/
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, PIDTYPE_TGID);
- mark_oom_victim(victim);
+ mark_oom_victim(victim, mm);
pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -898,7 +897,7 @@ static void __oom_kill_process(struct task_struct *victim)
if (unlikely(p->flags & PF_KTHREAD))
continue;
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
- mark_oom_victim(p);
+ mark_oom_victim(p, mm);
}
rcu_read_unlock();
@@ -942,7 +941,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
*/
task_lock(p);
if (task_will_free_mem(p)) {
- mark_oom_victim(p);
+ mark_oom_victim(p, p->mm);
wake_oom_reaper(p);
task_unlock(p);
put_task_struct(p);
@@ -1072,7 +1071,7 @@ bool out_of_memory(struct oom_control *oc)
* quickly exit and free its memory.
*/
if (task_will_free_mem(current)) {
- mark_oom_victim(current);
+ mark_oom_victim(current, current->mm);
wake_oom_reaper(current);
return true;
}
--
Michal Hocko
SUSE Labs
On 2018/10/22 17:48, Michal Hocko wrote:
> On Mon 22-10-18 16:58:50, Tetsuo Handa wrote:
>> Michal Hocko wrote:
>>> --- a/mm/oom_kill.c
>>> +++ b/mm/oom_kill.c
>>> @@ -898,6 +898,7 @@ static void __oom_kill_process(struct task_struct *victim)
>>> if (unlikely(p->flags & PF_KTHREAD))
>>> continue;
>>> do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
>>> + mark_oom_victim(p);
>>> }
>>> rcu_read_unlock();
>>>
>>> --
>>
>> Wrong. Either
>
> You are right. The mm might go away between process_shares_mm and here.
> While your find_lock_task_mm would be correct I believe we can do better
> by using the existing mm that we already have. I will make it a separate
> patch to clarity.
Still wrong. p->mm == NULL means that we are too late to set TIF_MEMDIE
on that thread. Passing non-NULL mm to mark_oom_victim() won't help.
> @@ -898,7 +897,7 @@ static void __oom_kill_process(struct task_struct *victim)
> if (unlikely(p->flags & PF_KTHREAD))
> continue;
> do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
> - mark_oom_victim(p);
> + mark_oom_victim(p, mm);
> }
> rcu_read_unlock();
>
On Mon 22-10-18 18:42:30, Tetsuo Handa wrote:
> On 2018/10/22 17:48, Michal Hocko wrote:
> > On Mon 22-10-18 16:58:50, Tetsuo Handa wrote:
> >> Michal Hocko wrote:
> >>> --- a/mm/oom_kill.c
> >>> +++ b/mm/oom_kill.c
> >>> @@ -898,6 +898,7 @@ static void __oom_kill_process(struct task_struct *victim)
> >>> if (unlikely(p->flags & PF_KTHREAD))
> >>> continue;
> >>> do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
> >>> + mark_oom_victim(p);
> >>> }
> >>> rcu_read_unlock();
> >>>
> >>> --
> >>
> >> Wrong. Either
> >
> > You are right. The mm might go away between process_shares_mm and here.
> > While your find_lock_task_mm would be correct I believe we can do better
> > by using the existing mm that we already have. I will make it a separate
> > patch to clarity.
>
> Still wrong. p->mm == NULL means that we are too late to set TIF_MEMDIE
> on that thread. Passing non-NULL mm to mark_oom_victim() won't help.
Why would it be too late? Or in other words why would this be harmful?
--
Michal Hocko
SUSE Labs
On 2018/10/22 19:43, Michal Hocko wrote:
> On Mon 22-10-18 18:42:30, Tetsuo Handa wrote:
>> On 2018/10/22 17:48, Michal Hocko wrote:
>>> On Mon 22-10-18 16:58:50, Tetsuo Handa wrote:
>>>> Michal Hocko wrote:
>>>>> --- a/mm/oom_kill.c
>>>>> +++ b/mm/oom_kill.c
>>>>> @@ -898,6 +898,7 @@ static void __oom_kill_process(struct task_struct *victim)
>>>>> if (unlikely(p->flags & PF_KTHREAD))
>>>>> continue;
>>>>> do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
>>>>> + mark_oom_victim(p);
>>>>> }
>>>>> rcu_read_unlock();
>>>>>
>>>>> --
>>>>
>>>> Wrong. Either
>>>
>>> You are right. The mm might go away between process_shares_mm and here.
>>> While your find_lock_task_mm would be correct I believe we can do better
>>> by using the existing mm that we already have. I will make it a separate
>>> patch to clarity.
>>
>> Still wrong. p->mm == NULL means that we are too late to set TIF_MEMDIE
>> on that thread. Passing non-NULL mm to mark_oom_victim() won't help.
>
> Why would it be too late? Or in other words why would this be harmful?
>
Setting TIF_MEMDIE after exit_mm() completed is too late.
static void exit_mm(void)
{
(...snipped...)
task_lock(current);
current->mm = NULL;
up_read(&mm->mmap_sem);
enter_lazy_tlb(mm, current);
task_unlock(current);
mm_update_next_owner(mm);
mmput(mm);
if (test_thread_flag(TIF_MEMDIE))
exit_oom_victim();
}
On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
> On 2018/10/22 16:13, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Tetsuo has reported [1] that a single process group memcg might easily
> > swamp the log with no-eligible oom victim reports due to race between
> > the memcg charge and oom_reaper
> >
> > Thread 1 Thread2 oom_reaper
> > try_charge try_charge
> > mem_cgroup_out_of_memory
> > mutex_lock(oom_lock)
> > mem_cgroup_out_of_memory
> > mutex_lock(oom_lock)
> > out_of_memory
> > select_bad_process
> > oom_kill_process(current)
> > wake_oom_reaper
> > oom_reap_task
> > MMF_OOM_SKIP->victim
> > mutex_unlock(oom_lock)
> > out_of_memory
> > select_bad_process # no task
> >
> > If Thread1 didn't race it would bail out from try_charge and force the
> > charge. We can achieve the same by checking tsk_is_oom_victim inside
> > the oom_lock and therefore close the race.
> >
> > [1] http://lkml.kernel.org/r/[email protected]
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > mm/memcontrol.c | 14 +++++++++++++-
> > 1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e79cb59552d9..a9dfed29967b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > .gfp_mask = gfp_mask,
> > .order = order,
> > };
> > - bool ret;
> > + bool ret = true;
> >
> > mutex_lock(&oom_lock);
> > +
> > + /*
> > + * multi-threaded tasks might race with oom_reaper and gain
> > + * MMF_OOM_SKIP before reaching out_of_memory which can lead
> > + * to out_of_memory failure if the task is the last one in
> > + * memcg which would be a false possitive failure reported
> > + */
> > + if (tsk_is_oom_victim(current))
> > + goto unlock;
> > +
>
> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
> so that any killed threads no longer wait for oom_lock.
tsk_is_oom_victim is stronger because it doesn't depend on
fatal_signal_pending which might be cleared throughout the exit process.
> Also, closing this race for only memcg OOM path is strange. Global OOM path
> (which are CLONE_VM without CLONE_THREAD) is still suffering this race
> (though frequency is lower than memcg OOM due to use of mutex_trylock()). Either
> checking before calling out_of_memory() or checking task_will_free_mem(current)
> inside out_of_memory() will close this race for both paths.
The global case is much more complicated because we know that memcg
might bypass the charge so we do not have to care about the potential
endless loop like in page allocator path. Moreover I am not even sure
the race is all that interesting in the global case. I have never heard
about a pre-mature panic due to no killable task. The racing oom task
would have to be the last eligible process in the system and that is
quite unlikely. We can think about a more involved solution for that if
we ever hear about this to be a real problem.
So a simple memcg specific fix sounds like a reasonable way forward.
--
Michal Hocko
SUSE Labs
On Mon 22-10-18 19:56:49, Tetsuo Handa wrote:
> On 2018/10/22 19:43, Michal Hocko wrote:
> > On Mon 22-10-18 18:42:30, Tetsuo Handa wrote:
> >> On 2018/10/22 17:48, Michal Hocko wrote:
> >>> On Mon 22-10-18 16:58:50, Tetsuo Handa wrote:
> >>>> Michal Hocko wrote:
> >>>>> --- a/mm/oom_kill.c
> >>>>> +++ b/mm/oom_kill.c
> >>>>> @@ -898,6 +898,7 @@ static void __oom_kill_process(struct task_struct *victim)
> >>>>> if (unlikely(p->flags & PF_KTHREAD))
> >>>>> continue;
> >>>>> do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
> >>>>> + mark_oom_victim(p);
> >>>>> }
> >>>>> rcu_read_unlock();
> >>>>>
> >>>>> --
> >>>>
> >>>> Wrong. Either
> >>>
> >>> You are right. The mm might go away between process_shares_mm and here.
> >>> While your find_lock_task_mm would be correct I believe we can do better
> >>> by using the existing mm that we already have. I will make it a separate
> >>> patch to clarity.
> >>
> >> Still wrong. p->mm == NULL means that we are too late to set TIF_MEMDIE
> >> on that thread. Passing non-NULL mm to mark_oom_victim() won't help.
> >
> > Why would it be too late? Or in other words why would this be harmful?
> >
>
> Setting TIF_MEMDIE after exit_mm() completed is too late.
You are right and I am obviously dense today. I will go with
find_lock_task_mm for now and push the "get rid of TIF_MEMDIE" up in the
todo list. I hope I will get to it some day.
--
Michal Hocko
SUSE Labs
Updated version
From fdd46ec5dbc0cf571c5a2b9e23a021e0c46b4641 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 22 Oct 2018 08:52:33 +0200
Subject: [PATCH 1/2] mm, oom: marks all killed tasks as oom victims
Historically we have called mark_oom_victim only to the main task
selected as the oom victim because oom victims have access to memory
reserves and granting the access to all killed tasks could deplete
memory reserves very quickly and cause even larger problems.
Since only a partial access to memory reserves is allowed there is no
longer this risk and so all tasks killed along with the oom victim
can be considered as well.
The primary motivation for that is that process groups which do not
shared signals would behave more like standard thread groups wrt oom
handling (aka tsk_is_oom_victim will work the same way for them).
- Use find_lock_task_mm to stabilize mm as suggested by Tetsuo
Signed-off-by: Michal Hocko <[email protected]>
---
mm/oom_kill.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa5360616..6be22a0b0ff2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -879,6 +879,7 @@ static void __oom_kill_process(struct task_struct *victim)
*/
rcu_read_lock();
for_each_process(p) {
+ struct task_struct *t;
if (!process_shares_mm(p, mm))
continue;
if (same_thread_group(p, victim))
@@ -898,6 +899,11 @@ static void __oom_kill_process(struct task_struct *victim)
if (unlikely(p->flags & PF_KTHREAD))
continue;
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
+ t = find_lock_task_mm(p);
+ if (!t)
+ continue;
+ mark_oom_victim(t);
+ task_unlock(t);
}
rcu_read_unlock();
--
2.19.1
On 2018/10/22 16:13, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Tetsuo has reported [1] that a single process group memcg might easily
> swamp the log with no-eligible oom victim reports due to race between
> the memcg charge and oom_reaper
>
> Thread 1 Thread2 oom_reaper
> try_charge try_charge
> mem_cgroup_out_of_memory
> mutex_lock(oom_lock)
> mem_cgroup_out_of_memory
> mutex_lock(oom_lock)
> out_of_memory
> select_bad_process
> oom_kill_process(current)
> wake_oom_reaper
> oom_reap_task
> MMF_OOM_SKIP->victim
> mutex_unlock(oom_lock)
> out_of_memory
> select_bad_process # no task
>
> If Thread1 didn't race it would bail out from try_charge and force the
> charge. We can achieve the same by checking tsk_is_oom_victim inside
> the oom_lock and therefore close the race.
>
> [1] http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/memcontrol.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e79cb59552d9..a9dfed29967b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> .gfp_mask = gfp_mask,
> .order = order,
> };
> - bool ret;
> + bool ret = true;
>
> mutex_lock(&oom_lock);
> +
> + /*
> + * multi-threaded tasks might race with oom_reaper and gain
> + * MMF_OOM_SKIP before reaching out_of_memory which can lead
> + * to out_of_memory failure if the task is the last one in
> + * memcg which would be a false possitive failure reported
> + */
> + if (tsk_is_oom_victim(current))
> + goto unlock;
> +
This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
so that any killed threads no longer wait for oom_lock.
Also, closing this race for only memcg OOM path is strange. Global OOM path
(which are CLONE_VM without CLONE_THREAD) is still suffering this race
(though frequency is lower than memcg OOM due to use of mutex_trylock()). Either
checking before calling out_of_memory() or checking task_will_free_mem(current)
inside out_of_memory() will close this race for both paths.
> ret = out_of_memory(&oc);
> +
> +unlock:
> mutex_unlock(&oom_lock);
> return ret;
> }
>
On 2018/10/22 21:03, Michal Hocko wrote:
> On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
>> On 2018/10/22 16:13, Michal Hocko wrote:
>>> From: Michal Hocko <[email protected]>
>>>
>>> Tetsuo has reported [1] that a single process group memcg might easily
>>> swamp the log with no-eligible oom victim reports due to race between
>>> the memcg charge and oom_reaper
>>>
>>> Thread 1 Thread2 oom_reaper
>>> try_charge try_charge
>>> mem_cgroup_out_of_memory
>>> mutex_lock(oom_lock)
>>> mem_cgroup_out_of_memory
>>> mutex_lock(oom_lock)
>>> out_of_memory
>>> select_bad_process
>>> oom_kill_process(current)
>>> wake_oom_reaper
>>> oom_reap_task
>>> MMF_OOM_SKIP->victim
>>> mutex_unlock(oom_lock)
>>> out_of_memory
>>> select_bad_process # no task
>>>
>>> If Thread1 didn't race it would bail out from try_charge and force the
>>> charge. We can achieve the same by checking tsk_is_oom_victim inside
>>> the oom_lock and therefore close the race.
>>>
>>> [1] http://lkml.kernel.org/r/[email protected]
>>> Signed-off-by: Michal Hocko <[email protected]>
>>> ---
>>> mm/memcontrol.c | 14 +++++++++++++-
>>> 1 file changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index e79cb59552d9..a9dfed29967b 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>> .gfp_mask = gfp_mask,
>>> .order = order,
>>> };
>>> - bool ret;
>>> + bool ret = true;
>>>
>>> mutex_lock(&oom_lock);
>>> +
>>> + /*
>>> + * multi-threaded tasks might race with oom_reaper and gain
>>> + * MMF_OOM_SKIP before reaching out_of_memory which can lead
>>> + * to out_of_memory failure if the task is the last one in
>>> + * memcg which would be a false possitive failure reported
>>> + */
>>> + if (tsk_is_oom_victim(current))
>>> + goto unlock;
>>> +
>>
>> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
>> so that any killed threads no longer wait for oom_lock.
>
> tsk_is_oom_victim is stronger because it doesn't depend on
> fatal_signal_pending which might be cleared throughout the exit process.
I mean:
mm/memcontrol.c | 3 +-
mm/oom_kill.c | 111 +++++---------------------------------------------------
2 files changed, 12 insertions(+), 102 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e79cb59..2c1e1ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1382,7 +1382,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
};
bool ret;
- mutex_lock(&oom_lock);
+ if (mutex_lock_killable(&oom_lock))
+ return true;
ret = out_of_memory(&oc);
mutex_unlock(&oom_lock);
return ret;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..22fd6b3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -755,81 +755,6 @@ bool oom_killer_disable(signed long timeout)
return true;
}
-static inline bool __task_will_free_mem(struct task_struct *task)
-{
- struct signal_struct *sig = task->signal;
-
- /*
- * A coredumping process may sleep for an extended period in exit_mm(),
- * so the oom killer cannot assume that the process will promptly exit
- * and release memory.
- */
- if (sig->flags & SIGNAL_GROUP_COREDUMP)
- return false;
-
- if (sig->flags & SIGNAL_GROUP_EXIT)
- return true;
-
- if (thread_group_empty(task) && (task->flags & PF_EXITING))
- return true;
-
- return false;
-}
-
-/*
- * Checks whether the given task is dying or exiting and likely to
- * release its address space. This means that all threads and processes
- * sharing the same mm have to be killed or exiting.
- * Caller has to make sure that task->mm is stable (hold task_lock or
- * it operates on the current).
- */
-static bool task_will_free_mem(struct task_struct *task)
-{
- struct mm_struct *mm = task->mm;
- struct task_struct *p;
- bool ret = true;
-
- /*
- * Skip tasks without mm because it might have passed its exit_mm and
- * exit_oom_victim. oom_reaper could have rescued that but do not rely
- * on that for now. We can consider find_lock_task_mm in future.
- */
- if (!mm)
- return false;
-
- if (!__task_will_free_mem(task))
- return false;
-
- /*
- * This task has already been drained by the oom reaper so there are
- * only small chances it will free some more
- */
- if (test_bit(MMF_OOM_SKIP, &mm->flags))
- return false;
-
- if (atomic_read(&mm->mm_users) <= 1)
- return true;
-
- /*
- * Make sure that all tasks which share the mm with the given tasks
- * are dying as well to make sure that a) nobody pins its mm and
- * b) the task is also reapable by the oom reaper.
- */
- rcu_read_lock();
- for_each_process(p) {
- if (!process_shares_mm(p, mm))
- continue;
- if (same_thread_group(task, p))
- continue;
- ret = __task_will_free_mem(p);
- if (!ret)
- break;
- }
- rcu_read_unlock();
-
- return ret;
-}
-
static void __oom_kill_process(struct task_struct *victim)
{
struct task_struct *p;
@@ -879,6 +804,8 @@ static void __oom_kill_process(struct task_struct *victim)
*/
rcu_read_lock();
for_each_process(p) {
+ struct task_struct *t;
+
if (!process_shares_mm(p, mm))
continue;
if (same_thread_group(p, victim))
@@ -898,6 +825,11 @@ static void __oom_kill_process(struct task_struct *victim)
if (unlikely(p->flags & PF_KTHREAD))
continue;
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
+ t = find_lock_task_mm(p);
+ if (!t)
+ continue;
+ mark_oom_victim(t);
+ task_unlock(t);
}
rcu_read_unlock();
@@ -934,21 +866,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just give it access to memory reserves
- * so it can die quickly
- */
- task_lock(p);
- if (task_will_free_mem(p)) {
- mark_oom_victim(p);
- wake_oom_reaper(p);
- task_unlock(p);
- put_task_struct(p);
- return;
- }
- task_unlock(p);
-
if (__ratelimit(&oom_rs))
dump_header(oc, p);
@@ -1055,6 +972,9 @@ bool out_of_memory(struct oom_control *oc)
unsigned long freed = 0;
enum oom_constraint constraint = CONSTRAINT_NONE;
+ if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL))
+ return true;
+
if (oom_killer_disabled)
return false;
@@ -1066,17 +986,6 @@ bool out_of_memory(struct oom_control *oc)
}
/*
- * If current has a pending SIGKILL or is exiting, then automatically
- * select it. The goal is to allow it to allocate so that it may
- * quickly exit and free its memory.
- */
- if (task_will_free_mem(current)) {
- mark_oom_victim(current);
- wake_oom_reaper(current);
- return true;
- }
-
- /*
* The OOM killer does not compensate for IO-less reclaim.
* pagefault_out_of_memory lost its gfp context so we have to
* make sure exclude 0 mask - all other users should have at least
--
1.8.3.1
>
>> Also, closing this race for only memcg OOM path is strange. Global OOM path
>> (which are CLONE_VM without CLONE_THREAD) is still suffering this race
>> (though frequency is lower than memcg OOM due to use of mutex_trylock()). Either
>> checking before calling out_of_memory() or checking task_will_free_mem(current)
>> inside out_of_memory() will close this race for both paths.
>
> The global case is much more complicated because we know that memcg
> might bypass the charge so we do not have to care about the potential
> endless loop like in page allocator path.
We know that the page allocator path does not do endless loop unless
__GFP_NOFAIL.
> Moreover I am not even sure
> the race is all that interesting in the global case. I have never heard
> about a pre-mature panic due to no killable task.
I don't think this race results in
panic("System is deadlocked on memory\n");
. But I think a few unnecessary OOM killing is possible (especially when
above patch which "eliminates the shortcuts which call mark_oom_victim()
on only one thread group" is not applied).
> The racing oom task
> would have to be the last eligible process in the system and that is
> quite unlikely. We can think about a more involved solution for that if
> we ever hear about this to be a real problem.
>
> So a simple memcg specific fix sounds like a reasonable way forward.
>
On Mon 22-10-18 22:20:36, Tetsuo Handa wrote:
> On 2018/10/22 21:03, Michal Hocko wrote:
> > On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
> >> On 2018/10/22 16:13, Michal Hocko wrote:
> >>> From: Michal Hocko <[email protected]>
> >>>
> >>> Tetsuo has reported [1] that a single process group memcg might easily
> >>> swamp the log with no-eligible oom victim reports due to race between
> >>> the memcg charge and oom_reaper
> >>>
> >>> Thread 1 Thread2 oom_reaper
> >>> try_charge try_charge
> >>> mem_cgroup_out_of_memory
> >>> mutex_lock(oom_lock)
> >>> mem_cgroup_out_of_memory
> >>> mutex_lock(oom_lock)
> >>> out_of_memory
> >>> select_bad_process
> >>> oom_kill_process(current)
> >>> wake_oom_reaper
> >>> oom_reap_task
> >>> MMF_OOM_SKIP->victim
> >>> mutex_unlock(oom_lock)
> >>> out_of_memory
> >>> select_bad_process # no task
> >>>
> >>> If Thread1 didn't race it would bail out from try_charge and force the
> >>> charge. We can achieve the same by checking tsk_is_oom_victim inside
> >>> the oom_lock and therefore close the race.
> >>>
> >>> [1] http://lkml.kernel.org/r/[email protected]
> >>> Signed-off-by: Michal Hocko <[email protected]>
> >>> ---
> >>> mm/memcontrol.c | 14 +++++++++++++-
> >>> 1 file changed, 13 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index e79cb59552d9..a9dfed29967b 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>> .gfp_mask = gfp_mask,
> >>> .order = order,
> >>> };
> >>> - bool ret;
> >>> + bool ret = true;
> >>>
> >>> mutex_lock(&oom_lock);
> >>> +
> >>> + /*
> >>> + * multi-threaded tasks might race with oom_reaper and gain
> >>> + * MMF_OOM_SKIP before reaching out_of_memory which can lead
> >>> + * to out_of_memory failure if the task is the last one in
> >>> + * memcg which would be a false possitive failure reported
> >>> + */
> >>> + if (tsk_is_oom_victim(current))
> >>> + goto unlock;
> >>> +
> >>
> >> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
> >> so that any killed threads no longer wait for oom_lock.
> >
> > tsk_is_oom_victim is stronger because it doesn't depend on
> > fatal_signal_pending which might be cleared throughout the exit process.
>
> I mean:
>
> mm/memcontrol.c | 3 +-
> mm/oom_kill.c | 111 +++++---------------------------------------------------
> 2 files changed, 12 insertions(+), 102 deletions(-)
This is much larger change than I feel comfortable with to plug this
specific issue. A simple and easy to understand fix which doesn't add
maintenance burden should be preferred in general.
The code reduction looks attractive but considering it is based on
removing one of the heuristics to prevent OOM reports in some case it
should be done on its own with a careful and throughout justification.
E.g. how often is the heuristic really helpful.
In principle I do not oppose to remove the shortcut after all due
diligence is done because this particular one had given us quite a lot
headaches in the past.
--
Michal Hocko
SUSE Labs
On 2018/10/22 22:43, Michal Hocko wrote:
> On Mon 22-10-18 22:20:36, Tetsuo Handa wrote:
>> I mean:
>>
>> mm/memcontrol.c | 3 +-
>> mm/oom_kill.c | 111 +++++---------------------------------------------------
>> 2 files changed, 12 insertions(+), 102 deletions(-)
>
> This is much larger change than I feel comfortable with to plug this
> specific issue. A simple and easy to understand fix which doesn't add
> maintenance burden should be preferred in general.
>
> The code reduction looks attractive but considering it is based on
> removing one of the heuristics to prevent OOM reports in some case it
> should be done on its own with a careful and throughout justification.
> E.g. how often is the heuristic really helpful.
I think the heuristic is hardly helpful.
Regarding task_will_free_mem(current) condition in out_of_memory(),
this served for two purposes. One is that mark_oom_victim() is not yet
called on current thread group when mark_oom_victim() was already called
on other thread groups. But such situation disappears by removing
task_will_free_mem() shortcuts and forcing for_each_process(p) loop
in __oom_kill_process().
The other is that mark_oom_victim() is not yet called on any thread groups when
all thread groups are exiting. In that case, we will fail to wait for current
thread group to release its mm... But it is unlikely that only threads which
task_will_free_mem(current) returns true can call out_of_memory() (note that
task_will_free_mem(p) returns false if p->mm == NULL).
I think it is highly unlikely to hit task_will_free_mem(p) condition
in oom_kill_process(). To hit it, the candidate who was chosen due to
the largest memory user has to be already exiting. However, if already
exiting, it is likely the candidate already released its mm (and hence
no longer the largest memory user). I can't say such race never happens,
but I think it is unlikely. Also, since task_will_free_mem(p) returns false
if thread group leader's mm is NULL whereas oom_badness() from
select_bad_process() evaluates any mm in that thread group and returns
a thread group leader, this heuristic is incomplete after all.
>
> In principle I do not oppose to remove the shortcut after all due
> diligence is done because this particular one had given us quite a lot
> headaches in the past.
>
Michal Hocko wrote:
> On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index e79cb59552d9..a9dfed29967b 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > .gfp_mask = gfp_mask,
> > > .order = order,
> > > };
> > > - bool ret;
> > > + bool ret = true;
> > >
> > > mutex_lock(&oom_lock);
> > > +
> > > + /*
> > > + * multi-threaded tasks might race with oom_reaper and gain
> > > + * MMF_OOM_SKIP before reaching out_of_memory which can lead
> > > + * to out_of_memory failure if the task is the last one in
> > > + * memcg which would be a false possitive failure reported
> > > + */
> > > + if (tsk_is_oom_victim(current))
> > > + goto unlock;
> > > +
> >
> > This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
> > so that any killed threads no longer wait for oom_lock.
>
> tsk_is_oom_victim is stronger because it doesn't depend on
> fatal_signal_pending which might be cleared throughout the exit process.
>
I still want to propose this. No need to be memcg OOM specific.
mm/memcontrol.c | 3 ++-
mm/oom_kill.c | 10 ++++++++++
2 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e79cb59..2c1e1ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1382,7 +1382,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
};
bool ret;
- mutex_lock(&oom_lock);
+ if (mutex_lock_killable(&oom_lock))
+ return true;
ret = out_of_memory(&oc);
mutex_unlock(&oom_lock);
return ret;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..e453bad 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1055,6 +1055,16 @@ bool out_of_memory(struct oom_control *oc)
unsigned long freed = 0;
enum oom_constraint constraint = CONSTRAINT_NONE;
+ /*
+ * It is possible that multi-threaded OOM victims get
+ * task_will_free_mem(current) == false when the OOM reaper quickly
+ * set MMF_OOM_SKIP. But since we know that tsk_is_oom_victim() == true
+ * tasks won't loop forever (unleess it is a __GFP_NOFAIL allocation
+ * request), we don't need to select next OOM victim.
+ */
+ if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL))
+ return true;
+
if (oom_killer_disabled)
return false;
--
1.8.3.1
On Tue 23-10-18 10:01:08, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index e79cb59552d9..a9dfed29967b 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > > .gfp_mask = gfp_mask,
> > > > .order = order,
> > > > };
> > > > - bool ret;
> > > > + bool ret = true;
> > > >
> > > > mutex_lock(&oom_lock);
> > > > +
> > > > + /*
> > > > + * multi-threaded tasks might race with oom_reaper and gain
> > > > + * MMF_OOM_SKIP before reaching out_of_memory which can lead
> > > > + * to out_of_memory failure if the task is the last one in
> > > > + * memcg which would be a false possitive failure reported
> > > > + */
> > > > + if (tsk_is_oom_victim(current))
> > > > + goto unlock;
> > > > +
> > >
> > > This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
> > > so that any killed threads no longer wait for oom_lock.
> >
> > tsk_is_oom_victim is stronger because it doesn't depend on
> > fatal_signal_pending which might be cleared throughout the exit process.
> >
>
> I still want to propose this. No need to be memcg OOM specific.
Well, I maintain what I've said [1] about simplicity and specific fix
for a specific issue. Especially in the tricky code like this where all
the consequences are far more subtle than they seem to be.
This is obviously a matter of taste but I don't see much point discussing
this back and forth for ever. Unless there is a general agreement that
the above is less appropriate then I am willing to consider a different
change but I simply do not have energy to nit pick for ever.
[1] http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs
On Tue 23-10-18 13:42:46, Michal Hocko wrote:
> On Tue 23-10-18 10:01:08, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index e79cb59552d9..a9dfed29967b 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > > > .gfp_mask = gfp_mask,
> > > > > .order = order,
> > > > > };
> > > > > - bool ret;
> > > > > + bool ret = true;
> > > > >
> > > > > mutex_lock(&oom_lock);
> > > > > +
> > > > > + /*
> > > > > + * multi-threaded tasks might race with oom_reaper and gain
> > > > > + * MMF_OOM_SKIP before reaching out_of_memory which can lead
> > > > > + * to out_of_memory failure if the task is the last one in
> > > > > + * memcg which would be a false possitive failure reported
> > > > > + */
> > > > > + if (tsk_is_oom_victim(current))
> > > > > + goto unlock;
> > > > > +
> > > >
> > > > This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
> > > > so that any killed threads no longer wait for oom_lock.
> > >
> > > tsk_is_oom_victim is stronger because it doesn't depend on
> > > fatal_signal_pending which might be cleared throughout the exit process.
> > >
> >
> > I still want to propose this. No need to be memcg OOM specific.
>
> Well, I maintain what I've said [1] about simplicity and specific fix
> for a specific issue. Especially in the tricky code like this where all
> the consequences are far more subtle than they seem to be.
>
> This is obviously a matter of taste but I don't see much point discussing
> this back and forth for ever. Unless there is a general agreement that
> the above is less appropriate then I am willing to consider a different
> change but I simply do not have energy to nit pick for ever.
>
> [1] http://lkml.kernel.org/r/[email protected]
In other words. Having a memcg specific fix means, well, a memcg
maintenance burden. Like any other memcg specific oom decisions we
already have. So are you OK with that Johannes or you would like to see
a more generic fix which might turn out to be more complex?
--
Michal Hocko
SUSE Labs
On 2018/10/23 21:10, Michal Hocko wrote:
> On Tue 23-10-18 13:42:46, Michal Hocko wrote:
>> On Tue 23-10-18 10:01:08, Tetsuo Handa wrote:
>>> Michal Hocko wrote:
>>>> On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>>> index e79cb59552d9..a9dfed29967b 100644
>>>>>> --- a/mm/memcontrol.c
>>>>>> +++ b/mm/memcontrol.c
>>>>>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>>>> .gfp_mask = gfp_mask,
>>>>>> .order = order,
>>>>>> };
>>>>>> - bool ret;
>>>>>> + bool ret = true;
>>>>>>
>>>>>> mutex_lock(&oom_lock);
>>>>>> +
>>>>>> + /*
>>>>>> + * multi-threaded tasks might race with oom_reaper and gain
>>>>>> + * MMF_OOM_SKIP before reaching out_of_memory which can lead
>>>>>> + * to out_of_memory failure if the task is the last one in
>>>>>> + * memcg which would be a false possitive failure reported
>>>>>> + */
>>>>>> + if (tsk_is_oom_victim(current))
>>>>>> + goto unlock;
>>>>>> +
>>>>>
>>>>> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
>>>>> so that any killed threads no longer wait for oom_lock.
>>>>
>>>> tsk_is_oom_victim is stronger because it doesn't depend on
>>>> fatal_signal_pending which might be cleared throughout the exit process.
>>>>
>>>
>>> I still want to propose this. No need to be memcg OOM specific.
>>
>> Well, I maintain what I've said [1] about simplicity and specific fix
>> for a specific issue. Especially in the tricky code like this where all
>> the consequences are far more subtle than they seem to be.
>>
>> This is obviously a matter of taste but I don't see much point discussing
>> this back and forth for ever. Unless there is a general agreement that
>> the above is less appropriate then I am willing to consider a different
>> change but I simply do not have energy to nit pick for ever.
>>
>> [1] http://lkml.kernel.org/r/[email protected]
>
> In other words. Having a memcg specific fix means, well, a memcg
> maintenance burden. Like any other memcg specific oom decisions we
> already have. So are you OK with that Johannes or you would like to see
> a more generic fix which might turn out to be more complex?
>
I don't know what "that Johannes" refers to.
If you don't want to affect SysRq-OOM and pagefault-OOM cases,
are you OK with having a global-OOM specific fix?
mm/page_alloc.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c1..f59f029 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3518,6 +3518,17 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
if (gfp_mask & __GFP_THISNODE)
goto out;
+ /*
+ * It is possible that multi-threaded OOM victims get
+ * task_will_free_mem(current) == false when the OOM reaper quickly
+ * set MMF_OOM_SKIP. But since we know that tsk_is_oom_victim() == true
+ * tasks won't loop forever (unless it is a __GFP_NOFAIL allocation
+ * request), we don't need to select next OOM victim.
+ */
+ if (tsk_is_oom_victim(current) && !(gfp_mask & __GFP_NOFAIL)) {
+ *did_some_progress = 1;
+ goto out;
+ }
/* Exhausted what can be done so it's blame time */
if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
*did_some_progress = 1;
--
1.8.3.1
On Tue 23-10-18 21:33:43, Tetsuo Handa wrote:
> On 2018/10/23 21:10, Michal Hocko wrote:
> > On Tue 23-10-18 13:42:46, Michal Hocko wrote:
> >> On Tue 23-10-18 10:01:08, Tetsuo Handa wrote:
> >>> Michal Hocko wrote:
> >>>> On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
> >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>>>>> index e79cb59552d9..a9dfed29967b 100644
> >>>>>> --- a/mm/memcontrol.c
> >>>>>> +++ b/mm/memcontrol.c
> >>>>>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>>>> .gfp_mask = gfp_mask,
> >>>>>> .order = order,
> >>>>>> };
> >>>>>> - bool ret;
> >>>>>> + bool ret = true;
> >>>>>>
> >>>>>> mutex_lock(&oom_lock);
> >>>>>> +
> >>>>>> + /*
> >>>>>> + * multi-threaded tasks might race with oom_reaper and gain
> >>>>>> + * MMF_OOM_SKIP before reaching out_of_memory which can lead
> >>>>>> + * to out_of_memory failure if the task is the last one in
> >>>>>> + * memcg which would be a false possitive failure reported
> >>>>>> + */
> >>>>>> + if (tsk_is_oom_victim(current))
> >>>>>> + goto unlock;
> >>>>>> +
> >>>>>
> >>>>> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
> >>>>> so that any killed threads no longer wait for oom_lock.
> >>>>
> >>>> tsk_is_oom_victim is stronger because it doesn't depend on
> >>>> fatal_signal_pending which might be cleared throughout the exit process.
> >>>>
> >>>
> >>> I still want to propose this. No need to be memcg OOM specific.
> >>
> >> Well, I maintain what I've said [1] about simplicity and specific fix
> >> for a specific issue. Especially in the tricky code like this where all
> >> the consequences are far more subtle than they seem to be.
> >>
> >> This is obviously a matter of taste but I don't see much point discussing
> >> this back and forth for ever. Unless there is a general agreement that
> >> the above is less appropriate then I am willing to consider a different
> >> change but I simply do not have energy to nit pick for ever.
> >>
> >> [1] http://lkml.kernel.org/r/[email protected]
> >
> > In other words. Having a memcg specific fix means, well, a memcg
> > maintenance burden. Like any other memcg specific oom decisions we
> > already have. So are you OK with that Johannes or you would like to see
> > a more generic fix which might turn out to be more complex?
> >
>
> I don't know what "that Johannes" refers to.
let me rephrase
Johannes, are you OK with that (memcg specific fix) or you would like to
see a more generic fix which might turn out to be more complex.
--
Michal Hocko
SUSE Labs
On Mon, Oct 22, 2018 at 09:13:23AM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Tetsuo has reported [1] that a single process group memcg might easily
> swamp the log with no-eligible oom victim reports due to race between
> the memcg charge and oom_reaper
>
> Thread 1 Thread2 oom_reaper
> try_charge try_charge
> mem_cgroup_out_of_memory
> mutex_lock(oom_lock)
> mem_cgroup_out_of_memory
> mutex_lock(oom_lock)
> out_of_memory
> select_bad_process
> oom_kill_process(current)
> wake_oom_reaper
> oom_reap_task
> MMF_OOM_SKIP->victim
> mutex_unlock(oom_lock)
> out_of_memory
> select_bad_process # no task
>
> If Thread1 didn't race it would bail out from try_charge and force the
> charge. We can achieve the same by checking tsk_is_oom_victim inside
> the oom_lock and therefore close the race.
>
> [1] http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/memcontrol.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e79cb59552d9..a9dfed29967b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> .gfp_mask = gfp_mask,
> .order = order,
> };
> - bool ret;
> + bool ret = true;
>
> mutex_lock(&oom_lock);
> +
> + /*
> + * multi-threaded tasks might race with oom_reaper and gain
> + * MMF_OOM_SKIP before reaching out_of_memory which can lead
> + * to out_of_memory failure if the task is the last one in
> + * memcg which would be a false possitive failure reported
> + */
> + if (tsk_is_oom_victim(current))
> + goto unlock;
> +
> ret = out_of_memory(&oc);
We already check tsk_is_oom_victim(current) in try_charge() before
looping on the OOM killer, so at most we'd have a single "no eligible
tasks" message from such a race before we force the charge - right?
While that's not perfect, I don't think it warrants complicating this
code even more. I honestly find it near-impossible to follow the code
and the possible scenarios at this point.
out_of_memory() bails on task_will_free_mem(current), which
specifically *excludes* already reaped tasks. Why are we then adding a
separate check before that to bail on already reaped victims?
Do we want to bail if current is a reaped victim or not?
I don't see how we could skip it safely in general: the current task
might have been killed and reaped and gotten access to the memory
reserve and still fail to allocate on its way out. It needs to kill
the next task if there is one, or warn if there isn't another
one. Because we're genuinely oom without reclaimable tasks.
There is of course the scenario brought forward in this thread, where
multiple threads of a process race and the second one enters oom even
though it doesn't need to anymore. What the global case does to catch
this is to grab the oom lock and do one last alloc attempt. Should
memcg lock the oom_lock and try one more time to charge the memcg?
Some simplification in this area would really be great. I'm reluctant
to ack patches like the above, even if they have some optical benefits
for the user, because the code is already too tricky for what it does.
On Fri 26-10-18 10:25:31, Johannes Weiner wrote:
> On Mon, Oct 22, 2018 at 09:13:23AM +0200, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Tetsuo has reported [1] that a single process group memcg might easily
> > swamp the log with no-eligible oom victim reports due to race between
> > the memcg charge and oom_reaper
> >
> > Thread 1 Thread2 oom_reaper
> > try_charge try_charge
> > mem_cgroup_out_of_memory
> > mutex_lock(oom_lock)
> > mem_cgroup_out_of_memory
> > mutex_lock(oom_lock)
> > out_of_memory
> > select_bad_process
> > oom_kill_process(current)
> > wake_oom_reaper
> > oom_reap_task
> > MMF_OOM_SKIP->victim
> > mutex_unlock(oom_lock)
> > out_of_memory
> > select_bad_process # no task
> >
> > If Thread1 didn't race it would bail out from try_charge and force the
> > charge. We can achieve the same by checking tsk_is_oom_victim inside
> > the oom_lock and therefore close the race.
> >
> > [1] http://lkml.kernel.org/r/[email protected]
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > mm/memcontrol.c | 14 +++++++++++++-
> > 1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e79cb59552d9..a9dfed29967b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > .gfp_mask = gfp_mask,
> > .order = order,
> > };
> > - bool ret;
> > + bool ret = true;
> >
> > mutex_lock(&oom_lock);
> > +
> > + /*
> > + * multi-threaded tasks might race with oom_reaper and gain
> > + * MMF_OOM_SKIP before reaching out_of_memory which can lead
> > + * to out_of_memory failure if the task is the last one in
> > + * memcg which would be a false possitive failure reported
> > + */
> > + if (tsk_is_oom_victim(current))
> > + goto unlock;
> > +
> > ret = out_of_memory(&oc);
>
> We already check tsk_is_oom_victim(current) in try_charge() before
> looping on the OOM killer, so at most we'd have a single "no eligible
> tasks" message from such a race before we force the charge - right?
Not really. You can have many threads blocked on the oom_lock and being
reaped while they are waiting. So the check without the lock will always
be racy. This is what Tetsuo's test case actually triggers I believe.
> While that's not perfect, I don't think it warrants complicating this
> code even more. I honestly find it near-impossible to follow the code
> and the possible scenarios at this point.
I do agree that the code is quite far from easy to follow. The set of
events that might happen in a different context is not trivial.
> out_of_memory() bails on task_will_free_mem(current), which
> specifically *excludes* already reaped tasks. Why are we then adding a
> separate check before that to bail on already reaped victims?
696453e66630a has introduced the bail out.
> Do we want to bail if current is a reaped victim or not?
>
> I don't see how we could skip it safely in general: the current task
> might have been killed and reaped and gotten access to the memory
> reserve and still fail to allocate on its way out. It needs to kill
> the next task if there is one, or warn if there isn't another
> one. Because we're genuinely oom without reclaimable tasks.
Yes, this would be the case for the global case which is a real OOM
situation. Memcg oom is somehow more relaxed because the oom is local.
> There is of course the scenario brought forward in this thread, where
> multiple threads of a process race and the second one enters oom even
> though it doesn't need to anymore. What the global case does to catch
> this is to grab the oom lock and do one last alloc attempt. Should
> memcg lock the oom_lock and try one more time to charge the memcg?
That would be another option. I agree that making it more towards the
global case makes it more attractive. My tsk_is_oom_victim is more
towards "plug this particular case".
So does this look better to you?
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e79cb59552d9..4abb66efe806 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
.gfp_mask = gfp_mask,
.order = order,
};
- bool ret;
+ bool ret = true;
mutex_lock(&oom_lock);
+
+ /*
+ * Make the last moment check while we were waiting for the oom_lock
+ * just in case the oom_reaper could have freed released some
+ * memory in the meantime. This mimics the lalst moment allocation
+ * in __alloc_pages_may_oom
+ */
+ if (mem_cgroup_margin(mem_over_limit) >= 1 << order)
+ goto unlock;
+
ret = out_of_memory(&oc);
+
+unlock:
mutex_unlock(&oom_lock);
return ret;
}
> Some simplification in this area would really be great. I'm reluctant
> to ack patches like the above, even if they have some optical benefits
> for the user, because the code is already too tricky for what it does.
I am open to different ideas, unless they are crazy timeout here and
timeout there wrapped with a duct tape.
--
Michal Hocko
SUSE Labs
On Fri 26-10-18 21:25:51, Michal Hocko wrote:
> On Fri 26-10-18 10:25:31, Johannes Weiner wrote:
[...]
> > There is of course the scenario brought forward in this thread, where
> > multiple threads of a process race and the second one enters oom even
> > though it doesn't need to anymore. What the global case does to catch
> > this is to grab the oom lock and do one last alloc attempt. Should
> > memcg lock the oom_lock and try one more time to charge the memcg?
>
> That would be another option. I agree that making it more towards the
> global case makes it more attractive. My tsk_is_oom_victim is more
> towards "plug this particular case".
Nevertheless let me emphasise that tsk_is_oom_victim will close the race
completely, while mem_cgroup_margin will always be racy. So the question
is whether we want to close the race because it is just too easy for
userspace to hit it or keep the global and memcg oom handling as close
as possible.
--
Michal Hocko
SUSE Labs
On 2018/10/27 4:25, Michal Hocko wrote:
>> out_of_memory() bails on task_will_free_mem(current), which
>> specifically *excludes* already reaped tasks. Why are we then adding a
>> separate check before that to bail on already reaped victims?
>
> 696453e66630a has introduced the bail out.
>
>> Do we want to bail if current is a reaped victim or not?
>>
>> I don't see how we could skip it safely in general: the current task
>> might have been killed and reaped and gotten access to the memory
>> reserve and still fail to allocate on its way out. It needs to kill
>> the next task if there is one, or warn if there isn't another
>> one. Because we're genuinely oom without reclaimable tasks.
>
> Yes, this would be the case for the global case which is a real OOM
> situation. Memcg oom is somehow more relaxed because the oom is local.
We can handle possibility of genuinely OOM without reclaimable tasks.
Only __GFP_NOFAIL OOM has to select next OOM victim. There is no need to
select next OOM victim unless __GFP_NOFAIL. Commit 696453e66630ad45
("mm, oom: task_will_free_mem should skip oom_reaped tasks") was too simple.
On 2018/10/27 4:33, Michal Hocko wrote:
> On Fri 26-10-18 21:25:51, Michal Hocko wrote:
>> On Fri 26-10-18 10:25:31, Johannes Weiner wrote:
> [...]
>>> There is of course the scenario brought forward in this thread, where
>>> multiple threads of a process race and the second one enters oom even
>>> though it doesn't need to anymore. What the global case does to catch
>>> this is to grab the oom lock and do one last alloc attempt. Should
>>> memcg lock the oom_lock and try one more time to charge the memcg?
>>
>> That would be another option. I agree that making it more towards the
>> global case makes it more attractive. My tsk_is_oom_victim is more
>> towards "plug this particular case".
>
> Nevertheless let me emphasise that tsk_is_oom_victim will close the race
> completely, while mem_cgroup_margin will always be racy. So the question
> is whether we want to close the race because it is just too easy for
> userspace to hit it or keep the global and memcg oom handling as close
> as possible.
>
Yes, adding tsk_is_oom_victim(current) before calling out_of_memory() from
both global OOM and memcg OOM paths can close the race completely. (But
note that tsk_is_oom_victim(current) for global OOM path needs to check for
__GFP_NOFAIL in order to handle genuinely OOM case.)
From dc0d9ec3205a28680dcf2213c0ffe8820e8b5913 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Tue, 6 Nov 2018 12:27:36 +0900
Subject: [PATCH] memcg: killed threads should not invoke memcg OOM killer
It is possible that a single process group memcg easily swamps the log
with no-eligible OOM victim messages after current thread was OOM-killed,
due to race between the memcg charge and the OOM reaper [1].
Thread-1 Thread-2 OOM reaper
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(oom_lock)
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(oom_lock)
out_of_memory()
select_bad_process()
oom_kill_process(current)
wake_oom_reaper()
oom_reap_task()
# sets MMF_OOM_SKIP
mutex_unlock(oom_lock)
out_of_memory()
select_bad_process() # no task
mutex_unlock(oom_lock)
We don't need to invoke the memcg OOM killer if current thread was killed
when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) and
memory_max_write() can bail out upon SIGKILL, and try_charge() allows
already killed/exiting threads to make forward progress.
[1] https://lkml.kernel.org/r/[email protected]
Signed-off-by: Tetsuo Handa <[email protected]>
---
mm/memcontrol.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e1469b..a97648a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
};
bool ret;
- mutex_lock(&oom_lock);
- ret = out_of_memory(&oc);
+ if (mutex_lock_killable(&oom_lock))
+ return true;
+ /*
+ * A few threads which were not waiting at mutex_lock_killable() can
+ * fail to bail out. Therefore, check again after holding oom_lock.
+ */
+ ret = fatal_signal_pending(current) || out_of_memory(&oc);
mutex_unlock(&oom_lock);
return ret;
}
--
1.8.3.1
On Tue 06-11-18 18:44:43, Tetsuo Handa wrote:
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6e1469b..a97648a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> };
> bool ret;
>
> - mutex_lock(&oom_lock);
> - ret = out_of_memory(&oc);
> + if (mutex_lock_killable(&oom_lock))
> + return true;
> + /*
> + * A few threads which were not waiting at mutex_lock_killable() can
> + * fail to bail out. Therefore, check again after holding oom_lock.
> + */
> + ret = fatal_signal_pending(current) || out_of_memory(&oc);
> mutex_unlock(&oom_lock);
> return ret;
> }
If we are goging with a memcg specific thingy then I really prefer
tsk_is_oom_victim approach. Or is there any reason why this is not
suitable?
--
Michal Hocko
SUSE Labs
On 2018/11/06 21:42, Michal Hocko wrote:
> On Tue 06-11-18 18:44:43, Tetsuo Handa wrote:
> [...]
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6e1469b..a97648a 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>> };
>> bool ret;
>>
>> - mutex_lock(&oom_lock);
>> - ret = out_of_memory(&oc);
>> + if (mutex_lock_killable(&oom_lock))
>> + return true;
>> + /*
>> + * A few threads which were not waiting at mutex_lock_killable() can
>> + * fail to bail out. Therefore, check again after holding oom_lock.
>> + */
>> + ret = fatal_signal_pending(current) || out_of_memory(&oc);
>> mutex_unlock(&oom_lock);
>> return ret;
>> }
>
> If we are goging with a memcg specific thingy then I really prefer
> tsk_is_oom_victim approach. Or is there any reason why this is not
> suitable?
>
Why need to wait for mark_oom_victim() called after slow printk() messages?
If current thread got Ctrl-C and thus current thread can terminate, what is
nice with waiting for the OOM killer? If there are several OOM events in
multiple memcg domains waiting for completion of printk() messages? I don't
see points with waiting for oom_lock, for try_charge() already allows current
thread to terminate due to fatal_signal_pending() test.
On Wed 07-11-18 18:45:27, Tetsuo Handa wrote:
> On 2018/11/06 21:42, Michal Hocko wrote:
> > On Tue 06-11-18 18:44:43, Tetsuo Handa wrote:
> > [...]
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 6e1469b..a97648a 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >> };
> >> bool ret;
> >>
> >> - mutex_lock(&oom_lock);
> >> - ret = out_of_memory(&oc);
> >> + if (mutex_lock_killable(&oom_lock))
> >> + return true;
> >> + /*
> >> + * A few threads which were not waiting at mutex_lock_killable() can
> >> + * fail to bail out. Therefore, check again after holding oom_lock.
> >> + */
> >> + ret = fatal_signal_pending(current) || out_of_memory(&oc);
> >> mutex_unlock(&oom_lock);
> >> return ret;
> >> }
> >
> > If we are goging with a memcg specific thingy then I really prefer
> > tsk_is_oom_victim approach. Or is there any reason why this is not
> > suitable?
> >
>
> Why need to wait for mark_oom_victim() called after slow printk() messages?
>
> If current thread got Ctrl-C and thus current thread can terminate, what is
> nice with waiting for the OOM killer? If there are several OOM events in
> multiple memcg domains waiting for completion of printk() messages? I don't
> see points with waiting for oom_lock, for try_charge() already allows current
> thread to terminate due to fatal_signal_pending() test.
mutex_lock_killable would take care of exiting task already. I would
then still prefer to check for mark_oom_victim because that is not racy
with the exit path clearing signals. I can update my patch to use
_killable lock variant if we are really going with the memcg specific
fix.
Johaness?
--
Michal Hocko
SUSE Labs
On 2018/11/07 19:08, Michal Hocko wrote:
> On Wed 07-11-18 18:45:27, Tetsuo Handa wrote:
>> On 2018/11/06 21:42, Michal Hocko wrote:
>>> On Tue 06-11-18 18:44:43, Tetsuo Handa wrote:
>>> [...]
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index 6e1469b..a97648a 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>> };
>>>> bool ret;
>>>>
>>>> - mutex_lock(&oom_lock);
>>>> - ret = out_of_memory(&oc);
>>>> + if (mutex_lock_killable(&oom_lock))
>>>> + return true;
>>>> + /*
>>>> + * A few threads which were not waiting at mutex_lock_killable() can
>>>> + * fail to bail out. Therefore, check again after holding oom_lock.
>>>> + */
>>>> + ret = fatal_signal_pending(current) || out_of_memory(&oc);
>>>> mutex_unlock(&oom_lock);
>>>> return ret;
>>>> }
>>>
>>> If we are goging with a memcg specific thingy then I really prefer
>>> tsk_is_oom_victim approach. Or is there any reason why this is not
>>> suitable?
>>>
>>
>> Why need to wait for mark_oom_victim() called after slow printk() messages?
>>
>> If current thread got Ctrl-C and thus current thread can terminate, what is
>> nice with waiting for the OOM killer? If there are several OOM events in
>> multiple memcg domains waiting for completion of printk() messages? I don't
>> see points with waiting for oom_lock, for try_charge() already allows current
>> thread to terminate due to fatal_signal_pending() test.
>
> mutex_lock_killable would take care of exiting task already. I would
> then still prefer to check for mark_oom_victim because that is not racy
> with the exit path clearing signals. I can update my patch to use
> _killable lock variant if we are really going with the memcg specific
> fix.
>
> Johaness?
>
No response for one month. When can we get to an RCU stall problem syzbot reported?
On 2018/12/07 21:43, Tetsuo Handa wrote:
> No response for one month. When can we get to an RCU stall problem syzbot reported?
Why not to apply this patch and then think how to address
https://lore.kernel.org/lkml/[email protected]/ ?
From 0fb58415770a83d6c40d471e1840f8bc4a35ca83 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Wed, 12 Dec 2018 19:15:51 +0900
Subject: [PATCH] memcg: killed threads should not invoke memcg OOM killer
It is possible that a single process group memcg easily swamps the log
with no-eligible OOM victim messages after current thread was OOM-killed,
due to race between the memcg charge and the OOM reaper [1].
Thread-1 Thread-2 OOM reaper
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(oom_lock)
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(oom_lock)
out_of_memory()
select_bad_process()
oom_kill_process(current)
wake_oom_reaper()
oom_reap_task()
# sets MMF_OOM_SKIP
mutex_unlock(oom_lock)
out_of_memory()
select_bad_process() # no task
mutex_unlock(oom_lock)
We don't need to invoke the memcg OOM killer if current thread was killed
when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) and
memory_max_write() can bail out upon SIGKILL, and try_charge() allows
already killed/exiting threads to make forward progress.
Michal has a plan to use tsk_is_oom_victim() by calling mark_oom_victim()
on all thread groups sharing victim's mm. But fatal_signal_pending() in
this patch helps regardless of Michal's plan because it will avoid
needlessly calling out_of_memory() when current thread is already
terminating (e.g. got SIGINT after passing fatal_signal_pending() check
in try_charge() and mutex_lock_killable() did not block).
[1] https://lkml.kernel.org/r/[email protected]
Signed-off-by: Tetsuo Handa <[email protected]>
---
mm/memcontrol.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b860dd4f7..b0d3bf3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1389,8 +1389,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
};
bool ret;
- mutex_lock(&oom_lock);
- ret = out_of_memory(&oc);
+ if (mutex_lock_killable(&oom_lock))
+ return true;
+ /*
+ * A few threads which were not waiting at mutex_lock_killable() can
+ * fail to bail out. Therefore, check again after holding oom_lock.
+ */
+ ret = fatal_signal_pending(current) || out_of_memory(&oc);
mutex_unlock(&oom_lock);
return ret;
}
--
1.8.3.1