oom_lock isn't needed for __oom_reap_task_mm(). If MMF_UNSTABLE is
already set for the mm, we can simply back out immediately since oom
reaping is already in progress (or done).
Signed-off-by: David Rientjes <[email protected]>
---
mm/mmap.c | 2 --
mm/oom_kill.c | 6 ++++--
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index cd2431f46188..7f918eb725f6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3072,9 +3072,7 @@ void exit_mmap(struct mm_struct *mm)
* to mmu_notifier_release(mm) ensures mmu notifier callbacks in
* __oom_reap_task_mm() will not block.
*/
- mutex_lock(&oom_lock);
__oom_reap_task_mm(mm);
- mutex_unlock(&oom_lock);
/*
* Now, set MMF_UNSTABLE to avoid racing with the oom reaper.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0fe4087d5151..e6328cef090f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -488,9 +488,11 @@ void __oom_reap_task_mm(struct mm_struct *mm)
* Tell all users of get_user/copy_from_user etc... that the content
* is no longer stable. No barriers really needed because unmapping
* should imply barriers already and the reader would hit a page fault
- * if it stumbled over a reaped memory.
+ * if it stumbled over a reaped memory. If MMF_UNSTABLE is already set,
+ * reaping as already occurred so nothing left to do.
*/
- set_bit(MMF_UNSTABLE, &mm->flags);
+ if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
+ return;
for (vma = mm->mmap ; vma; vma = vma->vm_next) {
if (!can_madv_dontneed_vma(vma))
What a simplified description of oom_lock...
Positive effects
(1) Serialize "setting TIF_MEMDIE and calling __thaw_task()/atomic_inc() from
mark_oom_victim()" and "setting oom_killer_disabled = true from
oom_killer_disable()".
(2) Serialize all printk() messages from out_of_memory().
(3) Prevent from selecting new OOM victim when there is an !MMF_OOM_SKIP mm
which current thread should wait for.
(4) Mutex blocking_notifier_call_chain() from out_of_memory() because some of
callbacks might not be thread-safe and/or serialized call might release
more memory than needed.
Negative effects
(A) Threads which called mutex_lock(&oom_lock) before calling out_of_memory()
are blocked waiting for "__oom_reap_task_mm() from exit_mmap()" and/or
"__oom_reap_task_mm() from oom_reap_task_mm()".
(B) Threads which do not call out_of_memory() because mutex_trylock(&oom_lock)
failed continue consuming CPU resources pointlessly.
Regarding (A), we can reduce the range oom_lock serializes from
"__oom_reap_task_mm()" to "setting MMF_OOM_SKIP", for oom_lock is useful for (3).
Therefore, we can apply below change on top of your patch. But I don't like
sharing MMF_UNSBALE for two purposes (reason is explained below).
Regarding (B), we can do direct OOM reaping (like my proposal does).
---
kernel/fork.c | 5 +++++
mm/mmap.c | 21 +++++++++------------
mm/oom_kill.c | 57 ++++++++++++++++++++++-----------------------------------
3 files changed, 36 insertions(+), 47 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 6747298..f37d481 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -984,6 +984,11 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
+ if (unlikely(mm_is_oom_victim(mm))) {
+ mutex_lock(&oom_lock);
+ set_bit(MMF_OOM_SKIP, &mm->flags);
+ mutex_unlock(&oom_lock);
+ }
mmdrop(mm);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index 7f918eb..203061f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3075,19 +3075,17 @@ void exit_mmap(struct mm_struct *mm)
__oom_reap_task_mm(mm);
/*
- * Now, set MMF_UNSTABLE to avoid racing with the oom reaper.
- * This needs to be done before calling munlock_vma_pages_all(),
- * which clears VM_LOCKED, otherwise the oom reaper cannot
- * reliably test for it. If the oom reaper races with
- * munlock_vma_pages_all(), this can result in a kernel oops if
- * a pmd is zapped, for example, after follow_page_mask() has
- * checked pmd_none().
+ * Wait for the oom reaper to complete. This needs to be done
+ * before calling munlock_vma_pages_all(), which clears
+ * VM_LOCKED, otherwise the oom reaper cannot reliably test for
+ * it. If the oom reaper races with munlock_vma_pages_all(),
+ * this can result in a kernel oops if a pmd is zapped, for
+ * example, after follow_page_mask() has checked pmd_none().
*
- * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
- * guarantee that the oom reaper will not run on this mm again
- * after mmap_sem is dropped.
+ * Taking mm->mmap_sem for write will guarantee that the oom
+ * reaper will not run on this mm again after mmap_sem is
+ * dropped.
*/
- set_bit(MMF_UNSTABLE, &mm->flags);
down_write(&mm->mmap_sem);
up_write(&mm->mmap_sem);
}
@@ -3115,7 +3113,6 @@ void exit_mmap(struct mm_struct *mm)
unmap_vmas(&tlb, vma, 0, -1);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb, 0, -1);
- set_bit(MMF_OOM_SKIP, &mm->flags);
/*
* Walk the list again, actually closing and freeing it,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e6328ce..7ed4ed0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -488,11 +488,9 @@ void __oom_reap_task_mm(struct mm_struct *mm)
* Tell all users of get_user/copy_from_user etc... that the content
* is no longer stable. No barriers really needed because unmapping
* should imply barriers already and the reader would hit a page fault
- * if it stumbled over a reaped memory. If MMF_UNSTABLE is already set,
- * reaping as already occurred so nothing left to do.
+ * if it stumbled over a reaped memory.
*/
- if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
- return;
+ set_bit(MMF_UNSTABLE, &mm->flags);
for (vma = mm->mmap ; vma; vma = vma->vm_next) {
if (!can_madv_dontneed_vma(vma))
@@ -524,25 +522,9 @@ void __oom_reap_task_mm(struct mm_struct *mm)
static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
{
- /*
- * We have to make sure to not race with the victim exit path
- * and cause premature new oom victim selection:
- * oom_reap_task_mm exit_mm
- * mmget_not_zero
- * mmput
- * atomic_dec_and_test
- * exit_oom_victim
- * [...]
- * out_of_memory
- * select_bad_process
- * # no TIF_MEMDIE task selects new victim
- * unmap_page_range # frees some memory
- */
- mutex_lock(&oom_lock);
-
if (!down_read_trylock(&mm->mmap_sem)) {
trace_skip_task_reaping(tsk->pid);
- goto out_oom;
+ return;
}
/*
@@ -555,10 +537,18 @@ static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
goto out_mm;
/*
- * MMF_UNSTABLE is set by exit_mmap when the OOM reaper can't
- * work on the mm anymore. The check for MMF_UNSTABLE must run
- * under mmap_sem for reading because it serializes against the
- * down_write();up_write() cycle in exit_mmap().
+ * MMF_UNSTABLE is set by the time exit_mmap() calls
+ * munlock_vma_pages_all() in order to avoid race condition. The check
+ * for MMF_UNSTABLE must run under mmap_sem for reading because it
+ * serializes against the down_write();up_write() cycle in exit_mmap().
+ *
+ * However, since MMF_UNSTABLE is set by __oom_reap_task_mm() from
+ * exit_mmap() before start reaping (because the purpose of
+ * MMF_UNSTABLE is to "tell all users of get_user/copy_from_user etc...
+ * that the content is no longer stable"), it cannot be used for a flag
+ * for indicating that the OOM reaper can't work on the mm anymore.
+ * The OOM reaper will give up after (by default) 1 second even if
+ * exit_mmap() is doing __oom_reap_task_mm().
*/
if (test_bit(MMF_UNSTABLE, &mm->flags)) {
trace_skip_task_reaping(tsk->pid);
@@ -576,8 +566,6 @@ static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
K(get_mm_counter(mm, MM_SHMEMPAGES)));
out_mm:
up_read(&mm->mmap_sem);
-out_oom:
- mutex_unlock(&oom_lock);
}
static void oom_reap_task(struct task_struct *tsk)
@@ -591,12 +579,7 @@ static void oom_reap_task(struct task_struct *tsk)
if (test_bit(MMF_OOM_SKIP, &mm->flags))
goto drop;
- /*
- * If this mm has already been reaped, doing so again will not likely
- * free additional memory.
- */
- if (!test_bit(MMF_UNSTABLE, &mm->flags))
- oom_reap_task_mm(tsk, mm);
+ oom_reap_task_mm(tsk, mm);
if (time_after_eq(jiffies, mm->oom_free_expire)) {
if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
@@ -658,12 +641,16 @@ static int oom_reaper(void *unused)
static u64 oom_free_timeout_ms = 1000;
static void wake_oom_reaper(struct task_struct *tsk)
{
+ unsigned long expire = jiffies + msecs_to_jiffies(oom_free_timeout_ms);
+
+ /* expire must not be 0 in order to avoid double list_add(). */
+ if (!expire)
+ expire++;
/*
* Set the reap timeout; if it's already set, the mm is enqueued and
* this tsk can be ignored.
*/
- if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL,
- jiffies + msecs_to_jiffies(oom_free_timeout_ms)))
+ if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL, expire))
return;
get_task_struct(tsk);
--
1.8.3.1
On Thu 12-07-18 14:34:00, David Rientjes wrote:
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 0fe4087d5151..e6328cef090f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -488,9 +488,11 @@ void __oom_reap_task_mm(struct mm_struct *mm)
> * Tell all users of get_user/copy_from_user etc... that the content
> * is no longer stable. No barriers really needed because unmapping
> * should imply barriers already and the reader would hit a page fault
> - * if it stumbled over a reaped memory.
> + * if it stumbled over a reaped memory. If MMF_UNSTABLE is already set,
> + * reaping as already occurred so nothing left to do.
> */
> - set_bit(MMF_UNSTABLE, &mm->flags);
> + if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
> + return;
This could lead to pre mature oom victim selection
oom_reaper exiting victim
oom_reap_task exit_mmap
__oom_reap_task_mm __oom_reap_task_mm
test_and_set_bit(MMF_UNSTABLE) # wins the race
test_and_set_bit(MMF_UNSTABLE)
set_bit(MMF_OOM_SKIP) # new victim can be selected now.
Besides that, why should we back off in the first place. We can
race the two without any problems AFAICS. We already do have proper
synchronization between the two due to mmap_sem and MMF_OOM_SKIP.
diff --git a/mm/mmap.c b/mm/mmap.c
index fc41c0543d7f..4642964f7741 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3073,9 +3073,7 @@ void exit_mmap(struct mm_struct *mm)
* which clears VM_LOCKED, otherwise the oom reaper cannot
* reliably test it.
*/
- mutex_lock(&oom_lock);
__oom_reap_task_mm(mm);
- mutex_unlock(&oom_lock);
set_bit(MMF_OOM_SKIP, &mm->flags);
down_write(&mm->mmap_sem);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 32e6f7becb40..f11108af122d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -529,28 +529,9 @@ void __oom_reap_task_mm(struct mm_struct *mm)
static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
{
- bool ret = true;
-
- /*
- * We have to make sure to not race with the victim exit path
- * and cause premature new oom victim selection:
- * oom_reap_task_mm exit_mm
- * mmget_not_zero
- * mmput
- * atomic_dec_and_test
- * exit_oom_victim
- * [...]
- * out_of_memory
- * select_bad_process
- * # no TIF_MEMDIE task selects new victim
- * unmap_page_range # frees some memory
- */
- mutex_lock(&oom_lock);
-
if (!down_read_trylock(&mm->mmap_sem)) {
- ret = false;
trace_skip_task_reaping(tsk->pid);
- goto unlock_oom;
+ return false;
}
/*
@@ -562,7 +543,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
if (mm_has_blockable_invalidate_notifiers(mm)) {
up_read(&mm->mmap_sem);
schedule_timeout_idle(HZ);
- goto unlock_oom;
+ return true;
}
/*
@@ -589,9 +570,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
up_read(&mm->mmap_sem);
trace_finish_task_reaping(tsk->pid);
-unlock_oom:
- mutex_unlock(&oom_lock);
- return ret;
+ return true;
}
#define MAX_OOM_REAP_RETRIES 10
--
Michal Hocko
SUSE Labs
On 2018/07/13 23:26, Michal Hocko wrote:
> On Thu 12-07-18 14:34:00, David Rientjes wrote:
> [...]
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 0fe4087d5151..e6328cef090f 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -488,9 +488,11 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>> * Tell all users of get_user/copy_from_user etc... that the content
>> * is no longer stable. No barriers really needed because unmapping
>> * should imply barriers already and the reader would hit a page fault
>> - * if it stumbled over a reaped memory.
>> + * if it stumbled over a reaped memory. If MMF_UNSTABLE is already set,
>> + * reaping as already occurred so nothing left to do.
>> */
>> - set_bit(MMF_UNSTABLE, &mm->flags);
>> + if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
>> + return;
>
> This could lead to pre mature oom victim selection
> oom_reaper exiting victim
> oom_reap_task exit_mmap
> __oom_reap_task_mm __oom_reap_task_mm
> test_and_set_bit(MMF_UNSTABLE) # wins the race
> test_and_set_bit(MMF_UNSTABLE)
> set_bit(MMF_OOM_SKIP) # new victim can be selected now.
>
> Besides that, why should we back off in the first place. We can
> race the two without any problems AFAICS. We already do have proper
> synchronization between the two due to mmap_sem and MMF_OOM_SKIP.
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index fc41c0543d7f..4642964f7741 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3073,9 +3073,7 @@ void exit_mmap(struct mm_struct *mm)
> * which clears VM_LOCKED, otherwise the oom reaper cannot
> * reliably test it.
> */
> - mutex_lock(&oom_lock);
> __oom_reap_task_mm(mm);
> - mutex_unlock(&oom_lock);
>
> set_bit(MMF_OOM_SKIP, &mm->flags);
David and Michal are using different version as a baseline here.
David is making changes using timeout based back off (in linux-next.git)
which is inappropriately trying to use MMF_UNSTABLE for two purposes.
Michal is making changes using current code (in linux.git) which does not
address David's concern.
My version ( https://marc.info/?l=linux-mm&m=153119509215026 ) is
making changes using current code which also provides oom-badness
based back off in order to address David's concern.
> down_write(&mm->mmap_sem);
Anyway, I suggest doing
mutex_lock(&oom_lock);
set_bit(MMF_OOM_SKIP, &mm->flags);
mutex_unlock(&oom_lock);
like I mentioned at
http://lkml.kernel.org/r/[email protected]
even if we make changes on top of linux-next's timeout based back off.
On Sat 14-07-18 06:18:58, Tetsuo Handa wrote:
> On 2018/07/13 23:26, Michal Hocko wrote:
> > On Thu 12-07-18 14:34:00, David Rientjes wrote:
> > [...]
> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >> index 0fe4087d5151..e6328cef090f 100644
> >> --- a/mm/oom_kill.c
> >> +++ b/mm/oom_kill.c
> >> @@ -488,9 +488,11 @@ void __oom_reap_task_mm(struct mm_struct *mm)
> >> * Tell all users of get_user/copy_from_user etc... that the content
> >> * is no longer stable. No barriers really needed because unmapping
> >> * should imply barriers already and the reader would hit a page fault
> >> - * if it stumbled over a reaped memory.
> >> + * if it stumbled over a reaped memory. If MMF_UNSTABLE is already set,
> >> + * reaping as already occurred so nothing left to do.
> >> */
> >> - set_bit(MMF_UNSTABLE, &mm->flags);
> >> + if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
> >> + return;
> >
> > This could lead to pre mature oom victim selection
> > oom_reaper exiting victim
> > oom_reap_task exit_mmap
> > __oom_reap_task_mm __oom_reap_task_mm
> > test_and_set_bit(MMF_UNSTABLE) # wins the race
> > test_and_set_bit(MMF_UNSTABLE)
> > set_bit(MMF_OOM_SKIP) # new victim can be selected now.
> >
> > Besides that, why should we back off in the first place. We can
> > race the two without any problems AFAICS. We already do have proper
> > synchronization between the two due to mmap_sem and MMF_OOM_SKIP.
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index fc41c0543d7f..4642964f7741 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3073,9 +3073,7 @@ void exit_mmap(struct mm_struct *mm)
> > * which clears VM_LOCKED, otherwise the oom reaper cannot
> > * reliably test it.
> > */
> > - mutex_lock(&oom_lock);
> > __oom_reap_task_mm(mm);
> > - mutex_unlock(&oom_lock);
> >
> > set_bit(MMF_OOM_SKIP, &mm->flags);
>
> David and Michal are using different version as a baseline here.
> David is making changes using timeout based back off (in linux-next.git)
> which is inappropriately trying to use MMF_UNSTABLE for two purposes.
>
> Michal is making changes using current code (in linux.git) which does not
> address David's concern.
Yes I have based it on top of Linus tree because the point of this patch
is to get rid of the locking which is no longer needed. I do not see
what concern are you talking about.
>
> My version ( https://marc.info/?l=linux-mm&m=153119509215026 ) is
> making changes using current code which also provides oom-badness
> based back off in order to address David's concern.
>
> > down_write(&mm->mmap_sem);
>
> Anyway, I suggest doing
>
> mutex_lock(&oom_lock);
> set_bit(MMF_OOM_SKIP, &mm->flags);
> mutex_unlock(&oom_lock);
Why do we need it?
> like I mentioned at
> http://lkml.kernel.org/r/[email protected]
> even if we make changes on top of linux-next's timeout based back off.
says
: (3) Prevent from selecting new OOM victim when there is an !MMF_OOM_SKIP mm
: which current thread should wait for.
[...]
: Regarding (A), we can reduce the range oom_lock serializes from
: "__oom_reap_task_mm()" to "setting MMF_OOM_SKIP", for oom_lock is useful for (3).
But why there is a lock needed for this? This doesn't make much sense to
me. If we do not have MMF_OOM_SKIP set we still should have mm_is_oom_victim
so no new task should be selected. If we race with the oom reaper than
ok, we would just not select a new victim and retry later.
--
Michal Hocko
SUSE Labs
On 2018/07/16 15:13, Michal Hocko wrote:
> On Sat 14-07-18 06:18:58, Tetsuo Handa wrote:
>>> @@ -3073,9 +3073,7 @@ void exit_mmap(struct mm_struct *mm)
>>> * which clears VM_LOCKED, otherwise the oom reaper cannot
>>> * reliably test it.
>>> */
>>> - mutex_lock(&oom_lock);
>>> __oom_reap_task_mm(mm);
>>> - mutex_unlock(&oom_lock);
>>>
>>> set_bit(MMF_OOM_SKIP, &mm->flags);
>>
>> David and Michal are using different version as a baseline here.
>> David is making changes using timeout based back off (in linux-next.git)
>> which is inappropriately trying to use MMF_UNSTABLE for two purposes.
>>
>> Michal is making changes using current code (in linux.git) which does not
>> address David's concern.
>
> Yes I have based it on top of Linus tree because the point of this patch
> is to get rid of the locking which is no longer needed. I do not see
> what concern are you talking about.
I'm saying that applying your patch does not work on linux-next.git
because David's patch already did s/MMF_OOM_SKIP/MMF_UNSTABLE/ .
>>
>> My version ( https://marc.info/?l=linux-mm&m=153119509215026 ) is
>> making changes using current code which also provides oom-badness
>> based back off in order to address David's concern.
>>
>>> down_write(&mm->mmap_sem);
>>
>> Anyway, I suggest doing
>>
>> mutex_lock(&oom_lock);
>> set_bit(MMF_OOM_SKIP, &mm->flags);
>> mutex_unlock(&oom_lock);
>
> Why do we need it?
>
>> like I mentioned at
>> http://lkml.kernel.org/r/[email protected]
>> even if we make changes on top of linux-next's timeout based back off.
>
> says
> : (3) Prevent from selecting new OOM victim when there is an !MMF_OOM_SKIP mm
> : which current thread should wait for.
> [...]
> : Regarding (A), we can reduce the range oom_lock serializes from
> : "__oom_reap_task_mm()" to "setting MMF_OOM_SKIP", for oom_lock is useful for (3).
>
> But why there is a lock needed for this? This doesn't make much sense to
> me. If we do not have MMF_OOM_SKIP set we still should have mm_is_oom_victim
> so no new task should be selected. If we race with the oom reaper than
> ok, we would just not select a new victim and retry later.
>
How mm_is_oom_victim() helps? mm_is_oom_victim() is used by exit_mmap() whether
current thread should call __oom_reap_task_mm().
I'm talking about below sequence (i.e. after returning from __oom_reap_task_mm()).
CPU 0 CPU 1
mutex_trylock(&oom_lock) in __alloc_pages_may_oom() succeeds.
get_page_from_freelist() fails.
Enters out_of_memory().
__oom_reap_task_mm() reclaims some memory.
Sets MMF_OOM_SKIP.
select_bad_process() selects new victim because MMF_OOM_SKIP is already set.
Kills a new OOM victim without retrying last second allocation attempt.
Leaves out_of_memory().
mutex_unlock(&oom_lock) in __alloc_pages_may_oom() is called.
If setting MMF_OOM_SKIP is guarded by oom_lock, we can enforce
last second allocation attempt like below.
CPU 0 CPU 1
mutex_trylock(&oom_lock) in __alloc_pages_may_oom() succeeds.
get_page_from_freelist() fails.
Enters out_of_memory().
__oom_reap_task_mm() reclaims some memory.
mutex_lock(&oom_lock);
select_bad_process() does not select new victim because MMF_OOM_SKIP is not yet set.
Leaves out_of_memory().
mutex_unlock(&oom_lock) in __alloc_pages_may_oom() is called.
Sets MMF_OOM_SKIP.
mutex_unlock(&oom_lock);
get_page_from_freelist() likely succeeds before reaching __alloc_pages_may_oom() again.
Saved one OOM victim from being needlessly killed.
That is, guarding setting MMF_OOM_SKIP works as if synchronize_rcu(); it waits for anybody
who already acquired (or started waiting for) oom_lock to release oom_lock, in order to
prevent select_bad_process() from needlessly selecting new OOM victim.
On Mon 16-07-18 16:04:26, Tetsuo Handa wrote:
> On 2018/07/16 15:13, Michal Hocko wrote:
> > On Sat 14-07-18 06:18:58, Tetsuo Handa wrote:
> >>> @@ -3073,9 +3073,7 @@ void exit_mmap(struct mm_struct *mm)
> >>> * which clears VM_LOCKED, otherwise the oom reaper cannot
> >>> * reliably test it.
> >>> */
> >>> - mutex_lock(&oom_lock);
> >>> __oom_reap_task_mm(mm);
> >>> - mutex_unlock(&oom_lock);
> >>>
> >>> set_bit(MMF_OOM_SKIP, &mm->flags);
> >>
> >> David and Michal are using different version as a baseline here.
> >> David is making changes using timeout based back off (in linux-next.git)
> >> which is inappropriately trying to use MMF_UNSTABLE for two purposes.
> >>
> >> Michal is making changes using current code (in linux.git) which does not
> >> address David's concern.
> >
> > Yes I have based it on top of Linus tree because the point of this patch
> > is to get rid of the locking which is no longer needed. I do not see
> > what concern are you talking about.
>
> I'm saying that applying your patch does not work on linux-next.git
> because David's patch already did s/MMF_OOM_SKIP/MMF_UNSTABLE/ .
This patch has been nacked by me AFAIR so I assume it should be dropped
from the mmotm tree.
> >> My version ( https://marc.info/?l=linux-mm&m=153119509215026 ) is
> >> making changes using current code which also provides oom-badness
> >> based back off in order to address David's concern.
> >>
> >>> down_write(&mm->mmap_sem);
> >>
> >> Anyway, I suggest doing
> >>
> >> mutex_lock(&oom_lock);
> >> set_bit(MMF_OOM_SKIP, &mm->flags);
> >> mutex_unlock(&oom_lock);
> >
> > Why do we need it?
> >
> >> like I mentioned at
> >> http://lkml.kernel.org/r/[email protected]
> >> even if we make changes on top of linux-next's timeout based back off.
> >
> > says
> > : (3) Prevent from selecting new OOM victim when there is an !MMF_OOM_SKIP mm
> > : which current thread should wait for.
> > [...]
> > : Regarding (A), we can reduce the range oom_lock serializes from
> > : "__oom_reap_task_mm()" to "setting MMF_OOM_SKIP", for oom_lock is useful for (3).
> >
> > But why there is a lock needed for this? This doesn't make much sense to
> > me. If we do not have MMF_OOM_SKIP set we still should have mm_is_oom_victim
> > so no new task should be selected. If we race with the oom reaper than
> > ok, we would just not select a new victim and retry later.
> >
>
> How mm_is_oom_victim() helps? mm_is_oom_victim() is used by exit_mmap() whether
> current thread should call __oom_reap_task_mm().
>
> I'm talking about below sequence (i.e. after returning from __oom_reap_task_mm()).
>
> CPU 0 CPU 1
>
> mutex_trylock(&oom_lock) in __alloc_pages_may_oom() succeeds.
> get_page_from_freelist() fails.
> Enters out_of_memory().
>
> __oom_reap_task_mm() reclaims some memory.
> Sets MMF_OOM_SKIP.
>
> select_bad_process() selects new victim because MMF_OOM_SKIP is already set.
> Kills a new OOM victim without retrying last second allocation attempt.
> Leaves out_of_memory().
> mutex_unlock(&oom_lock) in __alloc_pages_may_oom() is called.
OK, that wasn't clear from your above wording. As you explicitly
mentioned !MMF_OOM_SKIP mm.
> If setting MMF_OOM_SKIP is guarded by oom_lock, we can enforce
> last second allocation attempt like below.
>
> CPU 0 CPU 1
>
> mutex_trylock(&oom_lock) in __alloc_pages_may_oom() succeeds.
> get_page_from_freelist() fails.
> Enters out_of_memory().
>
> __oom_reap_task_mm() reclaims some memory.
> mutex_lock(&oom_lock);
>
> select_bad_process() does not select new victim because MMF_OOM_SKIP is not yet set.
> Leaves out_of_memory().
> mutex_unlock(&oom_lock) in __alloc_pages_may_oom() is called.
>
> Sets MMF_OOM_SKIP.
> mutex_unlock(&oom_lock);
>
> get_page_from_freelist() likely succeeds before reaching __alloc_pages_may_oom() again.
> Saved one OOM victim from being needlessly killed.
>
> That is, guarding setting MMF_OOM_SKIP works as if synchronize_rcu(); it waits for anybody
> who already acquired (or started waiting for) oom_lock to release oom_lock, in order to
> prevent select_bad_process() from needlessly selecting new OOM victim.
Hmm, is this a practical problem though? Do we really need to have a
broader locking context just to defeat this race? How about this goes
into a separate patch with some data justifying it?
--
Michal Hocko
SUSE Labs
On 2018/07/16 16:44, Michal Hocko wrote:
>> If setting MMF_OOM_SKIP is guarded by oom_lock, we can enforce
>> last second allocation attempt like below.
>>
>> CPU 0 CPU 1
>>
>> mutex_trylock(&oom_lock) in __alloc_pages_may_oom() succeeds.
>> get_page_from_freelist() fails.
>> Enters out_of_memory().
>>
>> __oom_reap_task_mm() reclaims some memory.
>> mutex_lock(&oom_lock);
>>
>> select_bad_process() does not select new victim because MMF_OOM_SKIP is not yet set.
>> Leaves out_of_memory().
>> mutex_unlock(&oom_lock) in __alloc_pages_may_oom() is called.
>>
>> Sets MMF_OOM_SKIP.
>> mutex_unlock(&oom_lock);
>>
>> get_page_from_freelist() likely succeeds before reaching __alloc_pages_may_oom() again.
>> Saved one OOM victim from being needlessly killed.
>>
>> That is, guarding setting MMF_OOM_SKIP works as if synchronize_rcu(); it waits for anybody
>> who already acquired (or started waiting for) oom_lock to release oom_lock, in order to
>> prevent select_bad_process() from needlessly selecting new OOM victim.
>
> Hmm, is this a practical problem though? Do we really need to have a
> broader locking context just to defeat this race?
Yes, for you think that select_bad_process() might take long time. It is possible
that MMF_OOM_SKIP is set while the owner of oom_lock is preempted. It is not such
a small window that select_bad_process() finds an mm which got MMF_OOM_SKIP
immediately before examining that mm.
> How about this goes
> into a separate patch with some data justifying it?
>
No. We won't be able to get data until we let people test using released
kernels. I don't like again getting reports like
http://lkml.kernel.org/r/[email protected]
by not guarding MMF_OOM_SKIP.
On Mon 16-07-18 19:38:21, Tetsuo Handa wrote:
> On 2018/07/16 16:44, Michal Hocko wrote:
> >> If setting MMF_OOM_SKIP is guarded by oom_lock, we can enforce
> >> last second allocation attempt like below.
> >>
> >> CPU 0 CPU 1
> >>
> >> mutex_trylock(&oom_lock) in __alloc_pages_may_oom() succeeds.
> >> get_page_from_freelist() fails.
> >> Enters out_of_memory().
> >>
> >> __oom_reap_task_mm() reclaims some memory.
> >> mutex_lock(&oom_lock);
> >>
> >> select_bad_process() does not select new victim because MMF_OOM_SKIP is not yet set.
> >> Leaves out_of_memory().
> >> mutex_unlock(&oom_lock) in __alloc_pages_may_oom() is called.
> >>
> >> Sets MMF_OOM_SKIP.
> >> mutex_unlock(&oom_lock);
> >>
> >> get_page_from_freelist() likely succeeds before reaching __alloc_pages_may_oom() again.
> >> Saved one OOM victim from being needlessly killed.
> >>
> >> That is, guarding setting MMF_OOM_SKIP works as if synchronize_rcu(); it waits for anybody
> >> who already acquired (or started waiting for) oom_lock to release oom_lock, in order to
> >> prevent select_bad_process() from needlessly selecting new OOM victim.
> >
> > Hmm, is this a practical problem though? Do we really need to have a
> > broader locking context just to defeat this race?
>
> Yes, for you think that select_bad_process() might take long time. It is possible
> that MMF_OOM_SKIP is set while the owner of oom_lock is preempted. It is not such
> a small window that select_bad_process() finds an mm which got MMF_OOM_SKIP
> immediately before examining that mm.
I only do care if the race is practical to hit. And that is why I would
like a simplification first (so drop the oom_lock in the oom_reaper
path) and then follow up with some decent justification on top.
--
Michal Hocko
SUSE Labs
On Fri, 13 Jul 2018, Michal Hocko wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 0fe4087d5151..e6328cef090f 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -488,9 +488,11 @@ void __oom_reap_task_mm(struct mm_struct *mm)
> > * Tell all users of get_user/copy_from_user etc... that the content
> > * is no longer stable. No barriers really needed because unmapping
> > * should imply barriers already and the reader would hit a page fault
> > - * if it stumbled over a reaped memory.
> > + * if it stumbled over a reaped memory. If MMF_UNSTABLE is already set,
> > + * reaping as already occurred so nothing left to do.
> > */
> > - set_bit(MMF_UNSTABLE, &mm->flags);
> > + if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
> > + return;
>
> This could lead to pre mature oom victim selection
> oom_reaper exiting victim
> oom_reap_task exit_mmap
> __oom_reap_task_mm __oom_reap_task_mm
> test_and_set_bit(MMF_UNSTABLE) # wins the race
> test_and_set_bit(MMF_UNSTABLE)
> set_bit(MMF_OOM_SKIP) # new victim can be selected now.
>
This is not the current state of the code in the -mm tree: MMF_OOM_SKIP
only gets set by the oom reaper when the timeout has expired when the
victim has failed to free memory in the exit path.
> Besides that, why should we back off in the first place. We can
> race the two without any problems AFAICS. We already do have proper
> synchronization between the two due to mmap_sem and MMF_OOM_SKIP.
>
test_and_set_bit() here is not strictly required, I thought it was better
since any unmapping done in this context is going to be handled by
whichever thread set MMF_UNSTABLE.
On Sat, 14 Jul 2018, Tetsuo Handa wrote:
> David is making changes using timeout based back off (in linux-next.git)
> which is inappropriately trying to use MMF_UNSTABLE for two purposes.
>
If you believe there is a problem with the use of MMF_UNSTABLE as it sits
in -mm, please follow up directly in the thread that proposed the patch.
I have seen two replies to that thread from you: one that incorporates it
into your work, and one that links to a verison of my patch in your
patchset. I haven't seen a concern raised about the use of MMF_UNSTABLE,
but perhaps it's somewhere in the 10,000 other emails about the oom
killer.