From: Michal Hocko <[email protected]>
David has noticed that the oom killer might kill additional tasks while
the existing victim hasn't terminated yet because the oom_reaper marks
the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
to 0. The race is as follows
oom_reap_task do_exit
exit_mm
__oom_reap_task_mm
mmput
__mmput
mmget_not_zero # fails
exit_mmap # frees memory
set_bit(MMF_OOM_SKIP)
Currently we are try to reduce a risk of this race by taking oom_lock
and wait for out_of_memory sleep while holding the lock to give the
victim some time to exit. This is quite suboptimal approach because
there is no guarantee the victim (especially a large one) will manage
to unmap its address space and free enough memory to the particular oom
domain which needs a memory (e.g. a specific NUMA node).
Fix this problem by allowing __oom_reap_task_mm and __mmput path to
race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
to run in parallel with other unmappers (hence the mmap_sem for read).
The only tricky part is we have to exclude page tables tear down and all
operations which modify the address space in the __mmput path. exit_mmap
doesn't expect any other users so it doesn't use any locking. Nothing
really forbids us to use mmap_sem for write, though. In fact we are
already relying on this lock earlier in the __mmput path to synchronize
with ksm and khugepaged.
Take the exclusive mmap_sem when calling free_pgtables and destroying
vmas to sync with __oom_reap_task_mm which take the lock for read. All
other operations can safely race with the parallel unmap.
Reported-by: David Rientjes <[email protected]>
Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
Signed-off-by: Michal Hocko <[email protected]>
---
Hi,
I am sending this as an RFC because I am not yet sure I haven't missed
something subtle here but the appoach should work in principle. I have
run it through some of my OOM stress tests to see if anything blows up
and it all went smoothly.
The issue has been brought up by David [1]. There were some attempts to
address it in oom proper [2][3] but the first one would cause problems
on their own [4] while the later is just too hairy.
Thoughts, objections, alternatives?
[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]
[3] http://lkml.kernel.org/r/[email protected]
[4] http://lkml.kernel.org/r/[email protected]
mm/mmap.c | 7 +++++++
mm/oom_kill.c | 40 ++--------------------------------------
2 files changed, 9 insertions(+), 38 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 3bd5ecd20d4d..253808e716dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
unmap_vmas(&tlb, vma, 0, -1);
+ /*
+ * oom reaper might race with exit_mmap so make sure we won't free
+ * page tables or unmap VMAs under its feet
+ */
+ down_write(&mm->mmap_sem);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb, 0, -1);
@@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm)
nr_accounted += vma_pages(vma);
vma = remove_vma(vma);
}
+ mm->mmap = NULL;
vm_unacct_memory(nr_accounted);
+ up_write(&mm->mmap_sem);
}
/* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925e7826..5dc0ff22d567 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
struct vm_area_struct *vma;
bool ret = true;
- /*
- * We have to make sure to not race with the victim exit path
- * and cause premature new oom victim selection:
- * __oom_reap_task_mm exit_mm
- * mmget_not_zero
- * mmput
- * atomic_dec_and_test
- * exit_oom_victim
- * [...]
- * out_of_memory
- * select_bad_process
- * # no TIF_MEMDIE task selects new victim
- * unmap_page_range # frees some memory
- */
- mutex_lock(&oom_lock);
-
- if (!down_read_trylock(&mm->mmap_sem)) {
- ret = false;
- goto unlock_oom;
- }
-
- /*
- * increase mm_users only after we know we will reap something so
- * that the mmput_async is called only when we have reaped something
- * and delayed __mmput doesn't matter that much
- */
- if (!mmget_not_zero(mm)) {
- up_read(&mm->mmap_sem);
- goto unlock_oom;
- }
+ if (!down_read_trylock(&mm->mmap_sem))
+ return false;
/*
* Tell all users of get_user/copy_from_user etc... that the content
@@ -538,14 +510,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
K(get_mm_counter(mm, MM_SHMEMPAGES)));
up_read(&mm->mmap_sem);
- /*
- * Drop our reference but make sure the mmput slow path is called from a
- * different context because we shouldn't risk we get stuck there and
- * put the oom_reaper out of the way.
- */
- mmput_async(mm);
-unlock_oom:
- mutex_unlock(&oom_lock);
return ret;
}
--
2.11.0
Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> David has noticed that the oom killer might kill additional tasks while
> the existing victim hasn't terminated yet because the oom_reaper marks
> the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> to 0. The race is as follows
>
> oom_reap_task do_exit
> exit_mm
> __oom_reap_task_mm
> mmput
> __mmput
> mmget_not_zero # fails
> exit_mmap # frees memory
> set_bit(MMF_OOM_SKIP)
>
> Currently we are try to reduce a risk of this race by taking oom_lock
> and wait for out_of_memory sleep while holding the lock to give the
> victim some time to exit. This is quite suboptimal approach because
> there is no guarantee the victim (especially a large one) will manage
> to unmap its address space and free enough memory to the particular oom
> domain which needs a memory (e.g. a specific NUMA node).
>
> Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> to run in parallel with other unmappers (hence the mmap_sem for read).
> The only tricky part is we have to exclude page tables tear down and all
> operations which modify the address space in the __mmput path. exit_mmap
> doesn't expect any other users so it doesn't use any locking. Nothing
> really forbids us to use mmap_sem for write, though. In fact we are
> already relying on this lock earlier in the __mmput path to synchronize
> with ksm and khugepaged.
>
> Take the exclusive mmap_sem when calling free_pgtables and destroying
> vmas to sync with __oom_reap_task_mm which take the lock for read. All
> other operations can safely race with the parallel unmap.
>
> Reported-by: David Rientjes <[email protected]>
> Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
> Signed-off-by: Michal Hocko <[email protected]>
> ---
>
> Hi,
> I am sending this as an RFC because I am not yet sure I haven't missed
> something subtle here but the appoach should work in principle. I have
> run it through some of my OOM stress tests to see if anything blows up
> and it all went smoothly.
>
> The issue has been brought up by David [1]. There were some attempts to
> address it in oom proper [2][3] but the first one would cause problems
> on their own [4] while the later is just too hairy.
>
> Thoughts, objections, alternatives?
I wonder why you prefer timeout based approach. Your patch will after all
set MMF_OOM_SKIP if operations between down_write() and up_write() took
more than one second. lock_anon_vma_root() from unlink_anon_vmas() from
free_pgtables() for example calls down_write()/up_write(). unlink_file_vma()
from free_pgtables() for another example calls down_write()/up_write().
This means that it might happen that exit_mmap() takes more than one second
with mm->mmap_sem held for write, doesn't this?
The worst situation is that no memory is released by uprobe_clear_state(), exit_aio(),
ksm_exit(), khugepaged_exit() and operations before down_write(&mm->mmap_sem), and then
one second elapses before some memory is released after down_write(&mm->mmap_sem).
In that case, down_write()/up_write() in your patch helps nothing.
Less worst situation is that no memory is released by uprobe_clear_state(), exit_aio(),
ksm_exit(), khugepaged_exit() and operations before down_write(&mm->mmap_sem), and then
only some memory is released after down_write(&mm->mmap_sem) before one second elapses.
Someone might think that this is still premature.
More likely situation is that down_read_trylock(&mm->mmap_sem) in __oom_reap_task_mm()
succeeds before exit_mmap() calls down_write(&mm->mmap_sem) (especially true if we remove
mutex_lock(&oom_lock) from __oom_reap_task_mm()). In this case, your patch merely gives
uprobe_clear_state(), exit_aio(), ksm_exit(), khugepaged_exit() and operations before
down_write(&mm->mmap_sem) some time to release memory, for your patch will after all set
MMF_OOM_SKIP immediately after __oom_reap_task_mm() called up_read(&mm->mmap_sem). If we
assume that majority of memory is released by operations between
down_write(&mm->mmap_sem)/up_write(&mm->mmap_sem) in exit_mm(), this is not a preferable
behavior.
My patch [3] cannot give uprobe_clear_state(), exit_aio(), ksm_exit(), khugepaged_exit()
and exit_mm() some time to release memory. But [3] can guarantee that all memory which
the OOM reaper can reclaim is reclaimed before setting MMF_OOM_SKIP.
If we wait for another second after setting MMF_OOM_SKIP, we could give operations between
down_write(&mm->mmap_sem)/up_write(&mm->mmap_sem) in exit_mm() (in your patch) or __mmput()
(in my patch) some more chance to reclaim memory before next OOM victim is selected.
>
> [1] http://lkml.kernel.org/r/[email protected]
> [2] http://lkml.kernel.org/r/[email protected]
> [3] http://lkml.kernel.org/r/[email protected]
> [4] http://lkml.kernel.org/r/[email protected]
>
> mm/mmap.c | 7 +++++++
> mm/oom_kill.c | 40 ++--------------------------------------
> 2 files changed, 9 insertions(+), 38 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3bd5ecd20d4d..253808e716dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
> /* Use -1 here to ensure all VMAs in the mm are unmapped */
> unmap_vmas(&tlb, vma, 0, -1);
>
> + /*
> + * oom reaper might race with exit_mmap so make sure we won't free
> + * page tables or unmap VMAs under its feet
> + */
> + down_write(&mm->mmap_sem);
> free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> tlb_finish_mmu(&tlb, 0, -1);
>
> @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm)
> nr_accounted += vma_pages(vma);
> vma = remove_vma(vma);
> }
> + mm->mmap = NULL;
> vm_unacct_memory(nr_accounted);
> + up_write(&mm->mmap_sem);
> }
>
> /* Insert vm structure into process list sorted by address
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 0e2c925e7826..5dc0ff22d567 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> struct vm_area_struct *vma;
> bool ret = true;
This "ret" is redundant.
>
> - /*
> - * We have to make sure to not race with the victim exit path
> - * and cause premature new oom victim selection:
> - * __oom_reap_task_mm exit_mm
> - * mmget_not_zero
> - * mmput
> - * atomic_dec_and_test
> - * exit_oom_victim
> - * [...]
> - * out_of_memory
> - * select_bad_process
> - * # no TIF_MEMDIE task selects new victim
> - * unmap_page_range # frees some memory
> - */
> - mutex_lock(&oom_lock);
You can remove mutex_lock(&oom_lock) here, but you should use mutex_lock(&oom_lock)
when setting MMF_OOM_SKIP, for below comment in [2] will be still valid.
/*
* Hide this mm from OOM killer because it has been either reaped or
* somebody can't call up_write(mmap_sem).
+ *
+ * Serialize setting of MMF_OOM_SKIP using oom_lock in order to
+ * avoid race with select_bad_process() which causes premature
+ * new oom victim selection.
+ *
+ * The OOM reaper: An allocating task:
+ * Failed get_page_from_freelist().
+ * Enters into out_of_memory().
+ * Reaped memory enough to make get_page_from_freelist() succeed.
+ * Sets MMF_OOM_SKIP to mm.
+ * Enters into select_bad_process().
+ * # MMF_OOM_SKIP mm selects new victim.
*/
+ mutex_lock(&oom_lock);
set_bit(MMF_OOM_SKIP, &mm->flags);
+ mutex_unlock(&oom_lock);
Ideally, we should as well use mutex_lock(&oom_lock) when setting MMF_OOM_SKIP from
__mmput(), for an allocating task does not call get_page_from_freelist() after
confirming that there is no !MMF_OOM_SKIP mm. Or, it would be possible to
let select_bad_process() abort on MMF_OOM_SKIP mm once using another bit.
> -
> - if (!down_read_trylock(&mm->mmap_sem)) {
> - ret = false;
> - goto unlock_oom;
> - }
> -
> - /*
> - * increase mm_users only after we know we will reap something so
> - * that the mmput_async is called only when we have reaped something
> - * and delayed __mmput doesn't matter that much
> - */
> - if (!mmget_not_zero(mm)) {
> - up_read(&mm->mmap_sem);
> - goto unlock_oom;
> - }
> + if (!down_read_trylock(&mm->mmap_sem))
> + return false;
>
> /*
> * Tell all users of get_user/copy_from_user etc... that the content
> @@ -538,14 +510,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> K(get_mm_counter(mm, MM_SHMEMPAGES)));
> up_read(&mm->mmap_sem);
>
> - /*
> - * Drop our reference but make sure the mmput slow path is called from a
> - * different context because we shouldn't risk we get stuck there and
> - * put the oom_reaper out of the way.
> - */
> - mmput_async(mm);
> -unlock_oom:
> - mutex_unlock(&oom_lock);
> return ret;
This is "return true;".
> }
>
> --
> 2.11.0
On Tue 27-06-17 19:52:03, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > David has noticed that the oom killer might kill additional tasks while
> > the existing victim hasn't terminated yet because the oom_reaper marks
> > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> > to 0. The race is as follows
> >
> > oom_reap_task do_exit
> > exit_mm
> > __oom_reap_task_mm
> > mmput
> > __mmput
> > mmget_not_zero # fails
> > exit_mmap # frees memory
> > set_bit(MMF_OOM_SKIP)
> >
> > Currently we are try to reduce a risk of this race by taking oom_lock
> > and wait for out_of_memory sleep while holding the lock to give the
> > victim some time to exit. This is quite suboptimal approach because
> > there is no guarantee the victim (especially a large one) will manage
> > to unmap its address space and free enough memory to the particular oom
> > domain which needs a memory (e.g. a specific NUMA node).
> >
> > Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> > to run in parallel with other unmappers (hence the mmap_sem for read).
> > The only tricky part is we have to exclude page tables tear down and all
> > operations which modify the address space in the __mmput path. exit_mmap
> > doesn't expect any other users so it doesn't use any locking. Nothing
> > really forbids us to use mmap_sem for write, though. In fact we are
> > already relying on this lock earlier in the __mmput path to synchronize
> > with ksm and khugepaged.
> >
> > Take the exclusive mmap_sem when calling free_pgtables and destroying
> > vmas to sync with __oom_reap_task_mm which take the lock for read. All
> > other operations can safely race with the parallel unmap.
> >
> > Reported-by: David Rientjes <[email protected]>
> > Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> >
> > Hi,
> > I am sending this as an RFC because I am not yet sure I haven't missed
> > something subtle here but the appoach should work in principle. I have
> > run it through some of my OOM stress tests to see if anything blows up
> > and it all went smoothly.
> >
> > The issue has been brought up by David [1]. There were some attempts to
> > address it in oom proper [2][3] but the first one would cause problems
> > on their own [4] while the later is just too hairy.
> >
> > Thoughts, objections, alternatives?
>
> I wonder why you prefer timeout based approach. Your patch will after all
> set MMF_OOM_SKIP if operations between down_write() and up_write() took
> more than one second.
if we reach down_write then we have unmapped the address space in
exit_mmap and oom reaper cannot do much more.
> lock_anon_vma_root() from unlink_anon_vmas() from
> free_pgtables() for example calls down_write()/up_write(). unlink_file_vma()
> from free_pgtables() for another example calls down_write()/up_write().
> This means that it might happen that exit_mmap() takes more than one second
> with mm->mmap_sem held for write, doesn't this?
>
> The worst situation is that no memory is released by uprobe_clear_state(), exit_aio(),
> ksm_exit(), khugepaged_exit() and operations before down_write(&mm->mmap_sem), and then
> one second elapses before some memory is released after down_write(&mm->mmap_sem).
> In that case, down_write()/up_write() in your patch helps nothing.
>
> Less worst situation is that no memory is released by uprobe_clear_state(), exit_aio(),
> ksm_exit(), khugepaged_exit() and operations before down_write(&mm->mmap_sem), and then
> only some memory is released after down_write(&mm->mmap_sem) before one second elapses.
> Someone might think that this is still premature.
This would basically mean that the the oom victim had all its memory in
page tables and vma structures with basically nothing mapped. While this
is possible this is something oom reaper cannot really help with until
we start reclaiming page tables as well. I have had a plan for that but
never got to implement it so this is still on my todo list.
> More likely situation is that down_read_trylock(&mm->mmap_sem) in __oom_reap_task_mm()
> succeeds before exit_mmap() calls down_write(&mm->mmap_sem) (especially true if we remove
> mutex_lock(&oom_lock) from __oom_reap_task_mm()). In this case, your patch merely gives
> uprobe_clear_state(), exit_aio(), ksm_exit(), khugepaged_exit() and operations before
> down_write(&mm->mmap_sem) some time to release memory, for your patch will after all set
> MMF_OOM_SKIP immediately after __oom_reap_task_mm() called up_read(&mm->mmap_sem). If we
> assume that majority of memory is released by operations between
> down_write(&mm->mmap_sem)/up_write(&mm->mmap_sem) in exit_mm(), this is not a preferable
> behavior.
>
> My patch [3] cannot give uprobe_clear_state(), exit_aio(), ksm_exit(), khugepaged_exit()
> and exit_mm() some time to release memory. But [3] can guarantee that all memory which
> the OOM reaper can reclaim is reclaimed before setting MMF_OOM_SKIP.
This should be the case with this patch as well. We simply do not set
MMF_OOM_SKIP if there is something to unmap.
--
Michal Hocko
SUSE Labs
Michal Hocko wrote:
> > I wonder why you prefer timeout based approach. Your patch will after all
> > set MMF_OOM_SKIP if operations between down_write() and up_write() took
> > more than one second.
>
> if we reach down_write then we have unmapped the address space in
> exit_mmap and oom reaper cannot do much more.
So, by the time down_write() is called, majority of memory is already released, isn't it?
On Tue 27-06-17 20:39:28, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > I wonder why you prefer timeout based approach. Your patch will after all
> > > set MMF_OOM_SKIP if operations between down_write() and up_write() took
> > > more than one second.
> >
> > if we reach down_write then we have unmapped the address space in
> > exit_mmap and oom reaper cannot do much more.
>
> So, by the time down_write() is called, majority of memory is already released, isn't it?
In most cases yes. To be put it in other words. By the time exit_mmap
takes down_write there is nothing more oom reaper could reclaim.
--
Michal Hocko
SUSE Labs
Michal Hocko wrote:
> On Tue 27-06-17 20:39:28, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > I wonder why you prefer timeout based approach. Your patch will after all
> > > > set MMF_OOM_SKIP if operations between down_write() and up_write() took
> > > > more than one second.
> > >
> > > if we reach down_write then we have unmapped the address space in
> > > exit_mmap and oom reaper cannot do much more.
> >
> > So, by the time down_write() is called, majority of memory is already released, isn't it?
>
> In most cases yes. To be put it in other words. By the time exit_mmap
> takes down_write there is nothing more oom reaper could reclaim.
>
Then, aren't there two exceptions which your patch cannot guarantee;
down_write(&mm->mmap_sem) in __ksm_exit() and __khugepaged_exit() ?
Since for some reason exit_mmap() cannot be brought to before
ksm_exit(mm)/khugepaged_exit(mm) calls,
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
exit_mmap(mm);
shouldn't we try __oom_reap_task_mm() before calling these down_write()
if mm is OOM victim's?
On Tue 27-06-17 22:31:58, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 27-06-17 20:39:28, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > I wonder why you prefer timeout based approach. Your patch will after all
> > > > > set MMF_OOM_SKIP if operations between down_write() and up_write() took
> > > > > more than one second.
> > > >
> > > > if we reach down_write then we have unmapped the address space in
> > > > exit_mmap and oom reaper cannot do much more.
> > >
> > > So, by the time down_write() is called, majority of memory is already released, isn't it?
> >
> > In most cases yes. To be put it in other words. By the time exit_mmap
> > takes down_write there is nothing more oom reaper could reclaim.
> >
> Then, aren't there two exceptions which your patch cannot guarantee;
> down_write(&mm->mmap_sem) in __ksm_exit() and __khugepaged_exit() ?
yes it cannot. Those would be quite rare situations. Somebody holding
the mmap sem would have to block those to wait for too long (that too
long might be for ever actually if we are livelocked). We cannot rule
that out of course and I would argue that it would be more appropriate
to simply go after another task in those rare cases. There is not much
we can really do. At some point the oom reaper has to give up and move
on otherwise we are back to square one when OOM could deadlock...
Maybe we can actually get rid of this down_write but I would go that way
only when it proves to be a real issue.
> Since for some reason exit_mmap() cannot be brought to before
> ksm_exit(mm)/khugepaged_exit(mm) calls,
9ba692948008 ("ksm: fix oom deadlock") would tell you more about the
ordering and the motivation.
>
> ksm_exit(mm);
> khugepaged_exit(mm); /* must run before exit_mmap */
> exit_mmap(mm);
>
> shouldn't we try __oom_reap_task_mm() before calling these down_write()
> if mm is OOM victim's?
This is what we try. We simply try to get mmap_sem for read and do our
work as soon as possible with the proposed patch. This is already an
improvement, no?
--
Michal Hocko
SUSE Labs
Michal Hocko wrote:
> On Tue 27-06-17 22:31:58, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 27-06-17 20:39:28, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > > I wonder why you prefer timeout based approach. Your patch will after all
> > > > > > set MMF_OOM_SKIP if operations between down_write() and up_write() took
> > > > > > more than one second.
> > > > >
> > > > > if we reach down_write then we have unmapped the address space in
> > > > > exit_mmap and oom reaper cannot do much more.
> > > >
> > > > So, by the time down_write() is called, majority of memory is already released, isn't it?
> > >
> > > In most cases yes. To be put it in other words. By the time exit_mmap
> > > takes down_write there is nothing more oom reaper could reclaim.
> > >
> > Then, aren't there two exceptions which your patch cannot guarantee;
> > down_write(&mm->mmap_sem) in __ksm_exit() and __khugepaged_exit() ?
>
> yes it cannot. Those would be quite rare situations. Somebody holding
> the mmap sem would have to block those to wait for too long (that too
> long might be for ever actually if we are livelocked). We cannot rule
> that out of course and I would argue that it would be more appropriate
> to simply go after another task in those rare cases. There is not much
> we can really do. At some point the oom reaper has to give up and move
> on otherwise we are back to square one when OOM could deadlock...
>
> Maybe we can actually get rid of this down_write but I would go that way
> only when it proves to be a real issue.
>
> > Since for some reason exit_mmap() cannot be brought to before
> > ksm_exit(mm)/khugepaged_exit(mm) calls,
>
> 9ba692948008 ("ksm: fix oom deadlock") would tell you more about the
> ordering and the motivation.
I don't understand ksm nor khugepaged. But that commit was actually calling
ksm_exit() just before free_pgtables() in exit_mmap(). It is ba76149f47d8c939
("thp: khugepaged") which added /* must run before exit_mmap */ comment.
>
> >
> > ksm_exit(mm);
> > khugepaged_exit(mm); /* must run before exit_mmap */
> > exit_mmap(mm);
> >
> > shouldn't we try __oom_reap_task_mm() before calling these down_write()
> > if mm is OOM victim's?
>
> This is what we try. We simply try to get mmap_sem for read and do our
> work as soon as possible with the proposed patch. This is already an
> improvement, no?
We can ask the OOM reaper kernel thread try to reap before the OOM killer
releases oom_lock mutex. But that is not guaranteed. It is possible that
the OOM victim thread is executed until down_write() in __ksm_exit() or
__khugepaged_exit() and then the OOM reaper kernel thread starts calling
down_read_trylock().
On Tue 27-06-17 23:26:22, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 27-06-17 22:31:58, Tetsuo Handa wrote:
[...]
> > > shouldn't we try __oom_reap_task_mm() before calling these down_write()
> > > if mm is OOM victim's?
> >
> > This is what we try. We simply try to get mmap_sem for read and do our
> > work as soon as possible with the proposed patch. This is already an
> > improvement, no?
>
> We can ask the OOM reaper kernel thread try to reap before the OOM killer
> releases oom_lock mutex. But that is not guaranteed. It is possible that
> the OOM victim thread is executed until down_write() in __ksm_exit() or
> __khugepaged_exit() and then the OOM reaper kernel thread starts calling
> down_read_trylock().
I strongly suspect we are getting tangent here. While I see your concern
and yes the approach can be probably improved, can we focus on one thing
at the time? I would like to fix the original problem first and only
then go deeper down the rat hole of other subtle details. Do you have
any fundamental objection to the suggested approach or see any issues
with it?
--
Michal Hocko
SUSE Labs
Forgot to CC Hugh.
Hugh, Andrew, do you see this could cause any problem wrt.
ksm/khugepaged exit path?
On Mon 26-06-17 15:03:46, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> David has noticed that the oom killer might kill additional tasks while
> the existing victim hasn't terminated yet because the oom_reaper marks
> the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> to 0. The race is as follows
>
> oom_reap_task do_exit
> exit_mm
> __oom_reap_task_mm
> mmput
> __mmput
> mmget_not_zero # fails
> exit_mmap # frees memory
> set_bit(MMF_OOM_SKIP)
>
> Currently we are try to reduce a risk of this race by taking oom_lock
> and wait for out_of_memory sleep while holding the lock to give the
> victim some time to exit. This is quite suboptimal approach because
> there is no guarantee the victim (especially a large one) will manage
> to unmap its address space and free enough memory to the particular oom
> domain which needs a memory (e.g. a specific NUMA node).
>
> Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> to run in parallel with other unmappers (hence the mmap_sem for read).
> The only tricky part is we have to exclude page tables tear down and all
> operations which modify the address space in the __mmput path. exit_mmap
> doesn't expect any other users so it doesn't use any locking. Nothing
> really forbids us to use mmap_sem for write, though. In fact we are
> already relying on this lock earlier in the __mmput path to synchronize
> with ksm and khugepaged.
>
> Take the exclusive mmap_sem when calling free_pgtables and destroying
> vmas to sync with __oom_reap_task_mm which take the lock for read. All
> other operations can safely race with the parallel unmap.
>
> Reported-by: David Rientjes <[email protected]>
> Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
> Signed-off-by: Michal Hocko <[email protected]>
> ---
>
> Hi,
> I am sending this as an RFC because I am not yet sure I haven't missed
> something subtle here but the appoach should work in principle. I have
> run it through some of my OOM stress tests to see if anything blows up
> and it all went smoothly.
>
> The issue has been brought up by David [1]. There were some attempts to
> address it in oom proper [2][3] but the first one would cause problems
> on their own [4] while the later is just too hairy.
>
> Thoughts, objections, alternatives?
>
> [1] http://lkml.kernel.org/r/[email protected]
> [2] http://lkml.kernel.org/r/[email protected]
> [3] http://lkml.kernel.org/r/[email protected]
> [4] http://lkml.kernel.org/r/[email protected]
>
> mm/mmap.c | 7 +++++++
> mm/oom_kill.c | 40 ++--------------------------------------
> 2 files changed, 9 insertions(+), 38 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3bd5ecd20d4d..253808e716dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
> /* Use -1 here to ensure all VMAs in the mm are unmapped */
> unmap_vmas(&tlb, vma, 0, -1);
>
> + /*
> + * oom reaper might race with exit_mmap so make sure we won't free
> + * page tables or unmap VMAs under its feet
> + */
> + down_write(&mm->mmap_sem);
> free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> tlb_finish_mmu(&tlb, 0, -1);
>
> @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm)
> nr_accounted += vma_pages(vma);
> vma = remove_vma(vma);
> }
> + mm->mmap = NULL;
> vm_unacct_memory(nr_accounted);
> + up_write(&mm->mmap_sem);
> }
>
> /* Insert vm structure into process list sorted by address
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 0e2c925e7826..5dc0ff22d567 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> struct vm_area_struct *vma;
> bool ret = true;
>
> - /*
> - * We have to make sure to not race with the victim exit path
> - * and cause premature new oom victim selection:
> - * __oom_reap_task_mm exit_mm
> - * mmget_not_zero
> - * mmput
> - * atomic_dec_and_test
> - * exit_oom_victim
> - * [...]
> - * out_of_memory
> - * select_bad_process
> - * # no TIF_MEMDIE task selects new victim
> - * unmap_page_range # frees some memory
> - */
> - mutex_lock(&oom_lock);
> -
> - if (!down_read_trylock(&mm->mmap_sem)) {
> - ret = false;
> - goto unlock_oom;
> - }
> -
> - /*
> - * increase mm_users only after we know we will reap something so
> - * that the mmput_async is called only when we have reaped something
> - * and delayed __mmput doesn't matter that much
> - */
> - if (!mmget_not_zero(mm)) {
> - up_read(&mm->mmap_sem);
> - goto unlock_oom;
> - }
> + if (!down_read_trylock(&mm->mmap_sem))
> + return false;
>
> /*
> * Tell all users of get_user/copy_from_user etc... that the content
> @@ -538,14 +510,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> K(get_mm_counter(mm, MM_SHMEMPAGES)));
> up_read(&mm->mmap_sem);
>
> - /*
> - * Drop our reference but make sure the mmput slow path is called from a
> - * different context because we shouldn't risk we get stuck there and
> - * put the oom_reaper out of the way.
> - */
> - mmput_async(mm);
> -unlock_oom:
> - mutex_unlock(&oom_lock);
> return ret;
> }
>
> --
> 2.11.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Michal Hocko
SUSE Labs
On Mon, 26 Jun 2017, Michal Hocko wrote:
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3bd5ecd20d4d..253808e716dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
> /* Use -1 here to ensure all VMAs in the mm are unmapped */
> unmap_vmas(&tlb, vma, 0, -1);
>
> + /*
> + * oom reaper might race with exit_mmap so make sure we won't free
> + * page tables or unmap VMAs under its feet
> + */
> + down_write(&mm->mmap_sem);
> free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> tlb_finish_mmu(&tlb, 0, -1);
>
> @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm)
> nr_accounted += vma_pages(vma);
> vma = remove_vma(vma);
> }
> + mm->mmap = NULL;
> vm_unacct_memory(nr_accounted);
> + up_write(&mm->mmap_sem);
> }
>
> /* Insert vm structure into process list sorted by address
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 0e2c925e7826..5dc0ff22d567 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> struct vm_area_struct *vma;
> bool ret = true;
>
> - /*
> - * We have to make sure to not race with the victim exit path
> - * and cause premature new oom victim selection:
> - * __oom_reap_task_mm exit_mm
> - * mmget_not_zero
> - * mmput
> - * atomic_dec_and_test
> - * exit_oom_victim
> - * [...]
> - * out_of_memory
> - * select_bad_process
> - * # no TIF_MEMDIE task selects new victim
> - * unmap_page_range # frees some memory
> - */
> - mutex_lock(&oom_lock);
> -
> - if (!down_read_trylock(&mm->mmap_sem)) {
> - ret = false;
> - goto unlock_oom;
> - }
> -
> - /*
> - * increase mm_users only after we know we will reap something so
> - * that the mmput_async is called only when we have reaped something
> - * and delayed __mmput doesn't matter that much
> - */
> - if (!mmget_not_zero(mm)) {
> - up_read(&mm->mmap_sem);
> - goto unlock_oom;
> - }
> + if (!down_read_trylock(&mm->mmap_sem))
> + return false;
I think this should return true if mm->mmap == NULL here.
On Tue, 27 Jun 2017, Tetsuo Handa wrote:
> I wonder why you prefer timeout based approach. Your patch will after all
> set MMF_OOM_SKIP if operations between down_write() and up_write() took
> more than one second. lock_anon_vma_root() from unlink_anon_vmas() from
> free_pgtables() for example calls down_write()/up_write(). unlink_file_vma()
> from free_pgtables() for another example calls down_write()/up_write().
> This means that it might happen that exit_mmap() takes more than one second
> with mm->mmap_sem held for write, doesn't this?
>
I certainly have no objection to increasing the timeout period or
increasing MAX_OOM_REAP_RETRIES to be substantially higher. All threads
holding mm->mmap_sem should be oom killed and be able to access memory
reserves to make forward progress if they fail to reclaim. If we are
truly blocked on mm->mmap_sem, waiting longer than one second to declare
that seems justifiable to prevent the exact situation you describe.
On Mon 10-07-17 16:55:22, David Rientjes wrote:
> On Mon, 26 Jun 2017, Michal Hocko wrote:
>
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 3bd5ecd20d4d..253808e716dc 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
> > /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > unmap_vmas(&tlb, vma, 0, -1);
> >
> > + /*
> > + * oom reaper might race with exit_mmap so make sure we won't free
> > + * page tables or unmap VMAs under its feet
> > + */
> > + down_write(&mm->mmap_sem);
> > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > tlb_finish_mmu(&tlb, 0, -1);
> >
> > @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm)
> > nr_accounted += vma_pages(vma);
> > vma = remove_vma(vma);
> > }
> > + mm->mmap = NULL;
> > vm_unacct_memory(nr_accounted);
> > + up_write(&mm->mmap_sem);
> > }
> >
> > /* Insert vm structure into process list sorted by address
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 0e2c925e7826..5dc0ff22d567 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> > struct vm_area_struct *vma;
> > bool ret = true;
> >
> > - /*
> > - * We have to make sure to not race with the victim exit path
> > - * and cause premature new oom victim selection:
> > - * __oom_reap_task_mm exit_mm
> > - * mmget_not_zero
> > - * mmput
> > - * atomic_dec_and_test
> > - * exit_oom_victim
> > - * [...]
> > - * out_of_memory
> > - * select_bad_process
> > - * # no TIF_MEMDIE task selects new victim
> > - * unmap_page_range # frees some memory
> > - */
> > - mutex_lock(&oom_lock);
> > -
> > - if (!down_read_trylock(&mm->mmap_sem)) {
> > - ret = false;
> > - goto unlock_oom;
> > - }
> > -
> > - /*
> > - * increase mm_users only after we know we will reap something so
> > - * that the mmput_async is called only when we have reaped something
> > - * and delayed __mmput doesn't matter that much
> > - */
> > - if (!mmget_not_zero(mm)) {
> > - up_read(&mm->mmap_sem);
> > - goto unlock_oom;
> > - }
> > + if (!down_read_trylock(&mm->mmap_sem))
> > + return false;
>
> I think this should return true if mm->mmap == NULL here.
This?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5dc0ff22d567..e155d1d8064f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,11 +470,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
{
struct mmu_gather tlb;
struct vm_area_struct *vma;
- bool ret = true;
if (!down_read_trylock(&mm->mmap_sem))
return false;
+ /* There is nothing to reap so bail out without signs in the log */
+ if (!mm->mmap)
+ goto unlock;
+
/*
* Tell all users of get_user/copy_from_user etc... that the content
* is no longer stable. No barriers really needed because unmapping
@@ -508,9 +511,10 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
K(get_mm_counter(mm, MM_ANONPAGES)),
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
+unlock:
up_read(&mm->mmap_sem);
- return ret;
+ return true;
}
#define MAX_OOM_REAP_RETRIES 10
--
Michal Hocko
SUSE Labs
On Tue, 11 Jul 2017, Michal Hocko wrote:
> This?
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5dc0ff22d567..e155d1d8064f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -470,11 +470,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> {
> struct mmu_gather tlb;
> struct vm_area_struct *vma;
> - bool ret = true;
>
> if (!down_read_trylock(&mm->mmap_sem))
> return false;
>
> + /* There is nothing to reap so bail out without signs in the log */
> + if (!mm->mmap)
> + goto unlock;
> +
> /*
> * Tell all users of get_user/copy_from_user etc... that the content
> * is no longer stable. No barriers really needed because unmapping
> @@ -508,9 +511,10 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> K(get_mm_counter(mm, MM_ANONPAGES)),
> K(get_mm_counter(mm, MM_FILEPAGES)),
> K(get_mm_counter(mm, MM_SHMEMPAGES)));
> +unlock:
> up_read(&mm->mmap_sem);
>
> - return ret;
> + return true;
> }
>
> #define MAX_OOM_REAP_RETRIES 10
Yes, this folded in with the original RFC patch appears to work better
with light testing.
However, I think MAX_OOM_REAP_RETRIES and/or the timeout of HZ/10 needs to
be increased as well to address the issue that Tetsuo pointed out. The
oom reaper shouldn't be required to do any work unless it is resolving a
livelock, and that scenario should be relatively rare. The oom killer
being a natural ultra slow path, I think it would be justifiable to wait
longer or retry more times than simply 1 second before declaring that
reaping is not possible. It reduces the likelihood of additional oom
killing.
On Tue 11-07-17 13:40:04, David Rientjes wrote:
> On Tue, 11 Jul 2017, Michal Hocko wrote:
>
> > This?
> > ---
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5dc0ff22d567..e155d1d8064f 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -470,11 +470,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> > {
> > struct mmu_gather tlb;
> > struct vm_area_struct *vma;
> > - bool ret = true;
> >
> > if (!down_read_trylock(&mm->mmap_sem))
> > return false;
> >
> > + /* There is nothing to reap so bail out without signs in the log */
> > + if (!mm->mmap)
> > + goto unlock;
> > +
> > /*
> > * Tell all users of get_user/copy_from_user etc... that the content
> > * is no longer stable. No barriers really needed because unmapping
> > @@ -508,9 +511,10 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> > K(get_mm_counter(mm, MM_ANONPAGES)),
> > K(get_mm_counter(mm, MM_FILEPAGES)),
> > K(get_mm_counter(mm, MM_SHMEMPAGES)));
> > +unlock:
> > up_read(&mm->mmap_sem);
> >
> > - return ret;
> > + return true;
> > }
> >
> > #define MAX_OOM_REAP_RETRIES 10
>
> Yes, this folded in with the original RFC patch appears to work better
> with light testing.
Yes folding it into the original patch was the plan. I would really
appreciate some Tested-by here.
> However, I think MAX_OOM_REAP_RETRIES and/or the timeout of HZ/10 needs to
> be increased as well to address the issue that Tetsuo pointed out. The
> oom reaper shouldn't be required to do any work unless it is resolving a
> livelock, and that scenario should be relatively rare. The oom killer
> being a natural ultra slow path, I think it would be justifiable to wait
> longer or retry more times than simply 1 second before declaring that
> reaping is not possible. It reduces the likelihood of additional oom
> killing.
I believe that this is an independent issue and as such it should be
addressed separately along with some data backing up that decision. I am
not against improving the waiting logic. We would need some requeuing
when we cannot reap the victim because we cannot really wait too much
time on a single oom victim considering there might be many victims
queued (because of memcg ooms). This would obviously need some more code
and I am willing to implement that but I would like to see that this is
something that is a real problem first.
Thanks!
--
Michal Hocko
SUSE Labs
On Thu 29-06-17 10:46:21, Michal Hocko wrote:
> Forgot to CC Hugh.
>
> Hugh, Andrew, do you see this could cause any problem wrt.
> ksm/khugepaged exit path?
ping. I would really appreciate some help here. I would like to resend
the patch soon.
> On Mon 26-06-17 15:03:46, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > David has noticed that the oom killer might kill additional tasks while
> > the existing victim hasn't terminated yet because the oom_reaper marks
> > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> > to 0. The race is as follows
> >
> > oom_reap_task do_exit
> > exit_mm
> > __oom_reap_task_mm
> > mmput
> > __mmput
> > mmget_not_zero # fails
> > exit_mmap # frees memory
> > set_bit(MMF_OOM_SKIP)
> >
> > Currently we are try to reduce a risk of this race by taking oom_lock
> > and wait for out_of_memory sleep while holding the lock to give the
> > victim some time to exit. This is quite suboptimal approach because
> > there is no guarantee the victim (especially a large one) will manage
> > to unmap its address space and free enough memory to the particular oom
> > domain which needs a memory (e.g. a specific NUMA node).
> >
> > Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> > to run in parallel with other unmappers (hence the mmap_sem for read).
> > The only tricky part is we have to exclude page tables tear down and all
> > operations which modify the address space in the __mmput path. exit_mmap
> > doesn't expect any other users so it doesn't use any locking. Nothing
> > really forbids us to use mmap_sem for write, though. In fact we are
> > already relying on this lock earlier in the __mmput path to synchronize
> > with ksm and khugepaged.
> >
> > Take the exclusive mmap_sem when calling free_pgtables and destroying
> > vmas to sync with __oom_reap_task_mm which take the lock for read. All
> > other operations can safely race with the parallel unmap.
> >
> > Reported-by: David Rientjes <[email protected]>
> > Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> >
> > Hi,
> > I am sending this as an RFC because I am not yet sure I haven't missed
> > something subtle here but the appoach should work in principle. I have
> > run it through some of my OOM stress tests to see if anything blows up
> > and it all went smoothly.
> >
> > The issue has been brought up by David [1]. There were some attempts to
> > address it in oom proper [2][3] but the first one would cause problems
> > on their own [4] while the later is just too hairy.
> >
> > Thoughts, objections, alternatives?
> >
> > [1] http://lkml.kernel.org/r/[email protected]
> > [2] http://lkml.kernel.org/r/[email protected]
> > [3] http://lkml.kernel.org/r/[email protected]
> > [4] http://lkml.kernel.org/r/[email protected]
> >
> > mm/mmap.c | 7 +++++++
> > mm/oom_kill.c | 40 ++--------------------------------------
> > 2 files changed, 9 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 3bd5ecd20d4d..253808e716dc 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
> > /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > unmap_vmas(&tlb, vma, 0, -1);
> >
> > + /*
> > + * oom reaper might race with exit_mmap so make sure we won't free
> > + * page tables or unmap VMAs under its feet
> > + */
> > + down_write(&mm->mmap_sem);
> > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > tlb_finish_mmu(&tlb, 0, -1);
> >
> > @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm)
> > nr_accounted += vma_pages(vma);
> > vma = remove_vma(vma);
> > }
> > + mm->mmap = NULL;
> > vm_unacct_memory(nr_accounted);
> > + up_write(&mm->mmap_sem);
> > }
> >
> > /* Insert vm structure into process list sorted by address
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 0e2c925e7826..5dc0ff22d567 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> > struct vm_area_struct *vma;
> > bool ret = true;
> >
> > - /*
> > - * We have to make sure to not race with the victim exit path
> > - * and cause premature new oom victim selection:
> > - * __oom_reap_task_mm exit_mm
> > - * mmget_not_zero
> > - * mmput
> > - * atomic_dec_and_test
> > - * exit_oom_victim
> > - * [...]
> > - * out_of_memory
> > - * select_bad_process
> > - * # no TIF_MEMDIE task selects new victim
> > - * unmap_page_range # frees some memory
> > - */
> > - mutex_lock(&oom_lock);
> > -
> > - if (!down_read_trylock(&mm->mmap_sem)) {
> > - ret = false;
> > - goto unlock_oom;
> > - }
> > -
> > - /*
> > - * increase mm_users only after we know we will reap something so
> > - * that the mmput_async is called only when we have reaped something
> > - * and delayed __mmput doesn't matter that much
> > - */
> > - if (!mmget_not_zero(mm)) {
> > - up_read(&mm->mmap_sem);
> > - goto unlock_oom;
> > - }
> > + if (!down_read_trylock(&mm->mmap_sem))
> > + return false;
> >
> > /*
> > * Tell all users of get_user/copy_from_user etc... that the content
> > @@ -538,14 +510,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> > K(get_mm_counter(mm, MM_SHMEMPAGES)));
> > up_read(&mm->mmap_sem);
> >
> > - /*
> > - * Drop our reference but make sure the mmput slow path is called from a
> > - * different context because we shouldn't risk we get stuck there and
> > - * put the oom_reaper out of the way.
> > - */
> > - mmput_async(mm);
> > -unlock_oom:
> > - mutex_unlock(&oom_lock);
> > return ret;
> > }
> >
> > --
> > 2.11.0
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> --
> Michal Hocko
> SUSE Labs
--
Michal Hocko
SUSE Labs
On Wed, 19 Jul 2017, Michal Hocko wrote:
> On Thu 29-06-17 10:46:21, Michal Hocko wrote:
> > Forgot to CC Hugh.
> >
> > Hugh, Andrew, do you see this could cause any problem wrt.
> > ksm/khugepaged exit path?
>
> ping. I would really appreciate some help here. I would like to resend
> the patch soon.
Sorry, Michal, I've been hiding from everyone.
No, I don't think your patch will cause any trouble for the ksm or
khugepaged exit path; but we'll find out for sure when akpm puts it
in mmotm - I doubt I'll get to trying it out in advance of that.
On the contrary, I think it will allow us to remove the peculiar
"down_write(mmap_sem); up_write(mmap_sem);" from those exit paths:
which were there to serialize, precisely because exit_mmap() did
not otherwise take mmap_sem; but you're now changing it to do so.
You could add a patch to remove those yourself, or any of us add
that on afterwards.
But I don't entirely agree (or disagree) with your placement:
see comment below.
>
> > On Mon 26-06-17 15:03:46, Michal Hocko wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > David has noticed that the oom killer might kill additional tasks while
> > > the existing victim hasn't terminated yet because the oom_reaper marks
> > > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> > > to 0. The race is as follows
> > >
> > > oom_reap_task do_exit
> > > exit_mm
> > > __oom_reap_task_mm
> > > mmput
> > > __mmput
> > > mmget_not_zero # fails
> > > exit_mmap # frees memory
> > > set_bit(MMF_OOM_SKIP)
> > >
> > > Currently we are try to reduce a risk of this race by taking oom_lock
> > > and wait for out_of_memory sleep while holding the lock to give the
> > > victim some time to exit. This is quite suboptimal approach because
> > > there is no guarantee the victim (especially a large one) will manage
> > > to unmap its address space and free enough memory to the particular oom
> > > domain which needs a memory (e.g. a specific NUMA node).
> > >
> > > Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> > > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> > > to run in parallel with other unmappers (hence the mmap_sem for read).
> > > The only tricky part is we have to exclude page tables tear down and all
> > > operations which modify the address space in the __mmput path. exit_mmap
> > > doesn't expect any other users so it doesn't use any locking. Nothing
> > > really forbids us to use mmap_sem for write, though. In fact we are
> > > already relying on this lock earlier in the __mmput path to synchronize
> > > with ksm and khugepaged.
> > >
> > > Take the exclusive mmap_sem when calling free_pgtables and destroying
> > > vmas to sync with __oom_reap_task_mm which take the lock for read. All
> > > other operations can safely race with the parallel unmap.
> > >
> > > Reported-by: David Rientjes <[email protected]>
> > > Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
> > > Signed-off-by: Michal Hocko <[email protected]>
> > > ---
> > >
> > > Hi,
> > > I am sending this as an RFC because I am not yet sure I haven't missed
> > > something subtle here but the appoach should work in principle. I have
> > > run it through some of my OOM stress tests to see if anything blows up
> > > and it all went smoothly.
> > >
> > > The issue has been brought up by David [1]. There were some attempts to
> > > address it in oom proper [2][3] but the first one would cause problems
> > > on their own [4] while the later is just too hairy.
> > >
> > > Thoughts, objections, alternatives?
> > >
> > > [1] http://lkml.kernel.org/r/[email protected]
> > > [2] http://lkml.kernel.org/r/[email protected]
> > > [3] http://lkml.kernel.org/r/[email protected]
> > > [4] http://lkml.kernel.org/r/[email protected]
> > >
> > > mm/mmap.c | 7 +++++++
> > > mm/oom_kill.c | 40 ++--------------------------------------
> > > 2 files changed, 9 insertions(+), 38 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 3bd5ecd20d4d..253808e716dc 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
> > > /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > > unmap_vmas(&tlb, vma, 0, -1);
> > >
> > > + /*
> > > + * oom reaper might race with exit_mmap so make sure we won't free
> > > + * page tables or unmap VMAs under its feet
> > > + */
> > > + down_write(&mm->mmap_sem);
Hmm. I'm conflicted about this. From a design point of view, I would
very much prefer you to take the mmap_sem higher up, maybe just before
or after the mmu_notifier_release() or arch_exit_mmap() (depends on
what those actually do): anyway before the unmap_vmas().
Because the things which go on in exit_mmap() are things which we expect
mmap_sem to be held across, and we get caught out when it is not: it's
awkard and error-prone enough that MADV_DONTNEED and MADV_FREE (for
very good reason) do things with only down_read(mmap_sem). But there's
a number of times (ksm exit being only one of them) when I've found it
a nuisance that we had no proper way of serializing against exit_mmap().
I'm conflicted because, on the other hand, I'm staunchly against adding
obstructions ("robust" futexes? gah!) into the exit patch, or widening
the use of locks that are not strictly needed. But wouldn't it be the
case here, that most contenders on the mmap_sem must hold a reference
to mm_users, and that prevents any possibility of racing exit_mmap();
only ksm and khugepaged, and any others who already need such mmap_sem
tricks to serialize against exit_mmap(), could offer any contention.
But I haven't looked at the oom_kill or oom_reaper end of it at all,
perhaps you have an overriding argument on the placement from that end.
Hugh
[Not strictly relevant here, but a related note: I was very surprised
to discover, only quite recently, how handle_mm_fault() may be called
without down_read(mmap_sem) - when core dumping. That seems a
misguided optimization to me, which would also be nice to correct;
but again I might not appreciate the full picture.]
> > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > tlb_finish_mmu(&tlb, 0, -1);
> > >
> > > @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm)
> > > nr_accounted += vma_pages(vma);
> > > vma = remove_vma(vma);
> > > }
> > > + mm->mmap = NULL;
> > > vm_unacct_memory(nr_accounted);
> > > + up_write(&mm->mmap_sem);
> > > }
> > >
> > > /* Insert vm structure into process list sorted by address
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 0e2c925e7826..5dc0ff22d567 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> > > struct vm_area_struct *vma;
> > > bool ret = true;
> > >
> > > - /*
> > > - * We have to make sure to not race with the victim exit path
> > > - * and cause premature new oom victim selection:
> > > - * __oom_reap_task_mm exit_mm
> > > - * mmget_not_zero
> > > - * mmput
> > > - * atomic_dec_and_test
> > > - * exit_oom_victim
> > > - * [...]
> > > - * out_of_memory
> > > - * select_bad_process
> > > - * # no TIF_MEMDIE task selects new victim
> > > - * unmap_page_range # frees some memory
> > > - */
> > > - mutex_lock(&oom_lock);
> > > -
> > > - if (!down_read_trylock(&mm->mmap_sem)) {
> > > - ret = false;
> > > - goto unlock_oom;
> > > - }
> > > -
> > > - /*
> > > - * increase mm_users only after we know we will reap something so
> > > - * that the mmput_async is called only when we have reaped something
> > > - * and delayed __mmput doesn't matter that much
> > > - */
> > > - if (!mmget_not_zero(mm)) {
> > > - up_read(&mm->mmap_sem);
> > > - goto unlock_oom;
> > > - }
> > > + if (!down_read_trylock(&mm->mmap_sem))
> > > + return false;
> > >
> > > /*
> > > * Tell all users of get_user/copy_from_user etc... that the content
> > > @@ -538,14 +510,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> > > K(get_mm_counter(mm, MM_SHMEMPAGES)));
> > > up_read(&mm->mmap_sem);
> > >
> > > - /*
> > > - * Drop our reference but make sure the mmput slow path is called from a
> > > - * different context because we shouldn't risk we get stuck there and
> > > - * put the oom_reaper out of the way.
> > > - */
> > > - mmput_async(mm);
> > > -unlock_oom:
> > > - mutex_unlock(&oom_lock);
> > > return ret;
> > > }
> > >
> > > --
> > > 2.11.0
> > >
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to [email protected]. For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
> > --
> > Michal Hocko
> > SUSE Labs
>
> --
> Michal Hocko
> SUSE Labs
On Wed 19-07-17 18:18:27, Hugh Dickins wrote:
> On Wed, 19 Jul 2017, Michal Hocko wrote:
> > On Thu 29-06-17 10:46:21, Michal Hocko wrote:
> > > Forgot to CC Hugh.
> > >
> > > Hugh, Andrew, do you see this could cause any problem wrt.
> > > ksm/khugepaged exit path?
> >
> > ping. I would really appreciate some help here. I would like to resend
> > the patch soon.
>
> Sorry, Michal, I've been hiding from everyone.
>
> No, I don't think your patch will cause any trouble for the ksm or
> khugepaged exit path; but we'll find out for sure when akpm puts it
> in mmotm - I doubt I'll get to trying it out in advance of that.
>
> On the contrary, I think it will allow us to remove the peculiar
> "down_write(mmap_sem); up_write(mmap_sem);" from those exit paths:
> which were there to serialize, precisely because exit_mmap() did
> not otherwise take mmap_sem; but you're now changing it to do so.
I was actually suspecting this could be done but didn't get to study the
code to be sure enough, your words are surely encouraging...
> You could add a patch to remove those yourself, or any of us add
> that on afterwards.
I will add it on my todo list and let's see when I get there.
> But I don't entirely agree (or disagree) with your placement:
> see comment below.
[...]
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 3bd5ecd20d4d..253808e716dc 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
> > > > /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > > > unmap_vmas(&tlb, vma, 0, -1);
> > > >
> > > > + /*
> > > > + * oom reaper might race with exit_mmap so make sure we won't free
> > > > + * page tables or unmap VMAs under its feet
> > > > + */
> > > > + down_write(&mm->mmap_sem);
>
> Hmm. I'm conflicted about this. From a design point of view, I would
> very much prefer you to take the mmap_sem higher up, maybe just before
> or after the mmu_notifier_release() or arch_exit_mmap() (depends on
> what those actually do): anyway before the unmap_vmas().
This thing is that I _want_ unmap_vmas to race with the oom reaper so I
cannot take the write log before unmap_vmas... If this whole area should
be covered by the write lock then I would need a handshake mechanism
between the oom reaper and the final unmap_vmas to know that oom reaper
won't set MMF_OOM_SKIP prematurely (see more on that below).
> Because the things which go on in exit_mmap() are things which we expect
> mmap_sem to be held across, and we get caught out when it is not: it's
> awkard and error-prone enough that MADV_DONTNEED and MADV_FREE (for
> very good reason) do things with only down_read(mmap_sem). But there's
> a number of times (ksm exit being only one of them) when I've found it
> a nuisance that we had no proper way of serializing against exit_mmap().
>
> I'm conflicted because, on the other hand, I'm staunchly against adding
> obstructions ("robust" futexes? gah!) into the exit patch, or widening
> the use of locks that are not strictly needed. But wouldn't it be the
> case here, that most contenders on the mmap_sem must hold a reference
> to mm_users, and that prevents any possibility of racing exit_mmap();
> only ksm and khugepaged, and any others who already need such mmap_sem
> tricks to serialize against exit_mmap(), could offer any contention.
>
> But I haven't looked at the oom_kill or oom_reaper end of it at all,
> perhaps you have an overriding argument on the placement from that end.
Well, the main problem here is that the oom_reaper tries to
MADV_DONTNEED the oom victim and then hide it from the oom killer (by
setting MMF_OOM_SKIP) to guarantee a forward progress. In order to do
that it needs mmap_sem for read. Currently we try to avoid races with
the eixt path by checking mm->mm_users and that can lead to premature
MMF_OOM_SKIP and that in turn to additional oom victim(s) selection
while the current one is still tearing the address space down.
One way around that is to allow final unmap race with the oom_reaper
tear down.
I hope this clarify the motivation
> Hugh
>
> [Not strictly relevant here, but a related note: I was very surprised
> to discover, only quite recently, how handle_mm_fault() may be called
> without down_read(mmap_sem) - when core dumping. That seems a
> misguided optimization to me, which would also be nice to correct;
> but again I might not appreciate the full picture.]
shrug
--
Michal Hocko
SUSE Labs
On Thu, 20 Jul 2017, Michal Hocko wrote:
> On Wed 19-07-17 18:18:27, Hugh Dickins wrote:
> >
> > But I haven't looked at the oom_kill or oom_reaper end of it at all,
> > perhaps you have an overriding argument on the placement from that end.
>
> Well, the main problem here is that the oom_reaper tries to
> MADV_DONTNEED the oom victim and then hide it from the oom killer (by
> setting MMF_OOM_SKIP) to guarantee a forward progress. In order to do
> that it needs mmap_sem for read. Currently we try to avoid races with
> the eixt path by checking mm->mm_users and that can lead to premature
> MMF_OOM_SKIP and that in turn to additional oom victim(s) selection
> while the current one is still tearing the address space down.
>
> One way around that is to allow final unmap race with the oom_reaper
> tear down.
>
> I hope this clarify the motivation
Thanks, yes, if you have a good reason of that kind, then I agree that
it's appropriate to leave the down_write(mmap_sem) until reaching the
free_pgtables() stage.
Hugh