With the oom-killer being able to operate on locked pages, exit_mmap
does not need to ensure that oom_reap_task_mm is done before it can
proceed. Instead it can rely on mmap_lock write lock to prevent
oom-killer from operating on the vma tree while it's freeing page
tables. exit_mmap can hold mmap_lock read lock when unmapping vmas
and then take mmap_lock write lock before freeing page tables.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/oom.h | 2 --
mm/mmap.c | 25 ++++++-------------------
mm/oom_kill.c | 2 +-
3 files changed, 7 insertions(+), 22 deletions(-)
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a1432511..6cdf0772dbae 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
return 0;
}
-bool __oom_reap_task_mm(struct mm_struct *mm);
-
long oom_badness(struct task_struct *p,
unsigned long totalpages);
diff --git a/mm/mmap.c b/mm/mmap.c
index 313b57d55a63..feaa840fb95d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3105,30 +3105,13 @@ void exit_mmap(struct mm_struct *mm)
/* mm's last user has gone, and its about to be pulled down */
mmu_notifier_release(mm);
- if (unlikely(mm_is_oom_victim(mm))) {
- /*
- * Manually reap the mm to free as much memory as possible.
- * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
- * this mm from further consideration. Taking mm->mmap_lock for
- * write after setting MMF_OOM_SKIP will guarantee that the oom
- * reaper will not run on this mm again after mmap_lock is
- * dropped.
- *
- * Nothing can be holding mm->mmap_lock here and the above call
- * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
- * __oom_reap_task_mm() will not block.
- */
- (void)__oom_reap_task_mm(mm);
- set_bit(MMF_OOM_SKIP, &mm->flags);
- }
-
- mmap_write_lock(mm);
+ mmap_read_lock(mm);
arch_exit_mmap(mm);
vma = mm->mmap;
if (!vma) {
/* Can happen if dup_mmap() received an OOM */
- mmap_write_unlock(mm);
+ mmap_read_unlock(mm);
return;
}
@@ -3138,6 +3121,10 @@ void exit_mmap(struct mm_struct *mm)
/* update_hiwater_rss(mm) here? but nobody should be looking */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
unmap_vmas(&tlb, vma, 0, -1);
+ mmap_read_unlock(mm);
+ /* Set MMF_OOM_SKIP to disregard this mm from further consideration.*/
+ set_bit(MMF_OOM_SKIP, &mm->flags);
+ mmap_write_lock(mm);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 49d7df39b02d..36355b162727 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -509,7 +509,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
static struct task_struct *oom_reaper_list;
static DEFINE_SPINLOCK(oom_reaper_lock);
-bool __oom_reap_task_mm(struct mm_struct *mm)
+static bool __oom_reap_task_mm(struct mm_struct *mm)
{
struct vm_area_struct *vma;
bool ret = true;
--
2.36.0.512.ge40c2bad7a-goog
On Mon 09-05-22 20:00:13, Suren Baghdasaryan wrote:
> With the oom-killer being able to operate on locked pages, exit_mmap
> does not need to ensure that oom_reap_task_mm is done before it can
> proceed. Instead it can rely on mmap_lock write lock to prevent
> oom-killer from operating on the vma tree while it's freeing page
> tables. exit_mmap can hold mmap_lock read lock when unmapping vmas
> and then take mmap_lock write lock before freeing page tables.
The changelog is rather light on nasty details which might be good but
for the sake of our future us let's be more verbose so that we do not
have to reinvent the prior history each time we are looking into this
code. I would go with something like this instead:
"
The primary reason to invoke the oom reaper from the exit_mmap path used
to be a prevention of an excessive oom killing if the oom victim exit
races with the oom reaper (see 212925802454 ("mm: oom: let oom_reap_task
and exit_mmap run concurrently") for more details. The invocation has
moved around since then because of the interaction with the munlock
logic but the underlying reason has remained the same (see 27ae357fa82b
("mm, oom: fix concurrent munlock and oom reaper unmap, v3").
Munlock code is no longer a problem since a213e5cf71cb ("mm/munlock:
delete munlock_vma_pages_all(), allow oomreap") and there shouldn't be
any blocking operation before the memory is unmapped by exit_mmap so
the oom reaper invocation can be dropped. The unmapping part can be done
with the non-exclusive mmap_sem and the exclusive one is only required
when page tables are freed.
Remove the oom_reaper from exit_mmap which will make the code easier to
read. This is really unlikely to make any observable difference although
some microbenchmarks could benefit from one less branch that needs to be
evaluated even though it almost never is true.
"
One minor comment below. Other than that \o/ this is finally going away.
I strongly suspect that the history of this code is a nice example about how
over optimizing code can cause more harm than good.
Acked-by: Michal Hocko <[email protected]>
Thanks!
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/oom.h | 2 --
> mm/mmap.c | 25 ++++++-------------------
> mm/oom_kill.c | 2 +-
> 3 files changed, 7 insertions(+), 22 deletions(-)
>
[...]
> @@ -3138,6 +3121,10 @@ void exit_mmap(struct mm_struct *mm)
> /* update_hiwater_rss(mm) here? but nobody should be looking */
> /* Use -1 here to ensure all VMAs in the mm are unmapped */
> unmap_vmas(&tlb, vma, 0, -1);
> + mmap_read_unlock(mm);
> + /* Set MMF_OOM_SKIP to disregard this mm from further consideration.*/
> + set_bit(MMF_OOM_SKIP, &mm->flags);
I think that it would be slightly more readable to add an empty line
above and below of this. Also the comment would be more helpful if it
explaind what the further consideration actually means. I would go with
/*
* Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
* because the memory has been already freed. Do not bother
* checking mm_is_oom_victim because setting a bit
* unconditionally is just cheaper.
*/
> + mmap_write_lock(mm);
> free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> tlb_finish_mmu(&tlb);
--
Michal Hocko
SUSE Labs
On 5/9/22 9:00 PM, Suren Baghdasaryan wrote:
> With the oom-killer being able to operate on locked pages, exit_mmap
> does not need to ensure that oom_reap_task_mm is done before it can
> proceed. Instead it can rely on mmap_lock write lock to prevent
> oom-killer from operating on the vma tree while it's freeing page
> tables. exit_mmap can hold mmap_lock read lock when unmapping vmas
> and then take mmap_lock write lock before freeing page tables.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/oom.h | 2 --
> mm/mmap.c | 25 ++++++-------------------
> mm/oom_kill.c | 2 +-
> 3 files changed, 7 insertions(+), 22 deletions(-)
>
How does this improve the test? Include the information on why this
change is needed as opposed describing what this does?
thanks,
-- Shuah
On Tue, May 10, 2022 at 8:46 AM Shuah Khan <[email protected]> wrote:
>
> On 5/9/22 9:00 PM, Suren Baghdasaryan wrote:
> > With the oom-killer being able to operate on locked pages, exit_mmap
> > does not need to ensure that oom_reap_task_mm is done before it can
> > proceed. Instead it can rely on mmap_lock write lock to prevent
> > oom-killer from operating on the vma tree while it's freeing page
> > tables. exit_mmap can hold mmap_lock read lock when unmapping vmas
> > and then take mmap_lock write lock before freeing page tables.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/oom.h | 2 --
> > mm/mmap.c | 25 ++++++-------------------
> > mm/oom_kill.c | 2 +-
> > 3 files changed, 7 insertions(+), 22 deletions(-)
> >
>
> How does this improve the test? Include the information on why this
> change is needed as opposed describing what this does?
It doesn't improve the test. I used the test to verify this change and
wanted to keep them together so that others have an easy way to
exercise the same code path. That's the only relation between the test
and this cleanup. I'll split them into separate patchsets to avoid
further confusion.
>
> thanks,
> -- Shuah
On Tue, May 10, 2022 at 6:06 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 09-05-22 20:00:13, Suren Baghdasaryan wrote:
> > With the oom-killer being able to operate on locked pages, exit_mmap
> > does not need to ensure that oom_reap_task_mm is done before it can
> > proceed. Instead it can rely on mmap_lock write lock to prevent
> > oom-killer from operating on the vma tree while it's freeing page
> > tables. exit_mmap can hold mmap_lock read lock when unmapping vmas
> > and then take mmap_lock write lock before freeing page tables.
>
> The changelog is rather light on nasty details which might be good but
> for the sake of our future us let's be more verbose so that we do not
> have to reinvent the prior history each time we are looking into this
> code. I would go with something like this instead:
> "
> The primary reason to invoke the oom reaper from the exit_mmap path used
> to be a prevention of an excessive oom killing if the oom victim exit
> races with the oom reaper (see 212925802454 ("mm: oom: let oom_reap_task
> and exit_mmap run concurrently") for more details. The invocation has
> moved around since then because of the interaction with the munlock
> logic but the underlying reason has remained the same (see 27ae357fa82b
> ("mm, oom: fix concurrent munlock and oom reaper unmap, v3").
>
> Munlock code is no longer a problem since a213e5cf71cb ("mm/munlock:
> delete munlock_vma_pages_all(), allow oomreap") and there shouldn't be
> any blocking operation before the memory is unmapped by exit_mmap so
> the oom reaper invocation can be dropped. The unmapping part can be done
> with the non-exclusive mmap_sem and the exclusive one is only required
> when page tables are freed.
>
> Remove the oom_reaper from exit_mmap which will make the code easier to
> read. This is really unlikely to make any observable difference although
> some microbenchmarks could benefit from one less branch that needs to be
> evaluated even though it almost never is true.
> "
Looks great! Thanks for collecting all the history. Will update the description.
>
> One minor comment below. Other than that \o/ this is finally going away.
> I strongly suspect that the history of this code is a nice example about how
> over optimizing code can cause more harm than good.
>
> Acked-by: Michal Hocko <[email protected]>
Thanks.
>
> Thanks!
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/oom.h | 2 --
> > mm/mmap.c | 25 ++++++-------------------
> > mm/oom_kill.c | 2 +-
> > 3 files changed, 7 insertions(+), 22 deletions(-)
> >
> [...]
> > @@ -3138,6 +3121,10 @@ void exit_mmap(struct mm_struct *mm)
> > /* update_hiwater_rss(mm) here? but nobody should be looking */
> > /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > unmap_vmas(&tlb, vma, 0, -1);
> > + mmap_read_unlock(mm);
> > + /* Set MMF_OOM_SKIP to disregard this mm from further consideration.*/
> > + set_bit(MMF_OOM_SKIP, &mm->flags);
>
> I think that it would be slightly more readable to add an empty line
> above and below of this. Also the comment would be more helpful if it
> explaind what the further consideration actually means. I would go with
>
> /*
> * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> * because the memory has been already freed. Do not bother
> * checking mm_is_oom_victim because setting a bit
> * unconditionally is just cheaper.
> */
>
Ack.
> > + mmap_write_lock(mm);
> > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > tlb_finish_mmu(&tlb);
>
> --
> Michal Hocko
> SUSE Labs
On Tue 10-05-22 09:31:50, Suren Baghdasaryan wrote:
> On Tue, May 10, 2022 at 6:06 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 09-05-22 20:00:13, Suren Baghdasaryan wrote:
> > > With the oom-killer being able to operate on locked pages, exit_mmap
> > > does not need to ensure that oom_reap_task_mm is done before it can
> > > proceed. Instead it can rely on mmap_lock write lock to prevent
> > > oom-killer from operating on the vma tree while it's freeing page
> > > tables. exit_mmap can hold mmap_lock read lock when unmapping vmas
> > > and then take mmap_lock write lock before freeing page tables.
> >
> > The changelog is rather light on nasty details which might be good but
> > for the sake of our future us let's be more verbose so that we do not
> > have to reinvent the prior history each time we are looking into this
> > code. I would go with something like this instead:
> > "
> > The primary reason to invoke the oom reaper from the exit_mmap path used
> > to be a prevention of an excessive oom killing if the oom victim exit
> > races with the oom reaper (see 212925802454 ("mm: oom: let oom_reap_task
> > and exit_mmap run concurrently") for more details. The invocation has
> > moved around since then because of the interaction with the munlock
> > logic but the underlying reason has remained the same (see 27ae357fa82b
> > ("mm, oom: fix concurrent munlock and oom reaper unmap, v3").
> >
> > Munlock code is no longer a problem since a213e5cf71cb ("mm/munlock:
> > delete munlock_vma_pages_all(), allow oomreap") and there shouldn't be
> > any blocking operation before the memory is unmapped by exit_mmap so
> > the oom reaper invocation can be dropped. The unmapping part can be done
> > with the non-exclusive mmap_sem and the exclusive one is only required
> > when page tables are freed.
> >
> > Remove the oom_reaper from exit_mmap which will make the code easier to
> > read. This is really unlikely to make any observable difference although
> > some microbenchmarks could benefit from one less branch that needs to be
> > evaluated even though it almost never is true.
> > "
>
> Looks great! Thanks for collecting all the history. Will update the description.
Please make sure you double check the story. This is mostly my
recollection and brief reading through the said commits. I might
misremember here and there.
--
Michal Hocko
SUSE Labs
On Tue, May 10, 2022 at 1:53 PM Michal Hocko <[email protected]> wrote:
>
> On Tue 10-05-22 09:31:50, Suren Baghdasaryan wrote:
> > On Tue, May 10, 2022 at 6:06 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 09-05-22 20:00:13, Suren Baghdasaryan wrote:
> > > > With the oom-killer being able to operate on locked pages, exit_mmap
> > > > does not need to ensure that oom_reap_task_mm is done before it can
> > > > proceed. Instead it can rely on mmap_lock write lock to prevent
> > > > oom-killer from operating on the vma tree while it's freeing page
> > > > tables. exit_mmap can hold mmap_lock read lock when unmapping vmas
> > > > and then take mmap_lock write lock before freeing page tables.
> > >
> > > The changelog is rather light on nasty details which might be good but
> > > for the sake of our future us let's be more verbose so that we do not
> > > have to reinvent the prior history each time we are looking into this
> > > code. I would go with something like this instead:
> > > "
> > > The primary reason to invoke the oom reaper from the exit_mmap path used
> > > to be a prevention of an excessive oom killing if the oom victim exit
> > > races with the oom reaper (see 212925802454 ("mm: oom: let oom_reap_task
> > > and exit_mmap run concurrently") for more details. The invocation has
> > > moved around since then because of the interaction with the munlock
> > > logic but the underlying reason has remained the same (see 27ae357fa82b
> > > ("mm, oom: fix concurrent munlock and oom reaper unmap, v3").
> > >
> > > Munlock code is no longer a problem since a213e5cf71cb ("mm/munlock:
> > > delete munlock_vma_pages_all(), allow oomreap") and there shouldn't be
> > > any blocking operation before the memory is unmapped by exit_mmap so
> > > the oom reaper invocation can be dropped. The unmapping part can be done
> > > with the non-exclusive mmap_sem and the exclusive one is only required
> > > when page tables are freed.
> > >
> > > Remove the oom_reaper from exit_mmap which will make the code easier to
> > > read. This is really unlikely to make any observable difference although
> > > some microbenchmarks could benefit from one less branch that needs to be
> > > evaluated even though it almost never is true.
> > > "
> >
> > Looks great! Thanks for collecting all the history. Will update the description.
>
> Please make sure you double check the story. This is mostly my
> recollection and brief reading through the said commits. I might
> misremember here and there.
Will do. Thanks!
> --
> Michal Hocko
> SUSE Labs