2021-12-08 21:22:21

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap

oom-reaper and process_mrelease system call should protect against
races with exit_mmap which can destroy page tables while they
walk the VMA tree. oom-reaper protects from that race by setting
MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
before taking and releasing mmap_write_lock. process_mrelease has
to elevate mm->mm_users to prevent such race. Both oom-reaper and
process_mrelease hold mmap_read_lock when walking the VMA tree.
The locking rules and mechanisms could be simpler if exit_mmap takes
mmap_write_lock while executing destructive operations such as
free_pgtables.
Change exit_mmap to hold the mmap_write_lock when calling
free_pgtables and remove_vma. Operations like unmap_vmas and
unlock_range are not destructive and could run under mmap_read_lock
but for simplicity we take one mmap_write_lock during almost the entire
operation. Note also that because oom-reaper checks VM_LOCKED flag,
unlock_range() should not be allowed to race with it.
Before this patch, remove_vma used to be called with no locks held,
however with fput being executed asynchronously and vm_ops->close
not being allowed to hold mmap_lock (it is called from __split_vma
with mmap_sem held for write), changing that should be fine.
In most cases this lock should be uncontended. Previously, Kirill
reported ~4% regression caused by a similar change [1]. We reran the
same test and although the individual results are quite noisy, the
percentiles show lower regression with 1.6% being the worst case [2].
The change allows oom-reaper and process_mrelease to execute safely
under mmap_read_lock without worries that exit_mmap might destroy page
tables from under them.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
changes in v4
- Separated comments describing vm_operations_struct::close locking
requirements into a separate patch, per Matthew Wilcox

mm/mmap.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index bfb0ea164a90..f4e09d390a07 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3142,25 +3142,27 @@ void exit_mmap(struct mm_struct *mm)
* to mmu_notifier_release(mm) ensures mmu notifier callbacks in
* __oom_reap_task_mm() will not block.
*
- * This needs to be done before calling munlock_vma_pages_all(),
+ * This needs to be done before calling unlock_range(),
* which clears VM_LOCKED, otherwise the oom reaper cannot
* reliably test it.
*/
(void)__oom_reap_task_mm(mm);

set_bit(MMF_OOM_SKIP, &mm->flags);
- mmap_write_lock(mm);
- mmap_write_unlock(mm);
}

+ mmap_write_lock(mm);
if (mm->locked_vm)
unlock_range(mm->mmap, ULONG_MAX);

arch_exit_mmap(mm);

vma = mm->mmap;
- if (!vma) /* Can happen if dup_mmap() received an OOM */
+ if (!vma) {
+ /* Can happen if dup_mmap() received an OOM */
+ mmap_write_unlock(mm);
return;
+ }

lru_add_drain();
flush_cache_mm(mm);
@@ -3171,16 +3173,14 @@ void exit_mmap(struct mm_struct *mm)
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb);

- /*
- * Walk the list again, actually closing and freeing it,
- * with preemption enabled, without holding any MM locks.
- */
+ /* Walk the list again, actually closing and freeing it. */
while (vma) {
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
vma = remove_vma(vma);
cond_resched();
}
+ mmap_write_unlock(mm);
vm_unacct_memory(nr_accounted);
}

--
2.34.1.400.ga245620fadb-goog



2021-12-08 21:22:23

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 2/3] mm: document locking restrictions for vm_operations_struct::close

Add comments for vm_operations_struct::close documenting locking
requirements for this callback and its callers.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..b9b88ba7564b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -577,6 +577,10 @@ enum page_entry_size {
*/
struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
+ /**
+ * @close: Called when the VMA is being removed from the MM.
+ * Context: User context. May sleep. Caller holds mmap_lock.
+ */
void (*close)(struct vm_area_struct * area);
/* Called any time before splitting to check if it's allowed */
int (*may_split)(struct vm_area_struct *area, unsigned long addr);
--
2.34.1.400.ga245620fadb-goog


2021-12-08 21:22:25

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection

With exit_mmap holding mmap_write_lock during free_pgtables call,
process_mrelease does not need to elevate mm->mm_users in order to
prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
is walking the VMA tree. The change prevents process_mrelease from
calling the last mmput, which can lead to waiting for IO completion
in exit_aio.

Fixes: 337546e83fc7 ("mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap")
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/oom_kill.c | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ddabefcfb5a..67780386f478 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1169,15 +1169,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
goto put_task;
}

- if (mmget_not_zero(p->mm)) {
- mm = p->mm;
- if (task_will_free_mem(p))
- reap = true;
- else {
- /* Error only if the work has not been done already */
- if (!test_bit(MMF_OOM_SKIP, &mm->flags))
- ret = -EINVAL;
- }
+ mm = p->mm;
+ mmgrab(mm);
+
+ if (task_will_free_mem(p))
+ reap = true;
+ else {
+ /* Error only if the work has not been done already */
+ if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+ ret = -EINVAL;
}
task_unlock(p);

@@ -1188,13 +1188,16 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
ret = -EINTR;
goto drop_mm;
}
- if (!__oom_reap_task_mm(mm))
+ /*
+ * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
+ * possible change in exit_mmap is seen
+ */
+ if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
ret = -EAGAIN;
mmap_read_unlock(mm);

drop_mm:
- if (mm)
- mmput(mm);
+ mmdrop(mm);
put_task:
put_task_struct(task);
return ret;
--
2.34.1.400.ga245620fadb-goog


2021-12-09 08:55:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap

On Wed 08-12-21 13:22:09, Suren Baghdasaryan wrote:
> oom-reaper and process_mrelease system call should protect against
> races with exit_mmap which can destroy page tables while they
> walk the VMA tree. oom-reaper protects from that race by setting
> MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
> before taking and releasing mmap_write_lock. process_mrelease has
> to elevate mm->mm_users to prevent such race. Both oom-reaper and
> process_mrelease hold mmap_read_lock when walking the VMA tree.
> The locking rules and mechanisms could be simpler if exit_mmap takes
> mmap_write_lock while executing destructive operations such as
> free_pgtables.
> Change exit_mmap to hold the mmap_write_lock when calling
> free_pgtables and remove_vma. Operations like unmap_vmas and
> unlock_range are not destructive and could run under mmap_read_lock
> but for simplicity we take one mmap_write_lock during almost the entire
> operation.

unlock_range is not safe to be called under read lock. See 27ae357fa82b
("mm, oom: fix concurrent munlock and oom reaper unmap, v3").

> Note also that because oom-reaper checks VM_LOCKED flag,
> unlock_range() should not be allowed to race with it.
> Before this patch, remove_vma used to be called with no locks held,
> however with fput being executed asynchronously and vm_ops->close
> not being allowed to hold mmap_lock (it is called from __split_vma
> with mmap_sem held for write), changing that should be fine.
> In most cases this lock should be uncontended. Previously, Kirill
> reported ~4% regression caused by a similar change [1]. We reran the
> same test and although the individual results are quite noisy, the
> percentiles show lower regression with 1.6% being the worst case [2].
> The change allows oom-reaper and process_mrelease to execute safely
> under mmap_read_lock without worries that exit_mmap might destroy page
> tables from under them.
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>

The patch looks good otherwise. Btw. when I was trying to do something
similar in the past Hugh has noted that we can get rid of the same
lock&&unlock trick in ksm. Maybe you want to have a look at that as well
;)

Acked-by: Michal Hocko <[email protected]>

Thanks!

> ---
> changes in v4
> - Separated comments describing vm_operations_struct::close locking
> requirements into a separate patch, per Matthew Wilcox
>
> mm/mmap.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index bfb0ea164a90..f4e09d390a07 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3142,25 +3142,27 @@ void exit_mmap(struct mm_struct *mm)
> * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> * __oom_reap_task_mm() will not block.
> *
> - * This needs to be done before calling munlock_vma_pages_all(),
> + * This needs to be done before calling unlock_range(),
> * which clears VM_LOCKED, otherwise the oom reaper cannot
> * reliably test it.
> */
> (void)__oom_reap_task_mm(mm);
>
> set_bit(MMF_OOM_SKIP, &mm->flags);
> - mmap_write_lock(mm);
> - mmap_write_unlock(mm);
> }
>
> + mmap_write_lock(mm);
> if (mm->locked_vm)
> unlock_range(mm->mmap, ULONG_MAX);
>
> arch_exit_mmap(mm);
>
> vma = mm->mmap;
> - if (!vma) /* Can happen if dup_mmap() received an OOM */
> + if (!vma) {
> + /* Can happen if dup_mmap() received an OOM */
> + mmap_write_unlock(mm);
> return;
> + }
>
> lru_add_drain();
> flush_cache_mm(mm);
> @@ -3171,16 +3173,14 @@ void exit_mmap(struct mm_struct *mm)
> free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> tlb_finish_mmu(&tlb);
>
> - /*
> - * Walk the list again, actually closing and freeing it,
> - * with preemption enabled, without holding any MM locks.
> - */
> + /* Walk the list again, actually closing and freeing it. */
> while (vma) {
> if (vma->vm_flags & VM_ACCOUNT)
> nr_accounted += vma_pages(vma);
> vma = remove_vma(vma);
> cond_resched();
> }
> + mmap_write_unlock(mm);
> vm_unacct_memory(nr_accounted);
> }
>
> --
> 2.34.1.400.ga245620fadb-goog

--
Michal Hocko
SUSE Labs

2021-12-09 08:56:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm: document locking restrictions for vm_operations_struct::close

On Wed 08-12-21 13:22:10, Suren Baghdasaryan wrote:
> Add comments for vm_operations_struct::close documenting locking
> requirements for this callback and its callers.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> include/linux/mm.h | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a7e4a9e7d807..b9b88ba7564b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -577,6 +577,10 @@ enum page_entry_size {
> */
> struct vm_operations_struct {
> void (*open)(struct vm_area_struct * area);
> + /**
> + * @close: Called when the VMA is being removed from the MM.
> + * Context: User context. May sleep. Caller holds mmap_lock.
> + */
> void (*close)(struct vm_area_struct * area);
> /* Called any time before splitting to check if it's allowed */
> int (*may_split)(struct vm_area_struct *area, unsigned long addr);
> --
> 2.34.1.400.ga245620fadb-goog

--
Michal Hocko
SUSE Labs

2021-12-09 08:59:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection

On Wed 08-12-21 13:22:11, Suren Baghdasaryan wrote:
> With exit_mmap holding mmap_write_lock during free_pgtables call,
> process_mrelease does not need to elevate mm->mm_users in order to
> prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
> is walking the VMA tree. The change prevents process_mrelease from
> calling the last mmput, which can lead to waiting for IO completion
> in exit_aio.
>
> Fixes: 337546e83fc7 ("mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap")

I am not sure I have brought this up already but I do not think Fixes
tag is a good fit. 337546e83fc7 is a correct way to handle the race. It
is just slightly less optimal than this fix.

> Signed-off-by: Suren Baghdasaryan <[email protected]>

Acked-by: Michal Hocko <[email protected]>

Thanks!
> ---
> mm/oom_kill.c | 27 +++++++++++++++------------
> 1 file changed, 15 insertions(+), 12 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ddabefcfb5a..67780386f478 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1169,15 +1169,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> goto put_task;
> }
>
> - if (mmget_not_zero(p->mm)) {
> - mm = p->mm;
> - if (task_will_free_mem(p))
> - reap = true;
> - else {
> - /* Error only if the work has not been done already */
> - if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> - ret = -EINVAL;
> - }
> + mm = p->mm;
> + mmgrab(mm);
> +
> + if (task_will_free_mem(p))
> + reap = true;
> + else {
> + /* Error only if the work has not been done already */
> + if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> + ret = -EINVAL;
> }
> task_unlock(p);
>
> @@ -1188,13 +1188,16 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> ret = -EINTR;
> goto drop_mm;
> }
> - if (!__oom_reap_task_mm(mm))
> + /*
> + * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
> + * possible change in exit_mmap is seen
> + */
> + if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
> ret = -EAGAIN;
> mmap_read_unlock(mm);
>
> drop_mm:
> - if (mm)
> - mmput(mm);
> + mmdrop(mm);
> put_task:
> put_task_struct(task);
> return ret;
> --
> 2.34.1.400.ga245620fadb-goog

--
Michal Hocko
SUSE Labs

2021-12-09 09:12:14

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

Do we want this on top?
----
From 58b04ae6dc97b0105ea2651daca55cf2386f69b4 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Thu, 9 Dec 2021 10:07:51 +0100
Subject: [PATCH] mm: drop MMF_OOM_SKIP from exit_mmap

MMF_OOM_SKIP used to play a synchronization role between exit_mmap and
oom repear in the past. Since the exclusive mmap_sem is held in
exit_mmap to cover all destructive operations the flag synchronization
is not needed anymore and we can safely drop it. Just make sure that
mm->mmap is set to NULL so that nobody will access the freed vma list.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/mmap.c | 23 +----------------------
1 file changed, 1 insertion(+), 22 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f4e09d390a07..0d6af9d89aa8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3129,28 +3129,6 @@ void exit_mmap(struct mm_struct *mm)
/* mm's last user has gone, and its about to be pulled down */
mmu_notifier_release(mm);

- if (unlikely(mm_is_oom_victim(mm))) {
- /*
- * Manually reap the mm to free as much memory as possible.
- * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
- * this mm from further consideration. Taking mm->mmap_lock for
- * write after setting MMF_OOM_SKIP will guarantee that the oom
- * reaper will not run on this mm again after mmap_lock is
- * dropped.
- *
- * Nothing can be holding mm->mmap_lock here and the above call
- * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
- * __oom_reap_task_mm() will not block.
- *
- * This needs to be done before calling unlock_range(),
- * which clears VM_LOCKED, otherwise the oom reaper cannot
- * reliably test it.
- */
- (void)__oom_reap_task_mm(mm);
-
- set_bit(MMF_OOM_SKIP, &mm->flags);
- }
-
mmap_write_lock(mm);
if (mm->locked_vm)
unlock_range(mm->mmap, ULONG_MAX);
@@ -3180,6 +3158,7 @@ void exit_mmap(struct mm_struct *mm)
vma = remove_vma(vma);
cond_resched();
}
+ mm->mmap = NULL;
mmap_write_unlock(mm);
vm_unacct_memory(nr_accounted);
}
--
2.30.2

--
Michal Hocko
SUSE Labs

2021-12-09 16:24:20

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <[email protected]> wrote:
>
> Do we want this on top?

As we discussed in this thread
https://lore.kernel.org/all/[email protected],
__oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
unmap pages in parallel with exit_mmap without blocking each other.
Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
and has a negative impact on performance. So the conclusion of that
thread I thought was to keep that part. My understanding is that we
also wanted to remove MMF_OOM_SKIP as a follow-up patch but
__oom_reap_task_mm would stay.


> ----
> From 58b04ae6dc97b0105ea2651daca55cf2386f69b4 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Thu, 9 Dec 2021 10:07:51 +0100
> Subject: [PATCH] mm: drop MMF_OOM_SKIP from exit_mmap
>
> MMF_OOM_SKIP used to play a synchronization role between exit_mmap and
> oom repear in the past. Since the exclusive mmap_sem is held in
> exit_mmap to cover all destructive operations the flag synchronization
> is not needed anymore and we can safely drop it. Just make sure that
> mm->mmap is set to NULL so that nobody will access the freed vma list.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/mmap.c | 23 +----------------------
> 1 file changed, 1 insertion(+), 22 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f4e09d390a07..0d6af9d89aa8 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3129,28 +3129,6 @@ void exit_mmap(struct mm_struct *mm)
> /* mm's last user has gone, and its about to be pulled down */
> mmu_notifier_release(mm);
>
> - if (unlikely(mm_is_oom_victim(mm))) {
> - /*
> - * Manually reap the mm to free as much memory as possible.
> - * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> - * this mm from further consideration. Taking mm->mmap_lock for
> - * write after setting MMF_OOM_SKIP will guarantee that the oom
> - * reaper will not run on this mm again after mmap_lock is
> - * dropped.
> - *
> - * Nothing can be holding mm->mmap_lock here and the above call
> - * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> - * __oom_reap_task_mm() will not block.
> - *
> - * This needs to be done before calling unlock_range(),
> - * which clears VM_LOCKED, otherwise the oom reaper cannot
> - * reliably test it.
> - */
> - (void)__oom_reap_task_mm(mm);
> -
> - set_bit(MMF_OOM_SKIP, &mm->flags);
> - }
> -
> mmap_write_lock(mm);
> if (mm->locked_vm)
> unlock_range(mm->mmap, ULONG_MAX);
> @@ -3180,6 +3158,7 @@ void exit_mmap(struct mm_struct *mm)
> vma = remove_vma(vma);
> cond_resched();
> }
> + mm->mmap = NULL;
> mmap_write_unlock(mm);
> vm_unacct_memory(nr_accounted);
> }
> --
> 2.30.2
>
> --
> Michal Hocko
> SUSE Labs

2021-12-09 16:48:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <[email protected]> wrote:
> >
> > Do we want this on top?
>
> As we discussed in this thread
> https://lore.kernel.org/all/[email protected],
> __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> unmap pages in parallel with exit_mmap without blocking each other.
> Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> and has a negative impact on performance. So the conclusion of that
> thread I thought was to keep that part. My understanding is that we
> also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> __oom_reap_task_mm would stay.

OK, then we were talking past each other, I am afraid. I really wanted
to get rid of this oom specific stuff from exit_mmap. It was there out
of necessity. With a proper locking we can finally get rid of the crud.
As I've said previously oom reaping has never been a hot path.

If we really want to optimize this path then I would much rather see a
generic solution which would allow to move the write lock down after
unmap_vmas. That would require oom reaper to be able to handle mlocked
memory.
--
Michal Hocko
SUSE Labs

2021-12-09 17:07:12

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Thu, Dec 9, 2021 at 8:47 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> > On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <[email protected]> wrote:
> > >
> > > Do we want this on top?
> >
> > As we discussed in this thread
> > https://lore.kernel.org/all/[email protected],
> > __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> > unmap pages in parallel with exit_mmap without blocking each other.
> > Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> > and has a negative impact on performance. So the conclusion of that
> > thread I thought was to keep that part. My understanding is that we
> > also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> > __oom_reap_task_mm would stay.
>
> OK, then we were talking past each other, I am afraid. I really wanted
> to get rid of this oom specific stuff from exit_mmap. It was there out
> of necessity. With a proper locking we can finally get rid of the crud.
> As I've said previously oom reaping has never been a hot path.
>
> If we really want to optimize this path then I would much rather see a
> generic solution which would allow to move the write lock down after
> unmap_vmas. That would require oom reaper to be able to handle mlocked
> memory.

Ok, let's work on that and when that's done we can get rid of the oom
stuff in exit_mmap. I'll look into this over the weekend and will
likely be back with questions.
Thanks!

> --
> Michal Hocko
> SUSE Labs

2021-12-09 19:03:28

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap

On Thu, Dec 9, 2021 at 12:55 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 08-12-21 13:22:09, Suren Baghdasaryan wrote:
> > oom-reaper and process_mrelease system call should protect against
> > races with exit_mmap which can destroy page tables while they
> > walk the VMA tree. oom-reaper protects from that race by setting
> > MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
> > before taking and releasing mmap_write_lock. process_mrelease has
> > to elevate mm->mm_users to prevent such race. Both oom-reaper and
> > process_mrelease hold mmap_read_lock when walking the VMA tree.
> > The locking rules and mechanisms could be simpler if exit_mmap takes
> > mmap_write_lock while executing destructive operations such as
> > free_pgtables.
> > Change exit_mmap to hold the mmap_write_lock when calling
> > free_pgtables and remove_vma. Operations like unmap_vmas and
> > unlock_range are not destructive and could run under mmap_read_lock
> > but for simplicity we take one mmap_write_lock during almost the entire
> > operation.
>
> unlock_range is not safe to be called under read lock. See 27ae357fa82b
> ("mm, oom: fix concurrent munlock and oom reaper unmap, v3").

Ok, I'll remove the sentence above.
Is my understanding correct that it is unsafe only because oom-reaper
can't deal with VM_LOCKED, otherwise it would be fine?

>
> > Note also that because oom-reaper checks VM_LOCKED flag,
> > unlock_range() should not be allowed to race with it.
> > Before this patch, remove_vma used to be called with no locks held,
> > however with fput being executed asynchronously and vm_ops->close
> > not being allowed to hold mmap_lock (it is called from __split_vma
> > with mmap_sem held for write), changing that should be fine.
> > In most cases this lock should be uncontended. Previously, Kirill
> > reported ~4% regression caused by a similar change [1]. We reran the
> > same test and although the individual results are quite noisy, the
> > percentiles show lower regression with 1.6% being the worst case [2].
> > The change allows oom-reaper and process_mrelease to execute safely
> > under mmap_read_lock without worries that exit_mmap might destroy page
> > tables from under them.
> >
> > [1] https://lore.kernel.org/all/[email protected]/
> > [2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>
> The patch looks good otherwise. Btw. when I was trying to do something
> similar in the past Hugh has noted that we can get rid of the same
> lock&&unlock trick in ksm. Maybe you want to have a look at that as well
> ;)

I'll take a look after we cleanup this path completely (oom pieces included).

>
> Acked-by: Michal Hocko <[email protected]>

Thanks!

>
> Thanks!
>
> > ---
> > changes in v4
> > - Separated comments describing vm_operations_struct::close locking
> > requirements into a separate patch, per Matthew Wilcox
> >
> > mm/mmap.c | 16 ++++++++--------
> > 1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index bfb0ea164a90..f4e09d390a07 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3142,25 +3142,27 @@ void exit_mmap(struct mm_struct *mm)
> > * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > * __oom_reap_task_mm() will not block.
> > *
> > - * This needs to be done before calling munlock_vma_pages_all(),
> > + * This needs to be done before calling unlock_range(),
> > * which clears VM_LOCKED, otherwise the oom reaper cannot
> > * reliably test it.
> > */
> > (void)__oom_reap_task_mm(mm);
> >
> > set_bit(MMF_OOM_SKIP, &mm->flags);
> > - mmap_write_lock(mm);
> > - mmap_write_unlock(mm);
> > }
> >
> > + mmap_write_lock(mm);
> > if (mm->locked_vm)
> > unlock_range(mm->mmap, ULONG_MAX);
> >
> > arch_exit_mmap(mm);
> >
> > vma = mm->mmap;
> > - if (!vma) /* Can happen if dup_mmap() received an OOM */
> > + if (!vma) {
> > + /* Can happen if dup_mmap() received an OOM */
> > + mmap_write_unlock(mm);
> > return;
> > + }
> >
> > lru_add_drain();
> > flush_cache_mm(mm);
> > @@ -3171,16 +3173,14 @@ void exit_mmap(struct mm_struct *mm)
> > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > tlb_finish_mmu(&tlb);
> >
> > - /*
> > - * Walk the list again, actually closing and freeing it,
> > - * with preemption enabled, without holding any MM locks.
> > - */
> > + /* Walk the list again, actually closing and freeing it. */
> > while (vma) {
> > if (vma->vm_flags & VM_ACCOUNT)
> > nr_accounted += vma_pages(vma);
> > vma = remove_vma(vma);
> > cond_resched();
> > }
> > + mmap_write_unlock(mm);
> > vm_unacct_memory(nr_accounted);
> > }
> >
> > --
> > 2.34.1.400.ga245620fadb-goog
>
> --
> Michal Hocko
> SUSE Labs

2021-12-09 19:04:14

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection

On Thu, Dec 9, 2021 at 12:59 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 08-12-21 13:22:11, Suren Baghdasaryan wrote:
> > With exit_mmap holding mmap_write_lock during free_pgtables call,
> > process_mrelease does not need to elevate mm->mm_users in order to
> > prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
> > is walking the VMA tree. The change prevents process_mrelease from
> > calling the last mmput, which can lead to waiting for IO completion
> > in exit_aio.
> >
> > Fixes: 337546e83fc7 ("mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap")
>
> I am not sure I have brought this up already but I do not think Fixes
> tag is a good fit. 337546e83fc7 is a correct way to handle the race. It
> is just slightly less optimal than this fix.

Will post v5 without it. Thanks!

>
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>
> Acked-by: Michal Hocko <[email protected]>
>
> Thanks!
> > ---
> > mm/oom_kill.c | 27 +++++++++++++++------------
> > 1 file changed, 15 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 1ddabefcfb5a..67780386f478 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1169,15 +1169,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> > goto put_task;
> > }
> >
> > - if (mmget_not_zero(p->mm)) {
> > - mm = p->mm;
> > - if (task_will_free_mem(p))
> > - reap = true;
> > - else {
> > - /* Error only if the work has not been done already */
> > - if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> > - ret = -EINVAL;
> > - }
> > + mm = p->mm;
> > + mmgrab(mm);
> > +
> > + if (task_will_free_mem(p))
> > + reap = true;
> > + else {
> > + /* Error only if the work has not been done already */
> > + if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> > + ret = -EINVAL;
> > }
> > task_unlock(p);
> >
> > @@ -1188,13 +1188,16 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> > ret = -EINTR;
> > goto drop_mm;
> > }
> > - if (!__oom_reap_task_mm(mm))
> > + /*
> > + * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
> > + * possible change in exit_mmap is seen
> > + */
> > + if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
> > ret = -EAGAIN;
> > mmap_read_unlock(mm);
> >
> > drop_mm:
> > - if (mm)
> > - mmput(mm);
> > + mmdrop(mm);
> > put_task:
> > put_task_struct(task);
> > return ret;
> > --
> > 2.34.1.400.ga245620fadb-goog
>
> --
> Michal Hocko
> SUSE Labs

2021-12-10 09:20:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap

On Thu 09-12-21 11:03:11, Suren Baghdasaryan wrote:
> On Thu, Dec 9, 2021 at 12:55 AM Michal Hocko <[email protected]> wrote:
> >
> > On Wed 08-12-21 13:22:09, Suren Baghdasaryan wrote:
> > > oom-reaper and process_mrelease system call should protect against
> > > races with exit_mmap which can destroy page tables while they
> > > walk the VMA tree. oom-reaper protects from that race by setting
> > > MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
> > > before taking and releasing mmap_write_lock. process_mrelease has
> > > to elevate mm->mm_users to prevent such race. Both oom-reaper and
> > > process_mrelease hold mmap_read_lock when walking the VMA tree.
> > > The locking rules and mechanisms could be simpler if exit_mmap takes
> > > mmap_write_lock while executing destructive operations such as
> > > free_pgtables.
> > > Change exit_mmap to hold the mmap_write_lock when calling
> > > free_pgtables and remove_vma. Operations like unmap_vmas and
> > > unlock_range are not destructive and could run under mmap_read_lock
> > > but for simplicity we take one mmap_write_lock during almost the entire
> > > operation.
> >
> > unlock_range is not safe to be called under read lock. See 27ae357fa82b
> > ("mm, oom: fix concurrent munlock and oom reaper unmap, v3").
>
> Ok, I'll remove the sentence above.
> Is my understanding correct that it is unsafe only because oom-reaper
> can't deal with VM_LOCKED, otherwise it would be fine?

The commit message (27ae357fa82b) goes into details that I have forgot already.
--
Michal Hocko
SUSE Labs

2021-12-16 02:26:26

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Thu, Dec 9, 2021 at 9:06 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Thu, Dec 9, 2021 at 8:47 AM Michal Hocko <[email protected]> wrote:
> >
> > On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> > > On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > Do we want this on top?
> > >
> > > As we discussed in this thread
> > > https://lore.kernel.org/all/[email protected],
> > > __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> > > unmap pages in parallel with exit_mmap without blocking each other.
> > > Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> > > and has a negative impact on performance. So the conclusion of that
> > > thread I thought was to keep that part. My understanding is that we
> > > also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> > > __oom_reap_task_mm would stay.
> >
> > OK, then we were talking past each other, I am afraid. I really wanted
> > to get rid of this oom specific stuff from exit_mmap. It was there out
> > of necessity. With a proper locking we can finally get rid of the crud.
> > As I've said previously oom reaping has never been a hot path.
> >
> > If we really want to optimize this path then I would much rather see a
> > generic solution which would allow to move the write lock down after
> > unmap_vmas. That would require oom reaper to be able to handle mlocked
> > memory.
>
> Ok, let's work on that and when that's done we can get rid of the oom
> stuff in exit_mmap. I'll look into this over the weekend and will
> likely be back with questions.

As promised, I have a question:
Any particular reason why munlock_vma_pages_range clears VM_LOCKED
before unlocking pages and not after (see:
https://elixir.bootlin.com/linux/latest/source/mm/mlock.c#L424)? Seems
to me if VM_LOCKED was reset at the end (with proper ordering) then
__oom_reap_task_mm would correctly skip VM_LOCKED vmas.
https://lore.kernel.org/lkml/[email protected]/
has this explanation:

"Since munlock_vma_pages_range() depends on clearing VM_LOCKED from
vm_flags before actually doing the munlock to determine if any other
vmas are locking the same memory, the check for VM_LOCKED in the oom
reaper is racy."

but "to determine if any other vmas are locking the same memory"
explanation eludes me... Any insights?
Thanks,
Suren.

> Thanks!
>
> > --
> > Michal Hocko
> > SUSE Labs

2021-12-16 11:49:51

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Wed, Dec 15, 2021 at 06:26:11PM -0800, Suren Baghdasaryan wrote:
> On Thu, Dec 9, 2021 at 9:06 AM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Thu, Dec 9, 2021 at 8:47 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> > > > On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > Do we want this on top?
> > > >
> > > > As we discussed in this thread
> > > > https://lore.kernel.org/all/[email protected],
> > > > __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> > > > unmap pages in parallel with exit_mmap without blocking each other.
> > > > Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> > > > and has a negative impact on performance. So the conclusion of that
> > > > thread I thought was to keep that part. My understanding is that we
> > > > also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> > > > __oom_reap_task_mm would stay.
> > >
> > > OK, then we were talking past each other, I am afraid. I really wanted
> > > to get rid of this oom specific stuff from exit_mmap. It was there out
> > > of necessity. With a proper locking we can finally get rid of the crud.
> > > As I've said previously oom reaping has never been a hot path.
> > >
> > > If we really want to optimize this path then I would much rather see a
> > > generic solution which would allow to move the write lock down after
> > > unmap_vmas. That would require oom reaper to be able to handle mlocked
> > > memory.
> >
> > Ok, let's work on that and when that's done we can get rid of the oom
> > stuff in exit_mmap. I'll look into this over the weekend and will
> > likely be back with questions.
>
> As promised, I have a question:
> Any particular reason why munlock_vma_pages_range clears VM_LOCKED
> before unlocking pages and not after (see:
> https://elixir.bootlin.com/linux/latest/source/mm/mlock.c#L424)? Seems
> to me if VM_LOCKED was reset at the end (with proper ordering) then
> __oom_reap_task_mm would correctly skip VM_LOCKED vmas.
> https://lore.kernel.org/lkml/[email protected]/
> has this explanation:
>
> "Since munlock_vma_pages_range() depends on clearing VM_LOCKED from
> vm_flags before actually doing the munlock to determine if any other
> vmas are locking the same memory, the check for VM_LOCKED in the oom
> reaper is racy."
>
> but "to determine if any other vmas are locking the same memory"
> explanation eludes me... Any insights?

A page's mlock state is determined by whether any of the vmas that map
it are mlocked. The munlock code does:

vma->vm_flags &= VM_LOCKED_CLEAR_MASK
TestClearPageMlocked()
isolate_lru_page()
__munlock_isolated_page()
page_mlock()
rmap_walk() # for_each_vma()
page_mlock_one()
(vma->vm_flags & VM_LOCKED) && TestSetPageMlocked()

If we didn't clear the VM_LOCKED flag first, racing threads could
re-lock pages under us because they see that flag and think our vma
wants those pages mlocked when we're in the process of munlocking.


2021-12-16 17:23:26

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Thu, Dec 16, 2021 at 3:49 AM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Dec 15, 2021 at 06:26:11PM -0800, Suren Baghdasaryan wrote:
> > On Thu, Dec 9, 2021 at 9:06 AM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Thu, Dec 9, 2021 at 8:47 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> > > > > On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > Do we want this on top?
> > > > >
> > > > > As we discussed in this thread
> > > > > https://lore.kernel.org/all/[email protected],
> > > > > __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> > > > > unmap pages in parallel with exit_mmap without blocking each other.
> > > > > Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> > > > > and has a negative impact on performance. So the conclusion of that
> > > > > thread I thought was to keep that part. My understanding is that we
> > > > > also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> > > > > __oom_reap_task_mm would stay.
> > > >
> > > > OK, then we were talking past each other, I am afraid. I really wanted
> > > > to get rid of this oom specific stuff from exit_mmap. It was there out
> > > > of necessity. With a proper locking we can finally get rid of the crud.
> > > > As I've said previously oom reaping has never been a hot path.
> > > >
> > > > If we really want to optimize this path then I would much rather see a
> > > > generic solution which would allow to move the write lock down after
> > > > unmap_vmas. That would require oom reaper to be able to handle mlocked
> > > > memory.
> > >
> > > Ok, let's work on that and when that's done we can get rid of the oom
> > > stuff in exit_mmap. I'll look into this over the weekend and will
> > > likely be back with questions.
> >
> > As promised, I have a question:
> > Any particular reason why munlock_vma_pages_range clears VM_LOCKED
> > before unlocking pages and not after (see:
> > https://elixir.bootlin.com/linux/latest/source/mm/mlock.c#L424)? Seems
> > to me if VM_LOCKED was reset at the end (with proper ordering) then
> > __oom_reap_task_mm would correctly skip VM_LOCKED vmas.
> > https://lore.kernel.org/lkml/[email protected]/
> > has this explanation:
> >
> > "Since munlock_vma_pages_range() depends on clearing VM_LOCKED from
> > vm_flags before actually doing the munlock to determine if any other
> > vmas are locking the same memory, the check for VM_LOCKED in the oom
> > reaper is racy."
> >
> > but "to determine if any other vmas are locking the same memory"
> > explanation eludes me... Any insights?
>
> A page's mlock state is determined by whether any of the vmas that map
> it are mlocked. The munlock code does:
>
> vma->vm_flags &= VM_LOCKED_CLEAR_MASK
> TestClearPageMlocked()
> isolate_lru_page()
> __munlock_isolated_page()
> page_mlock()
> rmap_walk() # for_each_vma()
> page_mlock_one()
> (vma->vm_flags & VM_LOCKED) && TestSetPageMlocked()
>
> If we didn't clear the VM_LOCKED flag first, racing threads could
> re-lock pages under us because they see that flag and think our vma
> wants those pages mlocked when we're in the process of munlocking.

Thanks for the explanation Johannes!
So far I didn't find an easy way to let __oom_reap_task_mm() run
concurrently with unlock_range(). Will keep exploring.

>

2021-12-30 06:00:11

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Thu, Dec 16, 2021 at 9:23 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Thu, Dec 16, 2021 at 3:49 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Wed, Dec 15, 2021 at 06:26:11PM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Dec 9, 2021 at 9:06 AM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Thu, Dec 9, 2021 at 8:47 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> > > > > > On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <[email protected]> wrote:
> > > > > > >
> > > > > > > Do we want this on top?
> > > > > >
> > > > > > As we discussed in this thread
> > > > > > https://lore.kernel.org/all/[email protected],
> > > > > > __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> > > > > > unmap pages in parallel with exit_mmap without blocking each other.
> > > > > > Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> > > > > > and has a negative impact on performance. So the conclusion of that
> > > > > > thread I thought was to keep that part. My understanding is that we
> > > > > > also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> > > > > > __oom_reap_task_mm would stay.
> > > > >
> > > > > OK, then we were talking past each other, I am afraid. I really wanted
> > > > > to get rid of this oom specific stuff from exit_mmap. It was there out
> > > > > of necessity. With a proper locking we can finally get rid of the crud.
> > > > > As I've said previously oom reaping has never been a hot path.
> > > > >
> > > > > If we really want to optimize this path then I would much rather see a
> > > > > generic solution which would allow to move the write lock down after
> > > > > unmap_vmas. That would require oom reaper to be able to handle mlocked
> > > > > memory.
> > > >
> > > > Ok, let's work on that and when that's done we can get rid of the oom
> > > > stuff in exit_mmap. I'll look into this over the weekend and will
> > > > likely be back with questions.
> > >
> > > As promised, I have a question:
> > > Any particular reason why munlock_vma_pages_range clears VM_LOCKED
> > > before unlocking pages and not after (see:
> > > https://elixir.bootlin.com/linux/latest/source/mm/mlock.c#L424)? Seems
> > > to me if VM_LOCKED was reset at the end (with proper ordering) then
> > > __oom_reap_task_mm would correctly skip VM_LOCKED vmas.
> > > https://lore.kernel.org/lkml/[email protected]/
> > > has this explanation:
> > >
> > > "Since munlock_vma_pages_range() depends on clearing VM_LOCKED from
> > > vm_flags before actually doing the munlock to determine if any other
> > > vmas are locking the same memory, the check for VM_LOCKED in the oom
> > > reaper is racy."
> > >
> > > but "to determine if any other vmas are locking the same memory"
> > > explanation eludes me... Any insights?
> >
> > A page's mlock state is determined by whether any of the vmas that map
> > it are mlocked. The munlock code does:
> >
> > vma->vm_flags &= VM_LOCKED_CLEAR_MASK
> > TestClearPageMlocked()
> > isolate_lru_page()
> > __munlock_isolated_page()
> > page_mlock()
> > rmap_walk() # for_each_vma()
> > page_mlock_one()
> > (vma->vm_flags & VM_LOCKED) && TestSetPageMlocked()
> >
> > If we didn't clear the VM_LOCKED flag first, racing threads could
> > re-lock pages under us because they see that flag and think our vma
> > wants those pages mlocked when we're in the process of munlocking.
>
> Thanks for the explanation Johannes!
> So far I didn't find an easy way to let __oom_reap_task_mm() run
> concurrently with unlock_range(). Will keep exploring.

After some more digging I think there are two acceptable options:

1. Call unlock_range() under mmap_write_lock and then downgrade it to
read lock so that both exit_mmap() and __oom_reap_task_mm() can unmap
vmas in parallel like this:

if (mm->locked_vm) {
mmap_write_lock(mm);
unlock_range(mm->mmap, ULONG_MAX);
mmap_write_downgrade(mm);
} else
mmap_read_lock(mm);
...
unmap_vmas(&tlb, vma, 0, -1);
mmap_read_unlock(mm);
mmap_write_lock(mm);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
...
mm->mmap = NULL;
mmap_write_unlock(mm);

This way exit_mmap() might block __oom_reap_task_mm() but for a much
shorter time during unlock_range() call.

2. Introduce another vm_flag mask similar to VM_LOCKED which is set
before munlock_vma_pages_range() clears VM_LOCKED so that
__oom_reap_task_mm() can identify vmas being unlocked and skip them.

Option 1 seems cleaner to me because it keeps the locking pattern
around unlock_range() in exit_mmap() consistent with all other places
it is used (in mremap() and munmap()) with mmap_write_lock taken.
WDYT?

> >

2021-12-30 08:24:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Wed 29-12-21 21:59:55, Suren Baghdasaryan wrote:
[...]
> After some more digging I think there are two acceptable options:
>
> 1. Call unlock_range() under mmap_write_lock and then downgrade it to
> read lock so that both exit_mmap() and __oom_reap_task_mm() can unmap
> vmas in parallel like this:
>
> if (mm->locked_vm) {
> mmap_write_lock(mm);
> unlock_range(mm->mmap, ULONG_MAX);
> mmap_write_downgrade(mm);
> } else
> mmap_read_lock(mm);
> ...
> unmap_vmas(&tlb, vma, 0, -1);
> mmap_read_unlock(mm);
> mmap_write_lock(mm);
> free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> ...
> mm->mmap = NULL;
> mmap_write_unlock(mm);
>
> This way exit_mmap() might block __oom_reap_task_mm() but for a much
> shorter time during unlock_range() call.

IIRC unlock_range depends on page lock at some stage and that can mean
this will block for a long time or for ever when the holder of the lock
depends on a memory allocation. This was the primary problem why the oom
reaper skips over mlocked vmas.

> 2. Introduce another vm_flag mask similar to VM_LOCKED which is set
> before munlock_vma_pages_range() clears VM_LOCKED so that
> __oom_reap_task_mm() can identify vmas being unlocked and skip them.
>
> Option 1 seems cleaner to me because it keeps the locking pattern
> around unlock_range() in exit_mmap() consistent with all other places
> it is used (in mremap() and munmap()) with mmap_write_lock taken.
> WDYT?

It would be really great to make unlock_range oom reaper aware IMHO.

You do not quote your change in the full length so it is not really
clear whether you are planning to drop __oom_reap_task_mm from exit_mmap
as well. If yes then 1) could push oom reaper to timeout while the
unlock_range could be dropped on something so that wouldn't be an
improvement. 2) sounds like a workaround to me as it doesn't really
address the underlying problem.

I have to say that I am not really a great fan of __oom_reap_task_mm in
exit_mmap but I would rather see it in place than making the surrounding
code more complex/tricky.

--
Michal Hocko
SUSE Labs

2021-12-30 17:29:54

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Thu, Dec 30, 2021 at 12:24 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 29-12-21 21:59:55, Suren Baghdasaryan wrote:
> [...]
> > After some more digging I think there are two acceptable options:
> >
> > 1. Call unlock_range() under mmap_write_lock and then downgrade it to
> > read lock so that both exit_mmap() and __oom_reap_task_mm() can unmap
> > vmas in parallel like this:
> >
> > if (mm->locked_vm) {
> > mmap_write_lock(mm);
> > unlock_range(mm->mmap, ULONG_MAX);
> > mmap_write_downgrade(mm);
> > } else
> > mmap_read_lock(mm);
> > ...
> > unmap_vmas(&tlb, vma, 0, -1);
> > mmap_read_unlock(mm);
> > mmap_write_lock(mm);
> > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > ...
> > mm->mmap = NULL;
> > mmap_write_unlock(mm);
> >
> > This way exit_mmap() might block __oom_reap_task_mm() but for a much
> > shorter time during unlock_range() call.
>
> IIRC unlock_range depends on page lock at some stage and that can mean
> this will block for a long time or for ever when the holder of the lock
> depends on a memory allocation. This was the primary problem why the oom
> reaper skips over mlocked vmas.

Oh, I missed that detail. I thought __oom_reap_task_mm() skips locked
vmas only to avoid destroying pgds from under follow_page().

>
> > 2. Introduce another vm_flag mask similar to VM_LOCKED which is set
> > before munlock_vma_pages_range() clears VM_LOCKED so that
> > __oom_reap_task_mm() can identify vmas being unlocked and skip them.
> >
> > Option 1 seems cleaner to me because it keeps the locking pattern
> > around unlock_range() in exit_mmap() consistent with all other places
> > it is used (in mremap() and munmap()) with mmap_write_lock taken.
> > WDYT?
>
> It would be really great to make unlock_range oom reaper aware IMHO.

What exactly do you envision? Say unlock_range() knows that it's
racing with __oom_reap_task_mm() and that calling follow_page() is
unsafe without locking, what should it do?

>
> You do not quote your change in the full length so it is not really
> clear whether you are planning to drop __oom_reap_task_mm from exit_mmap
> as well.

Yes, that was the plan.

> If yes then 1) could push oom reaper to timeout while the
> unlock_range could be dropped on something so that wouldn't be an
> improvement. 2) sounds like a workaround to me as it doesn't really
> address the underlying problem.

With (1) potentially blocking due to allocation I can see why this is a problem.
Agree about (2).

>
> I have to say that I am not really a great fan of __oom_reap_task_mm in
> exit_mmap but I would rather see it in place than making the surrounding
> code more complex/tricky.

Agree. So far I could not find a cleaner solution. I thought (1) would
be a good one but the point you made renders it invalid. If you
clarify your comment about making unlock_range oom reaper aware maybe
that will open a new line of investigation?
Thanks,
Suren.

>
> --
> Michal Hocko
> SUSE Labs

2022-01-03 12:11:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Thu 30-12-21 09:29:40, Suren Baghdasaryan wrote:
> On Thu, Dec 30, 2021 at 12:24 AM Michal Hocko <[email protected]> wrote:
> >
> > On Wed 29-12-21 21:59:55, Suren Baghdasaryan wrote:
> > [...]
> > > After some more digging I think there are two acceptable options:
> > >
> > > 1. Call unlock_range() under mmap_write_lock and then downgrade it to
> > > read lock so that both exit_mmap() and __oom_reap_task_mm() can unmap
> > > vmas in parallel like this:
> > >
> > > if (mm->locked_vm) {
> > > mmap_write_lock(mm);
> > > unlock_range(mm->mmap, ULONG_MAX);
> > > mmap_write_downgrade(mm);
> > > } else
> > > mmap_read_lock(mm);
> > > ...
> > > unmap_vmas(&tlb, vma, 0, -1);
> > > mmap_read_unlock(mm);
> > > mmap_write_lock(mm);
> > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > ...
> > > mm->mmap = NULL;
> > > mmap_write_unlock(mm);
> > >
> > > This way exit_mmap() might block __oom_reap_task_mm() but for a much
> > > shorter time during unlock_range() call.
> >
> > IIRC unlock_range depends on page lock at some stage and that can mean
> > this will block for a long time or for ever when the holder of the lock
> > depends on a memory allocation. This was the primary problem why the oom
> > reaper skips over mlocked vmas.
>
> Oh, I missed that detail. I thought __oom_reap_task_mm() skips locked
> vmas only to avoid destroying pgds from under follow_page().
>
> >
> > > 2. Introduce another vm_flag mask similar to VM_LOCKED which is set
> > > before munlock_vma_pages_range() clears VM_LOCKED so that
> > > __oom_reap_task_mm() can identify vmas being unlocked and skip them.
> > >
> > > Option 1 seems cleaner to me because it keeps the locking pattern
> > > around unlock_range() in exit_mmap() consistent with all other places
> > > it is used (in mremap() and munmap()) with mmap_write_lock taken.
> > > WDYT?
> >
> > It would be really great to make unlock_range oom reaper aware IMHO.
>
> What exactly do you envision? Say unlock_range() knows that it's
> racing with __oom_reap_task_mm() and that calling follow_page() is
> unsafe without locking, what should it do?

My original plan was to make the page lock conditional and use
trylocking from the oom reaper (aka lockless context). It is OK to
simply bail out and leave some mlocked memory behind if there is a
contention on a specific page. The overall objective is to free as much
memory as possible, not all of it.

IIRC Hugh was not a fan of this approach and he has mentioned that the
lock might not be even really needed and that the area would benefit
from a clean up rather than oom reaper specific hacks. I do tend to
agree with that. I just never managed to find any spare time for that
though and heavily mlocked oom victims tend to be really rare.
--
Michal Hocko
SUSE Labs

2022-01-03 21:16:42

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Mon, 3 Jan 2022, Michal Hocko wrote:
> On Thu 30-12-21 09:29:40, Suren Baghdasaryan wrote:
> > On Thu, Dec 30, 2021 at 12:24 AM Michal Hocko <[email protected]> wrote:
> > >
> > > It would be really great to make unlock_range oom reaper aware IMHO.
> >
> > What exactly do you envision? Say unlock_range() knows that it's
> > racing with __oom_reap_task_mm() and that calling follow_page() is
> > unsafe without locking, what should it do?
>
> My original plan was to make the page lock conditional and use
> trylocking from the oom reaper (aka lockless context). It is OK to
> simply bail out and leave some mlocked memory behind if there is a
> contention on a specific page. The overall objective is to free as much
> memory as possible, not all of it.
>
> IIRC Hugh was not a fan of this approach and he has mentioned that the
> lock might not be even really needed and that the area would benefit
> from a clean up rather than oom reaper specific hacks. I do tend to
> agree with that. I just never managed to find any spare time for that
> though and heavily mlocked oom victims tend to be really rare.

I forget when that was, and what I had in mind at that time.
But yes, by now I am very sure that munlocking needs a cleanup.

And I do have that cleanup (against a much older tree), but never
the time to rebase or post or shepherd it through N revisions.

It was 22 files changed, 464 insertions, 706 deletions:
which is too much to help with this immediate oom reaper question.

I'd better not drive this discussion further off-course; but it pains
me to see munlock_vma_pages obstructing, knowing there's a better way.

I wonder: what if I were to steal time and promise to post a
rebased series against 5.17-rc1 or rc2: not support it thereafter,
but might there be someone to pick it up and shepherd it through?
But there's no answer to that, without you seeing what it's like.

Hugh

2022-01-04 22:25:05

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Mon, Jan 3, 2022 at 1:16 PM Hugh Dickins <[email protected]> wrote:
>
> On Mon, 3 Jan 2022, Michal Hocko wrote:
> > On Thu 30-12-21 09:29:40, Suren Baghdasaryan wrote:
> > > On Thu, Dec 30, 2021 at 12:24 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > It would be really great to make unlock_range oom reaper aware IMHO.
> > >
> > > What exactly do you envision? Say unlock_range() knows that it's
> > > racing with __oom_reap_task_mm() and that calling follow_page() is
> > > unsafe without locking, what should it do?
> >
> > My original plan was to make the page lock conditional and use
> > trylocking from the oom reaper (aka lockless context). It is OK to
> > simply bail out and leave some mlocked memory behind if there is a
> > contention on a specific page. The overall objective is to free as much
> > memory as possible, not all of it.
> >
> > IIRC Hugh was not a fan of this approach and he has mentioned that the
> > lock might not be even really needed and that the area would benefit
> > from a clean up rather than oom reaper specific hacks. I do tend to
> > agree with that. I just never managed to find any spare time for that
> > though and heavily mlocked oom victims tend to be really rare.
>
> I forget when that was, and what I had in mind at that time.
> But yes, by now I am very sure that munlocking needs a cleanup.
>
> And I do have that cleanup (against a much older tree), but never
> the time to rebase or post or shepherd it through N revisions.

How old was that tree?

>
> It was 22 files changed, 464 insertions, 706 deletions:
> which is too much to help with this immediate oom reaper question.
>
> I'd better not drive this discussion further off-course; but it pains
> me to see munlock_vma_pages obstructing, knowing there's a better way.
>
> I wonder: what if I were to steal time and promise to post a
> rebased series against 5.17-rc1 or rc2: not support it thereafter,
> but might there be someone to pick it up and shepherd it through?
> But there's no answer to that, without you seeing what it's like.

I would be interested in taking a look and see if it can be upstreamed
and supported without bugging you too much.
Thanks,
Suren.

>
> Hugh