2018-03-20 21:33:53

by Yang Shi

[permalink] [raw]
Subject: [RFC PATCH 0/8] Drop mmap_sem during unmapping large map


Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large
memory space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.

Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). So, this
series of patches are aimed to solve the mmap_sem issue by adopting akpm's
suggestion.


Approach:
A couple of approaches were explored.
#1. Unmap large map by section in vm_munmap(). It works, but just sys_munmap()
can benefit from this change.

#2. Do unmapping in deeper place of the call chain, i.e. zap_pmd_range().
In this way, I don't have to define a magic size for unmapping. But, there
are two major issues:
* mmap_sem may be acquired by down_write() or down_read() in all the
possible call paths. So, the call path has to be checked to determine
to use which variants, either _write or _read. It increases the
complexity significantly.
* The below race condition might be introduced:
     CPU A                         CPU B
----------                ----------
do_munmap
zap_pmd_range
up_write                      do_munmap
                                    down_write
                                    ......
                                    remove_vma_list
                                    up_write
down_write
access vmas  <-- use-after-free bug

And, unmapping by section requires splitting vma, so the code has to
deal with partial unmapped vma, it also increase the complexity
significantly.

#3. Do it in do_munmap(). I can keep splitting vma/unmap region/free pagetables
/free vmas sequence atomic for every section. And, not only sys_munmap()
can benefit, but also mremap and sysv shm. The only problem is it may not
want to drop mmap_sem from some call paths. So, an extra parameter, called
"atomic", is introduced to do_munmap(). The caller can pass "true" or "false"
to tell do_munmap() if dropping mmap_sem is expected or not. "True" means not
drop, "false" means drop. Since all callers to do_munmap() acquire mmap_sem
by _write, so I just need deal with one variant. And, when re-acquiring
mmap_sem, just use down_write() for now since dealing with the return value
of down_write_killable() sounds unnecessary.

Other than these, a magic section size has to be defined explicitly, now
HPAGE_PUD_SIZE is used. According to my test, HPAGE_PUD_SIZE sounds good
enough. This is also why down_write() is used for re-acquiring mmap_sem
instead of down_write_killable(). Smaller size looks have to much overhead.

Regression and performance data:
Test is run on a machine with 32 cores of E5-2680 @ 2.70GHz and 384GB memory

Full LTP test is done, no regression issue.

Measurement of SyS_munmap() execution time:
size pristine patched delta
80GB 5008377 us 4905841 us -2%
160GB 9129243 us 9145306 us +0.18%
320GB 17915310 us 17990174 us +0.42%

Throughput of page faults (#/s) with vm-scalability:
pristine patched delta
mmap-pread-seq 554894 563517 +1.6%
mmap-pread-seq-mt 581232 580772 -0.079%
mmap-xread-seq-mt 99182 105400 +6.3%

Throughput of page faults (#/s) with the below stress-ng test:
stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf
--timeout 600s
pristine patched delta
100165 108396 +8.2%


There are 8 patches in this series.
1/8:
Introduce “atomic” parameter and define do_munmap_range(), modify
do_munmap() to call do_munmap() to unmap memory by section
2/8 - 6/8:
modify do_munmap() call sites in mm/mmap.c, mm/mremap.c,
fs/proc/vmcore.c, ipc/shm.c and mm/nommu.c to adopt "atomic" parameter
7/8 - 8/8:
modify the do_munmap() call sites in arch/x86 to adopt "atomic" parameter


Yang Shi (8):
mm: mmap: unmap large mapping by section
mm: mmap: pass atomic parameter to do_munmap() call sites
mm: mremap: pass atomic parameter to do_munmap()
mm: nommu: add atomic parameter to do_munmap()
ipc: shm: pass atomic parameter to do_munmap()
fs: proc/vmcore: pass atomic parameter to do_munmap()
x86: mpx: pass atomic parameter to do_munmap()
x86: vma: pass atomic parameter to do_munmap()

arch/x86/entry/vdso/vma.c | 2 +-
arch/x86/mm/mpx.c | 2 +-
fs/proc/vmcore.c | 4 ++--
include/linux/mm.h | 2 +-
ipc/shm.c | 9 ++++++---
mm/mmap.c | 48 ++++++++++++++++++++++++++++++++++++++++++------
mm/mremap.c | 10 ++++++----
mm/nommu.c | 5 +++--
8 files changed, 62 insertions(+), 20 deletions(-)


2018-03-20 21:33:12

by Yang Shi

[permalink] [raw]
Subject: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
Call Trace:
[<ffffffff817154d0>] ? __schedule+0x250/0x730
[<ffffffff817159e6>] schedule+0x36/0x80
[<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
[<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffff81717db0>] down_read+0x20/0x40
[<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
[<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
[<ffffffff81241d87>] __vfs_read+0x37/0x150
[<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
[<ffffffff81242266>] vfs_read+0x96/0x130
[<ffffffff812437b5>] SyS_read+0x55/0xc0
[<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

Since unmapping does't require any atomicity, so here unmap large
mapping (> HPAGE_PUD_SIZE) section by section, and release mmap_sem for
unmapping every HPAGE_PUD_SIZE if mmap_sem is contended and the call
path is fine to be interrupted controlled by "atomic", newly added
parameter to do_munmap(). "false" means it is fine to do unlock/relock
to mmap_sem in the middle.

Not only munmap may benefit from this change, but also
mremap/shm since they all call do_munmap() to do the real work.

The below is some regression and performance data collected on a machine
with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

Measurement of SyS_munmap() execution time:
size pristine patched delta
80GB 5008377 us 4905841 us -2%
160GB 9129243 us 9145306 us +0.18%
320GB 17915310 us 17990174 us +0.42%

Throughput of page faults (#/s) with vm-scalability:
pristine patched delta
mmap-pread-seq 554894 563517 +1.6%
mmap-pread-seq-mt 581232 580772 -0.079%
mmap-xread-seq-mt 99182 105400 +6.3%

Throughput of page faults (#/s) with the below stress-ng test:
stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf
--timeout 600s
pristine patched delta
100165 108396 +8.2%

Signed-off-by: Yang Shi <[email protected]>
---
include/linux/mm.h | 2 +-
mm/mmap.c | 40 ++++++++++++++++++++++++++++++++++++++--
2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42..2e447d4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2212,7 +2212,7 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
struct list_head *uf);
extern int do_munmap(struct mm_struct *, unsigned long, size_t,
- struct list_head *uf);
+ struct list_head *uf, bool atomic);

static inline unsigned long
do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/mm/mmap.c b/mm/mmap.c
index 9efdc021..ad6ae7a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2632,8 +2632,8 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
* work. This now handles partial unmappings.
* Jeremy Fitzhardinge <[email protected]>
*/
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
- struct list_head *uf)
+static int do_munmap_range(struct mm_struct *mm, unsigned long start,
+ size_t len, struct list_head *uf)
{
unsigned long end;
struct vm_area_struct *vma, *prev, *last;
@@ -2733,6 +2733,42 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
return 0;
}

+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
+ struct list_head *uf, bool atomic)
+{
+ int ret = 0;
+ size_t step = HPAGE_PUD_SIZE;
+
+ /*
+ * unmap large mapping (> huge pud size) section by section
+ * in order to give mmap_sem waiters a chance to acquire it.
+ */
+ if (len <= step)
+ ret = do_munmap_range(mm, start, len, uf);
+ else {
+ do {
+ ret = do_munmap_range(mm, start, step, uf);
+ if (ret < 0)
+ break;
+
+ if (rwsem_is_contended(&mm->mmap_sem) && !atomic &&
+ need_resched()) {
+ VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+ up_write(&mm->mmap_sem);
+ cond_resched();
+ down_write(&mm->mmap_sem);
+ }
+
+ start += step;
+ len -= step;
+ if (len <= step)
+ step = len;
+ } while (len > 0);
+ }
+
+ return ret;
+}
+
int vm_munmap(unsigned long start, size_t len)
{
int ret;
--
1.8.3.1


2018-03-20 21:33:17

by Yang Shi

[permalink] [raw]
Subject: [RFC PATCH 8/8] x86: vma: pass atomic parameter to do_munmap()

It sounds unnecessary to release mmap_sem in the middle of unmmaping
vdso map.

This is API change only.

Signed-off-by: Yang Shi <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
---
arch/x86/entry/vdso/vma.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 5b8b556..6359ae5 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -192,7 +192,7 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)

if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
- do_munmap(mm, text_start, image->size, NULL);
+ do_munmap(mm, text_start, image->size, NULL, true);
} else {
current->mm->context.vdso = (void __user *)text_start;
current->mm->context.vdso_image = image;
--
1.8.3.1


2018-03-20 21:33:18

by Yang Shi

[permalink] [raw]
Subject: [RFC PATCH 2/8] mm: mmap: pass atomic parameter to do_munmap() call sites

It looks safe to release mmap_sem in the middle for vm_munmap and brk,
so passing "false" to do_munmap() call.
However it sounds not safe to mmap_region() which is called by
SyS_mmap().

Signed-off-by: Yang Shi <[email protected]>
---
mm/mmap.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index ad6ae7a..374e4ec 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -225,7 +225,7 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)

/* Always allow shrinking brk. */
if (brk <= mm->brk) {
- if (!do_munmap(mm, newbrk, oldbrk-newbrk, &uf))
+ if (!do_munmap(mm, newbrk, oldbrk-newbrk, &uf, false))
goto set_brk;
goto out;
}
@@ -1643,7 +1643,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
/* Clear old maps */
while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,
&rb_parent)) {
- if (do_munmap(mm, addr, len, uf))
+ if (do_munmap(mm, addr, len, uf, true))
return -ENOMEM;
}

@@ -2778,7 +2778,7 @@ int vm_munmap(unsigned long start, size_t len)
if (down_write_killable(&mm->mmap_sem))
return -EINTR;

- ret = do_munmap(mm, start, len, &uf);
+ ret = do_munmap(mm, start, len, &uf, false);
up_write(&mm->mmap_sem);
userfaultfd_unmap_complete(mm, &uf);
return ret;
@@ -2945,7 +2945,7 @@ static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long
*/
while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,
&rb_parent)) {
- if (do_munmap(mm, addr, len, uf))
+ if (do_munmap(mm, addr, len, uf, false))
return -ENOMEM;
}

--
1.8.3.1


2018-03-20 21:33:22

by Yang Shi

[permalink] [raw]
Subject: [RFC PATCH 7/8] x86: mpx: pass atomic parameter to do_munmap()

Pass "true" to do_munmap() to not do unlock/relock to mmap_sem when
manipulating mpx map.

This is API change only.

Signed-off-by: Yang Shi <[email protected]>
Cc: Ricardo Neri <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
---
arch/x86/mm/mpx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index e500949..a180e94 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -780,7 +780,7 @@ static int unmap_entire_bt(struct mm_struct *mm,
* avoid recursion, do_munmap() will check whether it comes
* from one bounds table through VM_MPX flag.
*/
- return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm), NULL);
+ return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm), NULL, true);
}

static int try_unmap_single_bt(struct mm_struct *mm,
--
1.8.3.1


2018-03-20 21:34:01

by Yang Shi

[permalink] [raw]
Subject: [RFC PATCH 5/8] ipc: shm: pass atomic parameter to do_munmap()

It looks safe to do unlock/relock mmap_sem in the middle of shmat(), so
passing "false" here.

Signed-off-by: Yang Shi <[email protected]>
---
ipc/shm.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 4643865..1617523 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1537,7 +1537,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
*/
file = vma->vm_file;
size = i_size_read(file_inode(vma->vm_file));
- do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
+ do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start,
+ NULL, false);
/*
* We discovered the size of the shm segment, so
* break out of here and fall through to the next
@@ -1564,7 +1565,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
if ((vma->vm_ops == &shm_vm_ops) &&
((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) &&
(vma->vm_file == file))
- do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
+ do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start,
+ NULL, false);
vma = next;
}

@@ -1573,7 +1575,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
* given
*/
if (vma && vma->vm_start == addr && vma->vm_ops == &shm_vm_ops) {
- do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
+ do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start,
+ NULL, false);
retval = 0;
}

--
1.8.3.1


2018-03-20 21:34:15

by Yang Shi

[permalink] [raw]
Subject: [RFC PATCH 6/8] fs: proc/vmcore: pass atomic parameter to do_munmap()

Just pass "true" here since vmcore map is not a hot path there is not
too much gain to release mmap_sem in the middle.

Signed-off-by: Yang Shi <[email protected]>
---
fs/proc/vmcore.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a45f0af..02683eb 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -388,7 +388,7 @@ static int remap_oldmem_pfn_checked(struct vm_area_struct *vma,
}
return 0;
fail:
- do_munmap(vma->vm_mm, from, len, NULL);
+ do_munmap(vma->vm_mm, from, len, NULL, true);
return -EAGAIN;
}

@@ -481,7 +481,7 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)

return 0;
fail:
- do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
+ do_munmap(vma->vm_mm, vma->vm_start, len, NULL, true);
return -EAGAIN;
}
#else
--
1.8.3.1


2018-03-20 21:35:39

by Yang Shi

[permalink] [raw]
Subject: [RFC PATCH 3/8] mm: mremap: pass atomic parameter to do_munmap()

It sounds safe to do unlock/relock to mmap_sem in mremap, so passing
"false" here.

Signed-off-by: Yang Shi <[email protected]>
---
mm/mremap.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 049470a..5f8fca4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -353,7 +353,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
if (unlikely(vma->vm_flags & VM_PFNMAP))
untrack_pfn_moved(vma);

- if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) {
+ if (do_munmap(mm, old_addr, old_len, uf_unmap, false) < 0) {
/* OOM: unable to split vma, just get accounts right */
vm_unacct_memory(excess >> PAGE_SHIFT);
excess = 0;
@@ -462,12 +462,13 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
if (addr + old_len > new_addr && new_addr + new_len > addr)
goto out;

- ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
+ ret = do_munmap(mm, new_addr, new_len, uf_unmap_early, false);
if (ret)
goto out;

if (old_len >= new_len) {
- ret = do_munmap(mm, addr+new_len, old_len - new_len, uf_unmap);
+ ret = do_munmap(mm, addr+new_len, old_len - new_len,
+ uf_unmap, false);
if (ret && old_len != new_len)
goto out;
old_len = new_len;
@@ -568,7 +569,8 @@ static int vma_expandable(struct vm_area_struct *vma, unsigned long delta)
* do_munmap does all the needed commit accounting
*/
if (old_len >= new_len) {
- ret = do_munmap(mm, addr+new_len, old_len - new_len, &uf_unmap);
+ ret = do_munmap(mm, addr+new_len, old_len - new_len,
+ &uf_unmap, false);
if (ret && old_len != new_len)
goto out;
ret = addr;
--
1.8.3.1


2018-03-20 21:36:10

by Yang Shi

[permalink] [raw]
Subject: [RFC PATCH 4/8] mm: nommu: add atomic parameter to do_munmap()

Just add atomic parameter to keep consistent with the API change and
pass "true" to the call site. Nommu code doesn't do the mmap_sem
unlock/relock.

Signed-off-by: Yang Shi <[email protected]>
---
mm/nommu.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/nommu.c b/mm/nommu.c
index ebb6e61..5954c08 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1578,7 +1578,8 @@ static int shrink_vma(struct mm_struct *mm,
* - under NOMMU conditions the chunk to be unmapped must be backed by a single
* VMA, though it need not cover the whole VMA
*/
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, struct list_head *uf)
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
+ struct list_head *ufi, bool atomic)
{
struct vm_area_struct *vma;
unsigned long end;
@@ -1644,7 +1645,7 @@ int vm_munmap(unsigned long addr, size_t len)
int ret;

down_write(&mm->mmap_sem);
- ret = do_munmap(mm, addr, len, NULL);
+ ret = do_munmap(mm, addr, len, NULL, true);
up_write(&mm->mmap_sem);
return ret;
}
--
1.8.3.1


2018-03-20 22:37:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 7/8] x86: mpx: pass atomic parameter to do_munmap()

On Wed, 21 Mar 2018, Yang Shi wrote:

Please CC everyone involved on the full patch set next time. I had to dig
the rest out from my lkml archive to get the context.

> Pass "true" to do_munmap() to not do unlock/relock to mmap_sem when
> manipulating mpx map.

> This is API change only.

This is wrong. You cannot change the function in one patch and then clean
up the users. That breaks bisectability.

Depending on the number of callers this wants to be a single patch changing
both the function and the callers or you need to create a new function
which has the extra argument and switch all users over to it and then
remove the old function.

> @@ -780,7 +780,7 @@ static int unmap_entire_bt(struct mm_struct *mm,
> * avoid recursion, do_munmap() will check whether it comes
> * from one bounds table through VM_MPX flag.
> */
> - return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm), NULL);
> + return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm), NULL, true);

But looking at the full context this is the wrong approach.

First of all the name of that parameter 'atomic' is completely
misleading. It suggests that this happens in fully atomic context, which is
not the case.

Secondly, conditional locking is frowned upon in general and rightfully so.

So the right thing to do is to leave do_munmap() alone and add a new
function do_munmap_huge() or whatever sensible name you come up with. Then
convert the places which are considered to be safe one by one with a proper
changelog which explains WHY this is safe.

That way you avoid the chasing game of all existing do_munmap() callers and
just use the new 'free in chunks' approach where it is appropriate and
safe. No suprises, no bisectability issues....

While at it please add proper kernel doc documentation to both do_munmap()
and the new function which explains the intricacies.

Thanks,

tglx

2018-03-21 13:09:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Wed 21-03-18 05:31:19, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
>
> INFO: task ps:14018 blocked for more than 120 seconds.
> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> ps D 0 14018 1 0x00000004
> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> Call Trace:
> [<ffffffff817154d0>] ? __schedule+0x250/0x730
> [<ffffffff817159e6>] schedule+0x36/0x80
> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> [<ffffffff81717db0>] down_read+0x20/0x40
> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
> [<ffffffff81241d87>] __vfs_read+0x37/0x150
> [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
> [<ffffffff81242266>] vfs_read+0x96/0x130
> [<ffffffff812437b5>] SyS_read+0x55/0xc0
> [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).

Yes, this definitely sucks. One way to work that around is to split the
unmap to two phases. One to drop all the pages. That would only need
mmap_sem for read and then tear down the mapping with the mmap_sem for
write. This wouldn't help for parallel mmap_sem writers but those really
need a different approach (e.g. the range locking).

> Since unmapping does't require any atomicity, so here unmap large

How come? Could you be more specific why? Once you drop the lock the
address space might change under your feet and you might be unmapping a
completely different vma. That would require userspace doing nasty
things of course (e.g. MAP_FIXED) but I am worried that userspace really
depends on mmap/munmap atomicity these days.
--
Michal Hocko
SUSE Labs

2018-03-21 13:16:13

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Wed 21-03-18 05:31:19, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
>
> INFO: task ps:14018 blocked for more than 120 seconds.
> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> ps D 0 14018 1 0x00000004
> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> Call Trace:
> [<ffffffff817154d0>] ? __schedule+0x250/0x730
> [<ffffffff817159e6>] schedule+0x36/0x80
> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> [<ffffffff81717db0>] down_read+0x20/0x40
> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0

Slightly off-topic:
Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
of
arg_start = mm->arg_start;
arg_end = mm->arg_end;
env_start = mm->env_start;
env_end = mm->env_end;

change after exec or while the pid is already visible in proc? If yes
maybe we can use a dedicated lock.
--
Michal Hocko
SUSE Labs

2018-03-21 16:33:35

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 3/21/18 6:08 AM, Michal Hocko wrote:
> On Wed 21-03-18 05:31:19, Yang Shi wrote:
>> When running some mmap/munmap scalability tests with large memory (i.e.
>>> 300GB), the below hung task issue may happen occasionally.
>> INFO: task ps:14018 blocked for more than 120 seconds.
>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>> ps D 0 14018 1 0x00000004
>> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>> Call Trace:
>> [<ffffffff817154d0>] ? __schedule+0x250/0x730
>> [<ffffffff817159e6>] schedule+0x36/0x80
>> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>> [<ffffffff81717db0>] down_read+0x20/0x40
>> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>> [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>> [<ffffffff81241d87>] __vfs_read+0x37/0x150
>> [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>> [<ffffffff81242266>] vfs_read+0x96/0x130
>> [<ffffffff812437b5>] SyS_read+0x55/0xc0
>> [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
> Yes, this definitely sucks. One way to work that around is to split the
> unmap to two phases. One to drop all the pages. That would only need
> mmap_sem for read and then tear down the mapping with the mmap_sem for
> write. This wouldn't help for parallel mmap_sem writers but those really
> need a different approach (e.g. the range locking).

page fault might sneak in to map a page which has been unmapped before?

range locking should help a lot on manipulating small sections of a
large mapping in parallel or multiple small mappings. It may not achieve
too much for single large mapping.

>
>> Since unmapping does't require any atomicity, so here unmap large
> How come? Could you be more specific why? Once you drop the lock the
> address space might change under your feet and you might be unmapping a
> completely different vma. That would require userspace doing nasty
> things of course (e.g. MAP_FIXED) but I am worried that userspace really
> depends on mmap/munmap atomicity these days.

Sorry for the ambiguity. The statement does look misleading. munmap does
need certain atomicity, particularly for the below sequence:

splitting vma
unmap region
free pagetables
free vmas

Otherwise it may run into the below race condition:

CPU A CPU B
---------- ----------
do_munmap
zap_pmd_range
up_write do_munmap
down_write
......
remove_vma_list
up_write
down_write
access vmas <-- use-after-free bug

This is why I do the range unmap in do_munmap() rather than doing it in
deeper location, i.e. zap_pmd_range(). I elaborated this in the cover
letter.

Thanks,
Yang



2018-03-21 16:52:32

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 3/21/18 6:14 AM, Michal Hocko wrote:
> On Wed 21-03-18 05:31:19, Yang Shi wrote:
>> When running some mmap/munmap scalability tests with large memory (i.e.
>>> 300GB), the below hung task issue may happen occasionally.
>> INFO: task ps:14018 blocked for more than 120 seconds.
>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>> ps D 0 14018 1 0x00000004
>> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>> Call Trace:
>> [<ffffffff817154d0>] ? __schedule+0x250/0x730
>> [<ffffffff817159e6>] schedule+0x36/0x80
>> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>> [<ffffffff81717db0>] down_read+0x20/0x40
>> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> Slightly off-topic:
> Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
> of
> arg_start = mm->arg_start;
> arg_end = mm->arg_end;
> env_start = mm->env_start;
> env_end = mm->env_end;
>
> change after exec or while the pid is already visible in proc? If yes
> maybe we can use a dedicated lock.

Actually, Alexey Dobriyan had the same comment when he reviewed my very
first patch (which changes down_read to down_read_killable at that place).

Those 4 values might be changed by prctl_set_mm() and prctl_set_mm_map()
concurrently. They used to use down_read() to protect the change, but it
looks not good enough to protect concurrent writing. So, Mateusz Guzik's
commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem
for writing to protect against others") change it to down_write().

It seems mmap_sem can be replaced to a dedicated lock. How about
defining a rwlock in mm_struct to protect those data? I will come up
with a RFC patch for this.

However, this dedicated lock just can work around this specific case. I
believe solving mmap_sem scalability issue aimed by the patch series is
still our consensus.

Thanks,
Yang





2018-03-21 16:55:16

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 7/8] x86: mpx: pass atomic parameter to do_munmap()



On 3/20/18 3:35 PM, Thomas Gleixner wrote:
> On Wed, 21 Mar 2018, Yang Shi wrote:
>
> Please CC everyone involved on the full patch set next time. I had to dig
> the rest out from my lkml archive to get the context.

Sorry for the inconvenience. Will pay attention to it next time.

>
>> Pass "true" to do_munmap() to not do unlock/relock to mmap_sem when
>> manipulating mpx map.
>> This is API change only.
> This is wrong. You cannot change the function in one patch and then clean
> up the users. That breaks bisectability.
>
> Depending on the number of callers this wants to be a single patch changing
> both the function and the callers or you need to create a new function
> which has the extra argument and switch all users over to it and then
> remove the old function.
>
>> @@ -780,7 +780,7 @@ static int unmap_entire_bt(struct mm_struct *mm,
>> * avoid recursion, do_munmap() will check whether it comes
>> * from one bounds table through VM_MPX flag.
>> */
>> - return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm), NULL);
>> + return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm), NULL, true);
> But looking at the full context this is the wrong approach.
>
> First of all the name of that parameter 'atomic' is completely
> misleading. It suggests that this happens in fully atomic context, which is
> not the case.
>
> Secondly, conditional locking is frowned upon in general and rightfully so.
>
> So the right thing to do is to leave do_munmap() alone and add a new
> function do_munmap_huge() or whatever sensible name you come up with. Then
> convert the places which are considered to be safe one by one with a proper
> changelog which explains WHY this is safe.
>
> That way you avoid the chasing game of all existing do_munmap() callers and
> just use the new 'free in chunks' approach where it is appropriate and
> safe. No suprises, no bisectability issues....
>
> While at it please add proper kernel doc documentation to both do_munmap()
> and the new function which explains the intricacies.

Thanks a lot for the suggestion. Absolutely agree. Will fix the problems
in newer version.

Yang

>
> Thanks,
>
> tglx


2018-03-21 17:18:09

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 3/21/18 9:50 AM, Yang Shi wrote:
>
>
> On 3/21/18 6:14 AM, Michal Hocko wrote:
>> On Wed 21-03-18 05:31:19, Yang Shi wrote:
>>> When running some mmap/munmap scalability tests with large memory (i.e.
>>>> 300GB), the below hung task issue may happen occasionally.
>>> INFO: task ps:14018 blocked for more than 120 seconds.
>>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>>> message.
>>> ps D 0 14018 1 0x00000004
>>> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>> Call Trace:
>>> [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>> [<ffffffff817159e6>] schedule+0x36/0x80
>>> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>> [<ffffffff81717db0>] down_read+0x20/0x40
>>> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>> Slightly off-topic:
>> Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
>> of
>> arg_start = mm->arg_start;
>> arg_end = mm->arg_end;
>> env_start = mm->env_start;
>> env_end = mm->env_end;
>>
>> change after exec or while the pid is already visible in proc? If yes
>> maybe we can use a dedicated lock.

BTW, this is not the only place to acquire mmap_sem in
proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.

Yang

>
> Actually, Alexey Dobriyan had the same comment when he reviewed my
> very first patch (which changes down_read to down_read_killable at
> that place).
>
> Those 4 values might be changed by prctl_set_mm() and
> prctl_set_mm_map() concurrently. They used to use down_read() to
> protect the change, but it looks not good enough to protect concurrent
> writing. So, Mateusz Guzik's commit
> ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for
> writing to protect against others") change it to down_write().
>
> It seems mmap_sem can be replaced to a dedicated lock. How about
> defining a rwlock in mm_struct to protect those data? I will come up
> with a RFC patch for this.
>
> However, this dedicated lock just can work around this specific case.
> I believe solving mmap_sem scalability issue aimed by the patch series
> is still our consensus.
>
> Thanks,
> Yang
>
>
>
>


2018-03-21 17:31:12

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote:
> On 3/21/18 6:08 AM, Michal Hocko wrote:
> > Yes, this definitely sucks. One way to work that around is to split the
> > unmap to two phases. One to drop all the pages. That would only need
> > mmap_sem for read and then tear down the mapping with the mmap_sem for
> > write. This wouldn't help for parallel mmap_sem writers but those really
> > need a different approach (e.g. the range locking).
>
> page fault might sneak in to map a page which has been unmapped before?
>
> range locking should help a lot on manipulating small sections of a large
> mapping in parallel or multiple small mappings. It may not achieve too much
> for single large mapping.

I don't think we need range locking. What if we do munmap this way:

Take the mmap_sem for write
Find the VMA
If the VMA is large(*)
Mark the VMA as deleted
Drop the mmap_sem
zap all of the entries
Take the mmap_sem
Else
zap all of the entries
Continue finding VMAs
Drop the mmap_sem

Now we need to change everywhere which looks up a VMA to see if it needs
to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap
does not care; munmap will need to wait for the existing munmap operation
to complete), but it gives us the atomicity, at least on a per-VMA basis.

We could also do:

Take the mmap_sem for write
Mark all VMAs in the range as deleted & modify any partial VMAs
Drop mmap_sem
zap pages from deleted VMAs

That would give us the same atomicity that we have today.

Deleted VMAs would need a pointer to a completion, so operations that
need to wait can queue themselves up. I'd recommend we use the low bit
of vm_file and treat it as a pointer to a struct completion if set.


2018-03-21 21:25:24

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Wed 21-03-18 10:16:41, Yang Shi wrote:
>
>
> On 3/21/18 9:50 AM, Yang Shi wrote:
> >
> >
> > On 3/21/18 6:14 AM, Michal Hocko wrote:
> > > On Wed 21-03-18 05:31:19, Yang Shi wrote:
> > > > When running some mmap/munmap scalability tests with large memory (i.e.
> > > > > 300GB), the below hung task issue may happen occasionally.
> > > > INFO: task ps:14018 blocked for more than 120 seconds.
> > > > Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
> > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > > message.
> > > > ps D 0 14018 1 0x00000004
> > > > ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> > > > ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> > > > 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> > > > Call Trace:
> > > > [<ffffffff817154d0>] ? __schedule+0x250/0x730
> > > > [<ffffffff817159e6>] schedule+0x36/0x80
> > > > [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> > > > [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> > > > [<ffffffff81717db0>] down_read+0x20/0x40
> > > > [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> > > Slightly off-topic:
> > > Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
> > > of
> > > arg_start = mm->arg_start;
> > > arg_end = mm->arg_end;
> > > env_start = mm->env_start;
> > > env_end = mm->env_end;
> > >
> > > change after exec or while the pid is already visible in proc? If yes
> > > maybe we can use a dedicated lock.
>
> BTW, this is not the only place to acquire mmap_sem in
> proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
> mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.

Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
to remove that. munmap should perform much better. How to do that safely
is a different question. I am not yet convinced that tearing down a vma
in batches is safe. The vast majority of time is spent on tearing down
pages and that is quite easy to move out of the write lock. That would
be an improvement already and it should be risk safe. If even that is
not sufficient then using range locking should help a lot. There
shouldn't be really any other address space operations within the range
most of the time so this would be basically non-contended access.
--
Michal Hocko
SUSE Labs

2018-03-21 21:47:28

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 3/21/18 10:29 AM, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote:
>> On 3/21/18 6:08 AM, Michal Hocko wrote:
>>> Yes, this definitely sucks. One way to work that around is to split the
>>> unmap to two phases. One to drop all the pages. That would only need
>>> mmap_sem for read and then tear down the mapping with the mmap_sem for
>>> write. This wouldn't help for parallel mmap_sem writers but those really
>>> need a different approach (e.g. the range locking).
>> page fault might sneak in to map a page which has been unmapped before?
>>
>> range locking should help a lot on manipulating small sections of a large
>> mapping in parallel or multiple small mappings. It may not achieve too much
>> for single large mapping.
> I don't think we need range locking. What if we do munmap this way:
>
> Take the mmap_sem for write
> Find the VMA
> If the VMA is large(*)
> Mark the VMA as deleted
> Drop the mmap_sem
> zap all of the entries
> Take the mmap_sem
> Else
> zap all of the entries
> Continue finding VMAs
> Drop the mmap_sem
>
> Now we need to change everywhere which looks up a VMA to see if it needs
> to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap

Marking vma as deleted sounds good. The problem for my current approach
is the concurrent page fault may succeed if it access the not yet
unmapped section. Marking deleted vma could tell page fault the vma is
not valid anymore, then return SIGSEGV.

> does not care; munmap will need to wait for the existing munmap operation

Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?

Thanks,
Yang

> to complete), but it gives us the atomicity, at least on a per-VMA basis.
>
> We could also do:
>
> Take the mmap_sem for write
> Mark all VMAs in the range as deleted & modify any partial VMAs
> Drop mmap_sem
> zap pages from deleted VMAs
>
> That would give us the same atomicity that we have today.
>
> Deleted VMAs would need a pointer to a completion, so operations that
> need to wait can queue themselves up. I'd recommend we use the low bit
> of vm_file and treat it as a pointer to a struct completion if set.


2018-03-21 22:16:28

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
> On 3/21/18 10:29 AM, Matthew Wilcox wrote:
> > On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote:
> > > On 3/21/18 6:08 AM, Michal Hocko wrote:
> > > > Yes, this definitely sucks. One way to work that around is to split the
> > > > unmap to two phases. One to drop all the pages. That would only need
> > > > mmap_sem for read and then tear down the mapping with the mmap_sem for
> > > > write. This wouldn't help for parallel mmap_sem writers but those really
> > > > need a different approach (e.g. the range locking).
> > > page fault might sneak in to map a page which has been unmapped before?
> > >
> > > range locking should help a lot on manipulating small sections of a large
> > > mapping in parallel or multiple small mappings. It may not achieve too much
> > > for single large mapping.
> > I don't think we need range locking. What if we do munmap this way:
> >
> > Take the mmap_sem for write
> > Find the VMA
> > If the VMA is large(*)
> > Mark the VMA as deleted
> > Drop the mmap_sem
> > zap all of the entries
> > Take the mmap_sem
> > Else
> > zap all of the entries
> > Continue finding VMAs
> > Drop the mmap_sem
> >
> > Now we need to change everywhere which looks up a VMA to see if it needs
> > to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap
>
> Marking vma as deleted sounds good. The problem for my current approach is
> the concurrent page fault may succeed if it access the not yet unmapped
> section. Marking deleted vma could tell page fault the vma is not valid
> anymore, then return SIGSEGV.
>
> > does not care; munmap will need to wait for the existing munmap operation
>
> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?

Oh, I forgot about MAP_FIXED. Yes, MAP_FIXED should wait for the munmap
to finish. But a regular mmap can just pretend that it happened before
the munmap call and avoid the deleted VMAs.


2018-03-21 22:38:42

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 3/21/18 2:23 PM, Michal Hocko wrote:
> On Wed 21-03-18 10:16:41, Yang Shi wrote:
>>
>> On 3/21/18 9:50 AM, Yang Shi wrote:
>>>
>>> On 3/21/18 6:14 AM, Michal Hocko wrote:
>>>> On Wed 21-03-18 05:31:19, Yang Shi wrote:
>>>>> When running some mmap/munmap scalability tests with large memory (i.e.
>>>>>> 300GB), the below hung task issue may happen occasionally.
>>>>> INFO: task ps:14018 blocked for more than 120 seconds.
>>>>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>>>>> message.
>>>>> ps D 0 14018 1 0x00000004
>>>>> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>>>> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>>>> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>>>> Call Trace:
>>>>> [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>>>> [<ffffffff817159e6>] schedule+0x36/0x80
>>>>> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>>>> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>>>> [<ffffffff81717db0>] down_read+0x20/0x40
>>>>> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>>> Slightly off-topic:
>>>> Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
>>>> of
>>>> arg_start = mm->arg_start;
>>>> arg_end = mm->arg_end;
>>>> env_start = mm->env_start;
>>>> env_end = mm->env_end;
>>>>
>>>> change after exec or while the pid is already visible in proc? If yes
>>>> maybe we can use a dedicated lock.
>> BTW, this is not the only place to acquire mmap_sem in
>> proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
>> mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.
> Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
> to remove that. munmap should perform much better. How to do that safely

Yes, agree. We are on the same page.

> is a different question. I am not yet convinced that tearing down a vma
> in batches is safe. The vast majority of time is spent on tearing down

You can try my patches. I did full LTP test and running multiple kernel
build in parallel. It survives.

> pages and that is quite easy to move out of the write lock. That would
> be an improvement already and it should be risk safe. If even that is
> not sufficient then using range locking should help a lot. There
> shouldn't be really any other address space operations within the range
> most of the time so this would be basically non-contended access.

It might depend on how the range is defined. Too big range may lead to
surprisingly more contention, but too small range may bring in too much
lock/unlock operations.

Thanks,
Yang



2018-03-21 22:41:59

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 3/21/18 3:15 PM, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
>> On 3/21/18 10:29 AM, Matthew Wilcox wrote:
>>> On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote:
>>>> On 3/21/18 6:08 AM, Michal Hocko wrote:
>>>>> Yes, this definitely sucks. One way to work that around is to split the
>>>>> unmap to two phases. One to drop all the pages. That would only need
>>>>> mmap_sem for read and then tear down the mapping with the mmap_sem for
>>>>> write. This wouldn't help for parallel mmap_sem writers but those really
>>>>> need a different approach (e.g. the range locking).
>>>> page fault might sneak in to map a page which has been unmapped before?
>>>>
>>>> range locking should help a lot on manipulating small sections of a large
>>>> mapping in parallel or multiple small mappings. It may not achieve too much
>>>> for single large mapping.
>>> I don't think we need range locking. What if we do munmap this way:
>>>
>>> Take the mmap_sem for write
>>> Find the VMA
>>> If the VMA is large(*)
>>> Mark the VMA as deleted
>>> Drop the mmap_sem
>>> zap all of the entries
>>> Take the mmap_sem
>>> Else
>>> zap all of the entries
>>> Continue finding VMAs
>>> Drop the mmap_sem
>>>
>>> Now we need to change everywhere which looks up a VMA to see if it needs
>>> to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap
>> Marking vma as deleted sounds good. The problem for my current approach is
>> the concurrent page fault may succeed if it access the not yet unmapped
>> section. Marking deleted vma could tell page fault the vma is not valid
>> anymore, then return SIGSEGV.
>>
>>> does not care; munmap will need to wait for the existing munmap operation
>> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?
> Oh, I forgot about MAP_FIXED. Yes, MAP_FIXED should wait for the munmap
> to finish. But a regular mmap can just pretend that it happened before
> the munmap call and avoid the deleted VMAs.

But, my test shows race condition for reduced size mmap which calls
do_munmap(). It may need wait for the munmap finish too.

So, in my patches, I just make the do_munmap() called from mmap() hold
mmap_sem all the time.

Thanks,
Yang



2018-03-21 22:47:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
> Marking vma as deleted sounds good. The problem for my current approach is
> the concurrent page fault may succeed if it access the not yet unmapped
> section. Marking deleted vma could tell page fault the vma is not valid
> anymore, then return SIGSEGV.
>
> > does not care; munmap will need to wait for the existing munmap operation
>
> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?

The other thing about MAP_FIXED that we'll need to handle is unmapping
conflicts atomically. Say a program has a 200GB mapping and then
mmap(MAP_FIXED) another 200GB region on top of it. So I think page faults
are also going to have to wait for deleted vmas (then retry the fault)
rather than immediately raising SIGSEGV.

2018-03-22 09:12:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Wed 21-03-18 15:36:12, Yang Shi wrote:
>
>
> On 3/21/18 2:23 PM, Michal Hocko wrote:
> > On Wed 21-03-18 10:16:41, Yang Shi wrote:
> > >
> > > On 3/21/18 9:50 AM, Yang Shi wrote:
> > > >
> > > > On 3/21/18 6:14 AM, Michal Hocko wrote:
> > > > > On Wed 21-03-18 05:31:19, Yang Shi wrote:
> > > > > > When running some mmap/munmap scalability tests with large memory (i.e.
> > > > > > > 300GB), the below hung task issue may happen occasionally.
> > > > > > INFO: task ps:14018 blocked for more than 120 seconds.
> > > > > > Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
> > > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > > > > message.
> > > > > > ps D 0 14018 1 0x00000004
> > > > > > ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> > > > > > ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> > > > > > 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> > > > > > Call Trace:
> > > > > > [<ffffffff817154d0>] ? __schedule+0x250/0x730
> > > > > > [<ffffffff817159e6>] schedule+0x36/0x80
> > > > > > [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> > > > > > [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> > > > > > [<ffffffff81717db0>] down_read+0x20/0x40
> > > > > > [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> > > > > Slightly off-topic:
> > > > > Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
> > > > > of
> > > > > arg_start = mm->arg_start;
> > > > > arg_end = mm->arg_end;
> > > > > env_start = mm->env_start;
> > > > > env_end = mm->env_end;
> > > > >
> > > > > change after exec or while the pid is already visible in proc? If yes
> > > > > maybe we can use a dedicated lock.
> > > BTW, this is not the only place to acquire mmap_sem in
> > > proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
> > > mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.
> > Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
> > to remove that. munmap should perform much better. How to do that safely
>
> Yes, agree. We are on the same page.
>
> > is a different question. I am not yet convinced that tearing down a vma
> > in batches is safe. The vast majority of time is spent on tearing down
>
> You can try my patches. I did full LTP test and running multiple kernel
> build in parallel. It survives.

Which doesn't really mean anything. Those tests are likely to not hit
corner cases where an application silently depends on the mmap locking
and unmap atomicity.

> > pages and that is quite easy to move out of the write lock. That would
> > be an improvement already and it should be risk safe. If even that is
> > not sufficient then using range locking should help a lot. There
> > shouldn't be really any other address space operations within the range
> > most of the time so this would be basically non-contended access.
>
> It might depend on how the range is defined. Too big range may lead to
> surprisingly more contention, but too small range may bring in too much
> lock/unlock operations.

The full vma will have to be range locked. So there is nothing small or large.
--
Michal Hocko
SUSE Labs

2018-03-22 15:44:21

by Laurent Dufour

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 21/03/2018 23:46, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
>> Marking vma as deleted sounds good. The problem for my current approach is
>> the concurrent page fault may succeed if it access the not yet unmapped
>> section. Marking deleted vma could tell page fault the vma is not valid
>> anymore, then return SIGSEGV.
>>
>>> does not care; munmap will need to wait for the existing munmap operation
>>
>> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?
>
> The other thing about MAP_FIXED that we'll need to handle is unmapping
> conflicts atomically. Say a program has a 200GB mapping and then
> mmap(MAP_FIXED) another 200GB region on top of it. So I think page faults
> are also going to have to wait for deleted vmas (then retry the fault)
> rather than immediately raising SIGSEGV.

Regarding the page fault, why not relying on the PTE locking ?

When munmap() will unset the PTE it will have to held the PTE lock, so this
will serialize the access.
If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.


2018-03-22 15:44:47

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
> On 21/03/2018 23:46, Matthew Wilcox wrote:
> > On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
> >> Marking vma as deleted sounds good. The problem for my current approach is
> >> the concurrent page fault may succeed if it access the not yet unmapped
> >> section. Marking deleted vma could tell page fault the vma is not valid
> >> anymore, then return SIGSEGV.
> >>
> >>> does not care; munmap will need to wait for the existing munmap operation
> >>
> >> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?
> >
> > The other thing about MAP_FIXED that we'll need to handle is unmapping
> > conflicts atomically. Say a program has a 200GB mapping and then
> > mmap(MAP_FIXED) another 200GB region on top of it. So I think page faults
> > are also going to have to wait for deleted vmas (then retry the fault)
> > rather than immediately raising SIGSEGV.
>
> Regarding the page fault, why not relying on the PTE locking ?
>
> When munmap() will unset the PTE it will have to held the PTE lock, so this
> will serialize the access.
> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.

The page fault handler will walk the VMA tree to find the correct
VMA and then find that the VMA is marked as deleted. If it assumes
that the VMA has been deleted because of munmap(), then it can raise
SIGSEGV immediately. But if the VMA is marked as deleted because of
mmap(MAP_FIXED), it must wait until the new VMA is in place.

I think I was wrong to describe VMAs as being *deleted*. I think we
instead need the concept of a *locked* VMA that page faults will block on.
Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
an rwsem since the only reason to write-lock the VMA is because it is
being deleted.

2018-03-22 15:56:17

by Laurent Dufour

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 22/03/2018 16:40, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>> On 21/03/2018 23:46, Matthew Wilcox wrote:
>>> On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
>>>> Marking vma as deleted sounds good. The problem for my current approach is
>>>> the concurrent page fault may succeed if it access the not yet unmapped
>>>> section. Marking deleted vma could tell page fault the vma is not valid
>>>> anymore, then return SIGSEGV.
>>>>
>>>>> does not care; munmap will need to wait for the existing munmap operation
>>>>
>>>> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?
>>>
>>> The other thing about MAP_FIXED that we'll need to handle is unmapping
>>> conflicts atomically. Say a program has a 200GB mapping and then
>>> mmap(MAP_FIXED) another 200GB region on top of it. So I think page faults
>>> are also going to have to wait for deleted vmas (then retry the fault)
>>> rather than immediately raising SIGSEGV.
>>
>> Regarding the page fault, why not relying on the PTE locking ?
>>
>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>> will serialize the access.
>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
>
> The page fault handler will walk the VMA tree to find the correct
> VMA and then find that the VMA is marked as deleted. If it assumes
> that the VMA has been deleted because of munmap(), then it can raise
> SIGSEGV immediately. But if the VMA is marked as deleted because of
> mmap(MAP_FIXED), it must wait until the new VMA is in place.

I'm wondering if such a complexity is required.
If the user space process try to access the page being overwritten through
mmap(MAP_FIXED) by another thread, there is no guarantee that it will
manipulate the *old* page or *new* one.
I'd think this is up to the user process to handle that concurrency.
What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
are no more there, which is done through the mmap_sem and PTE locking.

> I think I was wrong to describe VMAs as being *deleted*. I think we
> instead need the concept of a *locked* VMA that page faults will block on.
> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
> an rwsem since the only reason to write-lock the VMA is because it is
> being deleted.

Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
the VMA is removed there is no need to wait. Isn't it ?


2018-03-22 16:07:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Thu, Mar 22, 2018 at 04:54:52PM +0100, Laurent Dufour wrote:
> On 22/03/2018 16:40, Matthew Wilcox wrote:
> > On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
> >> Regarding the page fault, why not relying on the PTE locking ?
> >>
> >> When munmap() will unset the PTE it will have to held the PTE lock, so this
> >> will serialize the access.
> >> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
> >> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
> >
> > The page fault handler will walk the VMA tree to find the correct
> > VMA and then find that the VMA is marked as deleted. If it assumes
> > that the VMA has been deleted because of munmap(), then it can raise
> > SIGSEGV immediately. But if the VMA is marked as deleted because of
> > mmap(MAP_FIXED), it must wait until the new VMA is in place.
>
> I'm wondering if such a complexity is required.
> If the user space process try to access the page being overwritten through
> mmap(MAP_FIXED) by another thread, there is no guarantee that it will
> manipulate the *old* page or *new* one.

Right; but it must return one or the other, it can't segfault.

> I'd think this is up to the user process to handle that concurrency.
> What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
> are no more there, which is done through the mmap_sem and PTE locking.

Yes, and allowing the fault handler to return the *old* page risks the
old page being reinserted into the page tables after the unmapping task
has done its work.

It's *really* rare to page-fault on a VMA which is in the middle of
being replaced. Why are you trying to optimise it?

> > I think I was wrong to describe VMAs as being *deleted*. I think we
> > instead need the concept of a *locked* VMA that page faults will block on.
> > Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
> > an rwsem since the only reason to write-lock the VMA is because it is
> > being deleted.
>
> Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
> the VMA is removed there is no need to wait. Isn't it ?

I can't think of another reason. I suppose we could mark the VMA as
locked-for-deletion or locked-for-replacement and have the SIGSEGV happen
early. But I'm not sure that optimising for SIGSEGVs is a worthwhile
use of our time. Just always have the pagefault sleep for a deleted VMA.

2018-03-22 16:08:12

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 3/22/18 2:10 AM, Michal Hocko wrote:
> On Wed 21-03-18 15:36:12, Yang Shi wrote:
>>
>> On 3/21/18 2:23 PM, Michal Hocko wrote:
>>> On Wed 21-03-18 10:16:41, Yang Shi wrote:
>>>> On 3/21/18 9:50 AM, Yang Shi wrote:
>>>>> On 3/21/18 6:14 AM, Michal Hocko wrote:
>>>>>> On Wed 21-03-18 05:31:19, Yang Shi wrote:
>>>>>>> When running some mmap/munmap scalability tests with large memory (i.e.
>>>>>>>> 300GB), the below hung task issue may happen occasionally.
>>>>>>> INFO: task ps:14018 blocked for more than 120 seconds.
>>>>>>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>>>>>>> message.
>>>>>>> ps D 0 14018 1 0x00000004
>>>>>>> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>>>>>> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>>>>>> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>>>>>> Call Trace:
>>>>>>> [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>>>>>> [<ffffffff817159e6>] schedule+0x36/0x80
>>>>>>> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>>>>>> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>>>>>> [<ffffffff81717db0>] down_read+0x20/0x40
>>>>>>> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>>>>> Slightly off-topic:
>>>>>> Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
>>>>>> of
>>>>>> arg_start = mm->arg_start;
>>>>>> arg_end = mm->arg_end;
>>>>>> env_start = mm->env_start;
>>>>>> env_end = mm->env_end;
>>>>>>
>>>>>> change after exec or while the pid is already visible in proc? If yes
>>>>>> maybe we can use a dedicated lock.
>>>> BTW, this is not the only place to acquire mmap_sem in
>>>> proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
>>>> mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.
>>> Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
>>> to remove that. munmap should perform much better. How to do that safely
>> Yes, agree. We are on the same page.
>>
>>> is a different question. I am not yet convinced that tearing down a vma
>>> in batches is safe. The vast majority of time is spent on tearing down
>> You can try my patches. I did full LTP test and running multiple kernel
>> build in parallel. It survives.
> Which doesn't really mean anything. Those tests are likely to not hit
> corner cases where an application silently depends on the mmap locking
> and unmap atomicity.

They definitely can't cover all corner cases. But, they do give us
somehow confidence that the most part works. The mmap stress tests in
LTP did discover some race conditions when I tried different approaches.

>
>>> pages and that is quite easy to move out of the write lock. That would
>>> be an improvement already and it should be risk safe. If even that is
>>> not sufficient then using range locking should help a lot. There
>>> shouldn't be really any other address space operations within the range
>>> most of the time so this would be basically non-contended access.
>> It might depend on how the range is defined. Too big range may lead to
>> surprisingly more contention, but too small range may bring in too much
>> lock/unlock operations.
> The full vma will have to be range locked. So there is nothing small or large.

It sounds not helpful to a single large vma case since just one range
lock for the vma, it sounds equal to mmap_sem.

Yang



2018-03-22 16:14:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Thu 22-03-18 09:06:14, Yang Shi wrote:
>
>
> On 3/22/18 2:10 AM, Michal Hocko wrote:
> > On Wed 21-03-18 15:36:12, Yang Shi wrote:
> > >
> > > On 3/21/18 2:23 PM, Michal Hocko wrote:
[...]
> > > > pages and that is quite easy to move out of the write lock. That would
> > > > be an improvement already and it should be risk safe. If even that is
> > > > not sufficient then using range locking should help a lot. There
> > > > shouldn't be really any other address space operations within the range
> > > > most of the time so this would be basically non-contended access.
> > > It might depend on how the range is defined. Too big range may lead to
> > > surprisingly more contention, but too small range may bring in too much
> > > lock/unlock operations.
> > The full vma will have to be range locked. So there is nothing small or large.
>
> It sounds not helpful to a single large vma case since just one range lock
> for the vma, it sounds equal to mmap_sem.

This is not how the range locking works. If we have a range lock per mm
then exclusive ranges are not contending. So if you are unmapping one
vma and want to create a new mapping or fault into a different range
then you are basically lockless.

--
Michal Hocko
SUSE Labs

2018-03-22 16:15:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Thu, Mar 22, 2018 at 09:06:14AM -0700, Yang Shi wrote:
> On 3/22/18 2:10 AM, Michal Hocko wrote:
> > On Wed 21-03-18 15:36:12, Yang Shi wrote:
> > > On 3/21/18 2:23 PM, Michal Hocko wrote:
> > > > On Wed 21-03-18 10:16:41, Yang Shi wrote:
> > > > > proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
> > > > > mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.
> > > > Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
> > > > to remove that. munmap should perform much better. How to do that safely
> > The full vma will have to be range locked. So there is nothing small or large.
>
> It sounds not helpful to a single large vma case since just one range lock
> for the vma, it sounds equal to mmap_sem.

But splitting mmap_sem into pieces is beneficial for this case. Imagine
we have a spinlock / rwlock to protect the rbtree / arg_start / arg_end
/ ... and then each VMA has a rwsem (or equivalent). access_remote_vm()
would walk the tree and grab the VMA's rwsem for read while reading
out the arguments. The munmap code would have a completely different
VMA write-locked.


2018-03-22 16:20:26

by Laurent Dufour

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 22/03/2018 17:05, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 04:54:52PM +0100, Laurent Dufour wrote:
>> On 22/03/2018 16:40, Matthew Wilcox wrote:
>>> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>>>> Regarding the page fault, why not relying on the PTE locking ?
>>>>
>>>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>>>> will serialize the access.
>>>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
>>>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
>>>
>>> The page fault handler will walk the VMA tree to find the correct
>>> VMA and then find that the VMA is marked as deleted. If it assumes
>>> that the VMA has been deleted because of munmap(), then it can raise
>>> SIGSEGV immediately. But if the VMA is marked as deleted because of
>>> mmap(MAP_FIXED), it must wait until the new VMA is in place.
>>
>> I'm wondering if such a complexity is required.
>> If the user space process try to access the page being overwritten through
>> mmap(MAP_FIXED) by another thread, there is no guarantee that it will
>> manipulate the *old* page or *new* one.
>
> Right; but it must return one or the other, it can't segfault.

Good point, I missed that...

>
>> I'd think this is up to the user process to handle that concurrency.
>> What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
>> are no more there, which is done through the mmap_sem and PTE locking.
>
> Yes, and allowing the fault handler to return the *old* page risks the
> old page being reinserted into the page tables after the unmapping task
> has done its work.

The PTE locking should prevent that.

> It's *really* rare to page-fault on a VMA which is in the middle of
> being replaced. Why are you trying to optimise it?

I was not trying to optimize it, but to not wait in the page fault handler.
This could become tricky in the case the VMA is removed once mmap(MAP_FIXED) is
done and before the waiting page fault got woken up. This means that the
removed VMA structure will have to remain until all the waiters are woken up
which implies ref_count or similar.

>
>>> I think I was wrong to describe VMAs as being *deleted*. I think we
>>> instead need the concept of a *locked* VMA that page faults will block on.
>>> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
>>> an rwsem since the only reason to write-lock the VMA is because it is
>>> being deleted.
>>
>> Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
>> the VMA is removed there is no need to wait. Isn't it ?
>
> I can't think of another reason. I suppose we could mark the VMA as
> locked-for-deletion or locked-for-replacement and have the SIGSEGV happen
> early. But I'm not sure that optimising for SIGSEGVs is a worthwhile
> use of our time. Just always have the pagefault sleep for a deleted VMA.


2018-03-22 16:30:49

by Laurent Dufour

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On 22/03/2018 17:13, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 09:06:14AM -0700, Yang Shi wrote:
>> On 3/22/18 2:10 AM, Michal Hocko wrote:
>>> On Wed 21-03-18 15:36:12, Yang Shi wrote:
>>>> On 3/21/18 2:23 PM, Michal Hocko wrote:
>>>>> On Wed 21-03-18 10:16:41, Yang Shi wrote:
>>>>>> proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
>>>>>> mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.
>>>>> Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
>>>>> to remove that. munmap should perform much better. How to do that safely
>>> The full vma will have to be range locked. So there is nothing small or large.
>>
>> It sounds not helpful to a single large vma case since just one range lock
>> for the vma, it sounds equal to mmap_sem.
>
> But splitting mmap_sem into pieces is beneficial for this case. Imagine
> we have a spinlock / rwlock to protect the rbtree

Which is more or less what I'm proposing in the speculative page fault series:
https://lkml.org/lkml/2018/3/13/1158

This being said, having a per VMA lock could lead to tricky dead lock case,
when merging multiple VMA happens in parallel since multiple VMA will have to
be locked at the same time, grabbing those lock in a fine order will be required.

> ... / arg_start / arg_end
> / ... and then each VMA has a rwsem (or equivalent). access_remote_vm()
> would walk the tree and grab the VMA's rwsem for read while reading
> out the arguments. The munmap code would have a completely different
> VMA write-locked.
>


2018-03-22 16:36:57

by David Laight

[permalink] [raw]
Subject: RE: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

From: Laurent Dufour
> Sent: 22 March 2018 16:29
...
> This being said, having a per VMA lock could lead to tricky dead lock case,
> when merging multiple VMA happens in parallel since multiple VMA will have to
> be locked at the same time, grabbing those lock in a fine order will be required.

You could have a global lock and per VMA locks.
Anything that only accesses one VMA could release the global lock after
acquiring the per VMA lock.
If code needs multiple VMA 'locked' it can lock and unlock each VMA
in turn, then keep the global lock held.

David

2018-03-22 16:48:45

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 3/22/18 9:18 AM, Laurent Dufour wrote:
>
> On 22/03/2018 17:05, Matthew Wilcox wrote:
>> On Thu, Mar 22, 2018 at 04:54:52PM +0100, Laurent Dufour wrote:
>>> On 22/03/2018 16:40, Matthew Wilcox wrote:
>>>> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>>>>> Regarding the page fault, why not relying on the PTE locking ?
>>>>>
>>>>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>>>>> will serialize the access.
>>>>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
>>>>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
>>>> The page fault handler will walk the VMA tree to find the correct
>>>> VMA and then find that the VMA is marked as deleted. If it assumes
>>>> that the VMA has been deleted because of munmap(), then it can raise
>>>> SIGSEGV immediately. But if the VMA is marked as deleted because of
>>>> mmap(MAP_FIXED), it must wait until the new VMA is in place.
>>> I'm wondering if such a complexity is required.
>>> If the user space process try to access the page being overwritten through
>>> mmap(MAP_FIXED) by another thread, there is no guarantee that it will
>>> manipulate the *old* page or *new* one.
>> Right; but it must return one or the other, it can't segfault.
> Good point, I missed that...
>
>>> I'd think this is up to the user process to handle that concurrency.
>>> What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
>>> are no more there, which is done through the mmap_sem and PTE locking.
>> Yes, and allowing the fault handler to return the *old* page risks the
>> old page being reinserted into the page tables after the unmapping task
>> has done its work.
> The PTE locking should prevent that.
>
>> It's *really* rare to page-fault on a VMA which is in the middle of
>> being replaced. Why are you trying to optimise it?
> I was not trying to optimize it, but to not wait in the page fault handler.
> This could become tricky in the case the VMA is removed once mmap(MAP_FIXED) is
> done and before the waiting page fault got woken up. This means that the
> removed VMA structure will have to remain until all the waiters are woken up
> which implies ref_count or similar.

We may not need ref_count. After removing "locked-for-deletion" vmas
when mmap(MAP_FIXED) is done, just wake up page fault to re-lookup vma,
then it will find the new vma installed by mmap(MAP_FIXED), right?

I'm not sure if completion can do this or not since I'm not quite
familiar with it :-(

Yang

>
>>>> I think I was wrong to describe VMAs as being *deleted*. I think we
>>>> instead need the concept of a *locked* VMA that page faults will block on.
>>>> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
>>>> an rwsem since the only reason to write-lock the VMA is because it is
>>>> being deleted.
>>> Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
>>> the VMA is removed there is no need to wait. Isn't it ?
>> I can't think of another reason. I suppose we could mark the VMA as
>> locked-for-deletion or locked-for-replacement and have the SIGSEGV happen
>> early. But I'm not sure that optimising for SIGSEGVs is a worthwhile
>> use of our time. Just always have the pagefault sleep for a deleted VMA.





2018-03-22 16:50:48

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 3/22/18 9:05 AM, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 04:54:52PM +0100, Laurent Dufour wrote:
>> On 22/03/2018 16:40, Matthew Wilcox wrote:
>>> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>>>> Regarding the page fault, why not relying on the PTE locking ?
>>>>
>>>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>>>> will serialize the access.
>>>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
>>>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
>>> The page fault handler will walk the VMA tree to find the correct
>>> VMA and then find that the VMA is marked as deleted. If it assumes
>>> that the VMA has been deleted because of munmap(), then it can raise
>>> SIGSEGV immediately. But if the VMA is marked as deleted because of
>>> mmap(MAP_FIXED), it must wait until the new VMA is in place.
>> I'm wondering if such a complexity is required.
>> If the user space process try to access the page being overwritten through
>> mmap(MAP_FIXED) by another thread, there is no guarantee that it will
>> manipulate the *old* page or *new* one.
> Right; but it must return one or the other, it can't segfault.
>
>> I'd think this is up to the user process to handle that concurrency.
>> What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
>> are no more there, which is done through the mmap_sem and PTE locking.
> Yes, and allowing the fault handler to return the *old* page risks the
> old page being reinserted into the page tables after the unmapping task
> has done its work.
>
> It's *really* rare to page-fault on a VMA which is in the middle of
> being replaced. Why are you trying to optimise it?
>
>>> I think I was wrong to describe VMAs as being *deleted*. I think we
>>> instead need the concept of a *locked* VMA that page faults will block on.
>>> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
>>> an rwsem since the only reason to write-lock the VMA is because it is
>>> being deleted.
>> Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
>> the VMA is removed there is no need to wait. Isn't it ?
> I can't think of another reason. I suppose we could mark the VMA as
> locked-for-deletion or locked-for-replacement and have the SIGSEGV happen
> early. But I'm not sure that optimising for SIGSEGVs is a worthwhile
> use of our time. Just always have the pagefault sleep for a deleted VMA.

It sounds worth to me. If we have every page fault sleep to wait for vma
deletion is done, it sounds equal to wait for mmap_sem all the time, right?

Yang



2018-03-22 16:52:38

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Thu, Mar 22, 2018 at 05:18:55PM +0100, Laurent Dufour wrote:
> > It's *really* rare to page-fault on a VMA which is in the middle of
> > being replaced. Why are you trying to optimise it?
>
> I was not trying to optimize it, but to not wait in the page fault handler.
> This could become tricky in the case the VMA is removed once mmap(MAP_FIXED) is
> done and before the waiting page fault got woken up. This means that the
> removed VMA structure will have to remain until all the waiters are woken up
> which implies ref_count or similar.

Yes, that's why we don't want an actual rwsem. What I had in mind was
a struct completion on the stack of the caller of munmap(), and a pointer
to it from the vma. The page fault handler grabs the VMA tree lock, walks
the VMA tree and finds a VMA. If the VMA is marked as locked, it waits
for the completion. Upon wakeup *it does not look at the VMA*, instead it
restarts the page fault.


2018-03-22 17:35:39

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section



On 3/21/18 10:29 AM, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote:
>> On 3/21/18 6:08 AM, Michal Hocko wrote:
>>> Yes, this definitely sucks. One way to work that around is to split the
>>> unmap to two phases. One to drop all the pages. That would only need
>>> mmap_sem for read and then tear down the mapping with the mmap_sem for
>>> write. This wouldn't help for parallel mmap_sem writers but those really
>>> need a different approach (e.g. the range locking).
>> page fault might sneak in to map a page which has been unmapped before?
>>
>> range locking should help a lot on manipulating small sections of a large
>> mapping in parallel or multiple small mappings. It may not achieve too much
>> for single large mapping.
> I don't think we need range locking. What if we do munmap this way:
>
> Take the mmap_sem for write
> Find the VMA
> If the VMA is large(*)
> Mark the VMA as deleted
> Drop the mmap_sem
> zap all of the entries
> Take the mmap_sem
> Else
> zap all of the entries
> Continue finding VMAs
> Drop the mmap_sem
>
> Now we need to change everywhere which looks up a VMA to see if it needs
> to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap
> does not care; munmap will need to wait for the existing munmap operation

The other question is why munmap need wait? If the other parallel munmap
finds the vma has been marked as "deleted", it just need return 0 as it
doesn't find vma.

Currently do_munmap() does the below logic:
vma = find_vma(mm, start);
if (!vma)
return 0;

Yang

> to complete), but it gives us the atomicity, at least on a per-VMA basis.
>
> We could also do:
>
> Take the mmap_sem for write
> Mark all VMAs in the range as deleted & modify any partial VMAs
> Drop mmap_sem
> zap pages from deleted VMAs
>
> That would give us the same atomicity that we have today.
>
> Deleted VMAs would need a pointer to a completion, so operations that
> need to wait can queue themselves up. I'd recommend we use the low bit
> of vm_file and treat it as a pointer to a struct completion if set.


2018-03-22 18:49:26

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Thu, Mar 22, 2018 at 10:34:08AM -0700, Yang Shi wrote:
> On 3/21/18 10:29 AM, Matthew Wilcox wrote:
> > Take the mmap_sem for write
> > Find the VMA
> > If the VMA is large(*)
> > Mark the VMA as deleted
> > Drop the mmap_sem
> > zap all of the entries
> > Take the mmap_sem
> > Else
> > zap all of the entries
> > Continue finding VMAs
> > Drop the mmap_sem
> >
> > Now we need to change everywhere which looks up a VMA to see if it needs
> > to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap
> > does not care; munmap will need to wait for the existing munmap operation
>
> The other question is why munmap need wait? If the other parallel munmap
> finds the vma has been marked as "deleted", it just need return 0 as it
> doesn't find vma.
>
> Currently do_munmap() does the below logic:
> vma = find_vma(mm, start);
> if (!vma)
> return 0;

At the point a munmap() returns, the area should be available for reuse.
If another thread is still unmapping, it won't actually be available yet.
We should wait for the other thread to finish before returning.

There may be some other corner cases; like what to do if there's a partial
unmap of a VMA, or a MAP_FIXED over part of an existing VMA. It's going
to be safer to just wait for any conflicts to die down. It's not like
real programs call munmap() on conflicting areas at the same time.

2018-03-23 13:06:28

by Laurent Dufour

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On 22/03/2018 17:46, Yang Shi wrote:
>
>
> On 3/22/18 9:18 AM, Laurent Dufour wrote:
>>
>> On 22/03/2018 17:05, Matthew Wilcox wrote:
>>> On Thu, Mar 22, 2018 at 04:54:52PM +0100, Laurent Dufour wrote:
>>>> On 22/03/2018 16:40, Matthew Wilcox wrote:
>>>>> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>>>>>> Regarding the page fault, why not relying on the PTE locking ?
>>>>>>
>>>>>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>>>>>> will serialize the access.
>>>>>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
>>>>>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
>>>>> The page fault handler will walk the VMA tree to find the correct
>>>>> VMA and then find that the VMA is marked as deleted.  If it assumes
>>>>> that the VMA has been deleted because of munmap(), then it can raise
>>>>> SIGSEGV immediately.  But if the VMA is marked as deleted because of
>>>>> mmap(MAP_FIXED), it must wait until the new VMA is in place.
>>>> I'm wondering if such a complexity is required.
>>>> If the user space process try to access the page being overwritten through
>>>> mmap(MAP_FIXED) by another thread, there is no guarantee that it will
>>>> manipulate the *old* page or *new* one.
>>> Right; but it must return one or the other, it can't segfault.
>> Good point, I missed that...
>>
>>>> I'd think this is up to the user process to handle that concurrency.
>>>> What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
>>>> are no more there, which is done through the mmap_sem and PTE locking.
>>> Yes, and allowing the fault handler to return the *old* page risks the
>>> old page being reinserted into the page tables after the unmapping task
>>> has done its work.
>> The PTE locking should prevent that.
>>
>>> It's *really* rare to page-fault on a VMA which is in the middle of
>>> being replaced.  Why are you trying to optimise it?
>> I was not trying to optimize it, but to not wait in the page fault handler.
>> This could become tricky in the case the VMA is removed once mmap(MAP_FIXED) is
>> done and before the waiting page fault got woken up. This means that the
>> removed VMA structure will have to remain until all the waiters are woken up
>> which implies ref_count or similar.
>
> We may not need ref_count. After removing "locked-for-deletion" vmas when
> mmap(MAP_FIXED) is done, just wake up page fault to re-lookup vma, then it will
> find the new vma installed by mmap(MAP_FIXED), right?

I do agree, as far as waking up would not require access to the VMA.

> I'm not sure if completion can do this or not since I'm not quite familiar with
> it :-(

I don't know either :/

Laurent.

> Yang
>
>>
>>>>> I think I was wrong to describe VMAs as being *deleted*.  I think we
>>>>> instead need the concept of a *locked* VMA that page faults will block on.
>>>>> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
>>>>> an rwsem since the only reason to write-lock the VMA is because it is
>>>>> being deleted.
>>>> Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
>>>> the VMA is removed there is no need to wait. Isn't it ?
>>> I can't think of another reason.  I suppose we could mark the VMA as
>>> locked-for-deletion or locked-for-replacement and have the SIGSEGV happen
>>> early.  But I'm not sure that optimising for SIGSEGVs is a worthwhile
>>> use of our time.  Just always have the pagefault sleep for a deleted VMA.
>
>
>
>


2018-03-24 18:27:16

by Jerome Glisse

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

On Wed, Mar 21, 2018 at 10:29:32AM -0700, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote:
> > On 3/21/18 6:08 AM, Michal Hocko wrote:
> > > Yes, this definitely sucks. One way to work that around is to split the
> > > unmap to two phases. One to drop all the pages. That would only need
> > > mmap_sem for read and then tear down the mapping with the mmap_sem for
> > > write. This wouldn't help for parallel mmap_sem writers but those really
> > > need a different approach (e.g. the range locking).
> >
> > page fault might sneak in to map a page which has been unmapped before?
> >
> > range locking should help a lot on manipulating small sections of a large
> > mapping in parallel or multiple small mappings. It may not achieve too much
> > for single large mapping.
>
> I don't think we need range locking. What if we do munmap this way:
>
> Take the mmap_sem for write
> Find the VMA
> If the VMA is large(*)
> Mark the VMA as deleted
> Drop the mmap_sem
> zap all of the entries
> Take the mmap_sem
> Else
> zap all of the entries
> Continue finding VMAs
> Drop the mmap_sem
>
> Now we need to change everywhere which looks up a VMA to see if it needs
> to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap
> does not care; munmap will need to wait for the existing munmap operation
> to complete), but it gives us the atomicity, at least on a per-VMA basis.
>

What about something that should fix all issues:
struct list_head to_free_puds;
...
down_write(&mm->mmap_sem);
...
unmap_vmas(&tlb, vma, start, end, &to_free_puds);
arch_unmap(mm, vma, start, end);
/* Fix up all other VM information */
remove_vma_list(mm, vma);
...
up_write(&mm->mmap_sem);
...
zap_pud_list(rss_update_info, to_free_puds);
update_rss(rss_update_info)

We collect pud that need to be free/zap we update the page table PUD
entry to pud_none under the write sem and CPU page table lock, add the
pud to the list that need zapping. We only collect pud fully cover,
and usual business for partialy covered pud.

Everything behave as today except that we do not free memory. Care
must be take with the anon vma and we should probably not free the
vma struct either before the zap but all other mm struct can be
updated. The rss_counter would also to be updated post zap pud.

We would need special code to zap pud list, no need to take lock or
do special arch tlb flushing, ptep_get_clear, ... when walking down
those puds. So it should scale a lot better too.

Did i miss something ?

Cheers,
J?r?me