Linus recently pointed out[1] some of the amount of unnecessary work
being done with the mmap_sem held. This patchset is a very initial
approach on reducing some of the contention on this lock, and moving
work outside of the critical region.
Patch 1 adds a simple helper function.
Patch 2 moves out some trivial setup logic in mlock related calls.
Patch 3 allows managing new vmas without requiring the mmap_sem for
vdsos. While it's true that there are many other scenarios where
this can be done, few are actually as straightforward as this in the
sense that we *always* end up allocating memory anyways, so there's really
no tradeoffs. For this reason I wanted to get this patch out in the open.
There are a few points to consider when preallocating vmas at the start
of system calls, such as how many new vmas (ie: callers of split_vma can
end up calling twice, depending on the mm state at that point) or the probability
that we end up merging the vma instead of having to create a new one, like the
case of brk or copy_vma. In both cases the overhead of creating and freeing
memory at every syscall's invocation might outweigh what we gain in not holding
the sem.
[1] https://lkml.org/lkml/2013/10/9/665
Thanks!
Davidlohr Bueso (3):
mm: add mlock_future_check helper
mm/mlock: prepare params outside critical region
vdso: preallocate new vmas
arch/arm/kernel/process.c | 22 +++++++++----
arch/arm64/kernel/vdso.c | 21 ++++++++++---
arch/hexagon/kernel/vdso.c | 16 +++++++---
arch/mips/kernel/vdso.c | 10 +++++-
arch/powerpc/kernel/vdso.c | 11 +++++--
arch/s390/kernel/vdso.c | 19 +++++++++---
arch/sh/kernel/vsyscall/vsyscall.c | 11 ++++++-
arch/tile/kernel/vdso.c | 13 ++++++--
arch/um/kernel/skas/mmu.c | 16 +++++++---
arch/unicore32/kernel/process.c | 17 +++++++---
arch/x86/um/vdso/vma.c | 18 ++++++++---
arch/x86/vdso/vdso32-setup.c | 16 +++++++++-
arch/x86/vdso/vma.c | 10 +++++-
include/linux/mm.h | 3 +-
kernel/events/uprobes.c | 14 +++++++--
mm/mlock.c | 18 ++++++-----
mm/mmap.c | 63 ++++++++++++++++++--------------------
17 files changed, 213 insertions(+), 85 deletions(-)
--
1.8.1.4
Both do_brk and do_mmap_pgoff verify that we actually
capable of locking future pages if the corresponding
VM_LOCKED flags are used. Encapsulate this logic into
a single mlock_future_check() helper function.
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Michel Lespinasse <[email protected]>
---
mm/mmap.c | 45 +++++++++++++++++++++++----------------------
1 file changed, 23 insertions(+), 22 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 9d54851..6a7824d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1192,6 +1192,24 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
return hint;
}
+static inline int mlock_future_check(struct mm_struct *mm,
+ unsigned long flags,
+ unsigned long len)
+{
+ unsigned long locked, lock_limit;
+
+ /* mlock MCL_FUTURE? */
+ if (flags & VM_LOCKED) {
+ locked = len >> PAGE_SHIFT;
+ locked += mm->locked_vm;
+ lock_limit = rlimit(RLIMIT_MEMLOCK);
+ lock_limit >>= PAGE_SHIFT;
+ if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+ return -EAGAIN;
+ }
+ return 0;
+}
+
/*
* The caller must hold down_write(¤t->mm->mmap_sem).
*/
@@ -1253,16 +1271,8 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
if (!can_do_mlock())
return -EPERM;
- /* mlock MCL_FUTURE? */
- if (vm_flags & VM_LOCKED) {
- unsigned long locked, lock_limit;
- locked = len >> PAGE_SHIFT;
- locked += mm->locked_vm;
- lock_limit = rlimit(RLIMIT_MEMLOCK);
- lock_limit >>= PAGE_SHIFT;
- if (locked > lock_limit && !capable(CAP_IPC_LOCK))
- return -EAGAIN;
- }
+ if (mlock_future_check(mm, vm_flags, len))
+ return -EAGAIN;
if (file) {
struct inode *inode = file_inode(file);
@@ -2593,18 +2603,9 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
if (error & ~PAGE_MASK)
return error;
- /*
- * mlock MCL_FUTURE?
- */
- if (mm->def_flags & VM_LOCKED) {
- unsigned long locked, lock_limit;
- locked = len >> PAGE_SHIFT;
- locked += mm->locked_vm;
- lock_limit = rlimit(RLIMIT_MEMLOCK);
- lock_limit >>= PAGE_SHIFT;
- if (locked > lock_limit && !capable(CAP_IPC_LOCK))
- return -EAGAIN;
- }
+ error = mlock_future_check(mm, mm->def_flags, len);
+ if (error)
+ return error;
/*
* mm->mmap_sem is required to protect against another thread
--
1.8.1.4
All mlock related syscalls prepare lock limits, lengths and
start parameters with the mmap_sem held. Move this logic
outside of the critical region. For the case of mlock, continue
incrementing the amount already locked by mm->locked_vm with
the rwsem taken.
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Vlastimil Babka <[email protected]>
---
mm/mlock.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/mm/mlock.c b/mm/mlock.c
index d480cd6..aa7de13 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -689,19 +689,21 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
lru_add_drain_all(); /* flush pagevec */
- down_write(¤t->mm->mmap_sem);
len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
start &= PAGE_MASK;
- locked = len >> PAGE_SHIFT;
- locked += current->mm->locked_vm;
-
lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;
+ locked = len >> PAGE_SHIFT;
+
+ down_write(¤t->mm->mmap_sem);
+
+ locked += current->mm->locked_vm;
/* check against resource limits */
if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
error = do_mlock(start, len, 1);
+
up_write(¤t->mm->mmap_sem);
if (!error)
error = __mm_populate(start, len, 0);
@@ -712,11 +714,13 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
{
int ret;
- down_write(¤t->mm->mmap_sem);
len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
start &= PAGE_MASK;
+
+ down_write(¤t->mm->mmap_sem);
ret = do_mlock(start, len, 0);
up_write(¤t->mm->mmap_sem);
+
return ret;
}
@@ -761,12 +765,12 @@ SYSCALL_DEFINE1(mlockall, int, flags)
if (flags & MCL_CURRENT)
lru_add_drain_all(); /* flush pagevec */
- down_write(¤t->mm->mmap_sem);
-
lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;
ret = -ENOMEM;
+ down_write(¤t->mm->mmap_sem);
+
if (!(flags & MCL_CURRENT) || (current->mm->total_vm <= lock_limit) ||
capable(CAP_IPC_LOCK))
ret = do_mlockall(flags);
--
1.8.1.4
With the exception of um and tile, architectures that use
the install_special_mapping() function, when setting up a
new vma at program startup, do so with the mmap_sem lock
held for writing. Unless there's an error, this process
ends up allocating a new vma through kmem_cache_zalloc,
and inserting it in the task's address space.
This patch moves the vma's space allocation outside of
install_special_mapping(), and leaves the callers to do so
explicitly, without depending on mmap_sem. The same goes for
freeing: if the new vma isn't used (and thus the process fails
at some point), it's caller's responsibility to free it -
currently this is done inside install_special_mapping.
Furthermore, uprobes behaves exactly the same and thus now the
xol_add_vma() function also preallocates the new vma.
While the changes to x86 vdso handling have been tested on both
large and small 64-bit systems, the rest of the architectures
are totally *untested*. Note that all changes are quite similar
from architecture to architecture.
This patch, when tested on a 64core, 256 Gb NUMA server, benefited
several aim7 workloads: long +27% throughput with over 1500 users;
compute +6.5% with over 1000 users; fserver +83% for small amounts
of users (10-100 range) and +9% for more and new_fserver, showing
a similar behavior, got +67% boost with 100 users and an avg of +8%
when more users were added.
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
arch/arm/kernel/process.c | 22 ++++++++++++++++------
arch/arm64/kernel/vdso.c | 21 +++++++++++++++++----
arch/hexagon/kernel/vdso.c | 16 ++++++++++++----
arch/mips/kernel/vdso.c | 10 +++++++++-
arch/powerpc/kernel/vdso.c | 11 ++++++++---
arch/s390/kernel/vdso.c | 19 +++++++++++++++----
arch/sh/kernel/vsyscall/vsyscall.c | 11 ++++++++++-
arch/tile/kernel/vdso.c | 13 ++++++++++---
arch/um/kernel/skas/mmu.c | 16 +++++++++++-----
arch/unicore32/kernel/process.c | 17 ++++++++++++-----
arch/x86/um/vdso/vma.c | 18 ++++++++++++++----
arch/x86/vdso/vdso32-setup.c | 16 +++++++++++++++-
arch/x86/vdso/vma.c | 10 +++++++++-
include/linux/mm.h | 3 ++-
kernel/events/uprobes.c | 14 ++++++++++++--
mm/mmap.c | 18 +++++++-----------
16 files changed, 179 insertions(+), 56 deletions(-)
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 94f6b05..5637c92 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -13,6 +13,7 @@
#include <linux/export.h>
#include <linux/sched.h>
#include <linux/kernel.h>
+#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/stddef.h>
#include <linux/unistd.h>
@@ -480,6 +481,7 @@ extern struct page *get_signal_page(void);
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
unsigned long addr;
int ret;
@@ -488,6 +490,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
if (!signal_page)
return -ENOMEM;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
if (IS_ERR_VALUE(addr)) {
@@ -496,14 +502,18 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
}
ret = install_special_mapping(mm, addr, PAGE_SIZE,
- VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
- &signal_page);
-
- if (ret == 0)
- mm->context.sigpage = addr;
+ VM_READ | VM_EXEC | VM_MAYREAD |
+ VM_MAYWRITE | VM_MAYEXEC,
+ &signal_page, &vma);
+ if (ret)
+ goto up_fail;
- up_fail:
+ mm->context.sigpage = addr;
+ up_write(&mm->mmap_sem);
+ return 0;
+up_fail:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
#endif
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 6a389dc..519a44c 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -83,20 +83,26 @@ arch_initcall(alloc_vectors_page);
int aarch32_setup_vectors_page(struct linux_binprm *bprm, int uses_interp)
{
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
unsigned long addr = AARCH32_VECTORS_BASE;
int ret;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
current->mm->context.vdso = (void *)addr;
/* Map vectors page at the high address. */
ret = install_special_mapping(mm, addr, PAGE_SIZE,
VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYEXEC,
- vectors_page);
+ vectors_page, &vma);
up_write(&mm->mmap_sem);
-
+ if (ret)
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
#endif /* CONFIG_COMPAT */
@@ -152,10 +158,15 @@ arch_initcall(vdso_init);
int arch_setup_additional_pages(struct linux_binprm *bprm,
int uses_interp)
{
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
unsigned long vdso_base, vdso_mapping_len;
int ret;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
/* Be sure to map the data page */
vdso_mapping_len = (vdso_pages + 1) << PAGE_SHIFT;
@@ -170,15 +181,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
ret = install_special_mapping(mm, vdso_base, vdso_mapping_len,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- vdso_pagelist);
+ vdso_pagelist, &vma);
if (ret) {
mm->context.vdso = NULL;
goto up_fail;
}
+ up_write(&mm->mmap_sem);
+ return ret;
up_fail:
up_write(&mm->mmap_sem);
-
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/arch/hexagon/kernel/vdso.c b/arch/hexagon/kernel/vdso.c
index 0bf5a87..188c5bd 100644
--- a/arch/hexagon/kernel/vdso.c
+++ b/arch/hexagon/kernel/vdso.c
@@ -19,6 +19,7 @@
*/
#include <linux/err.h>
+#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/binfmts.h>
@@ -63,8 +64,13 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
int ret;
unsigned long vdso_base;
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
/* Try to get it loaded right near ld.so/glibc. */
@@ -78,17 +84,19 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
/* MAYWRITE to allow gdb to COW and set breakpoints. */
ret = install_special_mapping(mm, vdso_base, PAGE_SIZE,
- VM_READ|VM_EXEC|
- VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- &vdso_page);
-
+ VM_READ|VM_EXEC|
+ VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+ &vdso_page, &vma);
if (ret)
goto up_fail;
mm->context.vdso = (void *)vdso_base;
+ up_write(&mm->mmap_sem);
+ return 0;
up_fail:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 0f1af58..cfc2c7b 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -74,8 +74,13 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
int ret;
unsigned long addr;
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
addr = vdso_addr(mm->start_stack);
@@ -89,15 +94,18 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
ret = install_special_mapping(mm, addr, PAGE_SIZE,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- &vdso_page);
+ &vdso_page, &vma);
if (ret)
goto up_fail;
mm->context.vdso = (void *)addr;
+ up_write(&mm->mmap_sem);
+ return 0;
up_fail:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 1d9c926..a23fb5f 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -193,6 +193,7 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
struct page **vdso_pagelist;
unsigned long vdso_pages;
unsigned long vdso_base;
@@ -232,6 +233,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
/* Add a page to the vdso size for the data page */
vdso_pages ++;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
/*
* pick a base address for the vDSO in process space. We try to put it
* at vdso_base which is the "natural" base for it, but we might fail
@@ -271,7 +276,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
rc = install_special_mapping(mm, vdso_base, vdso_pages << PAGE_SHIFT,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- vdso_pagelist);
+ vdso_pagelist, &vma);
if (rc) {
current->mm->context.vdso_base = 0;
goto fail_mmapsem;
@@ -279,9 +284,9 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
up_write(&mm->mmap_sem);
return 0;
-
- fail_mmapsem:
+fail_mmapsem:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return rc;
}
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 05d75c4..4e00e11 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -180,6 +180,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
struct mm_struct *mm = current->mm;
struct page **vdso_pagelist;
+ struct vm_area_struct *vma;
unsigned long vdso_pages;
unsigned long vdso_base;
int rc;
@@ -213,6 +214,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
if (vdso_pages == 0)
return 0;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
current->mm->context.vdso_base = 0;
/*
@@ -224,7 +229,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
vdso_base = get_unmapped_area(NULL, 0, vdso_pages << PAGE_SHIFT, 0, 0);
if (IS_ERR_VALUE(vdso_base)) {
rc = vdso_base;
- goto out_up;
+ goto out_err;
}
/*
@@ -247,11 +252,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
rc = install_special_mapping(mm, vdso_base, vdso_pages << PAGE_SHIFT,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- vdso_pagelist);
- if (rc)
+ vdso_pagelist, &vma);
+ if (rc) {
current->mm->context.vdso_base = 0;
-out_up:
+ goto out_err;
+ }
+
+ up_write(&mm->mmap_sem);
+ return 0;
+out_err:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return rc;
}
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 5ca5797..49d3834 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -10,6 +10,7 @@
* License. See the file "COPYING" in the main directory of this archive
* for more details.
*/
+#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/kernel.h>
#include <linux/init.h>
@@ -61,9 +62,14 @@ int __init vsyscall_init(void)
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
unsigned long addr;
int ret;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
if (IS_ERR_VALUE(addr)) {
@@ -74,14 +80,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
ret = install_special_mapping(mm, addr, PAGE_SIZE,
VM_READ | VM_EXEC |
VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
- syscall_pages);
+ syscall_pages, &vma);
if (unlikely(ret))
goto up_fail;
current->mm->context.vdso = (void *)addr;
+ up_write(&mm->mmap_sem);
+ return 0;
up_fail:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/arch/tile/kernel/vdso.c b/arch/tile/kernel/vdso.c
index 1533af2..cf93e62 100644
--- a/arch/tile/kernel/vdso.c
+++ b/arch/tile/kernel/vdso.c
@@ -15,6 +15,7 @@
#include <linux/binfmts.h>
#include <linux/compat.h>
#include <linux/elf.h>
+#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
@@ -140,6 +141,7 @@ int setup_vdso_pages(void)
{
struct page **pagelist;
unsigned long pages;
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
unsigned long vdso_base = 0;
int retval = 0;
@@ -147,6 +149,10 @@ int setup_vdso_pages(void)
if (!vdso_ready)
return 0;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
mm->context.vdso_base = 0;
pagelist = vdso_pagelist;
@@ -198,10 +204,11 @@ int setup_vdso_pages(void)
pages << PAGE_SHIFT,
VM_READ|VM_EXEC |
VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
- pagelist);
- if (retval)
+ pagelist, &vma);
+ if (retval) {
mm->context.vdso_base = 0;
-
+ kmem_cache_free(vm_area_cachep, vma);
+ }
return retval;
}
diff --git a/arch/um/kernel/skas/mmu.c b/arch/um/kernel/skas/mmu.c
index 007d550..a6c3190 100644
--- a/arch/um/kernel/skas/mmu.c
+++ b/arch/um/kernel/skas/mmu.c
@@ -104,18 +104,23 @@ int init_new_context(struct task_struct *task, struct mm_struct *mm)
void uml_setup_stubs(struct mm_struct *mm)
{
int err, ret;
+ struct vm_area_struct *vma;
if (!skas_needs_stub)
return;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
ret = init_stub_pte(mm, STUB_CODE,
(unsigned long) &__syscall_stub_start);
if (ret)
- goto out;
+ goto err;
ret = init_stub_pte(mm, STUB_DATA, mm->context.id.stack);
if (ret)
- goto out;
+ goto err;
mm->context.stub_pages[0] = virt_to_page(&__syscall_stub_start);
mm->context.stub_pages[1] = virt_to_page(mm->context.id.stack);
@@ -124,14 +129,15 @@ void uml_setup_stubs(struct mm_struct *mm)
err = install_special_mapping(mm, STUB_START, STUB_END - STUB_START,
VM_READ | VM_MAYREAD | VM_EXEC |
VM_MAYEXEC | VM_DONTCOPY | VM_PFNMAP,
- mm->context.stub_pages);
+ mm->context.stub_pages, &vma);
if (err) {
printk(KERN_ERR "install_special_mapping returned %d\n", err);
- goto out;
+ goto err;
}
return;
-out:
+err:
+ kmem_cache_free(vm_area_cachep, vma);
force_sigsegv(SIGSEGV, current);
}
diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index 778ebba..c18b0e4 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -14,6 +14,7 @@
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/kernel.h>
+#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/stddef.h>
#include <linux/unistd.h>
@@ -313,12 +314,18 @@ unsigned long arch_randomize_brk(struct mm_struct *mm)
int vectors_user_mapping(void)
{
+ int ret = 0;
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
- return install_special_mapping(mm, 0xffff0000, PAGE_SIZE,
- VM_READ | VM_EXEC |
- VM_MAYREAD | VM_MAYEXEC |
- VM_DONTEXPAND | VM_DONTDUMP,
- NULL);
+
+ ret = install_special_mapping(mm, 0xffff0000, PAGE_SIZE,
+ VM_READ | VM_EXEC |
+ VM_MAYREAD | VM_MAYEXEC |
+ VM_DONTEXPAND | VM_DONTDUMP,
+ NULL, &vma);
+ if (ret)
+ kmem_cache_free(vm_area_cachep, vma);
+ return ret;
}
const char *arch_vma_name(struct vm_area_struct *vma)
diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
index af91901..888d856 100644
--- a/arch/x86/um/vdso/vma.c
+++ b/arch/x86/um/vdso/vma.c
@@ -55,19 +55,29 @@ subsys_initcall(init_vdso);
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
int err;
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
if (!vdso_enabled)
return 0;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
err = install_special_mapping(mm, um_vdso_addr, PAGE_SIZE,
- VM_READ|VM_EXEC|
- VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- vdsop);
+ VM_READ|VM_EXEC|
+ VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+ vdsop, &vma);
+ if (err)
+ goto out_err;
up_write(&mm->mmap_sem);
-
+ return err;
+out_err:
+ up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return err;
}
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index d6bfb87..debb339 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -13,6 +13,7 @@
#include <linux/gfp.h>
#include <linux/string.h>
#include <linux/elf.h>
+#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/err.h>
#include <linux/module.h>
@@ -307,6 +308,7 @@ int __init sysenter_setup(void)
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
unsigned long addr;
int ret = 0;
bool compat;
@@ -319,6 +321,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
if (vdso_enabled == VDSO_DISABLED)
return 0;
+ if (compat_uses_vma || !compat) {
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+ }
+
down_write(&mm->mmap_sem);
/* Test compat mode once here, in case someone
@@ -346,7 +354,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
ret = install_special_mapping(mm, addr, PAGE_SIZE,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- vdso32_pages);
+ vdso32_pages, &vma);
if (ret)
goto up_fail;
@@ -355,12 +363,18 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
current_thread_info()->sysenter_return =
VDSO32_SYMBOL(addr, SYSENTER_RETURN);
+ up_write(&mm->mmap_sem);
+
+ return ret;
+
up_fail:
if (ret)
current->mm->context.vdso = NULL;
up_write(&mm->mmap_sem);
+ if (ret && (compat_uses_vma || !compat))
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 431e875..82c6b87 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -154,12 +154,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
unsigned size)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
unsigned long addr;
int ret;
if (!vdso_enabled)
return 0;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
addr = vdso_addr(mm->start_stack, size);
addr = get_unmapped_area(NULL, addr, size, 0, 0);
@@ -173,14 +178,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
ret = install_special_mapping(mm, addr, size,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- pages);
+ pages, &vma);
if (ret) {
current->mm->context.vdso = NULL;
goto up_fail;
}
+ up_write(&mm->mmap_sem);
+ return ret;
up_fail:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..4984fff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1515,7 +1515,8 @@ extern struct file *get_mm_exe_file(struct mm_struct *mm);
extern int may_expand_vm(struct mm_struct *mm, unsigned long npages);
extern int install_special_mapping(struct mm_struct *mm,
unsigned long addr, unsigned long len,
- unsigned long flags, struct page **pages);
+ unsigned long flags, struct page **pages,
+ struct vm_area_struct **vma_prealloc);
extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ad8e1bd..18abeaa 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1099,8 +1099,14 @@ void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned lon
static int xol_add_vma(struct xol_area *area)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+
int ret = -EALREADY;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
if (mm->uprobes_state.xol_area)
goto fail;
@@ -1114,16 +1120,20 @@ static int xol_add_vma(struct xol_area *area)
}
ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE,
- VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, &area->page);
+ VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
+ &area->page, &vma);
if (ret)
goto fail;
smp_wmb(); /* pairs with get_xol_area() */
mm->uprobes_state.xol_area = area;
ret = 0;
+
+ up_write(&mm->mmap_sem);
+ return 0;
fail:
up_write(&mm->mmap_sem);
-
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/mm/mmap.c b/mm/mmap.c
index 6a7824d..6e238a3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2909,17 +2909,17 @@ static const struct vm_operations_struct special_mapping_vmops = {
* The region past the last page supplied will always produce SIGBUS.
* The array pointer and the pages it points to are assumed to stay alive
* for as long as this mapping might exist.
+ *
+ * The caller has the responsibility of allocating the new vma, and freeing
+ * it if it was unused (when insert_vm_struct() fails).
*/
int install_special_mapping(struct mm_struct *mm,
unsigned long addr, unsigned long len,
- unsigned long vm_flags, struct page **pages)
+ unsigned long vm_flags, struct page **pages,
+ struct vm_area_struct **vma_prealloc)
{
- int ret;
- struct vm_area_struct *vma;
-
- vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
- if (unlikely(vma == NULL))
- return -ENOMEM;
+ int ret = 0;
+ struct vm_area_struct *vma = *vma_prealloc;
INIT_LIST_HEAD(&vma->anon_vma_chain);
vma->vm_mm = mm;
@@ -2939,11 +2939,7 @@ int install_special_mapping(struct mm_struct *mm,
mm->total_vm += len >> PAGE_SHIFT;
perf_event_mmap(vma);
-
- return 0;
-
out:
- kmem_cache_free(vm_area_cachep, vma);
return ret;
}
--
1.8.1.4
This seems somewhat insane:
int install_special_mapping(struct mm_struct *mm,
unsigned long addr, unsigned long len,
+ unsigned long vm_flags, struct page **pages,
+ struct vm_area_struct **vma_prealloc)
{
+ int ret = 0;
+ struct vm_area_struct *vma = *vma_prealloc;
(removed the "old" lines to make it more readable).
Why pass in "struct vm_area_struct **vma_prealloc" when you could just
pass in a plain and more readable "struct vm_area_struct *vma"?
My *guess* is that you originally cleared the vma_prealloc thing if
you used it, but in the patch you sent out you definitely don't (the
_only_ use of that "vma_prealloc" is the line that loads the content
into "vma", so this interface looks like it is some remnant of an
earlier and more complicated patch?
Linus
Am 18.10.2013 02:50, schrieb Davidlohr Bueso:
> With the exception of um and tile, architectures that use
> the install_special_mapping() function, when setting up a
> new vma at program startup, do so with the mmap_sem lock
> held for writing. Unless there's an error, this process
> ends up allocating a new vma through kmem_cache_zalloc,
> and inserting it in the task's address space.
>
> This patch moves the vma's space allocation outside of
> install_special_mapping(), and leaves the callers to do so
> explicitly, without depending on mmap_sem. The same goes for
> freeing: if the new vma isn't used (and thus the process fails
> at some point), it's caller's responsibility to free it -
> currently this is done inside install_special_mapping.
>
> Furthermore, uprobes behaves exactly the same and thus now the
> xol_add_vma() function also preallocates the new vma.
>
> While the changes to x86 vdso handling have been tested on both
> large and small 64-bit systems, the rest of the architectures
> are totally *untested*. Note that all changes are quite similar
> from architecture to architecture.
>
> This patch, when tested on a 64core, 256 Gb NUMA server, benefited
> several aim7 workloads: long +27% throughput with over 1500 users;
> compute +6.5% with over 1000 users; fserver +83% for small amounts
> of users (10-100 range) and +9% for more and new_fserver, showing
> a similar behavior, got +67% boost with 100 users and an avg of +8%
> when more users were added.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Richard Kuo <[email protected]>
> Cc: Ralf Baechle <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Martin Schwidefsky <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Paul Mundt <[email protected]>
> Cc: Chris Metcalf <[email protected]>
> Cc: Jeff Dike <[email protected]>
> Cc: Richard Weinberger <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
> arch/arm/kernel/process.c | 22 ++++++++++++++++------
> arch/arm64/kernel/vdso.c | 21 +++++++++++++++++----
> arch/hexagon/kernel/vdso.c | 16 ++++++++++++----
> arch/mips/kernel/vdso.c | 10 +++++++++-
> arch/powerpc/kernel/vdso.c | 11 ++++++++---
> arch/s390/kernel/vdso.c | 19 +++++++++++++++----
> arch/sh/kernel/vsyscall/vsyscall.c | 11 ++++++++++-
> arch/tile/kernel/vdso.c | 13 ++++++++++---
> arch/um/kernel/skas/mmu.c | 16 +++++++++++-----
> arch/unicore32/kernel/process.c | 17 ++++++++++++-----
> arch/x86/um/vdso/vma.c | 18 ++++++++++++++----
> arch/x86/vdso/vdso32-setup.c | 16 +++++++++++++++-
> arch/x86/vdso/vma.c | 10 +++++++++-
> include/linux/mm.h | 3 ++-
> kernel/events/uprobes.c | 14 ++++++++++++--
> mm/mmap.c | 18 +++++++-----------
> 16 files changed, 179 insertions(+), 56 deletions(-)
>
> diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
> index 94f6b05..5637c92 100644
> --- a/arch/arm/kernel/process.c
> +++ b/arch/arm/kernel/process.c
> @@ -13,6 +13,7 @@
> #include <linux/export.h>
> #include <linux/sched.h>
> #include <linux/kernel.h>
> +#include <linux/slab.h>
> #include <linux/mm.h>
> #include <linux/stddef.h>
> #include <linux/unistd.h>
> @@ -480,6 +481,7 @@ extern struct page *get_signal_page(void);
> int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> {
> struct mm_struct *mm = current->mm;
> + struct vm_area_struct *vma;
> unsigned long addr;
> int ret;
>
> @@ -488,6 +490,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> if (!signal_page)
> return -ENOMEM;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> down_write(&mm->mmap_sem);
> addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
> if (IS_ERR_VALUE(addr)) {
> @@ -496,14 +502,18 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> }
>
> ret = install_special_mapping(mm, addr, PAGE_SIZE,
> - VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
> - &signal_page);
> -
> - if (ret == 0)
> - mm->context.sigpage = addr;
> + VM_READ | VM_EXEC | VM_MAYREAD |
> + VM_MAYWRITE | VM_MAYEXEC,
> + &signal_page, &vma);
> + if (ret)
> + goto up_fail;
>
> - up_fail:
> + mm->context.sigpage = addr;
> + up_write(&mm->mmap_sem);
> + return 0;
> +up_fail:
> up_write(&mm->mmap_sem);
> + kmem_cache_free(vm_area_cachep, vma);
> return ret;
> }
> #endif
> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> index 6a389dc..519a44c 100644
> --- a/arch/arm64/kernel/vdso.c
> +++ b/arch/arm64/kernel/vdso.c
> @@ -83,20 +83,26 @@ arch_initcall(alloc_vectors_page);
>
> int aarch32_setup_vectors_page(struct linux_binprm *bprm, int uses_interp)
> {
> + struct vm_area_struct *vma;
> struct mm_struct *mm = current->mm;
> unsigned long addr = AARCH32_VECTORS_BASE;
> int ret;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> down_write(&mm->mmap_sem);
> current->mm->context.vdso = (void *)addr;
>
> /* Map vectors page at the high address. */
> ret = install_special_mapping(mm, addr, PAGE_SIZE,
> VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYEXEC,
> - vectors_page);
> + vectors_page, &vma);
>
> up_write(&mm->mmap_sem);
> -
> + if (ret)
> + kmem_cache_free(vm_area_cachep, vma);
> return ret;
> }
> #endif /* CONFIG_COMPAT */
> @@ -152,10 +158,15 @@ arch_initcall(vdso_init);
> int arch_setup_additional_pages(struct linux_binprm *bprm,
> int uses_interp)
> {
> + struct vm_area_struct *vma;
> struct mm_struct *mm = current->mm;
> unsigned long vdso_base, vdso_mapping_len;
> int ret;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> /* Be sure to map the data page */
> vdso_mapping_len = (vdso_pages + 1) << PAGE_SHIFT;
>
> @@ -170,15 +181,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
> ret = install_special_mapping(mm, vdso_base, vdso_mapping_len,
> VM_READ|VM_EXEC|
> VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> - vdso_pagelist);
> + vdso_pagelist, &vma);
> if (ret) {
> mm->context.vdso = NULL;
> goto up_fail;
> }
>
> + up_write(&mm->mmap_sem);
> + return ret;
> up_fail:
> up_write(&mm->mmap_sem);
> -
> + kmem_cache_free(vm_area_cachep, vma);
> return ret;
> }
>
> diff --git a/arch/hexagon/kernel/vdso.c b/arch/hexagon/kernel/vdso.c
> index 0bf5a87..188c5bd 100644
> --- a/arch/hexagon/kernel/vdso.c
> +++ b/arch/hexagon/kernel/vdso.c
> @@ -19,6 +19,7 @@
> */
>
> #include <linux/err.h>
> +#include <linux/slab.h>
> #include <linux/mm.h>
> #include <linux/vmalloc.h>
> #include <linux/binfmts.h>
> @@ -63,8 +64,13 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> {
> int ret;
> unsigned long vdso_base;
> + struct vm_area_struct *vma;
> struct mm_struct *mm = current->mm;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> down_write(&mm->mmap_sem);
>
> /* Try to get it loaded right near ld.so/glibc. */
> @@ -78,17 +84,19 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>
> /* MAYWRITE to allow gdb to COW and set breakpoints. */
> ret = install_special_mapping(mm, vdso_base, PAGE_SIZE,
> - VM_READ|VM_EXEC|
> - VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> - &vdso_page);
> -
> + VM_READ|VM_EXEC|
> + VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> + &vdso_page, &vma);
> if (ret)
> goto up_fail;
>
> mm->context.vdso = (void *)vdso_base;
>
> + up_write(&mm->mmap_sem);
> + return 0;
> up_fail:
> up_write(&mm->mmap_sem);
> + kmem_cache_free(vm_area_cachep, vma);
> return ret;
> }
>
> diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
> index 0f1af58..cfc2c7b 100644
> --- a/arch/mips/kernel/vdso.c
> +++ b/arch/mips/kernel/vdso.c
> @@ -74,8 +74,13 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> {
> int ret;
> unsigned long addr;
> + struct vm_area_struct *vma;
> struct mm_struct *mm = current->mm;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> down_write(&mm->mmap_sem);
>
> addr = vdso_addr(mm->start_stack);
> @@ -89,15 +94,18 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> ret = install_special_mapping(mm, addr, PAGE_SIZE,
> VM_READ|VM_EXEC|
> VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> - &vdso_page);
> + &vdso_page, &vma);
>
> if (ret)
> goto up_fail;
>
> mm->context.vdso = (void *)addr;
>
> + up_write(&mm->mmap_sem);
> + return 0;
> up_fail:
> up_write(&mm->mmap_sem);
> + kmem_cache_free(vm_area_cachep, vma);
> return ret;
> }
>
> diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
> index 1d9c926..a23fb5f 100644
> --- a/arch/powerpc/kernel/vdso.c
> +++ b/arch/powerpc/kernel/vdso.c
> @@ -193,6 +193,7 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
> int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> {
> struct mm_struct *mm = current->mm;
> + struct vm_area_struct *vma;
> struct page **vdso_pagelist;
> unsigned long vdso_pages;
> unsigned long vdso_base;
> @@ -232,6 +233,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> /* Add a page to the vdso size for the data page */
> vdso_pages ++;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> /*
> * pick a base address for the vDSO in process space. We try to put it
> * at vdso_base which is the "natural" base for it, but we might fail
> @@ -271,7 +276,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> rc = install_special_mapping(mm, vdso_base, vdso_pages << PAGE_SHIFT,
> VM_READ|VM_EXEC|
> VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> - vdso_pagelist);
> + vdso_pagelist, &vma);
> if (rc) {
> current->mm->context.vdso_base = 0;
> goto fail_mmapsem;
> @@ -279,9 +284,9 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>
> up_write(&mm->mmap_sem);
> return 0;
> -
> - fail_mmapsem:
> +fail_mmapsem:
> up_write(&mm->mmap_sem);
> + kmem_cache_free(vm_area_cachep, vma);
> return rc;
> }
>
> diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
> index 05d75c4..4e00e11 100644
> --- a/arch/s390/kernel/vdso.c
> +++ b/arch/s390/kernel/vdso.c
> @@ -180,6 +180,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> {
> struct mm_struct *mm = current->mm;
> struct page **vdso_pagelist;
> + struct vm_area_struct *vma;
> unsigned long vdso_pages;
> unsigned long vdso_base;
> int rc;
> @@ -213,6 +214,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> if (vdso_pages == 0)
> return 0;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> current->mm->context.vdso_base = 0;
>
> /*
> @@ -224,7 +229,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> vdso_base = get_unmapped_area(NULL, 0, vdso_pages << PAGE_SHIFT, 0, 0);
> if (IS_ERR_VALUE(vdso_base)) {
> rc = vdso_base;
> - goto out_up;
> + goto out_err;
> }
>
> /*
> @@ -247,11 +252,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> rc = install_special_mapping(mm, vdso_base, vdso_pages << PAGE_SHIFT,
> VM_READ|VM_EXEC|
> VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> - vdso_pagelist);
> - if (rc)
> + vdso_pagelist, &vma);
> + if (rc) {
> current->mm->context.vdso_base = 0;
> -out_up:
> + goto out_err;
> + }
> +
> + up_write(&mm->mmap_sem);
> + return 0;
> +out_err:
> up_write(&mm->mmap_sem);
> + kmem_cache_free(vm_area_cachep, vma);
> return rc;
> }
>
> diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
> index 5ca5797..49d3834 100644
> --- a/arch/sh/kernel/vsyscall/vsyscall.c
> +++ b/arch/sh/kernel/vsyscall/vsyscall.c
> @@ -10,6 +10,7 @@
> * License. See the file "COPYING" in the main directory of this archive
> * for more details.
> */
> +#include <linux/slab.h>
> #include <linux/mm.h>
> #include <linux/kernel.h>
> #include <linux/init.h>
> @@ -61,9 +62,14 @@ int __init vsyscall_init(void)
> int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> {
> struct mm_struct *mm = current->mm;
> + struct vm_area_struct *vma;
> unsigned long addr;
> int ret;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> down_write(&mm->mmap_sem);
> addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
> if (IS_ERR_VALUE(addr)) {
> @@ -74,14 +80,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> ret = install_special_mapping(mm, addr, PAGE_SIZE,
> VM_READ | VM_EXEC |
> VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
> - syscall_pages);
> + syscall_pages, &vma);
> if (unlikely(ret))
> goto up_fail;
>
> current->mm->context.vdso = (void *)addr;
>
> + up_write(&mm->mmap_sem);
> + return 0;
> up_fail:
> up_write(&mm->mmap_sem);
> + kmem_cache_free(vm_area_cachep, vma);
> return ret;
> }
>
> diff --git a/arch/tile/kernel/vdso.c b/arch/tile/kernel/vdso.c
> index 1533af2..cf93e62 100644
> --- a/arch/tile/kernel/vdso.c
> +++ b/arch/tile/kernel/vdso.c
> @@ -15,6 +15,7 @@
> #include <linux/binfmts.h>
> #include <linux/compat.h>
> #include <linux/elf.h>
> +#include <linux/slab.h>
> #include <linux/mm.h>
> #include <linux/pagemap.h>
>
> @@ -140,6 +141,7 @@ int setup_vdso_pages(void)
> {
> struct page **pagelist;
> unsigned long pages;
> + struct vm_area_struct *vma;
> struct mm_struct *mm = current->mm;
> unsigned long vdso_base = 0;
> int retval = 0;
> @@ -147,6 +149,10 @@ int setup_vdso_pages(void)
> if (!vdso_ready)
> return 0;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> mm->context.vdso_base = 0;
>
> pagelist = vdso_pagelist;
> @@ -198,10 +204,11 @@ int setup_vdso_pages(void)
> pages << PAGE_SHIFT,
> VM_READ|VM_EXEC |
> VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
> - pagelist);
> - if (retval)
> + pagelist, &vma);
> + if (retval) {
> mm->context.vdso_base = 0;
> -
> + kmem_cache_free(vm_area_cachep, vma);
> + }
> return retval;
> }
>
> diff --git a/arch/um/kernel/skas/mmu.c b/arch/um/kernel/skas/mmu.c
> index 007d550..a6c3190 100644
> --- a/arch/um/kernel/skas/mmu.c
> +++ b/arch/um/kernel/skas/mmu.c
> @@ -104,18 +104,23 @@ int init_new_context(struct task_struct *task, struct mm_struct *mm)
> void uml_setup_stubs(struct mm_struct *mm)
> {
> int err, ret;
> + struct vm_area_struct *vma;
>
> if (!skas_needs_stub)
> return;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
This function has return type void.
Use goto err please.
> +
> ret = init_stub_pte(mm, STUB_CODE,
> (unsigned long) &__syscall_stub_start);
> if (ret)
> - goto out;
> + goto err;
>
> ret = init_stub_pte(mm, STUB_DATA, mm->context.id.stack);
> if (ret)
> - goto out;
> + goto err;
>
> mm->context.stub_pages[0] = virt_to_page(&__syscall_stub_start);
> mm->context.stub_pages[1] = virt_to_page(mm->context.id.stack);
> @@ -124,14 +129,15 @@ void uml_setup_stubs(struct mm_struct *mm)
> err = install_special_mapping(mm, STUB_START, STUB_END - STUB_START,
> VM_READ | VM_MAYREAD | VM_EXEC |
> VM_MAYEXEC | VM_DONTCOPY | VM_PFNMAP,
> - mm->context.stub_pages);
> + mm->context.stub_pages, &vma);
> if (err) {
> printk(KERN_ERR "install_special_mapping returned %d\n", err);
> - goto out;
> + goto err;
> }
> return;
>
> -out:
> +err:
> + kmem_cache_free(vm_area_cachep, vma);
> force_sigsegv(SIGSEGV, current);
> }
>
> diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
> index 778ebba..c18b0e4 100644
> --- a/arch/unicore32/kernel/process.c
> +++ b/arch/unicore32/kernel/process.c
> @@ -14,6 +14,7 @@
> #include <linux/module.h>
> #include <linux/sched.h>
> #include <linux/kernel.h>
> +#include <linux/slab.h>
> #include <linux/mm.h>
> #include <linux/stddef.h>
> #include <linux/unistd.h>
> @@ -313,12 +314,18 @@ unsigned long arch_randomize_brk(struct mm_struct *mm)
>
> int vectors_user_mapping(void)
> {
> + int ret = 0;
> + struct vm_area_struct *vma;
> struct mm_struct *mm = current->mm;
> - return install_special_mapping(mm, 0xffff0000, PAGE_SIZE,
> - VM_READ | VM_EXEC |
> - VM_MAYREAD | VM_MAYEXEC |
> - VM_DONTEXPAND | VM_DONTDUMP,
> - NULL);
> +
> + ret = install_special_mapping(mm, 0xffff0000, PAGE_SIZE,
> + VM_READ | VM_EXEC |
> + VM_MAYREAD | VM_MAYEXEC |
> + VM_DONTEXPAND | VM_DONTDUMP,
> + NULL, &vma);
> + if (ret)
> + kmem_cache_free(vm_area_cachep, vma);
> + return ret;
> }
>
> const char *arch_vma_name(struct vm_area_struct *vma)
> diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
> index af91901..888d856 100644
> --- a/arch/x86/um/vdso/vma.c
> +++ b/arch/x86/um/vdso/vma.c
> @@ -55,19 +55,29 @@ subsys_initcall(init_vdso);
> int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> {
> int err;
> + struct vm_area_struct *vma;
> struct mm_struct *mm = current->mm;
>
> if (!vdso_enabled)
> return 0;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> down_write(&mm->mmap_sem);
>
> err = install_special_mapping(mm, um_vdso_addr, PAGE_SIZE,
> - VM_READ|VM_EXEC|
> - VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> - vdsop);
> + VM_READ|VM_EXEC|
> + VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> + vdsop, &vma);
> + if (err)
> + goto out_err;
>
> up_write(&mm->mmap_sem);
> -
> + return err;
> +out_err:
> + up_write(&mm->mmap_sem);
> + kmem_cache_free(vm_area_cachep, vma);
> return err;
> }
> diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
> index d6bfb87..debb339 100644
> --- a/arch/x86/vdso/vdso32-setup.c
> +++ b/arch/x86/vdso/vdso32-setup.c
> @@ -13,6 +13,7 @@
> #include <linux/gfp.h>
> #include <linux/string.h>
> #include <linux/elf.h>
> +#include <linux/slab.h>
> #include <linux/mm.h>
> #include <linux/err.h>
> #include <linux/module.h>
> @@ -307,6 +308,7 @@ int __init sysenter_setup(void)
> int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> {
> struct mm_struct *mm = current->mm;
> + struct vm_area_struct *vma;
> unsigned long addr;
> int ret = 0;
> bool compat;
> @@ -319,6 +321,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> if (vdso_enabled == VDSO_DISABLED)
> return 0;
>
> + if (compat_uses_vma || !compat) {
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> + }
> +
> down_write(&mm->mmap_sem);
>
> /* Test compat mode once here, in case someone
> @@ -346,7 +354,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> ret = install_special_mapping(mm, addr, PAGE_SIZE,
> VM_READ|VM_EXEC|
> VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> - vdso32_pages);
> + vdso32_pages, &vma);
>
> if (ret)
> goto up_fail;
> @@ -355,12 +363,18 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> current_thread_info()->sysenter_return =
> VDSO32_SYMBOL(addr, SYSENTER_RETURN);
>
> + up_write(&mm->mmap_sem);
> +
> + return ret;
> +
> up_fail:
> if (ret)
> current->mm->context.vdso = NULL;
>
> up_write(&mm->mmap_sem);
>
> + if (ret && (compat_uses_vma || !compat))
> + kmem_cache_free(vm_area_cachep, vma);
> return ret;
> }
>
> diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
> index 431e875..82c6b87 100644
> --- a/arch/x86/vdso/vma.c
> +++ b/arch/x86/vdso/vma.c
> @@ -154,12 +154,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
> unsigned size)
> {
> struct mm_struct *mm = current->mm;
> + struct vm_area_struct *vma;
> unsigned long addr;
> int ret;
>
> if (!vdso_enabled)
> return 0;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> down_write(&mm->mmap_sem);
> addr = vdso_addr(mm->start_stack, size);
> addr = get_unmapped_area(NULL, addr, size, 0, 0);
> @@ -173,14 +178,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
> ret = install_special_mapping(mm, addr, size,
> VM_READ|VM_EXEC|
> VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> - pages);
> + pages, &vma);
> if (ret) {
> current->mm->context.vdso = NULL;
> goto up_fail;
> }
>
> + up_write(&mm->mmap_sem);
> + return ret;
> up_fail:
> up_write(&mm->mmap_sem);
> + kmem_cache_free(vm_area_cachep, vma);
> return ret;
> }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8b6e55e..4984fff 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1515,7 +1515,8 @@ extern struct file *get_mm_exe_file(struct mm_struct *mm);
> extern int may_expand_vm(struct mm_struct *mm, unsigned long npages);
> extern int install_special_mapping(struct mm_struct *mm,
> unsigned long addr, unsigned long len,
> - unsigned long flags, struct page **pages);
> + unsigned long flags, struct page **pages,
> + struct vm_area_struct **vma_prealloc);
>
> extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index ad8e1bd..18abeaa 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1099,8 +1099,14 @@ void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned lon
> static int xol_add_vma(struct xol_area *area)
> {
> struct mm_struct *mm = current->mm;
> + struct vm_area_struct *vma;
> +
> int ret = -EALREADY;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> down_write(&mm->mmap_sem);
> if (mm->uprobes_state.xol_area)
> goto fail;
> @@ -1114,16 +1120,20 @@ static int xol_add_vma(struct xol_area *area)
> }
>
> ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> - VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, &area->page);
> + VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
> + &area->page, &vma);
> if (ret)
> goto fail;
>
> smp_wmb(); /* pairs with get_xol_area() */
> mm->uprobes_state.xol_area = area;
> ret = 0;
> +
> + up_write(&mm->mmap_sem);
> + return 0;
> fail:
> up_write(&mm->mmap_sem);
> -
> + kmem_cache_free(vm_area_cachep, vma);
> return ret;
> }
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 6a7824d..6e238a3 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2909,17 +2909,17 @@ static const struct vm_operations_struct special_mapping_vmops = {
> * The region past the last page supplied will always produce SIGBUS.
> * The array pointer and the pages it points to are assumed to stay alive
> * for as long as this mapping might exist.
> + *
> + * The caller has the responsibility of allocating the new vma, and freeing
> + * it if it was unused (when insert_vm_struct() fails).
> */
> int install_special_mapping(struct mm_struct *mm,
> unsigned long addr, unsigned long len,
> - unsigned long vm_flags, struct page **pages)
> + unsigned long vm_flags, struct page **pages,
> + struct vm_area_struct **vma_prealloc)
> {
> - int ret;
> - struct vm_area_struct *vma;
> -
> - vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> - if (unlikely(vma == NULL))
> - return -ENOMEM;
> + int ret = 0;
> + struct vm_area_struct *vma = *vma_prealloc;
>
> INIT_LIST_HEAD(&vma->anon_vma_chain);
> vma->vm_mm = mm;
> @@ -2939,11 +2939,7 @@ int install_special_mapping(struct mm_struct *mm,
> mm->total_vm += len >> PAGE_SHIFT;
>
> perf_event_mmap(vma);
> -
> - return 0;
> -
> out:
> - kmem_cache_free(vm_area_cachep, vma);
> return ret;
> }
>
>
* Davidlohr Bueso <[email protected]> wrote:
> --- a/arch/x86/vdso/vma.c
> +++ b/arch/x86/vdso/vma.c
> @@ -154,12 +154,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
> unsigned size)
> {
> struct mm_struct *mm = current->mm;
> + struct vm_area_struct *vma;
> unsigned long addr;
> int ret;
>
> if (!vdso_enabled)
> return 0;
>
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> + if (unlikely(!vma))
> + return -ENOMEM;
> +
> down_write(&mm->mmap_sem);
> addr = vdso_addr(mm->start_stack, size);
> addr = get_unmapped_area(NULL, addr, size, 0, 0);
> @@ -173,14 +178,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
> ret = install_special_mapping(mm, addr, size,
> VM_READ|VM_EXEC|
> VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> - pages);
> + pages, &vma);
> if (ret) {
> current->mm->context.vdso = NULL;
> goto up_fail;
> }
>
> + up_write(&mm->mmap_sem);
> + return ret;
> up_fail:
> up_write(&mm->mmap_sem);
> + kmem_cache_free(vm_area_cachep, vma);
> return ret;
> }
>
1)
Beyond the simplification that Linus suggested, why not introduce a new
function, named 'install_special_mapping_vma()' or so, and convert
architectures one by one, without pressure to get it all done (and all
correct) in a single patch?
2)
I don't see the justification: this code gets executed in exec() where a
new mm has just been allocated. There's only a single user of the mm and
thus the critical section width of mmap_sem is more or less irrelevant.
mmap_sem critical section size only matters for codepaths that threaded
programs can hit.
3)
But, if we do all that, a couple of other (micro-)optimizations are
possible in setup_additional_pages() as well:
- vdso_addr(), which is actually much _more_ expensive than kmalloc()
because on most distros it will call into the RNG, can also be done
outside the mmap_sem.
- the error paths can all be merged and the common case can be made
fall-through.
- use 'mm' consistently instead of repeating 'current->mm'
- set 'mm->context.vdso' only once we know it's all a success, and do it
outside the lock
- add a few comments about which operations are locked, which unlocked,
and why. Please double check the assumptions I documented there.
See the diff attached below. (Totally untested and all that.)
Also note that I think, in theory, if exec() guaranteed the privacy and
single threadedness of the new mm, we could probably do _all_ of this
unlocked. Right now I don't think this is guaranteed: ptrace() users might
look up the new PID and might interfere on the MM via
PTRACE_PEEK*/PTRACE_POKE*.
( Furthermore, /proc/ might also give early access to aspects of the mm -
although no manipulation of the mm is possible there. )
If such privacy of the new mm was guaranteed then that would also remove
the need to move the allocation out of install_special_mapping().
But, I don't think it all matters, due to #2 - and your changes actively
complicate setup_pages(), which makes this security sensitive code a bit
more fragile. We can still do it out of sheer principle, I just don't see
where it's supposed to help scale better.
Thanks,
Ingo
arch/x86/vdso/vma.c | 40 +++++++++++++++++++++++++++-------------
1 file changed, 27 insertions(+), 13 deletions(-)
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 431e875..c590387 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -157,30 +157,44 @@ static int setup_additional_pages(struct linux_binprm *bprm,
unsigned long addr;
int ret;
- if (!vdso_enabled)
+ if (unlikely(!vdso_enabled))
return 0;
- down_write(&mm->mmap_sem);
+ /*
+ * Do this outside the MM lock - we are in exec() with a new MM,
+ * nobody else can use these fields of the mm:
+ */
addr = vdso_addr(mm->start_stack, size);
- addr = get_unmapped_area(NULL, addr, size, 0, 0);
- if (IS_ERR_VALUE(addr)) {
- ret = addr;
- goto up_fail;
- }
- current->mm->context.vdso = (void *)addr;
+ /*
+ * This must be done under the MM lock - there might be parallel
+ * accesses to this mm, such as ptrace().
+ *
+ * [ This could be further optimized if exec() reliably inhibited
+ * all early access to the mm. ]
+ */
+ down_write(&mm->mmap_sem);
+ addr = get_unmapped_area(NULL, addr, size, 0, 0);
+ if (IS_ERR_VALUE(addr))
+ goto up_fail_addr;
ret = install_special_mapping(mm, addr, size,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
pages);
- if (ret) {
- current->mm->context.vdso = NULL;
- goto up_fail;
- }
+ up_write(&mm->mmap_sem);
+ if (ret)
+ goto fail;
-up_fail:
+ mm->context.vdso = (void *)addr;
+ return ret;
+
+up_fail_addr:
+ ret = addr;
up_write(&mm->mmap_sem);
+fail:
+ mm->context.vdso = NULL;
+
return ret;
}
From: Davidlohr Bueso <[email protected]>
Subject: [PATCH v2 3/3] vdso: preallocate new vmas
With the exception of um and tile, architectures that use
the install_special_mapping() function, when setting up a
new vma at program startup, do so with the mmap_sem lock
held for writing. Unless there's an error, this process
ends up allocating a new vma through kmem_cache_zalloc,
and inserting it in the task's address space.
This patch moves the vma's space allocation outside of
install_special_mapping(), and leaves the callers to do so
explicitly, without depending on mmap_sem. The same goes for
freeing: if the new vma isn't used (and thus the process fails
at some point), it's caller's responsibility to free it -
currently this is done inside install_special_mapping.
Furthermore, uprobes behaves exactly the same and thus now the
xol_add_vma() function also preallocates the new vma.
While the changes to x86 vdso handling have been tested on both
large and small 64-bit systems, the rest of the architectures
are totally *untested*. Note that all changes are quite similar
from architecture to architecture.
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
v2:
- Simplify install_special_mapping interface (Linus Torvalds)
- Fix return for uml_setup_stubs when mem allocation fails (Richard Weinberger)
arch/arm/kernel/process.c | 22 ++++++++++++++++------
arch/arm64/kernel/vdso.c | 21 +++++++++++++++++----
arch/hexagon/kernel/vdso.c | 16 ++++++++++++----
arch/mips/kernel/vdso.c | 10 +++++++++-
arch/powerpc/kernel/vdso.c | 11 ++++++++---
arch/s390/kernel/vdso.c | 19 +++++++++++++++----
arch/sh/kernel/vsyscall/vsyscall.c | 11 ++++++++++-
arch/tile/kernel/vdso.c | 13 ++++++++++---
arch/um/kernel/skas/mmu.c | 16 +++++++++++-----
arch/unicore32/kernel/process.c | 17 ++++++++++++-----
arch/x86/um/vdso/vma.c | 18 ++++++++++++++----
arch/x86/vdso/vdso32-setup.c | 16 +++++++++++++++-
arch/x86/vdso/vma.c | 10 +++++++++-
include/linux/mm.h | 3 ++-
kernel/events/uprobes.c | 14 ++++++++++++--
mm/mmap.c | 17 ++++++-----------
16 files changed, 178 insertions(+), 56 deletions(-)
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 94f6b05..d1eb115 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -13,6 +13,7 @@
#include <linux/export.h>
#include <linux/sched.h>
#include <linux/kernel.h>
+#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/stddef.h>
#include <linux/unistd.h>
@@ -480,6 +481,7 @@ extern struct page *get_signal_page(void);
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
unsigned long addr;
int ret;
@@ -488,6 +490,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
if (!signal_page)
return -ENOMEM;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
if (IS_ERR_VALUE(addr)) {
@@ -496,14 +502,18 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
}
ret = install_special_mapping(mm, addr, PAGE_SIZE,
- VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
- &signal_page);
-
- if (ret == 0)
- mm->context.sigpage = addr;
+ VM_READ | VM_EXEC | VM_MAYREAD |
+ VM_MAYWRITE | VM_MAYEXEC,
+ &signal_page, vma);
+ if (ret)
+ goto up_fail;
- up_fail:
+ mm->context.sigpage = addr;
+ up_write(&mm->mmap_sem);
+ return 0;
+up_fail:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
#endif
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 6a389dc..06a01ea 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -83,20 +83,26 @@ arch_initcall(alloc_vectors_page);
int aarch32_setup_vectors_page(struct linux_binprm *bprm, int uses_interp)
{
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
unsigned long addr = AARCH32_VECTORS_BASE;
int ret;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
current->mm->context.vdso = (void *)addr;
/* Map vectors page at the high address. */
ret = install_special_mapping(mm, addr, PAGE_SIZE,
VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYEXEC,
- vectors_page);
+ vectors_page, vma);
up_write(&mm->mmap_sem);
-
+ if (ret)
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
#endif /* CONFIG_COMPAT */
@@ -152,10 +158,15 @@ arch_initcall(vdso_init);
int arch_setup_additional_pages(struct linux_binprm *bprm,
int uses_interp)
{
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
unsigned long vdso_base, vdso_mapping_len;
int ret;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
/* Be sure to map the data page */
vdso_mapping_len = (vdso_pages + 1) << PAGE_SHIFT;
@@ -170,15 +181,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
ret = install_special_mapping(mm, vdso_base, vdso_mapping_len,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- vdso_pagelist);
+ vdso_pagelist, vma);
if (ret) {
mm->context.vdso = NULL;
goto up_fail;
}
+ up_write(&mm->mmap_sem);
+ return ret;
up_fail:
up_write(&mm->mmap_sem);
-
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/arch/hexagon/kernel/vdso.c b/arch/hexagon/kernel/vdso.c
index 0bf5a87..418a896 100644
--- a/arch/hexagon/kernel/vdso.c
+++ b/arch/hexagon/kernel/vdso.c
@@ -19,6 +19,7 @@
*/
#include <linux/err.h>
+#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/binfmts.h>
@@ -63,8 +64,13 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
int ret;
unsigned long vdso_base;
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
/* Try to get it loaded right near ld.so/glibc. */
@@ -78,17 +84,19 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
/* MAYWRITE to allow gdb to COW and set breakpoints. */
ret = install_special_mapping(mm, vdso_base, PAGE_SIZE,
- VM_READ|VM_EXEC|
- VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- &vdso_page);
-
+ VM_READ|VM_EXEC|
+ VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+ &vdso_page, vma);
if (ret)
goto up_fail;
mm->context.vdso = (void *)vdso_base;
+ up_write(&mm->mmap_sem);
+ return 0;
up_fail:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 0f1af58..fb44fc9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -74,8 +74,13 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
int ret;
unsigned long addr;
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
addr = vdso_addr(mm->start_stack);
@@ -89,15 +94,18 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
ret = install_special_mapping(mm, addr, PAGE_SIZE,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- &vdso_page);
+ &vdso_page, vma);
if (ret)
goto up_fail;
mm->context.vdso = (void *)addr;
+ up_write(&mm->mmap_sem);
+ return 0;
up_fail:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 1d9c926..ed339de 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -193,6 +193,7 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
struct page **vdso_pagelist;
unsigned long vdso_pages;
unsigned long vdso_base;
@@ -232,6 +233,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
/* Add a page to the vdso size for the data page */
vdso_pages ++;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
/*
* pick a base address for the vDSO in process space. We try to put it
* at vdso_base which is the "natural" base for it, but we might fail
@@ -271,7 +276,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
rc = install_special_mapping(mm, vdso_base, vdso_pages << PAGE_SHIFT,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- vdso_pagelist);
+ vdso_pagelist, vma);
if (rc) {
current->mm->context.vdso_base = 0;
goto fail_mmapsem;
@@ -279,9 +284,9 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
up_write(&mm->mmap_sem);
return 0;
-
- fail_mmapsem:
+fail_mmapsem:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return rc;
}
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 05d75c4..e2a707d 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -180,6 +180,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
struct mm_struct *mm = current->mm;
struct page **vdso_pagelist;
+ struct vm_area_struct *vma;
unsigned long vdso_pages;
unsigned long vdso_base;
int rc;
@@ -213,6 +214,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
if (vdso_pages == 0)
return 0;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
current->mm->context.vdso_base = 0;
/*
@@ -224,7 +229,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
vdso_base = get_unmapped_area(NULL, 0, vdso_pages << PAGE_SHIFT, 0, 0);
if (IS_ERR_VALUE(vdso_base)) {
rc = vdso_base;
- goto out_up;
+ goto out_err;
}
/*
@@ -247,11 +252,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
rc = install_special_mapping(mm, vdso_base, vdso_pages << PAGE_SHIFT,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- vdso_pagelist);
- if (rc)
+ vdso_pagelist, vma);
+ if (rc) {
current->mm->context.vdso_base = 0;
-out_up:
+ goto out_err;
+ }
+
+ up_write(&mm->mmap_sem);
+ return 0;
+out_err:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return rc;
}
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 5ca5797..f2431da 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -10,6 +10,7 @@
* License. See the file "COPYING" in the main directory of this archive
* for more details.
*/
+#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/kernel.h>
#include <linux/init.h>
@@ -61,9 +62,14 @@ int __init vsyscall_init(void)
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
unsigned long addr;
int ret;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
if (IS_ERR_VALUE(addr)) {
@@ -74,14 +80,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
ret = install_special_mapping(mm, addr, PAGE_SIZE,
VM_READ | VM_EXEC |
VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
- syscall_pages);
+ syscall_pages, vma);
if (unlikely(ret))
goto up_fail;
current->mm->context.vdso = (void *)addr;
+ up_write(&mm->mmap_sem);
+ return 0;
up_fail:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/arch/tile/kernel/vdso.c b/arch/tile/kernel/vdso.c
index 1533af2..e691c0b 100644
--- a/arch/tile/kernel/vdso.c
+++ b/arch/tile/kernel/vdso.c
@@ -15,6 +15,7 @@
#include <linux/binfmts.h>
#include <linux/compat.h>
#include <linux/elf.h>
+#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
@@ -140,6 +141,7 @@ int setup_vdso_pages(void)
{
struct page **pagelist;
unsigned long pages;
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
unsigned long vdso_base = 0;
int retval = 0;
@@ -147,6 +149,10 @@ int setup_vdso_pages(void)
if (!vdso_ready)
return 0;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
mm->context.vdso_base = 0;
pagelist = vdso_pagelist;
@@ -198,10 +204,11 @@ int setup_vdso_pages(void)
pages << PAGE_SHIFT,
VM_READ|VM_EXEC |
VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
- pagelist);
- if (retval)
+ pagelist, vma);
+ if (retval) {
mm->context.vdso_base = 0;
-
+ kmem_cache_free(vm_area_cachep, vma);
+ }
return retval;
}
diff --git a/arch/um/kernel/skas/mmu.c b/arch/um/kernel/skas/mmu.c
index 007d550..f08cd6c 100644
--- a/arch/um/kernel/skas/mmu.c
+++ b/arch/um/kernel/skas/mmu.c
@@ -104,18 +104,23 @@ int init_new_context(struct task_struct *task, struct mm_struct *mm)
void uml_setup_stubs(struct mm_struct *mm)
{
int err, ret;
+ struct vm_area_struct *vma;
if (!skas_needs_stub)
return;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return;
+
ret = init_stub_pte(mm, STUB_CODE,
(unsigned long) &__syscall_stub_start);
if (ret)
- goto out;
+ goto err;
ret = init_stub_pte(mm, STUB_DATA, mm->context.id.stack);
if (ret)
- goto out;
+ goto err;
mm->context.stub_pages[0] = virt_to_page(&__syscall_stub_start);
mm->context.stub_pages[1] = virt_to_page(mm->context.id.stack);
@@ -124,14 +129,15 @@ void uml_setup_stubs(struct mm_struct *mm)
err = install_special_mapping(mm, STUB_START, STUB_END - STUB_START,
VM_READ | VM_MAYREAD | VM_EXEC |
VM_MAYEXEC | VM_DONTCOPY | VM_PFNMAP,
- mm->context.stub_pages);
+ mm->context.stub_pages, vma);
if (err) {
printk(KERN_ERR "install_special_mapping returned %d\n", err);
- goto out;
+ goto err;
}
return;
-out:
+err:
+ kmem_cache_free(vm_area_cachep, vma);
force_sigsegv(SIGSEGV, current);
}
diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index 778ebba..d23adef 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -14,6 +14,7 @@
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/kernel.h>
+#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/stddef.h>
#include <linux/unistd.h>
@@ -313,12 +314,18 @@ unsigned long arch_randomize_brk(struct mm_struct *mm)
int vectors_user_mapping(void)
{
+ int ret = 0;
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
- return install_special_mapping(mm, 0xffff0000, PAGE_SIZE,
- VM_READ | VM_EXEC |
- VM_MAYREAD | VM_MAYEXEC |
- VM_DONTEXPAND | VM_DONTDUMP,
- NULL);
+
+ ret = install_special_mapping(mm, 0xffff0000, PAGE_SIZE,
+ VM_READ | VM_EXEC |
+ VM_MAYREAD | VM_MAYEXEC |
+ VM_DONTEXPAND | VM_DONTDUMP,
+ NULL, vma);
+ if (ret)
+ kmem_cache_free(vm_area_cachep, vma);
+ return ret;
}
const char *arch_vma_name(struct vm_area_struct *vma)
diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
index af91901..a380b13 100644
--- a/arch/x86/um/vdso/vma.c
+++ b/arch/x86/um/vdso/vma.c
@@ -55,19 +55,29 @@ subsys_initcall(init_vdso);
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
int err;
+ struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
if (!vdso_enabled)
return 0;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
err = install_special_mapping(mm, um_vdso_addr, PAGE_SIZE,
- VM_READ|VM_EXEC|
- VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- vdsop);
+ VM_READ|VM_EXEC|
+ VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+ vdsop, vma);
+ if (err)
+ goto out_err;
up_write(&mm->mmap_sem);
-
+ return err;
+out_err:
+ up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return err;
}
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index d6bfb87..efa791a 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -13,6 +13,7 @@
#include <linux/gfp.h>
#include <linux/string.h>
#include <linux/elf.h>
+#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/err.h>
#include <linux/module.h>
@@ -307,6 +308,7 @@ int __init sysenter_setup(void)
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
unsigned long addr;
int ret = 0;
bool compat;
@@ -319,6 +321,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
if (vdso_enabled == VDSO_DISABLED)
return 0;
+ if (compat_uses_vma || !compat) {
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+ }
+
down_write(&mm->mmap_sem);
/* Test compat mode once here, in case someone
@@ -346,7 +354,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
ret = install_special_mapping(mm, addr, PAGE_SIZE,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- vdso32_pages);
+ vdso32_pages, vma);
if (ret)
goto up_fail;
@@ -355,12 +363,18 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
current_thread_info()->sysenter_return =
VDSO32_SYMBOL(addr, SYSENTER_RETURN);
+ up_write(&mm->mmap_sem);
+
+ return ret;
+
up_fail:
if (ret)
current->mm->context.vdso = NULL;
up_write(&mm->mmap_sem);
+ if (ret && (compat_uses_vma || !compat))
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 431e875..fc189de 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -154,12 +154,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
unsigned size)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
unsigned long addr;
int ret;
if (!vdso_enabled)
return 0;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
addr = vdso_addr(mm->start_stack, size);
addr = get_unmapped_area(NULL, addr, size, 0, 0);
@@ -173,14 +178,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
ret = install_special_mapping(mm, addr, size,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- pages);
+ pages, vma);
if (ret) {
current->mm->context.vdso = NULL;
goto up_fail;
}
+ up_write(&mm->mmap_sem);
+ return ret;
up_fail:
up_write(&mm->mmap_sem);
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..ade2bd1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1515,7 +1515,8 @@ extern struct file *get_mm_exe_file(struct mm_struct *mm);
extern int may_expand_vm(struct mm_struct *mm, unsigned long npages);
extern int install_special_mapping(struct mm_struct *mm,
unsigned long addr, unsigned long len,
- unsigned long flags, struct page **pages);
+ unsigned long flags, struct page **pages,
+ struct vm_area_struct *vma);
extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ad8e1bd..3a99f4b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1099,8 +1099,14 @@ void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned lon
static int xol_add_vma(struct xol_area *area)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+
int ret = -EALREADY;
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+ if (unlikely(!vma))
+ return -ENOMEM;
+
down_write(&mm->mmap_sem);
if (mm->uprobes_state.xol_area)
goto fail;
@@ -1114,16 +1120,20 @@ static int xol_add_vma(struct xol_area *area)
}
ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE,
- VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, &area->page);
+ VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
+ &area->page, vma);
if (ret)
goto fail;
smp_wmb(); /* pairs with get_xol_area() */
mm->uprobes_state.xol_area = area;
ret = 0;
+
+ up_write(&mm->mmap_sem);
+ return 0;
fail:
up_write(&mm->mmap_sem);
-
+ kmem_cache_free(vm_area_cachep, vma);
return ret;
}
diff --git a/mm/mmap.c b/mm/mmap.c
index 6a7824d..6a6ef0a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2909,17 +2909,16 @@ static const struct vm_operations_struct special_mapping_vmops = {
* The region past the last page supplied will always produce SIGBUS.
* The array pointer and the pages it points to are assumed to stay alive
* for as long as this mapping might exist.
+ *
+ * The caller has the responsibility of allocating the new vma, and freeing
+ * it if it was unused (when insert_vm_struct() fails).
*/
int install_special_mapping(struct mm_struct *mm,
unsigned long addr, unsigned long len,
- unsigned long vm_flags, struct page **pages)
+ unsigned long vm_flags, struct page **pages,
+ struct vm_area_struct *vma)
{
- int ret;
- struct vm_area_struct *vma;
-
- vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
- if (unlikely(vma == NULL))
- return -ENOMEM;
+ int ret = 0;
INIT_LIST_HEAD(&vma->anon_vma_chain);
vma->vm_mm = mm;
@@ -2939,11 +2938,7 @@ int install_special_mapping(struct mm_struct *mm,
mm->total_vm += len >> PAGE_SHIFT;
perf_event_mmap(vma);
-
- return 0;
-
out:
- kmem_cache_free(vm_area_cachep, vma);
return ret;
}
--
1.8.1.4
Hi Ingo,
On Fri, 2013-10-18 at 08:05 +0200, Ingo Molnar wrote:
> * Davidlohr Bueso <[email protected]> wrote:
>
> > --- a/arch/x86/vdso/vma.c
> > +++ b/arch/x86/vdso/vma.c
> > @@ -154,12 +154,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
> > unsigned size)
> > {
> > struct mm_struct *mm = current->mm;
> > + struct vm_area_struct *vma;
> > unsigned long addr;
> > int ret;
> >
> > if (!vdso_enabled)
> > return 0;
> >
> > + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> > + if (unlikely(!vma))
> > + return -ENOMEM;
> > +
> > down_write(&mm->mmap_sem);
> > addr = vdso_addr(mm->start_stack, size);
> > addr = get_unmapped_area(NULL, addr, size, 0, 0);
> > @@ -173,14 +178,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
> > ret = install_special_mapping(mm, addr, size,
> > VM_READ|VM_EXEC|
> > VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> > - pages);
> > + pages, &vma);
> > if (ret) {
> > current->mm->context.vdso = NULL;
> > goto up_fail;
> > }
> >
> > + up_write(&mm->mmap_sem);
> > + return ret;
> > up_fail:
> > up_write(&mm->mmap_sem);
> > + kmem_cache_free(vm_area_cachep, vma);
> > return ret;
> > }
> >
>
> 1)
>
> Beyond the simplification that Linus suggested, why not introduce a new
> function, named 'install_special_mapping_vma()' or so, and convert
> architectures one by one, without pressure to get it all done (and all
> correct) in a single patch?
>
Hmm I'd normally take this approach but updating the callers from all
architectures was so straightforward and monotonous that I think it's
easier to just do it all at once. Andrew had suggested using linux-next
for testing, so if some arch breaks (in the not-compiling sense), the
maintainer or I can easily address the issue.
> 2)
>
> I don't see the justification: this code gets executed in exec() where a
> new mm has just been allocated. There's only a single user of the mm and
> thus the critical section width of mmap_sem is more or less irrelevant.
>
> mmap_sem critical section size only matters for codepaths that threaded
> programs can hit.
>
Yes, I was surprised by the performance boost I noticed when running
this patch. This weekend I re-ran the tests (including your 4/3 patch)
and noticed that while we're still getting some benefits (more like in
the +5% throughput range), it's not as good as I originally reported. I
believe the reason is because I had ran the tests on the vanilla kernel
without the max clock frequency, so the comparison was obviously not
fair. That said, I still think it's worth adding this patch, as it does
help at a micro-optimization level, and it's one less mmap_sem user we
have to worry about.
> 3)
>
> But, if we do all that, a couple of other (micro-)optimizations are
> possible in setup_additional_pages() as well:
I've rebased your patch on top of mine and things ran fine over the
weekend, didn't notice anything unusual - see below. I've *not* added
your SoB tag, so if you think it's good to go, please go ahead and add
it.
>
> - vdso_addr(), which is actually much _more_ expensive than kmalloc()
> because on most distros it will call into the RNG, can also be done
> outside the mmap_sem.
>
> - the error paths can all be merged and the common case can be made
> fall-through.
>
> - use 'mm' consistently instead of repeating 'current->mm'
>
> - set 'mm->context.vdso' only once we know it's all a success, and do it
> outside the lock
>
> - add a few comments about which operations are locked, which unlocked,
> and why. Please double check the assumptions I documented there.
>
> See the diff attached below. (Totally untested and all that.)
>
> Also note that I think, in theory, if exec() guaranteed the privacy and
> single threadedness of the new mm, we could probably do _all_ of this
> unlocked. Right now I don't think this is guaranteed: ptrace() users might
> look up the new PID and might interfere on the MM via
> PTRACE_PEEK*/PTRACE_POKE*.
>
> ( Furthermore, /proc/ might also give early access to aspects of the mm -
> although no manipulation of the mm is possible there. )
I was not aware of this, thanks for going into more details.
>
> If such privacy of the new mm was guaranteed then that would also remove
> the need to move the allocation out of install_special_mapping().
>
> But, I don't think it all matters, due to #2 - and your changes actively
> complicate setup_pages(), which makes this security sensitive code a bit
> more fragile. We can still do it out of sheer principle, I just don't see
> where it's supposed to help scale better.
I think that trying to guarantee new mm privacy would actually
complicate things much more than this patch, which is quite simple. And,
as you imply, it's not worthwhile for these code paths.
Thanks,
Davidlohr
8<------------------------------------------
From: Ingo Molnar <[email protected]>
Subject: [PATCH 4/3] x86/vdso: Optimize setup_additional_pages()
(micro-)optimizations are possible in setup_additional_pages() as well:
- vdso_addr(), which is actually much _more_ expensive than kmalloc()
because on most distros it will call into the RNG, can also be done
outside the mmap_sem.
- the error paths can all be merged and the common case can be made
fall-through.
- use 'mm' consistently instead of repeating 'current->mm'
- set 'mm->context.vdso' only once we know it's all a success, and do it
outside the lock
- add a few comments about which operations are locked, which unlocked,
and why. Please double check the assumptions I documented there.
[rebased on top of "vdso: preallocate new vmas" patch]
Signed-off-by: Davidlohr Bueso <[email protected]>
---
arch/x86/vdso/vma.c | 39 +++++++++++++++++++++++++--------------
1 file changed, 25 insertions(+), 14 deletions(-)
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index fc189de..5af8597 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -158,36 +158,47 @@ static int setup_additional_pages(struct linux_binprm *bprm,
unsigned long addr;
int ret;
- if (!vdso_enabled)
+ if (unlikely(!vdso_enabled))
return 0;
vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
if (unlikely(!vma))
return -ENOMEM;
- down_write(&mm->mmap_sem);
+ /*
+ * Do this outside the MM lock - we are in exec() with a new MM,
+ * nobody else can use these fields of the mm:
+ */
addr = vdso_addr(mm->start_stack, size);
- addr = get_unmapped_area(NULL, addr, size, 0, 0);
- if (IS_ERR_VALUE(addr)) {
- ret = addr;
- goto up_fail;
- }
- current->mm->context.vdso = (void *)addr;
+ /*
+ * This must be done under the MM lock - there might be parallel
+ * accesses to this mm, such as ptrace().
+ *
+ * [ This could be further optimized if exec() reliably inhibited
+ * all early access to the mm. ]
+ */
+ down_write(&mm->mmap_sem);
+ addr = get_unmapped_area(NULL, addr, size, 0, 0);
+ if (IS_ERR_VALUE(addr))
+ goto up_fail_addr;
ret = install_special_mapping(mm, addr, size,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
pages, vma);
- if (ret) {
- current->mm->context.vdso = NULL;
- goto up_fail;
- }
-
up_write(&mm->mmap_sem);
+ if (ret)
+ goto fail;
+
+ mm->context.vdso = (void *)addr;
return ret;
-up_fail:
+
+up_fail_addr:
+ ret = addr;
up_write(&mm->mmap_sem);
+fail:
+ mm->context.vdso = NULL;
kmem_cache_free(vm_area_cachep, vma);
return ret;
}
--
1.8.1.4
* Davidlohr Bueso <[email protected]> wrote:
> > 2)
> >
> > I don't see the justification: this code gets executed in exec() where
> > a new mm has just been allocated. There's only a single user of the mm
> > and thus the critical section width of mmap_sem is more or less
> > irrelevant.
> >
> > mmap_sem critical section size only matters for codepaths that
> > threaded programs can hit.
>
> Yes, I was surprised by the performance boost I noticed when running
> this patch. This weekend I re-ran the tests (including your 4/3 patch)
> and noticed that while we're still getting some benefits (more like in
> the +5% throughput range), it's not as good as I originally reported. I
> believe the reason is because I had ran the tests on the vanilla kernel
> without the max clock frequency, so the comparison was obviously not
> fair. That said, I still think it's worth adding this patch, as it does
> help at a micro-optimization level, and it's one less mmap_sem user we
> have to worry about.
But it's a mmap_sem user that is essentially _guaranteed_ to have only a
single user at that point, in the exec() path!
So I don't see how this can show _any_ measurable speedup, let alone a 5%
speedup in a macro test. If our understanding is correct then the patch
does nothing but shuffle around a flag setting operation. (the mmap_sem is
equivalent to setting a single flag, in the single-user case.)
Now, if our understanding is incorrect then we need to improve our
understanding.
Thanks,
Ingo
On Thu, Oct 17, 2013 at 05:50:35PM -0700, Davidlohr Bueso wrote:
> Linus recently pointed out[1] some of the amount of unnecessary work
> being done with the mmap_sem held. This patchset is a very initial
> approach on reducing some of the contention on this lock, and moving
> work outside of the critical region.
>
> Patch 1 adds a simple helper function.
>
> Patch 2 moves out some trivial setup logic in mlock related calls.
>
> Patch 3 allows managing new vmas without requiring the mmap_sem for
> vdsos. While it's true that there are many other scenarios where
> this can be done, few are actually as straightforward as this in the
> sense that we *always* end up allocating memory anyways, so there's really
> no tradeoffs. For this reason I wanted to get this patch out in the open.
>
> There are a few points to consider when preallocating vmas at the start
> of system calls, such as how many new vmas (ie: callers of split_vma can
> end up calling twice, depending on the mm state at that point) or the probability
> that we end up merging the vma instead of having to create a new one, like the
> case of brk or copy_vma. In both cases the overhead of creating and freeing
> memory at every syscall's invocation might outweigh what we gain in not holding
> the sem.
Hi Davidlohr,
I had a quick look at the patches and I don't see anything wrong with them.
However, I must also say that I have 99 problems with mmap_sem and the one
you're solving doesn't seem to be one of them, so I would be interested to
see performance numbers showing how much difference these changes make.
Generally the problems I see with mmap_sem are related to long latency
operations. Specifically, the mmap_sem write side is currently held
during the entire munmap operation, which iterates over user pages to
free them, and can take hundreds of milliseconds for large VMAs. Also,
the mmap_sem read side is held during user page fauls - well, the
VM_FAULT_RETRY mechanism allows us to drop mmap_sem during major page
faults, but it is still held while allocating user pages or page tables,
and while going through FS code for asynchronous readahead, which turns
out not to be as asynchronous as you'd think as it can still block for
reading extends etc... So, generally the main issues I am seeing with
mmap_sem are latency related, while your changes only help for throughput
for workloads that don't hit the above latency issues. I think that's a
valid thing to do but I'm not sure if common workloads hit these throughput
issues today ?
> [1] https://lkml.org/lkml/2013/10/9/665
Eh, that post really makes it look easy doesn't it :)
There are a few complications with mmap_sem as it turns out to protect
more than just the VMA structures. For example, mmap_sem plays a role
in preventing page tables from being unmapped while follow_page_mask()
reads through them (there are other arch specific ways to do that,
like disabling local interrupts on x86 to prevent TLB shootdown, but
none that are currently available in generic code). This isn't an
issue with your current proposed patches but is something you need to
be aware of if you're going to do more work around the mmap_sem issues
(which I would encourage you to BTW - there are a lot of issues around
mmap_sem, so it definitely helps to have more people looking at this :)
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On Tue, Oct 22, 2013 at 4:48 PM, <[email protected]> wrote:
>
> Generally the problems I see with mmap_sem are related to long latency
> operations. Specifically, the mmap_sem write side is currently held
> during the entire munmap operation, which iterates over user pages to
> free them, and can take hundreds of milliseconds for large VMAs.
So this would be the *perfect* place to just downgrade the semaphore
from a write to a read.
Do the vma ops under the write semaphore, then downgrade it to a
read-sem, and do the page teardown with just mmap_sem held for
reading..
Comments? Anybody want to try that? It should be fairly
straightforward, and we had a somewhat similar issue when it came to
mmap() having to populate the mapping for mlock. For that case, it was
sufficient to just move the "populate" phase outside the lock entirely
(for that case, we actually drop the write lock and then take the
read-lock and re-lookup the vma, for unmap we'd have to do a proper
downgrade so that there is no window where the virtual address area
could be re-allocated)
The big issue is that we'd have to split up do_munmap() into those two
phases, since right now callers take the write semaphore before
calling it, and drop it afterwards. And some callers do it in a loop.
But we should be fairly easily able to make the *common* case (ie
normal "munmap()") do something like
down_write(&mm->mmap_sem);
phase1_munmap(..);
downgrade_write(&mm->mmap_sem);
phase2_munmap(..);
up_read(&mm->mmap_sem);
instead of what it does now (which is to just do
down_write()/up_write() around do_munmap()).
I don't see any fundamental problems, but maybe there's some really
annoying detail that makes this nasty (right now we do
"remove_vma_list() -> remove_vma()" *after* tearing down the page
tables, and since that calls the ->close function, I think it has to
be done that way. I'm wondering if any of that code relies on the
mmap_sem() being held for exclusively for writing. I don't see why it
possibly could, but..
So maybe I'm being overly optimistic and it's not as easy as just
splitting do_mmap() into two phases, but it really *looks* like it
might be just a ten-liner or so.. And if a real munmap() is the common
case (as opposed to a do_munmap() that gets triggered by somebody
doing a "mmap()" on top of an old mapping), then we'd at least allow
page faults from other threads to be done concurrently with tearing
down the page tables for the unmapped vma..
Linus
On Tue, Oct 22, 2013 at 9:20 AM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Oct 22, 2013 at 4:48 PM, <[email protected]> wrote:
>> Generally the problems I see with mmap_sem are related to long latency
>> operations. Specifically, the mmap_sem write side is currently held
>> during the entire munmap operation, which iterates over user pages to
>> free them, and can take hundreds of milliseconds for large VMAs.
>
> So this would be the *perfect* place to just downgrade the semaphore
> from a write to a read.
It's not as simple as that, because we currently rely on mmap_sem
write side being held during page table teardown in order to exclude
things like follow_page() which may otherwise access page tables while
we are potentially freeing them.
I do think it's solvable, but it gets complicated fast. Hugh & I have
been talking about it; the approach I'm looking at would involve
unwiring the page tables first (under protection of the mmap_sem write
lock) and then iterating on the unwired page tables to free the data
pages, issue TLB shootdowns and free the actual page tables (we
probably don't need even the mmap_sem read side at that point). But,
that's nowhere like a 10 line change anymore at that point...
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On 10/22/2013 08:48 AM, [email protected] wrote:
> On Thu, Oct 17, 2013 at 05:50:35PM -0700, Davidlohr Bueso wrote:
>> Linus recently pointed out[1] some of the amount of unnecessary work
>> being done with the mmap_sem held. This patchset is a very initial
>> approach on reducing some of the contention on this lock, and moving
>> work outside of the critical region.
>>
>> Patch 1 adds a simple helper function.
>>
>> Patch 2 moves out some trivial setup logic in mlock related calls.
>>
>> Patch 3 allows managing new vmas without requiring the mmap_sem for
>> vdsos. While it's true that there are many other scenarios where
>> this can be done, few are actually as straightforward as this in the
>> sense that we *always* end up allocating memory anyways, so there's really
>> no tradeoffs. For this reason I wanted to get this patch out in the open.
>>
>> There are a few points to consider when preallocating vmas at the start
>> of system calls, such as how many new vmas (ie: callers of split_vma can
>> end up calling twice, depending on the mm state at that point) or the probability
>> that we end up merging the vma instead of having to create a new one, like the
>> case of brk or copy_vma. In both cases the overhead of creating and freeing
>> memory at every syscall's invocation might outweigh what we gain in not holding
>> the sem.
>
> Hi Davidlohr,
>
> I had a quick look at the patches and I don't see anything wrong with them.
> However, I must also say that I have 99 problems with mmap_sem and the one
> you're solving doesn't seem to be one of them, so I would be interested to
> see performance numbers showing how much difference these changes make.
>
> Generally the problems I see with mmap_sem are related to long latency
> operations. Specifically, the mmap_sem write side is currently held
> during the entire munmap operation, which iterates over user pages to
> free them, and can take hundreds of milliseconds for large VMAs.
This is the leading cause of my "egads, something that should have been
fast got delayed for several ms" detector firing. I've been wondering:
Could we replace mmap_sem with some kind of efficient range lock? The
operations would be:
- mm_lock_all_write (drop-in replacement for down_write(&...->mmap_sem))
- mm_lock_all_read (same for down_read)
- mm_lock_write_range(mm, start, end)
- mm_lock_read_range(mm, start_end)
and corresponding unlock functions (that maybe take a cookie that the
lock functions return or that take a pointer to some small on-stack data
structure).
I think that all the mm functions except get_unmapped_area could use an
interface like this, possibly with considerably less code than they use
now. get_unmapped_area is the main exception -- it would want trylock
operations or something so it didn't get stuck.
The easiest way to implement this that I can think of is a doubly-linked
list or even just an array, which should be fine for a handful of
threads. Beyond that, I don't really know. Creating a whole trie for
these things would be expensive, and fine-grained locking on rbtree-like
things isn't so easy.
This could be a huge win: operations on non-overlapping addresses
wouldn't get in each others' way, except for TLB shootdown interrupts.
(Hey, CPU vendors: give us a real remote TLB shootdown mechanism!)
--Andy
On Tue, 2013-10-22 at 08:48 -0700, [email protected] wrote:
> On Thu, Oct 17, 2013 at 05:50:35PM -0700, Davidlohr Bueso wrote:
> > Linus recently pointed out[1] some of the amount of unnecessary work
> > being done with the mmap_sem held. This patchset is a very initial
> > approach on reducing some of the contention on this lock, and moving
> > work outside of the critical region.
> >
> > Patch 1 adds a simple helper function.
> >
> > Patch 2 moves out some trivial setup logic in mlock related calls.
> >
> > Patch 3 allows managing new vmas without requiring the mmap_sem for
> > vdsos. While it's true that there are many other scenarios where
> > this can be done, few are actually as straightforward as this in the
> > sense that we *always* end up allocating memory anyways, so there's really
> > no tradeoffs. For this reason I wanted to get this patch out in the open.
> >
> > There are a few points to consider when preallocating vmas at the start
> > of system calls, such as how many new vmas (ie: callers of split_vma can
> > end up calling twice, depending on the mm state at that point) or the probability
> > that we end up merging the vma instead of having to create a new one, like the
> > case of brk or copy_vma. In both cases the overhead of creating and freeing
> > memory at every syscall's invocation might outweigh what we gain in not holding
> > the sem.
>
> Hi Davidlohr,
>
> I had a quick look at the patches and I don't see anything wrong with them.
> However, I must also say that I have 99 problems with mmap_sem and the one
> you're solving doesn't seem to be one of them, so I would be interested to
> see performance numbers showing how much difference these changes make.
>
Thanks for taking a look, could I get your ack? Unless folks have any
problems with the patchset, I do think it's worth picking up...
As discussed with Ingo, the numbers aren't too exciting, mainly because
of the nature of vdsos: with all patches combined I'm getting roughly
+5% throughput boost on some aim7 workloads. This was a first attempt at
dealing with the new vma allocation we do with the mmap_sem held. I've
recently explored some of the options I mentioned above, like for
brk(2). Unfortunately I haven't gotten good results; for instance we get
a -15% hit in ops/sec with aim9's brk_test program. So we end up using
already existing vmas a lot more than creating new ones, which makes a
lot of sense anyway. So all in all I don't think that preallocating vmas
outside the lock is a path to go, even at a micro-optimization level,
except for vdsos.
> Generally the problems I see with mmap_sem are related to long latency
> operations. Specifically, the mmap_sem write side is currently held
> during the entire munmap operation, which iterates over user pages to
> free them, and can take hundreds of milliseconds for large VMAs. Also,
> the mmap_sem read side is held during user page fauls - well, the
> VM_FAULT_RETRY mechanism allows us to drop mmap_sem during major page
> faults, but it is still held while allocating user pages or page tables,
> and while going through FS code for asynchronous readahead, which turns
> out not to be as asynchronous as you'd think as it can still block for
> reading extends etc... So, generally the main issues I am seeing with
> mmap_sem are latency related, while your changes only help for throughput
> for workloads that don't hit the above latency issues. I think that's a
> valid thing to do but I'm not sure if common workloads hit these throughput
> issues today ?
IMHO munmap is just one more example where having one big fat lock for
serializing an entire address space is something that must be addressed
(and, as you mention, mmap_sem doesn't even limit itself only to vmas -
heck... we even got it protecting superpage faults). For instance, it
would be really nice if we could somehow handle rbtrees with RCU, and
avoid taking any locks for things like lookups (ie: find_vma), of course
we need to guarantee that the vma won't be free'd underneath us one it's
found :) Or event allow concurrent updates and rebalancing - nonetheless
I don't know how realistic such things can be in the kernel.
>
> > [1] https://lkml.org/lkml/2013/10/9/665
>
> Eh, that post really makes it look easy doesn't it :)
>
> There are a few complications with mmap_sem as it turns out to protect
> more than just the VMA structures. For example, mmap_sem plays a role
> in preventing page tables from being unmapped while follow_page_mask()
> reads through them (there are other arch specific ways to do that,
> like disabling local interrupts on x86 to prevent TLB shootdown, but
> none that are currently available in generic code). This isn't an
> issue with your current proposed patches but is something you need to
> be aware of if you're going to do more work around the mmap_sem issues
> (which I would encourage you to BTW - there are a lot of issues around
> mmap_sem, so it definitely helps to have more people looking at this :)
Thanks for shedding some light on the topic!
Davidlohr
On Thu, Oct 17, 2013 at 05:50:36PM -0700, Davidlohr Bueso wrote:
> Both do_brk and do_mmap_pgoff verify that we actually
> capable of locking future pages if the corresponding
> VM_LOCKED flags are used. Encapsulate this logic into
> a single mlock_future_check() helper function.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Michel Lespinasse <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On Thu, Oct 17, 2013 at 05:50:37PM -0700, Davidlohr Bueso wrote:
> All mlock related syscalls prepare lock limits, lengths and
> start parameters with the mmap_sem held. Move this logic
> outside of the critical region. For the case of mlock, continue
> incrementing the amount already locked by mm->locked_vm with
> the rwsem taken.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> Cc: Michel Lespinasse <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
Reviewed-by: Michel Lespinasse <[email protected]>
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On 10/18/2013 02:50 AM, Davidlohr Bueso wrote:
> All mlock related syscalls prepare lock limits, lengths and
> start parameters with the mmap_sem held. Move this logic
> outside of the critical region. For the case of mlock, continue
> incrementing the amount already locked by mm->locked_vm with
> the rwsem taken.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> Cc: Michel Lespinasse <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
> ---
> mm/mlock.c | 18 +++++++++++-------
> 1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index d480cd6..aa7de13 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -689,19 +689,21 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
>
> lru_add_drain_all(); /* flush pagevec */
>
> - down_write(¤t->mm->mmap_sem);
> len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
> start &= PAGE_MASK;
>
> - locked = len >> PAGE_SHIFT;
> - locked += current->mm->locked_vm;
> -
> lock_limit = rlimit(RLIMIT_MEMLOCK);
> lock_limit >>= PAGE_SHIFT;
> + locked = len >> PAGE_SHIFT;
> +
> + down_write(¤t->mm->mmap_sem);
> +
> + locked += current->mm->locked_vm;
>
> /* check against resource limits */
> if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
> error = do_mlock(start, len, 1);
> +
> up_write(¤t->mm->mmap_sem);
> if (!error)
> error = __mm_populate(start, len, 0);
> @@ -712,11 +714,13 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
> {
> int ret;
>
> - down_write(¤t->mm->mmap_sem);
> len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
> start &= PAGE_MASK;
> +
> + down_write(¤t->mm->mmap_sem);
> ret = do_mlock(start, len, 0);
> up_write(¤t->mm->mmap_sem);
> +
> return ret;
> }
>
> @@ -761,12 +765,12 @@ SYSCALL_DEFINE1(mlockall, int, flags)
> if (flags & MCL_CURRENT)
> lru_add_drain_all(); /* flush pagevec */
>
> - down_write(¤t->mm->mmap_sem);
> -
> lock_limit = rlimit(RLIMIT_MEMLOCK);
> lock_limit >>= PAGE_SHIFT;
>
> ret = -ENOMEM;
> + down_write(¤t->mm->mmap_sem);
> +
> if (!(flags & MCL_CURRENT) || (current->mm->total_vm <= lock_limit) ||
> capable(CAP_IPC_LOCK))
> ret = do_mlockall(flags);
>
On Sun, Oct 20, 2013 at 08:26:15PM -0700, Davidlohr Bueso wrote:
> From: Davidlohr Bueso <[email protected]>
> Subject: [PATCH v2 3/3] vdso: preallocate new vmas
>
> With the exception of um and tile, architectures that use
> the install_special_mapping() function, when setting up a
> new vma at program startup, do so with the mmap_sem lock
> held for writing. Unless there's an error, this process
> ends up allocating a new vma through kmem_cache_zalloc,
> and inserting it in the task's address space.
>
> This patch moves the vma's space allocation outside of
> install_special_mapping(), and leaves the callers to do so
> explicitly, without depending on mmap_sem. The same goes for
> freeing: if the new vma isn't used (and thus the process fails
> at some point), it's caller's responsibility to free it -
> currently this is done inside install_special_mapping.
>
> Furthermore, uprobes behaves exactly the same and thus now the
> xol_add_vma() function also preallocates the new vma.
>
> While the changes to x86 vdso handling have been tested on both
> large and small 64-bit systems, the rest of the architectures
> are totally *untested*. Note that all changes are quite similar
> from architecture to architecture.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Richard Kuo <[email protected]>
> Cc: Ralf Baechle <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Martin Schwidefsky <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Paul Mundt <[email protected]>
> Cc: Chris Metcalf <[email protected]>
> Cc: Jeff Dike <[email protected]>
> Cc: Richard Weinberger <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
> v2:
> - Simplify install_special_mapping interface (Linus Torvalds)
> - Fix return for uml_setup_stubs when mem allocation fails (Richard Weinberger)
I'm still confused as to why you're seeing any gains with this
one. This code runs during exec when mm isn't shared with any other
threads yet, so why does it matter how long the mmap_sem is held since
nobody else can contend on it ? (well, except for accesses from
/fs/proc/base.c, but I don't see why these would matter in your
benchmarks either).
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On Tue, Oct 22, 2013 at 10:54 AM, Andy Lutomirski <[email protected]> wrote:
> On 10/22/2013 08:48 AM, [email protected] wrote:
>> Generally the problems I see with mmap_sem are related to long latency
>> operations. Specifically, the mmap_sem write side is currently held
>> during the entire munmap operation, which iterates over user pages to
>> free them, and can take hundreds of milliseconds for large VMAs.
>
> This is the leading cause of my "egads, something that should have been
> fast got delayed for several ms" detector firing.
Yes, I'm seeing such issues relatively frequently as well.
> I've been wondering:
>
> Could we replace mmap_sem with some kind of efficient range lock? The
> operations would be:
>
> - mm_lock_all_write (drop-in replacement for down_write(&...->mmap_sem))
> - mm_lock_all_read (same for down_read)
> - mm_lock_write_range(mm, start, end)
> - mm_lock_read_range(mm, start_end)
>
> and corresponding unlock functions (that maybe take a cookie that the
> lock functions return or that take a pointer to some small on-stack data
> structure).
That seems doable, however I believe we can get rid of the latencies
in the first place which seems to be a better direction. As I briefly
mentioned, I would like to tackle the munmap problem sometime; Jan
Kara also has a project to remove places where blocking FS functions
are called with mmap_sem held (he's doing it for lock ordering
purposes, so that FS can call in to MM functions that take mmap_sem,
but there are latency benefits as well if we can avoid blocking in FS
with mmap_sem held).
> The easiest way to implement this that I can think of is a doubly-linked
> list or even just an array, which should be fine for a handful of
> threads. Beyond that, I don't really know. Creating a whole trie for
> these things would be expensive, and fine-grained locking on rbtree-like
> things isn't so easy.
Jan also had an implementation of range locks using interval trees. To
take a range lock, you'd add the range you want to the interval tree,
count the conflicting range lock requests that were there before you,
and (if nonzero) block until that count goes to 0. When releasing the
range lock, you look for any conflicting requests in the interval tree
and decrement their conflict count, waking them up if the count goes
to 0.
But as I said earlier, I would prefer if we could avoid holding
mmap_sem during long-latency operations rather than working around
this issue with range locks.
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On Wed, Oct 23, 2013 at 3:13 AM, Michel Lespinasse <[email protected]> wrote:
> On Tue, Oct 22, 2013 at 10:54 AM, Andy Lutomirski <[email protected]> wrote:
>> On 10/22/2013 08:48 AM, [email protected] wrote:
>>> Generally the problems I see with mmap_sem are related to long latency
>>> operations. Specifically, the mmap_sem write side is currently held
>>> during the entire munmap operation, which iterates over user pages to
>>> free them, and can take hundreds of milliseconds for large VMAs.
>>
>> This is the leading cause of my "egads, something that should have been
>> fast got delayed for several ms" detector firing.
>
> Yes, I'm seeing such issues relatively frequently as well.
>
>> I've been wondering:
>>
>> Could we replace mmap_sem with some kind of efficient range lock? The
>> operations would be:
>>
>> - mm_lock_all_write (drop-in replacement for down_write(&...->mmap_sem))
>> - mm_lock_all_read (same for down_read)
>> - mm_lock_write_range(mm, start, end)
>> - mm_lock_read_range(mm, start_end)
>>
>> and corresponding unlock functions (that maybe take a cookie that the
>> lock functions return or that take a pointer to some small on-stack data
>> structure).
>
> That seems doable, however I believe we can get rid of the latencies
> in the first place which seems to be a better direction. As I briefly
> mentioned, I would like to tackle the munmap problem sometime; Jan
> Kara also has a project to remove places where blocking FS functions
> are called with mmap_sem held (he's doing it for lock ordering
> purposes, so that FS can call in to MM functions that take mmap_sem,
> but there are latency benefits as well if we can avoid blocking in FS
> with mmap_sem held).
There will still be scalability issues if there are enough threads,
but maybe this isn't so bad. (My workload may also have priority
inversion problems -- there's a thread that runs on its own core and
needs the mmap_sem read lock and a thread that runs on a highly
contended core that needs the write lock.)
>
>> The easiest way to implement this that I can think of is a doubly-linked
>> list or even just an array, which should be fine for a handful of
>> threads. Beyond that, I don't really know. Creating a whole trie for
>> these things would be expensive, and fine-grained locking on rbtree-like
>> things isn't so easy.
>
> Jan also had an implementation of range locks using interval trees. To
> take a range lock, you'd add the range you want to the interval tree,
> count the conflicting range lock requests that were there before you,
> and (if nonzero) block until that count goes to 0. When releasing the
> range lock, you look for any conflicting requests in the interval tree
> and decrement their conflict count, waking them up if the count goes
> to 0.
Yuck. Now we're taking a per-mm lock on the rbtree, doing some
cacheline-bouncing rbtree operations, and dropping the lock to
serialize access to something that probably only has a small handful
of accessors at a time. I bet that an O(num locks) array or linked
list will end up being faster in practice.
I think the idea solution would be to shove these things into the page
tables somehow, but that seems impossibly complicated.
--Andy
>
> But as I said earlier, I would prefer if we could avoid holding
> mmap_sem during long-latency operations rather than working around
> this issue with range locks.
>
> --
> Michel "Walken" Lespinasse
> A program is never fully debugged until the last user dies.
--
Andy Lutomirski
AMA Capital Management, LLC
On Wed, 2013-10-23 at 02:53 -0700, [email protected] wrote:
> On Sun, Oct 20, 2013 at 08:26:15PM -0700, Davidlohr Bueso wrote:
> > From: Davidlohr Bueso <[email protected]>
> > Subject: [PATCH v2 3/3] vdso: preallocate new vmas
> >
> > With the exception of um and tile, architectures that use
> > the install_special_mapping() function, when setting up a
> > new vma at program startup, do so with the mmap_sem lock
> > held for writing. Unless there's an error, this process
> > ends up allocating a new vma through kmem_cache_zalloc,
> > and inserting it in the task's address space.
> >
> > This patch moves the vma's space allocation outside of
> > install_special_mapping(), and leaves the callers to do so
> > explicitly, without depending on mmap_sem. The same goes for
> > freeing: if the new vma isn't used (and thus the process fails
> > at some point), it's caller's responsibility to free it -
> > currently this is done inside install_special_mapping.
> >
> > Furthermore, uprobes behaves exactly the same and thus now the
> > xol_add_vma() function also preallocates the new vma.
> >
> > While the changes to x86 vdso handling have been tested on both
> > large and small 64-bit systems, the rest of the architectures
> > are totally *untested*. Note that all changes are quite similar
> > from architecture to architecture.
> >
> > Signed-off-by: Davidlohr Bueso <[email protected]>
> > Cc: Russell King <[email protected]>
> > Cc: Catalin Marinas <[email protected]>
> > Cc: Will Deacon <[email protected]>
> > Cc: Richard Kuo <[email protected]>
> > Cc: Ralf Baechle <[email protected]>
> > Cc: Benjamin Herrenschmidt <[email protected]>
> > Cc: Paul Mackerras <[email protected]>
> > Cc: Martin Schwidefsky <[email protected]>
> > Cc: Heiko Carstens <[email protected]>
> > Cc: Paul Mundt <[email protected]>
> > Cc: Chris Metcalf <[email protected]>
> > Cc: Jeff Dike <[email protected]>
> > Cc: Richard Weinberger <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: "H. Peter Anvin" <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > ---
> > v2:
> > - Simplify install_special_mapping interface (Linus Torvalds)
> > - Fix return for uml_setup_stubs when mem allocation fails (Richard Weinberger)
>
> I'm still confused as to why you're seeing any gains with this
> one. This code runs during exec when mm isn't shared with any other
> threads yet, so why does it matter how long the mmap_sem is held since
> nobody else can contend on it ? (well, except for accesses from
> /fs/proc/base.c, but I don't see why these would matter in your
> benchmarks either).
Yeah, that's why I dropped the performance numbers from the changelog in
v2, of course any differences are within the noise range. When I did the
initial runs I was scratching my head as to why I was seeing benefits,
but it was most likely a matter of clock frequency differences, and I no
longer see such boosts.
Thanks,
Davidlohr