2015-07-21 20:00:46

by Eric B Munson

[permalink] [raw]
Subject: [PATCH V4 0/6] Allow user to request memory to be locked on page fault

mlock() allows a user to control page out of program memory, but this
comes at the cost of faulting in the entire mapping when it is
allocated. For large mappings where the entire area is not necessary
this is not ideal. Instead of forcing all locked pages to be present
when they are allocated, this set creates a middle ground. Pages are
marked to be placed on the unevictable LRU (locked) when they are first
used, but they are not faulted in by the mlock call.

This series introduces a new mlock() system call that takes a flags
argument along with the start address and size. This flags argument
gives the caller the ability to request memory be locked in the
traditional way, or to be locked after the page is faulted in. New
calls are added for munlock() and munlockall() which give the called a
way to specify which flags are supposed to be cleared. A new MCL flag
is added to mirror the lock on fault behavior from mlock() in
mlockall(). Finally, a flag for mmap() is added that allows a user to
specify that the covered are should not be paged out, but only after the
memory has been used the first time.

There are two main use cases that this set covers. The first is the
security focussed mlock case. A buffer is needed that cannot be written
to swap. The maximum size is known, but on average the memory used is
significantly less than this maximum. With lock on fault, the buffer
is guaranteed to never be paged out without consuming the maximum size
every time such a buffer is created.

The second use case is focussed on performance. Portions of a large
file are needed and we want to keep the used portions in memory once
accessed. This is the case for large graphical models where the path
through the graph is not known until run time. The entire graph is
unlikely to be used in a given invocation, but once a node has been
used it needs to stay resident for further processing. Given these
constraints we have a number of options. We can potentially waste a
large amount of memory by mlocking the entire region (this can also
cause a significant stall at startup as the entire file is read in).
We can mlock every page as we access them without tracking if the page
is already resident but this introduces large overhead for each access.
The third option is mapping the entire region with PROT_NONE and using
a signal handler for SIGSEGV to mprotect(PROT_READ) and mlock() the
needed page. Doing this page at a time adds a significant performance
penalty. Batching can be used to mitigate this overhead, but in order
to safely avoid trying to mprotect pages outside of the mapping, the
boundaries of each mapping to be used in this way must be tracked and
available to the signal handler. This is precisely what the mm system
in the kernel should already be doing.

For mlock(MLOCK_ONFAULT) and mmap(MAP_LOCKONFAULT) the user is charged
against RLIMIT_MEMLOCK as if mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was
used, so when the VMA is created not when the pages are faulted in. For
mlockall(MCL_ONFAULT) the user is charged as if MCL_FUTURE was used.
This decision was made to keep the accounting checks out of the page
fault path.

To illustrate the benefit of this set I wrote a test program that mmaps
a 5 GB file filled with random data and then makes 15,000,000 accesses
to random addresses in that mapping. The test program was run 20 times
for each setup. Results are reported for two program portions, setup
and execution. The setup phase is calling mmap and optionally mlock on
the entire region. For most experiments this is trivial, but it
highlights the cost of faulting in the entire region. Results are
averages across the 20 runs in milliseconds.

mmap with mlock(MLOCK_LOCKED) on entire range:
Setup avg: 8228.666
Processing avg: 8274.257

mmap with mlock(MLOCK_LOCKED) before each access:
Setup avg: 0.113
Processing avg: 90993.552

mmap with PROT_NONE and signal handler and batch size of 1 page:
With the default value in max_map_count, this gets ENOMEM as I attempt
to change the permissions, after upping the sysctl significantly I get:
Setup avg: 0.058
Processing avg: 69488.073

mmap with PROT_NONE and signal handler and batch size of 8 pages:
Setup avg: 0.068
Processing avg: 38204.116

mmap with PROT_NONE and signal handler and batch size of 16 pages:
Setup avg: 0.044
Processing avg: 29671.180

mmap with mlock(MLOCK_ONFAULT) on entire range:
Setup avg: 0.189
Processing avg: 17904.899

The signal handler in the batch cases faulted in memory in two steps to
avoid having to know the start and end of the faulting mapping. The
first step covers the page that caused the fault as we know that it will
be possible to lock. The second step speculatively tries to mlock and
mprotect the batch size - 1 pages that follow. There may be a clever
way to avoid this without having the program track each mapping to be
covered by this handeler in a globally accessible structure, but I could
not find it. It should be noted that with a large enough batch size
this two step fault handler can still cause the program to crash if it
reaches far beyond the end of the mapping.

These results show that if the developer knows that a majority of the
mapping will be used, it is better to try and fault it in at once,
otherwise MAP_LOCKONFAULT is significantly faster.

The performance cost of these patches are minimal on the two benchmarks
I have tested (stream and kernbench). The following are the average
values across 20 runs of stream and 10 runs of kernbench after a warmup
run whose results were discarded.

Avg throughput in MB/s from stream using 1000000 element arrays
Test 4.2-rc1 4.2-rc1+lock-on-fault
Copy: 10,566.5 10,421
Scale: 10,685 10,503.5
Add: 12,044.1 11,814.2
Triad: 12,064.8 11,846.3

Kernbench optimal load
4.2-rc1 4.2-rc1+lock-on-fault
Elapsed Time 78.453 78.991
User Time 64.2395 65.2355
System Time 9.7335 9.7085
Context Switches 22211.5 22412.1
Sleeps 14965.3 14956.1

---
Changes from V3:
Ensure that pages present when mlock2(MLOCK_ONFAULT) is called are locked
Ensure that VM_LOCKONFAULT is handled in cases that used to only check VM_LOCKED
Add tests for new system calls
Add missing syscall entries, fix NR_syscalls on multiple arch's
Add missing MAP_LOCKONFAULT for tile

Changes from V2:
Added new system calls for mlock, munlock, and munlockall with added
flags arguments for controlling how memory is locked or unlocked.

Eric B Munson (6):
mm: mlock: Refactor mlock, munlock, and munlockall code
mm: mlock: Add new mlock, munlock, and munlockall system calls
mm: gup: Add mm_lock_present()
mm: mlock: Introduce VM_LOCKONFAULT and add mlock flags to enable it
mm: mmap: Add mmap flag to request VM_LOCKONFAULT
selftests: vm: Add tests for lock on fault

arch/alpha/include/asm/unistd.h | 2 +-
arch/alpha/include/uapi/asm/mman.h | 5 +
arch/alpha/include/uapi/asm/unistd.h | 3 +
arch/alpha/kernel/systbls.S | 3 +
arch/arm/include/asm/unistd.h | 2 +-
arch/arm/include/uapi/asm/unistd.h | 3 +
arch/arm/kernel/calls.S | 3 +
arch/arm64/include/asm/unistd32.h | 6 +
arch/avr32/include/uapi/asm/unistd.h | 3 +
arch/avr32/kernel/syscall_table.S | 3 +
arch/blackfin/include/uapi/asm/unistd.h | 3 +
arch/blackfin/mach-common/entry.S | 3 +
arch/cris/arch-v10/kernel/entry.S | 3 +
arch/cris/arch-v32/kernel/entry.S | 3 +
arch/frv/kernel/entry.S | 3 +
arch/ia64/include/asm/unistd.h | 2 +-
arch/ia64/include/uapi/asm/unistd.h | 3 +
arch/ia64/kernel/entry.S | 3 +
arch/m32r/kernel/entry.S | 3 +
arch/m32r/kernel/syscall_table.S | 3 +
arch/m68k/include/asm/unistd.h | 2 +-
arch/m68k/include/uapi/asm/unistd.h | 3 +
arch/m68k/kernel/syscalltable.S | 3 +
arch/microblaze/include/uapi/asm/unistd.h | 3 +
arch/microblaze/kernel/syscall_table.S | 3 +
arch/mips/include/uapi/asm/mman.h | 8 +
arch/mips/include/uapi/asm/unistd.h | 21 +-
arch/mips/kernel/scall32-o32.S | 3 +
arch/mips/kernel/scall64-64.S | 3 +
arch/mips/kernel/scall64-n32.S | 3 +
arch/mips/kernel/scall64-o32.S | 3 +
arch/mn10300/kernel/entry.S | 3 +
arch/parisc/include/uapi/asm/mman.h | 5 +
arch/parisc/include/uapi/asm/unistd.h | 5 +-
arch/powerpc/include/uapi/asm/mman.h | 5 +
arch/powerpc/include/uapi/asm/unistd.h | 3 +
arch/s390/include/uapi/asm/unistd.h | 5 +-
arch/s390/kernel/compat_wrapper.c | 3 +
arch/s390/kernel/syscalls.S | 3 +
arch/sh/kernel/syscalls_32.S | 3 +
arch/sparc/include/uapi/asm/mman.h | 5 +
arch/sparc/include/uapi/asm/unistd.h | 5 +-
arch/sparc/kernel/systbls_32.S | 2 +-
arch/sparc/kernel/systbls_64.S | 4 +-
arch/tile/include/uapi/asm/mman.h | 9 +
arch/x86/entry/syscalls/syscall_32.tbl | 3 +
arch/x86/entry/syscalls/syscall_64.tbl | 3 +
arch/xtensa/include/uapi/asm/mman.h | 8 +
arch/xtensa/include/uapi/asm/unistd.h | 10 +-
drivers/gpu/drm/drm_vm.c | 8 +-
fs/proc/task_mmu.c | 3 +-
include/linux/mm.h | 2 +
include/linux/mman.h | 3 +-
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/mman.h | 5 +
include/uapi/asm-generic/unistd.h | 8 +-
kernel/events/core.c | 2 +
kernel/events/uprobes.c | 2 +-
kernel/fork.c | 2 +-
kernel/sys_ni.c | 3 +
mm/debug.c | 1 +
mm/gup.c | 175 +++++++-
mm/huge_memory.c | 3 +-
mm/hugetlb.c | 4 +-
mm/internal.h | 5 +-
mm/ksm.c | 2 +-
mm/madvise.c | 4 +-
mm/memory.c | 5 +-
mm/mlock.c | 159 +++++--
mm/mmap.c | 32 +-
mm/mremap.c | 6 +-
mm/msync.c | 2 +-
mm/rmap.c | 12 +-
mm/shmem.c | 2 +-
mm/swap.c | 3 +-
mm/vmscan.c | 2 +-
tools/testing/selftests/vm/Makefile | 3 +
tools/testing/selftests/vm/lock-on-fault.c | 344 +++++++++++++++
tools/testing/selftests/vm/mlock2-tests.c | 621 ++++++++++++++++++++++++++++
tools/testing/selftests/vm/on-fault-limit.c | 47 +++
tools/testing/selftests/vm/run_vmtests | 33 ++
81 files changed, 1604 insertions(+), 104 deletions(-)
create mode 100644 tools/testing/selftests/vm/lock-on-fault.c
create mode 100644 tools/testing/selftests/vm/mlock2-tests.c
create mode 100644 tools/testing/selftests/vm/on-fault-limit.c

Cc: Shuah Khan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

--
1.9.1


2015-07-21 19:59:48

by Eric B Munson

[permalink] [raw]
Subject: [PATCH V4 1/6] mm: mlock: Refactor mlock, munlock, and munlockall code

With the exception of mlockall() none of the mlock family of system
calls take a flags argument so they are not extensible. A later patch
in this set will extend the mlock family to support a middle ground
between pages that are locked and faulted in immediately and unlocked
pages. To pave the way for the new system calls, the code needs some
reorganization so that all the actual entry points handle is checking
input and translating to VMA flags.

This patch mostly moves code around with the exception of
do_munlockall(). All three functions are changed to support a follow on
patch which introduces new system calls that allow the user to specify
flags for these calls.

Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/mlock.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 46 insertions(+), 11 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 6fd2cf1..8e52c23 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -553,7 +553,8 @@ out:
return ret;
}

-static int do_mlock(unsigned long start, size_t len, int on)
+static int apply_vma_flags(unsigned long start, size_t len,
+ vm_flags_t flags, bool add_flags)
{
unsigned long nstart, end, tmp;
struct vm_area_struct * vma, * prev;
@@ -579,9 +580,11 @@ static int do_mlock(unsigned long start, size_t len, int on)

/* Here we know that vma->vm_start <= nstart < vma->vm_end. */

- newflags = vma->vm_flags & ~VM_LOCKED;
- if (on)
- newflags |= VM_LOCKED;
+ newflags = vma->vm_flags;
+ if (add_flags)
+ newflags |= flags;
+ else
+ newflags &= ~flags;

tmp = vma->vm_end;
if (tmp > end)
@@ -604,7 +607,7 @@ static int do_mlock(unsigned long start, size_t len, int on)
return error;
}

-SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
+static int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
{
unsigned long locked;
unsigned long lock_limit;
@@ -628,7 +631,7 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)

/* check against resource limits */
if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
- error = do_mlock(start, len, 1);
+ error = apply_vma_flags(start, len, flags, true);

up_write(&current->mm->mmap_sem);
if (error)
@@ -640,7 +643,12 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
return 0;
}

-SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
+SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
+{
+ return do_mlock(start, len, VM_LOCKED);
+}
+
+static int do_munlock(unsigned long start, size_t len, vm_flags_t flags)
{
int ret;

@@ -648,20 +656,23 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
start &= PAGE_MASK;

down_write(&current->mm->mmap_sem);
- ret = do_mlock(start, len, 0);
+ ret = apply_vma_flags(start, len, flags, false);
up_write(&current->mm->mmap_sem);

return ret;
}

+SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
+{
+ return do_munlock(start, len, VM_LOCKED);
+}
+
static int do_mlockall(int flags)
{
struct vm_area_struct * vma, * prev = NULL;

if (flags & MCL_FUTURE)
current->mm->def_flags |= VM_LOCKED;
- else
- current->mm->def_flags &= ~VM_LOCKED;
if (flags == MCL_FUTURE)
goto out;

@@ -711,12 +722,36 @@ out:
return ret;
}

+static int do_munlockall(int flags)
+{
+ struct vm_area_struct * vma, * prev = NULL;
+
+ if (flags & MCL_FUTURE)
+ current->mm->def_flags &= ~VM_LOCKED;
+ if (flags == MCL_FUTURE)
+ goto out;
+
+ for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
+ vm_flags_t newflags;
+
+ newflags = vma->vm_flags;
+ if (flags & MCL_CURRENT)
+ newflags &= ~VM_LOCKED;
+
+ /* Ignore errors */
+ mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
+ cond_resched_rcu_qs();
+ }
+out:
+ return 0;
+}
+
SYSCALL_DEFINE0(munlockall)
{
int ret;

down_write(&current->mm->mmap_sem);
- ret = do_mlockall(0);
+ ret = do_munlockall(MCL_CURRENT | MCL_FUTURE);
up_write(&current->mm->mmap_sem);
return ret;
}
--
1.9.1

2015-07-21 20:00:36

by Eric B Munson

[permalink] [raw]
Subject: [PATCH V4 2/6] mm: mlock: Add new mlock, munlock, and munlockall system calls

With the refactored mlock code, introduce new system calls for mlock,
munlock, and munlockall. The new calls will allow the user to specify
what lock states are being added or cleared. mlock2 and munlock2 are
trivial at the moment, but a follow on patch will add a new mlock state
making them useful.

munlock2 addresses a limitation of the current implementation. If a
user calls mlockall(MCL_CURRENT | MCL_FUTURE) and then later decides
that MCL_FUTURE should be removed, they would have to call munlockall()
followed by mlockall(MCL_CURRENT) which could potentially be very
expensive. The new munlockall2 system call allows a user to simply
clear the MCL_FUTURE flag.

Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from V3:
* Do a (hopefully) complete job of adding the new system calls

arch/alpha/include/asm/unistd.h | 2 +-
arch/alpha/include/uapi/asm/mman.h | 2 ++
arch/alpha/include/uapi/asm/unistd.h | 3 +++
arch/alpha/kernel/systbls.S | 3 +++
arch/arm/include/asm/unistd.h | 2 +-
arch/arm/include/uapi/asm/unistd.h | 3 +++
arch/arm/kernel/calls.S | 3 +++
arch/arm64/include/asm/unistd32.h | 6 ++++++
arch/avr32/include/uapi/asm/unistd.h | 3 +++
arch/avr32/kernel/syscall_table.S | 3 +++
arch/blackfin/include/uapi/asm/unistd.h | 3 +++
arch/blackfin/mach-common/entry.S | 3 +++
arch/cris/arch-v10/kernel/entry.S | 3 +++
arch/cris/arch-v32/kernel/entry.S | 3 +++
arch/frv/kernel/entry.S | 3 +++
arch/ia64/include/asm/unistd.h | 2 +-
arch/ia64/include/uapi/asm/unistd.h | 3 +++
arch/ia64/kernel/entry.S | 3 +++
arch/m32r/kernel/entry.S | 3 +++
arch/m32r/kernel/syscall_table.S | 3 +++
arch/m68k/include/asm/unistd.h | 2 +-
arch/m68k/include/uapi/asm/unistd.h | 3 +++
arch/m68k/kernel/syscalltable.S | 3 +++
arch/microblaze/include/uapi/asm/unistd.h | 3 +++
arch/microblaze/kernel/syscall_table.S | 3 +++
arch/mips/include/uapi/asm/mman.h | 5 +++++
arch/mips/include/uapi/asm/unistd.h | 21 +++++++++++++++------
arch/mips/kernel/scall32-o32.S | 3 +++
arch/mips/kernel/scall64-64.S | 3 +++
arch/mips/kernel/scall64-n32.S | 3 +++
arch/mips/kernel/scall64-o32.S | 3 +++
arch/mn10300/kernel/entry.S | 3 +++
arch/parisc/include/uapi/asm/mman.h | 2 ++
arch/parisc/include/uapi/asm/unistd.h | 5 ++++-
arch/powerpc/include/uapi/asm/mman.h | 2 ++
arch/powerpc/include/uapi/asm/unistd.h | 3 +++
arch/s390/include/uapi/asm/unistd.h | 5 ++++-
arch/s390/kernel/compat_wrapper.c | 3 +++
arch/s390/kernel/syscalls.S | 3 +++
arch/sh/kernel/syscalls_32.S | 3 +++
arch/sparc/include/uapi/asm/mman.h | 2 ++
arch/sparc/include/uapi/asm/unistd.h | 5 ++++-
arch/sparc/kernel/systbls_32.S | 2 +-
arch/sparc/kernel/systbls_64.S | 4 ++--
arch/tile/include/uapi/asm/mman.h | 5 +++++
arch/x86/entry/syscalls/syscall_32.tbl | 3 +++
arch/x86/entry/syscalls/syscall_64.tbl | 3 +++
arch/xtensa/include/uapi/asm/mman.h | 5 +++++
arch/xtensa/include/uapi/asm/unistd.h | 10 ++++++++--
include/linux/syscalls.h | 4 ++++
include/uapi/asm-generic/mman.h | 2 ++
include/uapi/asm-generic/unistd.h | 8 +++++++-
kernel/sys_ni.c | 3 +++
mm/mlock.c | 28 ++++++++++++++++++++++++++++
54 files changed, 205 insertions(+), 19 deletions(-)

diff --git a/arch/alpha/include/asm/unistd.h b/arch/alpha/include/asm/unistd.h
index a56e608..1d09392 100644
--- a/arch/alpha/include/asm/unistd.h
+++ b/arch/alpha/include/asm/unistd.h
@@ -3,7 +3,7 @@

#include <uapi/asm/unistd.h>

-#define NR_SYSCALLS 514
+#define NR_SYSCALLS 517

#define __ARCH_WANT_OLD_READDIR
#define __ARCH_WANT_STAT64
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..ec72436 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -38,6 +38,8 @@
#define MCL_CURRENT 8192 /* lock all currently mapped pages */
#define MCL_FUTURE 16384 /* lock all additions to address space */

+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
diff --git a/arch/alpha/include/uapi/asm/unistd.h b/arch/alpha/include/uapi/asm/unistd.h
index aa33bf5..29141d6 100644
--- a/arch/alpha/include/uapi/asm/unistd.h
+++ b/arch/alpha/include/uapi/asm/unistd.h
@@ -475,5 +475,8 @@
#define __NR_getrandom 511
#define __NR_memfd_create 512
#define __NR_execveat 513
+#define __NR_mlock2 514
+#define __NR_munlock2 515
+#define __NR_munlockall2 516

#endif /* _UAPI_ALPHA_UNISTD_H */
diff --git a/arch/alpha/kernel/systbls.S b/arch/alpha/kernel/systbls.S
index 9b62e3f..04d1cce 100644
--- a/arch/alpha/kernel/systbls.S
+++ b/arch/alpha/kernel/systbls.S
@@ -532,6 +532,9 @@ sys_call_table:
.quad sys_getrandom
.quad sys_memfd_create
.quad sys_execveat
+ .quad sys_mlock2
+ .quad sys_munlock2 /* 515 */
+ .quad sys_munlockall2

.size sys_call_table, . - sys_call_table
.type sys_call_table, @object
diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 32640c4..7cba573 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -19,7 +19,7 @@
* This may need to be greater than __NR_last_syscall+1 in order to
* account for the padding in the syscall table
*/
-#define __NR_syscalls (388)
+#define __NR_syscalls (392)

/*
* *NOTE*: This is a ghost syscall private to the kernel. Only the
diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index 0c3f5a0..46eaf405 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -414,6 +414,9 @@
#define __NR_memfd_create (__NR_SYSCALL_BASE+385)
#define __NR_bpf (__NR_SYSCALL_BASE+386)
#define __NR_execveat (__NR_SYSCALL_BASE+387)
+#define __NR_mlock2 (__NR_SYSCALL_BASE+388)
+#define __NR_munlock2 (__NR_SYSCALL_BASE+389)
+#define __NR_munlockall2 (__NR_SYSCALL_BASE+390)

/*
* The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 05745eb..8880822 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -397,6 +397,9 @@
/* 385 */ CALL(sys_memfd_create)
CALL(sys_bpf)
CALL(sys_execveat)
+ CALL(sys_mlock2)
+ CALL(sys_munlock2)
+/* 390 */ CALL(sys_munlockall2)
#ifndef syscalls_counted
.equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
#define syscalls_counted
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index cef934a..318072aa 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -797,3 +797,9 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)
#define __NR_execveat 387
__SYSCALL(__NR_execveat, compat_sys_execveat)
+#define __NR_mlock2 388
+__SYSCALL(__NR_mlock2, sys_mlock2)
+#define __NR_munlock2 389
+__SYSCALL(__NR_munlock2, sys_munlock2)
+#define __NR_munlockall2 390
+__SYSCALL(__NR_munlockall2, sys_munlockall2)
diff --git a/arch/avr32/include/uapi/asm/unistd.h b/arch/avr32/include/uapi/asm/unistd.h
index bbe2fba..e6a1681 100644
--- a/arch/avr32/include/uapi/asm/unistd.h
+++ b/arch/avr32/include/uapi/asm/unistd.h
@@ -333,5 +333,8 @@
#define __NR_memfd_create 318
#define __NR_bpf 319
#define __NR_execveat 320
+#define __NR_mlock2 321
+#define __NR_munlock2 322
+#define __NR_munlockall2 323

#endif /* _UAPI__ASM_AVR32_UNISTD_H */
diff --git a/arch/avr32/kernel/syscall_table.S b/arch/avr32/kernel/syscall_table.S
index c3b593b..83928ab 100644
--- a/arch/avr32/kernel/syscall_table.S
+++ b/arch/avr32/kernel/syscall_table.S
@@ -334,4 +334,7 @@ sys_call_table:
.long sys_memfd_create
.long sys_bpf
.long sys_execveat /* 320 */
+ .long sys_mlock2
+ .long sys_munlock2
+ .long sys_munlockall2
.long sys_ni_syscall /* r8 is saturated at nr_syscalls */
diff --git a/arch/blackfin/include/uapi/asm/unistd.h b/arch/blackfin/include/uapi/asm/unistd.h
index 0cb9078..37c0362 100644
--- a/arch/blackfin/include/uapi/asm/unistd.h
+++ b/arch/blackfin/include/uapi/asm/unistd.h
@@ -433,6 +433,9 @@
#define __IGNORE_munlock
#define __IGNORE_mlockall
#define __IGNORE_munlockall
+#define __IGNORE_mlock2
+#define __IGNORE_munlock2
+#define __IGNORE_munlockall2
#define __IGNORE_mincore
#define __IGNORE_madvise
#define __IGNORE_remap_file_pages
diff --git a/arch/blackfin/mach-common/entry.S b/arch/blackfin/mach-common/entry.S
index 8d9431e..5d83587 100644
--- a/arch/blackfin/mach-common/entry.S
+++ b/arch/blackfin/mach-common/entry.S
@@ -1704,6 +1704,9 @@ ENTRY(_sys_call_table)
.long _sys_memfd_create /* 390 */
.long _sys_bpf
.long _sys_execveat
+ .long _sys_mlock2
+ .long _sys_munlock2
+ .long _sys_munlockall2 /* 395 */

.rept NR_syscalls-(.-_sys_call_table)/4
.long _sys_ni_syscall
diff --git a/arch/cris/arch-v10/kernel/entry.S b/arch/cris/arch-v10/kernel/entry.S
index 81570fc..d0ce531 100644
--- a/arch/cris/arch-v10/kernel/entry.S
+++ b/arch/cris/arch-v10/kernel/entry.S
@@ -955,6 +955,9 @@ sys_call_table:
.long sys_process_vm_writev
.long sys_kcmp /* 350 */
.long sys_finit_module
+ .long sys_mlock2
+ .long sys_munlock2
+ .long sys_munlockall2

/*
* NOTE!! This doesn't have to be exact - we just have
diff --git a/arch/cris/arch-v32/kernel/entry.S b/arch/cris/arch-v32/kernel/entry.S
index 026a0b2..7f50a0b 100644
--- a/arch/cris/arch-v32/kernel/entry.S
+++ b/arch/cris/arch-v32/kernel/entry.S
@@ -875,6 +875,9 @@ sys_call_table:
.long sys_process_vm_writev
.long sys_kcmp /* 350 */
.long sys_finit_module
+ .long sys_mlock2
+ .long sys_munlock2
+ .long sys_munlockall2

/*
* NOTE!! This doesn't have to be exact - we just have
diff --git a/arch/frv/kernel/entry.S b/arch/frv/kernel/entry.S
index dfcd263..ee605a0 100644
--- a/arch/frv/kernel/entry.S
+++ b/arch/frv/kernel/entry.S
@@ -1515,5 +1515,8 @@ sys_call_table:
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
.long sys_setns
+ .long sys_mlock2
+ .long sys_munlock2
+ .long sys_munlockall2 /* 340 */

syscall_table_size = (. - sys_call_table)
diff --git a/arch/ia64/include/asm/unistd.h b/arch/ia64/include/asm/unistd.h
index 95c39b9..db73390 100644
--- a/arch/ia64/include/asm/unistd.h
+++ b/arch/ia64/include/asm/unistd.h
@@ -11,7 +11,7 @@



-#define NR_syscalls 319 /* length of syscall table */
+#define NR_syscalls 322 /* length of syscall table */

/*
* The following defines stop scripts/checksyscalls.sh from complaining about
diff --git a/arch/ia64/include/uapi/asm/unistd.h b/arch/ia64/include/uapi/asm/unistd.h
index 4610795..5f485cc 100644
--- a/arch/ia64/include/uapi/asm/unistd.h
+++ b/arch/ia64/include/uapi/asm/unistd.h
@@ -332,5 +332,8 @@
#define __NR_memfd_create 1340
#define __NR_bpf 1341
#define __NR_execveat 1342
+#define __NR_mlock2 1343
+#define __NR_munlock2 1344
+#define __NR_munlockall2 1345

#endif /* _UAPI_ASM_IA64_UNISTD_H */
diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S
index ae0de7b..3ef4457 100644
--- a/arch/ia64/kernel/entry.S
+++ b/arch/ia64/kernel/entry.S
@@ -1768,5 +1768,8 @@ sys_call_table:
data8 sys_memfd_create // 1340
data8 sys_bpf
data8 sys_execveat
+ data8 sys_mlock2
+ data8 sys_munlock2
+ data8 sys_munlockall2 // 1345

.org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
diff --git a/arch/m32r/kernel/entry.S b/arch/m32r/kernel/entry.S
index c639bfa..4f7f2e2 100644
--- a/arch/m32r/kernel/entry.S
+++ b/arch/m32r/kernel/entry.S
@@ -76,6 +76,9 @@
#define sys_munlock sys_ni_syscall
#define sys_mlockall sys_ni_syscall
#define sys_munlockall sys_ni_syscall
+#define sys_mlock2 sys_ni_syscall
+#define sys_munlock2 sys_ni_syscall
+#define sys_munlockall2 sys_ni_syscall
#define sys_mremap sys_ni_syscall
#define sys_mincore sys_ni_syscall
#define sys_remap_file_pages sys_ni_syscall
diff --git a/arch/m32r/kernel/syscall_table.S b/arch/m32r/kernel/syscall_table.S
index f365c19..9918c3e 100644
--- a/arch/m32r/kernel/syscall_table.S
+++ b/arch/m32r/kernel/syscall_table.S
@@ -325,3 +325,6 @@ ENTRY(sys_call_table)
.long sys_eventfd
.long sys_fallocate
.long sys_setns /* 325 */
+ .long sys_mlock2
+ .long sys_munlock2
+ .long sys_munlockall2
diff --git a/arch/m68k/include/asm/unistd.h b/arch/m68k/include/asm/unistd.h
index 244e0db..b18f3da 100644
--- a/arch/m68k/include/asm/unistd.h
+++ b/arch/m68k/include/asm/unistd.h
@@ -4,7 +4,7 @@
#include <uapi/asm/unistd.h>


-#define NR_syscalls 356
+#define NR_syscalls 359

#define __ARCH_WANT_OLD_READDIR
#define __ARCH_WANT_OLD_STAT
diff --git a/arch/m68k/include/uapi/asm/unistd.h b/arch/m68k/include/uapi/asm/unistd.h
index 61fb6cb..1405c3f 100644
--- a/arch/m68k/include/uapi/asm/unistd.h
+++ b/arch/m68k/include/uapi/asm/unistd.h
@@ -361,5 +361,8 @@
#define __NR_memfd_create 353
#define __NR_bpf 354
#define __NR_execveat 355
+#define __NR_mlock2 356
+#define __NR_munlock2 357
+#define __NR_munlockall2 358

#endif /* _UAPI_ASM_M68K_UNISTD_H_ */
diff --git a/arch/m68k/kernel/syscalltable.S b/arch/m68k/kernel/syscalltable.S
index a0ec430..7963c03 100644
--- a/arch/m68k/kernel/syscalltable.S
+++ b/arch/m68k/kernel/syscalltable.S
@@ -376,4 +376,7 @@ ENTRY(sys_call_table)
.long sys_memfd_create
.long sys_bpf
.long sys_execveat /* 355 */
+ .long sys_mlock2
+ .long sys_munlock2
+ .long sys_munlockall2

diff --git a/arch/microblaze/include/uapi/asm/unistd.h b/arch/microblaze/include/uapi/asm/unistd.h
index 32850c7..59b06b0 100644
--- a/arch/microblaze/include/uapi/asm/unistd.h
+++ b/arch/microblaze/include/uapi/asm/unistd.h
@@ -404,5 +404,8 @@
#define __NR_memfd_create 386
#define __NR_bpf 387
#define __NR_execveat 388
+#define __NR_mlock2 389 /* ok - nommu or mmu */
+#define __NR_munlock2 390 /* ok - nommu or mmu */
+#define __NR_munlockall2 391 /* ok - nommu or mmu */

#endif /* _UAPI_ASM_MICROBLAZE_UNISTD_H */
diff --git a/arch/microblaze/kernel/syscall_table.S b/arch/microblaze/kernel/syscall_table.S
index 29c8568..6e4b0fe 100644
--- a/arch/microblaze/kernel/syscall_table.S
+++ b/arch/microblaze/kernel/syscall_table.S
@@ -389,3 +389,6 @@ ENTRY(sys_call_table)
.long sys_memfd_create
.long sys_bpf
.long sys_execveat
+ .long sys_mlock2
+ .long sys_munlock2 /* 390 */
+ .long sys_munlockall2
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876..67c1cdf 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -62,6 +62,11 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */

+/*
+ * Flags for mlock
+ */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
diff --git a/arch/mips/include/uapi/asm/unistd.h b/arch/mips/include/uapi/asm/unistd.h
index c03088f..101b884 100644
--- a/arch/mips/include/uapi/asm/unistd.h
+++ b/arch/mips/include/uapi/asm/unistd.h
@@ -377,16 +377,19 @@
#define __NR_memfd_create (__NR_Linux + 354)
#define __NR_bpf (__NR_Linux + 355)
#define __NR_execveat (__NR_Linux + 356)
+#define __NR_mlock2 (__NR_Linux + 357)
+#define __NR_munlock2 (__NR_Linux + 358)
+#define __NR_munlockall2 (__NR_Linux + 359)

/*
* Offset of the last Linux o32 flavoured syscall
*/
-#define __NR_Linux_syscalls 356
+#define __NR_Linux_syscalls 359

#endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */

#define __NR_O32_Linux 4000
-#define __NR_O32_Linux_syscalls 356
+#define __NR_O32_Linux_syscalls 359

#if _MIPS_SIM == _MIPS_SIM_ABI64

@@ -711,16 +714,19 @@
#define __NR_memfd_create (__NR_Linux + 314)
#define __NR_bpf (__NR_Linux + 315)
#define __NR_execveat (__NR_Linux + 316)
+#define __NR_mlock2 (__NR_Linux + 317)
+#define __NR_munlock2 (__NR_Linux + 318)
+#define __NR_munlockall2 (__NR_Linux + 319)

/*
* Offset of the last Linux 64-bit flavoured syscall
*/
-#define __NR_Linux_syscalls 316
+#define __NR_Linux_syscalls 319

#endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */

#define __NR_64_Linux 5000
-#define __NR_64_Linux_syscalls 316
+#define __NR_64_Linux_syscalls 319

#if _MIPS_SIM == _MIPS_SIM_NABI32

@@ -1049,15 +1055,18 @@
#define __NR_memfd_create (__NR_Linux + 318)
#define __NR_bpf (__NR_Linux + 319)
#define __NR_execveat (__NR_Linux + 320)
+#define __NR_mlock2 (__NR_Linux + 321)
+#define __NR_munlock2 (__NR_Linux + 322)
+#define __NR_munlockall2 (__NR_Linux + 323)

/*
* Offset of the last N32 flavoured syscall
*/
-#define __NR_Linux_syscalls 320
+#define __NR_Linux_syscalls 323

#endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */

#define __NR_N32_Linux 6000
-#define __NR_N32_Linux_syscalls 320
+#define __NR_N32_Linux_syscalls 323

#endif /* _UAPI_ASM_UNISTD_H */
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 4cc1350..c409d53 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -599,3 +599,6 @@ EXPORT(sys_call_table)
PTR sys_memfd_create
PTR sys_bpf /* 4355 */
PTR sys_execveat
+ PTR sys_mlock2
+ PTR sys_munlock2
+ PTR sys_munlockall2
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index ad4d4463..0aa2742 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -436,4 +436,7 @@ EXPORT(sys_call_table)
PTR sys_memfd_create
PTR sys_bpf /* 5315 */
PTR sys_execveat
+ PTR sys_mlock2
+ PTR sys_munlock2
+ PTR sys_munlockall2
.size sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index 446cc65..eb21955 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -429,4 +429,7 @@ EXPORT(sysn32_call_table)
PTR sys_memfd_create
PTR sys_bpf
PTR compat_sys_execveat /* 6320 */
+ PTR sys_mlock2
+ PTR sys_munlock2
+ PTR sys_munlockall2
.size sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index f543ff4..f45049c 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -584,4 +584,7 @@ EXPORT(sys32_call_table)
PTR sys_memfd_create
PTR sys_bpf /* 4355 */
PTR compat_sys_execveat
+ PTR sys_mlock2
+ PTR sys_munlock2
+ PTR sys_munlockall2
.size sys32_call_table,.-sys32_call_table
diff --git a/arch/mn10300/kernel/entry.S b/arch/mn10300/kernel/entry.S
index 177d61d..d34adf5 100644
--- a/arch/mn10300/kernel/entry.S
+++ b/arch/mn10300/kernel/entry.S
@@ -767,6 +767,9 @@ ENTRY(sys_call_table)
.long sys_perf_event_open
.long sys_recvmmsg
.long sys_setns
+ .long sys_mlock2 /* 340 */
+ .long sys_munlock2
+ .long sys_munlockall2


nr_syscalls=(.-sys_call_table)/4
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251..daab994 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -32,6 +32,8 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */

+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
diff --git a/arch/parisc/include/uapi/asm/unistd.h b/arch/parisc/include/uapi/asm/unistd.h
index 2e639d7..455c8a3 100644
--- a/arch/parisc/include/uapi/asm/unistd.h
+++ b/arch/parisc/include/uapi/asm/unistd.h
@@ -358,8 +358,11 @@
#define __NR_memfd_create (__NR_Linux + 340)
#define __NR_bpf (__NR_Linux + 341)
#define __NR_execveat (__NR_Linux + 342)
+#define __NR_mlock2 (__NR_Linux + 343)
+#define __NR_munlock2 (__NR_Linux + 344)
+#define __NR_munlockall2 (__NR_Linux + 345)

-#define __NR_Linux_syscalls (__NR_execveat + 1)
+#define __NR_Linux_syscalls (__NR_munlockall2 + 1)


#define __IGNORE_select /* newselect */
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index 6ea26df..189e85f 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -23,6 +23,8 @@
#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */

+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index e4aa173..c9901e7 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -386,5 +386,8 @@
#define __NR_bpf 361
#define __NR_execveat 362
#define __NR_switch_endian 363
+#define __NR_mlock2 364
+#define __NR_munlock2 365
+#define __NR_munlockall2 366

#endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/s390/include/uapi/asm/unistd.h b/arch/s390/include/uapi/asm/unistd.h
index 67878af..d1c5b1f 100644
--- a/arch/s390/include/uapi/asm/unistd.h
+++ b/arch/s390/include/uapi/asm/unistd.h
@@ -290,7 +290,10 @@
#define __NR_s390_pci_mmio_write 352
#define __NR_s390_pci_mmio_read 353
#define __NR_execveat 354
-#define NR_syscalls 355
+#define __NR_mlock2 355
+#define __NR_munlock2 356
+#define __NR_munlockall2 357
+#define NR_syscalls 358

/*
* There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_wrapper.c b/arch/s390/kernel/compat_wrapper.c
index f8498dd..58339e2 100644
--- a/arch/s390/kernel/compat_wrapper.c
+++ b/arch/s390/kernel/compat_wrapper.c
@@ -220,3 +220,6 @@ COMPAT_SYSCALL_WRAP2(memfd_create, const char __user *, uname, unsigned int, fla
COMPAT_SYSCALL_WRAP3(bpf, int, cmd, union bpf_attr *, attr, unsigned int, size);
COMPAT_SYSCALL_WRAP3(s390_pci_mmio_write, const unsigned long, mmio_addr, const void __user *, user_buffer, const size_t, length);
COMPAT_SYSCALL_WRAP3(s390_pci_mmio_read, const unsigned long, mmio_addr, void __user *, user_buffer, const size_t, length);
+COMPAT_SYSCALL_WRAP3(mlock2, unsigned long, start, size_t, len, int, flags);
+COMPAT_SYSCALL_WRAP3(munlock2, unsigned long, start, size_t, len, int, flags);
+COMPAT_SYSCALL_WRAP1(munlockall2, int, flags);
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 1acad02..f6d81d6 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -363,3 +363,6 @@ SYSCALL(sys_bpf,compat_sys_bpf)
SYSCALL(sys_s390_pci_mmio_write,compat_sys_s390_pci_mmio_write)
SYSCALL(sys_s390_pci_mmio_read,compat_sys_s390_pci_mmio_read)
SYSCALL(sys_execveat,compat_sys_execveat)
+SYSCALL(sys_mlock2,compat_sys_mlock2) /* 355 */
+SYSCALL(sys_munlock2,compat_sys_munlock2)
+SYSCALL(sys_munlockall2,compat_sys_munlockall2)
diff --git a/arch/sh/kernel/syscalls_32.S b/arch/sh/kernel/syscalls_32.S
index 734234b..6d07867 100644
--- a/arch/sh/kernel/syscalls_32.S
+++ b/arch/sh/kernel/syscalls_32.S
@@ -386,3 +386,6 @@ ENTRY(sys_call_table)
.long sys_process_vm_writev
.long sys_kcmp
.long sys_finit_module
+ .long sys_mlock2
+ .long sys_munlock2 /* 370 */
+ .long sys_munlockall2
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 0b14df3..13d51be 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -18,6 +18,8 @@
#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */

+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
diff --git a/arch/sparc/include/uapi/asm/unistd.h b/arch/sparc/include/uapi/asm/unistd.h
index 6f35f4d..c25bbb1 100644
--- a/arch/sparc/include/uapi/asm/unistd.h
+++ b/arch/sparc/include/uapi/asm/unistd.h
@@ -416,8 +416,11 @@
#define __NR_memfd_create 348
#define __NR_bpf 349
#define __NR_execveat 350
+#define __NR_mlock2 351
+#define __NR_munlock2 352
+#define __NR_munlockall2 353

-#define NR_syscalls 351
+#define NR_syscalls 354

/* Bitmask values returned from kern_features system call. */
#define KERN_FEATURE_MIXED_MODE_STACK 0x00000001
diff --git a/arch/sparc/kernel/systbls_32.S b/arch/sparc/kernel/systbls_32.S
index e31a905..72b68d4 100644
--- a/arch/sparc/kernel/systbls_32.S
+++ b/arch/sparc/kernel/systbls_32.S
@@ -87,4 +87,4 @@ sys_call_table:
/*335*/ .long sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev
/*340*/ .long sys_ni_syscall, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
/*345*/ .long sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
-/*350*/ .long sys_execveat
+/*350*/ .long sys_execveat, sys_mlock2, sys_munlock2, sys_munlockall2
diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
index d72f76a..a96bfea 100644
--- a/arch/sparc/kernel/systbls_64.S
+++ b/arch/sparc/kernel/systbls_64.S
@@ -88,7 +88,7 @@ sys_call_table32:
.word sys_syncfs, compat_sys_sendmmsg, sys_setns, compat_sys_process_vm_readv, compat_sys_process_vm_writev
/*340*/ .word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
.word sys32_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
-/*350*/ .word sys32_execveat
+/*350*/ .word sys32_execveat, sys_mlock2, sys_munlock2, sys_munlockall2

#endif /* CONFIG_COMPAT */

@@ -168,4 +168,4 @@ sys_call_table:
.word sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev
/*340*/ .word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
.word sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
-/*350*/ .word sys64_execveat
+/*350*/ .word sys64_execveat, sys_mlock2, sys_munlock2, sys_munlockall2
diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
index 81b8fc3..f69ce48 100644
--- a/arch/tile/include/uapi/asm/mman.h
+++ b/arch/tile/include/uapi/asm/mman.h
@@ -37,5 +37,10 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */

+/*
+ * Flags for mlock
+ */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+

#endif /* _ASM_TILE_MMAN_H */
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ef8187f..13ce950 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -365,3 +365,6 @@
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
+359 i386 mlock2 sys_mlock2
+360 i386 munlock2 sys_munlock2
+361 i386 munlockall2 sys_munlockall2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 9ef32d5..13b3cb1 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -329,6 +329,9 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 common mlock2 sys_mlock2
+324 common munlock2 sys_munlock2
+325 common munlockall2 sys_munlockall2

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 201aec0..11f354f 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -75,6 +75,11 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */

+/*
+ * Flags for mlock
+ */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
diff --git a/arch/xtensa/include/uapi/asm/unistd.h b/arch/xtensa/include/uapi/asm/unistd.h
index b95c305..fbd0876 100644
--- a/arch/xtensa/include/uapi/asm/unistd.h
+++ b/arch/xtensa/include/uapi/asm/unistd.h
@@ -753,8 +753,14 @@ __SYSCALL(339, sys_memfd_create, 2)
__SYSCALL(340, sys_bpf, 3)
#define __NR_execveat 341
__SYSCALL(341, sys_execveat, 5)
-
-#define __NR_syscall_count 342
+#define __NR_mlock2 342
+__SYSCALL(342, sys_mlock2, 3)
+#define __NR_munlock2 343
+__SYSCALL(343, sys_munlock2, 3)
+#define __NR_munlockall2 344
+__SYSCALL(344, sys_munlock2, 1)
+
+#define __NR_syscall_count 345

/*
* sysxtensa syscall handler
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b45c45b..aecab5d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -884,4 +884,8 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
const char __user *const __user *argv,
const char __user *const __user *envp, int flags);

+asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
+asmlinkage long sys_munlock2(unsigned long start, size_t len, int flags);
+asmlinkage long sys_munlockall2(int flags);
+
#endif
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index e9fe6fd..242436b 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -18,4 +18,6 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */

+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#endif /* __ASM_GENERIC_MMAN_H */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..e759fa2 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,15 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)
#define __NR_execveat 281
__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_mlock2 282
+__SYSCALL(__NR_mlock2, sys_mlock2)
+#define __NR_munlock2 283
+__SYSCALL(__NR_munlock2, sys_munlock2)
+#define __NR_munlockall2 284
+__SYSCALL(__NR_munlockall2, sys_munlockall2)

#undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 285

/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7995ef5..63529b7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -193,6 +193,9 @@ cond_syscall(sys_mlock);
cond_syscall(sys_munlock);
cond_syscall(sys_mlockall);
cond_syscall(sys_munlockall);
+cond_syscall(sys_mlock2);
+cond_syscall(sys_munlock2);
+cond_syscall(sys_munlockall2);
cond_syscall(sys_mincore);
cond_syscall(sys_madvise);
cond_syscall(sys_mremap);
diff --git a/mm/mlock.c b/mm/mlock.c
index 8e52c23..d6e61d6 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -648,6 +648,14 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
return do_mlock(start, len, VM_LOCKED);
}

+SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
+{
+ if (!flags || flags & ~MLOCK_LOCKED)
+ return -EINVAL;
+
+ return do_mlock(start, len, VM_LOCKED);
+}
+
static int do_munlock(unsigned long start, size_t len, vm_flags_t flags)
{
int ret;
@@ -667,6 +675,13 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
return do_munlock(start, len, VM_LOCKED);
}

+SYSCALL_DEFINE3(munlock2, unsigned long, start, size_t, len, int, flags)
+{
+ if (!flags || flags & ~MLOCK_LOCKED)
+ return -EINVAL;
+ return do_munlock(start, len, VM_LOCKED);
+}
+
static int do_mlockall(int flags)
{
struct vm_area_struct * vma, * prev = NULL;
@@ -756,6 +771,19 @@ SYSCALL_DEFINE0(munlockall)
return ret;
}

+SYSCALL_DEFINE1(munlockall2, int, flags)
+{
+ int ret = -EINVAL;
+
+ if (!flags || flags & ~(MCL_CURRENT | MCL_FUTURE))
+ return ret;
+
+ down_write(&current->mm->mmap_sem);
+ ret = do_munlockall(flags);
+ up_write(&current->mm->mmap_sem);
+ return ret;
+}
+
/*
* Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB
* shm segments) get accounted against the user_struct instead.
--
1.9.1

2015-07-21 19:59:51

by Eric B Munson

[permalink] [raw]
Subject: [PATCH V4 3/6] mm: gup: Add mm_lock_present()

The upcoming mlock(MLOCK_ONFAULT) implementation will need a way to
request that all present pages in a range are locked without faulting in
pages that are not present. This logic is very close to what the
__mm_populate() call handles without faulting pages so the patch pulls
out the pieces that can be shared and adds mm_lock_present() to gup.c.
The following patch will call it from do_mlock() when MLOCK_ONFAULT is
specified.

Signed-off-by: Eric B Munson <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/gup.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 157 insertions(+), 15 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 6297f6b..233ef17 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -818,6 +818,30 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
}
EXPORT_SYMBOL(get_user_pages);

+/*
+ * Helper function used by both populate_vma_page_range() and pin_user_pages
+ */
+static int get_gup_flags(vm_flags_t vm_flags)
+{
+ int gup_flags = FOLL_TOUCH | FOLL_POPULATE;
+ /*
+ * We want to touch writable mappings with a write fault in order
+ * to break COW, except for shared mappings because these don't COW
+ * and we would not want to dirty them for nothing.
+ */
+ if ((vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
+ gup_flags |= FOLL_WRITE;
+
+ /*
+ * We want mlock to succeed for regions that have any permissions
+ * other than PROT_NONE.
+ */
+ if (vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
+ gup_flags |= FOLL_FORCE;
+
+ return gup_flags;
+}
+
/**
* populate_vma_page_range() - populate a range of pages in the vma.
* @vma: target vma
@@ -850,21 +874,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
VM_BUG_ON_VMA(end > vma->vm_end, vma);
VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);

- gup_flags = FOLL_TOUCH | FOLL_POPULATE;
- /*
- * We want to touch writable mappings with a write fault in order
- * to break COW, except for shared mappings because these don't COW
- * and we would not want to dirty them for nothing.
- */
- if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
- gup_flags |= FOLL_WRITE;
-
- /*
- * We want mlock to succeed for regions that have any permissions
- * other than PROT_NONE.
- */
- if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
- gup_flags |= FOLL_FORCE;
+ gup_flags = get_gup_flags(vma->vm_flags);

/*
* We made sure addr is within a VMA, so the following will
@@ -874,6 +884,138 @@ long populate_vma_page_range(struct vm_area_struct *vma,
NULL, NULL, nonblocking);
}

+static long pin_user_pages(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, int *nonblocking)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long nr_pages = (end - start) / PAGE_SIZE;
+ int gup_flags;
+ long i = 0;
+ unsigned int page_mask;
+
+ VM_BUG_ON(start & ~PAGE_MASK);
+ VM_BUG_ON(end & ~PAGE_MASK);
+ VM_BUG_ON_VMA(start < vma->vm_start, vma);
+ VM_BUG_ON_VMA(end > vma->vm_end, vma);
+ VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
+
+ if (!nr_pages)
+ return 0;
+
+ gup_flags = get_gup_flags(vma->vm_flags);
+
+ /*
+ * If FOLL_FORCE is set then do not force a full fault as the hinting
+ * fault information is unrelated to the reference behaviour of a task
+ * using the address space
+ */
+ if (!(gup_flags & FOLL_FORCE))
+ gup_flags |= FOLL_NUMA;
+
+ vma = NULL;
+
+ do {
+ struct page *page;
+ unsigned int foll_flags = gup_flags;
+ unsigned int page_increm;
+
+ /* first iteration or cross vma bound */
+ if (!vma || start >= vma->vm_end) {
+ vma = find_extend_vma(mm, start);
+ if (!vma && in_gate_area(mm, start)) {
+ int ret;
+ ret = get_gate_page(mm, start & PAGE_MASK,
+ gup_flags, &vma, NULL);
+ if (ret)
+ return i ? : ret;
+ page_mask = 0;
+ goto next_page;
+ }
+
+ if (!vma)
+ return i ? : -EFAULT;
+ if (is_vm_hugetlb_page(vma)) {
+ i = follow_hugetlb_page(mm, vma, NULL, NULL,
+ &start, &nr_pages, i,
+ gup_flags);
+ continue;
+ }
+ }
+
+ /*
+ * If we have a pending SIGKILL, don't keep pinning pages
+ */
+ if (unlikely(fatal_signal_pending(current)))
+ return i ? i : -ERESTARTSYS;
+ cond_resched();
+ page = follow_page_mask(vma, start, foll_flags, &page_mask);
+ if (!page)
+ goto next_page;
+ if (IS_ERR(page))
+ return i ? i : PTR_ERR(page);
+next_page:
+ page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
+ if (page_increm > nr_pages)
+ page_increm = nr_pages;
+ i += page_increm;
+ start += page_increm * PAGE_SIZE;
+ nr_pages -= page_increm;
+ } while (nr_pages);
+ return i;
+}
+
+/*
+ * mm_lock_present - lock present pages within a range of address space.
+ *
+ * This is used to implement mlock2(MLOCK_LOCKONFAULT). VMAs must be already
+ * marked with the desired vm_flags, and mmap_sem must not be held.
+ */
+int mm_lock_present(unsigned long start, unsigned long len)
+{
+ struct mm_struct *mm = current->mm;
+ unsigned long end, nstart, nend;
+ struct vm_area_struct *vma = NULL;
+ int locked = 0;
+ long ret = 0;
+
+ VM_BUG_ON(start & ~PAGE_MASK);
+ VM_BUG_ON(len != PAGE_ALIGN(len));
+ end = start + len;
+
+ for (nstart = start; nstart < end; nstart = nend) {
+ /*
+ * We want to fault in pages for [nstart; end) address range.
+ * Find first corresponding VMA.
+ */
+ if (!locked) {
+ locked = 1;
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, nstart);
+ } else if (nstart >= vma->vm_end)
+ vma = vma->vm_next;
+ if (!vma || vma->vm_start >= end)
+ break;
+ /*
+ * Set [nstart; nend) to intersection of desired address
+ * range with the first VMA. Also, skip undesirable VMA types.
+ */
+ nend = min(end, vma->vm_end);
+ if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+ continue;
+ if (nstart < vma->vm_start)
+ nstart = vma->vm_start;
+
+ ret = pin_user_pages(vma, nstart, nend, &locked);
+ if (ret < 0)
+ break;
+ nend = nstart + ret * PAGE_SIZE;
+ ret = 0;
+ }
+ if (locked)
+ up_read(&mm->mmap_sem);
+ return ret; /* 0 or negative error code */
+}
+
/*
* __mm_populate - populate and/or mlock pages within a range of address space.
*
--
1.9.1

2015-07-21 20:00:39

by Eric B Munson

[permalink] [raw]
Subject: [PATCH V4 4/6] mm: mlock: Introduce VM_LOCKONFAULT and add mlock flags to enable it

The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be
used this can incur a high penalty for locking.

For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well). For the security example, any application transacting
in data that cannot be swapped out (credit card data, medical records,
etc).

This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in. This can be done area at a time via the
mlock2(MLOCK_ONFAULT) or the mlockall(MCL_ONFAULT) system calls. These
calls can be undone via munlock2(MLOCK_ONFAULT) or
munlockall2(MCL_ONFAULT).

Applying the VM_LOCKONFAULT flag to a mapping with pages that are
already present required the addition of a function in gup.c to pin all
pages which are present in an address range. It borrows heavily from
__mm_populate().

To keep accounting checks out of the page fault path, users are billed
for the entire mapping lock as if MLOCK_LOCKED was used.

Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from V3:
Do extensive search for VM_LOCKED and ensure that VM_LOCKONFAULT is also handled
where appropriate

arch/alpha/include/uapi/asm/mman.h | 2 +
arch/mips/include/uapi/asm/mman.h | 2 +
arch/parisc/include/uapi/asm/mman.h | 2 +
arch/powerpc/include/uapi/asm/mman.h | 2 +
arch/sparc/include/uapi/asm/mman.h | 2 +
arch/tile/include/uapi/asm/mman.h | 3 ++
arch/xtensa/include/uapi/asm/mman.h | 2 +
drivers/gpu/drm/drm_vm.c | 8 ++-
fs/proc/task_mmu.c | 3 +-
include/linux/mm.h | 2 +
include/uapi/asm-generic/mman.h | 2 +
kernel/events/uprobes.c | 2 +-
kernel/fork.c | 2 +-
mm/debug.c | 1 +
mm/gup.c | 3 +-
mm/huge_memory.c | 3 +-
mm/hugetlb.c | 4 +-
mm/internal.h | 5 +-
mm/ksm.c | 2 +-
mm/madvise.c | 4 +-
mm/memory.c | 5 +-
mm/mlock.c | 98 +++++++++++++++++++++++++-----------
mm/mmap.c | 28 +++++++----
mm/mremap.c | 6 +--
mm/msync.c | 2 +-
mm/rmap.c | 12 ++---
mm/shmem.c | 2 +-
mm/swap.c | 3 +-
mm/vmscan.c | 2 +-
29 files changed, 145 insertions(+), 69 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index ec72436..77ae8db 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -37,8 +37,10 @@

#define MCL_CURRENT 8192 /* lock all currently mapped pages */
#define MCL_FUTURE 16384 /* lock all additions to address space */
+#define MCL_ONFAULT 32768 /* lock all pages that are faulted in */

#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */

#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 67c1cdf..71ed81d 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -61,11 +61,13 @@
*/
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */

/*
* Flags for mlock
*/
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */

#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index daab994..c0871ce 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -31,8 +31,10 @@

#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */

#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */

#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index 189e85f..f93f7eb 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -22,8 +22,10 @@

#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */
+#define MCL_ONFAULT 0x8000 /* lock all pages that are faulted in */

#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */

#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 13d51be..8cd2ebc 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -17,8 +17,10 @@

#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */
+#define MCL_ONFAULT 0x8000 /* lock all pages that are faulted in */

#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */

#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
index f69ce48..acdd013 100644
--- a/arch/tile/include/uapi/asm/mman.h
+++ b/arch/tile/include/uapi/asm/mman.h
@@ -36,11 +36,14 @@
*/
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
+

/*
* Flags for mlock
*/
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */


#endif /* _ASM_TILE_MMAN_H */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 11f354f..5725a15 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -74,11 +74,13 @@
*/
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */

/*
* Flags for mlock
*/
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */

#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/drivers/gpu/drm/drm_vm.c b/drivers/gpu/drm/drm_vm.c
index aab49ee..dfbcfc2 100644
--- a/drivers/gpu/drm/drm_vm.c
+++ b/drivers/gpu/drm/drm_vm.c
@@ -699,9 +699,15 @@ int drm_vma_info(struct seq_file *m, void *data)
(void *)(unsigned long)virt_to_phys(high_memory));

list_for_each_entry(pt, &dev->vmalist, head) {
+ char lock_flag = '-';
+
vma = pt->vma;
if (!vma)
continue;
+ if (vma->vm_flags & VM_LOCKED)
+ lock_flag = 'l';
+ else if (vma->vm_flags & VM_LOCKONFAULT)
+ lock_flag = 'f';
seq_printf(m,
"\n%5d 0x%pK-0x%pK %c%c%c%c%c%c 0x%08lx000",
pt->pid,
@@ -710,7 +716,7 @@ int drm_vma_info(struct seq_file *m, void *data)
vma->vm_flags & VM_WRITE ? 'w' : '-',
vma->vm_flags & VM_EXEC ? 'x' : '-',
vma->vm_flags & VM_MAYSHARE ? 's' : 'p',
- vma->vm_flags & VM_LOCKED ? 'l' : '-',
+ lock_flag,
vma->vm_flags & VM_IO ? 'i' : '-',
vma->vm_pgoff);

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ca1e091..2c435a7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -579,6 +579,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_X86_INTEL_MPX
[ilog2(VM_MPX)] = "mp",
#endif
+ [ilog2(VM_LOCKONFAULT)] = "lf",
[ilog2(VM_LOCKED)] = "lo",
[ilog2(VM_IO)] = "io",
[ilog2(VM_SEQ_READ)] = "sr",
@@ -654,7 +655,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
mss.swap >> 10,
vma_kernel_pagesize(vma) >> 10,
vma_mmu_pagesize(vma) >> 10,
- (vma->vm_flags & VM_LOCKED) ?
+ (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) ?
(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);

show_smap_vma_flags(m, vma);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f9..e78544f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -127,6 +127,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */

+#define VM_LOCKONFAULT 0x00001000 /* Lock the pages covered when they are faulted in */
#define VM_LOCKED 0x00002000
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */

@@ -1865,6 +1866,7 @@ static inline void mm_populate(unsigned long addr, unsigned long len)
/* Ignore errors */
(void) __mm_populate(addr, len, 1);
}
+extern int mm_lock_present(unsigned long addr, unsigned long start);
#else
static inline void mm_populate(unsigned long addr, unsigned long len) {}
#endif
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 242436b..555aab0 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -17,7 +17,9 @@

#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */

#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */

#endif /* __ASM_GENERIC_MMAN_H */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cb346f2..882c9f6 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -201,7 +201,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
try_to_free_swap(page);
pte_unmap_unlock(ptep, ptl);

- if (vma->vm_flags & VM_LOCKED)
+ if (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT))
munlock_vma_page(page);
put_page(page);

diff --git a/kernel/fork.c b/kernel/fork.c
index dbd9b8d..a949228 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -454,7 +454,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
tmp->vm_mm = mm;
if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
- tmp->vm_flags &= ~VM_LOCKED;
+ tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
tmp->vm_next = tmp->vm_prev = NULL;
file = tmp->vm_file;
if (file) {
diff --git a/mm/debug.c b/mm/debug.c
index 76089dd..25176bb 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -121,6 +121,7 @@ static const struct trace_print_flags vmaflags_names[] = {
{VM_GROWSDOWN, "growsdown" },
{VM_PFNMAP, "pfnmap" },
{VM_DENYWRITE, "denywrite" },
+ {VM_LOCKONFAULT, "lockonfault" },
{VM_LOCKED, "locked" },
{VM_IO, "io" },
{VM_SEQ_READ, "seqread" },
diff --git a/mm/gup.c b/mm/gup.c
index 233ef17..097a22a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -92,7 +92,8 @@ retry:
*/
mark_page_accessed(page);
}
- if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) {
+ if ((flags & FOLL_POPULATE) &&
+ (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT))) {
/*
* The preliminary mapping check is mainly to avoid the
* pointless overhead of lock_page on the ZERO_PAGE
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c107094..7985e35 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1238,7 +1238,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
pmd, _pmd, 1))
update_mmu_cache_pmd(vma, addr, pmd);
}
- if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) {
+ if ((flags & FOLL_POPULATE) &&
+ (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT))) {
if (page->mapping && trylock_page(page)) {
lru_add_drain();
if (page->mapping)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087..82caa48 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3764,8 +3764,8 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
unsigned long s_end = sbase + PUD_SIZE;

/* Allow segments to share if only one is marked locked */
- unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
- unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED;
+ unsigned long vm_flags = vma->vm_flags & ~(VM_LOCKED | VM_LOCKONFAULT);
+ unsigned long svm_flags = svma->vm_flags & ~(VM_LOCKED | VM_LOCKONFAULT);

/*
* match the virtual addresses, permission and the alignment of the
diff --git a/mm/internal.h b/mm/internal.h
index 36b23f1..53e140e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -246,10 +246,11 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
extern long populate_vma_page_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end, int *nonblocking);
extern void munlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end);
+ unsigned long start, unsigned long end, vm_flags_t to_drop);
static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
{
- munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
+ munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end,
+ VM_LOCKED | VM_LOCKONFAULT);
}

/*
diff --git a/mm/ksm.c b/mm/ksm.c
index 7ee101e..5d91b7d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1058,7 +1058,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
err = replace_page(vma, page, kpage, orig_pte);
}

- if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
+ if ((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) && kpage && !err) {
munlock_vma_page(page);
if (!PageMlocked(kpage)) {
unlock_page(page);
diff --git a/mm/madvise.c b/mm/madvise.c
index 64bb8a2..c9d9296 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -279,7 +279,7 @@ static long madvise_dontneed(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
*prev = vma;
- if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ if (vma->vm_flags & (VM_LOCKED|VM_LOCKONFAULT|VM_HUGETLB|VM_PFNMAP))
return -EINVAL;

zap_page_range(vma, start, end - start, NULL);
@@ -300,7 +300,7 @@ static long madvise_remove(struct vm_area_struct *vma,

*prev = NULL; /* tell sys_madvise we drop mmap_sem */

- if (vma->vm_flags & (VM_LOCKED | VM_HUGETLB))
+ if (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT | VM_HUGETLB))
return -EINVAL;

f = vma->vm_file;
diff --git a/mm/memory.c b/mm/memory.c
index 388dcf9..2b19e0b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2165,7 +2165,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
* Don't let another task, with possibly unlocked vma,
* keep the mlocked page.
*/
- if (page_copied && (vma->vm_flags & VM_LOCKED)) {
+ if (page_copied && (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT))) {
lock_page(old_page); /* LRU manipulation */
munlock_vma_page(old_page);
unlock_page(old_page);
@@ -2577,7 +2577,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

swap_free(entry);
- if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
+ if (vm_swap_full() || (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) ||
+ PageMlocked(page))
try_to_free_swap(page);
unlock_page(page);
if (page != swapcache) {
diff --git a/mm/mlock.c b/mm/mlock.c
index d6e61d6..8b45be1 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -406,23 +406,22 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
* @vma - vma containing range to be munlock()ed.
* @start - start address in @vma of the range
* @end - end of range in @vma.
+ * @to_drop - the VMA flags we want to drop from the specified range
*
- * For mremap(), munmap() and exit().
+ * For mremap(), munmap(), munlock(), and exit().
*
- * Called with @vma VM_LOCKED.
- *
- * Returns with VM_LOCKED cleared. Callers must be prepared to
+ * Returns with specified flags cleared. Callers must be prepared to
* deal with this.
*
- * We don't save and restore VM_LOCKED here because pages are
+ * We don't save and restore specified flags here because pages are
* still on lru. In unmap path, pages might be scanned by reclaim
* and re-mlocked by try_to_{munlock|unmap} before we unmap and
* free them. This will result in freeing mlocked pages.
*/
-void munlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, vm_flags_t to_drop)
{
- vma->vm_flags &= ~VM_LOCKED;
+ vma->vm_flags &= ~to_drop;

while (start < end) {
struct page *page = NULL;
@@ -502,11 +501,12 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
pgoff_t pgoff;
int nr_pages;
int ret = 0;
- int lock = !!(newflags & VM_LOCKED);
+ int lock = !!(newflags & (VM_LOCKED | VM_LOCKONFAULT));

if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
- goto out; /* don't set VM_LOCKED, don't count */
+ /* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
+ goto out;

pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
@@ -546,7 +546,11 @@ success:
if (lock)
vma->vm_flags = newflags;
else
- munlock_vma_pages_range(vma, start, end);
+ /*
+ * We need to tell which VM_LOCK* flag(s) we are clearing here
+ */
+ munlock_vma_pages_range(vma, start, end,
+ (vma->vm_flags & ~(newflags)));

out:
*prev = vma;
@@ -581,10 +585,12 @@ static int apply_vma_flags(unsigned long start, size_t len,
/* Here we know that vma->vm_start <= nstart < vma->vm_end. */

newflags = vma->vm_flags;
- if (add_flags)
+ if (add_flags) {
+ newflags &= ~(VM_LOCKED | VM_LOCKONFAULT);
newflags |= flags;
- else
+ } else {
newflags &= ~flags;
+ }

tmp = vma->vm_end;
if (tmp > end)
@@ -637,9 +643,15 @@ static int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
if (error)
return error;

- error = __mm_populate(start, len, 0);
- if (error)
- return __mlock_posix_error_return(error);
+ if (flags & (VM_LOCKED | VM_LOCKONFAULT)) {
+ if (flags & VM_LOCKED)
+ error = __mm_populate(start, len, 0);
+ else
+ error = mm_lock_present(start, len);
+ if (error)
+ return __mlock_posix_error_return(error);
+ }
+
return 0;
}

@@ -650,10 +662,14 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)

SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
{
- if (!flags || flags & ~MLOCK_LOCKED)
+ if (!flags || (flags & ~(MLOCK_LOCKED | MLOCK_ONFAULT)) ||
+ flags == (MLOCK_LOCKED | MLOCK_ONFAULT))
return -EINVAL;

- return do_mlock(start, len, VM_LOCKED);
+ if (flags & MLOCK_LOCKED)
+ return do_mlock(start, len, VM_LOCKED);
+
+ return do_mlock(start, len, VM_LOCKONFAULT);
}

static int do_munlock(unsigned long start, size_t len, vm_flags_t flags)
@@ -672,31 +688,46 @@ static int do_munlock(unsigned long start, size_t len, vm_flags_t flags)

SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
{
- return do_munlock(start, len, VM_LOCKED);
+ return do_munlock(start, len, VM_LOCKED | VM_LOCKONFAULT);
}

SYSCALL_DEFINE3(munlock2, unsigned long, start, size_t, len, int, flags)
{
- if (!flags || flags & ~MLOCK_LOCKED)
+ vm_flags_t to_clear = 0;
+
+ if (!flags || flags & ~(MLOCK_LOCKED | MLOCK_ONFAULT))
return -EINVAL;
- return do_munlock(start, len, VM_LOCKED);
+
+ if (flags & MLOCK_LOCKED)
+ to_clear |= VM_LOCKED;
+ if (flags & MLOCK_ONFAULT)
+ to_clear |= VM_LOCKONFAULT;
+
+ return do_munlock(start, len, to_clear);
}

static int do_mlockall(int flags)
{
struct vm_area_struct * vma, * prev = NULL;
+ vm_flags_t to_add;

if (flags & MCL_FUTURE)
current->mm->def_flags |= VM_LOCKED;
if (flags == MCL_FUTURE)
goto out;

+ if (flags & MCL_ONFAULT) {
+ current->mm->def_flags |= VM_LOCKONFAULT;
+ to_add = VM_LOCKONFAULT;
+ } else {
+ to_add = VM_LOCKED;
+ }
+
for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
vm_flags_t newflags;

- newflags = vma->vm_flags & ~VM_LOCKED;
- if (flags & MCL_CURRENT)
- newflags |= VM_LOCKED;
+ newflags = vma->vm_flags & ~(VM_LOCKED | VM_LOCKONFAULT);
+ newflags |= to_add;

/* Ignore errors */
mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
@@ -711,7 +742,8 @@ SYSCALL_DEFINE1(mlockall, int, flags)
unsigned long lock_limit;
int ret = -EINVAL;

- if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
+ if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)) ||
+ (flags & (MCL_FUTURE | MCL_ONFAULT)) == (MCL_FUTURE | MCL_ONFAULT))
goto out;

ret = -EPERM;
@@ -740,18 +772,24 @@ out:
static int do_munlockall(int flags)
{
struct vm_area_struct * vma, * prev = NULL;
+ vm_flags_t to_clear = 0;

if (flags & MCL_FUTURE)
current->mm->def_flags &= ~VM_LOCKED;
+ if (flags & MCL_ONFAULT)
+ current->mm->def_flags &= ~VM_LOCKONFAULT;
if (flags == MCL_FUTURE)
goto out;

+ if (flags & MCL_CURRENT)
+ to_clear |= VM_LOCKED;
+ if (flags & MCL_ONFAULT)
+ to_clear |= VM_LOCKONFAULT;
+
for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
vm_flags_t newflags;

- newflags = vma->vm_flags;
- if (flags & MCL_CURRENT)
- newflags &= ~VM_LOCKED;
+ newflags = vma->vm_flags & ~to_clear;

/* Ignore errors */
mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
@@ -766,7 +804,7 @@ SYSCALL_DEFINE0(munlockall)
int ret;

down_write(&current->mm->mmap_sem);
- ret = do_munlockall(MCL_CURRENT | MCL_FUTURE);
+ ret = do_munlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
up_write(&current->mm->mmap_sem);
return ret;
}
@@ -775,7 +813,7 @@ SYSCALL_DEFINE1(munlockall2, int, flags)
{
int ret = -EINVAL;

- if (!flags || flags & ~(MCL_CURRENT | MCL_FUTURE))
+ if (!flags || flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT))
return ret;

down_write(&current->mm->mmap_sem);
diff --git a/mm/mmap.c b/mm/mmap.c
index aa632ad..de89be4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1232,8 +1232,8 @@ static inline int mlock_future_check(struct mm_struct *mm,
{
unsigned long locked, lock_limit;

- /* mlock MCL_FUTURE? */
- if (flags & VM_LOCKED) {
+ /* mlock MCL_FUTURE or MCL_ONFAULT? */
+ if (flags & (VM_LOCKED | VM_LOCKONFAULT)) {
locked = len >> PAGE_SHIFT;
locked += mm->locked_vm;
lock_limit = rlimit(RLIMIT_MEMLOCK);
@@ -1646,12 +1646,12 @@ out:
perf_event_mmap(vma);

vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
- if (vm_flags & VM_LOCKED) {
+ if (vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) {
if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm)))
mm->locked_vm += (len >> PAGE_SHIFT);
else
- vma->vm_flags &= ~VM_LOCKED;
+ vma->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
}

if (file)
@@ -2104,7 +2104,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns
return -ENOMEM;

/* mlock limit tests */
- if (vma->vm_flags & VM_LOCKED) {
+ if (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) {
unsigned long locked;
unsigned long limit;
locked = mm->locked_vm + grow;
@@ -2128,7 +2128,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns
return -ENOMEM;

/* Ok, everything looks good - let it rip */
- if (vma->vm_flags & VM_LOCKED)
+ if (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT))
mm->locked_vm += grow;
vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
return 0;
@@ -2583,7 +2583,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
if (mm->locked_vm) {
struct vm_area_struct *tmp = vma;
while (tmp && tmp->vm_start < end) {
- if (tmp->vm_flags & VM_LOCKED) {
+ if (tmp->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) {
mm->locked_vm -= vma_pages(tmp);
munlock_vma_pages_all(tmp);
}
@@ -2636,6 +2636,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
unsigned long populate = 0;
unsigned long ret = -EINVAL;
struct file *file;
+ vm_flags_t drop_lock_flag = 0;

pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. "
"See Documentation/vm/remap_file_pages.txt.\n",
@@ -2675,10 +2676,15 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
flags |= MAP_SHARED | MAP_FIXED | MAP_POPULATE;
if (vma->vm_flags & VM_LOCKED) {
flags |= MAP_LOCKED;
- /* drop PG_Mlocked flag for over-mapped range */
- munlock_vma_pages_range(vma, start, start + size);
+ drop_lock_flag = VM_LOCKED;
+ } else if (vma->vm_flags & VM_LOCKONFAULT) {
+ drop_lock_flag = VM_LOCKONFAULT;
}

+ if (drop_lock_flag)
+ /* drop PG_Mlocked flag for over-mapped range */
+ munlock_vma_pages_range(vma, start, start + size, VM_LOCKED);
+
file = get_file(vma->vm_file);
ret = do_mmap_pgoff(vma->vm_file, start, size,
prot, flags, pgoff, &populate);
@@ -2781,7 +2787,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
out:
perf_event_mmap(vma);
mm->total_vm += len >> PAGE_SHIFT;
- if (flags & VM_LOCKED)
+ if (flags & (VM_LOCKED | VM_LOCKONFAULT))
mm->locked_vm += (len >> PAGE_SHIFT);
vma->vm_flags |= VM_SOFTDIRTY;
return addr;
@@ -2816,7 +2822,7 @@ void exit_mmap(struct mm_struct *mm)
if (mm->locked_vm) {
vma = mm->mmap;
while (vma) {
- if (vma->vm_flags & VM_LOCKED)
+ if (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT))
munlock_vma_pages_all(vma);
vma = vma->vm_next;
}
diff --git a/mm/mremap.c b/mm/mremap.c
index a7c93ec..44d4c44 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -335,7 +335,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
vma->vm_next->vm_flags |= VM_ACCOUNT;
}

- if (vm_flags & VM_LOCKED) {
+ if (vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) {
mm->locked_vm += new_len >> PAGE_SHIFT;
*locked = true;
}
@@ -371,7 +371,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
return ERR_PTR(-EINVAL);
}

- if (vma->vm_flags & VM_LOCKED) {
+ if (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) {
unsigned long locked, lock_limit;
locked = mm->locked_vm << PAGE_SHIFT;
lock_limit = rlimit(RLIMIT_MEMLOCK);
@@ -548,7 +548,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
}

vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
- if (vma->vm_flags & VM_LOCKED) {
+ if (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) {
mm->locked_vm += pages;
locked = true;
new_addr = addr;
diff --git a/mm/msync.c b/mm/msync.c
index bb04d53..1183183 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -73,7 +73,7 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
}
/* Here vma->vm_start <= start < vma->vm_end. */
if ((flags & MS_INVALIDATE) &&
- (vma->vm_flags & VM_LOCKED)) {
+ (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT))) {
error = -EBUSY;
goto out_unlock;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 171b687..3e91372 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -742,9 +742,9 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
if (!pmd)
return SWAP_AGAIN;

- if (vma->vm_flags & VM_LOCKED) {
+ if (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) {
spin_unlock(ptl);
- pra->vm_flags |= VM_LOCKED;
+ pra->vm_flags |= (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT));
return SWAP_FAIL; /* To break the loop */
}

@@ -763,9 +763,9 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
if (!pte)
return SWAP_AGAIN;

- if (vma->vm_flags & VM_LOCKED) {
+ if (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) {
pte_unmap_unlock(pte, ptl);
- pra->vm_flags |= VM_LOCKED;
+ pra->vm_flags |= (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT));
return SWAP_FAIL; /* To break the loop */
}

@@ -1205,7 +1205,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* skipped over this mm) then we should reactivate it.
*/
if (!(flags & TTU_IGNORE_MLOCK)) {
- if (vma->vm_flags & VM_LOCKED)
+ if (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT))
goto out_mlock;

if (flags & TTU_MUNLOCK)
@@ -1315,7 +1315,7 @@ out_mlock:
* page is actually mlocked.
*/
if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
- if (vma->vm_flags & VM_LOCKED) {
+ if (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) {
mlock_vma_page(page);
ret = SWAP_MLOCK;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index 4caf8ed..9ddf2ca 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -754,7 +754,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
index = page->index;
inode = mapping->host;
info = SHMEM_I(inode);
- if (info->flags & VM_LOCKED)
+ if (info->flags & (VM_LOCKED | VM_LOCKONFAULT))
goto redirty;
if (!total_swap_pages)
goto redirty;
diff --git a/mm/swap.c b/mm/swap.c
index a3a0a2f..3580a21 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -710,7 +710,8 @@ void lru_cache_add_active_or_unevictable(struct page *page,
{
VM_BUG_ON_PAGE(PageLRU(page), page);

- if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
+ if (likely((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == 0) ||
+ (vma->vm_flags & VM_SPECIAL)) {
SetPageActive(page);
lru_cache_add(page);
return;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e61445d..019d306 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -804,7 +804,7 @@ static enum page_references page_check_references(struct page *page,
* Mlock lost the isolation race with us. Let try_to_unmap()
* move the page to the unevictable list.
*/
- if (vm_flags & VM_LOCKED)
+ if (vm_flags & (VM_LOCKED | VM_LOCKONFAULT))
return PAGEREF_RECLAIM;

if (referenced_ptes) {
--
1.9.1

2015-07-21 20:00:48

by Eric B Munson

[permalink] [raw]
Subject: [PATCH V4 5/6] mm: mmap: Add mmap flag to request VM_LOCKONFAULT

The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be
used this can incur a high penalty for locking.

Now that we have the new VMA flag for the locked but not present state,
expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.

Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from V3:
Add missing MAP_LOCKONFAULT to tile

arch/alpha/include/uapi/asm/mman.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/powerpc/include/uapi/asm/mman.h | 1 +
arch/sparc/include/uapi/asm/mman.h | 1 +
arch/tile/include/uapi/asm/mman.h | 1 +
arch/xtensa/include/uapi/asm/mman.h | 1 +
include/linux/mman.h | 3 ++-
include/uapi/asm-generic/mman.h | 1 +
kernel/events/core.c | 2 ++
mm/mmap.c | 6 ++++--
11 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 77ae8db..3f80ca4 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -30,6 +30,7 @@
#define MAP_NONBLOCK 0x40000 /* do not block on IO */
#define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x100000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x200000 /* Lock pages after they are faulted in, do not prefault */

#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_SYNC 2 /* synchronous memory sync */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 71ed81d..905c1ea 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -48,6 +48,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */

/*
* Flags for msync
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index c0871ce..c4695f6 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -24,6 +24,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */

#define MS_SYNC 1 /* synchronous memory sync */
#define MS_ASYNC 2 /* sync memory asynchronously */
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index f93f7eb..40a3fda 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -31,5 +31,6 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */

#endif /* _UAPI_ASM_POWERPC_MMAN_H */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 8cd2ebc..3d74ab7 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -26,6 +26,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x8000 /* Lock pages after they are faulted in, do not prefault */


#endif /* _UAPI__SPARC_MMAN_H__ */
diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
index acdd013..800e5c3 100644
--- a/arch/tile/include/uapi/asm/mman.h
+++ b/arch/tile/include/uapi/asm/mman.h
@@ -29,6 +29,7 @@
#define MAP_DENYWRITE 0x0800 /* ETXTBSY */
#define MAP_EXECUTABLE 0x1000 /* mark it as an executable */
#define MAP_HUGETLB 0x4000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */


/*
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 5725a15..689e1f2 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -55,6 +55,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
# define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
* uninitialized */
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 16373c8..437264b 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -86,7 +86,8 @@ calc_vm_flag_bits(unsigned long flags)
{
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
_calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
- _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
+ _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
+ _calc_vm_trans(flags, MAP_LOCKONFAULT,VM_LOCKONFAULT);
}

unsigned long vm_commit_limit(void);
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 555aab0..007b784 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */

/* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */

diff --git a/kernel/events/core.c b/kernel/events/core.c
index d3dae34..53f312c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5816,6 +5816,8 @@ static void perf_event_mmap_event(struct perf_mmap_event *mmap_event)
flags |= MAP_EXECUTABLE;
if (vma->vm_flags & VM_LOCKED)
flags |= MAP_LOCKED;
+ if (vma->vm_flags & VM_LOCKONFAULT)
+ flags |= MAP_LOCKONFAULT;
if (vma->vm_flags & VM_HUGETLB)
flags |= MAP_HUGETLB;

diff --git a/mm/mmap.c b/mm/mmap.c
index de89be4..54715b6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1301,7 +1301,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

- if (flags & MAP_LOCKED)
+ if (flags & (MAP_LOCKED | MAP_LOCKONFAULT))
if (!can_do_mlock())
return -EPERM;

@@ -2678,12 +2678,14 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
flags |= MAP_LOCKED;
drop_lock_flag = VM_LOCKED;
} else if (vma->vm_flags & VM_LOCKONFAULT) {
+ flags |= MAP_LOCKONFAULT;
drop_lock_flag = VM_LOCKONFAULT;
}

+
if (drop_lock_flag)
/* drop PG_Mlocked flag for over-mapped range */
- munlock_vma_pages_range(vma, start, start + size, VM_LOCKED);
+ munlock_vma_pages_range(vma, start, start + size, drop_lock_flag);

file = get_file(vma->vm_file);
ret = do_mmap_pgoff(vma->vm_file, start, size,
--
1.9.1

2015-07-21 20:00:50

by Eric B Munson

[permalink] [raw]
Subject: [PATCH V4 6/6] selftests: vm: Add tests for lock on fault

Test the mmap() flag, and the mlockall() flag. These tests ensure that
pages are not faulted in until they are accessed, that the pages are
unevictable once faulted in, and that VMA splitting and merging works
with the new VM flag. The second test ensures that mlock limits are
respected. Note that the limit test needs to be run a normal user.

Also add tests to use the new mlock2 family of system calls.

Signed-off-by: Eric B Munson <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from V3:
Add tests for new mlock2 family of system calls

tools/testing/selftests/vm/Makefile | 3 +
tools/testing/selftests/vm/lock-on-fault.c | 344 ++++++++++++++++
tools/testing/selftests/vm/mlock2-tests.c | 617 ++++++++++++++++++++++++++++
tools/testing/selftests/vm/on-fault-limit.c | 47 +++
tools/testing/selftests/vm/run_vmtests | 33 ++
5 files changed, 1044 insertions(+)
create mode 100644 tools/testing/selftests/vm/lock-on-fault.c
create mode 100644 tools/testing/selftests/vm/mlock2-tests.c
create mode 100644 tools/testing/selftests/vm/on-fault-limit.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 231b9a0..0fe6524 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -5,7 +5,10 @@ BINARIES = compaction_test
BINARIES += hugepage-mmap
BINARIES += hugepage-shm
BINARIES += hugetlbfstest
+BINARIES += lock-on-fault
BINARIES += map_hugetlb
+BINARIES += mlock2-tests
+BINARIES += on-fault-limit
BINARIES += thuge-gen
BINARIES += transhuge-stress

diff --git a/tools/testing/selftests/vm/lock-on-fault.c b/tools/testing/selftests/vm/lock-on-fault.c
new file mode 100644
index 0000000..f02c9fb
--- /dev/null
+++ b/tools/testing/selftests/vm/lock-on-fault.c
@@ -0,0 +1,344 @@
+#include <sys/mman.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <errno.h>
+
+struct vm_boundaries {
+ unsigned long start;
+ unsigned long end;
+};
+
+static int get_vm_area(unsigned long addr, struct vm_boundaries *area)
+{
+ FILE *file;
+ int ret = 1;
+ char line[1024] = {0};
+ char *end_addr;
+ char *stop;
+ unsigned long start;
+ unsigned long end;
+
+ if (!area)
+ return ret;
+
+ file = fopen("/proc/self/maps", "r");
+ if (!file) {
+ perror("fopen");
+ return ret;
+ }
+
+ memset(area, 0, sizeof(struct vm_boundaries));
+
+ while(fgets(line, 1024, file)) {
+ end_addr = strchr(line, '-');
+ if (!end_addr) {
+ printf("cannot parse /proc/self/maps\n");
+ goto out;
+ }
+ *end_addr = '\0';
+ end_addr++;
+ stop = strchr(end_addr, ' ');
+ if (!stop) {
+ printf("cannot parse /proc/self/maps\n");
+ goto out;
+ }
+ stop = '\0';
+
+ sscanf(line, "%lx", &start);
+ sscanf(end_addr, "%lx", &end);
+
+ if (start <= addr && end > addr) {
+ area->start = start;
+ area->end = end;
+ ret = 0;
+ goto out;
+ }
+ }
+out:
+ fclose(file);
+ return ret;
+}
+
+static unsigned long get_pageflags(unsigned long addr)
+{
+ FILE *file;
+ unsigned long pfn;
+ unsigned long offset;
+
+ file = fopen("/proc/self/pagemap", "r");
+ if (!file) {
+ perror("fopen");
+ _exit(1);
+ }
+
+ offset = addr / getpagesize() * sizeof(unsigned long);
+ if (fseek(file, offset, SEEK_SET)) {
+ perror("fseek");
+ _exit(1);
+ }
+
+ if (fread(&pfn, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread");
+ _exit(1);
+ }
+
+ fclose(file);
+ return pfn;
+}
+
+static unsigned long get_kpageflags(unsigned long pfn)
+{
+ unsigned long flags;
+ FILE *file;
+
+ file = fopen("/proc/kpageflags", "r");
+ if (!file) {
+ perror("fopen");
+ _exit(1);
+ }
+
+ if (fseek(file, pfn * sizeof(unsigned long), SEEK_SET)) {
+ perror("fseek");
+ _exit(1);
+ }
+
+ if (fread(&flags, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread");
+ _exit(1);
+ }
+
+ fclose(file);
+ return flags;
+}
+
+#define PRESENT_BIT 0x8000000000000000
+#define PFN_MASK 0x007FFFFFFFFFFFFF
+#define UNEVICTABLE_BIT (1UL << 18)
+
+static int test_mmap(int flags)
+{
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ void *map;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE, flags, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("mmap()");
+ return 1;
+ }
+
+ /* Write something into the first page to ensure it is present */
+ *(char *)map = 1;
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+
+ /* page2_flags should not be present */
+ if (page2_flags & PRESENT_BIT) {
+ printf("page map says 0x%lx\n", page2_flags);
+ printf("present is 0x%lx\n", PRESENT_BIT);
+ return 1;
+ }
+
+ /* page1_flags should be present */
+ if ((page1_flags & PRESENT_BIT) == 0) {
+ printf("page map says 0x%lx\n", page1_flags);
+ printf("present is 0x%lx\n", PRESENT_BIT);
+ return 1;
+ }
+
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+
+ /* page1_flags now contains the entry from kpageflags for the first
+ * page, the unevictable bit should be set */
+ if ((page1_flags & UNEVICTABLE_BIT) == 0) {
+ printf("kpageflags says 0x%lx\n", page1_flags);
+ printf("unevictable is 0x%lx\n", UNEVICTABLE_BIT);
+ return 1;
+ }
+
+ munmap(map, 2 * page_size);
+ return 0;
+}
+
+static int test_munlock(int flags)
+{
+ int ret = 1;
+ void *map;
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page3_flags;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, flags, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("mmap()");
+ return ret;
+ }
+
+ if (munlock(map + page_size, page_size)) {
+ perror("munlock()");
+ goto out;
+ }
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+ page3_flags = get_pageflags((unsigned long)map + page_size * 2);
+
+ /* No pages should be present */
+ if ((page1_flags & PRESENT_BIT) || (page2_flags & PRESENT_BIT) ||
+ (page3_flags & PRESENT_BIT)) {
+ printf("Page was made present by munlock()\n");
+ goto out;
+ }
+
+ /* Write something to each page so that they are faulted in */
+ *(char*)map = 1;
+ *(char*)(map + page_size) = 1;
+ *(char*)(map + page_size * 2) = 1;
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+ page3_flags = get_pageflags((unsigned long)map + page_size * 2);
+
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+ page2_flags = get_kpageflags(page2_flags & PFN_MASK);
+ page3_flags = get_kpageflags(page3_flags & PFN_MASK);
+
+ /* Pages 1 and 3 should be unevictable */
+ if (!(page1_flags & UNEVICTABLE_BIT)) {
+ printf("Missing unevictable bit on lock on fault page1\n");
+ goto out;
+ }
+ if (!(page3_flags & UNEVICTABLE_BIT)) {
+ printf("Missing unevictable bit on lock on fault page3\n");
+ goto out;
+ }
+
+ /* Page 2 should not be unevictable */
+ if (page2_flags & UNEVICTABLE_BIT) {
+ printf("Unlocked page is still marked unevictable\n");
+ goto out;
+ }
+
+ ret = 0;
+
+out:
+ munmap(map, 3 * page_size);
+ return ret;
+}
+
+static int test_vma_management(int flags)
+{
+ int ret = 1;
+ void *map;
+ unsigned long page_size = getpagesize();
+ struct vm_boundaries page1;
+ struct vm_boundaries page2;
+ struct vm_boundaries page3;
+
+ map = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, flags, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("mmap()");
+ return ret;
+ }
+
+ if (get_vm_area((unsigned long)map, &page1) ||
+ get_vm_area((unsigned long)map + page_size, &page2) ||
+ get_vm_area((unsigned long)map + page_size * 2, &page3)) {
+ printf("couldn't find mapping in /proc/self/maps\n");
+ goto out;
+ }
+
+ /*
+ * Before we unlock a portion, we need to that all three pages are in
+ * the same VMA. If they are not we abort this test (Note that this is
+ * not a failure)
+ */
+ if (page1.start != page2.start || page2.start != page3.start) {
+ printf("VMAs are not merged to start, aborting test\n");
+ ret = 0;
+ goto out;
+ }
+
+ if (munlock(map + page_size, page_size)) {
+ perror("munlock()");
+ goto out;
+ }
+
+ if (get_vm_area((unsigned long)map, &page1) ||
+ get_vm_area((unsigned long)map + page_size, &page2) ||
+ get_vm_area((unsigned long)map + page_size * 2, &page3)) {
+ printf("couldn't find mapping in /proc/self/maps\n");
+ goto out;
+ }
+
+ /* All three VMAs should be different */
+ if (page1.start == page2.start || page2.start == page3.start) {
+ printf("failed to split VMA for munlock\n");
+ goto out;
+ }
+
+ /* Now unlock the first and third page and check the VMAs again */
+ if (munlock(map, page_size * 3)) {
+ perror("munlock()");
+ goto out;
+ }
+
+ if (get_vm_area((unsigned long)map, &page1) ||
+ get_vm_area((unsigned long)map + page_size, &page2) ||
+ get_vm_area((unsigned long)map + page_size * 2, &page3)) {
+ printf("couldn't find mapping in /proc/self/maps\n");
+ goto out;
+ }
+
+ /* Now all three VMAs should be the same */
+ if (page1.start != page2.start || page2.start != page3.start) {
+ printf("failed to merge VMAs after munlock\n");
+ goto out;
+ }
+
+ ret = 0;
+out:
+ munmap(map, 3 * page_size);
+ return ret;
+}
+
+#ifndef MCL_ONFAULT
+#define MCL_ONFAULT (MCL_FUTURE << 1)
+#endif
+
+static int test_mlockall(int (test_function)(int flags))
+{
+ int ret = 1;
+
+ if (mlockall(MCL_ONFAULT)) {
+ perror("mlockall");
+ return ret;
+ }
+
+ ret = test_function(MAP_PRIVATE | MAP_ANONYMOUS);
+ munlockall();
+ return ret;
+}
+
+#ifndef MAP_LOCKONFAULT
+#define MAP_LOCKONFAULT (MAP_HUGETLB << 1)
+#endif
+
+int main(int argc, char **argv)
+{
+ int ret = 0;
+ ret += test_mmap(MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKONFAULT);
+ ret += test_mlockall(test_mmap);
+ ret += test_munlock(MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKONFAULT);
+ ret += test_mlockall(test_munlock);
+ ret += test_vma_management(MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKONFAULT);
+ ret += test_mlockall(test_vma_management);
+ return ret;
+}
+
diff --git a/tools/testing/selftests/vm/mlock2-tests.c b/tools/testing/selftests/vm/mlock2-tests.c
new file mode 100644
index 0000000..22ab749
--- /dev/null
+++ b/tools/testing/selftests/vm/mlock2-tests.c
@@ -0,0 +1,617 @@
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <errno.h>
+#include <stdbool.h>
+
+#ifndef MLOCK_LOCK
+#define MLOCK_LOCK 1
+#endif
+
+#ifndef MLOCK_ONFAULT
+#define MLOCK_ONFAULT 2
+#endif
+
+#ifndef MCL_ONFAULT
+#define MCL_ONFAULT (MCL_FUTURE << 1)
+#endif
+
+static int mlock2_(void *start, size_t len, int flags)
+{
+#ifdef __NR_mlock2
+ return syscall(__NR_mlock2, start, len, flags);
+#else
+ errno = ENOSYS;
+ return -1;
+#endif
+}
+
+static int munlock2_(void *start, size_t len, int flags)
+{
+#ifdef __NR_munlock2
+ return syscall(__NR_munlock2, start, len, flags);
+#else
+ errno = ENOSYS;
+ return -1;
+#endif
+}
+
+static int munlockall2_(int flags)
+{
+#ifdef __NR_munlockall2
+ return syscall(__NR_munlockall2, flags);
+#else
+ errno = ENOSYS;
+ return -1;
+#endif
+}
+
+static unsigned long get_pageflags(unsigned long addr)
+{
+ FILE *file;
+ unsigned long pfn;
+ unsigned long offset;
+
+ file = fopen("/proc/self/pagemap", "r");
+ if (!file) {
+ perror("fopen pagemap");
+ _exit(1);
+ }
+
+ offset = addr / getpagesize() * sizeof(unsigned long);
+ if (fseek(file, offset, SEEK_SET)) {
+ perror("fseek pagemap");
+ _exit(1);
+ }
+
+ if (fread(&pfn, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread pagemap");
+ _exit(1);
+ }
+
+ fclose(file);
+ return pfn;
+}
+
+static unsigned long get_kpageflags(unsigned long pfn)
+{
+ unsigned long flags;
+ FILE *file;
+
+ file = fopen("/proc/kpageflags", "r");
+ if (!file) {
+ perror("fopen kpageflags");
+ _exit(1);
+ }
+
+ if (fseek(file, pfn * sizeof(unsigned long), SEEK_SET)) {
+ perror("fseek kpageflags");
+ _exit(1);
+ }
+
+ if (fread(&flags, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread kpageflags");
+ _exit(1);
+ }
+
+ fclose(file);
+ return flags;
+}
+
+#define VMFLAGS "VmFlags:"
+
+static bool find_flag(FILE *file, const char *vmflag)
+{
+ char *line = NULL;
+ char *flags;
+ size_t size = 0;
+ bool ret = false;
+
+ while (getline(&line, &size, file) > 0) {
+ if (!strstr(line, VMFLAGS)) {
+ free(line);
+ line = NULL;
+ size = 0;
+ continue;
+ }
+
+ flags = line + strlen(VMFLAGS);
+ ret = (strstr(flags, vmflag) != NULL);
+ goto out;
+ }
+
+out:
+ free(line);
+ return ret;
+}
+
+static bool is_vmflag_set(unsigned long addr, const char *vmflag)
+{
+ FILE *file;
+ char *line = NULL;
+ size_t size = 0;
+ bool ret = false;
+ unsigned long start, end;
+ char perms[5];
+ unsigned long offset;
+ char dev[32];
+ unsigned long inode;
+ char path[BUFSIZ];
+
+ file = fopen("/proc/self/smaps", "r");
+ if (!file) {
+ perror("fopen smaps");
+ _exit(1);
+ }
+
+ while (getline(&line, &size, file) > 0) {
+ if (sscanf(line, "%lx-%lx %s %lx %s %lu %s\n",
+ &start, &end, perms, &offset, dev, &inode, path) < 6)
+ goto next;
+
+ if (start <= addr && addr < end) {
+ ret = find_flag(file, vmflag);
+ goto out;
+ }
+
+next:
+ free(line);
+ line = NULL;
+ size = 0;
+ }
+
+out:
+ free(line);
+ fclose(file);
+ return ret;
+}
+
+#define PRESENT_BIT 0x8000000000000000
+#define PFN_MASK 0x007FFFFFFFFFFFFF
+#define UNEVICTABLE_BIT (1UL << 18)
+
+#define LOCKED "lo"
+#define LOCKEDONFAULT "lf"
+
+static int lock_check(char *map)
+{
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+
+ /* Both pages should be present */
+ if (((page1_flags & PRESENT_BIT) == 0) ||
+ ((page2_flags & PRESENT_BIT) == 0)) {
+ printf("Failed to make both pages present\n");
+ return 1;
+ }
+
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+ page2_flags = get_kpageflags(page2_flags & PFN_MASK);
+
+ /* Both pages should be unevictable */
+ if (((page1_flags & UNEVICTABLE_BIT) == 0) ||
+ ((page2_flags & UNEVICTABLE_BIT) == 0)) {
+ printf("Failed to make both pages unevictable\n");
+ return 1;
+ }
+
+ if (!is_vmflag_set((unsigned long)map, LOCKED) ||
+ !is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
+ printf("VMA flag %s is missing\n", LOCKED);
+ return 1;
+ }
+
+ return 0;
+}
+
+static int unlock_lock_check(char *map)
+{
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+ page2_flags = get_kpageflags(page2_flags & PFN_MASK);
+
+ if ((page1_flags & UNEVICTABLE_BIT) || (page2_flags & UNEVICTABLE_BIT)) {
+ printf("A page is still marked unevictable after unlock\n");
+ return 1;
+ }
+
+ if (is_vmflag_set((unsigned long)map, LOCKED) ||
+ is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
+ printf("VMA flag %s is still set after unlock\n", LOCKED);
+ return 1;
+ }
+
+ return 0;
+}
+
+static int test_mlock_lock()
+{
+ char *map;
+ int ret = 1;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("test_mlock_locked mmap");
+ goto out;
+ }
+
+ if (mlock2_(map, 2 * page_size, MLOCK_LOCK)) {
+ if (errno == ENOSYS) {
+ printf("Cannot call new mlock family, skipping test\n");
+ _exit(0);
+ }
+ perror("mlock2(MLOCK_LOCK)");
+ goto unmap;
+ }
+
+ if (lock_check(map))
+ goto unmap;
+
+ /* Now clear the MLOCK_LOCK flag and recheck attributes */
+ if (munlock2_(map, 2 * page_size, MLOCK_LOCK)) {
+ if (errno == ENOSYS) {
+ printf("Cannot call new mlock family, skipping test\n");
+ _exit(0);
+ }
+ perror("munlock2(MLOCK_LOCK)");
+ goto unmap;
+ }
+
+ ret = unlock_lock_check(map);
+
+unmap:
+ munmap(map, 2 * page_size);
+out:
+ return ret;
+}
+
+static int onfault_check(char *map)
+{
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+
+ /* Neither page should be present */
+ if ((page1_flags & PRESENT_BIT) || (page2_flags & PRESENT_BIT)) {
+ printf("Pages were made present by MLOCK_ONFAULT\n");
+ return 1;
+ }
+
+ *map = 'a';
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+
+ /* Only page 1 should be present */
+ if ((page1_flags & PRESENT_BIT) == 0) {
+ printf("Page 1 is not present after fault\n");
+ return 1;
+ } else if (page2_flags & PRESENT_BIT) {
+ printf("Page 2 was made present\n");
+ return 1;
+ }
+
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+
+ /* Page 1 should be unevictable */
+ if ((page1_flags & UNEVICTABLE_BIT) == 0) {
+ printf("Failed to make faulted page unevictable\n");
+ return 1;
+ }
+
+ if (!is_vmflag_set((unsigned long)map, LOCKEDONFAULT) ||
+ !is_vmflag_set((unsigned long)map + page_size, LOCKEDONFAULT)) {
+ printf("VMA flag %s is missing\n", LOCKEDONFAULT);
+ return 1;
+ }
+
+ return 0;
+}
+
+static int unlock_onfault_check(char *map)
+{
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+
+ if (page1_flags & UNEVICTABLE_BIT) {
+ printf("Page 1 is still marked unevictable after unlock\n");
+ return 1;
+ }
+
+ if (is_vmflag_set((unsigned long)map, LOCKEDONFAULT) ||
+ is_vmflag_set((unsigned long)map + page_size, LOCKEDONFAULT)) {
+ printf("VMA flag %s is still set after unlock\n", LOCKEDONFAULT);
+ return 1;
+ }
+
+ return 0;
+}
+
+static int test_mlock_onfault()
+{
+ char *map;
+ int ret = 1;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("test_mlock_locked mmap");
+ goto out;
+ }
+
+ if (mlock2_(map, 2 * page_size, MLOCK_ONFAULT)) {
+ if (errno == ENOSYS) {
+ printf("Cannot call new mlock family, skipping test\n");
+ _exit(0);
+ }
+ perror("mlock2(MLOCK_ONFAULT)");
+ goto unmap;
+ }
+
+ if (onfault_check(map))
+ goto unmap;
+
+ /* Now clear the MLOCK_ONFAULT flag and recheck attributes */
+ if (munlock2_(map, 2 * page_size, MLOCK_ONFAULT)) {
+ if (errno == ENOSYS) {
+ printf("Cannot call new mlock family, skipping test\n");
+ _exit(0);
+ }
+ perror("munlock2(MLOCK_LOCK)");
+ goto unmap;
+ }
+
+ ret = unlock_onfault_check(map);
+unmap:
+ munmap(map, 2 * page_size);
+out:
+ return ret;
+}
+
+static int test_lock_onfault_of_present()
+{
+ char *map;
+ int ret = 1;
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("test_mlock_locked mmap");
+ goto out;
+ }
+
+ *map = 'a';
+
+ if (mlock2_(map, 2 * page_size, MLOCK_ONFAULT)) {
+ if (errno == ENOSYS) {
+ printf("Cannot call new mlock family, skipping test\n");
+ _exit(0);
+ }
+ perror("mlock2(MLOCK_ONFAULT)");
+ goto unmap;
+ }
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+ page2_flags = get_kpageflags(page2_flags & PFN_MASK);
+
+ /* Page 1 should be unevictable */
+ if ((page1_flags & UNEVICTABLE_BIT) == 0) {
+ printf("Failed to make present page unevictable\n");
+ goto unmap;
+ }
+
+ if (!is_vmflag_set((unsigned long)map, LOCKEDONFAULT) ||
+ !is_vmflag_set((unsigned long)map + page_size, LOCKEDONFAULT)) {
+ printf("VMA flag %s is missing for one of the pages\n", LOCKEDONFAULT);
+ goto unmap;
+ }
+ ret = 0;
+unmap:
+ munmap(map, 2 * page_size);
+out:
+ return ret;
+}
+
+static int test_munlock_mismatch()
+{
+ char *map;
+ int ret = 1;
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("test_mlock_locked mmap");
+ goto out;
+ }
+
+ if (mlock2_(map, 2 * page_size, MLOCK_LOCK)) {
+ if (errno == ENOSYS) {
+ printf("Cannot call new mlock family, skipping test\n");
+ _exit(0);
+ }
+ perror("mlock2(MLOCK_LOCK)");
+ goto unmap;
+ }
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+
+ /* Both pages should be present */
+ if (((page1_flags & PRESENT_BIT) == 0) ||
+ ((page2_flags & PRESENT_BIT) == 0)) {
+ printf("Failed to make both pages present\n");
+ goto unmap;
+ }
+
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+ page2_flags = get_kpageflags(page2_flags & PFN_MASK);
+
+ /* Both pages should be unevictable */
+ if (((page1_flags & UNEVICTABLE_BIT) == 0) ||
+ ((page2_flags & UNEVICTABLE_BIT) == 0)) {
+ printf("Failed to make both pages unevictable\n");
+ goto unmap;
+ }
+
+ if (!is_vmflag_set((unsigned long)map, LOCKED) ||
+ !is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
+ printf("VMA flag %s is missing\n", LOCKED);
+ goto unmap;
+ }
+
+ /* Now clear the MLOCK_ONFAULT flag and recheck attributes */
+ if (munlock2_(map, 2 * page_size, MLOCK_ONFAULT)) {
+ if (errno == ENOSYS) {
+ printf("Cannot call new mlock family, skipping test\n");
+ _exit(0);
+ }
+ perror("munlock2(MLOCK_ONFAULT)");
+ goto unmap;
+ }
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+ page2_flags = get_kpageflags(page2_flags & PFN_MASK);
+
+ if ((page1_flags & UNEVICTABLE_BIT) == 0 ||
+ (page2_flags & UNEVICTABLE_BIT) == 0) {
+ printf("Both pages should still be unevictable but are not\n");
+ goto unmap;
+ }
+
+ if (!is_vmflag_set((unsigned long)map, LOCKED) ||
+ !is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
+ printf("VMA flag %s is not set set after unlock\n", LOCKED);
+ goto unmap;
+ }
+
+ ret = 0;
+unmap:
+ munmap(map, 2 * page_size);
+out:
+ return ret;
+
+}
+
+static int test_munlockall()
+{
+ char *map;
+ int ret = 1;
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+
+ if (map == MAP_FAILED) {
+ perror("test_munlockall mmap");
+ goto out;
+ }
+
+ if (mlockall(MCL_CURRENT)) {
+ perror("mlockall(MCL_CURRENT)");
+ goto out;
+ }
+
+ if (lock_check(map))
+ goto unmap;
+
+ if (munlockall2_(MCL_CURRENT)) {
+ perror("munlockall2(MCL_CURRENT)");
+ goto unmap;
+ }
+
+ if (unlock_lock_check(map))
+ goto unmap;
+
+ munmap(map, 2 * page_size);
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+
+ if (map == MAP_FAILED) {
+ perror("test_munlockall second mmap");
+ goto out;
+ }
+
+ if (mlockall(MCL_ONFAULT)) {
+ perror("mlockall(MCL_ONFAULT)");
+ goto unmap;
+ }
+
+ if (onfault_check(map))
+ goto unmap;
+
+ if (munlockall2_(MCL_ONFAULT)) {
+ perror("munlockall2(MCL_ONFAULT)");
+ goto unmap;
+ }
+
+ if (unlock_onfault_check(map))
+ goto unmap;
+
+ if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
+ perror("mlockall(MCL_CURRENT | MCL_FUTURE)");
+ goto out;
+ }
+
+ if (lock_check(map))
+ goto unmap;
+
+ if (munlockall2_(MCL_FUTURE | MCL_ONFAULT)) {
+ perror("munlockall2(MCL_FUTURE | MCL_ONFAULT)");
+ goto unmap;
+ }
+
+ ret = lock_check(map);
+
+unmap:
+ munmap(map, 2 * page_size);
+out:
+ munlockall2_(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
+ return ret;
+}
+
+int main(char **argv, int argc)
+{
+ int ret = 0;
+ ret += test_mlock_lock();
+ ret += test_mlock_onfault();
+ ret += test_munlockall();
+ ret += test_munlock_mismatch();
+ ret += test_lock_onfault_of_present();
+ return ret;
+}
+
diff --git a/tools/testing/selftests/vm/on-fault-limit.c b/tools/testing/selftests/vm/on-fault-limit.c
new file mode 100644
index 0000000..ed2a109
--- /dev/null
+++ b/tools/testing/selftests/vm/on-fault-limit.c
@@ -0,0 +1,47 @@
+#include <sys/mman.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+
+#ifndef MCL_ONFAULT
+#define MCL_ONFAULT (MCL_FUTURE << 1)
+#endif
+
+static int test_limit(void)
+{
+ int ret = 1;
+ struct rlimit lims;
+ void *map;
+
+ if (getrlimit(RLIMIT_MEMLOCK, &lims)) {
+ perror("getrlimit");
+ return ret;
+ }
+
+ if (mlockall(MCL_ONFAULT)) {
+ perror("mlockall");
+ return ret;
+ }
+
+ map = mmap(NULL, 2 * lims.rlim_max, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, 0, 0);
+ if (map != MAP_FAILED)
+ printf("mmap should have failed, but didn't\n");
+ else {
+ ret = 0;
+ munmap(map, 2 * lims.rlim_max);
+ }
+
+ munlockall();
+ return ret;
+}
+
+int main(int argc, char **argv)
+{
+ int ret = 0;
+
+ ret += test_limit();
+ return ret;
+}
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index 49ece11..990a61f 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -102,4 +102,37 @@ else
echo "[PASS]"
fi

+echo "--------------------"
+echo "running lock-on-fault"
+echo "--------------------"
+./lock-on-fault
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
+echo "--------------------"
+echo "running on-fault-limit"
+echo "--------------------"
+sudo -u nobody ./on-fault-limit
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
+echo "--------------------"
+echo "running mlock2-tests"
+echo "--------------------"
+./mlock2-tests
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
exit $exitcode
--
1.9.1

2015-07-21 20:44:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH V4 2/6] mm: mlock: Add new mlock, munlock, and munlockall system calls

On Tue, 21 Jul 2015 15:59:37 -0400 Eric B Munson <[email protected]> wrote:

> With the refactored mlock code, introduce new system calls for mlock,
> munlock, and munlockall. The new calls will allow the user to specify
> what lock states are being added or cleared. mlock2 and munlock2 are
> trivial at the moment, but a follow on patch will add a new mlock state
> making them useful.
>
> munlock2 addresses a limitation of the current implementation. If a
> user calls mlockall(MCL_CURRENT | MCL_FUTURE) and then later decides
> that MCL_FUTURE should be removed, they would have to call munlockall()
> followed by mlockall(MCL_CURRENT) which could potentially be very
> expensive. The new munlockall2 system call allows a user to simply
> clear the MCL_FUTURE flag.

This is hard. Maybe we shouldn't have wired up anything other than
x86. That's what we usually do with new syscalls.

You appear to have missed
mm-mlock-add-new-mlock-munlock-and-munlockall-system-calls-fix.patch:

--- a/arch/arm64/include/asm/unistd.h~mm-mlock-add-new-mlock-munlock-and-munlockall-system-calls-fix
+++ a/arch/arm64/include/asm/unistd.h
@@ -44,7 +44,7 @@
#define __ARM_NR_compat_cacheflush (__ARM_NR_COMPAT_BASE+2)
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE+5)

-#define __NR_compat_syscalls 388
+#define __NR_compat_syscalls 391
#endif

#define __ARCH_WANT_SYS_CLONE


And mm-mlock-add-new-mlock-munlock-and-munlockall-system-calls-fix-2.patch:


From: Heiko Carstens <[email protected]>
Subject: mm-mlock-add-new-mlock-munlock-and-munlockall-system-calls-fix-2

can we just remove the s390 bits which cause the breakage?
I will wire up the syscalls as soon as the patch set gets merged.

Heiko Carstens <[email protected]>
Cc: Eric B Munson <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/s390/kernel/syscalls.S | 3 ---
1 file changed, 3 deletions(-)

diff -puN arch/s390/kernel/syscalls.S~mm-mlock-add-new-mlock-munlock-and-munlockall-system-calls-fix-2 arch/s390/kernel/syscalls.S
--- a/arch/s390/kernel/syscalls.S~mm-mlock-add-new-mlock-munlock-and-munlockall-system-calls-fix-2
+++ a/arch/s390/kernel/syscalls.S
@@ -363,6 +363,3 @@ SYSCALL(sys_bpf,compat_sys_bpf)
SYSCALL(sys_s390_pci_mmio_write,compat_sys_s390_pci_mmio_write)
SYSCALL(sys_s390_pci_mmio_read,compat_sys_s390_pci_mmio_read)
SYSCALL(sys_execveat,compat_sys_execveat)
-SYSCALL(sys_mlock2,compat_sys_mlock2) /* 355 */
-SYSCALL(sys_munlock2,compat_sys_munlock2)
-SYSCALL(sys_munlockall2,compat_sys_munlockall2)

2015-07-22 01:25:22

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH V4 2/6] mm: mlock: Add new mlock, munlock, and munlockall system calls

On Tue, 2015-07-21 at 13:44 -0700, Andrew Morton wrote:
> On Tue, 21 Jul 2015 15:59:37 -0400 Eric B Munson <[email protected]> wrote:
>
> > With the refactored mlock code, introduce new system calls for mlock,
> > munlock, and munlockall. The new calls will allow the user to specify
> > what lock states are being added or cleared. mlock2 and munlock2 are
> > trivial at the moment, but a follow on patch will add a new mlock state
> > making them useful.
> >
> > munlock2 addresses a limitation of the current implementation. If a
> > user calls mlockall(MCL_CURRENT | MCL_FUTURE) and then later decides
> > that MCL_FUTURE should be removed, they would have to call munlockall()
> > followed by mlockall(MCL_CURRENT) which could potentially be very
> > expensive. The new munlockall2 system call allows a user to simply
> > clear the MCL_FUTURE flag.
>
> This is hard. Maybe we shouldn't have wired up anything other than
> x86. That's what we usually do with new syscalls.

Yeah I think so.

You haven't wired it up properly on powerpc, but I haven't mentioned it because
I'd rather we did it.

cheers

2015-07-22 09:16:23

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH V4 2/6] mm: mlock: Add new mlock, munlock, and munlockall system calls

On 07/21/2015 09:59 PM, Eric B Munson wrote:
> With the refactored mlock code, introduce new system calls for mlock,
> munlock, and munlockall. The new calls will allow the user to specify
> what lock states are being added or cleared. mlock2 and munlock2 are
> trivial at the moment, but a follow on patch will add a new mlock state
> making them useful.
>
> munlock2 addresses a limitation of the current implementation. If a

^ munlockall2?

> user calls mlockall(MCL_CURRENT | MCL_FUTURE) and then later decides
> that MCL_FUTURE should be removed, they would have to call munlockall()
> followed by mlockall(MCL_CURRENT) which could potentially be very
> expensive. The new munlockall2 system call allows a user to simply
> clear the MCL_FUTURE flag.
>

2015-07-22 10:03:39

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH V4 4/6] mm: mlock: Introduce VM_LOCKONFAULT and add mlock flags to enable it

On 07/21/2015 09:59 PM, Eric B Munson wrote:
> The cost of faulting in all memory to be locked can be very high when
> working with large mappings. If only portions of the mapping will be
> used this can incur a high penalty for locking.
>
> For the example of a large file, this is the usage pattern for a large
> statical language model (probably applies to other statical or graphical
> models as well). For the security example, any application transacting
> in data that cannot be swapped out (credit card data, medical records,
> etc).
>
> This patch introduces the ability to request that pages are not
> pre-faulted, but are placed on the unevictable LRU when they are finally
> faulted in. This can be done area at a time via the
> mlock2(MLOCK_ONFAULT) or the mlockall(MCL_ONFAULT) system calls. These
> calls can be undone via munlock2(MLOCK_ONFAULT) or
> munlockall2(MCL_ONFAULT).
>
> Applying the VM_LOCKONFAULT flag to a mapping with pages that are
> already present required the addition of a function in gup.c to pin all
> pages which are present in an address range. It borrows heavily from
> __mm_populate().
>
> To keep accounting checks out of the page fault path, users are billed
> for the entire mapping lock as if MLOCK_LOCKED was used.

Hi,

I think you should include a complete description of which transitions
for vma states and mlock2/munlock2 flags applied on them are valid and
what they do. It will also help with the manpages.
You explained some to Jon in the last thread, but I think there should
be a canonical description in changelog (if not also Documentation, if
mlock is covered there).

For example the scenario Jon asked, what happens after a
mlock2(MLOCK_ONFAULT) followed by mlock2(MLOCK_LOCKED), and that the
answer is "nothing". Your promised code comment for apply_vma_flags()
doesn't suffice IMHO (and I'm not sure it's there, anyway?).

But the more I think about the scenario and your new VM_LOCKONFAULT vma
flag, it seems awkward to me. Why should munlocking at all care if the
vma was mlocked with MLOCK_LOCKED or MLOCK_ONFAULT? In either case the
result is that all pages currently populated are munlocked. So the flags
for munlock2 should be unnecessary.

I also think VM_LOCKONFAULT is unnecessary. VM_LOCKED should be enough -
see how you had to handle the new flag in all places that had to handle
the old flag? I think the information whether mlock was supposed to
fault the whole vma is obsolete at the moment mlock returns. VM_LOCKED
should be enough for both modes, and the flag to mlock2 could just
control whether the pre-faulting is done.

So what should be IMHO enough:
- munlock can stay without flags
- mlock2 has only one new flag MLOCK_ONFAULT. If specified, pre-faulting
is not done, just set VM_LOCKED and mlock pages already present.
- same with mmap(MAP_LOCKONFAULT) (need to define what happens when both
MAP_LOCKED and MAP_LOCKONFAULT are specified).

Now mlockall(MCL_FUTURE) muddles the situation in that it stores the
information for future VMA's in current->mm->def_flags, and this
def_flags would need to distinguish VM_LOCKED with population and
without. But that could be still solvable without introducing a new vma
flag everywhere.

2015-07-22 10:42:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH V4 1/6] mm: mlock: Refactor mlock, munlock, and munlockall code

On Tue, Jul 21, 2015 at 03:59:36PM -0400, Eric B Munson wrote:
> @@ -648,20 +656,23 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
> start &= PAGE_MASK;
>
> down_write(&current->mm->mmap_sem);
> - ret = do_mlock(start, len, 0);
> + ret = apply_vma_flags(start, len, flags, false);
> up_write(&current->mm->mmap_sem);
>
> return ret;
> }
>
> +SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
> +{
> + return do_munlock(start, len, VM_LOCKED);
> +}
> +
> static int do_mlockall(int flags)
> {
> struct vm_area_struct * vma, * prev = NULL;
>
> if (flags & MCL_FUTURE)
> current->mm->def_flags |= VM_LOCKED;
> - else
> - current->mm->def_flags &= ~VM_LOCKED;

I think this is wrong.

With current code mlockall(MCL_CURRENT) after mlockall(MCL_FUTURE |
MCL_CURRENT) would undo future mlocking, without unlocking currently
mlocked memory.

The change will break the use-case.

> if (flags == MCL_FUTURE)
> goto out;
>

--
Kirill A. Shutemov

2015-07-22 11:13:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH V4 3/6] mm: gup: Add mm_lock_present()

On Tue, Jul 21, 2015 at 03:59:38PM -0400, Eric B Munson wrote:
> The upcoming mlock(MLOCK_ONFAULT) implementation will need a way to
> request that all present pages in a range are locked without faulting in
> pages that are not present. This logic is very close to what the
> __mm_populate() call handles without faulting pages so the patch pulls
> out the pieces that can be shared and adds mm_lock_present() to gup.c.
> The following patch will call it from do_mlock() when MLOCK_ONFAULT is
> specified.
>
> Signed-off-by: Eric B Munson <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> mm/gup.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 157 insertions(+), 15 deletions(-)

I don't like that you've copy-pasted a lot of code. I think it can be
solved with new foll flags.

Totally untested patch below split out mlock part of FOLL_POPULATE into
new FOLL_MLOCK flag. FOLL_POPULATE | FOLL_MLOCK will do what currently
FOLL_POPULATE does. The new MLOCK_ONFAULT can use just FOLL_MLOCK. It will
not trigger fault in.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c3a2b37365f6..c3834cddfcc7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2002,6 +2002,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
+#define FOLL_MLOCK 0x1000 /* mlock the page if the VMA is VM_LOCKED */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/gup.c b/mm/gup.c
index a798293fc648..4c7ff23947b9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -129,7 +129,7 @@ retry:
*/
mark_page_accessed(page);
}
- if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) {
+ if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
/*
* The preliminary mapping check is mainly to avoid the
* pointless overhead of lock_page on the ZERO_PAGE
@@ -299,6 +299,9 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
unsigned int fault_flags = 0;
int ret;

+ /* mlock present pages, but not fault in new one */
+ if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
+ return -ENOENT;
/* For mm_populate(), just skip the stack guard page. */
if ((*flags & FOLL_POPULATE) &&
(stack_guard_page_start(vma, address) ||
@@ -890,7 +893,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
VM_BUG_ON_VMA(end > vma->vm_end, vma);
VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);

- gup_flags = FOLL_TOUCH | FOLL_POPULATE;
+ gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
/*
* We want to touch writable mappings with a write fault in order
* to break COW, except for shared mappings because these don't COW
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8f9a334a6c66..9eeb3bd304fc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1306,7 +1306,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
pmd, _pmd, 1))
update_mmu_cache_pmd(vma, addr, pmd);
}
- if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) {
+ if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
if (page->mapping && trylock_page(page)) {
lru_add_drain();
if (page->mapping)

--
Kirill A. Shutemov

2015-07-22 11:26:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH V4 5/6] mm: mmap: Add mmap flag to request VM_LOCKONFAULT

On Tue, Jul 21, 2015 at 03:59:40PM -0400, Eric B Munson wrote:
> The cost of faulting in all memory to be locked can be very high when
> working with large mappings. If only portions of the mapping will be
> used this can incur a high penalty for locking.
>
> Now that we have the new VMA flag for the locked but not present state,
> expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.

What is advantage over mmap() + mlock(MLOCK_ONFAULT)?

--
Kirill A. Shutemov

2015-07-22 14:04:47

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH V4 1/6] mm: mlock: Refactor mlock, munlock, and munlockall code

On Wed, 22 Jul 2015, Kirill A. Shutemov wrote:

> On Tue, Jul 21, 2015 at 03:59:36PM -0400, Eric B Munson wrote:
> > @@ -648,20 +656,23 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
> > start &= PAGE_MASK;
> >
> > down_write(&current->mm->mmap_sem);
> > - ret = do_mlock(start, len, 0);
> > + ret = apply_vma_flags(start, len, flags, false);
> > up_write(&current->mm->mmap_sem);
> >
> > return ret;
> > }
> >
> > +SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
> > +{
> > + return do_munlock(start, len, VM_LOCKED);
> > +}
> > +
> > static int do_mlockall(int flags)
> > {
> > struct vm_area_struct * vma, * prev = NULL;
> >
> > if (flags & MCL_FUTURE)
> > current->mm->def_flags |= VM_LOCKED;
> > - else
> > - current->mm->def_flags &= ~VM_LOCKED;
>
> I think this is wrong.
>
> With current code mlockall(MCL_CURRENT) after mlockall(MCL_FUTURE |
> MCL_CURRENT) would undo future mlocking, without unlocking currently
> mlocked memory.
>
> The change will break the use-case.

It is wrong and I have addressed it in this case as well as with the
MCL_ONFAULT flag introduced in patch 4. I will also add to the mlockall
man page to specify this behavior.


Attachments:
(No filename) (1.19 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-07-22 14:05:08

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH V4 2/6] mm: mlock: Add new mlock, munlock, and munlockall system calls

On Wed, 22 Jul 2015, Vlastimil Babka wrote:

> On 07/21/2015 09:59 PM, Eric B Munson wrote:
> >With the refactored mlock code, introduce new system calls for mlock,
> >munlock, and munlockall. The new calls will allow the user to specify
> >what lock states are being added or cleared. mlock2 and munlock2 are
> >trivial at the moment, but a follow on patch will add a new mlock state
> >making them useful.
> >
> >munlock2 addresses a limitation of the current implementation. If a
>
> ^ munlockall2?

Fixed, thanks.


Attachments:
(No filename) (525.00 B)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-07-22 14:11:36

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH V4 3/6] mm: gup: Add mm_lock_present()

On Wed, 22 Jul 2015, Kirill A. Shutemov wrote:

> On Tue, Jul 21, 2015 at 03:59:38PM -0400, Eric B Munson wrote:
> > The upcoming mlock(MLOCK_ONFAULT) implementation will need a way to
> > request that all present pages in a range are locked without faulting in
> > pages that are not present. This logic is very close to what the
> > __mm_populate() call handles without faulting pages so the patch pulls
> > out the pieces that can be shared and adds mm_lock_present() to gup.c.
> > The following patch will call it from do_mlock() when MLOCK_ONFAULT is
> > specified.
> >
> > Signed-off-by: Eric B Munson <[email protected]>
> > Cc: Jonathan Corbet <[email protected]>
> > Cc: Vlastimil Babka <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > ---
> > mm/gup.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 157 insertions(+), 15 deletions(-)
>
> I don't like that you've copy-pasted a lot of code. I think it can be
> solved with new foll flags.
>
> Totally untested patch below split out mlock part of FOLL_POPULATE into
> new FOLL_MLOCK flag. FOLL_POPULATE | FOLL_MLOCK will do what currently
> FOLL_POPULATE does. The new MLOCK_ONFAULT can use just FOLL_MLOCK. It will
> not trigger fault in.

I originally tried to do this by adding a check for VM_LOCKONFAULT in
__get_user_pages() before the call to faultin_page() which would goto
next_page if LOCKONFAULT was specified. With the early out in
__get_user_pages(), all of the tests using lock on fault failed to lock
pages. I will try with a new FOLL flag and see if that can work out.

>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c3a2b37365f6..c3834cddfcc7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2002,6 +2002,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
> #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
> #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
> #define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
> +#define FOLL_MLOCK 0x1000 /* mlock the page if the VMA is VM_LOCKED */
>
> typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
> void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index a798293fc648..4c7ff23947b9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -129,7 +129,7 @@ retry:
> */
> mark_page_accessed(page);
> }
> - if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) {
> + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
> /*
> * The preliminary mapping check is mainly to avoid the
> * pointless overhead of lock_page on the ZERO_PAGE
> @@ -299,6 +299,9 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
> unsigned int fault_flags = 0;
> int ret;
>
> + /* mlock present pages, but not fault in new one */
> + if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
> + return -ENOENT;
> /* For mm_populate(), just skip the stack guard page. */
> if ((*flags & FOLL_POPULATE) &&
> (stack_guard_page_start(vma, address) ||
> @@ -890,7 +893,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> VM_BUG_ON_VMA(end > vma->vm_end, vma);
> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
>
> - gup_flags = FOLL_TOUCH | FOLL_POPULATE;
> + gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
> /*
> * We want to touch writable mappings with a write fault in order
> * to break COW, except for shared mappings because these don't COW
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8f9a334a6c66..9eeb3bd304fc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1306,7 +1306,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> pmd, _pmd, 1))
> update_mmu_cache_pmd(vma, addr, pmd);
> }
> - if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) {
> + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
> if (page->mapping && trylock_page(page)) {
> lru_add_drain();
> if (page->mapping)
>
> --
> Kirill A. Shutemov


Attachments:
(No filename) (4.03 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-07-22 14:15:10

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH V4 2/6] mm: mlock: Add new mlock, munlock, and munlockall system calls

On Wed, 22 Jul 2015, Michael Ellerman wrote:

> On Tue, 2015-07-21 at 13:44 -0700, Andrew Morton wrote:
> > On Tue, 21 Jul 2015 15:59:37 -0400 Eric B Munson <[email protected]> wrote:
> >
> > > With the refactored mlock code, introduce new system calls for mlock,
> > > munlock, and munlockall. The new calls will allow the user to specify
> > > what lock states are being added or cleared. mlock2 and munlock2 are
> > > trivial at the moment, but a follow on patch will add a new mlock state
> > > making them useful.
> > >
> > > munlock2 addresses a limitation of the current implementation. If a
> > > user calls mlockall(MCL_CURRENT | MCL_FUTURE) and then later decides
> > > that MCL_FUTURE should be removed, they would have to call munlockall()
> > > followed by mlockall(MCL_CURRENT) which could potentially be very
> > > expensive. The new munlockall2 system call allows a user to simply
> > > clear the MCL_FUTURE flag.
> >
> > This is hard. Maybe we shouldn't have wired up anything other than
> > x86. That's what we usually do with new syscalls.
>
> Yeah I think so.
>
> You haven't wired it up properly on powerpc, but I haven't mentioned it because
> I'd rather we did it.
>
> cheers

It looks like I will be spinning a V5, so I will drop all but the x86
system calls additions in that version.


Attachments:
(No filename) (1.29 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-07-22 14:32:26

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH V4 5/6] mm: mmap: Add mmap flag to request VM_LOCKONFAULT

On Wed, 22 Jul 2015, Kirill A. Shutemov wrote:

> On Tue, Jul 21, 2015 at 03:59:40PM -0400, Eric B Munson wrote:
> > The cost of faulting in all memory to be locked can be very high when
> > working with large mappings. If only portions of the mapping will be
> > used this can incur a high penalty for locking.
> >
> > Now that we have the new VMA flag for the locked but not present state,
> > expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
>
> What is advantage over mmap() + mlock(MLOCK_ONFAULT)?

There isn't one, it was added to maintain parity with the
mlock(MLOCK_LOCK) -> mmap(MAP_LOCKED) set. I think not having will lead
to confusion because we have MAP_LOCKED so why don't we support
LOCKONFAULT from mmap as well.


Attachments:
(No filename) (743.00 B)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-07-22 15:45:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH V4 5/6] mm: mmap: Add mmap flag to request VM_LOCKONFAULT

On Wed, Jul 22, 2015 at 10:32:20AM -0400, Eric B Munson wrote:
> On Wed, 22 Jul 2015, Kirill A. Shutemov wrote:
>
> > On Tue, Jul 21, 2015 at 03:59:40PM -0400, Eric B Munson wrote:
> > > The cost of faulting in all memory to be locked can be very high when
> > > working with large mappings. If only portions of the mapping will be
> > > used this can incur a high penalty for locking.
> > >
> > > Now that we have the new VMA flag for the locked but not present state,
> > > expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
> >
> > What is advantage over mmap() + mlock(MLOCK_ONFAULT)?
>
> There isn't one, it was added to maintain parity with the
> mlock(MLOCK_LOCK) -> mmap(MAP_LOCKED) set. I think not having will lead
> to confusion because we have MAP_LOCKED so why don't we support
> LOCKONFAULT from mmap as well.

I don't think it's ia good idea to spend bits in flags unless we have a
reason for that.

BTW, you have typo on sparc: s/0x8000/0x80000/.


--
Kirill A. Shutemov

2015-07-22 18:43:48

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH V4 4/6] mm: mlock: Introduce VM_LOCKONFAULT and add mlock flags to enable it

On Wed, 22 Jul 2015, Vlastimil Babka wrote:

> On 07/21/2015 09:59 PM, Eric B Munson wrote:
> >The cost of faulting in all memory to be locked can be very high when
> >working with large mappings. If only portions of the mapping will be
> >used this can incur a high penalty for locking.
> >
> >For the example of a large file, this is the usage pattern for a large
> >statical language model (probably applies to other statical or graphical
> >models as well). For the security example, any application transacting
> >in data that cannot be swapped out (credit card data, medical records,
> >etc).
> >
> >This patch introduces the ability to request that pages are not
> >pre-faulted, but are placed on the unevictable LRU when they are finally
> >faulted in. This can be done area at a time via the
> >mlock2(MLOCK_ONFAULT) or the mlockall(MCL_ONFAULT) system calls. These
> >calls can be undone via munlock2(MLOCK_ONFAULT) or
> >munlockall2(MCL_ONFAULT).
> >
> >Applying the VM_LOCKONFAULT flag to a mapping with pages that are
> >already present required the addition of a function in gup.c to pin all
> >pages which are present in an address range. It borrows heavily from
> >__mm_populate().
> >
> >To keep accounting checks out of the page fault path, users are billed
> >for the entire mapping lock as if MLOCK_LOCKED was used.
>
> Hi,
>
> I think you should include a complete description of which
> transitions for vma states and mlock2/munlock2 flags applied on them
> are valid and what they do. It will also help with the manpages.
> You explained some to Jon in the last thread, but I think there
> should be a canonical description in changelog (if not also
> Documentation, if mlock is covered there).
>
> For example the scenario Jon asked, what happens after a
> mlock2(MLOCK_ONFAULT) followed by mlock2(MLOCK_LOCKED), and that the
> answer is "nothing". Your promised code comment for
> apply_vma_flags() doesn't suffice IMHO (and I'm not sure it's there,
> anyway?).

I missed adding that comment to the code, will be there in V5 along with
the description in the changelog.

>
> But the more I think about the scenario and your new VM_LOCKONFAULT
> vma flag, it seems awkward to me. Why should munlocking at all care
> if the vma was mlocked with MLOCK_LOCKED or MLOCK_ONFAULT? In either
> case the result is that all pages currently populated are munlocked.
> So the flags for munlock2 should be unnecessary.

Say a user has a large area of interleaved MLOCK_LOCK and MLOCK_ONFAULT
mappings and they want to unlock only the ones with MLOCK_LOCK. With
the current implementation, this is possible in a single system call
that spans the entire region. With your suggestion, the user would have
to know what regions where locked with MLOCK_LOCK and call munlock() on
each of them. IMO, the way munlock2() works better mirrors the way
munlock() currently works when called on a large area of interleaved
locked and unlocked areas.

>
> I also think VM_LOCKONFAULT is unnecessary. VM_LOCKED should be
> enough - see how you had to handle the new flag in all places that
> had to handle the old flag? I think the information whether mlock
> was supposed to fault the whole vma is obsolete at the moment mlock
> returns. VM_LOCKED should be enough for both modes, and the flag to
> mlock2 could just control whether the pre-faulting is done.
>
> So what should be IMHO enough:
> - munlock can stay without flags
> - mlock2 has only one new flag MLOCK_ONFAULT. If specified,
> pre-faulting is not done, just set VM_LOCKED and mlock pages already
> present.
> - same with mmap(MAP_LOCKONFAULT) (need to define what happens when
> both MAP_LOCKED and MAP_LOCKONFAULT are specified).
>
> Now mlockall(MCL_FUTURE) muddles the situation in that it stores the
> information for future VMA's in current->mm->def_flags, and this
> def_flags would need to distinguish VM_LOCKED with population and
> without. But that could be still solvable without introducing a new
> vma flag everywhere.

With you right up until that last paragraph. I have been staring at
this a while and I cannot come up a way to handle the
mlockall(MCL_ONFAULT) without introducing a new vm flag. It doesn't
have to be VM_LOCKONFAULT, we could use the model that Michal Hocko
suggested with something like VM_FAULTPOPULATE. However, we can't
really use this flag anywhere except the mlock code becuase we have to
be able to distinguish a caller that wants to use MLOCK_LOCK with
whatever control VM_FAULTPOPULATE might grant outside of mlock and a
caller that wants MLOCK_ONFAULT. That was a long way of saying we need
an extra vma flag regardless. However, if that flag only controls if
mlock pre-populates it would work and it would do away with most of the
places I had to touch to handle VM_LOCKONFAULT properly.

I picked VM_LOCKONFAULT because it is explicit about what it is for and
there is little risk of someone coming along in 5 years and saying "why
not overload this flag to do this other thing completely unrelated to
mlock?". A flag for controling speculative population is more likely to
be overloaded outside of mlock().

If you have a sane way of handling mlockall(MCL_ONFAULT) without a new
VMA flag, I am happy to give it a try, but I haven't been able to come
up with one that doesn't have its own gremlins.


Attachments:
(No filename) (5.21 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-07-23 06:59:15

by Ralf Baechle

[permalink] [raw]
Subject: Re: [PATCH V4 2/6] mm: mlock: Add new mlock, munlock, and munlockall system calls

On Wed, Jul 22, 2015 at 10:15:01AM -0400, Eric B Munson wrote:

> >
> > You haven't wired it up properly on powerpc, but I haven't mentioned it because
> > I'd rather we did it.
> >
> > cheers
>
> It looks like I will be spinning a V5, so I will drop all but the x86
> system calls additions in that version.

The MIPS bits are looking good however, so

Acked-by: Ralf Baechle <[email protected]>

With my ack, will you keep them or maybe carry them as a separate patch?

Cheers,

Ralf

2015-07-23 10:03:50

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH V4 4/6] mm: mlock: Introduce VM_LOCKONFAULT and add mlock flags to enable it

On 07/22/2015 08:43 PM, Eric B Munson wrote:
> On Wed, 22 Jul 2015, Vlastimil Babka wrote:
>
>>
>> Hi,
>>
>> I think you should include a complete description of which
>> transitions for vma states and mlock2/munlock2 flags applied on them
>> are valid and what they do. It will also help with the manpages.
>> You explained some to Jon in the last thread, but I think there
>> should be a canonical description in changelog (if not also
>> Documentation, if mlock is covered there).
>>
>> For example the scenario Jon asked, what happens after a
>> mlock2(MLOCK_ONFAULT) followed by mlock2(MLOCK_LOCKED), and that the
>> answer is "nothing". Your promised code comment for
>> apply_vma_flags() doesn't suffice IMHO (and I'm not sure it's there,
>> anyway?).
>
> I missed adding that comment to the code, will be there in V5 along with
> the description in the changelog.

Thanks!

>>
>> But the more I think about the scenario and your new VM_LOCKONFAULT
>> vma flag, it seems awkward to me. Why should munlocking at all care
>> if the vma was mlocked with MLOCK_LOCKED or MLOCK_ONFAULT? In either
>> case the result is that all pages currently populated are munlocked.
>> So the flags for munlock2 should be unnecessary.
>
> Say a user has a large area of interleaved MLOCK_LOCK and MLOCK_ONFAULT
> mappings and they want to unlock only the ones with MLOCK_LOCK. With
> the current implementation, this is possible in a single system call
> that spans the entire region. With your suggestion, the user would have
> to know what regions where locked with MLOCK_LOCK and call munlock() on
> each of them. IMO, the way munlock2() works better mirrors the way
> munlock() currently works when called on a large area of interleaved
> locked and unlocked areas.

Um OK, that scenario is possible in theory. But I have a hard time imagining
that somebody would really want to do that. I think much more people would
benefit from a simpler API.

>
>>
>> I also think VM_LOCKONFAULT is unnecessary. VM_LOCKED should be
>> enough - see how you had to handle the new flag in all places that
>> had to handle the old flag? I think the information whether mlock
>> was supposed to fault the whole vma is obsolete at the moment mlock
>> returns. VM_LOCKED should be enough for both modes, and the flag to
>> mlock2 could just control whether the pre-faulting is done.
>>
>> So what should be IMHO enough:
>> - munlock can stay without flags
>> - mlock2 has only one new flag MLOCK_ONFAULT. If specified,
>> pre-faulting is not done, just set VM_LOCKED and mlock pages already
>> present.
>> - same with mmap(MAP_LOCKONFAULT) (need to define what happens when
>> both MAP_LOCKED and MAP_LOCKONFAULT are specified).
>>
>> Now mlockall(MCL_FUTURE) muddles the situation in that it stores the
>> information for future VMA's in current->mm->def_flags, and this
>> def_flags would need to distinguish VM_LOCKED with population and
>> without. But that could be still solvable without introducing a new
>> vma flag everywhere.
>
> With you right up until that last paragraph. I have been staring at
> this a while and I cannot come up a way to handle the
> mlockall(MCL_ONFAULT) without introducing a new vm flag. It doesn't
> have to be VM_LOCKONFAULT, we could use the model that Michal Hocko
> suggested with something like VM_FAULTPOPULATE. However, we can't
> really use this flag anywhere except the mlock code becuase we have to
> be able to distinguish a caller that wants to use MLOCK_LOCK with
> whatever control VM_FAULTPOPULATE might grant outside of mlock and a
> caller that wants MLOCK_ONFAULT. That was a long way of saying we need
> an extra vma flag regardless. However, if that flag only controls if
> mlock pre-populates it would work and it would do away with most of the
> places I had to touch to handle VM_LOCKONFAULT properly.

Yes, it would be a good way. Adding a new vma flag is probably cleanest after
all, but the flag would be set *in addition* to VM_LOCKED, *just* to prevent
pre-faulting. The places that check VM_LOCKED for the actual page mlocking (i.e.
try_to_unmap_one) would just keep checking VM_LOCKED. The places where VM_LOCKED
is checked to trigger prepopulation, would skip that if VM_LOCKONFAULT is also
set. Having VM_LOCKONFAULT set without also VM_LOCKED itself would be invalid state.

This should work fine with the simplified API as I proposed so let me reiterate
and try fill in the blanks:

- mlock2 has only one new flag MLOCK_ONFAULT. If specified, VM_LOCKONFAULT is
set in addition to VM_LOCKED and no prefaulting is done
- old mlock syscall naturally behaves as mlock2 without MLOCK_ONFAULT
- calling mlock/mlock2 on an already-mlocked area (if that's permitted
already?) will add/remove VM_LOCKONFAULT as needed. If it's removing,
prepopulate whole range. Of course adding VM_LOCKONFAULT to a vma that was
already prefaulted doesn't make any difference, but it's consistent with the rest.
- munlock removes both VM_LOCKED and VM_LOCKONFAULT
- mmap could treat MAP_LOCKONFAULT as a modifier to MAP_LOCKED to be consistent?
or not? I'm not sure here, either way subtly differs from mlock API anyway, I
just wish MAP_LOCKED never existed...
- mlockall(MCL_CURRENT) sets or clears VM_LOCKONFAULT depending on
MCL_LOCKONFAULT, mlockall(MCL_FUTURE) does the same on mm->def_flags
- munlockall2 removes both, like munlock. munlockall2(MCL_FUTURE) does that to
def_flags

> I picked VM_LOCKONFAULT because it is explicit about what it is for and
> there is little risk of someone coming along in 5 years and saying "why
> not overload this flag to do this other thing completely unrelated to
> mlock?". A flag for controling speculative population is more likely to
> be overloaded outside of mlock().

Sure, let's make clear the name is related to mlock, but the behavior could
still be additive to MAP_LOCKED.

> If you have a sane way of handling mlockall(MCL_ONFAULT) without a new
> VMA flag, I am happy to give it a try, but I haven't been able to come
> up with one that doesn't have its own gremlins.

Well we could store the MCL_FUTURE | MCL_ONFAULT bit elsewhere in mm_struct than
the def_flags field. The VM_LOCKED field is already evaluated specially from all
the other def_flags. We are nearing the full 32bit space for vma flags. I think
all I've proposed above wouldn't change much if we removed per-vma
VM_LOCKONFAULT flag from the equation. Just that re-mlocking area already
mlocked *withouth* MLOCK_ONFAULT wouldn't know that it was alread prepopulated,
and would have to re-populate in either case (I'm not sure, maybe it's already
done by current implementation anyway so it's not a potential performance
regression).
Only mlockall(MCL_FUTURE | MCL_ONFAULT) should really need the ONFAULT info to
"stick" somewhere in mm_struct, but it doesn't have to be def_flags?

2015-07-23 15:25:08

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH V4 4/6] mm: mlock: Introduce VM_LOCKONFAULT and add mlock flags to enable it

On Thu, 23 Jul 2015, Vlastimil Babka wrote:

> On 07/22/2015 08:43 PM, Eric B Munson wrote:
> > On Wed, 22 Jul 2015, Vlastimil Babka wrote:
> >
> >>
> >> Hi,
> >>
> >> I think you should include a complete description of which
> >> transitions for vma states and mlock2/munlock2 flags applied on them
> >> are valid and what they do. It will also help with the manpages.
> >> You explained some to Jon in the last thread, but I think there
> >> should be a canonical description in changelog (if not also
> >> Documentation, if mlock is covered there).
> >>
> >> For example the scenario Jon asked, what happens after a
> >> mlock2(MLOCK_ONFAULT) followed by mlock2(MLOCK_LOCKED), and that the
> >> answer is "nothing". Your promised code comment for
> >> apply_vma_flags() doesn't suffice IMHO (and I'm not sure it's there,
> >> anyway?).
> >
> > I missed adding that comment to the code, will be there in V5 along with
> > the description in the changelog.
>
> Thanks!
>
> >>
> >> But the more I think about the scenario and your new VM_LOCKONFAULT
> >> vma flag, it seems awkward to me. Why should munlocking at all care
> >> if the vma was mlocked with MLOCK_LOCKED or MLOCK_ONFAULT? In either
> >> case the result is that all pages currently populated are munlocked.
> >> So the flags for munlock2 should be unnecessary.
> >
> > Say a user has a large area of interleaved MLOCK_LOCK and MLOCK_ONFAULT
> > mappings and they want to unlock only the ones with MLOCK_LOCK. With
> > the current implementation, this is possible in a single system call
> > that spans the entire region. With your suggestion, the user would have
> > to know what regions where locked with MLOCK_LOCK and call munlock() on
> > each of them. IMO, the way munlock2() works better mirrors the way
> > munlock() currently works when called on a large area of interleaved
> > locked and unlocked areas.
>
> Um OK, that scenario is possible in theory. But I have a hard time imagining
> that somebody would really want to do that. I think much more people would
> benefit from a simpler API.

It wasn't about imagining a scenario, more about keeping parity with
something that currently works (unlocking a large area of interleaved
locked and unlocked regions). However, there is no reason we can't add
the new munlock2 later if it is desired.

>
> >
> >>
> >> I also think VM_LOCKONFAULT is unnecessary. VM_LOCKED should be
> >> enough - see how you had to handle the new flag in all places that
> >> had to handle the old flag? I think the information whether mlock
> >> was supposed to fault the whole vma is obsolete at the moment mlock
> >> returns. VM_LOCKED should be enough for both modes, and the flag to
> >> mlock2 could just control whether the pre-faulting is done.
> >>
> >> So what should be IMHO enough:
> >> - munlock can stay without flags
> >> - mlock2 has only one new flag MLOCK_ONFAULT. If specified,
> >> pre-faulting is not done, just set VM_LOCKED and mlock pages already
> >> present.
> >> - same with mmap(MAP_LOCKONFAULT) (need to define what happens when
> >> both MAP_LOCKED and MAP_LOCKONFAULT are specified).
> >>
> >> Now mlockall(MCL_FUTURE) muddles the situation in that it stores the
> >> information for future VMA's in current->mm->def_flags, and this
> >> def_flags would need to distinguish VM_LOCKED with population and
> >> without. But that could be still solvable without introducing a new
> >> vma flag everywhere.
> >
> > With you right up until that last paragraph. I have been staring at
> > this a while and I cannot come up a way to handle the
> > mlockall(MCL_ONFAULT) without introducing a new vm flag. It doesn't
> > have to be VM_LOCKONFAULT, we could use the model that Michal Hocko
> > suggested with something like VM_FAULTPOPULATE. However, we can't
> > really use this flag anywhere except the mlock code becuase we have to
> > be able to distinguish a caller that wants to use MLOCK_LOCK with
> > whatever control VM_FAULTPOPULATE might grant outside of mlock and a
> > caller that wants MLOCK_ONFAULT. That was a long way of saying we need
> > an extra vma flag regardless. However, if that flag only controls if
> > mlock pre-populates it would work and it would do away with most of the
> > places I had to touch to handle VM_LOCKONFAULT properly.
>
> Yes, it would be a good way. Adding a new vma flag is probably cleanest after
> all, but the flag would be set *in addition* to VM_LOCKED, *just* to prevent
> pre-faulting. The places that check VM_LOCKED for the actual page mlocking (i.e.
> try_to_unmap_one) would just keep checking VM_LOCKED. The places where VM_LOCKED
> is checked to trigger prepopulation, would skip that if VM_LOCKONFAULT is also
> set. Having VM_LOCKONFAULT set without also VM_LOCKED itself would be invalid state.
>
> This should work fine with the simplified API as I proposed so let me reiterate
> and try fill in the blanks:
>
> - mlock2 has only one new flag MLOCK_ONFAULT. If specified, VM_LOCKONFAULT is
> set in addition to VM_LOCKED and no prefaulting is done
> - old mlock syscall naturally behaves as mlock2 without MLOCK_ONFAULT
> - calling mlock/mlock2 on an already-mlocked area (if that's permitted
> already?) will add/remove VM_LOCKONFAULT as needed. If it's removing,
> prepopulate whole range. Of course adding VM_LOCKONFAULT to a vma that was
> already prefaulted doesn't make any difference, but it's consistent with the rest.
> - munlock removes both VM_LOCKED and VM_LOCKONFAULT
> - mmap could treat MAP_LOCKONFAULT as a modifier to MAP_LOCKED to be consistent?
> or not? I'm not sure here, either way subtly differs from mlock API anyway, I
> just wish MAP_LOCKED never existed...
> - mlockall(MCL_CURRENT) sets or clears VM_LOCKONFAULT depending on
> MCL_LOCKONFAULT, mlockall(MCL_FUTURE) does the same on mm->def_flags
> - munlockall2 removes both, like munlock. munlockall2(MCL_FUTURE) does that to
> def_flags
>
> > I picked VM_LOCKONFAULT because it is explicit about what it is for and
> > there is little risk of someone coming along in 5 years and saying "why
> > not overload this flag to do this other thing completely unrelated to
> > mlock?". A flag for controling speculative population is more likely to
> > be overloaded outside of mlock().
>
> Sure, let's make clear the name is related to mlock, but the behavior could
> still be additive to MAP_LOCKED.
>
> > If you have a sane way of handling mlockall(MCL_ONFAULT) without a new
> > VMA flag, I am happy to give it a try, but I haven't been able to come
> > up with one that doesn't have its own gremlins.
>
> Well we could store the MCL_FUTURE | MCL_ONFAULT bit elsewhere in mm_struct than
> the def_flags field. The VM_LOCKED field is already evaluated specially from all
> the other def_flags. We are nearing the full 32bit space for vma flags. I think
> all I've proposed above wouldn't change much if we removed per-vma
> VM_LOCKONFAULT flag from the equation. Just that re-mlocking area already
> mlocked *withouth* MLOCK_ONFAULT wouldn't know that it was alread prepopulated,
> and would have to re-populate in either case (I'm not sure, maybe it's already
> done by current implementation anyway so it's not a potential performance
> regression).
> Only mlockall(MCL_FUTURE | MCL_ONFAULT) should really need the ONFAULT info to
> "stick" somewhere in mm_struct, but it doesn't have to be def_flags?

This all sounds fine and should still cover the usecase that started
this adventure. I will include this change in the V5 spin.


Attachments:
(No filename) (7.36 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-07-24 14:39:45

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH V4 2/6] mm: mlock: Add new mlock, munlock, and munlockall system calls

On Thu, 23 Jul 2015, Ralf Baechle wrote:

> On Wed, Jul 22, 2015 at 10:15:01AM -0400, Eric B Munson wrote:
>
> > >
> > > You haven't wired it up properly on powerpc, but I haven't mentioned it because
> > > I'd rather we did it.
> > >
> > > cheers
> >
> > It looks like I will be spinning a V5, so I will drop all but the x86
> > system calls additions in that version.
>
> The MIPS bits are looking good however, so
>
> Acked-by: Ralf Baechle <[email protected]>
>
> With my ack, will you keep them or maybe carry them as a separate patch?

I will keep the MIPS additions as a separate patch in the series, though
I have dropped two of the new syscalls after some discussion. So I will
not include your ack on the new patch.

Eric


Attachments:
(No filename) (742.00 B)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-07-24 15:46:54

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH V4 2/6] mm: mlock: Add new mlock, munlock, and munlockall system calls

On 07/24/2015 07:39 AM, Eric B Munson wrote:
> On Thu, 23 Jul 2015, Ralf Baechle wrote:
>
>> On Wed, Jul 22, 2015 at 10:15:01AM -0400, Eric B Munson wrote:
>>
>>>>
>>>> You haven't wired it up properly on powerpc, but I haven't mentioned it because
>>>> I'd rather we did it.
>>>>
>>>> cheers
>>>
>>> It looks like I will be spinning a V5, so I will drop all but the x86
>>> system calls additions in that version.
>>
>> The MIPS bits are looking good however, so
>>
>> Acked-by: Ralf Baechle <[email protected]>
>>
>> With my ack, will you keep them or maybe carry them as a separate patch?
>
> I will keep the MIPS additions as a separate patch in the series, though
> I have dropped two of the new syscalls after some discussion. So I will
> not include your ack on the new patch.
>
> Eric
>

Hi Eric,

next-20150724 still has some failures due to this patch set. Are those
being looked at (I know parisc builds fail, but there may be others) ?

Thanks,
Guenter

2015-07-24 15:53:22

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH V4 2/6] mm: mlock: Add new mlock, munlock, and munlockall system calls

On Fri, 24 Jul 2015, Guenter Roeck wrote:

> On 07/24/2015 07:39 AM, Eric B Munson wrote:
> >On Thu, 23 Jul 2015, Ralf Baechle wrote:
> >
> >>On Wed, Jul 22, 2015 at 10:15:01AM -0400, Eric B Munson wrote:
> >>
> >>>>
> >>>>You haven't wired it up properly on powerpc, but I haven't mentioned it because
> >>>>I'd rather we did it.
> >>>>
> >>>>cheers
> >>>
> >>>It looks like I will be spinning a V5, so I will drop all but the x86
> >>>system calls additions in that version.
> >>
> >>The MIPS bits are looking good however, so
> >>
> >>Acked-by: Ralf Baechle <[email protected]>
> >>
> >>With my ack, will you keep them or maybe carry them as a separate patch?
> >
> >I will keep the MIPS additions as a separate patch in the series, though
> >I have dropped two of the new syscalls after some discussion. So I will
> >not include your ack on the new patch.
> >
> >Eric
> >
>
> Hi Eric,
>
> next-20150724 still has some failures due to this patch set. Are those
> being looked at (I know parisc builds fail, but there may be others) ?
>
> Thanks,
> Guenter

Guenter,

Yes, the next respin will drop all new arch syscall entries except
x86[_64] and MIPS. I will leave it up to arch maintainers to add the
entries.

Eric


Attachments:
(No filename) (1.20 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments