mlock() allows a user to control page out of program memory, but this
comes at the cost of faulting in the entire mapping when it is
allocated. For large mappings where the entire area is not necessary
this is not ideal. Instead of forcing all locked pages to be present
when they are allocated, this set creates a middle ground. Pages are
marked to be placed on the unevictable LRU (locked) when they are first
used, but they are not faulted in by the mlock call.
This series introduces a new mlock() system call that takes a flags
argument along with the start address and size. This flags argument
gives the caller the ability to request memory be locked in the
traditional way, or to be locked after the page is faulted in. New
calls are added for munlock() and munlockall() which give the called a
way to specify which flags are supposed to be cleared. A new MCL flag
is added to mirror the lock on fault behavior from mlock() in
mlockall(). Finally, a flag for mmap() is added that allows a user to
specify that the covered are should not be paged out, but only after the
memory has been used the first time.
There are two main use cases that this set covers. The first is the
security focussed mlock case. A buffer is needed that cannot be written
to swap. The maximum size is known, but on average the memory used is
significantly less than this maximum. With lock on fault, the buffer
is guaranteed to never be paged out without consuming the maximum size
every time such a buffer is created.
The second use case is focussed on performance. Portions of a large
file are needed and we want to keep the used portions in memory once
accessed. This is the case for large graphical models where the path
through the graph is not known until run time. The entire graph is
unlikely to be used in a given invocation, but once a node has been
used it needs to stay resident for further processing. Given these
constraints we have a number of options. We can potentially waste a
large amount of memory by mlocking the entire region (this can also
cause a significant stall at startup as the entire file is read in).
We can mlock every page as we access them without tracking if the page
is already resident but this introduces large overhead for each access.
The third option is mapping the entire region with PROT_NONE and using
a signal handler for SIGSEGV to mprotect(PROT_READ) and mlock() the
needed page. Doing this page at a time adds a significant performance
penalty. Batching can be used to mitigate this overhead, but in order
to safely avoid trying to mprotect pages outside of the mapping, the
boundaries of each mapping to be used in this way must be tracked and
available to the signal handler. This is precisely what the mm system
in the kernel should already be doing.
For mlock(MLOCK_ONFAULT) and mmap(MAP_LOCKONFAULT) the user is charged
against RLIMIT_MEMLOCK as if mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was
used, so when the VMA is created not when the pages are faulted in. For
mlockall(MCL_ONFAULT) the user is charged as if MCL_FUTURE was used.
This decision was made to keep the accounting checks out of the page
fault path.
To illustrate the benefit of this set I wrote a test program that mmaps
a 5 GB file filled with random data and then makes 15,000,000 accesses
to random addresses in that mapping. The test program was run 20 times
for each setup. Results are reported for two program portions, setup
and execution. The setup phase is calling mmap and optionally mlock on
the entire region. For most experiments this is trivial, but it
highlights the cost of faulting in the entire region. Results are
averages across the 20 runs in milliseconds.
mmap with mlock(MLOCK_LOCKED) on entire range:
Setup avg: 8228.666
Processing avg: 8274.257
mmap with mlock(MLOCK_LOCKED) before each access:
Setup avg: 0.113
Processing avg: 90993.552
mmap with PROT_NONE and signal handler and batch size of 1 page:
With the default value in max_map_count, this gets ENOMEM as I attempt
to change the permissions, after upping the sysctl significantly I get:
Setup avg: 0.058
Processing avg: 69488.073
mmap with PROT_NONE and signal handler and batch size of 8 pages:
Setup avg: 0.068
Processing avg: 38204.116
mmap with PROT_NONE and signal handler and batch size of 16 pages:
Setup avg: 0.044
Processing avg: 29671.180
mmap with mlock(MLOCK_ONFAULT) on entire range:
Setup avg: 0.189
Processing avg: 17904.899
The signal handler in the batch cases faulted in memory in two steps to
avoid having to know the start and end of the faulting mapping. The
first step covers the page that caused the fault as we know that it will
be possible to lock. The second step speculatively tries to mlock and
mprotect the batch size - 1 pages that follow. There may be a clever
way to avoid this without having the program track each mapping to be
covered by this handeler in a globally accessible structure, but I could
not find it. It should be noted that with a large enough batch size
this two step fault handler can still cause the program to crash if it
reaches far beyond the end of the mapping.
These results show that if the developer knows that a majority of the
mapping will be used, it is better to try and fault it in at once,
otherwise MAP_LOCKONFAULT is significantly faster.
The performance cost of these patches are minimal on the two benchmarks
I have tested (stream and kernbench). The following are the average
values across 20 runs of stream and 10 runs of kernbench after a warmup
run whose results were discarded.
Avg throughput in MB/s from stream using 1000000 element arrays
Test 4.2-rc1 4.2-rc1+lock-on-fault
Copy: 10,566.5 10,421
Scale: 10,685 10,503.5
Add: 12,044.1 11,814.2
Triad: 12,064.8 11,846.3
Kernbench optimal load
4.2-rc1 4.2-rc1+lock-on-fault
Elapsed Time 78.453 78.991
User Time 64.2395 65.2355
System Time 9.7335 9.7085
Context Switches 22211.5 22412.1
Sleeps 14965.3 14956.1
---
Changes from V2:
Added new system calls for mlock, munlock, and munlockall with added
flags arguments for controlling how memory is locked or unlocked.
Eric B Munson (5):
mm: mlock: Refactor mlock, munlock, and munlockall code
mm: mlock: Add new mlock, munlock, and munlockall system calls
mm: mlock: Introduce VM_LOCKONFAULT and add mlock flags to enable it
mm: mmap: Add mmap flag to request VM_LOCKONFAULT
selftests: vm: Add tests for lock on fault
arch/alpha/include/asm/unistd.h | 2 +-
arch/alpha/include/uapi/asm/mman.h | 5 +
arch/alpha/kernel/systbls.S | 3 +
arch/arm/kernel/calls.S | 3 +
arch/arm64/include/asm/unistd32.h | 6 +
arch/avr32/kernel/syscall_table.S | 3 +
arch/blackfin/mach-common/entry.S | 3 +
arch/cris/arch-v10/kernel/entry.S | 3 +
arch/cris/arch-v32/kernel/entry.S | 3 +
arch/frv/kernel/entry.S | 3 +
arch/ia64/kernel/entry.S | 3 +
arch/m32r/kernel/entry.S | 3 +
arch/m32r/kernel/syscall_table.S | 3 +
arch/m68k/kernel/syscalltable.S | 3 +
arch/microblaze/kernel/syscall_table.S | 3 +
arch/mips/include/uapi/asm/mman.h | 8 +
arch/mips/kernel/scall32-o32.S | 3 +
arch/mips/kernel/scall64-64.S | 3 +
arch/mips/kernel/scall64-n32.S | 3 +
arch/mips/kernel/scall64-o32.S | 3 +
arch/mn10300/kernel/entry.S | 3 +
arch/parisc/include/uapi/asm/mman.h | 5 +
arch/powerpc/include/uapi/asm/mman.h | 5 +
arch/s390/kernel/syscalls.S | 3 +
arch/sh/kernel/syscalls_32.S | 3 +
arch/sparc/include/uapi/asm/mman.h | 5 +
arch/sparc/kernel/systbls_32.S | 2 +-
arch/sparc/kernel/systbls_64.S | 4 +-
arch/tile/include/uapi/asm/mman.h | 8 +
arch/x86/entry/syscalls/syscall_32.tbl | 3 +
arch/x86/entry/syscalls/syscall_64.tbl | 3 +
arch/xtensa/include/uapi/asm/mman.h | 8 +
arch/xtensa/include/uapi/asm/unistd.h | 10 +-
fs/proc/task_mmu.c | 1 +
include/linux/mm.h | 1 +
include/linux/mman.h | 3 +-
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/mman.h | 5 +
include/uapi/asm-generic/unistd.h | 8 +-
kernel/sys_ni.c | 3 +
mm/mlock.c | 135 +++++++++--
mm/mmap.c | 6 +-
mm/swap.c | 3 +-
tools/testing/selftests/vm/Makefile | 2 +
tools/testing/selftests/vm/lock-on-fault.c | 342 ++++++++++++++++++++++++++++
tools/testing/selftests/vm/on-fault-limit.c | 47 ++++
tools/testing/selftests/vm/run_vmtests | 22 ++
47 files changed, 681 insertions(+), 32 deletions(-)
create mode 100644 tools/testing/selftests/vm/lock-on-fault.c
create mode 100644 tools/testing/selftests/vm/on-fault-limit.c
Cc: Shuah Khan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
--
1.9.1
With the exception of mlockall() none of the mlock family of system
calls take a flags argument so they are not extensible. A later patch
in this set will extend the mlock family to support a middle ground
between pages that are locked and faulted in immediately and unlocked
pages. To pave the way for the new system calls, the code needs some
reorganization so that all the actual entry points handle is checking
input and translating to VMA flags.
This patch mostly moves code around with the exception of
do_munlockall(). All three functions are changed to support a follow on
patch which introduces new system calls that allow the user to specify
flags for these calls.
Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/mlock.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 46 insertions(+), 11 deletions(-)
diff --git a/mm/mlock.c b/mm/mlock.c
index 6fd2cf1..8e52c23 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -553,7 +553,8 @@ out:
return ret;
}
-static int do_mlock(unsigned long start, size_t len, int on)
+static int apply_vma_flags(unsigned long start, size_t len,
+ vm_flags_t flags, bool add_flags)
{
unsigned long nstart, end, tmp;
struct vm_area_struct * vma, * prev;
@@ -579,9 +580,11 @@ static int do_mlock(unsigned long start, size_t len, int on)
/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
- newflags = vma->vm_flags & ~VM_LOCKED;
- if (on)
- newflags |= VM_LOCKED;
+ newflags = vma->vm_flags;
+ if (add_flags)
+ newflags |= flags;
+ else
+ newflags &= ~flags;
tmp = vma->vm_end;
if (tmp > end)
@@ -604,7 +607,7 @@ static int do_mlock(unsigned long start, size_t len, int on)
return error;
}
-SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
+static int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
{
unsigned long locked;
unsigned long lock_limit;
@@ -628,7 +631,7 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
/* check against resource limits */
if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
- error = do_mlock(start, len, 1);
+ error = apply_vma_flags(start, len, flags, true);
up_write(¤t->mm->mmap_sem);
if (error)
@@ -640,7 +643,12 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
return 0;
}
-SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
+SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
+{
+ return do_mlock(start, len, VM_LOCKED);
+}
+
+static int do_munlock(unsigned long start, size_t len, vm_flags_t flags)
{
int ret;
@@ -648,20 +656,23 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
start &= PAGE_MASK;
down_write(¤t->mm->mmap_sem);
- ret = do_mlock(start, len, 0);
+ ret = apply_vma_flags(start, len, flags, false);
up_write(¤t->mm->mmap_sem);
return ret;
}
+SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
+{
+ return do_munlock(start, len, VM_LOCKED);
+}
+
static int do_mlockall(int flags)
{
struct vm_area_struct * vma, * prev = NULL;
if (flags & MCL_FUTURE)
current->mm->def_flags |= VM_LOCKED;
- else
- current->mm->def_flags &= ~VM_LOCKED;
if (flags == MCL_FUTURE)
goto out;
@@ -711,12 +722,36 @@ out:
return ret;
}
+static int do_munlockall(int flags)
+{
+ struct vm_area_struct * vma, * prev = NULL;
+
+ if (flags & MCL_FUTURE)
+ current->mm->def_flags &= ~VM_LOCKED;
+ if (flags == MCL_FUTURE)
+ goto out;
+
+ for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
+ vm_flags_t newflags;
+
+ newflags = vma->vm_flags;
+ if (flags & MCL_CURRENT)
+ newflags &= ~VM_LOCKED;
+
+ /* Ignore errors */
+ mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
+ cond_resched_rcu_qs();
+ }
+out:
+ return 0;
+}
+
SYSCALL_DEFINE0(munlockall)
{
int ret;
down_write(¤t->mm->mmap_sem);
- ret = do_mlockall(0);
+ ret = do_munlockall(MCL_CURRENT | MCL_FUTURE);
up_write(¤t->mm->mmap_sem);
return ret;
}
--
1.9.1
With the refactored mlock code, introduce new system calls for mlock,
munlock, and munlockall. The new calls will allow the user to specify
what lock states are being added or cleared. mlock2 and munlock2 are
trivial at the moment, but a follow on patch will add a new mlock state
making them useful.
munlock2 addresses a limitation of the current implementation. If a
user calls mlockall(MCL_CURRENT | MCL_FUTURE) and then later decides
that MCL_FUTURE should be removed, they would have to call munlockall()
followed by mlockall(MCL_CURRENT) which could potentially be very
expensive. The new munlockall2 system call allows a user to simply
clear the MCL_FUTURE flag.
Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/alpha/include/asm/unistd.h | 2 +-
arch/alpha/include/uapi/asm/mman.h | 2 ++
arch/alpha/kernel/systbls.S | 3 +++
arch/arm/kernel/calls.S | 3 +++
arch/arm64/include/asm/unistd32.h | 6 ++++++
arch/avr32/kernel/syscall_table.S | 3 +++
arch/blackfin/mach-common/entry.S | 3 +++
arch/cris/arch-v10/kernel/entry.S | 3 +++
arch/cris/arch-v32/kernel/entry.S | 3 +++
arch/frv/kernel/entry.S | 3 +++
arch/ia64/kernel/entry.S | 3 +++
arch/m32r/kernel/entry.S | 3 +++
arch/m32r/kernel/syscall_table.S | 3 +++
arch/m68k/kernel/syscalltable.S | 3 +++
arch/microblaze/kernel/syscall_table.S | 3 +++
arch/mips/include/uapi/asm/mman.h | 5 +++++
arch/mips/kernel/scall32-o32.S | 3 +++
arch/mips/kernel/scall64-64.S | 3 +++
arch/mips/kernel/scall64-n32.S | 3 +++
arch/mips/kernel/scall64-o32.S | 3 +++
arch/mn10300/kernel/entry.S | 3 +++
arch/parisc/include/uapi/asm/mman.h | 2 ++
arch/powerpc/include/uapi/asm/mman.h | 2 ++
arch/s390/kernel/syscalls.S | 3 +++
arch/sh/kernel/syscalls_32.S | 3 +++
arch/sparc/include/uapi/asm/mman.h | 2 ++
arch/sparc/kernel/systbls_32.S | 2 +-
arch/sparc/kernel/systbls_64.S | 4 ++--
arch/tile/include/uapi/asm/mman.h | 5 +++++
arch/x86/entry/syscalls/syscall_32.tbl | 3 +++
arch/x86/entry/syscalls/syscall_64.tbl | 3 +++
arch/xtensa/include/uapi/asm/mman.h | 5 +++++
arch/xtensa/include/uapi/asm/unistd.h | 10 ++++++++--
include/linux/syscalls.h | 4 ++++
include/uapi/asm-generic/mman.h | 2 ++
include/uapi/asm-generic/unistd.h | 8 +++++++-
kernel/sys_ni.c | 3 +++
mm/mlock.c | 28 ++++++++++++++++++++++++++++
38 files changed, 148 insertions(+), 7 deletions(-)
diff --git a/arch/alpha/include/asm/unistd.h b/arch/alpha/include/asm/unistd.h
index a56e608..1d09392 100644
--- a/arch/alpha/include/asm/unistd.h
+++ b/arch/alpha/include/asm/unistd.h
@@ -3,7 +3,7 @@
#include <uapi/asm/unistd.h>
-#define NR_SYSCALLS 514
+#define NR_SYSCALLS 517
#define __ARCH_WANT_OLD_READDIR
#define __ARCH_WANT_STAT64
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..ec72436 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -38,6 +38,8 @@
#define MCL_CURRENT 8192 /* lock all currently mapped pages */
#define MCL_FUTURE 16384 /* lock all additions to address space */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
diff --git a/arch/alpha/kernel/systbls.S b/arch/alpha/kernel/systbls.S
index 9b62e3f..04d1cce 100644
--- a/arch/alpha/kernel/systbls.S
+++ b/arch/alpha/kernel/systbls.S
@@ -532,6 +532,9 @@ sys_call_table:
.quad sys_getrandom
.quad sys_memfd_create
.quad sys_execveat
+ .quad sys_mlock2
+ .quad sys_munlock2 /* 515 */
+ .quad sys_munlockall2
.size sys_call_table, . - sys_call_table
.type sys_call_table, @object
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 05745eb..514e77b 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -397,6 +397,9 @@
/* 385 */ CALL(sys_memfd_create)
CALL(sys_bpf)
CALL(sys_execveat)
+ CALL(sys_mlock2)
+ CALL(sys_munlock2)
+/* 400 */ CALL(sys_munlockall2)
#ifndef syscalls_counted
.equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
#define syscalls_counted
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index cef934a..318072aa 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -797,3 +797,9 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)
#define __NR_execveat 387
__SYSCALL(__NR_execveat, compat_sys_execveat)
+#define __NR_mlock2 388
+__SYSCALL(__NR_mlock2, sys_mlock2)
+#define __NR_munlock2 389
+__SYSCALL(__NR_munlock2, sys_munlock2)
+#define __NR_munlockall2 390
+__SYSCALL(__NR_munlockall2, sys_munlockall2)
diff --git a/arch/avr32/kernel/syscall_table.S b/arch/avr32/kernel/syscall_table.S
index c3b593b..83928ab 100644
--- a/arch/avr32/kernel/syscall_table.S
+++ b/arch/avr32/kernel/syscall_table.S
@@ -334,4 +334,7 @@ sys_call_table:
.long sys_memfd_create
.long sys_bpf
.long sys_execveat /* 320 */
+ .long sys_mlock2
+ .long sys_munlock2
+ .long sys_munlockall2
.long sys_ni_syscall /* r8 is saturated at nr_syscalls */
diff --git a/arch/blackfin/mach-common/entry.S b/arch/blackfin/mach-common/entry.S
index 8d9431e..5d83587 100644
--- a/arch/blackfin/mach-common/entry.S
+++ b/arch/blackfin/mach-common/entry.S
@@ -1704,6 +1704,9 @@ ENTRY(_sys_call_table)
.long _sys_memfd_create /* 390 */
.long _sys_bpf
.long _sys_execveat
+ .long _sys_mlock2
+ .long _sys_munlock2
+ .long _sys_munlockall2 /* 395 */
.rept NR_syscalls-(.-_sys_call_table)/4
.long _sys_ni_syscall
diff --git a/arch/cris/arch-v10/kernel/entry.S b/arch/cris/arch-v10/kernel/entry.S
index 81570fc..d0ce531 100644
--- a/arch/cris/arch-v10/kernel/entry.S
+++ b/arch/cris/arch-v10/kernel/entry.S
@@ -955,6 +955,9 @@ sys_call_table:
.long sys_process_vm_writev
.long sys_kcmp /* 350 */
.long sys_finit_module
+ .long sys_mlock2
+ .long sys_munlock2
+ .long sys_munlockall2
/*
* NOTE!! This doesn't have to be exact - we just have
diff --git a/arch/cris/arch-v32/kernel/entry.S b/arch/cris/arch-v32/kernel/entry.S
index 026a0b2..7f50a0b 100644
--- a/arch/cris/arch-v32/kernel/entry.S
+++ b/arch/cris/arch-v32/kernel/entry.S
@@ -875,6 +875,9 @@ sys_call_table:
.long sys_process_vm_writev
.long sys_kcmp /* 350 */
.long sys_finit_module
+ .long sys_mlock2
+ .long sys_munlock2
+ .long sys_munlockall2
/*
* NOTE!! This doesn't have to be exact - we just have
diff --git a/arch/frv/kernel/entry.S b/arch/frv/kernel/entry.S
index dfcd263..ee605a0 100644
--- a/arch/frv/kernel/entry.S
+++ b/arch/frv/kernel/entry.S
@@ -1515,5 +1515,8 @@ sys_call_table:
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
.long sys_setns
+ .long sys_mlock2
+ .long sys_munlock2
+ .long sys_munlockall2 /* 340 */
syscall_table_size = (. - sys_call_table)
diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S
index ae0de7b..3ef4457 100644
--- a/arch/ia64/kernel/entry.S
+++ b/arch/ia64/kernel/entry.S
@@ -1768,5 +1768,8 @@ sys_call_table:
data8 sys_memfd_create // 1340
data8 sys_bpf
data8 sys_execveat
+ data8 sys_mlock2
+ data8 sys_munlock2
+ data8 sys_munlockall2 // 1345
.org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
diff --git a/arch/m32r/kernel/entry.S b/arch/m32r/kernel/entry.S
index c639bfa..4f7f2e2 100644
--- a/arch/m32r/kernel/entry.S
+++ b/arch/m32r/kernel/entry.S
@@ -76,6 +76,9 @@
#define sys_munlock sys_ni_syscall
#define sys_mlockall sys_ni_syscall
#define sys_munlockall sys_ni_syscall
+#define sys_mlock2 sys_ni_syscall
+#define sys_munlock2 sys_ni_syscall
+#define sys_munlockall2 sys_ni_syscall
#define sys_mremap sys_ni_syscall
#define sys_mincore sys_ni_syscall
#define sys_remap_file_pages sys_ni_syscall
diff --git a/arch/m32r/kernel/syscall_table.S b/arch/m32r/kernel/syscall_table.S
index f365c19..9918c3e 100644
--- a/arch/m32r/kernel/syscall_table.S
+++ b/arch/m32r/kernel/syscall_table.S
@@ -325,3 +325,6 @@ ENTRY(sys_call_table)
.long sys_eventfd
.long sys_fallocate
.long sys_setns /* 325 */
+ .long sys_mlock2
+ .long sys_munlock2
+ .long sys_munlockall2
diff --git a/arch/m68k/kernel/syscalltable.S b/arch/m68k/kernel/syscalltable.S
index a0ec430..7963c03 100644
--- a/arch/m68k/kernel/syscalltable.S
+++ b/arch/m68k/kernel/syscalltable.S
@@ -376,4 +376,7 @@ ENTRY(sys_call_table)
.long sys_memfd_create
.long sys_bpf
.long sys_execveat /* 355 */
+ .long sys_mlock2
+ .long sys_munlock2
+ .long sys_munlockall2
diff --git a/arch/microblaze/kernel/syscall_table.S b/arch/microblaze/kernel/syscall_table.S
index 29c8568..6e4b0fe 100644
--- a/arch/microblaze/kernel/syscall_table.S
+++ b/arch/microblaze/kernel/syscall_table.S
@@ -389,3 +389,6 @@ ENTRY(sys_call_table)
.long sys_memfd_create
.long sys_bpf
.long sys_execveat
+ .long sys_mlock2
+ .long sys_munlock2 /* 390 */
+ .long sys_munlockall2
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876..67c1cdf 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -62,6 +62,11 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+/*
+ * Flags for mlock
+ */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 6e8de80..7af6066 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -582,3 +582,6 @@ EXPORT(sys_call_table)
PTR sys_memfd_create
PTR sys_bpf /* 4355 */
PTR sys_execveat
+ PTR sys_mlock2
+ PTR sys_munlock2
+ PTR sys_munlockall2
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index ad4d4463..0aa2742 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -436,4 +436,7 @@ EXPORT(sys_call_table)
PTR sys_memfd_create
PTR sys_bpf /* 5315 */
PTR sys_execveat
+ PTR sys_mlock2
+ PTR sys_munlock2
+ PTR sys_munlockall2
.size sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index 446cc65..eb21955 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -429,4 +429,7 @@ EXPORT(sysn32_call_table)
PTR sys_memfd_create
PTR sys_bpf
PTR compat_sys_execveat /* 6320 */
+ PTR sys_mlock2
+ PTR sys_munlock2
+ PTR sys_munlockall2
.size sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index d07b210..ee59c82 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -567,4 +567,7 @@ EXPORT(sys32_call_table)
PTR sys_memfd_create
PTR sys_bpf /* 4355 */
PTR compat_sys_execveat
+ PTR sys_mlock2
+ PTR sys_munlock2
+ PTR sys_munlockall2
.size sys32_call_table,.-sys32_call_table
diff --git a/arch/mn10300/kernel/entry.S b/arch/mn10300/kernel/entry.S
index 177d61d..d34adf5 100644
--- a/arch/mn10300/kernel/entry.S
+++ b/arch/mn10300/kernel/entry.S
@@ -767,6 +767,9 @@ ENTRY(sys_call_table)
.long sys_perf_event_open
.long sys_recvmmsg
.long sys_setns
+ .long sys_mlock2 /* 340 */
+ .long sys_munlock2
+ .long sys_munlockall2
nr_syscalls=(.-sys_call_table)/4
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251..daab994 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -32,6 +32,8 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index 6ea26df..189e85f 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -23,6 +23,8 @@
#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 1acad02..f6d81d6 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -363,3 +363,6 @@ SYSCALL(sys_bpf,compat_sys_bpf)
SYSCALL(sys_s390_pci_mmio_write,compat_sys_s390_pci_mmio_write)
SYSCALL(sys_s390_pci_mmio_read,compat_sys_s390_pci_mmio_read)
SYSCALL(sys_execveat,compat_sys_execveat)
+SYSCALL(sys_mlock2,compat_sys_mlock2) /* 355 */
+SYSCALL(sys_munlock2,compat_sys_munlock2)
+SYSCALL(sys_munlockall2,compat_sys_munlockall2)
diff --git a/arch/sh/kernel/syscalls_32.S b/arch/sh/kernel/syscalls_32.S
index 734234b..6d07867 100644
--- a/arch/sh/kernel/syscalls_32.S
+++ b/arch/sh/kernel/syscalls_32.S
@@ -386,3 +386,6 @@ ENTRY(sys_call_table)
.long sys_process_vm_writev
.long sys_kcmp
.long sys_finit_module
+ .long sys_mlock2
+ .long sys_munlock2 /* 370 */
+ .long sys_munlockall2
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 0b14df3..13d51be 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -18,6 +18,8 @@
#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
diff --git a/arch/sparc/kernel/systbls_32.S b/arch/sparc/kernel/systbls_32.S
index e31a905..72b68d4 100644
--- a/arch/sparc/kernel/systbls_32.S
+++ b/arch/sparc/kernel/systbls_32.S
@@ -87,4 +87,4 @@ sys_call_table:
/*335*/ .long sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev
/*340*/ .long sys_ni_syscall, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
/*345*/ .long sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
-/*350*/ .long sys_execveat
+/*350*/ .long sys_execveat, sys_mlock2, sys_munlock2, sys_munlockall2
diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
index d72f76a..a96bfea 100644
--- a/arch/sparc/kernel/systbls_64.S
+++ b/arch/sparc/kernel/systbls_64.S
@@ -88,7 +88,7 @@ sys_call_table32:
.word sys_syncfs, compat_sys_sendmmsg, sys_setns, compat_sys_process_vm_readv, compat_sys_process_vm_writev
/*340*/ .word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
.word sys32_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
-/*350*/ .word sys32_execveat
+/*350*/ .word sys32_execveat, sys_mlock2, sys_munlock2, sys_munlockall2
#endif /* CONFIG_COMPAT */
@@ -168,4 +168,4 @@ sys_call_table:
.word sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev
/*340*/ .word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
.word sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
-/*350*/ .word sys64_execveat
+/*350*/ .word sys64_execveat, sys_mlock2, sys_munlock2, sys_munlockall2
diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
index 81b8fc3..f69ce48 100644
--- a/arch/tile/include/uapi/asm/mman.h
+++ b/arch/tile/include/uapi/asm/mman.h
@@ -37,5 +37,10 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+/*
+ * Flags for mlock
+ */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#endif /* _ASM_TILE_MMAN_H */
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ef8187f..13ce950 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -365,3 +365,6 @@
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
+359 i386 mlock2 sys_mlock2
+360 i386 munlock2 sys_munlock2
+361 i386 munlockall2 sys_munlockall2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 9ef32d5..13b3cb1 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -329,6 +329,9 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 common mlock2 sys_mlock2
+324 common munlock2 sys_munlock2
+325 common munlockall2 sys_munlockall2
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 201aec0..11f354f 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -75,6 +75,11 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+/*
+ * Flags for mlock
+ */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
diff --git a/arch/xtensa/include/uapi/asm/unistd.h b/arch/xtensa/include/uapi/asm/unistd.h
index b95c305..961913c 100644
--- a/arch/xtensa/include/uapi/asm/unistd.h
+++ b/arch/xtensa/include/uapi/asm/unistd.h
@@ -753,8 +753,14 @@ __SYSCALL(339, sys_memfd_create, 2)
__SYSCALL(340, sys_bpf, 3)
#define __NR_execveat 341
__SYSCALL(341, sys_execveat, 5)
-
-#define __NR_syscall_count 342
+#define __NR_mlock2 342
+__SYSCALL(342, sys_mlock2, 3)
+#define __NR_munlock2 343
+__SYSCALL(342, sys_munlock2, 3)
+#define __NR_munlockall2 344
+__SYSCALL(342, sys_munlock2, 1)
+
+#define __NR_syscall_count 345
/*
* sysxtensa syscall handler
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b45c45b..aecab5d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -884,4 +884,8 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
const char __user *const __user *argv,
const char __user *const __user *envp, int flags);
+asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
+asmlinkage long sys_munlock2(unsigned long start, size_t len, int flags);
+asmlinkage long sys_munlockall2(int flags);
+
#endif
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index e9fe6fd..242436b 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -18,4 +18,6 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#endif /* __ASM_GENERIC_MMAN_H */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..e759fa2 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,15 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)
#define __NR_execveat 281
__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_mlock2 282
+__SYSCALL(__NR_mlock2, sys_mlock2)
+#define __NR_munlock2 283
+__SYSCALL(__NR_munlock2, sys_munlock2)
+#define __NR_munlockall2 284
+__SYSCALL(__NR_munlockall2, sys_munlockall2)
#undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 285
/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7995ef5..63529b7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -193,6 +193,9 @@ cond_syscall(sys_mlock);
cond_syscall(sys_munlock);
cond_syscall(sys_mlockall);
cond_syscall(sys_munlockall);
+cond_syscall(sys_mlock2);
+cond_syscall(sys_munlock2);
+cond_syscall(sys_munlockall2);
cond_syscall(sys_mincore);
cond_syscall(sys_madvise);
cond_syscall(sys_mremap);
diff --git a/mm/mlock.c b/mm/mlock.c
index 8e52c23..d6e61d6 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -648,6 +648,14 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
return do_mlock(start, len, VM_LOCKED);
}
+SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
+{
+ if (!flags || flags & ~MLOCK_LOCKED)
+ return -EINVAL;
+
+ return do_mlock(start, len, VM_LOCKED);
+}
+
static int do_munlock(unsigned long start, size_t len, vm_flags_t flags)
{
int ret;
@@ -667,6 +675,13 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
return do_munlock(start, len, VM_LOCKED);
}
+SYSCALL_DEFINE3(munlock2, unsigned long, start, size_t, len, int, flags)
+{
+ if (!flags || flags & ~MLOCK_LOCKED)
+ return -EINVAL;
+ return do_munlock(start, len, VM_LOCKED);
+}
+
static int do_mlockall(int flags)
{
struct vm_area_struct * vma, * prev = NULL;
@@ -756,6 +771,19 @@ SYSCALL_DEFINE0(munlockall)
return ret;
}
+SYSCALL_DEFINE1(munlockall2, int, flags)
+{
+ int ret = -EINVAL;
+
+ if (!flags || flags & ~(MCL_CURRENT | MCL_FUTURE))
+ return ret;
+
+ down_write(¤t->mm->mmap_sem);
+ ret = do_munlockall(flags);
+ up_write(¤t->mm->mmap_sem);
+ return ret;
+}
+
/*
* Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB
* shm segments) get accounted against the user_struct instead.
--
1.9.1
The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be
used this can incur a high penalty for locking.
For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well). For the security example, any application transacting
in data that cannot be swapped out (credit card data, medical records,
etc).
This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in. This can be done area at a time via the
mlock2(MLOCK_ONFAULT) or the mlockall(MCL_ONFAULT) system calls. These
calls can be undone via munlock2(MLOCK_ONFAULT) or
munlockall2(MCL_ONFAULT).
To keep accounting checks out of the page fault path, users are billed
for the entire mapping lock as if MLOCK_LOCKED was used.
Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/alpha/include/uapi/asm/mman.h | 2 +
arch/mips/include/uapi/asm/mman.h | 2 +
arch/parisc/include/uapi/asm/mman.h | 2 +
arch/powerpc/include/uapi/asm/mman.h | 2 +
arch/sparc/include/uapi/asm/mman.h | 2 +
arch/tile/include/uapi/asm/mman.h | 3 ++
arch/xtensa/include/uapi/asm/mman.h | 2 +
fs/proc/task_mmu.c | 1 +
include/linux/mm.h | 1 +
include/uapi/asm-generic/mman.h | 2 +
mm/mlock.c | 72 ++++++++++++++++++++++++++----------
mm/mmap.c | 4 +-
mm/swap.c | 3 +-
13 files changed, 75 insertions(+), 23 deletions(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index ec72436..77ae8db 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -37,8 +37,10 @@
#define MCL_CURRENT 8192 /* lock all currently mapped pages */
#define MCL_FUTURE 16384 /* lock all additions to address space */
+#define MCL_ONFAULT 32768 /* lock all pages that are faulted in */
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 67c1cdf..71ed81d 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -61,11 +61,13 @@
*/
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
/*
* Flags for mlock
*/
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index daab994..c0871ce 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -31,8 +31,10 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index 189e85f..f93f7eb 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -22,8 +22,10 @@
#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */
+#define MCL_ONFAULT 0x8000 /* lock all pages that are faulted in */
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 13d51be..8cd2ebc 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -17,8 +17,10 @@
#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */
+#define MCL_ONFAULT 0x8000 /* lock all pages that are faulted in */
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
index f69ce48..acdd013 100644
--- a/arch/tile/include/uapi/asm/mman.h
+++ b/arch/tile/include/uapi/asm/mman.h
@@ -36,11 +36,14 @@
*/
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
+
/*
* Flags for mlock
*/
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#endif /* _ASM_TILE_MMAN_H */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 11f354f..5725a15 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -74,11 +74,13 @@
*/
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
/*
* Flags for mlock
*/
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ca1e091..38d69fc 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -579,6 +579,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_X86_INTEL_MPX
[ilog2(VM_MPX)] = "mp",
#endif
+ [ilog2(VM_LOCKONFAULT)] = "lf",
[ilog2(VM_LOCKED)] = "lo",
[ilog2(VM_IO)] = "io",
[ilog2(VM_SEQ_READ)] = "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f9..ae40c7d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -127,6 +127,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
+#define VM_LOCKONFAULT 0x00001000 /* Lock the pages covered when they are faulted in */
#define VM_LOCKED 0x00002000
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 242436b..555aab0 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -17,7 +17,9 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#endif /* __ASM_GENERIC_MMAN_H */
diff --git a/mm/mlock.c b/mm/mlock.c
index d6e61d6..d9414d6 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -502,11 +502,12 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
pgoff_t pgoff;
int nr_pages;
int ret = 0;
- int lock = !!(newflags & VM_LOCKED);
+ int lock = !!(newflags & (VM_LOCKED | VM_LOCKONFAULT));
if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
- goto out; /* don't set VM_LOCKED, don't count */
+ /* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
+ goto out;
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
@@ -581,10 +582,12 @@ static int apply_vma_flags(unsigned long start, size_t len,
/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
newflags = vma->vm_flags;
- if (add_flags)
+ if (add_flags) {
+ newflags &= ~(VM_LOCKED | VM_LOCKONFAULT);
newflags |= flags;
- else
+ } else {
newflags &= ~flags;
+ }
tmp = vma->vm_end;
if (tmp > end)
@@ -637,9 +640,12 @@ static int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
if (error)
return error;
- error = __mm_populate(start, len, 0);
- if (error)
- return __mlock_posix_error_return(error);
+ if (flags & VM_LOCKED) {
+ error = __mm_populate(start, len, 0);
+ if (error)
+ return __mlock_posix_error_return(error);
+ }
+
return 0;
}
@@ -650,10 +656,14 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
{
- if (!flags || flags & ~MLOCK_LOCKED)
+ if (!flags || (flags & ~(MLOCK_LOCKED | MLOCK_ONFAULT)) ||
+ flags == (MLOCK_LOCKED | MLOCK_ONFAULT))
return -EINVAL;
- return do_mlock(start, len, VM_LOCKED);
+ if (flags & MLOCK_LOCKED)
+ return do_mlock(start, len, VM_LOCKED);
+
+ return do_mlock(start, len, VM_LOCKONFAULT);
}
static int do_munlock(unsigned long start, size_t len, vm_flags_t flags)
@@ -677,26 +687,41 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
SYSCALL_DEFINE3(munlock2, unsigned long, start, size_t, len, int, flags)
{
- if (!flags || flags & ~MLOCK_LOCKED)
+ vm_flags_t to_clear = 0;
+
+ if (!flags || flags & ~(MLOCK_LOCKED | MLOCK_ONFAULT))
return -EINVAL;
- return do_munlock(start, len, VM_LOCKED);
+
+ if (flags & MLOCK_LOCKED)
+ to_clear |= VM_LOCKED;
+ if (flags & MLOCK_ONFAULT)
+ to_clear |= VM_LOCKONFAULT;
+
+ return do_munlock(start, len, to_clear);
}
static int do_mlockall(int flags)
{
struct vm_area_struct * vma, * prev = NULL;
+ vm_flags_t to_add;
if (flags & MCL_FUTURE)
current->mm->def_flags |= VM_LOCKED;
if (flags == MCL_FUTURE)
goto out;
+ if (flags & MCL_ONFAULT) {
+ current->mm->def_flags |= VM_LOCKONFAULT;
+ to_add = VM_LOCKONFAULT;
+ } else {
+ to_add = VM_LOCKED;
+ }
+
for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
vm_flags_t newflags;
- newflags = vma->vm_flags & ~VM_LOCKED;
- if (flags & MCL_CURRENT)
- newflags |= VM_LOCKED;
+ newflags = vma->vm_flags & ~(VM_LOCKED | VM_LOCKONFAULT);
+ newflags |= to_add;
/* Ignore errors */
mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
@@ -711,7 +736,8 @@ SYSCALL_DEFINE1(mlockall, int, flags)
unsigned long lock_limit;
int ret = -EINVAL;
- if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
+ if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)) ||
+ (flags & (MCL_FUTURE | MCL_ONFAULT)) == (MCL_FUTURE | MCL_ONFAULT))
goto out;
ret = -EPERM;
@@ -740,18 +766,24 @@ out:
static int do_munlockall(int flags)
{
struct vm_area_struct * vma, * prev = NULL;
+ vm_flags_t to_clear = 0;
if (flags & MCL_FUTURE)
current->mm->def_flags &= ~VM_LOCKED;
+ if (flags & MCL_ONFAULT)
+ current->mm->def_flags &= ~VM_LOCKONFAULT;
if (flags == MCL_FUTURE)
goto out;
+ if (flags & MCL_CURRENT)
+ to_clear |= VM_LOCKED;
+ if (flags & MCL_ONFAULT)
+ to_clear |= VM_LOCKONFAULT;
+
for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
vm_flags_t newflags;
- newflags = vma->vm_flags;
- if (flags & MCL_CURRENT)
- newflags &= ~VM_LOCKED;
+ newflags = vma->vm_flags & ~to_clear;
/* Ignore errors */
mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
@@ -766,7 +798,7 @@ SYSCALL_DEFINE0(munlockall)
int ret;
down_write(¤t->mm->mmap_sem);
- ret = do_munlockall(MCL_CURRENT | MCL_FUTURE);
+ ret = do_munlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
up_write(¤t->mm->mmap_sem);
return ret;
}
@@ -775,7 +807,7 @@ SYSCALL_DEFINE1(munlockall2, int, flags)
{
int ret = -EINVAL;
- if (!flags || flags & ~(MCL_CURRENT | MCL_FUTURE))
+ if (!flags || flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT))
return ret;
down_write(¤t->mm->mmap_sem);
diff --git a/mm/mmap.c b/mm/mmap.c
index aa632ad..eb970ba 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1232,8 +1232,8 @@ static inline int mlock_future_check(struct mm_struct *mm,
{
unsigned long locked, lock_limit;
- /* mlock MCL_FUTURE? */
- if (flags & VM_LOCKED) {
+ /* mlock MCL_FUTURE or MCL_ONFAULT? */
+ if (flags & (VM_LOCKED | VM_LOCKONFAULT)) {
locked = len >> PAGE_SHIFT;
locked += mm->locked_vm;
lock_limit = rlimit(RLIMIT_MEMLOCK);
diff --git a/mm/swap.c b/mm/swap.c
index a3a0a2f..3580a21 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -710,7 +710,8 @@ void lru_cache_add_active_or_unevictable(struct page *page,
{
VM_BUG_ON_PAGE(PageLRU(page), page);
- if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
+ if (likely((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == 0) ||
+ (vma->vm_flags & VM_SPECIAL)) {
SetPageActive(page);
lru_cache_add(page);
return;
--
1.9.1
The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be
used this can incur a high penalty for locking.
Now that we have the new VMA flag for the locked but not present state,
expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/powerpc/include/uapi/asm/mman.h | 1 +
arch/sparc/include/uapi/asm/mman.h | 1 +
arch/xtensa/include/uapi/asm/mman.h | 1 +
include/linux/mman.h | 3 ++-
include/uapi/asm-generic/mman.h | 1 +
mm/mmap.c | 2 +-
9 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 77ae8db..3f80ca4 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -30,6 +30,7 @@
#define MAP_NONBLOCK 0x40000 /* do not block on IO */
#define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x100000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x200000 /* Lock pages after they are faulted in, do not prefault */
#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_SYNC 2 /* synchronous memory sync */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 71ed81d..905c1ea 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -48,6 +48,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
/*
* Flags for msync
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index c0871ce..c4695f6 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -24,6 +24,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
#define MS_SYNC 1 /* synchronous memory sync */
#define MS_ASYNC 2 /* sync memory asynchronously */
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index f93f7eb..40a3fda 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -31,5 +31,6 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */
#endif /* _UAPI_ASM_POWERPC_MMAN_H */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 8cd2ebc..3d74ab7 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -26,6 +26,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x8000 /* Lock pages after they are faulted in, do not prefault */
#endif /* _UAPI__SPARC_MMAN_H__ */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 5725a15..689e1f2 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -55,6 +55,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
# define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
* uninitialized */
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 16373c8..437264b 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -86,7 +86,8 @@ calc_vm_flag_bits(unsigned long flags)
{
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
_calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
- _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
+ _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
+ _calc_vm_trans(flags, MAP_LOCKONFAULT,VM_LOCKONFAULT);
}
unsigned long vm_commit_limit(void);
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 555aab0..007b784 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */
/* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
diff --git a/mm/mmap.c b/mm/mmap.c
index eb970ba..2dc4da3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1301,7 +1301,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
- if (flags & MAP_LOCKED)
+ if (flags & (MAP_LOCKED | MAP_LOCKONFAULT))
if (!can_do_mlock())
return -EPERM;
--
1.9.1
Test the mmap() flag, and the mlockall() flag. These tests ensure that
pages are not faulted in until they are accessed, that the pages are
unevictable once faulted in, and that VMA splitting and merging works
with the new VM flag. The second test ensures that mlock limits are
respected. Note that the limit test needs to be run a normal user.
Signed-off-by: Eric B Munson <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
tools/testing/selftests/vm/Makefile | 2 +
tools/testing/selftests/vm/lock-on-fault.c | 342 ++++++++++++++++++++++++++++
tools/testing/selftests/vm/on-fault-limit.c | 47 ++++
tools/testing/selftests/vm/run_vmtests | 22 ++
4 files changed, 413 insertions(+)
create mode 100644 tools/testing/selftests/vm/lock-on-fault.c
create mode 100644 tools/testing/selftests/vm/on-fault-limit.c
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 231b9a0..01b4a90 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -5,7 +5,9 @@ BINARIES = compaction_test
BINARIES += hugepage-mmap
BINARIES += hugepage-shm
BINARIES += hugetlbfstest
+BINARIES += lock-on-fault
BINARIES += map_hugetlb
+BINARIES += on-fault-limit
BINARIES += thuge-gen
BINARIES += transhuge-stress
diff --git a/tools/testing/selftests/vm/lock-on-fault.c b/tools/testing/selftests/vm/lock-on-fault.c
new file mode 100644
index 0000000..6e0cdc7
--- /dev/null
+++ b/tools/testing/selftests/vm/lock-on-fault.c
@@ -0,0 +1,342 @@
+#include <sys/mman.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+
+struct vm_boundaries {
+ unsigned long start;
+ unsigned long end;
+};
+
+static int get_vm_area(unsigned long addr, struct vm_boundaries *area)
+{
+ FILE *file;
+ int ret = 1;
+ char line[1024] = {0};
+ char *end_addr;
+ char *stop;
+ unsigned long start;
+ unsigned long end;
+
+ if (!area)
+ return ret;
+
+ file = fopen("/proc/self/maps", "r");
+ if (!file) {
+ perror("fopen");
+ return ret;
+ }
+
+ memset(area, 0, sizeof(struct vm_boundaries));
+
+ while(fgets(line, 1024, file)) {
+ end_addr = strchr(line, '-');
+ if (!end_addr) {
+ printf("cannot parse /proc/self/maps\n");
+ goto out;
+ }
+ *end_addr = '\0';
+ end_addr++;
+ stop = strchr(end_addr, ' ');
+ if (!stop) {
+ printf("cannot parse /proc/self/maps\n");
+ goto out;
+ }
+ stop = '\0';
+
+ sscanf(line, "%lx", &start);
+ sscanf(end_addr, "%lx", &end);
+
+ if (start <= addr && end > addr) {
+ area->start = start;
+ area->end = end;
+ ret = 0;
+ goto out;
+ }
+ }
+out:
+ fclose(file);
+ return ret;
+}
+
+static unsigned long get_pageflags(unsigned long addr)
+{
+ FILE *file;
+ unsigned long pfn;
+ unsigned long offset;
+
+ file = fopen("/proc/self/pagemap", "r");
+ if (!file) {
+ perror("fopen");
+ _exit(1);
+ }
+
+ offset = addr / getpagesize() * sizeof(unsigned long);
+ if (fseek(file, offset, SEEK_SET)) {
+ perror("fseek");
+ _exit(1);
+ }
+
+ if (fread(&pfn, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread");
+ _exit(1);
+ }
+
+ fclose(file);
+ return pfn;
+}
+
+static unsigned long get_kpageflags(unsigned long pfn)
+{
+ unsigned long flags;
+ FILE *file;
+
+ file = fopen("/proc/kpageflags", "r");
+ if (!file) {
+ perror("fopen");
+ _exit(1);
+ }
+
+ if (fseek(file, pfn * sizeof(unsigned long), SEEK_SET)) {
+ perror("fseek");
+ _exit(1);
+ }
+
+ if (fread(&flags, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread");
+ _exit(1);
+ }
+
+ fclose(file);
+ return flags;
+}
+
+#define PRESENT_BIT 0x8000000000000000
+#define PFN_MASK 0x007FFFFFFFFFFFFF
+#define UNEVICTABLE_BIT (1UL << 18)
+
+static int test_mmap(int flags)
+{
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ void *map;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE, flags, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("mmap()");
+ return 1;
+ }
+
+ /* Write something into the first page to ensure it is present */
+ *(char *)map = 1;
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+
+ /* page2_flags should not be present */
+ if (page2_flags & PRESENT_BIT) {
+ printf("page map says 0x%lx\n", page2_flags);
+ printf("present is 0x%lx\n", PRESENT_BIT);
+ return 1;
+ }
+
+ /* page1_flags should be present */
+ if (page1_flags & PRESENT_BIT == 0) {
+ printf("page map says 0x%lx\n", page1_flags);
+ printf("present is 0x%lx\n", PRESENT_BIT);
+ return 1;
+ }
+
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+
+ /* page1_flags now contains the entry from kpageflags for the first
+ * page, the unevictable bit should be set */
+ if (page1_flags & UNEVICTABLE_BIT == 0) {
+ printf("kpageflags says 0x%lx\n", page1_flags);
+ printf("unevictable is 0x%lx\n", UNEVICTABLE_BIT);
+ return 1;
+ }
+
+ munmap(map, 2 * page_size);
+ return 0;
+}
+
+static int test_munlock(int flags)
+{
+ int ret = 1;
+ void *map;
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page3_flags;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, flags, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("mmap()");
+ return ret;
+ }
+
+ if (munlock(map + page_size, page_size)) {
+ perror("munlock()");
+ goto out;
+ }
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+ page3_flags = get_pageflags((unsigned long)map + page_size * 2);
+
+ /* No pages should be present */
+ if ((page1_flags & PRESENT_BIT) || (page2_flags & PRESENT_BIT) ||
+ (page3_flags & PRESENT_BIT)) {
+ printf("Page was made present by munlock()\n");
+ goto out;
+ }
+
+ /* Write something to each page so that they are faulted in */
+ *(char*)map = 1;
+ *(char*)(map + page_size) = 1;
+ *(char*)(map + page_size * 2) = 1;
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+ page3_flags = get_pageflags((unsigned long)map + page_size * 2);
+
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+ page2_flags = get_kpageflags(page2_flags & PFN_MASK);
+ page3_flags = get_kpageflags(page3_flags & PFN_MASK);
+
+ /* Pages 1 and 3 should be unevictable */
+ if (!(page1_flags & UNEVICTABLE_BIT)) {
+ printf("Missing unevictable bit on lock on fault page1\n");
+ goto out;
+ }
+ if (!(page3_flags & UNEVICTABLE_BIT)) {
+ printf("Missing unevictable bit on lock on fault page3\n");
+ goto out;
+ }
+
+ /* Page 2 should not be unevictable */
+ if (page2_flags & UNEVICTABLE_BIT) {
+ printf("Unlocked page is still marked unevictable\n");
+ goto out;
+ }
+
+ ret = 0;
+
+out:
+ munmap(map, 3 * page_size);
+ return ret;
+}
+
+static int test_vma_management(int flags)
+{
+ int ret = 1;
+ void *map;
+ unsigned long page_size = getpagesize();
+ struct vm_boundaries page1;
+ struct vm_boundaries page2;
+ struct vm_boundaries page3;
+
+ map = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, flags, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("mmap()");
+ return ret;
+ }
+
+ if (get_vm_area((unsigned long)map, &page1) ||
+ get_vm_area((unsigned long)map + page_size, &page2) ||
+ get_vm_area((unsigned long)map + page_size * 2, &page3)) {
+ printf("couldn't find mapping in /proc/self/maps\n");
+ goto out;
+ }
+
+ /*
+ * Before we unlock a portion, we need to that all three pages are in
+ * the same VMA. If they are not we abort this test (Note that this is
+ * not a failure)
+ */
+ if (page1.start != page2.start || page2.start != page3.start) {
+ printf("VMAs are not merged to start, aborting test\n");
+ ret = 0;
+ goto out;
+ }
+
+ if (munlock(map + page_size, page_size)) {
+ perror("munlock()");
+ goto out;
+ }
+
+ if (get_vm_area((unsigned long)map, &page1) ||
+ get_vm_area((unsigned long)map + page_size, &page2) ||
+ get_vm_area((unsigned long)map + page_size * 2, &page3)) {
+ printf("couldn't find mapping in /proc/self/maps\n");
+ goto out;
+ }
+
+ /* All three VMAs should be different */
+ if (page1.start == page2.start || page2.start == page3.start) {
+ printf("failed to split VMA for munlock\n");
+ goto out;
+ }
+
+ /* Now unlock the first and third page and check the VMAs again */
+ if (munlock(map, page_size * 3)) {
+ perror("munlock()");
+ goto out;
+ }
+
+ if (get_vm_area((unsigned long)map, &page1) ||
+ get_vm_area((unsigned long)map + page_size, &page2) ||
+ get_vm_area((unsigned long)map + page_size * 2, &page3)) {
+ printf("couldn't find mapping in /proc/self/maps\n");
+ goto out;
+ }
+
+ /* Now all three VMAs should be the same */
+ if (page1.start != page2.start || page2.start != page3.start) {
+ printf("failed to merge VMAs after munlock\n");
+ goto out;
+ }
+
+ ret = 0;
+out:
+ munmap(map, 3 * page_size);
+ return ret;
+}
+
+#ifndef MCL_ONFAULT
+#define MCL_ONFAULT (MCL_FUTURE << 1)
+#endif
+
+static int test_mlockall(int (test_function)(int flags))
+{
+ int ret = 1;
+
+ if (mlockall(MCL_ONFAULT)) {
+ perror("mlockall");
+ return ret;
+ }
+
+ ret = test_function(MAP_PRIVATE | MAP_ANONYMOUS);
+ munlockall();
+ return ret;
+}
+
+#ifndef MAP_LOCKONFAULT
+#define MAP_LOCKONFAULT (MAP_HUGETLB << 1)
+#endif
+
+int main(int argc, char **argv)
+{
+ int ret = 0;
+ ret += test_mmap(MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKONFAULT);
+ ret += test_mlockall(test_mmap);
+ ret += test_munlock(MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKONFAULT);
+ ret += test_mlockall(test_munlock);
+ ret += test_vma_management(MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKONFAULT);
+ ret += test_mlockall(test_vma_management);
+ return ret;
+}
diff --git a/tools/testing/selftests/vm/on-fault-limit.c b/tools/testing/selftests/vm/on-fault-limit.c
new file mode 100644
index 0000000..ed2a109
--- /dev/null
+++ b/tools/testing/selftests/vm/on-fault-limit.c
@@ -0,0 +1,47 @@
+#include <sys/mman.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+
+#ifndef MCL_ONFAULT
+#define MCL_ONFAULT (MCL_FUTURE << 1)
+#endif
+
+static int test_limit(void)
+{
+ int ret = 1;
+ struct rlimit lims;
+ void *map;
+
+ if (getrlimit(RLIMIT_MEMLOCK, &lims)) {
+ perror("getrlimit");
+ return ret;
+ }
+
+ if (mlockall(MCL_ONFAULT)) {
+ perror("mlockall");
+ return ret;
+ }
+
+ map = mmap(NULL, 2 * lims.rlim_max, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, 0, 0);
+ if (map != MAP_FAILED)
+ printf("mmap should have failed, but didn't\n");
+ else {
+ ret = 0;
+ munmap(map, 2 * lims.rlim_max);
+ }
+
+ munlockall();
+ return ret;
+}
+
+int main(int argc, char **argv)
+{
+ int ret = 0;
+
+ ret += test_limit();
+ return ret;
+}
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index 49ece11..45241df 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -102,4 +102,26 @@ else
echo "[PASS]"
fi
+echo "--------------------"
+echo "running lock-on-fault"
+echo "--------------------"
+./lock-on-fault
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
+echo "--------------------"
+echo "running on-fault-limit"
+echo "--------------------"
+sudo -u nobody ./on-fault-limit
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
exit $exitcode
--
1.9.1
On Tue, 7 Jul 2015 13:03:38 -0400 Eric B Munson <[email protected]> wrote:
> mlock() allows a user to control page out of program memory, but this
> comes at the cost of faulting in the entire mapping when it is
> allocated. For large mappings where the entire area is not necessary
> this is not ideal. Instead of forcing all locked pages to be present
> when they are allocated, this set creates a middle ground. Pages are
> marked to be placed on the unevictable LRU (locked) when they are first
> used, but they are not faulted in by the mlock call.
>
> This series introduces a new mlock() system call that takes a flags
> argument along with the start address and size. This flags argument
> gives the caller the ability to request memory be locked in the
> traditional way, or to be locked after the page is faulted in. New
> calls are added for munlock() and munlockall() which give the called a
> way to specify which flags are supposed to be cleared. A new MCL flag
> is added to mirror the lock on fault behavior from mlock() in
> mlockall(). Finally, a flag for mmap() is added that allows a user to
> specify that the covered are should not be paged out, but only after the
> memory has been used the first time.
Thanks for sticking with this. Adding new syscalls is a bit of a
hassle but I do think we end up with a better interface - the existing
mlock/munlock/mlockall interfaces just aren't appropriate for these
things.
I don't know whether these syscalls should be documented via new
manpages, or if we should instead add them to the existing
mlock/munlock/mlockall manpages. Michael, could you please advise?
On Tue, 7 Jul 2015 13:03:43 -0400 Eric B Munson <[email protected]> wrote:
> Test the mmap() flag, and the mlockall() flag. These tests ensure that
> pages are not faulted in until they are accessed, that the pages are
> unevictable once faulted in, and that VMA splitting and merging works
> with the new VM flag. The second test ensures that mlock limits are
> respected. Note that the limit test needs to be run a normal user.
So we don't have tests for the new syscalls?
I have renumbered those syscalls (added 1) because of sys_userfaultfd.
On Tue, Jul 07, 2015 at 01:03:40PM -0400, Eric B Munson wrote:
> With the refactored mlock code, introduce new system calls for mlock,
> munlock, and munlockall. The new calls will allow the user to specify
> what lock states are being added or cleared. mlock2 and munlock2 are
> trivial at the moment, but a follow on patch will add a new mlock state
> making them useful.
>
> munlock2 addresses a limitation of the current implementation. If a
> user calls mlockall(MCL_CURRENT | MCL_FUTURE) and then later decides
> that MCL_FUTURE should be removed, they would have to call munlockall()
> followed by mlockall(MCL_CURRENT) which could potentially be very
> expensive. The new munlockall2 system call allows a user to simply
> clear the MCL_FUTURE flag.
>
> Signed-off-by: Eric B Munson <[email protected]>
...
> diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
> index 1acad02..f6d81d6 100644
> --- a/arch/s390/kernel/syscalls.S
> +++ b/arch/s390/kernel/syscalls.S
> @@ -363,3 +363,6 @@ SYSCALL(sys_bpf,compat_sys_bpf)
> SYSCALL(sys_s390_pci_mmio_write,compat_sys_s390_pci_mmio_write)
> SYSCALL(sys_s390_pci_mmio_read,compat_sys_s390_pci_mmio_read)
> SYSCALL(sys_execveat,compat_sys_execveat)
> +SYSCALL(sys_mlock2,compat_sys_mlock2) /* 355 */
> +SYSCALL(sys_munlock2,compat_sys_munlock2)
> +SYSCALL(sys_munlockall2,compat_sys_munlockall2)
FWIW, you would also need to add matching lines to the two files
arch/s390/include/uapi/asm/unistd.h
arch/s390/kernel/compat_wrapper.c
so that the system call would be wired up on s390.
On Wed, Jul 8, 2015 at 8:46 AM, Heiko Carstens
<[email protected]> wrote:
>> diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
>> index 1acad02..f6d81d6 100644
>> --- a/arch/s390/kernel/syscalls.S
>> +++ b/arch/s390/kernel/syscalls.S
>> @@ -363,3 +363,6 @@ SYSCALL(sys_bpf,compat_sys_bpf)
>> SYSCALL(sys_s390_pci_mmio_write,compat_sys_s390_pci_mmio_write)
>> SYSCALL(sys_s390_pci_mmio_read,compat_sys_s390_pci_mmio_read)
>> SYSCALL(sys_execveat,compat_sys_execveat)
>> +SYSCALL(sys_mlock2,compat_sys_mlock2) /* 355 */
>> +SYSCALL(sys_munlock2,compat_sys_munlock2)
>> +SYSCALL(sys_munlockall2,compat_sys_munlockall2)
>
> FWIW, you would also need to add matching lines to the two files
>
> arch/s390/include/uapi/asm/unistd.h
> arch/s390/kernel/compat_wrapper.c
>
> so that the system call would be wired up on s390.
Similar comment for m68k:
arch/m68k/include/asm/unistd.h
arch/m68k/include/uapi/asm/unistd.h
I think you best look at the last commits that added system calls, for all
architectures, to make sure you don't do partial updates.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Tue, Jul 07, 2015 at 01:03:40PM -0400, Eric B Munson wrote:
> diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
> index 05745eb..514e77b 100644
> --- a/arch/arm/kernel/calls.S
> +++ b/arch/arm/kernel/calls.S
> @@ -397,6 +397,9 @@
> /* 385 */ CALL(sys_memfd_create)
> CALL(sys_bpf)
> CALL(sys_execveat)
> + CALL(sys_mlock2)
> + CALL(sys_munlock2)
> +/* 400 */ CALL(sys_munlockall2)
s/400/390/
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index cef934a..318072aa 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -797,3 +797,9 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
> __SYSCALL(__NR_bpf, sys_bpf)
> #define __NR_execveat 387
> __SYSCALL(__NR_execveat, compat_sys_execveat)
> +#define __NR_mlock2 388
> +__SYSCALL(__NR_mlock2, sys_mlock2)
> +#define __NR_munlock2 389
> +__SYSCALL(__NR_munlock2, sys_munlock2)
> +#define __NR_munlockall2 390
> +__SYSCALL(__NR_munlockall2, sys_munlockall2)
These look fine.
Catalin
On Tue, 07 Jul 2015, Andrew Morton wrote:
> On Tue, 7 Jul 2015 13:03:38 -0400 Eric B Munson <[email protected]> wrote:
>
> > mlock() allows a user to control page out of program memory, but this
> > comes at the cost of faulting in the entire mapping when it is
> > allocated. For large mappings where the entire area is not necessary
> > this is not ideal. Instead of forcing all locked pages to be present
> > when they are allocated, this set creates a middle ground. Pages are
> > marked to be placed on the unevictable LRU (locked) when they are first
> > used, but they are not faulted in by the mlock call.
> >
> > This series introduces a new mlock() system call that takes a flags
> > argument along with the start address and size. This flags argument
> > gives the caller the ability to request memory be locked in the
> > traditional way, or to be locked after the page is faulted in. New
> > calls are added for munlock() and munlockall() which give the called a
> > way to specify which flags are supposed to be cleared. A new MCL flag
> > is added to mirror the lock on fault behavior from mlock() in
> > mlockall(). Finally, a flag for mmap() is added that allows a user to
> > specify that the covered are should not be paged out, but only after the
> > memory has been used the first time.
>
> Thanks for sticking with this. Adding new syscalls is a bit of a
> hassle but I do think we end up with a better interface - the existing
> mlock/munlock/mlockall interfaces just aren't appropriate for these
> things.
>
> I don't know whether these syscalls should be documented via new
> manpages, or if we should instead add them to the existing
> mlock/munlock/mlockall manpages. Michael, could you please advise?
>
Thanks for adding the series. I owe you several updates (getting the
new syscall right for all architectures and a set of tests for the new
syscalls). Would you prefer a new pair of patches or I update this set?
Eric
On Wed, 8 Jul 2015 09:23:02 -0400 Eric B Munson <[email protected]> wrote:
> > I don't know whether these syscalls should be documented via new
> > manpages, or if we should instead add them to the existing
> > mlock/munlock/mlockall manpages. Michael, could you please advise?
> >
>
> Thanks for adding the series. I owe you several updates (getting the
> new syscall right for all architectures and a set of tests for the new
> syscalls). Would you prefer a new pair of patches or I update this set?
It doesn't matter much. I guess a full update will be more convenient
at your end.
On Tue, 7 Jul 2015 13:03:41 -0400
Eric B Munson <[email protected]> wrote:
> This patch introduces the ability to request that pages are not
> pre-faulted, but are placed on the unevictable LRU when they are finally
> faulted in. This can be done area at a time via the
> mlock2(MLOCK_ONFAULT) or the mlockall(MCL_ONFAULT) system calls. These
> calls can be undone via munlock2(MLOCK_ONFAULT) or
> munlockall2(MCL_ONFAULT).
Quick, possibly dumb question: I've been beating my head against these for
a little bit, and I can't figure out what's supposed to happen in this
case:
mlock2(addr, len, MLOCK_ONFAULT);
munlock2(addr, len, MLOCK_LOCKED);
It looks to me like it will clear VM_LOCKED without actually unlocking any
pages. Is that the intended result?
Thanks,
jon
On Wed, 08 Jul 2015, Jonathan Corbet wrote:
> On Tue, 7 Jul 2015 13:03:41 -0400
> Eric B Munson <[email protected]> wrote:
>
> > This patch introduces the ability to request that pages are not
> > pre-faulted, but are placed on the unevictable LRU when they are finally
> > faulted in. This can be done area at a time via the
> > mlock2(MLOCK_ONFAULT) or the mlockall(MCL_ONFAULT) system calls. These
> > calls can be undone via munlock2(MLOCK_ONFAULT) or
> > munlockall2(MCL_ONFAULT).
>
> Quick, possibly dumb question: I've been beating my head against these for
> a little bit, and I can't figure out what's supposed to happen in this
> case:
>
> mlock2(addr, len, MLOCK_ONFAULT);
> munlock2(addr, len, MLOCK_LOCKED);
>
> It looks to me like it will clear VM_LOCKED without actually unlocking any
> pages. Is that the intended result?
This is not quite right, what happens when you call munlock2(addr, len,
MLOCK_LOCKED); is we call apply_vma_flags(addr, len, VM_LOCKED, false).
The false argument means that we intend to clear the specified flags.
Here is the relevant snippet:
...
newflags = vma->vm_flags;
if (add_flags) {
newflags &= ~(VM_LOCKED | VM_LOCKONFAULT);
newflags |= flags;
} else {
newflags &= ~flags;
}
...
Note that when we are adding flags, we first clear both VM_LOCKED and
VM_LOCKONFAULT. This was done to match the behavior found in
mlockall(). When we are remove flags, we simply clear the specified
flag(s).
So in your example the state of the VMAs covered by addr and len would
remain unchanged.
It sounds like apply_vma_flags() needs a comment covering this topic, I
will include that in the set I am working on now.
Eric
On Wed, 8 Jul 2015 16:34:56 -0400
Eric B Munson <[email protected]> wrote:
> > Quick, possibly dumb question: I've been beating my head against these for
> > a little bit, and I can't figure out what's supposed to happen in this
> > case:
> >
> > mlock2(addr, len, MLOCK_ONFAULT);
> > munlock2(addr, len, MLOCK_LOCKED);
> >
> > It looks to me like it will clear VM_LOCKED without actually unlocking any
> > pages. Is that the intended result?
>
> This is not quite right, what happens when you call munlock2(addr, len,
> MLOCK_LOCKED); is we call apply_vma_flags(addr, len, VM_LOCKED, false).
>From your explanation, it looks like what I said *was* right...what I was
missing was the fact that VM_LOCKED isn't set in the first place. So that
call would be a no-op, clearing a flag that's already cleared.
One other question...if I call mlock2(MLOCK_ONFAULT) on a range that
already has resident pages, I believe that those pages will not be locked
until they are reclaimed and faulted back in again, right? I suspect that
could be surprising to users.
Thanks,
jon
On Wed, 08 Jul 2015, Jonathan Corbet wrote:
> On Wed, 8 Jul 2015 16:34:56 -0400
> Eric B Munson <[email protected]> wrote:
>
> > > Quick, possibly dumb question: I've been beating my head against these for
> > > a little bit, and I can't figure out what's supposed to happen in this
> > > case:
> > >
> > > mlock2(addr, len, MLOCK_ONFAULT);
> > > munlock2(addr, len, MLOCK_LOCKED);
> > >
> > > It looks to me like it will clear VM_LOCKED without actually unlocking any
> > > pages. Is that the intended result?
> >
> > This is not quite right, what happens when you call munlock2(addr, len,
> > MLOCK_LOCKED); is we call apply_vma_flags(addr, len, VM_LOCKED, false).
>
> From your explanation, it looks like what I said *was* right...what I was
> missing was the fact that VM_LOCKED isn't set in the first place. So that
> call would be a no-op, clearing a flag that's already cleared.
Sorry, I misread the original. You are correct with the addition that
the call to munlock2(MLOCK_LOCKED) is a noop in this case.
>
> One other question...if I call mlock2(MLOCK_ONFAULT) on a range that
> already has resident pages, I believe that those pages will not be locked
> until they are reclaimed and faulted back in again, right? I suspect that
> could be surprising to users.
That is the case. I am looking into what it would take to find only the
present pages in a range and lock them, if that is the behavior that is
preferred I can include it in the updated series.
>
> Thanks,
>
> jon
On Thu, 9 Jul 2015 14:46:35 -0400
Eric B Munson <[email protected]> wrote:
> > One other question...if I call mlock2(MLOCK_ONFAULT) on a range that
> > already has resident pages, I believe that those pages will not be locked
> > until they are reclaimed and faulted back in again, right? I suspect that
> > could be surprising to users.
>
> That is the case. I am looking into what it would take to find only the
> present pages in a range and lock them, if that is the behavior that is
> preferred I can include it in the updated series.
For whatever my $0.02 is worth, I think that should be done. Otherwise
the mlock2() interface is essentially nondeterministic; you'll never
really know if a specific page is locked or not.
Thanks,
jon
On Fri, 10 Jul 2015, Jonathan Corbet wrote:
> On Thu, 9 Jul 2015 14:46:35 -0400
> Eric B Munson <[email protected]> wrote:
>
> > > One other question...if I call mlock2(MLOCK_ONFAULT) on a range that
> > > already has resident pages, I believe that those pages will not be locked
> > > until they are reclaimed and faulted back in again, right? I suspect that
> > > could be surprising to users.
> >
> > That is the case. I am looking into what it would take to find only the
> > present pages in a range and lock them, if that is the behavior that is
> > preferred I can include it in the updated series.
>
> For whatever my $0.02 is worth, I think that should be done. Otherwise
> the mlock2() interface is essentially nondeterministic; you'll never
> really know if a specific page is locked or not.
>
> Thanks,
>
> jon
Okay, I likely won't have the new set out today then. This change is
more invasive. IIUC, I need an equivalent to __get_user_page() skips
pages which are not present instead of faulting in and the call chain to
get to it. Unless there is an easier way that I am missing.
Eric
On Tue, Jul 7, 2015 at 1:03 PM, Eric B Munson <[email protected]> wrote:
> The cost of faulting in all memory to be locked can be very high when
> working with large mappings. If only portions of the mapping will be
> used this can incur a high penalty for locking.
>
> Now that we have the new VMA flag for the locked but not present state,
> expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
An automatic bisection on arch/tile leads to this commit:
5a5656f2c9b61c74c15f9ef3fa2e6513b6c237bb is the first bad commit
commit 5a5656f2c9b61c74c15f9ef3fa2e6513b6c237bb
Author: Eric B Munson <[email protected]>
Date: Thu Jul 16 10:09:22 2015 +1000
mm: mmap: add mmap flag to request VM_LOCKONFAULT
Fails with:
In file included from arch/tile/mm/init.c:24:
include/linux/mman.h: In function ‘calc_vm_flag_bits’:
include/linux/mman.h:90: error: ‘MAP_LOCKONFAULT’ undeclared (first
use in this function)
include/linux/mman.h:90: error: (Each undeclared identifier is
reported only once
include/linux/mman.h:90: error: for each function it appears in.)
In file included from arch/tile/mm/mmap.c:21:
include/linux/mman.h: In function ‘calc_vm_flag_bits’:
include/linux/mman.h:90: error: ‘MAP_LOCKONFAULT’ undeclared (first
use in this function)
include/linux/mman.h:90: error: (Each undeclared identifier is
reported only once
include/linux/mman.h:90: error: for each function it appears in.)
In file included from arch/tile/mm/fault.c:24:
include/linux/mman.h: In function ‘calc_vm_flag_bits’:
include/linux/mman.h:90: error: ‘MAP_LOCKONFAULT’ undeclared (first
use in this function)
include/linux/mman.h:90: error: (Each undeclared identifier is
reported only once
include/linux/mman.h:90: error: for each function it appears in.)
In file included from arch/tile/mm/hugetlbpage.c:27:
include/linux/mman.h: In function ‘calc_vm_flag_bits’:
include/linux/mman.h:90: error: ‘MAP_LOCKONFAULT’ undeclared (first
use in this function)
include/linux/mman.h:90: error: (Each undeclared identifier is
reported only once
include/linux/mman.h:90: error: for each function it appears in.)
make[1]: *** [arch/tile/mm/hugetlbpage.o] Error 1
http://kisskb.ellerman.id.au/kisskb/buildresult/12465365/
Paul.
--
On 07/18/2015 03:11 PM, Paul Gortmaker wrote:
> On Tue, Jul 7, 2015 at 1:03 PM, Eric B Munson<[email protected]> wrote:
>> >The cost of faulting in all memory to be locked can be very high when
>> >working with large mappings. If only portions of the mapping will be
>> >used this can incur a high penalty for locking.
>> >
>> >Now that we have the new VMA flag for the locked but not present state,
>> >expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
> An automatic bisection on arch/tile leads to this commit:
>
> 5a5656f2c9b61c74c15f9ef3fa2e6513b6c237bb is the first bad commit
> commit 5a5656f2c9b61c74c15f9ef3fa2e6513b6c237bb
> Author: Eric B Munson<[email protected]>
> Date: Thu Jul 16 10:09:22 2015 +1000
>
> mm: mmap: add mmap flag to request VM_LOCKONFAULT
Eric, I'm happy to help with figuring out the tile issues.
--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com
On 07/10/2015 06:19 PM, Eric B Munson wrote:
> On Fri, 10 Jul 2015, Jonathan Corbet wrote:
>
>> On Thu, 9 Jul 2015 14:46:35 -0400
>> Eric B Munson <[email protected]> wrote:
>>
>>>> One other question...if I call mlock2(MLOCK_ONFAULT) on a range that
>>>> already has resident pages, I believe that those pages will not be locked
>>>> until they are reclaimed and faulted back in again, right? I suspect that
>>>> could be surprising to users.
>>>
>>> That is the case. I am looking into what it would take to find only the
>>> present pages in a range and lock them, if that is the behavior that is
>>> preferred I can include it in the updated series.
>>
>> For whatever my $0.02 is worth, I think that should be done. Otherwise
>> the mlock2() interface is essentially nondeterministic; you'll never
>> really know if a specific page is locked or not.
>>
>> Thanks,
>>
>> jon
>
> Okay, I likely won't have the new set out today then. This change is
> more invasive. IIUC, I need an equivalent to __get_user_page() skips
> pages which are not present instead of faulting in and the call chain to
> get to it. Unless there is an easier way that I am missing.
IIRC having page PageMlocked and put on unevictable list isn't necessary
to prevent it from being reclaimed. It's just to prevent it from being
scanned for reclaim in the first place. When attempting to unmap the
page, vma flags are still checked, see the code in try_to_unmap_one().
You should probably extend the checks to your new VM_ flag as it is done
for VM_LOCKED and then you shouldn't need to walk the pages to mlock
them (although it would probably still be better for the accounting
accuracy).
> Eric
>
On Mon, 20 Jul 2015, Chris Metcalf wrote:
> On 07/18/2015 03:11 PM, Paul Gortmaker wrote:
> >On Tue, Jul 7, 2015 at 1:03 PM, Eric B Munson<[email protected]> wrote:
> >>>The cost of faulting in all memory to be locked can be very high when
> >>>working with large mappings. If only portions of the mapping will be
> >>>used this can incur a high penalty for locking.
> >>>
> >>>Now that we have the new VMA flag for the locked but not present state,
> >>>expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
> >An automatic bisection on arch/tile leads to this commit:
> >
> >5a5656f2c9b61c74c15f9ef3fa2e6513b6c237bb is the first bad commit
> >commit 5a5656f2c9b61c74c15f9ef3fa2e6513b6c237bb
> >Author: Eric B Munson<[email protected]>
> >Date: Thu Jul 16 10:09:22 2015 +1000
> >
> > mm: mmap: add mmap flag to request VM_LOCKONFAULT
>
> Eric, I'm happy to help with figuring out the tile issues.
Thanks for the offer, I think I have is sorted in V4 (which I am
checking one last time before I post).
Eric
On 2015-07-21 11:37 AM, Eric B Munson wrote:
> On Mon, 20 Jul 2015, Chris Metcalf wrote:
>
>> On 07/18/2015 03:11 PM, Paul Gortmaker wrote:
>>> On Tue, Jul 7, 2015 at 1:03 PM, Eric B Munson<[email protected]> wrote:
>>>>> The cost of faulting in all memory to be locked can be very high when
>>>>> working with large mappings. If only portions of the mapping will be
>>>>> used this can incur a high penalty for locking.
>>>>>
>>>>> Now that we have the new VMA flag for the locked but not present state,
>>>>> expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
>>> An automatic bisection on arch/tile leads to this commit:
>>>
>>> 5a5656f2c9b61c74c15f9ef3fa2e6513b6c237bb is the first bad commit
>>> commit 5a5656f2c9b61c74c15f9ef3fa2e6513b6c237bb
>>> Author: Eric B Munson<[email protected]>
>>> Date: Thu Jul 16 10:09:22 2015 +1000
>>>
>>> mm: mmap: add mmap flag to request VM_LOCKONFAULT
>>
>> Eric, I'm happy to help with figuring out the tile issues.
>
> Thanks for the offer, I think I have is sorted in V4 (which I am
> checking one last time before I post).
Not quite sorted yet. Seems parisc fails on v4. It updated the
number of syscalls but did not update syscall_table.S causing:
arch/parisc/kernel/syscall_table.S:444: Error: size of syscall table does not fit value of __NR_Linux_syscalls
http://kisskb.ellerman.id.au/kisskb/buildresult/12468884/
Paul.
--
> Eric
>