mlock() allows a user to control page out of program memory, but this
comes at the cost of faulting in the entire mapping when it is
allocated. For large mappings where the entire area is not necessary
this is not ideal. Instead of forcing all locked pages to be present
when they are allocated, this set creates a middle ground. Pages are
marked to be placed on the unevictable LRU (locked) when they are first
used, but they are not faulted in by the mlock call.
This series introduces a new mlock() system call that takes a flags
argument along with the start address and size. This flags argument
gives the caller the ability to request memory be locked in the
traditional way, or to be locked after the page is faulted in. A new
MCL flag is added to mirror the lock on fault behavior from mlock() in
mlockall(). Finally, a flag for mmap() is added that allows a user to
specify that the covered are should not be paged out, but only after the
memory has been used the first time.
There are two main use cases that this set covers. The first is the
security focussed mlock case. A buffer is needed that cannot be written
to swap. The maximum size is known, but on average the memory used is
significantly less than this maximum. With lock on fault, the buffer
is guaranteed to never be paged out without consuming the maximum size
every time such a buffer is created.
The second use case is focussed on performance. Portions of a large
file are needed and we want to keep the used portions in memory once
accessed. This is the case for large graphical models where the path
through the graph is not known until run time. The entire graph is
unlikely to be used in a given invocation, but once a node has been
used it needs to stay resident for further processing. Given these
constraints we have a number of options. We can potentially waste a
large amount of memory by mlocking the entire region (this can also
cause a significant stall at startup as the entire file is read in).
We can mlock every page as we access them without tracking if the page
is already resident but this introduces large overhead for each access.
The third option is mapping the entire region with PROT_NONE and using
a signal handler for SIGSEGV to mprotect(PROT_READ) and mlock() the
needed page. Doing this page at a time adds a significant performance
penalty. Batching can be used to mitigate this overhead, but in order
to safely avoid trying to mprotect pages outside of the mapping, the
boundaries of each mapping to be used in this way must be tracked and
available to the signal handler. This is precisely what the mm system
in the kernel should already be doing.
For mlock(MLOCK_ONFAULT) and mmap(MAP_LOCKONFAULT) the user is charged
against RLIMIT_MEMLOCK as if mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was
used, so when the VMA is created not when the pages are faulted in. For
mlockall(MCL_ONFAULT) the user is charged as if MCL_FUTURE was used.
This decision was made to keep the accounting checks out of the page
fault path.
To illustrate the benefit of this set I wrote a test program that mmaps
a 5 GB file filled with random data and then makes 15,000,000 accesses
to random addresses in that mapping. The test program was run 20 times
for each setup. Results are reported for two program portions, setup
and execution. The setup phase is calling mmap and optionally mlock on
the entire region. For most experiments this is trivial, but it
highlights the cost of faulting in the entire region. Results are
averages across the 20 runs in milliseconds.
mmap with mlock(MLOCK_LOCKED) on entire range:
Setup avg: 8228.666
Processing avg: 8274.257
mmap with mlock(MLOCK_LOCKED) before each access:
Setup avg: 0.113
Processing avg: 90993.552
mmap with PROT_NONE and signal handler and batch size of 1 page:
With the default value in max_map_count, this gets ENOMEM as I attempt
to change the permissions, after upping the sysctl significantly I get:
Setup avg: 0.058
Processing avg: 69488.073
mmap with PROT_NONE and signal handler and batch size of 8 pages:
Setup avg: 0.068
Processing avg: 38204.116
mmap with PROT_NONE and signal handler and batch size of 16 pages:
Setup avg: 0.044
Processing avg: 29671.180
mmap with mlock(MLOCK_ONFAULT) on entire range:
Setup avg: 0.189
Processing avg: 17904.899
The signal handler in the batch cases faulted in memory in two steps to
avoid having to know the start and end of the faulting mapping. The
first step covers the page that caused the fault as we know that it will
be possible to lock. The second step speculatively tries to mlock and
mprotect the batch size - 1 pages that follow. There may be a clever
way to avoid this without having the program track each mapping to be
covered by this handeler in a globally accessible structure, but I could
not find it. It should be noted that with a large enough batch size
this two step fault handler can still cause the program to crash if it
reaches far beyond the end of the mapping.
These results show that if the developer knows that a majority of the
mapping will be used, it is better to try and fault it in at once,
otherwise MAP_LOCKONFAULT is significantly faster.
The performance cost of these patches are minimal on the two benchmarks
I have tested (stream and kernbench). The following are the average
values across 20 runs of stream and 10 runs of kernbench after a warmup
run whose results were discarded.
Avg throughput in MB/s from stream using 1000000 element arrays
Test 4.2-rc1 4.2-rc1+lock-on-fault
Copy: 10,566.5 10,421
Scale: 10,685 10,503.5
Add: 12,044.1 11,814.2
Triad: 12,064.8 11,846.3
Kernbench optimal load
4.2-rc1 4.2-rc1+lock-on-fault
Elapsed Time 78.453 78.991
User Time 64.2395 65.2355
System Time 9.7335 9.7085
Context Switches 22211.5 22412.1
Sleeps 14965.3 14956.1
---
Changes from V4:
Drop all architectures for new sys call entries except x86[_64] and MIPS
Drop munlock2 and munlockall2
Make VM_LOCKONFAULT a modifier to VM_LOCKED only to simplify book keeping
Adjust tests to match
Changes from V3:
Ensure that pages present when mlock2(MLOCK_ONFAULT) is called are locked
Ensure that VM_LOCKONFAULT is handled in cases that used to only check VM_LOCKED
Add tests for new system calls
Add missing syscall entries, fix NR_syscalls on multiple arch's
Add missing MAP_LOCKONFAULT for tile
Changes from V2:
Added new system calls for mlock, munlock, and munlockall with added
flags arguments for controlling how memory is locked or unlocked.
Eric B Munson (7):
mm: mlock: Refactor mlock, munlock, and munlockall code
mm: mlock: Add new mlock system call
mm: Introduce VM_LOCKONFAULT
mm: mlock: Add mlock flags to enable VM_LOCKONFAULT usage
mm: mmap: Add mmap flag to request VM_LOCKONFAULT
selftests: vm: Add tests for lock on fault
mips: Add entry for new mlock2 syscall
arch/alpha/include/uapi/asm/mman.h | 5 +
arch/mips/include/uapi/asm/mman.h | 8 +
arch/mips/include/uapi/asm/unistd.h | 15 +-
arch/mips/kernel/scall32-o32.S | 1 +
arch/mips/kernel/scall64-64.S | 1 +
arch/mips/kernel/scall64-n32.S | 1 +
arch/mips/kernel/scall64-o32.S | 1 +
arch/parisc/include/uapi/asm/mman.h | 5 +
arch/powerpc/include/uapi/asm/mman.h | 5 +
arch/sparc/include/uapi/asm/mman.h | 5 +
arch/tile/include/uapi/asm/mman.h | 9 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/include/uapi/asm/mman.h | 8 +
drivers/gpu/drm/drm_vm.c | 8 +-
fs/proc/task_mmu.c | 1 +
include/linux/mm.h | 2 +
include/linux/mman.h | 3 +-
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/mman.h | 5 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/events/core.c | 3 +-
kernel/fork.c | 2 +-
kernel/sys_ni.c | 1 +
mm/debug.c | 1 +
mm/gup.c | 10 +-
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 4 +-
mm/mlock.c | 77 +++--
mm/mmap.c | 10 +-
mm/rmap.c | 4 +-
tools/testing/selftests/vm/Makefile | 3 +
tools/testing/selftests/vm/lock-on-fault.c | 344 +++++++++++++++++++
tools/testing/selftests/vm/mlock2-tests.c | 507 ++++++++++++++++++++++++++++
tools/testing/selftests/vm/on-fault-limit.c | 47 +++
tools/testing/selftests/vm/run_vmtests | 33 ++
36 files changed, 1093 insertions(+), 46 deletions(-)
create mode 100644 tools/testing/selftests/vm/lock-on-fault.c
create mode 100644 tools/testing/selftests/vm/mlock2-tests.c
create mode 100644 tools/testing/selftests/vm/on-fault-limit.c
Cc: Shuah Khan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
--
1.9.1
Extending the mlock system call is very difficult because it currently
does not take a flags argument. A later patch in this set will extend
mlock to support a middle ground between pages that are locked and
faulted in immediately and unlocked pages. To pave the way for the new
system call, the code needs some reorganization so that all the actual
entry point handles is checking input and translating to VMA flags.
Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/mlock.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/mm/mlock.c b/mm/mlock.c
index 6fd2cf1..1585cca 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -553,7 +553,8 @@ out:
return ret;
}
-static int do_mlock(unsigned long start, size_t len, int on)
+static int apply_vma_lock_flags(unsigned long start, size_t len,
+ vm_flags_t flags)
{
unsigned long nstart, end, tmp;
struct vm_area_struct * vma, * prev;
@@ -575,14 +576,10 @@ static int do_mlock(unsigned long start, size_t len, int on)
prev = vma;
for (nstart = start ; ; ) {
- vm_flags_t newflags;
+ vm_flags_t newflags = vma->vm_flags & ~VM_LOCKED;
+ newflags |= flags;
/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
-
- newflags = vma->vm_flags & ~VM_LOCKED;
- if (on)
- newflags |= VM_LOCKED;
-
tmp = vma->vm_end;
if (tmp > end)
tmp = end;
@@ -604,7 +601,7 @@ static int do_mlock(unsigned long start, size_t len, int on)
return error;
}
-SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
+static int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
{
unsigned long locked;
unsigned long lock_limit;
@@ -628,7 +625,7 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
/* check against resource limits */
if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
- error = do_mlock(start, len, 1);
+ error = apply_vma_lock_flags(start, len, flags);
up_write(¤t->mm->mmap_sem);
if (error)
@@ -640,6 +637,11 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
return 0;
}
+SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
+{
+ return do_mlock(start, len, VM_LOCKED);
+}
+
SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
{
int ret;
@@ -648,13 +650,13 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
start &= PAGE_MASK;
down_write(¤t->mm->mmap_sem);
- ret = do_mlock(start, len, 0);
+ ret = apply_vma_lock_flags(start, len, 0);
up_write(¤t->mm->mmap_sem);
return ret;
}
-static int do_mlockall(int flags)
+static int apply_mlockall_flags(int flags)
{
struct vm_area_struct * vma, * prev = NULL;
@@ -662,6 +664,7 @@ static int do_mlockall(int flags)
current->mm->def_flags |= VM_LOCKED;
else
current->mm->def_flags &= ~VM_LOCKED;
+
if (flags == MCL_FUTURE)
goto out;
@@ -703,7 +706,7 @@ SYSCALL_DEFINE1(mlockall, int, flags)
if (!(flags & MCL_CURRENT) || (current->mm->total_vm <= lock_limit) ||
capable(CAP_IPC_LOCK))
- ret = do_mlockall(flags);
+ ret = apply_mlockall_flags(flags);
up_write(¤t->mm->mmap_sem);
if (!ret && (flags & MCL_CURRENT))
mm_populate(0, TASK_SIZE);
@@ -716,7 +719,7 @@ SYSCALL_DEFINE0(munlockall)
int ret;
down_write(¤t->mm->mmap_sem);
- ret = do_mlockall(0);
+ ret = apply_mlockall_flags(0);
up_write(¤t->mm->mmap_sem);
return ret;
}
--
1.9.1
With the refactored mlock code, introduce a new system call for mlock.
The new call will allow the user to specify what lock states are being
added. mlock2 is trivial at the moment, but a follow on patch will add
a new mlock state making it useful.
Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from V4:
* Drop all architectures except x86[_64] from this patch, MIPS is added
later in the series. All others will be left to their maintainers.
Changes from V3:
* Do a (hopefully) complete job of adding the new system calls
arch/alpha/include/uapi/asm/mman.h | 2 ++
arch/mips/include/uapi/asm/mman.h | 5 +++++
arch/parisc/include/uapi/asm/mman.h | 2 ++
arch/powerpc/include/uapi/asm/mman.h | 2 ++
arch/sparc/include/uapi/asm/mman.h | 2 ++
arch/tile/include/uapi/asm/mman.h | 5 +++++
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/include/uapi/asm/mman.h | 5 +++++
include/linux/syscalls.h | 2 ++
include/uapi/asm-generic/mman.h | 2 ++
include/uapi/asm-generic/unistd.h | 4 +++-
kernel/sys_ni.c | 1 +
mm/mlock.c | 9 +++++++++
14 files changed, 42 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..ec72436 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -38,6 +38,8 @@
#define MCL_CURRENT 8192 /* lock all currently mapped pages */
#define MCL_FUTURE 16384 /* lock all additions to address space */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876..67c1cdf 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -62,6 +62,11 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+/*
+ * Flags for mlock
+ */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251..daab994 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -32,6 +32,8 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index 6ea26df..189e85f 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -23,6 +23,8 @@
#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 0b14df3..13d51be 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -18,6 +18,8 @@
#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
index 81b8fc3..f69ce48 100644
--- a/arch/tile/include/uapi/asm/mman.h
+++ b/arch/tile/include/uapi/asm/mman.h
@@ -37,5 +37,10 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+/*
+ * Flags for mlock
+ */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#endif /* _ASM_TILE_MMAN_H */
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ef8187f..839d5df 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
+359 i386 mlock2 sys_mlock2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 9ef32d5..ad36769 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 common mlock2 sys_mlock2
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 201aec0..11f354f 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -75,6 +75,11 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+/*
+ * Flags for mlock
+ */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b45c45b..56a3d59 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
const char __user *const __user *argv,
const char __user *const __user *envp, int flags);
+asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
+
#endif
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index e9fe6fd..242436b 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -18,4 +18,6 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+
#endif /* __ASM_GENERIC_MMAN_H */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..14a6013 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)
#define __NR_execveat 281
__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_mlock2 282
+__SYSCALL(__NR_mlock2, sys_mlock2)
#undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283
/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7995ef5..4818b71 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -193,6 +193,7 @@ cond_syscall(sys_mlock);
cond_syscall(sys_munlock);
cond_syscall(sys_mlockall);
cond_syscall(sys_munlockall);
+cond_syscall(sys_mlock2);
cond_syscall(sys_mincore);
cond_syscall(sys_madvise);
cond_syscall(sys_mremap);
diff --git a/mm/mlock.c b/mm/mlock.c
index 1585cca..c9c6a0f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -642,6 +642,15 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
return do_mlock(start, len, VM_LOCKED);
}
+SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
+{
+ vm_flags_t vm_flags = VM_LOCKED;
+ if (!flags || (flags & ~(MLOCK_LOCKED)))
+ return -EINVAL;
+
+ return do_mlock(start, len, vm_flags);
+}
+
SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
{
int ret;
--
1.9.1
The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be
used this can incur a high penalty for locking.
For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well). For the security example, any application transacting
in data that cannot be swapped out (credit card data, medical records,
etc).
This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in. The VM_LOCKONFAULT flag will be used together with
VM_LOCKED and has no effect when set without VM_LOCKED. Setting the
VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to
be added to the unevictable LRU when they are faulted or if they are
already present, but will not cause any missing pages to be faulted in.
Exposing this new lock state means that we cannot overload the meaning
of the FOLL_POPULATE flag any longer. Prior to this patch it was used
to mean that the VMA for a fault was locked. This means we need the
new FOLL_MLOCK flag to communicate the locked state of a VMA.
FOLL_POPULATE will now only control if the VMA should be populated and
in the case of VM_LOCKONFAULT, it will not be set.
Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
drivers/gpu/drm/drm_vm.c | 8 +++++++-
fs/proc/task_mmu.c | 1 +
include/linux/mm.h | 2 ++
kernel/fork.c | 2 +-
mm/debug.c | 1 +
mm/gup.c | 10 ++++++++--
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 4 ++--
mm/mlock.c | 2 +-
mm/mmap.c | 2 +-
mm/rmap.c | 4 ++--
11 files changed, 27 insertions(+), 11 deletions(-)
diff --git a/drivers/gpu/drm/drm_vm.c b/drivers/gpu/drm/drm_vm.c
index aab49ee..103a5f6 100644
--- a/drivers/gpu/drm/drm_vm.c
+++ b/drivers/gpu/drm/drm_vm.c
@@ -699,9 +699,15 @@ int drm_vma_info(struct seq_file *m, void *data)
(void *)(unsigned long)virt_to_phys(high_memory));
list_for_each_entry(pt, &dev->vmalist, head) {
+ char lock_flag = '-';
+
vma = pt->vma;
if (!vma)
continue;
+ if (vma->vm_flags & VM_LOCKONFAULT)
+ lock_flag = 'f';
+ else if (vma->vm_flags & VM_LOCKED)
+ lock_flag = 'l';
seq_printf(m,
"\n%5d 0x%pK-0x%pK %c%c%c%c%c%c 0x%08lx000",
pt->pid,
@@ -710,7 +716,7 @@ int drm_vma_info(struct seq_file *m, void *data)
vma->vm_flags & VM_WRITE ? 'w' : '-',
vma->vm_flags & VM_EXEC ? 'x' : '-',
vma->vm_flags & VM_MAYSHARE ? 's' : 'p',
- vma->vm_flags & VM_LOCKED ? 'l' : '-',
+ lock_flag,
vma->vm_flags & VM_IO ? 'i' : '-',
vma->vm_pgoff);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ca1e091..38d69fc 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -579,6 +579,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_X86_INTEL_MPX
[ilog2(VM_MPX)] = "mp",
#endif
+ [ilog2(VM_LOCKONFAULT)] = "lf",
[ilog2(VM_LOCKED)] = "lo",
[ilog2(VM_IO)] = "io",
[ilog2(VM_SEQ_READ)] = "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f9..c2f3551 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -127,6 +127,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
+#define VM_LOCKONFAULT 0x00001000 /* Lock the pages covered when they are faulted in */
#define VM_LOCKED 0x00002000
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
@@ -2043,6 +2044,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
+#define FOLL_MLOCK 0x1000 /* lock present pages */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/kernel/fork.c b/kernel/fork.c
index dbd9b8d..a949228 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -454,7 +454,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
tmp->vm_mm = mm;
if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
- tmp->vm_flags &= ~VM_LOCKED;
+ tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
tmp->vm_next = tmp->vm_prev = NULL;
file = tmp->vm_file;
if (file) {
diff --git a/mm/debug.c b/mm/debug.c
index 76089dd..25176bb 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -121,6 +121,7 @@ static const struct trace_print_flags vmaflags_names[] = {
{VM_GROWSDOWN, "growsdown" },
{VM_PFNMAP, "pfnmap" },
{VM_DENYWRITE, "denywrite" },
+ {VM_LOCKONFAULT, "lockonfault" },
{VM_LOCKED, "locked" },
{VM_IO, "io" },
{VM_SEQ_READ, "seqread" },
diff --git a/mm/gup.c b/mm/gup.c
index 6297f6b..e632908 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -92,7 +92,7 @@ retry:
*/
mark_page_accessed(page);
}
- if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) {
+ if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
/*
* The preliminary mapping check is mainly to avoid the
* pointless overhead of lock_page on the ZERO_PAGE
@@ -265,6 +265,9 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
unsigned int fault_flags = 0;
int ret;
+ /* mlock all present pages, but do not fault in new pages */
+ if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
+ return -ENOENT;
/* For mm_populate(), just skip the stack guard page. */
if ((*flags & FOLL_POPULATE) &&
(stack_guard_page_start(vma, address) ||
@@ -850,7 +853,10 @@ long populate_vma_page_range(struct vm_area_struct *vma,
VM_BUG_ON_VMA(end > vma->vm_end, vma);
VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
- gup_flags = FOLL_TOUCH | FOLL_POPULATE;
+ gup_flags = FOLL_TOUCH | FOLL_MLOCK;
+ if ((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == VM_LOCKED)
+ gup_flags |= FOLL_POPULATE;
+
/*
* We want to touch writable mappings with a write fault in order
* to break COW, except for shared mappings because these don't COW
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c107094..5e22d90 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1238,7 +1238,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
pmd, _pmd, 1))
update_mmu_cache_pmd(vma, addr, pmd);
}
- if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) {
+ if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED )) {
if (page->mapping && trylock_page(page)) {
lru_add_drain();
if (page->mapping)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087..82caa48 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3764,8 +3764,8 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
unsigned long s_end = sbase + PUD_SIZE;
/* Allow segments to share if only one is marked locked */
- unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
- unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED;
+ unsigned long vm_flags = vma->vm_flags & ~(VM_LOCKED | VM_LOCKONFAULT);
+ unsigned long svm_flags = svma->vm_flags & ~(VM_LOCKED | VM_LOCKONFAULT);
/*
* match the virtual addresses, permission and the alignment of the
diff --git a/mm/mlock.c b/mm/mlock.c
index c9c6a0f..e98bdd4 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -422,7 +422,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
void munlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- vma->vm_flags &= ~VM_LOCKED;
+ vma->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
while (start < end) {
struct page *page = NULL;
diff --git a/mm/mmap.c b/mm/mmap.c
index aa632ad..bdbefc3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1651,7 +1651,7 @@ out:
vma == get_gate_vma(current->mm)))
mm->locked_vm += (len >> PAGE_SHIFT);
else
- vma->vm_flags &= ~VM_LOCKED;
+ vma->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
}
if (file)
diff --git a/mm/rmap.c b/mm/rmap.c
index 171b687..47c855a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -744,7 +744,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
if (vma->vm_flags & VM_LOCKED) {
spin_unlock(ptl);
- pra->vm_flags |= VM_LOCKED;
+ pra->vm_flags |= (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT));
return SWAP_FAIL; /* To break the loop */
}
@@ -765,7 +765,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
if (vma->vm_flags & VM_LOCKED) {
pte_unmap_unlock(pte, ptl);
- pra->vm_flags |= VM_LOCKED;
+ pra->vm_flags |= (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT));
return SWAP_FAIL; /* To break the loop */
}
--
1.9.1
The previous patch introduced a flag that specified pages in a VMA
should be placed on the unevictable LRU, but they should not be made
present when the area is created. This patch adds the ability to set
this state via the new mlock system calls.
We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
MLOCK_ONFAULT will set the VM_LOCKONFAULT flag as well as the VM_LOCKED
flag for the target region. MCL_CURRENT and MCL_ONFAULT are used to
lock current mappings. With MCL_CURRENT all pages are made present and
with MCL_ONFAULT they are locked when faulted in. When specified with
MCL_FUTURE all new mappings will be marked with VM_LOCKONFAULT.
Currently, mlockall() clears all VMA lock flags and then sets the
requested flags. For instance, if a process has MCL_FUTURE and
MCL_CURRENT set, but they want to clear MCL_FUTURE this would be
accomplished by calling mlockall(MCL_CURRENT). This still holds with
the introduction of MCL_ONFAULT. Each call to mlockall() resets all
VMA flags to the values specified in the current call. The new mlock2
system call behaves in the same way. If a region is locked with
MLOCK_ONFAULT and a user wants to force it to be populated now, a second
call to mlock2(MLOCK_LOCKED) will accomplish this.
munlock() will unconditionally clear both vma flags. munlockall()
unconditionally clears for VMA flags on all VMAs and in the
mm->def_flags field.
Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from V4:
* Split addition of VMA flag
Changes from V3:
* Do extensive search for VM_LOCKED and ensure that VM_LOCKONFAULT is also handled
where appropriate
arch/alpha/include/uapi/asm/mman.h | 2 ++
arch/mips/include/uapi/asm/mman.h | 2 ++
arch/parisc/include/uapi/asm/mman.h | 2 ++
arch/powerpc/include/uapi/asm/mman.h | 2 ++
arch/sparc/include/uapi/asm/mman.h | 2 ++
arch/tile/include/uapi/asm/mman.h | 3 +++
arch/xtensa/include/uapi/asm/mman.h | 2 ++
include/uapi/asm-generic/mman.h | 2 ++
mm/mlock.c | 41 ++++++++++++++++++++++++------------
9 files changed, 45 insertions(+), 13 deletions(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index ec72436..77ae8db 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -37,8 +37,10 @@
#define MCL_CURRENT 8192 /* lock all currently mapped pages */
#define MCL_FUTURE 16384 /* lock all additions to address space */
+#define MCL_ONFAULT 32768 /* lock all pages that are faulted in */
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 67c1cdf..71ed81d 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -61,11 +61,13 @@
*/
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
/*
* Flags for mlock
*/
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index daab994..c0871ce 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -31,8 +31,10 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index 189e85f..f93f7eb 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -22,8 +22,10 @@
#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */
+#define MCL_ONFAULT 0x8000 /* lock all pages that are faulted in */
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 13d51be..8cd2ebc 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -17,8 +17,10 @@
#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */
+#define MCL_ONFAULT 0x8000 /* lock all pages that are faulted in */
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
index f69ce48..acdd013 100644
--- a/arch/tile/include/uapi/asm/mman.h
+++ b/arch/tile/include/uapi/asm/mman.h
@@ -36,11 +36,14 @@
*/
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
+
/*
* Flags for mlock
*/
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#endif /* _ASM_TILE_MMAN_H */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 11f354f..5725a15 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -74,11 +74,13 @@
*/
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
/*
* Flags for mlock
*/
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 242436b..555aab0 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -17,7 +17,9 @@
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
#define MLOCK_LOCKED 0x01 /* Lock and populate the specified range */
+#define MLOCK_ONFAULT 0x02 /* Lock pages in range after they are faulted in, do not prefault */
#endif /* __ASM_GENERIC_MMAN_H */
diff --git a/mm/mlock.c b/mm/mlock.c
index e98bdd4..3a99c80 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -506,7 +506,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
- goto out; /* don't set VM_LOCKED, don't count */
+ /* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
+ goto out;
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
@@ -576,7 +577,7 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
prev = vma;
for (nstart = start ; ; ) {
- vm_flags_t newflags = vma->vm_flags & ~VM_LOCKED;
+ vm_flags_t newflags = vma->vm_flags & ~(VM_LOCKED | VM_LOCKONFAULT);
newflags |= flags;
/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
@@ -645,9 +646,13 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
{
vm_flags_t vm_flags = VM_LOCKED;
- if (!flags || (flags & ~(MLOCK_LOCKED)))
+ if (!flags || (flags & ~(MLOCK_LOCKED | MLOCK_ONFAULT)) ||
+ flags == (MLOCK_LOCKED | MLOCK_ONFAULT))
return -EINVAL;
+ if (flags & MLOCK_ONFAULT)
+ vm_flags |= VM_LOCKONFAULT;
+
return do_mlock(start, len, vm_flags);
}
@@ -668,21 +673,30 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
static int apply_mlockall_flags(int flags)
{
struct vm_area_struct * vma, * prev = NULL;
+ vm_flags_t to_add = 0;
- if (flags & MCL_FUTURE)
+ current->mm->def_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
+ if (flags & MCL_FUTURE) {
current->mm->def_flags |= VM_LOCKED;
- else
- current->mm->def_flags &= ~VM_LOCKED;
- if (flags == MCL_FUTURE)
- goto out;
+ if (flags == MCL_FUTURE)
+ goto out;
+
+ if (flags & MCL_ONFAULT)
+ current->mm->def_flags |= VM_LOCKONFAULT;
+ }
+
+ if (flags & (MCL_ONFAULT | MCL_CURRENT)) {
+ to_add |= VM_LOCKED;
+ if (flags & MCL_ONFAULT)
+ to_add |= VM_LOCKONFAULT;
+ }
for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
vm_flags_t newflags;
- newflags = vma->vm_flags & ~VM_LOCKED;
- if (flags & MCL_CURRENT)
- newflags |= VM_LOCKED;
+ newflags = vma->vm_flags & ~(VM_LOCKED | VM_LOCKONFAULT);
+ newflags |= to_add;
/* Ignore errors */
mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
@@ -697,7 +711,8 @@ SYSCALL_DEFINE1(mlockall, int, flags)
unsigned long lock_limit;
int ret = -EINVAL;
- if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
+ if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)) ||
+ (flags & (MCL_CURRENT | MCL_ONFAULT)) == (MCL_CURRENT | MCL_ONFAULT))
goto out;
ret = -EPERM;
@@ -717,7 +732,7 @@ SYSCALL_DEFINE1(mlockall, int, flags)
capable(CAP_IPC_LOCK))
ret = apply_mlockall_flags(flags);
up_write(¤t->mm->mmap_sem);
- if (!ret && (flags & MCL_CURRENT))
+ if (!ret && (flags & (MCL_CURRENT | MCL_ONFAULT)))
mm_populate(0, TASK_SIZE);
out:
return ret;
--
1.9.1
The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be
used this can incur a high penalty for locking.
Now that we have the new VMA flag for the locked but not present state,
expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/powerpc/include/uapi/asm/mman.h | 1 +
arch/sparc/include/uapi/asm/mman.h | 1 +
arch/tile/include/uapi/asm/mman.h | 1 +
arch/xtensa/include/uapi/asm/mman.h | 1 +
include/linux/mman.h | 3 ++-
include/uapi/asm-generic/mman.h | 1 +
kernel/events/core.c | 3 ++-
mm/mmap.c | 8 ++++++--
11 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 77ae8db..3f80ca4 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -30,6 +30,7 @@
#define MAP_NONBLOCK 0x40000 /* do not block on IO */
#define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x100000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x200000 /* Lock pages after they are faulted in, do not prefault */
#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_SYNC 2 /* synchronous memory sync */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 71ed81d..905c1ea 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -48,6 +48,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
/*
* Flags for msync
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index c0871ce..c4695f6 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -24,6 +24,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
#define MS_SYNC 1 /* synchronous memory sync */
#define MS_ASYNC 2 /* sync memory asynchronously */
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index f93f7eb..40a3fda 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -31,5 +31,6 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */
#endif /* _UAPI_ASM_POWERPC_MMAN_H */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 8cd2ebc..f66efa6 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -26,6 +26,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */
#endif /* _UAPI__SPARC_MMAN_H__ */
diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
index acdd013..800e5c3 100644
--- a/arch/tile/include/uapi/asm/mman.h
+++ b/arch/tile/include/uapi/asm/mman.h
@@ -29,6 +29,7 @@
#define MAP_DENYWRITE 0x0800 /* ETXTBSY */
#define MAP_EXECUTABLE 0x1000 /* mark it as an executable */
#define MAP_HUGETLB 0x4000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
/*
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 5725a15..689e1f2 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -55,6 +55,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
# define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
* uninitialized */
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 16373c8..8243268 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -86,7 +86,8 @@ calc_vm_flag_bits(unsigned long flags)
{
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
_calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
- _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
+ _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
+ _calc_vm_trans(flags, MAP_LOCKONFAULT,VM_LOCKONFAULT | VM_LOCKED);
}
unsigned long vm_commit_limit(void);
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 555aab0..007b784 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */
/* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d3dae34..ec039f7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5815,7 +5815,8 @@ static void perf_event_mmap_event(struct perf_mmap_event *mmap_event)
if (vma->vm_flags & VM_MAYEXEC)
flags |= MAP_EXECUTABLE;
if (vma->vm_flags & VM_LOCKED)
- flags |= MAP_LOCKED;
+ flags |= (vma->vm_flags & VM_LOCKONFAULT ?
+ MAP_LOCKONFAULT : MAP_LOCKED);
if (vma->vm_flags & VM_HUGETLB)
flags |= MAP_HUGETLB;
diff --git a/mm/mmap.c b/mm/mmap.c
index bdbefc3..56a842d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1261,6 +1261,10 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
if (!len)
return -EINVAL;
+ if ((flags & (MAP_LOCKED | MAP_LOCKONFAULT)) ==
+ (MAP_LOCKED | MAP_LOCKONFAULT))
+ return -EINVAL;
+
/*
* Does the application expect PROT_READ to imply PROT_EXEC?
*
@@ -1301,7 +1305,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
- if (flags & MAP_LOCKED)
+ if (flags & (MAP_LOCKED | MAP_LOCKONFAULT))
if (!can_do_mlock())
return -EPERM;
@@ -2674,7 +2678,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
flags &= MAP_NONBLOCK;
flags |= MAP_SHARED | MAP_FIXED | MAP_POPULATE;
if (vma->vm_flags & VM_LOCKED) {
- flags |= MAP_LOCKED;
+ flags |= (vma->vm_flags & VM_LOCKONFAULT ? MAP_LOCKONFAULT : MAP_LOCKED);
/* drop PG_Mlocked flag for over-mapped range */
munlock_vma_pages_range(vma, start, start + size);
}
--
1.9.1
Test the mmap() flag, and the mlockall() flag. These tests ensure that
pages are not faulted in until they are accessed, that the pages are
unevictable once faulted in, and that VMA splitting and merging works
with the new VM flag. The second test ensures that mlock limits are
respected. Note that the limit test needs to be run a normal user.
Also add tests to use the new mlock2 family of system calls.
Signed-off-by: Eric B Munson <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
tools/testing/selftests/vm/Makefile | 3 +
tools/testing/selftests/vm/lock-on-fault.c | 344 +++++++++++++++++++
tools/testing/selftests/vm/mlock2-tests.c | 507 ++++++++++++++++++++++++++++
tools/testing/selftests/vm/on-fault-limit.c | 47 +++
tools/testing/selftests/vm/run_vmtests | 33 ++
5 files changed, 934 insertions(+)
create mode 100644 tools/testing/selftests/vm/lock-on-fault.c
create mode 100644 tools/testing/selftests/vm/mlock2-tests.c
create mode 100644 tools/testing/selftests/vm/on-fault-limit.c
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 231b9a0..0fe6524 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -5,7 +5,10 @@ BINARIES = compaction_test
BINARIES += hugepage-mmap
BINARIES += hugepage-shm
BINARIES += hugetlbfstest
+BINARIES += lock-on-fault
BINARIES += map_hugetlb
+BINARIES += mlock2-tests
+BINARIES += on-fault-limit
BINARIES += thuge-gen
BINARIES += transhuge-stress
diff --git a/tools/testing/selftests/vm/lock-on-fault.c b/tools/testing/selftests/vm/lock-on-fault.c
new file mode 100644
index 0000000..9783994
--- /dev/null
+++ b/tools/testing/selftests/vm/lock-on-fault.c
@@ -0,0 +1,344 @@
+#include <sys/mman.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <errno.h>
+
+struct vm_boundaries {
+ unsigned long start;
+ unsigned long end;
+};
+
+static int get_vm_area(unsigned long addr, struct vm_boundaries *area)
+{
+ FILE *file;
+ int ret = 1;
+ char line[1024] = {0};
+ char *end_addr;
+ char *stop;
+ unsigned long start;
+ unsigned long end;
+
+ if (!area)
+ return ret;
+
+ file = fopen("/proc/self/maps", "r");
+ if (!file) {
+ perror("fopen");
+ return ret;
+ }
+
+ memset(area, 0, sizeof(struct vm_boundaries));
+
+ while(fgets(line, 1024, file)) {
+ end_addr = strchr(line, '-');
+ if (!end_addr) {
+ printf("cannot parse /proc/self/maps\n");
+ goto out;
+ }
+ *end_addr = '\0';
+ end_addr++;
+ stop = strchr(end_addr, ' ');
+ if (!stop) {
+ printf("cannot parse /proc/self/maps\n");
+ goto out;
+ }
+ stop = '\0';
+
+ sscanf(line, "%lx", &start);
+ sscanf(end_addr, "%lx", &end);
+
+ if (start <= addr && end > addr) {
+ area->start = start;
+ area->end = end;
+ ret = 0;
+ goto out;
+ }
+ }
+out:
+ fclose(file);
+ return ret;
+}
+
+static unsigned long get_pageflags(unsigned long addr)
+{
+ FILE *file;
+ unsigned long pfn;
+ unsigned long offset;
+
+ file = fopen("/proc/self/pagemap", "r");
+ if (!file) {
+ perror("fopen");
+ _exit(1);
+ }
+
+ offset = addr / getpagesize() * sizeof(unsigned long);
+ if (fseek(file, offset, SEEK_SET)) {
+ perror("fseek");
+ _exit(1);
+ }
+
+ if (fread(&pfn, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread");
+ _exit(1);
+ }
+
+ fclose(file);
+ return pfn;
+}
+
+static unsigned long get_kpageflags(unsigned long pfn)
+{
+ unsigned long flags;
+ FILE *file;
+
+ file = fopen("/proc/kpageflags", "r");
+ if (!file) {
+ perror("fopen");
+ _exit(1);
+ }
+
+ if (fseek(file, pfn * sizeof(unsigned long), SEEK_SET)) {
+ perror("fseek");
+ _exit(1);
+ }
+
+ if (fread(&flags, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread");
+ _exit(1);
+ }
+
+ fclose(file);
+ return flags;
+}
+
+#define PRESENT_BIT 0x8000000000000000
+#define PFN_MASK 0x007FFFFFFFFFFFFF
+#define UNEVICTABLE_BIT (1UL << 18)
+
+static int test_mmap(int flags)
+{
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ void *map;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE, flags, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("mmap()");
+ return 1;
+ }
+
+ /* Write something into the first page to ensure it is present */
+ *(char *)map = 1;
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+
+ /* page2_flags should not be present */
+ if (page2_flags & PRESENT_BIT) {
+ printf("page map says 0x%lx\n", page2_flags);
+ printf("present is 0x%lx\n", PRESENT_BIT);
+ return 1;
+ }
+
+ /* page1_flags should be present */
+ if ((page1_flags & PRESENT_BIT) == 0) {
+ printf("page map says 0x%lx\n", page1_flags);
+ printf("present is 0x%lx\n", PRESENT_BIT);
+ return 1;
+ }
+
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+
+ /* page1_flags now contains the entry from kpageflags for the first
+ * page, the unevictable bit should be set */
+ if ((page1_flags & UNEVICTABLE_BIT) == 0) {
+ printf("kpageflags says 0x%lx\n", page1_flags);
+ printf("unevictable is 0x%lx\n", UNEVICTABLE_BIT);
+ return 1;
+ }
+
+ munmap(map, 2 * page_size);
+ return 0;
+}
+
+static int test_munlock(int flags)
+{
+ int ret = 1;
+ void *map;
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page3_flags;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, flags, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("mmap()");
+ return ret;
+ }
+
+ if (munlock(map + page_size, page_size)) {
+ perror("munlock()");
+ goto out;
+ }
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+ page3_flags = get_pageflags((unsigned long)map + page_size * 2);
+
+ /* No pages should be present */
+ if ((page1_flags & PRESENT_BIT) || (page2_flags & PRESENT_BIT) ||
+ (page3_flags & PRESENT_BIT)) {
+ printf("Page was made present by munlock()\n");
+ goto out;
+ }
+
+ /* Write something to each page so that they are faulted in */
+ *(char*)map = 1;
+ *(char*)(map + page_size) = 1;
+ *(char*)(map + page_size * 2) = 1;
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+ page3_flags = get_pageflags((unsigned long)map + page_size * 2);
+
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+ page2_flags = get_kpageflags(page2_flags & PFN_MASK);
+ page3_flags = get_kpageflags(page3_flags & PFN_MASK);
+
+ /* Pages 1 and 3 should be unevictable */
+ if (!(page1_flags & UNEVICTABLE_BIT)) {
+ printf("Missing unevictable bit on lock on fault page1\n");
+ goto out;
+ }
+ if (!(page3_flags & UNEVICTABLE_BIT)) {
+ printf("Missing unevictable bit on lock on fault page3\n");
+ goto out;
+ }
+
+ /* Page 2 should not be unevictable */
+ if (page2_flags & UNEVICTABLE_BIT) {
+ printf("Unlocked page is still marked unevictable\n");
+ goto out;
+ }
+
+ ret = 0;
+
+out:
+ munmap(map, 3 * page_size);
+ return ret;
+}
+
+static int test_vma_management(int flags)
+{
+ int ret = 1;
+ void *map;
+ unsigned long page_size = getpagesize();
+ struct vm_boundaries page1;
+ struct vm_boundaries page2;
+ struct vm_boundaries page3;
+
+ map = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, flags, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("mmap()");
+ return ret;
+ }
+
+ if (get_vm_area((unsigned long)map, &page1) ||
+ get_vm_area((unsigned long)map + page_size, &page2) ||
+ get_vm_area((unsigned long)map + page_size * 2, &page3)) {
+ printf("couldn't find mapping in /proc/self/maps\n");
+ goto out;
+ }
+
+ /*
+ * Before we unlock a portion, we need to that all three pages are in
+ * the same VMA. If they are not we abort this test (Note that this is
+ * not a failure)
+ */
+ if (page1.start != page2.start || page2.start != page3.start) {
+ printf("VMAs are not merged to start, aborting test\n");
+ ret = 0;
+ goto out;
+ }
+
+ if (munlock(map + page_size, page_size)) {
+ perror("munlock()");
+ goto out;
+ }
+
+ if (get_vm_area((unsigned long)map, &page1) ||
+ get_vm_area((unsigned long)map + page_size, &page2) ||
+ get_vm_area((unsigned long)map + page_size * 2, &page3)) {
+ printf("couldn't find mapping in /proc/self/maps\n");
+ goto out;
+ }
+
+ /* All three VMAs should be different */
+ if (page1.start == page2.start || page2.start == page3.start) {
+ printf("failed to split VMA for munlock\n");
+ goto out;
+ }
+
+ /* Now unlock the first and third page and check the VMAs again */
+ if (munlock(map, page_size * 3)) {
+ perror("munlock()");
+ goto out;
+ }
+
+ if (get_vm_area((unsigned long)map, &page1) ||
+ get_vm_area((unsigned long)map + page_size, &page2) ||
+ get_vm_area((unsigned long)map + page_size * 2, &page3)) {
+ printf("couldn't find mapping in /proc/self/maps\n");
+ goto out;
+ }
+
+ /* Now all three VMAs should be the same */
+ if (page1.start != page2.start || page2.start != page3.start) {
+ printf("failed to merge VMAs after munlock\n");
+ goto out;
+ }
+
+ ret = 0;
+out:
+ munmap(map, 3 * page_size);
+ return ret;
+}
+
+#ifndef MCL_ONFAULT
+#define MCL_ONFAULT (MCL_FUTURE << 1)
+#endif
+
+static int test_mlockall(int (test_function)(int flags))
+{
+ int ret = 1;
+
+ if (mlockall(MCL_ONFAULT | MCL_FUTURE)) {
+ perror("mlockall");
+ return ret;
+ }
+
+ ret = test_function(MAP_PRIVATE | MAP_ANONYMOUS);
+ munlockall();
+ return ret;
+}
+
+#ifndef MAP_LOCKONFAULT
+#define MAP_LOCKONFAULT (MAP_HUGETLB << 1)
+#endif
+
+int main(int argc, char **argv)
+{
+ int ret = 0;
+ ret += test_mmap(MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKONFAULT);
+ ret += test_mlockall(test_mmap);
+ ret += test_munlock(MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKONFAULT);
+ ret += test_mlockall(test_munlock);
+ ret += test_vma_management(MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKONFAULT);
+ ret += test_mlockall(test_vma_management);
+ return ret;
+}
+
diff --git a/tools/testing/selftests/vm/mlock2-tests.c b/tools/testing/selftests/vm/mlock2-tests.c
new file mode 100644
index 0000000..9acf9c2
--- /dev/null
+++ b/tools/testing/selftests/vm/mlock2-tests.c
@@ -0,0 +1,507 @@
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <errno.h>
+#include <stdbool.h>
+
+#ifndef MLOCK_LOCK
+#define MLOCK_LOCK 1
+#endif
+
+#ifndef MLOCK_ONFAULT
+#define MLOCK_ONFAULT 2
+#endif
+
+#ifndef MCL_ONFAULT
+#define MCL_ONFAULT (MCL_FUTURE << 1)
+#endif
+
+static int mlock2_(void *start, size_t len, int flags)
+{
+#ifdef __NR_mlock2
+ return syscall(__NR_mlock2, start, len, flags);
+#else
+ errno = ENOSYS;
+ return -1;
+#endif
+}
+
+static unsigned long get_pageflags(unsigned long addr)
+{
+ FILE *file;
+ unsigned long pfn;
+ unsigned long offset;
+
+ file = fopen("/proc/self/pagemap", "r");
+ if (!file) {
+ perror("fopen pagemap");
+ _exit(1);
+ }
+
+ offset = addr / getpagesize() * sizeof(unsigned long);
+ if (fseek(file, offset, SEEK_SET)) {
+ perror("fseek pagemap");
+ _exit(1);
+ }
+
+ if (fread(&pfn, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread pagemap");
+ _exit(1);
+ }
+
+ fclose(file);
+ return pfn;
+}
+
+static unsigned long get_kpageflags(unsigned long pfn)
+{
+ unsigned long flags;
+ FILE *file;
+
+ file = fopen("/proc/kpageflags", "r");
+ if (!file) {
+ perror("fopen kpageflags");
+ _exit(1);
+ }
+
+ if (fseek(file, pfn * sizeof(unsigned long), SEEK_SET)) {
+ perror("fseek kpageflags");
+ _exit(1);
+ }
+
+ if (fread(&flags, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread kpageflags");
+ _exit(1);
+ }
+
+ fclose(file);
+ return flags;
+}
+
+#define VMFLAGS "VmFlags:"
+
+static bool find_flag(FILE *file, const char *vmflag)
+{
+ char *line = NULL;
+ char *flags;
+ size_t size = 0;
+ bool ret = false;
+
+ while (getline(&line, &size, file) > 0) {
+ if (!strstr(line, VMFLAGS)) {
+ free(line);
+ line = NULL;
+ size = 0;
+ continue;
+ }
+
+ flags = line + strlen(VMFLAGS);
+ ret = (strstr(flags, vmflag) != NULL);
+ goto out;
+ }
+
+out:
+ free(line);
+ return ret;
+}
+
+static bool is_vmflag_set(unsigned long addr, const char *vmflag)
+{
+ FILE *file;
+ char *line = NULL;
+ size_t size = 0;
+ bool ret = false;
+ unsigned long start, end;
+ char perms[5];
+ unsigned long offset;
+ char dev[32];
+ unsigned long inode;
+ char path[BUFSIZ];
+
+ file = fopen("/proc/self/smaps", "r");
+ if (!file) {
+ perror("fopen smaps");
+ _exit(1);
+ }
+
+ while (getline(&line, &size, file) > 0) {
+ if (sscanf(line, "%lx-%lx %s %lx %s %lu %s\n",
+ &start, &end, perms, &offset, dev, &inode, path) < 6)
+ goto next;
+
+ if (start <= addr && addr < end) {
+ ret = find_flag(file, vmflag);
+ goto out;
+ }
+
+next:
+ free(line);
+ line = NULL;
+ size = 0;
+ }
+
+out:
+ free(line);
+ fclose(file);
+ return ret;
+}
+
+#define PRESENT_BIT 0x8000000000000000
+#define PFN_MASK 0x007FFFFFFFFFFFFF
+#define UNEVICTABLE_BIT (1UL << 18)
+
+#define LOCKED "lo"
+#define LOCKEDONFAULT "lf"
+
+static int lock_check(char *map)
+{
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+
+ /* Both pages should be present */
+ if (((page1_flags & PRESENT_BIT) == 0) ||
+ ((page2_flags & PRESENT_BIT) == 0)) {
+ printf("Failed to make both pages present\n");
+ return 1;
+ }
+
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+ page2_flags = get_kpageflags(page2_flags & PFN_MASK);
+
+ /* Both pages should be unevictable */
+ if (((page1_flags & UNEVICTABLE_BIT) == 0) ||
+ ((page2_flags & UNEVICTABLE_BIT) == 0)) {
+ printf("Failed to make both pages unevictable\n");
+ return 1;
+ }
+
+ if (!is_vmflag_set((unsigned long)map, LOCKED) ||
+ !is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
+ printf("VMA flag %s is missing\n", LOCKED);
+ return 1;
+ }
+
+ return 0;
+}
+
+static int unlock_lock_check(char *map)
+{
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+ page2_flags = get_kpageflags(page2_flags & PFN_MASK);
+
+ if ((page1_flags & UNEVICTABLE_BIT) || (page2_flags & UNEVICTABLE_BIT)) {
+ printf("A page is still marked unevictable after unlock\n");
+ return 1;
+ }
+
+ if (is_vmflag_set((unsigned long)map, LOCKED) ||
+ is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
+ printf("VMA flag %s is still set after unlock\n", LOCKED);
+ return 1;
+ }
+
+ return 0;
+}
+
+static int test_mlock_lock()
+{
+ char *map;
+ int ret = 1;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("test_mlock_locked mmap");
+ goto out;
+ }
+
+ if (mlock2_(map, 2 * page_size, MLOCK_LOCK)) {
+ if (errno == ENOSYS) {
+ printf("Cannot call new mlock family, skipping test\n");
+ _exit(0);
+ }
+ perror("mlock2(MLOCK_LOCK)");
+ goto unmap;
+ }
+
+ if (lock_check(map))
+ goto unmap;
+
+ /* Now unlock and recheck attributes */
+ if (munlock(map, 2 * page_size)) {
+ perror("munlock()");
+ goto unmap;
+ }
+
+ ret = unlock_lock_check(map);
+
+unmap:
+ munmap(map, 2 * page_size);
+out:
+ return ret;
+}
+
+static int onfault_check(char *map)
+{
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+
+ /* Neither page should be present */
+ if ((page1_flags & PRESENT_BIT) || (page2_flags & PRESENT_BIT)) {
+ printf("Pages were made present by MLOCK_ONFAULT\n");
+ return 1;
+ }
+
+ *map = 'a';
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+
+ /* Only page 1 should be present */
+ if ((page1_flags & PRESENT_BIT) == 0) {
+ printf("Page 1 is not present after fault\n");
+ return 1;
+ } else if (page2_flags & PRESENT_BIT) {
+ printf("Page 2 was made present\n");
+ return 1;
+ }
+
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+
+ /* Page 1 should be unevictable */
+ if ((page1_flags & UNEVICTABLE_BIT) == 0) {
+ printf("Failed to make faulted page unevictable\n");
+ return 1;
+ }
+
+ if (!is_vmflag_set((unsigned long)map, LOCKEDONFAULT) ||
+ !is_vmflag_set((unsigned long)map + page_size, LOCKEDONFAULT)) {
+ printf("VMA flag %s is missing\n", LOCKEDONFAULT);
+ return 1;
+ }
+
+ return 0;
+}
+
+static int unlock_onfault_check(char *map)
+{
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+
+ if (page1_flags & UNEVICTABLE_BIT) {
+ printf("Page 1 is still marked unevictable after unlock\n");
+ return 1;
+ }
+
+ if (is_vmflag_set((unsigned long)map, LOCKEDONFAULT) ||
+ is_vmflag_set((unsigned long)map + page_size, LOCKEDONFAULT)) {
+ printf("VMA flag %s is still set after unlock\n", LOCKEDONFAULT);
+ return 1;
+ }
+
+ return 0;
+}
+
+static int test_mlock_onfault()
+{
+ char *map;
+ int ret = 1;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("test_mlock_locked mmap");
+ goto out;
+ }
+
+ if (mlock2_(map, 2 * page_size, MLOCK_ONFAULT)) {
+ if (errno == ENOSYS) {
+ printf("Cannot call new mlock family, skipping test\n");
+ _exit(0);
+ }
+ perror("mlock2(MLOCK_ONFAULT)");
+ goto unmap;
+ }
+
+ if (onfault_check(map))
+ goto unmap;
+
+ /* Now unlock and recheck attributes */
+ if (munlock(map, 2 * page_size)) {
+ if (errno == ENOSYS) {
+ printf("Cannot call new mlock family, skipping test\n");
+ _exit(0);
+ }
+ perror("munlock2(MLOCK_LOCK)");
+ goto unmap;
+ }
+
+ ret = unlock_onfault_check(map);
+unmap:
+ munmap(map, 2 * page_size);
+out:
+ return ret;
+}
+
+static int test_lock_onfault_of_present()
+{
+ char *map;
+ int ret = 1;
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("test_mlock_locked mmap");
+ goto out;
+ }
+
+ *map = 'a';
+
+ if (mlock2_(map, 2 * page_size, MLOCK_ONFAULT)) {
+ if (errno == ENOSYS) {
+ printf("Cannot call new mlock family, skipping test\n");
+ _exit(0);
+ }
+ perror("mlock2(MLOCK_ONFAULT)");
+ goto unmap;
+ }
+
+ page1_flags = get_pageflags((unsigned long)map);
+ page2_flags = get_pageflags((unsigned long)map + page_size);
+ page1_flags = get_kpageflags(page1_flags & PFN_MASK);
+ page2_flags = get_kpageflags(page2_flags & PFN_MASK);
+
+ /* Page 1 should be unevictable */
+ if ((page1_flags & UNEVICTABLE_BIT) == 0) {
+ printf("Failed to make present page unevictable\n");
+ goto unmap;
+ }
+
+ if (!is_vmflag_set((unsigned long)map, LOCKEDONFAULT) ||
+ !is_vmflag_set((unsigned long)map + page_size, LOCKEDONFAULT)) {
+ printf("VMA flag %s is missing for one of the pages\n", LOCKEDONFAULT);
+ goto unmap;
+ }
+ ret = 0;
+unmap:
+ munmap(map, 2 * page_size);
+out:
+ return ret;
+}
+
+static int test_munlockall()
+{
+ char *map;
+ int ret = 1;
+ unsigned long page1_flags;
+ unsigned long page2_flags;
+ unsigned long page_size = getpagesize();
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+
+ if (map == MAP_FAILED) {
+ perror("test_munlockall mmap");
+ goto out;
+ }
+
+ if (mlockall(MCL_CURRENT)) {
+ perror("mlockall(MCL_CURRENT)");
+ goto out;
+ }
+
+ if (lock_check(map))
+ goto unmap;
+
+ if (munlockall()) {
+ perror("munlockall()");
+ goto unmap;
+ }
+
+ if (unlock_lock_check(map))
+ goto unmap;
+
+ munmap(map, 2 * page_size);
+
+ map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+
+ if (map == MAP_FAILED) {
+ perror("test_munlockall second mmap");
+ goto out;
+ }
+
+ if (mlockall(MCL_ONFAULT)) {
+ perror("mlockall(MCL_ONFAULT)");
+ goto unmap;
+ }
+
+ if (onfault_check(map))
+ goto unmap;
+
+ if (munlockall()) {
+ perror("munlockall()");
+ goto unmap;
+ }
+
+ if (unlock_onfault_check(map))
+ goto unmap;
+
+ if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
+ perror("mlockall(MCL_CURRENT | MCL_FUTURE)");
+ goto out;
+ }
+
+ if (lock_check(map))
+ goto unmap;
+
+ if (munlockall()) {
+ perror("munlockall()");
+ goto unmap;
+ }
+
+ ret = unlock_lock_check(map);
+
+unmap:
+ munmap(map, 2 * page_size);
+out:
+ munlockall();
+ return ret;
+}
+
+int main(char **argv, int argc)
+{
+ int ret = 0;
+ ret += test_mlock_lock();
+ ret += test_mlock_onfault();
+ ret += test_munlockall();
+ ret += test_lock_onfault_of_present();
+ return ret;
+}
+
diff --git a/tools/testing/selftests/vm/on-fault-limit.c b/tools/testing/selftests/vm/on-fault-limit.c
new file mode 100644
index 0000000..0ae458f
--- /dev/null
+++ b/tools/testing/selftests/vm/on-fault-limit.c
@@ -0,0 +1,47 @@
+#include <sys/mman.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+
+#ifndef MCL_ONFAULT
+#define MCL_ONFAULT (MCL_FUTURE << 1)
+#endif
+
+static int test_limit(void)
+{
+ int ret = 1;
+ struct rlimit lims;
+ void *map;
+
+ if (getrlimit(RLIMIT_MEMLOCK, &lims)) {
+ perror("getrlimit");
+ return ret;
+ }
+
+ if (mlockall(MCL_ONFAULT | MCL_FUTURE)) {
+ perror("mlockall");
+ return ret;
+ }
+
+ map = mmap(NULL, 2 * lims.rlim_max, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, 0, 0);
+ if (map != MAP_FAILED)
+ printf("mmap should have failed, but didn't\n");
+ else {
+ ret = 0;
+ munmap(map, 2 * lims.rlim_max);
+ }
+
+ munlockall();
+ return ret;
+}
+
+int main(int argc, char **argv)
+{
+ int ret = 0;
+
+ ret += test_limit();
+ return ret;
+}
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index 49ece11..990a61f 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -102,4 +102,37 @@ else
echo "[PASS]"
fi
+echo "--------------------"
+echo "running lock-on-fault"
+echo "--------------------"
+./lock-on-fault
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
+echo "--------------------"
+echo "running on-fault-limit"
+echo "--------------------"
+sudo -u nobody ./on-fault-limit
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
+echo "--------------------"
+echo "running mlock2-tests"
+echo "--------------------"
+./mlock2-tests
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
exit $exitcode
--
1.9.1
A previous commit introduced the new mlock2 syscall, add entries for the
MIPS architecture.
Signed-off-by: Eric B Munson <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/mips/include/uapi/asm/unistd.h | 15 +++++++++------
arch/mips/kernel/scall32-o32.S | 1 +
arch/mips/kernel/scall64-64.S | 1 +
arch/mips/kernel/scall64-n32.S | 1 +
arch/mips/kernel/scall64-o32.S | 1 +
5 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/arch/mips/include/uapi/asm/unistd.h b/arch/mips/include/uapi/asm/unistd.h
index c03088f..d0bdfaa 100644
--- a/arch/mips/include/uapi/asm/unistd.h
+++ b/arch/mips/include/uapi/asm/unistd.h
@@ -377,16 +377,17 @@
#define __NR_memfd_create (__NR_Linux + 354)
#define __NR_bpf (__NR_Linux + 355)
#define __NR_execveat (__NR_Linux + 356)
+#define __NR_mlock2 (__NR_Linux + 357)
/*
* Offset of the last Linux o32 flavoured syscall
*/
-#define __NR_Linux_syscalls 356
+#define __NR_Linux_syscalls 357
#endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */
#define __NR_O32_Linux 4000
-#define __NR_O32_Linux_syscalls 356
+#define __NR_O32_Linux_syscalls 357
#if _MIPS_SIM == _MIPS_SIM_ABI64
@@ -711,16 +712,17 @@
#define __NR_memfd_create (__NR_Linux + 314)
#define __NR_bpf (__NR_Linux + 315)
#define __NR_execveat (__NR_Linux + 316)
+#define __NR_mlock2 (__NR_Linux + 317)
/*
* Offset of the last Linux 64-bit flavoured syscall
*/
-#define __NR_Linux_syscalls 316
+#define __NR_Linux_syscalls 317
#endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */
#define __NR_64_Linux 5000
-#define __NR_64_Linux_syscalls 316
+#define __NR_64_Linux_syscalls 317
#if _MIPS_SIM == _MIPS_SIM_NABI32
@@ -1049,15 +1051,16 @@
#define __NR_memfd_create (__NR_Linux + 318)
#define __NR_bpf (__NR_Linux + 319)
#define __NR_execveat (__NR_Linux + 320)
+#define __NR_mlock2 (__NR_Linux + 321)
/*
* Offset of the last N32 flavoured syscall
*/
-#define __NR_Linux_syscalls 320
+#define __NR_Linux_syscalls 321
#endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */
#define __NR_N32_Linux 6000
-#define __NR_N32_Linux_syscalls 320
+#define __NR_N32_Linux_syscalls 321
#endif /* _UAPI_ASM_UNISTD_H */
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 4cc1350..b0b377a 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -599,3 +599,4 @@ EXPORT(sys_call_table)
PTR sys_memfd_create
PTR sys_bpf /* 4355 */
PTR sys_execveat
+ PTR sys_mlock2
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index ad4d4463..97aaf51 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -436,4 +436,5 @@ EXPORT(sys_call_table)
PTR sys_memfd_create
PTR sys_bpf /* 5315 */
PTR sys_execveat
+ PTR sys_mlock2
.size sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index 446cc65..e36f21e 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -429,4 +429,5 @@ EXPORT(sysn32_call_table)
PTR sys_memfd_create
PTR sys_bpf
PTR compat_sys_execveat /* 6320 */
+ PTR sys_mlock2
.size sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index f543ff4..7a8b2df 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -584,4 +584,5 @@ EXPORT(sys32_call_table)
PTR sys_memfd_create
PTR sys_bpf /* 4355 */
PTR compat_sys_execveat
+ PTR sys_mlock2
.size sys32_call_table,.-sys32_call_table
--
1.9.1
On Fri, Jul 24, 2015 at 05:28:39PM -0400, Eric B Munson wrote:
> Extending the mlock system call is very difficult because it currently
> does not take a flags argument. A later patch in this set will extend
> mlock to support a middle ground between pages that are locked and
> faulted in immediately and unlocked pages. To pave the way for the new
> system call, the code needs some reorganization so that all the actual
> entry point handles is checking input and translating to VMA flags.
>
> Signed-off-by: Eric B Munson <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Fri, Jul 24, 2015 at 05:28:40PM -0400, Eric B Munson wrote:
> With the refactored mlock code, introduce a new system call for mlock.
> The new call will allow the user to specify what lock states are being
> added. mlock2 is trivial at the moment, but a follow on patch will add
> a new mlock state making it useful.
>
> Signed-off-by: Eric B Munson <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Geert Uytterhoeven <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Stephen Rothwell <[email protected]>
> Cc: Guenter Roeck <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> Changes from V4:
> * Drop all architectures except x86[_64] from this patch, MIPS is added
> later in the series. All others will be left to their maintainers.
>
> Changes from V3:
> * Do a (hopefully) complete job of adding the new system calls
> arch/alpha/include/uapi/asm/mman.h | 2 ++
> arch/mips/include/uapi/asm/mman.h | 5 +++++
> arch/parisc/include/uapi/asm/mman.h | 2 ++
> arch/powerpc/include/uapi/asm/mman.h | 2 ++
> arch/sparc/include/uapi/asm/mman.h | 2 ++
> arch/tile/include/uapi/asm/mman.h | 5 +++++
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/xtensa/include/uapi/asm/mman.h | 5 +++++
Define MLOCK_LOCKED in include/uapi/asm-generic/mman-common.h.
This way you can drop changes in powerpc, sparc and tile.
Otherwise looks good.
--
Kirill A. Shutemov
On Fri, Jul 24, 2015 at 05:28:41PM -0400, Eric B Munson wrote:
> The cost of faulting in all memory to be locked can be very high when
> working with large mappings. If only portions of the mapping will be
> used this can incur a high penalty for locking.
>
> For the example of a large file, this is the usage pattern for a large
> statical language model (probably applies to other statical or graphical
> models as well). For the security example, any application transacting
> in data that cannot be swapped out (credit card data, medical records,
> etc).
>
> This patch introduces the ability to request that pages are not
> pre-faulted, but are placed on the unevictable LRU when they are finally
> faulted in. The VM_LOCKONFAULT flag will be used together with
> VM_LOCKED and has no effect when set without VM_LOCKED. Setting the
> VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to
> be added to the unevictable LRU when they are faulted or if they are
> already present, but will not cause any missing pages to be faulted in.
>
> Exposing this new lock state means that we cannot overload the meaning
> of the FOLL_POPULATE flag any longer. Prior to this patch it was used
> to mean that the VMA for a fault was locked. This means we need the
> new FOLL_MLOCK flag to communicate the locked state of a VMA.
> FOLL_POPULATE will now only control if the VMA should be populated and
> in the case of VM_LOCKONFAULT, it will not be set.
>
> Signed-off-by: Eric B Munson <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> drivers/gpu/drm/drm_vm.c | 8 +++++++-
> fs/proc/task_mmu.c | 1 +
> include/linux/mm.h | 2 ++
> kernel/fork.c | 2 +-
> mm/debug.c | 1 +
> mm/gup.c | 10 ++++++++--
> mm/huge_memory.c | 2 +-
> mm/hugetlb.c | 4 ++--
> mm/mlock.c | 2 +-
> mm/mmap.c | 2 +-
> mm/rmap.c | 4 ++--
> 11 files changed, 27 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_vm.c b/drivers/gpu/drm/drm_vm.c
> index aab49ee..103a5f6 100644
> --- a/drivers/gpu/drm/drm_vm.c
> +++ b/drivers/gpu/drm/drm_vm.c
> @@ -699,9 +699,15 @@ int drm_vma_info(struct seq_file *m, void *data)
> (void *)(unsigned long)virt_to_phys(high_memory));
>
> list_for_each_entry(pt, &dev->vmalist, head) {
> + char lock_flag = '-';
> +
> vma = pt->vma;
> if (!vma)
> continue;
> + if (vma->vm_flags & VM_LOCKONFAULT)
> + lock_flag = 'f';
> + else if (vma->vm_flags & VM_LOCKED)
> + lock_flag = 'l';
> seq_printf(m,
> "\n%5d 0x%pK-0x%pK %c%c%c%c%c%c 0x%08lx000",
> pt->pid,
> @@ -710,7 +716,7 @@ int drm_vma_info(struct seq_file *m, void *data)
> vma->vm_flags & VM_WRITE ? 'w' : '-',
> vma->vm_flags & VM_EXEC ? 'x' : '-',
> vma->vm_flags & VM_MAYSHARE ? 's' : 'p',
> - vma->vm_flags & VM_LOCKED ? 'l' : '-',
> + lock_flag,
> vma->vm_flags & VM_IO ? 'i' : '-',
> vma->vm_pgoff);
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ca1e091..38d69fc 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -579,6 +579,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> #ifdef CONFIG_X86_INTEL_MPX
> [ilog2(VM_MPX)] = "mp",
> #endif
> + [ilog2(VM_LOCKONFAULT)] = "lf",
> [ilog2(VM_LOCKED)] = "lo",
> [ilog2(VM_IO)] = "io",
> [ilog2(VM_SEQ_READ)] = "sr",
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2e872f9..c2f3551 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -127,6 +127,7 @@ extern unsigned int kobjsize(const void *objp);
> #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
> #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
>
> +#define VM_LOCKONFAULT 0x00001000 /* Lock the pages covered when they are faulted in */
> #define VM_LOCKED 0x00002000
> #define VM_IO 0x00004000 /* Memory mapped I/O or similar */
>
> @@ -2043,6 +2044,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
> #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
> #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
> #define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
> +#define FOLL_MLOCK 0x1000 /* lock present pages */
>
> typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
> void *data);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index dbd9b8d..a949228 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -454,7 +454,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> tmp->vm_mm = mm;
> if (anon_vma_fork(tmp, mpnt))
> goto fail_nomem_anon_vma_fork;
> - tmp->vm_flags &= ~VM_LOCKED;
> + tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
> tmp->vm_next = tmp->vm_prev = NULL;
> file = tmp->vm_file;
> if (file) {
> diff --git a/mm/debug.c b/mm/debug.c
> index 76089dd..25176bb 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -121,6 +121,7 @@ static const struct trace_print_flags vmaflags_names[] = {
> {VM_GROWSDOWN, "growsdown" },
> {VM_PFNMAP, "pfnmap" },
> {VM_DENYWRITE, "denywrite" },
> + {VM_LOCKONFAULT, "lockonfault" },
> {VM_LOCKED, "locked" },
> {VM_IO, "io" },
> {VM_SEQ_READ, "seqread" },
> diff --git a/mm/gup.c b/mm/gup.c
> index 6297f6b..e632908 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -92,7 +92,7 @@ retry:
> */
> mark_page_accessed(page);
> }
> - if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) {
> + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
> /*
> * The preliminary mapping check is mainly to avoid the
> * pointless overhead of lock_page on the ZERO_PAGE
> @@ -265,6 +265,9 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
> unsigned int fault_flags = 0;
> int ret;
>
> + /* mlock all present pages, but do not fault in new pages */
> + if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
> + return -ENOENT;
> /* For mm_populate(), just skip the stack guard page. */
> if ((*flags & FOLL_POPULATE) &&
> (stack_guard_page_start(vma, address) ||
> @@ -850,7 +853,10 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> VM_BUG_ON_VMA(end > vma->vm_end, vma);
> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
>
> - gup_flags = FOLL_TOUCH | FOLL_POPULATE;
> + gup_flags = FOLL_TOUCH | FOLL_MLOCK;
> + if ((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == VM_LOCKED)
> + gup_flags |= FOLL_POPULATE;
> +
> /*
> * We want to touch writable mappings with a write fault in order
> * to break COW, except for shared mappings because these don't COW
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c107094..5e22d90 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1238,7 +1238,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> pmd, _pmd, 1))
> update_mmu_cache_pmd(vma, addr, pmd);
> }
> - if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) {
> + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED )) {
^^^
Space befor ')'.
Otherwise:
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Fri, Jul 24, 2015 at 05:28:42PM -0400, Eric B Munson wrote:
> The previous patch introduced a flag that specified pages in a VMA
> should be placed on the unevictable LRU, but they should not be made
> present when the area is created. This patch adds the ability to set
> this state via the new mlock system calls.
>
> We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
> MLOCK_ONFAULT will set the VM_LOCKONFAULT flag as well as the VM_LOCKED
> flag for the target region. MCL_CURRENT and MCL_ONFAULT are used to
> lock current mappings. With MCL_CURRENT all pages are made present and
> with MCL_ONFAULT they are locked when faulted in. When specified with
> MCL_FUTURE all new mappings will be marked with VM_LOCKONFAULT.
>
> Currently, mlockall() clears all VMA lock flags and then sets the
> requested flags. For instance, if a process has MCL_FUTURE and
> MCL_CURRENT set, but they want to clear MCL_FUTURE this would be
> accomplished by calling mlockall(MCL_CURRENT). This still holds with
> the introduction of MCL_ONFAULT. Each call to mlockall() resets all
> VMA flags to the values specified in the current call. The new mlock2
> system call behaves in the same way. If a region is locked with
> MLOCK_ONFAULT and a user wants to force it to be populated now, a second
> call to mlock2(MLOCK_LOCKED) will accomplish this.
>
> munlock() will unconditionally clear both vma flags. munlockall()
> unconditionally clears for VMA flags on all VMAs and in the
> mm->def_flags field.
>
> Signed-off-by: Eric B Munson <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> Changes from V4:
> * Split addition of VMA flag
>
> Changes from V3:
> * Do extensive search for VM_LOCKED and ensure that VM_LOCKONFAULT is also handled
> where appropriate
> arch/alpha/include/uapi/asm/mman.h | 2 ++
> arch/mips/include/uapi/asm/mman.h | 2 ++
> arch/parisc/include/uapi/asm/mman.h | 2 ++
> arch/powerpc/include/uapi/asm/mman.h | 2 ++
> arch/sparc/include/uapi/asm/mman.h | 2 ++
> arch/tile/include/uapi/asm/mman.h | 3 +++
> arch/xtensa/include/uapi/asm/mman.h | 2 ++
Again, you can save few lines by moving some code into mman-common.h.
Otherwise looks good.
--
Kirill A. Shutemov
On Fri, Jul 24, 2015 at 05:28:43PM -0400, Eric B Munson wrote:
> The cost of faulting in all memory to be locked can be very high when
> working with large mappings. If only portions of the mapping will be
> used this can incur a high penalty for locking.
>
> Now that we have the new VMA flag for the locked but not present state,
> expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
As I mentioned before, I don't think this interface is justified.
MAP_LOCKED has known issues[1]. The MAP_LOCKED problem is not necessary
affects MAP_LOCKONFAULT, but still.
Let's not add new interface unless it's demonstrably useful.
[1] http://lkml.kernel.org/g/[email protected]
--
Kirill A. Shutemov
On 07/24/2015 11:28 PM, Eric B Munson wrote:
...
> Changes from V4:
> Drop all architectures for new sys call entries except x86[_64] and MIPS
> Drop munlock2 and munlockall2
> Make VM_LOCKONFAULT a modifier to VM_LOCKED only to simplify book keeping
> Adjust tests to match
Hi, thanks for considering my suggestions. Well, I do hope there were
correct as API's are hard and I'm no API expert. But since API's are
also impossible to change after merging, I'm sorry but I'll keep
pestering for one last thing. Thanks again for persisting, I do believe
it's for the good thing!
The thing is that I still don't like that one has to call
mlock2(MLOCK_LOCKED) to get the equivalent of the old mlock(). Why is
that flag needed? We have two modes of locking now, and v5 no longer
treats them separately in vma flags. But having two flags gives us four
possible combinations, so two of them would serve nothing but to confuse
the programmer IMHO. What will mlock2() without flags do? What will
mlock2(MLOCK_LOCKED | MLOCK_ONFAULT) do? (Note I haven't studied the
code yet, as having agreed on the API should come first. But I did
suggest documenting these things more thoroughly too...)
OK I checked now and both cases above seem to return EINVAL.
So about the only point I see in MLOCK_LOCKED flag is parity with
MAP_LOCKED for mmap(). But as Kirill said (and me before as well)
MAP_LOCKED is broken anyway so we shouldn't twist the rest just of the
API to keep the poor thing happier in its misery.
Also note that AFAICS you don't have MCL_LOCKED for mlockall() so
there's no full parity anyway. But please don't fix that by adding
MCL_LOCKED :)
Thanks!
On Mon, 27 Jul 2015, Vlastimil Babka wrote:
> On 07/24/2015 11:28 PM, Eric B Munson wrote:
>
> ...
>
> >Changes from V4:
> >Drop all architectures for new sys call entries except x86[_64] and MIPS
> >Drop munlock2 and munlockall2
> >Make VM_LOCKONFAULT a modifier to VM_LOCKED only to simplify book keeping
> >Adjust tests to match
>
> Hi, thanks for considering my suggestions. Well, I do hope there
> were correct as API's are hard and I'm no API expert. But since
> API's are also impossible to change after merging, I'm sorry but
> I'll keep pestering for one last thing. Thanks again for persisting,
> I do believe it's for the good thing!
>
> The thing is that I still don't like that one has to call
> mlock2(MLOCK_LOCKED) to get the equivalent of the old mlock(). Why
> is that flag needed? We have two modes of locking now, and v5 no
> longer treats them separately in vma flags. But having two flags
> gives us four possible combinations, so two of them would serve
> nothing but to confuse the programmer IMHO. What will mlock2()
> without flags do? What will mlock2(MLOCK_LOCKED | MLOCK_ONFAULT) do?
> (Note I haven't studied the code yet, as having agreed on the API
> should come first. But I did suggest documenting these things more
> thoroughly too...)
> OK I checked now and both cases above seem to return EINVAL.
>
> So about the only point I see in MLOCK_LOCKED flag is parity with
> MAP_LOCKED for mmap(). But as Kirill said (and me before as well)
> MAP_LOCKED is broken anyway so we shouldn't twist the rest just of
> the API to keep the poor thing happier in its misery.
>
> Also note that AFAICS you don't have MCL_LOCKED for mlockall() so
> there's no full parity anyway. But please don't fix that by adding
> MCL_LOCKED :)
>
> Thanks!
I have an MLOCK_LOCKED flag because I prefer an interface to be
explicit. The caller of mlock2() will be required to fill in the flags
argument regardless. I can drop the MLOCK_LOCKED flag with 0 being the
value for LOCKED, but I thought it easier to make clear what was going
on at any call to mlock2(). If user space defines a MLOCK_LOCKED that
happens to be 0, I suppose that would be okay.
We do actually have an MCL_LOCKED, we just call it MCL_CURRENT. Would
you prefer that I match the name in mlock2() (add MLOCK_CURRENT
instead)?
Finally, on the question of MAP_LOCKONFAULT, do you just dislike
MAP_LOCKED and do not want to see it extended, or is this a NAK on the
set if that patch is included. I ask because I have to spin a V6 to get
the MLOCK flag declarations right, but I would prefer not to do a V7+.
If this is a NAK with, I can drop that patch and rework the tests to
cover without the mmap flag. Otherwise I want to keep it, I have an
internal user that would like to see it added.
On Mon, 27 Jul 2015, Kirill A. Shutemov wrote:
> On Fri, Jul 24, 2015 at 05:28:43PM -0400, Eric B Munson wrote:
> > The cost of faulting in all memory to be locked can be very high when
> > working with large mappings. If only portions of the mapping will be
> > used this can incur a high penalty for locking.
> >
> > Now that we have the new VMA flag for the locked but not present state,
> > expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
>
> As I mentioned before, I don't think this interface is justified.
>
> MAP_LOCKED has known issues[1]. The MAP_LOCKED problem is not necessary
> affects MAP_LOCKONFAULT, but still.
>
> Let's not add new interface unless it's demonstrably useful.
>
> [1] http://lkml.kernel.org/g/[email protected]
I understand and should have been more explicit. This patch is still
included becuase I have an internal user that wants to see it added.
The problem discussed in the thread you point out does not affect
MAP_LOCKONFAULT because we do not attempt to populate the region with
MAP_LOCKONFAULT.
As I told Vlastimil, if this is a hard NAK with the patch I can work
with that. Otherwise I prefer it stays.
On Mon, Jul 27, 2015 at 09:41:26AM -0400, Eric B Munson wrote:
> On Mon, 27 Jul 2015, Kirill A. Shutemov wrote:
>
> > On Fri, Jul 24, 2015 at 05:28:43PM -0400, Eric B Munson wrote:
> > > The cost of faulting in all memory to be locked can be very high when
> > > working with large mappings. If only portions of the mapping will be
> > > used this can incur a high penalty for locking.
> > >
> > > Now that we have the new VMA flag for the locked but not present state,
> > > expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
> >
> > As I mentioned before, I don't think this interface is justified.
> >
> > MAP_LOCKED has known issues[1]. The MAP_LOCKED problem is not necessary
> > affects MAP_LOCKONFAULT, but still.
> >
> > Let's not add new interface unless it's demonstrably useful.
> >
> > [1] http://lkml.kernel.org/g/[email protected]
>
> I understand and should have been more explicit. This patch is still
> included becuase I have an internal user that wants to see it added.
> The problem discussed in the thread you point out does not affect
> MAP_LOCKONFAULT because we do not attempt to populate the region with
> MAP_LOCKONFAULT.
>
> As I told Vlastimil, if this is a hard NAK with the patch I can work
> with that. Otherwise I prefer it stays.
That's not how it works.
Once an ABI added to the kernel it stays there practically forever.
Therefore it must be useful to justify maintenance cost. I don't see it
demonstrated.
So, NAK.
--
Kirill A. Shutemov
On Mon, 27 Jul 2015, Kirill A. Shutemov wrote:
> On Mon, Jul 27, 2015 at 09:41:26AM -0400, Eric B Munson wrote:
> > On Mon, 27 Jul 2015, Kirill A. Shutemov wrote:
> >
> > > On Fri, Jul 24, 2015 at 05:28:43PM -0400, Eric B Munson wrote:
> > > > The cost of faulting in all memory to be locked can be very high when
> > > > working with large mappings. If only portions of the mapping will be
> > > > used this can incur a high penalty for locking.
> > > >
> > > > Now that we have the new VMA flag for the locked but not present state,
> > > > expose it as an mmap option like MAP_LOCKED -> VM_LOCKED.
> > >
> > > As I mentioned before, I don't think this interface is justified.
> > >
> > > MAP_LOCKED has known issues[1]. The MAP_LOCKED problem is not necessary
> > > affects MAP_LOCKONFAULT, but still.
> > >
> > > Let's not add new interface unless it's demonstrably useful.
> > >
> > > [1] http://lkml.kernel.org/g/[email protected]
> >
> > I understand and should have been more explicit. This patch is still
> > included becuase I have an internal user that wants to see it added.
> > The problem discussed in the thread you point out does not affect
> > MAP_LOCKONFAULT because we do not attempt to populate the region with
> > MAP_LOCKONFAULT.
> >
> > As I told Vlastimil, if this is a hard NAK with the patch I can work
> > with that. Otherwise I prefer it stays.
>
> That's not how it works.
I am not sure what you mean here. I have a user that will find this
useful and MAP_LOCKONFAULT does not suffer from the problem you point
out. I do not understand your NAK but thank you for explicit about it.
>
> Once an ABI added to the kernel it stays there practically forever.
> Therefore it must be useful to justify maintenance cost. I don't see it
> demonstrated.
I understand this, and I get that you do not like MAP_LOCKED, but I do
not see how your dislike for MAP_LOCKED means that this would not be
useful.
>
> So, NAK.
>
V6 will not have the new mmap flag unless there is someone else that
speaks up in favor of keeping it.
On 07/27/2015 03:35 PM, Eric B Munson wrote:
> On Mon, 27 Jul 2015, Vlastimil Babka wrote:
>
>> On 07/24/2015 11:28 PM, Eric B Munson wrote:
>>
>> ...
>>
>>> Changes from V4:
>>> Drop all architectures for new sys call entries except x86[_64] and MIPS
>>> Drop munlock2 and munlockall2
>>> Make VM_LOCKONFAULT a modifier to VM_LOCKED only to simplify book keeping
>>> Adjust tests to match
>>
>> Hi, thanks for considering my suggestions. Well, I do hope there
>> were correct as API's are hard and I'm no API expert. But since
>> API's are also impossible to change after merging, I'm sorry but
>> I'll keep pestering for one last thing. Thanks again for persisting,
>> I do believe it's for the good thing!
>>
>> The thing is that I still don't like that one has to call
>> mlock2(MLOCK_LOCKED) to get the equivalent of the old mlock(). Why
>> is that flag needed? We have two modes of locking now, and v5 no
>> longer treats them separately in vma flags. But having two flags
>> gives us four possible combinations, so two of them would serve
>> nothing but to confuse the programmer IMHO. What will mlock2()
>> without flags do? What will mlock2(MLOCK_LOCKED | MLOCK_ONFAULT) do?
>> (Note I haven't studied the code yet, as having agreed on the API
>> should come first. But I did suggest documenting these things more
>> thoroughly too...)
>> OK I checked now and both cases above seem to return EINVAL.
>>
>> So about the only point I see in MLOCK_LOCKED flag is parity with
>> MAP_LOCKED for mmap(). But as Kirill said (and me before as well)
>> MAP_LOCKED is broken anyway so we shouldn't twist the rest just of
>> the API to keep the poor thing happier in its misery.
>>
>> Also note that AFAICS you don't have MCL_LOCKED for mlockall() so
>> there's no full parity anyway. But please don't fix that by adding
>> MCL_LOCKED :)
>>
>> Thanks!
>
>
> I have an MLOCK_LOCKED flag because I prefer an interface to be
> explicit.
I think it's already explicit enough that the user calls mlock2(), no?
He obviously wants the range mlocked. An optional flag says that there
should be no pre-fault.
> The caller of mlock2() will be required to fill in the flags
> argument regardless.
I guess users not caring about MLOCK_ONFAULT will continue using plain
mlock() without flags anyway.
I can drop the MLOCK_LOCKED flag with 0 being the
> value for LOCKED, but I thought it easier to make clear what was going
> on at any call to mlock2(). If user space defines a MLOCK_LOCKED that
> happens to be 0, I suppose that would be okay.
Yeah that would remove the weird 4-states-of-which-2-are-invalid problem
I mentioned, but at the cost of glibc wrapper behaving differently than
the kernel syscall itself. For little gain.
> We do actually have an MCL_LOCKED, we just call it MCL_CURRENT. Would
> you prefer that I match the name in mlock2() (add MLOCK_CURRENT
> instead)?
Hm it's similar but not exactly the same, because MCL_FUTURE is not the
same as MLOCK_ONFAULT :) So MLOCK_CURRENT would be even more confusing.
Especially if mlockall(MCL_CURRENT | MCL_FUTURE) is OK, but
mlock2(MLOCK_LOCKED | MLOCK_ONFAULT) is invalid.
> Finally, on the question of MAP_LOCKONFAULT, do you just dislike
> MAP_LOCKED and do not want to see it extended, or is this a NAK on the
> set if that patch is included. I ask because I have to spin a V6 to get
> the MLOCK flag declarations right, but I would prefer not to do a V7+.
> If this is a NAK with, I can drop that patch and rework the tests to
> cover without the mmap flag. Otherwise I want to keep it, I have an
> internal user that would like to see it added.
I don't want to NAK that patch if you think it's useful.
On Mon, 27 Jul 2015, Vlastimil Babka wrote:
> On 07/27/2015 03:35 PM, Eric B Munson wrote:
> >On Mon, 27 Jul 2015, Vlastimil Babka wrote:
> >
> >>On 07/24/2015 11:28 PM, Eric B Munson wrote:
> >>
> >>...
> >>
> >>>Changes from V4:
> >>>Drop all architectures for new sys call entries except x86[_64] and MIPS
> >>>Drop munlock2 and munlockall2
> >>>Make VM_LOCKONFAULT a modifier to VM_LOCKED only to simplify book keeping
> >>>Adjust tests to match
> >>
> >>Hi, thanks for considering my suggestions. Well, I do hope there
> >>were correct as API's are hard and I'm no API expert. But since
> >>API's are also impossible to change after merging, I'm sorry but
> >>I'll keep pestering for one last thing. Thanks again for persisting,
> >>I do believe it's for the good thing!
> >>
> >>The thing is that I still don't like that one has to call
> >>mlock2(MLOCK_LOCKED) to get the equivalent of the old mlock(). Why
> >>is that flag needed? We have two modes of locking now, and v5 no
> >>longer treats them separately in vma flags. But having two flags
> >>gives us four possible combinations, so two of them would serve
> >>nothing but to confuse the programmer IMHO. What will mlock2()
> >>without flags do? What will mlock2(MLOCK_LOCKED | MLOCK_ONFAULT) do?
> >>(Note I haven't studied the code yet, as having agreed on the API
> >>should come first. But I did suggest documenting these things more
> >>thoroughly too...)
> >>OK I checked now and both cases above seem to return EINVAL.
> >>
> >>So about the only point I see in MLOCK_LOCKED flag is parity with
> >>MAP_LOCKED for mmap(). But as Kirill said (and me before as well)
> >>MAP_LOCKED is broken anyway so we shouldn't twist the rest just of
> >>the API to keep the poor thing happier in its misery.
> >>
> >>Also note that AFAICS you don't have MCL_LOCKED for mlockall() so
> >>there's no full parity anyway. But please don't fix that by adding
> >>MCL_LOCKED :)
> >>
> >>Thanks!
> >
> >
> >I have an MLOCK_LOCKED flag because I prefer an interface to be
> >explicit.
>
> I think it's already explicit enough that the user calls mlock2(),
> no? He obviously wants the range mlocked. An optional flag says that
> there should be no pre-fault.
>
> >The caller of mlock2() will be required to fill in the flags
> >argument regardless.
>
> I guess users not caring about MLOCK_ONFAULT will continue using
> plain mlock() without flags anyway.
>
> I can drop the MLOCK_LOCKED flag with 0 being the
> >value for LOCKED, but I thought it easier to make clear what was going
> >on at any call to mlock2(). If user space defines a MLOCK_LOCKED that
> >happens to be 0, I suppose that would be okay.
>
> Yeah that would remove the weird 4-states-of-which-2-are-invalid
> problem I mentioned, but at the cost of glibc wrapper behaving
> differently than the kernel syscall itself. For little gain.
>
> >We do actually have an MCL_LOCKED, we just call it MCL_CURRENT. Would
> >you prefer that I match the name in mlock2() (add MLOCK_CURRENT
> >instead)?
>
> Hm it's similar but not exactly the same, because MCL_FUTURE is not
> the same as MLOCK_ONFAULT :) So MLOCK_CURRENT would be even more
> confusing. Especially if mlockall(MCL_CURRENT | MCL_FUTURE) is OK,
> but mlock2(MLOCK_LOCKED | MLOCK_ONFAULT) is invalid.
MLOCK_ONFAULT isn't meant to be the same as MCL_FUTURE, rather it is
meant to be the same as MCL_ONFAULT. MCL_FUTURE only controls if the
locking policy will be applied to any new mappings made by this process,
not the locking policy itself. The better comparison is MCL_CURRENT to
MLOCK_LOCK and MCL_ONFAULT to MLOCK_ONFAULT. MCL_CURRENT and
MLOCK_LOCK do the same thing, only one requires a specific range of
addresses while the other works process wide. This is why I suggested
changing MLOCK_LOCK to MLOCK_CURRENT. It is an error to call
mlock2(MLOCK_LOCK | MLOCK_ONFAULT) just like it is an error to call
mlockall(MCL_CURRENT | MCL_ONFAULT). The combinations do no make sense.
This was all decided when VM_LOCKONFAULT was a separate state from
VM_LOCKED. Now that VM_LOCKONFAULT is a modifier to VM_LOCKED and
cannot be specified independentally, it might make more sense to mirror
that relationship to userspace. Which would lead to soemthing like the
following:
To lock and populate a region:
mlock2(start, len, 0);
To lock on fault a region:
mlock2(start, len, MLOCK_ONFAULT);
If LOCKONFAULT is seen as a modifier to mlock, then having the flags
argument as 0 mean do mlock classic makes more sense to me.
To mlock current on fault only:
mlockall(MCL_CURRENT | MCL_ONFAULT);
To mlock future on fault only:
mlockall(MCL_FUTURE | MCL_ONFAULT);
To lock everything on fault:
mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
I think I have talked myself into rewriting the set again :/
>
> >Finally, on the question of MAP_LOCKONFAULT, do you just dislike
> >MAP_LOCKED and do not want to see it extended, or is this a NAK on the
> >set if that patch is included. I ask because I have to spin a V6 to get
> >the MLOCK flag declarations right, but I would prefer not to do a V7+.
> >If this is a NAK with, I can drop that patch and rework the tests to
> >cover without the mmap flag. Otherwise I want to keep it, I have an
> >internal user that would like to see it added.
>
> I don't want to NAK that patch if you think it's useful.
>
>
On 07/27/2015 04:54 PM, Eric B Munson wrote:
> On Mon, 27 Jul 2015, Vlastimil Babka wrote:
>>
>>> We do actually have an MCL_LOCKED, we just call it MCL_CURRENT. Would
>>> you prefer that I match the name in mlock2() (add MLOCK_CURRENT
>>> instead)?
>>
>> Hm it's similar but not exactly the same, because MCL_FUTURE is not
>> the same as MLOCK_ONFAULT :) So MLOCK_CURRENT would be even more
>> confusing. Especially if mlockall(MCL_CURRENT | MCL_FUTURE) is OK,
>> but mlock2(MLOCK_LOCKED | MLOCK_ONFAULT) is invalid.
>
> MLOCK_ONFAULT isn't meant to be the same as MCL_FUTURE, rather it is
> meant to be the same as MCL_ONFAULT. MCL_FUTURE only controls if the
> locking policy will be applied to any new mappings made by this process,
> not the locking policy itself. The better comparison is MCL_CURRENT to
> MLOCK_LOCK and MCL_ONFAULT to MLOCK_ONFAULT. MCL_CURRENT and
> MLOCK_LOCK do the same thing, only one requires a specific range of
> addresses while the other works process wide. This is why I suggested
> changing MLOCK_LOCK to MLOCK_CURRENT. It is an error to call
> mlock2(MLOCK_LOCK | MLOCK_ONFAULT) just like it is an error to call
> mlockall(MCL_CURRENT | MCL_ONFAULT). The combinations do no make sense.
How is it an error to call mlockall(MCL_CURRENT | MCL_ONFAULT)? How else
would you apply mlock2(MCL_ONFAULT) to all current mappings? Later below
you use the same example and I don't think it's different by removing
MLOCK_LOCKED flag.
> This was all decided when VM_LOCKONFAULT was a separate state from
> VM_LOCKED. Now that VM_LOCKONFAULT is a modifier to VM_LOCKED and
> cannot be specified independentally, it might make more sense to mirror
> that relationship to userspace. Which would lead to soemthing like the
> following:
>
> To lock and populate a region:
> mlock2(start, len, 0);
>
> To lock on fault a region:
> mlock2(start, len, MLOCK_ONFAULT);
>
> If LOCKONFAULT is seen as a modifier to mlock, then having the flags
> argument as 0 mean do mlock classic makes more sense to me.
Yup that's what I was trying to suggest.
> To mlock current on fault only:
> mlockall(MCL_CURRENT | MCL_ONFAULT);
>
> To mlock future on fault only:
> mlockall(MCL_FUTURE | MCL_ONFAULT);
>
> To lock everything on fault:
> mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
>
> I think I have talked myself into rewriting the set again :/
Sorry :) You could also wait a bit for more input than just from me...
>>
>>> Finally, on the question of MAP_LOCKONFAULT, do you just dislike
>>> MAP_LOCKED and do not want to see it extended, or is this a NAK on the
>>> set if that patch is included. I ask because I have to spin a V6 to get
>>> the MLOCK flag declarations right, but I would prefer not to do a V7+.
>>> If this is a NAK with, I can drop that patch and rework the tests to
>>> cover without the mmap flag. Otherwise I want to keep it, I have an
>>> internal user that would like to see it added.
>>
>> I don't want to NAK that patch if you think it's useful.
>>
>>
[I am sorry but I didn't get to this sooner.]
On Mon 27-07-15 10:54:09, Eric B Munson wrote:
> Now that VM_LOCKONFAULT is a modifier to VM_LOCKED and
> cannot be specified independentally, it might make more sense to mirror
> that relationship to userspace. Which would lead to soemthing like the
> following:
A modifier makes more sense.
> To lock and populate a region:
> mlock2(start, len, 0);
>
> To lock on fault a region:
> mlock2(start, len, MLOCK_ONFAULT);
>
> If LOCKONFAULT is seen as a modifier to mlock, then having the flags
> argument as 0 mean do mlock classic makes more sense to me.
>
> To mlock current on fault only:
> mlockall(MCL_CURRENT | MCL_ONFAULT);
>
> To mlock future on fault only:
> mlockall(MCL_FUTURE | MCL_ONFAULT);
>
> To lock everything on fault:
> mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
Makes sense to me. The only remaining and still tricky part would be
the munlock{all}(flags) behavior. What should munlock(MLOCK_ONFAULT)
do? Keep locked and poppulate the range or simply ignore the flag an
just unlock?
I can see some sense to allow munlockall(MCL_FUTURE[|MLOCK_ONFAULT]),
munlockall(MCL_CURRENT) resp. munlockall(MCL_CURRENT|MCL_FUTURE) but
other combinations sound weird to me.
Anyway munlock with flags opens new doors of trickiness.
--
Michal Hocko
SUSE Labs
On 07/28/2015 01:17 PM, Michal Hocko wrote:
> [I am sorry but I didn't get to this sooner.]
>
> On Mon 27-07-15 10:54:09, Eric B Munson wrote:
>> Now that VM_LOCKONFAULT is a modifier to VM_LOCKED and
>> cannot be specified independentally, it might make more sense to mirror
>> that relationship to userspace. Which would lead to soemthing like the
>> following:
>
> A modifier makes more sense.
>
>> To lock and populate a region:
>> mlock2(start, len, 0);
>>
>> To lock on fault a region:
>> mlock2(start, len, MLOCK_ONFAULT);
>>
>> If LOCKONFAULT is seen as a modifier to mlock, then having the flags
>> argument as 0 mean do mlock classic makes more sense to me.
>>
>> To mlock current on fault only:
>> mlockall(MCL_CURRENT | MCL_ONFAULT);
>>
>> To mlock future on fault only:
>> mlockall(MCL_FUTURE | MCL_ONFAULT);
>>
>> To lock everything on fault:
>> mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
>
> Makes sense to me. The only remaining and still tricky part would be
> the munlock{all}(flags) behavior. What should munlock(MLOCK_ONFAULT)
> do? Keep locked and poppulate the range or simply ignore the flag an
> just unlock?
munlock(all) already lost both MLOCK_LOCKED and MLOCK_ONFAULT flags in
this revision, so I suppose in the next revision it will also not accept
MLOCK_ONFAULT, and will just munlock whatever was mlocked in either mode.
> I can see some sense to allow munlockall(MCL_FUTURE[|MLOCK_ONFAULT]),
> munlockall(MCL_CURRENT) resp. munlockall(MCL_CURRENT|MCL_FUTURE) but
> other combinations sound weird to me.
The effect of munlockall(MCL_FUTURE|MLOCK_ONFAULT), which you probably
intended for converting the onfault to full prepopulation for future
mappings, can be achieved by calling mlockall(MCL_FUTURE) (without
MLOCK_ONFAULT).
> Anyway munlock with flags opens new doors of trickiness.
On Tue, 28 Jul 2015, Michal Hocko wrote:
> [I am sorry but I didn't get to this sooner.]
>
> On Mon 27-07-15 10:54:09, Eric B Munson wrote:
> > Now that VM_LOCKONFAULT is a modifier to VM_LOCKED and
> > cannot be specified independentally, it might make more sense to mirror
> > that relationship to userspace. Which would lead to soemthing like the
> > following:
>
> A modifier makes more sense.
>
> > To lock and populate a region:
> > mlock2(start, len, 0);
> >
> > To lock on fault a region:
> > mlock2(start, len, MLOCK_ONFAULT);
> >
> > If LOCKONFAULT is seen as a modifier to mlock, then having the flags
> > argument as 0 mean do mlock classic makes more sense to me.
> >
> > To mlock current on fault only:
> > mlockall(MCL_CURRENT | MCL_ONFAULT);
> >
> > To mlock future on fault only:
> > mlockall(MCL_FUTURE | MCL_ONFAULT);
> >
> > To lock everything on fault:
> > mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
>
> Makes sense to me. The only remaining and still tricky part would be
> the munlock{all}(flags) behavior. What should munlock(MLOCK_ONFAULT)
> do? Keep locked and poppulate the range or simply ignore the flag an
> just unlock?
>
> I can see some sense to allow munlockall(MCL_FUTURE[|MLOCK_ONFAULT]),
> munlockall(MCL_CURRENT) resp. munlockall(MCL_CURRENT|MCL_FUTURE) but
> other combinations sound weird to me.
>
> Anyway munlock with flags opens new doors of trickiness.
In the current revision there are no new munlock[all] system calls
introduced. munlockall() unconditionally cleared both MCL_CURRENT and
MCL_FUTURE before the set and now unconditionally clears all three.
munlock() does the same for VM_LOCK and VM_LOCKONFAULT. If the user
wants to adjust mlockall flags today, they need to call mlockall a
second time with the new flags, this remains true for mlockall after
this set and the same behavior is mirrored in mlock2. The only
remaining question I have is should we have 2 new mlockall flags so that
the caller can explicitly set VM_LOCKONFAULT in the mm->def_flags vs
locking all current VMAs on fault. I ask because if the user wants to
lock all current VMAs the old way, but all future VMAs on fault they
have to call mlockall() twice:
mlockall(MCL_CURRENT);
mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
This has the side effect of converting all the current VMAs to
VM_LOCKONFAULT, but because they were all made present and locked in the
first call, this should not matter in most cases. The catch is that,
like mmap(MAP_LOCKED), mlockall() does not communicate if mm_populate()
fails. This has been true of mlockall() from the beginning so I don't
know if it needs more than an entry in the man page to clarify (which I
will add when I add documentation for MCL_ONFAULT). In a much less
likely corner case, it is not possible in the current setup to request
all current VMAs be VM_LOCKONFAULT and all future be VM_LOCKED.
On 07/28/2015 03:49 PM, Eric B Munson wrote:
> On Tue, 28 Jul 2015, Michal Hocko wrote:
>
[...]
> The only
> remaining question I have is should we have 2 new mlockall flags so that
> the caller can explicitly set VM_LOCKONFAULT in the mm->def_flags vs
> locking all current VMAs on fault. I ask because if the user wants to
> lock all current VMAs the old way, but all future VMAs on fault they
> have to call mlockall() twice:
>
> mlockall(MCL_CURRENT);
> mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
>
> This has the side effect of converting all the current VMAs to
> VM_LOCKONFAULT, but because they were all made present and locked in the
> first call, this should not matter in most cases.
Shouldn't the user be able to do this?
mlockall(MCL_CURRENT)
mlockall(MCL_FUTURE | MCL_ONFAULT);
Note that the second call shouldn't change (i.e. munlock) existing vma's
just because MCL_CURRENT is not present. The current implementation
doesn't do that thanks to the following in do_mlockall():
if (flags == MCL_FUTURE)
goto out;
before current vma's are processed and MCL_CURRENT is checked. This is
probably so that do_mlockall() can also handle the munlockall() syscall.
So we should be careful not to break this, but otherwise there are no
limitations by not having two MCL_ONFAULT flags. Having to do invoke
syscalls instead of one is not an issue as this shouldn't be frequent
syscall.
> The catch is that,
> like mmap(MAP_LOCKED), mlockall() does not communicate if mm_populate()
> fails. This has been true of mlockall() from the beginning so I don't
> know if it needs more than an entry in the man page to clarify (which I
> will add when I add documentation for MCL_ONFAULT).
Good point.
> In a much less
> likely corner case, it is not possible in the current setup to request
> all current VMAs be VM_LOCKONFAULT and all future be VM_LOCKED.
So again this should work:
mlockall(MCL_CURRENT | MCL_ONFAULT)
mlockall(MCL_FUTURE);
But the order matters here, as current implementation of do_mlockall()
will clear VM_LOCKED from def_flags if MCL_FUTURE is not passed. So
*it's different* from how it handles MCL_CURRENT (as explained above).
And not documented in manpage. Oh crap, this API is a closet full of
skeletons. Maybe it was an unnoticed regression and we can restore some
sanity?
On Tue, 28 Jul 2015, Vlastimil Babka wrote:
> On 07/28/2015 03:49 PM, Eric B Munson wrote:
> >On Tue, 28 Jul 2015, Michal Hocko wrote:
> >
>
> [...]
>
> >The only
> >remaining question I have is should we have 2 new mlockall flags so that
> >the caller can explicitly set VM_LOCKONFAULT in the mm->def_flags vs
> >locking all current VMAs on fault. I ask because if the user wants to
> >lock all current VMAs the old way, but all future VMAs on fault they
> >have to call mlockall() twice:
> >
> > mlockall(MCL_CURRENT);
> > mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
> >
> >This has the side effect of converting all the current VMAs to
> >VM_LOCKONFAULT, but because they were all made present and locked in the
> >first call, this should not matter in most cases.
>
> Shouldn't the user be able to do this?
>
> mlockall(MCL_CURRENT)
> mlockall(MCL_FUTURE | MCL_ONFAULT);
>
> Note that the second call shouldn't change (i.e. munlock) existing
> vma's just because MCL_CURRENT is not present. The current
> implementation doesn't do that thanks to the following in
> do_mlockall():
>
> if (flags == MCL_FUTURE)
> goto out;
>
> before current vma's are processed and MCL_CURRENT is checked. This
> is probably so that do_mlockall() can also handle the munlockall()
> syscall.
> So we should be careful not to break this, but otherwise there are
> no limitations by not having two MCL_ONFAULT flags. Having to do
> invoke syscalls instead of one is not an issue as this shouldn't be
> frequent syscall.
Good catch, my current implementation did break this and is now fixed.
>
> >The catch is that,
> >like mmap(MAP_LOCKED), mlockall() does not communicate if mm_populate()
> >fails. This has been true of mlockall() from the beginning so I don't
> >know if it needs more than an entry in the man page to clarify (which I
> >will add when I add documentation for MCL_ONFAULT).
>
> Good point.
>
> >In a much less
> >likely corner case, it is not possible in the current setup to request
> >all current VMAs be VM_LOCKONFAULT and all future be VM_LOCKED.
>
> So again this should work:
>
> mlockall(MCL_CURRENT | MCL_ONFAULT)
> mlockall(MCL_FUTURE);
>
> But the order matters here, as current implementation of
> do_mlockall() will clear VM_LOCKED from def_flags if MCL_FUTURE is
> not passed. So *it's different* from how it handles MCL_CURRENT (as
> explained above). And not documented in manpage. Oh crap, this API
> is a closet full of skeletons. Maybe it was an unnoticed regression
> and we can restore some sanity?
I will add a note about the ordering problem to the manpage as well.
Unfortunately, the basic idea of clearing VM_LOCKED from mm->def_flags
if MCL_FUTURE is not specified but not doing the same for MCL_CURRENT
predates the move to git, so I am not sure if it was ever different.
On Tue 28-07-15 09:49:42, Eric B Munson wrote:
> On Tue, 28 Jul 2015, Michal Hocko wrote:
>
> > [I am sorry but I didn't get to this sooner.]
> >
> > On Mon 27-07-15 10:54:09, Eric B Munson wrote:
> > > Now that VM_LOCKONFAULT is a modifier to VM_LOCKED and
> > > cannot be specified independentally, it might make more sense to mirror
> > > that relationship to userspace. Which would lead to soemthing like the
> > > following:
> >
> > A modifier makes more sense.
> >
> > > To lock and populate a region:
> > > mlock2(start, len, 0);
> > >
> > > To lock on fault a region:
> > > mlock2(start, len, MLOCK_ONFAULT);
> > >
> > > If LOCKONFAULT is seen as a modifier to mlock, then having the flags
> > > argument as 0 mean do mlock classic makes more sense to me.
> > >
> > > To mlock current on fault only:
> > > mlockall(MCL_CURRENT | MCL_ONFAULT);
> > >
> > > To mlock future on fault only:
> > > mlockall(MCL_FUTURE | MCL_ONFAULT);
> > >
> > > To lock everything on fault:
> > > mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
> >
> > Makes sense to me. The only remaining and still tricky part would be
> > the munlock{all}(flags) behavior. What should munlock(MLOCK_ONFAULT)
> > do? Keep locked and poppulate the range or simply ignore the flag an
> > just unlock?
> >
> > I can see some sense to allow munlockall(MCL_FUTURE[|MLOCK_ONFAULT]),
> > munlockall(MCL_CURRENT) resp. munlockall(MCL_CURRENT|MCL_FUTURE) but
> > other combinations sound weird to me.
> >
> > Anyway munlock with flags opens new doors of trickiness.
>
> In the current revision there are no new munlock[all] system calls
> introduced. munlockall() unconditionally cleared both MCL_CURRENT and
> MCL_FUTURE before the set and now unconditionally clears all three.
> munlock() does the same for VM_LOCK and VM_LOCKONFAULT.
OK if new munlock{all}(flags) is not introduced then this is much saner
IMO.
> If the user
> wants to adjust mlockall flags today, they need to call mlockall a
> second time with the new flags, this remains true for mlockall after
> this set and the same behavior is mirrored in mlock2.
OK, this makes sense to me.
> The only
> remaining question I have is should we have 2 new mlockall flags so that
> the caller can explicitly set VM_LOCKONFAULT in the mm->def_flags vs
> locking all current VMAs on fault. I ask because if the user wants to
> lock all current VMAs the old way, but all future VMAs on fault they
> have to call mlockall() twice:
>
> mlockall(MCL_CURRENT);
> mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);
>
> This has the side effect of converting all the current VMAs to
> VM_LOCKONFAULT, but because they were all made present and locked in the
> first call, this should not matter in most cases.
I think this is OK (worth documenting though) considering that ONFAULT
is just modifier for the current mlock* operation. The memory is locked
the same way for both - aka once the memory is present you do not know
whether it was done during mlock call or later during the fault.
> The catch is that,
> like mmap(MAP_LOCKED), mlockall() does not communicate if mm_populate()
> fails. This has been true of mlockall() from the beginning so I don't
> know if it needs more than an entry in the man page to clarify (which I
> will add when I add documentation for MCL_ONFAULT).
Yes this is true but unlike mmap it seems fixable I guess. We do not have
to unmap and we can downgrade mmap_sem to read and the fault so nobody
can race with a concurent mlock.
> In a much less
> likely corner case, it is not possible in the current setup to request
> all current VMAs be VM_LOCKONFAULT and all future be VM_LOCKED.
Vlastimil has already pointed that out. MCL_FUTURE doesn't clear
MCL_CURRENT. I was quite surprised in the beginning but it makes a
perfect sense. mlockall call shouldn't lead into munlocking, that would
be just weird. Clearing MCL_FUTURE on MCL_CURRENT makes sense on the
other hand because the request is explicit about _current_ memory and it
doesn't lead to any munlocking.
--
Michal Hocko
SUSE Labs
On 07/29/2015 12:45 PM, Michal Hocko wrote:
>> In a much less
>> likely corner case, it is not possible in the current setup to request
>> all current VMAs be VM_LOCKONFAULT and all future be VM_LOCKED.
>
> Vlastimil has already pointed that out. MCL_FUTURE doesn't clear
> MCL_CURRENT. I was quite surprised in the beginning but it makes a
> perfect sense. mlockall call shouldn't lead into munlocking, that would
> be just weird. Clearing MCL_FUTURE on MCL_CURRENT makes sense on the
> other hand because the request is explicit about _current_ memory and it
> doesn't lead to any munlocking.
Yeah after more thinking it does make some sense despite the perceived
inconsistency, but it's definitely worth documenting properly. It also already
covers the usecase for munlockall2(MCL_FUTURE) which IIRC you had in the earlier
revisions...