2015-06-10 13:27:15

by Eric B Munson

[permalink] [raw]
Subject: [RESEND PATCH V2 0/3] Allow user to request memory to be locked on page fault

mlock() allows a user to control page out of program memory, but this
comes at the cost of faulting in the entire mapping when it is
allocated. For large mappings where the entire area is not necessary
this is not ideal.

This series introduces new flags for mmap() and mlockall() that allow a
user to specify that the covered are should not be paged out, but only
after the memory has been used the first time.

There are two main use cases that this set covers. The first is the
security focussed mlock case. A buffer is needed that cannot be written
to swap. The maximum size is known, but on average the memory used is
significantly less than this maximum. With lock on fault, the buffer
is guaranteed to never be paged out without consuming the maximum size
every time such a buffer is created.

The second use case is focussed on performance. Portions of a large
file are needed and we want to keep the used portions in memory once
accessed. This is the case for large graphical models where the path
through the graph is not known until run time. The entire graph is
unlikely to be used in a given invocation, but once a node has been
used it needs to stay resident for further processing. Given these
constraints we have a number of options. We can potentially waste a
large amount of memory by mlocking the entire region (this can also
cause a significant stall at startup as the entire file is read in).
We can mlock every page as we access them without tracking if the page
is already resident but this introduces large overhead for each access.
The third option is mapping the entire region with PROT_NONE and using
a signal handler for SIGSEGV to mprotect(PROT_READ) and mlock() the
needed page. Doing this page at a time adds a significant performance
penalty. Batching can be used to mitigate this overhead, but in order
to safely avoid trying to mprotect pages outside of the mapping, the
boundaries of each mapping to be used in this way must be tracked and
available to the signal handler. This is precisely what the mm system
in the kernel should already be doing.

For mmap(MAP_LOCKONFAULT) the user is charged against RLIMIT_MEMLOCK
as if MAP_LOCKED was used, so when the VMA is created not when the pages
are faulted in. For mlockall(MCL_ON_FAULT) the user is charged as if
MCL_FUTURE was used. This decision was made to keep the accounting
checks out of the page fault path.

To illustrate the benefit of this patch I wrote a test program that
mmaps a 5 GB file filled with random data and then makes 15,000,000
accesses to random addresses in that mapping. The test program was run
20 times for each setup. Results are reported for two program portions,
setup and execution. The setup phase is calling mmap and optionally
mlock on the entire region. For most experiments this is trivial, but
it highlights the cost of faulting in the entire region. Results are
averages across the 20 runs in milliseconds.

mmap with MAP_LOCKED:
Setup avg: 11821.193
Processing avg: 3404.286

mmap with mlock() before each access:
Setup avg: 0.054
Processing avg: 34263.201

mmap with PROT_NONE and signal handler and batch size of 1 page:
With the default value in max_map_count, this gets ENOMEM as I attempt
to change the permissions, after upping the sysctl significantly I get:
Setup avg: 0.050
Processing avg: 67690.625

mmap with PROT_NONE and signal handler and batch size of 8 pages:
Setup avg: 0.098
Processing avg: 37344.197

mmap with PROT_NONE and signal handler and batch size of 16 pages:
Setup avg: 0.0548
Processing avg: 29295.669

mmap with MAP_LOCKONFAULT:
Setup avg: 0.073
Processing avg: 18392.136

The signal handler in the batch cases faulted in memory in two steps to
avoid having to know the start and end of the faulting mapping. The
first step covers the page that caused the fault as we know that it will
be possible to lock. The second step speculatively tries to mlock and
mprotect the batch size - 1 pages that follow. There may be a clever
way to avoid this without having the program track each mapping to be
covered by this handeler in a globally accessible structure, but I could
not find it. It should be noted that with a large enough batch size
this two step fault handler can still cause the program to crash if it
reaches far beyond the end of the mapping.

These results show that if the developer knows that a majority of the
mapping will be used, it is better to try and fault it in at once,
otherwise MAP_LOCKONFAULT is significantly faster.

The performance cost of these patches are minimal on the two benchmarks
I have tested (stream and kernbench). The following are the average
values across 20 runs of each benchmark after a warmup run whose
results were discarded.

Avg throughput in MB/s from stream using 1000000 element arrays
Test 4.1-rc2 4.1-rc2+lock-on-fault
Copy: 10,979.08 10,917.34
Scale: 11,094.45 11,023.01
Add: 12,487.29 12,388.65
Triad: 12,505.77 12,418.78

Kernbench optimal load
4.1-rc2 4.1-rc2+lock-on-fault
Elapsed Time 71.046 71.324
User Time 62.117 62.352
System Time 8.926 8.969
Context Switches 14531.9 14542.5
Sleeps 14935.9 14939

Eric B Munson (3):
Add mmap flag to request pages are locked after page fault
Add mlockall flag for locking pages on fault
Add tests for lock on fault

arch/alpha/include/uapi/asm/mman.h | 2 +
arch/mips/include/uapi/asm/mman.h | 2 +
arch/parisc/include/uapi/asm/mman.h | 2 +
arch/powerpc/include/uapi/asm/mman.h | 2 +
arch/sparc/include/uapi/asm/mman.h | 2 +
arch/tile/include/uapi/asm/mman.h | 2 +
arch/xtensa/include/uapi/asm/mman.h | 2 +
include/linux/mm.h | 1 +
include/linux/mman.h | 3 +-
include/uapi/asm-generic/mman.h | 2 +
mm/mlock.c | 13 ++-
mm/mmap.c | 4 +-
mm/swap.c | 3 +-
tools/testing/selftests/vm/Makefile | 8 +-
tools/testing/selftests/vm/lock-on-fault.c | 145 ++++++++++++++++++++++++++++
tools/testing/selftests/vm/on-fault-limit.c | 47 +++++++++
tools/testing/selftests/vm/run_vmtests | 23 +++++
17 files changed, 254 insertions(+), 9 deletions(-)
create mode 100644 tools/testing/selftests/vm/lock-on-fault.c
create mode 100644 tools/testing/selftests/vm/on-fault-limit.c

Cc: Shuah Khan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

--
1.9.1


2015-06-10 13:27:31

by Eric B Munson

[permalink] [raw]
Subject: [RESEND PATCH V2 1/3] Add mmap flag to request pages are locked after page fault

The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be
used this can incur a high penalty for locking.

For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well). For the security example, any application transacting
in data that cannot be swapped out (credit card data, medical records,
etc).

This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in.

To keep accounting checks out of the page fault path, users are billed
for the entire mapping lock as if MAP_LOCKED was used.

Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/powerpc/include/uapi/asm/mman.h | 1 +
arch/sparc/include/uapi/asm/mman.h | 1 +
arch/tile/include/uapi/asm/mman.h | 1 +
arch/xtensa/include/uapi/asm/mman.h | 1 +
include/linux/mm.h | 1 +
include/linux/mman.h | 3 ++-
include/uapi/asm-generic/mman.h | 1 +
mm/mmap.c | 4 ++--
mm/swap.c | 3 ++-
12 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..15e96e1 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -30,6 +30,7 @@
#define MAP_NONBLOCK 0x40000 /* do not block on IO */
#define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x100000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x200000 /* Lock pages after they are faulted in, do not prefault */

#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_SYNC 2 /* synchronous memory sync */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876..47846a5 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -48,6 +48,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */

/*
* Flags for msync
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251..1514cd7 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -24,6 +24,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */

#define MS_SYNC 1 /* synchronous memory sync */
#define MS_ASYNC 2 /* sync memory asynchronously */
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index 6ea26df..fce74fe 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -27,5 +27,6 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */

#endif /* _UAPI_ASM_POWERPC_MMAN_H */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 0b14df3..12425d8 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -22,6 +22,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */


#endif /* _UAPI__SPARC_MMAN_H__ */
diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
index 81b8fc3..ec04eaf 100644
--- a/arch/tile/include/uapi/asm/mman.h
+++ b/arch/tile/include/uapi/asm/mman.h
@@ -29,6 +29,7 @@
#define MAP_DENYWRITE 0x0800 /* ETXTBSY */
#define MAP_EXECUTABLE 0x1000 /* mark it as an executable */
#define MAP_HUGETLB 0x4000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x8000 /* Lock pages after they are faulted in, do not prefault */


/*
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 201aec0..42d43cc 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -55,6 +55,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
# define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
* uninitialized */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9f..3e31457 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -126,6 +126,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */

+#define VM_LOCKONFAULT 0x00001000 /* Lock the pages covered when they are faulted in */
#define VM_LOCKED 0x00002000
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */

diff --git a/include/linux/mman.h b/include/linux/mman.h
index 16373c8..437264b 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -86,7 +86,8 @@ calc_vm_flag_bits(unsigned long flags)
{
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
_calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
- _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
+ _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
+ _calc_vm_trans(flags, MAP_LOCKONFAULT,VM_LOCKONFAULT);
}

unsigned long vm_commit_limit(void);
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index e9fe6fd..fc4e586 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */

/* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */

diff --git a/mm/mmap.c b/mm/mmap.c
index bb50cac..ba1a6bf 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1233,7 +1233,7 @@ static inline int mlock_future_check(struct mm_struct *mm,
unsigned long locked, lock_limit;

/* mlock MCL_FUTURE? */
- if (flags & VM_LOCKED) {
+ if (flags & (VM_LOCKED | VM_LOCKONFAULT)) {
locked = len >> PAGE_SHIFT;
locked += mm->locked_vm;
lock_limit = rlimit(RLIMIT_MEMLOCK);
@@ -1301,7 +1301,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

- if (flags & MAP_LOCKED)
+ if (flags & (MAP_LOCKED | MAP_LOCKONFAULT))
if (!can_do_mlock())
return -EPERM;

diff --git a/mm/swap.c b/mm/swap.c
index a7251a8..07c905e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -711,7 +711,8 @@ void lru_cache_add_active_or_unevictable(struct page *page,
{
VM_BUG_ON_PAGE(PageLRU(page), page);

- if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
+ if (likely((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == 0) ||
+ (vma->vm_flags & VM_SPECIAL)) {
SetPageActive(page);
lru_cache_add(page);
return;
--
1.9.1

2015-06-10 13:27:25

by Eric B Munson

[permalink] [raw]
Subject: [RESEND PATCH V2 2/3] Add mlockall flag for locking pages on fault

Building on the previous patch, extend mlockall() to give a process a
way to specify that pages should be locked when they are faulted in, but
that pre-faulting is not needed.

MCL_ONFAULT is preferrable to MCL_FUTURE for the use cases enumerated
in the previous patch becuase MCL_FUTURE will behave as if each mapping
was made with MAP_LOCKED, causing the entire mapping to be faulted in
when new space is allocated or mapped. MCL_ONFAULT allows the user to
delay the fault in cost of any given page until it is actually needed,
but then guarantees that that page will always be resident.

As with the mmap(MAP_LOCKONFAULT) case, the user is charged for the
mapping against the RLIMIT_MEMLOCK when the address space is allocated,
not when the page is faulted in. This decision was made to keep the
accounting checks out of the page fault path.

Signed-off-by: Eric B Munson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/powerpc/include/uapi/asm/mman.h | 1 +
arch/sparc/include/uapi/asm/mman.h | 1 +
arch/tile/include/uapi/asm/mman.h | 1 +
arch/xtensa/include/uapi/asm/mman.h | 1 +
include/uapi/asm-generic/mman.h | 1 +
mm/mlock.c | 13 +++++++++----
9 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 15e96e1..dfdaecf 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -38,6 +38,7 @@

#define MCL_CURRENT 8192 /* lock all currently mapped pages */
#define MCL_FUTURE 16384 /* lock all additions to address space */
+#define MCL_ONFAULT 32768 /* lock all pages that are faulted in */

#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 47846a5..f0705ff 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -62,6 +62,7 @@
*/
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */

#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 1514cd7..7c2eb85 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -32,6 +32,7 @@

#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */

#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
index fce74fe..761137a 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -22,6 +22,7 @@

#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */
+#define MCL_ONFAULT 0x8000 /* lock all pages that are faulted in */

#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 12425d8..dd027b8 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -17,6 +17,7 @@

#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */
#define MCL_FUTURE 0x4000 /* lock all additions to address space */
+#define MCL_ONFAULT 0x8000 /* lock all pages that are faulted in */

#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
index ec04eaf..0f7ae45 100644
--- a/arch/tile/include/uapi/asm/mman.h
+++ b/arch/tile/include/uapi/asm/mman.h
@@ -37,6 +37,7 @@
*/
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */


#endif /* _ASM_TILE_MMAN_H */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 42d43cc..10fbbb7 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -75,6 +75,7 @@
*/
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */

#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index fc4e586..7fb729b 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -18,5 +18,6 @@

#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_ONFAULT 4 /* lock all pages that are faulted in */

#endif /* __ASM_GENERIC_MMAN_H */
diff --git a/mm/mlock.c b/mm/mlock.c
index 6fd2cf1..f15547f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -579,7 +579,7 @@ static int do_mlock(unsigned long start, size_t len, int on)

/* Here we know that vma->vm_start <= nstart < vma->vm_end. */

- newflags = vma->vm_flags & ~VM_LOCKED;
+ newflags = vma->vm_flags & ~(VM_LOCKED | VM_LOCKONFAULT);
if (on)
newflags |= VM_LOCKED;

@@ -662,13 +662,17 @@ static int do_mlockall(int flags)
current->mm->def_flags |= VM_LOCKED;
else
current->mm->def_flags &= ~VM_LOCKED;
- if (flags == MCL_FUTURE)
+ if (flags & MCL_ONFAULT)
+ current->mm->def_flags |= VM_LOCKONFAULT;
+ else
+ current->mm->def_flags &= ~VM_LOCKONFAULT;
+ if (flags == MCL_FUTURE || flags == MCL_ONFAULT)
goto out;

for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
vm_flags_t newflags;

- newflags = vma->vm_flags & ~VM_LOCKED;
+ newflags = vma->vm_flags & ~(VM_LOCKED | VM_LOCKONFAULT);
if (flags & MCL_CURRENT)
newflags |= VM_LOCKED;

@@ -685,7 +689,8 @@ SYSCALL_DEFINE1(mlockall, int, flags)
unsigned long lock_limit;
int ret = -EINVAL;

- if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
+ if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)) ||
+ ((flags & MCL_FUTURE) && (flags & MCL_ONFAULT)))
goto out;

ret = -EPERM;
--
1.9.1

2015-06-10 13:27:07

by Eric B Munson

[permalink] [raw]
Subject: [RESEND PATCH V2 3/3] Add tests for lock on fault

Test the mmap() flag, the mlockall() flag, and ensure that mlock limits
are respected. Note that the limit test needs to be run a normal user.

Signed-off-by: Eric B Munson <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
tools/testing/selftests/vm/Makefile | 8 +-
tools/testing/selftests/vm/lock-on-fault.c | 145 ++++++++++++++++++++++++++++
tools/testing/selftests/vm/on-fault-limit.c | 47 +++++++++
tools/testing/selftests/vm/run_vmtests | 23 +++++
4 files changed, 222 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/vm/lock-on-fault.c
create mode 100644 tools/testing/selftests/vm/on-fault-limit.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index a5ce953..32f3d20 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -1,7 +1,13 @@
# Makefile for vm selftests

CFLAGS = -Wall
-BINARIES = hugepage-mmap hugepage-shm map_hugetlb thuge-gen hugetlbfstest
+BINARIES = hugepage-mmap
+BINARIES += hugepage-shm
+BINARIES += hugetlbfstest
+BINARIES += lock-on-fault
+BINARIES += map_hugetlb
+BINARIES += on-fault-limit
+BINARIES += thuge-gen
BINARIES += transhuge-stress

all: $(BINARIES)
diff --git a/tools/testing/selftests/vm/lock-on-fault.c b/tools/testing/selftests/vm/lock-on-fault.c
new file mode 100644
index 0000000..4659303
--- /dev/null
+++ b/tools/testing/selftests/vm/lock-on-fault.c
@@ -0,0 +1,145 @@
+#include <sys/mman.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+
+#ifndef MCL_ONFAULT
+#define MCL_ONFAULT (MCL_FUTURE << 1)
+#endif
+
+#define PRESENT_BIT 0x8000000000000000
+#define PFN_MASK 0x007FFFFFFFFFFFFF
+#define UNEVICTABLE_BIT (1UL << 18)
+
+static int check_pageflags(void *map)
+{
+ FILE *file;
+ unsigned long pfn1;
+ unsigned long pfn2;
+ unsigned long offset1;
+ unsigned long offset2;
+ int ret = 1;
+
+ file = fopen("/proc/self/pagemap", "r");
+ if (!file) {
+ perror("fopen");
+ return ret;
+ }
+ offset1 = (unsigned long)map / getpagesize() * sizeof(unsigned long);
+ offset2 = ((unsigned long)map + getpagesize()) / getpagesize() * sizeof(unsigned long);
+ if (fseek(file, offset1, SEEK_SET)) {
+ perror("fseek");
+ goto out;
+ }
+
+ if (fread(&pfn1, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread");
+ goto out;
+ }
+
+ if (fseek(file, offset2, SEEK_SET)) {
+ perror("fseek");
+ goto out;
+ }
+
+ if (fread(&pfn2, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread");
+ goto out;
+ }
+
+ /* pfn2 should not be present */
+ if (pfn2 & PRESENT_BIT) {
+ printf("page map says 0x%lx\n", pfn2);
+ printf("present is 0x%lx\n", PRESENT_BIT);
+ goto out;
+ }
+
+ /* pfn1 should be present */
+ if ((pfn1 & PRESENT_BIT) == 0) {
+ printf("page map says 0x%lx\n", pfn1);
+ printf("present is 0x%lx\n", PRESENT_BIT);
+ goto out;
+ }
+
+ pfn1 &= PFN_MASK;
+ fclose(file);
+ file = fopen("/proc/kpageflags", "r");
+ if (!file) {
+ perror("fopen");
+ munmap(map, 2 * getpagesize());
+ return ret;
+ }
+
+ if (fseek(file, pfn1 * sizeof(unsigned long), SEEK_SET)) {
+ perror("fseek");
+ goto out;
+ }
+
+ if (fread(&pfn2, sizeof(unsigned long), 1, file) != 1) {
+ perror("fread");
+ goto out;
+ }
+
+ /* pfn2 now contains the entry from kpageflags for the first page, the
+ * unevictable bit should be set */
+ if ((pfn2 & UNEVICTABLE_BIT) == 0) {
+ printf("kpageflags says 0x%lx\n", pfn2);
+ printf("unevictable is 0x%lx\n", UNEVICTABLE_BIT);
+ goto out;
+ }
+
+ ret = 0;
+
+out:
+ fclose(file);
+ return ret;
+}
+
+static int test_mmap(int flags)
+{
+ int ret = 1;
+ void *map;
+
+ map = mmap(NULL, 2 * getpagesize(), PROT_READ | PROT_WRITE, flags, 0, 0);
+ if (map == MAP_FAILED) {
+ perror("mmap()");
+ return ret;
+ }
+
+ /* Write something into the first page to ensure it is present */
+ *(char *)map = 1;
+
+ ret = check_pageflags(map);
+
+ munmap(map, 2 * getpagesize());
+ return ret;
+}
+
+static int test_mlockall(void)
+{
+ int ret = 1;
+
+ if (mlockall(MCL_ONFAULT)) {
+ perror("mlockall");
+ return ret;
+ }
+
+ ret = test_mmap(MAP_PRIVATE | MAP_ANONYMOUS);
+ munlockall();
+ return ret;
+}
+
+#ifndef MAP_LOCKONFAULT
+#define MAP_LOCKONFAULT (MAP_HUGETLB << 1)
+#endif
+
+int main(int argc, char **argv)
+{
+ int ret = 0;
+
+ ret += test_mmap(MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKONFAULT);
+ ret += test_mlockall();
+ return ret;
+}
diff --git a/tools/testing/selftests/vm/on-fault-limit.c b/tools/testing/selftests/vm/on-fault-limit.c
new file mode 100644
index 0000000..ed2a109
--- /dev/null
+++ b/tools/testing/selftests/vm/on-fault-limit.c
@@ -0,0 +1,47 @@
+#include <sys/mman.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+
+#ifndef MCL_ONFAULT
+#define MCL_ONFAULT (MCL_FUTURE << 1)
+#endif
+
+static int test_limit(void)
+{
+ int ret = 1;
+ struct rlimit lims;
+ void *map;
+
+ if (getrlimit(RLIMIT_MEMLOCK, &lims)) {
+ perror("getrlimit");
+ return ret;
+ }
+
+ if (mlockall(MCL_ONFAULT)) {
+ perror("mlockall");
+ return ret;
+ }
+
+ map = mmap(NULL, 2 * lims.rlim_max, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, 0, 0);
+ if (map != MAP_FAILED)
+ printf("mmap should have failed, but didn't\n");
+ else {
+ ret = 0;
+ munmap(map, 2 * lims.rlim_max);
+ }
+
+ munlockall();
+ return ret;
+}
+
+int main(int argc, char **argv)
+{
+ int ret = 0;
+
+ ret += test_limit();
+ return ret;
+}
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index c87b681..c1aecce 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -90,4 +90,27 @@ fi
umount $mnt
rm -rf $mnt
echo $nr_hugepgs > /proc/sys/vm/nr_hugepages
+
+echo "--------------------"
+echo "running lock-on-fault"
+echo "--------------------"
+./lock-on-fault
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
+echo "--------------------"
+echo "running on-fault-limit"
+echo "--------------------"
+sudo -u nobody ./on-fault-limit
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
exit $exitcode
--
1.9.1

2015-06-10 21:59:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 0/3] Allow user to request memory to be locked on page fault

On Wed, 10 Jun 2015 09:26:47 -0400 Eric B Munson <[email protected]> wrote:

> mlock() allows a user to control page out of program memory, but this
> comes at the cost of faulting in the entire mapping when it is

s/mapping/locked area/

> allocated. For large mappings where the entire area is not necessary
> this is not ideal.
>
> This series introduces new flags for mmap() and mlockall() that allow a
> user to specify that the covered are should not be paged out, but only
> after the memory has been used the first time.

The comparison with MCL_FUTURE is hiding over in the 2/3 changelog.
It's important so let's copy it here.

: MCL_ONFAULT is preferrable to MCL_FUTURE for the use cases enumerated
: in the previous patch becuase MCL_FUTURE will behave as if each mapping
: was made with MAP_LOCKED, causing the entire mapping to be faulted in
: when new space is allocated or mapped. MCL_ONFAULT allows the user to
: delay the fault in cost of any given page until it is actually needed,
: but then guarantees that that page will always be resident.

I *think* it all looks OK. I'd like someone else to go over it also if
poss.


I guess the 2/3 changelog should have something like

: munlockall() will clear MCL_ONFAULT on all vma's in the process's VM.

It's pretty obvious, but the manpage delta should make this clear also.


Also the changelog(s) and manpage delta should explain that munlock()
clears MCL_ONFAULT.

And now I'm wondering what happens if userspace does
mmap(MAP_LOCKONFAULT) and later does munlock() on just part of that
region. Does the vma get split? Is this tested? Should also be in
the changelogs and manpage.

Ditto mlockall(MCL_ONFAULT) followed by munlock(). I'm not sure that
even makes sense but the behaviour should be understood and tested.


What's missing here is a syscall to set VM_LOCKONFAULT on an arbitrary
range of memory - mlock() for lock-on-fault. It's a shame that mlock()
didn't take a `mode' argument. Perhaps we should add such a syscall -
that would make the mmap flag unneeded but I suppose it should be kept
for symmetry.

2015-06-11 19:21:55

by Eric B Munson

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 0/3] Allow user to request memory to be locked on page fault

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/10/2015 05:59 PM, Andrew Morton wrote:
> On Wed, 10 Jun 2015 09:26:47 -0400 Eric B Munson
> <[email protected]> wrote:
>
>> mlock() allows a user to control page out of program memory, but
>> this comes at the cost of faulting in the entire mapping when it
>> is
>
> s/mapping/locked area/

Done.

>
>> allocated. For large mappings where the entire area is not
>> necessary this is not ideal.
>>
>> This series introduces new flags for mmap() and mlockall() that
>> allow a user to specify that the covered are should not be paged
>> out, but only after the memory has been used the first time.
>
> The comparison with MCL_FUTURE is hiding over in the 2/3 changelog.
> It's important so let's copy it here.
>
> : MCL_ONFAULT is preferrable to MCL_FUTURE for the use cases
> enumerated : in the previous patch becuase MCL_FUTURE will behave
> as if each mapping : was made with MAP_LOCKED, causing the entire
> mapping to be faulted in : when new space is allocated or mapped.
> MCL_ONFAULT allows the user to : delay the fault in cost of any
> given page until it is actually needed, : but then guarantees that
> that page will always be resident.

Done

>
> I *think* it all looks OK. I'd like someone else to go over it
> also if poss.
>
>
> I guess the 2/3 changelog should have something like
>
> : munlockall() will clear MCL_ONFAULT on all vma's in the process's
> VM.

Done

>
> It's pretty obvious, but the manpage delta should make this clear
> also.

Done

>
>
> Also the changelog(s) and manpage delta should explain that
> munlock() clears MCL_ONFAULT.

Done

>
> And now I'm wondering what happens if userspace does
> mmap(MAP_LOCKONFAULT) and later does munlock() on just part of
> that region. Does the vma get split? Is this tested? Should also
> be in the changelogs and manpage.
>
> Ditto mlockall(MCL_ONFAULT) followed by munlock(). I'm not sure
> that even makes sense but the behaviour should be understood and
> tested.

I have extended the kselftest for lock-on-fault to try both of these
scenarios and they work as expected. The VMA is split and the VM
flags are set appropriately for the resulting VMAs.

>
>
> What's missing here is a syscall to set VM_LOCKONFAULT on an
> arbitrary range of memory - mlock() for lock-on-fault. It's a
> shame that mlock() didn't take a `mode' argument. Perhaps we
> should add such a syscall - that would make the mmap flag unneeded
> but I suppose it should be kept for symmetry.

Do you want such a system call as part of this set? I would need some
time to make sure I had thought through all the possible corners one
could get into with such a call, so it would delay a V3 quite a bit.
Otherwise I can send a V3 out immediately.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBAgAGBQJVed+3AAoJELbVsDOpoOa9eHwP+gO8QmNdUKN55wiTLxXdFTRo
TTm62MJ3Yk45+JJ+8xI1POMSUVEBAX7pxnL8TpNPmwp+UF6IQT/hAnnEFNud8/aQ
5bAxU9a5fRO6Q5533woaVpYfXZXwXAla+37MGQziL7O0VEi2aQ9abX7AKnkjmXwq
e1Fc3vutAycNCzSxg42GwZxqHw83TYztyv3C4Cc7lShbCezABYvaDvXcUZkGwhjG
KJxSPYS2E0nv0MEy995P0L0H1A/KHq6mCOFFKQw6aVbPDs8J/0RhvQIlp/BBCPMV
TqDVxMBpTpdWs6reJnUZpouKBTA11KTvUA2HBVn5B14u2V7Np+NBpLKH2DUqAP2v
Gyg4Nj0MknqB1rutaBjHjI0ZefrWK5o+zWAVKZs+wtq9WkmCvTYWp505XnlJO+qo
1CEnab2kX8P74UYcsJUrJxAtxc94t6oLh305KnJheQUdcx/ZNKboB2vl1+np10jj
oZLmP2RfajZoPojPZ/bI6mj9Ffqf/Ptau+kLQ56G1IuVmQRi4ZgQ9D1+BILXyKHi
uycKovcHVffiQ+z1Ama2b4wP1t5yjNdxBH0oV1KMeScCxfyYHPFuDBe36Krjo8FO
dDMyibNIRJMX6SeYNIRni40Eafon5h21I95/yWxUaq0FGBZ1NuuSTofxAA53wJJz
f0FUI7f53Oxk9EKk8nfg
=gfVJ
-----END PGP SIGNATURE-----

2015-06-11 19:34:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 0/3] Allow user to request memory to be locked on page fault

On Thu, 11 Jun 2015 15:21:30 -0400 Eric B Munson <[email protected]> wrote:

> > Ditto mlockall(MCL_ONFAULT) followed by munlock(). I'm not sure
> > that even makes sense but the behaviour should be understood and
> > tested.
>
> I have extended the kselftest for lock-on-fault to try both of these
> scenarios and they work as expected. The VMA is split and the VM
> flags are set appropriately for the resulting VMAs.

munlock() should do vma merging as well. I *think* we implemented
that. More tests for you to add ;)

How are you testing the vma merging and splitting, btw? Parsing
the profcs files?

> > What's missing here is a syscall to set VM_LOCKONFAULT on an
> > arbitrary range of memory - mlock() for lock-on-fault. It's a
> > shame that mlock() didn't take a `mode' argument. Perhaps we
> > should add such a syscall - that would make the mmap flag unneeded
> > but I suppose it should be kept for symmetry.
>
> Do you want such a system call as part of this set? I would need some
> time to make sure I had thought through all the possible corners one
> could get into with such a call, so it would delay a V3 quite a bit.
> Otherwise I can send a V3 out immediately.

I think the way to look at this is to pretend that mm/mlock.c doesn't
exist and ask "how should we design these features".

And that would be:

- mmap() takes a `flags' argument: MAP_LOCKED|MAP_LOCKONFAULT.

- mlock() takes a `flags' argument. Presently that's
MLOCK_LOCKED|MLOCK_LOCKONFAULT.

- munlock() takes a `flags' arument. MLOCK_LOCKED|MLOCK_LOCKONFAULT
to specify which flags are being cleared.

- mlockall() and munlockall() ditto.


IOW, LOCKED and LOCKEDONFAULT are treated identically and independently.

Now, that's how we would have designed all this on day one. And I
think we can do this now, by adding new mlock2() and munlock2()
syscalls. And we may as well deprecate the old mlock() and munlock(),
not that this matters much.

*should* we do this? I'm thinking "yes" - it's all pretty simple
boilerplate and wrappers and such, and it gets the interface correct,
and extensible.

What do others think?

2015-06-11 19:56:00

by Eric B Munson

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 0/3] Allow user to request memory to be locked on page fault

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/11/2015 03:34 PM, Andrew Morton wrote:
> On Thu, 11 Jun 2015 15:21:30 -0400 Eric B Munson
> <[email protected]> wrote:
>
>>> Ditto mlockall(MCL_ONFAULT) followed by munlock(). I'm not
>>> sure that even makes sense but the behaviour should be
>>> understood and tested.
>>
>> I have extended the kselftest for lock-on-fault to try both of
>> these scenarios and they work as expected. The VMA is split and
>> the VM flags are set appropriately for the resulting VMAs.
>
> munlock() should do vma merging as well. I *think* we implemented
> that. More tests for you to add ;)

I will add a test for this as well. But the code is in place to merge
VMAs IIRC.

>
> How are you testing the vma merging and splitting, btw? Parsing
> the profcs files?

To show the VMA split happened, I dropped a printk in mlock_fixup()
and the user space test simply checks that unlocked pages are not
marked as unevictable. The test does not parse maps or smaps for
actual VMA layout. Given that we want to check the merging of VMAs as
well I will add this.

>
>>> What's missing here is a syscall to set VM_LOCKONFAULT on an
>>> arbitrary range of memory - mlock() for lock-on-fault. It's a
>>> shame that mlock() didn't take a `mode' argument. Perhaps we
>>> should add such a syscall - that would make the mmap flag
>>> unneeded but I suppose it should be kept for symmetry.
>>
>> Do you want such a system call as part of this set? I would need
>> some time to make sure I had thought through all the possible
>> corners one could get into with such a call, so it would delay a
>> V3 quite a bit. Otherwise I can send a V3 out immediately.
>
> I think the way to look at this is to pretend that mm/mlock.c
> doesn't exist and ask "how should we design these features".
>
> And that would be:
>
> - mmap() takes a `flags' argument: MAP_LOCKED|MAP_LOCKONFAULT.
>
> - mlock() takes a `flags' argument. Presently that's
> MLOCK_LOCKED|MLOCK_LOCKONFAULT.
>
> - munlock() takes a `flags' arument.
> MLOCK_LOCKED|MLOCK_LOCKONFAULT to specify which flags are being
> cleared.
>
> - mlockall() and munlockall() ditto.
>
>
> IOW, LOCKED and LOCKEDONFAULT are treated identically and
> independently.
>
> Now, that's how we would have designed all this on day one. And I
> think we can do this now, by adding new mlock2() and munlock2()
> syscalls. And we may as well deprecate the old mlock() and
> munlock(), not that this matters much.
>
> *should* we do this? I'm thinking "yes" - it's all pretty simple
> boilerplate and wrappers and such, and it gets the interface
> correct, and extensible.
>
> What do others think?
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBAgAGBQJVeefAAAoJELbVsDOpoOa9930P/j32OhsgPdxt8pmlYddpHBJg
PJ4EOYZLoNJ0bWAoePRAQvb9Rd0UumXukkQKVdFCFW72QfMPkjqyMWWOA5BZ6dYl
q3h3FTzcnAtVHG7bqFheV+Ie9ZX0dplTmuGlqTZzEIVePry9VXzqp9BADbWn3bVR
ucq1CFikyEB2yu8pMtykJmEaz4CO7fzCHz6oB7RNX5oHElWmi9AieuUr5eAw6enQ
6ofuNy/N3rTCwcjeRfdL7Xhs6vn62u4nw1Jey6l9hBQUx/ujMktKcn4VwkDXIYCi
+h7lfXWruqOuC+lspBRJO7OL2e6nRdedpDWJypeUGcKXokxB2FEB25Yu31K9sk/8
jDfaKNqmcfgOseLHb+DjJqG6nq9lsUhozg8C17SJpT8qFwQ8q7iJe+1GhUF1EBsL
+DpqLU56geBY6fyIfurOfp/4Hsx2u1KzezkEnMYT/8LkbGwqbq7Zj4rquLMSHCUt
uG5j0MuhmP8/Fuf8OMsIHHUMjBHRjH4rTyaCKxNj3T8uSuLfcnIqEZiJu2qaSA8l
PxpQ6yy2szw9lDxPvxLnh8Rkx+SGEc1ciamyppDTI4LQRiCjMQ7bHAKo0RwAaPJL
ZSHrdlDnUHrYTnd0EZwg0peh8AgkROgxna/pLpfQTeW1g3erqPfbI0Ab8N0cu5j0
8+qA5C+DeSjaMAoMskTG
=82B8
-----END PGP SIGNATURE-----

2015-06-12 12:05:26

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 0/3] Allow user to request memory to be locked on page fault

On 06/11/2015 09:34 PM, Andrew Morton wrote:
> On Thu, 11 Jun 2015 15:21:30 -0400 Eric B Munson <[email protected]> wrote:
>
>>> Ditto mlockall(MCL_ONFAULT) followed by munlock(). I'm not sure
>>> that even makes sense but the behaviour should be understood and
>>> tested.
>>
>> I have extended the kselftest for lock-on-fault to try both of these
>> scenarios and they work as expected. The VMA is split and the VM
>> flags are set appropriately for the resulting VMAs.
>
> munlock() should do vma merging as well. I *think* we implemented
> that. More tests for you to add ;)
>
> How are you testing the vma merging and splitting, btw? Parsing
> the profcs files?
>
>>> What's missing here is a syscall to set VM_LOCKONFAULT on an
>>> arbitrary range of memory - mlock() for lock-on-fault. It's a
>>> shame that mlock() didn't take a `mode' argument. Perhaps we
>>> should add such a syscall - that would make the mmap flag unneeded
>>> but I suppose it should be kept for symmetry.
>>
>> Do you want such a system call as part of this set? I would need some
>> time to make sure I had thought through all the possible corners one
>> could get into with such a call, so it would delay a V3 quite a bit.
>> Otherwise I can send a V3 out immediately.
>
> I think the way to look at this is to pretend that mm/mlock.c doesn't
> exist and ask "how should we design these features".
>
> And that would be:
>
> - mmap() takes a `flags' argument: MAP_LOCKED|MAP_LOCKONFAULT.

Note that the semantic of MAP_LOCKED can be subtly surprising:

"mlock(2) fails if the memory range cannot get populated to guarantee
that no future major faults will happen on the range. mmap(MAP_LOCKED)
on the other hand silently succeeds even if the range was populated only
partially."

( from http://marc.info/?l=linux-mm&m=143152790412727&w=2 )

So MAP_LOCKED can silently behave like MAP_LOCKONFAULT. While
MAP_LOCKONFAULT doesn't suffer from such problem, I wonder if that's
sufficient reason not to extend mmap by new mlock() flags that can be
instead applied to the VMA after mmapping, using the proposed mlock2()
with flags. So I think instead we could deprecate MAP_LOCKED more
prominently. I doubt the overhead of calling the extra syscall matters here?

> - mlock() takes a `flags' argument. Presently that's
> MLOCK_LOCKED|MLOCK_LOCKONFAULT.
>
> - munlock() takes a `flags' arument. MLOCK_LOCKED|MLOCK_LOCKONFAULT
> to specify which flags are being cleared.
>
> - mlockall() and munlockall() ditto.
>
>
> IOW, LOCKED and LOCKEDONFAULT are treated identically and independently.
>
> Now, that's how we would have designed all this on day one. And I
> think we can do this now, by adding new mlock2() and munlock2()
> syscalls. And we may as well deprecate the old mlock() and munlock(),
> not that this matters much.
>
> *should* we do this? I'm thinking "yes" - it's all pretty simple
> boilerplate and wrappers and such, and it gets the interface correct,
> and extensible.

If the new LOCKONFAULT functionality is indeed desired (I haven't still
decided myself) then I agree that would be the cleanest way.

> What do others think?
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2015-06-15 14:39:54

by Eric B Munson

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 0/3] Allow user to request memory to be locked on page fault

On Thu, 11 Jun 2015, Andrew Morton wrote:

> On Thu, 11 Jun 2015 15:21:30 -0400 Eric B Munson <[email protected]> wrote:
>
> > > Ditto mlockall(MCL_ONFAULT) followed by munlock(). I'm not sure
> > > that even makes sense but the behaviour should be understood and
> > > tested.
> >
> > I have extended the kselftest for lock-on-fault to try both of these
> > scenarios and they work as expected. The VMA is split and the VM
> > flags are set appropriately for the resulting VMAs.
>
> munlock() should do vma merging as well. I *think* we implemented
> that. More tests for you to add ;)
>
> How are you testing the vma merging and splitting, btw? Parsing
> the profcs files?

The lock-on-fault test now covers VMA splitting and merging by parsing
/proc/self/maps. VMA splitting and merging works as it should with both
MAP_LOCKONFAULT and MCL_ONFAULT.

>
> > > What's missing here is a syscall to set VM_LOCKONFAULT on an
> > > arbitrary range of memory - mlock() for lock-on-fault. It's a
> > > shame that mlock() didn't take a `mode' argument. Perhaps we
> > > should add such a syscall - that would make the mmap flag unneeded
> > > but I suppose it should be kept for symmetry.
> >
> > Do you want such a system call as part of this set? I would need some
> > time to make sure I had thought through all the possible corners one
> > could get into with such a call, so it would delay a V3 quite a bit.
> > Otherwise I can send a V3 out immediately.
>
> I think the way to look at this is to pretend that mm/mlock.c doesn't
> exist and ask "how should we design these features".
>
> And that would be:
>
> - mmap() takes a `flags' argument: MAP_LOCKED|MAP_LOCKONFAULT.
>
> - mlock() takes a `flags' argument. Presently that's
> MLOCK_LOCKED|MLOCK_LOCKONFAULT.
>
> - munlock() takes a `flags' arument. MLOCK_LOCKED|MLOCK_LOCKONFAULT
> to specify which flags are being cleared.
>
> - mlockall() and munlockall() ditto.
>
>
> IOW, LOCKED and LOCKEDONFAULT are treated identically and independently.
>
> Now, that's how we would have designed all this on day one. And I
> think we can do this now, by adding new mlock2() and munlock2()
> syscalls. And we may as well deprecate the old mlock() and munlock(),
> not that this matters much.
>
> *should* we do this? I'm thinking "yes" - it's all pretty simple
> boilerplate and wrappers and such, and it gets the interface correct,
> and extensible.
>
> What do others think?

I am working on V3 which will introduce the new system calls.


Attachments:
(No filename) (2.46 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-06-15 14:44:05

by Eric B Munson

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 0/3] Allow user to request memory to be locked on page fault

On Fri, 12 Jun 2015, Vlastimil Babka wrote:

> On 06/11/2015 09:34 PM, Andrew Morton wrote:
> >On Thu, 11 Jun 2015 15:21:30 -0400 Eric B Munson <[email protected]> wrote:
> >
> >>>Ditto mlockall(MCL_ONFAULT) followed by munlock(). I'm not sure
> >>>that even makes sense but the behaviour should be understood and
> >>>tested.
> >>
> >>I have extended the kselftest for lock-on-fault to try both of these
> >>scenarios and they work as expected. The VMA is split and the VM
> >>flags are set appropriately for the resulting VMAs.
> >
> >munlock() should do vma merging as well. I *think* we implemented
> >that. More tests for you to add ;)
> >
> >How are you testing the vma merging and splitting, btw? Parsing
> >the profcs files?
> >
> >>>What's missing here is a syscall to set VM_LOCKONFAULT on an
> >>>arbitrary range of memory - mlock() for lock-on-fault. It's a
> >>>shame that mlock() didn't take a `mode' argument. Perhaps we
> >>>should add such a syscall - that would make the mmap flag unneeded
> >>>but I suppose it should be kept for symmetry.
> >>
> >>Do you want such a system call as part of this set? I would need some
> >>time to make sure I had thought through all the possible corners one
> >>could get into with such a call, so it would delay a V3 quite a bit.
> >>Otherwise I can send a V3 out immediately.
> >
> >I think the way to look at this is to pretend that mm/mlock.c doesn't
> >exist and ask "how should we design these features".
> >
> >And that would be:
> >
> >- mmap() takes a `flags' argument: MAP_LOCKED|MAP_LOCKONFAULT.
>
> Note that the semantic of MAP_LOCKED can be subtly surprising:
>
> "mlock(2) fails if the memory range cannot get populated to guarantee
> that no future major faults will happen on the range.
> mmap(MAP_LOCKED) on the other hand silently succeeds even if the
> range was populated only
> partially."
>
> ( from http://marc.info/?l=linux-mm&m=143152790412727&w=2 )
>
> So MAP_LOCKED can silently behave like MAP_LOCKONFAULT. While
> MAP_LOCKONFAULT doesn't suffer from such problem, I wonder if that's
> sufficient reason not to extend mmap by new mlock() flags that can
> be instead applied to the VMA after mmapping, using the proposed
> mlock2() with flags. So I think instead we could deprecate
> MAP_LOCKED more prominently. I doubt the overhead of calling the
> extra syscall matters here?

We could talk about retiring the MAP_LOCKED flag but I suspect that
would get significantly more pushback than adding a new mmap flag.

Likely that the overhead does not matter in most cases, but presumably
there are cases where it does (as we have a MAP_LOCKED flag today).
Even with the proposed new system calls I think we should have the
MAP_LOCKONFAULT for parity with MAP_LOCKED.

>
> >- mlock() takes a `flags' argument. Presently that's
> > MLOCK_LOCKED|MLOCK_LOCKONFAULT.
> >
> >- munlock() takes a `flags' arument. MLOCK_LOCKED|MLOCK_LOCKONFAULT
> > to specify which flags are being cleared.
> >
> >- mlockall() and munlockall() ditto.
> >
> >
> >IOW, LOCKED and LOCKEDONFAULT are treated identically and independently.
> >
> >Now, that's how we would have designed all this on day one. And I
> >think we can do this now, by adding new mlock2() and munlock2()
> >syscalls. And we may as well deprecate the old mlock() and munlock(),
> >not that this matters much.
> >
> >*should* we do this? I'm thinking "yes" - it's all pretty simple
> >boilerplate and wrappers and such, and it gets the interface correct,
> >and extensible.
>
> If the new LOCKONFAULT functionality is indeed desired (I haven't
> still decided myself) then I agree that would be the cleanest way.

Do you disagree with the use cases I have listed or do you think there
is a better way of addressing those cases?

>
> >What do others think?


Attachments:
(No filename) (3.72 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-06-18 15:29:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 1/3] Add mmap flag to request pages are locked after page fault

[Sorry for the late reply - I meant to answer in the previous threads
but something always preempted me from that]

On Wed 10-06-15 09:26:48, Eric B Munson wrote:
> The cost of faulting in all memory to be locked can be very high when
> working with large mappings. If only portions of the mapping will be
> used this can incur a high penalty for locking.
>
> For the example of a large file, this is the usage pattern for a large
> statical language model (probably applies to other statical or graphical
> models as well). For the security example, any application transacting
> in data that cannot be swapped out (credit card data, medical records,
> etc).

Such a use case makes some sense to me but I am not sure the way you
implement it is the right one. This is another mlock related flag for
mmap with a different semantic. You do not want to prefault but e.g. is
the readahead or fault around acceptable? I do not see anything in your
patch to handle those...

Wouldn't it be much more reasonable and straightforward to have
MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
explicitly disallow any form of pre-faulting? It would be usable for
other usecases than with MAP_LOCKED combination.

> This patch introduces the ability to request that pages are not
> pre-faulted, but are placed on the unevictable LRU when they are finally
> faulted in.
>
> To keep accounting checks out of the page fault path, users are billed
> for the entire mapping lock as if MAP_LOCKED was used.
>
> Signed-off-by: Eric B Munson <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> arch/alpha/include/uapi/asm/mman.h | 1 +
> arch/mips/include/uapi/asm/mman.h | 1 +
> arch/parisc/include/uapi/asm/mman.h | 1 +
> arch/powerpc/include/uapi/asm/mman.h | 1 +
> arch/sparc/include/uapi/asm/mman.h | 1 +
> arch/tile/include/uapi/asm/mman.h | 1 +
> arch/xtensa/include/uapi/asm/mman.h | 1 +
> include/linux/mm.h | 1 +
> include/linux/mman.h | 3 ++-
> include/uapi/asm-generic/mman.h | 1 +
> mm/mmap.c | 4 ++--
> mm/swap.c | 3 ++-
> 12 files changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 0086b47..15e96e1 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -30,6 +30,7 @@
> #define MAP_NONBLOCK 0x40000 /* do not block on IO */
> #define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */
> #define MAP_HUGETLB 0x100000 /* create a huge page mapping */
> +#define MAP_LOCKONFAULT 0x200000 /* Lock pages after they are faulted in, do not prefault */
>
> #define MS_ASYNC 1 /* sync memory asynchronously */
> #define MS_SYNC 2 /* synchronous memory sync */
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index cfcb876..47846a5 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -48,6 +48,7 @@
> #define MAP_NONBLOCK 0x20000 /* do not block on IO */
> #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
> #define MAP_HUGETLB 0x80000 /* create a huge page mapping */
> +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
>
> /*
> * Flags for msync
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 294d251..1514cd7 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -24,6 +24,7 @@
> #define MAP_NONBLOCK 0x20000 /* do not block on IO */
> #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
> #define MAP_HUGETLB 0x80000 /* create a huge page mapping */
> +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
>
> #define MS_SYNC 1 /* synchronous memory sync */
> #define MS_ASYNC 2 /* sync memory asynchronously */
> diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
> index 6ea26df..fce74fe 100644
> --- a/arch/powerpc/include/uapi/asm/mman.h
> +++ b/arch/powerpc/include/uapi/asm/mman.h
> @@ -27,5 +27,6 @@
> #define MAP_NONBLOCK 0x10000 /* do not block on IO */
> #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
> #define MAP_HUGETLB 0x40000 /* create a huge page mapping */
> +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */
>
> #endif /* _UAPI_ASM_POWERPC_MMAN_H */
> diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
> index 0b14df3..12425d8 100644
> --- a/arch/sparc/include/uapi/asm/mman.h
> +++ b/arch/sparc/include/uapi/asm/mman.h
> @@ -22,6 +22,7 @@
> #define MAP_NONBLOCK 0x10000 /* do not block on IO */
> #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
> #define MAP_HUGETLB 0x40000 /* create a huge page mapping */
> +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */
>
>
> #endif /* _UAPI__SPARC_MMAN_H__ */
> diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
> index 81b8fc3..ec04eaf 100644
> --- a/arch/tile/include/uapi/asm/mman.h
> +++ b/arch/tile/include/uapi/asm/mman.h
> @@ -29,6 +29,7 @@
> #define MAP_DENYWRITE 0x0800 /* ETXTBSY */
> #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */
> #define MAP_HUGETLB 0x4000 /* create a huge page mapping */
> +#define MAP_LOCKONFAULT 0x8000 /* Lock pages after they are faulted in, do not prefault */
>
>
> /*
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 201aec0..42d43cc 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -55,6 +55,7 @@
> #define MAP_NONBLOCK 0x20000 /* do not block on IO */
> #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
> #define MAP_HUGETLB 0x80000 /* create a huge page mapping */
> +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
> #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
> # define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
> * uninitialized */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0755b9f..3e31457 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -126,6 +126,7 @@ extern unsigned int kobjsize(const void *objp);
> #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
> #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
>
> +#define VM_LOCKONFAULT 0x00001000 /* Lock the pages covered when they are faulted in */
> #define VM_LOCKED 0x00002000
> #define VM_IO 0x00004000 /* Memory mapped I/O or similar */
>
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 16373c8..437264b 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -86,7 +86,8 @@ calc_vm_flag_bits(unsigned long flags)
> {
> return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
> _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
> - _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
> + _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
> + _calc_vm_trans(flags, MAP_LOCKONFAULT,VM_LOCKONFAULT);
> }
>
> unsigned long vm_commit_limit(void);
> diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
> index e9fe6fd..fc4e586 100644
> --- a/include/uapi/asm-generic/mman.h
> +++ b/include/uapi/asm-generic/mman.h
> @@ -12,6 +12,7 @@
> #define MAP_NONBLOCK 0x10000 /* do not block on IO */
> #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
> #define MAP_HUGETLB 0x40000 /* create a huge page mapping */
> +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */
>
> /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index bb50cac..ba1a6bf 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1233,7 +1233,7 @@ static inline int mlock_future_check(struct mm_struct *mm,
> unsigned long locked, lock_limit;
>
> /* mlock MCL_FUTURE? */
> - if (flags & VM_LOCKED) {
> + if (flags & (VM_LOCKED | VM_LOCKONFAULT)) {
> locked = len >> PAGE_SHIFT;
> locked += mm->locked_vm;
> lock_limit = rlimit(RLIMIT_MEMLOCK);
> @@ -1301,7 +1301,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
> mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
>
> - if (flags & MAP_LOCKED)
> + if (flags & (MAP_LOCKED | MAP_LOCKONFAULT))
> if (!can_do_mlock())
> return -EPERM;
>
> diff --git a/mm/swap.c b/mm/swap.c
> index a7251a8..07c905e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -711,7 +711,8 @@ void lru_cache_add_active_or_unevictable(struct page *page,
> {
> VM_BUG_ON_PAGE(PageLRU(page), page);
>
> - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
> + if (likely((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == 0) ||
> + (vma->vm_flags & VM_SPECIAL)) {
> SetPageActive(page);
> lru_cache_add(page);
> return;
> --
> 1.9.1
>

--
Michal Hocko
SUSE Labs

2015-06-18 20:30:56

by Eric B Munson

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 1/3] Add mmap flag to request pages are locked after page fault

On Thu, 18 Jun 2015, Michal Hocko wrote:

> [Sorry for the late reply - I meant to answer in the previous threads
> but something always preempted me from that]
>
> On Wed 10-06-15 09:26:48, Eric B Munson wrote:
> > The cost of faulting in all memory to be locked can be very high when
> > working with large mappings. If only portions of the mapping will be
> > used this can incur a high penalty for locking.
> >
> > For the example of a large file, this is the usage pattern for a large
> > statical language model (probably applies to other statical or graphical
> > models as well). For the security example, any application transacting
> > in data that cannot be swapped out (credit card data, medical records,
> > etc).
>
> Such a use case makes some sense to me but I am not sure the way you
> implement it is the right one. This is another mlock related flag for
> mmap with a different semantic. You do not want to prefault but e.g. is
> the readahead or fault around acceptable? I do not see anything in your
> patch to handle those...

We haven't bumped into readahead or fault around causing performance
problems for us. If they cause problems for users when LOCKONFAULT is
in use then we can address them.

>
> Wouldn't it be much more reasonable and straightforward to have
> MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
> explicitly disallow any form of pre-faulting? It would be usable for
> other usecases than with MAP_LOCKED combination.

I don't see a clear case for it being more reasonable, it is one
possible way to solve the problem. But I think it leaves us in an even
more akward state WRT VMA flags. As you noted in your fix for the
mmap() man page, one can get into a state where a VMA is VM_LOCKED, but
not present. Having VM_LOCKONFAULT states that this was intentional, if
we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no
longer set VM_LOCKONFAULT (unless we want to start mapping it to the
presence of two MAP_ flags). This can make detecting the MAP_LOCKED +
populate failure state harder.

If this is the preferred path for mmap(), I am fine with that. However,
I would like to see the new system calls that Andrew mentioned (and that
I am testing patches for) go in as well. That way we give users the
ability to request VM_LOCKONFAULT for memory allocated using something
other than mmap.

>
> > This patch introduces the ability to request that pages are not
> > pre-faulted, but are placed on the unevictable LRU when they are finally
> > faulted in.
> >
> > To keep accounting checks out of the page fault path, users are billed
> > for the entire mapping lock as if MAP_LOCKED was used.
> >
> > Signed-off-by: Eric B Munson <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > ---
> > arch/alpha/include/uapi/asm/mman.h | 1 +
> > arch/mips/include/uapi/asm/mman.h | 1 +
> > arch/parisc/include/uapi/asm/mman.h | 1 +
> > arch/powerpc/include/uapi/asm/mman.h | 1 +
> > arch/sparc/include/uapi/asm/mman.h | 1 +
> > arch/tile/include/uapi/asm/mman.h | 1 +
> > arch/xtensa/include/uapi/asm/mman.h | 1 +
> > include/linux/mm.h | 1 +
> > include/linux/mman.h | 3 ++-
> > include/uapi/asm-generic/mman.h | 1 +
> > mm/mmap.c | 4 ++--
> > mm/swap.c | 3 ++-
> > 12 files changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > index 0086b47..15e96e1 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -30,6 +30,7 @@
> > #define MAP_NONBLOCK 0x40000 /* do not block on IO */
> > #define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */
> > #define MAP_HUGETLB 0x100000 /* create a huge page mapping */
> > +#define MAP_LOCKONFAULT 0x200000 /* Lock pages after they are faulted in, do not prefault */
> >
> > #define MS_ASYNC 1 /* sync memory asynchronously */
> > #define MS_SYNC 2 /* synchronous memory sync */
> > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > index cfcb876..47846a5 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -48,6 +48,7 @@
> > #define MAP_NONBLOCK 0x20000 /* do not block on IO */
> > #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
> > #define MAP_HUGETLB 0x80000 /* create a huge page mapping */
> > +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
> >
> > /*
> > * Flags for msync
> > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > index 294d251..1514cd7 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -24,6 +24,7 @@
> > #define MAP_NONBLOCK 0x20000 /* do not block on IO */
> > #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
> > #define MAP_HUGETLB 0x80000 /* create a huge page mapping */
> > +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
> >
> > #define MS_SYNC 1 /* synchronous memory sync */
> > #define MS_ASYNC 2 /* sync memory asynchronously */
> > diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h
> > index 6ea26df..fce74fe 100644
> > --- a/arch/powerpc/include/uapi/asm/mman.h
> > +++ b/arch/powerpc/include/uapi/asm/mman.h
> > @@ -27,5 +27,6 @@
> > #define MAP_NONBLOCK 0x10000 /* do not block on IO */
> > #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
> > #define MAP_HUGETLB 0x40000 /* create a huge page mapping */
> > +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */
> >
> > #endif /* _UAPI_ASM_POWERPC_MMAN_H */
> > diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
> > index 0b14df3..12425d8 100644
> > --- a/arch/sparc/include/uapi/asm/mman.h
> > +++ b/arch/sparc/include/uapi/asm/mman.h
> > @@ -22,6 +22,7 @@
> > #define MAP_NONBLOCK 0x10000 /* do not block on IO */
> > #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
> > #define MAP_HUGETLB 0x40000 /* create a huge page mapping */
> > +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */
> >
> >
> > #endif /* _UAPI__SPARC_MMAN_H__ */
> > diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h
> > index 81b8fc3..ec04eaf 100644
> > --- a/arch/tile/include/uapi/asm/mman.h
> > +++ b/arch/tile/include/uapi/asm/mman.h
> > @@ -29,6 +29,7 @@
> > #define MAP_DENYWRITE 0x0800 /* ETXTBSY */
> > #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */
> > #define MAP_HUGETLB 0x4000 /* create a huge page mapping */
> > +#define MAP_LOCKONFAULT 0x8000 /* Lock pages after they are faulted in, do not prefault */
> >
> >
> > /*
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > index 201aec0..42d43cc 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -55,6 +55,7 @@
> > #define MAP_NONBLOCK 0x20000 /* do not block on IO */
> > #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
> > #define MAP_HUGETLB 0x80000 /* create a huge page mapping */
> > +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */
> > #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
> > # define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
> > * uninitialized */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0755b9f..3e31457 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -126,6 +126,7 @@ extern unsigned int kobjsize(const void *objp);
> > #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
> > #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
> >
> > +#define VM_LOCKONFAULT 0x00001000 /* Lock the pages covered when they are faulted in */
> > #define VM_LOCKED 0x00002000
> > #define VM_IO 0x00004000 /* Memory mapped I/O or similar */
> >
> > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > index 16373c8..437264b 100644
> > --- a/include/linux/mman.h
> > +++ b/include/linux/mman.h
> > @@ -86,7 +86,8 @@ calc_vm_flag_bits(unsigned long flags)
> > {
> > return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
> > _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
> > - _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
> > + _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
> > + _calc_vm_trans(flags, MAP_LOCKONFAULT,VM_LOCKONFAULT);
> > }
> >
> > unsigned long vm_commit_limit(void);
> > diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
> > index e9fe6fd..fc4e586 100644
> > --- a/include/uapi/asm-generic/mman.h
> > +++ b/include/uapi/asm-generic/mman.h
> > @@ -12,6 +12,7 @@
> > #define MAP_NONBLOCK 0x10000 /* do not block on IO */
> > #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
> > #define MAP_HUGETLB 0x40000 /* create a huge page mapping */
> > +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */
> >
> > /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index bb50cac..ba1a6bf 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1233,7 +1233,7 @@ static inline int mlock_future_check(struct mm_struct *mm,
> > unsigned long locked, lock_limit;
> >
> > /* mlock MCL_FUTURE? */
> > - if (flags & VM_LOCKED) {
> > + if (flags & (VM_LOCKED | VM_LOCKONFAULT)) {
> > locked = len >> PAGE_SHIFT;
> > locked += mm->locked_vm;
> > lock_limit = rlimit(RLIMIT_MEMLOCK);
> > @@ -1301,7 +1301,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> > vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
> > mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
> >
> > - if (flags & MAP_LOCKED)
> > + if (flags & (MAP_LOCKED | MAP_LOCKONFAULT))
> > if (!can_do_mlock())
> > return -EPERM;
> >
> > diff --git a/mm/swap.c b/mm/swap.c
> > index a7251a8..07c905e 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -711,7 +711,8 @@ void lru_cache_add_active_or_unevictable(struct page *page,
> > {
> > VM_BUG_ON_PAGE(PageLRU(page), page);
> >
> > - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
> > + if (likely((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == 0) ||
> > + (vma->vm_flags & VM_SPECIAL)) {
> > SetPageActive(page);
> > lru_cache_add(page);
> > return;
> > --
> > 1.9.1
> >
>
> --
> Michal Hocko
> SUSE Labs


Attachments:
(No filename) (11.24 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-06-19 14:57:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 1/3] Add mmap flag to request pages are locked after page fault

On Thu 18-06-15 16:30:48, Eric B Munson wrote:
> On Thu, 18 Jun 2015, Michal Hocko wrote:
[...]
> > Wouldn't it be much more reasonable and straightforward to have
> > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
> > explicitly disallow any form of pre-faulting? It would be usable for
> > other usecases than with MAP_LOCKED combination.
>
> I don't see a clear case for it being more reasonable, it is one
> possible way to solve the problem.

MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault
around is all or nothing feature. Either all mappings (which support
this) fault around or none. There is no way to tell the kernel that
this particular mapping shouldn't fault around. I haven't seen such a
request yet but we have seen requests to have a way to opt out from
a global policy in the past (e.g. per-process opt out from THP). So
I can imagine somebody will come with a request to opt out from any
speculative operations on the mapped area in the future.

> But I think it leaves us in an even
> more akward state WRT VMA flags. As you noted in your fix for the
> mmap() man page, one can get into a state where a VMA is VM_LOCKED, but
> not present. Having VM_LOCKONFAULT states that this was intentional, if
> we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no
> longer set VM_LOCKONFAULT (unless we want to start mapping it to the
> presence of two MAP_ flags). This can make detecting the MAP_LOCKED +
> populate failure state harder.

I am not sure I understand your point here. Could you be more specific
how would you check for that and what for?

>From my understanding MAP_LOCKONFAULT is essentially
MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike
single MAP_LOCKED unfortunately). I would love to also have
MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really
skeptical considering how my previous attempt to make MAP_POPULATE
reasonable went.

> If this is the preferred path for mmap(), I am fine with that.

> However,
> I would like to see the new system calls that Andrew mentioned (and that
> I am testing patches for) go in as well.

mlock with flags sounds like a good step but I am not sure it will make
sense in the future. POSIX has screwed that and I am not sure how many
applications would use it. This ship has sailed long time ago.

> That way we give users the
> ability to request VM_LOCKONFAULT for memory allocated using something
> other than mmap.

mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even
without changing mlock syscall.

> > > This patch introduces the ability to request that pages are not
> > > pre-faulted, but are placed on the unevictable LRU when they are finally
> > > faulted in.
> > >
> > > To keep accounting checks out of the page fault path, users are billed
> > > for the entire mapping lock as if MAP_LOCKED was used.
> > >
> > > Signed-off-by: Eric B Munson <[email protected]>
> > > Cc: Michal Hocko <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > ---
[...]
--
Michal Hocko
SUSE Labs

2015-06-19 16:43:45

by Eric B Munson

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 1/3] Add mmap flag to request pages are locked after page fault

On Fri, 19 Jun 2015, Michal Hocko wrote:

> On Thu 18-06-15 16:30:48, Eric B Munson wrote:
> > On Thu, 18 Jun 2015, Michal Hocko wrote:
> [...]
> > > Wouldn't it be much more reasonable and straightforward to have
> > > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
> > > explicitly disallow any form of pre-faulting? It would be usable for
> > > other usecases than with MAP_LOCKED combination.
> >
> > I don't see a clear case for it being more reasonable, it is one
> > possible way to solve the problem.
>
> MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault
> around is all or nothing feature. Either all mappings (which support
> this) fault around or none. There is no way to tell the kernel that
> this particular mapping shouldn't fault around. I haven't seen such a
> request yet but we have seen requests to have a way to opt out from
> a global policy in the past (e.g. per-process opt out from THP). So
> I can imagine somebody will come with a request to opt out from any
> speculative operations on the mapped area in the future.
>
> > But I think it leaves us in an even
> > more akward state WRT VMA flags. As you noted in your fix for the
> > mmap() man page, one can get into a state where a VMA is VM_LOCKED, but
> > not present. Having VM_LOCKONFAULT states that this was intentional, if
> > we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no
> > longer set VM_LOCKONFAULT (unless we want to start mapping it to the
> > presence of two MAP_ flags). This can make detecting the MAP_LOCKED +
> > populate failure state harder.
>
> I am not sure I understand your point here. Could you be more specific
> how would you check for that and what for?

My thought on detecting was that someone might want to know if they had
a VMA that was VM_LOCKED but had not been made present becuase of a
failure in mmap. We don't have a way today, but adding VM_LOCKONFAULT
is at least explicit about what is happening which would make detecting
the VM_LOCKED but not present state easier. This assumes that
MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like
it would have to.

>
> From my understanding MAP_LOCKONFAULT is essentially
> MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike
> single MAP_LOCKED unfortunately). I would love to also have
> MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really
> skeptical considering how my previous attempt to make MAP_POPULATE
> reasonable went.

Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the
new MAP_LOCKONFAULT flag (or both)? If you prefer that MAP_LOCKED |
MAP_FAULTPOPULATE means that VM_LOCKONFAULT is set, I am fine with that
instead of introducing MAP_LOCKONFAULT. I went with the new flag
because to date, we have a one to one mapping of MAP_* to VM_* flags.

>
> > If this is the preferred path for mmap(), I am fine with that.
>
> > However,
> > I would like to see the new system calls that Andrew mentioned (and that
> > I am testing patches for) go in as well.
>
> mlock with flags sounds like a good step but I am not sure it will make
> sense in the future. POSIX has screwed that and I am not sure how many
> applications would use it. This ship has sailed long time ago.

I don't know either, but the code is the question, right? I know that
we have at least one team that wants it here.

>
> > That way we give users the
> > ability to request VM_LOCKONFAULT for memory allocated using something
> > other than mmap.
>
> mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even
> without changing mlock syscall.

That is true as long as MAP_FAULTPOPULATE set a flag in the VMA(s). It
doesn't cover the actual case I was asking about, which is how do I get
lock on fault on malloc'd memory?

>
> > > > This patch introduces the ability to request that pages are not
> > > > pre-faulted, but are placed on the unevictable LRU when they are finally
> > > > faulted in.
> > > >
> > > > To keep accounting checks out of the page fault path, users are billed
> > > > for the entire mapping lock as if MAP_LOCKED was used.
> > > >
> > > > Signed-off-by: Eric B Munson <[email protected]>
> > > > Cc: Michal Hocko <[email protected]>
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > ---
> [...]
> --
> Michal Hocko
> SUSE Labs


Attachments:
(No filename) (4.59 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-06-22 12:38:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 1/3] Add mmap flag to request pages are locked after page fault

On Fri 19-06-15 12:43:33, Eric B Munson wrote:
> On Fri, 19 Jun 2015, Michal Hocko wrote:
>
> > On Thu 18-06-15 16:30:48, Eric B Munson wrote:
> > > On Thu, 18 Jun 2015, Michal Hocko wrote:
> > [...]
> > > > Wouldn't it be much more reasonable and straightforward to have
> > > > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
> > > > explicitly disallow any form of pre-faulting? It would be usable for
> > > > other usecases than with MAP_LOCKED combination.
> > >
> > > I don't see a clear case for it being more reasonable, it is one
> > > possible way to solve the problem.
> >
> > MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault
> > around is all or nothing feature. Either all mappings (which support
> > this) fault around or none. There is no way to tell the kernel that
> > this particular mapping shouldn't fault around. I haven't seen such a
> > request yet but we have seen requests to have a way to opt out from
> > a global policy in the past (e.g. per-process opt out from THP). So
> > I can imagine somebody will come with a request to opt out from any
> > speculative operations on the mapped area in the future.
> >
> > > But I think it leaves us in an even
> > > more akward state WRT VMA flags. As you noted in your fix for the
> > > mmap() man page, one can get into a state where a VMA is VM_LOCKED, but
> > > not present. Having VM_LOCKONFAULT states that this was intentional, if
> > > we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no
> > > longer set VM_LOCKONFAULT (unless we want to start mapping it to the
> > > presence of two MAP_ flags). This can make detecting the MAP_LOCKED +
> > > populate failure state harder.
> >
> > I am not sure I understand your point here. Could you be more specific
> > how would you check for that and what for?
>
> My thought on detecting was that someone might want to know if they had
> a VMA that was VM_LOCKED but had not been made present becuase of a
> failure in mmap. We don't have a way today, but adding VM_LOCKONFAULT
> is at least explicit about what is happening which would make detecting
> the VM_LOCKED but not present state easier.

One could use /proc/<pid>/pagemap to query the residency.

> This assumes that
> MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like
> it would have to.

Yes, it would have to have a VM flag for the vma.

> > From my understanding MAP_LOCKONFAULT is essentially
> > MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike
> > single MAP_LOCKED unfortunately). I would love to also have
> > MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really
> > skeptical considering how my previous attempt to make MAP_POPULATE
> > reasonable went.
>
> Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the
> new MAP_LOCKONFAULT flag (or both)?

I thought the MAP_FAULTPOPULATE (or any other better name) would
directly translate into VM_FAULTPOPULATE and wouldn't be tight to the
locked semantic. We already have VM_LOCKED for that. The direct effect
of the flag would be to prevent from population other than the direct
page fault - including any speculative actions like fault around or
read-ahead.

> If you prefer that MAP_LOCKED |
> MAP_FAULTPOPULATE means that VM_LOCKONFAULT is set, I am fine with that
> instead of introducing MAP_LOCKONFAULT. I went with the new flag
> because to date, we have a one to one mapping of MAP_* to VM_* flags.
>
> >
> > > If this is the preferred path for mmap(), I am fine with that.
> >
> > > However,
> > > I would like to see the new system calls that Andrew mentioned (and that
> > > I am testing patches for) go in as well.
> >
> > mlock with flags sounds like a good step but I am not sure it will make
> > sense in the future. POSIX has screwed that and I am not sure how many
> > applications would use it. This ship has sailed long time ago.
>
> I don't know either, but the code is the question, right? I know that
> we have at least one team that wants it here.
>
> >
> > > That way we give users the
> > > ability to request VM_LOCKONFAULT for memory allocated using something
> > > other than mmap.
> >
> > mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even
> > without changing mlock syscall.
>
> That is true as long as MAP_FAULTPOPULATE set a flag in the VMA(s). It
> doesn't cover the actual case I was asking about, which is how do I get
> lock on fault on malloc'd memory?

OK I see your point now. We would indeed need a flag argument for mlock.
--
Michal Hocko
SUSE Labs

2015-06-22 14:18:10

by Eric B Munson

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 1/3] Add mmap flag to request pages are locked after page fault

On Mon, 22 Jun 2015, Michal Hocko wrote:

> On Fri 19-06-15 12:43:33, Eric B Munson wrote:
> > On Fri, 19 Jun 2015, Michal Hocko wrote:
> >
> > > On Thu 18-06-15 16:30:48, Eric B Munson wrote:
> > > > On Thu, 18 Jun 2015, Michal Hocko wrote:
> > > [...]
> > > > > Wouldn't it be much more reasonable and straightforward to have
> > > > > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
> > > > > explicitly disallow any form of pre-faulting? It would be usable for
> > > > > other usecases than with MAP_LOCKED combination.
> > > >
> > > > I don't see a clear case for it being more reasonable, it is one
> > > > possible way to solve the problem.
> > >
> > > MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault
> > > around is all or nothing feature. Either all mappings (which support
> > > this) fault around or none. There is no way to tell the kernel that
> > > this particular mapping shouldn't fault around. I haven't seen such a
> > > request yet but we have seen requests to have a way to opt out from
> > > a global policy in the past (e.g. per-process opt out from THP). So
> > > I can imagine somebody will come with a request to opt out from any
> > > speculative operations on the mapped area in the future.
> > >
> > > > But I think it leaves us in an even
> > > > more akward state WRT VMA flags. As you noted in your fix for the
> > > > mmap() man page, one can get into a state where a VMA is VM_LOCKED, but
> > > > not present. Having VM_LOCKONFAULT states that this was intentional, if
> > > > we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no
> > > > longer set VM_LOCKONFAULT (unless we want to start mapping it to the
> > > > presence of two MAP_ flags). This can make detecting the MAP_LOCKED +
> > > > populate failure state harder.
> > >
> > > I am not sure I understand your point here. Could you be more specific
> > > how would you check for that and what for?
> >
> > My thought on detecting was that someone might want to know if they had
> > a VMA that was VM_LOCKED but had not been made present becuase of a
> > failure in mmap. We don't have a way today, but adding VM_LOCKONFAULT
> > is at least explicit about what is happening which would make detecting
> > the VM_LOCKED but not present state easier.
>
> One could use /proc/<pid>/pagemap to query the residency.
>
> > This assumes that
> > MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like
> > it would have to.
>
> Yes, it would have to have a VM flag for the vma.
>
> > > From my understanding MAP_LOCKONFAULT is essentially
> > > MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike
> > > single MAP_LOCKED unfortunately). I would love to also have
> > > MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really
> > > skeptical considering how my previous attempt to make MAP_POPULATE
> > > reasonable went.
> >
> > Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the
> > new MAP_LOCKONFAULT flag (or both)?
>
> I thought the MAP_FAULTPOPULATE (or any other better name) would
> directly translate into VM_FAULTPOPULATE and wouldn't be tight to the
> locked semantic. We already have VM_LOCKED for that. The direct effect
> of the flag would be to prevent from population other than the direct
> page fault - including any speculative actions like fault around or
> read-ahead.

I like the ability to control other speculative population, but I am not
sure about overloading it with the VM_LOCKONFAULT case. Here is my
concern. If we are using VM_FAULTPOPULATE | VM_LOCKED to denote
LOCKONFAULT, how can we tell the difference between someone that wants
to avoid read-ahead and wants to use mlock()? This might lead to some
interesting states with mlock() and munlock() that take flags. For
instance, using VM_LOCKONFAULT mlock(MLOCK_ONFAULT) followed by
munlock(MLOCK_LOCKED) leaves the VMAs in the same state with
VM_LOCKONFAULT set. If we use VM_FAULTPOPULATE, the same pair of calls
would clear VM_LOCKED, but leave VM_FAULTPOPULATE. It may not matter in
the end, but I am concerned about the subtleties here.

>
> > If you prefer that MAP_LOCKED |
> > MAP_FAULTPOPULATE means that VM_LOCKONFAULT is set, I am fine with that
> > instead of introducing MAP_LOCKONFAULT. I went with the new flag
> > because to date, we have a one to one mapping of MAP_* to VM_* flags.
> >
> > >
> > > > If this is the preferred path for mmap(), I am fine with that.
> > >
> > > > However,
> > > > I would like to see the new system calls that Andrew mentioned (and that
> > > > I am testing patches for) go in as well.
> > >
> > > mlock with flags sounds like a good step but I am not sure it will make
> > > sense in the future. POSIX has screwed that and I am not sure how many
> > > applications would use it. This ship has sailed long time ago.
> >
> > I don't know either, but the code is the question, right? I know that
> > we have at least one team that wants it here.
> >
> > >
> > > > That way we give users the
> > > > ability to request VM_LOCKONFAULT for memory allocated using something
> > > > other than mmap.
> > >
> > > mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even
> > > without changing mlock syscall.
> >
> > That is true as long as MAP_FAULTPOPULATE set a flag in the VMA(s). It
> > doesn't cover the actual case I was asking about, which is how do I get
> > lock on fault on malloc'd memory?
>
> OK I see your point now. We would indeed need a flag argument for mlock.
> --
> Michal Hocko
> SUSE Labs


Attachments:
(No filename) (5.45 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-06-23 12:45:41

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 1/3] Add mmap flag to request pages are locked after page fault

On 06/22/2015 04:18 PM, Eric B Munson wrote:
> On Mon, 22 Jun 2015, Michal Hocko wrote:
>
>> On Fri 19-06-15 12:43:33, Eric B Munson wrote:
>>> On Fri, 19 Jun 2015, Michal Hocko wrote:
>>>
>>>> On Thu 18-06-15 16:30:48, Eric B Munson wrote:
>>>>> On Thu, 18 Jun 2015, Michal Hocko wrote:
>>>> [...]
>>>>>> Wouldn't it be much more reasonable and straightforward to have
>>>>>> MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would
>>>>>> explicitly disallow any form of pre-faulting? It would be usable for
>>>>>> other usecases than with MAP_LOCKED combination.
>>>>>
>>>>> I don't see a clear case for it being more reasonable, it is one
>>>>> possible way to solve the problem.
>>>>
>>>> MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault
>>>> around is all or nothing feature. Either all mappings (which support
>>>> this) fault around or none. There is no way to tell the kernel that
>>>> this particular mapping shouldn't fault around. I haven't seen such a
>>>> request yet but we have seen requests to have a way to opt out from
>>>> a global policy in the past (e.g. per-process opt out from THP). So
>>>> I can imagine somebody will come with a request to opt out from any
>>>> speculative operations on the mapped area in the future.

That sounds like something where new madvise() flag would make more
sense than a new mmap flag, and conflating it with locking behavior
would lead to all kinds of weird corner cases as Eric mentioned.

>>>>
>>>>> But I think it leaves us in an even
>>>>> more akward state WRT VMA flags. As you noted in your fix for the
>>>>> mmap() man page, one can get into a state where a VMA is VM_LOCKED, but
>>>>> not present. Having VM_LOCKONFAULT states that this was intentional, if
>>>>> we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no
>>>>> longer set VM_LOCKONFAULT (unless we want to start mapping it to the
>>>>> presence of two MAP_ flags). This can make detecting the MAP_LOCKED +
>>>>> populate failure state harder.
>>>>
>>>> I am not sure I understand your point here. Could you be more specific
>>>> how would you check for that and what for?
>>>
>>> My thought on detecting was that someone might want to know if they had
>>> a VMA that was VM_LOCKED but had not been made present becuase of a
>>> failure in mmap. We don't have a way today, but adding VM_LOCKONFAULT
>>> is at least explicit about what is happening which would make detecting
>>> the VM_LOCKED but not present state easier.
>>
>> One could use /proc/<pid>/pagemap to query the residency.

I think that's all too much complex scenario for a little gain. If
someone knows that mmap(MAP_LOCKED|MAP_POPULATE) is not perfect, he
should either mlock() separately from mmap(), or fault the range
manually with a for loop. Why try to detect if the corner case was hit?

>>
>>> This assumes that
>>> MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like
>>> it would have to.
>>
>> Yes, it would have to have a VM flag for the vma.

So with your approach, VM_LOCKED flag is enough, right? The new MAP_ /
MLOCK_ flags just cause setting VM_LOCKED to not fault the whole vma,
but otherwise nothing changes.

If that's true, I think it's better than a new vma flag.

>>
>>>> From my understanding MAP_LOCKONFAULT is essentially
>>>> MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike
>>>> single MAP_LOCKED unfortunately). I would love to also have
>>>> MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really
>>>> skeptical considering how my previous attempt to make MAP_POPULATE
>>>> reasonable went.
>>>
>>> Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the
>>> new MAP_LOCKONFAULT flag (or both)?
>>
>> I thought the MAP_FAULTPOPULATE (or any other better name) would
>> directly translate into VM_FAULTPOPULATE and wouldn't be tight to the
>> locked semantic. We already have VM_LOCKED for that. The direct effect
>> of the flag would be to prevent from population other than the direct
>> page fault - including any speculative actions like fault around or
>> read-ahead.
>
> I like the ability to control other speculative population, but I am not
> sure about overloading it with the VM_LOCKONFAULT case. Here is my
> concern. If we are using VM_FAULTPOPULATE | VM_LOCKED to denote
> LOCKONFAULT, how can we tell the difference between someone that wants
> to avoid read-ahead and wants to use mlock()? This might lead to some
> interesting states with mlock() and munlock() that take flags. For
> instance, using VM_LOCKONFAULT mlock(MLOCK_ONFAULT) followed by
> munlock(MLOCK_LOCKED) leaves the VMAs in the same state with
> VM_LOCKONFAULT set. If we use VM_FAULTPOPULATE, the same pair of calls
> would clear VM_LOCKED, but leave VM_FAULTPOPULATE. It may not matter in
> the end, but I am concerned about the subtleties here.

Right.

>>
>>> If you prefer that MAP_LOCKED |
>>> MAP_FAULTPOPULATE means that VM_LOCKONFAULT is set, I am fine with that
>>> instead of introducing MAP_LOCKONFAULT. I went with the new flag
>>> because to date, we have a one to one mapping of MAP_* to VM_* flags.
>>>
>>>>
>>>>> If this is the preferred path for mmap(), I am fine with that.
>>>>
>>>>> However,
>>>>> I would like to see the new system calls that Andrew mentioned (and that
>>>>> I am testing patches for) go in as well.
>>>>
>>>> mlock with flags sounds like a good step but I am not sure it will make
>>>> sense in the future. POSIX has screwed that and I am not sure how many
>>>> applications would use it. This ship has sailed long time ago.
>>>
>>> I don't know either, but the code is the question, right? I know that
>>> we have at least one team that wants it here.
>>>
>>>>
>>>>> That way we give users the
>>>>> ability to request VM_LOCKONFAULT for memory allocated using something
>>>>> other than mmap.
>>>>
>>>> mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even
>>>> without changing mlock syscall.
>>>
>>> That is true as long as MAP_FAULTPOPULATE set a flag in the VMA(s). It
>>> doesn't cover the actual case I was asking about, which is how do I get
>>> lock on fault on malloc'd memory?
>>
>> OK I see your point now. We would indeed need a flag argument for mlock.
>> --
>> Michal Hocko
>> SUSE Labs

2015-06-23 13:04:37

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 0/3] Allow user to request memory to be locked on page fault

On 06/15/2015 04:43 PM, Eric B Munson wrote:
>> Note that the semantic of MAP_LOCKED can be subtly surprising:
>>
>> "mlock(2) fails if the memory range cannot get populated to guarantee
>> that no future major faults will happen on the range.
>> mmap(MAP_LOCKED) on the other hand silently succeeds even if the
>> range was populated only
>> partially."
>>
>> ( from http://marc.info/?l=linux-mm&m=143152790412727&w=2 )
>>
>> So MAP_LOCKED can silently behave like MAP_LOCKONFAULT. While
>> MAP_LOCKONFAULT doesn't suffer from such problem, I wonder if that's
>> sufficient reason not to extend mmap by new mlock() flags that can
>> be instead applied to the VMA after mmapping, using the proposed
>> mlock2() with flags. So I think instead we could deprecate
>> MAP_LOCKED more prominently. I doubt the overhead of calling the
>> extra syscall matters here?
>
> We could talk about retiring the MAP_LOCKED flag but I suspect that
> would get significantly more pushback than adding a new mmap flag.

Oh no we can't "retire" as in remove the flag, ever. Just not continue
the way of mmap() flags related to mlock().

> Likely that the overhead does not matter in most cases, but presumably
> there are cases where it does (as we have a MAP_LOCKED flag today).
> Even with the proposed new system calls I think we should have the
> MAP_LOCKONFAULT for parity with MAP_LOCKED.

I'm not convinced, but it's not a major issue.

>>
>>> - mlock() takes a `flags' argument. Presently that's
>>> MLOCK_LOCKED|MLOCK_LOCKONFAULT.
>>>
>>> - munlock() takes a `flags' arument. MLOCK_LOCKED|MLOCK_LOCKONFAULT
>>> to specify which flags are being cleared.
>>>
>>> - mlockall() and munlockall() ditto.
>>>
>>>
>>> IOW, LOCKED and LOCKEDONFAULT are treated identically and independently.
>>>
>>> Now, that's how we would have designed all this on day one. And I
>>> think we can do this now, by adding new mlock2() and munlock2()
>>> syscalls. And we may as well deprecate the old mlock() and munlock(),
>>> not that this matters much.
>>>
>>> *should* we do this? I'm thinking "yes" - it's all pretty simple
>>> boilerplate and wrappers and such, and it gets the interface correct,
>>> and extensible.
>>
>> If the new LOCKONFAULT functionality is indeed desired (I haven't
>> still decided myself) then I agree that would be the cleanest way.
>
> Do you disagree with the use cases I have listed or do you think there
> is a better way of addressing those cases?

I'm somewhat sceptical about the security one. Are security sensitive
buffers that large to matter? The performance one is more convincing and
I don't see a better way, so OK.

>
>>
>>> What do others think?

2015-06-24 08:50:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 1/3] Add mmap flag to request pages are locked after page fault

On Mon 22-06-15 10:18:06, Eric B Munson wrote:
> On Mon, 22 Jun 2015, Michal Hocko wrote:
>
> > On Fri 19-06-15 12:43:33, Eric B Munson wrote:
[...]
> > > Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the
> > > new MAP_LOCKONFAULT flag (or both)?
> >
> > I thought the MAP_FAULTPOPULATE (or any other better name) would
> > directly translate into VM_FAULTPOPULATE and wouldn't be tight to the
> > locked semantic. We already have VM_LOCKED for that. The direct effect
> > of the flag would be to prevent from population other than the direct
> > page fault - including any speculative actions like fault around or
> > read-ahead.
>
> I like the ability to control other speculative population, but I am not
> sure about overloading it with the VM_LOCKONFAULT case. Here is my
> concern. If we are using VM_FAULTPOPULATE | VM_LOCKED to denote
> LOCKONFAULT, how can we tell the difference between someone that wants
> to avoid read-ahead and wants to use mlock()?

Not sure I understand. Something like?
addr = mmap(VM_FAULTPOPULATE) # To prevent speculative mappings into the vma
[...]
mlock(addr, len) # Now I want the full mlock semantic

and the later to have the full mlock semantic and populate the given
area regardless of VM_FAULTPOPULATE being set on the vma? This would
be an interesting question because mlock man page clearly states the
semantic and that is to _always_ populate or fail. So I originally
thought that it would obey VM_FAULTPOPULATE but this needs a more
thinking.

> This might lead to some
> interesting states with mlock() and munlock() that take flags. For
> instance, using VM_LOCKONFAULT mlock(MLOCK_ONFAULT) followed by
> munlock(MLOCK_LOCKED) leaves the VMAs in the same state with
> VM_LOCKONFAULT set.

This is really confusing. Let me try to rephrase that. So you have
mlock(addr, len, MLOCK_ONFAULT)
munlock(addr, len, MLOCK_LOCKED)

IIUC you would expect the vma still being MLOCK_ONFAULT, right? Isn't
that behavior strange and unexpected? First of all, munlock has
traditionally dropped the lock on the address range (e.g. what should
happen if you did plain old munlock(addr, len)). But even without
that. You are trying to unlock something that hasn't been locked the
same way. So I would expect -EINVAL at least, if the two modes should be
really represented by different flags.

Or did you mean the both types of lock like:
mlock(addr, len, MLOCK_ONFAULT) | mmap(MAP_LOCKONFAULT)
mlock(addr, len, MLOCK_LOCKED)
munlock(addr, len, MLOCK_LOCKED)

and that should keep MLOCK_ONFAULT?
This sounds even more weird to me because that means that the vma in
question would be locked by two different mechanisms. MLOCK_LOCKED with
the "always populate" semantic would rule out MLOCK_ONFAULT so what
would be the meaning of the other flag then? Also what should regular
munlock(addr, len) without flags unlock? Both?

> If we use VM_FAULTPOPULATE, the same pair of calls
> would clear VM_LOCKED, but leave VM_FAULTPOPULATE. It may not matter in
> the end, but I am concerned about the subtleties here.

This sounds like the proper behavior to me. munlock should simply always
drop VM_LOCKED and the VM_FAULTPOPULATE can live its separate life.

Btw. could you be more specific about semantic of m{un}lock(addr, len, flags)
you want to propose? The more I think about that the more I am unclear
about it, especially munlock behavior and possible flags.
--
Michal Hocko
SUSE Labs

2015-06-24 09:47:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 1/3] Add mmap flag to request pages are locked after page fault

On Tue 23-06-15 14:45:17, Vlastimil Babka wrote:
> On 06/22/2015 04:18 PM, Eric B Munson wrote:
> >On Mon, 22 Jun 2015, Michal Hocko wrote:
> >
> >>On Fri 19-06-15 12:43:33, Eric B Munson wrote:
[...]
> >>>My thought on detecting was that someone might want to know if they had
> >>>a VMA that was VM_LOCKED but had not been made present becuase of a
> >>>failure in mmap. We don't have a way today, but adding VM_LOCKONFAULT
> >>>is at least explicit about what is happening which would make detecting
> >>>the VM_LOCKED but not present state easier.
> >>
> >>One could use /proc/<pid>/pagemap to query the residency.
>
> I think that's all too much complex scenario for a little gain. If someone
> knows that mmap(MAP_LOCKED|MAP_POPULATE) is not perfect, he should either
> mlock() separately from mmap(), or fault the range manually with a for loop.
> Why try to detect if the corner case was hit?

No idea. I have just offered a way to do that. I do not think it is
anyhow useful but who knows... I do agree that the mlock should be used
for the full mlock semantic.

> >>>This assumes that
> >>>MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like
> >>>it would have to.
> >>
> >>Yes, it would have to have a VM flag for the vma.
>
> So with your approach, VM_LOCKED flag is enough, right? The new MAP_ /
> MLOCK_ flags just cause setting VM_LOCKED to not fault the whole vma, but
> otherwise nothing changes.

VM_FAULTPOPULATE would have to be sticky to prevent from other
speculative poppulation of the mapping. I mean, is it OK to have a new
mlock semantic (on fault) which might still populate&lock memory which
hasn't been faulted directly? Who knows what kind of speculative things
we will do in the future and then find out that the semantic of
lock-on-fault is not usable anymore.

[...]

--
Michal Hocko
SUSE Labs

2015-06-25 14:16:48

by Eric B Munson

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 0/3] Allow user to request memory to be locked on page fault

On Tue, 23 Jun 2015, Vlastimil Babka wrote:

> On 06/15/2015 04:43 PM, Eric B Munson wrote:
> >>Note that the semantic of MAP_LOCKED can be subtly surprising:
> >>
> >>"mlock(2) fails if the memory range cannot get populated to guarantee
> >>that no future major faults will happen on the range.
> >>mmap(MAP_LOCKED) on the other hand silently succeeds even if the
> >>range was populated only
> >>partially."
> >>
> >>( from http://marc.info/?l=linux-mm&m=143152790412727&w=2 )
> >>
> >>So MAP_LOCKED can silently behave like MAP_LOCKONFAULT. While
> >>MAP_LOCKONFAULT doesn't suffer from such problem, I wonder if that's
> >>sufficient reason not to extend mmap by new mlock() flags that can
> >>be instead applied to the VMA after mmapping, using the proposed
> >>mlock2() with flags. So I think instead we could deprecate
> >>MAP_LOCKED more prominently. I doubt the overhead of calling the
> >>extra syscall matters here?
> >
> >We could talk about retiring the MAP_LOCKED flag but I suspect that
> >would get significantly more pushback than adding a new mmap flag.
>
> Oh no we can't "retire" as in remove the flag, ever. Just not
> continue the way of mmap() flags related to mlock().
>
> >Likely that the overhead does not matter in most cases, but presumably
> >there are cases where it does (as we have a MAP_LOCKED flag today).
> >Even with the proposed new system calls I think we should have the
> >MAP_LOCKONFAULT for parity with MAP_LOCKED.
>
> I'm not convinced, but it's not a major issue.
>
> >>
> >>>- mlock() takes a `flags' argument. Presently that's
> >>> MLOCK_LOCKED|MLOCK_LOCKONFAULT.
> >>>
> >>>- munlock() takes a `flags' arument. MLOCK_LOCKED|MLOCK_LOCKONFAULT
> >>> to specify which flags are being cleared.
> >>>
> >>>- mlockall() and munlockall() ditto.
> >>>
> >>>
> >>>IOW, LOCKED and LOCKEDONFAULT are treated identically and independently.
> >>>
> >>>Now, that's how we would have designed all this on day one. And I
> >>>think we can do this now, by adding new mlock2() and munlock2()
> >>>syscalls. And we may as well deprecate the old mlock() and munlock(),
> >>>not that this matters much.
> >>>
> >>>*should* we do this? I'm thinking "yes" - it's all pretty simple
> >>>boilerplate and wrappers and such, and it gets the interface correct,
> >>>and extensible.
> >>
> >>If the new LOCKONFAULT functionality is indeed desired (I haven't
> >>still decided myself) then I agree that would be the cleanest way.
> >
> >Do you disagree with the use cases I have listed or do you think there
> >is a better way of addressing those cases?
>
> I'm somewhat sceptical about the security one. Are security
> sensitive buffers that large to matter? The performance one is more
> convincing and I don't see a better way, so OK.

They can be, the two that come to mind are medical images and high
resolution sensor data.

>
> >
> >>
> >>>What do others think?
>


Attachments:
(No filename) (2.84 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-06-25 14:26:52

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 0/3] Allow user to request memory to be locked on page fault

On Thu, Jun 25, 2015 at 7:16 AM, Eric B Munson <[email protected]> wrote:
> On Tue, 23 Jun 2015, Vlastimil Babka wrote:
>
>> On 06/15/2015 04:43 PM, Eric B Munson wrote:
>> >>
>> >>If the new LOCKONFAULT functionality is indeed desired (I haven't
>> >>still decided myself) then I agree that would be the cleanest way.
>> >
>> >Do you disagree with the use cases I have listed or do you think there
>> >is a better way of addressing those cases?
>>
>> I'm somewhat sceptical about the security one. Are security
>> sensitive buffers that large to matter? The performance one is more
>> convincing and I don't see a better way, so OK.
>
> They can be, the two that come to mind are medical images and high
> resolution sensor data.

I think we've been handling sensitive memory pages wrong forever. We
shouldn't lock them into memory; we should flag them as sensitive and
encrypt them if they're ever written out to disk.

--Andy

2015-06-25 14:46:54

by Eric B Munson

[permalink] [raw]
Subject: Re: [RESEND PATCH V2 1/3] Add mmap flag to request pages are locked after page fault

On Wed, 24 Jun 2015, Michal Hocko wrote:

> On Mon 22-06-15 10:18:06, Eric B Munson wrote:
> > On Mon, 22 Jun 2015, Michal Hocko wrote:
> >
> > > On Fri 19-06-15 12:43:33, Eric B Munson wrote:
> [...]
> > > > Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the
> > > > new MAP_LOCKONFAULT flag (or both)?
> > >
> > > I thought the MAP_FAULTPOPULATE (or any other better name) would
> > > directly translate into VM_FAULTPOPULATE and wouldn't be tight to the
> > > locked semantic. We already have VM_LOCKED for that. The direct effect
> > > of the flag would be to prevent from population other than the direct
> > > page fault - including any speculative actions like fault around or
> > > read-ahead.
> >
> > I like the ability to control other speculative population, but I am not
> > sure about overloading it with the VM_LOCKONFAULT case. Here is my
> > concern. If we are using VM_FAULTPOPULATE | VM_LOCKED to denote
> > LOCKONFAULT, how can we tell the difference between someone that wants
> > to avoid read-ahead and wants to use mlock()?
>
> Not sure I understand. Something like?
> addr = mmap(VM_FAULTPOPULATE) # To prevent speculative mappings into the vma
> [...]
> mlock(addr, len) # Now I want the full mlock semantic

So this leaves us without the LOCKONFAULT semantics? That is not at all
what I am looking for. What I want is a way to express 3 possible
states of a VMA WRT locking, locked (populated and all pages on the
unevictable LRU), lock on fault (populated by page fault, pages that are
present are on the unevictable LRU, newly faulted pages are added to
same), and not locked.

>
> and the later to have the full mlock semantic and populate the given
> area regardless of VM_FAULTPOPULATE being set on the vma? This would
> be an interesting question because mlock man page clearly states the
> semantic and that is to _always_ populate or fail. So I originally
> thought that it would obey VM_FAULTPOPULATE but this needs a more
> thinking.
>
> > This might lead to some
> > interesting states with mlock() and munlock() that take flags. For
> > instance, using VM_LOCKONFAULT mlock(MLOCK_ONFAULT) followed by
> > munlock(MLOCK_LOCKED) leaves the VMAs in the same state with
> > VM_LOCKONFAULT set.
>
> This is really confusing. Let me try to rephrase that. So you have
> mlock(addr, len, MLOCK_ONFAULT)
> munlock(addr, len, MLOCK_LOCKED)
>
> IIUC you would expect the vma still being MLOCK_ONFAULT, right? Isn't
> that behavior strange and unexpected? First of all, munlock has
> traditionally dropped the lock on the address range (e.g. what should
> happen if you did plain old munlock(addr, len)). But even without
> that. You are trying to unlock something that hasn't been locked the
> same way. So I would expect -EINVAL at least, if the two modes should be
> really represented by different flags.

I would expect it to remain MLOCK_LOCKONFAULT because the user requested
munlock(addr, len, MLOCK_LOCKED). It is not currently an error to
unlock memory that is not locked. We do this because we do not require
the user track what areas are locked. It is acceptable to have a mostly
locked area with holes unlocked with a single call to munlock that spans
the entire area. The same semantics should hold for munlock with flags.
If I have an area with MLOCK_LOCKED and MLOCK_ONFAULT interleaved, it
should be acceptable to clear the MLOCK_ONFAULT flag from those areas
with a single munlock call that spans the area.

On top of continuing with munlock semantics, the implementation would
need the ability to rollback an munlock call if it failed after altering
VMAs. If we have the same interleaved area as before and we go to
return -EINVAL the first time we hit an area that was MLOCK_LOCKED, how
do we restore the state of the VMAs we have already processed, and
possibly merged/split?
>
> Or did you mean the both types of lock like:
> mlock(addr, len, MLOCK_ONFAULT) | mmap(MAP_LOCKONFAULT)
> mlock(addr, len, MLOCK_LOCKED)
> munlock(addr, len, MLOCK_LOCKED)
>
> and that should keep MLOCK_ONFAULT?
> This sounds even more weird to me because that means that the vma in
> question would be locked by two different mechanisms. MLOCK_LOCKED with
> the "always populate" semantic would rule out MLOCK_ONFAULT so what
> would be the meaning of the other flag then? Also what should regular
> munlock(addr, len) without flags unlock? Both?

This is indeed confusing and not what I was trying to illustrate, but
since you bring it up. mlockall() currently clears all flags and then
sets the new flags with each subsequent call. mlock2 would use that
same behavior, if LOCKED was specified for a ONFAULT region, that region
would become LOCKED and vice versa.

I have the new system call set ready, I am waiting to post for rc1 so I
can run the benchmarks again on a base more stable than the middle of a
merge window. We should wait to hash out implementations until the code
is up rather than talk past eachother here.

>
> > If we use VM_FAULTPOPULATE, the same pair of calls
> > would clear VM_LOCKED, but leave VM_FAULTPOPULATE. It may not matter in
> > the end, but I am concerned about the subtleties here.
>
> This sounds like the proper behavior to me. munlock should simply always
> drop VM_LOCKED and the VM_FAULTPOPULATE can live its separate life.
>
> Btw. could you be more specific about semantic of m{un}lock(addr, len, flags)
> you want to propose? The more I think about that the more I am unclear
> about it, especially munlock behavior and possible flags.
> --
> Michal Hocko
> SUSE Labs


Attachments:
(No filename) (5.46 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments