Hi,
during the previous discussion http://marc.info/?l=linux-mm&m=143022313618001&w=2
it was made clear that making mmap(MAP_LOCKED) semantic really have
mlock() semantic is too dangerous. Even though we can try to reduce the
failure space the mmap man page should make it really clear about the
subtle distinctions between the two. This is what that first patch does.
The second patch is a small clarification for MAP_POPULATE based on
David Rientjes feedback.
From: Michal Hocko <[email protected]>
MAP_LOCKED had a subtly different semantic from mmap(2)+mlock(2) since
it has been introduced.
mlock(2) fails if the memory range cannot get populated to guarantee
that no future major faults will happen on the range. mmap(MAP_LOCKED) on
the other hand silently succeeds even if the range was populated only
partially.
Fixing this subtle difference in the kernel is rather awkward because
the memory population happens after mm locks have been dropped and so
the cleanup before returning failure (munlock) could operate on something
else than the originally mapped area.
E.g. speculative userspace page fault handler catching SEGV and doing
mmap(fault_addr, MAP_FIXED|MAP_LOCKED) might discard portion of a racing
mmap and lead to lost data. Although it is not clear whether such a
usage would be valid, mmap page doesn't explicitly describe requirements
for threaded applications so we cannot exclude this possibility.
This patch makes the semantic of MAP_LOCKED explicit and suggest using
mmap + mlock as the only way to guarantee no later major page faults.
Signed-off-by: Michal Hocko <[email protected]>
---
man2/mmap.2 | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/man2/mmap.2 b/man2/mmap.2
index 54d68cf87e9e..1486be2e96b3 100644
--- a/man2/mmap.2
+++ b/man2/mmap.2
@@ -235,8 +235,19 @@ See the Linux kernel source file
for further information.
.TP
.BR MAP_LOCKED " (since Linux 2.5.37)"
-Lock the pages of the mapped region into memory in the manner of
+Mark the mmaped region to be locked in the same way as
.BR mlock (2).
+This implementation will try to populate (prefault) the whole range but
+the mmap call doesn't fail with
+.B ENOMEM
+if this fails. Therefore major faults might happen later on. So the semantic
+is not as strong as
+.BR mlock (2).
+.BR mmap (2)
++
+.BR mlock (2)
+should be used when major faults are not acceptable after the initialization
+of the mapping.
This flag is ignored in older kernels.
.\" If set, the mapped pages will not be swapped out.
.TP
--
2.1.4
From: Michal Hocko <[email protected]>
David Rientjes has noticed that MAP_POPULATE wording might promise much
more than the kernel actually provides and intend to provide. The
primary usage of the flag is to pre-fault the range. There is no
guarantee that no major faults will happen later on. The pages might
have been reclaimed by the time the process tries to access them.
Signed-off-by: Michal Hocko <[email protected]>
---
man2/mmap.2 | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/man2/mmap.2 b/man2/mmap.2
index 1486be2e96b3..dcf306f2f730 100644
--- a/man2/mmap.2
+++ b/man2/mmap.2
@@ -284,7 +284,7 @@ private writable mappings.
.BR MAP_POPULATE " (since Linux 2.5.46)"
Populate (prefault) page tables for a mapping.
For a file mapping, this causes read-ahead on the file.
-Later accesses to the mapping will not be blocked by page faults.
+This will help to reduce blocking on page faults later.
.BR MAP_POPULATE
is supported for private mappings only since Linux 2.6.23.
.TP
--
2.1.4
On Wed, 13 May 2015, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> MAP_LOCKED had a subtly different semantic from mmap(2)+mlock(2) since
> it has been introduced.
> mlock(2) fails if the memory range cannot get populated to guarantee
> that no future major faults will happen on the range. mmap(MAP_LOCKED) on
> the other hand silently succeeds even if the range was populated only
> partially.
>
> Fixing this subtle difference in the kernel is rather awkward because
> the memory population happens after mm locks have been dropped and so
> the cleanup before returning failure (munlock) could operate on something
> else than the originally mapped area.
>
> E.g. speculative userspace page fault handler catching SEGV and doing
> mmap(fault_addr, MAP_FIXED|MAP_LOCKED) might discard portion of a racing
> mmap and lead to lost data. Although it is not clear whether such a
> usage would be valid, mmap page doesn't explicitly describe requirements
> for threaded applications so we cannot exclude this possibility.
>
> This patch makes the semantic of MAP_LOCKED explicit and suggest using
> mmap + mlock as the only way to guarantee no later major page faults.
>
> Signed-off-by: Michal Hocko <[email protected]>
Does the problem still happend when MAP_POPULATE | MAP_LOCKED is used
(AFAICT MAP_POPULATE will cause the mmap to fail if all the pages cannot
be made present).
Either way this is a good catch.
Acked-by: Eric B Munson <[email protected]>
On Wed, 13 May 2015, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> David Rientjes has noticed that MAP_POPULATE wording might promise much
> more than the kernel actually provides and intend to provide. The
> primary usage of the flag is to pre-fault the range. There is no
> guarantee that no major faults will happen later on. The pages might
> have been reclaimed by the time the process tries to access them.
>
> Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Eric B Munson <[email protected]>
On Wed, 13 May 2015, Eric B Munson wrote:
> On Wed, 13 May 2015, Michal Hocko wrote:
>
> > From: Michal Hocko <[email protected]>
> >
> > MAP_LOCKED had a subtly different semantic from mmap(2)+mlock(2) since
> > it has been introduced.
> > mlock(2) fails if the memory range cannot get populated to guarantee
> > that no future major faults will happen on the range. mmap(MAP_LOCKED) on
> > the other hand silently succeeds even if the range was populated only
> > partially.
> >
> > Fixing this subtle difference in the kernel is rather awkward because
> > the memory population happens after mm locks have been dropped and so
> > the cleanup before returning failure (munlock) could operate on something
> > else than the originally mapped area.
> >
> > E.g. speculative userspace page fault handler catching SEGV and doing
> > mmap(fault_addr, MAP_FIXED|MAP_LOCKED) might discard portion of a racing
> > mmap and lead to lost data. Although it is not clear whether such a
> > usage would be valid, mmap page doesn't explicitly describe requirements
> > for threaded applications so we cannot exclude this possibility.
> >
> > This patch makes the semantic of MAP_LOCKED explicit and suggest using
> > mmap + mlock as the only way to guarantee no later major page faults.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
>
> Does the problem still happend when MAP_POPULATE | MAP_LOCKED is used
> (AFAICT MAP_POPULATE will cause the mmap to fail if all the pages cannot
> be made present).
>
> Either way this is a good catch.
>
> Acked-by: Eric B Munson <[email protected]>
>
Sorry for the noise, this should have been a
Reviewed-by: Eric B Munson <[email protected]>
On Wed 13-05-15 10:45:06, Eric B Munson wrote:
> On Wed, 13 May 2015, Michal Hocko wrote:
>
> > From: Michal Hocko <[email protected]>
> >
> > MAP_LOCKED had a subtly different semantic from mmap(2)+mlock(2) since
> > it has been introduced.
> > mlock(2) fails if the memory range cannot get populated to guarantee
> > that no future major faults will happen on the range. mmap(MAP_LOCKED) on
> > the other hand silently succeeds even if the range was populated only
> > partially.
> >
> > Fixing this subtle difference in the kernel is rather awkward because
> > the memory population happens after mm locks have been dropped and so
> > the cleanup before returning failure (munlock) could operate on something
> > else than the originally mapped area.
> >
> > E.g. speculative userspace page fault handler catching SEGV and doing
> > mmap(fault_addr, MAP_FIXED|MAP_LOCKED) might discard portion of a racing
> > mmap and lead to lost data. Although it is not clear whether such a
> > usage would be valid, mmap page doesn't explicitly describe requirements
> > for threaded applications so we cannot exclude this possibility.
> >
> > This patch makes the semantic of MAP_LOCKED explicit and suggest using
> > mmap + mlock as the only way to guarantee no later major page faults.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
>
> Does the problem still happend when MAP_POPULATE | MAP_LOCKED is used
> (AFAICT MAP_POPULATE will cause the mmap to fail if all the pages cannot
> be made present).
No, there is no difference because MAP_POPULATE is implicit when
MAP_LOCKED is used and as pointed in the cover, we cannot fail after the
vma is created and locks dropped. The second patch tries to clarify that
MAP_POPULATE is just a best effort.
> Either way this is a good catch.
>
> Acked-by: Eric B Munson <[email protected]>
Thanks!
--
Michal Hocko
SUSE Labs
On 05/13/2015 04:38 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> MAP_LOCKED had a subtly different semantic from mmap(2)+mlock(2) since
> it has been introduced.
> mlock(2) fails if the memory range cannot get populated to guarantee
> that no future major faults will happen on the range. mmap(MAP_LOCKED) on
> the other hand silently succeeds even if the range was populated only
> partially.
>
> Fixing this subtle difference in the kernel is rather awkward because
> the memory population happens after mm locks have been dropped and so
> the cleanup before returning failure (munlock) could operate on something
> else than the originally mapped area.
>
> E.g. speculative userspace page fault handler catching SEGV and doing
> mmap(fault_addr, MAP_FIXED|MAP_LOCKED) might discard portion of a racing
> mmap and lead to lost data. Although it is not clear whether such a
> usage would be valid, mmap page doesn't explicitly describe requirements
> for threaded applications so we cannot exclude this possibility.
>
> This patch makes the semantic of MAP_LOCKED explicit and suggest using
> mmap + mlock as the only way to guarantee no later major page faults.
Thanks, Michal. Applied, with Reviewed-by: from Eric added.
Cheers,
Michael
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> man2/mmap.2 | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/man2/mmap.2 b/man2/mmap.2
> index 54d68cf87e9e..1486be2e96b3 100644
> --- a/man2/mmap.2
> +++ b/man2/mmap.2
> @@ -235,8 +235,19 @@ See the Linux kernel source file
> for further information.
> .TP
> .BR MAP_LOCKED " (since Linux 2.5.37)"
> -Lock the pages of the mapped region into memory in the manner of
> +Mark the mmaped region to be locked in the same way as
> .BR mlock (2).
> +This implementation will try to populate (prefault) the whole range but
> +the mmap call doesn't fail with
> +.B ENOMEM
> +if this fails. Therefore major faults might happen later on. So the semantic
> +is not as strong as
> +.BR mlock (2).
> +.BR mmap (2)
> ++
> +.BR mlock (2)
> +should be used when major faults are not acceptable after the initialization
> +of the mapping.
> This flag is ignored in older kernels.
> .\" If set, the mapped pages will not be swapped out.
> .TP
>
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
On 05/13/2015 04:38 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> David Rientjes has noticed that MAP_POPULATE wording might promise much
> more than the kernel actually provides and intend to provide. The
> primary usage of the flag is to pre-fault the range. There is no
> guarantee that no major faults will happen later on. The pages might
> have been reclaimed by the time the process tries to access them.
Yes, thanks, Michal -- that's a good point to make clearer.
Applied, with Reviewed-by: from Eric added.
Cheers,
Michael
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> man2/mmap.2 | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/man2/mmap.2 b/man2/mmap.2
> index 1486be2e96b3..dcf306f2f730 100644
> --- a/man2/mmap.2
> +++ b/man2/mmap.2
> @@ -284,7 +284,7 @@ private writable mappings.
> .BR MAP_POPULATE " (since Linux 2.5.46)"
> Populate (prefault) page tables for a mapping.
> For a file mapping, this causes read-ahead on the file.
> -Later accesses to the mapping will not be blocked by page faults.
> +This will help to reduce blocking on page faults later.
> .BR MAP_POPULATE
> is supported for private mappings only since Linux 2.6.23.
> .TP
>
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
On Wed, 13 May 2015, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> David Rientjes has noticed that MAP_POPULATE wording might promise much
> more than the kernel actually provides and intend to provide. The
> primary usage of the flag is to pre-fault the range. There is no
> guarantee that no major faults will happen later on. The pages might
> have been reclaimed by the time the process tries to access them.
>
> Signed-off-by: Michal Hocko <[email protected]>
Acked-by: David Rientjes <[email protected]>
Thanks for following up!
On Wed 13-05-15 16:38:10, Michal Hocko wrote:
> Hi,
> during the previous discussion http://marc.info/?l=linux-mm&m=143022313618001&w=2
> it was made clear that making mmap(MAP_LOCKED) semantic really have
> mlock() semantic is too dangerous. Even though we can try to reduce the
> failure space the mmap man page should make it really clear about the
> subtle distinctions between the two. This is what that first patch does.
> The second patch is a small clarification for MAP_POPULATE based on
> David Rientjes feedback.
I have completely forgot about the in kernel doc.
---
>From 9d1478ccd036f84e50da906e39cd1e7bcb94cecd Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 18 May 2015 11:07:00 +0200
Subject: [PATCH] Documentation/vm/unevictable-lru.txt: clarify MAP_LOCKED
behavior
There is a very subtle difference between mmap()+mlock() vs
mmap(MAP_LOCKED) semantic. The former one fails if the population of the
area fails while the later one doesn't. This basically means that
mmap(MAPLOCKED) areas might see major fault after mmap syscall returns
which is not the case for mlock. mmap man page has already been altered
but Documentation/vm/unevictable-lru.txt deserves a clarification as
well.
Reported-by: David Rientjes <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
Documentation/vm/unevictable-lru.txt | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
index 3be0bfc4738d..32ee3a67dba2 100644
--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -467,7 +467,13 @@ mmap(MAP_LOCKED) SYSTEM CALL HANDLING
In addition the mlock()/mlockall() system calls, an application can request
that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
-call. Furthermore, any mmap() call or brk() call that expands the heap by a
+call. There is one important and subtle difference here, though. mmap() + mlock()
+will fail if the range cannot be faulted in (e.g. because mm_populate fails)
+and returns with ENOMEM while mmap(MAP_LOCKED) will not fail. The mmaped
+area will still have properties of the locked area - aka. pages will not get
+swapped out - but major page faults to fault memory in might still happen.
+
+Furthermore, any mmap() call or brk() call that expands the heap by a
task that has previously called mlockall() with the MCL_FUTURE flag will result
in the newly mapped memory being mlocked. Before the unevictable/mlock
changes, the kernel simply called make_pages_present() to allocate pages and
--
2.1.4
--
Michal Hocko
SUSE Labs