2017-12-13 09:26:10

by Michal Hocko

[permalink] [raw]
Subject: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE


Hi,
I am resending with some minor updates based on Michael's review and
ask for inclusion. There haven't been any fundamental objections for
the RFC [1] nor the previous version [2]. The biggest discussion
revolved around the naming. There were many suggestions flowing
around MAP_REQUIRED, MAP_EXACT, MAP_FIXED_NOCLOBBER, MAP_AT_ADDR,
MAP_FIXED_NOREPLACE etc...

I am afraid we can bikeshed this to death and there will still be
somebody finding yet another better name. Therefore I've decided to
stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
MAP_FIXED prefix which should be recognized by developers and _SAFE
suffix should also be clear that all dangerous side effects of the old
MAP_FIXED are gone.

If somebody _really_ hates this then feel free to nack and resubmit
with a different name you can find a consensus for. I am sorry to be
stubborn here but I would rather have this merged than go over few more
iterations changing the name just because it seems like a good idea
now. My experience tells me that chances are that the name will turn out
to be "suboptimal" anyway over time.

Some more background:
This has started as a follow up discussion [3][4] resulting in the
runtime failure caused by hardening patch [5] which removes MAP_FIXED
from the elf loader because MAP_FIXED is inherently dangerous as it
might silently clobber an existing underlying mapping (e.g. stack). The
reason for the failure is that some architectures enforce an alignment
for the given address hint without MAP_FIXED used (e.g. for shared or
file backed mappings).

One way around this would be excluding those archs which do alignment
tricks from the hardening [6]. The patch is really trivial but it has
been objected, rightfully so, that this screams for a more generic
solution. We basically want a non-destructive MAP_FIXED.

The first patch introduced MAP_FIXED_SAFE which enforces the given
address but unlike MAP_FIXED it fails with EEXIST if the given range
conflicts with an existing one. The flag is introduced as a completely
new one rather than a MAP_FIXED extension because of the backward
compatibility. We really want a never-clobber semantic even on older
kernels which do not recognize the flag. Unfortunately mmap sucks wrt.
flags evaluation because we do not EINVAL on unknown flags. On those
kernels we would simply use the traditional hint based semantic so the
caller can still get a different address (which sucks) but at least not
silently corrupt an existing mapping. I do not see a good way around
that. Except we won't export expose the new semantic to the userspace at
all.

It seems there are users who would like to have something like that.
Jemalloc has been mentioned by Michael Ellerman [7]

Florian Weimer has mentioned the following:
: glibc ld.so currently maps DSOs without hints. This means that the kernel
: will map right next to each other, and the offsets between them a completely
: predictable. We would like to change that and supply a random address in a
: window of the address space. If there is a conflict, we do not want the
: kernel to pick a non-random address. Instead, we would try again with a
: random address.

John Hubbard has mentioned CUDA example
: a) Searches /proc/<pid>/maps for a "suitable" region of available
: VA space. "Suitable" generally means it has to have a base address
: within a certain limited range (a particular device model might
: have odd limitations, for example), it has to be large enough, and
: alignment has to be large enough (again, various devices may have
: constraints that lead us to do this).
:
: This is of course subject to races with other threads in the process.
:
: Let's say it finds a region starting at va.
:
: b) Next it does:
: p = mmap(va, ...)
:
: *without* setting MAP_FIXED, of course (so va is just a hint), to
: attempt to safely reserve that region. If p != va, then in most cases,
: this is a failure (almost certainly due to another thread getting a
: mapping from that region before we did), and so this layer now has to
: call munmap(), before returning a "failure: retry" to upper layers.
:
: IMPROVEMENT: --> if instead, we could call this:
:
: p = mmap(va, ... MAP_FIXED_SAFE ...)
:
: , then we could skip the munmap() call upon failure. This
: is a small thing, but it is useful here. (Thanks to Piotr
: Jaroszynski and Mark Hairgrove for helping me get that detail
: exactly right, btw.)
:
: c) After that, CUDA suballocates from p, via:
:
: q = mmap(sub_region_start, ... MAP_FIXED ...)
:
: Interestingly enough, "freeing" is also done via MAP_FIXED, and
: setting PROT_NONE to the subregion. Anyway, I just included (c) for
: general interest.

Atomic address range probing in the multithreaded programs in general
sounds like an interesting thing to me.

The second patch simply replaces MAP_FIXED use in elf loader by
MAP_FIXED_SAFE. I believe other places which rely on MAP_FIXED should
follow. Actually real MAP_FIXED usages should be docummented properly
and they should be more of an exception.

Diffstat says
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/metag/kernel/process.c | 6 +++++-
arch/mips/include/uapi/asm/mman.h | 2 ++
arch/parisc/include/uapi/asm/mman.h | 2 ++
arch/sparc/include/uapi/asm/mman.h | 1 -
arch/xtensa/include/uapi/asm/mman.h | 2 ++
fs/binfmt_elf.c | 12 ++++++++----
include/uapi/asm-generic/mman-common.h | 1 +
mm/mmap.c | 11 +++++++++++
9 files changed, 32 insertions(+), 6 deletions(-)

[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]
[3] http://lkml.kernel.org/r/[email protected]
[4] http://lkml.kernel.org/r/[email protected]
[5] http://lkml.kernel.org/r/[email protected]
[6] http://lkml.kernel.org/r/[email protected]
[7] http://lkml.kernel.org/r/[email protected]


2017-12-13 09:26:07

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/2] mm: introduce MAP_FIXED_SAFE

From: Michal Hocko <[email protected]>

MAP_FIXED is used quite often to enforce mapping at the particular
range. The main problem of this flag is, however, that it is inherently
dangerous because it unmaps existing mappings covered by the requested
range. This can cause silent memory corruptions. Some of them even with
serious security implications. While the current semantic might be
really desiderable in many cases there are others which would want to
enforce the given range but rather see a failure than a silent memory
corruption on a clashing range. Please note that there is no guarantee
that a given range is obeyed by the mmap even when it is free - e.g.
arch specific code is allowed to apply an alignment.

Introduce a new MAP_FIXED_SAFE flag for mmap to achieve this behavior.
It has the same semantic as MAP_FIXED wrt. the given address request
with a single exception that it fails with EEXIST if the requested
address is already covered by an existing mapping. We still do rely on
get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
check for a conflicting vma after it returns.

The flag is introduced as a completely new one rather than a MAP_FIXED
extension because of the backward compatibility. We really want a
never-clobber semantic even on older kernels which do not recognize
the flag. Unfortunately mmap sucks wrt. flags evaluation because we do
not EINVAL on unknown flags. On those kernels we would simply use the
traditional hint based semantic so the caller can still get a different
address (which sucks) but at least not silently corrupt an existing
mapping. I do not see a good way around that.

Changes since v1
- define MAP_FIXED_SAFE in asm-generic/mman-common.h as per Michael
Ellerman because all architecture which use this header can share
the same value. This will leave us with only 4 arches which need
special handling.

[fail on clashing range with EEXIST as per Florian Weimer]
[set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
Reviewed-by: Khalid Aziz <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/mips/include/uapi/asm/mman.h | 2 ++
arch/parisc/include/uapi/asm/mman.h | 2 ++
arch/sparc/include/uapi/asm/mman.h | 1 -
arch/xtensa/include/uapi/asm/mman.h | 2 ++
include/uapi/asm-generic/mman-common.h | 1 +
mm/mmap.c | 11 +++++++++++
7 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 6bf730063e3f..7287dbf1e11b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -31,6 +31,7 @@
#define MAP_NONBLOCK 0x40000 /* do not block on IO */
#define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x100000 /* create a huge page mapping */
+#define MAP_FIXED_SAFE 0x200000 /* MAP_FIXED which doesn't unmap underlying mapping */

#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_SYNC 2 /* synchronous memory sync */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 20c3df7a8fdd..f1e15890345c 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -50,6 +50,8 @@
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */

+#define MAP_FIXED_SAFE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
+
/*
* Flags for msync
*/
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index d1af0d74a188..daf0282ac417 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -26,6 +26,8 @@
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */

+#define MAP_FIXED_SAFE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
+
#define MS_SYNC 1 /* synchronous memory sync */
#define MS_ASYNC 2 /* sync memory asynchronously */
#define MS_INVALIDATE 4 /* invalidate the caches */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 715a2c927e79..d21bffd5d3dc 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -25,5 +25,4 @@
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */

-
#endif /* _UAPI__SPARC_MMAN_H__ */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 2bfe590694fc..0daf199caa57 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -56,6 +56,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_FIXED_SAFE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
# define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
* uninitialized */
@@ -63,6 +64,7 @@
# define MAP_UNINITIALIZED 0x0 /* Don't support this flag */
#endif

+
/*
* Flags for msync
*/
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6d319c46fd90..1eca2cb10d44 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -25,6 +25,7 @@
#else
# define MAP_UNINITIALIZED 0x0 /* Don't support this flag */
#endif
+#define MAP_FIXED_SAFE 0x80000 /* MAP_FIXED which doesn't unmap underlying mapping */

/*
* Flags for mlock
diff --git a/mm/mmap.c b/mm/mmap.c
index 0de87a376aaa..447223a2e469 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1342,6 +1342,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
if (!(file && path_noexec(&file->f_path)))
prot |= PROT_EXEC;

+ /* force arch specific MAP_FIXED handling in get_unmapped_area */
+ if (flags & MAP_FIXED_SAFE)
+ flags |= MAP_FIXED;
+
if (!(flags & MAP_FIXED))
addr = round_hint_to_min(addr);

@@ -1365,6 +1369,13 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
if (offset_in_page(addr))
return addr;

+ if (flags & MAP_FIXED_SAFE) {
+ struct vm_area_struct *vma = find_vma(mm, addr);
+
+ if (vma && vma->vm_start <= addr)
+ return -EEXIST;
+ }
+
if (prot == PROT_EXEC) {
pkey = execute_only_pkey(mm);
if (pkey < 0)
--
2.15.0

2017-12-13 09:26:47

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

From: Michal Hocko <[email protected]>

Both load_elf_interp and load_elf_binary rely on elf_map to map segments
on a controlled address and they use MAP_FIXED to enforce that. This is
however dangerous thing prone to silent data corruption which can be
even exploitable. Let's take CVE-2017-1000253 as an example. At the time
(before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
we could end up mapping over the existing stack with some luck.

The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
stack consumption early during execve fully stopped by da029c11e6b1
("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
safe and any attack should be impractical. On the other hand this is
just too subtle assumption so it can break quite easily and hard to
spot.

I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
fundamentally dangerous. Moreover it shouldn't be even needed. We are
at the early process stage and so there shouldn't be unrelated mappings
(except for stack and loader) existing so mmap for a given address
should succeed even without MAP_FIXED. Something is terribly wrong if
this is not the case and we should rather fail than silently corrupt the
underlying mapping.

Address this issue by changing MAP_FIXED to the newly added
MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
existing mapping clashing with the requested one without clobbering it.

Cc: Abdul Haleem <[email protected]>
Cc: Joel Stanley <[email protected]>
Acked-by: Kees Cook <[email protected]>
Reviewed-by: Khalid Aziz <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
arch/metag/kernel/process.c | 6 +++++-
fs/binfmt_elf.c | 12 ++++++++----
2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
index 0909834c83a7..867c8d0a5fb4 100644
--- a/arch/metag/kernel/process.c
+++ b/arch/metag/kernel/process.c
@@ -399,7 +399,7 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
tcm_tag = tcm_lookup_tag(addr);

if (tcm_tag != TCM_INVALID_TAG)
- type &= ~MAP_FIXED;
+ type &= ~(MAP_FIXED | MAP_FIXED_SAFE);

/*
* total_size is the size of the ELF (interpreter) image.
@@ -417,6 +417,10 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
} else
map_addr = vm_mmap(filep, addr, size, prot, type, off);

+ if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
+ pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
+ task_pid_nr(current), tsk->comm, (void*)addr);
+
if (!BAD_ADDR(map_addr) && tcm_tag != TCM_INVALID_TAG) {
struct tcm_allocation *tcm;
unsigned long tcm_addr;
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 73b01e474fdc..5916d45f64a7 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -372,6 +372,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
} else
map_addr = vm_mmap(filep, addr, size, prot, type, off);

+ if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
+ pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
+ task_pid_nr(current), current->comm, (void*)addr);
+
return(map_addr);
}

@@ -569,7 +573,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
elf_prot |= PROT_EXEC;
vaddr = eppnt->p_vaddr;
if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
- elf_type |= MAP_FIXED;
+ elf_type |= MAP_FIXED_SAFE;
else if (no_base && interp_elf_ex->e_type == ET_DYN)
load_addr = -vaddr;

@@ -930,7 +934,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
* the ET_DYN load_addr calculations, proceed normally.
*/
if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
- elf_flags |= MAP_FIXED;
+ elf_flags |= MAP_FIXED_SAFE;
} else if (loc->elf_ex.e_type == ET_DYN) {
/*
* This logic is run once for the first LOAD Program
@@ -966,7 +970,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
load_bias = ELF_ET_DYN_BASE;
if (current->flags & PF_RANDOMIZE)
load_bias += arch_mmap_rnd();
- elf_flags |= MAP_FIXED;
+ elf_flags |= MAP_FIXED_SAFE;
} else
load_bias = 0;

@@ -1223,7 +1227,7 @@ static int load_elf_library(struct file *file)
(eppnt->p_filesz +
ELF_PAGEOFFSET(eppnt->p_vaddr)),
PROT_READ | PROT_WRITE | PROT_EXEC,
- MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
+ MAP_FIXED_SAFE | MAP_PRIVATE | MAP_DENYWRITE,
(eppnt->p_offset -
ELF_PAGEOFFSET(eppnt->p_vaddr)));
if (error != ELF_PAGESTART(eppnt->p_vaddr))
--
2.15.0

2017-12-13 09:31:30

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/2] mmap.2: document new MAP_FIXED_SAFE flag

From: Michal Hocko <[email protected]>

4.16+ kernels offer a new MAP_FIXED_SAFE flag which allows the caller to
atomicaly probe for a given address range.

[wording heavily updated by John Hubbard <[email protected]>]
Signed-off-by: Michal Hocko <[email protected]>
---
man2/mmap.2 | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/man2/mmap.2 b/man2/mmap.2
index a5a8eb47a263..02d391697ce6 100644
--- a/man2/mmap.2
+++ b/man2/mmap.2
@@ -227,6 +227,22 @@ in mind that the exact layout of a process' memory map is allowed to change
significantly between kernel versions, C library versions, and operating system
releases.
.TP
+.BR MAP_FIXED_SAFE " (since Linux 4.16)"
+Similar to MAP_FIXED with respect to the
+.I
+addr
+enforcement, but different in that MAP_FIXED_SAFE never clobbers a pre-existing
+mapped range. If the requested range would collide with an existing
+mapping, then this call fails with
+.B EEXIST.
+This flag can therefore be used as a way to atomically (with respect to other
+threads) attempt to map an address range: one thread will succeed; all others
+will report failure. Please note that older kernels which do not recognize this
+flag will typically (upon detecting a collision with a pre-existing mapping)
+fall back to a "non-MAP_FIXED" type of behavior: they will return an address that
+is different than the requested one. Therefore, backward-compatible software
+should check the returned address against the requested address.
+.TP
.B MAP_GROWSDOWN
This flag is used for stacks.
It indicates to the kernel virtual memory system that the mapping
@@ -451,6 +467,12 @@ is not a valid file descriptor (and
.B MAP_ANONYMOUS
was not set).
.TP
+.B EEXIST
+range covered by
+.IR addr ,
+.IR length
+is clashing with an existing mapping.
+.TP
.B EINVAL
We don't like
.IR addr ,
--
2.15.0

2017-12-13 09:31:45

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

From: John Hubbard <[email protected]>

-- Expand the documentation to discuss the hazards in
enough detail to allow avoiding them.

-- Mention the upcoming MAP_FIXED_SAFE flag.

-- Enhance the alignment requirement slightly.

CC: Michael Ellerman <[email protected]>
CC: Jann Horn <[email protected]>
CC: Matthew Wilcox <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Mike Rapoport <[email protected]>
CC: Cyril Hrubis <[email protected]>
CC: Pavel Machek <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
man2/mmap.2 | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/man2/mmap.2 b/man2/mmap.2
index 02d391697ce6..cb8789daec2d 100644
--- a/man2/mmap.2
+++ b/man2/mmap.2
@@ -212,8 +212,9 @@ Don't interpret
.I addr
as a hint: place the mapping at exactly that address.
.I addr
-must be a multiple of the page size.
-If the memory region specified by
+must be suitably aligned: for most architectures a multiple of page
+size is sufficient; however, some architectures may impose additional
+restrictions. If the memory region specified by
.I addr
and
.I len
@@ -226,6 +227,33 @@ Software that aspires to be portable should use this option with care, keeping
in mind that the exact layout of a process' memory map is allowed to change
significantly between kernel versions, C library versions, and operating system
releases.
+.IP
+Furthermore, this option is extremely hazardous (when used on its own), because
+it forcibly removes pre-existing mappings, making it easy for a multi-threaded
+process to corrupt its own address space.
+.IP
+For example, thread A looks through
+.I /proc/<pid>/maps
+and locates an available
+address range, while thread B simultaneously acquires part or all of that same
+address range. Thread A then calls mmap(MAP_FIXED), effectively overwriting
+the mapping that thread B created.
+.IP
+Thread B need not create a mapping directly; simply making a library call
+that, internally, uses
+.I dlopen(3)
+to load some other shared library, will
+suffice. The dlopen(3) call will map the library into the process's address
+space. Furthermore, almost any library call may be implemented using this
+technique.
+Examples include brk(2), malloc(3), pthread_create(3), and the PAM libraries
+(http://www.linux-pam.org).
+.IP
+Newer kernels
+(Linux 4.16 and later) have a
+.B MAP_FIXED_SAFE
+option that avoids the corruption problem; if available, MAP_FIXED_SAFE
+should be preferred over MAP_FIXED.
.TP
.BR MAP_FIXED_SAFE " (since Linux 4.16)"
Similar to MAP_FIXED with respect to the
--
2.15.0

2017-12-13 12:25:38

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE

On Wed, Dec 13, 2017 at 10:25:48AM +0100, Michal Hocko wrote:
> I am afraid we can bikeshed this to death and there will still be
> somebody finding yet another better name. Therefore I've decided to
> stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
> MAP_FIXED prefix which should be recognized by developers and _SAFE
> suffix should also be clear that all dangerous side effects of the old
> MAP_FIXED are gone.

I liked basically every other name suggested more than MAP_FIXED_SAFE.
"Safe against what?" was an important question.

MAP_AT_ADDR was the best suggestion I saw that wasn't one of mine. Of
my suggestions, I liked MAP_STATIC the best.

2017-12-13 12:34:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE

On Wed 13-12-17 04:25:33, Matthew Wilcox wrote:
> On Wed, Dec 13, 2017 at 10:25:48AM +0100, Michal Hocko wrote:
> > I am afraid we can bikeshed this to death and there will still be
> > somebody finding yet another better name. Therefore I've decided to
> > stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
> > MAP_FIXED prefix which should be recognized by developers and _SAFE
> > suffix should also be clear that all dangerous side effects of the old
> > MAP_FIXED are gone.
>
> I liked basically every other name suggested more than MAP_FIXED_SAFE.
> "Safe against what?" was an important question.
>
> MAP_AT_ADDR was the best suggestion I saw that wasn't one of mine. Of
> my suggestions, I liked MAP_STATIC the best.

The question is whether you care enough to pursue this further yourself.
Because as I've said I do not want to spend another round discussing the
name. The flag is documented and I believe that the name has some merit.
Disagreeing on naming is the easiest pitfall to block otherwise useful
functionality from being merged. And I am pretty sure there will be
always somebody objecting...
--
Michal Hocko
SUSE Labs

2017-12-13 12:50:57

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: introduce MAP_FIXED_SAFE

On Wed, Dec 13, 2017 at 10:25:49AM +0100, Michal Hocko wrote:
> +++ b/mm/mmap.c
> @@ -1342,6 +1342,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> if (!(file && path_noexec(&file->f_path)))
> prot |= PROT_EXEC;
>
> + /* force arch specific MAP_FIXED handling in get_unmapped_area */
> + if (flags & MAP_FIXED_SAFE)
> + flags |= MAP_FIXED;
> +
> if (!(flags & MAP_FIXED))
> addr = round_hint_to_min(addr);
>

We're up to 22 MAP_ flags now. We'll run out soon. Let's preserve half
of a flag by giving userspace the definition:

#define MAP_FIXED_SAFE (MAP_FIXED | _MAP_NOT_HINT)

then in here:

if ((flags & _MAP_NOT_HINT) && !(flags & MAP_FIXED))
return -EINVAL;

Now we can use _MAP_NOT_HINT all by itself in the future to mean
something else.

2017-12-13 12:55:49

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Wed 2017-12-13 10:31:10, Michal Hocko wrote:
> From: John Hubbard <[email protected]>
>
> -- Expand the documentation to discuss the hazards in
> enough detail to allow avoiding them.
>
> -- Mention the upcoming MAP_FIXED_SAFE flag.

Pretty map everyone agreed MAP_FIXED_SAFE was a bad
name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (523.00 B)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-12-13 13:01:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: introduce MAP_FIXED_SAFE

On Wed 13-12-17 04:50:53, Matthew Wilcox wrote:
> On Wed, Dec 13, 2017 at 10:25:49AM +0100, Michal Hocko wrote:
> > +++ b/mm/mmap.c
> > @@ -1342,6 +1342,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> > if (!(file && path_noexec(&file->f_path)))
> > prot |= PROT_EXEC;
> >
> > + /* force arch specific MAP_FIXED handling in get_unmapped_area */
> > + if (flags & MAP_FIXED_SAFE)
> > + flags |= MAP_FIXED;
> > +
> > if (!(flags & MAP_FIXED))
> > addr = round_hint_to_min(addr);
> >
>
> We're up to 22 MAP_ flags now. We'll run out soon. Let's preserve half
> of a flag by giving userspace the definition:
>
> #define MAP_FIXED_SAFE (MAP_FIXED | _MAP_NOT_HINT)

I've already tried to explain why this cannot be a modifier for
MAP_FIXED. Read about the backward compatibility note...
Or do I misunderstand what you are saying here?

> then in here:
>
> if ((flags & _MAP_NOT_HINT) && !(flags & MAP_FIXED))
> return -EINVAL;
>
> Now we can use _MAP_NOT_HINT all by itself in the future to mean
> something else.

--
Michal Hocko
SUSE Labs

2017-12-13 13:04:41

by Cyril Hrubis

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

Hi!
> Pretty map everyone agreed MAP_FIXED_SAFE was a bad
> name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.

For what it's worth I do agree here.

--
Cyril Hrubis
[email protected]

2017-12-13 13:05:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Wed 13-12-17 13:55:40, Pavel Machek wrote:
> On Wed 2017-12-13 10:31:10, Michal Hocko wrote:
> > From: John Hubbard <[email protected]>
> >
> > -- Expand the documentation to discuss the hazards in
> > enough detail to allow avoiding them.
> >
> > -- Mention the upcoming MAP_FIXED_SAFE flag.
>
> Pretty map everyone agreed MAP_FIXED_SAFE was a bad
> name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.

Please read http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs

2017-12-13 13:09:05

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Wed 2017-12-13 14:04:58, Michal Hocko wrote:
> On Wed 13-12-17 13:55:40, Pavel Machek wrote:
> > On Wed 2017-12-13 10:31:10, Michal Hocko wrote:
> > > From: John Hubbard <[email protected]>
> > >
> > > -- Expand the documentation to discuss the hazards in
> > > enough detail to allow avoiding them.
> > >
> > > -- Mention the upcoming MAP_FIXED_SAFE flag.
> >
> > Pretty map everyone agreed MAP_FIXED_SAFE was a bad
> > name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.
>
> Please read http://lkml.kernel.org/r/[email protected]

Please fix your patches according to the feedback...

NACCKED-by: Pavel Machek <[email protected]>

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (828.00 B)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-12-13 13:16:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Wed 13-12-17 14:09:00, Pavel Machek wrote:
> On Wed 2017-12-13 14:04:58, Michal Hocko wrote:
> > On Wed 13-12-17 13:55:40, Pavel Machek wrote:
> > > On Wed 2017-12-13 10:31:10, Michal Hocko wrote:
> > > > From: John Hubbard <[email protected]>
> > > >
> > > > -- Expand the documentation to discuss the hazards in
> > > > enough detail to allow avoiding them.
> > > >
> > > > -- Mention the upcoming MAP_FIXED_SAFE flag.
> > >
> > > Pretty map everyone agreed MAP_FIXED_SAFE was a bad
> > > name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.
> >
> > Please read http://lkml.kernel.org/r/[email protected]
>
> Please fix your patches according to the feedback...
>
> NACCKED-by: Pavel Machek <[email protected]>

Good luck pursuing this further then. I am not going to spend time on
naming bikeheds. I have more pressing stuff to work on.
--
Michal Hocko
SUSE Labs

2017-12-13 13:21:11

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Wed 2017-12-13 14:16:40, Michal Hocko wrote:
> On Wed 13-12-17 14:09:00, Pavel Machek wrote:
> > On Wed 2017-12-13 14:04:58, Michal Hocko wrote:
> > > On Wed 13-12-17 13:55:40, Pavel Machek wrote:
> > > > On Wed 2017-12-13 10:31:10, Michal Hocko wrote:
> > > > > From: John Hubbard <[email protected]>
> > > > >
> > > > > -- Expand the documentation to discuss the hazards in
> > > > > enough detail to allow avoiding them.
> > > > >
> > > > > -- Mention the upcoming MAP_FIXED_SAFE flag.
> > > >
> > > > Pretty map everyone agreed MAP_FIXED_SAFE was a bad
> > > > name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.
> > >
> > > Please read http://lkml.kernel.org/r/[email protected]
> >
> > Please fix your patches according to the feedback...
> >
> > NACCKED-by: Pavel Machek <[email protected]>
>
> Good luck pursuing this further then. I am not going to spend time on
> naming bikeheds. I have more pressing stuff to work on.

You selected stupid name for a flag. Everyone and their dog agrees
with that. There's even consensus on better name (and everyone agrees
it is better than .._SAFE). Of course, we could have debate if it is
NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
poor naming on your part.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.40 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-12-13 13:35:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Wed 13-12-17 14:21:05, Pavel Machek wrote:
> On Wed 2017-12-13 14:16:40, Michal Hocko wrote:
> > On Wed 13-12-17 14:09:00, Pavel Machek wrote:
> > > On Wed 2017-12-13 14:04:58, Michal Hocko wrote:
> > > > On Wed 13-12-17 13:55:40, Pavel Machek wrote:
> > > > > On Wed 2017-12-13 10:31:10, Michal Hocko wrote:
> > > > > > From: John Hubbard <[email protected]>
> > > > > >
> > > > > > -- Expand the documentation to discuss the hazards in
> > > > > > enough detail to allow avoiding them.
> > > > > >
> > > > > > -- Mention the upcoming MAP_FIXED_SAFE flag.
> > > > >
> > > > > Pretty map everyone agreed MAP_FIXED_SAFE was a bad
> > > > > name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.
> > > >
> > > > Please read http://lkml.kernel.org/r/[email protected]
> > >
> > > Please fix your patches according to the feedback...
> > >
> > > NACCKED-by: Pavel Machek <[email protected]>
> >
> > Good luck pursuing this further then. I am not going to spend time on
> > naming bikeheds. I have more pressing stuff to work on.
>
> You selected stupid name for a flag. Everyone and their dog agrees
> with that.

Not sure about your dog but mine says that a flag which fixes an
_unsafe_ aspects of MAP_FIXED can be called MAP_FIXED_SAFE just fine.

Anyway, I am not going to argue about this further. I've implemented the
code, gathered uscases and fortified an in-kernel user which already led
to a security issue in the past. I consider my part done here. I do not
agree that MAP_FIXED_NOREPLACE would be so much better to respin and
then deal with what about this MAP_$FOO. If there are really stong
feelings about this then feel free to take these patches, do
s@MAP_FIXED_SAFE@MAP_$FOO@ and try to upstream them yourself.
--
Michal Hocko
SUSE Labs

2017-12-13 14:42:18

by Cyril Hrubis

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

Hi!
> You selected stupid name for a flag. Everyone and their dog agrees
> with that. There's even consensus on better name (and everyone agrees
> it is better than .._SAFE). Of course, we could have debate if it is
> NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
> poor naming on your part.

Well while everybody agrees that the name is so bad that basically
anything else would be better, there does not seem to be consensus on
which one to pick. I do understand that this frustrating and fruitless.

So what do we do now, roll a dice to choose new name?

Or do we ask BFDL[1] to choose the name?

[1] https://en.wikipedia.org/wiki/Benevolent_dictator_for_life

--
Cyril Hrubis
[email protected]

2017-12-13 17:13:21

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE

On Wed, Dec 13, 2017 at 1:25 AM, Michal Hocko <[email protected]> wrote:
>
> Hi,
> I am resending with some minor updates based on Michael's review and
> ask for inclusion. There haven't been any fundamental objections for
> the RFC [1] nor the previous version [2]. The biggest discussion
> revolved around the naming. There were many suggestions flowing
> around MAP_REQUIRED, MAP_EXACT, MAP_FIXED_NOCLOBBER, MAP_AT_ADDR,
> MAP_FIXED_NOREPLACE etc...

With this named MAP_FIXED_NOREPLACE (the best consensus we've got on a
name), please consider this series:

Acked-by: Kees Cook <[email protected]>

-Kees

>
> I am afraid we can bikeshed this to death and there will still be
> somebody finding yet another better name. Therefore I've decided to
> stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
> MAP_FIXED prefix which should be recognized by developers and _SAFE
> suffix should also be clear that all dangerous side effects of the old
> MAP_FIXED are gone.
>
> If somebody _really_ hates this then feel free to nack and resubmit
> with a different name you can find a consensus for. I am sorry to be
> stubborn here but I would rather have this merged than go over few more
> iterations changing the name just because it seems like a good idea
> now. My experience tells me that chances are that the name will turn out
> to be "suboptimal" anyway over time.
>
> Some more background:
> This has started as a follow up discussion [3][4] resulting in the
> runtime failure caused by hardening patch [5] which removes MAP_FIXED
> from the elf loader because MAP_FIXED is inherently dangerous as it
> might silently clobber an existing underlying mapping (e.g. stack). The
> reason for the failure is that some architectures enforce an alignment
> for the given address hint without MAP_FIXED used (e.g. for shared or
> file backed mappings).
>
> One way around this would be excluding those archs which do alignment
> tricks from the hardening [6]. The patch is really trivial but it has
> been objected, rightfully so, that this screams for a more generic
> solution. We basically want a non-destructive MAP_FIXED.
>
> The first patch introduced MAP_FIXED_SAFE which enforces the given
> address but unlike MAP_FIXED it fails with EEXIST if the given range
> conflicts with an existing one. The flag is introduced as a completely
> new one rather than a MAP_FIXED extension because of the backward
> compatibility. We really want a never-clobber semantic even on older
> kernels which do not recognize the flag. Unfortunately mmap sucks wrt.
> flags evaluation because we do not EINVAL on unknown flags. On those
> kernels we would simply use the traditional hint based semantic so the
> caller can still get a different address (which sucks) but at least not
> silently corrupt an existing mapping. I do not see a good way around
> that. Except we won't export expose the new semantic to the userspace at
> all.
>
> It seems there are users who would like to have something like that.
> Jemalloc has been mentioned by Michael Ellerman [7]
>
> Florian Weimer has mentioned the following:
> : glibc ld.so currently maps DSOs without hints. This means that the kernel
> : will map right next to each other, and the offsets between them a completely
> : predictable. We would like to change that and supply a random address in a
> : window of the address space. If there is a conflict, we do not want the
> : kernel to pick a non-random address. Instead, we would try again with a
> : random address.
>
> John Hubbard has mentioned CUDA example
> : a) Searches /proc/<pid>/maps for a "suitable" region of available
> : VA space. "Suitable" generally means it has to have a base address
> : within a certain limited range (a particular device model might
> : have odd limitations, for example), it has to be large enough, and
> : alignment has to be large enough (again, various devices may have
> : constraints that lead us to do this).
> :
> : This is of course subject to races with other threads in the process.
> :
> : Let's say it finds a region starting at va.
> :
> : b) Next it does:
> : p = mmap(va, ...)
> :
> : *without* setting MAP_FIXED, of course (so va is just a hint), to
> : attempt to safely reserve that region. If p != va, then in most cases,
> : this is a failure (almost certainly due to another thread getting a
> : mapping from that region before we did), and so this layer now has to
> : call munmap(), before returning a "failure: retry" to upper layers.
> :
> : IMPROVEMENT: --> if instead, we could call this:
> :
> : p = mmap(va, ... MAP_FIXED_SAFE ...)
> :
> : , then we could skip the munmap() call upon failure. This
> : is a small thing, but it is useful here. (Thanks to Piotr
> : Jaroszynski and Mark Hairgrove for helping me get that detail
> : exactly right, btw.)
> :
> : c) After that, CUDA suballocates from p, via:
> :
> : q = mmap(sub_region_start, ... MAP_FIXED ...)
> :
> : Interestingly enough, "freeing" is also done via MAP_FIXED, and
> : setting PROT_NONE to the subregion. Anyway, I just included (c) for
> : general interest.
>
> Atomic address range probing in the multithreaded programs in general
> sounds like an interesting thing to me.
>
> The second patch simply replaces MAP_FIXED use in elf loader by
> MAP_FIXED_SAFE. I believe other places which rely on MAP_FIXED should
> follow. Actually real MAP_FIXED usages should be docummented properly
> and they should be more of an exception.
>
> Diffstat says
> arch/alpha/include/uapi/asm/mman.h | 1 +
> arch/metag/kernel/process.c | 6 +++++-
> arch/mips/include/uapi/asm/mman.h | 2 ++
> arch/parisc/include/uapi/asm/mman.h | 2 ++
> arch/sparc/include/uapi/asm/mman.h | 1 -
> arch/xtensa/include/uapi/asm/mman.h | 2 ++
> fs/binfmt_elf.c | 12 ++++++++----
> include/uapi/asm-generic/mman-common.h | 1 +
> mm/mmap.c | 11 +++++++++++
> 9 files changed, 32 insertions(+), 6 deletions(-)
>
> [1] http://lkml.kernel.org/r/[email protected]
> [2] http://lkml.kernel.org/r/[email protected]
> [3] http://lkml.kernel.org/r/[email protected]
> [4] http://lkml.kernel.org/r/[email protected]
> [5] http://lkml.kernel.org/r/[email protected]
> [6] http://lkml.kernel.org/r/[email protected]
> [7] http://lkml.kernel.org/r/[email protected]
>



--
Kees Cook
Pixel Security

2017-12-13 23:20:14

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Wed, Dec 13, 2017 at 6:40 AM, Cyril Hrubis <[email protected]> wrote:
> Hi!
>> You selected stupid name for a flag. Everyone and their dog agrees
>> with that. There's even consensus on better name (and everyone agrees
>> it is better than .._SAFE). Of course, we could have debate if it is
>> NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
>> poor naming on your part.
>
> Well while everybody agrees that the name is so bad that basically
> anything else would be better, there does not seem to be consensus on
> which one to pick. I do understand that this frustrating and fruitless.

Based on the earlier threads where I tried to end the bikeshedding, it
seemed like MAP_FIXED_NOREPLACE was the least bad option.

> So what do we do now, roll a dice to choose new name?
>
> Or do we ask BFDL[1] to choose the name?

I'd like to hear feedback from Michael Kerrisk, as he's had to deal
with these kinds of choices in the past. I'm fine to ask Linus too. I
just want to get past the name since the feature is quite valuable.

And if Michal doesn't want to touch this patch any more, I'm happy to
do the search/replace/resend. :P

-Kees

--
Kees Cook
Pixel Security

2017-12-14 00:32:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE

On Wed, 13 Dec 2017 10:25:48 +0100 Michal Hocko <[email protected]> wrote:

>
> Hi,
> I am resending with some minor updates based on Michael's review and
> ask for inclusion. There haven't been any fundamental objections for
> the RFC [1] nor the previous version [2]. The biggest discussion
> revolved around the naming. There were many suggestions flowing
> around MAP_REQUIRED, MAP_EXACT, MAP_FIXED_NOCLOBBER, MAP_AT_ADDR,
> MAP_FIXED_NOREPLACE etc...

I like MAP_FIXED_CAREFUL :)

> I am afraid we can bikeshed this to death and there will still be
> somebody finding yet another better name. Therefore I've decided to
> stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
> MAP_FIXED prefix which should be recognized by developers and _SAFE
> suffix should also be clear that all dangerous side effects of the old
> MAP_FIXED are gone.
>
> If somebody _really_ hates this then feel free to nack and resubmit
> with a different name you can find a consensus for. I am sorry to be
> stubborn here but I would rather have this merged than go over few more
> iterations changing the name just because it seems like a good idea
> now. My experience tells me that chances are that the name will turn out
> to be "suboptimal" anyway over time.
>
> Some more background:
> This has started as a follow up discussion [3][4] resulting in the
> runtime failure caused by hardening patch [5] which removes MAP_FIXED
> from the elf loader because MAP_FIXED is inherently dangerous as it
> might silently clobber an existing underlying mapping (e.g. stack). The
> reason for the failure is that some architectures enforce an alignment
> for the given address hint without MAP_FIXED used (e.g. for shared or
> file backed mappings).
>
> One way around this would be excluding those archs which do alignment
> tricks from the hardening [6]. The patch is really trivial but it has
> been objected, rightfully so, that this screams for a more generic
> solution. We basically want a non-destructive MAP_FIXED.
>
> The first patch introduced MAP_FIXED_SAFE which enforces the given
> address but unlike MAP_FIXED it fails with EEXIST if the given range
> conflicts with an existing one. The flag is introduced as a completely
> new one rather than a MAP_FIXED extension because of the backward
> compatibility. We really want a never-clobber semantic even on older
> kernels which do not recognize the flag. Unfortunately mmap sucks wrt.
> flags evaluation because we do not EINVAL on unknown flags. On those
> kernels we would simply use the traditional hint based semantic so the
> caller can still get a different address (which sucks) but at least not
> silently corrupt an existing mapping. I do not see a good way around
> that. Except we won't export expose the new semantic to the userspace at
> all.
>
> It seems there are users who would like to have something like that.
> Jemalloc has been mentioned by Michael Ellerman [7]

http://lkml.kernel.org/r/[email protected].

It would be useful to get feedback from jemalloc developers (please).
I'll add some cc's.


> Florian Weimer has mentioned the following:
> : glibc ld.so currently maps DSOs without hints. This means that the kernel
> : will map right next to each other, and the offsets between them a completely
> : predictable. We would like to change that and supply a random address in a
> : window of the address space. If there is a conflict, we do not want the
> : kernel to pick a non-random address. Instead, we would try again with a
> : random address.
>
> John Hubbard has mentioned CUDA example
> : a) Searches /proc/<pid>/maps for a "suitable" region of available
> : VA space. "Suitable" generally means it has to have a base address
> : within a certain limited range (a particular device model might
> : have odd limitations, for example), it has to be large enough, and
> : alignment has to be large enough (again, various devices may have
> : constraints that lead us to do this).
> :
> : This is of course subject to races with other threads in the process.
> :
> : Let's say it finds a region starting at va.
> :
> : b) Next it does:
> : p = mmap(va, ...)
> :
> : *without* setting MAP_FIXED, of course (so va is just a hint), to
> : attempt to safely reserve that region. If p != va, then in most cases,
> : this is a failure (almost certainly due to another thread getting a
> : mapping from that region before we did), and so this layer now has to
> : call munmap(), before returning a "failure: retry" to upper layers.
> :
> : IMPROVEMENT: --> if instead, we could call this:
> :
> : p = mmap(va, ... MAP_FIXED_SAFE ...)
> :
> : , then we could skip the munmap() call upon failure. This
> : is a small thing, but it is useful here. (Thanks to Piotr
> : Jaroszynski and Mark Hairgrove for helping me get that detail
> : exactly right, btw.)
> :
> : c) After that, CUDA suballocates from p, via:
> :
> : q = mmap(sub_region_start, ... MAP_FIXED ...)
> :
> : Interestingly enough, "freeing" is also done via MAP_FIXED, and
> : setting PROT_NONE to the subregion. Anyway, I just included (c) for
> : general interest.
>
> Atomic address range probing in the multithreaded programs in general
> sounds like an interesting thing to me.
>
> The second patch simply replaces MAP_FIXED use in elf loader by
> MAP_FIXED_SAFE. I believe other places which rely on MAP_FIXED should
> follow. Actually real MAP_FIXED usages should be docummented properly
> and they should be more of an exception.

2017-12-14 01:42:21

by David Goldblatt

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE

(+cc the jemalloc jasone; -cc,+bcc the Google jasone).

The only time we would want MAP_FIXED (or rather, a non-broken
variant) is in the middle of trying to expand an allocation in place;
"atomic address range probing in the multithreaded programs" describes
our use case pretty well. That's in a pathway that usually fails; it's
pretty far down on our kernel mmap enhancements wish-list.

(Sorry if you get this twice, an html reply bounced).


On Wed, Dec 13, 2017 at 5:35 PM, David Goldblatt
<[email protected]> wrote:
> (+cc the jemalloc jasone; -cc,+bcc the Google jasone).
>
> The only time we would want MAP_FIXED (or rather, a non-broken variant) is
> in the middle of trying to expand an allocation in place; "atomic address
> range probing in the multithreaded programs" describes our use case pretty
> well. That's in a pathway that usually fails; it's pretty far down on our
> kernel mmap enhancements wish-list.
>
> On Wed, Dec 13, 2017 at 4:32 PM, Andrew Morton <[email protected]>
> wrote:
>>
>> On Wed, 13 Dec 2017 10:25:48 +0100 Michal Hocko <[email protected]> wrote:
>>
>> >
>> > Hi,
>> > I am resending with some minor updates based on Michael's review and
>> > ask for inclusion. There haven't been any fundamental objections for
>> > the RFC [1] nor the previous version [2]. The biggest discussion
>> > revolved around the naming. There were many suggestions flowing
>> > around MAP_REQUIRED, MAP_EXACT, MAP_FIXED_NOCLOBBER, MAP_AT_ADDR,
>> > MAP_FIXED_NOREPLACE etc...
>>
>> I like MAP_FIXED_CAREFUL :)
>>
>> > I am afraid we can bikeshed this to death and there will still be
>> > somebody finding yet another better name. Therefore I've decided to
>> > stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
>> > MAP_FIXED prefix which should be recognized by developers and _SAFE
>> > suffix should also be clear that all dangerous side effects of the old
>> > MAP_FIXED are gone.
>> >
>> > If somebody _really_ hates this then feel free to nack and resubmit
>> > with a different name you can find a consensus for. I am sorry to be
>> > stubborn here but I would rather have this merged than go over few more
>> > iterations changing the name just because it seems like a good idea
>> > now. My experience tells me that chances are that the name will turn out
>> > to be "suboptimal" anyway over time.
>> >
>> > Some more background:
>> > This has started as a follow up discussion [3][4] resulting in the
>> > runtime failure caused by hardening patch [5] which removes MAP_FIXED
>> > from the elf loader because MAP_FIXED is inherently dangerous as it
>> > might silently clobber an existing underlying mapping (e.g. stack). The
>> > reason for the failure is that some architectures enforce an alignment
>> > for the given address hint without MAP_FIXED used (e.g. for shared or
>> > file backed mappings).
>> >
>> > One way around this would be excluding those archs which do alignment
>> > tricks from the hardening [6]. The patch is really trivial but it has
>> > been objected, rightfully so, that this screams for a more generic
>> > solution. We basically want a non-destructive MAP_FIXED.
>> >
>> > The first patch introduced MAP_FIXED_SAFE which enforces the given
>> > address but unlike MAP_FIXED it fails with EEXIST if the given range
>> > conflicts with an existing one. The flag is introduced as a completely
>> > new one rather than a MAP_FIXED extension because of the backward
>> > compatibility. We really want a never-clobber semantic even on older
>> > kernels which do not recognize the flag. Unfortunately mmap sucks wrt.
>> > flags evaluation because we do not EINVAL on unknown flags. On those
>> > kernels we would simply use the traditional hint based semantic so the
>> > caller can still get a different address (which sucks) but at least not
>> > silently corrupt an existing mapping. I do not see a good way around
>> > that. Except we won't export expose the new semantic to the userspace at
>> > all.
>> >
>> > It seems there are users who would like to have something like that.
>> > Jemalloc has been mentioned by Michael Ellerman [7]
>>
>> http://lkml.kernel.org/r/[email protected].
>>
>> It would be useful to get feedback from jemalloc developers (please).
>> I'll add some cc's.
>>
>>
>> > Florian Weimer has mentioned the following:
>> > : glibc ld.so currently maps DSOs without hints. This means that the
>> > kernel
>> > : will map right next to each other, and the offsets between them a
>> > completely
>> > : predictable. We would like to change that and supply a random address
>> > in a
>> > : window of the address space. If there is a conflict, we do not want
>> > the
>> > : kernel to pick a non-random address. Instead, we would try again with
>> > a
>> > : random address.
>> >
>> > John Hubbard has mentioned CUDA example
>> > : a) Searches /proc/<pid>/maps for a "suitable" region of available
>> > : VA space. "Suitable" generally means it has to have a base address
>> > : within a certain limited range (a particular device model might
>> > : have odd limitations, for example), it has to be large enough, and
>> > : alignment has to be large enough (again, various devices may have
>> > : constraints that lead us to do this).
>> > :
>> > : This is of course subject to races with other threads in the process.
>> > :
>> > : Let's say it finds a region starting at va.
>> > :
>> > : b) Next it does:
>> > : p = mmap(va, ...)
>> > :
>> > : *without* setting MAP_FIXED, of course (so va is just a hint), to
>> > : attempt to safely reserve that region. If p != va, then in most cases,
>> > : this is a failure (almost certainly due to another thread getting a
>> > : mapping from that region before we did), and so this layer now has to
>> > : call munmap(), before returning a "failure: retry" to upper layers.
>> > :
>> > : IMPROVEMENT: --> if instead, we could call this:
>> > :
>> > : p = mmap(va, ... MAP_FIXED_SAFE ...)
>> > :
>> > : , then we could skip the munmap() call upon failure. This
>> > : is a small thing, but it is useful here. (Thanks to Piotr
>> > : Jaroszynski and Mark Hairgrove for helping me get that detail
>> > : exactly right, btw.)
>> > :
>> > : c) After that, CUDA suballocates from p, via:
>> > :
>> > : q = mmap(sub_region_start, ... MAP_FIXED ...)
>> > :
>> > : Interestingly enough, "freeing" is also done via MAP_FIXED, and
>> > : setting PROT_NONE to the subregion. Anyway, I just included (c) for
>> > : general interest.
>> >
>> > Atomic address range probing in the multithreaded programs in general
>> > sounds like an interesting thing to me.
>> >
>> > The second patch simply replaces MAP_FIXED use in elf loader by
>> > MAP_FIXED_SAFE. I believe other places which rely on MAP_FIXED should
>> > follow. Actually real MAP_FIXED usages should be docummented properly
>> > and they should be more of an exception.
>>
>

2017-12-14 02:53:24

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Wed, Dec 13, 2017 at 10:31 AM, Michal Hocko <[email protected]> wrote:
> From: John Hubbard <[email protected]>
>
> -- Expand the documentation to discuss the hazards in
> enough detail to allow avoiding them.
>
> -- Mention the upcoming MAP_FIXED_SAFE flag.
>
> -- Enhance the alignment requirement slightly.
>
> CC: Michael Ellerman <[email protected]>
> CC: Jann Horn <[email protected]>
> CC: Matthew Wilcox <[email protected]>
> CC: Michal Hocko <[email protected]>
> CC: Mike Rapoport <[email protected]>
> CC: Cyril Hrubis <[email protected]>
> CC: Pavel Machek <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> man2/mmap.2 | 32 ++++++++++++++++++++++++++++++--
> 1 file changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/man2/mmap.2 b/man2/mmap.2
> index 02d391697ce6..cb8789daec2d 100644
> --- a/man2/mmap.2
> +++ b/man2/mmap.2
[...]
> @@ -226,6 +227,33 @@ Software that aspires to be portable should use this option with care, keeping
> in mind that the exact layout of a process' memory map is allowed to change
> significantly between kernel versions, C library versions, and operating system
> releases.
> +.IP
> +Furthermore, this option is extremely hazardous (when used on its own), because
> +it forcibly removes pre-existing mappings, making it easy for a multi-threaded
> +process to corrupt its own address space.

I think this is worded unfortunately. It is dangerous if used
incorrectly, and it's a good tool when used correctly.

[...]
> +Thread B need not create a mapping directly; simply making a library call
> +that, internally, uses
> +.I dlopen(3)
> +to load some other shared library, will
> +suffice. The dlopen(3) call will map the library into the process's address
> +space. Furthermore, almost any library call may be implemented using this
> +technique.
> +Examples include brk(2), malloc(3), pthread_create(3), and the PAM libraries
> +(http://www.linux-pam.org).

This is arkward. This first mentions dlopen(), which is a very niche
case, and then just very casually mentions the much bigger
problem that tons of library functions can allocate memory through
malloc(), causing mmap() calls, sometimes without that even being
a documented property of the function.

> +.IP
> +Newer kernels
> +(Linux 4.16 and later) have a
> +.B MAP_FIXED_SAFE
> +option that avoids the corruption problem; if available, MAP_FIXED_SAFE
> +should be preferred over MAP_FIXED.

This is bad advice. MAP_FIXED is completely safe if you use it on an address
range you've allocated, and it is used in this way by core system libraries to
place multiple VMAs in virtually contiguous memory, for example:

ld.so (from glibc) uses it to load dynamic libraries:

$ strace -e trace=open,mmap,close /usr/bin/id 2>&1 >/dev/null | head -n20
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f35811c0000
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 161237, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3581198000
close(3) = 0
open("/lib/x86_64-linux-gnu/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 2259664, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f3580d78000
mmap(0x7f3580f9c000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x24000) = 0x7f3580f9c000
mmap(0x7f3580f9e000, 6864, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f3580f9e000
close(3) = 0
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 3795360, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f35809d9000
mmap(0x7f3580d6e000, 24576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x195000) = 0x7f3580d6e000
mmap(0x7f3580d74000, 14752, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f3580d74000
close(3) = 0
[...]

As a comment in dl-map-segments.h in glibc explains:
/* This is a position-independent shared object. We can let the
kernel map it anywhere it likes, but we must have space for all
the segments in their specified positions relative to the first.
So we map the first segment without MAP_FIXED, but with its
extent increased to cover all the segments. Then we remove
access from excess portion, and there is known sufficient space
there to remap from the later segments.


And AFAIK anything that allocates thread stacks uses MAP_FIXED to
create the guard page at the bottom.


MAP_FIXED is a better solution for these usecases than MAP_FIXED_SAFE,
or whatever it ends up being called. Please remove this advice or, better,
clarify what MAP_FIXED should be used for (creation of virtually contiguous
VMAs) and what MAP_FIXED_SAFE should be used for (attempting to
allocate memory at a fixed address for some reason, with a failure instead of
the normal fallback to using a different address).

2017-12-14 05:28:41

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On 12/13/2017 06:52 PM, Jann Horn wrote:
> On Wed, Dec 13, 2017 at 10:31 AM, Michal Hocko <[email protected]> wrote:
>> From: John Hubbard <[email protected]>
>>
>> -- Expand the documentation to discuss the hazards in
>> enough detail to allow avoiding them.
>>
>> -- Mention the upcoming MAP_FIXED_SAFE flag.
>>
>> -- Enhance the alignment requirement slightly.
>>
>> CC: Michael Ellerman <[email protected]>
>> CC: Jann Horn <[email protected]>
>> CC: Matthew Wilcox <[email protected]>
>> CC: Michal Hocko <[email protected]>
>> CC: Mike Rapoport <[email protected]>
>> CC: Cyril Hrubis <[email protected]>
>> CC: Pavel Machek <[email protected]>
>> Acked-by: Michal Hocko <[email protected]>
>> Signed-off-by: John Hubbard <[email protected]>
>> Signed-off-by: Michal Hocko <[email protected]>
>> ---
>> man2/mmap.2 | 32 ++++++++++++++++++++++++++++++--
>> 1 file changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/man2/mmap.2 b/man2/mmap.2
>> index 02d391697ce6..cb8789daec2d 100644
>> --- a/man2/mmap.2
>> +++ b/man2/mmap.2
> [...]
>> @@ -226,6 +227,33 @@ Software that aspires to be portable should use this option with care, keeping
>> in mind that the exact layout of a process' memory map is allowed to change
>> significantly between kernel versions, C library versions, and operating system
>> releases.
>> +.IP
>> +Furthermore, this option is extremely hazardous (when used on its own), because
>> +it forcibly removes pre-existing mappings, making it easy for a multi-threaded
>> +process to corrupt its own address space.
>
> I think this is worded unfortunately. It is dangerous if used
> incorrectly, and it's a good tool when used correctly.
>

Hi Jann,

Hey, thanks for reviewing this again. I think I can accomodate all of your requests,
without contradicting other reviewers' earlier feedback...approximately. :) I'll
have a go at rewording this, and addressing your additional comments below, tomorrow
afternoon, so please look for an updated version later that day.

thanks,
--
John Hubbard
NVIDIA

> [...]
>> +Thread B need not create a mapping directly; simply making a library call
>> +that, internally, uses
>> +.I dlopen(3)
>> +to load some other shared library, will
>> +suffice. The dlopen(3) call will map the library into the process's address
>> +space. Furthermore, almost any library call may be implemented using this
>> +technique.
>> +Examples include brk(2), malloc(3), pthread_create(3), and the PAM libraries
>> +(http://www.linux-pam.org).
>
> This is arkward. This first mentions dlopen(), which is a very niche
> case, and then just very casually mentions the much bigger
> problem that tons of library functions can allocate memory through
> malloc(), causing mmap() calls, sometimes without that even being
> a documented property of the function.
>
>> +.IP
>> +Newer kernels
>> +(Linux 4.16 and later) have a
>> +.B MAP_FIXED_SAFE
>> +option that avoids the corruption problem; if available, MAP_FIXED_SAFE
>> +should be preferred over MAP_FIXED.
>
> This is bad advice. MAP_FIXED is completely safe if you use it on an address
> range you've allocated, and it is used in this way by core system libraries to
> place multiple VMAs in virtually contiguous memory, for example:
>
> ld.so (from glibc) uses it to load dynamic libraries:
>
> $ strace -e trace=open,mmap,close /usr/bin/id 2>&1 >/dev/null | head -n20
> mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x7f35811c0000
> open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> mmap(NULL, 161237, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3581198000
> close(3) = 0
> open("/lib/x86_64-linux-gnu/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
> mmap(NULL, 2259664, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
> 0) = 0x7f3580d78000
> mmap(0x7f3580f9c000, 8192, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x24000) = 0x7f3580f9c000
> mmap(0x7f3580f9e000, 6864, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f3580f9e000
> close(3) = 0
> open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> mmap(NULL, 3795360, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
> 0) = 0x7f35809d9000
> mmap(0x7f3580d6e000, 24576, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x195000) = 0x7f3580d6e000
> mmap(0x7f3580d74000, 14752, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f3580d74000
> close(3) = 0
> [...]
>
> As a comment in dl-map-segments.h in glibc explains:
> /* This is a position-independent shared object. We can let the
> kernel map it anywhere it likes, but we must have space for all
> the segments in their specified positions relative to the first.
> So we map the first segment without MAP_FIXED, but with its
> extent increased to cover all the segments. Then we remove
> access from excess portion, and there is known sufficient space
> there to remap from the later segments.
>
>
> And AFAIK anything that allocates thread stacks uses MAP_FIXED to
> create the guard page at the bottom.
>
>
> MAP_FIXED is a better solution for these usecases than MAP_FIXED_SAFE,
> or whatever it ends up being called. Please remove this advice or, better,
> clarify what MAP_FIXED should be used for (creation of virtually contiguous
> VMAs) and what MAP_FIXED_SAFE should be used for (attempting to
> allocate memory at a fixed address for some reason, with a failure instead of
> the normal fallback to using a different address).
>

2017-12-14 07:07:24

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Wed 13-12-17 15:19:00, Kees Cook wrote:
> On Wed, Dec 13, 2017 at 6:40 AM, Cyril Hrubis <[email protected]> wrote:
> > Hi!
> >> You selected stupid name for a flag. Everyone and their dog agrees
> >> with that. There's even consensus on better name (and everyone agrees
> >> it is better than .._SAFE). Of course, we could have debate if it is
> >> NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
> >> poor naming on your part.
> >
> > Well while everybody agrees that the name is so bad that basically
> > anything else would be better, there does not seem to be consensus on
> > which one to pick. I do understand that this frustrating and fruitless.
>
> Based on the earlier threads where I tried to end the bikeshedding, it
> seemed like MAP_FIXED_NOREPLACE was the least bad option.
>
> > So what do we do now, roll a dice to choose new name?
> >
> > Or do we ask BFDL[1] to choose the name?
>
> I'd like to hear feedback from Michael Kerrisk, as he's had to deal
> with these kinds of choices in the past. I'm fine to ask Linus too. I
> just want to get past the name since the feature is quite valuable.
>
> And if Michal doesn't want to touch this patch any more, I'm happy to
> do the search/replace/resend. :P

I think Andrew can do the s@MAP_FIXED_SAFE@MAP_$FOO@ when adding the
patch to the mmotm tree. The reason why I refuse to repost is that a)
functionality doesn't really need a further rework (at least not based
on the review feedback) and b) I do not really see any large consensus
here. People claim to like this or that more but nobody (except of you
Kees) was willing to put their name under their preference in a form of
Acked-by. And that worries me, because generating "better" names sounds
too easy to allow a forward progress.
--
Michal Hocko
SUSE Labs

2017-12-14 13:15:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE

On Thu 14-12-17 12:44:17, Edward Napierala wrote:
> Regarding the name - how about adopting MAP_EXCL? It was introduced in
> FreeBSD,
> and seems to do exactly this; quoting mmap(2):
>
> MAP_FIXED Do not permit the system to select a different address
> than the one specified. If the specified address
> cannot be used, mmap() will fail. If MAP_FIXED is
> specified, addr must be a multiple of the page size.
> If MAP_EXCL is not specified, a successful MAP_FIXED
> request replaces any previous mappings for the
> process' pages in the range from addr to addr + len.
> In contrast, if MAP_EXCL is specified, the request
> will fail if a mapping already exists within the
> range.

I am not familiar with the FreeBSD implementation but from the above it
looks like MAP_EXCL is a MAP_FIXED mofifier which is not how we are
going to implement it in linux due to reasons mentioned in this cover
letter. Using the same name would be more confusing than helpful I am
afraid.

--
Michal Hocko
SUSE Labs

2017-12-14 14:54:52

by Edward Napierala

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE

On 1214T1415, Michal Hocko wrote:
> On Thu 14-12-17 12:44:17, Edward Napierala wrote:
> > Regarding the name - how about adopting MAP_EXCL? It was introduced in
> > FreeBSD,
> > and seems to do exactly this; quoting mmap(2):
> >
> > MAP_FIXED Do not permit the system to select a different address
> > than the one specified. If the specified address
> > cannot be used, mmap() will fail. If MAP_FIXED is
> > specified, addr must be a multiple of the page size.
> > If MAP_EXCL is not specified, a successful MAP_FIXED
> > request replaces any previous mappings for the
> > process' pages in the range from addr to addr + len.
> > In contrast, if MAP_EXCL is specified, the request
> > will fail if a mapping already exists within the
> > range.
>
> I am not familiar with the FreeBSD implementation but from the above it
> looks like MAP_EXCL is a MAP_FIXED mofifier which is not how we are
> going to implement it in linux due to reasons mentioned in this cover
> letter. Using the same name would be more confusing than helpful I am
> afraid.

Sorry, missed that. Indeed, reusing a name with a different semantics
would be a bad idea.

2017-12-14 23:07:00

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On 12/13/2017 06:52 PM, Jann Horn wrote:
> On Wed, Dec 13, 2017 at 10:31 AM, Michal Hocko <[email protected]> wrote:
>> From: John Hubbard <[email protected]>
[...]
>> +.IP
>> +Furthermore, this option is extremely hazardous (when used on its own), because
>> +it forcibly removes pre-existing mappings, making it easy for a multi-threaded
>> +process to corrupt its own address space.
>
> I think this is worded unfortunately. It is dangerous if used
> incorrectly, and it's a good tool when used correctly.
>
> [...]
>> +Thread B need not create a mapping directly; simply making a library call
>> +that, internally, uses
>> +.I dlopen(3)
>> +to load some other shared library, will
>> +suffice. The dlopen(3) call will map the library into the process's address
>> +space. Furthermore, almost any library call may be implemented using this
>> +technique.
>> +Examples include brk(2), malloc(3), pthread_create(3), and the PAM libraries
>> +(http://www.linux-pam.org).
>
> This is arkward. This first mentions dlopen(), which is a very niche
> case, and then just very casually mentions the much bigger
> problem that tons of library functions can allocate memory through
> malloc(), causing mmap() calls, sometimes without that even being
> a documented property of the function.
>

Hi Jann,

Here is some proposed new wording, to address your two comments above. What do
you think of this:

NOTE: this option can be hazardous (when used on its own), because it
forcibly removes pre-existing mappings, making it easy for a multi-
threaded process to corrupt its own address space. For example, thread A
looks through /proc/<pid>/maps and locates an available address range,
while thread B simultaneously acquires part or all of that same address
range. Thread A then calls mmap(MAP_FIXED), effectively overwriting the
mapping that thread B created.

Thread B need not create a mapping directly; simply making a library call
whose implementation calls malloc(3), mmap(), or dlopen(3) will suffice,
because those calls all create new mappings.

>> +.IP
>> +Newer kernels
>> +(Linux 4.16 and later) have a
>> +.B MAP_FIXED_SAFE
>> +option that avoids the corruption problem; if available, MAP_FIXED_SAFE
>> +should be preferred over MAP_FIXED.
>
> This is bad advice. MAP_FIXED is completely safe if you use it on an address
> range you've allocated, and it is used in this way by core system libraries to
> place multiple VMAs in virtually contiguous memory, for example:
[...]
> MAP_FIXED is a better solution for these usecases than MAP_FIXED_SAFE,
> or whatever it ends up being called. Please remove this advice or, better,
> clarify what MAP_FIXED should be used for (creation of virtually contiguous
> VMAs) and what MAP_FIXED_SAFE should be used for (attempting to
> allocate memory at a fixed address for some reason, with a failure instead of
> the normal fallback to using a different address).
>

Rather than risk another back-and-forth with Michal (who doesn't want any advice
on how to use this safely, in the man page), I've simply removed this advice
entirely.

thanks,
--
John Hubbard
NVIDIA

2017-12-14 23:10:55

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Fri, Dec 15, 2017 at 12:06 AM, John Hubbard <[email protected]> wrote:
> On 12/13/2017 06:52 PM, Jann Horn wrote:
>> On Wed, Dec 13, 2017 at 10:31 AM, Michal Hocko <[email protected]> wrote:
>>> From: John Hubbard <[email protected]>
> [...]
>>> +.IP
>>> +Furthermore, this option is extremely hazardous (when used on its own), because
>>> +it forcibly removes pre-existing mappings, making it easy for a multi-threaded
>>> +process to corrupt its own address space.
>>
>> I think this is worded unfortunately. It is dangerous if used
>> incorrectly, and it's a good tool when used correctly.
>>
>> [...]
>>> +Thread B need not create a mapping directly; simply making a library call
>>> +that, internally, uses
>>> +.I dlopen(3)
>>> +to load some other shared library, will
>>> +suffice. The dlopen(3) call will map the library into the process's address
>>> +space. Furthermore, almost any library call may be implemented using this
>>> +technique.
>>> +Examples include brk(2), malloc(3), pthread_create(3), and the PAM libraries
>>> +(http://www.linux-pam.org).
>>
>> This is arkward. This first mentions dlopen(), which is a very niche
>> case, and then just very casually mentions the much bigger
>> problem that tons of library functions can allocate memory through
>> malloc(), causing mmap() calls, sometimes without that even being
>> a documented property of the function.
>>
>
> Hi Jann,
>
> Here is some proposed new wording, to address your two comments above. What do
> you think of this:
>
> NOTE: this option can be hazardous (when used on its own), because it
> forcibly removes pre-existing mappings, making it easy for a multi-
> threaded process to corrupt its own address space. For example, thread A
> looks through /proc/<pid>/maps and locates an available address range,
> while thread B simultaneously acquires part or all of that same address
> range. Thread A then calls mmap(MAP_FIXED), effectively overwriting the
> mapping that thread B created.
>
> Thread B need not create a mapping directly; simply making a library call
> whose implementation calls malloc(3), mmap(), or dlopen(3) will suffice,
> because those calls all create new mappings.

Thanks! That sounds better to me.

>>> +.IP
>>> +Newer kernels
>>> +(Linux 4.16 and later) have a
>>> +.B MAP_FIXED_SAFE
>>> +option that avoids the corruption problem; if available, MAP_FIXED_SAFE
>>> +should be preferred over MAP_FIXED.
>>
>> This is bad advice. MAP_FIXED is completely safe if you use it on an address
>> range you've allocated, and it is used in this way by core system libraries to
>> place multiple VMAs in virtually contiguous memory, for example:
> [...]
>> MAP_FIXED is a better solution for these usecases than MAP_FIXED_SAFE,
>> or whatever it ends up being called. Please remove this advice or, better,
>> clarify what MAP_FIXED should be used for (creation of virtually contiguous
>> VMAs) and what MAP_FIXED_SAFE should be used for (attempting to
>> allocate memory at a fixed address for some reason, with a failure instead of
>> the normal fallback to using a different address).
>>
>
> Rather than risk another back-and-forth with Michal (who doesn't want any advice
> on how to use this safely, in the man page), I've simply removed this advice
> entirely.

Makes sense.

2017-12-15 09:02:09

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE

Kees Cook <[email protected]> writes:

> On Wed, Dec 13, 2017 at 1:25 AM, Michal Hocko <[email protected]> wrote:
>>
>> Hi,
>> I am resending with some minor updates based on Michael's review and
>> ask for inclusion. There haven't been any fundamental objections for
>> the RFC [1] nor the previous version [2]. The biggest discussion
>> revolved around the naming. There were many suggestions flowing
>> around MAP_REQUIRED, MAP_EXACT, MAP_FIXED_NOCLOBBER, MAP_AT_ADDR,
>> MAP_FIXED_NOREPLACE etc...
>
> With this named MAP_FIXED_NOREPLACE (the best consensus we've got on a
> name), please consider this series:
>
> Acked-by: Kees Cook <[email protected]>

I don't feel like I'm actually qualified to ack the mm and binfmt
changes, but everything *looks* correct to me, and you've fixed the flag
numbering such that it can go in mman-common.h as I suggested.

So if the name was MAP_FIXED_NOREPLACE I would also be happy with it.

Acked-by: Michael Ellerman <[email protected]>

I can resubmit with the name changed if that's what it takes.

cheers

2017-12-16 00:49:55

by Andrei Vagin

[permalink] [raw]
Subject: Re: [2/2] fs, elf: drop MAP_FIXED usage from elf_map

Hi Michal,

We run CRIU tests for linux-next and the 4.15.0-rc3-next-20171215 kernel
doesn't boot:

[ 3.492549] Freeing unused kernel memory: 1640K
[ 3.494547] Write protecting the kernel read-only data: 18432k
[ 3.498781] Freeing unused kernel memory: 2016K
[ 3.503330] Freeing unused kernel memory: 512K
[ 3.505232] rodata_test: all tests were successful
[ 3.515355] 1 (init): Uhuuh, elf segement at 00000000928fda3e requested but the memory is mapped already
[ 3.519533] Starting init: /sbin/init exists but couldn't execute it (error -95)
[ 3.528993] Starting init: /bin/sh exists but couldn't execute it (error -14)
[ 3.532127] Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
[ 3.538328] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc3-next-20171215-00001-g6d6aea478fce #11
[ 3.542201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[ 3.546081] Call Trace:
[ 3.547221] dump_stack+0x5c/0x79
[ 3.548768] ? rest_init+0x30/0xb0
[ 3.550320] panic+0xe4/0x232
[ 3.551669] ? rest_init+0xb0/0xb0
[ 3.553110] kernel_init+0xeb/0x100
[ 3.554701] ret_from_fork+0x1f/0x30
[ 3.558964] Kernel Offset: 0x2000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 3.564160] ---[ end Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.

If I revert this patch, it boots normally.

Thanks,
Andrei

On Wed, Dec 13, 2017 at 10:25:50AM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Both load_elf_interp and load_elf_binary rely on elf_map to map segments
> on a controlled address and they use MAP_FIXED to enforce that. This is
> however dangerous thing prone to silent data corruption which can be
> even exploitable. Let's take CVE-2017-1000253 as an example. At the time
> (before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
> ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
> the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
> we could end up mapping over the existing stack with some luck.
>
> The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
> fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
> further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
> revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
> stack consumption early during execve fully stopped by da029c11e6b1
> ("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
> safe and any attack should be impractical. On the other hand this is
> just too subtle assumption so it can break quite easily and hard to
> spot.
>
> I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
> fundamentally dangerous. Moreover it shouldn't be even needed. We are
> at the early process stage and so there shouldn't be unrelated mappings
> (except for stack and loader) existing so mmap for a given address
> should succeed even without MAP_FIXED. Something is terribly wrong if
> this is not the case and we should rather fail than silently corrupt the
> underlying mapping.
>
> Address this issue by changing MAP_FIXED to the newly added
> MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
> existing mapping clashing with the requested one without clobbering it.
>
> Cc: Abdul Haleem <[email protected]>
> Cc: Joel Stanley <[email protected]>
> Acked-by: Kees Cook <[email protected]>
> Reviewed-by: Khalid Aziz <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> arch/metag/kernel/process.c | 6 +++++-
> fs/binfmt_elf.c | 12 ++++++++----
> 2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
> index 0909834c83a7..867c8d0a5fb4 100644
> --- a/arch/metag/kernel/process.c
> +++ b/arch/metag/kernel/process.c
> @@ -399,7 +399,7 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
> tcm_tag = tcm_lookup_tag(addr);
>
> if (tcm_tag != TCM_INVALID_TAG)
> - type &= ~MAP_FIXED;
> + type &= ~(MAP_FIXED | MAP_FIXED_SAFE);
>
> /*
> * total_size is the size of the ELF (interpreter) image.
> @@ -417,6 +417,10 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
> } else
> map_addr = vm_mmap(filep, addr, size, prot, type, off);
>
> + if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> + pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> + task_pid_nr(current), tsk->comm, (void*)addr);
> +
> if (!BAD_ADDR(map_addr) && tcm_tag != TCM_INVALID_TAG) {
> struct tcm_allocation *tcm;
> unsigned long tcm_addr;
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 73b01e474fdc..5916d45f64a7 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -372,6 +372,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
> } else
> map_addr = vm_mmap(filep, addr, size, prot, type, off);
>
> + if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> + pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> + task_pid_nr(current), current->comm, (void*)addr);
> +
> return(map_addr);
> }
>
> @@ -569,7 +573,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
> elf_prot |= PROT_EXEC;
> vaddr = eppnt->p_vaddr;
> if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
> - elf_type |= MAP_FIXED;
> + elf_type |= MAP_FIXED_SAFE;
> else if (no_base && interp_elf_ex->e_type == ET_DYN)
> load_addr = -vaddr;
>
> @@ -930,7 +934,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> * the ET_DYN load_addr calculations, proceed normally.
> */
> if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
> - elf_flags |= MAP_FIXED;
> + elf_flags |= MAP_FIXED_SAFE;
> } else if (loc->elf_ex.e_type == ET_DYN) {
> /*
> * This logic is run once for the first LOAD Program
> @@ -966,7 +970,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> load_bias = ELF_ET_DYN_BASE;
> if (current->flags & PF_RANDOMIZE)
> load_bias += arch_mmap_rnd();
> - elf_flags |= MAP_FIXED;
> + elf_flags |= MAP_FIXED_SAFE;
> } else
> load_bias = 0;
>
> @@ -1223,7 +1227,7 @@ static int load_elf_library(struct file *file)
> (eppnt->p_filesz +
> ELF_PAGEOFFSET(eppnt->p_vaddr)),
> PROT_READ | PROT_WRITE | PROT_EXEC,
> - MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
> + MAP_FIXED_SAFE | MAP_PRIVATE | MAP_DENYWRITE,
> (eppnt->p_offset -
> ELF_PAGEOFFSET(eppnt->p_vaddr)));
> if (error != ELF_PAGESTART(eppnt->p_vaddr))


Attachments:
(No filename) (6.76 kB)
config.gz (21.39 kB)
dmesg.txt (23.84 kB)
Download all attachments

2017-12-18 09:13:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [2/2] fs, elf: drop MAP_FIXED usage from elf_map

On Fri 15-12-17 16:49:28, Andrei Vagin wrote:
> Hi Michal,
>
> We run CRIU tests for linux-next and the 4.15.0-rc3-next-20171215 kernel
> doesn't boot:
>
> [ 3.492549] Freeing unused kernel memory: 1640K
> [ 3.494547] Write protecting the kernel read-only data: 18432k
> [ 3.498781] Freeing unused kernel memory: 2016K
> [ 3.503330] Freeing unused kernel memory: 512K
> [ 3.505232] rodata_test: all tests were successful
> [ 3.515355] 1 (init): Uhuuh, elf segement at 00000000928fda3e requested but the memory is mapped already

Hmm, this interesting. What does the test actualy do? Could you add some
instrumentation to see what is actually mapped there? Something like

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 0e50230ce53d..1b68ddc34043 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -372,10 +372,28 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
} else
map_addr = vm_mmap(filep, addr, size, prot, type, off);

- if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
+ if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr)) {
+ struct vm_area_struct *vma;
+
pr_info("%d (%s): Uhuuh, elf segment at %p requested but the memory is mapped already\n",
task_pid_nr(current), current->comm,
(void *)addr);
+ vma = find_vma(current->mm, map_addr);
+ if (vma && vma->vm_start < addr) {
+ pr_info("requested [%lx, %lx] mapped [%lx, %lx] %lx ", addr, addr + total_size,
+ vma->vm_start, vma->vm_end, vma->vm_flags);
+ if (!vma->vm_file) {
+ pr_cont("anon\n");
+ } else {
+ char path[512];
+ char *p = file_path(vma->vm_file, path, sizeof(path));
+ if (IS_ERR(p))
+ p = "?";
+ pr_cont("\"%s\"\n", kbasename(p));
+ }
+ dump_stack();
+ }
+ }

return(map_addr);
}

> [ 3.519533] Starting init: /sbin/init exists but couldn't execute it (error -95)
> [ 3.528993] Starting init: /bin/sh exists but couldn't execute it (error -14)
> [ 3.532127] Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> [ 3.538328] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc3-next-20171215-00001-g6d6aea478fce #11
> [ 3.542201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> [ 3.546081] Call Trace:
> [ 3.547221] dump_stack+0x5c/0x79
> [ 3.548768] ? rest_init+0x30/0xb0
> [ 3.550320] panic+0xe4/0x232
> [ 3.551669] ? rest_init+0xb0/0xb0
> [ 3.553110] kernel_init+0xeb/0x100
> [ 3.554701] ret_from_fork+0x1f/0x30
> [ 3.558964] Kernel Offset: 0x2000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 3.564160] ---[ end Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
>
> If I revert this patch, it boots normally.
>
> Thanks,
> Andrei
>
> On Wed, Dec 13, 2017 at 10:25:50AM +0100, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Both load_elf_interp and load_elf_binary rely on elf_map to map segments
> > on a controlled address and they use MAP_FIXED to enforce that. This is
> > however dangerous thing prone to silent data corruption which can be
> > even exploitable. Let's take CVE-2017-1000253 as an example. At the time
> > (before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
> > ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
> > the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
> > we could end up mapping over the existing stack with some luck.
> >
> > The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
> > fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
> > further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
> > revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
> > stack consumption early during execve fully stopped by da029c11e6b1
> > ("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
> > safe and any attack should be impractical. On the other hand this is
> > just too subtle assumption so it can break quite easily and hard to
> > spot.
> >
> > I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
> > fundamentally dangerous. Moreover it shouldn't be even needed. We are
> > at the early process stage and so there shouldn't be unrelated mappings
> > (except for stack and loader) existing so mmap for a given address
> > should succeed even without MAP_FIXED. Something is terribly wrong if
> > this is not the case and we should rather fail than silently corrupt the
> > underlying mapping.
> >
> > Address this issue by changing MAP_FIXED to the newly added
> > MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
> > existing mapping clashing with the requested one without clobbering it.
> >
> > Cc: Abdul Haleem <[email protected]>
> > Cc: Joel Stanley <[email protected]>
> > Acked-by: Kees Cook <[email protected]>
> > Reviewed-by: Khalid Aziz <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > arch/metag/kernel/process.c | 6 +++++-
> > fs/binfmt_elf.c | 12 ++++++++----
> > 2 files changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
> > index 0909834c83a7..867c8d0a5fb4 100644
> > --- a/arch/metag/kernel/process.c
> > +++ b/arch/metag/kernel/process.c
> > @@ -399,7 +399,7 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
> > tcm_tag = tcm_lookup_tag(addr);
> >
> > if (tcm_tag != TCM_INVALID_TAG)
> > - type &= ~MAP_FIXED;
> > + type &= ~(MAP_FIXED | MAP_FIXED_SAFE);
> >
> > /*
> > * total_size is the size of the ELF (interpreter) image.
> > @@ -417,6 +417,10 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
> > } else
> > map_addr = vm_mmap(filep, addr, size, prot, type, off);
> >
> > + if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> > + pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> > + task_pid_nr(current), tsk->comm, (void*)addr);
> > +
> > if (!BAD_ADDR(map_addr) && tcm_tag != TCM_INVALID_TAG) {
> > struct tcm_allocation *tcm;
> > unsigned long tcm_addr;
> > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > index 73b01e474fdc..5916d45f64a7 100644
> > --- a/fs/binfmt_elf.c
> > +++ b/fs/binfmt_elf.c
> > @@ -372,6 +372,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
> > } else
> > map_addr = vm_mmap(filep, addr, size, prot, type, off);
> >
> > + if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> > + pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> > + task_pid_nr(current), current->comm, (void*)addr);
> > +
> > return(map_addr);
> > }
> >
> > @@ -569,7 +573,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
> > elf_prot |= PROT_EXEC;
> > vaddr = eppnt->p_vaddr;
> > if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
> > - elf_type |= MAP_FIXED;
> > + elf_type |= MAP_FIXED_SAFE;
> > else if (no_base && interp_elf_ex->e_type == ET_DYN)
> > load_addr = -vaddr;
> >
> > @@ -930,7 +934,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> > * the ET_DYN load_addr calculations, proceed normally.
> > */
> > if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
> > - elf_flags |= MAP_FIXED;
> > + elf_flags |= MAP_FIXED_SAFE;
> > } else if (loc->elf_ex.e_type == ET_DYN) {
> > /*
> > * This logic is run once for the first LOAD Program
> > @@ -966,7 +970,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> > load_bias = ELF_ET_DYN_BASE;
> > if (current->flags & PF_RANDOMIZE)
> > load_bias += arch_mmap_rnd();
> > - elf_flags |= MAP_FIXED;
> > + elf_flags |= MAP_FIXED_SAFE;
> > } else
> > load_bias = 0;
> >
> > @@ -1223,7 +1227,7 @@ static int load_elf_library(struct file *file)
> > (eppnt->p_filesz +
> > ELF_PAGEOFFSET(eppnt->p_vaddr)),
> > PROT_READ | PROT_WRITE | PROT_EXEC,
> > - MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
> > + MAP_FIXED_SAFE | MAP_PRIVATE | MAP_DENYWRITE,
> > (eppnt->p_offset -
> > ELF_PAGEOFFSET(eppnt->p_vaddr)));
> > if (error != ELF_PAGESTART(eppnt->p_vaddr))


> [ 0.000000] Linux version 4.15.0-rc3-next-20171215-00001-g6d6aea478fce (avagin@laptop) (gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC)) #11 SMP Fri Dec 15 16:39:11 PST 2017
> [ 0.000000] Command line: root=/dev/vda2 ro debug console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
> [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
> [ 0.000000] x86/fpu: xstate_offset[3]: 832, xstate_sizes[3]: 64
> [ 0.000000] x86/fpu: xstate_offset[4]: 896, xstate_sizes[4]: 64
> [ 0.000000] x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
> [ 0.000000] e820: BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffd8fff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000007ffd9000-0x000000007fffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> [ 0.000000] NX (Execute Disable) protection: active
> [ 0.000000] random: fast init done
> [ 0.000000] SMBIOS 2.8 present.
> [ 0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> [ 0.000000] Hypervisor detected: KVM
> [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> [ 0.000000] e820: last_pfn = 0x7ffd9 max_arch_pfn = 0x400000000
> [ 0.000000] MTRR default type: write-back
> [ 0.000000] MTRR fixed ranges enabled:
> [ 0.000000] 00000-9FFFF write-back
> [ 0.000000] A0000-BFFFF uncachable
> [ 0.000000] C0000-FFFFF write-protect
> [ 0.000000] MTRR variable ranges enabled:
> [ 0.000000] 0 base 0080000000 mask FF80000000 uncachable
> [ 0.000000] 1 disabled
> [ 0.000000] 2 disabled
> [ 0.000000] 3 disabled
> [ 0.000000] 4 disabled
> [ 0.000000] 5 disabled
> [ 0.000000] 6 disabled
> [ 0.000000] 7 disabled
> [ 0.000000] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
> [ 0.000000] found SMP MP-table at [mem 0x000f6bd0-0x000f6bdf] mapped at [ (ptrval)]
> [ 0.000000] Base memory trampoline at [ (ptrval)] 99000 size 24576
> [ 0.000000] Using GB pages for direct mapping
> [ 0.000000] BRK [0x2c984000, 0x2c984fff] PGTABLE
> [ 0.000000] BRK [0x2c985000, 0x2c985fff] PGTABLE
> [ 0.000000] BRK [0x2c986000, 0x2c986fff] PGTABLE
> [ 0.000000] BRK [0x2c987000, 0x2c987fff] PGTABLE
> [ 0.000000] BRK [0x2c988000, 0x2c988fff] PGTABLE
> [ 0.000000] BRK [0x2c989000, 0x2c989fff] PGTABLE
> [ 0.000000] ACPI: Early table checksum verification disabled
> [ 0.000000] ACPI: RSDP 0x00000000000F69C0 000014 (v00 BOCHS )
> [ 0.000000] ACPI: RSDT 0x000000007FFE12FF 00002C (v01 BOCHS BXPCRSDT 00000001 BXPC 00000001)
> [ 0.000000] ACPI: FACP 0x000000007FFE120B 000074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001)
> [ 0.000000] ACPI: DSDT 0x000000007FFE0040 0011CB (v01 BOCHS BXPCDSDT 00000001 BXPC 00000001)
> [ 0.000000] ACPI: FACS 0x000000007FFE0000 000040
> [ 0.000000] ACPI: APIC 0x000000007FFE127F 000080 (v01 BOCHS BXPCAPIC 00000001 BXPC 00000001)
> [ 0.000000] ACPI: Local APIC address 0xfee00000
> [ 0.000000] No NUMA configuration found
> [ 0.000000] Faking a node at [mem 0x0000000000000000-0x000000007ffd8fff]
> [ 0.000000] NODE_DATA(0) allocated [mem 0x7ffc2000-0x7ffd8fff]
> [ 0.000000] kvm-clock: cpu 0, msr 0:7ffc0001, primary cpu clock
> [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [ 0.000000] kvm-clock: using sched offset of 1076013277 cycles
> [ 0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffd8fff]
> [ 0.000000] Normal empty
> [ 0.000000] Device empty
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> [ 0.000000] node 0: [mem 0x0000000000100000-0x000000007ffd8fff]
> [ 0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffd8fff]
> [ 0.000000] On node 0 totalpages: 524151
> [ 0.000000] DMA zone: 64 pages used for memmap
> [ 0.000000] DMA zone: 21 pages reserved
> [ 0.000000] DMA zone: 3998 pages, LIFO batch:0
> [ 0.000000] DMA32 zone: 8128 pages used for memmap
> [ 0.000000] DMA32 zone: 520153 pages, LIFO batch:31
> [ 0.000000] Reserved but unavailable: 98 pages
> [ 0.000000] ACPI: PM-Timer IO Port: 0x608
> [ 0.000000] ACPI: Local APIC address 0xfee00000
> [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
> [ 0.000000] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
> [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
> [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
> [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
> [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
> [ 0.000000] ACPI: IRQ0 used by override.
> [ 0.000000] ACPI: IRQ5 used by override.
> [ 0.000000] ACPI: IRQ9 used by override.
> [ 0.000000] ACPI: IRQ10 used by override.
> [ 0.000000] ACPI: IRQ11 used by override.
> [ 0.000000] Using ACPI (MADT) for SMP configuration information
> [ 0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> [ 0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
> [ 0.000000] PM: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
> [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000effff]
> [ 0.000000] PM: Registered nosave memory: [mem 0x000f0000-0x000fffff]
> [ 0.000000] e820: [mem 0x80000000-0xfeffbfff] available for PCI devices
> [ 0.000000] Booting paravirtualized kernel on KVM
> [ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> [ 0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:2 nr_node_ids:1
> [ 0.000000] percpu: Embedded 44 pages/cpu @ (ptrval) s142296 r8192 d29736 u1048576
> [ 0.000000] pcpu-alloc: s142296 r8192 d29736 u1048576 alloc=1*2097152
> [ 0.000000] pcpu-alloc: [0] 0 1
> [ 0.000000] KVM setup async PF for cpu 0
> [ 0.000000] kvm-stealtime: cpu 0, msr 7fc122c0
> [ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 515938
> [ 0.000000] Policy zone: DMA32
> [ 0.000000] Kernel command line: root=/dev/vda2 ro debug console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0
> [ 0.000000] Memory: 2037056K/2096604K available (12300K kernel code, 1554K rwdata, 3584K rodata, 1640K init, 912K bss, 59548K reserved, 0K cma-reserved)
> [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
> [ 0.000000] ftrace: allocating 36554 entries in 143 pages
> [ 0.001000] Hierarchical RCU implementation.
> [ 0.001000] RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=2.
> [ 0.001000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
> [ 0.001000] NR_IRQS: 4352, nr_irqs: 440, preallocated irqs: 16
> [ 0.001000] Offload RCU callbacks from CPUs: (none).
> [ 0.001000] Console: colour dummy device 80x25
> [ 0.001000] console [ttyS0] enabled
> [ 0.001000] ACPI: Core revision 20171110
> [ 0.001000] ACPI: 1 ACPI AML tables successfully acquired and loaded
> [ 0.001009] APIC: Switch to symmetric I/O mode setup
> [ 0.001571] x2apic enabled
> [ 0.002003] Switched APIC routing to physical x2apic.
> [ 0.003538] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [ 0.004000] tsc: Detected 2496.000 MHz processor
> [ 0.004014] Calibrating delay loop (skipped) preset value.. 4992.00 BogoMIPS (lpj=2496000)
> [ 0.005014] pid_max: default: 32768 minimum: 301
> [ 0.006057] Security Framework initialized
> [ 0.006548] Yama: becoming mindful.
> [ 0.007019] SELinux: Disabled at boot.
> [ 0.008206] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
> [ 0.009164] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
> [ 0.009816] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
> [ 0.010009] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes)
> [ 0.011322] mce: CPU supports 10 MCE banks
> [ 0.011740] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
> [ 0.012002] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
> [ 0.012610] Freeing SMP alternatives memory: 36K
> [ 0.013467] TSC deadline timer enabled
> [ 0.013820] smpboot: CPU0: Intel Core Processor (Skylake) (family: 0x6, model: 0x5e, stepping: 0x3)
> [ 0.014000] Performance Events: unsupported p6 CPU model 94 no PMU driver, software events only.
> [ 0.014041] Hierarchical SRCU implementation.
> [ 0.015133] NMI watchdog: Perf event create on CPU 0 failed with -2
> [ 0.015725] NMI watchdog: Perf NMI watchdog permanently disabled
> [ 0.016077] smp: Bringing up secondary CPUs ...
> [ 0.016654] x86: Booting SMP configuration:
> [ 0.017005] .... node #0, CPUs: #1
> [ 0.001000] kvm-clock: cpu 1, msr 0:7ffc0041, secondary cpu clock
> [ 0.019051] KVM setup async PF for cpu 1
> [ 0.019599] kvm-stealtime: cpu 1, msr 7fd122c0
> [ 0.020009] smp: Brought up 1 node, 2 CPUs
> [ 0.020531] smpboot: Max logical packages: 2
> [ 0.021009] smpboot: Total of 2 processors activated (9984.00 BogoMIPS)
> [ 0.023160] devtmpfs: initialized
> [ 0.023513] x86/mm: Memory block size: 128MB
> [ 0.024811] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> [ 0.025015] futex hash table entries: 512 (order: 3, 32768 bytes)
> [ 0.026185] RTC time: 0:42:06, date: 12/16/17
> [ 0.026790] NET: Registered protocol family 16
> [ 0.027204] audit: initializing netlink subsys (disabled)
> [ 0.027914] audit: type=2000 audit(1513384927.133:1): state=initialized audit_enabled=0 res=1
> [ 0.028185] cpuidle: using governor menu
> [ 0.029118] ACPI: bus type PCI registered
> [ 0.029872] PCI: Using configuration type 1 for base access
> [ 0.034355] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> [ 0.035011] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
> [ 0.036066] cryptd: max_cpu_qlen set to 1000
> [ 0.036579] ACPI: Added _OSI(Module Device)
> [ 0.037007] ACPI: Added _OSI(Processor Device)
> [ 0.037426] ACPI: Added _OSI(3.0 _SCP Extensions)
> [ 0.037857] ACPI: Added _OSI(Processor Aggregator Device)
> [ 0.041356] ACPI: Interpreter enabled
> [ 0.041764] ACPI: (supports S0 S5)
> [ 0.042005] ACPI: Using IOAPIC for interrupt routing
> [ 0.042655] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
> [ 0.043625] ACPI: Enabled 2 GPEs in block 00 to 0F
> [ 0.059248] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
> [ 0.059953] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI]
> [ 0.060045] acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
> [ 0.061180] PCI host bridge to bus 0000:00
> [ 0.061874] pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window]
> [ 0.062013] pci_bus 0000:00: root bus resource [io 0x0d00-0xffff window]
> [ 0.063016] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
> [ 0.064015] pci_bus 0000:00: root bus resource [mem 0x80000000-0xfebfffff window]
> [ 0.065014] pci_bus 0000:00: root bus resource [bus 00-ff]
> [ 0.065753] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000
> [ 0.066487] pci 0000:00:01.0: [8086:7000] type 00 class 0x060100
> [ 0.067537] pci 0000:00:01.1: [8086:7010] type 00 class 0x010180
> [ 0.071700] pci 0000:00:01.1: reg 0x20: [io 0xc100-0xc10f]
> [ 0.074032] pci 0000:00:01.1: legacy IDE quirk: reg 0x10: [io 0x01f0-0x01f7]
> [ 0.074908] pci 0000:00:01.1: legacy IDE quirk: reg 0x14: [io 0x03f6]
> [ 0.075011] pci 0000:00:01.1: legacy IDE quirk: reg 0x18: [io 0x0170-0x0177]
> [ 0.076010] pci 0000:00:01.1: legacy IDE quirk: reg 0x1c: [io 0x0376]
> [ 0.077121] pci 0000:00:01.3: [8086:7113] type 00 class 0x068000
> [ 0.078148] pci 0000:00:01.3: quirk: [io 0x0600-0x063f] claimed by PIIX4 ACPI
> [ 0.079014] pci 0000:00:01.3: quirk: [io 0x0700-0x070f] claimed by PIIX4 SMB
> [ 0.080224] pci 0000:00:03.0: [1af4:1000] type 00 class 0x020000
> [ 0.082007] pci 0000:00:03.0: reg 0x10: [io 0xc040-0xc05f]
> [ 0.083814] pci 0000:00:03.0: reg 0x14: [mem 0xfebc0000-0xfebc0fff]
> [ 0.089891] pci 0000:00:03.0: reg 0x30: [mem 0xfeb80000-0xfebbffff pref]
> [ 0.090545] pci 0000:00:05.0: [1af4:1003] type 00 class 0x078000
> [ 0.092708] pci 0000:00:05.0: reg 0x10: [io 0xc060-0xc07f]
> [ 0.094009] pci 0000:00:05.0: reg 0x14: [mem 0xfebc1000-0xfebc1fff]
> [ 0.102484] pci 0000:00:06.0: [8086:2934] type 00 class 0x0c0300
> [ 0.108028] pci 0000:00:06.0: reg 0x20: [io 0xc080-0xc09f]
> [ 0.110738] pci 0000:00:06.1: [8086:2935] type 00 class 0x0c0300
> [ 0.114388] pci 0000:00:06.1: reg 0x20: [io 0xc0a0-0xc0bf]
> [ 0.117339] pci 0000:00:06.2: [8086:2936] type 00 class 0x0c0300
> [ 0.122770] pci 0000:00:06.2: reg 0x20: [io 0xc0c0-0xc0df]
> [ 0.124738] pci 0000:00:06.7: [8086:293a] type 00 class 0x0c0320
> [ 0.125825] pci 0000:00:06.7: reg 0x10: [mem 0xfebc2000-0xfebc2fff]
> [ 0.130347] pci 0000:00:07.0: [1af4:1001] type 00 class 0x010000
> [ 0.133007] pci 0000:00:07.0: reg 0x10: [io 0xc000-0xc03f]
> [ 0.134793] pci 0000:00:07.0: reg 0x14: [mem 0xfebc3000-0xfebc3fff]
> [ 0.141808] pci 0000:00:08.0: [1af4:1002] type 00 class 0x00ff00
> [ 0.142914] pci 0000:00:08.0: reg 0x10: [io 0xc0e0-0xc0ff]
> [ 0.148977] ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11)
> [ 0.149455] ACPI: PCI Interrupt Link [LNKB] (IRQs 5 *10 11)
> [ 0.150390] ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11)
> [ 0.151382] ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
> [ 0.152380] ACPI: PCI Interrupt Link [LNKS] (IRQs *9)
> [ 0.154508] vgaarb: loaded
> [ 0.155271] SCSI subsystem initialized
> [ 0.155887] EDAC MC: Ver: 3.0.0
> [ 0.156255] PCI: Using ACPI for IRQ routing
> [ 0.156566] PCI: pci_cache_line_size set to 64 bytes
> [ 0.157161] e820: reserve RAM buffer [mem 0x0009fc00-0x0009ffff]
> [ 0.157914] e820: reserve RAM buffer [mem 0x7ffd9000-0x7fffffff]
> [ 0.158253] NetLabel: Initializing
> [ 0.158765] NetLabel: domain hash size = 128
> [ 0.159005] NetLabel: protocols = UNLABELED CIPSOv4 CALIPSO
> [ 0.159775] NetLabel: unlabeled traffic allowed by default
> [ 0.160073] clocksource: Switched to clocksource kvm-clock
> [ 0.186764] VFS: Disk quotas dquot_6.6.0
> [ 0.187277] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
> [ 0.188251] FS-Cache: Loaded
> [ 0.188725] pnp: PnP ACPI init
> [ 0.189271] pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active)
> [ 0.190231] pnp 00:01: Plug and Play ACPI device, IDs PNP0303 (active)
> [ 0.191229] pnp 00:02: Plug and Play ACPI device, IDs PNP0f13 (active)
> [ 0.192096] pnp 00:03: [dma 2]
> [ 0.192514] pnp 00:03: Plug and Play ACPI device, IDs PNP0700 (active)
> [ 0.193631] pnp 00:04: Plug and Play ACPI device, IDs PNP0501 (active)
> [ 0.195416] pnp: PnP ACPI: found 5 devices
> [ 0.206055] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [ 0.207127] pci_bus 0000:00: resource 4 [io 0x0000-0x0cf7 window]
> [ 0.207832] pci_bus 0000:00: resource 5 [io 0x0d00-0xffff window]
> [ 0.208594] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
> [ 0.209469] pci_bus 0000:00: resource 7 [mem 0x80000000-0xfebfffff window]
> [ 0.210493] NET: Registered protocol family 2
> [ 0.211244] tcp_listen_portaddr_hash hash table entries: 1024 (order: 2, 16384 bytes)
> [ 0.212283] TCP established hash table entries: 16384 (order: 5, 131072 bytes)
> [ 0.213285] TCP bind hash table entries: 16384 (order: 6, 262144 bytes)
> [ 0.214306] TCP: Hash tables configured (established 16384 bind 16384)
> [ 0.215065] UDP hash table entries: 1024 (order: 3, 32768 bytes)
> [ 0.215797] UDP-Lite hash table entries: 1024 (order: 3, 32768 bytes)
> [ 0.217934] NET: Registered protocol family 1
> [ 0.219126] RPC: Registered named UNIX socket transport module.
> [ 0.219676] RPC: Registered udp transport module.
> [ 0.220130] RPC: Registered tcp transport module.
> [ 0.220552] RPC: Registered tcp NFSv4.1 backchannel transport module.
> [ 0.221153] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
> [ 0.221701] pci 0000:00:01.0: PIIX3: Enabling Passive Release
> [ 0.222319] pci 0000:00:01.0: Activating ISA DMA hang workarounds
> [ 0.444214] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
> [ 0.880141] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
> [ 1.311493] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
> [ 1.748829] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
> [ 1.962124] PCI: CLS 0 bytes, default 64
> [ 1.964749] Initialise system trusted keyrings
> [ 1.965289] workingset: timestamp_bits=37 max_order=19 bucket_order=0
> [ 1.969600] zbud: loaded
> [ 1.971287] SGI XFS with security attributes, no debug enabled
> [ 2.106071] NET: Registered protocol family 38
> [ 2.106556] Key type asymmetric registered
> [ 2.106931] Asymmetric key parser 'x509' registered
> [ 2.107514] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 248)
> [ 2.108327] io scheduler noop registered
> [ 2.108813] io scheduler deadline registered
> [ 2.109608] io scheduler cfq registered (default)
> [ 2.110258] io scheduler mq-deadline registered
> [ 2.110796] io scheduler kyber registered
> [ 2.111688] intel_idle: Please enable MWAIT in BIOS SETUP
> [ 2.112310] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> [ 2.113037] ACPI: Power Button [PWRF]
> [ 2.331642] virtio-pci 0000:00:03.0: virtio_pci: leaving for legacy driver
> [ 2.554093] virtio-pci 0000:00:05.0: virtio_pci: leaving for legacy driver
> [ 2.775938] virtio-pci 0000:00:07.0: virtio_pci: leaving for legacy driver
> [ 2.975053] tsc: Refined TSC clocksource calibration: 2495.981 MHz
> [ 2.975641] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x23fa6529869, max_idle_ns: 440795218057 ns
> [ 3.029409] virtio-pci 0000:00:08.0: virtio_pci: leaving for legacy driver
> [ 3.032925] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
> [ 3.056849] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> [ 3.064748] Non-volatile memory driver v1.3
> [ 3.065925] ppdev: user-space parallel port driver
> [ 3.071816] loop: module loaded
> [ 3.075337] vda: vda1 vda2 vda3
> [ 3.076659] Rounding down aligned max_sectors from 4294967295 to 4294967288
> [ 3.077996] Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
> [ 3.079790] libphy: Fixed MDIO Bus: probed
> [ 3.080257] tun: Universal TUN/TAP device driver, 1.6
> [ 3.082222] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
> [ 3.083675] serio: i8042 KBD port at 0x60,0x64 irq 1
> [ 3.084160] serio: i8042 AUX port at 0x60,0x64 irq 12
> [ 3.084816] mousedev: PS/2 mouse device common for all mice
> [ 3.086603] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input1
> [ 3.089192] rtc_cmos 00:00: RTC can wake from S4
> [ 3.090116] rtc_cmos 00:00: rtc core: registered rtc_cmos as rtc0
> [ 3.092829] rtc_cmos 00:00: alarms up to one day, y3k, 114 bytes nvram
> [ 3.093937] IR NEC protocol handler initialized
> [ 3.094408] IR RC5(x/sz) protocol handler initialized
> [ 3.094901] IR RC6 protocol handler initialized
> [ 3.095510] IR JVC protocol handler initialized
> [ 3.095952] IR Sony protocol handler initialized
> [ 3.096399] IR SANYO protocol handler initialized
> [ 3.096862] IR Sharp protocol handler initialized
> [ 3.097342] IR MCE Keyboard/mouse protocol handler initialized
> [ 3.097919] IR XMP protocol handler initialized
> [ 3.098530] device-mapper: uevent: version 1.0.3
> [ 3.099209] device-mapper: ioctl: 4.37.0-ioctl (2017-09-20) initialised: [email protected]
> [ 3.100398] device-mapper: multipath round-robin: version 1.2.0 loaded
> [ 3.101883] drop_monitor: Initializing network drop monitor service
> [ 3.102553] Netfilter messages via NETLINK v0.30.
> [ 3.103090] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
> [ 3.103738] ctnetlink v0.93: registering with nfnetlink.
> [ 3.104494] ip_tables: (C) 2000-2006 Netfilter Core Team
> [ 3.105734] Initializing XFRM netlink socket
> [ 3.106885] NET: Registered protocol family 10
> [ 3.109341] Segment Routing with IPv6
> [ 3.109976] mip6: Mobile IPv6
> [ 3.111987] ip6_tables: (C) 2000-2006 Netfilter Core Team
> [ 3.114230] NET: Registered protocol family 17
> [ 3.115047] Bridge firewalling registered
> [ 3.115824] Ebtables v2.0 registered
> [ 3.117996] 8021q: 802.1Q VLAN Support v1.8
> [ 3.119429] AVX2 version of gcm_enc/dec engaged.
> [ 3.119886] AES CTR mode by8 optimization enabled
> [ 3.128818] sched_clock: Marking stable (3128714579, 0)->(3404180881, -275466302)
> [ 3.129945] registered taskstats version 1
> [ 3.130427] Loading compiled-in X.509 certificates
> [ 3.163216] Loaded X.509 cert 'Build time autogenerated kernel key: 38e0adea1af8bd8a23b02436d4acf2f8c7408d23'
> [ 3.166359] zswap: loaded using pool lzo/zbud
> [ 3.167943] Key type big_key registered
> [ 3.168778] Magic number: 13:918:708
> [ 3.169255] rtc_cmos 00:00: setting system clock to 2017-12-16 00:42:09 UTC (1513384929)
> [ 3.170604] md: Skipping autodetection of RAID arrays. (raid=autodetect will force)
> [ 3.171932] EXT4-fs (vda2): couldn't mount as ext3 due to feature incompatibilities
> [ 3.173871] EXT4-fs (vda2): couldn't mount as ext2 due to feature incompatibilities
> [ 3.175306] EXT4-fs (vda2): INFO: recovery required on readonly filesystem
> [ 3.176212] EXT4-fs (vda2): write access will be enabled during recovery
> [ 3.397187] EXT4-fs (vda2): orphan cleanup on readonly fs
> [ 3.399412] EXT4-fs (vda2): 5 orphan inodes deleted
> [ 3.402759] EXT4-fs (vda2): recovery complete
> [ 3.466647] EXT4-fs (vda2): mounted filesystem with ordered data mode. Opts: (null)
> [ 3.469401] VFS: Mounted root (ext4 filesystem) readonly on device 253:2.
> [ 3.473719] devtmpfs: mounted
> [ 3.492549] Freeing unused kernel memory: 1640K
> [ 3.494547] Write protecting the kernel read-only data: 18432k
> [ 3.498781] Freeing unused kernel memory: 2016K
> [ 3.503330] Freeing unused kernel memory: 512K
> [ 3.505232] rodata_test: all tests were successful
> [ 3.515355] 1 (init): Uhuuh, elf segement at 00000000928fda3e requested but the memory is mapped already
> [ 3.519533] Starting init: /sbin/init exists but couldn't execute it (error -95)
> [ 3.528993] Starting init: /bin/sh exists but couldn't execute it (error -14)
> [ 3.532127] Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> [ 3.538328] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc3-next-20171215-00001-g6d6aea478fce #11
> [ 3.542201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> [ 3.546081] Call Trace:
> [ 3.547221] dump_stack+0x5c/0x79
> [ 3.548768] ? rest_init+0x30/0xb0
> [ 3.550320] panic+0xe4/0x232
> [ 3.551669] ? rest_init+0xb0/0xb0
> [ 3.553110] kernel_init+0xeb/0x100
> [ 3.554701] ret_from_fork+0x1f/0x30
> [ 3.558964] Kernel Offset: 0x2000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 3.564160] ---[ end Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.


--
Michal Hocko
SUSE Labs

2017-12-18 18:12:55

by Andrei Vagin

[permalink] [raw]
Subject: Re: [2/2] fs, elf: drop MAP_FIXED usage from elf_map

On Mon, Dec 18, 2017 at 10:13:02AM +0100, Michal Hocko wrote:
> On Fri 15-12-17 16:49:28, Andrei Vagin wrote:
> > Hi Michal,
> >
> > We run CRIU tests for linux-next and the 4.15.0-rc3-next-20171215 kernel
> > doesn't boot:
> >
> > [ 3.492549] Freeing unused kernel memory: 1640K
> > [ 3.494547] Write protecting the kernel read-only data: 18432k
> > [ 3.498781] Freeing unused kernel memory: 2016K
> > [ 3.503330] Freeing unused kernel memory: 512K
> > [ 3.505232] rodata_test: all tests were successful
> > [ 3.515355] 1 (init): Uhuuh, elf segement at 00000000928fda3e requested but the memory is mapped already
>
> Hmm, this interesting. What does the test actualy do? Could you add some
> instrumentation to see what is actually mapped there? Something like

There is nothing mapped there. It returns -95 (ENOSUPP)

The kernel is booted with this patch:

+ int ttype = type & ~MAP_FIXED_SAFE;
if (total_size) {
total_size = ELF_PAGEALIGN(total_size);
- map_addr = vm_mmap(filep, addr, total_size, prot, type,
off);
+ map_addr = vm_mmap(filep, addr, total_size, prot, ttype, off);
if (!BAD_ADDR(map_addr))
vm_munmap(map_addr+size, total_size-size);
} else
- map_addr = vm_mmap(filep, addr, size, prot, type, off);
+ map_addr = vm_mmap(filep, addr, size, prot, ttype, off);


>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 0e50230ce53d..1b68ddc34043 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -372,10 +372,28 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
> } else
> map_addr = vm_mmap(filep, addr, size, prot, type, off);
>
> - if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> + if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr)) {
> + struct vm_area_struct *vma;
> +
> pr_info("%d (%s): Uhuuh, elf segment at %p requested but the memory is mapped already\n",
> task_pid_nr(current), current->comm,
> (void *)addr);
> + vma = find_vma(current->mm, map_addr);
> + if (vma && vma->vm_start < addr) {
> + pr_info("requested [%lx, %lx] mapped [%lx, %lx] %lx ", addr, addr + total_size,
> + vma->vm_start, vma->vm_end, vma->vm_flags);
> + if (!vma->vm_file) {
> + pr_cont("anon\n");
> + } else {
> + char path[512];
> + char *p = file_path(vma->vm_file, path, sizeof(path));
> + if (IS_ERR(p))
> + p = "?";
> + pr_cont("\"%s\"\n", kbasename(p));
> + }
> + dump_stack();
> + }
> + }
>
> return(map_addr);
> }
>
> > [ 3.519533] Starting init: /sbin/init exists but couldn't execute it (error -95)
> > [ 3.528993] Starting init: /bin/sh exists but couldn't execute it (error -14)
> > [ 3.532127] Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> > [ 3.538328] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc3-next-20171215-00001-g6d6aea478fce #11
> > [ 3.542201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> > [ 3.546081] Call Trace:
> > [ 3.547221] dump_stack+0x5c/0x79
> > [ 3.548768] ? rest_init+0x30/0xb0
> > [ 3.550320] panic+0xe4/0x232
> > [ 3.551669] ? rest_init+0xb0/0xb0
> > [ 3.553110] kernel_init+0xeb/0x100
> > [ 3.554701] ret_from_fork+0x1f/0x30
> > [ 3.558964] Kernel Offset: 0x2000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > [ 3.564160] ---[ end Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> >
> > If I revert this patch, it boots normally.
> >
> > Thanks,
> > Andrei
> >
> > On Wed, Dec 13, 2017 at 10:25:50AM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > Both load_elf_interp and load_elf_binary rely on elf_map to map segments
> > > on a controlled address and they use MAP_FIXED to enforce that. This is
> > > however dangerous thing prone to silent data corruption which can be
> > > even exploitable. Let's take CVE-2017-1000253 as an example. At the time
> > > (before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
> > > ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
> > > the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
> > > we could end up mapping over the existing stack with some luck.
> > >
> > > The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
> > > fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
> > > further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
> > > revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
> > > stack consumption early during execve fully stopped by da029c11e6b1
> > > ("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
> > > safe and any attack should be impractical. On the other hand this is
> > > just too subtle assumption so it can break quite easily and hard to
> > > spot.
> > >
> > > I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
> > > fundamentally dangerous. Moreover it shouldn't be even needed. We are
> > > at the early process stage and so there shouldn't be unrelated mappings
> > > (except for stack and loader) existing so mmap for a given address
> > > should succeed even without MAP_FIXED. Something is terribly wrong if
> > > this is not the case and we should rather fail than silently corrupt the
> > > underlying mapping.
> > >
> > > Address this issue by changing MAP_FIXED to the newly added
> > > MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
> > > existing mapping clashing with the requested one without clobbering it.
> > >
> > > Cc: Abdul Haleem <[email protected]>
> > > Cc: Joel Stanley <[email protected]>
> > > Acked-by: Kees Cook <[email protected]>
> > > Reviewed-by: Khalid Aziz <[email protected]>
> > > Signed-off-by: Michal Hocko <[email protected]>
> > > ---
> > > arch/metag/kernel/process.c | 6 +++++-
> > > fs/binfmt_elf.c | 12 ++++++++----
> > > 2 files changed, 13 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
> > > index 0909834c83a7..867c8d0a5fb4 100644
> > > --- a/arch/metag/kernel/process.c
> > > +++ b/arch/metag/kernel/process.c
> > > @@ -399,7 +399,7 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
> > > tcm_tag = tcm_lookup_tag(addr);
> > >
> > > if (tcm_tag != TCM_INVALID_TAG)
> > > - type &= ~MAP_FIXED;
> > > + type &= ~(MAP_FIXED | MAP_FIXED_SAFE);
> > >
> > > /*
> > > * total_size is the size of the ELF (interpreter) image.
> > > @@ -417,6 +417,10 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
> > > } else
> > > map_addr = vm_mmap(filep, addr, size, prot, type, off);
> > >
> > > + if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> > > + pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> > > + task_pid_nr(current), tsk->comm, (void*)addr);
> > > +
> > > if (!BAD_ADDR(map_addr) && tcm_tag != TCM_INVALID_TAG) {
> > > struct tcm_allocation *tcm;
> > > unsigned long tcm_addr;
> > > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > > index 73b01e474fdc..5916d45f64a7 100644
> > > --- a/fs/binfmt_elf.c
> > > +++ b/fs/binfmt_elf.c
> > > @@ -372,6 +372,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
> > > } else
> > > map_addr = vm_mmap(filep, addr, size, prot, type, off);
> > >
> > > + if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> > > + pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> > > + task_pid_nr(current), current->comm, (void*)addr);
> > > +
> > > return(map_addr);
> > > }
> > >
> > > @@ -569,7 +573,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
> > > elf_prot |= PROT_EXEC;
> > > vaddr = eppnt->p_vaddr;
> > > if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
> > > - elf_type |= MAP_FIXED;
> > > + elf_type |= MAP_FIXED_SAFE;
> > > else if (no_base && interp_elf_ex->e_type == ET_DYN)
> > > load_addr = -vaddr;
> > >
> > > @@ -930,7 +934,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> > > * the ET_DYN load_addr calculations, proceed normally.
> > > */
> > > if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
> > > - elf_flags |= MAP_FIXED;
> > > + elf_flags |= MAP_FIXED_SAFE;
> > > } else if (loc->elf_ex.e_type == ET_DYN) {
> > > /*
> > > * This logic is run once for the first LOAD Program
> > > @@ -966,7 +970,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> > > load_bias = ELF_ET_DYN_BASE;
> > > if (current->flags & PF_RANDOMIZE)
> > > load_bias += arch_mmap_rnd();
> > > - elf_flags |= MAP_FIXED;
> > > + elf_flags |= MAP_FIXED_SAFE;
> > > } else
> > > load_bias = 0;
> > >
> > > @@ -1223,7 +1227,7 @@ static int load_elf_library(struct file *file)
> > > (eppnt->p_filesz +
> > > ELF_PAGEOFFSET(eppnt->p_vaddr)),
> > > PROT_READ | PROT_WRITE | PROT_EXEC,
> > > - MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
> > > + MAP_FIXED_SAFE | MAP_PRIVATE | MAP_DENYWRITE,
> > > (eppnt->p_offset -
> > > ELF_PAGEOFFSET(eppnt->p_vaddr)));
> > > if (error != ELF_PAGESTART(eppnt->p_vaddr))
>
>
> > [ 0.000000] Linux version 4.15.0-rc3-next-20171215-00001-g6d6aea478fce (avagin@laptop) (gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC)) #11 SMP Fri Dec 15 16:39:11 PST 2017
> > [ 0.000000] Command line: root=/dev/vda2 ro debug console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0
> > [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> > [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> > [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> > [ 0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
> > [ 0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
> > [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
> > [ 0.000000] x86/fpu: xstate_offset[3]: 832, xstate_sizes[3]: 64
> > [ 0.000000] x86/fpu: xstate_offset[4]: 896, xstate_sizes[4]: 64
> > [ 0.000000] x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
> > [ 0.000000] e820: BIOS-provided physical RAM map:
> > [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> > [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffd8fff] usable
> > [ 0.000000] BIOS-e820: [mem 0x000000007ffd9000-0x000000007fffffff] reserved
> > [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > [ 0.000000] NX (Execute Disable) protection: active
> > [ 0.000000] random: fast init done
> > [ 0.000000] SMBIOS 2.8 present.
> > [ 0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> > [ 0.000000] Hypervisor detected: KVM
> > [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> > [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> > [ 0.000000] e820: last_pfn = 0x7ffd9 max_arch_pfn = 0x400000000
> > [ 0.000000] MTRR default type: write-back
> > [ 0.000000] MTRR fixed ranges enabled:
> > [ 0.000000] 00000-9FFFF write-back
> > [ 0.000000] A0000-BFFFF uncachable
> > [ 0.000000] C0000-FFFFF write-protect
> > [ 0.000000] MTRR variable ranges enabled:
> > [ 0.000000] 0 base 0080000000 mask FF80000000 uncachable
> > [ 0.000000] 1 disabled
> > [ 0.000000] 2 disabled
> > [ 0.000000] 3 disabled
> > [ 0.000000] 4 disabled
> > [ 0.000000] 5 disabled
> > [ 0.000000] 6 disabled
> > [ 0.000000] 7 disabled
> > [ 0.000000] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
> > [ 0.000000] found SMP MP-table at [mem 0x000f6bd0-0x000f6bdf] mapped at [ (ptrval)]
> > [ 0.000000] Base memory trampoline at [ (ptrval)] 99000 size 24576
> > [ 0.000000] Using GB pages for direct mapping
> > [ 0.000000] BRK [0x2c984000, 0x2c984fff] PGTABLE
> > [ 0.000000] BRK [0x2c985000, 0x2c985fff] PGTABLE
> > [ 0.000000] BRK [0x2c986000, 0x2c986fff] PGTABLE
> > [ 0.000000] BRK [0x2c987000, 0x2c987fff] PGTABLE
> > [ 0.000000] BRK [0x2c988000, 0x2c988fff] PGTABLE
> > [ 0.000000] BRK [0x2c989000, 0x2c989fff] PGTABLE
> > [ 0.000000] ACPI: Early table checksum verification disabled
> > [ 0.000000] ACPI: RSDP 0x00000000000F69C0 000014 (v00 BOCHS )
> > [ 0.000000] ACPI: RSDT 0x000000007FFE12FF 00002C (v01 BOCHS BXPCRSDT 00000001 BXPC 00000001)
> > [ 0.000000] ACPI: FACP 0x000000007FFE120B 000074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001)
> > [ 0.000000] ACPI: DSDT 0x000000007FFE0040 0011CB (v01 BOCHS BXPCDSDT 00000001 BXPC 00000001)
> > [ 0.000000] ACPI: FACS 0x000000007FFE0000 000040
> > [ 0.000000] ACPI: APIC 0x000000007FFE127F 000080 (v01 BOCHS BXPCAPIC 00000001 BXPC 00000001)
> > [ 0.000000] ACPI: Local APIC address 0xfee00000
> > [ 0.000000] No NUMA configuration found
> > [ 0.000000] Faking a node at [mem 0x0000000000000000-0x000000007ffd8fff]
> > [ 0.000000] NODE_DATA(0) allocated [mem 0x7ffc2000-0x7ffd8fff]
> > [ 0.000000] kvm-clock: cpu 0, msr 0:7ffc0001, primary cpu clock
> > [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> > [ 0.000000] kvm-clock: using sched offset of 1076013277 cycles
> > [ 0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> > [ 0.000000] Zone ranges:
> > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> > [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffd8fff]
> > [ 0.000000] Normal empty
> > [ 0.000000] Device empty
> > [ 0.000000] Movable zone start for each node
> > [ 0.000000] Early memory node ranges
> > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > [ 0.000000] node 0: [mem 0x0000000000100000-0x000000007ffd8fff]
> > [ 0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffd8fff]
> > [ 0.000000] On node 0 totalpages: 524151
> > [ 0.000000] DMA zone: 64 pages used for memmap
> > [ 0.000000] DMA zone: 21 pages reserved
> > [ 0.000000] DMA zone: 3998 pages, LIFO batch:0
> > [ 0.000000] DMA32 zone: 8128 pages used for memmap
> > [ 0.000000] DMA32 zone: 520153 pages, LIFO batch:31
> > [ 0.000000] Reserved but unavailable: 98 pages
> > [ 0.000000] ACPI: PM-Timer IO Port: 0x608
> > [ 0.000000] ACPI: Local APIC address 0xfee00000
> > [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
> > [ 0.000000] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
> > [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> > [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
> > [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
> > [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
> > [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
> > [ 0.000000] ACPI: IRQ0 used by override.
> > [ 0.000000] ACPI: IRQ5 used by override.
> > [ 0.000000] ACPI: IRQ9 used by override.
> > [ 0.000000] ACPI: IRQ10 used by override.
> > [ 0.000000] ACPI: IRQ11 used by override.
> > [ 0.000000] Using ACPI (MADT) for SMP configuration information
> > [ 0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> > [ 0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
> > [ 0.000000] PM: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
> > [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000effff]
> > [ 0.000000] PM: Registered nosave memory: [mem 0x000f0000-0x000fffff]
> > [ 0.000000] e820: [mem 0x80000000-0xfeffbfff] available for PCI devices
> > [ 0.000000] Booting paravirtualized kernel on KVM
> > [ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > [ 0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:2 nr_node_ids:1
> > [ 0.000000] percpu: Embedded 44 pages/cpu @ (ptrval) s142296 r8192 d29736 u1048576
> > [ 0.000000] pcpu-alloc: s142296 r8192 d29736 u1048576 alloc=1*2097152
> > [ 0.000000] pcpu-alloc: [0] 0 1
> > [ 0.000000] KVM setup async PF for cpu 0
> > [ 0.000000] kvm-stealtime: cpu 0, msr 7fc122c0
> > [ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 515938
> > [ 0.000000] Policy zone: DMA32
> > [ 0.000000] Kernel command line: root=/dev/vda2 ro debug console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0
> > [ 0.000000] Memory: 2037056K/2096604K available (12300K kernel code, 1554K rwdata, 3584K rodata, 1640K init, 912K bss, 59548K reserved, 0K cma-reserved)
> > [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
> > [ 0.000000] ftrace: allocating 36554 entries in 143 pages
> > [ 0.001000] Hierarchical RCU implementation.
> > [ 0.001000] RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=2.
> > [ 0.001000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
> > [ 0.001000] NR_IRQS: 4352, nr_irqs: 440, preallocated irqs: 16
> > [ 0.001000] Offload RCU callbacks from CPUs: (none).
> > [ 0.001000] Console: colour dummy device 80x25
> > [ 0.001000] console [ttyS0] enabled
> > [ 0.001000] ACPI: Core revision 20171110
> > [ 0.001000] ACPI: 1 ACPI AML tables successfully acquired and loaded
> > [ 0.001009] APIC: Switch to symmetric I/O mode setup
> > [ 0.001571] x2apic enabled
> > [ 0.002003] Switched APIC routing to physical x2apic.
> > [ 0.003538] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> > [ 0.004000] tsc: Detected 2496.000 MHz processor
> > [ 0.004014] Calibrating delay loop (skipped) preset value.. 4992.00 BogoMIPS (lpj=2496000)
> > [ 0.005014] pid_max: default: 32768 minimum: 301
> > [ 0.006057] Security Framework initialized
> > [ 0.006548] Yama: becoming mindful.
> > [ 0.007019] SELinux: Disabled at boot.
> > [ 0.008206] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
> > [ 0.009164] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
> > [ 0.009816] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
> > [ 0.010009] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes)
> > [ 0.011322] mce: CPU supports 10 MCE banks
> > [ 0.011740] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
> > [ 0.012002] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
> > [ 0.012610] Freeing SMP alternatives memory: 36K
> > [ 0.013467] TSC deadline timer enabled
> > [ 0.013820] smpboot: CPU0: Intel Core Processor (Skylake) (family: 0x6, model: 0x5e, stepping: 0x3)
> > [ 0.014000] Performance Events: unsupported p6 CPU model 94 no PMU driver, software events only.
> > [ 0.014041] Hierarchical SRCU implementation.
> > [ 0.015133] NMI watchdog: Perf event create on CPU 0 failed with -2
> > [ 0.015725] NMI watchdog: Perf NMI watchdog permanently disabled
> > [ 0.016077] smp: Bringing up secondary CPUs ...
> > [ 0.016654] x86: Booting SMP configuration:
> > [ 0.017005] .... node #0, CPUs: #1
> > [ 0.001000] kvm-clock: cpu 1, msr 0:7ffc0041, secondary cpu clock
> > [ 0.019051] KVM setup async PF for cpu 1
> > [ 0.019599] kvm-stealtime: cpu 1, msr 7fd122c0
> > [ 0.020009] smp: Brought up 1 node, 2 CPUs
> > [ 0.020531] smpboot: Max logical packages: 2
> > [ 0.021009] smpboot: Total of 2 processors activated (9984.00 BogoMIPS)
> > [ 0.023160] devtmpfs: initialized
> > [ 0.023513] x86/mm: Memory block size: 128MB
> > [ 0.024811] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > [ 0.025015] futex hash table entries: 512 (order: 3, 32768 bytes)
> > [ 0.026185] RTC time: 0:42:06, date: 12/16/17
> > [ 0.026790] NET: Registered protocol family 16
> > [ 0.027204] audit: initializing netlink subsys (disabled)
> > [ 0.027914] audit: type=2000 audit(1513384927.133:1): state=initialized audit_enabled=0 res=1
> > [ 0.028185] cpuidle: using governor menu
> > [ 0.029118] ACPI: bus type PCI registered
> > [ 0.029872] PCI: Using configuration type 1 for base access
> > [ 0.034355] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> > [ 0.035011] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
> > [ 0.036066] cryptd: max_cpu_qlen set to 1000
> > [ 0.036579] ACPI: Added _OSI(Module Device)
> > [ 0.037007] ACPI: Added _OSI(Processor Device)
> > [ 0.037426] ACPI: Added _OSI(3.0 _SCP Extensions)
> > [ 0.037857] ACPI: Added _OSI(Processor Aggregator Device)
> > [ 0.041356] ACPI: Interpreter enabled
> > [ 0.041764] ACPI: (supports S0 S5)
> > [ 0.042005] ACPI: Using IOAPIC for interrupt routing
> > [ 0.042655] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
> > [ 0.043625] ACPI: Enabled 2 GPEs in block 00 to 0F
> > [ 0.059248] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
> > [ 0.059953] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI]
> > [ 0.060045] acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
> > [ 0.061180] PCI host bridge to bus 0000:00
> > [ 0.061874] pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window]
> > [ 0.062013] pci_bus 0000:00: root bus resource [io 0x0d00-0xffff window]
> > [ 0.063016] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
> > [ 0.064015] pci_bus 0000:00: root bus resource [mem 0x80000000-0xfebfffff window]
> > [ 0.065014] pci_bus 0000:00: root bus resource [bus 00-ff]
> > [ 0.065753] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000
> > [ 0.066487] pci 0000:00:01.0: [8086:7000] type 00 class 0x060100
> > [ 0.067537] pci 0000:00:01.1: [8086:7010] type 00 class 0x010180
> > [ 0.071700] pci 0000:00:01.1: reg 0x20: [io 0xc100-0xc10f]
> > [ 0.074032] pci 0000:00:01.1: legacy IDE quirk: reg 0x10: [io 0x01f0-0x01f7]
> > [ 0.074908] pci 0000:00:01.1: legacy IDE quirk: reg 0x14: [io 0x03f6]
> > [ 0.075011] pci 0000:00:01.1: legacy IDE quirk: reg 0x18: [io 0x0170-0x0177]
> > [ 0.076010] pci 0000:00:01.1: legacy IDE quirk: reg 0x1c: [io 0x0376]
> > [ 0.077121] pci 0000:00:01.3: [8086:7113] type 00 class 0x068000
> > [ 0.078148] pci 0000:00:01.3: quirk: [io 0x0600-0x063f] claimed by PIIX4 ACPI
> > [ 0.079014] pci 0000:00:01.3: quirk: [io 0x0700-0x070f] claimed by PIIX4 SMB
> > [ 0.080224] pci 0000:00:03.0: [1af4:1000] type 00 class 0x020000
> > [ 0.082007] pci 0000:00:03.0: reg 0x10: [io 0xc040-0xc05f]
> > [ 0.083814] pci 0000:00:03.0: reg 0x14: [mem 0xfebc0000-0xfebc0fff]
> > [ 0.089891] pci 0000:00:03.0: reg 0x30: [mem 0xfeb80000-0xfebbffff pref]
> > [ 0.090545] pci 0000:00:05.0: [1af4:1003] type 00 class 0x078000
> > [ 0.092708] pci 0000:00:05.0: reg 0x10: [io 0xc060-0xc07f]
> > [ 0.094009] pci 0000:00:05.0: reg 0x14: [mem 0xfebc1000-0xfebc1fff]
> > [ 0.102484] pci 0000:00:06.0: [8086:2934] type 00 class 0x0c0300
> > [ 0.108028] pci 0000:00:06.0: reg 0x20: [io 0xc080-0xc09f]
> > [ 0.110738] pci 0000:00:06.1: [8086:2935] type 00 class 0x0c0300
> > [ 0.114388] pci 0000:00:06.1: reg 0x20: [io 0xc0a0-0xc0bf]
> > [ 0.117339] pci 0000:00:06.2: [8086:2936] type 00 class 0x0c0300
> > [ 0.122770] pci 0000:00:06.2: reg 0x20: [io 0xc0c0-0xc0df]
> > [ 0.124738] pci 0000:00:06.7: [8086:293a] type 00 class 0x0c0320
> > [ 0.125825] pci 0000:00:06.7: reg 0x10: [mem 0xfebc2000-0xfebc2fff]
> > [ 0.130347] pci 0000:00:07.0: [1af4:1001] type 00 class 0x010000
> > [ 0.133007] pci 0000:00:07.0: reg 0x10: [io 0xc000-0xc03f]
> > [ 0.134793] pci 0000:00:07.0: reg 0x14: [mem 0xfebc3000-0xfebc3fff]
> > [ 0.141808] pci 0000:00:08.0: [1af4:1002] type 00 class 0x00ff00
> > [ 0.142914] pci 0000:00:08.0: reg 0x10: [io 0xc0e0-0xc0ff]
> > [ 0.148977] ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11)
> > [ 0.149455] ACPI: PCI Interrupt Link [LNKB] (IRQs 5 *10 11)
> > [ 0.150390] ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11)
> > [ 0.151382] ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
> > [ 0.152380] ACPI: PCI Interrupt Link [LNKS] (IRQs *9)
> > [ 0.154508] vgaarb: loaded
> > [ 0.155271] SCSI subsystem initialized
> > [ 0.155887] EDAC MC: Ver: 3.0.0
> > [ 0.156255] PCI: Using ACPI for IRQ routing
> > [ 0.156566] PCI: pci_cache_line_size set to 64 bytes
> > [ 0.157161] e820: reserve RAM buffer [mem 0x0009fc00-0x0009ffff]
> > [ 0.157914] e820: reserve RAM buffer [mem 0x7ffd9000-0x7fffffff]
> > [ 0.158253] NetLabel: Initializing
> > [ 0.158765] NetLabel: domain hash size = 128
> > [ 0.159005] NetLabel: protocols = UNLABELED CIPSOv4 CALIPSO
> > [ 0.159775] NetLabel: unlabeled traffic allowed by default
> > [ 0.160073] clocksource: Switched to clocksource kvm-clock
> > [ 0.186764] VFS: Disk quotas dquot_6.6.0
> > [ 0.187277] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
> > [ 0.188251] FS-Cache: Loaded
> > [ 0.188725] pnp: PnP ACPI init
> > [ 0.189271] pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active)
> > [ 0.190231] pnp 00:01: Plug and Play ACPI device, IDs PNP0303 (active)
> > [ 0.191229] pnp 00:02: Plug and Play ACPI device, IDs PNP0f13 (active)
> > [ 0.192096] pnp 00:03: [dma 2]
> > [ 0.192514] pnp 00:03: Plug and Play ACPI device, IDs PNP0700 (active)
> > [ 0.193631] pnp 00:04: Plug and Play ACPI device, IDs PNP0501 (active)
> > [ 0.195416] pnp: PnP ACPI: found 5 devices
> > [ 0.206055] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > [ 0.207127] pci_bus 0000:00: resource 4 [io 0x0000-0x0cf7 window]
> > [ 0.207832] pci_bus 0000:00: resource 5 [io 0x0d00-0xffff window]
> > [ 0.208594] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
> > [ 0.209469] pci_bus 0000:00: resource 7 [mem 0x80000000-0xfebfffff window]
> > [ 0.210493] NET: Registered protocol family 2
> > [ 0.211244] tcp_listen_portaddr_hash hash table entries: 1024 (order: 2, 16384 bytes)
> > [ 0.212283] TCP established hash table entries: 16384 (order: 5, 131072 bytes)
> > [ 0.213285] TCP bind hash table entries: 16384 (order: 6, 262144 bytes)
> > [ 0.214306] TCP: Hash tables configured (established 16384 bind 16384)
> > [ 0.215065] UDP hash table entries: 1024 (order: 3, 32768 bytes)
> > [ 0.215797] UDP-Lite hash table entries: 1024 (order: 3, 32768 bytes)
> > [ 0.217934] NET: Registered protocol family 1
> > [ 0.219126] RPC: Registered named UNIX socket transport module.
> > [ 0.219676] RPC: Registered udp transport module.
> > [ 0.220130] RPC: Registered tcp transport module.
> > [ 0.220552] RPC: Registered tcp NFSv4.1 backchannel transport module.
> > [ 0.221153] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
> > [ 0.221701] pci 0000:00:01.0: PIIX3: Enabling Passive Release
> > [ 0.222319] pci 0000:00:01.0: Activating ISA DMA hang workarounds
> > [ 0.444214] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
> > [ 0.880141] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
> > [ 1.311493] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
> > [ 1.748829] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
> > [ 1.962124] PCI: CLS 0 bytes, default 64
> > [ 1.964749] Initialise system trusted keyrings
> > [ 1.965289] workingset: timestamp_bits=37 max_order=19 bucket_order=0
> > [ 1.969600] zbud: loaded
> > [ 1.971287] SGI XFS with security attributes, no debug enabled
> > [ 2.106071] NET: Registered protocol family 38
> > [ 2.106556] Key type asymmetric registered
> > [ 2.106931] Asymmetric key parser 'x509' registered
> > [ 2.107514] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 248)
> > [ 2.108327] io scheduler noop registered
> > [ 2.108813] io scheduler deadline registered
> > [ 2.109608] io scheduler cfq registered (default)
> > [ 2.110258] io scheduler mq-deadline registered
> > [ 2.110796] io scheduler kyber registered
> > [ 2.111688] intel_idle: Please enable MWAIT in BIOS SETUP
> > [ 2.112310] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> > [ 2.113037] ACPI: Power Button [PWRF]
> > [ 2.331642] virtio-pci 0000:00:03.0: virtio_pci: leaving for legacy driver
> > [ 2.554093] virtio-pci 0000:00:05.0: virtio_pci: leaving for legacy driver
> > [ 2.775938] virtio-pci 0000:00:07.0: virtio_pci: leaving for legacy driver
> > [ 2.975053] tsc: Refined TSC clocksource calibration: 2495.981 MHz
> > [ 2.975641] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x23fa6529869, max_idle_ns: 440795218057 ns
> > [ 3.029409] virtio-pci 0000:00:08.0: virtio_pci: leaving for legacy driver
> > [ 3.032925] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
> > [ 3.056849] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> > [ 3.064748] Non-volatile memory driver v1.3
> > [ 3.065925] ppdev: user-space parallel port driver
> > [ 3.071816] loop: module loaded
> > [ 3.075337] vda: vda1 vda2 vda3
> > [ 3.076659] Rounding down aligned max_sectors from 4294967295 to 4294967288
> > [ 3.077996] Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
> > [ 3.079790] libphy: Fixed MDIO Bus: probed
> > [ 3.080257] tun: Universal TUN/TAP device driver, 1.6
> > [ 3.082222] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
> > [ 3.083675] serio: i8042 KBD port at 0x60,0x64 irq 1
> > [ 3.084160] serio: i8042 AUX port at 0x60,0x64 irq 12
> > [ 3.084816] mousedev: PS/2 mouse device common for all mice
> > [ 3.086603] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input1
> > [ 3.089192] rtc_cmos 00:00: RTC can wake from S4
> > [ 3.090116] rtc_cmos 00:00: rtc core: registered rtc_cmos as rtc0
> > [ 3.092829] rtc_cmos 00:00: alarms up to one day, y3k, 114 bytes nvram
> > [ 3.093937] IR NEC protocol handler initialized
> > [ 3.094408] IR RC5(x/sz) protocol handler initialized
> > [ 3.094901] IR RC6 protocol handler initialized
> > [ 3.095510] IR JVC protocol handler initialized
> > [ 3.095952] IR Sony protocol handler initialized
> > [ 3.096399] IR SANYO protocol handler initialized
> > [ 3.096862] IR Sharp protocol handler initialized
> > [ 3.097342] IR MCE Keyboard/mouse protocol handler initialized
> > [ 3.097919] IR XMP protocol handler initialized
> > [ 3.098530] device-mapper: uevent: version 1.0.3
> > [ 3.099209] device-mapper: ioctl: 4.37.0-ioctl (2017-09-20) initialised: [email protected]
> > [ 3.100398] device-mapper: multipath round-robin: version 1.2.0 loaded
> > [ 3.101883] drop_monitor: Initializing network drop monitor service
> > [ 3.102553] Netfilter messages via NETLINK v0.30.
> > [ 3.103090] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
> > [ 3.103738] ctnetlink v0.93: registering with nfnetlink.
> > [ 3.104494] ip_tables: (C) 2000-2006 Netfilter Core Team
> > [ 3.105734] Initializing XFRM netlink socket
> > [ 3.106885] NET: Registered protocol family 10
> > [ 3.109341] Segment Routing with IPv6
> > [ 3.109976] mip6: Mobile IPv6
> > [ 3.111987] ip6_tables: (C) 2000-2006 Netfilter Core Team
> > [ 3.114230] NET: Registered protocol family 17
> > [ 3.115047] Bridge firewalling registered
> > [ 3.115824] Ebtables v2.0 registered
> > [ 3.117996] 8021q: 802.1Q VLAN Support v1.8
> > [ 3.119429] AVX2 version of gcm_enc/dec engaged.
> > [ 3.119886] AES CTR mode by8 optimization enabled
> > [ 3.128818] sched_clock: Marking stable (3128714579, 0)->(3404180881, -275466302)
> > [ 3.129945] registered taskstats version 1
> > [ 3.130427] Loading compiled-in X.509 certificates
> > [ 3.163216] Loaded X.509 cert 'Build time autogenerated kernel key: 38e0adea1af8bd8a23b02436d4acf2f8c7408d23'
> > [ 3.166359] zswap: loaded using pool lzo/zbud
> > [ 3.167943] Key type big_key registered
> > [ 3.168778] Magic number: 13:918:708
> > [ 3.169255] rtc_cmos 00:00: setting system clock to 2017-12-16 00:42:09 UTC (1513384929)
> > [ 3.170604] md: Skipping autodetection of RAID arrays. (raid=autodetect will force)
> > [ 3.171932] EXT4-fs (vda2): couldn't mount as ext3 due to feature incompatibilities
> > [ 3.173871] EXT4-fs (vda2): couldn't mount as ext2 due to feature incompatibilities
> > [ 3.175306] EXT4-fs (vda2): INFO: recovery required on readonly filesystem
> > [ 3.176212] EXT4-fs (vda2): write access will be enabled during recovery
> > [ 3.397187] EXT4-fs (vda2): orphan cleanup on readonly fs
> > [ 3.399412] EXT4-fs (vda2): 5 orphan inodes deleted
> > [ 3.402759] EXT4-fs (vda2): recovery complete
> > [ 3.466647] EXT4-fs (vda2): mounted filesystem with ordered data mode. Opts: (null)
> > [ 3.469401] VFS: Mounted root (ext4 filesystem) readonly on device 253:2.
> > [ 3.473719] devtmpfs: mounted
> > [ 3.492549] Freeing unused kernel memory: 1640K
> > [ 3.494547] Write protecting the kernel read-only data: 18432k
> > [ 3.498781] Freeing unused kernel memory: 2016K
> > [ 3.503330] Freeing unused kernel memory: 512K
> > [ 3.505232] rodata_test: all tests were successful
> > [ 3.515355] 1 (init): Uhuuh, elf segement at 00000000928fda3e requested but the memory is mapped already
> > [ 3.519533] Starting init: /sbin/init exists but couldn't execute it (error -95)
> > [ 3.528993] Starting init: /bin/sh exists but couldn't execute it (error -14)
> > [ 3.532127] Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> > [ 3.538328] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc3-next-20171215-00001-g6d6aea478fce #11
> > [ 3.542201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> > [ 3.546081] Call Trace:
> > [ 3.547221] dump_stack+0x5c/0x79
> > [ 3.548768] ? rest_init+0x30/0xb0
> > [ 3.550320] panic+0xe4/0x232
> > [ 3.551669] ? rest_init+0xb0/0xb0
> > [ 3.553110] kernel_init+0xeb/0x100
> > [ 3.554701] ret_from_fork+0x1f/0x30
> > [ 3.558964] Kernel Offset: 0x2000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > [ 3.564160] ---[ end Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
>
>
> --
> Michal Hocko
> SUSE Labs

2017-12-18 18:49:24

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH] mm: don't use the same value for MAP_FIXED_SAFE and MAP_SYNC

Cc: Michal Hocko <[email protected]>
Fixes: ("fs, elf: drop MAP_FIXED usage from elf_map")
Signed-off-by: Andrei Vagin <[email protected]>
---
include/uapi/asm-generic/mman-common.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index b37502cbbef7..2db3fa287274 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -26,7 +26,9 @@
#else
# define MAP_UNINITIALIZED 0x0 /* Don't support this flag */
#endif
-#define MAP_FIXED_SAFE 0x80000 /* MAP_FIXED which doesn't unmap underlying mapping */
+
+/* 0x0100 - 0x80000 flags are defined in asm-generic/mman.h */
+#define MAP_FIXED_SAFE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */

/*
* Flags for mlock
--
2.13.6

Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

Hello Kees,

I'm late to the party, and only just caught up with the fuss :-).

On 12/14/2017 12:19 AM, Kees Cook wrote:
> On Wed, Dec 13, 2017 at 6:40 AM, Cyril Hrubis <[email protected]> wrote:
>> Hi!
>>> You selected stupid name for a flag. Everyone and their dog agrees
>>> with that. There's even consensus on better name (and everyone agrees
>>> it is better than .._SAFE). Of course, we could have debate if it is
>>> NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
>>> poor naming on your part.
>>
>> Well while everybody agrees that the name is so bad that basically
>> anything else would be better, there does not seem to be consensus on
>> which one to pick. I do understand that this frustrating and fruitless.
>
> Based on the earlier threads where I tried to end the bikeshedding, it
> seemed like MAP_FIXED_NOREPLACE was the least bad option.
>
>> So what do we do now, roll a dice to choose new name?
>>
>> Or do we ask BFDL[1] to choose the name?
>
> I'd like to hear feedback from Michael Kerrisk, as he's had to deal
> with these kinds of choices in the past. I'm fine to ask Linus too. I
> just want to get past the name since the feature is quite valuable.
>
> And if Michal doesn't want to touch this patch any more, I'm happy to
> do the search/replace/resend. :P

Something with the prefix MAP_FIXED_ seems to me obviously desirable,
both to suggest that the function is similar, and also for easy
grepping of the source code to look for instances of both.
MAP_FIXED_SAFE didn't really bother me as a name, but
MAP_FIXED_NOREPLACE (or MAP_FIXED_NOCLOBBER) seem slightly more
descriptive of what the flag actually does, so a little better.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2017-12-18 20:19:33

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Mon, Dec 18, 2017 at 11:12 AM, Michael Kerrisk (man-pages)
<[email protected]> wrote:
> Hello Kees,
>
> I'm late to the party, and only just caught up with the fuss :-).

No worries!

> On 12/14/2017 12:19 AM, Kees Cook wrote:
>> On Wed, Dec 13, 2017 at 6:40 AM, Cyril Hrubis <[email protected]> wrote:
>>> Hi!
>>>> You selected stupid name for a flag. Everyone and their dog agrees
>>>> with that. There's even consensus on better name (and everyone agrees
>>>> it is better than .._SAFE). Of course, we could have debate if it is
>>>> NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
>>>> poor naming on your part.
>>>
>>> Well while everybody agrees that the name is so bad that basically
>>> anything else would be better, there does not seem to be consensus on
>>> which one to pick. I do understand that this frustrating and fruitless.
>>
>> Based on the earlier threads where I tried to end the bikeshedding, it
>> seemed like MAP_FIXED_NOREPLACE was the least bad option.
>>
>>> So what do we do now, roll a dice to choose new name?
>>>
>>> Or do we ask BFDL[1] to choose the name?
>>
>> I'd like to hear feedback from Michael Kerrisk, as he's had to deal
>> with these kinds of choices in the past. I'm fine to ask Linus too. I
>> just want to get past the name since the feature is quite valuable.
>>
>> And if Michal doesn't want to touch this patch any more, I'm happy to
>> do the search/replace/resend. :P
>
> Something with the prefix MAP_FIXED_ seems to me obviously desirable,
> both to suggest that the function is similar, and also for easy
> grepping of the source code to look for instances of both.
> MAP_FIXED_SAFE didn't really bother me as a name, but
> MAP_FIXED_NOREPLACE (or MAP_FIXED_NOCLOBBER) seem slightly more
> descriptive of what the flag actually does, so a little better.

Great, thanks!

Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?

-Kees

--
Kees Cook
Pixel Security

2017-12-18 20:34:09

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Mon, Dec 18, 2017 at 12:19:21PM -0800, Kees Cook wrote:
> Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?

+1

2017-12-18 20:41:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: don't use the same value for MAP_FIXED_SAFE and MAP_SYNC

On Mon 18-12-17 10:49:16, Andrei Vagin wrote:
> Cc: Michal Hocko <[email protected]>
> Fixes: ("fs, elf: drop MAP_FIXED usage from elf_map")
> Signed-off-by: Andrei Vagin <[email protected]>
> ---
> include/uapi/asm-generic/mman-common.h | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index b37502cbbef7..2db3fa287274 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -26,7 +26,9 @@
> #else
> # define MAP_UNINITIALIZED 0x0 /* Don't support this flag */
> #endif
> -#define MAP_FIXED_SAFE 0x80000 /* MAP_FIXED which doesn't unmap underlying mapping */
> +
> +/* 0x0100 - 0x80000 flags are defined in asm-generic/mman.h */
> +#define MAP_FIXED_SAFE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */

Ouch, I was developing on top of mmotm which didn't have the new the new
MAP_SYNC. Thanks for catching that. Andrew, could you fold this into the
patch, please?
--
Michal Hocko
SUSE Labs

2017-12-19 12:40:16

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE

From: Edward Napierala
> Sent: 14 December 2017 14:55
>
> On 1214T1415, Michal Hocko wrote:
> > On Thu 14-12-17 12:44:17, Edward Napierala wrote:
> > > Regarding the name - how about adopting MAP_EXCL? It was introduced in
> > > FreeBSD,
> > > and seems to do exactly this; quoting mmap(2):
> > >
> > > MAP_FIXED Do not permit the system to select a different address
> > > than the one specified. If the specified address
> > > cannot be used, mmap() will fail. If MAP_FIXED is
> > > specified, addr must be a multiple of the page size.
> > > If MAP_EXCL is not specified, a successful MAP_FIXED
> > > request replaces any previous mappings for the
> > > process' pages in the range from addr to addr + len.
> > > In contrast, if MAP_EXCL is specified, the request
> > > will fail if a mapping already exists within the
> > > range.
> >
> > I am not familiar with the FreeBSD implementation but from the above it
> > looks like MAP_EXCL is a MAP_FIXED mofifier which is not how we are
> > going to implement it in linux due to reasons mentioned in this cover
> > letter. Using the same name would be more confusing than helpful I am
> > afraid.
>
> Sorry, missed that. Indeed, reusing a name with a different semantics
> would be a bad idea.

I don't remember any discussion about using MAP_FIXED | MAP_EXCL ?

Why not match the prior art??

David

2017-12-19 12:46:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE

On Tue 19-12-17 12:40:16, David Laight wrote:
> From: Edward Napierala
> > Sent: 14 December 2017 14:55
> >
> > On 1214T1415, Michal Hocko wrote:
> > > On Thu 14-12-17 12:44:17, Edward Napierala wrote:
> > > > Regarding the name - how about adopting MAP_EXCL? It was introduced in
> > > > FreeBSD,
> > > > and seems to do exactly this; quoting mmap(2):
> > > >
> > > > MAP_FIXED Do not permit the system to select a different address
> > > > than the one specified. If the specified address
> > > > cannot be used, mmap() will fail. If MAP_FIXED is
> > > > specified, addr must be a multiple of the page size.
> > > > If MAP_EXCL is not specified, a successful MAP_FIXED
> > > > request replaces any previous mappings for the
> > > > process' pages in the range from addr to addr + len.
> > > > In contrast, if MAP_EXCL is specified, the request
> > > > will fail if a mapping already exists within the
> > > > range.
> > >
> > > I am not familiar with the FreeBSD implementation but from the above it
> > > looks like MAP_EXCL is a MAP_FIXED mofifier which is not how we are
> > > going to implement it in linux due to reasons mentioned in this cover
> > > letter. Using the same name would be more confusing than helpful I am
> > > afraid.
> >
> > Sorry, missed that. Indeed, reusing a name with a different semantics
> > would be a bad idea.
>
> I don't remember any discussion about using MAP_FIXED | MAP_EXCL ?
>
> Why not match the prior art??

See the cover letter which explains why an extension is not backward
compatible.
--
Michal Hocko
SUSE Labs

2017-12-21 12:38:43

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

Kees Cook <[email protected]> writes:

> On Mon, Dec 18, 2017 at 11:12 AM, Michael Kerrisk (man-pages)
> <[email protected]> wrote:
>> Hello Kees,
>>
>> I'm late to the party, and only just caught up with the fuss :-).
>
> No worries!
>
>> On 12/14/2017 12:19 AM, Kees Cook wrote:
>>> On Wed, Dec 13, 2017 at 6:40 AM, Cyril Hrubis <[email protected]> wrote:
>>>> Hi!
>>>>> You selected stupid name for a flag. Everyone and their dog agrees
>>>>> with that. There's even consensus on better name (and everyone agrees
>>>>> it is better than .._SAFE). Of course, we could have debate if it is
>>>>> NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
>>>>> poor naming on your part.
>>>>
>>>> Well while everybody agrees that the name is so bad that basically
>>>> anything else would be better, there does not seem to be consensus on
>>>> which one to pick. I do understand that this frustrating and fruitless.
>>>
>>> Based on the earlier threads where I tried to end the bikeshedding, it
>>> seemed like MAP_FIXED_NOREPLACE was the least bad option.
>>>
>>>> So what do we do now, roll a dice to choose new name?
>>>>
>>>> Or do we ask BFDL[1] to choose the name?
>>>
>>> I'd like to hear feedback from Michael Kerrisk, as he's had to deal
>>> with these kinds of choices in the past. I'm fine to ask Linus too. I
>>> just want to get past the name since the feature is quite valuable.
>>>
>>> And if Michal doesn't want to touch this patch any more, I'm happy to
>>> do the search/replace/resend. :P
>>
>> Something with the prefix MAP_FIXED_ seems to me obviously desirable,
>> both to suggest that the function is similar, and also for easy
>> grepping of the source code to look for instances of both.
>> MAP_FIXED_SAFE didn't really bother me as a name, but
>> MAP_FIXED_NOREPLACE (or MAP_FIXED_NOCLOBBER) seem slightly more
>> descriptive of what the flag actually does, so a little better.
>
> Great, thanks!
>
> Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?

This seems to have not happened. Presumably Andrew just missed the mail
in the flood. And will probably miss this one too ... :)

cheers

2017-12-21 14:59:12

by Pavel Machek

[permalink] [raw]
Subject: known bad patch in -mm tree was Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

Hi!

> >>> And if Michal doesn't want to touch this patch any more, I'm happy to
> >>> do the search/replace/resend. :P
> >>
> >> Something with the prefix MAP_FIXED_ seems to me obviously desirable,
> >> both to suggest that the function is similar, and also for easy
> >> grepping of the source code to look for instances of both.
> >> MAP_FIXED_SAFE didn't really bother me as a name, but
> >> MAP_FIXED_NOREPLACE (or MAP_FIXED_NOCLOBBER) seem slightly more
> >> descriptive of what the flag actually does, so a little better.
> >
> > Great, thanks!
> >
> > Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?
>
> This seems to have not happened. Presumably Andrew just missed the mail
> in the flood. And will probably miss this one too ... :)

Nice way to mess up kernel development, Michal. Thank you! :-(.

Andrew, everyone and their dog agrees MAP_FIXED_SAFE is stupid name,
but Michal decided to just go ahead, ignoring feedback...

Can you either s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g or drop the patches?

Thanks,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.17 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-12-21 15:08:53

by Michal Hocko

[permalink] [raw]
Subject: Re: known bad patch in -mm tree was Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Thu 21-12-17 15:59:07, Pavel Machek wrote:
> Hi!
>
> > >>> And if Michal doesn't want to touch this patch any more, I'm happy to
> > >>> do the search/replace/resend. :P
> > >>
> > >> Something with the prefix MAP_FIXED_ seems to me obviously desirable,
> > >> both to suggest that the function is similar, and also for easy
> > >> grepping of the source code to look for instances of both.
> > >> MAP_FIXED_SAFE didn't really bother me as a name, but
> > >> MAP_FIXED_NOREPLACE (or MAP_FIXED_NOCLOBBER) seem slightly more
> > >> descriptive of what the flag actually does, so a little better.
> > >
> > > Great, thanks!
> > >
> > > Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?
> >
> > This seems to have not happened. Presumably Andrew just missed the mail
> > in the flood. And will probably miss this one too ... :)
>
> Nice way to mess up kernel development, Michal. Thank you! :-(.

Thank you for your valuable feedback! Maybe you have noticed that I
haven't enforced the patch and led others to decide the final name
(either by resubmitting patches or a simple replace in mmotm tree). Or
maybe you haven't because you are so busy bikesheding that you can
hardly see anything else.

> Andrew, everyone and their dog agrees MAP_FIXED_SAFE is stupid name,
> but Michal decided to just go ahead, ignoring feedback...
>
> Can you either s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g or drop the patches?

You have surely saved the world today and I hardly find words to thank
you (and your dog of course).

Thanks!
--
Michal Hocko
SUSE Labs

2017-12-21 22:24:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

On Thu, 21 Dec 2017 23:38:37 +1100 Michael Ellerman <[email protected]> wrote:

> > Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?

Done.

2017-12-22 00:06:09

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation

Andrew Morton <[email protected]> writes:

> On Thu, 21 Dec 2017 23:38:37 +1100 Michael Ellerman <[email protected]> wrote:
>
>> > Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?
>
> Done.

Thanks.

I sent an ack at some point, here's another if you like:

Acked-by: Michael Ellerman <[email protected]>

There's also a couple of stray whitespace changes in the version in
linux-next, and some inconsistent whitespace between the various mman.h
changes. Patch below to fix them up if you haven't already.

cheers


diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index bb9ccb5ff3ed..5e362db59780 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -50,7 +50,6 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
-
#define MAP_FIXED_SAFE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */

/*
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index dedc09ead4cb..a0702506d7c6 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -26,7 +26,6 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
-
#define MAP_FIXED_SAFE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */

#define MS_SYNC 1 /* synchronous memory sync */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index d21bffd5d3dc..715a2c927e79 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -25,4 +25,5 @@
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */

+
#endif /* _UAPI__SPARC_MMAN_H__ */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index da73b6d5dbcd..52f4d21923b3 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -65,7 +65,6 @@
# define MAP_UNINITIALIZED 0x0 /* Don't support this flag */
#endif

-
/*
* Flags for msync
*/

2018-04-18 10:52:40

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

From 0ba20dcbbc40b703413c9a6907a77968b087811b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Wed, 18 Apr 2018 15:31:48 +0900
Subject: [PATCH] fs, elf: don't complain MAP_FIXED_NOREPLACE if mapping
failed.

Commit 4ed28639519c7bad ("fs, elf: drop MAP_FIXED usage from elf_map") is
printing spurious messages under memory pressure due to map_addr == -ENOMEM.

9794 (a.out): Uhuuh, elf segment at 00007f2e34738000(fffffffffffffff4) requested but the memory is mapped already
14104 (a.out): Uhuuh, elf segment at 00007f34fd76c000(fffffffffffffff4) requested but the memory is mapped already
16843 (a.out): Uhuuh, elf segment at 00007f930ecc7000(fffffffffffffff4) requested but the memory is mapped already

Don't complain if IS_ERR_VALUE(), and use %px for printing the address.

Signed-off-by: Tetsuo Handa <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Abdul Haleem <[email protected]>
Cc: Joel Stanley <[email protected]>
Cc: Anshuman Khandual <[email protected]>
---
fs/binfmt_elf.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 41e0418..559d35b 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -377,10 +377,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
} else
map_addr = vm_mmap(filep, addr, size, prot, type, off);

- if ((type & MAP_FIXED_NOREPLACE) && BAD_ADDR(map_addr))
- pr_info("%d (%s): Uhuuh, elf segment at %p requested but the memory is mapped already\n",
- task_pid_nr(current), current->comm,
- (void *)addr);
+ if ((type & MAP_FIXED_NOREPLACE) && BAD_ADDR(map_addr) &&
+ !IS_ERR_VALUE(map_addr))
+ pr_info("%d (%s): Uhuuh, elf segment at %px requested but the memory is mapped already\n",
+ task_pid_nr(current), current->comm, (void *)addr);

return(map_addr);
}
--
1.8.3.1



2018-04-18 11:35:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

On Wed 18-04-18 19:51:05, Tetsuo Handa wrote:
> >From 0ba20dcbbc40b703413c9a6907a77968b087811b Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <[email protected]>
> Date: Wed, 18 Apr 2018 15:31:48 +0900
> Subject: [PATCH] fs, elf: don't complain MAP_FIXED_NOREPLACE if mapping
> failed.
>
> Commit 4ed28639519c7bad ("fs, elf: drop MAP_FIXED usage from elf_map") is
> printing spurious messages under memory pressure due to map_addr == -ENOMEM.
>
> 9794 (a.out): Uhuuh, elf segment at 00007f2e34738000(fffffffffffffff4) requested but the memory is mapped already
> 14104 (a.out): Uhuuh, elf segment at 00007f34fd76c000(fffffffffffffff4) requested but the memory is mapped already
> 16843 (a.out): Uhuuh, elf segment at 00007f930ecc7000(fffffffffffffff4) requested but the memory is mapped already

Hmm this is ENOMEM.

> Don't complain if IS_ERR_VALUE(),

this is simply wrong. We do want to warn on the failure because this is
when the actual clash happens. We should just warn on EEXIST.

> and use %px for printing the address.
>
> Signed-off-by: Tetsuo Handa <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Andrei Vagin <[email protected]>
> Cc: Khalid Aziz <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Abdul Haleem <[email protected]>
> Cc: Joel Stanley <[email protected]>
> Cc: Anshuman Khandual <[email protected]>
> ---
> fs/binfmt_elf.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 41e0418..559d35b 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -377,10 +377,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
> } else
> map_addr = vm_mmap(filep, addr, size, prot, type, off);
>
> - if ((type & MAP_FIXED_NOREPLACE) && BAD_ADDR(map_addr))
> - pr_info("%d (%s): Uhuuh, elf segment at %p requested but the memory is mapped already\n",
> - task_pid_nr(current), current->comm,
> - (void *)addr);
> + if ((type & MAP_FIXED_NOREPLACE) && BAD_ADDR(map_addr) &&
> + !IS_ERR_VALUE(map_addr))
> + pr_info("%d (%s): Uhuuh, elf segment at %px requested but the memory is mapped already\n",
> + task_pid_nr(current), current->comm, (void *)addr);
>
> return(map_addr);
> }
> --
> 1.8.3.1
>
>

--
Michal Hocko
SUSE Labs

2018-04-18 11:44:44

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

Michal Hocko wrote:
> > Don't complain if IS_ERR_VALUE(),
>
> this is simply wrong. We do want to warn on the failure because this is
> when the actual clash happens. We should just warn on EEXIST.

>From 25442cdd31aa5cc8522923a0153a77dfd2ebc832 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Wed, 18 Apr 2018 20:38:15 +0900
Subject: [PATCH] fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST
error.

Commit 4ed28639519c7bad ("fs, elf: drop MAP_FIXED usage from elf_map") is
printing spurious messages under memory pressure due to map_addr == -ENOMEM.

9794 (a.out): Uhuuh, elf segment at 00007f2e34738000(fffffffffffffff4) requested but the memory is mapped already
14104 (a.out): Uhuuh, elf segment at 00007f34fd76c000(fffffffffffffff4) requested but the memory is mapped already
16843 (a.out): Uhuuh, elf segment at 00007f930ecc7000(fffffffffffffff4) requested but the memory is mapped already

Complain only if -EEXIST, and use %px for printing the address.

Signed-off-by: Tetsuo Handa <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Abdul Haleem <[email protected]>
Cc: Joel Stanley <[email protected]>
Cc: Anshuman Khandual <[email protected]>
---
fs/binfmt_elf.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 41e0418..96615d9 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -377,10 +377,9 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
} else
map_addr = vm_mmap(filep, addr, size, prot, type, off);

- if ((type & MAP_FIXED_NOREPLACE) && BAD_ADDR(map_addr))
- pr_info("%d (%s): Uhuuh, elf segment at %p requested but the memory is mapped already\n",
- task_pid_nr(current), current->comm,
- (void *)addr);
+ if ((type & MAP_FIXED_NOREPLACE) && map_addr == -EEXIST)
+ pr_info("%d (%s): Uhuuh, elf segment at %px requested but the memory is mapped already\n",
+ task_pid_nr(current), current->comm, (void *)addr);

return(map_addr);
}
--
1.8.3.1

2018-04-18 11:57:39

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

On Wed 18-04-18 20:43:11, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > Don't complain if IS_ERR_VALUE(),
> >
> > this is simply wrong. We do want to warn on the failure because this is
> > when the actual clash happens. We should just warn on EEXIST.
>
> >From 25442cdd31aa5cc8522923a0153a77dfd2ebc832 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <[email protected]>
> Date: Wed, 18 Apr 2018 20:38:15 +0900
> Subject: [PATCH] fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST
> error.
>
> Commit 4ed28639519c7bad ("fs, elf: drop MAP_FIXED usage from elf_map") is
> printing spurious messages under memory pressure due to map_addr == -ENOMEM.
>
> 9794 (a.out): Uhuuh, elf segment at 00007f2e34738000(fffffffffffffff4) requested but the memory is mapped already
> 14104 (a.out): Uhuuh, elf segment at 00007f34fd76c000(fffffffffffffff4) requested but the memory is mapped already
> 16843 (a.out): Uhuuh, elf segment at 00007f930ecc7000(fffffffffffffff4) requested but the memory is mapped already
>
> Complain only if -EEXIST, and use %px for printing the address.

Yes this is better. But...

[...]
> - if ((type & MAP_FIXED_NOREPLACE) && BAD_ADDR(map_addr))
> - pr_info("%d (%s): Uhuuh, elf segment at %p requested but the memory is mapped already\n",
> - task_pid_nr(current), current->comm,
> - (void *)addr);
> + if ((type & MAP_FIXED_NOREPLACE) && map_addr == -EEXIST)

... please use PTR_ERR(map_addr) == -EEXIST

then you can add
Acked-by: Michal Hocko <[email protected]>

> + pr_info("%d (%s): Uhuuh, elf segment at %px requested but the memory is mapped already\n",
> + task_pid_nr(current), current->comm, (void *)addr);
>
> return(map_addr);
> }
> --
> 1.8.3.1

--
Michal Hocko
SUSE Labs

2018-04-18 14:08:45

by Tetsuo Handa

[permalink] [raw]
Subject: [PATCH v2] fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST error.

>From 3f396857d23d4bf1fac4d4332316b5ba0af6d2f9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Wed, 18 Apr 2018 23:00:53 +0900
Subject: [PATCH v2] fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST error.

Commit 4ed28639519c7bad ("fs, elf: drop MAP_FIXED usage from elf_map") is
printing spurious messages under memory pressure due to map_addr == -ENOMEM.

9794 (a.out): Uhuuh, elf segment at 00007f2e34738000(fffffffffffffff4) requested but the memory is mapped already
14104 (a.out): Uhuuh, elf segment at 00007f34fd76c000(fffffffffffffff4) requested but the memory is mapped already
16843 (a.out): Uhuuh, elf segment at 00007f930ecc7000(fffffffffffffff4) requested but the memory is mapped already

Complain only if -EEXIST, and use %px for printing the address.

Signed-off-by: Tetsuo Handa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Abdul Haleem <[email protected]>
Cc: Joel Stanley <[email protected]>
Cc: Anshuman Khandual <[email protected]>
---
fs/binfmt_elf.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 41e0418..4ad6f66 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -377,10 +377,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
} else
map_addr = vm_mmap(filep, addr, size, prot, type, off);

- if ((type & MAP_FIXED_NOREPLACE) && BAD_ADDR(map_addr))
- pr_info("%d (%s): Uhuuh, elf segment at %p requested but the memory is mapped already\n",
- task_pid_nr(current), current->comm,
- (void *)addr);
+ if ((type & MAP_FIXED_NOREPLACE) &&
+ PTR_ERR((void *)map_addr) == -EEXIST)
+ pr_info("%d (%s): Uhuuh, elf segment at %px requested but the memory is mapped already\n",
+ task_pid_nr(current), current->comm, (void *)addr);

return(map_addr);
}
--
1.8.3.1

2018-04-19 05:59:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST error.

On Wed 18-04-18 23:07:12, Tetsuo Handa wrote:
> >From 3f396857d23d4bf1fac4d4332316b5ba0af6d2f9 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <[email protected]>
> Date: Wed, 18 Apr 2018 23:00:53 +0900
> Subject: [PATCH v2] fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST error.
>
> Commit 4ed28639519c7bad ("fs, elf: drop MAP_FIXED usage from elf_map") is
> printing spurious messages under memory pressure due to map_addr == -ENOMEM.
>
> 9794 (a.out): Uhuuh, elf segment at 00007f2e34738000(fffffffffffffff4) requested but the memory is mapped already
> 14104 (a.out): Uhuuh, elf segment at 00007f34fd76c000(fffffffffffffff4) requested but the memory is mapped already
> 16843 (a.out): Uhuuh, elf segment at 00007f930ecc7000(fffffffffffffff4) requested but the memory is mapped already
>
> Complain only if -EEXIST, and use %px for printing the address.
>
> Signed-off-by: Tetsuo Handa <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Cc: Andrei Vagin <[email protected]>
> Cc: Khalid Aziz <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Abdul Haleem <[email protected]>
> Cc: Joel Stanley <[email protected]>
> Cc: Anshuman Khandual <[email protected]>

Thanks!

> ---
> fs/binfmt_elf.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 41e0418..4ad6f66 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -377,10 +377,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
> } else
> map_addr = vm_mmap(filep, addr, size, prot, type, off);
>
> - if ((type & MAP_FIXED_NOREPLACE) && BAD_ADDR(map_addr))
> - pr_info("%d (%s): Uhuuh, elf segment at %p requested but the memory is mapped already\n",
> - task_pid_nr(current), current->comm,
> - (void *)addr);
> + if ((type & MAP_FIXED_NOREPLACE) &&
> + PTR_ERR((void *)map_addr) == -EEXIST)
> + pr_info("%d (%s): Uhuuh, elf segment at %px requested but the memory is mapped already\n",
> + task_pid_nr(current), current->comm, (void *)addr);
>
> return(map_addr);
> }
> --
> 1.8.3.1

--
Michal Hocko
SUSE Labs

2018-05-29 22:23:56

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

Just a quick heads up. I noticed a change in libhugetlbfs testing starting
with v4.17-rc1.

V4.16 libhugetlbfs test results
********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 105 111
* FAIL: 0 0
* Killed by signal: 4 1
* Bad configuration: 1 1
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

v4.17-rc1 (and later) libhugetlbfs test results
********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 98 111
* FAIL: 0 0
* Killed by signal: 11 1
* Bad configuration: 1 1
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

I traced the 7 additional (32-bit) killed by signal results to this
commit 4ed28639519c fs, elf: drop MAP_FIXED usage from elf_map.

libhugetlbfs does unusual things and even provides custom linker scripts.
So, in hindsight this change in behavior does not seem too unexpected. I
JUST discovered this while running libhugetlbfs tests for an unrelated
issue/change and, will do some analysis to see exactly what is happening.

Also, will take it upon myself to run libhugetlbfs test suite on a
regular (at least weekly) basis.
--
Mike Kravetz

2018-05-30 08:03:01

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

On Tue 29-05-18 15:21:14, Mike Kravetz wrote:
> Just a quick heads up. I noticed a change in libhugetlbfs testing starting
> with v4.17-rc1.
>
> V4.16 libhugetlbfs test results
> ********** TEST SUMMARY
> * 2M
> * 32-bit 64-bit
> * Total testcases: 110 113
> * Skipped: 0 0
> * PASS: 105 111
> * FAIL: 0 0
> * Killed by signal: 4 1
> * Bad configuration: 1 1
> * Expected FAIL: 0 0
> * Unexpected PASS: 0 0
> * Test not present: 0 0
> * Strange test result: 0 0
> **********
>
> v4.17-rc1 (and later) libhugetlbfs test results
> ********** TEST SUMMARY
> * 2M
> * 32-bit 64-bit
> * Total testcases: 110 113
> * Skipped: 0 0
> * PASS: 98 111
> * FAIL: 0 0
> * Killed by signal: 11 1
> * Bad configuration: 1 1
> * Expected FAIL: 0 0
> * Unexpected PASS: 0 0
> * Test not present: 0 0
> * Strange test result: 0 0
> **********
>
> I traced the 7 additional (32-bit) killed by signal results to this
> commit 4ed28639519c fs, elf: drop MAP_FIXED usage from elf_map.
>
> libhugetlbfs does unusual things and even provides custom linker scripts.
> So, in hindsight this change in behavior does not seem too unexpected. I
> JUST discovered this while running libhugetlbfs tests for an unrelated
> issue/change and, will do some analysis to see exactly what is happening.

I am definitely interested about further details. Are there any messages
in the kernel log?

--
Michal Hocko
SUSE Labs

2018-05-30 15:03:02

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

On 05/30/2018 01:02 AM, Michal Hocko wrote:
> On Tue 29-05-18 15:21:14, Mike Kravetz wrote:
>> Just a quick heads up. I noticed a change in libhugetlbfs testing starting
>> with v4.17-rc1.
>>
>> V4.16 libhugetlbfs test results
>> ********** TEST SUMMARY
>> * 2M
>> * 32-bit 64-bit
>> * Total testcases: 110 113
>> * Skipped: 0 0
>> * PASS: 105 111
>> * FAIL: 0 0
>> * Killed by signal: 4 1
>> * Bad configuration: 1 1
>> * Expected FAIL: 0 0
>> * Unexpected PASS: 0 0
>> * Test not present: 0 0
>> * Strange test result: 0 0
>> **********
>>
>> v4.17-rc1 (and later) libhugetlbfs test results
>> ********** TEST SUMMARY
>> * 2M
>> * 32-bit 64-bit
>> * Total testcases: 110 113
>> * Skipped: 0 0
>> * PASS: 98 111
>> * FAIL: 0 0
>> * Killed by signal: 11 1
>> * Bad configuration: 1 1
>> * Expected FAIL: 0 0
>> * Unexpected PASS: 0 0
>> * Test not present: 0 0
>> * Strange test result: 0 0
>> **********
>>
>> I traced the 7 additional (32-bit) killed by signal results to this
>> commit 4ed28639519c fs, elf: drop MAP_FIXED usage from elf_map.
>>
>> libhugetlbfs does unusual things and even provides custom linker scripts.
>> So, in hindsight this change in behavior does not seem too unexpected. I
>> JUST discovered this while running libhugetlbfs tests for an unrelated
>> issue/change and, will do some analysis to see exactly what is happening.
>
> I am definitely interested about further details. Are there any messages
> in the kernel log?
>

Yes, new messages associated with the failures.

[ 47.570451] 1368 (xB.linkhuge_nof): Uhuuh, elf segment at 00000000a731413b requested but the memory is mapped already
[ 47.606991] 1372 (xB.linkhuge_nof): Uhuuh, elf segment at 00000000a731413b requested but the memory is mapped already
[ 47.641351] 1376 (xB.linkhuge_nof): Uhuuh, elf segment at 00000000a731413b requested but the memory is mapped already
[ 47.726138] 1384 (xB.linkhuge): Uhuuh, elf segment at 0000000090b9eaf6 requested but the memory is mapped already
[ 47.773169] 1393 (xB.linkhuge): Uhuuh, elf segment at 0000000090b9eaf6 requested but the memory is mapped already
[ 47.817788] 1402 (xB.linkhuge): Uhuuh, elf segment at 0000000090b9eaf6 requested but the memory is mapped already
[ 47.857338] 1406 (xB.linkshare): Uhuuh, elf segment at 0000000018430471 requested but the memory is mapped already
[ 47.956355] 1427 (xB.linkshare): Uhuuh, elf segment at 0000000018430471 requested but the memory is mapped already
[ 48.054894] 1448 (xB.linkhuge): Uhuuh, elf segment at 0000000090b9eaf6 requested but the memory is mapped already
[ 48.071221] 1451 (xB.linkhuge): Uhuuh, elf segment at 0000000090b9eaf6 requested but the memory is mapped already

Just curious, the addresses printed in those messages does not seem correct.
They should be page aligned. Correct? I think that %p conversion in the
pr_info() may doing something wrong.

Also, the new failures in question are indeed being built with custom linker
scripts designed for use with binutils older than 2.16 (really old). So, no
new users should encounter this issue (I think). It appears that this may
only impact old applications built long ago with pre-2.16 binutils.
--
Mike Kravetz

2018-05-30 16:25:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

On Wed 30-05-18 08:00:29, Mike Kravetz wrote:
> On 05/30/2018 01:02 AM, Michal Hocko wrote:
> > On Tue 29-05-18 15:21:14, Mike Kravetz wrote:
> >> Just a quick heads up. I noticed a change in libhugetlbfs testing starting
> >> with v4.17-rc1.
> >>
> >> V4.16 libhugetlbfs test results
> >> ********** TEST SUMMARY
> >> * 2M
> >> * 32-bit 64-bit
> >> * Total testcases: 110 113
> >> * Skipped: 0 0
> >> * PASS: 105 111
> >> * FAIL: 0 0
> >> * Killed by signal: 4 1
> >> * Bad configuration: 1 1
> >> * Expected FAIL: 0 0
> >> * Unexpected PASS: 0 0
> >> * Test not present: 0 0
> >> * Strange test result: 0 0
> >> **********
> >>
> >> v4.17-rc1 (and later) libhugetlbfs test results
> >> ********** TEST SUMMARY
> >> * 2M
> >> * 32-bit 64-bit
> >> * Total testcases: 110 113
> >> * Skipped: 0 0
> >> * PASS: 98 111
> >> * FAIL: 0 0
> >> * Killed by signal: 11 1
> >> * Bad configuration: 1 1
> >> * Expected FAIL: 0 0
> >> * Unexpected PASS: 0 0
> >> * Test not present: 0 0
> >> * Strange test result: 0 0
> >> **********
> >>
> >> I traced the 7 additional (32-bit) killed by signal results to this
> >> commit 4ed28639519c fs, elf: drop MAP_FIXED usage from elf_map.
> >>
> >> libhugetlbfs does unusual things and even provides custom linker scripts.
> >> So, in hindsight this change in behavior does not seem too unexpected. I
> >> JUST discovered this while running libhugetlbfs tests for an unrelated
> >> issue/change and, will do some analysis to see exactly what is happening.
> >
> > I am definitely interested about further details. Are there any messages
> > in the kernel log?
> >
>
> Yes, new messages associated with the failures.
>
> [ 47.570451] 1368 (xB.linkhuge_nof): Uhuuh, elf segment at 00000000a731413b requested but the memory is mapped already
> [ 47.606991] 1372 (xB.linkhuge_nof): Uhuuh, elf segment at 00000000a731413b requested but the memory is mapped already
> [ 47.641351] 1376 (xB.linkhuge_nof): Uhuuh, elf segment at 00000000a731413b requested but the memory is mapped already
> [ 47.726138] 1384 (xB.linkhuge): Uhuuh, elf segment at 0000000090b9eaf6 requested but the memory is mapped already
> [ 47.773169] 1393 (xB.linkhuge): Uhuuh, elf segment at 0000000090b9eaf6 requested but the memory is mapped already
> [ 47.817788] 1402 (xB.linkhuge): Uhuuh, elf segment at 0000000090b9eaf6 requested but the memory is mapped already
> [ 47.857338] 1406 (xB.linkshare): Uhuuh, elf segment at 0000000018430471 requested but the memory is mapped already
> [ 47.956355] 1427 (xB.linkshare): Uhuuh, elf segment at 0000000018430471 requested but the memory is mapped already
> [ 48.054894] 1448 (xB.linkhuge): Uhuuh, elf segment at 0000000090b9eaf6 requested but the memory is mapped already
> [ 48.071221] 1451 (xB.linkhuge): Uhuuh, elf segment at 0000000090b9eaf6 requested but the memory is mapped already
>
> Just curious, the addresses printed in those messages does not seem correct.
> They should be page aligned. Correct?

I have no idea what the loader actually does here.

> I think that %p conversion in the pr_info() may doing something wrong.

Well, we are using %px and that shouldn't do any tricks to the given
address.

> Also, the new failures in question are indeed being built with custom linker
> scripts designed for use with binutils older than 2.16 (really old). So, no
> new users should encounter this issue (I think). It appears that this may
> only impact old applications built long ago with pre-2.16 binutils.

Could you add a debugging data to dump the VMA which overlaps the
requested adress and who requested that? E.g. hook into do_mmap and dump
all requests from the linker.
--
Michal Hocko
SUSE Labs

2018-05-31 00:52:40

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

On 05/30/2018 09:25 AM, Michal Hocko wrote:
> Could you add a debugging data to dump the VMA which overlaps the
> requested adress and who requested that? E.g. hook into do_mmap and dump
> all requests from the linker.

Here you go. I added a bunch of stuff as I clearly do not understand
how elf loading works. To me, the 'sections' parsed by the kernel code
do not seem to directly align with those produced by objdump.

[ 38.899260] load_elf_binary: attempting to load file ./tests/obj32/xB.linkhuge_nofd
[ 38.902340] dumping section headers
[ 38.903534] index 0 p_offset = 34
[ 38.904683] index 0 p_vaddr = 8048034
[ 38.905680] index 0 p_paddr = 8048034
[ 38.906442] index 0 p_filesz = 120
[ 38.907110] index 0 p_memsz = 120
[ 38.907764]
[ 38.908019] index 1 p_offset = 154
[ 38.908521] index 1 p_vaddr = 8048154
[ 38.909081] index 1 p_paddr = 8048154
[ 38.909496] index 1 p_filesz = 13
[ 38.909855] index 1 p_memsz = 13
[ 38.910453]
[ 38.910731] index 2 p_offset = 0
[ 38.911317] index 2 p_vaddr = 8048000
[ 38.911997] index 2 p_paddr = 8048000
[ 38.912590] index 2 p_filesz = 169c
[ 38.913141] index 2 p_memsz = 169c
[ 38.913713]
[ 38.913987] index 3 p_offset = 169c
[ 38.914518] index 3 p_vaddr = 804969c
[ 38.915101] index 3 p_paddr = 804969c
[ 38.915718] index 3 p_filesz = 1878
[ 38.916266] index 3 p_memsz = 1878
[ 38.916799]
[ 38.917032] index 4 p_offset = 3000
[ 38.917537] index 4 p_vaddr = 9000000
[ 38.918119] index 4 p_paddr = 9000000
[ 38.918709] index 4 p_filesz = 0
[ 38.919525] index 4 p_memsz = 10
[ 38.919993]
[ 38.920275] index 5 p_offset = 2d88
[ 38.920791] index 5 p_vaddr = 804ad88
[ 38.921307] index 5 p_paddr = 804ad88
[ 38.921800] index 5 p_filesz = 18c
[ 38.922288] index 5 p_memsz = 18c
[ 38.922739]
[ 38.922973] index 6 p_offset = 168
[ 38.923431] index 6 p_vaddr = 8048168
[ 38.923946] index 6 p_paddr = 8048168
[ 38.924457] index 6 p_filesz = 44
[ 38.924931] index 6 p_memsz = 44
[ 38.925414]
[ 38.925593] index 7 p_offset = 0
[ 38.926031] index 7 p_vaddr = 0
[ 38.926510] index 7 p_paddr = 0
[ 38.926957] index 7 p_filesz = 0
[ 38.927443] index 7 p_memsz = 0
[ 38.927879]
[ 38.928115] index 8 p_offset = 169c
[ 38.928594] index 8 p_vaddr = 804969c
[ 38.929091] index 8 p_paddr = 804969c
[ 38.929646] index 8 p_filesz = 8c
[ 38.930177] index 8 p_memsz = 8c
[ 38.930710]
[ 38.931497] load_elf_binary: skipping index 0 p_vaddr = 8048034
[ 38.932321] load_elf_binary: skipping index 1 p_vaddr = 8048154
[ 38.933165] load_elf_binary: calling elf_map() index 2 bias 0 vaddr 8048000
[ 38.934087] map_addr ELF_PAGESTART(addr) 8048000 total_size 0 ELF_PAGEALIGN(size) 2000
[ 38.935101] eppnt->p_offset = 0
[ 38.935561] eppnt->p_vaddr = 8048000
[ 38.936073] eppnt->p_paddr = 8048000
[ 38.936897] eppnt->p_filesz = 169c
[ 38.937493] eppnt->p_memsz = 169c
[ 38.938042] load_elf_binary: calling elf_map() index 3 bias 0 vaddr 804969c
[ 38.939002] map_addr ELF_PAGESTART(addr) 8049000 total_size 0 ELF_PAGEALIGN(size) 2000
[ 38.939959] eppnt->p_offset = 169c
[ 38.940410] eppnt->p_vaddr = 804969c
[ 38.940897] eppnt->p_paddr = 804969c
[ 38.941507] eppnt->p_filesz = 1878
[ 38.942019] eppnt->p_memsz = 1878
[ 38.942516] 1123 (xB.linkhuge_nof): Uhuuh, elf segment at 8049000 requested but the memory is mapped already

It is pretty easy to see the mmap conflict. I'm still trying to determine if
the executable file is 'valid'. It did not throw an error previously as
MAP_FIXED unmapped the overlapping page. However, this does not seem right.
--
Mike Kravetz

2018-05-31 09:26:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

On Wed 30-05-18 17:51:15, Mike Kravetz wrote:
[...]
> [ 38.931497] load_elf_binary: skipping index 0 p_vaddr = 8048034
> [ 38.932321] load_elf_binary: skipping index 1 p_vaddr = 8048154
> [ 38.933165] load_elf_binary: calling elf_map() index 2 bias 0 vaddr 8048000
> [ 38.934087] map_addr ELF_PAGESTART(addr) 8048000 total_size 0 ELF_PAGEALIGN(size) 2000
> [ 38.935101] eppnt->p_offset = 0
> [ 38.935561] eppnt->p_vaddr = 8048000
> [ 38.936073] eppnt->p_paddr = 8048000
> [ 38.936897] eppnt->p_filesz = 169c
> [ 38.937493] eppnt->p_memsz = 169c
> [ 38.938042] load_elf_binary: calling elf_map() index 3 bias 0 vaddr 804969c
> [ 38.939002] map_addr ELF_PAGESTART(addr) 8049000 total_size 0 ELF_PAGEALIGN(size) 2000
> [ 38.939959] eppnt->p_offset = 169c
> [ 38.940410] eppnt->p_vaddr = 804969c
> [ 38.940897] eppnt->p_paddr = 804969c
> [ 38.941507] eppnt->p_filesz = 1878
> [ 38.942019] eppnt->p_memsz = 1878
> [ 38.942516] 1123 (xB.linkhuge_nof): Uhuuh, elf segment at 8049000 requested but the memory is mapped already
>
> It is pretty easy to see the mmap conflict. I'm still trying to determine if
> the executable file is 'valid'. It did not throw an error previously as
> MAP_FIXED unmapped the overlapping page. However, this does not seem right.

Yes, it looks suspicious to say the least. How come the original content
is not needed anymore? Maybe the first section should be 0x1000 rather
than 0x169c?

I am not an expert on the load linkers myself so I cannot really answer
this question. Please note that ppc had something similar. See
ad55eac74f20 ("elf: enforce MAP_FIXED on overlaying elf segments").
Maybe we need to sprinkle more of those at other places?
--
Michal Hocko
SUSE Labs

2018-05-31 21:47:07

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

On 05/31/2018 02:24 AM, Michal Hocko wrote:
> I am not an expert on the load linkers myself so I cannot really answer
> this question. Please note that ppc had something similar. See
> ad55eac74f20 ("elf: enforce MAP_FIXED on overlaying elf segments").
> Maybe we need to sprinkle more of those at other places?

I finally understand the issue, and it is NOT a problem with the kernel.
The issue is with old libhugetlbfs provided linker scripts, and yes,
starting with v4.17 people who run libhugetlbfs tests on x86 (at least)
will see additional failures.

I'll try to work this from the libhugetlbfs side. In the unlikely event
that anyone knows about those linker scripts, assistance and/or feedback
would be appreciated.

Read on only if you want additional details about this failure.

The executable files which are now failing are created with the elf_i386.xB
linker script. This script is provided for pre-2.17 versions of binutils.
binutils-2.17 came out aprox in 2007, and this script is disabled by default
if binutils-2.17 or later is used. The only way to create executables with
this script today is by setting the HUGETLB_DEPRECATED_LINK env variable.
This is what libhugetlbfs tests do to simply continue testing the old scripts.

I previously was mistaken about which tests were causing the additional
failures. The example I previously provided failed on v4.16 as well as
v4.17-rc kernels. So, please ignore that information.

For an executable that runs on v4.16 and fails on v4.17-rc, here is a listing
of elf sections that the kernel will attempt to load.

Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x08048000 0x08048000 0x11c24 0x11c24 R E 0x1000
LOAD 0x011c24 0x08059c24 0x08059c24 0x10d04 0x10d04 RW 0x1000
LOAD 0x023000 0x09000000 0x09000000 0x00000 0x10048 RWE 0x1000

The first section is loaded without issue. elf_map() will create a vma
based on the following:
map_addr ELF_PAGESTART(addr) 8048000 ELF_PAGEALIGN(size) 12000
File_offset 0

We then attempt to load the following section with:
map_addr ELF_PAGESTART(addr) 8059000 ELF_PAGEALIGN(size) 12000
File_offset 11000

This results in,
Uhuuh, elf segment at 8059000 requested but the memory is mapped already

Note that the last page of the first section overlaps with the first page
of the second section. Unlike the case in ad55eac74f20, the access
permissions on section 1 (RE) are different than section 2 (RW). If we
allowed the previous MAP_FIXED behavior, we would be changing part of a
read only section to read write. This is exactly what MAP_FIXED_NOREPLACE
was designed to prevent.
--
Mike Kravetz

2017-11-29 17:47:41

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

On 11/29/2017 07:42 AM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Both load_elf_interp and load_elf_binary rely on elf_map to map segments
> on a controlled address and they use MAP_FIXED to enforce that. This is
> however dangerous thing prone to silent data corruption which can be
> even exploitable. Let's take CVE-2017-1000253 as an example. At the time
> (before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
> ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
> the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
> we could end up mapping over the existing stack with some luck.
>
> The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
> fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
> further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
> revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
> stack consumption early during execve fully stopped by da029c11e6b1
> ("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
> safe and any attack should be impractical. On the other hand this is
> just too subtle assumption so it can break quite easily and hard to
> spot.
>
> I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
> fundamentally dangerous. Moreover it shouldn't be even needed. We are
> at the early process stage and so there shouldn't be unrelated mappings
> (except for stack and loader) existing so mmap for a given address
> should succeed even without MAP_FIXED. Something is terribly wrong if
> this is not the case and we should rather fail than silently corrupt the
> underlying mapping.
>
> Address this issue by changing MAP_FIXED to the newly added
> MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
> existing mapping clashing with the requested one without clobbering it.
>
> Cc: Abdul Haleem <[email protected]>
> Cc: Joel Stanley <[email protected]>
> Acked-by: Kees Cook <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---

Reviewed-by: Khalid Aziz <[email protected]>


From 1585412000925815567@xxx Wed Nov 29 14:45:06 +0000 2017
X-GM-THRID: 1585412000925815567
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-17 09:38:28

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map

On Thu, Nov 16, 2017 at 2:19 AM, Michal Hocko <[email protected]> wrote:
> From: Michal Hocko <[email protected]>
>
> Both load_elf_interp and load_elf_binary rely on elf_map to map segments
> on a controlled address and they use MAP_FIXED to enforce that. This is
> however dangerous thing prone to silent data corruption which can be
> even exploitable. Let's take CVE-2017-1000253 as an example. At the time
> (before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
> ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
> the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
> we could end up mapping over the existing stack with some luck.
>
> The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
> fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
> further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
> revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
> stack consumption early during execve fully stopped by da029c11e6b1
> ("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
> safe and any attack should be impractical. On the other hand this is
> just too subtle assumption so it can break quite easily and hard to
> spot.
>
> I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
> fundamentally dangerous. Moreover it shouldn't be even needed. We are
> at the early process stage and so there shouldn't be unrelated mappings
> (except for stack and loader) existing so mmap for a given address
> should succeed even without MAP_FIXED. Something is terribly wrong if
> this is not the case and we should rather fail than silently corrupt the
> underlying mapping.
>
> Address this issue by changing MAP_FIXED to the newly added
> MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
> existing mapping clashing with the requested one without clobbering it.
>
> Cc: Abdul Haleem <[email protected]>
> Cc: Joel Stanley <[email protected]>
> Cc: Kees Cook <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

Once (if?) the name gets settled, this looks good to me:

Acked-by: Kees Cook <[email protected]>

-Kees

> ---
> arch/metag/kernel/process.c | 6 +++++-
> fs/binfmt_elf.c | 12 ++++++++----
> 2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
> index c4606ce743d2..2286140e54e0 100644
> --- a/arch/metag/kernel/process.c
> +++ b/arch/metag/kernel/process.c
> @@ -398,7 +398,7 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
> tcm_tag = tcm_lookup_tag(addr);
>
> if (tcm_tag != TCM_INVALID_TAG)
> - type &= ~MAP_FIXED;
> + type &= ~(MAP_FIXED | MAP_FIXED_SAFE);
>
> /*
> * total_size is the size of the ELF (interpreter) image.
> @@ -416,6 +416,10 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
> } else
> map_addr = vm_mmap(filep, addr, size, prot, type, off);
>
> + if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> + pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> + task_pid_nr(current), tsk->comm, (void*)addr);
> +
> if (!BAD_ADDR(map_addr) && tcm_tag != TCM_INVALID_TAG) {
> struct tcm_allocation *tcm;
> unsigned long tcm_addr;
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 6466153f2bf0..12b21942ccde 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -372,6 +372,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
> } else
> map_addr = vm_mmap(filep, addr, size, prot, type, off);
>
> + if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> + pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> + task_pid_nr(current), current->comm, (void*)addr);
> +
> return(map_addr);
> }
>
> @@ -569,7 +573,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
> elf_prot |= PROT_EXEC;
> vaddr = eppnt->p_vaddr;
> if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
> - elf_type |= MAP_FIXED;
> + elf_type |= MAP_FIXED_SAFE;
> else if (no_base && interp_elf_ex->e_type == ET_DYN)
> load_addr = -vaddr;
>
> @@ -929,7 +933,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> * the ET_DYN load_addr calculations, proceed normally.
> */
> if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
> - elf_flags |= MAP_FIXED;
> + elf_flags |= MAP_FIXED_SAFE;
> } else if (loc->elf_ex.e_type == ET_DYN) {
> /*
> * This logic is run once for the first LOAD Program
> @@ -965,7 +969,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> load_bias = ELF_ET_DYN_BASE;
> if (current->flags & PF_RANDOMIZE)
> load_bias += arch_mmap_rnd();
> - elf_flags |= MAP_FIXED;
> + elf_flags |= MAP_FIXED_SAFE;
> } else
> load_bias = 0;
>
> @@ -1220,7 +1224,7 @@ static int load_elf_library(struct file *file)
> (eppnt->p_filesz +
> ELF_PAGEOFFSET(eppnt->p_vaddr)),
> PROT_READ | PROT_WRITE | PROT_EXEC,
> - MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
> + MAP_FIXED_SAFE | MAP_PRIVATE | MAP_DENYWRITE,
> (eppnt->p_offset -
> ELF_PAGEOFFSET(eppnt->p_vaddr)));
> if (error != ELF_PAGESTART(eppnt->p_vaddr))
> --
> 2.15.0
>



--
Kees Cook
Pixel Security

From 1584222617397004653@xxx Thu Nov 16 11:40:21 +0000 2017
X-GM-THRID: 1584222617397004653
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread