2017-08-11 19:19:53

by Rik van Riel

[permalink] [raw]
Subject: [PATCH v3 0/2] mm,fork,security: introduce MADV_WIPEONFORK

v3: simplify implementation, limit to anonymous, private mappings
v2: fix kbuild warnings

Remaining question: should this be under madvise (like MADV_DONTDUMP,
MADV_DONTFORK, etc) or should we implement an minherit syscall? Linus?


Introduce MADV_WIPEONFORK semantics, which result in a VMA being
empty in the child process after fork. This differs from MADV_DONTFORK
in one important way.

If a child process accesses memory that was MADV_WIPEONFORK, it
will get zeroes. The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will
get a segmentation fault, since those address ranges are no longer
valid in the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large
programs to fork in systems with strict memory overcommit restrictions,
changing the semantics of MADV_DONTFORK might break existing programs.

The use case is libraries that store or cache information, and
want to know that they need to regenerate it in the child process
after fork.

Examples of this would be:
- systemd/pulseaudio API checks (fail after fork)
(replacing a getpid check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized
PRNG in every child process are pretty obvious. However, due to
libraries having all kinds of internal state, and programs getting
compiled with many different versions of each library, it is
unreasonable to expect calling programs to re-initialize everything
manually after fork.

A further complication is the proliferation of clone flags,
programs bypassing glibc's functions to call clone directly,
and programs calling unshare, causing the glibc pthread_atfork
hook to not get called.

It would be better to have the kernel take care of this automatically.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

https://man.openbsd.org/minherit.2


2017-08-11 19:19:54

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 1/2] x86,mpx: make mpx depend on x86-64 to free up VMA flag

From: Rik van Riel <[email protected]>

MPX only seems to be available on 64 bit CPUs, starting with Skylake
and Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags,
in order to free up a VMA flag.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
arch/x86/Kconfig | 4 +++-
include/linux/mm.h | 8 ++++++--
2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 781521b7cf9e..6dff14fadc6f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1756,7 +1756,9 @@ config X86_SMAP
config X86_INTEL_MPX
prompt "Intel MPX (Memory Protection Extensions)"
def_bool n
- depends on CPU_SUP_INTEL
+ # Note: only available in 64-bit mode due to VMA flags shortage
+ depends on CPU_SUP_INTEL && X86_64
+ select ARCH_USES_HIGH_VMA_FLAGS
---help---
MPX provides hardware features that can be used in
conjunction with compiler-instrumented code to check
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..7550eeb06ccf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,10 +208,12 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */

#if defined(CONFIG_X86)
@@ -235,9 +237,11 @@ extern unsigned int kobjsize(const void *objp);
# define VM_MAPPED_COPY VM_ARCH_1 /* T if mapped copy of data (nommu mmap) */
#endif

-#if defined(CONFIG_X86)
+#if defined(CONFIG_X86_INTEL_MPX)
/* MPX specific bounds table or bounds directory */
-# define VM_MPX VM_ARCH_2
+# define VM_MPX VM_HIGH_ARCH_BIT_4
+#else
+# define VM_MPX VM_NONE
#endif

#ifndef VM_GROWSUP
--
2.9.4

2017-08-11 19:20:15

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

From: Rik van Riel <[email protected]>

Introduce MADV_WIPEONFORK semantics, which result in a VMA being
empty in the child process after fork. This differs from MADV_DONTFORK
in one important way.

If a child process accesses memory that was MADV_WIPEONFORK, it
will get zeroes. The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will
get a segmentation fault, since those address ranges are no longer
valid in the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large
programs to fork in systems with strict memory overcommit restrictions,
changing the semantics of MADV_DONTFORK might break existing programs.

MADV_WIPEONFORK only works on private, anonymous VMAs.

The use case is libraries that store or cache information, and
want to know that they need to regenerate it in the child process
after fork.

Examples of this would be:
- systemd/pulseaudio API checks (fail after fork)
(replacing a getpid check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized
PRNG in every child process are pretty obvious. However, due to
libraries having all kinds of internal state, and programs getting
compiled with many different versions of each library, it is
unreasonable to expect calling programs to re-initialize everything
manually after fork.

A further complication is the proliferation of clone flags,
programs bypassing glibc's functions to call clone directly,
and programs calling unshare, causing the glibc pthread_atfork
hook to not get called.

It would be better to have the kernel take care of this automatically.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

https://man.openbsd.org/minherit.2

Reported-by: Florian Weimer <[email protected]>
Reported-by: Colm MacCártaigh <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
arch/alpha/include/uapi/asm/mman.h | 3 +++
arch/mips/include/uapi/asm/mman.h | 3 +++
arch/parisc/include/uapi/asm/mman.h | 3 +++
arch/xtensa/include/uapi/asm/mman.h | 3 +++
fs/proc/task_mmu.c | 1 +
include/linux/mm.h | 2 +-
include/trace/events/mmflags.h | 8 +-------
include/uapi/asm-generic/mman-common.h | 3 +++
kernel/fork.c | 1 +
mm/madvise.c | 13 +++++++++++++
mm/memory.c | 10 ++++++++++
11 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 02760f6e6ca4..2a708a792882 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -64,6 +64,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */

+#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 655e2fb5395b..d59c57d60d7d 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -91,6 +91,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */

+#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 5979745815a5..e205e0179642 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 70 /* Clear the MADV_NODUMP flag */

+#define MADV_WIPEONFORK 71 /* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 72 /* Undo MADV_WIPEONFORK */
+
/* compatibility flags */
#define MAP_FILE 0
#define MAP_VARIABLE 0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 24365b30aae9..ed23e0a1b30d 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -103,6 +103,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */

+#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b836fd61ed87..2591e70216ff 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_NORESERVE)] = "nr",
[ilog2(VM_HUGETLB)] = "ht",
[ilog2(VM_ARCH_1)] = "ar",
+ [ilog2(VM_WIPEONFORK)] = "wf",
[ilog2(VM_DONTDUMP)] = "dd",
#ifdef CONFIG_MEM_SOFT_DIRTY
[ilog2(VM_SOFTDIRTY)] = "sd",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7550eeb06ccf..58788c1b9e9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -189,7 +189,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
#define VM_ARCH_1 0x01000000 /* Architecture-specific flag */
-#define VM_ARCH_2 0x02000000
+#define VM_WIPEONFORK 0x02000000 /* Wipe VMA contents in child. */
#define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */

#ifdef CONFIG_MEM_SOFT_DIRTY
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 8e50d01c645f..4c2e4737d7bc 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -125,12 +125,6 @@ IF_HAVE_PG_IDLE(PG_idle, "idle" )
#define __VM_ARCH_SPECIFIC_1 {VM_ARCH_1, "arch_1" }
#endif

-#if defined(CONFIG_X86)
-#define __VM_ARCH_SPECIFIC_2 {VM_MPX, "mpx" }
-#else
-#define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2, "arch_2" }
-#endif
-
#ifdef CONFIG_MEM_SOFT_DIRTY
#define IF_HAVE_VM_SOFTDIRTY(flag,name) {flag, name },
#else
@@ -162,7 +156,7 @@ IF_HAVE_PG_IDLE(PG_idle, "idle" )
{VM_NORESERVE, "noreserve" }, \
{VM_HUGETLB, "hugetlb" }, \
__VM_ARCH_SPECIFIC_1 , \
- __VM_ARCH_SPECIFIC_2 , \
+ {VM_WIPEONFORK, "wipeonfork" }, \
{VM_DONTDUMP, "dontdump" }, \
IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \
{VM_MIXEDMAP, "mixedmap" }, \
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 8c27db0c5c08..49e2b1d78093 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -58,6 +58,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 17 /* Clear the MADV_DONTDUMP flag */

+#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/kernel/fork.c b/kernel/fork.c
index 17921b0390b4..74be75373ee6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -659,6 +659,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
tmp->vm_next = tmp->vm_prev = NULL;
file = tmp->vm_file;
+
if (file) {
struct inode *inode = file_inode(file);
struct address_space *mapping = file->f_mapping;
diff --git a/mm/madvise.c b/mm/madvise.c
index 9976852f1e1c..9b82cfa88ccf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -80,6 +80,17 @@ static long madvise_behavior(struct vm_area_struct *vma,
}
new_flags &= ~VM_DONTCOPY;
break;
+ case MADV_WIPEONFORK:
+ /* MADV_WIPEONFORK is only supported on anonymous memory. */
+ if (vma->vm_file || vma->vm_flags & VM_SHARED) {
+ error = -EINVAL;
+ goto out;
+ }
+ new_flags |= VM_WIPEONFORK;
+ break;
+ case MADV_KEEPONFORK:
+ new_flags &= ~VM_WIPEONFORK;
+ break;
case MADV_DONTDUMP:
new_flags |= VM_DONTDUMP;
break;
@@ -689,6 +700,8 @@ madvise_behavior_valid(int behavior)
#endif
case MADV_DONTDUMP:
case MADV_DODUMP:
+ case MADV_WIPEONFORK:
+ case MADV_KEEPONFORK:
#ifdef CONFIG_MEMORY_FAILURE
case MADV_SOFT_OFFLINE:
case MADV_HWPOISON:
diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..f9b0ad7feb57 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
!vma->anon_vma)
return 0;

+ /*
+ * With VM_WIPEONFORK, the child inherits the VMA from the
+ * parent, but not its contents.
+ *
+ * A child accessing VM_WIPEONFORK memory will see all zeroes;
+ * a child accessing VM_DONTCOPY memory receives a segfault.
+ */
+ if (vma->vm_flags & VM_WIPEONFORK)
+ return 0;
+
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);

--
2.9.4

2017-08-11 19:42:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

On Fri, Aug 11, 2017 at 12:19 PM, <[email protected]> wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index 0e517be91a89..f9b0ad7feb57 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> !vma->anon_vma)
> return 0;
>
> + /*
> + * With VM_WIPEONFORK, the child inherits the VMA from the
> + * parent, but not its contents.
> + *
> + * A child accessing VM_WIPEONFORK memory will see all zeroes;
> + * a child accessing VM_DONTCOPY memory receives a segfault.
> + */
> + if (vma->vm_flags & VM_WIPEONFORK)
> + return 0;
> +

Is this right?

Yes, you don't do the page table copies. Fine. But you leave vma with
the the anon_vma pointer - doesn't that mean that it's still connected
to the original anonvma chain, and we might end up swapping something
in?

And even if that ends up not being an issue, I'd expect that you'd
want to break the anon_vma chain just to not make it grow
unnecessarily.

So my gut feel is that doing this in "copy_page_range()" is wrong, and
the logic should be moved up to dup_mmap(), where we can also
short-circuit the anon_vma chain entirely.

No?

The madvice() interface looks fine to me.

Linus

2017-08-11 20:27:50

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

On Fri, 2017-08-11 at 12:42 -0700, Linus Torvalds wrote:
> On Fri, Aug 11, 2017 at 12:19 PM,  <[email protected]> wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0e517be91a89..f9b0ad7feb57 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct
> > *dst_mm, struct mm_struct *src_mm,
> >                         !vma->anon_vma)
> >                 return 0;
> >
> > +       /*
> > +        * With VM_WIPEONFORK, the child inherits the VMA from the
> > +        * parent, but not its contents.
> > +        *
> > +        * A child accessing VM_WIPEONFORK memory will see all
> > zeroes;
> > +        * a child accessing VM_DONTCOPY memory receives a
> > segfault.
> > +        */
> > +       if (vma->vm_flags & VM_WIPEONFORK)
> > +               return 0;
> > +
>
> Is this right?
>
> Yes, you don't do the page table copies. Fine. But you leave vma with
> the the anon_vma pointer - doesn't that mean that it's still
> connected
> to the original anonvma chain, and we might end up swapping something
> in?

Swapping something in would require there to be a swap entry in
the page table entries, which we are not copying, so this should
not be a correctness issue.

> And even if that ends up not being an issue, I'd expect that you'd
> want to break the anon_vma chain just to not make it grow
> unnecessarily.

This is a good point. I can send a v4 that skips the anon_vma_fork()
call if VM_WIPEONFORK, and calls anon_vma_prepare(), instead.

> So my gut feel is that doing this in "copy_page_range()" is wrong,
> and
> the logic should be moved up to dup_mmap(), where we can also
> short-circuit the anon_vma chain entirely.
>
> No?

There is another test in copy_page_range already which ends up
skipping the page table copy when it should not be done.

If you want, I can move that test into a should_copy_page_range()
function, and call that from dup_mmap(), skipping the call to
copy_page_range() if should_copy_page_range() returns false.

Having only one of the two sets of tests in dup_mmap(), and
the other in copy_page_range() seems wrong.

Just let me know what you prefer, and I'll put that in v4.

> The madvice() interface looks fine to me.

That was the main reason for adding you to the thread :)

kind regards,

Rik

2017-08-11 20:50:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

On Fri, Aug 11, 2017 at 1:27 PM, Rik van Riel <[email protected]> wrote:
>>
>> Yes, you don't do the page table copies. Fine. But you leave vma with
>> the the anon_vma pointer - doesn't that mean that it's still
>> connected
>> to the original anonvma chain, and we might end up swapping something
>> in?
>
> Swapping something in would require there to be a swap entry in
> the page table entries, which we are not copying, so this should
> not be a correctness issue.

Yeah, I thought the rmap code just used the offset from the start to
avoid even doing swap entries, but I guess we don't actually ever
populate the page tables without the swap entry being there.

> There is another test in copy_page_range already which ends up
> skipping the page table copy when it should not be done.

Well, the VM_DONTCOPY test is in dup_mmap(), and I think I'd rather
match up the VM_WIPEONFORK logic with VM_DONTCOPY than with the
copy_page_range() tests.

Because I assume you are talking about the "if it's a shared mapping,
we don't need to copy the page tables and can just do it at page fault
time instead" part? That's a rather different thing, which isn't so
much about semantics, as about just a trade-off about when to touch
the page tables.

But yes, that one *might* make sense in dup_mmap() too. I just don't
think it's really analogous to the WIPEONFORK and DONTCOPY tests.

Linus