2017-08-04 19:01:18

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 0/2] mm,fork: MADV_WIPEONFORK - an empty VMA in the child

Introduce MADV_WIPEONFORK semantics, which result in a VMA being
empty in the child process after fork. This differs from MADV_DONTFORK
in one important way.

If a child process accesses memory that was MADV_WIPEONFORK, it
will get zeroes. The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will
get a segmentation fault, since those address ranges are no longer
valid in the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large
programs to fork in systems with strict memory overcommit restrictions,
changing the semantics of MADV_DONTFORK might break existing programs.

The use case is libraries that store or cache information, and
want to know that they need to regenerate it in the child process
after fork.
Examples of this would be:
- systemd/pulseaudio API checks (fail after fork)
(replacing a getpid check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized
PRNG in every child process are pretty obvious. However, due to
libraries having all kinds of internal state, and programs getting
compiled with many different versions of each library, it is
unreasonable to expect calling programs to re-initialize everything
manually after fork.

A further complication is the proliferation of clone flags,
programs bypassing glibc's functions to call clone directly,
and programs calling unshare, causing the glibc pthread_atfork
hook to not get called.

It would be better to have the kernel take care of this automatically.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

https://man.openbsd.org/minherit.2



2017-08-04 19:01:20

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 1/2] x86,mpx: make mpx depend on x86-64 to free up VMA flag

From: Rik van Riel <[email protected]>

MPX only seems to be available on 64 bit CPUs, starting with Skylake
and Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags,
in order to free up a VMA flag.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/Kconfig | 4 +++-
include/linux/mm.h | 8 ++++++--
2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 781521b7cf9e..6dff14fadc6f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1756,7 +1756,9 @@ config X86_SMAP
config X86_INTEL_MPX
prompt "Intel MPX (Memory Protection Extensions)"
def_bool n
- depends on CPU_SUP_INTEL
+ # Note: only available in 64-bit mode due to VMA flags shortage
+ depends on CPU_SUP_INTEL && X86_64
+ select ARCH_USES_HIGH_VMA_FLAGS
---help---
MPX provides hardware features that can be used in
conjunction with compiler-instrumented code to check
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..7550eeb06ccf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,10 +208,12 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */

#if defined(CONFIG_X86)
@@ -235,9 +237,11 @@ extern unsigned int kobjsize(const void *objp);
# define VM_MAPPED_COPY VM_ARCH_1 /* T if mapped copy of data (nommu mmap) */
#endif

-#if defined(CONFIG_X86)
+#if defined(CONFIG_X86_INTEL_MPX)
/* MPX specific bounds table or bounds directory */
-# define VM_MPX VM_ARCH_2
+# define VM_MPX VM_HIGH_ARCH_BIT_4
+#else
+# define VM_MPX VM_NONE
#endif

#ifndef VM_GROWSUP
--
2.9.4

2017-08-04 19:01:44

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

From: Rik van Riel <[email protected]>

Introduce MADV_WIPEONFORK semantics, which result in a VMA being
empty in the child process after fork. This differs from MADV_DONTFORK
in one important way.

If a child process accesses memory that was MADV_WIPEONFORK, it
will get zeroes. The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will
get a segmentation fault, since those address ranges are no longer
valid in the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large
programs to fork in systems with strict memory overcommit restrictions,
changing the semantics of MADV_DONTFORK might break existing programs.

The use case is libraries that store or cache information, and
want to know that they need to regenerate it in the child process
after fork.

Examples of this would be:
- systemd/pulseaudio API checks (fail after fork)
(replacing a getpid check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized
PRNG in every child process are pretty obvious. However, due to
libraries having all kinds of internal state, and programs getting
compiled with many different versions of each library, it is
unreasonable to expect calling programs to re-initialize everything
manually after fork.

A further complication is the proliferation of clone flags,
programs bypassing glibc's functions to call clone directly,
and programs calling unshare, causing the glibc pthread_atfork
hook to not get called.

It would be better to have the kernel take care of this automatically.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

https://man.openbsd.org/minherit.2

Reported-by: Florian Weimer <[email protected]>
Reported-by: Colm MacCártaigh <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
arch/alpha/include/uapi/asm/mman.h | 3 +++
arch/mips/include/uapi/asm/mman.h | 3 +++
arch/parisc/include/uapi/asm/mman.h | 3 +++
arch/xtensa/include/uapi/asm/mman.h | 3 +++
fs/proc/task_mmu.c | 1 +
include/linux/mm.h | 2 +-
include/uapi/asm-generic/mman-common.h | 3 +++
kernel/fork.c | 8 ++++++--
mm/madvise.c | 8 ++++++++
mm/memory.c | 10 ++++++++++
10 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 02760f6e6ca4..2a708a792882 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -64,6 +64,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */

+#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 655e2fb5395b..d59c57d60d7d 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -91,6 +91,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */

+#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 5979745815a5..e205e0179642 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 70 /* Clear the MADV_NODUMP flag */

+#define MADV_WIPEONFORK 71 /* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 72 /* Undo MADV_WIPEONFORK */
+
/* compatibility flags */
#define MAP_FILE 0
#define MAP_VARIABLE 0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 24365b30aae9..ed23e0a1b30d 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -103,6 +103,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */

+#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b836fd61ed87..2591e70216ff 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_NORESERVE)] = "nr",
[ilog2(VM_HUGETLB)] = "ht",
[ilog2(VM_ARCH_1)] = "ar",
+ [ilog2(VM_WIPEONFORK)] = "wf",
[ilog2(VM_DONTDUMP)] = "dd",
#ifdef CONFIG_MEM_SOFT_DIRTY
[ilog2(VM_SOFTDIRTY)] = "sd",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7550eeb06ccf..58788c1b9e9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -189,7 +189,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
#define VM_ARCH_1 0x01000000 /* Architecture-specific flag */
-#define VM_ARCH_2 0x02000000
+#define VM_WIPEONFORK 0x02000000 /* Wipe VMA contents in child. */
#define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */

#ifdef CONFIG_MEM_SOFT_DIRTY
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 8c27db0c5c08..49e2b1d78093 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -58,6 +58,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 17 /* Clear the MADV_DONTDUMP flag */

+#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/kernel/fork.c b/kernel/fork.c
index 17921b0390b4..2dd0d0cae3bb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -628,7 +628,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,

prev = NULL;
for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
- struct file *file;
+ struct file *file = NULL;

if (mpnt->vm_flags & VM_DONTCOPY) {
vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
@@ -658,7 +658,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
goto fail_nomem_anon_vma_fork;
tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
tmp->vm_next = tmp->vm_prev = NULL;
- file = tmp->vm_file;
+
+ /* With VM_WIPEONFORK, the child gets an empty VMA. */
+ if (!(tmp->vm_flags & VM_WIPEONFORK))
+ file = tmp->vm_file;
+
if (file) {
struct inode *inode = file_inode(file);
struct address_space *mapping = file->f_mapping;
diff --git a/mm/madvise.c b/mm/madvise.c
index 9976852f1e1c..9e644c0ed4dc 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -80,6 +80,12 @@ static long madvise_behavior(struct vm_area_struct *vma,
}
new_flags &= ~VM_DONTCOPY;
break;
+ case MADV_WIPEONFORK:
+ new_flags |= VM_WIPEONFORK;
+ break;
+ case MADV_KEEPONFORK:
+ new_flags &= ~VM_WIPEONFORK;
+ break;
case MADV_DONTDUMP:
new_flags |= VM_DONTDUMP;
break;
@@ -689,6 +695,8 @@ madvise_behavior_valid(int behavior)
#endif
case MADV_DONTDUMP:
case MADV_DODUMP:
+ case MADV_WIPEONFORK:
+ case MADV_KEEPONFORK:
#ifdef CONFIG_MEMORY_FAILURE
case MADV_SOFT_OFFLINE:
case MADV_HWPOISON:
diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..f9b0ad7feb57 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
!vma->anon_vma)
return 0;

+ /*
+ * With VM_WIPEONFORK, the child inherits the VMA from the
+ * parent, but not its contents.
+ *
+ * A child accessing VM_WIPEONFORK memory will see all zeroes;
+ * a child accessing VM_DONTCOPY memory receives a segfault.
+ */
+ if (vma->vm_flags & VM_WIPEONFORK)
+ return 0;
+
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);

--
2.9.4

2017-08-05 18:47:23

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

Hi Rik,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc3 next-20170804]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/riel-redhat-com/x86-mpx-make-mpx-depend-on-x86-64-to-free-up-VMA-flag/20170806-011851
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa

All error/warnings (new ones prefixed by >>):

In file included from mm/debug.c:12:0:
>> include/trace/events/mmflags.h:131:31: error: 'VM_ARCH_2' undeclared here (not in a function)
#define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2, "arch_2" }
^
>> include/trace/events/mmflags.h:165:2: note: in expansion of macro '__VM_ARCH_SPECIFIC_2'
__VM_ARCH_SPECIFIC_2 , \
^
>> mm/debug.c:39:2: note: in expansion of macro '__def_vmaflag_names'
__def_vmaflag_names,
^

vim +/VM_ARCH_2 +131 include/trace/events/mmflags.h

bcf669179 Kirill A. Shutemov 2016-03-17 127
bcf669179 Kirill A. Shutemov 2016-03-17 128 #if defined(CONFIG_X86)
bcf669179 Kirill A. Shutemov 2016-03-17 129 #define __VM_ARCH_SPECIFIC_2 {VM_MPX, "mpx" }
bcf669179 Kirill A. Shutemov 2016-03-17 130 #else
bcf669179 Kirill A. Shutemov 2016-03-17 @131 #define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2, "arch_2" }
420adbe9f Vlastimil Babka 2016-03-15 132 #endif
420adbe9f Vlastimil Babka 2016-03-15 133
420adbe9f Vlastimil Babka 2016-03-15 134 #ifdef CONFIG_MEM_SOFT_DIRTY
420adbe9f Vlastimil Babka 2016-03-15 135 #define IF_HAVE_VM_SOFTDIRTY(flag,name) {flag, name },
420adbe9f Vlastimil Babka 2016-03-15 136 #else
420adbe9f Vlastimil Babka 2016-03-15 137 #define IF_HAVE_VM_SOFTDIRTY(flag,name)
420adbe9f Vlastimil Babka 2016-03-15 138 #endif
420adbe9f Vlastimil Babka 2016-03-15 139
420adbe9f Vlastimil Babka 2016-03-15 140 #define __def_vmaflag_names \
420adbe9f Vlastimil Babka 2016-03-15 141 {VM_READ, "read" }, \
420adbe9f Vlastimil Babka 2016-03-15 142 {VM_WRITE, "write" }, \
420adbe9f Vlastimil Babka 2016-03-15 143 {VM_EXEC, "exec" }, \
420adbe9f Vlastimil Babka 2016-03-15 144 {VM_SHARED, "shared" }, \
420adbe9f Vlastimil Babka 2016-03-15 145 {VM_MAYREAD, "mayread" }, \
420adbe9f Vlastimil Babka 2016-03-15 146 {VM_MAYWRITE, "maywrite" }, \
420adbe9f Vlastimil Babka 2016-03-15 147 {VM_MAYEXEC, "mayexec" }, \
420adbe9f Vlastimil Babka 2016-03-15 148 {VM_MAYSHARE, "mayshare" }, \
420adbe9f Vlastimil Babka 2016-03-15 149 {VM_GROWSDOWN, "growsdown" }, \
bcf669179 Kirill A. Shutemov 2016-03-17 150 {VM_UFFD_MISSING, "uffd_missing" }, \
420adbe9f Vlastimil Babka 2016-03-15 151 {VM_PFNMAP, "pfnmap" }, \
420adbe9f Vlastimil Babka 2016-03-15 152 {VM_DENYWRITE, "denywrite" }, \
bcf669179 Kirill A. Shutemov 2016-03-17 153 {VM_UFFD_WP, "uffd_wp" }, \
420adbe9f Vlastimil Babka 2016-03-15 154 {VM_LOCKED, "locked" }, \
420adbe9f Vlastimil Babka 2016-03-15 155 {VM_IO, "io" }, \
420adbe9f Vlastimil Babka 2016-03-15 156 {VM_SEQ_READ, "seqread" }, \
420adbe9f Vlastimil Babka 2016-03-15 157 {VM_RAND_READ, "randread" }, \
420adbe9f Vlastimil Babka 2016-03-15 158 {VM_DONTCOPY, "dontcopy" }, \
420adbe9f Vlastimil Babka 2016-03-15 159 {VM_DONTEXPAND, "dontexpand" }, \
bcf669179 Kirill A. Shutemov 2016-03-17 160 {VM_LOCKONFAULT, "lockonfault" }, \
420adbe9f Vlastimil Babka 2016-03-15 161 {VM_ACCOUNT, "account" }, \
420adbe9f Vlastimil Babka 2016-03-15 162 {VM_NORESERVE, "noreserve" }, \
420adbe9f Vlastimil Babka 2016-03-15 163 {VM_HUGETLB, "hugetlb" }, \
bcf669179 Kirill A. Shutemov 2016-03-17 164 __VM_ARCH_SPECIFIC_1 , \
bcf669179 Kirill A. Shutemov 2016-03-17 @165 __VM_ARCH_SPECIFIC_2 , \
420adbe9f Vlastimil Babka 2016-03-15 166 {VM_DONTDUMP, "dontdump" }, \
420adbe9f Vlastimil Babka 2016-03-15 167 IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \
420adbe9f Vlastimil Babka 2016-03-15 168 {VM_MIXEDMAP, "mixedmap" }, \
420adbe9f Vlastimil Babka 2016-03-15 169 {VM_HUGEPAGE, "hugepage" }, \
420adbe9f Vlastimil Babka 2016-03-15 170 {VM_NOHUGEPAGE, "nohugepage" }, \
420adbe9f Vlastimil Babka 2016-03-15 171 {VM_MERGEABLE, "mergeable" } \
420adbe9f Vlastimil Babka 2016-03-15 172

:::::: The code at line 131 was first introduced by commit
:::::: bcf6691797f425b301f629bb783b7ff2d0bcfa5a mm, tracing: refresh __def_vmaflag_names

:::::: TO: Kirill A. Shutemov <[email protected]>
:::::: CC: Linus Torvalds <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (5.01 kB)
.config.gz (49.73 kB)
Download all attachments

2017-08-05 19:33:28

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

Hi Rik,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc3 next-20170804]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/riel-redhat-com/x86-mpx-make-mpx-depend-on-x86-64-to-free-up-VMA-flag/20170806-011851
config: tile-allmodconfig (attached as .config)
compiler: tilegx-linux-gcc (GCC) 4.6.2
reproduce:
wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=tile

All errors (new ones prefixed by >>):

>> mm/debug.c:39:2: error: 'VM_ARCH_2' undeclared here (not in a function)

vim +/VM_ARCH_2 +39 mm/debug.c

420adbe9f Vlastimil Babka 2016-03-15 37
edf14cdbf Vlastimil Babka 2016-03-15 38 const struct trace_print_flags vmaflag_names[] = {
edf14cdbf Vlastimil Babka 2016-03-15 @39 __def_vmaflag_names,
edf14cdbf Vlastimil Babka 2016-03-15 40 {0, NULL}
82742a3a5 Sasha Levin 2014-10-09 41 };
82742a3a5 Sasha Levin 2014-10-09 42

:::::: The code at line 39 was first introduced by commit
:::::: edf14cdbf9a0e5ab52698ca66d07a76ade0d5c46 mm, printk: introduce new format string for flags

:::::: TO: Vlastimil Babka <[email protected]>
:::::: CC: Linus Torvalds <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.53 kB)
.config.gz (48.61 kB)
Download all attachments