2008-12-03 19:04:47

by Lee Schermerhorn

[permalink] [raw]
Subject: [PATCH] - support inheritance of mlocks across fork/exec V2



Against; 2.6.28-rc7-mmotm-081203

V02: rework vetting of flags argument as suggested by
Kosaki Motohiro.
enhance description as requested by Andrew Morton.

Add support for mlockall(MCL_INHERIT|MCL_RECURSIVE):

MCL_CURRENT[|MCL_FUTURE]|MCL_INHERIT - inherit memory locks
[vmas' VM_LOCKED flags] across fork(), and inherit
MCL_FUTURE behavior [mm's def_flags] across fork()
and exec(). Behaves as if child and/or new task
called mlockall(MCL_CURRENT|MCL_FUTURE) as first
instruction.

MCL_RECURSIVE - inherit MCL_CURRENT|MCL_FUTURE|MCL_INHERIT
[vmas' VM_LOCKED flags for fork() and mm's def_flags
and mcl_inherit across fork() and exec()] for all
future generations of calling process's descendants.
Behaves as if child and/or new task called
mlockall(MCL_CURRENT|MCL_FUTURE|MCL_INHERIT|MCL_RECURSIVE)
as the first instruction.

In support of a "lock prefix command"--e.g., mlock <cmd> <args> ...
Analogous to taskset(1) for cpu affinity or numactl(8) for numa memory
policy.

Together with patches to keep mlocked pages off the LRU, this will
allow users/admins to lock down applications without modifying them,
if their RLIMIT_MEMLOCK is sufficiently large, keeping their pages
off the LRU and out of consideration for reclaim.

Potentially useful, as well, in real-time environments to force
prefaulting and residency for applications that don't mlock themselves.

Jeff Sharkey at Montana State developed a similar patch for Linux
[link no longer accessible], but apparently he never submitted the patch.

I submitted an earlier version of this patch around a year ago. I
resurrected it to test the unevictable lru/mlocked pages patches--
e.g., by "mlock -r make -j<N*nr_cpus> all". This did shake out a few
races and vmstat accounting bugs, but NOT something I'd recommend as
general practice--for kernel builds, that is.

----

Define MCL_INHERIT, MCL_RECURSIVE in <asm-*/mman.h>.
+ x86 and ia64 versions included.
+ other arch can/will be created, if this patch deemed merge-worthy.

Similarly, I'll provide kernel man page update if/when needed.

Example "lock prefix command" in Documentation/vm/mlock.c

Signed-off-by: Lee Schermerhorn <[email protected]>

Documentation/vm/mlock.c | 149 +++++++++++++++++++++++++++++++++++++++++++
arch/ia64/include/asm/mman.h | 2
arch/x86/include/asm/mman.h | 3
fs/binfmt_elf.c | 9 ++
include/linux/mm_types.h | 2
kernel/fork.c | 15 +++-
mm/mlock.c | 19 ++++-
7 files changed, 191 insertions(+), 8 deletions(-)

Index: linux-2.6.28-rc7-mmotm-081203/arch/ia64/include/asm/mman.h
===================================================================
--- linux-2.6.28-rc7-mmotm-081203.orig/arch/ia64/include/asm/mman.h 2008-12-03 09:33:42.000000000 -0500
+++ linux-2.6.28-rc7-mmotm-081203/arch/ia64/include/asm/mman.h 2008-12-03 10:33:29.000000000 -0500
@@ -21,6 +21,8 @@

#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_INHERIT 4 /* inherit '_FUTURE across fork/exec */
+#define MCL_RECURSIVE 8 /* inherit '_FUTURE recursively */

#ifdef __KERNEL__
#ifndef __ASSEMBLY__
Index: linux-2.6.28-rc7-mmotm-081203/mm/mlock.c
===================================================================
--- linux-2.6.28-rc7-mmotm-081203.orig/mm/mlock.c 2008-12-03 10:33:11.000000000 -0500
+++ linux-2.6.28-rc7-mmotm-081203/mm/mlock.c 2008-12-03 10:33:29.000000000 -0500
@@ -573,15 +573,18 @@ asmlinkage long sys_munlock(unsigned lon
static int do_mlockall(int flags)
{
struct vm_area_struct * vma, * prev = NULL;
+ struct mm_struct *mm = current->mm;
unsigned int def_flags = 0;

if (flags & MCL_FUTURE)
- def_flags = VM_LOCKED;
- current->mm->def_flags = def_flags;
- if (flags == MCL_FUTURE)
+ def_flags = VM_LOCKED;;
+ mm->def_flags = def_flags;
+ if (flags & MCL_INHERIT)
+ mm->mcl_inherit = flags & (MCL_INHERIT | MCL_RECURSIVE);
+ if ((flags & ~(MCL_INHERIT | MCL_RECURSIVE)) == MCL_FUTURE)
goto out;

- for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
+ for (vma = mm->mmap; vma ; vma = prev->vm_next) {
unsigned int newflags;

newflags = vma->vm_flags | VM_LOCKED;
@@ -600,9 +603,15 @@ asmlinkage long sys_mlockall(int flags)
unsigned long lock_limit;
int ret = -EINVAL;

- if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
+ if (!(flags & (MCL_CURRENT | MCL_FUTURE)))
goto out;

+ if (flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_INHERIT | MCL_RECURSIVE))
+ goto out; /* undefined flag bits */
+
+ if ((flags & (MCL_INHERIT | MCL_RECURSIVE)) == MCL_RECURSIVE)
+ goto out; /* 'RECURSIVE undefined without 'INHERIT */
+
ret = -EPERM;
if (!can_do_mlock())
goto out;
Index: linux-2.6.28-rc7-mmotm-081203/kernel/fork.c
===================================================================
--- linux-2.6.28-rc7-mmotm-081203.orig/kernel/fork.c 2008-12-03 10:18:15.000000000 -0500
+++ linux-2.6.28-rc7-mmotm-081203/kernel/fork.c 2008-12-03 10:33:29.000000000 -0500
@@ -278,7 +278,8 @@ static int dup_mmap(struct mm_struct *mm
*/
down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING);

- mm->locked_vm = 0;
+ if (!mm->mcl_inherit)
+ mm->locked_vm = 0;
mm->mmap = NULL;
mm->mmap_cache = NULL;
mm->free_area_cache = oldmm->mmap_base;
@@ -316,7 +317,8 @@ static int dup_mmap(struct mm_struct *mm
if (IS_ERR(pol))
goto fail_nomem_policy;
vma_set_policy(tmp, pol);
- tmp->vm_flags &= ~VM_LOCKED;
+ if (!mm->mcl_inherit)
+ tmp->vm_flags &= ~VM_LOCKED;
tmp->vm_mm = mm;
tmp->vm_next = NULL;
anon_vma_link(tmp);
@@ -406,6 +408,8 @@ __cacheline_aligned_in_smp DEFINE_SPINLO

static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
{
+ unsigned long def_flags = 0;
+
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
@@ -422,9 +426,14 @@ static struct mm_struct * mm_init(struct
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
mm_init_owner(mm, p);
+ if (current->mm && current->mm->mcl_inherit) {
+ def_flags = current->mm->def_flags & VM_LOCKED;
+ if (mm->mcl_inherit & MCL_RECURSIVE)
+ mm->mcl_inherit = current->mm->mcl_inherit;
+ }

if (likely(!mm_alloc_pgd(mm))) {
- mm->def_flags = 0;
+ mm->def_flags = def_flags;
mmu_notifier_mm_init(mm);
return mm;
}
Index: linux-2.6.28-rc7-mmotm-081203/fs/binfmt_elf.c
===================================================================
--- linux-2.6.28-rc7-mmotm-081203.orig/fs/binfmt_elf.c 2008-12-03 10:19:21.000000000 -0500
+++ linux-2.6.28-rc7-mmotm-081203/fs/binfmt_elf.c 2008-12-03 10:33:29.000000000 -0500
@@ -585,6 +585,7 @@ static int load_elf_binary(struct linux_
unsigned long reloc_func_desc = 0;
int executable_stack = EXSTACK_DEFAULT;
unsigned long def_flags = 0;
+ int mcl_inherit = 0;
struct {
struct elfhdr elf_ex;
struct elfhdr interp_elf_ex;
@@ -749,6 +750,13 @@ static int load_elf_binary(struct linux_
SET_PERSONALITY(loc->elf_ex);
}

+ /* Optionally inherit MCL_FUTURE state before destroying old mm */
+ if (current->mm && current->mm->mcl_inherit) {
+ def_flags = current->mm->def_flags & VM_LOCKED;
+ if (current->mm->mcl_inherit & MCL_RECURSIVE)
+ mcl_inherit = current->mm->mcl_inherit;
+ }
+
/* Flush all traces of the currently running executable */
retval = flush_old_exec(bprm);
if (retval)
@@ -757,6 +765,7 @@ static int load_elf_binary(struct linux_
/* OK, This is the point of no return */
current->flags &= ~PF_FORKNOEXEC;
current->mm->def_flags = def_flags;
+ current->mm->mcl_inherit = mcl_inherit;

/* Do this immediately, since STACK_TOP as used in setup_arg_pages
may depend on the personality. */
Index: linux-2.6.28-rc7-mmotm-081203/arch/x86/include/asm/mman.h
===================================================================
--- linux-2.6.28-rc7-mmotm-081203.orig/arch/x86/include/asm/mman.h 2008-12-03 10:16:26.000000000 -0500
+++ linux-2.6.28-rc7-mmotm-081203/arch/x86/include/asm/mman.h 2008-12-03 10:33:29.000000000 -0500
@@ -16,5 +16,8 @@

#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
+#define MCL_INHERIT 4 /* inherit mlocks across fork */
+ /* inherit '_FUTURE flag across fork/exec */
+#define MCL_RECURSIVE 8 /* inherit mlocks recursively */

#endif /* _ASM_X86_MMAN_H */
Index: linux-2.6.28-rc7-mmotm-081203/include/linux/mm_types.h
===================================================================
--- linux-2.6.28-rc7-mmotm-081203.orig/include/linux/mm_types.h 2008-12-03 10:18:01.000000000 -0500
+++ linux-2.6.28-rc7-mmotm-081203/include/linux/mm_types.h 2008-12-03 10:33:29.000000000 -0500
@@ -235,6 +235,8 @@ struct mm_struct {
unsigned int token_priority;
unsigned int last_interval;

+ int mcl_inherit; /* inherit current/future locks */
+
unsigned long flags; /* Must use atomic bitops to access the bits */

struct core_state *core_state; /* coredumping support */
Index: linux-2.6.28-rc7-mmotm-081203/Documentation/vm/mlock.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.28-rc7-mmotm-081203/Documentation/vm/mlock.c 2008-12-03 10:33:29.000000000 -0500
@@ -0,0 +1,149 @@
+/*
+ * mlock.c
+ *
+ * Command-line utility for launching a program with the
+ * mlockall() MCL_FUTURE flag set such that all of the task's
+ * pages will be locked into memory. This depends on the
+ * MCL_INHERIT|MCL_RECURSIVE enhancement to mlockall(2).
+ *
+ * Based on the taskset command from the schedutils package by
+ *
+ * Robert Love <[email protected]>
+ *
+ * Compile with:
+ *
+ * gcc -o mlock mlock.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, v2, as
+ * published by the Free Software Foundation
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Copyright (C) 2004 Robert Love
+ * Copyright (C) 2008 Hewlett-Packard, Inc.
+ */
+
+#include <sys/types.h>
+#include <sys/mman.h>
+
+#include <ctype.h>
+#include <errno.h>
+#include <getopt.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+
+#define MLOCK_VERSION "0.2"
+
+/*
+ * Version Info
+ *
+ * 0.1 - initial implementation
+ *
+ * 0.2 - add "--recursive" support
+ */
+
+#define OPTIONS "+hr"
+static struct option l_opts[] = {
+ {
+ .name = "help",
+ .has_arg = no_argument,
+ .flag = NULL,
+ .val = 'h'
+ },
+ {
+ .name = "recursive",
+ .has_arg = no_argument,
+ .flag = NULL,
+ .val = 'r'
+ },
+ {
+ .name = NULL,
+ }
+};
+
+/*
+ * For testing before MCL_INHERIT and MCL_RECURSIVE exist in a
+ * user space header. mlockall() will fail if these flags are
+ * not implemented.
+ *
+ * N.B., won't work on platforms with "interesting" values for
+ * MCL_FUTURE -- e.g., powerpc, sparc, alpha
+ * [maybe OK for alpha, but ...]
+ */
+#ifndef MCL_INHERIT
+#define MCL_INHERIT (MCL_FUTURE << 1)
+#define MCL_RECURSIVE (MCL_INHERIT << 1)
+#endif
+
+static const char *usage = "\
+\nmlock version " MLOCK_VERSION "\n\n\
+Usage: %s [-hr] <cmd> [args...]]\n\n\
+Where:\n\
+\t--help/-h = show this help/usage\n\
+\t--recursive/-r = inherit recursively--i.e., across future\n\
+\t generations.\n\n\
+Run <cmd> as if it had called mlockall(2) with the MCL_CURRENT|MCL_FUTURE\n\
+flags set. That is, all of <cmd>'s pages will be locked into memory.\n\
+If '--recursive/-r' specified, the MCL_RECURSIVE flag will be added, and\n\
+all future descendants of <cmd> will run with inherit this condition,\n\
+unless one of them calls munlockall(2) or mlockall(2) without the\n\
+MCL_INHERIT|MCL_RECURSIVE flags.\n\n\
+";
+
+static void show_usage(const char *cmd)
+{
+ fprintf(stderr, usage, cmd);
+}
+
+int main(int argc, char *argv[])
+{
+
+ int opt;
+ int flags = MCL_FUTURE|MCL_INHERIT;
+
+ while ((opt = getopt_long(argc, argv, OPTIONS, l_opts, NULL)) != -1) {
+ int ret = 1;
+
+ switch (opt) {
+ case 'r':
+ flags |= MCL_RECURSIVE;
+ break;
+ case 'h':
+ ret = 0;
+ /* fall through */
+
+ default:
+ show_usage(argv[0]);
+ return ret;
+ }
+ }
+
+ if ((argc - optind) < 1) {
+ show_usage(argv[0]);
+ return 1;
+ }
+
+ if (mlockall(flags) == -1) {
+ fprintf(stderr, "%s mlockall() failed - %s\n", argv[0],
+ strerror(errno));
+ return 1;
+ }
+
+ argv += optind;
+ execvp(argv[0], argv);
+ perror("execvp");
+ fprintf(stderr, "failed to execute %s\n", argv[0]);
+ return 1;
+
+}
+


2008-12-04 01:57:52

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] - support inheritance of mlocks across fork/exec V2

> @@ -600,9 +603,15 @@ asmlinkage long sys_mlockall(int flags)
> unsigned long lock_limit;
> int ret = -EINVAL;
>
> - if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
> + if (!(flags & (MCL_CURRENT | MCL_FUTURE)))
> goto out;
>
> + if (flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_INHERIT | MCL_RECURSIVE))
> + goto out; /* undefined flag bits */
> +
> + if ((flags & (MCL_INHERIT | MCL_RECURSIVE)) == MCL_RECURSIVE)
> + goto out; /* 'RECURSIVE undefined without 'INHERIT */
> +
> ret = -EPERM;
> if (!can_do_mlock())
> goto out;

looks good to me.

Reviewed-by: KOSAKI Motohiro <[email protected]>


2008-12-07 06:08:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] - support inheritance of mlocks across fork/exec V2


(cc linux-abi. Again.)

On Wed, 03 Dec 2008 14:04:29 -0500 Lee Schermerhorn <[email protected]> wrote:

>
>
> Against; 2.6.28-rc7-mmotm-081203
>
> V02: rework vetting of flags argument as suggested by
> Kosaki Motohiro.
> enhance description as requested by Andrew Morton.
>
> Add support for mlockall(MCL_INHERIT|MCL_RECURSIVE):
>
> MCL_CURRENT[|MCL_FUTURE]|MCL_INHERIT - inherit memory locks
> [vmas' VM_LOCKED flags] across fork(), and inherit
> MCL_FUTURE behavior [mm's def_flags] across fork()
> and exec(). Behaves as if child and/or new task
> called mlockall(MCL_CURRENT|MCL_FUTURE) as first
> instruction.
>
> MCL_RECURSIVE - inherit MCL_CURRENT|MCL_FUTURE|MCL_INHERIT
> [vmas' VM_LOCKED flags for fork() and mm's def_flags
> and mcl_inherit across fork() and exec()] for all
> future generations of calling process's descendants.
> Behaves as if child and/or new task called
> mlockall(MCL_CURRENT|MCL_FUTURE|MCL_INHERIT|MCL_RECURSIVE)
> as the first instruction.
>
> In support of a "lock prefix command"--e.g., mlock <cmd> <args> ...
> Analogous to taskset(1) for cpu affinity or numactl(8) for numa memory
> policy.
>
> Together with patches to keep mlocked pages off the LRU, this will
> allow users/admins to lock down applications without modifying them,
> if their RLIMIT_MEMLOCK is sufficiently large, keeping their pages
> off the LRU and out of consideration for reclaim.
>
> Potentially useful, as well, in real-time environments to force
> prefaulting and residency for applications that don't mlock themselves.
>
> Jeff Sharkey at Montana State developed a similar patch for Linux
> [link no longer accessible], but apparently he never submitted the patch.
>
> I submitted an earlier version of this patch around a year ago. I
> resurrected it to test the unevictable lru/mlocked pages patches--
> e.g., by "mlock -r make -j<N*nr_cpus> all". This did shake out a few
> races and vmstat accounting bugs, but NOT something I'd recommend as
> general practice--for kernel builds, that is.
>
> ----
>
> Define MCL_INHERIT, MCL_RECURSIVE in <asm-*/mman.h>.
> + x86 and ia64 versions included.
> + other arch can/will be created, if this patch deemed merge-worthy.
>
> Similarly, I'll provide kernel man page update if/when needed.
>
> Example "lock prefix command" in Documentation/vm/mlock.c
>
>
> ...
>
> --- linux-2.6.28-rc7-mmotm-081203.orig/mm/mlock.c 2008-12-03 10:33:11.000000000 -0500
> +++ linux-2.6.28-rc7-mmotm-081203/mm/mlock.c 2008-12-03 10:33:29.000000000 -0500
> @@ -573,15 +573,18 @@ asmlinkage long sys_munlock(unsigned lon
> static int do_mlockall(int flags)
> {
> struct vm_area_struct * vma, * prev = NULL;
> + struct mm_struct *mm = current->mm;
> unsigned int def_flags = 0;
>
> if (flags & MCL_FUTURE)
> - def_flags = VM_LOCKED;
> - current->mm->def_flags = def_flags;
> - if (flags == MCL_FUTURE)
> + def_flags = VM_LOCKED;;

You get paid by the semicolon?

> + mm->def_flags = def_flags;
> + if (flags & MCL_INHERIT)
> + mm->mcl_inherit = flags & (MCL_INHERIT | MCL_RECURSIVE);

aaaaaaarrrrrrggggggghhhhhhh!

So mm->mcl_inherit is _not_ a boolean meaning "this mm has MCL_INHERIT
set", as I naively believed when I saw it. Is it in fact a composite
of both MCL_INHERIT and MCL_RECURSIVE, given a badly misleading name
and a waffly comment.

Can we please make this clearer?

>
> ...
>
> --- linux-2.6.28-rc7-mmotm-081203.orig/kernel/fork.c 2008-12-03 10:18:15.000000000 -0500
> +++ linux-2.6.28-rc7-mmotm-081203/kernel/fork.c 2008-12-03 10:33:29.000000000 -0500
> @@ -278,7 +278,8 @@ static int dup_mmap(struct mm_struct *mm
> */
> down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING);
>
> - mm->locked_vm = 0;
> + if (!mm->mcl_inherit)
> + mm->locked_vm = 0;
> mm->mmap = NULL;
> mm->mmap_cache = NULL;
> mm->free_area_cache = oldmm->mmap_base;
> @@ -316,7 +317,8 @@ static int dup_mmap(struct mm_struct *mm
> if (IS_ERR(pol))
> goto fail_nomem_policy;
> vma_set_policy(tmp, pol);
> - tmp->vm_flags &= ~VM_LOCKED;
> + if (!mm->mcl_inherit)
> + tmp->vm_flags &= ~VM_LOCKED;
> tmp->vm_mm = mm;
> tmp->vm_next = NULL;
> anon_vma_link(tmp);
> @@ -406,6 +408,8 @@ __cacheline_aligned_in_smp DEFINE_SPINLO
>
> static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
> {
> + unsigned long def_flags = 0;
> +
> atomic_set(&mm->mm_users, 1);
> atomic_set(&mm->mm_count, 1);
> init_rwsem(&mm->mmap_sem);
> @@ -422,9 +426,14 @@ static struct mm_struct * mm_init(struct
> mm->free_area_cache = TASK_UNMAPPED_BASE;
> mm->cached_hole_size = ~0UL;
> mm_init_owner(mm, p);
> + if (current->mm && current->mm->mcl_inherit) {
> + def_flags = current->mm->def_flags & VM_LOCKED;
> + if (mm->mcl_inherit & MCL_RECURSIVE)
> + mm->mcl_inherit = current->mm->mcl_inherit;
> + }

hm. That code looks familiar. Not worth a helper function?

> if (likely(!mm_alloc_pgd(mm))) {
> - mm->def_flags = 0;
> + mm->def_flags = def_flags;
> mmu_notifier_mm_init(mm);
> return mm;
> }
> Index: linux-2.6.28-rc7-mmotm-081203/fs/binfmt_elf.c
> ===================================================================
> --- linux-2.6.28-rc7-mmotm-081203.orig/fs/binfmt_elf.c 2008-12-03 10:19:21.000000000 -0500
> +++ linux-2.6.28-rc7-mmotm-081203/fs/binfmt_elf.c 2008-12-03 10:33:29.000000000 -0500
> @@ -585,6 +585,7 @@ static int load_elf_binary(struct linux_
> unsigned long reloc_func_desc = 0;
> int executable_stack = EXSTACK_DEFAULT;
> unsigned long def_flags = 0;
> + int mcl_inherit = 0;
> struct {
> struct elfhdr elf_ex;
> struct elfhdr interp_elf_ex;
> @@ -749,6 +750,13 @@ static int load_elf_binary(struct linux_
> SET_PERSONALITY(loc->elf_ex);
> }
>
> + /* Optionally inherit MCL_FUTURE state before destroying old mm */
> + if (current->mm && current->mm->mcl_inherit) {

umm, OK, it looks like current->mm can be null here.

> + def_flags = current->mm->def_flags & VM_LOCKED;

ooh, we finally used local variable def_flags for something. That's
been wasting space since 2.2.26 (or earlier).

> + if (current->mm->mcl_inherit & MCL_RECURSIVE)
> + mcl_inherit = current->mm->mcl_inherit;
> + }
> +
> /* Flush all traces of the currently running executable */
> retval = flush_old_exec(bprm);
> if (retval)
> @@ -757,6 +765,7 @@ static int load_elf_binary(struct linux_
> /* OK, This is the point of no return */
> current->flags &= ~PF_FORKNOEXEC;
> current->mm->def_flags = def_flags;
> + current->mm->mcl_inherit = mcl_inherit;
>
> /* Do this immediately, since STACK_TOP as used in setup_arg_pages
> may depend on the personality. */

Why is this done in binfmt_elf.c? Doesn't this preclude the operation
of these new flags for other binary formats?

> Index: linux-2.6.28-rc7-mmotm-081203/arch/x86/include/asm/mman.h
> ===================================================================
> --- linux-2.6.28-rc7-mmotm-081203.orig/arch/x86/include/asm/mman.h 2008-12-03 10:16:26.000000000 -0500
> +++ linux-2.6.28-rc7-mmotm-081203/arch/x86/include/asm/mman.h 2008-12-03 10:33:29.000000000 -0500
> @@ -16,5 +16,8 @@
>
> #define MCL_CURRENT 1 /* lock all current mappings */
> #define MCL_FUTURE 2 /* lock all future mappings */
> +#define MCL_INHERIT 4 /* inherit mlocks across fork */
> + /* inherit '_FUTURE flag across fork/exec */
> +#define MCL_RECURSIVE 8 /* inherit mlocks recursively */
>
> #endif /* _ASM_X86_MMAN_H */
> Index: linux-2.6.28-rc7-mmotm-081203/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.28-rc7-mmotm-081203.orig/include/linux/mm_types.h 2008-12-03 10:18:01.000000000 -0500
> +++ linux-2.6.28-rc7-mmotm-081203/include/linux/mm_types.h 2008-12-03 10:33:29.000000000 -0500
> @@ -235,6 +235,8 @@ struct mm_struct {
> unsigned int token_priority;
> unsigned int last_interval;
>
> + int mcl_inherit; /* inherit current/future locks */
> +
> unsigned long flags; /* Must use atomic bitops to access the bits */

I was thinking that the boolean mcl_inherit could become a bit in
mm.flags. That was before I unmisled myself about it.

Perhaps we could use two bits. That might get a bit nasty.


2008-12-08 15:01:55

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH] - support inheritance of mlocks across fork/exec V2

On Sat, 2008-12-06 at 22:07 -0800, Andrew Morton wrote:
> (cc linux-abi. Again.)

Sorry.

>
> On Wed, 03 Dec 2008 14:04:29 -0500 Lee Schermerhorn <[email protected]> wrote:
>
> >
> >
> > Against; 2.6.28-rc7-mmotm-081203
> >
> > V02: rework vetting of flags argument as suggested by
> > Kosaki Motohiro.
> > enhance description as requested by Andrew Morton.
> >
> > Add support for mlockall(MCL_INHERIT|MCL_RECURSIVE):
> >
> > MCL_CURRENT[|MCL_FUTURE]|MCL_INHERIT - inherit memory locks
> > [vmas' VM_LOCKED flags] across fork(), and inherit
> > MCL_FUTURE behavior [mm's def_flags] across fork()
> > and exec(). Behaves as if child and/or new task
> > called mlockall(MCL_CURRENT|MCL_FUTURE) as first
> > instruction.
> >
> > MCL_RECURSIVE - inherit MCL_CURRENT|MCL_FUTURE|MCL_INHERIT
> > [vmas' VM_LOCKED flags for fork() and mm's def_flags
> > and mcl_inherit across fork() and exec()] for all
> > future generations of calling process's descendants.
> > Behaves as if child and/or new task called
> > mlockall(MCL_CURRENT|MCL_FUTURE|MCL_INHERIT|MCL_RECURSIVE)
> > as the first instruction.
> >
> > In support of a "lock prefix command"--e.g., mlock <cmd> <args> ...
> > Analogous to taskset(1) for cpu affinity or numactl(8) for numa memory
> > policy.
> >
> > Together with patches to keep mlocked pages off the LRU, this will
> > allow users/admins to lock down applications without modifying them,
> > if their RLIMIT_MEMLOCK is sufficiently large, keeping their pages
> > off the LRU and out of consideration for reclaim.
> >
> > Potentially useful, as well, in real-time environments to force
> > prefaulting and residency for applications that don't mlock themselves.
> >
> > Jeff Sharkey at Montana State developed a similar patch for Linux
> > [link no longer accessible], but apparently he never submitted the patch.
> >
> > I submitted an earlier version of this patch around a year ago. I
> > resurrected it to test the unevictable lru/mlocked pages patches--
> > e.g., by "mlock -r make -j<N*nr_cpus> all". This did shake out a few
> > races and vmstat accounting bugs, but NOT something I'd recommend as
> > general practice--for kernel builds, that is.
> >
> > ----
> >
> > Define MCL_INHERIT, MCL_RECURSIVE in <asm-*/mman.h>.
> > + x86 and ia64 versions included.
> > + other arch can/will be created, if this patch deemed merge-worthy.
> >
> > Similarly, I'll provide kernel man page update if/when needed.
> >
> > Example "lock prefix command" in Documentation/vm/mlock.c
> >
> >
> > ...
> >
> > --- linux-2.6.28-rc7-mmotm-081203.orig/mm/mlock.c 2008-12-03 10:33:11.000000000 -0500
> > +++ linux-2.6.28-rc7-mmotm-081203/mm/mlock.c 2008-12-03 10:33:29.000000000 -0500
> > @@ -573,15 +573,18 @@ asmlinkage long sys_munlock(unsigned lon
> > static int do_mlockall(int flags)
> > {
> > struct vm_area_struct * vma, * prev = NULL;
> > + struct mm_struct *mm = current->mm;
> > unsigned int def_flags = 0;
> >
> > if (flags & MCL_FUTURE)
> > - def_flags = VM_LOCKED;
> > - current->mm->def_flags = def_flags;
> > - if (flags == MCL_FUTURE)
> > + def_flags = VM_LOCKED;;
>
> You get paid by the semicolon?

Nope. Interestingly, ckeckpatch was fine with this. [It did warn about
>80 char line in x86 mman.h, but there was already a much longer line,
so I left it. Fixed now so that patch is squeaky clean, as far as
checkpatch is concerned.]

>
> > + mm->def_flags = def_flags;
> > + if (flags & MCL_INHERIT)
> > + mm->mcl_inherit = flags & (MCL_INHERIT | MCL_RECURSIVE);
>
> aaaaaaarrrrrrggggggghhhhhhh!
>
> So mm->mcl_inherit is _not_ a boolean meaning "this mm has MCL_INHERIT
> set", as I naively believed when I saw it. Is it in fact a composite
> of both MCL_INHERIT and MCL_RECURSIVE, given a badly misleading name
> and a waffly comment.
>
> Can we please make this clearer?

How about "mcl_inherit_mlocks_possibly_recursively" ?

Seriously, it made/makes sense to me. If mcl_inherit is non-zero, it
WILL contain MCL_INHERIT, meaning that existing mlocks and any VM_LOCKED
in def_flags will be inherited across the next fork or exec [of course,
existing mlocks can't be inherited across exec as we discard the address
space]. It may also contain MCL_RECURSIVE--tested separately--meaning
that the contents of mcl_inherit, itself, will also be inherited across
the next and future forks/execs, until/unless a descendant invokes
mlockall() w/o MCL_INHERIT or MCL_RECURSIVE to clear out mcl_inherit.

This seemed a bit much to try to explain on line where mcl_inherit is
declared in mm_types. More appropriate to a man page, I would have
thought. How about a Documentation/vm doc?

Meanwhile, perhaps: "mcl_inherit" -> "mcl_inherit_flags" or "mcl_flags"
or "mlock_flags", or such?

>
> >
> > ...
> >
> > --- linux-2.6.28-rc7-mmotm-081203.orig/kernel/fork.c 2008-12-03 10:18:15.000000000 -0500
> > +++ linux-2.6.28-rc7-mmotm-081203/kernel/fork.c 2008-12-03 10:33:29.000000000 -0500
> > @@ -278,7 +278,8 @@ static int dup_mmap(struct mm_struct *mm
> > */
> > down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING);
> >
> > - mm->locked_vm = 0;
> > + if (!mm->mcl_inherit)
> > + mm->locked_vm = 0;
> > mm->mmap = NULL;
> > mm->mmap_cache = NULL;
> > mm->free_area_cache = oldmm->mmap_base;
> > @@ -316,7 +317,8 @@ static int dup_mmap(struct mm_struct *mm
> > if (IS_ERR(pol))
> > goto fail_nomem_policy;
> > vma_set_policy(tmp, pol);
> > - tmp->vm_flags &= ~VM_LOCKED;
> > + if (!mm->mcl_inherit)
> > + tmp->vm_flags &= ~VM_LOCKED;
> > tmp->vm_mm = mm;
> > tmp->vm_next = NULL;
> > anon_vma_link(tmp);
> > @@ -406,6 +408,8 @@ __cacheline_aligned_in_smp DEFINE_SPINLO
> >
> > static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
> > {
> > + unsigned long def_flags = 0;
> > +
> > atomic_set(&mm->mm_users, 1);
> > atomic_set(&mm->mm_count, 1);
> > init_rwsem(&mm->mmap_sem);
> > @@ -422,9 +426,14 @@ static struct mm_struct * mm_init(struct
> > mm->free_area_cache = TASK_UNMAPPED_BASE;
> > mm->cached_hole_size = ~0UL;
> > mm_init_owner(mm, p);
> > + if (current->mm && current->mm->mcl_inherit) {
> > + def_flags = current->mm->def_flags & VM_LOCKED;
> > + if (mm->mcl_inherit & MCL_RECURSIVE)
> > + mm->mcl_inherit = current->mm->mcl_inherit;
> > + }
>
> hm. That code looks familiar. Not worth a helper function?

If we could remove the copy in load_elf_binary(), point would be moot.

More below.

>
> > if (likely(!mm_alloc_pgd(mm))) {
> > - mm->def_flags = 0;
> > + mm->def_flags = def_flags;
> > mmu_notifier_mm_init(mm);
> > return mm;
> > }
> > Index: linux-2.6.28-rc7-mmotm-081203/fs/binfmt_elf.c
> > ===================================================================
> > --- linux-2.6.28-rc7-mmotm-081203.orig/fs/binfmt_elf.c 2008-12-03 10:19:21.000000000 -0500
> > +++ linux-2.6.28-rc7-mmotm-081203/fs/binfmt_elf.c 2008-12-03 10:33:29.000000000 -0500
> > @@ -585,6 +585,7 @@ static int load_elf_binary(struct linux_
> > unsigned long reloc_func_desc = 0;
> > int executable_stack = EXSTACK_DEFAULT;
> > unsigned long def_flags = 0;
> > + int mcl_inherit = 0;
> > struct {
> > struct elfhdr elf_ex;
> > struct elfhdr interp_elf_ex;
> > @@ -749,6 +750,13 @@ static int load_elf_binary(struct linux_
> > SET_PERSONALITY(loc->elf_ex);
> > }
> >
> > + /* Optionally inherit MCL_FUTURE state before destroying old mm */
> > + if (current->mm && current->mm->mcl_inherit) {
>
> umm, OK, it looks like current->mm can be null here.

Probably not here. IIRC, it can be NULL [for init] up in mm_init() and
I probably just cut and pasted, instead of a helper function...

>
> > + def_flags = current->mm->def_flags & VM_LOCKED;
>
> ooh, we finally used local variable def_flags for something. That's
> been wasting space since 2.2.26 (or earlier).

def_flags was being used to clear out the current->mm->def_flags. Don't
know why, tho'. See below...
>
> > + if (current->mm->mcl_inherit & MCL_RECURSIVE)
> > + mcl_inherit = current->mm->mcl_inherit;
> > + }
> > +
> > /* Flush all traces of the currently running executable */
> > retval = flush_old_exec(bprm);
> > if (retval)
> > @@ -757,6 +765,7 @@ static int load_elf_binary(struct linux_
> > /* OK, This is the point of no return */
> > current->flags &= ~PF_FORKNOEXEC;
> > current->mm->def_flags = def_flags;
> > + current->mm->mcl_inherit = mcl_inherit;
> >
> > /* Do this immediately, since STACK_TOP as used in setup_arg_pages
> > may depend on the personality. */
>
> Why is this done in binfmt_elf.c? Doesn't this preclude the operation
> of these new flags for other binary formats?

Because binfmt_elf.c is the only binary format loader [that I can see]
that takes it upon itself to reinitialize mm->def_flags, which has
already been set up in mm_init(). I wondered, myself, why this is the
case, but I didn't dare just remove it, lest it reintroduce some obscure
bug. So, I duplicated my version of def_flags [and mcl_inherit]
initialization here.

Now that you've raised the question: why, indeed, does the elf format
loader reinit def_flags? Can we just remove this?

>
> > Index: linux-2.6.28-rc7-mmotm-081203/arch/x86/include/asm/mman.h
> > ===================================================================
> > --- linux-2.6.28-rc7-mmotm-081203.orig/arch/x86/include/asm/mman.h 2008-12-03 10:16:26.000000000 -0500
> > +++ linux-2.6.28-rc7-mmotm-081203/arch/x86/include/asm/mman.h 2008-12-03 10:33:29.000000000 -0500
> > @@ -16,5 +16,8 @@
> >
> > #define MCL_CURRENT 1 /* lock all current mappings */
> > #define MCL_FUTURE 2 /* lock all future mappings */
> > +#define MCL_INHERIT 4 /* inherit mlocks across fork */
> > + /* inherit '_FUTURE flag across fork/exec */
> > +#define MCL_RECURSIVE 8 /* inherit mlocks recursively */
> >
> > #endif /* _ASM_X86_MMAN_H */
> > Index: linux-2.6.28-rc7-mmotm-081203/include/linux/mm_types.h
> > ===================================================================
> > --- linux-2.6.28-rc7-mmotm-081203.orig/include/linux/mm_types.h 2008-12-03 10:18:01.000000000 -0500
> > +++ linux-2.6.28-rc7-mmotm-081203/include/linux/mm_types.h 2008-12-03 10:33:29.000000000 -0500
> > @@ -235,6 +235,8 @@ struct mm_struct {
> > unsigned int token_priority;
> > unsigned int last_interval;
> >
> > + int mcl_inherit; /* inherit current/future locks */
> > +
> > unsigned long flags; /* Must use atomic bitops to access the bits */
>
> I was thinking that the boolean mcl_inherit could become a bit in
> mm.flags. That was before I unmisled myself about it.
>
> Perhaps we could use two bits. That might get a bit nasty.

I did look for someplace to add these flags, before deciding on a new
member. The "Must use atomic bitops", and the fact that mm->flags seems
used soley for controlling core dumping [and defined in sched.h, no
less!], put me off from going that route. Similarly, because def_flags
is an initializer for the vma's vm_flags, I didn't want to pollute that.
I figured the 'int' would fill a padding slot on 64-bit archs, but does
add to mm size for 32-bit archs.

Suggestions?
I'll hold off on a re-spin, pending resolution of some of these issues.

Thanks for the thorough review.

Lee

2008-12-08 21:06:28

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH] - support inheritance of mlocks across fork/exec V2

Sorry for the resend. I had the 'linux-api' address wrong on my reponse
to Andrew. Want to include the api list in this exchange.

On Sat, 2008-12-06 at 22:07 -0800, Andrew Morton wrote:
> (cc linux-abi. Again.)
>
> On Wed, 03 Dec 2008 14:04:29 -0500 Lee Schermerhorn <[email protected]> wrote:
>
> >
> >
> > Against; 2.6.28-rc7-mmotm-081203
> >
> > V02: rework vetting of flags argument as suggested by
> > Kosaki Motohiro.
> > enhance description as requested by Andrew Morton.
> >
> > Add support for mlockall(MCL_INHERIT|MCL_RECURSIVE):
> >
> > MCL_CURRENT[|MCL_FUTURE]|MCL_INHERIT - inherit memory locks
> > [vmas' VM_LOCKED flags] across fork(), and inherit
> > MCL_FUTURE behavior [mm's def_flags] across fork()
> > and exec(). Behaves as if child and/or new task
> > called mlockall(MCL_CURRENT|MCL_FUTURE) as first
> > instruction.
> >
> > MCL_RECURSIVE - inherit MCL_CURRENT|MCL_FUTURE|MCL_INHERIT
> > [vmas' VM_LOCKED flags for fork() and mm's def_flags
> > and mcl_inherit across fork() and exec()] for all
> > future generations of calling process's descendants.
> > Behaves as if child and/or new task called
> > mlockall(MCL_CURRENT|MCL_FUTURE|MCL_INHERIT|MCL_RECURSIVE)
> > as the first instruction.
> >
> > In support of a "lock prefix command"--e.g., mlock <cmd> <args> ...
> > Analogous to taskset(1) for cpu affinity or numactl(8) for numa memory
> > policy.
> >
> > Together with patches to keep mlocked pages off the LRU, this will
> > allow users/admins to lock down applications without modifying them,
> > if their RLIMIT_MEMLOCK is sufficiently large, keeping their pages
> > off the LRU and out of consideration for reclaim.
> >
> > Potentially useful, as well, in real-time environments to force
> > prefaulting and residency for applications that don't mlock themselves.
> >
> > Jeff Sharkey at Montana State developed a similar patch for Linux
> > [link no longer accessible], but apparently he never submitted the patch.
> >
> > I submitted an earlier version of this patch around a year ago. I
> > resurrected it to test the unevictable lru/mlocked pages patches--
> > e.g., by "mlock -r make -j<N*nr_cpus> all". This did shake out a few
> > races and vmstat accounting bugs, but NOT something I'd recommend as
> > general practice--for kernel builds, that is.
> >
> > ----
> >
> > Define MCL_INHERIT, MCL_RECURSIVE in <asm-*/mman.h>.
> > + x86 and ia64 versions included.
> > + other arch can/will be created, if this patch deemed merge-worthy.
> >
> > Similarly, I'll provide kernel man page update if/when needed.
> >
> > Example "lock prefix command" in Documentation/vm/mlock.c
> >
> >
> > ...
> >
> > --- linux-2.6.28-rc7-mmotm-081203.orig/mm/mlock.c 2008-12-03 10:33:11.000000000 -0500
> > +++ linux-2.6.28-rc7-mmotm-081203/mm/mlock.c 2008-12-03 10:33:29.000000000 -0500
> > @@ -573,15 +573,18 @@ asmlinkage long sys_munlock(unsigned lon
> > static int do_mlockall(int flags)
> > {
> > struct vm_area_struct * vma, * prev = NULL;
> > + struct mm_struct *mm = current->mm;
> > unsigned int def_flags = 0;
> >
> > if (flags & MCL_FUTURE)
> > - def_flags = VM_LOCKED;
> > - current->mm->def_flags = def_flags;
> > - if (flags == MCL_FUTURE)
> > + def_flags = VM_LOCKED;;
>
> You get paid by the semicolon?

Nope. Interestingly, ckeckpatch was fine with this. [It did warn about
>80 char line in x86 mman.h, but there was already a much longer line,
so I left it. Fixed now so that patch is squeaky clean, as far as
checkpatch is concerned.]

>
> > + mm->def_flags = def_flags;
> > + if (flags & MCL_INHERIT)
> > + mm->mcl_inherit = flags & (MCL_INHERIT | MCL_RECURSIVE);
>
> aaaaaaarrrrrrggggggghhhhhhh!
>
> So mm->mcl_inherit is _not_ a boolean meaning "this mm has MCL_INHERIT
> set", as I naively believed when I saw it. Is it in fact a composite
> of both MCL_INHERIT and MCL_RECURSIVE, given a badly misleading name
> and a waffly comment.
>
> Can we please make this clearer?

How about "mcl_inherit_mlocks_possibly_recursively" ?

Seriously, it made/makes sense to me. If mcl_inherit is non-zero, it
WILL contain MCL_INHERIT, meaning that existing mlocks and any VM_LOCKED
in def_flags will be inherited across the next fork or exec [def_flags
only]. It may also contain MCL_RECURSIVE--tested separately--meaning
that the contents of mcl_inherit, itself, will also be inherited across
the next and future forks/execs, until/unless a descendant invokes
mlockall() w/o MCL_INHERIT or MCL_RECURSIVE to clear out mcl_inherit.

This seemed a bit much to try to explain on line where mcl_inherit is
declared in mm_types. More appropriate to a man page, I would have
thought. How about a Documentation/vm doc?

Meanwhile, perhaps: "mcl_inherit" -> "mcl_inherit_flags" or "mcl_flags"
or "mlock_flags", or such?

>
> >
> > ...
> >
> > --- linux-2.6.28-rc7-mmotm-081203.orig/kernel/fork.c 2008-12-03 10:18:15.000000000 -0500
> > +++ linux-2.6.28-rc7-mmotm-081203/kernel/fork.c 2008-12-03 10:33:29.000000000 -0500
> > @@ -278,7 +278,8 @@ static int dup_mmap(struct mm_struct *mm
> > */
> > down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING);
> >
> > - mm->locked_vm = 0;
> > + if (!mm->mcl_inherit)
> > + mm->locked_vm = 0;
> > mm->mmap = NULL;
> > mm->mmap_cache = NULL;
> > mm->free_area_cache = oldmm->mmap_base;
> > @@ -316,7 +317,8 @@ static int dup_mmap(struct mm_struct *mm
> > if (IS_ERR(pol))
> > goto fail_nomem_policy;
> > vma_set_policy(tmp, pol);
> > - tmp->vm_flags &= ~VM_LOCKED;
> > + if (!mm->mcl_inherit)
> > + tmp->vm_flags &= ~VM_LOCKED;
> > tmp->vm_mm = mm;
> > tmp->vm_next = NULL;
> > anon_vma_link(tmp);
> > @@ -406,6 +408,8 @@ __cacheline_aligned_in_smp DEFINE_SPINLO
> >
> > static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
> > {
> > + unsigned long def_flags = 0;
> > +
> > atomic_set(&mm->mm_users, 1);
> > atomic_set(&mm->mm_count, 1);
> > init_rwsem(&mm->mmap_sem);
> > @@ -422,9 +426,14 @@ static struct mm_struct * mm_init(struct
> > mm->free_area_cache = TASK_UNMAPPED_BASE;
> > mm->cached_hole_size = ~0UL;
> > mm_init_owner(mm, p);
> > + if (current->mm && current->mm->mcl_inherit) {
> > + def_flags = current->mm->def_flags & VM_LOCKED;
> > + if (mm->mcl_inherit & MCL_RECURSIVE)
> > + mm->mcl_inherit = current->mm->mcl_inherit;
> > + }
>
> hm. That code looks familiar. Not worth a helper function?

If we could remove the copy in load_elf_binary(), point would be moot.

More below.

>
> > if (likely(!mm_alloc_pgd(mm))) {
> > - mm->def_flags = 0;
> > + mm->def_flags = def_flags;
> > mmu_notifier_mm_init(mm);
> > return mm;
> > }
> > Index: linux-2.6.28-rc7-mmotm-081203/fs/binfmt_elf.c
> > ===================================================================
> > --- linux-2.6.28-rc7-mmotm-081203.orig/fs/binfmt_elf.c 2008-12-03 10:19:21.000000000 -0500
> > +++ linux-2.6.28-rc7-mmotm-081203/fs/binfmt_elf.c 2008-12-03 10:33:29.000000000 -0500
> > @@ -585,6 +585,7 @@ static int load_elf_binary(struct linux_
> > unsigned long reloc_func_desc = 0;
> > int executable_stack = EXSTACK_DEFAULT;
> > unsigned long def_flags = 0;
> > + int mcl_inherit = 0;
> > struct {
> > struct elfhdr elf_ex;
> > struct elfhdr interp_elf_ex;
> > @@ -749,6 +750,13 @@ static int load_elf_binary(struct linux_
> > SET_PERSONALITY(loc->elf_ex);
> > }
> >
> > + /* Optionally inherit MCL_FUTURE state before destroying old mm */
> > + if (current->mm && current->mm->mcl_inherit) {
>
> umm, OK, it looks like current->mm can be null here.

Probably not here. IIRC, it can be NULL [for init] up in mm_init() and
I probably just cut and pasted, instead of a helper function...

>
> > + def_flags = current->mm->def_flags & VM_LOCKED;
>
> ooh, we finally used local variable def_flags for something. That's
> been wasting space since 2.2.26 (or earlier).

def_flags was being used to clear out the current->mm->def_flags. Don't
know why, tho'. See below...
>
> > + if (current->mm->mcl_inherit & MCL_RECURSIVE)
> > + mcl_inherit = current->mm->mcl_inherit;
> > + }
> > +
> > /* Flush all traces of the currently running executable */
> > retval = flush_old_exec(bprm);
> > if (retval)
> > @@ -757,6 +765,7 @@ static int load_elf_binary(struct linux_
> > /* OK, This is the point of no return */
> > current->flags &= ~PF_FORKNOEXEC;
> > current->mm->def_flags = def_flags;
> > + current->mm->mcl_inherit = mcl_inherit;
> >
> > /* Do this immediately, since STACK_TOP as used in setup_arg_pages
> > may depend on the personality. */
>
> Why is this done in binfmt_elf.c? Doesn't this preclude the operation
> of these new flags for other binary formats?

Because binfmt_elf.c is the only binary format loader [that I can see]
that takes it upon itself to reinitialize mm->def_flags, which has
already been set up in mm_init(). I wondered, myself, why this is the
case, but I didn't dare just remove it, lest it reintroduce some obscure
bug. So, I duplicated my version of def_flags [and mcl_inherit]
initialization here.

Now that you've raised the question: why, indeed, does the elf format
loader reinit def_flags? Can we just remove this?

>
> > Index: linux-2.6.28-rc7-mmotm-081203/arch/x86/include/asm/mman.h
> > ===================================================================
> > --- linux-2.6.28-rc7-mmotm-081203.orig/arch/x86/include/asm/mman.h 2008-12-03 10:16:26.000000000 -0500
> > +++ linux-2.6.28-rc7-mmotm-081203/arch/x86/include/asm/mman.h 2008-12-03 10:33:29.000000000 -0500
> > @@ -16,5 +16,8 @@
> >
> > #define MCL_CURRENT 1 /* lock all current mappings */
> > #define MCL_FUTURE 2 /* lock all future mappings */
> > +#define MCL_INHERIT 4 /* inherit mlocks across fork */
> > + /* inherit '_FUTURE flag across fork/exec */
> > +#define MCL_RECURSIVE 8 /* inherit mlocks recursively */
> >
> > #endif /* _ASM_X86_MMAN_H */
> > Index: linux-2.6.28-rc7-mmotm-081203/include/linux/mm_types.h
> > ===================================================================
> > --- linux-2.6.28-rc7-mmotm-081203.orig/include/linux/mm_types.h 2008-12-03 10:18:01.000000000 -0500
> > +++ linux-2.6.28-rc7-mmotm-081203/include/linux/mm_types.h 2008-12-03 10:33:29.000000000 -0500
> > @@ -235,6 +235,8 @@ struct mm_struct {
> > unsigned int token_priority;
> > unsigned int last_interval;
> >
> > + int mcl_inherit; /* inherit current/future locks */
> > +
> > unsigned long flags; /* Must use atomic bitops to access the bits */
>
> I was thinking that the boolean mcl_inherit could become a bit in
> mm.flags. That was before I unmisled myself about it.
>
> Perhaps we could use two bits. That might get a bit nasty.

I did look for someplace to add these flags, before deciding on a new
member. The "Must use atomic bitops", and the fact that mm->flags seems
used soley for controlling core dumping [and defined in sched.h, no
less!], put me off from going that route. Similarly, because def_flags
is an initializer for the vma's vm_flags, I didn't want to pollute that.
I figured the 'int' would fill a padding slot on 64-bit archs, but does
add to mm size for 32-bit archs.

Suggestions?
I'll hold off on a re-spin, pending resolution of some of these issues.

Thanks for the thorough review.

Lee

2008-12-08 21:33:59

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] - support inheritance of mlocks across fork/exec V2

On Mon, 2008-12-08 at 16:05 -0500, Lee Schermerhorn wrote:
> > > In support of a "lock prefix command"--e.g., mlock <cmd>
> <args> ...
> > > Analogous to taskset(1) for cpu affinity or numactl(8) for numa memory
> > > policy.
> > >
> > > Together with patches to keep mlocked pages off the LRU, this will
> > > allow users/admins to lock down applications without modifying them,
> > > if their RLIMIT_MEMLOCK is sufficiently large, keeping their pages
> > > off the LRU and out of consideration for reclaim.
> > >
> > > Potentially useful, as well, in real-time environments to force
> > > prefaulting and residency for applications that don't mlock themselves.

This is a bit scary to me. Privilege and mode inheritance across
processes is the root of many nasty surprises, security and otherwise.

Here's a crazy alternative: add a flag to containers instead? I think
this is a better match to what you're trying to do and will keep people
from being surprised when an mlockall call in one thread causes a
fork/exec in another thread to crash their box, but only sometimes.

--
Mathematics is the supreme nostalgia of our time.

2008-12-09 19:41:40

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH] - support inheritance of mlocks across fork/exec V2

On Mon, 2008-12-08 at 15:33 -0600, Matt Mackall wrote:
> On Mon, 2008-12-08 at 16:05 -0500, Lee Schermerhorn wrote:
> > > > In support of a "lock prefix command"--e.g., mlock <cmd>
> > <args> ...
> > > > Analogous to taskset(1) for cpu affinity or numactl(8) for numa memory
> > > > policy.
> > > >
> > > > Together with patches to keep mlocked pages off the LRU, this will
> > > > allow users/admins to lock down applications without modifying them,
> > > > if their RLIMIT_MEMLOCK is sufficiently large, keeping their pages
> > > > off the LRU and out of consideration for reclaim.
> > > >
> > > > Potentially useful, as well, in real-time environments to force
> > > > prefaulting and residency for applications that don't mlock themselves.
>
> This is a bit scary to me. Privilege and mode inheritance across
> processes is the root of many nasty surprises, security and otherwise.

Hi, Matt:

Could you explain more about this issue? I believe that the patch
doesn't provide any privileges or capabilities that a process doesn't
already have. It just allows one to cause a process and, optionally,
its descendants, to behave as if one had modified the sources to invoke
mlockall() directly. It is still subject to each individual task's
resource limits. At least, that was my intent.

>
> Here's a crazy alternative: add a flag to containers instead? I think
> this is a better match to what you're trying to do and will keep people
> from being surprised when an mlockall call in one thread causes a
> fork/exec in another thread to crash their box, but only sometimes.

Not so crazy. It isn't really a better match, IMO [more below], but I
can see providing a flag to enable/disable this ability in the
container. Initially, the default could be disabled. If/when it's been
around long enough that people are comfortable with it, perhaps the
default could be flipped to enabled.

As far as "what I'm trying to do": I see this as exactly like using
taskset to set the cpu affinity of a task, or numactl to set the task
mempolicy without modifying the source. If one had access to the
program source and wouldn't, e.g., void a support contract by modifying
it, one could just insert the calls into the source itself. These would
still be subject to, but "finer grain" than, external administrative
controls, such as cpuset allowed mems and cpus.

But, sometimes, it's very convenient to use the prefix commands to
effect the same behavior. I could certainly use it in the Oracle,
SAP, ... benchmarking that I occasionally help out with. Actually, as I
have the patches, I can create custom kernels for the purpose of
demonstrating the effects. However, I think it would be useful for
customers someday to have this feature in distro kernels.

I don't understand your comment about threads and crashing the box. If
an inherited mlockall() can crash Linux, then inserting the call into
the source could also crash it. I think that's a bug we'd need to fix
in any case. Indeed, using this facility during testing of the
unevictable lru/mlocked pages work DID uncover bugs that I might not
have seen otherwise. If you meant "cause OOMs", I agree that this
could happen, as well but, again, I think this would be the same
behavior as if the calls were inserted into the source.

Regarding one thread being surprised by what another thread did: if you
mean threads within the same executable, I'd expect them to part of the
same program. One hopes that someone knows the big picture--what all
the threads are doing--obviating any such surprises. But, I recall that
you work/ed in the embedded space, and perhaps threads mean something
different to you.

Anyway...

This is a pretty small patch that I can maintain out of tree for my own
use in testing if people don't think its usefulness outweighs any
concerns it raises. Customers won't be able to use it, but thems the
breaks.

Regards,
Lee
>

2008-12-09 20:42:42

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] - support inheritance of mlocks across fork/exec V2

On Tue, 2008-12-09 at 14:40 -0500, Lee Schermerhorn wrote:
> On Mon, 2008-12-08 at 15:33 -0600, Matt Mackall wrote:
> > On Mon, 2008-12-08 at 16:05 -0500, Lee Schermerhorn wrote:
> > > > > In support of a "lock prefix command"--e.g., mlock <cmd>
> > > <args> ...
> > > > > Analogous to taskset(1) for cpu affinity or numactl(8) for numa memory
> > > > > policy.
> > > > >
> > > > > Together with patches to keep mlocked pages off the LRU, this will
> > > > > allow users/admins to lock down applications without modifying them,
> > > > > if their RLIMIT_MEMLOCK is sufficiently large, keeping their pages
> > > > > off the LRU and out of consideration for reclaim.
> > > > >
> > > > > Potentially useful, as well, in real-time environments to force
> > > > > prefaulting and residency for applications that don't mlock themselves.
> >
> > This is a bit scary to me. Privilege and mode inheritance across
> > processes is the root of many nasty surprises, security and otherwise.
>
> Hi, Matt:
>
> Could you explain more about this issue? I believe that the patch
> doesn't provide any privileges or capabilities that a process doesn't
> already have. It just allows one to cause a process and, optionally,
> its descendants, to behave as if one had modified the sources to invoke
> mlockall() directly. It is still subject to each individual task's
> resource limits. At least, that was my intent.

Again, it's about inheriting *something* across processes. This
historically creates surprises. When the thing being inherited is a
privilege, the surprises are security holes. I can't tell you what the
surprises are, only that I expect there to be some.

mlockall is not a privilege in itself, but it is a non-standard mode of
operation that most processes are not designed for or expecting, and it
does relate to the handling of a finite resource with system stability
implications. If the thing that turns on this mode is buried inside some
poorly written app that decides it needs to mlockall somewhere, the
'surprise' can occur far away - in the child of a child of a thread an
hour later. And the surprise can be fatal - all memory eaten up, because
our process just happened to run without an rlimit.

I've seen this with RT. An app temporarily elevates itself to RT, and
unrelatedly forks another process. The second short-lived process kicks
off a daemon that sometime later consumes all CPU in a busy loop (in
this case waiting on I/O that never happens because everything else is
shut out).

Doing it as a container parameter means that you explicitly recognize
that 'everything in the container' gets this mode. And you've probably
also given a thought to 'how big is this container' and the like as
well.

> As far as "what I'm trying to do": I see this as exactly like using
> taskset to set the cpu affinity of a task, or numactl to set the task
> mempolicy without modifying the source.

Oh sure, I completely get that and I think it's a useful notion. And I
think the above analogous interfaces have more or less the same issues,
except that mlockall is a much older and more widely used API.
Containers are a better match here too, but the above predate
containers.

> If one had access to the
> program source and wouldn't, e.g., void a support contract by modifying
> it, one could just insert the calls into the source itself.

Huh? We would of course set up a container 'from the outside'. By
comparison, your mlockall() call traditionally operates from 'from the
inside', and you're proposing to add a flag and a helper program that
makes it work 'from the outside' too. Which is a bit hackish.

--
Mathematics is the supreme nostalgia of our time.

2009-06-05 04:40:47

by Jon Masters

[permalink] [raw]
Subject: Re: [PATCH] - support inheritance of mlocks across fork/exec V2

On Wed, 2008-12-03 at 14:04 -0500, Lee Schermerhorn wrote:

> Add support for mlockall(MCL_INHERIT|MCL_RECURSIVE):

FWIW, I really liked this patch series. And I think there is still value
in a generic "mlock" wrapper utility that I can use. Sure, the later on
containers suggestions are all wonderful in theory but I don't see that
that went anywhere either (and I disagree that we can't trust people to
use this right without doing silly things) - if I'm really right that
this got dropped on the floor, can we resurrect it in .31 please?

Jon.

2009-06-05 04:50:23

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] - support inheritance of mlocks across fork/exec V2

> On Wed, 2008-12-03 at 14:04 -0500, Lee Schermerhorn wrote:
>
> > Add support for mlockall(MCL_INHERIT|MCL_RECURSIVE):
>
> FWIW, I really liked this patch series. And I think there is still value
> in a generic "mlock" wrapper utility that I can use. Sure, the later on
> containers suggestions are all wonderful in theory but I don't see that
> that went anywhere either (and I disagree that we can't trust people to
> use this right without doing silly things) - if I'm really right that
> this got dropped on the floor, can we resurrect it in .31 please?

I guess Lee is really really busy now.
Can you make V3 patch instead?



2009-06-05 05:13:43

by Jon Masters

[permalink] [raw]
Subject: Re: [PATCH] - support inheritance of mlocks across fork/exec V2

On Fri, 2009-06-05 at 13:49 +0900, KOSAKI Motohiro wrote:
> > On Wed, 2008-12-03 at 14:04 -0500, Lee Schermerhorn wrote:
> >
> > > Add support for mlockall(MCL_INHERIT|MCL_RECURSIVE):
> >
> > FWIW, I really liked this patch series. And I think there is still value
> > in a generic "mlock" wrapper utility that I can use. Sure, the later on
> > containers suggestions are all wonderful in theory but I don't see that
> > that went anywhere either (and I disagree that we can't trust people to
> > use this right without doing silly things) - if I'm really right that
> > this got dropped on the floor, can we resurrect it in .31 please?
>
> I guess Lee is really really busy now.

Who isn't? :)

> Can you make V3 patch instead?

I'm happy to rebase onto a recent kernel and repost if it's not
something that's instantly going to get dropped on the floor. I thought
about this patch series a few minutes ago when I found myself
recompiling a certain piece of audio software and realized there's no
reason I shouldn't just be able to e.g. just do the following:

mlock --all -- pulseaudio --start --high-priority=1

As a test of my sanity in this case, but there are other times when I'm
running software on RT kernels and would love to have that as a wrapper
to temporarily prevent a performance hit.

Jon.