Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753395AbYLHVG2 (ORCPT ); Mon, 8 Dec 2008 16:06:28 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752041AbYLHVGT (ORCPT ); Mon, 8 Dec 2008 16:06:19 -0500 Received: from g5t0009.atlanta.hp.com ([15.192.0.46]:10435 "EHLO g5t0009.atlanta.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751361AbYLHVGR (ORCPT ); Mon, 8 Dec 2008 16:06:17 -0500 Subject: Re: [PATCH] - support inheritance of mlocks across fork/exec V2 From: Lee Schermerhorn To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel , riel@redhat.com, hugh@veritas.com, kosaki.motohiro@jp.fujitsu.com, linux-api@vger.kernel.org In-Reply-To: <20081206220729.042a926e.akpm@linux-foundation.org> References: <1227561707.6937.61.camel@lts-notebook> <20081125152651.b4c3c18f.akpm@linux-foundation.org> <1228331069.6693.73.camel@lts-notebook> <20081206220729.042a926e.akpm@linux-foundation.org> Content-Type: text/plain Organization: HP/OSLO Date: Mon, 08 Dec 2008 16:05:37 -0500 Message-Id: <1228770337.31442.44.camel@lts-notebook> Mime-Version: 1.0 X-Mailer: Evolution 2.22.3.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11593 Lines: 292 Sorry for the resend. I had the 'linux-api' address wrong on my reponse to Andrew. Want to include the api list in this exchange. On Sat, 2008-12-06 at 22:07 -0800, Andrew Morton wrote: > (cc linux-abi. Again.) > > On Wed, 03 Dec 2008 14:04:29 -0500 Lee Schermerhorn wrote: > > > > > > > Against; 2.6.28-rc7-mmotm-081203 > > > > V02: rework vetting of flags argument as suggested by > > Kosaki Motohiro. > > enhance description as requested by Andrew Morton. > > > > Add support for mlockall(MCL_INHERIT|MCL_RECURSIVE): > > > > MCL_CURRENT[|MCL_FUTURE]|MCL_INHERIT - inherit memory locks > > [vmas' VM_LOCKED flags] across fork(), and inherit > > MCL_FUTURE behavior [mm's def_flags] across fork() > > and exec(). Behaves as if child and/or new task > > called mlockall(MCL_CURRENT|MCL_FUTURE) as first > > instruction. > > > > MCL_RECURSIVE - inherit MCL_CURRENT|MCL_FUTURE|MCL_INHERIT > > [vmas' VM_LOCKED flags for fork() and mm's def_flags > > and mcl_inherit across fork() and exec()] for all > > future generations of calling process's descendants. > > Behaves as if child and/or new task called > > mlockall(MCL_CURRENT|MCL_FUTURE|MCL_INHERIT|MCL_RECURSIVE) > > as the first instruction. > > > > In support of a "lock prefix command"--e.g., mlock ... > > Analogous to taskset(1) for cpu affinity or numactl(8) for numa memory > > policy. > > > > Together with patches to keep mlocked pages off the LRU, this will > > allow users/admins to lock down applications without modifying them, > > if their RLIMIT_MEMLOCK is sufficiently large, keeping their pages > > off the LRU and out of consideration for reclaim. > > > > Potentially useful, as well, in real-time environments to force > > prefaulting and residency for applications that don't mlock themselves. > > > > Jeff Sharkey at Montana State developed a similar patch for Linux > > [link no longer accessible], but apparently he never submitted the patch. > > > > I submitted an earlier version of this patch around a year ago. I > > resurrected it to test the unevictable lru/mlocked pages patches-- > > e.g., by "mlock -r make -j all". This did shake out a few > > races and vmstat accounting bugs, but NOT something I'd recommend as > > general practice--for kernel builds, that is. > > > > ---- > > > > Define MCL_INHERIT, MCL_RECURSIVE in . > > + x86 and ia64 versions included. > > + other arch can/will be created, if this patch deemed merge-worthy. > > > > Similarly, I'll provide kernel man page update if/when needed. > > > > Example "lock prefix command" in Documentation/vm/mlock.c > > > > > > ... > > > > --- linux-2.6.28-rc7-mmotm-081203.orig/mm/mlock.c 2008-12-03 10:33:11.000000000 -0500 > > +++ linux-2.6.28-rc7-mmotm-081203/mm/mlock.c 2008-12-03 10:33:29.000000000 -0500 > > @@ -573,15 +573,18 @@ asmlinkage long sys_munlock(unsigned lon > > static int do_mlockall(int flags) > > { > > struct vm_area_struct * vma, * prev = NULL; > > + struct mm_struct *mm = current->mm; > > unsigned int def_flags = 0; > > > > if (flags & MCL_FUTURE) > > - def_flags = VM_LOCKED; > > - current->mm->def_flags = def_flags; > > - if (flags == MCL_FUTURE) > > + def_flags = VM_LOCKED;; > > You get paid by the semicolon? Nope. Interestingly, ckeckpatch was fine with this. [It did warn about >80 char line in x86 mman.h, but there was already a much longer line, so I left it. Fixed now so that patch is squeaky clean, as far as checkpatch is concerned.] > > > + mm->def_flags = def_flags; > > + if (flags & MCL_INHERIT) > > + mm->mcl_inherit = flags & (MCL_INHERIT | MCL_RECURSIVE); > > aaaaaaarrrrrrggggggghhhhhhh! > > So mm->mcl_inherit is _not_ a boolean meaning "this mm has MCL_INHERIT > set", as I naively believed when I saw it. Is it in fact a composite > of both MCL_INHERIT and MCL_RECURSIVE, given a badly misleading name > and a waffly comment. > > Can we please make this clearer? How about "mcl_inherit_mlocks_possibly_recursively" ? Seriously, it made/makes sense to me. If mcl_inherit is non-zero, it WILL contain MCL_INHERIT, meaning that existing mlocks and any VM_LOCKED in def_flags will be inherited across the next fork or exec [def_flags only]. It may also contain MCL_RECURSIVE--tested separately--meaning that the contents of mcl_inherit, itself, will also be inherited across the next and future forks/execs, until/unless a descendant invokes mlockall() w/o MCL_INHERIT or MCL_RECURSIVE to clear out mcl_inherit. This seemed a bit much to try to explain on line where mcl_inherit is declared in mm_types. More appropriate to a man page, I would have thought. How about a Documentation/vm doc? Meanwhile, perhaps: "mcl_inherit" -> "mcl_inherit_flags" or "mcl_flags" or "mlock_flags", or such? > > > > > ... > > > > --- linux-2.6.28-rc7-mmotm-081203.orig/kernel/fork.c 2008-12-03 10:18:15.000000000 -0500 > > +++ linux-2.6.28-rc7-mmotm-081203/kernel/fork.c 2008-12-03 10:33:29.000000000 -0500 > > @@ -278,7 +278,8 @@ static int dup_mmap(struct mm_struct *mm > > */ > > down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING); > > > > - mm->locked_vm = 0; > > + if (!mm->mcl_inherit) > > + mm->locked_vm = 0; > > mm->mmap = NULL; > > mm->mmap_cache = NULL; > > mm->free_area_cache = oldmm->mmap_base; > > @@ -316,7 +317,8 @@ static int dup_mmap(struct mm_struct *mm > > if (IS_ERR(pol)) > > goto fail_nomem_policy; > > vma_set_policy(tmp, pol); > > - tmp->vm_flags &= ~VM_LOCKED; > > + if (!mm->mcl_inherit) > > + tmp->vm_flags &= ~VM_LOCKED; > > tmp->vm_mm = mm; > > tmp->vm_next = NULL; > > anon_vma_link(tmp); > > @@ -406,6 +408,8 @@ __cacheline_aligned_in_smp DEFINE_SPINLO > > > > static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p) > > { > > + unsigned long def_flags = 0; > > + > > atomic_set(&mm->mm_users, 1); > > atomic_set(&mm->mm_count, 1); > > init_rwsem(&mm->mmap_sem); > > @@ -422,9 +426,14 @@ static struct mm_struct * mm_init(struct > > mm->free_area_cache = TASK_UNMAPPED_BASE; > > mm->cached_hole_size = ~0UL; > > mm_init_owner(mm, p); > > + if (current->mm && current->mm->mcl_inherit) { > > + def_flags = current->mm->def_flags & VM_LOCKED; > > + if (mm->mcl_inherit & MCL_RECURSIVE) > > + mm->mcl_inherit = current->mm->mcl_inherit; > > + } > > hm. That code looks familiar. Not worth a helper function? If we could remove the copy in load_elf_binary(), point would be moot. More below. > > > if (likely(!mm_alloc_pgd(mm))) { > > - mm->def_flags = 0; > > + mm->def_flags = def_flags; > > mmu_notifier_mm_init(mm); > > return mm; > > } > > Index: linux-2.6.28-rc7-mmotm-081203/fs/binfmt_elf.c > > =================================================================== > > --- linux-2.6.28-rc7-mmotm-081203.orig/fs/binfmt_elf.c 2008-12-03 10:19:21.000000000 -0500 > > +++ linux-2.6.28-rc7-mmotm-081203/fs/binfmt_elf.c 2008-12-03 10:33:29.000000000 -0500 > > @@ -585,6 +585,7 @@ static int load_elf_binary(struct linux_ > > unsigned long reloc_func_desc = 0; > > int executable_stack = EXSTACK_DEFAULT; > > unsigned long def_flags = 0; > > + int mcl_inherit = 0; > > struct { > > struct elfhdr elf_ex; > > struct elfhdr interp_elf_ex; > > @@ -749,6 +750,13 @@ static int load_elf_binary(struct linux_ > > SET_PERSONALITY(loc->elf_ex); > > } > > > > + /* Optionally inherit MCL_FUTURE state before destroying old mm */ > > + if (current->mm && current->mm->mcl_inherit) { > > umm, OK, it looks like current->mm can be null here. Probably not here. IIRC, it can be NULL [for init] up in mm_init() and I probably just cut and pasted, instead of a helper function... > > > + def_flags = current->mm->def_flags & VM_LOCKED; > > ooh, we finally used local variable def_flags for something. That's > been wasting space since 2.2.26 (or earlier). def_flags was being used to clear out the current->mm->def_flags. Don't know why, tho'. See below... > > > + if (current->mm->mcl_inherit & MCL_RECURSIVE) > > + mcl_inherit = current->mm->mcl_inherit; > > + } > > + > > /* Flush all traces of the currently running executable */ > > retval = flush_old_exec(bprm); > > if (retval) > > @@ -757,6 +765,7 @@ static int load_elf_binary(struct linux_ > > /* OK, This is the point of no return */ > > current->flags &= ~PF_FORKNOEXEC; > > current->mm->def_flags = def_flags; > > + current->mm->mcl_inherit = mcl_inherit; > > > > /* Do this immediately, since STACK_TOP as used in setup_arg_pages > > may depend on the personality. */ > > Why is this done in binfmt_elf.c? Doesn't this preclude the operation > of these new flags for other binary formats? Because binfmt_elf.c is the only binary format loader [that I can see] that takes it upon itself to reinitialize mm->def_flags, which has already been set up in mm_init(). I wondered, myself, why this is the case, but I didn't dare just remove it, lest it reintroduce some obscure bug. So, I duplicated my version of def_flags [and mcl_inherit] initialization here. Now that you've raised the question: why, indeed, does the elf format loader reinit def_flags? Can we just remove this? > > > Index: linux-2.6.28-rc7-mmotm-081203/arch/x86/include/asm/mman.h > > =================================================================== > > --- linux-2.6.28-rc7-mmotm-081203.orig/arch/x86/include/asm/mman.h 2008-12-03 10:16:26.000000000 -0500 > > +++ linux-2.6.28-rc7-mmotm-081203/arch/x86/include/asm/mman.h 2008-12-03 10:33:29.000000000 -0500 > > @@ -16,5 +16,8 @@ > > > > #define MCL_CURRENT 1 /* lock all current mappings */ > > #define MCL_FUTURE 2 /* lock all future mappings */ > > +#define MCL_INHERIT 4 /* inherit mlocks across fork */ > > + /* inherit '_FUTURE flag across fork/exec */ > > +#define MCL_RECURSIVE 8 /* inherit mlocks recursively */ > > > > #endif /* _ASM_X86_MMAN_H */ > > Index: linux-2.6.28-rc7-mmotm-081203/include/linux/mm_types.h > > =================================================================== > > --- linux-2.6.28-rc7-mmotm-081203.orig/include/linux/mm_types.h 2008-12-03 10:18:01.000000000 -0500 > > +++ linux-2.6.28-rc7-mmotm-081203/include/linux/mm_types.h 2008-12-03 10:33:29.000000000 -0500 > > @@ -235,6 +235,8 @@ struct mm_struct { > > unsigned int token_priority; > > unsigned int last_interval; > > > > + int mcl_inherit; /* inherit current/future locks */ > > + > > unsigned long flags; /* Must use atomic bitops to access the bits */ > > I was thinking that the boolean mcl_inherit could become a bit in > mm.flags. That was before I unmisled myself about it. > > Perhaps we could use two bits. That might get a bit nasty. I did look for someplace to add these flags, before deciding on a new member. The "Must use atomic bitops", and the fact that mm->flags seems used soley for controlling core dumping [and defined in sched.h, no less!], put me off from going that route. Similarly, because def_flags is an initializer for the vma's vm_flags, I didn't want to pollute that. I figured the 'int' would fill a padding slot on 64-bit archs, but does add to mm size for 32-bit archs. Suggestions? I'll hold off on a re-spin, pending resolution of some of these issues. Thanks for the thorough review. Lee -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/