Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754646AbZDNT7J (ORCPT ); Tue, 14 Apr 2009 15:59:09 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752425AbZDNT67 (ORCPT ); Tue, 14 Apr 2009 15:58:59 -0400 Received: from fg-out-1718.google.com ([72.14.220.158]:19641 "EHLO fg-out-1718.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752013AbZDNT67 (ORCPT ); Tue, 14 Apr 2009 15:58:59 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=hHGlRJZczZCOrSiAFbIzed64vnH548gP9OpMkwnyqZvSQp1irY9jccaarv01+zPKzf XA8+goBNjMW+1E7Ozp0iK/roIo/EGx8EDIqMQfI9Pp/ww/LcQA+KlrNJ+RlgJmqtPIfy h7XfSR9eakOHy5nqj3HsmjiXrRGc+ldBtZ0z8= Date: Tue, 14 Apr 2009 23:59:09 +0400 From: Alexey Dobriyan To: Oren Laadan Cc: containers@lists.osdl.org, Dave Hansen , "Serge E. Hallyn" , Andrew Morton , Linus Torvalds , Linux-Kernel , Ingo Molnar Subject: Re: Creating tasks on restart: userspace vs kernel Message-ID: <20090414195909.GA28353@x200.localdomain> References: <49E40662.2040508@cs.columbia.edu> <20090414163633.GE27461@x200.localdomain> <49E4D89D.9060903@cs.columbia.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <49E4D89D.9060903@cs.columbia.edu> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3903 Lines: 109 > > In the end correctness of chopping will be equal to how good user > > understands that two task_struct's are independent of each other. > > > >> But it will still be a useful tool for many use cases, like batch cpu jobs, > >> some servers, vnc sessions (if you want graphics) etc. Imagine you run > >> 'octave' for a week and must reboot now - 'octave' wouldn't care if > >> you checkpointed it and then restart with a different pid ! > >> > >> <3> Clone with pid: > >> > >> To restart processes from userspace, there needs to be a way to > >> request a specific pid--in the current pid_ns--for the child process > >> (clearly, if it isn't in use). > >> > >> Why is it a disadvantage ? to Linus, a syscall clone_with_pid() > >> "sounds like a _wonderful_ attack vector against badly written > >> user-land software...". Actually, getting a specific pid is possible > >> without this syscall. But the point is that it's undesirable to have > >> this functionality unrestricted. > >> > >> So one option is to require root privileges. Another option is to > >> restrict such action in pid_ns created by the same user. Even more so, > >> restrict to only containers that are being restarted. > > > > You want to do small part in userspace and consequently end up with hacks > > both userspace-visible and in-kernel. > > I want to extend existing kernel interface to leverage fork/clone > from user space, AND to allow the flexibility mentioned above (which > you conveniently ignored). > > All hacks are in-kernel, aren't they ? mktree.c can be vieved as hack, why not? The whole existence of these requirements. You want new syscall or SET_NEX_PID or /proc file or something. > As for asking for a specific pid from user space, it can be done by: > * a new syscall (restricted to user-owned-namespace or CAP_SYS_ADMIN) > * a sys_restart(... SET_NEXT_PID) interface specific for restart (ugh) > * setting a special /proc/PID/next_id file which is consulted by fork /proc/*/next_id was disscussed and hopefully died, but no. > and in all cases, limit this so it can only allowed in a restarting > container, under the proper security model (again, e.g., Serge's > suggestion). > > > > > Pids aren't special, they are struct pid, dynamically allocated and > > refcounted just like any other structtures. > > > > They _become_ special for you intended method of restart. > > They are special. And I allow them not to be restored, as well, if > the use case so wishes. The use case is to restore as much as possible to the same state as equal as possible. Not going with fork_with_pid() in any form helps kernel to ensure correctness of restore and helps to avoid surprise failure modes from user POV. > > You also have flags in nsproxy image (or where?) like "do clone with > > CLONE_NEWUTS". > > Nope. Read the code. Which code? static int cr_write_namespaces(struct cr_ctx *ctx, struct task_struct *t) { ... new_uts = cr_obj_add_ptr(ctx, nsproxy->uts_ns, &hh->uts_ref, CR_OBJ_UTSNS, 0); if (new_uts < 0) { ret = new_uts; goto out; } hh->flags = 0; if (new_uts) ===> hh->flags |= CLONE_NEWUTS; ret = cr_write_obj(ctx, &h, hh); ... > > This is unneeded! > > > > nsproxy (or task_struct) image have reference (objref/position) to uts_ns image. > > > > On restart, one lookups object by reference or restore it if needed, > > takes refcount and glue. Just like with every other two structures. > > That's exactly how it's done. Not for uts_ns and future namespaces. ret = cr_restore_utsns(ctx, hh->uts_ref, hh->flags); ^^^^^^^^^ comes from disk > > No "what to do, what to do" logic. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/