Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755086AbZKEFkc (ORCPT ); Thu, 5 Nov 2009 00:40:32 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753042AbZKEFkb (ORCPT ); Thu, 5 Nov 2009 00:40:31 -0500 Received: from e32.co.us.ibm.com ([32.97.110.150]:49213 "EHLO e32.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751956AbZKEFka (ORCPT ); Thu, 5 Nov 2009 00:40:30 -0500 Date: Wed, 4 Nov 2009 21:41:24 -0800 From: Sukadev Bhattiprolu To: linux-kernel@vger.kernel.org Cc: arnd@arndb.de, Containers , "Eric W. Biederman" , hpa@zytor.com, Alexey Dobriyan , roland@redhat.com, Pavel Emelyanov Subject: [v11][PATCH 8/9] Define clone_with_pids() syscall Message-ID: <20091105054124.GH16142@us.ibm.com> References: <20091105053053.GA11289@us.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20091105053053.GA11289@us.ibm.com> X-Operating-System: Linux 2.0.32 on an i486 User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 13993 Lines: 418 From: Sukadev Bhattiprolu Subject: [v11][PATCH 8/9] Define clone_with_pids() syscall Container restart requires that a task have the same pid it had when it was checkpointed. When containers are nested the tasks within the containers exist in multiple pid namespaces and hence have multiple pids to specify during restart. clone_with_pids(), intended for use during restart, is the same as clone(), except that it takes a 'pids' paramter. This parameter lets caller choose specific pid numbers for the child process, in the process's active and ancestor pid namespaces. (Descendant pid namespaces in general don't matter since processes don't have pids in them anyway, but see comments in copy_target_pids() regarding CLONE_NEWPID). clone_with_pids() also attempts to address a second limitation of the clone() system call. clone() is restricted to 32 clone flags and all but one of these are in use. If more new clone flags are needed, we will be forced to define a new variant of the clone() system call. To address this, clone_with_pids() allows at least 64 clone flags with some room for more if necessary. To prevent unprivileged processes from misusing this interface, clone_with_pids() currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL. See Documentation/clone_with_pids in next patch for more details and an example of its usage. NOTE: - System calls are restricted to 6 parameters and the number and sizes of parameters needed for clone_with_pids() exceed 6 integers. The new prototype works around this restriction while providing some flexibility if clone_with_pids() needs to be further extended in the future. TODO: - We should convert clone-flags to 64-bit value in all architectures. Its probably best to do that as a separate patchset since clone_flags touches several functions and that patchset seems independent of this new system call. Changelog[v11]: - [Dave Hansen] Move clone_args validation checks to arch-indpeendent code. - [Oren Laadan] Make args_size a parameter to system call and remove it from 'struct clone_args' Changelog[v10]: - Rename clone3() to clone_with_pids() - [Linus Torvalds] Use PTREGSCALL() rather than the generic syscall implementation Changelog[v9]: - [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit architectures split the new clone-flags into 'low' and 'high' words and pass in the 'lower' flags as the first argument. This would maintain similarity of the clone3() with clone()/ clone2(). Also has the side-effect of the name matching the number of parameters :-) - [Roland McGrath] Rename structure to 'clone_args' and add a 'child_stack_size' field Changelog[v8] - [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg' must be 64-bit. - clone2() is in use in IA64. Rename system call to clone3(). Changelog[v7]: - [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2() and group parameters into a new 'struct clone_struct' object. Changelog[v6]: - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds) Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set' is constant across architectures. - (Nathan Lynch) Change pid_set.num_pids to unsigned and remove 'unum_pids < 0' check. Changelog[v4]: - (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set' Changelog[v3]: - (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid in the target_pids[] list and setting it 0. See copy_target_pids()). - (Oren Laadan) Specified target pids should apply only to youngest pid-namespaces (see copy_target_pids()) - (Matt Helsley) Update patch description. Changelog[v2]: - Remove unnecessary printk and add a note to callers of copy_target_pids() to free target_pids. - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description. - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and 'num_pids == 0' (fall back to normal clone()). - Move arch-independent code (sanity checks and copy-in of target-pids) into kernel/fork.c and simplify sys_clone_with_pids() Changelog[v1]: - Fixed some compile errors (had fixed these errors earlier in my git tree but had not refreshed patches before emailing them) Signed-off-by: Sukadev Bhattiprolu --- arch/x86/include/asm/syscalls.h | 1 + arch/x86/include/asm/unistd_32.h | 1 + arch/x86/kernel/entry_32.S | 1 + arch/x86/kernel/process_32.c | 49 ++++++++++++++ arch/x86/kernel/syscall_table_32.S | 1 + include/linux/sched.h | 2 + include/linux/types.h | 11 +++ kernel/fork.c | 124 +++++++++++++++++++++++++++++++++++- 8 files changed, 189 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 372b76e..3205f55 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -40,6 +40,7 @@ long sys_iopl(struct pt_regs *); /* kernel/process_32.c */ int sys_clone(struct pt_regs *); +int sys_clone_with_pids(struct pt_regs *); int sys_execve(struct pt_regs *); /* kernel/signal.c */ diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6fb3c20..a427571 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -342,6 +342,7 @@ #define __NR_pwritev 334 #define __NR_rt_tgsigqueueinfo 335 #define __NR_perf_event_open 336 +#define __NR_clone_with_pids 337 #ifdef __KERNEL__ diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index c097e7d..c7bd1f6 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -718,6 +718,7 @@ ptregs_##name: \ PTREGSCALL(iopl) PTREGSCALL(fork) PTREGSCALL(clone) +PTREGSCALL(clone_with_pids) PTREGSCALL(vfork) PTREGSCALL(execve) PTREGSCALL(sigaltstack) diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 4cf7956..55dd051 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -445,6 +445,55 @@ int sys_clone(struct pt_regs *regs) return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr); } +int sys_clone_with_pids(struct pt_regs *regs) +{ + int rc; + int args_size; + struct clone_args kca; + unsigned long flags; + int __user *parent_tid_ptr; + int __user *child_tid_ptr; + unsigned long __user child_stack; + unsigned long stack_size; + unsigned int flags_low; + struct clone_args __user *uca; + pid_t __user *pids; + + flags_low = regs->bx; + uca = (struct clone_args __user *)regs->cx; + args_size = regs->dx; + pids = (pid_t __user *)regs->di; + + rc = fetch_clone_args_from_user(uca, args_size, &kca); + if (rc) + return rc; + + /* + * TODO: Convert 'clone-flags' to 64-bits on all architectures. + * TODO: When ->clone_flags_high is non-zero, copy it in to the + * higher word(s) of 'flags': + * + * flags = (kca.clone_flags_high << 32) | flags_low; + */ + flags = flags_low; + parent_tid_ptr = (int *)kca.parent_tid_ptr; + child_tid_ptr = (int *)kca.child_tid_ptr; + + stack_size = (unsigned long)kca.child_stack_size; + child_stack = (unsigned long)kca.child_stack_base + stack_size; + + if (!child_stack) + child_stack = regs->sp; + + /* + * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value + * to several functions. Need to convert clone_flags to 64-bit. + */ + return do_fork_with_pids(flags, child_stack, regs, stack_size, + parent_tid_ptr, child_tid_ptr, kca.nr_pids, + pids); +} + /* * sys_execve() executes a new program. */ diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index 0157cd2..1357e83 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -336,3 +336,4 @@ ENTRY(sys_call_table) .long sys_pwritev .long sys_rt_tgsigqueueinfo /* 335 */ .long sys_perf_event_open + .long ptregs_clone_with_pids diff --git a/include/linux/sched.h b/include/linux/sched.h index 265efe5..0095ba0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2153,6 +2153,8 @@ extern int disallow_signal(int); extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); +extern int fetch_clone_args_from_user(struct clone_args __user *, int, + struct clone_args *); extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, unsigned int, pid_t __user *); diff --git a/include/linux/types.h b/include/linux/types.h index c42724f..0d08683 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -204,6 +204,17 @@ struct ustat { char f_fpack[6]; }; +struct clone_args { + u64 clone_flags_high; + u64 child_stack_base; + u64 child_stack_size; + u64 parent_tid_ptr; + u64 child_tid_ptr; + u32 nr_pids; + u32 reserved0; + u64 reserved1; +}; + #endif /* __KERNEL__ */ #endif /* __ASSEMBLY__ */ #endif /* _LINUX_TYPES_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 210e841..da28baf 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1372,6 +1372,114 @@ struct task_struct * __cpuinit fork_idle(int cpu) } /* + * If user specified any 'target-pids' in @upid_setp, copy them from + * user and return a pointer to a local copy of the list of pids. The + * caller must free the list, when they are done using it. + * + * If user did not specify any target pids, return NULL (caller should + * treat this like normal clone). + * + * On any errors, return the error code + */ +static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids) +{ + int j; + int rc; + int size; + int knum_pids; /* # of pids needed in kernel */ + pid_t *target_pids; + + if (!unum_pids) + return NULL; + + knum_pids = task_pid(current)->level + 1; + if (unum_pids > knum_pids) + return ERR_PTR(-EINVAL); + + /* + * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[] + * and set it to 0. This last entry in target_pids[] corresponds to the + * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was + * specified. If CLONE_NEWPID was not specified, this last entry will + * simply be ignored. + */ + target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL); + if (!target_pids) + return ERR_PTR(-ENOMEM); + + /* + * A process running in a level 2 pid namespace has three pid namespaces + * and hence three pid numbers. If this process is checkpointed, + * information about these three namespaces are saved. We refer to these + * namespaces as 'known namespaces'. + * + * If this checkpointed process is however restarted in a level 3 pid + * namespace, the restarted process has an extra ancestor pid namespace + * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'. + * + * During restart, the process requests specific pids for its 'known + * namespaces' and lets kernel assign pids to its 'unknown namespaces'. + * + * Since the requested-pids correspond to 'known namespaces' and since + * 'known-namespaces' are younger than (i.e descendants of) 'unknown- + * namespaces', copy requested pids to the back-end of target_pids[] + * (i.e before the last entry for CLONE_NEWPID mentioned above). + * Any entries in target_pids[] not corresponding to a requested pid + * will be set to zero and kernel assigns a pid in those namespaces. + * + * NOTE: The order of pids in target_pids[] is oldest pid namespace to + * youngest (target_pids[0] corresponds to init_pid_ns). i.e. + * the order is: + * + * - pids for 'unknown-namespaces' (if any) + * - pids for 'known-namespaces' (requested pids) + * - 0 in the last entry (for CLONE_NEWPID). + */ + j = knum_pids - unum_pids; + size = unum_pids * sizeof(pid_t); + + rc = copy_from_user(&target_pids[j], upids, size); + if (rc) { + rc = -EFAULT; + goto out_free; + } + + return target_pids; + +out_free: + kfree(target_pids); + return ERR_PTR(rc); +} + +int +fetch_clone_args_from_user(struct clone_args __user *uca, int args_size, + struct clone_args *kca) +{ + int rc; + + /* + * TODO: If size of clone_args is not what the kernel expects, it + * could be that kernel is newer and has an extended structure. + * When that happens, this check needs to be smarter. For now, + * assume exact match. + */ + if (args_size != sizeof(struct clone_args)) + return -EINVAL; + + rc = copy_from_user(kca, uca, args_size); + if (rc) + return -EFAULT; + + /* + * To avoid future compatibility issues, ensure unused fields are 0. + */ + if (kca->reserved0 || kca->reserved1 || kca->clone_flags_high) + return -EINVAL; + + return 0; +} + +/* * Ok, this is the main fork-routine. * * It copies the process, and if successful kick-starts @@ -1389,7 +1497,7 @@ long do_fork_with_pids(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; - pid_t *target_pids = NULL; + pid_t *target_pids; /* * Do some preliminary argument and permissions checking before we @@ -1423,6 +1531,16 @@ long do_fork_with_pids(unsigned long clone_flags, } } + target_pids = copy_target_pids(num_pids, upids); + if (target_pids) { + if (IS_ERR(target_pids)) + return PTR_ERR(target_pids); + + nr = -EPERM; + if (!capable(CAP_SYS_ADMIN)) + goto out_free; + } + /* * When called from kernel_thread, don't do user tracing stuff. */ @@ -1484,6 +1602,10 @@ long do_fork_with_pids(unsigned long clone_flags, } else { nr = PTR_ERR(p); } + +out_free: + kfree(target_pids); + return nr; } -- 1.6.0.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/