Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp403462yba; Thu, 18 Apr 2019 03:20:37 -0700 (PDT) X-Google-Smtp-Source: APXvYqzmlD7ZpAZ1JSQOfACgDO3nVINnb/W+2Xw59vCb+z4K9kRLAUqUO31cSirJ8VPpj6auaqiR X-Received: by 2002:aa7:938b:: with SMTP id t11mr56478573pfe.67.1555582837846; Thu, 18 Apr 2019 03:20:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555582837; cv=none; d=google.com; s=arc-20160816; b=gP39nDXtT/376pkv2MCKjmVdngjQWM/vJ/7Cyi+0quMrluyaV0luUZDMi1+jymyCgJ F2OYYm8xnlT9Z5r/doZUzwuuyiZi2+d7+njy5BZZtqfEK5X3NaT7w6RQM9noOBqUl4RJ H+foFeSI6Cs89CxS9lyrdlVu0rz+vM10gkidzlzvGSMcMBJbFtW/dY2L+ETf8gYos+pb 75N/pc2NTfgPm6Fj+Q8EV9nJ2LDq4NCwkpCA4ApgS1zR9odY57iHsg30Tn6j2ECn3vRv tLBmgUJLxY/N+EVlIv31+qBpAIuNf2Dh9Le6SdHxnT4E7/AOnyYVZUEgc4lC611Lp/1j aAmA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=R2Po5+RBxM5LMvR/Xx6JiIhdAstB8o+f/fVy0kezNqA=; b=jo4pyjk+YTFA7qYAA+wV4J1qee3QGCeIGxFCfzwHFcl20ScrdOMjm6bEW6UBBTvVL3 Yux32zm6PRik/xX9Y0uBD7okjm8rtgqgltdrKacKuto4Wbtx5ujeXtNvG4LmnllysZlB WxsHruX8LMnxFDhEh6GPXOnQWrI3vtXZFSsrLnnVn+MKP1PmUdw+Vo/hTcFtWZZGsdkr Vsbx+xVLOpYct42NpCsAO9MNeZLCIm2q/tWQNBbgg9LrmSH/icBDf2gL9sBArZ7pq1PP Kx6cuolI1qK5IW7gQ3RsjO3nx+jLzWIbBdQD6lJ07fj30RQRj84cnRpaesyzAJiCmjlI tW3w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@brauner.io header.s=google header.b=NiZJg1Ml; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f91si1802150plb.378.2019.04.18.03.20.22; Thu, 18 Apr 2019 03:20:37 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@brauner.io header.s=google header.b=NiZJg1Ml; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388628AbfDRKT1 (ORCPT + 99 others); Thu, 18 Apr 2019 06:19:27 -0400 Received: from mail-ed1-f65.google.com ([209.85.208.65]:45772 "EHLO mail-ed1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388449AbfDRKTZ (ORCPT ); Thu, 18 Apr 2019 06:19:25 -0400 Received: by mail-ed1-f65.google.com with SMTP id k92so1320406edc.12 for ; Thu, 18 Apr 2019 03:19:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=brauner.io; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=R2Po5+RBxM5LMvR/Xx6JiIhdAstB8o+f/fVy0kezNqA=; b=NiZJg1MlAFdNWbDcERsE2Nkc8LLio8XWGzVMpOmJKNXT8JmSENi/YrJX5vOYID863Q QTKJlA47z3dYu29VtBxgbONAnb4M+BZlNYgTK6KN4BM9dX2xCJiwgjld6B8NLUspXoty DMlUQlDGlonND1jnWU66tErnBRRWHW3P9wPr/sW41JjU9nmis9A9tWVwc+Vx7X7BAeY0 LmFwdjxt0wCgei3QiUtNxzbG0KdYjC5wjnHgaXBdaD9sjLWORQr0UzPz+X9lvT0/GPsC JNQPC5tOGw7viFvAMeSlx63fZx6xgNjzM1jNj2iDs2UpxZzlpdu6fKDNu1Y6Juw9JgYV 8L7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=R2Po5+RBxM5LMvR/Xx6JiIhdAstB8o+f/fVy0kezNqA=; b=Rf4JwIgHLOu2EcRSCthdfcbxdkMYqqQ2F9JdALsiCp2AVmyIgetn9adBsWQUaYFTAG 7ndWHxvgSHm3Ld0I2csmWrlnbq3Vst85MWE8hLRnSMMmCXF/z8yiooLIDqTpRviRfY+S nmDAgxhYtQW2xVLafTfJI+Oaiio4Urb4eAAf44a6j7cSpLTQ655s5TLlIUNo0m00wK4L y1cSmkjyKrV0yQAFqvQUWXnSOlVICvWYsbxIE7uA08llNQsghhZH+mTHEoiF1Fx753x/ CtAOFZZKVW2GvCAIPlvb8zY5bEbo9unMvU85JDv/ELyIeNW6kxrqr051U348krSNCw33 yp6A== X-Gm-Message-State: APjAAAVIjfD2bCVIOBhfdSmHPcTwQsGYTdqi37P9U7Ely9L6SuXOXith 0Lm2jSlWIR/2ZwbvPRjPgRfTeg== X-Received: by 2002:a17:906:5855:: with SMTP id h21mr49956120ejs.264.1555582762821; Thu, 18 Apr 2019 03:19:22 -0700 (PDT) Received: from localhost.localdomain ([212.91.227.56]) by smtp.gmail.com with ESMTPSA id 31sm400479edf.18.2019.04.18.03.19.21 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 18 Apr 2019 03:19:22 -0700 (PDT) From: Christian Brauner To: torvalds@linux-foundation.org, viro@zeniv.linux.org.uk, jannh@google.com, dhowells@redhat.com, oleg@redhat.com, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Cc: serge@hallyn.com, luto@kernel.org, arnd@arndb.de, ebiederm@xmission.com, keescook@chromium.org, tglx@linutronix.de, mtk.manpages@gmail.com, akpm@linux-foundation.org, cyphar@cyphar.com, joel@joelfernandes.org, dancol@google.com, Christian Brauner Subject: [PATCH v2 2/5] clone: add CLONE_PIDFD Date: Thu, 18 Apr 2019 12:18:38 +0200 Message-Id: <20190418101841.4476-3-christian@brauner.io> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190418101841.4476-1-christian@brauner.io> References: <20190418101841.4476-1-christian@brauner.io> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patchset makes it possible to retrieve pid file descriptors at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally suggested to implement this as a new flag to clone() instead of making it a separate system call. As spotted by Linus, there is exactly one bit for clone() left. CLONE_PIDFD creates file descriptors based on the anonymous inode implementation in the kernel that will also be used to implement the new mount api. They serve as a simple opaque handle on pids. Logically, this makes it possible to interpret a pidfd differently, narrowing or widening the scope of various operations (e.g. signal sending). Thus, a pidfd cannot just refer to a tgid, but also a tid, or in theory - given appropriate flag arguments in relevant syscalls - a process group or session. A pidfd does not represent a privilege. This does not imply it cannot ever be that way but for now this is not the case. A pidfd comes with additional information in fdinfo if the kernel supports procfs. The fdinfo file contains the pid of the process in the callers pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d". As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the fourth argument of clone. This has the advantage that we can give back the associated pid and the pidfd at the same time. To remove worries about missing metadata access this patchset comes with a sample program that illustrates how a combination of CLONE_PIDFD, and pidfd_send_signal() can be used to gain race-free access to process metadata through /proc/. The sample program can easily be translated into a helper that would be suitable for inclusion in libc so that users don't have to worry about writing it themselves. Suggested-by: Linus Torvalds Signed-off-by: Christian Brauner Signed-off-by: Jann Horn Cc: Arnd Bergmann Cc: "Eric W. Biederman" Cc: Kees Cook Cc: Thomas Gleixner Cc: David Howells Cc: "Michael Kerrisk (man-pages)" Cc: Andy Lutomirsky Cc: Andrew Morton Cc: Oleg Nesterov Cc: Aleksa Sarai Cc: Linus Torvalds Cc: Al Viro --- /* changelog */ v1: - Oleg Nesterov : - return pidfd in fourth argument of clone This way we can return the pid and the pidfd at the same time to the caller and can also start pid file descriptor numbering at 0 as is customary for file descriptors. - Christian Brauner : - update comments to reflect changes based on Oleg's idea v2: - Oleg Nesterov : - move put_user() before clone()'s point of no return so we can handle put_user() errors - Christian Brauner : - change pidfd_create() to also fd_install() With Oleg's change it makes sense to do the fd_install() right before the moved put_user(). --- include/linux/pid.h | 2 + include/uapi/linux/sched.h | 1 + kernel/fork.c | 96 ++++++++++++++++++++++++++++++++++++-- 3 files changed, 95 insertions(+), 4 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index b6f4ba16065a..3c8ef5a199ca 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -66,6 +66,8 @@ struct pid extern struct pid init_struct_pid; +extern const struct file_operations pidfd_fops; + static inline struct pid *get_pid(struct pid *pid) { if (pid) diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 22627f80063e..ed4ee170bee2 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -10,6 +10,7 @@ #define CLONE_FS 0x00000200 /* set if fs info shared between processes */ #define CLONE_FILES 0x00000400 /* set if open files shared between processes */ #define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */ +#define CLONE_PIDFD 0x00001000 /* set if a pidfd should be placed in parent */ #define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */ #define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */ #define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */ diff --git a/kernel/fork.c b/kernel/fork.c index 9dcd18aa210b..201aafdac727 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -11,6 +11,7 @@ * management can be a bitch. See 'mm/memory.c': 'copy_page_range()' */ +#include #include #include #include @@ -21,8 +22,10 @@ #include #include #include +#include #include #include +#include #include #include #include @@ -1662,6 +1665,58 @@ static inline void rcu_copy_process(struct task_struct *p) #endif /* #ifdef CONFIG_TASKS_RCU */ } +static int pidfd_release(struct inode *inode, struct file *file) +{ + struct pid *pid = file->private_data; + + file->private_data = NULL; + put_pid(pid); + return 0; +} + +#ifdef CONFIG_PROC_FS +static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) +{ + struct pid_namespace *ns = proc_pid_ns(file_inode(m->file)); + struct pid *pid = f->private_data; + + seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns)); + seq_putc(m, '\n'); +} +#endif + +const struct file_operations pidfd_fops = { + .release = pidfd_release, +#ifdef CONFIG_PROC_FS + .show_fdinfo = pidfd_show_fdinfo, +#endif +}; + +/** + * pidfd_create() - Create a new pid file descriptor. + * + * @pid: struct pid that the pidfd will reference + * + * This creates a new pid file descriptor with the O_CLOEXEC flag set. + * + * Note, that this function can only be called after the fd table has + * been unshared to avoid leaking the pidfd to the new process. + * + * Return: On success, a cloexec pidfd is returned. + * On error, a negative errno number will be returned. + */ +static int pidfd_create(struct pid *pid) +{ + int fd; + + fd = anon_inode_getfd("pidfd", &pidfd_fops, get_pid(pid), + O_RDWR | O_CLOEXEC); + if (fd < 0) + put_pid(pid); + + return fd; +} + /* * This creates a new process as a copy of the old one, * but does not actually start it yet. @@ -1674,13 +1729,14 @@ static __latent_entropy struct task_struct *copy_process( unsigned long clone_flags, unsigned long stack_start, unsigned long stack_size, + int __user *parent_tidptr, int __user *child_tidptr, struct pid *pid, int trace, unsigned long tls, int node) { - int retval; + int pidfd = -1, retval; struct task_struct *p; struct multiprocess_signals delayed; @@ -1730,6 +1786,19 @@ static __latent_entropy struct task_struct *copy_process( return ERR_PTR(-EINVAL); } + /* Pidfds will be returned through parent_tidptr. */ + if ((clone_flags & (CLONE_PIDFD | CLONE_PARENT_SETTID)) == + (CLONE_PIDFD | CLONE_PARENT_SETTID)) + return ERR_PTR(-EINVAL); + + /* + * Ensure that we can potentially reuse CLONE_DETACHED for + * CLONE_PIDFD in the future. + */ + if ((clone_flags & (CLONE_PIDFD | CLONE_DETACHED)) == + (CLONE_PIDFD | CLONE_DETACHED)) + return ERR_PTR(-EINVAL); + /* * Force any signals received before this point to be delivered * before the fork happens. Collect up signals sent to multiple @@ -1936,6 +2005,22 @@ static __latent_entropy struct task_struct *copy_process( } } + /* + * This has to happen after we've potentially unshared the file + * descriptor table (so that the pidfd doesn't leak into the child + * if the fd table isn't shared). + */ + if (clone_flags & CLONE_PIDFD) { + retval = pidfd_create(pid); + if (retval < 0) + goto bad_fork_free_pid; + + pidfd = retval; + retval = put_user(pidfd, parent_tidptr); + if (retval) + goto bad_fork_put_pidfd; + } + #ifdef CONFIG_BLOCK p->plug = NULL; #endif @@ -1996,7 +2081,7 @@ static __latent_entropy struct task_struct *copy_process( */ retval = cgroup_can_fork(p); if (retval) - goto bad_fork_free_pid; + goto bad_fork_put_pidfd; /* * From this point on we must avoid any synchronous user-space @@ -2111,6 +2196,9 @@ static __latent_entropy struct task_struct *copy_process( spin_unlock(¤t->sighand->siglock); write_unlock_irq(&tasklist_lock); cgroup_cancel_fork(p); +bad_fork_put_pidfd: + if (clone_flags & CLONE_PIDFD) + ksys_close(pidfd); bad_fork_free_pid: cgroup_threadgroup_change_end(current); if (pid != &init_struct_pid) @@ -2176,7 +2264,7 @@ static inline void init_idle_pids(struct task_struct *idle) struct task_struct *fork_idle(int cpu) { struct task_struct *task; - task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0, + task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0, cpu_to_node(cpu)); if (!IS_ERR(task)) { init_idle_pids(task); @@ -2223,7 +2311,7 @@ long _do_fork(unsigned long clone_flags, trace = 0; } - p = copy_process(clone_flags, stack_start, stack_size, + p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr, child_tidptr, NULL, trace, tls, NUMA_NO_NODE); add_latent_entropy(); -- 2.21.0