Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp1547188yba; Sun, 14 Apr 2019 13:16:35 -0700 (PDT) X-Google-Smtp-Source: APXvYqwlJZu48v8tBoJxenU+5et0i8emLQuHfYN22dPLbgW7qu3FNjGb+hmM30Y2YnTbEFd3qwBs X-Received: by 2002:a63:1d5b:: with SMTP id d27mr64633465pgm.386.1555272995693; Sun, 14 Apr 2019 13:16:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555272995; cv=none; d=google.com; s=arc-20160816; b=M6XYaIikRT2lMjx3zH29jICIy/OEP7R1H0UqDmz4qF7f9ZcK+YAXla8pDjrCbaCWvW 1nxtWWXmixrJbCRg3iIyTLNN6x9kkSwMyGgPWUsxDJgvEjK9SiCLPiBgS2Qr++owmLis 0kEk2yp+5boYPGHxYoF1TrW2e6Gp5yt1+xqCoscdiNehVtUfRNSUcxOCgMs7k/pTtT92 vKbwCNeG6mbAi/4UePrml3GTPUcyRNxivx6XULNcxyjTCo7riqHudwstdqCqTpaQ2nj5 Nup6Q8gBdvHhn89azJgate28RX9zyguiwPfLUs0WRAcIIZEw21zuW//tBIk/hKn7fYHT lwpg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=BzUEcC6WQNWJWHTu/y/PIgS7KPXsBl7ipNM86UN17vE=; b=aD/dYxU4Wn+Ic+KNgEJVb74nX3PRbf+WqFGEVXrgNfyMRo2AEwJujjh10qIpizHQjY g+Ss+eHt4Rpl3u3T+qtQX8NcB3ymA04qPefDPiJl7qhBFtdfGeJUvPnOtKlAoBX1lYdT SPYGCrLAbjcnNJ7Y4kj0h/yeld3vXpdEZeqq2K0k6baLpi8oulGOGuxMYN5yh8Phl2W9 GM3mvM38NiWhFAbdzckg+gfm45ak6zktELAlgmMXk+xup9wSbY0NhPYtb8P6S1H4D+EL tOzbzvgDEKcRZDxDmovJ5KkUUQFOWZtV64iJ6Kti2SqsIuSbGZ59l0EJcYrbuHy3RJPt 2fpg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@brauner.io header.s=google header.b=OPRvtW1q; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u20si33250766pfh.211.2019.04.14.13.16.19; Sun, 14 Apr 2019 13:16:35 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@brauner.io header.s=google header.b=OPRvtW1q; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727344AbfDNUP3 (ORCPT + 99 others); Sun, 14 Apr 2019 16:15:29 -0400 Received: from mail-ed1-f66.google.com ([209.85.208.66]:43173 "EHLO mail-ed1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727173AbfDNUPS (ORCPT ); Sun, 14 Apr 2019 16:15:18 -0400 Received: by mail-ed1-f66.google.com with SMTP id j20so2628116edq.10 for ; Sun, 14 Apr 2019 13:15:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=brauner.io; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=BzUEcC6WQNWJWHTu/y/PIgS7KPXsBl7ipNM86UN17vE=; b=OPRvtW1qQJqnQyHAnS7wRw9lYmTBgeCEQsdzXkJmP6b1eA9xUeA0oQfjNuaQ6gOsIj 2NCmWTBxFt0ZyJrIrsVSHK9+byx0/n64XybNEJuUdWqI0ssyQU6HAETSJs/Y5l07S6Bj irL3g6MtiC9PZ8PL2Coae1OR8fOfTKXOSM5tY1+xfKyD4cjkLtkA2QwTA+hFUrymEEkv u2pwuRdj60sBIzHLZwNTTp9TPEzmFXKZcgIa1qkhtit9uAnUlLssxY2bGoea2jpIyapn YqCUwT57H5rTBvTUlPM1jOdklHDVY8GtZuwKp/6GQE9qx5MeiA9Cf+gvwpN6vi2MBVcb nhEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=BzUEcC6WQNWJWHTu/y/PIgS7KPXsBl7ipNM86UN17vE=; b=Owvus08AZqBGG67e1GT26OZX1QoLalJgm1BUu9IWt3uygLuEFnIcgBuEmGmii7fdQh apkP6t2sbK5G3x7+4e4fFZvPKe1hH4igU4muPHxqzPdvv11Se+3HhqW4xj/1p9MHdSx4 BIdNIrLOwUtTKejoWfImsn2F5HOE7LrebSBpNtjFeDXuM//rrXzxoi/jJl/Ttq6QCJjZ jgYdBv8QABpdKe2QMoH5JuE5zxUT2X8jbjxdILOUt9Q7vyGAGNeyxsq9qyW++rOCpgxi 4NndBvxzcNgv4DMLJtxFR8RjwpRBpHvlcLv7tQOqsjELOjTsoZBXQ8JPWuImzRdM0PBw nEPw== X-Gm-Message-State: APjAAAVhqE1UTusdUjxHspMRdKyBiFM968uD5nXnD1zhiJUSLnBqtcG6 /mDpFNi8S+/vhFPZ5nJHj/TqzQ== X-Received: by 2002:a17:906:4d8b:: with SMTP id s11mr38780751eju.31.1555272915355; Sun, 14 Apr 2019 13:15:15 -0700 (PDT) Received: from localhost.localdomain ([212.91.227.56]) by smtp.gmail.com with ESMTPSA id n21sm3383068edq.14.2019.04.14.13.15.14 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 14 Apr 2019 13:15:14 -0700 (PDT) From: Christian Brauner To: torvalds@linux-foundation.org, viro@zeniv.linux.org.uk, jannh@google.com, dhowells@redhat.com, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Cc: serge@hallyn.com, luto@kernel.org, arnd@arndb.de, ebiederm@xmission.com, keescook@chromium.org, tglx@linutronix.de, mtk.manpages@gmail.com, akpm@linux-foundation.org, oleg@redhat.com, cyphar@cyphar.com, joel@joelfernandes.org, dancol@google.com, Christian Brauner Subject: [PATCH 2/4] clone: add CLONE_PIDFD Date: Sun, 14 Apr 2019 22:14:34 +0200 Message-Id: <20190414201436.19502-3-christian@brauner.io> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190414201436.19502-1-christian@brauner.io> References: <20190414201436.19502-1-christian@brauner.io> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org As discussed this patchset makes it possible to retrieve pid file descriptors at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally suggested to implement this as a new flag to clone() instead of making it a separate system call. As spotted by Linus, there is exactly one bit for clone() left. CLONE_PIDFD returns file descriptors based on the anonymous inode implementation in the kernel that will also be used to implement the new mount api. They serve as a simple opaque handle on pids. Logically, this makes it possible to interpret a pidfd differently, narrowing or widening the scope of various operations (e.g. signal sending). Thus, a pidfd cannot just refer to a tgid, but also a tid, or in theory - given appropriate flag arguments in relevant syscalls - a process group or session. A pidfd does not represent a privilege. This does not imply it cannot ever be that way but for now this is not the case. A pidfd comes with additional information in fdinfo if the kernel supports procfs. The fdinfo file contains the pid of the process in the callers pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d". To remove worries about missing metadata access this patchset comes with a sample program that illustrates how a combination of CLONE_PIDFD, fdinfo, and pidfd_send_signal() can be used to gain race-free access to process metadata through /proc/. The sample program can easily be translated into a helper that would be suitable for inclusion in libc so that users don't have to worry about writing it themselves. Suggested-by: Linus Torvalds Signed-off-by: Christian Brauner Signed-off-by: Jann Horn Cc: Arnd Bergmann Cc: "Eric W. Biederman" Cc: Kees Cook Cc: Thomas Gleixner Cc: David Howells Cc: "Michael Kerrisk (man-pages)" Cc: Andy Lutomirsky Cc: Andrew Morton Cc: Oleg Nesterov Cc: Aleksa Sarai Cc: Linus Torvalds Cc: Al Viro --- include/linux/pid.h | 2 + include/uapi/linux/sched.h | 1 + kernel/fork.c | 117 +++++++++++++++++++++++++++++++++++-- 3 files changed, 115 insertions(+), 5 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index b6f4ba16065a..3c8ef5a199ca 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -66,6 +66,8 @@ struct pid extern struct pid init_struct_pid; +extern const struct file_operations pidfd_fops; + static inline struct pid *get_pid(struct pid *pid) { if (pid) diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 22627f80063e..06fa224d2c48 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -10,6 +10,7 @@ #define CLONE_FS 0x00000200 /* set if fs info shared between processes */ #define CLONE_FILES 0x00000400 /* set if open files shared between processes */ #define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */ +#define CLONE_PIDFD 0x00001000 /* set if a pidfd instead of a pid should be returned */ #define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */ #define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */ #define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */ diff --git a/kernel/fork.c b/kernel/fork.c index 9dcd18aa210b..4825d5205604 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -11,6 +11,7 @@ * management can be a bitch. See 'mm/memory.c': 'copy_page_range()' */ +#include #include #include #include @@ -21,8 +22,10 @@ #include #include #include +#include #include #include +#include #include #include #include @@ -1662,6 +1665,87 @@ static inline void rcu_copy_process(struct task_struct *p) #endif /* #ifdef CONFIG_TASKS_RCU */ } +static int pidfd_release(struct inode *inode, struct file *file) +{ + struct pid *pid = file->private_data; + + file->private_data = NULL; + put_pid(pid); + return 0; +} + +#ifdef CONFIG_PROC_FS +static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) +{ + struct pid_namespace *ns = proc_pid_ns(file_inode(m->file)); + struct pid *pid = f->private_data; + + seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns)); + seq_putc(m, '\n'); +} +#endif + +const struct file_operations pidfd_fops = { + .release = pidfd_release, +#ifdef CONFIG_PROC_FS + .show_fdinfo = pidfd_show_fdinfo, +#endif +}; + +/** + * pidfd_create() - Create a new pid file descriptor. + * + * @pid: struct pid that the pidfd will reference + * @file: struct file referencing @pid to return to caller + * + * This creates a new pid file descriptor with the O_CLOEXEC flag set. + * + * Note, that this function can only be called after the fd table has + * potentially been shared to avoid leaking the pidfd to the new process. + * + * File descriptor numbering for pidfds starts at 1. This allows users + * of CLONE_PIDFD to perform the same checks as for pids, i.e.: + * pid/pidfd < 0: error + * pid/pidfd == 0: child + * pid/pidfd > 0: parent + * + * Return: On success, a cloexec pidfd ready to be installed through + * fd_install() will be returned. The corresponding file will be + * returned through @file. + * On error, a negative errno number will be returned. + */ +static int pidfd_create(struct pid *pid, struct file **file) +{ + unsigned int flags = O_RDWR | O_CLOEXEC; + int error, fd; + struct file *f; + + error = __alloc_fd(current->files, 1, rlimit(RLIMIT_NOFILE), flags); + if (error < 0) + return error; + fd = error; + + f = anon_inode_getfile("pidfd", &pidfd_fops, get_pid(pid), flags); + if (IS_ERR(f)) { + put_pid(pid); + error = PTR_ERR(f); + goto err_put_unused_fd; + } + + *file = f; + return fd; + +err_put_unused_fd: + put_unused_fd(fd); + return error; +} + +static inline void pidfd_put(int fd, struct file *file) +{ + put_unused_fd(fd); + fput(file); +} + /* * This creates a new process as a copy of the old one, * but does not actually start it yet. @@ -1678,11 +1762,12 @@ static __latent_entropy struct task_struct *copy_process( struct pid *pid, int trace, unsigned long tls, - int node) + int node, int *pidfd) { int retval; struct task_struct *p; struct multiprocess_signals delayed; + struct file *pidfdf = NULL; /* * Don't allow sharing the root directory with processes in a different @@ -1936,6 +2021,18 @@ static __latent_entropy struct task_struct *copy_process( } } + /* + * This has to happen after we've potentially unshared the file + * descriptor table (so that the pidfd doesn't leak into the child if + * the fd table isn't shared). + */ + if (clone_flags & CLONE_PIDFD) { + retval = pidfd_create(pid, &pidfdf); + if (retval < 0) + goto bad_fork_free_pid; + *pidfd = retval; + } + #ifdef CONFIG_BLOCK p->plug = NULL; #endif @@ -1996,7 +2093,7 @@ static __latent_entropy struct task_struct *copy_process( */ retval = cgroup_can_fork(p); if (retval) - goto bad_fork_free_pid; + goto bad_fork_put_pidfd; /* * From this point on we must avoid any synchronous user-space @@ -2097,6 +2194,9 @@ static __latent_entropy struct task_struct *copy_process( syscall_tracepoint_update(p); write_unlock_irq(&tasklist_lock); + if (clone_flags & CLONE_PIDFD) + fd_install(*pidfd, pidfdf); + proc_fork_connector(p); cgroup_post_fork(p); cgroup_threadgroup_change_end(current); @@ -2111,6 +2211,9 @@ static __latent_entropy struct task_struct *copy_process( spin_unlock(¤t->sighand->siglock); write_unlock_irq(&tasklist_lock); cgroup_cancel_fork(p); +bad_fork_put_pidfd: + if (clone_flags & CLONE_PIDFD) + pidfd_put(*pidfd, pidfdf); bad_fork_free_pid: cgroup_threadgroup_change_end(current); if (pid != &init_struct_pid) @@ -2177,7 +2280,7 @@ struct task_struct *fork_idle(int cpu) { struct task_struct *task; task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0, - cpu_to_node(cpu)); + cpu_to_node(cpu), NULL); if (!IS_ERR(task)) { init_idle_pids(task); init_idle(task, cpu); @@ -2202,7 +2305,7 @@ long _do_fork(unsigned long clone_flags, struct completion vfork; struct pid *pid; struct task_struct *p; - int trace = 0; + int pidfd, trace = 0; long nr; /* @@ -2224,7 +2327,7 @@ long _do_fork(unsigned long clone_flags, } p = copy_process(clone_flags, stack_start, stack_size, - child_tidptr, NULL, trace, tls, NUMA_NO_NODE); + child_tidptr, NULL, trace, tls, NUMA_NO_NODE, &pidfd); add_latent_entropy(); if (IS_ERR(p)) @@ -2260,6 +2363,10 @@ long _do_fork(unsigned long clone_flags, } put_pid(pid); + + if (clone_flags & CLONE_PIDFD) + nr = pidfd; + return nr; } -- 2.21.0