2019-03-27 16:24:59

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 0/4] pidfd_open()

Hey,

After the discussion over the last days, this is a fresh approach to
getting pidfds independent of the translate_pid() patchset.

pidfd_open() allows to retrieve pidfds for processes and removes the
dependency of pidfd on procfs.
These pidfds are allocated using anon_inode_getfd(), are O_CLOEXEC by
default and can be used with the pidfd_send_signal() syscall. They are not
dirfds and as such have the advantage that we can make them pollable or
readable in the future if we see a need to do so. Currently they do not
support any advanced operations. The pidfds are not associated with a
specific pid namespaces but rather only reference struct pid of a given
process in their private_data member.

One of the oustanding issues has been how to get information about a given
process if pidfds are regular file descriptors and do not provide access to
the process /proc/<pid> directory.
Various solutions have been proposed. The one that most people prefer is to
be able to retrieve a file descriptor to /proc/<pid> based on a pidfd (and
the other way around).
IF PROCFD_TO_PIDFD is passed as a flag together with a file descriptor to a
/proc mount in a given pid namespace and a pidfd pidfd_open() will return a
file descriptor to the corresponding /proc/<pid> directory in procfs
mounts' pid namespace. pidfd_open() is very careful to verify that the pid
hasn't been recycled in between.
IF PIDFD_TO_PROCFD is passed as a flag together with a file descriptor
referencing a /proc/<pid> directory a pidfd referencing the struct pid
stashed in /proc/<pid> of the process will be returned.
The pidfd_open() syscalls in that manner resembles openat() as it uses a
flag argument to modify what type of file descriptor will be returned.

The pidfd_open() implementation together with the flags argument strikes me
as an elegant compromise between splitting this into multiple syscalls and
avoiding ioctls().

Note that this patchset also includes Al's and David's commit to make anon
inodes unconditional. The original intention is to make it possible to use
anon inodes in core vfs functions. pidctl() has the same requirement so
David suggested I sent this in alongside this patch. Both are informed of
this.

The syscall comes with appropriate basic testing.

/* Examples */
// Retrieve pidfd
int pidfd = pidfd_open(1234, -1, -1, 0);

// Retrieve /proc/<pid> handle for pidfd
int procfd = open("/proc", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
int procpidfd = pidfd_open(-1, procfd, pidfd, PIDFD_TO_PROCFD);

// Retrieve pidfd for /proc/<pid>
int procpidfd = open("/proc/1234", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
int pidfd = pidfd_open(-1, procpidfd, -1, PROCFD_TO_PIDFD);

Thanks!
Christian

Christian Brauner (3):
pid: add pidfd_open()
signal: support pidfd_open() with pidfd_send_signal()
tests: add pidfd_open() tests

David Howells (1):
Make anon_inodes unconditional

arch/arm/kvm/Kconfig | 1 -
arch/arm64/kvm/Kconfig | 1 -
arch/mips/kvm/Kconfig | 1 -
arch/powerpc/kvm/Kconfig | 1 -
arch/s390/kvm/Kconfig | 1 -
arch/x86/Kconfig | 1 -
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/kvm/Kconfig | 1 -
drivers/base/Kconfig | 1 -
drivers/char/tpm/Kconfig | 1 -
drivers/dma-buf/Kconfig | 1 -
drivers/gpio/Kconfig | 1 -
drivers/iio/Kconfig | 1 -
drivers/infiniband/Kconfig | 1 -
drivers/vfio/Kconfig | 1 -
fs/Makefile | 2 +-
fs/notify/fanotify/Kconfig | 1 -
fs/notify/inotify/Kconfig | 1 -
include/linux/pid.h | 2 +
include/linux/syscalls.h | 2 +
include/uapi/linux/wait.h | 3 +
init/Kconfig | 10 -
kernel/pid.c | 247 ++++++++++++++++++
kernel/signal.c | 14 +-
kernel/sys_ni.c | 3 -
tools/testing/selftests/pidfd/Makefile | 2 +-
.../testing/selftests/pidfd/pidfd_open_test.c | 201 ++++++++++++++
28 files changed, 469 insertions(+), 35 deletions(-)
create mode 100644 tools/testing/selftests/pidfd/pidfd_open_test.c

--
2.21.0



2019-03-27 16:23:22

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 4/4] tests: add pidfd_open() tests

This adds a simple test case for pidfd_open().

Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Jann Horn <[email protected]
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Jonathan Kowalski <[email protected]>
Cc: "Dmitry V. Levin" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Nagarathnam Muthusamy <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Al Viro <[email protected]>
---
tools/testing/selftests/pidfd/Makefile | 2 +-
.../testing/selftests/pidfd/pidfd_open_test.c | 201 ++++++++++++++++++
2 files changed, 202 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/pidfd/pidfd_open_test.c

diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index deaf8073bc06..b36c0be70848 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -1,6 +1,6 @@
CFLAGS += -g -I../../../../usr/include/

-TEST_GEN_PROGS := pidfd_test
+TEST_GEN_PROGS := pidfd_test pidfd_open_test

include ../lib.mk

diff --git a/tools/testing/selftests/pidfd/pidfd_open_test.c b/tools/testing/selftests/pidfd/pidfd_open_test.c
new file mode 100644
index 000000000000..07a262a9ef2c
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_open_test.c
@@ -0,0 +1,201 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <linux/types.h>
+#include <linux/wait.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/mount.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+
+static inline int sys_pidfd_open(pid_t pid, int procfd, int pidfd,
+ unsigned int flags)
+{
+ return syscall(__NR_pidfd_open, pid, procfd, pidfd, flags);
+}
+
+static int safe_int(const char *numstr, int *converted)
+{
+ char *err = NULL;
+ signed long int sli;
+
+ errno = 0;
+ sli = strtol(numstr, &err, 0);
+ if (errno == ERANGE && (sli == LONG_MAX || sli == LONG_MIN))
+ return -ERANGE;
+
+ if (errno != 0 && sli == 0)
+ return -EINVAL;
+
+ if (err == numstr || *err != '\0')
+ return -EINVAL;
+
+ if (sli > INT_MAX || sli < INT_MIN)
+ return -ERANGE;
+
+ *converted = (int)sli;
+ return 0;
+}
+
+static int char_left_gc(const char *buffer, size_t len)
+{
+ size_t i;
+
+ for (i = 0; i < len; i++) {
+ if (buffer[i] == ' ' ||
+ buffer[i] == '\t')
+ continue;
+
+ return i;
+ }
+
+ return 0;
+}
+
+static int char_right_gc(const char *buffer, size_t len)
+{
+ int i;
+
+ for (i = len - 1; i >= 0; i--) {
+ if (buffer[i] == ' ' ||
+ buffer[i] == '\t' ||
+ buffer[i] == '\n' ||
+ buffer[i] == '\0')
+ continue;
+
+ return i + 1;
+ }
+
+ return 0;
+}
+
+static char *trim_whitespace_in_place(char *buffer)
+{
+ buffer += char_left_gc(buffer, strlen(buffer));
+ buffer[char_right_gc(buffer, strlen(buffer))] = '\0';
+ return buffer;
+}
+
+static pid_t get_pid_from_status_file(int *fd)
+{
+ int ret;
+ FILE *f;
+ size_t n = 0;
+ pid_t result = -1;
+ char *line = NULL;
+
+ /* fd now belongs to FILE and will be closed by fclose() */
+ f = fdopen(*fd, "r");
+ if (!f)
+ return -1;
+
+ while (getline(&line, &n, f) != -1) {
+ char *numstr;
+
+ if (strncmp(line, "Pid:", 4))
+ continue;
+
+ numstr = trim_whitespace_in_place(line + 4);
+ ret = safe_int(numstr, &result);
+ if (ret < 0)
+ goto out;
+
+ break;
+ }
+
+out:
+ free(line);
+ fclose(f);
+ *fd = -1;
+ return result;
+}
+
+int main(int argc, char **argv)
+{
+ int ret = 1;
+ int pidfd = -1, pidfd2 = -1, procfd = -1, procpidfd = -1, statusfd = -1;
+ pid_t pid;
+
+ pidfd = sys_pidfd_open(getpid(), -1, -1, 0);
+ if (pidfd < 0) {
+ ksft_print_msg("%s - failed to open pidfd\n", strerror(errno));
+ goto on_error;
+ }
+
+ procfd = open("/proc", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
+ if (procfd < 0) {
+ ksft_print_msg("%s - failed to open /proc\n", strerror(errno));
+ goto on_error;
+ }
+
+ procpidfd = sys_pidfd_open(-1, procfd, pidfd, PIDFD_TO_PROCFD);
+ if (procpidfd < 0) {
+ ksft_print_msg(
+ "%s - failed to retrieve /proc/<pid> from pidfd\n",
+ strerror(errno));
+ goto on_error;
+ }
+
+ pidfd2 = sys_pidfd_open(-1, procpidfd, -1, PROCFD_TO_PIDFD);
+ if (pidfd2 < 0) {
+ ksft_print_msg(
+ "%s - failed to retrieve pidfd from procpidfd\n",
+ strerror(errno));
+ goto on_error;
+ }
+
+ statusfd = openat(procpidfd, "status", O_CLOEXEC | O_RDONLY);
+ if (statusfd < 0) {
+ ksft_print_msg("%s - failed to open /proc/<pid>/status\n",
+ strerror(errno));
+ goto on_error;
+ }
+
+ pid = get_pid_from_status_file(&statusfd);
+ if (pid < 0) {
+ ksft_print_msg(
+ "%s - failed to retrieve pid from /proc/<pid>/status\n",
+ strerror(errno));
+ goto on_error;
+ }
+
+ if (pid != getpid()) {
+ ksft_print_msg(
+ "%s - actual pid %d does not equal retrieved pid from /proc/<pid>/status\n",
+ strerror(errno), pid, getpid());
+ goto on_error;
+ }
+
+ ret = 0;
+
+on_error:
+ if (pidfd >= 0)
+ close(pidfd);
+
+ if (pidfd2 >= 0)
+ close(pidfd2);
+
+ if (procfd >= 0)
+ close(procfd);
+
+ if (procpidfd >= 0)
+ close(procpidfd);
+
+ if (statusfd >= 0)
+ close(statusfd);
+
+ return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
--
2.21.0


2019-03-27 16:23:24

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 2/4] pid: add pidfd_open()

pidfd_open() allows to retrieve pidfds for processes and removes the
dependency of pidfd on procfs. Multiple people have expressed a desire to
do this even when pidfd_send_signal() was merged. It is even recorded in
the commit message for pidfd_send_signal() itself
(cf. commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad):
Q-06: (Andrew Morton [1])
Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
so this is something that we will need to support anyway. Hence,
there's no immediate need to add another syscalls just to make
pidfd_send_signal() not dependent on the presence of procfs. However,
adding a syscalls to get such file descriptors is planned for a
future patchset (cf. [1]).
Alexey made a similar request (cf. [2]). Additionally, Andy made an
argument that we should go forward with non-proc-dirfd file descriptors for
the sake of security and extensibility (cf. [3]).
This will unblock or help move along work on pidfd_wait which is currently
ongoing.

/* pidfds are anon inode file descriptors */
These pidfds are allocated using anon_inode_getfd(), are O_CLOEXEC by
default and can be used with the pidfd_send_signal() syscall. They are not
dirfds and as such have the advantage that we can make them pollable or
readable in the future if we see a need to do so. Currently they do not
support any advanced operations. The pidfds are not associated with a
specific pid namespaces but rather only reference struct pid of a given
process in their private_data member.

/* Process Metadata Access */
One of the oustanding issues has been how to get information about a given
process if pidfds are regular file descriptors and do not provide access to
the process /proc/<pid> directory.
Various solutions have been proposed. The one that most people prefer is to
be able to retrieve a file descriptor to /proc/<pid> based on a pidfd (and
the other way around).
IF PROCFD_TO_PIDFD is passed as a flag together with a file descriptor to a
/proc mount in a given pid namespace and a pidfd pidfd_open() will return a
file descriptor to the corresponding /proc/<pid> directory in procfs
mounts' pid namespace. pidfd_open() is very careful to verify that the pid
hasn't been recycled in between.
IF PIDFD_TO_PROCFD is passed as a flag together with a file descriptor
referencing a /proc/<pid> directory a pidfd referencing the struct pid
stashed in /proc/<pid> of the process will be returned.
The pidfd_open() syscalls in that manner resembles openat() as it uses a
flag argument to modify what type of file descriptor will be returned.

The pidfd_open() implementation together with the flags argument strikes me
as an elegant compromise between splitting this into multiple syscalls and
avoiding ioctls().

/* Examples */
// Retrieve pidfd
int pidfd = pidfd_open(1234, -1, -1, 0);

// Retrieve /proc/<pid> handle for pidfd
int procfd = open("/proc", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
int procpidfd = pidfd_open(-1, procfd, pidfd, PIDFD_TO_PROCFD);

// Retrieve pidfd for /proc/<pid>
int procpidfd = open("/proc/1234", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
int pidfd = pidfd_open(-1, procpidfd, -1, PROCFD_TO_PIDFD);

/* References */
[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/
[3]: https://lore.kernel.org/lkml/CALCETrXO=V=+qEdLDVPf8eCgLZiB9bOTrUfe0V-U-tUZoeoRDA@mail.gmail.com

Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Jann Horn <[email protected]
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Jonathan Kowalski <[email protected]>
Cc: "Dmitry V. Levin" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Nagarathnam Muthusamy <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Al Viro <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/pid.h | 2 +
include/linux/syscalls.h | 2 +
include/uapi/linux/wait.h | 3 +
kernel/pid.c | 247 +++++++++++++++++++++++++
6 files changed, 256 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 1f9607ed087c..c8046f261bee 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -433,3 +433,4 @@
425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup
426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter
427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register
+428 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 92ee0b4378d4..f714a3d57b88 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -349,6 +349,7 @@
425 common io_uring_setup __x64_sys_io_uring_setup
426 common io_uring_enter __x64_sys_io_uring_enter
427 common io_uring_register __x64_sys_io_uring_register
+428 common pidfd_open __x64_sys_pidfd_open

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/pid.h b/include/linux/pid.h
index b6f4ba16065a..3c8ef5a199ca 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -66,6 +66,8 @@ struct pid

extern struct pid init_struct_pid;

+extern const struct file_operations pidfd_fops;
+
static inline struct pid *get_pid(struct pid *pid)
{
if (pid)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e446806a561f..79b274698036 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,8 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
struct old_timex32 __user *tx);
asmlinkage long sys_syncfs(int fd);
asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidfd_open(pid_t pid, int procfd, int pidfd,
+ unsigned int flags);
asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
unsigned int vlen, unsigned flags);
asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
index ac49a220cf2a..8282fc19d8f6 100644
--- a/include/uapi/linux/wait.h
+++ b/include/uapi/linux/wait.h
@@ -18,5 +18,8 @@
#define P_PID 1
#define P_PGID 2

+/* Flags for pidfd_open */
+#define PIDFD_TO_PROCFD 1 /* retrieve file descriptor to /proc/<pid> for pidfd */
+#define PROCFD_TO_PIDFD 2 /* retrieve pidfd for /proc/<pid> */

#endif /* _UAPI_LINUX_WAIT_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..c9e24e726aba 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -26,8 +26,10 @@
*
*/

+#include <linux/anon_inodes.h>
#include <linux/mm.h>
#include <linux/export.h>
+#include <linux/fsnotify.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/rculist.h>
@@ -40,6 +42,7 @@
#include <linux/proc_fs.h>
#include <linux/sched/task.h>
#include <linux/idr.h>
+#include <linux/wait.h>

struct pid init_struct_pid = {
.count = ATOMIC_INIT(1),
@@ -451,6 +454,250 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
return idr_get_next(&ns->idr, &nr);
}

+static int pidfd_release(struct inode *inode, struct file *file)
+{
+ struct pid *pid = file->private_data;
+
+ if (pid) {
+ file->private_data = NULL;
+ put_pid(pid);
+ }
+
+ return 0;
+}
+
+const struct file_operations pidfd_fops = {
+ .release = pidfd_release,
+};
+
+static int pidfd_create_fd(struct pid *pid, unsigned int o_flags)
+{
+ int fd;
+
+ fd = anon_inode_getfd("pidfd", &pidfd_fops, get_pid(pid), O_RDWR | o_flags);
+ if (fd < 0)
+ put_pid(pid);
+
+ return fd;
+}
+
+#ifdef CONFIG_PROC_FS
+static struct pid_namespace *pidfd_get_proc_pid_ns(const struct file *file)
+{
+ struct inode *inode;
+ struct super_block *sb;
+
+ inode = file_inode(file);
+ sb = inode->i_sb;
+ if (sb->s_magic != PROC_SUPER_MAGIC)
+ return ERR_PTR(-EINVAL);
+
+ if (inode->i_ino != PROC_ROOT_INO)
+ return ERR_PTR(-EINVAL);
+
+ return get_pid_ns(inode->i_sb->s_fs_info);
+}
+
+static struct pid *pidfd_get_pid(const struct file *file)
+{
+ if (file->f_op != &pidfd_fops)
+ return ERR_PTR(-EINVAL);
+
+ return get_pid(file->private_data);
+}
+
+static struct file *pidfd_open_proc_pid(const struct file *procf, pid_t pid,
+ const struct pid *pidfd_pid)
+{
+ char name[11]; /* int to strlen + \0 */
+ struct file *file;
+ struct pid *proc_pid;
+
+ snprintf(name, sizeof(name), "%d", pid);
+ file = file_open_root(procf->f_path.dentry, procf->f_path.mnt, name,
+ O_DIRECTORY | O_NOFOLLOW, 0);
+ if (IS_ERR(file))
+ return file;
+
+ proc_pid = tgid_pidfd_to_pid(file);
+ if (IS_ERR(proc_pid)) {
+ filp_close(file, NULL);
+ return ERR_CAST(proc_pid);
+ }
+
+ if (pidfd_pid != proc_pid) {
+ filp_close(file, NULL);
+ return ERR_PTR(-ESRCH);
+ }
+
+ return file;
+}
+
+static int pidfd_to_procfd(pid_t pid, int procfd, int pidfd)
+{
+ long fd;
+ pid_t ns_pid;
+ struct fd fdproc, fdpid;
+ struct file *file = NULL;
+ struct pid *pidfd_pid = NULL;
+ struct pid_namespace *proc_pid_ns = NULL;
+
+ fdproc = fdget(procfd);
+ if (!fdproc.file)
+ return -EBADF;
+
+ fdpid = fdget(pidfd);
+ if (!fdpid.file) {
+ fdput(fdpid);
+ return -EBADF;
+ }
+
+ proc_pid_ns = pidfd_get_proc_pid_ns(fdproc.file);
+ if (IS_ERR(proc_pid_ns)) {
+ fd = PTR_ERR(proc_pid_ns);
+ proc_pid_ns = NULL;
+ goto err;
+ }
+
+ pidfd_pid = pidfd_get_pid(fdpid.file);
+ if (IS_ERR(pidfd_pid)) {
+ fd = PTR_ERR(pidfd_pid);
+ pidfd_pid = NULL;
+ goto err;
+ }
+
+ ns_pid = pid_nr_ns(pidfd_pid, proc_pid_ns);
+ if (!ns_pid) {
+ fd = -ESRCH;
+ goto err;
+ }
+
+ file = pidfd_open_proc_pid(fdproc.file, ns_pid, pidfd_pid);
+ if (IS_ERR(file)) {
+ fd = PTR_ERR(file);
+ file = NULL;
+ goto err;
+ }
+
+ fd = get_unused_fd_flags(O_CLOEXEC);
+ if (fd < 0)
+ goto err;
+
+ fsnotify_open(file);
+ fd_install(fd, file);
+ file = NULL;
+
+err:
+ fdput(fdproc);
+ fdput(fdpid);
+ if (proc_pid_ns)
+ put_pid_ns(proc_pid_ns);
+ put_pid(pidfd_pid);
+ if (file)
+ filp_close(file, NULL);
+
+ return fd;
+}
+
+static int procfd_to_pidfd(int procfd)
+{
+ int fd;
+ struct fd fdproc;
+ struct pid *proc_pid;
+
+ fdproc = fdget(procfd);
+ if (!fdproc.file)
+ return -EBADF;
+
+ proc_pid = tgid_pidfd_to_pid(fdproc.file);
+ if (IS_ERR(proc_pid)) {
+ fdput(fdproc);
+ return PTR_ERR(proc_pid);
+ }
+
+ fd = pidfd_create_fd(proc_pid, O_CLOEXEC);
+ fdput(fdproc);
+ return fd;
+}
+#else
+static inline int pidfd_to_procfd(pid_t pid, int procfd, int pidfd)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline int procfd_to_pidfd(int procfd)
+{
+ return -EOPNOTSUPP;
+}
+#endif /* CONFIG_PROC_FS */
+
+/*
+ * pidfd_open - open a pidfd
+ * @pid: pid for which to retrieve a pidfd
+ * @procfd: procfd file descriptor
+ * @pidfd: pidfd file descriptor
+ * @flags: flags to pass
+ *
+ * Creates a new pidfd or translates between pidfds and procfds.
+ * If no flag is passed, pidfd_open() will return a new pidfd for @pid. If
+ * PROCFD_TO_PIDFD is in @flags then a pidfd for struct pid referenced by
+ * @procfd is created. If PIDFD_TO_PROCFD is passed then a file descriptor to
+ * the process /proc/<pid> directory relative to the procfs referenced by
+ * @procfd will be returned.
+ */
+SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd, unsigned int,
+ flags)
+{
+ long fd = -EINVAL;
+
+ if (flags & ~(PIDFD_TO_PROCFD | PROCFD_TO_PIDFD))
+ return -EINVAL;
+
+ if (!flags) {
+ struct pid *pidfd_pid;
+
+ if (pid <= 0)
+ return -EINVAL;
+
+ if (procfd != -1 || pidfd != -1)
+ return -EINVAL;
+
+ rcu_read_lock();
+ pidfd_pid = get_pid(find_pid_ns(pid, task_active_pid_ns(current)));
+ rcu_read_unlock();
+
+ fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC);
+ put_pid(pidfd_pid);
+ } else if (flags & PIDFD_TO_PROCFD) {
+ if (flags & ~PIDFD_TO_PROCFD)
+ return -EINVAL;
+
+ if (pid != -1)
+ return -EINVAL;
+
+ if (procfd < 0 || pidfd < 0)
+ return -EINVAL;
+
+ fd = pidfd_to_procfd(pid, procfd, pidfd);
+ } else if (flags & PROCFD_TO_PIDFD) {
+ if (flags & ~PROCFD_TO_PIDFD)
+ return -EINVAL;
+
+ if (pid != -1)
+ return -EINVAL;
+
+ if (pidfd >= 0)
+ return -EINVAL;
+
+ if (procfd < 0)
+ return -EINVAL;
+
+ fd = procfd_to_pidfd(procfd);
+ }
+
+ return fd;
+}
+
void __init pid_idr_init(void)
{
/* Verify no one has done anything silly: */
--
2.21.0


2019-03-27 16:23:39

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 1/4] Make anon_inodes unconditional

From: David Howells <[email protected]>

Make the anon_inodes facility unconditional so that it can be used by core
VFS code and the pidfd_open() syscall.

Signed-off-by: David Howells <[email protected]>
Signed-off-by: Al Viro <[email protected]>
[[email protected]: adapt commit message to mention pidfd_open()]
Signed-off-by: Christian Brauner <[email protected]>
---
arch/arm/kvm/Kconfig | 1 -
arch/arm64/kvm/Kconfig | 1 -
arch/mips/kvm/Kconfig | 1 -
arch/powerpc/kvm/Kconfig | 1 -
arch/s390/kvm/Kconfig | 1 -
arch/x86/Kconfig | 1 -
arch/x86/kvm/Kconfig | 1 -
drivers/base/Kconfig | 1 -
drivers/char/tpm/Kconfig | 1 -
drivers/dma-buf/Kconfig | 1 -
drivers/gpio/Kconfig | 1 -
drivers/iio/Kconfig | 1 -
drivers/infiniband/Kconfig | 1 -
drivers/vfio/Kconfig | 1 -
fs/Makefile | 2 +-
fs/notify/fanotify/Kconfig | 1 -
fs/notify/inotify/Kconfig | 1 -
init/Kconfig | 10 ----------
18 files changed, 1 insertion(+), 27 deletions(-)

diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index 3f5320f46de2..f591026347a5 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
bool "Kernel-based Virtual Machine (KVM) support"
depends on MMU && OF
select PREEMPT_NOTIFIERS
- select ANON_INODES
select ARM_GIC
select ARM_GIC_V3
select ARM_GIC_V3_ITS
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index a3f85624313e..a67121d419a2 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -23,7 +23,6 @@ config KVM
depends on OF
select MMU_NOTIFIER
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_ARCH_TLB_FLUSH_ALL
select KVM_MMIO
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index 4528bc9c3cb1..eac25aef21e0 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
depends on MIPS_FP_SUPPORT
select EXPORT_UASM
select PREEMPT_NOTIFIERS
- select ANON_INODES
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select HAVE_KVM_VCPU_ASYNC_IOCTL
select KVM_MMIO
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index bfdde04e4905..f53997a8ca62 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -20,7 +20,6 @@ if VIRTUALIZATION
config KVM
bool
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_EVENTFD
select HAVE_KVM_VCPU_ASYNC_IOCTL
select SRCU
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 767453faacfc..1816ee48eadd 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
prompt "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_VCPU_ASYNC_IOCTL
select HAVE_KVM_EVENTFD
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c1f9b3cf437c..18f2c954464e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -44,7 +44,6 @@ config X86
#
select ACPI_LEGACY_TABLES_LOOKUP if ACPI
select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
- select ANON_INODES
select ARCH_32BIT_OFF_T if X86_32
select ARCH_CLOCKSOURCE_DATA
select ARCH_CLOCKSOURCE_INIT
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 72fa955f4a15..fc042419e670 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -27,7 +27,6 @@ config KVM
depends on X86_LOCAL_APIC
select PREEMPT_NOTIFIERS
select MMU_NOTIFIER
- select ANON_INODES
select HAVE_KVM_IRQCHIP
select HAVE_KVM_IRQFD
select IRQ_BYPASS_MANAGER
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 059700ea3521..03f067da12ee 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -174,7 +174,6 @@ source "drivers/base/regmap/Kconfig"
config DMA_SHARED_BUFFER
bool
default n
- select ANON_INODES
select IRQ_WORK
help
This option enables the framework for buffer-sharing between
diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 536e55d3919f..f3e4bc490cf0 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -157,7 +157,6 @@ config TCG_CRB
config TCG_VTPM_PROXY
tristate "VTPM Proxy Interface"
depends on TCG_TPM
- select ANON_INODES
---help---
This driver proxies for an emulated TPM (vTPM) running in userspace.
A device /dev/vtpmx is provided that creates a device pair
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
index 2e5a0faa2cb1..3fc9c2efc583 100644
--- a/drivers/dma-buf/Kconfig
+++ b/drivers/dma-buf/Kconfig
@@ -3,7 +3,6 @@ menu "DMABUF options"
config SYNC_FILE
bool "Explicit Synchronization Framework"
default n
- select ANON_INODES
select DMA_SHARED_BUFFER
---help---
The Sync File Framework adds explicit syncronization via
diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 3f50526a771f..0f91600c27ae 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -12,7 +12,6 @@ config ARCH_HAVE_CUSTOM_GPIO_H

menuconfig GPIOLIB
bool "GPIO Support"
- select ANON_INODES
help
This enables GPIO support through the generic GPIO library.
You only need to enable this, if you also want to enable
diff --git a/drivers/iio/Kconfig b/drivers/iio/Kconfig
index d08aeb41cd07..1dec0fecb6ef 100644
--- a/drivers/iio/Kconfig
+++ b/drivers/iio/Kconfig
@@ -4,7 +4,6 @@

menuconfig IIO
tristate "Industrial I/O support"
- select ANON_INODES
help
The industrial I/O subsystem provides a unified framework for
drivers for many different types of embedded sensors using a
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a1fb840de45d..d318bab25860 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -25,7 +25,6 @@ config INFINIBAND_USER_MAD

config INFINIBAND_USER_ACCESS
tristate "InfiniBand userspace access (verbs and CM)"
- select ANON_INODES
depends on MMU
---help---
Userspace InfiniBand access support. This enables the
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9de5ed38da83..3798d77d131c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -22,7 +22,6 @@ menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
- select ANON_INODES
help
VFIO provides a framework for secure userspace device drivers.
See Documentation/vfio.txt for more details.
diff --git a/fs/Makefile b/fs/Makefile
index 427fec226fae..35945f8139e6 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -25,7 +25,7 @@ obj-$(CONFIG_PROC_FS) += proc_namespace.o

obj-y += notify/
obj-$(CONFIG_EPOLL) += eventpoll.o
-obj-$(CONFIG_ANON_INODES) += anon_inodes.o
+obj-y += anon_inodes.o
obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
index 735bfb2e9190..521dc91d2cb5 100644
--- a/fs/notify/fanotify/Kconfig
+++ b/fs/notify/fanotify/Kconfig
@@ -1,7 +1,6 @@
config FANOTIFY
bool "Filesystem wide access notification"
select FSNOTIFY
- select ANON_INODES
select EXPORTFS
default n
---help---
diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
index b981fc0c8379..0161c74e76e2 100644
--- a/fs/notify/inotify/Kconfig
+++ b/fs/notify/inotify/Kconfig
@@ -1,6 +1,5 @@
config INOTIFY_USER
bool "Inotify support for userspace"
- select ANON_INODES
select FSNOTIFY
default y
---help---
diff --git a/init/Kconfig b/init/Kconfig
index 4592bf7997c0..be8f97e37a76 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1171,9 +1171,6 @@ config LD_DEAD_CODE_DATA_ELIMINATION
config SYSCTL
bool

-config ANON_INODES
- bool
-
config HAVE_UID16
bool

@@ -1378,14 +1375,12 @@ config HAVE_FUTEX_CMPXCHG
config EPOLL
bool "Enable eventpoll support" if EXPERT
default y
- select ANON_INODES
help
Disabling this option will cause the kernel to be built without
support for epoll family of system calls.

config SIGNALFD
bool "Enable signalfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the signalfd() system call that allows to receive signals
@@ -1395,7 +1390,6 @@ config SIGNALFD

config TIMERFD
bool "Enable timerfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the timerfd() system call that allows to receive timer
@@ -1405,7 +1399,6 @@ config TIMERFD

config EVENTFD
bool "Enable eventfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the eventfd() system call that allows to receive both
@@ -1516,7 +1509,6 @@ config KALLSYMS_BASE_RELATIVE
# syscall, maps, verifier
config BPF_SYSCALL
bool "Enable bpf() system call"
- select ANON_INODES
select BPF
select IRQ_WORK
default n
@@ -1533,7 +1525,6 @@ config BPF_JIT_ALWAYS_ON

config USERFAULTFD
bool "Enable userfaultfd() system call"
- select ANON_INODES
depends on MMU
help
Enable the userfaultfd() system call that allows to intercept and
@@ -1600,7 +1591,6 @@ config PERF_EVENTS
bool "Kernel performance events and counters"
default y if PROFILING
depends on HAVE_PERF_EVENTS
- select ANON_INODES
select IRQ_WORK
select SRCU
help
--
2.21.0


2019-03-27 16:25:00

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 3/4] signal: support pidfd_open() with pidfd_send_signal()

Let pidfd_send_signal() use pidfds retrieved via pidfd_open(). With this
patch pidfd_send_signal() becomes independent of procfs. This fullfils the
request made when we merged the pidfd_send_signal() patchset. The
pidfd_send_signal() syscall is now always available allowing for it to be
used by users without procfs mounted or even users without procfs support
compiled into the kernel.

Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Howells <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Jann Horn <[email protected]
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Jonathan Kowalski <[email protected]>
Cc: "Dmitry V. Levin" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Nagarathnam Muthusamy <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Al Viro <[email protected]>
---
kernel/signal.c | 14 ++++++++++----
kernel/sys_ni.c | 3 ---
2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index b7953934aa99..eb97d0cc6ef7 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3513,7 +3513,6 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
return kill_something_info(sig, &info, pid);
}

-#ifdef CONFIG_PROC_FS
/*
* Verify that the signaler and signalee either are in the same pid namespace
* or that the signaler's pid namespace is an ancestor of the signalee's pid
@@ -3550,6 +3549,14 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
return copy_siginfo_from_user(kinfo, info);
}

+static struct pid *pidfd_to_pid(const struct file *file)
+{
+ if (file->f_op == &pidfd_fops)
+ return file->private_data;
+
+ return tgid_pidfd_to_pid(file);
+}
+
/**
* sys_pidfd_send_signal - send a signal to a process through a task file
* descriptor
@@ -3581,12 +3588,12 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
if (flags)
return -EINVAL;

- f = fdget_raw(pidfd);
+ f = fdget(pidfd);
if (!f.file)
return -EBADF;

/* Is this a pidfd? */
- pid = tgid_pidfd_to_pid(f.file);
+ pid = pidfd_to_pid(f.file);
if (IS_ERR(pid)) {
ret = PTR_ERR(pid);
goto err;
@@ -3625,7 +3632,6 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
fdput(f);
return ret;
}
-#endif /* CONFIG_PROC_FS */

static int
do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d21f4befaea4..4d9ae5ea6caf 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -167,9 +167,6 @@ COND_SYSCALL(syslog);

/* kernel/sched/core.c */

-/* kernel/signal.c */
-COND_SYSCALL(pidfd_send_signal);
-
/* kernel/sys.c */
COND_SYSCALL(setregid);
COND_SYSCALL(setgid);
--
2.21.0


2019-03-27 17:09:52

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

On Wed, Mar 27, 2019 at 5:22 PM Christian Brauner <[email protected]> wrote:
> pidfd_open() allows to retrieve pidfds for processes and removes the
> dependency of pidfd on procfs. Multiple people have expressed a desire to
> do this even when pidfd_send_signal() was merged. It is even recorded in
[...]
> IF PROCFD_TO_PIDFD is passed as a flag together with a file descriptor to a
> /proc mount in a given pid namespace and a pidfd pidfd_open() will return a
> file descriptor to the corresponding /proc/<pid> directory in procfs
> mounts' pid namespace. pidfd_open() is very careful to verify that the pid

nit: s/mounts'/mount's/

> hasn't been recycled in between.
> IF PIDFD_TO_PROCFD is passed as a flag together with a file descriptor
> referencing a /proc/<pid> directory a pidfd referencing the struct pid
> stashed in /proc/<pid> of the process will be returned.

nit: s/of the process //?

> The pidfd_open() syscalls in that manner resembles openat() as it uses a

nit: s/syscalls/syscall/

[...]
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..c9e24e726aba 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
[...]
> +static struct file *pidfd_open_proc_pid(const struct file *procf, pid_t pid,
> + const struct pid *pidfd_pid)
> +{
> + char name[11]; /* int to strlen + \0 */

nit: The comment is a bit off; an unconstrained int needs 1+10+1
bytes, I think? minus sign, 10 digits, nullbyte? But of course that
can't actually happen here.

> + struct file *file;
> + struct pid *proc_pid;
> +
> + snprintf(name, sizeof(name), "%d", pid);
> + file = file_open_root(procf->f_path.dentry, procf->f_path.mnt, name,
> + O_DIRECTORY | O_NOFOLLOW, 0);

Maybe explicitly write the implied O_RDONLY (which is 0) here for clarity?

[...]
> +static int pidfd_to_procfd(pid_t pid, int procfd, int pidfd)
> +{
> + long fd;
> + pid_t ns_pid;
> + struct fd fdproc, fdpid;
> + struct file *file = NULL;
> + struct pid *pidfd_pid = NULL;
> + struct pid_namespace *proc_pid_ns = NULL;
> +
> + fdproc = fdget(procfd);
> + if (!fdproc.file)
> + return -EBADF;
> +
> + fdpid = fdget(pidfd);
> + if (!fdpid.file) {
> + fdput(fdpid);

Typo: s/fdput(fdpid)/fdput(fdproc)/

[...]
> +SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd, unsigned int,
> + flags)
[...]
> + if (!flags) {
[...]
> + rcu_read_lock();
> + pidfd_pid = get_pid(find_pid_ns(pid, task_active_pid_ns(current)));
> + rcu_read_unlock();

The previous three lines are equivalent to `pidfd_pid = find_get_pid(pid)`.

> + fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC);

Nit: You could hardcode O_CLOEXEC in pidfd_create_fd() and get rid of
the second function argument if you want to.

> + put_pid(pidfd_pid);
> + } else if (flags & PIDFD_TO_PROCFD) {
[...]
> + fd = pidfd_to_procfd(pid, procfd, pidfd);

The `pid` argument of pidfd_to_procfd() looks unused, maybe it makes
sense to get rid of that?

2019-03-27 17:46:06

by Yann Droneaud

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

Le mercredi 27 mars 2019 à 17:21 +0100, Christian Brauner a écrit :

> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..c9e24e726aba 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -26,8 +26,10 @@
> +SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd,
> unsigned int,
> + flags)
> +{
> + long fd = -EINVAL;
> +
> + if (flags & ~(PIDFD_TO_PROCFD | PROCFD_TO_PIDFD))
> + return -EINVAL;
> +
> + if (!flags) {
> + struct pid *pidfd_pid;
> +
> + if (pid <= 0)
> + return -EINVAL;
> +
> + if (procfd != -1 || pidfd != -1)
> + return -EINVAL;
> +
> + rcu_read_lock();
> + pidfd_pid = get_pid(find_pid_ns(pid, task_active_pid_ns(current)));
> + rcu_read_unlock();
> +
> + fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC);
> + put_pid(pidfd_pid);
> + } else if (flags & PIDFD_TO_PROCFD) {

[...]

> + } else if (flags & PROCFD_TO_PIDFD) {
> + if (flags & ~PROCFD_TO_PIDFD)
> + return -EINVAL;
> +
> + if (pid != -1)
> + return -EINVAL;
> +
> + if (pidfd >= 0)
>

I think it can be stricter with:

if (pidfd != -1)

(and match the check done for flag == 0).

Regards.

--
Yann Droneaud
OPTEYA



2019-03-27 19:39:21

by Jonathan Kowalski

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

Christian,

Giving this some thought, it looks fine to me, but I am not convinced
this should take a pid argument. I much prefer if obtaining a pidfd
goes in the translation stuff, that has very real usecases, and you
would end up duplicating the same thing over in two places.

If the /proc/<PID> dir fd of a pidfd is relative to its procrootfd, a
pid's pidfd is also relative to its namespace. Currently, you change
things which would mean I now have to setns and spawn a child that
passes me back the pidfd for a pid in the said namespace.

I prefer Andy's version of

int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name
this procfd_open, maybe

This is just a nicer race free openat.

But, Joel and you agreed that it makes sense to split out the
translation stuff out of pidfds.

My suggestion would be to extend ioctl_ns(2) then. It already provides
a (rather clumsy) mechanism to determine the relationship by comparing
st_dev and st_ino, which userspace can do itself for two namespace
descriptors. For translation, you can extend this namespace ioctl
(there is one to get a owner UID, you could add one to get a relative
PID).

Then, your pidfd call will be:

pidfd_open(pid_t pid, int ns, unsigned int flags);

You would also be able to compile out anything procfs related (include
the new API to procfs dir fd translation), otherwise, the way to open
pidfds is in this call, and without CONFIG_PROC_FS=Y, this is as good
as pidfd_open(pid_t pid) (of which a better version I propose above).
The new API should be its own thing, and the procfs translation tuple
its own thing, tied to the proc fs option. pidfds need not have any
relation to /proc.

For this procfd conversion system call, I would also lift any
namespace related restrictions if you are planning to, given the
constraint that I can only open a pidfd from an ancestor namespace
with the new pidfd_open system call, or acquire it through a
cooperative userspace process by fd passing otherwise, and I need the
/proc root fd, having both only permits me to open the said process's
/proc/<PID> dir fd subject to normal access restrictions. This means
the simplified procfd_open can be used to do metadata access without
even talking of PIDs at all, your process could live in its own world
and, given it has the authority to open the /proc directory, be able
to purely collect metadata based on the two file descriptors it is
given.

Once you have the restriction in the same call that allows you to open
a pidfd for an addressable PID from the given namespace fd, you can
finally remove the restriction to signal across namespaces once the
siginfo stuff is sorted out, as that will only work when you
explicitly push the fd into a sandbox, the process cannot get it out
of thin air on its own (and you already mentioned it has nothing to do
with security). What I do worry about is one can use NS_GET_PARENT
ioctl to get the parent pidns if the owning userns is the same, and
just passing that gives me back a pidfd for the task. **So, you might
want to add the constraint that the PID is actually reachable by the
current task as well, apart from being reachable in the passed in
namespace.**

Lastly, I also see no need of /proc/<PID> dir fd to pidfd conversion,
I would even recommend getting rid of that, so we only have one type
of pidfd, the anon inode one. What is the usecase behind that? It
would only be needed if you did not have a way to be able to metadata
access through a pidfd, which would be the case only prior to this
patch.

I think this would simplify a lot of things, and ioctl_ns(2) is
probably already the place to do comparison operations and query
operations on hierarichal namespaces, just adding the relative PID bit
will make it gain feature parity with translate_pid.

2019-03-27 19:44:25

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 4/4] tests: add pidfd_open() tests

On Wed, Mar 27, 2019 at 9:22 AM Christian Brauner <[email protected]> wrote:
>
> This adds a simple test case for pidfd_open().

Thank you for the tests. :) It might be nice to see more extensive
negative testing -- making sure pids aren't reachable from mismatched
pid namespaces, etc, but this gets us basic regression testing. Also,
you might save some copy/paste code using kselftest_harness.h's
interfaces.

--
Kees Cook

2019-03-27 20:18:35

by Jonathan Kowalski

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

On Wed, Mar 27, 2019 at 7:38 PM Jonathan Kowalski <[email protected]> wrote:
> ...
> ... the process cannot get it out
> of thin air on its own (and you already mentioned it has nothing to do
> with security). What I do worry about is one can use NS_GET_PARENT

disregard this, it works as it should.

> ioctl to get the parent pidns if the owning userns is the same, and
> just passing that gives me back a pidfd for the task. **So, you might
> want to add the constraint that the PID is actually reachable by the
> current task as well, apart from being reachable in the passed in
> namespace.**
>
> Lastly, I also see no need of /proc/<PID> dir fd to pidfd conversion,
> I would even recommend getting rid of that, so we only have one type
> of pidfd, the anon inode one. What is the usecase behind that? It
> would only be needed if you did not have a way to be able to metadata
> access through a pidfd, which would be the case only prior to this
> patch.
>
> I think this would simplify a lot of things, and ioctl_ns(2) is
> probably already the place to do comparison operations and query
> operations on hierarichal namespaces, just adding the relative PID bit
> will make it gain feature parity with translate_pid.

2019-03-27 20:56:47

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 4/4] tests: add pidfd_open() tests

On Wed, Mar 27, 2019 at 12:36:46PM -0700, Kees Cook wrote:
> On Wed, Mar 27, 2019 at 9:22 AM Christian Brauner <[email protected]> wrote:
> >
> > This adds a simple test case for pidfd_open().
>
> Thank you for the tests. :) It might be nice to see more extensive
> negative testing -- making sure pids aren't reachable from mismatched

There'll be more tests for sure! Once people say that they are happy
with the design I can put in the effort to expand the tests to more
elaborate scenarios. :)

> pid namespaces, etc, but this gets us basic regression testing. Also,
> you might save some copy/paste code using kselftest_harness.h's
> interfaces.
>
> --
> Kees Cook

2019-03-27 21:01:56

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

On Wed, Mar 27, 2019 at 06:21:24PM +0100, Yann Droneaud wrote:
> Le mercredi 27 mars 2019 à 17:21 +0100, Christian Brauner a écrit :
>
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..c9e24e726aba 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -26,8 +26,10 @@
> > +SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd,
> > unsigned int,
> > + flags)
> > +{
> > + long fd = -EINVAL;
> > +
> > + if (flags & ~(PIDFD_TO_PROCFD | PROCFD_TO_PIDFD))
> > + return -EINVAL;
> > +
> > + if (!flags) {
> > + struct pid *pidfd_pid;
> > +
> > + if (pid <= 0)
> > + return -EINVAL;
> > +
> > + if (procfd != -1 || pidfd != -1)
> > + return -EINVAL;
> > +
> > + rcu_read_lock();
> > + pidfd_pid = get_pid(find_pid_ns(pid, task_active_pid_ns(current)));
> > + rcu_read_unlock();
> > +
> > + fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC);
> > + put_pid(pidfd_pid);
> > + } else if (flags & PIDFD_TO_PROCFD) {
>
> [...]
>
> > + } else if (flags & PROCFD_TO_PIDFD) {
> > + if (flags & ~PROCFD_TO_PIDFD)
> > + return -EINVAL;
> > +
> > + if (pid != -1)
> > + return -EINVAL;
> > +
> > + if (pidfd >= 0)
> >
>
> I think it can be stricter with:
>
> if (pidfd != -1)

Yes.

>
> (and match the check done for flag == 0).
>
> Regards.
>
> --
> Yann Droneaud
> OPTEYA
>
>

2019-03-27 21:03:35

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

On Wed, Mar 27, 2019 at 06:07:54PM +0100, Jann Horn wrote:
> On Wed, Mar 27, 2019 at 5:22 PM Christian Brauner <[email protected]> wrote:
> > pidfd_open() allows to retrieve pidfds for processes and removes the
> > dependency of pidfd on procfs. Multiple people have expressed a desire to
> > do this even when pidfd_send_signal() was merged. It is even recorded in
> [...]
> > IF PROCFD_TO_PIDFD is passed as a flag together with a file descriptor to a
> > /proc mount in a given pid namespace and a pidfd pidfd_open() will return a
> > file descriptor to the corresponding /proc/<pid> directory in procfs
> > mounts' pid namespace. pidfd_open() is very careful to verify that the pid
>
> nit: s/mounts'/mount's/

Thanks.

>
> > hasn't been recycled in between.
> > IF PIDFD_TO_PROCFD is passed as a flag together with a file descriptor
> > referencing a /proc/<pid> directory a pidfd referencing the struct pid
> > stashed in /proc/<pid> of the process will be returned.
>
> nit: s/of the process //?

Yes.

>
> > The pidfd_open() syscalls in that manner resembles openat() as it uses a
>
> nit: s/syscalls/syscall/

Thanks.

>
> [...]
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..c9e24e726aba 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> [...]
> > +static struct file *pidfd_open_proc_pid(const struct file *procf, pid_t pid,
> > + const struct pid *pidfd_pid)
> > +{
> > + char name[11]; /* int to strlen + \0 */
>
> nit: The comment is a bit off; an unconstrained int needs 1+10+1
> bytes, I think? minus sign, 10 digits, nullbyte? But of course that
> can't actually happen here.

Yes, the comment is misleading.

>
> > + struct file *file;
> > + struct pid *proc_pid;
> > +
> > + snprintf(name, sizeof(name), "%d", pid);
> > + file = file_open_root(procf->f_path.dentry, procf->f_path.mnt, name,
> > + O_DIRECTORY | O_NOFOLLOW, 0);
>
> Maybe explicitly write the implied O_RDONLY (which is 0) here for clarity?

Yes.

>
> [...]
> > +static int pidfd_to_procfd(pid_t pid, int procfd, int pidfd)
> > +{
> > + long fd;
> > + pid_t ns_pid;
> > + struct fd fdproc, fdpid;
> > + struct file *file = NULL;
> > + struct pid *pidfd_pid = NULL;
> > + struct pid_namespace *proc_pid_ns = NULL;
> > +
> > + fdproc = fdget(procfd);
> > + if (!fdproc.file)
> > + return -EBADF;
> > +
> > + fdpid = fdget(pidfd);
> > + if (!fdpid.file) {
> > + fdput(fdpid);
>
> Typo: s/fdput(fdpid)/fdput(fdproc)/

Good catch!

>
> [...]
> > +SYSCALL_DEFINE4(pidfd_open, pid_t, pid, int, procfd, int, pidfd, unsigned int,
> > + flags)
> [...]
> > + if (!flags) {
> [...]
> > + rcu_read_lock();
> > + pidfd_pid = get_pid(find_pid_ns(pid, task_active_pid_ns(current)));
> > + rcu_read_unlock();
>
> The previous three lines are equivalent to `pidfd_pid = find_get_pid(pid)`.

Perfect, will replace.

>
> > + fd = pidfd_create_fd(pidfd_pid, O_CLOEXEC);
>
> Nit: You could hardcode O_CLOEXEC in pidfd_create_fd() and get rid of
> the second function argument if you want to.

Hm, let me rename this to pidfd_create_cloexec(pidfd_pid) then.

>
> > + put_pid(pidfd_pid);
> > + } else if (flags & PIDFD_TO_PROCFD) {
> [...]
> > + fd = pidfd_to_procfd(pid, procfd, pidfd);
>
> The `pid` argument of pidfd_to_procfd() looks unused, maybe it makes
> sense to get rid of that?

Yes.

2019-03-27 21:35:11

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

On Wed, Mar 27, 2019 at 07:38:13PM +0000, Jonathan Kowalski wrote:
> Christian,
>
> Giving this some thought, it looks fine to me, but I am not convinced
> this should take a pid argument. I much prefer if obtaining a pidfd

The signatures for pidfd_open() you're suggesting are conflicting. Even
taking into account that you're referring to Andy's version:

int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name

the "not convinced this should take a pid argument" confuses me wrt to
the proposed second signature:

pidfd_open(pid_t pid, int ns, unsigned int flags);

> goes in the translation stuff, that has very real usecases, and you
> would end up duplicating the same thing over in two places.
>
> If the /proc/<PID> dir fd of a pidfd is relative to its procrootfd, a
> pid's pidfd is also relative to its namespace. Currently, you change

Pass a /proc/<pid> file descriptor that you have been given access to to
pidfd_open() and retrieve a pidfd to that process.

> things which would mean I now have to setns and spawn a child that
> passes me back the pidfd for a pid in the said namespace.

There are two scenarios:
- you're in a cousin pid namespace
- you're in an ancestor pid namespace

If you're in an ancestor pid namespace you can just get a pidfd based on
the pid of the process in your namespace. If you're in a cousin
pid namespace it seems reasonable that you have to do a setns to get a
pidfd. After all, you're not able to see any pids in there by default.
If you want pidfd to a cousin pid namespace prove that you can access it
by attaching to the owning user namespace of the pid namespace and get
your pidfd.

>
> I prefer Andy's version of
>
> int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
> int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name
> this procfd_open, maybe
>
> This is just a nicer race free openat.
>
> But, Joel and you agreed that it makes sense to split out the
> translation stuff out of pidfds.

I don't quite remember that part.

I honestly think that the version proposed here covers for the most part
what we want and provides a decent compromise by avoiding ioctl()s for
the translation bits, allows us to not split this into multiple
syscalls, and also allows retrieving pidfds from other pid namespaces
provided that one has gotten access to a file descriptor for
/proc/<pid>. If really needed at some point - I doubt we will - you
can extend pidfd_open() to take the flag CLONE_NEWPID:

int pidfd = pidfd_open(pid, pid_ns_fd, -1, CLONE_NEWPID);

and get pidfd for another pid namespace back.

Christian

>
> My suggestion would be to extend ioctl_ns(2) then. It already provides
> a (rather clumsy) mechanism to determine the relationship by comparing
> st_dev and st_ino, which userspace can do itself for two namespace
> descriptors. For translation, you can extend this namespace ioctl
> (there is one to get a owner UID, you could add one to get a relative
> PID).
>
> Then, your pidfd call will be:
>
> pidfd_open(pid_t pid, int ns, unsigned int flags);
>
> You would also be able to compile out anything procfs related (include
> the new API to procfs dir fd translation), otherwise, the way to open
> pidfds is in this call, and without CONFIG_PROC_FS=Y, this is as good
> as pidfd_open(pid_t pid) (of which a better version I propose above).
> The new API should be its own thing, and the procfs translation tuple
> its own thing, tied to the proc fs option. pidfds need not have any
> relation to /proc.

They don't have a relation to procfs right now in this syscall. Works
fine without procfs. That's the point.

>
> For this procfd conversion system call, I would also lift any
> namespace related restrictions if you are planning to, given the
> constraint that I can only open a pidfd from an ancestor namespace
> with the new pidfd_open system call, or acquire it through a
> cooperative userspace process by fd passing otherwise, and I need the
> /proc root fd, having both only permits me to open the said process's
> /proc/<PID> dir fd subject to normal access restrictions. This means
> the simplified procfd_open can be used to do metadata access without
> even talking of PIDs at all, your process could live in its own world
> and, given it has the authority to open the /proc directory, be able
> to purely collect metadata based on the two file descriptors it is
> given.
>
> Once you have the restriction in the same call that allows you to open
> a pidfd for an addressable PID from the given namespace fd, you can
> finally remove the restriction to signal across namespaces once the
> siginfo stuff is sorted out, as that will only work when you
> explicitly push the fd into a sandbox, the process cannot get it out
> of thin air on its own (and you already mentioned it has nothing to do
> with security). What I do worry about is one can use NS_GET_PARENT
> ioctl to get the parent pidns if the owning userns is the same, and
> just passing that gives me back a pidfd for the task. **So, you might
> want to add the constraint that the PID is actually reachable by the
> current task as well, apart from being reachable in the passed in
> namespace.**
>
> Lastly, I also see no need of /proc/<PID> dir fd to pidfd conversion,
> I would even recommend getting rid of that, so we only have one type
> of pidfd, the anon inode one. What is the usecase behind that? It
> would only be needed if you did not have a way to be able to metadata
> access through a pidfd, which would be the case only prior to this
> patch.
>
> I think this would simplify a lot of things, and ioctl_ns(2) is
> probably already the place to do comparison operations and query
> operations on hierarichal namespaces, just adding the relative PID bit
> will make it gain feature parity with translate_pid.

2019-03-27 22:14:20

by Jonathan Kowalski

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

On Wed, Mar 27, 2019 at 9:34 PM Christian Brauner <[email protected]> wrote:
>
> On Wed, Mar 27, 2019 at 07:38:13PM +0000, Jonathan Kowalski wrote:
> > Christian,
> >
> > Giving this some thought, it looks fine to me, but I am not convinced
> > this should take a pid argument. I much prefer if obtaining a pidfd
>
> The signatures for pidfd_open() you're suggesting are conflicting. Even
> taking into account that you're referring to Andy's version

My version pidfd_open takes a pid, the pid namespace to search it in,
and has an optional flags value to extend it further in the future to
an immediate need. This is a simpler version of the pidctl's GET_PIDFD
command that you yourself proposed in the last patch, and you don't
need two namespace arguments, just one here (in that case you were
passing -1 in either of those anyway).

so pidctl(PIDCMD_GET_PIDFD, pid, source, -1, 0) becomes
pidfd_open(pid, source, 0);
pidctl(PIDCMD_GET_PIDFD, pid, -1, target or -1, 0) would be
pidfd_open(pid, -1, 0);

The target was just there due to it having to support
comparison/translation commands, but it did not serve much purpose
except the pid was reachable in both. I am not sure that is needed if
you just want a pidfd for pid relative to a namespace.

What I prefer is dropping the pid argument from *your* pidfd_open,
rename it to procfd/procpidfd_open to give it a clearer name, and have
the signature the same as what Andy suggested. This system call is
tied to CONFIG_PROC_FS, all it does is take a pidfd, a proc rootfd,
and return the /proc/<PID> dir fd. It is up to you if you want to
support procpidfd to pidfd conversion, I don't think there was any
usecase mentioned regarding that far, so you may just ignore it.

>
> int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name
>
> the "not convinced this should take a pid argument" confuses me wrt to
> the proposed second signature:
>
> pidfd_open(pid_t pid, int ns, unsigned int flags);
>
> > goes in the translation stuff, that has very real usecases, and you
> > would end up duplicating the same thing over in two places.
> >
> > If the /proc/<PID> dir fd of a pidfd is relative to its procrootfd, a
> > pid's pidfd is also relative to its namespace. Currently, you change
>
> Pass a /proc/<pid> file descriptor that you have been given access to to
> pidfd_open() and retrieve a pidfd to that process.

It would be nice if it would be free of /proc entirely. The usecase
this system call is after is helping race free translation from pidfd
to procfd. pidfd_open as a separate system call as mentioned above
sounds much cleaner to me, and it works regardless of what /proc is
(compiled in, mounted, not compiled in, no access, etc).

>
> > things which would mean I now have to setns and spawn a child that
> > passes me back the pidfd for a pid in the said namespace.
>
> There are two scenarios:
> - you're in a cousin pid namespace
> - you're in an ancestor pid namespace
>
> If you're in an ancestor pid namespace you can just get a pidfd based on
> the pid of the process in your namespace. If you're in a cousin
> pid namespace it seems reasonable that you have to do a setns to get a
> pidfd. After all, you're not able to see any pids in there by default.
> If you want pidfd to a cousin pid namespace prove that you can access it
> by attaching to the owning user namespace of the pid namespace and get
> your pidfd.

Yes, but you break the part that you had working in the pidctl
patchset, I learn the pid of a process in a child pidns, and I am able
to get a pidfd for it with the pid relative to the passed in namespace
descriptor. You could even get a pidfd for an ancestor process, given
it explicitly gives you its namespace descriptor (or a cousin pidns,
but that still means userspace will pass the namespace descriptor
explicitly), but it is up to you if you want to limit its scope to
down the hierarchy strictly (and it would be nice to know now so that
you don't end up wasting more time later, and the next patchset is
hopefully the last one before merge modulo nits).

>
> >
> > I prefer Andy's version of
> >
> > int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
> > int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD); //name
> > this procfd_open, maybe
> >
> > This is just a nicer race free openat.
> >
> > But, Joel and you agreed that it makes sense to split out the
> > translation stuff out of pidfds.
>
> I don't quite remember that part.
>
> I honestly think that the version proposed here covers for the most part
> what we want and provides a decent compromise by avoiding ioctl()s for
> the translation bits, allows us to not split this into multiple

I was not wanting to suggest an ioctl to burden you further, but it
occured to me that you would have to do less, as it already has
support to determine ownership and is meant for such tasks in
ioctl_ns(2), but I don't care if it's a translate_pid system call.

> syscalls, and also allows retrieving pidfds from other pid namespaces
> provided that one has gotten access to a file descriptor for
> /proc/<pid>. If really needed at some point - I doubt we will - you
> can extend pidfd_open() to take the flag CLONE_NEWPID:
>
> int pidfd = pidfd_open(pid, pid_ns_fd, -1, CLONE_NEWPID);
>
> and get pidfd for another pid namespace back.
>
> Christian
>
> >
> > My suggestion would be to extend ioctl_ns(2) then. It already provides
> > a (rather clumsy) mechanism to determine the relationship by comparing
> > st_dev and st_ino, which userspace can do itself for two namespace
> > descriptors. For translation, you can extend this namespace ioctl
> > (there is one to get a owner UID, you could add one to get a relative
> > PID).
> >
> > Then, your pidfd call will be:
> >
> > pidfd_open(pid_t pid, int ns, unsigned int flags);
> >
> > You would also be able to compile out anything procfs related (include
> > the new API to procfs dir fd translation), otherwise, the way to open
> > pidfds is in this call, and without CONFIG_PROC_FS=Y, this is as good
> > as pidfd_open(pid_t pid) (of which a better version I propose above).
> > The new API should be its own thing, and the procfs translation tuple
> > its own thing, tied to the proc fs option. pidfds need not have any
> > relation to /proc.
>
> They don't have a relation to procfs right now in this syscall. Works
> fine without procfs. That's the point.

Yes, but then you might as well put the translation thing to procfs in
its own system call, and take out pidctl's PIDCMD_GET_PIDFD in its
own, and keep the translate_pid system call as is from Eric's tree.

This gives you three clearly focused APIs:

procfd_open(int procrootfd, int pidfd, unsigned int flags);

pass in a /proc dir fd, and get back the /proc/<PID> dir fd of PID it
maps to, this a race free openat. That's it. flags could tell it to
give a thread's dir fd in the future. It only does translation (and
I'd like to know why the /proc/<PID> to pidfd translation is
necessary, is there a real use case? that will break the namespace
scoping you have in the current iteration of your patch too).

pidfd_open(pid_t pid, int nsfd, unsigned int flags);

Give me a pidfd for the pid relative to the namespace descriptor. It
is up to you if you want to strictly check that the current task has
this pid reachable from jts namespace, or rely on the fact that it
cannot acquire the namespace fd for a cousin or an ancestor without a
cooperating process that passes it in explicitly (again, would be nice
to know what others think about opening it up, i may be missing holes
in the swiss cheese that do allow you get hold of such a namespace
descriptor easily, in which case making it strictly down the hierarchy
is the most sensible).

and translate_pid

translate_pid(pid_t pid, int sourcens, int targetns, ...) as Konstantin had.


>
> >
> > For this procfd conversion system call, I would also lift any
> > namespace related restrictions if you are planning to, given the
> > constraint that I can only open a pidfd from an ancestor namespace
> > with the new pidfd_open system call, or acquire it through a
> > cooperative userspace process by fd passing otherwise, and I need the
> > /proc root fd, having both only permits me to open the said process's
> > /proc/<PID> dir fd subject to normal access restrictions. This means
> > the simplified procfd_open can be used to do metadata access without
> > even talking of PIDs at all, your process could live in its own world
> > and, given it has the authority to open the /proc directory, be able
> > to purely collect metadata based on the two file descriptors it is
> > given.
> >
> > Once you have the restriction in the same call that allows you to open
> > a pidfd for an addressable PID from the given namespace fd, you can
> > finally remove the restriction to signal across namespaces once the
> > siginfo stuff is sorted out, as that will only work when you
> > explicitly push the fd into a sandbox, the process cannot get it out
> > of thin air on its own (and you already mentioned it has nothing to do
> > with security). What I do worry about is one can use NS_GET_PARENT
> > ioctl to get the parent pidns if the owning userns is the same, and
> > just passing that gives me back a pidfd for the task. **So, you might
> > want to add the constraint that the PID is actually reachable by the
> > current task as well, apart from being reachable in the passed in
> > namespace.**
> >
> > Lastly, I also see no need of /proc/<PID> dir fd to pidfd conversion,
> > I would even recommend getting rid of that, so we only have one type
> > of pidfd, the anon inode one. What is the usecase behind that? It
> > would only be needed if you did not have a way to be able to metadata
> > access through a pidfd, which would be the case only prior to this
> > patch.
> >
> > I think this would simplify a lot of things, and ioctl_ns(2) is
> > probably already the place to do comparison operations and query
> > operations on hierarichal namespaces, just adding the relative PID bit
> > will make it gain feature parity with translate_pid.

2019-03-27 22:27:09

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

<snip>

> procfd_open(int procrootfd, int pidfd, unsigned int flags);
> pidfd_open(pid_t pid, int nsfd, unsigned int flags);

That honestly just feels like splitting openat into:
openat_dir()
and
opentat_file()

2019-03-28 00:43:14

by Jonathan Kowalski

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

pidfd_open is open pidfd for pid relative to pidns, so a better
analogy is that it is like openat for a relative pathname wrt dirfd.
O_DIRECTORY is analogous to what type of object, so a TIDFD flag in
the future which interprets pid (pathname) as thread id only and pins
that specific struct pid. That has now limited you to the specific
object and operations you can perform on said object with other pidfd
APIs.

procfd_open in my proposed signature is just a fancy race free openat
on the corresponding pid of the pidfd relative to the procrootfd
(which becomes the dirfd if i were doing it in userspace), which might
as well been implemented in userspace if things were not racy. Andy
suggested something similar.

My point is, when I am talking of the pidfd API, procfs is irrelevant.
You are thinking of it as a process directory and a process file, I am
thinking of it in terms of a process object and the proc dir fd as an
file system based interface to query process state (through read)
which is why I object to them being usable with pidfd_send_signal as
well, in principle. In the future, people may use task_diag for the
same purpose or a different interface, to work around its limitations.
This would just be another interface of the kernel to query process
state, not representative of the process object itself. Hence, keeping
the pidfd to procfd translation entirely separate (as already
suggested) sounds much, much better to me.

The pidfd API and related calls are untouched and unaffected by
presence, absence of procfs or not (they are, but you do unrelated
stuff in the same system call). To me atleast, munging opening (and
then changing what the procfd means to support the flag use case),
having flags like PIDFD_TO_PROCFD that will work only without
CLONE_NEWPID, then having eg. GET_TIDFID that may work with
CLONE_NEWPID, etc.

I find this interface confusing.

I have a few steps when starting to work with a pidfd:

1. Acquire

pidfd_open(pid, ns, flags) or pidfd_clone(...)

2. Operate

pidfd_send_signal(...)

For those who need a race free way to open the correct /proc/<PID> for
a pidfd relative to a /proc dir fd, for the purposes of metadata
access, you will have procfd_open, which is in the kernel because the
same thing is racy to do in userspace.

Otherwise, pidfd_open in this patchset is this and also a polymorphic
system call that can become procfd_open in my example when passed a
flag. It is doing vastly different things given the presence and
absence of options. This is similar to a multiplexor again, but it
looks more confusing. You have to mask options.

pidfd_open currently:

pidfd_open(pid, -1, -1, 0); gets pidfd in current active ns
pidfd_open(-1, procrootfd, pidfd, PIDFD_TO_PROCFD); returns dir fd of
/proc/<PID> it maps to rel. to proc rootfd
pidfd_open(-1, nsfd, pidfd, CLONE_NEWPID); as you propose this
searches pid in pidns pinned by nsfd, and returns a pidfd file
descriptor.

Extend this to threads in the future, and the combination and
permutation starts getting confusing. Based on the flag, it is
entirely changing what will it work upon, and what it will do.

I can reasonably summarise my pidfd_open and procfd_open in their man
page in one line:

pidfd_open(pid, ns, flags) returns pidfd for pid, searching it in the
pidns pinned by ns fd, and flags will determine further if this is
thread local or process local (i.e. tid or tgid, and tgid == tid for
single threaded) (in the future) (so you could do thread directed
signals by passing a flag to pidfd_send_signal and this pidfd).

Your call without CONFIG_PROC_FS will be literally this, but a few
options will have to be set as -1.

procfd_open(procrootfd, pidfd, flags), returns the proc dir fd for the
pid/tid depending on if the pidfd is thread local, process local, hint
it in flags, etc. It is just a race free wrapper around an openat in
userspace, undergoing the same access control checks.

Yes, pidfd_open as it is now works *just fine*, but it is more
confusing to use and discuss. The conclusion from the previous
discussion also seemed to be to split pidctl's PIDCMD_GET_PIDFD into
its own thing, and provide a translation from pidfd to its proc dir fd
on its own. Then, translate_pid can be its own thing, or you could
extend ioctl_ns(2) if you want.

All that said, thanks for the work on this once again. My intention is
just that we don't end up with an API that could have been done better
and be cleaner to use for potential users in the coming years.

2019-03-28 10:39:16

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

> All that said, thanks for the work on this once again. My intention is
> just that we don't end up with an API that could have been done better
> and be cleaner to use for potential users in the coming years.

Thanks for your input on all of this. I still don't find multiplexers in
the style of seccomp()/fsconfig()/keyctl() to be a problem since they
deal with a specific task. They are very much different from ioctl()s in
that regard. But since Joel, you, and Daniel found the pidctl() approach
not very nice I dropped it. The interface needs to be satisfactory for
all of us especially since Android and other system managers will be the
main consumers.
So let's split this into pidfd_open(pid_t pid, unsigned int flags) which
allows to cleanly get pidfds independent procfs and do the translation
to procpidfds in an ioctl() as we've discussed in prior threads. This
should also accommodate comments and ideas from Andy and Jann.
I'm coding this up now.

Christian

2019-03-28 17:01:06

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()



On March 28, 2019 6:38:15 AM EDT, Christian Brauner <[email protected]> wrote:
>> All that said, thanks for the work on this once again. My intention
>is
>> just that we don't end up with an API that could have been done
>better
>> and be cleaner to use for potential users in the coming years.
>
>Thanks for your input on all of this. I still don't find multiplexers
>in
>the style of seccomp()/fsconfig()/keyctl() to be a problem since they
>deal with a specific task. They are very much different from ioctl()s
>in
>that regard. But since Joel, you, and Daniel found the pidctl()
>approach
>not very nice I dropped it. The interface needs to be satisfactory for
>all of us especially since Android and other system managers will be
>the
>main consumers.
>So let's split this into pidfd_open(pid_t pid, unsigned int flags)
>which
>allows to cleanly get pidfds independent procfs and do the translation
>to procpidfds in an ioctl() as we've discussed in prior threads. This
>should also accommodate comments and ideas from Andy and Jann.
>I'm coding this up now.

This sounds quite sensible to me. Thanks!


Joel Fernandes, Android kernel team
Sent from k9-mail on Android

2019-03-28 17:09:16

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

On Thu, Mar 28, 2019 at 12:59:46PM -0400, Joel Fernandes wrote:
>
>
> On March 28, 2019 6:38:15 AM EDT, Christian Brauner <[email protected]> wrote:
> >> All that said, thanks for the work on this once again. My intention
> >is
> >> just that we don't end up with an API that could have been done
> >better
> >> and be cleaner to use for potential users in the coming years.
> >
> >Thanks for your input on all of this. I still don't find multiplexers
> >in
> >the style of seccomp()/fsconfig()/keyctl() to be a problem since they
> >deal with a specific task. They are very much different from ioctl()s
> >in
> >that regard. But since Joel, you, and Daniel found the pidctl()
> >approach
> >not very nice I dropped it. The interface needs to be satisfactory for
> >all of us especially since Android and other system managers will be
> >the
> >main consumers.
> >So let's split this into pidfd_open(pid_t pid, unsigned int flags)
> >which
> >allows to cleanly get pidfds independent procfs and do the translation
> >to procpidfds in an ioctl() as we've discussed in prior threads. This
> >should also accommodate comments and ideas from Andy and Jann.
> >I'm coding this up now.
>
> This sounds quite sensible to me. Thanks!

Thanks! I have it ready and hope to send it out later today.

Christian

2019-03-30 05:36:45

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

On Thu, Mar 28, 2019 at 3:38 AM Christian Brauner <[email protected]> wrote:
>
> > All that said, thanks for the work on this once again. My intention is
> > just that we don't end up with an API that could have been done better
> > and be cleaner to use for potential users in the coming years.
>
> Thanks for your input on all of this. I still don't find multiplexers in
> the style of seccomp()/fsconfig()/keyctl() to be a problem since they
> deal with a specific task. They are very much different from ioctl()s in
> that regard. But since Joel, you, and Daniel found the pidctl() approach
> not very nice I dropped it. The interface needs to be satisfactory for
> all of us especially since Android and other system managers will be the
> main consumers.

Thanks.

> So let's split this into pidfd_open(pid_t pid, unsigned int flags) which
> allows to cleanly get pidfds independent procfs and do the translation
> to procpidfds in an ioctl() as we've discussed in prior threads. This

I sustain my objection to adding an ioctl. Compared to a system call,
an ioctl has a more rigid interface, greater susceptibility to
programmer error (due to the same ioctl control code potentially doing
different things for different file types), longer path length, and
more awkward filtering/monitoring/auditing/tracing. We've discussed
this issue at length before, and I thought we all agreed to use system
calls, not ioctl, for core kernel functionality. So why is an ioctl
suddenly back on the table? The way I see it, an ioctl has no
advantages except for 1) conserving system call numbers, which are not
scarce, and 2) avoiding the system call number coordination problem
(and the coordination problem isn't a factor for core kernel code). I
don't understand everyone's reluctance to add new system calls. What
am I missing? Why would we give up all the advantages that a system
call gives us?

I also don't understand Andy's argument on the other thread that an
ioctl is okay if it's an "operation on an FD" --- *most* system calls
are operations on FDs. We don't have an ioctl for sendmsg(2) and it's
an "operation on an FD".

2019-03-30 06:26:06

by Jonathan Kowalski

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

On Sat, Mar 30, 2019 at 5:35 AM Daniel Colascione <[email protected]> wrote:
>
> On Thu, Mar 28, 2019 at 3:38 AM Christian Brauner <[email protected]> wrote:
> >
> > > All that said, thanks for the work on this once again. My intention is
> > > just that we don't end up with an API that could have been done better
> > > and be cleaner to use for potential users in the coming years.
> >
> > Thanks for your input on all of this. I still don't find multiplexers in
> > the style of seccomp()/fsconfig()/keyctl() to be a problem since they
> > deal with a specific task. They are very much different from ioctl()s in
> > that regard. But since Joel, you, and Daniel found the pidctl() approach
> > not very nice I dropped it. The interface needs to be satisfactory for
> > all of us especially since Android and other system managers will be the
> > main consumers.
>
> Thanks.
>
> > So let's split this into pidfd_open(pid_t pid, unsigned int flags) which
> > allows to cleanly get pidfds independent procfs and do the translation
> > to procpidfds in an ioctl() as we've discussed in prior threads. This
>
> I sustain my objection to adding an ioctl. Compared to a system call,
> an ioctl has a more rigid interface, greater susceptibility to
> programmer error (due to the same ioctl control code potentially doing
> different things for different file types), longer path length, and
> more awkward filtering/monitoring/auditing/tracing. We've discussed
> this issue at length before, and I thought we all agreed to use system
> calls, not ioctl, for core kernel functionality. So why is an ioctl
> suddenly back on the table? The way I see it, an ioctl has no
> advantages except for 1) conserving system call numbers, which are not
> scarce, and 2) avoiding the system call number coordination problem
> (and the coordination problem isn't a factor for core kernel code). I
> don't understand everyone's reluctance to add new system calls. What
> am I missing? Why would we give up all the advantages that a system
> call gives us?
>

I agree in general, but in this particular case a system call or an
ioctl doesn't matter much, all it does is take the pidfd, the command,
and /proc's dir fd.

If you start adding a system call for every specific operation on file
descriptors, it *will* become a problem. Besides, the translation is
just there because it is racy to do in userspace, it is not some well
defined core kernel functionality. Therefore, it is just a way to
enter the kernel to do the openat in a race free and safe manner.

As is, the facility being provided through an ioctl on the pidfd is
not something I'd consider a problem. I think the translation stuff
should also probably be an extension of ioctl_ns(2) (but I wouldn't be
opposed if translate_pid is resurrected as is).

For anything more involved than ioctl(pidfd, PIDFD_TO_PROCFD,
procrootfd), I'd agree that a system call would be a cleaner
interface, otherwise, if you cannot generalise it, using ioctls as a
command interface is probably the better tradeoff here.

> I also don't understand Andy's argument on the other thread that an
> ioctl is okay if it's an "operation on an FD" --- *most* system calls
> are operations on FDs. We don't have an ioctl for sendmsg(2) and it's
> an "operation on an FD".

2019-03-30 07:40:12

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

On Fri, Mar 29, 2019 at 11:25 PM Jonathan Kowalski <[email protected]> wrote:
>
> On Sat, Mar 30, 2019 at 5:35 AM Daniel Colascione <[email protected]> wrote:
> >
> > On Thu, Mar 28, 2019 at 3:38 AM Christian Brauner <[email protected]> wrote:
> > >
> > > > All that said, thanks for the work on this once again. My intention is
> > > > just that we don't end up with an API that could have been done better
> > > > and be cleaner to use for potential users in the coming years.
> > >
> > > Thanks for your input on all of this. I still don't find multiplexers in
> > > the style of seccomp()/fsconfig()/keyctl() to be a problem since they
> > > deal with a specific task. They are very much different from ioctl()s in
> > > that regard. But since Joel, you, and Daniel found the pidctl() approach
> > > not very nice I dropped it. The interface needs to be satisfactory for
> > > all of us especially since Android and other system managers will be the
> > > main consumers.
> >
> > Thanks.
> >
> > > So let's split this into pidfd_open(pid_t pid, unsigned int flags) which
> > > allows to cleanly get pidfds independent procfs and do the translation
> > > to procpidfds in an ioctl() as we've discussed in prior threads. This
> >
> > I sustain my objection to adding an ioctl. Compared to a system call,
> > an ioctl has a more rigid interface, greater susceptibility to
> > programmer error (due to the same ioctl control code potentially doing
> > different things for different file types), longer path length, and
> > more awkward filtering/monitoring/auditing/tracing. We've discussed
> > this issue at length before, and I thought we all agreed to use system
> > calls, not ioctl, for core kernel functionality. So why is an ioctl
> > suddenly back on the table? The way I see it, an ioctl has no
> > advantages except for 1) conserving system call numbers, which are not
> > scarce, and 2) avoiding the system call number coordination problem
> > (and the coordination problem isn't a factor for core kernel code). I
> > don't understand everyone's reluctance to add new system calls. What
> > am I missing? Why would we give up all the advantages that a system
> > call gives us?
> >
>
> I agree in general, but in this particular case a system call or an
> ioctl doesn't matter much, all it does is take the pidfd, the command,
> and /proc's dir fd.

Thanks again.

I agree that the operation we're discussing has a simple signature,
but signature flexibility isn't the only reason to prefer a system
call over an ioctl. There are other reasons for preferring system
calls to ioctls (safety, tracing, etc.) that apply even if the
operation we're discussing has a relatively simple signature: for
example, every system call has a distinct and convenient ftrace event,
but ioctls don't; strace filtering Just Works on a
system-call-by-system-call basis, but it doesn't for ioctls; and
documentation for system calls is much more discoverable (e.g., man
-k) than documentation for ioctls. Even if the distinction doesn't
matter much, IMHO, it still matters a little, enough to favor a system
call without an offsetting advantage for the ioctl option.

> If you start adding a system call for every specific operation on file
> descriptors, it *will* become a problem.

I'm not sure what you mean. Do you mean that adding a top-level system
call for every operation that might apply to one specific kind of file
descriptor would lead, as the overall result, to the kernel having
enough system calls to cause negative consequences? I'm not sure I
agree, but accepting this idea for the sake of discussion: shouldn't
we be more okay with system calls for features present on almost all
systems --- like procfs --- even if we punt to ioctls very rarely-used
functionality, e.g., some hypothetical special squeak noise that you
could get some specific 1995 adlib clone to make?

> Besides, the translation is
> just there because it is racy to do in userspace, it is not some well
> defined core kernel functionality.
> Therefore, it is just a way to
> enter the kernel to do the openat in a race free and safe manner.

I agree that the translation has to be done in the kernel, not
userspace, and that the kernel must provide to userspace some
interface for requesting that the translation happen: we're just
discussing the shape of this interface. Shouldn't all interfaces
provided by the kernel to userspace be equally well defined? I'm not
sure that the internal simplicity of the operation matters much
either. There are already explicit system calls for some
simple-to-implement things, e.g., timerfd_gettime. It's worth noting
that timerfd is (IIRC), like procfs, a feature that's both ubiquitous
and optional.

> As is, the facility being provided through an ioctl on the pidfd is
> not something I'd consider a problem.

You're right that from a signature perspective, using an ioctl isn't a
problem. I just want to make sure we take into account the other,
non-signature advantages that system calls have over ioctls.

> I think the translation stuff
> should also probably be an extension of ioctl_ns(2) (but I wouldn't be
> opposed if translate_pid is resurrected as is).
> For anything more involved than ioctl(pidfd, PIDFD_TO_PROCFD,
> procrootfd), I'd agree that a system call would be a cleaner
> interface, otherwise, if you cannot generalise it, using ioctls as a
> command interface is probably the better tradeoff here.

Sure: I just want to better understand everyone else's thought process
here, having been frustrated with things like the termios ioctls being
ioctls.

2019-03-30 14:31:49

by Jonathan Kowalski

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

On Sat, Mar 30, 2019 at 7:39 AM Daniel Colascione <[email protected]> wrote:
>
> [SNIP]
>
> Thanks again.
>
> I agree that the operation we're discussing has a simple signature,
> but signature flexibility isn't the only reason to prefer a system
> call over an ioctl. There are other reasons for preferring system
> calls to ioctls (safety, tracing, etc.) that apply even if the
> operation we're discussing has a relatively simple signature: for
> example, every system call has a distinct and convenient ftrace event,
> but ioctls don't; strace filtering Just Works on a
> system-call-by-system-call basis, but it doesn't for ioctls; and

It does for those with a unique number.

> documentation for system calls is much more discoverable (e.g., man
> -k) than documentation for ioctls. Even if the distinction doesn't
> matter much, IMHO, it still matters a little, enough to favor a system
> call without an offsetting advantage for the ioctl option.

It will be alongside other I/O operations in the pidfd man page.

>
> > If you start adding a system call for every specific operation on file
> > descriptors, it *will* become a problem.
>
> I'm not sure what you mean. Do you mean that adding a top-level system
> call for every operation that might apply to one specific kind of file
> descriptor would lead, as the overall result, to the kernel having
> enough system calls to cause negative consequences? I'm not sure I
> agree, but accepting this idea for the sake of discussion: shouldn't
> we be more okay with system calls for features present on almost all
> systems --- like procfs --- even if we punt to ioctls very rarely-used
> functionality, e.g., some hypothetical special squeak noise that you
> could get some specific 1995 adlib clone to make?

Consider if we were to do:

int procpidfd_open(int pidfd, int procrootfd);

This system call is useless besides a very specific operation for a
very specific usecase on a file descriptor. Then, you figure you might
want to support procpidfd back to pidfd (although YAGNI), so you will
add a CMD or flags argument now,

int procpidfd_open(int pidfd, int procrootfd/procpidfd, unsigned int flags);

int procpidfd = procpidfd_open(fd, procrootfd, PIDFD_TO_PROCPIDFD);
int pidfd = procpidfd_open(-1, procpidfd, PROCPIDFD_TO_PIDFD);

vs procpidfd2pidfd(int procpidfd); if you did not foresee the need.
Then, you want pidfd2pid, and so on.

If you don't, you will have to add a command interface in the new
system call, which changes parameters depending on the flags. This is
already starting to get ugly. In the end, it's a matter of taste, this
pattern if exploited leads to endless proliferation of the system call
interface, often ridden with short-sighted APIs, because you cannot
know if you want a focused call or a command style call. ioctl is
already a command interface, and 10 system calls for each command is
_NOT_ nice, from a user perspective.

>
> > Besides, the translation is
> > just there because it is racy to do in userspace, it is not some well
> > defined core kernel functionality.
> > Therefore, it is just a way to
> > enter the kernel to do the openat in a race free and safe manner.
>
> I agree that the translation has to be done in the kernel, not
> userspace, and that the kernel must provide to userspace some
> interface for requesting that the translation happen: we're just
> discussing the shape of this interface. Shouldn't all interfaces
> provided by the kernel to userspace be equally well defined? I'm not
> sure that the internal simplicity of the operation matters much
> either. There are already explicit system calls for some
> simple-to-implement things, e.g., timerfd_gettime. It's worth noting
> that timerfd is (IIRC), like procfs, a feature that's both ubiquitous
> and optional.

It is well defined, it has a well defined signature, and it will error
out if you don't use it properly. Again, what it does is very limited
and niche. I am not sure it warrants a system call of its own.

timerfd_gettime was an afterthought anyway, that's probably not a good
example (it was more to just match the POSIX timers interface as the
original timerfd never had support for querying, so the split into two
steps, create and initialize, you could argue one could do it without
a syscall, but it still has a well defined argument list, and
accepting elaborate data structures into an ioctl is not a good
interface to plumb, so that's easily justified).

It's much like socket and setsockopt/getsockopt in nature. I would
even say APIs separating creation and configuration age well and are
better, but a process doesn't fit such a model cleanly.

>
> > As is, the facility being provided through an ioctl on the pidfd is
> > not something I'd consider a problem.
>
> You're right that from a signature perspective, using an ioctl isn't a
> problem. I just want to make sure we take into account the other,
> non-signature advantages that system calls have over ioctls.
>
> > I think the translation stuff
> > should also probably be an extension of ioctl_ns(2) (but I wouldn't be
> > opposed if translate_pid is resurrected as is).
> > For anything more involved than ioctl(pidfd, PIDFD_TO_PROCFD,
> > procrootfd), I'd agree that a system call would be a cleaner
> > interface, otherwise, if you cannot generalise it, using ioctls as a
> > command interface is probably the better tradeoff here.
>
> Sure: I just want to better understand everyone else's thought process
> here, having been frustrated with things like the termios ioctls being
> ioctls.

On that note, I don't buy the safety argument here, or the seccomp argument.

You need to consider what this is, on a case by case basis.

ioctl(pidfd, PIDFD_TO_PROCFD, procrootfd);

See? You need /proc's root fd to be able to get to the /proc/<PID> dir
fd, in a race free manner. You don't need to have the overhead of
filtering to contain access, as privilege is bound to descriptors.

However, an ioctl to do pidfd_send_signal would have indeed been
problematic, it takes in structures, it has a very core use case, and
a very compelling advantage over existing signal sending APIs.

On the point, using file descriptors, you can safely delegate metadata
access to a process it has a pidfd open for, without /proc being in
its mount namespace (which was what Andy was after). File descriptors
are acting as capabilities, so if you don't leak the /proc's root fd
to the process, you can be assured it wouldn't be able to do any
metadata access (and you don't have /proc mounted).

*No need* to blacklist the ioctl or the system call, as it wouldn't
work anyway. *This* is a good interface, with little to no overhead
for security. So whether this is an ioctl or a system call doesn't
matter at all, because it doesn't suffer from all those cases where
you really want to avoid ioctls and do a well scoped system call as an
entrypoint instead.

There is not much incentive to it, really.

2019-03-30 16:09:41

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH 2/4] pid: add pidfd_open()

On Sat, Mar 30, 2019 at 7:30 AM Jonathan Kowalski <[email protected]> wrote:
>
> On Sat, Mar 30, 2019 at 7:39 AM Daniel Colascione <[email protected]> wrote:
> >
> > [SNIP]
> >
> > Thanks again.
> >
> > I agree that the operation we're discussing has a simple signature,
> > but signature flexibility isn't the only reason to prefer a system
> > call over an ioctl. There are other reasons for preferring system
> > calls to ioctls (safety, tracing, etc.) that apply even if the
> > operation we're discussing has a relatively simple signature: for
> > example, every system call has a distinct and convenient ftrace event,
> > but ioctls don't; strace filtering Just Works on a
> > system-call-by-system-call basis, but it doesn't for ioctls; and
>
> It does for those with a unique number.

There's no such thing as a unique ioctl number though. Anyone is free
to create an ioctl with a conflicting number. Sure, you can guarantee
ioctl request number uniqueness within things that ship with the
kernel, but you can also guarantee system call number uniqueness, so
what do we gain by using an ioctl? And yes, you can do *some*
filtering based on ioctl command, but it's frequently more awkward
than the system call equivalent and sometimes doesn't work at all.
What's the strace -e pidfd_open equivalent for an ioctl? Grep isn't
quite equivalent.

> > documentation for system calls is much more discoverable (e.g., man
> > -k) than documentation for ioctls. Even if the distinction doesn't
> > matter much, IMHO, it still matters a little, enough to favor a system
> > call without an offsetting advantage for the ioctl option.
>
> It will be alongside other I/O operations in the pidfd man page.
>
> >
> > > If you start adding a system call for every specific operation on file
> > > descriptors, it *will* become a problem.
> >
> > I'm not sure what you mean. Do you mean that adding a top-level system
> > call for every operation that might apply to one specific kind of file
> > descriptor would lead, as the overall result, to the kernel having
> > enough system calls to cause negative consequences? I'm not sure I
> > agree, but accepting this idea for the sake of discussion: shouldn't
> > we be more okay with system calls for features present on almost all
> > systems --- like procfs --- even if we punt to ioctls very rarely-used
> > functionality, e.g., some hypothetical special squeak noise that you
> > could get some specific 1995 adlib clone to make?
>
> Consider if we were to do:
>
> int procpidfd_open(int pidfd, int procrootfd);
>
> This system call is useless besides a very specific operation for a
> very specific usecase on a file descriptor.

I don't understand this argument based on an operation being "very
specific". Are you suggesting that epoll_wait should have been an
ioctl? I don't see how an operation being specific to a type of file
descriptor is an argument for making that operation an ioctl instead
of a system call.

> Then, you figure you might
> want to support procpidfd back to pidfd (although YAGNI), so you will
> add a CMD or flags argument now,

Why would we need a system call at all? And why are we invoking
hypothetical argument B in an argument against adding operation A as a
system call? B doesn't exist yet. As Christian mentioned, we could
create a named /proc/pid/handle file which, when opened, would yield a
pidfd. More generally, though: if we expose new functionality, we can
make a new system call. Why is that worse than adding a new ioctl
code?

> int procpidfd_open(int pidfd, int procrootfd/procpidfd, unsigned int flags);
>
> int procpidfd = procpidfd_open(fd, procrootfd, PIDFD_TO_PROCPIDFD);
> int pidfd = procpidfd_open(-1, procpidfd, PROCPIDFD_TO_PIDFD);
>
> vs procpidfd2pidfd(int procpidfd); if you did not foresee the need.
> Then, you want pidfd2pid, and so on.
>
> If you don't, you will have to add a command interface in the new
> system call, which changes parameters depending on the flags. This is
> already starting to get ugly.

> In the end, it's a matter of taste,

I don't think it's quite a matter of taste. That idiom suggests that
the two options are roughly functionally equivalent. In this case,
there are clear and specific technical reasons to prefer system calls.
We're not talking about two equivalent approaches. Making the
operation a system call gives users more power, and we shouldn't
deprive them of this power without a good reason. So far, I haven't
seen such a reason.

> this
> pattern if exploited leads to endless proliferation of the system call
> interface, often ridden with short-sighted APIs, because you cannot
> know if you want a focused call or a command style call.

Bad interfaces proliferate no matter what, and there's no difference
in support burden between an ioctl and a system call: both need to
work indefinitely. Are you saying that it's easier to remove a bad
interface if it's an ioctl instead of a system call? I don't recall
any precedent, but maybe I'm wrong.

> ioctl is
> already a command interface, and 10 system calls for each command is
> _NOT_ nice, from a user perspective.

syscall(2) is already a "command" interface, just with more automatic
features than ioctl. What do we gain by pushing the "command" down a
level from the syscall(2) multiplexer to the ioctl(2) multiplexer?

> > > Besides, the translation is
> > > just there because it is racy to do in userspace, it is not some well
> > > defined core kernel functionality.
> > > Therefore, it is just a way to
> > > enter the kernel to do the openat in a race free and safe manner.
> >
> > I agree that the translation has to be done in the kernel, not
> > userspace, and that the kernel must provide to userspace some
> > interface for requesting that the translation happen: we're just
> > discussing the shape of this interface. Shouldn't all interfaces
> > provided by the kernel to userspace be equally well defined? I'm not
> > sure that the internal simplicity of the operation matters much
> > either. There are already explicit system calls for some
> > simple-to-implement things, e.g., timerfd_gettime. It's worth noting
> > that timerfd is (IIRC), like procfs, a feature that's both ubiquitous
> > and optional.
>
> It is well defined, it has a well defined signature, and it will error
> out if you don't use it properly. Again, what it does is very limited
> and niche. I am not sure it warrants a system call of its own.

Do we need to think in terms of whether a bit of functionality
"warrants" a system call? A system call is not expensive to add.
Instead of thinking of whether a function meets some bar for promotion
to a system call, we should be thinking of whether a function meets
the bar for demotion to ioctl. System calls are generally more useful
to users than ioctls, and all things being equal, creating an
interface more pleasant for users should be the priority.

> timerfd_gettime was an afterthought anyway, that's probably not a good
> example (it was more to just match the POSIX timers interface as the
> original timerfd never had support for querying, so the split into two
> steps, create and initialize, you could argue one could do it without
> a syscall, but it still has a well defined argument list, and
> accepting elaborate data structures into an ioctl is not a good
> interface to plumb, so that's easily justified).

TCGETS deals with an elaborate data structure.

In any case, I don't think the elaborate data structure argument is
relevant. The things that the system call mechanism lets you do that
the ioctl mechanism does not are independent of whether a system call
accepts simple or complex argument types. And ioctl and a system call
both specify a bit of kernel mode code to run: the difference is in
identifying that bit of kernel-side code to run and in the features
you get automatically through the call mechanism.

> It's much like socket and setsockopt/getsockopt in nature. I would
> even say APIs separating creation and configuration age well and are
> better, but a process doesn't fit such a model cleanly.
>
> > > As is, the facility being provided through an ioctl on the pidfd is
> > > not something I'd consider a problem.
> >
> > You're right that from a signature perspective, using an ioctl isn't a
> > problem. I just want to make sure we take into account the other,
> > non-signature advantages that system calls have over ioctls.
> >
> > > I think the translation stuff
> > > should also probably be an extension of ioctl_ns(2) (but I wouldn't be
> > > opposed if translate_pid is resurrected as is).
> > > For anything more involved than ioctl(pidfd, PIDFD_TO_PROCFD,
> > > procrootfd), I'd agree that a system call would be a cleaner
> > > interface, otherwise, if you cannot generalise it, using ioctls as a
> > > command interface is probably the better tradeoff here.
> >
> > Sure: I just want to better understand everyone else's thought process
> > here, having been frustrated with things like the termios ioctls being
> > ioctls.
>
> On that note, I don't buy the safety argument here, or the seccomp argument.
>
> You need to consider what this is, on a case by case basis.
>
> ioctl(pidfd, PIDFD_TO_PROCFD, procrootfd);
>
> See? You need /proc's root fd to be able to get to the /proc/<PID> dir
> fd, in a race free manner. You don't need to have the overhead of
> filtering to contain access, as privilege is bound to descriptors.

There's no filtering overhead when you haven't enabled filtering, right?

You seem to be making an argument that we should make this facility an
ioctl because we don't need the features that the system call
mechanism provides --- correct me if I'm wrong. Even if we can't see a
case for filtering or containment or tracing _today_, such a case
might arise in the future, as has happened many times before. Besides,
we haven't touched on the other features we get "for free" when we
make an operation a system call, e.g., explicit tracing support. I
don't understand why we should forestall these uses cases, both
present and future, by using an ioctl instead of a system call,
especially when we get so little in return. I think there's a lot of
value in regularity and consistency and that making some operations
ioctls and some syscalls based on details like whether an operation
accepts a structure or a primitive makes the system less flexible and
understandable.

> [snip further discussion along the same lines]