2019-04-18 10:20:33

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v2 0/5] clone: add CLONE_PIDFD

Hey,

/* v2 summary */
Move put_user() into copy process before clone's point of no return so
that we can handle put_user() errors as suggested by Oleg. The good news
is that this again allows us to make the patch smaller.

/* v1 summary */
As suggested by Oleg, have pidfds returned in the fourth argument of
clone allowing us to return a pidfd and its pid to the caller at the
same time. This has various advantages:
- callers get the associated pid for the pidfd without additional
parsing
This makes it easier for userspce to get metadata access through
procfs.
- the type of the return value for clone() remains unchanged
(was changed to return an fd in the previous iteration)
- pid file descriptor numbering can start at 0 as is customary for
file descriptors
(was changed to start at 1 in the previous patchset to not break
fork()-like error checking when returning pidfds)
- finally, the patchset has gotten smaller

The patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call as previously discussed.

As decided last week [1] Jann and I have refined the implementation of
pidfds as anonymous inodes. Based on last weeks RFC we have only tweaked
documentation and naming, as well as making the sample program how to
get easy metadata access from a pidfd a little cleaner and more paranoid
when checking for errors.
The sample program can also serve as a test for the patchset.

When clone is called with CLONE_PIDFD a pidfd will be returned in the
fourth argument of clone. This is based on an idea from Oleg. It allows
us to return a pidfd and the associated pid to the caller at the same
time.

We have taken care that pidfds are created *after* the fd table has
been unshared to not leak pidfds into child processes.

The actual code for CLONE_PIDFD in patch 2 is completely confined to
fork.c (apart from the CLONE_PIDFD definition of course) and is rather
small and hopefully good to review.

The additional changes listed under David's name in the diffstat below
are here to make anon_inodes available unconditionally. They are needed
for the new mount api and thus for core vfs code in addition to pidfds.
David knows this and he has informed Al that this patch is sent out
here. The changes themselves are rather automatic.

As promised I have also contacted Joel who has sent a patchset to make
pidfds pollable. He has been informed and is happy to port his patchset
once we have moved forward [2].
Jann and I currently plan to target this patchset for inclusion in the 5.2
merge window.

Thanks!
Jann & Christian

[1]: https://lore.kernel.org/lkml/CAHk-=wifyY+XGNW=ZC4MyTHD14w81F8JjQNH-GaGAm2RxZ_S8Q@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/[email protected]/

Christian Brauner (4):
clone: add CLONE_PIDFD
signal: use fdget() since we don't allow O_PATH
signal: support CLONE_PIDFD with pidfd_send_signal
samples: show race-free pidfd metadata access

David Howells (1):
Make anon_inodes unconditional

arch/arm/kvm/Kconfig | 1 -
arch/arm64/kvm/Kconfig | 1 -
arch/mips/kvm/Kconfig | 1 -
arch/powerpc/kvm/Kconfig | 1 -
arch/s390/kvm/Kconfig | 1 -
arch/x86/Kconfig | 1 -
arch/x86/kvm/Kconfig | 1 -
drivers/base/Kconfig | 1 -
drivers/char/tpm/Kconfig | 1 -
drivers/dma-buf/Kconfig | 1 -
drivers/gpio/Kconfig | 1 -
drivers/iio/Kconfig | 1 -
drivers/infiniband/Kconfig | 1 -
drivers/vfio/Kconfig | 1 -
fs/Makefile | 2 +-
fs/notify/fanotify/Kconfig | 1 -
fs/notify/inotify/Kconfig | 1 -
include/linux/pid.h | 2 +
include/uapi/linux/sched.h | 1 +
init/Kconfig | 10 ---
kernel/fork.c | 96 ++++++++++++++++++++++++++--
kernel/signal.c | 14 +++--
kernel/sys_ni.c | 3 -
samples/Makefile | 2 +-
samples/pidfd/Makefile | 6 ++
samples/pidfd/pidfd-metadata.c | 112 +++++++++++++++++++++++++++++++++
26 files changed, 225 insertions(+), 39 deletions(-)
create mode 100644 samples/pidfd/Makefile
create mode 100644 samples/pidfd/pidfd-metadata.c

--
2.21.0


2019-04-18 10:20:37

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v2 2/5] clone: add CLONE_PIDFD

This patchset makes it possible to retrieve pid file descriptors at process
creation time by introducing the new flag CLONE_PIDFD to the clone() system
call. Linus originally suggested to implement this as a new flag to clone()
instead of making it a separate system call. As spotted by Linus, there is
exactly one bit for clone() left.

CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api. They serve as a simple opaque handle on pids. Logically, this
makes it possible to interpret a pidfd differently, narrowing or widening
the scope of various operations (e.g. signal sending). Thus, a pidfd cannot
just refer to a tgid, but also a tid, or in theory - given appropriate flag
arguments in relevant syscalls - a process group or session. A pidfd does
not represent a privilege. This does not imply it cannot ever be that way
but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs. The fdinfo file contains the pid of the process in the callers pid
namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the fourth
argument of clone. This has the advantage that we can give back the
associated pid and the pidfd at the same time.

To remove worries about missing metadata access this patchset comes with a
sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>. The sample program can easily be translated
into a helper that would be suitable for inclusion in libc so that users
don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Jann Horn <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
---
/* changelog */
v1:
- Oleg Nesterov <[email protected]>:
- return pidfd in fourth argument of clone
This way we can return the pid and the pidfd at the same time to the
caller and can also start pid file descriptor numbering at 0 as is
customary for file descriptors.
- Christian Brauner <[email protected]>:
- update comments to reflect changes based on Oleg's idea
v2:
- Oleg Nesterov <[email protected]>:
- move put_user() before clone()'s point of no return so we can handle
put_user() errors
- Christian Brauner <[email protected]>:
- change pidfd_create() to also fd_install()
With Oleg's change it makes sense to do the fd_install() right before
the moved put_user().
---
include/linux/pid.h | 2 +
include/uapi/linux/sched.h | 1 +
kernel/fork.c | 96 ++++++++++++++++++++++++++++++++++++--
3 files changed, 95 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index b6f4ba16065a..3c8ef5a199ca 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -66,6 +66,8 @@ struct pid

extern struct pid init_struct_pid;

+extern const struct file_operations pidfd_fops;
+
static inline struct pid *get_pid(struct pid *pid)
{
if (pid)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..ed4ee170bee2 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -10,6 +10,7 @@
#define CLONE_FS 0x00000200 /* set if fs info shared between processes */
#define CLONE_FILES 0x00000400 /* set if open files shared between processes */
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */
+#define CLONE_PIDFD 0x00001000 /* set if a pidfd should be placed in parent */
#define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */
#define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */
#define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9dcd18aa210b..201aafdac727 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,6 +11,7 @@
* management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
*/

+#include <linux/anon_inodes.h>
#include <linux/slab.h>
#include <linux/sched/autogroup.h>
#include <linux/sched/mm.h>
@@ -21,8 +22,10 @@
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/sched/cputime.h>
+#include <linux/seq_file.h>
#include <linux/rtmutex.h>
#include <linux/init.h>
+#include <linux/fsnotify.h>
#include <linux/unistd.h>
#include <linux/module.h>
#include <linux/vmalloc.h>
@@ -1662,6 +1665,58 @@ static inline void rcu_copy_process(struct task_struct *p)
#endif /* #ifdef CONFIG_TASKS_RCU */
}

+static int pidfd_release(struct inode *inode, struct file *file)
+{
+ struct pid *pid = file->private_data;
+
+ file->private_data = NULL;
+ put_pid(pid);
+ return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+ struct pid_namespace *ns = proc_pid_ns(file_inode(m->file));
+ struct pid *pid = f->private_data;
+
+ seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+ seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+ .release = pidfd_release,
+#ifdef CONFIG_PROC_FS
+ .show_fdinfo = pidfd_show_fdinfo,
+#endif
+};
+
+/**
+ * pidfd_create() - Create a new pid file descriptor.
+ *
+ * @pid: struct pid that the pidfd will reference
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set.
+ *
+ * Note, that this function can only be called after the fd table has
+ * been unshared to avoid leaking the pidfd to the new process.
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+static int pidfd_create(struct pid *pid)
+{
+ int fd;
+
+ fd = anon_inode_getfd("pidfd", &pidfd_fops, get_pid(pid),
+ O_RDWR | O_CLOEXEC);
+ if (fd < 0)
+ put_pid(pid);
+
+ return fd;
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -1674,13 +1729,14 @@ static __latent_entropy struct task_struct *copy_process(
unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
+ int __user *parent_tidptr,
int __user *child_tidptr,
struct pid *pid,
int trace,
unsigned long tls,
int node)
{
- int retval;
+ int pidfd = -1, retval;
struct task_struct *p;
struct multiprocess_signals delayed;

@@ -1730,6 +1786,19 @@ static __latent_entropy struct task_struct *copy_process(
return ERR_PTR(-EINVAL);
}

+ /* Pidfds will be returned through parent_tidptr. */
+ if ((clone_flags & (CLONE_PIDFD | CLONE_PARENT_SETTID)) ==
+ (CLONE_PIDFD | CLONE_PARENT_SETTID))
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * Ensure that we can potentially reuse CLONE_DETACHED for
+ * CLONE_PIDFD in the future.
+ */
+ if ((clone_flags & (CLONE_PIDFD | CLONE_DETACHED)) ==
+ (CLONE_PIDFD | CLONE_DETACHED))
+ return ERR_PTR(-EINVAL);
+
/*
* Force any signals received before this point to be delivered
* before the fork happens. Collect up signals sent to multiple
@@ -1936,6 +2005,22 @@ static __latent_entropy struct task_struct *copy_process(
}
}

+ /*
+ * This has to happen after we've potentially unshared the file
+ * descriptor table (so that the pidfd doesn't leak into the child
+ * if the fd table isn't shared).
+ */
+ if (clone_flags & CLONE_PIDFD) {
+ retval = pidfd_create(pid);
+ if (retval < 0)
+ goto bad_fork_free_pid;
+
+ pidfd = retval;
+ retval = put_user(pidfd, parent_tidptr);
+ if (retval)
+ goto bad_fork_put_pidfd;
+ }
+
#ifdef CONFIG_BLOCK
p->plug = NULL;
#endif
@@ -1996,7 +2081,7 @@ static __latent_entropy struct task_struct *copy_process(
*/
retval = cgroup_can_fork(p);
if (retval)
- goto bad_fork_free_pid;
+ goto bad_fork_put_pidfd;

/*
* From this point on we must avoid any synchronous user-space
@@ -2111,6 +2196,9 @@ static __latent_entropy struct task_struct *copy_process(
spin_unlock(&current->sighand->siglock);
write_unlock_irq(&tasklist_lock);
cgroup_cancel_fork(p);
+bad_fork_put_pidfd:
+ if (clone_flags & CLONE_PIDFD)
+ ksys_close(pidfd);
bad_fork_free_pid:
cgroup_threadgroup_change_end(current);
if (pid != &init_struct_pid)
@@ -2176,7 +2264,7 @@ static inline void init_idle_pids(struct task_struct *idle)
struct task_struct *fork_idle(int cpu)
{
struct task_struct *task;
- task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
+ task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
cpu_to_node(cpu));
if (!IS_ERR(task)) {
init_idle_pids(task);
@@ -2223,7 +2311,7 @@ long _do_fork(unsigned long clone_flags,
trace = 0;
}

- p = copy_process(clone_flags, stack_start, stack_size,
+ p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
add_latent_entropy();

--
2.21.0

2019-04-18 10:20:41

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v2 4/5] signal: support CLONE_PIDFD with pidfd_send_signal

Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD. With this
patch pidfd_send_signal() becomes independent of procfs. This fullfils the
request made when we merged the pidfd_send_signal() patchset. The
pidfd_send_signal() syscall is now always available allowing for it to be
used by users without procfs mounted or even users without procfs support
compiled into the kernel.

Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Jann Horn <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
---
/* changelog */
v1: patch unchanged
v2:
- Oleg Nesterov <[email protected]>:
- split s/fdget_raw()/fdget()/ into separate patch as it has nothing to
do with supporting CLONE_PIDFD
---
kernel/signal.c | 12 +++++++++---
kernel/sys_ni.c | 3 ---
2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 227ba170298e..cd83cc376767 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3513,7 +3513,6 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
return kill_something_info(sig, &info, pid);
}

-#ifdef CONFIG_PROC_FS
/*
* Verify that the signaler and signalee either are in the same pid namespace
* or that the signaler's pid namespace is an ancestor of the signalee's pid
@@ -3550,6 +3549,14 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
return copy_siginfo_from_user(kinfo, info);
}

+static struct pid *pidfd_to_pid(const struct file *file)
+{
+ if (file->f_op == &pidfd_fops)
+ return file->private_data;
+
+ return tgid_pidfd_to_pid(file);
+}
+
/**
* sys_pidfd_send_signal - send a signal to a process through a task file
* descriptor
@@ -3586,7 +3593,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
return -EBADF;

/* Is this a pidfd? */
- pid = tgid_pidfd_to_pid(f.file);
+ pid = pidfd_to_pid(f.file);
if (IS_ERR(pid)) {
ret = PTR_ERR(pid);
goto err;
@@ -3620,7 +3627,6 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
fdput(f);
return ret;
}
-#endif /* CONFIG_PROC_FS */

static int
do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d21f4befaea4..4d9ae5ea6caf 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -167,9 +167,6 @@ COND_SYSCALL(syslog);

/* kernel/sched/core.c */

-/* kernel/signal.c */
-COND_SYSCALL(pidfd_send_signal);
-
/* kernel/sys.c */
COND_SYSCALL(setregid);
COND_SYSCALL(setgid);
--
2.21.0

2019-04-18 10:20:49

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v2 5/5] samples: show race-free pidfd metadata access

This is a sample program showing userspace how to get race-free access to
process metadata from a pidfd. It is rather easy to do and userspace can
actually simply reuse code that currently parses a process's status file in
procfs.
The program can easily be extended into a generic helper suitable for
inclusion in a libc to make it even easier for userspace to gain metadata
access.

Since this came up in a discussion since this API is going to be used in
various service managers. A lot of programs will have a whitelist seccomp
filter that returns EPERM for all new syscalls. This means that programs
might get confused if CLONE_PIDFD works but the later pidfd_send_signal()
syscall doesn't. Hence, here's a ahead of time check that
pidfd_send_signal() is supported:

bool pidfd_send_signal_supported()
{
int procfd = open("/proc/self", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
if (procfd < 0)
return false;

/* pidfd_send_signal() should never fail this test. So it must
* mean it is not available or blocked by an LSM or seccomp or
* other. So * fallback to using pids in this case.
*/
return pidfd_send_signal(procfd, 0, NULL, 0) == 0;
}

Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Jann Horn <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
---
/* changelog */
v1:
- Christian Brauner <[email protected]>:
- adapt sample program to changes in how CLONE_PIDFD returns the pidfd
With Oleg's suggestion we can simplify the program even more.
v2: patch unchanged
---
samples/Makefile | 2 +-
samples/pidfd/Makefile | 6 ++
samples/pidfd/pidfd-metadata.c | 112 +++++++++++++++++++++++++++++++++
3 files changed, 119 insertions(+), 1 deletion(-)
create mode 100644 samples/pidfd/Makefile
create mode 100644 samples/pidfd/pidfd-metadata.c

diff --git a/samples/Makefile b/samples/Makefile
index b1142a958811..fadadb1c3b05 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -3,4 +3,4 @@
obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ trace_events/ livepatch/ \
hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ \
configfs/ connector/ v4l/ trace_printk/ \
- vfio-mdev/ statx/ qmi/ binderfs/
+ vfio-mdev/ statx/ qmi/ binderfs/ pidfd/
diff --git a/samples/pidfd/Makefile b/samples/pidfd/Makefile
new file mode 100644
index 000000000000..0ff97784177a
--- /dev/null
+++ b/samples/pidfd/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+hostprogs-y := pidfd-metadata
+always := $(hostprogs-y)
+HOSTCFLAGS_pidfd-metadata.o += -I$(objtree)/usr/include
+all: pidfd-metadata
diff --git a/samples/pidfd/pidfd-metadata.c b/samples/pidfd/pidfd-metadata.c
new file mode 100644
index 000000000000..bd8456fc4c0e
--- /dev/null
+++ b/samples/pidfd/pidfd-metadata.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#ifndef CLONE_PIDFD
+#define CLONE_PIDFD 0x00001000
+#endif
+
+static int do_child(void *args)
+{
+ printf("%d\n", getpid());
+ _exit(EXIT_SUCCESS);
+}
+
+static pid_t pidfd_clone(int flags, int *pidfd)
+{
+ size_t stack_size = 1024;
+ char *stack[1024] = { 0 };
+
+#ifdef __ia64__
+ return __clone2(do_child, stack, stack_size, flags | SIGCHLD, NULL, pidfd);
+#else
+ return clone(do_child, stack + stack_size, flags | SIGCHLD, NULL, pidfd);
+#endif
+}
+
+static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
+ unsigned int flags)
+{
+ return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
+}
+
+static int pidfd_metadata_fd(pid_t pid, int pidfd)
+{
+ int procfd, ret;
+ char path[100];
+
+ snprintf(path, sizeof(path), "/proc/%d", pid);
+ procfd = open(path, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
+ if (procfd < 0) {
+ warn("Failed to open %s\n", path);
+ return -1;
+ }
+
+ /*
+ * Verify that the pid has not been recycled and our /proc/<pid> handle
+ * is still valid.
+ */
+ ret = sys_pidfd_send_signal(pidfd, 0, NULL, 0);
+ if (ret < 0) {
+ switch (errno) {
+ case EPERM:
+ /* Process exists, just not allowed to signal it. */
+ break;
+ default:
+ warn("Failed to signal process\n");
+ close(procfd);
+ procfd = -1;
+ }
+ }
+
+ return procfd;
+}
+
+int main(int argc, char *argv[])
+{
+ int ret = EXIT_FAILURE;
+ char buf[4096] = { 0 };
+ pid_t pid;
+ int pidfd, procfd, statusfd;
+ ssize_t bytes;
+
+ pid = pidfd_clone(CLONE_PIDFD, &pidfd);
+ if (pid < 0)
+ exit(ret);
+
+ procfd = pidfd_metadata_fd(pid, pidfd);
+ close(pidfd);
+ if (procfd < 0)
+ goto out;
+
+ statusfd = openat(procfd, "status", O_RDONLY | O_CLOEXEC);
+ close(procfd);
+ if (statusfd < 0)
+ goto out;
+
+ bytes = read(statusfd, buf, sizeof(buf));
+ if (bytes > 0)
+ bytes = write(STDOUT_FILENO, buf, bytes);
+ close(statusfd);
+ ret = EXIT_SUCCESS;
+
+out:
+ (void)wait(NULL);
+
+ exit(ret);
+}
--
2.21.0

2019-04-18 10:22:32

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v2 3/5] signal: use fdget() since we don't allow O_PATH

As stated in the original commit for pidfd_send_signal() we don't allow to
signal processes through O_PATH file descriptors since it is semantically
equivalent to a write on the pidfd. We already correctly error out right
now and return EBADF if an O_PATH fd is passed. This is because we use
file->f_op to detect whether a pidfd is passed and O_PATH fds have their
file->f_op set to empty_fops in do_dentry_open() and thus fail the test.
Thus, there is no regression. It's just semantically correct to use fdget()
and return an error right from there instead of taking a reference and
returning an error later.

Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
---
/* changelog */
v1: patch not present
v2:
- Oleg Nesterov <[email protected]>:
- split out from following patch since the s/fdget_raw()/fdget()/
replacement is a fix unrelated to the theme of the following patch
---
kernel/signal.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f98448cf2def..227ba170298e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3581,7 +3581,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
if (flags)
return -EINVAL;

- f = fdget_raw(pidfd);
+ f = fdget(pidfd);
if (!f.file)
return -EBADF;

--
2.21.0

2019-04-18 10:22:32

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v2 1/5] Make anon_inodes unconditional

From: David Howells <[email protected]>

Make the anon_inodes facility unconditional so that it can be used by core
VFS code and pidfd code.

Signed-off-by: David Howells <[email protected]>
Signed-off-by: Al Viro <[email protected]>
[[email protected]: adapt commit message to mention pidfds]
Signed-off-by: Christian Brauner <[email protected]>
---
/* changelog */
v1: patch unchanged
v2: patch unchanged
---
arch/arm/kvm/Kconfig | 1 -
arch/arm64/kvm/Kconfig | 1 -
arch/mips/kvm/Kconfig | 1 -
arch/powerpc/kvm/Kconfig | 1 -
arch/s390/kvm/Kconfig | 1 -
arch/x86/Kconfig | 1 -
arch/x86/kvm/Kconfig | 1 -
drivers/base/Kconfig | 1 -
drivers/char/tpm/Kconfig | 1 -
drivers/dma-buf/Kconfig | 1 -
drivers/gpio/Kconfig | 1 -
drivers/iio/Kconfig | 1 -
drivers/infiniband/Kconfig | 1 -
drivers/vfio/Kconfig | 1 -
fs/Makefile | 2 +-
fs/notify/fanotify/Kconfig | 1 -
fs/notify/inotify/Kconfig | 1 -
init/Kconfig | 10 ----------
18 files changed, 1 insertion(+), 27 deletions(-)

diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index 3f5320f46de2..f591026347a5 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
bool "Kernel-based Virtual Machine (KVM) support"
depends on MMU && OF
select PREEMPT_NOTIFIERS
- select ANON_INODES
select ARM_GIC
select ARM_GIC_V3
select ARM_GIC_V3_ITS
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index a3f85624313e..a67121d419a2 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -23,7 +23,6 @@ config KVM
depends on OF
select MMU_NOTIFIER
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_ARCH_TLB_FLUSH_ALL
select KVM_MMIO
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index 4528bc9c3cb1..eac25aef21e0 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
depends on MIPS_FP_SUPPORT
select EXPORT_UASM
select PREEMPT_NOTIFIERS
- select ANON_INODES
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select HAVE_KVM_VCPU_ASYNC_IOCTL
select KVM_MMIO
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index bfdde04e4905..f53997a8ca62 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -20,7 +20,6 @@ if VIRTUALIZATION
config KVM
bool
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_EVENTFD
select HAVE_KVM_VCPU_ASYNC_IOCTL
select SRCU
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 767453faacfc..1816ee48eadd 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
prompt "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_VCPU_ASYNC_IOCTL
select HAVE_KVM_EVENTFD
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5ad92419be19..7a70fb58b2d0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -44,7 +44,6 @@ config X86
#
select ACPI_LEGACY_TABLES_LOOKUP if ACPI
select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
- select ANON_INODES
select ARCH_32BIT_OFF_T if X86_32
select ARCH_CLOCKSOURCE_DATA
select ARCH_CLOCKSOURCE_INIT
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 72fa955f4a15..fc042419e670 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -27,7 +27,6 @@ config KVM
depends on X86_LOCAL_APIC
select PREEMPT_NOTIFIERS
select MMU_NOTIFIER
- select ANON_INODES
select HAVE_KVM_IRQCHIP
select HAVE_KVM_IRQFD
select IRQ_BYPASS_MANAGER
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 059700ea3521..03f067da12ee 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -174,7 +174,6 @@ source "drivers/base/regmap/Kconfig"
config DMA_SHARED_BUFFER
bool
default n
- select ANON_INODES
select IRQ_WORK
help
This option enables the framework for buffer-sharing between
diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 536e55d3919f..f3e4bc490cf0 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -157,7 +157,6 @@ config TCG_CRB
config TCG_VTPM_PROXY
tristate "VTPM Proxy Interface"
depends on TCG_TPM
- select ANON_INODES
---help---
This driver proxies for an emulated TPM (vTPM) running in userspace.
A device /dev/vtpmx is provided that creates a device pair
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
index 2e5a0faa2cb1..3fc9c2efc583 100644
--- a/drivers/dma-buf/Kconfig
+++ b/drivers/dma-buf/Kconfig
@@ -3,7 +3,6 @@ menu "DMABUF options"
config SYNC_FILE
bool "Explicit Synchronization Framework"
default n
- select ANON_INODES
select DMA_SHARED_BUFFER
---help---
The Sync File Framework adds explicit syncronization via
diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 3f50526a771f..0f91600c27ae 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -12,7 +12,6 @@ config ARCH_HAVE_CUSTOM_GPIO_H

menuconfig GPIOLIB
bool "GPIO Support"
- select ANON_INODES
help
This enables GPIO support through the generic GPIO library.
You only need to enable this, if you also want to enable
diff --git a/drivers/iio/Kconfig b/drivers/iio/Kconfig
index d08aeb41cd07..1dec0fecb6ef 100644
--- a/drivers/iio/Kconfig
+++ b/drivers/iio/Kconfig
@@ -4,7 +4,6 @@

menuconfig IIO
tristate "Industrial I/O support"
- select ANON_INODES
help
The industrial I/O subsystem provides a unified framework for
drivers for many different types of embedded sensors using a
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a1fb840de45d..d318bab25860 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -25,7 +25,6 @@ config INFINIBAND_USER_MAD

config INFINIBAND_USER_ACCESS
tristate "InfiniBand userspace access (verbs and CM)"
- select ANON_INODES
depends on MMU
---help---
Userspace InfiniBand access support. This enables the
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9de5ed38da83..3798d77d131c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -22,7 +22,6 @@ menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
- select ANON_INODES
help
VFIO provides a framework for secure userspace device drivers.
See Documentation/vfio.txt for more details.
diff --git a/fs/Makefile b/fs/Makefile
index 427fec226fae..35945f8139e6 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -25,7 +25,7 @@ obj-$(CONFIG_PROC_FS) += proc_namespace.o

obj-y += notify/
obj-$(CONFIG_EPOLL) += eventpoll.o
-obj-$(CONFIG_ANON_INODES) += anon_inodes.o
+obj-y += anon_inodes.o
obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
index 735bfb2e9190..521dc91d2cb5 100644
--- a/fs/notify/fanotify/Kconfig
+++ b/fs/notify/fanotify/Kconfig
@@ -1,7 +1,6 @@
config FANOTIFY
bool "Filesystem wide access notification"
select FSNOTIFY
- select ANON_INODES
select EXPORTFS
default n
---help---
diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
index b981fc0c8379..0161c74e76e2 100644
--- a/fs/notify/inotify/Kconfig
+++ b/fs/notify/inotify/Kconfig
@@ -1,6 +1,5 @@
config INOTIFY_USER
bool "Inotify support for userspace"
- select ANON_INODES
select FSNOTIFY
default y
---help---
diff --git a/init/Kconfig b/init/Kconfig
index 4592bf7997c0..be8f97e37a76 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1171,9 +1171,6 @@ config LD_DEAD_CODE_DATA_ELIMINATION
config SYSCTL
bool

-config ANON_INODES
- bool
-
config HAVE_UID16
bool

@@ -1378,14 +1375,12 @@ config HAVE_FUTEX_CMPXCHG
config EPOLL
bool "Enable eventpoll support" if EXPERT
default y
- select ANON_INODES
help
Disabling this option will cause the kernel to be built without
support for epoll family of system calls.

config SIGNALFD
bool "Enable signalfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the signalfd() system call that allows to receive signals
@@ -1395,7 +1390,6 @@ config SIGNALFD

config TIMERFD
bool "Enable timerfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the timerfd() system call that allows to receive timer
@@ -1405,7 +1399,6 @@ config TIMERFD

config EVENTFD
bool "Enable eventfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the eventfd() system call that allows to receive both
@@ -1516,7 +1509,6 @@ config KALLSYMS_BASE_RELATIVE
# syscall, maps, verifier
config BPF_SYSCALL
bool "Enable bpf() system call"
- select ANON_INODES
select BPF
select IRQ_WORK
default n
@@ -1533,7 +1525,6 @@ config BPF_JIT_ALWAYS_ON

config USERFAULTFD
bool "Enable userfaultfd() system call"
- select ANON_INODES
depends on MMU
help
Enable the userfaultfd() system call that allows to intercept and
@@ -1600,7 +1591,6 @@ config PERF_EVENTS
bool "Kernel performance events and counters"
default y if PROFILING
depends on HAVE_PERF_EVENTS
- select ANON_INODES
select IRQ_WORK
select SRCU
help
--
2.21.0

2019-04-18 13:14:13

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] clone: add CLONE_PIDFD

On 04/18, Christian Brauner wrote:
>
> @@ -1674,13 +1729,14 @@ static __latent_entropy struct task_struct *copy_process(
> unsigned long clone_flags,
> unsigned long stack_start,
> unsigned long stack_size,
> + int __user *parent_tidptr,
> int __user *child_tidptr,
> struct pid *pid,
> int trace,
> unsigned long tls,
> int node)
> {
> - int retval;
> + int pidfd = -1, retval;

it seems that initialization is unneeded, but this is cosmetic.

I see no technical problems, feel free to add my reviewed-by.


But let me ask a couple of questions...


Why O_CLOEXEC? I am just curious, I do not really care.


Should we allow CLONE_THREAD | CLONE_PIDFD ?


Are you sure we will never need to extend this interface? If not, then perhaps it
make sense to add something like

if (CLONE_PIDFD) {
unsigned long not_used_yet;
if (get_user(not_used_yet, parent_tidptr) ||
not_used_yet != 0)
return -EINVAL;
}

this way we can easily add more arguments in future or even turn CLONE_PIDFD into
CLONE_MORE_ARGS_IN_PARENT_TIDPTR.

Not that I think this is really good idea, sys_clone2() makes more sense, but still.

Oleg.

2019-04-18 13:18:55

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] signal: use fdget() since we don't allow O_PATH

On 04/18, Christian Brauner wrote:
>
> It's just semantically correct to use fdget()
> and return an error right from there instead of taking a reference and
> returning an error later.

agreed, and thanks for your explanations

Acked-by: Oleg Nesterov <[email protected]>

2019-04-18 13:31:25

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] clone: add CLONE_PIDFD

On Thu, Apr 18, 2019 at 03:12:07PM +0200, Oleg Nesterov wrote:
> On 04/18, Christian Brauner wrote:
> >
> > @@ -1674,13 +1729,14 @@ static __latent_entropy struct task_struct *copy_process(
> > unsigned long clone_flags,
> > unsigned long stack_start,
> > unsigned long stack_size,
> > + int __user *parent_tidptr,
> > int __user *child_tidptr,
> > struct pid *pid,
> > int trace,
> > unsigned long tls,
> > int node)
> > {
> > - int retval;
> > + int pidfd = -1, retval;
>
> it seems that initialization is unneeded, but this is cosmetic.
>
> I see no technical problems, feel free to add my reviewed-by.

Thank you!

>
>
> But let me ask a couple of questions...

Sure!

>
>
> Why O_CLOEXEC? I am just curious, I do not really care.

I think that having the file descriptor O_CLOEXEC by default is a good
security measure in general. Most of the time you do not want to pass a
file descriptor through exec() (apart from 0,1,2) and it is usually more
of an issue when you accidently do it then when you accidently don't. So
if users really care about passing a pidfd they should do so by removing
the O_CLOEXEC flag explicitly.
(New file descriptors should probably all default to that but that's just
my opinion.)
Another thing is that for a pidfds it makes even more sense to have them
cloexec by default. You don't want to *unintentionally* leak an fd that
can be used to operate on a process.

>
>
> Should we allow CLONE_THREAD | CLONE_PIDFD ?

I think so, yes. I have thought about this. Yes, due to CLONE_FILES |
CLONE_VM you'd necessarily hand the pidfd to the child but threads are
no security boundary in the first place. So if you have a
non-cooperating thread you very much already have a problem. The
situation is not very much different from passing the tid.
Plus, CLONE_PIDFD can make use of the CLONE_UNDETACHED flag in the
future in contrast to all the other flags. So one could potentially (not
saying we have this planned!) add a flag CLONE_PIDFD_KILL_ON_CLOSE (This
is just an improvised bad name rn.) which would cause the child to kill
itself if it does close(pidfd) on its own pidfd and noone else is
holding a reference at which point you'd hand a suicide switch to a new
thread but nothing else.

>
>
> Are you sure we will never need to extend this interface? If not, then perhaps it

Well, we can already make use of CLONE_UNDETACHED with CLONE_PIDFD since
we don't just ignore it. :)

> make sense to add something like
>
> if (CLONE_PIDFD) {
> unsigned long not_used_yet;
> if (get_user(not_used_yet, parent_tidptr) ||
> not_used_yet != 0)
> return -EINVAL;

Oh, very interesting point. Yes, I think this is worth it.

> }
>
> this way we can easily add more arguments in future or even turn CLONE_PIDFD into
> CLONE_MORE_ARGS_IN_PARENT_TIDPTR.
>
> Not that I think this is really good idea, sys_clone2() makes more sense, but still.

No, you're right about leaving this option open. Even if we don't extend
not allowing garbage to be passed is always a good idea.

Christian

2019-04-18 14:11:34

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] clone: add CLONE_PIDFD

On 04/18, Christian Brauner wrote:
>
> On Thu, Apr 18, 2019 at 03:12:07PM +0200, Oleg Nesterov wrote:
> > Should we allow CLONE_THREAD | CLONE_PIDFD ?
>
> I think so, yes. I have thought about this.

OK, I won't insist. But let me explain why did I ask.

> Yes, due to CLONE_FILES |
> CLONE_VM you'd necessarily hand the pidfd to the child but threads are
> no security boundary in the first place.

No, no, I am not not worried about security. CLONE_PARENT | CLONE_PIDFD
looks more problematic to me, but I see nothing dangerous security-wise..

I agree that CLONE_THREAD | CLONE_PIDFD may be usefule, but I am not sure
we should allow this from the very begining, until we have a "real" use-case.

IIUC, we are going to make it pollable soon. OK, but proc_tgid_base_poll()
(which should be turned into pidfd_poll) simply can't work if pid_task() is
not a group leader. poll(pidfd) will hang forever if pidfd was created by
CLONE_THREAD | CLONE_PIDFD.

Sure, we can (should?) improve pidfd_poll() but this will need more nasty
changes in the core kernel code. Do we really need/want this? Right now it
is not clear to me. Instead, we can simply disallow CLONE_THREAD|CLONE_PIDFD
until we decide that yes, we want to poll sub-threads.

But again, I am fine with CLONE_THREAD | CLONE_PIDFD.

Oleg.

2019-04-18 14:16:52

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] clone: add CLONE_PIDFD

On April 18, 2019 4:10:20 PM GMT+02:00, Oleg Nesterov <[email protected]> wrote:
>On 04/18, Christian Brauner wrote:
>>
>> On Thu, Apr 18, 2019 at 03:12:07PM +0200, Oleg Nesterov wrote:
>> > Should we allow CLONE_THREAD | CLONE_PIDFD ?
>>
>> I think so, yes. I have thought about this.
>
>OK, I won't insist. But let me explain why did I ask.
>
>> Yes, due to CLONE_FILES |
>> CLONE_VM you'd necessarily hand the pidfd to the child but threads
>are
>> no security boundary in the first place.
>
>No, no, I am not not worried about security. CLONE_PARENT | CLONE_PIDFD
>looks more problematic to me, but I see nothing dangerous
>security-wise..
>
>I agree that CLONE_THREAD | CLONE_PIDFD may be usefule, but I am not
>sure
>we should allow this from the very begining, until we have a "real"
>use-case.
>
>IIUC, we are going to make it pollable soon. OK, but
>proc_tgid_base_poll()
>(which should be turned into pidfd_poll) simply can't work if
>pid_task() is
>not a group leader. poll(pidfd) will hang forever if pidfd was created
>by
>CLONE_THREAD | CLONE_PIDFD.
>
>Sure, we can (should?) improve pidfd_poll() but this will need more
>nasty
>changes in the core kernel code. Do we really need/want this? Right now
>it
>is not clear to me. Instead, we can simply disallow
>CLONE_THREAD|CLONE_PIDFD
>until we decide that yes, we want to poll sub-threads.

If you think that makes the polling work simpler for Joel then for sure.
And yes, I have argued for "disable until someone needs it" often before so I can't really argue the other way around here. :)
I'll send an updated version soon.

Christian

>
>But again, I am fine with CLONE_THREAD | CLONE_PIDFD.
>
>Oleg.

2019-04-18 14:27:23

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] signal: support CLONE_PIDFD with pidfd_send_signal

On 04/18, Christian Brauner wrote:
>
> +static struct pid *pidfd_to_pid(const struct file *file)
> +{
> + if (file->f_op == &pidfd_fops)
> + return file->private_data;
> +
> + return tgid_pidfd_to_pid(file);
> +}

the patch looks obviously fine to me, but I have an absolutely off-topic
question... why tgid_pidfd_to_pid() has to check d_is_dir() ?

Oleg.

2019-04-18 15:02:34

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] signal: support CLONE_PIDFD with pidfd_send_signal

On April 18, 2019 4:26:00 PM GMT+02:00, Oleg Nesterov <[email protected]> wrote:
>On 04/18, Christian Brauner wrote:
>>
>> +static struct pid *pidfd_to_pid(const struct file *file)
>> +{
>> + if (file->f_op == &pidfd_fops)
>> + return file->private_data;
>> +
>> + return tgid_pidfd_to_pid(file);
>> +}
>
>the patch looks obviously fine to me, but I have an absolutely
>off-topic
>question... why tgid_pidfd_to_pid() has to check d_is_dir() ?

It doesn't have too; pure paranoia.

Christian

2019-04-18 15:40:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] signal: use fdget() since we don't allow O_PATH

On Thu, Apr 18, 2019 at 3:19 AM Christian Brauner <[email protected]> wrote:
>
> It's just semantically correct to use fdget()
> and return an error right from there instead of taking a reference and
> returning an error later.

I've applied this one directly, because it ends up affecting the
existing code, and I'd rather just have the initial release of
pidfd_send_signal() right.

The other patches I consider to be "future release" material.

Linus

2019-04-18 15:54:41

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] signal: use fdget() since we don't allow O_PATH

On Thu, Apr 18, 2019 at 08:37:28AM -0700, Linus Torvalds wrote:
> On Thu, Apr 18, 2019 at 3:19 AM Christian Brauner <[email protected]> wrote:
> >
> > It's just semantically correct to use fdget()
> > and return an error right from there instead of taking a reference and
> > returning an error later.
>
> I've applied this one directly, because it ends up affecting the
> existing code, and I'd rather just have the initial release of
> pidfd_send_signal() right.

Perfect! Thank you!

>
> The other patches I consider to be "future release" material.

Yes, indeed. Jann and I are targeting the 5.2 merge window if that's ok.

Christian

2019-04-19 19:08:29

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] clone: add CLONE_PIDFD

On 2019-04-18, Christian Brauner <[email protected]> wrote:
> > Why O_CLOEXEC? I am just curious, I do not really care.
>
> I think that having the file descriptor O_CLOEXEC by default is a good
> security measure in general. Most of the time you do not want to pass a
> file descriptor through exec() (apart from 0,1,2) and it is usually more
> of an issue when you accidently do it then when you accidently don't. So
> if users really care about passing a pidfd they should do so by removing
> the O_CLOEXEC flag explicitly.
> (New file descriptors should probably all default to that but that's just
> my opinion.)
> Another thing is that for a pidfds it makes even more sense to have them
> cloexec by default. You don't want to *unintentionally* leak an fd that
> can be used to operate on a process.

There is another factor as well -- if you want to set O_CLOEXEC in a
multi-threaded process you can't be sure that another thread didn't fork
in between you getting the fd_install'd and the userspace process
setting O_CLOEXEC (leading to the fd leaking outside the current
process). This is why a lot of syscalls have a way to get an O_CLOEXEC
fd from the outset.

So I'm +1 on doing O_CLOEXEC by default -- you can always disable it
safely but enabling it safely isn't so simple (and I don't think it
makes much sense to add the mechanism to pass PIDFD_CLOEXEC as well,
given how tight the flags are getting).

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (1.52 kB)
signature.asc (849.00 B)
Download all attachments