2019-04-14 20:16:06

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 0/4] clone: add CLONE_PIDFD

Hey Linus,

This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call as previously discussed.

As decided last week [1] Jann and I have refined the implementation of
pidfds as anonymous inodes. Based on last weeks RFC we have only tweaked
documentation and naming, as well as making the sample program how to
get easy metadata access from a pidfd a little cleaner and more paranoid
when checking for errors.
The sample program can also serve as a test for the patchset.

When clone is called with CLONE_PIDFD a pidfd instead of a pid will be
returned. To make it possible for users of CLONE_PIDFD to apply standard
error checking that is common all across userspace, file descriptor
numbering for pidfds starts at 1 and not 0. This has the major advantage
that users can do:

int pidfd = clone(CLONE_PIDFD);
if (pidfd < 0) {
/* handle error */
exit(EXIT_FAILURE):
}

if (pidfd == 0) {
/* child */
exit(EXIT_SUCCESS);
}

/* parent */
exit(EXIT_SUCCESS);

We have also taken care that pidfds are created *after* the fd table has
been unshared to not leak pidfds into child processes.
pidfd creation during clone is split into two distinct steps:
1. preparing both an fd and a file referencing struct pid for fd_install()
2. fd_install()ing the pidfd
Step 1. is performed before clone's point of no return and especially
before write_lock_irq(&tasklist_lock) is taken.
Performing 1. before clone's point of no return ensures that we don't
need to fail a process that is already visible to userspace when pidfd
creation fails. Step 2. happens after attach_pid() is performed and the
process is visible to userspace.
Technically, we could have also performed step 1. and 2. together before
clone's point of no return and then calling close on the file descriptor
on failure. This would slightly increase code-locality but it is
semantically more correct and clean to bring the pidfd into existence
once the process is fully attached and not before.

The actual code for CLONE_PIDFD in patch 2 is completely confined to
fork.c (apart from the CLONE_PIDFD definition of course) and is rather
small and hopefully good to review.

The additional changes listed under David's name in the diffstat below
are here to make anon_inodes available unconditionally. They are needed
for the new mount api and thus for core vfs code in addition to pidfds.
David knows this and he has informed Al that this patch is sent out
here. The changes themselves are rather automatic.

As promised I have also contacted Joel who has sent a patchset to make
pidfds pollable. He has been informed and is happy to port his patchset
once we have moved forward [2].
Jann and I currently plan to target this patchset for inclusion in the 5.2
merge window.

Thanks!
Jann & Christian

[1]: https://lore.kernel.org/lkml/CAHk-=wifyY+XGNW=ZC4MyTHD14w81F8JjQNH-GaGAm2RxZ_S8Q@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/[email protected]/

Christian Brauner (3):
clone: add CLONE_PIDFD
signal: support CLONE_PIDFD with pidfd_send_signal
samples: show race-free pidfd metadata access

David Howells (1):
Make anon_inodes unconditional

arch/arm/kvm/Kconfig | 1 -
arch/arm64/kvm/Kconfig | 1 -
arch/mips/kvm/Kconfig | 1 -
arch/powerpc/kvm/Kconfig | 1 -
arch/s390/kvm/Kconfig | 1 -
arch/x86/Kconfig | 1 -
arch/x86/kvm/Kconfig | 1 -
drivers/base/Kconfig | 1 -
drivers/char/tpm/Kconfig | 1 -
drivers/dma-buf/Kconfig | 1 -
drivers/gpio/Kconfig | 1 -
drivers/iio/Kconfig | 1 -
drivers/infiniband/Kconfig | 1 -
drivers/vfio/Kconfig | 1 -
fs/Makefile | 2 +-
fs/notify/fanotify/Kconfig | 1 -
fs/notify/inotify/Kconfig | 1 -
include/linux/pid.h | 2 +
include/uapi/linux/sched.h | 1 +
init/Kconfig | 10 --
kernel/fork.c | 117 +++++++++++++++++++++-
kernel/signal.c | 14 ++-
kernel/sys_ni.c | 3 -
samples/Makefile | 2 +-
samples/pidfd/Makefile | 6 ++
samples/pidfd/pidfd-metadata.c | 172 +++++++++++++++++++++++++++++++++
26 files changed, 305 insertions(+), 40 deletions(-)
create mode 100644 samples/pidfd/Makefile
create mode 100644 samples/pidfd/pidfd-metadata.c

--
2.21.0


2019-04-14 20:16:10

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 3/4] signal: support CLONE_PIDFD with pidfd_send_signal

Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD. With this
patch pidfd_send_signal() becomes independent of procfs. This fullfils the
request made when we merged the pidfd_send_signal() patchset. The
pidfd_send_signal() syscall is now always available allowing for it to be
used by users without procfs mounted or even users without procfs support
compiled into the kernel.

Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Jann Horn <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
---
kernel/signal.c | 14 ++++++++++----
kernel/sys_ni.c | 3 ---
2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f98448cf2def..cd83cc376767 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3513,7 +3513,6 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
return kill_something_info(sig, &info, pid);
}

-#ifdef CONFIG_PROC_FS
/*
* Verify that the signaler and signalee either are in the same pid namespace
* or that the signaler's pid namespace is an ancestor of the signalee's pid
@@ -3550,6 +3549,14 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
return copy_siginfo_from_user(kinfo, info);
}

+static struct pid *pidfd_to_pid(const struct file *file)
+{
+ if (file->f_op == &pidfd_fops)
+ return file->private_data;
+
+ return tgid_pidfd_to_pid(file);
+}
+
/**
* sys_pidfd_send_signal - send a signal to a process through a task file
* descriptor
@@ -3581,12 +3588,12 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
if (flags)
return -EINVAL;

- f = fdget_raw(pidfd);
+ f = fdget(pidfd);
if (!f.file)
return -EBADF;

/* Is this a pidfd? */
- pid = tgid_pidfd_to_pid(f.file);
+ pid = pidfd_to_pid(f.file);
if (IS_ERR(pid)) {
ret = PTR_ERR(pid);
goto err;
@@ -3620,7 +3627,6 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
fdput(f);
return ret;
}
-#endif /* CONFIG_PROC_FS */

static int
do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d21f4befaea4..4d9ae5ea6caf 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -167,9 +167,6 @@ COND_SYSCALL(syslog);

/* kernel/sched/core.c */

-/* kernel/signal.c */
-COND_SYSCALL(pidfd_send_signal);
-
/* kernel/sys.c */
COND_SYSCALL(setregid);
COND_SYSCALL(setgid);
--
2.21.0

2019-04-14 20:16:13

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 4/4] samples: show race-free pidfd metadata access

This is a sample program showing userspace how to get race-free access to
process metadata from a pidfd. It is rather easy to do and userspace can
actually simply reuse code that currently parses a process's status file in
procfs.
The program can easily be extended into a generic helper suitable for
inclusion in a libc to make it even easier for userspace to gain metadata
access.

Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Jann Horn <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
---
samples/Makefile | 2 +-
samples/pidfd/Makefile | 6 ++
samples/pidfd/pidfd-metadata.c | 172 +++++++++++++++++++++++++++++++++
3 files changed, 179 insertions(+), 1 deletion(-)
create mode 100644 samples/pidfd/Makefile
create mode 100644 samples/pidfd/pidfd-metadata.c

diff --git a/samples/Makefile b/samples/Makefile
index b1142a958811..fadadb1c3b05 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -3,4 +3,4 @@
obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ trace_events/ livepatch/ \
hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ \
configfs/ connector/ v4l/ trace_printk/ \
- vfio-mdev/ statx/ qmi/ binderfs/
+ vfio-mdev/ statx/ qmi/ binderfs/ pidfd/
diff --git a/samples/pidfd/Makefile b/samples/pidfd/Makefile
new file mode 100644
index 000000000000..0ff97784177a
--- /dev/null
+++ b/samples/pidfd/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+hostprogs-y := pidfd-metadata
+always := $(hostprogs-y)
+HOSTCFLAGS_pidfd-metadata.o += -I$(objtree)/usr/include
+all: pidfd-metadata
diff --git a/samples/pidfd/pidfd-metadata.c b/samples/pidfd/pidfd-metadata.c
new file mode 100644
index 000000000000..23a44e582ccb
--- /dev/null
+++ b/samples/pidfd/pidfd-metadata.c
@@ -0,0 +1,172 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#ifndef CLONE_PIDFD
+#define CLONE_PIDFD 0x00001000
+#endif
+
+static int raw_clone_pidfd(void)
+{
+ unsigned long flags = CLONE_PIDFD | SIGCHLD;
+
+#if defined(__s390x__) || defined(__s390__) || defined(__CRIS__)
+ /*
+ * On s390/s390x and cris the order of the first and second arguments
+ * of the system call is reversed.
+ */
+ return (int)syscall(__NR_clone, NULL, flags);
+#elif defined(__sparc__) && defined(__arch64__)
+ {
+ /*
+ * sparc64 always returns the other process id in %o0, and a
+ * boolean flag whether this is the child or the parent in %o1.
+ * Inline assembly is needed to get the flag returned in %o1.
+ */
+ int child_pid, in_child;
+
+ asm volatile("mov %2, %%g1\n\t"
+ "mov %3, %%o0\n\t"
+ "mov 0 , %%o1\n\t"
+ "t 0x6d\n\t"
+ "mov %%o1, %0\n\t"
+ "mov %%o0, %1"
+ : "=r"(in_child), "=r"(child_pid)
+ : "i"(__NR_clone), "r"(flags)
+ : "%o1", "%o0", "%g1");
+
+ if (in_child)
+ return 0;
+ else
+ return child_pid;
+ }
+#elif defined(__ia64__)
+ /* On ia64 stack and stack size are passed as separate arguments. */
+ return (int)syscall(__NR_clone, flags, NULL, 0UL);
+#else
+ return (int)syscall(__NR_clone, flags, NULL);
+#endif
+}
+
+static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
+ unsigned int flags)
+{
+ return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
+}
+
+static int pidfd_metadata_fd(int pidfd)
+{
+ int procfd, ret;
+ char path[100];
+ FILE *f;
+ size_t n = 0;
+ char *line = NULL;
+
+ snprintf(path, sizeof(path), "/proc/self/fdinfo/%d", pidfd);
+
+ f = fopen(path, "re");
+ if (!f)
+ return -1;
+
+ ret = 0;
+ while (getline(&line, &n, f) != -1) {
+ char *numstr;
+ size_t len;
+
+ if (strncmp(line, "Pid:\t", 5))
+ continue;
+
+ numstr = line + 5;
+ len = strlen(numstr);
+ if (len > 0 && numstr[len - 1] == '\n')
+ numstr[len - 1] = '\0';
+ ret = snprintf(path, sizeof(path), "/proc/%s", numstr);
+ break;
+ }
+ free(line);
+ fclose(f);
+
+ if (!ret) {
+ errno = ENOENT;
+ warn("Failed to parse pid from fdinfo\n");
+ return -1;
+ }
+
+ procfd = open(path, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
+ if (procfd < 0) {
+ warn("Failed to open %s\n", path);
+ return -1;
+ }
+
+ /*
+ * Verify that the pid has not been recycled and our /proc/<pid> handle
+ * is still valid.
+ */
+ ret = sys_pidfd_send_signal(pidfd, 0, NULL, 0);
+ if (ret < 0) {
+ switch (errno) {
+ case EPERM:
+ /* Process exists, just not allowed to signal it. */
+ break;
+ default:
+ warn("Failed to signal process\n");
+ close(procfd);
+ procfd = -1;
+ }
+ }
+
+ return procfd;
+}
+
+int main(int argc, char *argv[])
+{
+ int ret = EXIT_FAILURE;
+ char buf[4096] = { 0 };
+ int pidfd, procfd, statusfd;
+ ssize_t bytes;
+
+ pidfd = raw_clone_pidfd();
+ if (pidfd < 0)
+ exit(ret);
+
+ if (pidfd == 0) {
+ printf("%d\n", getpid());
+ exit(EXIT_SUCCESS);
+ }
+
+ procfd = pidfd_metadata_fd(pidfd);
+ close(pidfd);
+ if (procfd < 0)
+ goto out;
+
+ statusfd = openat(procfd, "status", O_RDONLY | O_CLOEXEC);
+ close(procfd);
+ if (statusfd < 0)
+ goto out;
+
+ bytes = read(statusfd, buf, sizeof(buf));
+ if (bytes > 0)
+ bytes = write(STDOUT_FILENO, buf, bytes);
+ close(statusfd);
+ ret = EXIT_SUCCESS;
+
+out:
+ (void)wait(NULL);
+
+ exit(ret);
+}
--
2.21.0

2019-04-14 20:16:35

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 2/4] clone: add CLONE_PIDFD

As discussed this patchset makes it possible to retrieve pid file
descriptors at process creation time by introducing the new flag
CLONE_PIDFD to the clone() system call. Linus originally suggested to
implement this as a new flag to clone() instead of making it a separate
system call. As spotted by Linus, there is exactly one bit for clone()
left.

CLONE_PIDFD returns file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api. They serve as a simple opaque handle on pids. Logically, this
makes it possible to interpret a pidfd differently, narrowing or widening
the scope of various operations (e.g. signal sending). Thus, a pidfd cannot
just refer to a tgid, but also a tid, or in theory - given appropriate flag
arguments in relevant syscalls - a process group or session. A pidfd does
not represent a privilege. This does not imply it cannot ever be that way
but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs. The fdinfo file contains the pid of the process in the callers pid
namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

To remove worries about missing metadata access this patchset comes with a
sample program that illustrates how a combination of CLONE_PIDFD, fdinfo,
and pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>. The sample program can easily be translated
into a helper that would be suitable for inclusion in libc so that users
don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Jann Horn <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
---
include/linux/pid.h | 2 +
include/uapi/linux/sched.h | 1 +
kernel/fork.c | 117 +++++++++++++++++++++++++++++++++++--
3 files changed, 115 insertions(+), 5 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index b6f4ba16065a..3c8ef5a199ca 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -66,6 +66,8 @@ struct pid

extern struct pid init_struct_pid;

+extern const struct file_operations pidfd_fops;
+
static inline struct pid *get_pid(struct pid *pid)
{
if (pid)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..06fa224d2c48 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -10,6 +10,7 @@
#define CLONE_FS 0x00000200 /* set if fs info shared between processes */
#define CLONE_FILES 0x00000400 /* set if open files shared between processes */
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */
+#define CLONE_PIDFD 0x00001000 /* set if a pidfd instead of a pid should be returned */
#define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */
#define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */
#define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9dcd18aa210b..4825d5205604 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,6 +11,7 @@
* management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
*/

+#include <linux/anon_inodes.h>
#include <linux/slab.h>
#include <linux/sched/autogroup.h>
#include <linux/sched/mm.h>
@@ -21,8 +22,10 @@
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/sched/cputime.h>
+#include <linux/seq_file.h>
#include <linux/rtmutex.h>
#include <linux/init.h>
+#include <linux/fsnotify.h>
#include <linux/unistd.h>
#include <linux/module.h>
#include <linux/vmalloc.h>
@@ -1662,6 +1665,87 @@ static inline void rcu_copy_process(struct task_struct *p)
#endif /* #ifdef CONFIG_TASKS_RCU */
}

+static int pidfd_release(struct inode *inode, struct file *file)
+{
+ struct pid *pid = file->private_data;
+
+ file->private_data = NULL;
+ put_pid(pid);
+ return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+ struct pid_namespace *ns = proc_pid_ns(file_inode(m->file));
+ struct pid *pid = f->private_data;
+
+ seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+ seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+ .release = pidfd_release,
+#ifdef CONFIG_PROC_FS
+ .show_fdinfo = pidfd_show_fdinfo,
+#endif
+};
+
+/**
+ * pidfd_create() - Create a new pid file descriptor.
+ *
+ * @pid: struct pid that the pidfd will reference
+ * @file: struct file referencing @pid to return to caller
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set.
+ *
+ * Note, that this function can only be called after the fd table has
+ * potentially been shared to avoid leaking the pidfd to the new process.
+ *
+ * File descriptor numbering for pidfds starts at 1. This allows users
+ * of CLONE_PIDFD to perform the same checks as for pids, i.e.:
+ * pid/pidfd < 0: error
+ * pid/pidfd == 0: child
+ * pid/pidfd > 0: parent
+ *
+ * Return: On success, a cloexec pidfd ready to be installed through
+ * fd_install() will be returned. The corresponding file will be
+ * returned through @file.
+ * On error, a negative errno number will be returned.
+ */
+static int pidfd_create(struct pid *pid, struct file **file)
+{
+ unsigned int flags = O_RDWR | O_CLOEXEC;
+ int error, fd;
+ struct file *f;
+
+ error = __alloc_fd(current->files, 1, rlimit(RLIMIT_NOFILE), flags);
+ if (error < 0)
+ return error;
+ fd = error;
+
+ f = anon_inode_getfile("pidfd", &pidfd_fops, get_pid(pid), flags);
+ if (IS_ERR(f)) {
+ put_pid(pid);
+ error = PTR_ERR(f);
+ goto err_put_unused_fd;
+ }
+
+ *file = f;
+ return fd;
+
+err_put_unused_fd:
+ put_unused_fd(fd);
+ return error;
+}
+
+static inline void pidfd_put(int fd, struct file *file)
+{
+ put_unused_fd(fd);
+ fput(file);
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -1678,11 +1762,12 @@ static __latent_entropy struct task_struct *copy_process(
struct pid *pid,
int trace,
unsigned long tls,
- int node)
+ int node, int *pidfd)
{
int retval;
struct task_struct *p;
struct multiprocess_signals delayed;
+ struct file *pidfdf = NULL;

/*
* Don't allow sharing the root directory with processes in a different
@@ -1936,6 +2021,18 @@ static __latent_entropy struct task_struct *copy_process(
}
}

+ /*
+ * This has to happen after we've potentially unshared the file
+ * descriptor table (so that the pidfd doesn't leak into the child if
+ * the fd table isn't shared).
+ */
+ if (clone_flags & CLONE_PIDFD) {
+ retval = pidfd_create(pid, &pidfdf);
+ if (retval < 0)
+ goto bad_fork_free_pid;
+ *pidfd = retval;
+ }
+
#ifdef CONFIG_BLOCK
p->plug = NULL;
#endif
@@ -1996,7 +2093,7 @@ static __latent_entropy struct task_struct *copy_process(
*/
retval = cgroup_can_fork(p);
if (retval)
- goto bad_fork_free_pid;
+ goto bad_fork_put_pidfd;

/*
* From this point on we must avoid any synchronous user-space
@@ -2097,6 +2194,9 @@ static __latent_entropy struct task_struct *copy_process(
syscall_tracepoint_update(p);
write_unlock_irq(&tasklist_lock);

+ if (clone_flags & CLONE_PIDFD)
+ fd_install(*pidfd, pidfdf);
+
proc_fork_connector(p);
cgroup_post_fork(p);
cgroup_threadgroup_change_end(current);
@@ -2111,6 +2211,9 @@ static __latent_entropy struct task_struct *copy_process(
spin_unlock(&current->sighand->siglock);
write_unlock_irq(&tasklist_lock);
cgroup_cancel_fork(p);
+bad_fork_put_pidfd:
+ if (clone_flags & CLONE_PIDFD)
+ pidfd_put(*pidfd, pidfdf);
bad_fork_free_pid:
cgroup_threadgroup_change_end(current);
if (pid != &init_struct_pid)
@@ -2177,7 +2280,7 @@ struct task_struct *fork_idle(int cpu)
{
struct task_struct *task;
task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
- cpu_to_node(cpu));
+ cpu_to_node(cpu), NULL);
if (!IS_ERR(task)) {
init_idle_pids(task);
init_idle(task, cpu);
@@ -2202,7 +2305,7 @@ long _do_fork(unsigned long clone_flags,
struct completion vfork;
struct pid *pid;
struct task_struct *p;
- int trace = 0;
+ int pidfd, trace = 0;
long nr;

/*
@@ -2224,7 +2327,7 @@ long _do_fork(unsigned long clone_flags,
}

p = copy_process(clone_flags, stack_start, stack_size,
- child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
+ child_tidptr, NULL, trace, tls, NUMA_NO_NODE, &pidfd);
add_latent_entropy();

if (IS_ERR(p))
@@ -2260,6 +2363,10 @@ long _do_fork(unsigned long clone_flags,
}

put_pid(pid);
+
+ if (clone_flags & CLONE_PIDFD)
+ nr = pidfd;
+
return nr;
}

--
2.21.0

2019-04-14 20:16:55

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 1/4] Make anon_inodes unconditional

From: David Howells <[email protected]>

Make the anon_inodes facility unconditional so that it can be used by core
VFS code and pidfd code.

Signed-off-by: David Howells <[email protected]>
Signed-off-by: Al Viro <[email protected]>
[[email protected]: adapt commit message to mention pidfds]
Signed-off-by: Christian Brauner <[email protected]>
---
arch/arm/kvm/Kconfig | 1 -
arch/arm64/kvm/Kconfig | 1 -
arch/mips/kvm/Kconfig | 1 -
arch/powerpc/kvm/Kconfig | 1 -
arch/s390/kvm/Kconfig | 1 -
arch/x86/Kconfig | 1 -
arch/x86/kvm/Kconfig | 1 -
drivers/base/Kconfig | 1 -
drivers/char/tpm/Kconfig | 1 -
drivers/dma-buf/Kconfig | 1 -
drivers/gpio/Kconfig | 1 -
drivers/iio/Kconfig | 1 -
drivers/infiniband/Kconfig | 1 -
drivers/vfio/Kconfig | 1 -
fs/Makefile | 2 +-
fs/notify/fanotify/Kconfig | 1 -
fs/notify/inotify/Kconfig | 1 -
init/Kconfig | 10 ----------
18 files changed, 1 insertion(+), 27 deletions(-)

diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index 3f5320f46de2..f591026347a5 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
bool "Kernel-based Virtual Machine (KVM) support"
depends on MMU && OF
select PREEMPT_NOTIFIERS
- select ANON_INODES
select ARM_GIC
select ARM_GIC_V3
select ARM_GIC_V3_ITS
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index a3f85624313e..a67121d419a2 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -23,7 +23,6 @@ config KVM
depends on OF
select MMU_NOTIFIER
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_ARCH_TLB_FLUSH_ALL
select KVM_MMIO
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index 4528bc9c3cb1..eac25aef21e0 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
depends on MIPS_FP_SUPPORT
select EXPORT_UASM
select PREEMPT_NOTIFIERS
- select ANON_INODES
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select HAVE_KVM_VCPU_ASYNC_IOCTL
select KVM_MMIO
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index bfdde04e4905..f53997a8ca62 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -20,7 +20,6 @@ if VIRTUALIZATION
config KVM
bool
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_EVENTFD
select HAVE_KVM_VCPU_ASYNC_IOCTL
select SRCU
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 767453faacfc..1816ee48eadd 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
prompt "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_VCPU_ASYNC_IOCTL
select HAVE_KVM_EVENTFD
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5ad92419be19..7a70fb58b2d0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -44,7 +44,6 @@ config X86
#
select ACPI_LEGACY_TABLES_LOOKUP if ACPI
select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
- select ANON_INODES
select ARCH_32BIT_OFF_T if X86_32
select ARCH_CLOCKSOURCE_DATA
select ARCH_CLOCKSOURCE_INIT
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 72fa955f4a15..fc042419e670 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -27,7 +27,6 @@ config KVM
depends on X86_LOCAL_APIC
select PREEMPT_NOTIFIERS
select MMU_NOTIFIER
- select ANON_INODES
select HAVE_KVM_IRQCHIP
select HAVE_KVM_IRQFD
select IRQ_BYPASS_MANAGER
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 059700ea3521..03f067da12ee 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -174,7 +174,6 @@ source "drivers/base/regmap/Kconfig"
config DMA_SHARED_BUFFER
bool
default n
- select ANON_INODES
select IRQ_WORK
help
This option enables the framework for buffer-sharing between
diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 536e55d3919f..f3e4bc490cf0 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -157,7 +157,6 @@ config TCG_CRB
config TCG_VTPM_PROXY
tristate "VTPM Proxy Interface"
depends on TCG_TPM
- select ANON_INODES
---help---
This driver proxies for an emulated TPM (vTPM) running in userspace.
A device /dev/vtpmx is provided that creates a device pair
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
index 2e5a0faa2cb1..3fc9c2efc583 100644
--- a/drivers/dma-buf/Kconfig
+++ b/drivers/dma-buf/Kconfig
@@ -3,7 +3,6 @@ menu "DMABUF options"
config SYNC_FILE
bool "Explicit Synchronization Framework"
default n
- select ANON_INODES
select DMA_SHARED_BUFFER
---help---
The Sync File Framework adds explicit syncronization via
diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 3f50526a771f..0f91600c27ae 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -12,7 +12,6 @@ config ARCH_HAVE_CUSTOM_GPIO_H

menuconfig GPIOLIB
bool "GPIO Support"
- select ANON_INODES
help
This enables GPIO support through the generic GPIO library.
You only need to enable this, if you also want to enable
diff --git a/drivers/iio/Kconfig b/drivers/iio/Kconfig
index d08aeb41cd07..1dec0fecb6ef 100644
--- a/drivers/iio/Kconfig
+++ b/drivers/iio/Kconfig
@@ -4,7 +4,6 @@

menuconfig IIO
tristate "Industrial I/O support"
- select ANON_INODES
help
The industrial I/O subsystem provides a unified framework for
drivers for many different types of embedded sensors using a
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a1fb840de45d..d318bab25860 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -25,7 +25,6 @@ config INFINIBAND_USER_MAD

config INFINIBAND_USER_ACCESS
tristate "InfiniBand userspace access (verbs and CM)"
- select ANON_INODES
depends on MMU
---help---
Userspace InfiniBand access support. This enables the
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9de5ed38da83..3798d77d131c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -22,7 +22,6 @@ menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
- select ANON_INODES
help
VFIO provides a framework for secure userspace device drivers.
See Documentation/vfio.txt for more details.
diff --git a/fs/Makefile b/fs/Makefile
index 427fec226fae..35945f8139e6 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -25,7 +25,7 @@ obj-$(CONFIG_PROC_FS) += proc_namespace.o

obj-y += notify/
obj-$(CONFIG_EPOLL) += eventpoll.o
-obj-$(CONFIG_ANON_INODES) += anon_inodes.o
+obj-y += anon_inodes.o
obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
index 735bfb2e9190..521dc91d2cb5 100644
--- a/fs/notify/fanotify/Kconfig
+++ b/fs/notify/fanotify/Kconfig
@@ -1,7 +1,6 @@
config FANOTIFY
bool "Filesystem wide access notification"
select FSNOTIFY
- select ANON_INODES
select EXPORTFS
default n
---help---
diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
index b981fc0c8379..0161c74e76e2 100644
--- a/fs/notify/inotify/Kconfig
+++ b/fs/notify/inotify/Kconfig
@@ -1,6 +1,5 @@
config INOTIFY_USER
bool "Inotify support for userspace"
- select ANON_INODES
select FSNOTIFY
default y
---help---
diff --git a/init/Kconfig b/init/Kconfig
index 4592bf7997c0..be8f97e37a76 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1171,9 +1171,6 @@ config LD_DEAD_CODE_DATA_ELIMINATION
config SYSCTL
bool

-config ANON_INODES
- bool
-
config HAVE_UID16
bool

@@ -1378,14 +1375,12 @@ config HAVE_FUTEX_CMPXCHG
config EPOLL
bool "Enable eventpoll support" if EXPERT
default y
- select ANON_INODES
help
Disabling this option will cause the kernel to be built without
support for epoll family of system calls.

config SIGNALFD
bool "Enable signalfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the signalfd() system call that allows to receive signals
@@ -1395,7 +1390,6 @@ config SIGNALFD

config TIMERFD
bool "Enable timerfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the timerfd() system call that allows to receive timer
@@ -1405,7 +1399,6 @@ config TIMERFD

config EVENTFD
bool "Enable eventfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the eventfd() system call that allows to receive both
@@ -1516,7 +1509,6 @@ config KALLSYMS_BASE_RELATIVE
# syscall, maps, verifier
config BPF_SYSCALL
bool "Enable bpf() system call"
- select ANON_INODES
select BPF
select IRQ_WORK
default n
@@ -1533,7 +1525,6 @@ config BPF_JIT_ALWAYS_ON

config USERFAULTFD
bool "Enable userfaultfd() system call"
- select ANON_INODES
depends on MMU
help
Enable the userfaultfd() system call that allows to intercept and
@@ -1600,7 +1591,6 @@ config PERF_EVENTS
bool "Kernel performance events and counters"
default y if PROFILING
depends on HAVE_PERF_EVENTS
- select ANON_INODES
select IRQ_WORK
select SRCU
help
--
2.21.0

Subject: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On 14.04.19 22:14, Christian Brauner wrote:

Hi folks,

> This patchset makes it possible to retrieve pid file descriptors at
> process creation time by introducing the new flag CLONE_PIDFD to the
> clone() system call as previously discussed.

Sorry, for highjacking this thread, but I'm curious on what things to
consider when introducing new CLONE_* flags.

The reason I'm asking is:

I'm working on implementing plan9-like fs namespaces, where unprivileged
processes can change their own namespace at will. For that, certain
traditional unix'ish things have to be disabled, most notably suid.
As forbidding suid can be helpful in other scenarios, too, I thought
about making this its own feature. Doing that switch on clone() seems
a nice place for that, IMHO.

As there might be potentially even more CLONE_* flags in the future,
and the bitmask size is limited, this raises the question on how to
proceed with those flag additions in the future.

What's your thoughts on that ?


--mtx

--
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

Subject: Re: [PATCH 0/4] clone: add CLONE_PIDFD

On 14.04.19 22:14, Christian Brauner wrote:

Hi,

> When clone is called with CLONE_PIDFD a pidfd instead of a pid will be
> returned.

Uh, changing the return type by some flags produces a very strange
feeling in my stomach. Isn't there any other way for fetching a pidfs
for some pid ?

> To make it possible for users of CLONE_PIDFD to apply standard
> error checking that is common all across userspace, file descriptor
> numbering for pidfds starts at 1 and not 0.

Feels even more strange. I'm used to the kernel always picking the
lowest available pid. In some cases, one really wants to have fd 0.
Even though the actual usecases for doing that w/ a pidfd might be
pretty limited, I don't feel good w/ breaking this old habit.


--mtx

--
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

2019-04-15 10:53:09

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/4] clone: add CLONE_PIDFD

On 04/14, Christian Brauner wrote:
>
> @@ -2260,6 +2363,10 @@ long _do_fork(unsigned long clone_flags,
> }
>
> put_pid(pid);
> +
> + if (clone_flags & CLONE_PIDFD)
> + nr = pidfd;
> +

Well, this doesn't look nice ...

CLONE_PARENT_SETTID doesn't look very usefule, so what if we add

if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) ==
(CLONE_PIDFD|CLONE_PARENT_SETTID))
return ERR_PTR(-EINVAL);

at the start of copy_process() ?

Then it can do

if (clone_flags & CLONE_PIDFD) {
retval = pidfd_create(pid, &pidfdf);
if (retval < 0)
goto bad_fork_free_pid;
retval = put_user(retval, parent_tidptr)
if (retval < 0)
goto bad_fork_free_pid;
}

Oleg.

2019-04-15 11:44:59

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 2/4] clone: add CLONE_PIDFD

On Mon, Apr 15, 2019 at 12:52:09PM +0200, Oleg Nesterov wrote:
> On 04/14, Christian Brauner wrote:
> >
> > @@ -2260,6 +2363,10 @@ long _do_fork(unsigned long clone_flags,
> > }
> >
> > put_pid(pid);
> > +
> > + if (clone_flags & CLONE_PIDFD)
> > + nr = pidfd;
> > +
>
> Well, this doesn't look nice ...
>
> CLONE_PARENT_SETTID doesn't look very usefule, so what if we add
>
> if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) ==
> (CLONE_PIDFD|CLONE_PARENT_SETTID))
> return ERR_PTR(-EINVAL);
>
> at the start of copy_process() ?
>
> Then it can do
>
> if (clone_flags & CLONE_PIDFD) {
> retval = pidfd_create(pid, &pidfdf);
> if (retval < 0)
> goto bad_fork_free_pid;
> retval = put_user(retval, parent_tidptr)
> if (retval < 0)
> goto bad_fork_free_pid;
> }

Uhhh Oleg, that is nifty. I have to say I like that a lot. This would
let us return the pid and the pidfd in one go and we can also start
pidfd numbering at 0.

2019-04-15 13:26:02

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/4] clone: add CLONE_PIDFD

On 04/15, Christian Brauner wrote:
>
> > CLONE_PARENT_SETTID doesn't look very usefule, so what if we add
> >
> > if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) ==
> > (CLONE_PIDFD|CLONE_PARENT_SETTID))
> > return ERR_PTR(-EINVAL);
> >
> > at the start of copy_process() ?
> >
> > Then it can do
> >
> > if (clone_flags & CLONE_PIDFD) {
> > retval = pidfd_create(pid, &pidfdf);
> > if (retval < 0)
> > goto bad_fork_free_pid;
> > retval = put_user(retval, parent_tidptr)
> > if (retval < 0)
> > goto bad_fork_free_pid;
> > }
>
> Uhhh Oleg, that is nifty. I have to say I like that a lot. This would
> let us return the pid and the pidfd in one go and we can also start
> pidfd numbering at 0.

Christian, sorry if it was already discussed, but I can't force myself to
read all the previous discussions ;)

If we forget about CONFIG_PROC_FS, why do we really want to create a file?


Suppose we add a global u64 counter incremented by copy_process and reported
in /proc/$pid/status. Suppose that clone(CLONE_PIDFD) writes this counter to
*parent_tidptr. Let's denote this counter as UNIQ_PID.

Now, if you want to (say) safely kill a task and you have its UNIQ_PID, you
can do

kill_by_pid_uniq(int pid, u64 uniq_pid)
{
pidfd = open("/proc/$pid", O_DIRECTORY);

status = openat(pidfd, "status");
u64 this_uniq_pid = ... read UNIQ_PID from status ...;

if (uniq_pid != this_uniq_pid)
return;

pidfd_send_signal(pidfd);
}

Why else do we want pidfd?

Oleg.

2019-04-15 13:54:11

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 2/4] clone: add CLONE_PIDFD

On Mon, Apr 15, 2019 at 03:24:16PM +0200, Oleg Nesterov wrote:
> On 04/15, Christian Brauner wrote:
> >
> > > CLONE_PARENT_SETTID doesn't look very usefule, so what if we add
> > >
> > > if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) ==
> > > (CLONE_PIDFD|CLONE_PARENT_SETTID))
> > > return ERR_PTR(-EINVAL);
> > >
> > > at the start of copy_process() ?
> > >
> > > Then it can do
> > >
> > > if (clone_flags & CLONE_PIDFD) {
> > > retval = pidfd_create(pid, &pidfdf);
> > > if (retval < 0)
> > > goto bad_fork_free_pid;
> > > retval = put_user(retval, parent_tidptr)
> > > if (retval < 0)
> > > goto bad_fork_free_pid;
> > > }
> >
> > Uhhh Oleg, that is nifty. I have to say I like that a lot. This would
> > let us return the pid and the pidfd in one go and we can also start
> > pidfd numbering at 0.
>
> Christian, sorry if it was already discussed, but I can't force myself to
> read all the previous discussions ;)
>
> If we forget about CONFIG_PROC_FS, why do we really want to create a file?
>
>
> Suppose we add a global u64 counter incremented by copy_process and reported
> in /proc/$pid/status. Suppose that clone(CLONE_PIDFD) writes this counter to
> *parent_tidptr. Let's denote this counter as UNIQ_PID.
>
> Now, if you want to (say) safely kill a task and you have its UNIQ_PID, you
> can do
>
> kill_by_pid_uniq(int pid, u64 uniq_pid)
> {
> pidfd = open("/proc/$pid", O_DIRECTORY);
>
> status = openat(pidfd, "status");
> u64 this_uniq_pid = ... read UNIQ_PID from status ...;
>
> if (uniq_pid != this_uniq_pid)
> return;
>
> pidfd_send_signal(pidfd);
> }
>
> Why else do we want pidfd?

I think this was thrown around at one point but this is rather
inelegant imho. It basically makes a process unique by using a
combination of two identifiers. You end up with a similar concept but
you make it way less flexible and extensible imho. With pidfds you can
not care about pids at all if you don't want to. The UNIQ_PID thing
would require you to always juggle two identifiers.

Your example would also only work if CONFIG_PROC_FS is set (Not sure if
that's what you meant by "forget about CONFIG_PROC_FS")? Say, you get
a pid from clone() and your UNIQ_PID thing. Then you still can't
reliably kill a process because pidfd_send_signal() is not useable since
you can't get pidfds. And if you go the kill way you end up with the same
problem. Yes, you could solve this by probably extending syscalls to
take a UNIQ_PID argument but that seems very inelegant.

The UNIQ_PID implementation would also require being tracked in the
kernel either in task_struct or struct pid potentially and thus would
probably add more infrastructure in the kernel. We don't need any of
that if we simply rely on pidfds.

Most of all, the pidfd concept allows one way more flexibility in
extending it. For example, Joel is working on a patchset to make pidfds
pollable so you can get information about a process death by polling
them. We also want to be able to potentially wait on a process with
waitid(W_PIDFD) or similar as suggested by Linus in earlier threads. At
that point you end up in a similar situation as tgkill() where you pass
a tgid and a pid already to make sure that the pid you pass has the tgid
as thread-group leader. That is all way simpler with pidfds.

2019-04-15 15:51:28

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 15, 2019 at 12:08:09PM +0200, Enrico Weigelt, metux IT consult wrote:
> On 14.04.19 22:14, Christian Brauner wrote:
>
> Hi folks,
>
> > This patchset makes it possible to retrieve pid file descriptors at
> > process creation time by introducing the new flag CLONE_PIDFD to the
> > clone() system call as previously discussed.
>
> Sorry, for highjacking this thread, but I'm curious on what things to
> consider when introducing new CLONE_* flags.
>
> The reason I'm asking is:
>
> I'm working on implementing plan9-like fs namespaces, where unprivileged
> processes can change their own namespace at will. For that, certain

Is there any place where we can see previous discussion about this?

> traditional unix'ish things have to be disabled, most notably suid.

If you have to disable suid anyway, then is there any reason why the
existing ability to do this in a private user namespace, with only
your own uid mapped (which you can do without any privilege) does
not suffice? That was actually one of the main design goals of user
namespaces, to be able to clone(CLONE_NEWUSER), map your current uid,
then clone(CLONE_NEWNS) and bind mount at will.

> As forbidding suid can be helpful in other scenarios, too, I thought
> about making this its own feature. Doing that switch on clone() seems
> a nice place for that, IMHO.
>
> As there might be potentially even more CLONE_* flags in the future,
> and the bitmask size is limited, this raises the question on how to
> proceed with those flag additions in the future.
>
> What's your thoughts on that ?
>
>
> --mtx
>
> --
> Enrico Weigelt, metux IT consult
> Free software and Linux embedded engineering
> [email protected] -- +49-151-27565287

2019-04-15 16:26:38

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/4] clone: add CLONE_PIDFD

On Mon, Apr 15, 2019 at 03:52:48PM +0200, Christian Brauner wrote:
> On Mon, Apr 15, 2019 at 03:24:16PM +0200, Oleg Nesterov wrote:
> > On 04/15, Christian Brauner wrote:
> > >
> > > > CLONE_PARENT_SETTID doesn't look very usefule, so what if we add
> > > >
> > > > if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) ==
> > > > (CLONE_PIDFD|CLONE_PARENT_SETTID))
> > > > return ERR_PTR(-EINVAL);
> > > >
> > > > at the start of copy_process() ?
> > > >
> > > > Then it can do
> > > >
> > > > if (clone_flags & CLONE_PIDFD) {
> > > > retval = pidfd_create(pid, &pidfdf);
> > > > if (retval < 0)
> > > > goto bad_fork_free_pid;
> > > > retval = put_user(retval, parent_tidptr)
> > > > if (retval < 0)
> > > > goto bad_fork_free_pid;
> > > > }
> > >
> > > Uhhh Oleg, that is nifty. I have to say I like that a lot. This would
> > > let us return the pid and the pidfd in one go and we can also start
> > > pidfd numbering at 0.
> >
> > Christian, sorry if it was already discussed, but I can't force myself to
> > read all the previous discussions ;)
> >
> > If we forget about CONFIG_PROC_FS, why do we really want to create a file?
> >
> >
> > Suppose we add a global u64 counter incremented by copy_process and reported
> > in /proc/$pid/status. Suppose that clone(CLONE_PIDFD) writes this counter to
> > *parent_tidptr. Let's denote this counter as UNIQ_PID.
> >
> > Now, if you want to (say) safely kill a task and you have its UNIQ_PID, you
> > can do
> >
> > kill_by_pid_uniq(int pid, u64 uniq_pid)
> > {
> > pidfd = open("/proc/$pid", O_DIRECTORY);
> >
> > status = openat(pidfd, "status");
> > u64 this_uniq_pid = ... read UNIQ_PID from status ...;
> >
> > if (uniq_pid != this_uniq_pid)
> > return;
> >
> > pidfd_send_signal(pidfd);
> > }
> >
> > Why else do we want pidfd?
>
> I think this was thrown around at one point but this is rather
> inelegant imho. It basically makes a process unique by using a
> combination of two identifiers. You end up with a similar concept but
> you make it way less flexible and extensible imho. With pidfds you can
> not care about pids at all if you don't want to. The UNIQ_PID thing
> would require you to always juggle two identifiers.
>
> Your example would also only work if CONFIG_PROC_FS is set (Not sure if
> that's what you meant by "forget about CONFIG_PROC_FS")? Say, you get
> a pid from clone() and your UNIQ_PID thing. Then you still can't
> reliably kill a process because pidfd_send_signal() is not useable since
> you can't get pidfds. And if you go the kill way you end up with the same
> problem. Yes, you could solve this by probably extending syscalls to
> take a UNIQ_PID argument but that seems very inelegant.
>
> The UNIQ_PID implementation would also require being tracked in the
> kernel either in task_struct or struct pid potentially and thus would
> probably add more infrastructure in the kernel. We don't need any of
> that if we simply rely on pidfds.
>
> Most of all, the pidfd concept allows one way more flexibility in
> extending it. For example, Joel is working on a patchset to make pidfds
> pollable so you can get information about a process death by polling
> them. We also want to be able to potentially wait on a process with
> waitid(W_PIDFD) or similar as suggested by Linus in earlier threads. At
> that point you end up in a similar situation as tgkill() where you pass
> a tgid and a pid already to make sure that the pid you pass has the tgid
> as thread-group leader. That is all way simpler with pidfds.

I agree the pidfd file descriptor approach is simpler than dealing with 2
pids and is needed for the poll notification support I posted.

Also in the future it allows for a pidfd to sent over IPC to another
process using binder or unix domain sockets.

thanks,

- Joel

2019-04-15 17:17:27

by Jonathan Kowalski

[permalink] [raw]
Subject: Re: [PATCH 2/4] clone: add CLONE_PIDFD

On Mon, Apr 15, 2019 at 2:25 PM Oleg Nesterov <[email protected]> wrote:
>
> On 04/15, Christian Brauner wrote:
> >
> > > CLONE_PARENT_SETTID doesn't look very usefule, so what if we add
> > >
> > > if ((clone_flags & (CLONE_PIDFD|CLONE_PARENT_SETTID)) ==
> > > (CLONE_PIDFD|CLONE_PARENT_SETTID))
> > > return ERR_PTR(-EINVAL);
> > >
> > > at the start of copy_process() ?
> > >
> > > Then it can do
> > >
> > > if (clone_flags & CLONE_PIDFD) {
> > > retval = pidfd_create(pid, &pidfdf);
> > > if (retval < 0)
> > > goto bad_fork_free_pid;
> > > retval = put_user(retval, parent_tidptr)
> > > if (retval < 0)
> > > goto bad_fork_free_pid;
> > > }
> >
> > Uhhh Oleg, that is nifty. I have to say I like that a lot. This would
> > let us return the pid and the pidfd in one go and we can also start
> > pidfd numbering at 0.
>
> Christian, sorry if it was already discussed, but I can't force myself to
> read all the previous discussions ;)
>
> If we forget about CONFIG_PROC_FS, why do we really want to create a file?
>
>
> Suppose we add a global u64 counter incremented by copy_process and reported
> in /proc/$pid/status. Suppose that clone(CLONE_PIDFD) writes this counter to
> *parent_tidptr. Let's denote this counter as UNIQ_PID.
>
> Now, if you want to (say) safely kill a task and you have its UNIQ_PID, you
> can do
>
> kill_by_pid_uniq(int pid, u64 uniq_pid)
> {
> pidfd = open("/proc/$pid", O_DIRECTORY);
>
> status = openat(pidfd, "status");
> u64 this_uniq_pid = ... read UNIQ_PID from status ...;
>
> if (uniq_pid != this_uniq_pid)
> return;
>
> pidfd_send_signal(pidfd);
> }
>
> Why else do we want pidfd?

Apart from what others have already pointed out, there are two other
things I am looking forward to:

* Currently, when ptracing from a thread, waitpid means that I need to
block or constantly loop over with pauses to receive the ptrace
related results, since ptrace is thread directed (and to be able to
poll other event sources as well, eg. to receive further commands over
a pipe/message passing fd), and related waitpid responses only arrive
to the attached thread. The waitfd patchset was rejected on the
grounds that one could use a separate thread to do the waitpid while
polling from the attached thread or a new thread, but due to ptrace
this is false. pidfds would allow for this to work (this does mean
we'd also need to be able to return one at ATTACH/SEIZE time, though).
Note that waitid and other variants throw away a lot of needed
information.

* Descriptors mean you can optionally choose to bind your privileges
to the file descriptor and then pass it around to others. They do not
work this way now but the choice of such an extension has been kept
open. One of the examples is binding one's CAP_KILL capability and
then pass it to another process, so that it can freely signal the said
process (and only that), or be able to optionally poke holes in the
restrictions imposed by PID namespaces (possibly in the future), etc.

>
> Oleg.
>

2019-04-15 19:40:18

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH 2/4] clone: add CLONE_PIDFD

On Mon, Apr 15, 2019 at 10:15 AM Jonathan Kowalski <[email protected]> wrote:
> > Why else do we want pidfd?
>
> Apart from what others have already pointed out, there are two other
> things I am looking forward to:

Everything that Christian, Joel, and Jonathan have said is right.

If I can wax philosophical for a bit (as I've been accused to doing
:-)), there's a lot of value in consistency itself, a "more than the
sum of its parts" effect that arises from modeling all kernel-resource
handles as file descriptors. You get lifecycle consistency, API
consistency (e.g., dup, close), introspection consistency (via
/proc/pid/fd and friends), wait consistency, IPC consistency, and tons
of other benefits from using a file descriptor. The alternatives tend
to be very ugly: one of SysV IPC's* biggest mistakes, for example, was
having users manage its resources via non-file-descriptor kernel
handles. The process is, I think, the last major class of kernel
resource that users can't manipulate via file descriptor. Even if
using pidfds didn't provide strong immediate and direct benefits, it'd
*still* be worth moving to a file descriptor resource handle model for
the sake of making the system interface regular and uniform.

* Does anyone know *why* the SysV people didn't use FDs? The
consistency argument I'm making was just as relevant then as it is
now.

2019-04-15 20:02:30

by Aleksa Sarai

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On 2019-04-15, Enrico Weigelt, metux IT consult <[email protected]> wrote:
> > This patchset makes it possible to retrieve pid file descriptors at
> > process creation time by introducing the new flag CLONE_PIDFD to the
> > clone() system call as previously discussed.
>
> Sorry, for highjacking this thread, but I'm curious on what things to
> consider when introducing new CLONE_* flags.
>
> The reason I'm asking is:
>
> I'm working on implementing plan9-like fs namespaces, where unprivileged
> processes can change their own namespace at will. For that, certain
> traditional unix'ish things have to be disabled, most notably suid.
> As forbidding suid can be helpful in other scenarios, too, I thought
> about making this its own feature. Doing that switch on clone() seems
> a nice place for that, IMHO.

Just spit-balling -- is no_new_privs not sufficient for this usecase?
Not granting privileges such as setuid during execve(2) is the main
point of that flag.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (1.07 kB)
signature.asc (849.00 B)
Download all attachments

2019-04-15 20:34:40

by Andy Lutomirski

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai <[email protected]> wrote:
>
> On 2019-04-15, Enrico Weigelt, metux IT consult <[email protected]> wrote:
> > > This patchset makes it possible to retrieve pid file descriptors at
> > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > clone() system call as previously discussed.
> >
> > Sorry, for highjacking this thread, but I'm curious on what things to
> > consider when introducing new CLONE_* flags.
> >
> > The reason I'm asking is:
> >
> > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > processes can change their own namespace at will. For that, certain
> > traditional unix'ish things have to be disabled, most notably suid.
> > As forbidding suid can be helpful in other scenarios, too, I thought
> > about making this its own feature. Doing that switch on clone() seems
> > a nice place for that, IMHO.
>
> Just spit-balling -- is no_new_privs not sufficient for this usecase?
> Not granting privileges such as setuid during execve(2) is the main
> point of that flag.
>

I would personally *love* it if distros started setting no_new_privs
for basically all processes. And pidfd actually gets us part of the
way toward a straightforward way to make sudo and su still work in a
no_new_privs world: su could call into a daemon that would spawn the
privileged task, and su would get a (read-only!) pidfd back and then
wait for the fd and exit. I suppose that, done naively, this might
cause some odd effects with respect to tty handling, but I bet it's
solveable. I suppose it would be nifty if there were a way for a
process, by mutual agreement, to reparent itself to an unrelated
process.

Anyway, clone(2) is an enormous mess. Surely the right solution here
is to have a whole new process creation API that takes a big,
extensible struct as an argument, and supports *at least* the full
abilities of posix_spawn() and ideally covers all the use cases for
fork() + do stuff + exec(). It would be nifty if this API also had a
way to say "add no_new_privs and therefore enable extra functionality
that doesn't work without no_new_privs". This functionality would
include things like returning a future extra-privileged pidfd that
gives ptrace-like access.

As basic examples, the improved process creation API should take a
list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
from, fds to close (or, maybe even better, a list of fds to *not*
close), a list of rlimit changes to make, a list of signal changes to
make, the ability to set sid, pgrp, uid, gid (as in
setresuid/setresgid), the ability to do capset() operations, etc. The
posix_spawn() API, for all that it's rather complicated, covers a
bunch of the basics pretty well.

Sharing the parent's VM, signal set, fd table, etc, should all be
options, but they should default to *off*.

(Many other operating systems allow one to create a process and gain a
capability to do all kinds of things to that process. It's a
generally good idea.)

--Andy

2019-04-15 21:27:35

by Jonathan Kowalski

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 15, 2019 at 9:34 PM Andy Lutomirski <[email protected]> wrote:
>
> On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai <[email protected]> wrote:
> >
> > On 2019-04-15, Enrico Weigelt, metux IT consult <[email protected]> wrote:
> > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > clone() system call as previously discussed.
> > >
> > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > consider when introducing new CLONE_* flags.
> > >
> > > The reason I'm asking is:
> > >
> > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > processes can change their own namespace at will. For that, certain
> > > traditional unix'ish things have to be disabled, most notably suid.
> > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > about making this its own feature. Doing that switch on clone() seems
> > > a nice place for that, IMHO.
> >
> > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > Not granting privileges such as setuid during execve(2) is the main
> > point of that flag.
> >
>
> I would personally *love* it if distros started setting no_new_privs
> for basically all processes. And pidfd actually gets us part of the
> way toward a straightforward way to make sudo and su still work in a
> no_new_privs world: su could call into a daemon that would spawn the
> privileged task, and su would get a (read-only!) pidfd back and then
> wait for the fd and exit. I suppose that, done naively, this might
> cause some odd effects with respect to tty handling, but I bet it's
> solveable. I suppose it would be nifty if there were a way for a

Hmm, isn't what you're describing roughly what systemd-run -t does? It
will serialize the argument list, ask PID 1 to create a transient unit
(go through the polkit stuff), and then set the stdout/stderr and
stdin of the service to your tty, make it the controlling terminal of
the process and
reset it. So I guess it should work with sudo/su just fine too.

There is also s6-sudod (and a s6-sudoc client to it) that works in a
similar fashion, though it's a lot less fancy.

> process, by mutual agreement, to reparent itself to an unrelated
> process.
>
> Anyway, clone(2) is an enormous mess. Surely the right solution here
> is to have a whole new process creation API that takes a big,
> extensible struct as an argument, and supports *at least* the full
> abilities of posix_spawn() and ideally covers all the use cases for
> fork() + do stuff + exec(). It would be nifty if this API also had a
> way to say "add no_new_privs and therefore enable extra functionality
> that doesn't work without no_new_privs". This functionality would
> include things like returning a future extra-privileged pidfd that
> gives ptrace-like access.

My idea was that this intent could be supplied at clone time, you
could attach ptrace access modes to a pidfd (we could make those a bit
granular, perhaps) and any API that takes PIDs and checks against the
caller's ptrace access mode could instead derive so from the pidfd.
Since killing is a bit convoluted due to setuid binaries, that should
work if one is CAP_KILL capable in the owning userns of the task, and
if not that, has permissions to kill and the target has NNP set. This
would allow you to bind kill privileges in a way that is compatible
with both worlds, the upshot being NNP allows for the functionality to
be available to a lot more of userspace. Ofcourse, this would require
a new clone version, possibly with taking a clone2 struct which sets a
few parameters for the process and the flags for the pidfd.

Another point is that you have a pidfd_open (or something else) that
can create multiple pidfds from a pidfd obtained at clone time and
create pidfds with varying level of rights. It can also work by taking
a TID to open a pidfd for an external task (and then for all the
rights you wish to acquire on it, check against your ambient
authority).

(Actually, in general, having FMODE_* style bits spanning all methods
a file descriptor can take (through system calls), with the type of
object as key (class containing a set), and be able to enable/disable
them and seal them would be a useful addition, this all happening at
the struct file level instead of inode level sealing in memfds).

>
> As basic examples, the improved process creation API should take a
> list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> from, fds to close (or, maybe even better, a list of fds to *not*
> close), a list of rlimit changes to make, a list of signal changes to
> make, the ability to set sid, pgrp, uid, gid (as in
> setresuid/setresgid), the ability to do capset() operations, etc. The
> posix_spawn() API, for all that it's rather complicated, covers a
> bunch of the basics pretty well.
>
> Sharing the parent's VM, signal set, fd table, etc, should all be
> options, but they should default to *off*.

Historical note: Plan 9's rfork has RFC* flags for resources like
namespace/env/fd table, which means supplying those means you start
with an clean/empty view of that resource.


>
> (Many other operating systems allow one to create a process and gain a
> capability to do all kinds of things to that process. It's a
> generally good idea.)
>
> --Andy

2019-04-15 23:59:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 15, 2019 at 2:26 PM Jonathan Kowalski <[email protected]> wrote:
>
> On Mon, Apr 15, 2019 at 9:34 PM Andy Lutomirski <[email protected]> wrote:
> > I would personally *love* it if distros started setting no_new_privs
> > for basically all processes. And pidfd actually gets us part of the
> > way toward a straightforward way to make sudo and su still work in a
> > no_new_privs world: su could call into a daemon that would spawn the
> > privileged task, and su would get a (read-only!) pidfd back and then
> > wait for the fd and exit. I suppose that, done naively, this might
> > cause some odd effects with respect to tty handling, but I bet it's
> > solveable. I suppose it would be nifty if there were a way for a
>
> Hmm, isn't what you're describing roughly what systemd-run -t does? It
> will serialize the argument list, ask PID 1 to create a transient unit
> (go through the polkit stuff), and then set the stdout/stderr and
> stdin of the service to your tty, make it the controlling terminal of
> the process and
> reset it. So I guess it should work with sudo/su just fine too.
>
> There is also s6-sudod (and a s6-sudoc client to it) that works in a
> similar fashion, though it's a lot less fancy.

Cute. Now we just distros to work out the kinks and to ship these as
sudo and su :)

>
> > process, by mutual agreement, to reparent itself to an unrelated
> > process.
> >
> > Anyway, clone(2) is an enormous mess. Surely the right solution here
> > is to have a whole new process creation API that takes a big,
> > extensible struct as an argument, and supports *at least* the full
> > abilities of posix_spawn() and ideally covers all the use cases for
> > fork() + do stuff + exec(). It would be nifty if this API also had a
> > way to say "add no_new_privs and therefore enable extra functionality
> > that doesn't work without no_new_privs". This functionality would
> > include things like returning a future extra-privileged pidfd that
> > gives ptrace-like access.
>
> My idea was that this intent could be supplied at clone time, you
> could attach ptrace access modes to a pidfd (we could make those a bit
> granular, perhaps) and any API that takes PIDs and checks against the
> caller's ptrace access mode could instead derive so from the pidfd.
> Since killing is a bit convoluted due to setuid binaries, that should
> work if one is CAP_KILL capable in the owning userns of the task, and
> if not that, has permissions to kill and the target has NNP set.

This CAP_KILL trick makes me nervous. This particular permission is
really quite powerful, and it would need some analysis to conclude
that it's not *more* powerful than CAP_KILL.

> This
> would allow you to bind kill privileges in a way that is compatible
> with both worlds, the upshot being NNP allows for the functionality to
> be available to a lot more of userspace. Ofcourse, this would require
> a new clone version, possibly with taking a clone2 struct which sets a
> few parameters for the process and the flags for the pidfd.
>
> Another point is that you have a pidfd_open (or something else) that
> can create multiple pidfds from a pidfd obtained at clone time and
> create pidfds with varying level of rights. It can also work by taking
> a TID to open a pidfd for an external task (and then for all the
> rights you wish to acquire on it, check against your ambient
> authority).

Indeed.

>
> (Actually, in general, having FMODE_* style bits spanning all methods
> a file descriptor can take (through system calls), with the type of
> object as key (class containing a set), and be able to enable/disable
> them and seal them would be a useful addition, this all happening at
> the struct file level instead of inode level sealing in memfds).

At the risk of saying a dirty word, the Windows API works quite a bit
like this :)

Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On 15.04.19 17:50, Serge E. Hallyn wrote:

Hi,

>> I'm working on implementing plan9-like fs namespaces, where unprivileged>> processes can change their own namespace at will. For that, certain>
> Is there any place where we can see previous discussion about this?
Yes, lkml and constainers list.
It's stalled since few month, as I'm too busy w/ other things.

> If you have to disable suid anyway, then is there any reason why the> existing ability to do this in a private user namespace, with only>
your own uid mapped (which you can do without any privilege) does> not
suffice? That was actually one of the main design goals of user>
namespaces, to be able to clone(CLONE_NEWUSER), map your current uid,>
then clone(CLONE_NEWNS) and bind mount at will.
Well, it's not that easy ... maybe I should explain a bit more about how
Plan9 works, and how I intent to map it into Linux:

* on plan9, anybody can alter his own fs namespace (bind and mount), as
well as spawning new ones
* basically anything is coming from some fileserver - even devices
(eg. there is no such thing like device nodes)
* access control is done by the individual fileservers, based on the
initial authentication (on connecting to the server, before mounting)
* all users are equal - no root at all. the only exception is the
initial process, which gets the kernel devices mounted into his
namespace.

What I'd like to achieve on Linux:

* unprivileged users can have their own mount namespace, where they
can mount at will (maybe just 9P).
* but they still appear as the same normal users to the rest of the
system
* 9p programs (compiled for Linux ABI) can run parallel to traditional
linux programs within the same user and sessions (eg. from a terminal,
i can call both the same way)
* namespace modifications affect both equally (eg. I could run ff in
an own ns)
* these namespaces exist as long as there's one process alive in here
* creating a new ns can be done by unprivileged user
One of the things to make this work (w/o introducing a massive security
hole) is disable suid for those processes (actually, one day i'd like to
get rid of it completely, but that's another story).


--mtx

--
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On 15.04.19 21:59, Aleksa Sarai wrote:

> Just spit-balling -- is no_new_privs not sufficient for this usecase?> Not granting privileges such as setuid during execve(2) is the main>
point of that flag.
Oh, I wasn't aware of that. Thanks.


--mtx

--
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On 15.04.19 22:29, Andy Lutomirski wrote:

<snip>

> I would personally *love* it if distros started setting no_new_privs> for basically all processes.

Maybe a pam module for that would be fine.
But this should be configurable per-user, as so many things still rely
on suid.

Actually, I'd like to move all authentication / privilege switching
to factotum (login(1), sshd, etc then also could run as unprivileged
users).

> And pidfd actually gets us part of the> way toward a straightforward way to make sudo and su still work in a>
no_new_privs world: su could call into a daemon that would spawn the>
privileged task, and su would get a (read-only!) pidfd back and then>
wait for the fd and exit.

How exactly would the pidfd improve this scenario ?
IMHO, would just need to pass the inherited fd's to that daemon (eg.
via unix socket) which then sets them up in the new child process.

> I suppose that, done naively, this might> cause some odd effects with respect to tty handling, but I bet it's>
solveable.

Yes, signals and process groups would be a bit tricky. Some signals
could be transmitted in a similar way as ssh does.

But: how can we handle things like cgroups ?


--mtx

--
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

2019-04-16 21:32:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Tue, Apr 16, 2019 at 11:46 AM Enrico Weigelt, metux IT consult
<[email protected]> wrote:
>
> On 15.04.19 22:29, Andy Lutomirski wrote:
>
> <snip>
>
> > I would personally *love* it if distros started setting no_new_privs> for basically all processes.
>
> Maybe a pam module for that would be fine.
> But this should be configurable per-user, as so many things still rely
> on suid.
>
> Actually, I'd like to move all authentication / privilege switching
> to factotum (login(1), sshd, etc then also could run as unprivileged
> users).
>
> > And pidfd actually gets us part of the> way toward a straightforward way to make sudo and su still work in a>
> no_new_privs world: su could call into a daemon that would spawn the>
> privileged task, and su would get a (read-only!) pidfd back and then>
> wait for the fd and exit.
>
> How exactly would the pidfd improve this scenario ?
> IMHO, would just need to pass the inherited fd's to that daemon (eg.
> via unix socket) which then sets them up in the new child process.
>

It makes it easier to wait until the privileged program exits.
Without pidfd, you can't just wait(2) because the program that gets
spawned isn't a child. With pidfd, the daemon can pass the pidfd
back. Without pidfd, of course, you can wait by asking the daemon to
tell you when the program exits, but that's a uglier IMO.

> > I suppose that, done naively, this might> cause some odd effects with respect to tty handling, but I bet it's>
> solveable.
>
> Yes, signals and process groups would be a bit tricky. Some signals
> could be transmitted in a similar way as ssh does.
>
> But: how can we handle things like cgroups ?

Find a secure way to tell the daemon what cgroups to use?


--Andy

Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On 16.04.19 23:31, Andy Lutomirski wrote:

>> How exactly would the pidfd improve this scenario ?
>> IMHO, would just need to pass the inherited fd's to that daemon (eg.
>> via unix socket) which then sets them up in the new child process.
>
> It makes it easier to wait until the privileged program exits.
> Without pidfd, you can't just wait(2) because the program that gets
> spawned isn't a child.

Ah, that is a cool thing !
I suppose that also works across namespaces ?

What other things can be done via pidfd ?

>> But: how can we handle things like cgroups ?
>
> Find a secure way to tell the daemon what cgroups to use?

hmm, do we have some fd-handle to cgroups ?
In that case a process could send a handle of his cgroup to some
other process (eg. some "login" deamon) allowing him to join in.

We could look at cgroups more as kind of capabilities instead of
limitations (eg. things like: members of cgroup "net-foo1" are
granted n% of network bandwith, etc). That would open up completely
new approaches to security and resource control :)

It could go even further: anybody can create new cgroups within his
own, narrow down some limits and pass this to some other agent that
acts on behalf of him and is allowed to use his share of the system
resources for that.


--mtx

--
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

2019-04-17 12:20:44

by Florian Weimer

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

* Andy Lutomirski:

> I would personally *love* it if distros started setting no_new_privs
> for basically all processes.

Wouldn't no_new_privs inhibit all security transitions, including those
that reduce privileges? And therefore effectively reduce security?

> Anyway, clone(2) is an enormous mess. Surely the right solution here
> is to have a whole new process creation API that takes a big,
> extensible struct as an argument, and supports *at least* the full
> abilities of posix_spawn() and ideally covers all the use cases for
> fork() + do stuff + exec(). It would be nifty if this API also had a
> way to say "add no_new_privs and therefore enable extra functionality
> that doesn't work without no_new_privs". This functionality would
> include things like returning a future extra-privileged pidfd that
> gives ptrace-like access.

I agree that consistent replacement for the clone system call makes
sense. I'm not sure if covering everything that posix_spawn could do
would make sense. There seems to be some demand to be able to do large
parts of container setup using posix_spawn, so we'll probably add
support for things like writing to arbitrary files eventually. And of
course, proper error reporting, so that you can figure out which file
creation action failed.

Thanks,
Florian

2019-04-17 12:56:54

by Christian Brauner

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Wed, Apr 17, 2019 at 02:03:16PM +0200, Enrico Weigelt, metux IT consult wrote:
> On 16.04.19 23:31, Andy Lutomirski wrote:
>
> >> How exactly would the pidfd improve this scenario ?
> >> IMHO, would just need to pass the inherited fd's to that daemon (eg.
> >> via unix socket) which then sets them up in the new child process.
> >
> > It makes it easier to wait until the privileged program exits.
> > Without pidfd, you can't just wait(2) because the program that gets
> > spawned isn't a child.
>
> Ah, that is a cool thing !
> I suppose that also works across namespaces ?

Yes, it should. If you hand off the pidfd to another pidns (e.g. via SCM
credentials) for example.

>
> What other things can be done via pidfd ?

Very basic things right now and until CLONE_PIDFD is accepted (possibly
for 5.2) we won't enable any more features.
I'm not a fan of wild speculations and grand schemes that then don't
come to fruition. :) Right now it's about focussing on somewhat cleanly
covering the basics. Coming to a consensus there was hard enough. We
have no intention in making this more complex right now as it needs to
be.

Christian

2019-04-17 16:49:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]



> On Apr 17, 2019, at 5:19 AM, Florian Weimer <[email protected]> wrote:
>
> * Andy Lutomirski:
>
>> I would personally *love* it if distros started setting no_new_privs
>> for basically all processes.
>
> Wouldn't no_new_privs inhibit all security transitions, including those
> that reduce privileges? And therefore effectively reduce security?

In principle, you still can reduce privileges with no_new_privs. SELinux has a whole mechanism for privilege-reducing transitions on exec that works in no_new_privs mode. Also, all the traditional privilege dropping techniques work — setresuid(), unshare(), etc are all unaffected.

>
>> There seems to be some demand to be able to do large
> parts of container setup using posix_spawn, so we'll probably add
> support for things like writing to arbitrary files eventually. And of
> course, proper error reporting, so that you can figure out which file
> creation action failed.
>

ISTM the way to handle this is to have a way to make a container, set it up, and then clone/spawn into it. The current unshare() API is severely awkward.

Maybe the new better kernel spawn API shouldn’t support unshare-like semantics at all and should instead work like setns().

Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On 17.04.19 14:54, Christian Brauner wrote:

>> Ah, that is a cool thing !>> I suppose that also works across namespaces ?> > Yes, it should. If
you hand off the pidfd to another pidns (e.g. via SCM> credentials) for
example.
I thought about things like sending the pidfd via unix socket.
It would be really cool if the receiving process could then control
the referred process (eg. send signals), even if it's in a different
pidns.

>> What other things can be done via pidfd ?
>
> Very basic things right now and until CLONE_PIDFD is accepted (possibly
> for 5.2) we won't enable any more features.
> I'm not a fan of wild speculations and grand schemes that then don't
> come to fruition. :) Right now it's about focussing on somewhat cleanly
> covering the basics. Coming to a consensus there was hard enough. We
> have no intention in making this more complex right now as it needs to
> be.

IMHO, it would be good if it would support all operations that now can
be done via numerical PID, and w/ the permissions of the process who
created that pidfd. Certainly, that would also need some lockdown
mechanism, so the creating process can define what the holder of the
fd can actually do.

--mtx

--
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

2019-04-20 07:23:08

by Kevin Easton

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
> On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai <[email protected]> wrote:
> >
> > On 2019-04-15, Enrico Weigelt, metux IT consult <[email protected]> wrote:
> > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > clone() system call as previously discussed.
> > >
> > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > consider when introducing new CLONE_* flags.
> > >
> > > The reason I'm asking is:
> > >
> > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > processes can change their own namespace at will. For that, certain
> > > traditional unix'ish things have to be disabled, most notably suid.
> > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > about making this its own feature. Doing that switch on clone() seems
> > > a nice place for that, IMHO.
> >
> > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > Not granting privileges such as setuid during execve(2) is the main
> > point of that flag.
> >
>
> I would personally *love* it if distros started setting no_new_privs
> for basically all processes. And pidfd actually gets us part of the
> way toward a straightforward way to make sudo and su still work in a
> no_new_privs world: su could call into a daemon that would spawn the
> privileged task, and su would get a (read-only!) pidfd back and then
> wait for the fd and exit. I suppose that, done naively, this might
> cause some odd effects with respect to tty handling, but I bet it's
> solveable. I suppose it would be nifty if there were a way for a
> process, by mutual agreement, to reparent itself to an unrelated
> process.
>
> Anyway, clone(2) is an enormous mess. Surely the right solution here
> is to have a whole new process creation API that takes a big,
> extensible struct as an argument, and supports *at least* the full
> abilities of posix_spawn() and ideally covers all the use cases for
> fork() + do stuff + exec(). It would be nifty if this API also had a
> way to say "add no_new_privs and therefore enable extra functionality
> that doesn't work without no_new_privs". This functionality would
> include things like returning a future extra-privileged pidfd that
> gives ptrace-like access.
>
> As basic examples, the improved process creation API should take a
> list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> from, fds to close (or, maybe even better, a list of fds to *not*
> close), a list of rlimit changes to make, a list of signal changes to
> make, the ability to set sid, pgrp, uid, gid (as in
> setresuid/setresgid), the ability to do capset() operations, etc. The
> posix_spawn() API, for all that it's rather complicated, covers a
> bunch of the basics pretty well.

The idea of a system call that takes an infinitely-extendable laundry
list of operations to perform in kernel space seems quite inelegant, if
only for the error-reporting reason.

Instead, I suggest that what you'd want is a way to create a new
embryonic process that has no address space and isn't yet schedulable.
You then just need other-process-directed variants of all the normal
setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
etc.

Then when it's all set up you pr_execve() to kick it off.

- Kevin

2019-04-20 11:16:27

by Christian Brauner

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On April 20, 2019 9:14:06 AM GMT+02:00, Kevin Easton <[email protected]> wrote:
>On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
>> On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai <[email protected]>
>wrote:
>> >
>> > On 2019-04-15, Enrico Weigelt, metux IT consult <[email protected]>
>wrote:
>> > > > This patchset makes it possible to retrieve pid file
>descriptors at
>> > > > process creation time by introducing the new flag CLONE_PIDFD
>to the
>> > > > clone() system call as previously discussed.
>> > >
>> > > Sorry, for highjacking this thread, but I'm curious on what
>things to
>> > > consider when introducing new CLONE_* flags.
>> > >
>> > > The reason I'm asking is:
>> > >
>> > > I'm working on implementing plan9-like fs namespaces, where
>unprivileged
>> > > processes can change their own namespace at will. For that,
>certain
>> > > traditional unix'ish things have to be disabled, most notably
>suid.
>> > > As forbidding suid can be helpful in other scenarios, too, I
>thought
>> > > about making this its own feature. Doing that switch on clone()
>seems
>> > > a nice place for that, IMHO.
>> >
>> > Just spit-balling -- is no_new_privs not sufficient for this
>usecase?
>> > Not granting privileges such as setuid during execve(2) is the main
>> > point of that flag.
>> >
>>
>> I would personally *love* it if distros started setting no_new_privs
>> for basically all processes. And pidfd actually gets us part of the
>> way toward a straightforward way to make sudo and su still work in a
>> no_new_privs world: su could call into a daemon that would spawn the
>> privileged task, and su would get a (read-only!) pidfd back and then
>> wait for the fd and exit. I suppose that, done naively, this might
>> cause some odd effects with respect to tty handling, but I bet it's
>> solveable. I suppose it would be nifty if there were a way for a
>> process, by mutual agreement, to reparent itself to an unrelated
>> process.
>>
>> Anyway, clone(2) is an enormous mess. Surely the right solution here
>> is to have a whole new process creation API that takes a big,
>> extensible struct as an argument, and supports *at least* the full
>> abilities of posix_spawn() and ideally covers all the use cases for
>> fork() + do stuff + exec(). It would be nifty if this API also had a
>> way to say "add no_new_privs and therefore enable extra functionality
>> that doesn't work without no_new_privs". This functionality would
>> include things like returning a future extra-privileged pidfd that
>> gives ptrace-like access.
>>
>> As basic examples, the improved process creation API should take a
>> list of dup2() operations to perform, fds to remove the O_CLOEXEC
>flag
>> from, fds to close (or, maybe even better, a list of fds to *not*
>> close), a list of rlimit changes to make, a list of signal changes to
>> make, the ability to set sid, pgrp, uid, gid (as in
>> setresuid/setresgid), the ability to do capset() operations, etc.
>The
>> posix_spawn() API, for all that it's rather complicated, covers a
>> bunch of the basics pretty well.
>
>The idea of a system call that takes an infinitely-extendable laundry
>list of operations to perform in kernel space seems quite inelegant, if
>only for the error-reporting reason.
>
>Instead, I suggest that what you'd want is a way to create a new
>embryonic process that has no address space and isn't yet schedulable.
>You then just need other-process-directed variants of all the normal
>setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
>pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
>etc.
>
>Then when it's all set up you pr_execve() to kick it off.
>
> - Kevin

I proposed a version of this a while back when we first started talking about this.

2019-04-20 15:08:40

by Daniel Colascione

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Sat, Apr 20, 2019 at 12:14 AM Kevin Easton <[email protected]> wrote:
> On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
> > On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai <[email protected]> wrote:
> > >
> > > On 2019-04-15, Enrico Weigelt, metux IT consult <[email protected]> wrote:
> > > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > > clone() system call as previously discussed.
> > > >
> > > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > > consider when introducing new CLONE_* flags.
> > > >
> > > > The reason I'm asking is:
> > > >
> > > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > > processes can change their own namespace at will. For that, certain
> > > > traditional unix'ish things have to be disabled, most notably suid.
> > > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > > about making this its own feature. Doing that switch on clone() seems
> > > > a nice place for that, IMHO.
> > >
> > > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > > Not granting privileges such as setuid during execve(2) is the main
> > > point of that flag.
> > >
> >
> > I would personally *love* it if distros started setting no_new_privs
> > for basically all processes. And pidfd actually gets us part of the
> > way toward a straightforward way to make sudo and su still work in a
> > no_new_privs world: su could call into a daemon that would spawn the
> > privileged task, and su would get a (read-only!) pidfd back and then
> > wait for the fd and exit. I suppose that, done naively, this might
> > cause some odd effects with respect to tty handling, but I bet it's
> > solveable. I suppose it would be nifty if there were a way for a
> > process, by mutual agreement, to reparent itself to an unrelated
> > process.
> >
> > Anyway, clone(2) is an enormous mess. Surely the right solution here
> > is to have a whole new process creation API that takes a big,
> > extensible struct as an argument, and supports *at least* the full
> > abilities of posix_spawn() and ideally covers all the use cases for
> > fork() + do stuff + exec(). It would be nifty if this API also had a
> > way to say "add no_new_privs and therefore enable extra functionality
> > that doesn't work without no_new_privs". This functionality would
> > include things like returning a future extra-privileged pidfd that
> > gives ptrace-like access.
> >
> > As basic examples, the improved process creation API should take a
> > list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> > from, fds to close (or, maybe even better, a list of fds to *not*
> > close), a list of rlimit changes to make, a list of signal changes to
> > make, the ability to set sid, pgrp, uid, gid (as in
> > setresuid/setresgid), the ability to do capset() operations, etc. The
> > posix_spawn() API, for all that it's rather complicated, covers a
> > bunch of the basics pretty well.
>
> The idea of a system call that takes an infinitely-extendable laundry
> list of operations to perform in kernel space seems quite inelegant, if
> only for the error-reporting reason.
>
> Instead, I suggest that what you'd want is a way to create a new
> embryonic process that has no address space and isn't yet schedulable.
> You then just need other-process-directed variants of all the normal
> setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
> pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
> etc.

Providing process-directed versions of these functions would be useful
for a variety of management tasks anyway,

> Then when it's all set up you pr_execve() to kick it off.

Yes. That's the right general approach.

2019-04-20 15:32:39

by Al Viro

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:

> Anyway, clone(2) is an enormous mess. Surely the right solution here
> is to have a whole new process creation API that takes a big,
> extensible struct as an argument, and supports *at least* the full
> abilities of posix_spawn() and ideally covers all the use cases for
> fork() + do stuff + exec(). It would be nifty if this API also had a
> way to say "add no_new_privs and therefore enable extra functionality
> that doesn't work without no_new_privs". This functionality would
> include things like returning a future extra-privileged pidfd that
> gives ptrace-like access.

You had been two weeks too late with that, and a bit too obvious with the use
of "surely" too close to the beginning...

If it was _not_ a belated AFD posting, alt.tasteless is over -> that way...

2019-04-29 15:51:14

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Tue, Apr 16, 2019 at 08:32:50PM +0200, Enrico Weigelt, metux IT consult wrote:

(Sorry for the late reply, I had missed this one)

> On 15.04.19 17:50, Serge E. Hallyn wrote:
>
> Hi,
>
> >> I'm working on implementing plan9-like fs namespaces, where unprivileged>> processes can change their own namespace at will. For that, certain>
> > Is there any place where we can see previous discussion about this?
> Yes, lkml and constainers list.
> It's stalled since few month, as I'm too busy w/ other things.
>
> > If you have to disable suid anyway, then is there any reason why the> existing ability to do this in a private user namespace, with only>
> your own uid mapped (which you can do without any privilege) does> not
> suffice? That was actually one of the main design goals of user>
> namespaces, to be able to clone(CLONE_NEWUSER), map your current uid,>
> then clone(CLONE_NEWNS) and bind mount at will.
> Well, it's not that easy ... maybe I should explain a bit more about how
> Plan9 works, and how I intent to map it into Linux:
>
> * on plan9, anybody can alter his own fs namespace (bind and mount), as
> well as spawning new ones
> * basically anything is coming from some fileserver - even devices
> (eg. there is no such thing like device nodes)
> * access control is done by the individual fileservers, based on the
> initial authentication (on connecting to the server, before mounting)

yes, so far I'm aware of this,

> * all users are equal - no root at all. the only exception is the
> initial process, which gets the kernel devices mounted into his
> namespace.

This does not match my understanding, but I'm most likely wrong. (I thought
there was an actual 'host owner' uid, which mostly is only used for initial
process, but is basically root with a different name, and used far less. No
uid transitions without factotem so that it *looked* like no root user).

> What I'd like to achieve on Linux:
>
> * unprivileged users can have their own mount namespace, where they
> can mount at will (maybe just 9P).

No problem, you can do that now.

> * but they still appear as the same normal users to the rest of the
> system

No problem, you can do that now.

> * 9p programs (compiled for Linux ABI) can run parallel to traditional
> linux programs within the same user and sessions (eg. from a terminal,
> i can call both the same way)
> * namespace modifications affect both equally (eg. I could run ff in
> an own ns)

affect both of what equally?

> * these namespaces exist as long as there's one process alive in here

That's sort of how it is now, except you can also pin the namespaces
with their fds.

> * creating a new ns can be done by unprivileged user

That's true now.

> One of the things to make this work (w/o introducing a massive security
> hole) is disable suid for those processes (actually, one day i'd like to
> get rid of it completely, but that's another story).

That's exactly what user namespaces are for. You can create a new
user namespace, using no privilege at all, with your current uid (i.e.
1000) mapped to whatever uid you like; if you pick 0, then you can unshare all
the namespaces you like. Once you unshare mnt_ns, you can mount to your
heart's content. To other processes on the host, your process is
uid 1000. Host uid 0 is not mapped into your ns, so you cannot exploit
suid to host root.

Regarding factotem, I agree that with the pidfd work going on etc, it's getting
more and more tempting to attempt a switch to that. Looking back at my folder,
I see you posted a kernel patch for it. I had done the same long ago. Happy to
work with you again on that, and put a simple daemon into shadow package, if
util-linux isn't deemed the far better place.

-serge

Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On 29.04.19 17:49, Serge E. Hallyn wrote:

>> * all users are equal - no root at all. the only exception is the>> initial process, which gets the kernel devices mounted into his>>
namespace.> > This does not match my understanding, but I'm most likely
wrong. (I thought> there was an actual 'host owner' uid, which mostly
is only used for initial> process, but is basically root with a
different name, and used far less. No> uid transitions without factotem
so that it *looked* like no root user).
Not quite (IIRC). The hostowner is just the user who booted the machine,
the initial process runs under this uname and gets the kernel devices
bound into his namespace, so he can start fileservers on them.

Also the caphash device (the one you can create capabilities, eg. for
user change, which then can be used via capuse device) can only be
opened once - usually by the host factotum.

There really is no such thing like root user.

>> What I'd like to achieve on Linux:>>>> * unprivileged users can have their own mount namespace, where
they>> can mount at will (maybe just 9P).> > No problem, you can do
that now.
But only within separate userns, IMHO. (and, when I last tried, plain
users couldn't directly create their userns).

>> * but they still appear as the same normal users to the rest of the
>> system
>
> No problem, you can do that now.

How exactly ? Did I miss something vital ?

>> * 9p programs (compiled for Linux ABI) can run parallel to traditional
>> linux programs within the same user and sessions (eg. from a terminal,
>> i can call both the same way)
>> * namespace modifications affect both equally (eg. I could run ff in
>> an own ns)
>
> affect both of what equally?

mount / bind.

> That's exactly what user namespaces are for. You can create a new
> user namespace, using no privilege at all, with your current uid (i.e.
> 1000) mapped to whatever uid you like; if you pick 0, then you can unshare all
> the namespaces you like.

But I don't like to appear as 'root' in here. I just wanna have my own
filesystem namespace, nothing more.

> Once you unshare mnt_ns, you can mount to your
> heart's content. To other processes on the host, your process is
> uid 1000.

Is that the uid, I'm appearing to filesystems ?

> Regarding factotem, I agree that with the pidfd work going on etc, it's getting
> more and more tempting to attempt a switch to that. Looking back at my folder,
> I see you posted a kernel patch for it. I had done the same long ago. Happy to
> work with you again on that, and put a simple daemon into shadow package, if
> util-linux isn't deemed the far better place.

Yeah :)


--mtx

--
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

2019-04-29 19:31:46

by Jann Horn

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Sat, Apr 20, 2019 at 3:14 AM Kevin Easton <[email protected]> wrote:
> On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
> > On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai <[email protected]> wrote:
> > >
> > > On 2019-04-15, Enrico Weigelt, metux IT consult <[email protected]> wrote:
> > > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > > clone() system call as previously discussed.
> > > >
> > > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > > consider when introducing new CLONE_* flags.
> > > >
> > > > The reason I'm asking is:
> > > >
> > > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > > processes can change their own namespace at will. For that, certain
> > > > traditional unix'ish things have to be disabled, most notably suid.
> > > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > > about making this its own feature. Doing that switch on clone() seems
> > > > a nice place for that, IMHO.
> > >
> > > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > > Not granting privileges such as setuid during execve(2) is the main
> > > point of that flag.
> > >
> >
> > I would personally *love* it if distros started setting no_new_privs
> > for basically all processes. And pidfd actually gets us part of the
> > way toward a straightforward way to make sudo and su still work in a
> > no_new_privs world: su could call into a daemon that would spawn the
> > privileged task, and su would get a (read-only!) pidfd back and then
> > wait for the fd and exit. I suppose that, done naively, this might
> > cause some odd effects with respect to tty handling, but I bet it's
> > solveable. I suppose it would be nifty if there were a way for a
> > process, by mutual agreement, to reparent itself to an unrelated
> > process.
> >
> > Anyway, clone(2) is an enormous mess. Surely the right solution here
> > is to have a whole new process creation API that takes a big,
> > extensible struct as an argument, and supports *at least* the full
> > abilities of posix_spawn() and ideally covers all the use cases for
> > fork() + do stuff + exec(). It would be nifty if this API also had a
> > way to say "add no_new_privs and therefore enable extra functionality
> > that doesn't work without no_new_privs". This functionality would
> > include things like returning a future extra-privileged pidfd that
> > gives ptrace-like access.
> >
> > As basic examples, the improved process creation API should take a
> > list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> > from, fds to close (or, maybe even better, a list of fds to *not*
> > close), a list of rlimit changes to make, a list of signal changes to
> > make, the ability to set sid, pgrp, uid, gid (as in
> > setresuid/setresgid), the ability to do capset() operations, etc. The
> > posix_spawn() API, for all that it's rather complicated, covers a
> > bunch of the basics pretty well.
>
> The idea of a system call that takes an infinitely-extendable laundry
> list of operations to perform in kernel space seems quite inelegant, if
> only for the error-reporting reason.
>
> Instead, I suggest that what you'd want is a way to create a new
> embryonic process that has no address space and isn't yet schedulable.
> You then just need other-process-directed variants of all the normal
> setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
> pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
> etc.
>
> Then when it's all set up you pr_execve() to kick it off.

Is this really necessary? I agree that fork()+exec() is suboptimal,
but if you just want to avoid the cost of duplicating the address
space, you can AFAICS already do that in userspace with
clone(CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD). Then
the parent can block on a futex until the child leaves the mm_struct
through execve() (or by exiting, in the case of an error), and the
child can temporarily have its stack at the bottom of the caller's
stack. You could build an API like this around it in userspace:

int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid,
<clone flags and arguments, maybe in a struct>)

and then you'd use it like this to fork off a child process:

int spawn_shell_subprocess_(void *arg) {
char *cmdline = arg;
execl("/bin/sh", "sh", "-c", cmdline);
return -1;
}
pid_t spawn_shell_subprocess(char *cmdline) {
pid_t child_pid;
int res = clone_temporary(spawn_shell_subprocess_, cmdline,
&child_pid, [...]);
if (res == 0) return child_pid;
return res;
}

clone_temporary() could be implemented roughly as follows by the libc
(or other userspace code):

sigset_t sigset, sigset_old;
sigfillset(&sigset);
sigprocmask(SIG_SETMASK, &sigset, &sigset_old);
int child_pid;
int result = 0;
/* starting here, use inline assembly to ensure that no stack
allocations occur */
long child = syscall(__NR_clone,
CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP -
ABI_STACK_REDZONE_SIZE, NULL, &child_pid, 0);
if (child == -1) { result = -1; goto reset_sigmask; }
if (child == 0) {
result = fn(arg);
syscall(__NR_exit, 0);
}
futex(&child_pid, FUTEX_WAIT, child, NULL);
/* end of no-stack-allocations zone */
reset_sigmask:
sigprocmask(SIG_SETMASK, &sigset_old, NULL);
return result;

2019-04-29 20:03:51

by Jann Horn

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 29, 2019 at 3:30 PM Jann Horn <[email protected]> wrote:
> On Sat, Apr 20, 2019 at 3:14 AM Kevin Easton <[email protected]> wrote:
> > On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
> > > On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai <[email protected]> wrote:
> > > >
> > > > On 2019-04-15, Enrico Weigelt, metux IT consult <[email protected]> wrote:
> > > > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > > > clone() system call as previously discussed.
> > > > >
> > > > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > > > consider when introducing new CLONE_* flags.
> > > > >
> > > > > The reason I'm asking is:
> > > > >
> > > > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > > > processes can change their own namespace at will. For that, certain
> > > > > traditional unix'ish things have to be disabled, most notably suid.
> > > > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > > > about making this its own feature. Doing that switch on clone() seems
> > > > > a nice place for that, IMHO.
> > > >
> > > > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > > > Not granting privileges such as setuid during execve(2) is the main
> > > > point of that flag.
> > > >
> > >
> > > I would personally *love* it if distros started setting no_new_privs
> > > for basically all processes. And pidfd actually gets us part of the
> > > way toward a straightforward way to make sudo and su still work in a
> > > no_new_privs world: su could call into a daemon that would spawn the
> > > privileged task, and su would get a (read-only!) pidfd back and then
> > > wait for the fd and exit. I suppose that, done naively, this might
> > > cause some odd effects with respect to tty handling, but I bet it's
> > > solveable. I suppose it would be nifty if there were a way for a
> > > process, by mutual agreement, to reparent itself to an unrelated
> > > process.
> > >
> > > Anyway, clone(2) is an enormous mess. Surely the right solution here
> > > is to have a whole new process creation API that takes a big,
> > > extensible struct as an argument, and supports *at least* the full
> > > abilities of posix_spawn() and ideally covers all the use cases for
> > > fork() + do stuff + exec(). It would be nifty if this API also had a
> > > way to say "add no_new_privs and therefore enable extra functionality
> > > that doesn't work without no_new_privs". This functionality would
> > > include things like returning a future extra-privileged pidfd that
> > > gives ptrace-like access.
> > >
> > > As basic examples, the improved process creation API should take a
> > > list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> > > from, fds to close (or, maybe even better, a list of fds to *not*
> > > close), a list of rlimit changes to make, a list of signal changes to
> > > make, the ability to set sid, pgrp, uid, gid (as in
> > > setresuid/setresgid), the ability to do capset() operations, etc. The
> > > posix_spawn() API, for all that it's rather complicated, covers a
> > > bunch of the basics pretty well.
> >
> > The idea of a system call that takes an infinitely-extendable laundry
> > list of operations to perform in kernel space seems quite inelegant, if
> > only for the error-reporting reason.
> >
> > Instead, I suggest that what you'd want is a way to create a new
> > embryonic process that has no address space and isn't yet schedulable.
> > You then just need other-process-directed variants of all the normal
> > setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
> > pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
> > etc.
> >
> > Then when it's all set up you pr_execve() to kick it off.
>
> Is this really necessary? I agree that fork()+exec() is suboptimal,
> but if you just want to avoid the cost of duplicating the address
> space, you can AFAICS already do that in userspace with
> clone(CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD). Then
> the parent can block on a futex until the child leaves the mm_struct
> through execve() (or by exiting, in the case of an error), and the
> child can temporarily have its stack at the bottom of the caller's
> stack. You could build an API like this around it in userspace:
>
> int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid,
> <clone flags and arguments, maybe in a struct>)
>
> and then you'd use it like this to fork off a child process:
>
> int spawn_shell_subprocess_(void *arg) {
> char *cmdline = arg;
> execl("/bin/sh", "sh", "-c", cmdline);
> return -1;
> }
> pid_t spawn_shell_subprocess(char *cmdline) {
> pid_t child_pid;
> int res = clone_temporary(spawn_shell_subprocess_, cmdline,
> &child_pid, [...]);
> if (res == 0) return child_pid;
> return res;
> }
>
> clone_temporary() could be implemented roughly as follows by the libc
> (or other userspace code):
>
> sigset_t sigset, sigset_old;
> sigfillset(&sigset);
> sigprocmask(SIG_SETMASK, &sigset, &sigset_old);
> int child_pid;
> int result = 0;
> /* starting here, use inline assembly to ensure that no stack
> allocations occur */
> long child = syscall(__NR_clone,
> CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP -
> ABI_STACK_REDZONE_SIZE, NULL, &child_pid, 0);
> if (child == -1) { result = -1; goto reset_sigmask; }
> if (child == 0) {
> result = fn(arg);
> syscall(__NR_exit, 0);
> }
> futex(&child_pid, FUTEX_WAIT, child, NULL);
> /* end of no-stack-allocations zone */
> reset_sigmask:
> sigprocmask(SIG_SETMASK, &sigset_old, NULL);
> return result;

... I guess that already has a name, and it's called vfork(). (Well,
except that the Linux vfork() isn't a real vfork().)

So I guess my question is: Why not vfork()?

And if vfork() alone isn't flexible enough, alternatively: How about
an API that forks a new child in the same address space, and then
allows the parent to invoke arbitrary syscalls in the context of the
child? You could also build that in userspace if you wanted, I think -
just let the child run an assembly loop that reads registers from a
unix seqpacket socket, invokes the syscall instruction, and writes the
value of the result register back into the seqpacket socket. As long
as you use CLONE_VM, you don't have to worry about moving the pointer
targets of syscalls. The user-visible API could look like this:

// flags added by the implementation: CLONE_VM|CLONE_CHILD_SETTID
puppet_handle = fork_puppet(CLONE_NEWUSER|SIGCHLD);
int uid_map_fd = puppet_syscall(SYS_open, "/proc/self/uid_map");
char uid_map_buf[1000];
puppet_syscall(SYS_write, uid_map_fd, uid_map_buf, strlen(uid_map_buf));
puppet_syscall(SYS_close, uid_map_fd);
// waits for the child to either exit or switch to new mm via
CLONE_CHILD_CLEARTID
puppet_finish_execve(path, argv, envv);

2019-04-29 20:22:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 29, 2019 at 12:55 PM Jann Horn <[email protected]> wrote:
>
> ... I guess that already has a name, and it's called vfork(). (Well,
> except that the Linux vfork() isn't a real vfork().)

What?

Linux vfork() is very much a real vfork(). What do you mean?

Linus

2019-04-29 20:39:51

by Florian Weimer

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

* Linus Torvalds:

> On Mon, Apr 29, 2019 at 12:55 PM Jann Horn <[email protected]> wrote:
>>
>> ... I guess that already has a name, and it's called vfork(). (Well,
>> except that the Linux vfork() isn't a real vfork().)
>
> What?
>
> Linux vfork() is very much a real vfork(). What do you mean?

In Linux-as-the-ABI (as opposed to Linux-as-the-implementation), vfork
is sometimes implemented as fork, so applications cannot rely on the
vfork behavior regarding the stopped parent and the shared address
space.

In fact, it would be nice to have a flag we can check in the posix_spawn
implementation, so that we can support vfork-as-fork without any run
time cost to native Linux.

Thanks,
Florian

2019-04-29 20:51:17

by Florian Weimer

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

* Jann Horn:

>> int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid,
>> <clone flags and arguments, maybe in a struct>)
>>
>> and then you'd use it like this to fork off a child process:
>>
>> int spawn_shell_subprocess_(void *arg) {
>> char *cmdline = arg;
>> execl("/bin/sh", "sh", "-c", cmdline);
>> return -1;
>> }
>> pid_t spawn_shell_subprocess(char *cmdline) {
>> pid_t child_pid;
>> int res = clone_temporary(spawn_shell_subprocess_, cmdline,
>> &child_pid, [...]);
>> if (res == 0) return child_pid;
>> return res;
>> }
>>
>> clone_temporary() could be implemented roughly as follows by the libc
>> (or other userspace code):
>>
>> sigset_t sigset, sigset_old;
>> sigfillset(&sigset);
>> sigprocmask(SIG_SETMASK, &sigset, &sigset_old);
>> int child_pid;
>> int result = 0;
>> /* starting here, use inline assembly to ensure that no stack
>> allocations occur */
>> long child = syscall(__NR_clone,
>> CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP -
>> ABI_STACK_REDZONE_SIZE, NULL, &child_pid, 0);
>> if (child == -1) { result = -1; goto reset_sigmask; }
>> if (child == 0) {
>> result = fn(arg);
>> syscall(__NR_exit, 0);
>> }
>> futex(&child_pid, FUTEX_WAIT, child, NULL);
>> /* end of no-stack-allocations zone */
>> reset_sigmask:
>> sigprocmask(SIG_SETMASK, &sigset_old, NULL);
>> return result;
>
> ... I guess that already has a name, and it's called vfork(). (Well,
> except that the Linux vfork() isn't a real vfork().)
>
> So I guess my question is: Why not vfork()?

Mainly because some users want access to the clone flags, and that's not
possible with the current userspace wrappers. The stack setup for the
undocumented clone wrapper is also cumbersome, and the ia64 pecularity
annoying.

For the stack sharing, the callback-based interface looks like the
absolutely right thing to do to me. It enforces the notion that you can
safely return on the child path from a function calling vfork.

> And if vfork() alone isn't flexible enough, alternatively: How about
> an API that forks a new child in the same address space, and then
> allows the parent to invoke arbitrary syscalls in the context of the
> child?

As long it's not an eBPF script …

> You could also build that in userspace if you wanted, I think - just
> let the child run an assembly loop that reads registers from a unix
> seqpacket socket, invokes the syscall instruction, and writes the
> value of the result register back into the seqpacket socket. As long
> as you use CLONE_VM, you don't have to worry about moving the pointer
> targets of syscalls. The user-visible API could look like this:

People already use a variant of this, execve'ing twice. See
jspawnhelper.

Thanks,
Florian

2019-04-29 20:53:01

by Christian Brauner

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 29, 2019 at 10:38 PM Florian Weimer <[email protected]> wrote:
>
> * Linus Torvalds:
>
> > On Mon, Apr 29, 2019 at 12:55 PM Jann Horn <[email protected]> wrote:
> >>
> >> ... I guess that already has a name, and it's called vfork(). (Well,
> >> except that the Linux vfork() isn't a real vfork().)
> >
> > What?
> >
> > Linux vfork() is very much a real vfork(). What do you mean?
>
> In Linux-as-the-ABI (as opposed to Linux-as-the-implementation), vfork
> is sometimes implemented as fork, so applications cannot rely on the
> vfork behavior regarding the stopped parent and the shared address
> space.
>
> In fact, it would be nice to have a flag we can check in the posix_spawn
> implementation, so that we can support vfork-as-fork without any run
> time cost to native Linux.

After the next merge window we'll be out of flags if things go as planned.
To address this problem, Jann and I are currently in the middle of working
on a clone version that we intend to send out for discussion afterwards.
If the proposal is acceptable it would bump the number of available flags
significantly, putting things like this within reach.

Christian

2019-04-29 20:55:30

by Christian Brauner

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 29, 2019 at 10:50 PM Florian Weimer <[email protected]> wrote:
>
> * Jann Horn:
>
> >> int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid,
> >> <clone flags and arguments, maybe in a struct>)
> >>
> >> and then you'd use it like this to fork off a child process:
> >>
> >> int spawn_shell_subprocess_(void *arg) {
> >> char *cmdline = arg;
> >> execl("/bin/sh", "sh", "-c", cmdline);
> >> return -1;
> >> }
> >> pid_t spawn_shell_subprocess(char *cmdline) {
> >> pid_t child_pid;
> >> int res = clone_temporary(spawn_shell_subprocess_, cmdline,
> >> &child_pid, [...]);
> >> if (res == 0) return child_pid;
> >> return res;
> >> }
> >>
> >> clone_temporary() could be implemented roughly as follows by the libc
> >> (or other userspace code):
> >>
> >> sigset_t sigset, sigset_old;
> >> sigfillset(&sigset);
> >> sigprocmask(SIG_SETMASK, &sigset, &sigset_old);
> >> int child_pid;
> >> int result = 0;
> >> /* starting here, use inline assembly to ensure that no stack
> >> allocations occur */
> >> long child = syscall(__NR_clone,
> >> CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP -
> >> ABI_STACK_REDZONE_SIZE, NULL, &child_pid, 0);
> >> if (child == -1) { result = -1; goto reset_sigmask; }
> >> if (child == 0) {
> >> result = fn(arg);
> >> syscall(__NR_exit, 0);
> >> }
> >> futex(&child_pid, FUTEX_WAIT, child, NULL);
> >> /* end of no-stack-allocations zone */
> >> reset_sigmask:
> >> sigprocmask(SIG_SETMASK, &sigset_old, NULL);
> >> return result;
> >
> > ... I guess that already has a name, and it's called vfork(). (Well,
> > except that the Linux vfork() isn't a real vfork().)
> >
> > So I guess my question is: Why not vfork()?
>
> Mainly because some users want access to the clone flags, and that's not
> possible with the current userspace wrappers. The stack setup for the
> undocumented clone wrapper is also cumbersome, and the ia64 pecularity
> annoying.
>
> For the stack sharing, the callback-based interface looks like the
> absolutely right thing to do to me. It enforces the notion that you can
> safely return on the child path from a function calling vfork.
>
> > And if vfork() alone isn't flexible enough, alternatively: How about
> > an API that forks a new child in the same address space, and then
> > allows the parent to invoke arbitrary syscalls in the context of the
> > child?
>
> As long it's not an eBPF script …

You shouldn't even joke about this (I'm serious.).
I'm very certain there are people who'd think this is a good idea.

>
> > You could also build that in userspace if you wanted, I think - just
> > let the child run an assembly loop that reads registers from a unix
> > seqpacket socket, invokes the syscall instruction, and writes the
> > value of the result register back into the seqpacket socket. As long
> > as you use CLONE_VM, you don't have to worry about moving the pointer
> > targets of syscalls. The user-visible API could look like this:
>
> People already use a variant of this, execve'ing twice. See
> jspawnhelper.
>
> Thanks,
> Florian

2019-04-29 21:41:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 29, 2019 at 1:38 PM Florian Weimer <[email protected]> wrote:
>
> In Linux-as-the-ABI (as opposed to Linux-as-the-implementation), vfork
> is sometimes implemented as fork, so applications cannot rely on the
> vfork behavior regarding the stopped parent and the shared address
> space.

What broken library does that?

Sure, we didn't have a proper vfork() long long long ago. But that
predates both git and BK, so it's some time in the 90's. We've had a
proper vfork() *forever*.

> In fact, it would be nice to have a flag we can check in the posix_spawn
> implementation, so that we can support vfork-as-fork without any run
> time cost to native Linux.

No. Just make a bug-report to whatever broken library you use. What's
the point of having a library that can't even get vfork() right? Why
would you want to have a flag to say "vfork is broken"?

Linus

2019-04-30 00:41:37

by Jann Horn

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 29, 2019 at 4:21 PM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Apr 29, 2019 at 12:55 PM Jann Horn <[email protected]> wrote:
> >
> > ... I guess that already has a name, and it's called vfork(). (Well,
> > except that the Linux vfork() isn't a real vfork().)
>
> What?
>
> Linux vfork() is very much a real vfork(). What do you mean?

... uuuh, whoops. Turns out I don't know what I'm talking about.
Nevermind. For some reason I thought vfork() was just
CLONE_VFORK|SIGCHLD, but now I see I got that completely wrong.

2019-04-30 02:17:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 29, 2019 at 5:39 PM Jann Horn <[email protected]> wrote:
>
> ... uuuh, whoops. Turns out I don't know what I'm talking about.

Well, apparently there's some odd libc issue accoprding to Florian, so
there *might* be something to it.

> Nevermind. For some reason I thought vfork() was just
> CLONE_VFORK|SIGCHLD, but now I see I got that completely wrong.

Well, inside the kernel, that's actually *very* close to what vfork() is:

SYSCALL_DEFINE0(vfork)
{
return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
0, NULL, NULL, 0);
}

but that's just an internal implementation detail. It's a real vfork()
and should act as the traditional BSD "share everything" without any
address space copying. The CLONE_VFORK flag is what does the "wait for
child to exit or execve" magic.

Note that vfork() is "exciting" for the compiler in much the same way
"setjmp/longjmp()" is, because of the shared stack use in the child
and the parent. It is *very* easy to get this wrong and cause massive
and subtle memory corruption issues because the parent returns to
something that has been messed up by the child.

That may be why some libc might end up just using "fork()", because it
ends up avoiding bugs in user space.
(In fact, if I recall correctly, the _reason_ we have an explicit
'vfork()' entry point rather than using clone() with magic parameters
was that the lack of arguments meant that you didn't have to
save/restore any registers in user space, which made the whole stack
issue simpler. But it's been two decades, so my memory is bitrotting).

Also, particularly if you have a big address space, vfork()+execve()
can be quite a bit faster than fork()+execve(). Linux fork() is pretty
efficient, but if you have gigabytes of VM space to copy, it's going
to take time even if you do it fairly well.

Linus

2019-04-30 07:03:53

by Florian Weimer

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

* Linus Torvalds:

> On Mon, Apr 29, 2019 at 1:38 PM Florian Weimer <[email protected]> wrote:
>>
>> In Linux-as-the-ABI (as opposed to Linux-as-the-implementation), vfork
>> is sometimes implemented as fork, so applications cannot rely on the
>> vfork behavior regarding the stopped parent and the shared address
>> space.
>
> What broken library does that?
>
> Sure, we didn't have a proper vfork() long long long ago. But that
> predates both git and BK, so it's some time in the 90's. We've had a
> proper vfork() *forever*.

It's not so much about libraries, it's alternative implementations of
the system call interface: valgrind, qemu-user and WSL. For valgrind
and qemu-user, it's about cloning the internal data structures, so that
the subprocess does not clobber what's in the parent process (which may
have multiple threads and may not be fully blocked due to vfork). For
WSL, who knows what's going on there.

>> In fact, it would be nice to have a flag we can check in the posix_spawn
>> implementation, so that we can support vfork-as-fork without any run
>> time cost to native Linux.
>
> No. Just make a bug-report to whatever broken library you use. What's
> the point of having a library that can't even get vfork() right? Why
> would you want to have a flag to say "vfork is broken"?

It's apparently quite difficult to fix valgrind and qemu-user. WSL is
apparently not given the resources it needs, yet a surprising number of
people see it as a viable replacement and report what are essentially
vfork-related bugs.

If I had the flag, I could at least fix posix_spawn in glibc to consult
it before assuming that vfork shares address space. (The sharing allows
straightforward reporting of the vfork error code, without resorting to
pipes or creating a MAP_SHARED mapping.) For obvious reasons, I do not
want to apply the workaround unconditionally.

Thanks,
Florian

2019-04-30 08:24:16

by Florian Weimer

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

* Linus Torvalds:

> Note that vfork() is "exciting" for the compiler in much the same way
> "setjmp/longjmp()" is, because of the shared stack use in the child
> and the parent. It is *very* easy to get this wrong and cause massive
> and subtle memory corruption issues because the parent returns to
> something that has been messed up by the child.

Just using a wrapper around vfork is enough for that, if the return
address is saved on the stack. It's surprising hard to write a test
case for that, but the corruption is definitely there.

> (In fact, if I recall correctly, the _reason_ we have an explicit
> 'vfork()' entry point rather than using clone() with magic parameters
> was that the lack of arguments meant that you didn't have to
> save/restore any registers in user space, which made the whole stack
> issue simpler. But it's been two decades, so my memory is bitrotting).

That's an interesting point. Using a callback-style interface avoids
that because you never need to restore the registers in the new
subprocess. It's still appropriate to use an assembler implementation,
I think, because it will be more obviously correct.

> Also, particularly if you have a big address space, vfork()+execve()
> can be quite a bit faster than fork()+execve(). Linux fork() is pretty
> efficient, but if you have gigabytes of VM space to copy, it's going
> to take time even if you do it fairly well.

vfork is also more benign from a memory accounting perspective. In some
environments, it's not possible to call fork from a large process
because the accounting assumes (conservatively) that the new process
will dirty a lot of its private memory.

Thanks,
Florian

2019-04-30 12:41:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On 04/29, Linus Torvalds wrote:
>
> Linux vfork() is very much a real vfork(). What do you mean?

Yes, but I am wondering if man vfork should clarify what "child terminates"
actually means. I mean, the child can do clone(CLONE_THREAD) + sys_exit(),
this will wake the parent thread up before the child process exits or execs.

I see nothing wrong, but I was always curious whether it was designed this
way on purpose or not.

Oleg.

2019-04-30 16:21:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Tue, Apr 30, 2019 at 1:21 AM Florian Weimer <[email protected]> wrote:
>
> > (In fact, if I recall correctly, the _reason_ we have an explicit
> > 'vfork()' entry point rather than using clone() with magic parameters
> > was that the lack of arguments meant that you didn't have to
> > save/restore any registers in user space, which made the whole stack
> > issue simpler. But it's been two decades, so my memory is bitrotting).
>
> That's an interesting point. Using a callback-style interface avoids
> that because you never need to restore the registers in the new
> subprocess. It's still appropriate to use an assembler implementation,
> I think, because it will be more obviously correct.

I agree that a callback interface would have been a whole lot more
obvious and less prone to subtle problems.

But if you want vfork() because the programs you want to build use it,
that's the interface you need..

Of course, if you *don't* need the exact vfork() semantics, clone
itself actually very much supports a callback model with s separate
stack. You can basically do this:

- allocate new stack for the child
- in trivial asm wrapper, do:
- push the callback address on the child stack
- clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
- "ret"
- free new stack

where the "ret" in the child will just go to the callback, while the
parent (eventually) just returns from the trivial wrapper and frees
the new stack (which by definition is no longer used, since the child
has exited or execve'd.

So you can most definitely create a "vfork_with_child_callback()" with
clone, and it would arguably be a much superior interface to vfork()
anyway (maybe you'd like to pass in some arguments to the callback too
- add more stack setup for the child as needed), but it wouldn't be
the right solution for programs that just want to use the standard BSD
vfork() model.

> vfork is also more benign from a memory accounting perspective. In some
> environments, it's not possible to call fork from a large process
> because the accounting assumes (conservatively) that the new process
> will dirty a lot of its private memory.

Indeed.

Linus

2019-04-30 16:28:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Tue, Apr 30, 2019 at 9:19 AM Linus Torvalds
<[email protected]> wrote:
>
> Of course, if you *don't* need the exact vfork() semantics, clone
> itself actually very much supports a callback model with s separate
> stack. You can basically do this:
>
> - allocate new stack for the child
> - in trivial asm wrapper, do:
> - push the callback address on the child stack
> - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
> - "ret"
> - free new stack
>
> where the "ret" in the child will just go to the callback, while the
> parent (eventually) just returns from the trivial wrapper and frees
> the new stack (which by definition is no longer used, since the child
> has exited or execve'd.

In fact, Florian, maybe this is the solution to your "I want to use
vfork for posix_spawn(), but I don't know if I can trust it" problem.

Just use clone() directly. On WSL it will presumably just fail, and
you can then fall back on doing the slow stupid
fork+pipes-to-communicate.

On valgrind, I don't know what will happen. Maybe it will just do an
unchecked posix_spawn() because valgrind doesn't catch it?

Linus

2019-04-30 16:33:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Tue, Apr 30, 2019 at 5:39 AM Oleg Nesterov <[email protected]> wrote:
>
> Yes, but I am wondering if man vfork should clarify what "child terminates"
> actually means. I mean, the child can do clone(CLONE_THREAD) + sys_exit(),
> this will wake the parent thread up before the child process exits or execs.

That falls solidly into the "give people rope" category.

If the vfork() child wants to mess with the parent, it has many easier
ways to do it than create more threads.

As mentioned, the real problem with vfork() tends to be that the child
unintentionally messes with the parent because it just gets the stack
sharing wrong. No need to add intention there.

> I see nothing wrong, but I was always curious whether it was designed this
> way on purpose or not.

Oh, it's definitely on purpose. Trying to do some nested usage count
would be horrendously complex, and even a trivial "don't allow any
other clone() calls if we already have a vfork completion pending" is
just unnecessary logic.

Because at least in *theory*, there's actually nothing horribly wrong
with allowing a thread to be created during the vfork(). I don't see
the _point_, but it's not conceptually something that couldn't work
(you'd need a separate thread stack etc, but that's normal clone()).

So no, there's no safety or bogus "you can't do that". If you want to
play games after vfork(), go wild.

Linus

2019-04-30 17:09:14

by Florian Weimer

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

* Linus Torvalds:

> On Tue, Apr 30, 2019 at 9:19 AM Linus Torvalds
> <[email protected]> wrote:
>>
>> Of course, if you *don't* need the exact vfork() semantics, clone
>> itself actually very much supports a callback model with s separate
>> stack. You can basically do this:
>>
>> - allocate new stack for the child
>> - in trivial asm wrapper, do:
>> - push the callback address on the child stack
>> - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
>> - "ret"
>> - free new stack
>>
>> where the "ret" in the child will just go to the callback, while the
>> parent (eventually) just returns from the trivial wrapper and frees
>> the new stack (which by definition is no longer used, since the child
>> has exited or execve'd.
>
> In fact, Florian, maybe this is the solution to your "I want to use
> vfork for posix_spawn(), but I don't know if I can trust it" problem.
>
> Just use clone() directly. On WSL it will presumably just fail, and
> you can then fall back on doing the slow stupid
> fork+pipes-to-communicate.

We already use clone. I don't know why. We should add a comment that
provides the reason.

> On valgrind, I don't know what will happen. Maybe it will just do an
> unchecked posix_spawn() because valgrind doesn't catch it?

I think what happens with these emulators that they use fork (no shared
address space) but suspend the parent thread. clone with CLONE_VFORK
definitely does not fail. That mostly works, except that you do not get
back the error code from the execve. Instead, the process is considered
launched, and the caller collects the exit status from the _exit after
the failed execve.

> Of course, if you *don't* need the exact vfork() semantics, clone
> itself actually very much supports a callback model with s separate
> stack. You can basically do this:
>
> - allocate new stack for the child
> - in trivial asm wrapper, do:
> - push the callback address on the child stack
> - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
> - "ret"
> - free new stack
>
> where the "ret" in the child will just go to the callback, while the
> parent (eventually) just returns from the trivial wrapper and frees
> the new stack (which by definition is no longer used, since the child
> has exited or execve'd.
>
> So you can most definitely create a "vfork_with_child_callback()" with
> clone, and it would arguably be a much superior interface to vfork()
> anyway (maybe you'd like to pass in some arguments to the callback too
> - add more stack setup for the child as needed), but it wouldn't be
> the right solution for programs that just want to use the standard BSD
> vfork() model.

As far as we understand the situation, we believe that we absolutely
must block all signals for both the parent thread and the new
subprocess. Signals can be unblocked in the subprocess, but only after
setting their handlers to SIG_DFL or SIG_IGN. (Parent signal handlers
cannot run in the subprocess because application-supplied signal
handlers generally do not expect to run with a corrupt TCB—or a
different PID.)

At that point, I wonder if we can just skip the stack setup for the
CLONE_VFORK case and reuse the existing stack.

Thanks,
Florian

2019-05-05 02:47:20

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

On Mon, Apr 29, 2019 at 07:31:43PM +0200, Enrico Weigelt, metux IT consult wrote:

Argh. Sorry, it seems your emails aren't making it into my inbox, only
my once-in-a-long-while-checked lkml folder. Sorry again.

> On 29.04.19 17:49, Serge E. Hallyn wrote:
>
> >> * all users are equal - no root at all. the only exception is the>> initial process, which gets the kernel devices mounted into his>>
> namespace.> > This does not match my understanding, but I'm most likely
> wrong. (I thought> there was an actual 'host owner' uid, which mostly
> is only used for initial> process, but is basically root with a
> different name, and used far less. No> uid transitions without factotem
> so that it *looked* like no root user).
> Not quite (IIRC). The hostowner is just the user who booted the machine,
> the initial process runs under this uname and gets the kernel devices
> bound into his namespace, so he can start fileservers on them.
>
> Also the caphash device (the one you can create capabilities, eg. for
> user change, which then can be used via capuse device) can only be
> opened once - usually by the host factotum.
>
> There really is no such thing like root user.
>
> >> What I'd like to achieve on Linux:>>>> * unprivileged users can have their own mount namespace, where
> they>> can mount at will (maybe just 9P).> > No problem, you can do
> that now.
>
> But only within separate userns, IMHO. (and, when I last tried, plain

"Only within a separate userns" - but why does that matter? It's just
a different uid mapping.

> users couldn't directly create their userns).

Plain users can definately create their own userns, directly. On some
distros there is a kernel knob like

#cat /proc/sys/kernel/unprivileged_userns_clone
1

which when unset prevents unprivileged users creating a namespace.

> >> * but they still appear as the same normal users to the rest of the
> >> system
> >
> > No problem, you can do that now.
>
> How exactly ? Did I miss something vital ?

By unsharing your namespace and writing the new uid mapping. You can of
course only map your own uid without using any privileged helpers at all.
And it requires help from a second process, which does the writing to
the uid map file after the first process has unshared. But you can do it.
For instance, using the nsexec.c at

https://github.com/fcicq/nsexec

You can:

Terminal 1:
shallyn@stp:~/src/nsexec$ ./nsexec -UWm
about to unshare with 10020000
Press any key to exec (I am 31157)

Now in terminal 2:

Terminal 2:
shallyn@stp:~/src/nsexec$ echo "0 1000 1" > /proc/31157/uid_map
shallyn@stp:~/src/nsexec$ echo deny > /proc/31157/setgroups
shallyn@stp:~/src/nsexec$ echo "0 1000 1" > /proc/31157/gid_map

Then back in terminal 1:
# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
# mount --bind /etc /mnt
# echo $?
0
# ls /root
ls: cannot open directory '/root': Permission denied

To the rest of the system you look like uid 1000. You could have
chosen uid 1000 in your new namespace, but then you couldn't mount.
Of course you can nest user namespaces so you could create another,
this time mapping uid 1000 so you look like 1000 to yourself as well.

> >> * 9p programs (compiled for Linux ABI) can run parallel to traditional
> >> linux programs within the same user and sessions (eg. from a terminal,
> >> i can call both the same way)
> >> * namespace modifications affect both equally (eg. I could run ff in
> >> an own ns)
> >
> > affect both of what equally?
>
> mount / bind.
>
> > That's exactly what user namespaces are for. You can create a new
> > user namespace, using no privilege at all, with your current uid (i.e.
> > 1000) mapped to whatever uid you like; if you pick 0, then you can unshare all
> > the namespaces you like.
>
> But I don't like to appear as 'root' in here. I just wanna have my own
> filesystem namespace, nothing more.

Right. As you know setuid makes that impossible, unfortunately. That's
where nonewprivs shows promise.

> > Once you unshare mnt_ns, you can mount to your
> > heart's content. To other processes on the host, your process is
> > uid 1000.
>
> Is that the uid, I'm appearing to filesystems ?

Yes.

> > Regarding factotem, I agree that with the pidfd work going on etc, it's getting
> > more and more tempting to attempt a switch to that. Looking back at my folder,
> > I see you posted a kernel patch for it. I had done the same long ago. Happy to
> > work with you again on that, and put a simple daemon into shadow package, if
> > util-linux isn't deemed the far better place.
>
> Yeah :)
>
>
> --mtx
>
> --
> Enrico Weigelt, metux IT consult
> Free software and Linux embedded engineering
> [email protected] -- +49-151-27565287