/* Introduction */
This adds the pidfd_open() syscall.
pidfd_open() allows to retrieve file descriptors for a given pid. This
includes both file descriptors for processes and file descriptors for
threads.
With the addition of this syscalls pidfds become independent of procfs just
as pids are. Of course, if CONFIG_PROC_FS is not set then metadata access
for processes will not be possible but everything else will just work fine.
In addition, this allows us to remove the dependency of pidfd_send_signal()
on procfs and enable it unconditionally.
With the ability to call pidfd_open() on tids we can now add a flag to
pidfd_send_signal() to signal to a specific thread capturing the
functionality of tgkill() and related thread-focused signal syscalls.
The desire to lift the restriction for pidfds on procfs has been expressed
by multiple people (cf. the commit message of commit
3eb39f47934f9d5a3027fe00d906a45fe3a15fad and [2]).
/* Signature */
int pidfd_open(pid_t pid, unsigned int flags);
/* pidfds are anon inode file descriptors */
These pidfds are allocated using anon_inode_getfd(), are O_CLOEXEC by
default and can be used with the pidfd_send_signal() syscall. They are not
dirfds and as such have the advantage that we can make them pollable or
readable in the future if we see a need to do so. The pidfds are not
associated with a specific pid namespaces but rather only reference struct
pid of a given process in their private_data member.
Additionally, Andy made an argument that we should go forward with
non-proc-dirfd file descriptors for the sake of security and extensibility
(cf. [3]). This will unblock or help move along work on pidfd_wait which
is currently ongoing.
/* Process Metadata Access */
One of the oustanding issues has been how to get information about a given
process if pidfds are regular file descriptors and do not provide access to
the process /proc/<pid> directory.
Various solutions have been proposed. The one that most people prefer is to
be able to retrieve a file descriptor to /proc/<pid> based on a pidfd
(cf. [5]). The prefered solution for how to do this has been to implement
an ioctl that for pidfds that translates a pidfd into a dirfd for
/proc/<pid>. This has been implemented in this patchset as well. If
PIDFD_GET_PROCFD is passed as a command to an ioctl() taking a pidfd and an
fd referring to a procfs directory as an argument a corresponding dirfd to
/proc/<pid> can be retrieved.
The ioctl() makes very sure that the struct pid associated with the
/proc/<pid> fd is identical to the struct pid stashed in the pidfd. This
ensures that we avoid pid recycling issues.
/* Testing */
The patchset comes with tests (which btw. I consider mandatory with
every feature-patch that intends to go through the pidfd tree):
- test that no invalid flags can be passed to pidfd_open()
- test that no invalid pid can be passed to pidfd_open()
- test that a pidfd can be retrieved with pidfd_open()
- test whether a pidfd can be converted into an fd to /proc/<pid> to get
metadata access
- test that a pidfd retrieved based on a pid that has been recycled cannot
be converted into /proc/<pid> for that recycled pid
/* Example */
int pidfd = pidfd_open(1234, 0);
int procfd = open("/proc", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
int procpidfd = ioctl(pidfd, PIDFD_GET_PROCFD, procfd);
int statusfd = openat(procpidfd, "status", O_RDONLY | O_CLOEXEC);
int ret = read(statusfd, buf, sizeof(buf));
ret = pidfd_send_signal(pidfd, SIGKILL, NULL, 0);
/* References */
[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/
[3]: https://lore.kernel.org/lkml/CALCETrXO=V=+qEdLDVPf8eCgLZiB9bOTrUfe0V-U-tUZoeoRDA@mail.gmail.com
[4]: https://lore.kernel.org/lkml/CAHk-=wgmKZm-fESEiLq_W37sKpqCY89nQkPNfWhvF_CQ1ANgcw@mail.gmail.com
[5]: https://lore.kernel.org/lkml/[email protected]
Christian Brauner (4):
pid: add pidfd_open()
signal: support pidfd_open() with pidfd_send_signal()
signal: PIDFD_SIGNAL_TID threads via pidfds
tests: add pidfd_open() tests
David Howells (1):
Make anon_inodes unconditional
arch/arm/kvm/Kconfig | 1 -
arch/arm64/kvm/Kconfig | 1 -
arch/mips/kvm/Kconfig | 1 -
arch/powerpc/kvm/Kconfig | 1 -
arch/s390/kvm/Kconfig | 1 -
arch/x86/Kconfig | 1 -
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/kvm/Kconfig | 1 -
drivers/base/Kconfig | 1 -
drivers/char/tpm/Kconfig | 1 -
drivers/dma-buf/Kconfig | 1 -
drivers/gpio/Kconfig | 1 -
drivers/iio/Kconfig | 1 -
drivers/infiniband/Kconfig | 1 -
drivers/vfio/Kconfig | 1 -
fs/Makefile | 2 +-
fs/notify/fanotify/Kconfig | 1 -
fs/notify/inotify/Kconfig | 1 -
include/linux/pid.h | 2 +
include/linux/syscalls.h | 1 +
include/uapi/linux/wait.h | 5 +
init/Kconfig | 10 -
kernel/pid.c | 181 +++++++++
kernel/signal.c | 130 +++++--
kernel/sys_ni.c | 3 -
tools/testing/selftests/pidfd/Makefile | 2 +-
tools/testing/selftests/pidfd/pidfd.h | 57 +++
.../testing/selftests/pidfd/pidfd_open_test.c | 361 ++++++++++++++++++
tools/testing/selftests/pidfd/pidfd_test.c | 41 +-
30 files changed, 701 insertions(+), 112 deletions(-)
create mode 100644 tools/testing/selftests/pidfd/pidfd.h
create mode 100644 tools/testing/selftests/pidfd/pidfd_open_test.c
--
2.21.0
From: David Howells <[email protected]>
Make the anon_inodes facility unconditional so that it can be used by core
VFS code and the pidfd_open() syscall.
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Al Viro <[email protected]>
[[email protected]: adapt commit message to mention pidfd_open()]
Signed-off-by: Christian Brauner <[email protected]>
---
arch/arm/kvm/Kconfig | 1 -
arch/arm64/kvm/Kconfig | 1 -
arch/mips/kvm/Kconfig | 1 -
arch/powerpc/kvm/Kconfig | 1 -
arch/s390/kvm/Kconfig | 1 -
arch/x86/Kconfig | 1 -
arch/x86/kvm/Kconfig | 1 -
drivers/base/Kconfig | 1 -
drivers/char/tpm/Kconfig | 1 -
drivers/dma-buf/Kconfig | 1 -
drivers/gpio/Kconfig | 1 -
drivers/iio/Kconfig | 1 -
drivers/infiniband/Kconfig | 1 -
drivers/vfio/Kconfig | 1 -
fs/Makefile | 2 +-
fs/notify/fanotify/Kconfig | 1 -
fs/notify/inotify/Kconfig | 1 -
init/Kconfig | 10 ----------
18 files changed, 1 insertion(+), 27 deletions(-)
diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index 3f5320f46de2..f591026347a5 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
bool "Kernel-based Virtual Machine (KVM) support"
depends on MMU && OF
select PREEMPT_NOTIFIERS
- select ANON_INODES
select ARM_GIC
select ARM_GIC_V3
select ARM_GIC_V3_ITS
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index a3f85624313e..a67121d419a2 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -23,7 +23,6 @@ config KVM
depends on OF
select MMU_NOTIFIER
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_ARCH_TLB_FLUSH_ALL
select KVM_MMIO
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index 4528bc9c3cb1..eac25aef21e0 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
depends on MIPS_FP_SUPPORT
select EXPORT_UASM
select PREEMPT_NOTIFIERS
- select ANON_INODES
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select HAVE_KVM_VCPU_ASYNC_IOCTL
select KVM_MMIO
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index bfdde04e4905..f53997a8ca62 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -20,7 +20,6 @@ if VIRTUALIZATION
config KVM
bool
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_EVENTFD
select HAVE_KVM_VCPU_ASYNC_IOCTL
select SRCU
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 767453faacfc..1816ee48eadd 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
prompt "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM
select PREEMPT_NOTIFIERS
- select ANON_INODES
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_VCPU_ASYNC_IOCTL
select HAVE_KVM_EVENTFD
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c1f9b3cf437c..18f2c954464e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -44,7 +44,6 @@ config X86
#
select ACPI_LEGACY_TABLES_LOOKUP if ACPI
select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
- select ANON_INODES
select ARCH_32BIT_OFF_T if X86_32
select ARCH_CLOCKSOURCE_DATA
select ARCH_CLOCKSOURCE_INIT
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 72fa955f4a15..fc042419e670 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -27,7 +27,6 @@ config KVM
depends on X86_LOCAL_APIC
select PREEMPT_NOTIFIERS
select MMU_NOTIFIER
- select ANON_INODES
select HAVE_KVM_IRQCHIP
select HAVE_KVM_IRQFD
select IRQ_BYPASS_MANAGER
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 059700ea3521..03f067da12ee 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -174,7 +174,6 @@ source "drivers/base/regmap/Kconfig"
config DMA_SHARED_BUFFER
bool
default n
- select ANON_INODES
select IRQ_WORK
help
This option enables the framework for buffer-sharing between
diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 536e55d3919f..f3e4bc490cf0 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -157,7 +157,6 @@ config TCG_CRB
config TCG_VTPM_PROXY
tristate "VTPM Proxy Interface"
depends on TCG_TPM
- select ANON_INODES
---help---
This driver proxies for an emulated TPM (vTPM) running in userspace.
A device /dev/vtpmx is provided that creates a device pair
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
index 2e5a0faa2cb1..3fc9c2efc583 100644
--- a/drivers/dma-buf/Kconfig
+++ b/drivers/dma-buf/Kconfig
@@ -3,7 +3,6 @@ menu "DMABUF options"
config SYNC_FILE
bool "Explicit Synchronization Framework"
default n
- select ANON_INODES
select DMA_SHARED_BUFFER
---help---
The Sync File Framework adds explicit syncronization via
diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 3f50526a771f..0f91600c27ae 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -12,7 +12,6 @@ config ARCH_HAVE_CUSTOM_GPIO_H
menuconfig GPIOLIB
bool "GPIO Support"
- select ANON_INODES
help
This enables GPIO support through the generic GPIO library.
You only need to enable this, if you also want to enable
diff --git a/drivers/iio/Kconfig b/drivers/iio/Kconfig
index d08aeb41cd07..1dec0fecb6ef 100644
--- a/drivers/iio/Kconfig
+++ b/drivers/iio/Kconfig
@@ -4,7 +4,6 @@
menuconfig IIO
tristate "Industrial I/O support"
- select ANON_INODES
help
The industrial I/O subsystem provides a unified framework for
drivers for many different types of embedded sensors using a
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a1fb840de45d..d318bab25860 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -25,7 +25,6 @@ config INFINIBAND_USER_MAD
config INFINIBAND_USER_ACCESS
tristate "InfiniBand userspace access (verbs and CM)"
- select ANON_INODES
depends on MMU
---help---
Userspace InfiniBand access support. This enables the
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9de5ed38da83..3798d77d131c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -22,7 +22,6 @@ menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
- select ANON_INODES
help
VFIO provides a framework for secure userspace device drivers.
See Documentation/vfio.txt for more details.
diff --git a/fs/Makefile b/fs/Makefile
index 427fec226fae..35945f8139e6 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -25,7 +25,7 @@ obj-$(CONFIG_PROC_FS) += proc_namespace.o
obj-y += notify/
obj-$(CONFIG_EPOLL) += eventpoll.o
-obj-$(CONFIG_ANON_INODES) += anon_inodes.o
+obj-y += anon_inodes.o
obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
index 735bfb2e9190..521dc91d2cb5 100644
--- a/fs/notify/fanotify/Kconfig
+++ b/fs/notify/fanotify/Kconfig
@@ -1,7 +1,6 @@
config FANOTIFY
bool "Filesystem wide access notification"
select FSNOTIFY
- select ANON_INODES
select EXPORTFS
default n
---help---
diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
index b981fc0c8379..0161c74e76e2 100644
--- a/fs/notify/inotify/Kconfig
+++ b/fs/notify/inotify/Kconfig
@@ -1,6 +1,5 @@
config INOTIFY_USER
bool "Inotify support for userspace"
- select ANON_INODES
select FSNOTIFY
default y
---help---
diff --git a/init/Kconfig b/init/Kconfig
index 4592bf7997c0..be8f97e37a76 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1171,9 +1171,6 @@ config LD_DEAD_CODE_DATA_ELIMINATION
config SYSCTL
bool
-config ANON_INODES
- bool
-
config HAVE_UID16
bool
@@ -1378,14 +1375,12 @@ config HAVE_FUTEX_CMPXCHG
config EPOLL
bool "Enable eventpoll support" if EXPERT
default y
- select ANON_INODES
help
Disabling this option will cause the kernel to be built without
support for epoll family of system calls.
config SIGNALFD
bool "Enable signalfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the signalfd() system call that allows to receive signals
@@ -1395,7 +1390,6 @@ config SIGNALFD
config TIMERFD
bool "Enable timerfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the timerfd() system call that allows to receive timer
@@ -1405,7 +1399,6 @@ config TIMERFD
config EVENTFD
bool "Enable eventfd() system call" if EXPERT
- select ANON_INODES
default y
help
Enable the eventfd() system call that allows to receive both
@@ -1516,7 +1509,6 @@ config KALLSYMS_BASE_RELATIVE
# syscall, maps, verifier
config BPF_SYSCALL
bool "Enable bpf() system call"
- select ANON_INODES
select BPF
select IRQ_WORK
default n
@@ -1533,7 +1525,6 @@ config BPF_JIT_ALWAYS_ON
config USERFAULTFD
bool "Enable userfaultfd() system call"
- select ANON_INODES
depends on MMU
help
Enable the userfaultfd() system call that allows to intercept and
@@ -1600,7 +1591,6 @@ config PERF_EVENTS
bool "Kernel performance events and counters"
default y if PROFILING
depends on HAVE_PERF_EVENTS
- select ANON_INODES
select IRQ_WORK
select SRCU
help
--
2.21.0
This adds testing for pidfd_open():
- test that no invalid flags can be passed to pidfd_open()
- test that no invalid pid can be passed to pidfd_open()
- test that a pidfd can be retrieved with pidfd_open()
- test whether a pidfd can be converted into an fd to /proc/<pid> to get
metadata access
- test that a pidfd retrieved based on a pid that has been recycled cannot
be converted into /proc/<pid> for that recycled pid
Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Jann Horn <[email protected]
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Jonathan Kowalski <[email protected]>
Cc: "Dmitry V. Levin" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Nagarathnam Muthusamy <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Al Viro <[email protected]>
---
tools/testing/selftests/pidfd/Makefile | 2 +-
tools/testing/selftests/pidfd/pidfd.h | 57 +++
.../testing/selftests/pidfd/pidfd_open_test.c | 361 ++++++++++++++++++
tools/testing/selftests/pidfd/pidfd_test.c | 41 +-
4 files changed, 420 insertions(+), 41 deletions(-)
create mode 100644 tools/testing/selftests/pidfd/pidfd.h
create mode 100644 tools/testing/selftests/pidfd/pidfd_open_test.c
diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index deaf8073bc06..b36c0be70848 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -1,6 +1,6 @@
CFLAGS += -g -I../../../../usr/include/
-TEST_GEN_PROGS := pidfd_test
+TEST_GEN_PROGS := pidfd_test pidfd_open_test
include ../lib.mk
diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
new file mode 100644
index 000000000000..8452e910463f
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __PIDFD_H
+#define __PIDFD_H
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/mount.h>
+
+#include "../kselftest.h"
+
+/*
+ * The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
+ * That means, when it wraps around any pid < 300 will be skipped.
+ * So we need to use a pid > 300 in order to test recycling.
+ */
+#define PID_RECYCLE 1000
+
+/*
+ * Define a few custom error codes for the child process to clearly indicate
+ * what is happening. This way we can tell the difference between a system
+ * error, a test error, etc.
+ */
+#define PIDFD_PASS 0
+#define PIDFD_FAIL 1
+#define PIDFD_ERROR 2
+#define PIDFD_SKIP 3
+#define PIDFD_XFAIL 4
+
+int wait_for_pid(pid_t pid)
+{
+ int status, ret;
+
+again:
+ ret = waitpid(pid, &status, 0);
+ if (ret == -1) {
+ if (errno == EINTR)
+ goto again;
+
+ return -1;
+ }
+
+ if (!WIFEXITED(status))
+ return -1;
+
+ return WEXITSTATUS(status);
+}
+
+
+#endif /* __PIDFD_H */
diff --git a/tools/testing/selftests/pidfd/pidfd_open_test.c b/tools/testing/selftests/pidfd/pidfd_open_test.c
new file mode 100644
index 000000000000..ebe7a96cc8d6
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_open_test.c
@@ -0,0 +1,361 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <linux/types.h>
+#include <linux/wait.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/mount.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "pidfd.h"
+#include "../kselftest.h"
+
+static inline int sys_pidfd_open(pid_t pid, unsigned int flags)
+{
+ return syscall(__NR_pidfd_open, pid, flags);
+}
+
+static int safe_int(const char *numstr, int *converted)
+{
+ char *err = NULL;
+ signed long int sli;
+
+ errno = 0;
+ sli = strtol(numstr, &err, 0);
+ if (errno == ERANGE && (sli == LONG_MAX || sli == LONG_MIN))
+ return -ERANGE;
+
+ if (errno != 0 && sli == 0)
+ return -EINVAL;
+
+ if (err == numstr || *err != '\0')
+ return -EINVAL;
+
+ if (sli > INT_MAX || sli < INT_MIN)
+ return -ERANGE;
+
+ *converted = (int)sli;
+ return 0;
+}
+
+static int char_left_gc(const char *buffer, size_t len)
+{
+ size_t i;
+
+ for (i = 0; i < len; i++) {
+ if (buffer[i] == ' ' ||
+ buffer[i] == '\t')
+ continue;
+
+ return i;
+ }
+
+ return 0;
+}
+
+static int char_right_gc(const char *buffer, size_t len)
+{
+ int i;
+
+ for (i = len - 1; i >= 0; i--) {
+ if (buffer[i] == ' ' ||
+ buffer[i] == '\t' ||
+ buffer[i] == '\n' ||
+ buffer[i] == '\0')
+ continue;
+
+ return i + 1;
+ }
+
+ return 0;
+}
+
+static char *trim_whitespace_in_place(char *buffer)
+{
+ buffer += char_left_gc(buffer, strlen(buffer));
+ buffer[char_right_gc(buffer, strlen(buffer))] = '\0';
+ return buffer;
+}
+
+static pid_t get_pid_from_status_file(int *fd, const char *key, size_t keylen)
+{
+ int ret;
+ FILE *f;
+ size_t n = 0;
+ pid_t result = -1;
+ char *line = NULL;
+
+ /* fd now belongs to FILE and will be closed by fclose() */
+ f = fdopen(*fd, "r");
+ if (!f)
+ return -1;
+
+ while (getline(&line, &n, f) != -1) {
+ char *numstr;
+
+ if (strncmp(line, key, keylen))
+ continue;
+
+ numstr = trim_whitespace_in_place(line + 4);
+ ret = safe_int(numstr, &result);
+ if (ret < 0)
+ goto out;
+
+ break;
+ }
+
+out:
+ free(line);
+ fclose(f);
+ *fd = -1;
+ return result;
+}
+
+static int set_ns_last_pid(void)
+{
+ int fd;
+ ssize_t bytes;
+
+ fd = open("/proc/sys/kernel/ns_last_pid", O_WRONLY | O_CLOEXEC);
+ if (fd < 0) {
+ ksft_print_msg("failed to open \"/proc/sys/kernel/ns_last_pid\"n");
+ return -1;
+ }
+
+ bytes = write(fd, "999", sizeof("999") - 1);
+ close(fd);
+ if (bytes < 0 || (size_t)bytes != (sizeof("999") - 1))
+ return -1;
+
+ return 0;
+}
+
+static int test_pidfd_to_procfd_recycled_pid_fail(void)
+{
+ const char *test_name = "pidfd_open pidfd to procfd conversion on pid recycling";
+ int ret;
+ pid_t pid1;
+
+ ret = unshare(CLONE_NEWPID);
+ if (ret < 0)
+ ksft_exit_fail_msg("%s test: Failed to unshare pid namespace\n",
+ test_name);
+
+ ret = unshare(CLONE_NEWNS);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to unshare mount namespace\n",
+ test_name);
+
+ ret = mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg("%s test: Failed to remount / private\n",
+ test_name);
+
+ /* pid 1 in new pid namespace */
+ pid1 = fork();
+ if (pid1 < 0)
+ ksft_exit_fail_msg("%s test: Failed to create new process\n",
+ test_name);
+
+ /*
+ * This is lazy with file descriptor closing in the child process but
+ * it doesn't matter since we are very short-lived anyway and the
+ * cleanup would just be more complexity in this test.
+ */
+ if (pid1 == 0) {
+ int pidfd, procpidfd, procfd, ret;
+ pid_t pid2;
+
+
+ (void)umount2("/proc", MNT_DETACH);
+ ret = mount("proc", "/proc", "proc", 0, NULL);
+ if (ret < 0) {
+ ksft_print_msg("failed to mount a fresh /proc instance\n");
+ _exit(PIDFD_ERROR);
+ }
+
+ /* get pidfd for pid 1000 */
+ if (set_ns_last_pid() < 0) {
+ ksft_print_msg("failed to set ns_last_pid\n");
+ _exit(PIDFD_ERROR);
+ }
+
+ pid2 = fork();
+ if (pid2 < 0) {
+ ksft_print_msg("failed to create new process\n");
+ _exit(PIDFD_ERROR);
+ }
+
+ if (pid2 == 0)
+ _exit(PIDFD_PASS);
+
+ if (pid2 == PID_RECYCLE) {
+ pidfd = sys_pidfd_open(pid2, 0);
+ } else {
+ ksft_print_msg("failed to create process with pid %d\n",
+ PID_RECYCLE);
+ _exit(PIDFD_ERROR);
+ }
+
+ if (wait_for_pid(pid2))
+ _exit(PIDFD_ERROR);
+
+ if (pidfd < 0)
+ _exit(PIDFD_ERROR);
+
+ /* recycle pid 1000 */
+ if (set_ns_last_pid() < 0) {
+ ksft_print_msg("failed to set ns_last_pid\n");
+ _exit(PIDFD_ERROR);
+ }
+
+ pid2 = fork();
+ if (pid2 < 0) {
+ ksft_print_msg("failed to create new process\n");
+ _exit(PIDFD_ERROR);
+ }
+
+ if (pid2 == 0)
+ _exit(PIDFD_PASS);
+
+ if (pid2 != PID_RECYCLE) {
+ ksft_print_msg("failed to recycle pid %d\n",
+ PID_RECYCLE);
+ _exit(PIDFD_ERROR);
+ }
+
+ /*
+ * Now PID_RECYCLE is in zombie state since we have not waited
+ * on it yet, but it might have already exited. This ensures
+ * that the /proc/<pid> directory stays around.
+ */
+ procfd = open("/proc", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
+ if (procfd < 0) {
+ ksft_print_msg("%s - failed to open /proc\n",
+ strerror(errno));
+ _exit(PIDFD_ERROR);
+ }
+
+ procpidfd = ioctl(pidfd, PIDFD_GET_PROCFD, procfd);
+ if (procpidfd >= 0) {
+ ksft_print_msg(
+ "managed to get access to /proc/<pid> of recycled pid\n");
+ _exit(PIDFD_ERROR);
+ }
+
+ if (wait_for_pid(pid2))
+ _exit(PIDFD_ERROR);
+
+ _exit(PIDFD_PASS);
+ }
+
+ if (wait_for_pid(pid1) != 0)
+ return -1;
+
+ ksft_test_result_pass("%s test: passed\n", test_name);
+ ksft_inc_pass_cnt();
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ int ret = 1;
+ int pidfd = -1, procfd = -1, procpidfd = -1, statusfd = -1;
+ pid_t pid;
+
+ pidfd = sys_pidfd_open(-1, 0);
+ if (pidfd >= 0) {
+ ksft_print_msg(
+ "%s - succeeded to open pidfd for invalid pid -1\n",
+ strerror(errno));
+ goto on_error;
+ }
+ ksft_test_result_pass("do not allow invalid pid test: passed\n");
+ ksft_inc_pass_cnt();
+
+ pidfd = sys_pidfd_open(getpid(), 1);
+ if (pidfd >= 0) {
+ ksft_print_msg(
+ "%s - succeeded to open pidfd with invalid flag value specified\n",
+ strerror(errno));
+ goto on_error;
+ }
+ ksft_test_result_pass("do not allow invalid flag test: passed\n");
+ ksft_inc_pass_cnt();
+
+ pidfd = sys_pidfd_open(getpid(), 0);
+ if (pidfd < 0) {
+ ksft_print_msg("%s - failed to open pidfd\n", strerror(errno));
+ goto on_error;
+ }
+ ksft_test_result_pass("open a new pidfd test: passed\n");
+ ksft_inc_pass_cnt();
+
+ procfd = open("/proc", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
+ if (procfd < 0) {
+ ksft_print_msg("%s - failed to open /proc\n", strerror(errno));
+ goto on_error;
+ }
+
+ procpidfd = ioctl(pidfd, PIDFD_GET_PROCFD, procfd);
+ if (procpidfd < 0) {
+ ksft_print_msg(
+ "%s - failed to retrieve /proc/<pid> from pidfd\n",
+ strerror(errno));
+ goto on_error;
+ }
+
+ statusfd = openat(procpidfd, "status", O_CLOEXEC | O_RDONLY);
+ if (statusfd < 0) {
+ ksft_print_msg("%s - failed to open /proc/<pid>/status\n",
+ strerror(errno));
+ goto on_error;
+ }
+
+ pid = get_pid_from_status_file(&statusfd, "Pid:", sizeof("Pid:") - 1);
+ if (pid < 0) {
+ ksft_print_msg(
+ "%s - failed to retrieve pid from /proc/<pid>/status\n",
+ strerror(errno));
+ goto on_error;
+ }
+
+ if (pid != getpid()) {
+ ksft_print_msg(
+ "%s - actual pid %d does not equal retrieved pid from /proc/<pid>/status\n",
+ strerror(errno), pid, getpid());
+ goto on_error;
+ }
+ ksft_test_result_pass("convert pidfd to /proc/<pid> fd test: passed\n");
+ ksft_inc_pass_cnt();
+
+ ret = test_pidfd_to_procfd_recycled_pid_fail();
+
+on_error:
+ if (pidfd >= 0)
+ close(pidfd);
+
+ if (procfd >= 0)
+ close(procfd);
+
+ if (procpidfd >= 0)
+ close(procpidfd);
+
+ if (statusfd >= 0)
+ close(statusfd);
+
+ return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
diff --git a/tools/testing/selftests/pidfd/pidfd_test.c b/tools/testing/selftests/pidfd/pidfd_test.c
index d59378a93782..f01de87249c9 100644
--- a/tools/testing/selftests/pidfd/pidfd_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_test.c
@@ -14,6 +14,7 @@
#include <sys/wait.h>
#include <unistd.h>
+#include "pidfd.h"
#include "../kselftest.h"
static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
@@ -62,28 +63,6 @@ static int test_pidfd_send_signal_simple_success(void)
return 0;
}
-static int wait_for_pid(pid_t pid)
-{
- int status, ret;
-
-again:
- ret = waitpid(pid, &status, 0);
- if (ret == -1) {
- if (errno == EINTR)
- goto again;
-
- return -1;
- }
-
- if (ret != pid)
- goto again;
-
- if (!WIFEXITED(status))
- return -1;
-
- return WEXITSTATUS(status);
-}
-
static int test_pidfd_send_signal_exited_fail(void)
{
int pidfd, ret, saved_errno;
@@ -128,13 +107,6 @@ static int test_pidfd_send_signal_exited_fail(void)
return 0;
}
-/*
- * The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
- * That means, when it wraps around any pid < 300 will be skipped.
- * So we need to use a pid > 300 in order to test recycling.
- */
-#define PID_RECYCLE 1000
-
/*
* Maximum number of cycles we allow. This is equivalent to PID_MAX_DEFAULT.
* If users set a higher limit or we have cycled PIDFD_MAX_DEFAULT number of
@@ -143,17 +115,6 @@ static int test_pidfd_send_signal_exited_fail(void)
*/
#define PIDFD_MAX_DEFAULT 0x8000
-/*
- * Define a few custom error codes for the child process to clearly indicate
- * what is happening. This way we can tell the difference between a system
- * error, a test error, etc.
- */
-#define PIDFD_PASS 0
-#define PIDFD_FAIL 1
-#define PIDFD_ERROR 2
-#define PIDFD_SKIP 3
-#define PIDFD_XFAIL 4
-
static int test_pidfd_send_signal_recycled_pid_fail(void)
{
int i, ret;
--
2.21.0
/* Introduction */
This adds the pidfd_open() syscall.
pidfd_open() allows to retrieve file descriptors for a given pid. This
includes both file descriptors for processes and file descriptors for
threads.
With the addition of this syscalls pidfd become independent of procfs just
as pids are. Of course, if CONFIG_PROC_FS is not set then metadata access
for processes will not be possible but everything else will just work fine.
In addition, this allows us to remove the dependency of pidfd_send_signal()
on procfs and enable it unconditionally.
With the ability to call pidfd_open() on tids we can now add a flag to
pidfd_send_signal() to signal to a specific thread capturing the
functionality of tgkill() and related thread-focused signal syscalls.
The desire to lift the restriction for pidfds on procfs has been expressed
by multiple people (cf. the commit message of commit
3eb39f47934f9d5a3027fe00d906a45fe3a15fad and [2]).
/* Signature */
int pidfd_open(pid_t pid, unsigned int flags);
/* pidfds are anon inode file descriptors */
These pidfds are allocated using anon_inode_getfd(), are O_CLOEXEC by
default and can be used with the pidfd_send_signal() syscall. They are not
dirfds and as such have the advantage that we can make them pollable or
readable in the future if we see a need to do so. The pidfds are not
associated with a specific pid namespaces but rather only reference struct
pid of a given process in their private_data member.
Additionally, Andy made an argument that we should go forward with
non-proc-dirfd file descriptors for the sake of security and extensibility
(cf. [3]). This will unblock or help move along work on pidfd_wait which
is currently ongoing.
/* Process Metadata Access */
One of the oustanding issues has been how to get information about a given
process if pidfds are regular file descriptors and do not provide access to
the process /proc/<pid> directory.
Various solutions have been proposed. The one that most people prefer is to
be able to retrieve a file descriptor to /proc/<pid> based on a pidfd
(cf. [5]). The prefered solution for how to do this has been to implement
an ioctl that for pidfds that translates a pidfd into a dirfd for
/proc/<pid>. This has been implemented in this patchset as well. If
PIDFD_GET_PROCFD is passed as a command to an ioctl() taking a pidfd and an
fd referring to a procfs directory as an argument a corresponding dirfd to
/proc/<pid> can be retrieved.
The ioctl() makes very sure that the struct pid associated with the
/proc/<pid> fd is identical to the struct pid stashed in the pidfd. This
ensures that we avoid pid recycling issues.
/* Example */
int pidfd = pidfd_open(1234, 0);
int procfd = open("/proc", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
int procpidfd = ioctl(pidfd, PIDFD_GET_PROCFD, procfd);
int statusfd = openat(procpidfd, "status", O_RDONLY | O_CLOEXEC);
int ret = read(statusfd, buf, sizeof(buf));
ret = pidfd_send_signal(pidfd, SIGKILL, NULL, 0);
/* References */
[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/
[3]: https://lore.kernel.org/lkml/CALCETrXO=V=+qEdLDVPf8eCgLZiB9bOTrUfe0V-U-tUZoeoRDA@mail.gmail.com
[4]: https://lore.kernel.org/lkml/CAHk-=wgmKZm-fESEiLq_W37sKpqCY89nQkPNfWhvF_CQ1ANgcw@mail.gmail.com
[5]: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Jann Horn <[email protected]
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Jonathan Kowalski <[email protected]>
Cc: "Dmitry V. Levin" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Nagarathnam Muthusamy <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Al Viro <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/pid.h | 2 +
include/linux/syscalls.h | 1 +
include/uapi/linux/wait.h | 2 +
kernel/pid.c | 181 +++++++++++++++++++++++++
6 files changed, 188 insertions(+)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 1f9607ed087c..c8046f261bee 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -433,3 +433,4 @@
425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup
426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter
427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register
+428 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 92ee0b4378d4..f714a3d57b88 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -349,6 +349,7 @@
425 common io_uring_setup __x64_sys_io_uring_setup
426 common io_uring_enter __x64_sys_io_uring_enter
427 common io_uring_register __x64_sys_io_uring_register
+428 common pidfd_open __x64_sys_pidfd_open
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/pid.h b/include/linux/pid.h
index b6f4ba16065a..3c8ef5a199ca 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -66,6 +66,8 @@ struct pid
extern struct pid init_struct_pid;
+extern const struct file_operations pidfd_fops;
+
static inline struct pid *get_pid(struct pid *pid)
{
if (pid)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e446806a561f..117463673fb5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
struct old_timex32 __user *tx);
asmlinkage long sys_syncfs(int fd);
asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
unsigned int vlen, unsigned flags);
asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
index ac49a220cf2a..d6c7c0701997 100644
--- a/include/uapi/linux/wait.h
+++ b/include/uapi/linux/wait.h
@@ -18,5 +18,7 @@
#define P_PID 1
#define P_PGID 2
+/* Get a file descriptor for /proc/<pid> of the corresponding pidfd */
+#define PIDFD_GET_PROCFD _IOR('p', 1, int)
#endif /* _UAPI_LINUX_WAIT_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..8c9e15e0e463 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -26,8 +26,10 @@
*
*/
+#include <linux/anon_inodes.h>
#include <linux/mm.h>
#include <linux/export.h>
+#include <linux/fsnotify.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/rculist.h>
@@ -40,6 +42,7 @@
#include <linux/proc_fs.h>
#include <linux/sched/task.h>
#include <linux/idr.h>
+#include <linux/wait.h>
struct pid init_struct_pid = {
.count = ATOMIC_INIT(1),
@@ -451,6 +454,184 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
return idr_get_next(&ns->idr, &nr);
}
+#ifdef CONFIG_PROC_FS
+static struct pid_namespace *pidfd_get_proc_pid_ns(const struct file *file)
+{
+ struct inode *inode;
+ struct super_block *sb;
+
+ inode = file_inode(file);
+ sb = inode->i_sb;
+ if (sb->s_magic != PROC_SUPER_MAGIC)
+ return ERR_PTR(-EINVAL);
+
+ if (inode->i_ino != PROC_ROOT_INO)
+ return ERR_PTR(-EINVAL);
+
+ return get_pid_ns(inode->i_sb->s_fs_info);
+}
+
+static struct pid *pidfd_get_pid(const struct file *file)
+{
+ if (file->f_op != &pidfd_fops)
+ return ERR_PTR(-EINVAL);
+
+ return get_pid(file->private_data);
+}
+
+static struct file *pidfd_open_proc_pid(const struct file *procf, pid_t pid,
+ const struct pid *pidfd_pid)
+{
+ char name[12]; /* int to strlen + \0 but with */
+ struct file *file;
+ struct pid *proc_pid;
+
+ snprintf(name, sizeof(name), "%d", pid);
+ file = file_open_root(procf->f_path.dentry, procf->f_path.mnt, name,
+ O_DIRECTORY | O_RDONLY | O_NOFOLLOW, 0);
+ if (IS_ERR(file))
+ return file;
+
+ proc_pid = tgid_pidfd_to_pid(file);
+ if (IS_ERR(proc_pid)) {
+ filp_close(file, NULL);
+ return ERR_CAST(proc_pid);
+ }
+
+ if (pidfd_pid != proc_pid) {
+ filp_close(file, NULL);
+ return ERR_PTR(-ESRCH);
+ }
+
+ return file;
+}
+
+static inline int pidfd_to_procfd(int procfd, struct file *pidfd_file)
+{
+ long fd;
+ pid_t ns_pid;
+ struct fd fdproc;
+ struct file *file = NULL;
+ struct pid *pidfd_pid = NULL;
+ struct pid_namespace *proc_pid_ns = NULL;
+
+ fdproc = fdget(procfd);
+ if (!fdproc.file)
+ return -EBADF;
+
+ proc_pid_ns = pidfd_get_proc_pid_ns(fdproc.file);
+ if (IS_ERR(proc_pid_ns)) {
+ fd = PTR_ERR(proc_pid_ns);
+ proc_pid_ns = NULL;
+ goto err;
+ }
+
+ pidfd_pid = pidfd_get_pid(pidfd_file);
+ if (IS_ERR(pidfd_pid)) {
+ fd = PTR_ERR(pidfd_pid);
+ pidfd_pid = NULL;
+ goto err;
+ }
+
+ ns_pid = pid_nr_ns(pidfd_pid, proc_pid_ns);
+ if (!ns_pid) {
+ fd = -ESRCH;
+ goto err;
+ }
+
+ file = pidfd_open_proc_pid(fdproc.file, ns_pid, pidfd_pid);
+ if (IS_ERR(file)) {
+ fd = PTR_ERR(file);
+ file = NULL;
+ goto err;
+ }
+
+ fd = get_unused_fd_flags(O_CLOEXEC);
+ if (fd < 0)
+ goto err;
+
+ fsnotify_open(file);
+ fd_install(fd, file);
+ file = NULL;
+
+err:
+ fdput(fdproc);
+ if (proc_pid_ns)
+ put_pid_ns(proc_pid_ns);
+ put_pid(pidfd_pid);
+ if (file)
+ filp_close(file, NULL);
+
+ return fd;
+}
+#else
+static inline int pidfd_to_procfd(int procfd, struct file *pidfd_file)
+{
+ return -EOPNOTSUPP;
+}
+#endif /* CONFIG_PROC_FS */
+
+static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ int procfd = arg;
+
+ switch (cmd) {
+ case PIDFD_GET_PROCFD:
+ return pidfd_to_procfd(procfd, file);
+ default:
+ return -ENOTTY;
+ }
+}
+
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+ struct pid *pid = file->private_data;
+
+ if (pid) {
+ file->private_data = NULL;
+ put_pid(pid);
+ }
+
+ return 0;
+}
+
+const struct file_operations pidfd_fops = {
+ .release = pidfd_release,
+ .unlocked_ioctl = pidfd_ioctl,
+};
+
+static int pidfd_create_fd_cloexec(pid_t pid)
+{
+ int fd;
+ struct pid *p;
+
+ p = find_get_pid(pid);
+ if (!p)
+ return -ESRCH;
+
+ fd = anon_inode_getfd("pidfd", &pidfd_fops, p, O_RDWR | O_CLOEXEC);
+ if (fd < 0)
+ put_pid(p);
+
+ return fd;
+}
+
+/*
+ * pidfd_open - open a pidfd
+ * @pid: pid for which to retrieve a pidfd
+ * @flags: flags to pass
+ */
+SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
+{
+ if (flags)
+ return -EINVAL;
+
+ if (pid <= 0)
+ return -EINVAL;
+
+ return pidfd_create_fd_cloexec(pid);
+}
+
void __init pid_idr_init(void)
{
/* Verify no one has done anything silly: */
--
2.21.0
With the addition of pidfd_open() it is possible for users to reference a
specific thread by doing:
int pidfd = pidfd_open(<tid>, 0);
This means we can extend pidfd_send_signal() to signal a specific thread.
As promised in the commit for pidfd_send_signal() [1] the extension is
based on a flag argument, i.e. the scope of the signal delivery is based on
the flag argument, not on the type of file descriptor.
To this end the flag PIDFD_SIGNAL_TID is added. With this change we now
cover most of the functionality of all the other signal sending functions
combined:
- pidfd_send_signal(<pidfd>, <sig>, NULL, 0);
which is equivalent to
kill(<positive-pid>, <signal>)
- pidfd_send_signal(<pidfd>, <sig>, <info>, 0);
which is equivalent to
rt_sigqueueinfo(<tgid>, <sig>, <uinfo>)
- pidfd_send_signal(<pidfd>, <sig>, NULL, PIDFD_SIGNAL_TID);
which is equivalent to
tgkill(<tgid>, <tid>, <signal)
rt_tgsigqueueinfo(<tgid>, <tid>, <sig>, <uinfo>)
- pidfd_send_signal(<pidfd>, <sig>, <info>, PIDFD_SIGNAL_TID);
which is equivalent to
rt_tgsigqueueinfo(<tgid>, <tid>, <sig>, <uinfo>)
/* References */
[1]: commit 3eb39f47934 ("signal: add pidfd_send_signal() syscall")
Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Jann Horn <[email protected]
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Jonathan Kowalski <[email protected]>
Cc: "Dmitry V. Levin" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Nagarathnam Muthusamy <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Florian Weimer <[email protected]>
---
include/uapi/linux/wait.h | 3 +
kernel/signal.c | 116 ++++++++++++++++++++++++++------------
2 files changed, 82 insertions(+), 37 deletions(-)
diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
index d6c7c0701997..b72f0ef84fe5 100644
--- a/include/uapi/linux/wait.h
+++ b/include/uapi/linux/wait.h
@@ -21,4 +21,7 @@
/* Get a file descriptor for /proc/<pid> of the corresponding pidfd */
#define PIDFD_GET_PROCFD _IOR('p', 1, int)
+/* Flags to pass to pidfd_send_signal */
+#define PIDFD_SIGNAL_TID 1 /* Send signal to specific thread */
+
#endif /* _UAPI_LINUX_WAIT_H */
diff --git a/kernel/signal.c b/kernel/signal.c
index eb97d0cc6ef7..9f93da85b2b9 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3557,17 +3557,86 @@ static struct pid *pidfd_to_pid(const struct file *file)
return tgid_pidfd_to_pid(file);
}
+static int __do_send_specific(struct task_struct *p, int sig,
+ struct kernel_siginfo *info)
+{
+ int error = -ESRCH;
+
+ error = check_kill_permission(sig, info, p);
+ /*
+ * The null signal is a permissions and process existence probe.
+ * No signal is actually delivered.
+ */
+ if (!error && sig) {
+ error = do_send_sig_info(sig, info, p, PIDTYPE_PID);
+ /*
+ * If lock_task_sighand() failed we pretend the task
+ * dies after receiving the signal. The window is tiny,
+ * and the signal is private anyway.
+ */
+ if (unlikely(error == -ESRCH))
+ error = 0;
+ }
+
+ return error;
+}
+
+static int do_send_specific(pid_t tgid, pid_t pid, int sig,
+ struct kernel_siginfo *info)
+{
+ struct task_struct *p;
+ int error = -ESRCH;
+
+ rcu_read_lock();
+ p = find_task_by_vpid(pid);
+ if (p && (tgid <= 0 || task_tgid_vnr(p) == tgid))
+ error = __do_send_specific(p, sig, info);
+ rcu_read_unlock();
+
+ return error;
+}
+
+static int pidfd_send_signal_specific(struct pid *pid, int sig,
+ struct kernel_siginfo *info)
+{
+ struct task_struct *p;
+ int error = -ESRCH;
+
+ rcu_read_lock();
+ p = pid_task(pid, PIDTYPE_PID);
+ if (p)
+ error = __do_send_specific(p, sig, info);
+ rcu_read_unlock();
+
+ return error;
+}
+
/**
- * sys_pidfd_send_signal - send a signal to a process through a task file
- * descriptor
+ * sys_pidfd_send_signal - send a signal to a process through a pidfd
+
* @pidfd: the file descriptor of the process
* @sig: signal to be sent
* @info: the signal info
* @flags: future flags to be passed
*
- * The syscall currently only signals via PIDTYPE_PID which covers
- * kill(<positive-pid>, <signal>. It does not signal threads or process
- * groups.
+ * The syscall currently covers:
+ * - pidfd_send_signal(<pidfd>, <sig>, NULL, 0);
+ * which is equivalent to
+ * kill(<positive-pid>, <signal>)
+ *
+ * - pidfd_send_signal(<pidfd>, <sig>, <info>, 0);
+ * which is equivalent to
+ * rt_sigqueueinfo(<tgid>, <sig>, <uinfo>)
+ *
+ * - pidfd_send_signal(<pidfd>, <sig>, NULL, PIDFD_SIGNAL_TID);
+ * which is equivalent to
+ * tgkill(<tgid>, <tid>, <signal)
+ *
+ * rt_tgsigqueueinfo(<tgid>, <tid>, <sig>, <uinfo>)
+ * - pidfd_send_signal(<pidfd>, <sig>, <info>, PIDFD_SIGNAL_TID);
+ * which is equivalent to
+ * rt_tgsigqueueinfo(<tgid>, <tid>, <sig>, <uinfo>)
+ *
* In order to extend the syscall to threads and process groups the @flags
* argument should be used. In essence, the @flags argument will determine
* what is signaled and not the file descriptor itself. Put in other words,
@@ -3585,7 +3654,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
kernel_siginfo_t kinfo;
/* Enforce flags be set to 0 until we add an extension. */
- if (flags)
+ if (flags & ~PIDFD_SIGNAL_TID)
return -EINVAL;
f = fdget(pidfd);
@@ -3626,43 +3695,16 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
prepare_kill_siginfo(sig, &kinfo);
}
- ret = kill_pid_info(sig, &kinfo, pid);
+ if (flags & PIDFD_SIGNAL_TID)
+ ret = pidfd_send_signal_specific(pid, sig, &kinfo);
+ else
+ ret = kill_pid_info(sig, &kinfo, pid);
err:
fdput(f);
return ret;
}
-static int
-do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
-{
- struct task_struct *p;
- int error = -ESRCH;
-
- rcu_read_lock();
- p = find_task_by_vpid(pid);
- if (p && (tgid <= 0 || task_tgid_vnr(p) == tgid)) {
- error = check_kill_permission(sig, info, p);
- /*
- * The null signal is a permissions and process existence
- * probe. No signal is actually delivered.
- */
- if (!error && sig) {
- error = do_send_sig_info(sig, info, p, PIDTYPE_PID);
- /*
- * If lock_task_sighand() failed we pretend the task
- * dies after receiving the signal. The window is tiny,
- * and the signal is private anyway.
- */
- if (unlikely(error == -ESRCH))
- error = 0;
- }
- }
- rcu_read_unlock();
-
- return error;
-}
-
static int do_tkill(pid_t tgid, pid_t pid, int sig)
{
struct kernel_siginfo info;
--
2.21.0
Let pidfd_send_signal() use pidfds retrieved via pidfd_open(). With this
patch pidfd_send_signal() becomes independent of procfs. This fullfils the
request made when we merged the pidfd_send_signal() patchset. The
pidfd_send_signal() syscall is now always available allowing for it to be
used by users without procfs mounted or even users without procfs support
compiled into the kernel.
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Howells <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Jann Horn <[email protected]
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Jonathan Kowalski <[email protected]>
Cc: "Dmitry V. Levin" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Nagarathnam Muthusamy <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Florian Weimer <[email protected]>
---
kernel/signal.c | 14 ++++++++++----
kernel/sys_ni.c | 3 ---
2 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index b7953934aa99..eb97d0cc6ef7 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3513,7 +3513,6 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
return kill_something_info(sig, &info, pid);
}
-#ifdef CONFIG_PROC_FS
/*
* Verify that the signaler and signalee either are in the same pid namespace
* or that the signaler's pid namespace is an ancestor of the signalee's pid
@@ -3550,6 +3549,14 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
return copy_siginfo_from_user(kinfo, info);
}
+static struct pid *pidfd_to_pid(const struct file *file)
+{
+ if (file->f_op == &pidfd_fops)
+ return file->private_data;
+
+ return tgid_pidfd_to_pid(file);
+}
+
/**
* sys_pidfd_send_signal - send a signal to a process through a task file
* descriptor
@@ -3581,12 +3588,12 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
if (flags)
return -EINVAL;
- f = fdget_raw(pidfd);
+ f = fdget(pidfd);
if (!f.file)
return -EBADF;
/* Is this a pidfd? */
- pid = tgid_pidfd_to_pid(f.file);
+ pid = pidfd_to_pid(f.file);
if (IS_ERR(pid)) {
ret = PTR_ERR(pid);
goto err;
@@ -3625,7 +3632,6 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
fdput(f);
return ret;
}
-#endif /* CONFIG_PROC_FS */
static int
do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d21f4befaea4..4d9ae5ea6caf 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -167,9 +167,6 @@ COND_SYSCALL(syslog);
/* kernel/sched/core.c */
-/* kernel/signal.c */
-COND_SYSCALL(pidfd_send_signal);
-
/* kernel/sys.c */
COND_SYSCALL(setregid);
COND_SYSCALL(setgid);
--
2.21.0
On Fri, Mar 29, 2019 at 4:54 PM Christian Brauner <[email protected]> wrote:
> /* Introduction */
> This adds the pidfd_open() syscall.
> pidfd_open() allows to retrieve file descriptors for a given pid. This
> includes both file descriptors for processes and file descriptors for
> threads.
Looks good to me, overall. Apart from a few nits below:
Reviewed-by: Jann Horn <[email protected]>
[...]
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..8c9e15e0e463 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
[...]
> +static struct file *pidfd_open_proc_pid(const struct file *procf, pid_t pid,
> + const struct pid *pidfd_pid)
> +{
> + char name[12]; /* int to strlen + \0 but with */
nit: comment suddenly ends at "but with"?
[...]
> +}
> +
> +static inline int pidfd_to_procfd(int procfd, struct file *pidfd_file)
> +{
> + long fd;
nit: This should probably be an int?
[...]
> + return fd;
> +}
[...]
> +static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> + int procfd = arg;
nit: I think it'd be semantically cleaner to move this assignment into
the switch case, but I don't feel about it strongly.
> + switch (cmd) {
> + case PIDFD_GET_PROCFD:
> + return pidfd_to_procfd(procfd, file);
> + default:
> + return -ENOTTY;
> + }
> +}
On Sat, Mar 30, 2019 at 12:45:46AM +0100, Jann Horn wrote:
> On Fri, Mar 29, 2019 at 4:54 PM Christian Brauner <[email protected]> wrote:
> > /* Introduction */
> > This adds the pidfd_open() syscall.
> > pidfd_open() allows to retrieve file descriptors for a given pid. This
> > includes both file descriptors for processes and file descriptors for
> > threads.
>
> Looks good to me, overall. Apart from a few nits below:
> Reviewed-by: Jann Horn <[email protected]>
Thanks! Will fixup the nits and add your Reviewed-by!
>
> [...]
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..8c9e15e0e463 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> [...]
> > +static struct file *pidfd_open_proc_pid(const struct file *procf, pid_t pid,
> > + const struct pid *pidfd_pid)
> > +{
> > + char name[12]; /* int to strlen + \0 but with */
>
> nit: comment suddenly ends at "but with"?
Will fix.
>
> [...]
> > +}
> > +
> > +static inline int pidfd_to_procfd(int procfd, struct file *pidfd_file)
> > +{
> > + long fd;
>
> nit: This should probably be an int?
Yes.
>
> [...]
> > + return fd;
> > +}
> [...]
> > +static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> > +{
> > + int procfd = arg;
>
> nit: I think it'd be semantically cleaner to move this assignment into
> the switch case, but I don't feel about it strongly.
Agreed.
>
> > + switch (cmd) {
> > + case PIDFD_GET_PROCFD:
> > + return pidfd_to_procfd(procfd, file);
> > + default:
> > + return -ENOTTY;
> > + }
> > +}
On Fri, Mar 29, 2019 at 4:54 PM Christian Brauner <[email protected]> wrote:
> With the addition of pidfd_open() it is possible for users to reference a
> specific thread by doing:
>
> int pidfd = pidfd_open(<tid>, 0);
>
> This means we can extend pidfd_send_signal() to signal a specific thread.
> As promised in the commit for pidfd_send_signal() [1] the extension is
> based on a flag argument, i.e. the scope of the signal delivery is based on
> the flag argument, not on the type of file descriptor.
> To this end the flag PIDFD_SIGNAL_TID is added. With this change we now
> cover most of the functionality of all the other signal sending functions
> combined:
[...]
> diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
> index d6c7c0701997..b72f0ef84fe5 100644
> --- a/include/uapi/linux/wait.h
> +++ b/include/uapi/linux/wait.h
[...]
> +/* Flags to pass to pidfd_send_signal */
> +#define PIDFD_SIGNAL_TID 1 /* Send signal to specific thread */
nit: s/1/1U/; the flags argument is an `unsigned int`
> #endif /* _UAPI_LINUX_WAIT_H */
> diff --git a/kernel/signal.c b/kernel/signal.c
> index eb97d0cc6ef7..9f93da85b2b9 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
[...]
> +static int pidfd_send_signal_specific(struct pid *pid, int sig,
> + struct kernel_siginfo *info)
> +{
> + struct task_struct *p;
> + int error = -ESRCH;
> +
> + rcu_read_lock();
> + p = pid_task(pid, PIDTYPE_PID);
> + if (p)
> + error = __do_send_specific(p, sig, info);
> + rcu_read_unlock();
> +
> + return error;
> +}
> +
> /**
> - * sys_pidfd_send_signal - send a signal to a process through a task file
> - * descriptor
> + * sys_pidfd_send_signal - send a signal to a process through a pidfd
> +
> * @pidfd: the file descriptor of the process
> * @sig: signal to be sent
> * @info: the signal info
> * @flags: future flags to be passed
nit: comment is outdated, it isn't "future flags" anymore
[...]
> + * rt_tgsigqueueinfo(<tgid>, <tid>, <sig>, <uinfo>)
> + * - pidfd_send_signal(<pidfd>, <sig>, <info>, PIDFD_SIGNAL_TID);
> + * which is equivalent to
> + * rt_tgsigqueueinfo(<tgid>, <tid>, <sig>, <uinfo>)
> + *
> * In order to extend the syscall to threads and process groups the @flags
> * argument should be used. In essence, the @flags argument will determine
> * what is signaled and not the file descriptor itself. Put in other words,
nit: again, outdated comment about @flags
[...]
> @@ -3626,43 +3695,16 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> prepare_kill_siginfo(sig, &kinfo);
> }
>
> - ret = kill_pid_info(sig, &kinfo, pid);
> + if (flags & PIDFD_SIGNAL_TID)
> + ret = pidfd_send_signal_specific(pid, sig, &kinfo);
> + else
> + ret = kill_pid_info(sig, &kinfo, pid);
nit: maybe give pidfd_send_signal_specific() and kill_pid_info() the
same signatures, since they perform similar operations with the same
argument types?
Something that was already kinda weird in the existing code, but is
getting worse with TIDs is the handling of SI_USER with siginfo.
Copying context lines from above here:
if (info) {
ret = copy_siginfo_from_user_any(&kinfo, info);
if (unlikely(ret))
goto err;
ret = -EINVAL;
if (unlikely(sig != kinfo.si_signo))
goto err;
if ((task_pid(current) != pid) &&
(kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) {
/* Only allow sending arbitrary signals to yourself. */
ret = -EPERM;
if (kinfo.si_code != SI_USER)
goto err;
/* Turn this into a regular kill signal. */
prepare_kill_siginfo(sig, &kinfo);
}
} else {
prepare_kill_siginfo(sig, &kinfo);
}
So for signals to PIDs, the rule is that if you send siginfo with
SI_USER to yourself, the siginfo is preserved; otherwise the kernel
silently clobbers it. That's already kind of weird - silent behavior
difference depending on a security check. But now, for signals to
threads, I think the result is going to be that signalling the thread
group leader preserves information, and signalling any other thread
clobbers it? If so, that seems bad.
do_rt_sigqueueinfo() seems to have the same issue, from a glance - but
there, at least the error case is just a -EPERM, not a silent behavior
difference.
Would it make sense to refuse sending siginfo with SI_USER to
non-current? If you actually want to send a normal SI_USER signal, you
can use info==NULL, right? That should create wrongness parity with
do_rt_sigqueueinfo().
To improve things further, I guess you'd have to move the comparison
against current into pidfd_send_signal_specific(), or move the task
lookup out of it, or something like that?
> err:
> fdput(f);
> return ret;
> }
[...]
On Sat, Mar 30, 2019 at 02:06:34AM +0100, Jann Horn wrote:
> On Fri, Mar 29, 2019 at 4:54 PM Christian Brauner <[email protected]> wrote:
> > With the addition of pidfd_open() it is possible for users to reference a
> > specific thread by doing:
> >
> > int pidfd = pidfd_open(<tid>, 0);
> >
> > This means we can extend pidfd_send_signal() to signal a specific thread.
> > As promised in the commit for pidfd_send_signal() [1] the extension is
> > based on a flag argument, i.e. the scope of the signal delivery is based on
> > the flag argument, not on the type of file descriptor.
> > To this end the flag PIDFD_SIGNAL_TID is added. With this change we now
> > cover most of the functionality of all the other signal sending functions
> > combined:
> [...]
> > diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
> > index d6c7c0701997..b72f0ef84fe5 100644
> > --- a/include/uapi/linux/wait.h
> > +++ b/include/uapi/linux/wait.h
> [...]
> > +/* Flags to pass to pidfd_send_signal */
> > +#define PIDFD_SIGNAL_TID 1 /* Send signal to specific thread */
>
> nit: s/1/1U/; the flags argument is an `unsigned int`
Will change.
>
> > #endif /* _UAPI_LINUX_WAIT_H */
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index eb97d0cc6ef7..9f93da85b2b9 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> [...]
> > +static int pidfd_send_signal_specific(struct pid *pid, int sig,
> > + struct kernel_siginfo *info)
> > +{
> > + struct task_struct *p;
> > + int error = -ESRCH;
> > +
> > + rcu_read_lock();
> > + p = pid_task(pid, PIDTYPE_PID);
> > + if (p)
> > + error = __do_send_specific(p, sig, info);
> > + rcu_read_unlock();
> > +
> > + return error;
> > +}
> > +
> > /**
> > - * sys_pidfd_send_signal - send a signal to a process through a task file
> > - * descriptor
> > + * sys_pidfd_send_signal - send a signal to a process through a pidfd
> > +
> > * @pidfd: the file descriptor of the process
> > * @sig: signal to be sent
> > * @info: the signal info
> > * @flags: future flags to be passed
>
> nit: comment is outdated, it isn't "future flags" anymore
Will remove.
>
> [...]
> > + * rt_tgsigqueueinfo(<tgid>, <tid>, <sig>, <uinfo>)
> > + * - pidfd_send_signal(<pidfd>, <sig>, <info>, PIDFD_SIGNAL_TID);
> > + * which is equivalent to
> > + * rt_tgsigqueueinfo(<tgid>, <tid>, <sig>, <uinfo>)
> > + *
> > * In order to extend the syscall to threads and process groups the @flags
> > * argument should be used. In essence, the @flags argument will determine
> > * what is signaled and not the file descriptor itself. Put in other words,
>
> nit: again, outdated comment about @flags
Will update.
>
> [...]
> > @@ -3626,43 +3695,16 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> > prepare_kill_siginfo(sig, &kinfo);
> > }
> >
> > - ret = kill_pid_info(sig, &kinfo, pid);
> > + if (flags & PIDFD_SIGNAL_TID)
> > + ret = pidfd_send_signal_specific(pid, sig, &kinfo);
> > + else
> > + ret = kill_pid_info(sig, &kinfo, pid);
>
> nit: maybe give pidfd_send_signal_specific() and kill_pid_info() the
> same signatures, since they perform similar operations with the same
> argument types?
Yes, let's do
pidfd_send_signal_specific.(pid, sig, &kinfo);
kill_pid_info..............(pid, sig, &kinfo);
so it matches the argument order of the syscalls itself too.
>
> Something that was already kinda weird in the existing code, but is
> getting worse with TIDs is the handling of SI_USER with siginfo.
Right, that's what we discussed earlier.
> Copying context lines from above here:
>
> if (info) {
> ret = copy_siginfo_from_user_any(&kinfo, info);
> if (unlikely(ret))
> goto err;
> ret = -EINVAL;
> if (unlikely(sig != kinfo.si_signo))
> goto err;
> if ((task_pid(current) != pid) &&
> (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) {
> /* Only allow sending arbitrary signals to yourself. */
> ret = -EPERM;
> if (kinfo.si_code != SI_USER)
> goto err;
> /* Turn this into a regular kill signal. */
> prepare_kill_siginfo(sig, &kinfo);
> }
> } else {
> prepare_kill_siginfo(sig, &kinfo);
> }
>
> So for signals to PIDs, the rule is that if you send siginfo with
> SI_USER to yourself, the siginfo is preserved; otherwise the kernel
> silently clobbers it. That's already kind of weird - silent behavior
Clobbers as in "silently replaces it whatever it seems fit?
> difference depending on a security check. But now, for signals to
> threads, I think the result is going to be that signalling the thread
> group leader preserves information, and signalling any other thread
> clobbers it? If so, that seems bad.
>
> do_rt_sigqueueinfo() seems to have the same issue, from a glance - but
> there, at least the error case is just a -EPERM, not a silent behavior
> difference.
>
> Would it make sense to refuse sending siginfo with SI_USER to
> non-current? If you actually want to send a normal SI_USER signal, you
Yeah.
> can use info==NULL, right? That should create wrongness parity with
> do_rt_sigqueueinfo().
So you'd just do (just doing it non-elegantly rn):
if ((task_pid(current) != pid) &&
(kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) {
ret = -EPERM;
goto err;
}
> To improve things further, I guess you'd have to move the comparison
> against current into pidfd_send_signal_specific(), or move the task
> lookup out of it, or something like that?
Looks like a sane suggestion to me. Would you care to send a patch for
that? This is clearly a bugfix suitable for 5.1 so I'd rather not wait
until 5.2.
On Sat, Mar 30, 2019 at 02:22:29AM +0100, Christian Brauner wrote:
> On Sat, Mar 30, 2019 at 02:06:34AM +0100, Jann Horn wrote:
> > On Fri, Mar 29, 2019 at 4:54 PM Christian Brauner <[email protected]> wrote:
> > > With the addition of pidfd_open() it is possible for users to reference a
> > > specific thread by doing:
> > >
> > > int pidfd = pidfd_open(<tid>, 0);
> > >
> > > This means we can extend pidfd_send_signal() to signal a specific thread.
> > > As promised in the commit for pidfd_send_signal() [1] the extension is
> > > based on a flag argument, i.e. the scope of the signal delivery is based on
> > > the flag argument, not on the type of file descriptor.
> > > To this end the flag PIDFD_SIGNAL_TID is added. With this change we now
> > > cover most of the functionality of all the other signal sending functions
> > > combined:
> > [...]
> > > diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
> > > index d6c7c0701997..b72f0ef84fe5 100644
> > > --- a/include/uapi/linux/wait.h
> > > +++ b/include/uapi/linux/wait.h
> > [...]
> > > +/* Flags to pass to pidfd_send_signal */
> > > +#define PIDFD_SIGNAL_TID 1 /* Send signal to specific thread */
> >
> > nit: s/1/1U/; the flags argument is an `unsigned int`
>
> Will change.
>
> >
> > > #endif /* _UAPI_LINUX_WAIT_H */
> > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > index eb97d0cc6ef7..9f93da85b2b9 100644
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > [...]
> > > +static int pidfd_send_signal_specific(struct pid *pid, int sig,
> > > + struct kernel_siginfo *info)
> > > +{
> > > + struct task_struct *p;
> > > + int error = -ESRCH;
> > > +
> > > + rcu_read_lock();
> > > + p = pid_task(pid, PIDTYPE_PID);
> > > + if (p)
> > > + error = __do_send_specific(p, sig, info);
> > > + rcu_read_unlock();
> > > +
> > > + return error;
> > > +}
> > > +
> > > /**
> > > - * sys_pidfd_send_signal - send a signal to a process through a task file
> > > - * descriptor
> > > + * sys_pidfd_send_signal - send a signal to a process through a pidfd
> > > +
> > > * @pidfd: the file descriptor of the process
> > > * @sig: signal to be sent
> > > * @info: the signal info
> > > * @flags: future flags to be passed
> >
> > nit: comment is outdated, it isn't "future flags" anymore
>
> Will remove.
>
> >
> > [...]
> > > + * rt_tgsigqueueinfo(<tgid>, <tid>, <sig>, <uinfo>)
> > > + * - pidfd_send_signal(<pidfd>, <sig>, <info>, PIDFD_SIGNAL_TID);
> > > + * which is equivalent to
> > > + * rt_tgsigqueueinfo(<tgid>, <tid>, <sig>, <uinfo>)
> > > + *
> > > * In order to extend the syscall to threads and process groups the @flags
> > > * argument should be used. In essence, the @flags argument will determine
> > > * what is signaled and not the file descriptor itself. Put in other words,
> >
> > nit: again, outdated comment about @flags
>
> Will update.
>
> >
> > [...]
> > > @@ -3626,43 +3695,16 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> > > prepare_kill_siginfo(sig, &kinfo);
> > > }
> > >
> > > - ret = kill_pid_info(sig, &kinfo, pid);
> > > + if (flags & PIDFD_SIGNAL_TID)
> > > + ret = pidfd_send_signal_specific(pid, sig, &kinfo);
> > > + else
> > > + ret = kill_pid_info(sig, &kinfo, pid);
> >
> > nit: maybe give pidfd_send_signal_specific() and kill_pid_info() the
> > same signatures, since they perform similar operations with the same
> > argument types?
>
> Yes, let's do
> pidfd_send_signal_specific.(pid, sig, &kinfo);
> kill_pid_info..............(pid, sig, &kinfo);
>
> so it matches the argument order of the syscalls itself too.
Strike that. We should do:
pidfd_send_signal_specific.(sig, &kinfo, pid);
kill_pid_info..............(sig, &kinfo, pid);
because kill_pid_info() is called in multiple places so we would
needlessly shovle code around.
>
> >
> > Something that was already kinda weird in the existing code, but is
> > getting worse with TIDs is the handling of SI_USER with siginfo.
>
> Right, that's what we discussed earlier.
>
> > Copying context lines from above here:
> >
> > if (info) {
> > ret = copy_siginfo_from_user_any(&kinfo, info);
> > if (unlikely(ret))
> > goto err;
> > ret = -EINVAL;
> > if (unlikely(sig != kinfo.si_signo))
> > goto err;
> > if ((task_pid(current) != pid) &&
> > (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) {
> > /* Only allow sending arbitrary signals to yourself. */
> > ret = -EPERM;
> > if (kinfo.si_code != SI_USER)
> > goto err;
> > /* Turn this into a regular kill signal. */
> > prepare_kill_siginfo(sig, &kinfo);
> > }
> > } else {
> > prepare_kill_siginfo(sig, &kinfo);
> > }
> >
> > So for signals to PIDs, the rule is that if you send siginfo with
> > SI_USER to yourself, the siginfo is preserved; otherwise the kernel
> > silently clobbers it. That's already kind of weird - silent behavior
>
> Clobbers as in "silently replaces it whatever it seems fit?
>
> > difference depending on a security check. But now, for signals to
> > threads, I think the result is going to be that signalling the thread
> > group leader preserves information, and signalling any other thread
> > clobbers it? If so, that seems bad.
> >
> > do_rt_sigqueueinfo() seems to have the same issue, from a glance - but
> > there, at least the error case is just a -EPERM, not a silent behavior
> > difference.
> >
> > Would it make sense to refuse sending siginfo with SI_USER to
> > non-current? If you actually want to send a normal SI_USER signal, you
>
> Yeah.
>
> > can use info==NULL, right? That should create wrongness parity with
> > do_rt_sigqueueinfo().
>
> So you'd just do (just doing it non-elegantly rn):
> if ((task_pid(current) != pid) &&
> (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) {
> ret = -EPERM;
> goto err;
> }
>
> > To improve things further, I guess you'd have to move the comparison
> > against current into pidfd_send_signal_specific(), or move the task
> > lookup out of it, or something like that?
>
> Looks like a sane suggestion to me. Would you care to send a patch for
> that? This is clearly a bugfix suitable for 5.1 so I'd rather not wait
> until 5.2.
On Sat, Mar 30, 2019 at 02:34:16AM +0100, Christian Brauner wrote:
> On Sat, Mar 30, 2019 at 02:22:29AM +0100, Christian Brauner wrote:
> > On Sat, Mar 30, 2019 at 02:06:34AM +0100, Jann Horn wrote:
> > > On Fri, Mar 29, 2019 at 4:54 PM Christian Brauner <[email protected]> wrote:
> > > > With the addition of pidfd_open() it is possible for users to reference a
> > > > specific thread by doing:
> > > >
> > > > int pidfd = pidfd_open(<tid>, 0);
> > > >
> > > > This means we can extend pidfd_send_signal() to signal a specific thread.
> > > > As promised in the commit for pidfd_send_signal() [1] the extension is
> > > > based on a flag argument, i.e. the scope of the signal delivery is based on
> > > > the flag argument, not on the type of file descriptor.
> > > > To this end the flag PIDFD_SIGNAL_TID is added. With this change we now
> > > > cover most of the functionality of all the other signal sending functions
> > > > combined:
> > > [...]
> > > > diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
> > > > index d6c7c0701997..b72f0ef84fe5 100644
> > > > --- a/include/uapi/linux/wait.h
> > > > +++ b/include/uapi/linux/wait.h
> > > [...]
> > > > +/* Flags to pass to pidfd_send_signal */
> > > > +#define PIDFD_SIGNAL_TID 1 /* Send signal to specific thread */
> > >
> > > nit: s/1/1U/; the flags argument is an `unsigned int`
> >
> > Will change.
> >
> > >
> > > > #endif /* _UAPI_LINUX_WAIT_H */
> > > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > > index eb97d0cc6ef7..9f93da85b2b9 100644
> > > > --- a/kernel/signal.c
> > > > +++ b/kernel/signal.c
> > > [...]
> > > > +static int pidfd_send_signal_specific(struct pid *pid, int sig,
> > > > + struct kernel_siginfo *info)
> > > > +{
> > > > + struct task_struct *p;
> > > > + int error = -ESRCH;
> > > > +
> > > > + rcu_read_lock();
> > > > + p = pid_task(pid, PIDTYPE_PID);
> > > > + if (p)
> > > > + error = __do_send_specific(p, sig, info);
> > > > + rcu_read_unlock();
> > > > +
> > > > + return error;
> > > > +}
> > > > +
> > > > /**
> > > > - * sys_pidfd_send_signal - send a signal to a process through a task file
> > > > - * descriptor
> > > > + * sys_pidfd_send_signal - send a signal to a process through a pidfd
> > > > +
> > > > * @pidfd: the file descriptor of the process
> > > > * @sig: signal to be sent
> > > > * @info: the signal info
> > > > * @flags: future flags to be passed
> > >
> > > nit: comment is outdated, it isn't "future flags" anymore
> >
> > Will remove.
> >
> > >
> > > [...]
> > > > + * rt_tgsigqueueinfo(<tgid>, <tid>, <sig>, <uinfo>)
> > > > + * - pidfd_send_signal(<pidfd>, <sig>, <info>, PIDFD_SIGNAL_TID);
> > > > + * which is equivalent to
> > > > + * rt_tgsigqueueinfo(<tgid>, <tid>, <sig>, <uinfo>)
> > > > + *
> > > > * In order to extend the syscall to threads and process groups the @flags
> > > > * argument should be used. In essence, the @flags argument will determine
> > > > * what is signaled and not the file descriptor itself. Put in other words,
> > >
> > > nit: again, outdated comment about @flags
> >
> > Will update.
> >
> > >
> > > [...]
> > > > @@ -3626,43 +3695,16 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> > > > prepare_kill_siginfo(sig, &kinfo);
> > > > }
> > > >
> > > > - ret = kill_pid_info(sig, &kinfo, pid);
> > > > + if (flags & PIDFD_SIGNAL_TID)
> > > > + ret = pidfd_send_signal_specific(pid, sig, &kinfo);
> > > > + else
> > > > + ret = kill_pid_info(sig, &kinfo, pid);
> > >
> > > nit: maybe give pidfd_send_signal_specific() and kill_pid_info() the
> > > same signatures, since they perform similar operations with the same
> > > argument types?
> >
> > Yes, let's do
> > pidfd_send_signal_specific.(pid, sig, &kinfo);
> > kill_pid_info..............(pid, sig, &kinfo);
> >
> > so it matches the argument order of the syscalls itself too.
>
> Strike that. We should do:
> pidfd_send_signal_specific.(sig, &kinfo, pid);
> kill_pid_info..............(sig, &kinfo, pid);
>
> because kill_pid_info() is called in multiple places so we would
> needlessly shovle code around.
>
> >
> > >
> > > Something that was already kinda weird in the existing code, but is
> > > getting worse with TIDs is the handling of SI_USER with siginfo.
> >
> > Right, that's what we discussed earlier.
> >
> > > Copying context lines from above here:
> > >
> > > if (info) {
> > > ret = copy_siginfo_from_user_any(&kinfo, info);
> > > if (unlikely(ret))
> > > goto err;
> > > ret = -EINVAL;
> > > if (unlikely(sig != kinfo.si_signo))
> > > goto err;
> > > if ((task_pid(current) != pid) &&
> > > (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) {
> > > /* Only allow sending arbitrary signals to yourself. */
> > > ret = -EPERM;
> > > if (kinfo.si_code != SI_USER)
> > > goto err;
> > > /* Turn this into a regular kill signal. */
> > > prepare_kill_siginfo(sig, &kinfo);
> > > }
> > > } else {
> > > prepare_kill_siginfo(sig, &kinfo);
> > > }
> > >
> > > So for signals to PIDs, the rule is that if you send siginfo with
> > > SI_USER to yourself, the siginfo is preserved; otherwise the kernel
> > > silently clobbers it. That's already kind of weird - silent behavior
> >
> > Clobbers as in "silently replaces it whatever it seems fit?
> >
> > > difference depending on a security check. But now, for signals to
> > > threads, I think the result is going to be that signalling the thread
> > > group leader preserves information, and signalling any other thread
> > > clobbers it? If so, that seems bad.
> > >
> > > do_rt_sigqueueinfo() seems to have the same issue, from a glance - but
> > > there, at least the error case is just a -EPERM, not a silent behavior
> > > difference.
> > >
> > > Would it make sense to refuse sending siginfo with SI_USER to
> > > non-current? If you actually want to send a normal SI_USER signal, you
> >
> > Yeah.
> >
> > > can use info==NULL, right? That should create wrongness parity with
> > > do_rt_sigqueueinfo().
> >
> > So you'd just do (just doing it non-elegantly rn):
> > if ((task_pid(current) != pid) &&
> > (kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL)) {
> > ret = -EPERM;
> > goto err;
> > }
> >
> > > To improve things further, I guess you'd have to move the comparison
> > > against current into pidfd_send_signal_specific(), or move the task
> > > lookup out of it, or something like that?
> >
> > Looks like a sane suggestion to me. Would you care to send a patch for
> > that? This is clearly a bugfix suitable for 5.1 so I'd rather not wait
> > until 5.2.
Sorry, that was probably unclear. I was referring to:
> > > Would it make sense to refuse sending siginfo with SI_USER to
> > > non-current? If you actually want to send a normal SI_USER signal, you
This makes perfect sense as a bugfix for 5.1 imo.
On Fri, 2019-03-29 at 16:54 +0100, Christian Brauner wrote:
> diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
> index ac49a220cf2a..d6c7c0701997 100644
> --- a/include/uapi/linux/wait.h
> +++ b/include/uapi/linux/wait.h
> @@ -18,5 +18,7 @@
> #define P_PID 1
> #define P_PGID 2
>
> +/* Get a file descriptor for /proc/<pid> of the corresponding pidfd
> */
> +#define PIDFD_GET_PROCFD _IOR('p', 1, int)
>
> #endif /* _UAPI_LINUX_WAIT_H */
This is missing an entry in Documentation/ioctl/ioctl-number.txt and is
actually conflicting with existing entries.
However, I'd actually prefer a syscall to allow strict whitelisting via
seccomp and avoid the other ioctl disadvantages that Daniel has already
mentioned.
Cheers,
Jürg
On Sat, Mar 30, 2019 at 12:53:57PM +0100, Jürg Billeter wrote:
> On Fri, 2019-03-29 at 16:54 +0100, Christian Brauner wrote:
> > diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
> > index ac49a220cf2a..d6c7c0701997 100644
> > --- a/include/uapi/linux/wait.h
> > +++ b/include/uapi/linux/wait.h
> > @@ -18,5 +18,7 @@
> > #define P_PID 1
> > #define P_PGID 2
> >
> > +/* Get a file descriptor for /proc/<pid> of the corresponding pidfd
> > */
> > +#define PIDFD_GET_PROCFD _IOR('p', 1, int)
> >
> > #endif /* _UAPI_LINUX_WAIT_H */
>
> This is missing an entry in Documentation/ioctl/ioctl-number.txt and is
> actually conflicting with existing entries.
Thanks. Yes, Jann mentioned this too.
>
> However, I'd actually prefer a syscall to allow strict whitelisting via
> seccomp and avoid the other ioctl disadvantages that Daniel has already
> mentioned.
You can filter ioctls with seccomp.
I have compromised quite a bit now and I think what we have is perfectly
fine. a single clean syscalls pidfd_open() that lets you get pidfds for
threads and thread-group leaders independent of procfs and a clean,
simple fd->fd converstion ioctl() that is a property of the f_ops of the
pidfd to get an fd to /proc/<pid> for metadata access. Btw, this being a
part of the pidfd f_ops seems strikingly elegant to me. Because it
expresses the notion that the metadata is implicitly part of the pidfd
nicely. But I might just be dumb.
I do not see the need to add another syscall that is conditional on
CONFIG_PROC_FS and only does a pidfd to /proc/<pid>-fd conversion.
That's almost the definition of what an ioctl() is most suited for.
I get the opposition to multiplexers but consider if we where to oppose
all of them. Let's leave ioctls out and just look at a few widely used
multiplexer syscalls:
1. seccomp()
- number of supported commands: 4
2. prctl()
- number of supported commands: 45
3. keyctl()
- number of supported commands: 25
4. bpf()
- number of supported commands: 18
5. proposed fsconfig()
- number of supported commands: 8
Total Number of required syscalls: 100
That means for bpf() alone Linux would have had to gain *18* additional
single syscalls and for the new mount api only for configuring a mount
context 8 additional syscalls would need to be pulled.
That all hinges on the argument that "syscalls are cheap" and that
running out of syscall numbers is not a real problem because there is a
patchset that lifts this restriction _eventually_. That patchset hasn't
been merged yet and I have not even seen it sent out yet. So we're still
short of syscall numbers. _Even_ if this patchset would have landed,
adding 26 syscalls for two apis seems excessive.
So unless Linus jumps in here (Cced) and says that he's fine that the
pidfd to /proc/<pid>-fd conversion is suited for yet another syscalls
what we have here is perfectly acceptable.
Again, as I've said before I don't see the point in sending piles of
syscalls when it is not really justified and I find none of the
arguments against this implementation we have here right now very
convincing.
Christian
On Sat, Mar 30, 2019 at 2:37 PM Christian Brauner <[email protected]> wrote:
>
> On Sat, Mar 30, 2019 at 12:53:57PM +0100, Jürg Billeter wrote:
> > On Fri, 2019-03-29 at 16:54 +0100, Christian Brauner wrote:
> > > diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
> > > index ac49a220cf2a..d6c7c0701997 100644
> > > --- a/include/uapi/linux/wait.h
> > > +++ b/include/uapi/linux/wait.h
> > > @@ -18,5 +18,7 @@
> > > #define P_PID 1
> > > #define P_PGID 2
> > >
> > > +/* Get a file descriptor for /proc/<pid> of the corresponding pidfd
> > > */
> > > +#define PIDFD_GET_PROCFD _IOR('p', 1, int)
> > >
> > > #endif /* _UAPI_LINUX_WAIT_H */
> >
> > This is missing an entry in Documentation/ioctl/ioctl-number.txt and is
> > actually conflicting with existing entries.
>
> Thanks. Yes, Jann mentioned this too.
>
> >
> > However, I'd actually prefer a syscall to allow strict whitelisting via
> > seccomp and avoid the other ioctl disadvantages that Daniel has already
> > mentioned.
>
> You can filter ioctls with seccomp.
>
You probably wouldn't even need to, because the only way the ioctl
would be useful is to have a dir fd to the procfs root. As such, the
pidfd file descriptor itself is useless with the ioctl. There's also
no filtering to be done, as one pidfd strictly maps to a specific
task, so it's not that you get access to other things than what you
weren't permitted to, and that's pretty neat the way it is. If /proc
is not mounted in its namespace, you'd need to pass it to the process
explicitly, and if it is, then it doesn't matter anyway (even if it
can open /proc, hidepid based restrictions still work -- it's
essentially a race free openat).
On Fri, Mar 29, 2019 at 8:54 AM Christian Brauner <[email protected]> wrote:
>
> /* Introduction */
> This adds the pidfd_open() syscall.
> pidfd_open() allows to retrieve file descriptors for a given pid. This
> includes both file descriptors for processes and file descriptors for
> threads.
I'm ok with the pidfd_open() call to just get a pidfd, but that
"pidfd_to_profs()" needs to go away.
If you want to open the /proc/<pid>/status file, then just do that.
This whole "do one and convert to the other" needs to die. No, no, no.
Linus
On Sat, Mar 30, 2019 at 9:09 AM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Mar 29, 2019 at 8:54 AM Christian Brauner <[email protected]> wrote:
> >
> > /* Introduction */
> > This adds the pidfd_open() syscall.
> > pidfd_open() allows to retrieve file descriptors for a given pid. This
> > includes both file descriptors for processes and file descriptors for
> > threads.
>
> I'm ok with the pidfd_open() call to just get a pidfd, but that
> "pidfd_to_profs()" needs to go away.
>
> If you want to open the /proc/<pid>/status file, then just do that.
> This whole "do one and convert to the other" needs to die. No, no, no.
How do you propose that someone open the /proc/<pid>/status file in a
race-free manner starting with the result of calling pidfd_open?
On Sat, Mar 30, 2019 at 9:11 AM Daniel Colascione <[email protected]> wrote:
>
> How do you propose that someone open the /proc/<pid>/status file in a
> race-free manner starting with the result of calling pidfd_open?
CXhrist.
If you want to open a /proc file, then open a /proc file. Don't use
pidfd_open().
Just open /proc/<pid>/
Then you can use openat() to open whatever sub-files you want.
You have two choices: either you use /proc, or you don't. Don't mix the two.
And no, we are *NOT* making pidfd_open() into some "let's expose /proc
even when it's not mounted" horror. Seriously. That's disgusting,
wrong, insecure and stupid.
Linus
On Sat, Mar 30, 2019 at 9:16 AM Linus Torvalds
<[email protected]> wrote:
>
> And no, we are *NOT* making pidfd_open() into some "let's expose /proc
> even when it's not mounted" horror. Seriously. That's disgusting,
> wrong, insecure and stupid.
And btw, this is not debatable. In fact, this whole discussion is
making me feel like I should just revert pidfd, not because the code I
merged is wrong, but because people are clearly intending to do
completely inappropriate things with this.
Get your act together. Stop hacking up garbage.
Linus
On March 30, 2019 5:09:10 PM GMT+01:00, Linus Torvalds <[email protected]> wrote:
>On Fri, Mar 29, 2019 at 8:54 AM Christian Brauner
><[email protected]> wrote:
>>
>> /* Introduction */
>> This adds the pidfd_open() syscall.
>> pidfd_open() allows to retrieve file descriptors for a given pid.
>This
>> includes both file descriptors for processes and file descriptors for
>> threads.
>
>I'm ok with the pidfd_open() call to just get a pidfd, but that
From pure API perspective that's all I care about: independence of procfs.
Once we have pidfd_open() we can cleanly signal threads etc.
>"pidfd_to_profs()" needs to go away.
>
>If you want to open the /proc/<pid>/status file, then just do that.
>This whole "do one and convert to the other" needs to die. No, no, no.
>
> Linus
On Sat, Mar 30, 2019 at 9:19 AM Christian Brauner <[email protected]> wrote:
>
> From pure API perspective that's all I care about: independence of procfs.
> Once we have pidfd_open() we can cleanly signal threads etc.
But "independence from procfs" means that you damn well don't then do
"oh, now I have a pidfd, I want to turn it into a /proc fd and then
munge around there".
So I'm literally saying that it had better really *be* independent
from /proc. It is the standalone version, but it's most definitely
also the version that doesn't then give you secret access to /proc.
And it weorries me a lot that people are trying to play these kinds of
games. I'm just seeing some android patch that adds this horror and
then starts using it.
Linus
On Sat, Mar 30, 2019 at 9:24 AM Linus Torvalds
<[email protected]> wrote:
>
> On Sat, Mar 30, 2019 at 9:19 AM Christian Brauner <[email protected]> wrote:
> >
> > From pure API perspective that's all I care about: independence of procfs.
> > Once we have pidfd_open() we can cleanly signal threads etc.
>
> But "independence from procfs" means that you damn well don't then do
> "oh, now I have a pidfd, I want to turn it into a /proc fd and then
> munge around there".
>
> So I'm literally saying that it had better really *be* independent
> from /proc. It is the standalone version, but it's most definitely
> also the version that doesn't then give you secret access to /proc.
Just to be clear, I'm not proposing granting secret access to procfs,
and as far as I can see, nobody else is either. We've been talking
about making it easier to avoid races when you happen to want a pidfd
and a procfs fd that point to the same process, not granting access
that you didn't have before. If you'd rather not connect procfs and
pidfds, we can take this functionality off the table.
On Sat, Mar 30, 2019 at 09:24:23AM -0700, Linus Torvalds wrote:
> On Sat, Mar 30, 2019 at 9:19 AM Christian Brauner <[email protected]> wrote:
> >
> > From pure API perspective that's all I care about: independence of procfs.
> > Once we have pidfd_open() we can cleanly signal threads etc.
>
> But "independence from procfs" means that you damn well don't then do
> "oh, now I have a pidfd, I want to turn it into a /proc fd and then
> munge around there".
>
> So I'm literally saying that it had better really *be* independent
> from /proc. It is the standalone version, but it's most definitely
> also the version that doesn't then give you secret access to /proc.
>
> And it weorries me a lot that people are trying to play these kinds of
> games. I'm just seeing some android patch that adds this horror and
> then starts using it.
The original need for this conversion thing came from Android indeed. I
won't go forward with any such patches.
Christian
On Sat, Mar 30, 2019 at 09:34:02AM -0700, Daniel Colascione wrote:
> On Sat, Mar 30, 2019 at 9:24 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Sat, Mar 30, 2019 at 9:19 AM Christian Brauner <[email protected]> wrote:
> > >
> > > From pure API perspective that's all I care about: independence of procfs.
> > > Once we have pidfd_open() we can cleanly signal threads etc.
> >
> > But "independence from procfs" means that you damn well don't then do
> > "oh, now I have a pidfd, I want to turn it into a /proc fd and then
> > munge around there".
> >
> > So I'm literally saying that it had better really *be* independent
> > from /proc. It is the standalone version, but it's most definitely
> > also the version that doesn't then give you secret access to /proc.
>
> Just to be clear, I'm not proposing granting secret access to procfs,
> and as far as I can see, nobody else is either. We've been talking
> about making it easier to avoid races when you happen to want a pidfd
> and a procfs fd that point to the same process, not granting access
> that you didn't have before. If you'd rather not connect procfs and
> pidfds, we can take this functionality off the table.
This is dead! Nothing like this will make it through this tree. I have
no intention of endangering pidfd_send_signal().
Christian
On Sat, Mar 30, 2019 at 9:34 AM Daniel Colascione <[email protected]> wrote:
>
> Just to be clear, I'm not proposing granting secret access to procfs,
> and as far as I can see, nobody else is either. We've been talking
> about making it easier to avoid races when you happen to want a pidfd
> and a procfs fd that point to the same process
So I thought that was the whole point of just opening /proc/<pid>.
Exactly because that way you can then use openat() from there on.
Linus
On Sat, Mar 30, 2019 at 10:04:33AM -0700, Linus Torvalds wrote:
> On Sat, Mar 30, 2019 at 9:34 AM Daniel Colascione <[email protected]> wrote:
> >
> > Just to be clear, I'm not proposing granting secret access to procfs,
> > and as far as I can see, nobody else is either. We've been talking
> > about making it easier to avoid races when you happen to want a pidfd
> > and a procfs fd that point to the same process
>
> So I thought that was the whole point of just opening /proc/<pid>.
> Exactly because that way you can then use openat() from there on.
To clarify, what the Android guys really wanted to be part of the api is
a way to get race-free access to metadata associated with a given pidfd.
And the idea was that *if and only if procfs is mounted* you could do:
int pidfd = pidfd_open(1234, 0);
int procfd = open("/proc", O_RDONLY | O_CLOEXEC);
int procpidfd = ioctl(pidfd, PIDFD_TO_PROCFD, procfd);
and then we internally verify that the struct pid that the pidfd is
refering to, is still the same as the one that /proc/<pid> is refering
to and only then do we return an fd for the process /proc/<pid>
directory which would then allow you to do, e.g.:
int statusfd = openat(procpidfd, "status", O_RDONLY | O_CLOEXEC);
this would provide race-free access to metadat but again, only if /proc
is mounted and available to the user. But if that's an instant NAK we
will definitely *not* do this.
On Sat, Mar 30, 2019 at 10:12 AM Christian Brauner <[email protected]> wrote:
>
>
> To clarify, what the Android guys really wanted to be part of the api is
> a way to get race-free access to metadata associated with a given pidfd.
> And the idea was that *if and only if procfs is mounted* you could do:
>
> int pidfd = pidfd_open(1234, 0);
>
> int procfd = open("/proc", O_RDONLY | O_CLOEXEC);
> int procpidfd = ioctl(pidfd, PIDFD_TO_PROCFD, procfd);
And my claim is that this is three system calls - one of them very
hacky - to just do
int pidfd = open("/proc/%d", O_PATH);
and you're done. It acts as the pidfd _and_ the way to get the
associated status files etc.
So there is absolutely zero advantage to going through pidfd_open().
No. No. No.
So the *only* reason for "pidfd_open()" is if you don't have /proc in
the first place. In which case the whole PIDFD_TO_PROCFD is bogus.
Yeah, yeah, if you want to avoid going through the pathname
translation, that's one thing, but if that's your aim, then you again
should also just admit that PIDFD_TO_PROCFD is disgusting and wrong,
and you're basically saying "ok, I'm not going to do /proc at all".
So I'm ok with the whole "simpler, faster, no-proc pidfd", but then it
really has to be *SIMPLER* and *NO PROCFS*.
PIDFD_TO_PROCFD violates *everything*.
Linus
On Sat, Mar 30, 2019 at 10:24:19AM -0700, Linus Torvalds wrote:
> On Sat, Mar 30, 2019 at 10:12 AM Christian Brauner <[email protected]> wrote:
> >
> >
> > To clarify, what the Android guys really wanted to be part of the api is
> > a way to get race-free access to metadata associated with a given pidfd.
> > And the idea was that *if and only if procfs is mounted* you could do:
> >
> > int pidfd = pidfd_open(1234, 0);
> >
> > int procfd = open("/proc", O_RDONLY | O_CLOEXEC);
> > int procpidfd = ioctl(pidfd, PIDFD_TO_PROCFD, procfd);
>
> And my claim is that this is three system calls - one of them very
> hacky - to just do
>
> int pidfd = open("/proc/%d", O_PATH);
>
> and you're done. It acts as the pidfd _and_ the way to get the
> associated status files etc.
>
> So there is absolutely zero advantage to going through pidfd_open().
>
> No. No. No.
>
> So the *only* reason for "pidfd_open()" is if you don't have /proc in
> the first place. In which case the whole PIDFD_TO_PROCFD is bogus.
>
> Yeah, yeah, if you want to avoid going through the pathname
> translation, that's one thing, but if that's your aim, then you again
> should also just admit that PIDFD_TO_PROCFD is disgusting and wrong,
> and you're basically saying "ok, I'm not going to do /proc at all".
>
> So I'm ok with the whole "simpler, faster, no-proc pidfd", but then it
Understood.
> really has to be *SIMPLER* and *NO PROCFS*.
>
> PIDFD_TO_PROCFD violates *everything*.
On Sat, Mar 30, 2019 at 5:24 PM Linus Torvalds
<[email protected]> wrote:
>
> On Sat, Mar 30, 2019 at 10:12 AM Christian Brauner <[email protected]> wrote:
> >
> >
> > To clarify, what the Android guys really wanted to be part of the api is
> > a way to get race-free access to metadata associated with a given pidfd.
> > And the idea was that *if and only if procfs is mounted* you could do:
> >
> > int pidfd = pidfd_open(1234, 0);
> >
> > int procfd = open("/proc", O_RDONLY | O_CLOEXEC);
> > int procpidfd = ioctl(pidfd, PIDFD_TO_PROCFD, procfd);
>
> And my claim is that this is three system calls - one of them very
> hacky - to just do
>
> int pidfd = open("/proc/%d", O_PATH);
>
> and you're done. It acts as the pidfd _and_ the way to get the
> associated status files etc.
>
> So there is absolutely zero advantage to going through pidfd_open().
>
> No. No. No.
>
> So the *only* reason for "pidfd_open()" is if you don't have /proc in
> the first place. In which case the whole PIDFD_TO_PROCFD is bogus.
>
> Yeah, yeah, if you want to avoid going through the pathname
> translation, that's one thing, but if that's your aim, then you again
> should also just admit that PIDFD_TO_PROCFD is disgusting and wrong,
> and you're basically saying "ok, I'm not going to do /proc at all".
>
> So I'm ok with the whole "simpler, faster, no-proc pidfd", but then it
> really has to be *SIMPLER* and *NO PROCFS*.
>
(Resending because accidently it wasn't a reply-all)
If you go with pidfd_open, that should also mean you remove the
ability to be able to use /proc/<PID> dir fds in pidfd_send_signal.
Otherwise the semantics are hairy: I can only pidfd_open a task
reachable from my active namespace, but somehow also be able to open a
pidfd if I happen to see someone's /proc in my mount namespace and
have the access to open it?
> PIDFD_TO_PROCFD violates *everything*.
>
> Linus
On Sat, Mar 30, 2019 at 05:50:20PM +0000, Jonathan Kowalski wrote:
> On Sat, Mar 30, 2019 at 5:24 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Sat, Mar 30, 2019 at 10:12 AM Christian Brauner <[email protected]> wrote:
> > >
> > >
> > > To clarify, what the Android guys really wanted to be part of the api is
> > > a way to get race-free access to metadata associated with a given pidfd.
> > > And the idea was that *if and only if procfs is mounted* you could do:
> > >
> > > int pidfd = pidfd_open(1234, 0);
> > >
> > > int procfd = open("/proc", O_RDONLY | O_CLOEXEC);
> > > int procpidfd = ioctl(pidfd, PIDFD_TO_PROCFD, procfd);
> >
> > And my claim is that this is three system calls - one of them very
> > hacky - to just do
> >
> > int pidfd = open("/proc/%d", O_PATH);
> >
> > and you're done. It acts as the pidfd _and_ the way to get the
> > associated status files etc.
> >
> > So there is absolutely zero advantage to going through pidfd_open().
> >
> > No. No. No.
> >
> > So the *only* reason for "pidfd_open()" is if you don't have /proc in
> > the first place. In which case the whole PIDFD_TO_PROCFD is bogus.
> >
> > Yeah, yeah, if you want to avoid going through the pathname
> > translation, that's one thing, but if that's your aim, then you again
> > should also just admit that PIDFD_TO_PROCFD is disgusting and wrong,
> > and you're basically saying "ok, I'm not going to do /proc at all".
> >
> > So I'm ok with the whole "simpler, faster, no-proc pidfd", but then it
> > really has to be *SIMPLER* and *NO PROCFS*.
> >
>
> (Resending because accidently it wasn't a reply-all)
>
> If you go with pidfd_open, that should also mean you remove the
> ability to be able to use /proc/<PID> dir fds in pidfd_send_signal.
>
> Otherwise the semantics are hairy: I can only pidfd_open a task
> reachable from my active namespace, but somehow also be able to open a
You can easily setns() to another pid namespace and get a pidfd there.
That's how most namespace interactions work right now. We already had
that discussion.
On Sat, Mar 30, 2019 at 5:52 PM Christian Brauner <[email protected]> wrote:
>
> On Sat, Mar 30, 2019 at 05:50:20PM +0000, Jonathan Kowalski wrote:
> > On Sat, Mar 30, 2019 at 5:24 PM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > On Sat, Mar 30, 2019 at 10:12 AM Christian Brauner <[email protected]> wrote:
> > > >
> > > >
> > > > To clarify, what the Android guys really wanted to be part of the api is
> > > > a way to get race-free access to metadata associated with a given pidfd.
> > > > And the idea was that *if and only if procfs is mounted* you could do:
> > > >
> > > > int pidfd = pidfd_open(1234, 0);
> > > >
> > > > int procfd = open("/proc", O_RDONLY | O_CLOEXEC);
> > > > int procpidfd = ioctl(pidfd, PIDFD_TO_PROCFD, procfd);
> > >
> > > And my claim is that this is three system calls - one of them very
> > > hacky - to just do
> > >
> > > int pidfd = open("/proc/%d", O_PATH);
> > >
> > > and you're done. It acts as the pidfd _and_ the way to get the
> > > associated status files etc.
> > >
> > > So there is absolutely zero advantage to going through pidfd_open().
> > >
> > > No. No. No.
> > >
> > > So the *only* reason for "pidfd_open()" is if you don't have /proc in
> > > the first place. In which case the whole PIDFD_TO_PROCFD is bogus.
> > >
> > > Yeah, yeah, if you want to avoid going through the pathname
> > > translation, that's one thing, but if that's your aim, then you again
> > > should also just admit that PIDFD_TO_PROCFD is disgusting and wrong,
> > > and you're basically saying "ok, I'm not going to do /proc at all".
> > >
> > > So I'm ok with the whole "simpler, faster, no-proc pidfd", but then it
> > > really has to be *SIMPLER* and *NO PROCFS*.
> > >
> >
> > (Resending because accidently it wasn't a reply-all)
> >
> > If you go with pidfd_open, that should also mean you remove the
> > ability to be able to use /proc/<PID> dir fds in pidfd_send_signal.
> >
> > Otherwise the semantics are hairy: I can only pidfd_open a task
> > reachable from my active namespace, but somehow also be able to open a
>
> You can easily setns() to another pid namespace and get a pidfd there.
> That's how most namespace interactions work right now. We already had
> that discussion.
Only if it is a child namespace, or you have the relevant capabilities to setns.
Currently, if I just put a task in PID namespace, it can see /proc of
an ancestor PID namespace, and opendir /proc/<PID>, this is accepted
by pidfd_send_signal.
If you ever allow signalling across PID namespaces (because file
descriptors should be able to do that, they are not namespaced, see
files, sockets, etc), it will become a problem. Getting pidfds from
outside my active namespace should require userspace cooperation.
So, opening a pidfd should be limited to what *I* can see in my
namespace, like every other namespace. That is what a namespace is,
and PIDs have their own namespace, they're not exposed in the
filesystem namespace.
On Sat, Mar 30, 2019 at 6:24 PM Linus Torvalds
<[email protected]> wrote:
> On Sat, Mar 30, 2019 at 10:12 AM Christian Brauner <[email protected]> wrote:
> > To clarify, what the Android guys really wanted to be part of the api is
> > a way to get race-free access to metadata associated with a given pidfd.
> > And the idea was that *if and only if procfs is mounted* you could do:
> >
> > int pidfd = pidfd_open(1234, 0);
> >
> > int procfd = open("/proc", O_RDONLY | O_CLOEXEC);
> > int procpidfd = ioctl(pidfd, PIDFD_TO_PROCFD, procfd);
>
> And my claim is that this is three system calls - one of them very
> hacky - to just do
>
> int pidfd = open("/proc/%d", O_PATH);
>
> and you're done. It acts as the pidfd _and_ the way to get the
> associated status files etc.
>
> So there is absolutely zero advantage to going through pidfd_open().
>
> No. No. No.
>
> So the *only* reason for "pidfd_open()" is if you don't have /proc in
> the first place. In which case the whole PIDFD_TO_PROCFD is bogus.
So if, in the future, there is some sort of "create a new task and
return an fd to it" syscall, do you think it should always return
pidfds, or do you think it should return fds to /proc if procfs is
available? And if it should return fds to /proc, does that mean that
this "create a task" API should take an extra argument with a file
descriptor to the procfs instance you want to use?
(This can't always be implemented easily in userspace on top of normal
clone(), because if you create a task without a termination signal -
like a thread -, its PID can be recycled under you.)
An API like this would have less complexity stuffed into a single
syscall if it always returns pidfds, and if you then actually want an
fd to procfs, you can do the conversion that requires specifying a
procfs instance separately.
Of course, if you think that we shouldn't add an API for
pidfd-to-procfs conversion before we have an API for
clone()-with-an-fd-retval, that's understandable.
On Sat, Mar 30, 2019 at 05:59:34PM +0000, Jonathan Kowalski wrote:
> On Sat, Mar 30, 2019 at 5:52 PM Christian Brauner <[email protected]> wrote:
> >
> > On Sat, Mar 30, 2019 at 05:50:20PM +0000, Jonathan Kowalski wrote:
> > > On Sat, Mar 30, 2019 at 5:24 PM Linus Torvalds
> > > <[email protected]> wrote:
> > > >
> > > > On Sat, Mar 30, 2019 at 10:12 AM Christian Brauner <[email protected]> wrote:
> > > > >
> > > > >
> > > > > To clarify, what the Android guys really wanted to be part of the api is
> > > > > a way to get race-free access to metadata associated with a given pidfd.
> > > > > And the idea was that *if and only if procfs is mounted* you could do:
> > > > >
> > > > > int pidfd = pidfd_open(1234, 0);
> > > > >
> > > > > int procfd = open("/proc", O_RDONLY | O_CLOEXEC);
> > > > > int procpidfd = ioctl(pidfd, PIDFD_TO_PROCFD, procfd);
> > > >
> > > > And my claim is that this is three system calls - one of them very
> > > > hacky - to just do
> > > >
> > > > int pidfd = open("/proc/%d", O_PATH);
> > > >
> > > > and you're done. It acts as the pidfd _and_ the way to get the
> > > > associated status files etc.
> > > >
> > > > So there is absolutely zero advantage to going through pidfd_open().
> > > >
> > > > No. No. No.
> > > >
> > > > So the *only* reason for "pidfd_open()" is if you don't have /proc in
> > > > the first place. In which case the whole PIDFD_TO_PROCFD is bogus.
> > > >
> > > > Yeah, yeah, if you want to avoid going through the pathname
> > > > translation, that's one thing, but if that's your aim, then you again
> > > > should also just admit that PIDFD_TO_PROCFD is disgusting and wrong,
> > > > and you're basically saying "ok, I'm not going to do /proc at all".
> > > >
> > > > So I'm ok with the whole "simpler, faster, no-proc pidfd", but then it
> > > > really has to be *SIMPLER* and *NO PROCFS*.
> > > >
> > >
> > > (Resending because accidently it wasn't a reply-all)
> > >
> > > If you go with pidfd_open, that should also mean you remove the
> > > ability to be able to use /proc/<PID> dir fds in pidfd_send_signal.
> > >
> > > Otherwise the semantics are hairy: I can only pidfd_open a task
> > > reachable from my active namespace, but somehow also be able to open a
> >
> > You can easily setns() to another pid namespace and get a pidfd there.
> > That's how most namespace interactions work right now. We already had
> > that discussion.
>
> Only if it is a child namespace, or you have the relevant capabilities to setns.
>
> Currently, if I just put a task in PID namespace, it can see /proc of
> an ancestor PID namespace, and opendir /proc/<PID>, this is accepted
> by pidfd_send_signal.
>
> If you ever allow signalling across PID namespaces (because file
That's not on the plate for now and pidfd_send_signal() is checking for
that to not allow it.
On Sat, Mar 30, 2019 at 09:18:12AM -0700, Linus Torvalds wrote:
> On Sat, Mar 30, 2019 at 9:16 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > And no, we are *NOT* making pidfd_open() into some "let's expose /proc
> > even when it's not mounted" horror. Seriously. That's disgusting,
> > wrong, insecure and stupid.
>
> And btw, this is not debatable. In fact, this whole discussion is
> making me feel like I should just revert pidfd, not because the code I
> merged is wrong, but because people are clearly intending to do
> completely inappropriate things with this.
>
> Get your act together. Stop hacking up garbage.
>
> Linus
I am supportive of Linus's view of keeping /proc separate from pidfd. I
didn't really care about "pidfd" solving any racing issues with /proc/<pid>/*
access. My main interest in pidfd is because we can implement "waiting"
semantics in the future using something like a pidfd_wait call, which solves
one of the issues of Android's low-memory where it needs to be notified as
quickly as possible about a non-child process's death. Android Low memory
killer right now just keeps checking for /proc/<pid> 's existence which is
slow and more CPU intense for this.
As I said I don't really care about "pidfd" solving any racing issues with
/proc/<pid>/* accesses - because I still find it hard to imagine that the pid
number can be reused easily from the time you know which <pid> to deal with,
to the time when you want to read, say, the /proc/<pid>/status file. I am yet
to see any real data to show that such overflow happens - you literally need
32k process deaths and forks in such a short time frame, and on 64-bit, that
number is really high that its not even an issue. And if this is really an
issue, then you can just open a handle to /proc/<pid> at process creation
time and keep it around. If the <pid> is reused, you can still use openat(2)
on that handle without any races.
What I think will be most immediately valuable for Android in my opinion is
the pidfd_open() and pidfd_send_signal() syscalls, along with the future
pidfd_wait() that we're working on to solve the issue with Android's Low
memory killer I mentioned above.
Now, in relation to Daniel's concern with procfs, I feel we need to be clear
what is the meaning of a "pidfd" and it should consistently work across all
APIs no matter how pidfd is obtained. There shouldn't be like "if you obtain
pidfd using this method and it only works with these APIs", that's just plain
wrong IMO.
For example: with pidfd_open() or whatever that ends up being called is
available, the pidfd obtained is not tied to /proc. However, if one obtained
a pidfd using open("/proc/<pid>/", ..), then that pidfd *is* tied to /proc.
pidfd_send_signal will happily work on this "pidfd". Both these methods of
obtainings pidfd(s) make openat(2) work on them inconsistently. In one case
openat(2) fails, and in the other case it succeeds. I feel there should be no
ambiguity between how "pidfd" works with openat(2). Either make openat(2)
never work on any "pidfd", or always make it work on it.
I would say, don't call a /proc/<pid> file descriptor as a "pidfd". Next, to
remedy all this, I feel pidfd_send_signal *should not* work on fds that are
obtained by opening of /proc/<pid> if we can agree those are not pidfds.
Only the ones obtained using the proposed pidfd_open (or pidfd_clone) should
be allowed to be used with pidfd_send_signal.
So I believe the call to tgid_pidfd_to_pid() from pidfd_to_pid in Christian's
patch 3/5 needs to go away.
+static struct pid *pidfd_to_pid(const struct file *file)
+{
+ if (file->f_op == &pidfd_fops)
+ return file->private_data;
+
+ return tgid_pidfd_to_pid(file); <-------- Should just return error
+}
+
I agree with Christian that lets not put pidfd_send_signal at risk. Let us
make the notion of a pidfd and the APIs that work on it *consistent* and make
it do just what we really need for our usescases. Which for Android, is,
pidfd_open/clone, pidfd_send_signal and pidfd_wait. I hope I added some
clarity around the usecases.
thanks,
- Joel
On Sun, Mar 31, 2019 at 3:07 AM Joel Fernandes <[email protected]> wrote:
> As I said I don't really care about "pidfd" solving any racing issues with
> /proc/<pid>/* accesses - because I still find it hard to imagine that the pid
> number can be reused easily from the time you know which <pid> to deal with,
> to the time when you want to read, say, the /proc/<pid>/status file.
There have been several Android security bugs related to PID reuse.
> I am yet
> to see any real data to show that such overflow happens - you literally need
> 32k process deaths and forks in such a short time frame
This seems very inaccurate to me.
The time frame in which the PID has to wrap around is not the time
between process death and use of the PID. It is the time between *the
creation* of the old process and the use of the PID. Consider the
following sequence of events:
- process A starts with PID 1000
- some time passes in which some process repeatedly forks, with PIDs
wrapping around to 999
- process B starts an attempt to access process A (using PID 1000)
- process A dies
- process C spawns with PID 1000
- process B accidentally accesses process C
Also, it's probably worth clarifying that here, "processes" means "threads".
If there are a lot of active processes, that reduces the number of
times you have to clone() to get the PID to wrap around.
> and on 64-bit, that
> number is really high
Which number is really high on 64-bit? Checking on a walleye phone,
pid_max is still only 32768:
walleye:/ # cat /proc/sys/kernel/pid_max
32768
walleye:/ #
> that its not even an issue. And if this is really an
> issue, then you can just open a handle to /proc/<pid> at process creation
> time and keep it around. If the <pid> is reused, you can still use openat(2)
> on that handle without any races.
But not if you want to implement something like killall in a
race-free way, for example.
On Sun, Mar 31, 2019 at 04:34:57AM +0200, Jann Horn wrote:
> On Sun, Mar 31, 2019 at 3:07 AM Joel Fernandes <[email protected]> wrote:
> > As I said I don't really care about "pidfd" solving any racing issues with
> > /proc/<pid>/* accesses - because I still find it hard to imagine that the pid
> > number can be reused easily from the time you know which <pid> to deal with,
> > to the time when you want to read, say, the /proc/<pid>/status file.
>
> There have been several Android security bugs related to PID reuse.
Yes PID reuse will be a problem till we have pidfd_clone and
pidfd_send_signal (and any other pidfd related syscalls). I've never denied
PID reuse is *currently* a problem and the set of pidfd syscalls being
proposed are designed to avoid those. So I'm not fully sure what you mean.
Anyway, I would love to see those security bugs you mentioned if you could
point me to them.
> > I am yet
> > to see any real data to show that such overflow happens - you literally need
> > 32k process deaths and forks in such a short time frame
>
> This seems very inaccurate to me.
>
> The time frame in which the PID has to wrap around is not the time
> between process death and use of the PID. It is the time between *the
> creation* of the old process and the use of the PID. Consider the
> following sequence of events:
>
> - process A starts with PID 1000
> - some time passes in which some process repeatedly forks, with PIDs
> wrapping around to 999
> - process B starts an attempt to access process A (using PID 1000)
> - process A dies
> - process C spawns with PID 1000
> - process B accidentally accesses process C
>
> Also, it's probably worth clarifying that here, "processes" means "threads".
>
> If there are a lot of active processes, that reduces the number of
> times you have to clone() to get the PID to wrap around.
Ok, that's fair and I take your point. But I wonder what access you're
talking about, is it killing the process? If yes, pidfd_clone +
pidfd_send_signal will solve that in the race free way without relying on the
PID number. Is it accessing /proc/<pid>/? then see below.
> > and on 64-bit, that
> > number is really high
>
> Which number is really high on 64-bit? Checking on a walleye phone,
> pid_max is still only 32768:
>
> walleye:/ # cat /proc/sys/kernel/pid_max
> 32768
> walleye:/ #
Ok. I was talking about the theoretical limit of pid_max on a 64-bit
platform. But since we are talking about NOT relying on the PID number in the
first place, we can move on from this point.
> > that its not even an issue. And if this is really an
> > issue, then you can just open a handle to /proc/<pid> at process creation
> > time and keep it around. If the <pid> is reused, you can still use openat(2)
> > on that handle without any races.
>
> But not if you want to implement something like killall in a
> race-free way, for example.
I am not at all talking about killing processes in your last quote of my
email above, I'm talking about access to /proc/<pid>/ files.
As I said, at the time of process creation, you can obtain an fd by opening
/proc/<pid>/ and keep it open. Then you can do an openat(2) on that fd
without worrying at <pid> reuse, no? And then access all the files that way.
As for killall in Android. I don't think that "killing processes by name" is
relied on for the runtime operation of Android. That would be a very bad
idea. Low memory killer does not kill processes by name. It kills processes
by the PID number using kill(2) which we'd like to replace with
pidfd_send_signal.
Again if you want to convince Linus about having a "pidfd to procfd"
conversion mechanism, then by all means go for it. I just don't think it is
urgently necessary (and others may disagree with me on this), but I wouldn't
care if such a mechanism existed either. Whatever we do, I just want the
notion of "pidfd" to be consistent as I mentioned in my previous email.
thank you!
- Joel
On Sun, Mar 31, 2019 at 6:08 AM Joel Fernandes <[email protected]> wrote:
> On Sun, Mar 31, 2019 at 04:34:57AM +0200, Jann Horn wrote:
> > On Sun, Mar 31, 2019 at 3:07 AM Joel Fernandes <[email protected]> wrote:
> > > As I said I don't really care about "pidfd" solving any racing issues with
> > > /proc/<pid>/* accesses - because I still find it hard to imagine that the pid
> > > number can be reused easily from the time you know which <pid> to deal with,
> > > to the time when you want to read, say, the /proc/<pid>/status file.
> >
> > There have been several Android security bugs related to PID reuse.
>
> Yes PID reuse will be a problem till we have pidfd_clone and
> pidfd_send_signal (and any other pidfd related syscalls). I've never denied
> PID reuse is *currently* a problem and the set of pidfd syscalls being
> proposed are designed to avoid those. So I'm not fully sure what you mean.
> Anyway, I would love to see those security bugs you mentioned if you could
> point me to them.
https://bugs.chromium.org/p/project-zero/issues/detail?id=851
"Android: racy getpidcon usage permits binder service replacement"
https://bugs.chromium.org/p/project-zero/issues/detail?id=853
"Android: debuggerd mitigation bypass and infoleak"
https://bugs.chromium.org/p/project-zero/issues/detail?id=1404
"Android: Hardware Service Manager Arbitrary Service Replacement due
to getpidcon"
https://bugs.chromium.org/p/project-zero/issues/detail?id=1406
"Android: Permission bypass in KeyStore service due to getpidcon"
https://bugs.chromium.org/p/project-zero/issues/detail?id=1741
"Android: getpidcon() usage in hardware binder servicemanager permits
ACL bypass"
> > > I am yet
> > > to see any real data to show that such overflow happens - you literally need
> > > 32k process deaths and forks in such a short time frame
> >
> > This seems very inaccurate to me.
> >
> > The time frame in which the PID has to wrap around is not the time
> > between process death and use of the PID. It is the time between *the
> > creation* of the old process and the use of the PID. Consider the
> > following sequence of events:
> >
> > - process A starts with PID 1000
> > - some time passes in which some process repeatedly forks, with PIDs
> > wrapping around to 999
> > - process B starts an attempt to access process A (using PID 1000)
> > - process A dies
> > - process C spawns with PID 1000
> > - process B accidentally accesses process C
> >
> > Also, it's probably worth clarifying that here, "processes" means "threads".
> >
> > If there are a lot of active processes, that reduces the number of
> > times you have to clone() to get the PID to wrap around.
>
> Ok, that's fair and I take your point. But I wonder what access you're
> talking about, is it killing the process? If yes, pidfd_clone +
> pidfd_send_signal will solve that in the race free way without relying on the
> PID number.
Sure, given a pidfd_clone() syscall, as long as the parent of the
process is giving you a pidfd for it and you don't have to deal with
grandchildren created by fork() calls outside your control, that
works.
> Is it accessing /proc/<pid>/? then see below.
> > > and on 64-bit, that
> > > number is really high
> >
> > Which number is really high on 64-bit? Checking on a walleye phone,
> > pid_max is still only 32768:
> >
> > walleye:/ # cat /proc/sys/kernel/pid_max
> > 32768
> > walleye:/ #
>
> Ok. I was talking about the theoretical limit of pid_max on a 64-bit
> platform. But since we are talking about NOT relying on the PID number in the
> first place, we can move on from this point.
(pid_t is a signed 32-bit value, that's UAPI.)
> > > that its not even an issue. And if this is really an
> > > issue, then you can just open a handle to /proc/<pid> at process creation
> > > time and keep it around. If the <pid> is reused, you can still use openat(2)
> > > on that handle without any races.
> >
> > But not if you want to implement something like killall in a
> > race-free way, for example.
>
> I am not at all talking about killing processes in your last quote of my
> email above, I'm talking about access to /proc/<pid>/ files.
>
> As I said, at the time of process creation, you can obtain an fd by opening
> /proc/<pid>/ and keep it open. Then you can do an openat(2) on that fd
> without worrying at <pid> reuse, no? And then access all the files that way.
Yeah, if you're the parent of a process with a termination signal, that works.
> As for killall in Android. I don't think that "killing processes by name" is
> relied on for the runtime operation of Android. That would be a very bad
> idea. Low memory killer does not kill processes by name. It kills processes
> by the PID number using kill(2) which we'd like to replace with
> pidfd_send_signal.
Yeah, I somehow lost context when replying to your mail here; sorry
about that. I didn't mean to imply that this is a usecase on Android.
On Sat, Mar 30, 2019 at 9:47 PM Jann Horn <[email protected]> wrote:
>
> Sure, given a pidfd_clone() syscall, as long as the parent of the
> process is giving you a pidfd for it and you don't have to deal with
> grandchildren created by fork() calls outside your control, that
> works.
Don't do pidfd_clone() and pidfd_wait().
Both of those existing system calls already get a "flags" argument.
Just make a WPIDFD (for waitid) and CLONE_PIDFD (for clone) bit, and
make the existing system calls just take/return a pidfd.
Side note: we could (should?) also make the default maxpid just be
larger. It needs to fit in an 'int', but MAXINT instead of 65535 would
likely alreadt make a lot of these attacks harder.
There was some really old legacy reason why we actually limited it to
65535 originally. It was old and crufty even back when..
Linus
Linus
On Sun, Mar 31, 2019 at 07:52:28AM -0700, Linus Torvalds wrote:
> On Sat, Mar 30, 2019 at 9:47 PM Jann Horn <[email protected]> wrote:
> >
> > Sure, given a pidfd_clone() syscall, as long as the parent of the
> > process is giving you a pidfd for it and you don't have to deal with
> > grandchildren created by fork() calls outside your control, that
> > works.
>
> Don't do pidfd_clone() and pidfd_wait().
>
> Both of those existing system calls already get a "flags" argument.
> Just make a WPIDFD (for waitid) and CLONE_PIDFD (for clone) bit, and
> make the existing system calls just take/return a pidfd.
Yes, that's one of the options I was considering but was afraid of
pitching it because of the very massive opposition I got
regarding"multiplexers". I'm perfectly happy with doing it this way.
>
> Side note: we could (should?) also make the default maxpid just be
> larger. It needs to fit in an 'int', but MAXINT instead of 65535 would
> likely alreadt make a lot of these attacks harder.
Yes, agreed.
>
> There was some really old legacy reason why we actually limited it to
> 65535 originally. It was old and crufty even back when..
So Jann and I have been thinking about going forward with the following
idea:
With the pidfd_open() patchset I have pidfds are simple anone inode file
descriptors stashing a reference to struct pid of a process. I have
mentioned this is in prior mails. This cleanly decouples pidfds from
procfs completely.
The reason why we want to use pidfds with no connection to a specific procfs
instance, even in environments that have procfs, is that we would like to add
the API to clone with CLONE_PIDFD that you just mentioned that creates a
new process or thread and returns a pidfd to it. In the context of such
a syscall, it would be awkward to have the kernel open a file in some
procfs instance, since then userspace would have to specify which procfs
instance the fd should come from.
There is an argument to be made that for consistency's sake we should - although
I don't feel strongly about it - disallow the usage of pidfd_send_signal() with
fds referring to /proc/<pid> then. Unless you want this to always work. If you
want this to work then we would simply submit pidfd_open() for the 5.2
window. If you agree that it makes sense to only have pidfd_open() file
descriptors working with pidfd_send_signal() we would send a revert for
pidfd_send_signal() now and resubmit it together with pidfd_open()
during the 5.2. merge window. This decouples pidfds completely from
procfs not just when it is not compiled in or mounted.
I very much care about this being done right. If this means temporarily
kicking pidfd_send_signal() out until 5.2 I'm happy to do so.
Btw, the /proc/<pid> race issue that is mentioned constantly is simply
avoidable by placing the pid that the pidfd has stashed relative to the
callers' procfs mount's pid namespace in the pidfd's fdinfo. So there's
not even a need to really go through /proc/<pid> in the first place. A
caller wanting to get metadata access and avoid a race with pid
recycling can then simply do:
int pidfd = pidfd_open(pid, 0);
int pid = parse_fdinfo("/proc/self/fdinfo/<pidfd>");
int procpidfd = open("/proc/<pid>", ...);
/* Test if process still exists by sending signal 0 through our pidfd. */
int ret = pidfd_send_signal(pid, 0, NULL, PIDFD_SIGNAL_THREAD);
if (ret < 0 && errno == ESRCH) {
/* pid has been recycled and procpidfd refers to another process */
}
Christian
On Sun, Mar 31, 2019 at 8:05 AM Christian Brauner <[email protected]> wrote:
>
> On Sun, Mar 31, 2019 at 07:52:28AM -0700, Linus Torvalds wrote:
> > On Sat, Mar 30, 2019 at 9:47 PM Jann Horn <[email protected]> wrote:
> > >
> > > Sure, given a pidfd_clone() syscall, as long as the parent of the
> > > process is giving you a pidfd for it and you don't have to deal with
> > > grandchildren created by fork() calls outside your control, that
> > > works.
> >
> > Don't do pidfd_clone() and pidfd_wait().
> >
> > Both of those existing system calls already get a "flags" argument.
> > Just make a WPIDFD (for waitid) and CLONE_PIDFD (for clone) bit, and
> > make the existing system calls just take/return a pidfd.
>
> Yes, that's one of the options I was considering but was afraid of
> pitching it because of the very massive opposition I got
> regarding"multiplexers". I'm perfectly happy with doing it this way.
This approach is fine by me, FWIW. I like it more than a general-purpose pidctl.
> Btw, the /proc/<pid> race issue that is mentioned constantly is simply
> avoidable by placing the pid that the pidfd has stashed relative to the
> callers' procfs mount's pid namespace in the pidfd's fdinfo. So there's
> not even a need to really go through /proc/<pid> in the first place. A
> caller wanting to get metadata access and avoid a race with pid
> recycling can then simply do:
>
> int pidfd = pidfd_open(pid, 0);
> int pid = parse_fdinfo("/proc/self/fdinfo/<pidfd>");
> int procpidfd = open("/proc/<pid>", ...);
IMHO, it's worth documenting this procedure in the pidfd man page.
> /* Test if process still exists by sending signal 0 through our pidfd. */
Are you planning on officially documenting this incantation in the
pidfd man page?
> int ret = pidfd_send_signal(pid, 0, NULL, PIDFD_SIGNAL_THREAD);
> if (ret < 0 && errno == ESRCH) {
> /* pid has been recycled and procpidfd refers to another process */
> }
I was going to suggest that WNOHANG also works for this purpose, but
that idea raises another question: normally, you can wait*(2) on a
process only once. Are you imagining waitid on a pidfd succeeding more
than one? ISTM that the pidfd would need to internally store not just
a struct pid, but the exit status as well or some way to get to it.
On Sun, Mar 31, 2019 at 3:59 PM Linus Torvalds
<[email protected]> wrote:
>
> On Sat, Mar 30, 2019 at 9:47 PM Jann Horn <[email protected]> wrote:
> >
> > Sure, given a pidfd_clone() syscall, as long as the parent of the
> > process is giving you a pidfd for it and you don't have to deal with
> > grandchildren created by fork() calls outside your control, that
> > works.
>
> Don't do pidfd_clone() and pidfd_wait().
>
> Both of those existing system calls already get a "flags" argument.
> Just make a WPIDFD (for waitid) and CLONE_PIDFD (for clone) bit, and
> make the existing system calls just take/return a pidfd.
clone is out of flags, so there will have to be a new system call.
I am not sure about the waitid bit. Are you suggesting it takes a
pidfd and waits using it? I was thinking if we could make the pidfd
itself pollable and readable for exit status. At pidfd_open time, you
pass the flag and only if you're a parent you get a readable instance,
if not, a pollable one for everyone (eg. for an indirect child as a
reaper), and it fails for threads.
Then, the pidfd clone2 returns can also be polled and read from.
The main pain point is, currently when I ptrace from a thread a
process, I need to use waitpid (waitid throws away ptrace critical
information), and since ptrace works on a thread by thread basis, only
the attached thread can do the waitpid. This means I cannot do
anything else from the attached thread concurrently. waitfd was
supposed to solve this (back in 2009) but it never made it in, and
clone4 from Josh Triplett did something similar (returned exit status
over the clonefd).
FreeBSD's process descriptors are also pollable (which is where all
this work was originally inspired from) and it would help with
adoption if semantics were similar. Besides that, it would help
libraries to be able to host their own set of children without
affecting the entire process's waiting logic oe mucking with the
SIGCHLD handler (you wouldn't need signals).
>
> Side note: we could (should?) also make the default maxpid just be
> larger. It needs to fit in an 'int', but MAXINT instead of 65535 would
> likely alreadt make a lot of these attacks harder.
>
> There was some really old legacy reason why we actually limited it to
> 65535 originally. It was old and crufty even back when..
>
> Linus
>
> Linus
> On Mar 30, 2019, at 11:24 AM, Linus Torvalds <[email protected]> wrote:
>
>> On Sat, Mar 30, 2019 at 10:12 AM Christian Brauner <[email protected]> wrote:
>>
>>
>> To clarify, what the Android guys really wanted to be part of the api is
>> a way to get race-free access to metadata associated with a given pidfd.
>> And the idea was that *if and only if procfs is mounted* you could do:
>>
>> int pidfd = pidfd_open(1234, 0);
>>
>> int procfd = open("/proc", O_RDONLY | O_CLOEXEC);
>> int procpidfd = ioctl(pidfd, PIDFD_TO_PROCFD, procfd);
>
> And my claim is that this is three system calls - one of them very
> hacky - to just do
>
> int pidfd = open("/proc/%d", O_PATH);
Hi Linus-
I want to re-check this because I think Christian’s example was bad. I proposed these ioctls, but that wasn’t the intended use. The real point is:
int pidfd = new_improved_clone(...);
To be useful, this type of API *must* work without proc mounted.
And, later:
openat(fd to pidfd’s proc directory, “status”, ...);
And we want a non-utterly-crappy way to do this. The ioctl is certainly ugly, but it *works*.
Another approach is:
pid_t pid = pidfd_get_pid(pidfd);
sprintf(buf, “/proc/%d”, pid);
int procfd = open(buf, O_PATH);
if (pidfd_get_pid(pidfd) != pid) {
we lose;
}
But this is clunky.
Do you think the clunky version is okay, or do you have a suggestion for making it better?
—Andy
On Sun, Mar 31, 2019 at 1:38 PM Andy Lutomirski <[email protected]> wrote:
>
> openat(fd to pidfd’s proc directory, “status”, ...);
>
> And we want a non-utterly-crappy way to do this. The ioctl is certainly ugly, but it *works*.
It's beyond clunky. It's a disgrace.
If people really want equivalency between open("/proc/%d") and some
new pidfd_open(), then just *make* the two equivalent.
No effing crappy ioctl idiocy to create one from the other. Just make
the damn things be the exact same thing (and then if we extend clone()
to return one, make that return the same exact thing too).
Btw, we have several clone bits left:
- if we don't have CLONE_PARENT set, the low 8 bits are still available
- ignoring that, wehave bit #12 free: It used to be CLONE_IDLETASK
long long ago, but it was always kernel-only so it's never been
exposed as a user space bit.
Linus
On Sun, Mar 31, 2019 at 02:03:25PM -0700, Linus Torvalds wrote:
> On Sun, Mar 31, 2019 at 1:38 PM Andy Lutomirski <[email protected]> wrote:
> >
> > openat(fd to pidfd’s proc directory, “status”, ...);
> >
> > And we want a non-utterly-crappy way to do this. The ioctl is certainly ugly, but it *works*.
>
> It's beyond clunky. It's a disgrace.
>
> If people really want equivalency between open("/proc/%d") and some
> new pidfd_open(), then just *make* the two equivalent.
I don't think that we want or can make them equivalent since that would
mean we depend on procfs. If userspace really wants to turn a pidfd into
an fd for /proc/<pid> then they can be burdened to do so by parsing out
the pid relative to their procfs pid namespace from the pidfds fdinfo:
int pidfd = pidfd_open(pid, 0);
int pid = parse_fdinfo("/proc/self/fdinfo/<pidfd>");
int procpidfd = open("/proc/<pid>", ...);
/* Test if process still exists by sending signal 0 through our pidfd. */
int ret = pidfd_send_signal(pid, 0, NULL, PIDFD_SIGNAL_THREAD);
if (ret < 0 && errno == ESRCH) {
/* pid has been recycled and procpidfd refers to another process */
}
it's race free and no ioctl() is needed.
>
> No effing crappy ioctl idiocy to create one from the other. Just make
> the damn things be the exact same thing (and then if we extend clone()
> to return one, make that return the same exact thing too).
>
> Btw, we have several clone bits left:
>
> - if we don't have CLONE_PARENT set, the low 8 bits are still available
>
> - ignoring that, wehave bit #12 free: It used to be CLONE_IDLETASK
> long long ago, but it was always kernel-only so it's never been
> exposed as a user space bit.
That's good to know.
On Sun, Mar 31, 2019 at 02:09:03PM -0600, Andy Lutomirski wrote:
>
>
> > On Mar 30, 2019, at 11:24 AM, Linus Torvalds <[email protected]> wrote:
> >
> >> On Sat, Mar 30, 2019 at 10:12 AM Christian Brauner <[email protected]> wrote:
> >>
> >>
> >> To clarify, what the Android guys really wanted to be part of the api is
> >> a way to get race-free access to metadata associated with a given pidfd.
> >> And the idea was that *if and only if procfs is mounted* you could do:
> >>
> >> int pidfd = pidfd_open(1234, 0);
> >>
> >> int procfd = open("/proc", O_RDONLY | O_CLOEXEC);
> >> int procpidfd = ioctl(pidfd, PIDFD_TO_PROCFD, procfd);
> >
> > And my claim is that this is three system calls - one of them very
> > hacky - to just do
> >
> > int pidfd = open("/proc/%d", O_PATH);
>
> Hi Linus-
>
> I want to re-check this because I think Christian’s example was bad. I proposed these ioctls, but that wasn’t the intended use. The real point is:
Getting metadata access was pushed as essential originally which is why
this ioctl() came up in the first place. The concerns about
CLONE_PIDFD were not relevant when this came up [1]:
<quote>
> And how do you propose, given one of these handle objects, getting a
> process's current priority, or its current oom score, or its list of
> memory maps? As I mentioned in my original email, and which nobody has
> addressed, if you don't use a dirfd as your process handle or you
> don't provide an easy way to get one of these proc directory FDs, you
> need to duplicate a lot of metadata access interfaces.
An API that takes a process handle object and an fd pointing at /proc
(the root of the proc fs) and gives you back a proc dirfd would do the
trick. You could do this with no new kernel features at all if you're
willing to read the pid, call openat(2), and handle the races in user
code.
<quote>
[1]: https://lore.kernel.org/lkml/CALCETrUFrFKC2YTLH7ViM_7XPYk3LNmNiaz6s8wtWo1pmJQXzg@mail.gmail.com/
>
> int pidfd = new_improved_clone(...);
>
> To be useful, this type of API *must* work without proc mounted.
>
> And, later:
>
> openat(fd to pidfd’s proc directory, “status”, ...);
>
> And we want a non-utterly-crappy way to do this. The ioctl is certainly ugly, but it *works*.
>
> Another approach is:
>
> pid_t pid = pidfd_get_pid(pidfd);
> sprintf(buf, “/proc/%d”, pid);
> int procfd = open(buf, O_PATH);
> if (pidfd_get_pid(pidfd) != pid) {
> we lose;
> }
>
> But this is clunky.
>
> Do you think the clunky version is okay, or do you have a suggestion for making it better?
>
> —Andy
On Sun, Mar 31, 2019 at 2:10 PM Christian Brauner <[email protected]> wrote:
>
> I don't think that we want or can make them equivalent since that would
> mean we depend on procfs.
Sure we can.
If /proc is enabled, then you always do that dance YOU ALREADY WROTE
THE CODE FOR to do the stupid ioctl.
And if /procfs isn't enabled, then you don't do that.
Ta-daa. Done. No stupid ioctl, and now /proc and pidfd_open() return
the same damn thing.
And guess what? If /proc isn't enabled, then obviously pidfd_open()
gives you the /proc-less thing, but at least there is no crazy "two
different file descriptors for the same thing" situation, because then
the /proc one doesn't exist.
Notice? No incompatibility. No crazy stupid new "convert one to the
other", because "the other model" NEVER EXISTS. There is only one
pidfd - it might be proc-less if CONFIG_PROC isn't there, but let's
face it, nobody even cares, because nobody ever disabled /proc anyway.
And no need for some new "convert" interface (ioctl or other).
Problem solved.
Linus
On Sun, Mar 31, 2019 at 02:17:48PM -0700, Linus Torvalds wrote:
> On Sun, Mar 31, 2019 at 2:10 PM Christian Brauner <[email protected]> wrote:
> >
> > I don't think that we want or can make them equivalent since that would
> > mean we depend on procfs.
>
> Sure we can.
>
> If /proc is enabled, then you always do that dance YOU ALREADY WROTE
> THE CODE FOR to do the stupid ioctl.
>
> And if /procfs isn't enabled, then you don't do that.
>
> Ta-daa. Done. No stupid ioctl, and now /proc and pidfd_open() return
> the same damn thing.
>
> And guess what? If /proc isn't enabled, then obviously pidfd_open()
> gives you the /proc-less thing, but at least there is no crazy "two
> different file descriptors for the same thing" situation, because then
> the /proc one doesn't exist.
>
> Notice? No incompatibility. No crazy stupid new "convert one to the
> other", because "the other model" NEVER EXISTS. There is only one
> pidfd - it might be proc-less if CONFIG_PROC isn't there, but let's
> face it, nobody even cares, because nobody ever disabled /proc anyway.
Thanks for the input. The problem Jann and I saw with this is that it
would be awkward to have the kernel open a file in some procfs instance,
since then userspace would have to specify which procfs instance the fd
should come from. Yes, it could probably always be the callers procfs
instance. But I'm concerned how this will interact with procfs mount
options such as hidepid={1,2} and pid namespaces.
One concern is what should happen when there's no procfs mount in the
callers mount namespace. Do we check for that in pidfd_open() and at
CLONE_PIDFD time and then refuse to give the caller a pidfd in that case
or do they still get a dirfd to /proc/<pid>? If we refuse it would mean
that having procfs mounted is necessary to get a pidfd if we don't
refuse it would mean that we're making process metadata available
without procfs being mounted which an administrator migh not want by not
having mounted procfs.
Another question is what happens when procfs is mounted with
hidepid={1,2}. If the caller accidently learns a pid they could
pidfd_open() it and circumvent hidepid={1,2}. So we would need to verify
the procfs mount options as well. I think these problems would be gone
if pidfds were opaque handles.
Christian
On Sun, Mar 31, 2019 at 10:18 PM Linus Torvalds
<[email protected]> wrote:
>
> On Sun, Mar 31, 2019 at 2:10 PM Christian Brauner <[email protected]> wrote:
> >
> > I don't think that we want or can make them equivalent since that would
> > mean we depend on procfs.
>
> Sure we can.
>
> If /proc is enabled, then you always do that dance YOU ALREADY WROTE
> THE CODE FOR to do the stupid ioctl.
>
> And if /procfs isn't enabled, then you don't do that.
>
> Ta-daa. Done. No stupid ioctl, and now /proc and pidfd_open() return
> the same damn thing.
So this is already inherently broken with the /proc filesystem mounted
with hidepid=2. What type of object will it return when CONFIG_PROC_FS
is enabled, but I cannot see anyone's /proc entry except the
processe's running as my own user.
>
> And guess what? If /proc isn't enabled, then obviously pidfd_open()
> gives you the /proc-less thing, but at least there is no crazy "two
> different file descriptors for the same thing" situation, because then
> the /proc one doesn't exist.
There will be when you have no procfs mount for the PID namespace.
What do you return in that case? You run into the problem of two types
of descriptors on the same system.
If you choose to error out, the whole API is useless.
Yes, you can now argue that if I use hidepid=2 and cannot see other
process's /proc entry other than my own user, I wouldn't be able to
kill them anyway, so erroring out is fine, but let's not forget
CAP_KILL.
>
> Notice? No incompatibility. No crazy stupid new "convert one to the
> other", because "the other model" NEVER EXISTS. There is only one
> pidfd - it might be proc-less if CONFIG_PROC isn't there, but let's
> face it, nobody even cares, because nobody ever disabled /proc anyway.
I agree that the ioctl really sucks, but using /proc sucks even further.
Processes have their own namespace, and that is not the filesystem, it
is the PID namespace. As such, pidfd_open should be my open() for an
addressable task I can see in my PID namespace, unrelated to /proc.
Then, processes could pass a pidfd for something in their namespace
*elsewhere*, crossing namespace boundaries, as file descriptors remain
unaffected by that, and subject to kernel permissions, you could
implement some neat communication primitive just through such process
descriptors (which are stable).
It's like the filesystem in some sense, the processes form some
hierarchy, so I should be only be able to open something I can address
in this namespace. It is otherwise a gross layering violation if the
/proc filesystem works like some sort of backdoor to open some
different kind of descriptors pidfd calls accept (and it ends up
working for those I cannot even see).
>
> And no need for some new "convert" interface (ioctl or other).
>
> Problem solved.
>
> Linus
On Sun, Mar 31, 2019 at 3:03 PM Christian Brauner <[email protected]> wrote:
>
> Thanks for the input. The problem Jann and I saw with this is that it
> would be awkward to have the kernel open a file in some procfs instance,
> since then userspace would have to specify which procfs instance the fd
> should come from.
I would actually suggest we just make the rules be that the
pidfd_open() always return the internal /proc entry regardless of any
mount-point (or any "hidepid") but also suggest that exactly *because*
it gives you visibility into the target pid, you'd basically require
the strictest kind of control of the process you're trying to get the
pidfd of.
Ie likely something along the lines of
ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS)
kind of requirements.
But honestly, just how much do you need pidfd_open()? If this is
purely because somebody goes "oh, ASCII is expensive", then just stop
doing it entirely. It's not. It's fine. Going throuigh a filesystem is
a *good* thing, exactly because it allows MIS to control it.
So it's entirely possible that the right answer is: "just open
/proc/<pid>/", and accept the fact that everybody has it anyway, and
people who don't have it don't get the new functionality (with the
possible exception of clone(CLONE_PIDFD), which only gives you access
to a child you created yourself.
Linus
On Sun, Mar 31, 2019 at 03:16:47PM -0700, Linus Torvalds wrote:
> On Sun, Mar 31, 2019 at 3:03 PM Christian Brauner <[email protected]> wrote:
> >
> > Thanks for the input. The problem Jann and I saw with this is that it
> > would be awkward to have the kernel open a file in some procfs instance,
> > since then userspace would have to specify which procfs instance the fd
> > should come from.
>
> I would actually suggest we just make the rules be that the
> pidfd_open() always return the internal /proc entry regardless of any
> mount-point (or any "hidepid") but also suggest that exactly *because*
> it gives you visibility into the target pid, you'd basically require
> the strictest kind of control of the process you're trying to get the
> pidfd of.
>
> Ie likely something along the lines of
>
> ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS)
I can live with that but I would like to hear what Jann thinks too if
that's ok.
Christian
On Sun, Mar 31, 2019 at 3:16 PM Linus Torvalds
<[email protected]> wrote:
>
> I would actually suggest we just make the rules be that the
> pidfd_open() always return the internal /proc entry regardless of any
> mount-point (or any "hidepid")
The clever alternative, which might be the RightWay(tm) is to just
create a completely unattached dentry, and basically tie it into the
actual /proc filesystem hierarchy at lookup() time when somebody does
the openat() using it for the first time. You'd get a very simple
callback: since the dentry would be unattached, you'd be guaranteed to
get a "lookup()" from the VFS layer, and that lookup would then do the
"hook into the actual /proc filesystem".
We already kind of do things like that in the VFS layer when we have
unattached dentries (because of "look up by filehandle" etc). In many
ways the "pidfd_open()" really is exactly a "look up by file handle"
operation, it just so happens that the "file handle" is just the
pid/namespace combination.
And if the splice alias (which is what the VFS layer calls that "tie
aliased dentry to the parent" operation) fails, because the /proc
filesystem isn't mounted or whatever, then trying to look up names off
the thing will also fail.
It's a tiny bit too clever for my taste, and it's not *exactly* the
same thing as our normal inode alias handling, but it's pretty close
conceptually (and even practically).
So it would basically do something very similar to the ioctl, but it
would do it implicitly and automatically at that first lookup.
That would also mean that you'd not actually pay the cost of doing any
of this *unless* you also end up trying to open things in /proc using
that pidfd.
Linus
On Sun, Mar 31, 2019 at 04:40:15PM -0700, Linus Torvalds wrote:
> The clever alternative, which might be the RightWay(tm) is to just
> create a completely unattached dentry, and basically tie it into the
> actual /proc filesystem hierarchy at lookup() time when somebody does
> the openat() using it for the first time. You'd get a very simple
> callback: since the dentry would be unattached, you'd be guaranteed to
> get a "lookup()" from the VFS layer, and that lookup would then do the
> "hook into the actual /proc filesystem".
Ugh... Which vfsmount would you have to go with it?
> We already kind of do things like that in the VFS layer when we have
> unattached dentries (because of "look up by filehandle" etc). In many
> ways the "pidfd_open()" really is exactly a "look up by file handle"
> operation, it just so happens that the "file handle" is just the
> pid/namespace combination.
Except that we never let unattached _directory_ dentries out - if
we can't reattach them to the tree, open-by-handle will tell you to
take a hike.
> And if the splice alias (which is what the VFS layer calls that "tie
> aliased dentry to the parent" operation) fails, because the /proc
> filesystem isn't mounted or whatever, then trying to look up names off
> the thing will also fail.
> It's a tiny bit too clever for my taste, and it's not *exactly* the
> same thing as our normal inode alias handling, but it's pretty close
> conceptually (and even practically).
It's more than a tiny bit too clever for mine...
> So it would basically do something very similar to the ioctl, but it
> would do it implicitly and automatically at that first lookup.
>
> That would also mean that you'd not actually pay the cost of doing any
> of this *unless* you also end up trying to open things in /proc using
> that pidfd.
Al, back to normal life and digging through several flamefests from
hell...
On Sun, Mar 31, 2019 at 05:18:10PM -0700, Linus Torvalds wrote:
> On Sun, Mar 31, 2019 at 5:09 PM Al Viro <[email protected]> wrote:
> >
> > Ugh... Which vfsmount would you have to go with it?
>
> I'd literally just do a lookup of "/proc" in the current root
> directory in the lookup() function for that special pseudo-dentry.
>
> If it's not mounted, or not a /proc filesystem, screw it.
>
> > Except that we never let unattached _directory_ dentries out - if
> > we can't reattach them to the tree, open-by-handle will tell you to
> > take a hike.
>
> Absolutely. Which is why I said it's _conceptually_ similar to the alias lookup.
>
> And I suspect we can even use some of the same practical logic, but
> it's definitely not _exactly_ the same. This thing very much involves
> magic hooking into the lookup() function (but we then have to look up
> the alias not for the path we're looking up, but for the _parent_
> we're looking that path up in, which is very different from the normal
> case).
>
> > It's more than a tiny bit too clever for mine...
>
> Fair enough. The whole "just do the whole lookup at pidfd creation
> time" is certainly a whole lot simpler.
Even just from a pure maintenance perspective this sounds a better to me.
On Sun, Mar 31, 2019 at 5:09 PM Al Viro <[email protected]> wrote:
>
> Ugh... Which vfsmount would you have to go with it?
I'd literally just do a lookup of "/proc" in the current root
directory in the lookup() function for that special pseudo-dentry.
If it's not mounted, or not a /proc filesystem, screw it.
> Except that we never let unattached _directory_ dentries out - if
> we can't reattach them to the tree, open-by-handle will tell you to
> take a hike.
Absolutely. Which is why I said it's _conceptually_ similar to the alias lookup.
And I suspect we can even use some of the same practical logic, but
it's definitely not _exactly_ the same. This thing very much involves
magic hooking into the lookup() function (but we then have to look up
the alias not for the path we're looking up, but for the _parent_
we're looking that path up in, which is very different from the normal
case).
> It's more than a tiny bit too clever for mine...
Fair enough. The whole "just do the whole lookup at pidfd creation
time" is certainly a whole lot simpler.
> Al, back to normal life and digging through several flamefests from
> hell...
Yeah, I would like to see the actual aio.c pull request and the
use-after-free fixes. All the patches look fine, I just don't have the
final end result..
And that takes precedence anyway. Right now the "open_pidfd()" is a
future discussion.
Linus
On Mon, Apr 1, 2019 at 12:33 AM Christian Brauner <[email protected]> wrote:
> On Sun, Mar 31, 2019 at 03:16:47PM -0700, Linus Torvalds wrote:
> > On Sun, Mar 31, 2019 at 3:03 PM Christian Brauner <[email protected]> wrote:
> > > Thanks for the input. The problem Jann and I saw with this is that it
> > > would be awkward to have the kernel open a file in some procfs instance,
> > > since then userspace would have to specify which procfs instance the fd
> > > should come from.
> >
> > I would actually suggest we just make the rules be that the
> > pidfd_open() always return the internal /proc entry regardless of any
> > mount-point (or any "hidepid") but also suggest that exactly *because*
> > it gives you visibility into the target pid, you'd basically require
> > the strictest kind of control of the process you're trying to get the
> > pidfd of.
> >
> > Ie likely something along the lines of
> >
> > ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS)
>
> I can live with that but I would like to hear what Jann thinks too if
> that's ok.
Ah, yes. That seems reasonable. And, as Linus said, pidfd_open() is
less important if you can just do open("/proc/...") on systems with
procfs instead.
One minor detail to keep in mind for the future is that in a
straightforward implementation of this concept, if a non-capable
process is running in a mount namespace, but in the initial network
namespace, without any reachable /proc mount, it will be able to look
at information about other processes' network connections by first
using pidfd_open() on itself or by using clone(CLONE_PIDFD), then
looking at the "net" directory under the resulting file descriptor.
> On Mar 31, 2019, at 3:17 PM, Linus Torvalds <[email protected]> wrote:
>
>> On Sun, Mar 31, 2019 at 2:10 PM Christian Brauner <[email protected]> wrote:
>>
>> I don't think that we want or can make them equivalent since that would
>> mean we depend on procfs.
>
> Sure we can.
>
> If /proc is enabled, then you always do that dance YOU ALREADY WROTE
> THE CODE FOR to do the stupid ioctl.
>
> And if /procfs isn't enabled, then you don't do that.
>
> Ta-daa. Done. No stupid ioctl, and now /proc and pidfd_open() return
> the same damn thing.
>
> And guess what? If /proc isn't enabled, then obviously pidfd_open()
> gives you the /proc-less thing, but at least there is no crazy "two
> different file descriptors for the same thing" situation, because then
> the /proc one doesn't exist.
>
I wish we could do this, and, in a clean design, it would be a no-brainer. But /proc has too much baggage. Just to mention two such things, there’s “net” and “../sys”. This crud is why we have all kinds of crazy rules that prevent programs in sandboxes from making a new mounts and mounting /proc in it. If we make it possible to clone a new process and this access /proc without having /proc mounted, we’ll open up a big can of worms.
Maybe we could have a sanitized view of /proc and make a pidfd be a directory fd pointing at that.
On Sun, Mar 31, 2019 at 05:18:10PM -0700, Linus Torvalds wrote:
> Yeah, I would like to see the actual aio.c pull request and the
> use-after-free fixes. All the patches look fine, I just don't have the
> final end result..
use-after-free fixes: ceph is already in mainline, Daniel's bpf fix is in
bpf tree (1da6c4d9140c "bpf: fix use after free in bpf_evict_inode"),
the rest is in vfs.git#fixes.
The following changes since commit 9e98c678c2d6ae3a17cb2de55d17f69dddaa231b:
Linux 5.1-rc1 (2019-03-17 14:22:26 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git fixes
for you to fetch changes up to 93b919da64c15b90953f96a536e5e61df896ca57:
debugfs: fix use-after-free on symlink traversal (2019-04-01 00:31:02 -0400)
----------------------------------------------------------------
Al Viro (3):
jffs2: fix use-after-free on symlink traversal
ubifs: fix use-after-free on symlink traversal
debugfs: fix use-after-free on symlink traversal
fs/debugfs/inode.c | 13 +++++++++----
fs/jffs2/readinode.c | 5 -----
fs/jffs2/super.c | 5 ++++-
fs/ubifs/super.c | 4 +---
4 files changed, 14 insertions(+), 13 deletions(-)
On Mon, Apr 01, 2019 at 07:37:42AM +0100, Al Viro wrote:
> On Sun, Mar 31, 2019 at 05:18:10PM -0700, Linus Torvalds wrote:
>
> > Yeah, I would like to see the actual aio.c pull request and the
> > use-after-free fixes. All the patches look fine, I just don't have the
> > final end result..
>
> use-after-free fixes: ceph is already in mainline, Daniel's bpf fix is in
> bpf tree (1da6c4d9140c "bpf: fix use after free in bpf_evict_inode"),
> the rest is in vfs.git#fixes.
... and aio stuff is
The following changes since commit 9e98c678c2d6ae3a17cb2de55d17f69dddaa231b:
Linux 5.1-rc1 (2019-03-17 14:22:26 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git work.aio
for you to fetch changes up to 7316b49c2a117ca0611bc9af779d2108b764a7f9:
aio: move sanity checks and request allocation to io_submit_one() (2019-03-17 20:52:32 -0400)
----------------------------------------------------------------
Al Viro (8):
aio: fold lookup_kiocb() into its sole caller
aio: keep io_event in aio_kiocb
aio: store event at final iocb_put()
Fix aio_poll() races
make aio_read()/aio_write() return int
aio: move dropping ->ki_eventfd into iocb_destroy()
deal with get_reqs_available() in aio_get_req() itself
aio: move sanity checks and request allocation to io_submit_one()
Linus Torvalds (1):
pin iocb through aio.
fs/aio.c | 338 ++++++++++++++++++++++++++++-----------------------------------
1 file changed, 150 insertions(+), 188 deletions(-)
Hi,
Le lundi 01 avril 2019 à 02:52 +0200, Jann Horn a écrit :
> One minor detail to keep in mind for the future is that in a
> straightforward implementation of this concept, if a non-capable
> process is running in a mount namespace, but in the initial network
> namespace, without any reachable /proc mount, it will be able to look
> at information about other processes' network connections by first
> using pidfd_open() on itself or by using clone(CLONE_PIDFD), then
> looking at the "net" directory under the resulting file descriptor.
I also think it would punch a hole in chroot() ... (but in 2019, nobody
should rely on it for security purpose).
Regards.
--
Yann Droneaud
OPTEYA
On Mon, Apr 1, 2019 at 1:53 AM Jann Horn <[email protected]> wrote:
>
> On Mon, Apr 1, 2019 at 12:33 AM Christian Brauner <[email protected]> wrote:
> > On Sun, Mar 31, 2019 at 03:16:47PM -0700, Linus Torvalds wrote:
> > > On Sun, Mar 31, 2019 at 3:03 PM Christian Brauner <[email protected]> wrote:
> > > > Thanks for the input. The problem Jann and I saw with this is that it
> > > > would be awkward to have the kernel open a file in some procfs instance,
> > > > since then userspace would have to specify which procfs instance the fd
> > > > should come from.
> > >
> > > I would actually suggest we just make the rules be that the
> > > pidfd_open() always return the internal /proc entry regardless of any
> > > mount-point (or any "hidepid") but also suggest that exactly *because*
> > > it gives you visibility into the target pid, you'd basically require
> > > the strictest kind of control of the process you're trying to get the
> > > pidfd of.
> > >
> > > Ie likely something along the lines of
> > >
> > > ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS)
This then restricts the usage of the API under YAMA etc to processes
which have CAP_SYS_PTRACE or are parents wanting to manage their
children (which has worked fine for all these years anyway).
If they were just stable file descriptors referring to the process,
none of it would be a problem. You would just need normal permissions
when signalling using the pidfd (and depending on if you have CAP_KILL
in the owning userns, you could send any signal to it), ptrace
privileges when you use the pidfd with ptrace itself (suppose we
extend it to take a pidfd in the future, and it has a well established
model), so there is some separation of responsibilities. This is more
useful in general for userspace IMO.
All of the complication comes from the fact that we're trying to bind
a pid reference to also its /proc directory, and there's now another
way to get to that apart from the mount namespace, when there is
already a race free to do so yourself.
> >
> > I can live with that but I would like to hear what Jann thinks too if
> > that's ok.
>
> Ah, yes. That seems reasonable. And, as Linus said, pidfd_open() is
> less important if you can just do open("/proc/...") on systems with
> procfs instead.
>
> One minor detail to keep in mind for the future is that in a
> straightforward implementation of this concept, if a non-capable
> process is running in a mount namespace, but in the initial network
> namespace, without any reachable /proc mount, it will be able to look
> at information about other processes' network connections by first
> using pidfd_open() on itself or by using clone(CLONE_PIDFD), then
> looking at the "net" directory under the resulting file descriptor.
On 2019-03-31, Andy Lutomirski <[email protected]> wrote:
> > On Mar 31, 2019, at 3:17 PM, Linus Torvalds <[email protected]> wrote:
> >> On Sun, Mar 31, 2019 at 2:10 PM Christian Brauner <[email protected]> wrote:
> >>
> >> I don't think that we want or can make them equivalent since that would
> >> mean we depend on procfs.
> >
> > Sure we can.
> >
> > If /proc is enabled, then you always do that dance YOU ALREADY WROTE
> > THE CODE FOR to do the stupid ioctl.
> >
> > And if /procfs isn't enabled, then you don't do that.
> >
> > Ta-daa. Done. No stupid ioctl, and now /proc and pidfd_open() return
> > the same damn thing.
> >
> > And guess what? If /proc isn't enabled, then obviously pidfd_open()
> > gives you the /proc-less thing, but at least there is no crazy "two
> > different file descriptors for the same thing" situation, because then
> > the /proc one doesn't exist.
> >
>
> I wish we could do this, and, in a clean design, it would be a
> no-brainer. But /proc has too much baggage. Just to mention two such
> things, there’s “net” and “../sys”. This crud is why we have all
> kinds of crazy rules that prevent programs in sandboxes from making a
> new mounts and mounting /proc in it. If we make it possible to clone
> a new process and this access /proc without having /proc mounted,
> we’ll open up a big can of worms.
>
> Maybe we could have a sanitized view of /proc and make a pidfd be a
> directory fd pointing at that.
Eric pitched a procfs2 which would *just* be the PIDs some time ago (in
an attempt to make it possible one day to mount /proc inside a container
without adding a bunch of masked paths), though it was just an idea and
I don't know if he ever had a patch for it.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Sun, Mar 31, 2019 at 08:13:38PM -0600, Andy Lutomirski wrote:
>
>
> > On Mar 31, 2019, at 3:17 PM, Linus Torvalds <[email protected]> wrote:
> >
> >> On Sun, Mar 31, 2019 at 2:10 PM Christian Brauner <[email protected]> wrote:
> >>
> >> I don't think that we want or can make them equivalent since that would
> >> mean we depend on procfs.
> >
> > Sure we can.
> >
> > If /proc is enabled, then you always do that dance YOU ALREADY WROTE
> > THE CODE FOR to do the stupid ioctl.
> >
> > And if /procfs isn't enabled, then you don't do that.
> >
> > Ta-daa. Done. No stupid ioctl, and now /proc and pidfd_open() return
> > the same damn thing.
> >
> > And guess what? If /proc isn't enabled, then obviously pidfd_open()
> > gives you the /proc-less thing, but at least there is no crazy "two
> > different file descriptors for the same thing" situation, because then
> > the /proc one doesn't exist.
> >
>
> I wish we could do this, and, in a clean design, it would be a no-brainer. But /proc has too much baggage. Just to mention two such things, there’s “net” and “../sys”. This crud is why we have all kinds of crazy rules that prevent programs in sandboxes from making a new mounts and mounting /proc in it. If we make it possible to clone a new process and this access /proc without having /proc mounted, we’ll open up a big can of worms.
>
> Maybe we could have a sanitized view of /proc and make a pidfd be a directory fd pointing at that.
We can also just create something like an internal bind-mount without a
parent, i.e. similar to
open_tree(<internal-procfs-mount>, "<pid>", OPEN_TREE_CLONE);
on a clone(CLONE_PIDFD);
that would block any openat(fd, "..");
On Mon, Apr 1, 2019 at 2:04 PM Christian Brauner <[email protected]> wrote:
> On Sun, Mar 31, 2019 at 08:13:38PM -0600, Andy Lutomirski wrote:
> > > On Mar 31, 2019, at 3:17 PM, Linus Torvalds <[email protected]> wrote:
> > >> On Sun, Mar 31, 2019 at 2:10 PM Christian Brauner <[email protected]> wrote:
> > >>
> > >> I don't think that we want or can make them equivalent since that would
> > >> mean we depend on procfs.
> > >
> > > Sure we can.
> > >
> > > If /proc is enabled, then you always do that dance YOU ALREADY WROTE
> > > THE CODE FOR to do the stupid ioctl.
> > >
> > > And if /procfs isn't enabled, then you don't do that.
> > >
> > > Ta-daa. Done. No stupid ioctl, and now /proc and pidfd_open() return
> > > the same damn thing.
> > >
> > > And guess what? If /proc isn't enabled, then obviously pidfd_open()
> > > gives you the /proc-less thing, but at least there is no crazy "two
> > > different file descriptors for the same thing" situation, because then
> > > the /proc one doesn't exist.
> > >
> >
> > I wish we could do this, and, in a clean design, it would be a no-brainer. But /proc has too much baggage. Just to mention two such things, there’s “net” and “../sys”. This crud is why we have all kinds of crazy rules that prevent programs in sandboxes from making a new mounts and mounting /proc in it. If we make it possible to clone a new process and this access /proc without having /proc mounted, we’ll open up a big can of worms.
> >
> > Maybe we could have a sanitized view of /proc and make a pidfd be a directory fd pointing at that.
>
> We can also just create something like an internal bind-mount without a
> parent, i.e. similar to
>
> open_tree(<internal-procfs-mount>, "<pid>", OPEN_TREE_CLONE);
>
> on a clone(CLONE_PIDFD);
>
> that would block any openat(fd, "..");
Or we add a check to follow_dotdot()/follow_dotdot_rcu() that throws
an error if nd->path.mnt->mnt_flags has some new flag for "no dotdot
traversal on this mountpoint", and then set that on the internal procfs
mount... if Al Viro doesn't think that that's too hideous.
On Mon, Apr 1, 2019 at 4:41 AM Aleksa Sarai <[email protected]> wrote:
>
> Eric pitched a procfs2 which would *just* be the PIDs some time ago (in
> an attempt to make it possible one day to mount /proc inside a container
> without adding a bunch of masked paths), though it was just an idea and
> I don't know if he ever had a patch for it.
I wonder if we really want a fill procfs2, or maybe we could just make
the pidfd readable (yes, it's a directory file descriptor, but we
could allow reading).
What are the *actual* use cases for opening /proc files through it? If
it's really just for a small subset that android wants to do this
(getting basic process state like "running" etc), rather than anything
else, then we could skip the whole /proc linking entirely and go the
other way instead (ie open_pidfd() would get that limited IO model,
and we could make the /proc directory node get the same limited IO
model).
Linus
On Mon, Apr 01, 2019 at 08:36:26AM -0700, Linus Torvalds wrote:
> On Mon, Apr 1, 2019 at 4:41 AM Aleksa Sarai <[email protected]> wrote:
> >
> > Eric pitched a procfs2 which would *just* be the PIDs some time ago (in
> > an attempt to make it possible one day to mount /proc inside a container
> > without adding a bunch of masked paths), though it was just an idea and
> > I don't know if he ever had a patch for it.
>
> I wonder if we really want a fill procfs2, or maybe we could just make
No, I don't think we want a full procfs2.
> the pidfd readable (yes, it's a directory file descriptor, but we
> could allow reading).
Hm, if I understand this correctly, then the pidfd we return from
pidfd_open() would still be a dirfd but not tied to procfs? So I would
implement a "dummy" procfs anon_procfs that is a kernel internal mount
from which we allocate inodes, stash struct pid and off to userspace we
go?
>
> What are the *actual* use cases for opening /proc files through it? If
> it's really just for a small subset that android wants to do this
> (getting basic process state like "running" etc), rather than anything
> else, then we could skip the whole /proc linking entirely and go the
> other way instead (ie open_pidfd() would get that limited IO model,
> and we could make the /proc directory node get the same limited IO
> model).
From the original thread where metadata access was apparently very
important things that were listed:
<quote>
And how do you propose, given one of these handle objects, getting a
process's current priority, or its current oom score, or its list of
memory maps? As I mentioned in my original email, and which nobody has
addressed, if you don't use a dirfd as your process handle or you
don't provide an easy way to get one of these proc directory FDs, you
need to duplicate a lot of metadata access interfaces.
<quote>
( https://lore.kernel.org/lkml/CALCETrUFrFKC2YTLH7ViM_7XPYk3LNmNiaz6s8wtWo1pmJQXzg@mail.gmail.com/ )
Joel can probably speak best to this.
On Mon, Apr 1, 2019 at 8:36 AM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Apr 1, 2019 at 4:41 AM Aleksa Sarai <[email protected]> wrote:
> >
> > Eric pitched a procfs2 which would *just* be the PIDs some time ago (in
> > an attempt to make it possible one day to mount /proc inside a container
> > without adding a bunch of masked paths), though it was just an idea and
> > I don't know if he ever had a patch for it.
Couldn't this mode just be a relatively simple procfs mount option
instead of a whole new filesystem? It'd be a bit like hidepid, right?
The internal bind mount option and the no-dotdot-traversal options
also look good to me.
> I wonder if we really want a fill procfs2, or maybe we could just make
> the pidfd readable (yes, it's a directory file descriptor, but we
> could allow reading).
What would read(2) read?
> What are the *actual* use cases for opening /proc files through it? If
> it's really just for a small subset that android wants to do this
> (getting basic process state like "running" etc), rather than anything
> else, then we could skip the whole /proc linking entirely and go the
> other way instead (ie open_pidfd() would get that limited IO model,
> and we could make the /proc directory node get the same limited IO
> model).
We do a lot of process state inspection and manipulation, including
reading and writing the oom killer adjustment score, reading smaps,
and the occasional cgroup manipulation. More generally, I'd also like
to be able to write a race-free pkill(1). Doing this work via pidfd
would be convenient. More generally, we can't enumerate the specific
use cases, because what we want to do with processes isn't bounded in
advance, and we regularly find new things in /proc/pid that we want to
read and write. I'd rather not prematurely limit the applicability of
the pidfd interface, especially when there's a simple option (the
procfs directory file descriptor approach) that doesn't require
in-advance enumeration of supported process inspection and
manipulation actions or a separate per-option pidfd equivalent. I very
much want a general-purpose API that reuses the metadata interfaces
the kernel already exposes. It's not clear to me how this rich
interface could be matched by read(2) on a pidfd.
On Mon, Apr 1, 2019 at 4:55 PM Daniel Colascione <[email protected]> wrote:
>
> On Mon, Apr 1, 2019 at 8:36 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Mon, Apr 1, 2019 at 4:41 AM Aleksa Sarai <[email protected]> wrote:
> > >
> > > Eric pitched a procfs2 which would *just* be the PIDs some time ago (in
> > > an attempt to make it possible one day to mount /proc inside a container
> > > without adding a bunch of masked paths), though it was just an idea and
> > > I don't know if he ever had a patch for it.
>
> Couldn't this mode just be a relatively simple procfs mount option
> instead of a whole new filesystem? It'd be a bit like hidepid, right?
> The internal bind mount option and the no-dotdot-traversal options
> also look good to me.
>
> > I wonder if we really want a fill procfs2, or maybe we could just make
> > the pidfd readable (yes, it's a directory file descriptor, but we
> > could allow reading).
>
> What would read(2) read?
>
> > What are the *actual* use cases for opening /proc files through it? If
> > it's really just for a small subset that android wants to do this
> > (getting basic process state like "running" etc), rather than anything
> > else, then we could skip the whole /proc linking entirely and go the
> > other way instead (ie open_pidfd() would get that limited IO model,
> > and we could make the /proc directory node get the same limited IO
> > model).
>
> We do a lot of process state inspection and manipulation, including
> reading and writing the oom killer adjustment score, reading smaps,
> and the occasional cgroup manipulation. More generally, I'd also like
> to be able to write a race-free pkill(1). Doing this work via pidfd
> would be convenient. More generally, we can't enumerate the specific
> use cases, because what we want to do with processes isn't bounded in
> advance, and we regularly find new things in /proc/pid that we want to
> read and write. I'd rather not prematurely limit the applicability of
> the pidfd interface, especially when there's a simple option (the
> procfs directory file descriptor approach) that doesn't require
> in-advance enumeration of supported process inspection and
> manipulation actions or a separate per-option pidfd equivalent. I very
> much want a general-purpose API that reuses the metadata interfaces
> the kernel already exposes. It's not clear to me how this rich
> interface could be matched by read(2) on a pidfd.
With the POLLHUP model on a simple pidfd, you'd know when the process
you were referring to is dead (and one can map POLLPRI to dead and
POLLHUP to zombie, etc).
This is just an extension of the child process model, since you'd know
when it's dead, there's no race involved with opening the wrong
/proc/<PID> entry. The entire thing is already non-racy for direct
children, the same model can be extended to non-direct ones.
This simplifies a lot of things, now I am essentially just passing a
file descriptor pinning the struct pid associated with the original
task, and not process state around to others (I may even want the
other process to not read that stuff out even if it was allowed to, as
it wouldn't have been able to otherwise, due to being a in a different
mount namespace). KISS.
The upshot is this same descriptor can be returned from clone, which
would allow you to directly register it in your event loop (like
signalfd, timerfd, file fd, sockets, etc) and POLLIN generated for you
to read back its exit status (it is arguable if non-parents should be
returned a readable instance from pidfd_open, but parents sure
should). You can disable SIGCHLD for the child, and read back exit
status much later. The entire point of waiting and reaping was that
it'd be lost, but now you have a descriptor where it is kept for you
to consume.
On Mon, Apr 1, 2019 at 8:55 AM Daniel Colascione <[email protected]> wrote:
>
>
> > I wonder if we really want a fill procfs2, or maybe we could just make
> > the pidfd readable (yes, it's a directory file descriptor, but we
> > could allow reading).
>
> What would read(2) read?
We could make it read anything, but it would have to be something
people agree is sufficient (and not so expensive to create that rare
users of that data would find the overhead excessive).
Eg we could make it return the same thing that /proc/<pid>/status
reads right now.
But it sounds like you need pretty much all of /proc/<pid>/xyz:
> We do a lot of process state inspection and manipulation, including
> reading and writing the oom killer adjustment score, reading smaps,
> and the occasional cgroup manipulation. More generally, I'd also like
> to be able to write a race-free pkill(1)
I suspect most of what pkill wants is indeed in that 'status' file,
but other things aren't.
Linus
On Mon, Apr 1, 2019 at 9:01 AM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Apr 1, 2019 at 8:55 AM Daniel Colascione <[email protected]> wrote:
> >
> >
> > > I wonder if we really want a fill procfs2, or maybe we could just make
> > > the pidfd readable (yes, it's a directory file descriptor, but we
> > > could allow reading).
> >
> > What would read(2) read?
>
> We could make it read anything, but it would have to be something
> people agree is sufficient (and not so expensive to create that rare
> users of that data would find the overhead excessive).
In my exithand patch last year, I took a minimal approach and just had
read(2) read EOF once the process exited and blocked until then. Maybe
we could do that? But if we're supported waitid(2) on pidfds instead,
then we don't need read(2) at all on pidfds, right? Right now, I don't
see a situation in which I'd take advantage of a pidfd read(2) that
just gave me the running/sleeping/zombie state.
I'm also a bit worried about the implications of making a directory FD
also readable. Didn't problems with that come up in one of the
multiple-named-streams proposals?
> Eg we could make it return the same thing that /proc/<pid>/status
> reads right now.
>
> But it sounds like you need pretty much all of /proc/<pid>/xyz:
>
> > We do a lot of process state inspection and manipulation, including
> > reading and writing the oom killer adjustment score, reading smaps,
> > and the occasional cgroup manipulation. More generally, I'd also like
> > to be able to write a race-free pkill(1)
>
> I suspect most of what pkill wants is indeed in that 'status' file,
> but other things aren't.
Right. It's hard to predict what we might need. pkill also needs
/proc/pid/cmdline, FWIW.
> On Apr 1, 2019, at 9:36 AM, Linus Torvalds <[email protected]> wrote:
>
>> On Mon, Apr 1, 2019 at 4:41 AM Aleksa Sarai <[email protected]> wrote:
>>
>> Eric pitched a procfs2 which would *just* be the PIDs some time ago (in
>> an attempt to make it possible one day to mount /proc inside a container
>> without adding a bunch of masked paths), though it was just an idea and
>> I don't know if he ever had a patch for it.
>
> I wonder if we really want a fill procfs2, or maybe we could just make
> the pidfd readable (yes, it's a directory file descriptor, but we
> could allow reading).
There were patches to make procfs mount options work sensibly, and I think they got merged. So we could probably avoid a whole procfs2 by instead having it be procfs plus a special (maybe purely internal) mount option to restrict it.
On Mon, Apr 1, 2019 at 9:07 AM Jonathan Kowalski <[email protected]> wrote:
>
> With the POLLHUP model on a simple pidfd, you'd know when the process
> you were referring to is dead (and one can map POLLPRI to dead and
> POLLHUP to zombie, etc).
Adding ->poll() to the pidfd should be easy. Again, it would be
trivially be made to work for the directory fd you get from
/proc/<pid> too.
Yeah, yeah, pollable directories are odd, but the vfs layer doesn't
care about things like "is this a directory or not". It will just call
the f_op->poll() method.
Linus
On Mon, Apr 1, 2019 at 9:07 AM Jonathan Kowalski <[email protected]> wrote:
>
> On Mon, Apr 1, 2019 at 4:55 PM Daniel Colascione <[email protected]> wrote:
> >
> > On Mon, Apr 1, 2019 at 8:36 AM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > On Mon, Apr 1, 2019 at 4:41 AM Aleksa Sarai <[email protected]> wrote:
> > > >
> > > > Eric pitched a procfs2 which would *just* be the PIDs some time ago (in
> > > > an attempt to make it possible one day to mount /proc inside a container
> > > > without adding a bunch of masked paths), though it was just an idea and
> > > > I don't know if he ever had a patch for it.
> >
> > Couldn't this mode just be a relatively simple procfs mount option
> > instead of a whole new filesystem? It'd be a bit like hidepid, right?
> > The internal bind mount option and the no-dotdot-traversal options
> > also look good to me.
> >
> > > I wonder if we really want a fill procfs2, or maybe we could just make
> > > the pidfd readable (yes, it's a directory file descriptor, but we
> > > could allow reading).
> >
> > What would read(2) read?
> >
> > > What are the *actual* use cases for opening /proc files through it? If
> > > it's really just for a small subset that android wants to do this
> > > (getting basic process state like "running" etc), rather than anything
> > > else, then we could skip the whole /proc linking entirely and go the
> > > other way instead (ie open_pidfd() would get that limited IO model,
> > > and we could make the /proc directory node get the same limited IO
> > > model).
> >
> > We do a lot of process state inspection and manipulation, including
> > reading and writing the oom killer adjustment score, reading smaps,
> > and the occasional cgroup manipulation. More generally, I'd also like
> > to be able to write a race-free pkill(1). Doing this work via pidfd
> > would be convenient. More generally, we can't enumerate the specific
> > use cases, because what we want to do with processes isn't bounded in
> > advance, and we regularly find new things in /proc/pid that we want to
> > read and write. I'd rather not prematurely limit the applicability of
> > the pidfd interface, especially when there's a simple option (the
> > procfs directory file descriptor approach) that doesn't require
> > in-advance enumeration of supported process inspection and
> > manipulation actions or a separate per-option pidfd equivalent. I very
> > much want a general-purpose API that reuses the metadata interfaces
> > the kernel already exposes. It's not clear to me how this rich
> > interface could be matched by read(2) on a pidfd.
>
> With the POLLHUP model on a simple pidfd, you'd know when the process
> you were referring to is dead (and one can map POLLPRI to dead and
> POLLHUP to zombie, etc).
I agree that there needs to be *some* kind of pollable file descriptor
that fires on process death. Should it be the pifd itself? I've been
thinking that a pollable directory would be too weird, but if that's
not actually a problem, I'm fine with it.
Mapping different state transitions to different poll bits is an
interesting idea, but I'm also worried about making poll the "source
of truth" with respect to process state as reported by a pidfd.
Consider a socket: for a socket, read(2)/recv(2) is the "source of
truth" and poll is just a hint that says "now is a good time to
attempt a read". Some other systems even allow for spurious wakeups
from poll. It's also worth keeping in mind that some pre-existing
event loops let you provide "is readable" and "is writable" callbacks,
but don't support the more exotic poll notification bits. That's why
I've tried to keep my proposals limited to poll signaling readability.
But I'm not really that picky. I just want something that works.
> This is just an extension of the child process model, since you'd know
> when it's dead, there's no race involved with opening the wrong
> /proc/<PID> entry. The entire thing is already non-racy for direct
> children, the same model can be extended to non-direct ones.
>
> This simplifies a lot of things, now I am essentially just passing a
> file descriptor pinning the struct pid associated with the original
> task, and not process state around to others (I may even want the
> other process to not read that stuff out even if it was allowed to, as
> it wouldn't have been able to otherwise, due to being a in a different
> mount namespace). KISS.
>
> The upshot is this same descriptor can be returned from clone, which
> would allow you to directly register it in your event loop (like
> signalfd, timerfd, file fd, sockets, etc) and POLLIN generated for you
> to read back its exit status (it is arguable if non-parents should be
> returned a readable instance from pidfd_open, but parents sure
> should). You can disable SIGCHLD for the child, and read back exit
> status much later. The entire point of waiting and reaping was that
> it'd be lost, but now you have a descriptor where it is kept for you
> to consume.
There's a subtlety: suppose I'm a library and I want to create a
private subprocess. I use the new clone facility, whatever it is, and
get a pidfd back. I need to be able to read the child's exit status
even if the child exits before clone returns in the parent. Doesn't
this requirement imply that the pidfd, kernel-side, contain something
a bit more than a struct pid?
On Mon, Apr 1, 2019 at 5:15 PM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Apr 1, 2019 at 9:07 AM Jonathan Kowalski <[email protected]> wrote:
> >
> > With the POLLHUP model on a simple pidfd, you'd know when the process
> > you were referring to is dead (and one can map POLLPRI to dead and
> > POLLHUP to zombie, etc).
>
> Adding ->poll() to the pidfd should be easy. Again, it would be
> trivially be made to work for the directory fd you get from
> /proc/<pid> too.
>
> Yeah, yeah, pollable directories are odd, but the vfs layer doesn't
> care about things like "is this a directory or not". It will just call
> the f_op->poll() method.
I know, Andy even sent a patch for that long back. The question is,
this sure solves the immediate usecase, but it inhibits some very
powerful (and natural) things from being realised in the future, and
makes some choices harder.
Currently, pidfd_send_signal doesn't work across PID namespaces. It
would be possible to make it work, but some things need to be taken
care of, precisely, that one allows a task to open pidfds for tasks
*it* can see. Why? because you essentially isolate the PID namespace,
so your open() for this namespace suddenly doesn't start opening
things it cannot see through some other namespace (i.e. /proc),
precisely how you cannot open sockets in network namespaces from the
outside, though if you can setns, you should be able to (same with
pidfds). This makes for a nice delegation model, I can essentially put
a task in a namespace with no other tasks, keep pushing pidfds into
the damn thing, and subject to kernel permissions and capabilities it
can yield in the owner userns, signal the said task. You can extend
this to ptrace and other things, by making them accept a pidfd. This
means userspace has to explicitly pass such descriptors around to make
this work, like it does today (and how I can use an open socket and
accept connections whilst living in totally isolated network
namespace).
Besides that, /proc comes with too much stuff, it should be possible
to go from pidfd to /proc/<PID> and do whatever you wish to, but
atleast two things that require varying levels of capabilities of
inspection, the latter of which can be isolated by mount namespaces
even if the process would usially be allowed to peek into it and read
the entire thing, do not end being munged together. I can choose to
pass both, but if /proc dir fds *are* pidfds, you need the entire
complexity of masking and whatnot (which would be usable on its own,
no doubt), making directory descriptors pollable and readable, etc
etc.
>
> Linus
On Mon, Apr 1, 2019 at 9:22 AM Daniel Colascione <[email protected]> wrote:
>
> There's a subtlety: suppose I'm a library and I want to create a
> private subprocess. I use the new clone facility, whatever it is, and
> get a pidfd back. I need to be able to read the child's exit status
> even if the child exits before clone returns in the parent. Doesn't
> this requirement imply that the pidfd, kernel-side, contain something
> a bit more than a struct pid?
Note that clone() has always supported this, but it has basically
never been used.
One of the early thinkings behind clone() was that it could be used
for basically "AIO in user space", by having exactly these kinds of
library-private internal subthreads that are hidden as real threads.
It's why we have that special CSIGNAL mask for setting the exit
signal. It doesn't *just* set the signal to be sent when the thread
exits - setting the signal to something else than SIGCHLD actually
changes behavior in other ways too.
In particular a non-SIGCHLD thread won't be found by a regular
"waitpid()". You have to explicitly wait for it with __WCLONE (which
in turn is supposed to be used with the explicit pid to be waited
for).
Now, none of this was ever really used. The people who wanted AIO
wanted the brain-damaged POSIX kind of AIO, not something cool and
clever. So I suspect the whole exit-signal thing was just used for
random small per-project things.
Linus
On Mon, Apr 1, 2019 at 9:30 AM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Apr 1, 2019 at 9:22 AM Daniel Colascione <[email protected]> wrote:
> >
> > There's a subtlety: suppose I'm a library and I want to create a
> > private subprocess. I use the new clone facility, whatever it is, and
> > get a pidfd back. I need to be able to read the child's exit status
> > even if the child exits before clone returns in the parent. Doesn't
> > this requirement imply that the pidfd, kernel-side, contain something
> > a bit more than a struct pid?
>
> Note that clone() has always supported this, but it has basically
> never been used.
>
> One of the early thinkings behind clone() was that it could be used
> for basically "AIO in user space", by having exactly these kinds of
> library-private internal subthreads that are hidden as real threads.
>
> It's why we have that special CSIGNAL mask for setting the exit
> signal. It doesn't *just* set the signal to be sent when the thread
> exits - setting the signal to something else than SIGCHLD actually
> changes behavior in other ways too.
>
> In particular a non-SIGCHLD thread won't be found by a regular
> "waitpid()". You have to explicitly wait for it with __WCLONE (which
> in turn is supposed to be used with the explicit pid to be waited
> for).
But doesn't the CSIGNAL approach still require that libraries somehow
coordinate which non-SIGCHLD signal they use? (Signal coordination a
separate problem, unfortunately.) If you don't set a signal, you don't
get any notification at all, and in that case, you have to burn a
thread on a blocking wait, right? And you don't *have* to wait for a
specific PID with __WCLONE. If two libraries each did a separate
__WCLONE or __WALL wait on all subprocesses, they'd still interfere,
AIUI, even if the interface isn't supposed to be used that way.
I like the pidfd way of doing private subprocesses a bit more than the
CSIGNAL approach because it's more "natural" (in that a pidfd is just
a handle like we have handles to tons of other things), integrates
with existing event and wait facilities, doesn't require any global
coordination at all, and works for more than this one use case.
(CSIGNAL is limited to immediate parents. You could, in principle,
SCM_RIGHTS a pidfd to someone else.)
> Now, none of this was ever really used. The people who wanted AIO
> wanted the brain-damaged POSIX kind of AIO, not something cool and
> clever. So I suspect the whole exit-signal thing was just used for
> random small per-project things.
AIO is a whole other ball of wax. :-)
From: Daniel Colascione
> Sent: 01 April 2019 17:45
...
> > Now, none of this was ever really used. The people who wanted AIO
> > wanted the brain-damaged POSIX kind of AIO, not something cool and
> > clever. So I suspect the whole exit-signal thing was just used for
> > random small per-project things.
>
> AIO is a whole other ball of wax. :-)
For AIO you need to go back and tell K&R to use the IO_QIO stuff from
RSX/11M rather than the 'IO in process context' they used for Unix :-)
Then you'd end up with the windows kernel.....
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Mon, Apr 1, 2019 at 9:45 AM Daniel Colascione <[email protected]> wrote:
>
> But doesn't the CSIGNAL approach still require that libraries somehow
> coordinate which non-SIGCHLD signal they use?
Yes. As mentioned, this was seldom used.
In some cases it's ok, eg aio would just specify the signal in
sigev_signo (except it was never fleshed out to have sival_ptr etc
because nobody ever did it, afaik).
> (Signal coordination a separate problem, unfortunately.)
Yeah, in several cases you would just want to signal a handler
directly, not an index into handlers.
Anyway, I'm just saying that the whole "reap children" part is not
necessarily tied to the pidfd thing. And honestly, nobody ever does
anything like that anyway in practice, because of all the other issues
it causes that you'll hit.
Linus
On Mon, Apr 01, 2019 at 09:01:29AM -0700, Linus Torvalds wrote:
> On Mon, Apr 1, 2019 at 8:55 AM Daniel Colascione <[email protected]> wrote:
> >
> >
> > > I wonder if we really want a fill procfs2, or maybe we could just make
> > > the pidfd readable (yes, it's a directory file descriptor, but we
> > > could allow reading).
> >
> > What would read(2) read?
>
> We could make it read anything, but it would have to be something
> people agree is sufficient (and not so expensive to create that rare
> users of that data would find the overhead excessive).
>
> Eg we could make it return the same thing that /proc/<pid>/status
> reads right now.
>
> But it sounds like you need pretty much all of /proc/<pid>/xyz:
From what I gather from this thread we are still best of with using fds
to /proc/<pid> as pidfds. Linus, do you agree or have I misunderstood?
Yes, we can have an internal mount option to restrict access to various
parts of procfs from such pidfds or do the parent-less bind-mount trick
but I think this beats having a stunted dummy dirfd that we implement a
read method on.
One thing is that we also need something to disable access to the
"/proc/<pid>/net". One option could be to give the files in "net/" an
->open-handler which checks that our file->f_path.mnt is not one of our
special clone() mounts and if it is refuse the open.
To clarify the way forward:
Jann and I were discussing whether pidfd_open() still makes sense and
whether I shouldn't just jump straight to a first version of
CLONE_PIDFD.
Basically, if you have a system without CONFIG_PROC_FS it makes sense
that clone gives back an anon inode file descriptor as pidfds because
you can still signal threads in a race-free way. But it doesn't make a
lot of sense to have pidfd_open() in this scenario because you can't
really do anything with that pidfd apart from sending signals. And on a
system like that sending a signal is still racy. Since the process can
be recycled between learning the pid number and calling pidfd_open()
[1]. So it only makes sense to have _clone()_ give back anon_inode() fds
on a system without CONFIG_PROC_FS but it doesn't make sense for
pidfd_open() In other news, I think it makes more sense if I jump to the
implementation of CLONE_PIDFD instead of working on pidfd_open().
[1]: The only case - that seems rather far-fetched - where it makes
sense is when the parent wants to create that pidfd and hand it to
someone else.
Christian
On Mon, Apr 1, 2019 at 12:42 PM Christian Brauner <[email protected]> wrote:
>
> From what I gather from this thread we are still best of with using fds
> to /proc/<pid> as pidfds. Linus, do you agree or have I misunderstood?
That does seem to be the most flexible option.
> Yes, we can have an internal mount option to restrict access to various
> parts of procfs from such pidfds
I suspect you'd find that other parties might want such a "restrict
proc" mount option too, so I don't think it needs to be anything
internal.
But it would be pretty much independent of the pidfd issue, of course.
> One thing is that we also need something to disable access to the
> "/proc/<pid>/net". One option could be to give the files in "net/" an
> ->open-handler which checks that our file->f_path.mnt is not one of our
> special clone() mounts and if it is refuse the open.
I would expect that that would be part of the "restrict proc" mount options, no?
> Basically, if you have a system without CONFIG_PROC_FS it makes sense
> that clone gives back an anon inode file descriptor as pidfds because
> you can still signal threads in a race-free way. But it doesn't make a
> lot of sense to have pidfd_open() in this scenario because you can't
> really do anything with that pidfd apart from sending signals.
Well, people might want that.
But realistically, everybody enables /proc support anyway. Even if you
don't actually fully *mount* it in some restricted area, the support
is pretty much always there in any kernel config.
But yes, in general I agree that that also most likely means that a
separate system call for "open_pidfd()" isn't worth it.
Because if the main objection to /proc is that it exposes too much,
then I think a much better option is to see how to sanely restrict the
"too much" parts.
Because I think there might be a lot of people who want a restricted
/proc, in various container models etc.
Linus
On Mon, Apr 1, 2019 at 10:35 PM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Apr 1, 2019 at 12:42 PM Christian Brauner <[email protected]> wrote:
> >
> > From what I gather from this thread we are still best of with using fds
> > to /proc/<pid> as pidfds. Linus, do you agree or have I misunderstood?
You mention the race about learning the PID, PID being recycled, and
pidfd_open getting the wrong reference.
This exists with the /proc model to way. How do you propose to address this?
>
> That does seem to be the most flexible option.
>
> > Yes, we can have an internal mount option to restrict access to various
> > parts of procfs from such pidfds
>
> I suspect you'd find that other parties might want such a "restrict
> proc" mount option too, so I don't think it needs to be anything
> internal.
>
> But it would be pretty much independent of the pidfd issue, of course.
>
> > One thing is that we also need something to disable access to the
> > "/proc/<pid>/net". One option could be to give the files in "net/" an
> > ->open-handler which checks that our file->f_path.mnt is not one of our
> > special clone() mounts and if it is refuse the open.
>
> I would expect that that would be part of the "restrict proc" mount options, no?
I was thinking if some part of /proc could be split in a procpidfs,
with possibly shared code, which means with the new mount API, you
could configure a superblock with restricted views by virtue of mount
options, per task. This only gives you the view as inside /proc/<PID>,
and you might not be able to lift restrctions depending on privileges
in owning userns of the task.
Now, this would mean we keep the notion of anon inode fds as pidfds,
and on supported systems, you could configure an instance, pass the
pidfd descriptor in the fsconfig stage (David Howells demonstrated
similar support for passing user and net namespace descriptors during
superblock configuration) and also the pidns descriptor. Without
mounting the fs, the mount fd can then be used to do metadata access
passing it to openat and friends, possibly passed to others. This is
more granular than restrcting access over the entire /proc instance,
and can be adjusted per process, and since you need the pidfd's pidns
descriptor, you cannot easily cross namespace boundaries with just a
single pidfd. You can also create many variants of restricted versions
of a single task's proc directory.
The pidfd API and /proc access can be connected while also keeping
them both separate, conceptually.
There are some details but this will wound up being much more powerful
(restricting /proc as a whole is ofcourse also fine, in addition to
this). There are some details on when and how someone should be able
to do this, but is something like this up for discussion?
>
> > Basically, if you have a system without CONFIG_PROC_FS it makes sense
> > that clone gives back an anon inode file descriptor as pidfds because
> > you can still signal threads in a race-free way. But it doesn't make a
> > lot of sense to have pidfd_open() in this scenario because you can't
> > really do anything with that pidfd apart from sending signals.
>
> Well, people might want that.
>
> But realistically, everybody enables /proc support anyway. Even if you
> don't actually fully *mount* it in some restricted area, the support
> is pretty much always there in any kernel config.
>
> But yes, in general I agree that that also most likely means that a
> separate system call for "open_pidfd()" isn't worth it.
>
> Because if the main objection to /proc is that it exposes too much,
> then I think a much better option is to see how to sanely restrict the
> "too much" parts.
>
> Because I think there might be a lot of people who want a restricted
> /proc, in various container models etc.
>
> Linus
On Mon, Apr 1, 2019 at 2:58 PM Jonathan Kowalski <[email protected]> wrote:
>
> You mention the race about learning the PID, PID being recycled, and
> pidfd_open getting the wrong reference.
>
> This exists with the /proc model to way. How do you propose to address this?
Note that that race exists _regardless_ of any interfaces.
pidfd_open() has the same race: any time you have a pid, the lifetime
of it is only as long as the process existing.
That's why we talked about the CLONE_PIDFD flag, which would return
the pidfd itself when creating a new process. That's one truly
race-free way to handle it.
Or just do the fork(), and know that the pid won't be re-used until
you've done the wait() for it, and block SIGCHLD until you've done the
lookup.
That said, in *practice*, you can probably use any of the racy "look
up pidfd using pid" models, as long as you just verify the end result
after you've opened it.
That verification could be as simple as "look up the parent pid of the
pidfd I got", if you know you created it with fork() (and you can
obviously track what _other_ thread you yourself created, so you can
verify whether it is yours or not).
For example, using "openat(pidfd, "status", ..)", but also by just
tracking what you've done waitpid() on (but you need to look out for
subtle races with another thread being in the process of doing so).
Or you can just say that as long as you got the pidfd quickly after
the fork(), any pid wrapping attack is practically not possible even
if it might be racy in theory.
Linus
On Mon, Apr 1, 2019 at 3:13 PM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Apr 1, 2019 at 2:58 PM Jonathan Kowalski <[email protected]> wrote:
> >
> > You mention the race about learning the PID, PID being recycled, and
> > pidfd_open getting the wrong reference.
> >
> > This exists with the /proc model to way. How do you propose to address this?
>
> Note that that race exists _regardless_ of any interfaces.
> pidfd_open() has the same race: any time you have a pid, the lifetime
> of it is only as long as the process existing.
>
> That's why we talked about the CLONE_PIDFD flag, which would return
> the pidfd itself when creating a new process. That's one truly
> race-free way to handle it.
Yes. Returning a pidfd from clone seems like a simple and robust approach.
> Or just do the fork(), and know that the pid won't be re-used until
> you've done the wait() for it, and block SIGCHLD until you've done the
> lookup.
That doesn't work when some other thread is running a waitpid(-1)
loop. I think it's important to create an interface that libraries can
use without global coordination.
> That said, in *practice*, you can probably use any of the racy "look
> up pidfd using pid" models, as long as you just verify the end result
> after you've opened it.
>
> That verification could be as simple as "look up the parent pid of the
> pidfd I got", if you know you created it with fork() (and you can
> obviously track what _other_ thread you yourself created, so you can
> verify whether it is yours or not).
>
> For example, using "openat(pidfd, "status", ..)", but also by just
> tracking what you've done waitpid() on (but you need to look out for
> subtle races with another thread being in the process of doing so).
>
> Or you can just say that as long as you got the pidfd quickly after
> the fork(), any pid wrapping attack is practically not possible even
> if it might be racy in theory.
I don't like ignoring races just because they're rare. The cost of
complete race freedom for the process interface is low considering the
work we're doing on pidfds anyway.
* Daniel Colascione:
> But doesn't the CSIGNAL approach still require that libraries somehow
> coordinate which non-SIGCHLD signal they use? (Signal coordination a
> separate problem, unfortunately.)
There's already an allocation mechanism for realtime signals in glibc,
via __libc_allocate_rtsig. I don't know what it is about: It's clearly
intended as an external interface, yet there isn't a declaration in any
installed header file. ALSA has some optional code to use it, but I
don't think distributions compile ALSA in that way; it always uses SIGIO.
Thanks,
Florian