Hey everyone,
This adds the ability to wait on processes using pidfds. This is one of
the few missing pieces to make it possible to manage processes using
only pidfds.
pidfd_wait() does explicitly not allow scoping of the process referred
to by the pidfd, i.e. generic wait requests such as P_PGID and P_ALL and
other trickery such as passing in negative values and so on is not
supported.
The series also adds support for CLONE_WAIT_PID which prevents the
process referred to by the pidfd to appear in generic wait requests
similar to what is the default in FreeBSD.
This feature has been requested multiple times when I gave talks about
this work (for extensions see [1]).
The syscall patch is rather small overall. The largest portion of this
series are the tests and the cleanup to remove struct waitid_info from
exit.c.
Thanks!
Christian
[1]: In the future, we might add something like
CLONE_WAIT_STATUS_FOREIGN (or some better name).
Such pidfds will allow anyone to retrieve the exit status of a
non-parent process by calling pidfd_wait() on it without reaping
it. This has also been requested quite often and fits nicely into
the api. But that's for a later patchset.
Christian Brauner (5):
exit: kill struct waitid_info
pidfd: add pidfd_wait()
arch: wire-up pidfd_wait()
pidfd: add CLONE_WAIT_PID
pidfd: add pidfd_wait tests
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 4 +-
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/pid.h | 5 +
include/linux/sched.h | 1 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/sched.h | 1 +
kernel/exit.c | 191 ++++++----
kernel/fork.c | 19 +-
kernel/signal.c | 7 +-
tools/testing/selftests/pidfd/pidfd.h | 25 ++
tools/testing/selftests/pidfd/pidfd_test.c | 14 -
tools/testing/selftests/pidfd/pidfd_wait.c | 398 ++++++++++++++++++++
29 files changed, 606 insertions(+), 85 deletions(-)
create mode 100644 tools/testing/selftests/pidfd/pidfd_wait.c
--
2.22.0
The code here uses a struct waitid_info to catch basic information about
process exit including the pid, uid, status, and signal that caused the
process to exit. This information is then stuffed into a struct siginfo
for the waitid() syscall. That seems like an odd thing to do. We can
just pass down a siginfo_t struct directly which let's us cleanup and
simplify the whole code quite a bit.
This patch also simplifies the following implementation of pidfd_wait().
Note that this changes how struct siginfo is filled in for users of
waitid. POSIX doesn't mandate anything else other than si_pid, si_uid,
si_code, and si_signo. So it seems up to the implementation. In case
anyone relies on the old behavior we can just revert but I highly doubt
that users fill in siginfo_t before a call to waitid and expect all
fields other then the ones mention above to be untouched.
Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
---
kernel/exit.c | 101 ++++++++++++++++++--------------------------------
1 file changed, 36 insertions(+), 65 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index a75b6a7f458a..73392a455b72 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -994,19 +994,12 @@ SYSCALL_DEFINE1(exit_group, int, error_code)
return 0;
}
-struct waitid_info {
- pid_t pid;
- uid_t uid;
- int status;
- int cause;
-};
-
struct wait_opts {
enum pid_type wo_type;
int wo_flags;
struct pid *wo_pid;
- struct waitid_info *wo_info;
+ kernel_siginfo_t *wo_info;
int wo_stat;
struct rusage *wo_rusage;
@@ -1058,7 +1051,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
int state, status;
pid_t pid = task_pid_vnr(p);
uid_t uid = from_kuid_munged(current_user_ns(), task_uid(p));
- struct waitid_info *infop;
+ kernel_siginfo_t *infop;
if (!likely(wo->wo_flags & WEXITED))
return 0;
@@ -1169,14 +1162,14 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
infop = wo->wo_info;
if (infop) {
if ((status & 0x7f) == 0) {
- infop->cause = CLD_EXITED;
- infop->status = status >> 8;
+ infop->si_code = CLD_EXITED;
+ infop->si_status = status >> 8;
} else {
- infop->cause = (status & 0x80) ? CLD_DUMPED : CLD_KILLED;
- infop->status = status & 0x7f;
+ infop->si_code = (status & 0x80) ? CLD_DUMPED : CLD_KILLED;
+ infop->si_status = status & 0x7f;
}
- infop->pid = pid;
- infop->uid = uid;
+ infop->si_pid = pid;
+ infop->si_uid = uid;
}
return pid;
@@ -1215,7 +1208,7 @@ static int *task_stopped_code(struct task_struct *p, bool ptrace)
static int wait_task_stopped(struct wait_opts *wo,
int ptrace, struct task_struct *p)
{
- struct waitid_info *infop;
+ kernel_siginfo_t *infop;
int exit_code, *p_code, why;
uid_t uid = 0; /* unneeded, required by compiler */
pid_t pid;
@@ -1270,10 +1263,10 @@ static int wait_task_stopped(struct wait_opts *wo,
infop = wo->wo_info;
if (infop) {
- infop->cause = why;
- infop->status = exit_code;
- infop->pid = pid;
- infop->uid = uid;
+ infop->si_code = why;
+ infop->si_status = exit_code;
+ infop->si_pid = pid;
+ infop->si_uid = uid;
}
return pid;
}
@@ -1286,7 +1279,7 @@ static int wait_task_stopped(struct wait_opts *wo,
*/
static int wait_task_continued(struct wait_opts *wo, struct task_struct *p)
{
- struct waitid_info *infop;
+ kernel_siginfo_t *infop;
pid_t pid;
uid_t uid;
@@ -1316,13 +1309,13 @@ static int wait_task_continued(struct wait_opts *wo, struct task_struct *p)
put_task_struct(p);
infop = wo->wo_info;
- if (!infop) {
- wo->wo_stat = 0xffff;
+ if (infop) {
+ infop->si_code = CLD_CONTINUED;
+ infop->si_status = SIGCONT;
+ infop->si_pid = pid;
+ infop->si_uid = uid;
} else {
- infop->cause = CLD_CONTINUED;
- infop->pid = pid;
- infop->uid = uid;
- infop->status = SIGCONT;
+ wo->wo_stat = 0xffff;
}
return pid;
}
@@ -1552,7 +1545,7 @@ static long do_wait(struct wait_opts *wo)
return retval;
}
-static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
+static long kernel_waitid(int which, pid_t upid, kernel_siginfo_t *infop,
int options, struct rusage *ru)
{
struct wait_opts wo;
@@ -1602,33 +1595,22 @@ SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
infop, int, options, struct rusage __user *, ru)
{
struct rusage r;
- struct waitid_info info = {.status = 0};
- long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
- int signo = 0;
-
+ kernel_siginfo_t kinfo = {
+ .si_signo = 0,
+ };
+ long err = kernel_waitid(which, upid, infop ? &kinfo : NULL, options,
+ ru ? &r : NULL);
if (err > 0) {
- signo = SIGCHLD;
+ kinfo.si_signo = SIGCHLD;
err = 0;
if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
return -EFAULT;
}
- if (!infop)
- return err;
- if (!user_access_begin(infop, sizeof(*infop)))
+ if (infop && copy_siginfo_to_user(infop, &kinfo))
return -EFAULT;
- unsafe_put_user(signo, &infop->si_signo, Efault);
- unsafe_put_user(0, &infop->si_errno, Efault);
- unsafe_put_user(info.cause, &infop->si_code, Efault);
- unsafe_put_user(info.pid, &infop->si_pid, Efault);
- unsafe_put_user(info.uid, &infop->si_uid, Efault);
- unsafe_put_user(info.status, &infop->si_status, Efault);
- user_access_end();
- return err;
-Efault:
- user_access_end();
- return -EFAULT;
+ return err ;
}
long kernel_wait4(pid_t upid, int __user *stat_addr, int options,
@@ -1722,11 +1704,13 @@ COMPAT_SYSCALL_DEFINE5(waitid,
struct compat_rusage __user *, uru)
{
struct rusage ru;
- struct waitid_info info = {.status = 0};
- long err = kernel_waitid(which, pid, &info, options, uru ? &ru : NULL);
- int signo = 0;
+ kernel_siginfo_t kinfo = {
+ .si_signo = 0,
+ };
+ long err = kernel_waitid(which, pid, infop ? &kinfo : NULL, options,
+ uru ? &ru : NULL);
if (err > 0) {
- signo = SIGCHLD;
+ kinfo.si_signo = SIGCHLD;
err = 0;
if (uru) {
/* kernel_waitid() overwrites everything in ru */
@@ -1739,23 +1723,10 @@ COMPAT_SYSCALL_DEFINE5(waitid,
}
}
- if (!infop)
- return err;
-
- if (!user_access_begin(infop, sizeof(*infop)))
+ if (infop && copy_siginfo_to_user32(infop, &kinfo))
return -EFAULT;
- unsafe_put_user(signo, &infop->si_signo, Efault);
- unsafe_put_user(0, &infop->si_errno, Efault);
- unsafe_put_user(info.cause, &infop->si_code, Efault);
- unsafe_put_user(info.pid, &infop->si_pid, Efault);
- unsafe_put_user(info.uid, &infop->si_uid, Efault);
- unsafe_put_user(info.status, &infop->si_status, Efault);
- user_access_end();
return err;
-Efault:
- user_access_end();
- return -EFAULT;
}
#endif
--
2.22.0
This adds the pidfd_wait() syscall.
One of the last remaining bits for the pidfd api is to make it possible
to wait on pidfds. With this syscall implemented parts of userspace that
want to use this api can finally switch to managing processes completely
through pidfds if they so desire (cf. [1]).
The pidfd_wait() syscall does not allow scoping of the process
identified by the pidfd, i.e. it explicitly does not try to mirror the
behavior of: wait4(-1), wait4(0), waitid(P_ALL), waitid(P_PGID) etc. It
only allows for semantics equivalent to wait4(pid), waitid(P_PID). Users
that need scoping should rely on pid-based wait*() syscalls for now.
pidfd_wait() allows to specify which changes to wait for. The states to
wait for can be or-ed and are specified in the states argument:
WEXITED Wait for children that have terminated.
WSTOPPED Wait for children that have been stopped by
delivery of a signal.
WCONTINUED Wait for (previously stopped) children that have
been resumed by delivery of SIGCONT.
WUNTRACED Return if a child has stopped.
The behavior of pidfd_wait() can be further modified by specifying the
following or-able options in the flags argument:
__WCLONE Only wait for a process that delivers no signal
or a different signal than SIGCHLD to the parent
on termination.
__WALL Wait for all children indepedent of whether or
not they deliver no signal or another signal
than SIGCHLD to the parent on termination.
parent
__WNOTHREAD Do not wait for children of other threads in the
same thread-group.
WNOHANG Return immediately if no child has exited.
WNOWAIT Leave the child in a waitable state.
pidfd_wait() takes an additional siginfo_t argument. If it is non-NULL,
pidfd_wait() will fill in si_pid, si_uid, si_signo, si_status, and
si_code. The si_code field will be set to one of CLD_EXITED, CLD_KILLED,
CLD_DUMPED, CLD_STOPPED, CLD_TRAPPED, or CLD_CONTINUED.
Information about resource usage of the process in question is returned
in the struct rusage argument of pidfd_wait().
On success, pidfd_wait() will return the pid of the process the pidfd
referred to. On failure, a negative error code will be returned.
/* Prior approach */
The first implementation was based on a flag WPIDFD which got added to
the wait*() system calls. However, that involved passing the pidfd
through the pid_t pid argument and do in-kernel type switching based on
the flag which feels like a really unclean solution and overall like a
mishmash of two apis. This is something we luckily have avoided so far
and I think we're better off in the long run if we keep it that way.
/* References */
[1]: https://github.com/systemd/systemd/issues/13101
Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
---
include/linux/pid.h | 5 +++
kernel/exit.c | 87 +++++++++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 8 +++++
kernel/signal.c | 7 ++--
4 files changed, 105 insertions(+), 2 deletions(-)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 2a83e434db9d..443cd4108943 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -72,6 +72,11 @@ extern struct pid init_struct_pid;
extern const struct file_operations pidfd_fops;
+struct file;
+
+extern struct pid *pidfd_pid(const struct file *file);
+
+
static inline struct pid *get_pid(struct pid *pid)
{
if (pid)
diff --git a/kernel/exit.c b/kernel/exit.c
index 73392a455b72..8086c76e1959 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1738,3 +1738,90 @@ __weak void abort(void)
panic("Oops failed to kill thread");
}
EXPORT_SYMBOL(abort);
+
+static int copy_rusage_to_user_any(struct rusage *kru, struct rusage __user *ru)
+{
+#ifdef CONFIG_COMPAT
+ if (in_compat_syscall())
+ return put_compat_rusage(kru, (struct compat_rusage __user *)ru);
+#endif
+ return copy_to_user(ru, kru, sizeof(*kru));
+}
+
+static int copy_siginfo_to_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
+{
+#ifdef CONFIG_COMPAT
+ if (in_compat_syscall())
+ return copy_siginfo_to_user32(
+ (struct compat_siginfo __user *)info, kinfo);
+#endif
+ return copy_siginfo_to_user(info, kinfo);
+}
+
+SYSCALL_DEFINE6(pidfd_wait, int, pidfd, int __user *, stat_addr,
+ siginfo_t __user *, info, struct rusage __user *, ru,
+ unsigned int, states, unsigned int, flags)
+{
+ long ret;
+ struct fd f;
+ struct pid *pid;
+ struct wait_opts wo;
+ struct rusage kru = {};
+ kernel_siginfo_t kinfo = {
+ .si_signo = 0,
+ };
+
+ if (pidfd < 0)
+ return -EINVAL;
+
+ if (states & ~(WEXITED | WSTOPPED | WCONTINUED | WUNTRACED))
+ return -EINVAL;
+
+ if (!(states & (WEXITED | WSTOPPED | WCONTINUED | WUNTRACED)))
+ return -EINVAL;
+
+ if (flags & ~(__WNOTHREAD | __WCLONE | __WALL | WNOWAIT | WNOHANG))
+ return -EINVAL;
+
+ f = fdget(pidfd);
+ if (!f.file)
+ return -EBADF;
+
+ pid = pidfd_pid(f.file);
+ if (IS_ERR(pid)) {
+ ret = PTR_ERR(pid);
+ goto out_fdput;
+ }
+
+ wo = (struct wait_opts){
+ .wo_type = PIDTYPE_PID,
+ .wo_pid = pid,
+ .wo_flags = states | flags,
+ .wo_info = info ? &kinfo : NULL,
+ .wo_rusage = ru ? &kru : NULL,
+ };
+
+ ret = do_wait(&wo);
+ if (ret > 0) {
+ kinfo.si_signo = SIGCHLD;
+
+ if (stat_addr && put_user(wo.wo_stat, stat_addr)) {
+ ret = -EFAULT;
+ goto out_fdput;
+ }
+
+ if (ru && copy_rusage_to_user_any(&kru, ru)) {
+ ret = -EFAULT;
+ goto out_fdput;
+ }
+ } else {
+ kinfo.si_signo = 0;
+ }
+
+ if (info && copy_siginfo_to_user_any(&kinfo, info))
+ ret = -EFAULT;
+
+out_fdput:
+ fdput(f);
+ return ret;
+}
diff --git a/kernel/fork.c b/kernel/fork.c
index d8ae0f1b4148..baaff6570517 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1743,6 +1743,14 @@ const struct file_operations pidfd_fops = {
#endif
};
+struct pid *pidfd_pid(const struct file *file)
+{
+ if (file->f_op == &pidfd_fops)
+ return file->private_data;
+
+ return ERR_PTR(-EBADF);
+}
+
static void __delayed_free_task(struct rcu_head *rhp)
{
struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
diff --git a/kernel/signal.c b/kernel/signal.c
index 91b789dd6e72..2e567f64812f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3672,8 +3672,11 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
static struct pid *pidfd_to_pid(const struct file *file)
{
- if (file->f_op == &pidfd_fops)
- return file->private_data;
+ struct pid *pid;
+
+ pid = pidfd_pid(file);
+ if (!IS_ERR(pid))
+ return pid;
return tgid_pidfd_to_pid(file);
}
--
2.22.0
This wires up the pidfd_wait() syscall into all arches at once.
Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: David Howells <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 4 +++-
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 4 ++++
include/uapi/asm-generic/unistd.h | 4 +++-
20 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 728fe028c02c..ca3e593f0c7a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
543 common fspick sys_fspick
544 common pidfd_open sys_pidfd_open
# 545 reserved for clone3
+548 common pidfd_wait sys_pidfd_wait
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..5e448d915b2f 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+438 common pidfd_wait sys_pidfd_wait
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..b722e47377a5 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 436
+#define __NR_compat_syscalls 439
#endif
#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 94ab29cf4f00..ca77c9d4f7a1 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -877,7 +877,9 @@ __SYSCALL(__NR_fsmount, sys_fsmount)
__SYSCALL(__NR_fspick, sys_fspick)
#define __NR_pidfd_open 434
__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
-#define __NR_clone3 435
+#define __NR_pidfd_wait 438
+__SYSCALL(__NR_pidfd_wait, sys_pidfd_wait)
+#define __NR_clone3 439
__SYSCALL(__NR_clone3, sys_clone3)
/*
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 36d5faf4c86c..f038afaced9b 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,3 +356,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 common pidfd_wait sys_pidfd_wait
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index a88a285a0e5f..51f86f7b4cec 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 common pidfd_wait sys_pidfd_wait
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 09b0cd7dab0a..24f912ac5dfa 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+438 common pidfd_wait sys_pidfd_wait
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index c9c879ec9b6d..edc144c4040c 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -374,3 +374,4 @@
433 n32 fspick sys_fspick
434 n32 pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 n32 pidfd_wait sys_pidfd_wait
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index bbce9159caa1..da4486ea0f4f 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -350,3 +350,4 @@
433 n64 fspick sys_fspick
434 n64 pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 n64 pidfd_wait sys_pidfd_wait
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 9653591428ec..d738688e50d8 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -423,3 +423,4 @@
433 o32 fspick sys_fspick
434 o32 pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 o32 pidfd_wait sys_pidfd_wait
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 670d1371aca1..d60f44d8145c 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -432,3 +432,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3_wrapper
+438 common pidfd_wait sys_pidfd_wait
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 3331749aab20..3309bf5f5370 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 common pidfd_wait sys_pidfd_wait
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index a90d3e945445..ef8ba9a9c3bb 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
433 common fspick sys_fspick sys_fspick
434 common pidfd_open sys_pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 common pidfd_wait sys_pidfd_wait sys_pidfd_wait
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index b5ed26c4c005..9e786a198bfd 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 common pidfd_wait sys_pidfd_wait
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8c8cc7537fb2..ef4f13907894 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -481,3 +481,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 common pidfd_wait sys_pidfd_wait
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c00019abd076..76ec8c905745 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,3 +440,4 @@
433 i386 fspick sys_fspick __ia32_sys_fspick
434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
435 i386 clone3 sys_clone3 __ia32_sys_clone3
+438 i386 pidfd_wait sys_pidfd_wait __ia32_sys_pidfd_wait
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c29976eca4a8..733c206130f8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
433 common fspick __x64_sys_fspick
434 common pidfd_open __x64_sys_pidfd_open
435 common clone3 __x64_sys_clone3/ptregs
+438 common pidfd_wait __x64_sys_pidfd_wait
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 25f4de729a6d..417203971292 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -406,3 +406,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+438 common pidfd_wait sys_pidfd_wait
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 88145da7d140..760e8eacb93c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -932,6 +932,10 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
asmlinkage long sys_syncfs(int fd);
asmlinkage long sys_setns(int fd, int nstype);
asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
+asmlinkage long sys_pidfd_wait(int pidfd, int __user *stat_addr,
+ struct siginfo __user *info,
+ struct rusage __user *ru, unsigned int states,
+ unsigned int flags);
asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
unsigned int vlen, unsigned flags);
asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1be0e798e362..0dd5b9d4dba0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -850,9 +850,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#define __NR_clone3 435
__SYSCALL(__NR_clone3, sys_clone3)
#endif
+#define __NR_pidfd_wait 438
+__SYSCALL(__NR_pidfd_wait, sys_pidfd_wait)
#undef __NR_syscalls
-#define __NR_syscalls 436
+#define __NR_syscalls 439
/*
* 32 bit systems traditionally used different
--
2.22.0
If CLONE_WAIT_PID is set the newly created process will not be
considered by process wait requests that wait generically on children
such as:
syscall(__NR_wait4, -1, wstatus, options, rusage)
syscall(__NR_waitpid, -1, wstatus, options)
syscall(__NR_waitid, P_ALL, -1, siginfo, options, rusage)
syscall(__NR_waitid, P_PGID, -1, siginfo, options, rusage)
syscall(__NR_waitpid, -pid, wstatus, options)
syscall(__NR_wait4, -pid, wstatus, options, rusage)
A process created with CLONE_WAIT_PID can only be waited upon with a
focussed wait call. This ensures that processes can be reaped even if
all file descriptors referring to it are closed.
/* Usecases */
This feature has been requested in discussions when I presented this
work multiple times. Here are concrete use cases people have:
1. Process managers that would like to use pidfd for all process
watching needs require this feature.
A process manager (e.g. PID 1) that needs to reap all children
assigned to it needs to invoke some form of waitall request as
outlined above. This has to be done since the process manager might
not know about processes that got re-parented to it. Without
CLONE_WAIT_PID the process manager will end up reaping processes it
uses pidfds to watch for since they are crucial internal processes.
2. Various libraries want to be able to fork off helper processes
internally that do not otherwise affect the program they are used in.
This is currently not possible.
However, if a process invokes a waitall request the internal
helper process of the library might get reaped, confusing the library
which expected it to reap it itself.
Careful programs will thus generally avoid waitall requests which is
inefficient.
3. A general class of programs are ones that use event loops (e.g. GLib,
systemd, and LXC etc.). Such event loops currently call focused wait
requests iteratively on all processes they are configured to watch to
avoid waitall request pitfalls.
This is ugly and inefficient since it cannot be used to watch large
numbers of file descriptors without paying the O(n) cost on each
event loop iteration.
/* Prior art */
FreeBSD has a similar concept (cf. [1], [2]). They are currently doing
it the other way around, i.e. by default all procdescs are not visible
in waitall requests. Howver, originally, they allowed procdescs to
appear in waitall and changed it later (cf. [1]).
Currently, CLONE_WAIT_PID can only be used in conjunction with
CLONE_PIDFD since the usecases above only make sense when used in
combination with both.
/* References */
[1]: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201054
[2]: https://svnweb.freebsd.org/base/head/sys/kern/kern_exit.c
Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
---
include/linux/sched.h | 1 +
include/uapi/linux/sched.h | 1 +
kernel/exit.c | 3 +++
kernel/fork.c | 11 ++++++++++-
4 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8dc1811487f5..f0166f630a1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1468,6 +1468,7 @@ extern struct pid *cad_pid;
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
+#define PF_WAIT_PID 0x20000000 /* This task will not appear in generic wait requests */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..ffb1cac18e4e 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -32,6 +32,7 @@
#define CLONE_NEWPID 0x20000000 /* New pid namespace */
#define CLONE_NEWNET 0x40000000 /* New network namespace */
#define CLONE_IO 0x80000000 /* Clone io context */
+#define CLONE_WAIT_PID 0x200000000ULL /* set if process should not appear in generic wait requests */
/*
* Arguments for the clone3 syscall
diff --git a/kernel/exit.c b/kernel/exit.c
index 8086c76e1959..aa15de1108b2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1019,6 +1019,9 @@ eligible_child(struct wait_opts *wo, bool ptrace, struct task_struct *p)
if (!eligible_pid(wo, p))
return 0;
+ if ((p->flags & PF_WAIT_PID) && (wo->wo_type != PIDTYPE_PID))
+ return 0;
+
/*
* Wait for all children (clone and not) if __WALL is set or
* if it is traced by us.
diff --git a/kernel/fork.c b/kernel/fork.c
index baaff6570517..a067f3876e2e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1910,6 +1910,8 @@ static __latent_entropy struct task_struct *copy_process(
delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
p->flags |= PF_FORKNOEXEC;
+ if (clone_flags & CLONE_WAIT_PID)
+ p->flags |= PF_WAIT_PID;
INIT_LIST_HEAD(&p->children);
INIT_LIST_HEAD(&p->sibling);
rcu_copy_process(p);
@@ -2590,7 +2592,7 @@ static bool clone3_args_valid(const struct kernel_clone_args *kargs)
* All lower bits of the flag word are taken.
* Verify that no other unknown flags are passed along.
*/
- if (kargs->flags & ~CLONE_LEGACY_FLAGS)
+ if (kargs->flags & ~(CLONE_LEGACY_FLAGS | CLONE_WAIT_PID))
return false;
/*
@@ -2600,6 +2602,13 @@ static bool clone3_args_valid(const struct kernel_clone_args *kargs)
if (kargs->flags & (CLONE_DETACHED | CSIGNAL))
return false;
+ /*
+ * Currently only allow CLONE_WAIT_PID for processes created as
+ * pidfds until someone needs this feature for regular pids too.
+ */
+ if ((kargs->flags & CLONE_WAIT_PID) && !(kargs->flags & CLONE_PIDFD))
+ return false;
+
if ((kargs->flags & (CLONE_THREAD | CLONE_PARENT)) &&
kargs->exit_signal)
return false;
--
2.22.0
Add tests for pidfd_wait() and CLONE_WAIT_PID:
- test that pidfd_wait() syscall is supported
- test that pidfd_wait() does not accept unknown flags
- test that pidfd_wait() can wait on a pidfd
- test that pidfd_wait() can wait on a pidfd and return struct rusage
- test that pidfd_wait() can wait on a pidfd and return siginfo_t
- test that pidfd_wait() works with WEXITED
- test that pidfd_wait() works with WSTOPPED
- test that pidfd_wait() works with WUNTRACED
- test that pidfd_wait() works with WCONTINUED
- test that pidfd_wait() works with WNOWAIT
- test that pidfd_wait() works with WNOHANG
- test that CLONE_WAIT_PID does not appear in waitid(P_ALL) requests
- test that CLONE_WAIT_PID does not appear in waitid(P_PGID) requests
- test that CLONE_WAIT_PID does not appear in wait4(-1) requests
- test that CLONE_WAIT_PID does not appear in wait4(0) requests
- test that CLONE_WAIT_PID does appear in waitid(P_PID) requests
- test that CLONE_WAIT_PID does appear in wait4(pid) requests
Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
---
tools/testing/selftests/pidfd/pidfd.h | 25 ++
tools/testing/selftests/pidfd/pidfd_test.c | 14 -
tools/testing/selftests/pidfd/pidfd_wait.c | 398 +++++++++++++++++++++
3 files changed, 423 insertions(+), 14 deletions(-)
create mode 100644 tools/testing/selftests/pidfd/pidfd_wait.c
diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index 8452e910463f..d56bca179b9e 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -16,6 +16,26 @@
#include "../kselftest.h"
+#ifndef CLONE_PIDFD
+#define CLONE_PIDFD 0x00001000
+#endif
+
+#ifndef CLONE_WAIT_PID
+#define CLONE_WAIT_PID 0x200000000ULL
+#endif
+
+#ifndef __NR_pidfd_open
+#define __NR_pidfd_open -1
+#endif
+
+#ifndef __NR_pidfd_send_signal
+#define __NR_pidfd_send_signal -1
+#endif
+
+#ifndef __NR_clone3
+#define __NR_clone3 -1
+#endif
+
/*
* The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
* That means, when it wraps around any pid < 300 will be skipped.
@@ -53,5 +73,10 @@ int wait_for_pid(pid_t pid)
return WEXITSTATUS(status);
}
+static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
+ unsigned int flags)
+{
+ return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
+}
#endif /* __PIDFD_H */
diff --git a/tools/testing/selftests/pidfd/pidfd_test.c b/tools/testing/selftests/pidfd/pidfd_test.c
index 7eaa8a3de262..42e3eb494d72 100644
--- a/tools/testing/selftests/pidfd/pidfd_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_test.c
@@ -21,20 +21,12 @@
#include "pidfd.h"
#include "../kselftest.h"
-#ifndef __NR_pidfd_send_signal
-#define __NR_pidfd_send_signal -1
-#endif
-
#define str(s) _str(s)
#define _str(s) #s
#define CHILD_THREAD_MIN_WAIT 3 /* seconds */
#define MAX_EVENTS 5
-#ifndef CLONE_PIDFD
-#define CLONE_PIDFD 0x00001000
-#endif
-
static pid_t pidfd_clone(int flags, int *pidfd, int (*fn)(void *))
{
size_t stack_size = 1024;
@@ -47,12 +39,6 @@ static pid_t pidfd_clone(int flags, int *pidfd, int (*fn)(void *))
#endif
}
-static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
- unsigned int flags)
-{
- return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
-}
-
static int signal_received;
static void set_signal_received_on_sigusr1(int sig)
diff --git a/tools/testing/selftests/pidfd/pidfd_wait.c b/tools/testing/selftests/pidfd/pidfd_wait.c
new file mode 100644
index 000000000000..fcbd10098067
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_wait.c
@@ -0,0 +1,398 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <signal.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sched.h>
+#include <string.h>
+#include <sys/resource.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "pidfd.h"
+#include "../kselftest.h"
+
+#define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
+
+static pid_t sys_clone3(struct clone_args *args)
+{
+ return syscall(__NR_clone3, args, sizeof(struct clone_args));
+}
+
+static pid_t sys_pidfd_wait(int pidfd, int *wstatus, siginfo_t *info,
+ struct rusage *ru, unsigned int states,
+ unsigned int flags)
+{
+ return syscall(__NR_pidfd_wait, pidfd, wstatus, info, ru, states, flags);
+}
+
+static int test_pidfd_wait_syscall_support(void)
+{
+ int ret;
+ const char *test_name = "pidfd_wait check for support";
+
+ ret = sys_pidfd_wait(-EBADF, NULL, NULL, NULL, WEXITED, 0);
+ if (ret < 0 && errno == ENOSYS)
+ ksft_exit_skip("%s test: pidfd_wait() syscall not supported\n",
+ test_name);
+
+ ksft_test_result_pass("%s test: Passed\n", test_name);
+ return 0;
+}
+
+static int test_pidfd_wait_simple(void)
+{
+ const char *test_name = "pidfd_wait simple";
+ int pidfd = -1, status = 0;
+ pid_t parent_tid = -1;
+ struct clone_args args = {
+ .parent_tid = ptr_to_u64(&parent_tid),
+ .pidfd = ptr_to_u64(&pidfd),
+ .flags = CLONE_PIDFD | CLONE_PARENT_SETTID,
+ .exit_signal = SIGCHLD,
+ };
+ pid_t pid;
+
+ pid = sys_clone3(&args);
+ if (pid < 0)
+ ksft_exit_fail_msg("%s test: failed to create new process %s",
+ test_name, strerror(errno));
+
+ if (pid == 0)
+ exit(EXIT_SUCCESS);
+
+ pid = sys_pidfd_wait(pidfd, NULL, NULL, NULL, -1, 0);
+ if (pid > 0)
+ ksft_exit_fail_msg(
+ "%s test: succeeded to wait on process with invalid flags passed",
+ test_name);
+
+ pid = sys_pidfd_wait(pidfd, NULL, NULL, NULL, 0, -1);
+ if (pid > 0)
+ ksft_exit_fail_msg(
+ "%s test: succeeded to wait on process with invalid flags passed",
+ test_name);
+
+ pid = sys_pidfd_wait(pidfd, NULL, NULL, NULL, 0, 0);
+ if (pid > 0)
+ ksft_exit_fail_msg(
+ "%s test: succeeded to wait on process with invalid flags passed",
+ test_name);
+
+ pid = sys_pidfd_wait(pidfd, NULL, NULL, NULL, WEXITED, 0);
+ if (pid < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+ close(pidfd);
+
+ pidfd = -1;
+ parent_tid = -1;
+ pid = sys_clone3(&args);
+ if (pid < 0)
+ ksft_exit_fail_msg("%s test: failed to create new process %s",
+ test_name, strerror(errno));
+
+ if (pid == 0)
+ exit(EXIT_SUCCESS);
+
+ pid = sys_pidfd_wait(pidfd, &status, NULL, NULL, WEXITED, 0);
+ if (pid < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ ksft_exit_fail_msg(
+ "%s test: unexpected status received after waiting on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+ close(pidfd);
+
+ pidfd = -1;
+ parent_tid = -1;
+ pid = sys_clone3(&args);
+ if (pid < 0)
+ ksft_exit_fail_msg("%s test: failed to create new process %s",
+ test_name, strerror(errno));
+
+ if (pid == 0)
+ exit(EXIT_FAILURE);
+
+ pid = sys_pidfd_wait(pidfd, &status, NULL, NULL, WEXITED, 0);
+ if (pid < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+
+ if (!WIFEXITED(status) || !WEXITSTATUS(status))
+ ksft_exit_fail_msg(
+ "%s test: unexpected status received after waiting on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+ close(pidfd);
+
+ ksft_test_result_pass("%s test: Passed\n", test_name);
+ return 0;
+}
+
+static int test_pidfd_wait_rusage(void)
+{
+ const char *test_name = "pidfd_wait rusage";
+ int pidfd = -1, status = 0;
+ pid_t parent_tid = -1;
+ struct clone_args args = {
+ .parent_tid = ptr_to_u64(&parent_tid),
+ .pidfd = ptr_to_u64(&pidfd),
+ .flags = CLONE_PIDFD | CLONE_PARENT_SETTID,
+ .exit_signal = SIGCHLD,
+ };
+ pid_t pid;
+ struct rusage rusage = { 0 };
+
+ pid = sys_clone3(&args);
+ if (pid < 0)
+ ksft_exit_fail_msg("%s test: failed to create new process %s",
+ test_name, strerror(errno));
+
+ if (pid == 0)
+ exit(EXIT_SUCCESS);
+
+ pid = sys_pidfd_wait(pidfd, &status, NULL, &rusage, WEXITED, 0);
+ if (pid < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ ksft_exit_fail_msg(
+ "%s test: unexpected status received after waiting on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+ close(pidfd);
+
+ ksft_test_result_pass("%s test: Passed\n", test_name);
+ return 0;
+}
+
+static int test_pidfd_wait_siginfo(void)
+{
+ const char *test_name = "pidfd_wait siginfo";
+ int pidfd = -1, status = 0;
+ pid_t parent_tid = -1;
+ struct clone_args args = {
+ .parent_tid = ptr_to_u64(&parent_tid),
+ .pidfd = ptr_to_u64(&pidfd),
+ .flags = CLONE_PIDFD | CLONE_PARENT_SETTID,
+ .exit_signal = SIGCHLD,
+ };
+ pid_t pid;
+ siginfo_t info = {
+ .si_signo = 0,
+ };
+
+ pid = sys_clone3(&args);
+ if (pid < 0)
+ ksft_exit_fail_msg("%s test: failed to create new process %s",
+ test_name, strerror(errno));
+
+ if (pid == 0)
+ exit(EXIT_SUCCESS);
+
+ pid = sys_pidfd_wait(pidfd, &status, &info, NULL, WEXITED, 0);
+ if (pid < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ ksft_exit_fail_msg(
+ "%s test: unexpected status received after waiting on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+ close(pidfd);
+
+ if (info.si_signo != SIGCHLD)
+ ksft_exit_fail_msg(
+ "%s test: unexpected si_signo value %d received after waiting on process with pid %d and pidfd %d: %s",
+ test_name, info.si_signo, parent_tid, pidfd,
+ strerror(errno));
+
+ if (info.si_code != CLD_EXITED)
+ ksft_exit_fail_msg(
+ "%s test: unexpected si_code value %d received after waiting on process with pid %d and pidfd %d: %s",
+ test_name, info.si_code, parent_tid, pidfd,
+ strerror(errno));
+
+ ksft_test_result_pass("%s test: Passed\n", test_name);
+ return 0;
+}
+
+static int test_pidfd_wait_states(void)
+{
+ const char *test_name = "pidfd_wait states";
+ int pidfd = -1, status = 0;
+ pid_t parent_tid = -1;
+ struct clone_args args = {
+ .parent_tid = ptr_to_u64(&parent_tid),
+ .pidfd = ptr_to_u64(&pidfd),
+ .flags = CLONE_PIDFD | CLONE_PARENT_SETTID,
+ .exit_signal = SIGCHLD,
+ };
+ int ret;
+ pid_t pid;
+ siginfo_t info = {
+ .si_signo = 0,
+ };
+
+ pid = sys_clone3(&args);
+ if (pid < 0)
+ ksft_exit_fail_msg("%s test: failed to create new process %s",
+ test_name, strerror(errno));
+
+ if (pid == 0) {
+ kill(getpid(), SIGSTOP);
+ kill(getpid(), SIGSTOP);
+ exit(EXIT_SUCCESS);
+ }
+
+ pid = sys_pidfd_wait(pidfd, &status, &info, NULL, WSTOPPED, 0);
+ if (pid < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+
+ ret = sys_pidfd_send_signal(pidfd, SIGCONT, NULL, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+
+ pid = sys_pidfd_wait(pidfd, &status, &info, NULL, WCONTINUED, 0);
+ if (pid < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+
+ pid = sys_pidfd_wait(pidfd, &status, &info, NULL, WUNTRACED, 0);
+ if (pid < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+
+ ret = sys_pidfd_send_signal(pidfd, SIGKILL, NULL, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+
+ pid = sys_pidfd_wait(pidfd, &status, &info, NULL, WEXITED, 0);
+ if (pid < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+
+ close(pidfd);
+
+ ksft_test_result_pass("%s test: Passed\n", test_name);
+ return 0;
+}
+
+static int test_clone_wait_pid(void)
+{
+ const char *test_name = "CLONE_WAIT_PID";
+ int pidfd = -1, status = 0;
+ pid_t parent_tid = -1;
+ struct clone_args args = {
+ .parent_tid = ptr_to_u64(&parent_tid),
+ .pidfd = ptr_to_u64(&pidfd),
+ .flags = CLONE_PARENT_SETTID | CLONE_WAIT_PID,
+ .exit_signal = SIGCHLD,
+ };
+ int ret;
+ pid_t pid, rpid;
+
+ pid = sys_clone3(&args);
+ if (pid > 0)
+ ksft_exit_fail_msg(
+ "%s test: managed to create new process with CLONE_WAIT_PID but without CLONE_PIDFD",
+ test_name);
+
+ args.flags |= CLONE_PIDFD;
+ pid = sys_clone3(&args);
+ if (pid < 0)
+ ksft_exit_fail_msg("%s test: failed to create new process %s",
+ test_name, strerror(errno));
+
+ if (pid == 0)
+ exit(EXIT_SUCCESS);
+
+ ret = syscall(__NR_waitid, P_ALL, -1, NULL, WEXITED, NULL);
+ if (ret == 0)
+ ksft_exit_fail_msg(
+ "%s test: managed to wait on process created with CLONE_PIDFD | CLONE_WAIT_PID with pid %d and pidfd %d through waitid(P_ALL) request",
+ test_name, parent_tid, pidfd);
+
+ ret = syscall(__NR_waitid, P_PGID, getpgid(0), NULL, WEXITED, NULL);
+ if (ret == 0)
+ ksft_exit_fail_msg(
+ "%s test: managed to wait on process created with CLONE_PIDFD | CLONE_WAIT_PID with pid %d and pidfd %d through waitid(P_PGID) request",
+ test_name, parent_tid, pidfd);
+
+ rpid = syscall(__NR_wait4, -1, NULL, 0, NULL);
+ if (rpid > 0)
+ ksft_exit_fail_msg(
+ "%s test: managed to wait on process created with CLONE_PIDFD | CLONE_WAIT_PID with pid %d and pidfd %d through wait4(-1) request",
+ test_name, parent_tid, pidfd);
+
+ rpid = syscall(__NR_wait4, 0, NULL, 0, NULL);
+ if (rpid > 0)
+ ksft_exit_fail_msg(
+ "%s test: managed to wait on process created with CLONE_PIDFD | CLONE_WAIT_PID with pid %d and pidfd %d through wait4(0) request",
+ test_name, parent_tid, pidfd);
+
+ ret = syscall(__NR_waitid, P_PID, pid, NULL, WEXITED | WNOWAIT | WNOHANG, NULL);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process created with CLONE_PIDFD | CLONE_WAIT_PID with pid %d and pidfd %d through waitid(P_PID) request: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+
+ rpid = sys_pidfd_wait(pidfd, NULL, NULL, NULL, WEXITED, WNOWAIT | WNOHANG);
+ if (rpid < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+
+ rpid = syscall(__NR_wait4, pid, &status, 0, NULL);
+ if (rpid < 0)
+ ksft_exit_fail_msg(
+ "%s test: failed to wait on process created with CLONE_PIDFD | CLONE_WAIT_PID with pid %d and pidfd %d through wait4(%d) request: %s",
+ test_name, parent_tid, pidfd, parent_tid, strerror(errno));
+
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ ksft_exit_fail_msg(
+ "%s test: unexpected status received after waiting on process with pid %d and pidfd %d: %s",
+ test_name, parent_tid, pidfd, strerror(errno));
+ close(pidfd);
+
+ ksft_test_result_pass("%s test: Passed\n", test_name);
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+ ksft_set_plan(6);
+
+ test_pidfd_wait_syscall_support();
+ test_pidfd_wait_simple();
+ test_pidfd_wait_rusage();
+ test_pidfd_wait_siginfo();
+ test_pidfd_wait_states();
+ test_clone_wait_pid();
+
+ return ksft_exit_pass();
+}
--
2.22.0
On Wed, Jul 24, 2019 at 7:47 AM Christian Brauner <[email protected]> wrote:
>
> The code here uses a struct waitid_info to catch basic information about
> process exit including the pid, uid, status, and signal that caused the
> process to exit. This information is then stuffed into a struct siginfo
> for the waitid() syscall. That seems like an odd thing to do. We can
> just pass down a siginfo_t struct directly which let's us cleanup and
> simplify the whole code quite a bit.
Ack. Except I'd like the commit message to explain where this comes
from instead of that "That seems like an odd thing to do".
The _original_ reason for "struct waitid_info" was that "siginfo_t" is
huge because of all the insane padding that various architectures do.
So it was introduced by commit 67d7ddded322 ("waitid(2): leave copyout
of siginfo to syscall itself") very much to avoid stack usage issues.
And I quote:
collect the information needed for siginfo into
a small structure (waitid_info)
simply because "sigset_t" was big.
But that size came from the explicit "pad it out to 128 bytes for
future expansion that will never happen", and the kernel using the
same exact sigset_t that was exposed to user space.
Then in commit 4ce5f9c9e754 ("signal: Use a smaller struct siginfo in
the kernel") we got rid of the insane padding for in-kernel use,
exactly because it causes issues like this.
End result: that "struct waitid_info" no longer makes sense. It's not
appreciably smaller than kernel_siginfo_t is today, but it made sense
at the time.
Linus
On July 24, 2019 7:45:38 PM GMT+02:00, Linus Torvalds <[email protected]> wrote:
>On Wed, Jul 24, 2019 at 7:47 AM Christian Brauner
><[email protected]> wrote:
>>
>> This adds the pidfd_wait() syscall.
>
>I despise this patch.
>
>Why can't this just be a new P_PIDFD flag, and then use
>"waitid(P_PIDFD, pidfd, ...);"
>
>Yes, yes, yes, I realize that "pidfd" is of type "int", and waitid()
>takes an argument of type pid_t, but it's the same type in the end,
>and it does seem like the whole *point* of "waitid()" is that
>"idtype_t idtype" which tells you what kind of ID you're passing it.
>
> Linus
Well in that case we could add P_PIDFD.
But then I would like to _only_ enable it for waitid(). How's that sound?
Christian
On July 24, 2019 7:50:49 PM GMT+02:00, Christian Brauner <[email protected]> wrote:
>On July 24, 2019 7:45:38 PM GMT+02:00, Linus Torvalds
><[email protected]> wrote:
>>On Wed, Jul 24, 2019 at 7:47 AM Christian Brauner
>><[email protected]> wrote:
>>>
>>> This adds the pidfd_wait() syscall.
>>
>>I despise this patch.
>>
>>Why can't this just be a new P_PIDFD flag, and then use
>>"waitid(P_PIDFD, pidfd, ...);"
>>
>>Yes, yes, yes, I realize that "pidfd" is of type "int", and waitid()
>>takes an argument of type pid_t, but it's the same type in the end,
>>and it does seem like the whole *point* of "waitid()" is that
>>"idtype_t idtype" which tells you what kind of ID you're passing it.
>>
>> Linus
>
>Well in that case we could add P_PIDFD.
>But then I would like to _only_ enable it for waitid(). How's that
>sound?
>
>Christian
Ah, sorry, just saw that that's what you suggested.
Christian
On Wed, Jul 24, 2019 at 4:48 PM Christian Brauner <[email protected]> wrote:
> If CLONE_WAIT_PID is set the newly created process will not be
> considered by process wait requests that wait generically on children
> such as:
>
> syscall(__NR_wait4, -1, wstatus, options, rusage)
> syscall(__NR_waitpid, -1, wstatus, options)
> syscall(__NR_waitid, P_ALL, -1, siginfo, options, rusage)
> syscall(__NR_waitid, P_PGID, -1, siginfo, options, rusage)
> syscall(__NR_waitpid, -pid, wstatus, options)
> syscall(__NR_wait4, -pid, wstatus, options, rusage)
>
> A process created with CLONE_WAIT_PID can only be waited upon with a
> focussed wait call. This ensures that processes can be reaped even if
> all file descriptors referring to it are closed.
[...]
> diff --git a/kernel/fork.c b/kernel/fork.c
> index baaff6570517..a067f3876e2e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1910,6 +1910,8 @@ static __latent_entropy struct task_struct *copy_process(
> delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
> p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
> p->flags |= PF_FORKNOEXEC;
> + if (clone_flags & CLONE_WAIT_PID)
> + p->flags |= PF_WAIT_PID;
> INIT_LIST_HEAD(&p->children);
> INIT_LIST_HEAD(&p->sibling);
> rcu_copy_process(p);
This means that if a process with PF_WAIT_PID forks, the child
inherits the flag, right? That seems unintended? You might have to add
something like "if (clone_flags & CLONE_THREAD == 0) p->flags &=
~PF_WAIT_PID;" before this. (I think threads do have to inherit the
flag so that the case where a non-leader thread of the child goes
through execve and steals the leader's identity is handled properly.)
Or you could cram it somewhere into signal_struct instead of on the
task - that might be a more logical place for it?
On July 24, 2019 9:07:54 PM GMT+02:00, Jann Horn <[email protected]> wrote:
>On Wed, Jul 24, 2019 at 8:27 PM Christian Brauner
><[email protected]> wrote:
>> On July 24, 2019 8:14:26 PM GMT+02:00, Jann Horn <[email protected]>
>wrote:
>> >On Wed, Jul 24, 2019 at 4:48 PM Christian Brauner
>> ><[email protected]> wrote:
>> >> If CLONE_WAIT_PID is set the newly created process will not be
>> >> considered by process wait requests that wait generically on
>children
>> >> such as:
>> >>
>> >> syscall(__NR_wait4, -1, wstatus, options, rusage)
>> >> syscall(__NR_waitpid, -1, wstatus, options)
>> >> syscall(__NR_waitid, P_ALL, -1, siginfo, options, rusage)
>> >> syscall(__NR_waitid, P_PGID, -1, siginfo, options, rusage)
>> >> syscall(__NR_waitpid, -pid, wstatus, options)
>> >> syscall(__NR_wait4, -pid, wstatus, options, rusage)
>> >>
>> >> A process created with CLONE_WAIT_PID can only be waited upon with
>a
>> >> focussed wait call. This ensures that processes can be reaped even
>if
>> >> all file descriptors referring to it are closed.
>> >[...]
>> >> diff --git a/kernel/fork.c b/kernel/fork.c
>> >> index baaff6570517..a067f3876e2e 100644
>> >> --- a/kernel/fork.c
>> >> +++ b/kernel/fork.c
>> >> @@ -1910,6 +1910,8 @@ static __latent_entropy struct task_struct
>> >*copy_process(
>> >> delayacct_tsk_init(p); /* Must remain after
>> >dup_task_struct() */
>> >> p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
>> >> p->flags |= PF_FORKNOEXEC;
>> >> + if (clone_flags & CLONE_WAIT_PID)
>> >> + p->flags |= PF_WAIT_PID;
>> >> INIT_LIST_HEAD(&p->children);
>> >> INIT_LIST_HEAD(&p->sibling);
>> >> rcu_copy_process(p);
>> >
>> >This means that if a process with PF_WAIT_PID forks, the child
>> >inherits the flag, right? That seems unintended? You might have to
>add
>> >something like "if (clone_flags & CLONE_THREAD == 0) p->flags &=
>> >~PF_WAIT_PID;" before this. (I think threads do have to inherit the
>> >flag so that the case where a non-leader thread of the child goes
>> >through execve and steals the leader's identity is handled
>properly.)
>> >Or you could cram it somewhere into signal_struct instead of on the
>> >task - that might be a more logical place for it?
>>
>> Hm, CLONE_WAIT_PID is only useable with CLONE_PIDFD which in turn is
>> not useable with CLONE_THREAD.
>> But we should probably make that explicit for CLONE_WAIT_PID too.
>
>To clarify:
>
>This code looks buggy to me because p->flags is inherited from the
>parent, with the exception of flags that are explicitly stripped out.
>Since PF_WAIT_PID is not stripped out, this means that if task A
>creates a child B with clone(CLONE_WAIT_PID), and then task B uses
>fork() to create a child C, then B will not be able to use
>wait(&status) to wait for C since C inherited PF_WAIT_PID from B.
>
>The obvious way to fix that would be to always strip out PF_WAIT_PID;
>but that would also be wrong, because if task B creates a thread C,
>and then C calls execve(), the task_struct of B goes away and B's TGID
>is taken over by C. When C eventually exits, it should still obey the
>CLONE_WAIT_PID (since to A, it's all the same process). Therefore, if
>p->flags is used to track whether the task was created with
>CLONE_WAIT_PID, PF_WAIT_PID must be inherited if CLONE_THREAD is set.
>So:
>
>diff --git a/kernel/fork.c b/kernel/fork.c
>index d8ae0f1b4148..b32e1e9a6c9c 100644
>--- a/kernel/fork.c
>+++ b/kernel/fork.c
>@@ -1902,6 +1902,10 @@ static __latent_entropy struct task_struct
>*copy_process(
> delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
> p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
> p->flags |= PF_FORKNOEXEC;
>+ if (!(clone_flags & CLONE_THREAD))
>+ p->flags &= ~PF_PF_WAIT_PID;
>+ if (clone_flags & CLONE_WAIT_PID)
>+ p->flags |= PF_PF_WAIT_PID;
> INIT_LIST_HEAD(&p->children);
> INIT_LIST_HEAD(&p->sibling);
> rcu_copy_process(p);
>
>An alternative would be to not use p->flags at all, but instead make
>this a property of the signal_struct - since the property is shared by
>all threads, that might make more sense?
Yeah, thanks for clarifying.
Now it's more obvious.
I need to take a look at the signal struct before I can say anything about this.
On Wed, Jul 24, 2019 at 7:47 AM Christian Brauner <[email protected]> wrote:
>
> This adds the pidfd_wait() syscall.
I despise this patch.
Why can't this just be a new P_PIDFD flag, and then use
"waitid(P_PIDFD, pidfd, ...);"
Yes, yes, yes, I realize that "pidfd" is of type "int", and waitid()
takes an argument of type pid_t, but it's the same type in the end,
and it does seem like the whole *point* of "waitid()" is that
"idtype_t idtype" which tells you what kind of ID you're passing it.
Linus
On Wed, Jul 24, 2019 at 8:27 PM Christian Brauner <[email protected]> wrote:
> On July 24, 2019 8:14:26 PM GMT+02:00, Jann Horn <[email protected]> wrote:
> >On Wed, Jul 24, 2019 at 4:48 PM Christian Brauner
> ><[email protected]> wrote:
> >> If CLONE_WAIT_PID is set the newly created process will not be
> >> considered by process wait requests that wait generically on children
> >> such as:
> >>
> >> syscall(__NR_wait4, -1, wstatus, options, rusage)
> >> syscall(__NR_waitpid, -1, wstatus, options)
> >> syscall(__NR_waitid, P_ALL, -1, siginfo, options, rusage)
> >> syscall(__NR_waitid, P_PGID, -1, siginfo, options, rusage)
> >> syscall(__NR_waitpid, -pid, wstatus, options)
> >> syscall(__NR_wait4, -pid, wstatus, options, rusage)
> >>
> >> A process created with CLONE_WAIT_PID can only be waited upon with a
> >> focussed wait call. This ensures that processes can be reaped even if
> >> all file descriptors referring to it are closed.
> >[...]
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index baaff6570517..a067f3876e2e 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -1910,6 +1910,8 @@ static __latent_entropy struct task_struct
> >*copy_process(
> >> delayacct_tsk_init(p); /* Must remain after
> >dup_task_struct() */
> >> p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
> >> p->flags |= PF_FORKNOEXEC;
> >> + if (clone_flags & CLONE_WAIT_PID)
> >> + p->flags |= PF_WAIT_PID;
> >> INIT_LIST_HEAD(&p->children);
> >> INIT_LIST_HEAD(&p->sibling);
> >> rcu_copy_process(p);
> >
> >This means that if a process with PF_WAIT_PID forks, the child
> >inherits the flag, right? That seems unintended? You might have to add
> >something like "if (clone_flags & CLONE_THREAD == 0) p->flags &=
> >~PF_WAIT_PID;" before this. (I think threads do have to inherit the
> >flag so that the case where a non-leader thread of the child goes
> >through execve and steals the leader's identity is handled properly.)
> >Or you could cram it somewhere into signal_struct instead of on the
> >task - that might be a more logical place for it?
>
> Hm, CLONE_WAIT_PID is only useable with CLONE_PIDFD which in turn is
> not useable with CLONE_THREAD.
> But we should probably make that explicit for CLONE_WAIT_PID too.
To clarify:
This code looks buggy to me because p->flags is inherited from the
parent, with the exception of flags that are explicitly stripped out.
Since PF_WAIT_PID is not stripped out, this means that if task A
creates a child B with clone(CLONE_WAIT_PID), and then task B uses
fork() to create a child C, then B will not be able to use
wait(&status) to wait for C since C inherited PF_WAIT_PID from B.
The obvious way to fix that would be to always strip out PF_WAIT_PID;
but that would also be wrong, because if task B creates a thread C,
and then C calls execve(), the task_struct of B goes away and B's TGID
is taken over by C. When C eventually exits, it should still obey the
CLONE_WAIT_PID (since to A, it's all the same process). Therefore, if
p->flags is used to track whether the task was created with
CLONE_WAIT_PID, PF_WAIT_PID must be inherited if CLONE_THREAD is set.
So:
diff --git a/kernel/fork.c b/kernel/fork.c
index d8ae0f1b4148..b32e1e9a6c9c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1902,6 +1902,10 @@ static __latent_entropy struct task_struct *copy_process(
delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
p->flags |= PF_FORKNOEXEC;
+ if (!(clone_flags & CLONE_THREAD))
+ p->flags &= ~PF_PF_WAIT_PID;
+ if (clone_flags & CLONE_WAIT_PID)
+ p->flags |= PF_PF_WAIT_PID;
INIT_LIST_HEAD(&p->children);
INIT_LIST_HEAD(&p->sibling);
rcu_copy_process(p);
An alternative would be to not use p->flags at all, but instead make
this a property of the signal_struct - since the property is shared by
all threads, that might make more sense?
On July 24, 2019 8:14:26 PM GMT+02:00, Jann Horn <[email protected]> wrote:
>On Wed, Jul 24, 2019 at 4:48 PM Christian Brauner
><[email protected]> wrote:
>> If CLONE_WAIT_PID is set the newly created process will not be
>> considered by process wait requests that wait generically on children
>> such as:
>>
>> syscall(__NR_wait4, -1, wstatus, options, rusage)
>> syscall(__NR_waitpid, -1, wstatus, options)
>> syscall(__NR_waitid, P_ALL, -1, siginfo, options, rusage)
>> syscall(__NR_waitid, P_PGID, -1, siginfo, options, rusage)
>> syscall(__NR_waitpid, -pid, wstatus, options)
>> syscall(__NR_wait4, -pid, wstatus, options, rusage)
>>
>> A process created with CLONE_WAIT_PID can only be waited upon with a
>> focussed wait call. This ensures that processes can be reaped even if
>> all file descriptors referring to it are closed.
>[...]
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index baaff6570517..a067f3876e2e 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1910,6 +1910,8 @@ static __latent_entropy struct task_struct
>*copy_process(
>> delayacct_tsk_init(p); /* Must remain after
>dup_task_struct() */
>> p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
>> p->flags |= PF_FORKNOEXEC;
>> + if (clone_flags & CLONE_WAIT_PID)
>> + p->flags |= PF_WAIT_PID;
>> INIT_LIST_HEAD(&p->children);
>> INIT_LIST_HEAD(&p->sibling);
>> rcu_copy_process(p);
>
>This means that if a process with PF_WAIT_PID forks, the child
>inherits the flag, right? That seems unintended? You might have to add
>something like "if (clone_flags & CLONE_THREAD == 0) p->flags &=
>~PF_WAIT_PID;" before this. (I think threads do have to inherit the
>flag so that the case where a non-leader thread of the child goes
>through execve and steals the leader's identity is handled properly.)
>Or you could cram it somewhere into signal_struct instead of on the
>task - that might be a more logical place for it?
Hm, CLONE_WAIT_PID is only useable with CLONE_PIDFD which in turn is
not useable with CLONE_THREAD.
But we should probably make that explicit for CLONE_WAIT_PID too.
On Wed, Jul 24, 2019 at 10:37:34AM -0700, Linus Torvalds wrote:
> On Wed, Jul 24, 2019 at 7:47 AM Christian Brauner <[email protected]> wrote:
> >
> > The code here uses a struct waitid_info to catch basic information about
> > process exit including the pid, uid, status, and signal that caused the
> > process to exit. This information is then stuffed into a struct siginfo
> > for the waitid() syscall. That seems like an odd thing to do. We can
> > just pass down a siginfo_t struct directly which let's us cleanup and
> > simplify the whole code quite a bit.
>
> Ack. Except I'd like the commit message to explain where this comes
> from instead of that "That seems like an odd thing to do".
I'll respin and will add an explanation based on the info you gave
below. Thanks for providing the background on that!
Christian
>
> The _original_ reason for "struct waitid_info" was that "siginfo_t" is
> huge because of all the insane padding that various architectures do.
>
> So it was introduced by commit 67d7ddded322 ("waitid(2): leave copyout
> of siginfo to syscall itself") very much to avoid stack usage issues.
> And I quote:
>
> collect the information needed for siginfo into
> a small structure (waitid_info)
>
> simply because "sigset_t" was big.
>
> But that size came from the explicit "pad it out to 128 bytes for
> future expansion that will never happen", and the kernel using the
> same exact sigset_t that was exposed to user space.
>
> Then in commit 4ce5f9c9e754 ("signal: Use a smaller struct siginfo in
> the kernel") we got rid of the insane padding for in-kernel use,
> exactly because it causes issues like this.
>
> End result: that "struct waitid_info" no longer makes sense. It's not
> appreciably smaller than kernel_siginfo_t is today, but it made sense
> at the time.
>
> Linus
On 07/24, Christian Brauner wrote:
>
> Note that this changes how struct siginfo is filled in for users of
> waitid.
Namely, copy_siginfo_to_user() will nullify the extra SI_EXPANSION_SIZE
bytes + 2*sizeof(__ARCH_SI_CLOCK_T) from _sigchld (waitid doesn't report
utime/stime in siginfo).
Looks correct... even the compat case, but please double-check
copy_siginfo_to_user32/siginfo_layout. Looks like both SIL_KILL and
SIL_CHLD cases are fine in that this patch can't add other user-visible
changes, but I could easily miss something.
> In case
> anyone relies on the old behavior we can just revert
we won't need to rever the whole patch, we can just replace
copy_siginfo_to_user() with copy_to_user(offsetof(si_utime)).
I see you are going to update the changelog and resend, feel free to add
my reviewed-by.
Oleg.
On Thu, Jul 25, 2019 at 11:40:52AM +0200, Oleg Nesterov wrote:
> On 07/24, Christian Brauner wrote:
> >
> > Note that this changes how struct siginfo is filled in for users of
> > waitid.
>
> Namely, copy_siginfo_to_user() will nullify the extra SI_EXPANSION_SIZE
> bytes + 2*sizeof(__ARCH_SI_CLOCK_T) from _sigchld (waitid doesn't report
> utime/stime in siginfo).
>
> Looks correct... even the compat case, but please double-check
> copy_siginfo_to_user32/siginfo_layout. Looks like both SIL_KILL and
> SIL_CHLD cases are fine in that this patch can't add other user-visible
> changes, but I could easily miss something.
>
> > In case
> > anyone relies on the old behavior we can just revert
>
> we won't need to rever the whole patch, we can just replace
> copy_siginfo_to_user() with copy_to_user(offsetof(si_utime)).
>
> I see you are going to update the changelog and resend, feel free to add
> my reviewed-by.
Will do.
Thanks, Oleg!
Christian
On Wed, Jul 24, 2019 at 09:10:20PM +0200, Christian Brauner wrote:
> On July 24, 2019 9:07:54 PM GMT+02:00, Jann Horn <[email protected]> wrote:
> >On Wed, Jul 24, 2019 at 8:27 PM Christian Brauner
> ><[email protected]> wrote:
> >> On July 24, 2019 8:14:26 PM GMT+02:00, Jann Horn <[email protected]>
> >wrote:
> >> >On Wed, Jul 24, 2019 at 4:48 PM Christian Brauner
> >> ><[email protected]> wrote:
> >> >> If CLONE_WAIT_PID is set the newly created process will not be
> >> >> considered by process wait requests that wait generically on
> >children
> >> >> such as:
> >> >>
> >> >> syscall(__NR_wait4, -1, wstatus, options, rusage)
> >> >> syscall(__NR_waitpid, -1, wstatus, options)
> >> >> syscall(__NR_waitid, P_ALL, -1, siginfo, options, rusage)
> >> >> syscall(__NR_waitid, P_PGID, -1, siginfo, options, rusage)
> >> >> syscall(__NR_waitpid, -pid, wstatus, options)
> >> >> syscall(__NR_wait4, -pid, wstatus, options, rusage)
> >> >>
> >> >> A process created with CLONE_WAIT_PID can only be waited upon with
> >a
> >> >> focussed wait call. This ensures that processes can be reaped even
> >if
> >> >> all file descriptors referring to it are closed.
> >> >[...]
> >> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> >> index baaff6570517..a067f3876e2e 100644
> >> >> --- a/kernel/fork.c
> >> >> +++ b/kernel/fork.c
> >> >> @@ -1910,6 +1910,8 @@ static __latent_entropy struct task_struct
> >> >*copy_process(
> >> >> delayacct_tsk_init(p); /* Must remain after
> >> >dup_task_struct() */
> >> >> p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
> >> >> p->flags |= PF_FORKNOEXEC;
> >> >> + if (clone_flags & CLONE_WAIT_PID)
> >> >> + p->flags |= PF_WAIT_PID;
> >> >> INIT_LIST_HEAD(&p->children);
> >> >> INIT_LIST_HEAD(&p->sibling);
> >> >> rcu_copy_process(p);
> >> >
> >> >This means that if a process with PF_WAIT_PID forks, the child
> >> >inherits the flag, right? That seems unintended? You might have to
> >add
> >> >something like "if (clone_flags & CLONE_THREAD == 0) p->flags &=
> >> >~PF_WAIT_PID;" before this. (I think threads do have to inherit the
> >> >flag so that the case where a non-leader thread of the child goes
> >> >through execve and steals the leader's identity is handled
> >properly.)
> >> >Or you could cram it somewhere into signal_struct instead of on the
> >> >task - that might be a more logical place for it?
> >>
> >> Hm, CLONE_WAIT_PID is only useable with CLONE_PIDFD which in turn is
> >> not useable with CLONE_THREAD.
> >> But we should probably make that explicit for CLONE_WAIT_PID too.
> >
> >To clarify:
> >
> >This code looks buggy to me because p->flags is inherited from the
> >parent, with the exception of flags that are explicitly stripped out.
> >Since PF_WAIT_PID is not stripped out, this means that if task A
> >creates a child B with clone(CLONE_WAIT_PID), and then task B uses
> >fork() to create a child C, then B will not be able to use
> >wait(&status) to wait for C since C inherited PF_WAIT_PID from B.
> >
> >The obvious way to fix that would be to always strip out PF_WAIT_PID;
> >but that would also be wrong, because if task B creates a thread C,
> >and then C calls execve(), the task_struct of B goes away and B's TGID
> >is taken over by C. When C eventually exits, it should still obey the
> >CLONE_WAIT_PID (since to A, it's all the same process). Therefore, if
> >p->flags is used to track whether the task was created with
> >CLONE_WAIT_PID, PF_WAIT_PID must be inherited if CLONE_THREAD is set.
> >So:
> >
> >diff --git a/kernel/fork.c b/kernel/fork.c
> >index d8ae0f1b4148..b32e1e9a6c9c 100644
> >--- a/kernel/fork.c
> >+++ b/kernel/fork.c
> >@@ -1902,6 +1902,10 @@ static __latent_entropy struct task_struct
> >*copy_process(
> > delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
> > p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
> > p->flags |= PF_FORKNOEXEC;
> >+ if (!(clone_flags & CLONE_THREAD))
> >+ p->flags &= ~PF_PF_WAIT_PID;
> >+ if (clone_flags & CLONE_WAIT_PID)
> >+ p->flags |= PF_PF_WAIT_PID;
> > INIT_LIST_HEAD(&p->children);
> > INIT_LIST_HEAD(&p->sibling);
> > rcu_copy_process(p);
> >
> >An alternative would be to not use p->flags at all, but instead make
> >this a property of the signal_struct - since the property is shared by
> >all threads, that might make more sense?
>
> Yeah, thanks for clarifying.
> Now it's more obvious.
> I need to take a look at the signal struct before I can say anything about this.
I've been looking at this a bit late last night.
Putting this in the flags argument of signal_struct would indeed be
possible. But it feels misplaced to me there. I think the implied
semantics by having this part of task_struct are nicer, i.e. the intent
is clearer especially when the task is filtered later on in exit.c.
So unless anyone sees a clear problem or otherwise objects I would keep
it as a property of task_struct for now and fix it up.
Christian
On 07/24, Christian Brauner wrote:
>
> +SYSCALL_DEFINE6(pidfd_wait, int, pidfd, int __user *, stat_addr,
> + siginfo_t __user *, info, struct rusage __user *, ru,
> + unsigned int, states, unsigned int, flags)
> +{
Oh, I too think that P_PIDFD makes more sense.
and could you explain in the changelog why? I am not arguing and if
nothing else this is consistent with other pidfd features, but if you
are parent/debugger you can't hit the problem with pid-reuse, unless
you races with your sub-threads.
Oleg.
On Thu, Jul 25, 2019 at 12:16:27PM +0200, Oleg Nesterov wrote:
> On 07/24, Christian Brauner wrote:
> >
> > +SYSCALL_DEFINE6(pidfd_wait, int, pidfd, int __user *, stat_addr,
> > + siginfo_t __user *, info, struct rusage __user *, ru,
> > + unsigned int, states, unsigned int, flags)
> > +{
>
> Oh, I too think that P_PIDFD makes more sense.
I have already updated the patch to introduce P_PIDFD.
>
> and could you explain in the changelog why? I am not arguing and if
> nothing else this is consistent with other pidfd features, but if you
> are parent/debugger you can't hit the problem with pid-reuse, unless
> you races with your sub-threads.
One of the things is that later on this will allow us to make it
possible to retrieve the exit status via waitid(P_PIDFD) for non-parent
processes if handed a _suitable_ pidfd that has this feature set. Maybe
even - if safe - make it possible to wait on a process as a non-parent.
And some tools just really want to do away with pids completely.
Christian
On Thu, Jul 25, 2019 at 12:30:48PM +0200, Oleg Nesterov wrote:
> On 07/24, Jann Horn wrote:
> >
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1902,6 +1902,10 @@ static __latent_entropy struct task_struct *copy_process(
> > delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
> > p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
> > p->flags |= PF_FORKNOEXEC;
> > + if (!(clone_flags & CLONE_THREAD))
> > + p->flags &= ~PF_PF_WAIT_PID;
> > + if (clone_flags & CLONE_WAIT_PID)
> > + p->flags |= PF_PF_WAIT_PID;
>
> agreed, but then the "if (!thread_group_leader(tsk))" block in de_thread()
> should also copy PF_PF_WAIT_PID.
>
> > An alternative would be to not use p->flags at all, but instead make
> > this a property of the signal_struct - since the property is shared by
> > all threads, that might make more sense?
>
> I tend to agree.
Hm, ok. That's two people that prefer to make this a flag in
signal_struct. Ok, let me adapt the patch.
Thanks!
Christian
Or. We can change wait_consider_task() to not clear ->notask_error if
WXXX and the child is PF_WAIT_PID.
This way you can "safely" use wait() without WNOHANG, it won't block if
all the children which can report an even are PF_WAIT_PID.
But I do not understand your use-cases, I have no idea if this can help
or not. Just I think the more discussion is always better when we are
going to add the new API.
On 07/25, Oleg Nesterov wrote:
>
> On 07/25, Christian Brauner wrote:
> >
> > On Thu, Jul 25, 2019 at 12:35:44PM +0200, Oleg Nesterov wrote:
> > >
> > > I have to admit this feature looks a bit exotic to me...
> >
> > It might look like it from the kernels perspective but from the feedback
> > on this when presenting on this userspace has real usecases for this.
>
> OK...
>
> but then perhaps we can make PF_WAIT_PID more flexible.
>
> Say, we can add the new WXXX wait option and change eligible_child()
>
> if ((p->flags & PF_WAIT_PID) && (wo->options & WXXX))
> return 0;
>
> this way the parent can tell waitid() whether the PF_WAIT_PID tasks should
> be filtered or not.
>
> And if we do this we can even add PR_SET_WAIT_PID/PR_CLR_WAIT_PID instead
> of the new CLONE_ flag.
>
> Oleg.
Linus Torvalds <[email protected]> writes:
> On Wed, Jul 24, 2019 at 7:47 AM Christian Brauner <[email protected]> wrote:
>>
>> The code here uses a struct waitid_info to catch basic information about
>> process exit including the pid, uid, status, and signal that caused the
>> process to exit. This information is then stuffed into a struct siginfo
>> for the waitid() syscall. That seems like an odd thing to do. We can
>> just pass down a siginfo_t struct directly which let's us cleanup and
>> simplify the whole code quite a bit.
>
> Ack. Except I'd like the commit message to explain where this comes
> from instead of that "That seems like an odd thing to do".
>
> The _original_ reason for "struct waitid_info" was that "siginfo_t" is
> huge because of all the insane padding that various architectures do.
>
> So it was introduced by commit 67d7ddded322 ("waitid(2): leave copyout
> of siginfo to syscall itself") very much to avoid stack usage issues.
> And I quote:
>
> collect the information needed for siginfo into
> a small structure (waitid_info)
>
> simply because "sigset_t" was big.
>
> But that size came from the explicit "pad it out to 128 bytes for
> future expansion that will never happen", and the kernel using the
> same exact sigset_t that was exposed to user space.
>
> Then in commit 4ce5f9c9e754 ("signal: Use a smaller struct siginfo in
> the kernel") we got rid of the insane padding for in-kernel use,
> exactly because it causes issues like this.
>
> End result: that "struct waitid_info" no longer makes sense. It's not
> appreciably smaller than kernel_siginfo_t is today, but it made sense
> at the time.
Apologies. I meant to reply yesterday but I was preempted by baby
issues.
I strongly disagree that this direction makes sense. The largest
value that I see from struct waitid_info is that it makes it possible to
reason about which values are returned where struct kernel_siginfo does
not.
One of the details the existence of struct waitid_info makes clear is
that unlike the related child death path the wait code does not
fillin si_utime and si_stime. Which is very important to know when you
are dealing with y2038 issues and Arnd Bergmann is.
The most egregious example I know of using siginfo wrong is:
70f1b0d34bdf ("signal/usb: Replace kill_pid_info_as_cred with
kill_pid_usb_asyncio"). But just by moving struct siginfo out of the
program logic and into dedicated little functions that just deal with
the craziness of struct siginfo I have found lots of little bugs.
We don't need that kind of invitation to bugs in the wait logic.
Especially since it is already known that the wait logic is subtely
wrong in corner cases dealing with threads. Especially since we have
not fixed those bugs since we don't have enough people who understand
the code well enough to review those fixes.
I think a better direction for cleanups would be to merge struct
waitid_info into struct wait_opts so that we could remove unnecessary
conditionals from the wait code, and maybe make the whole thing less
subtle.
I took a look at this change and the only reduction in code size
or complexity comes from using copy_siginfo_to_user instead of
multiple lines of unsafe_put_user. That seems to be a worth while
change. However that can be done by just modifying waitid and
compat_waitid.
There is also a bug in these changes. The issue is using structure
initializers of struct kernel_siginfo and then copying the structure to
userspace. I believe KASAN can warn about that now but it might be
sensitive to the architecture specific details of struct kernel_siginfo.
So instead of doing:
+ kernel_siginfo_t kinfo = {
+ .si_signo = 0,
+ };
...
+ if (infop && copy_siginfo_to_user(infop, &kinfo))
return -EFAULT;
The code should do:
+ kernel_siginfo_t kinfo;
+ clear_siginfo(&kinfo);
...
+ if (infop && copy_siginfo_to_user(infop, &kinfo))
return -EFAULT;
Oleg raised a valid concern about copy_signfo_to_user32 when WNOHANG is
specified. In that case all of the the siginfo fields should be
set to 0. Although posix only requires si_pid and si_signo to be zero.
Return without a process due to WNOHANG would probably be clearer if the
code just called clear_user on the whole of the user siginfo.
Eric
On 07/24, Jann Horn wrote:
>
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1902,6 +1902,10 @@ static __latent_entropy struct task_struct *copy_process(
> delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
> p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
> p->flags |= PF_FORKNOEXEC;
> + if (!(clone_flags & CLONE_THREAD))
> + p->flags &= ~PF_PF_WAIT_PID;
> + if (clone_flags & CLONE_WAIT_PID)
> + p->flags |= PF_PF_WAIT_PID;
agreed, but then the "if (!thread_group_leader(tsk))" block in de_thread()
should also copy PF_PF_WAIT_PID.
> An alternative would be to not use p->flags at all, but instead make
> this a property of the signal_struct - since the property is shared by
> all threads, that might make more sense?
I tend to agree.
Oleg.
On 07/24, Christian Brauner wrote:
>
> If CLONE_WAIT_PID is set the newly created process will not be
> considered by process wait requests that wait generically on children
> such as:
I have to admit this feature looks a bit exotic to me...
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -1019,6 +1019,9 @@ eligible_child(struct wait_opts *wo, bool ptrace, struct task_struct *p)
> if (!eligible_pid(wo, p))
> return 0;
>
> + if ((p->flags & PF_WAIT_PID) && (wo->wo_type != PIDTYPE_PID))
> + return 0;
Even if ptrace == T ?
This doesn't look right. Say, strace should work even if its tracee (or
one of the tracees) has PF_WAIT_PID.
Oleg.
On Thu, Jul 25, 2019 at 12:35:44PM +0200, Oleg Nesterov wrote:
> On 07/24, Christian Brauner wrote:
> >
> > If CLONE_WAIT_PID is set the newly created process will not be
> > considered by process wait requests that wait generically on children
> > such as:
>
> I have to admit this feature looks a bit exotic to me...
It might look like it from the kernels perspective but from the feedback
on this when presenting on this userspace has real usecases for this.
>
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -1019,6 +1019,9 @@ eligible_child(struct wait_opts *wo, bool ptrace, struct task_struct *p)
> > if (!eligible_pid(wo, p))
> > return 0;
> >
> > + if ((p->flags & PF_WAIT_PID) && (wo->wo_type != PIDTYPE_PID))
> > + return 0;
>
> Even if ptrace == T ?
>
> This doesn't look right. Say, strace should work even if its tracee (or
> one of the tracees) has PF_WAIT_PID.
As in
if (!ptrace && (p->flags & PF_WAIT_PID) && (wo->wo_type != PIDTYPE_PID))
return 0;
Sure, we can allow that.
Christian
On 07/25, Christian Brauner wrote:
>
> On Thu, Jul 25, 2019 at 12:35:44PM +0200, Oleg Nesterov wrote:
> >
> > I have to admit this feature looks a bit exotic to me...
>
> It might look like it from the kernels perspective but from the feedback
> on this when presenting on this userspace has real usecases for this.
OK...
but then perhaps we can make PF_WAIT_PID more flexible.
Say, we can add the new WXXX wait option and change eligible_child()
if ((p->flags & PF_WAIT_PID) && (wo->options & WXXX))
return 0;
this way the parent can tell waitid() whether the PF_WAIT_PID tasks should
be filtered or not.
And if we do this we can even add PR_SET_WAIT_PID/PR_CLR_WAIT_PID instead
of the new CLONE_ flag.
Oleg.
On Thu, Jul 25, 2019 at 01:25:03PM +0200, Oleg Nesterov wrote:
> On 07/25, Christian Brauner wrote:
> >
> > On Thu, Jul 25, 2019 at 12:35:44PM +0200, Oleg Nesterov wrote:
> > >
> > > I have to admit this feature looks a bit exotic to me...
> >
> > It might look like it from the kernels perspective but from the feedback
> > on this when presenting on this userspace has real usecases for this.
>
> OK...
>
> but then perhaps we can make PF_WAIT_PID more flexible.
I've changed this to be a property on signal_struct, i.e. a bitfield
entry following the {is,has}_child_subreaper entries.
>
> Say, we can add the new WXXX wait option and change eligible_child()
>
> if ((p->flags & PF_WAIT_PID) && (wo->options & WXXX))
> return 0;
>
> this way the parent can tell waitid() whether the PF_WAIT_PID tasks should
> be filtered or not.
>
> And if we do this we can even add PR_SET_WAIT_PID/PR_CLR_WAIT_PID instead
> of the new CLONE_ flag.
Hm, I prefer to not do this.
The idea is that this is a property of the pidfd itself and not of the
waitid() call. And currently, I don't think it makes sense to change
this property at runtime.
What might make sense is to remove this property when a task gets
reparented, i.e. when its parent has died.
On Thu, Jul 25, 2019 at 01:43:59PM +0200, Oleg Nesterov wrote:
> Or. We can change wait_consider_task() to not clear ->notask_error if
> WXXX and the child is PF_WAIT_PID.
>
> This way you can "safely" use wait() without WNOHANG, it won't block if
> all the children which can report an even are PF_WAIT_PID.
>
> But I do not understand your use-cases, I have no idea if this can help
One usecase (among others listed in the commit message) are shared
libraries. P_ALL is usually something you can't really use in a shared
library because you have no idea what else might be fork()ed off. Only
the main program can use this but none of the auxiliary libraries that
it uses.
The other way around you want to be able fork() off something without
affecting P_ALL in the main program.
The key is that you want to be able to create child processes in a
shared library without the main programing having to know about this so
that it can use P_ALL and never get stuff from the library.
Assume you have a project with a main loop with a million things
happening in that mainloop like some gui app running an avi video. For
example, gtk uses gstreamer which forks off all codecs in child
processes which are sandboxed for security. So gstreamer is using helper
processes in the background which are my children now. Now I'm creating
four more additional helper processes as well. Now, in my (glib, qt
whatever) mainloop on SIGCHLD some part of the app is checking with
WNHOANG and finds a process has exited. It's cleaning this thing up now
but it's not a process it wanted to clean up. The other part of the app
is now doing waitid(P_PID, pid) but will find the process already gone
it wanted to reap.
I hope I'm expressing this well enough.
Linus Torvalds <[email protected]> writes:
> On Wed, Jul 24, 2019 at 7:47 AM Christian Brauner <[email protected]> wrote:
>>
>> This adds the pidfd_wait() syscall.
>
> I despise this patch.
>
> Why can't this just be a new P_PIDFD flag, and then use
> "waitid(P_PIDFD, pidfd, ...);"
>
> Yes, yes, yes, I realize that "pidfd" is of type "int", and waitid()
> takes an argument of type pid_t, but it's the same type in the end,
> and it does seem like the whole *point* of "waitid()" is that
> "idtype_t idtype" which tells you what kind of ID you're passing it.
Just to emphasize this point.
The posix declaration of waitid is:
>int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options);
Where id_t is defined as:
> id_t - Used as a general identifier; can be used to contain at least a pid_t, uid_t, or gid_t.
And the BSDs at least have defined P_UID and P_GID. So that flexibility
has been used.
So it looks entirely reasonable to have P_PIDFD that just waits for a
specified child.
Eric
On 07/25, Christian Brauner wrote:
>
> The key is that you want to be able to create child processes in a
> shared library without the main programing having to know about this so
> that it can use P_ALL and never get stuff from the library.
OK, thanks...
in this case you should probablu pass 0 in CSIGNAL to ensure that the main
program won't be notified when that child exits.
Oleg.
On Thu, Jul 25, 2019 at 06:13:17PM +0200, Oleg Nesterov wrote:
> On 07/25, Christian Brauner wrote:
> >
> > The key is that you want to be able to create child processes in a
> > shared library without the main programing having to know about this so
> > that it can use P_ALL and never get stuff from the library.
>
> OK, thanks...
>
> in this case you should probablu pass 0 in CSIGNAL to ensure that the main
> program won't be notified when that child exits.
Yes, that's the idea. So you'd turn off SIGCHLD and rely on pidfd polling
only. That's similar to how pdfork() on FreeBSD works. It's just that we
need to do:
struct clone_args args = {
.exit_signal = 0,
.flags = CLONE_PIDFD | CLONE_WAIT_PID,
};
Now the shared library can guarantee that noone else in the mainloop
gets woken by a SIGCHLD from the pidfd-based helper process (because
exit_signal is 0) but afaict it also needs CLONE_WAIT_PID. Since the
latter allows it to guarantee that if someone in the mainloop gets
SIGCHLD from another regular process and calls waitid(P_ALL) it won't
accidently reap a pidfd-based process that has exited.
Fyi, I'm splitting CLONE_WAIT_PID out into a separate patchset and only
keep P_PIDFD for now. So we can discuss this independently. You think
that's better Oleg?
Christian
Christian Brauner <[email protected]> writes:
> On Thu, Jul 25, 2019 at 01:43:59PM +0200, Oleg Nesterov wrote:
>> Or. We can change wait_consider_task() to not clear ->notask_error if
>> WXXX and the child is PF_WAIT_PID.
>>
>> This way you can "safely" use wait() without WNOHANG, it won't block if
>> all the children which can report an even are PF_WAIT_PID.
>>
>> But I do not understand your use-cases, I have no idea if this can help
>
> One usecase (among others listed in the commit message) are shared
> libraries. P_ALL is usually something you can't really use in a shared
> library because you have no idea what else might be fork()ed off. Only
> the main program can use this but none of the auxiliary libraries that
> it uses.
> The other way around you want to be able fork() off something without
> affecting P_ALL in the main program.
> The key is that you want to be able to create child processes in a
> shared library without the main programing having to know about this so
> that it can use P_ALL and never get stuff from the library.
>
> Assume you have a project with a main loop with a million things
> happening in that mainloop like some gui app running an avi video. For
> example, gtk uses gstreamer which forks off all codecs in child
> processes which are sandboxed for security. So gstreamer is using helper
> processes in the background which are my children now. Now I'm creating
> four more additional helper processes as well. Now, in my (glib, qt
> whatever) mainloop on SIGCHLD some part of the app is checking with
> WNHOANG and finds a process has exited. It's cleaning this thing up now
> but it's not a process it wanted to clean up. The other part of the app
> is now doing waitid(P_PID, pid) but will find the process already gone
> it wanted to reap.
>
> I hope I'm expressing this well enough.
I think so.
A) I think Oleg is correct that you should test the flag in
do_wait_thread rather than elsewhere.
B) We have a deficiency in do_wait that should be addressed. The
do_wait function does not have a fast path for waiting on a
particular process. For adding this functionality I such a fast path
goes from a nice to have to a necessity for getting all of the
fiddly details correct.
C) I believe the semantics should be that while such a file descriptor
is open, only that file descriptor can be used to reap the process.
And that it should be allowed to pass the file descriptor between
processes. Which means the parent can die and the process be
reparted to init and we should still be able to reap the process with
the file descriptor.
D) I think it is a toss up how we should deal with such a process when
the file descriptor is closed. Setting the process to autoreap
or reparent to init and let init deal with it. My inclination is
that autoreap is the correct behavior.
Eric
On Thu, Jul 25, 2019 at 07:46:50AM -0500, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On Wed, Jul 24, 2019 at 7:47 AM Christian Brauner <[email protected]> wrote:
> >>
> >> The code here uses a struct waitid_info to catch basic information about
> >> process exit including the pid, uid, status, and signal that caused the
> >> process to exit. This information is then stuffed into a struct siginfo
> >> for the waitid() syscall. That seems like an odd thing to do. We can
> >> just pass down a siginfo_t struct directly which let's us cleanup and
> >> simplify the whole code quite a bit.
> >
> > Ack. Except I'd like the commit message to explain where this comes
> > from instead of that "That seems like an odd thing to do".
> >
> > The _original_ reason for "struct waitid_info" was that "siginfo_t" is
> > huge because of all the insane padding that various architectures do.
> >
> > So it was introduced by commit 67d7ddded322 ("waitid(2): leave copyout
> > of siginfo to syscall itself") very much to avoid stack usage issues.
> > And I quote:
> >
> > collect the information needed for siginfo into
> > a small structure (waitid_info)
> >
> > simply because "sigset_t" was big.
> >
> > But that size came from the explicit "pad it out to 128 bytes for
> > future expansion that will never happen", and the kernel using the
> > same exact sigset_t that was exposed to user space.
> >
> > Then in commit 4ce5f9c9e754 ("signal: Use a smaller struct siginfo in
> > the kernel") we got rid of the insane padding for in-kernel use,
> > exactly because it causes issues like this.
> >
> > End result: that "struct waitid_info" no longer makes sense. It's not
> > appreciably smaller than kernel_siginfo_t is today, but it made sense
> > at the time.
>
> Apologies. I meant to reply yesterday but I was preempted by baby
> issues.
>
> I strongly disagree that this direction makes sense. The largest
> value that I see from struct waitid_info is that it makes it possible to
> reason about which values are returned where struct kernel_siginfo does
> not.
>
> One of the details the existence of struct waitid_info makes clear is
> that unlike the related child death path the wait code does not
> fillin si_utime and si_stime. Which is very important to know when you
> are dealing with y2038 issues and Arnd Bergmann is.
>
> The most egregious example I know of using siginfo wrong is:
> 70f1b0d34bdf ("signal/usb: Replace kill_pid_info_as_cred with
> kill_pid_usb_asyncio"). But just by moving struct siginfo out of the
> program logic and into dedicated little functions that just deal with
> the craziness of struct siginfo I have found lots of little bugs.
>
> We don't need that kind of invitation to bugs in the wait logic.
I don't think it's a strong enough argument for rejecting this change.
Suspecting that something might go wrong if we simplify something is a
valid call to proceed with caution and be on the lookout for potential
regressions so we can react fast. I respect that. But it's not
necessarily a good argument to reject a change.
I'm happy to switch from an initializer (which is not even clear is a
bug) to using clear_siginfo().
And I'm also going to split this patch out of the P_PIDFD patch but I'm
going to send this out again. I haven't heard a sound argument why this
patch is worse than what we have right now in there.
Christian
On Wed, Jul 24, 2019 at 4:47 PM Christian Brauner <[email protected]> wrote:
> +
> +static int copy_rusage_to_user_any(struct rusage *kru, struct rusage __user *ru)
> +{
> +#ifdef CONFIG_COMPAT
> + if (in_compat_syscall())
> + return put_compat_rusage(kru, (struct compat_rusage __user *)ru);
> +#endif
> + return copy_to_user(ru, kru, sizeof(*kru));
> +}
I think this code needs a check for COMPAT_USE_64BIT_TIME in order
to handle x32 correctly.
It would be nice to introduce it in a separate patch, and then use it
to kill off
compat_sys_getrusage() and compat_sys_wait4(), and possibly even
compat_sys_waitid() in combination with your copy_siginfo_to_user_any().
That could be done as a cleanup patch afterwards, or as part of your series.
Arnd
On Fri, Jul 26, 2019 at 10:19:55AM +0200, Arnd Bergmann wrote:
> On Wed, Jul 24, 2019 at 4:47 PM Christian Brauner <[email protected]> wrote:
>
> > +
> > +static int copy_rusage_to_user_any(struct rusage *kru, struct rusage __user *ru)
> > +{
> > +#ifdef CONFIG_COMPAT
> > + if (in_compat_syscall())
> > + return put_compat_rusage(kru, (struct compat_rusage __user *)ru);
> > +#endif
> > + return copy_to_user(ru, kru, sizeof(*kru));
> > +}
>
> I think this code needs a check for COMPAT_USE_64BIT_TIME in order
> to handle x32 correctly.
Similar to waitid(), indeed.
>
> It would be nice to introduce it in a separate patch, and then use it
> to kill off
> compat_sys_getrusage() and compat_sys_wait4(), and possibly even
> compat_sys_waitid() in combination with your copy_siginfo_to_user_any().
> That could be done as a cleanup patch afterwards, or as part of your series.
Right, but we won't go the syscall route but instead go the P_PIDFD
route for waitid(). :)
Christian
On Fri, Jul 26, 2019 at 10:24 AM Christian Brauner <[email protected]> wrote:
>
> > It would be nice to introduce it in a separate patch, and then use it
> > to kill off
> > compat_sys_getrusage() and compat_sys_wait4(), and possibly even
> > compat_sys_waitid() in combination with your copy_siginfo_to_user_any().
> > That could be done as a cleanup patch afterwards, or as part of your series.
>
> Right, but we won't go the syscall route but instead go the P_PIDFD
> route for waitid(). :)
Ah, of course, nevermind then. It would still be a useful cleanup, but
many other things would be as well.
Arnd
Christian Brauner <[email protected]> writes:
> On Thu, Jul 25, 2019 at 07:46:50AM -0500, Eric W. Biederman wrote:
>> Linus Torvalds <[email protected]> writes:
>>
>> > On Wed, Jul 24, 2019 at 7:47 AM Christian Brauner <[email protected]> wrote:
>> >>
>> >> The code here uses a struct waitid_info to catch basic information about
>> >> process exit including the pid, uid, status, and signal that caused the
>> >> process to exit. This information is then stuffed into a struct siginfo
>> >> for the waitid() syscall. That seems like an odd thing to do. We can
>> >> just pass down a siginfo_t struct directly which let's us cleanup and
>> >> simplify the whole code quite a bit.
>> >
>> > Ack. Except I'd like the commit message to explain where this comes
>> > from instead of that "That seems like an odd thing to do".
>> >
>> > The _original_ reason for "struct waitid_info" was that "siginfo_t" is
>> > huge because of all the insane padding that various architectures do.
>> >
>> > So it was introduced by commit 67d7ddded322 ("waitid(2): leave copyout
>> > of siginfo to syscall itself") very much to avoid stack usage issues.
>> > And I quote:
>> >
>> > collect the information needed for siginfo into
>> > a small structure (waitid_info)
>> >
>> > simply because "sigset_t" was big.
>> >
>> > But that size came from the explicit "pad it out to 128 bytes for
>> > future expansion that will never happen", and the kernel using the
>> > same exact sigset_t that was exposed to user space.
>> >
>> > Then in commit 4ce5f9c9e754 ("signal: Use a smaller struct siginfo in
>> > the kernel") we got rid of the insane padding for in-kernel use,
>> > exactly because it causes issues like this.
>> >
>> > End result: that "struct waitid_info" no longer makes sense. It's not
>> > appreciably smaller than kernel_siginfo_t is today, but it made sense
>> > at the time.
>>
>> Apologies. I meant to reply yesterday but I was preempted by baby
>> issues.
>>
>> I strongly disagree that this direction makes sense. The largest
>> value that I see from struct waitid_info is that it makes it possible to
>> reason about which values are returned where struct kernel_siginfo does
>> not.
>>
>> One of the details the existence of struct waitid_info makes clear is
>> that unlike the related child death path the wait code does not
>> fillin si_utime and si_stime. Which is very important to know when you
>> are dealing with y2038 issues and Arnd Bergmann is.
>>
>> The most egregious example I know of using siginfo wrong is:
>> 70f1b0d34bdf ("signal/usb: Replace kill_pid_info_as_cred with
>> kill_pid_usb_asyncio"). But just by moving struct siginfo out of the
>> program logic and into dedicated little functions that just deal with
>> the craziness of struct siginfo I have found lots of little bugs.
>>
>> We don't need that kind of invitation to bugs in the wait logic.
>
> I don't think it's a strong enough argument for rejecting this change.
> Suspecting that something might go wrong if we simplify something is a
> valid call to proceed with caution and be on the lookout for potential
> regressions so we can react fast. I respect that. But it's not
> necessarily a good argument to reject a change.
Except your change was not a simplification. Your change was
a substitution to do the work of filling in struct kernel_siginfo in 4
locations instead of just 2.
The only simplification came from not using unsafe_put_user. Which is
valid but has nothing to do with struct waitid_info.
> I'm happy to switch from an initializer (which is not even clear is a
> bug) to using clear_siginfo().
I just double checked the definitions in signal_types.h and
uapi/asm-generic/siginfo.h and there is definitely padding on 64bit.
So yes barring magic compiler plug ins it is a bug.
> And I'm also going to split this patch out of the P_PIDFD patch but I'm
> going to send this out again. I haven't heard a sound argument why
> this
> patch is worse than what we have right now in there.
Then I am afraid I have not expressed myself well.
When I read through this patch I saw.
- A bug when dealing with struct kernel_siginfo.
- A substitution from of struct waitid_info to struct kernel_siginfo.
- An actual simplification in replacing several unsafe_put_user calls
with copy_siginfo_to_user.
- A gratuitous change in change the order of several of the statements.
- No simplification in the logic of do_wait.
Or in short I saw you did "s/struct waitid_info/struct kernel_siginfo/"
and introduced bugs. Further you increased the number of locations that
we need to be very careful with struct kernel_siginfo from 2 to 4.
For myself I am extremely sensitive to people taking a cavalier attitude
to using siginfo. So that we have a chance of maintaining that code I
have been steadily cleaning it up for the last year or two. I hope
things are better now. I think I have gotten the core code that
manipulates struct siginfo from something that no one had the time to
read through and understand to something that a person can look at and
in a couple hours understand the nuances of.
But it still remains the case that we must be much more careful with
struct siginfo than we are with other structures. It still remains
entirely too easy to fill it out wrong unless you are paying close
attention to all of the details.
Or in other words I think if you simplified your cleanup something like
below it would be a good cleanup.
Eric
diff --git a/kernel/exit.c b/kernel/exit.c
index 3d86930f035e..9ce896b478f5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1552,11 +1552,12 @@ static long do_wait(struct wait_opts *wo)
return retval;
}
-static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
+static long kernel_waitid(int which, pid_t upid, struct kernel_siginfo *info,
int options, struct rusage *ru)
{
struct wait_opts wo;
struct pid *pid = NULL;
+ struct waitid_info winfo = { .status = 0 };
enum pid_type type;
long ret;
@@ -1592,11 +1593,21 @@ static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
wo.wo_type = type;
wo.wo_pid = pid;
wo.wo_flags = options;
- wo.wo_info = infop;
+ wo.wo_info = &winfo;
wo.wo_rusage = ru;
ret = do_wait(&wo);
-
put_pid(pid);
+
+ clear_siginfo(info);
+ if (ret > 0) {
+ info->si_signo = SIGCHLD;
+ info->si_errno = 0;
+ info->si_code = winfo.cause;
+ info->si_pid = winfo.pid;
+ info->si_uid = winfo.uid;
+ info->si_status = winfo.status;
+ }
+
return ret;
}
@@ -1604,33 +1615,18 @@ SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
infop, int, options, struct rusage __user *, ru)
{
struct rusage r;
- struct waitid_info info = {.status = 0};
+ struct kernel_siginfo info;
long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
- int signo = 0;
if (err > 0) {
- signo = SIGCHLD;
err = 0;
if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
return -EFAULT;
}
- if (!infop)
- return err;
- if (!user_access_begin(infop, sizeof(*infop)))
+ if (infop && copy_siginfo_to_user(infop, &info))
return -EFAULT;
-
- unsafe_put_user(signo, &infop->si_signo, Efault);
- unsafe_put_user(0, &infop->si_errno, Efault);
- unsafe_put_user(info.cause, &infop->si_code, Efault);
- unsafe_put_user(info.pid, &infop->si_pid, Efault);
- unsafe_put_user(info.uid, &infop->si_uid, Efault);
- unsafe_put_user(info.status, &infop->si_status, Efault);
- user_access_end();
return err;
-Efault:
- user_access_end();
- return -EFAULT;
}
long kernel_wait4(pid_t upid, int __user *stat_addr, int options,
@@ -1724,11 +1720,9 @@ COMPAT_SYSCALL_DEFINE5(waitid,
struct compat_rusage __user *, uru)
{
struct rusage ru;
- struct waitid_info info = {.status = 0};
+ struct kernel_siginfo info;
long err = kernel_waitid(which, pid, &info, options, uru ? &ru : NULL);
- int signo = 0;
if (err > 0) {
- signo = SIGCHLD;
err = 0;
if (uru) {
/* kernel_waitid() overwrites everything in ru */
@@ -1741,23 +1735,9 @@ COMPAT_SYSCALL_DEFINE5(waitid,
}
}
- if (!infop)
- return err;
-
- if (!user_access_begin(infop, sizeof(*infop)))
+ if (infop && copy_siginfo_to_user32(infop, &info))
return -EFAULT;
-
- unsafe_put_user(signo, &infop->si_signo, Efault);
- unsafe_put_user(0, &infop->si_errno, Efault);
- unsafe_put_user(info.cause, &infop->si_code, Efault);
- unsafe_put_user(info.pid, &infop->si_pid, Efault);
- unsafe_put_user(info.uid, &infop->si_uid, Efault);
- unsafe_put_user(info.status, &infop->si_status, Efault);
- user_access_end();
return err;
-Efault:
- user_access_end();
- return -EFAULT;
}
#endif
On Fri, Jul 26, 2019 at 06:59:39AM -0500, Eric W. Biederman wrote:
> Christian Brauner <[email protected]> writes:
>
> > On Thu, Jul 25, 2019 at 07:46:50AM -0500, Eric W. Biederman wrote:
> >> Linus Torvalds <[email protected]> writes:
> >>
> >> > On Wed, Jul 24, 2019 at 7:47 AM Christian Brauner <[email protected]> wrote:
> >> >>
> >> >> The code here uses a struct waitid_info to catch basic information about
> >> >> process exit including the pid, uid, status, and signal that caused the
> >> >> process to exit. This information is then stuffed into a struct siginfo
> >> >> for the waitid() syscall. That seems like an odd thing to do. We can
> >> >> just pass down a siginfo_t struct directly which let's us cleanup and
> >> >> simplify the whole code quite a bit.
> >> >
> >> > Ack. Except I'd like the commit message to explain where this comes
> >> > from instead of that "That seems like an odd thing to do".
> >> >
> >> > The _original_ reason for "struct waitid_info" was that "siginfo_t" is
> >> > huge because of all the insane padding that various architectures do.
> >> >
> >> > So it was introduced by commit 67d7ddded322 ("waitid(2): leave copyout
> >> > of siginfo to syscall itself") very much to avoid stack usage issues.
> >> > And I quote:
> >> >
> >> > collect the information needed for siginfo into
> >> > a small structure (waitid_info)
> >> >
> >> > simply because "sigset_t" was big.
> >> >
> >> > But that size came from the explicit "pad it out to 128 bytes for
> >> > future expansion that will never happen", and the kernel using the
> >> > same exact sigset_t that was exposed to user space.
> >> >
> >> > Then in commit 4ce5f9c9e754 ("signal: Use a smaller struct siginfo in
> >> > the kernel") we got rid of the insane padding for in-kernel use,
> >> > exactly because it causes issues like this.
> >> >
> >> > End result: that "struct waitid_info" no longer makes sense. It's not
> >> > appreciably smaller than kernel_siginfo_t is today, but it made sense
> >> > at the time.
> >>
> >> Apologies. I meant to reply yesterday but I was preempted by baby
> >> issues.
> >>
> >> I strongly disagree that this direction makes sense. The largest
> >> value that I see from struct waitid_info is that it makes it possible to
> >> reason about which values are returned where struct kernel_siginfo does
> >> not.
> >>
> >> One of the details the existence of struct waitid_info makes clear is
> >> that unlike the related child death path the wait code does not
> >> fillin si_utime and si_stime. Which is very important to know when you
> >> are dealing with y2038 issues and Arnd Bergmann is.
> >>
> >> The most egregious example I know of using siginfo wrong is:
> >> 70f1b0d34bdf ("signal/usb: Replace kill_pid_info_as_cred with
> >> kill_pid_usb_asyncio"). But just by moving struct siginfo out of the
> >> program logic and into dedicated little functions that just deal with
> >> the craziness of struct siginfo I have found lots of little bugs.
> >>
> >> We don't need that kind of invitation to bugs in the wait logic.
> >
> > I don't think it's a strong enough argument for rejecting this change.
> > Suspecting that something might go wrong if we simplify something is a
> > valid call to proceed with caution and be on the lookout for potential
> > regressions so we can react fast. I respect that. But it's not
> > necessarily a good argument to reject a change.
>
> Except your change was not a simplification. Your change was
> a substitution to do the work of filling in struct kernel_siginfo in 4
> locations instead of just 2.
>
> The only simplification came from not using unsafe_put_user. Which is
> valid but has nothing to do with struct waitid_info.
>
> > I'm happy to switch from an initializer (which is not even clear is a
> > bug) to using clear_siginfo().
>
> I just double checked the definitions in signal_types.h and
> uapi/asm-generic/siginfo.h and there is definitely padding on 64bit.
> So yes barring magic compiler plug ins it is a bug.
>
> > And I'm also going to split this patch out of the P_PIDFD patch but I'm
> > going to send this out again. I haven't heard a sound argument why
> > this
> > patch is worse than what we have right now in there.
>
> Then I am afraid I have not expressed myself well.
>
> When I read through this patch I saw.
> - A bug when dealing with struct kernel_siginfo.
> - A substitution from of struct waitid_info to struct kernel_siginfo.
> - An actual simplification in replacing several unsafe_put_user calls
> with copy_siginfo_to_user.
> - A gratuitous change in change the order of several of the statements.
> - No simplification in the logic of do_wait.
I'm not going to feel bad for trying to improve a piece of code and not
getting it right the first time.
Removal of a custom struct that is only used to copy bits of information
into another struct for which we have a dedicated in-kernel struct to
avoid exactly that seems like a valid use of
s/<custom-struct>/<dedicated-kernel-struct>/ to me; especially since I
needed to touch that file anyway.
>
> Or in short I saw you did "s/struct waitid_info/struct kernel_siginfo/"
> and introduced bugs. Further you increased the number of locations that
> we need to be very careful with struct kernel_siginfo from 2 to 4.
If you're concerned about siginfo_t being used in more places than it is
right now, then please put a comment above it that it should be avoided
and that because of architectural nuances clear_signfo() needs to be
used to initialize it correctly.
Christian