2018-11-19 10:34:39

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v1 0/2] proc: allow signaling processes via file descriptors

Hey,

This little series introduces the ability to signal processes via file
descriptors to eliminate race-conditions caused by pid recycling.
With this patch an open() call on /proc/<pid> will give userspace a
handle to struct pid of the process associated with /proc/<pid>. This
allows to maintain a stable handle on a process.
Discussion has shown that a dedicated syscall is prefered over an ioctl().
Thus, the new syscall procfd_signal() is introduced to solve this
problem. It operates on a process file descriptor. More details are
found in the individual commit messages.

With this series a process can be killed via:

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
int ret;
char buf[1000];

if (argc < 2)
exit(EXIT_FAILURE);

ret = snprintf(buf, sizeof(buf), "/proc/%s", argv[1]);
if (ret < 0)
exit(EXIT_FAILURE);

int fd = open(buf, O_DIRECTORY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open \"%s\"\n", strerror(errno), buf);
exit(EXIT_FAILURE);
}

ret = syscall(__NR_procfd_signal, fd, SIGKILL, NULL, 0);
if (ret < 0) {
printf("Failed to send SIGKILL \"%s\"\n", strerror(errno));
close(fd);
exit(EXIT_FAILURE);
}

close(fd);

exit(EXIT_SUCCESS);
}

Thanks!
Christian

Christian Brauner (2):
proc: get process file descriptor from /proc/<pid>
signal: add procfd_signal() syscall
procfd_signal.2: document procfd_signal syscall

arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/proc/base.c | 23 ++++++++
include/linux/proc_fs.h | 1 +
include/linux/syscalls.h | 2 +
kernel/signal.c | 76 ++++++++++++++++++++++++--
6 files changed, 98 insertions(+), 6 deletions(-)

--
2.19.1



2018-11-19 10:34:53

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v1 2/2] signal: add procfd_signal() syscall

The kill() syscall operates on process identifiers. After a process has
exited its pid can be reused by another process. If a caller sends a signal
to a reused pid it will end up signaling the wrong process. This issue has
often surfaced and there has been a push [1] to address this problem.

A prior patch has introduced the ability to get a file descriptor
referencing struct pid by opening /proc/<pid>. This guarantees a stable
handle on a process which can be used to send signals to the referenced
process. Discussion has shown that a dedicated syscall is preferable over
ioctl()s. Thus, the new syscall procfd_signal() is introduced to solve
this problem. It operates on a process file descriptor.
The syscall takes an additional siginfo_t and flags argument. If siginfo_t
is NULL then procfd_signal() behaves like kill() if it is not NULL it
behaves like rt_sigqueueinfo.
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0.

With this patch a process can be killed via:

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
int ret;
char buf[1000];

if (argc < 2)
exit(EXIT_FAILURE);

ret = snprintf(buf, sizeof(buf), "/proc/%s", argv[1]);
if (ret < 0)
exit(EXIT_FAILURE);

int fd = open(buf, O_DIRECTORY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open \"%s\"\n", strerror(errno), buf);
exit(EXIT_FAILURE);
}

ret = syscall(__NR_procfd_signal, fd, SIGKILL, NULL, 0);
if (ret < 0) {
printf("Failed to send SIGKILL \"%s\"\n", strerror(errno));
close(fd);
exit(EXIT_FAILURE);
}

close(fd);

exit(EXIT_SUCCESS);
}

[1]: https://lkml.org/lkml/2018/11/18/130

Cc: "Eric W. Biederman" <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
Changelog:
v1:
- patch introduced
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/proc/base.c | 6 ++
include/linux/proc_fs.h | 1 +
include/linux/syscalls.h | 2 +
kernel/signal.c | 76 ++++++++++++++++++++++++--
6 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..e637eab883e9 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,4 @@
384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
386 i386 rseq sys_rseq __ia32_sys_rseq
+387 i386 procfd_signal sys_procfd_signal __ia32_sys_procfd_signal
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..e95f6741ab42 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
332 common statx __x64_sys_statx
333 common io_pgetevents __x64_sys_io_pgetevents
334 common rseq __x64_sys_rseq
+335 common procfd_signal __x64_sys_procfd_signal

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6365a4fea314..a12c9de92bd0 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3055,6 +3055,12 @@ static const struct file_operations proc_tgid_base_operations = {
.release = proc_tgid_release,
};

+bool proc_is_procfd(const struct file *file)
+{
+ return d_is_dir(file->f_path.dentry) &&
+ (file->f_op == &proc_tgid_base_operations);
+}
+
static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
{
return proc_pident_lookup(dir, dentry,
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index d0e1f1522a78..2d53a47fba34 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -73,6 +73,7 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
int (*show)(struct seq_file *, void *),
proc_write_t write,
void *data);
+extern bool proc_is_procfd(const struct file *file);

#else /* CONFIG_PROC_FS */

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2ac3d13a915b..a5ca8cb84566 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -907,6 +907,8 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
int flags, uint32_t sig);
+asmlinkage long sys_procfd_signal(int fd, int sig, siginfo_t __user *info,
+ int flags);

/*
* Architecture-specific system calls
diff --git a/kernel/signal.c b/kernel/signal.c
index 9a32bc2088c9..e8a8929de710 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -19,7 +19,9 @@
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/sched/cputime.h>
+#include <linux/file.h>
#include <linux/fs.h>
+#include <linux/proc_fs.h>
#include <linux/tty.h>
#include <linux/binfmts.h>
#include <linux/coredump.h>
@@ -3286,6 +3288,16 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese,
}
#endif

+static inline void prepare_kill_siginfo(int sig, struct kernel_siginfo *info)
+{
+ clear_siginfo(info);
+ info->si_signo = sig;
+ info->si_errno = 0;
+ info->si_code = SI_USER;
+ info->si_pid = task_tgid_vnr(current);
+ info->si_uid = from_kuid_munged(current_user_ns(), current_uid());
+}
+
/**
* sys_kill - send a signal to a process
* @pid: the PID of the process
@@ -3295,16 +3307,68 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
{
struct kernel_siginfo info;

- clear_siginfo(&info);
- info.si_signo = sig;
- info.si_errno = 0;
- info.si_code = SI_USER;
- info.si_pid = task_tgid_vnr(current);
- info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
+ prepare_kill_siginfo(sig, &info);

return kill_something_info(sig, &info, pid);
}

+/**
+ * sys_procfd_signal - send a signal to a process through a process file
+ * descriptor
+ * @fd: the file descriptor of the process
+ * @sig: signal to be sent
+ * @info: the signal info
+ * @flags: future flags to be passed
+ */
+SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t __user *, info,
+ int, flags)
+{
+ int ret;
+ struct pid *pid;
+ kernel_siginfo_t kinfo;
+ struct fd f;
+
+ /* Enforce flags be set to 0 until we add an extension. */
+ if (flags)
+ return -EINVAL;
+
+ f = fdget_raw(fd);
+ if (!f.file)
+ return -EBADF;
+
+ ret = -EINVAL;
+ /* Is this a process file descriptor? */
+ if (!proc_is_procfd(f.file))
+ goto err;
+
+ pid = f.file->private_data;
+ if (!pid)
+ goto err;
+
+ if (info) {
+ ret = __copy_siginfo_from_user(sig, &kinfo, info);
+ if (unlikely(ret))
+ goto err;
+ /*
+ * Not even root can pretend to send signals from the kernel.
+ * Nor can they impersonate a kill()/tgkill(), which adds
+ * source info.
+ */
+ ret = -EPERM;
+ if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
+ (task_pid(current) != pid))
+ goto err;
+ } else {
+ prepare_kill_siginfo(sig, &kinfo);
+ }
+
+ ret = kill_pid_info(sig, &kinfo, pid);
+
+err:
+ fdput(f);
+ return ret;
+}
+
static int
do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
{
--
2.19.1


2018-11-19 10:35:09

by Christian Brauner

[permalink] [raw]
Subject: [PATCH] procfd_signal.2: document procfd_signal syscall

Signed-off-by: Christian Brauner <[email protected]>
---
Changelog:
v1:
- patch introduced
---
man2/procfd_signal.2 | 147 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 147 insertions(+)
create mode 100644 man2/procfd_signal.2

diff --git a/man2/procfd_signal.2 b/man2/procfd_signal.2
new file mode 100644
index 000000000..6af0b74f4
--- /dev/null
+++ b/man2/procfd_signal.2
@@ -0,0 +1,147 @@
+.\" Copyright (C) 2018 Christian Brauner <[email protected]>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein. The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH PROCFD_SIGNAL 2 2017-09-15 "Linux" "Linux Programmer's Manual"
+.SH NAME
+procfd_signal \- send signal to a process through a process descriptor
+.SH SYNOPSIS
+.nf
+.B #include <sys/types.h>
+.B #include <signal.h>
+.PP
+.BI "int procfd_signal(int " fd ", int " sig ", siginfo_t *" info ", int " flags );
+.fi
+.SH DESCRIPTION
+.BR procfd_signal ()
+sends the signal specified in
+.I sig
+to the process identified by the file descriptor
+.IR fd .
+The permissions required to send a signal are the same as for
+.BR kill (2).
+As with
+.BR kill (2),
+the null signal (0) can be used to check if a process with a given
+PID exists.
+.PP
+The optional
+.I info
+argument specifies the data to accompany the signal.
+This argument is a pointer to a structure of type
+.IR siginfo_t ,
+described in
+.BR sigaction (2)
+(and defined by including
+.IR <sigaction.h> ).
+The caller should set the following fields in this structure:
+.TP
+.I si_code
+This must be one of the
+.B SI_*
+codes in the Linux kernel source file
+.IR include/asm-generic/siginfo.h ,
+with the restriction that the code must be negative
+(i.e., cannot be
+.BR SI_USER ,
+which is used by the kernel to indicate a signal sent by
+.BR kill (2))
+and cannot (since Linux 2.6.39) be
+.BR SI_TKILL
+(which is used by the kernel to indicate a signal sent using
+.\" tkill(2) or
+.BR tgkill (2)).
+.TP
+.I si_pid
+This should be set to a process ID,
+typically the process ID of the sender.
+.TP
+.I si_uid
+This should be set to a user ID,
+typically the real user ID of the sender.
+.TP
+.I si_value
+This field contains the user data to accompany the signal.
+For more information, see the description of the last
+.RI ( "union sigval" )
+argument of
+.BR sigqueue (3).
+.PP
+Internally, the kernel sets the
+.I si_signo
+field to the value specified in
+.IR sig ,
+so that the receiver of the signal can also obtain
+the signal number via that field.
+.PP
+The
+.I flags
+argument is reserved for future extension and must be set to 0.
+.PP
+.SH RETURN VALUE
+On success, this system call returns 0.
+On error, it returns \-1 and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBADF
+.I fd
+is not a valid file descriptor.
+.TP
+.B EINVAL
+An invalid signal was specified.
+.TP
+.B EINVAL
+.I fd
+does not refer to a process.
+.TP
+.B EINVAL
+The flags argument was not 0.
+.TP
+.B EPERM
+The caller does not have permission to send the signal to the target.
+For the required permissions, see
+.BR kill (2).
+Or:
+.I uinfo->si_code
+is invalid.
+.TP
+.B ESRCH
+The process or process group does not exist.
+Note that an existing process might be a zombie,
+a process that has terminated execution, but
+has not yet been
+.BR wait (2)ed
+for.
+.SH CONFORMING TO
+This system call is Linux-specific.
+.SH SEE ALSO
+.BR kill (2),
+.BR sigaction (2),
+.BR sigprocmask (2),
+.BR tgkill (2),
+.BR pthread_sigqueue (3),
+.BR rt_sigqueueinfo (2),
+.BR sigqueue (3),
+.BR signal (7)
--
2.19.1


2018-11-19 10:35:59

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v1 1/2] proc: get process file descriptor from /proc/<pid>

With this patch an open() call on /proc/<pid> will give userspace a handle
to struct pid of the process associated with /proc/<pid>. This allows to
maintain a stable handle on a process.
I have been discussing various approaches extensively during technical
conferences this year culminating in a long argument with Eric at Linux
Plumbers. The general consensus was that having a handle on a process
should be something that is very simple and easy to maintain with the
option of being extensible via a more advanced api if the need arises. I
believe that this patch is the most simple, dumb, and therefore
maintainable solution.

[1]: https://lkml.org/lkml/2018/10/30/118

Cc: "Eric W. Biederman" <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
Changelog:
v1:
- remove ioctl() to signal processes and replace with a dedicated syscall
in the next patch
---
fs/proc/base.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ce3465479447..6365a4fea314 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3032,10 +3032,27 @@ static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
tgid_base_stuff, ARRAY_SIZE(tgid_base_stuff));
}

+static int proc_tgid_open(struct inode *inode, struct file *file)
+{
+ /* grab reference to struct pid and stash the pointer away */
+ file->private_data = get_pid(proc_pid(inode));
+ return 0;
+}
+
+static int proc_tgid_release(struct inode *inode, struct file *file)
+{
+ struct pid *pid = file->private_data;
+ /* drop reference to struct pid */
+ put_pid(pid);
+ return 0;
+}
+
static const struct file_operations proc_tgid_base_operations = {
+ .open = proc_tgid_open,
.read = generic_read_dir,
.iterate_shared = proc_tgid_base_readdir,
.llseek = generic_file_llseek,
+ .release = proc_tgid_release,
};

static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
--
2.19.1


2018-11-19 15:35:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] proc: get process file descriptor from /proc/<pid>

On Mon, Nov 19, 2018 at 2:33 AM Christian Brauner <[email protected]> wrote:
>
> With this patch an open() call on /proc/<pid> will give userspace a handle
> to struct pid of the process associated with /proc/<pid>. This allows to
> maintain a stable handle on a process.
> I have been discussing various approaches extensively during technical
> conferences this year culminating in a long argument with Eric at Linux
> Plumbers. The general consensus was that having a handle on a process
> should be something that is very simple and easy to maintain with the
> option of being extensible via a more advanced api if the need arises. I
> believe that this patch is the most simple, dumb, and therefore
> maintainable solution.

How does the mechanism you're adding here differ from proc_pid()?

2018-11-19 15:46:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 2:33 AM Christian Brauner <[email protected]> wrote:
>
> The kill() syscall operates on process identifiers. After a process has
> exited its pid can be reused by another process. If a caller sends a signal
> to a reused pid it will end up signaling the wrong process. This issue has
> often surfaced and there has been a push [1] to address this problem.
>
> A prior patch has introduced the ability to get a file descriptor
> referencing struct pid by opening /proc/<pid>. This guarantees a stable
> handle on a process which can be used to send signals to the referenced
> process. Discussion has shown that a dedicated syscall is preferable over
> ioctl()s. Thus, the new syscall procfd_signal() is introduced to solve
> this problem. It operates on a process file descriptor.
> The syscall takes an additional siginfo_t and flags argument. If siginfo_t
> is NULL then procfd_signal() behaves like kill() if it is not NULL it
> behaves like rt_sigqueueinfo.
> The flags argument is added to allow for future extensions of this syscall.
> It currently needs to be passed as 0.

A few questions. First: you've made this work on /proc/PID, but
should it also work on /proc/PID/task/TID to send signals to a
specific thread?

> +bool proc_is_procfd(const struct file *file)
> +{
> + return d_is_dir(file->f_path.dentry) &&
> + (file->f_op == &proc_tgid_base_operations);
> +}

Maybe rename to proc_is_tgid_procfd() to leave room for proc_is_tid_procfd()?

> + if (info) {
> + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> + if (unlikely(ret))
> + goto err;
> + /*
> + * Not even root can pretend to send signals from the kernel.
> + * Nor can they impersonate a kill()/tgkill(), which adds
> + * source info.
> + */
> + ret = -EPERM;
> + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> + (task_pid(current) != pid))
> + goto err;

Is the exception for signaling yourself actually useful here?

2018-11-19 15:58:37

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 7:45 AM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Nov 19, 2018 at 2:33 AM Christian Brauner <[email protected]> wrote:
>>
>> The kill() syscall operates on process identifiers. After a process has
>> exited its pid can be reused by another process. If a caller sends a signal
>> to a reused pid it will end up signaling the wrong process. This issue has
>> often surfaced and there has been a push [1] to address this problem.
>>
>> A prior patch has introduced the ability to get a file descriptor
>> referencing struct pid by opening /proc/<pid>. This guarantees a stable
>> handle on a process which can be used to send signals to the referenced
>> process. Discussion has shown that a dedicated syscall is preferable over
>> ioctl()s. Thus, the new syscall procfd_signal() is introduced to solve
>> this problem. It operates on a process file descriptor.
>> The syscall takes an additional siginfo_t and flags argument. If siginfo_t
>> is NULL then procfd_signal() behaves like kill() if it is not NULL it
>> behaves like rt_sigqueueinfo.
>> The flags argument is added to allow for future extensions of this syscall.
>> It currently needs to be passed as 0.
>
> A few questions. First: you've made this work on /proc/PID, but
> should it also work on /proc/PID/task/TID to send signals to a
> specific thread?

+1

>> + if (info) {
>> + ret = __copy_siginfo_from_user(sig, &kinfo, info);
>> + if (unlikely(ret))
>> + goto err;
>> + /*
>> + * Not even root can pretend to send signals from the kernel.
>> + * Nor can they impersonate a kill()/tgkill(), which adds
>> + * source info.
>> + */
>> + ret = -EPERM;
>> + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
>> + (task_pid(current) != pid))
>> + goto err;
>
> Is the exception for signaling yourself actually useful here?

All the signal functions exempt the current process from access
checks. Whether that's useful or not (and I think it is), being
inconsistent here would be wrong.

2018-11-19 16:00:25

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 2:32 AM, Christian Brauner <[email protected]> wrote:
> The kill() syscall operates on process identifiers. After a process has
> exited its pid can be reused by another process. If a caller sends a signal
> to a reused pid it will end up signaling the wrong process. This issue has
> often surfaced and there has been a push [1] to address this problem.
>
> A prior patch has introduced the ability to get a file descriptor
> referencing struct pid by opening /proc/<pid>. This guarantees a stable
> handle on a process which can be used to send signals to the referenced
> process. Discussion has shown that a dedicated syscall is preferable over
> ioctl()s. Thus, the new syscall procfd_signal() is introduced to solve
> this problem. It operates on a process file descriptor.
> The syscall takes an additional siginfo_t and flags argument. If siginfo_t
> is NULL then procfd_signal() behaves like kill() if it is not NULL it
> behaves like rt_sigqueueinfo.
> The flags argument is added to allow for future extensions of this syscall.
> It currently needs to be passed as 0.
>
> With this patch a process can be killed via:
>
> #define _GNU_SOURCE
> #include <errno.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <signal.h>
> #include <sys/stat.h>
> #include <sys/syscall.h>
> #include <sys/types.h>
> #include <unistd.h>
>
> int main(int argc, char *argv[])
> {
> int ret;
> char buf[1000];
>
> if (argc < 2)
> exit(EXIT_FAILURE);
>
> ret = snprintf(buf, sizeof(buf), "/proc/%s", argv[1]);
> if (ret < 0)
> exit(EXIT_FAILURE);
>
> int fd = open(buf, O_DIRECTORY | O_CLOEXEC);
> if (fd < 0) {
> printf("%s - Failed to open \"%s\"\n", strerror(errno), buf);
> exit(EXIT_FAILURE);
> }
>
> ret = syscall(__NR_procfd_signal, fd, SIGKILL, NULL, 0);
> if (ret < 0) {
> printf("Failed to send SIGKILL \"%s\"\n", strerror(errno));
> close(fd);
> exit(EXIT_FAILURE);
> }
>
> close(fd);
>
> exit(EXIT_SUCCESS);
> }
>
> [1]: https://lkml.org/lkml/2018/11/18/130
>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Serge Hallyn <[email protected]>
> Cc: Jann Horn <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Andy Lutomirsky <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Aleksa Sarai <[email protected]>
> Cc: Al Viro <[email protected]>
> Signed-off-by: Christian Brauner <[email protected]>
> ---
> Changelog:
> v1:
> - patch introduced
> ---
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> fs/proc/base.c | 6 ++
> include/linux/proc_fs.h | 1 +
> include/linux/syscalls.h | 2 +
> kernel/signal.c | 76 ++++++++++++++++++++++++--
> 6 files changed, 81 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..e637eab883e9 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
> 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
> 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
> 386 i386 rseq sys_rseq __ia32_sys_rseq
> +387 i386 procfd_signal sys_procfd_signal __ia32_sys_procfd_signal
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..e95f6741ab42 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
> 332 common statx __x64_sys_statx
> 333 common io_pgetevents __x64_sys_io_pgetevents
> 334 common rseq __x64_sys_rseq
> +335 common procfd_signal __x64_sys_procfd_signal
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 6365a4fea314..a12c9de92bd0 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3055,6 +3055,12 @@ static const struct file_operations proc_tgid_base_operations = {
> .release = proc_tgid_release,
> };
>
> +bool proc_is_procfd(const struct file *file)
> +{
> + return d_is_dir(file->f_path.dentry) &&
> + (file->f_op == &proc_tgid_base_operations);
> +}
> +
> static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
> {
> return proc_pident_lookup(dir, dentry,
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index d0e1f1522a78..2d53a47fba34 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -73,6 +73,7 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
> int (*show)(struct seq_file *, void *),
> proc_write_t write,
> void *data);
> +extern bool proc_is_procfd(const struct file *file);
>
> #else /* CONFIG_PROC_FS */
>
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 2ac3d13a915b..a5ca8cb84566 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -907,6 +907,8 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
> unsigned mask, struct statx __user *buffer);
> asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
> int flags, uint32_t sig);
> +asmlinkage long sys_procfd_signal(int fd, int sig, siginfo_t __user *info,
> + int flags);
>
> /*
> * Architecture-specific system calls
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9a32bc2088c9..e8a8929de710 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -19,7 +19,9 @@
> #include <linux/sched/task.h>
> #include <linux/sched/task_stack.h>
> #include <linux/sched/cputime.h>
> +#include <linux/file.h>
> #include <linux/fs.h>
> +#include <linux/proc_fs.h>
> #include <linux/tty.h>
> #include <linux/binfmts.h>
> #include <linux/coredump.h>
> @@ -3286,6 +3288,16 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese,
> }
> #endif
>
> +static inline void prepare_kill_siginfo(int sig, struct kernel_siginfo *info)
> +{
> + clear_siginfo(info);
> + info->si_signo = sig;
> + info->si_errno = 0;
> + info->si_code = SI_USER;
> + info->si_pid = task_tgid_vnr(current);
> + info->si_uid = from_kuid_munged(current_user_ns(), current_uid());
> +}
> +
> /**
> * sys_kill - send a signal to a process
> * @pid: the PID of the process
> @@ -3295,16 +3307,68 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
> {
> struct kernel_siginfo info;
>
> - clear_siginfo(&info);
> - info.si_signo = sig;
> - info.si_errno = 0;
> - info.si_code = SI_USER;
> - info.si_pid = task_tgid_vnr(current);
> - info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
> + prepare_kill_siginfo(sig, &info);
>
> return kill_something_info(sig, &info, pid);
> }
>
> +/**
> + * sys_procfd_signal - send a signal to a process through a process file
> + * descriptor
> + * @fd: the file descriptor of the process
> + * @sig: signal to be sent
> + * @info: the signal info
> + * @flags: future flags to be passed
> + */
> +SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t __user *, info,
> + int, flags)
> +{
> + int ret;
> + struct pid *pid;
> + kernel_siginfo_t kinfo;
> + struct fd f;
> +
> + /* Enforce flags be set to 0 until we add an extension. */
> + if (flags)
> + return -EINVAL;
> +
> + f = fdget_raw(fd);
> + if (!f.file)
> + return -EBADF;
> +
> + ret = -EINVAL;
> + /* Is this a process file descriptor? */
> + if (!proc_is_procfd(f.file))
> + goto err;
> +
> + pid = f.file->private_data;

You never addressed my comment on the previous patch about your use of
private_data here. Why can't you use the struct pid reference that's
already in the inode?

2018-11-19 17:12:23

by Eugene Syromiatnikov

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 3cf7b533b3d1..e637eab883e9 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -398,3 +398,4 @@
> 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
> 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
> 386 i386 rseq sys_rseq __ia32_sys_rseq
> +387 i386 procfd_signal sys_procfd_signal __ia32_sys_procfd_signal
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index f0b1709a5ffb..e95f6741ab42 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -343,6 +343,7 @@
> 332 common statx __x64_sys_statx
> 333 common io_pgetevents __x64_sys_io_pgetevents
> 334 common rseq __x64_sys_rseq
> +335 common procfd_signal __x64_sys_procfd_signal

You have wired up the syscall on x86 but have not added the syscall number
to the generic syscall header (include/uapi/asm-generic/unistd.h), why?

2018-11-19 17:15:35

by Eugene Syromiatnikov

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> +/**
> + * sys_procfd_signal - send a signal to a process through a process file
> + * descriptor
> + * @fd: the file descriptor of the process
> + * @sig: signal to be sent
> + * @info: the signal info
> + * @flags: future flags to be passed
> + */
> +SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t __user *, info,
> + int, flags)

It could be considered better to use an unsigned type for the "flags"
argument.

2018-11-19 18:22:00

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] proc: get process file descriptor from /proc/<pid>

On Mon, Nov 19, 2018 at 07:32:33AM -0800, Andy Lutomirski wrote:
> On Mon, Nov 19, 2018 at 2:33 AM Christian Brauner <[email protected]> wrote:
> >
> > With this patch an open() call on /proc/<pid> will give userspace a handle
> > to struct pid of the process associated with /proc/<pid>. This allows to
> > maintain a stable handle on a process.
> > I have been discussing various approaches extensively during technical
> > conferences this year culminating in a long argument with Eric at Linux
> > Plumbers. The general consensus was that having a handle on a process
> > should be something that is very simple and easy to maintain with the
> > option of being extensible via a more advanced api if the need arises. I
> > believe that this patch is the most simple, dumb, and therefore
> > maintainable solution.
>
> How does the mechanism you're adding here differ from proc_pid()?

I'm sorry, I am missing the context around proc_pid()?. Are you talking
about the the proc_pid() function in "internal.h"? What exactly is the
proc_pid() mechanism?

2018-11-19 18:24:22

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 06:10:53PM +0100, Eugene Syromiatnikov wrote:
> On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 3cf7b533b3d1..e637eab883e9 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -398,3 +398,4 @@
> > 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
> > 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
> > 386 i386 rseq sys_rseq __ia32_sys_rseq
> > +387 i386 procfd_signal sys_procfd_signal __ia32_sys_procfd_signal
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index f0b1709a5ffb..e95f6741ab42 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -343,6 +343,7 @@
> > 332 common statx __x64_sys_statx
> > 333 common io_pgetevents __x64_sys_io_pgetevents
> > 334 common rseq __x64_sys_rseq
> > +335 common procfd_signal __x64_sys_procfd_signal
>
> You have wired up the syscall on x86 but have not added the syscall number
> to the generic syscall header (include/uapi/asm-generic/unistd.h), why?

Oversight on my part, sorry.

2018-11-19 18:31:47

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 07:59:24AM -0800, Daniel Colascione wrote:
> On Mon, Nov 19, 2018 at 2:32 AM, Christian Brauner <[email protected]> wrote:
> > The kill() syscall operates on process identifiers. After a process has
> > exited its pid can be reused by another process. If a caller sends a signal
> > to a reused pid it will end up signaling the wrong process. This issue has
> > often surfaced and there has been a push [1] to address this problem.
> >
> > A prior patch has introduced the ability to get a file descriptor
> > referencing struct pid by opening /proc/<pid>. This guarantees a stable
> > handle on a process which can be used to send signals to the referenced
> > process. Discussion has shown that a dedicated syscall is preferable over
> > ioctl()s. Thus, the new syscall procfd_signal() is introduced to solve
> > this problem. It operates on a process file descriptor.
> > The syscall takes an additional siginfo_t and flags argument. If siginfo_t
> > is NULL then procfd_signal() behaves like kill() if it is not NULL it
> > behaves like rt_sigqueueinfo.
> > The flags argument is added to allow for future extensions of this syscall.
> > It currently needs to be passed as 0.
> >
> > With this patch a process can be killed via:
> >
> > #define _GNU_SOURCE
> > #include <errno.h>
> > #include <fcntl.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <signal.h>
> > #include <sys/stat.h>
> > #include <sys/syscall.h>
> > #include <sys/types.h>
> > #include <unistd.h>
> >
> > int main(int argc, char *argv[])
> > {
> > int ret;
> > char buf[1000];
> >
> > if (argc < 2)
> > exit(EXIT_FAILURE);
> >
> > ret = snprintf(buf, sizeof(buf), "/proc/%s", argv[1]);
> > if (ret < 0)
> > exit(EXIT_FAILURE);
> >
> > int fd = open(buf, O_DIRECTORY | O_CLOEXEC);
> > if (fd < 0) {
> > printf("%s - Failed to open \"%s\"\n", strerror(errno), buf);
> > exit(EXIT_FAILURE);
> > }
> >
> > ret = syscall(__NR_procfd_signal, fd, SIGKILL, NULL, 0);
> > if (ret < 0) {
> > printf("Failed to send SIGKILL \"%s\"\n", strerror(errno));
> > close(fd);
> > exit(EXIT_FAILURE);
> > }
> >
> > close(fd);
> >
> > exit(EXIT_SUCCESS);
> > }
> >
> > [1]: https://lkml.org/lkml/2018/11/18/130
> >
> > Cc: "Eric W. Biederman" <[email protected]>
> > Cc: Serge Hallyn <[email protected]>
> > Cc: Jann Horn <[email protected]>
> > Cc: Kees Cook <[email protected]>
> > Cc: Andy Lutomirsky <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Oleg Nesterov <[email protected]>
> > Cc: Aleksa Sarai <[email protected]>
> > Cc: Al Viro <[email protected]>
> > Signed-off-by: Christian Brauner <[email protected]>
> > ---
> > Changelog:
> > v1:
> > - patch introduced
> > ---
> > arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> > arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> > fs/proc/base.c | 6 ++
> > include/linux/proc_fs.h | 1 +
> > include/linux/syscalls.h | 2 +
> > kernel/signal.c | 76 ++++++++++++++++++++++++--
> > 6 files changed, 81 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 3cf7b533b3d1..e637eab883e9 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -398,3 +398,4 @@
> > 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
> > 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
> > 386 i386 rseq sys_rseq __ia32_sys_rseq
> > +387 i386 procfd_signal sys_procfd_signal __ia32_sys_procfd_signal
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index f0b1709a5ffb..e95f6741ab42 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -343,6 +343,7 @@
> > 332 common statx __x64_sys_statx
> > 333 common io_pgetevents __x64_sys_io_pgetevents
> > 334 common rseq __x64_sys_rseq
> > +335 common procfd_signal __x64_sys_procfd_signal
> >
> > #
> > # x32-specific system call numbers start at 512 to avoid cache impact
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 6365a4fea314..a12c9de92bd0 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -3055,6 +3055,12 @@ static const struct file_operations proc_tgid_base_operations = {
> > .release = proc_tgid_release,
> > };
> >
> > +bool proc_is_procfd(const struct file *file)
> > +{
> > + return d_is_dir(file->f_path.dentry) &&
> > + (file->f_op == &proc_tgid_base_operations);
> > +}
> > +
> > static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
> > {
> > return proc_pident_lookup(dir, dentry,
> > diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> > index d0e1f1522a78..2d53a47fba34 100644
> > --- a/include/linux/proc_fs.h
> > +++ b/include/linux/proc_fs.h
> > @@ -73,6 +73,7 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
> > int (*show)(struct seq_file *, void *),
> > proc_write_t write,
> > void *data);
> > +extern bool proc_is_procfd(const struct file *file);
> >
> > #else /* CONFIG_PROC_FS */
> >
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 2ac3d13a915b..a5ca8cb84566 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -907,6 +907,8 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
> > unsigned mask, struct statx __user *buffer);
> > asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
> > int flags, uint32_t sig);
> > +asmlinkage long sys_procfd_signal(int fd, int sig, siginfo_t __user *info,
> > + int flags);
> >
> > /*
> > * Architecture-specific system calls
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 9a32bc2088c9..e8a8929de710 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -19,7 +19,9 @@
> > #include <linux/sched/task.h>
> > #include <linux/sched/task_stack.h>
> > #include <linux/sched/cputime.h>
> > +#include <linux/file.h>
> > #include <linux/fs.h>
> > +#include <linux/proc_fs.h>
> > #include <linux/tty.h>
> > #include <linux/binfmts.h>
> > #include <linux/coredump.h>
> > @@ -3286,6 +3288,16 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese,
> > }
> > #endif
> >
> > +static inline void prepare_kill_siginfo(int sig, struct kernel_siginfo *info)
> > +{
> > + clear_siginfo(info);
> > + info->si_signo = sig;
> > + info->si_errno = 0;
> > + info->si_code = SI_USER;
> > + info->si_pid = task_tgid_vnr(current);
> > + info->si_uid = from_kuid_munged(current_user_ns(), current_uid());
> > +}
> > +
> > /**
> > * sys_kill - send a signal to a process
> > * @pid: the PID of the process
> > @@ -3295,16 +3307,68 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
> > {
> > struct kernel_siginfo info;
> >
> > - clear_siginfo(&info);
> > - info.si_signo = sig;
> > - info.si_errno = 0;
> > - info.si_code = SI_USER;
> > - info.si_pid = task_tgid_vnr(current);
> > - info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
> > + prepare_kill_siginfo(sig, &info);
> >
> > return kill_something_info(sig, &info, pid);
> > }
> >
> > +/**
> > + * sys_procfd_signal - send a signal to a process through a process file
> > + * descriptor
> > + * @fd: the file descriptor of the process
> > + * @sig: signal to be sent
> > + * @info: the signal info
> > + * @flags: future flags to be passed
> > + */
> > +SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t __user *, info,
> > + int, flags)
> > +{
> > + int ret;
> > + struct pid *pid;
> > + kernel_siginfo_t kinfo;
> > + struct fd f;
> > +
> > + /* Enforce flags be set to 0 until we add an extension. */
> > + if (flags)
> > + return -EINVAL;
> > +
> > + f = fdget_raw(fd);
> > + if (!f.file)
> > + return -EBADF;
> > +
> > + ret = -EINVAL;
> > + /* Is this a process file descriptor? */
> > + if (!proc_is_procfd(f.file))
> > + goto err;
> > +
> > + pid = f.file->private_data;
>
> You never addressed my comment on the previous patch about your use of

Sorry, that thread exploded so quickly that I might have missed it.

> private_data here. Why can't you use the struct pid reference that's
> already in the inode?

If that's what people prefer we can probably use that. There was
precedent for stashing away such data in fs/proc/base.c already for
various other things including user namespaces and struct mm so I
followed this model. A prior version of my patch (I didn't send out) did
retrive the inode via proc_pid() in .open() took an additional reference
via get_pid() and dropped it in .release(). Do we prefer that?

2018-11-19 18:53:50

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 07:45:04AM -0800, Andy Lutomirski wrote:
> On Mon, Nov 19, 2018 at 2:33 AM Christian Brauner <[email protected]> wrote:
> >
> > The kill() syscall operates on process identifiers. After a process has
> > exited its pid can be reused by another process. If a caller sends a signal
> > to a reused pid it will end up signaling the wrong process. This issue has
> > often surfaced and there has been a push [1] to address this problem.
> >
> > A prior patch has introduced the ability to get a file descriptor
> > referencing struct pid by opening /proc/<pid>. This guarantees a stable
> > handle on a process which can be used to send signals to the referenced
> > process. Discussion has shown that a dedicated syscall is preferable over
> > ioctl()s. Thus, the new syscall procfd_signal() is introduced to solve
> > this problem. It operates on a process file descriptor.
> > The syscall takes an additional siginfo_t and flags argument. If siginfo_t
> > is NULL then procfd_signal() behaves like kill() if it is not NULL it
> > behaves like rt_sigqueueinfo.
> > The flags argument is added to allow for future extensions of this syscall.
> > It currently needs to be passed as 0.
>
> A few questions. First: you've made this work on /proc/PID, but
> should it also work on /proc/PID/task/TID to send signals to a
> specific thread?

Yeah, so I thought about that. Your point being to combine: kill(),
tgkill() aka rt_sigqueueinfo() and rt_tg_sigqueueinfo(). If I understand
this correctly the implication is to also get file descriptors to
/proc/PID/task/TID and pass them to procfd_signal()? Can we hold of on
that one? Adding this in the future should be easily doable by simply
getting /proc/PID/task/TID file descriptors but I would like this
patchset to be as small as possible.

>
> > +bool proc_is_procfd(const struct file *file)
> > +{
> > + return d_is_dir(file->f_path.dentry) &&
> > + (file->f_op == &proc_tgid_base_operations);
> > +}
>
> Maybe rename to proc_is_tgid_procfd() to leave room for proc_is_tid_procfd()?

Yes, good idea!

>
> > + if (info) {
> > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > + if (unlikely(ret))
> > + goto err;
> > + /*
> > + * Not even root can pretend to send signals from the kernel.
> > + * Nor can they impersonate a kill()/tgkill(), which adds
> > + * source info.
> > + */
> > + ret = -EPERM;
> > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > + (task_pid(current) != pid))
> > + goto err;
>
> Is the exception for signaling yourself actually useful here?

I tried to strictly follow the sigqueue-based permission checks. I'm not
comfortable removing this check without signal-experts telling me that
it is safe to do.

2018-11-19 19:03:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

Christian Brauner <[email protected]> writes:

> On Mon, Nov 19, 2018 at 07:59:24AM -0800, Daniel Colascione wrote:
>> You never addressed my comment on the previous patch about your use of
>
> Sorry, that thread exploded so quickly that I might have missed it.
>
>> private_data here. Why can't you use the struct pid reference that's
>> already in the inode?
>
> If that's what people prefer we can probably use that. There was
> precedent for stashing away such data in fs/proc/base.c already for
> various other things including user namespaces and struct mm so I
> followed this model. A prior version of my patch (I didn't send out) did
> retrive the inode via proc_pid() in .open() took an additional reference
> via get_pid() and dropped it in .release(). Do we prefer that?

If you are using proc/<pid>/ directories as your file descriptors, you
don't need to add an open or a release method at all. The existing file
descriptors hold a reference to the inode which holds a reference the
the struct pid.

The only time you need to get a reference is when you need a task
and kill_pid_info already performs that work for you.

So using proc_pid is all you need to do to get the pid from the existing
file descriptors.

Eric


2018-11-19 19:33:46

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 01:02:06PM -0600, Eric W. Biederman wrote:
> Christian Brauner <[email protected]> writes:
>
> > On Mon, Nov 19, 2018 at 07:59:24AM -0800, Daniel Colascione wrote:
> >> You never addressed my comment on the previous patch about your use of
> >
> > Sorry, that thread exploded so quickly that I might have missed it.
> >
> >> private_data here. Why can't you use the struct pid reference that's
> >> already in the inode?
> >
> > If that's what people prefer we can probably use that. There was
> > precedent for stashing away such data in fs/proc/base.c already for
> > various other things including user namespaces and struct mm so I
> > followed this model. A prior version of my patch (I didn't send out) did
> > retrive the inode via proc_pid() in .open() took an additional reference
> > via get_pid() and dropped it in .release(). Do we prefer that?
>
> If you are using proc/<pid>/ directories as your file descriptors, you
> don't need to add an open or a release method at all. The existing file
> descriptors hold a reference to the inode which holds a reference the
> the struct pid.
>
> The only time you need to get a reference is when you need a task
> and kill_pid_info already performs that work for you.

Oh, I see what you and Andy are saying now. Sweet, that means we can
trim down the patch even more. Less code, less headache.

Thanks!

>
> So using proc_pid is all you need to do to get the pid from the existing
> file descriptors.
>
> Eric
>

2018-11-19 19:41:21

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

Yep. That's also what I was talking about, FWIW.

On Mon, Nov 19, 2018 at 11:31 AM, Christian Brauner
<[email protected]> wrote:
> On Mon, Nov 19, 2018 at 01:02:06PM -0600, Eric W. Biederman wrote:
>> Christian Brauner <[email protected]> writes:
>>
>> > On Mon, Nov 19, 2018 at 07:59:24AM -0800, Daniel Colascione wrote:
>> >> You never addressed my comment on the previous patch about your use of
>> >
>> > Sorry, that thread exploded so quickly that I might have missed it.
>> >
>> >> private_data here. Why can't you use the struct pid reference that's
>> >> already in the inode?
>> >
>> > If that's what people prefer we can probably use that. There was
>> > precedent for stashing away such data in fs/proc/base.c already for
>> > various other things including user namespaces and struct mm so I
>> > followed this model. A prior version of my patch (I didn't send out) did
>> > retrive the inode via proc_pid() in .open() took an additional reference
>> > via get_pid() and dropped it in .release(). Do we prefer that?
>>
>> If you are using proc/<pid>/ directories as your file descriptors, you
>> don't need to add an open or a release method at all. The existing file
>> descriptors hold a reference to the inode which holds a reference the
>> the struct pid.
>>
>> The only time you need to get a reference is when you need a task
>> and kill_pid_info already performs that work for you.
>
> Oh, I see what you and Andy are saying now. Sweet, that means we can
> trim down the patch even more. Less code, less headache.
>
> Thanks!
>
>>
>> So using proc_pid is all you need to do to get the pid from the existing
>> file descriptors.
>>
>> Eric
>>

2018-11-19 20:31:40

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On 2018-11-19, Christian Brauner <[email protected]> wrote:
> + if (info) {
> + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> + if (unlikely(ret))
> + goto err;
> + /*
> + * Not even root can pretend to send signals from the kernel.
> + * Nor can they impersonate a kill()/tgkill(), which adds
> + * source info.
> + */
> + ret = -EPERM;
> + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> + (task_pid(current) != pid))
> + goto err;
> + } else {
> + prepare_kill_siginfo(sig, &kinfo);
> + }

I wonder whether we should also have a pidns restriction here, since
currently it isn't possible for a container process using a pidns to
signal processes outside its pidns. AFAICS, this isn't done through an
explicit check -- it's a side-effect of processes in a pidns not being
able to address non-descendant-pidns processes.

But maybe it's reasonable to allow sending a procfd to a different pidns
and the same operations working on it? If we extend the procfd API to
allow process creation this would allow a container to create a process
outside its pidns.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (1.20 kB)
signature.asc (849.00 B)
Download all attachments

2018-11-19 20:56:23

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > + if (info) {
> > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > + if (unlikely(ret))
> > + goto err;
> > + /*
> > + * Not even root can pretend to send signals from the kernel.
> > + * Nor can they impersonate a kill()/tgkill(), which adds
> > + * source info.
> > + */
> > + ret = -EPERM;
> > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > + (task_pid(current) != pid))
> > + goto err;
> > + } else {
> > + prepare_kill_siginfo(sig, &kinfo);
> > + }
>
> I wonder whether we should also have a pidns restriction here, since
> currently it isn't possible for a container process using a pidns to
> signal processes outside its pidns. AFAICS, this isn't done through an
> explicit check -- it's a side-effect of processes in a pidns not being
> able to address non-descendant-pidns processes.
>
> But maybe it's reasonable to allow sending a procfd to a different pidns
> and the same operations working on it? If we extend the procfd API to

No, I don't think so. I really don't want any fancy semantics in here.
Fancy doesn't get merged and fancy is hard to maintain. So we should do
something like:

if (proc_pid_ns() != current_pid_ns)
return EINVAL

> allow process creation this would allow a container to create a process
> outside its pidns.
>
> --
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>



2018-11-19 21:15:16

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 09:55:18PM +0100, Christian Brauner wrote:
> On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > > + if (info) {
> > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > + if (unlikely(ret))
> > > + goto err;
> > > + /*
> > > + * Not even root can pretend to send signals from the kernel.
> > > + * Nor can they impersonate a kill()/tgkill(), which adds
> > > + * source info.
> > > + */
> > > + ret = -EPERM;
> > > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > + (task_pid(current) != pid))
> > > + goto err;
> > > + } else {
> > > + prepare_kill_siginfo(sig, &kinfo);
> > > + }
> >
> > I wonder whether we should also have a pidns restriction here, since
> > currently it isn't possible for a container process using a pidns to
> > signal processes outside its pidns. AFAICS, this isn't done through an
> > explicit check -- it's a side-effect of processes in a pidns not being
> > able to address non-descendant-pidns processes.
> >
> > But maybe it's reasonable to allow sending a procfd to a different pidns
> > and the same operations working on it? If we extend the procfd API to
>
> No, I don't think so. I really don't want any fancy semantics in here.
> Fancy doesn't get merged and fancy is hard to maintain. So we should do
> something like:
>
> if (proc_pid_ns() != current_pid_ns)
> return EINVAL

To be more precise, we need to detect if fd refers to an ancestor pidns
and if so return EINVAL.

>
> > allow process creation this would allow a container to create a process
> > outside its pidns.
> >
> > --
> > Aleksa Sarai
> > Senior Software Engineer (Containers)
> > SUSE Linux GmbH
> > <https://www.cyphar.com/>
>
>

2018-11-19 21:19:48

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On 2018-11-19, Christian Brauner <[email protected]> wrote:
> On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > > + if (info) {
> > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > + if (unlikely(ret))
> > > + goto err;
> > > + /*
> > > + * Not even root can pretend to send signals from the kernel.
> > > + * Nor can they impersonate a kill()/tgkill(), which adds
> > > + * source info.
> > > + */
> > > + ret = -EPERM;
> > > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > + (task_pid(current) != pid))
> > > + goto err;
> > > + } else {
> > > + prepare_kill_siginfo(sig, &kinfo);
> > > + }
> >
> > I wonder whether we should also have a pidns restriction here, since
> > currently it isn't possible for a container process using a pidns to
> > signal processes outside its pidns. AFAICS, this isn't done through an
> > explicit check -- it's a side-effect of processes in a pidns not being
> > able to address non-descendant-pidns processes.
> >
> > But maybe it's reasonable to allow sending a procfd to a different pidns
> > and the same operations working on it? If we extend the procfd API to
>
> No, I don't think so. I really don't want any fancy semantics in here.
> Fancy doesn't get merged and fancy is hard to maintain. So we should do
> something like:
>
> if (proc_pid_ns() != current_pid_ns)
> return EINVAL

This isn't quite sufficient. The key thing is that you have to be in an
*ancestor* (or same) pidns, not the *same* pidns. Ideally you can re-use
the check already in pidns_get_parent, and expose it. It would be
something as trivial as:

bool pidns_is_descendant(struct pid_namespace *ns,
struct pid_namespace *ancestor)
{
for (;;) {
if (!ns)
return false;
if (ns == ancestor)
break;
ns = ns->parent;
}
return true;
}

And you can rewrite pidns_get_parent to use it. So you would instead be
doing:

if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
return -EPERM;

(Or you can just copy the 5-line loop into procfd_signal -- though I
imagine we'll need this for all of the procfd_* APIs.)

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (2.38 kB)
signature.asc (849.00 B)
Download all attachments

2018-11-19 21:21:45

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Tue, Nov 20, 2018 at 08:18:10AM +1100, Aleksa Sarai wrote:
> On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > > On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > > > + if (info) {
> > > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > > + if (unlikely(ret))
> > > > + goto err;
> > > > + /*
> > > > + * Not even root can pretend to send signals from the kernel.
> > > > + * Nor can they impersonate a kill()/tgkill(), which adds
> > > > + * source info.
> > > > + */
> > > > + ret = -EPERM;
> > > > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > > + (task_pid(current) != pid))
> > > > + goto err;
> > > > + } else {
> > > > + prepare_kill_siginfo(sig, &kinfo);
> > > > + }
> > >
> > > I wonder whether we should also have a pidns restriction here, since
> > > currently it isn't possible for a container process using a pidns to
> > > signal processes outside its pidns. AFAICS, this isn't done through an
> > > explicit check -- it's a side-effect of processes in a pidns not being
> > > able to address non-descendant-pidns processes.
> > >
> > > But maybe it's reasonable to allow sending a procfd to a different pidns
> > > and the same operations working on it? If we extend the procfd API to
> >
> > No, I don't think so. I really don't want any fancy semantics in here.
> > Fancy doesn't get merged and fancy is hard to maintain. So we should do
> > something like:
> >
> > if (proc_pid_ns() != current_pid_ns)
> > return EINVAL
>
> This isn't quite sufficient. The key thing is that you have to be in an
> *ancestor* (or same) pidns, not the *same* pidns. Ideally you can re-use

See my next mail.

> the check already in pidns_get_parent, and expose it. It would be
> something as trivial as:
>
> bool pidns_is_descendant(struct pid_namespace *ns,
> struct pid_namespace *ancestor)
> {
> for (;;) {
> if (!ns)
> return false;
> if (ns == ancestor)
> break;
> ns = ns->parent;
> }
> return true;
> }
>
> And you can rewrite pidns_get_parent to use it. So you would instead be
> doing:
>
> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> return -EPERM;
>
> (Or you can just copy the 5-line loop into procfd_signal -- though I
> imagine we'll need this for all of the procfd_* APIs.)
>
> --
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>



2018-11-19 21:23:21

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Tue, Nov 20, 2018 at 08:18:10AM +1100, Aleksa Sarai wrote:
> On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > > On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > > > + if (info) {
> > > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > > + if (unlikely(ret))
> > > > + goto err;
> > > > + /*
> > > > + * Not even root can pretend to send signals from the kernel.
> > > > + * Nor can they impersonate a kill()/tgkill(), which adds
> > > > + * source info.
> > > > + */
> > > > + ret = -EPERM;
> > > > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > > + (task_pid(current) != pid))
> > > > + goto err;
> > > > + } else {
> > > > + prepare_kill_siginfo(sig, &kinfo);
> > > > + }
> > >
> > > I wonder whether we should also have a pidns restriction here, since
> > > currently it isn't possible for a container process using a pidns to
> > > signal processes outside its pidns. AFAICS, this isn't done through an
> > > explicit check -- it's a side-effect of processes in a pidns not being
> > > able to address non-descendant-pidns processes.
> > >
> > > But maybe it's reasonable to allow sending a procfd to a different pidns
> > > and the same operations working on it? If we extend the procfd API to
> >
> > No, I don't think so. I really don't want any fancy semantics in here.
> > Fancy doesn't get merged and fancy is hard to maintain. So we should do
> > something like:
> >
> > if (proc_pid_ns() != current_pid_ns)
> > return EINVAL
>
> This isn't quite sufficient. The key thing is that you have to be in an
> *ancestor* (or same) pidns, not the *same* pidns. Ideally you can re-use
> the check already in pidns_get_parent, and expose it. It would be
> something as trivial as:
>
> bool pidns_is_descendant(struct pid_namespace *ns,
> struct pid_namespace *ancestor)
> {
> for (;;) {
> if (!ns)
> return false;
> if (ns == ancestor)
> break;
> ns = ns->parent;
> }
> return true;
> }

That can be done without a loop by comparing the level counter for the
two pid namespaces.

>
> And you can rewrite pidns_get_parent to use it. So you would instead be
> doing:
>
> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> return -EPERM;
>
> (Or you can just copy the 5-line loop into procfd_signal -- though I
> imagine we'll need this for all of the procfd_* APIs.)
>
> --
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>



2018-11-19 21:24:53

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On 2018-11-20, Aleksa Sarai <[email protected]> wrote:
> On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > > On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > > > + if (info) {
> > > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > > + if (unlikely(ret))
> > > > + goto err;
> > > > + /*
> > > > + * Not even root can pretend to send signals from the kernel.
> > > > + * Nor can they impersonate a kill()/tgkill(), which adds
> > > > + * source info.
> > > > + */
> > > > + ret = -EPERM;
> > > > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > > + (task_pid(current) != pid))
> > > > + goto err;
> > > > + } else {
> > > > + prepare_kill_siginfo(sig, &kinfo);
> > > > + }
> > >
> > > I wonder whether we should also have a pidns restriction here, since
> > > currently it isn't possible for a container process using a pidns to
> > > signal processes outside its pidns. AFAICS, this isn't done through an
> > > explicit check -- it's a side-effect of processes in a pidns not being
> > > able to address non-descendant-pidns processes.
> > >
> > > But maybe it's reasonable to allow sending a procfd to a different pidns
> > > and the same operations working on it? If we extend the procfd API to
> >
> > No, I don't think so. I really don't want any fancy semantics in here.
> > Fancy doesn't get merged and fancy is hard to maintain. So we should do
> > something like:
> >
> > if (proc_pid_ns() != current_pid_ns)
> > return EINVAL
>
> This isn't quite sufficient. The key thing is that you have to be in an
> *ancestor* (or same) pidns, not the *same* pidns. Ideally you can re-use
> the check already in pidns_get_parent, and expose it. It would be
> something as trivial as:
>
> bool pidns_is_descendant(struct pid_namespace *ns,
> struct pid_namespace *ancestor)
> {
> for (;;) {
> if (!ns)
> return false;
> if (ns == ancestor)
> break;
> ns = ns->parent;
> }
> return true;
> }
>
> And you can rewrite pidns_get_parent to use it. So you would instead be
> doing:
>
> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> return -EPERM;

Scratch the last bit, -EPERM is wrong here. I would argue that -EINVAL
is *somewhat* wrong because arguable the more semantically consistent
error (with kill(2)) would be -ESRCH -- but then you're mixing the "pid
is dead" and "pid is not visible to you" cases. I'm not sure what the
right errno would be here (I'm sure some of the LKML greybeards will
have a better clue.) :P

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (2.80 kB)
signature.asc (849.00 B)
Download all attachments

2018-11-19 21:26:30

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On 2018-11-19, Christian Brauner <[email protected]> wrote:
> On Tue, Nov 20, 2018 at 08:18:10AM +1100, Aleksa Sarai wrote:
> > On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > > On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > > > On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > > > > + if (info) {
> > > > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > > > + if (unlikely(ret))
> > > > > + goto err;
> > > > > + /*
> > > > > + * Not even root can pretend to send signals from the kernel.
> > > > > + * Nor can they impersonate a kill()/tgkill(), which adds
> > > > > + * source info.
> > > > > + */
> > > > > + ret = -EPERM;
> > > > > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > > > + (task_pid(current) != pid))
> > > > > + goto err;
> > > > > + } else {
> > > > > + prepare_kill_siginfo(sig, &kinfo);
> > > > > + }
> > > >
> > > > I wonder whether we should also have a pidns restriction here, since
> > > > currently it isn't possible for a container process using a pidns to
> > > > signal processes outside its pidns. AFAICS, this isn't done through an
> > > > explicit check -- it's a side-effect of processes in a pidns not being
> > > > able to address non-descendant-pidns processes.
> > > >
> > > > But maybe it's reasonable to allow sending a procfd to a different pidns
> > > > and the same operations working on it? If we extend the procfd API to
> > >
> > > No, I don't think so. I really don't want any fancy semantics in here.
> > > Fancy doesn't get merged and fancy is hard to maintain. So we should do
> > > something like:
> > >
> > > if (proc_pid_ns() != current_pid_ns)
> > > return EINVAL
> >
> > This isn't quite sufficient. The key thing is that you have to be in an
> > *ancestor* (or same) pidns, not the *same* pidns. Ideally you can re-use
> > the check already in pidns_get_parent, and expose it. It would be
> > something as trivial as:
> >
> > bool pidns_is_descendant(struct pid_namespace *ns,
> > struct pid_namespace *ancestor)
> > {
> > for (;;) {
> > if (!ns)
> > return false;
> > if (ns == ancestor)
> > break;
> > ns = ns->parent;
> > }
> > return true;
> > }
>
> That can be done without a loop by comparing the level counter for the
> two pid namespaces.

If so, we can refactor how pidns_get_parent() and family work. :P

But yes, I agree with doing the above check.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (2.62 kB)
signature.asc (849.00 B)
Download all attachments

2018-11-19 21:27:39

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner <[email protected]> wrote:
> That can be done without a loop by comparing the level counter for the
> two pid namespaces.
>
>>
>> And you can rewrite pidns_get_parent to use it. So you would instead be
>> doing:
>>
>> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
>> return -EPERM;
>>
>> (Or you can just copy the 5-line loop into procfd_signal -- though I
>> imagine we'll need this for all of the procfd_* APIs.)

Why is any of this even necessary? Why does the child namespace we're
considering even have a file descriptor to its ancestor's procfs? If
it has one of these FDs, it can already *read* all sorts of
information it really shouldn't be able to acquire, so the additional
ability to send a signal (subject to the usual permission checks)
feels like sticking a finger in a dike that's already well-perforated.
IMHO, we shouldn't bother with this check. The patch would be simpler
without it.

2018-11-19 21:38:32

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On 2018-11-19, Daniel Colascione <[email protected]> wrote:
> On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner <[email protected]> wrote:
> > That can be done without a loop by comparing the level counter for the
> > two pid namespaces.
> >
> >>
> >> And you can rewrite pidns_get_parent to use it. So you would instead be
> >> doing:
> >>
> >> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> >> return -EPERM;
> >>
> >> (Or you can just copy the 5-line loop into procfd_signal -- though I
> >> imagine we'll need this for all of the procfd_* APIs.)
>
> Why is any of this even necessary? Why does the child namespace we're
> considering even have a file descriptor to its ancestor's procfs? If
> it has one of these FDs, it can already *read* all sorts of
> information it really shouldn't be able to acquire, so the additional
> ability to send a signal (subject to the usual permission checks)
> feels like sticking a finger in a dike that's already well-perforated.
> IMHO, we shouldn't bother with this check. The patch would be simpler
> without it.

First of all, currently it isn't possible to signal processes in an
ancestor pidns. Given the long thread about exit code visibility
semantics, I'm sure you see why bringing up this question is reasonable.

Some people (stupidly) bind-mount / into containers. There were several
CVEs in both LXC and runc where you could access the host filesystem
(including the host /proc). I'd prefer to not provide a mechanism for
such escalations to start sending signals to host processes, since I
don't see a strong reason why it should be allowed (and allowing it
would add more cracks to the isolation of pidns).

I think there is a huge difference between having read access to /proc
and being able to use /proc to signal processes which you ordinarily
would not be able to signal.

And another important point is that of semantics.

If we move forward with procfd_new() and the rest of the API we are
discussing, I'd argue we'd want to allow passing an nsfs fd to specify
what pidns we want the process to be created in (for procfd_new()). This
will obviously require a permission check to make sure we aren't
creating processes in a parent pidns -- and so for consistency all
procfd_* operations should have similar checks.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (2.41 kB)
signature.asc (849.00 B)
Download all attachments

2018-11-19 21:39:58

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 01:26:22PM -0800, Daniel Colascione wrote:
> On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner <[email protected]> wrote:
> > That can be done without a loop by comparing the level counter for the
> > two pid namespaces.
> >
> >>
> >> And you can rewrite pidns_get_parent to use it. So you would instead be
> >> doing:
> >>
> >> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> >> return -EPERM;
> >>
> >> (Or you can just copy the 5-line loop into procfd_signal -- though I
> >> imagine we'll need this for all of the procfd_* APIs.)
>
> Why is any of this even necessary? Why does the child namespace we're
> considering even have a file descriptor to its ancestor's procfs? If

Because you can send file descriptors between processes and container
runtimes tend to do that.

> it has one of these FDs, it can already *read* all sorts of
> information it really shouldn't be able to acquire, so the additional
> ability to send a signal (subject to the usual permission checks)
> feels like sticking a finger in a dike that's already well-perforated.
> IMHO, we shouldn't bother with this check. The patch would be simpler
> without it.

We will definitely not allow signaling processes in an ancestor pid
namespace! That is a security issue! I can imagine container runtimes
killing their monitoring process etc. pp. Not happening, unless someone
with deep expertise in signals can convince me otherwise.

2018-11-19 21:43:51

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 1:37 PM Christian Brauner <[email protected]> wrote:
>
> On Mon, Nov 19, 2018 at 01:26:22PM -0800, Daniel Colascione wrote:
> > On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner <[email protected]> wrote:
> > > That can be done without a loop by comparing the level counter for the
> > > two pid namespaces.
> > >
> > >>
> > >> And you can rewrite pidns_get_parent to use it. So you would instead be
> > >> doing:
> > >>
> > >> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> > >> return -EPERM;
> > >>
> > >> (Or you can just copy the 5-line loop into procfd_signal -- though I
> > >> imagine we'll need this for all of the procfd_* APIs.)
> >
> > Why is any of this even necessary? Why does the child namespace we're
> > considering even have a file descriptor to its ancestor's procfs? If
>
> Because you can send file descriptors between processes and container
> runtimes tend to do that.

Right. But why *would* a container runtime send one of these procfs
FDs to a container?

> > it has one of these FDs, it can already *read* all sorts of
> > information it really shouldn't be able to acquire, so the additional
> > ability to send a signal (subject to the usual permission checks)
> > feels like sticking a finger in a dike that's already well-perforated.
> > IMHO, we shouldn't bother with this check. The patch would be simpler
> > without it.
>
> We will definitely not allow signaling processes in an ancestor pid
> namespace! That is a security issue! I can imagine container runtimes
> killing their monitoring process etc. pp. Not happening, unless someone
> with deep expertise in signals can convince me otherwise.

If parent namespace procfs FDs or mounts really can leak into child
namespaces as easily as Aleksa says, then I don't mind adding the
check. I was under the impression that if you find yourself in this
situation, you already have a big problem.

2018-11-19 22:40:52

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
>
> +/**
> + * sys_procfd_signal - send a signal to a process through a process file
> + * descriptor
> + * @fd: the file descriptor of the process
> + * @sig: signal to be sent
> + * @info: the signal info
> + * @flags: future flags to be passed
> + */
> +SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t __user *, info,
> + int, flags)
> +{

Can I just register an objection here that I think using a syscall
just for this is silly?

My understanding is that the concern is that some code might do:

unknown_fd = recv_fd();
ioctl(unknown_fd, SOME_IOCTL, NULL); // where SOME_IOCTL == PROC_FD_KILL
// whoops, unknown_fd was a procfd and we killed a task!

In my experience when writing fd sending/receiving code, the sender and
receiver are fairly tightly coupled. Has anyone ever actually fixed a
bug where they had an fd that they lost track of what "type" it was
and screwed up like this? It seems completely theoretical to me.

The ioctl() approach has the benefit of being extensible. Adding a
syscall means that everyone has to do all the boilerplate for each new
pid op in the kernel, arches, libc, strace, etc.

Tycho

2018-11-19 23:08:17

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 02:49:22PM -0800, Daniel Colascione wrote:
> On Mon, Nov 19, 2018 at 2:40 PM Tycho Andersen <[email protected]> wrote:
> > Can I just register an objection here that I think using a syscall
> > just for this is silly?
>
> Yes, you can argue that the bikeshed should be ioctl-colored and not
> syscall-colored.
>
> > My understanding is that the concern is that some code might do:
> >
> > unknown_fd = recv_fd();
> > ioctl(unknown_fd, SOME_IOCTL, NULL); // where SOME_IOCTL == PROC_FD_KILL
> > // whoops, unknown_fd was a procfd and we killed a task!
> >
> > In my experience when writing fd sending/receiving code, the sender and
> > receiver are fairly tightly coupled. Has anyone ever actually fixed a
> > bug where they had an fd that they lost track of what "type" it was
> > and screwed up like this? It seems completely theoretical to me.
>
> Yes, I have fixed bugs of this form.
>
> > The ioctl() approach has the benefit of being extensible.
>
> The system call table is also extensible.

But not infinitely so. The x32 ABI starts at 512, and right now I see
334 syscalls on x86_64. So the next 178 people can say "let's just
define a syscall", and after that? I suppose we could move to setting
BIT(10), but how much userspace assumes > 512 => compat syscall and
blocks it via seccomp or whatever?

Contrast that with the ioctl space, which is an unsigned long and
fairly sparse still (Documentation/ioctl/ioctl-number.txt).

> ioctl is for when a given piece of functionality *can't*
> realistically get its own system call because it's separated from
> the main kernel somehow. procfs is a core part of the kernel, so we
> can and should expose interfaces to it using system calls.

I suppose it's obvious, but I disagree.

> > Adding a
> > syscall means that everyone has to do all the boilerplate for each new
> > pid op in the kernel, arches, libc, strace, etc.
>
> These tools also care about ioctls. Adding a system call is a pain,
> but the solution is to make adding system calls less of a pain, not to
> permanently make the Linux ABI worse.

For user-defined values of "worse" :)

Tycho

2018-11-19 23:29:47

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 2:40 PM Tycho Andersen <[email protected]> wrote:
> Can I just register an objection here that I think using a syscall
> just for this is silly?

Yes, you can argue that the bikeshed should be ioctl-colored and not
syscall-colored.

> My understanding is that the concern is that some code might do:
>
> unknown_fd = recv_fd();
> ioctl(unknown_fd, SOME_IOCTL, NULL); // where SOME_IOCTL == PROC_FD_KILL
> // whoops, unknown_fd was a procfd and we killed a task!
>
> In my experience when writing fd sending/receiving code, the sender and
> receiver are fairly tightly coupled. Has anyone ever actually fixed a
> bug where they had an fd that they lost track of what "type" it was
> and screwed up like this? It seems completely theoretical to me.

Yes, I have fixed bugs of this form.

> The ioctl() approach has the benefit of being extensible.

The system call table is also extensible. ioctl is for when a given
piece of functionality *can't* realistically get its own system call
because it's separated from the main kernel somehow. procfs is a core
part of the kernel, so we can and should expose interfaces to it using
system calls.

> Adding a
> syscall means that everyone has to do all the boilerplate for each new
> pid op in the kernel, arches, libc, strace, etc.

These tools also care about ioctls. Adding a system call is a pain,
but the solution is to make adding system calls less of a pain, not to
permanently make the Linux ABI worse.

2018-11-20 00:08:50

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

Hi Christian,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.20-rc3]
[cannot apply to next-20181119]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Christian-Brauner/proc-allow-signaling-processes-via-file-descriptors/20181120-063836
config: riscv-tinyconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 8.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=riscv

All warnings (new ones prefixed by >>):

>> <stdin>:1338:2: warning: #warning syscall procfd_signal not implemented [-Wcpp]

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.05 kB)
.config.gz (4.31 kB)
Download all attachments

2018-11-20 00:11:16

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Tue, Nov 20, 2018 at 07:37:58AM +0800, kbuild test robot wrote:
> Hi Christian,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.20-rc3]
> [cannot apply to next-20181119]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Christian-Brauner/proc-allow-signaling-processes-via-file-descriptors/20181120-063836
> config: riscv-tinyconfig (attached as .config)
> compiler: riscv64-linux-gcc (GCC) 8.1.0
> reproduce:
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=8.1.0 make.cross ARCH=riscv
>
> All errors (new ones prefixed by >>):
>
> kernel/signal.c: In function '__do_sys_procfd_signal':
> >> kernel/signal.c:3341:7: error: implicit declaration of function 'proc_is_procfd'; did you mean 'clockid_to_fd'? [-Werror=implicit-function-declaration]
> if (!proc_is_procfd(f.file))
> ^~~~~~~~~~~~~~

On my radar and fixed. This happens when CONFIG_PROC_FS unset.

> clockid_to_fd
> cc1: some warnings being treated as errors
>
> vim +3341 kernel/signal.c
>
> 3314
> 3315 /**
> 3316 * sys_procfd_signal - send a signal to a process through a process file
> 3317 * descriptor
> 3318 * @fd: the file descriptor of the process
> 3319 * @sig: signal to be sent
> 3320 * @info: the signal info
> 3321 * @flags: future flags to be passed
> 3322 */
> 3323 SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t __user *, info,
> 3324 int, flags)
> 3325 {
> 3326 int ret;
> 3327 struct pid *pid;
> 3328 kernel_siginfo_t kinfo;
> 3329 struct fd f;
> 3330
> 3331 /* Enforce flags be set to 0 until we add an extension. */
> 3332 if (flags)
> 3333 return -EINVAL;
> 3334
> 3335 f = fdget_raw(fd);
> 3336 if (!f.file)
> 3337 return -EBADF;
> 3338
> 3339 ret = -EINVAL;
> 3340 /* Is this a process file descriptor? */
> > 3341 if (!proc_is_procfd(f.file))
> 3342 goto err;
> 3343
> 3344 pid = f.file->private_data;
> 3345 if (!pid)
> 3346 goto err;
> 3347
> 3348 if (info) {
> 3349 ret = __copy_siginfo_from_user(sig, &kinfo, info);
> 3350 if (unlikely(ret))
> 3351 goto err;
> 3352 /*
> 3353 * Not even root can pretend to send signals from the kernel.
> 3354 * Nor can they impersonate a kill()/tgkill(), which adds
> 3355 * source info.
> 3356 */
> 3357 ret = -EPERM;
> 3358 if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> 3359 (task_pid(current) != pid))
> 3360 goto err;
> 3361 } else {
> 3362 prepare_kill_siginfo(sig, &kinfo);
> 3363 }
> 3364
> 3365 ret = kill_pid_info(sig, &kinfo, pid);
> 3366
> 3367 err:
> 3368 fdput(f);
> 3369 return ret;
> 3370 }
> 3371
>
> ---
> 0-DAY kernel test infrastructure Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all Intel Corporation



2018-11-20 00:29:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 3:07 PM Tycho Andersen <[email protected]> wrote:
> > These tools also care about ioctls. Adding a system call is a pain,
> > but the solution is to make adding system calls less of a pain, not to
> > permanently make the Linux ABI worse.
>
> For user-defined values of "worse" :)
>

I tend to agree with Tycho here. But I'm wondering if it might be
worth considering a better ioctl.

/me dons flame-proof hat

We could do:

long better_ioctl(int fd, u32 nr, const void *inbuf, size_t inlen,
const void *outbuf, size_t outlen);

and have a central table in the kernel listing all possible nr values
along with which driver they belong to. We could have a sane
signature and get rid of the nr collision problem.

The major problem I see is that u32 isn't really enough to have a sane
way to allow out-of-tree drivers to use this, and that we can't
readily use anything bigger than u32 without indirection because we're
out of syscall argument space.

2018-11-20 00:34:03

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 04:27:49PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 19, 2018 at 3:07 PM Tycho Andersen <[email protected]> wrote:
> > > These tools also care about ioctls. Adding a system call is a pain,
> > > but the solution is to make adding system calls less of a pain, not to
> > > permanently make the Linux ABI worse.
> >
> > For user-defined values of "worse" :)
> >
>
> I tend to agree with Tycho here. But I'm wondering if it might be
> worth considering a better ioctl.
>
> /me dons flame-proof hat
>
> We could do:
>
> long better_ioctl(int fd, u32 nr, const void *inbuf, size_t inlen,
> const void *outbuf, size_t outlen);

I'm the writer of this patch so take this with a grain of salt.
I think it is a bad idea to stop this patch and force us to introduce a
new type of ioctl().
An ioctl() is also not easy to use for this task because we want to add
a siginfo_t argument which I actually think provides value and makes
this interface more useful.

>
> and have a central table in the kernel listing all possible nr values
> along with which driver they belong to. We could have a sane
> signature and get rid of the nr collision problem.
>
> The major problem I see is that u32 isn't really enough to have a sane
> way to allow out-of-tree drivers to use this, and that we can't
> readily use anything bigger than u32 without indirection because we're
> out of syscall argument space.

2018-11-20 00:45:53

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

Hi Christian,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.20-rc3]
[cannot apply to next-20181119]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Christian-Brauner/proc-allow-signaling-processes-via-file-descriptors/20181120-063836
config: riscv-tinyconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 8.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=riscv

All errors (new ones prefixed by >>):

kernel/signal.c: In function '__do_sys_procfd_signal':
>> kernel/signal.c:3341:7: error: implicit declaration of function 'proc_is_procfd'; did you mean 'clockid_to_fd'? [-Werror=implicit-function-declaration]
if (!proc_is_procfd(f.file))
^~~~~~~~~~~~~~
clockid_to_fd
cc1: some warnings being treated as errors

vim +3341 kernel/signal.c

3314
3315 /**
3316 * sys_procfd_signal - send a signal to a process through a process file
3317 * descriptor
3318 * @fd: the file descriptor of the process
3319 * @sig: signal to be sent
3320 * @info: the signal info
3321 * @flags: future flags to be passed
3322 */
3323 SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t __user *, info,
3324 int, flags)
3325 {
3326 int ret;
3327 struct pid *pid;
3328 kernel_siginfo_t kinfo;
3329 struct fd f;
3330
3331 /* Enforce flags be set to 0 until we add an extension. */
3332 if (flags)
3333 return -EINVAL;
3334
3335 f = fdget_raw(fd);
3336 if (!f.file)
3337 return -EBADF;
3338
3339 ret = -EINVAL;
3340 /* Is this a process file descriptor? */
> 3341 if (!proc_is_procfd(f.file))
3342 goto err;
3343
3344 pid = f.file->private_data;
3345 if (!pid)
3346 goto err;
3347
3348 if (info) {
3349 ret = __copy_siginfo_from_user(sig, &kinfo, info);
3350 if (unlikely(ret))
3351 goto err;
3352 /*
3353 * Not even root can pretend to send signals from the kernel.
3354 * Nor can they impersonate a kill()/tgkill(), which adds
3355 * source info.
3356 */
3357 ret = -EPERM;
3358 if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
3359 (task_pid(current) != pid))
3360 goto err;
3361 } else {
3362 prepare_kill_siginfo(sig, &kinfo);
3363 }
3364
3365 ret = kill_pid_info(sig, &kinfo, pid);
3366
3367 err:
3368 fdput(f);
3369 return ret;
3370 }
3371

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (2.96 kB)
.config.gz (4.31 kB)
Download all attachments

2018-11-20 00:52:25

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 4:28 PM Andy Lutomirski <[email protected]> wrote:
>
> On Mon, Nov 19, 2018 at 3:07 PM Tycho Andersen <[email protected]> wrote:
> > > These tools also care about ioctls. Adding a system call is a pain,
> > > but the solution is to make adding system calls less of a pain, not to
> > > permanently make the Linux ABI worse.
> >
> > For user-defined values of "worse" :)
> >
>
> I tend to agree with Tycho here. But I'm wondering if it might be
> worth considering a better ioctl.
>
> /me dons flame-proof hat
>
> We could do:
>
> long better_ioctl(int fd, u32 nr, const void *inbuf, size_t inlen,
> const void *outbuf, size_t outlen);
>
> and have a central table in the kernel listing all possible nr values
> along with which driver they belong to. We could have a sane
> signature and get rid of the nr collision problem.

The essential difference between a regular system call and an ioctl is
that in the former, the invoked kernel-side code depends on the
operation number, and in the latter, the invoked kernel-side code
depends on the operation number and file descriptor type. By creating
a new kind of collision-free ioctl, all you've done is re-invent the
system call, but with a funky calling convention and less operand
space. It makes no sense. Previous system call multiplexers --- e.g.,
socketcall --- are widely regarded as mistakes, and there's no reason
to repeat these mistakes.

System call numbers are not scarce, and your other proposal to clean
up the x86 numbering will make wiring up a new system call less
annoying. The *only* purpose of an ioctl is to solve the system call
numbering coordination problem --- if the invoked kernel-side code
depends on (DRIVER, OPERATION_NUMBER), and DRIVER can vary out-of-tree
with ioctl, ioctl lets out-of-tree code expose interfaces. For in-tree
code, this problem doesn't exist, so there's no reason to use the
awkward ioctl workaround!

2018-11-20 02:06:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 4:33 PM Christian Brauner <[email protected]> wrote:
>
> On Mon, Nov 19, 2018 at 04:27:49PM -0800, Andy Lutomirski wrote:
> > On Mon, Nov 19, 2018 at 3:07 PM Tycho Andersen <[email protected]> wrote:
> > > > These tools also care about ioctls. Adding a system call is a pain,
> > > > but the solution is to make adding system calls less of a pain, not to
> > > > permanently make the Linux ABI worse.
> > >
> > > For user-defined values of "worse" :)
> > >
> >
> > I tend to agree with Tycho here. But I'm wondering if it might be
> > worth considering a better ioctl.
> >
> > /me dons flame-proof hat
> >
> > We could do:
> >
> > long better_ioctl(int fd, u32 nr, const void *inbuf, size_t inlen,
> > const void *outbuf, size_t outlen);
>
> I'm the writer of this patch so take this with a grain of salt.
> I think it is a bad idea to stop this patch and force us to introduce a
> new type of ioctl().

I agree completely.

> An ioctl() is also not easy to use for this task because we want to add
> a siginfo_t argument which I actually think provides value and makes
> this interface more useful.
>

You could always have a struct procfd_kill and pass a pointer to
*that*. But sure, it's ugly.

2018-11-20 05:00:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

Daniel Colascione <[email protected]> writes:

> On Mon, Nov 19, 2018 at 1:37 PM Christian Brauner <[email protected]> wrote:
>>
>> On Mon, Nov 19, 2018 at 01:26:22PM -0800, Daniel Colascione wrote:
>> > On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner <[email protected]> wrote:
>> > > That can be done without a loop by comparing the level counter for the
>> > > two pid namespaces.
>> > >
>> > >>
>> > >> And you can rewrite pidns_get_parent to use it. So you would instead be
>> > >> doing:
>> > >>
>> > >> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
>> > >> return -EPERM;
>> > >>
>> > >> (Or you can just copy the 5-line loop into procfd_signal -- though I
>> > >> imagine we'll need this for all of the procfd_* APIs.)
>> >
>> > Why is any of this even necessary? Why does the child namespace we're
>> > considering even have a file descriptor to its ancestor's procfs? If
>>
>> Because you can send file descriptors between processes and container
>> runtimes tend to do that.
>
> Right. But why *would* a container runtime send one of these procfs
> FDs to a container?
>
>> > it has one of these FDs, it can already *read* all sorts of
>> > information it really shouldn't be able to acquire, so the additional
>> > ability to send a signal (subject to the usual permission checks)
>> > feels like sticking a finger in a dike that's already well-perforated.
>> > IMHO, we shouldn't bother with this check. The patch would be simpler
>> > without it.
>>
>> We will definitely not allow signaling processes in an ancestor pid
>> namespace! That is a security issue! I can imagine container runtimes
>> killing their monitoring process etc. pp. Not happening, unless someone
>> with deep expertise in signals can convince me otherwise.
>
> If parent namespace procfs FDs or mounts really can leak into child
> namespaces as easily as Aleksa says, then I don't mind adding the
> check. I was under the impression that if you find yourself in this
> situation, you already have a big problem.

There is one big reason to have the check, and I have not seen it
mentioned yet in this thread.

When SI_USER is set we report the pid of the sender of the signal in
si_pid. When the signal comes from the kernel si_pid == 0. When signal
is sent from an ancestor pid namespace si_pid also equals 0 (which is
reasonable).

A signal out to a process in a parent pid namespace such as SIGCHLD is
reasonable as we can map the pid. I really don't see the point of
forbidding that. From the perspective of the process in the parent pid
namespace it is just another process in it's pid namespace. So it
should pose no problem from the perspective of the receiving process.

A signal to a process in a pid namespace that is neither a parent nor a
descendent pid namespace would be a problem, as there is no well defined
notion of what si_pid should be set to. So for that case perhaps we
should have something like a noprocess pid that we can set. Perhaps we
could set si_pid to 0xffffffff. That would take a small extension to
pid_nr_ns.

File descriptors are not namespaced. It is completely legitimate to use
file descriptors to get around limitations of namespaces.

Adding limitations to a file descriptor based api because someone else
can't set up their processes in such a way as to get the restrictions
they are looking for seems very sad.

Frankly I think it is one of the better features of namespaces that we
have to carefully handle and define these cases so that when the
inevitable leaks happen you are not immediately in a world of hurt. All
of the other permission checks etc continue to do their job. Plus you
are prepared for the case when someone wants their containers to have an
interesting communication primitive.

Eric





2018-11-20 10:32:25

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 10:59:12PM -0600, Eric W. Biederman wrote:
> Daniel Colascione <[email protected]> writes:
>
> > On Mon, Nov 19, 2018 at 1:37 PM Christian Brauner <[email protected]> wrote:
> >>
> >> On Mon, Nov 19, 2018 at 01:26:22PM -0800, Daniel Colascione wrote:
> >> > On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner <[email protected]> wrote:
> >> > > That can be done without a loop by comparing the level counter for the
> >> > > two pid namespaces.
> >> > >
> >> > >>
> >> > >> And you can rewrite pidns_get_parent to use it. So you would instead be
> >> > >> doing:
> >> > >>
> >> > >> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> >> > >> return -EPERM;
> >> > >>
> >> > >> (Or you can just copy the 5-line loop into procfd_signal -- though I
> >> > >> imagine we'll need this for all of the procfd_* APIs.)
> >> >
> >> > Why is any of this even necessary? Why does the child namespace we're
> >> > considering even have a file descriptor to its ancestor's procfs? If
> >>
> >> Because you can send file descriptors between processes and container
> >> runtimes tend to do that.
> >
> > Right. But why *would* a container runtime send one of these procfs
> > FDs to a container?
> >
> >> > it has one of these FDs, it can already *read* all sorts of
> >> > information it really shouldn't be able to acquire, so the additional
> >> > ability to send a signal (subject to the usual permission checks)
> >> > feels like sticking a finger in a dike that's already well-perforated.
> >> > IMHO, we shouldn't bother with this check. The patch would be simpler
> >> > without it.
> >>
> >> We will definitely not allow signaling processes in an ancestor pid
> >> namespace! That is a security issue! I can imagine container runtimes
> >> killing their monitoring process etc. pp. Not happening, unless someone
> >> with deep expertise in signals can convince me otherwise.
> >
> > If parent namespace procfs FDs or mounts really can leak into child
> > namespaces as easily as Aleksa says, then I don't mind adding the
> > check. I was under the impression that if you find yourself in this
> > situation, you already have a big problem.
>
> There is one big reason to have the check, and I have not seen it
> mentioned yet in this thread.
>
> When SI_USER is set we report the pid of the sender of the signal in
> si_pid. When the signal comes from the kernel si_pid == 0. When signal
> is sent from an ancestor pid namespace si_pid also equals 0 (which is
> reasonable).
>
> A signal out to a process in a parent pid namespace such as SIGCHLD is
> reasonable as we can map the pid. I really don't see the point of
> forbidding that. From the perspective of the process in the parent pid
> namespace it is just another process in it's pid namespace. So it
> should pose no problem from the perspective of the receiving process.
>
> A signal to a process in a pid namespace that is neither a parent nor a
> descendent pid namespace would be a problem, as there is no well defined
> notion of what si_pid should be set to. So for that case perhaps we
> should have something like a noprocess pid that we can set. Perhaps we
> could set si_pid to 0xffffffff. That would take a small extension to
> pid_nr_ns.
>
> File descriptors are not namespaced. It is completely legitimate to use
> file descriptors to get around limitations of namespaces.

Frankly, I don't see a good argument for why we would allow that even if
safe. I have not heard a legitimate use-case or need for this.
At this point I care about very simple semantics. Being able to signal
into ancestor pid namespaces and cousin namespaces is interesting but
makes the syscall more brittle and harder to understand.
Changing pid_nr_ns() might be the solution but this function is called
all over the place in the kernel and I'm not going to risk breaking
something by changing it for a feature that no one so far has ever
asked for.
If you are ok with this then we should hold off on this. We can always
add this feature later by removing the check when someone has a use-case
for it.
I'll send a v2 of the patch that keeps the restriction for now. If you
insist on it being removed we can make the change in a follow-up
iteration.

Christian

>
> Adding limitations to a file descriptor based api because someone else
> can't set up their processes in such a way as to get the restrictions
> they are looking for seems very sad.
>
> Frankly I think it is one of the better features of namespaces that we
> have to carefully handle and define these cases so that when the
> inevitable leaks happen you are not immediately in a world of hurt. All
> of the other permission checks etc continue to do their job. Plus you
> are prepared for the case when someone wants their containers to have an
> interesting communication primitive.
>
> Eric
>
>
>
>

Subject: Re: [PATCH] procfd_signal.2: document procfd_signal syscall

Hello Christian,

On 11/19/18 11:32 AM, Christian Brauner wrote:
> Signed-off-by: Christian Brauner <[email protected]>
> ---
> Changelog:
> v1:
> - patch introduced
> ---
> man2/procfd_signal.2 | 147 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 147 insertions(+)
> create mode 100644 man2/procfd_signal.2
>
> diff --git a/man2/procfd_signal.2 b/man2/procfd_signal.2
> new file mode 100644
> index 000000000..6af0b74f4
> --- /dev/null
> +++ b/man2/procfd_signal.2
> @@ -0,0 +1,147 @@
> +.\" Copyright (C) 2018 Christian Brauner <[email protected]>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein. The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH PROCFD_SIGNAL 2 2017-09-15 "Linux" "Linux Programmer's Manual"

Bad timestamp :-)

> +.SH NAME
> +procfd_signal \- send signal to a process through a process descriptor

s/through/via/

> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.B #include <signal.h>
> +.PP
> +.BI "int procfd_signal(int " fd ", int " sig ", siginfo_t *" info ", int " flags );
> +.fi
> +.SH DESCRIPTION
> +.BR procfd_signal ()
> +sends the signal specified in
> +.I sig
> +to the process identified by the file descriptor
> +.IR fd .

Here, I think we need some words about how one obtains that
file descriptor.

> +The permissions required to send a signal are the same as for
> +.BR kill (2).
> +As with
> +.BR kill (2),
> +the null signal (0) can be used to check if a process with a given
> +PID exists.

But there is no PID mentioned on this page?

I suppose:

As with
.BR kill (2),
the null signal (0) can be used to check if the process referred to by
.I fd
exists.

?

> +.PP
> +The optional
> +.I info
> +argument specifies the data to accompany the signal.
> +This argument is a pointer to a structure of type
> +.IR siginfo_t ,
> +described in
> +.BR sigaction (2)
> +(and defined by including
> +.IR <sigaction.h> ).
> +The caller should set the following fields in this structure:
> +.TP
> +.I si_code
> +This must be one of the
> +.B SI_*
> +codes in the Linux kernel source file
> +.IR include/asm-generic/siginfo.h ,
> +with the restriction that the code must be negative
> +(i.e., cannot be
> +.BR SI_USER ,
> +which is used by the kernel to indicate a signal sent by
> +.BR kill (2))
> +and cannot (since Linux 2.6.39) be
> +.BR SI_TKILL
> +(which is used by the kernel to indicate a signal sent using
> +.\" tkill(2) or
> +.BR tgkill (2)).
> +.TP
> +.I si_pid
> +This should be set to a process ID,
> +typically the process ID of the sender.
> +.TP
> +.I si_uid
> +This should be set to a user ID,
> +typically the real user ID of the sender.
> +.TP
> +.I si_value
> +This field contains the user data to accompany the signal.
> +For more information, see the description of the last
> +.RI ( "union sigval" )
> +argument of
> +.BR sigqueue (3).

With sigqueue(3), one sends only a signal plus accompanying
data. It is of course based on the lower level rt_sigqueueinfo(2).
The man page for that system call says:

These system calls are not intended for direct application use;
they are provided to allow the implementation of sigqueue(3) and
pthread_sigqueue(3).

Is procfd_signal() intended to be the API directly used from
user space? If it is, then I think there should be some
explanation of why there is a 'siginfo_t' argument (vis-à-vis
sigqueue(3), which makes do with just union sigval).
If procfd_signal() is not intended to be the API used directly
from user space, then I think there needs to be a paragraph
similar to the one in the rt_sigqueueinfo(2) page queoted above.

> +.PP
> +Internally, the kernel sets the
> +.I si_signo
> +field to the value specified in
> +.IR sig ,
> +so that the receiver of the signal can also obtain
> +the signal number via that field.
> +.PP
> +The
> +.I flags
> +argument is reserved for future extension and must be set to 0.
> +.PP
> +.SH RETURN VALUE
> +On success, this system call returns 0.
> +On error, it returns \-1 and
> +.I errno
> +is set to indicate the error.
> +.SH ERRORS
> +.TP
> +.B EBADF
> +.I fd
> +is not a valid file descriptor.
> +.TP
> +.B EINVAL
> +An invalid signal was specified.
> +.TP
> +.B EINVAL
> +.I fd
> +does not refer to a process.
> +.TP
> +.B EINVAL
> +The flags argument was not 0.
> +.TP
> +.B EPERM
> +The caller does not have permission to send the signal to the target.
> +For the required permissions, see
> +.BR kill (2).
> +Or:
> +.I uinfo->si_code
> +is invalid.
> +.TP
> +.B ESRCH
> +The process or process group does not exist.

"or process group"? I suspect a cut and paste error here :-)

The connection between the preceding sentence and the one
that follows it is not quite clear:

> +Note that an existing process might be a zombie,
> +a process that has terminated execution, but
> +has not yet been
> +.BR wait (2)ed
> +for.

Is it the case that using procfd_signal() with a file
descriptor that refers to a zombie will yield EINVAL?
If yes, this could be made clearer with the following:

.B ESRCH
The specified process no longer exists or is a process in the
zombie state (a process that has terminated execution, but
has not yet been
BR wait (2)ed
for).

> +.SH CONFORMING TO
> +This system call is Linux-specific.
> +.SH SEE ALSO
> +.BR kill (2),
> +.BR sigaction (2),
> +.BR sigprocmask (2),
> +.BR tgkill (2),
> +.BR pthread_sigqueue (3),
> +.BR rt_sigqueueinfo (2),
> +.BR sigqueue (3),
> +.BR signal (7)

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2018-11-22 08:37:20

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Tue, Nov 20, 2018 at 11:31:13AM +0100, Christian Brauner wrote:
> On Mon, Nov 19, 2018 at 10:59:12PM -0600, Eric W. Biederman wrote:
> > Daniel Colascione <[email protected]> writes:
> >
> > > On Mon, Nov 19, 2018 at 1:37 PM Christian Brauner <[email protected]> wrote:
> > >>
> > >> On Mon, Nov 19, 2018 at 01:26:22PM -0800, Daniel Colascione wrote:
> > >> > On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner <[email protected]> wrote:
> > >> > > That can be done without a loop by comparing the level counter for the
> > >> > > two pid namespaces.
> > >> > >
> > >> > >>
> > >> > >> And you can rewrite pidns_get_parent to use it. So you would instead be
> > >> > >> doing:
> > >> > >>
> > >> > >> if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> > >> > >> return -EPERM;
> > >> > >>
> > >> > >> (Or you can just copy the 5-line loop into procfd_signal -- though I
> > >> > >> imagine we'll need this for all of the procfd_* APIs.)
> > >> >
> > >> > Why is any of this even necessary? Why does the child namespace we're
> > >> > considering even have a file descriptor to its ancestor's procfs? If
> > >>
> > >> Because you can send file descriptors between processes and container
> > >> runtimes tend to do that.
> > >
> > > Right. But why *would* a container runtime send one of these procfs
> > > FDs to a container?
> > >
> > >> > it has one of these FDs, it can already *read* all sorts of
> > >> > information it really shouldn't be able to acquire, so the additional
> > >> > ability to send a signal (subject to the usual permission checks)
> > >> > feels like sticking a finger in a dike that's already well-perforated.
> > >> > IMHO, we shouldn't bother with this check. The patch would be simpler
> > >> > without it.
> > >>
> > >> We will definitely not allow signaling processes in an ancestor pid
> > >> namespace! That is a security issue! I can imagine container runtimes
> > >> killing their monitoring process etc. pp. Not happening, unless someone
> > >> with deep expertise in signals can convince me otherwise.
> > >
> > > If parent namespace procfs FDs or mounts really can leak into child
> > > namespaces as easily as Aleksa says, then I don't mind adding the
> > > check. I was under the impression that if you find yourself in this
> > > situation, you already have a big problem.
> >
> > There is one big reason to have the check, and I have not seen it
> > mentioned yet in this thread.
> >
> > When SI_USER is set we report the pid of the sender of the signal in
> > si_pid. When the signal comes from the kernel si_pid == 0. When signal
> > is sent from an ancestor pid namespace si_pid also equals 0 (which is
> > reasonable).
> >
> > A signal out to a process in a parent pid namespace such as SIGCHLD is
> > reasonable as we can map the pid. I really don't see the point of
> > forbidding that. From the perspective of the process in the parent pid
> > namespace it is just another process in it's pid namespace. So it
> > should pose no problem from the perspective of the receiving process.
> >
> > A signal to a process in a pid namespace that is neither a parent nor a
> > descendent pid namespace would be a problem, as there is no well defined
> > notion of what si_pid should be set to. So for that case perhaps we
> > should have something like a noprocess pid that we can set. Perhaps we
> > could set si_pid to 0xffffffff. That would take a small extension to
> > pid_nr_ns.
> >
> > File descriptors are not namespaced. It is completely legitimate to use
> > file descriptors to get around limitations of namespaces.
>
> Frankly, I don't see a good argument for why we would allow that even if
> safe. I have not heard a legitimate use-case or need for this.
> At this point I care about very simple semantics. Being able to signal
> into ancestor pid namespaces and cousin namespaces is interesting but
> makes the syscall more brittle and harder to understand.

Yeah, I'm with you on that. We can always open that door later if a good
use case comes up, but I prefer simple at first.

> Changing pid_nr_ns() might be the solution but this function is called
> all over the place in the kernel and I'm not going to risk breaking
> something by changing it for a feature that no one so far has ever
> asked for.
> If you are ok with this then we should hold off on this. We can always
> add this feature later by removing the check when someone has a use-case
> for it.
> I'll send a v2 of the patch that keeps the restriction for now. If you
> insist on it being removed we can make the change in a follow-up
> iteration.
>
> Christian
>
> >
> > Adding limitations to a file descriptor based api because someone else
> > can't set up their processes in such a way as to get the restrictions
> > they are looking for seems very sad.
> >
> > Frankly I think it is one of the better features of namespaces that we
> > have to carefully handle and define these cases so that when the
> > inevitable leaks happen you are not immediately in a world of hurt. All
> > of the other permission checks etc continue to do their job. Plus you
> > are prepared for the case when someone wants their containers to have an
> > interesting communication primitive.
> >
> > Eric
> >
> >
> >
> >

2018-11-22 23:07:46

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Tue, Nov 20, 2018 at 08:23:43AM +1100, Aleksa Sarai wrote:
> On 2018-11-20, Aleksa Sarai <[email protected]> wrote:
> > On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > > On Tue, Nov 20, 2018 at 07:28:57AM +1100, Aleksa Sarai wrote:
> > > > On 2018-11-19, Christian Brauner <[email protected]> wrote:
> > > > > + if (info) {
> > > > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > > > + if (unlikely(ret))
> > > > > + goto err;
> > > > > + /*
> > > > > + * Not even root can pretend to send signals from the kernel.
> > > > > + * Nor can they impersonate a kill()/tgkill(), which adds
> > > > > + * source info.
> > > > > + */
> > > > > + ret = -EPERM;
> > > > > + if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
> > > > > + (task_pid(current) != pid))
> > > > > + goto err;
> > > > > + } else {
> > > > > + prepare_kill_siginfo(sig, &kinfo);
> > > > > + }
> > > >
> > > > I wonder whether we should also have a pidns restriction here, since
> > > > currently it isn't possible for a container process using a pidns to
> > > > signal processes outside its pidns. AFAICS, this isn't done through an
> > > > explicit check -- it's a side-effect of processes in a pidns not being
> > > > able to address non-descendant-pidns processes.
> > > >
> > > > But maybe it's reasonable to allow sending a procfd to a different pidns
> > > > and the same operations working on it? If we extend the procfd API to
> > >
> > > No, I don't think so. I really don't want any fancy semantics in here.
> > > Fancy doesn't get merged and fancy is hard to maintain. So we should do
> > > something like:
> > >
> > > if (proc_pid_ns() != current_pid_ns)
> > > return EINVAL
> >
> > This isn't quite sufficient. The key thing is that you have to be in an
> > *ancestor* (or same) pidns, not the *same* pidns. Ideally you can re-use
> > the check already in pidns_get_parent, and expose it. It would be
> > something as trivial as:
> >
> > bool pidns_is_descendant(struct pid_namespace *ns,
> > struct pid_namespace *ancestor)
> > {
> > for (;;) {
> > if (!ns)
> > return false;
> > if (ns == ancestor)
> > break;
> > ns = ns->parent;
> > }
> > return true;
> > }
> >
> > And you can rewrite pidns_get_parent to use it. So you would instead be
> > doing:
> >
> > if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> > return -EPERM;
>
> Scratch the last bit, -EPERM is wrong here. I would argue that -EINVAL
> is *somewhat* wrong because arguable the more semantically consistent
> error (with kill(2)) would be -ESRCH -- but then you're mixing the "pid
> is dead" and "pid is not visible to you" cases. I'm not sure what the
> right errno would be here (I'm sure some of the LKML greybeards will
> have a better clue.) :P

Actually I like EXDEV for this. ERMOTE also works.

2018-11-22 23:46:40

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 03:39:54PM -0700, Tycho Andersen wrote:
> On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> >
> > +/**
> > + * sys_procfd_signal - send a signal to a process through a process file
> > + * descriptor
> > + * @fd: the file descriptor of the process
> > + * @sig: signal to be sent
> > + * @info: the signal info
> > + * @flags: future flags to be passed
> > + */
> > +SYSCALL_DEFINE4(procfd_signal, int, fd, int, sig, siginfo_t __user *, info,
> > + int, flags)
> > +{
>
> Can I just register an objection here that I think using a syscall
> just for this is silly?
>
> My understanding is that the concern is that some code might do:
>
> unknown_fd = recv_fd();
> ioctl(unknown_fd, SOME_IOCTL, NULL); // where SOME_IOCTL == PROC_FD_KILL
> // whoops, unknown_fd was a procfd and we killed a task!

This could just be my own mental model, but for something like "kill a
task", an ioctl just seems wrong. Syscall seems more natural.

I'd ack either method.

-serge

2018-11-28 21:01:06

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH] procfd_signal.2: document procfd_signal syscall

* Christian Brauner:

> +.\" Copyright (C) 2018 Christian Brauner <[email protected]>

The text seems to be largely derived from rt_sigqueueinfo, so I'm not
sure if this appropriate here.

> +the null signal (0) can be used to check if a process with a given
> +PID exists.

What does this mean if hte process is identified by file descriptor?

> +.PP
> +The optional
> +.I info
> +argument specifies the data to accompany the signal.
> +This argument is a pointer to a structure of type
> +.IR siginfo_t ,
> +described in
> +.BR sigaction (2)
> +(and defined by including
> +.IR <sigaction.h> ).
> +The caller should set the following fields in this structure:
> +.TP
> +.I si_code
> +This must be one of the
> +.B SI_*
> +codes in the Linux kernel source file
> +.IR include/asm-generic/siginfo.h ,
> +with the restriction that the code must be negative
> +(i.e., cannot be
> +.BR SI_USER ,
> +which is used by the kernel to indicate a signal sent by
> +.BR kill (2))
> +and cannot (since Linux 2.6.39) be

Obsolete reference in this context.

> +.TP
> +.B ESRCH
> +The process or process group does not exist.
> +Note that an existing process might be a zombie,
> +a process that has terminated execution, but
> +has not yet been
> +.BR wait (2)ed
> +for.

Again: What does this mean if the process identified by a descriptor?
Does a process in zombie state exist in this sense or not?

Thanks,
Florian

2018-11-28 21:13:19

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] procfd_signal.2: document procfd_signal syscall

On November 29, 2018 9:59:52 AM GMT+13:00, Florian Weimer <[email protected]> wrote:
>* Christian Brauner:
>
>> +.\" Copyright (C) 2018 Christian Brauner <[email protected]>
>
>The text seems to be largely derived from rt_sigqueueinfo, so I'm not
>sure if this appropriate here.
>
>> +the null signal (0) can be used to check if a process with a given
>> +PID exists.
>
>What does this mean if hte process is identified by file descriptor?
>
>> +.PP
>> +The optional
>> +.I info
>> +argument specifies the data to accompany the signal.
>> +This argument is a pointer to a structure of type
>> +.IR siginfo_t ,
>> +described in
>> +.BR sigaction (2)
>> +(and defined by including
>> +.IR <sigaction.h> ).
>> +The caller should set the following fields in this structure:
>> +.TP
>> +.I si_code
>> +This must be one of the
>> +.B SI_*
>> +codes in the Linux kernel source file
>> +.IR include/asm-generic/siginfo.h ,
>> +with the restriction that the code must be negative
>> +(i.e., cannot be
>> +.BR SI_USER ,
>> +which is used by the kernel to indicate a signal sent by
>> +.BR kill (2))
>> +and cannot (since Linux 2.6.39) be
>
>Obsolete reference in this context.
>
>> +.TP
>> +.B ESRCH
>> +The process or process group does not exist.
>> +Note that an existing process might be a zombie,
>> +a process that has terminated execution, but
>> +has not yet been
>> +.BR wait (2)ed
>> +for.
>
>Again: What does this mean if the process identified by a descriptor?
>Does a process in zombie state exist in this sense or not?
>
>Thanks,
>Florian

Updating the document is on my Todo.
Florian, can you take a look at the actual patch too, please?

2018-11-28 21:46:29

by Joey Pabalinas

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> + if (info) {
> + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> + if (unlikely(ret))
> + goto err;

What's the reason you don't propagate up the errors from __copy_siginfo_from_user()?
Granted, I admit that -E2BIG is kind of weird to return, but -EFAULT seems like a
fairly sane error.

Or is there some reason it's more useful to just return -EINVAL for all of the
failure cases here?

--
Cheers,
Joey Pabalinas


Attachments:
(No filename) (510.00 B)
signature.asc (849.00 B)
Download all attachments

2018-11-28 22:07:00

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Wed, Nov 28, 2018 at 11:45:34AM -1000, Joey Pabalinas wrote:
> On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> > + if (info) {
> > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > + if (unlikely(ret))
> > + goto err;
>

I think you're misreading here. It jumps to:

err:
fdput(f);
return ret;

and does propagate the error. This is also an old iteration of the patch.

Christian

> What's the reason you don't propagate up the errors from __copy_siginfo_from_user()?
> Granted, I admit that -E2BIG is kind of weird to return, but -EFAULT seems like a
> fairly sane error.
>
> Or is there some reason it's more useful to just return -EINVAL for all of the
> failure cases here?
>
> --
> Cheers,
> Joey Pabalinas



2018-11-28 23:08:58

by Joey Pabalinas

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

On Wed, Nov 28, 2018 at 11:05:49PM +0100, Christian Brauner wrote:
> On Wed, Nov 28, 2018 at 11:45:34AM -1000, Joey Pabalinas wrote:
> > On Mon, Nov 19, 2018 at 11:32:39AM +0100, Christian Brauner wrote:
> > > + if (info) {
> > > + ret = __copy_siginfo_from_user(sig, &kinfo, info);
> > > + if (unlikely(ret))
> > > + goto err;
> >
>
> I think you're misreading here. It jumps to:
>
> err:
> fdput(f);
> return ret;
>
> and does propagate the error. This is also an old iteration of the patch.

Whoops, that I am. Sorry about that.

--
Cheers,
Joey Pabalinas


Attachments:
(No filename) (599.00 B)
signature.asc (849.00 B)
Download all attachments