2018-11-18 11:19:54

by Christian Brauner

[permalink] [raw]
Subject: [PATCH] proc: allow killing processes via file descriptors

With this patch an open() call on /proc/<pid> will give userspace a handle
to struct pid of the process associated with /proc/<pid>. This allows to
maintain a stable handle on a process.
I have been discussing various approaches extensively during technical
conferences this year culminating in a long argument with Eric at Linux
Plumbers. The general consensus was that having a handle on a process
will be something that is very simple and easy to maintain with the
option of being extensible via a more advanced api if the need arises. I
believe that this patch is the most simple, dumb, and therefore
maintainable solution.

The need for this has arisen in order to reliably kill a process without
running into issues of the pid being recycled as has been described in the
rejected patch [1]. To fulfill the need described in that patchset a new
ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a
process via a file descriptor:

int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC);
ioctl(fd, PROC_FD_SIGNAL, SIGKILL);
close(fd);

Note, the stable handle will allow us to carefully extend this feature in
the future.

[1]: https://lkml.org/lkml/2018/10/30/118

Cc: "Eric W. Biederman" <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
fs/proc/base.c | 33 +++++++++++++++++++++++++++++++++
include/uapi/linux/procfd.h | 11 +++++++++++
2 files changed, 44 insertions(+)
create mode 100644 include/uapi/linux/procfd.h

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ce3465479447..dfde564a21eb 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -88,6 +88,7 @@
#include <linux/fs_struct.h>
#include <linux/slab.h>
#include <linux/sched/autogroup.h>
+#include <linux/sched/signal.h>
#include <linux/sched/mm.h>
#include <linux/sched/coredump.h>
#include <linux/sched/debug.h>
@@ -95,6 +96,7 @@
#include <linux/flex_array.h>
#include <linux/posix-timers.h>
#include <trace/events/oom.h>
+#include <uapi/linux/procfd.h>
#include "internal.h"
#include "fd.h"

@@ -3032,10 +3034,41 @@ static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
tgid_base_stuff, ARRAY_SIZE(tgid_base_stuff));
}

+static int proc_tgid_open(struct inode *inode, struct file *file)
+{
+ /* grab reference to struct pid */
+ file->private_data = get_pid(proc_pid(inode));
+ return 0;
+}
+
+static int proc_tgid_release(struct inode *inode, struct file *file)
+{
+ struct pid *pid = file->private_data;
+ /* drop reference to struct pid */
+ put_pid(pid);
+ return 0;
+}
+
+static long proc_tgid_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ struct pid *pid = file->private_data;
+
+ switch (cmd) {
+ case PROC_FD_SIGNAL:
+ return kill_pid(pid, arg, 1);
+ default:
+ return -EINVAL;
+ }
+}
+
static const struct file_operations proc_tgid_base_operations = {
+ .open = proc_tgid_open,
.read = generic_read_dir,
.iterate_shared = proc_tgid_base_readdir,
.llseek = generic_file_llseek,
+ .release = proc_tgid_release,
+ .unlocked_ioctl = proc_tgid_ioctl,
};

static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
diff --git a/include/uapi/linux/procfd.h b/include/uapi/linux/procfd.h
new file mode 100644
index 000000000000..8e4c07a9f3a3
--- /dev/null
+++ b/include/uapi/linux/procfd.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __LINUX_PROCFD_H
+#define __LINUX_PROCFD_H
+
+#include <linux/ioctl.h>
+
+/* Returns a file descriptor that refers to a struct pid */
+#define PROC_FD_SIGNAL _IOW('p', 1, __s32)
+
+#endif /* __LINUX_PROCFD_H */
+
--
2.19.1



2018-11-18 14:00:19

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

I had been led to believe that the proposal would be a comprehensive
process API, not an ioctl basically equivalent to my previous patch.
If you had a more comprehensive proposal, please just share it on LKML
instead of limiting the discussion to those able to attend these
various conferences. If there's some determined opposition to a
general new process API, this opposition needs a fair and full airing,
as not everyone can attend these conferences.

On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <[email protected]> wrote:
> With this patch an open() call on /proc/<pid> will give userspace a handle
> to struct pid of the process associated with /proc/<pid>. This allows to
> maintain a stable handle on a process.
> I have been discussing various approaches extensively during technical
> conferences this year culminating in a long argument with Eric at Linux
> Plumbers. The general consensus was that having a handle on a process
> will be something that is very simple and easy to maintain

ioctls are the opposite of "easy to maintain". Their
file-descriptor-specific behavior makes it difficult to use the things
safely. If you want to take this approach, please make a new system
call. An ioctl is just a system call with a very strange spelling and
unfortunate collision semantics.

> with the
> option of being extensible via a more advanced api if the need arises.

The need *has* arisen; see my exithand patch.

> I
> believe that this patch is the most simple, dumb, and therefore
> maintainable solution.
>
> The need for this has arisen in order to reliably kill a process without
> running into issues of the pid being recycled as has been described in the
> rejected patch [1].

That patch was not "rejected". It was tabled pending the more
comprehensive process API proposal that was supposed to have emerged.
This patch is just another variant of the sort of approach we
discussed on that patch's thread here. As I mentioned on that thread,
the right approach option is a new system call, not an ioctl.

To fulfill the need described in that patchset a new
> ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a
> process via a file descriptor:
>
> int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC);
> ioctl(fd, PROC_FD_SIGNAL, SIGKILL);
> close(fd);
>
> Note, the stable handle will allow us to carefully extend this feature in
> the future.

We still need the ability to synchronously wait on a process's death,
as in my patch set. I will be refreshing that patch set.

2018-11-18 15:40:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 5:59 AM Daniel Colascione <[email protected]> wrote:
>
> I had been led to believe that the proposal would be a comprehensive
> process API, not an ioctl basically equivalent to my previous patch.
> If you had a more comprehensive proposal, please just share it on LKML
> instead of limiting the discussion to those able to attend these
> various conferences. If there's some determined opposition to a
> general new process API, this opposition needs a fair and full airing,
> as not everyone can attend these conferences.
>
> On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <[email protected]> wrote:
> > With this patch an open() call on /proc/<pid> will give userspace a handle
> > to struct pid of the process associated with /proc/<pid>. This allows to
> > maintain a stable handle on a process.
> > I have been discussing various approaches extensively during technical
> > conferences this year culminating in a long argument with Eric at Linux
> > Plumbers. The general consensus was that having a handle on a process
> > will be something that is very simple and easy to maintain
>
> ioctls are the opposite of "easy to maintain". Their
> file-descriptor-specific behavior makes it difficult to use the things
> safely. If you want to take this approach, please make a new system
> call. An ioctl is just a system call with a very strange spelling and
> unfortunate collision semantics.
>
> > with the
> > option of being extensible via a more advanced api if the need arises.
>
> The need *has* arisen; see my exithand patch.
>
> > I
> > believe that this patch is the most simple, dumb, and therefore
> > maintainable solution.
> >
> > The need for this has arisen in order to reliably kill a process without
> > running into issues of the pid being recycled as has been described in the
> > rejected patch [1].
>
> That patch was not "rejected". It was tabled pending the more
> comprehensive process API proposal that was supposed to have emerged.
> This patch is just another variant of the sort of approach we
> discussed on that patch's thread here. As I mentioned on that thread,
> the right approach option is a new system call, not an ioctl.
>
> To fulfill the need described in that patchset a new
> > ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a
> > process via a file descriptor:
> >
> > int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC);
> > ioctl(fd, PROC_FD_SIGNAL, SIGKILL);
> > close(fd);
> >
> > Note, the stable handle will allow us to carefully extend this feature in
> > the future.
>
> We still need the ability to synchronously wait on a process's death,
> as in my patch set. I will be refreshing that patch set.

I fully agree that a more comprehensive, less expensive API for
managing processes would be nice. But I also think that this patch
(using the directory fd and ioctl) is better from a security
perspective than using a new file in /proc.

I have an old patch to make proc directory fds pollable:

https://lore.kernel.org/patchwork/patch/345098/

That patch plus the one in this thread might make a nice addition to
the kernel even if we expect something much better to come along
later.

This patch that uses ioctl

2018-11-18 15:55:03

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <[email protected]> wrote:
> I fully agree that a more comprehensive, less expensive API for
> managing processes would be nice. But I also think that this patch
> (using the directory fd and ioctl) is better from a security
> perspective than using a new file in /proc.

That's an assertion, not an argument. And I'm not opposed to an
operation on the directory FD, now that it's clear Linus has banned
"write(2)-as-a-command" APIs. I just insist that we implement the API
with a system call instead of a less-reliable ioctl due to the
inherent namespace collision issues in ioctl command names.

> I have an old patch to make proc directory fds pollable:
>
> https://lore.kernel.org/patchwork/patch/345098/
>
> That patch plus the one in this thread might make a nice addition to
> the kernel even if we expect something much better to come along
> later.

I've always commented on that patch. You never addressed my technical
objections. Why are you bringing up this patch again as if that
discussion had never happened? To review, that patch has various race
conditions, and even if it were technically correct, it'd be an abuse
of directory objects (in what other circumstance do we poll
directories?) and not logically generalizable to a model in which we
expose process exit status via the exit-monitoring API.

2018-11-18 16:04:09

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <[email protected]> wrote:
> +static int proc_tgid_open(struct inode *inode, struct file *file)
> +{
> + /* grab reference to struct pid */
> + file->private_data = get_pid(proc_pid(inode));

Why grab another reference to the struct pid when the inode already has one?

2018-11-18 16:19:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <[email protected]> wrote:
>
> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <[email protected]> wrote:
> > I fully agree that a more comprehensive, less expensive API for
> > managing processes would be nice. But I also think that this patch
> > (using the directory fd and ioctl) is better from a security
> > perspective than using a new file in /proc.
>
> That's an assertion, not an argument. And I'm not opposed to an
> operation on the directory FD, now that it's clear Linus has banned
> "write(2)-as-a-command" APIs. I just insist that we implement the API
> with a system call instead of a less-reliable ioctl due to the
> inherent namespace collision issues in ioctl command names.

Linus banned it because of bugs iike the ones in the patch.

>
> > I have an old patch to make proc directory fds pollable:
> >
> > https://lore.kernel.org/patchwork/patch/345098/
> >
> > That patch plus the one in this thread might make a nice addition to
> > the kernel even if we expect something much better to come along
> > later.
>
> I've always commented on that patch. You never addressed my technical
> objections. Why are you bringing up this patch again as if that
> discussion had never happened? To review, that patch has various race
> conditions

I don't think I ever saw that review.

> and even if it were technically correct, it'd be an abuse
> of directory objects (in what other circumstance do we poll
> directories?) and not logically generalizable to a model in which we
> expose process exit status via the exit-monitoring API.

I agree it's weird. It might be better to have /proc/PID/exit_status
and make *that* pollable.

2018-11-18 16:31:36

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski <[email protected]> wrote:
> On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <[email protected]> wrote:
>>
>> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <[email protected]> wrote:
>> > I fully agree that a more comprehensive, less expensive API for
>> > managing processes would be nice. But I also think that this patch
>> > (using the directory fd and ioctl) is better from a security
>> > perspective than using a new file in /proc.
>>
>> That's an assertion, not an argument. And I'm not opposed to an
>> operation on the directory FD, now that it's clear Linus has banned
>> "write(2)-as-a-command" APIs. I just insist that we implement the API
>> with a system call instead of a less-reliable ioctl due to the
>> inherent namespace collision issues in ioctl command names.
>
> Linus banned it because of bugs iike the ones in the patch.

Maybe: he didn't provide a reason. What's your point?

>> > I have an old patch to make proc directory fds pollable:
>> >
>> > https://lore.kernel.org/patchwork/patch/345098/
>> >
>> > That patch plus the one in this thread might make a nice addition to
>> > the kernel even if we expect something much better to come along
>> > later.
>>
>> I've always commented on that patch. You never addressed my technical
>> objections. Why are you bringing up this patch again as if that
>> discussion had never happened? To review, that patch has various race
>> conditions
>
> I don't think I ever saw that review.

https://lore.kernel.org/lkml/CAKOZuevXXqwJepmLNUtrU=f8jtdgdKAzUAnAA2+KVcWoMxMyFg@mail.gmail.com/

>> and even if it were technically correct, it'd be an abuse
>> of directory objects (in what other circumstance do we poll
>> directories?) and not logically generalizable to a model in which we
>> expose process exit status via the exit-monitoring API.
>
> I agree it's weird. It might be better to have /proc/PID/exit_status
> and make *that* pollable.

That's exactly what my patch provides. I feel like we're rehashing the
previous discussion.

2018-11-18 16:35:02

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On 11/18/18 8:17 AM, Andy Lutomirski wrote:
> On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <[email protected]> wrote:
>>
>> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <[email protected]> wrote:
>>> I fully agree that a more comprehensive, less expensive API for
>>> managing processes would be nice. But I also think that this patch
>>> (using the directory fd and ioctl) is better from a security
>>> perspective than using a new file in /proc.
>>
>> That's an assertion, not an argument. And I'm not opposed to an
>> operation on the directory FD, now that it's clear Linus has banned
>> "write(2)-as-a-command" APIs. I just insist that we implement the API
>> with a system call instead of a less-reliable ioctl due to the
>> inherent namespace collision issues in ioctl command names.
>
> Linus banned it because of bugs iike the ones in the patch.
>
>>
>>> I have an old patch to make proc directory fds pollable:
>>>
>>> https://lore.kernel.org/patchwork/patch/345098/
>>>
>>> That patch plus the one in this thread might make a nice addition to
>>> the kernel even if we expect something much better to come along
>>> later.
>>
>> I've always commented on that patch. You never addressed my technical
>> objections. Why are you bringing up this patch again as if that
>> discussion had never happened? To review, that patch has various race
>> conditions
>
> I don't think I ever saw that review.
>
>> and even if it were technically correct, it'd be an abuse
>> of directory objects (in what other circumstance do we poll
>> directories?) and not logically generalizable to a model in which we
>> expose process exit status via the exit-monitoring API.
>
> I agree it's weird. It might be better to have /proc/PID/exit_status
> and make *that* pollable.
>

If there is a new exit_status file, it could even be more than
8 bits of exit status:

See https://lore.kernel.org/lkml/[email protected]/T/#u
and http://austingroupbugs.net/view.php?id=594#c1317

--
~Randy

2018-11-18 16:51:16

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 8:33 AM, Randy Dunlap <[email protected]> wrote:
> On 11/18/18 8:17 AM, Andy Lutomirski wrote:
>> On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <[email protected]> wrote:
>>>
>>> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <[email protected]> wrote:
>>>> I fully agree that a more comprehensive, less expensive API for
>>>> managing processes would be nice. But I also think that this patch
>>>> (using the directory fd and ioctl) is better from a security
>>>> perspective than using a new file in /proc.
>>>
>>> That's an assertion, not an argument. And I'm not opposed to an
>>> operation on the directory FD, now that it's clear Linus has banned
>>> "write(2)-as-a-command" APIs. I just insist that we implement the API
>>> with a system call instead of a less-reliable ioctl due to the
>>> inherent namespace collision issues in ioctl command names.
>>
>> Linus banned it because of bugs iike the ones in the patch.
>>
>>>
>>>> I have an old patch to make proc directory fds pollable:
>>>>
>>>> https://lore.kernel.org/patchwork/patch/345098/
>>>>
>>>> That patch plus the one in this thread might make a nice addition to
>>>> the kernel even if we expect something much better to come along
>>>> later.
>>>
>>> I've always commented on that patch. You never addressed my technical
>>> objections. Why are you bringing up this patch again as if that
>>> discussion had never happened? To review, that patch has various race
>>> conditions
>>
>> I don't think I ever saw that review.
>>
>>> and even if it were technically correct, it'd be an abuse
>>> of directory objects (in what other circumstance do we poll
>>> directories?) and not logically generalizable to a model in which we
>>> expose process exit status via the exit-monitoring API.
>>
>> I agree it's weird. It might be better to have /proc/PID/exit_status
>> and make *that* pollable.
>>
>
> If there is a new exit_status file, it could even be more than
> 8 bits of exit status:
>
> See https://lore.kernel.org/lkml/[email protected]/T/#u
> and http://austingroupbugs.net/view.php?id=594#c1317

First of all, as I discussed in [1], we need to first figure out *who*
should have access to the process exit information. FreeBSD appears to
make it public without disaster, and if we make exit status public, a
lot of problems just disappear.

[1] https://lore.kernel.org/lkml/CAKOZueussVwABQaC+O9fW+MZayccvttKQZfWg0hh-cZ+1ykXig@mail.gmail.com/

Providing more than eight bits of status information is a separate
discussion. I don't think it's necessary, since it breaks assumptions
people make today without really adding much in the way of new
capabilities. If we want to let processes die while leaving behind
more information, let's let them leave behind an arbitrary-sized
string or something instead of repeating the mistake of an integral
exit status, but with more bits.

2018-11-18 17:10:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 8:49 AM Daniel Colascione <[email protected]> wrote:
>
> On Sun, Nov 18, 2018 at 8:33 AM, Randy Dunlap <[email protected]> wrote:
> > On 11/18/18 8:17 AM, Andy Lutomirski wrote:
> >> On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <[email protected]> wrote:
> >>>
> >>> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <[email protected]> wrote:
> >>>> I fully agree that a more comprehensive, less expensive API for
> >>>> managing processes would be nice. But I also think that this patch
> >>>> (using the directory fd and ioctl) is better from a security
> >>>> perspective than using a new file in /proc.
> >>>
> >>> That's an assertion, not an argument. And I'm not opposed to an
> >>> operation on the directory FD, now that it's clear Linus has banned
> >>> "write(2)-as-a-command" APIs. I just insist that we implement the API
> >>> with a system call instead of a less-reliable ioctl due to the
> >>> inherent namespace collision issues in ioctl command names.
> >>
> >> Linus banned it because of bugs iike the ones in the patch.
> >>
> >>>
> >>>> I have an old patch to make proc directory fds pollable:
> >>>>
> >>>> https://lore.kernel.org/patchwork/patch/345098/
> >>>>
> >>>> That patch plus the one in this thread might make a nice addition to
> >>>> the kernel even if we expect something much better to come along
> >>>> later.
> >>>
> >>> I've always commented on that patch. You never addressed my technical
> >>> objections. Why are you bringing up this patch again as if that
> >>> discussion had never happened? To review, that patch has various race
> >>> conditions
> >>
> >> I don't think I ever saw that review.
> >>
> >>> and even if it were technically correct, it'd be an abuse
> >>> of directory objects (in what other circumstance do we poll
> >>> directories?) and not logically generalizable to a model in which we
> >>> expose process exit status via the exit-monitoring API.
> >>
> >> I agree it's weird. It might be better to have /proc/PID/exit_status
> >> and make *that* pollable.
> >>
> >
> > If there is a new exit_status file, it could even be more than
> > 8 bits of exit status:
> >
> > See https://lore.kernel.org/lkml/[email protected]/T/#u
> > and http://austingroupbugs.net/view.php?id=594#c1317
>
> First of all, as I discussed in [1], we need to first figure out *who*
> should have access to the process exit information. FreeBSD appears to
> make it public without disaster, and if we make exit status public, a
> lot of problems just disappear.

I kind of want to go in the other direction of making a lot of process
information (especially cmdline) less publicly accessible.

In general, any kind of API where a process has an fd is tricky to do
right on UNIXy systems because of SUID, SGID, and LSM transition
rules. Windows has an easy time of it because it's always safe for a
parent process to introspect the child. (Well, almost, because
Windows gained their privilege elevation stuff. I'm not saying we
shouldn't do it -- I'm just saying that it's nontrivial.

2018-11-18 17:14:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 8:29 AM Daniel Colascione <[email protected]> wrote:
>
> On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski <[email protected]> wrote:
> > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <[email protected]> wrote:
> >>
> >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <[email protected]> wrote:
> >> > I fully agree that a more comprehensive, less expensive API for
> >> > managing processes would be nice. But I also think that this patch
> >> > (using the directory fd and ioctl) is better from a security
> >> > perspective than using a new file in /proc.
> >>
> >> That's an assertion, not an argument. And I'm not opposed to an
> >> operation on the directory FD, now that it's clear Linus has banned
> >> "write(2)-as-a-command" APIs. I just insist that we implement the API
> >> with a system call instead of a less-reliable ioctl due to the
> >> inherent namespace collision issues in ioctl command names.
> >
> > Linus banned it because of bugs iike the ones in the patch.
>
> Maybe: he didn't provide a reason. What's your point?

My point is that an API that involves a file like /proc/PID/kill is
very tricky to get right. Here are some considerations:

1. The .write handler for the file must not behave differently
depending on current. So we can't check the caller's creds. (There
are legacy things that are generally buggy that look at current's
creds in write handlers. They're legacy, and they're almost
universally at least a little bit buggy, and many have been
exploitable.)

2. Even if we have ioctl() or a new syscall on /proc/PID/kill, we at
least have to define whether *opening* kill checks any credentials.
It probably shouldn't, since I don't see what the semantics should be.

3. The current Linux kill_pid stuff doesn't take a cred parameter, so
it's not so easy to check f_cred even if we wanted to.

So the simplest implementation using /proc/PID/kill would be for .open
to do essentially nothing and for ioctl or a new syscall to check
credentials as usual. But this seems to have no technical advantage
over just using /proc/PID itself, and it's a good deal slower, as it
requires an open/close cycle.

Now if we had an ioctlat() API, maybe it would make sense. But we
don't, and I think it would be a bit crazy to add one.

--Andy

2018-11-18 17:18:27

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 9:13 AM, Andy Lutomirski <[email protected]> wrote:
> On Sun, Nov 18, 2018 at 8:29 AM Daniel Colascione <[email protected]> wrote:
>>
>> On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski <[email protected]> wrote:
>> > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <[email protected]> wrote:
>> >>
>> >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <[email protected]> wrote:
>> >> > I fully agree that a more comprehensive, less expensive API for
>> >> > managing processes would be nice. But I also think that this patch
>> >> > (using the directory fd and ioctl) is better from a security
>> >> > perspective than using a new file in /proc.
>> >>
>> >> That's an assertion, not an argument. And I'm not opposed to an
>> >> operation on the directory FD, now that it's clear Linus has banned
>> >> "write(2)-as-a-command" APIs. I just insist that we implement the API
>> >> with a system call instead of a less-reliable ioctl due to the
>> >> inherent namespace collision issues in ioctl command names.
>> >
>> > Linus banned it because of bugs iike the ones in the patch.
>>
>> Maybe: he didn't provide a reason. What's your point?
>
> My point is that an API that involves a file like /proc/PID/kill is
> very tricky to get right. Here are some considerations:

Moot. write(2) for this interface is off the table anyway. The right
approach here is a system call that accepts a /proc/pid directory file
descriptor, a signal number, and a signal information field (as in
sigqueue(2)).

> Now if we had an ioctlat() API, maybe it would make sense. But we
> don't, and I think it would be a bit crazy to add one.

A process is not a driver. Why won't this idea of using an ioctl for
the kill-process-by-dfd thing just won't die? An ioctl has *zero*
advantage in this context.

2018-11-18 17:27:06

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 9:09 AM, Andy Lutomirski <[email protected]> wrote:
> On Sun, Nov 18, 2018 at 8:49 AM Daniel Colascione <[email protected]> wrote:
>>
>> On Sun, Nov 18, 2018 at 8:33 AM, Randy Dunlap <[email protected]> wrote:
>> > On 11/18/18 8:17 AM, Andy Lutomirski wrote:
>> >> On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <[email protected]> wrote:
>> >>>
>> >>> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <[email protected]> wrote:
>> >>>> I fully agree that a more comprehensive, less expensive API for
>> >>>> managing processes would be nice. But I also think that this patch
>> >>>> (using the directory fd and ioctl) is better from a security
>> >>>> perspective than using a new file in /proc.
>> >>>
>> >>> That's an assertion, not an argument. And I'm not opposed to an
>> >>> operation on the directory FD, now that it's clear Linus has banned
>> >>> "write(2)-as-a-command" APIs. I just insist that we implement the API
>> >>> with a system call instead of a less-reliable ioctl due to the
>> >>> inherent namespace collision issues in ioctl command names.
>> >>
>> >> Linus banned it because of bugs iike the ones in the patch.
>> >>
>> >>>
>> >>>> I have an old patch to make proc directory fds pollable:
>> >>>>
>> >>>> https://lore.kernel.org/patchwork/patch/345098/
>> >>>>
>> >>>> That patch plus the one in this thread might make a nice addition to
>> >>>> the kernel even if we expect something much better to come along
>> >>>> later.
>> >>>
>> >>> I've always commented on that patch. You never addressed my technical
>> >>> objections. Why are you bringing up this patch again as if that
>> >>> discussion had never happened? To review, that patch has various race
>> >>> conditions
>> >>
>> >> I don't think I ever saw that review.
>> >>
>> >>> and even if it were technically correct, it'd be an abuse
>> >>> of directory objects (in what other circumstance do we poll
>> >>> directories?) and not logically generalizable to a model in which we
>> >>> expose process exit status via the exit-monitoring API.
>> >>
>> >> I agree it's weird. It might be better to have /proc/PID/exit_status
>> >> and make *that* pollable.
>> >>
>> >
>> > If there is a new exit_status file, it could even be more than
>> > 8 bits of exit status:
>> >
>> > See https://lore.kernel.org/lkml/[email protected]/T/#u
>> > and http://austingroupbugs.net/view.php?id=594#c1317
>>
>> First of all, as I discussed in [1], we need to first figure out *who*
>> should have access to the process exit information. FreeBSD appears to
>> make it public without disaster, and if we make exit status public, a
>> lot of problems just disappear.
>
> I kind of want to go in the other direction of making a lot of process
> information (especially cmdline) less publicly accessible.

Okay. That has nothing to do with exit status. Please address the
points related to the API we're discussing and that I raised in the
other thread.

Assuming we don't broaden exit status readability (which would make a
lot of things simpler), the exit notification mechanism must work like
this: if you can see a process in /proc, you should be able to wait on
it. If you learn that process's exit status through some other means
--- e.g., you're the process's parent, you can ptrace the process, you
have CAP_WHATEVER_IT_IS_ --- then you should be able to learn the fate
of the process. Otherwise you just be able to learn that the process
exited.

> Windows has an easy time of it because

Windows has an easier time of it because it doesn't use an ad-hoc
ambient authority permission model. In Windows, if you can open a
handle to do something, that handle lets you do the thing. Period.
There's none of this "well, I opened this process FD, but since I
opened it, the process called setuid, so now I can't get its exit
status" nonsense. Privilege elevation is always accomplished via a
separate call to CreateProcessWithToken, which creates a *new* process
with the elevated privileges. An existing process can't suddenly and
magically become this special thing that you can't inspect, but that
has the same PID and identity as this other process that you used to
be able to inspect. The model is just better, because permission is
baked into the HANDLE. Now, that ship has sailed. We're stuck with
setreuid and exec. But let's be clear about what's causing the
complexity.

2018-11-18 17:43:03

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 07:38:09AM -0800, Andy Lutomirski wrote:
> On Sun, Nov 18, 2018 at 5:59 AM Daniel Colascione <[email protected]> wrote:
> >
> > I had been led to believe that the proposal would be a comprehensive
> > process API, not an ioctl basically equivalent to my previous patch.
> > If you had a more comprehensive proposal, please just share it on LKML
> > instead of limiting the discussion to those able to attend these
> > various conferences. If there's some determined opposition to a
> > general new process API, this opposition needs a fair and full airing,
> > as not everyone can attend these conferences.
> >
> > On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <[email protected]> wrote:
> > > With this patch an open() call on /proc/<pid> will give userspace a handle
> > > to struct pid of the process associated with /proc/<pid>. This allows to
> > > maintain a stable handle on a process.
> > > I have been discussing various approaches extensively during technical
> > > conferences this year culminating in a long argument with Eric at Linux
> > > Plumbers. The general consensus was that having a handle on a process
> > > will be something that is very simple and easy to maintain
> >
> > ioctls are the opposite of "easy to maintain". Their
> > file-descriptor-specific behavior makes it difficult to use the things
> > safely. If you want to take this approach, please make a new system
> > call. An ioctl is just a system call with a very strange spelling and
> > unfortunate collision semantics.
> >
> > > with the
> > > option of being extensible via a more advanced api if the need arises.
> >
> > The need *has* arisen; see my exithand patch.
> >
> > > I
> > > believe that this patch is the most simple, dumb, and therefore
> > > maintainable solution.
> > >
> > > The need for this has arisen in order to reliably kill a process without
> > > running into issues of the pid being recycled as has been described in the
> > > rejected patch [1].
> >
> > That patch was not "rejected". It was tabled pending the more
> > comprehensive process API proposal that was supposed to have emerged.
> > This patch is just another variant of the sort of approach we
> > discussed on that patch's thread here. As I mentioned on that thread,
> > the right approach option is a new system call, not an ioctl.
> >
> > To fulfill the need described in that patchset a new
> > > ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a
> > > process via a file descriptor:
> > >
> > > int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC);
> > > ioctl(fd, PROC_FD_SIGNAL, SIGKILL);
> > > close(fd);
> > >
> > > Note, the stable handle will allow us to carefully extend this feature in
> > > the future.
> >
> > We still need the ability to synchronously wait on a process's death,
> > as in my patch set. I will be refreshing that patch set.
>
> I fully agree that a more comprehensive, less expensive API for
> managing processes would be nice. But I also think that this patch
> (using the directory fd and ioctl) is better from a security
> perspective than using a new file in /proc.
>
> I have an old patch to make proc directory fds pollable:
>
> https://lore.kernel.org/patchwork/patch/345098/
>
> That patch plus the one in this thread might make a nice addition to
> the kernel even if we expect something much better to come along
> later.

I agree. Eric's point was to make the first implementation of this as
simple as possible that's why this patch is intentionally almost
trivial. And I like it for its simplicity.

I had a more comprehensive API proposal of which open(/proc/<pid>) was a
part. I didn't send out alongside this patch as Eric clearly prefered to
only have the /proc/<pid> part. Here is the full proposal as I intended
to originally send it out:

The gist is to have file descriptors for processes which is obviously not a new
idea. This has been done before in other OSes and it has been tried before in
Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to
make it very clear that I'm not laying claim to this being my or even a novel
idea in any way. However, I want to diverge from previous approaches with my
suggestion. (Though I can't be sure that there's not something similar in other
OSes already.)

One of the main motivations for having procfds is to have a race-free way of
configuring, starting, polling, and killing a process. Basically, a process
lifecycle api if you want to think about it that way. The api should also be
easily extendable in the future to avoid running into the limitations we
currently see with the clone*() syscall(s) again.

One of the crucial points of the api is to *separate the configuration
of a process through a procfd from actually creating the process*.
This is a crucial property expressed in the open*() system calls. First, get a
stable handle on an object then allow for ways to configure it. As such the
procfd api shares the same insight with Al's and David's new mount api.
(Fwiw, Andy also pointed out similarities with posix_spawn().)
What I envisioned was to have the following syscalls (multiple name suggestions):

1. int process_open / proc_open / procopen
2. int process_config / proc_config / procconfig or ioctl()-based
3. int process_info / proc_info / procinfo or ioctl()-based
4. int process_manage / proc_manage / procmanage or ioctl()-based

and the following procfs extension:

int procfd = open("/proc/<pid>", O_DIRECTORY | O_CLOEXEC);

Some of you will notice right away that we could replace 2-4 with ioctl()s.

#### process_open()
will return an fd that creates a process context.
The fd returned by process_open() does neither refer to any existing process
nor has the process actually been started yet. So non-configuration operations
on it or trying to interact with it would fail with e.g. ESRCH/EINVAL.

#### process_config() / ioctl()
takes an fd returned by process_open() and can be used to configure a process
context *before it is alive*.
Some things that I would like to be able to do with this syscall are:
- configure signals
- set clone flags
- write idmappings if the process runs in a new user namespace
- configure what happens when all procfds referring to the process are gone
- ...

Just to get a very rough feel for this without detailing parameters right now:

/* process should have own mountns */
process_config/ioctl(fd, PROC_SET_FLAG, CLONE_NEWNS, <potentially additional arguments>)

/* process should get SIGKILL when all procfds are closed */
process_config/ioctl(fd, PROC_SET_CLOSE, SIGKILL, <potentially additional arguments>)

After the caller is done configuring the process there would be a final step:

process_config/ioctl(fd, PROC_CREATE, 0, <potentially additional arguments>)

which would create the process and (either as return value or through a
parameter) return the pid of the newly created process.

These fds should be pollable (though this is maybe out of scope for a first
implementation). In combination with the split between getting an fd for a
process context and starting the process would this would then allow for nice
things such as adding an fd gotten via process_open() to an epoll() instance
where other processes can poll the fd to e.g. (given appropriate privileges)
get an event when process_config/ioctl()(fd, PROC_CREATE, *, <potentially
additional arguments>) has actually started the process or it exited.

#### int process_info / ioctl()
allows to retrieve information about a process (e.g. signals, namespaces, or
even information available through getrusage()).
This would be a more performant and race-free way then parsing through various
files in /proc. I remember quite some people asking for a variation of this.

#### process_manage / ioctl()
allows to interact/manage a process through a procfd.
Specifically, one would be able to send signals to the process, retrieve the
exit status from it etc. Here is an example to get a feel for it:

/* send SIGTERM to process /
process_manage/ioctl(fd, PROC_SIGNAL, SIGTERM, <potentially additional arguments>)

/* block until the process has exited and retrieve exit information via
* <potentially additional arguments>.
* One could also make it possible to specify a timeout here.
*/
process_manage/ioctl(fd, PROC_WAIT, 0, <potentially additional arguments>)

#### /proc/<pid>
allow to get a procfd for an existing process.
This adds a new file "handle" to /proc that serves as a way to get a procfd for
a process.

I hope that's enough information for now without too much detail.
I think that /proc/<pid> is probably the easiest to target and that I
prototyped.

[1]: https://lkml.org/lkml/2018/10/30/118
[2]: https://lkml.kernel.org/r/[email protected]
[3]: https://lore.kernel.org/lkml/2279556.Wl6mCVq5Zi@tjmaciei-mobl4

2018-11-18 17:44:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 9:24 AM Daniel Colascione <[email protected]> wrote:
> Assuming we don't broaden exit status readability (which would make a
> lot of things simpler), the exit notification mechanism must work like
> this: if you can see a process in /proc, you should be able to wait on
> it. If you learn that process's exit status through some other means
> --- e.g., you're the process's parent, you can ptrace the process, you
> have CAP_WHATEVER_IT_IS_ --- then you should be able to learn the fate
> of the process. Otherwise you just be able to learn that the process
> exited.

Sounds reasonable to me. Except for the obvious turd that, if you
open /proc/PID/whatever, and the process calls execve(), then the
resulting semantics are awkward at best.

>
> > Windows has an easy time of it because
>
> Windows has an easier time of it because it doesn't use an ad-hoc
> ambient authority permission model. In Windows, if you can open a
> handle to do something, that handle lets you do the thing. Period.
> There's none of this "well, I opened this process FD, but since I
> opened it, the process called setuid, so now I can't get its exit
> status" nonsense. Privilege elevation is always accomplished via a
> separate call to CreateProcessWithToken, which creates a *new* process
> with the elevated privileges. An existing process can't suddenly and
> magically become this special thing that you can't inspect, but that
> has the same PID and identity as this other process that you used to
> be able to inspect. The model is just better, because permission is
> baked into the HANDLE. Now, that ship has sailed. We're stuck with
> setreuid and exec. But let's be clear about what's causing the
> complexity.

I'm not entirely sure that ship has sailed. In the kernel, we already
have a bit of a distinction between a pid (and tid, etc -- I'm
referring to struct pid) and a task. If we make a new
process-management API, we could put a distinction like this into the
API. As a straw-man proposal (highly incomplete and probably wrong,
but maybe it gets the idea across):

Have a way to get an fd that refers to a "running program". (I'm
calling it that to distinguish it from "task" and "pid", both of which
already mean something.) You'd be able to open such an fd given a
pid, and your permissions would be checked at that time. R access
means you can read the running program's memory and otherwise
introspect it. W means you can modify it's memory and otherwise mess
with it. X means you can send it signals. We might need more bits to
really do this right.

Now here's the kicker: if the "running program" calls execve(), it
goes away. The fd gets some sort of notification that this happened
and there's an API to get a handle to the new running program *if the
caller has the appropriate permissions*. setresuid() has no effect
here -- if you have W access to the process and the process calls
setresuid(), you still have W access.

To make this fully useful, we'd probably want to elaborate it with a
race-free way to track all descendents and, if needed, kill them all,
subject to permissions.

This API ought to be extensible to replace ptrace() eventually.

Does this seem like a reasonable direction to go in?

2018-11-18 17:44:58

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

Daniel Colascione <[email protected]> writes:

> On Sun, Nov 18, 2018 at 9:13 AM, Andy Lutomirski <[email protected]> wrote:
>> On Sun, Nov 18, 2018 at 8:29 AM Daniel Colascione <[email protected]> wrote:
>>>
>>> On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski <[email protected]> wrote:
>>> > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <[email protected]> wrote:
>>> >>
>>> >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <[email protected]> wrote:
>>> >> > I fully agree that a more comprehensive, less expensive API for
>>> >> > managing processes would be nice. But I also think that this patch
>>> >> > (using the directory fd and ioctl) is better from a security
>>> >> > perspective than using a new file in /proc.
>>> >>
>>> >> That's an assertion, not an argument. And I'm not opposed to an
>>> >> operation on the directory FD, now that it's clear Linus has banned
>>> >> "write(2)-as-a-command" APIs. I just insist that we implement the API
>>> >> with a system call instead of a less-reliable ioctl due to the
>>> >> inherent namespace collision issues in ioctl command names.
>>> >
>>> > Linus banned it because of bugs iike the ones in the patch.
>>>
>>> Maybe: he didn't provide a reason. What's your point?
>>
>> My point is that an API that involves a file like /proc/PID/kill is
>> very tricky to get right. Here are some considerations:
>
> Moot. write(2) for this interface is off the table anyway. The right
> approach here is a system call that accepts a /proc/pid directory file
> descriptor, a signal number, and a signal information field (as in
> sigqueue(2)).

If we did not have the permission check challenges and could perform
the permission checks in open, write(2) would be on the table.
Performing write(2) would only be concrend about data.

Unfortunately we have setresuid and exec which make that infeasible
for the kill operations.

>> Now if we had an ioctlat() API, maybe it would make sense. But we
>> don't, and I think it would be a bit crazy to add one.
>
> A process is not a driver. Why won't this idea of using an ioctl for
> the kill-process-by-dfd thing just won't die? An ioctl has *zero*
> advantage in this context.

An ioctl has an advantage in implementation complexity. An ioctl is
very much easier to wire up that a system call.

I don't know if that outweighs ioctls disadvantages in long term
maintainability.

Eric


2018-11-18 17:45:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 9:42 AM Christian Brauner <[email protected]> wrote:
>
> On Sun, Nov 18, 2018 at 07:38:09AM -0800, Andy Lutomirski wrote:
> > On Sun, Nov 18, 2018 at 5:59 AM Daniel Colascione <[email protected]> wrote:
> > >
> > > I had been led to believe that the proposal would be a comprehensive
> > > process API, not an ioctl basically equivalent to my previous patch.
> > > If you had a more comprehensive proposal, please just share it on LKML
> > > instead of limiting the discussion to those able to attend these
> > > various conferences. If there's some determined opposition to a
> > > general new process API, this opposition needs a fair and full airing,
> > > as not everyone can attend these conferences.
> > >
> > > On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <[email protected]> wrote:
> > > > With this patch an open() call on /proc/<pid> will give userspace a handle
> > > > to struct pid of the process associated with /proc/<pid>. This allows to
> > > > maintain a stable handle on a process.
> > > > I have been discussing various approaches extensively during technical
> > > > conferences this year culminating in a long argument with Eric at Linux
> > > > Plumbers. The general consensus was that having a handle on a process
> > > > will be something that is very simple and easy to maintain
> > >
> > > ioctls are the opposite of "easy to maintain". Their
> > > file-descriptor-specific behavior makes it difficult to use the things
> > > safely. If you want to take this approach, please make a new system
> > > call. An ioctl is just a system call with a very strange spelling and
> > > unfortunate collision semantics.
> > >
> > > > with the
> > > > option of being extensible via a more advanced api if the need arises.
> > >
> > > The need *has* arisen; see my exithand patch.
> > >
> > > > I
> > > > believe that this patch is the most simple, dumb, and therefore
> > > > maintainable solution.
> > > >
> > > > The need for this has arisen in order to reliably kill a process without
> > > > running into issues of the pid being recycled as has been described in the
> > > > rejected patch [1].
> > >
> > > That patch was not "rejected". It was tabled pending the more
> > > comprehensive process API proposal that was supposed to have emerged.
> > > This patch is just another variant of the sort of approach we
> > > discussed on that patch's thread here. As I mentioned on that thread,
> > > the right approach option is a new system call, not an ioctl.
> > >
> > > To fulfill the need described in that patchset a new
> > > > ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a
> > > > process via a file descriptor:
> > > >
> > > > int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC);
> > > > ioctl(fd, PROC_FD_SIGNAL, SIGKILL);
> > > > close(fd);
> > > >
> > > > Note, the stable handle will allow us to carefully extend this feature in
> > > > the future.
> > >
> > > We still need the ability to synchronously wait on a process's death,
> > > as in my patch set. I will be refreshing that patch set.
> >
> > I fully agree that a more comprehensive, less expensive API for
> > managing processes would be nice. But I also think that this patch
> > (using the directory fd and ioctl) is better from a security
> > perspective than using a new file in /proc.
> >
> > I have an old patch to make proc directory fds pollable:
> >
> > https://lore.kernel.org/patchwork/patch/345098/
> >
> > That patch plus the one in this thread might make a nice addition to
> > the kernel even if we expect something much better to come along
> > later.
>
> I agree. Eric's point was to make the first implementation of this as
> simple as possible that's why this patch is intentionally almost
> trivial. And I like it for its simplicity.
>
> I had a more comprehensive API proposal of which open(/proc/<pid>) was a
> part. I didn't send out alongside this patch as Eric clearly prefered to
> only have the /proc/<pid> part. Here is the full proposal as I intended
> to originally send it out:
>
> The gist is to have file descriptors for processes which is obviously not a new
> idea. This has been done before in other OSes and it has been tried before in
> Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to
> make it very clear that I'm not laying claim to this being my or even a novel
> idea in any way. However, I want to diverge from previous approaches with my
> suggestion. (Though I can't be sure that there's not something similar in other
> OSes already.)
>
> One of the main motivations for having procfds is to have a race-free way of
> configuring, starting, polling, and killing a process. Basically, a process
> lifecycle api if you want to think about it that way. The api should also be
> easily extendable in the future to avoid running into the limitations we
> currently see with the clone*() syscall(s) again.
>
> One of the crucial points of the api is to *separate the configuration
> of a process through a procfd from actually creating the process*.
> This is a crucial property expressed in the open*() system calls. First, get a
> stable handle on an object then allow for ways to configure it. As such the
> procfd api shares the same insight with Al's and David's new mount api.
> (Fwiw, Andy also pointed out similarities with posix_spawn().)
> What I envisioned was to have the following syscalls (multiple name suggestions):
>
> 1. int process_open / proc_open / procopen
> 2. int process_config / proc_config / procconfig or ioctl()-based
> 3. int process_info / proc_info / procinfo or ioctl()-based
> 4. int process_manage / proc_manage / procmanage or ioctl()-based

Emails crossed :(

For process management, I generally like this, although we might do
better if we make execve() effectively invalidate the handle. Then we
avoid a bunch of nasty permission issues. For process *creation*, we
have the problem that libc authors feel that they can't safely use fds
at all. There was a proposal for "high fds" a long time back to solve
that. We might finally need to do something like that.

2018-11-18 17:46:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 9:43 AM Eric W. Biederman <[email protected]> wrote:
>
> Daniel Colascione <[email protected]> writes:
>
> > On Sun, Nov 18, 2018 at 9:13 AM, Andy Lutomirski <[email protected]> wrote:
> >> On Sun, Nov 18, 2018 at 8:29 AM Daniel Colascione <[email protected]> wrote:
> >>>
> >>> On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski <[email protected]> wrote:
> >>> > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <[email protected]> wrote:
> >>> >>
> >>> >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <[email protected]> wrote:
> >>> >> > I fully agree that a more comprehensive, less expensive API for
> >>> >> > managing processes would be nice. But I also think that this patch
> >>> >> > (using the directory fd and ioctl) is better from a security
> >>> >> > perspective than using a new file in /proc.
> >>> >>
> >>> >> That's an assertion, not an argument. And I'm not opposed to an
> >>> >> operation on the directory FD, now that it's clear Linus has banned
> >>> >> "write(2)-as-a-command" APIs. I just insist that we implement the API
> >>> >> with a system call instead of a less-reliable ioctl due to the
> >>> >> inherent namespace collision issues in ioctl command names.
> >>> >
> >>> > Linus banned it because of bugs iike the ones in the patch.
> >>>
> >>> Maybe: he didn't provide a reason. What's your point?
> >>
> >> My point is that an API that involves a file like /proc/PID/kill is
> >> very tricky to get right. Here are some considerations:
> >
> > Moot. write(2) for this interface is off the table anyway. The right
> > approach here is a system call that accepts a /proc/pid directory file
> > descriptor, a signal number, and a signal information field (as in
> > sigqueue(2)).
>
> If we did not have the permission check challenges and could perform
> the permission checks in open, write(2) would be on the table.
> Performing write(2) would only be concrend about data.
>
> Unfortunately we have setresuid and exec which make that infeasible
> for the kill operations.

setresuid() should be irrelevant. If you had permission to kill a
process and the process calls setresuid(), you should still have
permission to kill it.

For execve(), we could make execve() invalidate the fd. (See other email.)

2018-11-18 17:52:44

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 9:42 AM, Andy Lutomirski <[email protected]> wrote:
> On Sun, Nov 18, 2018 at 9:24 AM Daniel Colascione <[email protected]> wrote:
>> Assuming we don't broaden exit status readability (which would make a
>> lot of things simpler), the exit notification mechanism must work like
>> this: if you can see a process in /proc, you should be able to wait on
>> it. If you learn that process's exit status through some other means
>> --- e.g., you're the process's parent, you can ptrace the process, you
>> have CAP_WHATEVER_IT_IS_ --- then you should be able to learn the fate
>> of the process. Otherwise you just be able to learn that the process
>> exited.
>
> Sounds reasonable to me. Except for the obvious turd that, if you
> open /proc/PID/whatever, and the process calls execve(), then the
> resulting semantics are awkward at best.

A process calling execve does not give up its logical identity. Lots
of programs exec themselves, e.g., to reload configuration.

>> > Windows has an easy time of it because
>>
>> Windows has an easier time of it because it doesn't use an ad-hoc
>> ambient authority permission model. In Windows, if you can open a
>> handle to do something, that handle lets you do the thing. Period.
>> There's none of this "well, I opened this process FD, but since I
>> opened it, the process called setuid, so now I can't get its exit
>> status" nonsense. Privilege elevation is always accomplished via a
>> separate call to CreateProcessWithToken, which creates a *new* process
>> with the elevated privileges. An existing process can't suddenly and
>> magically become this special thing that you can't inspect, but that
>> has the same PID and identity as this other process that you used to
>> be able to inspect. The model is just better, because permission is
>> baked into the HANDLE. Now, that ship has sailed. We're stuck with
>> setreuid and exec. But let's be clear about what's causing the
>> complexity.
>
> I'm not entirely sure that ship has sailed. In the kernel, we already
> have a bit of a distinction between a pid (and tid, etc -- I'm
> referring to struct pid) and a task. If we make a new
> process-management API, we could put a distinction like this into the
> API.

It would be a disaster to have different APIs give callers a different
idea of process identity over its lifetime. The precedent is
well-established that execve and setreuid do not change a process's
identity. Invaliding some identifiers but not others in response to
supposedly-internal things a process might do under rare circumstances
is creating a bug machine..

> setresuid() has no effect
> here -- if you have W access to the process and the process calls
> setresuid(), you still have W access.

Now you've created a situation in which an operation that security
policy previously blocked now becomes possible, invaliding previous
designs based on the old security invariant. That's the definition of
introducing a security hole.

2018-11-18 17:57:42

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 9:43 AM, Eric W. Biederman
<[email protected]> wrote:
> Daniel Colascione <[email protected]> writes:
>
>> On Sun, Nov 18, 2018 at 9:13 AM, Andy Lutomirski <[email protected]> wrote:
>>> On Sun, Nov 18, 2018 at 8:29 AM Daniel Colascione <[email protected]> wrote:
>>>>
>>>> On Sun, Nov 18, 2018 at 8:17 AM, Andy Lutomirski <[email protected]> wrote:
>>>> > On Sun, Nov 18, 2018 at 7:53 AM Daniel Colascione <[email protected]> wrote:
>>>> >>
>>>> >> On Sun, Nov 18, 2018 at 7:38 AM, Andy Lutomirski <[email protected]> wrote:
>>>> >> > I fully agree that a more comprehensive, less expensive API for
>>>> >> > managing processes would be nice. But I also think that this patch
>>>> >> > (using the directory fd and ioctl) is better from a security
>>>> >> > perspective than using a new file in /proc.
>>>> >>
>>>> >> That's an assertion, not an argument. And I'm not opposed to an
>>>> >> operation on the directory FD, now that it's clear Linus has banned
>>>> >> "write(2)-as-a-command" APIs. I just insist that we implement the API
>>>> >> with a system call instead of a less-reliable ioctl due to the
>>>> >> inherent namespace collision issues in ioctl command names.
>>>> >
>>>> > Linus banned it because of bugs iike the ones in the patch.
>>>>
>>>> Maybe: he didn't provide a reason. What's your point?
>>>
>>> My point is that an API that involves a file like /proc/PID/kill is
>>> very tricky to get right. Here are some considerations:
>>
>> Moot. write(2) for this interface is off the table anyway. The right
>> approach here is a system call that accepts a /proc/pid directory file
>> descriptor, a signal number, and a signal information field (as in
>> sigqueue(2)).
>
> If we did not have the permission check challenges and could perform
> the permission checks in open, write(2) would be on the table.
> Performing write(2) would only be concrend about data.
>
> Unfortunately we have setresuid and exec which make that infeasible
> for the kill operations.
>
>>> Now if we had an ioctlat() API, maybe it would make sense. But we
>>> don't, and I think it would be a bit crazy to add one.
>>
>> A process is not a driver. Why won't this idea of using an ioctl for
>> the kill-process-by-dfd thing just won't die? An ioctl has *zero*
>> advantage in this context.
>
> An ioctl has an advantage in implementation complexity. An ioctl is
> very much easier to wire up that a system call.
>
> I don't know if that outweighs ioctls disadvantages in long term
> maintainability.

It's not just maintainability. It's safety. We want to expose the new
kill interface to userspace via some kill(1) extension, probably. So
you should be able to write something like `cd /proc/12345 && kill
--by-handle .`. How does kill --by-handle know that it's safe to
perform the kill-by-proc-dfd operation on the file descriptor that it
opens? If the kill operation is an ioctl, you could pass it
/proc/self/fd/whatever of a completely different type; kill would call
ioctl on whatever FD it got, and potentially do a completely random
thing instead of killing a process. In the same situation, a new
system call would fail reliably. Yes, kill could check that the device
numbers of the file it opened matched proc's somehow, but that's
annoying and error-prone and nobody's going to bother in practice. A
new system call, by contrast, fails safe.

I really don't want to give up safety and fail-safe behavior forever
just because it's annoying, today, to wire up a new system call. (The
new table-driven system call stuff, if it ever lands, would make
things easier.)

2018-11-18 18:10:31

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 9:41 AM, Christian Brauner <[email protected]> wrote:
> On Sun, Nov 18, 2018 at 07:38:09AM -0800, Andy Lutomirski wrote:
>> On Sun, Nov 18, 2018 at 5:59 AM Daniel Colascione <[email protected]> wrote:
>> >
>> > I had been led to believe that the proposal would be a comprehensive
>> > process API, not an ioctl basically equivalent to my previous patch.
>> > If you had a more comprehensive proposal, please just share it on LKML
>> > instead of limiting the discussion to those able to attend these
>> > various conferences. If there's some determined opposition to a
>> > general new process API, this opposition needs a fair and full airing,
>> > as not everyone can attend these conferences.
>> >
>> > On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <[email protected]> wrote:
>> > > With this patch an open() call on /proc/<pid> will give userspace a handle
>> > > to struct pid of the process associated with /proc/<pid>. This allows to
>> > > maintain a stable handle on a process.
>> > > I have been discussing various approaches extensively during technical
>> > > conferences this year culminating in a long argument with Eric at Linux
>> > > Plumbers. The general consensus was that having a handle on a process
>> > > will be something that is very simple and easy to maintain
>> >
>> > ioctls are the opposite of "easy to maintain". Their
>> > file-descriptor-specific behavior makes it difficult to use the things
>> > safely. If you want to take this approach, please make a new system
>> > call. An ioctl is just a system call with a very strange spelling and
>> > unfortunate collision semantics.
>> >
>> > > with the
>> > > option of being extensible via a more advanced api if the need arises.
>> >
>> > The need *has* arisen; see my exithand patch.
>> >
>> > > I
>> > > believe that this patch is the most simple, dumb, and therefore
>> > > maintainable solution.
>> > >
>> > > The need for this has arisen in order to reliably kill a process without
>> > > running into issues of the pid being recycled as has been described in the
>> > > rejected patch [1].
>> >
>> > That patch was not "rejected". It was tabled pending the more
>> > comprehensive process API proposal that was supposed to have emerged.
>> > This patch is just another variant of the sort of approach we
>> > discussed on that patch's thread here. As I mentioned on that thread,
>> > the right approach option is a new system call, not an ioctl.
>> >
>> > To fulfill the need described in that patchset a new
>> > > ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a
>> > > process via a file descriptor:
>> > >
>> > > int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC);
>> > > ioctl(fd, PROC_FD_SIGNAL, SIGKILL);
>> > > close(fd);
>> > >
>> > > Note, the stable handle will allow us to carefully extend this feature in
>> > > the future.
>> >
>> > We still need the ability to synchronously wait on a process's death,
>> > as in my patch set. I will be refreshing that patch set.
>>
>> I fully agree that a more comprehensive, less expensive API for
>> managing processes would be nice. But I also think that this patch
>> (using the directory fd and ioctl) is better from a security
>> perspective than using a new file in /proc.
>>
>> I have an old patch to make proc directory fds pollable:
>>
>> https://lore.kernel.org/patchwork/patch/345098/
>>
>> That patch plus the one in this thread might make a nice addition to
>> the kernel even if we expect something much better to come along
>> later.
>
> I agree. Eric's point was to make the first implementation of this as
> simple as possible that's why this patch is intentionally almost
> trivial. And I like it for its simplicity.
>
> I had a more comprehensive API proposal of which open(/proc/<pid>) was a
> part. I didn't send out alongside this patch as Eric clearly prefered to
> only have the /proc/<pid> part. Here is the full proposal as I intended
> to originally send it out:

Thanks.

> The gist is to have file descriptors for processes which is obviously not a new
> idea. This has been done before in other OSes and it has been tried before in
> Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to
> make it very clear that I'm not laying claim to this being my or even a novel
> idea in any way. However, I want to diverge from previous approaches with my
> suggestion. (Though I can't be sure that there's not something similar in other
> OSes already.)

Windows works basically as you describe. You can create a process is
suspended state, configure it however you want, then let it run.
CreateProcess (and even moreso, NtCreateProcess) also provide a rich
(and *extensible*) interface for pre-creation process configuration.

>> One of the main motivations for having procfds is to have a race-free way of
> configuring, starting, polling, and killing a process. Basically, a process
> lifecycle api if you want to think about it that way. The api should also be
> easily extendable in the future to avoid running into the limitations we
> currently see with the clone*() syscall(s) again.
>
> One of the crucial points of the api is to *separate the configuration
> of a process through a procfd from actually creating the process*.
> This is a crucial property expressed in the open*() system calls. First, get a
> stable handle on an object then allow for ways to configure it. As such the
> procfd api shares the same insight with Al's and David's new mount api.
> (Fwiw, Andy also pointed out similarities with posix_spawn().)
> What I envisioned was to have the following syscalls (multiple name suggestions):
>
> 1. int process_open / proc_open / procopen
> 2. int process_config / proc_config / procconfig or ioctl()-based
> 3. int process_info / proc_info / procinfo or ioctl()-based
> 4. int process_manage / proc_manage / procmanage or ioctl()-based

The API you've proposed seems fine to me, although I'd either 1)
consolidate operations further into one system call, or 2) separate
the different management operations into more and different system
calls that can be audited independently. The grouping you've proposed
seems to have the worst aspects of API splitting and API multiplexing.
But I have no objection to it in spirit.

That said, while I do want to fix process configuration and startup
generally, I want to fix specific holes in the existing API surface
first. The two patches I've already sent do that, and this work
shouldn't wait on an ideal larger process-API overhaul that may or may
not arrive. Based on previous history, I suspect that an API of the
scope you're proposing would take years to overcome all LKML
objections and land. I don't want to wait that long when we can make
smaller fixes that would not conflict with the general architecture.

The original patch on this thread is half of the right fix. While I
think we should use a system call instead of an ioctl, and while I
have some specific implementation critiques (which I described in a
different message), it's the right general sort of thing. We should
merge it.

Next, I want to merge my exithand proposal, or something like it. It's
likewise a simple change that, in a minimal way, addresses a
longstanding API deficiency. I'm very strongly against the
POLLERR-on-directory variant of the idea.

2018-11-18 18:17:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 10:07 AM Daniel Colascione <[email protected]> wrote:
> Next, I want to merge my exithand proposal, or something like it. It's
> likewise a simple change that, in a minimal way, addresses a
> longstanding API deficiency. I'm very strongly against the
> POLLERR-on-directory variant of the idea.

Can you explain why you don't like POLLERR-on-a-directory? I'm not
saying that POLLERR-on-a-directory is fantastic, but I'd like to
understand what your objection is.

2018-11-18 18:30:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 9:51 AM Daniel Colascione <[email protected]> wrote:
>
> > I'm not entirely sure that ship has sailed. In the kernel, we already
> > have a bit of a distinction between a pid (and tid, etc -- I'm
> > referring to struct pid) and a task. If we make a new
> > process-management API, we could put a distinction like this into the
> > API.
>
> It would be a disaster to have different APIs give callers a different
> idea of process identity over its lifetime. The precedent is
> well-established that execve and setreuid do not change a process's
> identity. Invaliding some identifiers but not others in response to
> supposedly-internal things a process might do under rare circumstances
> is creating a bug machine..

Here's my point: if we're really going to make a new API to manipulate
processes by their fd, I think we should have at least a decent idea
of how that API will get extended in the future. Right now, we have
an extremely awkward situation where opening an fd in /proc requires
certain capabilities or uids, and using those fds often also checks
current's capabilities, and the target process may have changed its
own security context, including gaining privilege via SUID, SGID, or
LSM transition rules in the mean time. This has been a huge source of
security bugs. It would be nice to have a model for future APIs that
avoids these problems.

And I didn't say in my proposal that a process's identity should
fundamentally change when it calls execve(). I'm suggesting that
certain operations that could cause a process to gain privilege or
otherwise require greater permission to introspect (mainly execve)
could be handled by invalidating the new process management fds.
Sure, if init re-execs itself, it's still PID 1, but that doesn't
necessarily mean that:

fd = process_open_management_fd(1);
[init reexecs]
process_do_something(fd);

needs to work.

>
> > setresuid() has no effect
> > here -- if you have W access to the process and the process calls
> > setresuid(), you still have W access.
>
> Now you've created a situation in which an operation that security
> policy previously blocked now becomes possible, invaliding previous
> designs based on the old security invariant. That's the definition of
> introducing a security hole.

I think you're overstating your case. To a pretty good approximation,
setresuid() allows the caller to remove elements from the set {ruid,
suid, euid}, unless the caller has CAP_SETUID. If you could ptrace a
process before it calls setresuid(), you might as well be able to
ptrace() it after, since you could have just ptraced it and made it
call setresuid() while still ptracing it. Similarly, it seems like
it's probably safe to be able to open an fd that lets you watch the
exit status of a process, have the process call setresuid(), and still
see the exit status.

Regardless of how you feel about these issues, if you're going to add
an API by which you open an fd, wait for a process to exit, and read
the exit status, you need to define the conditions under which you may
open the fd and under which you may read the exit status once you have
the fd. There are probably multiple valid answers, but the question
still needs to be answered. My POLLERR hack, aside from being ugly,
avoids this particular issue because it merely lets you wait for
something you already could have observed using readdir().

2018-11-18 18:32:39

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 10:15 AM, Andy Lutomirski <[email protected]> wrote:
> On Sun, Nov 18, 2018 at 10:07 AM Daniel Colascione <[email protected]> wrote:
>> Next, I want to merge my exithand proposal, or something like it. It's
>> likewise a simple change that, in a minimal way, addresses a
>> longstanding API deficiency. I'm very strongly against the
>> POLLERR-on-directory variant of the idea.
>
> Can you explain why you don't like POLLERR-on-a-directory? I'm not
> saying that POLLERR-on-a-directory is fantastic, but I'd like to
> understand what your objection is.

I've written my objections in more detail at [1]. Basically,

1) Nothing else today works by polling on directory file descriptors.

2) POLLERR is wrong because POLLERR indicates, well, an error, but
since we want notifications upon either a transition to a zombie *or*
actual death, and /proc/pid operations work perfectly well on zombie
processes, there's no error to report, and reporting POLLERR would be
a weird kind of lie. POLLHUP might be less awkward here.

3) POLLERR is a single bit of information. I want exit status as well,
or at least a logical path from whatever we add to some kind of exit
status reporting. You can get exit status from a zombie with openat on
/proc/pid/stat, but what if the process fully dies by the time you get
around to reading its exit status? Should we synthesize a
/proc/pid/stat? It seems simpler to be able to just read(2) the exit
information from some FD.

4) Event monitoring frameworks generally treat POLLERR as some kind of
black sheep. Most people think in terms of bits indicating reading and
writing. I want something that can cleanly integrate into existing
wait mechanisms.

5) Poll events are *hints* that some other operation probably won't
block if attempted. Using poll as the primary way of communicating a
bit of information instead of an attempt-IO-now hint feels awkward.

6) A POLLERR interface can't be used by the shell without some kind of helper.

What *advantage* does a POLLERR interface have? That you don't have to
openat a separate file? That's a trivial operation in profs,
especially compared to the machinery of process shutdown.

I'm not particularly tied to a proc file; if we're adding a system
call interface for killing a process by its procfs dfd, we can add one
for returning an eventfd-like FD representing that process's status.
It's unfortunate that the process handle FD also happens to be a
directory FD; if it were a standalone object type, we could just use
POLLIN more naturally.

[1] https://lore.kernel.org/lkml/CAKOZueszfoSM0pxhmuFLOuPmJqSfYXxgutstyCgqxAyoUi4h3w@mail.gmail.com/

2018-11-18 18:45:30

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 10:28 AM, Andy Lutomirski <[email protected]> wrote:
> On Sun, Nov 18, 2018 at 9:51 AM Daniel Colascione <[email protected]> wrote:
>>
>> > I'm not entirely sure that ship has sailed. In the kernel, we already
>> > have a bit of a distinction between a pid (and tid, etc -- I'm
>> > referring to struct pid) and a task. If we make a new
>> > process-management API, we could put a distinction like this into the
>> > API.
>>
>> It would be a disaster to have different APIs give callers a different
>> idea of process identity over its lifetime. The precedent is
>> well-established that execve and setreuid do not change a process's
>> identity. Invaliding some identifiers but not others in response to
>> supposedly-internal things a process might do under rare circumstances
>> is creating a bug machine..
>
> Here's my point: if we're really going to make a new API to manipulate
> processes by their fd, I think we should have at least a decent idea
> of how that API will get extended in the future. Right now, we have
> an extremely awkward situation where opening an fd in /proc requires
> certain capabilities or uids, and using those fds often also checks
> current's capabilities, and the target process may have changed its
> own security context, including gaining privilege via SUID, SGID, or
> LSM transition rules in the mean time. This has been a huge source of
> security bugs. It would be nice to have a model for future APIs that
> avoids these problems.
>
> And I didn't say in my proposal that a process's identity should
> fundamentally change when it calls execve(). I'm suggesting that
> certain operations that could cause a process to gain privilege or
> otherwise require greater permission to introspect (mainly execve)
> could be handled by invalidating the new process management fds.
> Sure, if init re-execs itself, it's still PID 1, but that doesn't
> necessarily mean that:
>
> fd = process_open_management_fd(1);
> [init reexecs]
> process_do_something(fd);
>
> needs to work.

PID 1 is a bad example here, because it doesn't get recycled. Other
PIDs do. The snippet you gave *does* need to work, in general, because
if exec invalidates the handle, and you need to reopen by PID to
re-establish your right to do something with the process, that process
may in fact have died between the invalidation and your reopen, and
your reopened FD may refer to some other random process.

The only way around this problem is to have two separate FDs --- one
to represent process identity, which *must* be continuous across
execve, and the other to represent some specific capability, some
ability to do something to that process. It's reasonable to invalidate
capability after execve, but it's not reasonable to invalidate
identity. In concrete terms, I don't see a big advantage to this
separation, and I think a single identity FD combined with
per-operation capability checks is sufficient. And much simpler.

>> > setresuid() has no effect
>> > here -- if you have W access to the process and the process calls
>> > setresuid(), you still have W access.
>>
>> Now you've created a situation in which an operation that security
>> policy previously blocked now becomes possible, invaliding previous
>> designs based on the old security invariant. That's the definition of
>> introducing a security hole.
>
> I think you're overstating your case. To a pretty good approximation,
> setresuid() allows the caller to remove elements from the set {ruid,
> suid, euid}, unless the caller has CAP_SETUID. If you could ptrace a
> process before it calls setresuid(), you might as well be able to
> ptrace() it after, since you could have just ptraced it and made it
> call setresuid() while still ptracing it.

What about a child that execs a setuid binary?

> Similarly, it seems like
> it's probably safe to be able to open an fd that lets you watch the
> exit status of a process, have the process call setresuid(), and still
> see the exit status.

Is it? That's an open question.

>
> Regardless of how you feel about these issues, if you're going to add
> an API by which you open an fd, wait for a process to exit, and read
> the exit status, you need to define the conditions under which you may
> open the fd and under which you may read the exit status once you have
> the fd. There are probably multiple valid answers, but the question
> still needs to be answered.

Yes. That's the point I made in that previous message of mine that I referenced.

> My POLLERR hack, aside from being ugly,
> avoids this particular issue because it merely lets you wait for
> something you already could have observed using readdir().

Yes. I mentioned this same issue-punting as the motivation behind
exithand, initially, just reading EOF on exit.

2018-11-18 19:06:18

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On 2018-11-18, Daniel Colascione <[email protected]> wrote:
> > Here's my point: if we're really going to make a new API to manipulate
> > processes by their fd, I think we should have at least a decent idea
> > of how that API will get extended in the future. Right now, we have
> > an extremely awkward situation where opening an fd in /proc requires
> > certain capabilities or uids, and using those fds often also checks
> > current's capabilities, and the target process may have changed its
> > own security context, including gaining privilege via SUID, SGID, or
> > LSM transition rules in the mean time. This has been a huge source of
> > security bugs. It would be nice to have a model for future APIs that
> > avoids these problems.
> >
> > And I didn't say in my proposal that a process's identity should
> > fundamentally change when it calls execve(). I'm suggesting that
> > certain operations that could cause a process to gain privilege or
> > otherwise require greater permission to introspect (mainly execve)
> > could be handled by invalidating the new process management fds.
> > Sure, if init re-execs itself, it's still PID 1, but that doesn't
> > necessarily mean that:
> >
> > fd = process_open_management_fd(1);
> > [init reexecs]
> > process_do_something(fd);
> >
> > needs to work.
>
> PID 1 is a bad example here, because it doesn't get recycled. Other
> PIDs do. The snippet you gave *does* need to work, in general, because
> if exec invalidates the handle, and you need to reopen by PID to
> re-establish your right to do something with the process, that process
> may in fact have died between the invalidation and your reopen, and
> your reopened FD may refer to some other random process.

I imagine the error would be -EPERM rather than -ESRCH in this case,
which would be incredibly trivial for userspace to differentiate
between. If you wish to re-open the path that is also trivial by
re-opening through /proc/self/fd/$fd -- which will re-do any permission
checks and will guarantee that you are re-opening the same 'struct file'
and thus the same 'struct pid'.

> The only way around this problem is to have two separate FDs --- one
> to represent process identity, which *must* be continuous across
> execve, and the other to represent some specific capability, some
> ability to do something to that process. It's reasonable to invalidate
> capability after execve, but it's not reasonable to invalidate
> identity. In concrete terms, I don't see a big advantage to this
> separation, and I think a single identity FD combined with
> per-operation capability checks is sufficient. And much simpler.

I think that the error separation above would trivially allow user-space
to know whether the identity or capability of a process being monitored
has changed.

Currently, all operations on a '/proc/$pid' which you've previously
opened and has died will give you -ESRCH. So the above separation I
mentioned is entirely consistent with how users are using '/proc/$pid'
to check for PID death today.

> > I think you're overstating your case. To a pretty good approximation,
> > setresuid() allows the caller to remove elements from the set {ruid,
> > suid, euid}, unless the caller has CAP_SETUID. If you could ptrace a
> > process before it calls setresuid(), you might as well be able to
> > ptrace() it after, since you could have just ptraced it and made it
> > call setresuid() while still ptracing it.
>
> What about a child that execs a setuid binary?

Yeah, for this reason I think that using -EPERM on operations that we
think are not reasonable to allow possibly-less-privileged processes to
do -- probably the most reasonable choice would be ptrace_may_access().

> > Similarly, it seems like
> > it's probably safe to be able to open an fd that lets you watch the
> > exit status of a process, have the process call setresuid(), and still
> > see the exit status.
>
> Is it? That's an open question.

Well, if we consider wait4(2) it seems that this is already the case.
If you fork+exec a setuid binary you can definitely see its exit code.

> > My POLLERR hack, aside from being ugly,
> > avoids this particular issue because it merely lets you wait for
> > something you already could have observed using readdir().
>
> Yes. I mentioned this same issue-punting as the motivation behind
> exithand, initially, just reading EOF on exit.

One question I have about EOF-on-exit is that if we wish to extend it to
allow providing the exit status (which is something we discussed in the
original thread), how will multiple-readers be handled in such a
scenario? Would we be storing the exit status or siginfo in the
equivalent of a locked memfd?

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (4.79 kB)
signature.asc (849.00 B)
Download all attachments

2018-11-18 19:26:24

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 10:07:31AM -0800, Daniel Colascione wrote:
> On Sun, Nov 18, 2018 at 9:41 AM, Christian Brauner <[email protected]> wrote:
> > On Sun, Nov 18, 2018 at 07:38:09AM -0800, Andy Lutomirski wrote:
> >> On Sun, Nov 18, 2018 at 5:59 AM Daniel Colascione <[email protected]> wrote:
> >> >
> >> > I had been led to believe that the proposal would be a comprehensive
> >> > process API, not an ioctl basically equivalent to my previous patch.
> >> > If you had a more comprehensive proposal, please just share it on LKML
> >> > instead of limiting the discussion to those able to attend these
> >> > various conferences. If there's some determined opposition to a
> >> > general new process API, this opposition needs a fair and full airing,
> >> > as not everyone can attend these conferences.
> >> >
> >> > On Sun, Nov 18, 2018 at 3:17 AM, Christian Brauner <[email protected]> wrote:
> >> > > With this patch an open() call on /proc/<pid> will give userspace a handle
> >> > > to struct pid of the process associated with /proc/<pid>. This allows to
> >> > > maintain a stable handle on a process.
> >> > > I have been discussing various approaches extensively during technical
> >> > > conferences this year culminating in a long argument with Eric at Linux
> >> > > Plumbers. The general consensus was that having a handle on a process
> >> > > will be something that is very simple and easy to maintain
> >> >
> >> > ioctls are the opposite of "easy to maintain". Their
> >> > file-descriptor-specific behavior makes it difficult to use the things
> >> > safely. If you want to take this approach, please make a new system
> >> > call. An ioctl is just a system call with a very strange spelling and
> >> > unfortunate collision semantics.
> >> >
> >> > > with the
> >> > > option of being extensible via a more advanced api if the need arises.
> >> >
> >> > The need *has* arisen; see my exithand patch.
> >> >
> >> > > I
> >> > > believe that this patch is the most simple, dumb, and therefore
> >> > > maintainable solution.
> >> > >
> >> > > The need for this has arisen in order to reliably kill a process without
> >> > > running into issues of the pid being recycled as has been described in the
> >> > > rejected patch [1].
> >> >
> >> > That patch was not "rejected". It was tabled pending the more
> >> > comprehensive process API proposal that was supposed to have emerged.
> >> > This patch is just another variant of the sort of approach we
> >> > discussed on that patch's thread here. As I mentioned on that thread,
> >> > the right approach option is a new system call, not an ioctl.
> >> >
> >> > To fulfill the need described in that patchset a new
> >> > > ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a
> >> > > process via a file descriptor:
> >> > >
> >> > > int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC);
> >> > > ioctl(fd, PROC_FD_SIGNAL, SIGKILL);
> >> > > close(fd);
> >> > >
> >> > > Note, the stable handle will allow us to carefully extend this feature in
> >> > > the future.
> >> >
> >> > We still need the ability to synchronously wait on a process's death,
> >> > as in my patch set. I will be refreshing that patch set.
> >>
> >> I fully agree that a more comprehensive, less expensive API for
> >> managing processes would be nice. But I also think that this patch
> >> (using the directory fd and ioctl) is better from a security
> >> perspective than using a new file in /proc.
> >>
> >> I have an old patch to make proc directory fds pollable:
> >>
> >> https://lore.kernel.org/patchwork/patch/345098/
> >>
> >> That patch plus the one in this thread might make a nice addition to
> >> the kernel even if we expect something much better to come along
> >> later.
> >
> > I agree. Eric's point was to make the first implementation of this as
> > simple as possible that's why this patch is intentionally almost
> > trivial. And I like it for its simplicity.
> >
> > I had a more comprehensive API proposal of which open(/proc/<pid>) was a
> > part. I didn't send out alongside this patch as Eric clearly prefered to
> > only have the /proc/<pid> part. Here is the full proposal as I intended
> > to originally send it out:
>
> Thanks.
>
> > The gist is to have file descriptors for processes which is obviously not a new
> > idea. This has been done before in other OSes and it has been tried before in
> > Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to
> > make it very clear that I'm not laying claim to this being my or even a novel
> > idea in any way. However, I want to diverge from previous approaches with my
> > suggestion. (Though I can't be sure that there's not something similar in other
> > OSes already.)
>
> Windows works basically as you describe. You can create a process is
> suspended state, configure it however you want, then let it run.
> CreateProcess (and even moreso, NtCreateProcess) also provide a rich
> (and *extensible*) interface for pre-creation process configuration.
>
> >> One of the main motivations for having procfds is to have a race-free way of
> > configuring, starting, polling, and killing a process. Basically, a process
> > lifecycle api if you want to think about it that way. The api should also be
> > easily extendable in the future to avoid running into the limitations we
> > currently see with the clone*() syscall(s) again.
> >
> > One of the crucial points of the api is to *separate the configuration
> > of a process through a procfd from actually creating the process*.
> > This is a crucial property expressed in the open*() system calls. First, get a
> > stable handle on an object then allow for ways to configure it. As such the
> > procfd api shares the same insight with Al's and David's new mount api.
> > (Fwiw, Andy also pointed out similarities with posix_spawn().)
> > What I envisioned was to have the following syscalls (multiple name suggestions):
> >
> > 1. int process_open / proc_open / procopen
> > 2. int process_config / proc_config / procconfig or ioctl()-based
> > 3. int process_info / proc_info / procinfo or ioctl()-based
> > 4. int process_manage / proc_manage / procmanage or ioctl()-based
>
> The API you've proposed seems fine to me, although I'd either 1)
> consolidate operations further into one system call, or 2) separate
> the different management operations into more and different system
> calls that can be audited independently. The grouping you've proposed
> seems to have the worst aspects of API splitting and API multiplexing.
> But I have no objection to it in spirit.
>
> That said, while I do want to fix process configuration and startup
> generally, I want to fix specific holes in the existing API surface
> first. The two patches I've already sent do that, and this work
> shouldn't wait on an ideal larger process-API overhaul that may or may
> not arrive. Based on previous history, I suspect that an API of the
> scope you're proposing would take years to overcome all LKML
> objections and land. I don't want to wait that long when we can make
> smaller fixes that would not conflict with the general architecture.
>
> The original patch on this thread is half of the right fix. While I
> think we should use a system call instead of an ioctl, and while I
> have some specific implementation critiques (which I described in a
> different message), it's the right general sort of thing. We should
> merge it.

Thanks. I agree. Note, I don't care if it's an ioctl() or not. I'm happy
to instead add a syscall process_signal() alongside this patchset. What
do people prefer?

>
> Next, I want to merge my exithand proposal, or something like it. It's
> likewise a simple change that, in a minimal way, addresses a
> longstanding API deficiency. I'm very strongly against the
> POLLERR-on-directory variant of the idea.

2018-11-18 19:45:43

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 11:05 AM, Aleksa Sarai <[email protected]> wrote:
> On 2018-11-18, Daniel Colascione <[email protected]> wrote:
>> > Here's my point: if we're really going to make a new API to manipulate
>> > processes by their fd, I think we should have at least a decent idea
>> > of how that API will get extended in the future. Right now, we have
>> > an extremely awkward situation where opening an fd in /proc requires
>> > certain capabilities or uids, and using those fds often also checks
>> > current's capabilities, and the target process may have changed its
>> > own security context, including gaining privilege via SUID, SGID, or
>> > LSM transition rules in the mean time. This has been a huge source of
>> > security bugs. It would be nice to have a model for future APIs that
>> > avoids these problems.
>> >
>> > And I didn't say in my proposal that a process's identity should
>> > fundamentally change when it calls execve(). I'm suggesting that
>> > certain operations that could cause a process to gain privilege or
>> > otherwise require greater permission to introspect (mainly execve)
>> > could be handled by invalidating the new process management fds.
>> > Sure, if init re-execs itself, it's still PID 1, but that doesn't
>> > necessarily mean that:
>> >
>> > fd = process_open_management_fd(1);
>> > [init reexecs]
>> > process_do_something(fd);
>> >
>> > needs to work.
>>
>> PID 1 is a bad example here, because it doesn't get recycled. Other
>> PIDs do. The snippet you gave *does* need to work, in general, because
>> if exec invalidates the handle, and you need to reopen by PID to
>> re-establish your right to do something with the process, that process
>> may in fact have died between the invalidation and your reopen, and
>> your reopened FD may refer to some other random process.
>
> I imagine the error would be -EPERM rather than -ESRCH in this case,
> which would be incredibly trivial for userspace to differentiate
> between.

Why would userspace necessarily see EPERM? The PID might get recycled
into a different random process that the caller has the ability to
affect.

> If you wish to re-open the path that is also trivial by
> re-opening through /proc/self/fd/$fd -- which will re-do any permission
> checks and will guarantee that you are re-opening the same 'struct file'
> and thus the same 'struct pid'.

When you reopen via /proc/self/fd, you get a new struct file
referencing the existing inode, not the same struct file. A new
reference to the old struct file would just be dup.

Anyway: what other API requires, for correct operation, occasional
reopening through /proc/self/fd? It's cumbersome, and it doesn't add
anything. If we invalidate process handles on execve, and processes
are legally allowed to re-exec themselves for arbitrary reasons at any
time, that's tantamount to saying that handles might become invalid at
any time and that all callers must be prepared to go through the
reopen-and-retry path before any operation.

Why are we making them do that? So that a process can have an open FD
that represents a process-operation capability? Which capability does
the open FD represent?

I think when you and Andy must be talking about is an API that looks like this:

int open_process_operation_handle(int procfs_dirfd, int capability_bitmask)

capability_bitmask would have bits like

PROCESS_CAPABILITY_KILL --- send a signal
PROCESS_CAPABILITY_PTRACE --- attach to a process
PROCESS_CAPABILITY_READ_EXIT_STATUS --- what it says on the tin
PROCESS_CAPABILITY_READ_CMDLINE --- etc.

Then you'd have system calls like

int process_kill(int process_capability_fd, int signo, const union sigval data)
int process_ptrace_attach(int process_capability_fd)
int process_wait_for_exit(int process_capability_fd, siginfo_t* exit_info)

that worked on these capability bits. If a process execs or does
something else to change its security capabilities, operations on
these capability FDs would fail with ESTALE or something and callers
would have to re-acquire their capabilities.

This approach works fine. It has some nice theoretical properties, and
could allow for things like nicer privilege separation for debuggers.
I wouldn't mind something like this getting into the kernel.

I just don't think this model is necessary right now. I want a small
change from what we have today, one likely to actually make it into
the tree. And bypassing the capability FDs and just allowing callers
to operate directly on process *identity* FDs, using privileges in
effect at the time of all, is at least no worse than what we have now.

That is, I'm proposing an API that looks like this:

int process_kill(int procfs_dfd, int signo, const union sigval value)

If, later, process_kill were to *also* accept process-capability FDs,
nothing would break.

>> The only way around this problem is to have two separate FDs --- one
>> to represent process identity, which *must* be continuous across
>> execve, and the other to represent some specific capability, some
>> ability to do something to that process. It's reasonable to invalidate
>> capability after execve, but it's not reasonable to invalidate
>> identity. In concrete terms, I don't see a big advantage to this
>> separation, and I think a single identity FD combined with
>> per-operation capability checks is sufficient. And much simpler.
>
> I think that the error separation above would trivially allow user-space
> to know whether the identity or capability of a process being monitored
> has changed.
>
> Currently, all operations on a '/proc/$pid' which you've previously
> opened and has died will give you -ESRCH.

Not the case. Zombies have died, but profs operations work fine on zombies.

>> > Similarly, it seems like
>> > it's probably safe to be able to open an fd that lets you watch the
>> > exit status of a process, have the process call setresuid(), and still
>> > see the exit status.
>>
>> Is it? That's an open question.
>
> Well, if we consider wait4(2) it seems that this is already the case.
> If you fork+exec a setuid binary you can definitely see its exit code.

Only if you're the parent. Otherwise, you can't see the process exit
status unless you pass a ptrace access check and consult
/proc/pid/stat after the process dies, but before the zombie
disappears. Random unrelated and unprivileged processes can't see exit
statuses from distant parts of the system.

>> > My POLLERR hack, aside from being ugly,
>> > avoids this particular issue because it merely lets you wait for
>> > something you already could have observed using readdir().
>>
>> Yes. I mentioned this same issue-punting as the motivation behind
>> exithand, initially, just reading EOF on exit.
>
> One question I have about EOF-on-exit is that if we wish to extend it to
> allow providing the exit status (which is something we discussed in the
> original thread), how will multiple-readers be handled in such a
> scenario?
> Would we be storing the exit status or siginfo in the
> equivalent of a locked memfd?

Yes, that's what I have in mind. A siginfo_t is small enough that we
could just store it as a blob allocated off the procfs inode or
something like that without bothering with a shmfs file. You'd be able
to read(2) the exit status as many times as you wanted.

2018-11-18 20:23:36

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 12:15 PM, Christian Brauner
<[email protected]> wrote:
>> That is, I'm proposing an API that looks like this:
>>
>> int process_kill(int procfs_dfd, int signo, const union sigval value)
>
> I've started a second tree with process_signal(int procpid_dfd, int sig)

Thanks.

> instead of an ioctl(). Why do you want sigval too?

API completeness. The sigqueue interface is a superset of kill, and I
don't want process_kill to do less than any PID-based kill. Maybe
taking a siginfo_t, like rt_sigqueueinfo does, would be even better in
that respect, come to think of it.

2018-11-18 20:30:53

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 11:44:19AM -0800, Daniel Colascione wrote:
> On Sun, Nov 18, 2018 at 11:05 AM, Aleksa Sarai <[email protected]> wrote:
> > On 2018-11-18, Daniel Colascione <[email protected]> wrote:
> >> > Here's my point: if we're really going to make a new API to manipulate
> >> > processes by their fd, I think we should have at least a decent idea
> >> > of how that API will get extended in the future. Right now, we have
> >> > an extremely awkward situation where opening an fd in /proc requires
> >> > certain capabilities or uids, and using those fds often also checks
> >> > current's capabilities, and the target process may have changed its
> >> > own security context, including gaining privilege via SUID, SGID, or
> >> > LSM transition rules in the mean time. This has been a huge source of
> >> > security bugs. It would be nice to have a model for future APIs that
> >> > avoids these problems.
> >> >
> >> > And I didn't say in my proposal that a process's identity should
> >> > fundamentally change when it calls execve(). I'm suggesting that
> >> > certain operations that could cause a process to gain privilege or
> >> > otherwise require greater permission to introspect (mainly execve)
> >> > could be handled by invalidating the new process management fds.
> >> > Sure, if init re-execs itself, it's still PID 1, but that doesn't
> >> > necessarily mean that:
> >> >
> >> > fd = process_open_management_fd(1);
> >> > [init reexecs]
> >> > process_do_something(fd);
> >> >
> >> > needs to work.
> >>
> >> PID 1 is a bad example here, because it doesn't get recycled. Other
> >> PIDs do. The snippet you gave *does* need to work, in general, because
> >> if exec invalidates the handle, and you need to reopen by PID to
> >> re-establish your right to do something with the process, that process
> >> may in fact have died between the invalidation and your reopen, and
> >> your reopened FD may refer to some other random process.
> >
> > I imagine the error would be -EPERM rather than -ESRCH in this case,
> > which would be incredibly trivial for userspace to differentiate
> > between.
>
> Why would userspace necessarily see EPERM? The PID might get recycled
> into a different random process that the caller has the ability to
> affect.
>
> > If you wish to re-open the path that is also trivial by
> > re-opening through /proc/self/fd/$fd -- which will re-do any permission
> > checks and will guarantee that you are re-opening the same 'struct file'
> > and thus the same 'struct pid'.
>
> When you reopen via /proc/self/fd, you get a new struct file
> referencing the existing inode, not the same struct file. A new
> reference to the old struct file would just be dup.
>
> Anyway: what other API requires, for correct operation, occasional
> reopening through /proc/self/fd? It's cumbersome, and it doesn't add
> anything. If we invalidate process handles on execve, and processes
> are legally allowed to re-exec themselves for arbitrary reasons at any
> time, that's tantamount to saying that handles might become invalid at
> any time and that all callers must be prepared to go through the
> reopen-and-retry path before any operation.
>
> Why are we making them do that? So that a process can have an open FD
> that represents a process-operation capability? Which capability does
> the open FD represent?
>
> I think when you and Andy must be talking about is an API that looks like this:
>
> int open_process_operation_handle(int procfs_dirfd, int capability_bitmask)
>
> capability_bitmask would have bits like
>
> PROCESS_CAPABILITY_KILL --- send a signal
> PROCESS_CAPABILITY_PTRACE --- attach to a process
> PROCESS_CAPABILITY_READ_EXIT_STATUS --- what it says on the tin
> PROCESS_CAPABILITY_READ_CMDLINE --- etc.
>
> Then you'd have system calls like
>
> int process_kill(int process_capability_fd, int signo, const union sigval data)
> int process_ptrace_attach(int process_capability_fd)
> int process_wait_for_exit(int process_capability_fd, siginfo_t* exit_info)
>
> that worked on these capability bits. If a process execs or does
> something else to change its security capabilities, operations on
> these capability FDs would fail with ESTALE or something and callers
> would have to re-acquire their capabilities.
>
> This approach works fine. It has some nice theoretical properties, and
> could allow for things like nicer privilege separation for debuggers.
> I wouldn't mind something like this getting into the kernel.
>
> I just don't think this model is necessary right now. I want a small
> change from what we have today, one likely to actually make it into
> the tree. And bypassing the capability FDs and just allowing callers
> to operate directly on process *identity* FDs, using privileges in
> effect at the time of all, is at least no worse than what we have now.
>
> That is, I'm proposing an API that looks like this:
>
> int process_kill(int procfs_dfd, int signo, const union sigval value)

I've started a second tree with process_signal(int procpid_dfd, int sig)
instead of an ioctl(). Why do you want sigval too?

>
> If, later, process_kill were to *also* accept process-capability FDs,
> nothing would break.
>
> >> The only way around this problem is to have two separate FDs --- one
> >> to represent process identity, which *must* be continuous across
> >> execve, and the other to represent some specific capability, some
> >> ability to do something to that process. It's reasonable to invalidate
> >> capability after execve, but it's not reasonable to invalidate
> >> identity. In concrete terms, I don't see a big advantage to this
> >> separation, and I think a single identity FD combined with
> >> per-operation capability checks is sufficient. And much simpler.
> >
> > I think that the error separation above would trivially allow user-space
> > to know whether the identity or capability of a process being monitored
> > has changed.
> >
> > Currently, all operations on a '/proc/$pid' which you've previously
> > opened and has died will give you -ESRCH.
>
> Not the case. Zombies have died, but profs operations work fine on zombies.
>
> >> > Similarly, it seems like
> >> > it's probably safe to be able to open an fd that lets you watch the
> >> > exit status of a process, have the process call setresuid(), and still
> >> > see the exit status.
> >>
> >> Is it? That's an open question.
> >
> > Well, if we consider wait4(2) it seems that this is already the case.
> > If you fork+exec a setuid binary you can definitely see its exit code.
>
> Only if you're the parent. Otherwise, you can't see the process exit
> status unless you pass a ptrace access check and consult
> /proc/pid/stat after the process dies, but before the zombie
> disappears. Random unrelated and unprivileged processes can't see exit
> statuses from distant parts of the system.
>
> >> > My POLLERR hack, aside from being ugly,
> >> > avoids this particular issue because it merely lets you wait for
> >> > something you already could have observed using readdir().
> >>
> >> Yes. I mentioned this same issue-punting as the motivation behind
> >> exithand, initially, just reading EOF on exit.
> >
> > One question I have about EOF-on-exit is that if we wish to extend it to
> > allow providing the exit status (which is something we discussed in the
> > original thread), how will multiple-readers be handled in such a
> > scenario?
> > Would we be storing the exit status or siginfo in the
> > equivalent of a locked memfd?
>
> Yes, that's what I have in mind. A siginfo_t is small enough that we
> could just store it as a blob allocated off the procfs inode or
> something like that without bothering with a shmfs file. You'd be able
> to read(2) the exit status as many times as you wanted.

2018-11-18 20:42:45

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 12:28 PM, Andy Lutomirski <[email protected]> wrote:
>> That is, I'm proposing an API that looks like this:
>>
>> int process_kill(int procfs_dfd, int signo, const union sigval value)
>>
>> If, later, process_kill were to *also* accept process-capability FDs,
>> nothing would break.
>
> Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away.

Sure. A flag might make for better ergonomics.

>> Yes, that's what I have in mind. A siginfo_t is small enough that we
>> could just store it as a blob allocated off the procfs inode or
>> something like that without bothering with a shmfs file. You'd be able
>> to read(2) the exit status as many times as you wanted.
>
> I think that, if the syscall in question is read(2), then it should work *once* per struct file. Otherwise running cat on the file would behave very oddly.

Why? The file pointer would work normally.

> Read and poll have the same problem as write: we can’t check caps in read or poll either.

Why not? Reading /proc/pid/stat does an access check today and
conditionally replaces the exit status with zero.

2018-11-18 20:46:10

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 01:28:41PM -0700, Andy Lutomirski wrote:
>
>
> > On Nov 18, 2018, at 12:44 PM, Daniel Colascione <[email protected]> wrote:
> >
>
> >
> > That is, I'm proposing an API that looks like this:
> >
> > int process_kill(int procfs_dfd, int signo, const union sigval value)
> >
> > If, later, process_kill were to *also* accept process-capability FDs,
> > nothing would break.
>
> Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away.

I can add a flag argument
int process_signal(int procfs_dfd, int signo, siginfo_t *info, int flags)
The way I see it process_signal() should be equivalent to kill(pid, signal) for now.
That is siginfo_t is cleared and set to:

info.si_signo = sig;
info.si_errno = 0;
info.si_code = SI_USER;
info.si_pid = task_tgid_vnr(current);
info.si_uid = from_kuid_munged(current_user_ns(), current_uid());

>
> > Yes, that's what I have in mind. A siginfo_t is small enough that we
> > could just store it as a blob allocated off the procfs inode or
> > something like that without bothering with a shmfs file. You'd be able
> > to read(2) the exit status as many times as you wanted.
>
> I think that, if the syscall in question is read(2), then it should work *once* per struct file. Otherwise running cat on the file would behave very oddly.
>
> Read and poll have the same problem as write: we can’t check caps in read or poll either.

2018-11-18 20:53:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors



> On Nov 18, 2018, at 12:44 PM, Daniel Colascione <[email protected]> wrote:
>

>
> That is, I'm proposing an API that looks like this:
>
> int process_kill(int procfs_dfd, int signo, const union sigval value)
>
> If, later, process_kill were to *also* accept process-capability FDs,
> nothing would break.

Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away.

> Yes, that's what I have in mind. A siginfo_t is small enough that we
> could just store it as a blob allocated off the procfs inode or
> something like that without bothering with a shmfs file. You'd be able
> to read(2) the exit status as many times as you wanted.

I think that, if the syscall in question is read(2), then it should work *once* per struct file. Otherwise running cat on the file would behave very oddly.

Read and poll have the same problem as write: we can’t check caps in read or poll either.

2018-11-18 20:58:52

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 12:43 PM, Christian Brauner
<[email protected]> wrote:
> On Sun, Nov 18, 2018 at 01:28:41PM -0700, Andy Lutomirski wrote:
>>
>>
>> > On Nov 18, 2018, at 12:44 PM, Daniel Colascione <[email protected]> wrote:
>> >
>>
>> >
>> > That is, I'm proposing an API that looks like this:
>> >
>> > int process_kill(int procfs_dfd, int signo, const union sigval value)
>> >
>> > If, later, process_kill were to *also* accept process-capability FDs,
>> > nothing would break.
>>
>> Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away.
>
> I can add a flag argument
> int process_signal(int procfs_dfd, int signo, siginfo_t *info, int flags)
> The way I see it process_signal() should be equivalent to kill(pid, signal) for now.
> That is siginfo_t is cleared and set to:
>
> info.si_signo = sig;
> info.si_errno = 0;
> info.si_code = SI_USER;
> info.si_pid = task_tgid_vnr(current);
> info.si_uid = from_kuid_munged(current_user_ns(), current_uid());

That makes sense. I just don't want to get into a situation where
callers feel that they *have* to use the PID-based APIs to send a
signal because process_kill doesn't offer some bit of functionality.

Are you imagining something like requiring info t be NULL unless flags
contains some "I have a siginfo_t" value?

BTW: passing SI_USER to rt_sigqueueinfo *should* as long as the
passed-in si_pid and si_uid match what the kernel would set them to in
the kill(2) case. The whole point of SI_USER is that the recipient
knows that it can trust the origin information embedded in the
siginfo_t in the signal handler. If the kernel verifies that a signal
sender isn't actually lying, why not let people send SI_USER with
rt_sigqueueinfo?

2018-11-18 21:25:47

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 12:54:10PM -0800, Daniel Colascione wrote:
> On Sun, Nov 18, 2018 at 12:43 PM, Christian Brauner
> <[email protected]> wrote:
> > On Sun, Nov 18, 2018 at 01:28:41PM -0700, Andy Lutomirski wrote:
> >>
> >>
> >> > On Nov 18, 2018, at 12:44 PM, Daniel Colascione <[email protected]> wrote:
> >> >
> >>
> >> >
> >> > That is, I'm proposing an API that looks like this:
> >> >
> >> > int process_kill(int procfs_dfd, int signo, const union sigval value)
> >> >
> >> > If, later, process_kill were to *also* accept process-capability FDs,
> >> > nothing would break.
> >>
> >> Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away.
> >
> > I can add a flag argument
> > int process_signal(int procfs_dfd, int signo, siginfo_t *info, int flags)
> > The way I see it process_signal() should be equivalent to kill(pid, signal) for now.
> > That is siginfo_t is cleared and set to:
> >
> > info.si_signo = sig;
> > info.si_errno = 0;
> > info.si_code = SI_USER;
> > info.si_pid = task_tgid_vnr(current);
> > info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
>
> That makes sense. I just don't want to get into a situation where
> callers feel that they *have* to use the PID-based APIs to send a
> signal because process_kill doesn't offer some bit of functionality.

Yeah.

>
> Are you imagining something like requiring info t be NULL unless flags
> contains some "I have a siginfo_t" value?

Well, I was actually thinking about something like:

/**
* sys_process_signal - send a signal to a process trough a process file descriptor
* @fd: the file descriptor of the process
* @sig: signal to be sent
* @info: the signal info
* @flags: future flags to be passed
*/
SYSCALL_DEFINE4(process_signal, int, fd, int, sig, siginfo_t __user *, info,
int, flags)
{
struct pid *pid;
struct fd *f;
kernel_siginfo_t kinfo;

/* Do not allow users to pass garbage. */
if (flags)
return -EINVAL;

int ret = __copy_siginfo_from_user(sig, &kinfo, info);
if (unlikely(ret))
return ret;

/* For now, enforce that caller's creds are used. */
kinfo.si_pid = task_tgid_vnr(current);
kinfo.si_uid = from_kuid_munged(current_user_ns(), current_uid());

if (signal_impersonates_kernel(kinfo))
return -EPERM;

f = fdget(fd);
if (!f.file)
return -EBADF;

pid = f.file->private_data;
if (!pid)
return -EBADF;

return kill_pid_info(sig, kinfo, pid);
}

>
> BTW: passing SI_USER to rt_sigqueueinfo *should* as long as the
> passed-in si_pid and si_uid match what the kernel would set them to in
> the kill(2) case. The whole point of SI_USER is that the recipient
> knows that it can trust the origin information embedded in the
> siginfo_t in the signal handler. If the kernel verifies that a signal
> sender isn't actually lying, why not let people send SI_USER with
> rt_sigqueueinfo?

2018-11-18 21:58:56

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 10:23:36PM +0100, Christian Brauner wrote:
> On Sun, Nov 18, 2018 at 12:54:10PM -0800, Daniel Colascione wrote:
> > On Sun, Nov 18, 2018 at 12:43 PM, Christian Brauner
> > <[email protected]> wrote:
> > > On Sun, Nov 18, 2018 at 01:28:41PM -0700, Andy Lutomirski wrote:
> > >>
> > >>
> > >> > On Nov 18, 2018, at 12:44 PM, Daniel Colascione <[email protected]> wrote:
> > >> >
> > >>
> > >> >
> > >> > That is, I'm proposing an API that looks like this:
> > >> >
> > >> > int process_kill(int procfs_dfd, int signo, const union sigval value)
> > >> >
> > >> > If, later, process_kill were to *also* accept process-capability FDs,
> > >> > nothing would break.
> > >>
> > >> Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away.
> > >
> > > I can add a flag argument
> > > int process_signal(int procfs_dfd, int signo, siginfo_t *info, int flags)
> > > The way I see it process_signal() should be equivalent to kill(pid, signal) for now.
> > > That is siginfo_t is cleared and set to:
> > >
> > > info.si_signo = sig;
> > > info.si_errno = 0;
> > > info.si_code = SI_USER;
> > > info.si_pid = task_tgid_vnr(current);
> > > info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
> >
> > That makes sense. I just don't want to get into a situation where
> > callers feel that they *have* to use the PID-based APIs to send a
> > signal because process_kill doesn't offer some bit of functionality.
>
> Yeah.
>
> >
> > Are you imagining something like requiring info t be NULL unless flags
> > contains some "I have a siginfo_t" value?
>
> Well, I was actually thinking about something like:
>
> /**
> * sys_process_signal - send a signal to a process trough a process file descriptor
> * @fd: the file descriptor of the process
> * @sig: signal to be sent
> * @info: the signal info
> * @flags: future flags to be passed
> */
> SYSCALL_DEFINE4(process_signal, int, fd, int, sig, siginfo_t __user *, info,
> int, flags)
> {
> struct pid *pid;
> struct fd *f;
> kernel_siginfo_t kinfo;
>
> /* Do not allow users to pass garbage. */
> if (flags)
> return -EINVAL;
>
> int ret = __copy_siginfo_from_user(sig, &kinfo, info);
> if (unlikely(ret))
> return ret;
>
> /* For now, enforce that caller's creds are used. */
> kinfo.si_pid = task_tgid_vnr(current);
> kinfo.si_uid = from_kuid_munged(current_user_ns(), current_uid());
>
> if (signal_impersonates_kernel(kinfo))
> return -EPERM;
>
> f = fdget(fd);
> if (!f.file)
> return -EBADF;
>
> pid = f.file->private_data;
> if (!pid)
> return -EBADF;
>
> return kill_pid_info(sig, kinfo, pid);
> }

Just jotted this down here briefly. This will need an fput and so on
obvs.

>
> >
> > BTW: passing SI_USER to rt_sigqueueinfo *should* as long as the
> > passed-in si_pid and si_uid match what the kernel would set them to in
> > the kill(2) case. The whole point of SI_USER is that the recipient
> > knows that it can trust the origin information embedded in the
> > siginfo_t in the signal handler. If the kernel verifies that a signal
> > sender isn't actually lying, why not let people send SI_USER with
> > rt_sigqueueinfo?

2018-11-19 00:10:39

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On 2018-11-18, Daniel Colascione <[email protected]> wrote:
> > The gist is to have file descriptors for processes which is obviously not a new
> > idea. This has been done before in other OSes and it has been tried before in
> > Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to
> > make it very clear that I'm not laying claim to this being my or even a novel
> > idea in any way. However, I want to diverge from previous approaches with my
> > suggestion. (Though I can't be sure that there's not something similar in other
> > OSes already.)
>
> Windows works basically as you describe. You can create a process is
> suspended state, configure it however you want, then let it run.
> CreateProcess (and even moreso, NtCreateProcess) also provide a rich
> (and *extensible*) interface for pre-creation process configuration.
>
> >> One of the main motivations for having procfds is to have a race-free way of
> > configuring, starting, polling, and killing a process. Basically, a process
> > lifecycle api if you want to think about it that way. The api should also be
> > easily extendable in the future to avoid running into the limitations we
> > currently see with the clone*() syscall(s) again.
> >
> > One of the crucial points of the api is to *separate the configuration
> > of a process through a procfd from actually creating the process*.
> > This is a crucial property expressed in the open*() system calls. First, get a
> > stable handle on an object then allow for ways to configure it. As such the
> > procfd api shares the same insight with Al's and David's new mount api.
> > (Fwiw, Andy also pointed out similarities with posix_spawn().)
> > What I envisioned was to have the following syscalls (multiple name suggestions):
> >
> > 1. int process_open / proc_open / procopen
> > 2. int process_config / proc_config / procconfig or ioctl()-based
> > 3. int process_info / proc_info / procinfo or ioctl()-based
> > 4. int process_manage / proc_manage / procmanage or ioctl()-based
>
> The API you've proposed seems fine to me, although I'd either 1)
> consolidate operations further into one system call, or 2) separate
> the different management operations into more and different system
> calls that can be audited independently. The grouping you've proposed
> seems to have the worst aspects of API splitting and API multiplexing.
> But I have no objection to it in spirit.

I think combining it all into one API is going to be a soft re-invention
of ioctls, but specifically for procfds. This would be an improvement
over just ioctls (since the current ioctl namespacing is based on
well-behaved drivers and hoping we never have more than 256 ioctl
drivers), but I'm not sure it would help make the API nicer than having
separate syscalls (we'd have to do something similar to bpf(2) which I'm
not a huge fan of).

> That said, while I do want to fix process configuration and startup
> generally, I want to fix specific holes in the existing API surface
> first. The two patches I've already sent do that, and this work
> shouldn't wait on an ideal larger process-API overhaul that may or may
> not arrive. Based on previous history, I suspect that an API of the
> scope you're proposing would take years to overcome all LKML
> objections and land. I don't want to wait that long when we can make
> smaller fixes that would not conflict with the general architecture.

I believe this is precisely what Christian is trying to do with this
patch (and you say as much later in your mail).

I think that adding all of {sighand,sighand_exitcode,kill,...} would not
help the path of landing a much larger API change. We should instead
think about the API we want at the end of the day, and then land smaller
changes which are long-term compatible (and won't just become deprecated
APIs -- there's far too many of them, let's not add more needlessly).

If the plan is to have an ioctl API we should merge minor ioctls first.
If the idea is to have an explosion of syscalls, then we should merge
minor syscalls first. We shouldn't merge procfiles if the end goal is to
not use them.

> Next, I want to merge my exithand proposal, or something like it. It's
> likewise a simple change that, in a minimal way, addresses a
> longstanding API deficiency. I'm very strongly against the
> POLLERR-on-directory variant of the idea.

I agree with you on this need. I will admit I do somewhat like the EOF
solution (mainly because it seamlessly deals with the multi-reader case)
but I'm still not sure I like /proc/$pid/exit_sighand. As mentioned in
the other discussion, ideally we would be only ever operating with the

An ugly strawman of an alternative would be an interface that gave you
an fd that you could similarly wait-until-EOF on, but that's probably
not a major API improvement unless we also make said API provide exit
status information in a way that works with the
multiple-readers-with-different-creds usecase.

One other thing I think we should eventually consider is to provide an
API which pings a listener whenever a process does an execve() (and
possibly fork()). This is something you can get from FreeBSD's kqueue --
and is something that we have in a really neutered form in the "proc
connector". But of course we can discuss this separately, especially if
we have an extensible API idea in mind when we start.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (5.44 kB)
signature.asc (849.00 B)
Download all attachments

2018-11-19 00:12:47

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On 2018-11-18, Daniel Colascione <[email protected]> wrote:
> On Sun, Nov 18, 2018 at 11:05 AM, Aleksa Sarai <[email protected]> wrote:
> > On 2018-11-18, Daniel Colascione <[email protected]> wrote:
> >> > Here's my point: if we're really going to make a new API to manipulate
> >> > processes by their fd, I think we should have at least a decent idea
> >> > of how that API will get extended in the future. Right now, we have
> >> > an extremely awkward situation where opening an fd in /proc requires
> >> > certain capabilities or uids, and using those fds often also checks
> >> > current's capabilities, and the target process may have changed its
> >> > own security context, including gaining privilege via SUID, SGID, or
> >> > LSM transition rules in the mean time. This has been a huge source of
> >> > security bugs. It would be nice to have a model for future APIs that
> >> > avoids these problems.
> >> >
> >> > And I didn't say in my proposal that a process's identity should
> >> > fundamentally change when it calls execve(). I'm suggesting that
> >> > certain operations that could cause a process to gain privilege or
> >> > otherwise require greater permission to introspect (mainly execve)
> >> > could be handled by invalidating the new process management fds.
> >> > Sure, if init re-execs itself, it's still PID 1, but that doesn't
> >> > necessarily mean that:
> >> >
> >> > fd = process_open_management_fd(1);
> >> > [init reexecs]
> >> > process_do_something(fd);
> >> >
> >> > needs to work.
> >>
> >> PID 1 is a bad example here, because it doesn't get recycled. Other
> >> PIDs do. The snippet you gave *does* need to work, in general, because
> >> if exec invalidates the handle, and you need to reopen by PID to
> >> re-establish your right to do something with the process, that process
> >> may in fact have died between the invalidation and your reopen, and
> >> your reopened FD may refer to some other random process.
> >
> > I imagine the error would be -EPERM rather than -ESRCH in this case,
> > which would be incredibly trivial for userspace to differentiate
> > between.
>
> Why would userspace necessarily see EPERM? The PID might get recycled
> into a different random process that the caller has the ability to
> affect.

I'm not sure what you're talking about. execve() doesn't change the PID
of a process, and in the case we are talking about:

pidX_handle = open_pid_handle(pidX);
[ pidX execs a setuid binary ]
do_something(pidX_handle);

pidX still has the same PID (so PID recycling is irrelevant in this
case). The key point is whether do_something() should give you an error
in such a state transition, and in that case I would say you'd get
-EPERM which would indicate (obviously) insufficient privileges.

If the PID has died you'd get -ESRCH. Even if it was eventually
recycled. Because you've pinned a 'struct pid'.

> > If you wish to re-open the path that is also trivial by
> > re-opening through /proc/self/fd/$fd -- which will re-do any permission
> > checks and will guarantee that you are re-opening the same 'struct file'
> > and thus the same 'struct pid'.
>
> When you reopen via /proc/self/fd, you get a new struct file
> referencing the existing inode, not the same struct file. A new
> reference to the old struct file would just be dup.

I don't think this is really relevant to what I'm trying to say...

> Anyway: what other API requires, for correct operation, occasional
> reopening through /proc/self/fd? It's cumbersome, and it doesn't add
> anything. If we invalidate process handles on execve, and processes
> are legally allowed to re-exec themselves for arbitrary reasons at any
> time, that's tantamount to saying that handles might become invalid at
> any time and that all callers must be prepared to go through the
> reopen-and-retry path before any operation.

O_PATH. In container runtimes this is necessary for several reasons to
protect against malicious container root filesystems as well as avoiding
exposing a dirfd to the container.

In LXC, O_PATH re-opening is used for /dev/ptmx as well as some other
operations. In runc we use it for FIFO re-opening so that we can signal
pid1 in a container to execve() into user code.

So this isn't a new thing.

> Why are we making them do that? So that a process can have an open FD
> that represents a process-operation capability? Which capability does
> the open FD represent?

The re-opening part was just an argument to show that there isn't a
condition where you wouldn't be able to get access to the 'struct pid'.
I doubt that anyone would actually need to use this -- since you'd need
to pass "/proc/pid/fd/..." to a more privileged process in order to use
the re-opening.

But this also means that we don't need to have a concept of a pidfd that
isn't actually associated with a PID but is instead associated with
current->mm (which is what you appear to be proposing with the whole
"identity fd" concept).

> I think when you and Andy must be talking about is an API that looks like this:
>
> int open_process_operation_handle(int procfs_dirfd, int capability_bitmask)
>
> capability_bitmask would have bits like
>
> PROCESS_CAPABILITY_KILL --- send a signal
> PROCESS_CAPABILITY_PTRACE --- attach to a process
> PROCESS_CAPABILITY_READ_EXIT_STATUS --- what it says on the tin
> PROCESS_CAPABILITY_READ_CMDLINE --- etc.
>
> Then you'd have system calls like
>
> int process_kill(int process_capability_fd, int signo, const union sigval data)
> int process_ptrace_attach(int process_capability_fd)
> int process_wait_for_exit(int process_capability_fd, siginfo_t* exit_info)
>
> that worked on these capability bits. If a process execs or does
> something else to change its security capabilities, operations on
> these capability FDs would fail with ESTALE or something and callers
> would have to re-acquire their capabilities.
>
> This approach works fine. It has some nice theoretical properties, and
> could allow for things like nicer privilege separation for debuggers.
> I wouldn't mind something like this getting into the kernel.

Andy might be arguing for this (and as you said, I can see the benefit
of doing it this way).

I'm not convinced that doing permission checks on-open is necessary here
-- I get Andy's point about write(2) semantics but I think a new set of
proc_* syscalls wouldn't need to follow those semantics. I might be
wrong though.

> I just don't think this model is necessary right now. I want a small
> change from what we have today, one likely to actually make it into
> the tree. And bypassing the capability FDs and just allowing callers
> to operate directly on process *identity* FDs, using privileges in
> effect at the time of all, is at least no worse than what we have now.
>
> That is, I'm proposing an API that looks like this:
>
> int process_kill(int procfs_dfd, int signo, const union sigval value)
>
> If, later, process_kill were to *also* accept process-capability FDs,
> nothing would break.

Again, I think we should agree on whether it's necessary to have both
types of fds before we commit to maintaining both APIs forever...

> >> The only way around this problem is to have two separate FDs --- one
> >> to represent process identity, which *must* be continuous across
> >> execve, and the other to represent some specific capability, some
> >> ability to do something to that process. It's reasonable to invalidate
> >> capability after execve, but it's not reasonable to invalidate
> >> identity. In concrete terms, I don't see a big advantage to this
> >> separation, and I think a single identity FD combined with
> >> per-operation capability checks is sufficient. And much simpler.
> >
> > I think that the error separation above would trivially allow user-space
> > to know whether the identity or capability of a process being monitored
> > has changed.
> >
> > Currently, all operations on a '/proc/$pid' which you've previously
> > opened and has died will give you -ESRCH.
>
> Not the case. Zombies have died, but profs operations work fine on zombies.

It is the case if the process is dead in the sense that the PID might be
re-used. That is what I meant be "dead" here, not semi-dead in the sense
that zombies are.

> >> > Similarly, it seems like
> >> > it's probably safe to be able to open an fd that lets you watch the
> >> > exit status of a process, have the process call setresuid(), and still
> >> > see the exit status.
> >>
> >> Is it? That's an open question.
> >
> > Well, if we consider wait4(2) it seems that this is already the case.
> > If you fork+exec a setuid binary you can definitely see its exit code.
>
> Only if you're the parent. Otherwise, you can't see the process exit
> status unless you pass a ptrace access check and consult
> /proc/pid/stat after the process dies, but before the zombie
> disappears. Random unrelated and unprivileged processes can't see exit
> statuses from distant parts of the system.

Sure, I'd propose that ptrace_may_access() is what we should use for
operation permission checks.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (9.16 kB)
signature.asc (849.00 B)
Download all attachments

2018-11-19 00:35:13

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 1:30 PM, Christian Brauner <[email protected]> wrote:
> On Sun, Nov 18, 2018 at 10:23:36PM +0100, Christian Brauner wrote:
>> On Sun, Nov 18, 2018 at 12:54:10PM -0800, Daniel Colascione wrote:
>> > On Sun, Nov 18, 2018 at 12:43 PM, Christian Brauner
>> > <[email protected]> wrote:
>> > > On Sun, Nov 18, 2018 at 01:28:41PM -0700, Andy Lutomirski wrote:
>> > >>
>> > >>
>> > >> > On Nov 18, 2018, at 12:44 PM, Daniel Colascione <[email protected]> wrote:
>> > >> >
>> > >>
>> > >> >
>> > >> > That is, I'm proposing an API that looks like this:
>> > >> >
>> > >> > int process_kill(int procfs_dfd, int signo, const union sigval value)
>> > >> >
>> > >> > If, later, process_kill were to *also* accept process-capability FDs,
>> > >> > nothing would break.
>> > >>
>> > >> Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away.
>> > >
>> > > I can add a flag argument
>> > > int process_signal(int procfs_dfd, int signo, siginfo_t *info, int flags)
>> > > The way I see it process_signal() should be equivalent to kill(pid, signal) for now.
>> > > That is siginfo_t is cleared and set to:
>> > >
>> > > info.si_signo = sig;
>> > > info.si_errno = 0;
>> > > info.si_code = SI_USER;
>> > > info.si_pid = task_tgid_vnr(current);
>> > > info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
>> >
>> > That makes sense. I just don't want to get into a situation where
>> > callers feel that they *have* to use the PID-based APIs to send a
>> > signal because process_kill doesn't offer some bit of functionality.
>>
>> Yeah.
>>
>> >
>> > Are you imagining something like requiring info t be NULL unless flags
>> > contains some "I have a siginfo_t" value?
>>
>> Well, I was actually thinking about something like:
>>
>> /**
>> * sys_process_signal - send a signal to a process trough a process file descriptor
>> * @fd: the file descriptor of the process
>> * @sig: signal to be sent
>> * @info: the signal info
>> * @flags: future flags to be passed
>> */
>> SYSCALL_DEFINE4(process_signal, int, fd, int, sig, siginfo_t __user *, info,
>> int, flags)
>> {
>> struct pid *pid;
>> struct fd *f;
>> kernel_siginfo_t kinfo;
>>
>> /* Do not allow users to pass garbage. */
>> if (flags)
>> return -EINVAL;
>>
>> int ret = __copy_siginfo_from_user(sig, &kinfo, info);
>> if (unlikely(ret))
>> return ret;
>>
>> /* For now, enforce that caller's creds are used. */
>> kinfo.si_pid = task_tgid_vnr(current);
>> kinfo.si_uid = from_kuid_munged(current_user_ns(), current_uid());

How about doing it this way? If info is NULL, act like kill(2);
otherwise, act like rt_sigqueueinfo(2).

(Not actual working or compiled code.)

SYSCALL_DEFINE4(process_signal, int, fd, int, sig, siginfo_t __user *, info,
int, flags)
{
struct fd f = { 0 };
kernel_siginfo_t kinfo;
int ret;

/* Make API extension possible. */
ret = -EINVAL;
if (flags)
goto out;

ret = -EBADF;
f = fdget(fd);
if (!f.file)
goto out;

ret = mumble_mumble_check_real_proc_file(f.file);
if (ret)
goto out;

/* Act like kill(2) or rt_sigqueueinfo(2) depending on whether
* the user gave us a siginfo structure.
*/
if (info) {
ret = __copy_siginfo_from_user(sig, &kinfo, info);
if (ret)
goto out;
/* Combine this logic with rt_sigqueueinfo(2) */
ret = -EPERM;
if ((info->si_code >= 0 || info->si_code == SI_TKILL) &&
(task_pid_vnr(current) != pid))
goto out;

} else {
/* Combine this logic with kill(2) */
clear_siginfo(&kinfo);
kinfo.si_signo = sig;
kinfo.si_errno = 0;
kinfo.si_code = SI_USER;
kinfo.si_pid = task_tgid_vnr(current);
kinfo.si_uid = from_kuid_munged(current_user_ns(),
current_uid());
}

ret = kill_pid_info(sig, &kinfo, proc_pid(file_inode(f.file)));

out:
if (f.file)
fput(f);
return ret;
}

2018-11-19 00:44:03

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 04:31:22PM -0800, Daniel Colascione wrote:
> On Sun, Nov 18, 2018 at 1:30 PM, Christian Brauner <[email protected]> wrote:
> > On Sun, Nov 18, 2018 at 10:23:36PM +0100, Christian Brauner wrote:
> >> On Sun, Nov 18, 2018 at 12:54:10PM -0800, Daniel Colascione wrote:
> >> > On Sun, Nov 18, 2018 at 12:43 PM, Christian Brauner
> >> > <[email protected]> wrote:
> >> > > On Sun, Nov 18, 2018 at 01:28:41PM -0700, Andy Lutomirski wrote:
> >> > >>
> >> > >>
> >> > >> > On Nov 18, 2018, at 12:44 PM, Daniel Colascione <[email protected]> wrote:
> >> > >> >
> >> > >>
> >> > >> >
> >> > >> > That is, I'm proposing an API that looks like this:
> >> > >> >
> >> > >> > int process_kill(int procfs_dfd, int signo, const union sigval value)
> >> > >> >
> >> > >> > If, later, process_kill were to *also* accept process-capability FDs,
> >> > >> > nothing would break.
> >> > >>
> >> > >> Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away.
> >> > >
> >> > > I can add a flag argument
> >> > > int process_signal(int procfs_dfd, int signo, siginfo_t *info, int flags)
> >> > > The way I see it process_signal() should be equivalent to kill(pid, signal) for now.
> >> > > That is siginfo_t is cleared and set to:
> >> > >
> >> > > info.si_signo = sig;
> >> > > info.si_errno = 0;
> >> > > info.si_code = SI_USER;
> >> > > info.si_pid = task_tgid_vnr(current);
> >> > > info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
> >> >
> >> > That makes sense. I just don't want to get into a situation where
> >> > callers feel that they *have* to use the PID-based APIs to send a
> >> > signal because process_kill doesn't offer some bit of functionality.
> >>
> >> Yeah.
> >>
> >> >
> >> > Are you imagining something like requiring info t be NULL unless flags
> >> > contains some "I have a siginfo_t" value?
> >>
> >> Well, I was actually thinking about something like:
> >>
> >> /**
> >> * sys_process_signal - send a signal to a process trough a process file descriptor
> >> * @fd: the file descriptor of the process
> >> * @sig: signal to be sent
> >> * @info: the signal info
> >> * @flags: future flags to be passed
> >> */
> >> SYSCALL_DEFINE4(process_signal, int, fd, int, sig, siginfo_t __user *, info,
> >> int, flags)
> >> {
> >> struct pid *pid;
> >> struct fd *f;
> >> kernel_siginfo_t kinfo;
> >>
> >> /* Do not allow users to pass garbage. */
> >> if (flags)
> >> return -EINVAL;
> >>
> >> int ret = __copy_siginfo_from_user(sig, &kinfo, info);
> >> if (unlikely(ret))
> >> return ret;
> >>
> >> /* For now, enforce that caller's creds are used. */
> >> kinfo.si_pid = task_tgid_vnr(current);
> >> kinfo.si_uid = from_kuid_munged(current_user_ns(), current_uid());
>
> How about doing it this way? If info is NULL, act like kill(2);
> otherwise, act like rt_sigqueueinfo(2).
>
> (Not actual working or compiled code.)
>
> SYSCALL_DEFINE4(process_signal, int, fd, int, sig, siginfo_t __user *, info,
> int, flags)
> {
> struct fd f = { 0 };
> kernel_siginfo_t kinfo;
> int ret;
>
> /* Make API extension possible. */
> ret = -EINVAL;
> if (flags)
> goto out;
>
> ret = -EBADF;
> f = fdget(fd);
> if (!f.file)
> goto out;
>
> ret = mumble_mumble_check_real_proc_file(f.file);
> if (ret)
> goto out;
>
> /* Act like kill(2) or rt_sigqueueinfo(2) depending on whether
> * the user gave us a siginfo structure.
> */
> if (info) {
> ret = __copy_siginfo_from_user(sig, &kinfo, info);
> if (ret)
> goto out;
> /* Combine this logic with rt_sigqueueinfo(2) */
> ret = -EPERM;
> if ((info->si_code >= 0 || info->si_code == SI_TKILL) &&
> (task_pid_vnr(current) != pid))
> goto out;
>
> } else {
> /* Combine this logic with kill(2) */
> clear_siginfo(&kinfo);
> kinfo.si_signo = sig;
> kinfo.si_errno = 0;
> kinfo.si_code = SI_USER;
> kinfo.si_pid = task_tgid_vnr(current);
> kinfo.si_uid = from_kuid_munged(current_user_ns(),
> current_uid());
> }
>
> ret = kill_pid_info(sig, &kinfo, proc_pid(file_inode(f.file)));
>
> out:
> if (f.file)
> fput(f);
> return ret;
> }

Right, allowing to ass NULL might make sense. I had:

/**
* sys_process_signal - send a signal to a process trough a process file descriptor
* @fd: the file descriptor of the process
* @sig: signal to be sent
* @info: the signal info
* @flags: future flags to be passed
*/
SYSCALL_DEFINE4(procfd_kill, int, fd, int, sig, siginfo_t __user *, info, int, flags)
{
int ret;
struct pid *pid;
kernel_siginfo_t kinfo;
struct fd f;

/* Do not allow users to pass garbage. */
if (flags)
return -EINVAL;

ret = __copy_siginfo_from_user(sig, &kinfo, info);
if (unlikely(ret))
return ret;

/* For now, enforce that caller's creds are used. */
kinfo.si_pid = task_tgid_vnr(current);
kinfo.si_uid = from_kuid_munged(current_user_ns(), current_uid());

f = fdget_raw(fd);
if (!f.file)
return -EBADF;

ret = -EINVAL;
/* Is this a process file descriptor? */
if (!proc_is_procfd(f.file) || !d_is_dir(f.file->f_path.dentry))
goto err;

pid = f.file->private_data;
if (!pid)
goto err;

ret = -EPERM;
/*
* Not even root can pretend to send signals from the kernel.
* Nor can they impersonate a kill()/tgkill(), which adds source info.
*/
if ((kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL) &&
(task_pid(current) != pid))
goto err;

ret = kill_pid_info(sig, &kinfo, pid);

err:
fdput(f);
return ret;
}

2018-11-19 00:55:22

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 4:09 PM, Aleksa Sarai <[email protected]> wrote:
> On 2018-11-18, Daniel Colascione <[email protected]> wrote:
>> On Sun, Nov 18, 2018 at 11:05 AM, Aleksa Sarai <[email protected]> wrote:
>> > On 2018-11-18, Daniel Colascione <[email protected]> wrote:
>> >> > Here's my point: if we're really going to make a new API to manipulate
>> >> > processes by their fd, I think we should have at least a decent idea
>> >> > of how that API will get extended in the future. Right now, we have
>> >> > an extremely awkward situation where opening an fd in /proc requires
>> >> > certain capabilities or uids, and using those fds often also checks
>> >> > current's capabilities, and the target process may have changed its
>> >> > own security context, including gaining privilege via SUID, SGID, or
>> >> > LSM transition rules in the mean time. This has been a huge source of
>> >> > security bugs. It would be nice to have a model for future APIs that
>> >> > avoids these problems.
>> >> >
>> >> > And I didn't say in my proposal that a process's identity should
>> >> > fundamentally change when it calls execve(). I'm suggesting that
>> >> > certain operations that could cause a process to gain privilege or
>> >> > otherwise require greater permission to introspect (mainly execve)
>> >> > could be handled by invalidating the new process management fds.
>> >> > Sure, if init re-execs itself, it's still PID 1, but that doesn't
>> >> > necessarily mean that:
>> >> >
>> >> > fd = process_open_management_fd(1);
>> >> > [init reexecs]
>> >> > process_do_something(fd);
>> >> >
>> >> > needs to work.
>> >>
>> >> PID 1 is a bad example here, because it doesn't get recycled. Other
>> >> PIDs do. The snippet you gave *does* need to work, in general, because
>> >> if exec invalidates the handle, and you need to reopen by PID to
>> >> re-establish your right to do something with the process, that process
>> >> may in fact have died between the invalidation and your reopen, and
>> >> your reopened FD may refer to some other random process.
>> >
>> > I imagine the error would be -EPERM rather than -ESRCH in this case,
>> > which would be incredibly trivial for userspace to differentiate
>> > between.
>>
>> Why would userspace necessarily see EPERM? The PID might get recycled
>> into a different random process that the caller has the ability to
>> affect.
>
> I'm not sure what you're talking about. execve() doesn't change the PID
> of a process, and in the case we are talking about:
>
> pidX_handle = open_pid_handle(pidX);
> [ pidX execs a setuid binary ]
> do_something(pidX_handle);
>
> pidX still has the same PID (so PID recycling is irrelevant in this
> case). The key point is whether do_something() should give you an error
> in such a state transition, and in that case I would say you'd get
> -EPERM which would indicate (obviously) insufficient privileges.

EPERM is the wrong error. All that's happened here is that the process
has execed itself; you may still have permission to operate on the
post-execve process. ESTALE is the right error here.

But yes, there is a PID trace. What do you do after getting ESTALE?
You reopen the handle and retry your operation. How do you open a new
handle? Unless you're using some awful /proc/self/fd/... hack, you
reopen by PID. And at that point, you've introduced a PID race again.
That's why, in my sketch below, I imagined creating the capability
handle from the process-identity handle and not, as in the snippet
above, directly from the PID.

>> Anyway: what other API requires, for correct operation, occasional
>> reopening through /proc/self/fd? It's cumbersome, and it doesn't add
>> anything. If we invalidate process handles on execve, and processes
>> are legally allowed to re-exec themselves for arbitrary reasons at any
>> time, that's tantamount to saying that handles might become invalid at
>> any time and that all callers must be prepared to go through the
>> reopen-and-retry path before any operation.
>
> O_PATH. In container runtimes this is necessary for several reasons to
> protect against malicious container root filesystems as well as avoiding
> exposing a dirfd to the container.
>
> In LXC, O_PATH re-opening is used for /dev/ptmx as well as some other
> operations. In runc we use it for FIFO re-opening so that we can signal
> pid1 in a container to execve() into user code.
>
> So this isn't a new thing.

Yuck. I'd still argue that 1) the reopen trick isn't really intended
as the mainline path for that kernel functionality, and 2) there ought
to be a way to do what you're describing in a cleaner way. I'd
classify this approach as a hack. It's one thing to require a hack in
specialized container initialization code, but it's another to bake it
into a hopefully-common API for something as fundamental as process
management, especially when there's a perfectly good alternative that
doesn't require this hack.

>> Why are we making them do that? So that a process can have an open FD
>> that represents a process-operation capability? Which capability does
>> the open FD represent?
>
> The re-opening part was just an argument to show that there isn't a
> condition where you wouldn't be able to get access to the 'struct pid'.
> I doubt that anyone would actually need to use this -- since you'd need
> to pass "/proc/pid/fd/..." to a more privileged process in order to use
> the re-opening.
>
> But this also means that we don't need to have a concept of a pidfd that
> isn't actually associated with a PID but is instead associated with
> current->mm (which is what you appear to be proposing with the whole
> "identity fd" concept).

Not current->mm; that can be shared with clone. struct signal is the
right long-term identity. It's usually easier to keep the struct pid
around though, which is exactly what a procfs FD is today: just a
lightweight handle to a struct pid.

>> I think when you and Andy must be talking about is an API that looks like this:
>>
>> int open_process_operation_handle(int procfs_dirfd, int capability_bitmask)
>>
>> capability_bitmask would have bits like
>>
>> PROCESS_CAPABILITY_KILL --- send a signal
>> PROCESS_CAPABILITY_PTRACE --- attach to a process
>> PROCESS_CAPABILITY_READ_EXIT_STATUS --- what it says on the tin
>> PROCESS_CAPABILITY_READ_CMDLINE --- etc.
>>
>> Then you'd have system calls like
>>
>> int process_kill(int process_capability_fd, int signo, const union sigval data)
>> int process_ptrace_attach(int process_capability_fd)
>> int process_wait_for_exit(int process_capability_fd, siginfo_t* exit_info)
>>
>> that worked on these capability bits. If a process execs or does
>> something else to change its security capabilities, operations on
>> these capability FDs would fail with ESTALE or something and callers
>> would have to re-acquire their capabilities.
>>
>> This approach works fine. It has some nice theoretical properties, and
>> could allow for things like nicer privilege separation for debuggers.
>> I wouldn't mind something like this getting into the kernel.
>
> Andy might be arguing for this (and as you said, I can see the benefit
> of doing it this way).
>
> I'm not convinced that doing permission checks on-open is necessary here
> -- I get Andy's point about write(2) semantics but I think a new set of
> proc_* syscalls wouldn't need to follow those semantics. I might be
> wrong though.

For now, it's fine to just expose system calls that operate directly
on the procfs dfd.

>> I just don't think this model is necessary right now. I want a small
>> change from what we have today, one likely to actually make it into
>> the tree. And bypassing the capability FDs and just allowing callers
>> to operate directly on process *identity* FDs, using privileges in
>> effect at the time of all, is at least no worse than what we have now.
>>
>> That is, I'm proposing an API that looks like this:
>>
>> int process_kill(int procfs_dfd, int signo, const union sigval value)
>>
>> If, later, process_kill were to *also* accept process-capability FDs,
>> nothing would break.
>
> Again, I think we should agree on whether it's necessary to have both
> types of fds before we commit to maintaining both APIs forever...

I don't think noting that an API *could* be extended in a certain way
in the future creates any obligation to decide, immediately, whether
that extension will ever be needed. Right now, I don't see a reason to
supply the capability FD API I described. I'm just saying that it
could be added in a low-friction way if necessary one day.

>> >> > Similarly, it seems like
>> >> > it's probably safe to be able to open an fd that lets you watch the
>> >> > exit status of a process, have the process call setresuid(), and still
>> >> > see the exit status.
>> >>
>> >> Is it? That's an open question.
>> >
>> > Well, if we consider wait4(2) it seems that this is already the case.
>> > If you fork+exec a setuid binary you can definitely see its exit code.
>>
>> Only if you're the parent. Otherwise, you can't see the process exit
>> status unless you pass a ptrace access check and consult
>> /proc/pid/stat after the process dies, but before the zombie
>> disappears. Random unrelated and unprivileged processes can't see exit
>> statuses from distant parts of the system.
>
> Sure, I'd propose that ptrace_may_access() is what we should use for
> operation permission checks.

The tricky part is that ptrace_may_access takes a struct task. We want
logic that's *like* ptrace_may_access, but that works posthumously.
It's especially tricky because there's an LSM hook that lets
__ptrace_may_access do arbitrary things. And we can't just run that
hook upon process death, since *after* a process dies, a process
holding an exithand FD (or whatever we call it) may pass that FD to
another process, and *that* process can read(2) from it.

Another option is doing the exithand access check at open time. I
think that's probably fine, and it would make things a lot simpler.
But if we use this option, we should understand what we're doing, and
get some security-conscious people to think through the implications.

2018-11-19 01:23:09

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 4:53 PM, Daniel Colascione <[email protected]> wrote:
>> Sure, I'd propose that ptrace_may_access() is what we should use for
>> operation permission checks.
>
> The tricky part is that ptrace_may_access takes a struct task. We want
> logic that's *like* ptrace_may_access, but that works posthumously.
> It's especially tricky because there's an LSM hook that lets
> __ptrace_may_access do arbitrary things. And we can't just run that
> hook upon process death, since *after* a process dies, a process
> holding an exithand FD (or whatever we call it) may pass that FD to
> another process, and *that* process can read(2) from it.
>
> Another option is doing the exithand access check at open time. I
> think that's probably fine, and it would make things a lot simpler.
> But if we use this option, we should understand what we're doing, and
> get some security-conscious people to think through the implications.

A ptrace check is also probably too strict. Yama's ptrace_scope
feature will block ptrace between unrelated processes within a single
user context, but applying this restriction to exit code monitoring
seems too severe to me.

2018-11-19 01:44:34

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 4:08 PM, Aleksa Sarai <[email protected]> wrote:
> On 2018-11-18, Daniel Colascione <[email protected]> wrote:
>> > The gist is to have file descriptors for processes which is obviously not a new
>> > idea. This has been done before in other OSes and it has been tried before in
>> > Linux [2], [3] (Thanks to Kees for pointing out these patches.). So I want to
>> > make it very clear that I'm not laying claim to this being my or even a novel
>> > idea in any way. However, I want to diverge from previous approaches with my
>> > suggestion. (Though I can't be sure that there's not something similar in other
>> > OSes already.)
>>
>> Windows works basically as you describe. You can create a process is
>> suspended state, configure it however you want, then let it run.
>> CreateProcess (and even moreso, NtCreateProcess) also provide a rich
>> (and *extensible*) interface for pre-creation process configuration.
>>
>> >> One of the main motivations for having procfds is to have a race-free way of
>> > configuring, starting, polling, and killing a process. Basically, a process
>> > lifecycle api if you want to think about it that way. The api should also be
>> > easily extendable in the future to avoid running into the limitations we
>> > currently see with the clone*() syscall(s) again.
>> >
>> > One of the crucial points of the api is to *separate the configuration
>> > of a process through a procfd from actually creating the process*.
>> > This is a crucial property expressed in the open*() system calls. First, get a
>> > stable handle on an object then allow for ways to configure it. As such the
>> > procfd api shares the same insight with Al's and David's new mount api.
>> > (Fwiw, Andy also pointed out similarities with posix_spawn().)
>> > What I envisioned was to have the following syscalls (multiple name suggestions):
>> >
>> > 1. int process_open / proc_open / procopen
>> > 2. int process_config / proc_config / procconfig or ioctl()-based
>> > 3. int process_info / proc_info / procinfo or ioctl()-based
>> > 4. int process_manage / proc_manage / procmanage or ioctl()-based
>>
>> The API you've proposed seems fine to me, although I'd either 1)
>> consolidate operations further into one system call, or 2) separate
>> the different management operations into more and different system
>> calls that can be audited independently. The grouping you've proposed
>> seems to have the worst aspects of API splitting and API multiplexing.
>> But I have no objection to it in spirit.
>
> I think combining it all into one API is going to be a soft re-invention
> of ioctls, but specifically for procfds. This would be an improvement
> over just ioctls (since the current ioctl namespacing is based on
> well-behaved drivers and hoping we never have more than 256 ioctl
> drivers), but I'm not sure it would help make the API nicer than having
> separate syscalls (we'd have to do something similar to bpf(2) which I'm
> not a huge fan of).

Right. Multiplexers are nothing new, and I'm not a huge fan of them.
From an API design perspective, having a bunch of different system
calls is probably best.

(I do wonder what happens to system call cache behavior once the
top-level system call table becomes huge though.)

>> That said, while I do want to fix process configuration and startup
>> generally, I want to fix specific holes in the existing API surface
>> first. The two patches I've already sent do that, and this work
>> shouldn't wait on an ideal larger process-API overhaul that may or may
>> not arrive. Based on previous history, I suspect that an API of the
>> scope you're proposing would take years to overcome all LKML
>> objections and land. I don't want to wait that long when we can make
>> smaller fixes that would not conflict with the general architecture.
>
> I believe this is precisely what Christian is trying to do with this
> patch (and you say as much later in your mail).
>
> I think that adding all of {sighand,sighand_exitcode,kill,...} would not
> help the path of landing a much larger API change. We should instead
> think about the API we want at the end of the day, and then land smaller
> changes which are long-term compatible (and won't just become deprecated
> APIs -- there's far too many of them, let's not add more needlessly).

I don't think we need to reach consensus on some long-term design to
address specific problems that we know today. The changes we're
talking about here *are* long-term compatible with a bigger process
API overhaul. They may or may not be *part* of that solution, but I
don't see them making that solution harder either. And the proposals
so far all seem to go in the right direction.

>> Next, I want to merge my exithand proposal, or something like it. It's
>> likewise a simple change that, in a minimal way, addresses a
>> longstanding API deficiency. I'm very strongly against the
>> POLLERR-on-directory variant of the idea.
>
> I agree with you on this need. I will admit I do somewhat like the EOF
> solution (mainly because it seamlessly deals with the multi-reader case)
> but I'm still not sure I like /proc/$pid/exit_sighand. As mentioned in
> the other discussion, ideally we would be only ever operating with the

This sentence got cut off.

> An ugly strawman of an alternative would be an interface that gave you
> an fd that you could similarly wait-until-EOF on, but that's probably
> not a major API improvement unless we also make said API provide exit
> status information in a way that works with the
> multiple-readers-with-different-creds usecase.

What about something like this then?

#define PROCESS_EXIT_HANDLE_CLOEXEC (1<<0)
#define PROCESS_EXIT_HANDLE_NONBLOCK (1<<1)
#define PROCESS_EXIT_HANDLE_WANT_STATUS (1<<2)

/* Open an "status handle" for the process identified by PROCFS_DFD,
* which must be an open descriptor to a /proc/pid directory. FLAGS is
* a combination of zero or more of the
* PROCESS_EXIT_HANDLE_* constants.
*
* Return -1 on error. On success, return a descriptor for a process
* status handle. Before the process identified by PROCFS_DFD exits,
* reads from the status handle block. After exit, reads from the
* status handle yield either EOF (if PROCESS_EXIT_HANDLE_WANT_STATUS
* is not specified) or a siginfo_t describing how the process exited
* (if PROCESS_EXIT_HANDLE_WANT_STATUS is specified), as from
* waitid(2).
*
* The returned file descriptor also supports poll(2) and other
* notification APIs.
*
* Any process may call process_get_statusfd from any PROCFS_DFD
* without PROCESS_EXIT_HANDLE_WANT_STATUS.
* If PROCESS_EXIT_HANDLE_WANT_STATUS is specified, PROCFS_DFD must
* refer either to the calling process (or one of its threads), a
* child of the current process, or a process for which the current
* process would be able to successfully call ptrace(2).
*
* The PROCESS_EXIT_HANDLE_WANT_STATUS permission check happens only
* at open time, not at read time, and the process handle can be
transferred like any other FD.
*/
int process_get_statusfd(int procfs_dfd, int flags);

> One other thing I think we should eventually consider is to provide an
> API which pings a listener whenever a process does an execve() (and
> possibly fork()).

I thought about providing this facility too, but it's not immediately
apparent to me who would use it. ISTM that most callers that would
want this would be happy grabbing the process with ptrace or passively
monitoring it with ftrace or the process connector.

2018-11-19 02:52:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 12:32 PM Daniel Colascione <[email protected]> wrote:
>
> On Sun, Nov 18, 2018 at 12:28 PM, Andy Lutomirski <[email protected]> wrote:
> >> That is, I'm proposing an API that looks like this:
> >>
> >> int process_kill(int procfs_dfd, int signo, const union sigval value)
> >>
> >> If, later, process_kill were to *also* accept process-capability FDs,
> >> nothing would break.
> >
> > Except that this makes it ambiguous to the caller as to whether their current creds are considered. So it would need to be a different syscall or at least a flag. Otherwise a lot of those nice theoretical properties go away.
>
> Sure. A flag might make for better ergonomics.
>
> >> Yes, that's what I have in mind. A siginfo_t is small enough that we
> >> could just store it as a blob allocated off the procfs inode or
> >> something like that without bothering with a shmfs file. You'd be able
> >> to read(2) the exit status as many times as you wanted.
> >
> > I think that, if the syscall in question is read(2), then it should work *once* per struct file. Otherwise running cat on the file would behave very oddly.
>
> Why? The file pointer would work normally.

Can you explain the exact semantics? If I have an fd where read(2)
returns the same 4-byte value every time read(2) is called, then cat
will just return an infinite sequence of the same value. This is not
a complete disaster, but it's not really a good thing.

>
> > Read and poll have the same problem as write: we can’t check caps in read or poll either.
>
> Why not? Reading /proc/pid/stat does an access check today and
> conditionally replaces the exit status with zero.

And that's probably a bug. It's at least a giant kludge that we shouldn't copy.

Here is the general rule: the basic operations that are expected to
treat file descriptors as capabilities *must* treat file descriptors
as capabilities, at least for new APIs. This includes read(2),
write(2), and poll(2). We should have an exceedingly good reason to
check current's creds, mm, or anything else about current in those
syscalls.

There is a good reason for this: consider what happens if you type:

sudo >/proc/PID/whatever

or

sudo </proc/PID/whatever

2018-11-19 03:00:15

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 09:42:35AM -0800, Andy Lutomirski wrote:

> Now here's the kicker: if the "running program" calls execve(), it
> goes away. The fd gets some sort of notification that this happened

Type error, parser failed.

Define "fd", please. If it's a "file descriptor", thank you do playing,
you've lost. That's not going to work. If it's "opened file" (aka
"file description" in horrible POSIXese), who's going to get notifications
and what kind of exclusion are you going to use?

2018-11-19 03:49:49

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 6:47 PM Al Viro <[email protected]> wrote:
>
> On Sun, Nov 18, 2018 at 09:42:35AM -0800, Andy Lutomirski wrote:
>
> > Now here's the kicker: if the "running program" calls execve(), it
> > goes away. The fd gets some sort of notification that this happened
>
> Type error, parser failed.
>
> Define "fd", please. If it's a "file descriptor", thank you do playing,
> you've lost. That's not going to work. If it's "opened file" (aka
> "file description" in horrible POSIXese), who's going to get notifications
> and what kind of exclusion are you going to use?

What I meant was: a program that has one of these fds would be able to
find out that an execve() happened and the program needs to refresh
its access to the target task. This could be as simple as POLLHUP
and, if needed, some syscall indicating exactly why we got POLLHUP
(e.g. execve vs exit).

There would be some sort of indication that a program that holds an fd
pointing at an "opened file" could get -- probably poll() would return
some status indicating that execve() happened and our capability is
gone, and, if needed

2018-11-19 10:58:57

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

Hi Christian,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.20-rc3 next-20181116]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Christian-Brauner/proc-allow-killing-processes-via-file-descriptors/20181119-170316
config: riscv-tinyconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 8.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=riscv

All warnings (new ones prefixed by >>):

./usr/include/linux/v4l2-controls.h:1105: found __[us]{8,16,32,64} type without #include <linux/types.h>
>> ./usr/include/linux/procfd.h:8: found __[us]{8,16,32,64} type without #include <linux/types.h>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.15 kB)
.config.gz (4.31 kB)
Download all attachments

2018-11-19 14:16:47

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] proc: allow killing processes via file descriptors

From: > Christian Brauner
> Sent: 18 November 2018 11:18
>
> With this patch an open() call on /proc/<pid> will give userspace a handle
> to struct pid of the process associated with /proc/<pid>. This allows to
> maintain a stable handle on a process.

My 3c...

You need to add a version of fork() that returns an open fd to
/proc/pid to the parent.

Is it possible to overload fcntl() rather than ioctl() ?

More interestingly what about a 'unique pid' (eg the pid extended to
(say) 128 bits with a use count) that can be safely put into a /var/run/pid
file for a daemon and used later in a 'kill' that will only ever
reference the correct process.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2018-11-19 15:51:05

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, Nov 18, 2018 at 12:17:51PM +0100, Christian Brauner wrote:
> With this patch an open() call on /proc/<pid> will give userspace a handle
> to struct pid of the process associated with /proc/<pid>. This allows to
> maintain a stable handle on a process.
> I have been discussing various approaches extensively during technical
> conferences this year culminating in a long argument with Eric at Linux
> Plumbers. The general consensus was that having a handle on a process
> will be something that is very simple and easy to maintain with the
> option of being extensible via a more advanced api if the need arises. I
> believe that this patch is the most simple, dumb, and therefore
> maintainable solution.
>
> The need for this has arisen in order to reliably kill a process without
> running into issues of the pid being recycled as has been described in the
> rejected patch [1]. To fulfill the need described in that patchset a new

It would certainly be good to fix this. IIUC, things like pkill(1) are
a gamble today and can probably kill a process that doesn't fulfil the
match criteria due to PID recycling. (If not, I'd certainly like to
understand how that is prevented.)

> ioctl() PROC_FD_SIGNAL is added. It can be used to send signals to a
> process via a file descriptor:
>
> int fd = open("/proc/1234", O_DIRECTORY | O_CLOEXEC);
> ioctl(fd, PROC_FD_SIGNAL, SIGKILL);
> close(fd);
>
> Note, the stable handle will allow us to carefully extend this feature in
> the future.

A concern here would be that an fd-based shadow API may need to be
created, duplicating the whole PID-based API that already exists.

However, so long as the PID is stabilised against recycling, an ordinary
kill() call seems fine as a way to kill the target process, and no new
API or permission model seems to be needed.


It occurs to me that a mechanism for holding a reference on a third
process already exists: ptrace.

Suppose we were to have something like

ptrace(PTRACE_MONITOR, pid, 0, 0);

that subscribes the caller for the same set of notifications via wait()
as the process' real parent gets, and prevents PID recycling until the
zombie is consumed be everyone who is subscribed.

Multiple PTRACE_MONITOR attachments could be allowed for a given target,
in addition to the real parent and regular ptrace-parent (if any).


There are a couple of wrinkles:

* ptrace() operates on tasks, not processes, so the precise semantics
of PTRACE_MONITOR attachment would need a bit of thought.

* Odd mechanisms for discovering PIDs, like the unix(7) SCM_CREDENTIALS
message might need a special variant to get a PTRACE_MONITOR attachment
along with the message. This variant behaviour would need to be opt-in
for the recipient.


OTOH, extending ptrace may bring problems of its own :/

Cheers
---Dave

2018-11-19 16:16:16

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Sun, 18 Nov 2018 at 18:30, Andy Lutomirski <[email protected]> wrote:
> Here's my point: if we're really going to make a new API to manipulate
> processes by their fd, I think we should have at least a decent idea
> of how that API will get extended in the future. Right now, we have
> an extremely awkward situation where opening an fd in /proc requires
> certain capabilities or uids, and using those fds often also checks
> current's capabilities, and the target process may have changed its
> own security context, including gaining privilege via SUID, SGID, or
> LSM transition rules in the mean time. This has been a huge source of
> security bugs. It would be nice to have a model for future APIs that
> avoids these problems.
>
> And I didn't say in my proposal that a process's identity should
> fundamentally change when it calls execve(). I'm suggesting that
> certain operations that could cause a process to gain privilege or
> otherwise require greater permission to introspect (mainly execve)
> could be handled by invalidating the new process management fds.
> Sure, if init re-execs itself, it's still PID 1, but that doesn't
> necessarily mean that:
>
> fd = process_open_management_fd(1);
> [init reexecs]
> process_do_something(fd);
>
> needs to work.
>
> >
> > > setresuid() has no effect
> > > here -- if you have W access to the process and the process calls
> > > setresuid(), you still have W access.
> >
> > Now you've created a situation in which an operation that security
> > policy previously blocked now becomes possible, invaliding previous
> > designs based on the old security invariant. That's the definition of
> > introducing a security hole.
>
> I think you're overstating your case. To a pretty good approximation,
> setresuid() allows the caller to remove elements from the set {ruid,
> suid, euid}, unless the caller has CAP_SETUID. If you could ptrace a
> process before it calls setresuid(), you might as well be able to
> ptrace() it after, since you could have just ptraced it and made it
> call setresuid() while still ptracing it. Similarly, it seems like
> it's probably safe to be able to open an fd that lets you watch the
> exit status of a process, have the process call setresuid(), and still
> see the exit status.
>
> Regardless of how you feel about these issues, if you're going to add
> an API by which you open an fd, wait for a process to exit, and read
> the exit status, you need to define the conditions under which you may
> open the fd and under which you may read the exit status once you have
> the fd. There are probably multiple valid answers, but the question
> still needs to be answered. My POLLERR hack, aside from being ugly,
> avoids this particular issue because it merely lets you wait for
> something you already could have observed using readdir().

Beg your pardon for hijacking the thread..

I wonder how fast it would be holding a pid with another open()ed fd.
And then you need to read comm (or how you filter whom to kill).
It seems to me that procfs will be even slower with this safe-way.
But I might misunderstand the idea, excuses.

So, I just wanted to gently remind about procfs with netlink socket[1].
It seems to me that whenever you receive() pid information, you
can request some uniq 64(?) bit number and kill the process using it.
Whenever uniqueness of 64-bit number to handle pids will be a question
the netlink message might be painlessly extended to 128 or whatever.

Also, it may provide the facilities to atomically kill process say by name
by adding another field to netlink message.

Probably, if it's time to add a new API for procfs, netlink may be more
desirable.

[1]: https://lwn.net/Articles/650243/

Thanks,
Dmitry

2018-11-19 16:28:52

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On Mon, Nov 19, 2018 at 8:13 AM, Dmitry Safonov <[email protected]> wrote:
> I wonder how fast it would be holding a pid with another open()ed fd.
> And then you need to read comm (or how you filter whom to kill).
> It seems to me that procfs will be even slower with this safe-way.
> But I might misunderstand the idea, excuses.
>
> So, I just wanted to gently remind about procfs with netlink socket[1].

We discussed netlink was extensively on the thread about
/proc/pid/kill. For numerous reasons, it's not suitable for
fundamental process management. We really need an FD-based interface
to processes, just like we have FD-based interfaces to other resource
types. We need something consistent and reliable, not an abuse of a
monitoring interface.

> Probably, if it's time to add a new API for procfs, netlink may be more
> desirable.

Definitely not.

2018-11-19 16:29:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors (Larger pids)

Dmitry Safonov <[email protected]> writes:
>
> So, I just wanted to gently remind about procfs with netlink socket[1].
> It seems to me that whenever you receive() pid information, you
> can request some uniq 64(?) bit number and kill the process using it.
> Whenever uniqueness of 64-bit number to handle pids will be a question
> the netlink message might be painlessly extended to 128 or whatever.

No.

I have seen this requested twice in this thread now, and despite
understanding where everyone is coming from I do not believe it will
be wise to implement larger pids.

The problem is that we then have to make these larger pids live in
the pid namespace, make struct pid larger to hold them and update
CRIU to save and restore them.

All for a very small handful of processes that use this extended API.

Eric

2018-11-19 20:23:00

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH] proc: allow killing processes via file descriptors

On 2018-11-19, Daniel Colascione <[email protected]> wrote:
> > I wonder how fast it would be holding a pid with another open()ed fd.
> > And then you need to read comm (or how you filter whom to kill).
> > It seems to me that procfs will be even slower with this safe-way.
> > But I might misunderstand the idea, excuses.
> >
> > So, I just wanted to gently remind about procfs with netlink socket[1].
>
> We discussed netlink was extensively on the thread about
> /proc/pid/kill. For numerous reasons, it's not suitable for
> fundamental process management. We really need an FD-based interface
> to processes, just like we have FD-based interfaces to other resource
> types. We need something consistent and reliable, not an abuse of a
> monitoring interface.

Another significant problem with using netlink for something like this
is that (as its name suggest) it's tied to network namespaces and not
pid namespaces so you wouldn't reasonably be able to use the API inside
a container. Using an fd side-steps the problem somewhat (though this
just gave me an idea -- I will add it to the other thread).

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (1.20 kB)
signature.asc (849.00 B)
Download all attachments