2018-10-29 22:14:05

by Daniel Colascione

[permalink] [raw]
Subject: [RFC PATCH] Implement /proc/pid/kill

Add a simple proc-based kill interface. To use /proc/pid/kill, just
write the signal number in base-10 ASCII to the kill file of the
process to be killed: for example, 'echo 9 > /proc/$$/kill'.

Semantically, /proc/pid/kill works like kill(2), except that the
process ID comes from the proc filesystem context instead of from an
explicit system call parameter. This way, it's possible to avoid races
between inspecting some aspect of a process and that process's PID
being reused for some other process.

With /proc/pid/kill, it's possible to write a proper race-free and
safe pkill(1). An approximation follows. A real program might use
openat(2), having opened a process's /proc/pid directory explicitly,
with the directory file descriptor serving as a sort of "process
handle".

#!/bin/bash
set -euo pipefail
pat=$1
for proc_status in /proc/*/status; do (
cd $(dirname $proc_status)
readarray proc_argv -d'' < cmdline
if ((${#proc_argv[@]} > 0)) &&
[[ ${proc_argv[0]} = *$pat* ]];
then
echo 15 > kill
fi
) || true; done

Signed-off-by: Daniel Colascione <[email protected]>
---
fs/proc/base.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7e9f07bf260d..923d62b21e67 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -205,6 +205,44 @@ static int proc_root_link(struct dentry *dentry, struct path *path)
return result;
}

+static ssize_t proc_pid_kill_write(struct file *file,
+ const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ ssize_t res;
+ int sig;
+ char buffer[4];
+
+ res = -EINVAL;
+ if (*ppos != 0)
+ goto out;
+
+ res = -EINVAL;
+ if (count > sizeof(buffer) - 1)
+ goto out;
+
+ res = -EFAULT;
+ if (copy_from_user(buffer, buf, count))
+ goto out;
+
+ buffer[count] = '\0';
+ res = kstrtoint(strstrip(buffer), 10, &sig);
+ if (res)
+ goto out;
+
+ res = kill_pid(proc_pid(file_inode(file)), sig, 0);
+ if (res)
+ goto out;
+ res = count;
+out:
+ return res;
+
+}
+
+static const struct file_operations proc_pid_kill_ops = {
+ .write = proc_pid_kill_write,
+};
+
static ssize_t get_mm_cmdline(struct mm_struct *mm, char __user *buf,
size_t count, loff_t *ppos)
{
@@ -2935,6 +2973,7 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
ONE("syscall", S_IRUSR, proc_pid_syscall),
#endif
+ REG("kill", S_IRUGO | S_IWUGO, proc_pid_kill_ops),
REG("cmdline", S_IRUGO, proc_pid_cmdline_ops),
ONE("stat", S_IRUGO, proc_tgid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
--
2.19.1.568.g152ad8e336-goog



2018-10-30 03:22:41

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Mon, Oct 29, 2018 at 3:11 PM Daniel Colascione <[email protected]> wrote:
>
> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> write the signal number in base-10 ASCII to the kill file of the
> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>
> Semantically, /proc/pid/kill works like kill(2), except that the
> process ID comes from the proc filesystem context instead of from an
> explicit system call parameter. This way, it's possible to avoid races
> between inspecting some aspect of a process and that process's PID
> being reused for some other process.
>
> With /proc/pid/kill, it's possible to write a proper race-free and
> safe pkill(1). An approximation follows. A real program might use
> openat(2), having opened a process's /proc/pid directory explicitly,
> with the directory file descriptor serving as a sort of "process
> handle".

How long does the 'inspection' procedure take? If its a short
duration, then is PID reuse really an issue, I mean the PIDs are not
reused until wrap around and the only reason this can be a problem is
if you have the wrap around while the 'inspecting some aspect'
procedure takes really long.

Also the proc fs is typically not the right place for this. Some
entries in proc are writeable, but those are for changing values of
kernel data structures. The title of man proc(5) is "proc - process
information pseudo-filesystem". So its "information" right?

IMO without a really good reason for this, it could really be a hard
sell but the RFC was worth it anyway to discuss it ;-)

thanks,

- Joel

2018-10-30 05:01:42

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On 2018-10-29, Daniel Colascione <[email protected]> wrote:
> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> write the signal number in base-10 ASCII to the kill file of the
> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>
> Semantically, /proc/pid/kill works like kill(2), except that the
> process ID comes from the proc filesystem context instead of from an
> explicit system call parameter. This way, it's possible to avoid races
> between inspecting some aspect of a process and that process's PID
> being reused for some other process.

(Aside from any UX concerns other folks might have.)

I think it would be a good idea to (at least temporarily) restrict this
so that only processes that are in the same PID namespace as the /proc
being resolved through may use this interface. Otherwise you might have
cases where partial container breakouts can start sending signals to
PIDs they wouldn't normally be able to address.

> With /proc/pid/kill, it's possible to write a proper race-free and
> safe pkill(1). An approximation follows. A real program might use
> openat(2), having opened a process's /proc/pid directory explicitly,
> with the directory file descriptor serving as a sort of "process
> handle".

I do like the idea of holding a dirfd to /proc/$pid to address
processes, and it something I considered doing in runc. (Unfortunately
there are lots of things that make it a bit difficult to use /proc/$pid
exclusively for introspection of a process -- especially in the context
of containers.)

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (1.65 kB)
signature.asc (849.00 B)
Download all attachments

2018-10-30 08:52:41

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 3:21 AM, Joel Fernandes <[email protected]> wrote:
> On Mon, Oct 29, 2018 at 3:11 PM Daniel Colascione <[email protected]> wrote:
>>
>> Add a simple proc-based kill interface. To use /proc/pid/kill, just
>> write the signal number in base-10 ASCII to the kill file of the
>> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>>
>> Semantically, /proc/pid/kill works like kill(2), except that the
>> process ID comes from the proc filesystem context instead of from an
>> explicit system call parameter. This way, it's possible to avoid races
>> between inspecting some aspect of a process and that process's PID
>> being reused for some other process.
>>
>> With /proc/pid/kill, it's possible to write a proper race-free and
>> safe pkill(1). An approximation follows. A real program might use
>> openat(2), having opened a process's /proc/pid directory explicitly,
>> with the directory file descriptor serving as a sort of "process
>> handle".
>
> How long does the 'inspection' procedure take? If its a short
> duration, then is PID reuse really an issue, I mean the PIDs are not
> reused until wrap around and the only reason this can be a problem is
> if you have the wrap around while the 'inspecting some aspect'
> procedure takes really long.

It's a race. Would you make similar statements about a similar fix for
a race condition involving a mutex and a double-free just because the
race didn't crash most of the time? The issue I'm trying to fix here
is the same problem, one level higher up in the abstraction hierarchy.

> Also the proc fs is typically not the right place for this. Some
> entries in proc are writeable, but those are for changing values of
> kernel data structures. The title of man proc(5) is "proc - process
> information pseudo-filesystem". So its "information" right?

Why should userspace care whether a particular operation is "changing
[a] value[] of [a] kernel data structure" or something else? That
something in /proc is a struct field is an implementation detail. It's
the interface semantics that matters, and whether a particular
operation is achieved by changing a struct field or by making a
function call is irrelevant to userspace. Proc is a filesystem about
processes. Why shouldn't you be able to send a signal to a process via
proc? It's an operation involving processes.

It's already possible to do things *to* processes via proc, e.g.,
adjust OOM killer scores. Proc filesystem file descriptors are
userspace references to kernel-side struct pid instances, and as such,
make good process handles. There are already "verb" files in procfs,
such as /proc/sys/vm/drop_caches and /proc/sysrq-trigger. Why not add
a kill "verb", especially if it closes a race that can't be closed
some other way?

You could implement this interface as a system call that took a procfs
directory file descriptor, but relative to this proposal, it would be
all downside. Such a thing would act just the same way as
/pric/pid/kill, and wouldn't be usable from the shell or from programs
that didn't want to use syscall(2). (Since glibc isn't adding new
system call wrappers.) AFAIK, the only downside of having a "kill"
file is the need for a string-to-integer conversion, but compared to
process killing, integer parsing is insignificant.

> IMO without a really good reason for this, it could really be a hard
> sell but the RFC was worth it anyway to discuss it ;-)

The traditional unix process API is down there at level -10 of Rusty
Russel's old bad API scale: "It's impossible to get right". The races
in the current API are unavoidable. That most programs don't hit these
races most of the time doesn't mean that the race isn't present.

We've moved to a model where we identify other system resources, like
DRM fences, locks, sockets, and everything else via file descriptors.
This change is a step toward using procfs file descriptors to work
with processes, which makes the system more regular and easier to
reason about. A clean API that's possible to use correctly is a
worthwhile project.

2018-10-30 09:06:24

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 5:00 AM, Aleksa Sarai <[email protected]> wrote:
> On 2018-10-29, Daniel Colascione <[email protected]> wrote:
>> Add a simple proc-based kill interface. To use /proc/pid/kill, just
>> write the signal number in base-10 ASCII to the kill file of the
>> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>>
>> Semantically, /proc/pid/kill works like kill(2), except that the
>> process ID comes from the proc filesystem context instead of from an
>> explicit system call parameter. This way, it's possible to avoid races
>> between inspecting some aspect of a process and that process's PID
>> being reused for some other process.
>
> (Aside from any UX concerns other folks might have.)
>
> I think it would be a good idea to (at least temporarily) restrict this
> so that only processes that are in the same PID namespace as the /proc
> being resolved through may use this interface. Otherwise you might have
> cases where partial container breakouts can start sending signals to
> PIDs they wouldn't normally be able to address.

That's a good idea.

>> With /proc/pid/kill, it's possible to write a proper race-free and
>> safe pkill(1). An approximation follows. A real program might use
>> openat(2), having opened a process's /proc/pid directory explicitly,
>> with the directory file descriptor serving as a sort of "process
>> handle".
>
> I do like the idea of holding a dirfd to /proc/$pid to address
> processes, and it something I considered doing in runc.

I did think about explicit system calls to create userspace struct pid
references, independent of proc --- something like open_process(int
pid). But when we already have procfs, which is already a way of
getting struct pid references, why add a new interface? Granted, not
everyone has /proc mounted. One idea add a special PROCFS_FD constant
(say, -2) that you would supply to openat(2) as dirfd. When openat(2)
saw PROCFS_FD, it'd interpret the open as relative to procfs whether
or not /proc were actually mounted anywhere. This facility would let
you open a "proc" file from anywhere even without the right mounts set
up.

> (Unfortunately
> there are lots of things that make it a bit difficult to use /proc/$pid
> exclusively for introspection of a process -- especially in the context
> of containers.)

Tons of things already break without a working /proc. What do you have in mind?

2018-10-30 10:39:54

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 08:50:22AM +0000, Daniel Colascione wrote:
> On Tue, Oct 30, 2018 at 3:21 AM, Joel Fernandes <[email protected]> wrote:
> > On Mon, Oct 29, 2018 at 3:11 PM Daniel Colascione <[email protected]> wrote:
> >>
> >> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> >> write the signal number in base-10 ASCII to the kill file of the
> >> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
> >>
> >> Semantically, /proc/pid/kill works like kill(2), except that the
> >> process ID comes from the proc filesystem context instead of from an
> >> explicit system call parameter. This way, it's possible to avoid races
> >> between inspecting some aspect of a process and that process's PID
> >> being reused for some other process.
> >>
> >> With /proc/pid/kill, it's possible to write a proper race-free and
> >> safe pkill(1). An approximation follows. A real program might use
> >> openat(2), having opened a process's /proc/pid directory explicitly,
> >> with the directory file descriptor serving as a sort of "process
> >> handle".
> >
> > How long does the 'inspection' procedure take? If its a short
> > duration, then is PID reuse really an issue, I mean the PIDs are not
> > reused until wrap around and the only reason this can be a problem is
> > if you have the wrap around while the 'inspecting some aspect'
> > procedure takes really long.
>
> It's a race. Would you make similar statements about a similar fix for
> a race condition involving a mutex and a double-free just because the
> race didn't crash most of the time? The issue I'm trying to fix here
> is the same problem, one level higher up in the abstraction hierarchy.
>
> > Also the proc fs is typically not the right place for this. Some
> > entries in proc are writeable, but those are for changing values of
> > kernel data structures. The title of man proc(5) is "proc - process
> > information pseudo-filesystem". So its "information" right?
>
> Why should userspace care whether a particular operation is "changing
> [a] value[] of [a] kernel data structure" or something else? That
> something in /proc is a struct field is an implementation detail. It's
> the interface semantics that matters, and whether a particular
> operation is achieved by changing a struct field or by making a
> function call is irrelevant to userspace. Proc is a filesystem about
> processes. Why shouldn't you be able to send a signal to a process via
> proc? It's an operation involving processes.
>
> It's already possible to do things *to* processes via proc, e.g.,
> adjust OOM killer scores. Proc filesystem file descriptors are
> userspace references to kernel-side struct pid instances, and as such,
> make good process handles. There are already "verb" files in procfs,
> such as /proc/sys/vm/drop_caches and /proc/sysrq-trigger. Why not add
> a kill "verb", especially if it closes a race that can't be closed
> some other way?
>
> You could implement this interface as a system call that took a procfs
> directory file descriptor, but relative to this proposal, it would be
> all downside. Such a thing would act just the same way as
> /pric/pid/kill, and wouldn't be usable from the shell or from programs
> that didn't want to use syscall(2). (Since glibc isn't adding new
> system call wrappers.) AFAIK, the only downside of having a "kill"
> file is the need for a string-to-integer conversion, but compared to
> process killing, integer parsing is insignificant.
>
> > IMO without a really good reason for this, it could really be a hard
> > sell but the RFC was worth it anyway to discuss it ;-)
>
> The traditional unix process API is down there at level -10 of Rusty
> Russel's old bad API scale: "It's impossible to get right". The races
> in the current API are unavoidable. That most programs don't hit these
> races most of the time doesn't mean that the race isn't present.
>
> We've moved to a model where we identify other system resources, like
> DRM fences, locks, sockets, and everything else via file descriptors.
> This change is a step toward using procfs file descriptors to work
> with processes, which makes the system more regular and easier to
> reason about. A clean API that's possible to use correctly is a
> worthwhile project.

So I have been disucssing a new process API With David Howells, Kees
Cook and a few others and I am working on an RFC/proposal for this. It
is partially inspired by the new mount API. So I would like to block
this patch until then. I would like to get this right very much and I
don't think this is the way to go. I hope to have a more detailed
proposal out soon(ish). David and I were also thinking about an adhoc
session at the kernel summit but we aren't clear whether there's still a
slot.

2018-10-30 10:42:03

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 11:39:11AM +0100, Christian Brauner wrote:
> On Tue, Oct 30, 2018 at 08:50:22AM +0000, Daniel Colascione wrote:
> > On Tue, Oct 30, 2018 at 3:21 AM, Joel Fernandes <[email protected]> wrote:
> > > On Mon, Oct 29, 2018 at 3:11 PM Daniel Colascione <[email protected]> wrote:
> > >>
> > >> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> > >> write the signal number in base-10 ASCII to the kill file of the
> > >> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
> > >>
> > >> Semantically, /proc/pid/kill works like kill(2), except that the
> > >> process ID comes from the proc filesystem context instead of from an
> > >> explicit system call parameter. This way, it's possible to avoid races
> > >> between inspecting some aspect of a process and that process's PID
> > >> being reused for some other process.
> > >>
> > >> With /proc/pid/kill, it's possible to write a proper race-free and
> > >> safe pkill(1). An approximation follows. A real program might use
> > >> openat(2), having opened a process's /proc/pid directory explicitly,
> > >> with the directory file descriptor serving as a sort of "process
> > >> handle".
> > >
> > > How long does the 'inspection' procedure take? If its a short
> > > duration, then is PID reuse really an issue, I mean the PIDs are not
> > > reused until wrap around and the only reason this can be a problem is
> > > if you have the wrap around while the 'inspecting some aspect'
> > > procedure takes really long.
> >
> > It's a race. Would you make similar statements about a similar fix for
> > a race condition involving a mutex and a double-free just because the
> > race didn't crash most of the time? The issue I'm trying to fix here
> > is the same problem, one level higher up in the abstraction hierarchy.
> >
> > > Also the proc fs is typically not the right place for this. Some
> > > entries in proc are writeable, but those are for changing values of
> > > kernel data structures. The title of man proc(5) is "proc - process
> > > information pseudo-filesystem". So its "information" right?
> >
> > Why should userspace care whether a particular operation is "changing
> > [a] value[] of [a] kernel data structure" or something else? That
> > something in /proc is a struct field is an implementation detail. It's
> > the interface semantics that matters, and whether a particular
> > operation is achieved by changing a struct field or by making a
> > function call is irrelevant to userspace. Proc is a filesystem about
> > processes. Why shouldn't you be able to send a signal to a process via
> > proc? It's an operation involving processes.
> >
> > It's already possible to do things *to* processes via proc, e.g.,
> > adjust OOM killer scores. Proc filesystem file descriptors are
> > userspace references to kernel-side struct pid instances, and as such,
> > make good process handles. There are already "verb" files in procfs,
> > such as /proc/sys/vm/drop_caches and /proc/sysrq-trigger. Why not add
> > a kill "verb", especially if it closes a race that can't be closed
> > some other way?
> >
> > You could implement this interface as a system call that took a procfs
> > directory file descriptor, but relative to this proposal, it would be
> > all downside. Such a thing would act just the same way as
> > /pric/pid/kill, and wouldn't be usable from the shell or from programs
> > that didn't want to use syscall(2). (Since glibc isn't adding new
> > system call wrappers.) AFAIK, the only downside of having a "kill"
> > file is the need for a string-to-integer conversion, but compared to
> > process killing, integer parsing is insignificant.
> >
> > > IMO without a really good reason for this, it could really be a hard
> > > sell but the RFC was worth it anyway to discuss it ;-)
> >
> > The traditional unix process API is down there at level -10 of Rusty
> > Russel's old bad API scale: "It's impossible to get right". The races
> > in the current API are unavoidable. That most programs don't hit these
> > races most of the time doesn't mean that the race isn't present.
> >
> > We've moved to a model where we identify other system resources, like
> > DRM fences, locks, sockets, and everything else via file descriptors.
> > This change is a step toward using procfs file descriptors to work
> > with processes, which makes the system more regular and easier to
> > reason about. A clean API that's possible to use correctly is a
> > worthwhile project.
>
> So I have been disucssing a new process API With David Howells, Kees
> Cook and a few others and I am working on an RFC/proposal for this. It
> is partially inspired by the new mount API. So I would like to block
> this patch until then. I would like to get this right very much and I
> don't think this is the way to go. I hope to have a more detailed
> proposal out soon(ish). David and I were also thinking about an adhoc
> session at the kernel summit but we aren't clear whether there's still a
> slot.

It's also entertaining since I talked with Dylan Reid at Google about
this during {O,L}SS too. :)

2018-10-30 10:51:57

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 10:40 AM, Christian Brauner
<[email protected]> wrote:
> On Tue, Oct 30, 2018 at 11:39:11AM +0100, Christian Brauner wrote:
>> On Tue, Oct 30, 2018 at 08:50:22AM +0000, Daniel Colascione wrote:
>> > On Tue, Oct 30, 2018 at 3:21 AM, Joel Fernandes <[email protected]> wrote:
>> > > On Mon, Oct 29, 2018 at 3:11 PM Daniel Colascione <[email protected]> wrote:
>> > >>
>> > >> Add a simple proc-based kill interface. To use /proc/pid/kill, just
>> > >> write the signal number in base-10 ASCII to the kill file of the
>> > >> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>> > >>
>> > >> Semantically, /proc/pid/kill works like kill(2), except that the
>> > >> process ID comes from the proc filesystem context instead of from an
>> > >> explicit system call parameter. This way, it's possible to avoid races
>> > >> between inspecting some aspect of a process and that process's PID
>> > >> being reused for some other process.
>> > >>
>> > >> With /proc/pid/kill, it's possible to write a proper race-free and
>> > >> safe pkill(1). An approximation follows. A real program might use
>> > >> openat(2), having opened a process's /proc/pid directory explicitly,
>> > >> with the directory file descriptor serving as a sort of "process
>> > >> handle".
>> > >
>> > > How long does the 'inspection' procedure take? If its a short
>> > > duration, then is PID reuse really an issue, I mean the PIDs are not
>> > > reused until wrap around and the only reason this can be a problem is
>> > > if you have the wrap around while the 'inspecting some aspect'
>> > > procedure takes really long.
>> >
>> > It's a race. Would you make similar statements about a similar fix for
>> > a race condition involving a mutex and a double-free just because the
>> > race didn't crash most of the time? The issue I'm trying to fix here
>> > is the same problem, one level higher up in the abstraction hierarchy.
>> >
>> > > Also the proc fs is typically not the right place for this. Some
>> > > entries in proc are writeable, but those are for changing values of
>> > > kernel data structures. The title of man proc(5) is "proc - process
>> > > information pseudo-filesystem". So its "information" right?
>> >
>> > Why should userspace care whether a particular operation is "changing
>> > [a] value[] of [a] kernel data structure" or something else? That
>> > something in /proc is a struct field is an implementation detail. It's
>> > the interface semantics that matters, and whether a particular
>> > operation is achieved by changing a struct field or by making a
>> > function call is irrelevant to userspace. Proc is a filesystem about
>> > processes. Why shouldn't you be able to send a signal to a process via
>> > proc? It's an operation involving processes.
>> >
>> > It's already possible to do things *to* processes via proc, e.g.,
>> > adjust OOM killer scores. Proc filesystem file descriptors are
>> > userspace references to kernel-side struct pid instances, and as such,
>> > make good process handles. There are already "verb" files in procfs,
>> > such as /proc/sys/vm/drop_caches and /proc/sysrq-trigger. Why not add
>> > a kill "verb", especially if it closes a race that can't be closed
>> > some other way?
>> >
>> > You could implement this interface as a system call that took a procfs
>> > directory file descriptor, but relative to this proposal, it would be
>> > all downside. Such a thing would act just the same way as
>> > /pric/pid/kill, and wouldn't be usable from the shell or from programs
>> > that didn't want to use syscall(2). (Since glibc isn't adding new
>> > system call wrappers.) AFAIK, the only downside of having a "kill"
>> > file is the need for a string-to-integer conversion, but compared to
>> > process killing, integer parsing is insignificant.
>> >
>> > > IMO without a really good reason for this, it could really be a hard
>> > > sell but the RFC was worth it anyway to discuss it ;-)
>> >
>> > The traditional unix process API is down there at level -10 of Rusty
>> > Russel's old bad API scale: "It's impossible to get right". The races
>> > in the current API are unavoidable. That most programs don't hit these
>> > races most of the time doesn't mean that the race isn't present.
>> >
>> > We've moved to a model where we identify other system resources, like
>> > DRM fences, locks, sockets, and everything else via file descriptors.
>> > This change is a step toward using procfs file descriptors to work
>> > with processes, which makes the system more regular and easier to
>> > reason about. A clean API that's possible to use correctly is a
>> > worthwhile project.
>>
>> So I have been disucssing a new process API With David Howells, Kees
>> Cook and a few others and I am working on an RFC/proposal for this. It
>> is partially inspired by the new mount API. So I would like to block
>> this patch until then. I would like to get this right very much and

It's good to hear that others are thinking about this problem.

>> I
>> don't think this is the way to go.

Why not?

Does your proposed API allow for a race-free pkill, with arbitrary
selection criteria? This capability is a good litmus test for fixing
the long-standing Unix process API issues.

2018-10-30 11:06:11

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 11:48 AM Daniel Colascione <[email protected]> wrote:
>
> On Tue, Oct 30, 2018 at 10:40 AM, Christian Brauner
> <[email protected]> wrote:
> > On Tue, Oct 30, 2018 at 11:39:11AM +0100, Christian Brauner wrote:
> >> On Tue, Oct 30, 2018 at 08:50:22AM +0000, Daniel Colascione wrote:
> >> > On Tue, Oct 30, 2018 at 3:21 AM, Joel Fernandes <[email protected]> wrote:
> >> > > On Mon, Oct 29, 2018 at 3:11 PM Daniel Colascione <[email protected]> wrote:
> >> > >>
> >> > >> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> >> > >> write the signal number in base-10 ASCII to the kill file of the
> >> > >> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
> >> > >>
> >> > >> Semantically, /proc/pid/kill works like kill(2), except that the
> >> > >> process ID comes from the proc filesystem context instead of from an
> >> > >> explicit system call parameter. This way, it's possible to avoid races
> >> > >> between inspecting some aspect of a process and that process's PID
> >> > >> being reused for some other process.
> >> > >>
> >> > >> With /proc/pid/kill, it's possible to write a proper race-free and
> >> > >> safe pkill(1). An approximation follows. A real program might use
> >> > >> openat(2), having opened a process's /proc/pid directory explicitly,
> >> > >> with the directory file descriptor serving as a sort of "process
> >> > >> handle".
> >> > >
> >> > > How long does the 'inspection' procedure take? If its a short
> >> > > duration, then is PID reuse really an issue, I mean the PIDs are not
> >> > > reused until wrap around and the only reason this can be a problem is
> >> > > if you have the wrap around while the 'inspecting some aspect'
> >> > > procedure takes really long.
> >> >
> >> > It's a race. Would you make similar statements about a similar fix for
> >> > a race condition involving a mutex and a double-free just because the
> >> > race didn't crash most of the time? The issue I'm trying to fix here
> >> > is the same problem, one level higher up in the abstraction hierarchy.
> >> >
> >> > > Also the proc fs is typically not the right place for this. Some
> >> > > entries in proc are writeable, but those are for changing values of
> >> > > kernel data structures. The title of man proc(5) is "proc - process
> >> > > information pseudo-filesystem". So its "information" right?
> >> >
> >> > Why should userspace care whether a particular operation is "changing
> >> > [a] value[] of [a] kernel data structure" or something else? That
> >> > something in /proc is a struct field is an implementation detail. It's
> >> > the interface semantics that matters, and whether a particular
> >> > operation is achieved by changing a struct field or by making a
> >> > function call is irrelevant to userspace. Proc is a filesystem about
> >> > processes. Why shouldn't you be able to send a signal to a process via
> >> > proc? It's an operation involving processes.
> >> >
> >> > It's already possible to do things *to* processes via proc, e.g.,
> >> > adjust OOM killer scores. Proc filesystem file descriptors are
> >> > userspace references to kernel-side struct pid instances, and as such,
> >> > make good process handles. There are already "verb" files in procfs,
> >> > such as /proc/sys/vm/drop_caches and /proc/sysrq-trigger. Why not add
> >> > a kill "verb", especially if it closes a race that can't be closed
> >> > some other way?
> >> >
> >> > You could implement this interface as a system call that took a procfs
> >> > directory file descriptor, but relative to this proposal, it would be
> >> > all downside. Such a thing would act just the same way as
> >> > /pric/pid/kill, and wouldn't be usable from the shell or from programs
> >> > that didn't want to use syscall(2). (Since glibc isn't adding new
> >> > system call wrappers.) AFAIK, the only downside of having a "kill"
> >> > file is the need for a string-to-integer conversion, but compared to
> >> > process killing, integer parsing is insignificant.
> >> >
> >> > > IMO without a really good reason for this, it could really be a hard
> >> > > sell but the RFC was worth it anyway to discuss it ;-)
> >> >
> >> > The traditional unix process API is down there at level -10 of Rusty
> >> > Russel's old bad API scale: "It's impossible to get right". The races
> >> > in the current API are unavoidable. That most programs don't hit these
> >> > races most of the time doesn't mean that the race isn't present.
> >> >
> >> > We've moved to a model where we identify other system resources, like
> >> > DRM fences, locks, sockets, and everything else via file descriptors.
> >> > This change is a step toward using procfs file descriptors to work
> >> > with processes, which makes the system more regular and easier to
> >> > reason about. A clean API that's possible to use correctly is a
> >> > worthwhile project.
> >>
> >> So I have been disucssing a new process API With David Howells, Kees
> >> Cook and a few others and I am working on an RFC/proposal for this. It
> >> is partially inspired by the new mount API. So I would like to block
> >> this patch until then. I would like to get this right very much and
>
> It's good to hear that others are thinking about this problem.
>
> >> I
> >> don't think this is the way to go.

Because we want this to be generic and things like getting handles on
processes via /proc is just a part of that.

>
> Why not?
>
> Does your proposed API allow for a race-free pkill, with arbitrary
> selection criteria? This capability is a good litmus test for fixing
> the long-standing Unix process API issues.

You'd have a handle on the process with an fd so yes, it would be.

2018-10-30 11:13:07

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 11:04 AM, Christian Brauner
<[email protected]> wrote:
> On Tue, Oct 30, 2018 at 11:48 AM Daniel Colascione <[email protected]> wrote:
>>
>> On Tue, Oct 30, 2018 at 10:40 AM, Christian Brauner
>> <[email protected]> wrote:
>> > On Tue, Oct 30, 2018 at 11:39:11AM +0100, Christian Brauner wrote:
>> >> On Tue, Oct 30, 2018 at 08:50:22AM +0000, Daniel Colascione wrote:
>> >> > On Tue, Oct 30, 2018 at 3:21 AM, Joel Fernandes <[email protected]> wrote:
>> >> > > On Mon, Oct 29, 2018 at 3:11 PM Daniel Colascione <[email protected]> wrote:
>> >> > >>
>> >> > >> Add a simple proc-based kill interface. To use /proc/pid/kill, just
>> >> > >> write the signal number in base-10 ASCII to the kill file of the
>> >> > >> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>> >> > >>
>> >> > >> Semantically, /proc/pid/kill works like kill(2), except that the
>> >> > >> process ID comes from the proc filesystem context instead of from an
>> >> > >> explicit system call parameter. This way, it's possible to avoid races
>> >> > >> between inspecting some aspect of a process and that process's PID
>> >> > >> being reused for some other process.
>> >> > >>
>> >> > >> With /proc/pid/kill, it's possible to write a proper race-free and
>> >> > >> safe pkill(1). An approximation follows. A real program might use
>> >> > >> openat(2), having opened a process's /proc/pid directory explicitly,
>> >> > >> with the directory file descriptor serving as a sort of "process
>> >> > >> handle".
>> >> > >
>> >> > > How long does the 'inspection' procedure take? If its a short
>> >> > > duration, then is PID reuse really an issue, I mean the PIDs are not
>> >> > > reused until wrap around and the only reason this can be a problem is
>> >> > > if you have the wrap around while the 'inspecting some aspect'
>> >> > > procedure takes really long.
>> >> >
>> >> > It's a race. Would you make similar statements about a similar fix for
>> >> > a race condition involving a mutex and a double-free just because the
>> >> > race didn't crash most of the time? The issue I'm trying to fix here
>> >> > is the same problem, one level higher up in the abstraction hierarchy.
>> >> >
>> >> > > Also the proc fs is typically not the right place for this. Some
>> >> > > entries in proc are writeable, but those are for changing values of
>> >> > > kernel data structures. The title of man proc(5) is "proc - process
>> >> > > information pseudo-filesystem". So its "information" right?
>> >> >
>> >> > Why should userspace care whether a particular operation is "changing
>> >> > [a] value[] of [a] kernel data structure" or something else? That
>> >> > something in /proc is a struct field is an implementation detail. It's
>> >> > the interface semantics that matters, and whether a particular
>> >> > operation is achieved by changing a struct field or by making a
>> >> > function call is irrelevant to userspace. Proc is a filesystem about
>> >> > processes. Why shouldn't you be able to send a signal to a process via
>> >> > proc? It's an operation involving processes.
>> >> >
>> >> > It's already possible to do things *to* processes via proc, e.g.,
>> >> > adjust OOM killer scores. Proc filesystem file descriptors are
>> >> > userspace references to kernel-side struct pid instances, and as such,
>> >> > make good process handles. There are already "verb" files in procfs,
>> >> > such as /proc/sys/vm/drop_caches and /proc/sysrq-trigger. Why not add
>> >> > a kill "verb", especially if it closes a race that can't be closed
>> >> > some other way?
>> >> >
>> >> > You could implement this interface as a system call that took a procfs
>> >> > directory file descriptor, but relative to this proposal, it would be
>> >> > all downside. Such a thing would act just the same way as
>> >> > /pric/pid/kill, and wouldn't be usable from the shell or from programs
>> >> > that didn't want to use syscall(2). (Since glibc isn't adding new
>> >> > system call wrappers.) AFAIK, the only downside of having a "kill"
>> >> > file is the need for a string-to-integer conversion, but compared to
>> >> > process killing, integer parsing is insignificant.
>> >> >
>> >> > > IMO without a really good reason for this, it could really be a hard
>> >> > > sell but the RFC was worth it anyway to discuss it ;-)
>> >> >
>> >> > The traditional unix process API is down there at level -10 of Rusty
>> >> > Russel's old bad API scale: "It's impossible to get right". The races
>> >> > in the current API are unavoidable. That most programs don't hit these
>> >> > races most of the time doesn't mean that the race isn't present.
>> >> >
>> >> > We've moved to a model where we identify other system resources, like
>> >> > DRM fences, locks, sockets, and everything else via file descriptors.
>> >> > This change is a step toward using procfs file descriptors to work
>> >> > with processes, which makes the system more regular and easier to
>> >> > reason about. A clean API that's possible to use correctly is a
>> >> > worthwhile project.
>> >>
>> >> So I have been disucssing a new process API With David Howells, Kees
>> >> Cook and a few others and I am working on an RFC/proposal for this. It
>> >> is partially inspired by the new mount API. So I would like to block
>> >> this patch until then. I would like to get this right very much and
>>
>> It's good to hear that others are thinking about this problem.
>>
>> >> I
>> >> don't think this is the way to go.
>
> Because we want this to be generic and things like getting handles on
> processes via /proc is just a part of that

The word "generic" is like the word "secure": it's hard to tell what
it means in isolation. :-) Over what domain do we need to be generic?
Procfs file descriptors already work on processes generally, and they
allow for race-free access to anything that's reachable via a procfs
pid directory. In what way would an alternate approach be even more
generic?

>> Why not?
>>
>> Does your proposed API allow for a race-free pkill, with arbitrary
>> selection criteria? This capability is a good litmus test for fixing
>> the long-standing Unix process API issues.
>
> You'd have a handle on the process with an fd so yes, it would be.

Thanks. That's good to hear.

Any idea on the timetable for this proposal? I'm open to lots of
alternative technical approaches, but I don't want this capability to
languish for a long time.

2018-10-30 11:22:28

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 12:12 PM Daniel Colascione <[email protected]> wrote:
>
> On Tue, Oct 30, 2018 at 11:04 AM, Christian Brauner
> <[email protected]> wrote:
> > On Tue, Oct 30, 2018 at 11:48 AM Daniel Colascione <[email protected]> wrote:
> >>
> >> On Tue, Oct 30, 2018 at 10:40 AM, Christian Brauner
> >> <[email protected]> wrote:
> >> > On Tue, Oct 30, 2018 at 11:39:11AM +0100, Christian Brauner wrote:
> >> >> On Tue, Oct 30, 2018 at 08:50:22AM +0000, Daniel Colascione wrote:
> >> >> > On Tue, Oct 30, 2018 at 3:21 AM, Joel Fernandes <[email protected]> wrote:
> >> >> > > On Mon, Oct 29, 2018 at 3:11 PM Daniel Colascione <[email protected]> wrote:
> >> >> > >>
> >> >> > >> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> >> >> > >> write the signal number in base-10 ASCII to the kill file of the
> >> >> > >> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
> >> >> > >>
> >> >> > >> Semantically, /proc/pid/kill works like kill(2), except that the
> >> >> > >> process ID comes from the proc filesystem context instead of from an
> >> >> > >> explicit system call parameter. This way, it's possible to avoid races
> >> >> > >> between inspecting some aspect of a process and that process's PID
> >> >> > >> being reused for some other process.
> >> >> > >>
> >> >> > >> With /proc/pid/kill, it's possible to write a proper race-free and
> >> >> > >> safe pkill(1). An approximation follows. A real program might use
> >> >> > >> openat(2), having opened a process's /proc/pid directory explicitly,
> >> >> > >> with the directory file descriptor serving as a sort of "process
> >> >> > >> handle".
> >> >> > >
> >> >> > > How long does the 'inspection' procedure take? If its a short
> >> >> > > duration, then is PID reuse really an issue, I mean the PIDs are not
> >> >> > > reused until wrap around and the only reason this can be a problem is
> >> >> > > if you have the wrap around while the 'inspecting some aspect'
> >> >> > > procedure takes really long.
> >> >> >
> >> >> > It's a race. Would you make similar statements about a similar fix for
> >> >> > a race condition involving a mutex and a double-free just because the
> >> >> > race didn't crash most of the time? The issue I'm trying to fix here
> >> >> > is the same problem, one level higher up in the abstraction hierarchy.
> >> >> >
> >> >> > > Also the proc fs is typically not the right place for this. Some
> >> >> > > entries in proc are writeable, but those are for changing values of
> >> >> > > kernel data structures. The title of man proc(5) is "proc - process
> >> >> > > information pseudo-filesystem". So its "information" right?
> >> >> >
> >> >> > Why should userspace care whether a particular operation is "changing
> >> >> > [a] value[] of [a] kernel data structure" or something else? That
> >> >> > something in /proc is a struct field is an implementation detail. It's
> >> >> > the interface semantics that matters, and whether a particular
> >> >> > operation is achieved by changing a struct field or by making a
> >> >> > function call is irrelevant to userspace. Proc is a filesystem about
> >> >> > processes. Why shouldn't you be able to send a signal to a process via
> >> >> > proc? It's an operation involving processes.
> >> >> >
> >> >> > It's already possible to do things *to* processes via proc, e.g.,
> >> >> > adjust OOM killer scores. Proc filesystem file descriptors are
> >> >> > userspace references to kernel-side struct pid instances, and as such,
> >> >> > make good process handles. There are already "verb" files in procfs,
> >> >> > such as /proc/sys/vm/drop_caches and /proc/sysrq-trigger. Why not add
> >> >> > a kill "verb", especially if it closes a race that can't be closed
> >> >> > some other way?
> >> >> >
> >> >> > You could implement this interface as a system call that took a procfs
> >> >> > directory file descriptor, but relative to this proposal, it would be
> >> >> > all downside. Such a thing would act just the same way as
> >> >> > /pric/pid/kill, and wouldn't be usable from the shell or from programs
> >> >> > that didn't want to use syscall(2). (Since glibc isn't adding new
> >> >> > system call wrappers.) AFAIK, the only downside of having a "kill"
> >> >> > file is the need for a string-to-integer conversion, but compared to
> >> >> > process killing, integer parsing is insignificant.
> >> >> >
> >> >> > > IMO without a really good reason for this, it could really be a hard
> >> >> > > sell but the RFC was worth it anyway to discuss it ;-)
> >> >> >
> >> >> > The traditional unix process API is down there at level -10 of Rusty
> >> >> > Russel's old bad API scale: "It's impossible to get right". The races
> >> >> > in the current API are unavoidable. That most programs don't hit these
> >> >> > races most of the time doesn't mean that the race isn't present.
> >> >> >
> >> >> > We've moved to a model where we identify other system resources, like
> >> >> > DRM fences, locks, sockets, and everything else via file descriptors.
> >> >> > This change is a step toward using procfs file descriptors to work
> >> >> > with processes, which makes the system more regular and easier to
> >> >> > reason about. A clean API that's possible to use correctly is a
> >> >> > worthwhile project.
> >> >>
> >> >> So I have been disucssing a new process API With David Howells, Kees
> >> >> Cook and a few others and I am working on an RFC/proposal for this. It
> >> >> is partially inspired by the new mount API. So I would like to block
> >> >> this patch until then. I would like to get this right very much and
> >>
> >> It's good to hear that others are thinking about this problem.
> >>
> >> >> I
> >> >> don't think this is the way to go.
> >
> > Because we want this to be generic and things like getting handles on
> > processes via /proc is just a part of that
>
> The word "generic" is like the word "secure": it's hard to tell what

Ha, trolling hard but true. :)

> it means in isolation. :-) Over what domain do we need to be generic?
> Procfs file descriptors already work on processes generally, and they
> allow for race-free access to anything that's reachable via a procfs
> pid directory. In what way would an alternate approach be even more
> generic?
>
> >> Why not?
> >>
> >> Does your proposed API allow for a race-free pkill, with arbitrary
> >> selection criteria? This capability is a good litmus test for fixing
> >> the long-standing Unix process API issues.
> >
> > You'd have a handle on the process with an fd so yes, it would be.
>
> Thanks. That's good to hear.
>
> Any idea on the timetable for this proposal? I'm open to lots of
> alternative technical approaches, but I don't want this capability to
> languish for a long time.

Latest end of year likely sooner depending on the feedback I'm getting
during LPC.

2018-10-30 17:02:29

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 1:50 AM Daniel Colascione <[email protected]> wrote:
>
> On Tue, Oct 30, 2018 at 3:21 AM, Joel Fernandes <[email protected]> wrote:
> > On Mon, Oct 29, 2018 at 3:11 PM Daniel Colascione <[email protected]> wrote:
> >>
> >> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> >> write the signal number in base-10 ASCII to the kill file of the
> >> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
> >>
> >> Semantically, /proc/pid/kill works like kill(2), except that the
> >> process ID comes from the proc filesystem context instead of from an
> >> explicit system call parameter. This way, it's possible to avoid races
> >> between inspecting some aspect of a process and that process's PID
> >> being reused for some other process.
> >>
> >> With /proc/pid/kill, it's possible to write a proper race-free and
> >> safe pkill(1). An approximation follows. A real program might use
> >> openat(2), having opened a process's /proc/pid directory explicitly,
> >> with the directory file descriptor serving as a sort of "process
> >> handle".
> >
> > How long does the 'inspection' procedure take? If its a short
> > duration, then is PID reuse really an issue, I mean the PIDs are not
> > reused until wrap around and the only reason this can be a problem is
> > if you have the wrap around while the 'inspecting some aspect'
> > procedure takes really long.
>
> It's a race. Would you make similar statements about a similar fix for
> a race condition involving a mutex and a double-free just because the
> race didn't crash most of the time? The issue I'm trying to fix here
> is the same problem, one level higher up in the abstraction hierarchy.

I was just curious that if this was a real issue you are hitting in a
production system, it wasn't clear from the commit message. When I
read your commit I thought "Does the inspection process take that long
that we wrap around an entire PID range?". So perhaps you should amend
your commit message to address that it is not really a problem you ARE
seeing, but rather something you anticipate and that this patch would
be a nice-to-have to avoid that. Typically there should be good
reasons/real-usecases to add a new interface to the kernel. Linus has
repeatedly rejected new interfaces on the grounds of non existent
use-cases or non real-world use cases. Again if I am missing something
here, then please improve the commit message so others don't have
similar questions :) Its completely upto you though..

> > IMO without a really good reason for this, it could really be a hard
> > sell but the RFC was worth it anyway to discuss it ;-)
>
> The traditional unix process API is down there at level -10 of Rusty
> Russel's old bad API scale: "It's impossible to get right". The races
> in the current API are unavoidable. That most programs don't hit these
> races most of the time doesn't mean that the race isn't present.
>
> We've moved to a model where we identify other system resources, like
> DRM fences, locks, sockets, and everything else via file descriptors.
> This change is a step toward using procfs file descriptors to work
> with processes, which makes the system more regular and easier to
> reason about. A clean API that's possible to use correctly is a
> worthwhile project.

Ok, agreed. thanks,

- Joel

2018-10-30 21:09:52

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On 2018-10-30, Daniel Colascione <[email protected]> wrote:
> >> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> >> write the signal number in base-10 ASCII to the kill file of the
> >> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
> >>
> >> Semantically, /proc/pid/kill works like kill(2), except that the
> >> process ID comes from the proc filesystem context instead of from an
> >> explicit system call parameter. This way, it's possible to avoid races
> >> between inspecting some aspect of a process and that process's PID
> >> being reused for some other process.
> >
> > (Aside from any UX concerns other folks might have.)
> >
> > I think it would be a good idea to (at least temporarily) restrict this
> > so that only processes that are in the same PID namespace as the /proc
> > being resolved through may use this interface. Otherwise you might have
> > cases where partial container breakouts can start sending signals to
> > PIDs they wouldn't normally be able to address.
>
> That's a good idea.

(Oh and I wonder how this interacts with SELinux/AppArmor signal
mediation.)

> > (Unfortunately
> > there are lots of things that make it a bit difficult to use /proc/$pid
> > exclusively for introspection of a process -- especially in the context
> > of containers.)
>
> Tons of things already break without a working /proc. What do you have in mind?

Heh, if only that was the only blocker. :P

The basic problem is that currently container runtimes either depend on
some non-transient on-disk state (which becomes invalid on machine
reboots or dead processes and so on), or on long-running processes that
keep file descriptors required for administration of a container alive
(think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
attacks). Usually both.

What would be really useful would be having some way of "hiding away" a
mount namespace (of the pid1 of the container) that has all of the
information and bind-mounts-to-file-descriptors that are necessary for
administration. If the container's pid1 dies all of the transient state
has disappeared automatically -- because the stashed mount namespace has
died. In addition, if this was done the way I'm thinking with (and this
is the contentious bit) hierarchical mount namespaces you could make it
so that the pid1 could not manipulate its current mount namespace to
confuse the administrative process. You would also then create an
intermediate user namespace to help with several race conditions (that
have caused security bugs like CVE-2016-9962) we've seen when joining
containers.

Unfortunately this all depends on hierarchical mount namespaces (and
note that this would just be that NS_GET_PARENT gives you the mount
namespace that it was created in -- I'm not suggesting we redesign peers
or anything like that). This makes it basically a non-starter.

But if, on top of this ground-work, we then referenced containers
entirely via an fd to /proc/$pid then you could also avoid PID reuse
races (as well as being able to find out implicitly whether a container
has died thanks to the error semantics of /proc/$pid). And that's the
way I would suggest doing it (if we had these other things in place).

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (3.32 kB)
signature.asc (849.00 B)
Download all attachments

2018-10-30 22:16:10

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 07:45:01AM +1100, Aleksa Sarai wrote:
[...]
> > > (Unfortunately
> > > there are lots of things that make it a bit difficult to use /proc/$pid
> > > exclusively for introspection of a process -- especially in the context
> > > of containers.)
> >
> > Tons of things already break without a working /proc. What do you have in mind?
>
> Heh, if only that was the only blocker. :P
>
> The basic problem is that currently container runtimes either depend on
> some non-transient on-disk state (which becomes invalid on machine
> reboots or dead processes and so on), or on long-running processes that
> keep file descriptors required for administration of a container alive
> (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> attacks). Usually both.
>
> What would be really useful would be having some way of "hiding away" a
> mount namespace (of the pid1 of the container) that has all of the
> information and bind-mounts-to-file-descriptors that are necessary for
> administration. If the container's pid1 dies all of the transient state
> has disappeared automatically -- because the stashed mount namespace has
> died. In addition, if this was done the way I'm thinking with (and this
> is the contentious bit) hierarchical mount namespaces you could make it
> so that the pid1 could not manipulate its current mount namespace to
> confuse the administrative process. You would also then create an
> intermediate user namespace to help with several race conditions (that
> have caused security bugs like CVE-2016-9962) we've seen when joining
> containers.
>
> Unfortunately this all depends on hierarchical mount namespaces (and
> note that this would just be that NS_GET_PARENT gives you the mount
> namespace that it was created in -- I'm not suggesting we redesign peers
> or anything like that). This makes it basically a non-starter.
>
> But if, on top of this ground-work, we then referenced containers
> entirely via an fd to /proc/$pid then you could also avoid PID reuse
> races (as well as being able to find out implicitly whether a container
> has died thanks to the error semantics of /proc/$pid). And that's the
> way I would suggest doing it (if we had these other things in place).

I didn't fully follow exactly what you mean. If you can explain for the
layman who doesn't know much experience with containers..

Are you saying that keeping open a /proc/$pid directory handle is not
sufficient to prevent PID reuse while the proc entries under /proc/$pid are
being looked into? If its not sufficient, then isn't that a bug? If it is
sufficient, then can we not just keep the handle open while we do whatever we
want under /proc/$pid ?

- Joel


2018-10-30 22:24:44

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On 2018-10-30, Joel Fernandes <[email protected]> wrote:
> On Wed, Oct 31, 2018 at 07:45:01AM +1100, Aleksa Sarai wrote:
> [...]
> > > > (Unfortunately
> > > > there are lots of things that make it a bit difficult to use /proc/$pid
> > > > exclusively for introspection of a process -- especially in the context
> > > > of containers.)
> > >
> > > Tons of things already break without a working /proc. What do you have in mind?
> >
> > Heh, if only that was the only blocker. :P
> >
> > The basic problem is that currently container runtimes either depend on
> > some non-transient on-disk state (which becomes invalid on machine
> > reboots or dead processes and so on), or on long-running processes that
> > keep file descriptors required for administration of a container alive
> > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> > attacks). Usually both.
> >
> > What would be really useful would be having some way of "hiding away" a
> > mount namespace (of the pid1 of the container) that has all of the
> > information and bind-mounts-to-file-descriptors that are necessary for
> > administration. If the container's pid1 dies all of the transient state
> > has disappeared automatically -- because the stashed mount namespace has
> > died. In addition, if this was done the way I'm thinking with (and this
> > is the contentious bit) hierarchical mount namespaces you could make it
> > so that the pid1 could not manipulate its current mount namespace to
> > confuse the administrative process. You would also then create an
> > intermediate user namespace to help with several race conditions (that
> > have caused security bugs like CVE-2016-9962) we've seen when joining
> > containers.
> >
> > Unfortunately this all depends on hierarchical mount namespaces (and
> > note that this would just be that NS_GET_PARENT gives you the mount
> > namespace that it was created in -- I'm not suggesting we redesign peers
> > or anything like that). This makes it basically a non-starter.
> >
> > But if, on top of this ground-work, we then referenced containers
> > entirely via an fd to /proc/$pid then you could also avoid PID reuse
> > races (as well as being able to find out implicitly whether a container
> > has died thanks to the error semantics of /proc/$pid). And that's the
> > way I would suggest doing it (if we had these other things in place).
>
> I didn't fully follow exactly what you mean. If you can explain for the
> layman who doesn't know much experience with containers..
>
> Are you saying that keeping open a /proc/$pid directory handle is not
> sufficient to prevent PID reuse while the proc entries under /proc/$pid are
> being looked into? If its not sufficient, then isn't that a bug? If it is
> sufficient, then can we not just keep the handle open while we do whatever we
> want under /proc/$pid ?

Sorry, I went on a bit of a tangent about various internals of container
runtimes. My main point is that I would love to use /proc/$pid because
it makes reuse handling very trivial and is always correct, but that
there are things which stop us from being able to use it for everything
(which is what my incoherent rambling was on about).

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (3.29 kB)
signature.asc (849.00 B)
Download all attachments

2018-10-30 22:34:43

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 09:23:39AM +1100, Aleksa Sarai wrote:
> On 2018-10-30, Joel Fernandes <[email protected]> wrote:
> > On Wed, Oct 31, 2018 at 07:45:01AM +1100, Aleksa Sarai wrote:
> > [...]
> > > > > (Unfortunately
> > > > > there are lots of things that make it a bit difficult to use /proc/$pid
> > > > > exclusively for introspection of a process -- especially in the context
> > > > > of containers.)
> > > >
> > > > Tons of things already break without a working /proc. What do you have in mind?
> > >
> > > Heh, if only that was the only blocker. :P
> > >
> > > The basic problem is that currently container runtimes either depend on
> > > some non-transient on-disk state (which becomes invalid on machine
> > > reboots or dead processes and so on), or on long-running processes that
> > > keep file descriptors required for administration of a container alive
> > > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> > > attacks). Usually both.
> > >
> > > What would be really useful would be having some way of "hiding away" a
> > > mount namespace (of the pid1 of the container) that has all of the
> > > information and bind-mounts-to-file-descriptors that are necessary for
> > > administration. If the container's pid1 dies all of the transient state
> > > has disappeared automatically -- because the stashed mount namespace has
> > > died. In addition, if this was done the way I'm thinking with (and this
> > > is the contentious bit) hierarchical mount namespaces you could make it
> > > so that the pid1 could not manipulate its current mount namespace to
> > > confuse the administrative process. You would also then create an
> > > intermediate user namespace to help with several race conditions (that
> > > have caused security bugs like CVE-2016-9962) we've seen when joining
> > > containers.
> > >
> > > Unfortunately this all depends on hierarchical mount namespaces (and
> > > note that this would just be that NS_GET_PARENT gives you the mount
> > > namespace that it was created in -- I'm not suggesting we redesign peers
> > > or anything like that). This makes it basically a non-starter.
> > >
> > > But if, on top of this ground-work, we then referenced containers
> > > entirely via an fd to /proc/$pid then you could also avoid PID reuse
> > > races (as well as being able to find out implicitly whether a container
> > > has died thanks to the error semantics of /proc/$pid). And that's the
> > > way I would suggest doing it (if we had these other things in place).
> >
> > I didn't fully follow exactly what you mean. If you can explain for the
> > layman who doesn't know much experience with containers..
> >
> > Are you saying that keeping open a /proc/$pid directory handle is not
> > sufficient to prevent PID reuse while the proc entries under /proc/$pid are
> > being looked into? If its not sufficient, then isn't that a bug? If it is
> > sufficient, then can we not just keep the handle open while we do whatever we
> > want under /proc/$pid ?
>
> Sorry, I went on a bit of a tangent about various internals of container
> runtimes. My main point is that I would love to use /proc/$pid because
> it makes reuse handling very trivial and is always correct, but that
> there are things which stop us from being able to use it for everything
> (which is what my incoherent rambling was on about).

Ok thanks. So I am guessing if the following sequence works, then Dan's patch is not
needed.

1. open /proc/<pid> directory
2. inspect /proc/<pid> or do whatever with <pid>
3. Issue the kill on <pid>
4. Close the /proc/<pid> directory opened in step 1.

So unless I missed something, the above sequence will not cause any PID reuse
races.

- Joel



2018-10-30 23:12:49

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 10:33 PM, Joel Fernandes <[email protected]> wrote:
> On Wed, Oct 31, 2018 at 09:23:39AM +1100, Aleksa Sarai wrote:
>> On 2018-10-30, Joel Fernandes <[email protected]> wrote:
>> > On Wed, Oct 31, 2018 at 07:45:01AM +1100, Aleksa Sarai wrote:
>> > [...]
>> > > > > (Unfortunately
>> > > > > there are lots of things that make it a bit difficult to use /proc/$pid
>> > > > > exclusively for introspection of a process -- especially in the context
>> > > > > of containers.)
>> > > >
>> > > > Tons of things already break without a working /proc. What do you have in mind?
>> > >
>> > > Heh, if only that was the only blocker. :P
>> > >
>> > > The basic problem is that currently container runtimes either depend on
>> > > some non-transient on-disk state (which becomes invalid on machine
>> > > reboots or dead processes and so on), or on long-running processes that
>> > > keep file descriptors required for administration of a container alive
>> > > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
>> > > attacks). Usually both.
>> > >
>> > > What would be really useful would be having some way of "hiding away" a
>> > > mount namespace (of the pid1 of the container) that has all of the
>> > > information and bind-mounts-to-file-descriptors that are necessary for
>> > > administration. If the container's pid1 dies all of the transient state
>> > > has disappeared automatically -- because the stashed mount namespace has
>> > > died. In addition, if this was done the way I'm thinking with (and this
>> > > is the contentious bit) hierarchical mount namespaces you could make it
>> > > so that the pid1 could not manipulate its current mount namespace to
>> > > confuse the administrative process. You would also then create an
>> > > intermediate user namespace to help with several race conditions (that
>> > > have caused security bugs like CVE-2016-9962) we've seen when joining
>> > > containers.
>> > >
>> > > Unfortunately this all depends on hierarchical mount namespaces (and
>> > > note that this would just be that NS_GET_PARENT gives you the mount
>> > > namespace that it was created in -- I'm not suggesting we redesign peers
>> > > or anything like that). This makes it basically a non-starter.
>> > >
>> > > But if, on top of this ground-work, we then referenced containers
>> > > entirely via an fd to /proc/$pid then you could also avoid PID reuse
>> > > races (as well as being able to find out implicitly whether a container
>> > > has died thanks to the error semantics of /proc/$pid). And that's the
>> > > way I would suggest doing it (if we had these other things in place).
>> >
>> > I didn't fully follow exactly what you mean. If you can explain for the
>> > layman who doesn't know much experience with containers..
>> >
>> > Are you saying that keeping open a /proc/$pid directory handle is not
>> > sufficient to prevent PID reuse while the proc entries under /proc/$pid are
>> > being looked into? If its not sufficient, then isn't that a bug? If it is
>> > sufficient, then can we not just keep the handle open while we do whatever we
>> > want under /proc/$pid ?
>>
>> Sorry, I went on a bit of a tangent about various internals of container
>> runtimes. My main point is that I would love to use /proc/$pid because
>> it makes reuse handling very trivial and is always correct, but that
>> there are things which stop us from being able to use it for everything
>> (which is what my incoherent rambling was on about).
>
> Ok thanks. So I am guessing if the following sequence works, then Dan's patch is not
> needed.
>
> 1. open /proc/<pid> directory
> 2. inspect /proc/<pid> or do whatever with <pid>
> 3. Issue the kill on <pid>
> 4. Close the /proc/<pid> directory opened in step 1.
>
> So unless I missed something, the above sequence will not cause any PID reuse
> races.

Keeping a /proc/$PID directory file descriptor open does not prevent
$PID being used to name some other process. If it could, you could
pretty quickly fill a whole system's process table. See the program
below, which demonstrates the PID collision.

I think Aleksa's larger point is that it's useful to treat processes
as other file-descriptor-named, poll-able, wait-able resources.
Consistency is important. A process is just another system resource,
and like any other system resource, you should be open to hold a file
descriptor to it and do things to that process via that file
descriptor. The precise form of this process-handle FD is up for
debate. The existing /proc/$PID directory FD is a good candidate for a
process handle FD, since it does almost all of what's needed. But
regardless of what form a process handle FD takes, we need it. I don't
see a case for continuing to treat processes in a non-unixy,
non-file-descriptor-based manner.

#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

int
main()
{
int child_pid = fork();
if (child_pid < 0)
abort();

char buf[64];
int child_procfs_fd;

if (child_pid == 0) {
for (;;)
pause();
abort();
}

printf("child PID is %d\n", child_pid);
sprintf(buf, "/proc/%d", child_pid);
child_procfs_fd = open(buf, O_DIRECTORY | O_RDONLY);
if (child_procfs_fd < 0)
abort();
printf("FD# of open /proc/%d is %d\n",
child_pid,
child_procfs_fd);
printf("killing child with SIGKILL\n");
kill(child_pid, SIGKILL);
if (wait(NULL) != child_pid)
abort();
printf("child is now dead. PROCFS FD STILL OPEN\n");
for (;;) {
int new_child_pid = fork();
if (new_child_pid < 0)
abort();
if (new_child_pid == 0)
_exit(0);
// printf("new child PID: %d\n", new_child_pid);
if (wait(NULL) != new_child_pid)
abort();
if (new_child_pid == child_pid) {
printf("FOUND PID COLLISION %d\n", child_pid);
printf("old child had pid %d. new, "
"different child has pid %d. "
"procfs directory for old child still open!\n",
child_pid, child_pid);
break;
}
}

return 0;
}

2018-10-30 23:16:23

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On 2018-10-30, Joel Fernandes <[email protected]> wrote:
> > > [...]
> > > > > > (Unfortunately
> > > > > > there are lots of things that make it a bit difficult to use /proc/$pid
> > > > > > exclusively for introspection of a process -- especially in the context
> > > > > > of containers.)
> > > > >
> > > > > Tons of things already break without a working /proc. What do you have in mind?
> > > >
> > > > Heh, if only that was the only blocker. :P
> > > >
> > > > The basic problem is that currently container runtimes either depend on
> > > > some non-transient on-disk state (which becomes invalid on machine
> > > > reboots or dead processes and so on), or on long-running processes that
> > > > keep file descriptors required for administration of a container alive
> > > > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> > > > attacks). Usually both.
> > > >
> > > > What would be really useful would be having some way of "hiding away" a
> > > > mount namespace (of the pid1 of the container) that has all of the
> > > > information and bind-mounts-to-file-descriptors that are necessary for
> > > > administration. If the container's pid1 dies all of the transient state
> > > > has disappeared automatically -- because the stashed mount namespace has
> > > > died. In addition, if this was done the way I'm thinking with (and this
> > > > is the contentious bit) hierarchical mount namespaces you could make it
> > > > so that the pid1 could not manipulate its current mount namespace to
> > > > confuse the administrative process. You would also then create an
> > > > intermediate user namespace to help with several race conditions (that
> > > > have caused security bugs like CVE-2016-9962) we've seen when joining
> > > > containers.
> > > >
> > > > Unfortunately this all depends on hierarchical mount namespaces (and
> > > > note that this would just be that NS_GET_PARENT gives you the mount
> > > > namespace that it was created in -- I'm not suggesting we redesign peers
> > > > or anything like that). This makes it basically a non-starter.
> > > >
> > > > But if, on top of this ground-work, we then referenced containers
> > > > entirely via an fd to /proc/$pid then you could also avoid PID reuse
> > > > races (as well as being able to find out implicitly whether a container
> > > > has died thanks to the error semantics of /proc/$pid). And that's the
> > > > way I would suggest doing it (if we had these other things in place).
> > >
> > > I didn't fully follow exactly what you mean. If you can explain for the
> > > layman who doesn't know much experience with containers..
> > >
> > > Are you saying that keeping open a /proc/$pid directory handle is not
> > > sufficient to prevent PID reuse while the proc entries under /proc/$pid are
> > > being looked into? If its not sufficient, then isn't that a bug? If it is
> > > sufficient, then can we not just keep the handle open while we do whatever we
> > > want under /proc/$pid ?
> >
> > Sorry, I went on a bit of a tangent about various internals of container
> > runtimes. My main point is that I would love to use /proc/$pid because
> > it makes reuse handling very trivial and is always correct, but that
> > there are things which stop us from being able to use it for everything
> > (which is what my incoherent rambling was on about).
>
> Ok thanks. So I am guessing if the following sequence works, then Dan's patch is not
> needed.
>
> 1. open /proc/<pid> directory
> 2. inspect /proc/<pid> or do whatever with <pid>
> 3. Issue the kill on <pid>
> 4. Close the /proc/<pid> directory opened in step 1.
>
> So unless I missed something, the above sequence will not cause any PID reuse
> races.

(Sorry, I misunderstood your original question.)

The problem is that holding /proc/$pid doesn't stop the PID from dying
and being reused. The benefit of holding open /proc/$pid is that you
will get an error if you try to use it *after* the PID has died -- which
means that you don't need to worry about explicitly checking for PID
reuse if you are only operating with the file descriptor and not the
PID.

So that sequence won't always work. There is a race where the pid might
die and be recycled by the time you call kill(2) -- after you've done
step 2. By tying step 2 and 3 together -- in this patch -- you remove
the race (since in order to resolve the "kill" procfs file VFS must
resolve the PID first -- atomically).

Though this race window is likely very tiny, and I wonder how much PID
churn you really need to hit it.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (4.64 kB)
signature.asc (849.00 B)
Download all attachments

2018-10-30 23:57:20

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 11:23 PM, Christian Brauner
<[email protected]> wrote:
> On Wed, Oct 31, 2018 at 12:10 AM Daniel Colascione <[email protected]> wrote:
>> I think Aleksa's larger point is that it's useful to treat processes
>> as other file-descriptor-named, poll-able, wait-able resources.
>> Consistency is important. A process is just another system resource,
>> and like any other system resource, you should be open to hold a file
>> descriptor to it and do things to that process via that file
>> descriptor. The precise form of this process-handle FD is up for
>> debate. The existing /proc/$PID directory FD is a good candidate for a
>> process handle FD, since it does almost all of what's needed. But
>> regardless of what form a process handle FD takes, we need it. I don't
>> see a case for continuing to treat processes in a non-unixy,
>> non-file-descriptor-based manner.
>
> That's what I'm proposing in the API for which I'm gathering feedback.
> I have presented parts of this in various discussions at LSS Europe last week
> and will be at LPC.
> We don't want to rush an API like this though. It was tried before in
> other forms
> and these proposals didn't make it.

And I'm looking forward to that proposal.

AIUI, one of the reasons previous proposals in this area failed is
that they'd have their process handles pin entire task_struct
instances and keep PID-table entries reserved as long as handles were
kept open. If you keep a PID reserved as long as you have an open
handle to a process, then existing PID-taking interfaces become
race-free automatically, so there's little new API. (This
PID-preserving approach is the one Windows uses, FWIW.)

An alternate approach is that have your process "handle" reference a
struct pid instead of a numeric PID or task_struct. You can't prevent
PID reuse this way, but you can at least distinguish between different
processes that have used the same PID, so you still get race freedom.
The downside of this approach is that since you don't prevent PID
reuse, existing PID-accepting interfaces remain racy and so, since you
need to let people name processes by process handle instead of by PID
whenever they want to do something with a process, you need to provide
an FD-accepting version of each current pid_t-accepting system call,
e.g. ptrace.

2018-10-31 00:07:52

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 12:10 AM Daniel Colascione <[email protected]> wrote:
>
> On Tue, Oct 30, 2018 at 10:33 PM, Joel Fernandes <[email protected]> wrote:
> > On Wed, Oct 31, 2018 at 09:23:39AM +1100, Aleksa Sarai wrote:
> >> On 2018-10-30, Joel Fernandes <[email protected]> wrote:
> >> > On Wed, Oct 31, 2018 at 07:45:01AM +1100, Aleksa Sarai wrote:
> >> > [...]
> >> > > > > (Unfortunately
> >> > > > > there are lots of things that make it a bit difficult to use /proc/$pid
> >> > > > > exclusively for introspection of a process -- especially in the context
> >> > > > > of containers.)
> >> > > >
> >> > > > Tons of things already break without a working /proc. What do you have in mind?
> >> > >
> >> > > Heh, if only that was the only blocker. :P
> >> > >
> >> > > The basic problem is that currently container runtimes either depend on
> >> > > some non-transient on-disk state (which becomes invalid on machine
> >> > > reboots or dead processes and so on), or on long-running processes that
> >> > > keep file descriptors required for administration of a container alive
> >> > > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> >> > > attacks). Usually both.
> >> > >
> >> > > What would be really useful would be having some way of "hiding away" a
> >> > > mount namespace (of the pid1 of the container) that has all of the
> >> > > information and bind-mounts-to-file-descriptors that are necessary for
> >> > > administration. If the container's pid1 dies all of the transient state
> >> > > has disappeared automatically -- because the stashed mount namespace has
> >> > > died. In addition, if this was done the way I'm thinking with (and this
> >> > > is the contentious bit) hierarchical mount namespaces you could make it
> >> > > so that the pid1 could not manipulate its current mount namespace to
> >> > > confuse the administrative process. You would also then create an
> >> > > intermediate user namespace to help with several race conditions (that
> >> > > have caused security bugs like CVE-2016-9962) we've seen when joining
> >> > > containers.
> >> > >
> >> > > Unfortunately this all depends on hierarchical mount namespaces (and
> >> > > note that this would just be that NS_GET_PARENT gives you the mount
> >> > > namespace that it was created in -- I'm not suggesting we redesign peers
> >> > > or anything like that). This makes it basically a non-starter.
> >> > >
> >> > > But if, on top of this ground-work, we then referenced containers
> >> > > entirely via an fd to /proc/$pid then you could also avoid PID reuse
> >> > > races (as well as being able to find out implicitly whether a container
> >> > > has died thanks to the error semantics of /proc/$pid). And that's the
> >> > > way I would suggest doing it (if we had these other things in place).
> >> >
> >> > I didn't fully follow exactly what you mean. If you can explain for the
> >> > layman who doesn't know much experience with containers..
> >> >
> >> > Are you saying that keeping open a /proc/$pid directory handle is not
> >> > sufficient to prevent PID reuse while the proc entries under /proc/$pid are
> >> > being looked into? If its not sufficient, then isn't that a bug? If it is
> >> > sufficient, then can we not just keep the handle open while we do whatever we
> >> > want under /proc/$pid ?
> >>
> >> Sorry, I went on a bit of a tangent about various internals of container
> >> runtimes. My main point is that I would love to use /proc/$pid because
> >> it makes reuse handling very trivial and is always correct, but that
> >> there are things which stop us from being able to use it for everything
> >> (which is what my incoherent rambling was on about).
> >
> > Ok thanks. So I am guessing if the following sequence works, then Dan's patch is not
> > needed.
> >
> > 1. open /proc/<pid> directory
> > 2. inspect /proc/<pid> or do whatever with <pid>
> > 3. Issue the kill on <pid>
> > 4. Close the /proc/<pid> directory opened in step 1.
> >
> > So unless I missed something, the above sequence will not cause any PID reuse
> > races.
>
> Keeping a /proc/$PID directory file descriptor open does not prevent
> $PID being used to name some other process. If it could, you could
> pretty quickly fill a whole system's process table. See the program
> below, which demonstrates the PID collision.
>
> I think Aleksa's larger point is that it's useful to treat processes
> as other file-descriptor-named, poll-able, wait-able resources.
> Consistency is important. A process is just another system resource,
> and like any other system resource, you should be open to hold a file
> descriptor to it and do things to that process via that file
> descriptor. The precise form of this process-handle FD is up for
> debate. The existing /proc/$PID directory FD is a good candidate for a
> process handle FD, since it does almost all of what's needed. But
> regardless of what form a process handle FD takes, we need it. I don't
> see a case for continuing to treat processes in a non-unixy,
> non-file-descriptor-based manner.

That's what I'm proposing in the API for which I'm gathering feedback.
I have presented parts of this in various discussions at LSS Europe last week
and will be at LPC.
We don't want to rush an API like this though. It was tried before in
other forms
and these proposals didn't make it.


>
> #include <fcntl.h>
> #include <signal.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <sys/wait.h>
> #include <unistd.h>
>
> int
> main()
> {
> int child_pid = fork();
> if (child_pid < 0)
> abort();
>
> char buf[64];
> int child_procfs_fd;
>
> if (child_pid == 0) {
> for (;;)
> pause();
> abort();
> }
>
> printf("child PID is %d\n", child_pid);
> sprintf(buf, "/proc/%d", child_pid);
> child_procfs_fd = open(buf, O_DIRECTORY | O_RDONLY);
> if (child_procfs_fd < 0)
> abort();
> printf("FD# of open /proc/%d is %d\n",
> child_pid,
> child_procfs_fd);
> printf("killing child with SIGKILL\n");
> kill(child_pid, SIGKILL);
> if (wait(NULL) != child_pid)
> abort();
> printf("child is now dead. PROCFS FD STILL OPEN\n");
> for (;;) {
> int new_child_pid = fork();
> if (new_child_pid < 0)
> abort();
> if (new_child_pid == 0)
> _exit(0);
> // printf("new child PID: %d\n", new_child_pid);
> if (wait(NULL) != new_child_pid)
> abort();
> if (new_child_pid == child_pid) {
> printf("FOUND PID COLLISION %d\n", child_pid);
> printf("old child had pid %d. new, "
> "different child has pid %d. "
> "procfs directory for old child still open!\n",
> child_pid, child_pid);
> break;
> }
> }
>
> return 0;
> }

2018-10-31 01:25:08

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 09:49:08AM +1100, Aleksa Sarai wrote:
> On 2018-10-30, Joel Fernandes <[email protected]> wrote:
> > > > [...]
> > > > > > > (Unfortunately
> > > > > > > there are lots of things that make it a bit difficult to use /proc/$pid
> > > > > > > exclusively for introspection of a process -- especially in the context
> > > > > > > of containers.)
> > > > > >
> > > > > > Tons of things already break without a working /proc. What do you have in mind?
> > > > >
> > > > > Heh, if only that was the only blocker. :P
> > > > >
> > > > > The basic problem is that currently container runtimes either depend on
> > > > > some non-transient on-disk state (which becomes invalid on machine
> > > > > reboots or dead processes and so on), or on long-running processes that
> > > > > keep file descriptors required for administration of a container alive
> > > > > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> > > > > attacks). Usually both.
> > > > >
> > > > > What would be really useful would be having some way of "hiding away" a
> > > > > mount namespace (of the pid1 of the container) that has all of the
> > > > > information and bind-mounts-to-file-descriptors that are necessary for
> > > > > administration. If the container's pid1 dies all of the transient state
> > > > > has disappeared automatically -- because the stashed mount namespace has
> > > > > died. In addition, if this was done the way I'm thinking with (and this
> > > > > is the contentious bit) hierarchical mount namespaces you could make it
> > > > > so that the pid1 could not manipulate its current mount namespace to
> > > > > confuse the administrative process. You would also then create an
> > > > > intermediate user namespace to help with several race conditions (that
> > > > > have caused security bugs like CVE-2016-9962) we've seen when joining
> > > > > containers.
> > > > >
> > > > > Unfortunately this all depends on hierarchical mount namespaces (and
> > > > > note that this would just be that NS_GET_PARENT gives you the mount
> > > > > namespace that it was created in -- I'm not suggesting we redesign peers
> > > > > or anything like that). This makes it basically a non-starter.
> > > > >
> > > > > But if, on top of this ground-work, we then referenced containers
> > > > > entirely via an fd to /proc/$pid then you could also avoid PID reuse
> > > > > races (as well as being able to find out implicitly whether a container
> > > > > has died thanks to the error semantics of /proc/$pid). And that's the
> > > > > way I would suggest doing it (if we had these other things in place).
> > > >
> > > > I didn't fully follow exactly what you mean. If you can explain for the
> > > > layman who doesn't know much experience with containers..
> > > >
> > > > Are you saying that keeping open a /proc/$pid directory handle is not
> > > > sufficient to prevent PID reuse while the proc entries under /proc/$pid are
> > > > being looked into? If its not sufficient, then isn't that a bug? If it is
> > > > sufficient, then can we not just keep the handle open while we do whatever we
> > > > want under /proc/$pid ?
> > >
> > > Sorry, I went on a bit of a tangent about various internals of container
> > > runtimes. My main point is that I would love to use /proc/$pid because
> > > it makes reuse handling very trivial and is always correct, but that
> > > there are things which stop us from being able to use it for everything
> > > (which is what my incoherent rambling was on about).
> >
> > Ok thanks. So I am guessing if the following sequence works, then Dan's patch is not
> > needed.
> >
> > 1. open /proc/<pid> directory
> > 2. inspect /proc/<pid> or do whatever with <pid>
> > 3. Issue the kill on <pid>
> > 4. Close the /proc/<pid> directory opened in step 1.
> >
> > So unless I missed something, the above sequence will not cause any PID reuse
> > races.
>
> (Sorry, I misunderstood your original question.)
>
> The problem is that holding /proc/$pid doesn't stop the PID from dying
> and being reused. The benefit of holding open /proc/$pid is that you
> will get an error if you try to use it *after* the PID has died -- which
> means that you don't need to worry about explicitly checking for PID
> reuse if you are only operating with the file descriptor and not the
> PID.
>
> So that sequence won't always work. There is a race where the pid might
> die and be recycled by the time you call kill(2) -- after you've done
> step 2. By tying step 2 and 3 together -- in this patch -- you remove
> the race (since in order to resolve the "kill" procfs file VFS must
> resolve the PID first -- atomically).

Makes sense, thanks.

> Though this race window is likely very tiny, and I wonder how much PID
> churn you really need to hit it.

Yeah that's what I asked initially how much of a problem it really is.

Also, I am wondering why the implementation does not want to keep a reference
to the task_struct for the duration of any open proc files/directories. Is
there a good reason?

- Joel


2018-10-31 01:28:42

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 11:10:47PM +0000, Daniel Colascione wrote:
> On Tue, Oct 30, 2018 at 10:33 PM, Joel Fernandes <[email protected]> wrote:
> > On Wed, Oct 31, 2018 at 09:23:39AM +1100, Aleksa Sarai wrote:
> >> On 2018-10-30, Joel Fernandes <[email protected]> wrote:
> >> > On Wed, Oct 31, 2018 at 07:45:01AM +1100, Aleksa Sarai wrote:
> >> > [...]
> >> > > > > (Unfortunately
> >> > > > > there are lots of things that make it a bit difficult to use /proc/$pid
> >> > > > > exclusively for introspection of a process -- especially in the context
> >> > > > > of containers.)
> >> > > >
> >> > > > Tons of things already break without a working /proc. What do you have in mind?
> >> > >
> >> > > Heh, if only that was the only blocker. :P
> >> > >
> >> > > The basic problem is that currently container runtimes either depend on
> >> > > some non-transient on-disk state (which becomes invalid on machine
> >> > > reboots or dead processes and so on), or on long-running processes that
> >> > > keep file descriptors required for administration of a container alive
> >> > > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> >> > > attacks). Usually both.
> >> > >
> >> > > What would be really useful would be having some way of "hiding away" a
> >> > > mount namespace (of the pid1 of the container) that has all of the
> >> > > information and bind-mounts-to-file-descriptors that are necessary for
> >> > > administration. If the container's pid1 dies all of the transient state
> >> > > has disappeared automatically -- because the stashed mount namespace has
> >> > > died. In addition, if this was done the way I'm thinking with (and this
> >> > > is the contentious bit) hierarchical mount namespaces you could make it
> >> > > so that the pid1 could not manipulate its current mount namespace to
> >> > > confuse the administrative process. You would also then create an
> >> > > intermediate user namespace to help with several race conditions (that
> >> > > have caused security bugs like CVE-2016-9962) we've seen when joining
> >> > > containers.
> >> > >
> >> > > Unfortunately this all depends on hierarchical mount namespaces (and
> >> > > note that this would just be that NS_GET_PARENT gives you the mount
> >> > > namespace that it was created in -- I'm not suggesting we redesign peers
> >> > > or anything like that). This makes it basically a non-starter.
> >> > >
> >> > > But if, on top of this ground-work, we then referenced containers
> >> > > entirely via an fd to /proc/$pid then you could also avoid PID reuse
> >> > > races (as well as being able to find out implicitly whether a container
> >> > > has died thanks to the error semantics of /proc/$pid). And that's the
> >> > > way I would suggest doing it (if we had these other things in place).
> >> >
> >> > I didn't fully follow exactly what you mean. If you can explain for the
> >> > layman who doesn't know much experience with containers..
> >> >
> >> > Are you saying that keeping open a /proc/$pid directory handle is not
> >> > sufficient to prevent PID reuse while the proc entries under /proc/$pid are
> >> > being looked into? If its not sufficient, then isn't that a bug? If it is
> >> > sufficient, then can we not just keep the handle open while we do whatever we
> >> > want under /proc/$pid ?
> >>
> >> Sorry, I went on a bit of a tangent about various internals of container
> >> runtimes. My main point is that I would love to use /proc/$pid because
> >> it makes reuse handling very trivial and is always correct, but that
> >> there are things which stop us from being able to use it for everything
> >> (which is what my incoherent rambling was on about).
> >
> > Ok thanks. So I am guessing if the following sequence works, then Dan's patch is not
> > needed.
> >
> > 1. open /proc/<pid> directory
> > 2. inspect /proc/<pid> or do whatever with <pid>
> > 3. Issue the kill on <pid>
> > 4. Close the /proc/<pid> directory opened in step 1.
> >
> > So unless I missed something, the above sequence will not cause any PID reuse
> > races.
>
> Keeping a /proc/$PID directory file descriptor open does not prevent
> $PID being used to name some other process. If it could, you could
> pretty quickly fill a whole system's process table. See the program
> below, which demonstrates the PID collision.

I know. We both were not sure about that earlier, that's why I requested you
to write the program when we were privately chatting. Now I'm sure because
Aleska answered that and the you program you wrote showed that too.

I wonder if this cannot be plumbed by just making the /proc/$PID directory
opens hold a reference to task_struct (and a reference to whatever else is
supposed to prevent the PID from getting reused), instead of introducing a
brand new API.

> I think Aleksa's larger point is that it's useful to treat processes
> as other file-descriptor-named, poll-able, wait-able resources.
> Consistency is important. A process is just another system resource,
> and like any other system resource, you should be open to hold a file
> descriptor to it and do things to that process via that file
> descriptor. The precise form of this process-handle FD is up for
> debate. The existing /proc/$PID directory FD is a good candidate for a
> process handle FD, since it does almost all of what's needed. But
> regardless of what form a process handle FD takes, we need it. I don't
> see a case for continuing to treat processes in a non-unixy,
> non-file-descriptor-based manner.

So wait, how is that supposed to address what you're now saying above
"quickly fill a whole process table"? You either want this, or you don't :)


2018-10-31 01:58:55

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 12:57 AM, Joel Fernandes <[email protected]> wrote:
> On Tue, Oct 30, 2018 at 11:10:47PM +0000, Daniel Colascione wrote:
>> On Tue, Oct 30, 2018 at 10:33 PM, Joel Fernandes <[email protected]> wrote:
>> > On Wed, Oct 31, 2018 at 09:23:39AM +1100, Aleksa Sarai wrote:
>> >> On 2018-10-30, Joel Fernandes <[email protected]> wrote:
>> >> > On Wed, Oct 31, 2018 at 07:45:01AM +1100, Aleksa Sarai wrote:
>> >> > [...]
>> >> > > > > (Unfortunately
>> >> > > > > there are lots of things that make it a bit difficult to use /proc/$pid
>> >> > > > > exclusively for introspection of a process -- especially in the context
>> >> > > > > of containers.)
>> >> > > >
>> >> > > > Tons of things already break without a working /proc. What do you have in mind?
>> >> > >
>> >> > > Heh, if only that was the only blocker. :P
>> >> > >
>> >> > > The basic problem is that currently container runtimes either depend on
>> >> > > some non-transient on-disk state (which becomes invalid on machine
>> >> > > reboots or dead processes and so on), or on long-running processes that
>> >> > > keep file descriptors required for administration of a container alive
>> >> > > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
>> >> > > attacks). Usually both.
>> >> > >
>> >> > > What would be really useful would be having some way of "hiding away" a
>> >> > > mount namespace (of the pid1 of the container) that has all of the
>> >> > > information and bind-mounts-to-file-descriptors that are necessary for
>> >> > > administration. If the container's pid1 dies all of the transient state
>> >> > > has disappeared automatically -- because the stashed mount namespace has
>> >> > > died. In addition, if this was done the way I'm thinking with (and this
>> >> > > is the contentious bit) hierarchical mount namespaces you could make it
>> >> > > so that the pid1 could not manipulate its current mount namespace to
>> >> > > confuse the administrative process. You would also then create an
>> >> > > intermediate user namespace to help with several race conditions (that
>> >> > > have caused security bugs like CVE-2016-9962) we've seen when joining
>> >> > > containers.
>> >> > >
>> >> > > Unfortunately this all depends on hierarchical mount namespaces (and
>> >> > > note that this would just be that NS_GET_PARENT gives you the mount
>> >> > > namespace that it was created in -- I'm not suggesting we redesign peers
>> >> > > or anything like that). This makes it basically a non-starter.
>> >> > >
>> >> > > But if, on top of this ground-work, we then referenced containers
>> >> > > entirely via an fd to /proc/$pid then you could also avoid PID reuse
>> >> > > races (as well as being able to find out implicitly whether a container
>> >> > > has died thanks to the error semantics of /proc/$pid). And that's the
>> >> > > way I would suggest doing it (if we had these other things in place).
>> >> >
>> >> > I didn't fully follow exactly what you mean. If you can explain for the
>> >> > layman who doesn't know much experience with containers..
>> >> >
>> >> > Are you saying that keeping open a /proc/$pid directory handle is not
>> >> > sufficient to prevent PID reuse while the proc entries under /proc/$pid are
>> >> > being looked into? If its not sufficient, then isn't that a bug? If it is
>> >> > sufficient, then can we not just keep the handle open while we do whatever we
>> >> > want under /proc/$pid ?
>> >>
>> >> Sorry, I went on a bit of a tangent about various internals of container
>> >> runtimes. My main point is that I would love to use /proc/$pid because
>> >> it makes reuse handling very trivial and is always correct, but that
>> >> there are things which stop us from being able to use it for everything
>> >> (which is what my incoherent rambling was on about).
>> >
>> > Ok thanks. So I am guessing if the following sequence works, then Dan's patch is not
>> > needed.
>> >
>> > 1. open /proc/<pid> directory
>> > 2. inspect /proc/<pid> or do whatever with <pid>
>> > 3. Issue the kill on <pid>
>> > 4. Close the /proc/<pid> directory opened in step 1.
>> >
>> > So unless I missed something, the above sequence will not cause any PID reuse
>> > races.
>>
>> Keeping a /proc/$PID directory file descriptor open does not prevent
>> $PID being used to name some other process. If it could, you could
>> pretty quickly fill a whole system's process table. See the program
>> below, which demonstrates the PID collision.
>
> I know. We both were not sure about that earlier, that's why I requested you
> to write the program when we were privately chatting. Now I'm sure because
> Aleska answered that and the you program you wrote showed that too.

I don't think that this behavior was ever in doubt from my side.

> I wonder if this cannot be plumbed by just making the /proc/$PID directory
> opens hold a reference to task_struct (and a reference to whatever else is
> supposed to prevent the PID from getting reused), instead of introducing a
> brand new API.

That *is* a brand-new API ---- just spelled the same as an old API.
Besides, the PID-preserving handle approach has a problem with
rlimits. In particular, a user that is otherwise limited by
RLIMIT_NPROC could squat on far more entries in the process table than
he could otherwise. (And the whole point of RLIMIT_NPROC is to limit
process table squatting.) You can't just make procfs directory FDs
count against RLIMIT_NPROC, because that'd break existing user code
that assumes that procfs FDs *don't* count against the user process
limit.

>> I think Aleksa's larger point is that it's useful to treat processes
>> as other file-descriptor-named, poll-able, wait-able resources.
>> Consistency is important. A process is just another system resource,
>> and like any other system resource, you should be open to hold a file
>> descriptor to it and do things to that process via that file
>> descriptor. The precise form of this process-handle FD is up for
>> debate. The existing /proc/$PID directory FD is a good candidate for a
>> process handle FD, since it does almost all of what's needed. But
>> regardless of what form a process handle FD takes, we need it. I don't
>> see a case for continuing to treat processes in a non-unixy,
>> non-file-descriptor-based manner.
>
> So wait, how is that supposed to address what you're now saying above
> "quickly fill a whole process table"? You either want this, or you don't :)

I don't understand what you're getting at.

2018-10-31 02:00:42

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 12:42 AM, Joel Fernandes <[email protected]> wrote:
> On Wed, Oct 31, 2018 at 09:49:08AM +1100, Aleksa Sarai wrote:
>> On 2018-10-30, Joel Fernandes <[email protected]> wrote:
>> > > > [...]
>> > > > > > > (Unfortunately
>> > > > > > > there are lots of things that make it a bit difficult to use /proc/$pid
>> > > > > > > exclusively for introspection of a process -- especially in the context
>> > > > > > > of containers.)
>> > > > > >
>> > > > > > Tons of things already break without a working /proc. What do you have in mind?
>> > > > >
>> > > > > Heh, if only that was the only blocker. :P
>> > > > >
>> > > > > The basic problem is that currently container runtimes either depend on
>> > > > > some non-transient on-disk state (which becomes invalid on machine
>> > > > > reboots or dead processes and so on), or on long-running processes that
>> > > > > keep file descriptors required for administration of a container alive
>> > > > > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
>> > > > > attacks). Usually both.
>> > > > >
>> > > > > What would be really useful would be having some way of "hiding away" a
>> > > > > mount namespace (of the pid1 of the container) that has all of the
>> > > > > information and bind-mounts-to-file-descriptors that are necessary for
>> > > > > administration. If the container's pid1 dies all of the transient state
>> > > > > has disappeared automatically -- because the stashed mount namespace has
>> > > > > died. In addition, if this was done the way I'm thinking with (and this
>> > > > > is the contentious bit) hierarchical mount namespaces you could make it
>> > > > > so that the pid1 could not manipulate its current mount namespace to
>> > > > > confuse the administrative process. You would also then create an
>> > > > > intermediate user namespace to help with several race conditions (that
>> > > > > have caused security bugs like CVE-2016-9962) we've seen when joining
>> > > > > containers.
>> > > > >
>> > > > > Unfortunately this all depends on hierarchical mount namespaces (and
>> > > > > note that this would just be that NS_GET_PARENT gives you the mount
>> > > > > namespace that it was created in -- I'm not suggesting we redesign peers
>> > > > > or anything like that). This makes it basically a non-starter.
>> > > > >
>> > > > > But if, on top of this ground-work, we then referenced containers
>> > > > > entirely via an fd to /proc/$pid then you could also avoid PID reuse
>> > > > > races (as well as being able to find out implicitly whether a container
>> > > > > has died thanks to the error semantics of /proc/$pid). And that's the
>> > > > > way I would suggest doing it (if we had these other things in place).
>> > > >
>> > > > I didn't fully follow exactly what you mean. If you can explain for the
>> > > > layman who doesn't know much experience with containers..
>> > > >
>> > > > Are you saying that keeping open a /proc/$pid directory handle is not
>> > > > sufficient to prevent PID reuse while the proc entries under /proc/$pid are
>> > > > being looked into? If its not sufficient, then isn't that a bug? If it is
>> > > > sufficient, then can we not just keep the handle open while we do whatever we
>> > > > want under /proc/$pid ?
>> > >
>> > > Sorry, I went on a bit of a tangent about various internals of container
>> > > runtimes. My main point is that I would love to use /proc/$pid because
>> > > it makes reuse handling very trivial and is always correct, but that
>> > > there are things which stop us from being able to use it for everything
>> > > (which is what my incoherent rambling was on about).
>> >
>> > Ok thanks. So I am guessing if the following sequence works, then Dan's patch is not
>> > needed.
>> >
>> > 1. open /proc/<pid> directory
>> > 2. inspect /proc/<pid> or do whatever with <pid>
>> > 3. Issue the kill on <pid>
>> > 4. Close the /proc/<pid> directory opened in step 1.
>> >
>> > So unless I missed something, the above sequence will not cause any PID reuse
>> > races.
>>
>> (Sorry, I misunderstood your original question.)
>>
>> The problem is that holding /proc/$pid doesn't stop the PID from dying
>> and being reused. The benefit of holding open /proc/$pid is that you
>> will get an error if you try to use it *after* the PID has died -- which
>> means that you don't need to worry about explicitly checking for PID
>> reuse if you are only operating with the file descriptor and not the
>> PID.
>>
>> So that sequence won't always work. There is a race where the pid might
>> die and be recycled by the time you call kill(2) -- after you've done
>> step 2. By tying step 2 and 3 together -- in this patch -- you remove
>> the race (since in order to resolve the "kill" procfs file VFS must
>> resolve the PID first -- atomically).
>
> Makes sense, thanks.
>
>> Though this race window is likely very tiny, and I wonder how much PID
>> churn you really need to hit it.
>
> Yeah that's what I asked initially how much of a problem it really is.

It's fundamentally impossible to use the process stuff today in a
race-free manner today. That the race occurs rarely isn't a good
reason to fix it. The fixes people are proposing are all lightweight,
so I don't understand this desire to stick with the status quo.
There's a longstanding API bug here. We can fix it, so we should.

2018-10-31 02:59:06

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On 2018-10-31, Christian Brauner <[email protected]> wrote:
> > I think Aleksa's larger point is that it's useful to treat processes
> > as other file-descriptor-named, poll-able, wait-able resources.
> > Consistency is important. A process is just another system resource,
> > and like any other system resource, you should be open to hold a file
> > descriptor to it and do things to that process via that file
> > descriptor. The precise form of this process-handle FD is up for
> > debate. The existing /proc/$PID directory FD is a good candidate for a
> > process handle FD, since it does almost all of what's needed. But
> > regardless of what form a process handle FD takes, we need it. I don't
> > see a case for continuing to treat processes in a non-unixy,
> > non-file-descriptor-based manner.
>
> That's what I'm proposing in the API for which I'm gathering feedback.
> I have presented parts of this in various discussions at LSS Europe last week
> and will be at LPC.
> We don't want to rush an API like this though. It was tried before in
> other forms
> and these proposals didn't make it.

:+1: on a well thought-out and generic proposal. As we've discussed
elsewhere, this is an issue that really would be great to (finally)
solve.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (1.36 kB)
signature.asc (849.00 B)
Download all attachments

2018-10-31 04:26:06

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 7:56 PM, Aleksa Sarai <[email protected]> wrote:
> On 2018-10-31, Christian Brauner <[email protected]> wrote:
>> > I think Aleksa's larger point is that it's useful to treat processes
>> > as other file-descriptor-named, poll-able, wait-able resources.
>> > Consistency is important. A process is just another system resource,
>> > and like any other system resource, you should be open to hold a file
>> > descriptor to it and do things to that process via that file
>> > descriptor. The precise form of this process-handle FD is up for
>> > debate. The existing /proc/$PID directory FD is a good candidate for a
>> > process handle FD, since it does almost all of what's needed. But
>> > regardless of what form a process handle FD takes, we need it. I don't
>> > see a case for continuing to treat processes in a non-unixy,
>> > non-file-descriptor-based manner.
>>
>> That's what I'm proposing in the API for which I'm gathering feedback.
>> I have presented parts of this in various discussions at LSS Europe last week
>> and will be at LPC.
>> We don't want to rush an API like this though. It was tried before in
>> other forms
>> and these proposals didn't make it.
>
> :+1: on a well thought-out and generic proposal. As we've discussed
> elsewhere, this is an issue that really would be great to (finally)
> solve.

Excited to see this and please count me in for discussions around this. thanks.

- Joel

2018-10-31 04:47:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

Daniel Colascione <[email protected]> writes:

> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> write the signal number in base-10 ASCII to the kill file of the
> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>
> Semantically, /proc/pid/kill works like kill(2), except that the
> process ID comes from the proc filesystem context instead of from an
> explicit system call parameter. This way, it's possible to avoid races
> between inspecting some aspect of a process and that process's PID
> being reused for some other process.
>
> With /proc/pid/kill, it's possible to write a proper race-free and
> safe pkill(1). An approximation follows. A real program might use
> openat(2), having opened a process's /proc/pid directory explicitly,
> with the directory file descriptor serving as a sort of "process
> handle".
>
> #!/bin/bash
> set -euo pipefail
> pat=$1
> for proc_status in /proc/*/status; do (
> cd $(dirname $proc_status)
> readarray proc_argv -d'' < cmdline
> if ((${#proc_argv[@]} > 0)) &&
> [[ ${proc_argv[0]} = *$pat* ]];
> then
> echo 15 > kill
> fi
> ) || true; done
>

In general this looks good.

Unfortunately the permission checks are are subject to a serious
problem. Even if I don't have permission to kill a process I quite
likely will be allowed to open the file.

Then I just need to find a setuid or setcap executable will write to
stdout or stderr a number. Then I have killed something I should not
have the privileges to kill.

At a bare minimum you need to perform the permission check using the
credentials of the opener of the file. Which means refactoring
kill_pid so that you can perform the permission check for killing the
application during open. Given that process credentials can change
completely during exec you also need to rule out a change in process
credentials making it so that the original opener of the file would not
be able to kill the process as it is now.

But overall this looks quite reasaonble.

Eric

> Signed-off-by: Daniel Colascione <[email protected]>
> ---
> fs/proc/base.c | 39 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 39 insertions(+)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 7e9f07bf260d..923d62b21e67 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -205,6 +205,44 @@ static int proc_root_link(struct dentry *dentry, struct path *path)
> return result;
> }
>
> +static ssize_t proc_pid_kill_write(struct file *file,
> + const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + ssize_t res;
> + int sig;
> + char buffer[4];
> +
> + res = -EINVAL;
> + if (*ppos != 0)
> + goto out;
> +
> + res = -EINVAL;
> + if (count > sizeof(buffer) - 1)
> + goto out;
> +
> + res = -EFAULT;
> + if (copy_from_user(buffer, buf, count))
> + goto out;
> +
> + buffer[count] = '\0';
> + res = kstrtoint(strstrip(buffer), 10, &sig);
> + if (res)
> + goto out;
> +
> + res = kill_pid(proc_pid(file_inode(file)), sig, 0);
> + if (res)
> + goto out;
> + res = count;
> +out:
> + return res;
> +
> +}
> +
> +static const struct file_operations proc_pid_kill_ops = {
> + .write = proc_pid_kill_write,
> +};
> +
> static ssize_t get_mm_cmdline(struct mm_struct *mm, char __user *buf,
> size_t count, loff_t *ppos)
> {
> @@ -2935,6 +2973,7 @@ static const struct pid_entry tgid_base_stuff[] = {
> #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
> ONE("syscall", S_IRUSR, proc_pid_syscall),
> #endif
> + REG("kill", S_IRUGO | S_IWUGO, proc_pid_kill_ops),
> REG("cmdline", S_IRUGO, proc_pid_cmdline_ops),
> ONE("stat", S_IRUGO, proc_tgid_stat),
> ONE("statm", S_IRUGO, proc_pid_statm),

2018-10-31 04:48:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

Aleksa Sarai <[email protected]> writes:

> On 2018-10-29, Daniel Colascione <[email protected]> wrote:
>> Add a simple proc-based kill interface. To use /proc/pid/kill, just
>> write the signal number in base-10 ASCII to the kill file of the
>> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>>
>> Semantically, /proc/pid/kill works like kill(2), except that the
>> process ID comes from the proc filesystem context instead of from an
>> explicit system call parameter. This way, it's possible to avoid races
>> between inspecting some aspect of a process and that process's PID
>> being reused for some other process.
>
> (Aside from any UX concerns other folks might have.)
>
> I think it would be a good idea to (at least temporarily) restrict this
> so that only processes that are in the same PID namespace as the /proc
> being resolved through may use this interface. Otherwise you might have
> cases where partial container breakouts can start sending signals to
> PIDs they wouldn't normally be able to address.

No. That is the container managers job. If you have the wrong proc
mounted in your container or otherwise allow access to it that is the
fault of the application that set up the container.

The pid namespace limits visibility. If something becomes visible and
you have permissions over it, it is perfectly reasonable for you to
execute those permissions.

Eric

2018-10-31 05:12:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

Christian Brauner <[email protected]> writes:

> On Tue, Oct 30, 2018 at 12:12 PM Daniel Colascione <[email protected]> wrote:
>>
>> On Tue, Oct 30, 2018 at 11:04 AM, Christian Brauner
>> <[email protected]> wrote:
>> > On Tue, Oct 30, 2018 at 11:48 AM Daniel Colascione <[email protected]> wrote:
>> >>
>> >> Why not?
>> >>
>> >> Does your proposed API allow for a race-free pkill, with arbitrary
>> >> selection criteria? This capability is a good litmus test for fixing
>> >> the long-standing Unix process API issues.
>> >
>> > You'd have a handle on the process with an fd so yes, it would be.
>>
>> Thanks. That's good to hear.
>>
>> Any idea on the timetable for this proposal? I'm open to lots of
>> alternative technical approaches, but I don't want this capability to
>> languish for a long time.
>
> Latest end of year likely sooner depending on the feedback I'm getting
> during LPC.

Frankly. If you want a race free fork variant probably the easiest
thing to do is to return a open copy of the proc directory entry. Wrapped
in a bind mount so that you can't see beyond that directory in proc.

My only concern would be if a vfsmount per process would be too heavy for such a use.

Eric




2018-10-31 12:45:49

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On 10/30, Eric W. Biederman wrote:
>
> At a bare minimum you need to perform the permission check using the
> credentials of the opener of the file. Which means refactoring
> kill_pid so that you can perform the permission check for killing the
> application during open.

perhaps it would be simpler to do

my_cred = override_creds(file->f_cred);
kill_pid(...);
revert_creds(my_cred);

?

> But overall this looks quite reasaonble.

Agreed.

Oleg.


2018-10-31 13:28:47

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 12:44 PM, Oleg Nesterov <[email protected]> wrote:
> On 10/30, Eric W. Biederman wrote:
>>
>> At a bare minimum you need to perform the permission check using the
>> credentials of the opener of the file. Which means refactoring
>> kill_pid so that you can perform the permission check for killing the
>> application during open.

Absolutely right. Thanks for spotting that.

> perhaps it would be simpler to do
>
> my_cred = override_creds(file->f_cred);
> kill_pid(...);
> revert_creds(my_cred);

Thanks for the suggestion. That looks neat, but it's not quite enough.
The problem is that check_kill_permission looks for
same_thread_group(current, t) _before_ checking kill_of_by_cred, so
with just this code snippet, it'd still be possible for an
unprivileged process to open /proc/$PRIVILEGED_PID/kill and hand the
FD to $PRIVILEGED_PID, which would write to it and kill itself or any
of its threads. I think, with some rearrangement of permissions
checks, this problem can be overcome.

There's another problem though: say we open /proc/pid/5/kill *, with
proc 5 being an ordinary unprivileged process, e.g., the shell. At
open(2) time, the access check passes. Now suppose PID 5 execve(2)s
into a setuid process. The kill FD is still open, so the kill FD's
holder can send a signal to a process it normally wouldn't be able to
kill.

You might say, "let's somehow invalidate open kill FDs upon setuid
exec", but the problem that then results is then that a legitimate
privileged user of /proc/pid/kill (say, Android lmkd) might see its
/proc/pid/kill handle spuriously become invalidated if the process to
which it refers execs in a setuid way. Maybe in this case we make
could write(2) on the kill FD fail with ECHILD ("no child process"?)
and have callers, if they see ECHILD, close the kill FD, re-open it,
and try again. But now we're getting into an interface that's too
complicated to use from the shell.

Maybe a simpler approach would be to bind the kill FD to the struct
cred that opened it and keep the access check in write(2) --- a
write(2) with current->cred not equal to f_cred would fail with EPERM.
This way, you could play standard-output-of-setuid-program or
SCM_RIGHTS games with the kill FD itself, but you wouldn't be able to
do anything with the FD having done so.

Honestly, though, maybe a new procfs_sigqueue(2) system call would be
simpler and more robust. With a single system call, we wouldn't split
the permissions check between open(2) and write(2), and so the whole
problem disappears.

The downside is that you wouldn't be able to use the new feature via
the shell without a helper program. :-(

What do you think?

* I actually have a local variant of the patch that would have you
open "/proc/$PID/kill/$SIGNO" instead, since different signal numbers
have different permission checks.

This approach is kind of neat, since /proc/pid/kill/$SIGNO would act
as an "option" to kill a process only with a particular signal, and a
write(2) to a /proc/$PID/kill/$SIGNO file would allow you to specify a
sigqueue(2)-style siginfo value along with the actual signal number
(since the signal number is encoded in the filename). For example, a
privileged process could open /proc/self/kill/10 (SIGUSR1) and hand
the FD to an unprivileged process, letting that process signal (via
signal) completion of some process without giving that unprivileged
process the ability to send *any* signal to the privileged process.
But eventfd is almost certainly a better choice in this situation
anyway, I think.

2018-10-31 14:40:12

by Daniel Colascione

[permalink] [raw]
Subject: [PATCH v2] Implement /proc/pid/kill

Add a simple proc-based kill interface. To use /proc/pid/kill, just
write the signal number in base-10 ASCII to the kill file of the
process to be killed: for example, 'echo 9 > /proc/$$/kill'.

Semantically, /proc/pid/kill works like kill(2), except that the
process ID comes from the proc filesystem context instead of from an
explicit system call parameter. This way, it's possible to avoid races
between inspecting some aspect of a process and that process's PID
being reused for some other process.

Note that only the real user ID that opened a /proc/pid/kill file can
write to it; other users get EPERM. This check prevents confused
deputy attacks via, e.g., standard output of setuid programs.

With /proc/pid/kill, it's possible to write a proper race-free and
safe pkill(1). An approximation follows. A real program might use
openat(2), having opened a process's /proc/pid directory explicitly,
with the directory file descriptor serving as a sort of "process
handle".

#!/bin/bash
set -euo pipefail
pat=$1
for proc_status in /proc/*/status; do (
cd $(dirname $proc_status)
readarray proc_argv -d'' < cmdline
if ((${#proc_argv[@]} > 0)) &&
[[ ${proc_argv[0]} = *$pat* ]];
then
echo 15 > kill
fi
) || true; done

Signed-off-by: Daniel Colascione <[email protected]>
---

Added a real-user-ID check to prevent confused deputy attacks.

fs/proc/base.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7e9f07bf260d..74e494f24b28 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -205,6 +205,56 @@ static int proc_root_link(struct dentry *dentry, struct path *path)
return result;
}

+static ssize_t proc_pid_kill_write(struct file *file,
+ const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ ssize_t res;
+ int sig;
+ char buffer[4];
+
+ /* This check prevents a confused deputy attack in which an
+ * unprivileged process opens /proc/victim/kill and convinces
+ * a privileged process to write to that kill FD, effectively
+ * performing a kill with the privileges of the unwitting
+ * privileged process. Here, we just fail the kill operation
+ * if someone calls write(2) with a real user ID that differs
+ * from the one used to open the kill FD.
+ */
+ res = -EPERM;
+ if (file->f_cred->user != current_user())
+ goto out;
+
+ res = -EINVAL;
+ if (*ppos != 0)
+ goto out;
+
+ res = -EINVAL;
+ if (count > sizeof(buffer) - 1)
+ goto out;
+
+ res = -EFAULT;
+ if (copy_from_user(buffer, buf, count))
+ goto out;
+
+ buffer[count] = '\0';
+ res = kstrtoint(strstrip(buffer), 10, &sig);
+ if (res)
+ goto out;
+
+ res = kill_pid(proc_pid(file_inode(file)), sig, 0);
+ if (res)
+ goto out;
+ res = count;
+out:
+ return res;
+
+}
+
+static const struct file_operations proc_pid_kill_ops = {
+ .write = proc_pid_kill_write,
+};
+
static ssize_t get_mm_cmdline(struct mm_struct *mm, char __user *buf,
size_t count, loff_t *ppos)
{
@@ -2935,6 +2985,7 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
ONE("syscall", S_IRUSR, proc_pid_syscall),
#endif
+ REG("kill", S_IRUGO | S_IWUGO, proc_pid_kill_ops),
REG("cmdline", S_IRUGO, proc_pid_cmdline_ops),
ONE("stat", S_IRUGO, proc_tgid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
--
2.19.1.568.g152ad8e336-goog


2018-10-31 15:06:48

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v2] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 02:37:44PM +0000, Daniel Colascione wrote:
> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> write the signal number in base-10 ASCII to the kill file of the
> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>
> Semantically, /proc/pid/kill works like kill(2), except that the
> process ID comes from the proc filesystem context instead of from an
> explicit system call parameter. This way, it's possible to avoid races
> between inspecting some aspect of a process and that process's PID
> being reused for some other process.
>
> Note that only the real user ID that opened a /proc/pid/kill file can
> write to it; other users get EPERM. This check prevents confused
> deputy attacks via, e.g., standard output of setuid programs.
>
> With /proc/pid/kill, it's possible to write a proper race-free and
> safe pkill(1). An approximation follows. A real program might use
> openat(2), having opened a process's /proc/pid directory explicitly,
> with the directory file descriptor serving as a sort of "process
> handle".
>
> #!/bin/bash
> set -euo pipefail
> pat=$1
> for proc_status in /proc/*/status; do (
> cd $(dirname $proc_status)
> readarray proc_argv -d'' < cmdline
> if ((${#proc_argv[@]} > 0)) &&
> [[ ${proc_argv[0]} = *$pat* ]];
> then
> echo 15 > kill
> fi
> ) || true; done
>
> Signed-off-by: Daniel Colascione <[email protected]>
> ---
>
> Added a real-user-ID check to prevent confused deputy attacks.
>
> fs/proc/base.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 51 insertions(+)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 7e9f07bf260d..74e494f24b28 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -205,6 +205,56 @@ static int proc_root_link(struct dentry *dentry, struct path *path)
> return result;
> }
>
> +static ssize_t proc_pid_kill_write(struct file *file,
> + const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + ssize_t res;
> + int sig;
> + char buffer[4];
> +
> + /* This check prevents a confused deputy attack in which an
> + * unprivileged process opens /proc/victim/kill and convinces
> + * a privileged process to write to that kill FD, effectively
> + * performing a kill with the privileges of the unwitting
> + * privileged process. Here, we just fail the kill operation
> + * if someone calls write(2) with a real user ID that differs
> + * from the one used to open the kill FD.
> + */
> + res = -EPERM;
> + if (file->f_cred->user != current_user())
> + goto out;

nit: You could get rid of the out label and just do direct returns. Will save
a few lines and is more readable.

> +
> + res = -EINVAL;
> + if (*ppos != 0)
> + goto out;
> +
> + res = -EINVAL;
> + if (count > sizeof(buffer) - 1)
> + goto out;
> +
> + res = -EFAULT;
> + if (copy_from_user(buffer, buf, count))
> + goto out;
> +
> + buffer[count] = '\0';

I think you can just zero-initialize buffer with "= {};" and get rid of this line.

> + res = kstrtoint(strstrip(buffer), 10, &sig);
> + if (res)
> + goto out;


> +
> + res = kill_pid(proc_pid(file_inode(file)), sig, 0);
> + if (res)
> + goto out;

if (res)
return res;

Other than the security issues which I still think you're discussing, since
we need this, I suggest to maintainers we take this in as an intermediate
solution since we don't have anything close to it and this is a real issue,
and the fix proposed is simple. So FWIW feel free to add my reviewed-by
(with the above nits and security issues taken care off) on any future
respins:

Reviewed-by: Joel Fernandes (Google) <[email protected]>

thanks,

- Joel


2018-10-31 15:14:39

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On 10/31, Daniel Colascione wrote:
>
> > perhaps it would be simpler to do
> >
> > my_cred = override_creds(file->f_cred);
> > kill_pid(...);
> > revert_creds(my_cred);
>
> Thanks for the suggestion. That looks neat, but it's not quite enough.
> The problem is that check_kill_permission looks for
> same_thread_group(current, t) _before_ checking kill_of_by_cred,

Yes, you are right.

Looks like kill_pid_info_as_cred() can find another user, but probably
it needs some changes with or without /proc/pid/kill ...

> There's another problem though: say we open /proc/pid/5/kill *, with
> proc 5 being an ordinary unprivileged process, e.g., the shell. At
> open(2) time, the access check passes. Now suppose PID 5 execve(2)s
> into a setuid process. The kill FD is still open, so the kill FD's
> holder can send a signal

Confused... why? kill_ok_by_cred() should fail?

Oleg.


2018-10-31 15:17:31

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 3:10 PM, Oleg Nesterov <[email protected]> wrote:
> On 10/31, Daniel Colascione wrote:
>>
>> > perhaps it would be simpler to do
>> >
>> > my_cred = override_creds(file->f_cred);
>> > kill_pid(...);
>> > revert_creds(my_cred);
>>
>> Thanks for the suggestion. That looks neat, but it's not quite enough.
>> The problem is that check_kill_permission looks for
>> same_thread_group(current, t) _before_ checking kill_of_by_cred,
>
> Yes, you are right.
>
> Looks like kill_pid_info_as_cred() can find another user, but probably
> it needs some changes with or without /proc/pid/kill ...
>
>> There's another problem though: say we open /proc/pid/5/kill *, with
>> proc 5 being an ordinary unprivileged process, e.g., the shell. At
>> open(2) time, the access check passes. Now suppose PID 5 execve(2)s
>> into a setuid process. The kill FD is still open, so the kill FD's
>> holder can send a signal
>
> Confused... why? kill_ok_by_cred() should fail?

Not if we don't run it. :-) I thought you were proposing that we do
*all* access checks in open() and let write() succeed unconditionally,
since that's the model that a lot of FD-mediated resources (like
regular files) use. (MAC notwithstanding.)

Anyway, I sent a v2 patch that I think closes the hole another way. In
v2, we just require that the real user ID that opens a /proc/pid/kill
file is the same one that writes to it. It successfully blocks the
setuid attack above while preserving all the write-time permission
checks and keeping the close correspondence between
write()-on-proc-pid-kill-fd and kill(2). Can you think of any
situation where this scheme breaks? I *think* comparing struct user
addresses instead of numeric UIDs will protect the check against user
namespace shenanigans.

2018-10-31 15:50:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On 10/31, Daniel Colascione wrote:
>
> > Confused... why? kill_ok_by_cred() should fail?
>
> Not if we don't run it. :-) I thought you were proposing that we do
> *all* access checks in open() and let write() succeed unconditionally,

Ah, no ;)

> Anyway, I sent a v2 patch that I think closes the hole another way. In
> v2, we just require that the real user ID that opens a /proc/pid/kill
> file is the same one that writes to it. It successfully blocks the
> setuid attack above while preserving all the write-time permission
> checks and keeping the close correspondence between
> write()-on-proc-pid-kill-fd and kill(2). Can you think of any
> situation where this scheme breaks?

I see no problems...

but again, perhaps we should fix kill_pid_info_as_cred() and use it in
/proc/pid/kill? I dunno.

Oleg.


2018-10-31 16:02:03

by Daniel Colascione

[permalink] [raw]
Subject: [PATCH v3] Implement /proc/pid/kill

Add a simple proc-based kill interface. To use /proc/pid/kill, just
write the signal number in base-10 ASCII to the kill file of the
process to be killed: for example, 'echo 9 > /proc/$$/kill'.

Semantically, /proc/pid/kill works like kill(2), except that the
process ID comes from the proc filesystem context instead of from an
explicit system call parameter. This way, it's possible to avoid races
between inspecting some aspect of a process and that process's PID
being reused for some other process.

Note that the write(2) to the kill file descriptor works only if it
happens in the security context as the call to open(2), where
"security context" is defined as the set of all ambient user IDs
(effective uid, fs uid, real uid, and saved uid) as well as the
presence of the CAP_KILL capability. This check prevents confused
deputy attacks via, e.g., supplying a /proc/$(pidof httpd)/kill file
descriptor as the standard output of setuid program and convincing
that program to write a "9".

With /proc/pid/kill, it's possible to write a proper race-free and
safe pkill(1). An approximation follows. A real program might use
openat(2), having opened a process's /proc/pid directory explicitly,
with the directory file descriptor serving as a sort of "process
handle".

#!/bin/bash
set -euo pipefail
pat=$1
for proc_status in /proc/*/status; do (
cd $(dirname $proc_status)
readarray proc_argv -d'' < cmdline
if ((${#proc_argv[@]} > 0)) &&
[[ ${proc_argv[0]} = *$pat* ]];
then
echo 15 > kill
fi
) || true; done

Signed-off-by: Daniel Colascione <[email protected]>
---

Turns out that checking struct user isn't sufficient, since signal.c's
permissions check also cares about effective UIDs. Let's be
extra-paranoid and bail if _anything_ relevant in struct cred
has changed.

Also, as Joel suggested, switch from goto-return to direct return.

fs/proc/base.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 67 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7e9f07bf260d..b0e7ded96af9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -205,6 +205,72 @@ static int proc_root_link(struct dentry *dentry, struct path *path)
return result;
}

+static ssize_t proc_pid_kill_write(struct file *file,
+ const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ ssize_t res;
+ int sig;
+ char buffer[4];
+ const struct cred *cur_cred;
+ const struct cred *open_cred;
+ bool security_changed;
+
+ /* This check prevents a confused deputy attack in which an
+ * unprivileged process opens /proc/victim/kill and convinces
+ * a privileged process to write to that kill FD, effectively
+ * performing a kill with the privileges of the unwitting
+ * privileged process. Here, we just fail the kill operation
+ * if someone calls write(2) with a real user ID that differs
+ * from the one used to open the kill FD.
+ */
+ cur_cred = current_cred();
+ open_cred = file->f_cred;
+ security_changed =
+ cur_cred->user_ns != open_cred->user_ns ||
+ !uid_eq(cur_cred->euid, open_cred->euid) ||
+ !uid_eq(cur_cred->fsuid, open_cred->fsuid) ||
+ !uid_eq(cur_cred->suid, open_cred->suid) ||
+ !uid_eq(cur_cred->uid, open_cred->uid) ||
+ /* No audit: if we actually use the capability, we'll
+ * audit during the actual kill. Here, we're just
+ * checking whether our kill-FD has escaped its
+ * original security context and bailing if it has.
+ */
+ (security_capable_noaudit(cur_cred,
+ cur_cred->user_ns,
+ CAP_KILL)
+ != security_capable_noaudit(open_cred,
+ open_cred->user_ns,
+ CAP_KILL));
+ if (security_changed)
+ return -EPERM;
+
+ if (*ppos != 0)
+ return -EINVAL;
+
+ if (count > sizeof(buffer) - 1)
+ return -EINVAL;
+
+ if (copy_from_user(buffer, buf, count))
+ return -EINVAL;
+
+ buffer[count] = '\0';
+ res = kstrtoint(strstrip(buffer), 10, &sig);
+ if (res)
+ return res;
+
+ res = kill_pid(proc_pid(file_inode(file)), sig, 0);
+ if (res)
+ return res;
+
+ return count;
+}
+
+static const struct file_operations proc_pid_kill_ops = {
+ .write = proc_pid_kill_write,
+};
+
static ssize_t get_mm_cmdline(struct mm_struct *mm, char __user *buf,
size_t count, loff_t *ppos)
{
@@ -2935,6 +3001,7 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
ONE("syscall", S_IRUSR, proc_pid_syscall),
#endif
+ REG("kill", S_IRUGO | S_IWUGO, proc_pid_kill_ops),
REG("cmdline", S_IRUGO, proc_pid_cmdline_ops),
ONE("stat", S_IRUGO, proc_tgid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
--
2.19.1.568.g152ad8e336-goog


2018-10-31 16:25:37

by Jann Horn

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

+linux-api, Andy Lutomirski, Eric Biederman

On Wed, Oct 31, 2018 at 3:12 AM Daniel Colascione <[email protected]> wrote:
> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> write the signal number in base-10 ASCII to the kill file of the
> process to be killed: for example, 'echo 9 > /proc/$$/kill'.

This is a kernel API change, you should CC the linux-api list.

I think that getting the semantics of this right might be easier if
you used an ioctl handler instead of a write handler.

> Semantically, /proc/pid/kill works like kill(2), except that the
> process ID comes from the proc filesystem context instead of from an
> explicit system call parameter. This way, it's possible to avoid races
> between inspecting some aspect of a process and that process's PID
> being reused for some other process.
>
> With /proc/pid/kill, it's possible to write a proper race-free and
> safe pkill(1). An approximation follows. A real program might use
> openat(2), having opened a process's /proc/pid directory explicitly,
> with the directory file descriptor serving as a sort of "process
> handle".
>
> #!/bin/bash
> set -euo pipefail
> pat=$1
> for proc_status in /proc/*/status; do (
> cd $(dirname $proc_status)
> readarray proc_argv -d'' < cmdline
> if ((${#proc_argv[@]} > 0)) &&
> [[ ${proc_argv[0]} = *$pat* ]];
> then
> echo 15 > kill
> fi
> ) || true; done
>
> Signed-off-by: Daniel Colascione <[email protected]>
> ---
> fs/proc/base.c | 39 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 39 insertions(+)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 7e9f07bf260d..923d62b21e67 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -205,6 +205,44 @@ static int proc_root_link(struct dentry *dentry, struct path *path)
> return result;
> }
>
> +static ssize_t proc_pid_kill_write(struct file *file,
> + const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + ssize_t res;
> + int sig;
> + char buffer[4];
> +
> + res = -EINVAL;
> + if (*ppos != 0)
> + goto out;
> +
> + res = -EINVAL;
> + if (count > sizeof(buffer) - 1)
> + goto out;
> +
> + res = -EFAULT;
> + if (copy_from_user(buffer, buf, count))
> + goto out;
> +
> + buffer[count] = '\0';
> + res = kstrtoint(strstrip(buffer), 10, &sig);
> + if (res)
> + goto out;
> +
> + res = kill_pid(proc_pid(file_inode(file)), sig, 0);
> + if (res)
> + goto out;
> + res = count;
> +out:
> + return res;
> +
> +}

2018-10-31 17:35:20

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v2] Implement /proc/pid/kill

On 2018-10-31, Joel Fernandes <[email protected]> wrote:
> I suggest to maintainers we take this in as an intermediate solution
> since we don't have anything close to it and this is a real issue, and
> the fix proposed is simple.

I would suggest we wait until after LPC to see what Christian's design
is (given that, from what I've heard, it will help us solve more
problems than just the kill(2) issue).

I am very skeptical of adding new procfs files as an "intermediate
solution" (procfs is the most obvious example of an interface which is
effectively written in stone). Especially if this would conflict with
the idea Christian will propose -- as he said, there were proposals to
do this in the past and they didn't get anywhere because of lack of
discussion and brainstorming before posting patches.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (932.00 B)
signature.asc (849.00 B)
Download all attachments

2018-10-31 18:04:00

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH v3] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 5:54 PM, Tycho Andersen <[email protected]> wrote:
> Why not just use an ioctl() like Jann suggested instead of this big
> security check? Then we avoid the whole setuid writer thing entirely,

Don't you think a system call would be better than a new ioctl? With
either an ioctl or a new system call, though, the shell would need a
helper program to use the facility, whereas with the existing
approach, the shell can use the new facility without any additional
binaries.

> and we can pass the fd around if we want to.

You can pass the FD around today --- specifically, you just pass the
/proc/pid directory FD, not the /proc/pid/kill FD. The /proc/pid
directory FD acts as a process handle. (It's literally a reference to
a struct pid.) Anyone who receives one of these process handle FDs and
who wants to use the corresponding kill file can open the kill fd with
openat(2). What you can't do is pass the /proc/pid/kill FD to another
security context and use it, but when would you ever want to do that?

2018-10-31 18:18:07

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 06:00:49PM +0000, Daniel Colascione wrote:
> On Wed, Oct 31, 2018 at 5:54 PM, Tycho Andersen <[email protected]> wrote:
> > Why not just use an ioctl() like Jann suggested instead of this big
> > security check? Then we avoid the whole setuid writer thing entirely,
>
> Don't you think a system call would be better than a new ioctl?

We already have a kill() system call :)

> With either an ioctl or a new system call, though, the shell would
> need a helper program to use the facility, whereas with the existing
> approach, the shell can use the new facility without any additional
> binaries.

...and a binary to use it!

The nice thing about an ioctl is that it avoids this class of attacks
entirely.

> > and we can pass the fd around if we want to.
>
> You can pass the FD around today --- specifically, you just pass the
> /proc/pid directory FD, not the /proc/pid/kill FD. The /proc/pid
> directory FD acts as a process handle. (It's literally a reference to
> a struct pid.) Anyone who receives one of these process handle FDs and
> who wants to use the corresponding kill file can open the kill fd with
> openat(2). What you can't do is pass the /proc/pid/kill FD to another
> security context and use it, but when would you ever want to do that?

Perhaps I don't have a good imagination, because it's not clear to me
when I'd ever use this from a shell instead of the kill binary,
either. Using this from the shell is still racy, because if I do
something like:

echo 9 > /proc/$pid/kill

There's exactly the same race that there is with kill, that $pid might
be something else. Of course I could do some magic with bind mounts or
my pwd or something to keep it alive, but I can already do that today
with kill.

I can understand the desire to have a race free way to do this, but
"it must use write(2)" seems a little unnecessary, given that the
shell use case isn't particularly convincing to me.

Tycho

2018-10-31 19:00:17

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 03:59:12PM +0000, Daniel Colascione wrote:
> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> write the signal number in base-10 ASCII to the kill file of the
> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>
> Semantically, /proc/pid/kill works like kill(2), except that the
> process ID comes from the proc filesystem context instead of from an
> explicit system call parameter. This way, it's possible to avoid races
> between inspecting some aspect of a process and that process's PID
> being reused for some other process.
>
> Note that the write(2) to the kill file descriptor works only if it
> happens in the security context as the call to open(2), where
> "security context" is defined as the set of all ambient user IDs
> (effective uid, fs uid, real uid, and saved uid) as well as the
> presence of the CAP_KILL capability. This check prevents confused
> deputy attacks via, e.g., supplying a /proc/$(pidof httpd)/kill file
> descriptor as the standard output of setuid program and convincing
> that program to write a "9".
>
> With /proc/pid/kill, it's possible to write a proper race-free and
> safe pkill(1). An approximation follows. A real program might use
> openat(2), having opened a process's /proc/pid directory explicitly,
> with the directory file descriptor serving as a sort of "process
> handle".
>
> #!/bin/bash
> set -euo pipefail
> pat=$1
> for proc_status in /proc/*/status; do (
> cd $(dirname $proc_status)
> readarray proc_argv -d'' < cmdline
> if ((${#proc_argv[@]} > 0)) &&
> [[ ${proc_argv[0]} = *$pat* ]];
> then
> echo 15 > kill
> fi
> ) || true; done
>
> Signed-off-by: Daniel Colascione <[email protected]>
> ---
>
> Turns out that checking struct user isn't sufficient, since signal.c's
> permissions check also cares about effective UIDs. Let's be
> extra-paranoid and bail if _anything_ relevant in struct cred
> has changed.
>
> Also, as Joel suggested, switch from goto-return to direct return.
>
> fs/proc/base.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 67 insertions(+)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 7e9f07bf260d..b0e7ded96af9 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -205,6 +205,72 @@ static int proc_root_link(struct dentry *dentry, struct path *path)
> return result;
> }
>
> +static ssize_t proc_pid_kill_write(struct file *file,
> + const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + ssize_t res;
> + int sig;
> + char buffer[4];
> + const struct cred *cur_cred;
> + const struct cred *open_cred;
> + bool security_changed;
> +
> + /* This check prevents a confused deputy attack in which an
> + * unprivileged process opens /proc/victim/kill and convinces
> + * a privileged process to write to that kill FD, effectively
> + * performing a kill with the privileges of the unwitting
> + * privileged process. Here, we just fail the kill operation
> + * if someone calls write(2) with a real user ID that differs
> + * from the one used to open the kill FD.
> + */
> + cur_cred = current_cred();
> + open_cred = file->f_cred;
> + security_changed =
> + cur_cred->user_ns != open_cred->user_ns ||
> + !uid_eq(cur_cred->euid, open_cred->euid) ||
> + !uid_eq(cur_cred->fsuid, open_cred->fsuid) ||
> + !uid_eq(cur_cred->suid, open_cred->suid) ||
> + !uid_eq(cur_cred->uid, open_cred->uid) ||
> + /* No audit: if we actually use the capability, we'll
> + * audit during the actual kill. Here, we're just
> + * checking whether our kill-FD has escaped its
> + * original security context and bailing if it has.
> + */
> + (security_capable_noaudit(cur_cred,
> + cur_cred->user_ns,
> + CAP_KILL)
> + != security_capable_noaudit(open_cred,
> + open_cred->user_ns,
> + CAP_KILL));
> + if (security_changed)
> + return -EPERM;

Why not just use an ioctl() like Jann suggested instead of this big
security check? Then we avoid the whole setuid writer thing entirely,
and we can pass the fd around if we want to.

Tycho

2018-10-31 19:34:21

by Daniel Colascione

[permalink] [raw]
Subject: Re: [PATCH v3] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 6:17 PM, Tycho Andersen <[email protected]> wrote:
> On Wed, Oct 31, 2018 at 06:00:49PM +0000, Daniel Colascione wrote:
>> On Wed, Oct 31, 2018 at 5:54 PM, Tycho Andersen <[email protected]> wrote:
>> > Why not just use an ioctl() like Jann suggested instead of this big
>> > security check? Then we avoid the whole setuid writer thing entirely,
>>
>> Don't you think a system call would be better than a new ioctl?
>
> We already have a kill() system call :)

kill(2) is useless this purpose: it accepts a numeric PID, but we'd
need it to accept a process file descriptor instead. It's true that
the existing kill(1) binary might be the vehicle for using a
hypothetical new system call, but that's a separate matter.

>> With either an ioctl or a new system call, though, the shell would
>> need a helper program to use the facility, whereas with the existing
>> approach, the shell can use the new facility without any additional
>> binaries.
>
> ...and a binary to use it!
>
> The nice thing about an ioctl is that it avoids this class of attacks
> entirely.

Let's stop talking about adding an ioctl. Ioctls have problems with
namespacing of the request argument; it's not safe, in general, to
issue an ioctl against a file descriptor of an unknown type. You don't
know how that FD will interpret your request code. The two good
options before us are a write(2) interface and a new system call. I
think both are defensible. But I don't see a good reason to consider
adding an ioctl instead of a system call.

>> > and we can pass the fd around if we want to.
>>
>> You can pass the FD around today --- specifically, you just pass the
>> /proc/pid directory FD, not the /proc/pid/kill FD. The /proc/pid
>> directory FD acts as a process handle. (It's literally a reference to
>> a struct pid.) Anyone who receives one of these process handle FDs and
>> who wants to use the corresponding kill file can open the kill fd with
>> openat(2). What you can't do is pass the /proc/pid/kill FD to another
>> security context and use it, but when would you ever want to do that?
>
> Perhaps I don't have a good imagination, because it's not clear to me
> when I'd ever use this from a shell instead of the kill binary,

I'm not against a new system call per se; I'd just prefer a write(2)
interface if we can get away with it. The trouble with a system call
is that it would have to accept a /proc/pid file descriptor, and file
descriptor management in the shell is awkward. I imagine the interface
would look something like kill -f PATH, where PATH is a PATH to a
/proc/pid directory. You'd be able to cd into /proc/$SOMETHING,
inspect state, and then, if you wanted to kill the process, you'd run
kill -f . 9 (or whatever signal number you want). It seems to be about
as ergonomic as 'echo 9 > ./kill'. But a new system call means new
kernel headers, coordination with procps, and bash, and every other
shell that has a kill builtin. You could provide a different, non-kill
binary, but then who'd distribute it? A new proc file, OTOH, would
Just Work. I agree that a system call interface would avoid the need
for the security check thing in the patch, but is avoiding this check
worth the coordination flowing from adding a new system call? I don't
know.

All of this is moot if the new comprehensive process interface that
comes out of LPC ends up being better anyway.

> either. Using this from the shell is still racy, because if I do
> something like:
>
> echo 9 > /proc/$pid/kill
>
> There's exactly the same race that there is with kill, that $pid might
> be something else.

> Of course I could do some magic with bind mounts or
> my pwd or something to keep it alive, but I can already do that today
> with kill.

You can't do it today with kill. The idea that keeping a open file
descriptor to a /proc/pid or a file within it prevents PID reuse is
widespread, but incorrect.

2018-10-31 20:32:30

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 07:33:06PM +0000, Daniel Colascione wrote:
> On Wed, Oct 31, 2018 at 6:17 PM, Tycho Andersen <[email protected]> wrote:
> > On Wed, Oct 31, 2018 at 06:00:49PM +0000, Daniel Colascione wrote:
> >> On Wed, Oct 31, 2018 at 5:54 PM, Tycho Andersen <[email protected]> wrote:
> >> > Why not just use an ioctl() like Jann suggested instead of this big
> >> > security check? Then we avoid the whole setuid writer thing entirely,
> >>
> >> Don't you think a system call would be better than a new ioctl?
> >
> > We already have a kill() system call :)
>
> kill(2) is useless this purpose: it accepts a numeric PID, but we'd
> need it to accept a process file descriptor instead. It's true that
> the existing kill(1) binary might be the vehicle for using a
> hypothetical new system call, but that's a separate matter.
>
> >> With either an ioctl or a new system call, though, the shell would
> >> need a helper program to use the facility, whereas with the existing
> >> approach, the shell can use the new facility without any additional
> >> binaries.
> >
> > ...and a binary to use it!
> >
> > The nice thing about an ioctl is that it avoids this class of attacks
> > entirely.
>
> Let's stop talking about adding an ioctl. Ioctls have problems with
> namespacing of the request argument; it's not safe, in general, to
> issue an ioctl against a file descriptor of an unknown type.

So don't lose track of the fd type. I'm not sure I see this as a big
problem.

> You don't know how that FD will interpret your request code. The two
> good options before us are a write(2) interface and a new system
> call. I think both are defensible. But I don't see a good reason to
> consider adding an ioctl instead of a system call.

https://www.mail-archive.com/[email protected]/msg1729911.html
https://www.mail-archive.com/[email protected]/msg1729921.html

maybe? :)

> All of this is moot if the new comprehensive process interface that
> comes out of LPC ends up being better anyway.

+1, I think a way to do all of this sort of thing would be nice.

> > either. Using this from the shell is still racy, because if I do
> > something like:
> >
> > echo 9 > /proc/$pid/kill
> >
> > There's exactly the same race that there is with kill, that $pid might
> > be something else.
>
> > Of course I could do some magic with bind mounts or
> > my pwd or something to keep it alive, but I can already do that today
> > with kill.
>
> You can't do it today with kill. The idea that keeping a open file
> descriptor to a /proc/pid or a file within it prevents PID reuse is
> widespread, but incorrect.

Good to know :)

Tycho

2018-10-31 21:48:43

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v2] Implement /proc/pid/kill

On Thu, Nov 01, 2018 at 04:33:53AM +1100, Aleksa Sarai wrote:
> On 2018-10-31, Joel Fernandes <[email protected]> wrote:
> > I suggest to maintainers we take this in as an intermediate solution
> > since we don't have anything close to it and this is a real issue, and
> > the fix proposed is simple.
>
> I would suggest we wait until after LPC to see what Christian's design
> is (given that, from what I've heard, it will help us solve more
> problems than just the kill(2) issue).
>
> I am very skeptical of adding new procfs files as an "intermediate
> solution" (procfs is the most obvious example of an interface which is
> effectively written in stone). Especially if this would conflict with
> the idea Christian will propose -- as he said, there were proposals to
> do this in the past and they didn't get anywhere because of lack of
> discussion and brainstorming before posting patches.
>

Fine with me, thanks.


2018-11-01 04:56:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Wed, Oct 31, 2018 at 9:23 AM Jann Horn <[email protected]> wrote:
>
> +linux-api, Andy Lutomirski, Eric Biederman
>
> On Wed, Oct 31, 2018 at 3:12 AM Daniel Colascione <[email protected]> wrote:
> > Add a simple proc-based kill interface. To use /proc/pid/kill, just
> > write the signal number in base-10 ASCII to the kill file of the
> > process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>
> This is a kernel API change, you should CC the linux-api list.
>
> I think that getting the semantics of this right might be easier if
> you used an ioctl handler instead of a write handler.
>
> > Semantically, /proc/pid/kill works like kill(2), except that the
> > process ID comes from the proc filesystem context instead of from an
> > explicit system call parameter. This way, it's possible to avoid races
> > between inspecting some aspect of a process and that process's PID
> > being reused for some other process.
> >
> > With /proc/pid/kill, it's possible to write a proper race-free and
> > safe pkill(1). An approximation follows. A real program might use
> > openat(2), having opened a process's /proc/pid directory explicitly,
> > with the directory file descriptor serving as a sort of "process
> > handle".
> >
> > #!/bin/bash
> > set -euo pipefail
> > pat=$1
> > for proc_status in /proc/*/status; do (
> > cd $(dirname $proc_status)
> > readarray proc_argv -d'' < cmdline
> > if ((${#proc_argv[@]} > 0)) &&
> > [[ ${proc_argv[0]} = *$pat* ]];
> > then
> > echo 15 > kill
> > fi
> > ) || true; done
> >
> > Signed-off-by: Daniel Colascione <[email protected]>
> > ---
> > fs/proc/base.c | 39 +++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 39 insertions(+)
> >
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 7e9f07bf260d..923d62b21e67 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -205,6 +205,44 @@ static int proc_root_link(struct dentry *dentry, struct path *path)
> > return result;
> > }
> >
> > +static ssize_t proc_pid_kill_write(struct file *file,
> > + const char __user *buf,
> > + size_t count, loff_t *ppos)
> > +{
> > + ssize_t res;
> > + int sig;
> > + char buffer[4];
> > +
> > + res = -EINVAL;
> > + if (*ppos != 0)
> > + goto out;
> > +
> > + res = -EINVAL;
> > + if (count > sizeof(buffer) - 1)
> > + goto out;
> > +
> > + res = -EFAULT;
> > + if (copy_from_user(buffer, buf, count))
> > + goto out;
> > +
> > + buffer[count] = '\0';
> > + res = kstrtoint(strstrip(buffer), 10, &sig);
> > + if (res)
> > + goto out;
> > +
> > + res = kill_pid(proc_pid(file_inode(file)), sig, 0);

Indeed, you can't do this from .write unless you manage to pass a cred
struct pointer in. ioctl or a new syscall would be better.

--Andy

2018-11-01 11:54:23

by David Laight

[permalink] [raw]
Subject: RE: [RFC PATCH] Implement /proc/pid/kill

From: Sent: 31 October 2018 13:28
...
> * I actually have a local variant of the patch that would have you
> open "/proc/$PID/kill/$SIGNO" instead, since different signal numbers
> have different permission checks.

I think you'd need the open() to specify some specific unusual
open modes.
Otherwise it'll be far too easy for scripts (and people) to
accidentally send every signal to every process.

Also think of the memory footprint.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2018-11-01 12:40:19

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v3] Implement /proc/pid/kill

From: Daniel Colascione
> Sent: 31 October 2018 19:33
...
> You can't do it today with kill. The idea that keeping a open file
> descriptor to a /proc/pid or a file within it prevents PID reuse is
> widespread, but incorrect.

Is there a real good reason why that shouldn't be the case?
ie Holding a reference on the 'struct pid' being enough to stop reuse.

A patch to do that would be more generally useful.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2018-11-01 15:51:23

by Daniel Colascione

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Thu, Nov 1, 2018 at 11:53 AM, David Laight <[email protected]> wrote:
> From: Sent: 31 October 2018 13:28
> ...
>> * I actually have a local variant of the patch that would have you
>> open "/proc/$PID/kill/$SIGNO" instead, since different signal numbers
>> have different permission checks.
>
> I think you'd need the open() to specify some specific unusual
> open modes.
> Otherwise it'll be far too easy for scripts (and people) to
> accidentally send every signal to every process.

I think the /proc/$PID/kill/$SIGNO idea is dead anyway, and even
dead-er since Linus banned write() for commands. (Looks like we'll
need a system call after all.)

That said, for the record, I was talking about the *write* sending the
signal, not the open, so grep of /proc wouldn't send random signals to
every process.

> Also think of the memory footprint.

Proc inodes are created on-demand, so AIUI, the memory overhead of
heaving a per-FD directory of stuff isn't very high.

2018-11-01 20:41:44

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Tue, Oct 30, 2018 at 09:24:00PM -0700, Joel Fernandes wrote:
> On Tue, Oct 30, 2018 at 7:56 PM, Aleksa Sarai <[email protected]> wrote:
> > On 2018-10-31, Christian Brauner <[email protected]> wrote:
> >> > I think Aleksa's larger point is that it's useful to treat processes
> >> > as other file-descriptor-named, poll-able, wait-able resources.
> >> > Consistency is important. A process is just another system resource,
> >> > and like any other system resource, you should be open to hold a file
> >> > descriptor to it and do things to that process via that file
> >> > descriptor. The precise form of this process-handle FD is up for
> >> > debate. The existing /proc/$PID directory FD is a good candidate for a
> >> > process handle FD, since it does almost all of what's needed. But
> >> > regardless of what form a process handle FD takes, we need it. I don't
> >> > see a case for continuing to treat processes in a non-unixy,
> >> > non-file-descriptor-based manner.
> >>
> >> That's what I'm proposing in the API for which I'm gathering feedback.
> >> I have presented parts of this in various discussions at LSS Europe last week
> >> and will be at LPC.
> >> We don't want to rush an API like this though. It was tried before in
> >> other forms
> >> and these proposals didn't make it.
> >
> > :+1: on a well thought-out and generic proposal. As we've discussed
> > elsewhere, this is an issue that really would be great to (finally)
> > solve.
>
> Excited to see this and please count me in for discussions around this. thanks.
>

Just a quick question, is there a track planned at LPC for discussing this
new proposal or topics around/related to the proposal?

If not, should that be planned?

- Joel


2018-11-02 09:47:27

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

On Thu, Nov 01, 2018 at 01:40:59PM -0700, Joel Fernandes wrote:
> On Tue, Oct 30, 2018 at 09:24:00PM -0700, Joel Fernandes wrote:
> > On Tue, Oct 30, 2018 at 7:56 PM, Aleksa Sarai <[email protected]> wrote:
> > > On 2018-10-31, Christian Brauner <[email protected]> wrote:
> > >> > I think Aleksa's larger point is that it's useful to treat processes
> > >> > as other file-descriptor-named, poll-able, wait-able resources.
> > >> > Consistency is important. A process is just another system resource,
> > >> > and like any other system resource, you should be open to hold a file
> > >> > descriptor to it and do things to that process via that file
> > >> > descriptor. The precise form of this process-handle FD is up for
> > >> > debate. The existing /proc/$PID directory FD is a good candidate for a
> > >> > process handle FD, since it does almost all of what's needed. But
> > >> > regardless of what form a process handle FD takes, we need it. I don't
> > >> > see a case for continuing to treat processes in a non-unixy,
> > >> > non-file-descriptor-based manner.
> > >>
> > >> That's what I'm proposing in the API for which I'm gathering feedback.
> > >> I have presented parts of this in various discussions at LSS Europe last week
> > >> and will be at LPC.
> > >> We don't want to rush an API like this though. It was tried before in
> > >> other forms
> > >> and these proposals didn't make it.
> > >
> > > :+1: on a well thought-out and generic proposal. As we've discussed
> > > elsewhere, this is an issue that really would be great to (finally)
> > > solve.
> >
> > Excited to see this and please count me in for discussions around this. thanks.
> >
>
> Just a quick question, is there a track planned at LPC for discussing this
> new proposal or topics around/related to the proposal?
>
> If not, should that be planned?

There isn't currently one planned but I'm happy to have a hallway track
session around this.

But note, I think not all relevant people are going to be there (e.g.
Andy). File descriptors for processes seems interesting to a lot of
people so I'm going to send out a pitch of the idea I have and see how
much I'm going to get yelled at latest on Tuesday. Even if it just
triggers a design discussion.
I have been urged by people I pitched this to to send it to lkml
already. Sorry for the delay and the initial non-transparency. The only
reason I didn't do it right away was to ensure that this idea is not
completely crazy. :) (Eric probably still thinks I am though. :))
It's just that I'm at a conference and I want to have a nicer writeup of
this. Given the speed with which this is all coming I have given up on
preparing a first set of patches. :)

Christian

2018-11-02 14:35:00

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

Quoting Christian Brauner ([email protected]):
> On Thu, Nov 01, 2018 at 01:40:59PM -0700, Joel Fernandes wrote:
> > On Tue, Oct 30, 2018 at 09:24:00PM -0700, Joel Fernandes wrote:
> > > On Tue, Oct 30, 2018 at 7:56 PM, Aleksa Sarai <[email protected]> wrote:
> > > > On 2018-10-31, Christian Brauner <[email protected]> wrote:
> > > >> > I think Aleksa's larger point is that it's useful to treat processes
> > > >> > as other file-descriptor-named, poll-able, wait-able resources.
> > > >> > Consistency is important. A process is just another system resource,
> > > >> > and like any other system resource, you should be open to hold a file
> > > >> > descriptor to it and do things to that process via that file
> > > >> > descriptor. The precise form of this process-handle FD is up for
> > > >> > debate. The existing /proc/$PID directory FD is a good candidate for a
> > > >> > process handle FD, since it does almost all of what's needed. But
> > > >> > regardless of what form a process handle FD takes, we need it. I don't
> > > >> > see a case for continuing to treat processes in a non-unixy,
> > > >> > non-file-descriptor-based manner.
> > > >>
> > > >> That's what I'm proposing in the API for which I'm gathering feedback.
> > > >> I have presented parts of this in various discussions at LSS Europe last week
> > > >> and will be at LPC.
> > > >> We don't want to rush an API like this though. It was tried before in
> > > >> other forms
> > > >> and these proposals didn't make it.
> > > >
> > > > :+1: on a well thought-out and generic proposal. As we've discussed
> > > > elsewhere, this is an issue that really would be great to (finally)
> > > > solve.
> > >
> > > Excited to see this and please count me in for discussions around this. thanks.
> > >
> >
> > Just a quick question, is there a track planned at LPC for discussing this
> > new proposal or topics around/related to the proposal?
> >
> > If not, should that be planned?
>
> There isn't currently one planned but I'm happy to have a hallway track
> session around this.
>
> But note, I think not all relevant people are going to be there (e.g.
> Andy). File descriptors for processes seems interesting to a lot of
> people so I'm going to send out a pitch of the idea I have and see how
> much I'm going to get yelled at latest on Tuesday. Even if it just
> triggers a design discussion.
> I have been urged by people I pitched this to to send it to lkml
> already. Sorry for the delay and the initial non-transparency. The only
> reason I didn't do it right away was to ensure that this idea is not
> completely crazy. :) (Eric probably still thinks I am though. :))
> It's just that I'm at a conference and I want to have a nicer writeup of
> this. Given the speed with which this is all coming I have given up on
> preparing a first set of patches. :)
>
> Christian

Sounds good, thanks, looking forward to it.

2018-11-12 01:20:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v3] Implement /proc/pid/kill

David Laight <[email protected]> writes:

> From: Daniel Colascione
>> Sent: 31 October 2018 19:33
> ...
>> You can't do it today with kill. The idea that keeping a open file
>> descriptor to a /proc/pid or a file within it prevents PID reuse is
>> widespread, but incorrect.
>
> Is there a real good reason why that shouldn't be the case?
> ie Holding a reference on the 'struct pid' being enough to stop reuse.
>
> A patch to do that would be more generally useful.

Holding an open file descriptor to /proc/pid is enough to prevent pid
reuse problems. If a pid number is reused a new 'struct pid' is
generated. There is not 'struct pid' reuse.

So in solving this the kernel data structure you would need to hold onto
is a 'struct pid'. It is just necessary to add an interface to sending
signals to that 'struct pid' and not looking up the 'struct pid' again
by number.

Eric

2018-11-12 23:15:30

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC PATCH] Implement /proc/pid/kill

Hi!

> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> write the signal number in base-10 ASCII to the kill file of the
> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>
> Semantically, /proc/pid/kill works like kill(2), except that the
> process ID comes from the proc filesystem context instead of from an
> explicit system call parameter. This way, it's possible to avoid races
> between inspecting some aspect of a process and that process's PID
> being reused for some other process.

I'd really prefer this to use a new syscall, taking directory fd as an
argument. Various security solutions (seccomp, chromium sandbox,
ptrace/subterfugue) could be pretty surprised what write to a file is
doing...

> With /proc/pid/kill, it's possible to write a proper race-free and
> safe pkill(1). An approximation follows. A real program might use
> openat(2), having opened a process's /proc/pid directory explicitly,
> with the directory file descriptor serving as a sort of "process
> handle".

Actually.. if you open /proc/123 directory, is pid 123 still going to
be reused?

If so... could we simply block reuse of that pid as long as directory
is open? It would solve the race w/o new file and w/o new syscall...

Good luck,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.41 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments