Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
MIME-Version: 1.0
In-Reply-To: <20181101070036.l24c2p432ohuwmqf@yavin>
References: <20181029175322.189042-1-dancol@google.com> <20181029192250.130551-1-dancol@google.com>
 <20181101070036.l24c2p432ohuwmqf@yavin>
From:   Daniel Colascione <dancol@google.com>
Date:   Thu, 1 Nov 2018 09:59:40 +0000
Message-ID: <CAKOZueszfoSM0pxhmuFLOuPmJqSfYXxgutstyCgqxAyoUi4h3w@mail.gmail.com>
Subject: Re: [RFC PATCH v2] Minimal non-child process exit notification support
To:     Aleksa Sarai <cyphar@cyphar.com>
Cc:     linux-kernel <linux-kernel@vger.kernel.org>,
        Tim Murray <timmurray@google.com>,
        Joel Fernandes <joelaf@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Thu, Nov 1, 2018 at 7:00 AM, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2018-10-29, Daniel Colascione <dancol@google.com> wrote:
>> This patch adds a new file under /proc/pid, /proc/pid/exithand.
>> Attempting to read from an exithand file will block until the
>> corresponding process exits, at which point the read will successfully
>> complete with EOF.  The file descriptor supports both blocking
>> operations and poll(2). It's intended to be a minimal interface for
>> allowing a program to wait for the exit of a process that is not one
>> of its children.
>>
>> Why might we want this interface? Android's lmkd kills processes in
>> order to free memory in response to various memory pressure
>> signals. It's desirable to wait until a killed process actually exits
>> before moving on (if needed) to killing the next process. Since the
>> processes that lmkd kills are not lmkd's children, lmkd currently
>> lacks a way to wait for a process to actually die after being sent
>> SIGKILL; today, lmkd resorts to polling the proc filesystem pid
>> entry. This interface allow lmkd to give up polling and instead block
>> and wait for process death.
>
> I agree with the need for this interface (with a few caveats), but there
> are a few points I'd like to make:
>
>  * I don't think that making a new procfile is necessary. When you open
>    /proc/$pid you already have a handle for the underlying process, and
>    you can already poll to check whether the process has died (fstatat
>    fails for instance). What if we just used an inotify event to tell
>    userspace that the process has died -- to avoid userspace doing a
>    poll loop?

I'm trying to make a simple interface. The basic unix data access
model is that a userspace application wants information (e.g., next
bunch of bytes in a file, next packet from a socket, next signal from
a signal FD, etc.), and tells the kernel so by making a system call on
a file descriptor. Ordinarily, the kernel returns to userspace with
the requested information when it's available, potentially after
blocking until the information is available. Sometimes userspace
doesn't want to block, so it adds O_NONBLOCK to the open file mode,
and in this mode, the kernel can tell the userspace requestor "try
again later", but the source of truth is still that
ordinarily-blocking system call. How does userspace know when to try
again in the "try again later" case? By using
select/poll/epoll/whatever, which suggests a good time for that "try
again later" retry, but is not dispositive about it, since that
ordinarily-blocking system call is still the sole source of truth, and
that poll is allowed to report spurious readabilty.

This model works fine and has a ton of mental and technical
infrastructure built around it.  It's the one the system uses for
almost every bit of information useful to an application. I feel very
strongly that process exit should also adhere to this model. It's
consistent, robust, and simple to use. That's why I added a procfile
that adheres to this model. It lets processes deal with exits in
exactly the same way they do any other event, with bit of userspace
code that works with possibly-blocking file descriptors generally
(e.g. libevent) without special logic in any polling loop. The event
file I'm proposing is so ordinary, in fact, that it works from the
shell. Without some specific technical reason to do something
different, we shouldn't do something unusual.

Given that we *can*, cheaply, provide a clean and consistent API to
userspace, why would we instead want to inflict some exotic and
hard-to-use interface on userspace instead? Asking that userspace poll
on a directory file descriptor and, when poll returns, check by
looking for certain errors (we'd have to spec which ones) from fstatat
is awkward. /proc/pid is a directory. In what other context does the
kernel ask userspace to use a directory this way?

I don't want to get bogged down in a discussion of a thousand exotic
ways we could provide this feature when there's one clear approach
that works everywhere else and that we should just copy.

>  * There is a fairly old interface called the proc_connector which gives
>    you global fork+exec+exit events (similar to kevents from FreeBSD
>    though much less full-featured). I was working on some patches to
>    extend proc_connector so that it could be used inside containers as
>    well as unprivileged users. This would be another way we could
>    implement this.

Both netlink and the *notify APIs are intended for broad monitoring of
system activity, not for waiting for some specific event. They require
a substantial amount of setup code, and since both are event-streaming
APIs with buffers that can overflow, both need some logic for
userspace to detect buffer overrun and fall back to explicit scanning
if that happens. They're also optional part of the kernel, and as you
note, there's a lot of work needed before either is usable with /proc,
most of that work being unrelated to the race-free process management
operations I'd like to support. We don't need work on either to fix
these longstanding problems with the process API.

Ideally it wouldn't require /proc at all, but /proc is how we ask
about non-child processes generally, so we're stuck with it.

> I'm really not a huge fan of the "blocking read" semantic (though if we
> have to have it, can we at least provide as much information as you get
> from proc_connector -- such as the exit status?).

That a process *exists* is one bit of information. You can get it
today by hammering /proc, so the current exithand patch doesn't leak
information.

You're asking for the kernel to communicate additional information to
userspace when a process dies. Who should have access to that
information? If the answer is "everyone", then yes, we can just have
the read() yield a siginfo_t, just like for waitid. That's simple and
useful. But if the answer is "some users", then who? The exit status
in /proc/pid/stat is zeroed out for readers that fail do_task_stat's
ptrace_may_access call. (Falsifying the exit status in stat seems a
privilege check fails seems like a bad idea from a correctness POV.)
Should open() on exithand perform the same ptrace_may_access privilege
check? What if the process *becomes* untraceable during its lifetime
(e.g., with setuid). Should that read() on the exithand FD still yield
a siginfo_t? Just having exithand yield EOF all the time punts the
privilege problem to a later discussion because this approach doesn't
leak information. We can always add an "exithand_full" or something
that actually yields a siginfo_t.

Another option would be to make exithand's read() always yield a
siginfo_t, but have the open() just fail if the caller couldn't
ptrace_may_access it. But why shouldn't you be able to wait on other
processes? If you can see it in /proc, you should be able to wait on
it exiting.

> Also maybe we should
> integrate this into the exit machinery instead of this loop...

I don't know what you mean. It's already integrated into the exit
machinery: it's what runs the waitqueue.