Received: by 2002:ac0:98c7:0:0:0:0:0 with SMTP id g7-v6csp684419imd; Thu, 1 Nov 2018 04:03:31 -0700 (PDT) X-Google-Smtp-Source: AJdET5eiBAM6o/KA9BuecFAScjPF5cE6jjC2ExD2reYbex5wKc97sD4YtyuPH3nADyEmO7JqfnCz X-Received: by 2002:a63:b709:: with SMTP id t9-v6mr6496498pgf.366.1541070211854; Thu, 01 Nov 2018 04:03:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1541070211; cv=none; d=google.com; s=arc-20160816; b=g6X+yfE0hgCM2XLxNPIaudvb+CzpCPdGxybEyVVB7UzBofRgS8UfBLmqB04A/bfOI+ mHa9YGN7w276Bwnr9vQFnAPmG8zgits7vjhE9SxJegmp7ULgqC5wQo2O4GMmmUiwLLpi cTY2r41gmaS12rABP9nZD/XV1QnxxjJ1DiOAWZPIzIAQvllQc77JiPvgkGm5h4/DgZVQ V7MIECSUXNv3HNSzAzxoIV+dzVpMm9wwgW3OxxTKQPbm7+2WvlZ4fxDH6BlYjX3koLSQ Gz5K6L720lif80ROrAsZyiAyavOe4aKYI/Y+mkvg0+zTBYJWAxDH7cQwSX/IJAmyIPEo otPg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=+HC0gTOIARQF0ibXKY1nGeAavW4KoFLyhZPAs1j4D10=; b=ekVGzZ07UTRIJe8oiU+lUSljb9iUNq7p3rD525fGlu4DhZrIn8F2FYRVmy+yYw1qFS nlAXAXLxBZYW9DoBFMNRvFLpwBfhaMB4dJxPTDHKio9pvl2CbCrLOXkO4Fm+4zvCFT2T HfkS3gt9qqfkFRLTeL2DLRKs+bVptNaRfcUlGhhlBIPHs2OQFoNrYWcS2sj8ZesNtDVG v0+5qXxr/pstOcMVpcCbKTIh/nCXg5vQwhXv5dwKOe/NAGcxq48i8FTdF/V+pupQv9JG A5QM5139WmjqY5sC7JqqpBdOCoujkaonAP4CMlLJXXgNlhP83qxwFVOdGpoz1R/K08mz 7cpQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id bg12-v6si29480183plb.319.2018.11.01.04.03.13; Thu, 01 Nov 2018 04:03:31 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728089AbeKATu1 (ORCPT + 99 others); Thu, 1 Nov 2018 15:50:27 -0400 Received: from mx1.mailbox.org ([80.241.60.212]:52154 "EHLO mx1.mailbox.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727806AbeKATu1 (ORCPT ); Thu, 1 Nov 2018 15:50:27 -0400 Received: from smtp2.mailbox.org (unknown [IPv6:2001:67c:2050:105:465:1:2:0]) (using TLSv1.2 with cipher ECDHE-RSA-CHACHA20-POLY1305 (256/256 bits)) (No client certificate requested) by mx1.mailbox.org (Postfix) with ESMTPS id 2618B4BAF5; Thu, 1 Nov 2018 11:48:00 +0100 (CET) X-Virus-Scanned: amavisd-new at heinlein-support.de Received: from smtp2.mailbox.org ([80.241.60.241]) by hefe.heinlein-support.de (hefe.heinlein-support.de [91.198.250.172]) (amavisd-new, port 10030) with ESMTP id cdNTyhMREMlW; Thu, 1 Nov 2018 11:47:58 +0100 (CET) Date: Thu, 1 Nov 2018 21:47:51 +1100 From: Aleksa Sarai To: Daniel Colascione Cc: linux-kernel , Tim Murray , Joel Fernandes Subject: Re: [RFC PATCH v2] Minimal non-child process exit notification support Message-ID: <20181101104750.q23rb3hczx2tzakq@yavin> References: <20181029175322.189042-1-dancol@google.com> <20181029192250.130551-1-dancol@google.com> <20181101070036.l24c2p432ohuwmqf@yavin> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="snyupnd52lalsuup" Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --snyupnd52lalsuup Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2018-11-01, Daniel Colascione wrote: > On Thu, Nov 1, 2018 at 7:00 AM, Aleksa Sarai wrote: > > On 2018-10-29, Daniel Colascione wrote: > >> This patch adds a new file under /proc/pid, /proc/pid/exithand. > >> Attempting to read from an exithand file will block until the > >> corresponding process exits, at which point the read will successfully > >> complete with EOF. The file descriptor supports both blocking > >> operations and poll(2). It's intended to be a minimal interface for > >> allowing a program to wait for the exit of a process that is not one > >> of its children. > >> > >> Why might we want this interface? Android's lmkd kills processes in > >> order to free memory in response to various memory pressure > >> signals. It's desirable to wait until a killed process actually exits > >> before moving on (if needed) to killing the next process. Since the > >> processes that lmkd kills are not lmkd's children, lmkd currently > >> lacks a way to wait for a process to actually die after being sent > >> SIGKILL; today, lmkd resorts to polling the proc filesystem pid > >> entry. This interface allow lmkd to give up polling and instead block > >> and wait for process death. > > > > I agree with the need for this interface (with a few caveats), but there > > are a few points I'd like to make: > > > > * I don't think that making a new procfile is necessary. When you open > > /proc/$pid you already have a handle for the underlying process, and > > you can already poll to check whether the process has died (fstatat > > fails for instance). What if we just used an inotify event to tell > > userspace that the process has died -- to avoid userspace doing a > > poll loop? >=20 > I'm trying to make a simple interface. The basic unix data access > model is that a userspace application wants information (e.g., next > bunch of bytes in a file, next packet from a socket, next signal from > a signal FD, etc.), and tells the kernel so by making a system call on > a file descriptor. Ordinarily, the kernel returns to userspace with > the requested information when it's available, potentially after > blocking until the information is available. Sometimes userspace > doesn't want to block, so it adds O_NONBLOCK to the open file mode, > and in this mode, the kernel can tell the userspace requestor "try > again later", but the source of truth is still that > ordinarily-blocking system call. How does userspace know when to try > again in the "try again later" case? By using > select/poll/epoll/whatever, which suggests a good time for that "try > again later" retry, but is not dispositive about it, since that > ordinarily-blocking system call is still the sole source of truth, and > that poll is allowed to report spurious readabilty. inotify gives you an event if a file or directory is deleted. A pid dying semantically is similar to the idea of a /proc/$pid being deleted. I don't see how a blocking read on a new procfile is simpler than using the existing notification-on-file-events infrastructure -- not to mention that the idea of "this file blocks until the thing we are indirectly referencing by this file is gone" seems to me to be a really strange interface. Sure, it uses read(2) -- but is that the only constraint on designing simple interfaces? > The event file I'm proposing is so ordinary, in fact, that it works > from the shell. Without some specific technical reason to do something > different, we shouldn't do something unusual. inotify-tools are available on effectively every distribution. > Given that we *can*, cheaply, provide a clean and consistent API to > userspace, why would we instead want to inflict some exotic and > hard-to-use interface on userspace instead? Asking that userspace poll > on a directory file descriptor and, when poll returns, check by > looking for certain errors (we'd have to spec which ones) from fstatat > is awkward. /proc/pid is a directory. In what other context does the > kernel ask userspace to use a directory this way? I'm not sure you understood my proposal. I said that we need an interface to do this, and I was trying to explain (by noting what the current way of doing it would be) what I think the interface should be. To reiterate, I believe that having an inotify event (IN_DELETE_SELF on /proc/$pid) would be in keeping with the current way of doing things but allowing userspace to avoid all of the annoyances you just mentioned and I was alluding to. I *don't* think that the current scheme of looping on fstatat is the way it should be left. And there is an argument the inotify is not sufficient to=20 > > I'm really not a huge fan of the "blocking read" semantic (though if we > > have to have it, can we at least provide as much information as you get > > from proc_connector -- such as the exit status?). > [...] > The exit status in /proc/pid/stat is zeroed out for readers that fail > do_task_stat's ptrace_may_access call. (Falsifying the exit status in > stat seems a privilege check fails seems like a bad idea from a > correctness POV.) It's not clear to me what the purpose of that field is within procfs for *dead* proceses -- which is what we're discussing here. As far as I can tell, you will get an ESRCH when you try to read it. When testing this it also looked like you didn't even get the exit_status as a zombie but I might be mistaken. So while it is masked for !ptrace_may_access, it's also zero (or unreadable) for almost every case outside of stopped processes (AFAICS). Am I missing something? > Should open() on exithand perform the same ptrace_may_access privilege > check? What if the process *becomes* untraceable during its lifetime > (e.g., with setuid). Should that read() on the exithand FD still yield > a siginfo_t? Just having exithand yield EOF all the time punts the > privilege problem to a later discussion because this approach doesn't > leak information. We can always add an "exithand_full" or something > that actually yields a siginfo_t. I agree that read(2) makes this hard. I don't think we should use it. But if we have to use it, I would like us to have feature parity with features that FreeBSD had 18 years ago. > Another option would be to make exithand's read() always yield a > siginfo_t, but have the open() just fail if the caller couldn't > ptrace_may_access it. But why shouldn't you be able to wait on other > processes? If you can see it in /proc, you should be able to wait on > it exiting. I would suggest looking at FreeBSD's kevent semantics for inspiration (or at least to see an alternative way of doing things). In particular, EVFILT_PROC+NOTE_EXIT -- which is attached to a particular process. I wonder what their view is on these sorts of questions. > > Also maybe we should > > integrate this into the exit machinery instead of this loop... >=20 > I don't know what you mean. It's already integrated into the exit > machinery: it's what runs the waitqueue. My mistake, I missed the last hunk of the patch. --=20 Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH --snyupnd52lalsuup Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEb6Gz4/mhjNy+aiz1Snvnv3Dem58FAlva2dQACgkQSnvnv3De m5+qAxAA1x2Kbzbwil79xUBlcjKcEehEww8PpIKyZ9hdUHImdjTJlP4Yzo7sJfwz xCfiUPUG1I8etcBDSgmAowAzPqFfyRQGhj0QcRI/BfK1RDMEit1ndGtm7UJEZgse LpCYh7jv0l5pvZK85AyaXJx/JcRE2t8Ec8fVNQeEIfwxpKp6C+vYQLaNV7+X95+f lW5fC4ek6z9+KpZPOxJw31XZgBUyZZHq4zhxLwHCdNOHAyN/EMXHhXxd1OdWFi0A Z9DAnW17aeqSbVY69mgWOYBnK0cSXf0LMYeD85hNlJpLVrfG96QRw/sU+TyZ0+Sb 0cOwI8Kgp2WW3LoSiXULnk/U0aP1uPg7WCpmOsZ1fb/SLpOOqEfm8SaJqDlpF0kq rv3r0VYbo0y2KADzC0A+HokPzorJ3fhWScGFfoBeKgpyhDr9wUxLA8tRVW6jM5LG QIxjcn0ww2HolUR1shRNt9bl2Ffuvkj4LVPd4wD6WY8mk+yEvxS15WZZzBatulYb zWzqQXfPB8RbIYX+bO1/m/ZaeClraQz3BnzrrFmvbAiJeSLvkDugBfvTGFYCoP0t sPRvmKmQF5zTYgsYXfvjLOW2SD8pYyWrUvpyD3gJr0JF1MXrdGMLMp9MAz7cHKH4 o/DWJNNerVNjcajMncfB+1lkfMzl7590z67VoFswcKNCj8PMA3o= =kCkq -----END PGP SIGNATURE----- --snyupnd52lalsuup--