Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp2323530yba; Fri, 19 Apr 2019 17:19:36 -0700 (PDT) X-Google-Smtp-Source: APXvYqyUD52vnbb6BueII4RYKnbtRoofDpuNfBk+e2sI68jbEd2g1r4D26p5zcxxyORNvx9BvSbw X-Received: by 2002:a65:5cc8:: with SMTP id b8mr6409046pgt.36.1555719576335; Fri, 19 Apr 2019 17:19:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555719576; cv=none; d=google.com; s=arc-20160816; b=BEjN6a8NOS8vGVjKtqe/0OIBDzubdr3Y7U/LDZorSwihy3LZIquTe7uqJS3wku9ZAn m9jIH2b/jOt1WNzUWd9XBQmSkcFtJuXa3xlLBN5UVKCfkDHvtE71f6FlwgRhmtNWUq6P rd0S9Txcmn3h0+Cvrj4bboAtZ7cV+HsYF4nT31IPpi10LCMaTQZ9S3KORu87q/MiblkS 8IRUCnlGeXwIe6U0uumejtJtd0RdKmC4BFuoRcgjykaiPvTYZglzAS9KdAeAQ077Vt+n JEBXZYnBZp0f3krOdxSuNIi7iulzQkaIZcKgCN98t8/DyoR6JlVknQD/khjdLQE5iB2L UVjw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=4kkui8FCKbTumGhMcD8M8T8qmmWfzoEvPvzxEf1mUiE=; b=YFOKHF+Of3LJNco0dKbEX/FYggYyGVyOjkGpT7Cz4Z0TL1tcmRwG2OHbqU+FWgiDCo 9fNcA8vyDvMrVdX2eaKz1AcCAVtXIoZDVCb53lBig1ItBLiRp/NC9TgQL4dMoSzaOEuL uTj7eEJj4Gx1zf4lfJFGwJRHgKiqC2p9EcrClwXue0fkeURYIDGTpxcf2LqEU/0NPIJV AzmCXxoxpvPEJiWroYbi4bOWFt6cW8KbrEjx3VG+UjRwtRDKeE+4ZVS6R1K2+7CNCYzU E5CMALI4D0ONcvMmDWxgeKl6oiGeLI+zz6MPTL4I03tZlJPTLIoaRXC2qRF+XJvd2i3P wRWw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@brauner.io header.s=google header.b=bvZtrbak; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z11si6438583plk.81.2019.04.19.17.19.21; Fri, 19 Apr 2019 17:19:36 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@brauner.io header.s=google header.b=bvZtrbak; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727385AbfDTARf (ORCPT + 99 others); Fri, 19 Apr 2019 20:17:35 -0400 Received: from mail-lf1-f67.google.com ([209.85.167.67]:41479 "EHLO mail-lf1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727697AbfDTARf (ORCPT ); Fri, 19 Apr 2019 20:17:35 -0400 Received: by mail-lf1-f67.google.com with SMTP id t30so5016057lfd.8 for ; Fri, 19 Apr 2019 17:17:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=brauner.io; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=4kkui8FCKbTumGhMcD8M8T8qmmWfzoEvPvzxEf1mUiE=; b=bvZtrbakG8atcZkELiSo2Atax7kuOKgpvU0/NwspQXlSqVmoHjMGEBsCe6Bw4MltYl pyNb2IdFiYqXx36QeA5cdZgxL29uqoKk2LkJXRvkHPv0Tz2xytrF8NdM5mDH22w60tCS VXtUd01s9m0vaSo54kwKy7u6pWqw78eKGCN76cUDdG+eQ4favdy+8onNovoXA+OgQzRI 0xys9Z3LPiQ2EWRgK3aP5un6sIJfqAjVWKCboAQZaQ83KoNkdlv9KKoMwFqYEZWboiqf C3/8qXlT9OuMDYGFOzh8Y/qMSBDHos7wEXbp3lDR4w7dh63IRyuEVvwUO2VPgQM5h6a4 Ypzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=4kkui8FCKbTumGhMcD8M8T8qmmWfzoEvPvzxEf1mUiE=; b=GwO55dJTY9A/VnRrk6QYxDBbYY+GimXeujJGLuJLB6qbmmB3uwrpP9xNi0bLYBQ6R2 J1eaVxv/hf854WrpsLCYg/tGVMi+wK+zOCyN55EmhrivReXJMDgBrZqFKJ8co3xlEdpF tuysJlGfyvmB9kzMQzvdCVIpQyeFyncHVkBUQLAQyNEQQD20KvQE3OZ8VKmoxAu5d2vi IUZJ7t2JeIh3DgGCu4sFoSi2YZSHhfie4ORn3K42njdrkmbDWJxuhv1OE9wHVJ0MW5XW j6OCLeFRzFxirUDPsKt3JWbkHwfCSSRhtVgumR83UP5fheDurw2gh/g4/1JJ025f8LCT 0G5Q== X-Gm-Message-State: APjAAAUiFCfBuW3aw8QpLNNVzRyCYLK7V5EGGOOvawSahjNPuBOqiyUq 7S96soH62711fOrLmdNBvMmNNC0Y/ncipe5k8In9TA== X-Received: by 2002:a19:e30a:: with SMTP id a10mr3601659lfh.58.1555719452206; Fri, 19 Apr 2019 17:17:32 -0700 (PDT) MIME-Version: 1.0 References: <20190411175043.31207-1-joel@joelfernandes.org> <20190416120430.GA15437@redhat.com> <20190416192051.GA184889@google.com> <20190417130940.GC32622@redhat.com> <20190419190247.GB251571@google.com> <20190419191858.iwcvqm6fihbkaata@brauner.io> <20190419194902.GE251571@google.com> In-Reply-To: From: Christian Brauner Date: Sat, 20 Apr 2019 02:17:21 +0200 Message-ID: Subject: Re: [PATCH RFC 1/2] Add polling support to pidfd To: Daniel Colascione Cc: Joel Fernandes , Jann Horn , Oleg Nesterov , Florian Weimer , kernel list , Andy Lutomirski , Steven Rostedt , Suren Baghdasaryan , Linus Torvalds , Alexey Dobriyan , Al Viro , Andrei Vagin , Andrew Morton , Arnd Bergmann , "Eric W. Biederman" , Kees Cook , linux-fsdevel , "open list:KERNEL SELFTEST FRAMEWORK" , Michal Hocko , Nadav Amit , Serge Hallyn , Shuah Khan , Stephen Rothwell , Taehee Yoo , Tejun Heo , Thomas Gleixner , kernel-team , Tycho Andersen Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Apr 20, 2019 at 1:47 AM Daniel Colascione wrote: > > On Fri, Apr 19, 2019 at 4:12 PM Christian Brauner wrote: > > > > On Sat, Apr 20, 2019 at 12:46 AM Daniel Colascione wrote: > > > > > > On Fri, Apr 19, 2019 at 3:02 PM Christian Brauner wrote: > > > > > > > > On Fri, Apr 19, 2019 at 11:48 PM Christian Brauner wrote: > > > > > > > > > > On Fri, Apr 19, 2019 at 11:21 PM Daniel Colascione wrote: > > > > > > > > > > > > On Fri, Apr 19, 2019 at 1:57 PM Christian Brauner wrote: > > > > > > > > > > > > > > On Fri, Apr 19, 2019 at 10:34 PM Daniel Colascione wrote: > > > > > > > > > > > > > > > > On Fri, Apr 19, 2019 at 12:49 PM Joel Fernandes wrote: > > > > > > > > > > > > > > > > > > On Fri, Apr 19, 2019 at 09:18:59PM +0200, Christian Brauner wrote: > > > > > > > > > > On Fri, Apr 19, 2019 at 03:02:47PM -0400, Joel Fernandes wrote: > > > > > > > > > > > On Thu, Apr 18, 2019 at 07:26:44PM +0200, Christian Brauner wrote: > > > > > > > > > > > > On April 18, 2019 7:23:38 PM GMT+02:00, Jann Horn wrote: > > > > > > > > > > > > >On Wed, Apr 17, 2019 at 3:09 PM Oleg Nesterov wrote: > > > > > > > > > > > > >> On 04/16, Joel Fernandes wrote: > > > > > > > > > > > > >> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg Nesterov wrote: > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > Could you explain when it should return POLLIN? When the whole > > > > > > > > > > > > >process exits? > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > It returns POLLIN when the task is dead or doesn't exist anymore, > > > > > > > > > > > > >or when it > > > > > > > > > > > > >> > is in a zombie state and there's no other thread in the thread > > > > > > > > > > > > >group. > > > > > > > > > > > > >> > > > > > > > > > > > > >> IOW, when the whole thread group exits, so it can't be used to > > > > > > > > > > > > >monitor sub-threads. > > > > > > > > > > > > >> > > > > > > > > > > > > >> just in case... speaking of this patch it doesn't modify > > > > > > > > > > > > >proc_tid_base_operations, > > > > > > > > > > > > >> so you can't poll("/proc/sub-thread-tid") anyway, but iiuc you are > > > > > > > > > > > > >going to use > > > > > > > > > > > > >> the anonymous file returned by CLONE_PIDFD ? > > > > > > > > > > > > > > > > > > > > > > > > > >I don't think procfs works that way. /proc/sub-thread-tid has > > > > > > > > > > > > >proc_tgid_base_operations despite not being a thread group leader. > > > > > > > > > > > > >(Yes, that's kinda weird.) AFAICS the WARN_ON_ONCE() in this code can > > > > > > > > > > > > >be hit trivially, and then the code will misbehave. > > > > > > > > > > > > > > > > > > > > > > > > > >@Joel: I think you'll have to either rewrite this to explicitly bail > > > > > > > > > > > > >out if you're dealing with a thread group leader, or make the code > > > > > > > > > > > > >work for threads, too. > > > > > > > > > > > > > > > > > > > > > > > > The latter case probably being preferred if this API is supposed to be > > > > > > > > > > > > useable for thread management in userspace. > > > > > > > > > > > > > > > > > > > > > > At the moment, we are not planning to use this for sub-thread management. I > > > > > > > > > > > am reworking this patch to only work on clone(2) pidfds which makes the above > > > > > > > > > > > > > > > > > > > > Indeed and agreed. > > > > > > > > > > > > > > > > > > > > > discussion about /proc a bit unnecessary I think. Per the latest CLONE_PIDFD > > > > > > > > > > > patches, CLONE_THREAD with pidfd is not supported. > > > > > > > > > > > > > > > > > > > > Yes. We have no one asking for it right now and we can easily add this > > > > > > > > > > later. > > > > > > > > > > > > > > > > > > > > Admittedly I haven't gotten around to reviewing the patches here yet > > > > > > > > > > completely. But one thing about using POLLIN. FreeBSD is using POLLHUP > > > > > > > > > > on process exit which I think is nice as well. How about returning > > > > > > > > > > POLLIN | POLLHUP on process exit? > > > > > > > > > > We already do things like this. For example, when you proxy between > > > > > > > > > > ttys. If the process that you're reading data from has exited and closed > > > > > > > > > > it's end you still can't usually simply exit because it might have still > > > > > > > > > > buffered data that you want to read. The way one can deal with this > > > > > > > > > > from userspace is that you can observe a (POLLHUP | POLLIN) event and > > > > > > > > > > you keep on reading until you only observe a POLLHUP without a POLLIN > > > > > > > > > > event at which point you know you have read > > > > > > > > > > all data. > > > > > > > > > > I like the semantics for pidfds as well as it would indicate: > > > > > > > > > > - POLLHUP -> process has exited > > > > > > > > > > - POLLIN -> information can be read > > > > > > > > > > > > > > > > > > Actually I think a bit different about this, in my opinion the pidfd should > > > > > > > > > always be readable (we would store the exit status somewhere in the future > > > > > > > > > which would be readable, even after task_struct is dead). So I was thinking > > > > > > > > > we always return EPOLLIN. If process has not exited, then it blocks. > > > > > > > > > > > > > > > > ITYM that a pidfd polls as readable *once a task exits* and stays > > > > > > > > readable forever. Before a task exit, a poll on a pidfd should *not* > > > > > > > > yield POLLIN and reading that pidfd should *not* complete immediately. > > > > > > > > There's no way that, having observed POLLIN on a pidfd, you should > > > > > > > > ever then *not* see POLLIN on that pidfd in the future --- it's a > > > > > > > > one-way transition from not-ready-to-get-exit-status to > > > > > > > > ready-to-get-exit-status. > > > > > > > > > > > > > > What do you consider interesting state transitions? A listener on a pidfd > > > > > > > in epoll_wait() might be interested if the process execs for example. > > > > > > > That's a very valid use-case for e.g. systemd. > > > > > > > > > > > > Sure, but systemd is specialized. > > > > > > > > > > So is Android and we're not designing an interface for Android but for > > > > > all of userspace. > > > > > I hope this is clear. Service managers are quite important and systemd > > > > > is the largest one > > > > > and they can make good use of this feature. > > > > > > > > > > > > > > > > > There are two broad classes of programs that care about process exit > > > > > > status: 1) those that just want to do something and wait for it to > > > > > > complete, and 2) programs that want to perform detailed monitoring of > > > > > > processes and intervention in their state. #1 is overwhelmingly more > > > > > > common. The basic pidfd feature should take care of case #1 only, as > > > > > > wait*() in file descriptor form. I definitely don't think we should be > > > > > > complicating the interface and making it more error-prone (see below) > > > > > > for the sake of that rare program that cares about non-exit > > > > > > notification conditions. You're proposing a complicated combination of > > > > > > poll bit flags that most users (the ones who just wait to wait for > > > > > > processes) don't care about and that risk making the facility hard to > > > > > > use with existing event loops, which generally recognize readability > > > > > > and writability as the only properties that are worth monitoring. > > > > > > > > > > That whole pargraph is about dismissing a range of valid use-cases based on > > > > > assumptions such as "way more common" and > > > > > even argues that service managers are special cases and therefore not > > > > > really worth considering. I would like to be more open to other use cases. > > > > > > > > > > > > > > > > > > We can't use EPOLLIN for that too otherwise you'd need to to waitid(_WNOHANG) > > > > > > > to check whether an exit status can be read which is not nice and then you > > > > > > > multiplex different meanings on the same bit. > > > > > > > I would prefer if the exit status can only be read from the parent which is > > > > > > > clean and the least complicated semantics, i.e. Linus waitid() idea. > > > > > > > > > > > > Exit status information should be *at least* as broadly available > > > > > > through pidfds as it is through the last field of /proc/pid/stat > > > > > > today, and probably more broadly. I've been saying for six months now > > > > > > that we need to talk about *who* should have access to exit status > > > > > > information. We haven't had that conversation yet. My preference is to > > > > > > just make exit status information globally available, as FreeBSD seems > > > > > > > > Totally aside from whether or not this is a good idea but since you > > > > keep bringing > > > > this up and I'm really curious about this where is this documented and how > > > > does this work, please? > > > > > > According to the kqueue FreeBSD man page [1] (I'm reading the FreeBSD > > > 12 version), it's possible to register in a kqueue instead a PID of > > > interest via EVFILT_PROC and receive a NOTE_EXIT notification when > > > that process dies. NOTE_EXIT comes with the exit status of the process > > > that died. I don't see any requirement that EVFILT_PROC work only on > > > child processes of the waiter: on the contrary, the man page states > > > that "if a process can normally see another > > > process, it can attach an event to it.". This documentation reads to > > > me like on FreeBSD process exit status is much more widely available > > > than it is on Linux. Am I missing something? > > > > So in fact FreeBSD has what I'm proposing fully for pids but partial > > for pidfds: > > state transition montoring NOTE_EXIT, NOTE_FORK, NOTE_EXEC and with > > NOTE_TRACK even more. > > For NOTE_EXIT you register a pid or pidfd in an epoll_wait()/kqueue loop you get > > an event and you can get access to that data in the case of kqueue by > > look at the > > "data" member or by getting another event flag. I was putting the idea > > on the table > > to do this via EPOLLIN and then looking at a simple struct that contains that > > information. > > If you turn pidfd into an event stream, reads have to be destructive. > If reads are destructive, you can't share pidfds instances between > multiple readers. If you can't get a pidfd except via clone, you can't > have more than one pidfd instance for a single process. The overall It's not off the table that we can add a pidfd_open() if that becomes a real thing. > result is that we're back in the same place we were before with the > old wait system, i.e., only one entity can monitor a process for > interesting state transitions and everyone else gets a racy, > inadequate interface via /proc. FreeBSD doesn't have this problem > because you can create an *arbitrary* number of *different* kqueue > objects, register a PID in each of them, and get an independent > destructively-read event stream in each context. It's worth noting If we add pidfd_open() we should be able to do this too though. > that the FreeBSD process file descriptor from pdfork(2) is *NOT* an > event stream, as you're describing, but a level-triggered Because it only implements the exit event, that's what I said before but you can do it with pids already. And they thought about this use-case for pidfds at least. My point is that we don't end up with an interface that then doesn't allow us to extend this to cover such an api. > one-transition facility of the sort that I'm advocating. > > In other words, FreeBSD already implements the model I'm describing: > level-triggered simple exit notification for pidfd and a separate > edge-triggered monitoring facility. > > > I like this idea to be honest. > > I'm not opposed to some facility that delivers a stream of events > relating to some process. That could even be epoll, as our rough > equivalent to kqueue. I don't see a need to make the pidfd the channel > through which we deliver these events. There's room for both an event The problem is that epoll_wait() does currently not return data that the user hasn't already passed in via the "data" argument ad ADD or MOD time. And it seems nice to deliver this through the pidfd itself. > stream like the one FreeBSD provides and a level-triggered "did this > process exit or not?" indication via pidfd.