Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp2202670yba; Fri, 19 Apr 2019 14:23:35 -0700 (PDT) X-Google-Smtp-Source: APXvYqw1zHIVoMfJcs38XWJZrxasZwc3u/HJxSF2vy4Rz8VIf0HtXE9f1YFXa5kjwuZWqDDqX1BT X-Received: by 2002:a63:d709:: with SMTP id d9mr6057915pgg.38.1555709015009; Fri, 19 Apr 2019 14:23:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555709015; cv=none; d=google.com; s=arc-20160816; b=Gdr6RSbXRgpdIGnKOBEXQ1xICXp9cQIfQM5R1qAh8IQPvDLvCU+drIOq4D9wXLF+Cn xLzcer7EGIfwM2FQT9J/wBNBedbn5O8e+bbkQp3Fdz1pt+zYy1fXDpcFvkahtg8njD88 DeGcclroENlPoBoGO+NFk3Zya4dUp7YwKuJ7g6cmN1L4FtdvvjFX+L4C4YMPKJXPm8+v IJFOPUQFfbCk1llx2XjIxmGCOZVgxjPxYQtFrQKvbUrXA6/NbwfuDekYs/oKCpmMug1v owxOS7G85hYtzb3oLlf1OtdLgW2RzwBchRCCAWiXs2eEgPoYlVi0Mu0gPSLv7rUyJ4qH WhsA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=YD3ro3dBkJgXvaLFB2ikEjNFYEYkV6p1e8OY3sDeVCs=; b=f3Lof4wqDs4mNhSkEmKDBvSlVH51VRdjbWKzjKipeslyTN9QCSXfBeEyWqqX/dvx24 pgag5aeRYNACQ06dNwqE5CH415+LG8awZ7x2nAiks4kkBPPF6mfvekav3F38Rar+YUEr DVL3mu7zpEz8IP45FzH3jX+X06neVi58djBSJwkI7rGKYB3+SMmdS8kRoqZEsp6KJgoo XdTR4DV1blt5hI3+8v8AYiGH5oqiRGt4CnyladOOUkdPoF/3Xm2J8RD6D7aeX1NYvFVB nYE9ZbfTYG657BDFbm3dwGZD2uiaO+ShixhqjNrXZE+KaCzp0+hV6MtxZiV+6UIqdfhP k/jg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Workixur; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k5si5451468pgq.193.2019.04.19.14.23.20; Fri, 19 Apr 2019 14:23:34 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Workixur; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727010AbfDSVVR (ORCPT + 99 others); Fri, 19 Apr 2019 17:21:17 -0400 Received: from mail-vs1-f66.google.com ([209.85.217.66]:44375 "EHLO mail-vs1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726174AbfDSVVR (ORCPT ); Fri, 19 Apr 2019 17:21:17 -0400 Received: by mail-vs1-f66.google.com with SMTP id j184so3413168vsd.11 for ; Fri, 19 Apr 2019 14:21:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=YD3ro3dBkJgXvaLFB2ikEjNFYEYkV6p1e8OY3sDeVCs=; b=WorkixurkmyCuEP4N+0Kly9/LPit5nC/7TZkRR88eMhDaM2lGdRO3MB1Kfub9LQ6kW qzdnDIcvIFf3fiXglAhDx2KRx2gKEjwmwzQ6oiNrFxxoDjcASDK3ZoqqnuzyQUR83lnE RNPW7yoIkJXZE/wzgYZVOd3xdvfczITLTAb0uF4Ggf5fehhEGcyqJGG9YQk92nxLsfZL C+OMVeswatA15+moJmG52pquHEwuJ6Fw+rLKX1c2AJvmdtFO2oDxr7dleKMKPND9HV9N 4HwqusppnWzYRCrAefIwk0uj94RtNMnyOJmwJzYeBBBgAOFqrtsRldwaBqs2X82SzxM5 7V6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=YD3ro3dBkJgXvaLFB2ikEjNFYEYkV6p1e8OY3sDeVCs=; b=mGAkK1nkh+1LkwABK48cWNAab2w1wsTNDlyJCoADPjjiZTNc++sKeD4UPEf9poY0FJ CJ84Z0RwoMnRJUvcNwpnWRQuadDP5vRIl/K0sLse8RNd1roUqqweD0NQAP5FMiCXiL0v I1jtjcMOXHKOkf2LF67jUwgPAwpOV4kFbLtmF476iy4US1EmofmpaBuGYgk6db0PHoWW QiqPAv7dbXJ9CooWWqGcsSLZl/MMkM4xZMr+ZheyPKcOeUmtnOYJom9pS7wpD57L3mY3 Ph64wpn+ywUhGoZ0l7Hm8ih04qrZxNIeoOGqJ4sdi4FnMTTpo+P1kA/WT1bmizYjHcIG Q5MA== X-Gm-Message-State: APjAAAUBQaC4dzbhSG6U+ZjAHi74mijJMVMjnOYYLG5Co/GZp8RGR2Ca dzKnKHLEX74uGii+BwyLWnKmIZXHaGliAk438pfp8w== X-Received: by 2002:a67:bc01:: with SMTP id t1mr3429898vsn.149.1555708875346; Fri, 19 Apr 2019 14:21:15 -0700 (PDT) MIME-Version: 1.0 References: <20190411175043.31207-1-joel@joelfernandes.org> <20190416120430.GA15437@redhat.com> <20190416192051.GA184889@google.com> <20190417130940.GC32622@redhat.com> <20190419190247.GB251571@google.com> <20190419191858.iwcvqm6fihbkaata@brauner.io> <20190419194902.GE251571@google.com> In-Reply-To: From: Daniel Colascione Date: Fri, 19 Apr 2019 14:21:03 -0700 Message-ID: Subject: Re: [PATCH RFC 1/2] Add polling support to pidfd To: Christian Brauner Cc: Joel Fernandes , Jann Horn , Oleg Nesterov , Florian Weimer , kernel list , Andy Lutomirski , Steven Rostedt , Suren Baghdasaryan , Linus Torvalds , Alexey Dobriyan , Al Viro , Andrei Vagin , Andrew Morton , Arnd Bergmann , "Eric W. Biederman" , Kees Cook , linux-fsdevel , "open list:KERNEL SELFTEST FRAMEWORK" , Michal Hocko , Nadav Amit , Serge Hallyn , Shuah Khan , Stephen Rothwell , Taehee Yoo , Tejun Heo , Thomas Gleixner , kernel-team , Tycho Andersen Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 19, 2019 at 1:57 PM Christian Brauner wrote: > > On Fri, Apr 19, 2019 at 10:34 PM Daniel Colascione wrote: > > > > On Fri, Apr 19, 2019 at 12:49 PM Joel Fernandes wrote: > > > > > > On Fri, Apr 19, 2019 at 09:18:59PM +0200, Christian Brauner wrote: > > > > On Fri, Apr 19, 2019 at 03:02:47PM -0400, Joel Fernandes wrote: > > > > > On Thu, Apr 18, 2019 at 07:26:44PM +0200, Christian Brauner wrote: > > > > > > On April 18, 2019 7:23:38 PM GMT+02:00, Jann Horn wrote: > > > > > > >On Wed, Apr 17, 2019 at 3:09 PM Oleg Nesterov wrote: > > > > > > >> On 04/16, Joel Fernandes wrote: > > > > > > >> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg Nesterov wrote: > > > > > > >> > > > > > > > > >> > > Could you explain when it should return POLLIN? When the whole > > > > > > >process exits? > > > > > > >> > > > > > > > >> > It returns POLLIN when the task is dead or doesn't exist anymore, > > > > > > >or when it > > > > > > >> > is in a zombie state and there's no other thread in the thread > > > > > > >group. > > > > > > >> > > > > > > >> IOW, when the whole thread group exits, so it can't be used to > > > > > > >monitor sub-threads. > > > > > > >> > > > > > > >> just in case... speaking of this patch it doesn't modify > > > > > > >proc_tid_base_operations, > > > > > > >> so you can't poll("/proc/sub-thread-tid") anyway, but iiuc you are > > > > > > >going to use > > > > > > >> the anonymous file returned by CLONE_PIDFD ? > > > > > > > > > > > > > >I don't think procfs works that way. /proc/sub-thread-tid has > > > > > > >proc_tgid_base_operations despite not being a thread group leader. > > > > > > >(Yes, that's kinda weird.) AFAICS the WARN_ON_ONCE() in this code can > > > > > > >be hit trivially, and then the code will misbehave. > > > > > > > > > > > > > >@Joel: I think you'll have to either rewrite this to explicitly bail > > > > > > >out if you're dealing with a thread group leader, or make the code > > > > > > >work for threads, too. > > > > > > > > > > > > The latter case probably being preferred if this API is supposed to be > > > > > > useable for thread management in userspace. > > > > > > > > > > At the moment, we are not planning to use this for sub-thread management. I > > > > > am reworking this patch to only work on clone(2) pidfds which makes the above > > > > > > > > Indeed and agreed. > > > > > > > > > discussion about /proc a bit unnecessary I think. Per the latest CLONE_PIDFD > > > > > patches, CLONE_THREAD with pidfd is not supported. > > > > > > > > Yes. We have no one asking for it right now and we can easily add this > > > > later. > > > > > > > > Admittedly I haven't gotten around to reviewing the patches here yet > > > > completely. But one thing about using POLLIN. FreeBSD is using POLLHUP > > > > on process exit which I think is nice as well. How about returning > > > > POLLIN | POLLHUP on process exit? > > > > We already do things like this. For example, when you proxy between > > > > ttys. If the process that you're reading data from has exited and closed > > > > it's end you still can't usually simply exit because it might have still > > > > buffered data that you want to read. The way one can deal with this > > > > from userspace is that you can observe a (POLLHUP | POLLIN) event and > > > > you keep on reading until you only observe a POLLHUP without a POLLIN > > > > event at which point you know you have read > > > > all data. > > > > I like the semantics for pidfds as well as it would indicate: > > > > - POLLHUP -> process has exited > > > > - POLLIN -> information can be read > > > > > > Actually I think a bit different about this, in my opinion the pidfd should > > > always be readable (we would store the exit status somewhere in the future > > > which would be readable, even after task_struct is dead). So I was thinking > > > we always return EPOLLIN. If process has not exited, then it blocks. > > > > ITYM that a pidfd polls as readable *once a task exits* and stays > > readable forever. Before a task exit, a poll on a pidfd should *not* > > yield POLLIN and reading that pidfd should *not* complete immediately. > > There's no way that, having observed POLLIN on a pidfd, you should > > ever then *not* see POLLIN on that pidfd in the future --- it's a > > one-way transition from not-ready-to-get-exit-status to > > ready-to-get-exit-status. > > What do you consider interesting state transitions? A listener on a pidfd > in epoll_wait() might be interested if the process execs for example. > That's a very valid use-case for e.g. systemd. Sure, but systemd is specialized. There are two broad classes of programs that care about process exit status: 1) those that just want to do something and wait for it to complete, and 2) programs that want to perform detailed monitoring of processes and intervention in their state. #1 is overwhelmingly more common. The basic pidfd feature should take care of case #1 only, as wait*() in file descriptor form. I definitely don't think we should be complicating the interface and making it more error-prone (see below) for the sake of that rare program that cares about non-exit notification conditions. You're proposing a complicated combination of poll bit flags that most users (the ones who just wait to wait for processes) don't care about and that risk making the facility hard to use with existing event loops, which generally recognize readability and writability as the only properties that are worth monitoring. > We can't use EPOLLIN for that too otherwise you'd need to to waitid(_WNOHANG) > to check whether an exit status can be read which is not nice and then you > multiplex different meanings on the same bit. > I would prefer if the exit status can only be read from the parent which is > clean and the least complicated semantics, i.e. Linus waitid() idea. Exit status information should be *at least* as broadly available through pidfds as it is through the last field of /proc/pid/stat today, and probably more broadly. I've been saying for six months now that we need to talk about *who* should have access to exit status information. We haven't had that conversation yet. My preference is to just make exit status information globally available, as FreeBSD seems to do. I think it would be broadly useful for something like pkill to wait for processes to exit and to retrieve their exit information. Speaking of pkill: AIUI, in your current patch set, one can get a pidfd *only* via clone. Joel indicated that he believes poll(2) shouldn't be supported on procfs pidfds. Is that your thinking as well? If that's the case, then we're in a state where non-parents can't wait for process exit, and providing this facility is an important goal of the whole project. > EPOLLIN on a pidfd could very well mean that data can be read via > a read() on the pidfd *other* than the exit status. The read could e.g. > give you a lean struct that indicates the type of state transition: NOTIFY_EXIT, > NOTIFY_EXEC, etc.. This way we are not bound to a specific poll event indicating > a specific state. > Though there's a case to be made that EPOLLHUP could indicate process exit > and EPOLLIN a state change + read(). And do you imagine making read() destructive? Does that read() then reset the POLLIN state? You're essentially proposing that a pidfd provide an "event stream" interface, delivering notifications packets that indicate state changes like "process exited" or "process stopped" or "process execed". While this sort of interface is powerful and has some nice properties that tools like debuggers and daemon monitors might want to use, I think it's too complicated and error prone for the overwhelmingly common case of wanting to monitor process lifetime. I'd much rather pidfd provide a simple one-state-transition level-triggered (not edge-triggered, as your suggestion implies) facility. If we want to let sophisticated programs read a stream of notification packets indicating changes in process state, we can provide that as a separate interface in future work. I like Linus' idea of just making waitid(2) (not waitpid(2), as I mistakenly mentioned earlier) on a pidfd act *exactly* like a waitid(2) on the corresponding process and making POLLIN just mean "waitid will succeed". It's a nice simple model that's easy to reason about and that makes it easy to port existing code to pidfds. I am very much against signaling additional information on basic pidfds using non-POLLIN poll flags.