Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932592AbcCNTcX (ORCPT ); Mon, 14 Mar 2016 15:32:23 -0400 Received: from prod-mail-xrelay07.akamai.com ([23.79.238.175]:25570 "EHLO prod-mail-xrelay07.akamai.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755658AbcCNTcV (ORCPT ); Mon, 14 Mar 2016 15:32:21 -0400 Subject: Re: [PATCH] epoll: add exclusive wakeups flag To: "Michael Kerrisk (man-pages)" , Andrew Morton References: <56A9C03B.7020104@gmail.com> <56AA56A2.3000700@akamai.com> <56AB1F6C.7000609@gmail.com> <56E1C2B5.2040905@akamai.com> <56E1D1D7.8040000@gmail.com> <56E1DBC2.6040109@akamai.com> <56E32FC5.4030902@akamai.com> <56E353CF.6050503@gmail.com> <56E6D0ED.20609@akamai.com> <56E6F941.9040307@gmail.com> Cc: mingo@kernel.org, peterz@infradead.org, viro@ftp.linux.org.uk, normalperson@yhbt.net, m@silodev.com, corbet@lwn.net, luto@amacapital.net, torvalds@linux-foundation.org, hagen@jauu.net, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org From: Jason Baron X-Enigmail-Draft-Status: N1110 Message-ID: <56E711C3.8020008@akamai.com> Date: Mon, 14 Mar 2016 15:32:19 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: <56E6F941.9040307@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6515 Lines: 163 On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote: > [Restoring CC, which I see I accidentally dropped, one iteration back.] > > Hi Jason, > > Thanks for the review. I've tweaked one piece to respond to your > feedback. But I also have another new question below. > > On 03/15/2016 03:55 AM, Jason Baron wrote: >> On 03/11/2016 06:25 PM, Michael Kerrisk (man-pages) wrote: >>> On 03/11/2016 09:51 PM, Jason Baron wrote: >>>> On 03/11/2016 03:30 PM, Michael Kerrisk (man-pages) wrote: > > [...] > >> Hi Michael, >> >> Looks good. One comment below. >> >> Thanks, >> >>> EPOLLEXCLUSIVE (since Linux 4.5) >>> Sets an exclusive wakeup mode for the epoll file >>> descriptor that is being attached to the target file >>> descriptor, fd. When a wakeup event occurs and multiple >>> epoll file descriptors are attached to the same target >>> file using EPOLLEXCLUSIVE, one or more of the epoll file >>> descriptors will receive an event with epoll_wait(2). >>> The default in this scenario (when EPOLLEXCLUSIVE is not >>> set) is for all epoll file descriptors to receive an >>> event. EPOLLEXCLUSIVE is thus useful for avoiding thun‐ >>> dering herd problems in certain scenarios. >>> >>> If the same file descriptor is in multiple epoll >>> instances, some with the EPOLLEXCLUSIVE flag, and others >>> without, then events will provided to all epoll >>> instances that did not specify EPOLLEXCLUSIVE, and at >>> least one of the epoll instances that did specify >>> EPOLLEXCLUSIVE. >>> >>> The following values may be specified in conjunction >>> with EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and >>> EPOLLET. EPOLLHUP and EPOLLERR can also be specified, >>> but are ignored (as usual). Attempts to specify other >> >> I'm not sure 'ignored' is the right wording here. 'EPOLLHUP' and >> 'EPOLERR' are always included in the set of events when something is >> added as EPOLLEXCLUSIVE. This is consistent with the non-EPOLLEXCLUSIVE >> add case. > > Yes. > >> So 'EPOLLHUP' and 'EPOLERR' may be specified but will be >> included in the set of events on an add, whether they are specified or not. > > Yes. I understand your discomfort with the work "ignored", but the > problem was that, because it made special mention of EPOLLHUP and EPOLLERR, > your proposed text made it sound as though EPOLLEXCLUSIVE somehow was > special with respect to these two flags. I wanted to clarify that it is not. > How about this: > > The following values may be specified in conjunction > with EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and > EPOLLET. EPOLLHUP and EPOLLERR can also be specified, > but this is not required: as usual, these events are > always reported if they occur, regardless of whether > they are specified in events. > ? Yes, nothing special here with respect to EPOLLHUP and EPOLLERR. So this looks fine to me. > >>> values in events yield an error. EPOLLEXCLUSIVE may be >>> used only in an EPOLL_CTL_ADD operation; attempts to >>> employ it with EPOLL_CTL_MOD yield an error. If >>> EPOLLEXCLUSIVE has set using epoll_ctl(2), then a subse‐ >>> quent EPOLL_CTL_MOD on the same epfd, fd pair yields an > b>> error. An epoll_ctl(2) that specifies EPOLLEXCLUSIVE in >>> events and specifies the target file descriptor fd as an >>> epoll instance will likewise fail. The error in all of >>> these cases is EINVAL. >>> >>> ERRORS >>> EINVAL An invalid event type was specified along with EPOLLEX‐ >>> CLUSIVE in events. >>> >>> EINVAL op was EPOLL_CTL_MOD and events included EPOLLEXCLUSIVE. >>> >>> EINVAL op was EPOLL_CTL_MOD and the EPOLLEXCLUSIVE flag has >>> previously been applied to this epfd, fd pair. >>> >>> EINVAL EPOLLEXCLUSIVE was specified in event and fd is refers >>> to an epoll instance. > > Returning to the second sentence in this description: > > When a wakeup event occurs and multiple epoll file descrip‐ > tors are attached to the same target file using EPOLLEXCLU‐ > SIVE, one or more of the epoll file descriptors will > receive an event with epoll_wait(2). > > There is a point that is unclear to me: what does "target file" refer to? > Is it an open file description (aka open file table entry) or an inode? > I suspect the former, but it was not clear in your original text. > So from epoll's perspective, the wakeups are associated with a 'wait queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via file->poll()) results in adding to the same 'wait queue' then we will get 'exclusive' wakeup behavior. So in general, I think the answer here is that its associated with the inode (I coudn't say with 100% certainty without really looking at all file->poll() implementations). Certainly, with the 'FIFO' example below, the two scenarios will have the same behavior with respect to EPOLLEXCLUSIVE. Also, the 'non-exclusive' mode would be subject to the same question of which wait queue is the epfd is associated with... Thanks, -Jason > To make this point even clearer, here are two scenarios I'm thinking of. > In each case, we're talking of monitoring the read end of a FIFO. > > === > > Scenario 1: > > We have three processes each of which > 1. Creates an epoll instance > 2. Opens the read end of the FIFO > 3. Adds the read end of the FIFO to the epoll instance, specifying > EPOLLEXCLUSIVE > > When input becomes available on the FIFO, how many processes > get a wakeup? > > === > > Scenario 3 > > A parent process opens the read end of a FIFO and then calls > fork() three times to create three children. Each child then: > > 1. Creates an epoll instance > 2. Adds the read end of the FIFO to the epoll instance, specifying > EPOLLEXCLUSIVE > > When input becomes available on the FIFO, how many processes > get a wakeup? > > === > > Cheers, > > Michael >