2015-12-08 03:24:59

by Jason Baron

[permalink] [raw]
Subject: [PATCH] epoll: add exclusive wakeups flag

Hi,

Re-post of an old series addressing thundering herd issues when sharing
an event source fd amongst multiple epoll fds. Last posting was here
for reference: https://lkml.org/lkml/2015/2/25/56

The patch herein drops the core scheduler 'rotate' changes I had previously
proposed as this patch seems performant without those.

I was prompted to re-post this because Madars Vitolins reported some good
speedups with this patch using Enduro/X application. His writeup is here:
https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/

Thanks,

-Jason

Sample epoll_clt text:

EPOLLEXCLUSIVE
Sets an exclusive wakeup mode for the epfd file descriptor that is
being attached to the target file descriptor, fd. Thus, when an
event occurs and multiple epfd file descriptors are attached to the
same target file using EPOLLEXCLUSIVE, one or more epfds will receive
an event with epoll_wait(2). The default in this scenario (when
EPOLLEXCLUSIVE is not set) is for all epfds to receive an event.
EPOLLEXLUSVIE may only be specified with the op EPOLL_CTL_ADD.

Jason Baron (1):
epoll: add EPOLLEXCLUSIVE flag

fs/eventpoll.c | 24 +++++++++++++++++++++---
include/uapi/linux/eventpoll.h | 3 +++
2 files changed, 24 insertions(+), 3 deletions(-)

--
2.6.1


2015-12-08 03:25:32

by Jason Baron

[permalink] [raw]
Subject: [PATCH] epoll: add EPOLLEXCLUSIVE flag

Currently, epoll file descriptors or epfds (the fd returned from
epoll_create[1]()) that are added to a shared wakeup source are always
added in a non-exclusive manner. This means that when we have multiple
epfds attached to a shared fd source they are all woken up. This creates
thundering herd type behavior.

Introduce a new 'EPOLLEXCLUSIVE' flag that can be passed as part of the
'event' argument during an epoll_ctl() EPOLL_CTL_ADD operation. This new
flag allows for exclusive wakeups when there are multiple epfds attached to
a shared fd event source.

The implementation walks the list of exclusive waiters, and queues an
event to each epfd, until it finds the first waiter that has threads
blocked on it via epoll_wait(). The idea is to search for threads which are
idle and ready to process the wakeup events. Thus, we queue an event to at
least 1 epfd, but may still potentially queue an event to all epfds that
are attached to the shared fd source.

Performance testing was done by Madars Vitolins using a modified version of
Enduro/X. The use of the 'EPOLLEXCLUSIVE' flag reduce the length of this
particular workload from 860s down to 24s.

Tested-by: Madars Vitolins <[email protected]>
Signed-off-by: Jason Baron <[email protected]>
---
fs/eventpoll.c | 24 +++++++++++++++++++++---
include/uapi/linux/eventpoll.h | 3 +++
2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1e009ca..ae1dbcf 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -92,7 +92,7 @@
*/

/* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)

/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4
@@ -1002,6 +1002,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
unsigned long flags;
struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi->ep;
+ int ewake = 0;

if ((unsigned long)key & POLLFREE) {
ep_pwq_from_wait(wait)->whead = NULL;
@@ -1066,8 +1067,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
*/
- if (waitqueue_active(&ep->wq))
+ if (waitqueue_active(&ep->wq)) {
+ ewake = 1;
wake_up_locked(&ep->wq);
+ }
if (waitqueue_active(&ep->poll_wait))
pwake++;

@@ -1078,6 +1081,9 @@ out_unlock:
if (pwake)
ep_poll_safewake(&ep->poll_wait);

+ if (epi->event.events & EPOLLEXCLUSIVE)
+ return ewake;
+
return 1;
}

@@ -1095,7 +1101,10 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
- add_wait_queue(whead, &pwq->wait);
+ if (epi->event.events & EPOLLEXCLUSIVE)
+ add_wait_queue_exclusive(whead, &pwq->wait);
+ else
+ add_wait_queue(whead, &pwq->wait);
list_add_tail(&pwq->llink, &epi->pwqlist);
epi->nwait++;
} else {
@@ -1862,6 +1871,15 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
goto error_tgt_fput;

/*
+ * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only,
+ * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
+ * Also, we do not currently supported nested exclusive wakeups.
+ */
+ if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
+ (op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
+ goto error_tgt_fput;
+
+ /*
* At this point it is safe to assume that the "private_data" contains
* our own data structure.
*/
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..1c31549 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -26,6 +26,9 @@
#define EPOLL_CTL_DEL 2
#define EPOLL_CTL_MOD 3

+/* Set exclusive wakeup mode for the target file descriptor */
+#define EPOLLEXCLUSIVE (1 << 28)
+
/*
* Request the handling of system wakeup events so as to prevent system suspends
* from happening while those events are being processed.
--
2.6.1

2016-03-10 18:53:53

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH] epoll: add exclusive wakeups flag

Hi Michael,

On 01/29/2016 03:14 AM, Michael Kerrisk (man-pages) wrote:
> Hello Jason,
> On 01/28/2016 06:57 PM, Jason Baron wrote:
>> Hi,
>>
>> On 01/28/2016 02:16 AM, Michael Kerrisk (man-pages) wrote:
>>> Hi Jason,
>>>
>>> On 12/08/2015 04:23 AM, Jason Baron wrote:
>>>> Hi,
>>>>
>>>> Re-post of an old series addressing thundering herd issues when sharing
>>>> an event source fd amongst multiple epoll fds. Last posting was here
>>>> for reference: https://lkml.org/lkml/2015/2/25/56
>>>>
>>>> The patch herein drops the core scheduler 'rotate' changes I had previously
>>>> proposed as this patch seems performant without those.
>>>>
>>>> I was prompted to re-post this because Madars Vitolins reported some good
>>>> speedups with this patch using Enduro/X application. His writeup is here:
>>>> https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/
>>>>
>>>> Thanks,
>>>>
>>>> -Jason
>>>>
>>>> Sample epoll_clt text:
>>>
>>> Thanks for the proposed text. I have some questions about points
>>> that are not quite clear to me.
>>>
>>>> EPOLLEXCLUSIVE
>>>> Sets an exclusive wakeup mode for the epfd file descriptor that is
>>>> being attached to the target file descriptor, fd. Thus, when an
>>>> event occurs and multiple epfd file descriptors are attached to the
>>>> same target file using EPOLLEXCLUSIVE, one or more epfds will receive
>>>> an event with epoll_wait(2). The default in this scenario (when
>>>> EPOLLEXCLUSIVE is not set) is for all epfds to receive an event.
>>>> EPOLLEXLUSVIE may only be specified with the op EPOLL_CTL_ADD.
>>>
>>> So, assuming an FD is present in the interest list of multiple (say 6)
>>> epoll FDs, and some (say 3) of those attachments were done using
>>> EPOLLEXCLUSVE. Which of the following statements are correct:
>>>
>>> (a) It's guaranteed that *none* of the epoll FDs that did NOT specify
>>> EPOLLEXCLUSIVE will receive an event.
>>>
>>> (b) It's guaranteed that *all* of the epoll FDs that did NOT specify
>>> EPOLLEXCLUSIVE will receive an event.
>>>
>>> (c) From 1 to 3 of the epoll FDs that did specify EPOLLEXCLUSIVE
>>> will receive an event.
>>>
>>> (d) Exactly one epoll FD that did specify EPOLLEXCLUSIVE will get
>>> an event, and it is indeterminate which one.
>>>
>>
>> So b and c. All the non-exclusive adds will get it and at least 1 of the
>> exclusive adds will as well.
>
> So is it fair to say that the expected use case is that all epoll sets
> would use EPOLLEXCLUSIVE?
>
>>> I suppose one point I'm trying to uncover in the above is: what is
>>> the scope of EPOLLEXCLUSIVE? Is it just applicable for one process's
>>> FD, or is it setting an attribute in the epoll "interest list" record
>>> for that FD that affects notification behavior across all processes?
>>>
>>
>> Right - so 'EPOLLEXCLUSIVE' will affect other epoll sets that are also
>> using 'EPOLLEXCLUSIVE' against the the same fd, but will have no affect
>> on epoll sets connected to fd that do not specify it.
>>
>>
>>> And then:
>>>
>>> (1) What are the semantics of EPOLLEXCLUSIVE if the added FD becomes
>>> disabled via EPOLLONESHOT (or explicitly via EPOLL_CTL_MOD with
>>> the 'events' field set to 0)?
>>>
>>
>> In the case of EPOLLEXCLUSIVE and EPOLLONESHOT, one would have to re-arm
>> at least 1 of threads that was woken up by doing EPOLL_CTL_MOD to
>> guarantee further wakeups.
>>
>> And like-wise with an EPOLL_CTL_MOD with 'events' all set to 0, one
>> would need to either re-arm the thread that set the 'events' field to 0
>> (by setting back to non-zero), or re-arm in at least one other thread
>> via EPOLL_CTL_MOD (or delete and add).
>
> Okay -- so when an EPOLLEXCLUSIVE FD becomes disarmed it is possible
> to re-enable rith EPOLL_CTL_MOD; one doesn't need to delete and re-add
> the FD.
>
>>> (2) The source code contains a comment "we do not currently supported
>>> nested exclusive wakeups". Could you elaborate on this point? It
>>> sounds like something that should be documented.
>>
>> So I was just trying to say that we return -EINVAL if you try to do and
>> EPOLL_CTL_ADD with EPOLLEXCLUSIVE and the 'fd' argument is a epoll fd
>> returned via epoll_create().
>
> Okay -- that definitely belongs in the man page.
>
> I'll work up a text, but would like to get input about the "use case"
> question above.
>
> Cheers,
>
> Michael
>
>
>

Ok, here's some updated text:

EPOLLEXCLUSIVE

Sets an exclusive wakeup mode for the epfd file descriptor that is being
attached to the target file descriptor, fd. When a wakeup event occurs
and multiple epfd file descriptors are attached to the same target file
using EPOLLEXCLUSIVE, one or more epfds will receive an event with
epoll_wait(2). The default in this scenario (when EPOLLEXCLUSIVE is not
set) is for all epfds to receive an event.

The events supported by EPOLLEXCLUSIVE are: EPOLLIN, EPOLLOUT, EPOLLERR,
EPOLLHUP, EPOLLWAKEUP, and EPOLLET. epoll_wait(2) will always wait for
EPOLLERR and EPOLLHUP; it is not necessary to set it in events. If
EPOLLEXCLUSIVE is set using epoll_ctl(2), then a subsequent
EPOLL_CTL_MOD on the same epfd, fd pair will retrun -EINVAL. An
epoll_ctl(2) that specifies EPOLLEXCLUSIVE in events and specifies the
target file descriptor fd as an epoll instance will return -EINVAL
as well.

Thanks,

-Jason



Subject: Re: [PATCH] epoll: add exclusive wakeups flag

Hi Jason,

> Ok, here's some updated text:
>
> EPOLLEXCLUSIVE
>
> Sets an exclusive wakeup mode for the epfd file descriptor that is being
> attached to the target file descriptor, fd. When a wakeup event occurs
> and multiple epfd file descriptors are attached to the same target file
> using EPOLLEXCLUSIVE, one or more epfds will receive an event with
> epoll_wait(2). The default in this scenario (when EPOLLEXCLUSIVE is not
> set) is for all epfds to receive an event.
>
> The events supported by EPOLLEXCLUSIVE are: EPOLLIN, EPOLLOUT, EPOLLERR,
> EPOLLHUP, EPOLLWAKEUP, and EPOLLET. epoll_wait(2) will always wait for
> EPOLLERR and EPOLLHUP; it is not necessary to set it in events. If
> EPOLLEXCLUSIVE is set using epoll_ctl(2), then a subsequent
> EPOLL_CTL_MOD on the same epfd, fd pair will retrun -EINVAL. An
> epoll_ctl(2) that specifies EPOLLEXCLUSIVE in events and specifies the
> target file descriptor fd as an epoll instance will return -EINVAL
> as well.

So, I worked that up into the following text:

EPOLLEXCLUSIVE (since Linux 4.5)
Sets an exclusive wakeup mode for the epoll file
descriptor that is being attached to the target file
descriptor, fd. When a wakeup event occurs and multiple
epoll file descriptors are attached to the same target
file using EPOLLEXCLUSIVE, one or more of the epoll file
descriptors will receive an event with epoll_wait(2).
The default in this scenario (when EPOLLEXCLUSIVE is not
set) is for all epoll file descriptors to receive an
event. EPOLLEXCLUSIVE is thus useful for avoiding thun‐
dering herd problems in certain scenarios.

If the same file descriptor is in multiple epoll
instances, some with the EPOLLEXCLUSIVE flag, and others
without, then events will provided to all epoll
instances that did not specify EPOLLEXCLUSIVE, and at
least one of the epoll instances that did specify
EPOLLEXCLUSIVE.

The following values may be specified in conjunction
with EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and
EPOLLET. EPOLLHUP and EPOLLERR can also be specified,
but are ignored (as usual). Attempts to specify other
values in events yield an error. EPOLLEXCLUSIVE may be
used only in an EPOLL_CTL_ADD operation; attempts to
employ it with EPOLL_CTL_MOD yield an error. If
EPOLLEXCLUSIVE has set using epoll_ctl(2), then a subse‐
quent EPOLL_CTL_MOD on the same epfd, fd pair yields an
error. An epoll_ctl(2) that specifies EPOLLEXCLUSIVE in
events and specifies the target file descriptor fd as an
epoll instance will likewise fail. The error in all of
these cases is EINVAL.

ERRORS
EINVAL An invalid event type was specified along with EPOLLEX‐
CLUSIVE in events.

EINVAL op was EPOLL_CTL_MOD and events included EPOLLEXCLUSIVE.

EINVAL op was EPOLL_CTL_MOD and the EPOLLEXCLUSIVE flag has
previously been applied to this epfd, fd pair.

EINVAL EPOLLEXCLUSIVE was specified in event and fd is refers
to an epoll instance.

Is there anything that needs to be fixed in the above text?

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: [PATCH] epoll: add exclusive wakeups flag

On 03/10/2016 07:53 PM, Jason Baron wrote:
> Hi Michael,
>
> On 01/29/2016 03:14 AM, Michael Kerrisk (man-pages) wrote:
>> Hello Jason,
>> On 01/28/2016 06:57 PM, Jason Baron wrote:
>>> Hi,
>>>
>>> On 01/28/2016 02:16 AM, Michael Kerrisk (man-pages) wrote:
>>>> Hi Jason,
>>>>
>>>> On 12/08/2015 04:23 AM, Jason Baron wrote:
>>>>> Hi,
>>>>>
>>>>> Re-post of an old series addressing thundering herd issues when sharing
>>>>> an event source fd amongst multiple epoll fds. Last posting was here
>>>>> for reference: https://lkml.org/lkml/2015/2/25/56
>>>>>
>>>>> The patch herein drops the core scheduler 'rotate' changes I had previously
>>>>> proposed as this patch seems performant without those.
>>>>>
>>>>> I was prompted to re-post this because Madars Vitolins reported some good
>>>>> speedups with this patch using Enduro/X application. His writeup is here:
>>>>> https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Jason
>>>>>
>>>>> Sample epoll_clt text:
>>>>
>>>> Thanks for the proposed text. I have some questions about points
>>>> that are not quite clear to me.
>>>>
>>>>> EPOLLEXCLUSIVE
>>>>> Sets an exclusive wakeup mode for the epfd file descriptor that is
>>>>> being attached to the target file descriptor, fd. Thus, when an
>>>>> event occurs and multiple epfd file descriptors are attached to the
>>>>> same target file using EPOLLEXCLUSIVE, one or more epfds will receive
>>>>> an event with epoll_wait(2). The default in this scenario (when
>>>>> EPOLLEXCLUSIVE is not set) is for all epfds to receive an event.
>>>>> EPOLLEXLUSVIE may only be specified with the op EPOLL_CTL_ADD.
>>>>
>>>> So, assuming an FD is present in the interest list of multiple (say 6)
>>>> epoll FDs, and some (say 3) of those attachments were done using
>>>> EPOLLEXCLUSVE. Which of the following statements are correct:
>>>>
>>>> (a) It's guaranteed that *none* of the epoll FDs that did NOT specify
>>>> EPOLLEXCLUSIVE will receive an event.
>>>>
>>>> (b) It's guaranteed that *all* of the epoll FDs that did NOT specify
>>>> EPOLLEXCLUSIVE will receive an event.
>>>>
>>>> (c) From 1 to 3 of the epoll FDs that did specify EPOLLEXCLUSIVE
>>>> will receive an event.
>>>>
>>>> (d) Exactly one epoll FD that did specify EPOLLEXCLUSIVE will get
>>>> an event, and it is indeterminate which one.
>>>>
>>>
>>> So b and c. All the non-exclusive adds will get it and at least 1 of the
>>> exclusive adds will as well.
>>
>> So is it fair to say that the expected use case is that all epoll sets
>> would use EPOLLEXCLUSIVE?
>>
>>>> I suppose one point I'm trying to uncover in the above is: what is
>>>> the scope of EPOLLEXCLUSIVE? Is it just applicable for one process's
>>>> FD, or is it setting an attribute in the epoll "interest list" record
>>>> for that FD that affects notification behavior across all processes?
>>>>
>>>
>>> Right - so 'EPOLLEXCLUSIVE' will affect other epoll sets that are also
>>> using 'EPOLLEXCLUSIVE' against the the same fd, but will have no affect
>>> on epoll sets connected to fd that do not specify it.
>>>
>>>
>>>> And then:
>>>>
>>>> (1) What are the semantics of EPOLLEXCLUSIVE if the added FD becomes
>>>> disabled via EPOLLONESHOT (or explicitly via EPOLL_CTL_MOD with
>>>> the 'events' field set to 0)?
>>>>
>>>
>>> In the case of EPOLLEXCLUSIVE and EPOLLONESHOT, one would have to re-arm
>>> at least 1 of threads that was woken up by doing EPOLL_CTL_MOD to
>>> guarantee further wakeups.
>>>
>>> And like-wise with an EPOLL_CTL_MOD with 'events' all set to 0, one
>>> would need to either re-arm the thread that set the 'events' field to 0
>>> (by setting back to non-zero), or re-arm in at least one other thread
>>> via EPOLL_CTL_MOD (or delete and add).
>>
>> Okay -- so when an EPOLLEXCLUSIVE FD becomes disarmed it is possible
>> to re-enable rith EPOLL_CTL_MOD; one doesn't need to delete and re-add
>> the FD.
>>
>>>> (2) The source code contains a comment "we do not currently supported
>>>> nested exclusive wakeups". Could you elaborate on this point? It
>>>> sounds like something that should be documented.
>>>
>>> So I was just trying to say that we return -EINVAL if you try to do and
>>> EPOLL_CTL_ADD with EPOLLEXCLUSIVE and the 'fd' argument is a epoll fd
>>> returned via epoll_create().
>>
>> Okay -- that definitely belongs in the man page.
>>
>> I'll work up a text, but would like to get input about the "use case"
>> question above.
>>
>> Cheers,
>>
>> Michael
>>
>>
>>
>
> Ok, here's some updated text:
>
> EPOLLEXCLUSIVE
>
> Sets an exclusive wakeup mode for the epfd file descriptor that is being
> attached to the target file descriptor, fd. When a wakeup event occurs
> and multiple epfd file descriptors are attached to the same target file
> using EPOLLEXCLUSIVE, one or more epfds will receive an event with
> epoll_wait(2). The default in this scenario (when EPOLLEXCLUSIVE is not
> set) is for all epfds to receive an event.
>
> The events supported by EPOLLEXCLUSIVE are: EPOLLIN, EPOLLOUT, EPOLLERR,
> EPOLLHUP, EPOLLWAKEUP, and EPOLLET. epoll_wait(2) will always wait for
> EPOLLERR and EPOLLHUP; it is not necessary to set it in events. If
> EPOLLEXCLUSIVE is set using epoll_ctl(2), then a subsequent
> EPOLL_CTL_MOD on the same epfd, fd pair will retrun -EINVAL. An
> epoll_ctl(2) that specifies EPOLLEXCLUSIVE in events and specifies the
> target file descriptor fd as an epoll instance will return -EINVAL
> as well.

By the way, in the code you have

case EPOLL_CTL_MOD:
if (epi) {
if (!(epi->event.events & EPOLLEXCLUSIVE)) {
epds.events |= POLLERR | POLLHUP;
error = ep_modify(ep, epi, &epds);
}

I think the "if" here is redundant. IIUC, earlier in the code you
disallow EPOLL_CTL_MOD with EPOLLEXCLUSIVE.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2016-03-10 20:40:41

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH] epoll: add exclusive wakeups flag



On 03/10/2016 02:58 PM, Michael Kerrisk (man-pages) wrote:
> On 03/10/2016 07:53 PM, Jason Baron wrote:
>> Hi Michael,
>>
>> On 01/29/2016 03:14 AM, Michael Kerrisk (man-pages) wrote:
>>> Hello Jason,
>>> On 01/28/2016 06:57 PM, Jason Baron wrote:
>>>> Hi,
>>>>
>>>> On 01/28/2016 02:16 AM, Michael Kerrisk (man-pages) wrote:
>>>>> Hi Jason,
>>>>>
>>>>> On 12/08/2015 04:23 AM, Jason Baron wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Re-post of an old series addressing thundering herd issues when sharing
>>>>>> an event source fd amongst multiple epoll fds. Last posting was here
>>>>>> for reference: https://lkml.org/lkml/2015/2/25/56
>>>>>>
>>>>>> The patch herein drops the core scheduler 'rotate' changes I had previously
>>>>>> proposed as this patch seems performant without those.
>>>>>>
>>>>>> I was prompted to re-post this because Madars Vitolins reported some good
>>>>>> speedups with this patch using Enduro/X application. His writeup is here:
>>>>>> https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Jason
>>>>>>
>>>>>> Sample epoll_clt text:
>>>>>
>>>>> Thanks for the proposed text. I have some questions about points
>>>>> that are not quite clear to me.
>>>>>
>>>>>> EPOLLEXCLUSIVE
>>>>>> Sets an exclusive wakeup mode for the epfd file descriptor that is
>>>>>> being attached to the target file descriptor, fd. Thus, when an
>>>>>> event occurs and multiple epfd file descriptors are attached to the
>>>>>> same target file using EPOLLEXCLUSIVE, one or more epfds will receive
>>>>>> an event with epoll_wait(2). The default in this scenario (when
>>>>>> EPOLLEXCLUSIVE is not set) is for all epfds to receive an event.
>>>>>> EPOLLEXLUSVIE may only be specified with the op EPOLL_CTL_ADD.
>>>>>
>>>>> So, assuming an FD is present in the interest list of multiple (say 6)
>>>>> epoll FDs, and some (say 3) of those attachments were done using
>>>>> EPOLLEXCLUSVE. Which of the following statements are correct:
>>>>>
>>>>> (a) It's guaranteed that *none* of the epoll FDs that did NOT specify
>>>>> EPOLLEXCLUSIVE will receive an event.
>>>>>
>>>>> (b) It's guaranteed that *all* of the epoll FDs that did NOT specify
>>>>> EPOLLEXCLUSIVE will receive an event.
>>>>>
>>>>> (c) From 1 to 3 of the epoll FDs that did specify EPOLLEXCLUSIVE
>>>>> will receive an event.
>>>>>
>>>>> (d) Exactly one epoll FD that did specify EPOLLEXCLUSIVE will get
>>>>> an event, and it is indeterminate which one.
>>>>>
>>>>
>>>> So b and c. All the non-exclusive adds will get it and at least 1 of the
>>>> exclusive adds will as well.
>>>
>>> So is it fair to say that the expected use case is that all epoll sets
>>> would use EPOLLEXCLUSIVE?
>>>
>>>>> I suppose one point I'm trying to uncover in the above is: what is
>>>>> the scope of EPOLLEXCLUSIVE? Is it just applicable for one process's
>>>>> FD, or is it setting an attribute in the epoll "interest list" record
>>>>> for that FD that affects notification behavior across all processes?
>>>>>
>>>>
>>>> Right - so 'EPOLLEXCLUSIVE' will affect other epoll sets that are also
>>>> using 'EPOLLEXCLUSIVE' against the the same fd, but will have no affect
>>>> on epoll sets connected to fd that do not specify it.
>>>>
>>>>
>>>>> And then:
>>>>>
>>>>> (1) What are the semantics of EPOLLEXCLUSIVE if the added FD becomes
>>>>> disabled via EPOLLONESHOT (or explicitly via EPOLL_CTL_MOD with
>>>>> the 'events' field set to 0)?
>>>>>
>>>>
>>>> In the case of EPOLLEXCLUSIVE and EPOLLONESHOT, one would have to re-arm
>>>> at least 1 of threads that was woken up by doing EPOLL_CTL_MOD to
>>>> guarantee further wakeups.
>>>>
>>>> And like-wise with an EPOLL_CTL_MOD with 'events' all set to 0, one
>>>> would need to either re-arm the thread that set the 'events' field to 0
>>>> (by setting back to non-zero), or re-arm in at least one other thread
>>>> via EPOLL_CTL_MOD (or delete and add).
>>>
>>> Okay -- so when an EPOLLEXCLUSIVE FD becomes disarmed it is possible
>>> to re-enable rith EPOLL_CTL_MOD; one doesn't need to delete and re-add
>>> the FD.
>>>
>>>>> (2) The source code contains a comment "we do not currently supported
>>>>> nested exclusive wakeups". Could you elaborate on this point? It
>>>>> sounds like something that should be documented.
>>>>
>>>> So I was just trying to say that we return -EINVAL if you try to do and
>>>> EPOLL_CTL_ADD with EPOLLEXCLUSIVE and the 'fd' argument is a epoll fd
>>>> returned via epoll_create().
>>>
>>> Okay -- that definitely belongs in the man page.
>>>
>>> I'll work up a text, but would like to get input about the "use case"
>>> question above.
>>>
>>> Cheers,
>>>
>>> Michael
>>>
>>>
>>>
>>
>> Ok, here's some updated text:
>>
>> EPOLLEXCLUSIVE
>>
>> Sets an exclusive wakeup mode for the epfd file descriptor that is being
>> attached to the target file descriptor, fd. When a wakeup event occurs
>> and multiple epfd file descriptors are attached to the same target file
>> using EPOLLEXCLUSIVE, one or more epfds will receive an event with
>> epoll_wait(2). The default in this scenario (when EPOLLEXCLUSIVE is not
>> set) is for all epfds to receive an event.
>>
>> The events supported by EPOLLEXCLUSIVE are: EPOLLIN, EPOLLOUT, EPOLLERR,
>> EPOLLHUP, EPOLLWAKEUP, and EPOLLET. epoll_wait(2) will always wait for
>> EPOLLERR and EPOLLHUP; it is not necessary to set it in events. If
>> EPOLLEXCLUSIVE is set using epoll_ctl(2), then a subsequent
>> EPOLL_CTL_MOD on the same epfd, fd pair will retrun -EINVAL. An
>> epoll_ctl(2) that specifies EPOLLEXCLUSIVE in events and specifies the
>> target file descriptor fd as an epoll instance will return -EINVAL
>> as well.
>
> By the way, in the code you have
>
> case EPOLL_CTL_MOD:
> if (epi) {
> if (!(epi->event.events & EPOLLEXCLUSIVE)) {
> epds.events |= POLLERR | POLLHUP;
> error = ep_modify(ep, epi, &epds);
> }
>
> I think the "if" here is redundant. IIUC, earlier in the code you
> disallow EPOLL_CTL_MOD with EPOLLEXCLUSIVE.
>
> Cheers,
>
> Michael
>
>

Hi Michael,

So the previous check ensures that you can not add the EPOLLEXCLUSIVE
flag to the events via an EPOLL_CTL_MOD operation, where EPOLLEXCLUSIVE
may not be the existing events set. While this check here ensure you
can't modify an existing set that already has the EPOLLEXCLUSIVE flag.

Thanks,

-Jason

Subject: Re: [PATCH] epoll: add exclusive wakeups flag

>> By the way, in the code you have
>>
>> case EPOLL_CTL_MOD:
>> if (epi) {
>> if (!(epi->event.events & EPOLLEXCLUSIVE)) {
>> epds.events |= POLLERR | POLLHUP;
>> error = ep_modify(ep, epi, &epds);
>> }
>>
>> I think the "if" here is redundant. IIUC, earlier in the code you
>> disallow EPOLL_CTL_MOD with EPOLLEXCLUSIVE.
>>
>> Cheers,
>>
>> Michael
>>
>>
>
> Hi Michael,
>
> So the previous check ensures that you can not add the EPOLLEXCLUSIVE
> flag to the events via an EPOLL_CTL_MOD operation, where EPOLLEXCLUSIVE
> may not be the existing events set. While this check here ensure you
> can't modify an existing set that already has the EPOLLEXCLUSIVE flag.

Hmmm - I misread the code, itr seems :-/. Could you please carefully
check the man page text I sent earlier in this thread. Maybe I
injected some errors into the text.

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: [PATCH] epoll: add exclusive wakeups flag

[Restoring CC, which I see I accidentally dropped, one iteration back.]

Hi Jason,

Thanks for the review. I've tweaked one piece to respond to your
feedback. But I also have another new question below.

On 03/15/2016 03:55 AM, Jason Baron wrote:
> On 03/11/2016 06:25 PM, Michael Kerrisk (man-pages) wrote:
>> On 03/11/2016 09:51 PM, Jason Baron wrote:
>>> On 03/11/2016 03:30 PM, Michael Kerrisk (man-pages) wrote:

[...]

> Hi Michael,
>
> Looks good. One comment below.
>
> Thanks,
>
>> EPOLLEXCLUSIVE (since Linux 4.5)
>> Sets an exclusive wakeup mode for the epoll file
>> descriptor that is being attached to the target file
>> descriptor, fd. When a wakeup event occurs and multiple
>> epoll file descriptors are attached to the same target
>> file using EPOLLEXCLUSIVE, one or more of the epoll file
>> descriptors will receive an event with epoll_wait(2).
>> The default in this scenario (when EPOLLEXCLUSIVE is not
>> set) is for all epoll file descriptors to receive an
>> event. EPOLLEXCLUSIVE is thus useful for avoiding thun‐
>> dering herd problems in certain scenarios.
>>
>> If the same file descriptor is in multiple epoll
>> instances, some with the EPOLLEXCLUSIVE flag, and others
>> without, then events will provided to all epoll
>> instances that did not specify EPOLLEXCLUSIVE, and at
>> least one of the epoll instances that did specify
>> EPOLLEXCLUSIVE.
>>
>> The following values may be specified in conjunction
>> with EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and
>> EPOLLET. EPOLLHUP and EPOLLERR can also be specified,
>> but are ignored (as usual). Attempts to specify other
>
> I'm not sure 'ignored' is the right wording here. 'EPOLLHUP' and
> 'EPOLERR' are always included in the set of events when something is
> added as EPOLLEXCLUSIVE. This is consistent with the non-EPOLLEXCLUSIVE
> add case.

Yes.

> So 'EPOLLHUP' and 'EPOLERR' may be specified but will be
> included in the set of events on an add, whether they are specified or not.

Yes. I understand your discomfort with the work "ignored", but the
problem was that, because it made special mention of EPOLLHUP and EPOLLERR,
your proposed text made it sound as though EPOLLEXCLUSIVE somehow was
special with respect to these two flags. I wanted to clarify that it is not.
How about this:

The following values may be specified in conjunction
with EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and
EPOLLET. EPOLLHUP and EPOLLERR can also be specified,
but this is not required: as usual, these events are
always reported if they occur, regardless of whether
they are specified in events.
?

>> values in events yield an error. EPOLLEXCLUSIVE may be
>> used only in an EPOLL_CTL_ADD operation; attempts to
>> employ it with EPOLL_CTL_MOD yield an error. If
>> EPOLLEXCLUSIVE has set using epoll_ctl(2), then a subse‐
>> quent EPOLL_CTL_MOD on the same epfd, fd pair yields an
b>> error. An epoll_ctl(2) that specifies EPOLLEXCLUSIVE in
>> events and specifies the target file descriptor fd as an
>> epoll instance will likewise fail. The error in all of
>> these cases is EINVAL.
>>
>> ERRORS
>> EINVAL An invalid event type was specified along with EPOLLEX‐
>> CLUSIVE in events.
>>
>> EINVAL op was EPOLL_CTL_MOD and events included EPOLLEXCLUSIVE.
>>
>> EINVAL op was EPOLL_CTL_MOD and the EPOLLEXCLUSIVE flag has
>> previously been applied to this epfd, fd pair.
>>
>> EINVAL EPOLLEXCLUSIVE was specified in event and fd is refers
>> to an epoll instance.

Returning to the second sentence in this description:

When a wakeup event occurs and multiple epoll file descrip‐
tors are attached to the same target file using EPOLLEXCLU‐
SIVE, one or more of the epoll file descriptors will
receive an event with epoll_wait(2).

There is a point that is unclear to me: what does "target file" refer to?
Is it an open file description (aka open file table entry) or an inode?
I suspect the former, but it was not clear in your original text.

To make this point even clearer, here are two scenarios I'm thinking of.
In each case, we're talking of monitoring the read end of a FIFO.

===

Scenario 1:

We have three processes each of which
1. Creates an epoll instance
2. Opens the read end of the FIFO
3. Adds the read end of the FIFO to the epoll instance, specifying
EPOLLEXCLUSIVE

When input becomes available on the FIFO, how many processes
get a wakeup?

===

Scenario 3

A parent process opens the read end of a FIFO and then calls
fork() three times to create three children. Each child then:

1. Creates an epoll instance
2. Adds the read end of the FIFO to the epoll instance, specifying
EPOLLEXCLUSIVE

When input becomes available on the FIFO, how many processes
get a wakeup?

===

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2016-03-14 19:32:23

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH] epoll: add exclusive wakeups flag



On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote:
> [Restoring CC, which I see I accidentally dropped, one iteration back.]
>
> Hi Jason,
>
> Thanks for the review. I've tweaked one piece to respond to your
> feedback. But I also have another new question below.
>
> On 03/15/2016 03:55 AM, Jason Baron wrote:
>> On 03/11/2016 06:25 PM, Michael Kerrisk (man-pages) wrote:
>>> On 03/11/2016 09:51 PM, Jason Baron wrote:
>>>> On 03/11/2016 03:30 PM, Michael Kerrisk (man-pages) wrote:
>
> [...]
>
>> Hi Michael,
>>
>> Looks good. One comment below.
>>
>> Thanks,
>>
>>> EPOLLEXCLUSIVE (since Linux 4.5)
>>> Sets an exclusive wakeup mode for the epoll file
>>> descriptor that is being attached to the target file
>>> descriptor, fd. When a wakeup event occurs and multiple
>>> epoll file descriptors are attached to the same target
>>> file using EPOLLEXCLUSIVE, one or more of the epoll file
>>> descriptors will receive an event with epoll_wait(2).
>>> The default in this scenario (when EPOLLEXCLUSIVE is not
>>> set) is for all epoll file descriptors to receive an
>>> event. EPOLLEXCLUSIVE is thus useful for avoiding thun‐
>>> dering herd problems in certain scenarios.
>>>
>>> If the same file descriptor is in multiple epoll
>>> instances, some with the EPOLLEXCLUSIVE flag, and others
>>> without, then events will provided to all epoll
>>> instances that did not specify EPOLLEXCLUSIVE, and at
>>> least one of the epoll instances that did specify
>>> EPOLLEXCLUSIVE.
>>>
>>> The following values may be specified in conjunction
>>> with EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and
>>> EPOLLET. EPOLLHUP and EPOLLERR can also be specified,
>>> but are ignored (as usual). Attempts to specify other
>>
>> I'm not sure 'ignored' is the right wording here. 'EPOLLHUP' and
>> 'EPOLERR' are always included in the set of events when something is
>> added as EPOLLEXCLUSIVE. This is consistent with the non-EPOLLEXCLUSIVE
>> add case.
>
> Yes.
>
>> So 'EPOLLHUP' and 'EPOLERR' may be specified but will be
>> included in the set of events on an add, whether they are specified or not.
>
> Yes. I understand your discomfort with the work "ignored", but the
> problem was that, because it made special mention of EPOLLHUP and EPOLLERR,
> your proposed text made it sound as though EPOLLEXCLUSIVE somehow was
> special with respect to these two flags. I wanted to clarify that it is not.
> How about this:
>
> The following values may be specified in conjunction
> with EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and
> EPOLLET. EPOLLHUP and EPOLLERR can also be specified,
> but this is not required: as usual, these events are
> always reported if they occur, regardless of whether
> they are specified in events.
> ?

Yes, nothing special here with respect to EPOLLHUP and EPOLLERR. So this
looks fine to me.

>
>>> values in events yield an error. EPOLLEXCLUSIVE may be
>>> used only in an EPOLL_CTL_ADD operation; attempts to
>>> employ it with EPOLL_CTL_MOD yield an error. If
>>> EPOLLEXCLUSIVE has set using epoll_ctl(2), then a subse‐
>>> quent EPOLL_CTL_MOD on the same epfd, fd pair yields an
> b>> error. An epoll_ctl(2) that specifies EPOLLEXCLUSIVE in
>>> events and specifies the target file descriptor fd as an
>>> epoll instance will likewise fail. The error in all of
>>> these cases is EINVAL.
>>>
>>> ERRORS
>>> EINVAL An invalid event type was specified along with EPOLLEX‐
>>> CLUSIVE in events.
>>>
>>> EINVAL op was EPOLL_CTL_MOD and events included EPOLLEXCLUSIVE.
>>>
>>> EINVAL op was EPOLL_CTL_MOD and the EPOLLEXCLUSIVE flag has
>>> previously been applied to this epfd, fd pair.
>>>
>>> EINVAL EPOLLEXCLUSIVE was specified in event and fd is refers
>>> to an epoll instance.
>
> Returning to the second sentence in this description:
>
> When a wakeup event occurs and multiple epoll file descrip‐
> tors are attached to the same target file using EPOLLEXCLU‐
> SIVE, one or more of the epoll file descriptors will
> receive an event with epoll_wait(2).
>
> There is a point that is unclear to me: what does "target file" refer to?
> Is it an open file description (aka open file table entry) or an inode?
> I suspect the former, but it was not clear in your original text.
>

So from epoll's perspective, the wakeups are associated with a 'wait
queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via
file->poll()) results in adding to the same 'wait queue' then we will
get 'exclusive' wakeup behavior.

So in general, I think the answer here is that its associated with the
inode (I coudn't say with 100% certainty without really looking at all
file->poll() implementations). Certainly, with the 'FIFO' example below,
the two scenarios will have the same behavior with respect to
EPOLLEXCLUSIVE.

Also, the 'non-exclusive' mode would be subject to the same question of
which wait queue is the epfd is associated with...

Thanks,

-Jason

> To make this point even clearer, here are two scenarios I'm thinking of.
> In each case, we're talking of monitoring the read end of a FIFO.
>
> ===
>
> Scenario 1:
>
> We have three processes each of which
> 1. Creates an epoll instance
> 2. Opens the read end of the FIFO
> 3. Adds the read end of the FIFO to the epoll instance, specifying
> EPOLLEXCLUSIVE
>
> When input becomes available on the FIFO, how many processes
> get a wakeup?
>
> ===
>
> Scenario 3
>
> A parent process opens the read end of a FIFO and then calls
> fork() three times to create three children. Each child then:
>
> 1. Creates an epoll instance
> 2. Adds the read end of the FIFO to the epoll instance, specifying
> EPOLLEXCLUSIVE
>
> When input becomes available on the FIFO, how many processes
> get a wakeup?
>
> ===
>
> Cheers,
>
> Michael
>

Subject: Re: [PATCH] epoll: add exclusive wakeups flag

Hi Jason,

On 03/15/2016 08:32 AM, Jason Baron wrote:
>
>
> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote:
>> [Restoring CC, which I see I accidentally dropped, one iteration back.]

[...]

>>>> values in events yield an error. EPOLLEXCLUSIVE may be
>>>> used only in an EPOLL_CTL_ADD operation; attempts to
>>>> employ it with EPOLL_CTL_MOD yield an error. If
>>>> EPOLLEXCLUSIVE has set using epoll_ctl(2), then a subse‐
>>>> quent EPOLL_CTL_MOD on the same epfd, fd pair yields an
>> b>> error. An epoll_ctl(2) that specifies EPOLLEXCLUSIVE in
>>>> events and specifies the target file descriptor fd as an
>>>> epoll instance will likewise fail. The error in all of
>>>> these cases is EINVAL.
>>>>
>>>> ERRORS
>>>> EINVAL An invalid event type was specified along with EPOLLEX‐
>>>> CLUSIVE in events.
>>>>
>>>> EINVAL op was EPOLL_CTL_MOD and events included EPOLLEXCLUSIVE.
>>>>
>>>> EINVAL op was EPOLL_CTL_MOD and the EPOLLEXCLUSIVE flag has
>>>> previously been applied to this epfd, fd pair.
>>>>
>>>> EINVAL EPOLLEXCLUSIVE was specified in event and fd is refers
>>>> to an epoll instance.
>>
>> Returning to the second sentence in this description:
>>
>> When a wakeup event occurs and multiple epoll file descrip‐
>> tors are attached to the same target file using EPOLLEXCLU‐
>> SIVE, one or more of the epoll file descriptors will
>> receive an event with epoll_wait(2).
>>
>> There is a point that is unclear to me: what does "target file" refer to?
>> Is it an open file description (aka open file table entry) or an inode?
>> I suspect the former, but it was not clear in your original text.
>>
>
> So from epoll's perspective, the wakeups are associated with a 'wait
> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via
> file->poll()) results in adding to the same 'wait queue' then we will
> get 'exclusive' wakeup behavior.
>
> So in general, I think the answer here is that its associated with the
> inode (I coudn't say with 100% certainty without really looking at all
> file->poll() implementations). Certainly, with the 'FIFO' example below,
> the two scenarios will have the same behavior with respect to
> EPOLLEXCLUSIVE.

So, in both scenarios, *one or more* processes will get a wakeup?
(I'll try to add something to the text to clarify the detail we're
discussing.)

> Also, the 'non-exclusive' mode would be subject to the same question of
> which wait queue is the epfd is associated with...

I'm not sure of the point you are trying to make here?

Cheers,

Michael


>> To make this point even clearer, here are two scenarios I'm thinking of.
>> In each case, we're talking of monitoring the read end of a FIFO.
>>
>> ===
>>
>> Scenario 1:
>>
>> We have three processes each of which
>> 1. Creates an epoll instance
>> 2. Opens the read end of the FIFO
>> 3. Adds the read end of the FIFO to the epoll instance, specifying
>> EPOLLEXCLUSIVE
>>
>> When input becomes available on the FIFO, how many processes
>> get a wakeup?
>>
>> ===
>>
>> Scenario 3
>>
>> A parent process opens the read end of a FIFO and then calls
>> fork() three times to create three children. Each child then:
>>
>> 1. Creates an epoll instance
>> 2. Adds the read end of the FIFO to the epoll instance, specifying
>> EPOLLEXCLUSIVE
>>
>> When input becomes available on the FIFO, how many processes
>> get a wakeup?
>>
>> ===
>>
>> Cheers,
>>
>> Michael
>>
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: [PATCH] epoll: add exclusive wakeups flag

Hi Jason,

On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote:
> Hi Jason,
>
> On 03/15/2016 08:32 AM, Jason Baron wrote:
>>
>>
>> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote:
>>> [Restoring CC, which I see I accidentally dropped, one iteration back.]

[...]

>>> Returning to the second sentence in this description:
>>>
>>> When a wakeup event occurs and multiple epoll file descrip‐
>>> tors are attached to the same target file using EPOLLEXCLU‐
>>> SIVE, one or more of the epoll file descriptors will
>>> receive an event with epoll_wait(2).
>>>
>>> There is a point that is unclear to me: what does "target file" refer to?
>>> Is it an open file description (aka open file table entry) or an inode?
>>> I suspect the former, but it was not clear in your original text.
>>>
>>
>> So from epoll's perspective, the wakeups are associated with a 'wait
>> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via
>> file->poll()) results in adding to the same 'wait queue' then we will
>> get 'exclusive' wakeup behavior.
>>
>> So in general, I think the answer here is that its associated with the
>> inode (I coudn't say with 100% certainty without really looking at all
>> file->poll() implementations). Certainly, with the 'FIFO' example below,
>> the two scenarios will have the same behavior with respect to
>> EPOLLEXCLUSIVE.

So, I was actually a little surprised by this, and went away and tested
this point. It appears to me that that the two scenarios described below
do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See below.

> So, in both scenarios, *one or more* processes will get a wakeup?
> (I'll try to add something to the text to clarify the detail we're
> discussing.)
>
>> Also, the 'non-exclusive' mode would be subject to the same question of
>> which wait queue is the epfd is associated with...
>
> I'm not sure of the point you are trying to make here?
>
> Cheers,
>
> Michael
>
>
>>> To make this point even clearer, here are two scenarios I'm thinking of.
>>> In each case, we're talking of monitoring the read end of a FIFO.
>>>
>>> ===
>>>
>>> Scenario 1:
>>>
>>> We have three processes each of which
>>> 1. Creates an epoll instance
>>> 2. Opens the read end of the FIFO
>>> 3. Adds the read end of the FIFO to the epoll instance, specifying
>>> EPOLLEXCLUSIVE
>>>
>>> When input becomes available on the FIFO, how many processes
>>> get a wakeup?

When I test this scenario, all three processes get a wakeup.

>>> ===
>>>
>>> Scenario 3
>>>
>>> A parent process opens the read end of a FIFO and then calls
>>> fork() three times to create three children. Each child then:
>>>
>>> 1. Creates an epoll instance
>>> 2. Adds the read end of the FIFO to the epoll instance, specifying
>>> EPOLLEXCLUSIVE
>>>
>>> When input becomes available on the FIFO, how many processes
>>> get a wakeup?

When I test this scenario, one process gets a wakeup.

In other words, "target file" appears to mean open file description
(aka open file table entry), not inode.

This is actually what I suspected might be the case, but now I am
puzzled. Given what I've discovered and what you suggest are the
semantics, is the implementation correct? (I suspect that it is,
but it is at odds with your statement above. My test programs are
inline below.

Cheers,

Michael

============

/* t_EPOLLEXCLUSIVE_multipen.c

Licensed under GNU GPLv2 or later.
*/
#include <sys/epoll.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

#define usageErr(msg, progName) \
do { fprintf(stderr, "Usage: "); \
fprintf(stderr, msg, progName); \
exit(EXIT_FAILURE); } while (0)

#ifndef EPOLLEXCLUSIVE
#define EPOLLEXCLUSIVE (1 << 28)
#endif

int
main(int argc, char *argv[])
{
int fd, epfd, nready;
struct epoll_event ev, rev;

if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s <FIFO>n", argv[0]);

epfd = epoll_create(2);
if (epfd == -1)
errExit("epoll_create");

fd = open(argv[1], O_RDONLY);
if (fd == -1)
errExit("open");
printf("Opened %s\n", argv[1]);

ev.events = EPOLLIN | EPOLLEXCLUSIVE;
if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
errExit("epoll_ctl");

nready = epoll_wait(epfd, &rev, 1, -1);
if (nready == -1)
errExit("epoll-wait");
printf("epoll_wait() returned %d\n", nready);

exit(EXIT_SUCCESS);
}

===============

/* t_EPOLLEXCLUSIVE_fork.c

Licensed under GNU GPLv2 or later.
*/

#include <sys/epoll.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

#define usageErr(msg, progName) \
do { fprintf(stderr, "Usage: "); \
fprintf(stderr, msg, progName); \
exit(EXIT_FAILURE); } while (0)

#ifndef EPOLLEXCLUSIVE
#define EPOLLEXCLUSIVE (1 << 28)
#endif

int
main(int argc, char *argv[])
{
int fd, epfd, nready;
struct epoll_event ev, rev;
int cnum;

if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s <FIFO>n", argv[0]);

fd = open(argv[1], O_RDONLY);
if (fd == -1)
errExit("open");
printf("Opened %s\n", argv[1]);

for (cnum = 0; cnum < 3; cnum++) {
switch (fork()) {
case -1:
errExit("fork");

case 0: /* Child */
epfd = epoll_create(2);
if (epfd == -1)
errExit("epoll_create");

ev.events = EPOLLIN | EPOLLEXCLUSIVE;
if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
errExit("epoll_ctl");

nready = epoll_wait(epfd, &rev, 1, -1);
if (nready == -1)
errExit("epoll-wait");
printf("Child %d: epoll_wait() returned %d\n", cnum, nready);
exit(EXIT_SUCCESS);

default:
break;
}
}

wait(NULL);
wait(NULL);
wait(NULL);

exit(EXIT_SUCCESS);
}

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2016-03-14 22:35:14

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH] epoll: add exclusive wakeups flag

Hi Michael,

On 03/14/2016 05:03 PM, Michael Kerrisk (man-pages) wrote:
> Hi Jason,
>
> On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote:
>> Hi Jason,
>>
>> On 03/15/2016 08:32 AM, Jason Baron wrote:
>>>
>>>
>>> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote:
>>>> [Restoring CC, which I see I accidentally dropped, one iteration back.]
>
> [...]
>
>>>> Returning to the second sentence in this description:
>>>>
>>>> When a wakeup event occurs and multiple epoll file descrip‐
>>>> tors are attached to the same target file using EPOLLEXCLU‐
>>>> SIVE, one or more of the epoll file descriptors will
>>>> receive an event with epoll_wait(2).
>>>>
>>>> There is a point that is unclear to me: what does "target file" refer to?
>>>> Is it an open file description (aka open file table entry) or an inode?
>>>> I suspect the former, but it was not clear in your original text.
>>>>
>>>
>>> So from epoll's perspective, the wakeups are associated with a 'wait
>>> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via
>>> file->poll()) results in adding to the same 'wait queue' then we will
>>> get 'exclusive' wakeup behavior.
>>>
>>> So in general, I think the answer here is that its associated with the
>>> inode (I coudn't say with 100% certainty without really looking at all
>>> file->poll() implementations). Certainly, with the 'FIFO' example below,
>>> the two scenarios will have the same behavior with respect to
>>> EPOLLEXCLUSIVE.
>
> So, I was actually a little surprised by this, and went away and tested
> this point. It appears to me that that the two scenarios described below
> do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See below.
>
>> So, in both scenarios, *one or more* processes will get a wakeup?
>> (I'll try to add something to the text to clarify the detail we're
>> discussing.)
>>
>>> Also, the 'non-exclusive' mode would be subject to the same question of
>>> which wait queue is the epfd is associated with...
>>
>> I'm not sure of the point you are trying to make here?
>>
>> Cheers,
>>
>> Michael
>>
>>
>>>> To make this point even clearer, here are two scenarios I'm thinking of.
>>>> In each case, we're talking of monitoring the read end of a FIFO.
>>>>
>>>> ===
>>>>
>>>> Scenario 1:
>>>>
>>>> We have three processes each of which
>>>> 1. Creates an epoll instance
>>>> 2. Opens the read end of the FIFO
>>>> 3. Adds the read end of the FIFO to the epoll instance, specifying
>>>> EPOLLEXCLUSIVE
>>>>
>>>> When input becomes available on the FIFO, how many processes
>>>> get a wakeup?
>
> When I test this scenario, all three processes get a wakeup.
>
>>>> ===
>>>>
>>>> Scenario 3
>>>>
>>>> A parent process opens the read end of a FIFO and then calls
>>>> fork() three times to create three children. Each child then:
>>>>
>>>> 1. Creates an epoll instance
>>>> 2. Adds the read end of the FIFO to the epoll instance, specifying
>>>> EPOLLEXCLUSIVE
>>>>
>>>> When input becomes available on the FIFO, how many processes
>>>> get a wakeup?
>
> When I test this scenario, one process gets a wakeup.
>
> In other words, "target file" appears to mean open file description
> (aka open file table entry), not inode.
>
> This is actually what I suspected might be the case, but now I am
> puzzled. Given what I've discovered and what you suggest are the
> semantics, is the implementation correct? (I suspect that it is,
> but it is at odds with your statement above. My test programs are
> inline below.
>
> Cheers,
>
> Michael
>

Thanks for the test cases. So in your first test case, you are exiting
immediately after the epoll_wait() returns. So this is actually causing
the next wakeup. And then the 2nd thread returns from epoll_wait() and
this causes the 3rd wakeup.

So the wakeups are actually not happening from the write directly, but
instead from the readers doing a close(). If you do some sort of sleep
after the epoll_wait() you can confirm the behavior. So I believe this
is working as expected.

Thanks,

-Jason


> ============
>
> /* t_EPOLLEXCLUSIVE_multipen.c
>
> Licensed under GNU GPLv2 or later.
> */
> #include <sys/epoll.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/types.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
>
> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
> } while (0)
>
> #define usageErr(msg, progName) \
> do { fprintf(stderr, "Usage: "); \
> fprintf(stderr, msg, progName); \
> exit(EXIT_FAILURE); } while (0)
>
> #ifndef EPOLLEXCLUSIVE
> #define EPOLLEXCLUSIVE (1 << 28)
> #endif
>
> int
> main(int argc, char *argv[])
> {
> int fd, epfd, nready;
> struct epoll_event ev, rev;
>
> if (argc != 2 || strcmp(argv[1], "--help") == 0)
> usageErr("%s <FIFO>n", argv[0]);
>
> epfd = epoll_create(2);
> if (epfd == -1)
> errExit("epoll_create");
>
> fd = open(argv[1], O_RDONLY);
> if (fd == -1)
> errExit("open");
> printf("Opened %s\n", argv[1]);
>
> ev.events = EPOLLIN | EPOLLEXCLUSIVE;
> if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
> errExit("epoll_ctl");
>
> nready = epoll_wait(epfd, &rev, 1, -1);
> if (nready == -1)
> errExit("epoll-wait");
> printf("epoll_wait() returned %d\n", nready);
>
> exit(EXIT_SUCCESS);
> }
>
> ===============
>
> /* t_EPOLLEXCLUSIVE_fork.c
>
> Licensed under GNU GPLv2 or later.
> */
>
> #include <sys/epoll.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/types.h>
> #include <sys/wait.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
>
> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
> } while (0)
>
> #define usageErr(msg, progName) \
> do { fprintf(stderr, "Usage: "); \
> fprintf(stderr, msg, progName); \
> exit(EXIT_FAILURE); } while (0)
>
> #ifndef EPOLLEXCLUSIVE
> #define EPOLLEXCLUSIVE (1 << 28)
> #endif
>
> int
> main(int argc, char *argv[])
> {
> int fd, epfd, nready;
> struct epoll_event ev, rev;
> int cnum;
>
> if (argc != 2 || strcmp(argv[1], "--help") == 0)
> usageErr("%s <FIFO>n", argv[0]);
>
> fd = open(argv[1], O_RDONLY);
> if (fd == -1)
> errExit("open");
> printf("Opened %s\n", argv[1]);
>
> for (cnum = 0; cnum < 3; cnum++) {
> switch (fork()) {
> case -1:
> errExit("fork");
>
> case 0: /* Child */
> epfd = epoll_create(2);
> if (epfd == -1)
> errExit("epoll_create");
>
> ev.events = EPOLLIN | EPOLLEXCLUSIVE;
> if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
> errExit("epoll_ctl");
>
> nready = epoll_wait(epfd, &rev, 1, -1);
> if (nready == -1)
> errExit("epoll-wait");
> printf("Child %d: epoll_wait() returned %d\n", cnum, nready);
> exit(EXIT_SUCCESS);
>
> default:
> break;
> }
> }
>
> wait(NULL);
> wait(NULL);
> wait(NULL);
>
> exit(EXIT_SUCCESS);
> }
>

2016-03-14 23:22:16

by Madars Vitolins

[permalink] [raw]
Subject: Re: [PATCH] epoll: add exclusive wakeups flag

Hi Jason and Michael,

Hmm... I tried to play with those pipe samples bellow, but even with
sleep I got that all process wakeups (maybe I miss something too), also
tried with EPOLLIN.

On same bases I created sample with Posix Queues with EPOLLIN |
EPOLLEXCLUSIVE and the goods news are that it works correctly.

file q.c:
==================
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/epoll.h>
#include <fcntl.h>
#include <sys/wait.h>
#include <errno.h>
#include <mqueue.h>

#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

#define usageErr(msg, progName) \
do { fprintf(stderr, "Usage: "); \
fprintf(stderr, msg, progName); \
exit(EXIT_FAILURE); } while (0)

#ifndef EPOLLEXCLUSIVE
#define EPOLLEXCLUSIVE (1 << 28)
#endif

#define MAX_SIZE 10

int
main (int argc, char *argv[])
{
int epfd, nready;
struct epoll_event ev, rev;
mqd_t fd;
struct mq_attr attr;
char buffer[MAX_SIZE + 1];
int cnum;

/* initialize the queue attributes */
attr.mq_flags = 0;
attr.mq_maxmsg = 5;
attr.mq_msgsize = MAX_SIZE;
attr.mq_curmsgs = 0;

/* cleanup for multiple runs... */
mq_unlink ("/TESTQ");

/* create the message queue */
fd =
mq_open ("/TESTQ", O_CREAT | O_RDWR | O_NONBLOCK, S_IWUSR | S_IRUSR,
&attr);
if (fd == -1)
errExit ("open");

for (cnum = 0; cnum < 3; cnum++)
{
switch (fork ())
{
case -1:
errExit ("fork");

case 0: /* Child */
epfd = epoll_create (2);
if (epfd == -1)
errExit ("epoll_create");

ev.events = EPOLLIN | EPOLLEXCLUSIVE;
if (epoll_ctl (epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
errExit ("epoll_ctl");

printf ("About to wait...\n");
nready = epoll_wait (epfd, &rev, 1, -1);
if (nready == -1)
errExit ("epoll-wait");

printf ("Child %d: epoll_wait() returned %d\n", cnum, nready);
exit (EXIT_SUCCESS);

default:
break;
}
}
sleep (1);
/* send a msq to Q */
memset (buffer, 0, MAX_SIZE);
if (0 > mq_send (fd, buffer, MAX_SIZE, 0))
errExit ("mq_send");
printf ("msg sent ok...\n");

wait (NULL);
wait (NULL);
wait (NULL);

exit (EXIT_SUCCESS);
}
==================

$ gcc q.c -lrt
$ ./a.out
About to wait...
About to wait...
About to wait...
msg sent ok...
Child 2: epoll_wait() returned 1
^C
$



Best regards,
Madars


Jason Baron @ 2016-03-15 00:35 rakstīja:
> Hi Michael,
>
> On 03/14/2016 05:03 PM, Michael Kerrisk (man-pages) wrote:
>> Hi Jason,
>>
>> On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote:
>>> Hi Jason,
>>>
>>> On 03/15/2016 08:32 AM, Jason Baron wrote:
>>>>
>>>>
>>>> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote:
>>>>> [Restoring CC, which I see I accidentally dropped, one iteration
>>>>> back.]
>>
>> [...]
>>
>>>>> Returning to the second sentence in this description:
>>>>>
>>>>> When a wakeup event occurs and multiple epoll file
>>>>> descrip‐
>>>>> tors are attached to the same target file using
>>>>> EPOLLEXCLU‐
>>>>> SIVE, one or more of the epoll file descriptors
>>>>> will
>>>>> receive an event with epoll_wait(2).
>>>>>
>>>>> There is a point that is unclear to me: what does "target file"
>>>>> refer to?
>>>>> Is it an open file description (aka open file table entry) or an
>>>>> inode?
>>>>> I suspect the former, but it was not clear in your original text.
>>>>>
>>>>
>>>> So from epoll's perspective, the wakeups are associated with a 'wait
>>>> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done
>>>> via
>>>> file->poll()) results in adding to the same 'wait queue' then we
>>>> will
>>>> get 'exclusive' wakeup behavior.
>>>>
>>>> So in general, I think the answer here is that its associated with
>>>> the
>>>> inode (I coudn't say with 100% certainty without really looking at
>>>> all
>>>> file->poll() implementations). Certainly, with the 'FIFO' example
>>>> below,
>>>> the two scenarios will have the same behavior with respect to
>>>> EPOLLEXCLUSIVE.
>>
>> So, I was actually a little surprised by this, and went away and
>> tested
>> this point. It appears to me that that the two scenarios described
>> below
>> do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See
>> below.
>>
>>> So, in both scenarios, *one or more* processes will get a wakeup?
>>> (I'll try to add something to the text to clarify the detail we're
>>> discussing.)
>>>
>>>> Also, the 'non-exclusive' mode would be subject to the same question
>>>> of
>>>> which wait queue is the epfd is associated with...
>>>
>>> I'm not sure of the point you are trying to make here?
>>>
>>> Cheers,
>>>
>>> Michael
>>>
>>>
>>>>> To make this point even clearer, here are two scenarios I'm
>>>>> thinking of.
>>>>> In each case, we're talking of monitoring the read end of a FIFO.
>>>>>
>>>>> ===
>>>>>
>>>>> Scenario 1:
>>>>>
>>>>> We have three processes each of which
>>>>> 1. Creates an epoll instance
>>>>> 2. Opens the read end of the FIFO
>>>>> 3. Adds the read end of the FIFO to the epoll instance, specifying
>>>>> EPOLLEXCLUSIVE
>>>>>
>>>>> When input becomes available on the FIFO, how many processes
>>>>> get a wakeup?
>>
>> When I test this scenario, all three processes get a wakeup.
>>
>>>>> ===
>>>>>
>>>>> Scenario 3
>>>>>
>>>>> A parent process opens the read end of a FIFO and then calls
>>>>> fork() three times to create three children. Each child then:
>>>>>
>>>>> 1. Creates an epoll instance
>>>>> 2. Adds the read end of the FIFO to the epoll instance, specifying
>>>>> EPOLLEXCLUSIVE
>>>>>
>>>>> When input becomes available on the FIFO, how many processes
>>>>> get a wakeup?
>>
>> When I test this scenario, one process gets a wakeup.
>>
>> In other words, "target file" appears to mean open file description
>> (aka open file table entry), not inode.
>>
>> This is actually what I suspected might be the case, but now I am
>> puzzled. Given what I've discovered and what you suggest are the
>> semantics, is the implementation correct? (I suspect that it is,
>> but it is at odds with your statement above. My test programs are
>> inline below.
>>
>> Cheers,
>>
>> Michael
>>
>
> Thanks for the test cases. So in your first test case, you are exiting
> immediately after the epoll_wait() returns. So this is actually causing
> the next wakeup. And then the 2nd thread returns from epoll_wait() and
> this causes the 3rd wakeup.
>
> So the wakeups are actually not happening from the write directly, but
> instead from the readers doing a close(). If you do some sort of sleep
> after the epoll_wait() you can confirm the behavior. So I believe this
> is working as expected.
>
> Thanks,
>
> -Jason
>
>
>> ============
>>
>> /* t_EPOLLEXCLUSIVE_multipen.c
>>
>> Licensed under GNU GPLv2 or later.
>> */
>> #include <sys/epoll.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sys/types.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <string.h>
>>
>> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
>> } while (0)
>>
>> #define usageErr(msg, progName) \
>> do { fprintf(stderr, "Usage: "); \
>> fprintf(stderr, msg, progName); \
>> exit(EXIT_FAILURE); } while (0)
>>
>> #ifndef EPOLLEXCLUSIVE
>> #define EPOLLEXCLUSIVE (1 << 28)
>> #endif
>>
>> int
>> main(int argc, char *argv[])
>> {
>> int fd, epfd, nready;
>> struct epoll_event ev, rev;
>>
>> if (argc != 2 || strcmp(argv[1], "--help") == 0)
>> usageErr("%s <FIFO>n", argv[0]);
>>
>> epfd = epoll_create(2);
>> if (epfd == -1)
>> errExit("epoll_create");
>>
>> fd = open(argv[1], O_RDONLY);
>> if (fd == -1)
>> errExit("open");
>> printf("Opened %s\n", argv[1]);
>>
>> ev.events = EPOLLIN | EPOLLEXCLUSIVE;
>> if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
>> errExit("epoll_ctl");
>>
>> nready = epoll_wait(epfd, &rev, 1, -1);
>> if (nready == -1)
>> errExit("epoll-wait");
>> printf("epoll_wait() returned %d\n", nready);
>>
>> exit(EXIT_SUCCESS);
>> }
>>
>> ===============
>>
>> /* t_EPOLLEXCLUSIVE_fork.c
>>
>> Licensed under GNU GPLv2 or later.
>> */
>>
>> #include <sys/epoll.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sys/types.h>
>> #include <sys/wait.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <string.h>
>>
>> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
>> } while (0)
>>
>> #define usageErr(msg, progName) \
>> do { fprintf(stderr, "Usage: "); \
>> fprintf(stderr, msg, progName); \
>> exit(EXIT_FAILURE); } while (0)
>>
>> #ifndef EPOLLEXCLUSIVE
>> #define EPOLLEXCLUSIVE (1 << 28)
>> #endif
>>
>> int
>> main(int argc, char *argv[])
>> {
>> int fd, epfd, nready;
>> struct epoll_event ev, rev;
>> int cnum;
>>
>> if (argc != 2 || strcmp(argv[1], "--help") == 0)
>> usageErr("%s <FIFO>n", argv[0]);
>>
>> fd = open(argv[1], O_RDONLY);
>> if (fd == -1)
>> errExit("open");
>> printf("Opened %s\n", argv[1]);
>>
>> for (cnum = 0; cnum < 3; cnum++) {
>> switch (fork()) {
>> case -1:
>> errExit("fork");
>>
>> case 0: /* Child */
>> epfd = epoll_create(2);
>> if (epfd == -1)
>> errExit("epoll_create");
>>
>> ev.events = EPOLLIN | EPOLLEXCLUSIVE;
>> if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
>> errExit("epoll_ctl");
>>
>> nready = epoll_wait(epfd, &rev, 1, -1);
>> if (nready == -1)
>> errExit("epoll-wait");
>> printf("Child %d: epoll_wait() returned %d\n", cnum,
>> nready);
>> exit(EXIT_SUCCESS);
>>
>> default:
>> break;
>> }
>> }
>>
>> wait(NULL);
>> wait(NULL);
>> wait(NULL);
>>
>> exit(EXIT_SUCCESS);
>> }
>>

Subject: Re: [PATCH] epoll: add exclusive wakeups flag

Hi Jason,

On 03/15/2016 11:35 AM, Jason Baron wrote:
> Hi Michael,
>
> On 03/14/2016 05:03 PM, Michael Kerrisk (man-pages) wrote:
>> Hi Jason,
>>
>> On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote:
>>> Hi Jason,
>>>
>>> On 03/15/2016 08:32 AM, Jason Baron wrote:
>>>>
>>>>
>>>> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote:
>>>>> [Restoring CC, which I see I accidentally dropped, one iteration back.]
>>
>> [...]
>>
>>>>> Returning to the second sentence in this description:
>>>>>
>>>>> When a wakeup event occurs and multiple epoll file descrip‐
>>>>> tors are attached to the same target file using EPOLLEXCLU‐
>>>>> SIVE, one or more of the epoll file descriptors will
>>>>> receive an event with epoll_wait(2).
>>>>>
>>>>> There is a point that is unclear to me: what does "target file" refer to?
>>>>> Is it an open file description (aka open file table entry) or an inode?
>>>>> I suspect the former, but it was not clear in your original text.
>>>>>
>>>>
>>>> So from epoll's perspective, the wakeups are associated with a 'wait
>>>> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via
>>>> file->poll()) results in adding to the same 'wait queue' then we will
>>>> get 'exclusive' wakeup behavior.
>>>>
>>>> So in general, I think the answer here is that its associated with the
>>>> inode (I coudn't say with 100% certainty without really looking at all
>>>> file->poll() implementations). Certainly, with the 'FIFO' example below,
>>>> the two scenarios will have the same behavior with respect to
>>>> EPOLLEXCLUSIVE.
>>
>> So, I was actually a little surprised by this, and went away and tested
>> this point. It appears to me that that the two scenarios described below
>> do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See below.
>>
>>> So, in both scenarios, *one or more* processes will get a wakeup?
>>> (I'll try to add something to the text to clarify the detail we're
>>> discussing.)
>>>
>>>> Also, the 'non-exclusive' mode would be subject to the same question of
>>>> which wait queue is the epfd is associated with...
>>>
>>> I'm not sure of the point you are trying to make here?
>>>
>>> Cheers,
>>>
>>> Michael
>>>
>>>
>>>>> To make this point even clearer, here are two scenarios I'm thinking of.
>>>>> In each case, we're talking of monitoring the read end of a FIFO.
>>>>>
>>>>> ===
>>>>>
>>>>> Scenario 1:
>>>>>
>>>>> We have three processes each of which
>>>>> 1. Creates an epoll instance
>>>>> 2. Opens the read end of the FIFO
>>>>> 3. Adds the read end of the FIFO to the epoll instance, specifying
>>>>> EPOLLEXCLUSIVE
>>>>>
>>>>> When input becomes available on the FIFO, how many processes
>>>>> get a wakeup?
>>
>> When I test this scenario, all three processes get a wakeup.
>>
>>>>> ===
>>>>>
>>>>> Scenario 3
>>>>>
>>>>> A parent process opens the read end of a FIFO and then calls
>>>>> fork() three times to create three children. Each child then:
>>>>>
>>>>> 1. Creates an epoll instance
>>>>> 2. Adds the read end of the FIFO to the epoll instance, specifying
>>>>> EPOLLEXCLUSIVE
>>>>>
>>>>> When input becomes available on the FIFO, how many processes
>>>>> get a wakeup?
>>
>> When I test this scenario, one process gets a wakeup.
>>
>> In other words, "target file" appears to mean open file description
>> (aka open file table entry), not inode.
>>
>> This is actually what I suspected might be the case, but now I am
>> puzzled. Given what I've discovered and what you suggest are the
>> semantics, is the implementation correct? (I suspect that it is,
>> but it is at odds with your statement above. My test programs are
>> inline below.
>>
>> Cheers,
>>
>> Michael
>>
>
> Thanks for the test cases. So in your first test case, you are exiting
> immediately after the epoll_wait() returns. So this is actually causing
> the next wakeup.

Can I just check my understanding of the rationale for the preceding
point. The next process is getting woken up, because the previous process
did not "consume" the event (that is, the input is still available on the
FIFO). Right?

> And then the 2nd thread returns from epoll_wait() and
> this causes the 3rd wakeup.

I added the sleep() calls, but still things don't seem to happen
quite as you suggest. In the first scenario, after the first process
terminates, *all* of the remaining processes wake from epoll_wait().
What's happening in this case? (This smells like a possible bug.)

In the second scenario (fork()), after the first process terminates
(without consuming the FIFO input), all of the other processes remain
blocked in epoll-wait(). (Note, I extended the test program here
to allow the number of child processes to be specified as a command-line
argument.) I think I can make sense of that: it's because the open
file descriptor for the read end of the FIFO has been duplicated
in all of the child processes, and closing the FD in one child
does not cause the corresponding open file description in other
processes to be torn down because there are other FDs that still
refer to it.

> So the wakeups are actually not happening from the write directly, but
> instead from the readers doing a close(). If you do some sort of sleep
> after the epoll_wait() you can confirm the behavior. So I believe this
> is working as expected.

As note above, I'm still slightly puzzled.
Revised test programs pasted below.

Cheers,

Michael

==========

/* t_EPOLLEXCLUSIVE_multiopen.c

Licensed under GNU GPLv2 or later.
*/

#include <sys/epoll.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

#define usageErr(msg, progName) \
do { fprintf(stderr, "Usage: "); \
fprintf(stderr, msg, progName); \
exit(EXIT_FAILURE); } while (0)

#ifndef EPOLLEXCLUSIVE
#define EPOLLEXCLUSIVE (1 << 28)
#endif

int
main(int argc, char *argv[])
{
int fd, epfd, nready;
struct epoll_event ev, rev;

if (argc != 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s <FIFO>\n", argv[0]);

epfd = epoll_create(2);
if (epfd == -1)
errExit("epoll_create");

fd = open(argv[1], O_RDONLY);
if (fd == -1)
errExit("open");
printf("Opened %s\n", argv[1]);

ev.events = EPOLLIN | EPOLLEXCLUSIVE;
if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
errExit("epoll_ctl");

nready = epoll_wait(epfd, &rev, 1, -1);
if (nready == -1)
errExit("epoll-wait");
printf("epoll_wait() returned %d\n", nready);

printf("sleeping\n");
sleep(3);
printf("Terminating\n");
exit(EXIT_SUCCESS);
}

===================

/* t_EPOLLEXCLUSIVE_fork.c

Licensed under GNU GPLv2 or later.
*/

#include <sys/epoll.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

#define usageErr(msg, progName) \
do { fprintf(stderr, "Usage: "); \
fprintf(stderr, msg, progName); \
exit(EXIT_FAILURE); } while (0)

#ifndef EPOLLEXCLUSIVE
#define EPOLLEXCLUSIVE (1 << 28)
#endif

int
main(int argc, char *argv[])
{
int fd, epfd, nready;
struct epoll_event ev, rev;
int cnum, cmax;

if (argc < 2 || strcmp(argv[1], "--help") == 0)
usageErr("%s <FIFO> [num-children]\n", argv[0]);

fd = open(argv[1], O_RDONLY);
if (fd == -1)
errExit("open");
printf("Opened %s\n", argv[1]);

cmax = (argc > 2) ? atoi(argv[2]) : 3;

for (cnum = 0; cnum < cmax; cnum++) {
switch (fork()) {
case -1:
errExit("fork");

case 0: /* Child */
epfd = epoll_create(2);
if (epfd == -1)
errExit("epoll_create");

ev.events = EPOLLIN | EPOLLEXCLUSIVE;
if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
errExit("epoll_ctl");

nready = epoll_wait(epfd, &rev, 1, -1);
if (nready == -1)
errExit("epoll-wait");
printf("Child %d: epoll_wait() returned %d\n", cnum, nready);
printf("sleeping\n");
sleep(3);
printf("Child %d terminating\n", cnum);
exit(EXIT_SUCCESS);

default:
break;
}
}

for (cnum = 0; cnum < cmax; cnum++)
wait(NULL);

exit(EXIT_SUCCESS);
}

2016-03-15 02:45:06

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH] epoll: add exclusive wakeups flag

Hi Michael,

On 03/14/2016 07:26 PM, Michael Kerrisk (man-pages) wrote:
> Hi Jason,
>
> On 03/15/2016 11:35 AM, Jason Baron wrote:
>> Hi Michael,
>>
>> On 03/14/2016 05:03 PM, Michael Kerrisk (man-pages) wrote:
>>> Hi Jason,
>>>
>>> On 03/15/2016 09:01 AM, Michael Kerrisk (man-pages) wrote:
>>>> Hi Jason,
>>>>
>>>> On 03/15/2016 08:32 AM, Jason Baron wrote:
>>>>>
>>>>>
>>>>> On 03/14/2016 01:47 PM, Michael Kerrisk (man-pages) wrote:
>>>>>> [Restoring CC, which I see I accidentally dropped, one iteration back.]
>>>
>>> [...]
>>>
>>>>>> Returning to the second sentence in this description:
>>>>>>
>>>>>> When a wakeup event occurs and multiple epoll file descrip‐
>>>>>> tors are attached to the same target file using EPOLLEXCLU‐
>>>>>> SIVE, one or more of the epoll file descriptors will
>>>>>> receive an event with epoll_wait(2).
>>>>>>
>>>>>> There is a point that is unclear to me: what does "target file" refer to?
>>>>>> Is it an open file description (aka open file table entry) or an inode?
>>>>>> I suspect the former, but it was not clear in your original text.
>>>>>>
>>>>>
>>>>> So from epoll's perspective, the wakeups are associated with a 'wait
>>>>> queue'. So if the open() and subsequent EPOLL_CTL_ADD (which is done via
>>>>> file->poll()) results in adding to the same 'wait queue' then we will
>>>>> get 'exclusive' wakeup behavior.
>>>>>
>>>>> So in general, I think the answer here is that its associated with the
>>>>> inode (I coudn't say with 100% certainty without really looking at all
>>>>> file->poll() implementations). Certainly, with the 'FIFO' example below,
>>>>> the two scenarios will have the same behavior with respect to
>>>>> EPOLLEXCLUSIVE.
>>>
>>> So, I was actually a little surprised by this, and went away and tested
>>> this point. It appears to me that that the two scenarios described below
>>> do NOT have the same behavior with respect to EPOLLEXCLUSIVE. See below.
>>>
>>>> So, in both scenarios, *one or more* processes will get a wakeup?
>>>> (I'll try to add something to the text to clarify the detail we're
>>>> discussing.)
>>>>
>>>>> Also, the 'non-exclusive' mode would be subject to the same question of
>>>>> which wait queue is the epfd is associated with...
>>>>
>>>> I'm not sure of the point you are trying to make here?
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>>
>>>>
>>>>>> To make this point even clearer, here are two scenarios I'm thinking of.
>>>>>> In each case, we're talking of monitoring the read end of a FIFO.
>>>>>>
>>>>>> ===
>>>>>>
>>>>>> Scenario 1:
>>>>>>
>>>>>> We have three processes each of which
>>>>>> 1. Creates an epoll instance
>>>>>> 2. Opens the read end of the FIFO
>>>>>> 3. Adds the read end of the FIFO to the epoll instance, specifying
>>>>>> EPOLLEXCLUSIVE
>>>>>>
>>>>>> When input becomes available on the FIFO, how many processes
>>>>>> get a wakeup?
>>>
>>> When I test this scenario, all three processes get a wakeup.
>>>
>>>>>> ===
>>>>>>
>>>>>> Scenario 3
>>>>>>
>>>>>> A parent process opens the read end of a FIFO and then calls
>>>>>> fork() three times to create three children. Each child then:
>>>>>>
>>>>>> 1. Creates an epoll instance
>>>>>> 2. Adds the read end of the FIFO to the epoll instance, specifying
>>>>>> EPOLLEXCLUSIVE
>>>>>>
>>>>>> When input becomes available on the FIFO, how many processes
>>>>>> get a wakeup?
>>>
>>> When I test this scenario, one process gets a wakeup.
>>>
>>> In other words, "target file" appears to mean open file description
>>> (aka open file table entry), not inode.
>>>
>>> This is actually what I suspected might be the case, but now I am
>>> puzzled. Given what I've discovered and what you suggest are the
>>> semantics, is the implementation correct? (I suspect that it is,
>>> but it is at odds with your statement above. My test programs are
>>> inline below.
>>>
>>> Cheers,
>>>
>>> Michael
>>>
>>
>> Thanks for the test cases. So in your first test case, you are exiting
>> immediately after the epoll_wait() returns. So this is actually causing
>> the next wakeup.
>
> Can I just check my understanding of the rationale for the preceding
> point. The next process is getting woken up, because the previous process
> did not "consume" the event (that is, the input is still available on the
> FIFO). Right?
>
>> And then the 2nd thread returns from epoll_wait() and
>> this causes the 3rd wakeup.
>
> I added the sleep() calls, but still things don't seem to happen
> quite as you suggest. In the first scenario, after the first process
> terminates, *all* of the remaining processes wake from epoll_wait().
> What's happening in this case? (This smells like a possible bug.)

Yes, you are right. When the first process exits() and thus closes
the read-side of the pipe, it will wake up all the other calls that
are in epoll_wait(). When the file closes, since there are no more
fds referencing it, the pipe_release() routine does a wakeup specifying
both POLLIN and POLLOUT. In this case, all of the epoll exclusive
waiters will get a wakeup. The combination of POLLIN and POLLOUT is
not expected to be the typical use-case here. Normally, we would have
the threads in event loops, and just POLLIN would be set resulting
in the exclusive waskeup behavior. So yes, there can be multiple
wakeups in some cases, but the *common* case of only POLLIN or only
POLLOUT set will yield exclusive wakeups. There is no guarantee here
that only 1 thread wakes up - only that *at least* one.

>
> In the second scenario (fork()), after the first process terminates
> (without consuming the FIFO input), all of the other processes remain
> blocked in epoll-wait(). (Note, I extended the test program here
> to allow the number of child processes to be specified as a command-line
> argument.) I think I can make sense of that: it's because the open
> file descriptor for the read end of the FIFO has been duplicated
> in all of the child processes, and closing the FD in one child
> does not cause the corresponding open file description in other
> processes to be torn down because there are other FDs that still
> refer to it.
>

Yes, exactly. The final process is going to invoke the pipe_release(),
but at that point there is nobody left to wakeup.

>> So the wakeups are actually not happening from the write directly, but
>> instead from the readers doing a close(). If you do some sort of sleep
>> after the epoll_wait() you can confirm the behavior. So I believe this
>> is working as expected.
>
> As note above, I'm still slightly puzzled.
> Revised test programs pasted below.
>

Ok, hopefully this makes sense.

Thanks,

-Jason

> Cheers,
>
> Michael
>
> ==========
>
> /* t_EPOLLEXCLUSIVE_multiopen.c
>
> Licensed under GNU GPLv2 or later.
> */
>
> #include <sys/epoll.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/types.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
>
> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
> } while (0)
>
> #define usageErr(msg, progName) \
> do { fprintf(stderr, "Usage: "); \
> fprintf(stderr, msg, progName); \
> exit(EXIT_FAILURE); } while (0)
>
> #ifndef EPOLLEXCLUSIVE
> #define EPOLLEXCLUSIVE (1 << 28)
> #endif
>
> int
> main(int argc, char *argv[])
> {
> int fd, epfd, nready;
> struct epoll_event ev, rev;
>
> if (argc != 2 || strcmp(argv[1], "--help") == 0)
> usageErr("%s <FIFO>\n", argv[0]);
>
> epfd = epoll_create(2);
> if (epfd == -1)
> errExit("epoll_create");
>
> fd = open(argv[1], O_RDONLY);
> if (fd == -1)
> errExit("open");
> printf("Opened %s\n", argv[1]);
>
> ev.events = EPOLLIN | EPOLLEXCLUSIVE;
> if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
> errExit("epoll_ctl");
>
> nready = epoll_wait(epfd, &rev, 1, -1);
> if (nready == -1)
> errExit("epoll-wait");
> printf("epoll_wait() returned %d\n", nready);
>
> printf("sleeping\n");
> sleep(3);
> printf("Terminating\n");
> exit(EXIT_SUCCESS);
> }
>
> ===================
>
> /* t_EPOLLEXCLUSIVE_fork.c
>
> Licensed under GNU GPLv2 or later.
> */
>
> #include <sys/epoll.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/types.h>
> #include <sys/wait.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
>
> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
> } while (0)
>
> #define usageErr(msg, progName) \
> do { fprintf(stderr, "Usage: "); \
> fprintf(stderr, msg, progName); \
> exit(EXIT_FAILURE); } while (0)
>
> #ifndef EPOLLEXCLUSIVE
> #define EPOLLEXCLUSIVE (1 << 28)
> #endif
>
> int
> main(int argc, char *argv[])
> {
> int fd, epfd, nready;
> struct epoll_event ev, rev;
> int cnum, cmax;
>
> if (argc < 2 || strcmp(argv[1], "--help") == 0)
> usageErr("%s <FIFO> [num-children]\n", argv[0]);
>
> fd = open(argv[1], O_RDONLY);
> if (fd == -1)
> errExit("open");
> printf("Opened %s\n", argv[1]);
>
> cmax = (argc > 2) ? atoi(argv[2]) : 3;
>
> for (cnum = 0; cnum < cmax; cnum++) {
> switch (fork()) {
> case -1:
> errExit("fork");
>
> case 0: /* Child */
> epfd = epoll_create(2);
> if (epfd == -1)
> errExit("epoll_create");
>
> ev.events = EPOLLIN | EPOLLEXCLUSIVE;
> if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
> errExit("epoll_ctl");
>
> nready = epoll_wait(epfd, &rev, 1, -1);
> if (nready == -1)
> errExit("epoll-wait");
> printf("Child %d: epoll_wait() returned %d\n", cnum, nready);
> printf("sleeping\n");
> sleep(3);
> printf("Child %d terminating\n", cnum);
> exit(EXIT_SUCCESS);
>
> default:
> break;
> }
> }
>
> for (cnum = 0; cnum < cmax; cnum++)
> wait(NULL);
>
> exit(EXIT_SUCCESS);
> }
>