2019-07-12 00:13:11

by Carlo Wood

[permalink] [raw]
Subject: On

Hello everyone,

this is a follow up to a discussion from 2012 about EPOLL_CTL_DISABLE,
which ended up NOT being accepted in the kernel.

I believe however that this issue wasn't addressed.
Michael Kerrisk did an excellent job writing a summary of the problem:
https://lkml.org/lkml/2012/10/16/302 (the first four listed points).

In the past two years I've been working on "yet another" event driven
I/O library, exclusively for linux however, with the general aim to
be highly scalable for many cores. The I/O part (around epoll) is only
part of that library - but it is what I spent the last two years on.

I ran independently into the same problem as described by Paton Lewis
and only afterwards, when I couldn't think of a working solution and
started to do heavy research, found related articles about the problem.

Reading through the thread on lkml.org that given above, it still
isn't clear to me that this would be a non-issue that can be solved
in user space (without severe (unnecessary) performance penalties).

Let me summarize the problem in my own way (probably less general,
but more related to my personal case):

1. The epoll interest list stores a pointer in epoll_event::data.ptr
that points to a specific file descriptor related object, which
indeed can be seen as a user-space cache of data related to that
file descriptor. In my case this includes one mutex that protects
an unsigned int whose bits keep track of things like whether or
not that object (pointer) is in the epoll interest list (was added
with epoll_ctl) and which events are being watched specifically,
as well as whether or not the fd was closed (after removing it
from the interest list with EPOLL_CTL_DEL) etc.

2. At least one thread (only one, in my case) calls epoll_wait
(epoll_pwait) in a loop. I use no timeout because that seems too
inaccurate, my timers are signal based. When epoll_wait returns, the
pointer epoll_event::data.ptr is used to call a member function on
the dereferenced FileDescriptor class. For non-C++ coders, that
means a function is passed as first parameter the value of
epoll_event::data.ptr (the 'this' pointer) which is subsequently
dereferenced inside that function. The exact function depends on the
actual event. In most cases the handling of the event is passed on
to a thread pool queue for handling by another thread - but before
that happens the pointer was already dereferenced (which, among
others, allows me to increment a reference count for that object).
This obviously means that epoll_event::data.ptr may never point to
freed data.

3. As almost all processing happens by worker threads of a thread
pool - it will be some other thread that decides that an object
is done and needs to be freed. This thread first removes the object
from the epoll interest list, and then deletes the object.
Unfortunately this does not work -- and apart from adding a delay
before the object is really deleted (which isn't a real solution
as several of you already pointed out in 2012), I do not see any
possible alternative.

The reason it can't work is because the Event Loop Thread (that loops
around epoll_wait) may already be in the process of returning from
epoll_wait, lets say -- it already returned from epoll_wait, but
wasn't able to execute ANY code following it. And because this is
ALWAYS possible there is NEVER a safe moment to delete the object.

The scenario is, for example:

Event Loop Thread Worker thread

Returns from epoll_wait() passing
back to user space a data.ptr pointing
to a user allocated object.

Removes the object
from the epoll interest
list with EPOLL_CTL_DEL

<arbitrary delay here,
for example, no delay>

Free the object pointed
to by data.ptr.

Dereferences data.ptr and crashes.


The solution proposed by Andy Lutomirski
(https://lkml.org/lkml/2012/10/18/434) does not work here:
In order use RCU the ptr must be removed from the protected
"list" before the grace period is started, which must start
before a read-side critical area ends. But the "list" here
is the epoll interest list - and removing it from that list
requires the call to epoll_ctl(..., EPOLL_CTL_DEL, ...) to
finish. In other words, the Worker thread is the RCU "updater"
and the "arbitrary delay" must be the rcu grace period.
No problems there. The event loop thread however must call
rcu_read_lock() before accessing the epoll_event structure,
which is not possible because that happens inside epoll_wait(),
which doesn't provide a hook to add such call.

As far as the Worker Thread is concerned there are no readers,
and the grace period can finish instantly, simply because
there is no way to register that data.ptr was already copied.

If one tries to begin a read-side critical area after epoll_wait()
returns, then that won't work: in that case you should not
be able to access that ptr when it was already removed from
the interest list. The only way that RCU would work here is
when a reader subscribes *before* the kernel copies the
corresponding epoll_event structure to user space, in a way
that that will never happen when the EPOLL_CTL_DEL finished
before it got to that point.


I believe that the only safe solution is to let the Event Loop
Thread do the deleting. So, if all else fails I'll have to add
objects that a Worker Thread thinks need to be deleted to a
FIFO that is processed by the Event Loop Thread before entering
epoll_wait(). But that is a lot of extra code for something
that could be achieved with just a small change to epoll:


I propose to add a new EPOLL event EPOLLCLOSED that will cause
epoll_wait to return (for that event) whenever a file descriptor is
closed.

The Worker Thread then does not remove an object from the
interest list, but either adds (if it was removed before) or
modifies the event (using EPOLL_CTL_MOD) to watch that fd
for closing, and then closes it.

Aka,

Working Thread:

epoll_ctl(epoll_fd, EPOLL_CTL_ADD, fd, &event);
close(fd);

where event contains the new EPOLLCLOSED (compare EPOLLOUT, EPOLLIN
etc).

This must then guarantee the event EPOLLCLOSED to be reported
by exactly one epoll_wait(), the caller thread of which can then
proceed with deleting the resources.

Note that close(fd) must cause the removal from the interest list
of any epoll struct before causing the event - and that the
EPOLLCLOSED event may only be reported after all other events
for that fd have already been reported (although it would be
ok to report them at the same time: simply handle the other
events first).

I am not sure how this will work when more than one thread
calls epoll_wait and more than one watch the same fd: in
that case it is less trivial because then it seems always
possible that the EPOLLCLOSED event will be reported before
another event in another thread.


What are your thoughts? Is the addition of EPOLLCLOSED
event feasible?

--
Carlo Wood <[email protected]>


2019-07-12 00:33:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: On

On Thu, Jul 11, 2019 at 5:01 PM Carlo Wood <[email protected]> wrote:
>
> I believe that the only safe solution is to let the Event Loop
> Thread do the deleting. So, if all else fails I'll have to add
> objects that a Worker Thread thinks need to be deleted to a
> FIFO that is processed by the Event Loop Thread before entering
> epoll_wait(). But that is a lot of extra code for something
> that could be achieved with just a small change to epoll:

This doesn't seem so bad at all.

>
>
> I propose to add a new EPOLL event EPOLLCLOSED that will cause
> epoll_wait to return (for that event) whenever a file descriptor is
> closed.

This totally falls apart if you ever want to add a feature to your
library to detach the handler for a given fd without closing the fd.

>
> The Worker Thread then does not remove an object from the
> interest list, but either adds (if it was removed before) or
> modifies the event (using EPOLL_CTL_MOD) to watch that fd
> for closing, and then closes it.
>
> Aka,
>
> Working Thread:
>
> epoll_ctl(epoll_fd, EPOLL_CTL_ADD, fd, &event);
> close(fd);
>
> where event contains the new EPOLLCLOSED (compare EPOLLOUT, EPOLLIN
> etc).
>
> This must then guarantee the event EPOLLCLOSED to be reported
> by exactly one epoll_wait(), the caller thread of which can then
> proceed with deleting the resources.
>
> Note that close(fd) must cause the removal from the interest list
> of any epoll struct before causing the event - and that the
> EPOLLCLOSED event may only be reported after all other events
> for that fd have already been reported (although it would be
> ok to report them at the same time: simply handle the other
> events first).

This is a bunch of subtle semantics in the kernel to support your
particular use case.

>
> I am not sure how this will work when more than one thread
> calls epoll_wait and more than one watch the same fd: in
> that case it is less trivial because then it seems always
> possible that the EPOLLCLOSED event will be reported before
> another event in another thread.

But this case is fairly straightforward with the user mode approach --
for example, add it to the list for all threads calling epoll_wait.
Or otherwise defer the deletion until all epoll_wait threads have
woken.

2019-07-12 01:34:12

by Carlo Wood

[permalink] [raw]
Subject: Re: Is a new EPOLLCLOSED a solution to the problem that EPOLL_CTL_DISABLE tried to solve?

Hi Andy,

thank you for you quick reply.

On Thu, 11 Jul 2019 17:32:21 -0700
Andy Lutomirski <[email protected]> wrote:

> > I propose to add a new EPOLL event EPOLLCLOSED that will cause
> > epoll_wait to return (for that event) whenever a file descriptor is
> > closed.
>
> This totally falls apart if you ever want to add a feature to your
> library to detach the handler for a given fd without closing the fd.

Another way to cause epoll_wait() to wake up for that specific fd is
okay too, of course.

For example, since the new event basically will mean "resources
can now be deleted", the event could be called EPOLLDELETE.
It is just needed to have some easy way to trigger this event.

Nevertheless, in the more exceptional case that one wants to destroy
the object/rss that data.ptr points to without closing the fd it is
probably possible to first dup the fd and then close the old one.

> > The Worker Thread then does not remove an object from the
> > interest list, but either adds (if it was removed before) or
> > modifies the event (using EPOLL_CTL_MOD) to watch that fd
> > for closing, and then closes it.
> >
> > Aka,
> >
> > Working Thread:
> >
> > epoll_ctl(epoll_fd, EPOLL_CTL_ADD, fd, &event);
> > close(fd);
> >
> > where event contains the new EPOLLCLOSED (compare EPOLLOUT, EPOLLIN
> > etc).
> >
> > This must then guarantee the event EPOLLCLOSED to be reported
> > by exactly one epoll_wait(), the caller thread of which can then
> > proceed with deleting the resources.
> >
> > Note that close(fd) must cause the removal from the interest list
> > of any epoll struct before causing the event - and that the
> > EPOLLCLOSED event may only be reported after all other events
> > for that fd have already been reported (although it would be
> > ok to report them at the same time: simply handle the other
> > events first).
>
> This is a bunch of subtle semantics in the kernel to support your
> particular use case.

My particular use case? How so? The problem I'm trying to address
is the fact that "It is not currently possible to reliably delete epoll
items when using the same epoll set from multiple threads", end quote
of Paton Lewis' email from 2012.

If there is a simple, generally accepted solution for this in user-space
then of course there is no reason to change the kernel; but despite
all my efforts to research the net for a solution for this, all can
find are people with the same question and no good answers.

If there was a way to pass a special event to the thread waiting in
epoll_wait() that it now is safe to free the memory that data.ptr is
pointing to, then problem would evaporate to something trivally simple.

Lets say we would not be using close(2), but instead some
epoll_destruct(epoll_fd, fd). Then the worker thread, instead of,

if (last reference to object has gone)
{
epoll_ctl(epoll_fd, EPOLL_CTL_DEL, object->fd, NULL);
delete object; // Unsafe
}

could do,

if (last reference to object has gone)
epoll_destuct(epoll_fd, object->fd);
// Or [optionally dup() and] close(object->fd);

Whereas the thread that waits for epoll_fd would take care
of the deletion:

for (;;)
{
int ready = epoll_wait(epoll_fd, s_events, maxevents, -1);
while (ready > 0)
{
epoll_event& event(s_events[--ready]);
if ((event.events & EPOLLDELETE)) // Or EPOLLCLOSED, or
// whatever the name is.
{
delete event.data.ptr;
break;
}
// Handle other events.
}
}

In this case, if epoll_wait() had returned just prior to
the call to epoll_destruct()/close(), the object will not be deleted;
The returned events would be handled, epoll_wait() reentered,
and only once EPOLLDELETE is returned the object would be deleted.

The bunch of subtle requirements as you call it is just about how
to implement this in a way that it will do what it is supposed to do,
and in no way specific for my particular use case.

The requirements are, namely:

1. Only one epoll_fd may have added an epoll_event structure with
a pointer to the resource (if more than one need to point to it,
you could add a pointer to a smart pointer to the object to
the epoll_event structure instead). I'd be surprised if this
is not ALREADY a required for normal implementations.

2. The call to epoll_destruct() (as introduced as example above)
must remove the fd from the epoll's interest list. Of course
it doesn't HAVE to - but then the user MUST call epoll_ctl(epoll_fd,
EPOLL_CTL_DEL, ...) right before it, so why not? The reason
I opted close(2) is because that ALREADY has this behavior,
hence it seems a good candidate for epoll_destruct.

3. After an EPOLLDELETE (formly EPOLLCLOSED) has been returned by
an epoll_wait() no other events may be returned for that fd.
This is obvious, and should be easy to implement. I just added it for
completeness ;).

> But this case is fairly straightforward with the user mode approach --
> for example, add it to the list for all threads calling epoll_wait.
> Or otherwise defer the deletion until all epoll_wait threads have
> woken.

That really seems a hell of lot more complex (involving mutexes and
updating a queue that might grow till unknown sizes, hence requiring
possibly calls to malloc) then my proposed solution, for something that
basically every application that uses epoll needs.

--
Carlo Wood <[email protected]>