2009-09-16 23:44:15

by Gilad Benjamini

[permalink] [raw]
Subject: epoll and closed file descriptors

I am running repeatedly into a scenario where epoll notifies userland of
events on a closed file descriptor.
I am running a single thread application, on a single CPU machine so
multiple threads isn't the issue.

A sample set of events that I have seen
- File descriptor (13) for a socket is closed
- epoll_wait returns with no events.
- Several epoll related calls happen
- More than 20 seconds after the "close", epoll_wait finds an event on fd 13
with EPOLLIN|EPOLLERR|EPOLLHUP.
- epoll_wait continues to report this event

Running kernel 2.6.24. Some technical problems are preventing me from trying
a newer kernel at the moment.

One more thing worth mentioning: the application uses libcurl, leading to a
situation where the file is closed before the descriptor was removed from
the epoll descriptor. The code should be able to handle that AFAIK.

Any ideas would be appreciated.
Gilad


2009-09-17 00:07:55

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll and closed file descriptors

On Wed, 16 Sep 2009, Gilad Benjamini wrote:

> I am running repeatedly into a scenario where epoll notifies userland of
> events on a closed file descriptor.
> I am running a single thread application, on a single CPU machine so
> multiple threads isn't the issue.
>
> A sample set of events that I have seen
> - File descriptor (13) for a socket is closed
> - epoll_wait returns with no events.
> - Several epoll related calls happen
> - More than 20 seconds after the "close", epoll_wait finds an event on fd 13
> with EPOLLIN|EPOLLERR|EPOLLHUP.
> - epoll_wait continues to report this event

Epoll removes the fd from its container, when the last instance of the
underlying kernel file pointer is released (or when you explicitly
remove it with epoll_ctl(EPOLL_CTL_DEL)).
If you continue to get the event, it means that someone else has an
instance of the socket (that, looking at the events, saw a shutdown) open,
by hence keeping the kernel object alive.
If you don't want to see the events, just remove the socket from the epoll
set before closing.
Or, you remove the socket the first time you see an EPOLLHUP.



- Davide

2009-09-17 00:23:13

by Gilad Benjamini

[permalink] [raw]
Subject: RE: epoll and closed file descriptors

Davide wrote:
> On Wed, 16 Sep 2009, Gilad Benjamini wrote:
>
> > I am running repeatedly into a scenario where epoll notifies
> userland of
> > events on a closed file descriptor.
> > I am running a single thread application, on a single CPU machine so
> > multiple threads isn't the issue.
> >
> > A sample set of events that I have seen
> > - File descriptor (13) for a socket is closed
> > - epoll_wait returns with no events.
> > - Several epoll related calls happen
> > - More than 20 seconds after the "close", epoll_wait finds an event
> on fd 13
> > with EPOLLIN|EPOLLERR|EPOLLHUP.
> > - epoll_wait continues to report this event
>
> Epoll removes the fd from its container, when the last instance of the
> underlying kernel file pointer is released (or when you explicitly
> remove it with epoll_ctl(EPOLL_CTL_DEL)).
> If you continue to get the event, it means that someone else has an
> instance of the socket (that, looking at the events, saw a shutdown)
> open,
> by hence keeping the kernel object alive.
> If you don't want to see the events, just remove the socket from the
> epoll
> set before closing.
> Or, you remove the socket the first time you see an EPOLLHUP.
>
>
>
> - Davide

I would, but epoll is preventing me from doing so.
Early in sys_epoll_ctl there are these lines

file = fget(epfd);
if (!file)
goto error_return;

Leaving me in a kind of dead lock

Gilad

2009-09-17 00:28:20

by Bryan Donlan

[permalink] [raw]
Subject: Re: epoll and closed file descriptors

On Wed, Sep 16, 2009 at 8:23 PM, Gilad Benjamini
<[email protected]> wrote:
> Davide wrote:
>> On Wed, 16 Sep 2009, Gilad Benjamini wrote:
>>
>> > I am running repeatedly into a scenario ?where epoll notifies
>> userland of
>> > events on a closed file descriptor.
>> > I am running a single thread application, on a single CPU machine so
>> > multiple threads isn't the issue.
>> >
>> > A sample set of events that I have seen
>> > - File descriptor (13) for a socket is closed
>> > - epoll_wait returns with no events.
>> > - Several epoll related calls happen
>> > - More than 20 seconds after the "close", epoll_wait finds an event
>> on fd 13
>> > with EPOLLIN|EPOLLERR|EPOLLHUP.
>> > - epoll_wait continues to report this event
>>
>> Epoll removes the fd from its container, when the last instance of the
>> underlying kernel file pointer is released (or when you explicitly
>> remove it with epoll_ctl(EPOLL_CTL_DEL)).
>> If you continue to get the event, it means that someone else has an
>> instance of the socket (that, looking at the events, saw a shutdown)
>> open,
>> by hence keeping the kernel object alive.
>> If you don't want to see the events, just remove the socket from the
>> epoll
>> set before closing.
>> Or, you remove the socket the first time you see an EPOLLHUP.
>>
>>
>>
>> - Davide
>
> I would, but epoll is preventing me from doing so.
> Early in sys_epoll_ctl there are these lines
>
> ?file = fget(epfd);
> ?if (!file)
> ? ?goto error_return;
>
> Leaving me in a kind of dead lock

You could dup() the file descriptor before handing it to curl, and use
your own copy for epoll operations.

2009-09-17 00:30:21

by Davide Libenzi

[permalink] [raw]
Subject: RE: epoll and closed file descriptors

On Wed, 16 Sep 2009, Gilad Benjamini wrote:

> I would, but epoll is preventing me from doing so.
> Early in sys_epoll_ctl there are these lines
>
> file = fget(epfd);
> if (!file)
> goto error_return;
>
> Leaving me in a kind of dead lock

The 'epfd' in there, is the _epoll fd_, which, if fget() fails, means you
close it.
You see likely failing the 'tfile = fget(fd)' (of course, you closed it),
so if someone else keeps the socket open and you have no chance in telling
it to drop it (really?), you need to remove the socket from the set before
closing it.



- Davide

2009-09-17 00:40:51

by Gilad Benjamini

[permalink] [raw]
Subject: RE: epoll and closed file descriptors

Davide wrote:
> On Wed, 16 Sep 2009, Gilad Benjamini wrote:
>
> > I would, but epoll is preventing me from doing so.
> > Early in sys_epoll_ctl there are these lines
> >
> > file = fget(epfd);
> > if (!file)
> > goto error_return;
> >
> > Leaving me in a kind of dead lock
>
> The 'epfd' in there, is the _epoll fd_, which, if fget() fails, means
> you
> close it.
> You see likely failing the 'tfile = fget(fd)' (of course, you closed
> it),
> so if someone else keeps the socket open and you have no chance in
> telling
> it to drop it (really?), you need to remove the socket from the set
> before
> closing it.
>
>
>
> - Davide

My bad. I meant to quote the line that you mentioned.
I agree that the right thing to do is to remove the fd from epoll before
closing it.
However, due to the way curl works, I cannot do that. Changing the curl code
doesn't seem trivial.

Regardless, I still don't see how the kernel got into this situation, and if
this situation is valid, why it doesn't bail out of it.

2009-09-17 00:45:40

by Bryan Donlan

[permalink] [raw]
Subject: Re: epoll and closed file descriptors

On Wed, Sep 16, 2009 at 8:40 PM, Gilad Benjamini
<[email protected]> wrote:
> Davide wrote:
>> On Wed, 16 Sep 2009, Gilad Benjamini wrote:
>>
>> > I would, but epoll is preventing me from doing so.
>> > Early in sys_epoll_ctl there are these lines
>> >
>> > ? file = fget(epfd);
>> > ? if (!file)
>> > ? ? goto error_return;
>> >
>> > Leaving me in a kind of dead lock
>>
>> The 'epfd' in there, is the _epoll fd_, which, if fget() fails, means
>> you
>> close it.
>> You see likely failing the 'tfile = fget(fd)' (of course, you closed
>> it),
>> so if someone else keeps the socket open and you have no chance in
>> telling
>> it to drop it (really?), you need to remove the socket from the set
>> before
>> closing it.
>>
>>
>>
>> - Davide
>
> My bad. I meant to quote the line that you mentioned.
> I agree that the right thing to do is to remove the fd from epoll before
> closing it.
> However, due to the way curl works, I cannot do that. Changing the curl code
> doesn't seem trivial.
>
> Regardless, I still don't see how the kernel got into this situation, and if
> this situation is valid, why it doesn't bail out of it.

epoll references the underlying file object; the fd is used _only_ to
obtain this file object, and then never used again. Determining when
the fd goes away then requires iterating over all fds, and since epoll
was designed to avoid doing exactly that, it isn't an acceptable
solution.

As I mentioned in the other email, by dup()ing the file descriptor,
you can get control over when the copy is closed without changing
curl. Then just make sure to remove it from the epoll set before
closing your copy.

2009-09-17 00:53:13

by Gilad Benjamini

[permalink] [raw]
Subject: RE: epoll and closed file descriptors

> Subject: Re: epoll and closed file descriptors
>
> On Wed, Sep 16, 2009 at 8:40 PM, Gilad Benjamini
> <[email protected]> wrote:
> > Davide wrote:
> >> On Wed, 16 Sep 2009, Gilad Benjamini wrote:
> >>
> >> > I would, but epoll is preventing me from doing so.
> >> > Early in sys_epoll_ctl there are these lines
> >> >
> >> > ? file = fget(epfd);
> >> > ? if (!file)
> >> > ? ? goto error_return;
> >> >
> >> > Leaving me in a kind of dead lock
> >>
> >> The 'epfd' in there, is the _epoll fd_, which, if fget() fails,
> means
> >> you
> >> close it.
> >> You see likely failing the 'tfile = fget(fd)' (of course, you closed
> >> it),
> >> so if someone else keeps the socket open and you have no chance in
> >> telling
> >> it to drop it (really?), you need to remove the socket from the set
> >> before
> >> closing it.
> >>
> >>
> >>
> >> - Davide
> >
> > My bad. I meant to quote the line that you mentioned.
> > I agree that the right thing to do is to remove the fd from epoll
> before
> > closing it.
> > However, due to the way curl works, I cannot do that. Changing the
> curl code
> > doesn't seem trivial.
> >
> > Regardless, I still don't see how the kernel got into this situation,
> and if
> > this situation is valid, why it doesn't bail out of it.
>
> epoll references the underlying file object; the fd is used _only_ to
> obtain this file object, and then never used again. Determining when
> the fd goes away then requires iterating over all fds, and since epoll
> was designed to avoid doing exactly that, it isn't an acceptable
> solution.

Regarding bailing out of the situation, I see the logic in your answer.
What about the first part ? Any ideas how the kernel actually got into that
tight spot ?
Looking into the code I can't find a path that can lead into this situation.



2009-09-17 00:55:23

by Davide Libenzi

[permalink] [raw]
Subject: RE: epoll and closed file descriptors

On Wed, 16 Sep 2009, Gilad Benjamini wrote:

> My bad. I meant to quote the line that you mentioned.
> I agree that the right thing to do is to remove the fd from epoll before
> closing it.
> However, due to the way curl works, I cannot do that. Changing the curl code
> doesn't seem trivial.

You probably need to explain how you use libcurl together with epoll, even
though this is not a help-userspace mailing list.


> Regardless, I still don't see how the kernel got into this situation, and if
> this situation is valid, why it doesn't bail out of it.

I think this was already explained.


- Davide

2009-09-17 00:57:38

by Bryan Donlan

[permalink] [raw]
Subject: Re: epoll and closed file descriptors

2009/9/16 Gilad Benjamini <[email protected]>:
>> Subject: Re: epoll and closed file descriptors
>>
>> On Wed, Sep 16, 2009 at 8:40 PM, Gilad Benjamini
>> <[email protected]> wrote:
>> > Davide wrote:
>> >> On Wed, 16 Sep 2009, Gilad Benjamini wrote:
>> >>
>> >> > I would, but epoll is preventing me from doing so.
>> >> > Early in sys_epoll_ctl there are these lines
>> >> >
>> >> > ? file = fget(epfd);
>> >> > ? if (!file)
>> >> > ? ? goto error_return;
>> >> >
>> >> > Leaving me in a kind of dead lock
>> >>
>> >> The 'epfd' in there, is the _epoll fd_, which, if fget() fails,
>> means
>> >> you
>> >> close it.
>> >> You see likely failing the 'tfile = fget(fd)' (of course, you closed
>> >> it),
>> >> so if someone else keeps the socket open and you have no chance in
>> >> telling
>> >> it to drop it (really?), you need to remove the socket from the set
>> >> before
>> >> closing it.
>> >>
>> >>
>> >>
>> >> - Davide
>> >
>> > My bad. I meant to quote the line that you mentioned.
>> > I agree that the right thing to do is to remove the fd from epoll
>> before
>> > closing it.
>> > However, due to the way curl works, I cannot do that. Changing the
>> curl code
>> > doesn't seem trivial.
>> >
>> > Regardless, I still don't see how the kernel got into this situation,
>> and if
>> > this situation is valid, why it doesn't bail out of it.
>>
>> epoll references the underlying file object; the fd is used _only_ to
>> obtain this file object, and then never used again. Determining when
>> the fd goes away then requires iterating over all fds, and since epoll
>> was designed to avoid doing exactly that, it isn't an acceptable
>> solution.
>
> Regarding bailing out of the situation, I see the logic in your answer.
> What about the first part ? Any ideas how the kernel actually got into that
> tight spot ?
> Looking into the code I can't find a path that can lead into this situation.

Userspace passes in a fd that is unused (closed). Kernel can't find
the file object corresponding to the fd (because the fd went away when
it wasn't looking), so it says to userspace, "Sorry, no such file!".
And it's completely correct.

So, the basic problem is userspace becomes unable to refer to the file
object. The kernel's doing just fine; it's not in a "tight spot" at
all. It's perfectly happy to refer to the file object by a direct
pointer to it - it's just userspace is unable to tell the kernel to
remove it from the epoll, because the only name userspace had for it
has been removed.