2005-09-23 03:12:15

by Vadim Lobanov

[permalink] [raw]
Subject: [RFC] epoll

Hi,

Lately, I've been exploring the epoll code in more detail, and have come
across a few questions/issues (I wouldn't call them problems) that I'd
like to solicit comments on. If everyone agrees, some of these might
turn into TODOs as code changes. Without further ado:

1. Size
The size parameter to sys_epoll_create() is simply sanity-checked (has
to be greater than zero) and then ignored. I presume the current
implementation works perfectly without this parameter, so I am rather
curious why it is even passed in. Historical reasons? Future code
improvements? On the same note, I'd like to suggest that '0' also should
be an allowed value, for the case when the application really does not
know what the size estimate should be.

2. Timeout
It seems that the timeout parameter in sys_epoll_wait() is not handled
quite correctly. According to the manpages, a value of '-1' means
infinite timeout, but the effect of other negative values is left
undefined. In fact, if you run a userland program that calls
epoll_wait() with a timeout value of '-2', the kernel prints an error
into /var/log/messages from within schedule_timeout(), due to its
argument being negative. It seems there are two ways to correct this
behavior:
- Check the passed timeout for being less than '-1', and return an
error. A new errno value needs to be introduced into the epoll_wait()
API.
- Redefine the epoll_wait() API to accept any negative value as an
infinite timeout, and change the code appropriately.

3. Wakeup
As determined by testing with userland code, the sys_tgkill() and
sys_tkill() functions currently will NOT wake up a sleeping
epoll_wait(). Effectively, this means that epoll_wait() is NOT a pthread
cancellation point. There are two potential issues with this:
- epoll_wait() meets the unofficial(?) definition of a "system call that
may block".
- epoll_wait() behaves differently from poll() and friends.

4. Code Duplication
As sys_tgkill() and sys_tkill() are currently written, a large portion
of the two functions is duplicated. It might make sense to pull that
equivalent code out into a separate function.

Comments please? In particular, the pthread cancellation issue is
worrysome. In the case that any of the above points turn into actual
code TODOs, I'll be more than happy to cook up and submit the patches.

Thanks for reading. :-)
-Vadim Lobanov


2005-09-23 05:33:11

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] epoll

On Thu, 22 Sep 2005, Vadim Lobanov wrote:

> 1. Size
> The size parameter to sys_epoll_create() is simply sanity-checked (has
> to be greater than zero) and then ignored. I presume the current
> implementation works perfectly without this parameter, so I am rather
> curious why it is even passed in. Historical reasons? Future code
> improvements? On the same note, I'd like to suggest that '0' also should
> be an allowed value, for the case when the application really does not
> know what the size estimate should be.

This born from the initial epoll implementation, that was using the size
parameter to size the hash. Since now epoll uses rbtrees, the parameter is
no more used. Changing the API to remove it, and break userspace, was/is
not a good idea.



> 2. Timeout
> It seems that the timeout parameter in sys_epoll_wait() is not handled
> quite correctly. According to the manpages, a value of '-1' means
> infinite timeout, but the effect of other negative values is left
> undefined. In fact, if you run a userland program that calls
> epoll_wait() with a timeout value of '-2', the kernel prints an error
> into /var/log/messages from within schedule_timeout(), due to its
> argument being negative. It seems there are two ways to correct this
> behavior:
> - Check the passed timeout for being less than '-1', and return an
> error. A new errno value needs to be introduced into the epoll_wait()
> API.
> - Redefine the epoll_wait() API to accept any negative value as an
> infinite timeout, and change the code appropriately.

Yes, it is nicer if epoll behaves like poll. (*)



> 3. Wakeup
> As determined by testing with userland code, the sys_tgkill() and
> sys_tkill() functions currently will NOT wake up a sleeping
> epoll_wait(). Effectively, this means that epoll_wait() is NOT a pthread
> cancellation point. There are two potential issues with this:
> - epoll_wait() meets the unofficial(?) definition of a "system call that
> may block".
> - epoll_wait() behaves differently from poll() and friends.

The epoll_wait() wait loop is the standard one that even poll() uses (prep
wait, make interruptible, test signals, sched timeo). So if poll() is woke
up, so should epoll_wait(). A minimal code snippet that proves poll()
behing woke up, and epoll_wait() not, would help.



> 4. Code Duplication
> As sys_tgkill() and sys_tkill() are currently written, a large portion
> of the two functions is duplicated. It might make sense to pull that
> equivalent code out into a separate function.

This should be moved in "[RFC] something else" ;)



> Comments please? In particular, the pthread cancellation issue is
> worrysome. In the case that any of the above points turn into actual
> code TODOs, I'll be more than happy to cook up and submit the patches.

[*] No need, since it's a one liner.



- Davide


2005-09-23 06:00:14

by Vadim Lobanov

[permalink] [raw]
Subject: Re: [RFC] epoll

On Thu, 22 Sep 2005, Davide Libenzi wrote:

> On Thu, 22 Sep 2005, Vadim Lobanov wrote:
>
> > 1. Size
> > The size parameter to sys_epoll_create() is simply sanity-checked (has
> > to be greater than zero) and then ignored. I presume the current
> > implementation works perfectly without this parameter, so I am rather
> > curious why it is even passed in. Historical reasons? Future code
> > improvements? On the same note, I'd like to suggest that '0' also should
> > be an allowed value, for the case when the application really does not
> > know what the size estimate should be.
>
> This born from the initial epoll implementation, that was using the size
> parameter to size the hash. Since now epoll uses rbtrees, the parameter is
> no more used. Changing the API to remove it, and break userspace, was/is
> not a good idea.
>

Any thoughts on allowing a size of '0'?

>
> > 2. Timeout
> > It seems that the timeout parameter in sys_epoll_wait() is not handled
> > quite correctly. According to the manpages, a value of '-1' means
> > infinite timeout, but the effect of other negative values is left
> > undefined. In fact, if you run a userland program that calls
> > epoll_wait() with a timeout value of '-2', the kernel prints an error
> > into /var/log/messages from within schedule_timeout(), due to its
> > argument being negative. It seems there are two ways to correct this
> > behavior:
> > - Check the passed timeout for being less than '-1', and return an
> > error. A new errno value needs to be introduced into the epoll_wait()
> > API.
> > - Redefine the epoll_wait() API to accept any negative value as an
> > infinite timeout, and change the code appropriately.
>
> Yes, it is nicer if epoll behaves like poll. (*)
>

No patch needed from me here, then?

>
> > 3. Wakeup
> > As determined by testing with userland code, the sys_tgkill() and
> > sys_tkill() functions currently will NOT wake up a sleeping
> > epoll_wait(). Effectively, this means that epoll_wait() is NOT a pthread
> > cancellation point. There are two potential issues with this:
> > - epoll_wait() meets the unofficial(?) definition of a "system call that
> > may block".
> > - epoll_wait() behaves differently from poll() and friends.
>
> The epoll_wait() wait loop is the standard one that even poll() uses (prep
> wait, make interruptible, test signals, sched timeo). So if poll() is woke
> up, so should epoll_wait(). A minimal code snippet that proves poll()
> behing woke up, and epoll_wait() not, would help.
>

Certainly. :-) See end of email for sample program.

>
> > 4. Code Duplication
> > As sys_tgkill() and sys_tkill() are currently written, a large portion
> > of the two functions is duplicated. It might make sense to pull that
> > equivalent code out into a separate function.
>
> This should be moved in "[RFC] something else" ;)
>

Alright.

>
> > Comments please? In particular, the pthread cancellation issue is
> > worrysome. In the case that any of the above points turn into actual
> > code TODOs, I'll be more than happy to cook up and submit the patches.
>
> [*] No need, since it's a one liner.
>

Here's a sample program that illustrates difference in pthread killing
between poll and epoll:
================================================================
#include <sys/epoll.h>
#include <sys/poll.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

//#define _E_

void * run (int * fd) {
struct epoll_event result;

printf("Wait forever while polling.\n");
#ifdef _E_
epoll_wait(*fd, &result, 1, -1);
#else
poll(NULL, 0, -1);
#endif
printf("Uhoh! Something is borked!\n");

return NULL;
}

int main (void) {
int events;
pthread_t thread;

#ifdef _E_
events = epoll_create(1);
#endif
pthread_create(&thread, NULL, (void * (*) (void *))run, &events);
getchar();
printf("Try to kill the thread.\n");
pthread_cancel(thread);
pthread_join(thread, NULL);
printf("Success.\n");
#ifdef _E_
close(events);
#endif

return 0;
}
================================================================
gcc -lpthread -Wall -o test test.c
================================================================

>
> - Davide
>

-Vadim Lobanov

2005-09-23 17:28:27

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] epoll

On Thu, 22 Sep 2005, Vadim Lobanov wrote:

> On Thu, 22 Sep 2005, Davide Libenzi wrote:
>
>> This born from the initial epoll implementation, that was using the size
>> parameter to size the hash. Since now epoll uses rbtrees, the parameter is
>> no more used. Changing the API to remove it, and break userspace, was/is
>> not a good idea.
>>
>
> Any thoughts on allowing a size of '0'?

We would need to change man pages for that, and this IMHO w/out a strong
reason of doing so.



>> Yes, it is nicer if epoll behaves like poll. (*)
>>
>
> No patch needed from me here, then?

No need thx. I'll be sending the one-liner to the list right now.



>>> 3. Wakeup
>>> As determined by testing with userland code, the sys_tgkill() and
>>> sys_tkill() functions currently will NOT wake up a sleeping
>>> epoll_wait(). Effectively, this means that epoll_wait() is NOT a pthread
>>> cancellation point. There are two potential issues with this:
>>> - epoll_wait() meets the unofficial(?) definition of a "system call that
>>> may block".
>>> - epoll_wait() behaves differently from poll() and friends.
>>
>> The epoll_wait() wait loop is the standard one that even poll() uses (prep
>> wait, make interruptible, test signals, sched timeo). So if poll() is woke
>> up, so should epoll_wait(). A minimal code snippet that proves poll()
>> behing woke up, and epoll_wait() not, would help.
>>
>
> Certainly. :-) See end of email for sample program.

I'm afraid you need to bug the glibc guys, since I think they wrap
sys_poll(). Try the test program below, when defining _X_, that makes it
call sys_poll() directly. It will have the same epoll_wait() behaviour.




- Davide





#include <sys/epoll.h>
#include <sys/poll.h>
#include <pthread.h>
#include <stdio.h>
#include <linux/unistd.h>
#include <errno.h>
#include <unistd.h>


#define __NR_xpoll __NR_poll

#define __sys_xpoll(ufds, nfds, timeout) _syscall3(int, xpoll, struct pollfd *, ufds, \
unsigned int, nfds, long, timeout)

__sys_xpoll(ufds, nfds, timeout);


void * run (int * fd) {
struct epoll_event result;

printf("Wait forever while polling.\n");
#if defined(_E_)
epoll_wait(*fd, &result, 1, -1);
#elif defined(_X_)
xpoll(NULL, 0, -1);
#else
poll(NULL, 0, -1);
#endif
printf("Uhoh! Something is borked!\n");

return NULL;
}

int main (void) {
int events;
pthread_t thread;

#if defined(_E_)
events = epoll_create(1);
#endif
pthread_create(&thread, NULL, (void * (*) (void *))run, &events);
getchar();
printf("Try to kill the thread.\n");
pthread_cancel(thread);
pthread_join(thread, NULL);
printf("Success.\n");
#if defined(_E_)
close(events);
#endif

return 0;
}

2005-09-23 18:39:12

by Vadim Lobanov

[permalink] [raw]
Subject: Re: [RFC] epoll

On Fri, 23 Sep 2005, Davide Libenzi wrote:

> >>> 3. Wakeup
> >>> As determined by testing with userland code, the sys_tgkill() and
> >>> sys_tkill() functions currently will NOT wake up a sleeping
> >>> epoll_wait(). Effectively, this means that epoll_wait() is NOT a pthread
> >>> cancellation point. There are two potential issues with this:
> >>> - epoll_wait() meets the unofficial(?) definition of a "system call that
> >>> may block".
> >>> - epoll_wait() behaves differently from poll() and friends.
> >>
> >> The epoll_wait() wait loop is the standard one that even poll() uses (prep
> >> wait, make interruptible, test signals, sched timeo). So if poll() is woke
> >> up, so should epoll_wait(). A minimal code snippet that proves poll()
> >> behing woke up, and epoll_wait() not, would help.
> >>
> >
> > Certainly. :-) See end of email for sample program.
>
> I'm afraid you need to bug the glibc guys, since I think they wrap
> sys_poll(). Try the test program below, when defining _X_, that makes it
> call sys_poll() directly. It will have the same epoll_wait() behaviour.

I'm still a bit confused by how the pthread implementation fits
together. Correct me if the following is wrong, please:
Whenever the user wants to cancel a pthread, glibc eventually calls
{sys-}tgkill() upon the given thread, causing the kernel to return EINTR
to the blocking system call, in this case epoll_wait(). It is glibc's
job to catch this return value and realize that the thread is ready to be
killed, which it is not doing in the case of epoll_wait().
Or is the "current thread has been cancelled and should be killed" check
happening elsewhere / in some other way?

By the way, I already brought this up on the glibc mailing list (before
I sent it to LKML), and it seems they couldn't care less.
(http://sources.redhat.com/ml/libc-alpha/2005-09/msg00071.html)

> - Davide
>

-Vadim Lobanov

>
> #include <sys/epoll.h>
> #include <sys/poll.h>
> #include <pthread.h>
> #include <stdio.h>
> #include <linux/unistd.h>
> #include <errno.h>
> #include <unistd.h>
>
>
> #define __NR_xpoll __NR_poll
>
> #define __sys_xpoll(ufds, nfds, timeout) _syscall3(int, xpoll, struct pollfd *, ufds, \
> unsigned int, nfds, long, timeout)
>
> __sys_xpoll(ufds, nfds, timeout);
>
>
> void * run (int * fd) {
> struct epoll_event result;
>
> printf("Wait forever while polling.\n");
> #if defined(_E_)
> epoll_wait(*fd, &result, 1, -1);
> #elif defined(_X_)
> xpoll(NULL, 0, -1);
> #else
> poll(NULL, 0, -1);
> #endif
> printf("Uhoh! Something is borked!\n");
>
> return NULL;
> }
>
> int main (void) {
> int events;
> pthread_t thread;
>
> #if defined(_E_)
> events = epoll_create(1);
> #endif
> pthread_create(&thread, NULL, (void * (*) (void *))run, &events);
> getchar();
> printf("Try to kill the thread.\n");
> pthread_cancel(thread);
> pthread_join(thread, NULL);
> printf("Success.\n");
> #if defined(_E_)
> close(events);
> #endif
>
> return 0;
> }
>
>

2005-09-23 19:06:16

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] epoll

On Fri, 23 Sep 2005, Vadim Lobanov wrote:

> On Fri, 23 Sep 2005, Davide Libenzi wrote:
>
>>>>> 3. Wakeup
>>>>> As determined by testing with userland code, the sys_tgkill() and
>>>>> sys_tkill() functions currently will NOT wake up a sleeping
>>>>> epoll_wait(). Effectively, this means that epoll_wait() is NOT a pthread
>>>>> cancellation point. There are two potential issues with this:
>>>>> - epoll_wait() meets the unofficial(?) definition of a "system call that
>>>>> may block".
>>>>> - epoll_wait() behaves differently from poll() and friends.
>>>>
>>>> The epoll_wait() wait loop is the standard one that even poll() uses (prep
>>>> wait, make interruptible, test signals, sched timeo). So if poll() is woke
>>>> up, so should epoll_wait(). A minimal code snippet that proves poll()
>>>> behing woke up, and epoll_wait() not, would help.
>>>>
>>>
>>> Certainly. :-) See end of email for sample program.
>>
>> I'm afraid you need to bug the glibc guys, since I think they wrap
>> sys_poll(). Try the test program below, when defining _X_, that makes it
>> call sys_poll() directly. It will have the same epoll_wait() behaviour.
>
> I'm still a bit confused by how the pthread implementation fits
> together. Correct me if the following is wrong, please:
> Whenever the user wants to cancel a pthread, glibc eventually calls
> {sys-}tgkill() upon the given thread, causing the kernel to return EINTR
> to the blocking system call, in this case epoll_wait(). It is glibc's
> job to catch this return value and realize that the thread is ready to be
> killed, which it is not doing in the case of epoll_wait().
> Or is the "current thread has been cancelled and should be killed" check
> happening elsewhere / in some other way?

Please do not make me look at glibc/pthread code since I do not have time
ATM. I can only speculate on what it is happening. The sys_poll() and
sys_epoll_wait() system calls, when called directly, have the same
behaviour (like you can see in the test code snippet). They both return
EINTR to the caller. When you call glibc's poll(), the behaviour changes
and function is explicitly made a pthread cancellation point. The glibc's
epoll_wait() is not wrapped by the same code, and this makes it unable to
be pthread-canceled. Try to post to glibc the code snippet, and see if
they want to make epoll_wait() pthread-cancel enabled too.



> By the way, I already brought this up on the glibc mailing list (before
> I sent it to LKML), and it seems they couldn't care less.
> (http://sources.redhat.com/ml/libc-alpha/2005-09/msg00071.html)

Yeah, that's Uli :)



- Davide