2003-01-22 06:46:14

by Lennert Buytenhek

[permalink] [raw]
Subject: {sys_,/dev/}epoll waiting timeout

Hi!

Both /dev/epoll EP_POLL and sys_epoll_wait, when converting the passed
timeout value in msec to jiffies, round down instead of up. This
occasionally causes these functions to return without any active fd's
before the given timeout period has passed.

This can cause fun situations like these:
epoll_wait(epfd, events, maxevents, timeout_until_next_timer_expiry)
[ returns too early ]

gettimeofday(&now, NULL)
[ notice that the first timer has not yet expired, do nothing ]

epoll_wait(epfd, events, maxevents, random_small_value_less_than_jiffy)
[ returns immediately ]

gettimeofday(&now, NULL)
[ notice that first timer still didn't expire yet, do nothing ]

etc.

Effectively causing busy-wait loops of on average half a jiffy.


nanosleep(2) always rounds timeout values up (I think it is required to do
so by some specification which says that this call should sleep _at_least_
the given amount of time), and this approach to me makes sense for
{sys_,/dev/}epoll also. See <linux/time.h>:timespec_to_jiffies and
kernel/time.c:sys_nanosleep.

Will you accept a patch to do this?


cheers,
Lennert


2003-01-22 07:54:17

by Jamie Lokier

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

Lennert Buytenhek wrote:
> Both /dev/epoll EP_POLL and sys_epoll_wait, when converting the passed
> timeout value in msec to jiffies, round down instead of up. This
> occasionally causes these functions to return without any active fd's
> before the given timeout period has passed.

So you would like it to round up the timeout like poll(), select(),
and io_getevents() do? Fair enough, for the sake of consistency!

sys_poll() says:

if (timeout) {
/* Careful about overflow in the intermediate values */
if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ)
timeout = (unsigned long)(timeout*HZ+999)/1000+1;
else /* Negative or overflow */
timeout = MAX_SCHEDULE_TIMEOUT;
}

sys_io_getevents() does something more complicated in a function
called set_timeout(), but it essentially comes to the same thing. It
takes a value in nanseconds (which I prefer, btw, for future usefulness).

There is also a nasty overflow in ep_poll(). If HZ == 1000 and
you ask to wait for > approx. 2000 seconds (not unreasonable, say if
epoll is running an ftp server and the client timeout is set to 1
hour), then ep_poll gets the calculation wrong.

I suggest that this line in eventpoll.c:

jtimeout = timeout == -1 ? MAX_SCHEDULE_TIMEOUT: (timeout * HZ) / 1000;

be changed to this:

jtimeout = 0;
if (timeout) {
/* Careful about overflow in the intermediate values */
if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ)
jtimeout = (unsigned long)(timeout*HZ+999)/1000+1;
else /* Negative or overflow */
jtimeout = MAX_SCHEDULE_TIMEOUT;
}

And that the prototypes for ep_poll() and sys_epoll_wait() be changed
to take a "long timeout" instead of an "int", just like sys_poll().

> Effectively causing busy-wait loops of on average half a jiffy.

Sometimes busy waiting is the right thing to do. With HZ == 100,
anything involving video animation needs to busy wait after select()
or equivalent, otherwise there is too much display jitter.

But all that sort of code needs to know how much select() et al. tend
to overrun by anyway, so they can just do the same if using epoll.

-- Jamie


ps. sys_* system-call functions should never return "int". They
should always return "long" or a pointer - even if the user-space
equivalent returns "int". Take a look at sys_open() for an example.
Technical requirement of the system call return path on 64-bit targets.

2003-01-22 12:37:37

by Ed Tomlinson

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

Jamie Lokier wrote:

> jtimeout = 0;
> if (timeout) {
> /* Careful about overflow in the intermediate values */
> if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ)
> jtimeout = (unsigned long)(timeout*HZ+999)/1000+1;
> else /* Negative or overflow */
> jtimeout = MAX_SCHEDULE_TIMEOUT;
> }

Why assume HZ=1000? Would not:

timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1;

make more sense?

2003-01-22 13:11:34

by Jamie Lokier

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

Ed Tomlinson wrote:
> Jamie Lokier wrote:
>
> > jtimeout = 0;
> > if (timeout) {
> > /* Careful about overflow in the intermediate values */
> > if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ)
> > jtimeout = (unsigned long)(timeout*HZ+999)/1000+1;
> > else /* Negative or overflow */
> > jtimeout = MAX_SCHEDULE_TIMEOUT;
> > }
>
> Why assume HZ=1000? Would not:
>
> timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1;
>
> make more sense?

No, that's silly. Why do you want to multiply by HZ and then divide by HZ?

-- Jamie

2003-01-22 19:10:25

by Randy.Dunlap

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

On Wed, 22 Jan 2003, Jamie Lokier wrote:

| Ed Tomlinson wrote:
| > Jamie Lokier wrote:
| >
| > > jtimeout = 0;
| > > if (timeout) {
| > > /* Careful about overflow in the intermediate values */
| > > if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ)
| > > jtimeout = (unsigned long)(timeout*HZ+999)/1000+1;
| > > else /* Negative or overflow */
| > > jtimeout = MAX_SCHEDULE_TIMEOUT;
| > > }
| >
| > Why assume HZ=1000? Would not:
| >
| > timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1;
| >
| > make more sense?
|
| No, that's silly. Why do you want to multiply by HZ and then divide by HZ?

OK, I don't get it. All Ed did was replace 1000 with HZ and
999 with (HZ-1). What's bad about that? Seems to me like
the right thing to do. Much more portable.

What if HZ changes? Who's going to audit the kernel for changes?

--
~Randy

2003-01-22 19:28:15

by Randy.Dunlap

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

On Wed, 22 Jan 2003, Jamie Lokier wrote:

| Randy.Dunlap wrote:
| > | > Why assume HZ=1000? Would not:
| > | >
| > | > timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1;
| > | >
| > | > make more sense?
| > |
| > | No, that's silly. Why do you want to multiply by HZ and then divide by HZ?
| >
| > OK, I don't get it. All Ed did was replace 1000 with HZ and
| > 999 with (HZ-1). What's bad about that? Seems to me like
| > the right thing to do. Much more portable.
| >
| > What if HZ changes? Who's going to audit the kernel for changes?
|
| You're being dense. The input timeout is measured in milliseconds;
| see poll(2). The calculated timeout is measured in jiffies. Hence
| multiply by jiffies and divide by milliseconds.

Like I said, I didn't get it. Now I do. Thanks.

--
~Randy

2003-01-22 19:25:42

by Jamie Lokier

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

Randy.Dunlap wrote:
> | > Why assume HZ=1000? Would not:
> | >
> | > timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1;
> | >
> | > make more sense?
> |
> | No, that's silly. Why do you want to multiply by HZ and then divide by HZ?
>
> OK, I don't get it. All Ed did was replace 1000 with HZ and
> 999 with (HZ-1). What's bad about that? Seems to me like
> the right thing to do. Much more portable.
>
> What if HZ changes? Who's going to audit the kernel for changes?

You're being dense. The input timeout is measured in milliseconds;
see poll(2). The calculated timeout is measured in jiffies. Hence
multiply by jiffies and divide by milliseconds.

-- Jamie

2003-01-23 13:53:01

by Davide Libenzi

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

On Wed, 22 Jan 2003, Jamie Lokier wrote:

> So you would like it to round up the timeout like poll(), select(),
> and io_getevents() do? Fair enough, for the sake of consistency!
>
> sys_poll() says:
>
> if (timeout) {
> /* Careful about overflow in the intermediate values */
> if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ)
> timeout = (unsigned long)(timeout*HZ+999)/1000+1;
> else /* Negative or overflow */
> timeout = MAX_SCHEDULE_TIMEOUT;
> }
>
> sys_io_getevents() does something more complicated in a function
> called set_timeout(), but it essentially comes to the same thing. It
> takes a value in nanseconds (which I prefer, btw, for future usefulness).

>From a mathematical point of view this is a ceil(v)+1, so this is wrong.
It should be :

t = (t * HZ + 999) / 1000;

The +999 already gives you the round up. Different is if we want to be
sure to sleep at least that amount of jiffies ( the rounded up ), in that
case since the timer tick might arrive immediately after we go to sleep by
making us to lose immediately a jiffie, we need another +1. Anyway I'll do
the round up. Same for the overflow check.


> And that the prototypes for ep_poll() and sys_epoll_wait() be changed
> to take a "long timeout" instead of an "int", just like sys_poll().

I don't see why. The poll(2) timeout is an int.


> ps. sys_* system-call functions should never return "int". They
> should always return "long" or a pointer - even if the user-space
> equivalent returns "int". Take a look at sys_open() for an example.
> Technical requirement of the system call return path on 64-bit targets.

True, I'll change it.



- Davide

2003-01-23 15:34:00

by Jamie Lokier

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

Davide Libenzi wrote:
> >From a mathematical point of view this is a ceil(v)+1, so this is wrong.
> It should be :
>
> t = (t * HZ + 999) / 1000;
>
> The +999 already gives you the round up. Different is if we want to be
> sure to sleep at least that amount of jiffies ( the rounded up ), in that
> case since the timer tick might arrive immediately after we go to sleep by
> making us to lose immediately a jiffie, we need another +1. Anyway I'll do
> the round up. Same for the overflow check.

I wonder if it's appropriate to copy sys_poll(), which has the +1, or
sys_select(), which doesn't!

> > And that the prototypes for ep_poll() and sys_epoll_wait() be changed
> > to take a "long timeout" instead of an "int", just like sys_poll().
>
> I don't see why. The poll(2) timeout is an int.

poll(2) takes an int, but sys_poll() takes a long.
I think everyone is confused :)

The reason I suggested "long timeout" for ep_poll is because the
multiply in the expression:

jtimeout = (unsigned long)(timeout*HZ+999)/1000;

can overflow if you don't. If you stick with the int, you'll need to
write:

jtimeout = (((unsigned long)timeout)*HZ+999)/1000;

-- Jamie

2003-01-23 17:10:19

by Mark Mielke

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

On Thu, Jan 23, 2003 at 03:43:04PM +0000, Jamie Lokier wrote:
> Davide Libenzi wrote:
> > >From a mathematical point of view this is a ceil(v)+1, so this is wrong.
> > It should be :
> > t = (t * HZ + 999) / 1000;
> > The +999 already gives you the round up. Different is if we want to be
> > sure to sleep at least that amount of jiffies ( the rounded up ), in that
> > case since the timer tick might arrive immediately after we go to sleep by
> > making us to lose immediately a jiffie, we need another +1. Anyway I'll do
> > the round up. Same for the overflow check.
> I wonder if it's appropriate to copy sys_poll(), which has the +1, or
> sys_select(), which doesn't!

Or, fix sys_poll(). With the +1, this means that sys_poll() would have
a 1 in 1001 chance per second of returning one jiffie too early.

> > > And that the prototypes for ep_poll() and sys_epoll_wait() be changed
> > > to take a "long timeout" instead of an "int", just like sys_poll().
> > I don't see why. The poll(2) timeout is an int.
> poll(2) takes an int, but sys_poll() takes a long.
> I think everyone is confused :)
> The reason I suggested "long timeout" for ep_poll is because the
> multiply in the expression:
> jtimeout = (unsigned long)(timeout*HZ+999)/1000;
> can overflow if you don't. If you stick with the int, you'll need to
> write:
> jtimeout = (((unsigned long)timeout)*HZ+999)/1000;

On a 16 bit platform, perhaps... :-)

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2003-01-23 18:19:24

by Jamie Lokier

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

Mark Mielke wrote:
> Or, fix sys_poll(). With the +1, this means that sys_poll() would have
> a 1 in 1001 chance per second of returning one jiffie too early.

Nope. Read the expression again.

> > poll(2) takes an int, but sys_poll() takes a long.
> > I think everyone is confused :)
> > The reason I suggested "long timeout" for ep_poll is because the
> > multiply in the expression:
> > jtimeout = (unsigned long)(timeout*HZ+999)/1000;
> > can overflow if you don't. If you stick with the int, you'll need to
> > write:
> > jtimeout = (((unsigned long)timeout)*HZ+999)/1000;
>
> On a 16 bit platform, perhaps... :-)

Nope. It will overflow on a _64_ bit platform, if you give it a value
of MAXINT for example.

-- Jamie

2003-01-23 20:23:29

by Mark Mielke

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

On Thu, Jan 23, 2003 at 06:28:31PM +0000, Jamie Lokier wrote:
> Mark Mielke wrote:
> > Or, fix sys_poll(). With the +1, this means that sys_poll() would have
> > a 1 in 1001 chance per second of returning one jiffie too early.
> Nope. Read the expression again.

Sorry... not 1 in 1001... almost 100% chance of returning of one
jiffie too many. In practice, even on a relatively idle system,
the process will not be able to wake up as frequently as it might
be able to expect.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2003-01-23 22:09:55

by Jamie Lokier

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

Mark Mielke wrote:
> Sorry... not 1 in 1001... almost 100% chance of returning of one
> jiffie too many.

It's curious that select() is different from poll() - almost is if
completely different people wrote the code :)

We have to wonder whether it was a design decision.
Perhaps the unix specifications require it (see below).

> In practice, even on a relatively idle system, the process will not
> be able to wake up as frequently as it might be able to expect.

You're right - it's unfortunate that using poll() lets you sleep and
wake up no faster than every _two_ ticks. That's actually caused by
poll()'s treating zero differently though, not by +1 as such.

There's a strange contradiction between rounding up the waiting time
to the next number of jiffies, and then rounding it down (in a
time-dependent way) by waiting until the next N'th timer interrupt.

If, as someone said, the appropriate unix specification says that
"wait for 10ms" means to wait for _at minimum_ 10ms, then you do need
the +1.

(Davide), IMHO epoll should decide whether it means "at minimum" (in
which case the +1 is a requirement), or it means "at maximum" (in
which case rounding up is wrong).

The current method of rounding up and then effectively down means that
you get an unpredictable mixture of both.

-- Jamie

ps. I would always prefer an absolute wakeup time anyway - it avoids a
race condition too. What a shame none of the system calls work that way.

pps. To summarise, all the time APIs are a complete mess in unix, and
there's nothing you can do in user space to make up for the b0rken
system call interface. Except not duplicate past errors in new interfaces :)

2003-01-24 14:32:20

by Andreas Schwab

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

Jamie Lokier <[email protected]> writes:

|> (Davide), IMHO epoll should decide whether it means "at minimum" (in
|> which case the +1 is a requirement), or it means "at maximum" (in
|> which case rounding up is wrong).

Since Linux isn't a RTOS you can't meet "at maximum". POSIX says "poll ()
shall wait at least timeout milliseconds"

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 N?rnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2003-01-25 00:53:38

by Davide Libenzi

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

On Thu, 23 Jan 2003, Jamie Lokier wrote:

> (Davide), IMHO epoll should decide whether it means "at minimum" (in
> which case the +1 is a requirement), or it means "at maximum" (in
> which case rounding up is wrong).
>
> The current method of rounding up and then effectively down means that
> you get an unpredictable mixture of both.

I think I'll go with :

Tj = (Tms * HZ + 999) / 1000

Somehow I feel it more correct :)



- Davide

2003-01-27 21:18:15

by Bill Rugolsky Jr.

[permalink] [raw]
Subject: bug in select() (was Re: {sys_,/dev/}epoll waiting timeout)

On Thu, Jan 23, 2003 at 10:18:58PM +0000, Jamie Lokier wrote:
> If, as someone said, the appropriate unix specification says that
> "wait for 10ms" means to wait for _at minimum_ 10ms, then you do need
> the +1.
>
> (Davide), IMHO epoll should decide whether it means "at minimum" (in
> which case the +1 is a requirement), or it means "at maximum" (in
> which case rounding up is wrong).
>
> The current method of rounding up and then effectively down means that
> you get an unpredictable mixture of both.

Quite independent of this discussion, my boss came across this today
while looking at some strace output:

gettimeofday({1043689947, 402580}, NULL) = 0
select(4, [0], [], [], {1, 999658}) = 0 (Timeout)
gettimeofday({1043689949, 401857}, NULL) = 0
gettimeofday({1043689949, 401939}, NULL) = 0
select(4, [0], [], [], {0, 299}) = 0 (Timeout)
gettimeofday({1043689949, 403577}, NULL) = 0

Note that 1043689949.401857 - 1043689947.402580 = 1.999277.

The Single Unix Specification (v2 and v3), says of select():

Implementations may also place limitations on the granularity of
timeout intervals. If the requested timeout interval requires a finer
granularity than the implementation supports, the actual timeout
interval shall be rounded up to the next supported value.

That seems to indicate that a fix is required.

Regards,

Bill Rugolsky

2003-01-27 22:37:39

by Davide Libenzi

[permalink] [raw]
Subject: Re: bug in select() (was Re: {sys_,/dev/}epoll waiting timeout)

On Mon, 27 Jan 2003, Bill Rugolsky Jr. wrote:

> Quite independent of this discussion, my boss came across this today
> while looking at some strace output:
>
> gettimeofday({1043689947, 402580}, NULL) = 0
> select(4, [0], [], [], {1, 999658}) = 0 (Timeout)
> gettimeofday({1043689949, 401857}, NULL) = 0
> gettimeofday({1043689949, 401939}, NULL) = 0
> select(4, [0], [], [], {0, 299}) = 0 (Timeout)
> gettimeofday({1043689949, 403577}, NULL) = 0
>
> Note that 1043689949.401857 - 1043689947.402580 = 1.999277.
>
> The Single Unix Specification (v2 and v3), says of select():
>
> Implementations may also place limitations on the granularity of
> timeout intervals. If the requested timeout interval requires a finer
> granularity than the implementation supports, the actual timeout
> interval shall be rounded up to the next supported value.
>
> That seems to indicate that a fix is required.

The problem is that the schedule_timeout() is not precise. So if you pass
N to such function, your sleep interval can be ]N-1, N+1[
This w/out considering other latencies but looking only at how timers are
updated ( you can call schedule_timeout() immediately before a timer tick
or immediately after ). So if we want to be sure to sleep at least the
rounded up number of jiffies, we might end up sleeping one/two jiffies
more ( and this w/out accounting other latencies ). And this will lead to
the formula ( used by poll() ) :

Tj = (Tms * HZ + 999) / 1000 + 1

( if Tms > 0 )



- Davide

2003-01-28 09:35:59

by Jamie Lokier

[permalink] [raw]
Subject: Re: bug in select() (was Re: {sys_,/dev/}epoll waiting timeout)

Davide Libenzi wrote:
> ( if Tms > 0 )

Which is unfortunate, because that doesn't allow for a value of Tms ==
0 which is needed when you want to sleep and wake up on every jiffie
on systems where HZ >= 1000. Tms == 0 is taken already, to mean do
not wait at all.

-- Jamie

2003-01-28 10:34:36

by Mark Mielke

[permalink] [raw]
Subject: Re: bug in select() (was Re: {sys_,/dev/}epoll waiting timeout)

On Tue, Jan 28, 2003 at 09:45:00AM +0000, Jamie Lokier wrote:
> Davide Libenzi wrote:
> > ( if Tms > 0 )
> Which is unfortunate, because that doesn't allow for a value of Tms ==
> 0 which is needed when you want to sleep and wake up on every jiffie
> on systems where HZ >= 1000. Tms == 0 is taken already, to mean do
> not wait at all.

To some degree, isn't this the equivalent of yield()?

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2003-01-28 19:38:50

by Randy.Dunlap

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

On Wed, 22 Jan 2003, Jamie Lokier wrote:

| ps. sys_* system-call functions should never return "int". They
| should always return "long" or a pointer - even if the user-space
| equivalent returns "int". Take a look at sys_open() for an example.
| Technical requirement of the system call return path on 64-bit targets.

Is this a blanket truism? For all architectures?

Should current (older/all) syscalls be modified, or should only new ones
(like epoll) be corrected?

Thanks,
--
~Randy

2003-01-28 21:27:10

by Jamie Lokier

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

Randy.Dunlap wrote:
> On Wed, 22 Jan 2003, Jamie Lokier wrote:
>
> | ps. sys_* system-call functions should never return "int". They
> | should always return "long" or a pointer - even if the user-space
> | equivalent returns "int". Take a look at sys_open() for an example.
> | Technical requirement of the system call return path on 64-bit targets.
>
> Is this a blanket truism? For all architectures?

I believe so, for all architecture-independent syscall functions.
(And architecture-dependent on those that need it -- see end of this
message).

Linus mentioned it in passing in the recent x86 vsyscall threads. I
have never seen it mentioned before. I always assumed that returned
longs were due to sloppy programming :)

If you look at the syscall return paths on the 64-bit architectures,
some of them always check the 64-bit return value register to see if
it is negative. If so, they set an error flag, which is what
userspace uses to decide whether to return -1, rather than checking if
the return value is >= -4096 (or >= -125, architecture implementations vary).

This convoluted ABI allows those architectures to return full 64-bit
values as legitimate values, which is needed for... ptrace(), only on
those architecture of course.

The question is therefore are "int" values returned from C functions
sign-extended to 64 bits? I don't know, maybe it varies between
architectures -- for all I know it happens to work on all the
supported 64-bit architectures -- but I believe this is the reason why
"long" (or equivalent, e.g. "ssize_t") and pointer types are
_supposed_ to be the only permitted return types from syscall
functions.

> Should current (older/all) syscalls be modified, or should only new ones
> (like epoll) be corrected?

All of them.

Here's a partial list of functions which return int from 2.5.49, based
on parsing Alpha, ARM, CRIS, IA64 and x86_64 trees:

sys_set_tid_address
sys_futex
sys_sched_setaffinity
sys_sched_getaffinity
sys_remap_file_pages
sys_epoll_create
sys_epoll_ctl
sys_epoll_wait
sys_lookup_dcookie
sys_pause

As far as I can guess from reading the GCC sources, 32-bit return
values aren't sign extended on x86_64 and IA64, but they are on all
the other Linux-supported 64-bit architectures (I could be mistaken
about this). Of x86_64 and IA64, only the IA64 syscall return path
tests if the return value register is negative.

Which suggests that all the architectures are fine with all these
"int" returns, except IA64.

Curiously, IA64's own sys_perfmonctl() returns int.

-- Jamie

2003-01-28 21:30:42

by Jamie Lokier

[permalink] [raw]
Subject: Re: bug in select() (was Re: {sys_,/dev/}epoll waiting timeout)

Mark Mielke wrote:
> On Tue, Jan 28, 2003 at 09:45:00AM +0000, Jamie Lokier wrote:
> > Davide Libenzi wrote:
> > > ( if Tms > 0 )
> > Which is unfortunate, because that doesn't allow for a value of Tms ==
> > 0 which is needed when you want to sleep and wake up on every jiffie
> > on systems where HZ >= 1000. Tms == 0 is taken already, to mean do
> > not wait at all.
>
> To some degree, isn't this the equivalent of yield()?

No. If you want a process to wake every HZ tick, do a little work and
then sleep again, yield() won't do that.

If HZ >= 1000, you simply can't use Linux poll() to do that; you have
to use select(). (Or epoll_wait()).

Even if select() is changed to do double-rounding-up like poll(), it
will still do this because select() times have microsecond
granularity.

-- Jamie

2003-01-28 21:36:06

by David Mosberger

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

>>>>> On Tue, 28 Jan 2003 21:36:21 +0000, Jamie Lokier <[email protected]> said:

Jamie> Curiously, IA64's own sys_perfmonctl() returns int.

Definitely a bug. Thanks for catching this. I'll let the maintainer know.

--david

2003-01-28 21:46:24

by Andi Kleen

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

Jamie Lokier <[email protected]> writes:
>
> Which suggests that all the architectures are fine with all these
> "int" returns, except IA64.

x86-64 needs long returns too.

I think I fixed all of them, if you noticed any missing please let me now.

-Andi

2003-01-28 22:00:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: bug in select() (was Re: {sys_,/dev/}epoll waiting timeout)

On Tue, 28 Jan 2003, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > ( if Tms > 0 )
>
> Which is unfortunate, because that doesn't allow for a value of Tms ==
> 0 which is needed when you want to sleep and wake up on every jiffie
> on systems where HZ >= 1000. Tms == 0 is taken already, to mean do
> not wait at all.

Waking up every jiffie does not make a lot of sense in most applications
since they probably prefer to deal with seconds and its derivates, to have
a predictable behavior on different systems. Functions like
poll/select/epoll are simply not the right solution if you want to cut the
microsecond on sleep times.



- Davide

2003-01-28 22:09:46

by Davide Libenzi

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

On Tue, 28 Jan 2003, Andi Kleen wrote:

> Jamie Lokier <[email protected]> writes:
> >
> > Which suggests that all the architectures are fine with all these
> > "int" returns, except IA64.
>
> x86-64 needs long returns too.
>
> I think I fixed all of them, if you noticed any missing please let me now.

#define __NR_epoll_create ???
#define __NR_epoll_ctl ???
#define __NR_epoll_wait ???

That in 2.5.59 return "int". I posted the patch to make them return
"long" to Linus ( Andrew got it ) this weekend.




- Davide

2003-01-28 22:29:49

by Andi Kleen

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

On Tue, Jan 28, 2003 at 02:24:52PM -0800, Davide Libenzi wrote:
> On Tue, 28 Jan 2003, Andi Kleen wrote:
>
> > Jamie Lokier <[email protected]> writes:
> > >
> > > Which suggests that all the architectures are fine with all these
> > > "int" returns, except IA64.
> >
> > x86-64 needs long returns too.
> >
> > I think I fixed all of them, if you noticed any missing please let me now.
>
> #define __NR_epoll_create ???
> #define __NR_epoll_ctl ???
> #define __NR_epoll_wait ???
>
> That in 2.5.59 return "int". I posted the patch to make them return
> "long" to Linus ( Andrew got it ) this weekend.

Sorry I meant I fixed all x86-64 specific syscalls returning int.

-Andi

2003-01-28 22:44:54

by Davide Libenzi

[permalink] [raw]
Subject: Re: {sys_,/dev/}epoll waiting timeout

On Tue, 28 Jan 2003, Andi Kleen wrote:

> On Tue, Jan 28, 2003 at 02:24:52PM -0800, Davide Libenzi wrote:
> > On Tue, 28 Jan 2003, Andi Kleen wrote:
> >
> > > Jamie Lokier <[email protected]> writes:
> > > >
> > > > Which suggests that all the architectures are fine with all these
> > > > "int" returns, except IA64.
> > >
> > > x86-64 needs long returns too.
> > >
> > > I think I fixed all of them, if you noticed any missing please let me now.
> >
> > #define __NR_epoll_create ???
> > #define __NR_epoll_ctl ???
> > #define __NR_epoll_wait ???
> >
> > That in 2.5.59 return "int". I posted the patch to make them return
> > "long" to Linus ( Andrew got it ) this weekend.
>
> Sorry I meant I fixed all x86-64 specific syscalls returning int.

I know Andi, I was trying to make you notice that those are missing from
the x86-64 arch :)



- Davide