2002-11-18 15:58:14

by Davide Libenzi

[permalink] [raw]
Subject: [rfc] epoll interface change and glibc bits ...


1) epoll's event structure extension

I received quite a few request to extend the event structure to have space
for an opaque user data object. The eventpoll event structure will turn to
be :

struct epollfd {
int fd;
unsigned short int events, revents;
unsigned long obj;
};

and the epoll_ctl() function will turn to :

int epoll_ctl(int epfd, int op, struct epollfd *pfd);



2) epoll bits in glibc

I was talking to Ulrich Drepper about adding epoll bits inside glibc. His
first objection was to store epoll bits inside poll.h, that IMHO is wrong
because epoll semantics are completely different from poll(). My idea of
the <sys/epoll.h> include file would be this :

#ifndef _SYS_EPOLL_H
#define _SYS_EPOLL_H 1

#include <bits/poll.h>

/* Valid opcodes to issue to sys_epoll_ctl() */
#define EP_CTL_ADD 1
#define EP_CTL_DEL 2
#define EP_CTL_MOD 3

struct epollfd {
int fd;
unsigned short int events, revents;
unsigned long obj;
};

__BEGIN_DECLS

extern int epoll_create(int size);
extern int epoll_ctl(int epfd, int op, struct epollfd *pfd) THROW;
extern int epoll_wait(int epfd, struct epollfd *events, int maxevents,
int timeout) THROW;

__END_DECLS

#endif /* sys/epoll.h */


But he does not like epoll to include <bits/poll.h> and he would like
epoll to redefine POLLIN, POLLOUT, ... to EPOLLIN, EPOLLOUT, ...
In my opinion it is right for epoll to include <bits/poll.h> because those
are bits that f_op->poll() returns, and renaming those bits inside another
include file will require more maintainance. If the kernel will be
extended to support more POLL* bits, they will have to go only inside
<bits/poll.h> w/out having another file to be updated IMHO.



I would like to receive feedback on those issues ...





- Davide




2002-11-18 16:06:01

by Jakub Jelinek

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, Nov 18, 2002 at 08:05:32AM -0800, Davide Libenzi wrote:
>
> 1) epoll's event structure extension
>
> I received quite a few request to extend the event structure to have space
> for an opaque user data object. The eventpoll event structure will turn to
> be :
>
> struct epollfd {
> int fd;
> unsigned short int events, revents;
> unsigned long obj;

Cannot this be uint64_t obj; ?
Have mercy with 32<->64 bit translation layers in the kernel.

Jakub

2002-11-18 16:11:41

by Jakub Jelinek

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, Nov 18, 2002 at 08:15:39AM -0800, Davide Libenzi wrote:
> > > 1) epoll's event structure extension
> > >
> > > I received quite a few request to extend the event structure to have space
> > > for an opaque user data object. The eventpoll event structure will turn to
> > > be :
> > >
> > > struct epollfd {
> > > int fd;
> > > unsigned short int events, revents;
> > > unsigned long obj;
> >
> > Cannot this be uint64_t obj; ?
> > Have mercy with 32<->64 bit translation layers in the kernel.
>
> Maybe this :
>
> typedef void *epoll_obj_t;

That is as bad as unsigned long - it is different between 32-bit and 64-bit
ABIs.
Unless you handle the translation in the generic epoll code, there will
need to be wrappers around it which will kmalloc (maxelements * sizeof (struct epollfd));
do the generic epoll into that array and then translate all the structures
in that array into smaller ones and copy to userspace.

Jakub

2002-11-18 16:08:16

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, 18 Nov 2002, Jakub Jelinek wrote:

> On Mon, Nov 18, 2002 at 08:05:32AM -0800, Davide Libenzi wrote:
> >
> > 1) epoll's event structure extension
> >
> > I received quite a few request to extend the event structure to have space
> > for an opaque user data object. The eventpoll event structure will turn to
> > be :
> >
> > struct epollfd {
> > int fd;
> > unsigned short int events, revents;
> > unsigned long obj;
>
> Cannot this be uint64_t obj; ?
> Have mercy with 32<->64 bit translation layers in the kernel.

Maybe this :

typedef void *epoll_obj_t;




- Davide


2002-11-18 16:25:19

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, 18 Nov 2002, Jakub Jelinek wrote:

> That is as bad as unsigned long - it is different between 32-bit and 64-bit
> ABIs.

Yeah, you right. I did thin about 32bit and 64bit as two diffferent
kernel-glibc environment, I did not think about 32-64 ABI compatibility.
Ouch, adding a 64bit object will double the size of the event structure :(



- Davide





2002-11-18 17:37:15

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, Nov 18, 2002 at 08:05:32AM -0800, Davide Libenzi wrote:
> 1) epoll's event structure extension
> I received quite a few request to extend the event structure to have space
> for an opaque user data object. The eventpoll event structure will turn to
> be :
> struct epollfd {
> int fd;
> unsigned short int events, revents;
> unsigned long obj;
> };

As long as people most people make use of 'obj', this
will be a good thing.

I would be tempted to go the read(), read64() route on this. It makes
everything a mess, but it probably means a more efficient
implementation.

> 2) epoll bits in glibc
> I was talking to Ulrich Drepper about adding epoll bits inside glibc. His
> first objection was to store epoll bits inside poll.h, that IMHO is wrong
> because epoll semantics are completely different from poll(). My idea of
> the <sys/epoll.h> include file would be this :
> ...
> But he does not like epoll to include <bits/poll.h> and he would like
> epoll to redefine POLLIN, POLLOUT, ... to EPOLLIN, EPOLLOUT, ...
> In my opinion it is right for epoll to include <bits/poll.h> because those
> are bits that f_op->poll() returns, and renaming those bits inside another
> include file will require more maintainance. If the kernel will be
> extended to support more POLL* bits, they will have to go only inside
> <bits/poll.h> w/out having another file to be updated IMHO.

If you can guarantee that epoll will always be compatible with poll() in
terms of objects that can be watched, and events that can be watched, I
would lean towards your preference. If this guarantee cannot be made, I
would lean toward Ulrich's preference.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-18 18:30:22

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, 18 Nov 2002, Mark Mielke wrote:

> On Mon, Nov 18, 2002 at 08:05:32AM -0800, Davide Libenzi wrote:
> > 1) epoll's event structure extension
> > I received quite a few request to extend the event structure to have space
> > for an opaque user data object. The eventpoll event structure will turn to
> > be :
> > struct epollfd {
> > int fd;
> > unsigned short int events, revents;
> > unsigned long obj;
> > };
>
> As long as people most people make use of 'obj', this
> will be a good thing.
>
> I would be tempted to go the read(), read64() route on this. It makes
> everything a mess, but it probably means a more efficient
> implementation.

At that point I prefer to have it directly a 64bit int.



> > 2) epoll bits in glibc
> > I was talking to Ulrich Drepper about adding epoll bits inside glibc. His
> > first objection was to store epoll bits inside poll.h, that IMHO is wrong
> > because epoll semantics are completely different from poll(). My idea of
> > the <sys/epoll.h> include file would be this :
> > ...
> > But he does not like epoll to include <bits/poll.h> and he would like
> > epoll to redefine POLLIN, POLLOUT, ... to EPOLLIN, EPOLLOUT, ...
> > In my opinion it is right for epoll to include <bits/poll.h> because those
> > are bits that f_op->poll() returns, and renaming those bits inside another
> > include file will require more maintainance. If the kernel will be
> > extended to support more POLL* bits, they will have to go only inside
> > <bits/poll.h> w/out having another file to be updated IMHO.
>
> If you can guarantee that epoll will always be compatible with poll() in
> terms of objects that can be watched, and events that can be watched, I
> would lean towards your preference. If this guarantee cannot be made, I
> would lean toward Ulrich's preference.

epoll does hook f_op->poll() and hence uses the asm/poll.h bits.




- Davide


2002-11-18 18:34:40

by Grant Taylor

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

[ This is sort of a rehash of a private email or two; Davide asked me
to air my objections... ]

Davide Libenzi writes:

> I received quite a few request to extend the event structure to have
> space for an opaque user data object. The eventpoll event structure
> will turn to be:
>
> struct epollfd {
> int fd;
> unsigned short int events, revents;
> unsigned long obj;
> };

NOOOO!!!!!

As it stands, it's very easy to write an event dispatching layer that
runs over both epoll and poll. You write a little function that,
somehow, gathers events from the kernel in the form of a "struct
pollfd" array, and then the same code can process these events for
both poll and epoll. If epoll deviates from returning pollfd's, then
there needs to be mapping code or duplicate event dispatch logic for
two different yet substantively similar structures.

More importantly, the whole notion of userspace storing pointers (and
don't kid yourself, that's what people will use it for) in the kernel
is a little suspect. At best you get some slightly awkward caveats.
For example, when userspace decides to free or move something, it has
to tell the kernel via a syscall AND manually scrub the epollfd's
present in the userspace-owned part of the event mapping.

I also don't think there is a performance argument for this feature.
The usual thing is an array lookup indexed on the fd, and that should
be entirely workable for at least as long as the kernel keeps fd's in
an array.

I'm open to epoll returning non-pollfd records if we find that there
is useful information that should be passed to userspace that way, but
I don't think this is it.

> I was talking to Ulrich Drepper about adding epoll bits inside
> glibc. His first objection was to store epoll bits inside poll.h,
> that IMHO is wrong because epoll semantics are completely different
> from poll().

> My idea of the <sys/epoll.h> include file would be this:

> #ifndef _SYS_EPOLL_H
> #define _SYS_EPOLL_H 1
>
> #include <bits/poll.h>
> [...]

> But he does not like epoll to include <bits/poll.h> and he would
> like epoll to redefine POLLIN, POLLOUT, ... to EPOLLIN, EPOLLOUT,
> ...

> In my opinion it is right for epoll to include <bits/poll.h> because
> those are bits that f_op->poll() returns, and renaming those bits
> inside another include file will require more maintainance. If the
> kernel will be extended to support more POLL* bits, they will have
> to go only inside <bits/poll.h> w/out having another file to be
> updated IMHO.

I'm with you on this one, mostly for the same reasons I disagree with
the userdata thing. It just adds work to have to do things
differently.

The various flavors of signals for event delivery all give you normal
poll bits and not oddball "SIGPOLLFOO" mechanism-specific bits. If we
continue adding new mechanisms with new bits, we'll rapidly run out,
at least on 32 bit machines.

I appreciate the interface cleanliness direction the library people
are coming from, but I don't think it's useful here. The meaning of
the event is a function of both the event (the struct pollfd) and the
event arrival (via epoll, poll, sigio, sig32+, etc).

--
Grant Taylor - gtaylor<at>picante.com - http://www.picante.com/~gtaylor/
Linux Printing Website and HOWTO: http://www.linuxprinting.org/

2002-11-18 19:52:35

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Davide Libenzi wrote:

> epoll does hook f_op->poll() and hence uses the asm/poll.h bits.

It does today. We are talking about "you promise that this will be the
case ever after or we'll cut your head off". I have no idea why you're
so reluctant since you don't have to maintain any of the user-level
bits. And it is not you who has to deal with the fallout of a change
when it happens.

If epoll is so different from poll (and this is what I've been told frmo
Davide) then there should be a clear separation of the interfaces and
all those arguing to unify the data types and constants better should
rethink there understanding.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE92Uai2ijCOnn/RHQRAnxbAJ9grqqA0HgAJTkHSpJMAxEBu6QFDwCeONNv
UI85HprHxc+v+dYmMPEVzR8=
=hZm1
-----END PGP SIGNATURE-----

2002-11-18 21:23:42

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, 18 Nov 2002, Ulrich Drepper wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Davide Libenzi wrote:
>
> > epoll does hook f_op->poll() and hence uses the asm/poll.h bits.
>
> It does today. We are talking about "you promise that this will be the
> case ever after or we'll cut your head off". I have no idea why you're
> so reluctant since you don't have to maintain any of the user-level
> bits. And it is not you who has to deal with the fallout of a change
> when it happens.
>
> If epoll is so different from poll (and this is what I've been told frmo
> Davide) then there should be a clear separation of the interfaces and
> all those arguing to unify the data types and constants better should
> rethink there understanding.

Ok, the message has been post and noone argued about your solution. Like
it is used to say "speak now or forever hold your piece" :) So we will go
with it, no problem for me. We will define a struct epollfd and all POLL*
bits in EPOLL*




- Davide




2002-11-18 21:59:13

by Grant Taylor

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Ulrich Drepper writes:

>> epoll does hook f_op->poll() and hence uses the asm/poll.h bits.

> It does today. We are talking about "you promise that this will be
> the case ever after or we'll cut your head off".
> [...]
> it is not you who has to deal with the fallout of a change when it

Maybe Davide wouldn't, but *I* do; my project at work runs over epoll,
and interface changes would require rework by me.

Sensible interface changes in the future won't bother me. I don't
expect anything in the future nearly as earth-shattering as this
current driver/ioctl->syscall transition.

> If epoll is so different from poll (and this is what I've been told
> frmo Davide) then there should be a clear separation of the
> interfaces and all those arguing to unify the data types and
> constants better should rethink there understanding.

The main call returns a subset of the information that poll returns.
What could be more natural than to name that subset the same thing?

Really, sys_epoll does two things:

1 It sets up epoll itself.

This interface is entirely epoll-specific, and has all EP-specific
constants etc, in <sys/epoll.h>, as well it should.


2 It returns the set of poll bits that have changed since the last
epoll.

This interface is defined largely in terms of poll, and that's OK.
How many changes do you expect in the poll interface?


In the end, I don't really feel all that strongly about this bits
issue (since truth be told the performance impact will be very small
at most) so it's up to you and Davide.


OTOH, I really hate the "user pointer in struct epollfd" thing...

--
Grant Taylor - gtaylor<at>picante.com - http://www.picante.com/~gtaylor/
Linux Printing Website and HOWTO: http://www.linuxprinting.org/

2002-11-18 22:03:18

by Dan Kegel

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Ulrich wrote:
>> epoll does hook f_op->poll() and hence uses the asm/poll.h bits.
>
> It does today. We are talking about "you promise that this will be the
> case ever after or we'll cut your head off". I have no idea why you're
> so reluctant since you don't have to maintain any of the user-level
> bits. And it is not you who has to deal with the fallout of a change
> when it happens.
>
> If epoll is so different from poll (and this is what I've been told frmo
> Davide) then there should be a clear separation of the interfaces and
> all those arguing to unify the data types and constants better should
> rethink there understanding.

epoll is not really that different from poll, is it?
It delivers edge-triggered versions of the same events poll uses.
Or is there something epoll does I'm not aware of?
- Dan

2002-11-18 22:19:31

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, Nov 18, 2002 at 05:04:07PM -0500, Grant Taylor wrote:
> OTOH, I really hate the "user pointer in struct epollfd" thing...

Are you saying you see no way of using the 'user pointer in struct epollfd'
to accelerate event dispatching?

For a perfectly good example of a use for it that has nothing to do
with pointers, consider the possibility that the structure could hold
a priority number. Sure the FD could be used as an index into an array
(taking up lots of cache memory, btw) or an index into a hash (more
expensive to process), but wouldn't it be useful to sort high priority
events before low priority events without having to dereference every
single fd? I would even tend to delay executing low priority events
until epoll_wait(0) stopped telling me about high priority events.

An opaque field gives me, the event loop designer, freedom. No opaque
field because a few event loop designers are convinced that it will be
used as a data pointer, and they believe this to be wrong, is a
limitation. epoll provides a very efficient alternative to poll. Forcing
epoll to look like poll somewhat defeats the purpose. I don't mind having
a bigger event loop, or two different event loops (one used when epoll is
available, and one used when epoll isn't).

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-18 22:14:32

by Matthew D. Hall

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Davide Libenzi wrote:

>On Mon, 18 Nov 2002, Jakub Jelinek wrote:
>
>
>
>>That is as bad as unsigned long - it is different between 32-bit and 64-bit
>>ABIs.
>>
>>
>
>Yeah, you right. I did thin about 32bit and 64bit as two diffferent
>kernel-glibc environment, I did not think about 32-64 ABI compatibility.
>Ouch, adding a 64bit object will double the size of the event structure :(
>
>
>
>- Davide
>
>
>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>Yeah, you right. I did thin about 32bit and 64bit as two diffferent
>kernel-glibc environment, I did not think about 32-64 ABI compatibility.
>Ouch, adding a 64bit object will double the size of the event structure :(
>
>
Pros of a 64-bit opaque user pointer:
- Simpler programming for userland.
- Fewer cache misses, because there is no need for accessing part of a
separate and large fd->pointer table in userland before being able to
act upon event notification (some implementations can get around this,
but it's very common practice).

Cons:
- 4 bytes memory wastage on 32-bit platforms per struct. Note that it
is not the full 8, because userland needs to have that table of pointers
anyway.

Notes:
- No memory wastage for nontrivial applications on 64-bit platforms.

Matthew D. Hall


2002-11-18 23:09:32

by Dan Kegel

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Davide Libenzi wrote:
> On Mon, 18 Nov 2002, Dan Kegel wrote:
>
>
>>Ulrich wrote:
>>
>>>>epoll does hook f_op->poll() and hence uses the asm/poll.h bits.
>>>
>>>It does today. We are talking about "you promise that this will be the
>>>case ever after or we'll cut your head off". I have no idea why you're
>>>so reluctant since you don't have to maintain any of the user-level
>>>bits. And it is not you who has to deal with the fallout of a change
>>>when it happens.
>>>
>>>If epoll is so different from poll (and this is what I've been told frmo
>>>Davide) then there should be a clear separation of the interfaces and
>>>all those arguing to unify the data types and constants better should
>>>rethink there understanding.
>>
>>epoll is not really that different from poll, is it?
>>It delivers edge-triggered versions of the same events poll uses.
>>Or is there something epoll does I'm not aware of?
>
>
> The interface ( edge-triggered ) is quite different and we saw in the
> previous experience how this might lead to confusion for the user. Putting
> epoll bits inside poll.h will IMHO increase this.

The only difference is the edge-triggered nature, though, right?
- Dan



2002-11-18 23:13:17

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, 18 Nov 2002, Dan Kegel wrote:

> > The interface ( edge-triggered ) is quite different and we saw in the
> > previous experience how this might lead to confusion for the user. Putting
> > epoll bits inside poll.h will IMHO increase this.
>
> The only difference is the edge-triggered nature, though, right?

Yes, but we have seen that it's enough :)



- Davide


2002-11-18 23:04:27

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, 18 Nov 2002, Dan Kegel wrote:

> Ulrich wrote:
> >> epoll does hook f_op->poll() and hence uses the asm/poll.h bits.
> >
> > It does today. We are talking about "you promise that this will be the
> > case ever after or we'll cut your head off". I have no idea why you're
> > so reluctant since you don't have to maintain any of the user-level
> > bits. And it is not you who has to deal with the fallout of a change
> > when it happens.
> >
> > If epoll is so different from poll (and this is what I've been told frmo
> > Davide) then there should be a clear separation of the interfaces and
> > all those arguing to unify the data types and constants better should
> > rethink there understanding.
>
> epoll is not really that different from poll, is it?
> It delivers edge-triggered versions of the same events poll uses.
> Or is there something epoll does I'm not aware of?

The interface ( edge-triggered ) is quite different and we saw in the
previous experience how this might lead to confusion for the user. Putting
epoll bits inside poll.h will IMHO increase this.



- Davide


2002-11-18 22:49:17

by Jamie Lokier

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Davide Libenzi wrote:
> Ok, the message has been post and noone argued about your solution. Like
> it is used to say "speak now or forever hold your piece" :) So we will go
> with it, no problem for me. We will define a struct epollfd and all POLL*
> bits in EPOLL*

1. Fwiw, I would really like to see epoll extended beyond fds, to a more
general edge-triggered event collector - signals, timers, aios. I've
written about this before as you know (but been too busy lately to
pursue the idea). I'm not going to say any more about this until I
have time to code something...

2. I don't like the "int64_t" proposal because there is no
language guarantee that 64 bits is enough to hold a pointer - and
of course, a pointer is what many applications will store in it.

"long" seems to be the de facto standard for holding a pointer
value in the kernel (e.g. in a futex). That's ugly too, but quite
consistent. Nothing clean presents itself.

-- Jamie

2002-11-18 23:00:21

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, 18 Nov 2002, Grant Taylor wrote:

> Ulrich Drepper writes:
>
> >> epoll does hook f_op->poll() and hence uses the asm/poll.h bits.
>
> > It does today. We are talking about "you promise that this will be
> > the case ever after or we'll cut your head off".
> > [...]
> > it is not you who has to deal with the fallout of a change when it
>
> Maybe Davide wouldn't, but *I* do; my project at work runs over epoll,
> and interface changes would require rework by me.
>
> Sensible interface changes in the future won't bother me. I don't
> expect anything in the future nearly as earth-shattering as this
> current driver/ioctl->syscall transition.

epoll basically born now, and IMHO is very important to get the interface
stable right now. Later changes will be a real pain in the *ss. So I'm
gradually more convinced to have epoll to have its own bits and data
structure.




- Davide


2002-11-18 23:29:59

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, 18 Nov 2002, Dan Kegel wrote:

> Davide Libenzi wrote:
> > On Mon, 18 Nov 2002, Dan Kegel wrote:
> >
> >>>The interface ( edge-triggered ) is quite different and we saw in the
> >>>previous experience how this might lead to confusion for the user. Putting
> >>>epoll bits inside poll.h will IMHO increase this.
> >>
> >>The only difference is the edge-triggered nature, though, right?
> >
> > Yes, but we have seen that it's enough :)
>
> I'm not so sure. If the epoll documentation were clear enough
> (which at the moment, frankly, it isn't), I think
> there's a good chance users would not be confused
> by the difference between level-triggered and edge-triggered
> events.
>
> I'd be happy to contribute better doc... has the man page
> for sys_epoll been written yet?

Yep, by a long time :)

http://www.xmailserver.org/linux-patches/epoll.2
http://www.xmailserver.org/linux-patches/epoll_create.2
http://www.xmailserver.org/linux-patches/epoll_ctl.2
http://www.xmailserver.org/linux-patches/epoll_wait.2

it is going to change though with the latest talks about the interface.



- Davide


2002-11-18 23:48:39

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, 18 Nov 2002, Jamie Lokier wrote:

> 1. Fwiw, I would really like to see epoll extended beyond fds, to a more
> general edge-triggered event collector - signals, timers, aios. I've
> written about this before as you know (but been too busy lately to
> pursue the idea). I'm not going to say any more about this until I
> have time to code something...

Anything that "exposes" a file* interface that support f_op->poll() is
usable with epoll. File rocks !! :)



> 2. I don't like the "int64_t" proposal because there is no
> language guarantee that 64 bits is enough to hold a pointer - and
> of course, a pointer is what many applications will store in it.

I know, but we should be covered for another 8-10 years with 64 bits :)



- Davide


2002-11-18 23:22:36

by Dan Kegel

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Davide Libenzi wrote:
> On Mon, 18 Nov 2002, Dan Kegel wrote:
>
>>>The interface ( edge-triggered ) is quite different and we saw in the
>>>previous experience how this might lead to confusion for the user. Putting
>>>epoll bits inside poll.h will IMHO increase this.
>>
>>The only difference is the edge-triggered nature, though, right?
>
> Yes, but we have seen that it's enough :)

I'm not so sure. If the epoll documentation were clear enough
(which at the moment, frankly, it isn't), I think
there's a good chance users would not be confused
by the difference between level-triggered and edge-triggered
events.

I'd be happy to contribute better doc... has the man page
for sys_epoll been written yet?
- Dan

2002-11-19 00:18:08

by Grant Taylor

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

>>>>> Mark Mielke <[email protected]> writes:

>> OTOH, I really hate the "user pointer in struct epollfd" thing...

> Are you saying you see no way of using the 'user pointer in struct
> epollfd' to accelerate event dispatching?

No, I'm saying that's it's a tricky and unusual thing, and I'm not
convinced that eliminating an fd array will make a substantial
difference, cache or no.

> For a perfectly good example of a use for it that has nothing to do
> with pointers, consider the possibility that the structure could
> hold a priority number [...] to sort high priority events before low
> priority events without having to dereference every single fd?

Sure, but...

> I would even tend to delay executing low priority events until
> epoll_wait(0) stopped telling me about high priority events.

...for this to work, you have to stow events over epoll calls. The
sensible place to store these is in your per-fd structure. So you
still don't save the access to your per-event structure, just the one
array index lookup.

If you do priorities, you *must* do this; otherwise you will be
processing all events as they arrive in userspace. Merely doing them
in priority order will produce a slightly reduced but still O(n)
latency for high priority events, rather than roughly bounded latency
as is usually the intent.

BTW it is also possible to implement event prioritization in
kernelspace. You just [e]poll several sets of epolled fd's and take
the most interesting set of events each time. Unless, that is, the
new syscall interface broke this...

> An opaque field gives me, the event loop designer, freedom. No
> opaque field because a few event loop designers are convinced that
> it will be used as a data pointer, and they believe this to be
> wrong, is a limitation.

Bah. Freedom causes holes in feet.

Flexible interfaces are invariably a pain for the users or the
developers or both. Inflexible interfaces are less painful - they
either work or they don't. As long as you avoid the nonworking sort,
life is wonderful.

> epoll provides a very efficient alternative to poll. Forcing epoll
> to look like poll somewhat defeats the purpose. I don't mind having

No argument here; it might even be handy for epoll to do other things,
but this hinting feature just feels wrong. (Even "other things" would
probably be better implemented through standard poll or some other
non-specialized mechanism).

To the coder go the spoils; we'll see what Davide does. And what
Linus lets in...

--
Grant Taylor - gtaylor<at>picante.com - http://www.picante.com/~gtaylor/
Linux Printing Website and HOWTO: http://www.linuxprinting.org/

2002-11-19 00:23:02

by Dan Kegel

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Davide Libenzi wrote:
>>I'd be happy to contribute better doc... has the man page
>>for sys_epoll been written yet?
>
> http://www.xmailserver.org/linux-patches/epoll.2
> http://www.xmailserver.org/linux-patches/epoll_create.2
> http://www.xmailserver.org/linux-patches/epoll_ctl.2
> http://www.xmailserver.org/linux-patches/epoll_wait.2
>
> it is going to change though with the latest talks about the interface.

Hmm. Right off the bat, I see a terminology problem.
The man page says

.SH NAME
epoll \- edge triggered asynchronous I/O facility

That's going to confuse some users. They might think
epoll can actually initiate I/O. Better to say

epoll \- edge triggered I/O readiness notification facility

Second, epoll_ctl(2) doesn't define the meaning of the
event mask. It should give the allowed bits and define
their meanings. If we use the traditional POLLIN etc, we
can say
POLLIN - the fd has become ready for reading
POLLOUT - the fd has become ready for writing
Note: If epoll tells you e.g. POLLIN, it means that
poll will tell you the same thing,
since poll gives the current status,
and epoll gives changes in status.

- Dan


2002-11-19 01:26:49

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, 18 Nov 2002, Dan Kegel wrote:

> Davide Libenzi wrote:
> >>I'd be happy to contribute better doc... has the man page
> >>for sys_epoll been written yet?
> >
> > http://www.xmailserver.org/linux-patches/epoll.2
> > http://www.xmailserver.org/linux-patches/epoll_create.2
> > http://www.xmailserver.org/linux-patches/epoll_ctl.2
> > http://www.xmailserver.org/linux-patches/epoll_wait.2
> >
> > it is going to change though with the latest talks about the interface.
>
> Hmm. Right off the bat, I see a terminology problem.
> The man page says
>
> .SH NAME
> epoll \- edge triggered asynchronous I/O facility
>
> That's going to confuse some users. They might think
> epoll can actually initiate I/O. Better to say
>
> epoll \- edge triggered I/O readiness notification facility

Yes, maybe sounds better ...


> Second, epoll_ctl(2) doesn't define the meaning of the
> event mask. It should give the allowed bits and define
> their meanings. If we use the traditional POLLIN etc, we
> can say
> POLLIN - the fd has become ready for reading
> POLLOUT - the fd has become ready for writing
> Note: If epoll tells you e.g. POLLIN, it means that
> poll will tell you the same thing,
> since poll gives the current status,
> and epoll gives changes in status.

I will have to change man pages also to fit EPOLL* definitions.



- Davide


2002-11-19 01:26:40

by Jamie Lokier

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Davide Libenzi wrote:
> > 1. Fwiw, I would really like to see epoll extended beyond fds, to a more
> > general edge-triggered event collector - signals, timers, aios. I've
> > written about this before as you know (but been too busy lately to
> > pursue the idea). I'm not going to say any more about this until I
> > have time to code something...
>
> Anything that "exposes" a file* interface that support f_op->poll() is
> usable with epoll. File rocks !! :)

I agree, fds are pretty good, and it's nice that they work equally
well with poll/select/sigio as with epoll.

It's just that to write an ideal, general event loop, pollable fds
need to be created on the fly for futexes, signals, timers and aio
requests at least. Currently this is already done for futexes. In
addition, some kinds of event result need to return a few words of
data with each event (for example, SIGCHLD events).

Perhaps it's not a bad idea, but I do wonder whether fds created on
the fly for every requested event, and events than can only report
"readable" not anything richer than that are a good solution.

-- Jamie

2002-11-19 01:38:38

by Dan Kegel

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Davide Libenzi wrote:
> On Mon, 18 Nov 2002, Dan Kegel wrote:
>>Second, epoll_ctl(2) doesn't define the meaning of the
>>event mask. It should give the allowed bits and define
>>their meanings. If we use the traditional POLLIN etc, we
>>can say
>> POLLIN - the fd has become ready for reading
>> POLLOUT - the fd has become ready for writing
>> Note: If epoll tells you e.g. POLLIN, it means that
>> poll will tell you the same thing,
>> since poll gives the current status,
>> and epoll gives changes in status.
>
>
> I will have to change man pages also to fit EPOLL* definitions.

IMHO changing from using POLLIN etc. to EPOLLIN
will obscure the essential relationship between
epoll and poll (namely, that the epoll bits
are the time derivative of the poll bits).

I would prefer to continue using POLLIN, etc.
- Dan


2002-11-19 01:42:41

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Tue, 19 Nov 2002, Jamie Lokier wrote:

> > Anything that "exposes" a file* interface that support f_op->poll() is
> > usable with epoll. File rocks !! :)
>
> I agree, fds are pretty good, and it's nice that they work equally
> well with poll/select/sigio as with epoll.
>
> It's just that to write an ideal, general event loop, pollable fds
> need to be created on the fly for futexes, signals, timers and aio
> requests at least. Currently this is already done for futexes. In
> addition, some kinds of event result need to return a few words of
> data with each event (for example, SIGCHLD events).
>
> Perhaps it's not a bad idea, but I do wonder whether fds created on
> the fly for every requested event, and events than can only report
> "readable" not anything richer than that are a good solution.

IMHO it is very elegant to have these objects to be a file* internally.
Common event retrival and automatic cleanup are just the first two points
that come in my mind.



- Davide


2002-11-19 01:57:13

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, 18 Nov 2002, Dan Kegel wrote:

> Davide Libenzi wrote:
> > On Mon, 18 Nov 2002, Dan Kegel wrote:
> >>Second, epoll_ctl(2) doesn't define the meaning of the
> >>event mask. It should give the allowed bits and define
> >>their meanings. If we use the traditional POLLIN etc, we
> >>can say
> >> POLLIN - the fd has become ready for reading
> >> POLLOUT - the fd has become ready for writing
> >> Note: If epoll tells you e.g. POLLIN, it means that
> >> poll will tell you the same thing,
> >> since poll gives the current status,
> >> and epoll gives changes in status.
> >
> >
> > I will have to change man pages also to fit EPOLL* definitions.
>
> IMHO changing from using POLLIN etc. to EPOLLIN
> will obscure the essential relationship between
> epoll and poll (namely, that the epoll bits
> are the time derivative of the poll bits).
>
> I would prefer to continue using POLLIN, etc.

Dan, at the beginning I had the same thought as yours. Now I sort of
changed my mind. The epoll interface is basically seeing the light in
these days and even if right now it uses f_op->poll(), tomorrow we don't
know. To avoid painful code changes later is IMHO better to define EPOLL*
bits right now. They'll be the same of the POLL* bits but will enable
epoll to be independent from poll.h bits. At least to the outside world.
Same for the event structure.



- Davide


2002-11-19 03:39:25

by Edgar Toernig

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Davide Libenzi wrote:
>
> http://www.xmailserver.org/linux-patches/epoll.2
> http://www.xmailserver.org/linux-patches/epoll_create.2
> http://www.xmailserver.org/linux-patches/epoll_ctl.2
> http://www.xmailserver.org/linux-patches/epoll_wait.2
>
> it is going to change though with the latest talks about the interface.

My 2 cents:

Remove the waitqueue stuff from epoll.2. It has meaning only to
linux kernel developers and noone else.

Please add more semantic details of the new facility. It is new
and these man pages are the only (and normative) reference. I.e.:

What about adding an fd twice to the epoll-set? Do you get an
error, will it override the previous settings for that fd, will
it be ignored, or is it registered twice and you get two results
for that fd? Can two epoll-sets wait for the same fd? Are events
reported to both epoll-fds?

Is the epoll-fd itself poll/epoll/selectable? Can I build cluster
of epoll-sets? What happens if the epollfd is put into its own
fd set? Can I send the epoll-fd over a unix-socket to another
process?

Then, please add more details of how events are generated. You
say, that an inactive-to-active transition causes an event. What
is the starting point of the collection? (I guess, all transitions
between two epoll_wait calls.) There could be a couple of transi-
tions on an fd between two epoll_wait calls. Are these events com-
bined into a single reported event or is each single edge reported?
Does an operation on an fd effect the already collected but not yet
reported events?

About epoll_wait: it looks like a "read with timeout" call. Is that
really necessary or wouldn't a normal read(2) work as well? Similar
for epoll_ctl: couldn't a write(2) to the epoll-fd do the same?

That are the things that come into my mind at the moment. I guess
there are more details I've missed...

Ciao, ET.

2002-11-19 03:39:33

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, Nov 18, 2002 at 03:52:45PM -0800, Dan Kegel wrote:
> I'm not so sure. If the epoll documentation were clear enough
> (which at the moment, frankly, it isn't), I think
> there's a good chance users would not be confused
> by the difference between level-triggered and edge-triggered
> events.

I would argue that it is a good thing that people that believe poll and
epoll to be almost exactly the same, find themselves confused.

I might not succeed, but it is certainly how I feel about the issue.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-19 03:50:16

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, Nov 18, 2002 at 07:23:35PM -0500, Grant Taylor wrote:
> >>>>> Mark Mielke <[email protected]> writes:
> > I would even tend to delay executing low priority events until
> > epoll_wait(0) stopped telling me about high priority events.
> ...for this to work, you have to stow events over epoll calls. The
> sensible place to store these is in your per-fd structure. So you
> still don't save the access to your per-event structure, just the one
> array index lookup.

This isn't necessarily true. For a minimal example a stack or list of
undispatched events can be kept. Also, if all high priority events are
processed, I would not be so worried about low priority events taking
a longer time to be dispatched. Is it not my freedom as an event loop
designer to make these decisions?

> If you do priorities, you *must* do this; otherwise you will be
> processing all events as they arrive in userspace. Merely doing them
> in priority order will produce a slightly reduced but still O(n)
> latency for high priority events, rather than roughly bounded latency
> as is usually the intent.

Overall throughput is not significant. Some events require lower latency
than other events. An event loop that assumes that all events should have
equal latency may be satisfactory for some applications, but will result
in problems for others.

> BTW it is also possible to implement event prioritization in
> kernelspace. You just [e]poll several sets of epolled fd's and take
> the most interesting set of events each time. Unless, that is, the
> new syscall interface broke this...

I'm asking for a little freedom to innovate when the times comes. You are
suggesting that I shouldn't have this freedom, and that my ability to
innovate should be limited to what you or I can imagine at the current
point in time. We're talking about one extra field to a data structure.

I could have asked for epoll to implement a priority scheme internally...
I didn't... :-)

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-19 03:50:38

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Mon, 18 Nov 2002, Grant Taylor wrote:

> To the coder go the spoils; we'll see what Davide does. And what
> Linus lets in...

I'm really neutral here, I see pros and cons about the opaque data member.
What I should have pretty much decided right now is to go with EPOLL* bits
and the "struct epollfd" ( Ulrich request ).



- Davide



2002-11-19 04:07:31

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Tue, 19 Nov 2002, Edgar Toernig wrote:

> Davide Libenzi wrote:
> >
> > http://www.xmailserver.org/linux-patches/epoll.2
> > http://www.xmailserver.org/linux-patches/epoll_create.2
> > http://www.xmailserver.org/linux-patches/epoll_ctl.2
> > http://www.xmailserver.org/linux-patches/epoll_wait.2
> >
> > it is going to change though with the latest talks about the interface.

Yes, man pages needs some work after the interface will be fixed. I hope
to fix it this week.



> Remove the waitqueue stuff from epoll.2. It has meaning only to
> linux kernel developers and noone else.

Sigh, I liked it :)


> What about adding an fd twice to the epoll-set? Do you get an
> error, will it override the previous settings for that fd, will
> it be ignored, or is it registered twice and you get two results
> for that fd?

You get EEXIST
Well, there's the remote possibility, trying very badly from two threads,
to add the same fd twice. It is an harmless condition though.



> Can two epoll-sets wait for the same fd?

Yes. Not suggested though.


> Are events reported to both epoll-fds?

Yes.



> Is the epoll-fd itself poll/epoll/selectable?

Yes.



> Can I build cluster of epoll-sets?

Uh ?!



> What happens if the epollfd is put into its own fd set?

You might find your machine a little bit frozen :)
Either 1) I remove the read lock from poll() or 2) I check the condition
at insetion time to avoid it. I very much prefer 2)


> Can I send the epoll-fd over a unix-socket to another
> process?

I'd say yes. SCM_RIGHTS should simply do an in-kernel file* to remote task
descriptor mapping.



> Then, please add more details of how events are generated. You
> say, that an inactive-to-active transition causes an event. What
> is the starting point of the collection? (I guess, all transitions

The starting point are the bits found at insertion time.



> between two epoll_wait calls.) There could be a couple of transi-
> tions on an fd between two epoll_wait calls. Are these events com-
> bined into a single reported event or is each single edge reported?

Yes, they'll be combined.



> Does an operation on an fd effect the already collected but not yet
> reported events?

You can do two operations on an existing fd. Remove is meaninless for this
case. Modify will re-read available bits.



> About epoll_wait: it looks like a "read with timeout" call. Is that
> really necessary or wouldn't a normal read(2) work as well? Similar
> for epoll_ctl: couldn't a write(2) to the epoll-fd do the same?

IMHO distinc functions are more clear than magic read/write operations.



- Davide


2002-11-19 05:29:05

by Edgar Toernig

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Davide Libenzi wrote:
>
> On Tue, 19 Nov 2002, Edgar Toernig wrote:
> > What about adding an fd twice to the epoll-set? Do you get an
> > error, will it override the previous settings for that fd, will
> > it be ignored, or is it registered twice and you get two results
> > for that fd?
>
> You get EEXIST
> Well, there's the remote possibility, trying very badly from two threads,
> to add the same fd twice. It is an harmless condition though.

Just IMHO: I would prefer a different behaviour:

int epoll_ctl(int epfd, int fd, int events)

which registers interest for "events" on "fd" and retuns previous
registered events for that fd (implies that the fd is removed when
"events" is 0) or -1 for error.

If you don't like it, at least an EP_CTL_GET should be added though.

Btw, what errno for an invalid fd (not epfd)?

> > Is the epoll-fd itself poll/epoll/selectable?
>
> Yes.

Fine. I guess, only POLLIN/readable is generated.

>
> > Can I build cluster of epoll-sets?
>
> Uh ?!

The previous "yes" already answers this ;-) What I meant, ie three
fd-sets - low, normal, high priority fds - and a fourth set consisting
of the three epfds for these sets.

> > What happens if the epollfd is put into its own fd set?
>
> You might find your machine a little bit frozen :)
> Either 1) I remove the read lock from poll() or 2) I check the condition
> at insetion time to avoid it. I very much prefer 2)

Hehe, sure. But could become tricky: someone may build a circular chain
of epoll-fd-sets.

> > Can I send the epoll-fd over a unix-socket to another
> > process?
>
> I'd say yes. SCM_RIGHTS should simply do an in-kernel file* to remote task
> descriptor mapping.

And what happens then? Will the set refers to the fds from the sender
process or of fds of the receiving process (which may not even have
all those fds open)?

Another btw, what happens on close of an fd? Will it get removed from all
epoll-fd-sets automatically?

> > Then, please add more details of how events are generated. You
> > say, that an inactive-to-active transition causes an event. What
> > is the starting point of the collection? (I guess, all transitions
>
> The starting point are the bits found at insertion time.

... and then after each epoll_wait call, I assume?

> > Does an operation on an fd effect the already collected but not yet
> > reported events?
>
> You can do two operations on an existing fd. Remove is meaninless for this
> case. Modify will re-read available bits.

Huh, sorry. I meant read/write/poll style of operations.

Anyway, thanks for the information. I hope they will find their way
into the man-pages ;-) (Ok, they may become more like the posix docs
but IMHO new interfaces should be well documented.)

Ciao, ET.

2002-11-19 05:43:12

by Grant Taylor

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

>>>>> Mark Mielke <[email protected]> writes:

> We're talking about one extra field to a data structure.

Hmm. As it happens, it looks like Davide is going to err on the side
of the kitchen sink; that neatly makes us both more or less happy,
even if it's still ugly ;)


Meanwhile, in the more important caveat department (Dan, this will
appeal to you), I found a while back that signals cause pain with
epoll.

For example, sometimes TCP reads return EAGAIN when in fact they have
data. This seems to stem from the case where the signal is found
before the first segment copy (from tcp.c circa 1425, there's even a
handy FIXME note there). If you use epoll and get an EAGAIN, you have
no idea if it was a signal or a real empty socket unless you are also
very careful to notice when you got a signal during the read.

do {
struct sk_buff * skb;
u32 offset;

/* Are we at urgent data? Stop if we have read anything. */
if (copied && tp->urg_data && tp->urg_seq == *seq)
break;

/* We need to check signals first, to get correct SIGURG
* handling. FIXME: Need to check this doesnt impact 1003.1g
* and move it down to the bottom of the loop
*/
if (signal_pending(current)) {
if (copied)
break;
copied = timeo ? sock_intr_errno(timeo) : -EAGAIN;
break;
}

/* Next get a buffer. */

skb = skb_peek(&sk->receive_queue);
do {



--
Grant Taylor - gtaylor<at>picante.com - http://www.picante.com/~gtaylor/
Linux Printing Website and HOWTO: http://www.linuxprinting.org/

2002-11-19 05:55:26

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Tue, Nov 19, 2002 at 06:35:46AM +0100, Edgar Toernig wrote:
> Davide Libenzi wrote:
> > > What happens if the epollfd is put into its own fd set?
> > You might find your machine a little bit frozen :)
> > Either 1) I remove the read lock from poll() or 2) I check the condition
> > at insetion time to avoid it. I very much prefer 2)
> Hehe, sure. But could become tricky: someone may build a circular chain
> of epoll-fd-sets.

This could be an indication that epoll of an epoll fd should not be
allowed. This kind of sucks as epoll event hierarchies appear, at
least on the surface, very natural, and better than poll()/epoll.

Another option would be to allow a reasonable depth (2? 3?). The extra
checking (depth calculation) only needs to be performed if a file is
added or removed where f_op == epoll_f_op.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-19 06:07:49

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Tue, Nov 19, 2002 at 12:49:16AM -0500, Grant Taylor wrote:
> For example, sometimes TCP reads return EAGAIN when in fact they have
> data. This seems to stem from the case where the signal is found
> before the first segment copy (from tcp.c circa 1425, there's even a
> handy FIXME note there). If you use epoll and get an EAGAIN, you have
> no idea if it was a signal or a real empty socket unless you are also
> very careful to notice when you got a signal during the read.

I hope this isn't a stupid question: Why doesn't the code you speak of
return EINTR instead of EAGAIN?

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-19 12:26:06

by bert hubert

[permalink] [raw]
Subject: having indistinguishable events halves epoll utility

On Mon, Nov 18, 2002 at 05:32:14PM -0500, Mark Mielke wrote:
> On Mon, Nov 18, 2002 at 05:04:07PM -0500, Grant Taylor wrote:
> > OTOH, I really hate the "user pointer in struct epollfd" thing...
>
> An opaque field gives me, the event loop designer, freedom. No opaque
> field because a few event loop designers are convinced that it will be
> used as a data pointer, and they believe this to be wrong, is a
> limitation. epoll provides a very efficient alternative to poll. Forcing

If epoll only returns 'bald events' that like electrons are
indistinguishable, userspace will have to do expensive accounting. In fact,
work on MTasker (http://ds9a.nl/mtasker) already necessitated additional
accounting in user space in order to report events back to the right task.

Such double accounting really hurts epoll scalability benefits and means
doing work in the wrong place. It is like we now have kallsyms - the kernel
is well suited to contain that data and do the work. A weaker example is the
in-kernel module loader.

Another solution might be to have the kernel assign an id on epoll_ctl
insert time and store that in the struct. As we are unlikely to have >2^31
events in flight, 32 bits would suffice easily.

But in any case - the cost of preventing 'malicious users' from storing a
single pointer in the kernel is massive amounts of double work. We are in
that case not offering a useful interface.

So from me as an application & library developer, please do the opaque
pointer thing. By the way, Davide, do you think the time is ripe to wrap up
the manpages now? I could finalize them if you want.

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-11-19 15:17:18

by Jamie Lokier

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Mark Mielke wrote:
> On Tue, Nov 19, 2002 at 12:49:16AM -0500, Grant Taylor wrote:
> > For example, sometimes TCP reads return EAGAIN when in fact they have
> > data. This seems to stem from the case where the signal is found
> > before the first segment copy (from tcp.c circa 1425, there's even a
> > handy FIXME note there). If you use epoll and get an EAGAIN, you have
> > no idea if it was a signal or a real empty socket unless you are also
> > very careful to notice when you got a signal during the read.
>
> I hope this isn't a stupid question: Why doesn't the code you speak of
> return EINTR instead of EAGAIN?

Mark's right, it should be EINTR. EAGAIN shouldn't break any
single-thread user state machines using poll/select, as a non-blocking
read is always allowed to return EAGAIN for any transient reason.

I'm not sure if EAGAIN can cause a poll() wakeup event to be missed.
If so, that would be a TCP bug that breaks epoll, and it would also
break some user state machines using poll/select, when there are
multiple processes waiting on a socket.

-- Jamie

2002-11-19 16:58:35

by Dan Kegel

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Grant Taylor wrote:
> Meanwhile, in the more important caveat department (Dan, this will
> appeal to you), I found a while back that signals cause pain with
> epoll.
>
> For example, sometimes TCP reads return EAGAIN when in fact they have
> data. This seems to stem from the case where the signal is found
> before the first segment copy (from tcp.c circa 1425, there's even a
> handy FIXME note there). If you use epoll and get an EAGAIN, you have
> no idea if it was a signal or a real empty socket unless you are also
> very careful to notice when you got a signal during the read.
> ...
> /* We need to check signals first, to get correct SIGURG
> * handling. FIXME: Need to check this doesnt impact 1003.1g
> * and move it down to the bottom of the loop
> */
> if (signal_pending(current)) {
> if (copied)
> break;
> copied = timeo ? sock_intr_errno(timeo) : -EAGAIN;
> break;
> }

eek! Thanks for noticing that!

Jamie wrote:

> Mark's right, it should be EINTR. EAGAIN shouldn't break any
> single-thread user state machines using poll/select, as a non-blocking
> read is always allowed to return EAGAIN for any transient reason.
>
> I'm not sure if EAGAIN can cause a poll() wakeup event to be missed.
> If so, that would be a TCP bug that breaks epoll, and it would also
> break some user state machines using poll/select, when there are
> multiple processes waiting on a socket.

I guess we should scour the sources looking for ways read() and write()
can return EAGAIN, and make sure that there is no chance this causes
a hang in user state machines that rely on epoll. (Sure would be nice
if the Stanford Checker was up to this kind of thing.)
- Dan



2002-11-19 17:00:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Tue, 19 Nov 2002, Mark Mielke wrote:

> On Tue, Nov 19, 2002 at 06:35:46AM +0100, Edgar Toernig wrote:
> > Davide Libenzi wrote:
> > > > What happens if the epollfd is put into its own fd set?
> > > You might find your machine a little bit frozen :)
> > > Either 1) I remove the read lock from poll() or 2) I check the condition
> > > at insetion time to avoid it. I very much prefer 2)
> > Hehe, sure. But could become tricky: someone may build a circular chain
> > of epoll-fd-sets.
>
> This could be an indication that epoll of an epoll fd should not be
> allowed. This kind of sucks as epoll event hierarchies appear, at
> least on the surface, very natural, and better than poll()/epoll.
>
> Another option would be to allow a reasonable depth (2? 3?). The extra
> checking (depth calculation) only needs to be performed if a file is
> added or removed where f_op == epoll_f_op.

I will not enable epoll fds to be stored inside another epoll fd. With
limited numbers of fds poll scales good, and if you need to create a
priority hierarchy, you can use a small poll set of epoll fds. This is a
very corner case and I'm not willing to screw up the to to handle
something that made 0.1% of users are going to do. It _can_ be done in
other ways.



- Davide


2002-11-19 17:02:59

by Davide Libenzi

[permalink] [raw]
Subject: Re: having indistinguishable events halves epoll utility

On Tue, 19 Nov 2002, bert hubert wrote:

> So from me as an application & library developer, please do the opaque
> pointer thing. By the way, Davide, do you think the time is ripe to wrap up
> the manpages now? I could finalize them if you want.

Let's finalize this discussion and I'll give you ( togheter with whoever
would like it ) the task to work on finalizing man pages.



- Davide


2002-11-19 19:27:51

by Grant Taylor

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Dan Kegel writes:

>> For example, sometimes TCP reads return EAGAIN when in fact they
> eek! Thanks for noticing that!

We aim to please.

In fact, it took me a long time to track this sucker down, so it's
good that I remembered before someone else got baffled.

This wouldn't be an issue if nonblocking reads or writes were
guaranteed to be uninterruptible even if on a "slow" device. But
that's a deeper change than just a bit of caution wrt EAGAIN, and this
is an area to monkey around in only very carefully.

Unfortunately, I never looked elsewhere for this style of error; in
the old epoll you could only epoll TCP sockets (maybe pipes too?). It
might even be present in TCP sendmsg; I don't recall...


Davide writes:

> I will not enable epoll fds to be stored inside another epoll
> fd. With limited numbers of fds poll scales good, and if you need to
> create a priority hierarchy, you can use a small poll set of epoll
> fds. This is a very corner case and I'm not willing to screw up the
> to to handle something that made 0.1% of users are going to do. It
> _can_ be done in other ways.

I'd vote for the "three levels, tops" approach, myself, but I suspect
that for now you are right and it will be simpler to just be
restrictive until we find a need and/or a good way to let epoll epoll
epoll. So to speak ;)

If we're bound to poll small sets of epoll fds perhaps a bit of
improvement in poll would be worthwhile. I should go look at my
profiles again; I've got a program which works this way because some
library code requires poll semantics and it's no good rewriting the
world...

--
Grant Taylor - gtaylor<at>picante.com - http://www.picante.com/~gtaylor/
Linux Printing Website and HOWTO: http://www.linuxprinting.org/

2002-11-19 19:44:11

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Tue, 19 Nov 2002, Grant Taylor wrote:

> If we're bound to poll small sets of epoll fds perhaps a bit of
> improvement in poll would be worthwhile. I should go look at my

The current poll() with a number of fds you're talking about, scales
beautifully.



- Davide


2002-11-19 20:44:29

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Tue, Nov 19, 2002 at 11:51:40AM -0800, Davide Libenzi wrote:
> On Tue, 19 Nov 2002, Grant Taylor wrote:
> > If we're bound to poll small sets of epoll fds perhaps a bit of
> > improvement in poll would be worthwhile. I should go look at my
> The current poll() with a number of fds you're talking about, scales
> beautifully.

For event loops that need minimal latency for high priority events
(even at the cost of higher latency for low priority events), poll()
of epoll fds may allow greater optimization opportunities, as the
epoll fd set is dynamically adjusted without extra system code
overhead.

Example: Code that expects that at least one high priority event may
be ready (for example, after dispatching events during the previous
iteration) can choose to first only poll(0) the epoll fds responsible
for the high priority event sets. Only if poll(0) returns no high
priority epoll fds, would poll(timeout) then be issued on all epoll
fds. This would result in a very short dispatch path for events for
high priority events.

What we really need is a few actual event loop implementations to test
all of these theories. I would find it especially nice if the code
could be ported to 2.4.x... :-)

For now, epoll of epoll fd is not necessary for our needs.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-19 20:47:17

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Tue, 19 Nov 2002, Mark Mielke wrote:

> On Tue, Nov 19, 2002 at 11:51:40AM -0800, Davide Libenzi wrote:
> > On Tue, 19 Nov 2002, Grant Taylor wrote:
> > > If we're bound to poll small sets of epoll fds perhaps a bit of
> > > improvement in poll would be worthwhile. I should go look at my
> > The current poll() with a number of fds you're talking about, scales
> > beautifully.
>
> For event loops that need minimal latency for high priority events
> (even at the cost of higher latency for low priority events), poll()
> of epoll fds may allow greater optimization opportunities, as the
> epoll fd set is dynamically adjusted without extra system code
> overhead.
>
> Example: Code that expects that at least one high priority event may
> be ready (for example, after dispatching events during the previous
> iteration) can choose to first only poll(0) the epoll fds responsible
> for the high priority event sets. Only if poll(0) returns no high
> priority epoll fds, would poll(timeout) then be issued on all epoll
> fds. This would result in a very short dispatch path for events for
> high priority events.
>
> What we really need is a few actual event loop implementations to test
> all of these theories. I would find it especially nice if the code
> could be ported to 2.4.x... :-)

I think it's better to wait to define the final interface ( and eventpoll
code bits ) before starting the 2.4.x backport patch.



- Davide


2002-11-20 01:52:29

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Tue, 19 Nov 2002, Edgar Toernig wrote:

> > You get EEXIST
> > Well, there's the remote possibility, trying very badly from two threads,
> > to add the same fd twice. It is an harmless condition though.
>
> Just IMHO: I would prefer a different behaviour:
>
> int epoll_ctl(int epfd, int fd, int events)
>
> which registers interest for "events" on "fd" and retuns previous
> registered events for that fd (implies that the fd is removed when
> "events" is 0) or -1 for error.
>
> If you don't like it, at least an EP_CTL_GET should be added though.
>
> Btw, what errno for an invalid fd (not epfd)?

I don't like things to happen for magic bits conditions. EPOLL_CTL_ADD
requires the same code internally and is more clear for the user. If
someone won't shoot me before, I think the final interface will be :

#include <bits/poll.h>

#define EPOLLIN POLLIN
...


#define EPOLL_CTL_ADD 1
#define EPOLL_CTL_DEL 2
#define EPOLL_CTL_MOD 3


struct epoll_fd {
int fd;
unsigned short events;
unsigned short revents;
__uint64_t obj;
};


int epoll_create(int size);
int epoll_ctl(int epfd, int op, struct epollfd *pfd);
int epoll_wait(int epfd, struct epollfd *events, int maxevents, int timeout);


Ulrich, is it right for you or do you want EPOLL* bits to be directly
defined as numbers instead of assuming the POLL* vaules ?


At the very end the opaque object might help in many cases while it won't
hurt other cases.



> > You might find your machine a little bit frozen :)
> > Either 1) I remove the read lock from poll() or 2) I check the condition
> > at insetion time to avoid it. I very much prefer 2)
>
> Hehe, sure. But could become tricky: someone may build a circular chain
> of epoll-fd-sets.

It'll be possible to add epfd1 inside epfd2, not epfd1 inside epfd1.



> > I'd say yes. SCM_RIGHTS should simply do an in-kernel file* to remote task
> > descriptor mapping.
>
> And what happens then? Will the set refers to the fds from the sender
> process or of fds of the receiving process (which may not even have
> all those fds open)?

Uhm !? It'll refer to files opened on the other process. To handle this
correctly we should prevent an epoll file to be passed with SCM_RIGHTS in
net/core/scm.c. I mean, no catastrophic things should happen, only the
interface won't work correctly. I don't know if the extra handling code is
worth, but we should definitely put this inside epoll(2).



> Another btw, what happens on close of an fd? Will it get removed from all
> epoll-fd-sets automatically?

Yes, obviously.




- Davide


2002-11-20 03:01:34

by Jamie Lokier

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

I have a question about:

> struct epoll_fd {
> int fd;
> unsigned short events;
> unsigned short revents;
> __uint64_t obj;
> };

What value does the `fd' field have when a file descriptor being
polled has been renumbered (by dup/close or dup2/close or
fcntl(F_DUPFD)/close or passing through a unix domain socket)?

If we are honest, the `obj' field is absolutely essential as its the
only value which uniquely identifies the file descriptor if you have
done anything unusual with the fds.

The `fd' field, on the other hand, is not guaranteed to correspond
with the correct file descriptor number. So.... perhaps the structure
should contain an `obj' field and _no_ `fd' field?

This doesn't affect applications. Those which use `obj' for something
interesting (i.e. a pointer) will have the `fd' value stored in the
pointed-to data structure, while simple applications can just store
the original `fd' value in `obj' in the first place.

> > And what happens then? Will the set refers to the fds from the sender
> > process or of fds of the receiving process (which may not even have
> > all those fds open)?
>
> Uhm !? It'll refer to files opened on the other process. To handle this
> correctly we should prevent an epoll file to be passed with SCM_RIGHTS in
> net/core/scm.c. I mean, no catastrophic things should happen, only the
> interface won't work correctly. I don't know if the extra handling code is
> worth, but we should definitely put this inside epoll(2).

See above - same problem occurs with dup() and variants. The problem
is that epoll is really reporting events on a file*, and the mapping
back to fd is not always meaningful.

SCM_RIGHTS is just one of the ways to renumber fds, and disallowing it
won't fix the exact same problem caused by dup(). Thus SCM_RIGHTS
should be allowed because there is no reason to disallow it - and it
is actually useful for some kinds of server implementation.

> > Hehe, sure. But could become tricky: someone may build a circular chain
> > of epoll-fd-sets.
>
> It'll be possible to add epfd1 inside epfd2, not epfd1 inside epfd1.

Beware of overflowing the kernel stack. If epfd4 becomes readable,
and wakes up epfd3, which wakes up epfd2, which wakes up epfd1... If
that is implemented recursively than I can write malicious code which
will crash the kernel. Note that this isn't a cycle. It's possible
to code the wakeups so this cannot happen but still have the expected
behaviour.

A circular arrangement should be fine, if silly. The semantics are
quite logical and don't require special cases: epfd2 becoming readable
will trigger epfd1 to become readable if it isn't already. If you
make a cycle, that's silly but still behaves as you'd expect. If
epfd1 becomes readable, it wakes up epfd1... which is already
readable so nothing further happens. Similarly with larger cycles.
Assuming you've avoided stack overflow for acyclic graphs, there won't
be any problem with cyclic ones.

-- Jamie

2002-11-20 03:39:16

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Wed, 20 Nov 2002, Jamie Lokier wrote:

> I have a question about:
>
> > struct epoll_fd {
> > int fd;
> > unsigned short events;
> > unsigned short revents;
> > __uint64_t obj;
> > };
>
> What value does the `fd' field have when a file descriptor being
> polled has been renumbered (by dup/close or dup2/close or
> fcntl(F_DUPFD)/close or passing through a unix domain socket)?
>
> If we are honest, the `obj' field is absolutely essential as its the
> only value which uniquely identifies the file descriptor if you have
> done anything unusual with the fds.
>
> The `fd' field, on the other hand, is not guaranteed to correspond
> with the correct file descriptor number. So.... perhaps the structure
> should contain an `obj' field and _no_ `fd' field?
>
> This doesn't affect applications. Those which use `obj' for something
> interesting (i.e. a pointer) will have the `fd' value stored in the
> pointed-to data structure, while simple applications can just store
> the original `fd' value in `obj' in the first place.

Even if I agree with you here, this will make the API asymmetrical. We
will have :

struct epoll_fd {
unsigned short events;
unsigned short revents;
__uint64_t obj;
};


int epoll_ctl(int epfd, int op, int fd, struct epoll_fd *pfd);

Where the "fd" is used only for EPOLL_CTL_ADD, and "obj" for EPOLL_CTL_DEL
and EPOLL_CTL_MOD.




> > It'll be possible to add epfd1 inside epfd2, not epfd1 inside epfd1.
>
> Beware of overflowing the kernel stack. If epfd4 becomes readable,
> and wakes up epfd3, which wakes up epfd2, which wakes up epfd1... If
> that is implemented recursively than I can write malicious code which
> will crash the kernel. Note that this isn't a cycle. It's possible
> to code the wakeups so this cannot happen but still have the expected
> behaviour.
>
> A circular arrangement should be fine, if silly. The semantics are
> quite logical and don't require special cases: epfd2 becoming readable
> will trigger epfd1 to become readable if it isn't already. If you
> make a cycle, that's silly but still behaves as you'd expect. If
> epfd1 becomes readable, it wakes up epfd1... which is already
> readable so nothing further happens. Similarly with larger cycles.
> Assuming you've avoided stack overflow for acyclic graphs, there won't
> be any problem with cyclic ones.

This is a problem with the new callback'd wake_up(). I'd be tempted to not
permit epoll fd inclusion inside other epoll fds.




- Davide


2002-11-20 03:56:58

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Wed, 20 Nov 2002, Jamie Lokier wrote:

> I have a question about:
>
> > struct epoll_fd {
> > int fd;
> > unsigned short events;
> > unsigned short revents;
> > __uint64_t obj;
> > };
>
> What value does the `fd' field have when a file descriptor being
> polled has been renumbered (by dup/close or dup2/close or
> fcntl(F_DUPFD)/close or passing through a unix domain socket)?
>
> If we are honest, the `obj' field is absolutely essential as its the
> only value which uniquely identifies the file descriptor if you have
> done anything unusual with the fds.
>
> The `fd' field, on the other hand, is not guaranteed to correspond
> with the correct file descriptor number. So.... perhaps the structure
> should contain an `obj' field and _no_ `fd' field?
>
> This doesn't affect applications. Those which use `obj' for something
> interesting (i.e. a pointer) will have the `fd' value stored in the
> pointed-to data structure, while simple applications can just store
> the original `fd' value in `obj' in the first place.

It's OK. I agree. We can remove the fd from inside the structure and have :

struct epoll_event {
unsigned short events;
unsigned short revents;
__uint64_t obj;
};

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents,
int timeout);


And the lower size of the structure will help to reduce the amount of
memory transfered to userspace. I just saw that adding the extra "obj"
member lowered performance of about 15% with crazy tests like Ben's
pipetest. This because it creates, on my machine, more than 400000 events
per second, and saving memory bandwidth on such conditions is a must. With
the "more human" http test performance are about the same.




- Davide


2002-11-20 07:34:34

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Wed, Nov 20, 2002 at 03:09:19AM +0000, Jamie Lokier wrote:
> [ ... regarding 'struct epollfd' ('struct epoll_fd'?) ... ]
> What value does the `fd' field have when a file descriptor being
> polled has been renumbered (by dup/close or dup2/close or
> fcntl(F_DUPFD)/close or passing through a unix domain socket)?

As long as the kernel doesn't freeze up or leak memory, I believe it is
the responsibility of the application code to ensure that the correct
file descriptors are registered (and deregistered). If the application
code does screwy things, it should expect a little bit of undefined
behaviour.

> The `fd' field, on the other hand, is not guaranteed to correspond
> with the correct file descriptor number. So.... perhaps the structure
> should contain an `obj' field and _no_ `fd' field?

Whether the 'fd' field is in kernel space or user space or both does
not really affect the ability of the application code to be able to
perform dup() operations without needing to change references to the
'fd'. For the people that do not wish to use the 'obj' field, the 'fd'
field will be more natural. (Anybody who dups a file descriptor, and
then tells the event loop to watch both file descriptors, is asking for
trouble under any scheme...)

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-20 07:47:20

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Tue, Nov 19, 2002 at 08:04:33PM -0800, Davide Libenzi wrote:
> On Wed, 20 Nov 2002, Jamie Lokier wrote:
> > The `fd' field, on the other hand, is not guaranteed to correspond
> > with the correct file descriptor number. So.... perhaps the structure
> > should contain an `obj' field and _no_ `fd' field?
> > ...
> It's OK. I agree. We can remove the fd from inside the structure and have :
> struct epoll_event {
> unsigned short events;
> unsigned short revents;
> __uint64_t obj;
> };

Forget any argument I had against removing 'fd'. This sounds good.

Perhaps 'obj' should be named 'userdata'?

struct epoll_event {
unsigned short events;
unsigned short revents;
__uint64_t userdata;
};

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-20 21:57:14

by Jamie Lokier

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Davide Libenzi wrote:
> And the lower size of the structure will help to reduce the amount of
> memory transfered to userspace. I just saw that adding the extra "obj"
> member lowered performance of about 15% with crazy tests like Ben's
> pipetest. This because it creates, on my machine, more than 400000 events
> per second, and saving memory bandwidth on such conditions is a must. With
> the "more human" http test performance are about the same.

I'd be quite surprised if 400,000 word/sec of memory bandwidth can
explain a 15% time difference, especially considering all the other
things that are done to communicate over a pipe (wakeups etc).

Btw, I have seen variations of that magnitude in other network tests
that I've done that _appear_ to be explained by small changes to the
code like that, but the next day or week the variation stops
correlating with the change. Page colouring affecting benchmarks
again?

-- Jamie

2002-11-20 22:02:11

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Wed, 20 Nov 2002, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > And the lower size of the structure will help to reduce the amount of
> > memory transfered to userspace. I just saw that adding the extra "obj"
> > member lowered performance of about 15% with crazy tests like Ben's
> > pipetest. This because it creates, on my machine, more than 400000 events
> > per second, and saving memory bandwidth on such conditions is a must. With
> > the "more human" http test performance are about the same.
>
> I'd be quite surprised if 400,000 word/sec of memory bandwidth can
> explain a 15% time difference, especially considering all the other
> things that are done to communicate over a pipe (wakeups etc).

Jamie, they were 16 bytes * 400000, and the token passed through the pipe
was 12 bytes.



- Davide


2002-11-20 23:11:50

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Wed, 20 Nov 2002, Mark Mielke wrote:

> On Tue, Nov 19, 2002 at 08:04:33PM -0800, Davide Libenzi wrote:
> > On Wed, 20 Nov 2002, Jamie Lokier wrote:
> > > The `fd' field, on the other hand, is not guaranteed to correspond
> > > with the correct file descriptor number. So.... perhaps the structure
> > > should contain an `obj' field and _no_ `fd' field?
> > > ...
> > It's OK. I agree. We can remove the fd from inside the structure and have :
> > struct epoll_event {
> > unsigned short events;
> > unsigned short revents;
> > __uint64_t obj;
> > };
>
> Forget any argument I had against removing 'fd'. This sounds good.
>
> Perhaps 'obj' should be named 'userdata'?
>
> struct epoll_event {
> unsigned short events;
> unsigned short revents;
> __uint64_t userdata;
> };

Do we want to have a union instead of a direct 64bit int ?




- Davide


2002-11-20 23:20:17

by Jamie Lokier

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Davide Libenzi wrote:
> > > And the lower size of the structure will help to reduce the amount of
> > > memory transfered to userspace. I just saw that adding the extra "obj"
> > > member lowered performance of about 15% with crazy tests like Ben's
> > > pipetest. This because it creates, on my machine, more than 400000 events
> > > per second, and saving memory bandwidth on such conditions is a must. With
> > > the "more human" http test performance are about the same.
> >
> > I'd be quite surprised if 400,000 word/sec of memory bandwidth can
> > explain a 15% time difference, especially considering all the other
> > things that are done to communicate over a pipe (wakeups etc).
>
> Jamie, they were 16 bytes * 400000, and the token passed through the pipe
> was 12 bytes.

However, it's 4 bytes (1 word) * 400000 _difference_ between the two
tests, yes?

-- Jamie

2002-11-20 23:25:21

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Wed, 20 Nov 2002, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > > > And the lower size of the structure will help to reduce the amount of
> > > > memory transfered to userspace. I just saw that adding the extra "obj"
> > > > member lowered performance of about 15% with crazy tests like Ben's
> > > > pipetest. This because it creates, on my machine, more than 400000 events
> > > > per second, and saving memory bandwidth on such conditions is a must. With
> > > > the "more human" http test performance are about the same.
> > >
> > > I'd be quite surprised if 400,000 word/sec of memory bandwidth can
> > > explain a 15% time difference, especially considering all the other
> > > things that are done to communicate over a pipe (wakeups etc).
> >
> > Jamie, they were 16 bytes * 400000, and the token passed through the pipe
> > was 12 bytes.
>
> However, it's 4 bytes (1 word) * 400000 _difference_ between the two
> tests, yes?

Yep, the problem is that the "tool" used to measure ( Ben's pipetest ) on
my machine has a variance of about 35% and this makes every measure prety
fuzzy.



- Davide


2002-11-20 23:37:42

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Wed, Nov 20, 2002 at 03:19:30PM -0800, Davide Libenzi wrote:
> On Wed, 20 Nov 2002, Mark Mielke wrote:
> > > struct epoll_event {
> > > unsigned short events;
> > > unsigned short revents;
> > > __uint64_t obj;
> > > };
> > Forget any argument I had against removing 'fd'. This sounds good.
> > Perhaps 'obj' should be named 'userdata'?
> > struct epoll_event {
> > unsigned short events;
> > unsigned short revents;
> > __uint64_t userdata;
> > };
> Do we want to have a union instead of a direct 64bit int ?

I was going to suggest this, except I couldn't figure out what to
suggest that it look like... I finally figured that the value could be
cast, or wrapped in a union by userspace (although theoretically, this
might mean more words than absolutely necessary to initialize on a
32-bit CPU...)

What were you thinking? 1X64 bit or 2X32 bit?

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-20 23:49:44

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Wed, 20 Nov 2002, Mark Mielke wrote:

> On Wed, Nov 20, 2002 at 03:19:30PM -0800, Davide Libenzi wrote:
> > On Wed, 20 Nov 2002, Mark Mielke wrote:
> > > > struct epoll_event {
> > > > unsigned short events;
> > > > unsigned short revents;
> > > > __uint64_t obj;
> > > > };
> > > Forget any argument I had against removing 'fd'. This sounds good.
> > > Perhaps 'obj' should be named 'userdata'?
> > > struct epoll_event {
> > > unsigned short events;
> > > unsigned short revents;
> > > __uint64_t userdata;
> > > };
> > Do we want to have a union instead of a direct 64bit int ?
>
> I was going to suggest this, except I couldn't figure out what to
> suggest that it look like... I finally figured that the value could be
> cast, or wrapped in a union by userspace (although theoretically, this
> might mean more words than absolutely necessary to initialize on a
> 32-bit CPU...)
>
> What were you thinking? 1X64 bit or 2X32 bit?

Something like :

typedef union epoll_obj {
void *ptr;
__uint32_t u32[2];
__uint64_t u64;
} epoll_obj_t;

I'm open to suggestions though. The "ptr" enable me to avoid wierd casts
to avoid gcc screaming.




- Davide



2002-11-21 00:20:05

by Jamie Lokier

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Davide Libenzi wrote:
> typedef union epoll_obj {
> void *ptr;
> __uint32_t u32[2];
> __uint64_t u64;
> } epoll_obj_t;
>
> I'm open to suggestions though. The "ptr" enable me to avoid wierd casts
> to avoid gcc screaming.

That makes more sense to me, because it will be fine to use `ptr' even
on 128-bit pointer machines when they arrive, yet preserves the
property that 64<->32 bit conversion functions don't need to reformat
the buffer when running 32-bit applications on a 64-bit CPU... even if
the 32-bit application uses the `ptr' field.

Did I just write that? :)

-- Jamie

2002-11-21 00:18:49

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Wed, Nov 20, 2002 at 03:57:26PM -0800, Davide Libenzi wrote:
> > What were you thinking? 1X64 bit or 2X32 bit?
> typedef union epoll_obj {
> void *ptr;
> __uint32_t u32[2];
> __uint64_t u64;
> } epoll_obj_t;
> I'm open to suggestions though. The "ptr" enable me to avoid wierd casts
> to avoid gcc screaming.

It looks fine to me for as long as we can guarantee that sizeof(void *)
will be less than or equal to sizeof(__uint64_t) (relatively safe).

I still prefer 'userdata' over 'obj', but the name of thing is not very
important to me.

I'm not sure if this is wise or not, but an 'fd' member might be
useful as well:

typedef union epoll_obj {
void *ptr;
int fd;
__uint32_t u32[2];
__uint64_t u64;
} epoll_obj_t;

For applications that wish to store fd's (probably common due to
poll() roots), this would help them avoid casting magic as well. Also,
it allows for 64 bit file descriptors if that ever became necessary.

Since userspace applications are almost guaranteed to need to do casting
or union copying, using the union sounds like a good idea.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-21 00:46:49

by Jamie Lokier

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

Mark Mielke wrote:
> It looks fine to me for as long as we can guarantee that sizeof(void *)
> will be less than or equal to sizeof(__uint64_t) (relatively safe).

It is fine, for epoll itself, even if sizeof(void *) > sizeof(__uint64_t).

Conversion functions for legacy 32-bit code running on a 128-bit chip
will be more complex, but epoll is the _least_ of problems in that case...

> I still prefer 'userdata' over 'obj', but the name of thing is not very
> important to me.

The AIO subsystem has a very similar cookie, which is simply declared
as `__u64 aio_data' in user-supplied requests, and returned in the
responses. It's converted to a field called `ki_user_data' in the
kernel.

Just a few reference points... I like `user_data' myself.

> I'm not sure if this is wise or not, but an 'fd' member might be
> useful as well:
> [...]
> For applications that wish to store fd's (probably common due to
> poll() roots), this would help them avoid casting magic as well. Also,
> it allows for 64 bit file descriptors if that ever became necessary.

That is a good idea, if we go for the union.

-- Jamie

2002-11-21 00:57:00

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Thu, 21 Nov 2002, Jamie Lokier wrote:

> That is a good idea, if we go for the union.

This is the include file I'm thinking to submit to Ulrich for glibc
inclusion :



#ifndef _SYS_EPOLL_H
#define _SYS_EPOLL_H 1

#include <sys/types.h>


enum EPOLL_EVENTS {
EPOLLIN = 0x001,
#define EPOLLIN EPOLLIN

EPOLLPRI = 0x002,
#define EPOLLPRI EPOLLPRI

EPOLLOUT = 0x004,
#define EPOLLOUT EPOLLOUT

#ifdef __USE_XOPEN

EPOLLRDNORM = 0x040,
#define EPOLLRDNORM EPOLLRDNORM

EPOLLRDBAND = 0x080,
#define EPOLLRDBAND EPOLLRDBAND

EPOLLWRNORM = 0x100,
#define EPOLLWRNORM EPOLLWRNORM

EPOLLWRBAND = 0x200,
#define EPOLLWRBAND EPOLLWRBAND

#endif

#ifdef __USE_GNU
EPOLLMSG = 0x400,
#define EPOLLMSG EPOLLMSG
#endif

EPOLLERR = 0x008,
#define EPOLLERR EPOLLERR

EPOLLHUP = 0x010
#define EPOLLHUP EPOLLHUP

};



/* Valid opcodes to issue to epoll_ctl() */
#define EPOLL_CTL_ADD 1 /* Add a file decriptor to the interface */
#define EPOLL_CTL_DEL 2 /* Remove a file decriptor from the interface */
#define EPOLL_CTL_MOD 3 /* Change file decriptor epoll_event structure */


typedef union epoll_obj {
void *ptr;
int fd;
__uint32_t u32[2];
__uint64_t u64;
} epoll_obj_t;

struct epoll_event {
unsigned short events; /* Required events */
unsigned short revents; /* Returned events */
epoll_obj_t data; /* User data variable */
};


__BEGIN_DECLS

/*
* Creates an epoll interface by suggesting the requested size in terms
* of file descriptors that will be presumably stored inside the interface.
* The integer returned by epoll_create() should be closed with close().
*/
extern int epoll_create(int size);

/*
* Controller function for the epoll interface. Valid values for the "op"
* parameter are the EPOLL_CTL_* constant defined above. The "fd" parameter
* is the target of the operation while the "event" parameter describe which
* events the caller is interested in and the data ( obj ) he associates
* the the file "fd".
*/
extern int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event) __THROW;

/*
* Wait for events on the epoll interface. The function will return the number
* of events that will be available inside the memory area pointed by "events".
* Up to "maxevents" events will be copied. The "timeout" parameter enable the
* caller to specify a miximum wait time in milliseconds ( -1 == infinite ).
*/
extern int epoll_wait(int epfd, struct epoll_event *events, int maxevents,
int timeout) __THROW;

__END_DECLS


#endif /* #ifndef _SYS_EPOLL_H */



Comments ?






- Davide


2002-11-21 01:08:49

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Thu, Nov 21, 2002 at 12:28:16AM +0000, Jamie Lokier wrote:
> Davide Libenzi wrote:
> > typedef union epoll_obj {
> > void *ptr;
> > __uint32_t u32[2];
> > __uint64_t u64;
> > } epoll_obj_t;
> > I'm open to suggestions though. The "ptr" enable me to avoid wierd casts
> > to avoid gcc screaming.
> That makes more sense to me, because it will be fine to use `ptr' even
> on 128-bit pointer machines when they arrive, yet preserves the
> property that 64<->32 bit conversion functions don't need to reformat
> the buffer when running 32-bit applications on a 64-bit CPU... even if
> the 32-bit application uses the `ptr' field.
> Did I just write that? :)

The problem with sizeof(void *) being >= sizeof(__uint64_t) is that the
data structure is the wrong length. Binary compatibility would not be
maintained.

Still... I believe that the days of 128-bit pointers are comfortably
far enough away that it does not cause any concerns at all for me. (I
suspect the kernel will need quite a few more changes than just this
to support 128-bit poitners...)

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-21 01:13:13

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Wed, 20 Nov 2002, Mark Mielke wrote:

> On Thu, Nov 21, 2002 at 12:28:16AM +0000, Jamie Lokier wrote:
> > Davide Libenzi wrote:
> > > typedef union epoll_obj {
> > > void *ptr;
> > > __uint32_t u32[2];
> > > __uint64_t u64;
> > > } epoll_obj_t;
> > > I'm open to suggestions though. The "ptr" enable me to avoid wierd casts
> > > to avoid gcc screaming.
> > That makes more sense to me, because it will be fine to use `ptr' even
> > on 128-bit pointer machines when they arrive, yet preserves the
> > property that 64<->32 bit conversion functions don't need to reformat
> > the buffer when running 32-bit applications on a 64-bit CPU... even if
> > the 32-bit application uses the `ptr' field.
> > Did I just write that? :)
>
> The problem with sizeof(void *) being >= sizeof(__uint64_t) is that the
> data structure is the wrong length. Binary compatibility would not be
> maintained.
>
> Still... I believe that the days of 128-bit pointers are comfortably
> far enough away that it does not cause any concerns at all for me. (I
> suspect the kernel will need quite a few more changes than just this
> to support 128-bit poitners...)

In theory it is possible to pass epoll_create() an extra parameter that
will set the size of the extra data, from 0 up to N. And the kernel will
return this data "as is". It'll become nasty in user space to access data
though.




- Davide


2002-11-21 15:30:14

by Denis Vlasenko

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On 20 November 2002 22:33, Mark Mielke wrote:
> On Wed, Nov 20, 2002 at 03:57:26PM -0800, Davide Libenzi wrote:
> > > What were you thinking? 1X64 bit or 2X32 bit?
> >
> > typedef union epoll_obj {
> > void *ptr;
> > __uint32_t u32[2];
> > __uint64_t u64;
> > } epoll_obj_t;
> > I'm open to suggestions though. The "ptr" enable me to avoid wierd
> > casts to avoid gcc screaming.
>
> It looks fine to me for as long as we can guarantee that sizeof(void
> *) will be less than or equal to sizeof(__uint64_t) (relatively
> safe).
>
> I still prefer 'userdata' over 'obj', but the name of thing is not
> very important to me.
>
> I'm not sure if this is wise or not, but an 'fd' member might be
> useful as well:
>
> typedef union epoll_obj {
> void *ptr;
> int fd;
> __uint32_t u32[2];
> __uint64_t u64;
> } epoll_obj_t;

u32 and u64 sounds more like type name. d32 / d64 ?
--
vda

2002-11-21 16:37:28

by Mark Mielke

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Thu, Nov 21, 2002 at 06:08:16PM -0200, Denis Vlasenko wrote:
> > typedef union epoll_obj {
> > void *ptr;
> > int fd;
> > __uint32_t u32[2];
> > __uint64_t u64;
> > } epoll_obj_t;
> u32 and u64 sounds more like type name. d32 / d64 ?

d32 isn't as descriptive as u32 (unsigned). Also if every field had a 'd'
in front of it, it would be 'dptr', 'dfd', ...

u32 is fine to me.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-21 17:37:43

by Davide Libenzi

[permalink] [raw]
Subject: Re: [rfc] epoll interface change and glibc bits ...

On Thu, 21 Nov 2002, Mark Mielke wrote:

> On Thu, Nov 21, 2002 at 06:08:16PM -0200, Denis Vlasenko wrote:
> > > typedef union epoll_obj {
> > > void *ptr;
> > > int fd;
> > > __uint32_t u32[2];
> > > __uint64_t u64;
> > > } epoll_obj_t;
> > u32 and u64 sounds more like type name. d32 / d64 ?
>
> d32 isn't as descriptive as u32 (unsigned). Also if every field had a 'd'
> in front of it, it would be 'dptr', 'dfd', ...
>
> u32 is fine to me.

I renamed epoll_obj_t to epoll_data_t ...



- Davide