2006-11-17 21:40:32

by Marc Snider

[permalink] [raw]
Subject: Read/Write multiple network FDs in a single syscall context switch?

I've searched long and hard prior to posting here, but have been unable to locate a kernel mechanism providing the ability to read or write multiple FDs in a single userspace to kernel context switch.

We've got a userspace network application that uses epoll to wait for packet arrival and then reads a single frame off of dozens of separate FDs (sockets), operates on the payload and then forwards along by writing to dozens of other separate FDs (sockets).?? At high loads we invariably have many dozens of socket FDs to read and write.

If 50 separate frames are received on 50 separate sockets then we are at present doing 50 separate reads and then 50 separate writes, thus resulting in over a hundred distinct (and seemingly unnecessary) user to kernel space and kernel to user space context switches.?? Is there a mechanism I've missed which allows many network FDs to be read or written in a single syscall??? For example, something analogous to the recv() and send() calls but instead providing a vector for the parameters and return value?

I picture something like:

???ssize_t *recvMultiple(int *s, void **buf, size_t *len, int *flags)?? ??? and
?? ??ssize_t *sendMultiple(int *s, void **buf, size_t *len, int *flags)


The user would have to be careful about not using blocking sockets with these types of multiple FD operations, but it seems to me that such a kernel mechanism would allow a user space process to eliminate dozens or even hundreds of unnecessary context switches when servicing multiple network FDs...??? The cycle savings for an application like ours would be huge.?? I am confused about why I've been unable to locate such a mechanism considering the perceived performance advantages and ubiquitous nature of user applications that service many network FDs...

If it's not too much trouble then I'd appreciate if those answering could CC: me on any responses.


Regards,

Marc Snider
[email protected]



2006-11-17 21:46:14

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Read/Write multiple network FDs in a single syscall context switch?

On Fri, 17 Nov 2006 16:40:30 -0500
"Marc Snider" <[email protected]> wrote:

> I've searched long and hard prior to posting here, but have been unable to locate a kernel mechanism providing the ability to read or write multiple FDs in a single userspace to kernel context switch.
>
> We've got a userspace network application that uses epoll to wait for packet arrival and then reads a single frame off of dozens of separate FDs (sockets), operates on the payload and then forwards along by writing to dozens of other separate FDs (sockets).   At high loads we invariably have many dozens of socket FDs to read and write.
>
> If 50 separate frames are received on 50 separate sockets then we are at present doing 50 separate reads and then 50 separate writes, thus resulting in over a hundred distinct (and seemingly unnecessary) user to kernel space and kernel to user space context switches.   Is there a mechanism I've missed which allows many network FDs to be read or written in a single syscall?   For example, something analogous to the recv() and send() calls but instead providing a vector for the parameters and return value?
>
> I picture something like:
>
>    ssize_t *recvMultiple(int *s, void **buf, size_t *len, int *flags)       and
>      ssize_t *sendMultiple(int *s, void **buf, size_t *len, int *flags)
>
>
> The user would have to be careful about not using blocking sockets with these types of multiple FD operations, but it seems to me that such a kernel mechanism would allow a user space process to eliminate dozens or even hundreds of unnecessary context switches when servicing multiple network FDs...    The cycle savings for an application like ours would be huge.   I am confused about why I've been unable to locate such a mechanism considering the perceived performance advantages and ubiquitous nature of user applications that service many network FDs...
>
> If it's not too much trouble then I'd appreciate if those answering could CC: me on any responses.
>
>
> Regards,
>
> Marc Snider
> [email protected]


No there is no API like this. You will have all sorts of problems to consider like
what if there is no data on some of the sockets, or you are flow blocked or lots of
other issues. If the data is all the same then why not use multicast?


--
Stephen Hemminger <[email protected]>

2006-11-18 04:20:05

by Willy Tarreau

[permalink] [raw]
Subject: Re: Read/Write multiple network FDs in a single syscall context switch?

On Fri, Nov 17, 2006 at 04:40:30PM -0500, Marc Snider wrote:
> I've searched long and hard prior to posting here, but have been unable to locate a kernel mechanism providing the ability to read or write multiple FDs in a single userspace to kernel context switch.
>
> We've got a userspace network application that uses epoll to wait for packet arrival and then reads a single frame off of dozens of separate FDs (sockets), operates on the payload and then forwards along by writing to dozens of other separate FDs (sockets).?? At high loads we invariably have many dozens of socket FDs to read and write.
>
> If 50 separate frames are received on 50 separate sockets then we are at present doing 50 separate reads and then 50 separate writes, thus resulting in over a hundred distinct (and seemingly unnecessary) user to kernel space and kernel to user space context switches.?? Is there a mechanism I've missed which allows many network FDs to be read or written in a single syscall??? For example, something analogous to the recv() and send() calls but instead providing a vector for the parameters and return value?
>
> I picture something like:
>
> ???ssize_t *recvMultiple(int *s, void **buf, size_t *len, int *flags)?? ??? and
> ?? ??ssize_t *sendMultiple(int *s, void **buf, size_t *len, int *flags)
>
>
> The user would have to be careful about not using blocking sockets with these types of multiple FD operations, but it seems to me that such a kernel mechanism would allow a user space process to eliminate dozens or even hundreds of unnecessary context switches when servicing multiple network FDs...??? The cycle savings for an application like ours would be huge.?? I am confused about why I've been unable to locate such a mechanism considering the perceived performance advantages and ubiquitous nature of user applications that service many network FDs...

You should take a look at the "Kernel Mode Linux" patch. While it doesn't
provide the feature you want, it addresses this specific context switch
problem by making your app run in kernel space, thus considerably reducing
the cost of the system calls.

Regards,
Willy

2006-11-18 04:24:46

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Read/Write multiple network FDs in a single syscall context switch?

On Fri, 17 Nov 2006 22:53:26 -0500
"Marc Snider" <[email protected]> wrote:

> Sorry, I must have given the wrong impression with respect to the data. It is not all the same. Each ingress socket is associated with an individual egress socket and the packet data being received and transmitted is unique across ingress/egress socket pairs...
>
> Guess I don't see the difficulties you alluded to below, Stephen. The userspace app would only ask to receive on sockets where there was already known data available as per Epoll reporting. I also think it a reasonable constraint to require in this multiple FD operation case that all sockets are mandated as nonblocking, thus a zero or some other unique return value could be provided for each socket that would have blocked in lieu of EWOULDBLOCK.
>
> Write sockets would only be written to when data was available, so there would be no ambiguity on write operations. Again, if the request could not be satisfied due to socket buffer overflow or other aberration a nonblocking return code would ensue.
>
> If all socket FDs referenced were required to be nonblocking then I'm having difficulty understanding how circumstances would differ for a vectorized recvMultiple() or sendMultiple() operation when contrasted with doing multiple individual recv() and/or send() calls on N nonblocking sockets...
>
> Forgive me if I'm missing something. It seems to me that the bang for the buck in exponentially reducing the number of context switches required for a userspace application to service many network FDs makes a great deal of sense here....
>
> Regards,
> Marc Snider
> [email protected]
>

You forget on Linux system calls are cheap, unlike other OS's. A poll/select followed by a receive
is going to be as fast as any recv_any() type interface. Unless you can reduce the number
of copies from kernel to user (or vice versa) there is no point to inventing yet another
network interface.