2006-03-06 06:22:40

by Dan Aloni

[permalink] [raw]
Subject: Status of AIO

Hello,

I'm trying to assert the status of AIO under the current version
of Linux 2.6. However by searching I wasn't able to find any
indication about it's current state. Is there anyone using it
under a production environment?

I'd like to know how complete it is and whether socket AIO is
adaquately supported.

Thanks,

--
Dan Aloni
[email protected], [email protected], [email protected], [email protected]


2006-03-06 15:06:39

by Phillip Susi

[permalink] [raw]
Subject: Re: Status of AIO

Dan Aloni wrote:
> Hello,
>
> I'm trying to assert the status of AIO under the current version

I think you mean ascertain.

> of Linux 2.6. However by searching I wasn't able to find any
> indication about it's current state. Is there anyone using it
> under a production environment?
>
> I'd like to know how complete it is and whether socket AIO is
> adaquately supported.
>
> Thanks,
>

AFAIK, it is not yet supported by the sockets layer, and the glibc posix
aio apis do NOT use the kernel aio support. I have done some
experimentation with it by hacking dd, but from what I can tell, it is
not used in any sort of production capacity.

2006-03-06 21:24:15

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Status of AIO

On Mon, Mar 06, 2006 at 08:24:03AM +0200, Dan Aloni wrote:
> Hello,
>
> I'm trying to assert the status of AIO under the current version
> of Linux 2.6. However by searching I wasn't able to find any
> indication about it's current state. Is there anyone using it
> under a production environment?

For O_DIRECT aio things are pretty stable (barring a patch to improve -EIO
handling). The functionality is used by the various databases, so it gets
a fair amount of exercise.

> I'd like to know how complete it is and whether socket AIO is
> adaquately supported.

Socket AIO is not supported yet, but it is useful to get user requests to
know there is demand for it.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-03-06 22:53:10

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Status of AIO

On 3/6/06, Benjamin LaHaise <[email protected]> wrote:
> Socket AIO is not supported yet, but it is useful to get user requests to
> know there is demand for it.

I don't think the POSIX AIO nor the kernel AIO interfaces are suitable
for sockets, at least the way we can expect network traffic to be
handled in the near future. Some more radical approaches are needed.
I'll have some proposals which will be part of the talk I have at OLS.

2006-03-06 23:17:16

by Phillip Susi

[permalink] [raw]
Subject: Re: Status of AIO

Ulrich Drepper wrote:
>
> I don't think the POSIX AIO nor the kernel AIO interfaces are suitable
> for sockets, at least the way we can expect network traffic to be
> handled in the near future. Some more radical approaches are needed.
> I'll have some proposals which will be part of the talk I have at OLS.


Why do you say it is not suitable? The kernel aio interfaces should
work very well, especially when combined with O_DIRECT.

2006-03-06 23:19:43

by Phillip Susi

[permalink] [raw]
Subject: Re: Status of AIO

I'm sending this again because it looks like the original got lost. At
least, I've not seen it show up on the mailing list yet and I sent it 8
hours ago.

Dan Aloni wrote:
> Hello,
>
> I'm trying to assert the status of AIO under the current version

I think you mean ascertain.

> of Linux 2.6. However by searching I wasn't able to find any
> indication about it's current state. Is there anyone using it
> under a production environment?
>
> I'd like to know how complete it is and whether socket AIO is
> adaquately supported.
>
> Thanks,
>

AFAIK, it is not yet supported by the sockets layer, and the glibc posix
aio apis do NOT use the kernel aio support. I have done some
experimentation with it by hacking dd, but from what I can tell, it is
not used in any sort of production capacity.


2006-03-06 23:38:18

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Status of AIO

On Mon, Mar 06, 2006 at 02:53:07PM -0800, Ulrich Drepper wrote:
> I don't think the POSIX AIO nor the kernel AIO interfaces are suitable
> for sockets, at least the way we can expect network traffic to be
> handled in the near future. Some more radical approaches are needed.
> I'll have some proposals which will be part of the talk I have at OLS.

Oh? I've always envisioned that network AIO would be able to use O_DIRECT
style zero copy transmit, and something like I/O AT on the receive side.
The in kernel API provides a lightweight event mechanism that should work
ideally for this purpose.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-03-07 00:24:40

by David Miller

[permalink] [raw]
Subject: Re: Status of AIO

From: Benjamin LaHaise <[email protected]>
Date: Mon, 6 Mar 2006 18:33:00 -0500

> On Mon, Mar 06, 2006 at 02:53:07PM -0800, Ulrich Drepper wrote:
> > I don't think the POSIX AIO nor the kernel AIO interfaces are suitable
> > for sockets, at least the way we can expect network traffic to be
> > handled in the near future. Some more radical approaches are needed.
> > I'll have some proposals which will be part of the talk I have at OLS.
>
> Oh? I've always envisioned that network AIO would be able to use O_DIRECT
> style zero copy transmit, and something like I/O AT on the receive side.
> The in kernel API provides a lightweight event mechanism that should work
> ideally for this purpose.

I think something like net channels will be more effective on receive.

We shouldn't be designing things for the old and inefficient world
where the work is done in software and hardware interrupt context, it
should be moved as close as possible to the compute entities and that
means putting the work all the way into the app itself, if not very
close.

To me, it is not a matter of if we put the networking stack at least
partially into some userland library, but when.

Eveyone has their brains wrapped around how OS support for networking
has always been done, and assuming that particular model is erroneous
(and net channels show good hard evidence that it is) this continued
thought process merely continues the error.

I really dislike it when non-networking people work on these
interfaces. They've all frankly stunk, and they've had several
opportunities to try and get it right.

I want a bonafide networking person to work on any high performance
networking API we every decide to actually use.

This is why I going to sit and wait patiently for Van Jacobson's work
to get published and mature, because it's the only light in the tunnel
since Multics.

Yes, since Multics, that's how bad the existing models for doing these
things are.

2006-03-07 00:47:56

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Status of AIO

On Mon, Mar 06, 2006 at 04:24:44PM -0800, David S. Miller wrote:
> > Oh? I've always envisioned that network AIO would be able to use O_DIRECT
> > style zero copy transmit, and something like I/O AT on the receive side.
> > The in kernel API provides a lightweight event mechanism that should work
> > ideally for this purpose.
>
> I think something like net channels will be more effective on receive.

Perhaps, but we don't necessarily have to go to that extreme to get the
value of the approach. One way of doing network receive that would let
us keep the same userland API is to have the kernel perform the receive
portion of TCP processing in userspace as a vsyscall. The whole channel
would then be a concept internal to the kernel. Once that works and the
internals have settled down, it might make sense to export an API that
allows us to expose parts of the channel to the user.

Unfortunately, I think that the problem of getting the packets delivered
to the right user is Hard (especially with incoming filters and all the
other features of the stack).

...
> I want a bonafide networking person to work on any high performance
> networking API we every decide to actually use.

I'm open to suggestions. =-) So far my thoughts have mostly been limited
to how to make tx faster, at which point you have to go into the kernel
somehow to deal with the virtual => physical address translation (be it
with a locked buffer or whatever) and kicking the hardware. Rx has been
much less interesting simply because the hardware side doesn't offer as
much.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-03-07 00:51:23

by David Miller

[permalink] [raw]
Subject: Re: Status of AIO

From: Benjamin LaHaise <[email protected]>
Date: Mon, 6 Mar 2006 19:42:37 -0500

> I'm open to suggestions. =-) So far my thoughts have mostly been limited
> to how to make tx faster, at which point you have to go into the kernel
> somehow to deal with the virtual => physical address translation (be it
> with a locked buffer or whatever) and kicking the hardware. Rx has been
> much less interesting simply because the hardware side doesn't offer as
> much.

I think any such VM tricks need serious thought. It has serious
consequences as far as cost especially on SMP. Evgivny has some data
that shows this, and chapter 5 of Networking Algorithmics has a lot of
good analysis and paper references on this topic.

2006-03-07 01:29:23

by Dan Aloni

[permalink] [raw]
Subject: Re: Status of AIO

On Mon, Mar 06, 2006 at 04:18:54PM -0500, Benjamin LaHaise wrote:
> On Mon, Mar 06, 2006 at 08:24:03AM +0200, Dan Aloni wrote:
> > Hello,
> >
> > I'm trying to assert the status of AIO under the current version
> > of Linux 2.6. However by searching I wasn't able to find any
> > indication about it's current state. Is there anyone using it
> > under a production environment?
>
> For O_DIRECT aio things are pretty stable (barring a patch to improve -EIO
> handling). The functionality is used by the various databases, so it gets
> a fair amount of exercise.
>
> > I'd like to know how complete it is and whether socket AIO is
> > adaquately supported.
>
> Socket AIO is not supported yet, but it is useful to get user requests to
> know there is demand for it.

Well, I've written a small test app to see if it works with network
sockets and apparently it did for that small test case (connect()
with aio_read(), loop with aio_error(), and aio_return()). I thought
perhaps the glibc implementation was running behind the scene, so I've
checked to see if it a thread was created in the background and I
there wasn't any thread.

--
Dan Aloni
[email protected], [email protected], [email protected], [email protected]

2006-03-07 01:35:20

by Phillip Susi

[permalink] [raw]
Subject: Re: Status of AIO

David S. Miller wrote:
>
> I think something like net channels will be more effective on receive.
>

What is this "net channels"? I'll do some googling but if you have a
direct reference it would be helpful.

> We shouldn't be designing things for the old and inefficient world
> where the work is done in software and hardware interrupt context, it
> should be moved as close as possible to the compute entities and that
> means putting the work all the way into the app itself, if not very
> close.
>
> To me, it is not a matter of if we put the networking stack at least
> partially into some userland library, but when.
>

Maybe you should try using a microkernel then like mach? The Linux way
of doing things is to leave critical services that most user mode code
depends on, such as filesystems and the network stack, in the kernel. I
don't think that's going to change.

> Eveyone has their brains wrapped around how OS support for networking
> has always been done, and assuming that particular model is erroneous
> (and net channels show good hard evidence that it is) this continued
> thought process merely continues the error.
>

Have you taken a look at bsd's kqueue and NT's IO completion port
approach? They allow virtually all of the IO to be offloaded to
hardware DMA, and there's no reason Linux can't do the same with aio and
O_DIRECT. There's no need completely throw out the stack and start
over, let alone in user mode, to get there.

> I really dislike it when non-networking people work on these
> interfaces. They've all frankly stunk, and they've had several
> opportunities to try and get it right.
>

I agree, the old (non) blocking IO style interfaces have all sucked,
which is why it's time to move on to aio. NT has been demonstrating for
10 years now ( that's how long ago I wrote an FTPd using IOCPs on NT )
the benefits of async IO. It has been a long time coming, but once the
Linux kernel is capable of zero copy aio, I will be quite happy.

> I want a bonafide networking person to work on any high performance
> networking API we every decide to actually use.
>
> This is why I going to sit and wait patiently for Van Jacobson's work
> to get published and mature, because it's the only light in the tunnel
> since Multics.
>
> Yes, since Multics, that's how bad the existing models for doing these
> things are.

2006-03-07 01:37:19

by Nicholas Miell

[permalink] [raw]
Subject: Re: Status of AIO

On Tue, 2006-03-07 at 03:30 +0200, Dan Aloni wrote:
> On Mon, Mar 06, 2006 at 04:18:54PM -0500, Benjamin LaHaise wrote:
> > On Mon, Mar 06, 2006 at 08:24:03AM +0200, Dan Aloni wrote:
> > > Hello,
> > >
> > > I'm trying to assert the status of AIO under the current version
> > > of Linux 2.6. However by searching I wasn't able to find any
> > > indication about it's current state. Is there anyone using it
> > > under a production environment?
> >
> > For O_DIRECT aio things are pretty stable (barring a patch to improve -EIO
> > handling). The functionality is used by the various databases, so it gets
> > a fair amount of exercise.
> >
> > > I'd like to know how complete it is and whether socket AIO is
> > > adaquately supported.
> >
> > Socket AIO is not supported yet, but it is useful to get user requests to
> > know there is demand for it.
>
> Well, I've written a small test app to see if it works with network
> sockets and apparently it did for that small test case (connect()
> with aio_read(), loop with aio_error(), and aio_return()). I thought
> perhaps the glibc implementation was running behind the scene, so I've
> checked to see if it a thread was created in the background and I
> there wasn't any thread.

None of the aio_* functions use the kernel's AIO interface. They're
implemented entirely in userspace using a thread pool.

--
Nicholas Miell <[email protected]>

2006-03-07 01:38:00

by Phillip Susi

[permalink] [raw]
Subject: Re: Status of AIO

aio_* functions are library routines in glibc that are implemented by
spawning threads to use the normal kernel io syscalls. They don't use
real async IO in the kernel. I'm not sure why you didn't see the
thread, but if you look up the glibc sources you will see how it works.

To use the kernel aio you make calls to io_submit().

Dan Aloni wrote:
> Well, I've written a small test app to see if it works with network
> sockets and apparently it did for that small test case (connect()
> with aio_read(), loop with aio_error(), and aio_return()). I thought
> perhaps the glibc implementation was running behind the scene, so I've
> checked to see if it a thread was created in the background and I
> there wasn't any thread.
>

2006-03-07 01:44:34

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Status of AIO

On Mon, Mar 06, 2006 at 04:51:29PM -0800, David S. Miller wrote:
> I think any such VM tricks need serious thought. It has serious
> consequences as far as cost especially on SMP. Evgivny has some data
> that shows this, and chapter 5 of Networking Algorithmics has a lot of
> good analysis and paper references on this topic.

VM tricks do suck, so you just have to use the tricks that nobody else
is... My thinking is to do something like the following: have a structure
to reference a set of pages. When it is first created, it takes a reference
on the pages in question, and it is added to the vm_area_struct of the user
so that the vm can poke it for freeing when memory pressure occurs. The
sk_buff dataref also has to have a pointer to the pageref added. Now, the
trick to making it useful is as follows:

struct pageref {
atomic_t free_count;
int use_count; /* protected by socket lock */
...
unsigned long user_address;
unsigned long length;
struct socket *sock; /* backref for VM */
struct page *pages[];
};

The fast path in network transmit becomes:

if (sock->pageref->... overlaps buf) {
for each packet built {
use_count++;
<add pageref to skb's dataref happily without atomics
or memory copying>
}
}

Then the kfree_skb() path does an atomic_dec() on pageref->free_count
instead of the page. (Or get rid of the atomic using knowledge about the
fact that a given pageref could only be freed by the network driver it was
given to.) That would make the transmit path bloody cheap, and the tx irq
context no more expensive than it already is.

It's probably easier to show this tx path with code that gets the details
right.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-03-07 01:45:45

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Status of AIO

On Tue, Mar 07, 2006 at 03:30:50AM +0200, Dan Aloni wrote:
> Well, I've written a small test app to see if it works with network
> sockets and apparently it did for that small test case (connect()
> with aio_read(), loop with aio_error(), and aio_return()). I thought
> perhaps the glibc implementation was running behind the scene, so I've
> checked to see if it a thread was created in the background and I
> there wasn't any thread.

Unfortunately, it will block in io_submit when it shouldn't.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-03-07 02:02:45

by Dan Aloni

[permalink] [raw]
Subject: Re: Status of AIO

On Mon, Mar 06, 2006 at 08:39:15PM -0500, Benjamin LaHaise wrote:
> On Mon, Mar 06, 2006 at 04:51:29PM -0800, David S. Miller wrote:
> > I think any such VM tricks need serious thought. It has serious
> > consequences as far as cost especially on SMP. Evgivny has some data
> > that shows this, and chapter 5 of Networking Algorithmics has a lot of
> > good analysis and paper references on this topic.
>
> VM tricks do suck, so you just have to use the tricks that nobody else
> is... My thinking is to do something like the following: have a structure
> to reference a set of pages. When it is first created, it takes a reference
> on the pages in question, and it is added to the vm_area_struct of the user
> so that the vm can poke it for freeing when memory pressure occurs. The
> sk_buff dataref also has to have a pointer to the pageref added. Now, the
> trick to making it useful is as follows:
>
> struct pageref {
> atomic_t free_count;
> int use_count; /* protected by socket lock */
> ...
> unsigned long user_address;
> unsigned long length;
> struct socket *sock; /* backref for VM */
> struct page *pages[];
> };
[...]
>
> It's probably easier to show this tx path with code that gets the details
> right.

This somehow resembles the scatter-gatter lists already used in some
subsystems such as the SCSI sg driver.

BTW you have to make these pages Copy-On-Write before this procedure
starts because you wouldn't want it to accidently fill the zero page,
i.e. the VM will have to supply a unique set of pages otherwise it
messes up.

--
Dan Aloni
[email protected], [email protected], [email protected], [email protected]

2006-03-07 02:12:52

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Status of AIO

On Tue, Mar 07, 2006 at 04:04:11AM +0200, Dan Aloni wrote:
> This somehow resembles the scatter-gatter lists already used in some
> subsystems such as the SCSI sg driver.

None of the iovecs are particularly special. What's special here is that
particulars of the container make the fast path *cheap*.

> BTW you have to make these pages Copy-On-Write before this procedure
> starts because you wouldn't want it to accidently fill the zero page,
> i.e. the VM will have to supply a unique set of pages otherwise it
> messes up.

No, that would be insanely expensive. There's no way this would be done
transparently to the user unless we know that we're blocking until the
transmit is complete.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-03-07 03:04:13

by David Miller

[permalink] [raw]
Subject: Re: Status of AIO

From: Phillip Susi <[email protected]>
Date: Mon, 06 Mar 2006 20:34:46 -0500

> What is this "net channels"? I'll do some googling but if you have a
> direct reference it would be helpful.

You didn't google hard enough, my blog entry on the topic
comes up as the first entry when you google for "Van Jacobson
net channels".

> Maybe you should try using a microkernel then like mach? The Linux way
> of doing things is to leave critical services that most user mode code
> depends on, such as filesystems and the network stack, in the kernel. I
> don't think that's going to change.

Oh yee of little faith, and we don't need to go to a microkernel
architecture to move things like parts of the TCP stack into
user space.

2006-03-07 03:06:14

by David Miller

[permalink] [raw]
Subject: Re: Status of AIO

From: Benjamin LaHaise <[email protected]>
Date: Mon, 6 Mar 2006 20:39:15 -0500

> VM tricks do suck, so you just have to use the tricks that nobody else
> is... My thinking is to do something like the following: have a structure
> to reference a set of pages. When it is first created, it takes a reference
> on the pages in question, and it is added to the vm_area_struct of the user
> so that the vm can poke it for freeing when memory pressure occurs. The
> sk_buff dataref also has to have a pointer to the pageref added.

You've just reinvented fbufs, and they have their own known set of
issues.

Please read chapter 5 of Networking Algorithmics or ask someone to
paraphrase the content for you. It really covers this completely, and
once you read it you will be able to avoid reinenting the wheel and
falling under the false notion of having invented something :-)

2006-03-07 03:11:03

by David Miller

[permalink] [raw]
Subject: Re: Status of AIO

From: Benjamin LaHaise <[email protected]>
Date: Mon, 6 Mar 2006 21:07:36 -0500

> On Tue, Mar 07, 2006 at 04:04:11AM +0200, Dan Aloni wrote:
> > This somehow resembles the scatter-gatter lists already used in some
> > subsystems such as the SCSI sg driver.
>
> None of the iovecs are particularly special. What's special here is that
> particulars of the container make the fast path *cheap*.

Please read Druschel and Peterson's paper on fbufs and any followon
work before going down this path. Fbufs are exactly what you are
proposing as a workaround for the VM cost of page flipping, and the
idea has been around since 1993. :-)

As I mentioned Chapter 5 of Networking Algorithmics discusses this
in detail, and also covers many related attempts such as I/O
Lite.

2006-03-07 04:07:37

by Phillip Susi

[permalink] [raw]
Subject: Re: Status of AIO

David S. Miller wrote:
> You didn't google hard enough, my blog entry on the topic
> comes up as the first entry when you google for "Van Jacobson
> net channels".
>

Thanks, I read the page... I find it to be a little extreme, and zero
copy aio can get the same benefits without all that hassle. Let me
write this as a reply to the article itself:


> With SMP systems this "end host" concept really should be extended to
> the computing entities within the system, that being cpus and threads
> within the box.


I agree; all threads and cpus should be able to concurrently process
network IO, and without wasting cpu cycles copying the data around 6
times. That does not, and should not mean moving the TCP/IP protocol to
user space.


> So, given all that, how do you implement network packet receive
> properly? Well, first of all, you stop doing so much work in interrupt
> (both hard and soft) context. Jamal Hadi Salim and others understood
> this quite well, and NAPI is a direct consequence of that understanding.
> But what Van is trying to show in his presentation is that you can take
> this further, in fact a _lot_ further.


I agree; a minimum of work should be done in interrupt context.
Specifically the interrupt handler should simply insert and remove
packets from the queue and program the hardware registers for DMA access
to the packet buffer memory. If the hardware supports scatter/gather
DMA, then the upper layers can enqueue packet buffers to send/recieve
into/from and the interrupt handler just pulls packets off this queue
when the hardware raises an interrupt to indicate it has completed the
DMA transfer.


This is how NT and I believe BSD have been doing things for some time
now, and the direction the Linux kernel is moving in.

> A Van Jacobson channel is a path for network packets. It is
> implemented as an array'd queue of packets. There is state for the
> producer and the consumer, and it all sits in different cache lines so
> that it is never the case that both the consumer and producer write to
> shared cache lines. Network cards want to know purely about packets, yet
> for years we've been enforcing an OS determined model and abstraction
> for network packets upon the drivers for such cards. This has come in
> the form of "mbufs" in BSD and "SKBs" under Linux, but the channels are
> designed so that this is totally unnecessary. Drivers no longer need to
> know about what the OS packet buffers look like, channels just contain
> pointers to packet data.

I must admit, I am a bit confused by this. It sounds a lot like the pot
calling the kettle black to me. Aren't SKBs and mbufs already just a
form of the very queue of packets being advocated here? Don't they
simply list memory ranges for the driver to transfer to the nic as a
packet?

> The next step is to build channels to sockets. We need some
> intelligence in order to map packets to channels, and this comes in the
> form of a tiny packet classifier the drivers use on input. It reads the
> protocol, ports, and addresses to determine the flow ID and uses this to
> find a channel. If no matching flow is found, we fall back to the basic
> channel we created in the first step. As sockets are created, channel
> mappings are installed and thus the driver classifier can find them
> later. The socket wakes up, and does protocol input processing and
> copying into userspace directly out of the channel.

How is this any different from what we have now, other than bypassing
the kernel buffer? The tcp/ip layer looks at the incoming packet to
decide what socket it goes with, and copies it to the waiting buffer.
Right now that waiting buffer is a kernel buffer, because at the time
the packet arrives, the kernel does not have any user buffers.

If the user process uses aio though, it can hand the kernel a few
buffers to receive into ahead of time so when the packets have been
classified, they can be copied directly to the user buffer.

> And in the next step you can have the socket ask for a channel ID
> (with a getsockopt or something like that), have it mmap() a receive
> ring buffer into user space, and the mapped channel just tosses the
> packet data into that mmap()'d area and wakes up the process. The
> process has a mini TCP receive engine in user space.

There is no need to use mmap() and burden the user code with
implementing TCP itself ( which is quite a lot of work ). It can hand
the kernel buffers by queuing multiple O_DIRECT aio requests and the
kernel can directly dump the data stream there after stripping off the
headers. When sending it can program the hardware to directly
scatter/gather DMA from the user buffer attached to the aio request.

> And you can take this even further than that (think, remote systems).
> At each stage Van presents a table of profiled measurements for a normal
> bulk TCP data transfer. The final stage of channeling all the way to
> userspace is some 6 times faster than what we're doing today, yes I said
> 6 times faster that isn't a typo.


Yes, we can and should have a 6 times speed up, but as I've explained
above, NT has had that for 10 years without having to push TCP into user
space.

2006-03-07 06:02:24

by David Miller

[permalink] [raw]
Subject: Re: Status of AIO

From: Phillip Susi <[email protected]>
Date: Mon, 06 Mar 2006 23:07:05 -0500

> How is this any different from what we have now, other than bypassing
> the kernel buffer? The tcp/ip layer looks at the incoming packet to
> decide what socket it goes with, and copies it to the waiting buffer.
> Right now that waiting buffer is a kernel buffer, because at the time
> the packet arrives, the kernel does not have any user buffers.

The whole idea is to figure out what socket gets the packet
without going through the IP and TCP stack at all, in the
hardware interrupt handler, using a tiny classifier that's
very fast and can be implemented in hardware.

Please wrap your brain around the idea a little longer than
the 15 or so minutes you have thus far... thanks.

> Yes, we can and should have a 6 times speed up, but as I've explained
> above, NT has had that for 10 years without having to push TCP into user
> space.

That's complete BS.

2006-03-07 07:31:41

by Dan Aloni

[permalink] [raw]
Subject: Re: Status of AIO

On Mon, Mar 06, 2006 at 09:07:36PM -0500, Benjamin LaHaise wrote:
> On Tue, Mar 07, 2006 at 04:04:11AM +0200, Dan Aloni wrote:
> > This somehow resembles the scatter-gatter lists already used in some
> > subsystems such as the SCSI sg driver.
>
> None of the iovecs are particularly special. What's special here is that
> particulars of the container make the fast path *cheap*.
>
> > BTW you have to make these pages Copy-On-Write before this procedure
> > starts because you wouldn't want it to accidently fill the zero page,
> > i.e. the VM will have to supply a unique set of pages otherwise it
> > messes up.
>
> No, that would be insanely expensive. There's no way this would be done
> transparently to the user unless we know that we're blocking until the
> transmit is complete.

Sure it can't be transparent to the user, but you can just require the user
to perform mlock on the VMA and you get around this problem.

--
Dan Aloni
[email protected], [email protected], [email protected], [email protected]

2006-03-07 16:07:55

by Phillip Susi

[permalink] [raw]
Subject: Re: Status of AIO

David S. Miller wrote:
> The whole idea is to figure out what socket gets the packet
> without going through the IP and TCP stack at all, in the
> hardware interrupt handler, using a tiny classifier that's
> very fast and can be implemented in hardware.
>

AFAIK, "going through the IP and TCP stack" just means passing a quick
packet classifier to locate the corresponding socket. It would be nice
to be able to possibly offload that to the hardware, but you don't need
to throw out the baby ( tcp/ip stack ) with the bathwater to get there.

Maybe some sort of interface could be constructed to allow the higher
layers to pass down some sort of ASL type byte code classifier to the
NIC driver, which could either call it via the kernel software ASL
interpreter, or convert it to firmware code to load into the hardware.

> Please wrap your brain around the idea a little longer than
> the 15 or so minutes you have thus far... thanks.
>

I've had my brain wrapped around these sort of networking problems for
over 10 years now, so I think I have a fair handle on things. Certainly
enough to carry on a discussion about it.

>> Yes, we can and should have a 6 times speed up, but as I've explained
>> above, NT has had that for 10 years without having to push TCP into user
>> space.
>
> That's complete BS.

Error, does not compute.

Your holier than thou attitude does not a healthy discussion make. I
explained the methods that have been in use on NT to achieve a 6 fold
decrease in cpu utilization for bulk network IO, and how it can be
applied to the Linux kernel without radical changes. If you don't
understand it, then ask sensible questions, not just cry "That's
complete BS!"

We already have O_DIRECT aio for disk drives that can do zero copy,
there's no reason not to apply that to the network stack as well.


2006-03-07 16:40:44

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Status of AIO

On Mon, Mar 06, 2006 at 07:06:33PM -0800, David S. Miller wrote:
> You've just reinvented fbufs, and they have their own known set of
> issues.

> Please read chapter 5 of Networking Algorithmics or ask someone to
> paraphrase the content for you. It really covers this completely, and
> once you read it you will be able to avoid reinenting the wheel and
> falling under the false notion of having invented something :-)

Nothing in software is particularly unique given the same set of
requirements. Unfortunately, none of the local book stores have a copy
of Networking Algorithmics in stock, so it will be a few days before it
arrives. What problems does this approach have? Aside from the fact that
it's useless unless implemented on top of AIO type semantics, it looks
like a good way to improve performance.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-03-08 07:09:44

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Status of AIO

On 3/6/06, Phillip Susi <[email protected]> wrote:
> Why do you say it is not suitable? The kernel aio interfaces should
> work very well, especially when combined with O_DIRECT.

What has network I/O to do with O_DIRECT? I'm talking about async network I/O.

2006-03-08 15:59:41

by Phillip Susi

[permalink] [raw]
Subject: Re: Status of AIO

Ulrich Drepper wrote:
> What has network I/O to do with O_DIRECT? I'm talking about async network I/O.

O_DIRECT allows for zero copy IO, which saves a boatload of cpu cycles.
For disk IO it is possible to use O_DIRECT without aio, but there is
generally a loss of efficiency doing so. For network IO, O_DIRECT is
not even possible without aio.

By using aio and O_DIRECT for network IO, you can achieve massive
performance and scalability gains.

You said before that the kernel aio interface is not suitable for
sockets. Why not?