As presented in our talk at this year's OLS, the Bensley platform, which
will be out in early 2006, will have an asyncronous DMA engine. It can be
used to offload copies from the CPU, such as the kernel copies of received
packets into the user buffer.
The code consists of the following sections:
1) The HW driver for the DMA engine device
2) The DMA subsystem, which abstracts the HW details from users of the
async DMA
3) Modifications to net/ to make use of the DMA engine for receive copy
offload:
3a) Code to register the net stack as a "DMA client"
3b) Code to pin and unpin pages associated with a user buffer
3c) Code to initiate async DMA transactions in the net receive path
Today we are releasing 2, 3a, and 3b, as well as "testclient", a throwaway
driver we wrote to demonstrate the DMA subsystem API. We will be releasing
3c shortly. We will be releasing 1 (the HW driver) when the platform ships
early next year. Until then, the code doesn't really *do* anything, but we
wanted to release what we could right away, and start getting some
feedback.
Against 2.6.14:
patch 1: DMA engine
patch 2: iovec pin/unpin code; register net as a DMA client
patch 3: testclient
overall diffstat information:
drivers/Kconfig | 2
drivers/Makefile | 1
drivers/dma/Kconfig | 40 ++
drivers/dma/Makefile | 5
drivers/dma/cb_list.h | 12
drivers/dma/dmaengine.c | 394 ++++++++++++++++++++++++
drivers/dma/testclient.c | 132 ++++++++
include/linux/dmaengine.h | 268 ++++++++++++++++
net/core/Makefile | 3
net/core/dev.c | 78 ++++
net/core/user_dma.c | 422 ++++++++++++++++++++++++++
11 files changed, 1356 insertions(+), 1 deletion(-)
Regards -- Andy and Chris
Andrew Grover wrote:
> As presented in our talk at this year's OLS, the Bensley platform, which
> will be out in early 2006, will have an asyncronous DMA engine. It can be
> used to offload copies from the CPU, such as the kernel copies of received
> packets into the user buffer.
IOAT is super-neat stuff.
In addition to helping speed up network RX, I would like to see how
possible it is to experiment with IOAT uses outside of networking.
Sample ideas: VM page pre-zeroing. ATA PIO data xfers (async copy to
static buffer, to dramatically shorten length of kmap+irqsave time).
Extremely large memcpy() calls.
Additionally, current IOAT is memory->memory. I would love to be able
to convince Intel to add transforms and checksums, to enable offload of
memory->transform->memory and memory->checksum->result operations like
sha-{1,256} hashing[1], crc32*, aes crypto, and other highly common
operations. All of that could be made async.
Jeff
On Mer, 2005-11-23 at 17:06 -0500, Jeff Garzik wrote:
> Sample ideas: VM page pre-zeroing. ATA PIO data xfers (async copy to
> static buffer, to dramatically shorten length of kmap+irqsave time).
> Extremely large memcpy() calls.
ATA PIO copies are 512 bytes of memory per sector and that is usually
already in cache and on cache line boundaries. You won't even be able to
measure it done by the CPU. I can't see the I/O engine sync cost being
worth it.
Might just about help large transfers I guess but you don't do
multisector which is the only case you'd get perhaps 8K an I/O.
> Additionally, current IOAT is memory->memory. I would love to be able
> to convince Intel to add transforms and checksums,
Not just transforms but also masks and maybe even merges and textures
would be rather handy 8)
On Mer, 2005-11-23 at 12:26 -0800, Andrew Grover wrote:
> early next year. Until then, the code doesn't really *do* anything, but we
> wanted to release what we could right away, and start getting some
> feedback.
First comment partly based on Jeff Garziks comments - if you added an
"operation" to the base functions and an operation mask to the DMA
engines it becomes possible to support engines that can do other ops (eg
abusing an NCR53c8xx for both copy and clear).
Second one - you obviously tested this somehow, was that all done by
simulation or do you have a "CPU" memcpy test engine for use before the
hardware pops up ?
On Wed, Nov 23, 2005 at 05:06:42PM -0500, Jeff Garzik wrote:
> IOAT is super-neat stuff.
The main problem I see is that it'll likely only pay off when you can keep
the queue of copies long (to amortize the cost of
talking to an external chip). At least for the standard recvmsg
skb->user space, user space-> skb cases these queues are
likely short in most cases. That's because most applications
do relatively small recvmsg or sendmsgs.
It definitely will need a threshold under which it is disabled.
With bad luck the threshold will be high enough that it doesn't
help very often :/
Longer term the right way to handle this would be likely to use
POSIX AIO on sockets. With that interface it would be easier
to keep long queues of data in flight, which would be best for
the DMA engine.
> In addition to helping speed up network RX, I would like to see how
> possible it is to experiment with IOAT uses outside of networking.
> Sample ideas: VM page pre-zeroing. ATA PIO data xfers (async copy to
> static buffer, to dramatically shorten length of kmap+irqsave time).
> Extremely large memcpy() calls.
Another proposal was swiotlb.
But it's not clear it's a good idea: a lot of these applications prefer to
have the target in cache. And IOAT will force it out of cache.
> Additionally, current IOAT is memory->memory. I would love to be able
> to convince Intel to add transforms and checksums, to enable offload of
> memory->transform->memory and memory->checksum->result operations like
> sha-{1,256} hashing[1], crc32*, aes crypto, and other highly common
> operations. All of that could be made async.
I remember the registers in the Amiga Blitter for this and I'm
still scared... Maybe it's better to keep it simple.
-Andi
Andrew Grover wrote:
> As presented in our talk at this year's OLS, the Bensley platform, which
> will be out in early 2006, will have an asyncronous DMA engine. It can be
> used to offload copies from the CPU, such as the kernel copies of received
> packets into the user buffer.
More than a one-paragraph description would be nice... URLs to OLS and
IDF presentations, other info?
Jeff
Andrew Grover wrote:
> overall diffstat information:
> drivers/Kconfig | 2
> drivers/Makefile | 1
> drivers/dma/Kconfig | 40 ++
> drivers/dma/Makefile | 5
> drivers/dma/cb_list.h | 12
> drivers/dma/dmaengine.c | 394 ++++++++++++++++++++++++
> drivers/dma/testclient.c | 132 ++++++++
> include/linux/dmaengine.h | 268 ++++++++++++++++
> net/core/Makefile | 3
> net/core/dev.c | 78 ++++
> net/core/user_dma.c | 422 ++++++++++++++++++++++++++
> 11 files changed, 1356 insertions(+), 1 deletion(-)
overall, there was a distinction lack of any useful
description/documentation, over and above the code itself.
Jeff
Alan Cox wrote:
>>Additionally, current IOAT is memory->memory. I would love to be able
>>to convince Intel to add transforms and checksums,
>
>
> Not just transforms but also masks and maybe even merges and textures
> would be rather handy 8)
Ah yes: I totally forgot to mention XOR.
Software RAID would love that.
Jeff
Andi Kleen wrote:
> Longer term the right way to handle this would be likely to use
> POSIX AIO on sockets. With that interface it would be easier
> to keep long queues of data in flight, which would be best for
> the DMA engine.
Agreed.
For my own userland projects, I'm starting to feel the need for network
AIO, since it is more natural: the hardware operations themselves are
asynchronous.
>>In addition to helping speed up network RX, I would like to see how
>>possible it is to experiment with IOAT uses outside of networking.
>>Sample ideas: VM page pre-zeroing. ATA PIO data xfers (async copy to
>>static buffer, to dramatically shorten length of kmap+irqsave time).
>>Extremely large memcpy() calls.
>
>
> Another proposal was swiotlb.
That's an interesting thought.
> But it's not clear it's a good idea: a lot of these applications prefer to
> have the target in cache. And IOAT will force it out of cache.
>
>
>>Additionally, current IOAT is memory->memory. I would love to be able
>>to convince Intel to add transforms and checksums, to enable offload of
>>memory->transform->memory and memory->checksum->result operations like
>>sha-{1,256} hashing[1], crc32*, aes crypto, and other highly common
>>operations. All of that could be made async.
>
>
> I remember the registers in the Amiga Blitter for this and I'm
> still scared... Maybe it's better to keep it simple.
We're talking about CISC here! ;-) ;-)
[note: I'm the type of person who would stuff the kernel + glibc onto an
FPGA, if I could]
I would love to see Intel, AMD, VIA (others?) compete by adding selected
transforms/checksums/hashs to their chips, though this method. Just
provide a method to enumerate what transforms are supported on <this>
chip...
Jeff
On Mer, 2005-11-23 at 23:30 +0100, Andi Kleen wrote:
> Another proposal was swiotlb.
I was hoping Intel might have rediscovered the IOMMU by then and be back
on feature parity with the VAX 11/750
>
> But it's not clear it's a good idea: a lot of these applications prefer to
> have the target in cache. And IOAT will force it out of cache.
This is true for some cases but not all for iotlb
CPU generated data going out that won't be rewritten immediately should
be a cheap path not needing the cache. Incoming data would invalidate
the cache anyway if it arrives by DMA so the ioat would move it
asynchronously of the CPU without cache harm
Might also be interesting to use one half of a hypedthread CPU as a
copier using the streaming instructions, might be better than turning it
off to improve performance ?
Alan
Alan Cox wrote:
> Might also be interesting to use one half of a hypedthread CPU as a
> copier using the streaming instructions, might be better than turning it
> off to improve performance ?
That's pretty interesting too...
Jeff
On Wed, Nov 23, 2005 at 11:30:08PM +0100, Andi Kleen wrote:
> The main problem I see is that it'll likely only pay off when you can keep
> the queue of copies long (to amortize the cost of
> talking to an external chip). At least for the standard recvmsg
> skb->user space, user space-> skb cases these queues are
> likely short in most cases. That's because most applications
> do relatively small recvmsg or sendmsgs.
Don't forget that there are benefits of not polluting the cache with the
traffic for the incoming skbs.
> Longer term the right way to handle this would be likely to use
> POSIX AIO on sockets. With that interface it would be easier
> to keep long queues of data in flight, which would be best for
> the DMA engine.
Yes, that's something I'd like to try soon.
> But it's not clear it's a good idea: a lot of these applications prefer to
> have the target in cache. And IOAT will force it out of cache.
In the I/O AT case it might make sense to do a few prefetch()es of the
userland data on the return-to-userspace code path. Similarly, we should
make sure that network drivers prefetch the header at the earliest possible
time, too.
> I remember the registers in the Amiga Blitter for this and I'm
> still scared... Maybe it's better to keep it simple.
*grin* but you could use it for such cool tasks as MFM en/decoding! =-)
-ben
--
"Time is what keeps everything from happening all at once." -- John Wheeler
Don't Email: <[email protected]>.
From: Benjamin LaHaise <[email protected]>
Date: Wed, 23 Nov 2005 19:17:01 -0500
> Similarly, we should make sure that network drivers prefetch the
> header at the earliest possible time, too.
Several do already, thankfully :) At last check skge, tg3, chelsio,
and ixgb do the necessary prefetching on receive.
On Wed, Nov 23, 2005 at 07:17:01PM -0500, Benjamin LaHaise wrote:
> On Wed, Nov 23, 2005 at 11:30:08PM +0100, Andi Kleen wrote:
> > The main problem I see is that it'll likely only pay off when you can keep
> > the queue of copies long (to amortize the cost of
> > talking to an external chip). At least for the standard recvmsg
> > skb->user space, user space-> skb cases these queues are
> > likely short in most cases. That's because most applications
> > do relatively small recvmsg or sendmsgs.
>
> Don't forget that there are benefits of not polluting the cache with the
> traffic for the incoming skbs.
Is that a general benefit outside benchmarks? I would expect
most real programs to actually do something with the data
- and that usually involves needing it in cache.
> > But it's not clear it's a good idea: a lot of these applications prefer to
> > have the target in cache. And IOAT will force it out of cache.
>
> In the I/O AT case it might make sense to do a few prefetch()es of the
> userland data on the return-to-userspace code path.
Some prefetches for user space might be a good idea yes
> Similarly, we should
> make sure that network drivers prefetch the header at the earliest possible
> time, too.
It's done kind of already but tricky to get right because
the prefetch distances upto use are not really long enough
-Andi
Andi Kleen wrote:
>>Don't forget that there are benefits of not polluting the cache with the
>>traffic for the incoming skbs.
>>
>>
>
>Is that a general benefit outside benchmarks? I would expect
>most real programs to actually do something with the data
>- and that usually involves needing it in cache.
>
>
>
As an example, an NFS server reads some data pages using iSCSI and sends
them using NFS/TCP (or vice versa).
>>In the I/O AT case it might make sense to do a few prefetch()es of the
>>userland data on the return-to-userspace code path.
>>
>>
>
>Some prefetches for user space might be a good idea yes
>
>
>
As long as they can be turned off. Not all usespace applications want to
touch the data immediately.
On Thu, Nov 24, 2005 at 05:24:34PM +0200, Avi Kivity wrote:
> Andi Kleen wrote:
>
> >>Don't forget that there are benefits of not polluting the cache with the
> >>traffic for the incoming skbs.
> >>
> >>
> >
> >Is that a general benefit outside benchmarks? I would expect
> >most real programs to actually do something with the data
> >- and that usually involves needing it in cache.
> >
> >
> >
> As an example, an NFS server reads some data pages using iSCSI and sends
> them using NFS/TCP (or vice versa).
For TX this can be done zero copy using a sendfile like setup.
For RX it may help - but my point was that most applications
are not structured in this simple way.
> >>In the I/O AT case it might make sense to do a few prefetch()es of the
> >>userland data on the return-to-userspace code path.
> >>
> >>
> >
> >Some prefetches for user space might be a good idea yes
> >
> >
> >
> As long as they can be turned off. Not all usespace applications want to
> touch the data immediately.
Perhaps. And lots of others might. Of course the simple
network benchmarks don't so the number on them look good.
Just pointing out that it's not clear it will always be a big help.
-Andi
Andi Kleen wrote:
>>>
>>>
>>As an example, an NFS server reads some data pages using iSCSI and sends
>>them using NFS/TCP (or vice versa).
>>
>>
>
>For TX this can be done zero copy using a sendfile like setup.
>
>
Yes, or with aio send for anonymous memory.
>For RX it may help - but my point was that most applications
>are not structured in this simple way.
>
>
>
Agreed. But those that do care, care very much. The data mover
applications, simply because they don't touch the data, expect very high
bandwidth.
>>As long as they can be turned off. Not all usespace applications want to
>>touch the data immediately.
>>
>>
>
>Perhaps. And lots of others might. Of course the simple
>network benchmarks don't so the number on them look good.
>
>
>
There are very real non-benchmark applications that want this.
>Just pointing out that it's not clear it will always be a big help.
>
>
>
Agree it should default to in-cache.
> >Just pointing out that it's not clear it will always be a big help.
> >
> >
> >
> Agree it should default to in-cache.
This would mean no DMA engine by default.
Clearly there needs to be some heuristic to decide by default. We'll see how
effective it will be in the end.
-Andi
On Wed, Nov 23, 2005 at 05:45:36PM -0500, Jeff Garzik wrote:
> Andrew Grover wrote:
> >As presented in our talk at this year's OLS, the Bensley platform, which
> >will be out in early 2006, will have an asyncronous DMA engine. It can be
> >used to offload copies from the CPU, such as the kernel copies of received
> >packets into the user buffer.
>
> More than a one-paragraph description would be nice... URLs to OLS and
> IDF presentations, other info?
>
> Jeff
>
FYI,
OLS paper can be found at
http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf
Starting at page 281.
Other info can be found at
http://www.intel.com/technology/ioacceleration/index.htm
On Wed, 23 Nov 2005, Jeff Garzik wrote:
> Alan Cox wrote:
> >>Additionally, current IOAT is memory->memory. I would love to be able
> >>to convince Intel to add transforms and checksums,
> >
> >
> > Not just transforms but also masks and maybe even merges and textures
> > would be rather handy 8)
>
>
> Ah yes: I totally forgot to mention XOR.
>
> Software RAID would love that.
A number of embedded processors already have HW that does these kinda of
things. On Freescale PPC processors there have been general purpose DMA
engines for mem<->mem and more recently and additional crypto engines that
allow for hashing, XOR, and security.
I'm actually searching for any examples of drivers that deal with the
issues related to DMA'ng directly two and from user space memory.
I have an ioctl based driver that does copies back and forth between user
and kernel space and would like to remove that since the crypto engine has
full scatter/gather capability.
The only significant effort I've come across is Peter Chubb's work for
user mode drivers which has some code for handling pinning of the user
space memory and what looks like generation of a scatter list.
- kumar
Kumar> I'm actually searching for any examples of drivers that
Kumar> deal with the issues related to DMA'ng directly two and
Kumar> from user space memory.
It's not quite the same story as what you're doing with DMA engines
inside the CPU, but you could look at drivers/infiniband, particularly
drivers/infiniband/core/uverbs_mem.c. That handles pinning and
getting DMA addresses for user memory that will be used as a DMA
target in the future.
- R.
On Iau, 2005-12-08 at 16:13 -0600, Kumar Gala wrote:
> I'm actually searching for any examples of drivers that deal with the
> issues related to DMA'ng directly two and from user space memory.
Look at drivers/media/video for several examples. Essentially in 2.6
get_user_pages() gives you page structs and pins the pages you need.
On Thu, Dec 08, 2005 at 04:13:52PM -0600, Kumar Gala ([email protected]) wrote:
> On Wed, 23 Nov 2005, Jeff Garzik wrote:
>
> > Alan Cox wrote:
> > >>Additionally, current IOAT is memory->memory. I would love to be able
> > >>to convince Intel to add transforms and checksums,
> > >
> > >
> > > Not just transforms but also masks and maybe even merges and textures
> > > would be rather handy 8)
> >
> >
> > Ah yes: I totally forgot to mention XOR.
> >
> > Software RAID would love that.
>
> A number of embedded processors already have HW that does these kinda of
> things. On Freescale PPC processors there have been general purpose DMA
> engines for mem<->mem and more recently and additional crypto engines that
> allow for hashing, XOR, and security.
>
> I'm actually searching for any examples of drivers that deal with the
> issues related to DMA'ng directly two and from user space memory.
>
> I have an ioctl based driver that does copies back and forth between user
> and kernel space and would like to remove that since the crypto engine has
> full scatter/gather capability.
>
> The only significant effort I've come across is Peter Chubb's work for
> user mode drivers which has some code for handling pinning of the user
> space memory and what looks like generation of a scatter list.
Acrypto supports crypto processing directly in userspace pages.
In 2.6 it is quite easy using get_user_pages().
> - kumar
>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Evgeniy Polyakov