2009-06-03 21:33:05

by Leon Woestenberg

[permalink] [raw]
Subject: Re: splice methods in character device driver

Hello all,

On Wed, May 13, 2009 at 6:59 PM, Steve Rottinger <[email protected]> wrote:
> is passing in the pages into splice_to_pipe. ?The pages are associated
> with a PCI BAR, not main memory. ?I'm wondering if this could be a problem?
>
Good question; my newbie answer would be the pages need to be mapped
in kernel space.

I have a similar use case but with memory being DMA'd to host main
memory (instead of the data sitting in your PCI device) in a character
device driver. The driver is a complete rewrite from scratch from
what's currently sitting-butt-ugly in staging/altpcichdma.c
so-please-don't-look-there.

I have already implemented zero-latency overlapping transfers in the
DMA engine (i.e. it never sits idle if async I/O is performed through
threads), now it would be really cool to add zero-copy.

What is it my driver is expected to do?

.splice_read:

- Allocate a bunch of single pages
- Create a scatter-gather list
- "stuff the data pages in question into a struct page *pages[]." a la
"fs/splice.c:vmsplice_to_pipe()"
- Start the DMA from the device to the pages (i.e. the transfer)
- Return.

.splice_write:

- Create a scatter-gather list

interrupt handler / DMA service routine:
- device book keeping
- wake_up_interruptible(transfer_queue)

.confirm():

"then you need to provide a suitable ->confirm() hook that can wait on
this IO to complete if needed."
- wait_on_event_interruptibe(transfer_queue)

.release():

- release the pages

.steal():

unsure

.map

unsure

Regards,
--
Leon


2009-06-04 07:32:24

by Jens Axboe

[permalink] [raw]
Subject: Re: splice methods in character device driver

On Wed, Jun 03 2009, Leon Woestenberg wrote:
> Hello all,
>
> On Wed, May 13, 2009 at 6:59 PM, Steve Rottinger <[email protected]> wrote:
> > is passing in the pages into splice_to_pipe. ?The pages are associated
> > with a PCI BAR, not main memory. ?I'm wondering if this could be a problem?
> >
> Good question; my newbie answer would be the pages need to be mapped
> in kernel space.

That is what the ->map() hook is for.

> I have a similar use case but with memory being DMA'd to host main
> memory (instead of the data sitting in your PCI device) in a character
> device driver. The driver is a complete rewrite from scratch from
> what's currently sitting-butt-ugly in staging/altpcichdma.c
> so-please-don't-look-there.
>
> I have already implemented zero-latency overlapping transfers in the
> DMA engine (i.e. it never sits idle if async I/O is performed through
> threads), now it would be really cool to add zero-copy.
>
> What is it my driver is expected to do?
>
> .splice_read:
>
> - Allocate a bunch of single pages
> - Create a scatter-gather list
> - "stuff the data pages in question into a struct page *pages[]." a la
> "fs/splice.c:vmsplice_to_pipe()"
> - Start the DMA from the device to the pages (i.e. the transfer)
> - Return.
>
> .splice_write:
>
> - Create a scatter-gather list
>
> interrupt handler / DMA service routine:
> - device book keeping
> - wake_up_interruptible(transfer_queue)
>
> .confirm():
>
> "then you need to provide a suitable ->confirm() hook that can wait on
> this IO to complete if needed."
> - wait_on_event_interruptibe(transfer_queue)
>
> .release():
>
> - release the pages
>
> .steal():
>
> unsure

This is what allows zero copy throughout the pipe line. ->steal(), if
sucesful, should pass ownership of that page to the caller. The previous
owner must no longer modify it.

> .map
>
> unsure

See above :-)

--
Jens Axboe

2009-06-04 13:20:59

by Steve Rottinger

[permalink] [raw]
Subject: Re: splice methods in character device driver

Since I was working with a memory region that didn't have any "struct pages"
associated with it (or at least I wasn't able to find a way to retrieve
them for
this space), I took the approach of generating fake struct pages, which
I passed
through the pipe. Unfortunately, this also required me to make some
rather hanus
hacks to various kernel macros to get them to handle the fake pages; ie:
page_to_phys.
I'm not sure if this was the best way to do it, but it was the only way
that I could come
up with. ->map didn't help, since I am in O_DIRECT mode -- I wanted the
disk controller's
DMA to directly transfer from PCI memory.

As this point, I have proof of concept, since I am now able to transfer
some data directly from
PCI space to disk; however, I am still wrestling with some issues:

- I'm not sure at what point it is safe to free up the pages that I am
passing
through the pipe. I tried doing it in the "release" method, however,
this is apparently too
soon, since this results in a crash. How do I know when the system is
done with them?

- The performance is poor, and much slower than transferring directly from
main memory with O_DIRECT. I suspect that this has a lot to do with
large amount of
systems calls required to move the data, since each call moves only
64K. Maybe I'll
try increasing the pipe size, next.

Once I get past these issues, and I get the code in a better state, I'll
be happy to share what
I can.

-Steve



Jens Axboe wrote:
> On Wed, Jun 03 2009, Leon Woestenberg wrote:
>
>> Hello all,
>>
>> On Wed, May 13, 2009 at 6:59 PM, Steve Rottinger <[email protected]> wrote:
>>
>>> is passing in the pages into splice_to_pipe. The pages are associated
>>> with a PCI BAR, not main memory. I'm wondering if this could be a problem?
>>>
>>>
>> Good question; my newbie answer would be the pages need to be mapped
>> in kernel space.
>>
>
> That is what the ->map() hook is for.
>
>
>> I have a similar use case but with memory being DMA'd to host main
>> memory (instead of the data sitting in your PCI device) in a character
>> device driver. The driver is a complete rewrite from scratch from
>> what's currently sitting-butt-ugly in staging/altpcichdma.c
>> so-please-don't-look-there.
>>
>> I have already implemented zero-latency overlapping transfers in the
>> DMA engine (i.e. it never sits idle if async I/O is performed through
>> threads), now it would be really cool to add zero-copy.
>>
>> What is it my driver is expected to do?
>>
>> .splice_read:
>>
>> - Allocate a bunch of single pages
>> - Create a scatter-gather list
>> - "stuff the data pages in question into a struct page *pages[]." a la
>> "fs/splice.c:vmsplice_to_pipe()"
>> - Start the DMA from the device to the pages (i.e. the transfer)
>> - Return.
>>
>> .splice_write:
>>
>> - Create a scatter-gather list
>>
>> interrupt handler / DMA service routine:
>> - device book keeping
>> - wake_up_interruptible(transfer_queue)
>>
>> .confirm():
>>
>> "then you need to provide a suitable ->confirm() hook that can wait on
>> this IO to complete if needed."
>> - wait_on_event_interruptibe(transfer_queue)
>>
>> .release():
>>
>> - release the pages
>>
>> .steal():
>>
>> unsure
>>
>
> This is what allows zero copy throughout the pipe line. ->steal(), if
> sucesful, should pass ownership of that page to the caller. The previous
> owner must no longer modify it.
>
>
>> .map
>>
>> unsure
>>
>
> See above :-)
>
>

2009-06-12 19:22:03

by Leon Woestenberg

[permalink] [raw]
Subject: Re: splice methods in character device driver

Steve, Jens,

another few questions:

On Thu, Jun 4, 2009 at 3:20 PM, Steve Rottinger<[email protected]> wrote:
> ...
> - The performance is poor, and much slower than transferring directly from
> main memory with O_DIRECT. ?I suspect that this has a lot to do with
> large amount of systems calls required to move the data, since each call moves only
> 64K. ?Maybe I'll try increasing the pipe size, next.
>
> Once I get past these issues, and I get the code in a better state, I'll
> be happy to share what
> I can.
>
I've been experimenting a bit using mostly-empty functions to learn
understand the function call flow:

splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_device);
pipe_to_device(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
struct splice_desc *sd)

So some back-of-a-coaster calculations:

If I understand correctly, a pipe_buffer never spans more than one
page (typically 4kB).

The SPLICE_BUFFERS is 16, thus splice_from_pipe() is called every 64kB.

The actor "pipe_to_device" is called on each pipe_buffer, so for every 4kB.

For my case, I have a DMA engine that does say 200 MB/s, resulting in
50000 actor calls per second.

As my use case would be to splice from an acquisition card to disk,
splice() made an interesting approach.

However, if the above is correct, I assume splice() is not meant for
my use-case?


Regards,

Leon.






/* the actor which takes pages from the pipe to the device
*
* it must move a single struct pipe_buffer to the desired destination
* Existing implementations are pipe_to_file, pipe_to_sendpage, pipe_to_user.
*/
static int pipe_to_device(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
struct splice_desc *sd)
{
int rc;
printk(KERN_DEBUG "pipe_to_device(buf->offset=%d, sd->len=%d)\n",
buf->offset, sd->len);
/* make sure the data in this buffer is up-to-date */
rc = buf->ops->confirm(pipe, buf);
if (unlikely(rc))
return rc;
/* create a transfer for this buffer */

}
/* kernel wants to write from the pipe into our file at ppos */
ssize_t splice_write(struct pipe_inode_info *pipe, struct file *out,
loff_t *ppos, size_t len, unsigned int flags)
{
int ret;
printk(KERN_DEBUG "splice_write(len=%d)\n", len);
ret = splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_device);
return 0;
}

--
Leon

2009-06-12 19:59:46

by Jens Axboe

[permalink] [raw]
Subject: Re: splice methods in character device driver

On Fri, Jun 12 2009, Leon Woestenberg wrote:
> Steve, Jens,
>
> another few questions:
>
> On Thu, Jun 4, 2009 at 3:20 PM, Steve Rottinger<[email protected]> wrote:
> > ...
> > - The performance is poor, and much slower than transferring directly from
> > main memory with O_DIRECT. ?I suspect that this has a lot to do with
> > large amount of systems calls required to move the data, since each call moves only
> > 64K. ?Maybe I'll try increasing the pipe size, next.
> >
> > Once I get past these issues, and I get the code in a better state, I'll
> > be happy to share what
> > I can.
> >
> I've been experimenting a bit using mostly-empty functions to learn
> understand the function call flow:
>
> splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_device);
> pipe_to_device(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> struct splice_desc *sd)
>
> So some back-of-a-coaster calculations:
>
> If I understand correctly, a pipe_buffer never spans more than one
> page (typically 4kB).

Correct

> The SPLICE_BUFFERS is 16, thus splice_from_pipe() is called every 64kB.

Also correct.

> The actor "pipe_to_device" is called on each pipe_buffer, so for every 4kB.

Ditto.

> For my case, I have a DMA engine that does say 200 MB/s, resulting in
> 50000 actor calls per second.
>
> As my use case would be to splice from an acquisition card to disk,
> splice() made an interesting approach.
>
> However, if the above is correct, I assume splice() is not meant for
> my use-case?

50000 function calls per second is not a lot. We do lots of things on a
per-page basis in the kernel. Batching would of course speed things up,
but it's not been a problem thus far. So I would not worry about 50k
function calls per second to begin with.

--
Jens Axboe

2009-06-12 20:45:53

by Steve Rottinger

[permalink] [raw]
Subject: Re: splice methods in character device driver

Hi Leon,

It does seem like a lot of code needs to be executed to move a small
chunk of data.
Although, I think that most of the overhead that I was experiencing
came from the cumulative
overhead of each splice system call. I increased my pipe size using
Jens' pipe size patch,
from 16 to 256 pages, and this had a huge effect -- the speed of my
transfers more than doubled.
Pipe sizes larger that 256 pages, cause my kernel to crash.

I'm doing about 300MB/s to my hardware RAID, running two instances of my
splice() copy application
(One on each RAID channel). I would like to combine the two RAID
channels using a software RAID 0;
however, splice, even from /dev/zero runs horribly slow to a software
RAID device. I'd be curious
to know if anyone else has tried this?

-Steve

Leon Woestenberg wrote:
> Steve, Jens,
>
> another few questions:
>
> On Thu, Jun 4, 2009 at 3:20 PM, Steve Rottinger<[email protected]> wrote:
>
>> ...
>> - The performance is poor, and much slower than transferring directly from
>> main memory with O_DIRECT. I suspect that this has a lot to do with
>> large amount of systems calls required to move the data, since each call moves only
>> 64K. Maybe I'll try increasing the pipe size, next.
>>
>> Once I get past these issues, and I get the code in a better state, I'll
>> be happy to share what
>> I can.
>>
>>
> I've been experimenting a bit using mostly-empty functions to learn
> understand the function call flow:
>
> splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_device);
> pipe_to_device(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> struct splice_desc *sd)
>
> So some back-of-a-coaster calculations:
>
> If I understand correctly, a pipe_buffer never spans more than one
> page (typically 4kB).
>
> The SPLICE_BUFFERS is 16, thus splice_from_pipe() is called every 64kB.
>
> The actor "pipe_to_device" is called on each pipe_buffer, so for every 4kB.
>
> For my case, I have a DMA engine that does say 200 MB/s, resulting in
> 50000 actor calls per second.
>
> As my use case would be to splice from an acquisition card to disk,
> splice() made an interesting approach.
>
> However, if the above is correct, I assume splice() is not meant for
> my use-case?
>
>
> Regards,
>
> Leon.
>
>
>
>
>
>
> /* the actor which takes pages from the pipe to the device
> *
> * it must move a single struct pipe_buffer to the desired destination
> * Existing implementations are pipe_to_file, pipe_to_sendpage, pipe_to_user.
> */
> static int pipe_to_device(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> struct splice_desc *sd)
> {
> int rc;
> printk(KERN_DEBUG "pipe_to_device(buf->offset=%d, sd->len=%d)\n",
> buf->offset, sd->len);
> /* make sure the data in this buffer is up-to-date */
> rc = buf->ops->confirm(pipe, buf);
> if (unlikely(rc))
> return rc;
> /* create a transfer for this buffer */
>
> }
> /* kernel wants to write from the pipe into our file at ppos */
> ssize_t splice_write(struct pipe_inode_info *pipe, struct file *out,
> loff_t *ppos, size_t len, unsigned int flags)
> {
> int ret;
> printk(KERN_DEBUG "splice_write(len=%d)\n", len);
> ret = splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_device);
> return 0;
> }
>
>

2009-06-16 11:59:23

by Jens Axboe

[permalink] [raw]
Subject: Re: splice methods in character device driver

On Fri, Jun 12 2009, Steve Rottinger wrote:
> Hi Leon,
>
> It does seem like a lot of code needs to be executed to move a small
> chunk of data.

It's really not, you should try and benchmark the function call overhead
:-).

> Although, I think that most of the overhead that I was experiencing
> came from the cumulative
> overhead of each splice system call. I increased my pipe size using
> Jens' pipe size patch,
> from 16 to 256 pages, and this had a huge effect -- the speed of my
> transfers more than doubled.
> Pipe sizes larger that 256 pages, cause my kernel to crash.

Yes, the system call is more expensive. Increasing the pipe size can
definitely help there.

> I'm doing about 300MB/s to my hardware RAID, running two instances of my
> splice() copy application
> (One on each RAID channel). I would like to combine the two RAID
> channels using a software RAID 0;
> however, splice, even from /dev/zero runs horribly slow to a software
> RAID device. I'd be curious
> to know if anyone else has tried this?

Did you trace it and find out why it was slow? It should not be. Moving
300MB/sec should not be making any machine sweat.

--
Jens Axboe

2009-06-16 15:06:33

by Steve Rottinger

[permalink] [raw]
Subject: Re: splice methods in character device driver

Hi Jens,

Jens Axboe wrote:
>
>> Although, I think that most of the overhead that I was experiencing
>> came from the cumulative
>> overhead of each splice system call. I increased my pipe size using
>> Jens' pipe size patch,
>> from 16 to 256 pages, and this had a huge effect -- the speed of my
>> transfers more than doubled.
>> Pipe sizes larger that 256 pages, cause my kernel to crash.
>>
>
> Yes, the system call is more expensive. Increasing the pipe size can
> definitely help there.
>
>
I know that you have been asked this before, but is there any chance
that we can
get the pipe size patch into the kernel mainline? It seems like it is
essential to
moving data fast using the splice interface.

>> I'm doing about 300MB/s to my hardware RAID, running two instances of my
>> splice() copy application
>> (One on each RAID channel). I would like to combine the two RAID
>> channels using a software RAID 0;
>> however, splice, even from /dev/zero runs horribly slow to a software
>> RAID device. I'd be curious
>> to know if anyone else has tried this?
>>
>
> Did you trace it and find out why it was slow? It should not be. Moving
> 300MB/sec should not be making any machine sweat.
>
>
I haven't dug into this too deeply, yet; however, I did discover
something interesting:
The splice runs much faster using the software raid, if I transfer to a
file on a mounted
filesystem, instead of the raw md block device.

-Steve

2009-06-16 18:25:07

by Jens Axboe

[permalink] [raw]
Subject: [RFC][PATCH] add support for shrinking/growing a pipe (Was "Re: splice methods in character device driver")

On Tue, Jun 16 2009, Steve Rottinger wrote:
> >> Although, I think that most of the overhead that I was experiencing
> >> came from the cumulative
> >> overhead of each splice system call. I increased my pipe size using
> >> Jens' pipe size patch,
> >> from 16 to 256 pages, and this had a huge effect -- the speed of my
> >> transfers more than doubled.
> >> Pipe sizes larger that 256 pages, cause my kernel to crash.
> >>
> >
> > Yes, the system call is more expensive. Increasing the pipe size can
> > definitely help there.
> >
> >
> I know that you have been asked this before, but is there any chance
> that we can
> get the pipe size patch into the kernel mainline? It seems like it is
> essential to
> moving data fast using the splice interface.

Sure, the only unresolved issue with it is what sort of interface to
export for changing the pipe size. I went with fcntl().

Linus, I think we discussed this years ago. The patch in question is
here:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=24547ac4d97bebb58caf9ce58bd507a95c812a3f

I'd like to get it in now, there has been several requests for this in
the past. But I didn't want to push it before this was resolved.

I don't know whether other operating systems allow this functionality,
and if they do what interface they use. I suspect that our need is
somewhat special, since we have splice.

--
Jens Axboe

2009-06-16 18:28:18

by Jens Axboe

[permalink] [raw]
Subject: Re: splice methods in character device driver

On Tue, Jun 16 2009, Steve Rottinger wrote:
> >> I'm doing about 300MB/s to my hardware RAID, running two instances of my
> >> splice() copy application
> >> (One on each RAID channel). I would like to combine the two RAID
> >> channels using a software RAID 0;
> >> however, splice, even from /dev/zero runs horribly slow to a software
> >> RAID device. I'd be curious
> >> to know if anyone else has tried this?
> >>
> >
> > Did you trace it and find out why it was slow? It should not be. Moving
> > 300MB/sec should not be making any machine sweat.
> >
> I haven't dug into this too deeply, yet; however, I did discover
> something interesting: The splice runs much faster using the software
> raid, if I transfer to a file on a mounted filesystem, instead of the
> raw md block device.

OK, that's a least a starting point. I'll try this tomorrow (raw block
device vs fs file).

--
Jens Axboe