Hello,
I'm trying to measure the perf gain by using splice.For now I'm trying to copy a 1G file using splice.(In real scenario, the driver will DMA the data to some buffer(which is mmap'd).The app will then write the newly-DMA'd data to the disk while some other thread is crunching the same buffer.The buffer is guaranteed to not be modified.To avoid copying I was thinking of : splice-IN-mmap'd-buffer->pipe and splice-OUT-pipe->file.)
PS - I've inlined some sloppy code that I cooked up.
Case1) read from input_file and write(O_DIRECT so no buff-cache is involved but it doesn't work) to dest_file.We can talk about the buff-cache later.
(csh#)time ./splice_to_splice
0.004u 1.451s 0:02.16 67.1% 0+0k 2097152+2097152io 0pf+0w
#define KILO_BYTE (1024)
#define PIPE_SIZE (64 * KILO_BYTE)
int filedes [2];
pipe (filedes);
fd_from = open(filename_from,O_RDWR|O_LARGEFILE|O_DIRECT),0777);
fd_to = open(filename_to,(O_WRONLY|O_CREAT|O_LARGEFILE|O_DIRECT),0777);
to_write = 2048 * 512 * KILO_BYTE;
while (to_write) {
ret = splice (fd_from, &from_offset, filedes [1], NULL, PIPE_SIZE,
SPLICE_F_MORE | SPLICE_F_MOVE);
if (ret < 0) {
printf("Error: LINE:%d ret:%d\n",__LINE__,ret);
goto error;
} else {
ret = splice (filedes [0], NULL, fd_to,
&to_offset, PIPE_SIZE/*should be ret,but ...*/,
SPLICE_F_MORE | SPLICE_F_MOVE);
if (ret < 0) {
printf("Error: LINE:%d ret:%d\n",__LINE__);
goto error;
}
to_write -= ret;
}
}
Case 2) directly reading and writing:
Case2.1) copy 64K blocks
(csh#)time ./file_to_file 64
0.015u 1.066s 0:04.04 26.4% 0+0k 2097152+2097152io 0pf+0w
#define KILO_BYTE (1024)
#define MEGA_BYTE (1024 * (KILO_BYTE))
#define BUFF_SIZE (64 * MEGA_BYTE)
posix_memalign((void**)&buff,4096,BUFF_SIZE);
fd_from = open(filename_from,(O_RDWR|O_LARGEFILE|O_DIRECT),0777);
fd_to = open(filename_to,(O_WRONLY|O_CREAT|O_LARGEFILE|O_DIRECT),0777);
/* 1G file == 2048 * 512K blocks */
to_write = 2048 * 512 * KILO_BYTE;
copy_size = cmd_line_input * KILO_BYTE; /* control from cmd_line */
while (to_write) {
ret = read(fd_from, buff,copy_size);
if (ret != copy_size) {
printf("Error: LINE:%d ret:%d\n",__LINE__,ret);
goto error;
} else {
ret = write (fd_to,buff,copy_size);
if (ret != copy_size) {
printf("Error: LINE:%d ret:%d\n",__LINE__);
goto error;
}
to_write -= ret;
}
}
Case2.2) copy 512K blocks
(csh#)time ./file_to_file 512
0.004u 0.306s 0:01.86 16.1% 0+0k 2097152+2097152io 0pf+0w
Case 2.3) copy 1M blocks
time ./file_to_file 1024
0.000u 0.240s 0:01.88 12.7% 0+0k 2097152+2097152io 0pf+0w
Questions:
Q1) When using splice,why is the CPU consumption greater than read/write(case 2.1)?What does this mean?
Q2) How do I confirm that the memory bandwidth consumption does not spike up when using splice in this case? By this I mean, (node)cpu<->mem. The DMA-in/DMA-out will happen.You can't escape from that but the IOH-bus will be utilized. I want to keep the cpu(node)-mem path free(well, minimize unnecessary copies).
Q3) When using splice, even though the destination file is opened in O_DIRECT mode, the data gets cached. I verified it using vmstat.
r b swpd free buff cache
1 0 0 9358820 116576 2100904
./splice_to_splice
r b swpd free buff cache
2 0 0 7228908 116576 4198164
I see the same caching issue even if I vmsplice buffers(simple malloc'd iov) to a pipe and then splice the pipe to a file. The speed is still an issue with vmsplice too.
Q4) Also, using splice, you can only transfer 64K worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But using stock read/write, I can go upto 1MB buffer. After that I don't see any gain. But still the reduction in system/cpu time is significant.
I would appreciate any pointers.
thanks
Rick
Hi Rick,
On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote:
> Q3) When using splice, even though the destination file is opened in O_DIRECT mode, the data gets cached. I verified it using vmstat.
>
> r b swpd free buff cache
> 1 0 0 9358820 116576 2100904
>
> ./splice_to_splice
>
> r b swpd free buff cache
> 2 0 0 7228908 116576 4198164
>
> I see the same caching issue even if I vmsplice buffers(simple malloc'd iov) to a pipe and then splice the pipe to a file. The speed is still an issue with vmsplice too.
>
One thing is that O_DIRECT is a hint; not all filesystems bypass the
cache. I'm pretty sure ext2 does, and I know fat doesn't.
Another variable is whether (and how) your filesystem implements the
splice_write file operation. The generic one (pipe_to_file) in
fs/splice.c copies data to pagecache. The default one goes out to
vfs_write() and might stand more of a chance of honoring O_DIRECT.
> Q4) Also, using splice, you can only transfer 64K worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But using stock read/write, I can go upto 1MB buffer. After that I don't see any gain. But still the reduction in system/cpu time is significant.
I'm not a splicing expert but I did spend some time recently trying to
improve FTP reception by splicing from a TCP socket to a file. I found
that while splicing avoids copying packets to userland, that gain is
more than offset by a large increase in calls into the storage stack.
It's especially bad with TCP sockets because a typical packet has, say,
1460 bytes of data. Since splicing works on PIPE_BUFFERS pages at a
time, and packet pages are only about 35% utilized, each cycle to
userland I could only move 23 KiB of data at most. Some similar effect
may be in play in your case.
ftrace may be of some help in finding the bottleneck...
Regards,
------------------------------------------------------------------------
Steven J. Magnani "I claim this network for MARS!
http://www.digidescorp.com Earthling, return my space modulator!"
#include <standard.disclaimer>
Hello Jens - any assistance/pointers on 1) and 2) below
will be great.I'm willing to test out any sample patch.
Steve,
--- On Wed, 4/21/10, Steven J. Magnani <[email protected]> wrote:
> Hi Rick,
>
> On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote:
> > Q3) When using splice, even though the destination
> file is opened in O_DIRECT mode, the data gets cached. I
> verified it using vmstat.
> >
> > r b???swpd???free???buff cache???
> > 1 0? ???0 9358820 116576 2100904
> >
> > ./splice_to_splice
> >
> > r b swpd???free???buff cache
> > 2 0? 0 7228908 116576? 4198164
> >
> > I see the same caching issue even if I vmsplice
> buffers(simple malloc'd iov) to a pipe and then splice the
> pipe to a file. The speed is still an issue with vmsplice
> too.
> >
>
> One thing is that O_DIRECT is a hint; not all filesystems
> bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't.
>
> Another variable is whether (and how) your filesystem
> implements the splice_write file operation. The generic one (pipe_to_file)
> in fs/splice.c copies data to pagecache. The default one goes
> out to vfs_write() and might stand more of a chance of honoring
> O_DIRECT.
>
True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'.
> > Q4) Also, using splice, you can only transfer 64K
> worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But
> using stock read/write, I can go upto 1MB buffer. After that
> I don't see any gain. But still the reduction in system/cpu
> time is significant.
>
> I'm not a splicing expert but I did spend some time
> recently trying to
> improve FTP reception by splicing from a TCP socket to a
> file. I found that while splicing avoids copying packets to userland,
> that gain is more than offset by a large increase in calls into the
> storage stack.It's especially bad with TCP sockets because a typical
> packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS
> pages at a time, and packet pages are only about 35% utilized, each
> cycle to userland I could only move 23 KiB of data at most. Some
> similar effect may be in play in your case.
>
Agreed,increasing number of calls will offset the benefit.
But what if:
1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'?
What are the implications in the other parts of the kernel?
2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd).
> Regards,
>? Steven J. Magnani? ? ? ? ???
regards
++Rick
On Fri, 2010-04-23 at 09:07 -0700, Rick Sherm wrote:
> Hello Jens - any assistance/pointers on 1) and 2) below
> will be great.I'm willing to test out any sample patch.
Recent mail from him has come from [email protected], I cc'd it.
>
> Steve,
>
> --- On Wed, 4/21/10, Steven J. Magnani <[email protected]> wrote:
> > Hi Rick,
> >
> > On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote:
> > > Q3) When using splice, even though the destination
> > file is opened in O_DIRECT mode, the data gets cached. I
> > verified it using vmstat.
> > >
> > > r b swpd free buff cache
> > > 1 0 0 9358820 116576 2100904
> > >
> > > ./splice_to_splice
> > >
> > > r b swpd free buff cache
> > > 2 0 0 7228908 116576 4198164
> > >
> > > I see the same caching issue even if I vmsplice
> > buffers(simple malloc'd iov) to a pipe and then splice the
> > pipe to a file. The speed is still an issue with vmsplice
> > too.
> > >
> >
> > One thing is that O_DIRECT is a hint; not all filesystems
> > bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't.
> >
> > Another variable is whether (and how) your filesystem
> > implements the splice_write file operation. The generic one (pipe_to_file)
> > in fs/splice.c copies data to pagecache. The default one goes
> > out to vfs_write() and might stand more of a chance of honoring
> > O_DIRECT.
> >
>
> True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'.
>
> > > Q4) Also, using splice, you can only transfer 64K
> > worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But
> > using stock read/write, I can go upto 1MB buffer. After that
> > I don't see any gain. But still the reduction in system/cpu
> > time is significant.
> >
> > I'm not a splicing expert but I did spend some time
> > recently trying to
> > improve FTP reception by splicing from a TCP socket to a
> > file. I found that while splicing avoids copying packets to userland,
> > that gain is more than offset by a large increase in calls into the
> > storage stack.It's especially bad with TCP sockets because a typical
> > packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS
> > pages at a time, and packet pages are only about 35% utilized, each
> > cycle to userland I could only move 23 KiB of data at most. Some
> > similar effect may be in play in your case.
> >
>
> Agreed,increasing number of calls will offset the benefit.
> But what if:
> 1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'?
> What are the implications in the other parts of the kernel?
This came up recently, one problem is that there a couple of kernel
functions having up to 3 stack-based arrays of dimension PIPE_BUFFER. So
the stack cost of increasing PIPE_BUFFERS can be quite high. I've
thought it might be nice if there was some mechanism for userland apps
to be able to request larger PIPE_BUFFERS values, but I haven't pursued
this line of thought to see if it's practical.
> 2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd).
It's a neat idea, but it would probably be much easier (and less
invasive) to try this sort of pipelining in userland using a ring buffer
or ping-pong approach. I'm actually in the middle of something like this
with FTP, where I will have a reader thread that puts data from the
network into a ring buffer, from which a writer thread moves it to a
file.
------------------------------------------------------------------------
Steven J. Magnani "I claim this network for MARS!
http://www.digidescorp.com Earthling, return my space modulator!"
#include <standard.disclaimer>
On Fri, Apr 23 2010, Steven J. Magnani wrote:
> On Fri, 2010-04-23 at 09:07 -0700, Rick Sherm wrote:
> > Hello Jens - any assistance/pointers on 1) and 2) below
> > will be great.I'm willing to test out any sample patch.
>
> Recent mail from him has come from [email protected], I cc'd it.
Goes to the same inbox in the end, so no difference :-)
> > > On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote:
> > > > Q3) When using splice, even though the destination
> > > file is opened in O_DIRECT mode, the data gets cached. I
> > > verified it using vmstat.
> > > >
> > > > r b swpd free buff cache
> > > > 1 0 0 9358820 116576 2100904
> > > >
> > > > ./splice_to_splice
> > > >
> > > > r b swpd free buff cache
> > > > 2 0 0 7228908 116576 4198164
> > > >
> > > > I see the same caching issue even if I vmsplice
> > > buffers(simple malloc'd iov) to a pipe and then splice the
> > > pipe to a file. The speed is still an issue with vmsplice
> > > too.
> > > >
> > >
> > > One thing is that O_DIRECT is a hint; not all filesystems
> > > bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't.
> > >
> > > Another variable is whether (and how) your filesystem
> > > implements the splice_write file operation. The generic one (pipe_to_file)
> > > in fs/splice.c copies data to pagecache. The default one goes
> > > out to vfs_write() and might stand more of a chance of honoring
> > > O_DIRECT.
> > >
> >
> > True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'.
> >
> > > > Q4) Also, using splice, you can only transfer 64K
> > > worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But
> > > using stock read/write, I can go upto 1MB buffer. After that
> > > I don't see any gain. But still the reduction in system/cpu
> > > time is significant.
> > >
> > > I'm not a splicing expert but I did spend some time
> > > recently trying to
> > > improve FTP reception by splicing from a TCP socket to a
> > > file. I found that while splicing avoids copying packets to userland,
> > > that gain is more than offset by a large increase in calls into the
> > > storage stack.It's especially bad with TCP sockets because a typical
> > > packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS
> > > pages at a time, and packet pages are only about 35% utilized, each
> > > cycle to userland I could only move 23 KiB of data at most. Some
> > > similar effect may be in play in your case.
> > >
> >
> > Agreed,increasing number of calls will offset the benefit.
> > But what if:
> > 1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'?
> > What are the implications in the other parts of the kernel?
>
> This came up recently, one problem is that there a couple of kernel
> functions having up to 3 stack-based arrays of dimension PIPE_BUFFER. So
> the stack cost of increasing PIPE_BUFFERS can be quite high. I've
> thought it might be nice if there was some mechanism for userland apps
> to be able to request larger PIPE_BUFFERS values, but I haven't pursued
> this line of thought to see if it's practical.
I still have patches pending for this, making the pipe buffer count
settable form user space:
http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=24547ac4d97bebb58caf9ce58bd507a95c812a3f
Let me know if you want to give it a spin on a recent kernel, and I'll
update it.
> > 2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd).
>
> It's a neat idea, but it would probably be much easier (and less
> invasive) to try this sort of pipelining in userland using a ring buffer
> or ping-pong approach. I'm actually in the middle of something like this
> with FTP, where I will have a reader thread that puts data from the
> network into a ring buffer, from which a writer thread moves it to a
> file.
See vmsplice.c from the splice test tools:
http://brick.kernel.dk/snaps/splice-git-latest.tar.gz
--
Jens Axboe
Hello Jens,
--- On Fri, 4/23/10, Jens Axboe <[email protected]> wrote:
> I still have patches pending for this, making the pipe
> buffer count
> settable form user space:
>
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=24547ac4d97bebb58caf9ce58bd507a95c812a3f
>
> Let me know if you want to give it a spin on a recent
> kernel, and I'll
> update it.
>
I think we need to adjust 'PIPE_BUFFERS' in default_file_splice_read() also,correct?
> Jens Axboe
Thanks