LinuxLists.cc - splice/vmsplice performance test results

2006-11-16 18:09:08

Subject: splice/vmsplice performance test results

Hi,

I've done some testing to see how splice/vmsplice perform
vs. other alternatives on transferring a large file across
a fast network. One option I tested was to use vmsplice
to get a 1-copy receive, but it didn't perform as well
as I had hoped. I was wondering if my results were at odds
with what other people have observed.

I've two systems, each with:
Tyan S2895 motherboard
2 ea. 2.6 GHz Opteron
1 GiB memory
Myricom Myri-10G 10 Gb/s NIC (PCIe x8)
2.6.19-rc5-g134a11f0 on FC4

In addition, one system has a 3ware 9590-8ML (PCIe) and a 3ware
9550SX-8LP (PCI-X), with 16 Seagate Barracuda 7200.10 SATA drives
(250 GB ea., NCQ enabled). Write caching is enabled on the 3ware
cards.

The Myricom cards are connected back-to-back using 9000 byte MTU.
I baseline the network performance with 'iperf -w 1M -l 64K'
and get 6.9 Gb/s.

After a fair amount of testing, I settled on a 4-way software
RAID0 on top of 4-way hardware RAID0 units as giving the best
streaming performance. The file system is XFS, with the stripe
unit set to the hardware RAID chunk size, and the stripe width
16 times that.

Disk tuning parameters in /sys/block/sd*/queue are default
values, except queue/nr_requests = 5 gives me best performance.
(It seems like the 3ware cards slow down a little if I feed them
too much data on the streaming write test I'm using.)

I baseline file write performance with
sync; time { dd if=/dev/zero of=./zero bs=32k count=512k; sync; }
and get 465-520 MB/s (highly variable).

I test baseline file read performance with
time dd if=./zero of=/dev/null bs=32k count=512k
and get 950 MB/s (fairly repeatable).

My test program can do one of the following:

send data:
A) read() from file into buffer, write() buffer into socket
B) mmap() section of file, write() that into socket, munmap()
C) splice() from file to pipe, splice() from pipe to socket

receive data:
1) read() from socket into buffer, write() buffer into file
2) ftruncate() to extend file, mmap() new extent, read()
from socket into new extent, munmap()
3) read() from socket into buffer, vmsplice() buffer to
pipe, splice() pipe to file (using the double-buffer trick)

Here's the results, using:
- 64 KiB buffer, mmap extent, or splice
- 1 MiB TCP window
- 16 GiB data sent across network

A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)

A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)

A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s)
A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s)
A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s)

I had (naively) hoped the read/vmsplice/splice combination would
run at the same speed I can write a file, i.e. at about 450 MB/s
on my setup. Do any of my numbers seem bogus, so I should look
harder at my test program?

Or is read+write really the fastest way to get data off a
socket and into a file?

-- Jim Schutt

(Please Cc: me, as I'm not subscribed to lkml.)

2006-11-16 20:26:03

by Jens Axboe

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Thu, Nov 16 2006, Jim Schutt wrote:
> Hi,
>
> I've done some testing to see how splice/vmsplice perform
> vs. other alternatives on transferring a large file across
> a fast network. One option I tested was to use vmsplice
> to get a 1-copy receive, but it didn't perform as well
> as I had hoped. I was wondering if my results were at odds
> with what other people have observed.
>
> I've two systems, each with:
> Tyan S2895 motherboard
> 2 ea. 2.6 GHz Opteron
> 1 GiB memory
> Myricom Myri-10G 10 Gb/s NIC (PCIe x8)
> 2.6.19-rc5-g134a11f0 on FC4
>
> In addition, one system has a 3ware 9590-8ML (PCIe) and a 3ware
> 9550SX-8LP (PCI-X), with 16 Seagate Barracuda 7200.10 SATA drives
> (250 GB ea., NCQ enabled). Write caching is enabled on the 3ware
> cards.
>
> The Myricom cards are connected back-to-back using 9000 byte MTU.
> I baseline the network performance with 'iperf -w 1M -l 64K'
> and get 6.9 Gb/s.
>
> After a fair amount of testing, I settled on a 4-way software
> RAID0 on top of 4-way hardware RAID0 units as giving the best
> streaming performance. The file system is XFS, with the stripe
> unit set to the hardware RAID chunk size, and the stripe width
> 16 times that.
>
> Disk tuning parameters in /sys/block/sd*/queue are default
> values, except queue/nr_requests = 5 gives me best performance.
> (It seems like the 3ware cards slow down a little if I feed them
> too much data on the streaming write test I'm using.)
>
> I baseline file write performance with
> sync; time { dd if=/dev/zero of=./zero bs=32k count=512k; sync; }
> and get 465-520 MB/s (highly variable).
>
> I test baseline file read performance with
> time dd if=./zero of=/dev/null bs=32k count=512k
> and get 950 MB/s (fairly repeatable).
>
> My test program can do one of the following:
>
> send data:
> A) read() from file into buffer, write() buffer into socket
> B) mmap() section of file, write() that into socket, munmap()
> C) splice() from file to pipe, splice() from pipe to socket
>
> receive data:
> 1) read() from socket into buffer, write() buffer into file
> 2) ftruncate() to extend file, mmap() new extent, read()
> from socket into new extent, munmap()
> 3) read() from socket into buffer, vmsplice() buffer to
> pipe, splice() pipe to file (using the double-buffer trick)
>
> Here's the results, using:
> - 64 KiB buffer, mmap extent, or splice
> - 1 MiB TCP window
> - 16 GiB data sent across network
>
> A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)
>
> A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
> B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
> C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)
>
> A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s)
> A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s)
> A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s)
>
> I had (naively) hoped the read/vmsplice/splice combination would
> run at the same speed I can write a file, i.e. at about 450 MB/s
> on my setup. Do any of my numbers seem bogus, so I should look
> harder at my test program?

Could be read-ahead playing in here, I'd have to take a closer look at
the generated io patterns to say more about that. Any chance you can
capture iostat or blktrace info for such a run to compare that goes to
the disk? Can you pass along the test program?

> Or is read+write really the fastest way to get data off a
> socket and into a file?

splice() should be just as fast of course, and more efficient. Not a lot
of real-life performance tuning has gone into it yet, so I would not be
surprised if we need to smoothen a few edges.

--
Jens Axboe

2006-11-16 20:52:25

by David Miller

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

From: "Jim Schutt" <[email protected]>
Date: Thu, 16 Nov 2006 11:08:59 -0700

> Or is read+write really the fastest way to get data off a
> socket and into a file?

There is still no explicit TCP support for splice/vmsplice so things
get copied around and most of the other advantaves of splice/vmplice
aren't obtained either. So perhaps that explains your numbers.

Jens Axboe tries to get things working, and others have looked into it
too, but adding TCP support is hard and for several reasons folks like
Alexey Kuznetsov and Evgeniy Polyakov believe that sys_receivefile()
is an interface much better suited for TCP receive.

splice/vmsplice has a lot of state connected to a transaction, and
perhaps that is part of why Evgeniy and Alexey have trouble wrapping
their brains around an efficient implementation.

2006-11-16 21:21:39

by Jens Axboe

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Thu, Nov 16 2006, David Miller wrote:
> From: "Jim Schutt" <[email protected]>
> Date: Thu, 16 Nov 2006 11:08:59 -0700
>
> > Or is read+write really the fastest way to get data off a
> > socket and into a file?
>
> There is still no explicit TCP support for splice/vmsplice so things
> get copied around and most of the other advantaves of splice/vmplice
> aren't obtained either. So perhaps that explains your numbers.

There should not be any copying for tcp send, at least no more than what
sendfile() did/does. Hmm?

> Jens Axboe tries to get things working, and others have looked into it
> too, but adding TCP support is hard and for several reasons folks like
> Alexey Kuznetsov and Evgeniy Polyakov believe that sys_receivefile()
> is an interface much better suited for TCP receive.
>
> splice/vmsplice has a lot of state connected to a transaction, and
> perhaps that is part of why Evgeniy and Alexey have trouble wrapping
> their brains around an efficient implementation.

I hope to try and see if I can help get some of that done, however I
need all the help I can get on the networking side. Not sure I
understand why it has to be so difficult, if we need to define a wrapper
container instead of passing down a pipe that is completely fine with
me. The networking code basically just needs to hang on to the
pipe_buffer and release it on ack for send, receive is somewhat more
involved (and I don't know enough about networking to voice any
half-intelligent opinion on that!).

I would just consider it a damn shame if we cannot complete the splice
family and need to punt to something else for net receive.

--
Jens Axboe

2006-11-16 21:24:23

by Jim Schutt

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote:
> On Thu, Nov 16 2006, Jim Schutt wrote:
> > Hi,
> >

> >
> > My test program can do one of the following:
> >
> > send data:
> > A) read() from file into buffer, write() buffer into socket
> > B) mmap() section of file, write() that into socket, munmap()
> > C) splice() from file to pipe, splice() from pipe to socket
> >
> > receive data:
> > 1) read() from socket into buffer, write() buffer into file
> > 2) ftruncate() to extend file, mmap() new extent, read()
> > from socket into new extent, munmap()
> > 3) read() from socket into buffer, vmsplice() buffer to
> > pipe, splice() pipe to file (using the double-buffer trick)
> >
> > Here's the results, using:
> > - 64 KiB buffer, mmap extent, or splice
> > - 1 MiB TCP window
> > - 16 GiB data sent across network
> >
> > A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)
> >
> > A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
> > B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
> > C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)
> >
> > A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s)
> > A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s)
> > A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s)
> >
> > I had (naively) hoped the read/vmsplice/splice combination would
> > run at the same speed I can write a file, i.e. at about 450 MB/s
> > on my setup. Do any of my numbers seem bogus, so I should look
> > harder at my test program?
>
> Could be read-ahead playing in here, I'd have to take a closer look at
> the generated io patterns to say more about that. Any chance you can
> capture iostat or blktrace info for such a run to compare that goes to
> the disk?

I can try. Do you prefer iostat or blktrace, or would you like both?
Can you point me at some instructions?

> Can you pass along the test program?

Inserted inline below.

>
> > Or is read+write really the fastest way to get data off a
> > socket and into a file?
>
> splice() should be just as fast of course, and more efficient. Not a lot
> of real-life performance tuning has gone into it yet, so I would not be
> surprised if we need to smoothen a few edges.
>

I'm glad I can help a little here.

-- Jim

Here's a splice.h I'm using, based on your example:
-----------------------

/* Implement splice syscall support for glibc versions that don't
* have it.
*/

#ifndef __do_splice_syscall_h__
#define __do_splice_syscall_h__

#include <sys/syscall.h>
#include <unistd.h>

#if defined(__i386__)

/* From kernel tree include/asm-i386/unistd.h
*/
#ifndef __NR_splice
#define __NR_splice 313
#endif
#ifndef __NR_vmsplice
#define __NR_vmsplice 316
#endif

#elif defined(__x86_64__)

/* From kernel tree include/asm-x86_64/unistd.h
*/
#ifndef __NR_splice
#define __NR_splice 275
#endif
#ifndef __NR_vmsplice
#define __NR_vmsplice 278
#endif

#else
#error unsupported architecture
#endif

/* From kernel tree include/linux/pipe_fs_i.h
*/
#define SPLICE_F_MOVE (0x01) /* move pages instead of copying */
#define SPLICE_F_NONBLOCK (0x02) /* don't block on the pipe splicing
(but */
/* we may still block on the fd we splice */
/* from/to, of course */
#define SPLICE_F_MORE (0x04) /* expect more data */
#define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */

#ifndef SYS_splice
#define SYS_splice __NR_splice
#endif
#ifndef SYS_vmsplice
#define SYS_vmsplice __NR_vmsplice
#endif

#ifndef _LARGEFILE64_SOURCE
#error need -D_LARGEFILE64_SOURCE
#endif

static inline
int splice(int fd_in, off64_t *off_in, int fd_out, off64_t *off_out,
size_t len, unsigned int flags)
{
return syscall(SYS_splice, fd_in, off_in,
fd_out, off_out, len, flags);
}

struct iovec;

static inline
int vmsplice(int fd, const struct iovec *iov,
unsigned long nr_segs, unsigned int flags)
{
return syscall(SYS_vmsplice, fd, iov, nr_segs, flags);
}

#endif /* __do_splice_syscall_h__ */

------------------

And here's the test program itself:

/* Copyright 2006 Sandia Corporation.
*
* This file is free software; you can redistribute it and/or modify
* it under the terms of version 2 of the GNU General Public License,
* as published by the Free Software Foundation.
*/

/* Compile with -DHAVE_SPLICE_SYSCALL if you _know_ your kernel
* has splice/vmsplice support.
*/

#ifndef _LARGEFILE64_SOURCE
#error need -D_LARGEFILE64_SOURCE
#endif

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <sys/sendfile.h>

#include <netdb.h>
#include <sys/socket.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>
#include <resolv.h>

#define ALIGN(value,size) (((value) + (size) - 1) & ~((size) - 1))

/* Once glibc implements splice(), SYS_splice will be defined in
* system headers. Until then we need to use our own stuff to access
* these syscalls in new kernels.
*/
#ifdef HAVE_SPLICE_SYSCALL
#ifndef SYS_splice
#include <splice.h>
#endif
#endif

#define OPTION_FLAGS "b:c:hl:m:op:tw:"

const char usage[] = "\n\
dnd [OPTION] ... <file> \n\
\n\
Performs a timed disk-network-disk data transfer, using TCP/IP \n\
as the network protocol. \n\
\n\
When dnd is invoked with \"-c <remote-host>\", it connects to \n\
<remote-host> and sends the contents of <file>. Otherwise, it \n\
accepts a connection and writes the data received over the \n\
connection into <file>. On the sender side, timing starts just \n\
before the first byte is read from the file, and stops just after \n\
the last byte of data is sent. On the receiver side, timing starts \n\
just before the first byte is received, and stops just after the \n\
last byte is synced to disk. \n\
\n\
-b <bsz> \n\
Use a buffer of size <bsz> bytes to move data. The default \n\
value is 65536 bytes. The value for <bsz> may be suffixed \n\
with one of the following multipliers: \n\
k *1000 \n\
M *1000*1000 \n\
G *1000*1000*1000 \n\
Ki *1024 \n\
Mi *1024*1024 \n\
Gi *1024*1024*1024 \n\
-c <remote-host> \n\
Connect to <rhost> to send the data in <file>. \n\
If not specified, listen for connections and receive data \n\
into <file>. \n\
-h \n\
Print this message. \n\
-l <sz> \n\
Limit the transfer to at most <sz> bytes. The value for \n\
<sz> may be suffixed as for the '-b' option. Valid only \n\
if the '-c' option is also present.\n\
-m <method> \n\
Select one of the following methods: \n\
mmap Use mmap system call on the file descriptor and \n\
read/write system calls on the socket descriptor. \n\
rw Use read/write system calls on both the file \n\
descriptor and the socket descriptor. (Default) \n\
sendfile \n\
Use the sendfile system call to send data. \n\
splice Use the splice system call. Currently only supports \n\
a splice from the file to the socket. \n\
vmsplice \n\
Use the read system call to receive data from the \n\
socket into memory, and the vmsplice system call \n\
to move the data into the file. \n\
-o \n\
If writing to <file> and it already exists, overwrite its \n\
data and truncate it to the total number of bytes received.\n\
-p <port> \n\
Either listen on <port>, or attempt to connect to \n\
<remote_host>:<port>. The default port is 13931. \n\
-t \n\
If writing to <file> and it already exists, truncate it \n\
to zero length before writing to it.\n\
-w <wsz> \n\
Use a TCP window size of <wsz> bytes, which may be suffixed \n\
as for <bsz> above. The default value is 131072 bytes.\n\
\n\
";

enum method {MMAP, RW, SENDFILE, SPLICE, VMSPLICE};

struct options {
enum method use;

unsigned int truncate:1;
unsigned int overwrite:1;
unsigned int readfile:1;

uint64_t limit;
int win_size;
size_t buf_size;
unsigned short def_port;

char *host_str;
char *file;
char *buf;
void *private;
};

typedef uint64_t (*move_function)(const struct options *opts,
int from_fd, int to_fd);

union pipe_fd {
int fda[2];
struct {
int r;
int w;
} fd;
};

static struct options dnd_opts = {
.use = RW,
.buf_size = 64 * 1024,
.win_size = 128 * 1024,
.def_port = 13931
};

uint64_t dt_usec(struct timeval *start, struct timeval *stop)
{
uint64_t dt;

dt = stop->tv_sec - start->tv_sec;
dt *= 1000000;
dt += stop->tv_usec - start->tv_usec;
return dt;
}

static
uint64_t suffix(const char *str)
{
uint64_t s = 1;

switch (*str) {
case 'k':
s *= 1000;
break;
case 'K':
if (*(str+1) == 'i')
s *= 1024;
break;
case 'M':
if (*(str+1) == 'i')
s *= 1024*1024;
else
s *= 1000*1000;
break;
case 'G':
if (*(str+1) == 'i')
s *= 1024*1024*1024;
else
s *= 1000*1000*1000;
break;
}

return s;
}

int open_file(const struct options *opts)
{
int fd, err;
mode_t o_mode = 0644;
int o_flags = O_LARGEFILE;

if (opts->use == MMAP)
o_flags |= O_RDWR;
else {
if (opts->readfile)
o_flags |= O_RDONLY;
else
o_flags |= O_WRONLY;
}

if (!opts->readfile) {
o_flags |= O_CREAT;
if (opts->truncate)
o_flags |= O_TRUNC;
else if (!opts->overwrite)
o_flags |= O_EXCL;
}
fd = open64(opts->file, o_flags, o_mode);
if (fd < 0) {
perror("Open data file");
exit(EXIT_FAILURE);
}
return fd;
}

int connect_to(const struct options *opts, struct sockaddr_in6 *saddr)
{
int fd;
int optval;
socklen_t optlen;
int err;

struct hostent *tgt;

/* Turn on IPv6 resolver action - gethostbyname() will always
* return IPv6 addresses.
*/
res_init();
_res.options |= RES_USE_INET6;

tgt = gethostbyname(opts->host_str);
if (!tgt) {
herror("gethostbyname/IPv6");
exit(EXIT_FAILURE);
}
if (tgt->h_addrtype != AF_INET6) {
fprintf(stderr,
"Error: got non-IPv6 address from gethostbyname!\n");
exit(EXIT_FAILURE);
}
#if 1
{
char buf[INET6_ADDRSTRLEN+1] = {0,};
char **ptr;

printf("connecting to host: %s\n", opts->host_str);
printf(" canonical name: %s\n", tgt->h_name);
ptr = tgt->h_aliases;
while (*ptr) {
printf(" alias: %s\n", *ptr);
++ptr;
}
ptr = tgt->h_addr_list;
while (*ptr) {
if (!inet_ntop(tgt->h_addrtype, *ptr,
buf, INET6_ADDRSTRLEN+1)) {
perror("inet_ntop/IPv6");
exit(EXIT_FAILURE);
}
printf(" address: %s\n", buf);
++ptr;
}
}
#endif
fd = socket(PF_INET6, SOCK_STREAM, 0);
if (fd == -1) {
perror("Open IPv6 socket");
exit(EXIT_FAILURE);
}

optval = opts->win_size;
optlen = sizeof(optval);
err = setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &optval, optlen);
if (err) {
perror("Set IPv6 socket SO_SNDBUF");
exit(EXIT_FAILURE);
}
optlen = sizeof(optval);
err = getsockopt(fd, SOL_SOCKET, SO_SNDBUF, &optval, &optlen);
if (err) {
perror("Get IPv6 socket SO_SNDBUF");
exit(EXIT_FAILURE);
}
if (optval != opts->win_size)
printf("TCP send window size: requested %d actual %d\n",
opts->win_size, optval);

saddr->sin6_family = AF_INET6;
saddr->sin6_port = htons(opts->def_port);
saddr->sin6_addr = *(struct in6_addr *)tgt->h_addr_list[0];

err = connect(fd, (struct sockaddr *)saddr, sizeof(*saddr));
if (err) {
perror("Connect to remote host");
exit(EXIT_FAILURE);
}

optval = 1;
optlen = sizeof(optval);
err = setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &optval, optlen);
if (err) {
perror("Set IPv6 socket TCP_NODELAY");
exit(EXIT_FAILURE);
}
return fd;
}

int listen_for(const struct options *opts, struct sockaddr_in6 *saddr)
{
int fd, lfd;
int optval;
socklen_t optlen;
int err;

lfd = socket(PF_INET6, SOCK_STREAM, 0);
if (lfd == -1) {
perror("Open IPv6 socket");
exit(EXIT_FAILURE);
}
optval = 1;
err = setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR,
&optval, sizeof(optval));
if (err) {
perror("Set IPv6 socket SO_REUSEADDR");
exit(EXIT_FAILURE);
}

optval = opts->win_size;
optlen = sizeof(optval);
err = setsockopt(lfd, SOL_SOCKET, SO_RCVBUF, &optval, optlen);
if (err) {
perror("Set IPv6 socket SO_RCVBUF");
exit(EXIT_FAILURE);
}
optlen = sizeof(optval);
err = getsockopt(lfd, SOL_SOCKET, SO_RCVBUF, &optval, &optlen);
if (err) {
perror("Get IPv6 socket SO_RCVBUF");
exit(EXIT_FAILURE);
}
if (optval != opts->win_size)
printf("TCP receive window size: requested %d actual %d\n",
opts->win_size, optval);

saddr->sin6_family = AF_INET6;
saddr->sin6_port = htons(opts->def_port);
saddr->sin6_addr = in6addr_any;

err = bind(lfd, (struct sockaddr *)saddr, sizeof(*saddr));
if (err) {
perror("Bind IPv6 address");
exit(EXIT_FAILURE);
}
err = listen(lfd, 1);
if (err) {
perror("Listen on IPv6 address");
exit(EXIT_FAILURE);
}
fd = accept(lfd, NULL, 0);
if (fd < 0) {
perror("Accept new connection");
exit(EXIT_FAILURE);
}
close(lfd);

optval = 1;
optlen = sizeof(optval);
err = setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &optval, optlen);
if (err) {
perror("Set IPv6 socket TCP_NODELAY");
exit(EXIT_FAILURE);
}
return fd;
}

int wait_on_connection(const struct options *opts)
{
int sock_fd, err;
struct sockaddr_in6 saddr;
socklen_t saddr_len = sizeof(saddr);
char buf[INET6_ADDRSTRLEN+1];

if (opts->host_str)
sock_fd = connect_to(opts, &saddr);
else
sock_fd = listen_for(opts, &saddr);

err = getpeername(sock_fd, (struct sockaddr *)&saddr, &saddr_len);
if (err) {
perror("getpeername");
exit(EXIT_FAILURE);
}
if (saddr.sin6_family != AF_INET6) {
fprintf(stderr,
"Error: got non-IPv6 address from getpeername!\n");
exit(EXIT_FAILURE);
}
if (!inet_ntop(AF_INET6, &saddr.sin6_addr,
buf, INET6_ADDRSTRLEN+1)) {
perror("inet_ntop/IPv6");
exit(EXIT_FAILURE);
}
printf("Connected to %s port %d\n",
buf, (int)ntohs(saddr.sin6_port));

return sock_fd;
}

void setup_mmap(struct options *opts, int fd)
{
struct stat64 sb;
size_t pg_sz = sysconf(_SC_PAGESIZE);

opts->buf_size = ALIGN(opts->buf_size, pg_sz);

/* We'll just get the file size here so we don't time it later...
*/
if (fstat64(fd, &sb) < 0) {
perror("Stat data file");
exit(EXIT_FAILURE);
}
opts->private = malloc(sizeof(off64_t));
if (!opts->private) {
perror("Allocating private data");
exit(EXIT_FAILURE);
}
*((off64_t *)opts->private) = sb.st_size;
}

uint64_t mmap_send(const struct options *opts, int fd, int sd)
{
size_t bufl = opts->buf_size;
off64_t fsz = *((off64_t *)opts->private);

uint64_t bytes = 0;
ssize_t n;
size_t m, l;
char *mem;
int err;

if (opts->limit && opts->limit < (uint64_t)fsz)
fsz = opts->limit;

again:
mem = mmap64(NULL, bufl, PROT_READ, MAP_SHARED, fd, bytes);
if (mem == MAP_FAILED) {
fprintf(stderr, "mmap %llu @ offset %llu: %s\n",
(unsigned long long)bufl,
(unsigned long long)bytes, strerror(errno));
exit(EXIT_FAILURE);
}
#ifdef USE_MADVISE
err = madvise(mem, bufl, MADV_WILLNEED);
if (err && errno != EAGAIN) {
perror("madvise");
exit(EXIT_FAILURE);
}
#endif
if (bytes + bufl < (uint64_t)fsz)
l = bufl;
else
l = fsz - bytes;

m = 0;
while (l) {

again2:
n = write(sd, mem + m, l);
if (n < 0) {
if (errno == EINTR)
goto again2;
perror("write");
exit(EXIT_FAILURE);
}
bytes += n;
m += n;
l -= n;
}
err = munmap(mem, bufl);
if (err < 0) {
fprintf(stderr, "munmap %llu: %s\n",
(unsigned long long)bufl, strerror(errno));
exit(EXIT_FAILURE);
}
if (bytes == (uint64_t)fsz)
return bytes;

goto again;
}

uint64_t mmap_recv(const struct options *opts, int sd, int fd)
{
size_t bufl = opts->buf_size;

uint64_t bytes = 0;
ssize_t n;
size_t m, l;
char *mem;
int err;

again:
l = bufl;

err = ftruncate64(fd, bytes + bufl);
if (err < 0) {
fprintf(stderr, "ftruncate to %llu: %s\n",
(unsigned long long)bytes + bufl, strerror(errno));
exit(EXIT_FAILURE);
}
mem = mmap64(NULL, bufl, PROT_WRITE, MAP_SHARED, fd, bytes);
if (mem == MAP_FAILED) {
fprintf(stderr, "mmap %llu @ offset %llu: %s\n",
(unsigned long long)bufl,
(unsigned long long)bytes, strerror(errno));
exit(EXIT_FAILURE);
}

m = 0;
while (l) {

again2:
n = read(sd, mem + m, l);
if (n < 0) {
if (errno == EINTR)
goto again2;
perror("Read");
exit(EXIT_FAILURE);
}
if (n == 0) {
err = munmap(mem, bufl);
if (err < 0) {
fprintf(stderr, "munmap %llu: %s\n",
(unsigned long long)bufl,
strerror(errno));
exit(EXIT_FAILURE);
}
err = ftruncate64(fd, bytes);
if (err < 0) {
fprintf(stderr, "ftruncate to %llu: %s\n",
(unsigned long long)bytes,
strerror(errno));
exit(EXIT_FAILURE);
}
fdatasync(fd);
return bytes;
}
bytes += n;
m += n;
l -= n;
}
err = munmap(mem, bufl);
if (err < 0) {
fprintf(stderr, "munmap %llu: %s\n",
(unsigned long long)bufl, strerror(errno));
exit(EXIT_FAILURE);
}
goto again;
}

void setup_rw(struct options *opts)
{
opts->buf = malloc(opts->buf_size);
if (!opts->buf) {
perror("Allocating data buffer");
exit(EXIT_FAILURE);
}
}

uint64_t read_write(const struct options *opts, int rfd, int wfd)
{
char *buf = opts->buf;
size_t bufl = opts->buf_size;

uint64_t bytes = 0;
ssize_t n, m, l;

again:
if (opts->limit && bufl > opts->limit - bytes)
bufl = opts->limit - bytes;

l = read(rfd, buf, bufl);
if (l < 0) {
if (errno == EINTR)
goto again;
perror("Read");
exit(EXIT_FAILURE);
}
if (l == 0 || (opts->limit && opts->limit == bytes)) {
if (opts->readfile)
close(wfd);
else {
ftruncate64(wfd, bytes);
fdatasync(wfd);
}
return bytes;
}

m = 0;
while (l) {

again2:
n = write(wfd, buf + m, l);
if (n < 0) {
if (errno == EINTR)
goto again2;
perror("Write");
exit(EXIT_FAILURE);
}
m += n;
l -= n;
}
bytes += m;
goto again;
}

uint64_t sendfile_send(const struct options *opts, int fd, int sd)
{
size_t bufl = opts->buf_size;

uint64_t bytes = 0;
off64_t os = 0;
ssize_t l;

again:
if (opts->limit && bufl > opts->limit - bytes)
bufl = opts->limit - bytes;

l = sendfile64(sd, fd, &os, bufl);
if (l < 0) {
perror("sendfile from file");
exit(EXIT_FAILURE);
}
if (l == 0) {
close(sd);
return bytes;
}
bytes += l;

goto again;
}

/* At least for now (as of 2.6.18-rc5), splice seems to hang if you
* try to splice more data than the pipe can handle, rather than doing
* what it can and returning what that was. So, coerce user buffer down
* to maximum pipe buffer size.
*/
#ifndef MAX_SPLICE_SIZE
#define MAX_SPLICE_SIZE (64 * 1024)
#endif

void setup_splice(struct options *opts)
{
union pipe_fd *p;
size_t pg_sz = sysconf(_SC_PAGESIZE);

p = malloc(sizeof(union pipe_fd));
if (!p) {
perror("allocate pipe fds");
exit(EXIT_FAILURE);
}
if (pipe(p->fda) < 0) {
perror("opening pipe");
exit(EXIT_FAILURE);
}
opts->private = p;

if (opts->buf_size > MAX_SPLICE_SIZE)
opts->buf_size = MAX_SPLICE_SIZE;

opts->buf_size = ALIGN(opts->buf_size, pg_sz);
}

#ifdef HAVE_SPLICE_SYSCALL
uint64_t splice_send(const struct options *opts, int fd, int sd)
{
union pipe_fd *p = opts->private;
size_t bufl = opts->buf_size;

uint64_t bytes = 0;
ssize_t n, l;

again:
if (opts->limit && bufl > opts->limit - bytes)
bufl = opts->limit - bytes;

l = splice(fd, NULL, p->fd.w, NULL, bufl,
SPLICE_F_MORE | SPLICE_F_MOVE);
if (l < 0) {
perror("splice from file");
exit(EXIT_FAILURE);
}
if (l == 0) {
close(sd);
return bytes;
}

while (l) {

n = splice(p->fd.r, NULL, sd, NULL, l,
SPLICE_F_MORE | SPLICE_F_MOVE);
if (n < 0) {
perror("splice to socket");
exit(EXIT_FAILURE);
}
l -= n;
bytes += n;
}
goto again;
}
#else
move_function splice_send = NULL;
#endif

/* vmsplice moves pages backing a user address range to a pipe.
However,
* you don't want the application changing data in that address range
* after the pages have been moved to the pipe, but before they have
been
* consumed at their destination.
*
* The solution is to double buffer:
* load buffer A, vmsplice to pipe
* load buffer B, vmsplice to pipe
* When the B->splice->pipe call completes, there can no longer be any
* references in the pipe to the pages backing buffer A, since it is now
* filled with references to the pages backing buffer B. So, it is safe
* to load new data into buffer A.
*/
void setup_vmsplice(struct options *opts)
{
union pipe_fd *p;
size_t pg_sz = sysconf(_SC_PAGESIZE);

p = malloc(sizeof(union pipe_fd));
if (!p) {
perror("allocate pipe fds");
exit(EXIT_FAILURE);
}
if (pipe(p->fda) < 0) {
perror("opening pipe");
exit(EXIT_FAILURE);
}
opts->private = p;

if (opts->buf_size > MAX_SPLICE_SIZE)
opts->buf_size = MAX_SPLICE_SIZE;

opts->buf_size = ALIGN(opts->buf_size, pg_sz);

opts->buf = malloc(2*opts->buf_size + pg_sz);
if (!opts->buf) {
perror("Allocating data buffer");
exit(EXIT_FAILURE);
}
opts->buf = (char *)ALIGN((unsigned long)opts->buf,
(unsigned long)pg_sz);
}

#ifdef HAVE_SPLICE_SYSCALL
uint64_t vmsplice_recv(const struct options *opts, int sd, int fd)
{
union pipe_fd *p = opts->private;
struct iovec iov;

uint64_t bytes = 0;
ssize_t n, m, l;
unsigned i = 1;

again:
i = (i + 1) & 1;
iov.iov_base = opts->buf + i * opts->buf_size;

again2:
l = read(sd, iov.iov_base, opts->buf_size);
if (l < 0) {
if (errno == EINTR)
goto again2;
perror("Read");
exit(EXIT_FAILURE);
}
if (l == 0) {
fdatasync(fd);
return bytes;
}

while (l) {
iov.iov_len = l;

n = vmsplice(p->fd.w, &iov, 1, 0);
if (n < 0) {
perror("vmsplice to pipe");
exit(EXIT_FAILURE);
}

while (n) {
m = splice(p->fd.r, NULL, fd, NULL, n, SPLICE_F_MORE);
if (m < 0) {
perror("splice to file");
exit(EXIT_FAILURE);
}
n -= m;

l -= m;
bytes += m;
iov.iov_base += m;
}
}
goto again;
}
#else
move_function vmsplice_recv = NULL;
#endif

uint64_t move(const struct options *opts, int fd, int sd,
move_function do_send, move_function do_recv)
{
uint64_t bytes;

if (opts->readfile) {
if (do_send)
bytes = do_send(opts, fd, sd);
else
goto no_support;
}
else {
if (do_recv)
bytes = do_recv(opts, sd, fd);
else
goto no_support;
}
return bytes;

no_support:
fprintf(stderr, "Sorry, method not implemented.\n");
exit(EXIT_FAILURE);
}

int main(int argc, char *argv[])
{
int sd, fd;

uint64_t byte_cnt;

struct timeval start;
struct timeval stop;
uint64_t et_usec;

move_function do_send;
move_function do_recv;

while (1) {
char *next_char;
int c = getopt(argc, argv, OPTION_FLAGS);
if (c == -1)
break;

switch (c) {
case 'b':
{
uint64_t sz = strtoull(optarg, &next_char, 0);
sz *= suffix(next_char);
dnd_opts.buf_size = sz;
if ((uint64_t)dnd_opts.buf_size != sz) {
fprintf(stderr,
"Error: invalid buffer size\n");
exit(EXIT_FAILURE);
}
}
break;
case 'c':
dnd_opts.host_str = strdup(optarg);
dnd_opts.readfile = 1;
break;
case 'h':
printf("%s", usage);
exit(EXIT_SUCCESS);
case 'l':
{
uint64_t sz = strtoull(optarg, &next_char, 0);
sz *= suffix(next_char);
dnd_opts.limit = sz;
}
break;
case 'm':
if (strncmp(optarg, "mmap", 32) == 0)
dnd_opts.use = MMAP;
else if (strncmp(optarg, "rw", 32) == 0)
dnd_opts.use = RW;
else if (strncmp(optarg, "sendfile", 32) == 0)
dnd_opts.use = SENDFILE;
else if (strncmp(optarg, "splice", 32) == 0)
dnd_opts.use = SPLICE;
else if (strncmp(optarg, "vmsplice", 32) == 0)
dnd_opts.use = VMSPLICE;
else {
fprintf(stderr,
"Error: unknown method '%s'\n",
optarg);
exit(EXIT_FAILURE);
}
break;
case 'o':
dnd_opts.overwrite = 1;
break;
case 'p':
{
unsigned long port = strtoul(optarg, NULL, 0);
dnd_opts.def_port = port;
if (dnd_opts.def_port == 0 ||
(unsigned long)dnd_opts.def_port != port) {
fprintf(stderr,
"Error: invalid port\n");
exit(EXIT_FAILURE);
}
}
break;
case 't':
dnd_opts.truncate = 1;
break;
case 'w':
{
uint64_t wsz = strtoull(optarg, &next_char, 0);
wsz *= suffix(next_char);
dnd_opts.win_size = wsz;
if ((uint64_t)dnd_opts.win_size != wsz) {
fprintf(stderr,
"Error: invalid window size\n");
exit(EXIT_FAILURE);
}
}
break;
}
}
if (dnd_opts.limit && !dnd_opts.readfile) {
fprintf(stderr, "Error: can only limit transfer as sender!\n"
" (I.e., when also using '-c' option.)\n");
exit(EXIT_FAILURE);
}
if (dnd_opts.truncate && dnd_opts.overwrite) {
fprintf(stderr, "Error: cannot both overwrite "
"and truncate data file!\n");
exit(EXIT_FAILURE);
}
if (argc == 1) {
printf("%s", usage);
exit(EXIT_SUCCESS);
}
if (optind+1 != argc) {
fprintf(stderr, "Error: need a filename\n");
exit(EXIT_FAILURE);
}
dnd_opts.file = strdup(argv[optind]);

fd = open_file(&dnd_opts);
sd = wait_on_connection(&dnd_opts);

switch (dnd_opts.use) {
case MMAP:
printf("Using mmap with read/write calls\n");
setup_mmap(&dnd_opts, fd);
do_send = mmap_send;
do_recv = mmap_recv;
break;
case RW:
printf("Using read/write calls\n");
setup_rw(&dnd_opts);
do_send = read_write;
do_recv = read_write;
break;
case SENDFILE:
printf("Using sendfile calls\n");
do_send = sendfile_send;
do_recv = NULL;
break;
case SPLICE:
printf("Using splice calls\n");
setup_splice(&dnd_opts);
do_send = splice_send;
do_recv = NULL;
break;
case VMSPLICE:
printf("Using vmsplice with read/write calls\n");
setup_vmsplice(&dnd_opts);
do_send = NULL;
do_recv = vmsplice_recv;
break;
}
printf("Buffer size %u bytes\n", (unsigned)dnd_opts.buf_size);

gettimeofday(&start, NULL);

byte_cnt = move(&dnd_opts, fd, sd, do_send, do_recv);

gettimeofday(&stop, NULL);
et_usec = dt_usec(&start, &stop);

printf("\n%s %llu KiB in %.3f sec: %.3f MB/s (%.3f Gb/s)\n\n",
(dnd_opts.readfile ? "Sent" : "Received"),
(unsigned long long)byte_cnt/1024, (1.e-6 * et_usec),
((float)byte_cnt/et_usec),
8*((float)byte_cnt/et_usec)/1000);

exit(EXIT_SUCCESS);
}

2006-11-16 21:27:50

by David Miller

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

From: Jens Axboe <[email protected]>
Date: Thu, 16 Nov 2006 22:21:07 +0100

> On Thu, Nov 16 2006, David Miller wrote:
> > From: "Jim Schutt" <[email protected]>
> > Date: Thu, 16 Nov 2006 11:08:59 -0700
> >
> > > Or is read+write really the fastest way to get data off a
> > > socket and into a file?
> >
> > There is still no explicit TCP support for splice/vmsplice so things
> > get copied around and most of the other advantaves of splice/vmplice
> > aren't obtained either. So perhaps that explains your numbers.
>
> There should not be any copying for tcp send, at least no more than what
> sendfile() did/does. Hmm?

That's true on send, correct.

> > Jens Axboe tries to get things working, and others have looked into it
> > too, but adding TCP support is hard and for several reasons folks like
> > Alexey Kuznetsov and Evgeniy Polyakov believe that sys_receivefile()
> > is an interface much better suited for TCP receive.
> >
> > splice/vmsplice has a lot of state connected to a transaction, and
> > perhaps that is part of why Evgeniy and Alexey have trouble wrapping
> > their brains around an efficient implementation.
>
> I hope to try and see if I can help get some of that done, however I
> need all the help I can get on the networking side. Not sure I
> understand why it has to be so difficult, if we need to define a wrapper
> container instead of passing down a pipe that is completely fine with
> me. The networking code basically just needs to hang on to the
> pipe_buffer and release it on ack for send, receive is somewhat more
> involved (and I don't know enough about networking to voice any
> half-intelligent opinion on that!).
>
> I would just consider it a damn shame if we cannot complete the splice
> family and need to punt to something else for net receive.

I'm sure that the folks on [email protected] are more than
willing to give you a hand in this area :-)

2006-11-17 17:22:15

by Jim Schutt

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote:
> On Thu, Nov 16 2006, Jim Schutt wrote:
> > Hi,
> >

> > My test program can do one of the following:
> >
> > send data:
> > A) read() from file into buffer, write() buffer into socket
> > B) mmap() section of file, write() that into socket, munmap()
> > C) splice() from file to pipe, splice() from pipe to socket
> >
> > receive data:
> > 1) read() from socket into buffer, write() buffer into file
> > 2) ftruncate() to extend file, mmap() new extent, read()
> > from socket into new extent, munmap()
> > 3) read() from socket into buffer, vmsplice() buffer to
> > pipe, splice() pipe to file (using the double-buffer trick)
> >
> > Here's the results, using:
> > - 64 KiB buffer, mmap extent, or splice
> > - 1 MiB TCP window
> > - 16 GiB data sent across network
> >
> > A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)
> >
> > A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
> > B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
> > C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)
> >
> > A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s)
> > A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s)
> > A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s)
> >
> > I had (naively) hoped the read/vmsplice/splice combination would
> > run at the same speed I can write a file, i.e. at about 450 MB/s
> > on my setup. Do any of my numbers seem bogus, so I should look
> > harder at my test program?
>
> Could be read-ahead playing in here, I'd have to take a closer look at
> the generated io patterns to say more about that. Any chance you can
> capture iostat or blktrace info for such a run to compare that goes to
> the disk?

I've attached a file with iostat and vmstat results for the case
where I read from a socket and write a file, vs. the case where I
read from a socket and use vmsplice/splice to write the file.
(Sorry it's not inline - my mailer locks up when I try to
include the file.)

Would you still like blktrace info for these two cases?

-- Jim

Attachments:

iostat.txt (19.33 kB)

2006-11-20 07:59:59

by Jens Axboe

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Fri, Nov 17 2006, Jim Schutt wrote:
> On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote:
> > On Thu, Nov 16 2006, Jim Schutt wrote:
> > > Hi,
> > >
>
> > > My test program can do one of the following:
> > >
> > > send data:
> > > A) read() from file into buffer, write() buffer into socket
> > > B) mmap() section of file, write() that into socket, munmap()
> > > C) splice() from file to pipe, splice() from pipe to socket
> > >
> > > receive data:
> > > 1) read() from socket into buffer, write() buffer into file
> > > 2) ftruncate() to extend file, mmap() new extent, read()
> > > from socket into new extent, munmap()
> > > 3) read() from socket into buffer, vmsplice() buffer to
> > > pipe, splice() pipe to file (using the double-buffer trick)
> > >
> > > Here's the results, using:
> > > - 64 KiB buffer, mmap extent, or splice
> > > - 1 MiB TCP window
> > > - 16 GiB data sent across network
> > >
> > > A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)
> > >
> > > A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
> > > B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
> > > C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)
> > >
> > > A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s)
> > > A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s)
> > > A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s)
> > >
> > > I had (naively) hoped the read/vmsplice/splice combination would
> > > run at the same speed I can write a file, i.e. at about 450 MB/s
> > > on my setup. Do any of my numbers seem bogus, so I should look
> > > harder at my test program?
> >
> > Could be read-ahead playing in here, I'd have to take a closer look at
> > the generated io patterns to say more about that. Any chance you can
> > capture iostat or blktrace info for such a run to compare that goes to
> > the disk?
>
> I've attached a file with iostat and vmstat results for the case
> where I read from a socket and write a file, vs. the case where I
> read from a socket and use vmsplice/splice to write the file.
> (Sorry it's not inline - my mailer locks up when I try to
> include the file.)
>
> Would you still like blktrace info for these two cases?

No, I think the iostat data is fine, I don't think the blktrace info
would give me any more insight on this problem. I'll set up a test to
reproduce it here, looks like the write out path could be optimized some
more.

--
Jens Axboe

2006-11-20 08:24:44

by Jens Axboe

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Mon, Nov 20 2006, Jens Axboe wrote:
> On Fri, Nov 17 2006, Jim Schutt wrote:
> > On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote:
> > > On Thu, Nov 16 2006, Jim Schutt wrote:
> > > > Hi,
> > > >
> >
> > > > My test program can do one of the following:
> > > >
> > > > send data:
> > > > A) read() from file into buffer, write() buffer into socket
> > > > B) mmap() section of file, write() that into socket, munmap()
> > > > C) splice() from file to pipe, splice() from pipe to socket
> > > >
> > > > receive data:
> > > > 1) read() from socket into buffer, write() buffer into file
> > > > 2) ftruncate() to extend file, mmap() new extent, read()
> > > > from socket into new extent, munmap()
> > > > 3) read() from socket into buffer, vmsplice() buffer to
> > > > pipe, splice() pipe to file (using the double-buffer trick)
> > > >
> > > > Here's the results, using:
> > > > - 64 KiB buffer, mmap extent, or splice
> > > > - 1 MiB TCP window
> > > > - 16 GiB data sent across network
> > > >
> > > > A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)
> > > >
> > > > A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
> > > > B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
> > > > C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)
> > > >
> > > > A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s)
> > > > A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s)
> > > > A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s)
> > > >
> > > > I had (naively) hoped the read/vmsplice/splice combination would
> > > > run at the same speed I can write a file, i.e. at about 450 MB/s
> > > > on my setup. Do any of my numbers seem bogus, so I should look
> > > > harder at my test program?
> > >
> > > Could be read-ahead playing in here, I'd have to take a closer look at
> > > the generated io patterns to say more about that. Any chance you can
> > > capture iostat or blktrace info for such a run to compare that goes to
> > > the disk?
> >
> > I've attached a file with iostat and vmstat results for the case
> > where I read from a socket and write a file, vs. the case where I
> > read from a socket and use vmsplice/splice to write the file.
> > (Sorry it's not inline - my mailer locks up when I try to
> > include the file.)
> >
> > Would you still like blktrace info for these two cases?
>
> No, I think the iostat data is fine, I don't think the blktrace info
> would give me any more insight on this problem. I'll set up a test to
> reproduce it here, looks like the write out path could be optimized some
> more.

While I get that setup, can you repeat your testing without using
SPLICE_F_MORE (you don't really use that flag correctly, but it does not
matter for your case afaict) and SPLICE_F_MOVE? The latter will cost
some CPU, but vmsplice/splice for network receive to a file is not
really optimal in the first place. When we get splice() from socket fd
support that'll improve, right now you are doing the best you can.

--
Jens Axboe

2006-11-20 15:49:15

by Jim Schutt

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Mon, 2006-11-20 at 09:24 +0100, Jens Axboe wrote:
> On Mon, Nov 20 2006, Jens Axboe wrote:
> > On Fri, Nov 17 2006, Jim Schutt wrote:
> > > On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote:
> > > > On Thu, Nov 16 2006, Jim Schutt wrote:
> > > > > Hi,
> > > > >
> > >
> > > > > My test program can do one of the following:
> > > > >
> > > > > send data:
> > > > > A) read() from file into buffer, write() buffer into socket
> > > > > B) mmap() section of file, write() that into socket, munmap()
> > > > > C) splice() from file to pipe, splice() from pipe to socket
> > > > >
> > > > > receive data:
> > > > > 1) read() from socket into buffer, write() buffer into file
> > > > > 2) ftruncate() to extend file, mmap() new extent, read()
> > > > > from socket into new extent, munmap()
> > > > > 3) read() from socket into buffer, vmsplice() buffer to
> > > > > pipe, splice() pipe to file (using the double-buffer trick)
> > > > >
> > > > > Here's the results, using:
> > > > > - 64 KiB buffer, mmap extent, or splice
> > > > > - 1 MiB TCP window
> > > > > - 16 GiB data sent across network
> > > > >
> > > > > A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)
> > > > >
> > > > > A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
> > > > > B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
> > > > > C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)
> > > > >
> > > > > A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s)
> > > > > A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s)
> > > > > A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s)
> > > > >
> > > > > I had (naively) hoped the read/vmsplice/splice combination would
> > > > > run at the same speed I can write a file, i.e. at about 450 MB/s
> > > > > on my setup. Do any of my numbers seem bogus, so I should look
> > > > > harder at my test program?
> > > >
> > > > Could be read-ahead playing in here, I'd have to take a closer look at
> > > > the generated io patterns to say more about that. Any chance you can
> > > > capture iostat or blktrace info for such a run to compare that goes to
> > > > the disk?
> > >
> > > I've attached a file with iostat and vmstat results for the case
> > > where I read from a socket and write a file, vs. the case where I
> > > read from a socket and use vmsplice/splice to write the file.
> > > (Sorry it's not inline - my mailer locks up when I try to
> > > include the file.)
> > >
> > > Would you still like blktrace info for these two cases?
> >
> > No, I think the iostat data is fine, I don't think the blktrace info
> > would give me any more insight on this problem. I'll set up a test to
> > reproduce it here, looks like the write out path could be optimized some
> > more.

Great, let me know if you need testing from me.

>
> While I get that setup, can you repeat your testing without using
> SPLICE_F_MORE (you don't really use that flag correctly, but it does not
> matter for your case afaict) and SPLICE_F_MOVE?

Done. Removing these flags from any call to splice or vmsplice made
no difference to throughput for me, either when sending or receiving
a file.

> The latter will cost
> some CPU, but vmsplice/splice for network receive to a file is not
> really optimal in the first place. When we get splice() from socket fd
> support that'll improve, right now you are doing the best you can.
>
Thanks for reviewing my test program.

2006-11-21 13:55:08

by Jens Axboe

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Mon, Nov 20 2006, Jim Schutt wrote:
> On Mon, 2006-11-20 at 09:24 +0100, Jens Axboe wrote:
> > On Mon, Nov 20 2006, Jens Axboe wrote:
> > > On Fri, Nov 17 2006, Jim Schutt wrote:
> > > > On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote:
> > > > > On Thu, Nov 16 2006, Jim Schutt wrote:
> > > > > > Hi,
> > > > > >
> > > >
> > > > > > My test program can do one of the following:
> > > > > >
> > > > > > send data:
> > > > > > A) read() from file into buffer, write() buffer into socket
> > > > > > B) mmap() section of file, write() that into socket, munmap()
> > > > > > C) splice() from file to pipe, splice() from pipe to socket
> > > > > >
> > > > > > receive data:
> > > > > > 1) read() from socket into buffer, write() buffer into file
> > > > > > 2) ftruncate() to extend file, mmap() new extent, read()
> > > > > > from socket into new extent, munmap()
> > > > > > 3) read() from socket into buffer, vmsplice() buffer to
> > > > > > pipe, splice() pipe to file (using the double-buffer trick)
> > > > > >
> > > > > > Here's the results, using:
> > > > > > - 64 KiB buffer, mmap extent, or splice
> > > > > > - 1 MiB TCP window
> > > > > > - 16 GiB data sent across network
> > > > > >
> > > > > > A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)
> > > > > >
> > > > > > A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
> > > > > > B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
> > > > > > C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)
> > > > > >
> > > > > > A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s)
> > > > > > A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s)
> > > > > > A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s)
> > > > > >
> > > > > > I had (naively) hoped the read/vmsplice/splice combination would
> > > > > > run at the same speed I can write a file, i.e. at about 450 MB/s
> > > > > > on my setup. Do any of my numbers seem bogus, so I should look
> > > > > > harder at my test program?
> > > > >
> > > > > Could be read-ahead playing in here, I'd have to take a closer look at
> > > > > the generated io patterns to say more about that. Any chance you can
> > > > > capture iostat or blktrace info for such a run to compare that goes to
> > > > > the disk?
> > > >
> > > > I've attached a file with iostat and vmstat results for the case
> > > > where I read from a socket and write a file, vs. the case where I
> > > > read from a socket and use vmsplice/splice to write the file.
> > > > (Sorry it's not inline - my mailer locks up when I try to
> > > > include the file.)
> > > >
> > > > Would you still like blktrace info for these two cases?
> > >
> > > No, I think the iostat data is fine, I don't think the blktrace info
> > > would give me any more insight on this problem. I'll set up a test to
> > > reproduce it here, looks like the write out path could be optimized some
> > > more.
>
> Great, let me know if you need testing from me.

I found some suboptimal behaviour in your test app - you don't check for
short reads and splice would really like things to be aligned for the
best performance. I did some testing with the original app here, and I
get 114.769MB/s for read-from-socket -> write-to-file and 109.878MB/s
for read-from-socket -> vmsplice-splice-to-file. If I fix up the read to
always get the full buffer size before doing the vmsplice+splice, the
performance is up to the same as the read/write.

Since it's doing buffered writes, the results do vary a lot though (as
you also indicated). A raw /dev/zero -> /dev/null is 3 times faster with
vmsplice/splice.

--
Jens Axboe

2006-11-21 19:18:01

by Jim Schutt

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Tue, 2006-11-21 at 14:54 +0100, Jens Axboe wrote:
> On Mon, Nov 20 2006, Jim Schutt wrote:
> > On Mon, 2006-11-20 at 09:24 +0100, Jens Axboe wrote:
> > > On Mon, Nov 20 2006, Jens Axboe wrote:
> > > > On Fri, Nov 17 2006, Jim Schutt wrote:
> > > > > On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote:
> > > > > > On Thu, Nov 16 2006, Jim Schutt wrote:
> > > > > > > Hi,
> > > > > > >
> > > > >
> > > > > > > My test program can do one of the following:
> > > > > > >
> > > > > > > send data:
> > > > > > > A) read() from file into buffer, write() buffer into socket
> > > > > > > B) mmap() section of file, write() that into socket, munmap()
> > > > > > > C) splice() from file to pipe, splice() from pipe to socket
> > > > > > >
> > > > > > > receive data:
> > > > > > > 1) read() from socket into buffer, write() buffer into file
> > > > > > > 2) ftruncate() to extend file, mmap() new extent, read()
> > > > > > > from socket into new extent, munmap()
> > > > > > > 3) read() from socket into buffer, vmsplice() buffer to
> > > > > > > pipe, splice() pipe to file (using the double-buffer trick)
> > > > > > >
> > > > > > > Here's the results, using:
> > > > > > > - 64 KiB buffer, mmap extent, or splice
> > > > > > > - 1 MiB TCP window
> > > > > > > - 16 GiB data sent across network
> > > > > > >
> > > > > > > A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)
> > > > > > >
> > > > > > > A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
> > > > > > > B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
> > > > > > > C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)
> > > > > > >
> > > > > > > A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s)
> > > > > > > A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s)
> > > > > > > A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s)
> > > > > > >
> > > > > > > I had (naively) hoped the read/vmsplice/splice combination would
> > > > > > > run at the same speed I can write a file, i.e. at about 450 MB/s
> > > > > > > on my setup. Do any of my numbers seem bogus, so I should look
> > > > > > > harder at my test program?
> > > > > >
> > > > > > Could be read-ahead playing in here, I'd have to take a closer look at
> > > > > > the generated io patterns to say more about that. Any chance you can
> > > > > > capture iostat or blktrace info for such a run to compare that goes to
> > > > > > the disk?
> > > > >
> > > > > I've attached a file with iostat and vmstat results for the case
> > > > > where I read from a socket and write a file, vs. the case where I
> > > > > read from a socket and use vmsplice/splice to write the file.
> > > > > (Sorry it's not inline - my mailer locks up when I try to
> > > > > include the file.)
> > > > >
> > > > > Would you still like blktrace info for these two cases?
> > > >
> > > > No, I think the iostat data is fine, I don't think the blktrace info
> > > > would give me any more insight on this problem. I'll set up a test to
> > > > reproduce it here, looks like the write out path could be optimized some
> > > > more.
> >
> > Great, let me know if you need testing from me.
>
> I found some suboptimal behaviour in your test app - you don't check for
> short reads and splice would really like things to be aligned for the
> best performance. I did some testing with the original app here, and I
> get 114.769MB/s for read-from-socket -> write-to-file and 109.878MB/s
> for read-from-socket -> vmsplice-splice-to-file. If I fix up the read to
> always get the full buffer size before doing the vmsplice+splice, the
> performance is up to the same as the read/write.

Sorry - I had assumed my network was so much faster than my
disk subsystem I'd never get a short read from a socket except at
the end of the transfer. Pretty silly of me, in hindsight.

I can see now how even one short read early would screw up
the alignment for splicing into a file for the rest of the
transfer, right?

Here's some new results:

Run w/check for short read on socket in vmsplice case:
- /dev/zero -> /dev/null w/ socket read + file write: 1130 MB/s
(Man, my network is running fast today. I don't know why.)
- /dev/zero -> /dev/null w/ socket read + vmsplice/splice: 1028 MB/s
- /dev/zero -> file w/ socket read + vmsplice/splice: 336 MB/s

Rerun w/original:
- /dev/zero -> /dev/null w/ socket read + vmsplice/splice: 1026 MB/s
- /dev/zero -> file w/ socket read + vmsplice/splice: 285 MB/s
- /dev/zero -> file w/ socket read + file write: 382 MB/s

So I was losing 50 MB/s due to short reads on the socket
screwing up the alignment for splice. Sorry to waste your
time on that.

But, it looks like socket-read + file-write is still ~50 MB/s
faster than socket-read + vmsplice/splice (assuming I didn't
screw up my short read fix - see patch below). I assume that's
still unexpected?

>
> Since it's doing buffered writes, the results do vary a lot though (as
> you also indicated). A raw /dev/zero -> /dev/null is 3 times faster with
> vmsplice/splice.
>

Hmmm. Is it worth me trying to do some sort of kernel
profiling to see if there is anything unexpected with
my setup? If so, do you have a preference as to what
I would use?

Here's how I fixed my app to fix up (I think) short reads.
Maybe I missed your point?

diff --git a/src/dnd.c b/src/dnd.c
index 01bd7b8..aa70102 100644
--- a/src/dnd.c
+++ b/src/dnd.c
@@ -773,18 +773,26 @@ uint64_t vmsplice_recv(const struct opti
again:
i = (i + 1) & 1;
iov.iov_base = opts->buf + i * opts->buf_size;
+ l = 0;

again2:
- l = read(sd, iov.iov_base, opts->buf_size);
- if (l < 0) {
+ m = read(sd, iov.iov_base + l, opts->buf_size - l);
+ if (m < 0) {
if (errno == EINTR)
goto again2;
perror("Read");
exit(EXIT_FAILURE);
}
- if (l == 0) {
- fdatasync(fd);
- return bytes;
+ if (m == 0) {
+ if (l == 0) {
+ fdatasync(fd);
+ return bytes;
+ }
+ }
+ else {
+ l += m;
+ if (l != opts->buf_size)
+ goto again2;
}

while (l) {

2006-11-22 08:57:30

by Jens Axboe

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Tue, Nov 21 2006, Jim Schutt wrote:
> On Tue, 2006-11-21 at 14:54 +0100, Jens Axboe wrote:
> > On Mon, Nov 20 2006, Jim Schutt wrote:
> > > On Mon, 2006-11-20 at 09:24 +0100, Jens Axboe wrote:
> > > > On Mon, Nov 20 2006, Jens Axboe wrote:
> > > > > On Fri, Nov 17 2006, Jim Schutt wrote:
> > > > > > On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote:
> > > > > > > On Thu, Nov 16 2006, Jim Schutt wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > >
> > > > > > > > My test program can do one of the following:
> > > > > > > >
> > > > > > > > send data:
> > > > > > > > A) read() from file into buffer, write() buffer into socket
> > > > > > > > B) mmap() section of file, write() that into socket, munmap()
> > > > > > > > C) splice() from file to pipe, splice() from pipe to socket
> > > > > > > >
> > > > > > > > receive data:
> > > > > > > > 1) read() from socket into buffer, write() buffer into file
> > > > > > > > 2) ftruncate() to extend file, mmap() new extent, read()
> > > > > > > > from socket into new extent, munmap()
> > > > > > > > 3) read() from socket into buffer, vmsplice() buffer to
> > > > > > > > pipe, splice() pipe to file (using the double-buffer trick)
> > > > > > > >
> > > > > > > > Here's the results, using:
> > > > > > > > - 64 KiB buffer, mmap extent, or splice
> > > > > > > > - 1 MiB TCP window
> > > > > > > > - 16 GiB data sent across network
> > > > > > > >
> > > > > > > > A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)
> > > > > > > >
> > > > > > > > A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
> > > > > > > > B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
> > > > > > > > C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)
> > > > > > > >
> > > > > > > > A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s)
> > > > > > > > A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s)
> > > > > > > > A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s)
> > > > > > > >
> > > > > > > > I had (naively) hoped the read/vmsplice/splice combination would
> > > > > > > > run at the same speed I can write a file, i.e. at about 450 MB/s
> > > > > > > > on my setup. Do any of my numbers seem bogus, so I should look
> > > > > > > > harder at my test program?
> > > > > > >
> > > > > > > Could be read-ahead playing in here, I'd have to take a closer look at
> > > > > > > the generated io patterns to say more about that. Any chance you can
> > > > > > > capture iostat or blktrace info for such a run to compare that goes to
> > > > > > > the disk?
> > > > > >
> > > > > > I've attached a file with iostat and vmstat results for the case
> > > > > > where I read from a socket and write a file, vs. the case where I
> > > > > > read from a socket and use vmsplice/splice to write the file.
> > > > > > (Sorry it's not inline - my mailer locks up when I try to
> > > > > > include the file.)
> > > > > >
> > > > > > Would you still like blktrace info for these two cases?
> > > > >
> > > > > No, I think the iostat data is fine, I don't think the blktrace info
> > > > > would give me any more insight on this problem. I'll set up a test to
> > > > > reproduce it here, looks like the write out path could be optimized some
> > > > > more.
> > >
> > > Great, let me know if you need testing from me.
> >
> > I found some suboptimal behaviour in your test app - you don't check for
> > short reads and splice would really like things to be aligned for the
> > best performance. I did some testing with the original app here, and I
> > get 114.769MB/s for read-from-socket -> write-to-file and 109.878MB/s
> > for read-from-socket -> vmsplice-splice-to-file. If I fix up the read to
> > always get the full buffer size before doing the vmsplice+splice, the
> > performance is up to the same as the read/write.
>
> Sorry - I had assumed my network was so much faster than my
> disk subsystem I'd never get a short read from a socket except at
> the end of the transfer. Pretty silly of me, in hindsight.

That may be true if you are doing sync io, but the tests here do not.
Hence queueing can be much quicker than you can pull the data off the
net.

> I can see now how even one short read early would screw up
> the alignment for splicing into a file for the rest of the
> transfer, right?

Yep, it would hurt a lot. As your numbers show :-)

> Here's some new results:
>
> Run w/check for short read on socket in vmsplice case:
> - /dev/zero -> /dev/null w/ socket read + file write: 1130 MB/s
> (Man, my network is running fast today. I don't know why.)
> - /dev/zero -> /dev/null w/ socket read + vmsplice/splice: 1028 MB/s
> - /dev/zero -> file w/ socket read + vmsplice/splice: 336 MB/s
>
> Rerun w/original:
> - /dev/zero -> /dev/null w/ socket read + vmsplice/splice: 1026 MB/s
> - /dev/zero -> file w/ socket read + vmsplice/splice: 285 MB/s
> - /dev/zero -> file w/ socket read + file write: 382 MB/s
>
> So I was losing 50 MB/s due to short reads on the socket
> screwing up the alignment for splice. Sorry to waste your
> time on that.
>
> But, it looks like socket-read + file-write is still ~50 MB/s
> faster than socket-read + vmsplice/splice (assuming I didn't
> screw up my short read fix - see patch below). I assume that's
> still unexpected?

Maybe unexpected is not the right word, but it certainly would be nice
if it was faster. As I originally mentioned, your test case of socket
read + vmsplice + splice is not the best way to pull data off the
network, but it's the best until splice-from-socket works. Your zero ->
null transfer doesn't seem to take a large hit becaue of that, so it's
still very surprising that we lose 50MB/s on real io performance.

> > Since it's doing buffered writes, the results do vary a lot though (as
> > you also indicated). A raw /dev/zero -> /dev/null is 3 times faster with
> > vmsplice/splice.
> >
>
> Hmmm. Is it worth me trying to do some sort of kernel
> profiling to see if there is anything unexpected with
> my setup? If so, do you have a preference as to what
> I would use?

Not sure that profiling would be that interesting, as the problem
probably lies in where we are _not_ spending the time. But it certainly
can't hurt. Try to oprofile the kernel for a 10-20 sec interval while
the test is running. Do 3 such runs for the two test cases
(write-to-file, vmsplice/splice-to-file).

See the bottom of Documentation/basic_profiling.txt for how to profile
the kernel easily.

> Here's how I fixed my app to fix up (I think) short reads.
> Maybe I missed your point?

That looks about right.

--
Jens Axboe

2006-11-22 22:35:27

by Jim Schutt

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Wed, 2006-11-22 at 09:57 +0100, Jens Axboe wrote:
> On Tue, Nov 21 2006, Jim Schutt wrote:
[snip]
> >
> > Hmmm. Is it worth me trying to do some sort of kernel
> > profiling to see if there is anything unexpected with
> > my setup? If so, do you have a preference as to what
> > I would use?
>
> Not sure that profiling would be that interesting, as the problem
> probably lies in where we are _not_ spending the time. But it certainly
> can't hurt. Try to oprofile the kernel for a 10-20 sec interval while
> the test is running. Do 3 such runs for the two test cases
> (write-to-file, vmsplice/splice-to-file).
>

OK, I've attached results for 20 second profiles of three
runs of each test: read-from-socket + write-to-file, and
read-from-socket + vmsplice/splice-to-file.

The test case and throughput is in the name: e.g. rvs-1-306MBps
is trial 1 of read/vmsplice/splice case, which ran at 306 MB/s.

Let me know if I can help with more testing, and thanks
again for looking into this.

-- Jim Schutt

Attachments:

oprofile-rvs-1-306MBps.txt.bz2 (4.40 kB)
oprofile-rvs-2-324MBps.txt.bz2 (4.41 kB)
oprofile-rvs-3-314MBps.txt.bz2 (4.49 kB)
oprofile-rw-1-334MBps.txt.bz2 (4.44 kB)
oprofile-rw-2-348MBps.txt.bz2 (4.57 kB)
oprofile-rw-3-361MBps.txt.bz2 (4.54 kB)
Download all attachments

2006-11-23 11:24:35

by Jens Axboe

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Wed, Nov 22 2006, Jim Schutt wrote:
>
> On Wed, 2006-11-22 at 09:57 +0100, Jens Axboe wrote:
> > On Tue, Nov 21 2006, Jim Schutt wrote:
> [snip]
> > >
> > > Hmmm. Is it worth me trying to do some sort of kernel
> > > profiling to see if there is anything unexpected with
> > > my setup? If so, do you have a preference as to what
> > > I would use?
> >
> > Not sure that profiling would be that interesting, as the problem
> > probably lies in where we are _not_ spending the time. But it certainly
> > can't hurt. Try to oprofile the kernel for a 10-20 sec interval while
> > the test is running. Do 3 such runs for the two test cases
> > (write-to-file, vmsplice/splice-to-file).
> >
>
> OK, I've attached results for 20 second profiles of three
> runs of each test: read-from-socket + write-to-file, and
> read-from-socket + vmsplice/splice-to-file.
>
> The test case and throughput is in the name: e.g. rvs-1-306MBps
> is trial 1 of read/vmsplice/splice case, which ran at 306 MB/s.
>
> Let me know if I can help with more testing, and thanks
> again for looking into this.

As I suspected, nothing sticks out in these logs as the problem here is
not due to a maxed out system. The traces look fairly identical, less
time spent in copy_user with the splice approach.

Comparing the generic_file_buffered_write() and splice-to-file path,
there really isn't a whole lot of difference. It would be interesting to
try and eliminate some of the differences between the two approaches -
could you try and change the vmsplice to a write-to-pipe instead? And
add SPLICE_F_NONBLOCK to the splice-to-file as well. Basically I'm
interested in a something that only really tests splice-to-file vs
write-to-file. Perhaps easier if you can just run fio to test that, I'm
inlining a job file to test that specifically.

; -- start job file

[global]
bs=64k
rw=write
overwrite=0
size=16g
end_fsync=1
direct=0
unlink

[write]
ioengine=sync

[splice]
stonewall
ioengine=splice

; -- end job file

You can grab a fio snapshot here:

http://brick.kernel.dk/snaps/fio-git-20061123122325.tar.gz

You probably want to run that a few times to see how stable the results
are, buffered io is always a little problematic from a consistency point
of view in benchmark results.

--
Jens Axboe

2006-11-27 20:57:18

by Jim Schutt

[permalink] [raw]

Subject: Re: splice/vmsplice performance test results

On Thu, 2006-11-23 at 12:24 +0100, Jens Axboe wrote:
> On Wed, Nov 22 2006, Jim Schutt wrote:
> >
> > On Wed, 2006-11-22 at 09:57 +0100, Jens Axboe wrote:
> > > On Tue, Nov 21 2006, Jim Schutt wrote:
> > [snip]
> > > >
> > > > Hmmm. Is it worth me trying to do some sort of kernel
> > > > profiling to see if there is anything unexpected with
> > > > my setup? If so, do you have a preference as to what
> > > > I would use?
> > >
> > > Not sure that profiling would be that interesting, as the problem
> > > probably lies in where we are _not_ spending the time. But it certainly
> > > can't hurt. Try to oprofile the kernel for a 10-20 sec interval while
> > > the test is running. Do 3 such runs for the two test cases
> > > (write-to-file, vmsplice/splice-to-file).
> > >
> >
> > OK, I've attached results for 20 second profiles of three
> > runs of each test: read-from-socket + write-to-file, and
> > read-from-socket + vmsplice/splice-to-file.
> >
> > The test case and throughput is in the name: e.g. rvs-1-306MBps
> > is trial 1 of read/vmsplice/splice case, which ran at 306 MB/s.
> >
> > Let me know if I can help with more testing, and thanks
> > again for looking into this.
>
> As I suspected, nothing sticks out in these logs as the problem here is
> not due to a maxed out system. The traces look fairly identical, less
> time spent in copy_user with the splice approach.
>
> Comparing the generic_file_buffered_write() and splice-to-file path,
> there really isn't a whole lot of difference. It would be interesting to
> try and eliminate some of the differences between the two approaches -
> could you try and change the vmsplice to a write-to-pipe instead? And
> add SPLICE_F_NONBLOCK to the splice-to-file as well. Basically I'm
> interested in a something that only really tests splice-to-file vs
> write-to-file. Perhaps easier if you can just run fio to test that, I'm
> inlining a job file to test that specifically.
>

Sorry for the delayed reply.

Here's results from three fio runs, using the job file
you gave me, and fio-git-20061124142507.tar.gz:

---
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (groupid=0): err= 0:
write: io= 16384MiB, bw=558404KiB/s, runt= 30766msec
slat (msec): min= 0, max= 0, avg= 0.00, dev= 0.00
clat (msec): min= 0, max= 35, avg= 0.00, dev= 0.08
bw (KiB/s) : min= 0, max=644481, per=98.90%, avg=552243.95,
dev=561417.03
cpu : usr=0.91%, sys=85.03%, ctx=14121
splice: (groupid=1): err= 0:
write: io= 16384MiB, bw=486144KiB/s, runt= 35339msec
slat (msec): min= 0, max= 0, avg= 0.00, dev= 0.00
clat (msec): min= 0, max= 30, avg= 0.00, dev= 0.10
bw (KiB/s) : min= 0, max=555745, per=99.06%, avg=481567.28,
dev=488565.60
cpu : usr=0.85%, sys=88.87%, ctx=12956

Run status group 0 (all jobs):
WRITE: io=16384MiB, aggrb=558404, minb=558404, maxb=558404,
mint=30766msec, maxt=30766msec

Run status group 1 (all jobs):
WRITE: io=16384MiB, aggrb=486144, minb=486144, maxb=486144,
mint=35339msec, maxt=35339msec

Disk stats (read/write):
md0: ios=20/270938, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
---
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (groupid=0): err= 0:
write: io= 16384MiB, bw=547234KiB/s, runt= 31394msec
slat (msec): min= 0, max= 1, avg= 0.00, dev= 0.00
clat (msec): min= 0, max= 57, avg= 0.00, dev= 0.14
bw (KiB/s) : min= 0, max=662568, per=98.94%, avg=541406.71,
dev=550958.67
cpu : usr=0.79%, sys=82.80%, ctx=16560
splice: (groupid=1): err= 0:
write: io= 16384MiB, bw=473313KiB/s, runt= 36297msec
slat (msec): min= 0, max= 1, avg= 0.00, dev= 0.00
clat (msec): min= 0, max= 27, avg= 0.00, dev= 0.11
bw (KiB/s) : min= 0, max=562298, per=99.12%, avg=469142.21,
dev=476426.36
cpu : usr=1.05%, sys=84.78%, ctx=16043

Run status group 0 (all jobs):
WRITE: io=16384MiB, aggrb=547234, minb=547234, maxb=547234,
mint=31394msec, maxt=31394msec

Run status group 1 (all jobs):
WRITE: io=16384MiB, aggrb=473313, minb=473313, maxb=473313,
mint=36297msec, maxt=36297msec

Disk stats (read/write):
md0: ios=17/270784, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
---
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (groupid=0): err= 0:
write: io= 16384MiB, bw=561140KiB/s, runt= 30616msec
slat (msec): min= 0, max= 1, avg= 0.00, dev= 0.00
clat (msec): min= 0, max= 26, avg= 0.00, dev= 0.08
bw (KiB/s) : min= 0, max=665976, per=98.94%, avg=555198.98,
dev=564632.52
cpu : usr=0.82%, sys=85.12%, ctx=14287
splice: (groupid=1): err= 0:
write: io= 16384MiB, bw=487192KiB/s, runt= 35263msec
slat (msec): min= 0, max= 0, avg= 0.00, dev= 0.00
clat (msec): min= 0, max= 29, avg= 0.00, dev= 0.09
bw (KiB/s) : min= 0, max=566099, per=99.08%, avg=482706.32,
dev=489726.36
cpu : usr=0.85%, sys=86.96%, ctx=13072

Run status group 0 (all jobs):
WRITE: io=16384MiB, aggrb=561140, minb=561140, maxb=561140,
mint=30616msec, maxt=30616msec

Run status group 1 (all jobs):
WRITE: io=16384MiB, aggrb=487192, minb=487192, maxb=487192,
mint=35263msec, maxt=35263msec

Disk stats (read/write):
md0: ios=18/270851, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
---

Do these results give you something to work with, or would
you like me to run some other tests?

Let me know.

Thanks -- Jim Schutt