2006-09-20 20:59:14

by Ashwini Kulkarni

[permalink] [raw]
Subject: [RFC 0/6] TCP socket splice


My name is Ashwini Kulkarni and I have been working at Intel Corporation for
the past 4 months as an engineering intern. I have been working on the 'TCP
socket splice' project with Chris Leech. This is a work-in-progress version
of the project with scope for further modifications.

TCP socket splicing:
It allows a TCP socket to be spliced to a file via a pipe buffer. First, to
splice data from a socket to a pipe buffer, upto 16 source pages(s) are pulled
into the pipe buffer. Then to splice data from the pipe buffer to a file,
those pages are migrated into the address space of the target file. It takes
place entirely within the kernel and thus results in zero memory copies. It is
the receive side complement to sendfile() but unlike sendfile() it is
possible to splice from a socket as well and not just to a socket.

Current Method:
+ > Application Buffer +
| |
_________________|_______________________|_____________
| |
Receive or | | Write
I/OAT DMA | |
| |
| V
Network File System
Buffer Buffer
^ |
| |
_________________|_______________________|_____________
DMA | | DMA
| |
Hardware | |
| V
NIC SATA

In the current method, the packet is DMA’d from the NIC into the network buffer.
There is a read on socket to the user space and the packet data is copied from
the network buffer to the application buffer. A write operation then moves the
data from the application buffer to the file system buffer which is then DMA'd
to the disk again. Thus, in the current method there will be one full copy of
all the data to the user space.

Using TCP socket splice:

Application Control
|
_________________|__________________________________
|
| TCP socket splice
| +---------------------+
| | Direct path |
V | V
Network File System
Buffer Buffer
^ |
| |
_________________|_______________________|__________
DMA | | DMA
| |
Hardware | |
| V
NIC SATA

In this method, the objective is to use TCP socket splicing to create a direct
path in the kernel from the network buffer to the file system buffer via a pipe
buffer. The pages will migrate from the network buffer (which is associated
with the socket) into the pipe buffer for an optimized path. From the pipe
buffer, the pages will then be migrated to the output file address space page
cache. This will enable to create a LAN to file-system API which will avoid the
memcpy operations in user space and thus create a fast path from the network
buffer to the storage buffer.

Open Issues (currently being addressed):
There is a performance drop when transferring bigger files (usually larger than
65536 bytes in size). Performance drop increases with the size of the file.
Work is in progress to identify the source of this issue.

We encourage the community to review our TCP socket splice project. Feedback
would be greatly appreciated.

--
Ashwini Kulkarni


2006-09-20 20:59:18

by Ashwini Kulkarni

[permalink] [raw]
Subject: [RFC 1/6] Make splice_to_pipe non-static and move structure definitions to a header file


---

fs/splice.c | 18 +-----------------
include/linux/pipe_fs_i.h | 18 ++++++++++++++++++
2 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 684bca3..c6a880b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -29,22 +29,6 @@
#include <linux/syscalls.h>
#include <linux/uio.h>

-struct partial_page {
- unsigned int offset;
- unsigned int len;
-};
-
-/*
- * Passed to splice_to_pipe
- */
-struct splice_pipe_desc {
- struct page **pages; /* page map */
- struct partial_page *partial; /* pages[] may not be contig */
- int nr_pages; /* number of pages in map */
- unsigned int flags; /* splice flags */
- struct pipe_buf_operations *ops;/* ops associated with output pipe */
-};
-
/*
* Attempt to steal a page from a pipe buffer. This should perhaps go into
* a vm helper function, it's already simplified quite a bit by the
@@ -173,7 +157,7 @@ static struct pipe_buf_operations user_p
* Pipe output worker. This sets up our pipe format with the page cache
* pipe buffer operations. Otherwise very similar to the regular pipe_writev().
*/
-static ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
+ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
struct splice_pipe_desc *spd)
{
int ret, do_wakeup, page_nr;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index ea4f7cd..9067985 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -100,4 +100,22 @@ extern ssize_t splice_from_pipe(struct p
loff_t *, size_t, unsigned int,
splice_actor *);

+struct partial_page {
+ unsigned int offset;
+ unsigned int len;
+};
+
+/*
+ * Passed to splice_to_pipe
+ */
+struct splice_pipe_desc {
+ struct page **pages; /* page map */
+ struct partial_page *partial; /* pages[] may not be contig */
+ int nr_pages; /* number of pages in map */
+ unsigned int flags; /* splice flags */
+ struct pipe_buf_operations *ops;/* ops associated with output pipe */
+};
+
+ssize_t splice_to_pipe(struct pipe_inode_info *, struct splice_pipe_desc *);
+
#endif

2006-09-20 20:59:44

by Ashwini Kulkarni

[permalink] [raw]
Subject: [RFC 3/6] Add in TCP related part of splice read to ipv4


---

net/ipv4/af_inet.c | 1
net/ipv4/tcp.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 136 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c84a320..3c0d245 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -807,6 +807,7 @@ const struct proto_ops inet_stream_ops =
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
.sendpage = tcp_sendpage,
+ .splice_read = tcp_splice_read,
#ifdef CONFIG_COMPAT
.compat_setsockopt = compat_sock_common_setsockopt,
.compat_getsockopt = compat_sock_common_getsockopt,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 934396b..d4c02a1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -254,6 +254,10 @@
#include <linux/init.h>
#include <linux/smp_lock.h>
#include <linux/fs.h>
+#include <linux/skbuff.h>
+#include <linux/pipe_fs_i.h>
+#include <linux/net.h>
+#include <linux/socket.h>
#include <linux/random.h>
#include <linux/bootmem.h>
#include <linux/cache.h>
@@ -264,6 +268,7 @@
#include <net/xfrm.h>
#include <net/ip.h>
#include <net/netdma.h>
+#include <net/sock.h>

#include <asm/uaccess.h>
#include <asm/ioctls.h>
@@ -291,6 +296,23 @@ EXPORT_SYMBOL(tcp_memory_allocated);
EXPORT_SYMBOL(tcp_sockets_allocated);

/*
+ * Create a TCP splice context.
+ */
+struct tcp_splice_state {
+ struct pipe_inode_info *pipe;
+ void (*original_data_ready)(struct sock*, int);
+ size_t len;
+ size_t offset;
+ unsigned int flags;
+};
+
+int __tcp_splice_read(struct sock *sk, loff_t *ppos, struct pipe_inode_info *pipe,
+ size_t len, unsigned int flags, struct tcp_splice_state *tss);
+int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
+ unsigned int offset, size_t len);
+void tcp_splice_data_ready(struct sock *sk, int flag);
+
+/*
* Pressure flag: try to collapse.
* Technical note: it is used by multiple contexts non atomically.
* All the sk_stream_mem_schedule() is of this nature: accounting
@@ -499,6 +521,118 @@ static inline void tcp_push(struct sock
}
}

+/*
+ * tcp_splice_read - splice data from TCP socket to a pipe
+ * @sock: socket to splice from
+ * @pipe: pipe to splice to
+ * @len: number of bytes to splice
+ * @flags: splice modifier flags
+ *
+ * Will read pages from given socket and fill them into a pipe.
+ */
+ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags)
+{
+ struct tcp_splice_state tss = {
+ .pipe = pipe,
+ .len = len,
+ .flags = flags,
+ };
+ struct sock *sk = sock->sk;
+ ssize_t spliced;
+ int ret;
+
+ ret = 0;
+ spliced = 0;
+
+ if (*ppos != 0)
+ return -EINVAL;
+
+ while(tss.len) {
+ ret = __tcp_splice_read(sk, ppos, tss.pipe, tss.len, tss.flags, &tss);
+
+ if(ret < 0)
+ break;
+ else if (!ret) {
+ if (spliced)
+ break;
+ if (flags & SPLICE_F_NONBLOCK) {
+ ret = -EAGAIN;
+ break;
+ }
+ }
+ tss.len -= ret;
+ spliced += ret;
+ }
+ if (spliced)
+ return spliced;
+
+ return ret;
+}
+
+int __tcp_splice_read(struct sock *sk, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags, struct tcp_splice_state *tss)
+{
+ read_descriptor_t rd_desc;
+ int copied;
+
+ tss->original_data_ready = sk->sk_data_ready;
+
+ sk->sk_user_data = tss;
+
+ /* Store TCP splice context information in read_descriptor_t. */
+ rd_desc.arg.data = tss;
+
+ copied = tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
+
+ if (copied != 0) {
+ if (flags & SPLICE_F_MORE) {
+ /* Setup new sk_data_ready as tcp_splice_data_ready. */
+ sk->sk_data_ready = tcp_splice_data_ready;
+ return sk_wait_data(sk, &sk->sk_rcvtimeo);
+ }
+ else if(flags & SPLICE_F_NONBLOCK)
+ return -EAGAIN;
+ else return copied;
+ }
+ else
+ return copied;
+}
+
+int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, unsigned int offset, size_t len)
+{
+ /*
+ * Restore TCP splice context from read_descriptor_t
+ */
+ struct tcp_splice_state *tss = rd_desc->arg.data;
+
+ return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
+}
+
+void tcp_splice_data_ready(struct sock *sk, int flag)
+{
+ /*
+ * Restore splice context/ read_descriptor_t from sk->sk_user_data
+ */
+ struct tcp_splice_state *tss = sk->sk_user_data;
+ read_descriptor_t rd_desc;
+
+ read_lock(&sk->sk_callback_lock);
+
+ rd_desc.arg.data = tss;
+ rd_desc.count = 1;
+ tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
+
+ read_unlock(&sk->sk_callback_lock);
+
+ if(tss->len == 0) {
+ /* Restore original sk_data_ready callback. */
+ sk->sk_data_ready = tss->original_data_ready;
+ /* Wakeup user thread. */
+ return sock_def_wakeup(sk);
+ }
+ else
+ return;
+}
+
static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
size_t psize, int flags)
{
@@ -2345,6 +2479,7 @@ EXPORT_SYMBOL(tcp_poll);
EXPORT_SYMBOL(tcp_read_sock);
EXPORT_SYMBOL(tcp_recvmsg);
EXPORT_SYMBOL(tcp_sendmsg);
+EXPORT_SYMBOL(tcp_splice_read);
EXPORT_SYMBOL(tcp_sendpage);
EXPORT_SYMBOL(tcp_setsockopt);
EXPORT_SYMBOL(tcp_shutdown);

2006-09-20 20:59:49

by Ashwini Kulkarni

[permalink] [raw]
Subject: [RFC 5/6] Add skb_splice_bits to skbuff.c


---

include/linux/skbuff.h | 2 +
net/core/skbuff.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 139 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 755e9cd..8f4b90e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1338,6 +1338,8 @@ extern unsigned int skb_checksum(cons
int len, unsigned int csum);
extern int skb_copy_bits(const struct sk_buff *skb, int offset,
void *to, int len);
+extern int skb_splice_bits(const struct sk_buff *skb, int offset,
+ struct pipe_inode_info *pipe, int len, unsigned int flags);
extern int skb_store_bits(const struct sk_buff *skb, int offset,
void *from, int len);
extern unsigned int skb_copy_and_csum_bits(const struct sk_buff *skb,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c54f366..a92d165 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -53,6 +53,7 @@
#endif
#include <linux/string.h>
#include <linux/skbuff.h>
+#include <linux/pipe_fs_i.h>
#include <linux/cache.h>
#include <linux/rtnetlink.h>
#include <linux/init.h>
@@ -70,6 +71,17 @@
static kmem_cache_t *skbuff_head_cache __read_mostly;
static kmem_cache_t *skbuff_fclone_cache __read_mostly;

+/* Pipe buffer operations for a socket. */
+static struct pipe_buf_operations sock_buf_ops = {
+ .can_merge = 0,
+ .map = generic_pipe_buf_map,
+ .unmap = generic_pipe_buf_unmap,
+ .pin = generic_pipe_buf_pin,
+ .release = generic_sock_buf_release,
+ .steal = generic_pipe_buf_steal,
+ .get = generic_pipe_buf_get,
+};
+
/*
* Keep out-of-line to prevent kernel bloat.
* __builtin_return_address is not used because it is not always
@@ -1148,6 +1160,131 @@ fault:
return -EFAULT;
}

+/* Move specified number of bytes from the source skb to the
+ * destination pipe buffer. This function even handles all the
+ * bits of traversing fragment lists.
+ */
+int skb_splice_bits(const struct sk_buff *skb, int offset, struct pipe_inode_info *pipe, int len, unsigned int flags)
+{
+ struct page *page;
+ struct partial_page partial[PIPE_BUFFERS];
+ struct page *pages[PIPE_BUFFERS];
+ int buflen, available_len;
+ int pg_nr = 0;
+ int i, nfrags;
+ void *address;
+ size_t ret = 0;
+ struct splice_pipe_desc spd = {
+ .pages = pages,
+ .partial = partial,
+ .flags = flags,
+ .ops = &sock_buf_ops,
+ };
+
+ buflen = skb_headlen(skb);
+
+ if ((available_len = buflen - offset) >0) {
+ if (available_len > len)
+ available_len = len;
+
+ page = alloc_page(GFP_KERNEL);
+ if (!page)
+ return -ENOMEM;
+
+ address = kmap(page);
+ memcpy(address, skb->data + offset, available_len);
+ /* Push page into splice pipe desc. */
+ spd.pages[pg_nr] = page;
+ pg_nr++;
+ kunmap(page);
+
+ /* If entire length has been consumed or number of pages pushed into
+ * splice pipe desc(pipe buffer) equals 16, then call splice_to_pipe.
+ */
+ if (((len -= available_len) == 0) || pg_nr == PIPE_BUFFERS) {
+ spd.nr_pages = pg_nr;
+ offset += available_len;
+ ret = splice_to_pipe(pipe, &spd);
+ if (ret == -EPIPE)
+ return -EPIPE;
+ else if (ret == -EAGAIN)
+ return -EAGAIN;
+ else if (ret == -ERESTARTSYS)
+ return -ERESTARTSYS;
+ else goto frags;
+ }
+ }
+ frags:
+ if (skb_shinfo(skb)->nr_frags != 0) {
+ nfrags = skb_shinfo(skb)->nr_frags;
+
+ for (i = 0; i < nfrags; i++) {
+ int total;
+ skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+ get_page(skb_shinfo(skb)->frags[i].page);
+
+ total = buflen + skb_shinfo(skb)->frags[i].size;
+
+ if ((available_len = total - offset) > 0) {
+
+ if (available_len > len)
+ available_len = len;
+
+ spd.pages[pg_nr] = frag->page;
+ spd.partial[pg_nr].offset = frag->page_offset;
+ spd.partial[pg_nr].len = frag->size;
+ pg_nr++;
+
+ if (((len -= available_len) == 0) || pg_nr == PIPE_BUFFERS) {
+ spd.nr_pages = pg_nr;
+ ret = splice_to_pipe(pipe, &spd);
+ goto out;
+ }
+
+ offset += available_len;
+ }
+ buflen = total;
+ }
+ spd.nr_pages = pg_nr;
+ ret = splice_to_pipe(pipe, &spd);
+ }
+ out:
+ if (ret == -EPIPE)
+ return -EPIPE;
+ if (ret == -EAGAIN)
+ return -EAGAIN;
+ if (ret == -ERESTARTSYS)
+ return -ERESTARTSYS;
+
+ if (skb_shinfo(skb)->frag_list) {
+ struct sk_buff *list = skb_shinfo(skb)->frag_list;
+
+ for(; list; list = list->next) {
+ int total, more;
+
+ total = buflen + list->len;
+ if ((available_len = total - offset) > 0) {
+
+ if (available_len > len)
+ available_len = len;
+
+ more = skb_splice_bits(list, offset - buflen, pipe, available_len, flags);
+ if (more >= 0)
+ ret += more;
+ else
+ return -EFAULT;
+
+ if ((len -= available_len) == 0)
+ return ret;
+
+ offset += available_len;
+ }
+ buflen = total;
+ }
+ }
+ return ret;
+}
+
/**
* skb_store_bits - store bits from kernel buffer to skb
* @skb: destination buffer

2006-09-20 21:00:27

by Ashwini Kulkarni

[permalink] [raw]
Subject: [RFC 6/6] Move i_size_read part from do_splice_to() to __generic_file_splice_read() in splice.c


---

fs/splice.c | 18 ++++++++----------
1 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 3a4202d..2f8f42a 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -271,7 +271,7 @@ __generic_file_splice_read(struct file *
struct partial_page partial[PIPE_BUFFERS];
struct page *page;
pgoff_t index, end_index;
- loff_t isize;
+ loff_t isize, left;
size_t total_len;
int error, page_nr;
struct splice_pipe_desc spd = {
@@ -421,6 +421,13 @@ __generic_file_splice_read(struct file *
* i_size must be checked after ->readpage().
*/
isize = i_size_read(mapping->host);
+ if (unlikely(*ppos >= isize))
+ return 0;
+
+ left = isize - *ppos;
+ if (unlikely(left < len))
+ len = left;
+
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
if (unlikely(!isize || index > end_index))
break;
@@ -903,7 +910,6 @@ static long do_splice_to(struct file *in
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
{
- loff_t isize, left;
int ret;

if (unlikely(!in->f_op || !in->f_op->splice_read))
@@ -916,14 +922,6 @@ static long do_splice_to(struct file *in
if (unlikely(ret < 0))
return ret;

- isize = i_size_read(in->f_mapping->host);
- if (unlikely(*ppos >= isize))
- return 0;
-
- left = isize - *ppos;
- if (unlikely(left < len))
- len = left;
-
return in->f_op->splice_read(in, ppos, pipe, len, flags);
}


2006-09-20 21:00:42

by Ashwini Kulkarni

[permalink] [raw]
Subject: [RFC 4/6] Add TCP socket splicing (tcp_splice_read) support


---

fs/splice.c | 16 ++++++++++++++++
include/linux/net.h | 2 ++
include/linux/pipe_fs_i.h | 1 +
include/net/tcp.h | 3 +++
net/socket.c | 13 +++++++++++++
5 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index c6a880b..3a4202d 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -123,6 +123,12 @@ error:
return err;
}

+void generic_sock_buf_release(struct pipe_inode_info *pipe,
+ struct pipe_buffer *buf)
+{
+ put_page(buf->page);
+}
+
static struct pipe_buf_operations page_cache_pipe_buf_ops = {
.can_merge = 0,
.map = generic_pipe_buf_map,
@@ -133,6 +139,16 @@ static struct pipe_buf_operations page_c
.get = generic_pipe_buf_get,
};

+static struct pipe_buf_operations sock_buf_ops = {
+ .can_merge = 0,
+ .map = generic_pipe_buf_map,
+ .unmap = generic_pipe_buf_unmap,
+ .pin = generic_pipe_buf_pin,
+ .release = generic_sock_buf_release,
+ .steal = generic_pipe_buf_steal,
+ .get = generic_pipe_buf_get,
+};
+
static int user_page_pipe_buf_steal(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
diff --git a/include/linux/net.h b/include/linux/net.h
index b20c53c..65dfe0c 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -164,6 +164,8 @@ struct proto_ops {
struct vm_area_struct * vma);
ssize_t (*sendpage) (struct socket *sock, struct page *page,
int offset, size_t size, int flags);
+ ssize_t (*splice_read)(struct socket *sock, loff_t *ppos,
+ struct pipe_inode_info *pipe, size_t len, unsigned int flags);
};

struct net_proto_family {
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 9067985..f7f439b 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -72,6 +72,7 @@ void generic_pipe_buf_get(struct pipe_in
int generic_pipe_buf_pin(struct pipe_inode_info *, struct pipe_buffer *);
int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *);

+void generic_sock_buf_release(struct pipe_inode_info *, struct pipe_buffer *);
/*
* splice is tied to pipes as a transport (at least for now), so we'll just
* add the splice flags here.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..5032501 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -300,6 +300,9 @@ extern void tcp_cleanup_rbuf(struct so
extern int tcp_twsk_unique(struct sock *sk,
struct sock *sktw, void *twp);

+extern ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
+ struct pipe_inode_info *pipe, size_t len, unsigned int flags);
+
static inline void tcp_dec_quickack_mode(struct sock *sk,
const unsigned int pkts)
{
diff --git a/net/socket.c b/net/socket.c
index 6d261bf..8a4f602 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -117,6 +117,8 @@ static ssize_t sock_writev(struct file *
unsigned long count, loff_t *ppos);
static ssize_t sock_sendpage(struct file *file, struct page *page,
int offset, size_t size, loff_t *ppos, int more);
+static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
+ struct pipe_inode_info *pipe, size_t len, unsigned int flags);

/*
* Socket files have a set of 'special' operations as well as the generic file ones. These don't appear
@@ -141,6 +143,7 @@ static struct file_operations socket_fil
.writev = sock_writev,
.sendpage = sock_sendpage,
.splice_write = generic_splice_sendpage,
+ .splice_read = sock_splice_read,
};

/*
@@ -701,6 +704,16 @@ static ssize_t sock_sendpage(struct file
return sock->ops->sendpage(sock, page, offset, size, flags);
}

+static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
+ struct pipe_inode_info *pipe, size_t len, unsigned int flags)
+{
+ struct socket *sock;
+
+ sock = file->private_data;
+
+ return sock->ops->splice_read(sock, ppos, pipe, len, flags);
+}
+
static struct sock_iocb *alloc_sock_iocb(struct kiocb *iocb,
char __user *ubuf, size_t size, struct sock_iocb *siocb)
{

2006-09-20 20:59:21

by Ashwini Kulkarni

[permalink] [raw]
Subject: [RFC 2/6] Make sock_def_wakeup non-static


---

include/net/sock.h | 1 +
net/core/sock.c | 3 ++-
2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 324b3ea..3a64262 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -497,6 +497,7 @@ extern void sk_stream_wait_close(struct
extern int sk_stream_error(struct sock *sk, int flags, int err);
extern void sk_stream_kill_queues(struct sock *sk);

+extern void sock_def_wakeup(struct sock *sk);
extern int sk_wait_data(struct sock *sk, long *timeo);

struct request_sock_ops;
diff --git a/net/core/sock.c b/net/core/sock.c
index 51fcfbc..8496854 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1400,7 +1400,7 @@ ssize_t sock_no_sendpage(struct socket *
* Default Socket Callbacks
*/

-static void sock_def_wakeup(struct sock *sk)
+void sock_def_wakeup(struct sock *sk)
{
read_lock(&sk->sk_callback_lock);
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
@@ -1961,6 +1961,7 @@ EXPORT_SYMBOL(sock_no_poll);
EXPORT_SYMBOL(sock_no_recvmsg);
EXPORT_SYMBOL(sock_no_sendmsg);
EXPORT_SYMBOL(sock_no_sendpage);
+EXPORT_SYMBOL(sock_def_wakeup);
EXPORT_SYMBOL(sock_no_setsockopt);
EXPORT_SYMBOL(sock_no_shutdown);
EXPORT_SYMBOL(sock_no_socketpair);

2006-09-21 06:01:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC 0/6] TCP socket splice

On Wed, Sep 20, 2006 at 02:07:11PM -0700, Ashwini Kulkarni ([email protected]) wrote:
> Using TCP socket splice:
>
> Application Control
> |
> _________________|__________________________________
> |
> | TCP socket splice
> | +---------------------+
> | | Direct path |
> V | V
> Network File System
> Buffer Buffer
> ^ |
> | |
> _________________|_______________________|__________
> DMA | | DMA
> | |
> Hardware | |
> | V
> NIC SATA
>
> In this method, the objective is to use TCP socket splicing to create a direct
> path in the kernel from the network buffer to the file system buffer via a pipe
> buffer. The pages will migrate from the network buffer (which is associated
> with the socket) into the pipe buffer for an optimized path. From the pipe
> buffer, the pages will then be migrated to the output file address space page
> cache. This will enable to create a LAN to file-system API which will avoid the
> memcpy operations in user space and thus create a fast path from the network
> buffer to the storage buffer.
>
> Open Issues (currently being addressed):
> There is a performance drop when transferring bigger files (usually larger than
> 65536 bytes in size). Performance drop increases with the size of the file.
> Work is in progress to identify the source of this issue.
>
> We encourage the community to review our TCP socket splice project. Feedback
> would be greatly appreciated.

First of all it is not zero-copy, most of the time when mtu is not
changed skb does not have fragments, which means that you need to copy,
and you do it in skb_splice_bits() after skb_headlen() check.

Additionally to copy you add kmap/kunmap overhead, which can be very
noticeble. I would not be surprised that exactly that part introduces
described above performance drop compared to copy_*_user() approach.
Did you checked it with (hacked) drivers, which put data into fragment
list? Could you post your benchamrks.

And your coding style is broken noticebly...
Also do not check for every possible error case, negative return value
always meant error, otherwise it is ok in your case to proceed.

> --
> Ashwini Kulkarni

--
Evgeniy Polyakov

2006-09-22 17:45:34

by Phillip Susi

[permalink] [raw]
Subject: Re: [RFC 0/6] TCP socket splice

How is this different than just having the application mmap() the file
and recv() into that buffer?

Ashwini Kulkarni wrote:
> My name is Ashwini Kulkarni and I have been working at Intel Corporation for
> the past 4 months as an engineering intern. I have been working on the 'TCP
> socket splice' project with Chris Leech. This is a work-in-progress version
> of the project with scope for further modifications.
>
> TCP socket splicing:
> It allows a TCP socket to be spliced to a file via a pipe buffer. First, to
> splice data from a socket to a pipe buffer, upto 16 source pages(s) are pulled
> into the pipe buffer. Then to splice data from the pipe buffer to a file,
> those pages are migrated into the address space of the target file. It takes
> place entirely within the kernel and thus results in zero memory copies. It is
> the receive side complement to sendfile() but unlike sendfile() it is
> possible to splice from a socket as well and not just to a socket.
>
> Current Method:
> + > Application Buffer +
> | |
> _________________|_______________________|_____________
> | |
> Receive or | | Write
> I/OAT DMA | |
> | |
> | V
> Network File System
> Buffer Buffer
> ^ |
> | |
> _________________|_______________________|_____________
> DMA | | DMA
> | |
> Hardware | |
> | V
> NIC SATA
>
> In the current method, the packet is DMA’d from the NIC into the network buffer.
> There is a read on socket to the user space and the packet data is copied from
> the network buffer to the application buffer. A write operation then moves the
> data from the application buffer to the file system buffer which is then DMA'd
> to the disk again. Thus, in the current method there will be one full copy of
> all the data to the user space.
>
> Using TCP socket splice:
>
> Application Control
> |
> _________________|__________________________________
> |
> | TCP socket splice
> | +---------------------+
> | | Direct path |
> V | V
> Network File System
> Buffer Buffer
> ^ |
> | |
> _________________|_______________________|__________
> DMA | | DMA
> | |
> Hardware | |
> | V
> NIC SATA
>
> In this method, the objective is to use TCP socket splicing to create a direct
> path in the kernel from the network buffer to the file system buffer via a pipe
> buffer. The pages will migrate from the network buffer (which is associated
> with the socket) into the pipe buffer for an optimized path. From the pipe
> buffer, the pages will then be migrated to the output file address space page
> cache. This will enable to create a LAN to file-system API which will avoid the
> memcpy operations in user space and thus create a fast path from the network
> buffer to the storage buffer.
>
> Open Issues (currently being addressed):
> There is a performance drop when transferring bigger files (usually larger than
> 65536 bytes in size). Performance drop increases with the size of the file.
> Work is in progress to identify the source of this issue.
>
> We encourage the community to review our TCP socket splice project. Feedback
> would be greatly appreciated.
>
> --
> Ashwini Kulkarni