2006-03-29 12:28:36

by Jens Axboe

[permalink] [raw]
Subject: [PATCH][RFC] splice support

Hi,

Since my initial posting back in December, I've had some private queries
about the state of splice support. The state was pretty much that it was
a little broken, if one attempted to do file | file splicing. The
original patch migrated pages from one file to another in this case,
which got vm ugly really quickly. And it wasn't always the right thing
to do, since it would mean that splicing file1 to file2 would move
file1's page cache to file2. Sometimes this is what you want, sometimes
it is not.

So that was removed to make things work fully. It can later be
reintroduced (and controlled with the splice flags passed in, whether to
'loan' or 'gift' source pages to use a McVoy term) if need be.

Apart from that change, I added splice to socket support. It then
becomes a full sendfile() replacement (unless I broke something). I'm
attaching the current patch against 2.6.16-git, and also three test apps
that you can use as a reference or just to play with this. The apps are:

splice-in file
Splice file to stdout

splice-out
Splice stdin to file

splice-net hostname port
Splice stdin to hostname:port

Examples - splice copying a file can be done as:

# splice-in file | splice-out new_file

Sending a file over the network

# cat file | splice-net hostname port

and then have the other end run eg netcat -l -p port to receive the
data.

The patch adds the syscalls for i386/x86_64/ppc. The patch can also (as
before) be found in the block git repo, splice branch.

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 326595f..ce3ef4f 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -312,3 +312,4 @@ ENTRY(sys_call_table)
.long sys_unshare /* 310 */
.long sys_set_robust_list
.long sys_get_robust_list
+ .long sys_splice
diff --git a/fs/Makefile b/fs/Makefile
index 080b386..f3a4f70 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -10,7 +10,7 @@ obj-y := open.o read_write.o file_table.
ioctl.o readdir.o select.o fifo.o locks.o dcache.o inode.o \
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o mpage.o direct-io.o \
- ioprio.o pnode.o drop_caches.o
+ ioprio.o pnode.o drop_caches.o splice.o

obj-$(CONFIG_INOTIFY) += inotify.o
obj-$(CONFIG_EPOLL) += eventpoll.o
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 509ccec..23e2c7c 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -53,6 +53,8 @@ const struct file_operations ext2_file_o
.readv = generic_file_readv,
.writev = generic_file_writev,
.sendfile = generic_file_sendfile,
+ .splice_read = generic_file_splice_read,
+ .splice_write = generic_file_splice_write,
};

#ifdef CONFIG_EXT2_FS_XIP
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 783a796..1efefb6 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -119,6 +119,8 @@ const struct file_operations ext3_file_o
.release = ext3_release_file,
.fsync = ext3_sync_file,
.sendfile = generic_file_sendfile,
+ .splice_read = generic_file_splice_read,
+ .splice_write = generic_file_splice_write,
};

struct inode_operations ext3_file_inode_operations = {
diff --git a/fs/pipe.c b/fs/pipe.c
index e2f4f1d..aef9ad7 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -15,6 +15,7 @@ #include <linux/mount.h>
#include <linux/pipe_fs_i.h>
#include <linux/uio.h>
#include <linux/highmem.h>
+#include <linux/pagemap.h>

#include <asm/uaccess.h>
#include <asm/ioctls.h>
@@ -94,11 +95,20 @@ static void anon_pipe_buf_release(struct
{
struct page *page = buf->page;

- if (info->tmp_page) {
- __free_page(page);
+ /*
+ * If nobody else uses this page, and we don't already have a
+ * temporary page, let's keep track of it as a one-deep
+ * allocation cache
+ */
+ if (page_count(page) == 1 && !info->tmp_page) {
+ info->tmp_page = page;
return;
}
- info->tmp_page = page;
+
+ /*
+ * Otherwise just release our reference to it
+ */
+ page_cache_release(page);
}

static void *anon_pipe_buf_map(struct file *file, struct pipe_inode_info *info, struct pipe_buffer *buf)
diff --git a/fs/splice.c b/fs/splice.c
new file mode 100644
index 0000000..5ab41b7
--- /dev/null
+++ b/fs/splice.c
@@ -0,0 +1,529 @@
+/*
+ * "splice": joining two ropes together by interweaving their strands.
+ *
+ * This is the "extended pipe" functionality, where a pipe is used as
+ * an arbitrary in-memory buffer. Think of a pipe as a small kernel
+ * buffer that you can use to transfer data from one end to the other.
+ *
+ * The traditional unix read/write is extended with a "splice()" operation
+ * that transfers data buffers to or from a pipe buffer.
+ *
+ * Named by Larry McVoy, original implementation from Linus, extended by
+ * Jens to support splicing to files and fixing the initial implementation
+ * bugs.
+ *
+ * Copyright (C) 2005 Jens Axboe <[email protected]>
+ * Copyright (C) 2005 Linus Torvalds <[email protected]>
+ *
+ */
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/pipe_fs_i.h>
+#include <linux/mm_inline.h>
+
+static void page_cache_pipe_buf_release(struct pipe_inode_info *info,
+ struct pipe_buffer *buf)
+{
+ page_cache_release(buf->page);
+ buf->page = NULL;
+}
+
+static void *page_cache_pipe_buf_map(struct file *file,
+ struct pipe_inode_info *info,
+ struct pipe_buffer *buf)
+{
+ struct page *page = buf->page;
+
+ lock_page(page);
+
+ if (!PageUptodate(page))
+ return NULL;
+
+
+ return kmap(buf->page);
+}
+
+static void page_cache_pipe_buf_unmap(struct pipe_inode_info *info,
+ struct pipe_buffer *buf)
+{
+ unlock_page(buf->page);
+ kunmap(buf->page);
+}
+
+static struct pipe_buf_operations page_cache_pipe_buf_ops = {
+ .can_merge = 0,
+ .map = page_cache_pipe_buf_map,
+ .unmap = page_cache_pipe_buf_unmap,
+ .release = page_cache_pipe_buf_release,
+};
+
+static ssize_t move_to_pipe(struct inode *inode, struct page **array, int pages,
+ unsigned long offset, unsigned long len)
+{
+ struct pipe_inode_info *info;
+ int ret, do_wakeup, i;
+
+ ret = 0;
+ do_wakeup = 0;
+ i = 0;
+
+ mutex_lock(PIPE_MUTEX(*inode));
+
+ info = inode->i_pipe;
+ for (;;) {
+ int bufs;
+
+ if (!PIPE_READERS(*inode)) {
+ send_sig(SIGPIPE, current, 0);
+ if (!ret)
+ ret = -EPIPE;
+ break;
+ }
+
+ bufs = info->nrbufs;
+ if (bufs < PIPE_BUFFERS) {
+ int newbuf = (info->curbuf + bufs) & (PIPE_BUFFERS - 1);
+ struct pipe_buffer *buf = info->bufs + newbuf;
+ struct page *page = array[i++];
+ unsigned long this_len;
+
+ this_len = PAGE_SIZE - offset;
+ if (this_len > len)
+ this_len = len;
+
+ buf->page = page;
+ buf->offset = offset;
+ buf->len = this_len;
+ buf->ops = &page_cache_pipe_buf_ops;
+ info->nrbufs = ++bufs;
+ do_wakeup = 1;
+
+ ret += this_len;
+ len -= this_len;
+ offset = 0;
+ if (!--pages)
+ break;
+ if (!len)
+ break;
+ if (bufs < PIPE_BUFFERS)
+ continue;
+
+ break;
+ }
+
+ if (signal_pending(current)) {
+ if (!ret)
+ ret = -ERESTARTSYS;
+ break;
+ }
+
+ if (do_wakeup) {
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
+ kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO,
+ POLL_IN);
+ do_wakeup = 0;
+ }
+
+ PIPE_WAITING_WRITERS(*inode)++;
+ pipe_wait(inode);
+ PIPE_WAITING_WRITERS(*inode)--;
+ }
+
+ mutex_unlock(PIPE_MUTEX(*inode));
+
+ if (do_wakeup) {
+ wake_up_interruptible(PIPE_WAIT(*inode));
+ kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
+ }
+
+ while (i < pages)
+ page_cache_release(array[i++]);
+
+ return ret;
+}
+
+static int __generic_file_splice_read(struct file *in, struct inode *pipe,
+ size_t len)
+{
+ struct address_space *mapping = in->f_mapping;
+ unsigned long index, offset, pages;
+ struct page *array[PIPE_BUFFERS];
+ int i;
+
+ index = in->f_pos >> PAGE_CACHE_SHIFT;
+ offset = in->f_pos & ~PAGE_CACHE_MASK;
+ pages = (len + offset + PAGE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+
+ if (pages > PIPE_BUFFERS)
+ pages = PIPE_BUFFERS;
+
+ /*
+ * Get as many pages from the page cache as possible..
+ * Start IO on the page cache entries we create (we
+ * can assume that any pre-existing ones we find have
+ * already had IO started on them).
+ */
+ i = find_get_pages(mapping, index, pages, array);
+
+ /*
+ * If not all pages were in the page-cache, we'll
+ * just assume that the rest haven't been read in,
+ * so we'll get the rest locked and start IO on
+ * them if we can..
+ */
+ while (i < pages) {
+ struct page *page;
+ int error;
+
+ page = find_or_create_page(mapping, index + i, GFP_USER);
+ if (!page)
+ break;
+
+ if (PageUptodate(page))
+ unlock_page(page);
+ else {
+ error = mapping->a_ops->readpage(in, page);
+ if (unlikely(error)) {
+ page_cache_release(page);
+ break;
+ }
+ }
+
+ array[i++] = page;
+ }
+
+ if (!i)
+ return 0;
+
+ /*
+ * Now we splice them into the pipe..
+ */
+ return move_to_pipe(pipe, array, i, offset, len);
+}
+
+ssize_t generic_file_splice_read(struct file *in, struct inode *pipe,
+ size_t len, unsigned long flags)
+{
+ ssize_t spliced;
+ int ret;
+
+ ret = 0;
+ spliced = 0;
+ while (len) {
+ ret = __generic_file_splice_read(in, pipe, len);
+
+ if (ret <= 0)
+ break;
+
+ in->f_pos += ret;
+ len -= ret;
+ spliced += ret;
+ }
+
+ if (spliced)
+ return spliced;
+
+ return ret;
+}
+
+/*
+ * Send buf to a socket through sendpage
+ */
+static int pipe_to_sendpage(struct pipe_inode_info *info,
+ struct pipe_buffer *buf,
+ struct file *file, unsigned int len,
+ unsigned long long pos)
+{
+ unsigned long offset;
+ unsigned int size;
+ ssize_t ret;
+
+ /*
+ * sub-optimal, but we are limited by the pipe ->map. we don't
+ * need a kmap'ed buffer here, we just want to make sure we
+ * have the page pinned if the pipe page originates from the
+ * page cache
+ */
+ if (!buf->ops->map(file, info, buf))
+ return -EIO;
+
+ size = len;
+ if (size > PAGE_SIZE)
+ size = PAGE_SIZE;
+
+ offset = pos & ~PAGE_CACHE_MASK;
+
+ ret = file->f_op->sendpage(file, buf->page, offset, size, &pos,
+ size < len);
+
+ buf->ops->unmap(info, buf);
+ if (ret == len)
+ return 0;
+
+ return -EIO;
+}
+
+/*
+ * This is a little more tricky than the file -> pipe splicing. There are
+ * basically three cases:
+ *
+ * - Destination page already exists in the address space and there
+ * are users of it. For that case we have no other option that
+ * copying the data. Tough luck.
+ * - Destination page already exists in the address space, but there
+ * are no users of it. Make sure it's uptodate, then drop it. Fall
+ * through to last case.
+ * - Destination page does not exist, we can add the pipe page to
+ * the page cache and avoid the copy.
+ *
+ * For now we just do the slower thing and always copy pages over, it's
+ * easier than migrating pages from the pipe to the target file. For the
+ * case of doing file | file splicing, the migrate approach had some LRU
+ * nastiness...
+ */
+static int pipe_to_file(struct pipe_inode_info *info, struct pipe_buffer *buf,
+ struct file *file, unsigned int len,
+ unsigned long long pos)
+{
+ struct address_space *mapping = file->f_mapping;
+ unsigned long long index;
+ unsigned int offset;
+ struct page *page;
+ char *src, *dst;
+ int ret;
+
+ /*
+ * after this, page will be locked and unmapped
+ */
+ src = buf->ops->map(file, info, buf);
+ if (!src)
+ return -EIO;
+
+ index = pos >> PAGE_CACHE_SHIFT;
+ offset = pos & ~PAGE_CACHE_MASK;
+
+ ret = -ENOMEM;
+ page = find_or_create_page(mapping, index, GFP_USER);
+ if (!page)
+ goto out;
+
+ /*
+ * If the page is uptodate, it is also locked. If it isn't
+ * uptodate, we can mark it uptodate if we are filling the
+ * full page. Otherwise we need to read it in first...
+ */
+ if (!PageUptodate(page)) {
+ if (len < PAGE_CACHE_SIZE) {
+ ret = mapping->a_ops->readpage(file, page);
+ if (unlikely(ret))
+ goto out;
+
+ lock_page(page);
+ } else {
+ WARN_ON(!PageLocked(page));
+ SetPageUptodate(page);
+ }
+ }
+
+ ret = page->mapping->a_ops->prepare_write(file, page, 0, len);
+ if (ret)
+ goto out;
+
+ dst = kmap_atomic(page, KM_USER0);
+ memcpy(dst + offset, src + buf->offset, len);
+ flush_dcache_page(page);
+ kunmap_atomic(dst, KM_USER0);
+
+ flush_dcache_page(page);
+
+ ret = page->mapping->a_ops->commit_write(file, page, 0, len);
+ if (ret)
+ goto out;
+
+ set_page_dirty(page);
+ ret = write_one_page(page, 0);
+out:
+ if (ret < 0)
+ unlock_page(page);
+ page_cache_release(page);
+ buf->ops->unmap(info, buf);
+ return ret;
+}
+
+typedef int (splice_actor)(struct pipe_inode_info *, struct pipe_buffer *,
+ struct file *, unsigned int, unsigned long long);
+
+static ssize_t move_from_pipe(struct inode *inode, struct file *out,
+ size_t len, unsigned long flags,
+ splice_actor *actor)
+{
+ struct pipe_inode_info *info;
+ unsigned long long pos;
+ int ret, do_wakeup, err;
+
+ pos = out->f_pos;
+ ret = 0;
+ do_wakeup = 0;
+
+ mutex_lock(PIPE_MUTEX(*inode));
+
+ info = inode->i_pipe;
+ for (;;) {
+ int bufs = info->nrbufs;
+
+ if (bufs) {
+ int curbuf = info->curbuf;
+ struct pipe_buffer *buf = info->bufs + curbuf;
+ struct pipe_buf_operations *ops = buf->ops;
+ unsigned int this_len;
+
+ this_len = buf->len;
+ if (this_len > len)
+ this_len = len;
+
+ err = actor(info, buf, out, this_len, pos);
+ if (err) {
+ if (!ret)
+ ret = err;
+
+ break;
+ }
+
+ ret += this_len;
+ buf->offset += this_len;
+ buf->len -= this_len;
+ if (!buf->len) {
+ buf->ops = NULL;
+ ops->release(info, buf);
+ curbuf = (curbuf + 1) & (PIPE_BUFFERS - 1);
+ info->curbuf = curbuf;
+ info->nrbufs = --bufs;
+ do_wakeup = 1;
+ }
+
+ pos += this_len;
+ len -= this_len;
+ if (!len)
+ break;
+ }
+
+ if (bufs)
+ continue;
+ if (!PIPE_WRITERS(*inode))
+ break;
+ if (!PIPE_WAITING_WRITERS(*inode)) {
+ if (ret)
+ break;
+ }
+
+ if (signal_pending(current)) {
+ if (!ret)
+ ret = -ERESTARTSYS;
+ break;
+ }
+
+ if (do_wakeup) {
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
+ kill_fasync(PIPE_FASYNC_WRITERS(*inode),SIGIO,POLL_OUT);
+ do_wakeup = 0;
+ }
+
+ pipe_wait(inode);
+ }
+
+ mutex_unlock(PIPE_MUTEX(*inode));
+
+ if (do_wakeup) {
+ wake_up_interruptible(PIPE_WAIT(*inode));
+ kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
+ }
+
+ out->f_pos = pos;
+ return ret;
+
+}
+
+ssize_t generic_file_splice_write(struct inode *inode, struct file *out,
+ size_t len, unsigned long flags)
+{
+ return move_from_pipe(inode, out, len, flags, pipe_to_file);
+}
+
+ssize_t generic_splice_sendpage(struct inode *inode, struct file *out,
+ size_t len, unsigned long flags)
+{
+ return move_from_pipe(inode, out, len, flags, pipe_to_sendpage);
+}
+
+static long do_splice_from(struct inode *pipe, struct file *out, size_t len,
+ unsigned long flags)
+{
+ if (out->f_op && out->f_op->splice_write)
+ return out->f_op->splice_write(pipe, out, len, flags);
+
+ return -EINVAL;
+}
+
+static long do_splice_to(struct file *in, struct inode *pipe, size_t len,
+ unsigned long flags)
+{
+ if (in->f_op && in->f_op->splice_read) {
+ loff_t isize = i_size_read(in->f_mapping->host);
+ size_t left;
+
+ if (unlikely(in->f_pos >= isize))
+ return 0;
+
+ left = isize - in->f_pos;
+ if (left < len)
+ len = left;
+
+ return in->f_op->splice_read(in, pipe, len, flags);
+ }
+
+ return -EINVAL;
+}
+
+static long do_splice(struct file *in, struct file *out, size_t len,
+ unsigned long flags)
+{
+ struct inode *pipe;
+
+ pipe = in->f_dentry->d_inode;
+ if (pipe->i_pipe)
+ return do_splice_from(pipe, out, len, flags);
+
+ pipe = out->f_dentry->d_inode;
+ if (pipe->i_pipe)
+ return do_splice_to(in, pipe, len, flags);
+
+ return -EINVAL;
+}
+
+asmlinkage long sys_splice(int fdin, int fdout, size_t len, unsigned long flags)
+{
+ long error;
+ struct file *in, *out;
+ int fput_in, fput_out;
+
+ if (unlikely(!len))
+ return 0;
+
+ error = -EBADF;
+ in = fget_light(fdin, &fput_in);
+ if (in) {
+ if (in->f_mode & FMODE_READ) {
+ out = fget_light(fdout, &fput_out);
+ if (out) {
+ if (out->f_mode & FMODE_WRITE)
+ error = do_splice(in, out, len, flags);
+ fput_light(out, fput_out);
+ }
+ }
+
+ fput_light(in, fput_in);
+ }
+
+ return error;
+}
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index 014e356..789e9bd 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -318,8 +318,9 @@ #define __NR_ppoll 309
#define __NR_unshare 310
#define __NR_set_robust_list 311
#define __NR_get_robust_list 312
+#define __NR_sys_splice 313

-#define NR_syscalls 313
+#define NR_syscalls 314

/*
* user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-powerpc/unistd.h b/include/asm-powerpc/unistd.h
index 3555699..69c9b19 100644
--- a/include/asm-powerpc/unistd.h
+++ b/include/asm-powerpc/unistd.h
@@ -301,8 +301,9 @@ #define __NR_spu_create 279
#define __NR_pselect6 280
#define __NR_ppoll 281
#define __NR_unshare 282
+#define __NR_splice 283

-#define __NR_syscalls 283
+#define __NR_syscalls 284

#ifdef __KERNEL__
#define __NR__exit __NR_exit
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index fcc5163..f21ff2c 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -609,8 +609,10 @@ #define __NR_set_robust_list 273
__SYSCALL(__NR_set_robust_list, sys_set_robust_list)
#define __NR_get_robust_list 274
__SYSCALL(__NR_get_robust_list, sys_get_robust_list)
+#define __NR_splice 275
+__SYSCALL(__NR_splice, sys_splice)

-#define __NR_syscall_max __NR_get_robust_list
+#define __NR_syscall_max __NR_splice

#ifndef __NO_STUBS

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 408fe89..9ea74aa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1032,6 +1032,8 @@ struct file_operations {
int (*check_flags)(int);
int (*dir_notify)(struct file *filp, unsigned long arg);
int (*flock) (struct file *, int, struct file_lock *);
+ ssize_t (*splice_write)(struct inode *, struct file *, size_t, unsigned long);
+ ssize_t (*splice_read)(struct file *, struct inode *, size_t, unsigned long);
};

struct inode_operations {
@@ -1609,6 +1611,8 @@ extern ssize_t generic_file_sendfile(str
extern void do_generic_mapping_read(struct address_space *mapping,
struct file_ra_state *, struct file *,
loff_t *, read_descriptor_t *, read_actor_t);
+extern ssize_t generic_file_splice_read(struct file *, struct inode *, size_t, unsigned long);
+extern ssize_t generic_file_splice_write(struct inode *, struct file *, size_t, unsigned long);
extern void
file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
extern ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index b12e59c..7be0f82 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -16,6 +16,7 @@ struct pipe_buf_operations {
void * (*map)(struct file *, struct pipe_inode_info *, struct pipe_buffer *);
void (*unmap)(struct pipe_inode_info *, struct pipe_buffer *);
void (*release)(struct pipe_inode_info *, struct pipe_buffer *);
+ int (*claim)(struct pipe_inode_info *, struct pipe_buffer *);
};

struct pipe_inode_info {
diff --git a/net/socket.c b/net/socket.c
index fcd77ea..29094b9 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -119,7 +119,10 @@ static ssize_t sock_writev(struct file *
static ssize_t sock_sendpage(struct file *file, struct page *page,
int offset, size_t size, loff_t *ppos, int more);

+extern ssize_t generic_splice_sendpage(struct inode *inode, struct file *out,
+ size_t len, unsigned long flags);

+
/*
* Socket files have a set of 'special' operations as well as the generic file ones. These don't appear
* in the operation structures but are done directly via the socketcall() multiplexor.
@@ -141,7 +144,8 @@ #endif
.fasync = sock_fasync,
.readv = sock_readv,
.writev = sock_writev,
- .sendpage = sock_sendpage
+ .sendpage = sock_sendpage,
+ .splice_write = generic_splice_sendpage,
};

/*

--
Jens Axboe


2006-03-29 12:30:15

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Wed, Mar 29 2006, Jens Axboe wrote:
> splice-in file
> Splice file to stdout
>
> splice-out
> Splice stdin to file
>
> splice-net hostname port
> Splice stdin to hostname:port
>
> Examples - splice copying a file can be done as:
>
> # splice-in file | splice-out new_file
>
> Sending a file over the network
>
> # cat file | splice-net hostname port
>
> and then have the other end run eg netcat -l -p port to receive the
> data.

Which I naturally forgot to attach, here they are.

--
Jens Axboe


Attachments:
(No filename) (533.00 B)
splice-in.c (986.00 B)
splice-out.c (903.00 B)
splice-net.c (1.44 kB)
Download all attachments

2006-03-29 13:15:51

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Jens Axboe wrote:

> index 509ccec..23e2c7c 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -53,6 +53,8 @@ const struct file_operations ext2_file_o
> .readv = generic_file_readv,
> .writev = generic_file_writev,
> .sendfile = generic_file_sendfile,
> + .splice_read = generic_file_splice_read,
> + .splice_write = generic_file_splice_write,
> };
>
> #ifdef CONFIG_EXT2_FS_XIP
> diff --git a/fs/ext3/file.c b/fs/ext3/file.c
> index 783a796..1efefb6 100644
> --- a/fs/ext3/file.c
> +++ b/fs/ext3/file.c
> @@ -119,6 +119,8 @@ const struct file_operations ext3_file_o
> .release = ext3_release_file,
> .fsync = ext3_sync_file,
> .sendfile = generic_file_sendfile,
> + .splice_read = generic_file_splice_read,
> + .splice_write = generic_file_splice_write,

> +static long do_splice_from(struct inode *pipe, struct file *out, size_t len,
> + unsigned long flags)
> +{
> + if (out->f_op && out->f_op->splice_write)
> + return out->f_op->splice_write(pipe, out, len, flags);

1) What are the consequences of doing

if (f_op->splice_write)
f_op->splice_write(...);
else
generic_file_splice_write(...);

to cause sys_splice() to default to supported?

2) Do you really have to test f_op itself for NULL? Is that a stealth
closed-file check or something? I would be surprised if f_op was ever
really NULL.

Jeff


2006-03-29 13:27:26

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Wed, Mar 29 2006, Jeff Garzik wrote:
> Jens Axboe wrote:
>
> >index 509ccec..23e2c7c 100644
> >--- a/fs/ext2/file.c
> >+++ b/fs/ext2/file.c
> >@@ -53,6 +53,8 @@ const struct file_operations ext2_file_o
> > .readv = generic_file_readv,
> > .writev = generic_file_writev,
> > .sendfile = generic_file_sendfile,
> >+ .splice_read = generic_file_splice_read,
> >+ .splice_write = generic_file_splice_write,
> > };
> >
> > #ifdef CONFIG_EXT2_FS_XIP
> >diff --git a/fs/ext3/file.c b/fs/ext3/file.c
> >index 783a796..1efefb6 100644
> >--- a/fs/ext3/file.c
> >+++ b/fs/ext3/file.c
> >@@ -119,6 +119,8 @@ const struct file_operations ext3_file_o
> > .release = ext3_release_file,
> > .fsync = ext3_sync_file,
> > .sendfile = generic_file_sendfile,
> >+ .splice_read = generic_file_splice_read,
> >+ .splice_write = generic_file_splice_write,
>
> >+static long do_splice_from(struct inode *pipe, struct file *out, size_t
> >len,
> >+ unsigned long flags)
> >+{
> >+ if (out->f_op && out->f_op->splice_write)
> >+ return out->f_op->splice_write(pipe, out, len, flags);
>
> 1) What are the consequences of doing
>
> if (f_op->splice_write)
> f_op->splice_write(...);
> else
> generic_file_splice_write(...);
>
> to cause sys_splice() to default to supported?

It should probably work, the fs guys should know more about that. Any fs
that ->prepare_write(), ->commit_write() works for can use
generic_file_splice_write(). I prefer to keep it sane for now, mason
tells me that eg xfs might need special care.

> 2) Do you really have to test f_op itself for NULL? Is that a stealth
> closed-file check or something? I would be surprised if f_op was ever
> really NULL.

Probably not, paranoia.

After posting I fixed another bug, essentially making the 'more' flag to
sendpage() be correct. Should improve throughput a little.

--
Jens Axboe

2006-03-29 20:06:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support



On Wed, 29 Mar 2006, Jeff Garzik wrote:
>
> 1) What are the consequences of doing
>
> if (f_op->splice_write)
> f_op->splice_write(...);
> else
> generic_file_splice_write(...);
>
> to cause sys_splice() to default to supported?

I'd actually much prefer a number of filesystems just adding he
"generic_file_splice_write()" thing. If it works for them (and it usually
will), it's a one-liner. And it won't do wrong things on filesystems that
have special rules (inode re-validate for networked filesystems etc).

> 2) Do you really have to test f_op itself for NULL? Is that a stealth
> closed-file check or something? I would be surprised if f_op was ever really
> NULL.

Hmm.. I agree that f_op probably should never be NULL (a struct file with
a NULL f_op is pretty useless), but it is a test that we historically have
had. So it's probably best to keep for consistency, and if somebody wants
to, they can clean up all the other tests too (in the read/write/lseek
paths).

I'm inclined to apply this patch (well, I'd like the fixed one). The whole
splice() thing has been rolling around in my head for years, and the pipe
support infrastructure for it has been around for over a year now in
preparation for this.

And the patch actually looks pretty clean to me.

Linus

2006-03-29 20:42:23

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Wed, Mar 29 2006, Linus Torvalds wrote:
>
>
> On Wed, 29 Mar 2006, Jeff Garzik wrote:
> >
> > 1) What are the consequences of doing
> >
> > if (f_op->splice_write)
> > f_op->splice_write(...);
> > else
> > generic_file_splice_write(...);
> >
> > to cause sys_splice() to default to supported?
>
> I'd actually much prefer a number of filesystems just adding he
> "generic_file_splice_write()" thing. If it works for them (and it usually
> will), it's a one-liner. And it won't do wrong things on filesystems that
> have special rules (inode re-validate for networked filesystems etc).
>
> > 2) Do you really have to test f_op itself for NULL? Is that a stealth
> > closed-file check or something? I would be surprised if f_op was ever really
> > NULL.
>
> Hmm.. I agree that f_op probably should never be NULL (a struct file with
> a NULL f_op is pretty useless), but it is a test that we historically have
> had. So it's probably best to keep for consistency, and if somebody wants
> to, they can clean up all the other tests too (in the read/write/lseek
> paths).
>
> I'm inclined to apply this patch (well, I'd like the fixed one). The whole
> splice() thing has been rolling around in my head for years, and the pipe
> support infrastructure for it has been around for over a year now in
> preparation for this.
>
> And the patch actually looks pretty clean to me.

Go ahead, as mentioned there are a few little extra fixes in the git
repo. The remaining changes I had in mind don't require anything
massive, so...

--
Jens Axboe

2006-03-29 20:43:11

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Wed, Mar 29 2006, Jens Axboe wrote:
> On Wed, Mar 29 2006, Linus Torvalds wrote:
> >
> >
> > On Wed, 29 Mar 2006, Jeff Garzik wrote:
> > >
> > > 1) What are the consequences of doing
> > >
> > > if (f_op->splice_write)
> > > f_op->splice_write(...);
> > > else
> > > generic_file_splice_write(...);
> > >
> > > to cause sys_splice() to default to supported?
> >
> > I'd actually much prefer a number of filesystems just adding he
> > "generic_file_splice_write()" thing. If it works for them (and it usually
> > will), it's a one-liner. And it won't do wrong things on filesystems that
> > have special rules (inode re-validate for networked filesystems etc).
> >
> > > 2) Do you really have to test f_op itself for NULL? Is that a stealth
> > > closed-file check or something? I would be surprised if f_op was ever really
> > > NULL.
> >
> > Hmm.. I agree that f_op probably should never be NULL (a struct file with
> > a NULL f_op is pretty useless), but it is a test that we historically have
> > had. So it's probably best to keep for consistency, and if somebody wants
> > to, they can clean up all the other tests too (in the read/write/lseek
> > paths).
> >
> > I'm inclined to apply this patch (well, I'd like the fixed one). The whole
> > splice() thing has been rolling around in my head for years, and the pipe
> > support infrastructure for it has been around for over a year now in
> > preparation for this.
> >
> > And the patch actually looks pretty clean to me.
>
> Go ahead, as mentioned there are a few little extra fixes in the git
> repo. The remaining changes I had in mind don't require anything
> massive, so...

git://brick.kernel.dk/data/git/linux-2.6-block.git splice

is the url, just in case.

--
Jens Axboe

2006-03-29 21:15:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support



On Wed, 29 Mar 2006, Jens Axboe wrote:
>
> git://brick.kernel.dk/data/git/linux-2.6-block.git splice
>
> is the url, just in case.

Btw, would you mind if I just re-created that as a single patch instead?
Especially with the first commit being slightly corrupt (look at the first
line of the commit message ;), and some of the later commits just fixing
things up further, it would appear to be cleaner to just merge it as a
single "initial splice support" commit..

Linus

2006-03-29 21:50:35

by Nathan Scott

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Wed, Mar 29, 2006 at 03:27:25PM +0200, Jens Axboe wrote:
> On Wed, Mar 29 2006, Jeff Garzik wrote:
> > to cause sys_splice() to default to supported?
>
> It should probably work, the fs guys should know more about that. Any fs
> that ->prepare_write(), ->commit_write() works for can use
> generic_file_splice_write(). I prefer to keep it sane for now, mason
> tells me that eg xfs might need special care.

It looks pretty straightforward to get XFS support done - we will
likely need routines that wrap around generic_file_splice_read /
write to handle offline files for an HSM, etc, but not too tricky.
I'll make the XFS changes once the generic support is in place.

cheers.

--
Nathan

2006-03-29 22:35:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Jens Axboe <[email protected]> wrote:
>
> Hi,
>
> Since my initial posting back in December, I've had some private queries
> about the state of splice support. The state was pretty much that it was
> a little broken, if one attempted to do file | file splicing. The
> original patch migrated pages from one file to another in this case,
> which got vm ugly really quickly. And it wasn't always the right thing
> to do, since it would mean that splicing file1 to file2 would move
> file1's page cache to file2. Sometimes this is what you want, sometimes
> it is not.
>
> So that was removed to make things work fully. It can later be
> reintroduced (and controlled with the splice flags passed in, whether to
> 'loan' or 'gift' source pages to use a McVoy term) if need be.
>
> Apart from that change, I added splice to socket support. It then
> becomes a full sendfile() replacement (unless I broke something). I'm
> attaching the current patch against 2.6.16-git, and also three test apps
> that you can use as a reference or just to play with this. The apps are:
>

- splice() take a size_t length. Should it be taking a 64-bit length?

- splice() doesn't check for (len < 0), like read() and write() do.
Should it?

- Please don't call it `len'. VFS has to deal with "lengths" which can
be in units of PAGE_CACHE_SIZE, fs blocksize, 512-bytes sectors or bytes,
and it gets confusing. Our liking for variable names like `len' and
`count' just makes it worse.

If it's in units of pages then call it `npages'. If it's in bytes then
call it `nbytes'.

What _is_ it in units of, anyway? I guess bytes, since it's size_t.

I assume all this lenning:

unsigned int this_len;

this_len = buf->len;
if (this_len > len)
this_len = len;

is dealing with bytes too. You'll be wanting a size_t in there.

- why is the `flags' arg to sys_splice() unsigned long? Can it be `int'?

- what does `flags' do, anyway? The whole thing is undocumented and
almost uncommented.

- the tmp_page trick in anon_pipe_buf_release() seems to be unrelated to
the splice() work. It should be a separate patch and any peformance
testing (needed, please) should be decoupled from that change.

- I think the `size_t left' in do_splice_to() can overflow if f_pos is
sufficiently different from i_size.

- All the operations do foo(in, out, ...). It's a bit more conventional
to do foo(out, in, ...).

- The logic in do_splice() hurts my brain. "if `in' is a pipe then
splice from `in-as-a-pipe' to `out' else if `out' is a pipe then splice
from `in' to 'out-as-a-pipe'. Make sense, I guess, but I do wonder "what
would happen if those tests were reversed?". Nothing, I guess.

- In pipe_to_file():

- Shouldn't it be using GFP_HIGHUSER()?

- local variable `index' should be unsigned long or, for clarity
value, pgoff_t.

- Incoming arg `pos' should be loff_t?

- It's racy against truncate(). After running ->readpage and
lock_page(), need to check for page->mapping == NULL.

- There's a duplicate flush_dcache_page().

- Why does it run write_one_page()??? (Don't tell me. I'll work it
out when I see the commented version ;))

- I worry a bit about the assumption in one place that a non-zero
return from commit_write() indicates an error, whereas another place
assumes that a negative return is an error. We had problems in the
past where some a_ops implementations decided to return small positive
numbers from prepare_write() or commit_write() a_ops, which broke
stuff. They shouldn't be doing that now, but it's a thing to watch out
for.

- Bug. If write_one_page() returned an error, it still unlocked the page.

- In pipe_to_sendpage():

- local variable `offset' is ulong, but elsewhere you've used uint.
The latter is better.

- Again, incoming arg `len' is confusing. I _think_ it's actually
"number of bytes to be moved from this page". A comment which explains
these things would be nice, and perhaps a better name (bytes_to_send?)

- Should incoming arg `pos' be loff_t? That would give it some meaning.

- Why does it use PAGE_SIZE and PAGE_SHIFT rather than PAGE_CACHE_*?

- In generic_file_splice_read():

- nonatomic modification of f_pos. Is i_mutex held? (see
generic_file_llseek())

- Darnit, we carried `flags' this far and ended up not using it.
(What _does_ flags do, anyway? Reads on..)

- In __generic_file_splice_read():

- local variable `index' is ulong, could be pgoff_t (for clarity)

- local variable `offset' could be uint (it is uint elsewhere, and
might generate better code).

A better name might be offset_in_page.

- local variable `pages' could be uint (but watch out for overflow!!).

A better name might be nr_pages (matches find_get_pages()). Then,
local variable `array' can be renamed to `pages', which is all much
better.

- While we're in the spirit, local var `i' would be better named
`page_nr' or something.

- Shouldn't it be using GFP_HIGHUSER?

- whoa. We move the pages into the pipe while they're still under
read I/O. Is that deliberate? (pls add nice comment).

- These pages can get truncated at any time they're unlocked. Does
the code cope with all that?

- hm. What happens if the pages which find_get_pages() returned are
not contiguous in pagecache? I think your `pages' array gets all
jumbled up.

- In move_to_pipe()

- gargh, another `offset' and `len'. No idea what they're doing, so
am unable to determine whether ulong is an appropriate type. Am keenly
looking forward to the commented version!

- Suggest you rename `pages' to `nr_pages', `array' to `pages'. And
`i' to `page_nr'.

- local var `bufs' could be renamed `nrbufs' to align with
pipe_inode_info and could be made uint.

- Do we actually need local var `bufs'? It seems to be caching info->nrbufs.

- release_pages() might be faster than one-at-a-time page_cache_release()


Anyway, that's all just low-level stuff.

What does the feature do? How would one use it in an application? Is it
intended that it be generalised to other kinds of address_spaces? If so,
which ones, and what implementation problems might we expect?

(And I still don't know what `flags' does!)

2006-03-30 00:50:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support



On Wed, 29 Mar 2006, Andrew Morton wrote:
>
> - splice() take a size_t length. Should it be taking a 64-bit length?

No. You can't splice more than the kernel buffers anyway (ie currently
PIPE_BUFFERS pages, ie ~64kB, although in theory somebody could use large
pages for it), so 64-bit would be total overkill.

> - splice() doesn't check for (len < 0), like read() and write() do.
> Should it?

Umm. More likely better to just do rw_verify_area() instead, which limits
it to MAX_INT. Although it probably doesn't matter, for the above obvious
reason anyway (ie we end up doing everything on a page-granular area
anyway).

> - Please don't call it `len'. VFS has to deal with "lengths" which can
> be in units of PAGE_CACHE_SIZE, fs blocksize, 512-bytes sectors or bytes,
> and it gets confusing. Our liking for variable names like `len' and
> `count' just makes it worse.

I don't see that being problematic. I agree that if "len" is in pages, we
should call it something else (like 'npages'), but the fact that we
commonly use "len" for byte-lengths just sounds sane.

> - why is the `flags' arg to sys_splice() unsigned long? Can it be `int'?

flags are always unsigned long, haven't you noticed? Besides, they should
never be signed, if you do bitmasks and shifting on them: "int" is
strictly worse than "unsigned" when we're talking flags.

> - what does `flags' do, anyway? The whole thing is undocumented and
> almost uncommented.

Right now "flags" doesn't do anything at all, and you should just pass in
zero.

But if we ever do a "move" vs "copy" hint, we'll want something.

> - the tmp_page trick in anon_pipe_buf_release() seems to be unrelated to
> the splice() work. It should be a separate patch and any peformance
> testing (needed, please) should be decoupled from that change.

It's not unrelated. Note the new "page_count() == 1" test.

> - All the operations do foo(in, out, ...). It's a bit more conventional
> to do foo(out, in, ...).

You think? What convention do we have? We don't say "cp dst src", we say
"cp src dst".

The "destination first" convention is insane. It only makes sense for
assignments, and these aren't assignments.

> - The logic in do_splice() hurts my brain. "if `in' is a pipe then
> splice from `in-as-a-pipe' to `out' else if `out' is a pipe then splice
> from `in' to 'out-as-a-pipe'. Make sense, I guess, but I do wonder "what
> would happen if those tests were reversed?". Nothing, I guess.

Why would it matter? If both are pipes, then one is as good as the other.
You just want to pick the version that is potentially more efficient, if
there is any difference (and there is).

However, I don't think Jens actually did the pipe->pipe case at all (ie
pipes don't have the "splice_read()" function yet).

>
> - In pipe_to_file():
>
> - Shouldn't it be using GFP_HIGHUSER()?

Some filesystems may not like having highpages.

I suspect it should be "mapping_gfp_mask(mapping)".

> What does the feature do? How would one use it in an application? Is it
> intended that it be generalised to other kinds of address_spaces? If so,
> which ones, and what implementation problems might we expect?

You'd use it instead of "sendfile()".

Think of it as a hell of a lot more capable than "sendfile()" can ever be.
It can take input from anything that just has "splice_read()", and move it
to anything else, without doing any extra copies.

So imagine a streaming media thing, with a DVB card. Imaging wanting to
push all that data around the system without doing a "read()" into user
space and then a "write()" out to a file.

Also, imagine doing a "tee()" system call that can duplicate the pages
from one pipe into two other pipes. By just incrementing the page count.
So that your media streaming application can stream to both a file _and_
to live video (or whatever else) without doubling the buffering and doing
memcpy.

Linus

2006-03-30 01:04:56

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Linus Torvalds wrote:
> The "destination first" convention is insane. It only makes sense for
> assignments, and these aren't assignments.

I agree.

But alas, sendfile(2) is defined as destination first:

> ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count)

which begs the question, do we want to be different from sendfile(2),
and confuse a segment of the programmer populace? :)

Jeff


2006-03-30 01:18:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Jeff Garzik <[email protected]> wrote:
>
> Linus Torvalds wrote:
> > The "destination first" convention is insane. It only makes sense for
> > assignments, and these aren't assignments.
>
> I agree.
>
> But alas, sendfile(2) is defined as destination first:
>
> > ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count)
>
> which begs the question, do we want to be different from sendfile(2),
> and confuse a segment of the programmer populace? :)
>

strcpy, memcpy...

Obviously copy(from, to) is the sane way to do things, but yeah, I _think_
a C programmer would expect copy(to, from).

I don't think it matters much at all, really. If they get it backwards
they'll notice pretty quickly ;)

2006-03-30 02:08:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Linus Torvalds <[email protected]> wrote:
>
> > - why is the `flags' arg to sys_splice() unsigned long? Can it be `int'?
>
> flags are always unsigned long, haven't you noticed?

<does `man 2 open', gets confused>

> Besides, they should
> never be signed, if you do bitmasks and shifting on them: "int" is
> strictly worse than "unsigned" when we're talking flags.

Sure, but is there any gain in making flags 64-bit on 64-bit machines when
we cannot use more than 32 bits in there anyway?

> Right now "flags" doesn't do anything at all, and you should just pass in
> zero.

In that case perhaps we should be enforcing flags==0 so that future
flags-using applications will reliably fail on old flags-not-understanding
kernels.

But that won't work if we later define a bit in flags to mean "behave like
old kernels used to". So perhaps we should require that bits 0-15 of
`flags' be zero and not care about bits 16-31.

IOW: it might be best to make `flags' just go away, and add new syscalls in
the future as appropriate.

2006-03-30 05:13:50

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Hi Jens,

Looks nice!

Jens Axboe wrote:

>Hi,
>
>Since my initial posting back in December, I've had some private queries
>about the state of splice support. The state was pretty much that it was
>a little broken, if one attempted to do file | file splicing. The
>original patch migrated pages from one file to another in this case,
>which got vm ugly really quickly. And it wasn't always the right thing
>to do, since it would mean that splicing file1 to file2 would move
>file1's page cache to file2. Sometimes this is what you want, sometimes
>it is
>

Page migration now generalised vmscan.c and introduced remove_mapping
function, which should help keep things clean.

Moving a page onto and off the LRU is an interesting problem, though.
But possibly you could just leave it on the LRU and transfer the pagecache
reference over to the pipe. vmscan would find extra pages on the LRU at
times, but they would go away when pipe releases the page.

Moving a page from a pipe to a filesystem might be harder, because you
don't know if it came from a filesystem (still on LRU) or not (in which
case you need to add it to LRU). If only you can keep track of this
information as the page gets passed around... hmm the PG_private will be
free to use because a filesystem must always drop its buffers before
remove_mapping can run. One would also need to take care of replacing
an existing page I guess.

Hmm... I think it can work, falling back to copies if we get stuck
anywhere.

Unless someone beats me to it, I'll try coding something up when I get
a bit more free time.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-30 06:18:15

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Wed, Mar 29 2006, Linus Torvalds wrote:
>
>
> On Wed, 29 Mar 2006, Jens Axboe wrote:
> >
> > git://brick.kernel.dk/data/git/linux-2.6-block.git splice
> >
> > is the url, just in case.
>
> Btw, would you mind if I just re-created that as a single patch instead?
> Especially with the first commit being slightly corrupt (look at the first
> line of the commit message ;), and some of the later commits just fixing
> things up further, it would appear to be cleaner to just merge it as a
> single "initial splice support" commit..

I'll make the suggestions made while I was gone sleeping and rebase a
single patch on top of current HEAD.

--
Jens Axboe

2006-03-30 06:18:46

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Wed, Mar 29 2006, Andrew Morton wrote:
> Jeff Garzik <[email protected]> wrote:
> >
> > Linus Torvalds wrote:
> > > The "destination first" convention is insane. It only makes sense for
> > > assignments, and these aren't assignments.
> >
> > I agree.
> >
> > But alas, sendfile(2) is defined as destination first:
> >
> > > ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count)
> >
> > which begs the question, do we want to be different from sendfile(2),
> > and confuse a segment of the programmer populace? :)
> >
>
> strcpy, memcpy...
>
> Obviously copy(from, to) is the sane way to do things, but yeah, I _think_
> a C programmer would expect copy(to, from).
>
> I don't think it matters much at all, really. If they get it backwards
> they'll notice pretty quickly ;)

Yeah, I've always thought that the sendfile() arguments are stupidly
transposed :-)

--
Jens Axboe

2006-03-30 06:20:01

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Linus Torvalds wrote:

>
>On Wed, 29 Mar 2006, Andrew Morton wrote:
>
>>- splice() take a size_t length. Should it be taking a 64-bit length?
>>
>
>No. You can't splice more than the kernel buffers anyway (ie currently
>PIPE_BUFFERS pages, ie ~64kB, although in theory somebody could use large
>pages for it), so 64-bit would be total overkill.
>
>

But in that case you'll just end up blocking on the pipe won't you? I
think Andrew's talking about the syscall itself, which should be 64-bit
surely.

Hmm, with no "offset" parameter, you cannot splice-sendfile >4GB files...
Or am I going crazy? I see, it uses f_pos. Should it use offsets instead?
I guess you covered this in your earlier sys_splice discussions. I'll do
some research.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-30 07:00:04

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Thu, Mar 30 2006, Nick Piggin wrote:
> Hi Jens,
>
> Looks nice!
>
> Jens Axboe wrote:
>
> >Hi,
> >
> >Since my initial posting back in December, I've had some private queries
> >about the state of splice support. The state was pretty much that it was
> >a little broken, if one attempted to do file | file splicing. The
> >original patch migrated pages from one file to another in this case,
> >which got vm ugly really quickly. And it wasn't always the right thing
> >to do, since it would mean that splicing file1 to file2 would move
> >file1's page cache to file2. Sometimes this is what you want, sometimes
> >it is
> >
>
> Page migration now generalised vmscan.c and introduced remove_mapping
> function, which should help keep things clean.

Excellent.

> Moving a page onto and off the LRU is an interesting problem, though.
> But possibly you could just leave it on the LRU and transfer the pagecache
> reference over to the pipe. vmscan would find extra pages on the LRU at
> times, but they would go away when pipe releases the page.
>
> Moving a page from a pipe to a filesystem might be harder, because you
> don't know if it came from a filesystem (still on LRU) or not (in which
> case you need to add it to LRU). If only you can keep track of this

Well that, to me, is _the_ hard problem to solve for this. But you
sort-of do know, my plan is/was to add a ->steal() hook to the pipe
buffers that would 'unhook' the page so it was in a clean state to be
added to the LRU/page cache again. If stealing failed, just fall back to
copying (or hard error, let the flags decide).

> information as the page gets passed around... hmm the PG_private will be
> free to use because a filesystem must always drop its buffers before
> remove_mapping can run. One would also need to take care of replacing
> an existing page I guess.
>
> Hmm... I think it can work, falling back to copies if we get stuck
> anywhere.
>
> Unless someone beats me to it, I'll try coding something up when I get
> a bit more free time.

You are more than welcome, I hope to give it a little shot today and see
how it goes.

--
Jens Axboe

2006-03-30 07:15:54

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support


(So Linus basically handled everything here, I'll make some scattered
comments where I made changes).

On Wed, Mar 29 2006, Linus Torvalds wrote:
> > - splice() doesn't check for (len < 0), like read() and write() do.
> > Should it?
>
> Umm. More likely better to just do rw_verify_area() instead, which limits
> it to MAX_INT. Although it probably doesn't matter, for the above obvious
> reason anyway (ie we end up doing everything on a page-granular area
> anyway).

I've added rw_verify_area() calls now.

> > - what does `flags' do, anyway? The whole thing is undocumented and
> > almost uncommented.
>
> Right now "flags" doesn't do anything at all, and you should just pass in
> zero.
>
> But if we ever do a "move" vs "copy" hint, we'll want something.

Precisely. I already have something in progress for that...

> > - the tmp_page trick in anon_pipe_buf_release() seems to be unrelated to
> > the splice() work. It should be a separate patch and any peformance
> > testing (needed, please) should be decoupled from that change.
>
> It's not unrelated. Note the new "page_count() == 1" test.

Yes, this is needed to make migrating pages from a pipe to the page
cache possible.

> > - The logic in do_splice() hurts my brain. "if `in' is a pipe then
> > splice from `in-as-a-pipe' to `out' else if `out' is a pipe then splice
> > from `in' to 'out-as-a-pipe'. Make sense, I guess, but I do wonder "what
> > would happen if those tests were reversed?". Nothing, I guess.
>
> Why would it matter? If both are pipes, then one is as good as the other.
> You just want to pick the version that is potentially more efficient, if
> there is any difference (and there is).
>
> However, I don't think Jens actually did the pipe->pipe case at all (ie
> pipes don't have the "splice_read()" function yet).

No it's not there yet, coverage will increase soon :)

> >
> > - In pipe_to_file():
> >
> > - Shouldn't it be using GFP_HIGHUSER()?
>
> Some filesystems may not like having highpages.
>
> I suspect it should be "mapping_gfp_mask(mapping)".

I actually already made it GFP_HIGHUSER yesterday in a non-yet committed
patch, so I'll check up on this and make the change.

--
Jens Axboe

2006-03-30 07:20:04

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Andrew Morton wrote:

>Linus Torvalds <[email protected]> wrote:
>
>>Right now "flags" doesn't do anything at all, and you should just pass in
>>zero.
>>
>
>In that case perhaps we should be enforcing flags==0 so that future
>flags-using applications will reliably fail on old flags-not-understanding
>kernels.
>
>But that won't work if we later define a bit in flags to mean "behave like
>old kernels used to". So perhaps we should require that bits 0-15 of
>`flags' be zero and not care about bits 16-31.
>
>IOW: it might be best to make `flags' just go away, and add new syscalls in
>the future as appropriate.
>

Well it is always going to transfer data from infd to outfd, isn't it?
If something comes up that does not do that, then that should be a new
syscall rather than a new flag.

flags just modify the manner of the transfer I think. Things should still
work if some flag is not supported, perhaps just not with optimal
performance. That said, unsupported flags probably should fail, shouldn't
they? The application / library could then retry with flags = 0.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-30 07:21:43

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Wed, Mar 29 2006, Andrew Morton wrote:
> > Besides, they should
> > never be signed, if you do bitmasks and shifting on them: "int" is
> > strictly worse than "unsigned" when we're talking flags.
>
> Sure, but is there any gain in making flags 64-bit on 64-bit machines when
> we cannot use more than 32 bits in there anyway?

unsigned int seems fine to me.

> > Right now "flags" doesn't do anything at all, and you should just pass in
> > zero.
>
> In that case perhaps we should be enforcing flags==0 so that future
> flags-using applications will reliably fail on old flags-not-understanding
> kernels.
>
> But that won't work if we later define a bit in flags to mean "behave like
> old kernels used to". So perhaps we should require that bits 0-15 of
> `flags' be zero and not care about bits 16-31.
>
> IOW: it might be best to make `flags' just go away, and add new syscalls in
> the future as appropriate.

Not if flags == 0 maintains the same behaviour. The only flag I can
think of right now is the 'move' or 'gift' flag, meaning that the caller
wants to migrate pages from the pipe instead of copying them. I'd
imagine we'd get that in way before 2.6.17 anyways, so I think we're
fine.

--
Jens Axboe

2006-03-30 07:30:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Jens Axboe <[email protected]> wrote:
>
> > > Right now "flags" doesn't do anything at all, and you should just pass in
> > > zero.
> >
> > In that case perhaps we should be enforcing flags==0 so that future
> > flags-using applications will reliably fail on old flags-not-understanding
> > kernels.
> >
> > But that won't work if we later define a bit in flags to mean "behave like
> > old kernels used to". So perhaps we should require that bits 0-15 of
> > `flags' be zero and not care about bits 16-31.
> >
> > IOW: it might be best to make `flags' just go away, and add new syscalls in
> > the future as appropriate.
>
> Not if flags == 0 maintains the same behaviour. The only flag I can
> think of right now is the 'move' or 'gift' flag, meaning that the caller
> wants to migrate pages from the pipe instead of copying them. I'd
> imagine we'd get that in way before 2.6.17 anyways, so I think we're
> fine.

OK.. Do you plan to make it reject unrecognised flags?

2006-03-30 07:33:33

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Wed, Mar 29 2006, Andrew Morton wrote:
> Jens Axboe <[email protected]> wrote:
> >
> > > > Right now "flags" doesn't do anything at all, and you should just pass in
> > > > zero.
> > >
> > > In that case perhaps we should be enforcing flags==0 so that future
> > > flags-using applications will reliably fail on old flags-not-understanding
> > > kernels.
> > >
> > > But that won't work if we later define a bit in flags to mean "behave like
> > > old kernels used to". So perhaps we should require that bits 0-15 of
> > > `flags' be zero and not care about bits 16-31.
> > >
> > > IOW: it might be best to make `flags' just go away, and add new syscalls in
> > > the future as appropriate.
> >
> > Not if flags == 0 maintains the same behaviour. The only flag I can
> > think of right now is the 'move' or 'gift' flag, meaning that the caller
> > wants to migrate pages from the pipe instead of copying them. I'd
> > imagine we'd get that in way before 2.6.17 anyways, so I think we're
> > fine.
>
> OK.. Do you plan to make it reject unrecognised flags?

Depends, not sure if eg a 'move' flag should be a hard or soft
indication. Say we can't move a page and the caller asked us to migrate,
we'd probably just do the sane thing and copy that one page. It would be
silly to fail that request entirely.

--
Jens Axboe

2006-03-30 07:45:26

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support


(going over the other comments, thanks a lot for taking a good look
Andrew!)

On Wed, Mar 29 2006, Andrew Morton wrote:
> - Please don't call it `len'. VFS has to deal with "lengths" which can
> be in units of PAGE_CACHE_SIZE, fs blocksize, 512-bytes sectors or bytes,
> and it gets confusing. Our liking for variable names like `len' and
> `count' just makes it worse.
>
> If it's in units of pages then call it `npages'. If it's in bytes then
> call it `nbytes'.

Hmm well I usually use 'len' or something like that when it's a base
unit, like bytes. If it was in units of pages or something I agree it
would be confusing.

> What _is_ it in units of, anyway? I guess bytes, since it's size_t.
>
> I assume all this lenning:
>
> unsigned int this_len;
>
> this_len = buf->len;
> if (this_len > len)
> this_len = len;
>
> is dealing with bytes too. You'll be wanting a size_t in there.

But buf->len is unsigned int to begin with.

> - I think the `size_t left' in do_splice_to() can overflow if f_pos is
> sufficiently different from i_size.

They're both loff_t.

> - In pipe_to_file():
>
> - Shouldn't it be using GFP_HIGHUSER()?
>
> - local variable `index' should be unsigned long or, for clarity
> value, pgoff_t.

Fixed.

> - Incoming arg `pos' should be loff_t?

Fixed

> - It's racy against truncate(). After running ->readpage and
> lock_page(), need to check for page->mapping == NULL.

Indeed, fixed.

> - There's a duplicate flush_dcache_page().

Doh, fixed.

> - Why does it run write_one_page()??? (Don't tell me. I'll work it
> out when I see the commented version ;))

Can probably go, will re-check!

> - I worry a bit about the assumption in one place that a non-zero
> return from commit_write() indicates an error, whereas another place
> assumes that a negative return is an error. We had problems in the
> past where some a_ops implementations decided to return small positive
> numbers from prepare_write() or commit_write() a_ops, which broke
> stuff. They shouldn't be doing that now, but it's a thing to watch out
> for.

I'll just check for < 0 and be safe.

> - Bug. If write_one_page() returned an error, it still unlocked the page.

Ok, I guess that could be a future but (it can't fail with !wait now).

> - In pipe_to_sendpage():
>
> - local variable `offset' is ulong, but elsewhere you've used uint.
> The latter is better.

Fixed.

> - Again, incoming arg `len' is confusing. I _think_ it's actually
> "number of bytes to be moved from this page". A comment which explains
> these things would be nice, and perhaps a better name (bytes_to_send?)

It's bytes, yes. Comment added.

> - Should incoming arg `pos' be loff_t? That would give it some meaning.

Yes changed with the pipe_to_file() change since it modified the actor
typedef anyways.

> - Why does it use PAGE_SIZE and PAGE_SHIFT rather than PAGE_CACHE_*?

Fixed, the other places used PAGE_CACHE_*.

> - In generic_file_splice_read():
>
> - nonatomic modification of f_pos. Is i_mutex held? (see
> generic_file_llseek())

Fixed.

> - Darnit, we carried `flags' this far and ended up not using it.
> (What _does_ flags do, anyway? Reads on..)

It'll be passed to the actor next! Will probably change some of the
actor args into a little struct instead of passing so many variables.

> - In __generic_file_splice_read():
>
> - local variable `index' is ulong, could be pgoff_t (for clarity)

Fixed.

> - local variable `offset' could be uint (it is uint elsewhere, and
> might generate better code).

Fixed.

> - local variable `pages' could be uint (but watch out for overflow!!).

Fixed.

> A better name might be nr_pages (matches find_get_pages()). Then,
> local variable `array' can be renamed to `pages', which is all much
> better.

Agree, fixed.

> - whoa. We move the pages into the pipe while they're still under
> read I/O. Is that deliberate? (pls add nice comment).

It's fine, you need to call ->map() to get at it anways, which will lock
the page.

> - These pages can get truncated at any time they're unlocked. Does
> the code cope with all that?

I guess page_cache_pipe_buf_map() needs the same ->mapping check?

> - hm. What happens if the pages which find_get_pages() returned are
> not contiguous in pagecache? I think your `pages' array gets all
> jumbled up.

Hmm please expand.

> - In move_to_pipe()
>
> - Suggest you rename `pages' to `nr_pages', `array' to `pages'. And
> `i' to `page_nr'.

Done (except 'i').

> - local var `bufs' could be renamed `nrbufs' to align with
> pipe_inode_info and could be made uint.
>
> - Do we actually need local var `bufs'? It seems to be caching info->nrbufs.

Cleaner that way, imho.

> - release_pages() might be faster than one-at-a-time page_cache_release()

We should not hit that case very often. Not sure how to handle the
'cold' right now, so I'll just leave it.

--
Jens Axboe

2006-03-30 08:02:25

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

>> OK.. Do you plan to make it reject unrecognised flags?
>
>Depends, not sure if eg a 'move' flag should be a hard or soft
>indication. Say we can't move a page and the caller asked us to migrate,
>we'd probably just do the sane thing and copy that one page. It would be
>silly to fail that request entirely.
>

Another bit in the flags:

enum {
SPLICE_COPY = 1 << 0,
SPLICE_MOVE = 1 << 1,
SPLICE_MOVEORCOPY = 1 << 2, // fallback to copy if move fails
};


Jan Engelhardt
--

2006-03-30 08:03:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Jens Axboe <[email protected]> wrote:
>
>
> ...
>
> > - I think the `size_t left' in do_splice_to() can overflow if f_pos is
> > sufficiently different from i_size.
>
> They're both loff_t.

Nope:

+static long do_splice_to(struct file *in, struct inode *pipe, size_t len,
+ unsigned long flags)
+{
+ if (in->f_op && in->f_op->splice_read) {
+ loff_t isize = i_size_read(in->f_mapping->host);
+ size_t left;
+
+ if (unlikely(in->f_pos >= isize))
+ return 0;
+
+ left = isize - in->f_pos;

It's doing

32bit = 64bit - 64bit;

>
> > - In generic_file_splice_read():
> >
> > - nonatomic modification of f_pos. Is i_mutex held? (see
> > generic_file_llseek())
>
> Fixed.

OK. In some ways I agree with Nick that a pwrite/pread-like interface is
nicer, so things are more stateless and threads don't have to fight over
f_pos. Dunno..

> > - These pages can get truncated at any time they're unlocked. Does
> > the code cope with all that?
>
> I guess page_cache_pipe_buf_map() needs the same ->mapping check?

That would seem appropriate.

btw, that function might have a problem I think - it returns NULL with
the page locked, but pipe_to_sendpage() and other callers don't appear to
unlock it.

> > - hm. What happens if the pages which find_get_pages() returned are
> > not contiguous in pagecache? I think your `pages' array gets all
> > jumbled up.
>
> Hmm please expand.

find_get_pages() does "find me the next N pages above `index' which are
presently in pagecache'. So it can return an array of page*'s which do not
represent contiguous pages in the file - there can be holes in there.

IOW: pages[n]->index !necessarily= pages[n+1]->index-1

Maybe the code handles that by making sure that all the pages in the range
are already in pagecache - I didn't check. But that would take some heroic
locking.

> > - release_pages() might be faster than one-at-a-time page_cache_release()
>
> We should not hit that case very often. Not sure how to handle the
> 'cold' right now, so I'll just leave it.

OK. ("cold" is a wild-ass guess as to whether you think those pages'
contents are likely to be be in CPU cache. I'd guess "yes", so cold=0).

2006-03-30 08:09:28

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

>>
>> - splice() take a size_t length. Should it be taking a 64-bit length?
>
>No. You can't splice more than the kernel buffers anyway (ie currently
>PIPE_BUFFERS pages, ie ~64kB, although in theory somebody could use large
>pages for it), so 64-bit would be total overkill.
>
So unsigned int would be enough, would not it?


>> - why is the `flags' arg to sys_splice() unsigned long? Can it be `int'?
>
>flags are always unsigned long, haven't you noticed? [...]

On x86, an unsigned long holds 32 bit, so we can at most put 32 flags in a
'flags' argument, even on x64 -- to stay compatible/deterministic/you know.
So it sounds like a good thing to make it 'unsigned int' rather than
'unsigned long'. Saves stack space.


>> - All the operations do foo(in, out, ...). It's a bit more conventional
>> to do foo(out, in, ...).

man bzero(3)
man swab(3)
man bcopy(3)

It's not (out,in) everywhere. But there are also other extremes, like
outb() where (val,port) is rather confusing compared to the rest of the
world (port,val).



Jan Engelhardt
--

2006-03-30 08:10:01

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Thu, Mar 30 2006, Andrew Morton wrote:
> Jens Axboe <[email protected]> wrote:
> >
> >
> > ...
> >
> > > - I think the `size_t left' in do_splice_to() can overflow if f_pos is
> > > sufficiently different from i_size.
> >
> > They're both loff_t.
>
> Nope:
>
> +static long do_splice_to(struct file *in, struct inode *pipe, size_t len,
> + unsigned long flags)
> +{
> + if (in->f_op && in->f_op->splice_read) {
> + loff_t isize = i_size_read(in->f_mapping->host);
> + size_t left;
> +
> + if (unlikely(in->f_pos >= isize))
> + return 0;
> +
> + left = isize - in->f_pos;
>
> It's doing
>
> 32bit = 64bit - 64bit;

My mistake, I looked at a different spot. Fixed.

> >
> > > - In generic_file_splice_read():
> > >
> > > - nonatomic modification of f_pos. Is i_mutex held? (see
> > > generic_file_llseek())
> >
> > Fixed.
>
> OK. In some ways I agree with Nick that a pwrite/pread-like interface is
> nicer, so things are more stateless and threads don't have to fight over
> f_pos. Dunno..

I can go either way, I tend to agree that passing in offset may be the
best solution.

> > > - These pages can get truncated at any time they're unlocked. Does
> > > the code cope with all that?
> >
> > I guess page_cache_pipe_buf_map() needs the same ->mapping check?
>
> That would seem appropriate.
>
> btw, that function might have a problem I think - it returns NULL with
> the page locked, but pipe_to_sendpage() and other callers don't appear to
> unlock it.

Will fix that up.

> > > - hm. What happens if the pages which find_get_pages() returned are
> > > not contiguous in pagecache? I think your `pages' array gets all
> > > jumbled up.
> >
> > Hmm please expand.
>
> find_get_pages() does "find me the next N pages above `index' which are
> presently in pagecache'. So it can return an array of page*'s which do not
> represent contiguous pages in the file - there can be holes in there.
>
> IOW: pages[n]->index !necessarily= pages[n+1]->index-1
>
> Maybe the code handles that by making sure that all the pages in the range
> are already in pagecache - I didn't check. But that would take some heroic
> locking.

It doesn't, I'm assuming that find_get_pages() returns consequtive pages
atm. Would seem like the sane interface :-)

We continue doing find_or_create_page() on the remaining, but using 'i'
as the 'index' addition. So if we had non-conseq pages, we'd be screwed.

Needs a little thinking, input welcome..

--
Jens Axboe

2006-03-30 08:27:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Jens Axboe <[email protected]> wrote:
>
> > find_get_pages() does "find me the next N pages above `index' which are
> > presently in pagecache'. So it can return an array of page*'s which do not
> > represent contiguous pages in the file - there can be holes in there.
> >
> > IOW: pages[n]->index !necessarily= pages[n+1]->index-1
> >
> > Maybe the code handles that by making sure that all the pages in the range
> > are already in pagecache - I didn't check. But that would take some heroic
> > locking.
>
> It doesn't, I'm assuming that find_get_pages() returns consequtive pages
> atm. Would seem like the sane interface :-)

Yeah, sorry. It's a "gather what's presently there" thing. For writeback.

Nick has some gang-lookup-slots code. So instead of populating an array of
page*'s you can populate an array of (effectively) page**'s. Then one
could walk that. All while holding ->tree_lock. This doesn't help ;)

Or you could walk the pages[] array until you hit an ->index which doesn't
match and then toss the rest away. That's a bit of extra work, but in the
common case all the pages will be good. Perhaps.

> We continue doing find_or_create_page() on the remaining, but using 'i'
> as the 'index' addition. So if we had non-conseq pages, we'd be screwed.

Yup.

Probably the simplest for now is an open-coded find_get_page() loop. Later
on we should optimise that into a find_get_contig_pages() which only takes
tree_lock a single time.

Doing it with a new radix_tree_gang_lookup_contig_name_me_longer() would be
relatively straightforward too. It would bale out as soon as it hit a
not-present slot.

2006-03-30 08:51:27

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Thu, Mar 30 2006, Andrew Morton wrote:
> Jens Axboe <[email protected]> wrote:
> >
> > > find_get_pages() does "find me the next N pages above `index' which are
> > > presently in pagecache'. So it can return an array of page*'s which do not
> > > represent contiguous pages in the file - there can be holes in there.
> > >
> > > IOW: pages[n]->index !necessarily= pages[n+1]->index-1
> > >
> > > Maybe the code handles that by making sure that all the pages in the range
> > > are already in pagecache - I didn't check. But that would take some heroic
> > > locking.
> >
> > It doesn't, I'm assuming that find_get_pages() returns consequtive pages
> > atm. Would seem like the sane interface :-)
>
> Yeah, sorry. It's a "gather what's presently there" thing. For writeback.
>
> Nick has some gang-lookup-slots code. So instead of populating an array of
> page*'s you can populate an array of (effectively) page**'s. Then one
> could walk that. All while holding ->tree_lock. This doesn't help ;)
>
> Or you could walk the pages[] array until you hit an ->index which doesn't
> match and then toss the rest away. That's a bit of extra work, but in the
> common case all the pages will be good. Perhaps.
>
> > We continue doing find_or_create_page() on the remaining, but using 'i'
> > as the 'index' addition. So if we had non-conseq pages, we'd be screwed.
>
> Yup.
>
> Probably the simplest for now is an open-coded find_get_page() loop. Later
> on we should optimise that into a find_get_contig_pages() which only takes
> tree_lock a single time.
>
> Doing it with a new radix_tree_gang_lookup_contig_name_me_longer() would be
> relatively straightforward too. It would bale out as soon as it hit a
> not-present slot.

I'll go for the simple approach right now, going over the returned
find_get_pages() array and moving pages around and filling holes doesn't
sound too alluring. Thanks!

--
Jens Axboe

2006-03-30 08:53:00

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Wed, 29 Mar 2006 14:28:41 +0200
Jens Axboe <[email protected]> wrote:
>
> + /*
> + * Get as many pages from the page cache as possible..
> + * Start IO on the page cache entries we create (we
> + * can assume that any pre-existing ones we find have
> + * already had IO started on them).
> + */
> + i = find_get_pages(mapping, index, pages, array);
> +

It looks page caches in this array is hold by pipe until data is consumed.
So..this page cannot be reclaimd or migrated and hot-removed :).
I don't know about sendfile() but this looks client can hold server's memory,
when server uses sendfile() 64k/conn.

Is there a way to force these pages to be freed ? or page reclaimer can
know this page is held by splice ? (we need additional PG_flags to do this ?)

I think these pages are necessary to be held only when data in them is used.

--Kame


2006-03-30 09:15:17

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Thu, Mar 30 2006, Jens Axboe wrote:
> On Thu, Mar 30 2006, Andrew Morton wrote:
> > Jens Axboe <[email protected]> wrote:
> > >
> > > > find_get_pages() does "find me the next N pages above `index' which are
> > > > presently in pagecache'. So it can return an array of page*'s which do not
> > > > represent contiguous pages in the file - there can be holes in there.
> > > >
> > > > IOW: pages[n]->index !necessarily= pages[n+1]->index-1
> > > >
> > > > Maybe the code handles that by making sure that all the pages in the range
> > > > are already in pagecache - I didn't check. But that would take some heroic
> > > > locking.
> > >
> > > It doesn't, I'm assuming that find_get_pages() returns consequtive pages
> > > atm. Would seem like the sane interface :-)
> >
> > Yeah, sorry. It's a "gather what's presently there" thing. For writeback.
> >
> > Nick has some gang-lookup-slots code. So instead of populating an array of
> > page*'s you can populate an array of (effectively) page**'s. Then one
> > could walk that. All while holding ->tree_lock. This doesn't help ;)
> >
> > Or you could walk the pages[] array until you hit an ->index which doesn't
> > match and then toss the rest away. That's a bit of extra work, but in the
> > common case all the pages will be good. Perhaps.
> >
> > > We continue doing find_or_create_page() on the remaining, but using 'i'
> > > as the 'index' addition. So if we had non-conseq pages, we'd be screwed.
> >
> > Yup.
> >
> > Probably the simplest for now is an open-coded find_get_page() loop. Later
> > on we should optimise that into a find_get_contig_pages() which only takes
> > tree_lock a single time.
> >
> > Doing it with a new radix_tree_gang_lookup_contig_name_me_longer() would be
> > relatively straightforward too. It would bale out as soon as it hit a
> > not-present slot.
>
> I'll go for the simple approach right now, going over the returned
> find_get_pages() array and moving pages around and filling holes doesn't
> sound too alluring. Thanks!

Actually it isn't so bad, how does this look?

diff --git a/fs/splice.c b/fs/splice.c
index 6327a7c..e7bb2ed 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -161,7 +161,8 @@ static int __generic_file_splice_read(st
struct address_space *mapping = in->f_mapping;
unsigned int offset, nr_pages;
struct page *pages[PIPE_BUFFERS];
- pgoff_t index;
+ struct page *page;
+ pgoff_t index, pidx;
int i;

index = in->f_pos >> PAGE_CACHE_SHIFT;
@@ -180,30 +181,48 @@ static int __generic_file_splice_read(st
i = find_get_pages(mapping, index, nr_pages, pages);

/*
- * If not all pages were in the page-cache, we'll
- * just assume that the rest haven't been read in,
- * so we'll get the rest locked and start IO on
- * them if we can..
+ * common case - we found all pages, kick it off
*/
- while (i < nr_pages) {
- struct page *page;
- int error;
-
- page = find_or_create_page(mapping, index + i, GFP_USER);
- if (!page)
- break;
+ if (i == nr_pages)
+ goto splice_them;

- if (PageUptodate(page))
- unlock_page(page);
- else {
- error = mapping->a_ops->readpage(in, page);
- if (unlikely(error)) {
- page_cache_release(page);
+ /*
+ * find_get_pages() may not return consecutive pages, so loop
+ * over the array moving pages and filling the rest, if need be.
+ */
+ for (i = 0, pidx = index; i < nr_pages; pidx++, i++) {
+ if (!pages[i]) {
+ int error;
+fill_page:
+ /*
+ * no page there, look one up / create it
+ */
+ page = find_or_create_page(mapping, pidx, GFP_HIGHUSER);
+ if (!page)
break;
+
+ if (PageUptodate(page))
+ unlock_page(page);
+ else {
+ error = mapping->a_ops->readpage(in, page);
+
+ if (unlikely(error)) {
+ page_cache_release(page);
+ break;
+ }
}
+ pages[i] = page;
+ } else if (pages[i]->index != pidx) {
+ page = pages[i];
+ /*
+ * page isn't in the right spot, move it and jump
+ * back to filling this one. we know that ->index
+ * is larger than pidx
+ */
+ pages[i + page->index - pidx] = page;
+ pages[i] = NULL;
+ goto fill_page;
}
-
- pages[i++] = page;
}

if (!i)
@@ -212,6 +231,7 @@ static int __generic_file_splice_read(st
/*
* Now we splice them into the pipe..
*/
+splice_them:
return move_to_pipe(pipe, pages, i, offset, len);
}


--
Jens Axboe

2006-03-30 09:40:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Jens Axboe <[email protected]> wrote:
>
> Actually it isn't so bad, how does this look?
>
> ...
>
> @@ -180,30 +181,48 @@ static int __generic_file_splice_read(st
> i = find_get_pages(mapping, index, nr_pages, pages);
>
> /*
> - * If not all pages were in the page-cache, we'll
> - * just assume that the rest haven't been read in,
> - * so we'll get the rest locked and start IO on
> - * them if we can..
> + * common case - we found all pages, kick it off
> */
> - while (i < nr_pages) {
> - struct page *page;
> - int error;
> -
> - page = find_or_create_page(mapping, index + i, GFP_USER);
> - if (!page)
> - break;
> + if (i == nr_pages)
> + goto splice_them;

The return value from find_get_pages() is "how many pages did I find" - it
doesn't tell us whether they were contiguous.

How about

if (i && (pages[i - 1]->index == index + i - 1))

<thinks>

So if we asked for N pages starting at index=10 and got

[11, 13]

i == 2
pages[i-1]->index == 13
index + i - 1 == 11.

So I think it's OK. Yeah, it has to be - any gap at all in the returned
page array will make pages[i-1]->index too big.


The one-at-a-time logic looks OK from a quick scan. Do we have logic in
there to check that we're not overrunning i_size? (See the pain
do_generic_mapping_read() goes through).


argh, readahead. Really we should be kicking the readahead engine in there
as well. That's fairly straightforward - see do_generic_mapping_read().

Also, the code here _might_ be able to use do_page_cache_readahead() just
to prepopulate the pages which you know you'll be needing. There are no
guarantees that the pages will still be there when you want them of course,
but it's a decent way of putting a block of pages into a single BIO and
speeding up the common case. But if the code is calling
page_cache_readahead() it won't need to do that.

2006-03-30 09:45:16

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Thu, Mar 30 2006, Andrew Morton wrote:
> Jens Axboe <[email protected]> wrote:
> >
> > Actually it isn't so bad, how does this look?
> >
> > ...
> >
> > @@ -180,30 +181,48 @@ static int __generic_file_splice_read(st
> > i = find_get_pages(mapping, index, nr_pages, pages);
> >
> > /*
> > - * If not all pages were in the page-cache, we'll
> > - * just assume that the rest haven't been read in,
> > - * so we'll get the rest locked and start IO on
> > - * them if we can..
> > + * common case - we found all pages, kick it off
> > */
> > - while (i < nr_pages) {
> > - struct page *page;
> > - int error;
> > -
> > - page = find_or_create_page(mapping, index + i, GFP_USER);
> > - if (!page)
> > - break;
> > + if (i == nr_pages)
> > + goto splice_them;
>
> The return value from find_get_pages() is "how many pages did I find" - it
> doesn't tell us whether they were contiguous.

Oh right, so there's still a hole there. The above logic foolishly
thinks that if i == nr_pages, they must be contig. But I can see that
may not be the case. Additionally, I need to init pages[] from i and
forward, since we may be looking at that in the loop.

> How about
>
> if (i && (pages[i - 1]->index == index + i - 1))
>
> <thinks>
>
> So if we asked for N pages starting at index=10 and got
>
> [11, 13]
>
> i == 2
> pages[i-1]->index == 13
> index + i - 1 == 11.
>
> So I think it's OK. Yeah, it has to be - any gap at all in the returned
> page array will make pages[i-1]->index too big.

Agree, that should be sound. I'll adjust the code.

> The one-at-a-time logic looks OK from a quick scan. Do we have logic in
> there to check that we're not overrunning i_size? (See the pain
> do_generic_mapping_read() goes through).

do_splice_to() checks that, should I move that checking further down in
case the file is truncated?

> argh, readahead. Really we should be kicking the readahead engine in there
> as well. That's fairly straightforward - see do_generic_mapping_read().
>
> Also, the code here _might_ be able to use do_page_cache_readahead() just
> to prepopulate the pages which you know you'll be needing. There are no
> guarantees that the pages will still be there when you want them of course,
> but it's a decent way of putting a block of pages into a single BIO and
> speeding up the common case. But if the code is calling
> page_cache_readahead() it won't need to do that.

I'll look into readahead.

--
Jens Axboe

2006-03-30 09:56:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Jens Axboe <[email protected]> wrote:
>
> > The one-at-a-time logic looks OK from a quick scan. Do we have logic in
> > there to check that we're not overrunning i_size? (See the pain
> > do_generic_mapping_read() goes through).
>
> do_splice_to() checks that, should I move that checking further down in
> case the file is truncated?

Again, see do_generic_mapping_read()'s ghastly tricks - it checks i_size
after each readpage().

i_size can increase or decrease under our feet if we're not holding i_mutex
(and we don't want to). So userspace is being silly and the main things we
need to care about here are to not leak uninitialised data and to not oops.
A readpage() outside i_size will return either all-zeroes or some valid
data which isn't actually within i_size any more, so I guess we're OK.



2006-03-30 10:01:09

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Thu, Mar 30 2006, Andrew Morton wrote:
> Jens Axboe <[email protected]> wrote:
> >
> > > The one-at-a-time logic looks OK from a quick scan. Do we have logic in
> > > there to check that we're not overrunning i_size? (See the pain
> > > do_generic_mapping_read() goes through).
> >
> > do_splice_to() checks that, should I move that checking further down in
> > case the file is truncated?
>
> Again, see do_generic_mapping_read()'s ghastly tricks - it checks i_size
> after each readpage().
>
> i_size can increase or decrease under our feet if we're not holding i_mutex
> (and we don't want to). So userspace is being silly and the main things we
> need to care about here are to not leak uninitialised data and to not oops.
> A readpage() outside i_size will return either all-zeroes or some valid
> data which isn't actually within i_size any more, so I guess we're OK.

So from the looks of things, worst case is returning zeroes if we
'raced' with someone truncating the file. I'll just leave it as-is for
now.

--
Jens Axboe

2006-03-30 13:53:39

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Thu, Mar 30 2006, KAMEZAWA Hiroyuki wrote:
> On Wed, 29 Mar 2006 14:28:41 +0200
> Jens Axboe <[email protected]> wrote:
> >
> > + /*
> > + * Get as many pages from the page cache as possible..
> > + * Start IO on the page cache entries we create (we
> > + * can assume that any pre-existing ones we find have
> > + * already had IO started on them).
> > + */
> > + i = find_get_pages(mapping, index, pages, array);
> > +
>
> It looks page caches in this array is hold by pipe until data is consumed.
> So..this page cannot be reclaimd or migrated and hot-removed :).

Right

> I don't know about sendfile() but this looks client can hold server's
> memory, when server uses sendfile() 64k/conn.

You mean when the server uses splice, 64kb (well 16 pages actually) /
connection? That's a correct observation, I wouldn't think that pinning
that small a number of pages is likely to cause any issues. At least I
can think of much worse pinning by just doing IO :-)

> Is there a way to force these pages to be freed ? or page reclaimer
> can know this page is held by splice ? (we need additional PG_flags to
> do this ?)
>
> I think these pages are necessary to be held only when data in them is
> used.

Not without tearing down the pipe.

--
Jens Axboe

2006-03-30 14:05:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Thu, 30 Mar 2006 15:53:46 +0200
Jens Axboe <[email protected]> wrote:

> > I don't know about sendfile() but this looks client can hold server's
> > memory, when server uses sendfile() 64k/conn.
>
> You mean when the server uses splice, 64kb (well 16 pages actually) /
> connection? That's a correct observation, I wouldn't think that pinning
> that small a number of pages is likely to cause any issues. At least I
> can think of much worse pinning by just doing IO :-)
>
My point is consumer can sleep forever and pages are pinnded forever.
And people who use splice() will not notice they are pinning pages.

But as you say, it's not problem in usual situation.
Maybe I'm too pessimistic how my cusomers play with Linux ;)

-- Kame

2006-03-30 14:38:04

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Thu, Mar 30 2006, KAMEZAWA Hiroyuki wrote:
> On Thu, 30 Mar 2006 15:53:46 +0200
> Jens Axboe <[email protected]> wrote:
>
> > > I don't know about sendfile() but this looks client can hold server's
> > > memory, when server uses sendfile() 64k/conn.
> >
> > You mean when the server uses splice, 64kb (well 16 pages actually) /
> > connection? That's a correct observation, I wouldn't think that pinning
> > that small a number of pages is likely to cause any issues. At least I
> > can think of much worse pinning by just doing IO :-)
> >
> My point is consumer can sleep forever and pages are pinnded forever.
> And people who use splice() will not notice they are pinning pages.
>
> But as you say, it's not problem in usual situation.
> Maybe I'm too pessimistic how my cusomers play with Linux ;)

It's a valid concern, however as mentioned there's a number of ways in
which a user can pin memory already. Perhaps this general problem should
be capped elsewhere?

--
Jens Axboe

2006-03-30 14:55:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

On Thu, 30 Mar 2006 16:38:10 +0200
Jens Axboe <[email protected]> wrote:

> On Thu, Mar 30 2006, KAMEZAWA Hiroyuki wrote:
> > On Thu, 30 Mar 2006 15:53:46 +0200
> > Jens Axboe <[email protected]> wrote:
> >
> > > > I don't know about sendfile() but this looks client can hold server's
> > > > memory, when server uses sendfile() 64k/conn.
> > >
> > > You mean when the server uses splice, 64kb (well 16 pages actually) /
> > > connection? That's a correct observation, I wouldn't think that pinning
> > > that small a number of pages is likely to cause any issues. At least I
> > > can think of much worse pinning by just doing IO :-)
> > >
> > My point is consumer can sleep forever and pages are pinnded forever.
> > And people who use splice() will not notice they are pinning pages.
> >
> > But as you say, it's not problem in usual situation.
> > Maybe I'm too pessimistic how my cusomers play with Linux ;)
>
> It's a valid concern, however as mentioned there's a number of ways in
> which a user can pin memory already.
Yes.

>Perhaps this general problem should be capped elsewhere?
>
I don't know. but this new one cannot be catched by overcommit_memory but
a user can consume not-reclaimable memory.

To be honest, I have to work with crash-dump. Sometimes cutomers request me to
find out "how pages is used and why memory cannot be reclaimed ?" from dump.
So, I don't like unknown "1" reference to page-cache from some codes.
splice can increase this unkonwn 1 reference to some extent.

But I like idea of splice itself :).

-Kame

2006-03-30 16:47:58

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Jens Axboe wrote:
> On Thu, Mar 30 2006, Nick Piggin wrote:

>>Moving a page from a pipe to a filesystem might be harder, because you
>>don't know if it came from a filesystem (still on LRU) or not (in which
>>case you need to add it to LRU). If only you can keep track of this
>
>
> Well that, to me, is _the_ hard problem to solve for this. But you
> sort-of do know, my plan is/was to add a ->steal() hook to the pipe
> buffers that would 'unhook' the page so it was in a clean state to be
> added to the LRU/page cache again. If stealing failed, just fall back to
> copying (or hard error, let the flags decide).

Yeah, that's probably best to start with. If 'gift' copies become
widely used and performance critical we can look at tricks to speed
it up further (not that this way would be particularly slow itself).

My first thought is that falling back to copying would be the best
idea, because otherwise you can get random failures for any number of
reasons (dirty pages, page reclaim, migration, get_user_pages, etc).
If a future usage wants an error, we could add an extra flag?

>>
>>Unless someone beats me to it, I'll try coding something up when I get
>>a bit more free time.
>
>
> You are more than welcome, I hope to give it a little shot today and see
> how it goes.

Sure, I wouldn't get around to it for a while yet anyway, but I'd be
happy to review your patch if you do it earlier.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-30 18:34:34

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Andrew Morton wrote:
> Jens Axboe <[email protected]> wrote:

>> It doesn't, I'm assuming that find_get_pages() returns consequtive pages
>> atm. Would seem like the sane interface :-)
>
>
> Yeah, sorry. It's a "gather what's presently there" thing. For writeback.
>
> Nick has some gang-lookup-slots code. So instead of populating an array of
> page*'s you can populate an array of (effectively) page**'s. Then one
> could walk that. All while holding ->tree_lock. This doesn't help ;)
>

Actually while we're on the subject, my gang_lookup_slot code is just
named to match lookup_slot and gang_lookup... it still only returns
slots that are presently populated. Suggestions for a better name
welcome?

> Probably the simplest for now is an open-coded find_get_page() loop. Later

Agreed. I think that's the best idea for now.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-30 18:58:01

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] splice support

Jens Axboe wrote:
> On Thu, Mar 30 2006, Andrew Morton wrote:

>>Maybe the code handles that by making sure that all the pages in the range
>>are already in pagecache - I didn't check. But that would take some heroic
>>locking.
>
>
> It doesn't, I'm assuming that find_get_pages() returns consequtive pages
> atm. Would seem like the sane interface :-)
>

It doesn't.

You could do a find_get_pages_and_holes (or something along those
lines), which would fetch contiguous pagecache and stick NULLs
if pages don't exist.

Would require a bit of radix-tree support to do it nicely, but you
could get started with a naive find_get_page loop.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com