LinuxLists.cc - [rfc patch 3/4] splice: remove confirm from pipe_buf

2008-06-21 15:48:52

Subject: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

From: Miklos Szeredi <[email protected]>

The 'confirm' operation was only used for splicing from page cache, to
wait for read on a page to finish. But generic_file_splice_read()
already blocks on readahead reads, so it seems logical to block on the
rare and slow single page reads too.

So wait for readpage to finish inside __generic_file_splice_read() and
remove the 'confirm' method.

This also fixes short return counts when the filesystem (e.g. fuse)
invalidates the page between insertation and removal.

Signed-off-by: Miklos Szeredi <[email protected]>
---
drivers/block/loop.c | 5 ---
fs/nfsd/vfs.c | 9 ------
fs/pipe.c | 27 ------------------
fs/splice.c | 69 +++++-----------------------------------------
include/linux/pipe_fs_i.h | 22 +-------------
kernel/relay.c | 1
net/core/skbuff.c | 1
7 files changed, 11 insertions(+), 123 deletions(-)

Index: linux-2.6/drivers/block/loop.c
===================================================================
--- linux-2.6.orig/drivers/block/loop.c 2008-06-21 11:46:52.000000000 +0200
+++ linux-2.6/drivers/block/loop.c 2008-06-21 11:51:39.000000000 +0200
@@ -395,11 +395,6 @@ lo_splice_actor(struct pipe_inode_info *
struct page *page = buf->page;
sector_t IV;
size_t size;
- int ret;
-
- ret = buf->ops->confirm(pipe, buf);
- if (unlikely(ret))
- return ret;

IV = ((sector_t) page->index << (PAGE_CACHE_SHIFT - 9)) +
(buf->offset >> 9);
Index: linux-2.6/fs/pipe.c
===================================================================
--- linux-2.6.orig/fs/pipe.c 2008-06-21 11:46:52.000000000 +0200
+++ linux-2.6/fs/pipe.c 2008-06-21 11:47:09.000000000 +0200
@@ -223,26 +223,10 @@ void generic_pipe_buf_get(struct pipe_in
page_cache_get(buf->page);
}

-/**
- * generic_pipe_buf_confirm - verify contents of the pipe buffer
- * @info: the pipe that the buffer belongs to
- * @buf: the buffer to confirm
- *
- * Description:
- * This function does nothing, because the generic pipe code uses
- * pages that are always good when inserted into the pipe.
- */
-int generic_pipe_buf_confirm(struct pipe_inode_info *info,
- struct pipe_buffer *buf)
-{
- return 0;
-}
-
static const struct pipe_buf_operations anon_pipe_buf_ops = {
.can_merge = 1,
.map = generic_pipe_buf_map,
.unmap = generic_pipe_buf_unmap,
- .confirm = generic_pipe_buf_confirm,
.release = anon_pipe_buf_release,
.get = generic_pipe_buf_get,
};
@@ -281,13 +265,6 @@ pipe_read(struct kiocb *iocb, const stru
if (chars > total_len)
chars = total_len;

- error = ops->confirm(pipe, buf);
- if (error) {
- if (!ret)
- error = ret;
- break;
- }
-
atomic = !iov_fault_in_pages_write(iov, chars);
redo:
addr = ops->map(pipe, buf, atomic);
@@ -402,10 +379,6 @@ pipe_write(struct kiocb *iocb, const str
int error, atomic = 1;
void *addr;

- error = ops->confirm(pipe, buf);
- if (error)
- goto out;
-
iov_fault_in_pages_read(iov, chars);
redo1:
addr = ops->map(pipe, buf, atomic);
Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c 2008-06-21 11:46:52.000000000 +0200
+++ linux-2.6/fs/splice.c 2008-06-21 11:49:52.000000000 +0200
@@ -37,53 +37,10 @@ static void page_cache_pipe_buf_release(
buf->flags &= ~PIPE_BUF_FLAG_LRU;
}

-/*
- * Check whether the contents of buf is OK to access. Since the content
- * is a page cache page, IO may be in flight.
- */
-static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe,
- struct pipe_buffer *buf)
-{
- struct page *page = buf->page;
- int err;
-
- if (!PageUptodate(page)) {
- lock_page(page);
-
- /*
- * Page got truncated/unhashed. This will cause a 0-byte
- * splice, if this is the first page.
- */
- if (!page->mapping) {
- err = -ENODATA;
- goto error;
- }
-
- /*
- * Uh oh, read-error from disk.
- */
- if (!PageUptodate(page)) {
- err = -EIO;
- goto error;
- }
-
- /*
- * Page is ok afterall, we are done.
- */
- unlock_page(page);
- }
-
- return 0;
-error:
- unlock_page(page);
- return err;
-}
-
static const struct pipe_buf_operations page_cache_pipe_buf_ops = {
.can_merge = 0,
.map = generic_pipe_buf_map,
.unmap = generic_pipe_buf_unmap,
- .confirm = page_cache_pipe_buf_confirm,
.release = page_cache_pipe_buf_release,
.get = generic_pipe_buf_get,
};
@@ -92,7 +49,6 @@ static const struct pipe_buf_operations
.can_merge = 0,
.map = generic_pipe_buf_map,
.unmap = generic_pipe_buf_unmap,
- .confirm = generic_pipe_buf_confirm,
.release = page_cache_pipe_buf_release,
.get = generic_pipe_buf_get,
};
@@ -349,6 +305,11 @@ __generic_file_splice_read(struct file *

break;
}
+ wait_on_page_locked(page);
+ if (!PageUptodate(page)) {
+ error = -EIO;
+ break;
+ }
}
fill_it:
/*
@@ -451,13 +412,10 @@ static int pipe_to_sendpage(struct pipe_
loff_t pos = sd->pos;
int ret, more;

- ret = buf->ops->confirm(pipe, buf);
- if (!ret) {
- more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len;
+ more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len;

- ret = file->f_op->sendpage(file, buf->page, buf->offset,
- sd->len, &pos, more);
- }
+ ret = file->f_op->sendpage(file, buf->page, buf->offset,
+ sd->len, &pos, more);

return ret;
}
@@ -492,13 +450,6 @@ static int pipe_to_file(struct pipe_inod
void *fsdata;
int ret;

- /*
- * make sure the data in this buffer is uptodate
- */
- ret = buf->ops->confirm(pipe, buf);
- if (unlikely(ret))
- return ret;
-
offset = sd->pos & ~PAGE_CACHE_MASK;

this_len = sd->len;
@@ -1231,10 +1182,6 @@ static int pipe_to_user(struct pipe_inod
char *src;
int ret;

- ret = buf->ops->confirm(pipe, buf);
- if (unlikely(ret))
- return ret;
-
/*
* See if we can use the atomic maps, by prefaulting in the
* pages and doing an atomic copy
Index: linux-2.6/include/linux/pipe_fs_i.h
===================================================================
--- linux-2.6.orig/include/linux/pipe_fs_i.h 2008-06-21 11:46:52.000000000 +0200
+++ linux-2.6/include/linux/pipe_fs_i.h 2008-06-21 11:47:09.000000000 +0200
@@ -58,16 +58,8 @@ struct pipe_inode_info {
};

/*
- * Note on the nesting of these functions:
- *
- * ->confirm()
- * ->map()
- * ...
- * ->unmap()
- *
- * That is, ->map() must be called on a confirmed buffer. See below
- * for the meaning of each operation. Also see kerneldoc in fs/pipe.c
- * for the pipe and generic variants of these hooks.
+ * Also see kerneldoc in fs/pipe.c for the pipe and generic variants
+ * of these hooks.
*/
struct pipe_buf_operations {
/*
@@ -97,15 +89,6 @@ struct pipe_buf_operations {
void (*unmap)(struct pipe_inode_info *, struct pipe_buffer *, void *);

/*
- * ->confirm() verifies that the data in the pipe buffer is there
- * and that the contents are good. If the pages in the pipe belong
- * to a file system, we may need to wait for IO completion in this
- * hook. Returns 0 for good, or a negative error value in case of
- * error.
- */
- int (*confirm)(struct pipe_inode_info *, struct pipe_buffer *);
-
- /*
* When the contents of this pipe buffer has been completely
* consumed by a reader, ->release() is called.
*/
@@ -132,6 +115,5 @@ void __free_pipe_info(struct pipe_inode_
void *generic_pipe_buf_map(struct pipe_inode_info *, struct pipe_buffer *, int);
void generic_pipe_buf_unmap(struct pipe_inode_info *, struct pipe_buffer *, void *);
void generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *);
-int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *);

#endif
Index: linux-2.6/kernel/relay.c
===================================================================
--- linux-2.6.orig/kernel/relay.c 2008-06-21 11:47:05.000000000 +0200
+++ linux-2.6/kernel/relay.c 2008-06-21 11:47:09.000000000 +0200
@@ -1079,7 +1079,6 @@ static struct pipe_buf_operations relay_
.can_merge = 0,
.map = generic_pipe_buf_map,
.unmap = generic_pipe_buf_unmap,
- .confirm = generic_pipe_buf_confirm,
.release = relay_pipe_buf_release,
.get = generic_pipe_buf_get,
};
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c 2008-06-21 11:46:52.000000000 +0200
+++ linux-2.6/net/core/skbuff.c 2008-06-21 11:47:09.000000000 +0200
@@ -93,7 +93,6 @@ static struct pipe_buf_operations sock_p
.can_merge = 0,
.map = generic_pipe_buf_map,
.unmap = generic_pipe_buf_unmap,
- .confirm = generic_pipe_buf_confirm,
.release = sock_pipe_buf_release,
.get = sock_pipe_buf_get,
};
Index: linux-2.6/fs/nfsd/vfs.c
===================================================================
--- linux-2.6.orig/fs/nfsd/vfs.c 2008-06-19 14:58:10.000000000 +0200
+++ linux-2.6/fs/nfsd/vfs.c 2008-06-21 11:50:57.000000000 +0200
@@ -839,14 +839,7 @@ nfsd_splice_actor(struct pipe_inode_info
struct svc_rqst *rqstp = sd->u.data;
struct page **pp = rqstp->rq_respages + rqstp->rq_resused;
struct page *page = buf->page;
- size_t size;
- int ret;
-
- ret = buf->ops->confirm(pipe, buf);
- if (unlikely(ret))
- return ret;
-
- size = sd->len;
+ size_t size = sd->len;

if (rqstp->rq_res.page_len == 0) {
get_page(page);

--

2008-06-24 08:05:02

by Jens Axboe

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

On Sat, Jun 21 2008, Miklos Szeredi wrote:
> From: Miklos Szeredi <[email protected]>
>
> The 'confirm' operation was only used for splicing from page cache, to
> wait for read on a page to finish. But generic_file_splice_read()
> already blocks on readahead reads, so it seems logical to block on the
> rare and slow single page reads too.
>
> So wait for readpage to finish inside __generic_file_splice_read() and
> remove the 'confirm' method.
>
> This also fixes short return counts when the filesystem (e.g. fuse)
> invalidates the page between insertation and removal.

One of the basic goals of splice is to allow the pipe buffer to only be
consisten when a consumer asks for it, otherwise the filling will always
be sync. There should be no blocking on reads in the splice-in path,
only on consumption for splice-out.

So nack on this one, too.

--
Jens Axboe

2008-06-24 08:55:38

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

> > The 'confirm' operation was only used for splicing from page cache, to
> > wait for read on a page to finish. But generic_file_splice_read()
> > already blocks on readahead reads, so it seems logical to block on the
> > rare and slow single page reads too.
> >
> > So wait for readpage to finish inside __generic_file_splice_read() and
> > remove the 'confirm' method.
> >
> > This also fixes short return counts when the filesystem (e.g. fuse)
> > invalidates the page between insertation and removal.
>
> One of the basic goals of splice is to allow the pipe buffer to only be
> consisten when a consumer asks for it, otherwise the filling will always
> be sync. There should be no blocking on reads in the splice-in path,
> only on consumption for splice-out.

What you are ignoring (and I've mentioned in the changelog) is that it
is *already* sync. Look at the code: this starts I/O:

page_cache_sync_readahead(mapping, &in->f_ra, in,
index, req_pages - spd.nr_pages);

And this waits for it to finish:

if (!PageUptodate(page)) {
...
lock_page(page);

The only way it will be async, is if there's no readahead. But do we
want to optmize that case?

Miklos

2008-06-24 11:19:28

by Jens Axboe

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

On Tue, Jun 24 2008, Miklos Szeredi wrote:
> > > The 'confirm' operation was only used for splicing from page cache, to
> > > wait for read on a page to finish. But generic_file_splice_read()
> > > already blocks on readahead reads, so it seems logical to block on the
> > > rare and slow single page reads too.
> > >
> > > So wait for readpage to finish inside __generic_file_splice_read() and
> > > remove the 'confirm' method.
> > >
> > > This also fixes short return counts when the filesystem (e.g. fuse)
> > > invalidates the page between insertation and removal.
> >
> > One of the basic goals of splice is to allow the pipe buffer to only be
> > consisten when a consumer asks for it, otherwise the filling will always
> > be sync. There should be no blocking on reads in the splice-in path,
> > only on consumption for splice-out.
>
> What you are ignoring (and I've mentioned in the changelog) is that it
> is *already* sync. Look at the code: this starts I/O:
>
> page_cache_sync_readahead(mapping, &in->f_ra, in,
> index, req_pages - spd.nr_pages);
>
> And this waits for it to finish:
>
> if (!PageUptodate(page)) {
> ...
> lock_page(page);
>
> The only way it will be async, is if there's no readahead. But do we
> want to optmize that case?

It's an unfortunate side effect of the read-ahead, I'd much rather just
get rid of that. It _should_ behave like the non-ra case, when a page is
added it merely has IO started on it. So we want to have that be
something like

if (!PageUptodate(page) && !PageInFlight(page))
...

basically like PageWriteback(), but for read-in.

--
Jens Axboe

2008-06-24 11:36:48

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

> It's an unfortunate side effect of the read-ahead, I'd much rather just
> get rid of that. It _should_ behave like the non-ra case, when a page is
> added it merely has IO started on it. So we want to have that be
> something like
>
> if (!PageUptodate(page) && !PageInFlight(page))
> ...
>
> basically like PageWriteback(), but for read-in.

OK it could be done, possibly at great pain. But why is it important?
What's the use case where it matters that splice-in should not block
on the read?

And note, after the pipe is full it will block no matter what, since
the consumer will have to wait until the page is brought uptodate, and
can only then commence with getting the data out from the pipe.

Miklos

2008-06-24 11:47:32

by Evgeniy Polyakov

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

On Tue, Jun 24, 2008 at 01:36:35PM +0200, Miklos Szeredi ([email protected]) wrote:
> > basically like PageWriteback(), but for read-in.
>
> OK it could be done, possibly at great pain. But why is it important?

Maybe not that great if mark all readahead pages as, well, readahead,
and do the same for readpage (essnetially it is the same).

> What's the use case where it matters that splice-in should not block
> on the read?

To be able to transfer what was already read?

--
Evgeniy Polyakov

2008-06-24 12:02:47

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

> > > basically like PageWriteback(), but for read-in.
> >
> > OK it could be done, possibly at great pain. But why is it important?
>
> Maybe not that great if mark all readahead pages as, well, readahead,
> and do the same for readpage (essnetially it is the same).

It isn't that easy. Readahead (->readpages()) is best effort, and is
allowed to not bring the page uptodate, since it will be retried with
->readpage(). I don't know whether any filesystems actually do that,
but it's allowed nonetheless.

> > What's the use case where it matters that splice-in should not block
> > on the read?
>
> To be able to transfer what was already read?

That needs the consumer to be non-blocking...

Umm, one more reason why the ->confirm() stuff is currently busted:
pipe_read() will block on such a buffer even if pipe file is marked
O_NONBLOCK. Fixing that would take a hell of a lot of added
complexity in pipe_poll(), etc...

Miklos

2008-06-24 12:17:19

by Nick Piggin

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

On Tuesday 24 June 2008 21:36, Miklos Szeredi wrote:
> > It's an unfortunate side effect of the read-ahead, I'd much rather just
> > get rid of that. It _should_ behave like the non-ra case, when a page is
> > added it merely has IO started on it. So we want to have that be
> > something like
> >
> > if (!PageUptodate(page) && !PageInFlight(page))
> > ...
> >
> > basically like PageWriteback(), but for read-in.
>
> OK it could be done, possibly at great pain. But why is it important?

It has been considered, but adding atomic operations on these paths
always really hurts. Adding something like this would basically be
another at least 2 atomic operations that can never be removed again...

Provided that you've done the sync readahead earlier, it presumably
should be a very rare case to have to start new IO in the loop
below, right? In which case, I wonder if we couldn't move that 2nd
loop out of generic_file_splice_read and into
page_cache_pipe_buf_confirm.

> What's the use case where it matters that splice-in should not block
> on the read?

It just makes it generally less able to pipeline IO and computation,
doesn't it?

> And note, after the pipe is full it will block no matter what, since
> the consumer will have to wait until the page is brought uptodate, and
> can only then commence with getting the data out from the pipe.

True, but (especially with patches to variably size the pipe buffer)
I imagine programs could be designed fairly carefully to the size of
the buffer (and not just things that blast bulk data down the pipe...)

2008-06-24 12:17:35

by Evgeniy Polyakov

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

On Tue, Jun 24, 2008 at 02:02:19PM +0200, Miklos Szeredi ([email protected]) wrote:
> > Maybe not that great if mark all readahead pages as, well, readahead,
> > and do the same for readpage (essnetially it is the same).
>
> It isn't that easy. Readahead (->readpages()) is best effort, and is
> allowed to not bring the page uptodate, since it will be retried with
> ->readpage(). I don't know whether any filesystems actually do that,
> but it's allowed nonetheless.

Yes, there is such filesystem :)
It is quite useful for network FS, since it does not bother to wait until
pages are in the cache and can try next request. Anyone who scheduled
readahead has full control over that pages and is allowed to set/clear
whatever flags it want (pages are locked), so it would be a great win to
set page as being read and unlocked. It can be a policy to clear read
bit when page is evicted from the cache by failed readahead/readpage(s).

> > > What's the use case where it matters that splice-in should not block
> > > on the read?
> >
> > To be able to transfer what was already read?
>
> That needs the consumer to be non-blocking...
>
> Umm, one more reason why the ->confirm() stuff is currently busted:
> pipe_read() will block on such a buffer even if pipe file is marked
> O_NONBLOCK. Fixing that would take a hell of a lot of added
> complexity in pipe_poll(), etc...

Yes, nonblocking splice is tricky and it covers only half of the users.

--
Evgeniy Polyakov

2008-06-24 12:23:17

by Jens Axboe

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

On Tue, Jun 24 2008, Nick Piggin wrote:
> On Tuesday 24 June 2008 21:36, Miklos Szeredi wrote:
> > > It's an unfortunate side effect of the read-ahead, I'd much rather just
> > > get rid of that. It _should_ behave like the non-ra case, when a page is
> > > added it merely has IO started on it. So we want to have that be
> > > something like
> > >
> > > if (!PageUptodate(page) && !PageInFlight(page))
> > > ...
> > >
> > > basically like PageWriteback(), but for read-in.
> >
> > OK it could be done, possibly at great pain. But why is it important?
>
> It has been considered, but adding atomic operations on these paths
> always really hurts. Adding something like this would basically be
> another at least 2 atomic operations that can never be removed again...
>
> Provided that you've done the sync readahead earlier, it presumably
> should be a very rare case to have to start new IO in the loop
> below, right? In which case, I wonder if we couldn't move that 2nd
> loop out of generic_file_splice_read and into
> page_cache_pipe_buf_confirm.

That's a good point, moving those blocks of code to the other end makes
a lot of sense. Or just kill the read-ahead, or at least do it
differently. It's definitely an oversight/bug having splice from file
block on the pages it just issued read-ahead for.

> > What's the use case where it matters that splice-in should not block
> > on the read?
>
> It just makes it generally less able to pipeline IO and computation,
> doesn't it?

Precisely!

> > And note, after the pipe is full it will block no matter what, since
> > the consumer will have to wait until the page is brought uptodate, and
> > can only then commence with getting the data out from the pipe.
>
> True, but (especially with patches to variably size the pipe buffer)
> I imagine programs could be designed fairly carefully to the size of
> the buffer (and not just things that blast bulk data down the pipe...)

Yep, that's the whole premise for the dynpipe branch I've been carrying
around for some time.

--
Jens Axboe

2008-06-24 13:02:19

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

> > > It's an unfortunate side effect of the read-ahead, I'd much rather just
> > > get rid of that. It _should_ behave like the non-ra case, when a page is
> > > added it merely has IO started on it. So we want to have that be
> > > something like
> > >
> > > if (!PageUptodate(page) && !PageInFlight(page))
> > > ...
> > >
> > > basically like PageWriteback(), but for read-in.
> >
> > OK it could be done, possibly at great pain. But why is it important?
>
> It has been considered, but adding atomic operations on these paths
> always really hurts. Adding something like this would basically be
> another at least 2 atomic operations that can never be removed again...
>
> Provided that you've done the sync readahead earlier, it presumably
> should be a very rare case to have to start new IO in the loop
> below, right? In which case, I wonder if we couldn't move that 2nd
> loop out of generic_file_splice_read and into
> page_cache_pipe_buf_confirm.

The problem with that second loop (which started this thing) is that
if a page is invalidated by the filesystem, then it doesn't redo the
lookup/read like the plain cached read does.

And that can't be done in page_cache_pipe_buf_confirm() at all.

> > What's the use case where it matters that splice-in should not block
> > on the read?
>
> It just makes it generally less able to pipeline IO and computation,
> doesn't it?

Maybe. I don't really see how splice might be used that would be
helped by this. Do you have a concrete example?

In fact I don't really know at all what splice is being used for
(other than the in kernel uses: nfsd, sendfile).

Thanks,
Miklos

2008-06-24 17:31:24

by Linus Torvalds

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

On Tue, 24 Jun 2008, Miklos Szeredi wrote:
>
> OK it could be done, possibly at great pain. But why is it important?
> What's the use case where it matters that splice-in should not block
> on the read?

If you're splicing from one file to another, the _goal_ you should have is
that you want to have a mode where you can literally steal the page, and
never _ever_ be IO-synchronous (well, meta-data accesses will be, you
can't really avoid that sanely).

IOW, it should be possible to do a

- splice() file->pipe with SPLICE_STEAL
don't even wait for the read to finish!

- splice() pipe->file
insert the page into the destination page cache, mark it dirty

an no, we probably do not support that yet (for example, I wouldn't be
surprised if "dirty + !uptodate" is considered an error for the VM even
though the page should still be locked from the read), but it really was a
design goal.

Also, asynchronous is important even when you "just" want to overlap IO
with CPU, so even if it's going to the network, then if you can delay the
"wait for IO to complete" until the last possible moment (ie the _second_
splice, when you end up copying it into an SKB, then both your throughput
and your latency are likely going to be noticeably better, because you've
now been able to do a lot of the costly CPU work (system exit + entry at
the least, but hopefully a noticeable portion of the TCP stack too)
overlapped with the disk seeking.

So asynchronous ops was really one of the big goals for splice.

Linus

2008-06-24 18:24:40

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

> >
> > OK it could be done, possibly at great pain. But why is it important?
> > What's the use case where it matters that splice-in should not block
> > on the read?
>
> If you're splicing from one file to another, the _goal_ you should have is
> that you want to have a mode where you can literally steal the page, and
> never _ever_ be IO-synchronous (well, meta-data accesses will be, you
> can't really avoid that sanely).
>
> IOW, it should be possible to do a
>
> - splice() file->pipe with SPLICE_STEAL
> don't even wait for the read to finish!
>
> - splice() pipe->file
> insert the page into the destination page cache, mark it dirty
>
> an no, we probably do not support that yet (for example, I wouldn't be
> surprised if "dirty + !uptodate" is considered an error for the VM even
> though the page should still be locked from the read), but it really was a
> design goal.

OK. But currently we have an implementation that

1) doesn't do any of this, unless readahead is disabled

2) if readhead is disabled, it does the async splice-in (file->pipe),
but blocks on splice-out (pipe->any)

3) it blocks on read(pipefd, ...) even if pipefd is set to O_NONBLOCK

And in addition, splice-in and splice-out can return a short count or
even zero count if the filesystem invalidates the cached pages during
the splicing (data became stale for example). Are these the right
semantics? I'm not sure.

> Also, asynchronous is important even when you "just" want to overlap IO
> with CPU, so even if it's going to the network, then if you can delay the
> "wait for IO to complete" until the last possible moment (ie the _second_
> splice, when you end up copying it into an SKB, then both your throughput
> and your latency are likely going to be noticeably better, because you've
> now been able to do a lot of the costly CPU work (system exit + entry at
> the least, but hopefully a noticeable portion of the TCP stack too)
> overlapped with the disk seeking.

My feeling is (and I'm not an expert in this area at all) is that disk
seeking will be many orders of magnitude slower than any CPU work
associated with getting the data out to the network.

> So asynchronous ops was really one of the big goals for splice.

Well, if it can be implemented right, I have nothing against that.
But what we currently have is very far from that, and it seems to me
there are very big hurdles to overcome yet.

Miklos

2008-06-24 18:32:14

by Linus Torvalds

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

On Tue, 24 Jun 2008, Miklos Szeredi wrote:
>
> OK. But currently we have an implementation that
>
> 1) doesn't do any of this, unless readahead is disabled

Sure. But removing even the conceptual support? Not a good idea.

> And in addition, splice-in and splice-out can return a short count or
> even zero count if the filesystem invalidates the cached pages during
> the splicing (data became stale for example). Are these the right
> semantics? I'm not sure.

What does that really have with splice() and removing the features? Why
don't you just fix that issue?

Linus

2008-06-24 19:06:30

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

> On Tue, 24 Jun 2008, Miklos Szeredi wrote:
> >
> > OK. But currently we have an implementation that
> >
> > 1) doesn't do any of this, unless readahead is disabled
>
> Sure. But removing even the conceptual support? Not a good idea.
>
> > And in addition, splice-in and splice-out can return a short count or
> > even zero count if the filesystem invalidates the cached pages during
> > the splicing (data became stale for example). Are these the right
> > semantics? I'm not sure.
>
> What does that really have with splice() and removing the features? Why
> don't you just fix that issue?

Because it's freakin' difficult, and I'm lazy, that's why :)

Let's start with page_cache_pipe_buf_confirm(). How should we deal
with finding an invalidated page (!PageUptodate(page) &&
!page->mapping)?

We could return zero to use the contents even though it was
invalidated, not good, but if the page was originally uptodate, then
it should be OK to use the stale data. But it could have been
invalidated before becoming uptodate, so the contents could be total
crap, and that's not good. So now we have to tweak page invalidation
to differentiate between was-uptodate and was-never-uptodate pages.

The other is __generic_file_splice_read(). Currently it just bails
out if it finds an invalidated page. That could be rewritten to throw
away the page, look it up again in the radix tree, etc, etc... Lots
of added complexity in an already not-too-simple function.

All for what? To be able to keep the async-on-no-readahead behavior
of generic_file_splice_read()? The current implementation is not even
close to what would be required to do the async splicing properly.

Conclusion: I think we are better off with a simple
do_generic_file_read() based implementation until someone gives this
the proper thought and effort, than to leave all the complex and dead
code to rot and cause people (me) headaches... :)

Miklos

2008-06-24 19:18:20

by Linus Torvalds

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

On Tue, 24 Jun 2008, Miklos Szeredi wrote:
>
> Let's start with page_cache_pipe_buf_confirm(). How should we deal
> with finding an invalidated page (!PageUptodate(page) &&
> !page->mapping)?

I suspect we just have to use it. After all, it was valid when the read
was done. The fact that it got invalidated later is kind of immaterial.

splice() is an optimized read+write. The read reads it into a temporary
buffer. The fact that it's a zero-copy buffer and basically just re-uses
the source doesn't really change that.

Linus

2008-06-24 19:24:42

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

> >
> > Let's start with page_cache_pipe_buf_confirm(). How should we deal
> > with finding an invalidated page (!PageUptodate(page) &&
> > !page->mapping)?
>
> I suspect we just have to use it. After all, it was valid when the read
> was done. The fact that it got invalidated later is kind of immaterial.

Right. But what if it's invalidated *before* becoming uptodate (if
you'd read my mail further, I discussed this).

Why does invalidate_complete_page2() do ClearPageUptodate()? Dunno,
maybe it shoulnd't. But that would need a rather thorough audit of
all code checking PageUptodate()...

> splice() is an optimized read+write. The read reads it into a temporary
> buffer. The fact that it's a zero-copy buffer and basically just re-uses
> the source doesn't really change that.

Agreed.

Miklos

2008-06-24 19:27:07

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

> > >
> > > Let's start with page_cache_pipe_buf_confirm(). How should we deal
> > > with finding an invalidated page (!PageUptodate(page) &&
> > > !page->mapping)?
> >
> > I suspect we just have to use it. After all, it was valid when the read
> > was done. The fact that it got invalidated later is kind of immaterial.
>
> Right. But what if it's invalidated *before* becoming uptodate (if
> you'd read my mail further, I discussed this).

Please ignore, this can't happen of course due to page locking...

Miklos

2008-06-24 20:06:29

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

> >
> > Or it can only happen if there was an I/O error on reading the page.
>
> Now, IO errors are something else. They should have the PG_error bit set,
> and we should just return EIO or something.

Linus, you're right (as always), but see where this is going? A rare
problem (splice() returning short count because of an invalidated
page) is becoming an even more rare problem (splice() returning
rubbish instead of an error, if ->readpage() failed, and filesystem
forgot to set PG_error). And it won't show up in any other paths,
because the generic_file_aio_read() path will just check
PageUptodate(), and return -EIO if not.

OK, maybe we should add a WARN_ON(!PageError()) for the
!PageUptodate() case in generic_file_aio_read(), but that could still
leave some filesystems broken for a long time which experience I/O
errors rarely.

So I think the only sane solution here is to remove
ClearPageUptodate(). But that's a VM people's call, I don't have
enough insight into that.

Miklos

2008-06-24 19:33:17

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

> > > >
> > > > Let's start with page_cache_pipe_buf_confirm(). How should we deal
> > > > with finding an invalidated page (!PageUptodate(page) &&
> > > > !page->mapping)?
> > >
> > > I suspect we just have to use it. After all, it was valid when the read
> > > was done. The fact that it got invalidated later is kind of immaterial.
> >
> > Right. But what if it's invalidated *before* becoming uptodate (if
> > you'd read my mail further, I discussed this).
>
> Please ignore, this can't happen of course due to page locking...

Or it can only happen if there was an I/O error on reading the page.

So it's an issue after all...

Miklos

2008-06-24 19:48:19

by Linus Torvalds

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

On Tue, 24 Jun 2008, Miklos Szeredi wrote:
>
> Or it can only happen if there was an I/O error on reading the page.

Now, IO errors are something else. They should have the PG_error bit set,
and we should just return EIO or something.

That said, I'd not be surprised at all if error handling is broken. It's
often been broken even in the _normal_ paths (ie totally normal "read()"
etc).

Linus

2008-06-24 19:46:31

by Linus Torvalds

[permalink] [raw]

Subject: Re: [rfc patch 3/4] splice: remove confirm from pipe_buf_operations

On Tue, 24 Jun 2008, Miklos Szeredi wrote:
>
> Right. But what if it's invalidated *before* becoming uptodate (if
> you'd read my mail further, I discussed this).
>
> Why does invalidate_complete_page2() do ClearPageUptodate()? Dunno,
> maybe it shoulnd't. But that would need a rather thorough audit of
> all code checking PageUptodate()...

Quite frankly, that shouldn't happen in the first place. Nothing should
clear up-to-date on a page that is locked. Doesn't
invalidate_complete_page() wait for the page to be unlocked already?

That said, the VM people already discussed (I think for other reasons),
removing the ClearPageUptodate(), because it has problems. There's talk
about just unhashing the page or moving it to another radix tree or
something.

Linus