2008-07-30 09:44:14

by Miklos Szeredi

[permalink] [raw]
Subject: [patch v3] splice: fix race with page invalidation

Jens,

Please apply or ack this for 2.6.27.

[v3: respun against 2.6.27-rc1]

Thanks,
Miklos

----
From: Miklos Szeredi <[email protected]>

Brian Wang reported that a FUSE filesystem exported through NFS could return
I/O errors on read. This was traced to splice_direct_to_actor() returning a
short or zero count when racing with page invalidation.

However this is not FUSE or NFSD specific, other filesystems (notably NFS)
also call invalidate_inode_pages2() to purge stale data from the cache.

If this happens while such pages are sitting in a pipe buffer, then splice(2)
from the pipe can return zero, and read(2) from the pipe can return ENODATA.

The zero return is especially bad, since it implies end-of-file or
disconnected pipe/socket, and is documented as such for splice. But returning
an error for read() is also nasty, when in fact there was no error (data
becoming stale is not an error).

The same problems can be triggered by "hole punching" with
madvise(MADV_REMOVE).

This patch simply reuses the do_generic_file_read() infrastructure to collect
pages to be spliced into the pipe buffer. This has the advantage of using
very well tested codepaths. Other than fixing the above problem it also fixes
smaller issues with the previous generic_file_splice_read() implementation:

- error handling bugs for some corner cases
- AOP_TRUNCATED_PAGE handling

There are no real disadvantages: splice() from a file was originally meant to
be asynchronous, but in reality it only did that for non-readahead pages,
which happen rarely.

Signed-off-by: Miklos Szeredi <[email protected]>
---

fs/splice.c | 295 ++++++-----------------------------------------------
include/linux/fs.h | 2
mm/filemap.c | 2
3 files changed, 41 insertions(+), 258 deletions(-)

Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c 2008-07-30 10:06:11.000000000 +0200
+++ linux-2.6/fs/splice.c 2008-07-30 11:26:23.000000000 +0200
@@ -87,53 +87,11 @@ static void page_cache_pipe_buf_release(
buf->flags &= ~PIPE_BUF_FLAG_LRU;
}

-/*
- * Check whether the contents of buf is OK to access. Since the content
- * is a page cache page, IO may be in flight.
- */
-static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe,
- struct pipe_buffer *buf)
-{
- struct page *page = buf->page;
- int err;
-
- if (!PageUptodate(page)) {
- lock_page(page);
-
- /*
- * Page got truncated/unhashed. This will cause a 0-byte
- * splice, if this is the first page.
- */
- if (!page->mapping) {
- err = -ENODATA;
- goto error;
- }
-
- /*
- * Uh oh, read-error from disk.
- */
- if (!PageUptodate(page)) {
- err = -EIO;
- goto error;
- }
-
- /*
- * Page is ok afterall, we are done.
- */
- unlock_page(page);
- }
-
- return 0;
-error:
- unlock_page(page);
- return err;
-}
-
static const struct pipe_buf_operations page_cache_pipe_buf_ops = {
.can_merge = 0,
.map = generic_pipe_buf_map,
.unmap = generic_pipe_buf_unmap,
- .confirm = page_cache_pipe_buf_confirm,
+ .confirm = generic_pipe_buf_confirm,
.release = page_cache_pipe_buf_release,
.steal = page_cache_pipe_buf_steal,
.get = generic_pipe_buf_get,
@@ -265,212 +223,26 @@ static void spd_release_page(struct spli
page_cache_release(spd->pages[i]);
}

-static int
-__generic_file_splice_read(struct file *in, loff_t *ppos,
- struct pipe_inode_info *pipe, size_t len,
- unsigned int flags)
+static int file_splice_read_actor(read_descriptor_t *desc, struct page *page,
+ unsigned long offset, unsigned long size)
{
- struct address_space *mapping = in->f_mapping;
- unsigned int loff, nr_pages, req_pages;
- struct page *pages[PIPE_BUFFERS];
- struct partial_page partial[PIPE_BUFFERS];
- struct page *page;
- pgoff_t index, end_index;
- loff_t isize;
- int error, page_nr;
- struct splice_pipe_desc spd = {
- .pages = pages,
- .partial = partial,
- .flags = flags,
- .ops = &page_cache_pipe_buf_ops,
- .spd_release = spd_release_page,
- };
-
- index = *ppos >> PAGE_CACHE_SHIFT;
- loff = *ppos & ~PAGE_CACHE_MASK;
- req_pages = (len + loff + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- nr_pages = min(req_pages, (unsigned)PIPE_BUFFERS);
-
- /*
- * Lookup the (hopefully) full range of pages we need.
- */
- spd.nr_pages = find_get_pages_contig(mapping, index, nr_pages, pages);
- index += spd.nr_pages;
-
- /*
- * If find_get_pages_contig() returned fewer pages than we needed,
- * readahead/allocate the rest and fill in the holes.
- */
- if (spd.nr_pages < nr_pages)
- page_cache_sync_readahead(mapping, &in->f_ra, in,
- index, req_pages - spd.nr_pages);
-
- error = 0;
- while (spd.nr_pages < nr_pages) {
- /*
- * Page could be there, find_get_pages_contig() breaks on
- * the first hole.
- */
- page = find_get_page(mapping, index);
- if (!page) {
- /*
- * page didn't exist, allocate one.
- */
- page = page_cache_alloc_cold(mapping);
- if (!page)
- break;
-
- error = add_to_page_cache_lru(page, mapping, index,
- mapping_gfp_mask(mapping));
- if (unlikely(error)) {
- page_cache_release(page);
- if (error == -EEXIST)
- continue;
- break;
- }
- /*
- * add_to_page_cache() locks the page, unlock it
- * to avoid convoluting the logic below even more.
- */
- unlock_page(page);
- }
-
- pages[spd.nr_pages++] = page;
- index++;
- }
-
- /*
- * Now loop over the map and see if we need to start IO on any
- * pages, fill in the partial map, etc.
- */
- index = *ppos >> PAGE_CACHE_SHIFT;
- nr_pages = spd.nr_pages;
- spd.nr_pages = 0;
- for (page_nr = 0; page_nr < nr_pages; page_nr++) {
- unsigned int this_len;
-
- if (!len)
- break;
-
- /*
- * this_len is the max we'll use from this page
- */
- this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff);
- page = pages[page_nr];
-
- if (PageReadahead(page))
- page_cache_async_readahead(mapping, &in->f_ra, in,
- page, index, req_pages - page_nr);
-
- /*
- * If the page isn't uptodate, we may need to start io on it
- */
- if (!PageUptodate(page)) {
- /*
- * If in nonblock mode then dont block on waiting
- * for an in-flight io page
- */
- if (flags & SPLICE_F_NONBLOCK) {
- if (TestSetPageLocked(page)) {
- error = -EAGAIN;
- break;
- }
- } else
- lock_page(page);
-
- /*
- * Page was truncated, or invalidated by the
- * filesystem. Redo the find/create, but this time the
- * page is kept locked, so there's no chance of another
- * race with truncate/invalidate.
- */
- if (!page->mapping) {
- unlock_page(page);
- page = find_or_create_page(mapping, index,
- mapping_gfp_mask(mapping));
-
- if (!page) {
- error = -ENOMEM;
- break;
- }
- page_cache_release(pages[page_nr]);
- pages[page_nr] = page;
- }
- /*
- * page was already under io and is now done, great
- */
- if (PageUptodate(page)) {
- unlock_page(page);
- goto fill_it;
- }
-
- /*
- * need to read in the page
- */
- error = mapping->a_ops->readpage(in, page);
- if (unlikely(error)) {
- /*
- * We really should re-lookup the page here,
- * but it complicates things a lot. Instead
- * lets just do what we already stored, and
- * we'll get it the next time we are called.
- */
- if (error == AOP_TRUNCATED_PAGE)
- error = 0;
-
- break;
- }
- }
-fill_it:
- /*
- * i_size must be checked after PageUptodate.
- */
- isize = i_size_read(mapping->host);
- end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
- if (unlikely(!isize || index > end_index))
- break;
-
- /*
- * if this is the last page, see if we need to shrink
- * the length and stop
- */
- if (end_index == index) {
- unsigned int plen;
-
- /*
- * max good bytes in this page
- */
- plen = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
- if (plen <= loff)
- break;
+ struct splice_pipe_desc *spd = desc->arg.data;
+ unsigned long count = desc->count;

- /*
- * force quit after adding this page
- */
- this_len = min(this_len, plen - loff);
- len = this_len;
- }
+ BUG_ON(spd->nr_pages >= PIPE_BUFFERS);

- partial[page_nr].offset = loff;
- partial[page_nr].len = this_len;
- len -= this_len;
- loff = 0;
- spd.nr_pages++;
- index++;
- }
+ if (size > count)
+ size = count;

- /*
- * Release any pages at the end, if we quit early. 'page_nr' is how far
- * we got, 'nr_pages' is how many pages are in the map.
- */
- while (page_nr < nr_pages)
- page_cache_release(pages[page_nr++]);
- in->f_ra.prev_pos = (loff_t)index << PAGE_CACHE_SHIFT;
+ page_cache_get(page);
+ spd->pages[spd->nr_pages] = page;
+ spd->partial[spd->nr_pages].offset = offset;
+ spd->partial[spd->nr_pages].len = size;
+ spd->nr_pages++;

- if (spd.nr_pages)
- return splice_to_pipe(pipe, &spd);
+ desc->count = count - size;

- return error;
+ return size;
}

/**
@@ -491,24 +263,33 @@ ssize_t generic_file_splice_read(struct
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
{
- loff_t isize, left;
- int ret;
-
- isize = i_size_read(in->f_mapping->host);
- if (unlikely(*ppos >= isize))
- return 0;
-
- left = isize - *ppos;
- if (unlikely(left < len))
- len = left;
+ ssize_t ret;
+ loff_t pos = *ppos;
+ size_t offset = pos & ~PAGE_CACHE_MASK;
+ struct page *pages[PIPE_BUFFERS];
+ struct partial_page partial[PIPE_BUFFERS];
+ struct splice_pipe_desc spd = {
+ .pages = pages,
+ .partial = partial,
+ .flags = flags,
+ .ops = &page_cache_pipe_buf_ops,
+ .spd_release = spd_release_page,
+ };
+ read_descriptor_t desc = {
+ .count = min(len, (PIPE_BUFFERS << PAGE_CACHE_SHIFT) - offset),
+ .arg.data = &spd,
+ };

- ret = __generic_file_splice_read(in, ppos, pipe, len, flags);
- if (ret > 0)
- *ppos += ret;
+ do_generic_file_read(in, &pos, &desc, file_splice_read_actor);
+ ret = desc.error;
+ if (spd.nr_pages) {
+ ret = splice_to_pipe(pipe, &spd);
+ if (ret > 0)
+ *ppos += ret;
+ }

return ret;
}
-
EXPORT_SYMBOL(generic_file_splice_read);

/*
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h 2008-07-30 10:06:11.000000000 +0200
+++ linux-2.6/include/linux/fs.h 2008-07-30 11:26:23.000000000 +0200
@@ -1873,6 +1873,8 @@ extern ssize_t do_sync_read(struct file
extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
extern int generic_segment_checks(const struct iovec *iov,
unsigned long *nr_segs, size_t *count, int access_flags);
+extern void do_generic_file_read(struct file *filp, loff_t *ppos,
+ read_descriptor_t *desc, read_actor_t actor);

/* fs/splice.c */
extern ssize_t generic_file_splice_read(struct file *, loff_t *,
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c 2008-07-30 10:06:11.000000000 +0200
+++ linux-2.6/mm/filemap.c 2008-07-30 11:26:23.000000000 +0200
@@ -982,7 +982,7 @@ static void shrink_readahead_size_eio(st
* This is really ugly. But the goto's actually try to clarify some
* of the logic when it comes to error handling etc.
*/
-static void do_generic_file_read(struct file *filp, loff_t *ppos,
+void do_generic_file_read(struct file *filp, loff_t *ppos,
read_descriptor_t *desc, read_actor_t actor)
{
struct address_space *mapping = filp->f_mapping;


2008-07-30 17:04:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation



On Wed, 30 Jul 2008, Miklos Szeredi wrote:
>
> There are no real disadvantages: splice() from a file was originally meant to
> be asynchronous, but in reality it only did that for non-readahead pages,
> which happen rarely.

I still don't like this. I still don't see the point, and I still think
there is something fundamentally wrong elsewhere.

I also object to just dismissing the async nature as unimportant. Fix it
instead. Make it use generic_file_readahead() or something. This is fixing
things in all the wrong places, imnsho.

Linus

2008-07-30 17:30:30

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Wed, 30 Jul 2008, Linus Torvalds wrote:
> On Wed, 30 Jul 2008, Miklos Szeredi wrote:
> >
> > There are no real disadvantages: splice() from a file was
> > originally meant to be asynchronous, but in reality it only did
> > that for non-readahead pages, which happen rarely.
>
> I still don't like this. I still don't see the point, and I still
> think there is something fundamentally wrong elsewhere.

We discussed the possible solutions with Nick, and came to the
conclusion, that short term (i.e. 2.6.27) this is probably the best
solution.

Long term sure, I have no problem with implementing async splice.

In fact, I may even have personal interest in looking at splice,
because people are asking for a zero-copy interface for fuse.

But that is definitely not 2.6.27, so I think you should reconsider
taking this patch, which is obviously correct due to its simplicity,
and won't cause any performance regressions either.

Thanks,
Miklos

2008-07-30 17:54:21

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Wed, Jul 30 2008, Miklos Szeredi wrote:
> On Wed, 30 Jul 2008, Linus Torvalds wrote:
> > On Wed, 30 Jul 2008, Miklos Szeredi wrote:
> > >
> > > There are no real disadvantages: splice() from a file was
> > > originally meant to be asynchronous, but in reality it only did
> > > that for non-readahead pages, which happen rarely.
> >
> > I still don't like this. I still don't see the point, and I still
> > think there is something fundamentally wrong elsewhere.

You snipped the part where Linus objected to dismissing the async
nature, I fully agree with that part.

> We discussed the possible solutions with Nick, and came to the
> conclusion, that short term (i.e. 2.6.27) this is probably the best
> solution.

Ehm where? Nick also said that he didn't like removing the ->confirm()
bits as they are completely related to the async nature of splice. You
already submitted this exact patch earlier and it was nak'ed.

> Long term sure, I have no problem with implementing async splice.
>
> In fact, I may even have personal interest in looking at splice,
> because people are asking for a zero-copy interface for fuse.
>
> But that is definitely not 2.6.27, so I think you should reconsider
> taking this patch, which is obviously correct due to its simplicity,
> and won't cause any performance regressions either.

Then please just fix the issue, instead of removing the bits that make
this possible.

--
Jens Axboe

2008-07-30 18:33:27

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Wed, 30 Jul 2008, Jens Axboe wrote:
> On Wed, Jul 30 2008, Miklos Szeredi wrote:
> > On Wed, 30 Jul 2008, Linus Torvalds wrote:
> > > On Wed, 30 Jul 2008, Miklos Szeredi wrote:
> > > >
> > > > There are no real disadvantages: splice() from a file was
> > > > originally meant to be asynchronous, but in reality it only did
> > > > that for non-readahead pages, which happen rarely.
> > >
> > > I still don't like this. I still don't see the point, and I still
> > > think there is something fundamentally wrong elsewhere.
>
> You snipped the part where Linus objected to dismissing the async
> nature, I fully agree with that part.
>
> > We discussed the possible solutions with Nick, and came to the
> > conclusion, that short term (i.e. 2.6.27) this is probably the best
> > solution.
>
> Ehm where?

http://lkml.org/lkml/2008/7/7/476

> Nick also said that he didn't like removing the ->confirm()
> bits as they are completely related to the async nature of splice. You
> already submitted this exact patch earlier and it was nak'ed.

That's not true. The resubmitted patch didn't remove the ->confirm()
calls, which is what Nick objected to, I think.

> > Long term sure, I have no problem with implementing async splice.
> >
> > In fact, I may even have personal interest in looking at splice,
> > because people are asking for a zero-copy interface for fuse.
> >
> > But that is definitely not 2.6.27, so I think you should reconsider
> > taking this patch, which is obviously correct due to its simplicity,
> > and won't cause any performance regressions either.
>
> Then please just fix the issue, instead of removing the bits that make
> this possible.

I tried to fix it, but Nick didn't like my fix. Ideas are of course
welcome.

Thanks,
Miklos

2008-07-30 18:45:11

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

> On Wed, 30 Jul 2008, Jens Axboe wrote:
> > You snipped the part where Linus objected to dismissing the async
> > nature, I fully agree with that part.

And also note in what Nick said in the referenced mail: it would be
nice if someone actually _cared_ about the async nature. The fact
that it has been broken from the start, and nobody noticed is a strong
hint that currently there isn't anybody who does.

Maybe fuse will be the first one to actually care, and then I'll
bother with putting a lot of effort into it. But until someone cares,
nobody will bother, and that's how it should be. That's very much in
line with the evoultionary nature of kernel developments: unused
features will just get broken and eventually removed.

Miklos

2008-07-30 19:45:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Wed, Jul 30 2008, Miklos Szeredi wrote:
> > On Wed, 30 Jul 2008, Jens Axboe wrote:
> > > You snipped the part where Linus objected to dismissing the async
> > > nature, I fully agree with that part.
>
> And also note in what Nick said in the referenced mail: it would be
> nice if someone actually _cared_ about the async nature. The fact
> that it has been broken from the start, and nobody noticed is a strong
> hint that currently there isn't anybody who does.

That's largely due to the (still) lack of direct splice users. It's a
clear part of the design and benefit of using splice. I very much care
about this, and as soon as there are available cycles for this, I'll get
it into better shape in this respect. Taking a step backwards is not the
right way forward, imho.

> Maybe fuse will be the first one to actually care, and then I'll
> bother with putting a lot of effort into it. But until someone cares,
> nobody will bother, and that's how it should be. That's very much in
> line with the evoultionary nature of kernel developments: unused
> features will just get broken and eventually removed.

People always say then, and the end result is it never gets done. Not to
point the finger at Nick, but removing the steal part of the splice
design was something I objected to a lot. Yet it was acked on the
premise that it would eventually get resubmitted in a fixed manner, but
were are not further along that path.

--
Jens Axboe

2008-07-30 20:06:59

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Wed, 30 Jul 2008, Jens Axboe wrote:
> On Wed, Jul 30 2008, Miklos Szeredi wrote:
> > > On Wed, 30 Jul 2008, Jens Axboe wrote:
> > > > You snipped the part where Linus objected to dismissing the async
> > > > nature, I fully agree with that part.
> >
> > And also note in what Nick said in the referenced mail: it would be
> > nice if someone actually _cared_ about the async nature. The fact
> > that it has been broken from the start, and nobody noticed is a strong
> > hint that currently there isn't anybody who does.
>
> That's largely due to the (still) lack of direct splice users. It's a
> clear part of the design and benefit of using splice. I very much care
> about this, and as soon as there are available cycles for this, I'll get
> it into better shape in this respect. Taking a step backwards is not the
> right way forward, imho.

So what is? The only way forward is to put those cycles into it,
which nobody seems to have available.

Take this patch as a bugfix. It's not in any way showing the way
forward: as soon as you have the time, you can revert it and start
from the current state.

Hmm?

Miklos

2008-07-30 20:17:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation



On Wed, 30 Jul 2008, Miklos Szeredi wrote:
>
> Take this patch as a bugfix. It's not in any way showing the way
> forward: as soon as you have the time, you can revert it and start
> from the current state.
>
> Hmm?

I dislike that mentality.

The fact is, it's not a bug-fix, it's just papering over the real problem.

And by papering it over, it then just makes people less likely to bother
with the real issue.

For example, and I talked about this earlier - what make syou think that
the FUSE/NFSD behaviour you don't like is at all valid in the first place?

If you depend on data not being truncated because you have it "in flight",
tjhere's already something wrong there. It's _not_ just that people can
see zero bytes in the reply - apparently they can see the file shrink
before they see the read return. That kind of thing just worries me. And
it might be a general NFS issue, not necessarily a FUSE one.

So I think your whole approach stinks. I don't agree with the "bug-fix".
It really smells like a "bug-paper-over".

Linus

2008-07-30 20:46:29

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Wed, 30 Jul 2008, Linus Torvalds wrote:
> On Wed, 30 Jul 2008, Miklos Szeredi wrote:
> >
> > Take this patch as a bugfix. It's not in any way showing the way
> > forward: as soon as you have the time, you can revert it and start
> > from the current state.
> >
> > Hmm?
>
> I dislike that mentality.
>
> The fact is, it's not a bug-fix, it's just papering over the real problem.

It _is_ a bug fix. See here, from man 2 splice:

RETURN VALUE
Upon successful completion, splice() returns the number of bytes
spliced to or from the pipe. A return value of 0 means that there was
no data to transfer, and it would not make sense to block, because
there are no writers connected to the write end of the pipe referred to
by fd_in.

Currently splice on NFS, FUSE and a few other filesystems don't
conform to that clause: splice can return zero even if there's still
data to be read, just because the data happened to be invalidated
during the splicing. That's a plain and clear bug, which has
absolutely nothing to do with NFS _exporting_.

> And by papering it over, it then just makes people less likely to bother
> with the real issue.

I think you are talking about a totally separate issue: that NFSD's
use of splice can result in strange things if the file is truncated
while being read. But this is an NFSD issue and I don't see that it
has _anything_ to do with the above bug in splice. I think you are
just confusing the two things.

Miklos

2008-07-30 20:55:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation



On Wed, 30 Jul 2008, Miklos Szeredi wrote:
>
> It _is_ a bug fix.

No it's not.

It's still papering over a totally unrelated bug. It's not a "fix", it's a
"paper-over". It doesn't matter if _behaviour_ changes.

Can you really not see the difference?

Afaik, everybody pretty much agreed what the real fix should be (don't
mark the page not up-to-date, just remove it from the radix tree), I'm not
seeing why you continue to push the patch that ALREADY GOT NAK'ed.

Linus

2008-07-30 21:16:46

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

> On Wed, 30 Jul 2008, Miklos Szeredi wrote:
> >
> > It _is_ a bug fix.
>
> No it's not.
>
> It's still papering over a totally unrelated bug. It's not a "fix", it's a
> "paper-over". It doesn't matter if _behaviour_ changes.
>
> Can you really not see the difference?
>
> Afaik, everybody pretty much agreed what the real fix should be (don't
> mark the page not up-to-date, just remove it from the radix tree)

Huh? I did exactly that, and that patch got NAK'ed by you and Nick:

http://lkml.org/lkml/2008/6/25/230
http://lkml.org/lkml/2008/7/7/21

Confused.

> I'm not seeing why you continue to push the patch that ALREADY GOT
> NAK'ed.

And later ACK'ed by Nick.

There's everything here but agreement about a solution :)

Miklos

2008-07-30 21:26:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation



On Wed, 30 Jul 2008, Miklos Szeredi wrote:
>
> Huh? I did exactly that, and that patch got NAK'ed by you and Nick:

Umm. That was during the late -rc sequence (what, -rc8?) where I wanted a
minimal "let's fix it". Not that anybody then apparently cared, which is
probably just as well.

Then you did NOTHING AT ALL about it, now the merge window is over, and
you complain again, with what looks like basically the same patch that was
already rejected earlier.

Hellooo..

Linus

2008-07-30 21:46:30

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Wed, 30 Jul 2008, Linus Torvalds wrote:
> On Wed, 30 Jul 2008, Miklos Szeredi wrote:
> >
> > Huh? I did exactly that, and that patch got NAK'ed by you and Nick:
>
> Umm. That was during the late -rc sequence (what, -rc8?) where I wanted a
> minimal "let's fix it". Not that anybody then apparently cared, which is
> probably just as well.
>
> Then you did NOTHING AT ALL about it, now the merge window is over, and
> you complain again, with what looks like basically the same patch that was
> already rejected earlier.
>
> Hellooo..

You are being unfair: after having talked it over with Nick I
resubmitted this patch (not the same), which was added to -mm and
nobody complained then. Then it got thrown out of -mm during the merge
window because of a conflict, and then now I got around to
resubmitting it again.

But you're the boss, if you don't like it it won't go in. I'll have a
look at the "don't clear PG_uptodate" solution again and see if we can
get an agreement there.

Miklos

2008-07-30 22:00:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation



On Wed, 30 Jul 2008, Miklos Szeredi wrote:
>
> You are being unfair: after having talked it over with Nick I
> resubmitted this patch (not the same), which was added to -mm and
> nobody complained then. Then it got thrown out of -mm during the merge
> window because of a conflict, and then now I got around to
> resubmitting it again.

Ok, fair enough. I don't follow -mm myself (since as far as I'm concerned,
a lot of the point of -mm is that Andrew takes a lot of load off me). So
yes, it was unfair and yes, I'd never have reacted to it in -mm.

But I'd really like to get that PG_uptodate bit just fixed - both wrt
writeout errors and wrt truncate/holepunch. We had some similar issues wrt
ext3 (?) inode buffers, where removing the uptodate bit actually ended up
being a mistake.

Linus

2008-07-31 00:12:34

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

> > And by papering it over, it then just makes people less likely to bother
> > with the real issue.
>
> I think you are talking about a totally separate issue: that NFSD's
> use of splice can result in strange things if the file is truncated
> while being read. But this is an NFSD issue and I don't see that it
> has _anything_ to do with the above bug in splice. I think you are
> just confusing the two things.

I'm more concerned by sendfile() users like Apache, Samba, FTPd. In
an earlier thread on this topic, I asked if the splice bug can also
result in sendfile() sending blocks of zeros, when a file is truncated
after it has been sent, and the answer was yes probably.

Not that I checked or anything. But if it affects sendfile() it's a
bigger deal - that has many users.

Assuming it does affect sendfile(), it's exasperated by not being able
to tell when a sendfile() has finished with the pages its sending.
E.g. you can't lock the file or otherwise synchronise with another
program which wants to modify the file.

-- Jamie

2008-07-31 00:42:42

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

Jamie Lokier wrote:
> not being able to tell when a sendfile() has finished with the pages
> its sending.

(Except by the socket fully closing or a handshake from the other end,
obviously.)

-- Jamie

2008-07-31 00:55:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation



On Thu, 31 Jul 2008, Jamie Lokier wrote:
>
> Jamie Lokier wrote:
> > not being able to tell when a sendfile() has finished with the pages
> > its sending.
>
> (Except by the socket fully closing or a handshake from the other end,
> obviously.)

Well, people should realize that this is pretty fundamental to zero-copy
scemes. It's why zero-copy is often much less useful than doing a copy in
the first place. How do you know how far in a splice buffer some random
'struct page' has gotten? Especially with splicing to spicing to tee to
splice...

You'd have to have some kind of barrier model (which would be really
complex), or perhaps a "wait for this page to no longer be shared" (which
has issues all its own).

IOW, splice() is very closely related to a magic kind of "mmap()+write()"
in another thread. That's literally what it does internally (except the
"mmap" is just a small magic kernel buffer rather than virtual address
space), and exactly as with mmap, if you modify the file, the other thread
will see if, even though it did it long ago.

Personally, I think the right approach is to just realize that splice() is
_not_ a write() system call, and never will be. If you need synchronous
writing, you simply shouldn't use splice().

Linus

2008-07-31 00:58:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation



On Wed, 30 Jul 2008, Linus Torvalds wrote:
>
> Personally, I think the right approach is to just realize that splice() is
> _not_ a write() system call, and never will be. If you need synchronous
> writing, you simply shouldn't use splice().

Side note: in-kernel users could probably do somethign about this. IOW, if
there's some in-kernel usage (and yes, knfsd would be a prime example),
that one may actually be able to do things that a _user_level user of
splice() could never do.

That includes things like getting the inode semaphore over a write (so
that you can guarantee that pages that are in flight are not modified,
except again possibly by other mmap users), and/or a per-page callback for
when splice() is done with a page (so that you could keep the page locked
while it's getting spliced, for example).

And no, we don't actually have that either, of course.

Linus

2008-07-31 02:16:36

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Thursday 31 July 2008 03:54, Jens Axboe wrote:
> On Wed, Jul 30 2008, Miklos Szeredi wrote:
> > On Wed, 30 Jul 2008, Linus Torvalds wrote:
> > > On Wed, 30 Jul 2008, Miklos Szeredi wrote:
> > > > There are no real disadvantages: splice() from a file was
> > > > originally meant to be asynchronous, but in reality it only did
> > > > that for non-readahead pages, which happen rarely.
> > >
> > > I still don't like this. I still don't see the point, and I still
> > > think there is something fundamentally wrong elsewhere.
>
> You snipped the part where Linus objected to dismissing the async
> nature, I fully agree with that part.
>
> > We discussed the possible solutions with Nick, and came to the
> > conclusion, that short term (i.e. 2.6.27) this is probably the best
> > solution.
>
> Ehm where? Nick also said that he didn't like removing the ->confirm()
> bits as they are completely related to the async nature of splice. You
> already submitted this exact patch earlier and it was nak'ed.
>
> > Long term sure, I have no problem with implementing async splice.
> >
> > In fact, I may even have personal interest in looking at splice,
> > because people are asking for a zero-copy interface for fuse.
> >
> > But that is definitely not 2.6.27, so I think you should reconsider
> > taking this patch, which is obviously correct due to its simplicity,
> > and won't cause any performance regressions either.
>
> Then please just fix the issue, instead of removing the bits that make
> this possible.

The only "real" objection I had to avoiding the ClearPageUptodate there
I guess is that it would weaken some assertions. I was more concerned
about the unidentified problems... but there probably shouldn't be
too many places in the VM that really care that much anymore (and those
that do might already be racy).

Now it seems to be perfectly fine to use the actual page itself that may
have been truncated, and we have been doing that for a long time (see
get_user_pages). So I'm not so worried about a bad data corruption or
anything but just the VM getting confused, which we could fix anyway.

I guess that kind of patch could sit in -mm for a while then get merged.
Linus probably wouldn't think highly of a post-rc1 merge, but if it
really is a bugfix, maybe?

2008-07-31 06:12:25

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

Linus Torvalds wrote:
> > Jamie Lokier wrote:
> > > not being able to tell when a sendfile() has finished with the pages
> > > its sending.
> >
> > (Except by the socket fully closing or a handshake from the other end,
> > obviously.)
>
> Well, people should realize that this is pretty fundamental to zero-copy
> scemes. It's why zero-copy is often much less useful than doing a copy in
> the first place. How do you know how far in a splice buffer some random
> 'struct page' has gotten? Especially with splicing to spicing to tee to
> splice...

Having implemented an equivalent zero-copy thing in userspace, I can
confidently say it's not fundamental at all.

What is fundamental is that you either (a) treat sendfile as an async
operation, and get a notification when it's finished with the data,
just like any other async operation, or (b) while sendfile claims those
pages, they are marked COW.

(b) is *much* more useful for the things you actually want to use
sendfile for, namely a faster copy-file-to-socket with no weird
complications. Since you're sending files which you don't *expect* to
change (but want to behave sensibly if they do), and the pages
probably aren't mapped into any process, COW would not cost anything.

Right now, sendfile is used by servers of all kinds: http, ftp, file
servers, you name it. They all want to believe it's purely a
performance optimisation, equivalent to write. On many operations
systems, it is. (I count sendfile equivalents on: Windows NT, SCO
Unixware, Solaris, FreeBSD, Dragonfly, HP-UX, Tru64, AIX and S/390 in
addition to Linux :-)

> You'd have to have some kind of barrier model (which would be really
> complex), or perhaps a "wait for this page to no longer be shared" (which
> has issues all its own).
>
> IOW, splice() is very closely related to a magic kind of "mmap()+write()"
> in another thread. That's literally what it does internally (except the
> "mmap" is just a small magic kernel buffer rather than virtual address
> space), and exactly as with mmap, if you modify the file, the other thread
> will see if, even though it did it long ago.

That's fine. But if you use a thread, the thread can tell you when
it's done. Then you know what you're sending not an infinite time in
the future :-)

> Personally, I think the right approach is to just realize that splice() is
> _not_ a write() system call, and never will be. If you need synchronous
> writing, you simply shouldn't use splice().

People want zero-copy, and no weirdness like sending blocks of zeros
which the file never contained, and (if you lock the file) knowing
when to release locks for someone else to edit the file.

Sync or async doesn't matter so much; that's API stuff.

The obvious mechanism for completion notifications is the AIO event
interface. I.e. aio_sendfile that reports completion when it's safe
to modify data it was using. aio_splice would be logical for similar
reasons. Note it doesn't mean when the data has reached a particular
place, it means when the pages it's holding are released. Pity AIO
still sucks ;-)

Btw, Windows had this since forever, it's called overlapped
TransmitFile with an I/O completion event. Don't know if it's any
good though ;-)

-- Jamie

2008-07-31 07:31:20

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Thu, 31 Jul 2008, Jamie Lokier wrote:
> I'm more concerned by sendfile() users like Apache, Samba, FTPd. In
> an earlier thread on this topic, I asked if the splice bug can also
> result in sendfile() sending blocks of zeros, when a file is truncated
> after it has been sent, and the answer was yes probably.
>
> Not that I checked or anything. But if it affects sendfile() it's a
> bigger deal - that has many users.

Nick also pointed out, that it also affects plain read(2), albeit only
with a tiny window.

But partial truncates are _rare_ (we don't even have a UNIX utility
for that ;), so in practice all this may not actually matter very
much.

Miklos

2008-07-31 10:27:47

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Thu, Jul 31, 2008 at 07:12:01AM +0100, Jamie Lokier ([email protected]) wrote:
> The obvious mechanism for completion notifications is the AIO event
> interface. I.e. aio_sendfile that reports completion when it's safe
> to modify data it was using. aio_splice would be logical for similar
> reasons. Note it doesn't mean when the data has reached a particular
> place, it means when the pages it's holding are released. Pity AIO
> still sucks ;-)

It is not that simple: page can be held in hardware or tcp queues for
a long time, and the only possible way to know, that system finished
with it, is receiving ack from the remote side. There is a project to
implement such a callback at skb destruction time (it is freed after ack
from the other peer), but do we really need it? System which does care
about transmit will implement own ack mechanism, so page can be unlocked
at higher layer. Actually page can be locked during transfer and
unlocked after rpc reply received, so underlying page invalidation will
be postponed and will not affect sendfile/splice.

> Btw, Windows had this since forever, it's called overlapped
> TransmitFile with an I/O completion event. Don't know if it's any
> good though ;-)

There was a linux aio_sendfile() too. Google still knows about its
numbers, graphs and so on... :)

--
Evgeniy Polyakov

2008-07-31 12:34:27

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

Evgeniy Polyakov wrote:
> On Thu, Jul 31, 2008 at 07:12:01AM +0100, Jamie Lokier ([email protected]) wrote:
> > The obvious mechanism for completion notifications is the AIO event
> > interface. I.e. aio_sendfile that reports completion when it's safe
> > to modify data it was using. aio_splice would be logical for similar
> > reasons. Note it doesn't mean when the data has reached a particular
> > place, it means when the pages it's holding are released. Pity AIO
> > still sucks ;-)
>
> It is not that simple: page can be held in hardware or tcp queues for
> a long time, and the only possible way to know, that system finished
> with it, is receiving ack from the remote side. There is a project to
> implement such a callback at skb destruction time (it is freed after ack
> from the other peer), but do we really need it? System which does care
> about transmit will implement own ack mechanism, so page can be unlocked
> at higher layer. Actually page can be locked during transfer and
> unlocked after rpc reply received, so underlying page invalidation will
> be postponed and will not affect sendfile/splice.

This is why marking the pages COW would be better. Automatic!
There's no need for a notification, merely letting go of the page
references - yes, the hardware / TCP acks already do that, no locking
or anything! :-) The last reference is nothing special, it just means
the next file write/truncate sees the count is 1 and doesn't need to
COW the page.


Two reason for being mildly curious about sendfile page releases in an
application though:

- Sendfile on tmpfs files: zero copy sending of calculated data.
Only trouble is when can you reuse the pages? Current solution
is use a set of files, consume the pages in sequential order, delete
files at some point, let the kernel hold the pages. Works for
sequentially generated and transmitted data, but not good for
userspace caches where different parts expire separately. Also,
may pin a lot of page cache; not sure if that's accounted.

- Sendfile on real large data contained in a userspace
database-come-filesystem (the future!). App wants to send big
blobs, and with COW it can forget about them, but for performance
it would rathe allocate new writes in the file to areas that are
not sendfile-hot. It can approximate with heuristics though.

> > Btw, Windows had this since forever, it's called overlapped
> > TransmitFile with an I/O completion event. Don't know if it's any
> > good though ;-)
>
> There was a linux aio_sendfile() too. Google still knows about its
> numbers, graphs and so on... :)

I vaguely remember it's performance didn't seem that good.

One of the problems is you don't really want AIO all the time, just
when a process would block because the data isn't in cache. You
really don't want to be sending *all* ops to worker threads, even
kernel threads. And you preferably don't want the AIO interface
overhead for ops satisfied from cache.

Syslets got some of the way there, and maybe that's why they were
faster than AIO for some things. There are user-space hacks which are
a bit like syslets. (Bind two processes to the same CPU, process 1
wakes process 2 just before 1 does a syscall, and puts 2 back to sleep
if 2 didn't wake and do an atomic op to prove it's awake). I haven't
tested their performance, it could suck.

Look up LAIO, Lazy Asynchronous I/O. Apparently FreeBSD, NetBSD,
Solaris, Tru64, and Windows, have the capability to call a synchronous
I/O op and if it's satisfied from cache, simply return a result, if
not, either queue it and return an AIO event later (Windows style (it
does some cleverer thread balancing too)), or wake another thread to
handle it (FreeBSD style). I believe Linus suggested something like
the latter line approach some time ago.

-- Jamie

2008-07-31 12:49:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Thursday 31 July 2008 22:33, Jamie Lokier wrote:
> Evgeniy Polyakov wrote:
> > On Thu, Jul 31, 2008 at 07:12:01AM +0100, Jamie Lokier
([email protected]) wrote:
> > > The obvious mechanism for completion notifications is the AIO event
> > > interface. I.e. aio_sendfile that reports completion when it's safe
> > > to modify data it was using. aio_splice would be logical for similar
> > > reasons. Note it doesn't mean when the data has reached a particular
> > > place, it means when the pages it's holding are released. Pity AIO
> > > still sucks ;-)
> >
> > It is not that simple: page can be held in hardware or tcp queues for
> > a long time, and the only possible way to know, that system finished
> > with it, is receiving ack from the remote side. There is a project to
> > implement such a callback at skb destruction time (it is freed after ack
> > from the other peer), but do we really need it? System which does care
> > about transmit will implement own ack mechanism, so page can be unlocked
> > at higher layer. Actually page can be locked during transfer and
> > unlocked after rpc reply received, so underlying page invalidation will
> > be postponed and will not affect sendfile/splice.
>
> This is why marking the pages COW would be better. Automatic!
> There's no need for a notification, merely letting go of the page
> references - yes, the hardware / TCP acks already do that, no locking
> or anything! :-) The last reference is nothing special, it just means
> the next file write/truncate sees the count is 1 and doesn't need to
> COW the page.

Better == more bloat and complexity and corner cases in the VM?

If the app wants to ensure some specific data is sent, then it has
to wait until the receiver receives it before changing it, simple.

And you still don't avoid the fundamental problem that the receiver
may not get exactly what the sender has put in flight if we do things
asynchronously.

2008-07-31 13:00:13

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Wednesday 30 July 2008 19:43, Miklos Szeredi wrote:
> Jens,
>
> Please apply or ack this for 2.6.27.
>
> [v3: respun against 2.6.27-rc1]
>
> Thanks,
> Miklos
>
> ----
> From: Miklos Szeredi <[email protected]>
>
> Brian Wang reported that a FUSE filesystem exported through NFS could
> return I/O errors on read. This was traced to splice_direct_to_actor()
> returning a short or zero count when racing with page invalidation.
>
> However this is not FUSE or NFSD specific, other filesystems (notably NFS)
> also call invalidate_inode_pages2() to purge stale data from the cache.
>
> If this happens while such pages are sitting in a pipe buffer, then
> splice(2) from the pipe can return zero, and read(2) from the pipe can
> return ENODATA.
>
> The zero return is especially bad, since it implies end-of-file or
> disconnected pipe/socket, and is documented as such for splice. But
> returning an error for read() is also nasty, when in fact there was no
> error (data becoming stale is not an error).

Hmm, the PageError case is a similar one which cannot be avoided, so
it kind of indicates to me that the splice async API is slightly
lacking (and provides me with some confirmation about my dislike of
removing ClearPageUptodate from invalidate...)

Returning -EIO at the pipe read I don't think quite make sense because
it is conceptually an IO error for the splicer, not the reader (who
is reading from a pipe, not from the file causing the error).

It seems like the right way to fix this would be to allow the splicing
process to be notified of a short read, in which case it could try to
refill the pipe with the unread bytes...

2008-07-31 13:31:58

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Thu, Jul 31, 2008 at 01:33:50PM +0100, Jamie Lokier ([email protected]) wrote:
> This is why marking the pages COW would be better. Automatic!
> There's no need for a notification, merely letting go of the page
> references - yes, the hardware / TCP acks already do that, no locking
> or anything! :-) The last reference is nothing special, it just means
> the next file write/truncate sees the count is 1 and doesn't need to
> COW the page.

It depends... COW can DoS the system: consider attacker who sends a
page, writes there, sends again and so on in lots of threads. Depending
on link capacity eventually COW will eat the whole RAM.

> > There was a linux aio_sendfile() too. Google still knows about its
> > numbers, graphs and so on... :)
>
> I vaguely remember it's performance didn't seem that good.

<q>
Benchmark of the 100 1MB files transfer (files are in VFS already) using
sync sendfile() against aio_sendfile_path() shows about 10MB/sec
performance win (78 MB/s vs 66-72 MB/s over 1 Gb network, sendfile
sending server is one-way AMD Athlong 64 3500+) for aio_sendfile_path().
</q>

So, it was really better that sync sendfile :)

> One of the problems is you don't really want AIO all the time, just
> when a process would block because the data isn't in cache. You
> really don't want to be sending *all* ops to worker threads, even
> kernel threads. And you preferably don't want the AIO interface
> overhead for ops satisfied from cache.

That's how all AIO should work of course. We are getting into a bit of
offtopic, but aio_sendfile() worked that way as long as syslets,
although the former did allocate some structures before trying to send
the data.

> Syslets got some of the way there, and maybe that's why they were
> faster than AIO for some things. There are user-space hacks which are
> a bit like syslets. (Bind two processes to the same CPU, process 1
> wakes process 2 just before 1 does a syscall, and puts 2 back to sleep
> if 2 didn't wake and do an atomic op to prove it's awake). I haven't
> tested their performance, it could suck.

Looks scary :)
Thread allocation in userspace is rather costly operations compared to
syslet threads in kernelspace. But depending on IO pattern this may or
may not be a noticeble factor... It requires testing and numbers.

--
Evgeniy Polyakov

2008-07-31 16:38:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation



On Thu, 31 Jul 2008, Jamie Lokier wrote:
>
> Having implemented an equivalent zero-copy thing in userspace, I can
> confidently say it's not fundamental at all.

Oh yes it is.

Doing it in user space is _trivial_, because you control everything, and
there are no barriers.

> What is fundamental is that you either (a) treat sendfile as an async
> operation, and get a notification when it's finished with the data,
> just like any other async operation

Umm. And that's exactly what I *described*.

But it's trivial to do inside one program (either all in user space, or
all in kernel space).

It's very difficult indeed to do across two totally different domains.

Have you _looked_ at the complexities of async IO in UNIX? They are
horrible. The overhead to even just _track_ the notifiers basically undoes
all relevant optimizations for doing zero-copy.

IOW, AIO is useful not because of zero-copy, but because it allows
_overlapping_ IO. Anybody who confuses the two is seriously misguided.

> , or (b) while sendfile claims those
> pages, they are marked COW.

.. and this one shows that you have no clue about performance of a memcpy.

Once you do that COW, you're actually MUCH BETTER OFF just copying.

Really.

Copying a page is much cheaper than doing COW on it. Doing a "write()"
really isn't that expensive. People think that memory is slow, but memory
isn't all that slow, and caches work really well. Yes, memory is slow
compared to a few reference count increments, but memory is absolutely
*not* slow when compared to the overhead of TLB invalidates across CPUs
etc.

So don't do it. If you think you need it, you should not be using
zero-copy in the first place.

In other words, let me repeat:

- use splice() when you *understand* that it's just taking a refcount and
you don't care.

- use read()/write() when you can't be bothered.

There's nothing wrong with read/write. The _normal_ situation should be
that 99.9% of all IO is done using the regular interfaces. Splice() (and
sendpage() before it) is a special case. You should be using splice if you
have a DVR and you can do all the DMA from the tuner card into buffers
that you can then split up and send off to show real-time at the same time
as you copy them to disk.

THAT is when zero-copy is useful. If you think you need to play games with
async notifiers, you're already off the deep end.

Linus

2008-07-31 17:01:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation



On Thu, 31 Jul 2008, Evgeniy Polyakov wrote:
>
> It depends... COW can DoS the system: consider attacker who sends a
> page, writes there, sends again and so on in lots of threads. Depending
> on link capacity eventually COW will eat the whole RAM.

Yes, COW is complex, and the complexity would be part of the cost. But the
much bigger cost is the fact that COW is simply most costly than copying
the data in the first place.

A _single_ page fault is usually much much more expensive than copying a
page, especially if you can do the copy well wrt caches. For example, a
very common case is that the data you're writing is already in the CPU
caches.

In fact, even if you can avoid the fault, the cost of doing all the
locking and looking up the pages for COW is likely already bigger than the
memcpy. The memcpy() is a nice linear access which both the CPU and the
memory controller can optimize and can get almost perfect CPU throughput
for. In contrast, doing a COW implies a lot of random walking over
multiple data structures. And _if_ it's all in cache, it's probably ok,
but you're totally screwed if you need to send an IPI to another CPU to
actually flush the TLB (to make the COW actually take effect!).

So yes, you can win by COW'ing, but it's rare, and it mainly happens in
benchmarks.

For example, I had a trial patch long long ago (I think ten years by now)
to do pipe transfers as copying pages around with COW. It was absolutely
_beautiful_ in benchmarks. I could transfer gigabytes per second, and this
was on something like a Pentium/MMX which had what, 7-10MB/s memcpy
performance?

In other words, I don't dispute at all that COW schemes can perform really
really well.

HOWEVER - the COW scheme actually performed _worse_ in any real world
benchmark, including just compiling the kernel (we used to use -pipe to
gcc to transfer data between cc/as).

The reason? The benchmark worked _really_ well, because what it did was
basically to do a trivial microbenchmark that did

for (;;) {
write(fd, buffer, bufsize);
}

and do you see something unrealistic there? Right: it never actually
touched the buffer itself, so it would not ever actually trigger the COW
case, and as a result, the memory got marked read-only on the first time,
and it never ever took a fault, and in fact the TLB never ever needed to
be flushed after the first one because the page was already marked
read-only.

That's simply not _realistic_. It's hardly how any real load work.

> > > There was a linux aio_sendfile() too. Google still knows about its
> > > numbers, graphs and so on... :)
> >
> > I vaguely remember it's performance didn't seem that good.
>
> <q>
> Benchmark of the 100 1MB files transfer (files are in VFS already) using
> sync sendfile() against aio_sendfile_path() shows about 10MB/sec
> performance win (78 MB/s vs 66-72 MB/s over 1 Gb network, sendfile
> sending server is one-way AMD Athlong 64 3500+) for aio_sendfile_path().
> </q>
>
> So, it was really better that sync sendfile :)

I suspect it wasn't any better with small files and small transfers.

Yes, some people do big files. Physicists have special things where they
get a few terabytes per second from some high-energy experiment. The
various people spying on you have special setups where they move gigabytes
of satellite map data around to visualize it.

So there are undeniably cases like that, but they are also usually so
special that they really don't even care about COW, because they sure as
hell don't care about somebody else modifying the file they're sending at
the same time.

In fact the whole point is that they don't touch the data at teh CPU
_at_all_, and the reason they want zero-copy sending is that they
literally want to DMA from disk buffers to memory, and then from memory to
a network interface, and they don't want the CPU _ever_ seeing it with all
the cache invalidations etc.

And _that_ is the case where you should use sendfile(). If your CPU has
actually touched the data, you should probably just use read()/write().

Of course, one of the really nice things about splice() (which was not
true about sendfile()) is that you can actually mix-and-match. You can
splice data from kernel buffers, but you can also splice data from user
VM, or you can do regular "write()" calls to fill (or read) the data from
the splice pipe.

This is useful in ways that sendfile() never was. You can write() headers
into the pipe buffer, and then splice() the file data into it, and the
user only sees a pipe (and can either read from it or splice it or tee it
or whatever). IOW, splice very much has the UNIX model of "everything is
a pipe", taken to one (admittedly odd) extreme.

Anyway, the correct way to use splice() is to either just know the data is
"safe" (either because you are _ok_ with people modifying it after the
splice op took place, or because you know nobody will). The alternative is
to expect an acknowledgement from the other side, because then you know
the buffer is done.

Linus

2008-07-31 17:04:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation



On Thu, 31 Jul 2008, Nick Piggin wrote:
>
> It seems like the right way to fix this would be to allow the splicing
> process to be notified of a short read, in which case it could try to
> refill the pipe with the unread bytes...

Hmm. That should certainly work with the splice model. The users of the
data wouldn't eat (or ignore) the invalid data, they'd just say "invalid
data", and stop. And it would be up to the other side to handle it (it
can see the state of the pipe, we can make it both wake up POLL_ERR _and_
return an error if somebody tries to write to a "blocked" pipe).

So yes, that's very possible, but it obviously requires splice() users to
be able to handle more cases. I'm not sure it's realistic to expect users
to be that advanced.

Linus

2008-07-31 17:21:35

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

Linus Torvalds wrote:
> > , or (b) while sendfile claims those
> > pages, they are marked COW.
>
> .. and this one shows that you have no clue about performance of a memcpy.
>
> Once you do that COW, you're actually MUCH BETTER OFF just copying.
> Copying a page is much cheaper than doing COW on it.

That sounds familiar :-)

But did you miss the bit where you DON'T COPY ANYTHING EVER*? COW is
able provide _correctness_ for the rare corner cases which you're not
optimising for. You don't actually copy more than 0.0% (*approx).

Copying is supposed to be _so_ rare, in this, that it doesn't count.

Correctness is so you can say "I've written that, when the syscall
returns I know what data the other end will receive, if it succeeds".
Knowing what can happen in what order is bread and butter around here,
you know how useful that can.

The cost of COW is TLB flushes*. But for splice, there ARE NO TLB
FLUSHES because such files are not mapped writable! And you don't
intend to write the file soon either. A program would be daft to use
splice _intending_ to do those things, it obviously would be poor use
of the interface. The kernel may as well copy the data if they did
(and it's in a good position to decide).

> Doing a "write()" really isn't that expensive. People think that
> memory is slow, but memory isn't all that slow, and caches work
> really well. Yes, memory is slow compared to a few reference count
> increments, but memory is absolutely *not* slow when compared to the
> overhead of TLB invalidates across CPUs etc.

You're missing the real point of network splice().

It's not just for speed.

It's for sharing data. Your TCP buffers can share data, when the same
big lump is in flight to lots of clients. Think static file / web /
FTP server, the kind with 80% of hits to 0.01% of the files roughly
the same of your RAM.

You want network splice() for the same reason you want shared
libraries. So that memory use scales better with some loads**.

You don't know how much good that will do, only, like shared
libraries, that it's intrinsically good if it doesn't cost anything.
And I'm suggesting that since no TLB flushes or COW copies are
expected, and you can just copy at sendfile time if the page is
already write-mapped anywhere, so the page references aren't
complicated, it shouldn't cost anything.

** - Admittedly this is rather load dependent. But it's potentially
O(c*d) for write vs. O(d) for sendfile, hand-wavingly, where c is the
number of connections using d data. (Then again, as I work out the
figures, RAM is getting cheaper faster than bandwidth-latency products
are getting bigger... It's not a motivator except for cheapskates.
But doesn't detract from intrinsic goodness.)

-- Jamie

2008-07-31 18:13:42

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Thu, 31 Jul 2008, Linus Torvalds wrote:
> On Thu, 31 Jul 2008, Nick Piggin wrote:
> >
> > It seems like the right way to fix this would be to allow the splicing
> > process to be notified of a short read, in which case it could try to
> > refill the pipe with the unread bytes...
>
> Hmm. That should certainly work with the splice model. The users of the
> data wouldn't eat (or ignore) the invalid data, they'd just say "invalid
> data", and stop. And it would be up to the other side to handle it (it
> can see the state of the pipe, we can make it both wake up POLL_ERR _and_
> return an error if somebody tries to write to a "blocked" pipe).
>
> So yes, that's very possible, but it obviously requires splice() users to
> be able to handle more cases. I'm not sure it's realistic to expect users
> to be that advanced.

Worse, it's not guaranteed to make any progress. E.g. on NFS server
with data being continually updated, cache on the client will
basically always be invalid, so the above scheme might just repeat the
splices forever without success.

Miklos

2008-07-31 18:59:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation



On Thu, 31 Jul 2008, Jamie Lokier wrote:
>
> But did you miss the bit where you DON'T COPY ANYTHING EVER*? COW is
> able provide _correctness_ for the rare corner cases which you're not
> optimising for. You don't actually copy more than 0.0% (*approx).

The thing is, just even _marking_ things COW is the expensive part. If we
have to walk page tables - we're screwed.

> The cost of COW is TLB flushes*. But for splice, there ARE NO TLB
> FLUSHES because such files are not mapped writable!

For splice, there are also no flags to set, no extra tracking costs, etc
etc.

But yes, we could make splice (from a file) do something like

- just fall back to copy if the page is already mapped (page->mapcount
gives us that)

- set a bit ("splicemapped") when we splice it in, and increment
page->mapcount for each splice copy.

- if a "splicemapped" page is ever mmap'ed or written to (either through
write or truncate), we COW it then (and actually move the page cache
page - it would be a "woc": a reverse cow, not a normal one).

- do all of this with page lock held, to make sure that there are no
writers or new mappers happening.

So it's probably doable.

(We could have a separate "splicecount", and actually allow non-writable
mappings, but I suspect we cannot afford the space in teh "struct space"
for a whole new count).

> You're missing the real point of network splice().
>
> It's not just for speed.
>
> It's for sharing data. Your TCP buffers can share data, when the same
> big lump is in flight to lots of clients. Think static file / web /
> FTP server, the kind with 80% of hits to 0.01% of the files roughly
> the same of your RAM.

Maybe. Does it really show up as a big thing?

Linus

2008-08-01 01:23:21

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Friday 01 August 2008 04:13, Miklos Szeredi wrote:
> On Thu, 31 Jul 2008, Linus Torvalds wrote:
> > On Thu, 31 Jul 2008, Nick Piggin wrote:
> > > It seems like the right way to fix this would be to allow the splicing
> > > process to be notified of a short read, in which case it could try to
> > > refill the pipe with the unread bytes...
> >
> > Hmm. That should certainly work with the splice model. The users of the
> > data wouldn't eat (or ignore) the invalid data, they'd just say "invalid
> > data", and stop. And it would be up to the other side to handle it (it
> > can see the state of the pipe, we can make it both wake up POLL_ERR _and_
> > return an error if somebody tries to write to a "blocked" pipe).
> >
> > So yes, that's very possible, but it obviously requires splice() users to
> > be able to handle more cases. I'm not sure it's realistic to expect users
> > to be that advanced.
>
> Worse, it's not guaranteed to make any progress. E.g. on NFS server
> with data being continually updated, cache on the client will
> basically always be invalid, so the above scheme might just repeat the
> splices forever without success.

Well, a) it probably makes sense in that case to provide another mode
of operation which fills the data synchronously from the sender and
copys it to the pipe (although the sender might just use read/write)
And b) we could *also* look at clearing PG_uptodate as an optimisation
iff that is found to help.

But I think it is kind of needed. The data comes from the sender, and
so only the sender may really know what to do in case of failure. I
think it is quite reasonable for an asynchronous interface to have some
kind of completion/error check and I think users should be that
advanced... if they aren't that advanced, they could use the synchonous,
copying flag to splice outlined in a), and then they wouldn't have to
care.

2008-08-01 18:29:21

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Fri, 1 Aug 2008, Nick Piggin wrote:
> Well, a) it probably makes sense in that case to provide another mode
> of operation which fills the data synchronously from the sender and
> copys it to the pipe (although the sender might just use read/write)
> And b) we could *also* look at clearing PG_uptodate as an optimisation
> iff that is found to help.

IMO it's not worth it to complicate the API just for the sake of
correctness in the so-very-rare read error case. Users of the splice
API will simply ignore this requirement, because things will work fine
on ext3 and friends, and will break only rarely on NFS and FUSE.

So I think it's much better to make the API simple: invalid pages are
OK, and for I/O errors we return -EIO on the pipe. It's not 100%
correct, but all in all it will result in less buggy programs.

Thanks,
Miklos
----

Subject: mm: dont clear PG_uptodate on truncate/invalidate

From: Miklos Szeredi <[email protected]>

Brian Wang reported that a FUSE filesystem exported through NFS could return
I/O errors on read. This was traced to splice_direct_to_actor() returning a
short or zero count when racing with page invalidation.

However this is not FUSE or NFSD specific, other filesystems (notably NFS)
also call invalidate_inode_pages2() to purge stale data from the cache.

If this happens while such pages are sitting in a pipe buffer, then splice(2)
from the pipe can return zero, and read(2) from the pipe can return ENODATA.

The zero return is especially bad, since it implies end-of-file or
disconnected pipe/socket, and is documented as such for splice. But returning
an error for read() is also nasty, when in fact there was no error (data
becoming stale is not an error).

The same problems can be triggered by "hole punching" with
madvise(MADV_REMOVE).

Fix this by not clearing the PG_uptodate flag on truncation and
invalidation.

Signed-off-by: Miklos Szeredi <[email protected]>
---
mm/truncate.c | 2 --
1 file changed, 2 deletions(-)

Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c 2008-07-28 17:45:02.000000000 +0200
+++ linux-2.6/mm/truncate.c 2008-08-01 20:18:51.000000000 +0200
@@ -104,7 +104,6 @@ truncate_complete_page(struct address_sp
cancel_dirty_page(page, PAGE_CACHE_SIZE);

remove_from_page_cache(page);
- ClearPageUptodate(page);
ClearPageMappedToDisk(page);
page_cache_release(page); /* pagecache ref */
}
@@ -356,7 +355,6 @@ invalidate_complete_page2(struct address
BUG_ON(PagePrivate(page));
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
- ClearPageUptodate(page);
page_cache_release(page); /* pagecache ref */
return 1;
failed:

2008-08-01 18:35:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation



On Fri, 1 Aug 2008, Miklos Szeredi wrote:
>
> Subject: mm: dont clear PG_uptodate on truncate/invalidate
> From: Miklos Szeredi <[email protected]>

Ok, this one I have no problems with what-so-ever. I'd like Ack's for this
kind of change (and obviously hope that it's tested), but it looks clean
and I think the new rules are better (not just for this case).

Linus

2008-08-02 04:27:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Saturday 02 August 2008 04:28, Miklos Szeredi wrote:
> On Fri, 1 Aug 2008, Nick Piggin wrote:
> > Well, a) it probably makes sense in that case to provide another mode
> > of operation which fills the data synchronously from the sender and
> > copys it to the pipe (although the sender might just use read/write)
> > And b) we could *also* look at clearing PG_uptodate as an optimisation
> > iff that is found to help.
>
> IMO it's not worth it to complicate the API just for the sake of
> correctness in the so-very-rare read error case. Users of the splice
> API will simply ignore this requirement, because things will work fine
> on ext3 and friends, and will break only rarely on NFS and FUSE.
>
> So I think it's much better to make the API simple: invalid pages are
> OK, and for I/O errors we return -EIO on the pipe. It's not 100%
> correct, but all in all it will result in less buggy programs.

That's true, but I hate how we always (in the VM, at least) just brush
error handling under the carpet because it is too hard :(

I guess your patch is OK, though. I don't see any reasons it could cause
problems...

2008-08-04 15:30:18

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

Nick Piggin wrote:
> On Saturday 02 August 2008 04:28, Miklos Szeredi wrote:
> > On Fri, 1 Aug 2008, Nick Piggin wrote:
> > > Well, a) it probably makes sense in that case to provide another mode
> > > of operation which fills the data synchronously from the sender and
> > > copys it to the pipe (although the sender might just use read/write)
> > > And b) we could *also* look at clearing PG_uptodate as an optimisation
> > > iff that is found to help.
> >
> > IMO it's not worth it to complicate the API just for the sake of
> > correctness in the so-very-rare read error case. Users of the splice
> > API will simply ignore this requirement, because things will work fine
> > on ext3 and friends, and will break only rarely on NFS and FUSE.
> >
> > So I think it's much better to make the API simple: invalid pages are
> > OK, and for I/O errors we return -EIO on the pipe. It's not 100%
> > correct, but all in all it will result in less buggy programs.
>
> That's true, but I hate how we always (in the VM, at least) just brush
> error handling under the carpet because it is too hard :(
>
> I guess your patch is OK, though. I don't see any reasons it could cause
> problems...

At least, if there are situations where the data received is not what
a common sense programmer would expect (e.g. blocks of zeros, data
from an unexpected time in syscall sequence, or something, or just
"reliable except with FUSE and NFS"), please ensure it's documented in
splice.txt or wherever.

-- Jamie

2008-08-05 03:04:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch v3] splice: fix race with page invalidation

On Tuesday 05 August 2008 01:29, Jamie Lokier wrote:
> Nick Piggin wrote:
> > On Saturday 02 August 2008 04:28, Miklos Szeredi wrote:
> > > On Fri, 1 Aug 2008, Nick Piggin wrote:
> > > > Well, a) it probably makes sense in that case to provide another mode
> > > > of operation which fills the data synchronously from the sender and
> > > > copys it to the pipe (although the sender might just use read/write)
> > > > And b) we could *also* look at clearing PG_uptodate as an
> > > > optimisation iff that is found to help.
> > >
> > > IMO it's not worth it to complicate the API just for the sake of
> > > correctness in the so-very-rare read error case. Users of the splice
> > > API will simply ignore this requirement, because things will work fine
> > > on ext3 and friends, and will break only rarely on NFS and FUSE.
> > >
> > > So I think it's much better to make the API simple: invalid pages are
> > > OK, and for I/O errors we return -EIO on the pipe. It's not 100%
> > > correct, but all in all it will result in less buggy programs.
> >
> > That's true, but I hate how we always (in the VM, at least) just brush
> > error handling under the carpet because it is too hard :(
> >
> > I guess your patch is OK, though. I don't see any reasons it could cause
> > problems...
>
> At least, if there are situations where the data received is not what
> a common sense programmer would expect (e.g. blocks of zeros, data
> from an unexpected time in syscall sequence, or something, or just
> "reliable except with FUSE and NFS"), please ensure it's documented in
> splice.txt or wherever.

Not quite true. Many filesystems can return -EIO, and truncate can
partially zero pages.

Basically the man page should note that until the splice API is
improved, then a) -EIO errors will be seen at the receiever, b)
the pages can see transient zeroes (this is the case with read(2)
as well, but splice has a much bigger window), and c) the sender
does not send a snapshot of data because it can still be modified
until it is recieved.

c is not too surprising for an asynchronous interface, but it is
nice to document in case people are expecting COw or something.
b and c can more or less be worked around by not doing silly things
like truncating or scribbling on data until reciever really has it.
a, I argue, should be fixed in API.

Subject: Re: [patch v3] splice: fix race with page invalidation

Nick, Jens

On Tue, Aug 5, 2008 at 4:57 AM, Nick Piggin <[email protected]> wrote:
> On Tuesday 05 August 2008 01:29, Jamie Lokier wrote:
>> Nick Piggin wrote:
>> > On Saturday 02 August 2008 04:28, Miklos Szeredi wrote:
>> > > On Fri, 1 Aug 2008, Nick Piggin wrote:
>> > > > Well, a) it probably makes sense in that case to provide another mode
>> > > > of operation which fills the data synchronously from the sender and
>> > > > copys it to the pipe (although the sender might just use read/write)
>> > > > And b) we could *also* look at clearing PG_uptodate as an
>> > > > optimisation iff that is found to help.
>> > >
>> > > IMO it's not worth it to complicate the API just for the sake of
>> > > correctness in the so-very-rare read error case. Users of the splice
>> > > API will simply ignore this requirement, because things will work fine
>> > > on ext3 and friends, and will break only rarely on NFS and FUSE.
>> > >
>> > > So I think it's much better to make the API simple: invalid pages are
>> > > OK, and for I/O errors we return -EIO on the pipe. It's not 100%
>> > > correct, but all in all it will result in less buggy programs.
>> >
>> > That's true, but I hate how we always (in the VM, at least) just brush
>> > error handling under the carpet because it is too hard :(
>> >
>> > I guess your patch is OK, though. I don't see any reasons it could cause
>> > problems...
>>
>> At least, if there are situations where the data received is not what
>> a common sense programmer would expect (e.g. blocks of zeros, data
>> from an unexpected time in syscall sequence, or something, or just
>> "reliable except with FUSE and NFS"), please ensure it's documented in
>> splice.txt or wherever.
>
> Not quite true. Many filesystems can return -EIO, and truncate can
> partially zero pages.
>
> Basically the man page should note that until the splice API is
> improved, then a) -EIO errors will be seen at the receiever, b)
> the pages can see transient zeroes (this is the case with read(2)
> as well, but splice has a much bigger window), and c) the sender
> does not send a snapshot of data because it can still be modified
> until it is recieved.
>
> c is not too surprising for an asynchronous interface, but it is
> nice to document in case people are expecting COw or something.
> b and c can more or less be worked around by not doing silly things
> like truncating or scribbling on data until reciever really has it.
> a, I argue, should be fixed in API.

Nick, could you come up with a patch to the man page for this?
Something that's ACKable by Jens?

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html