2008-06-25 12:41:40

by Miklos Szeredi

[permalink] [raw]
Subject: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

From: Miklos Szeredi <[email protected]>

Clearing the uptodate page flag will cause page_cache_pipe_buf_confirm()
to return -ENODATA if that page was in the buffer. This in turn will cause
splice() to return a short or zero count.

This manifested itself in rare I/O errors seen on nfs exported fuse
filesystems. This is because nfsd uses splice_direct_to_actor() to
read files, and fuse uses invalidate_inode_pages2() to invalidate
stale data on open.

Fix this by not clearing PG_uptodate on page invalidation. This will
result in the old, invalid page contents being copied. But that's OK,
the contents were valid at splice-in time (which is when the the
"copy" was conceptually done).

I haven't done an audit of all code that checks the PG_uptodate flags,
but I suspect, that this change won't have any harmful effects. Most
code checks page->mapping to see if the page was truncated or
invalidated, before using it, and retries the find/read on the page if
it wasn't. The page_cache_pipe_buf_confirm() code is an exception in
this regard.

Signed-off-by: Miklos Szeredi <[email protected]>
---
mm/truncate.c | 1 -
1 file changed, 1 deletion(-)

Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c 2008-06-24 20:49:25.000000000 +0200
+++ linux-2.6/mm/truncate.c 2008-06-24 23:28:32.000000000 +0200
@@ -356,7 +356,6 @@ invalidate_complete_page2(struct address
BUG_ON(PagePrivate(page));
__remove_from_page_cache(page);
write_unlock_irq(&mapping->tree_lock);
- ClearPageUptodate(page);
page_cache_release(page); /* pagecache ref */
return 1;
failed:

--


2008-06-25 13:12:21

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Wed, Jun 25, 2008 at 02:40:39PM +0200, Miklos Szeredi ([email protected]) wrote:
> From: Miklos Szeredi <[email protected]>
>
> Clearing the uptodate page flag will cause page_cache_pipe_buf_confirm()
> to return -ENODATA if that page was in the buffer. This in turn will cause
> splice() to return a short or zero count.
>
> This manifested itself in rare I/O errors seen on nfs exported fuse
> filesystems. This is because nfsd uses splice_direct_to_actor() to
> read files, and fuse uses invalidate_inode_pages2() to invalidate
> stale data on open.
>
> Fix this by not clearing PG_uptodate on page invalidation. This will
> result in the old, invalid page contents being copied. But that's OK,
> the contents were valid at splice-in time (which is when the the
> "copy" was conceptually done).
>
> I haven't done an audit of all code that checks the PG_uptodate flags,
> but I suspect, that this change won't have any harmful effects. Most
> code checks page->mapping to see if the page was truncated or
> invalidated, before using it, and retries the find/read on the page if
> it wasn't. The page_cache_pipe_buf_confirm() code is an exception in
> this regard.

What about writing path, when page is written after some previous write?
Like __block_prepare_write()?

> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> mm/truncate.c | 1 -
> 1 file changed, 1 deletion(-)
>
> Index: linux-2.6/mm/truncate.c
> ===================================================================
> --- linux-2.6.orig/mm/truncate.c 2008-06-24 20:49:25.000000000 +0200
> +++ linux-2.6/mm/truncate.c 2008-06-24 23:28:32.000000000 +0200
> @@ -356,7 +356,6 @@ invalidate_complete_page2(struct address
> BUG_ON(PagePrivate(page));
> __remove_from_page_cache(page);
> write_unlock_irq(&mapping->tree_lock);
> - ClearPageUptodate(page);
> page_cache_release(page); /* pagecache ref */
> return 1;
> failed:

Don't do that, add new function instead which will do exactly that, if
you do need exactly this behaviour.

Also why isn't invalidate_complete_page() enough, if you want to have
that page to be half invalidated?

--
Evgeniy Polyakov

2008-06-25 13:33:26

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

> > I haven't done an audit of all code that checks the PG_uptodate flags,
> > but I suspect, that this change won't have any harmful effects. Most
> > code checks page->mapping to see if the page was truncated or
> > invalidated, before using it, and retries the find/read on the page if
> > it wasn't. The page_cache_pipe_buf_confirm() code is an exception in
> > this regard.
>
> What about writing path, when page is written after some previous write?

page->mapping should be checked in the write paths as well.

> Like __block_prepare_write()?

That's called with the page locked and page->mapping verified.

> > Signed-off-by: Miklos Szeredi <[email protected]>
> > ---
> > mm/truncate.c | 1 -
> > 1 file changed, 1 deletion(-)
> >
> > Index: linux-2.6/mm/truncate.c
> > ===================================================================
> > --- linux-2.6.orig/mm/truncate.c 2008-06-24 20:49:25.000000000 +0200
> > +++ linux-2.6/mm/truncate.c 2008-06-24 23:28:32.000000000 +0200
> > @@ -356,7 +356,6 @@ invalidate_complete_page2(struct address
> > BUG_ON(PagePrivate(page));
> > __remove_from_page_cache(page);
> > write_unlock_irq(&mapping->tree_lock);
> > - ClearPageUptodate(page);
> > page_cache_release(page); /* pagecache ref */
> > return 1;
> > failed:
>
> Don't do that, add new function instead which will do exactly that, if
> you do need exactly this behaviour.

I don't see any point in doing that.

> Also why isn't invalidate_complete_page() enough, if you want to have
> that page to be half invalidated?

I want the page fully invalidated, and I also want splice and nfs
exporting to work as for other filesystems.

Miklos

2008-06-25 14:18:08

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Wed, Jun 25, 2008 at 03:32:55PM +0200, Miklos Szeredi ([email protected]) wrote:
> > What about writing path, when page is written after some previous write?
>
> page->mapping should be checked in the write paths as well.

All we need from mapping here is where to put this page and inode
pointer.

> > Like __block_prepare_write()?
>
> That's called with the page locked and page->mapping verified.

Only when called via standard codepath. If page was grabbed and page
unlocked and subsequently 'invalidated' via invalidate_complete_page2(),
it still relies on uptodate bit to be set to correctly work.

After all we do not need page mapping to write into given page, that's
why __block_prepare_write() does not check it.

> > > Signed-off-by: Miklos Szeredi <[email protected]>
> > > ---
> > > mm/truncate.c | 1 -
> > > 1 file changed, 1 deletion(-)
> > >
> > > Index: linux-2.6/mm/truncate.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/truncate.c 2008-06-24 20:49:25.000000000 +0200
> > > +++ linux-2.6/mm/truncate.c 2008-06-24 23:28:32.000000000 +0200
> > > @@ -356,7 +356,6 @@ invalidate_complete_page2(struct address
> > > BUG_ON(PagePrivate(page));
> > > __remove_from_page_cache(page);
> > > write_unlock_irq(&mapping->tree_lock);
> > > - ClearPageUptodate(page);
> > > page_cache_release(page); /* pagecache ref */
> > > return 1;
> > > failed:
> >
> > Don't do that, add new function instead which will do exactly that, if
> > you do need exactly this behaviour.
>
> I don't see any point in doing that.
>
> > Also why isn't invalidate_complete_page() enough, if you want to have
> > that page to be half invalidated?
>
> I want the page fully invalidated, and I also want splice and nfs
> exporting to work as for other filesystems.

Fully invalidated page can not be uptodate, doesnt' it? :)

You destroy existing functionality just because there are some obscure
places, where it is used, so instead of fixing that places, you treat
the symptom. After writing previous mail I found a way to workaround it
even with your changes, but the whole approach of changing
invalidate_complete_page2() is not correct imho.

Your note:
>Let's start with page_cache_pipe_buf_confirm(). How should we deal
>with finding an invalidated page (!PageUptodate(page) &&
>!page->mapping)?

>We could return zero to use the contents even though it was
>invalidated, not good, but if the page was originally uptodate, then
>it should be OK to use the stale data. But it could have been
>invalidated before becoming uptodate, so the contents could be total
>crap, and that's not good. So now we have to tweak page invalidation
>to differentiate between was-uptodate and was-never-uptodate pages.

Is this nfs/fuse problem you described:
http://marc.info/?l=linux-fsdevel&m=121396920822693&w=2

Instead of returning error when reading from invalid page, now you
return old content of it? From description on above link it is not the
case, when user reads data into splice pipe and suddenly it becomes
invalidated, which you try to fix with this patch, but it may be
completely different problem though.

--
Evgeniy Polyakov

2008-06-25 14:45:01

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

> > > Like __block_prepare_write()?
> >
> > That's called with the page locked and page->mapping verified.
>
> Only when called via standard codepath.

It would be a grave error to call it without the page lock.

> > I want the page fully invalidated, and I also want splice and nfs
> > exporting to work as for other filesystems.
>
> Fully invalidated page can not be uptodate, doesnt' it? :)

That's just a question of how we interpret PG_uptodate. If it means:
the page contains data that is valid, or was valid at some point in
time, then an invalidated or truncated page can be uptodate.

> You destroy existing functionality just because there are some obscure
> places, where it is used, so instead of fixing that places, you treat
> the symptom. After writing previous mail I found a way to workaround it
> even with your changes, but the whole approach of changing
> invalidate_complete_page2() is not correct imho.

You rely on page being !PageUptodate() after being invalidated? Why
can't you check page->mapping instead (as everything else does)?

> Is this nfs/fuse problem you described:
> http://marc.info/?l=linux-fsdevel&m=121396920822693&w=2

Yes.

> Instead of returning error when reading from invalid page, now you
> return old content of it?

No, instead of returning a short count, it is now returning old
content.

Miklos

2008-06-25 15:12:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()



On Wed, 25 Jun 2008, Miklos Szeredi wrote:
>
> Clearing the uptodate page flag will cause page_cache_pipe_buf_confirm()
> to return -ENODATA if that page was in the buffer. This in turn will cause
> splice() to return a short or zero count.

I really think we should just change splice itself at this point.

The VM people may end up removing the clearing of the uptodate bit for
other reasons, but there's no way I'll do this kind of thing which affects
unknown number of cases for just the splice kind of reason, when we could
just change the one place you care about right now - which is just
FUSE/NFSD/page_cache_pipe_buf_confirm.

I also really don't think this even fixes the problems you have with
FUSE/NFSD - because you'll still be reading zeroes for a truncated file.
Yes, you get the rigth counts, but you don't get the right data.

(And no, your previous patch that removed the asnchronous stuff and the
pipe_buf_confirm entirely didn't fix it _either_ - it's simply unavoidable
when you just pass unlocked pages around. There is no serialization with
other people doing truncates etc)

That's "correct" from a splice() kind of standpoint (it's essentially a
temporary mmap() with MAP_PRIVATE), but the thing is, it just sounds like
the whole "page went away" thing is a more fundamental issue. It sounds
like nfds should hold a read-lock on the file while it has any IO in
flight, or something like that.

IOW, I don't think your patch really is in the right area. I *do* agree
that page_cache_pipe_buf_confirm() itself probably is worth changing, but
I think you're trying to treat some of the individual symptoms here (and
not even all), and there's a more fundamental underlying issue that you're
not loooking at.

Maybe.

Linus

2008-06-25 15:30:14

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

> > Clearing the uptodate page flag will cause page_cache_pipe_buf_confirm()
> > to return -ENODATA if that page was in the buffer. This in turn will cause
> > splice() to return a short or zero count.
>
> I really think we should just change splice itself at this point.

We discussed this yesterday. My conclusion was (which I still think
is true) that it can't be fixed in page_cache_pipe_buf_confirm(),
because due to current practice of not setting PG_error for I/O errors
for read, it is impossible to distinguish between a never-been-uptodate
page and a was-uptodate-before-invalidation page.

And it's not just an nfsd issue. Userspace might also expect that if
a zero count is returned, that means it went beyond EOF, and not that
it should retry the splice, maybe it has better luck this time.

So no, this is not just a fuse/nfsd issue, it applies to all
filesystems that do invalidate_inode_pages2 (there are 4-5 of them I
think).

And I don't see what I would be ignoring. This is _not_ about
truncate(2), that is shared by all filesystems, and bugs wrt splice
would affect not just fuse.

Miklos

2008-06-25 15:37:01

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Wed, Jun 25, 2008 at 04:41:10PM +0200, Miklos Szeredi ([email protected]) wrote:
> > > > Like __block_prepare_write()?
> > >
> > > That's called with the page locked and page->mapping verified.
> >
> > Only when called via standard codepath.
>
> It would be a grave error to call it without the page lock.

Page is locked of course, but invalidated, removed from all trees and
caches, i.e. grab, lock, check, unlock... invalidate, write into that
page should fail, but it will not, since page is uptodate and
prepare_write does not check mapping at all.

> > > I want the page fully invalidated, and I also want splice and nfs
> > > exporting to work as for other filesystems.
> >
> > Fully invalidated page can not be uptodate, doesnt' it? :)
>
> That's just a question of how we interpret PG_uptodate. If it means:
> the page contains data that is valid, or was valid at some point in
> time, then an invalidated or truncated page can be uptodate.
>
> > You destroy existing functionality just because there are some obscure
> > places, where it is used, so instead of fixing that places, you treat
> > the symptom. After writing previous mail I found a way to workaround it
> > even with your changes, but the whole approach of changing
> > invalidate_complete_page2() is not correct imho.
>
> You rely on page being !PageUptodate() after being invalidated? Why
> can't you check page->mapping instead (as everything else does)?

I already found a solution, but I worry about splice code, which usage
results in hidden error now.

> > Is this nfs/fuse problem you described:
> > http://marc.info/?l=linux-fsdevel&m=121396920822693&w=2
>
> Yes.
>
> > Instead of returning error when reading from invalid page, now you
> > return old content of it?
>
> No, instead of returning a short count, it is now returning old
> content.

Or instead of returning error or zero and relookup page eventually,
which can already contain new data, we get old data.

--
Evgeniy Polyakov

2008-06-25 15:47:45

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Wed, Jun 25, 2008 at 04:41:10PM +0200, Miklos Szeredi ([email protected]) wrote:
> > Is this nfs/fuse problem you described:
> > http://marc.info/?l=linux-fsdevel&m=121396920822693&w=2
>
> Yes.

I do not know fuse good enough, but it looks like if your patch only
fixes issue for pages which are in splice buffer during invalidation,
any subsequent call for splice() will get correct new data (at least
invoke readpage(s)), but in the description of error there was a
long timeout between reading and writing, so it was a fresh splice()
call, which suddenly started to return errors.

Is it possible that by having uptodate bit in place, but page was
actually freed but not removed from all trees and so on, so it
masked read errors? This may be not the right conclusion though :)

--
Evgeniy Polyakov

2008-06-25 15:59:41

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

> > > > > Like __block_prepare_write()?
> > > >
> > > > That's called with the page locked and page->mapping verified.
> > >
> > > Only when called via standard codepath.
> >
> > It would be a grave error to call it without the page lock.
>
> Page is locked of course, but invalidated, removed from all trees and
> caches, i.e. grab, lock, check, unlock... invalidate, write into that
> page should fail, but it will not, since page is uptodate and
> prepare_write does not check mapping at all.

But callers do check after having locked the page.

> > > Instead of returning error when reading from invalid page, now you
> > > return old content of it?
> >
> > No, instead of returning a short count, it is now returning old
> > content.
>
> Or instead of returning error or zero and relookup page eventually,
> which can already contain new data, we get old data.

Umm, it doesn't make any sense to try to always get fresh data. If
you do read() on a file, the data may become old and invalid a
millisecond after the read finished. We can't and needn't do anything
about this.

Miklos

2008-06-25 16:03:23

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

> > > Is this nfs/fuse problem you described:
> > > http://marc.info/?l=linux-fsdevel&m=121396920822693&w=2
> >
> > Yes.
>
> I do not know fuse good enough, but it looks like if your patch only
> fixes issue for pages which are in splice buffer during invalidation,
> any subsequent call for splice() will get correct new data (at least
> invoke readpage(s)), but in the description of error there was a
> long timeout between reading and writing, so it was a fresh splice()
> call, which suddenly started to return errors.
>
> Is it possible that by having uptodate bit in place, but page was
> actually freed but not removed from all trees and so on, so it
> masked read errors? This may be not the right conclusion though :)

No, the mechanics of the problem are well understood. Only the
solution is problematic :)

Miklos

2008-06-25 16:19:15

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Wed, Jun 25, 2008 at 05:59:14PM +0200, Miklos Szeredi ([email protected]) wrote:
> > Page is locked of course, but invalidated, removed from all trees and
> > caches, i.e. grab, lock, check, unlock... invalidate, write into that
> > page should fail, but it will not, since page is uptodate and
> > prepare_write does not check mapping at all.
>
> But callers do check after having locked the page.

Yes, it is possible to check mapping, but it does not exist and it is
correct, that there is no mapping - we are just writing into page in
ram, kind of loop device, but without binding page into mapping.
And mapping itself is used just for its operations.

> > > > Instead of returning error when reading from invalid page, now you
> > > > return old content of it?
> > >
> > > No, instead of returning a short count, it is now returning old
> > > content.
> >
> > Or instead of returning error or zero and relookup page eventually,
> > which can already contain new data, we get old data.
>
> Umm, it doesn't make any sense to try to always get fresh data. If
> you do read() on a file, the data may become old and invalid a
> millisecond after the read finished. We can't and needn't do anything
> about this.

Page reading from disk is atomic in respect that page is always locked,
now readpage(s) may not be called in some cases...

--
Evgeniy Polyakov

2008-06-25 16:20:00

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Wed, Jun 25, 2008 at 06:02:43PM +0200, Miklos Szeredi ([email protected]) wrote:
> > I do not know fuse good enough, but it looks like if your patch only
> > fixes issue for pages which are in splice buffer during invalidation,
> > any subsequent call for splice() will get correct new data (at least
> > invoke readpage(s)), but in the description of error there was a
> > long timeout between reading and writing, so it was a fresh splice()
> > call, which suddenly started to return errors.
> >
> > Is it possible that by having uptodate bit in place, but page was
> > actually freed but not removed from all trees and so on, so it
> > masked read errors? This may be not the right conclusion though :)
>
> No, the mechanics of the problem are well understood. Only the
> solution is problematic :)

Then why not fix filesystems to set error bit on all broken reads, if
you sure it is the case?

--
Evgeniy Polyakov

2008-06-25 16:31:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()



On Wed, 25 Jun 2008, Miklos Szeredi wrote:
>
> We discussed this yesterday. My conclusion was (which I still think
> is true) that it can't be fixed in page_cache_pipe_buf_confirm(),
> because due to current practice of not setting PG_error for I/O errors
> for read, it is impossible to distinguish between a never-been-uptodate
> page and a was-uptodate-before-invalidation page.

Umm. The regular read does this quite well. If something isn't up-to-date,
it tries a synchronous read. Once.

> And it's not just an nfsd issue. Userspace might also expect that if
> a zero count is returned, that means it went beyond EOF, and not that
> it should retry the splice, maybe it has better luck this time.

You're totally ignoring the real issue - user space that uses splice()
*knows* that it uses splice(). It's a private mmap().

NFSD, on the other hand, is supposed to act as NFSD. I think that
currently it assumes that nobody else modifies the files, which is
reasonable, but breaks with FUSE.

But do you see? That's a NFSD/FUSE issue, not a splice one!

Linus

2008-06-25 16:42:22

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

> >
> > We discussed this yesterday. My conclusion was (which I still think
> > is true) that it can't be fixed in page_cache_pipe_buf_confirm(),
> > because due to current practice of not setting PG_error for I/O errors
> > for read, it is impossible to distinguish between a never-been-uptodate
> > page and a was-uptodate-before-invalidation page.
>
> Umm. The regular read does this quite well. If something isn't up-to-date,
> it tries a synchronous read. Once.

Exactly. And if page_cache_pipe_buf_confirm() could do a synchronous
re-read of the page, that would work. But it can't, because it only
has the page and not the file.

> > And it's not just an nfsd issue. Userspace might also expect that if
> > a zero count is returned, that means it went beyond EOF, and not that
> > it should retry the splice, maybe it has better luck this time.
>
> You're totally ignoring the real issue - user space that uses splice()
> *knows* that it uses splice(). It's a private mmap().
>
> NFSD, on the other hand, is supposed to act as NFSD. I think that
> currently it assumes that nobody else modifies the files, which is
> reasonable, but breaks with FUSE.

Not so. Why couldn't someone modify an ext3 file, while nfsd is
holding the page? Is that wrong? I don't know, but it's not fuse
specific.

> But do you see? That's a NFSD/FUSE issue, not a splice one!

No, I think you are wrong.

Miklos

2008-06-25 17:38:54

by Jamie Lokier

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

Linus Torvalds wrote:
> I also really don't think this even fixes the problems you have with
> FUSE/NFSD - because you'll still be reading zeroes for a truncated file.
> Yes, you get the rigth counts, but you don't get the right data.
...
> That's "correct" from a splice() kind of standpoint (it's essentially a
> temporary mmap() with MAP_PRIVATE), but the thing is, it just sounds like
> the whole "page went away" thing is a more fundamental issue. It sounds
> like nfds should hold a read-lock on the file while it has any IO in
> flight, or something like that.

I'm thinking any kind of user-space server using splice() will not
want to transmit zeros either, when another process truncates the file.
E.g. Apache, Samba, etc.

Does this problem affect sendfile() users?

-- Jamie

2008-06-25 18:36:11

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

> > I also really don't think this even fixes the problems you have with
> > FUSE/NFSD - because you'll still be reading zeroes for a truncated file.
> > Yes, you get the rigth counts, but you don't get the right data.
> ...
> > That's "correct" from a splice() kind of standpoint (it's essentially a
> > temporary mmap() with MAP_PRIVATE), but the thing is, it just sounds like
> > the whole "page went away" thing is a more fundamental issue. It sounds
> > like nfds should hold a read-lock on the file while it has any IO in
> > flight, or something like that.
>
> I'm thinking any kind of user-space server using splice() will not
> want to transmit zeros either, when another process truncates the file.
> E.g. Apache, Samba, etc.
>
> Does this problem affect sendfile() users?

AFAICS it does.

And I agree, that splice should handle truncation better. But as I
said, this affects every single filesystem out there. And yes, my
original patch wouldn't solve this (although it wouldn't make it
harder to solve either).

However the page invalidation issue is completely orthogonal, and only
affects a few filesystems which call invalidate_complete_page2(),
namely: 9p, afs, fuse and nfs.

Miklos

2008-07-07 06:39:09

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Thursday 26 June 2008 04:35, Miklos Szeredi wrote:
> > > I also really don't think this even fixes the problems you have with
> > > FUSE/NFSD - because you'll still be reading zeroes for a truncated
> > > file. Yes, you get the rigth counts, but you don't get the right data.
> >
> > ...
> >
> > > That's "correct" from a splice() kind of standpoint (it's essentially a
> > > temporary mmap() with MAP_PRIVATE), but the thing is, it just sounds
> > > like the whole "page went away" thing is a more fundamental issue. It
> > > sounds like nfds should hold a read-lock on the file while it has any
> > > IO in flight, or something like that.
> >
> > I'm thinking any kind of user-space server using splice() will not
> > want to transmit zeros either, when another process truncates the file.
> > E.g. Apache, Samba, etc.
> >
> > Does this problem affect sendfile() users?
>
> AFAICS it does.
>
> And I agree, that splice should handle truncation better. But as I
> said, this affects every single filesystem out there. And yes, my
> original patch wouldn't solve this (although it wouldn't make it
> harder to solve either).
>
> However the page invalidation issue is completely orthogonal, and only
> affects a few filesystems which call invalidate_complete_page2(),
> namely: 9p, afs, fuse and nfs.

I don't know what became of this thread, but I agree with everyone else
you should not skip clearing PG_uptodate here. If nothing else, it
weakens some important assertions in the VM. But I agree that splice
should really try harder to work with it and we should be a little
careful about just changing things like this.

2008-07-07 09:22:18

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Mon, 7 Jul 2008, Nick Piggin wrote:
> I don't know what became of this thread, but I agree with everyone else
> you should not skip clearing PG_uptodate here. If nothing else, it
> weakens some important assertions in the VM. But I agree that splice
> should really try harder to work with it and we should be a little
> careful about just changing things like this.

Sure, that's why I rfc'ed.

But I'd still like to know, what *are* those assumptions in the VM
that would be weakened by this?

Thanks,
Miklos

2008-07-07 10:13:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Mon, 07 Jul 2008, Miklos Szeredi wrote:
> On Mon, 7 Jul 2008, Nick Piggin wrote:
> > I don't know what became of this thread, but I agree with everyone else
> > you should not skip clearing PG_uptodate here. If nothing else, it
> > weakens some important assertions in the VM. But I agree that splice
> > should really try harder to work with it and we should be a little
> > careful about just changing things like this.
>
> Sure, that's why I rfc'ed.
>
> But I'd still like to know, what *are* those assumptions in the VM
> that would be weakened by this?

For one, currently some of the generic VM code assumes that after
synchronously reading in a page (i.e. ->readpage() then lock_page())
!PageUptodate() necessarily means an I/O error:

/**
* read_cache_page - read into page cache, fill it if needed
...
* If the page does not get brought uptodate, return -EIO.
*/

Which is wrong, the page could be invalidated between being broough
uptodate and being examined for being uptodate. Then we'd be
returning EIO, which is definitely wrong.

AFAICS this could be a real (albeit rare) bug in NFS's readdir().

This is easily fixable in read_cache_page(), but what I'm trying to
say is that assumptions about PG_uptodate aren't all that clear to
begin with, so it would perhaps be useful to first think about this a
bit more.

Miklos

2008-07-07 10:43:51

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Monday 07 July 2008 19:21, Miklos Szeredi wrote:
> On Mon, 7 Jul 2008, Nick Piggin wrote:
> > I don't know what became of this thread, but I agree with everyone else
> > you should not skip clearing PG_uptodate here. If nothing else, it
> > weakens some important assertions in the VM. But I agree that splice
> > should really try harder to work with it and we should be a little
> > careful about just changing things like this.
>
> Sure, that's why I rfc'ed.
>
> But I'd still like to know, what *are* those assumptions in the VM
> that would be weakened by this?

Not assumptions (that I know of, but there could be some) but
assertions. For example we assert that pages in page tables are
always uptodate. We'd miss warning if we had an invalidated page
in the pagetables after this change.

2008-07-07 11:02:32

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Monday 07 July 2008 20:12, Miklos Szeredi wrote:
> On Mon, 07 Jul 2008, Miklos Szeredi wrote:
> > On Mon, 7 Jul 2008, Nick Piggin wrote:
> > > I don't know what became of this thread, but I agree with everyone else
> > > you should not skip clearing PG_uptodate here. If nothing else, it
> > > weakens some important assertions in the VM. But I agree that splice
> > > should really try harder to work with it and we should be a little
> > > careful about just changing things like this.
> >
> > Sure, that's why I rfc'ed.
> >
> > But I'd still like to know, what *are* those assumptions in the VM
> > that would be weakened by this?
>
> For one, currently some of the generic VM code assumes that after
> synchronously reading in a page (i.e. ->readpage() then lock_page())
> !PageUptodate() necessarily means an I/O error:

Yes, the error paths in the vm/fs layer can be pretty crappy.


> /**
> * read_cache_page - read into page cache, fill it if needed
> ...
> * If the page does not get brought uptodate, return -EIO.
> */
>
> Which is wrong, the page could be invalidated between being broough
> uptodate and being examined for being uptodate. Then we'd be
> returning EIO, which is definitely wrong.
>
> AFAICS this could be a real (albeit rare) bug in NFS's readdir().

Actually this bug is known for a long time and exists in the
generic mapping read code too. And it doesn't even need to be
invalidated as such, but even truncated. It can be hard to get
people excited about "theoretical" bugs :(


> This is easily fixable in read_cache_page(), but what I'm trying to
> say is that assumptions about PG_uptodate aren't all that clear to
> begin with, so it would perhaps be useful to first think about this a
> bit more.

PG_uptodate should be set if we can return data to userspace
from it. I wouldn't worry about the error path bugs like this:
they should all be testing PG_error for -EIOness rather than
!PageUptodate. However I don't want to skip clearing PG_uptodate
in invalidate just yet if possible.

It seems to be a documented and known issue from day 0, so if
we can't see a really easy way to fix it without leave PG_uptodate
hanging around, can we put the burden on the callers to handle
the case correctly rather than put it on the VM to handle it?
(which we will then have to support for T < infinity)

And it isn't just a fuse problem is it? Other places can invalidate
and truncate pages which might be spliced into a pipe, can't they?

2008-07-07 12:03:56

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Mon, 7 Jul 2008, Nick Piggin wrote:
> And it isn't just a fuse problem is it? Other places can invalidate
> and truncate pages which might be spliced into a pipe, can't they?

Yes. There are several different problems here actually, and it's not
completely clear how splice should handle them:

Case 1: page is completely truncated while in the pipe

Currently splice() will detect this and return a short read count on
the splice-out. Which sounds sane, and consistent with the fact
that a zero return value can happen on EOF.

Case 2: page is partially truncated while in the pipe

Splice doesn't detect this, and returns the contents of the whole
page on splice-out, which will contain the zeroed-out part as well.
This is not so nice, but other than some elaborate COW schemes, I
don't see how this could be fixed.

Case 3: page is invalidated while in the pipe

This can happen on pages in the middle of the file, and splice-out
can return a zero count. This is *BAD*, callers of splice really
should be able to assume, that a zero return means EOF.

Page invalidation (the invalidate_inode_pages2 kind) is done by only a
few filesystems (FUSE, NFS, AFS, 9P), and by O_DIRECT hackery. So
case 3 only affects these, and only fuse can be re-exported by nfsd
(and that's only in -mm yet), hence this is very unlikely to be hit
for any of the others.

But it's bad despite being rare, because once it hits it can cause
data corruption (like with fuse/nfsd) and could be very hard to debug.
OK, case 2 could also cause corruption if caller is not expecting it.

Splice is a cool concept, but it isn't easy to get the implementation
right...

Miklos

2008-07-07 12:18:30

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Monday 07 July 2008 22:03, Miklos Szeredi wrote:
> On Mon, 7 Jul 2008, Nick Piggin wrote:
> > And it isn't just a fuse problem is it? Other places can invalidate
> > and truncate pages which might be spliced into a pipe, can't they?
>
> Yes. There are several different problems here actually, and it's not
> completely clear how splice should handle them:
>
> Case 1: page is completely truncated while in the pipe
>
> Currently splice() will detect this and return a short read count on
> the splice-out. Which sounds sane, and consistent with the fact
> that a zero return value can happen on EOF.

OK.


> Case 2: page is partially truncated while in the pipe
>
> Splice doesn't detect this, and returns the contents of the whole
> page on splice-out, which will contain the zeroed-out part as well.
> This is not so nice, but other than some elaborate COW schemes, I
> don't see how this could be fixed.

There is a race window in the read(2) path where this can happen too.
The window for splice is larger, but I don't know if it is worth a
song and dance about if we're not bothering with the read(2) problem.

Maybe a note in the man page.


> Case 3: page is invalidated while in the pipe
>
> This can happen on pages in the middle of the file, and splice-out
> can return a zero count. This is *BAD*, callers of splice really
> should be able to assume, that a zero return means EOF.
>
> Page invalidation (the invalidate_inode_pages2 kind) is done by only a
> few filesystems (FUSE, NFS, AFS, 9P), and by O_DIRECT hackery. So
> case 3 only affects these, and only fuse can be re-exported by nfsd
> (and that's only in -mm yet), hence this is very unlikely to be hit
> for any of the others.

Things that are using invalidate_complete_page2 are probably
also subtly broken if they allow mmap of the same pages, BTW.
It is easy to get wrong. If they have to handle the case of
invalidation failure _anyway_, then we really should have them
just use the safe invalidate...

That would "solve" the splice issue... Although if they handle
failure with a wait/retry loop, then it probably opens a window
to DoS by leaving your pipe filled. In theory one could have a
slowpath function triggered when invalidate fails which copies
the page data and then replaces them with copies in the pipe.
The hard part I suspect is to walk through everybodies pipes and
going through all pages. Probably not realistically solveable.


> But it's bad despite being rare, because once it hits it can cause
> data corruption (like with fuse/nfsd) and could be very hard to debug.
> OK, case 2 could also cause corruption if caller is not expecting it.
>
> Splice is a cool concept, but it isn't easy to get the implementation
> right...

Indeed.

2008-07-07 12:53:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Mon, 7 Jul 2008, Nick Piggin wrote:
> On Monday 07 July 2008 22:03, Miklos Szeredi wrote:
> > Case 3: page is invalidated while in the pipe
> >
> > This can happen on pages in the middle of the file, and splice-out
> > can return a zero count. This is *BAD*, callers of splice really
> > should be able to assume, that a zero return means EOF.
> >
> > Page invalidation (the invalidate_inode_pages2 kind) is done by only a
> > few filesystems (FUSE, NFS, AFS, 9P), and by O_DIRECT hackery. So
> > case 3 only affects these, and only fuse can be re-exported by nfsd
> > (and that's only in -mm yet), hence this is very unlikely to be hit
> > for any of the others.
>
> Things that are using invalidate_complete_page2 are probably
> also subtly broken if they allow mmap of the same pages, BTW.
> It is easy to get wrong. If they have to handle the case of
> invalidation failure _anyway_, then we really should have them
> just use the safe invalidate...

No, if the file changed remotely, then we really want to invalidate
_all_ cached pages.

The only way invalidate_complete_page2() can fail is if the page is
dirty. But we call ->launder_page() for exactly that reason. Now if
->launder_page() leaves the page dirty, that's bad, but that shouldn't
normally happen.

> That would "solve" the splice issue... Although if they handle
> failure with a wait/retry loop, then it probably opens a window
> to DoS by leaving your pipe filled. In theory one could have a
> slowpath function triggered when invalidate fails which copies
> the page data and then replaces them with copies in the pipe.
> The hard part I suspect is to walk through everybodies pipes and
> going through all pages. Probably not realistically solveable.

Right. I think leaving PG_uptodate on invalidation is actually a
rather clean solution compared to the alternatives.

Well, other than my original proposal, which would just have reused
the do_generic_file_read() infrastructure for splice. I still don't
see why we shouldn't use that, until the whole async splice-in thing
is properly figured out.

Miklos

2008-07-07 14:28:33

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Monday 07 July 2008 22:52, Miklos Szeredi wrote:
> On Mon, 7 Jul 2008, Nick Piggin wrote:
> > On Monday 07 July 2008 22:03, Miklos Szeredi wrote:
> > > Case 3: page is invalidated while in the pipe
> > >
> > > This can happen on pages in the middle of the file, and splice-out
> > > can return a zero count. This is *BAD*, callers of splice really
> > > should be able to assume, that a zero return means EOF.
> > >
> > > Page invalidation (the invalidate_inode_pages2 kind) is done by only a
> > > few filesystems (FUSE, NFS, AFS, 9P), and by O_DIRECT hackery. So
> > > case 3 only affects these, and only fuse can be re-exported by nfsd
> > > (and that's only in -mm yet), hence this is very unlikely to be hit
> > > for any of the others.
> >
> > Things that are using invalidate_complete_page2 are probably
> > also subtly broken if they allow mmap of the same pages, BTW.
> > It is easy to get wrong. If they have to handle the case of
> > invalidation failure _anyway_, then we really should have them
> > just use the safe invalidate...
>
> No, if the file changed remotely, then we really want to invalidate
> _all_ cached pages.
>
> The only way invalidate_complete_page2() can fail is if the page is
> dirty. But we call ->launder_page() for exactly that reason. Now if
> ->launder_page() leaves the page dirty, that's bad, but that shouldn't
> normally happen.

If dirty can't happen, the caller should just use the truncate.
The creation of this "invalidate 2" thing was just papering over
problems in the callers.

But anyway your point is taken -- caller doesn't really handle failure.


> > That would "solve" the splice issue... Although if they handle
> > failure with a wait/retry loop, then it probably opens a window
> > to DoS by leaving your pipe filled. In theory one could have a
> > slowpath function triggered when invalidate fails which copies
> > the page data and then replaces them with copies in the pipe.
> > The hard part I suspect is to walk through everybodies pipes and
> > going through all pages. Probably not realistically solveable.
>
> Right. I think leaving PG_uptodate on invalidation is actually a
> rather clean solution compared to the alternatives.

Note that files can be truncated in the middle too, so you can't
just fix one case that happens to hit you, you'd have to fix things
consistently.

But...


> Well, other than my original proposal, which would just have reused
> the do_generic_file_read() infrastructure for splice. I still don't
> see why we shouldn't use that, until the whole async splice-in thing
> is properly figured out.

Given the alternatives, perhaps this is for the best, at least for
now.

2008-07-07 15:09:12

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Tue, 8 Jul 2008, Nick Piggin wrote:
> If dirty can't happen, the caller should just use the truncate.
> The creation of this "invalidate 2" thing was just papering over
> problems in the callers.

Dirty *can* happen. The difference between truncate_inode_pages() and
invalidate_inode_pages2() is that the former just throws away dirty
pages, while the latter can do something about them through
->launder_page().

> But anyway your point is taken -- caller doesn't really handle failure.

Yes.

> > Right. I think leaving PG_uptodate on invalidation is actually a
> > rather clean solution compared to the alternatives.
>
> Note that files can be truncated in the middle too, so you can't
> just fix one case that happens to hit you, you'd have to fix things
> consistently.

Hmm, OK.

> But...
>
>
> > Well, other than my original proposal, which would just have reused
> > the do_generic_file_read() infrastructure for splice. I still don't
> > see why we shouldn't use that, until the whole async splice-in thing
> > is properly figured out.
>
> Given the alternatives, perhaps this is for the best, at least for
> now.

Yeah. I'm not at all opposed to improving splice to be able to do all
sorts of fancy things like async splice-in, and stealing of pages.
But it's unlikely that I will have the motivation to implement any of
them just to fix this bug.

Miklos

2008-07-08 02:22:41

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/2] mm: dont clear PG_uptodate in invalidate_complete_page2()

On Tuesday 08 July 2008 01:08, Miklos Szeredi wrote:
> On Tue, 8 Jul 2008, Nick Piggin wrote:

> > > Well, other than my original proposal, which would just have reused
> > > the do_generic_file_read() infrastructure for splice. I still don't
> > > see why we shouldn't use that, until the whole async splice-in thing
> > > is properly figured out.
> >
> > Given the alternatives, perhaps this is for the best, at least for
> > now.
>
> Yeah. I'm not at all opposed to improving splice to be able to do all
> sorts of fancy things like async splice-in, and stealing of pages.
> But it's unlikely that I will have the motivation to implement any of
> them just to fix this bug.

Yeah. Well then, would you mind having another cut at the patch to
do that? I guess it might help if you don't remove the ->confirm
code -- after fixing the bug then we could discuss what to do with
that code and how we could implement async.

I guess it would be nice to find something that gets a lot of
benefit with the async splicing. Luckily the existing scheme is
workable enough that it would be easy to test a hunch just by
patching it back in...