2005-12-22 17:56:00

by Andrea Arcangeli

[permalink] [raw]
Subject: nfs invalidates lose pte dirty bits

Hello Andrew,

this is yet another problem exposed by the new invalidate_inode_pages2
semantics.

If a mapping has dirty data (pte dirty and page clean), and an
invalidate triggers, the filemap_write_and_wait will do nothing, but the
invalidate_inode_pages2 will destroy the pte dirty bit and discard the
pagecache.

This is an attempted untested fix, however I'm not sure if it's enough
because we'd be still losing some dirty bit if the application keeps
touching the mapping in between the unmap_mapping_range and the
invalidate_inode_pages2. So perhaps this should be combined with a non
destructive version of invalidate_inode_pages2 that will simply skip
over any dirty page (because we have the guarantee that the dirty
information has been generated by userspace after the
filemap_write_and_wait). Invalidates may be spurious, it's not like with
O_DIRECT were we're guaranteed we overwritten the stuff in the atomic
context of the write() syscall.

However for the application being tested I think this should be enough,
since it looked single threaded.

An incremental patch could change the call to invalidate_inode_pages2 to
something else like invalidate_inode_clean_pages so that multithreaded
programs won't risk to lose dirty bits in the ptes too.

Or you may change your mind and we can go back to a much simpler relaxed
semantic of invalidates, where people notices changes in mapping only
after an munmap (like 2.4), that only requires allowing PG_uptodate
pages to be mapped in userspace like 2.4 (you know I tend to prefer
those semantics, even though they're not even close to distributed
shared memory ;).

I'm also wondering why you added this -EIO in invalidate_inode_pages2:

if (!invalidate_complete_page(mapping, page)) {
if (was_dirty)
set_page_dirty(page);
ret = -EIO;
}

When -EIO failures triggers because of a racy invalidate, the db write
will get -EIO as retval. Especially the PageDirty check in
invalidate_complete_page could return 0, but it doesn't make sense to
check the PageDirty bit inside invalidate_inode_pages2, and we're not
required to do the set_page_dirty either, the whole point of
invalidate_inode_pages2 is to destroy dirty bits. We should go ahead and
never returned -EIO there if the page was found dirty again, right? I
guess it's ok but it doesn't make much sense and I'd rather avoid -EIO
unless they're strictly necessary, because every time a db gets a -EIO
we get a critical bugreport.

Only a potential invalidate_inode_clean_pages for nfs invalidates, would
have to preserve the dirty bits and never call test_clear_page_dirty in
the first place.

Thanks!

Signed-off-by: Andrea Arcangeli <[email protected]>

diff -r 994d5b26cde2 fs/nfs/inode.c
--- a/fs/nfs/inode.c Mon Dec 19 23:00:04 2005 +0000
+++ b/fs/nfs/inode.c Thu Dec 22 18:37:25 2005 +0100
@@ -1180,8 +1180,9 @@

if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
if (S_ISREG(inode->i_mode)) {
- if (filemap_fdatawrite(mapping) == 0)
- filemap_fdatawait(mapping);
+ if (mapping_mapped(inode->i_mapping))
+ unmap_mapping_range(inode->i_mapping, 0, -1, 0);
+ filemap_write_and_wait(inode->i_mapping);
nfs_wb_all(inode);
}
invalidate_inode_pages2(mapping);


2005-12-22 23:30:57

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs invalidates lose pte dirty bits

On Thu, 2005-12-22 at 18:55 +0100, Andrea Arcangeli wrote:
> Hello Andrew,
>
> this is yet another problem exposed by the new invalidate_inode_pages2
> semantics.
>
> If a mapping has dirty data (pte dirty and page clean), and an
> invalidate triggers, the filemap_write_and_wait will do nothing, but the
> invalidate_inode_pages2 will destroy the pte dirty bit and discard the
> pagecache.

See the latest git release where we introduce the nfs_sync_mapping()
helper.

Cheers,
Trond

2005-12-23 02:36:08

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: nfs invalidates lose pte dirty bits

On Thu, Dec 22, 2005 at 06:30:49PM -0500, Trond Myklebust wrote:
> See the latest git release where we introduce the nfs_sync_mapping()
> helper.

So you also still break completely with threaded programs, did you
consider that while fixing the most obvious problem? Isn't that a
problem too? What about my suggestion of invalidate_inode_clean_pages?

Still I wonder if messing with ptes for invalidates is worth it.

2005-12-23 08:42:08

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs invalidates lose pte dirty bits

On Fri, 2005-12-23 at 03:36 +0100, Andrea Arcangeli wrote:
> On Thu, Dec 22, 2005 at 06:30:49PM -0500, Trond Myklebust wrote:
> > See the latest git release where we introduce the nfs_sync_mapping()
> > helper.
>
> So you also still break completely with threaded programs, did you
> consider that while fixing the most obvious problem? Isn't that a
> problem too? What about my suggestion of invalidate_inode_clean_pages?

It is only a problem when doing mmap writes. In the case of ordinary
writes, NFS has to do its own tracking of dirty pages, and so it can
ensure that no race exists.

However if the user is doing mmap writes while the file is in the
process of being modified on the server, then they are doing something
wrong anyway. The small race between nfs_sync_mapping() and
invalidate_inode_pages2() is the least of their problems.

Cheers,
Trond

2005-12-23 15:17:14

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: nfs invalidates lose pte dirty bits

On Fri, Dec 23, 2005 at 09:41:55AM +0100, Trond Myklebust wrote:
> On Fri, 2005-12-23 at 03:36 +0100, Andrea Arcangeli wrote:
> > On Thu, Dec 22, 2005 at 06:30:49PM -0500, Trond Myklebust wrote:
> > > See the latest git release where we introduce the nfs_sync_mapping()
> > > helper.
> >
> > So you also still break completely with threaded programs, did you
> > consider that while fixing the most obvious problem? Isn't that a
> > problem too? What about my suggestion of invalidate_inode_clean_pages?
>
> It is only a problem when doing mmap writes. In the case of ordinary

Yes, those changes are all about mmap writes.

> However if the user is doing mmap writes while the file is in the
> process of being modified on the server, then they are doing something
> wrong anyway. The small race between nfs_sync_mapping() and
> invalidate_inode_pages2() is the least of their problems.

I'm talking about spurious revalidates, I don't think the testcase I'm
dealing with is really needing an invalidate, it's a spurious one
(perhaps triggered by flock), but I'm lucky it's single threaded so
current fix will work for them.

2005-12-23 21:03:14

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs invalidates lose pte dirty bits

On Fri, 2005-12-23 at 16:17 +0100, Andrea Arcangeli wrote:
> > However if the user is doing mmap writes while the file is in the
> > process of being modified on the server, then they are doing something
> > wrong anyway. The small race between nfs_sync_mapping() and
> > invalidate_inode_pages2() is the least of their problems.
>
> I'm talking about spurious revalidates, I don't think the testcase I'm
> dealing with is really needing an invalidate, it's a spurious one
> (perhaps triggered by flock), but I'm lucky it's single threaded so
> current fix will work for them.

We should perhaps think of setting up an
"unmap_write_dirty_and_invalidate" helper in order to deal with this
sort of race. As long as we have to first unmap all the pages, then
write the dirty ones, then invalidate them, we will be open to races.

Cheers,
Trond