2006-02-17 10:58:00

by Olaf Kirch

[permalink] [raw]
Subject: Problems with mmap consistency

Hi,

we are currently investigating a customer problem with NFS mmap
consistency. They have an application (binary only, oh joy :) that
makes heavy use of a shared mmapped file (it's basically used as a work
queue for a local service).

They experienced frequent data corruption on the file. What happens is
that they establish a lock when updating a portion of the file, which
invalidates all cached data. Now until recently, nfs_revalidate_inode did
not make any attempt to flush out dirty mmapped pages. Andrea Arcangeli
recently submitted a fix to mainline with changes __nfs_revalidate_inode
to do this:

/* Do the page cache invalidation */
if (flags & NFS_INO_INVALID_DATA) {
if (S_ISREG(inode->i_mode)) {
if (mapping_mapped(inode->i_mapping))
unmap_mapping_range(inode->i_mapping, 0, -1, 0);
filemap_write_and_wait(inode->i_mapping);
nfs_wb_all(inode);
}
nfsi->flags &= ~NFS_INO_INVALID_DATA;
invalidate_inode_pages2(inode->i_mapping);

This improved the customer's situation slightly, but they are still
seeing corruption. I put together a test case that demonstrates this
problem within a few seconds.

It seems the problem is that there is still a fairly big race window
in this code - after unmap_mapping_range returns, we're waiting for
all dirty pages to be flushed to the server. A lot can happen while
we wait - for instance, some other process may modify a page that
has already been written. This modification will be lost during
the subsequent invalidate_inode_pages2 call.

Is there a simple VM mechanism that could be used to prevent this,
or does it need some extra logic to block a process in nfs_readpage
while we're revalidating?

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


Attachments:
(No filename) (1.77 kB)
mmapper.c (6.74 kB)
Download all attachments

2006-02-25 17:27:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Problems with mmap consistency

On Sat, Feb 25, 2006 at 09:55:01AM -0500, Trond Myklebust wrote:
> If we don't clear PG_dirty in writepage(), then the VM gets all huffy.
> However, we do not want to start actual writeback until we've got enough
> pages to make it worth our while. For that reason, we track the dirty
> pages in the NFS layer, and start writeout when writepages() is done.

Isn't ->writepages API enough to coalesce writes together? Why do you
assume you will get more data to write by waiting more time? pagecache
layer already waits long enough before calling ->writepages.

I'm unsure if you really gain anything. If you weren't using
->writepages already then your local hack would look more worthwhile to
me.

> How does this differ from a page that has buffers but is "clean"?

That those clean buffers can be discarded without data loss?

Andrew stated the invariant: "a clean page with dirty buffers is
illegal.".

Now with this last flush on try_to_release_page perhaps you can get away
with it, but your problem was going way beyond the VM race.

My fix fixed the VM race only (note that the same race would happen on
ext2 too if you called randomly invalidate_inode_pages2 on any inode,
which btw, surprise is what the new ioctl FIPUNCHMEM does... So let's
say my fix is for FIPUNCHMEM and nfs on VM side.

Your patch fixes an entirely different matter, that is, it tries to
close the holes that you opened by breaking the invariant that a page
can be dirty still but with PG_dirty not set.

So again, those two patches are orthogonal. In the current state of
things both looks to me, though I would like to understand if you really
get a benefit from the local nfs hack around the pagecache invariants.

Thanks.


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-25 18:42:31

by Trond Myklebust

[permalink] [raw]
Subject: Re: Problems with mmap consistency

On Sat, 2006-02-25 at 18:27 +0100, Andrea Arcangeli wrote:
> On Sat, Feb 25, 2006 at 09:55:01AM -0500, Trond Myklebust wrote:
> > If we don't clear PG_dirty in writepage(), then the VM gets all huffy.
> > However, we do not want to start actual writeback until we've got enough
> > pages to make it worth our while. For that reason, we track the dirty
> > pages in the NFS layer, and start writeout when writepages() is done.
>
> Isn't ->writepages API enough to coalesce writes together? Why do you
> assume you will get more data to write by waiting more time? pagecache
> layer already waits long enough before calling ->writepages.
>
> I'm unsure if you really gain anything. If you weren't using
> ->writepages already then your local hack would look more worthwhile to
> me.

We don't wait beyond writepages(). However writepages doesn't protect
you against concurrent truncate calls: the vm seems to assume that the
page lock will. Since our flush mechanism cannot lock the pages (we
need to be able to call it in contexts like "read" where the page is
already locked) we have a potential race.

> > How does this differ from a page that has buffers but is "clean"?
>
> That those clean buffers can be discarded without data loss?
>
> Andrew stated the invariant: "a clean page with dirty buffers is
> illegal.".
>
> Now with this last flush on try_to_release_page perhaps you can get away
> with it, but your problem was going way beyond the VM race.
>
> My fix fixed the VM race only (note that the same race would happen on
> ext2 too if you called randomly invalidate_inode_pages2 on any inode,
> which btw, surprise is what the new ioctl FIPUNCHMEM does... So let's
> say my fix is for FIPUNCHMEM and nfs on VM side.
>
> Your patch fixes an entirely different matter, that is, it tries to
> close the holes that you opened by breaking the invariant that a page
> can be dirty still but with PG_dirty not set.
>
> So again, those two patches are orthogonal. In the current state of
> things both looks to me, though I would like to understand if you really
> get a benefit from the local nfs hack around the pagecache invariants.

In 99% of cases we cannot, and will not use the page cache dirtying
mechanisms since those mechanisms discard all trace of the user context
in which the page was dirtied. In consequence, writepage() simply does
not allow us to track all the context data that we need in order to be
able to safely write out data (i.e. user credentials, file state
context, ...).

The only case where we allow ourselves to rely on writepage() is for
mmapped files, where the whole concept of "user context" is all pretty
blurry anyway.

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-28 01:04:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Problems with mmap consistency

On Sat, Feb 25, 2006 at 01:42:07PM -0500, Trond Myklebust wrote:
> The only case where we allow ourselves to rely on writepage() is for
> mmapped files, where the whole concept of "user context" is all pretty
> blurry anyway.

Ok. I trust you on the nfs lowlevel part and I think Neil's comment on
this is interesting.

However I want to clarify a few things on those two patches. While you
currently don't relay on the dirty bitflag when it's clear, you
definitely relay on the dirty bitflag when it's set (because your
.set_page_dirty callback is set to __set_page_dirty_nobuffers).


I mean, you don't relay on the dirty bitflag after writepage has been
called and the page is clean, but like every other kernel fs (including
ext2) you relay on it _before_ writepage is invoked. The race my patch
wants to address is a race where a page becomes from dirty to clean,
without writepage being invoked and without the lowlevel fs ever
noticing that the page has been marked dirty.

If you had a lowlevel implementation of .set_page_dirty, you had a
slight chance to fix the race locally to the fs, but you didn't use it,
not even Neil's patch modified set_page_dirty.

So no matter how you change the code inside linux/fs/*, unless you hack
more around .set_page_dirty there's no chance for you to fix the
fs-corruption source that my patch fixed in the VM (in linux/mm/*).

So again those two patches are completely orthogonal, both are needed
and they should be considered separately: they fixes two completely
different bugs. The fact both bugs can lead to a similar kind of fs
corruption now seems just a pure coincidence.

The fact my fix fixed the testcase fully, seems to mean that the
lowlevel race has a much smaller window to trigger and the highlevel
race in the VM was the biggest offender (at least for the testcase that
Olaf posted here, dunno about other workloads).

Thanks.


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-17 15:16:00

by Trond Myklebust

[permalink] [raw]
Subject: Re: Problems with mmap consistency

On Fri, 2006-02-17 at 11:57 +0100, Olaf Kirch wrote:
> It seems the problem is that there is still a fairly big race window
> in this code - after unmap_mapping_range returns, we're waiting for
> all dirty pages to be flushed to the server. A lot can happen while
> we wait - for instance, some other process may modify a page that
> has already been written. This modification will be lost during
> the subsequent invalidate_inode_pages2 call.
>
> Is there a simple VM mechanism that could be used to prevent this,
> or does it need some extra logic to block a process in nfs_readpage
> while we're revalidating?

No. I believe I have discussed this with Andrea and others before.
We need to add a VM helper that will invalidate the page cache and flush
dirty pages atomically (such a thing does not yet exist).

Cheers,
Trond



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-24 04:02:11

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Problems with mmap consistency

Hello,

On Fri, Feb 17, 2006 at 10:15:30AM -0500, Trond Myklebust wrote:
> On Fri, 2006-02-17 at 11:57 +0100, Olaf Kirch wrote:
> > It seems the problem is that there is still a fairly big race window
> > in this code - after unmap_mapping_range returns, we're waiting for
> > all dirty pages to be flushed to the server. A lot can happen while
> > we wait - for instance, some other process may modify a page that
> > has already been written. This modification will be lost during
> > the subsequent invalidate_inode_pages2 call.
> >
> > Is there a simple VM mechanism that could be used to prevent this,
> > or does it need some extra logic to block a process in nfs_readpage
> > while we're revalidating?
>
> No. I believe I have discussed this with Andrea and others before.

Yep.

> We need to add a VM helper that will invalidate the page cache and flush
> dirty pages atomically (such a thing does not yet exist).

Such thing will hardly exist if page faults can mark pages dirty at any
time mostly lockless. Things may change for the better if we start
taking the page lock in do_no_page but I doubt we'll serialize
atomically against the whole disk-bound fdatawriteandwait.

I think it's enough to make invalidate_inode_pages2 non destructive.
Then we can wonder if we should join
unmap_mapping*+fdatasyncandwait+invalidate_inode_pages2 in a single
function (so they look more atomic even if they're not internally), or
something like that, but currently there are reasons why they're split
(notably O_DIRECT needs the invalidate only for writes, for o_direct
reads not).

If a buffered write happens at the same time of a direct write and it's
making pages dirty again after our fdatasyncandwait, there's no
guarantee anyway because the two syscalls are parallel. So I guess we
could just change invalidate_inode_pages2 not to be destructive, then
the race will go away, and an nfs invalidate won't nuke the newly
written data anymore (even if the newly written data is written between
the fdatasyncwait and invalidate_inode_pages2).

Basically this way dirty pages gets higher prio, and they can't be
dropped anymore. They can only be flushed safely to disk. That's enough
for O_DIRECT coherency protocol and it should fix nfs.

I seem to remember I asked explanations on the "was_dirty" code in my
first email on the subject (before noticing you also worked on the nfs
part about at the same time ;), but I didn't get any answer.

I didn't check if this fixes the testcase, patch still untested sorry.

Thanks.

Signed-off-by: Andrea Arcangeli <[email protected]>

Index: sp3/mm/truncate.c
--- sp3/mm/truncate.c.~1~ 2006-02-23 06:31:14.000000000 +0100
+++ sp3/mm/truncate.c 2006-02-24 04:36:02.000000000 +0100
@@ -246,9 +246,14 @@ EXPORT_SYMBOL(invalidate_inode_pages);
* @start: the page offset 'from' which to invalidate
* @end: the page offset 'to' which to invalidate (inclusive)
* Any pages which are found to be mapped into pagetables are unmapped prior to
- * invalidation.
+ * invalidation. invalidate_inode_pages2_range is non destructive and it can't
+ * lose dirty data (if dirty data exists -EIO will be returned). It's up to
+ * the caller to call unmap_mapping_range and filemap_write_and_wait before
+ * invalidate_inode_pages2 if needed.
*
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIO if any pages could not be invalidated. Before returning -EIO
+ * it tries invalidating all pages in the range, it doesn't stop at the first
+ * page invalidation failure.
*/
int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
@@ -262,13 +267,12 @@ int invalidate_inode_pages2_range(struct

pagevec_init(&pvec, 0);
next = start;
- while (next <= end && !ret && !wrapped &&
+ while (next <= end && !wrapped &&
pagevec_lookup(&pvec, mapping, next,
min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
- for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
+ for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t page_index;
- int was_dirty;

lock_page(page);
if (page->mapping != mapping) {
@@ -304,12 +308,9 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
- if (!invalidate_complete_page(mapping, page)) {
- if (was_dirty)
- set_page_dirty(page);
+
+ if (!invalidate_complete_page(mapping, page))
ret = -EIO;
- }
unlock_page(page);
}
pagevec_release(&pvec);


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-24 06:16:05

by NeilBrown

[permalink] [raw]
Subject: Re: Problems with mmap consistency

On Friday February 24, [email protected] wrote:
> Hello,
>
> On Fri, Feb 17, 2006 at 10:15:30AM -0500, Trond Myklebust wrote:
> > On Fri, 2006-02-17 at 11:57 +0100, Olaf Kirch wrote:
> > > It seems the problem is that there is still a fairly big race window
> > > in this code - after unmap_mapping_range returns, we're waiting for
> > > all dirty pages to be flushed to the server. A lot can happen while
> > > we wait - for instance, some other process may modify a page that
> > > has already been written. This modification will be lost during
> > > the subsequent invalidate_inode_pages2 call.

This seems very closely related to a problem I have been chasing. In
my case invalidate_inode_page2 discard pages that are in the process
of writeback, and writeback oops because it expects that page to still
be in the inodes radix_tree, but it isn't.

The core problem is a race between invalidate_inode_pages2 trying to
invalidate pages, and other code making those pages dirty.

> > >
> > > Is there a simple VM mechanism that could be used to prevent this,
> > > or does it need some extra logic to block a process in nfs_readpage
> > > while we're revalidating?
> >
> > No. I believe I have discussed this with Andrea and others before.
>
> Yep.
>
> > We need to add a VM helper that will invalidate the page cache and flush
> > dirty pages atomically (such a thing does not yet exist).

While this would be nice, I don't think it can be done reliably (as
Andrea indicates) because protecting against a page being dirtied is
not possible.

What is *needed* is that we invalidate any part of a page that isn't
dirty. Dirty parts need to be left for write-back, but the non-dirty
parts should be marked as invalid so they will be refreshed from the
server before responding to a read.
The 'discard the page from the page_cache' that
invalidate_inode_pages2 does through invalidate_complete_page isn't
needed. We just need to mark the page as not containing valid data.

I suspect this could be done through a ->releasepage callback.

>
> Such thing will hardly exist if page faults can mark pages dirty at any
> time mostly lockless. Things may change for the better if we start
> taking the page lock in do_no_page but I doubt we'll serialize
> atomically against the whole disk-bound fdatawriteandwait.
>
> I think it's enough to make invalidate_inode_pages2 non destructive.
> Then we can wonder if we should join
> unmap_mapping*+fdatasyncandwait+invalidate_inode_pages2 in a single
> function (so they look more atomic even if they're not internally), or
> something like that, but currently there are reasons why they're split
> (notably O_DIRECT needs the invalidate only for writes, for o_direct
> reads not).

I'm not sure exactly what you mean by 'non destructive'.
Do you mean that pages are not removed but are just marked 'invalid',
or do you mean that if a page is dirty it doesn't get removed.

The later, by itself, is dangerous I think. As a page can be partly
dirty and partly not (for 'write' requests, nfs knows which parts are
dirty. For mmap, it has to assume that it is all dirty), the part that
is not dirty really needs to be invalidated somehow.


A ->readpage already calls nfs_wb_page before starting a read, so if
we define an invalidate_page callback (see patch below from Trond,
modified by me) which calls ClearPageUptodate appropriately, we might
get the desired result. However I'm not entirely sure what the code should
look like as I don't know exactly when a page is marked dirty and when
marked clean as it flows through nfs....

>
> Basically this way dirty pages gets higher prio, and they can't be
> dropped anymore. They can only be flushed safely to disk. That's enough
> for O_DIRECT coherency protocol and it should fix nfs.
>
> I seem to remember I asked explanations on the "was_dirty" code in my
> first email on the subject (before noticing you also worked on the nfs
> part about at the same time ;), but I didn't get any answer.
>
> I didn't check if this fixes the testcase, patch still untested
> sorry.

That patch makes sense to me ... at least in this context. I haven't
looked at other users of the code.

NeilBrown

----------------------------

NFS: Avoid races between writebacks and truncation

Author: Trond Myklebust <[email protected]>

Currently, there is no serialisation between NFS asynchronous writebacks
and truncation at the page level due to the fact that nfs_sync_inode()
cannot lock the pages that it is about to write out.

This means that it is possible to be flushing out data (and calling something
like set_page_writeback()) while the page cache is busy evicting the page.
Oops...

Use the hooks provided in try_to_release_page() to ensure that dirty pages
are always written back to storage before we evict them.

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Neil Brown <[email protected]>

Index: linux-2.6.15/fs/nfs/file.c
===================================================================
--- linux-2.6.15.orig/fs/nfs/file.c 2006-02-24 14:56:42.000000000 +1100
+++ linux-2.6.15/fs/nfs/file.c 2006-02-24 14:57:08.000000000 +1100
@@ -316,6 +316,19 @@
return status;
}

+static int nfs_invalidate_page(struct page *page, unsigned long offset)
+{
+ /* FIXME: we really should cancel any unstarted writes on this page */
+ BUG_ON(PagePrivate(page));
+ return 1;
+}
+
+static int nfs_release_page(struct page *page, gfp_t gfp)
+{
+ nfs_wb_page(page->mapping->host, page);
+ return !PagePrivate(page);
+}
+
struct address_space_operations nfs_file_aops = {
.readpage = nfs_readpage,
.readpages = nfs_readpages,
@@ -324,6 +337,8 @@
.writepages = nfs_writepages,
.prepare_write = nfs_prepare_write,
.commit_write = nfs_commit_write,
+ .invalidatepage = nfs_invalidate_page,
+ .releasepage = nfs_release_page,
#ifdef CONFIG_NFS_DIRECTIO
.direct_IO = nfs_direct_IO,
#endif
Index: linux-2.6.15/fs/nfs/pagelist.c
===================================================================
--- linux-2.6.15.orig/fs/nfs/pagelist.c 2006-02-24 14:56:42.000000000 +1100
+++ linux-2.6.15/fs/nfs/pagelist.c 2006-02-24 14:57:08.000000000 +1100
@@ -85,6 +85,7 @@
atomic_set(&req->wb_complete, 0);
req->wb_index = page->index;
page_cache_get(page);
+ SetPagePrivate(page);
req->wb_offset = offset;
req->wb_pgbase = offset;
req->wb_bytes = count;
@@ -147,8 +148,10 @@
*/
void nfs_clear_request(struct nfs_page *req)
{
- if (req->wb_page) {
- page_cache_release(req->wb_page);
+ struct page *page = req->wb_page;
+ if (page != NULL) {
+ ClearPagePrivate(page);
+ page_cache_release(page);
req->wb_page = NULL;
}
}


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-24 07:12:33

by Olaf Kirch

[permalink] [raw]
Subject: Re: Problems with mmap consistency

On Fri, Feb 24, 2006 at 05:01:42AM +0100, Andrea Arcangeli wrote:
> > We need to add a VM helper that will invalidate the page cache and flush
> > dirty pages atomically (such a thing does not yet exist).
>
> Such thing will hardly exist if page faults can mark pages dirty at any
> time mostly lockless. Things may change for the better if we start
> taking the page lock in do_no_page but I doubt we'll serialize
> atomically against the whole disk-bound fdatawriteandwait.

One thing I've been thinking about, which would be rather heavy-handed
though, would be to have an rwsem in the mapping; do_no_page would take
a readlock, while nfs would take the write lock when invalidating.
Bu that's really ugly because we'd add this lock for NFS only.

> I think it's enough to make invalidate_inode_pages2 non destructive.

That would be worth trying. I'll give your patch a try.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-24 16:08:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Problems with mmap consistency

on fri, feb 24, 2006 at 05:15:25pm +1100, neil brown wrote:
> what is *needed* is that we invalidate any part of a page that isn't
> dirty. dirty parts need to be left for write-back, but the non-dirty

I agree, this is exactly what my patch does. never destry the dirty
info.

> i'm not sure exactly what you mean by 'non destructive'.

Non destructive means you can call it anytime on anything and you're
always safe. that basically means not destroying dirty bits (everything
else can always be discarded in a loss-less way, only discarding dirty
bits is a destructive operation).

> do you mean that pages are not removed but are just marked 'invalid',
> or do you mean that if a page is dirty it doesn't get removed.

Clean pages are still freed hard (from pagecache to freelist), only
pages dirty are left in the cache i.e. the non-destructive behaviour
(while previously they would have been destroyed too, i.e. the
destructive behaviour).

> the later, by itself, is dangerous i think. as a page can be partly
> dirty and partly not (for 'write' requests, nfs knows which parts are
> dirty. for mmap, it has to assume that it is all dirty), the part that
> is not dirty really needs to be invalidated somehow.

That's why we have to call fdatasyncandwait before
invalidate_inode_pages2. that's what is taking care to flush them before
invalidating them. furthermore the lowlevel nfs caches are being taken
care of by the above check in try_to_release_page.

Note also that the dirty bit must be set on the pagecache too, when a
page is dirty and needs writeback. It's not enough to set the lowlevel
caches dirty or pdflush will never see them and periodically flush them.
The writeback layer of the pagecache has no clue on the lowlevel caches
of the pageprivate pages, so the dirty bits must be set in the pagecache
too when writeback data exists in the lowlevel and writepage has to be
called.

The highlevel only makes sure to call writepage on pages with dirty bit
(so then the lowlevel caches can be flushed) and then
try_to_release_page is guaranteed to be called before discarding a page
(just in case the page is busy when we try to free it: should never
happen with nfs unless the dirty bit is set too).

> a ->readpage already calls nfs_wb_page before starting a read, so if
> we define an invalidate_page callback (see patch below from trond,
> modified by me) which calls clearpageuptodate appropriately, we might
> get the desired result. however i'm not entirely sure what the code should
> look like as i don't know exactly when a page is marked dirty and when
> marked clean as it flows through nfs....

I see what your patch does. It's quite equivalent to mine except I
solved the problem in a more VM generic way then a nfs lowlevel way.

Note also that I'm not convinced that try_to_release_page is really a
do_writeback call. Basically the way you workaround the problem without
changing the VM function is to flush the dirty lowlevel cache
inside try_to_release_page. But I suspect try_to_release_page should be
more a "try_to_release" then a "writepage_and_try_to_release". As long
as the dirty bit is always set, it's the VM that takes care of not
freeing the page and calling writepage on it again, you don't have to do
that before the page is destroyed to avoid corruption.

Still implementing the releaspage callback is a good idea, but I think
it would be enough to bugcheck against any data being dirty inside it
with PageDirty not set. As long as PageDirty is set, you don't have to
do anything in releasepage callback.

Overall I think my patch is obviously backwards compatible, and I would
like to see your patch modified with only bugchecks inside releasepage
and invalidate page (or whatever else action needed to free clean
buffers).

While reading the default releasepage callback I was also shoked by this:

spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
if (ret) {
/*
* If the filesystem writes its buffers by hand (eg
* ext3)
* then we can have clean buffers against a dirty page.
* We
* clean the page here; otherwise later reattachment of
* buffers
* could encounter a non-uptodate page, which is
* unresolvable.
* This only applies in the rare case where
* try_to_free_buffers
* succeeds but the page is not freed.
*/
clear_page_dirty(page);
^^^^^^^^^^^^^^^^
}

I can't possibly see how the above is safe. The above even collided with
the previous was_dirty logic. What was the point of was_dirty logic
(besides the fact we don't need to be destructive in the first place),
when try_to_release_page can clear the dirty bit? I really can't see how
the fact we managed to release the bh because they were not busy, has
anything to do with the page being clean (like after a munmap). I think
the above is a lonstanding fs corruption bug and the above
clear_page_dirty should be nuked, but perhaps I'm overlooking
something.... The above path is quite related to our code for the nfs
matter since it's invoked here:

if (PagePrivate(page) && !try_to_release_page(page, 0))
return 0;

write_lock_irq(&mapping->tree_lock);
if (PageDirty(page)) {

If the above bogus clear_page_dirty clears the dirty bitflag, what's the
point of checking PageDirty after returning from try_to_release_page?


I'm CC'ing Andrew since this stuff goes way beyond nfs and he knows this
better ;)

Here the relevant links including patches for Andrew:

http://sourceforge.net/mailarchive/message.php?msg_id=14906347
http://sourceforge.net/mailarchive/message.php?msg_id=14906867

PS. I'm offlist so please keep me in CC, thanks.


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-24 23:37:27

by Andrew Morton

[permalink] [raw]
Subject: Re: Problems with mmap consistency

Andrea Arcangeli <[email protected]> wrote:
>
> While reading the default releasepage callback I was also shoked by this:
>
> spin_lock(&mapping->private_lock);
> ret = drop_buffers(page, &buffers_to_free);
> if (ret) {
> /*
> * If the filesystem writes its buffers by hand (eg
> * ext3)
> * then we can have clean buffers against a dirty page.
> * We
> * clean the page here; otherwise later reattachment of
> * buffers
> * could encounter a non-uptodate page, which is
> * unresolvable.
> * This only applies in the rare case where
> * try_to_free_buffers
> * succeeds but the page is not freed.
> */
> clear_page_dirty(page);
> ^^^^^^^^^^^^^^^^
> }
>
> I can't possibly see how the above is safe.

The protocol is: a clean page with dirty buffers is illegal. A dirty page
with clean buffers is legal (filesystem cleaned the bh's independently).

This is enforced by:

- If an address_space uses buffer_heads, it must use
a_ops.set_page_dirty=__set_page_dirty_buffers. And
__set_page_dirty_buffers() will dirty the buffer_heads when someone runs
set_page_dirty().

- try_to_free_buffers() is hence allowed to assume that clean buffers
means a fully written-back page, hence try_to_free_buffers() is allowed
to clear the page's dirty bit.

Arguably the fs should be doing this, but the core "knows" about
filesystems which write the buffers directly.

Synchronisation for the page-vs-buffer dirty state is via
mapping->private_lock. try_to_free_buffers() versus
__set_page_dirty_buffers().


I _think_ it's all solid? It's pretty old code now..

We used to have zillions of assertions in there to check that this was all
functioning correctly. I took them out a year or two ago.

I guess we really should retain a check for clean page with dirty
buffers. That's a memory leak and potentially a data loss.

> The above even collided with
> the previous was_dirty logic.

What's the "was_dirty logic"?

> What was the point of was_dirty logic
> (besides the fact we don't need to be destructive in the first place),
> when try_to_release_page can clear the dirty bit? I really can't see how
> the fact we managed to release the bh because they were not busy, has
> anything to do with the page being clean (like after a munmap). I think
> the above is a lonstanding fs corruption bug and the above
> clear_page_dirty should be nuked, but perhaps I'm overlooking
> something.... The above path is quite related to our code for the nfs
> matter since it's invoked here:
>
> if (PagePrivate(page) && !try_to_release_page(page, 0))
> return 0;
>
> write_lock_irq(&mapping->tree_lock);
> if (PageDirty(page)) {

try_to_free_buffers() doesn't check - it assume that there's a valid bh* at
page_private(). It will oops if called with a page which doesn't have
buffers, so NFS cannot get in here at all.



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-25 00:34:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Problems with mmap consistency

On Fri, Feb 24, 2006 at 03:39:31PM -0800, Andrew Morton wrote:
> a_ops.set_page_dirty=__set_page_dirty_buffers. And
> __set_page_dirty_buffers() will dirty the buffer_heads when someone runs

Ok this is what I temporarily forgot. set_page_dirty doesn't only tags
the tree etc.. it also calls into the lowlevel through
a_ops.set_page_dirty.

> I _think_ it's all solid? It's pretty old code now..

yep sorry.

> > The above even collided with
> > the previous was_dirty logic.
>
> What's the "was_dirty logic"?

this:

- was_dirty = test_clear_page_dirty(page);
- if (!invalidate_complete_page(mapping, page)) {
- if (was_dirty)
- set_page_dirty(page);
+
+ if (!invalidate_complete_page(mapping, page))
ret = -EIO;
- }


but everything is ok, there was no collision since it is right calling
clear_page_dirty.

> > (besides the fact we don't need to be destructive in the first place),
> > when try_to_release_page can clear the dirty bit? I really can't see how
> > the fact we managed to release the bh because they were not busy, has
> > anything to do with the page being clean (like after a munmap). I think
> > the above is a lonstanding fs corruption bug and the above
> > clear_page_dirty should be nuked, but perhaps I'm overlooking
> > something.... The above path is quite related to our code for the nfs
> > matter since it's invoked here:
> >
> > if (PagePrivate(page) && !try_to_release_page(page, 0))
> > return 0;
> >
> > write_lock_irq(&mapping->tree_lock);
> > if (PageDirty(page)) {
>
> try_to_free_buffers() doesn't check - it assume that there's a valid bh* at
> page_private(). It will oops if called with a page which doesn't have
> buffers, so NFS cannot get in here at all.

Yes, I meant our code in the sense on the stuff my patch modified.

I got positive confirmation that my patch that makes
invalidate_inode_pages2 non-destructive fixed the problem. At the top of
this thread you can find the testcase used to reproduce the race posted
by Olaf. I'm unsure if Neil's patch is needed, but it certainly could
co-exist. I feel his patch should not execute a writepage inside
try_to_release_page, it's not needed anymore with my fix in place.

Thanks for the enlightenment!


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-25 00:59:44

by Trond Myklebust

[permalink] [raw]
Subject: Re: Problems with mmap consistency

On Sat, 2006-02-25 at 01:33 +0100, Andrea Arcangeli wrote:

> I got positive confirmation that my patch that makes
> invalidate_inode_pages2 non-destructive fixed the problem. At the top of
> this thread you can find the testcase used to reproduce the race posted
> by Olaf. I'm unsure if Neil's patch is needed, but it certainly could
> co-exist. I feel his patch should not execute a writepage inside
> try_to_release_page, it's not needed anymore with my fix in place.

Yes it still is needed. The page doesn't need to have the dirty bit set
in order to actually need writing out (sort of the equivalent of having
buffers in the blocks case).
My patch just ensures that try_to_release_page() actually flushes those
buffers and waits for completion.

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-25 01:15:50

by Andrew Morton

[permalink] [raw]
Subject: Re: Problems with mmap consistency

Trond Myklebust <[email protected]> wrote:
>
> On Sat, 2006-02-25 at 01:33 +0100, Andrea Arcangeli wrote:
>
> > I got positive confirmation that my patch that makes
> > invalidate_inode_pages2 non-destructive fixed the problem. At the top of
> > this thread you can find the testcase used to reproduce the race posted
> > by Olaf. I'm unsure if Neil's patch is needed, but it certainly could
> > co-exist. I feel his patch should not execute a writepage inside
> > try_to_release_page, it's not needed anymore with my fix in place.
>
> Yes it still is needed. The page doesn't need to have the dirty bit set
> in order to actually need writing out (sort of the equivalent of having
> buffers in the blocks case).

eh? I don't know what's being talked about here, but that doesn't sound
right. Taking the block-backed protocol as an example, if a page isn't
dirty it shouldn't be written out. What _is_ legal is a page which is
dirty but doesn't need writing out. You've gone and invented the converse,
illegal case here.

> My patch just ensures that try_to_release_page() actually flushes those
> buffers and waits for completion.

<what patch?>


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-25 05:00:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Problems with mmap consistency

On Fri, Feb 24, 2006 at 05:17:52PM -0800, Andrew Morton wrote:
> Trond Myklebust <[email protected]> wrote:
> >
> > On Sat, 2006-02-25 at 01:33 +0100, Andrea Arcangeli wrote:
> >
> > > I got positive confirmation that my patch that makes
> > > invalidate_inode_pages2 non-destructive fixed the problem. At the top of
> > > this thread you can find the testcase used to reproduce the race posted
> > > by Olaf. I'm unsure if Neil's patch is needed, but it certainly could
> > > co-exist. I feel his patch should not execute a writepage inside
> > > try_to_release_page, it's not needed anymore with my fix in place.
> >
> > Yes it still is needed. The page doesn't need to have the dirty bit set
> > in order to actually need writing out (sort of the equivalent of having
> > buffers in the blocks case).
>
> eh? I don't know what's being talked about here, but that doesn't sound
> right. Taking the block-backed protocol as an example, if a page isn't
> dirty it shouldn't be written out. What _is_ legal is a page which is
> dirty but doesn't need writing out. You've gone and invented the converse,
> illegal case here.

Agreed, this is what I was trying to say too in the previous emails in
answer to Neil (and it was confirmed by your last email too). It doesn't
sound sane that a page has dirty-lowlevel data but PG_dirty is not set.
try_to_release_page isn't ->writepage and if PG_dirty isn't set it would
be hiding those pages to pdflush (pdflush wouldn't see the dirty bit and
it wouldn't writeback periodically).

try_to_release_page can return failure without doing writeback if it
finds dirty data in the lowlevel caches (but even that isn't strictly
necessary as long as the PG_dirty bit is set correctly).

Now the fact my patch fixed the problem, makes me quite optimistic nfs
is currently doing the right thing in setting PG_dirty, but perhaps the
testcase isn't stressing all races...

> > My patch just ensures that try_to_release_page() actually flushes those
> > buffers and waits for completion.
>
> <what patch?>

The one that Neil posted, it was the second URL (the first url is my
proposed patch):

http://sourceforge.net/mailarchive/message.php?msg_id=14906347
http://sourceforge.net/mailarchive/message.php?msg_id=14906867


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-25 14:55:23

by Trond Myklebust

[permalink] [raw]
Subject: Re: Problems with mmap consistency

On Sat, 2006-02-25 at 05:59 +0100, Andrea Arcangeli wrote:
> > eh? I don't know what's being talked about here, but that doesn't sound
> > right. Taking the block-backed protocol as an example, if a page isn't
> > dirty it shouldn't be written out. What _is_ legal is a page which is
> > dirty but doesn't need writing out. You've gone and invented the converse,
> > illegal case here.
>
> Agreed, this is what I was trying to say too in the previous emails in
> answer to Neil (and it was confirmed by your last email too). It doesn't
> sound sane that a page has dirty-lowlevel data but PG_dirty is not set.
> try_to_release_page isn't ->writepage and if PG_dirty isn't set it would
> be hiding those pages to pdflush (pdflush wouldn't see the dirty bit and
> it wouldn't writeback periodically).

If we don't clear PG_dirty in writepage(), then the VM gets all huffy.
However, we do not want to start actual writeback until we've got enough
pages to make it worth our while. For that reason, we track the dirty
pages in the NFS layer, and start writeout when writepages() is done.

How does this differ from a page that has buffers but is "clean"?

Cheers
Trond



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs