Date: Fri, 22 Oct 2004 18:17:44 +0200
From: Andrea Arcangeli <andrea@novell.com>
To: Andrew Morton <akpm@osdl.org>
Cc: shaggy@austin.ibm.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty
Message-ID: <20041022161744.GF14325@dualathlon.random>
References: <1098393346.7157.112.camel@localhost> <20041021144531.22dd0d54.akpm@osdl.org> <20041021223613.GA8756@dualathlon.random> <20041021160233.68a84971.akpm@osdl.org> <20041021232059.GE8756@dualathlon.random> <20041021164245.4abec5d2.akpm@osdl.org> <20041022003004.GA14325@dualathlon.random> <20041022012211.GD14325@dualathlon.random> <20041021190320.02dccda7.akpm@osdl.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20041021190320.02dccda7.akpm@osdl.org>
User-Agent: Mutt/1.5.6i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 9094
Lines: 234

On Thu, Oct 21, 2004 at 07:03:20PM -0700, Andrew Morton wrote:
> Yeah.  I think the only 100% reliable way to do this is:
> 
> 	lock_page()
> 	unmap_mapping_range(page)
> 	ClearPageUptodate(page);
> 	invalidate(page);	/* Try to free the thing */
> 	unlock_page(page);
> 
> (Can do it for a whole range of pages if we always agree to lock pages in
> file-index-ascending order).
> 
> but hrm, we don't even have the locking there to prevent do_no_page() from
> reinstantiating the pte immediately after the unmap_mapping_range().
> 
> So what to do?
> 
> - The patch you originally sent has a race window which can be made
>   nonfatal by removing the BUGcheck in mpage_writepage().
> 
> - NFS should probably be taught to use unmap_mapping_range() regardless
>   of what we do, so the existing-mmaps-hold-stale-data problem gets fixed
>   up.
> 
> - invalidate_inode_pages2() isn't successfully invalidating the page if
>   it has buffers.

I did it right in 2.4, somebody obviously broken the thing in 2.6. Now
w.r.t. to ext2/ext3 I never heard a problem, and we nuke buffers
regardless if the page is mapped or not (so this isn't a corner case).

define block_invalidate_page(page) discard_bh_page(page, 0, 0)

	} else {
		if (page->buffers) {
			/* Restart after this page */
			list_del(head);
			list_add_tail(head, curr);

			page_cache_get(page);
			spin_unlock(&pagecache_lock);
			block_invalidate_page(page);
		} else
			unlocked = 0;

		ClearPageDirty(page);

The guarantee is total in 2.4 w.r.t.  buffers, as worse it can generate
an hearth attack to ext3 but it won't risk to lose coherency. Only 2.6
broke the O_DIRECT vs buffered I/O coherency protocol.

The semantics of not uptodate pages mapped in the address space are as
well understood and working fine in 2.4.

>   This is the biggest problem, because the pages can trivially have
>   buffers - just write()ing to them will attach buffers, if they're ext2-
>   or ext3-backed.
> 
>   It means that a direct-IO write to a section of a file which is mmapped
>   causes pagecache and disk to get out of sync.  Question is: do we care,
>   or do we say "don't do that"?  Nobody seems to have noticed thus far and
>   it's a pretty dopey thing to be doing.

this is a supported feature. People is writing with O_DIRECT and reading
with buffered read and we must support that (think a database writing
and a tar.gz reading). The coherency is a must after the O_DIRECT has
completed. If people runs the operations at the same time the result is
undefined. But by serializing the I/O at the userspace level (like
stop_db(); tar.gz) we must provide guarantees to them.

>   If we _do_ want to fix it, it seems the simplest approach would be to
>   nuke the pte's in invalidate_inode_pages2().  And we'd need some "oh we
>   raced - try again" loop in there to handle the "do_no_page()
>   reinstantiated a pte" race.

the do_no_page don't need to be coherent. The pte is the last thing we
can care about. The API only guarantees read/write syscalls to be
coherent, mmap not.

the only bug is the bh and it has to be fixed as easily as I did in 2.4.

> Something like this?  Slow as a dog, but possibly correct ;)
> 
> 
> void invalidate_inode_pages2(struct address_space *mapping)
> {
> 	struct pagevec pvec;
> 	pgoff_t next = 0;
> 	int i;
> 
> 	pagevec_init(&pvec, 0);
> 	while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> 		for (i = 0; i < pagevec_count(&pvec); i++) {
> 			struct page *page = pvec.pages[i];
> 
> 			lock_page(page);
> 			if (page->mapping == mapping) {	/* truncate race? */
> 				wait_on_page_writeback(page);
> 				next = page->index + 1;
> 				while (page_mapped(page)) {
> 					unmap_mapping_range(mapping,
> 					  page->index << PAGE_CACHE_SHIFT,
> 					  page->index << PAGE_CACHE_SHIFT+1,
> 					  0);
> 					clear_page_dirty(page);
> 				}
> 				invalidate_complete_page(mapping, page);
> 			}
> 			unlock_page(page);
> 		}
> 		pagevec_release(&pvec);
> 		cond_resched();
> 	}
> }

You know the page_mapped path isn't really very important as far as we
execute both clear_page_dirty and ClearPageUptodate. We must provide the
guarantee to the syscall. Leaving not updoate pages in the ptes is
fine, the VM can deal with that. the semantics of that condition is that
the page changed on disk due O_DIRECT from under us, so the page isn't
uptodate anymore. If something we can argue about ext3 not being capable
of dealing with a PG_update going away from under it, but the VM is
capable of that, there is only one spot that must be aware about it
(zap fix we posted to start the thread).

BTW, we have these fixes in our tree to make it work:

--- sles/mm/truncate.c.~1~	2004-05-18 19:24:40.000000000 +0200
+++ sles/mm/truncate.c	2004-05-19 02:09:28.311781864 +0200
@@ -260,9 +260,10 @@ void invalidate_inode_pages2(struct addr
 			if (page->mapping == mapping) {	/* truncate race? */
 				wait_on_page_writeback(page);
 				next = page->index + 1;
-				if (page_mapped(page))
+				if (page_mapped(page)) {
+					ClearPageUptodate(page);
 					clear_page_dirty(page);
-				else
+				} else
 					invalidate_complete_page(mapping, page);
 			}
 			unlock_page(page);

and then this fix for BLKFLSBUF:

diff -urNp linux-2.6.8/mm/truncate.c linux-2.6.8.SUSE/mm/truncate.c
--- linux-2.6.8/mm/truncate.c	2004-09-21 15:39:37.850379202 +0200
+++ linux-2.6.8.SUSE/mm/truncate.c	2004-09-21 15:39:54.733668309 +0200
@@ -86,6 +86,12 @@ invalidate_complete_page(struct address_
 		write_unlock_irq(&mapping->tree_lock);
 		return 0;
 	}
+
+	BUG_ON(PagePrivate(page));
+	if (page_count(page) != 2) {
+		write_unlock_irq(&mapping->tree_lock);
+		return 0;
+	}
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);
 	ClearPageUptodate(page);
@@ -306,7 +312,11 @@ void invalidate_inode_pages2(struct addr
 					clear_page_dirty(page);
 					ClearPageUptodate(page);
 				} else {
-					invalidate_complete_page(mapping, page);
+					if (!invalidate_complete_page(mapping,
+								      page)) {
+						clear_page_dirty(page);
+						ClearPageUptodate(page);
+					}
 				}
 			}
 			unlock_page(page);


apparently those weren't submitted to mainline yet, that could have
created some confusion on why we submitted the third patch that avoids
the BUG in pdflush and it avoids to try writing to disk a page that is
not uptodate anymore. Do you see why it's wrong to pass down to pdflush
a page that isn't utpodate? That page must be re-read, it sure must not
be written out. The semantics of mapped pages not uptodate are very well
defined and they're pose no risk to the VM (modulo the bug in pdflush
and this clear improvement in zap).

> The unmapping from pagetables means that invalidate_complete_page() will
> generally successfully remove the page from pagecache.  That mostly fixes
> the direct-write-of-mmapped-data coherency problem: the pages simply aren't
> in pagecache any more so we'll surely redo physical I/O.

the direct-write-of-mmapped-data coherency is the last thing we can care
about. Even before thinking about direct-write-of-mmapped-data we must
fix direct-write-of-cached-data (assuming it's not mmapped cache). That
is a showstopper bug, the direct-write-of-mmapped-data is "undefined" by
the linux API. Garbage can end up in the pagecache as far as the kernel
is concerned (all we guarantee is that no random page contents are
delivered to userspace).

> But it's not 100% reliable because ->invalidatepage isn't 100% reliable and
> it really sucks that we're offering behaviour which only works most of the
> time :(

That don't need to ever work, I don't think we even need to add a pte
shootdown, the behviour has never been provided by design. All we need
to do is to fix the bh layer, and fixup the spots in zap to stop bugging
out on dirty not uptodate pages, plus the above patches in this email,
then it'll work as well as 2.4 and it will avoid to break apps. Note
that this is a very serious matter, what happens is that oracle writes
with O_DIRECT and when you do a tar.gz of the database the last block
may be corrupt if it's not 512byte aligned if it was a partial page and
it had partial bh. mysql uses O_DIRECT too. I'm not particularly
concerned about ext3, that can be fixed too, or we can add a new bitflag
as you suggested to avoid fixing the filesystems to deal with that. But
for the VM I'm confortable clearing the uptodate page is much simpler
and it poses no risk to the VM. For the fs you may be very well right
that the BH_reread_from_disk might be needed instead.

So let's forget the mmapped case and let's fix the bh screwup in 2.6.
Then we can think at the mmapped case.

And to fix the mmapped case IMHO all we have to do is:

	clear_page_dirty
	ClearPageUptodate
	unmap_page_range

Not the other way around. The uptodate bit must be clared before zapping
the pte to force a reload.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/