2006-10-03 17:51:59

by Chip Coldwell

[permalink] [raw]
Subject: direct IO regression in 2.6.18

Hi,

The following commit caused a regression in direct IO:

commit 016eb4a0ed06a3677d67a584da901f0e9a63c666
Author: Andrew Morton <[email protected]>
Date: Fri Sep 8 09:48:38 2006 -0700

[PATCH] invalidate_complete_page() race fix

If a CPU faults this page into pagetables after invalidate_mapping_pages()
checked page_mapped(), invalidate_complete_page() will still proceed to remove
the page from pagecache. This leaves the page-faulting process with a
detached page. If it was MAP_SHARED then file data loss will ensue.

Fix that up by checking the page's refcount after taking tree_lock.

Cc: Nick Piggin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

diff --git a/mm/truncate.c b/mm/truncate.c
index cf1b015..c6ab55e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -68,10 +68,10 @@ invalidate_complete_page(struct address_
return 0;

write_lock_irq(&mapping->tree_lock);
- if (PageDirty(page)) {
- write_unlock_irq(&mapping->tree_lock);
- return 0;
- }
+ if (PageDirty(page))
+ goto failed;
+ if (page_count(page) != 2) /* caller's ref + pagecache ref */
+ goto failed;

BUG_ON(PagePrivate(page));
__remove_from_page_cache(page);
@@ -79,6 +79,9 @@ invalidate_complete_page(struct address_
ClearPageUptodate(page);
page_cache_release(page); /* pagecache ref */
return 1;
+failed:
+ write_unlock_irq(&mapping->tree_lock);
+ return 0;
}

/**


The issue is that invalidate_complete_page is in two different code
paths, the normal one (prune_icache calls invalidate_inode_pages calls
invalidate_mapping_pages calls invalidate_complete_page) and also via
direct IO (generic_file_direct_write calls generic_file_direct_IO
calls invalidate_inode_pages2_range calls invalidate_complete_page).
In the latter case, the page is (usually) not even in the page cache
and the refcount can legitimately != 2.

Chip

--
Charles M. "Chip" Coldwell
Senior Software Engineer
Red Hat, Inc
978-392-2426


2006-10-03 18:04:21

by Andrew Morton

[permalink] [raw]
Subject: Re: direct IO regression in 2.6.18

On Tue, 3 Oct 2006 13:51:49 -0400 (EDT)
Chip Coldwell <[email protected]> wrote:

> Hi,
>
> The following commit caused a regression in direct IO:
>
> commit 016eb4a0ed06a3677d67a584da901f0e9a63c666
> Author: Andrew Morton <[email protected]>
> Date: Fri Sep 8 09:48:38 2006 -0700
>
> [PATCH] invalidate_complete_page() race fix
>
> ...
>
> The issue is that invalidate_complete_page is in two different code
> paths, the normal one (prune_icache calls invalidate_inode_pages calls
> invalidate_mapping_pages calls invalidate_complete_page) and also via
> direct IO (generic_file_direct_write calls generic_file_direct_IO
> calls invalidate_inode_pages2_range calls invalidate_complete_page).
> In the latter case, the page is (usually) not even in the page cache
> and the refcount can legitimately != 2.


The below was merged earlier this week and is tagged for 2.6.18.x.
You just falsified my changelog ;)



From: Andrew Morton <[email protected]>

The recent fix to invalidate_inode_pages() (git commit 016eb4a) managed to
unfix invalidate_inode_pages2().

The problem is that various bits of code in the kernel can take transient refs
on pages: the page scanner will do this when inspecting a batch of pages, and
the lru_cache_add() batching pagevecs also hold a ref.

Net result is transient failures in invalidate_inode_pages2(). This affects
NFS directory invalidation (observed) and presumably also block-backed
direct-io (not yet reported).

Fix it by reverting invalidate_inode_pages2() back to the old version which
ignores the page refcounts.

We may come up with something more clever later, but for now we need a 2.6.18
fix for NFS.

Cc: Chuck Lever <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/truncate.c | 34 ++++++++++++++++++++++++++++++++--
1 file changed, 32 insertions(+), 2 deletions(-)

diff -puN mm/truncate.c~invalidate_inode_pages2-ignore-page-refcounts mm/truncate.c
--- a/mm/truncate.c~invalidate_inode_pages2-ignore-page-refcounts
+++ a/mm/truncate.c
@@ -287,9 +287,39 @@ unsigned long invalidate_inode_pages(str
{
return invalidate_mapping_pages(mapping, 0, ~0UL);
}
-
EXPORT_SYMBOL(invalidate_inode_pages);

+/*
+ * This is like invalidate_complete_page(), except it ignores the page's
+ * refcount. We do this because invalidate_inode_pages2() needs stronger
+ * invalidation guarantees, and cannot afford to leave pages behind because
+ * shrink_list() has a temp ref on them, or because they're transiently sitting
+ * in the lru_cache_add() pagevecs.
+ */
+static int
+invalidate_complete_page2(struct address_space *mapping, struct page *page)
+{
+ if (page->mapping != mapping)
+ return 0;
+
+ if (PagePrivate(page) && !try_to_release_page(page, 0))
+ return 0;
+
+ write_lock_irq(&mapping->tree_lock);
+ if (PageDirty(page))
+ goto failed;
+
+ BUG_ON(PagePrivate(page));
+ __remove_from_page_cache(page);
+ write_unlock_irq(&mapping->tree_lock);
+ ClearPageUptodate(page);
+ page_cache_release(page); /* pagecache ref */
+ return 1;
+failed:
+ write_unlock_irq(&mapping->tree_lock);
+ return 0;
+}
+
/**
* invalidate_inode_pages2_range - remove range of pages from an address_space
* @mapping: the address_space
@@ -356,7 +386,7 @@ int invalidate_inode_pages2_range(struct
}
}
was_dirty = test_clear_page_dirty(page);
- if (!invalidate_complete_page(mapping, page)) {
+ if (!invalidate_complete_page2(mapping, page)) {
if (was_dirty)
set_page_dirty(page);
ret = -EIO;
_