2007-02-10 02:31:27

by Nick Piggin

[permalink] [raw]
Subject: [patch 0/3] 2.6.20 fix for PageUptodate memorder problem (try 3)

OK, I have got rid of SetPageUptodate_nowarn, and removed the atomic op
from SetNewPageUptodate. Made PageUptodate_NoLock only issue the memory
barrier is the page was uptodate (hopefully the compiler can thread the
branch into the caller's branch).

SetNewPageUptodate does not do the S390 page_test_and_clear_dirty, so
I'd like to make sure that's OK.

Rearranged the patch series so we don't have the first patch introducing
a lot of WARN_ONs that are solved in the next two patches (rather, solve
those issues first).

Thanks,
Nick

--
SuSE Labs


2007-02-10 02:31:48

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/3] mm: make read_cache_page synchronous

Ensure pages are uptodate after returning from read_cache_page, which allows
us to cut out most of the filesystem-internal PageUptodate calls.

I didn't have a great look down the call chains, but this appears to fixes 7
possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in ecryptfs,
1 in jffs2, and a possible cleared data overwritten with readpage in block2mtd.
All depending on whether the filler is async and/or can return with a !uptodate
page.

Signed-off-by: Nick Piggin <[email protected]>

drivers/mtd/devices/block2mtd.c | 3 --
fs/afs/dir.c | 3 --
fs/afs/mntpt.c | 11 +++------
fs/cramfs/inode.c | 3 +-
fs/ecryptfs/mmap.c | 10 --------
fs/ext2/dir.c | 3 --
fs/freevxfs/vxfs_subr.c | 3 --
fs/minix/dir.c | 1
fs/namei.c | 12 ----------
fs/nfs/dir.c | 5 ----
fs/nfs/symlink.c | 6 -----
fs/ntfs/aops.h | 3 --
fs/ntfs/attrib.c | 18 +--------------
fs/ntfs/file.c | 3 --
fs/ntfs/super.c | 30 +++----------------------
fs/ocfs2/symlink.c | 7 -----
fs/partitions/check.c | 3 --
fs/reiserfs/xattr.c | 4 ---
fs/sysv/dir.c | 10 --------
fs/ufs/dir.c | 6 -----
fs/ufs/util.c | 6 +----
include/linux/pagemap.h | 11 +++++++++
mm/filemap.c | 48 ++++++++++++++++++++++++++++++----------
mm/swapfile.c | 3 --
24 files changed, 68 insertions(+), 144 deletions(-)

Index: linux-2.6/fs/afs/dir.c
===================================================================
--- linux-2.6.orig/fs/afs/dir.c
+++ linux-2.6/fs/afs/dir.c
@@ -187,10 +187,7 @@ static struct page *afs_dir_get_page(str

page = read_mapping_page(dir->i_mapping, index, NULL);
if (!IS_ERR(page)) {
- wait_on_page_locked(page);
kmap(page);
- if (!PageUptodate(page))
- goto fail;
if (!PageChecked(page))
afs_dir_check_page(dir, page);
if (PageError(page))
Index: linux-2.6/fs/afs/mntpt.c
===================================================================
--- linux-2.6.orig/fs/afs/mntpt.c
+++ linux-2.6/fs/afs/mntpt.c
@@ -77,13 +77,11 @@ int afs_mntpt_check_symlink(struct afs_v
}

ret = -EIO;
- wait_on_page_locked(page);
- buf = kmap(page);
- if (!PageUptodate(page))
- goto out_free;
if (PageError(page))
goto out_free;

+ buf = kmap(page);
+
/* examine the symlink's contents */
size = vnode->status.size;
_debug("symlink to %*.*s", size, (int) size, buf);
@@ -100,8 +98,8 @@ int afs_mntpt_check_symlink(struct afs_v

ret = 0;

- out_free:
kunmap(page);
+ out_free:
page_cache_release(page);
out:
_leave(" = %d", ret);
@@ -184,8 +182,7 @@ static struct vfsmount *afs_mntpt_do_aut
}

ret = -EIO;
- wait_on_page_locked(page);
- if (!PageUptodate(page) || PageError(page))
+ if (PageError(page))
goto error;

buf = kmap(page);
Index: linux-2.6/fs/cramfs/inode.c
===================================================================
--- linux-2.6.orig/fs/cramfs/inode.c
+++ linux-2.6/fs/cramfs/inode.c
@@ -180,7 +180,8 @@ static void *cramfs_read(struct super_bl
struct page *page = NULL;

if (blocknr + i < devsize) {
- page = read_mapping_page(mapping, blocknr + i, NULL);
+ page = read_mapping_page_async(mapping, blocknr + i,
+ NULL);
/* synchronous error? */
if (IS_ERR(page))
page = NULL;
Index: linux-2.6/fs/ext2/dir.c
===================================================================
--- linux-2.6.orig/fs/ext2/dir.c
+++ linux-2.6/fs/ext2/dir.c
@@ -161,10 +161,7 @@ static struct page * ext2_get_page(struc
struct address_space *mapping = dir->i_mapping;
struct page *page = read_mapping_page(mapping, n, NULL);
if (!IS_ERR(page)) {
- wait_on_page_locked(page);
kmap(page);
- if (!PageUptodate(page))
- goto fail;
if (!PageChecked(page))
ext2_check_page(page);
if (PageError(page))
Index: linux-2.6/fs/freevxfs/vxfs_subr.c
===================================================================
--- linux-2.6.orig/fs/freevxfs/vxfs_subr.c
+++ linux-2.6/fs/freevxfs/vxfs_subr.c
@@ -74,10 +74,7 @@ vxfs_get_page(struct address_space *mapp
pp = read_mapping_page(mapping, n, NULL);

if (!IS_ERR(pp)) {
- wait_on_page_locked(pp);
kmap(pp);
- if (!PageUptodate(pp))
- goto fail;
/** if (!PageChecked(pp)) **/
/** vxfs_check_page(pp); **/
if (PageError(pp))
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1746,7 +1746,7 @@ int generic_file_readonly_mmap(struct fi
EXPORT_SYMBOL(generic_file_mmap);
EXPORT_SYMBOL(generic_file_readonly_mmap);

-static inline struct page *__read_cache_page(struct address_space *mapping,
+static struct page *__read_cache_page(struct address_space *mapping,
unsigned long index,
int (*filler)(void *,struct page*),
void *data)
@@ -1783,17 +1783,11 @@ repeat:
return page;
}

-/**
- * read_cache_page - read into page cache, fill it if needed
- * @mapping: the page's address_space
- * @index: the page index
- * @filler: function to perform the read
- * @data: destination for read data
- *
- * Read into the page cache. If a page already exists,
- * and PageUptodate() is not set, try to fill the page.
+/*
+ * Same as read_cache_page, but don't wait for page to become unlocked
+ * after submitting it to the filler.
*/
-struct page *read_cache_page(struct address_space *mapping,
+struct page *read_cache_page_async(struct address_space *mapping,
unsigned long index,
int (*filler)(void *,struct page*),
void *data)
@@ -1825,6 +1819,37 @@ retry:
page = ERR_PTR(err);
}
out:
+ mark_page_accessed(page);
+ return page;
+}
+EXPORT_SYMBOL(read_cache_page_async);
+
+/**
+ * read_cache_page - read into page cache, fill it if needed
+ * @mapping: the page's address_space
+ * @index: the page index
+ * @filler: function to perform the read
+ * @data: destination for read data
+ *
+ * Read into the page cache. If a page already exists,
+ * and PageUptodate() is not set, try to fill the page.
+ */
+struct page *read_cache_page(struct address_space *mapping,
+ unsigned long index,
+ int (*filler)(void *,struct page*),
+ void *data)
+{
+ struct page *page;
+
+ page = read_cache_page_async(mapping, index, filler, data);
+ if (IS_ERR(page))
+ goto out;
+ wait_on_page_locked(page);
+ if (!PageUptodate_NoLock(page)) {
+ page_cache_release(page);
+ page = ERR_PTR(-EIO);
+ }
+ out:
return page;
}
EXPORT_SYMBOL(read_cache_page);
Index: linux-2.6/fs/minix/dir.c
===================================================================
--- linux-2.6.orig/fs/minix/dir.c
+++ linux-2.6/fs/minix/dir.c
@@ -62,7 +62,6 @@ static struct page * dir_get_page(struct
struct address_space *mapping = dir->i_mapping;
struct page *page = read_mapping_page(mapping, n, NULL);
if (!IS_ERR(page)) {
- wait_on_page_locked(page);
kmap(page);
if (!PageUptodate(page))
goto fail;
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -2639,19 +2639,9 @@ static char *page_getlink(struct dentry
struct address_space *mapping = dentry->d_inode->i_mapping;
page = read_mapping_page(mapping, 0, NULL);
if (IS_ERR(page))
- goto sync_fail;
- wait_on_page_locked(page);
- if (!PageUptodate(page))
- goto async_fail;
+ return (char*)page;
*ppage = page;
return kmap(page);
-
-async_fail:
- page_cache_release(page);
- return ERR_PTR(-EIO);
-
-sync_fail:
- return (char*)page;
}

int page_readlink(struct dentry *dentry, char __user *buffer, int buflen)
Index: linux-2.6/fs/ntfs/aops.h
===================================================================
--- linux-2.6.orig/fs/ntfs/aops.h
+++ linux-2.6/fs/ntfs/aops.h
@@ -89,9 +89,8 @@ static inline struct page *ntfs_map_page
struct page *page = read_mapping_page(mapping, index, NULL);

if (!IS_ERR(page)) {
- wait_on_page_locked(page);
kmap(page);
- if (PageUptodate(page) && !PageError(page))
+ if (!PageError(page))
return page;
ntfs_unmap_page(page);
return ERR_PTR(-EIO);
Index: linux-2.6/fs/ntfs/attrib.c
===================================================================
--- linux-2.6.orig/fs/ntfs/attrib.c
+++ linux-2.6/fs/ntfs/attrib.c
@@ -2532,14 +2532,7 @@ int ntfs_attr_set(ntfs_inode *ni, const
page = read_mapping_page(mapping, idx, NULL);
if (IS_ERR(page)) {
ntfs_error(vol->sb, "Failed to read first partial "
- "page (sync error, index 0x%lx).", idx);
- return PTR_ERR(page);
- }
- wait_on_page_locked(page);
- if (unlikely(!PageUptodate(page))) {
- ntfs_error(vol->sb, "Failed to read first partial page "
- "(async error, index 0x%lx).", idx);
- page_cache_release(page);
+ "page (error, index 0x%lx).", idx);
return PTR_ERR(page);
}
/*
@@ -2602,14 +2595,7 @@ int ntfs_attr_set(ntfs_inode *ni, const
page = read_mapping_page(mapping, idx, NULL);
if (IS_ERR(page)) {
ntfs_error(vol->sb, "Failed to read last partial page "
- "(sync error, index 0x%lx).", idx);
- return PTR_ERR(page);
- }
- wait_on_page_locked(page);
- if (unlikely(!PageUptodate(page))) {
- ntfs_error(vol->sb, "Failed to read last partial page "
- "(async error, index 0x%lx).", idx);
- page_cache_release(page);
+ "(error, index 0x%lx).", idx);
return PTR_ERR(page);
}
kaddr = kmap_atomic(page, KM_USER0);
Index: linux-2.6/fs/ntfs/file.c
===================================================================
--- linux-2.6.orig/fs/ntfs/file.c
+++ linux-2.6/fs/ntfs/file.c
@@ -236,8 +236,7 @@ do_non_resident_extend:
err = PTR_ERR(page);
goto init_err_out;
}
- wait_on_page_locked(page);
- if (unlikely(!PageUptodate(page) || PageError(page))) {
+ if (unlikely(PageError(page))) {
page_cache_release(page);
err = -EIO;
goto init_err_out;
Index: linux-2.6/fs/ocfs2/symlink.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/symlink.c
+++ linux-2.6/fs/ocfs2/symlink.c
@@ -67,16 +67,9 @@ static char *ocfs2_page_getlink(struct d
page = read_mapping_page(mapping, 0, NULL);
if (IS_ERR(page))
goto sync_fail;
- wait_on_page_locked(page);
- if (!PageUptodate(page))
- goto async_fail;
*ppage = page;
return kmap(page);

-async_fail:
- page_cache_release(page);
- return ERR_PTR(-EIO);
-
sync_fail:
return (char*)page;
}
Index: linux-2.6/fs/partitions/check.c
===================================================================
--- linux-2.6.orig/fs/partitions/check.c
+++ linux-2.6/fs/partitions/check.c
@@ -561,9 +561,6 @@ unsigned char *read_dev_sector(struct bl
page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_CACHE_SHIFT-9)),
NULL);
if (!IS_ERR(page)) {
- wait_on_page_locked(page);
- if (!PageUptodate(page))
- goto fail;
if (PageError(page))
goto fail;
p->v = page;
Index: linux-2.6/fs/reiserfs/xattr.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/xattr.c
+++ linux-2.6/fs/reiserfs/xattr.c
@@ -454,11 +454,7 @@ static struct page *reiserfs_get_page(st
mapping_set_gfp_mask(mapping, GFP_NOFS);
page = read_mapping_page(mapping, n, NULL);
if (!IS_ERR(page)) {
- wait_on_page_locked(page);
kmap(page);
- if (!PageUptodate(page))
- goto fail;
-
if (PageError(page))
goto fail;
}
Index: linux-2.6/fs/sysv/dir.c
===================================================================
--- linux-2.6.orig/fs/sysv/dir.c
+++ linux-2.6/fs/sysv/dir.c
@@ -54,17 +54,9 @@ static struct page * dir_get_page(struct
{
struct address_space *mapping = dir->i_mapping;
struct page *page = read_mapping_page(mapping, n, NULL);
- if (!IS_ERR(page)) {
- wait_on_page_locked(page);
+ if (!IS_ERR(page))
kmap(page);
- if (!PageUptodate(page))
- goto fail;
- }
return page;
-
-fail:
- dir_put_page(page);
- return ERR_PTR(-EIO);
}

static int sysv_readdir(struct file * filp, void * dirent, filldir_t filldir)
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -1531,9 +1531,6 @@ asmlinkage long sys_swapon(const char __
error = PTR_ERR(page);
goto bad_swap;
}
- wait_on_page_locked(page);
- if (!PageUptodate(page))
- goto bad_swap;
kmap(page);
swap_header = page_address(page);

Index: linux-2.6/drivers/mtd/devices/block2mtd.c
===================================================================
--- linux-2.6.orig/drivers/mtd/devices/block2mtd.c
+++ linux-2.6/drivers/mtd/devices/block2mtd.c
@@ -88,9 +88,8 @@ static void cache_readahead(struct addre

static struct page* page_readahead(struct address_space *mapping, int index)
{
- filler_t *filler = (filler_t*)mapping->a_ops->readpage;
cache_readahead(mapping, index);
- return read_cache_page(mapping, index, filler, NULL);
+ return read_mapping_page(mapping, index, NULL);
}


Index: linux-2.6/fs/ecryptfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/mmap.c
+++ linux-2.6/fs/ecryptfs/mmap.c
@@ -54,14 +54,7 @@ static struct page *ecryptfs_get1page(st
dentry = file->f_path.dentry;
inode = dentry->d_inode;
mapping = inode->i_mapping;
- page = read_cache_page(mapping, index,
- (filler_t *)mapping->a_ops->readpage,
- (void *)file);
- if (IS_ERR(page))
- goto out;
- wait_on_page_locked(page);
-out:
- return page;
+ return read_mapping_page(mapping, index, (void *)file);
}

static
@@ -233,7 +226,6 @@ int ecryptfs_do_readpage(struct file *fi
ecryptfs_printk(KERN_ERR, "Error reading from page cache\n");
goto out;
}
- wait_on_page_locked(lower_page);
page_data = (char *)kmap(page);
if (!page_data) {
rc = -ENOMEM;
Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c
+++ linux-2.6/fs/nfs/dir.c
@@ -322,8 +322,6 @@ int find_dirent_page(nfs_readdir_descrip
status = PTR_ERR(page);
goto out;
}
- if (!PageUptodate(page))
- goto read_error;

/* NOTE: Someone else may have changed the READDIRPLUS flag */
desc->page = page;
@@ -337,9 +335,6 @@ int find_dirent_page(nfs_readdir_descrip
out:
dfprintk(DIRCACHE, "NFS: %s: returns %d\n", __FUNCTION__, status);
return status;
- read_error:
- page_cache_release(page);
- return -EIO;
}

/*
Index: linux-2.6/fs/nfs/symlink.c
===================================================================
--- linux-2.6.orig/fs/nfs/symlink.c
+++ linux-2.6/fs/nfs/symlink.c
@@ -61,15 +61,9 @@ static void *nfs_follow_link(struct dent
err = page;
goto read_failed;
}
- if (!PageUptodate(page)) {
- err = ERR_PTR(-EIO);
- goto getlink_read_error;
- }
nd_set_link(nd, kmap(page));
return page;

-getlink_read_error:
- page_cache_release(page);
read_failed:
nd_set_link(nd, err);
return NULL;
Index: linux-2.6/fs/ntfs/super.c
===================================================================
--- linux-2.6.orig/fs/ntfs/super.c
+++ linux-2.6/fs/ntfs/super.c
@@ -2471,7 +2471,6 @@ static s64 get_nr_free_clusters(ntfs_vol
s64 nr_free = vol->nr_clusters;
u32 *kaddr;
struct address_space *mapping = vol->lcnbmp_ino->i_mapping;
- filler_t *readpage = (filler_t*)mapping->a_ops->readpage;
struct page *page;
pgoff_t index, max_index;

@@ -2494,24 +2493,14 @@ static s64 get_nr_free_clusters(ntfs_vol
* Read the page from page cache, getting it from backing store
* if necessary, and increment the use count.
*/
- page = read_cache_page(mapping, index, (filler_t*)readpage,
- NULL);
+ page = read_mapping_page(mapping, index, NULL);
/* Ignore pages which errored synchronously. */
if (IS_ERR(page)) {
- ntfs_debug("Sync read_cache_page() error. Skipping "
+ ntfs_debug("read_mapping_page() error. Skipping "
"page (index 0x%lx).", index);
nr_free -= PAGE_CACHE_SIZE * 8;
continue;
}
- wait_on_page_locked(page);
- /* Ignore pages which errored asynchronously. */
- if (!PageUptodate(page)) {
- ntfs_debug("Async read_cache_page() error. Skipping "
- "page (index 0x%lx).", index);
- page_cache_release(page);
- nr_free -= PAGE_CACHE_SIZE * 8;
- continue;
- }
kaddr = (u32*)kmap_atomic(page, KM_USER0);
/*
* For each 4 bytes, subtract the number of set bits. If this
@@ -2562,7 +2551,6 @@ static unsigned long __get_nr_free_mft_r
{
u32 *kaddr;
struct address_space *mapping = vol->mftbmp_ino->i_mapping;
- filler_t *readpage = (filler_t*)mapping->a_ops->readpage;
struct page *page;
pgoff_t index;

@@ -2576,21 +2564,11 @@ static unsigned long __get_nr_free_mft_r
* Read the page from page cache, getting it from backing store
* if necessary, and increment the use count.
*/
- page = read_cache_page(mapping, index, (filler_t*)readpage,
- NULL);
+ page = read_mapping_page(mapping, index, NULL);
/* Ignore pages which errored synchronously. */
if (IS_ERR(page)) {
- ntfs_debug("Sync read_cache_page() error. Skipping "
- "page (index 0x%lx).", index);
- nr_free -= PAGE_CACHE_SIZE * 8;
- continue;
- }
- wait_on_page_locked(page);
- /* Ignore pages which errored asynchronously. */
- if (!PageUptodate(page)) {
- ntfs_debug("Async read_cache_page() error. Skipping "
+ ntfs_debug("read_mapping_page() error. Skipping "
"page (index 0x%lx).", index);
- page_cache_release(page);
nr_free -= PAGE_CACHE_SIZE * 8;
continue;
}
Index: linux-2.6/fs/ufs/dir.c
===================================================================
--- linux-2.6.orig/fs/ufs/dir.c
+++ linux-2.6/fs/ufs/dir.c
@@ -180,13 +180,9 @@ fail:
static struct page *ufs_get_page(struct inode *dir, unsigned long n)
{
struct address_space *mapping = dir->i_mapping;
- struct page *page = read_cache_page(mapping, n,
- (filler_t*)mapping->a_ops->readpage, NULL);
+ struct page *page = read_mapping_page(mapping, n, NULL);
if (!IS_ERR(page)) {
- wait_on_page_locked(page);
kmap(page);
- if (!PageUptodate(page))
- goto fail;
if (!PageChecked(page))
ufs_check_page(page);
if (PageError(page))
Index: linux-2.6/fs/ufs/util.c
===================================================================
--- linux-2.6.orig/fs/ufs/util.c
+++ linux-2.6/fs/ufs/util.c
@@ -251,13 +251,11 @@ struct page *ufs_get_locked_page(struct

page = find_lock_page(mapping, index);
if (!page) {
- page = read_cache_page(mapping, index,
- (filler_t*)mapping->a_ops->readpage,
- NULL);
+ page = read_mapping_page(mapping, index, NULL);

if (IS_ERR(page)) {
printk(KERN_ERR "ufs_change_blocknr: "
- "read_cache_page error: ino %lu, index: %lu\n",
+ "read_mapping_page error: ino %lu, index: %lu\n",
mapping->host->i_ino, index);
goto out;
}
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -97,12 +97,23 @@ static inline struct page *grab_cache_pa

extern struct page * grab_cache_page_nowait(struct address_space *mapping,
unsigned long index);
+extern struct page * read_cache_page_async(struct address_space *mapping,
+ unsigned long index, filler_t *filler,
+ void *data);
extern struct page * read_cache_page(struct address_space *mapping,
unsigned long index, filler_t *filler,
void *data);
extern int read_cache_pages(struct address_space *mapping,
struct list_head *pages, filler_t *filler, void *data);

+static inline struct page *read_mapping_page_async(
+ struct address_space *mapping,
+ unsigned long index, void *data)
+{
+ filler_t *filler = (filler_t *)mapping->a_ops->readpage;
+ return read_cache_page_async(mapping, index, filler, data);
+}
+
static inline struct page *read_mapping_page(struct address_space *mapping,
unsigned long index, void *data)
{

2007-02-10 02:31:50

by Nick Piggin

[permalink] [raw]
Subject: [patch 2/3] fs: buffer don't PageUptodate without page locked

__block_write_full_page is calling SetPageUptodate without the page locked.
This is unusual, but not incorrect, as PG_writeback is still set.

However the next patch will require that SetPageUptodate always be called
with the page locked. Simply don't bother setting the page uptodate in this
case (it is unusual that the write path does such a thing anyway). Instead
just leave it to the read side to bring the page uptodate when it notices
that all buffers are uptodate.

Signed-off-by: Nick Piggin <[email protected]>

fs/buffer.c | 11 +----------
1 file changed, 1 insertion(+), 10 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1698,17 +1698,8 @@ done:
* clean. Someone wrote them back by hand with
* ll_rw_block/submit_bh. A rare case.
*/
- int uptodate = 1;
- do {
- if (!buffer_uptodate(bh)) {
- uptodate = 0;
- break;
- }
- bh = bh->b_this_page;
- } while (bh != head);
- if (uptodate)
- SetPageUptodate(page);
end_page_writeback(page);
+
/*
* The page and buffer_heads can be released at any time from
* here on.

2007-02-10 02:32:24

by Nick Piggin

[permalink] [raw]
Subject: [patch 3/3] mm: fix PageUptodate memorder

After running SetPageUptodate, preceeding stores to the page contents to
actually bring it uptodate may not be ordered with the store to set the page
uptodate.

Therefore, another CPU which checks PageUptodate is true, then reads the
page contents can get stale data.

Fix this by ensuring SetPageUptodate is always called with the page locked
(except in the case of a new page that cannot be visible to other CPUs), and
requiring PageUptodate be checked only when the page is locked.

To facilitate lockless checks, SetPageUptodate contains an smp_wmb to order
preceeding stores before the store to page flags, and a new PageUptodate_NoLock
is introduced, which issues a smp_rmb after the page flags are loaded for the
test.

I'm still not sure that a DMA memory barrier is not required, however I think
the logical place to put such a barrier would be in the IO completion routines,
when they come back to tell us that they have succeeded. (Help? Anyone?)

One thing I like about it is that it unifies the anonymous page handling
with the rest of the page management, by marking anon pages as uptodate
when they _are_ uptodate, rather than when our implementation requires
that they be marked as such. Doing this let me get rid of the smp_wmb's
in the page copying functions, which were specially added for anonymous
pages for a closely related issue, always vaguely troubled me.

Convert core code and some filesystems to use PageUptodate_NoLock, just for
reference (a more complete patch follows).

Signed-off-by: Nick Piggin <[email protected]>

fs/ext2/dir.c | 2 -
fs/namei.c | 2 -
fs/partitions/check.c | 2 -
fs/splice.c | 4 +--
include/linux/highmem.h | 4 ---
include/linux/page-flags.h | 57 +++++++++++++++++++++++++++++++++++++++++----
mm/filemap.c | 28 +++++++++++++++-------
mm/hugetlb.c | 2 +
mm/memory.c | 9 +++----
mm/page_io.c | 2 -
mm/swap_state.c | 2 -
mm/swapfile.c | 2 -
12 files changed, 86 insertions(+), 30 deletions(-)

Index: linux-2.6/include/linux/highmem.h
===================================================================
--- linux-2.6.orig/include/linux/highmem.h
+++ linux-2.6/include/linux/highmem.h
@@ -57,8 +57,6 @@ static inline void clear_user_highpage(s
void *addr = kmap_atomic(page, KM_USER0);
clear_user_page(addr, vaddr, page);
kunmap_atomic(addr, KM_USER0);
- /* Make sure this page is cleared on other CPU's too before using it */
- smp_wmb();
}

#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
@@ -108,8 +106,6 @@ static inline void copy_user_highpage(st
copy_user_page(vto, vfrom, vaddr, to);
kunmap_atomic(vfrom, KM_USER0);
kunmap_atomic(vto, KM_USER1);
- /* Make sure this page is cleared on other CPU's too before using it */
- smp_wmb();
}

#endif
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -126,16 +126,60 @@
#define ClearPageReferenced(page) clear_bit(PG_referenced, &(page)->flags)
#define TestClearPageReferenced(page) test_and_clear_bit(PG_referenced, &(page)->flags)

-#define PageUptodate(page) test_bit(PG_uptodate, &(page)->flags)
-#ifdef CONFIG_S390
+static inline int PageUptodate(struct page *page)
+{
+ WARN_ON(!PageLocked(page));
+ return test_bit(PG_uptodate, &(page)->flags);
+}
+
+/*
+ * PageUptodate to be used when not holding the page lock.
+ */
+static inline int PageUptodate_NoLock(struct page *page)
+{
+ int ret = test_bit(PG_uptodate, &(page)->flags);
+
+ /*
+ * Must ensure that the data we read out of the page is loaded
+ * _after_ we've loaded page->flags and found that it is uptodate.
+ * See SetPageUptodate() for the other side of the story.
+ */
+ if (ret)
+ smp_rmb();
+
+ return ret;
+}
+
static inline void SetPageUptodate(struct page *page)
{
+ WARN_ON(!PageLocked(page));
+#ifdef CONFIG_S390
if (!test_and_set_bit(PG_uptodate, &page->flags))
page_test_and_clear_dirty(page);
-}
#else
-#define SetPageUptodate(page) set_bit(PG_uptodate, &(page)->flags)
+ /*
+ * Memory barrier must be issued before setting the PG_uptodate bit,
+ * so all previous writes that served to bring the page uptodate are
+ * visible before PageUptodate becomes true.
+ *
+ * S390 is guaranteed to have a barrier in the test_and_set operation
+ * (see Documentation/atomic_ops.txt).
+ *
+ * This memory barrier should not need to provide ordering against
+ * DMA writes into the page, because the IO completion should really
+ * be doing that.
+ */
+ smp_wmb();
+ set_bit(PG_uptodate, &(page)->flags);
#endif
+}
+
+static inline void SetNewPageUptodate(struct page *page)
+{
+ smp_wmb();
+ __set_bit(PG_uptodate, &(page)->flags);
+}
+
#define ClearPageUptodate(page) clear_bit(PG_uptodate, &(page)->flags)

#define PageDirty(page) test_bit(PG_dirty, &(page)->flags)
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -443,6 +443,7 @@ static int hugetlb_cow(struct mm_struct

spin_unlock(&mm->page_table_lock);
copy_huge_page(new_page, old_page, address, vma);
+ SetNewPageUptodate(new_page);
spin_lock(&mm->page_table_lock);

ptep = huge_pte_offset(mm, address & HPAGE_MASK);
@@ -506,6 +507,7 @@ retry:
} else
lock_page(page);
}
+ SetNewPageUptodate(page);

spin_lock(&mm->page_table_lock);
size = i_size_read(mapping->host) >> HPAGE_SHIFT;
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1463,10 +1463,8 @@ static inline void cow_user_page(struct
memset(kaddr, 0, PAGE_SIZE);
kunmap_atomic(kaddr, KM_USER0);
flush_dcache_page(dst);
- return;
-
- }
- copy_user_highpage(dst, src, va, vma);
+ } else
+ copy_user_highpage(dst, src, va, vma);
}

/*
@@ -1579,6 +1577,7 @@ gotten:
goto oom;
cow_user_page(new_page, old_page, address, vma);
}
+ SetNewPageUptodate(new_page);

/*
* Re-check the pte - we dropped the lock
@@ -2097,6 +2096,7 @@ static int do_anonymous_page(struct mm_s
page = alloc_zeroed_user_highpage(vma, address);
if (!page)
goto oom;
+ SetNewPageUptodate(page);

entry = mk_pte(page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2203,6 +2203,7 @@ retry:
copy_user_highpage(page, new_page, address, vma);
page_cache_release(new_page);
new_page = page;
+ SetNewPageUptodate(new_page);
anon = 1;

} else {
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -932,7 +932,7 @@ find_page:
handle_ra_miss(mapping, &ra, index);
goto no_cached_page;
}
- if (!PageUptodate(page))
+ if (!PageUptodate_NoLock(page))
goto page_not_up_to_date;
page_ok:

@@ -1000,7 +1000,7 @@ readpage:
goto readpage_error;
}

- if (!PageUptodate(page)) {
+ if (!PageUptodate_NoLock(page)) {
lock_page(page);
if (!PageUptodate(page)) {
if (page->mapping == NULL) {
@@ -1417,7 +1417,7 @@ retry_find:
* Ok, found a page in the page cache, now we need to check
* that it's up-to-date.
*/
- if (!PageUptodate(page))
+ if (!PageUptodate_NoLock(page))
goto page_not_uptodate;

success:
@@ -1484,7 +1484,7 @@ page_not_uptodate:
error = mapping->a_ops->readpage(file, page);
if (!error) {
wait_on_page_locked(page);
- if (PageUptodate(page))
+ if (PageUptodate_NoLock(page))
goto success;
} else if (error == AOP_TRUNCATED_PAGE) {
page_cache_release(page);
@@ -1515,7 +1515,7 @@ page_not_uptodate:
error = mapping->a_ops->readpage(file, page);
if (!error) {
wait_on_page_locked(page);
- if (PageUptodate(page))
+ if (PageUptodate_NoLock(page))
goto success;
} else if (error == AOP_TRUNCATED_PAGE) {
page_cache_release(page);
@@ -1554,7 +1554,7 @@ retry_find:
* Ok, found a page in the page cache, now we need to check
* that it's up-to-date.
*/
- if (!PageUptodate(page)) {
+ if (!PageUptodate_NoLock(page)) {
if (nonblock) {
page_cache_release(page);
return NULL;
@@ -1605,7 +1605,7 @@ page_not_uptodate:
error = mapping->a_ops->readpage(file, page);
if (!error) {
wait_on_page_locked(page);
- if (PageUptodate(page))
+ if (PageUptodate_NoLock(page))
goto success;
} else if (error == AOP_TRUNCATED_PAGE) {
page_cache_release(page);
@@ -1635,7 +1635,7 @@ page_not_uptodate:
error = mapping->a_ops->readpage(file, page);
if (!error) {
wait_on_page_locked(page);
- if (PageUptodate(page))
+ if (PageUptodate_NoLock(page))
goto success;
} else if (error == AOP_TRUNCATED_PAGE) {
page_cache_release(page);
@@ -1800,7 +1800,7 @@ retry:
if (IS_ERR(page))
goto out;
mark_page_accessed(page);
- if (PageUptodate(page))
+ if (PageUptodate_NoLock(page))
goto out;

lock_page(page);
Index: linux-2.6/mm/page_io.c
===================================================================
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -134,7 +134,7 @@ int swap_readpage(struct file *file, str
int ret = 0;

BUG_ON(!PageLocked(page));
- ClearPageUptodate(page);
+ BUG_ON(PageUptodate(page));
bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
end_swap_bio_read);
if (bio == NULL) {
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -149,6 +149,7 @@ int add_to_swap(struct page * page, gfp_
int err;

BUG_ON(!PageLocked(page));
+ BUG_ON(!PageUptodate(page));

for (;;) {
entry = get_swap_page();
@@ -171,7 +172,6 @@ int add_to_swap(struct page * page, gfp_

switch (err) {
case 0: /* Success */
- SetPageUptodate(page);
SetPageDirty(page);
INC_CACHE_INFO(add_total);
return 1;
Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -107,7 +107,7 @@ static int page_cache_pipe_buf_pin(struc
struct page *page = buf->page;
int err;

- if (!PageUptodate(page)) {
+ if (!PageUptodate_NoLock(page)) {
lock_page(page);

/*
@@ -373,7 +373,7 @@ __generic_file_splice_read(struct file *
/*
* If the page isn't uptodate, we may need to start io on it
*/
- if (!PageUptodate(page)) {
+ if (!PageUptodate_NoLock(page)) {
/*
* If in nonblock mode then dont block on waiting
* for an in-flight io page

2007-02-10 22:44:54

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [patch 0/3] 2.6.20 fix for PageUptodate memorder problem (try 3)

On Sat, 2007-02-10 at 03:31 +0100, Nick Piggin wrote:
> SetNewPageUptodate does not do the S390 page_test_and_clear_dirty, so
> I'd like to make sure that's OK.

An I/O operation on s390 will set the dirty bit for a page. That is the
reason to have SetPageUptodate clear the per page dirty bit when the
page is made uptodate the first time. Otherwise we end up writing each
page back to its backing device at least once. If SetNewPageUptodate is
used on new anonymous pages exclusively I don't see a problem in
omitting the page_test_clear_dirty.

--
blue skies,
Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.


2007-02-13 05:52:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 0/3] 2.6.20 fix for PageUptodate memorder problem (try 3)

On Sat, Feb 10, 2007 at 11:44:55PM +0100, Martin Schwidefsky wrote:
> On Sat, 2007-02-10 at 03:31 +0100, Nick Piggin wrote:
> > SetNewPageUptodate does not do the S390 page_test_and_clear_dirty, so
> > I'd like to make sure that's OK.
>
> An I/O operation on s390 will set the dirty bit for a page. That is the

Oh, OK.

> reason to have SetPageUptodate clear the per page dirty bit when the
> page is made uptodate the first time. Otherwise we end up writing each
> page back to its backing device at least once. If SetNewPageUptodate is
> used on new anonymous pages exclusively I don't see a problem in
> omitting the page_test_clear_dirty.

Thanks for the confirmation.

I'll obviously have to resend a new patchset because I made a silly
paper-bag bug with this one. May I say that the s390 specific part of
the change is acked-by: you?

2007-02-13 09:25:32

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [patch 0/3] 2.6.20 fix for PageUptodate memorder problem (try 3)

On Tue, 2007-02-13 at 06:52 +0100, Nick Piggin wrote:
> Thanks for the confirmation.
>
> I'll obviously have to resend a new patchset because I made a silly
> paper-bag bug with this one. May I say that the s390 specific part of
> the change is acked-by: you?

Yes.

--
blue skies,
Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.