2004-01-14 18:48:07

by Chuck Lever

[permalink] [raw]
Subject: [PATCH] NFS client doesn't flush dirty mmap'd pages during fsync/msync

hi trond-

it has been observed via fsx testing with NFS O_DIRECT that fsync and
msync (and their relatives) do not flush dirty mmap'd pages in NFS files
to the server. these pages are flushed later by other operations that
flush all dirty pages on a mount point. this can result in file data
corruption because write ordering is compromised.

the nfs_writepage interface causes dirty mmap'd pages to be queued on an
NFS inode's dirty page queue. these pages are queued with a NULL filp
because a filp is not available via the nfs_writepage interface. when
nfs_fsync is invoked (either via fsync() or via msync()), it is looking
specifically for dirty pages associated with a given filp, so it skips the
dirty pages with a NULL filp.

this patch removes the check in nfs_scan_list for pages with a specific
filp. this is a three line change -- the rest of the patch removes the
filp argument from a number of function declarations where it is no longer
needed.

patch is against 2.6.0, equivalent patch for 2.4 is already in your inbox.


diff -X /home/cel/src/linux/dont-diff -Naurp 04-async-read/fs/nfs/inode.c 05-msync/fs/nfs/inode.c
--- 04-async-read/fs/nfs/inode.c 2003-12-17 21:59:18.000000000 -0500
+++ 05-msync/fs/nfs/inode.c 2004-01-14 12:18:51.558979000 -0500
@@ -118,7 +118,7 @@ nfs_write_inode(struct inode *inode, int
{
int flags = sync ? FLUSH_WAIT : 0;

- nfs_commit_file(inode, NULL, 0, 0, flags);
+ nfs_commit_file(inode, 0, 0, flags);
}

static void
diff -X /home/cel/src/linux/dont-diff -Naurp 04-async-read/fs/nfs/pagelist.c 05-msync/fs/nfs/pagelist.c
--- 04-async-read/fs/nfs/pagelist.c 2004-01-14 12:14:56.485979000 -0500
+++ 05-msync/fs/nfs/pagelist.c 2004-01-14 12:18:51.654976000 -0500
@@ -247,7 +247,6 @@ nfs_coalesce_requests(struct list_head *
* nfs_scan_list - Scan a list for matching requests
* @head: One of the NFS inode request lists
* @dst: Destination list
- * @file: if set, ensure we match requests from this file
* @idx_start: lower bound of page->index to scan
* @npages: idx_start + npages sets the upper bound to scan.
*
@@ -259,7 +258,6 @@ nfs_coalesce_requests(struct list_head *
*/
int
nfs_scan_list(struct list_head *head, struct list_head *dst,
- struct file *file,
unsigned long idx_start, unsigned int npages)
{
struct list_head *pos, *tmp;
@@ -277,9 +275,6 @@ nfs_scan_list(struct list_head *head, st

req = nfs_list_entry(pos);

- if (file && req->wb_file != file)
- continue;
-
if (req->wb_index < idx_start)
continue;
if (req->wb_index > idx_end)
diff -X /home/cel/src/linux/dont-diff -Naurp 04-async-read/fs/nfs/write.c 05-msync/fs/nfs/write.c
--- 04-async-read/fs/nfs/write.c 2003-12-17 21:59:16.000000000 -0500
+++ 05-msync/fs/nfs/write.c 2004-01-14 12:18:51.706976000 -0500
@@ -272,7 +272,7 @@ nfs_writepages(struct address_space *map
err = generic_writepages(mapping, wbc);
if (err)
goto out;
- err = nfs_flush_file(inode, NULL, 0, 0, 0);
+ err = nfs_flush_file(inode, 0, 0, 0);
if (err < 0)
goto out;
if (wbc->sync_mode == WB_SYNC_HOLD)
@@ -280,7 +280,7 @@ nfs_writepages(struct address_space *map
if (is_sync && wbc->sync_mode == WB_SYNC_ALL) {
err = nfs_wb_all(inode);
} else
- nfs_commit_file(inode, NULL, 0, 0, 0);
+ nfs_commit_file(inode, 0, 0, 0);
out:
return err;
}
@@ -450,7 +450,6 @@ nfs_wait_on_requests(struct inode *inode
* nfs_scan_dirty - Scan an inode for dirty requests
* @inode: NFS inode to scan
* @dst: destination list
- * @file: if set, ensure we match requests from this file
* @idx_start: lower bound of page->index to scan.
* @npages: idx_start + npages sets the upper bound to scan.
*
@@ -458,11 +457,12 @@ nfs_wait_on_requests(struct inode *inode
* The requests are *not* checked to ensure that they form a contiguous set.
*/
static int
-nfs_scan_dirty(struct inode *inode, struct list_head *dst, struct file *file, unsigned long idx_start, unsigned int npages)
+nfs_scan_dirty(struct inode *inode, struct list_head *dst,
+ unsigned long idx_start, unsigned int npages)
{
struct nfs_inode *nfsi = NFS_I(inode);
int res;
- res = nfs_scan_list(&nfsi->dirty, dst, file, idx_start, npages);
+ res = nfs_scan_list(&nfsi->dirty, dst, idx_start, npages);
nfsi->ndirty -= res;
sub_page_state(nr_dirty,res);
if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty))
@@ -475,7 +475,6 @@ nfs_scan_dirty(struct inode *inode, stru
* nfs_scan_commit - Scan an inode for commit requests
* @inode: NFS inode to scan
* @dst: destination list
- * @file: if set, ensure we collect requests from this file only.
* @idx_start: lower bound of page->index to scan.
* @npages: idx_start + npages sets the upper bound to scan.
*
@@ -483,11 +482,12 @@ nfs_scan_dirty(struct inode *inode, stru
* The requests are *not* checked to ensure that they form a contiguous set.
*/
static int
-nfs_scan_commit(struct inode *inode, struct list_head *dst, struct file *file, unsigned long idx_start, unsigned int npages)
+nfs_scan_commit(struct inode *inode, struct list_head *dst,
+ unsigned long idx_start, unsigned int npages)
{
struct nfs_inode *nfsi = NFS_I(inode);
int res;
- res = nfs_scan_list(&nfsi->commit, dst, file, idx_start, npages);
+ res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
nfsi->ncommit -= res;
if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
@@ -617,12 +617,12 @@ nfs_strategy(struct inode *inode)
#if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
if (NFS_PROTO(inode)->version == 2) {
if (dirty >= NFS_STRATEGY_PAGES * wpages)
- nfs_flush_file(inode, NULL, 0, 0, 0);
+ nfs_flush_file(inode, 0, 0, 0);
} else if (dirty >= wpages)
- nfs_flush_file(inode, NULL, 0, 0, 0);
+ nfs_flush_file(inode, 0, 0, 0);
#else
if (dirty >= NFS_STRATEGY_PAGES * wpages)
- nfs_flush_file(inode, NULL, 0, 0, 0);
+ nfs_flush_file(inode, 0, 0, 0);
#endif
}

@@ -1047,7 +1047,7 @@ nfs_commit_done(struct rpc_task *task)
}
#endif

-int nfs_flush_file(struct inode *inode, struct file *file, unsigned long idx_start,
+int nfs_flush_file(struct inode *inode, unsigned long idx_start,
unsigned int npages, int how)
{
LIST_HEAD(head);
@@ -1055,7 +1055,7 @@ int nfs_flush_file(struct inode *inode,
error = 0;

spin_lock(&nfs_wreq_lock);
- res = nfs_scan_dirty(inode, &head, file, idx_start, npages);
+ res = nfs_scan_dirty(inode, &head, idx_start, npages);
spin_unlock(&nfs_wreq_lock);
if (res)
error = nfs_flush_list(&head, NFS_SERVER(inode)->wpages, how);
@@ -1065,7 +1065,7 @@ int nfs_flush_file(struct inode *inode,
}

#if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
-int nfs_commit_file(struct inode *inode, struct file *file, unsigned long idx_start,
+int nfs_commit_file(struct inode *inode, unsigned long idx_start,
unsigned int npages, int how)
{
LIST_HEAD(head);
@@ -1073,7 +1073,7 @@ int nfs_commit_file(struct inode *inode,
error = 0;

spin_lock(&nfs_wreq_lock);
- res = nfs_scan_commit(inode, &head, file, idx_start, npages);
+ res = nfs_scan_commit(inode, &head, idx_start, npages);
spin_unlock(&nfs_wreq_lock);
if (res)
error = nfs_commit_list(&head, how);
@@ -1100,10 +1100,10 @@ int nfs_sync_file(struct inode *inode, s
if (wait)
error = nfs_wait_on_requests(inode, file, idx_start, npages);
if (error == 0)
- error = nfs_flush_file(inode, file, idx_start, npages, how);
+ error = nfs_flush_file(inode, idx_start, npages, how);
#if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
if (error == 0)
- error = nfs_commit_file(inode, file, idx_start, npages, how);
+ error = nfs_commit_file(inode, idx_start, npages, how);
#endif
} while (error > 0);
return error;
diff -X /home/cel/src/linux/dont-diff -Naurp 04-async-read/include/linux/nfs_fs.h 05-msync/include/linux/nfs_fs.h
--- 04-async-read/include/linux/nfs_fs.h 2003-12-17 21:58:40.000000000 -0500
+++ 05-msync/include/linux/nfs_fs.h 2004-01-14 12:18:51.776976000 -0500
@@ -310,14 +310,14 @@ extern void nfs_commit_done(struct rpc_t
* return value!)
*/
extern int nfs_sync_file(struct inode *, struct file *, unsigned long, unsigned int, int);
-extern int nfs_flush_file(struct inode *, struct file *, unsigned long, unsigned int, int);
+extern int nfs_flush_file(struct inode *, unsigned long, unsigned int, int);
extern int nfs_flush_list(struct list_head *, int, int);
#if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
-extern int nfs_commit_file(struct inode *, struct file *, unsigned long, unsigned int, int);
+extern int nfs_commit_file(struct inode *, unsigned long, unsigned int, int);
extern int nfs_commit_list(struct list_head *, int);
#else
static inline int
-nfs_commit_file(struct inode *inode, struct file *file, unsigned long offset,
+nfs_commit_file(struct inode *inode, unsigned long offset,
unsigned int len, int flags)
{
return 0;
diff -X /home/cel/src/linux/dont-diff -Naurp 04-async-read/include/linux/nfs_page.h 05-msync/include/linux/nfs_page.h
--- 04-async-read/include/linux/nfs_page.h 2004-01-14 12:14:56.750976000 -0500
+++ 05-msync/include/linux/nfs_page.h 2004-01-14 12:18:52.058985000 -0500
@@ -55,7 +55,7 @@ extern void nfs_release_request(struct n
extern void nfs_list_add_request(struct nfs_page *, struct list_head *);

extern int nfs_scan_list(struct list_head *, struct list_head *,
- struct file *, unsigned long, unsigned int);
+ unsigned long, unsigned int);
extern int nfs_coalesce_requests(struct list_head *, struct list_head *,
unsigned int);
extern int nfs_wait_on_request(struct nfs_page *);

- Chuck Lever
--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2004-01-20 12:03:28

by Olaf Kirch

[permalink] [raw]
Subject: Re: [PATCH] NFS client doesn't flush dirty mmap'd pages during fsync/msync

Hi,

On Wed, Jan 14, 2004 at 01:46:27PM -0500, Chuck Lever wrote:
> the nfs_writepage interface causes dirty mmap'd pages to be queued on an
> NFS inode's dirty page queue. these pages are queued with a NULL filp
> because a filp is not available via the nfs_writepage interface. when
> nfs_fsync is invoked (either via fsync() or via msync()), it is looking
> specifically for dirty pages associated with a given filp, so it skips the
> dirty pages with a NULL filp.

Given that we have the inode for these dirty pages, couldn't we at least
compare the req->wb_inode to filp->f_dentry->d_inode? With your patch,
an fsync degenerates to a global sync on the mounted file system, if
I'm not mistaken.

Olaf
--
Olaf Kirch | Stop wasting entropy - start using predictable
[email protected] | tempfile names today!
---------------+


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-01-20 13:31:15

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] NFS client doesn't flush dirty mmap'd pages during fsync/msync

> Given that we have the inode for these dirty pages, couldn't we at least
> compare the req->wb_inode to filp->f_dentry->d_inode?

I believe that is indeed what Chuck's patch does. Previously we would in
addition check that the filp matches. For ordinary files that is fine, but
in the special case of msync(MS_SYNC), it was causing the fsync() method
to skip precisely those dirty pages that we wanted to sync to disk.

Cheers,
Trond




-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-01-20 13:56:10

by Lever, Charles

[permalink] [raw]
Subject: RE: [PATCH] NFS client doesn't flush dirty mmap'd pages during fsync/msync

> > Given that we have the inode for these dirty pages,=20
> > couldn't we at least
> > compare the req->wb_inode to filp->f_dentry->d_inode?
>=20
> I believe that is indeed what Chuck's patch does.

no, my patch simply removes the filp check. the inode
is not available in nfs_scan_list, and there was never
a check there to see if the inode pointers match.

> > With your patch, an fsync degenerates to a global sync
> > on the mounted file system, if I'm not mistaken.

the patch does not change which list is passed to nfs_scan_list.
nfs_scan_list is used by nfs_scan_dirty and nfs_scan_commit --
so it always works on a single file at a time.


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-01-20 14:37:56

by Trond Myklebust

[permalink] [raw]
Subject: RE: [PATCH] NFS client doesn't flush dirty mmap'd pages during fsync/msync

> no, my patch simply removes the filp check. the inode
> is not available in nfs_scan_list, and there was never
> a check there to see if the inode pointers match.

An explicit check would be superfluous. The point is that there's an
implicit check given that nfs_flush_file() and nfs_commit_file() both take
an inode argument in order to be able to pass the per-inode dirtyand
commit request lists respectively as arguments to nfs_scan_list().

Note: this would be more transparent if you would also convert all
remaining nfs_wb_file() calls to nfs_wb_all(), and then kill the former.

Cheers,
Trond




-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-01-20 14:58:37

by Olaf Kirch

[permalink] [raw]
Subject: Re: [PATCH] NFS client doesn't flush dirty mmap'd pages during fsync/msync

On Tue, Jan 20, 2004 at 05:55:53AM -0800, Lever, Charles wrote:
> > > With your patch, an fsync degenerates to a global sync
> > > on the mounted file system, if I'm not mistaken.
>
> the patch does not change which list is passed to nfs_scan_list.
> nfs_scan_list is used by nfs_scan_dirty and nfs_scan_commit --
> so it always works on a single file at a time.

Ah, that was the point I missed.

Thanks,
Olaf
--
Olaf Kirch | Stop wasting entropy - start using predictable
[email protected] | tempfile names today!
---------------+


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs