2001-12-06 02:01:26

by Xeno

[permalink] [raw]
Subject: 2.4: NFS client race causes data loss when appending

We've been observing intermittent data loss when appending to files over
NFS due to a race in the NFS client (losing < 4K a couple times a day).
Just one process appending to each file. We're using 2.4.9, but it
looks like the race is present in other 2.4 versions. Happens most
frequently when nfs_file_write calls nfs_revalidate_inode while there
are lots of writebacks pending. Here's the sequence of events we're
seeing.

1. getattr request goes out to get file size. Value will be
stale compared to inode->i_size, since writes are happening.
2. All writebacks for the inode complete.
3. getattr response returns with stale file size value.
4. __nfs_refresh_inode checks writebacks, finds none,
overwrites inode->i_size.
5. generic_file_write resets file position (O_APPEND) with
stale file size, overwriting previously written data.

Race is theoretically present in 2.2 as well, but the delay-based
flushing in 2.2 rarely flushes all writes away, so there are usually
requests on the writeback list. Only became a problem in 2.4 with
more aggressive flushing clearing out all writebacks.

Here's a patch (2.4.16) that works for us, it eliminates the race in the
most common case. BKL and NFS_REVALIDATING prevent new writes from
coming in while the getattr is happening, so we just need to check for
writebacks before starting the getattr. Not a perfect solution, there's
still a potential for races with direct calls to nfs_refresh_inode that
bypass nfs_revalidate_inode. In practice, we're not triggering those
operations with any frequency.

--- linux/fs/nfs/inode.c Fri Nov 9 14:28:15 2001
+++ linux-nfsappendrace/fs/nfs/inode.c Wed Dec 5 17:12:28 2001
@@ -868,8 +868,9 @@
__nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
{
int status = -ESTALE;
struct nfs_fattr fattr;
+ int writebacks;

dfprintk(PAGECACHE, "NFS: revalidating (%x/%Ld)\n",
inode->i_dev, (long long)NFS_FILEID(inode));

@@ -889,8 +890,9 @@
}
}
NFS_FLAGS(inode) |= NFS_INO_REVALIDATING;

+ writebacks = nfs_have_writebacks(inode);
status = NFS_PROTO(inode)->getattr(inode, &fattr);
if (status) {
dfprintk(PAGECACHE, "nfs_revalidate_inode: (%x/%Ld) getattr failed, error=%d\n",
inode->i_dev, (long long)NFS_FILEID(inode), status);
@@ -900,8 +902,11 @@
remove_inode_hash(inode);
}
goto out;
}
+
+ if ( writebacks && nfs_size_to_loff_t(fattr.size) < inode->i_size )
+ fattr.size = (__u64) inode->i_size;

status = nfs_refresh_inode(inode, &fattr);
if (status) {
dfprintk(PAGECACHE, "nfs_revalidate_inode: (%x/%Ld) refresh failed, error=%d\n",

Please cc on responses, thanks!

Xeno


2001-12-06 14:30:29

by Trond Myklebust

[permalink] [raw]
Subject: Re: 2.4: NFS client race causes data loss when appending

>>>>> " " == xeno <[email protected]> writes:

> 1. getattr request goes out to get file size. Value will be
> stale compared to inode->i_size, since writes are happening.
> 2. All writebacks for the inode complete.
> 3. getattr response returns with stale file size value.
> 4. __nfs_refresh_inode checks writebacks, finds none,
> overwrites inode->i_size.
> 5. generic_file_write resets file position (O_APPEND) with
> stale file size, overwriting previously written data.

<snip>
> --- linux/fs/nfs/inode.c Fri Nov 9 14:28:15 2001
> +++ linux-nfsappendrace/fs/nfs/inode.c Wed Dec 5 17:12:28 2001
> @@ -868,8 +868,9 @@
> __nfs_revalidate_inode(struct nfs_server *server, struct inode
> *inode) {
> int status = -ESTALE; struct nfs_fattr fattr;
> + int writebacks;

> dfprintk(PAGECACHE, "NFS: revalidating (%x/%Ld)\n",
inode-> i_dev, (long long)NFS_FILEID(inode));

> @@ -889,8 +890,9 @@
> }
> } NFS_FLAGS(inode) |= NFS_INO_REVALIDATING;

> + writebacks = nfs_have_writebacks(inode);
> status = NFS_PROTO(inode)->getattr(inode, &fattr); if
> (status) {
> dfprintk(PAGECACHE, "nfs_revalidate_inode:
> (%x/%Ld) getattr failed, error=%d\n",
inode-> i_dev, (long long)NFS_FILEID(inode), status);
> @@ -900,8 +902,11 @@
> remove_inode_hash(inode);
> } goto out;
> }
> +
> + if ( writebacks && nfs_size_to_loff_t(fattr.size) <
> inode->i_size )
> + fattr.size = (__u64) inode->i_size;

> status = nfs_refresh_inode(inode, &fattr); if (status)
> {
> dfprintk(PAGECACHE, "nfs_revalidate_inode:
> (%x/%Ld) refresh failed, error=%d\n",

The above is clearly insufficient to fix the race: you've only
addressed the problem of getattr. NFS is crawling with stuff that
returns fattrs (read/getattr/lookup/...). Each and every one of them
can race in the way you describe.
It will also fail to prevent a race occurring if the writeback is
scheduled and written while we are in the getattr() call (rare but
possible)...

What we really want is to prevent nfs_refresh_inode() from
overwriting newer attribute information with older information. How
therefore about something like the appended patch, that uses the ctime
field to determine which attribute information is obsolete?
I'm afraid it's not going to work too well for Linux servers because
of the shitty 1 second resolution we have on (a|m|c)time, but it will
help against most non-Linux servers.

Cheers,
Trond

--- linux-2.4.17-pre4/fs/nfs/inode.c.orig Thu Dec 6 02:27:46 2001
+++ linux-2.4.17-pre4/fs/nfs/inode.c Thu Dec 6 15:26:07 2001
@@ -1007,6 +1007,10 @@
new_size = fattr->size;
new_isize = nfs_size_to_loff_t(fattr->size);

+ if (time_before(jiffies, NFS_READTIME(inode)+NFS_ATTRTIMEO(inode)) &&
+ (s64)NFS_CACHE_CTIME(inode) - (s64)fattr->ctime < 0)
+ return 0;
+
/*
* Update the read time so we don't revalidate too often.
*/

2001-12-06 21:24:53

by Trond Myklebust

[permalink] [raw]
Subject: Re: 2.4: NFS client race causes data loss when appending

>>>>> " " == Trond Myklebust <[email protected]> writes:

> What we really want is to prevent nfs_refresh_inode() from
> overwriting newer attribute information with older
> information. How therefore about something like the appended
> patch, that uses the ctime field to determine which attribute
> information is obsolete? I'm afraid it's not going to work too
> well for Linux servers because of the shitty 1 second
> resolution we have on (a|m|c)time, but it will help against
> most non-Linux servers.

Hah... I of course managed to get the sign wrong on the last
patch. The following should work better, including with Linux servers.

Cheers,
Trond

--- linux-2.4.17-pre4/fs/nfs/inode.c.orig Thu Dec 6 02:27:46 2001
+++ linux-2.4.17-pre4/fs/nfs/inode.c Thu Dec 6 21:45:29 2001
@@ -655,20 +655,8 @@
inode->i_op = &nfs_symlink_inode_operations;
else
init_special_inode(inode, inode->i_mode, fattr->rdev);
- /*
- * Preset the size and mtime, as there's no need
- * to invalidate the caches.
- */
- inode->i_size = nfs_size_to_loff_t(fattr->size);
- inode->i_mtime = nfs_time_to_secs(fattr->mtime);
- inode->i_atime = nfs_time_to_secs(fattr->atime);
- inode->i_ctime = nfs_time_to_secs(fattr->ctime);
- NFS_CACHE_CTIME(inode) = fattr->ctime;
- NFS_CACHE_MTIME(inode) = fattr->mtime;
- NFS_CACHE_ISIZE(inode) = fattr->size;
- NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode);
- NFS_ATTRTIMEO_UPDATE(inode) = jiffies;
memcpy(&inode->u.nfs_i.fh, fh, sizeof(inode->u.nfs_i.fh));
+ NFS_CACHEINV(inode);
}
nfs_refresh_inode(inode, fattr);
}
@@ -966,6 +954,37 @@
}

/*
+ * nfs_fattr_obsolete - Test if attribute data is newer than cached data
+ * @inode: inode
+ * @fattr: attributes to test
+ *
+ * Avoid stuffing the attribute cache with obsolete information.
+ * We always accept updates if the attribute cache timed out, or if
+ * fattr->ctime is newer than our cached value.
+ * If fattr->ctime matches the cached value, we still accept the update
+ * if there also is a reasonable match for the Weak Cache Consistency
+ * data. This is in order to cope with NFS servers with crap time
+ * resolution...
+ */
+static inline
+int nfs_fattr_obsolete(struct inode *inode, struct nfs_fattr *fattr)
+{
+ s64 cdif;
+
+ if (time_after(jiffies, NFS_READTIME(inode)+NFS_ATTRTIMEO(inode)))
+ goto out_valid;
+ if ((cdif = (s64)fattr->ctime - (s64)NFS_CACHE_CTIME(inode)) > 0)
+ goto out_valid;
+ /* Ugh... */
+ if (cdif == 0 && (fattr->valid & NFS_ATTR_WCC)
+ && (s64)fattr->pre_ctime - (s64)NFS_CACHE_CTIME(inode) >= 0)
+ goto out_valid;
+ return -1;
+ out_valid:
+ return 0;
+}
+
+/*
* Many nfs protocol calls return the new file attributes after
* an operation. Here we update the inode to reflect the state
* of the server's inode.
@@ -1003,6 +1022,10 @@
if ((inode->i_mode & S_IFMT) != (fattr->mode & S_IFMT))
goto out_changed;

+ /* Avoid races */
+ if (nfs_fattr_obsolete(inode, fattr))
+ return 0;
+
new_mtime = fattr->mtime;
new_size = fattr->size;
new_isize = nfs_size_to_loff_t(fattr->size);