2012-04-16 16:14:27

by Jan Kara

[permalink] [raw]
Subject: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

Hello,

here is the fifth iteration of my patches to improve filesystem freezing.
No serious changes since last time. Mostly I rebased patches and merged this
series with series moving file_update_time() to ->page_mkwrite() to simplify
testing and merging.

Filesystem freezing is currently racy and thus we can end up with dirty data on
frozen filesystem (see changelog patch 13 for detailed race description). This
patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
file_update_time() call to ->page_mkwrite callback (patches 01-07) and put
freeze handling in mnt_want_write() / mnt_drop_write(). That however required
some code shuffling and changes to kern_path_create() (see patches 09-12). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

Changes since v4:
* added a couple of Acked-by's
* added some comments & doc update
* added patches from series "Push file_update_time() into .page_mkwrite"
since it doesn't make much sense to keep them separate anymore
* rebased on top of 3.4-rc2

Changes since v3:
* added third level of freezing for fs internal purposes - hooked some
filesystems to use it (XFS, nilfs2)
* removed racy i_size check from filemap_mkwrite()

Changes since v2:
* completely rewritten
* freezing is now blocked at VFS entry points
* two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
* have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
* use percpu counters instead of own percpu structure
* added documentation fixes from the old fs freezing series
* converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder <[email protected]>
CC: Anton Altaparmakov <[email protected]>
CC: Ben Myers <[email protected]>
CC: Chris Mason <[email protected]>
CC: [email protected]
CC: "David S. Miller" <[email protected]>
CC: [email protected]
CC: "J. Bruce Fields" <[email protected]>
CC: Joel Becker <[email protected]>
CC: KONISHI Ryusuke <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: Mark Fasheh <[email protected]>
CC: Miklos Szeredi <[email protected]>
CC: [email protected]
CC: OGAWA Hirofumi <[email protected]>
CC: Steven Whitehouse <[email protected]>
CC: "Theodore Ts'o" <[email protected]>
CC: [email protected]


2012-04-16 16:14:23

by Jan Kara

[permalink] [raw]
Subject: [PATCH 12/27] nfsd: Push mnt_want_write() outside of i_mutex

When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: [email protected]
CC: "J. Bruce Fields" <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/nfsd/nfs4recover.c | 9 +++--
fs/nfsd/nfsfh.c | 1 +
fs/nfsd/nfsproc.c | 9 ++++-
fs/nfsd/vfs.c | 79 ++++++++++++++++++++++---------------------
fs/nfsd/vfs.h | 11 +++++-
include/linux/nfsd/nfsfh.h | 1 +
6 files changed, 64 insertions(+), 46 deletions(-)

diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 4767429..efa7574 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -154,6 +154,10 @@ nfsd4_create_clid_dir(struct nfs4_client *clp)
if (status < 0)
return;

+ status = mnt_want_write_file(rec_file);
+ if (status)
+ return;
+
dir = rec_file->f_path.dentry;
/* lock the parent */
mutex_lock(&dir->d_inode->i_mutex);
@@ -173,11 +177,7 @@ nfsd4_create_clid_dir(struct nfs4_client *clp)
* as well be forgiving and just succeed silently.
*/
goto out_put;
- status = mnt_want_write_file(rec_file);
- if (status)
- goto out_put;
status = vfs_mkdir(dir->d_inode, dentry, S_IRWXU);
- mnt_drop_write_file(rec_file);
out_put:
dput(dentry);
out_unlock:
@@ -189,6 +189,7 @@ out_unlock:
" (err %d); please check that %s exists"
" and is writeable", status,
user_recovery_dirname);
+ mnt_drop_write_file(rec_file);
nfs4_reset_creds(original_cred);
}

diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 68454e7..8b93353 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -635,6 +635,7 @@ fh_put(struct svc_fh *fhp)
fhp->fh_post_saved = 0;
#endif
}
+ fh_drop_write(fhp);
if (exp) {
cache_put(&exp->h, &svc_export_cache);
fhp->fh_export = NULL;
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index e15dc45..aad6d45 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -196,6 +196,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
struct dentry *dchild;
int type, mode;
__be32 nfserr;
+ int hosterr;
dev_t rdev = 0, wanted = new_decode_dev(attr->ia_size);

dprintk("nfsd: CREATE %s %.*s\n",
@@ -214,6 +215,12 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
nfserr = nfserr_exist;
if (isdotent(argp->name, argp->len))
goto done;
+ hosterr = fh_want_write(dirfhp);
+ if (hosterr) {
+ nfserr = nfserrno(hosterr);
+ goto done;
+ }
+
fh_lock_nested(dirfhp, I_MUTEX_PARENT);
dchild = lookup_one_len(argp->name, dirfhp->fh_dentry, argp->len);
if (IS_ERR(dchild)) {
@@ -330,7 +337,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
out_unlock:
/* We don't really need to unlock, as fh_put does it. */
fh_unlock(dirfhp);
-
+ fh_drop_write(dirfhp);
done:
fh_put(dirfhp);
return nfsd_return_dirop(nfserr, resp);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 296d671..b8bb649 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1276,6 +1276,10 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
* If it has, the parent directory should already be locked.
*/
if (!resfhp->fh_dentry) {
+ host_err = fh_want_write(fhp);
+ if (host_err)
+ goto out_nfserr;
+
/* called from nfsd_proc_mkdir, or possibly nfsd3_proc_create */
fh_lock_nested(fhp, I_MUTEX_PARENT);
dchild = lookup_one_len(fname, dentry, flen);
@@ -1319,14 +1323,11 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
goto out;
}

- host_err = fh_want_write(fhp);
- if (host_err)
- goto out_nfserr;
-
/*
* Get the dir op function pointer.
*/
err = 0;
+ host_err = 0;
switch (type) {
case S_IFREG:
host_err = vfs_create(dirp, dchild, iap->ia_mode, NULL);
@@ -1343,10 +1344,8 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
host_err = vfs_mknod(dirp, dchild, iap->ia_mode, rdev);
break;
}
- if (host_err < 0) {
- fh_drop_write(fhp);
+ if (host_err < 0)
goto out_nfserr;
- }

err = nfsd_create_setattr(rqstp, resfhp, iap);

@@ -1358,7 +1357,6 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
err2 = nfserrno(commit_metadata(fhp));
if (err2)
err = err2;
- fh_drop_write(fhp);
/*
* Update the file handle to get the new inode info.
*/
@@ -1417,6 +1415,11 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
err = nfserr_notdir;
if (!dirp->i_op->lookup)
goto out;
+
+ host_err = fh_want_write(fhp);
+ if (host_err)
+ goto out_nfserr;
+
fh_lock_nested(fhp, I_MUTEX_PARENT);

/*
@@ -1449,9 +1452,6 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
v_atime = verifier[1]&0x7fffffff;
}

- host_err = fh_want_write(fhp);
- if (host_err)
- goto out_nfserr;
if (dchild->d_inode) {
err = 0;

@@ -1522,7 +1522,6 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (!err)
err = nfserrno(commit_metadata(fhp));

- fh_drop_write(fhp);
/*
* Update the filehandle to get the new inode info.
*/
@@ -1533,6 +1532,7 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
fh_unlock(fhp);
if (dchild && !IS_ERR(dchild))
dput(dchild);
+ fh_drop_write(fhp);
return err;

out_nfserr:
@@ -1613,6 +1613,11 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
err = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
if (err)
goto out;
+
+ host_err = fh_want_write(fhp);
+ if (host_err)
+ goto out_nfserr;
+
fh_lock(fhp);
dentry = fhp->fh_dentry;
dnew = lookup_one_len(fname, dentry, flen);
@@ -1620,10 +1625,6 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (IS_ERR(dnew))
goto out_nfserr;

- host_err = fh_want_write(fhp);
- if (host_err)
- goto out_nfserr;
-
if (unlikely(path[plen] != 0)) {
char *path_alloced = kmalloc(plen+1, GFP_KERNEL);
if (path_alloced == NULL)
@@ -1683,6 +1684,12 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
if (isdotent(name, len))
goto out;

+ host_err = fh_want_write(tfhp);
+ if (host_err) {
+ err = nfserrno(host_err);
+ goto out;
+ }
+
fh_lock_nested(ffhp, I_MUTEX_PARENT);
ddir = ffhp->fh_dentry;
dirp = ddir->d_inode;
@@ -1694,18 +1701,13 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,

dold = tfhp->fh_dentry;

- host_err = fh_want_write(tfhp);
- if (host_err) {
- err = nfserrno(host_err);
- goto out_dput;
- }
err = nfserr_noent;
if (!dold->d_inode)
- goto out_drop_write;
+ goto out_dput;
host_err = nfsd_break_lease(dold->d_inode);
if (host_err) {
err = nfserrno(host_err);
- goto out_drop_write;
+ goto out_dput;
}
host_err = vfs_link(dold, dirp, dnew);
if (!host_err) {
@@ -1718,12 +1720,11 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
else
err = nfserrno(host_err);
}
-out_drop_write:
- fh_drop_write(tfhp);
out_dput:
dput(dnew);
out_unlock:
fh_unlock(ffhp);
+ fh_drop_write(tfhp);
out:
return err;

@@ -1766,6 +1767,12 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
if (!flen || isdotent(fname, flen) || !tlen || isdotent(tname, tlen))
goto out;

+ host_err = fh_want_write(ffhp);
+ if (host_err) {
+ err = nfserrno(host_err);
+ goto out;
+ }
+
/* cannot use fh_lock as we need deadlock protective ordering
* so do it by hand */
trap = lock_rename(tdentry, fdentry);
@@ -1796,17 +1803,14 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
host_err = -EXDEV;
if (ffhp->fh_export->ex_path.mnt != tfhp->fh_export->ex_path.mnt)
goto out_dput_new;
- host_err = fh_want_write(ffhp);
- if (host_err)
- goto out_dput_new;

host_err = nfsd_break_lease(odentry->d_inode);
if (host_err)
- goto out_drop_write;
+ goto out_dput_new;
if (ndentry->d_inode) {
host_err = nfsd_break_lease(ndentry->d_inode);
if (host_err)
- goto out_drop_write;
+ goto out_dput_new;
}
host_err = vfs_rename(fdir, odentry, tdir, ndentry);
if (!host_err) {
@@ -1814,8 +1818,6 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
if (!host_err)
host_err = commit_metadata(ffhp);
}
-out_drop_write:
- fh_drop_write(ffhp);
out_dput_new:
dput(ndentry);
out_dput_old:
@@ -1831,6 +1833,7 @@ out_drop_write:
fill_post_wcc(tfhp);
unlock_rename(tdentry, fdentry);
ffhp->fh_locked = tfhp->fh_locked = 0;
+ fh_drop_write(ffhp);

out:
return err;
@@ -1856,6 +1859,10 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (err)
goto out;

+ host_err = fh_want_write(fhp);
+ if (host_err)
+ goto out_nfserr;
+
fh_lock_nested(fhp, I_MUTEX_PARENT);
dentry = fhp->fh_dentry;
dirp = dentry->d_inode;
@@ -1874,21 +1881,15 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (!type)
type = rdentry->d_inode->i_mode & S_IFMT;

- host_err = fh_want_write(fhp);
- if (host_err)
- goto out_put;
-
host_err = nfsd_break_lease(rdentry->d_inode);
if (host_err)
- goto out_drop_write;
+ goto out_put;
if (type != S_IFDIR)
host_err = vfs_unlink(dirp, rdentry);
else
host_err = vfs_rmdir(dirp, rdentry);
if (!host_err)
host_err = commit_metadata(fhp);
-out_drop_write:
- fh_drop_write(fhp);
out_put:
dput(rdentry);

diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index ec0611b..359594c 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -110,12 +110,19 @@ int nfsd_set_posix_acl(struct svc_fh *, int, struct posix_acl *);

static inline int fh_want_write(struct svc_fh *fh)
{
- return mnt_want_write(fh->fh_export->ex_path.mnt);
+ int ret = mnt_want_write(fh->fh_export->ex_path.mnt);
+
+ if (!ret)
+ fh->fh_want_write = 1;
+ return ret;
}

static inline void fh_drop_write(struct svc_fh *fh)
{
- mnt_drop_write(fh->fh_export->ex_path.mnt);
+ if (fh->fh_want_write) {
+ fh->fh_want_write = 0;
+ mnt_drop_write(fh->fh_export->ex_path.mnt);
+ }
}

#endif /* LINUX_NFSD_VFS_H */
diff --git a/include/linux/nfsd/nfsfh.h b/include/linux/nfsd/nfsfh.h
index ce4743a..fa63048 100644
--- a/include/linux/nfsd/nfsfh.h
+++ b/include/linux/nfsd/nfsfh.h
@@ -143,6 +143,7 @@ typedef struct svc_fh {
int fh_maxsize; /* max size for fh_handle */

unsigned char fh_locked; /* inode locked by us */
+ unsigned char fh_want_write; /* remount protection taken */

#ifdef CONFIG_NFSD_V3
unsigned char fh_post_saved; /* post-op attrs saved */
--
1.7.1


2012-04-16 16:16:22

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

The subject should have been [PATCH 00/27]... Sorry for the mistake.

Honza

On Mon 16-04-12 18:13:38, Jan Kara wrote:
> Hello,
>
> here is the fifth iteration of my patches to improve filesystem freezing.
> No serious changes since last time. Mostly I rebased patches and merged this
> series with series moving file_update_time() to ->page_mkwrite() to simplify
> testing and merging.
>
> Filesystem freezing is currently racy and thus we can end up with dirty data on
> frozen filesystem (see changelog patch 13 for detailed race description). This
> patch series aims at fixing this.
>
> To be able to block all places where inodes get dirtied, I've moved filesystem
> file_update_time() call to ->page_mkwrite callback (patches 01-07) and put
> freeze handling in mnt_want_write() / mnt_drop_write(). That however required
> some code shuffling and changes to kern_path_create() (see patches 09-12). I
> think the result is OK but opinions may differ ;). The advantage of this change
> also is that all filesystems get freeze protection almost for free - even ext2
> can handle freezing well now.
>
> Another potential contention point might be patch 19. In that patch we make
> freeze_super() refuse to freeze the filesystem when there are open but unlinked
> files which may be impractical in some cases. The main reason for this is the
> problem with handling of file deletion from fput() called with mmap_sem held
> (e.g. from munmap(2)), and then there's the fact that we cannot really force
> such filesystem into a consistent state... But if people think that freezing
> with open but unlinked files should happen, then I have some possible
> solutions in mind (maybe as a separate patchset since this is large enough).
>
> I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
> filesystem despite beating it with fsstress and bash-shared-mapping while
> freezing and unfreezing for several hours (using ext4 and xfs) so I'm
> reasonably confident this could finally be the right solution.
>
> Changes since v4:
> * added a couple of Acked-by's
> * added some comments & doc update
> * added patches from series "Push file_update_time() into .page_mkwrite"
> since it doesn't make much sense to keep them separate anymore
> * rebased on top of 3.4-rc2
>
> Changes since v3:
> * added third level of freezing for fs internal purposes - hooked some
> filesystems to use it (XFS, nilfs2)
> * removed racy i_size check from filemap_mkwrite()
>
> Changes since v2:
> * completely rewritten
> * freezing is now blocked at VFS entry points
> * two stage freezing to handle both mmapped writes and other IO
>
> The biggest changes since v1:
> * have two counters to provide safe state transitions for SB_FREEZE_WRITE
> and SB_FREEZE_TRANS states
> * use percpu counters instead of own percpu structure
> * added documentation fixes from the old fs freezing series
> * converted XFS to use SB_FREEZE_TRANS counter instead of its private
> m_active_trans counter
>
> Honza
>
> CC: Alex Elder <[email protected]>
> CC: Anton Altaparmakov <[email protected]>
> CC: Ben Myers <[email protected]>
> CC: Chris Mason <[email protected]>
> CC: [email protected]
> CC: "David S. Miller" <[email protected]>
> CC: [email protected]
> CC: "J. Bruce Fields" <[email protected]>
> CC: Joel Becker <[email protected]>
> CC: KONISHI Ryusuke <[email protected]>
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: Mark Fasheh <[email protected]>
> CC: Miklos Szeredi <[email protected]>
> CC: [email protected]
> CC: OGAWA Hirofumi <[email protected]>
> CC: Steven Whitehouse <[email protected]>
> CC: "Theodore Ts'o" <[email protected]>
> CC: [email protected]
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-04-18 00:55:34

by Chris Samuel

[permalink] [raw]
Subject: Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

On 17/04/12 15:10, Andreas Dilger wrote:

> (which IMHO would prevent nearly every Gnome system from freezing unless these
> apps have changed their behaviour in more recent releases).

It would also affect current KDE desktops as they tend to use MySQL for
Akonadi & Nepomuk, plus anyone using Chromium (and presumably Chrome):

samuel@eris:/tmp$ sudo lsof | grep deleted | awk '{print $1}' | sort |
uniq -c
[sudo] password for samuel:
32 chromium-
1 dovecot
5 imap-logi
5 mysqld

I would be surprised if you could find many systems that didn't have
files in this situation.

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

2012-04-17 00:48:38

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

On Mon, Apr 16, 2012 at 03:02:50PM -0700, Andreas Dilger wrote:
> On 2012-04-16, at 9:13 AM, Jan Kara wrote:
> > Another potential contention point might be patch 19. In that patch
> > we make freeze_super() refuse to freeze the filesystem when there
> > are open but unlinked files which may be impractical in some cases.
> > The main reason for this is the problem with handling of file deletion
> > from fput() called with mmap_sem held (e.g. from munmap(2)), and
> > then there's the fact that we cannot really force such filesystem
> > into a consistent state... But if people think that freezing with
> > open but unlinked files should happen, then I have some possible
> > solutions in mind (maybe as a separate patchset since this is
> > large enough).
>
> Looking at a desktop system, I think it is very typical that there
> are open-unlinked files present, so I don't know if this is really
> an acceptable solution. It isn't clear from your comments whether
> this is a blanket refusal for all open-unlinked files, or only in
> some particular cases...
>
> lsof | grep deleted
> nautilus 25393 adilger 19r REG 253,0 340 253954 /home/adilger/.local/share/gvfs-metadata/home (deleted)
> nautilus 25393 adilger 20r REG 253,0 32768 253964 /home/adilger/.local/share/gvfs-metadata/home-f332a8f3.log (deleted)
> gnome-ter 25623 adilger 22u REG 0,18 17841 2717846 /tmp/vtePIRJCW (deleted)
> gnome-ter 25623 adilger 23u REG 0,18 5568 2717847 /tmp/vteDCSJCW (deleted)
> gnome-ter 25623 adilger 29u REG 0,18 480 2728484 /tmp/vte6C1TCW (deleted)

Unlinked-but-open files are the reason that XFS dirties the log
after the freeze process is complete. This ensures that if the
system crashes while the filesystem is frozen then log recovery
during the next mount will process the unlinked (orphaned) inodes
and free the correctly. i.e. you can still freeze a filesystem with
inodes in this state successfully and have everythign behave as
you'd expect.

I'm not sure how other filesystems handle this problem, but perhaps
pushing this check down into filesystem specific code or adding a
superblock feature flag might be a way to allow filesystems to
handle this case in the way they think is best...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-04-17 19:35:12

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

On Tue, Apr 17, 2012 at 11:32:46AM +0200, Jan Kara wrote:
> On Mon 16-04-12 15:02:50, Andreas Dilger wrote:
> > On 2012-04-16, at 9:13 AM, Jan Kara wrote:
> > > Another potential contention point might be patch 19. In that patch
> > > we make freeze_super() refuse to freeze the filesystem when there
> > > are open but unlinked files which may be impractical in some cases.
> > > The main reason for this is the problem with handling of file deletion
> > > from fput() called with mmap_sem held (e.g. from munmap(2)), and
> > > then there's the fact that we cannot really force such filesystem
> > > into a consistent state... But if people think that freezing with
> > > open but unlinked files should happen, then I have some possible
> > > solutions in mind (maybe as a separate patchset since this is
> > > large enough).
> >
> > Looking at a desktop system, I think it is very typical that there
> > are open-unlinked files present, so I don't know if this is really
> > an acceptable solution. It isn't clear from your comments whether
> > this is a blanket refusal for all open-unlinked files, or only in
> > some particular cases...
> Thanks for looking at this. It is currently a blanket refusal. And I
> agree it's problematic. There are two problems with open but unlinked
> files.

Let me add my name to the chorus of "we have to handle freezing
with open+unlinked, we cannot assume they don't exist."

> One is that some old filesystems cannot get in a consistent state in
> presence of open but unlinked files but for filesystems we really care
> about - xfs, ext4, ext3, btrfs, or even ocfs2, gfs2 - that is not a real
> issue (these filesystems will delete those inodes on next mount read-write).

Others have pointed out that we can flag the safe filesystems.
I'd even be willing to say you can't freeze the unsafe filesystems.

> The other problem is with what should happen when you put last inode
> reference on a frozen filesystem. Two possibilities I see are:
>
> a) block the iput() call - that is inconvenient because it can be
> called in various contexts. I think we could possibly use the same level of
> freeze protection as for page fault (this has changed since I originally
> thought about this and that would make things simpler) but I'm not
> completely sure.

Given that frozen filesystems can stay that way for a while,
couldn't that lead to a million frozen df(1)s? It's like your average
NFS network failure.

> b) let the iput finish but filesystem will keep inode on its orphan list
> (or it's equivalent) and the inode will be deleted after the filesystem is
> thawed. The advantage of this is we don't have to block iput(), the
> disadvantage is we have to have filesystem support and not all filesystems
> can do this.

Perhaps we handle iput() like unlinked. If the filesystem can
handle it, we allow it, otherwise we block.

Joel

>
> Any thoughts?
>
> Honza
> >
> > lsof | grep deleted
> > nautilus 25393 adilger 19r REG 253,0 340 253954 /home/adilger/.local/share/gvfs-metadata/home (deleted)
> > nautilus 25393 adilger 20r REG 253,0 32768 253964 /home/adilger/.local/share/gvfs-metadata/home-f332a8f3.log (deleted)
> > gnome-ter 25623 adilger 22u REG 0,18 17841 2717846 /tmp/vtePIRJCW (deleted)
> > gnome-ter 25623 adilger 23u REG 0,18 5568 2717847 /tmp/vteDCSJCW (deleted)
> > gnome-ter 25623 adilger 29u REG 0,18 480 2728484 /tmp/vte6C1TCW (deleted)
>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

--

"The first requisite of a good citizen in this republic of ours
is that he shall be able and willing to pull his weight."
- Theodore Roosevelt

http://www.jlbec.org/
[email protected]

2012-04-17 09:32:59

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

On Mon 16-04-12 15:02:50, Andreas Dilger wrote:
> On 2012-04-16, at 9:13 AM, Jan Kara wrote:
> > Another potential contention point might be patch 19. In that patch
> > we make freeze_super() refuse to freeze the filesystem when there
> > are open but unlinked files which may be impractical in some cases.
> > The main reason for this is the problem with handling of file deletion
> > from fput() called with mmap_sem held (e.g. from munmap(2)), and
> > then there's the fact that we cannot really force such filesystem
> > into a consistent state... But if people think that freezing with
> > open but unlinked files should happen, then I have some possible
> > solutions in mind (maybe as a separate patchset since this is
> > large enough).
>
> Looking at a desktop system, I think it is very typical that there
> are open-unlinked files present, so I don't know if this is really
> an acceptable solution. It isn't clear from your comments whether
> this is a blanket refusal for all open-unlinked files, or only in
> some particular cases...
Thanks for looking at this. It is currently a blanket refusal. And I
agree it's problematic. There are two problems with open but unlinked
files.

One is that some old filesystems cannot get in a consistent state in
presence of open but unlinked files but for filesystems we really care
about - xfs, ext4, ext3, btrfs, or even ocfs2, gfs2 - that is not a real
issue (these filesystems will delete those inodes on next mount read-write).

The other problem is with what should happen when you put last inode
reference on a frozen filesystem. Two possibilities I see are:

a) block the iput() call - that is inconvenient because it can be
called in various contexts. I think we could possibly use the same level of
freeze protection as for page fault (this has changed since I originally
thought about this and that would make things simpler) but I'm not
completely sure.

b) let the iput finish but filesystem will keep inode on its orphan list
(or it's equivalent) and the inode will be deleted after the filesystem is
thawed. The advantage of this is we don't have to block iput(), the
disadvantage is we have to have filesystem support and not all filesystems
can do this.

Any thoughts?

Honza
>
> lsof | grep deleted
> nautilus 25393 adilger 19r REG 253,0 340 253954 /home/adilger/.local/share/gvfs-metadata/home (deleted)
> nautilus 25393 adilger 20r REG 253,0 32768 253964 /home/adilger/.local/share/gvfs-metadata/home-f332a8f3.log (deleted)
> gnome-ter 25623 adilger 22u REG 0,18 17841 2717846 /tmp/vtePIRJCW (deleted)
> gnome-ter 25623 adilger 23u REG 0,18 5568 2717847 /tmp/vteDCSJCW (deleted)
> gnome-ter 25623 adilger 29u REG 0,18 480 2728484 /tmp/vte6C1TCW (deleted)

--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-04-16 21:59:58

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

On 2012-04-16, at 9:13 AM, Jan Kara wrote:
> Another potential contention point might be patch 19. In that patch
> we make freeze_super() refuse to freeze the filesystem when there
> are open but unlinked files which may be impractical in some cases.
> The main reason for this is the problem with handling of file deletion
> from fput() called with mmap_sem held (e.g. from munmap(2)), and
> then there's the fact that we cannot really force such filesystem
> into a consistent state... But if people think that freezing with
> open but unlinked files should happen, then I have some possible
> solutions in mind (maybe as a separate patchset since this is
> large enough).

Looking at a desktop system, I think it is very typical that there
are open-unlinked files present, so I don't know if this is really
an acceptable solution. It isn't clear from your comments whether
this is a blanket refusal for all open-unlinked files, or only in
some particular cases...

lsof | grep deleted
nautilus 25393 adilger 19r REG 253,0 340 253954 /home/adilger/.local/share/gvfs-metadata/home (deleted)
nautilus 25393 adilger 20r REG 253,0 32768 253964 /home/adilger/.local/share/gvfs-metadata/home-f332a8f3.log (deleted)
gnome-ter 25623 adilger 22u REG 0,18 17841 2717846 /tmp/vtePIRJCW (deleted)
gnome-ter 25623 adilger 23u REG 0,18 5568 2717847 /tmp/vteDCSJCW (deleted)
gnome-ter 25623 adilger 29u REG 0,18 480 2728484 /tmp/vte6C1TCW (deleted)

Cheers, Andreas






2012-04-17 08:19:24

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 12/27] nfsd: Push mnt_want_write() outside of i_mutex

On Mon 16-04-12 14:25:33, J. Bruce Fields wrote:
> On Mon, Apr 16, 2012 at 06:13:50PM +0200, Jan Kara wrote:
> > When mnt_want_write() starts to handle freezing it will get a full lock
> > semantics requiring proper lock ordering. So push mnt_want_write() call
> > consistently outside of i_mutex.
>
> How are you testing this?
That's a good question :). When I wrote this, I tried running nfs server,
mounting it and checked that I could do basic operations and no lockdep
warning is emitted. If you have better idea what test I should run, I can
do that.

> And do you want this particular track merged for 3.5 through the nfsd
> tree, or should it go some other way?
My original intent was to let Al merge everything to make it easy. But
you wish to take this patch via your tree, it should be possible (it does
not depend on anything, just the rest of the series depends on it due to
lock ordering constraints).

Honza

> > CC: [email protected]
> > CC: "J. Bruce Fields" <[email protected]>
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/nfsd/nfs4recover.c | 9 +++--
> > fs/nfsd/nfsfh.c | 1 +
> > fs/nfsd/nfsproc.c | 9 ++++-
> > fs/nfsd/vfs.c | 79 ++++++++++++++++++++++---------------------
> > fs/nfsd/vfs.h | 11 +++++-
> > include/linux/nfsd/nfsfh.h | 1 +
> > 6 files changed, 64 insertions(+), 46 deletions(-)
> >
> > diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
> > index 4767429..efa7574 100644
> > --- a/fs/nfsd/nfs4recover.c
> > +++ b/fs/nfsd/nfs4recover.c
> > @@ -154,6 +154,10 @@ nfsd4_create_clid_dir(struct nfs4_client *clp)
> > if (status < 0)
> > return;
> >
> > + status = mnt_want_write_file(rec_file);
> > + if (status)
> > + return;
> > +
> > dir = rec_file->f_path.dentry;
> > /* lock the parent */
> > mutex_lock(&dir->d_inode->i_mutex);
> > @@ -173,11 +177,7 @@ nfsd4_create_clid_dir(struct nfs4_client *clp)
> > * as well be forgiving and just succeed silently.
> > */
> > goto out_put;
> > - status = mnt_want_write_file(rec_file);
> > - if (status)
> > - goto out_put;
> > status = vfs_mkdir(dir->d_inode, dentry, S_IRWXU);
> > - mnt_drop_write_file(rec_file);
> > out_put:
> > dput(dentry);
> > out_unlock:
> > @@ -189,6 +189,7 @@ out_unlock:
> > " (err %d); please check that %s exists"
> > " and is writeable", status,
> > user_recovery_dirname);
> > + mnt_drop_write_file(rec_file);
> > nfs4_reset_creds(original_cred);
> > }
> >
> > diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> > index 68454e7..8b93353 100644
> > --- a/fs/nfsd/nfsfh.c
> > +++ b/fs/nfsd/nfsfh.c
> > @@ -635,6 +635,7 @@ fh_put(struct svc_fh *fhp)
> > fhp->fh_post_saved = 0;
> > #endif
> > }
> > + fh_drop_write(fhp);
> > if (exp) {
> > cache_put(&exp->h, &svc_export_cache);
> > fhp->fh_export = NULL;
> > diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
> > index e15dc45..aad6d45 100644
> > --- a/fs/nfsd/nfsproc.c
> > +++ b/fs/nfsd/nfsproc.c
> > @@ -196,6 +196,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
> > struct dentry *dchild;
> > int type, mode;
> > __be32 nfserr;
> > + int hosterr;
> > dev_t rdev = 0, wanted = new_decode_dev(attr->ia_size);
> >
> > dprintk("nfsd: CREATE %s %.*s\n",
> > @@ -214,6 +215,12 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
> > nfserr = nfserr_exist;
> > if (isdotent(argp->name, argp->len))
> > goto done;
> > + hosterr = fh_want_write(dirfhp);
> > + if (hosterr) {
> > + nfserr = nfserrno(hosterr);
> > + goto done;
> > + }
> > +
> > fh_lock_nested(dirfhp, I_MUTEX_PARENT);
> > dchild = lookup_one_len(argp->name, dirfhp->fh_dentry, argp->len);
> > if (IS_ERR(dchild)) {
> > @@ -330,7 +337,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
> > out_unlock:
> > /* We don't really need to unlock, as fh_put does it. */
> > fh_unlock(dirfhp);
> > -
> > + fh_drop_write(dirfhp);
> > done:
> > fh_put(dirfhp);
> > return nfsd_return_dirop(nfserr, resp);
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 296d671..b8bb649 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -1276,6 +1276,10 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > * If it has, the parent directory should already be locked.
> > */
> > if (!resfhp->fh_dentry) {
> > + host_err = fh_want_write(fhp);
> > + if (host_err)
> > + goto out_nfserr;
> > +
> > /* called from nfsd_proc_mkdir, or possibly nfsd3_proc_create */
> > fh_lock_nested(fhp, I_MUTEX_PARENT);
> > dchild = lookup_one_len(fname, dentry, flen);
> > @@ -1319,14 +1323,11 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > goto out;
> > }
> >
> > - host_err = fh_want_write(fhp);
> > - if (host_err)
> > - goto out_nfserr;
> > -
> > /*
> > * Get the dir op function pointer.
> > */
> > err = 0;
> > + host_err = 0;
> > switch (type) {
> > case S_IFREG:
> > host_err = vfs_create(dirp, dchild, iap->ia_mode, NULL);
> > @@ -1343,10 +1344,8 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > host_err = vfs_mknod(dirp, dchild, iap->ia_mode, rdev);
> > break;
> > }
> > - if (host_err < 0) {
> > - fh_drop_write(fhp);
> > + if (host_err < 0)
> > goto out_nfserr;
> > - }
> >
> > err = nfsd_create_setattr(rqstp, resfhp, iap);
> >
> > @@ -1358,7 +1357,6 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > err2 = nfserrno(commit_metadata(fhp));
> > if (err2)
> > err = err2;
> > - fh_drop_write(fhp);
> > /*
> > * Update the file handle to get the new inode info.
> > */
> > @@ -1417,6 +1415,11 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > err = nfserr_notdir;
> > if (!dirp->i_op->lookup)
> > goto out;
> > +
> > + host_err = fh_want_write(fhp);
> > + if (host_err)
> > + goto out_nfserr;
> > +
> > fh_lock_nested(fhp, I_MUTEX_PARENT);
> >
> > /*
> > @@ -1449,9 +1452,6 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > v_atime = verifier[1]&0x7fffffff;
> > }
> >
> > - host_err = fh_want_write(fhp);
> > - if (host_err)
> > - goto out_nfserr;
> > if (dchild->d_inode) {
> > err = 0;
> >
> > @@ -1522,7 +1522,6 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > if (!err)
> > err = nfserrno(commit_metadata(fhp));
> >
> > - fh_drop_write(fhp);
> > /*
> > * Update the filehandle to get the new inode info.
> > */
> > @@ -1533,6 +1532,7 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > fh_unlock(fhp);
> > if (dchild && !IS_ERR(dchild))
> > dput(dchild);
> > + fh_drop_write(fhp);
> > return err;
> >
> > out_nfserr:
> > @@ -1613,6 +1613,11 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > err = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
> > if (err)
> > goto out;
> > +
> > + host_err = fh_want_write(fhp);
> > + if (host_err)
> > + goto out_nfserr;
> > +
> > fh_lock(fhp);
> > dentry = fhp->fh_dentry;
> > dnew = lookup_one_len(fname, dentry, flen);
> > @@ -1620,10 +1625,6 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > if (IS_ERR(dnew))
> > goto out_nfserr;
> >
> > - host_err = fh_want_write(fhp);
> > - if (host_err)
> > - goto out_nfserr;
> > -
> > if (unlikely(path[plen] != 0)) {
> > char *path_alloced = kmalloc(plen+1, GFP_KERNEL);
> > if (path_alloced == NULL)
> > @@ -1683,6 +1684,12 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> > if (isdotent(name, len))
> > goto out;
> >
> > + host_err = fh_want_write(tfhp);
> > + if (host_err) {
> > + err = nfserrno(host_err);
> > + goto out;
> > + }
> > +
> > fh_lock_nested(ffhp, I_MUTEX_PARENT);
> > ddir = ffhp->fh_dentry;
> > dirp = ddir->d_inode;
> > @@ -1694,18 +1701,13 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> >
> > dold = tfhp->fh_dentry;
> >
> > - host_err = fh_want_write(tfhp);
> > - if (host_err) {
> > - err = nfserrno(host_err);
> > - goto out_dput;
> > - }
> > err = nfserr_noent;
> > if (!dold->d_inode)
> > - goto out_drop_write;
> > + goto out_dput;
> > host_err = nfsd_break_lease(dold->d_inode);
> > if (host_err) {
> > err = nfserrno(host_err);
> > - goto out_drop_write;
> > + goto out_dput;
> > }
> > host_err = vfs_link(dold, dirp, dnew);
> > if (!host_err) {
> > @@ -1718,12 +1720,11 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> > else
> > err = nfserrno(host_err);
> > }
> > -out_drop_write:
> > - fh_drop_write(tfhp);
> > out_dput:
> > dput(dnew);
> > out_unlock:
> > fh_unlock(ffhp);
> > + fh_drop_write(tfhp);
> > out:
> > return err;
> >
> > @@ -1766,6 +1767,12 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> > if (!flen || isdotent(fname, flen) || !tlen || isdotent(tname, tlen))
> > goto out;
> >
> > + host_err = fh_want_write(ffhp);
> > + if (host_err) {
> > + err = nfserrno(host_err);
> > + goto out;
> > + }
> > +
> > /* cannot use fh_lock as we need deadlock protective ordering
> > * so do it by hand */
> > trap = lock_rename(tdentry, fdentry);
> > @@ -1796,17 +1803,14 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> > host_err = -EXDEV;
> > if (ffhp->fh_export->ex_path.mnt != tfhp->fh_export->ex_path.mnt)
> > goto out_dput_new;
> > - host_err = fh_want_write(ffhp);
> > - if (host_err)
> > - goto out_dput_new;
> >
> > host_err = nfsd_break_lease(odentry->d_inode);
> > if (host_err)
> > - goto out_drop_write;
> > + goto out_dput_new;
> > if (ndentry->d_inode) {
> > host_err = nfsd_break_lease(ndentry->d_inode);
> > if (host_err)
> > - goto out_drop_write;
> > + goto out_dput_new;
> > }
> > host_err = vfs_rename(fdir, odentry, tdir, ndentry);
> > if (!host_err) {
> > @@ -1814,8 +1818,6 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> > if (!host_err)
> > host_err = commit_metadata(ffhp);
> > }
> > -out_drop_write:
> > - fh_drop_write(ffhp);
> > out_dput_new:
> > dput(ndentry);
> > out_dput_old:
> > @@ -1831,6 +1833,7 @@ out_drop_write:
> > fill_post_wcc(tfhp);
> > unlock_rename(tdentry, fdentry);
> > ffhp->fh_locked = tfhp->fh_locked = 0;
> > + fh_drop_write(ffhp);
> >
> > out:
> > return err;
> > @@ -1856,6 +1859,10 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> > if (err)
> > goto out;
> >
> > + host_err = fh_want_write(fhp);
> > + if (host_err)
> > + goto out_nfserr;
> > +
> > fh_lock_nested(fhp, I_MUTEX_PARENT);
> > dentry = fhp->fh_dentry;
> > dirp = dentry->d_inode;
> > @@ -1874,21 +1881,15 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> > if (!type)
> > type = rdentry->d_inode->i_mode & S_IFMT;
> >
> > - host_err = fh_want_write(fhp);
> > - if (host_err)
> > - goto out_put;
> > -
> > host_err = nfsd_break_lease(rdentry->d_inode);
> > if (host_err)
> > - goto out_drop_write;
> > + goto out_put;
> > if (type != S_IFDIR)
> > host_err = vfs_unlink(dirp, rdentry);
> > else
> > host_err = vfs_rmdir(dirp, rdentry);
> > if (!host_err)
> > host_err = commit_metadata(fhp);
> > -out_drop_write:
> > - fh_drop_write(fhp);
> > out_put:
> > dput(rdentry);
> >
> > diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> > index ec0611b..359594c 100644
> > --- a/fs/nfsd/vfs.h
> > +++ b/fs/nfsd/vfs.h
> > @@ -110,12 +110,19 @@ int nfsd_set_posix_acl(struct svc_fh *, int, struct posix_acl *);
> >
> > static inline int fh_want_write(struct svc_fh *fh)
> > {
> > - return mnt_want_write(fh->fh_export->ex_path.mnt);
> > + int ret = mnt_want_write(fh->fh_export->ex_path.mnt);
> > +
> > + if (!ret)
> > + fh->fh_want_write = 1;
> > + return ret;
> > }
> >
> > static inline void fh_drop_write(struct svc_fh *fh)
> > {
> > - mnt_drop_write(fh->fh_export->ex_path.mnt);
> > + if (fh->fh_want_write) {
> > + fh->fh_want_write = 0;
> > + mnt_drop_write(fh->fh_export->ex_path.mnt);
> > + }
> > }
> >
> > #endif /* LINUX_NFSD_VFS_H */
> > diff --git a/include/linux/nfsd/nfsfh.h b/include/linux/nfsd/nfsfh.h
> > index ce4743a..fa63048 100644
> > --- a/include/linux/nfsd/nfsfh.h
> > +++ b/include/linux/nfsd/nfsfh.h
> > @@ -143,6 +143,7 @@ typedef struct svc_fh {
> > int fh_maxsize; /* max size for fh_handle */
> >
> > unsigned char fh_locked; /* inode locked by us */
> > + unsigned char fh_want_write; /* remount protection taken */
> >
> > #ifdef CONFIG_NFSD_V3
> > unsigned char fh_post_saved; /* post-op attrs saved */
> > --
> > 1.7.1
> >
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-04-16 18:25:42

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 12/27] nfsd: Push mnt_want_write() outside of i_mutex

On Mon, Apr 16, 2012 at 06:13:50PM +0200, Jan Kara wrote:
> When mnt_want_write() starts to handle freezing it will get a full lock
> semantics requiring proper lock ordering. So push mnt_want_write() call
> consistently outside of i_mutex.

How are you testing this? And do you want this particular track merged
for 3.5 through the nfsd tree, or should it go some other way?

--b.

>
> CC: [email protected]
> CC: "J. Bruce Fields" <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/nfsd/nfs4recover.c | 9 +++--
> fs/nfsd/nfsfh.c | 1 +
> fs/nfsd/nfsproc.c | 9 ++++-
> fs/nfsd/vfs.c | 79 ++++++++++++++++++++++---------------------
> fs/nfsd/vfs.h | 11 +++++-
> include/linux/nfsd/nfsfh.h | 1 +
> 6 files changed, 64 insertions(+), 46 deletions(-)
>
> diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
> index 4767429..efa7574 100644
> --- a/fs/nfsd/nfs4recover.c
> +++ b/fs/nfsd/nfs4recover.c
> @@ -154,6 +154,10 @@ nfsd4_create_clid_dir(struct nfs4_client *clp)
> if (status < 0)
> return;
>
> + status = mnt_want_write_file(rec_file);
> + if (status)
> + return;
> +
> dir = rec_file->f_path.dentry;
> /* lock the parent */
> mutex_lock(&dir->d_inode->i_mutex);
> @@ -173,11 +177,7 @@ nfsd4_create_clid_dir(struct nfs4_client *clp)
> * as well be forgiving and just succeed silently.
> */
> goto out_put;
> - status = mnt_want_write_file(rec_file);
> - if (status)
> - goto out_put;
> status = vfs_mkdir(dir->d_inode, dentry, S_IRWXU);
> - mnt_drop_write_file(rec_file);
> out_put:
> dput(dentry);
> out_unlock:
> @@ -189,6 +189,7 @@ out_unlock:
> " (err %d); please check that %s exists"
> " and is writeable", status,
> user_recovery_dirname);
> + mnt_drop_write_file(rec_file);
> nfs4_reset_creds(original_cred);
> }
>
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index 68454e7..8b93353 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -635,6 +635,7 @@ fh_put(struct svc_fh *fhp)
> fhp->fh_post_saved = 0;
> #endif
> }
> + fh_drop_write(fhp);
> if (exp) {
> cache_put(&exp->h, &svc_export_cache);
> fhp->fh_export = NULL;
> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
> index e15dc45..aad6d45 100644
> --- a/fs/nfsd/nfsproc.c
> +++ b/fs/nfsd/nfsproc.c
> @@ -196,6 +196,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
> struct dentry *dchild;
> int type, mode;
> __be32 nfserr;
> + int hosterr;
> dev_t rdev = 0, wanted = new_decode_dev(attr->ia_size);
>
> dprintk("nfsd: CREATE %s %.*s\n",
> @@ -214,6 +215,12 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
> nfserr = nfserr_exist;
> if (isdotent(argp->name, argp->len))
> goto done;
> + hosterr = fh_want_write(dirfhp);
> + if (hosterr) {
> + nfserr = nfserrno(hosterr);
> + goto done;
> + }
> +
> fh_lock_nested(dirfhp, I_MUTEX_PARENT);
> dchild = lookup_one_len(argp->name, dirfhp->fh_dentry, argp->len);
> if (IS_ERR(dchild)) {
> @@ -330,7 +337,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
> out_unlock:
> /* We don't really need to unlock, as fh_put does it. */
> fh_unlock(dirfhp);
> -
> + fh_drop_write(dirfhp);
> done:
> fh_put(dirfhp);
> return nfsd_return_dirop(nfserr, resp);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 296d671..b8bb649 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1276,6 +1276,10 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> * If it has, the parent directory should already be locked.
> */
> if (!resfhp->fh_dentry) {
> + host_err = fh_want_write(fhp);
> + if (host_err)
> + goto out_nfserr;
> +
> /* called from nfsd_proc_mkdir, or possibly nfsd3_proc_create */
> fh_lock_nested(fhp, I_MUTEX_PARENT);
> dchild = lookup_one_len(fname, dentry, flen);
> @@ -1319,14 +1323,11 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> goto out;
> }
>
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - goto out_nfserr;
> -
> /*
> * Get the dir op function pointer.
> */
> err = 0;
> + host_err = 0;
> switch (type) {
> case S_IFREG:
> host_err = vfs_create(dirp, dchild, iap->ia_mode, NULL);
> @@ -1343,10 +1344,8 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> host_err = vfs_mknod(dirp, dchild, iap->ia_mode, rdev);
> break;
> }
> - if (host_err < 0) {
> - fh_drop_write(fhp);
> + if (host_err < 0)
> goto out_nfserr;
> - }
>
> err = nfsd_create_setattr(rqstp, resfhp, iap);
>
> @@ -1358,7 +1357,6 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> err2 = nfserrno(commit_metadata(fhp));
> if (err2)
> err = err2;
> - fh_drop_write(fhp);
> /*
> * Update the file handle to get the new inode info.
> */
> @@ -1417,6 +1415,11 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> err = nfserr_notdir;
> if (!dirp->i_op->lookup)
> goto out;
> +
> + host_err = fh_want_write(fhp);
> + if (host_err)
> + goto out_nfserr;
> +
> fh_lock_nested(fhp, I_MUTEX_PARENT);
>
> /*
> @@ -1449,9 +1452,6 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> v_atime = verifier[1]&0x7fffffff;
> }
>
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - goto out_nfserr;
> if (dchild->d_inode) {
> err = 0;
>
> @@ -1522,7 +1522,6 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (!err)
> err = nfserrno(commit_metadata(fhp));
>
> - fh_drop_write(fhp);
> /*
> * Update the filehandle to get the new inode info.
> */
> @@ -1533,6 +1532,7 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> fh_unlock(fhp);
> if (dchild && !IS_ERR(dchild))
> dput(dchild);
> + fh_drop_write(fhp);
> return err;
>
> out_nfserr:
> @@ -1613,6 +1613,11 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
> err = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
> if (err)
> goto out;
> +
> + host_err = fh_want_write(fhp);
> + if (host_err)
> + goto out_nfserr;
> +
> fh_lock(fhp);
> dentry = fhp->fh_dentry;
> dnew = lookup_one_len(fname, dentry, flen);
> @@ -1620,10 +1625,6 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (IS_ERR(dnew))
> goto out_nfserr;
>
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - goto out_nfserr;
> -
> if (unlikely(path[plen] != 0)) {
> char *path_alloced = kmalloc(plen+1, GFP_KERNEL);
> if (path_alloced == NULL)
> @@ -1683,6 +1684,12 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> if (isdotent(name, len))
> goto out;
>
> + host_err = fh_want_write(tfhp);
> + if (host_err) {
> + err = nfserrno(host_err);
> + goto out;
> + }
> +
> fh_lock_nested(ffhp, I_MUTEX_PARENT);
> ddir = ffhp->fh_dentry;
> dirp = ddir->d_inode;
> @@ -1694,18 +1701,13 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
>
> dold = tfhp->fh_dentry;
>
> - host_err = fh_want_write(tfhp);
> - if (host_err) {
> - err = nfserrno(host_err);
> - goto out_dput;
> - }
> err = nfserr_noent;
> if (!dold->d_inode)
> - goto out_drop_write;
> + goto out_dput;
> host_err = nfsd_break_lease(dold->d_inode);
> if (host_err) {
> err = nfserrno(host_err);
> - goto out_drop_write;
> + goto out_dput;
> }
> host_err = vfs_link(dold, dirp, dnew);
> if (!host_err) {
> @@ -1718,12 +1720,11 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> else
> err = nfserrno(host_err);
> }
> -out_drop_write:
> - fh_drop_write(tfhp);
> out_dput:
> dput(dnew);
> out_unlock:
> fh_unlock(ffhp);
> + fh_drop_write(tfhp);
> out:
> return err;
>
> @@ -1766,6 +1767,12 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> if (!flen || isdotent(fname, flen) || !tlen || isdotent(tname, tlen))
> goto out;
>
> + host_err = fh_want_write(ffhp);
> + if (host_err) {
> + err = nfserrno(host_err);
> + goto out;
> + }
> +
> /* cannot use fh_lock as we need deadlock protective ordering
> * so do it by hand */
> trap = lock_rename(tdentry, fdentry);
> @@ -1796,17 +1803,14 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> host_err = -EXDEV;
> if (ffhp->fh_export->ex_path.mnt != tfhp->fh_export->ex_path.mnt)
> goto out_dput_new;
> - host_err = fh_want_write(ffhp);
> - if (host_err)
> - goto out_dput_new;
>
> host_err = nfsd_break_lease(odentry->d_inode);
> if (host_err)
> - goto out_drop_write;
> + goto out_dput_new;
> if (ndentry->d_inode) {
> host_err = nfsd_break_lease(ndentry->d_inode);
> if (host_err)
> - goto out_drop_write;
> + goto out_dput_new;
> }
> host_err = vfs_rename(fdir, odentry, tdir, ndentry);
> if (!host_err) {
> @@ -1814,8 +1818,6 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> if (!host_err)
> host_err = commit_metadata(ffhp);
> }
> -out_drop_write:
> - fh_drop_write(ffhp);
> out_dput_new:
> dput(ndentry);
> out_dput_old:
> @@ -1831,6 +1833,7 @@ out_drop_write:
> fill_post_wcc(tfhp);
> unlock_rename(tdentry, fdentry);
> ffhp->fh_locked = tfhp->fh_locked = 0;
> + fh_drop_write(ffhp);
>
> out:
> return err;
> @@ -1856,6 +1859,10 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> if (err)
> goto out;
>
> + host_err = fh_want_write(fhp);
> + if (host_err)
> + goto out_nfserr;
> +
> fh_lock_nested(fhp, I_MUTEX_PARENT);
> dentry = fhp->fh_dentry;
> dirp = dentry->d_inode;
> @@ -1874,21 +1881,15 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> if (!type)
> type = rdentry->d_inode->i_mode & S_IFMT;
>
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - goto out_put;
> -
> host_err = nfsd_break_lease(rdentry->d_inode);
> if (host_err)
> - goto out_drop_write;
> + goto out_put;
> if (type != S_IFDIR)
> host_err = vfs_unlink(dirp, rdentry);
> else
> host_err = vfs_rmdir(dirp, rdentry);
> if (!host_err)
> host_err = commit_metadata(fhp);
> -out_drop_write:
> - fh_drop_write(fhp);
> out_put:
> dput(rdentry);
>
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index ec0611b..359594c 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -110,12 +110,19 @@ int nfsd_set_posix_acl(struct svc_fh *, int, struct posix_acl *);
>
> static inline int fh_want_write(struct svc_fh *fh)
> {
> - return mnt_want_write(fh->fh_export->ex_path.mnt);
> + int ret = mnt_want_write(fh->fh_export->ex_path.mnt);
> +
> + if (!ret)
> + fh->fh_want_write = 1;
> + return ret;
> }
>
> static inline void fh_drop_write(struct svc_fh *fh)
> {
> - mnt_drop_write(fh->fh_export->ex_path.mnt);
> + if (fh->fh_want_write) {
> + fh->fh_want_write = 0;
> + mnt_drop_write(fh->fh_export->ex_path.mnt);
> + }
> }
>
> #endif /* LINUX_NFSD_VFS_H */
> diff --git a/include/linux/nfsd/nfsfh.h b/include/linux/nfsd/nfsfh.h
> index ce4743a..fa63048 100644
> --- a/include/linux/nfsd/nfsfh.h
> +++ b/include/linux/nfsd/nfsfh.h
> @@ -143,6 +143,7 @@ typedef struct svc_fh {
> int fh_maxsize; /* max size for fh_handle */
>
> unsigned char fh_locked; /* inode locked by us */
> + unsigned char fh_want_write; /* remount protection taken */
>
> #ifdef CONFIG_NFSD_V3
> unsigned char fh_post_saved; /* post-op attrs saved */
> --
> 1.7.1
>

2012-04-17 05:10:02

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 00/19 v5] Fix filesystem freezing deadlocks

On 2012-04-16, at 5:43 PM, Dave Chinner wrote:
> On Mon, Apr 16, 2012 at 03:02:50PM -0700, Andreas Dilger wrote:
>> On 2012-04-16, at 9:13 AM, Jan Kara wrote:
>>> Another potential contention point might be patch 19. In that patch
>>> we make freeze_super() refuse to freeze the filesystem when there
>>> are open but unlinked files which may be impractical in some cases.
>>> The main reason for this is the problem with handling of file deletion from fput() called with mmap_sem held (e.g. from munmap(2)),
>>> and then there's the fact that we cannot really force such filesystem
>>> into a consistent state... But if people think that freezing with
>>> open but unlinked files should happen, then I have some possible
>>> solutions in mind (maybe as a separate patchset since this is
>>> large enough).
>>
>> Looking at a desktop system, I think it is very typical that there
>> are open-unlinked files present, so I don't know if this is really
>> an acceptable solution. It isn't clear from your comments whether
>> this is a blanket refusal for all open-unlinked files, or only in
>> some particular cases...
>
> Unlinked-but-open files are the reason that XFS dirties the log
> after the freeze process is complete. This ensures that if the
> system crashes while the filesystem is frozen then log recovery
> during the next mount will process the unlinked (orphaned) inodes
> and free the correctly. i.e. you can still freeze a filesystem with
> inodes in this state successfully and have everythign behave as
> you'd expect.
>
> I'm not sure how other filesystems handle this problem, but perhaps
> pushing this check down into filesystem specific code or adding a
> superblock feature flag might be a way to allow filesystems to
> handle this case in the way they think is best...

The ext3/4 code has long been able to handle open-unlinked files
properly after a crash (they are put into a singly-linked list from
the superblock on disk that is processed after journal recovery).

The issue here is that blocking freeze from succeeding with open-
unlinked files is an unreasonable assumption of this patch, and
I don't think it is acceptable to land this patchset (which IMHO
would prevent nearly every Gnome system from freezing unless these
apps have changed their behaviour in more recent releases).

Like you suggest, filesystems that handle this correctly should be
able to flag or otherwise indicate that this is OK, and allow the
freeze to continue. For other filesystems that do not handle
open-unlinked file consistency during a filesystem freeze/snapshot
whether this should even be considered a new case, or is something
that has existed for ages already.

The other question is whether this is still a problem even for
filesystems handling the consistency issue, but from Jan's comment
above there is a new locking issue related to mmap_sem being added?

Cheers, Andreas
--
Andreas Dilger Whamcloud, Inc.
Principal Lustre Engineer http://www.whamcloud.com/