From: "J. Bruce Fields" <[email protected]>
Older NFS clients can only find out about changes to a file by polling.
NFSv4 instead allows a client to get a "delegation" on a file and then
operate on a file without talking to the server. Consistency is
maintained by requiring the server to recall the delegation before
allowing any conflicting operation.
But our kfsd doesn't always recall delegations on conflicting operations
from local (non-NFS) users of an exported filesystem.
This patch series fixes that by defining a new lock type to represent a
delegation. This new lock type isn't available to userspace for
now--nfsd is the only user.
Delegations come in both "read" and "write" flavors, but I'm only
implementing read delegations for now.
Delegations are similar to leases. The main difference is that
delegations need to be broken on any operation that changes a file's
attributes or the set of links pointing to it (like link, unlink, and
rename).
Such operations take several locks (including at least one i_mutex on a
directory). Delegations can take a long time (about a minute) to recall
when NFS clients are unresponsive. To avoid blocking a lot of unrelated
operations, this version of the patches drops all of those locks before
waiting.
To that end, I'm passing an extra inode ** to functions like vfs_unlink.
When vfs_unlink finds a delegation it can then pass back the offending
inode so that the caller can wait for the delegation recall. If the
caller passes in NULL, then it instead gets -EWOULDBLOCK on encountering
a delegation.
In fact, callers outside the vfs are mostly passing in NULL:
- nfsd wants to imediately return NFS4ERR_DELAY to callers on
encountering a delegation, instead of waiting.
- I assume that anyone exporting the fileystem underlying an
ecryptfs mount is making a mistake, and that it's better to
return an error than to wait.
- Ditto for fscache.
But those other callers could be taught to wait as well if necessary.
At least in the link case, these patches may currently have the same
problem Jeff Layton's ESTALE patches hit (some code paths that look safe
to retry aren't, due to some audit code).
I think this is about the 4th version posted. The previous version
didn't drop locks before waiting.
--b.
J. Bruce Fields (12):
vfs: pull ext4's double-i_mutex-locking into common code
vfs: don't use PARENT/CHILD lock classes for non-directories
vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
vfs: take i_mutex on renamed file
locks: introduce new FL_DELEG lock flag
locks: implement delegations
namei: minor vfs_unlink cleanup
locks: break delegations on unlink
locks: helper functions for delegation breaking
locks: break delegations on rename
locks: break delegations on link
locks: break delegations on any attribute modification
Jan Kara (1):
gfs2: Get rid of I_MUTEX_QUOTA usage
Documentation/filesystems/directory-locking | 30 +++++++----
drivers/base/devtmpfs.c | 6 +--
fs/attr.c | 5 +-
fs/cachefiles/interface.c | 4 +-
fs/cachefiles/namei.c | 4 +-
fs/ecryptfs/inode.c | 6 +--
fs/ext4/move_extent.c | 21 +-------
fs/gfs2/ops_fstype.c | 8 +++
fs/gfs2/quota.c | 2 +-
fs/hpfs/namei.c | 2 +-
fs/inode.c | 42 ++++++++++++++-
fs/locks.c | 51 ++++++++++++++----
fs/namei.c | 74 ++++++++++++++++++++-------
fs/nfsd/nfs4state.c | 2 +-
fs/nfsd/vfs.c | 14 +++--
fs/open.c | 21 ++++++--
fs/utimes.c | 9 +++-
include/linux/fs.h | 72 +++++++++++++++++++++-----
ipc/mqueue.c | 2 +-
19 files changed, 280 insertions(+), 95 deletions(-)
--
1.7.9.5
On Mon, Sep 10, 2012 at 02:37:24PM +0800, Guo Chao wrote:
> On Mon, Sep 10, 2012 at 01:10:37PM +0800, Ram Pai wrote:
> > On Mon, Sep 10, 2012 at 10:40:51AM +0800, Guo Chao wrote:
> > >
> > > Hard to say whether it's a bug or what's problems of being able to rename
> > > mountpoint.
> >
> > 'man 2 rename' says it is ok to rename a directory that is already
> > mounted.
> >
> > EBUSY The rename fails because oldpath or newpath is a directory that is
> > in use by some process (perhaps as current working directory, or as root
> > directory, or beacuse it was open for reading) or is in use by the
> > system (for example as mount point), while the system considers this an
> > error. (Note that there is no requirement to return EBUSY in such
> > cases-- there is nothing wrong with doing the rename anyway -- but is
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > allowed to return EBUSY if the system cannot otherwise handle such
> > situations)
> >
> > RP
>
> Erhh, good point.
>
> However, current implementation seems to return EBUSY unconditionally,
> instead of considering whether it can handle this situation. It can be
> descripted as 'it may return EBUSY or not, depends on whether you are
> luck enough to rush into the race'.
>
> This seems still conform to the statement in the manual, in a weird way
> though.
I think the code that checks if the new dentry or the old dentry is a
mount point can be safely removed. It does not serve any purpose AFAICT.
RP
On Fri, Sep 07, 2012 at 10:27:05AM +0800, Guo Chao wrote:
> On Thu, Sep 06, 2012 at 01:51:18PM -0400, J. Bruce Fields wrote:
> > On Thu, Sep 06, 2012 at 11:05:26AM +0800, Guo Chao wrote:
> > > On Wed, Sep 05, 2012 at 04:55:15PM -0400, J. Bruce Fields wrote:
> > > > From: "J. Bruce Fields" <[email protected]>
> > > > diff --git a/fs/namei.c b/fs/namei.c
> > > > index 1b46439..6156135 100644
> > > > --- a/fs/namei.c
> > > > +++ b/fs/namei.c
> > > > @@ -3658,6 +3658,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > > > struct inode *new_dir, struct dentry *new_dentry)
> > > > {
> > > > struct inode *target = new_dentry->d_inode;
> > > > + struct inode *source = old_dentry->d_inode;
> > > > int error;
> > > >
> > > > error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry);
> > > > @@ -3665,8 +3666,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > > > return error;
> > > >
> > > > dget(new_dentry);
> > > > - if (target)
> > > > - mutex_lock(&target->i_mutex);
> > > > + lock_two_nondirectories(source, target);
> > > >
> > > > error = -EBUSY;
> > > > if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
> > > > @@ -3681,8 +3681,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > > > if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
> > > > d_move(old_dentry, new_dentry);
> > > > out:
> > > > - if (target)
> > > > - mutex_unlock(&target->i_mutex);
> > > > + unlock_two_nondirectories(source, target);
> > > > dput(new_dentry);
> > > > return error;
> > > > }
> > > >
> > >
> > > This change also fixes a race between rename and mount.
> > >
> > > Apparently we avoid to rename source or target if they are
> > > mountpoint. But nothing prevents source being the mountpoint
> > > after d_mountpoint check because we do not hold its i_mutex.
> > >
> > > Thus we are actually able to rename a mountpoint.
> > >
> > > Rename directory should also need this care.
> >
> > Do you have any practical way to reproduce that race?
> >
> > --b.
>
> Rename a mountpoint? Of course.
>
> One script ...
>
> #!/bin/bash
> while true
> do
> mount -t sysfs nodev mnt && umount mnt
> done
>
>
>
> The other ...
>
> #!/bin/bash
> while true
> do
> mv mnt mnt2 && mv mnt2 mnt
> done
>
>
>
> Run them simultaneously in two consoles. When mount keeps reporting
> 'mount point mnt does not exist', stop them, then you will see the
> familiar sysfs under mnt2.
Oh, thanks, for some reason I assumed it would be more difficult to
reproduce.
I think we can do this--I don't think it even requires any care to the
locking order of the renamed vs the victim directory, though I can't
completely convince myself of that.
Is it necessary to fix this, though? Does it cause any problems other
than unexpected behavior?
--b.
On Thu, Sep 06, 2012 at 11:05:26AM +0800, Guo Chao wrote:
> On Wed, Sep 05, 2012 at 04:55:15PM -0400, J. Bruce Fields wrote:
> > From: "J. Bruce Fields" <[email protected]>
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 1b46439..6156135 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -3658,6 +3658,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > struct inode *new_dir, struct dentry *new_dentry)
> > {
> > struct inode *target = new_dentry->d_inode;
> > + struct inode *source = old_dentry->d_inode;
> > int error;
> >
> > error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry);
> > @@ -3665,8 +3666,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > return error;
> >
> > dget(new_dentry);
> > - if (target)
> > - mutex_lock(&target->i_mutex);
> > + lock_two_nondirectories(source, target);
> >
> > error = -EBUSY;
> > if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
> > @@ -3681,8 +3681,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
> > d_move(old_dentry, new_dentry);
> > out:
> > - if (target)
> > - mutex_unlock(&target->i_mutex);
> > + unlock_two_nondirectories(source, target);
> > dput(new_dentry);
> > return error;
> > }
> >
>
> This change also fixes a race between rename and mount.
>
> Apparently we avoid to rename source or target if they are
> mountpoint. But nothing prevents source being the mountpoint
> after d_mountpoint check because we do not hold its i_mutex.
>
> Thus we are actually able to rename a mountpoint.
>
> Rename directory should also need this care.
Do you have any practical way to reproduce that race?
--b.
On Thu, Sep 06, 2012 at 01:51:18PM -0400, J. Bruce Fields wrote:
> On Thu, Sep 06, 2012 at 11:05:26AM +0800, Guo Chao wrote:
> > On Wed, Sep 05, 2012 at 04:55:15PM -0400, J. Bruce Fields wrote:
> > > From: "J. Bruce Fields" <[email protected]>
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 1b46439..6156135 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -3658,6 +3658,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > > struct inode *new_dir, struct dentry *new_dentry)
> > > {
> > > struct inode *target = new_dentry->d_inode;
> > > + struct inode *source = old_dentry->d_inode;
> > > int error;
> > >
> > > error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry);
> > > @@ -3665,8 +3666,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > > return error;
> > >
> > > dget(new_dentry);
> > > - if (target)
> > > - mutex_lock(&target->i_mutex);
> > > + lock_two_nondirectories(source, target);
> > >
> > > error = -EBUSY;
> > > if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
> > > @@ -3681,8 +3681,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > > if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
> > > d_move(old_dentry, new_dentry);
> > > out:
> > > - if (target)
> > > - mutex_unlock(&target->i_mutex);
> > > + unlock_two_nondirectories(source, target);
> > > dput(new_dentry);
> > > return error;
> > > }
> > >
> >
> > This change also fixes a race between rename and mount.
> >
> > Apparently we avoid to rename source or target if they are
> > mountpoint. But nothing prevents source being the mountpoint
> > after d_mountpoint check because we do not hold its i_mutex.
> >
> > Thus we are actually able to rename a mountpoint.
> >
> > Rename directory should also need this care.
>
> Do you have any practical way to reproduce that race?
>
> --b.
Rename a mountpoint? Of course.
One script ...
#!/bin/bash
while true
do
mount -t sysfs nodev mnt && umount mnt
done
The other ...
#!/bin/bash
while true
do
mv mnt mnt2 && mv mnt2 mnt
done
Run them simultaneously in two consoles. When mount keeps reporting
'mount point mnt does not exist', stop them, then you will see the
familiar sysfs under mnt2.
From: "J. Bruce Fields" <[email protected]>
NFSv4 uses leases to guarantee that clients can cache metadata as well
as data.
Cc: Mikulas Patocka <[email protected]>
Cc: David Howells <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Dustin Kirkland <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
---
drivers/base/devtmpfs.c | 4 ++--
fs/attr.c | 5 ++++-
fs/cachefiles/interface.c | 4 ++--
fs/ecryptfs/inode.c | 2 +-
fs/hpfs/namei.c | 2 +-
fs/inode.c | 6 +++++-
fs/nfsd/vfs.c | 8 ++++++--
fs/open.c | 21 +++++++++++++++++----
fs/utimes.c | 9 ++++++++-
include/linux/fs.h | 2 +-
10 files changed, 47 insertions(+), 16 deletions(-)
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 24162ec..3a21418 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -210,7 +210,7 @@ static int handle_create(const char *nodename, umode_t mode, struct device *dev)
newattrs.ia_mode = mode;
newattrs.ia_valid = ATTR_MODE;
mutex_lock(&dentry->d_inode->i_mutex);
- notify_change(dentry, &newattrs);
+ notify_change(dentry, &newattrs, NULL);
mutex_unlock(&dentry->d_inode->i_mutex);
/* mark as kernel-created inode */
@@ -315,7 +315,7 @@ static int handle_remove(const char *nodename, struct device *dev)
newattrs.ia_valid =
ATTR_UID|ATTR_GID|ATTR_MODE;
mutex_lock(&dentry->d_inode->i_mutex);
- notify_change(dentry, &newattrs);
+ notify_change(dentry, &newattrs, NULL);
mutex_unlock(&dentry->d_inode->i_mutex);
err = vfs_unlink(parent.dentry->d_inode, dentry, NULL);
if (!err || err == -ENOENT)
diff --git a/fs/attr.c b/fs/attr.c
index 29e38a1..1d4178b 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -163,7 +163,7 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
}
EXPORT_SYMBOL(setattr_copy);
-int notify_change(struct dentry * dentry, struct iattr * attr)
+int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **delegated_inode)
{
struct inode *inode = dentry->d_inode;
umode_t mode = inode->i_mode;
@@ -239,6 +239,9 @@ int notify_change(struct dentry * dentry, struct iattr * attr)
error = security_inode_setattr(dentry, attr);
if (error)
return error;
+ error = try_break_deleg(inode, delegated_inode);
+ if (error)
+ return error;
if (inode->i_op->setattr)
error = inode->i_op->setattr(dentry, attr);
diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index 67bef6d..67674b4 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -417,14 +417,14 @@ static int cachefiles_attr_changed(struct fscache_object *_object)
_debug("discard tail %llx", oi_size);
newattrs.ia_valid = ATTR_SIZE;
newattrs.ia_size = oi_size & PAGE_MASK;
- ret = notify_change(object->backer, &newattrs);
+ ret = notify_change(object->backer, &newattrs, NULL);
if (ret < 0)
goto truncate_failed;
}
newattrs.ia_valid = ATTR_SIZE;
newattrs.ia_size = ni_size;
- ret = notify_change(object->backer, &newattrs);
+ ret = notify_change(object->backer, &newattrs, NULL);
truncate_failed:
mutex_unlock(&object->backer->d_inode->i_mutex);
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 5c5f5df..8adc8e7 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -987,7 +987,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
lower_ia.ia_valid &= ~ATTR_MODE;
mutex_lock(&lower_dentry->d_inode->i_mutex);
- rc = notify_change(lower_dentry, &lower_ia);
+ rc = notify_change(lower_dentry, &lower_ia, NULL);
mutex_unlock(&lower_dentry->d_inode->i_mutex);
out:
fsstack_copy_attr_all(inode, lower_inode);
diff --git a/fs/hpfs/namei.c b/fs/hpfs/namei.c
index bc90824..7b2b97e 100644
--- a/fs/hpfs/namei.c
+++ b/fs/hpfs/namei.c
@@ -407,7 +407,7 @@ again:
/*printk("HPFS: truncating file before delete.\n");*/
newattrs.ia_size = 0;
newattrs.ia_valid = ATTR_SIZE | ATTR_CTIME;
- err = notify_change(dentry, &newattrs);
+ err = notify_change(dentry, &newattrs, NULL);
put_write_access(inode);
if (!err)
goto again;
diff --git a/fs/inode.c b/fs/inode.c
index 4ab9516..7e39593 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1633,7 +1633,11 @@ static int __remove_suid(struct dentry *dentry, int kill)
struct iattr newattrs;
newattrs.ia_valid = ATTR_FORCE | kill;
- return notify_change(dentry, &newattrs);
+ /*
+ * Note we call this on write, so notify_change will not
+ * encounter any conflicting delegations:
+ */
+ return notify_change(dentry, &newattrs, NULL);
}
int file_remove_suid(struct file *file)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 2a77b28..7264c52 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -426,7 +426,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
goto out_nfserr;
fh_lock(fhp);
- host_err = notify_change(dentry, iap);
+ host_err = notify_change(dentry, iap, NULL);
err = nfserrno(host_err);
fh_unlock(fhp);
}
@@ -961,7 +961,11 @@ static void kill_suid(struct dentry *dentry)
ia.ia_valid = ATTR_KILL_SUID | ATTR_KILL_SGID | ATTR_KILL_PRIV;
mutex_lock(&dentry->d_inode->i_mutex);
- notify_change(dentry, &ia);
+ /*
+ * Note we call this on write, so notify_change will not
+ * encounter any conflicting delegations:
+ */
+ notify_change(dentry, &ia, NULL);
mutex_unlock(&dentry->d_inode->i_mutex);
}
diff --git a/fs/open.c b/fs/open.c
index bc132e1..35a478b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -56,7 +56,7 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
newattrs.ia_valid |= ret | ATTR_FORCE;
mutex_lock(&dentry->d_inode->i_mutex);
- ret = notify_change(dentry, &newattrs);
+ ret = notify_change(dentry, &newattrs, NULL);
mutex_unlock(&dentry->d_inode->i_mutex);
return ret;
}
@@ -455,21 +455,28 @@ out:
static int chmod_common(struct path *path, umode_t mode)
{
struct inode *inode = path->dentry->d_inode;
+ struct inode *delegated_inode = NULL;
struct iattr newattrs;
int error;
error = mnt_want_write(path->mnt);
if (error)
return error;
+retry_deleg:
mutex_lock(&inode->i_mutex);
error = security_path_chmod(path, mode);
if (error)
goto out_unlock;
newattrs.ia_mode = (mode & S_IALLUGO) | (inode->i_mode & ~S_IALLUGO);
newattrs.ia_valid = ATTR_MODE | ATTR_CTIME;
- error = notify_change(path->dentry, &newattrs);
+ error = notify_change(path->dentry, &newattrs, &delegated_inode);
out_unlock:
mutex_unlock(&inode->i_mutex);
+ if (delegated_inode) {
+ error = break_deleg_wait(&delegated_inode);
+ if (!error)
+ goto retry_deleg;
+ }
mnt_drop_write(path->mnt);
return error;
}
@@ -509,6 +516,7 @@ SYSCALL_DEFINE2(chmod, const char __user *, filename, umode_t, mode)
static int chown_common(struct path *path, uid_t user, gid_t group)
{
struct inode *inode = path->dentry->d_inode;
+ struct inode *delegated_inode = NULL;
int error;
struct iattr newattrs;
kuid_t uid;
@@ -533,12 +541,17 @@ static int chown_common(struct path *path, uid_t user, gid_t group)
if (!S_ISDIR(inode->i_mode))
newattrs.ia_valid |=
ATTR_KILL_SUID | ATTR_KILL_SGID | ATTR_KILL_PRIV;
+retry_deleg:
mutex_lock(&inode->i_mutex);
error = security_path_chown(path, user, group);
if (!error)
- error = notify_change(path->dentry, &newattrs);
+ error = notify_change(path->dentry, &newattrs, &delegated_inode);
mutex_unlock(&inode->i_mutex);
-
+ if (delegated_inode) {
+ error = break_deleg_wait(&delegated_inode);
+ if (!error)
+ goto retry_deleg;
+ }
return error;
}
diff --git a/fs/utimes.c b/fs/utimes.c
index fa4dbe4..e0fda92 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -53,6 +53,7 @@ static int utimes_common(struct path *path, struct timespec *times)
int error;
struct iattr newattrs;
struct inode *inode = path->dentry->d_inode;
+ struct inode *delegated_inode = NULL;
error = mnt_want_write(path->mnt);
if (error)
@@ -101,9 +102,15 @@ static int utimes_common(struct path *path, struct timespec *times)
goto mnt_drop_write_and_out;
}
}
+retry_deleg:
mutex_lock(&inode->i_mutex);
- error = notify_change(path->dentry, &newattrs);
+ error = notify_change(path->dentry, &newattrs, &delegated_inode);
mutex_unlock(&inode->i_mutex);
+ if (delegated_inode) {
+ error = break_deleg_wait(&delegated_inode);
+ if (!error)
+ goto retry_deleg;
+ }
mnt_drop_write_and_out:
mnt_drop_write(path->mnt);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 52b8c67..9a3a890 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2450,7 +2450,7 @@ extern void emergency_remount(void);
#ifdef CONFIG_BLOCK
extern sector_t bmap(struct inode *, sector_t);
#endif
-extern int notify_change(struct dentry *, struct iattr *);
+extern int notify_change(struct dentry *, struct iattr *, struct inode **);
extern int inode_permission(struct inode *, int);
extern int generic_permission(struct inode *, int);
--
1.7.9.5
On Thu, Sep 06, 2012 at 07:01:22AM -0400, Jeff Layton wrote:
> On Wed, 5 Sep 2012 17:02:06 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > Oops, I meant to cc this to Jeff:
> >
> > On Wed, Sep 05, 2012 at 04:55:22PM -0400, J. Bruce Fields wrote:
> > > @@ -3556,6 +3560,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
> > > if (error)
> > > return error;
> > >
> > > +retry_deleg:
> > > new_dentry = user_path_create(newdfd, newname, &new_path, 0);
> > > error = PTR_ERR(new_dentry);
> > > if (IS_ERR(new_dentry))
> > > @@ -3570,9 +3575,14 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
> > > error = security_path_link(old_path.dentry, &new_path, new_dentry);
> > > if (error)
> > > goto out_dput;
> > > - error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry);
> > > + error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
> > > out_dput:
> > > done_path_create(&new_path, new_dentry);
> > > + if (delegated_inode) {
> > > + error = break_deleg_wait(&delegated_inode);
> > > + if (!error)
> > > + goto retry_deleg;
> > > + }
> > > out:
> > > path_put(&old_path);
> > >
> >
> > I think this will get me into trouble with the audit code, right?
> >
> > (On a quick skim I think I'm retrying from a point after the getname()
> > in rename and unlink, so I think those are OK, but I may be missing
> > something....)
> >
> > --b.
> >
>
> tl;dr: this patch is OK, but the unlink and rename patches will be
> problematic.
>
> Longer explanation:
>
> The big problem is multiple calls into audit_inode_child(), which will
> give you duplicate audit records when you retry the call. That function
> mostly gets called from the fsnotify_* calls, but is also called from
> may_delete().
>
> In the create case (like this one), you're safe since we only call
> fsnotify_* after the create succeeds. In the may_delete() case, we have
> to call that function before the actual operation. A quick look at your
> patches looks like you'll end up with multiple calls into may_delete on
> a retry. That'll give you multiple audit records for each.
>
> The good news is that if and when my audit_names overhaul patches go
> in, that problem should go away. At that point audit_inode_child() will
> know how to update existing records instead of creating new ones in
> this case. I'm hoping those will go in for 3.7, so you may just want to
> try to ensure that these patches go in after those.
OK, thanks for the explanation, I'll keep an eye on the audit_names
overhaul.
--b.
>
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index c8f5e74..2a77b28 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -1717,7 +1717,7 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> > > err = nfserrno(host_err);
> > > goto out_dput;
> > > }
> > > - host_err = vfs_link(dold, dirp, dnew);
> > > + host_err = vfs_link(dold, dirp, dnew, NULL);
> > > if (!host_err) {
> > > err = nfserrno(commit_metadata(ffhp));
> > > if (!err)
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index ad9df65e..52b8c67 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -1721,7 +1721,7 @@ extern int vfs_create(struct inode *, struct dentry *, umode_t, bool);
> > > extern int vfs_mkdir(struct inode *, struct dentry *, umode_t);
> > > extern int vfs_mknod(struct inode *, struct dentry *, umode_t, dev_t);
> > > extern int vfs_symlink(struct inode *, struct dentry *, const char *);
> > > -extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
> > > +extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **);
> > > extern int vfs_rmdir(struct inode *, struct dentry *);
> > > extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
> > > extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **);
> > > --
> > > 1.7.9.5
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> --
> Jeff Layton <[email protected]>
On Fri, Sep 07, 2012 at 05:39:01PM -0400, J. Bruce Fields wrote:
> On Fri, Sep 07, 2012 at 10:27:05AM +0800, Guo Chao wrote:
> > On Thu, Sep 06, 2012 at 01:51:18PM -0400, J. Bruce Fields wrote:
> > > On Thu, Sep 06, 2012 at 11:05:26AM +0800, Guo Chao wrote:
> > > > On Wed, Sep 05, 2012 at 04:55:15PM -0400, J. Bruce Fields wrote:
> > > > > From: "J. Bruce Fields" <[email protected]>
> > > > > diff --git a/fs/namei.c b/fs/namei.c
> > > > > index 1b46439..6156135 100644
> > > > > --- a/fs/namei.c
> > > > > +++ b/fs/namei.c
> > > > > @@ -3658,6 +3658,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > > > > struct inode *new_dir, struct dentry *new_dentry)
> > > > > {
> > > > > struct inode *target = new_dentry->d_inode;
> > > > > + struct inode *source = old_dentry->d_inode;
> > > > > int error;
> > > > >
> > > > > error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry);
> > > > > @@ -3665,8 +3666,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > > > > return error;
> > > > >
> > > > > dget(new_dentry);
> > > > > - if (target)
> > > > > - mutex_lock(&target->i_mutex);
> > > > > + lock_two_nondirectories(source, target);
> > > > >
> > > > > error = -EBUSY;
> > > > > if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
> > > > > @@ -3681,8 +3681,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > > > > if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
> > > > > d_move(old_dentry, new_dentry);
> > > > > out:
> > > > > - if (target)
> > > > > - mutex_unlock(&target->i_mutex);
> > > > > + unlock_two_nondirectories(source, target);
> > > > > dput(new_dentry);
> > > > > return error;
> > > > > }
> > > > >
> > > >
> > > > This change also fixes a race between rename and mount.
> > > >
> > > > Apparently we avoid to rename source or target if they are
> > > > mountpoint. But nothing prevents source being the mountpoint
> > > > after d_mountpoint check because we do not hold its i_mutex.
> > > >
> > > > Thus we are actually able to rename a mountpoint.
> > > >
> > > > Rename directory should also need this care.
> > >
> > > Do you have any practical way to reproduce that race?
> > >
> > > --b.
> >
> > Rename a mountpoint? Of course.
> >
> > One script ...
> >
> > #!/bin/bash
> > while true
> > do
> > mount -t sysfs nodev mnt && umount mnt
> > done
> >
> >
> >
> > The other ...
> >
> > #!/bin/bash
> > while true
> > do
> > mv mnt mnt2 && mv mnt2 mnt
> > done
> >
> >
> >
> > Run them simultaneously in two consoles. When mount keeps reporting
> > 'mount point mnt does not exist', stop them, then you will see the
> > familiar sysfs under mnt2.
>
> Oh, thanks, for some reason I assumed it would be more difficult to
> reproduce.
>
> I think we can do this--I don't think it even requires any care to the
> locking order of the renamed vs the victim directory, though I can't
> completely convince myself of that.
>
> Is it necessary to fix this, though? Does it cause any problems other
> than unexpected behavior?
>
> --b.
> --
Hard to say whether it's a bug or what's problems of being able to rename
mountpoint.
Anyway, this patch closes this race when mountpoint is a file. Thus we get
different behaviour when deal with files and directories. It's apparently
not well-defined, but again, is it a problem? Not sure ... ...
On Wed, 5 Sep 2012 17:02:06 -0400
"J. Bruce Fields" <[email protected]> wrote:
> Oops, I meant to cc this to Jeff:
>
> On Wed, Sep 05, 2012 at 04:55:22PM -0400, J. Bruce Fields wrote:
> > @@ -3556,6 +3560,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
> > if (error)
> > return error;
> >
> > +retry_deleg:
> > new_dentry = user_path_create(newdfd, newname, &new_path, 0);
> > error = PTR_ERR(new_dentry);
> > if (IS_ERR(new_dentry))
> > @@ -3570,9 +3575,14 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
> > error = security_path_link(old_path.dentry, &new_path, new_dentry);
> > if (error)
> > goto out_dput;
> > - error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry);
> > + error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
> > out_dput:
> > done_path_create(&new_path, new_dentry);
> > + if (delegated_inode) {
> > + error = break_deleg_wait(&delegated_inode);
> > + if (!error)
> > + goto retry_deleg;
> > + }
> > out:
> > path_put(&old_path);
> >
>
> I think this will get me into trouble with the audit code, right?
>
> (On a quick skim I think I'm retrying from a point after the getname()
> in rename and unlink, so I think those are OK, but I may be missing
> something....)
>
> --b.
>
tl;dr: this patch is OK, but the unlink and rename patches will be
problematic.
Longer explanation:
The big problem is multiple calls into audit_inode_child(), which will
give you duplicate audit records when you retry the call. That function
mostly gets called from the fsnotify_* calls, but is also called from
may_delete().
In the create case (like this one), you're safe since we only call
fsnotify_* after the create succeeds. In the may_delete() case, we have
to call that function before the actual operation. A quick look at your
patches looks like you'll end up with multiple calls into may_delete on
a retry. That'll give you multiple audit records for each.
The good news is that if and when my audit_names overhaul patches go
in, that problem should go away. At that point audit_inode_child() will
know how to update existing records instead of creating new ones in
this case. I'm hoping those will go in for 3.7, so you may just want to
try to ensure that these patches go in after those.
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index c8f5e74..2a77b28 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -1717,7 +1717,7 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> > err = nfserrno(host_err);
> > goto out_dput;
> > }
> > - host_err = vfs_link(dold, dirp, dnew);
> > + host_err = vfs_link(dold, dirp, dnew, NULL);
> > if (!host_err) {
> > err = nfserrno(commit_metadata(ffhp));
> > if (!err)
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index ad9df65e..52b8c67 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1721,7 +1721,7 @@ extern int vfs_create(struct inode *, struct dentry *, umode_t, bool);
> > extern int vfs_mkdir(struct inode *, struct dentry *, umode_t);
> > extern int vfs_mknod(struct inode *, struct dentry *, umode_t, dev_t);
> > extern int vfs_symlink(struct inode *, struct dentry *, const char *);
> > -extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
> > +extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **);
> > extern int vfs_rmdir(struct inode *, struct dentry *);
> > extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
> > extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **);
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jeff Layton <[email protected]>
From: "J. Bruce Fields" <[email protected]>
Implement NFSv4 delegations at the vfs level using the new FL_DELEG lock
type.
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/locks.c | 49 +++++++++++++++++++++++++++++++++++++++----------
include/linux/fs.h | 18 +++++++++++++++---
2 files changed, 54 insertions(+), 13 deletions(-)
diff --git a/fs/locks.c b/fs/locks.c
index 6ee1943..24bb57d 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1176,28 +1176,40 @@ static void time_out_leases(struct inode *inode)
}
}
+static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker)
+{
+ if ((breaker->fl_flags & FL_DELEG) && (lease->fl_flags & FL_LEASE))
+ return false;
+ return locks_conflict(breaker, lease);
+}
+
/**
* __break_lease - revoke all outstanding leases on file
* @inode: the inode of the file to return
- * @mode: the open mode (read or write)
+ * @mode: O_RDONLY: break only write leases; O_WRONLY or O_RDWR:
+ * break all leases
+ * @type: FL_LEASE: break leases and delegations; FL_DELEG: break
+ * only delegations
*
* break_lease (inlined for speed) has checked there already is at least
* some kind of lock (maybe a lease) on this file. Leases are broken on
* a call to open() or truncate(). This function can sleep unless you
* specified %O_NONBLOCK to your open().
*/
-int __break_lease(struct inode *inode, unsigned int mode)
+int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
{
int error = 0;
struct file_lock *new_fl, *flock;
struct file_lock *fl;
unsigned long break_time;
int i_have_this_lease = 0;
+ bool lease_conflict = false;
int want_write = (mode & O_ACCMODE) != O_RDONLY;
new_fl = lease_alloc(NULL, want_write ? F_WRLCK : F_RDLCK);
if (IS_ERR(new_fl))
return PTR_ERR(new_fl);
+ new_fl->fl_flags = type;
lock_flocks();
@@ -1207,13 +1219,16 @@ int __break_lease(struct inode *inode, unsigned int mode)
if ((flock == NULL) || !IS_LEASE(flock))
goto out;
- if (!locks_conflict(flock, new_fl))
+ for (fl = flock; fl && IS_LEASE(fl); fl = fl->fl_next) {
+ if (leases_conflict(fl, new_fl)) {
+ lease_conflict = true;
+ if (fl->fl_owner == current->files)
+ i_have_this_lease = 1;
+ }
+ }
+ if (!lease_conflict)
goto out;
- for (fl = flock; fl && IS_LEASE(fl); fl = fl->fl_next)
- if (fl->fl_owner == current->files)
- i_have_this_lease = 1;
-
break_time = 0;
if (lease_break_time > 0) {
break_time = jiffies + lease_break_time * HZ;
@@ -1222,6 +1237,8 @@ int __break_lease(struct inode *inode, unsigned int mode)
}
for (fl = flock; fl && IS_LEASE(fl); fl = fl->fl_next) {
+ if (!leases_conflict(fl, new_fl))
+ continue;
if (want_write) {
if (fl->fl_flags & FL_UNLOCK_PENDING)
continue;
@@ -1263,7 +1280,7 @@ restart:
*/
for (flock = inode->i_flock; flock && IS_LEASE(flock);
flock = flock->fl_next) {
- if (locks_conflict(new_fl, flock))
+ if (leases_conflict(new_fl, flock))
goto restart;
}
error = 0;
@@ -1343,9 +1360,20 @@ int generic_add_lease(struct file *filp, long arg, struct file_lock **flp)
struct file_lock *fl, **before, **my_before = NULL, *lease;
struct dentry *dentry = filp->f_path.dentry;
struct inode *inode = dentry->d_inode;
+ bool is_deleg = (*flp)->fl_flags & FL_DELEG;
int error;
lease = *flp;
+ /*
+ * In the delegation case we need mutual exclusion with
+ * a number of operations that take the i_mutex. We trylock
+ * because delegations are an optional optimization, and if
+ * there's some chance of a conflict--we'd rather not
+ * bother, maybe that's a sign this just isn't a good file to
+ * hand out a delegation on.
+ */
+ if (is_deleg && !mutex_trylock(&inode->i_mutex))
+ return -EAGAIN;
error = -EAGAIN;
if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
@@ -1397,9 +1425,10 @@ int generic_add_lease(struct file *filp, long arg, struct file_lock **flp)
goto out;
locks_insert_lock(before, lease);
- return 0;
-
+ error = 0;
out:
+ if (is_deleg)
+ mutex_unlock(&inode->i_mutex);
return error;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c43c893..73157f7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1261,7 +1261,7 @@ extern int vfs_test_lock(struct file *, struct file_lock *);
extern int vfs_lock_file(struct file *, unsigned int, struct file_lock *, struct file_lock *);
extern int vfs_cancel_lock(struct file *filp, struct file_lock *fl);
extern int flock_lock_file_wait(struct file *filp, struct file_lock *fl);
-extern int __break_lease(struct inode *inode, unsigned int flags);
+extern int __break_lease(struct inode *inode, unsigned int flags, unsigned int type);
extern void lease_get_mtime(struct inode *, struct timespec *time);
extern int generic_setlease(struct file *, long, struct file_lock **);
extern int vfs_setlease(struct file *, long, struct file_lock **);
@@ -1374,7 +1374,7 @@ static inline int flock_lock_file_wait(struct file *filp,
return -ENOLCK;
}
-static inline int __break_lease(struct inode *inode, unsigned int mode)
+static inline int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
{
return 0;
}
@@ -2151,9 +2151,17 @@ static inline int locks_verify_truncate(struct inode *inode,
static inline int break_lease(struct inode *inode, unsigned int mode)
{
if (inode->i_flock)
- return __break_lease(inode, mode);
+ return __break_lease(inode, mode, FL_LEASE);
return 0;
}
+
+static inline int break_deleg(struct inode *inode, unsigned int mode)
+{
+ if (inode->i_flock)
+ return __break_lease(inode, mode, FL_DELEG);
+ return 0;
+}
+
#else /* !CONFIG_FILE_LOCKING */
static inline int locks_mandatory_locked(struct inode *inode)
{
@@ -2193,6 +2201,10 @@ static inline int break_lease(struct inode *inode, unsigned int mode)
return 0;
}
+static inline int break_deleg(struct inode *inode, unsigned int mode)
+{
+ return 0;
+}
#endif /* CONFIG_FILE_LOCKING */
/* fs/open.c */
--
1.7.9.5
On Mon, Sep 10, 2012 at 10:40:51AM +0800, Guo Chao wrote:
> On Fri, Sep 07, 2012 at 05:39:01PM -0400, J. Bruce Fields wrote:
> > On Fri, Sep 07, 2012 at 10:27:05AM +0800, Guo Chao wrote:
> > > One script ...
> > >
> > > #!/bin/bash
> > > while true
> > > do
> > > mount -t sysfs nodev mnt && umount mnt
> > > done
> > >
> > >
> > >
> > > The other ...
> > >
> > > #!/bin/bash
> > > while true
> > > do
> > > mv mnt mnt2 && mv mnt2 mnt
> > > done
> > >
> > >
> > >
> > > Run them simultaneously in two consoles. When mount keeps reporting
> > > 'mount point mnt does not exist', stop them, then you will see the
> > > familiar sysfs under mnt2.
> >
> > Oh, thanks, for some reason I assumed it would be more difficult to
> > reproduce.
> >
> > I think we can do this--I don't think it even requires any care to the
> > locking order of the renamed vs the victim directory, though I can't
> > completely convince myself of that.
> >
> > Is it necessary to fix this, though? Does it cause any problems other
> > than unexpected behavior?
> >
> > --b.
> > --
>
> Hard to say whether it's a bug or what's problems of being able to rename
> mountpoint.
>
> Anyway, this patch closes this race when mountpoint is a file. Thus we get
> different behaviour when deal with files and directories. It's apparently
> not well-defined, but again, is it a problem? Not sure ... ...
OK. I'm not seeing a strong argument to change the current behavior, so
inclined to leave it alone for now.
--b.
On Thu, Sep 06, 2012 at 10:53:58AM +0800, Guo Chao wrote:
> On Wed, Sep 05, 2012 at 04:55:12PM -0400, J. Bruce Fields wrote:
> > +/**
> > + * lock_two_nondirectories - release locks from lock_two_nondirectories()
> unlock_two_nondirectories
>
> > + * @inode1: first inode to unlock
> > + * @inode2: second inode to lock
> unlock
>
Both fixed, thanks!
--b.
commit d190216f0b6c4623f9c23e2c7fb2e2fa9009891f
Author: J. Bruce Fields <[email protected]>
Date: Wed Apr 18 15:16:33 2012 -0400
vfs: pull ext4's double-i_mutex-locking into common code
We want to do this elsewhere as well.
Also, compare pointers instead of inode numbers.
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index c5826c6..b87d94a 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -1086,19 +1086,7 @@ mext_inode_double_lock(struct inode *inode1, struct inode *inode2)
if (ret < 0)
goto out;
- if (inode1 == inode2) {
- mutex_lock(&inode1->i_mutex);
- goto out;
- }
-
- if (inode1->i_ino < inode2->i_ino) {
- mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
- mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
- } else {
- mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
- mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
- }
-
+ lock_two_nondirectories(inode1, inode2);
out:
return ret;
}
@@ -1123,12 +1111,7 @@ mext_inode_double_unlock(struct inode *inode1, struct inode *inode2)
if (ret < 0)
goto out;
- if (inode1)
- mutex_unlock(&inode1->i_mutex);
-
- if (inode2 && inode2 != inode1)
- mutex_unlock(&inode2->i_mutex);
-
+ unlock_two_nondirectories(inode1, inode2);
out:
return ret;
}
diff --git a/fs/inode.c b/fs/inode.c
index ac8d904..275f571 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -969,6 +969,42 @@ void unlock_new_inode(struct inode *inode)
EXPORT_SYMBOL(unlock_new_inode);
/**
+ * lock_two_nondirectories - take two i_mutexes on non-directory objects
+ * @inode1: first inode to lock; must be non-NULL
+ * @inode2: second inode to lock; optional, may equal first or be NULL.
+ */
+void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
+{
+ if (inode1 == inode2 || inode2 == NULL)
+ mutex_lock(&inode1->i_mutex);
+ else if (inode1 < inode2) {
+ mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+ mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+
+ } else {
+ mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
+ mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+ }
+}
+EXPORT_SYMBOL(lock_two_nondirectories);
+
+/**
+ * unlock_two_nondirectories - release locks from lock_two_nondirectories()
+ * @inode1: first inode to unlock
+ * @inode2: second inode to unlock
+ *
+ * Arguments must be same as those given to corresponding
+ * lock_two_nondirectories() call.
+ */
+void unlock_two_nondirectories(struct inode *inode1, struct inode *inode2)
+{
+ mutex_unlock(&inode1->i_mutex);
+ if (inode2 && inode2 != inode1)
+ mutex_unlock(&inode2->i_mutex);
+}
+EXPORT_SYMBOL(unlock_two_nondirectories);
+
+/**
* iget5_locked - obtain an inode from a mounted file system
* @sb: super block of file system
* @hashval: hash value (usually inode number) to get
diff --git a/include/linux/fs.h b/include/linux/fs.h
index aa11047..a179b69 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -895,6 +895,9 @@ enum inode_i_mutex_lock_class
I_MUTEX_QUOTA
};
+void lock_two_nondirectories(struct inode *, struct inode*);
+void unlock_two_nondirectories(struct inode *, struct inode*);
+
/*
* NOTE: in a 32bit arch with a preemptable kernel and
* an UP compile the i_size_read/write must be atomic
On Wed, Sep 05, 2012 at 04:55:15PM -0400, J. Bruce Fields wrote:
> From: "J. Bruce Fields" <[email protected]>
> diff --git a/fs/namei.c b/fs/namei.c
> index 1b46439..6156135 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3658,6 +3658,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> struct inode *new_dir, struct dentry *new_dentry)
> {
> struct inode *target = new_dentry->d_inode;
> + struct inode *source = old_dentry->d_inode;
> int error;
>
> error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry);
> @@ -3665,8 +3666,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> return error;
>
> dget(new_dentry);
> - if (target)
> - mutex_lock(&target->i_mutex);
> + lock_two_nondirectories(source, target);
>
> error = -EBUSY;
> if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
> @@ -3681,8 +3681,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
> d_move(old_dentry, new_dentry);
> out:
> - if (target)
> - mutex_unlock(&target->i_mutex);
> + unlock_two_nondirectories(source, target);
> dput(new_dentry);
> return error;
> }
>
This change also fixes a race between rename and mount.
Apparently we avoid to rename source or target if they are
mountpoint. But nothing prevents source being the mountpoint
after d_mountpoint check because we do not hold its i_mutex.
Thus we are actually able to rename a mountpoint.
Rename directory should also need this care.
From: "J. Bruce Fields" <[email protected]>
Reserve I_MUTEX_PARENT and I_MUTEX_CHILD for locking of actual
directories.
(Also I_MUTEX_QUOTA isn't really a meaningful name for this locking
class any more; fixed in a later patch.)
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/inode.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 582843d..44179ce 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -978,12 +978,12 @@ void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
if (inode1 == inode2 || inode2 == NULL)
mutex_lock(&inode1->i_mutex);
else if (inode1 < inode2) {
- mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
- mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+ mutex_lock(&inode1->i_mutex);
+ mutex_lock_nested(&inode2->i_mutex, I_MUTEX_QUOTA);
} else {
- mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
- mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+ mutex_lock(&inode2->i_mutex);
+ mutex_lock_nested(&inode1->i_mutex, I_MUTEX_QUOTA);
}
}
EXPORT_SYMBOL(lock_two_nondirectories);
--
1.7.9.5
Oops, I meant to cc this to Jeff:
On Wed, Sep 05, 2012 at 04:55:22PM -0400, J. Bruce Fields wrote:
> @@ -3556,6 +3560,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
> if (error)
> return error;
>
> +retry_deleg:
> new_dentry = user_path_create(newdfd, newname, &new_path, 0);
> error = PTR_ERR(new_dentry);
> if (IS_ERR(new_dentry))
> @@ -3570,9 +3575,14 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
> error = security_path_link(old_path.dentry, &new_path, new_dentry);
> if (error)
> goto out_dput;
> - error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry);
> + error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
> out_dput:
> done_path_create(&new_path, new_dentry);
> + if (delegated_inode) {
> + error = break_deleg_wait(&delegated_inode);
> + if (!error)
> + goto retry_deleg;
> + }
> out:
> path_put(&old_path);
>
I think this will get me into trouble with the audit code, right?
(On a quick skim I think I'm retrying from a point after the getname()
in rename and unlink, so I think those are OK, but I may be missing
something....)
--b.
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index c8f5e74..2a77b28 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1717,7 +1717,7 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> err = nfserrno(host_err);
> goto out_dput;
> }
> - host_err = vfs_link(dold, dirp, dnew);
> + host_err = vfs_link(dold, dirp, dnew, NULL);
> if (!host_err) {
> err = nfserrno(commit_metadata(ffhp));
> if (!err)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ad9df65e..52b8c67 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1721,7 +1721,7 @@ extern int vfs_create(struct inode *, struct dentry *, umode_t, bool);
> extern int vfs_mkdir(struct inode *, struct dentry *, umode_t);
> extern int vfs_mknod(struct inode *, struct dentry *, umode_t, dev_t);
> extern int vfs_symlink(struct inode *, struct dentry *, const char *);
> -extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
> +extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **);
> extern int vfs_rmdir(struct inode *, struct dentry *);
> extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
> extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **);
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 05, 2012 at 04:55:12PM -0400, J. Bruce Fields wrote:
> +/**
> + * lock_two_nondirectories - release locks from lock_two_nondirectories()
unlock_two_nondirectories
> + * @inode1: first inode to unlock
> + * @inode2: second inode to lock
unlock
On Mon, Sep 10, 2012 at 10:40:51AM +0800, Guo Chao wrote:
> On Fri, Sep 07, 2012 at 05:39:01PM -0400, J. Bruce Fields wrote:
> > On Fri, Sep 07, 2012 at 10:27:05AM +0800, Guo Chao wrote:
> > > On Thu, Sep 06, 2012 at 01:51:18PM -0400, J. Bruce Fields wrote:
> > > > On Thu, Sep 06, 2012 at 11:05:26AM +0800, Guo Chao wrote:
> > > > > On Wed, Sep 05, 2012 at 04:55:15PM -0400, J. Bruce Fields wrote:
> > > > > > From: "J. Bruce Fields" <[email protected]>
> > > > > > diff --git a/fs/namei.c b/fs/namei.c
> > > > > > index 1b46439..6156135 100644
> > > > > > --- a/fs/namei.c
> > > > > > +++ b/fs/namei.c
> > > > > > @@ -3658,6 +3658,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > > > > > struct inode *new_dir, struct dentry *new_dentry)
> > > > > > {
> > > > > > struct inode *target = new_dentry->d_inode;
> > > > > > + struct inode *source = old_dentry->d_inode;
> > > > > > int error;
> > > > > >
> > > > > > error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry);
> > > > > > @@ -3665,8 +3666,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > > > > > return error;
> > > > > >
> > > > > > dget(new_dentry);
> > > > > > - if (target)
> > > > > > - mutex_lock(&target->i_mutex);
> > > > > > + lock_two_nondirectories(source, target);
> > > > > >
> > > > > > error = -EBUSY;
> > > > > > if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
> > > > > > @@ -3681,8 +3681,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> > > > > > if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
> > > > > > d_move(old_dentry, new_dentry);
> > > > > > out:
> > > > > > - if (target)
> > > > > > - mutex_unlock(&target->i_mutex);
> > > > > > + unlock_two_nondirectories(source, target);
> > > > > > dput(new_dentry);
> > > > > > return error;
> > > > > > }
> > > > > >
> > > > >
> > > > > This change also fixes a race between rename and mount.
> > > > >
> > > > > Apparently we avoid to rename source or target if they are
> > > > > mountpoint. But nothing prevents source being the mountpoint
> > > > > after d_mountpoint check because we do not hold its i_mutex.
> > > > >
> > > > > Thus we are actually able to rename a mountpoint.
> > > > >
> > > > > Rename directory should also need this care.
> > > >
> > > > Do you have any practical way to reproduce that race?
> > > >
> > > > --b.
> > >
> > > Rename a mountpoint? Of course.
> > >
> > > One script ...
> > >
> > > #!/bin/bash
> > > while true
> > > do
> > > mount -t sysfs nodev mnt && umount mnt
> > > done
> > >
> > >
> > >
> > > The other ...
> > >
> > > #!/bin/bash
> > > while true
> > > do
> > > mv mnt mnt2 && mv mnt2 mnt
> > > done
> > >
> > >
> > >
> > > Run them simultaneously in two consoles. When mount keeps reporting
> > > 'mount point mnt does not exist', stop them, then you will see the
> > > familiar sysfs under mnt2.
> >
> > Oh, thanks, for some reason I assumed it would be more difficult to
> > reproduce.
> >
> > I think we can do this--I don't think it even requires any care to the
> > locking order of the renamed vs the victim directory, though I can't
> > completely convince myself of that.
> >
> > Is it necessary to fix this, though? Does it cause any problems other
> > than unexpected behavior?
> >
> > --b.
> > --
>
> Hard to say whether it's a bug or what's problems of being able to rename
> mountpoint.
'man 2 rename' says it is ok to rename a directory that is already
mounted.
EBUSY The rename fails because oldpath or newpath is a directory that is
in use by some process (perhaps as current working directory, or as root
directory, or beacuse it was open for reading) or is in use by the
system (for example as mount point), while the system considers this an
error. (Note that there is no requirement to return EBUSY in such
cases-- there is nothing wrong with doing the rename anyway -- but is
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
allowed to return EBUSY if the system cannot otherwise handle such
situations)
RP
From: "J. Bruce Fields" <[email protected]>
For now FL_DELEG is just a synonym for FL_LEASE. So this patch doesn't
change behavior.
Next we'll modify break_lease to treat FL_DELEG leases differently, to
account for the fact that NFSv4 delegations should be broken in more
situations than Windows oplocks.
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/locks.c | 2 +-
fs/nfsd/nfs4state.c | 2 +-
include/linux/fs.h | 1 +
3 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/locks.c b/fs/locks.c
index 669911e..6ee1943 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -131,7 +131,7 @@
#define IS_POSIX(fl) (fl->fl_flags & FL_POSIX)
#define IS_FLOCK(fl) (fl->fl_flags & FL_FLOCK)
-#define IS_LEASE(fl) (fl->fl_flags & FL_LEASE)
+#define IS_LEASE(fl) (fl->fl_flags & (FL_LEASE|FL_DELEG))
static bool lease_breaking(struct file_lock *fl)
{
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 5f1a91a..015267a 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -2811,7 +2811,7 @@ static struct file_lock *nfs4_alloc_init_lease(struct nfs4_delegation *dp, int f
return NULL;
locks_init_lock(fl);
fl->fl_lmops = &nfsd_lease_mng_ops;
- fl->fl_flags = FL_LEASE;
+ fl->fl_flags = FL_DELEG;
fl->fl_type = flag == NFS4_OPEN_DELEGATE_READ? F_RDLCK: F_WRLCK;
fl->fl_end = OFFSET_MAX;
fl->fl_owner = (fl_owner_t)(dp->dl_file);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 017070e..c43c893 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1139,6 +1139,7 @@ static inline int file_check_writeable(struct file *filp)
#define FL_POSIX 1
#define FL_FLOCK 2
+#define FL_DELEG 4 /* NFSv4 delegation */
#define FL_ACCESS 8 /* not trying to lock, just looking */
#define FL_EXISTS 16 /* when unlocking, test for existence */
#define FL_LEASE 32 /* lease held on this file */
--
1.7.9.5
From: "J. Bruce Fields" <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/cachefiles/namei.c | 2 +-
fs/namei.c | 26 ++++++++++++++++++++++----
fs/nfsd/vfs.c | 2 +-
include/linux/fs.h | 2 +-
4 files changed, 25 insertions(+), 7 deletions(-)
diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 10cf5c6..ef23d2a 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -397,7 +397,7 @@ try_again:
cachefiles_io_error(cache, "Rename security error %d", ret);
} else {
ret = vfs_rename(dir->d_inode, rep,
- cache->graveyard->d_inode, grave);
+ cache->graveyard->d_inode, grave, NULL);
if (ret != 0 && ret != -ENOMEM)
cachefiles_io_error(cache,
"Rename failed with error %d", ret);
diff --git a/fs/namei.c b/fs/namei.c
index eef4806..c7318ae 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3666,7 +3666,8 @@ out:
}
static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
- struct inode *new_dir, struct dentry *new_dentry)
+ struct inode *new_dir, struct dentry *new_dentry,
+ struct inode **delegated_inode)
{
struct inode *target = new_dentry->d_inode;
struct inode *source = old_dentry->d_inode;
@@ -3683,6 +3684,14 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
goto out;
+ error = try_break_deleg(source, delegated_inode);
+ if (error)
+ goto out;
+ if (target) {
+ error = try_break_deleg(target, delegated_inode);
+ if (error)
+ goto out;
+ }
error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
if (error)
goto out;
@@ -3698,7 +3707,8 @@ out:
}
int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
- struct inode *new_dir, struct dentry *new_dentry)
+ struct inode *new_dir, struct dentry *new_dentry,
+ struct inode **delegated_inode)
{
int error;
int is_dir = S_ISDIR(old_dentry->d_inode->i_mode);
@@ -3726,7 +3736,7 @@ int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
if (is_dir)
error = vfs_rename_dir(old_dir,old_dentry,new_dir,new_dentry);
else
- error = vfs_rename_other(old_dir,old_dentry,new_dir,new_dentry);
+ error = vfs_rename_other(old_dir,old_dentry,new_dir,new_dentry,delegated_inode);
if (!error)
fsnotify_move(old_dir, new_dir, old_name, is_dir,
new_dentry->d_inode, old_dentry);
@@ -3742,6 +3752,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
struct dentry *old_dentry, *new_dentry;
struct dentry *trap;
struct nameidata oldnd, newnd;
+ struct inode *delegated_inode = NULL;
char *from;
char *to;
int error;
@@ -3775,6 +3786,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
newnd.flags &= ~LOOKUP_PARENT;
newnd.flags |= LOOKUP_RENAME_TARGET;
+retry_deleg:
trap = lock_rename(new_dir, old_dir);
old_dentry = lookup_hash(&oldnd);
@@ -3811,13 +3823,19 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
if (error)
goto exit5;
error = vfs_rename(old_dir->d_inode, old_dentry,
- new_dir->d_inode, new_dentry);
+ new_dir->d_inode, new_dentry,
+ &delegated_inode);
exit5:
dput(new_dentry);
exit4:
dput(old_dentry);
exit3:
unlock_rename(new_dir, old_dir);
+ if (delegated_inode) {
+ error = break_deleg_wait(&delegated_inode);
+ if (!error)
+ goto retry_deleg;
+ }
mnt_drop_write(oldnd.path.mnt);
exit2:
path_put(&newnd.path);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 9c39d26..c8f5e74 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1820,7 +1820,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
if (host_err)
goto out_dput_new;
}
- host_err = vfs_rename(fdir, odentry, tdir, ndentry);
+ host_err = vfs_rename(fdir, odentry, tdir, ndentry, NULL);
if (!host_err) {
host_err = commit_metadata(tfhp);
if (!host_err)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4c0fa4f..ad9df65e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1724,7 +1724,7 @@ extern int vfs_symlink(struct inode *, struct dentry *, const char *);
extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
-extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **);
/*
* VFS dentry helper functions.
--
1.7.9.5
From: "J. Bruce Fields" <[email protected]>
We want to do this elsewhere as well.
Also, compare pointers instead of inode numbers.
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/ext4/move_extent.c | 21 ++-------------------
fs/inode.c | 36 ++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 3 +++
3 files changed, 41 insertions(+), 19 deletions(-)
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index c5826c6..b87d94a 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -1086,19 +1086,7 @@ mext_inode_double_lock(struct inode *inode1, struct inode *inode2)
if (ret < 0)
goto out;
- if (inode1 == inode2) {
- mutex_lock(&inode1->i_mutex);
- goto out;
- }
-
- if (inode1->i_ino < inode2->i_ino) {
- mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
- mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
- } else {
- mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
- mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
- }
-
+ lock_two_nondirectories(inode1, inode2);
out:
return ret;
}
@@ -1123,12 +1111,7 @@ mext_inode_double_unlock(struct inode *inode1, struct inode *inode2)
if (ret < 0)
goto out;
- if (inode1)
- mutex_unlock(&inode1->i_mutex);
-
- if (inode2 && inode2 != inode1)
- mutex_unlock(&inode2->i_mutex);
-
+ unlock_two_nondirectories(inode1, inode2);
out:
return ret;
}
diff --git a/fs/inode.c b/fs/inode.c
index ac8d904..582843d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -969,6 +969,42 @@ void unlock_new_inode(struct inode *inode)
EXPORT_SYMBOL(unlock_new_inode);
/**
+ * lock_two_nondirectories - take two i_mutexes on non-directory objects
+ * @inode1: first inode to lock; must be non-NULL
+ * @inode2: second inode to lock; optional, may equal first or be NULL.
+ */
+void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
+{
+ if (inode1 == inode2 || inode2 == NULL)
+ mutex_lock(&inode1->i_mutex);
+ else if (inode1 < inode2) {
+ mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+ mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+
+ } else {
+ mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
+ mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+ }
+}
+EXPORT_SYMBOL(lock_two_nondirectories);
+
+/**
+ * lock_two_nondirectories - release locks from lock_two_nondirectories()
+ * @inode1: first inode to unlock
+ * @inode2: second inode to lock
+ *
+ * Arguments must be same as those given to corresponding
+ * lock_two_nondirectories() call.
+ */
+void unlock_two_nondirectories(struct inode *inode1, struct inode *inode2)
+{
+ mutex_unlock(&inode1->i_mutex);
+ if (inode2 && inode2 != inode1)
+ mutex_unlock(&inode2->i_mutex);
+}
+EXPORT_SYMBOL(unlock_two_nondirectories);
+
+/**
* iget5_locked - obtain an inode from a mounted file system
* @sb: super block of file system
* @hashval: hash value (usually inode number) to get
diff --git a/include/linux/fs.h b/include/linux/fs.h
index aa11047..a179b69 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -895,6 +895,9 @@ enum inode_i_mutex_lock_class
I_MUTEX_QUOTA
};
+void lock_two_nondirectories(struct inode *, struct inode*);
+void unlock_two_nondirectories(struct inode *, struct inode*);
+
/*
* NOTE: in a 32bit arch with a preemptable kernel and
* an UP compile the i_size_read/write must be atomic
--
1.7.9.5
From: "J. Bruce Fields" <[email protected]>
I_MUTEX_QUOTA is now just being used whenever we want to lock two
non-directories. So the name isn't right. I_MUTEX_NONDIR2 isn't
especially elegant but it's the best I could think of.
Also fix some outdated documentation.
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/inode.c | 4 ++--
include/linux/fs.h | 9 ++++++---
2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 44179ce..4ab9516 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -979,11 +979,11 @@ void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
mutex_lock(&inode1->i_mutex);
else if (inode1 < inode2) {
mutex_lock(&inode1->i_mutex);
- mutex_lock_nested(&inode2->i_mutex, I_MUTEX_QUOTA);
+ mutex_lock_nested(&inode2->i_mutex, I_MUTEX_NONDIR2);
} else {
mutex_lock(&inode2->i_mutex);
- mutex_lock_nested(&inode1->i_mutex, I_MUTEX_QUOTA);
+ mutex_lock_nested(&inode1->i_mutex, I_MUTEX_NONDIR2);
}
}
EXPORT_SYMBOL(lock_two_nondirectories);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a179b69..017070e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -881,10 +881,13 @@ static inline int inode_unhashed(struct inode *inode)
* 0: the object of the current VFS operation
* 1: parent
* 2: child/target
- * 3: quota file
+ * 3: xattr
+ * 4: second non-directory
+ * The last is for certain operations (such as rename) which lock two
+ * non-directories at once.
*
* The locking order between these classes is
- * parent -> child -> normal -> xattr -> quota
+ * parent -> child -> normal -> xattr -> second non-directory
*/
enum inode_i_mutex_lock_class
{
@@ -892,7 +895,7 @@ enum inode_i_mutex_lock_class
I_MUTEX_PARENT,
I_MUTEX_CHILD,
I_MUTEX_XATTR,
- I_MUTEX_QUOTA
+ I_MUTEX_NONDIR2
};
void lock_two_nondirectories(struct inode *, struct inode*);
--
1.7.9.5
From: "J. Bruce Fields" <[email protected]>
We'll be using dentry->d_inode in one more place.
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/namei.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 6156135..10d6622 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3307,6 +3307,7 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
int vfs_unlink(struct inode *dir, struct dentry *dentry)
{
+ struct inode *target = dentry->d_inode;
int error = may_delete(dir, dentry, 0);
if (error)
@@ -3315,7 +3316,7 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
if (!dir->i_op->unlink)
return -EPERM;
- mutex_lock(&dentry->d_inode->i_mutex);
+ mutex_lock(&target->i_mutex);
if (d_mountpoint(dentry))
error = -EBUSY;
else {
@@ -3326,11 +3327,11 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
dont_mount(dentry);
}
}
- mutex_unlock(&dentry->d_inode->i_mutex);
+ mutex_unlock(&target->i_mutex);
/* We don't d_delete() NFS sillyrenamed files--they still exist. */
if (!error && !(dentry->d_flags & DCACHE_NFSFS_RENAMED)) {
- fsnotify_link_count(dentry->d_inode);
+ fsnotify_link_count(target);
d_delete(dentry);
}
--
1.7.9.5
From: Jan Kara <[email protected]>
GFS2 uses i_mutex on its system quota inode to synchronize writes to
quota file. Since this is an internal inode to GFS2 (not part of directory
hiearchy or visible by user) we are safe to define locking rules for it. So
let's just get it its own locking class to make it clear.
CC: Steven Whitehouse <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/gfs2/ops_fstype.c | 8 ++++++++
fs/gfs2/quota.c | 2 +-
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index e5af9dc..e443966 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -19,6 +19,7 @@
#include <linux/mount.h>
#include <linux/gfs2_ondisk.h>
#include <linux/quotaops.h>
+#include <linux/lockdep.h>
#include "gfs2.h"
#include "incore.h"
@@ -766,6 +767,7 @@ fail:
return error;
}
+static struct lock_class_key gfs2_quota_imutex_key;
static int init_inodes(struct gfs2_sbd *sdp, int undo)
{
@@ -803,6 +805,12 @@ static int init_inodes(struct gfs2_sbd *sdp, int undo)
fs_err(sdp, "can't get quota file inode: %d\n", error);
goto fail_rindex;
}
+ /*
+ * i_mutex on quota files is special. Since this inode is hidden system
+ * file, we are safe to define locking ourselves.
+ */
+ lockdep_set_class(&sdp->sd_quota_inode->i_mutex,
+ &gfs2_quota_imutex_key);
error = gfs2_rindex_update(sdp);
if (error)
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index a3bde91..3d9c989 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -781,7 +781,7 @@ static int do_sync(unsigned int num_qd, struct gfs2_quota_data **qda)
return -ENOMEM;
sort(qda, num_qd, sizeof(struct gfs2_quota_data *), sort_qd, NULL);
- mutex_lock_nested(&ip->i_inode.i_mutex, I_MUTEX_QUOTA);
+ mutex_lock(&ip->i_inode.i_mutex);
for (qx = 0; qx < num_qd; qx++) {
error = gfs2_glock_nq_init(qda[qx]->qd_gl, LM_ST_EXCLUSIVE,
GL_NOCACHE, &ghs[qx]);
--
1.7.9.5
From: "J. Bruce Fields" <[email protected]>
A read delegation is used by NFSv4 as a guarantee that a client can
perform local read opens without informing the server.
The open operation takes the last component of the pathname as an
argument, thus is also a lookup operation, and giving the client the
above guarantee means informing the client before we allow anything that
would change the set of names pointing to the inode.
Therefore, we need to break delegations on rename, link, and unlink.
We also need to prevent new delegations from being acquired while one of
these operations is in progress.
We could add some completely new locking for that purpose, but it's
simpler to use the i_mutex, since that's already taken by all the
operations we care about.
The single exception is rename. So, modify rename to take the i_mutex
on the file that is being renamed.
Also fix up lockdep and Documentation/filesystems/directory-locking to
reflect the change.
Signed-off-by: J. Bruce Fields <[email protected]>
---
Documentation/filesystems/directory-locking | 30 +++++++++++++++++++--------
fs/namei.c | 7 +++----
2 files changed, 24 insertions(+), 13 deletions(-)
diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking
index ff7b611..9e8a629 100644
--- a/Documentation/filesystems/directory-locking
+++ b/Documentation/filesystems/directory-locking
@@ -2,6 +2,10 @@
kinds of locks - per-inode (->i_mutex) and per-filesystem
(->s_vfs_rename_mutex).
+ When taking the i_mutex on multiple non-directory objects, we
+always acquire the locks in order by increasing address. We'll call
+that "inode pointer" order in the following.
+
For our purposes all operations fall in 5 classes:
1) read access. Locking rules: caller locks directory we are accessing.
@@ -12,8 +16,9 @@ kinds of locks - per-inode (->i_mutex) and per-filesystem
locks victim and calls the method.
4) rename() that is _not_ cross-directory. Locking rules: caller locks
-the parent, finds source and target, if target already exists - locks it
-and then calls the method.
+the parent and finds source and target. If source and target both
+exist, they are locked in inode pointer order. Otherwise lock just
+source. Then call method.
5) link creation. Locking rules:
* lock parent
@@ -30,7 +35,8 @@ rules:
fail with -ENOTEMPTY
* if new parent is equal to or is a descendent of source
fail with -ELOOP
- * if target exists - lock it.
+ * If target exists, lock both source and target, in inode
+ pointer order. Otherwise lock just source.
* call the method.
@@ -56,9 +62,11 @@ objects - A < B iff A is an ancestor of B.
renames will be blocked on filesystem lock and we don't start changing
the order until we had acquired all locks).
-(3) any operation holds at most one lock on non-directory object and
- that lock is acquired after all other locks. (Proof: see descriptions
- of operations).
+(3) locks on non-directory objects are acquired only after locks on
+ directory objects, and are acquired in inode pointer order.
+ (Proof: all operations but renames take lock on at most one
+ non-directory object, except renames, which take locks on source and
+ target in inode pointer order.)
Now consider the minimal deadlock. Each process is blocked on
attempt to acquire some lock and already holds at least one lock. Let's
@@ -66,9 +74,13 @@ consider the set of contended locks. First of all, filesystem lock is
not contended, since any process blocked on it is not holding any locks.
Thus all processes are blocked on ->i_mutex.
- Non-directory objects are not contended due to (3). Thus link
-creation can't be a part of deadlock - it can't be blocked on source
-and it means that it doesn't hold any locks.
+ By (3), any process holding a non-directory lock can only be
+waiting on another non-directory lock with a larger address. Therefore
+the process holding the "largest" such lock can always make progress, and
+non-directory objects are not included in the set of contended locks.
+
+ Thus link creation can't be a part of deadlock - it can't be
+blocked on source and it means that it doesn't hold any locks.
Any contended object is either held by cross-directory rename or
has a child that is also contended. Indeed, suppose that it is held by
diff --git a/fs/namei.c b/fs/namei.c
index 1b46439..6156135 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3658,6 +3658,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
struct inode *new_dir, struct dentry *new_dentry)
{
struct inode *target = new_dentry->d_inode;
+ struct inode *source = old_dentry->d_inode;
int error;
error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry);
@@ -3665,8 +3666,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
return error;
dget(new_dentry);
- if (target)
- mutex_lock(&target->i_mutex);
+ lock_two_nondirectories(source, target);
error = -EBUSY;
if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
@@ -3681,8 +3681,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
d_move(old_dentry, new_dentry);
out:
- if (target)
- mutex_unlock(&target->i_mutex);
+ unlock_two_nondirectories(source, target);
dput(new_dentry);
return error;
}
--
1.7.9.5
On Mon, Sep 10, 2012 at 01:10:37PM +0800, Ram Pai wrote:
> On Mon, Sep 10, 2012 at 10:40:51AM +0800, Guo Chao wrote:
> >
> > Hard to say whether it's a bug or what's problems of being able to rename
> > mountpoint.
>
> 'man 2 rename' says it is ok to rename a directory that is already
> mounted.
>
> EBUSY The rename fails because oldpath or newpath is a directory that is
> in use by some process (perhaps as current working directory, or as root
> directory, or beacuse it was open for reading) or is in use by the
> system (for example as mount point), while the system considers this an
> error. (Note that there is no requirement to return EBUSY in such
> cases-- there is nothing wrong with doing the rename anyway -- but is
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> allowed to return EBUSY if the system cannot otherwise handle such
> situations)
>
> RP
Erhh, good point.
However, current implementation seems to return EBUSY unconditionally,
instead of considering whether it can handle this situation. It can be
descripted as 'it may return EBUSY or not, depends on whether you are
luck enough to rush into the race'.
This seems still conform to the statement in the manual, in a weird way
though.
On Wed, Sep 05, 2012 at 04:55:11PM -0400, J. Bruce Fields wrote:
> From: Jan Kara <[email protected]>
>
> GFS2 uses i_mutex on its system quota inode to synchronize writes to
> quota file. Since this is an internal inode to GFS2 (not part of directory
> hiearchy or visible by user) we are safe to define locking rules for it. So
> let's just get it its own locking class to make it clear.
By the way, Steve, is there any reason you haven't taken this through
the gfs2 tree?
--b.
>
> CC: Steven Whitehouse <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> Signed-off-by: J. Bruce Fields <[email protected]>
> ---
> fs/gfs2/ops_fstype.c | 8 ++++++++
> fs/gfs2/quota.c | 2 +-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index e5af9dc..e443966 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -19,6 +19,7 @@
> #include <linux/mount.h>
> #include <linux/gfs2_ondisk.h>
> #include <linux/quotaops.h>
> +#include <linux/lockdep.h>
>
> #include "gfs2.h"
> #include "incore.h"
> @@ -766,6 +767,7 @@ fail:
> return error;
> }
>
> +static struct lock_class_key gfs2_quota_imutex_key;
>
> static int init_inodes(struct gfs2_sbd *sdp, int undo)
> {
> @@ -803,6 +805,12 @@ static int init_inodes(struct gfs2_sbd *sdp, int undo)
> fs_err(sdp, "can't get quota file inode: %d\n", error);
> goto fail_rindex;
> }
> + /*
> + * i_mutex on quota files is special. Since this inode is hidden system
> + * file, we are safe to define locking ourselves.
> + */
> + lockdep_set_class(&sdp->sd_quota_inode->i_mutex,
> + &gfs2_quota_imutex_key);
>
> error = gfs2_rindex_update(sdp);
> if (error)
> diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> index a3bde91..3d9c989 100644
> --- a/fs/gfs2/quota.c
> +++ b/fs/gfs2/quota.c
> @@ -781,7 +781,7 @@ static int do_sync(unsigned int num_qd, struct gfs2_quota_data **qda)
> return -ENOMEM;
>
> sort(qda, num_qd, sizeof(struct gfs2_quota_data *), sort_qd, NULL);
> - mutex_lock_nested(&ip->i_inode.i_mutex, I_MUTEX_QUOTA);
> + mutex_lock(&ip->i_inode.i_mutex);
> for (qx = 0; qx < num_qd; qx++) {
> error = gfs2_glock_nq_init(qda[qx]->qd_gl, LM_ST_EXCLUSIVE,
> GL_NOCACHE, &ghs[qx]);
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 06, 2012 at 03:27:46PM +0100, Steven Whitehouse wrote:
> Hi,
>
> On Wed, 2012-09-05 at 16:59 -0400, J. Bruce Fields wrote:
> > On Wed, Sep 05, 2012 at 04:55:11PM -0400, J. Bruce Fields wrote:
> > > From: Jan Kara <[email protected]>
> > >
> > > GFS2 uses i_mutex on its system quota inode to synchronize writes to
> > > quota file. Since this is an internal inode to GFS2 (not part of directory
> > > hiearchy or visible by user) we are safe to define locking rules for it. So
> > > let's just get it its own locking class to make it clear.
> >
> > By the way, Steve, is there any reason you haven't taken this through
> > the gfs2 tree?
> >
> > --b.
> >
> I don't remember now, but I've pushed it into my tree for the next merge
> window,
Thanks!
--b.
From: "J. Bruce Fields" <[email protected]>
We'll need the same logic for rename and link.
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/namei.c | 13 +++----------
include/linux/fs.h | 33 +++++++++++++++++++++++++++++++--
2 files changed, 34 insertions(+), 12 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index bdf1dca..eef4806 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3322,14 +3322,9 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegate
else {
error = security_inode_unlink(dir, dentry);
if (!error) {
- error = break_deleg(target, O_WRONLY|O_NONBLOCK);
- if (error) {
- if (error == -EWOULDBLOCK && delegated_inode) {
- *delegated_inode = target;
- ihold(target);
- }
+ error = try_break_deleg(target, delegated_inode);
+ if (error)
goto out;
- }
error = dir->i_op->unlink(dir, dentry);
if (!error)
dont_mount(dentry);
@@ -3397,9 +3392,7 @@ exit2:
if (inode)
iput(inode); /* truncate the inode here */
if (delegated_inode) {
- error = break_deleg(delegated_inode, O_WRONLY);
- iput(delegated_inode);
- delegated_inode = NULL;
+ error = break_deleg_wait(&delegated_inode);
if (!error)
goto retry_deleg;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6eb70a7..4c0fa4f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2094,6 +2094,9 @@ extern bool our_mnt(struct vfsmount *mnt);
extern int current_umask(void);
+extern void ihold(struct inode * inode);
+extern void iput(struct inode *);
+
/* /sys/fs */
extern struct kobject *fs_kobj;
@@ -2162,6 +2165,28 @@ static inline int break_deleg(struct inode *inode, unsigned int mode)
return 0;
}
+static inline int try_break_deleg(struct inode *inode, struct inode **delegated_inode)
+{
+ int ret;
+
+ ret = break_deleg(inode, O_WRONLY|O_NONBLOCK);
+ if (ret == -EWOULDBLOCK && delegated_inode) {
+ *delegated_inode = inode;
+ ihold(inode);
+ }
+ return ret;
+}
+
+static inline int break_deleg_wait(struct inode **delegated_inode)
+{
+ int ret;
+
+ ret = break_deleg(*delegated_inode, O_WRONLY);
+ iput(*delegated_inode);
+ *delegated_inode = NULL;
+ return ret;
+}
+
#else /* !CONFIG_FILE_LOCKING */
static inline int locks_mandatory_locked(struct inode *inode)
{
@@ -2205,6 +2230,12 @@ static inline int break_deleg(struct inode *inode, unsigned int mode)
{
return 0;
}
+
+static inline int try_break_deleg(struct inode *inode, struct delegated_inode **inode)
+{
+ return 0;
+}
+
#endif /* CONFIG_FILE_LOCKING */
/* fs/open.c */
@@ -2502,8 +2533,6 @@ extern loff_t vfs_llseek(struct file *file, loff_t offset, int origin);
extern int inode_init_always(struct super_block *, struct inode *);
extern void inode_init_once(struct inode *);
extern void address_space_init_once(struct address_space *mapping);
-extern void ihold(struct inode * inode);
-extern void iput(struct inode *);
extern struct inode * igrab(struct inode *);
extern ino_t iunique(struct super_block *, ino_t);
extern int inode_needs_sync(struct inode *inode);
--
1.7.9.5
Hi,
On Wed, 2012-09-05 at 16:59 -0400, J. Bruce Fields wrote:
> On Wed, Sep 05, 2012 at 04:55:11PM -0400, J. Bruce Fields wrote:
> > From: Jan Kara <[email protected]>
> >
> > GFS2 uses i_mutex on its system quota inode to synchronize writes to
> > quota file. Since this is an internal inode to GFS2 (not part of directory
> > hiearchy or visible by user) we are safe to define locking rules for it. So
> > let's just get it its own locking class to make it clear.
>
> By the way, Steve, is there any reason you haven't taken this through
> the gfs2 tree?
>
> --b.
>
I don't remember now, but I've pushed it into my tree for the next merge
window,
Steve.
From: "J. Bruce Fields" <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Dustin Kirkland <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/ecryptfs/inode.c | 2 +-
fs/namei.c | 18 ++++++++++++++----
fs/nfsd/vfs.c | 2 +-
include/linux/fs.h | 2 +-
4 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 17b917f..5c5f5df 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -475,7 +475,7 @@ static int ecryptfs_link(struct dentry *old_dentry, struct inode *dir,
dget(lower_new_dentry);
lower_dir_dentry = lock_parent(lower_new_dentry);
rc = vfs_link(lower_old_dentry, lower_dir_dentry->d_inode,
- lower_new_dentry);
+ lower_new_dentry, NULL);
if (rc || !lower_new_dentry->d_inode)
goto out_lock;
rc = ecryptfs_interpose(lower_new_dentry, new_dentry, dir->i_sb);
diff --git a/fs/namei.c b/fs/namei.c
index c7318ae..a3a915e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3475,7 +3475,7 @@ SYSCALL_DEFINE2(symlink, const char __user *, oldname, const char __user *, newn
return sys_symlinkat(oldname, AT_FDCWD, newname);
}
-int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
+int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry, struct inode **delegated_inode)
{
struct inode *inode = old_dentry->d_inode;
unsigned max_links = dir->i_sb->s_max_links;
@@ -3511,8 +3511,11 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
error = -ENOENT;
else if (max_links && inode->i_nlink >= max_links)
error = -EMLINK;
- else
- error = dir->i_op->link(old_dentry, dir, new_dentry);
+ else {
+ error = try_break_deleg(inode, delegated_inode);
+ if (!error)
+ error = dir->i_op->link(old_dentry, dir, new_dentry);
+ }
mutex_unlock(&inode->i_mutex);
if (!error)
fsnotify_link(dir, inode, new_dentry);
@@ -3533,6 +3536,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
{
struct dentry *new_dentry;
struct path old_path, new_path;
+ struct inode *delegated_inode = NULL;
int how = 0;
int error;
@@ -3556,6 +3560,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
if (error)
return error;
+retry_deleg:
new_dentry = user_path_create(newdfd, newname, &new_path, 0);
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
@@ -3570,9 +3575,14 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
error = security_path_link(old_path.dentry, &new_path, new_dentry);
if (error)
goto out_dput;
- error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry);
+ error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
out_dput:
done_path_create(&new_path, new_dentry);
+ if (delegated_inode) {
+ error = break_deleg_wait(&delegated_inode);
+ if (!error)
+ goto retry_deleg;
+ }
out:
path_put(&old_path);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index c8f5e74..2a77b28 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1717,7 +1717,7 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
err = nfserrno(host_err);
goto out_dput;
}
- host_err = vfs_link(dold, dirp, dnew);
+ host_err = vfs_link(dold, dirp, dnew, NULL);
if (!host_err) {
err = nfserrno(commit_metadata(ffhp));
if (!err)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ad9df65e..52b8c67 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1721,7 +1721,7 @@ extern int vfs_create(struct inode *, struct dentry *, umode_t, bool);
extern int vfs_mkdir(struct inode *, struct dentry *, umode_t);
extern int vfs_mknod(struct inode *, struct dentry *, umode_t, dev_t);
extern int vfs_symlink(struct inode *, struct dentry *, const char *);
-extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
+extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **);
--
1.7.9.5
From: "J. Bruce Fields" <[email protected]>
We need to break delegations on any operation that changes the set of
links pointing to an inode. Start with unlink.
Such operations also hold the i_mutex on a parent directory. Breaking a
delegation may require waiting for a timeout (by default 90 seconds) in
the case of a unresponsive NFS client. To avoid blocking all directory
operations, we therefore drop locks before waiting for the delegation.
The logic then looks like:
acquire locks
...
test for delegation; if found:
take reference on inode
release locks
wait for delegation break
drop reference on inode
retry
It is possible this could never terminate. (Even if we take precautions
to prevent another delegation being acquired on the same inode, we could
get a different inode on each retry.) But this seems very unlikely.
The initial test for a delegation happens after the lock on the target
inode is acquired, but the directory inode may have been acquired
further up the call stack. We therefore add a "struct inode **"
argument to any intervening functions, which we use to pass the inode
back up to the caller in the case it needs a delegation synchronously
broken.
Cc: David Howells <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Dustin Kirkland <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
---
drivers/base/devtmpfs.c | 2 +-
fs/cachefiles/namei.c | 2 +-
fs/ecryptfs/inode.c | 2 +-
fs/namei.c | 23 ++++++++++++++++++++---
fs/nfsd/vfs.c | 2 +-
include/linux/fs.h | 2 +-
ipc/mqueue.c | 2 +-
7 files changed, 26 insertions(+), 9 deletions(-)
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index deb4a45..24162ec 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -317,7 +317,7 @@ static int handle_remove(const char *nodename, struct device *dev)
mutex_lock(&dentry->d_inode->i_mutex);
notify_change(dentry, &newattrs);
mutex_unlock(&dentry->d_inode->i_mutex);
- err = vfs_unlink(parent.dentry->d_inode, dentry);
+ err = vfs_unlink(parent.dentry->d_inode, dentry, NULL);
if (!err || err == -ENOENT)
deleted = 1;
}
diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index b0b5f7c..10cf5c6 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -295,7 +295,7 @@ static int cachefiles_bury_object(struct cachefiles_cache *cache,
if (ret < 0) {
cachefiles_io_error(cache, "Unlink security error");
} else {
- ret = vfs_unlink(dir->d_inode, rep);
+ ret = vfs_unlink(dir->d_inode, rep, NULL);
if (preemptive)
cachefiles_mark_object_buried(cache, rep);
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 534b129..17b917f 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -153,7 +153,7 @@ static int ecryptfs_do_unlink(struct inode *dir, struct dentry *dentry,
dget(lower_dentry);
lower_dir_dentry = lock_parent(lower_dentry);
- rc = vfs_unlink(lower_dir_inode, lower_dentry);
+ rc = vfs_unlink(lower_dir_inode, lower_dentry, NULL);
if (rc) {
printk(KERN_ERR "Error in vfs_unlink; rc = [%d]\n", rc);
goto out_unlock;
diff --git a/fs/namei.c b/fs/namei.c
index 10d6622..bdf1dca 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3305,7 +3305,7 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
return do_rmdir(AT_FDCWD, pathname);
}
-int vfs_unlink(struct inode *dir, struct dentry *dentry)
+int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegated_inode)
{
struct inode *target = dentry->d_inode;
int error = may_delete(dir, dentry, 0);
@@ -3322,11 +3322,20 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
else {
error = security_inode_unlink(dir, dentry);
if (!error) {
+ error = break_deleg(target, O_WRONLY|O_NONBLOCK);
+ if (error) {
+ if (error == -EWOULDBLOCK && delegated_inode) {
+ *delegated_inode = target;
+ ihold(target);
+ }
+ goto out;
+ }
error = dir->i_op->unlink(dir, dentry);
if (!error)
dont_mount(dentry);
}
}
+out:
mutex_unlock(&target->i_mutex);
/* We don't d_delete() NFS sillyrenamed files--they still exist. */
@@ -3351,6 +3360,7 @@ static long do_unlinkat(int dfd, const char __user *pathname)
struct dentry *dentry;
struct nameidata nd;
struct inode *inode = NULL;
+ struct inode *delegated_inode = NULL;
error = user_path_parent(dfd, pathname, &nd, &name);
if (error)
@@ -3364,7 +3374,7 @@ static long do_unlinkat(int dfd, const char __user *pathname)
error = mnt_want_write(nd.path.mnt);
if (error)
goto exit1;
-
+retry_deleg:
mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
dentry = lookup_hash(&nd);
error = PTR_ERR(dentry);
@@ -3379,13 +3389,20 @@ static long do_unlinkat(int dfd, const char __user *pathname)
error = security_path_unlink(&nd.path, dentry);
if (error)
goto exit2;
- error = vfs_unlink(nd.path.dentry->d_inode, dentry);
+ error = vfs_unlink(nd.path.dentry->d_inode, dentry, &delegated_inode);
exit2:
dput(dentry);
}
mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
if (inode)
iput(inode); /* truncate the inode here */
+ if (delegated_inode) {
+ error = break_deleg(delegated_inode, O_WRONLY);
+ iput(delegated_inode);
+ delegated_inode = NULL;
+ if (!error)
+ goto retry_deleg;
+ }
mnt_drop_write(nd.path.mnt);
exit1:
path_put(&nd.path);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index a9269f1..9c39d26 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1893,7 +1893,7 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (host_err)
goto out_put;
if (type != S_IFDIR)
- host_err = vfs_unlink(dirp, rdentry);
+ host_err = vfs_unlink(dirp, rdentry, NULL);
else
host_err = vfs_rmdir(dirp, rdentry);
if (!host_err)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73157f7..6eb70a7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1723,7 +1723,7 @@ extern int vfs_mknod(struct inode *, struct dentry *, umode_t, dev_t);
extern int vfs_symlink(struct inode *, struct dentry *, const char *);
extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
extern int vfs_rmdir(struct inode *, struct dentry *);
-extern int vfs_unlink(struct inode *, struct dentry *);
+extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
/*
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index f8e54f5..910b5ea 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -881,7 +881,7 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
err = mnt_want_write(ipc_ns->mq_mnt);
if (err)
goto out_err;
- err = vfs_unlink(dentry->d_parent->d_inode, dentry);
+ err = vfs_unlink(dentry->d_parent->d_inode, dentry, NULL);
mnt_drop_write(ipc_ns->mq_mnt);
out_err:
dput(dentry);
--
1.7.9.5
On Mon, Sep 10, 2012 at 03:27:32PM +0800, Ram Pai wrote:
> > This seems still conform to the statement in the manual, in a weird way
> > though.
>
> I think the code that checks if the new dentry or the old dentry is a
> mount point can be safely removed. It does not serve any purpose AFAICT.
Yes, it does. It's a _very_ traditional behaviour, on a lot of Unices,
and changing it can screw all kinds of scripts. Worse yet, screw them
in rarely hit cases.
BTW, what do you expect to happen when the target is a mountpoint? Suppose
the mountpoint is a directory that happens to be empty; you are asking to
rename another directory over it...
Anyway, userland API stability considerations are sufficient to NAK any
such change. Sorry.