2022-06-13 23:21:42

by NeilBrown

[permalink] [raw]
Subject: [PATCH RFC 00/12] Allow concurrent directory updates.

VFS currently holds an exclusive lock on a directory during create,
unlink, rename. This imposes serialisation on all filesystems though
some may not benefit from it, and some may be able to provide finer
grained locking internally, thus reducing contention.

This series allows the filesystem to request that the inode lock be
shared rather than exclusive. In that case an exclusive lock will be
held on the dentry instead, much as is done for parallel lookup.

The NFS filesystem can easily support concurrent updates (server does
any needed serialiation) so it is converted.

This series also converts nfsd to use the new interfaces so concurrent
incoming NFS requests in the one directory can be handled concurrently.

As a net result, if an NFS mounted filesystem is reexported over NFS,
then multiple clients can create files in a single directory and all
synchronisation will be handled on the final server. This helps hid
latency on link from client to server.

I include a few nfsd patches that aren't strictly needed for this work,
but seem to be a logical consequence of the changes that I did have to
make.

I have only tested this lightly. In particular the rename support is
quite new and I haven't tried to break it yet.

I post this for general review, and hopefully extra testing... Daire
Byrne has expressed interest in the NFS re-export parallelism.

NeilBrown


---

NeilBrown (12):
VFS: support parallel updates in the one directory.
VFS: move EEXIST and ENOENT tests into lookup_hash_update()
VFS: move want_write checks into lookup_hash_update()
VFS: move dput() and mnt_drop_write() into done_path_update()
VFS: export done_path_update()
VFS: support concurrent renames.
NFS: support parallel updates in the one directory.
nfsd: allow parallel creates from nfsd
nfsd: support concurrent renames.
nfsd: reduce locking in nfsd_lookup()
nfsd: use (un)lock_inode instead of fh_(un)lock
nfsd: discard fh_locked flag and fh_lock/fh_unlock


fs/dcache.c | 59 ++++-
fs/namei.c | 578 ++++++++++++++++++++++++++++++++---------
fs/nfs/dir.c | 29 ++-
fs/nfs/inode.c | 2 +
fs/nfs/unlink.c | 5 +-
fs/nfsd/nfs2acl.c | 6 +-
fs/nfsd/nfs3acl.c | 4 +-
fs/nfsd/nfs3proc.c | 37 +--
fs/nfsd/nfs4acl.c | 7 +-
fs/nfsd/nfs4proc.c | 61 ++---
fs/nfsd/nfs4state.c | 8 +-
fs/nfsd/nfsfh.c | 10 +-
fs/nfsd/nfsfh.h | 58 +----
fs/nfsd/nfsproc.c | 31 +--
fs/nfsd/vfs.c | 243 ++++++++---------
fs/nfsd/vfs.h | 8 +-
include/linux/dcache.h | 27 ++
include/linux/fs.h | 1 +
include/linux/namei.h | 30 ++-
19 files changed, 791 insertions(+), 413 deletions(-)

--
Signature


2022-06-13 23:21:43

by NeilBrown

[permalink] [raw]
Subject: [PATCH 02/12] VFS: move EEXIST and ENOENT tests into lookup_hash_update()

Moving common error handling into lookup_hash_update() simplifies
callers.
A future patch will export this functionality to nfsd, and the more code
we put in the interface, the less code will be needed in nfsd.

Signed-off-by: NeilBrown <[email protected]>
---
fs/namei.c | 55 +++++++++++++++++++++++++++++++------------------------
1 file changed, 31 insertions(+), 24 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 9a2ca515c219..e7a56d6e2472 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1563,7 +1563,13 @@ static struct dentry *lookup_dcache(const struct qstr *name,
{
struct dentry *dentry = d_lookup(dir, name);
if (dentry) {
- int error = d_revalidate(dentry, flags);
+ int error;
+ /* Some filesystems assume EXCL -> CREATE, so make
+ * sure it does.
+ */
+ if (!(flags & LOOKUP_CREATE))
+ flags &= ~LOOKUP_EXCL;
+ error = d_revalidate(dentry, flags);
if (unlikely(error <= 0)) {
if (!error)
d_invalidate(dentry);
@@ -1621,6 +1627,8 @@ static struct dentry *__lookup_hash(const struct qstr *name,
* or shared lock depending on the fs preference, then do a lookup,
* and then set the DCACHE_PAR_UPDATE bit on the child if a shared lock
* was taken on the parent.
+ * If LOOKUP_EXCL, name should not already exist, else -EEXIST
+ * If not LOOKUP_CREATE, name should already exist, else -ENOENT
*/
static struct dentry *lookup_hash_update(const struct qstr *name,
struct dentry *base, unsigned int flags,
@@ -1644,6 +1652,20 @@ static struct dentry *lookup_hash_update(const struct qstr *name,
err = PTR_ERR(dentry);
goto out_err;
}
+ if (flags & LOOKUP_EXCL) {
+ if (d_is_positive(dentry)) {
+ dput(dentry);
+ err = -EEXIST;
+ goto out_err;
+ }
+ }
+ if (!(flags & LOOKUP_CREATE)) {
+ if (!dentry->d_inode) {
+ dput(dentry);
+ err = -ENOENT;
+ goto out_err;
+ }
+ }
if (wq && !d_lock_update(dentry, base, name)) {
/*
* Failed to get lock due to race with unlink or rename
@@ -3838,7 +3860,7 @@ static struct dentry *filename_create_one(struct qstr *last, struct path *path,
struct dentry *dentry;
bool want_dir = lookup_flags & LOOKUP_DIRECTORY;
unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
- unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL;
+ unsigned int create_flag = LOOKUP_CREATE;
int err2;
int error;

@@ -3849,26 +3871,17 @@ static struct dentry *filename_create_one(struct qstr *last, struct path *path,
* '/', and a directory wasn't requested.
*/
if (last->name[last->len] && !want_dir)
- create_flags = 0;
+ /* Name was foo/bar/ but not creating a directory, so
+ * we won't try to create - result with be either -ENOENT
+ * or -EEXIST.
+ */
+ create_flag = 0;
dentry = lookup_hash_update(last, path->dentry,
- reval_flag | create_flags, wq);
+ reval_flag | create_flag | LOOKUP_EXCL,
+ wq);
if (IS_ERR(dentry))
goto drop_write;

- error = -EEXIST;
- if (d_is_positive(dentry))
- goto fail;
-
- /*
- * Special case - lookup gave negative, but... we had foo/bar/
- * From the vfs_mknod() POV we just have a negative dentry -
- * all is fine. Let's be bastards - you had / on the end, you've
- * been asking for (non-existent) directory. -ENOENT for you.
- */
- if (unlikely(!create_flags)) {
- error = -ENOENT;
- goto fail;
- }
if (unlikely(err2)) {
error = err2;
goto fail;
@@ -4264,10 +4277,6 @@ int do_rmdir(int dfd, struct filename *name)
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit3;
- if (!dentry->d_inode) {
- error = -ENOENT;
- goto exit4;
- }
error = security_path_rmdir(&path, dentry);
if (error)
goto exit4;
@@ -4407,8 +4416,6 @@ int do_unlinkat(int dfd, struct filename *name)
if (last.name[last.len])
goto slashes;
inode = dentry->d_inode;
- if (d_is_negative(dentry))
- goto slashes;
ihold(inode);
error = security_path_unlink(&path, dentry);
if (error)


2022-06-13 23:21:44

by NeilBrown

[permalink] [raw]
Subject: [PATCH 01/12] VFS: support parallel updates in the one directory.

From: NeilBrown <[email protected]>

Some filesystems can support parallel modifications to a directory,
either because the modification happen on a remote server which does its
own locking (e.g. NFS) or because they can internally lock just a part
of a directory (e.g. many local filesystems, with a bit of work - the
lustre project has patches for ext4 to support concurrent updates).

To allow this, we introduce VFS support for parallel modification:
unlink (including rmdir) and create. Parallel rename is added in a
later patch.

If a filesystem supports parallel modification in a given directory, it
sets S_PAR_UPDATE on the inode for that directory. lookup_open() and
the new lookup_hash_update() notice the flag and take a shared lock on
the directory, and rely on a lock-bit in d_flags, much like parallel
lookup relies on DCACHE_PAR_LOOKUP.

__lookup_hash() is enhanced to use d_alloc_parallel() if it was told
that a shared lock was taken - by being given a non-NULL *wq.

The new DCACHE_PAR_UPDATE is the most significant bit in d_flags, so we
have nearly run out of flags. 0x08000000 is still unused .... for now.

Once a dentry for the target name has been obtained, DCACHE_PAR_UPDATE
is set on it, waiting if necessary if it is already set. Once this is
set, the thread has exclusive access to the name and can call into the
filesystem to perform the required action.

We use wait_var_event() to wait for DCACHE_PAR_UPDATE to be cleared rather
than trying to overload the wq used for DCACHE_PAR_LOOKUP.

Some filesystems do *not* complete the lookup that precedes a create,
but leave the dentry d_in_lookup() and unhashed, so often a dentry will
have both DCACHE_PAR_LOOKUP and DCACHE_PAR_UPDATE set at the same time.
To allow for this, we need the 'wq' that is passed to d_alloc_parallel()
(and used when DCACHE_PAR_LOOKUP is cleared), to continue to exist until
the creation is complete, which may be after filename_create()
completes. So a wq now needs to be passed to filename_create() when
parallel modification is attempted.

Waiting for DCACHE_PAR_UPDATE can sometimes result in the dentry
changing - either through unlink or rename. This requires that the
lookup be retried. This in turn requires that the wq be reused. The
use for DECLARE_WAITQUEUE() in d_wait_lookup() and the omission of
remove_wait_queue() means that after the wake_up_all() in
__d_lookup_down(), the wait_queue_head list will be undefined. So we
change d_wait_lookup() to use DEFINE_WAIT(), so that
autoremove_wake_function() is used for wakeups, and the wait_queue_head
remains well defined.

Signed-off-by: NeilBrown <[email protected]>
---
fs/dcache.c | 59 ++++++++++
fs/namei.c | 277 ++++++++++++++++++++++++++++++++++++++----------
include/linux/dcache.h | 27 +++++
include/linux/fs.h | 1
include/linux/namei.h | 13 ++
5 files changed, 317 insertions(+), 60 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 93f4f5ee07bf..1b523b512575 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1765,6 +1765,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
struct dentry *dentry;
char *dname;
int err;
+ static struct lock_class_key __key;

dentry = kmem_cache_alloc_lru(dentry_cache, &sb->s_dentry_lru,
GFP_KERNEL);
@@ -1820,6 +1821,8 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
INIT_LIST_HEAD(&dentry->d_child);
d_set_d_op(dentry, dentry->d_sb->s_d_op);

+ lockdep_init_map(&dentry->d_update_map, "DENTRY_PAR_UPDATE", &__key, 0);
+
if (dentry->d_op && dentry->d_op->d_init) {
err = dentry->d_op->d_init(dentry);
if (err) {
@@ -2580,7 +2583,7 @@ static inline void end_dir_add(struct inode *dir, unsigned n)
static void d_wait_lookup(struct dentry *dentry)
{
if (d_in_lookup(dentry)) {
- DECLARE_WAITQUEUE(wait, current);
+ DEFINE_WAIT(wait);
add_wait_queue(dentry->d_wait, &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
@@ -3209,6 +3212,60 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
}
EXPORT_SYMBOL(d_tmpfile);

+/**
+ * d_lock_update_nested - lock a dentry before updating
+ * @dentry: the dentry to be locked
+ * @base: the parent, or %NULL
+ * @name: the name in that parent, or %NULL
+ * @subclass: lockdep locking class.
+ *
+ * Lock a dentry in a directory on which a shared-lock is held, and
+ * on which parallel updates are permitted.
+ * If the base and name are given, then on success the dentry will still
+ * have that base and name - it will not have raced with rename.
+ * On success, a positive dentry will still be hashed, ensuring there
+ * was no race with unlink.
+ * If they are not given, then on success the dentry will be negative,
+ * which again ensures no race with rename, or unlink.
+ */
+bool d_lock_update_nested(struct dentry *dentry,
+ struct dentry *base, const struct qstr *name,
+ int subclass)
+{
+ bool ret = true;
+
+ lock_acquire_exclusive(&dentry->d_update_map, subclass,
+ 0, NULL, _THIS_IP_);
+ spin_lock(&dentry->d_lock);
+ if (dentry->d_flags & DCACHE_PAR_UPDATE)
+ ___wait_var_event(&dentry->d_flags,
+ !(dentry->d_flags & DCACHE_PAR_UPDATE),
+ TASK_UNINTERRUPTIBLE, 0, 0,
+ (spin_unlock(&dentry->d_lock),
+ schedule(),
+ spin_lock(&dentry->d_lock))
+ );
+ if (d_unhashed(dentry) && d_is_positive(dentry)) {
+ /* name was unlinked while we waited */
+ ret = false;
+ } else if (base &&
+ (dentry->d_parent != base ||
+ dentry->d_name.hash != name->hash ||
+ !d_same_name(dentry, base, name))) {
+ /* dentry was renamed - possibly silly-rename */
+ ret = false;
+ } else if (!base && d_is_positive(dentry)) {
+ ret = false;
+ } else {
+ dentry->d_flags |= DCACHE_PAR_UPDATE;
+ }
+ spin_unlock(&dentry->d_lock);
+ if (!ret)
+ lock_map_release(&dentry->d_update_map);
+ return ret;
+}
+EXPORT_SYMBOL(d_lock_update_nested);
+
static __initdata unsigned long dhash_entries;
static int __init set_dhash_entries(char *str)
{
diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3..9a2ca515c219 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1575,14 +1575,13 @@ static struct dentry *lookup_dcache(const struct qstr *name,
}

/*
- * Parent directory has inode locked exclusive. This is one
- * and only case when ->lookup() gets called on non in-lookup
- * dentries - as the matter of fact, this only gets called
- * when directory is guaranteed to have no in-lookup children
- * at all.
+ * Parent directory has inode locked exclusive, or shared if wq is
+ * given. In the later case the fs has explicitly allowed concurrent
+ * creates in this directory.
*/
static struct dentry *__lookup_hash(const struct qstr *name,
- struct dentry *base, unsigned int flags)
+ struct dentry *base, unsigned int flags,
+ wait_queue_head_t *wq)
{
struct dentry *dentry = lookup_dcache(name, base, flags);
struct dentry *old;
@@ -1595,18 +1594,107 @@ static struct dentry *__lookup_hash(const struct qstr *name,
if (unlikely(IS_DEADDIR(dir)))
return ERR_PTR(-ENOENT);

- dentry = d_alloc(base, name);
+ if (wq)
+ dentry = d_alloc_parallel(base, name, wq);
+ else
+ dentry = d_alloc(base, name);
if (unlikely(!dentry))
return ERR_PTR(-ENOMEM);
+ if (IS_ERR(dentry))
+ return dentry;
+
+ if (wq && d_in_lookup(dentry))
+ /* Must have raced with another thread doing the lookup */
+ return dentry;

old = dir->i_op->lookup(dir, dentry, flags);
if (unlikely(old)) {
+ d_lookup_done(dentry);
dput(dentry);
dentry = old;
}
return dentry;
}

+/*
+ * Parent directory (base) is not locked. We take either an exclusive
+ * or shared lock depending on the fs preference, then do a lookup,
+ * and then set the DCACHE_PAR_UPDATE bit on the child if a shared lock
+ * was taken on the parent.
+ */
+static struct dentry *lookup_hash_update(const struct qstr *name,
+ struct dentry *base, unsigned int flags,
+ wait_queue_head_t *wq)
+{
+ struct dentry *dentry;
+ struct inode *dir = base->d_inode;
+ int err;
+
+ if (!(dir->i_flags & S_PAR_UPDATE))
+ wq = NULL;
+
+ if (wq)
+ inode_lock_shared_nested(dir, I_MUTEX_PARENT);
+ else
+ inode_lock_nested(dir, I_MUTEX_PARENT);
+
+retry:
+ dentry = __lookup_hash(name, base, flags, wq);
+ if (IS_ERR(dentry)) {
+ err = PTR_ERR(dentry);
+ goto out_err;
+ }
+ if (wq && !d_lock_update(dentry, base, name)) {
+ /*
+ * Failed to get lock due to race with unlink or rename
+ * - try again
+ */
+ d_lookup_done(dentry);
+ dput(dentry);
+ goto retry;
+ }
+ return dentry;
+
+out_err:
+ if (wq)
+ inode_unlock_shared(dir);
+ else
+ inode_unlock(dir);
+ return ERR_PTR(err);
+}
+
+static int lookup_one_common(struct user_namespace *mnt_userns,
+ const char *name, struct dentry *base, int len,
+ struct qstr *this);
+
+struct dentry *lookup_hash_update_len(const char *name, int nlen,
+ struct path *path, unsigned int flags,
+ wait_queue_head_t *wq)
+{
+ struct qstr this;
+ int err = lookup_one_common(mnt_user_ns(path->mnt), name,
+ path->dentry, nlen, &this);
+ if (err)
+ return ERR_PTR(err);
+ return lookup_hash_update(&this, path->dentry, flags, wq);
+}
+EXPORT_SYMBOL(lookup_hash_update_len);
+
+static void done_path_update(struct path *path, struct dentry *dentry,
+ wait_queue_head_t *wq)
+{
+ struct inode *dir = path->dentry->d_inode;
+ bool shared = (dir->i_flags & S_PAR_UPDATE) && wq;
+
+ if (shared) {
+ d_lookup_done(dentry);
+ d_unlock_update(dentry);
+ inode_unlock_shared(dir);
+ } else {
+ inode_unlock(dir);
+ }
+}
+
static struct dentry *lookup_fast(struct nameidata *nd,
struct inode **inode,
unsigned *seqp)
@@ -2589,7 +2677,7 @@ static struct dentry *__kern_path_locked(struct filename *name, struct path *pat
return ERR_PTR(-EINVAL);
}
inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
- d = __lookup_hash(&last, path->dentry, 0);
+ d = __lookup_hash(&last, path->dentry, 0, NULL);
if (IS_ERR(d)) {
inode_unlock(path->dentry->d_inode);
path_put(path);
@@ -3226,16 +3314,32 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
{
struct dentry *const DENTRY_NOT_SET = (void *) -1UL;
struct inode *dir = nd->path.dentry->d_inode;
+ bool have_par_update = false;
int error;

if (nd->flags & LOOKUP_DIRECTORY)
open_flag |= O_DIRECTORY;

+ /*
+ * dentry is negative or d_in_lookup(). If this is a
+ * shared-lock create we need to set DCACHE_PAR_UPDATE to ensure
+ * exclusion.
+ */
+ if ((open_flag & O_CREAT) &&
+ (dir->i_flags & S_PAR_UPDATE)) {
+ if (!d_lock_update(dentry, NULL, NULL))
+ /* already exists, non-atomic open */
+ return dentry;
+ have_par_update = true;
+ }
+
file->f_path.dentry = DENTRY_NOT_SET;
file->f_path.mnt = nd->path.mnt;
error = dir->i_op->atomic_open(dir, dentry, file,
open_to_namei_flags(open_flag), mode);
d_lookup_done(dentry);
+ if (have_par_update)
+ d_unlock_update(dentry);
if (!error) {
if (file->f_mode & FMODE_OPENED) {
if (unlikely(dentry != file->f_path.dentry)) {
@@ -3287,6 +3391,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
int error, create_error = 0;
umode_t mode = op->mode;
DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
+ bool have_par_update = false;

if (unlikely(IS_DEADDIR(dir_inode)))
return ERR_PTR(-ENOENT);
@@ -3361,6 +3466,14 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
dentry = res;
}
}
+ /*
+ * If dentry is negative and this is a shared-lock
+ * create we need to get DCACHE_PAR_UPDATE to ensure exclusion
+ */
+ if ((open_flag & O_CREAT) &&
+ !dentry->d_inode &&
+ (dir_inode->i_flags & S_PAR_UPDATE))
+ have_par_update = d_lock_update(dentry, NULL, NULL);

/* Negative dentry, just create the file */
if (!dentry->d_inode && (open_flag & O_CREAT)) {
@@ -3380,9 +3493,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
error = create_error;
goto out_dput;
}
+ if (have_par_update)
+ d_unlock_update(dentry);
return dentry;

out_dput:
+ if (have_par_update)
+ d_unlock_update(dentry);
dput(dentry);
return ERR_PTR(error);
}
@@ -3397,6 +3514,7 @@ static const char *open_last_lookups(struct nameidata *nd,
struct inode *inode;
struct dentry *dentry;
const char *res;
+ bool shared;

nd->flags |= op->intent;

@@ -3437,14 +3555,15 @@ static const char *open_last_lookups(struct nameidata *nd,
* dropping this one anyway.
*/
}
- if (open_flag & O_CREAT)
+ shared = !!(dir->d_inode->i_flags & S_PAR_UPDATE);
+ if ((open_flag & O_CREAT) && !shared)
inode_lock(dir->d_inode);
else
inode_lock_shared(dir->d_inode);
dentry = lookup_open(nd, file, op, got_write);
if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
fsnotify_create(dir->d_inode, dentry);
- if (open_flag & O_CREAT)
+ if ((open_flag & O_CREAT) && !shared)
inode_unlock(dir->d_inode);
else
inode_unlock_shared(dir->d_inode);
@@ -3712,41 +3831,29 @@ struct file *do_file_open_root(const struct path *root,
return file;
}

-static struct dentry *filename_create(int dfd, struct filename *name,
- struct path *path, unsigned int lookup_flags)
+static struct dentry *filename_create_one(struct qstr *last, struct path *path,
+ unsigned int lookup_flags,
+ wait_queue_head_t *wq)
{
- struct dentry *dentry = ERR_PTR(-EEXIST);
- struct qstr last;
+ struct dentry *dentry;
bool want_dir = lookup_flags & LOOKUP_DIRECTORY;
unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL;
- int type;
int err2;
int error;

- error = filename_parentat(dfd, name, reval_flag, path, &last, &type);
- if (error)
- return ERR_PTR(error);
-
- /*
- * Yucky last component or no last component at all?
- * (foo/., foo/.., /////)
- */
- if (unlikely(type != LAST_NORM))
- goto out;
-
/* don't fail immediately if it's r/o, at least try to report other errors */
err2 = mnt_want_write(path->mnt);
/*
* Do the final lookup. Suppress 'create' if there is a trailing
* '/', and a directory wasn't requested.
*/
- if (last.name[last.len] && !want_dir)
+ if (last->name[last->len] && !want_dir)
create_flags = 0;
- inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
- dentry = __lookup_hash(&last, path->dentry, reval_flag | create_flags);
+ dentry = lookup_hash_update(last, path->dentry,
+ reval_flag | create_flags, wq);
if (IS_ERR(dentry))
- goto unlock;
+ goto drop_write;

error = -EEXIST;
if (d_is_positive(dentry))
@@ -3768,14 +3875,56 @@ static struct dentry *filename_create(int dfd, struct filename *name,
}
return dentry;
fail:
+ done_path_update(path, dentry, wq);
dput(dentry);
dentry = ERR_PTR(error);
-unlock:
- inode_unlock(path->dentry->d_inode);
+drop_write:
if (!err2)
mnt_drop_write(path->mnt);
+ return dentry;
+}
+
+struct dentry *filename_create_one_len(const char *name, int nlen,
+ struct path *path,
+ unsigned int lookup_flags,
+ wait_queue_head_t *wq)
+{
+ struct qstr this;
+ int err;
+
+ err = lookup_one_common(mnt_user_ns(path->mnt), name,
+ path->dentry, nlen, &this);
+ if (err)
+ return ERR_PTR(err);
+ return filename_create_one(&this, path, lookup_flags, wq);
+}
+EXPORT_SYMBOL(filename_create_one_len);
+
+static struct dentry *filename_create(int dfd, struct filename *name,
+ struct path *path, unsigned int lookup_flags,
+ wait_queue_head_t *wq)
+{
+ struct dentry *dentry = ERR_PTR(-EEXIST);
+ struct qstr last;
+ unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
+ int type;
+ int error;
+
+ error = filename_parentat(dfd, name, reval_flag, path, &last, &type);
+ if (error)
+ return ERR_PTR(error);
+
+ /*
+ * Yucky last component or no last component at all?
+ * (foo/., foo/.., /////)
+ */
+ if (unlikely(type != LAST_NORM))
+ goto out;
+
+ dentry = filename_create_one(&last, path, lookup_flags, wq);
out:
- path_put(path);
+ if (IS_ERR(dentry))
+ path_put(path);
return dentry;
}

@@ -3783,27 +3932,29 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
struct path *path, unsigned int lookup_flags)
{
struct filename *filename = getname_kernel(pathname);
- struct dentry *res = filename_create(dfd, filename, path, lookup_flags);
+ struct dentry *res = filename_create(dfd, filename, path, lookup_flags,
+ NULL);

putname(filename);
return res;
}
EXPORT_SYMBOL(kern_path_create);

-void done_path_create(struct path *path, struct dentry *dentry)
+void done_path_create_wq(struct path *path, struct dentry *dentry,
+ wait_queue_head_t *wq)
{
+ done_path_update(path, dentry, wq);
dput(dentry);
- inode_unlock(path->dentry->d_inode);
mnt_drop_write(path->mnt);
path_put(path);
}
-EXPORT_SYMBOL(done_path_create);
+EXPORT_SYMBOL(done_path_create_wq);

inline struct dentry *user_path_create(int dfd, const char __user *pathname,
- struct path *path, unsigned int lookup_flags)
+ struct path *path, unsigned int lookup_flags)
{
struct filename *filename = getname(pathname);
- struct dentry *res = filename_create(dfd, filename, path, lookup_flags);
+ struct dentry *res = filename_create(dfd, filename, path, lookup_flags, NULL);

putname(filename);
return res;
@@ -3882,12 +4033,13 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
struct path path;
int error;
unsigned int lookup_flags = 0;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

error = may_mknod(mode);
if (error)
goto out1;
retry:
- dentry = filename_create(dfd, name, &path, lookup_flags);
+ dentry = filename_create(dfd, name, &path, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto out1;
@@ -3916,7 +4068,7 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
break;
}
out2:
- done_path_create(&path, dentry);
+ done_path_create_wq(&path, dentry, &wq);
if (retry_estale(error, lookup_flags)) {
lookup_flags |= LOOKUP_REVAL;
goto retry;
@@ -3985,9 +4137,10 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
struct path path;
int error;
unsigned int lookup_flags = LOOKUP_DIRECTORY;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

retry:
- dentry = filename_create(dfd, name, &path, lookup_flags);
+ dentry = filename_create(dfd, name, &path, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto out_putname;
@@ -4001,7 +4154,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
mode);
}
- done_path_create(&path, dentry);
+ done_path_create_wq(&path, dentry, &wq);
if (retry_estale(error, lookup_flags)) {
lookup_flags |= LOOKUP_REVAL;
goto retry;
@@ -4085,6 +4238,7 @@ int do_rmdir(int dfd, struct filename *name)
struct qstr last;
int type;
unsigned int lookup_flags = 0;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
retry:
error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
if (error)
@@ -4106,8 +4260,7 @@ int do_rmdir(int dfd, struct filename *name)
if (error)
goto exit2;

- inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
- dentry = __lookup_hash(&last, path.dentry, lookup_flags);
+ dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit3;
@@ -4121,9 +4274,9 @@ int do_rmdir(int dfd, struct filename *name)
mnt_userns = mnt_user_ns(path.mnt);
error = vfs_rmdir(mnt_userns, path.dentry->d_inode, dentry);
exit4:
+ done_path_update(&path, dentry, &wq);
dput(dentry);
exit3:
- inode_unlock(path.dentry->d_inode);
mnt_drop_write(path.mnt);
exit2:
path_put(&path);
@@ -4148,13 +4301,14 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
* @dentry: victim
* @delegated_inode: returns victim inode, if the inode is delegated.
*
- * The caller must hold dir->i_mutex.
+ * The caller must either hold a write-lock on dir->i_rwsem, or
+ * a read-lock having atomically set DCACHE_PAR_UPDATE.
*
* If vfs_unlink discovers a delegation, it will return -EWOULDBLOCK and
* return a reference to the inode in delegated_inode. The caller
* should then break the delegation on that inode and retry. Because
* breaking a delegation may take a long time, the caller should drop
- * dir->i_mutex before doing so.
+ * dir->i_rwsem before doing so.
*
* Alternatively, a caller may pass NULL for delegated_inode. This may
* be appropriate for callers that expect the underlying filesystem not
@@ -4213,9 +4367,11 @@ EXPORT_SYMBOL(vfs_unlink);

/*
* Make sure that the actual truncation of the file will occur outside its
- * directory's i_mutex. Truncate can take a long time if there is a lot of
+ * directory's i_rwsem. Truncate can take a long time if there is a lot of
* writeout happening, and we don't want to prevent access to the directory
* while waiting on the I/O.
+ * If the directory permits (S_PAR_UPDATE), we take a shared lock on the
+ * directory DCACHE_PAR_UPDATE to get exclusive access to the dentry.
*/
int do_unlinkat(int dfd, struct filename *name)
{
@@ -4227,6 +4383,7 @@ int do_unlinkat(int dfd, struct filename *name)
struct inode *inode = NULL;
struct inode *delegated_inode = NULL;
unsigned int lookup_flags = 0;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
retry:
error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
if (error)
@@ -4239,9 +4396,9 @@ int do_unlinkat(int dfd, struct filename *name)
error = mnt_want_write(path.mnt);
if (error)
goto exit2;
+
retry_deleg:
- inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
- dentry = __lookup_hash(&last, path.dentry, lookup_flags);
+ dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
struct user_namespace *mnt_userns;
@@ -4260,9 +4417,9 @@ int do_unlinkat(int dfd, struct filename *name)
error = vfs_unlink(mnt_userns, path.dentry->d_inode, dentry,
&delegated_inode);
exit3:
+ done_path_update(&path, dentry, &wq);
dput(dentry);
}
- inode_unlock(path.dentry->d_inode);
if (inode)
iput(inode); /* truncate the inode here */
inode = NULL;
@@ -4351,13 +4508,14 @@ int do_symlinkat(struct filename *from, int newdfd, struct filename *to)
struct dentry *dentry;
struct path path;
unsigned int lookup_flags = 0;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

if (IS_ERR(from)) {
error = PTR_ERR(from);
goto out_putnames;
}
retry:
- dentry = filename_create(newdfd, to, &path, lookup_flags);
+ dentry = filename_create(newdfd, to, &path, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto out_putnames;
@@ -4370,7 +4528,7 @@ int do_symlinkat(struct filename *from, int newdfd, struct filename *to)
error = vfs_symlink(mnt_userns, path.dentry->d_inode, dentry,
from->name);
}
- done_path_create(&path, dentry);
+ done_path_create_wq(&path, dentry, &wq);
if (retry_estale(error, lookup_flags)) {
lookup_flags |= LOOKUP_REVAL;
goto retry;
@@ -4499,6 +4657,7 @@ int do_linkat(int olddfd, struct filename *old, int newdfd,
struct inode *delegated_inode = NULL;
int how = 0;
int error;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) {
error = -EINVAL;
@@ -4522,7 +4681,7 @@ int do_linkat(int olddfd, struct filename *old, int newdfd,
goto out_putnames;

new_dentry = filename_create(newdfd, new, &new_path,
- (how & LOOKUP_REVAL));
+ (how & LOOKUP_REVAL), &wq);
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto out_putpath;
@@ -4540,7 +4699,7 @@ int do_linkat(int olddfd, struct filename *old, int newdfd,
error = vfs_link(old_path.dentry, mnt_userns, new_path.dentry->d_inode,
new_dentry, &delegated_inode);
out_dput:
- done_path_create(&new_path, new_dentry);
+ done_path_create_wq(&new_path, new_dentry, &wq);
if (delegated_inode) {
error = break_deleg_wait(&delegated_inode);
if (!error) {
@@ -4810,7 +4969,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
retry_deleg:
trap = lock_rename(new_path.dentry, old_path.dentry);

- old_dentry = __lookup_hash(&old_last, old_path.dentry, lookup_flags);
+ old_dentry = __lookup_hash(&old_last, old_path.dentry,
+ lookup_flags, NULL);
error = PTR_ERR(old_dentry);
if (IS_ERR(old_dentry))
goto exit3;
@@ -4818,7 +4978,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
error = -ENOENT;
if (d_is_negative(old_dentry))
goto exit4;
- new_dentry = __lookup_hash(&new_last, new_path.dentry, lookup_flags | target_flags);
+ new_dentry = __lookup_hash(&new_last, new_path.dentry,
+ lookup_flags | target_flags, NULL);
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto exit4;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f5bba51480b2..6331cf4ac87a 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -13,7 +13,9 @@
#include <linux/rcupdate.h>
#include <linux/lockref.h>
#include <linux/stringhash.h>
+#include <linux/sched.h>
#include <linux/wait.h>
+#include <linux/wait_bit.h>

struct path;
struct vfsmount;
@@ -96,6 +98,8 @@ struct dentry {
unsigned long d_time; /* used by d_revalidate */
void *d_fsdata; /* fs-specific data */

+ /* lockdep tracking of DCACHE_PAR_UPDATE locks */
+ struct lockdep_map d_update_map;
union {
struct list_head d_lru; /* LRU list */
wait_queue_head_t *d_wait; /* in-lookup ones only */
@@ -211,6 +215,7 @@ struct dentry_operations {
#define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */
#define DCACHE_DENTRY_CURSOR 0x20000000
#define DCACHE_NORCU 0x40000000 /* No RCU delay for freeing */
+#define DCACHE_PAR_UPDATE 0x80000000 /* Being created or unlinked with shared lock */

extern seqlock_t rename_lock;

@@ -599,4 +604,26 @@ struct name_snapshot {
void take_dentry_name_snapshot(struct name_snapshot *, struct dentry *);
void release_dentry_name_snapshot(struct name_snapshot *);

+bool d_lock_update_nested(struct dentry *dentry,
+ struct dentry *base, const struct qstr *name,
+ int subclass);
+static inline bool d_lock_update(struct dentry *dentry,
+ struct dentry *base, const struct qstr *name)
+{
+ return d_lock_update_nested(dentry, base, name, 0);
+}
+
+static inline void d_unlock_update(struct dentry *dentry)
+{
+ if (IS_ERR_OR_NULL(dentry))
+ return;
+ if (dentry->d_flags & DCACHE_PAR_UPDATE) {
+ lock_map_release(&dentry->d_update_map);
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags &= ~DCACHE_PAR_UPDATE;
+ spin_unlock(&dentry->d_lock);
+ wake_up_var(&dentry->d_flags);
+ }
+}
+
#endif /* __LINUX_DCACHE_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ad5e3520fae..96a2a23927e1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2146,6 +2146,7 @@ struct super_operations {
#define S_CASEFOLD (1 << 15) /* Casefolded file */
#define S_VERITY (1 << 16) /* Verity file (using fs/verity/) */
#define S_KERNEL_FILE (1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
+#define S_PAR_UPDATE (1 << 18) /* Parallel directory operations supported */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
diff --git a/include/linux/namei.h b/include/linux/namei.h
index caeb08a98536..f75c6639dd1a 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -61,7 +61,14 @@ extern int kern_path(const char *, unsigned, struct path *);

extern struct dentry *kern_path_create(int, const char *, struct path *, unsigned int);
extern struct dentry *user_path_create(int, const char __user *, struct path *, unsigned int);
-extern void done_path_create(struct path *, struct dentry *);
+extern struct dentry *lookup_hash_update_len(const char *name, int nlen,
+ struct path *path, unsigned int flags,
+ wait_queue_head_t *wq);
+extern void done_path_create_wq(struct path *, struct dentry *, wait_queue_head_t *wq);
+static inline void done_path_create(struct path *path, struct dentry *dentry)
+{
+ done_path_create_wq(path, dentry, NULL);
+}
extern struct dentry *kern_path_locked(const char *, struct path *);

extern struct dentry *try_lookup_one_len(const char *, struct dentry *, int);
@@ -75,6 +82,10 @@ struct dentry *lookup_one_unlocked(struct user_namespace *mnt_userns,
struct dentry *lookup_one_positive_unlocked(struct user_namespace *mnt_userns,
const char *name,
struct dentry *base, int len);
+struct dentry *filename_create_one_len(const char *name, int nlen,
+ struct path *path,
+ unsigned int lookup_flags,
+ wait_queue_head_t *wq);

extern int follow_down_one(struct path *);
extern int follow_down(struct path *);


2022-06-13 23:21:58

by NeilBrown

[permalink] [raw]
Subject: [PATCH 03/12] VFS: move want_write checks into lookup_hash_update()

mnt_want_write() is always called before lookup_hash_update(), so we can
simplify the code by moving the call into that function.

If lookup_hash_update() succeeds, it now will have claimed the
want_write lock. If it fails, the want_write lock isn't held.

Also lookup_hash_update() now receives a 'struct path'.

Note that when creating a name, any error from mnt_want_write() does not
get reported unless there is no other error. For unlink/rmdir though,
an error from mnt_want_write() is immediately fatal - overriding ENOENT
for example. This behaviour seems strange, but this patch is careful to
preserve it.

Signed-off-by: NeilBrown <[email protected]>
---
fs/namei.c | 71 +++++++++++++++++++++++++-----------------------------------
1 file changed, 29 insertions(+), 42 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index e7a56d6e2472..83ce2f7083be 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1631,12 +1631,22 @@ static struct dentry *__lookup_hash(const struct qstr *name,
* If not LOOKUP_CREATE, name should already exist, else -ENOENT
*/
static struct dentry *lookup_hash_update(const struct qstr *name,
- struct dentry *base, unsigned int flags,
+ struct path *path, unsigned int flags,
wait_queue_head_t *wq)
{
struct dentry *dentry;
+ struct dentry *base = path->dentry;
struct inode *dir = base->d_inode;
- int err;
+ int err, err2;
+
+ /* For create, don't fail immediately if it's r/o,
+ * at least try to report other errors.
+ * For unlink/rmdir where LOOKUP_REVAl is the only
+ * flag, fail immediately if r/o.
+ */
+ err2 = mnt_want_write(path->mnt);
+ if (err2 && (flags & ~LOOKUP_REVAL) == 0)
+ return ERR_PTR(err2);

if (!(dir->i_flags & S_PAR_UPDATE))
wq = NULL;
@@ -1675,6 +1685,11 @@ static struct dentry *lookup_hash_update(const struct qstr *name,
dput(dentry);
goto retry;
}
+ if (err2) {
+ err = err2;
+ dput(dentry);
+ goto out_err;
+ }
return dentry;

out_err:
@@ -1682,6 +1697,8 @@ static struct dentry *lookup_hash_update(const struct qstr *name,
inode_unlock_shared(dir);
else
inode_unlock(dir);
+ if (!err2)
+ mnt_drop_write(path->mnt);
return ERR_PTR(err);
}

@@ -1698,7 +1715,7 @@ struct dentry *lookup_hash_update_len(const char *name, int nlen,
path->dentry, nlen, &this);
if (err)
return ERR_PTR(err);
- return lookup_hash_update(&this, path->dentry, flags, wq);
+ return lookup_hash_update(&this, path, flags, wq);
}
EXPORT_SYMBOL(lookup_hash_update_len);

@@ -3857,15 +3874,10 @@ static struct dentry *filename_create_one(struct qstr *last, struct path *path,
unsigned int lookup_flags,
wait_queue_head_t *wq)
{
- struct dentry *dentry;
bool want_dir = lookup_flags & LOOKUP_DIRECTORY;
unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
unsigned int create_flag = LOOKUP_CREATE;
- int err2;
- int error;

- /* don't fail immediately if it's r/o, at least try to report other errors */
- err2 = mnt_want_write(path->mnt);
/*
* Do the final lookup. Suppress 'create' if there is a trailing
* '/', and a directory wasn't requested.
@@ -3876,25 +3888,9 @@ static struct dentry *filename_create_one(struct qstr *last, struct path *path,
* or -EEXIST.
*/
create_flag = 0;
- dentry = lookup_hash_update(last, path->dentry,
- reval_flag | create_flag | LOOKUP_EXCL,
- wq);
- if (IS_ERR(dentry))
- goto drop_write;
-
- if (unlikely(err2)) {
- error = err2;
- goto fail;
- }
- return dentry;
-fail:
- done_path_update(path, dentry, wq);
- dput(dentry);
- dentry = ERR_PTR(error);
-drop_write:
- if (!err2)
- mnt_drop_write(path->mnt);
- return dentry;
+ return lookup_hash_update(last, path,
+ reval_flag | create_flag | LOOKUP_EXCL,
+ wq);
}

struct dentry *filename_create_one_len(const char *name, int nlen,
@@ -4269,23 +4265,18 @@ int do_rmdir(int dfd, struct filename *name)
goto exit2;
}

- error = mnt_want_write(path.mnt);
- if (error)
- goto exit2;
-
- dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq);
+ dentry = lookup_hash_update(&last, &path, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
- goto exit3;
+ goto exit2;
error = security_path_rmdir(&path, dentry);
if (error)
- goto exit4;
+ goto exit3;
mnt_userns = mnt_user_ns(path.mnt);
error = vfs_rmdir(mnt_userns, path.dentry->d_inode, dentry);
-exit4:
+exit3:
done_path_update(&path, dentry, &wq);
dput(dentry);
-exit3:
mnt_drop_write(path.mnt);
exit2:
path_put(&path);
@@ -4402,12 +4393,8 @@ int do_unlinkat(int dfd, struct filename *name)
if (type != LAST_NORM)
goto exit2;

- error = mnt_want_write(path.mnt);
- if (error)
- goto exit2;
-
retry_deleg:
- dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq);
+ dentry = lookup_hash_update(&last, &path, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
struct user_namespace *mnt_userns;
@@ -4426,6 +4413,7 @@ int do_unlinkat(int dfd, struct filename *name)
exit3:
done_path_update(&path, dentry, &wq);
dput(dentry);
+ mnt_drop_write(path.mnt);
}
if (inode)
iput(inode); /* truncate the inode here */
@@ -4435,7 +4423,6 @@ int do_unlinkat(int dfd, struct filename *name)
if (!error)
goto retry_deleg;
}
- mnt_drop_write(path.mnt);
exit2:
path_put(&path);
if (retry_estale(error, lookup_flags)) {


2022-06-13 23:22:08

by NeilBrown

[permalink] [raw]
Subject: [PATCH 04/12] VFS: move dput() and mnt_drop_write() into done_path_update()

All calls to done_path_update() are followed by the same two calls, so
merge them in.

Signed-off-by: NeilBrown <[email protected]>
---
fs/namei.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 83ce2f7083be..f13bff877e30 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1732,6 +1732,8 @@ static void done_path_update(struct path *path, struct dentry *dentry,
} else {
inode_unlock(dir);
}
+ dput(dentry);
+ mnt_drop_write(path->mnt);
}

static struct dentry *lookup_fast(struct nameidata *nd,
@@ -3953,8 +3955,6 @@ void done_path_create_wq(struct path *path, struct dentry *dentry,
wait_queue_head_t *wq)
{
done_path_update(path, dentry, wq);
- dput(dentry);
- mnt_drop_write(path->mnt);
path_put(path);
}
EXPORT_SYMBOL(done_path_create_wq);
@@ -4276,8 +4276,6 @@ int do_rmdir(int dfd, struct filename *name)
error = vfs_rmdir(mnt_userns, path.dentry->d_inode, dentry);
exit3:
done_path_update(&path, dentry, &wq);
- dput(dentry);
- mnt_drop_write(path.mnt);
exit2:
path_put(&path);
if (retry_estale(error, lookup_flags)) {
@@ -4412,8 +4410,6 @@ int do_unlinkat(int dfd, struct filename *name)
&delegated_inode);
exit3:
done_path_update(&path, dentry, &wq);
- dput(dentry);
- mnt_drop_write(path.mnt);
}
if (inode)
iput(inode); /* truncate the inode here */


2022-06-13 23:22:23

by NeilBrown

[permalink] [raw]
Subject: [PATCH 06/12] VFS: support concurrent renames.

An object can now be renamed from or to a directory in which a create or
unlink is concurrently happening.
Two or more renames with the one directory can also be concurrent.

There is still only one cross-directory rename permitted at a time.

Signed-off-by: NeilBrown <[email protected]>
---
fs/namei.c | 230 ++++++++++++++++++++++++++++++++++++++++++++-----
include/linux/namei.h | 9 ++
2 files changed, 216 insertions(+), 23 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 8ce7aa16b704..31ba4dbedfcf 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3172,6 +3172,197 @@ void unlock_rename(struct dentry *p1, struct dentry *p2)
}
EXPORT_SYMBOL(unlock_rename);

+struct dentry *lock_rename_lookup_excl(struct dentry *p1, struct dentry *p2,
+ struct dentry **d1p, struct dentry **d2p,
+ struct qstr *last1, struct qstr *last2,
+ unsigned int flags1, unsigned int flags2)
+{
+ struct dentry *p;
+ struct dentry *d1, *d2;
+
+ if (p1 == p2) {
+ inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
+ d1 = __lookup_hash(last1, p1, flags1, NULL);
+ if (IS_ERR(d1))
+ goto out_unlock_1;
+ d2 = __lookup_hash(last2, p2, flags2, NULL);
+ if (IS_ERR(d2))
+ goto out_unlock_2;
+ *d1p = d1; *d2p = d2;
+ return NULL;
+ out_unlock_2:
+ dput(d1);
+ d1 = d2;
+ out_unlock_1:
+ inode_unlock(p1->d_inode);
+ return d1;
+ }
+
+ mutex_lock(&p1->d_sb->s_vfs_rename_mutex);
+
+ if ((p = d_ancestor(p2, p1)) != NULL) {
+ inode_lock_nested(p2->d_inode, I_MUTEX_PARENT);
+ inode_lock_nested(p1->d_inode, I_MUTEX_CHILD);
+ } else if ((p = d_ancestor(p1, p2)) != NULL) {
+ inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
+ inode_lock_nested(p2->d_inode, I_MUTEX_CHILD);
+ } else {
+ inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
+ inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2);
+ }
+ d1 = __lookup_hash(last1, p1, flags1, NULL);
+ if (IS_ERR(d1))
+ goto unlock_out_3;
+ d2 = __lookup_hash(last2, p2, flags2, NULL);
+ if (IS_ERR(d2))
+ goto unlock_out_4;
+
+ *d1p = d1;
+ *d2p = d2;
+ return p;
+unlock_out_4:
+ dput(d1);
+ d1 = d2;
+unlock_out_3:
+ inode_unlock(p1->d_inode);
+ inode_unlock(p2->d_inode);
+ mutex_unlock(&p1->d_sb->s_vfs_rename_mutex);
+ return d1;
+}
+
+static struct dentry *lock_rename_lookup(struct dentry *p1, struct dentry *p2,
+ struct dentry **d1p, struct dentry **d2p,
+ struct qstr *last1, struct qstr *last2,
+ unsigned int flags1, unsigned int flags2,
+ wait_queue_head_t *wq)
+{
+ struct dentry *p;
+ struct dentry *d1, *d2;
+ bool ok1, ok2;
+
+ if (!wq || (p1->d_inode->i_flags & S_PAR_UPDATE) == 0)
+ return lock_rename_lookup_excl(p1, p2, d1p, d2p, last1, last2,
+ flags1, flags2);
+
+ if (p1 == p2) {
+ inode_lock_shared_nested(p1->d_inode, I_MUTEX_PARENT);
+ retry:
+ d1 = __lookup_hash(last1, p1, flags1, wq);
+ if (IS_ERR(d1))
+ goto out_unlock_1;
+ d2 = __lookup_hash(last2, p2, flags2, wq);
+ if (IS_ERR(d2))
+ goto out_unlock_2;
+ *d1p = d1; *d2p = d2;
+
+ if (d1 < d2) {
+ ok1 = d_lock_update(d1, p1, last1);
+ ok2 = d_lock_update(d2, p2, last2);
+ } else {
+ ok2 = d_lock_update(d2, p2, last2);
+ ok1 = d_lock_update(d1, p1, last1);
+ }
+ if (!ok1 || !ok2) {
+ if (ok1)
+ d_unlock_update(d1);
+ if (ok2)
+ d_unlock_update(d2);
+ dput(d1);
+ dput(d2);
+ goto retry;
+ }
+ return NULL;
+ out_unlock_2:
+ dput(d1);
+ d1 = d2;
+ out_unlock_1:
+ inode_unlock_shared(p1->d_inode);
+ return d1;
+ }
+
+ mutex_lock(&p1->d_sb->s_vfs_rename_mutex);
+
+ if ((p = d_ancestor(p2, p1)) != NULL) {
+ inode_lock_shared_nested(p2->d_inode, I_MUTEX_PARENT);
+ inode_lock_shared_nested(p1->d_inode, I_MUTEX_CHILD);
+ } else if ((p = d_ancestor(p1, p2)) != NULL) {
+ inode_lock_shared_nested(p1->d_inode, I_MUTEX_PARENT);
+ inode_lock_shared_nested(p2->d_inode, I_MUTEX_CHILD);
+ } else {
+ inode_lock_shared_nested(p1->d_inode, I_MUTEX_PARENT);
+ inode_lock_shared_nested(p2->d_inode, I_MUTEX_PARENT2);
+ }
+retry2:
+ d1 = __lookup_hash(last1, p1, flags1, wq);
+ if (IS_ERR(d1))
+ goto unlock_out_3;
+ d2 = __lookup_hash(last2, p2, flags2, wq);
+ if (IS_ERR(d2))
+ goto unlock_out_4;
+
+ ok1 = d_lock_update(d1, p1, last1);
+ ok2 = d_lock_update(d2, p2, last2);
+ if (!ok1 || !ok2) {
+ if (ok1)
+ d_unlock_update(d1);
+ if (ok2)
+ d_unlock_update(d2);
+ dput(d1);
+ dput(d2);
+ goto retry2;
+ }
+ *d1p = d1;
+ *d2p = d2;
+ return p;
+unlock_out_4:
+ dput(d1);
+ d1 = d2;
+unlock_out_3:
+ inode_unlock_shared(p1->d_inode);
+ inode_unlock_shared(p2->d_inode);
+ mutex_unlock(&p1->d_sb->s_vfs_rename_mutex);
+ return d1;
+}
+
+struct dentry *lock_rename_lookup_one(struct dentry *p1, struct dentry *p2,
+ struct dentry **d1p, struct dentry **d2p,
+ const char *name1, int nlen1,
+ const char *name2, int nlen2,
+ unsigned int flags1, unsigned int flags2,
+ wait_queue_head_t *wq)
+{
+ struct qstr this1, this2;
+ int err;
+
+ err = lookup_one_common(&init_user_ns, name1, p1, nlen1, &this1);
+ if (err)
+ return ERR_PTR(err);
+ err = lookup_one_common(&init_user_ns, name2, p2, nlen2, &this2);
+ if (err)
+ return ERR_PTR(err);
+ return lock_rename_lookup(p1, p2, d1p, d2p, &this1, &this2,
+ flags1, flags2, wq);
+}
+EXPORT_SYMBOL(lock_rename_lookup_one);
+
+void unlock_rename_lookup(struct dentry *p1, struct dentry *p2,
+ struct dentry *d1, struct dentry *d2)
+{
+ if (d1->d_flags & DCACHE_PAR_UPDATE) {
+ d_lookup_done(d1);
+ d_lookup_done(d2);
+ inode_unlock_shared(p1->d_inode);
+ if (p1 != p2) {
+ inode_unlock_shared(p2->d_inode);
+ mutex_unlock(&p1->d_sb->s_vfs_rename_mutex);
+ }
+ } else
+ unlock_rename(p1, p2);
+ dput(d1);
+ dput(d2);
+}
+EXPORT_SYMBOL(unlock_rename_lookup);
+
/**
* vfs_create - create new file
* @mnt_userns: user namespace of the mount the inode was found from
@@ -4910,6 +5101,7 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
unsigned int lookup_flags = 0, target_flags = LOOKUP_RENAME_TARGET;
bool should_retry = false;
int error = -EINVAL;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
goto put_names;
@@ -4950,58 +5142,53 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
goto exit2;

retry_deleg:
- trap = lock_rename(new_path.dentry, old_path.dentry);
-
- old_dentry = __lookup_hash(&old_last, old_path.dentry,
- lookup_flags, NULL);
- error = PTR_ERR(old_dentry);
- if (IS_ERR(old_dentry))
+ trap = lock_rename_lookup(new_path.dentry, old_path.dentry,
+ &new_dentry, &old_dentry,
+ &new_last, &old_last,
+ lookup_flags | target_flags, lookup_flags,
+ &wq);
+ if (IS_ERR(trap))
goto exit3;
/* source must exist */
error = -ENOENT;
if (d_is_negative(old_dentry))
goto exit4;
- new_dentry = __lookup_hash(&new_last, new_path.dentry,
- lookup_flags | target_flags, NULL);
- error = PTR_ERR(new_dentry);
- if (IS_ERR(new_dentry))
- goto exit4;
error = -EEXIST;
if ((flags & RENAME_NOREPLACE) && d_is_positive(new_dentry))
- goto exit5;
+ goto exit4;
if (flags & RENAME_EXCHANGE) {
error = -ENOENT;
if (d_is_negative(new_dentry))
- goto exit5;
+ goto exit4;

if (!d_is_dir(new_dentry)) {
error = -ENOTDIR;
if (new_last.name[new_last.len])
- goto exit5;
+ goto exit4;
}
}
/* unless the source is a directory trailing slashes give -ENOTDIR */
if (!d_is_dir(old_dentry)) {
error = -ENOTDIR;
if (old_last.name[old_last.len])
- goto exit5;
+ goto exit4;
if (!(flags & RENAME_EXCHANGE) && new_last.name[new_last.len])
- goto exit5;
+ goto exit4;
}
/* source should not be ancestor of target */
error = -EINVAL;
if (old_dentry == trap)
- goto exit5;
+ goto exit4;
/* target should not be an ancestor of source */
if (!(flags & RENAME_EXCHANGE))
error = -ENOTEMPTY;
if (new_dentry == trap)
- goto exit5;
+ goto exit4;

error = security_path_rename(&old_path, old_dentry,
&new_path, new_dentry, flags);
if (error)
- goto exit5;
+ goto exit4;

rd.old_dir = old_path.dentry->d_inode;
rd.old_dentry = old_dentry;
@@ -5012,12 +5199,9 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
rd.delegated_inode = &delegated_inode;
rd.flags = flags;
error = vfs_rename(&rd);
-exit5:
- dput(new_dentry);
exit4:
- dput(old_dentry);
+ unlock_rename_lookup(new_path.dentry, old_path.dentry, new_dentry, old_dentry);
exit3:
- unlock_rename(new_path.dentry, old_path.dentry);
if (delegated_inode) {
error = break_deleg_wait(&delegated_inode);
if (!error)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 217aa6de9f25..b1ea89568de8 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -101,6 +101,15 @@ extern int follow_up(struct path *);

extern struct dentry *lock_rename(struct dentry *, struct dentry *);
extern void unlock_rename(struct dentry *, struct dentry *);
+extern struct dentry *lock_rename_lookup_one(
+ struct dentry *p1, struct dentry *p2,
+ struct dentry **d1p, struct dentry **d2p,
+ const char *name1, int nlen1,
+ const char *name2, int nlen2,
+ unsigned int flags1, unsigned int flags2,
+ wait_queue_head_t *wq);
+extern void unlock_rename_lookup(struct dentry *p1, struct dentry *p2,
+ struct dentry *d1, struct dentry *d2);

extern int __must_check nd_jump_link(struct path *path);



2022-06-13 23:22:28

by NeilBrown

[permalink] [raw]
Subject: [PATCH 07/12] NFS: support parallel updates in the one directory.

NFS can easily support parallel updates as the locking is done on the
server, so this patch enables parallel updates for NFS.

NFS unlink needs to block concurrent opens() once it decides to actually
unlink the file, rather than rename it to .nfsXXXX (aka sillyrename).
It currently does this by temporarily unhashing the dentry and relying
on the exclusive lock on the directory to block a ->lookup(). That
doesn't work now that unlink uses a shared lock, so an alternate
approach is needed.

__nfs_lookup_revalidate (->d_revalidate) now blocks if DCACHE_PAR_UPDATE
is set, and if nfs_unlink() happens to be called with an exclusive lock
and DCACHE_PAR_UPDATE is not set, it get set during the potential race window.

I'd rather use some other indicator in the dentry to tell
_nfs_lookup_revalidate() to wait, but we are nearly out of d_flags bits,
and NFS doesn't have a general-purpose d_fsdata.

NFS "silly-rename" may now be called with only a shared lock on the
directory, so it needs a bit of extra care to get exclusive access to
the new name. d_lock_update_nested() and d_unlock_update() help here.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/dir.c | 29 +++++++++++++++++++++++------
fs/nfs/inode.c | 2 ++
fs/nfs/unlink.c | 5 ++++-
3 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a8ecdd527662..54c2c7adcd56 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1778,6 +1778,9 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
int ret;

if (flags & LOOKUP_RCU) {
+ if (dentry->d_flags & DCACHE_PAR_UPDATE)
+ /* Pending unlink */
+ return -ECHILD;
parent = READ_ONCE(dentry->d_parent);
dir = d_inode_rcu(parent);
if (!dir)
@@ -1786,6 +1789,9 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
if (parent != READ_ONCE(dentry->d_parent))
return -ECHILD;
} else {
+ /* Wait for unlink to complete */
+ wait_var_event(&dentry->d_flags,
+ !(dentry->d_flags & DCACHE_PAR_UPDATE));
parent = dget_parent(dentry);
ret = reval(d_inode(parent), dentry, flags);
dput(parent);
@@ -2453,7 +2459,7 @@ static int nfs_safe_remove(struct dentry *dentry)
int nfs_unlink(struct inode *dir, struct dentry *dentry)
{
int error;
- int need_rehash = 0;
+ bool did_set_par_update = false;

dfprintk(VFS, "NFS: unlink(%s/%lu, %pd)\n", dir->i_sb->s_id,
dir->i_ino, dentry);
@@ -2468,15 +2474,26 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry)
error = nfs_sillyrename(dir, dentry);
goto out;
}
- if (!d_unhashed(dentry)) {
- __d_drop(dentry);
- need_rehash = 1;
+ /* We must prevent any concurrent open until the unlink
+ * completes. ->d_revalidate will wait for DCACHE_PAR_UPDATE
+ * to clear, but if this happens to a non-parallel update, we
+ * still want to block opens. So set DCACHE_PAR_UPDATE
+ * temporarily.
+ */
+ if (!(dentry->d_flags & DCACHE_PAR_UPDATE)) {
+ /* Must have exclusive lock on parent */
+ did_set_par_update = true;
+ dentry->d_flags |= DCACHE_PAR_UPDATE;
}
+
spin_unlock(&dentry->d_lock);
error = nfs_safe_remove(dentry);
nfs_dentry_remove_handle_error(dir, dentry, error);
- if (need_rehash)
- d_rehash(dentry);
+ if (did_set_par_update) {
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags &= ~DCACHE_PAR_UPDATE;
+ spin_unlock(&dentry->d_lock);
+ }
out:
trace_nfs_unlink_exit(dir, dentry, error);
return error;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b4e46b0ffa2d..cea2554710d2 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -481,6 +481,8 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)

/* We can't support update_atime(), since the server will reset it */
inode->i_flags |= S_NOATIME|S_NOCMTIME;
+ /* Parallel updates to directories are trivial */
+ inode->i_flags |= S_PAR_UPDATE;
inode->i_mode = fattr->mode;
nfsi->cache_validity = 0;
if ((fattr->valid & NFS_ATTR_FATTR_MODE) == 0
diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
index 9697cd5d2561..52a20eb6131c 100644
--- a/fs/nfs/unlink.c
+++ b/fs/nfs/unlink.c
@@ -462,6 +462,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
sdentry = NULL;
do {
int slen;
+ d_unlock_update(sdentry);
dput(sdentry);
sillycounter++;
slen = scnprintf(silly, sizeof(silly),
@@ -479,7 +480,8 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
*/
if (IS_ERR(sdentry))
goto out;
- } while (d_inode(sdentry) != NULL); /* need negative lookup */
+ } while (!d_lock_update_nested(sdentry, NULL, NULL,
+ SINGLE_DEPTH_NESTING));

ihold(inode);

@@ -524,6 +526,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
rpc_put_task(task);
out_dput:
iput(inode);
+ d_unlock_update(sdentry);
dput(sdentry);
out:
return error;


2022-06-13 23:22:40

by NeilBrown

[permalink] [raw]
Subject: [PATCH 08/12] nfsd: allow parallel creates from nfsd

Rather than getting write access, locking the directory, and performing
a lookup, use
filename_create_one_len() for create, or
lookup_hash_update_len() for delete
which combines all these steps and handles shared locking for concurrent
updates where appropriate.

As we don't use fh_lock() we need to call fh_fill_pre_attrs()
explicitly. However if we only get a shared lock, then the pre/post
attributes won't be atomic, so we need to ensure that fh know that,
and doesn't try to encode wcc attrs either.

Note that there is only one filesystem that allows shared locks for
create/unlink and that is NFS (for re-export). NFS does support atomic
pre/post attributes, so there is no loss in not providing them.

When some other filesystem supports concurrent updates, we might need to
consider if the pre/post attributes are more important than the
parallelism.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfsd/nfs3proc.c | 31 +++++-------
fs/nfsd/nfs4proc.c | 32 +++++-------
fs/nfsd/nfsfh.c | 7 ++-
fs/nfsd/nfsfh.h | 4 +-
fs/nfsd/nfsproc.c | 29 +++++------
fs/nfsd/vfs.c | 134 +++++++++++++++++++++++-----------------------------
6 files changed, 105 insertions(+), 132 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index 981a3a7a6e16..0fdbb9504a87 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -231,12 +231,14 @@ static __be32
nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct svc_fh *resfhp, struct nfsd3_createargs *argp)
{
+ struct path path;
struct iattr *iap = &argp->attrs;
- struct dentry *parent, *child;
+ struct dentry *child;
__u32 v_mtime, v_atime;
struct inode *inode;
__be32 status;
int host_err;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

if (isdotent(argp->name, argp->len))
return nfserr_exist;
@@ -247,20 +249,15 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (status != nfs_ok)
return status;

- parent = fhp->fh_dentry;
- inode = d_inode(parent);
+ path.dentry = fhp->fh_dentry;
+ path.mnt = fhp->fh_export->ex_path.mnt;
+ inode = d_inode(path.dentry);

- host_err = fh_want_write(fhp);
- if (host_err)
- return nfserrno(host_err);
+ child = filename_create_one_len(argp->name, argp->len,
+ &path, 0, &wq);

- fh_lock_nested(fhp, I_MUTEX_PARENT);
-
- child = lookup_one_len(argp->name, parent, argp->len);
- if (IS_ERR(child)) {
- status = nfserrno(PTR_ERR(child));
- goto out;
- }
+ if (IS_ERR(child))
+ return nfserrno(PTR_ERR(child));

if (d_really_is_negative(child)) {
status = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
@@ -311,6 +308,7 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,

if (!IS_POSIXACL(inode))
iap->ia_mode &= ~current_umask();
+ fh_fill_pre_attrs(fhp, (child->d_flags & DCACHE_PAR_UPDATE) == 0);

host_err = vfs_create(&init_user_ns, inode, child, iap->ia_mode, true);
if (host_err < 0) {
@@ -332,12 +330,9 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,

set_attr:
status = nfsd_create_setattr(rqstp, fhp, resfhp, iap);
-
+ fh_fill_post_attrs(fhp);
out:
- fh_unlock(fhp);
- if (child && !IS_ERR(child))
- dput(child);
- fh_drop_write(fhp);
+ done_path_update(&path, child, &wq);
return status;
}

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 3895eb52d2b1..71a4b8ef77f0 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -285,12 +285,13 @@ static __be32
nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct svc_fh *resfhp, struct nfsd4_open *open)
{
+ struct path path;
struct iattr *iap = &open->op_iattr;
- struct dentry *parent, *child;
+ struct dentry *child;
__u32 v_mtime, v_atime;
struct inode *inode;
__be32 status;
- int host_err;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

if (isdotent(open->op_fname, open->op_fnamelen))
return nfserr_exist;
@@ -300,20 +301,17 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
status = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_EXEC);
if (status != nfs_ok)
return status;
- parent = fhp->fh_dentry;
- inode = d_inode(parent);

- host_err = fh_want_write(fhp);
- if (host_err)
- return nfserrno(host_err);
+ path.dentry = fhp->fh_dentry;
+ path.mnt = fhp->fh_export->ex_path.mnt;
+ inode = d_inode(path.dentry);

- fh_lock_nested(fhp, I_MUTEX_PARENT);
+ child = filename_create_one_len(open->op_fname,
+ open->op_fnamelen,
+ &path, 0, &wq);

- child = lookup_one_len(open->op_fname, parent, open->op_fnamelen);
- if (IS_ERR(child)) {
- status = nfserrno(PTR_ERR(child));
- goto out;
- }
+ if (IS_ERR(child))
+ return nfserrno(PTR_ERR(child));

if (d_really_is_negative(child)) {
status = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
@@ -386,6 +384,7 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (!IS_POSIXACL(inode))
iap->ia_mode &= ~current_umask();

+ fh_fill_pre_attrs(fhp, (child->d_flags & DCACHE_PAR_UPDATE) == 0);
status = nfsd4_vfs_create(fhp, child, open);
if (status != nfs_ok)
goto out;
@@ -405,12 +404,9 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,

set_attr:
status = nfsd_create_setattr(rqstp, fhp, resfhp, iap);
-
+ fh_fill_post_attrs(fhp);
out:
- fh_unlock(fhp);
- if (child && !IS_ERR(child))
- dput(child);
- fh_drop_write(fhp);
+ done_path_update(&path, child, &wq);
return status;
}

diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index c29baa03dfaf..a50db688c60d 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -616,7 +616,7 @@ fh_update(struct svc_fh *fhp)
* @fhp: file handle to be updated
*
*/
-void fh_fill_pre_attrs(struct svc_fh *fhp)
+void fh_fill_pre_attrs(struct svc_fh *fhp, bool atomic)
{
bool v4 = (fhp->fh_maxsize == NFS4_FHSIZE);
struct inode *inode;
@@ -626,6 +626,11 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
if (fhp->fh_no_wcc || fhp->fh_pre_saved)
return;

+ if (!atomic) {
+ fhp->fh_no_atomic_attr = true;
+ fhp->fh_no_wcc = true;
+ }
+
inode = d_inode(fhp->fh_dentry);
err = fh_getattr(fhp, &stat);
if (err) {
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index fb9d358a267e..ecc57fd3fd67 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -320,7 +320,7 @@ static inline u64 nfsd4_change_attribute(struct kstat *stat,
return time_to_chattr(&stat->ctime);
}

-extern void fh_fill_pre_attrs(struct svc_fh *fhp);
+extern void fh_fill_pre_attrs(struct svc_fh *fhp, bool atomic);
extern void fh_fill_post_attrs(struct svc_fh *fhp);


@@ -347,7 +347,7 @@ fh_lock_nested(struct svc_fh *fhp, unsigned int subclass)

inode = d_inode(dentry);
inode_lock_nested(inode, subclass);
- fh_fill_pre_attrs(fhp);
+ fh_fill_pre_attrs(fhp, true);
fhp->fh_locked = true;
}

diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index fcdab8a8a41f..2dccf77634e8 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -255,6 +255,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
static __be32
nfsd_proc_create(struct svc_rqst *rqstp)
{
+ struct path path;
struct nfsd_createargs *argp = rqstp->rq_argp;
struct nfsd_diropres *resp = rqstp->rq_resp;
svc_fh *dirfhp = &argp->fh;
@@ -263,8 +264,8 @@ nfsd_proc_create(struct svc_rqst *rqstp)
struct inode *inode;
struct dentry *dchild;
int type, mode;
- int hosterr;
dev_t rdev = 0, wanted = new_decode_dev(attr->ia_size);
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

dprintk("nfsd: CREATE %s %.*s\n",
SVCFH_fmt(dirfhp), argp->len, argp->name);
@@ -279,17 +280,13 @@ nfsd_proc_create(struct svc_rqst *rqstp)
resp->status = nfserr_exist;
if (isdotent(argp->name, argp->len))
goto done;
- hosterr = fh_want_write(dirfhp);
- if (hosterr) {
- resp->status = nfserrno(hosterr);
- goto done;
- }

- fh_lock_nested(dirfhp, I_MUTEX_PARENT);
- dchild = lookup_one_len(argp->name, dirfhp->fh_dentry, argp->len);
+ path.dentry = dirfhp->fh_dentry;
+ path.mnt = dirfhp->fh_export->ex_path.mnt;
+ dchild = filename_create_one_len(argp->name, argp->len, &path, 0, &wq);
if (IS_ERR(dchild)) {
resp->status = nfserrno(PTR_ERR(dchild));
- goto out_unlock;
+ goto out_done;
}
fh_init(newfhp, NFS_FHSIZE);
resp->status = fh_compose(newfhp, dirfhp->fh_export, dchild, dirfhp);
@@ -298,7 +295,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
dput(dchild);
if (resp->status) {
if (resp->status != nfserr_noent)
- goto out_unlock;
+ goto out_done;
/*
* If the new file handle wasn't verified, we can't tell
* whether the file exists or not. Time to bail ...
@@ -307,7 +304,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
if (!newfhp->fh_dentry) {
printk(KERN_WARNING
"nfsd_proc_create: file handle not verified\n");
- goto out_unlock;
+ goto out_done;
}
}

@@ -341,7 +338,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
newfhp->fh_dentry,
NFSD_MAY_WRITE|NFSD_MAY_LOCAL_ACCESS);
if (resp->status && resp->status != nfserr_rofs)
- goto out_unlock;
+ goto out_done;
}
} else
type = S_IFREG;
@@ -378,7 +375,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
/* Make sure the type and device matches */
resp->status = nfserr_exist;
if (inode && inode_wrong_type(inode, type))
- goto out_unlock;
+ goto out_done;
}

resp->status = nfs_ok;
@@ -400,10 +397,8 @@ nfsd_proc_create(struct svc_rqst *rqstp)
(time64_t)0);
}

-out_unlock:
- /* We don't really need to unlock, as fh_put does it. */
- fh_unlock(dirfhp);
- fh_drop_write(dirfhp);
+out_done:
+ done_path_update(&path, dchild, &wq);
done:
fh_put(dirfhp);
if (resp->status != nfs_ok)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 840e3af63a6f..6cdd5e407600 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1274,12 +1274,6 @@ nfsd_create_locked(struct svc_rqst *rqstp, struct svc_fh *fhp,
dirp = d_inode(dentry);

dchild = dget(resfhp->fh_dentry);
- if (!fhp->fh_locked) {
- WARN_ONCE(1, "nfsd_create: parent %pd2 not locked!\n",
- dentry);
- err = nfserr_io;
- goto out;
- }

err = nfsd_permission(rqstp, fhp->fh_export, dentry, NFSD_MAY_CREATE);
if (err)
@@ -1362,9 +1356,11 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
char *fname, int flen, struct iattr *iap,
int type, dev_t rdev, struct svc_fh *resfhp)
{
- struct dentry *dentry, *dchild = NULL;
+ struct path path;
+ struct dentry *dchild = NULL;
__be32 err;
int host_err;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

if (isdotent(fname, flen))
return nfserr_exist;
@@ -1373,27 +1369,22 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (err)
return err;

- dentry = fhp->fh_dentry;
-
- host_err = fh_want_write(fhp);
- if (host_err)
- return nfserrno(host_err);
+ path.dentry = fhp->fh_dentry;
+ path.mnt = fhp->fh_export->ex_path.mnt;

- fh_lock_nested(fhp, I_MUTEX_PARENT);
- dchild = lookup_one_len(fname, dentry, flen);
+ dchild = filename_create_one_len(fname, flen, &path, 0, &wq);
host_err = PTR_ERR(dchild);
if (IS_ERR(dchild))
return nfserrno(host_err);
err = fh_compose(resfhp, fhp->fh_export, dchild, fhp);
- /*
- * We unconditionally drop our ref to dchild as fh_compose will have
- * already grabbed its own ref for it.
- */
- dput(dchild);
- if (err)
- return err;
- return nfsd_create_locked(rqstp, fhp, fname, flen, iap, type,
- rdev, resfhp);
+ if (!err) {
+ fh_fill_pre_attrs(fhp, (dchild->d_flags & DCACHE_PAR_UPDATE) == 0);
+ err = nfsd_create_locked(rqstp, fhp, fname, flen, iap, type,
+ rdev, resfhp);
+ fh_fill_post_attrs(fhp);
+ }
+ done_path_update(&path, dchild, &wq);
+ return err;
}

/*
@@ -1441,15 +1432,17 @@ nfsd_readlink(struct svc_rqst *rqstp, struct svc_fh *fhp, char *buf, int *lenp)
__be32
nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
char *fname, int flen,
- char *path,
+ char *lpath,
struct svc_fh *resfhp)
{
- struct dentry *dentry, *dnew;
+ struct path path;
+ struct dentry *dnew;
__be32 err, cerr;
int host_err;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

err = nfserr_noent;
- if (!flen || path[0] == '\0')
+ if (!flen || lpath[0] == '\0')
goto out;
err = nfserr_exist;
if (isdotent(fname, flen))
@@ -1459,28 +1452,28 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (err)
goto out;

- host_err = fh_want_write(fhp);
- if (host_err)
- goto out_nfserr;
+ path.dentry = fhp->fh_dentry;
+ path.mnt = fhp->fh_export->ex_path.mnt;

- fh_lock(fhp);
- dentry = fhp->fh_dentry;
- dnew = lookup_one_len(fname, dentry, flen);
+ dnew = filename_create_one_len(fname, flen, &path, 0, &wq);
host_err = PTR_ERR(dnew);
if (IS_ERR(dnew))
goto out_nfserr;

- host_err = vfs_symlink(&init_user_ns, d_inode(dentry), dnew, path);
+ fh_fill_pre_attrs(fhp, (dnew->d_flags & DCACHE_PAR_UPDATE) == 0);
+ host_err = vfs_symlink(mnt_user_ns(path.mnt), d_inode(path.dentry),
+ dnew, lpath);
err = nfserrno(host_err);
- fh_unlock(fhp);
if (!err)
err = nfserrno(commit_metadata(fhp));

- fh_drop_write(fhp);
+ fh_fill_post_attrs(fhp);

cerr = fh_compose(resfhp, fhp->fh_export, dnew, fhp);
- dput(dnew);
- if (err==0) err = cerr;
+ if (err==0)
+ err = cerr;
+
+ done_path_update(&path, dnew, &wq);
out:
return err;

@@ -1497,10 +1490,12 @@ __be32
nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
char *name, int len, struct svc_fh *tfhp)
{
- struct dentry *ddir, *dnew, *dold;
+ struct path path;
+ struct dentry *dold, *dnew;
struct inode *dirp;
__be32 err;
int host_err;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

err = fh_verify(rqstp, ffhp, S_IFDIR, NFSD_MAY_CREATE);
if (err)
@@ -1518,17 +1513,11 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
if (isdotent(name, len))
goto out;

- host_err = fh_want_write(tfhp);
- if (host_err) {
- err = nfserrno(host_err);
- goto out;
- }
-
- fh_lock_nested(ffhp, I_MUTEX_PARENT);
- ddir = ffhp->fh_dentry;
- dirp = d_inode(ddir);
+ path.dentry = ffhp->fh_dentry;
+ path.mnt = ffhp->fh_export->ex_path.mnt;
+ dirp = d_inode(path.dentry);

- dnew = lookup_one_len(name, ddir, len);
+ dnew = filename_create_one_len(name, len, &path, 0, &wq);
host_err = PTR_ERR(dnew);
if (IS_ERR(dnew))
goto out_nfserr;
@@ -1537,9 +1526,10 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,

err = nfserr_noent;
if (d_really_is_negative(dold))
- goto out_dput;
+ goto out_done;
+ fh_fill_pre_attrs(ffhp, (dnew->d_flags & DCACHE_PAR_UPDATE) == 0);
host_err = vfs_link(dold, &init_user_ns, dirp, dnew, NULL);
- fh_unlock(ffhp);
+
if (!host_err) {
err = nfserrno(commit_metadata(ffhp));
if (!err)
@@ -1550,17 +1540,15 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
else
err = nfserrno(host_err);
}
-out_dput:
- dput(dnew);
-out_unlock:
- fh_unlock(ffhp);
- fh_drop_write(tfhp);
+out_done:
+ fh_fill_post_attrs(ffhp);
+ done_path_update(&path, dnew, &wq);
out:
return err;

out_nfserr:
err = nfserrno(host_err);
- goto out_unlock;
+ goto out;
}

static void
@@ -1625,8 +1613,8 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
* so do it by hand */
trap = lock_rename(tdentry, fdentry);
ffhp->fh_locked = tfhp->fh_locked = true;
- fh_fill_pre_attrs(ffhp);
- fh_fill_pre_attrs(tfhp);
+ fh_fill_pre_attrs(ffhp, true);
+ fh_fill_pre_attrs(tfhp, true);

odentry = lookup_one_len(fname, fdentry, flen);
host_err = PTR_ERR(odentry);
@@ -1717,11 +1705,13 @@ __be32
nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
char *fname, int flen)
{
- struct dentry *dentry, *rdentry;
+ struct dentry *rdentry;
struct inode *dirp;
struct inode *rinode;
__be32 err;
int host_err;
+ struct path path;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

err = nfserr_acces;
if (!flen || isdotent(fname, flen))
@@ -1730,24 +1720,18 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (err)
goto out;

- host_err = fh_want_write(fhp);
- if (host_err)
- goto out_nfserr;
+ path.mnt = fhp->fh_export->ex_path.mnt;
+ path.dentry = fhp->fh_dentry;

- fh_lock_nested(fhp, I_MUTEX_PARENT);
- dentry = fhp->fh_dentry;
- dirp = d_inode(dentry);
+ rdentry = lookup_hash_update_len(fname, flen, &path, 0, &wq);
+ dirp = d_inode(path.dentry);

- rdentry = lookup_one_len(fname, dentry, flen);
host_err = PTR_ERR(rdentry);
if (IS_ERR(rdentry))
- goto out_drop_write;
+ goto out_nfserr;
+
+ fh_fill_pre_attrs(fhp, (rdentry->d_flags & DCACHE_PAR_UPDATE) == 0);

- if (d_really_is_negative(rdentry)) {
- dput(rdentry);
- host_err = -ENOENT;
- goto out_drop_write;
- }
rinode = d_inode(rdentry);
ihold(rinode);

@@ -1761,15 +1745,13 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
} else {
host_err = vfs_rmdir(&init_user_ns, dirp, rdentry);
}
+ fh_fill_post_attrs(fhp);

- fh_unlock(fhp);
+ done_path_update(&path, rdentry, &wq);
if (!host_err)
host_err = commit_metadata(fhp);
- dput(rdentry);
iput(rinode); /* truncate the inode here */

-out_drop_write:
- fh_drop_write(fhp);
out_nfserr:
if (host_err == -EBUSY) {
/* name is mounted-on. There is no perfect


2022-06-13 23:23:39

by NeilBrown

[permalink] [raw]
Subject: [PATCH 11/12] nfsd: use (un)lock_inode instead of fh_(un)lock

Except for the xattr changes, these callers don't need to save pre/post
attrs, so use a simple lock/unlock.

For the xattr changes, make the saving of pre/post explicit rather than
requiring a comment.

Also many fh_unlock() are not needed.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfsd/nfs2acl.c | 6 +++---
fs/nfsd/nfs3acl.c | 4 ++--
fs/nfsd/nfs3proc.c | 4 ----
fs/nfsd/nfs4acl.c | 7 +++----
fs/nfsd/nfs4proc.c | 2 --
fs/nfsd/nfs4state.c | 8 ++++----
fs/nfsd/nfsfh.c | 1 -
fs/nfsd/nfsfh.h | 8 --------
fs/nfsd/vfs.c | 26 ++++++++++++--------------
9 files changed, 24 insertions(+), 42 deletions(-)

diff --git a/fs/nfsd/nfs2acl.c b/fs/nfsd/nfs2acl.c
index b5760801d377..9edd3c1a30fb 100644
--- a/fs/nfsd/nfs2acl.c
+++ b/fs/nfsd/nfs2acl.c
@@ -111,7 +111,7 @@ static __be32 nfsacld_proc_setacl(struct svc_rqst *rqstp)
if (error)
goto out_errno;

- fh_lock(fh);
+ inode_lock(inode);

error = set_posix_acl(&init_user_ns, inode, ACL_TYPE_ACCESS,
argp->acl_access);
@@ -122,7 +122,7 @@ static __be32 nfsacld_proc_setacl(struct svc_rqst *rqstp)
if (error)
goto out_drop_lock;

- fh_unlock(fh);
+ inode_unlock(inode);

fh_drop_write(fh);

@@ -136,7 +136,7 @@ static __be32 nfsacld_proc_setacl(struct svc_rqst *rqstp)
return rpc_success;

out_drop_lock:
- fh_unlock(fh);
+ inode_unlock(inode);
fh_drop_write(fh);
out_errno:
resp->status = nfserrno(error);
diff --git a/fs/nfsd/nfs3acl.c b/fs/nfsd/nfs3acl.c
index 35b2ebda14da..9446c6743664 100644
--- a/fs/nfsd/nfs3acl.c
+++ b/fs/nfsd/nfs3acl.c
@@ -101,7 +101,7 @@ static __be32 nfsd3_proc_setacl(struct svc_rqst *rqstp)
if (error)
goto out_errno;

- fh_lock(fh);
+ inode_lock(inode);

error = set_posix_acl(&init_user_ns, inode, ACL_TYPE_ACCESS,
argp->acl_access);
@@ -111,7 +111,7 @@ static __be32 nfsd3_proc_setacl(struct svc_rqst *rqstp)
argp->acl_default);

out_drop_lock:
- fh_unlock(fh);
+ inode_unlock(inode);
fh_drop_write(fh);
out_errno:
resp->status = nfserrno(error);
diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index d85b110d58dd..7bb07c7e6ee8 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -374,7 +374,6 @@ nfsd3_proc_mkdir(struct svc_rqst *rqstp)
fh_init(&resp->fh, NFS3_FHSIZE);
resp->status = nfsd_create(rqstp, &resp->dirfh, argp->name, argp->len,
&argp->attrs, S_IFDIR, 0, &resp->fh);
- fh_unlock(&resp->dirfh);
return rpc_success;
}

@@ -449,7 +448,6 @@ nfsd3_proc_mknod(struct svc_rqst *rqstp)
type = nfs3_ftypes[argp->ftype];
resp->status = nfsd_create(rqstp, &resp->dirfh, argp->name, argp->len,
&argp->attrs, type, rdev, &resp->fh);
- fh_unlock(&resp->dirfh);
out:
return rpc_success;
}
@@ -472,7 +470,6 @@ nfsd3_proc_remove(struct svc_rqst *rqstp)
fh_copy(&resp->fh, &argp->fh);
resp->status = nfsd_unlink(rqstp, &resp->fh, -S_IFDIR,
argp->name, argp->len);
- fh_unlock(&resp->fh);
return rpc_success;
}

@@ -493,7 +490,6 @@ nfsd3_proc_rmdir(struct svc_rqst *rqstp)
fh_copy(&resp->fh, &argp->fh);
resp->status = nfsd_unlink(rqstp, &resp->fh, S_IFDIR,
argp->name, argp->len);
- fh_unlock(&resp->fh);
return rpc_success;
}

diff --git a/fs/nfsd/nfs4acl.c b/fs/nfsd/nfs4acl.c
index eaa3a0cf38f1..7bcc6dc0f762 100644
--- a/fs/nfsd/nfs4acl.c
+++ b/fs/nfsd/nfs4acl.c
@@ -779,19 +779,18 @@ nfsd4_set_nfs4_acl(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (host_error < 0)
goto out_nfserr;

- fh_lock(fhp);
+ inode_lock(inode);

host_error = set_posix_acl(&init_user_ns, inode, ACL_TYPE_ACCESS, pacl);
if (host_error < 0)
goto out_drop_lock;

- if (S_ISDIR(inode->i_mode)) {
+ if (S_ISDIR(inode->i_mode))
host_error = set_posix_acl(&init_user_ns, inode,
ACL_TYPE_DEFAULT, dpacl);
- }

out_drop_lock:
- fh_unlock(fhp);
+ inode_unlock(inode);

posix_acl_release(pacl);
posix_acl_release(dpacl);
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 79434e29b63f..d6defaf5a77a 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -860,7 +860,6 @@ nfsd4_create(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
do_set_nfs4_acl(rqstp, &resfh, create->cr_acl,
create->cr_bmval);

- fh_unlock(&cstate->current_fh);
set_change_info(&create->cr_cinfo, &cstate->current_fh);
fh_dup2(&cstate->current_fh, &resfh);
out:
@@ -1040,7 +1039,6 @@ nfsd4_remove(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
status = nfsd_unlink(rqstp, &cstate->current_fh, 0,
remove->rm_name, remove->rm_namelen);
if (!status) {
- fh_unlock(&cstate->current_fh);
set_change_info(&remove->rm_cinfo, &cstate->current_fh);
}
return status;
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 9409a0dc1b76..cb664c61b546 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -7321,21 +7321,21 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
static __be32 nfsd_test_lock(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file_lock *lock)
{
struct nfsd_file *nf;
+ struct inode *inode = fhp->fh_dentry->d_inode;
__be32 err;

err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_READ, &nf);
if (err)
return err;
- fh_lock(fhp); /* to block new leases till after test_lock: */
- err = nfserrno(nfsd_open_break_lease(fhp->fh_dentry->d_inode,
- NFSD_MAY_READ));
+ inode_lock(inode); /* to block new leases till after test_lock: */
+ err = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ));
if (err)
goto out;
lock->fl_file = nf->nf_file;
err = nfserrno(vfs_test_lock(nf->nf_file, lock));
lock->fl_file = NULL;
out:
- fh_unlock(fhp);
+ inode_unlock(inode);
nfsd_file_put(nf);
return err;
}
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index a50db688c60d..ae270e4f921f 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -685,7 +685,6 @@ fh_put(struct svc_fh *fhp)
struct dentry * dentry = fhp->fh_dentry;
struct svc_export * exp = fhp->fh_export;
if (dentry) {
- fh_unlock(fhp);
fhp->fh_dentry = NULL;
dput(dentry);
fh_clear_pre_post_attrs(fhp);
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index ecc57fd3fd67..c5061cdb1016 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -323,14 +323,6 @@ static inline u64 nfsd4_change_attribute(struct kstat *stat,
extern void fh_fill_pre_attrs(struct svc_fh *fhp, bool atomic);
extern void fh_fill_post_attrs(struct svc_fh *fhp);

-
-/*
- * Lock a file handle/inode
- * NOTE: both fh_lock and fh_unlock are done "by hand" in
- * vfs.c:nfsd_rename as it needs to grab 2 i_mutex's at once
- * so, any changes here should be reflected there.
- */
-
static inline void
fh_lock_nested(struct svc_fh *fhp, unsigned int subclass)
{
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 4c2e431100ba..f2f4868618bb 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -253,7 +253,6 @@ nfsd_lookup_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp,
* returned. Otherwise the covered directory is returned.
* NOTE: this mountpoint crossing is not supported properly by all
* clients and is explicitly disallowed for NFSv3
- * NeilBrown <[email protected]>
*/
__be32
nfsd_lookup(struct svc_rqst *rqstp, struct svc_fh *fhp, const char *name,
@@ -434,7 +433,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
return err;
}

- fh_lock(fhp);
+ inode_lock(inode);
if (size_change) {
/*
* RFC5661, Section 18.30.4:
@@ -470,7 +469,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
host_err = notify_change(&init_user_ns, dentry, iap, NULL);

out_unlock:
- fh_unlock(fhp);
+ inode_unlock(inode);
if (size_change)
put_write_access(inode);
out:
@@ -2123,12 +2122,8 @@ nfsd_listxattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char **bufp,
}

/*
- * Removexattr and setxattr need to call fh_lock to both lock the inode
- * and set the change attribute. Since the top-level vfs_removexattr
- * and vfs_setxattr calls already do their own inode_lock calls, call
- * the _locked variant. Pass in a NULL pointer for delegated_inode,
- * and let the client deal with NFS4ERR_DELAY (same as with e.g.
- * setattr and remove).
+ * Pass in a NULL pointer for delegated_inode, and let the client deal
+ * with NFS4ERR_DELAY (same as with e.g. setattr and remove).
*/
__be32
nfsd_removexattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char *name)
@@ -2144,12 +2139,14 @@ nfsd_removexattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char *name)
if (ret)
return nfserrno(ret);

- fh_lock(fhp);
+ inode_lock(fhp->fh_dentry->d_inode);
+ fh_fill_pre_attrs(fhp, true);

ret = __vfs_removexattr_locked(&init_user_ns, fhp->fh_dentry,
name, NULL);

- fh_unlock(fhp);
+ fh_fill_post_attrs(fhp);
+ inode_unlock(fhp->fh_dentry->d_inode);
fh_drop_write(fhp);

return nfsd_xattr_errno(ret);
@@ -2169,12 +2166,13 @@ nfsd_setxattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char *name,
ret = fh_want_write(fhp);
if (ret)
return nfserrno(ret);
- fh_lock(fhp);
+ inode_lock(fhp->fh_dentry->d_inode);
+ fh_fill_pre_attrs(fhp, true);

ret = __vfs_setxattr_locked(&init_user_ns, fhp->fh_dentry, name, buf,
len, flags, NULL);
-
- fh_unlock(fhp);
+ fh_fill_post_attrs(fhp);
+ inode_unlock(fhp->fh_dentry->d_inode);
fh_drop_write(fhp);

return nfsd_xattr_errno(ret);


2022-06-13 23:23:41

by NeilBrown

[permalink] [raw]
Subject: [PATCH 05/12] VFS: export done_path_update()

This function will be be useful to nfsd, so export it.

done_path_create_wq() is now simple enough to be "static inline"
rather than an explicit export.

Signed-off-by: NeilBrown <[email protected]>
---
fs/namei.c | 11 ++---------
include/linux/namei.h | 10 +++++++++-
2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index f13bff877e30..8ce7aa16b704 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1719,7 +1719,7 @@ struct dentry *lookup_hash_update_len(const char *name, int nlen,
}
EXPORT_SYMBOL(lookup_hash_update_len);

-static void done_path_update(struct path *path, struct dentry *dentry,
+void done_path_update(struct path *path, struct dentry *dentry,
wait_queue_head_t *wq)
{
struct inode *dir = path->dentry->d_inode;
@@ -1735,6 +1735,7 @@ static void done_path_update(struct path *path, struct dentry *dentry,
dput(dentry);
mnt_drop_write(path->mnt);
}
+EXPORT_SYMBOL(done_path_update);

static struct dentry *lookup_fast(struct nameidata *nd,
struct inode **inode,
@@ -3951,14 +3952,6 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
}
EXPORT_SYMBOL(kern_path_create);

-void done_path_create_wq(struct path *path, struct dentry *dentry,
- wait_queue_head_t *wq)
-{
- done_path_update(path, dentry, wq);
- path_put(path);
-}
-EXPORT_SYMBOL(done_path_create_wq);
-
inline struct dentry *user_path_create(int dfd, const char __user *pathname,
struct path *path, unsigned int lookup_flags)
{
diff --git a/include/linux/namei.h b/include/linux/namei.h
index f75c6639dd1a..217aa6de9f25 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -64,11 +64,19 @@ extern struct dentry *user_path_create(int, const char __user *, struct path *,
extern struct dentry *lookup_hash_update_len(const char *name, int nlen,
struct path *path, unsigned int flags,
wait_queue_head_t *wq);
-extern void done_path_create_wq(struct path *, struct dentry *, wait_queue_head_t *wq);
+extern void done_path_update(struct path *, struct dentry *, wait_queue_head_t *);
+static inline void done_path_create_wq(struct path *path, struct dentry *dentry,
+ wait_queue_head_t *wq)
+{
+ done_path_update(path, dentry, wq);
+ path_put(path);
+}
+
static inline void done_path_create(struct path *path, struct dentry *dentry)
{
done_path_create_wq(path, dentry, NULL);
}
+
extern struct dentry *kern_path_locked(const char *, struct path *);

extern struct dentry *try_lookup_one_len(const char *, struct dentry *, int);


2022-06-13 23:23:42

by NeilBrown

[permalink] [raw]
Subject: [PATCH 09/12] nfsd: support concurrent renames.

If the filesystem supports it, renames can now be concurrent with other
updates.
We use lock_rename_lookup_one() to do the appropriate locking in the
right order and to look up the names.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfsd/vfs.c | 49 +++++++++++++++++++------------------------------
1 file changed, 19 insertions(+), 30 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 6cdd5e407600..b0df216ab3e4 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1584,6 +1584,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
__be32 err;
int host_err;
bool close_cached = false;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

err = fh_verify(rqstp, ffhp, S_IFDIR, NFSD_MAY_REMOVE);
if (err)
@@ -1611,41 +1612,37 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,

/* cannot use fh_lock as we need deadlock protective ordering
* so do it by hand */
- trap = lock_rename(tdentry, fdentry);
- ffhp->fh_locked = tfhp->fh_locked = true;
- fh_fill_pre_attrs(ffhp, true);
- fh_fill_pre_attrs(tfhp, true);
-
- odentry = lookup_one_len(fname, fdentry, flen);
- host_err = PTR_ERR(odentry);
- if (IS_ERR(odentry))
+ trap = lock_rename_lookup_one(tdentry, fdentry, &ndentry, &odentry,
+ tname, tlen, fname, flen, 0, 0, &wq);
+ host_err = PTR_ERR(trap);
+ if (IS_ERR(trap))
goto out_nfserr;
+ ffhp->fh_locked = tfhp->fh_locked = true;
+ fh_fill_pre_attrs(ffhp, (ndentry->d_flags & DCACHE_PAR_UPDATE) == 0);
+ fh_fill_pre_attrs(tfhp, (ndentry->d_flags & DCACHE_PAR_UPDATE) == 0);

host_err = -ENOENT;
if (d_really_is_negative(odentry))
- goto out_dput_old;
+ goto out_unlock;
host_err = -EINVAL;
if (odentry == trap)
- goto out_dput_old;
+ goto out_unlock;

- ndentry = lookup_one_len(tname, tdentry, tlen);
- host_err = PTR_ERR(ndentry);
- if (IS_ERR(ndentry))
- goto out_dput_old;
host_err = -ENOTEMPTY;
if (ndentry == trap)
- goto out_dput_new;
+ goto out_unlock;

host_err = -EXDEV;
if (ffhp->fh_export->ex_path.mnt != tfhp->fh_export->ex_path.mnt)
- goto out_dput_new;
+ goto out_unlock;
if (ffhp->fh_export->ex_path.dentry != tfhp->fh_export->ex_path.dentry)
- goto out_dput_new;
+ goto out_unlock;

if ((ndentry->d_sb->s_export_op->flags & EXPORT_OP_CLOSE_BEFORE_UNLINK) &&
nfsd_has_cached_files(ndentry)) {
close_cached = true;
- goto out_dput_old;
+ dget(ndentry);
+ goto out_unlock;
} else {
struct renamedata rd = {
.old_mnt_userns = &init_user_ns,
@@ -1662,23 +1659,15 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
host_err = commit_metadata(ffhp);
}
}
- out_dput_new:
- dput(ndentry);
- out_dput_old:
- dput(odentry);
- out_nfserr:
- err = nfserrno(host_err);
- /*
- * We cannot rely on fh_unlock on the two filehandles,
- * as that would do the wrong thing if the two directories
- * were the same, so again we do it by hand.
- */
if (!close_cached) {
fh_fill_post_attrs(ffhp);
fh_fill_post_attrs(tfhp);
}
- unlock_rename(tdentry, fdentry);
+ out_unlock:
+ unlock_rename_lookup(tdentry, fdentry, ndentry, odentry);
ffhp->fh_locked = tfhp->fh_locked = false;
+ out_nfserr:
+ err = nfserrno(host_err);
fh_drop_write(ffhp);

/*


2022-06-13 23:24:14

by NeilBrown

[permalink] [raw]
Subject: [PATCH 12/12] nfsd: discard fh_locked flag and fh_lock/fh_unlock

fh_lock() and fh_unlock() are no longer used, so discard them.
They are the only real users of ->fh_locked, so discard that too.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfsd/nfsfh.c | 2 +-
fs/nfsd/nfsfh.h | 48 ++++--------------------------------------------
fs/nfsd/vfs.c | 4 ----
3 files changed, 5 insertions(+), 49 deletions(-)

diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index ae270e4f921f..a3dbe9f34c0e 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -548,7 +548,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
if (ref_fh == fhp)
fh_put(ref_fh);

- if (fhp->fh_locked || fhp->fh_dentry) {
+ if (fhp->fh_dentry) {
printk(KERN_ERR "fh_compose: fh %pd2 not initialized!\n",
dentry);
}
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index c5061cdb1016..559912b1d794 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -81,7 +81,6 @@ typedef struct svc_fh {
struct dentry * fh_dentry; /* validated dentry */
struct svc_export * fh_export; /* export pointer */

- bool fh_locked; /* inode locked by us */
bool fh_want_write; /* remount protection taken */
bool fh_no_wcc; /* no wcc data needed */
bool fh_no_atomic_attr;
@@ -93,7 +92,7 @@ typedef struct svc_fh {
bool fh_post_saved; /* post-op attrs saved */
bool fh_pre_saved; /* pre-op attrs saved */

- /* Pre-op attributes saved during fh_lock */
+ /* Pre-op attributes saved when inode exclusively locked */
__u64 fh_pre_size; /* size before operation */
struct timespec64 fh_pre_mtime; /* mtime before oper */
struct timespec64 fh_pre_ctime; /* ctime before oper */
@@ -103,7 +102,7 @@ typedef struct svc_fh {
*/
u64 fh_pre_change;

- /* Post-op attributes saved in fh_unlock */
+ /* Post-op attributes saved in fh_fill_post_attrs() */
struct kstat fh_post_attr; /* full attrs after operation */
u64 fh_post_change; /* nfsv4 change; see above */
} svc_fh;
@@ -223,8 +222,8 @@ void fh_put(struct svc_fh *);
static __inline__ struct svc_fh *
fh_copy(struct svc_fh *dst, struct svc_fh *src)
{
- WARN_ON(src->fh_dentry || src->fh_locked);
-
+ WARN_ON(src->fh_dentry);
+
*dst = *src;
return dst;
}
@@ -323,43 +322,4 @@ static inline u64 nfsd4_change_attribute(struct kstat *stat,
extern void fh_fill_pre_attrs(struct svc_fh *fhp, bool atomic);
extern void fh_fill_post_attrs(struct svc_fh *fhp);

-static inline void
-fh_lock_nested(struct svc_fh *fhp, unsigned int subclass)
-{
- struct dentry *dentry = fhp->fh_dentry;
- struct inode *inode;
-
- BUG_ON(!dentry);
-
- if (fhp->fh_locked) {
- printk(KERN_WARNING "fh_lock: %pd2 already locked!\n",
- dentry);
- return;
- }
-
- inode = d_inode(dentry);
- inode_lock_nested(inode, subclass);
- fh_fill_pre_attrs(fhp, true);
- fhp->fh_locked = true;
-}
-
-static inline void
-fh_lock(struct svc_fh *fhp)
-{
- fh_lock_nested(fhp, I_MUTEX_NORMAL);
-}
-
-/*
- * Unlock a file handle/inode
- */
-static inline void
-fh_unlock(struct svc_fh *fhp)
-{
- if (fhp->fh_locked) {
- fh_fill_post_attrs(fhp);
- inode_unlock(d_inode(fhp->fh_dentry));
- fhp->fh_locked = false;
- }
-}
-
#endif /* _LINUX_NFSD_NFSFH_H */
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f2f4868618bb..0e07b19a0289 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1623,14 +1623,11 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
goto out;
}

- /* cannot use fh_lock as we need deadlock protective ordering
- * so do it by hand */
trap = lock_rename_lookup_one(tdentry, fdentry, &ndentry, &odentry,
tname, tlen, fname, flen, 0, 0, &wq);
host_err = PTR_ERR(trap);
if (IS_ERR(trap))
goto out_nfserr;
- ffhp->fh_locked = tfhp->fh_locked = true;
fh_fill_pre_attrs(ffhp, (ndentry->d_flags & DCACHE_PAR_UPDATE) == 0);
fh_fill_pre_attrs(tfhp, (ndentry->d_flags & DCACHE_PAR_UPDATE) == 0);

@@ -1678,7 +1675,6 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
}
out_unlock:
unlock_rename_lookup(tdentry, fdentry, ndentry, odentry);
- ffhp->fh_locked = tfhp->fh_locked = false;
out_nfserr:
err = nfserrno(host_err);
fh_drop_write(ffhp);


2022-06-14 04:40:36

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 06/12] VFS: support concurrent renames.

Hi NeilBrown,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.19-rc2 next-20220610]
[cannot apply to trondmy-nfs/linux-next viro-vfs/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/NeilBrown/Allow-concurrent-directory-updates/20220614-072355
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b13baccc3850ca8b8cccbf8ed9912dbaa0fdf7f3
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20220614/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/46a2afd9f68f24a42f38f3a8afebafe7e494e9d8
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review NeilBrown/Allow-concurrent-directory-updates/20220614-072355
git checkout 46a2afd9f68f24a42f38f3a8afebafe7e494e9d8
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

>> fs/namei.c:3175:16: warning: no previous prototype for 'lock_rename_lookup_excl' [-Wmissing-prototypes]
3175 | struct dentry *lock_rename_lookup_excl(struct dentry *p1, struct dentry *p2,
| ^~~~~~~~~~~~~~~~~~~~~~~


vim +/lock_rename_lookup_excl +3175 fs/namei.c

3174
> 3175 struct dentry *lock_rename_lookup_excl(struct dentry *p1, struct dentry *p2,
3176 struct dentry **d1p, struct dentry **d2p,
3177 struct qstr *last1, struct qstr *last2,
3178 unsigned int flags1, unsigned int flags2)
3179 {
3180 struct dentry *p;
3181 struct dentry *d1, *d2;
3182
3183 if (p1 == p2) {
3184 inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
3185 d1 = __lookup_hash(last1, p1, flags1, NULL);
3186 if (IS_ERR(d1))
3187 goto out_unlock_1;
3188 d2 = __lookup_hash(last2, p2, flags2, NULL);
3189 if (IS_ERR(d2))
3190 goto out_unlock_2;
3191 *d1p = d1; *d2p = d2;
3192 return NULL;
3193 out_unlock_2:
3194 dput(d1);
3195 d1 = d2;
3196 out_unlock_1:
3197 inode_unlock(p1->d_inode);
3198 return d1;
3199 }
3200
3201 mutex_lock(&p1->d_sb->s_vfs_rename_mutex);
3202
3203 if ((p = d_ancestor(p2, p1)) != NULL) {
3204 inode_lock_nested(p2->d_inode, I_MUTEX_PARENT);
3205 inode_lock_nested(p1->d_inode, I_MUTEX_CHILD);
3206 } else if ((p = d_ancestor(p1, p2)) != NULL) {
3207 inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
3208 inode_lock_nested(p2->d_inode, I_MUTEX_CHILD);
3209 } else {
3210 inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
3211 inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2);
3212 }
3213 d1 = __lookup_hash(last1, p1, flags1, NULL);
3214 if (IS_ERR(d1))
3215 goto unlock_out_3;
3216 d2 = __lookup_hash(last2, p2, flags2, NULL);
3217 if (IS_ERR(d2))
3218 goto unlock_out_4;
3219
3220 *d1p = d1;
3221 *d2p = d2;
3222 return p;
3223 unlock_out_4:
3224 dput(d1);
3225 d1 = d2;
3226 unlock_out_3:
3227 inode_unlock(p1->d_inode);
3228 inode_unlock(p2->d_inode);
3229 mutex_unlock(&p1->d_sb->s_vfs_rename_mutex);
3230 return d1;
3231 }
3232

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-06-14 12:42:57

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 06/12] VFS: support concurrent renames.

Hi NeilBrown,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.19-rc2 next-20220614]
[cannot apply to trondmy-nfs/linux-next viro-vfs/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/NeilBrown/Allow-concurrent-directory-updates/20220614-072355
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b13baccc3850ca8b8cccbf8ed9912dbaa0fdf7f3
config: i386-randconfig-s001-20220613 (https://download.01.org/0day-ci/archive/20220614/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.4-30-g92122700-dirty
# https://github.com/intel-lab-lkp/linux/commit/46a2afd9f68f24a42f38f3a8afebafe7e494e9d8
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review NeilBrown/Allow-concurrent-directory-updates/20220614-072355
git checkout 46a2afd9f68f24a42f38f3a8afebafe7e494e9d8
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>


sparse warnings: (new ones prefixed by >>)
>> fs/namei.c:3175:15: sparse: sparse: symbol 'lock_rename_lookup_excl' was not declared. Should it be static?
fs/namei.c: note: in included file (through include/linux/rbtree.h, include/linux/mm_types.h, include/linux/mmzone.h, ...):
include/linux/rcupdate.h:726:9: sparse: sparse: context imbalance in 'terminate_walk' - unexpected unlock
include/linux/rcupdate.h:726:9: sparse: sparse: context imbalance in 'try_to_unlazy' - unexpected unlock
include/linux/rcupdate.h:726:9: sparse: sparse: context imbalance in 'try_to_unlazy_next' - unexpected unlock
fs/namei.c:2492:19: sparse: sparse: context imbalance in 'path_init' - different lock contexts for basic block

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-06-14 13:41:47

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 06/12] VFS: support concurrent renames.

Hi NeilBrown,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.19-rc2 next-20220614]
[cannot apply to trondmy-nfs/linux-next viro-vfs/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/NeilBrown/Allow-concurrent-directory-updates/20220614-072355
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b13baccc3850ca8b8cccbf8ed9912dbaa0fdf7f3
config: i386-buildonly-randconfig-r005-20220613 (https://download.01.org/0day-ci/archive/20220614/[email protected]/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project c97436f8b6e2718286e8496faf53a2c800e281cf)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/46a2afd9f68f24a42f38f3a8afebafe7e494e9d8
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review NeilBrown/Allow-concurrent-directory-updates/20220614-072355
git checkout 46a2afd9f68f24a42f38f3a8afebafe7e494e9d8
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

>> fs/namei.c:3175:16: warning: no previous prototype for function 'lock_rename_lookup_excl' [-Wmissing-prototypes]
struct dentry *lock_rename_lookup_excl(struct dentry *p1, struct dentry *p2,
^
fs/namei.c:3175:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
struct dentry *lock_rename_lookup_excl(struct dentry *p1, struct dentry *p2,
^
static
1 warning generated.


vim +/lock_rename_lookup_excl +3175 fs/namei.c

3174
> 3175 struct dentry *lock_rename_lookup_excl(struct dentry *p1, struct dentry *p2,
3176 struct dentry **d1p, struct dentry **d2p,
3177 struct qstr *last1, struct qstr *last2,
3178 unsigned int flags1, unsigned int flags2)
3179 {
3180 struct dentry *p;
3181 struct dentry *d1, *d2;
3182
3183 if (p1 == p2) {
3184 inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
3185 d1 = __lookup_hash(last1, p1, flags1, NULL);
3186 if (IS_ERR(d1))
3187 goto out_unlock_1;
3188 d2 = __lookup_hash(last2, p2, flags2, NULL);
3189 if (IS_ERR(d2))
3190 goto out_unlock_2;
3191 *d1p = d1; *d2p = d2;
3192 return NULL;
3193 out_unlock_2:
3194 dput(d1);
3195 d1 = d2;
3196 out_unlock_1:
3197 inode_unlock(p1->d_inode);
3198 return d1;
3199 }
3200
3201 mutex_lock(&p1->d_sb->s_vfs_rename_mutex);
3202
3203 if ((p = d_ancestor(p2, p1)) != NULL) {
3204 inode_lock_nested(p2->d_inode, I_MUTEX_PARENT);
3205 inode_lock_nested(p1->d_inode, I_MUTEX_CHILD);
3206 } else if ((p = d_ancestor(p1, p2)) != NULL) {
3207 inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
3208 inode_lock_nested(p2->d_inode, I_MUTEX_CHILD);
3209 } else {
3210 inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
3211 inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2);
3212 }
3213 d1 = __lookup_hash(last1, p1, flags1, NULL);
3214 if (IS_ERR(d1))
3215 goto unlock_out_3;
3216 d2 = __lookup_hash(last2, p2, flags2, NULL);
3217 if (IS_ERR(d2))
3218 goto unlock_out_4;
3219
3220 *d1p = d1;
3221 *d2p = d2;
3222 return p;
3223 unlock_out_4:
3224 dput(d1);
3225 d1 = d2;
3226 unlock_out_3:
3227 inode_unlock(p1->d_inode);
3228 inode_unlock(p2->d_inode);
3229 mutex_unlock(&p1->d_sb->s_vfs_rename_mutex);
3230 return d1;
3231 }
3232

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-06-15 13:52:55

by Daire Byrne

[permalink] [raw]
Subject: Re: [PATCH RFC 00/12] Allow concurrent directory updates.

Neil,

Firstly, thank you for your work on this. I'm probably the main
beneficiary of this (NFSD) effort atm so I feel extra special and
lucky!

I have done some quick artificial tests similar to before where I am
using a NFS server and client separated by an (extreme) 200ms of
latency (great for testing parallelism). I am only using NFSv3 due to
the NFSD_CACHE_SIZE_SLOTS_PER_SESSION parallelism limitations for
NFSv4.

Firstly, a client direct to server (VFS) with 10 simultaneous create
processes hitting the same directory:

client1 # for x in {1..1000}; do
echo /srv/server1/data/touch.$x
done | xargs -n1 -P 10 -iX -t touch X 2>&1 | pv -l -a >|/dev/null

Without the patch ( on the client), this reaches a steady state of 2.4
creates/s and increasing the number of parallel create processes does
not change this aggregate performance.

With the patch, the creation rate increases to 15 creates/s and with
100 processes, it further scales up to 121 creates/s.

Now for the re-export case (NFSD) where an intermediary server
re-exports the originating server (200ms away) to clients on it's
local LAN, there is no noticeable improvement for a single (not
patched) client. But we do see an aggregate improvement when we use
multiple clients at once.

# pdsh -Rssh -w 'client[1-10]' 'for x in {1..1000}; do echo
/srv/reexport1/data/$(hostname -s).$x; done | xargs -n1 -P 10 -iX -t
touch X 2>&1' | pv -l -a >|/dev/null

Without the patch applied to the reexport server, the aggregate is
around 2.2 create/s which is similar to doing it directly to the
originating server from a single client (above).

With the patch, the aggregate increases to 15 creates/s for 10 clients
which again matches the results of a single patched client. Not quite
a x10 increase but a healthy improvement nonetheless.

However, it is at this point that I started to experience some
stability issues with the re-export server that are not present with
the vanilla unpatched v5.19-rc2 kernel. In particular the knfsd
threads start to lock up with stack traces like this:

[ 1234.460696] INFO: task nfsd:5514 blocked for more than 123 seconds.
[ 1234.461481] Tainted: G W E 5.19.0-1.dneg.x86_64 #1
[ 1234.462289] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1234.463227] task:nfsd state:D stack: 0 pid: 5514
ppid: 2 flags:0x00004000
[ 1234.464212] Call Trace:
[ 1234.464677] <TASK>
[ 1234.465104] __schedule+0x2a9/0x8a0
[ 1234.465663] schedule+0x55/0xc0
[ 1234.466183] ? nfs_lookup_revalidate_dentry+0x3a0/0x3a0 [nfs]
[ 1234.466995] __nfs_lookup_revalidate+0xdf/0x120 [nfs]
[ 1234.467732] ? put_prev_task_stop+0x170/0x170
[ 1234.468374] nfs_lookup_revalidate+0x15/0x20 [nfs]
[ 1234.469073] lookup_dcache+0x5a/0x80
[ 1234.469639] lookup_one_unlocked+0x59/0xa0
[ 1234.470244] lookup_one_len_unlocked+0x1d/0x20
[ 1234.470951] nfsd_lookup_dentry+0x190/0x470 [nfsd]
[ 1234.471663] nfsd_lookup+0x88/0x1b0 [nfsd]
[ 1234.472294] nfsd3_proc_lookup+0xb4/0x100 [nfsd]
[ 1234.473012] nfsd_dispatch+0x161/0x290 [nfsd]
[ 1234.473689] svc_process_common+0x48a/0x620 [sunrpc]
[ 1234.474402] ? nfsd_svc+0x330/0x330 [nfsd]
[ 1234.475038] ? nfsd_shutdown_threads+0xa0/0xa0 [nfsd]
[ 1234.475772] svc_process+0xbc/0xf0 [sunrpc]
[ 1234.476408] nfsd+0xda/0x190 [nfsd]
[ 1234.477011] kthread+0xf0/0x120
[ 1234.477522] ? kthread_complete_and_exit+0x20/0x20
[ 1234.478199] ret_from_fork+0x22/0x30
[ 1234.478755] </TASK>

For whatever reason, they seem to affect our Netapp mounts and
re-exports rather than our originating Linux NFS servers (against
which all tests were done). This may be related to the fact that those
Netapps serve our home directories so there could be some unique
locking patterns going on there.

This issue made things a bit too unstable to test at larger scales or
with our production workloads.

So all in all, the performance improvements in the knfsd re-export
case is looking great and we have real world use cases that this helps
with (batch processing workloads with latencies >10ms). If we can
figure out the hanging knfsd threads, then I can test it more heavily.

Many thanks,

Daire

On Tue, 14 Jun 2022 at 00:19, NeilBrown <[email protected]> wrote:
>
> VFS currently holds an exclusive lock on a directory during create,
> unlink, rename. This imposes serialisation on all filesystems though
> some may not benefit from it, and some may be able to provide finer
> grained locking internally, thus reducing contention.
>
> This series allows the filesystem to request that the inode lock be
> shared rather than exclusive. In that case an exclusive lock will be
> held on the dentry instead, much as is done for parallel lookup.
>
> The NFS filesystem can easily support concurrent updates (server does
> any needed serialiation) so it is converted.
>
> This series also converts nfsd to use the new interfaces so concurrent
> incoming NFS requests in the one directory can be handled concurrently.
>
> As a net result, if an NFS mounted filesystem is reexported over NFS,
> then multiple clients can create files in a single directory and all
> synchronisation will be handled on the final server. This helps hid
> latency on link from client to server.
>
> I include a few nfsd patches that aren't strictly needed for this work,
> but seem to be a logical consequence of the changes that I did have to
> make.
>
> I have only tested this lightly. In particular the rename support is
> quite new and I haven't tried to break it yet.
>
> I post this for general review, and hopefully extra testing... Daire
> Byrne has expressed interest in the NFS re-export parallelism.
>
> NeilBrown
>
>
> ---
>
> NeilBrown (12):
> VFS: support parallel updates in the one directory.
> VFS: move EEXIST and ENOENT tests into lookup_hash_update()
> VFS: move want_write checks into lookup_hash_update()
> VFS: move dput() and mnt_drop_write() into done_path_update()
> VFS: export done_path_update()
> VFS: support concurrent renames.
> NFS: support parallel updates in the one directory.
> nfsd: allow parallel creates from nfsd
> nfsd: support concurrent renames.
> nfsd: reduce locking in nfsd_lookup()
> nfsd: use (un)lock_inode instead of fh_(un)lock
> nfsd: discard fh_locked flag and fh_lock/fh_unlock
>
>
> fs/dcache.c | 59 ++++-
> fs/namei.c | 578 ++++++++++++++++++++++++++++++++---------
> fs/nfs/dir.c | 29 ++-
> fs/nfs/inode.c | 2 +
> fs/nfs/unlink.c | 5 +-
> fs/nfsd/nfs2acl.c | 6 +-
> fs/nfsd/nfs3acl.c | 4 +-
> fs/nfsd/nfs3proc.c | 37 +--
> fs/nfsd/nfs4acl.c | 7 +-
> fs/nfsd/nfs4proc.c | 61 ++---
> fs/nfsd/nfs4state.c | 8 +-
> fs/nfsd/nfsfh.c | 10 +-
> fs/nfsd/nfsfh.h | 58 +----
> fs/nfsd/nfsproc.c | 31 +--
> fs/nfsd/vfs.c | 243 ++++++++---------
> fs/nfsd/vfs.h | 8 +-
> include/linux/dcache.h | 27 ++
> include/linux/fs.h | 1 +
> include/linux/namei.h | 30 ++-
> 19 files changed, 791 insertions(+), 413 deletions(-)
>
> --
> Signature
>

2022-06-16 10:52:57

by Daire Byrne

[permalink] [raw]
Subject: Re: [PATCH RFC 00/12] Allow concurrent directory updates.

On Thu, 16 Jun 2022 at 01:56, NeilBrown <[email protected]> wrote:
>
> On Wed, 15 Jun 2022, Daire Byrne wrote:
> ..
> > However, it is at this point that I started to experience some
> > stability issues with the re-export server that are not present with
> > the vanilla unpatched v5.19-rc2 kernel. In particular the knfsd
> > threads start to lock up with stack traces like this:
> >
> > [ 1234.460696] INFO: task nfsd:5514 blocked for more than 123 seconds.
> > [ 1234.461481] Tainted: G W E 5.19.0-1.dneg.x86_64 #1
> > [ 1234.462289] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [ 1234.463227] task:nfsd state:D stack: 0 pid: 5514
> > ppid: 2 flags:0x00004000
> > [ 1234.464212] Call Trace:
> > [ 1234.464677] <TASK>
> > [ 1234.465104] __schedule+0x2a9/0x8a0
> > [ 1234.465663] schedule+0x55/0xc0
> > [ 1234.466183] ? nfs_lookup_revalidate_dentry+0x3a0/0x3a0 [nfs]
> > [ 1234.466995] __nfs_lookup_revalidate+0xdf/0x120 [nfs]
>
> I can see the cause of this - I forget a wakeup. This patch should fix
> it, though I hope to find a better solution.
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 54c2c7adcd56..072130d000c4 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -2483,17 +2483,16 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry)
> if (!(dentry->d_flags & DCACHE_PAR_UPDATE)) {
> /* Must have exclusive lock on parent */
> did_set_par_update = true;
> + lock_acquire_exclusive(&dentry->d_update_map, 0,
> + 0, NULL, _THIS_IP_);
> dentry->d_flags |= DCACHE_PAR_UPDATE;
> }
>
> spin_unlock(&dentry->d_lock);
> error = nfs_safe_remove(dentry);
> nfs_dentry_remove_handle_error(dir, dentry, error);
> - if (did_set_par_update) {
> - spin_lock(&dentry->d_lock);
> - dentry->d_flags &= ~DCACHE_PAR_UPDATE;
> - spin_unlock(&dentry->d_lock);
> - }
> + if (did_set_par_update)
> + d_unlock_update(dentry);
> out:
> trace_nfs_unlink_exit(dir, dentry, error);
> return error;
>
> >
> > So all in all, the performance improvements in the knfsd re-export
> > case is looking great and we have real world use cases that this helps
> > with (batch processing workloads with latencies >10ms). If we can
> > figure out the hanging knfsd threads, then I can test it more heavily.
>
> Hopefully the above patch will allow the more heavy testing to continue.
> In any case, thanks a lot for the testing so far,

Patch applied but unfortunately I'm still getting the same trace, but
this time I also captured a preceding stack for a hung process local
to the reexport server - I wonder if it's happening somewhere in the
VFS changes rather than nfsd which then exports the path?

[ 373.930506] INFO: task XXXX:5072 blocked for more than 122 seconds.
[ 373.931410] Tainted: G W E 5.19.0-3.dneg.x86_64 #1
[ 373.932313] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 373.933442] task:XXXX state:D stack: 0 pid: 5072 ppid:
1 flags:0x00000000
[ 373.934639] Call Trace:
[ 373.935007] <TASK>
[ 373.935306] __schedule+0x2a9/0x8a0
[ 373.935844] schedule+0x55/0xc0
[ 373.936294] ? nfs_lookup_revalidate_dentry+0x3a0/0x3a0 [nfs]
[ 373.937137] __nfs_lookup_revalidate+0xdf/0x120 [nfs]
[ 373.937875] ? put_prev_task_stop+0x170/0x170
[ 373.938525] nfs_lookup_revalidate+0x15/0x20 [nfs]
[ 373.939226] lookup_fast+0xda/0x150
[ 373.939756] path_openat+0x12a/0x1090
[ 373.940293] ? __filemap_fdatawrite_range+0x54/0x70
[ 373.941100] do_filp_open+0xb2/0x120
[ 373.941635] ? hashlen_string+0xd0/0xd0
[ 373.942190] ? _raw_spin_unlock+0xe/0x30
[ 373.942766] do_sys_openat2+0x245/0x320
[ 373.943305] do_sys_open+0x46/0x80
[ 373.943839] __x64_sys_open+0x21/0x30
[ 373.944428] do_syscall_64+0x3b/0x90
[ 373.944979] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 373.945688] RIP: 0033:0x7fcd80ceeeb0
[ 373.946226] RSP: 002b:00007fff90fd8298 EFLAGS: 00000246 ORIG_RAX:
0000000000000002
[ 373.947330] RAX: ffffffffffffffda RBX: 00007fcd81d6e981 RCX: 00007fcd80ceeeb0
[ 373.947333] RDX: 00000000000001b6 RSI: 0000000000000000 RDI: 00007fff90fd8360
[ 373.947334] RBP: 00007fff90fd82f0 R08: 00007fcd81d6e986 R09: 0000000000000000
[ 373.947335] R10: 0000000000000024 R11: 0000000000000246 R12: 0000000000cd6110
[ 373.947337] R13: 0000000000000008 R14: 00007fff90fd8360 R15: 00007fff90fdb580
[ 373.947339] </TASK>
[ 373.947421] INFO: task nfsd:5696 blocked for more than 122 seconds.
[ 373.947423] Tainted: G W E 5.19.0-3.dneg.x86_64 #1
[ 373.947424] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 373.947425] task:nfsd state:D stack: 0 pid: 5696
ppid: 2 flags:0x00004000
[ 373.947428] Call Trace:
[ 373.947429] <TASK>
[ 373.947430] __schedule+0x2a9/0x8a0
[ 373.947434] schedule+0x55/0xc0
[ 373.947436] ? nfs_lookup_revalidate_dentry+0x3a0/0x3a0 [nfs]
[ 373.947451] __nfs_lookup_revalidate+0xdf/0x120 [nfs]
[ 373.947464] ? put_prev_task_stop+0x170/0x170
[ 373.947466] nfs_lookup_revalidate+0x15/0x20 [nfs]
[ 373.947478] lookup_dcache+0x5a/0x80
[ 373.947481] lookup_one_unlocked+0x59/0xa0
[ 373.947484] lookup_one_len_unlocked+0x1d/0x20
[ 373.947487] nfsd_lookup_dentry+0x190/0x470 [nfsd]
[ 373.947509] nfsd_lookup+0x88/0x1b0 [nfsd]
[ 373.947522] nfsd3_proc_lookup+0xb4/0x100 [nfsd]
[ 373.947537] nfsd_dispatch+0x161/0x290 [nfsd]
[ 373.947551] svc_process_common+0x48a/0x620 [sunrpc]
[ 373.947589] ? nfsd_svc+0x330/0x330 [nfsd]
[ 373.947602] ? nfsd_shutdown_threads+0xa0/0xa0 [nfsd]
[ 373.947621] svc_process+0xbc/0xf0 [sunrpc]
[ 373.951088] nfsd+0xda/0x190 [nfsd]
[ 373.951136] kthread+0xf0/0x120
[ 373.951138] ? kthread_complete_and_exit+0x20/0x20
[ 373.951140] ret_from_fork+0x22/0x30
[ 373.951149] </TASK>

I double checked that the patch had been applied and I hadn't made a
mistake with installation.

I could perhaps try running with just the VFS patches to see if I can
still reproduce the "local" VFS hang without the nfsd patches? Your
previous VFS only patchset was stable for me.

Daire

2022-06-16 13:56:14

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH RFC 00/12] Allow concurrent directory updates.

Hi Neil,

On Wed, Jun 15, 2022 at 9:23 PM NeilBrown <[email protected]> wrote:
>
> On Wed, 15 Jun 2022, Daire Byrne wrote:
> ...
> > With the patch, the aggregate increases to 15 creates/s for 10 clients
> > which again matches the results of a single patched client. Not quite
> > a x10 increase but a healthy improvement nonetheless.
>
> Great!
>
> >
> > However, it is at this point that I started to experience some
> > stability issues with the re-export server that are not present with
> > the vanilla unpatched v5.19-rc2 kernel. In particular the knfsd
> > threads start to lock up with stack traces like this:
> >
> > [ 1234.460696] INFO: task nfsd:5514 blocked for more than 123 seconds.
> > [ 1234.461481] Tainted: G W E 5.19.0-1.dneg.x86_64 #1
> > [ 1234.462289] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [ 1234.463227] task:nfsd state:D stack: 0 pid: 5514
> > ppid: 2 flags:0x00004000
> > [ 1234.464212] Call Trace:
> > [ 1234.464677] <TASK>
> > [ 1234.465104] __schedule+0x2a9/0x8a0
> > [ 1234.465663] schedule+0x55/0xc0
> > [ 1234.466183] ? nfs_lookup_revalidate_dentry+0x3a0/0x3a0 [nfs]
> > [ 1234.466995] __nfs_lookup_revalidate+0xdf/0x120 [nfs]
>
> I can see the cause of this - I forget a wakeup. This patch should fix
> it, though I hope to find a better solution.

I've applied as far as the NFS client patch plus the extra fix and I'm
seeing this stack trace on my client when I try to run cthon tests (I
was seeing it without the extra change as well):

anna@gouda ~ % virsh console client
Connected to domain 'client'
Escape character is ^] (Ctrl + ])
[ 148.977712] BUG: unable to handle page fault for address: ffffa56a03e2ff70
[ 148.978892] #PF: supervisor read access in kernel mode
[ 148.979504] #PF: error_code(0x0000) - not-present page
[ 148.980071] PGD 100000067 P4D 100000067 PUD 1001bb067 PMD 103164067 PTE 0
[ 148.980859] Oops: 0000 [#1] PREEMPT SMP PTI
[ 148.981323] CPU: 2 PID: 683 Comm: test7 Tainted: G W
5.19.0-rc1-g3de25d30cb97-dirty #33479
5f05aa11df373b898aeb9103b7a17e0ee3946022
[ 148.982824] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 0.0.0 02/06/2015
[ 148.984016] RIP: 0010:d_alloc_parallel+0x1ad/0x5a0
[ 148.984609] Code: 48 c7 c2 ff ff ff ff 89 c1 48 d3 e2 48 f7 d2 4d
31 c1 49 85 d1 0f 84 e2 00 00 00 66 90 4d 8b 6d 00 4d 85 ed 0f 84 3e
03 00 00 <45> 39 95 70 ff ff ff 75 ea 4d 39 a5 68 ff ff ff 75 e1 4d 8d
b5 50
[ 148.986724] RSP: 0018:ffffa56a03e1bc50 EFLAGS: 00010286
[ 148.987302] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9101b9400000
[ 148.988084] RDX: 00000006ebc26d0c RSI: ffff9100c30b9020 RDI: ffff9100cb523480
[ 148.988978] RBP: 00000000000009fc R08: 00000000000000c0 R09: ffff9100cb76d180
[ 148.989752] R10: 00000000ebc26d0c R11: ffffffff846e8750 R12: ffff9100cb523480
[ 148.990891] R13: ffffa56a03e30000 R14: ffff9100cb5234d8 R15: ffffa56a03e1bdc8
[ 148.991741] FS: 00007f0c00660540(0000) GS:ffff9101b9d00000(0000)
knlGS:0000000000000000
[ 148.992995] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 148.993686] CR2: ffffa56a03e2ff70 CR3: 000000010eaa2004 CR4: 0000000000370ee0
[ 148.994494] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 148.995638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 148.996462] Call Trace:
[ 148.996752] <TASK>
[ 148.997036] path_openat+0x290/0xd80
[ 148.997472] do_filp_open+0xbe/0x160
[ 148.998136] do_sys_openat2+0x8e/0x160
[ 148.998609] __x64_sys_creat+0x47/0x70
[ 148.999057] do_syscall_64+0x31/0x50
[ 148.999479] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 149.000040] RIP: 0033:0x7f0c00502487
[ 149.000444] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00
00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 55 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 74 24 0c 48 89 3c
24 e8
[ 149.002576] RSP: 002b:00007ffc01181a98 EFLAGS: 00000246 ORIG_RAX:
0000000000000055
[ 149.003610] RAX: ffffffffffffffda RBX: 00007ffc01184d28 RCX: 00007f0c00502487
[ 149.004511] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 00007ffc01181ae0
[ 149.005351] RBP: 00007ffc01182af0 R08: 0000000000000000 R09: 0000000000000064
[ 149.006160] R10: 00007f0c00406c68 R11: 0000000000000246 R12: 0000000000000000
[ 149.006939] R13: 00007ffc01184d40 R14: 0000000000000000 R15: 00007f0c0078b000
[ 149.007882] </TASK>
[ 149.008135] Modules linked in: cbc cts rpcsec_gss_krb5 nfsv4 nfs
fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core cfg80211 rfkill
8021q garp stp mrp llc intel_rapl_msr intel_rapl_common kvm_intel kvm
snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul
snd_hda_intel ghash_clmulni_intel iTCO_wdt joydev snd_intel_dspcfg
intel_pmc_bxt mousedev iTCO_vendor_support vfat snd_hda_codec fat
snd_hwdep snd_hda_core aesni_intel crypto_simd cryptd rapl snd_pcm
intel_agp qxl psmouse snd_timer i2c_i801 pcspkr i2c_smbus lpc_ich snd
soundcore usbhid drm_ttm_helper intel_gtt mac_hid ttm nfsd nfs_acl
lockd auth_rpcgss usbip_host grace usbip_core fuse sunrpc qemu_fw_cfg
ip_tables x_tables xfs libcrc32c crc32c_generic serio_raw atkbd libps2
9pnet_virtio virtio_rng 9pnet vivaldi_fmap rng_core virtio_balloon
virtio_net virtio_blk net_failover virtio_console virtio_scsi failover
crc32c_intel xhci_pci i8042 virtio_pci virtio_pci_legacy_dev
xhci_pci_renesas virtio_pci_modern_dev serio
[ 149.020778] CR2: ffffa56a03e2ff70
[ 149.021803] ---[ end trace 0000000000000000 ]---
[ 149.021810] RIP: 0010:d_alloc_parallel+0x1ad/0x5a0
[ 149.021815] Code: 48 c7 c2 ff ff ff ff 89 c1 48 d3 e2 48 f7 d2 4d
31 c1 49 85 d1 0f 84 e2 00 00 00 66 90 4d 8b 6d 00 4d 85 ed 0f 84 3e
03 00 00 <45> 39 95 70 ff ff ff 75 ea 4d 39 a5 68 ff ff ff 75 e1 4d 8d
b5 50
[ 149.021817] RSP: 0018:ffffa56a03e1bc50 EFLAGS: 00010286
[ 149.021819] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9101b9400000
[ 149.021820] RDX: 00000006ebc26d0c RSI: ffff9100c30b9020 RDI: ffff9100cb523480
[ 149.021821] RBP: 00000000000009fc R08: 00000000000000c0 R09: ffff9100cb76d180
[ 149.021822] R10: 00000000ebc26d0c R11: ffffffff846e8750 R12: ffff9100cb523480
[ 149.021823] R13: ffffa56a03e30000 R14: ffff9100cb5234d8 R15: ffffa56a03e1bdc8
[ 149.021824] FS: 00007f0c00660540(0000) GS:ffff9101b9d00000(0000)
knlGS:0000000000000000
[ 149.021826] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 149.021827] CR2: ffffa56a03e2ff70 CR3: 000000010eaa2004 CR4: 0000000000370ee0
[ 149.021831] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 149.021832] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 149.021833] note: test7[683] exited with preempt_count 1
[ 149.026665] BUG: unable to handle page fault for address: ffffa56a03e2ff70
[ 149.026668] #PF: supervisor read access in kernel mode
[ 149.026669] #PF: error_code(0x0000) - not-present page
[ 149.026670] PGD 100000067 P4D 100000067 PUD 1001bb067 PMD 103164067 PTE 0
[ 149.026674] Oops: 0000 [#2] PREEMPT SMP PTI
[ 149.026674] Oops: 0000 [#2] PREEMPT SMP PTI
[ 149.026676] CPU: 0 PID: 687 Comm: cthon.zsh Tainted: G D W
5.19.0-rc1-g3de25d30cb97-dirty #33479
5f05aa11df373b898aeb9103b7a17e0ee3946022
[ 149.026679] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 0.0.0 02/06/2015
[ 149.026680] RIP: 0010:d_alloc_parallel+0x1ad/0x5a0
[ 149.026684] Code: 48 c7 c2 ff ff ff ff 89 c1 48 d3 e2 48 f7 d2 4d
31 c1 49 85 d1 0f 84 e2 00 00 00 66 90 4d 8b 6d 00 4d 85 ed 0f 84 3e
03 00 00 <45> 39 95 70 ff ff ff 75 ea 4d 39 a5 68 ff ff ff 75 e1 4d 8d
b5 50
[ 149.026685] RSP: 0018:ffffa56a03e3bbf0 EFLAGS: 00010286
[ 149.026687] RAX: 000000000000002a RBX: 000000000000002a RCX: ffff9101b9400000
[ 149.026688] RDX: 000000047dd98e98 RSI: ffff9100c1da4030 RDI: ffff9100c5ea7540
[ 149.026689] RBP: 00000000000009fc R08: 00000000000000c0 R09: ffff9100cb481600
[ 149.026690] R10: 000000007dd98e98 R11: ffffffff846e7640 R12: ffff9100c5ea7540
[ 149.026691] R13: ffffa56a03e30000 R14: ffff9100c5ea7598 R15: ffffa56a03e3bda0
[ 149.026692] FS: 00007fa4232f9000(0000) GS:ffff9101b9c00000(0000)
knlGS:0000000000000000
[ 149.026693] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 149.026694] CR2: ffffa56a03e2ff70 CR3: 0000000104b90002 CR4: 0000000000370ef0
[ 149.026698] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 149.026699] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 149.026700] Call Trace:
[ 149.026702] <TASK>
[ 149.026704] __lookup_slow+0x61/0x160
[ 149.026707] ? try_to_unlazy+0x14d/0x1f0
[ 149.026709] lookup_slow+0x33/0x50
[ 149.026711] walk_component+0x132/0x150
[ 149.026714] path_lookupat+0x4d/0x100
[ 149.026716] filename_lookup+0xeb/0x200
[ 149.026718] user_path_at_empty+0x36/0x90
[ 149.026721] do_faccessat+0x124/0x290
[ 149.026723] do_syscall_64+0x31/0x50
[ 149.026726] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 149.026729] RIP: 0033:0x7fa423101ceb
[ 149.026731] Code: 77 05 c3 0f 1f 40 00 48 8b 15 a9 b0 0f 00 f7 d8
64 89 02 48 c7 c0 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa b8 15 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 79 b0 0f 00
f7 d8
[ 149.026732] RSP: 002b:00007ffd7374d6a8 EFLAGS: 00000202 ORIG_RAX:
0000000000000015
[ 149.026734] RAX: ffffffffffffffda RBX: 00007ffd7374d760 RCX: 00007fa423101ceb
[ 149.026735] RDX: 000000000000002f RSI: 0000000000000001 RDI: 00007ffd7374d760
[ 149.026736] RBP: 00007ffd7374d760 R08: 0000000000000001 R09: 0000000000000000
[ 149.026737] R10: 0000000000000001 R11: 0000000000000202 R12: 000055b0c36db9d8
[ 149.026738] R13: 00007ffd7374e760 R14: 00007ffd7374d770 R15: 00007fa42326ebb0
[ 149.026740] </TASK>
[ 149.026740] Modules linked in: cbc cts rpcsec_gss_krb5 nfsv4 nfs
fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core cfg80211 rfkill
8021q garp stp mrp llc intel_rapl_msr intel_rapl_common kvm_intel kvm
snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul
snd_hda_intel ghash_clmulni_intel iTCO_wdt joydev snd_intel_dspcfg
intel_pmc_bxt mousedev iTCO_vendor_support vfat snd_hda_codec fat
snd_hwdep snd_hda_core aesni_intel crypto_simd cryptd rapl snd_pcm
intel_agp qxl psmouse snd_timer i2c_i801 pcspkr i2c_smbus lpc_ich snd
soundcore usbhid drm_ttm_helper intel_gtt mac_hid ttm nfsd nfs_acl
lockd auth_rpcgss usbip_host grace usbip_core fuse sunrpc qemu_fw_cfg
ip_tables x_tables xfs libcrc32c crc32c_generic serio_raw atkbd libps2
9pnet_virtio virtio_rng 9pnet vivaldi_fmap rng_core virtio_balloon
virtio_net virtio_blk net_failover virtio_console virtio_scsi failover
crc32c_intel xhci_pci i8042 virtio_pci virtio_pci_legacy_dev
xhci_pci_renesas virtio_pci_modern_dev serio
[ 149.026793] CR2: ffffa56a03e2ff70
[ 149.026795] ---[ end trace 0000000000000000 ]---
[ 149.026795] RIP: 0010:d_alloc_parallel+0x1ad/0x5a0
[ 149.026797] Code: 48 c7 c2 ff ff ff ff 89 c1 48 d3 e2 48 f7 d2 4d
31 c1 49 85 d1 0f 84 e2 00 00 00 66 90 4d 8b 6d 00 4d 85 ed 0f 84 3e
03 00 00 <45> 39 95 70 ff ff ff 75 ea 4d 39 a5 68 ff ff ff 75 e1 4d 8d
b5 50
[ 149.026798] RSP: 0018:ffffa56a03e1bc50 EFLAGS: 00010286
[ 149.026799] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9101b9400000
[ 149.026800] RDX: 00000006ebc26d0c RSI: ffff9100c30b9020 RDI: ffff9100cb523480
[ 149.026801] RBP: 00000000000009fc R08: 00000000000000c0 R09: ffff9100cb76d180
[ 149.026802] R10: 00000000ebc26d0c R11: ffffffff846e8750 R12: ffff9100cb523480
[ 149.026803] R13: ffffa56a03e30000 R14: ffff9100cb5234d8 R15: ffffa56a03e1bdc8
[ 149.026803] FS: 00007fa4232f9000(0000) GS:ffff9101b9c00000(0000)
knlGS:0000000000000000
[ 149.026805] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 149.026806] CR2: ffffa56a03e2ff70 CR3: 0000000104b90002 CR4: 0000000000370ef0
[ 149.026807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 149.026807] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 149.026808] note: cthon.zsh[687] exited with preempt_count 1
[ 149.039191] kernel tried to execute NX-protected page - exploit
attempt? (uid: 0)
[ 149.039194] BUG: unable to handle page fault for address: ffff9100cb775fb0
[ 149.039196] #PF: supervisor instruction fetch in kernel mode
[ 149.039197] #PF: error_code(0x0011) - permissions violation
[ 149.039199] PGD 1cfe05067 P4D 1cfe05067 PUD 100b7a063 PMD 800000010b6001e3
[ 149.039203] Oops: 0011 [#3] PREEMPT SMP PTI
[ 149.039206] CPU: 1 PID: 15 Comm: pr/tty0 Tainted: G D W
5.19.0-rc1-g3de25d30cb97-dirty #33479
5f05aa11df373b898aeb9103b7a17e0ee3946022
[ 149.039210] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 0.0.0 02/06/2015
[ 149.039212] RIP: 0010:0xffff9100cb775fb0
[ 149.039215] Code: ff ff 80 5f 77 cb 00 91 ff ff d0 5b 77 cb 00 91
ff ff 20 35 52 cb 00 91 ff ff a0 5f 77 cb 00 91 ff ff a0 5f 77 cb 00
91 ff ff <30> b5 d7 cb 00 91 ff ff b0 de 9b 82 ff ff ff ff 00 00 00 00
00 00
[ 149.039217] RSP: 0018:ffffa56a00110f28 EFLAGS: 00010286
[ 149.039219] RAX: ffff9100c04ab530 RBX: 0000000000000004 RCX: 0000000000000004
[ 149.039220] RDX: ffff9100c01c4c10 RSI: 0000000000000000 RDI: ffff9100c04ab530
[ 149.039221] RBP: 0000000000000015 R08: 0000000080100006 R09: 0000000000100006
[ 149.039223] R10: 7fffffffffffffff R11: ffff9100cb775fb0 R12: ffff9101b9cb25f8
[ 149.039224] R13: ffffa56a00110f38 R14: ffff9101b9cb2580 R15: ffff9100c0318000
[ 149.039226] FS: 0000000000000000(0000) GS:ffff9101b9c80000(0000)
knlGS:0000000000000000
[ 149.039227] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 149.039229] CR2: ffff9100cb775fb0 CR3: 0000000103698005 CR4: 0000000000370ee0
[ 149.039234] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 149.039235] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 149.039236] Call Trace:
[ 149.039238] <IRQ>
[ 149.039240] ? rcu_core+0x2f7/0x770
[ 149.039245] ? __do_softirq+0x152/0x2d1
[ 149.039249] ? irq_exit_rcu+0x6c/0xb0
[ 149.039252] ? sysvec_apic_timer_interrupt+0x6d/0x80
[ 149.039255] </IRQ>
[ 149.039256] <TASK>
[ 149.039257] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[ 149.039263] ? bit_putcs+0x503/0x7c0
[ 149.039271] ? fbcon_putcs+0x251/0x280
[ 149.039273] ? bit_clear+0x100/0x100
[ 149.039276] ? fbcon_redraw+0x152/0x220
[ 149.039279] ? fbcon_scroll+0xd4/0x1b0
[ 149.039282] ? con_scroll+0x1b1/0x240
[ 149.039286] ? vt_console_print+0x1f7/0x450
[ 149.039289] ? __console_emit_next_record+0x30b/0x3a0
[ 149.039293] ? printk_kthread_func+0x4a4/0x600
[ 149.039295] ? wake_bit_function+0x70/0x70
[ 149.039298] ? console_cpu_notify+0x80/0x80
[ 149.039300] ? kthread+0xd8/0xf0
[ 149.039303] ? kthread_blkcg+0x30/0x30
[ 149.039305] ? ret_from_fork+0x22/0x30
[ 149.039310] </TASK>
[ 149.039311] Modules linked in: cbc cts rpcsec_gss_krb5 nfsv4 nfs
fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core cfg80211 rfkill
8021q garp stp mrp llc intel_rapl_msr intel_rapl_common kvm_intel kvm
snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul
snd_hda_intel ghash_clmulni_intel iTCO_wdt joydev snd_intel_dspcfg
intel_pmc_bxt mousedev iTCO_vendor_support vfat snd_hda_codec fat
snd_hwdep snd_hda_core aesni_intel crypto_simd cryptd rapl snd_pcm
intel_agp qxl psmouse snd_timer i2c_i801 pcspkr i2c_smbus lpc_ich snd
soundcore usbhid drm_ttm_helper intel_gtt mac_hid ttm nfsd nfs_acl
lockd auth_rpcgss usbip_host grace usbip_core fuse sunrpc qemu_fw_cfg
ip_tables x_tables xfs libcrc32c crc32c_generic serio_raw atkbd libps2
9pnet_virtio virtio_rng 9pnet vivaldi_fmap rng_core virtio_balloon
virtio_net virtio_blk net_failover virtio_console virtio_scsi failover
crc32c_intel xhci_pci i8042 virtio_pci virtio_pci_legacy_dev
xhci_pci_renesas virtio_pci_modern_dev serio
[ 149.039392] CR2: ffff9100cb775fb0
[ 149.039402] ---[ end trace 0000000000000000 ]---
[ 149.039403] RIP: 0010:d_alloc_parallel+0x1ad/0x5a0
[ 149.039407] Code: 48 c7 c2 ff ff ff ff 89 c1 48 d3 e2 48 f7 d2 4d
31 c1 49 85 d1 0f 84 e2 00 00 00 66 90 4d 8b 6d 00 4d 85 ed 0f 84 3e
03 00 00 <45> 39 95 70 ff ff ff 75 ea 4d 39 a5 68 ff ff ff 75 e1 4d 8d
b5 50
[ 149.039408] RSP: 0018:ffffa56a03e1bc50 EFLAGS: 00010286
[ 149.039410] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9101b9400000
[ 149.039411] RDX: 00000006ebc26d0c RSI: ffff9100c30b9020 RDI: ffff9100cb523480
[ 149.039412] RBP: 00000000000009fc R08: 00000000000000c0 R09: ffff9100cb76d180
[ 149.039413] R10: 00000000ebc26d0c R11: ffffffff846e8750 R12: ffff9100cb523480
[ 149.039414] R13: ffffa56a03e30000 R14: ffff9100cb5234d8 R15: ffffa56a03e1bdc8
[ 149.039415] FS: 0000000000000000(0000) GS:ffff9101b9c80000(0000)
knlGS:0000000000000000
[ 149.039417] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 149.039418] CR2: ffff9100cb775fb0 CR3: 0000000103698005 CR4: 0000000000370ee0
[ 149.039421] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 149.039422] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 149.039423] Kernel panic - not syncing: Fatal exception in interrupt
[ 149.039611] Kernel Offset: 0x1600000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)


Hopefully something in there helps you figure out what's going on!

Anna

>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 54c2c7adcd56..072130d000c4 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -2483,17 +2483,16 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry)
> if (!(dentry->d_flags & DCACHE_PAR_UPDATE)) {
> /* Must have exclusive lock on parent */
> did_set_par_update = true;
> + lock_acquire_exclusive(&dentry->d_update_map, 0,
> + 0, NULL, _THIS_IP_);
> dentry->d_flags |= DCACHE_PAR_UPDATE;
> }
>
> spin_unlock(&dentry->d_lock);
> error = nfs_safe_remove(dentry);
> nfs_dentry_remove_handle_error(dir, dentry, error);
> - if (did_set_par_update) {
> - spin_lock(&dentry->d_lock);
> - dentry->d_flags &= ~DCACHE_PAR_UPDATE;
> - spin_unlock(&dentry->d_lock);
> - }
> + if (did_set_par_update)
> + d_unlock_update(dentry);
> out:
> trace_nfs_unlink_exit(dir, dentry, error);
> return error;
>
> >
> > So all in all, the performance improvements in the knfsd re-export
> > case is looking great and we have real world use cases that this helps
> > with (batch processing workloads with latencies >10ms). If we can
> > figure out the hanging knfsd threads, then I can test it more heavily.
>
> Hopefully the above patch will allow the more heavy testing to continue.
> In any case, thanks a lot for the testing so far,
>
> NeilBrown

2022-06-17 05:59:01

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH RFC 00/12] Allow concurrent directory updates.

On Thu, 16 Jun 2022, Daire Byrne wrote:
>
> I double checked that the patch had been applied and I hadn't made a
> mistake with installation.

:-) always worth double checking...

>
> I could perhaps try running with just the VFS patches to see if I can
> still reproduce the "local" VFS hang without the nfsd patches? Your
> previous VFS only patchset was stable for me.

I've made quite a few changes since that VFS-only patches. Almost
certainly the problem is not in the nfsd code.

I think that following has a reasonable chance of making things better,
both for the problem you hit and the problem Anna hit. I haven't tested
it at all yet so no promises - up to you if you try it.

Thanks to both of you for the help with testing.

NeilBrown


diff --git a/fs/namei.c b/fs/namei.c
index 31ba4dbedfcf..6d0c955d407a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1609,7 +1609,7 @@ static struct dentry *__lookup_hash(const struct qstr *name,
if (IS_ERR(dentry))
return dentry;

- if (wq && d_in_lookup(dentry))
+ if (wq && !d_in_lookup(dentry))
/* Must have raced with another thread doing the lookup */
return dentry;

@@ -1664,6 +1664,7 @@ static struct dentry *lookup_hash_update(const struct qstr *name,
}
if (flags & LOOKUP_EXCL) {
if (d_is_positive(dentry)) {
+ d_lookup_done(dentry);
dput(dentry);
err = -EEXIST;
goto out_err;
@@ -1671,6 +1672,7 @@ static struct dentry *lookup_hash_update(const struct qstr *name,
}
if (!(flags & LOOKUP_CREATE)) {
if (!dentry->d_inode) {
+ d_lookup_done(dentry);
dput(dentry);
err = -ENOENT;
goto out_err;
@@ -1687,6 +1689,8 @@ static struct dentry *lookup_hash_update(const struct qstr *name,
}
if (err2) {
err = err2;
+ d_lookup_done(dentry);
+ d_unlock_update(dentry);
dput(dentry);
goto out_err;
}
@@ -3273,6 +3277,7 @@ static struct dentry *lock_rename_lookup(struct dentry *p1, struct dentry *p2,
}
return NULL;
out_unlock_2:
+ d_lookup_done(d1);
dput(d1);
d1 = d2;
out_unlock_1:
@@ -3315,6 +3320,7 @@ static struct dentry *lock_rename_lookup(struct dentry *p1, struct dentry *p2,
*d2p = d2;
return p;
unlock_out_4:
+ d_lookup_done(d1);
dput(d1);
d1 = d2;
unlock_out_3:

2022-06-17 15:29:30

by Daire Byrne

[permalink] [raw]
Subject: Re: [PATCH RFC 00/12] Allow concurrent directory updates.

On Fri, 17 Jun 2022 at 06:49, NeilBrown <[email protected]> wrote:
>
> On Thu, 16 Jun 2022, Daire Byrne wrote:
> >
> > I double checked that the patch had been applied and I hadn't made a
> > mistake with installation.
>
> :-) always worth double checking...
>
> >
> > I could perhaps try running with just the VFS patches to see if I can
> > still reproduce the "local" VFS hang without the nfsd patches? Your
> > previous VFS only patchset was stable for me.
>
> I've made quite a few changes since that VFS-only patches. Almost
> certainly the problem is not in the nfsd code.
>
> I think that following has a reasonable chance of making things better,
> both for the problem you hit and the problem Anna hit. I haven't tested
> it at all yet so no promises - up to you if you try it.
>
> Thanks to both of you for the help with testing.
>
> NeilBrown

This patch does the job for me - no more stack traces and things have
been stable all day. I'm going to run some production loads over the
weekend and then I'll do some more artificial scale testing next week.

Thanks again for this work! Improving the parallelism anywhere we can
for single clients and then nfsd is great for reexport servers
(especially once you add some "cloud" latency).

Cheers,

Daire

2022-06-20 10:23:31

by Daire Byrne

[permalink] [raw]
Subject: Re: [PATCH RFC 00/12] Allow concurrent directory updates.

On Fri, 17 Jun 2022 at 16:27, Daire Byrne <[email protected]> wrote:
> This patch does the job for me - no more stack traces and things have
> been stable all day. I'm going to run some production loads over the
> weekend and then I'll do some more artificial scale testing next week.
>
> Thanks again for this work! Improving the parallelism anywhere we can
> for single clients and then nfsd is great for reexport servers
> (especially once you add some "cloud" latency).
>
> Cheers,
>
> Daire

The patch ran without incident with our production re-export workloads
over the weekend (which also helps audit v5.19-rc2).

I ran a couple more synthetic tests and got up to 100 clients of a
re-export server and ~113 creates/s aggregate to a single directory
with 200ms latency to the originating server. This compares well with
the ~121 create/s when using 100 threads on a single patched client
direct to the remote NFS server.

In other words, the NFSD portion of this patch is delivering almost
the same performance as the underlying VFS NFS client performance when
re-exporting that path to hundreds of clients.

Again, without this patch we can only sustain just under 3 create/s in
both cases (VFS/NFSD) with 200ms latency.

This is a great improvement for our batch workloads with varying
amounts of latency >10ms (cloud networking).

Tested-by: Daire Byrne <[email protected]>

Daire

2022-06-24 14:47:42

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 08/12] nfsd: allow parallel creates from nfsd



> On Jun 13, 2022, at 7:18 PM, NeilBrown <[email protected]> wrote:
>
> Rather than getting write access, locking the directory, and performing
> a lookup, use
> filename_create_one_len() for create, or
> lookup_hash_update_len() for delete
> which combines all these steps and handles shared locking for concurrent
> updates where appropriate.
>
> As we don't use fh_lock() we need to call fh_fill_pre_attrs()
> explicitly. However if we only get a shared lock, then the pre/post
> attributes won't be atomic, so we need to ensure that fh know that,
> and doesn't try to encode wcc attrs either.
>
> Note that there is only one filesystem that allows shared locks for
> create/unlink and that is NFS (for re-export). NFS does support atomic
> pre/post attributes, so there is no loss in not providing them.

Should this be "NFS does /not/ support atomic pre/post ..." ?


> When some other filesystem supports concurrent updates, we might need to
> consider if the pre/post attributes are more important than the
> parallelism.

My impression is that pre/post attributes in NFSv3 have not
turned out to be as useful as their inventors predicted.

I think this is mostly a "switch to the new API" patch, so

Reviewed-by: Chuck Lever <[email protected]>


> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/nfsd/nfs3proc.c | 31 +++++-------
> fs/nfsd/nfs4proc.c | 32 +++++-------
> fs/nfsd/nfsfh.c | 7 ++-
> fs/nfsd/nfsfh.h | 4 +-
> fs/nfsd/nfsproc.c | 29 +++++------
> fs/nfsd/vfs.c | 134 +++++++++++++++++++++++-----------------------------
> 6 files changed, 105 insertions(+), 132 deletions(-)
>
> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
> index 981a3a7a6e16..0fdbb9504a87 100644
> --- a/fs/nfsd/nfs3proc.c
> +++ b/fs/nfsd/nfs3proc.c
> @@ -231,12 +231,14 @@ static __be32
> nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
> struct svc_fh *resfhp, struct nfsd3_createargs *argp)
> {
> + struct path path;
> struct iattr *iap = &argp->attrs;
> - struct dentry *parent, *child;
> + struct dentry *child;
> __u32 v_mtime, v_atime;
> struct inode *inode;
> __be32 status;
> int host_err;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

Nit: I'd prefer sticking with the reverse christmas tree
style for variable declarations. Same in other patches
that touch NFSD.


> if (isdotent(argp->name, argp->len))
> return nfserr_exist;
> @@ -247,20 +249,15 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (status != nfs_ok)
> return status;
>
> - parent = fhp->fh_dentry;
> - inode = d_inode(parent);
> + path.dentry = fhp->fh_dentry;
> + path.mnt = fhp->fh_export->ex_path.mnt;
> + inode = d_inode(path.dentry);
>
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - return nfserrno(host_err);
> + child = filename_create_one_len(argp->name, argp->len,
> + &path, 0, &wq);
>
> - fh_lock_nested(fhp, I_MUTEX_PARENT);
> -
> - child = lookup_one_len(argp->name, parent, argp->len);
> - if (IS_ERR(child)) {
> - status = nfserrno(PTR_ERR(child));
> - goto out;
> - }
> + if (IS_ERR(child))
> + return nfserrno(PTR_ERR(child));
>
> if (d_really_is_negative(child)) {
> status = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
> @@ -311,6 +308,7 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> if (!IS_POSIXACL(inode))
> iap->ia_mode &= ~current_umask();
> + fh_fill_pre_attrs(fhp, (child->d_flags & DCACHE_PAR_UPDATE) == 0);
>
> host_err = vfs_create(&init_user_ns, inode, child, iap->ia_mode, true);
> if (host_err < 0) {
> @@ -332,12 +330,9 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> set_attr:
> status = nfsd_create_setattr(rqstp, fhp, resfhp, iap);
> -
> + fh_fill_post_attrs(fhp);
> out:
> - fh_unlock(fhp);
> - if (child && !IS_ERR(child))
> - dput(child);
> - fh_drop_write(fhp);
> + done_path_update(&path, child, &wq);
> return status;
> }
>
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index 3895eb52d2b1..71a4b8ef77f0 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -285,12 +285,13 @@ static __be32
> nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
> struct svc_fh *resfhp, struct nfsd4_open *open)
> {
> + struct path path;
> struct iattr *iap = &open->op_iattr;
> - struct dentry *parent, *child;
> + struct dentry *child;
> __u32 v_mtime, v_atime;
> struct inode *inode;
> __be32 status;
> - int host_err;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

Ditto.


> if (isdotent(open->op_fname, open->op_fnamelen))
> return nfserr_exist;
> @@ -300,20 +301,17 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
> status = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_EXEC);
> if (status != nfs_ok)
> return status;
> - parent = fhp->fh_dentry;
> - inode = d_inode(parent);
>
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - return nfserrno(host_err);
> + path.dentry = fhp->fh_dentry;
> + path.mnt = fhp->fh_export->ex_path.mnt;
> + inode = d_inode(path.dentry);
>
> - fh_lock_nested(fhp, I_MUTEX_PARENT);
> + child = filename_create_one_len(open->op_fname,
> + open->op_fnamelen,
> + &path, 0, &wq);
>
> - child = lookup_one_len(open->op_fname, parent, open->op_fnamelen);
> - if (IS_ERR(child)) {
> - status = nfserrno(PTR_ERR(child));
> - goto out;
> - }
> + if (IS_ERR(child))
> + return nfserrno(PTR_ERR(child));
>
> if (d_really_is_negative(child)) {
> status = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
> @@ -386,6 +384,7 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (!IS_POSIXACL(inode))
> iap->ia_mode &= ~current_umask();
>
> + fh_fill_pre_attrs(fhp, (child->d_flags & DCACHE_PAR_UPDATE) == 0);
> status = nfsd4_vfs_create(fhp, child, open);
> if (status != nfs_ok)
> goto out;
> @@ -405,12 +404,9 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> set_attr:
> status = nfsd_create_setattr(rqstp, fhp, resfhp, iap);
> -
> + fh_fill_post_attrs(fhp);
> out:
> - fh_unlock(fhp);
> - if (child && !IS_ERR(child))
> - dput(child);
> - fh_drop_write(fhp);
> + done_path_update(&path, child, &wq);
> return status;
> }
>
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index c29baa03dfaf..a50db688c60d 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -616,7 +616,7 @@ fh_update(struct svc_fh *fhp)
> * @fhp: file handle to be updated
> *
> */
> -void fh_fill_pre_attrs(struct svc_fh *fhp)
> +void fh_fill_pre_attrs(struct svc_fh *fhp, bool atomic)
> {
> bool v4 = (fhp->fh_maxsize == NFS4_FHSIZE);
> struct inode *inode;
> @@ -626,6 +626,11 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
> if (fhp->fh_no_wcc || fhp->fh_pre_saved)
> return;
>
> + if (!atomic) {
> + fhp->fh_no_atomic_attr = true;
> + fhp->fh_no_wcc = true;
> + }
> +
> inode = d_inode(fhp->fh_dentry);
> err = fh_getattr(fhp, &stat);
> if (err) {
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index fb9d358a267e..ecc57fd3fd67 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -320,7 +320,7 @@ static inline u64 nfsd4_change_attribute(struct kstat *stat,
> return time_to_chattr(&stat->ctime);
> }
>
> -extern void fh_fill_pre_attrs(struct svc_fh *fhp);
> +extern void fh_fill_pre_attrs(struct svc_fh *fhp, bool atomic);
> extern void fh_fill_post_attrs(struct svc_fh *fhp);
>
>
> @@ -347,7 +347,7 @@ fh_lock_nested(struct svc_fh *fhp, unsigned int subclass)
>
> inode = d_inode(dentry);
> inode_lock_nested(inode, subclass);
> - fh_fill_pre_attrs(fhp);
> + fh_fill_pre_attrs(fhp, true);
> fhp->fh_locked = true;
> }
>
> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
> index fcdab8a8a41f..2dccf77634e8 100644
> --- a/fs/nfsd/nfsproc.c
> +++ b/fs/nfsd/nfsproc.c
> @@ -255,6 +255,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
> static __be32
> nfsd_proc_create(struct svc_rqst *rqstp)
> {
> + struct path path;
> struct nfsd_createargs *argp = rqstp->rq_argp;
> struct nfsd_diropres *resp = rqstp->rq_resp;
> svc_fh *dirfhp = &argp->fh;
> @@ -263,8 +264,8 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> struct inode *inode;
> struct dentry *dchild;
> int type, mode;
> - int hosterr;
> dev_t rdev = 0, wanted = new_decode_dev(attr->ia_size);
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

Ditto.


> dprintk("nfsd: CREATE %s %.*s\n",
> SVCFH_fmt(dirfhp), argp->len, argp->name);
> @@ -279,17 +280,13 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> resp->status = nfserr_exist;
> if (isdotent(argp->name, argp->len))
> goto done;
> - hosterr = fh_want_write(dirfhp);
> - if (hosterr) {
> - resp->status = nfserrno(hosterr);
> - goto done;
> - }
>
> - fh_lock_nested(dirfhp, I_MUTEX_PARENT);
> - dchild = lookup_one_len(argp->name, dirfhp->fh_dentry, argp->len);
> + path.dentry = dirfhp->fh_dentry;
> + path.mnt = dirfhp->fh_export->ex_path.mnt;
> + dchild = filename_create_one_len(argp->name, argp->len, &path, 0, &wq);
> if (IS_ERR(dchild)) {
> resp->status = nfserrno(PTR_ERR(dchild));
> - goto out_unlock;
> + goto out_done;
> }
> fh_init(newfhp, NFS_FHSIZE);
> resp->status = fh_compose(newfhp, dirfhp->fh_export, dchild, dirfhp);
> @@ -298,7 +295,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> dput(dchild);
> if (resp->status) {
> if (resp->status != nfserr_noent)
> - goto out_unlock;
> + goto out_done;
> /*
> * If the new file handle wasn't verified, we can't tell
> * whether the file exists or not. Time to bail ...
> @@ -307,7 +304,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> if (!newfhp->fh_dentry) {
> printk(KERN_WARNING
> "nfsd_proc_create: file handle not verified\n");
> - goto out_unlock;
> + goto out_done;
> }
> }
>
> @@ -341,7 +338,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> newfhp->fh_dentry,
> NFSD_MAY_WRITE|NFSD_MAY_LOCAL_ACCESS);
> if (resp->status && resp->status != nfserr_rofs)
> - goto out_unlock;
> + goto out_done;
> }
> } else
> type = S_IFREG;
> @@ -378,7 +375,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> /* Make sure the type and device matches */
> resp->status = nfserr_exist;
> if (inode && inode_wrong_type(inode, type))
> - goto out_unlock;
> + goto out_done;
> }
>
> resp->status = nfs_ok;
> @@ -400,10 +397,8 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> (time64_t)0);
> }
>
> -out_unlock:
> - /* We don't really need to unlock, as fh_put does it. */
> - fh_unlock(dirfhp);
> - fh_drop_write(dirfhp);
> +out_done:
> + done_path_update(&path, dchild, &wq);
> done:
> fh_put(dirfhp);
> if (resp->status != nfs_ok)
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 840e3af63a6f..6cdd5e407600 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1274,12 +1274,6 @@ nfsd_create_locked(struct svc_rqst *rqstp, struct svc_fh *fhp,
> dirp = d_inode(dentry);
>
> dchild = dget(resfhp->fh_dentry);
> - if (!fhp->fh_locked) {
> - WARN_ONCE(1, "nfsd_create: parent %pd2 not locked!\n",
> - dentry);
> - err = nfserr_io;
> - goto out;
> - }
>
> err = nfsd_permission(rqstp, fhp->fh_export, dentry, NFSD_MAY_CREATE);
> if (err)
> @@ -1362,9 +1356,11 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> char *fname, int flen, struct iattr *iap,
> int type, dev_t rdev, struct svc_fh *resfhp)
> {
> - struct dentry *dentry, *dchild = NULL;
> + struct path path;
> + struct dentry *dchild = NULL;
> __be32 err;
> int host_err;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>
> if (isdotent(fname, flen))
> return nfserr_exist;
> @@ -1373,27 +1369,22 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (err)
> return err;
>
> - dentry = fhp->fh_dentry;
> -
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - return nfserrno(host_err);
> + path.dentry = fhp->fh_dentry;
> + path.mnt = fhp->fh_export->ex_path.mnt;
>
> - fh_lock_nested(fhp, I_MUTEX_PARENT);
> - dchild = lookup_one_len(fname, dentry, flen);
> + dchild = filename_create_one_len(fname, flen, &path, 0, &wq);
> host_err = PTR_ERR(dchild);
> if (IS_ERR(dchild))
> return nfserrno(host_err);
> err = fh_compose(resfhp, fhp->fh_export, dchild, fhp);
> - /*
> - * We unconditionally drop our ref to dchild as fh_compose will have
> - * already grabbed its own ref for it.
> - */
> - dput(dchild);
> - if (err)
> - return err;
> - return nfsd_create_locked(rqstp, fhp, fname, flen, iap, type,
> - rdev, resfhp);
> + if (!err) {
> + fh_fill_pre_attrs(fhp, (dchild->d_flags & DCACHE_PAR_UPDATE) == 0);
> + err = nfsd_create_locked(rqstp, fhp, fname, flen, iap, type,
> + rdev, resfhp);
> + fh_fill_post_attrs(fhp);
> + }
> + done_path_update(&path, dchild, &wq);
> + return err;
> }
>
> /*
> @@ -1441,15 +1432,17 @@ nfsd_readlink(struct svc_rqst *rqstp, struct svc_fh *fhp, char *buf, int *lenp)
> __be32
> nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
> char *fname, int flen,
> - char *path,
> + char *lpath,
> struct svc_fh *resfhp)
> {
> - struct dentry *dentry, *dnew;
> + struct path path;
> + struct dentry *dnew;
> __be32 err, cerr;
> int host_err;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

Ditto.


> err = nfserr_noent;
> - if (!flen || path[0] == '\0')
> + if (!flen || lpath[0] == '\0')
> goto out;
> err = nfserr_exist;
> if (isdotent(fname, flen))
> @@ -1459,28 +1452,28 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (err)
> goto out;
>
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - goto out_nfserr;
> + path.dentry = fhp->fh_dentry;
> + path.mnt = fhp->fh_export->ex_path.mnt;
>
> - fh_lock(fhp);
> - dentry = fhp->fh_dentry;
> - dnew = lookup_one_len(fname, dentry, flen);
> + dnew = filename_create_one_len(fname, flen, &path, 0, &wq);
> host_err = PTR_ERR(dnew);
> if (IS_ERR(dnew))
> goto out_nfserr;
>
> - host_err = vfs_symlink(&init_user_ns, d_inode(dentry), dnew, path);
> + fh_fill_pre_attrs(fhp, (dnew->d_flags & DCACHE_PAR_UPDATE) == 0);
> + host_err = vfs_symlink(mnt_user_ns(path.mnt), d_inode(path.dentry),
> + dnew, lpath);
> err = nfserrno(host_err);
> - fh_unlock(fhp);
> if (!err)
> err = nfserrno(commit_metadata(fhp));
>
> - fh_drop_write(fhp);
> + fh_fill_post_attrs(fhp);
>
> cerr = fh_compose(resfhp, fhp->fh_export, dnew, fhp);
> - dput(dnew);
> - if (err==0) err = cerr;
> + if (err==0)
> + err = cerr;
> +
> + done_path_update(&path, dnew, &wq);
> out:
> return err;
>
> @@ -1497,10 +1490,12 @@ __be32
> nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> char *name, int len, struct svc_fh *tfhp)
> {
> - struct dentry *ddir, *dnew, *dold;
> + struct path path;
> + struct dentry *dold, *dnew;
> struct inode *dirp;
> __be32 err;
> int host_err;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

Ditto.


> err = fh_verify(rqstp, ffhp, S_IFDIR, NFSD_MAY_CREATE);
> if (err)
> @@ -1518,17 +1513,11 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> if (isdotent(name, len))
> goto out;
>
> - host_err = fh_want_write(tfhp);
> - if (host_err) {
> - err = nfserrno(host_err);
> - goto out;
> - }
> -
> - fh_lock_nested(ffhp, I_MUTEX_PARENT);
> - ddir = ffhp->fh_dentry;
> - dirp = d_inode(ddir);
> + path.dentry = ffhp->fh_dentry;
> + path.mnt = ffhp->fh_export->ex_path.mnt;
> + dirp = d_inode(path.dentry);
>
> - dnew = lookup_one_len(name, ddir, len);
> + dnew = filename_create_one_len(name, len, &path, 0, &wq);
> host_err = PTR_ERR(dnew);
> if (IS_ERR(dnew))
> goto out_nfserr;
> @@ -1537,9 +1526,10 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
>
> err = nfserr_noent;
> if (d_really_is_negative(dold))
> - goto out_dput;
> + goto out_done;
> + fh_fill_pre_attrs(ffhp, (dnew->d_flags & DCACHE_PAR_UPDATE) == 0);
> host_err = vfs_link(dold, &init_user_ns, dirp, dnew, NULL);
> - fh_unlock(ffhp);
> +
> if (!host_err) {
> err = nfserrno(commit_metadata(ffhp));
> if (!err)
> @@ -1550,17 +1540,15 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> else
> err = nfserrno(host_err);
> }
> -out_dput:
> - dput(dnew);
> -out_unlock:
> - fh_unlock(ffhp);
> - fh_drop_write(tfhp);
> +out_done:
> + fh_fill_post_attrs(ffhp);
> + done_path_update(&path, dnew, &wq);
> out:
> return err;
>
> out_nfserr:
> err = nfserrno(host_err);
> - goto out_unlock;
> + goto out;
> }
>
> static void
> @@ -1625,8 +1613,8 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> * so do it by hand */
> trap = lock_rename(tdentry, fdentry);
> ffhp->fh_locked = tfhp->fh_locked = true;
> - fh_fill_pre_attrs(ffhp);
> - fh_fill_pre_attrs(tfhp);
> + fh_fill_pre_attrs(ffhp, true);
> + fh_fill_pre_attrs(tfhp, true);
>
> odentry = lookup_one_len(fname, fdentry, flen);
> host_err = PTR_ERR(odentry);
> @@ -1717,11 +1705,13 @@ __be32
> nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> char *fname, int flen)
> {
> - struct dentry *dentry, *rdentry;
> + struct dentry *rdentry;
> struct inode *dirp;
> struct inode *rinode;
> __be32 err;
> int host_err;
> + struct path path;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

Ditto.


> err = nfserr_acces;
> if (!flen || isdotent(fname, flen))
> @@ -1730,24 +1720,18 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> if (err)
> goto out;
>
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - goto out_nfserr;
> + path.mnt = fhp->fh_export->ex_path.mnt;
> + path.dentry = fhp->fh_dentry;
>
> - fh_lock_nested(fhp, I_MUTEX_PARENT);
> - dentry = fhp->fh_dentry;
> - dirp = d_inode(dentry);
> + rdentry = lookup_hash_update_len(fname, flen, &path, 0, &wq);
> + dirp = d_inode(path.dentry);
>
> - rdentry = lookup_one_len(fname, dentry, flen);
> host_err = PTR_ERR(rdentry);
> if (IS_ERR(rdentry))
> - goto out_drop_write;
> + goto out_nfserr;
> +
> + fh_fill_pre_attrs(fhp, (rdentry->d_flags & DCACHE_PAR_UPDATE) == 0);
>
> - if (d_really_is_negative(rdentry)) {
> - dput(rdentry);
> - host_err = -ENOENT;
> - goto out_drop_write;
> - }
> rinode = d_inode(rdentry);
> ihold(rinode);
>
> @@ -1761,15 +1745,13 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> } else {
> host_err = vfs_rmdir(&init_user_ns, dirp, rdentry);
> }
> + fh_fill_post_attrs(fhp);
>
> - fh_unlock(fhp);
> + done_path_update(&path, rdentry, &wq);
> if (!host_err)
> host_err = commit_metadata(fhp);
> - dput(rdentry);
> iput(rinode); /* truncate the inode here */
>
> -out_drop_write:
> - fh_drop_write(fhp);
> out_nfserr:
> if (host_err == -EBUSY) {
> /* name is mounted-on. There is no perfect
>
>

--
Chuck Lever



2022-06-24 14:49:35

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 09/12] nfsd: support concurrent renames.



> On Jun 13, 2022, at 7:18 PM, NeilBrown <[email protected]> wrote:
>
> If the filesystem supports it, renames can now be concurrent with other
> updates.
> We use lock_rename_lookup_one() to do the appropriate locking in the
> right order and to look up the names.
>
> Signed-off-by: NeilBrown <[email protected]>

Reviewed-by: Chuck Lever <[email protected]>


> ---
> fs/nfsd/vfs.c | 49 +++++++++++++++++++------------------------------
> 1 file changed, 19 insertions(+), 30 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 6cdd5e407600..b0df216ab3e4 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1584,6 +1584,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> __be32 err;
> int host_err;
> bool close_cached = false;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

Ditto.


>
> err = fh_verify(rqstp, ffhp, S_IFDIR, NFSD_MAY_REMOVE);
> if (err)
> @@ -1611,41 +1612,37 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
>
> /* cannot use fh_lock as we need deadlock protective ordering
> * so do it by hand */
> - trap = lock_rename(tdentry, fdentry);
> - ffhp->fh_locked = tfhp->fh_locked = true;
> - fh_fill_pre_attrs(ffhp, true);
> - fh_fill_pre_attrs(tfhp, true);
> -
> - odentry = lookup_one_len(fname, fdentry, flen);
> - host_err = PTR_ERR(odentry);
> - if (IS_ERR(odentry))
> + trap = lock_rename_lookup_one(tdentry, fdentry, &ndentry, &odentry,
> + tname, tlen, fname, flen, 0, 0, &wq);
> + host_err = PTR_ERR(trap);
> + if (IS_ERR(trap))
> goto out_nfserr;
> + ffhp->fh_locked = tfhp->fh_locked = true;
> + fh_fill_pre_attrs(ffhp, (ndentry->d_flags & DCACHE_PAR_UPDATE) == 0);
> + fh_fill_pre_attrs(tfhp, (ndentry->d_flags & DCACHE_PAR_UPDATE) == 0);
>
> host_err = -ENOENT;
> if (d_really_is_negative(odentry))
> - goto out_dput_old;
> + goto out_unlock;
> host_err = -EINVAL;
> if (odentry == trap)
> - goto out_dput_old;
> + goto out_unlock;
>
> - ndentry = lookup_one_len(tname, tdentry, tlen);
> - host_err = PTR_ERR(ndentry);
> - if (IS_ERR(ndentry))
> - goto out_dput_old;
> host_err = -ENOTEMPTY;
> if (ndentry == trap)
> - goto out_dput_new;
> + goto out_unlock;
>
> host_err = -EXDEV;
> if (ffhp->fh_export->ex_path.mnt != tfhp->fh_export->ex_path.mnt)
> - goto out_dput_new;
> + goto out_unlock;
> if (ffhp->fh_export->ex_path.dentry != tfhp->fh_export->ex_path.dentry)
> - goto out_dput_new;
> + goto out_unlock;
>
> if ((ndentry->d_sb->s_export_op->flags & EXPORT_OP_CLOSE_BEFORE_UNLINK) &&
> nfsd_has_cached_files(ndentry)) {
> close_cached = true;
> - goto out_dput_old;
> + dget(ndentry);
> + goto out_unlock;
> } else {
> struct renamedata rd = {
> .old_mnt_userns = &init_user_ns,
> @@ -1662,23 +1659,15 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> host_err = commit_metadata(ffhp);
> }
> }
> - out_dput_new:
> - dput(ndentry);
> - out_dput_old:
> - dput(odentry);
> - out_nfserr:
> - err = nfserrno(host_err);
> - /*
> - * We cannot rely on fh_unlock on the two filehandles,
> - * as that would do the wrong thing if the two directories
> - * were the same, so again we do it by hand.
> - */
> if (!close_cached) {
> fh_fill_post_attrs(ffhp);
> fh_fill_post_attrs(tfhp);
> }
> - unlock_rename(tdentry, fdentry);
> + out_unlock:
> + unlock_rename_lookup(tdentry, fdentry, ndentry, odentry);
> ffhp->fh_locked = tfhp->fh_locked = false;
> + out_nfserr:
> + err = nfserrno(host_err);
> fh_drop_write(ffhp);
>
> /*
>
>

--
Chuck Lever



2022-06-24 14:49:39

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 11/12] nfsd: use (un)lock_inode instead of fh_(un)lock



> On Jun 13, 2022, at 7:18 PM, NeilBrown <[email protected]> wrote:
>
> Except for the xattr changes, these callers don't need to save pre/post
> attrs, so use a simple lock/unlock.
>
> For the xattr changes, make the saving of pre/post explicit rather than
> requiring a comment.
>
> Also many fh_unlock() are not needed.

This patch does three different (though related) things. I prefer to
instead see three patches:

- One that changes the xattr code, as described
- One that cleans up unneeded fh_unlock calls, with an explanation of
why that is safe to do
- One that replaces fh_lock() with inode_lock (or inode_lock_nested,
see below).


> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/nfsd/nfs2acl.c | 6 +++---
> fs/nfsd/nfs3acl.c | 4 ++--
> fs/nfsd/nfs3proc.c | 4 ----
> fs/nfsd/nfs4acl.c | 7 +++----
> fs/nfsd/nfs4proc.c | 2 --
> fs/nfsd/nfs4state.c | 8 ++++----
> fs/nfsd/nfsfh.c | 1 -
> fs/nfsd/nfsfh.h | 8 --------
> fs/nfsd/vfs.c | 26 ++++++++++++--------------
> 9 files changed, 24 insertions(+), 42 deletions(-)
>
> diff --git a/fs/nfsd/nfs2acl.c b/fs/nfsd/nfs2acl.c
> index b5760801d377..9edd3c1a30fb 100644
> --- a/fs/nfsd/nfs2acl.c
> +++ b/fs/nfsd/nfs2acl.c
> @@ -111,7 +111,7 @@ static __be32 nfsacld_proc_setacl(struct svc_rqst *rqstp)
> if (error)
> goto out_errno;
>
> - fh_lock(fh);
> + inode_lock(inode);

fh_lock() uses inode_lock_nested(). If the above hunk is the
correct substitution, the patch description should explain why
that is correct. Otherwise, shall we use inode_lock_nested()
here too?

Likewise throughout.


> error = set_posix_acl(&init_user_ns, inode, ACL_TYPE_ACCESS,
> argp->acl_access);
> @@ -122,7 +122,7 @@ static __be32 nfsacld_proc_setacl(struct svc_rqst *rqstp)
> if (error)
> goto out_drop_lock;
>
> - fh_unlock(fh);
> + inode_unlock(inode);
>
> fh_drop_write(fh);
>
> @@ -136,7 +136,7 @@ static __be32 nfsacld_proc_setacl(struct svc_rqst *rqstp)
> return rpc_success;
>
> out_drop_lock:
> - fh_unlock(fh);
> + inode_unlock(inode);
> fh_drop_write(fh);
> out_errno:
> resp->status = nfserrno(error);
> diff --git a/fs/nfsd/nfs3acl.c b/fs/nfsd/nfs3acl.c
> index 35b2ebda14da..9446c6743664 100644
> --- a/fs/nfsd/nfs3acl.c
> +++ b/fs/nfsd/nfs3acl.c
> @@ -101,7 +101,7 @@ static __be32 nfsd3_proc_setacl(struct svc_rqst *rqstp)
> if (error)
> goto out_errno;
>
> - fh_lock(fh);
> + inode_lock(inode);
>
> error = set_posix_acl(&init_user_ns, inode, ACL_TYPE_ACCESS,
> argp->acl_access);
> @@ -111,7 +111,7 @@ static __be32 nfsd3_proc_setacl(struct svc_rqst *rqstp)
> argp->acl_default);
>
> out_drop_lock:
> - fh_unlock(fh);
> + inode_unlock(inode);
> fh_drop_write(fh);
> out_errno:
> resp->status = nfserrno(error);
> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
> index d85b110d58dd..7bb07c7e6ee8 100644
> --- a/fs/nfsd/nfs3proc.c
> +++ b/fs/nfsd/nfs3proc.c
> @@ -374,7 +374,6 @@ nfsd3_proc_mkdir(struct svc_rqst *rqstp)
> fh_init(&resp->fh, NFS3_FHSIZE);
> resp->status = nfsd_create(rqstp, &resp->dirfh, argp->name, argp->len,
> &argp->attrs, S_IFDIR, 0, &resp->fh);
> - fh_unlock(&resp->dirfh);
> return rpc_success;
> }
>
> @@ -449,7 +448,6 @@ nfsd3_proc_mknod(struct svc_rqst *rqstp)
> type = nfs3_ftypes[argp->ftype];
> resp->status = nfsd_create(rqstp, &resp->dirfh, argp->name, argp->len,
> &argp->attrs, type, rdev, &resp->fh);
> - fh_unlock(&resp->dirfh);
> out:
> return rpc_success;
> }
> @@ -472,7 +470,6 @@ nfsd3_proc_remove(struct svc_rqst *rqstp)
> fh_copy(&resp->fh, &argp->fh);
> resp->status = nfsd_unlink(rqstp, &resp->fh, -S_IFDIR,
> argp->name, argp->len);
> - fh_unlock(&resp->fh);
> return rpc_success;
> }
>
> @@ -493,7 +490,6 @@ nfsd3_proc_rmdir(struct svc_rqst *rqstp)
> fh_copy(&resp->fh, &argp->fh);
> resp->status = nfsd_unlink(rqstp, &resp->fh, S_IFDIR,
> argp->name, argp->len);
> - fh_unlock(&resp->fh);
> return rpc_success;
> }
>
> diff --git a/fs/nfsd/nfs4acl.c b/fs/nfsd/nfs4acl.c
> index eaa3a0cf38f1..7bcc6dc0f762 100644
> --- a/fs/nfsd/nfs4acl.c
> +++ b/fs/nfsd/nfs4acl.c
> @@ -779,19 +779,18 @@ nfsd4_set_nfs4_acl(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (host_error < 0)
> goto out_nfserr;
>
> - fh_lock(fhp);
> + inode_lock(inode);
>
> host_error = set_posix_acl(&init_user_ns, inode, ACL_TYPE_ACCESS, pacl);
> if (host_error < 0)
> goto out_drop_lock;
>
> - if (S_ISDIR(inode->i_mode)) {
> + if (S_ISDIR(inode->i_mode))
> host_error = set_posix_acl(&init_user_ns, inode,
> ACL_TYPE_DEFAULT, dpacl);
> - }
>
> out_drop_lock:
> - fh_unlock(fhp);
> + inode_unlock(inode);
>
> posix_acl_release(pacl);
> posix_acl_release(dpacl);
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index 79434e29b63f..d6defaf5a77a 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -860,7 +860,6 @@ nfsd4_create(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> do_set_nfs4_acl(rqstp, &resfh, create->cr_acl,
> create->cr_bmval);
>
> - fh_unlock(&cstate->current_fh);
> set_change_info(&create->cr_cinfo, &cstate->current_fh);
> fh_dup2(&cstate->current_fh, &resfh);
> out:
> @@ -1040,7 +1039,6 @@ nfsd4_remove(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> status = nfsd_unlink(rqstp, &cstate->current_fh, 0,
> remove->rm_name, remove->rm_namelen);
> if (!status) {
> - fh_unlock(&cstate->current_fh);
> set_change_info(&remove->rm_cinfo, &cstate->current_fh);
> }
> return status;
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 9409a0dc1b76..cb664c61b546 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -7321,21 +7321,21 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> static __be32 nfsd_test_lock(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file_lock *lock)
> {
> struct nfsd_file *nf;
> + struct inode *inode = fhp->fh_dentry->d_inode;
> __be32 err;
>
> err = nfsd_file_acquire(rqstp, fhp, NFSD_MAY_READ, &nf);
> if (err)
> return err;
> - fh_lock(fhp); /* to block new leases till after test_lock: */
> - err = nfserrno(nfsd_open_break_lease(fhp->fh_dentry->d_inode,
> - NFSD_MAY_READ));
> + inode_lock(inode); /* to block new leases till after test_lock: */
> + err = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ));
> if (err)
> goto out;
> lock->fl_file = nf->nf_file;
> err = nfserrno(vfs_test_lock(nf->nf_file, lock));
> lock->fl_file = NULL;
> out:
> - fh_unlock(fhp);
> + inode_unlock(inode);
> nfsd_file_put(nf);
> return err;
> }
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index a50db688c60d..ae270e4f921f 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -685,7 +685,6 @@ fh_put(struct svc_fh *fhp)
> struct dentry * dentry = fhp->fh_dentry;
> struct svc_export * exp = fhp->fh_export;
> if (dentry) {
> - fh_unlock(fhp);
> fhp->fh_dentry = NULL;
> dput(dentry);
> fh_clear_pre_post_attrs(fhp);
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index ecc57fd3fd67..c5061cdb1016 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -323,14 +323,6 @@ static inline u64 nfsd4_change_attribute(struct kstat *stat,
> extern void fh_fill_pre_attrs(struct svc_fh *fhp, bool atomic);
> extern void fh_fill_post_attrs(struct svc_fh *fhp);
>
> -
> -/*
> - * Lock a file handle/inode
> - * NOTE: both fh_lock and fh_unlock are done "by hand" in
> - * vfs.c:nfsd_rename as it needs to grab 2 i_mutex's at once
> - * so, any changes here should be reflected there.
> - */
> -
> static inline void
> fh_lock_nested(struct svc_fh *fhp, unsigned int subclass)
> {
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 4c2e431100ba..f2f4868618bb 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -253,7 +253,6 @@ nfsd_lookup_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp,
> * returned. Otherwise the covered directory is returned.
> * NOTE: this mountpoint crossing is not supported properly by all
> * clients and is explicitly disallowed for NFSv3
> - * NeilBrown <[email protected]>
> */
> __be32
> nfsd_lookup(struct svc_rqst *rqstp, struct svc_fh *fhp, const char *name,
> @@ -434,7 +433,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
> return err;
> }
>
> - fh_lock(fhp);
> + inode_lock(inode);
> if (size_change) {
> /*
> * RFC5661, Section 18.30.4:
> @@ -470,7 +469,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
> host_err = notify_change(&init_user_ns, dentry, iap, NULL);
>
> out_unlock:
> - fh_unlock(fhp);
> + inode_unlock(inode);
> if (size_change)
> put_write_access(inode);
> out:
> @@ -2123,12 +2122,8 @@ nfsd_listxattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char **bufp,
> }
>
> /*
> - * Removexattr and setxattr need to call fh_lock to both lock the inode
> - * and set the change attribute. Since the top-level vfs_removexattr
> - * and vfs_setxattr calls already do their own inode_lock calls, call
> - * the _locked variant. Pass in a NULL pointer for delegated_inode,
> - * and let the client deal with NFS4ERR_DELAY (same as with e.g.
> - * setattr and remove).
> + * Pass in a NULL pointer for delegated_inode, and let the client deal
> + * with NFS4ERR_DELAY (same as with e.g. setattr and remove).
> */
> __be32
> nfsd_removexattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char *name)
> @@ -2144,12 +2139,14 @@ nfsd_removexattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char *name)
> if (ret)
> return nfserrno(ret);
>
> - fh_lock(fhp);
> + inode_lock(fhp->fh_dentry->d_inode);
> + fh_fill_pre_attrs(fhp, true);
>
> ret = __vfs_removexattr_locked(&init_user_ns, fhp->fh_dentry,
> name, NULL);
>
> - fh_unlock(fhp);
> + fh_fill_post_attrs(fhp);
> + inode_unlock(fhp->fh_dentry->d_inode);
> fh_drop_write(fhp);
>
> return nfsd_xattr_errno(ret);
> @@ -2169,12 +2166,13 @@ nfsd_setxattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char *name,
> ret = fh_want_write(fhp);
> if (ret)
> return nfserrno(ret);
> - fh_lock(fhp);
> + inode_lock(fhp->fh_dentry->d_inode);
> + fh_fill_pre_attrs(fhp, true);
>
> ret = __vfs_setxattr_locked(&init_user_ns, fhp->fh_dentry, name, buf,
> len, flags, NULL);
> -
> - fh_unlock(fhp);
> + fh_fill_post_attrs(fhp);
> + inode_unlock(fhp->fh_dentry->d_inode);
> fh_drop_write(fhp);
>
> return nfsd_xattr_errno(ret);
>
>

--
Chuck Lever



2022-06-24 14:49:44

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 12/12] nfsd: discard fh_locked flag and fh_lock/fh_unlock



> On Jun 13, 2022, at 7:18 PM, NeilBrown <[email protected]> wrote:
>
> fh_lock() and fh_unlock() are no longer used, so discard them.
> They are the only real users of ->fh_locked, so discard that too.
>
> Signed-off-by: NeilBrown <[email protected]>

Reviewed-by: Chuck Lever <[email protected]>


> ---
> fs/nfsd/nfsfh.c | 2 +-
> fs/nfsd/nfsfh.h | 48 ++++--------------------------------------------
> fs/nfsd/vfs.c | 4 ----
> 3 files changed, 5 insertions(+), 49 deletions(-)
>
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index ae270e4f921f..a3dbe9f34c0e 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -548,7 +548,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
> if (ref_fh == fhp)
> fh_put(ref_fh);
>
> - if (fhp->fh_locked || fhp->fh_dentry) {
> + if (fhp->fh_dentry) {
> printk(KERN_ERR "fh_compose: fh %pd2 not initialized!\n",
> dentry);
> }
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index c5061cdb1016..559912b1d794 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -81,7 +81,6 @@ typedef struct svc_fh {
> struct dentry * fh_dentry; /* validated dentry */
> struct svc_export * fh_export; /* export pointer */
>
> - bool fh_locked; /* inode locked by us */
> bool fh_want_write; /* remount protection taken */
> bool fh_no_wcc; /* no wcc data needed */
> bool fh_no_atomic_attr;
> @@ -93,7 +92,7 @@ typedef struct svc_fh {
> bool fh_post_saved; /* post-op attrs saved */
> bool fh_pre_saved; /* pre-op attrs saved */
>
> - /* Pre-op attributes saved during fh_lock */
> + /* Pre-op attributes saved when inode exclusively locked */
> __u64 fh_pre_size; /* size before operation */
> struct timespec64 fh_pre_mtime; /* mtime before oper */
> struct timespec64 fh_pre_ctime; /* ctime before oper */
> @@ -103,7 +102,7 @@ typedef struct svc_fh {
> */
> u64 fh_pre_change;
>
> - /* Post-op attributes saved in fh_unlock */
> + /* Post-op attributes saved in fh_fill_post_attrs() */
> struct kstat fh_post_attr; /* full attrs after operation */
> u64 fh_post_change; /* nfsv4 change; see above */
> } svc_fh;
> @@ -223,8 +222,8 @@ void fh_put(struct svc_fh *);
> static __inline__ struct svc_fh *
> fh_copy(struct svc_fh *dst, struct svc_fh *src)
> {
> - WARN_ON(src->fh_dentry || src->fh_locked);
> -
> + WARN_ON(src->fh_dentry);
> +
> *dst = *src;
> return dst;
> }
> @@ -323,43 +322,4 @@ static inline u64 nfsd4_change_attribute(struct kstat *stat,
> extern void fh_fill_pre_attrs(struct svc_fh *fhp, bool atomic);
> extern void fh_fill_post_attrs(struct svc_fh *fhp);
>
> -static inline void
> -fh_lock_nested(struct svc_fh *fhp, unsigned int subclass)
> -{
> - struct dentry *dentry = fhp->fh_dentry;
> - struct inode *inode;
> -
> - BUG_ON(!dentry);
> -
> - if (fhp->fh_locked) {
> - printk(KERN_WARNING "fh_lock: %pd2 already locked!\n",
> - dentry);
> - return;
> - }
> -
> - inode = d_inode(dentry);
> - inode_lock_nested(inode, subclass);
> - fh_fill_pre_attrs(fhp, true);
> - fhp->fh_locked = true;
> -}
> -
> -static inline void
> -fh_lock(struct svc_fh *fhp)
> -{
> - fh_lock_nested(fhp, I_MUTEX_NORMAL);
> -}
> -
> -/*
> - * Unlock a file handle/inode
> - */
> -static inline void
> -fh_unlock(struct svc_fh *fhp)
> -{
> - if (fhp->fh_locked) {
> - fh_fill_post_attrs(fhp);
> - inode_unlock(d_inode(fhp->fh_dentry));
> - fhp->fh_locked = false;
> - }
> -}
> -
> #endif /* _LINUX_NFSD_NFSFH_H */
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index f2f4868618bb..0e07b19a0289 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1623,14 +1623,11 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> goto out;
> }
>
> - /* cannot use fh_lock as we need deadlock protective ordering
> - * so do it by hand */
> trap = lock_rename_lookup_one(tdentry, fdentry, &ndentry, &odentry,
> tname, tlen, fname, flen, 0, 0, &wq);
> host_err = PTR_ERR(trap);
> if (IS_ERR(trap))
> goto out_nfserr;
> - ffhp->fh_locked = tfhp->fh_locked = true;
> fh_fill_pre_attrs(ffhp, (ndentry->d_flags & DCACHE_PAR_UPDATE) == 0);
> fh_fill_pre_attrs(tfhp, (ndentry->d_flags & DCACHE_PAR_UPDATE) == 0);
>
> @@ -1678,7 +1675,6 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> }
> out_unlock:
> unlock_rename_lookup(tdentry, fdentry, ndentry, odentry);
> - ffhp->fh_locked = tfhp->fh_locked = false;
> out_nfserr:
> err = nfserrno(host_err);
> fh_drop_write(ffhp);
>
>

--
Chuck Lever



2022-06-28 22:50:12

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 08/12] nfsd: allow parallel creates from nfsd



> On Jun 13, 2022, at 7:18 PM, NeilBrown <[email protected]> wrote:
>
> Rather than getting write access, locking the directory, and performing
> a lookup, use
> filename_create_one_len() for create, or
> lookup_hash_update_len() for delete
> which combines all these steps and handles shared locking for concurrent
> updates where appropriate.
>
> As we don't use fh_lock() we need to call fh_fill_pre_attrs()
> explicitly. However if we only get a shared lock, then the pre/post
> attributes won't be atomic, so we need to ensure that fh know that,
> and doesn't try to encode wcc attrs either.
>
> Note that there is only one filesystem that allows shared locks for
> create/unlink and that is NFS (for re-export). NFS does support atomic
> pre/post attributes, so there is no loss in not providing them.
>
> When some other filesystem supports concurrent updates, we might need to
> consider if the pre/post attributes are more important than the
> parallelism.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/nfsd/nfs3proc.c | 31 +++++-------
> fs/nfsd/nfs4proc.c | 32 +++++-------
> fs/nfsd/nfsfh.c | 7 ++-
> fs/nfsd/nfsfh.h | 4 +-
> fs/nfsd/nfsproc.c | 29 +++++------
> fs/nfsd/vfs.c | 134 +++++++++++++++++++++++-----------------------------
> 6 files changed, 105 insertions(+), 132 deletions(-)
>
> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
> index 981a3a7a6e16..0fdbb9504a87 100644
> --- a/fs/nfsd/nfs3proc.c
> +++ b/fs/nfsd/nfs3proc.c
> @@ -231,12 +231,14 @@ static __be32
> nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
> struct svc_fh *resfhp, struct nfsd3_createargs *argp)
> {
> + struct path path;
> struct iattr *iap = &argp->attrs;
> - struct dentry *parent, *child;
> + struct dentry *child;
> __u32 v_mtime, v_atime;
> struct inode *inode;
> __be32 status;
> int host_err;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>
> if (isdotent(argp->name, argp->len))
> return nfserr_exist;
> @@ -247,20 +249,15 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (status != nfs_ok)
> return status;
>
> - parent = fhp->fh_dentry;
> - inode = d_inode(parent);
> + path.dentry = fhp->fh_dentry;
> + path.mnt = fhp->fh_export->ex_path.mnt;
> + inode = d_inode(path.dentry);
>
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - return nfserrno(host_err);
> + child = filename_create_one_len(argp->name, argp->len,
> + &path, 0, &wq);
>
> - fh_lock_nested(fhp, I_MUTEX_PARENT);
> -
> - child = lookup_one_len(argp->name, parent, argp->len);
> - if (IS_ERR(child)) {
> - status = nfserrno(PTR_ERR(child));
> - goto out;
> - }
> + if (IS_ERR(child))
> + return nfserrno(PTR_ERR(child));
>
> if (d_really_is_negative(child)) {
> status = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
> @@ -311,6 +308,7 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> if (!IS_POSIXACL(inode))
> iap->ia_mode &= ~current_umask();
> + fh_fill_pre_attrs(fhp, (child->d_flags & DCACHE_PAR_UPDATE) == 0);
>
> host_err = vfs_create(&init_user_ns, inode, child, iap->ia_mode, true);
> if (host_err < 0) {
> @@ -332,12 +330,9 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> set_attr:
> status = nfsd_create_setattr(rqstp, fhp, resfhp, iap);
> -
> + fh_fill_post_attrs(fhp);
> out:
> - fh_unlock(fhp);
> - if (child && !IS_ERR(child))
> - dput(child);
> - fh_drop_write(fhp);
> + done_path_update(&path, child, &wq);
> return status;
> }
>
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index 3895eb52d2b1..71a4b8ef77f0 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -285,12 +285,13 @@ static __be32
> nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
> struct svc_fh *resfhp, struct nfsd4_open *open)
> {
> + struct path path;
> struct iattr *iap = &open->op_iattr;
> - struct dentry *parent, *child;
> + struct dentry *child;
> __u32 v_mtime, v_atime;
> struct inode *inode;
> __be32 status;
> - int host_err;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>
> if (isdotent(open->op_fname, open->op_fnamelen))
> return nfserr_exist;
> @@ -300,20 +301,17 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
> status = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_EXEC);
> if (status != nfs_ok)
> return status;
> - parent = fhp->fh_dentry;
> - inode = d_inode(parent);
>
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - return nfserrno(host_err);
> + path.dentry = fhp->fh_dentry;
> + path.mnt = fhp->fh_export->ex_path.mnt;
> + inode = d_inode(path.dentry);
>
> - fh_lock_nested(fhp, I_MUTEX_PARENT);
> + child = filename_create_one_len(open->op_fname,
> + open->op_fnamelen,
> + &path, 0, &wq);
>
> - child = lookup_one_len(open->op_fname, parent, open->op_fnamelen);
> - if (IS_ERR(child)) {
> - status = nfserrno(PTR_ERR(child));
> - goto out;
> - }
> + if (IS_ERR(child))
> + return nfserrno(PTR_ERR(child));
>
> if (d_really_is_negative(child)) {
> status = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
> @@ -386,6 +384,7 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (!IS_POSIXACL(inode))
> iap->ia_mode &= ~current_umask();
>
> + fh_fill_pre_attrs(fhp, (child->d_flags & DCACHE_PAR_UPDATE) == 0);
> status = nfsd4_vfs_create(fhp, child, open);
> if (status != nfs_ok)
> goto out;
> @@ -405,12 +404,9 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> set_attr:
> status = nfsd_create_setattr(rqstp, fhp, resfhp, iap);
> -
> + fh_fill_post_attrs(fhp);
> out:
> - fh_unlock(fhp);
> - if (child && !IS_ERR(child))
> - dput(child);
> - fh_drop_write(fhp);
> + done_path_update(&path, child, &wq);
> return status;
> }
>
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index c29baa03dfaf..a50db688c60d 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -616,7 +616,7 @@ fh_update(struct svc_fh *fhp)
> * @fhp: file handle to be updated
> *
> */
> -void fh_fill_pre_attrs(struct svc_fh *fhp)
> +void fh_fill_pre_attrs(struct svc_fh *fhp, bool atomic)

Hi Neil, just noticed this:

CC [M] fs/nfsd/nfsfh.o
CHECK /home/cel/src/linux/linux/fs/nfsd/nfsfh.c
/home/cel/src/linux/linux/fs/nfsd/nfsfh.c:621: warning: Function parameter or member 'atomic' not described in 'fh_fill_pre_attrs'

And... do you intend to repost this series with the supplemental
fixes applied?

Should we come up with a plan to merge these during the next
window, or do you feel more work is needed?


> {
> bool v4 = (fhp->fh_maxsize == NFS4_FHSIZE);
> struct inode *inode;
> @@ -626,6 +626,11 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
> if (fhp->fh_no_wcc || fhp->fh_pre_saved)
> return;
>
> + if (!atomic) {
> + fhp->fh_no_atomic_attr = true;
> + fhp->fh_no_wcc = true;
> + }
> +
> inode = d_inode(fhp->fh_dentry);
> err = fh_getattr(fhp, &stat);
> if (err) {
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index fb9d358a267e..ecc57fd3fd67 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -320,7 +320,7 @@ static inline u64 nfsd4_change_attribute(struct kstat *stat,
> return time_to_chattr(&stat->ctime);
> }
>
> -extern void fh_fill_pre_attrs(struct svc_fh *fhp);
> +extern void fh_fill_pre_attrs(struct svc_fh *fhp, bool atomic);
> extern void fh_fill_post_attrs(struct svc_fh *fhp);
>
>
> @@ -347,7 +347,7 @@ fh_lock_nested(struct svc_fh *fhp, unsigned int subclass)
>
> inode = d_inode(dentry);
> inode_lock_nested(inode, subclass);
> - fh_fill_pre_attrs(fhp);
> + fh_fill_pre_attrs(fhp, true);
> fhp->fh_locked = true;
> }
>
> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
> index fcdab8a8a41f..2dccf77634e8 100644
> --- a/fs/nfsd/nfsproc.c
> +++ b/fs/nfsd/nfsproc.c
> @@ -255,6 +255,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
> static __be32
> nfsd_proc_create(struct svc_rqst *rqstp)
> {
> + struct path path;
> struct nfsd_createargs *argp = rqstp->rq_argp;
> struct nfsd_diropres *resp = rqstp->rq_resp;
> svc_fh *dirfhp = &argp->fh;
> @@ -263,8 +264,8 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> struct inode *inode;
> struct dentry *dchild;
> int type, mode;
> - int hosterr;
> dev_t rdev = 0, wanted = new_decode_dev(attr->ia_size);
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>
> dprintk("nfsd: CREATE %s %.*s\n",
> SVCFH_fmt(dirfhp), argp->len, argp->name);
> @@ -279,17 +280,13 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> resp->status = nfserr_exist;
> if (isdotent(argp->name, argp->len))
> goto done;
> - hosterr = fh_want_write(dirfhp);
> - if (hosterr) {
> - resp->status = nfserrno(hosterr);
> - goto done;
> - }
>
> - fh_lock_nested(dirfhp, I_MUTEX_PARENT);
> - dchild = lookup_one_len(argp->name, dirfhp->fh_dentry, argp->len);
> + path.dentry = dirfhp->fh_dentry;
> + path.mnt = dirfhp->fh_export->ex_path.mnt;
> + dchild = filename_create_one_len(argp->name, argp->len, &path, 0, &wq);
> if (IS_ERR(dchild)) {
> resp->status = nfserrno(PTR_ERR(dchild));
> - goto out_unlock;
> + goto out_done;
> }
> fh_init(newfhp, NFS_FHSIZE);
> resp->status = fh_compose(newfhp, dirfhp->fh_export, dchild, dirfhp);
> @@ -298,7 +295,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> dput(dchild);
> if (resp->status) {
> if (resp->status != nfserr_noent)
> - goto out_unlock;
> + goto out_done;
> /*
> * If the new file handle wasn't verified, we can't tell
> * whether the file exists or not. Time to bail ...
> @@ -307,7 +304,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> if (!newfhp->fh_dentry) {
> printk(KERN_WARNING
> "nfsd_proc_create: file handle not verified\n");
> - goto out_unlock;
> + goto out_done;
> }
> }
>
> @@ -341,7 +338,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> newfhp->fh_dentry,
> NFSD_MAY_WRITE|NFSD_MAY_LOCAL_ACCESS);
> if (resp->status && resp->status != nfserr_rofs)
> - goto out_unlock;
> + goto out_done;
> }
> } else
> type = S_IFREG;
> @@ -378,7 +375,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> /* Make sure the type and device matches */
> resp->status = nfserr_exist;
> if (inode && inode_wrong_type(inode, type))
> - goto out_unlock;
> + goto out_done;
> }
>
> resp->status = nfs_ok;
> @@ -400,10 +397,8 @@ nfsd_proc_create(struct svc_rqst *rqstp)
> (time64_t)0);
> }
>
> -out_unlock:
> - /* We don't really need to unlock, as fh_put does it. */
> - fh_unlock(dirfhp);
> - fh_drop_write(dirfhp);
> +out_done:
> + done_path_update(&path, dchild, &wq);
> done:
> fh_put(dirfhp);
> if (resp->status != nfs_ok)
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 840e3af63a6f..6cdd5e407600 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1274,12 +1274,6 @@ nfsd_create_locked(struct svc_rqst *rqstp, struct svc_fh *fhp,
> dirp = d_inode(dentry);
>
> dchild = dget(resfhp->fh_dentry);
> - if (!fhp->fh_locked) {
> - WARN_ONCE(1, "nfsd_create: parent %pd2 not locked!\n",
> - dentry);
> - err = nfserr_io;
> - goto out;
> - }
>
> err = nfsd_permission(rqstp, fhp->fh_export, dentry, NFSD_MAY_CREATE);
> if (err)
> @@ -1362,9 +1356,11 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> char *fname, int flen, struct iattr *iap,
> int type, dev_t rdev, struct svc_fh *resfhp)
> {
> - struct dentry *dentry, *dchild = NULL;
> + struct path path;
> + struct dentry *dchild = NULL;
> __be32 err;
> int host_err;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>
> if (isdotent(fname, flen))
> return nfserr_exist;
> @@ -1373,27 +1369,22 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (err)
> return err;
>
> - dentry = fhp->fh_dentry;
> -
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - return nfserrno(host_err);
> + path.dentry = fhp->fh_dentry;
> + path.mnt = fhp->fh_export->ex_path.mnt;
>
> - fh_lock_nested(fhp, I_MUTEX_PARENT);
> - dchild = lookup_one_len(fname, dentry, flen);
> + dchild = filename_create_one_len(fname, flen, &path, 0, &wq);
> host_err = PTR_ERR(dchild);
> if (IS_ERR(dchild))
> return nfserrno(host_err);
> err = fh_compose(resfhp, fhp->fh_export, dchild, fhp);
> - /*
> - * We unconditionally drop our ref to dchild as fh_compose will have
> - * already grabbed its own ref for it.
> - */
> - dput(dchild);
> - if (err)
> - return err;
> - return nfsd_create_locked(rqstp, fhp, fname, flen, iap, type,
> - rdev, resfhp);
> + if (!err) {
> + fh_fill_pre_attrs(fhp, (dchild->d_flags & DCACHE_PAR_UPDATE) == 0);
> + err = nfsd_create_locked(rqstp, fhp, fname, flen, iap, type,
> + rdev, resfhp);
> + fh_fill_post_attrs(fhp);
> + }
> + done_path_update(&path, dchild, &wq);
> + return err;
> }
>
> /*
> @@ -1441,15 +1432,17 @@ nfsd_readlink(struct svc_rqst *rqstp, struct svc_fh *fhp, char *buf, int *lenp)
> __be32
> nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
> char *fname, int flen,
> - char *path,
> + char *lpath,
> struct svc_fh *resfhp)
> {
> - struct dentry *dentry, *dnew;
> + struct path path;
> + struct dentry *dnew;
> __be32 err, cerr;
> int host_err;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>
> err = nfserr_noent;
> - if (!flen || path[0] == '\0')
> + if (!flen || lpath[0] == '\0')
> goto out;
> err = nfserr_exist;
> if (isdotent(fname, flen))
> @@ -1459,28 +1452,28 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (err)
> goto out;
>
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - goto out_nfserr;
> + path.dentry = fhp->fh_dentry;
> + path.mnt = fhp->fh_export->ex_path.mnt;
>
> - fh_lock(fhp);
> - dentry = fhp->fh_dentry;
> - dnew = lookup_one_len(fname, dentry, flen);
> + dnew = filename_create_one_len(fname, flen, &path, 0, &wq);
> host_err = PTR_ERR(dnew);
> if (IS_ERR(dnew))
> goto out_nfserr;
>
> - host_err = vfs_symlink(&init_user_ns, d_inode(dentry), dnew, path);
> + fh_fill_pre_attrs(fhp, (dnew->d_flags & DCACHE_PAR_UPDATE) == 0);
> + host_err = vfs_symlink(mnt_user_ns(path.mnt), d_inode(path.dentry),
> + dnew, lpath);
> err = nfserrno(host_err);
> - fh_unlock(fhp);
> if (!err)
> err = nfserrno(commit_metadata(fhp));
>
> - fh_drop_write(fhp);
> + fh_fill_post_attrs(fhp);
>
> cerr = fh_compose(resfhp, fhp->fh_export, dnew, fhp);
> - dput(dnew);
> - if (err==0) err = cerr;
> + if (err==0)
> + err = cerr;
> +
> + done_path_update(&path, dnew, &wq);
> out:
> return err;
>
> @@ -1497,10 +1490,12 @@ __be32
> nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> char *name, int len, struct svc_fh *tfhp)
> {
> - struct dentry *ddir, *dnew, *dold;
> + struct path path;
> + struct dentry *dold, *dnew;
> struct inode *dirp;
> __be32 err;
> int host_err;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>
> err = fh_verify(rqstp, ffhp, S_IFDIR, NFSD_MAY_CREATE);
> if (err)
> @@ -1518,17 +1513,11 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> if (isdotent(name, len))
> goto out;
>
> - host_err = fh_want_write(tfhp);
> - if (host_err) {
> - err = nfserrno(host_err);
> - goto out;
> - }
> -
> - fh_lock_nested(ffhp, I_MUTEX_PARENT);
> - ddir = ffhp->fh_dentry;
> - dirp = d_inode(ddir);
> + path.dentry = ffhp->fh_dentry;
> + path.mnt = ffhp->fh_export->ex_path.mnt;
> + dirp = d_inode(path.dentry);
>
> - dnew = lookup_one_len(name, ddir, len);
> + dnew = filename_create_one_len(name, len, &path, 0, &wq);
> host_err = PTR_ERR(dnew);
> if (IS_ERR(dnew))
> goto out_nfserr;
> @@ -1537,9 +1526,10 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
>
> err = nfserr_noent;
> if (d_really_is_negative(dold))
> - goto out_dput;
> + goto out_done;
> + fh_fill_pre_attrs(ffhp, (dnew->d_flags & DCACHE_PAR_UPDATE) == 0);
> host_err = vfs_link(dold, &init_user_ns, dirp, dnew, NULL);
> - fh_unlock(ffhp);
> +
> if (!host_err) {
> err = nfserrno(commit_metadata(ffhp));
> if (!err)
> @@ -1550,17 +1540,15 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> else
> err = nfserrno(host_err);
> }
> -out_dput:
> - dput(dnew);
> -out_unlock:
> - fh_unlock(ffhp);
> - fh_drop_write(tfhp);
> +out_done:
> + fh_fill_post_attrs(ffhp);
> + done_path_update(&path, dnew, &wq);
> out:
> return err;
>
> out_nfserr:
> err = nfserrno(host_err);
> - goto out_unlock;
> + goto out;
> }
>
> static void
> @@ -1625,8 +1613,8 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> * so do it by hand */
> trap = lock_rename(tdentry, fdentry);
> ffhp->fh_locked = tfhp->fh_locked = true;
> - fh_fill_pre_attrs(ffhp);
> - fh_fill_pre_attrs(tfhp);
> + fh_fill_pre_attrs(ffhp, true);
> + fh_fill_pre_attrs(tfhp, true);
>
> odentry = lookup_one_len(fname, fdentry, flen);
> host_err = PTR_ERR(odentry);
> @@ -1717,11 +1705,13 @@ __be32
> nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> char *fname, int flen)
> {
> - struct dentry *dentry, *rdentry;
> + struct dentry *rdentry;
> struct inode *dirp;
> struct inode *rinode;
> __be32 err;
> int host_err;
> + struct path path;
> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
>
> err = nfserr_acces;
> if (!flen || isdotent(fname, flen))
> @@ -1730,24 +1720,18 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> if (err)
> goto out;
>
> - host_err = fh_want_write(fhp);
> - if (host_err)
> - goto out_nfserr;
> + path.mnt = fhp->fh_export->ex_path.mnt;
> + path.dentry = fhp->fh_dentry;
>
> - fh_lock_nested(fhp, I_MUTEX_PARENT);
> - dentry = fhp->fh_dentry;
> - dirp = d_inode(dentry);
> + rdentry = lookup_hash_update_len(fname, flen, &path, 0, &wq);
> + dirp = d_inode(path.dentry);
>
> - rdentry = lookup_one_len(fname, dentry, flen);
> host_err = PTR_ERR(rdentry);
> if (IS_ERR(rdentry))
> - goto out_drop_write;
> + goto out_nfserr;
> +
> + fh_fill_pre_attrs(fhp, (rdentry->d_flags & DCACHE_PAR_UPDATE) == 0);
>
> - if (d_really_is_negative(rdentry)) {
> - dput(rdentry);
> - host_err = -ENOENT;
> - goto out_drop_write;
> - }
> rinode = d_inode(rdentry);
> ihold(rinode);
>
> @@ -1761,15 +1745,13 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> } else {
> host_err = vfs_rmdir(&init_user_ns, dirp, rdentry);
> }
> + fh_fill_post_attrs(fhp);
>
> - fh_unlock(fhp);
> + done_path_update(&path, rdentry, &wq);
> if (!host_err)
> host_err = commit_metadata(fhp);
> - dput(rdentry);
> iput(rinode); /* truncate the inode here */
>
> -out_drop_write:
> - fh_drop_write(fhp);
> out_nfserr:
> if (host_err == -EBUSY) {
> /* name is mounted-on. There is no perfect
>
>

--
Chuck Lever



2022-06-28 23:21:40

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 08/12] nfsd: allow parallel creates from nfsd

On Wed, 29 Jun 2022, Chuck Lever III wrote:
>
> > On Jun 13, 2022, at 7:18 PM, NeilBrown <[email protected]> wrote:
> > diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> > index c29baa03dfaf..a50db688c60d 100644
> > --- a/fs/nfsd/nfsfh.c
> > +++ b/fs/nfsd/nfsfh.c
> > @@ -616,7 +616,7 @@ fh_update(struct svc_fh *fhp)
> > * @fhp: file handle to be updated
> > *
> > */
> > -void fh_fill_pre_attrs(struct svc_fh *fhp)
> > +void fh_fill_pre_attrs(struct svc_fh *fhp, bool atomic)
>
> Hi Neil, just noticed this:
>
> CC [M] fs/nfsd/nfsfh.o
> CHECK /home/cel/src/linux/linux/fs/nfsd/nfsfh.c
> /home/cel/src/linux/linux/fs/nfsd/nfsfh.c:621: warning: Function parameter or member 'atomic' not described in 'fh_fill_pre_attrs'

Thanks. I"ll address that, and also the other issues that you raised in
your patch-by-patch review. Thanks for those.

>
> And... do you intend to repost this series with the supplemental
> fixes applied?

I've been a bit distracted this week, but my current plan is to
reorganise the patches to put as many of the NFS and NFSD patches as
possible before the VFS patches. There should then just be one each for
NFS and NFSD after the VFS changes. I hope to post that series early
next week.

>
> Should we come up with a plan to merge these during the next
> window, or do you feel more work is needed?

I think it would be reasonable to merge the preliminary NFS and NFSD
patches in the next window. I'd really like to hear from Al before
pushing the rest too hard. Probably after rc1 I'll post the remainder
of the series and include Linus if we haven't heard from Al. I'd be
perfectly happy if the main content of the series landed in the
subsequent merge window.

Thanks,
NeilBrown

2022-07-04 17:19:44

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 08/12] nfsd: allow parallel creates from nfsd



> On Jun 28, 2022, at 7:09 PM, NeilBrown <[email protected]> wrote:
>
> On Wed, 29 Jun 2022, Chuck Lever III wrote:
>>
>>> On Jun 13, 2022, at 7:18 PM, NeilBrown <[email protected]> wrote:
>>> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
>>> index c29baa03dfaf..a50db688c60d 100644
>>> --- a/fs/nfsd/nfsfh.c
>>> +++ b/fs/nfsd/nfsfh.c
>>> @@ -616,7 +616,7 @@ fh_update(struct svc_fh *fhp)
>>> * @fhp: file handle to be updated
>>> *
>>> */
>>> -void fh_fill_pre_attrs(struct svc_fh *fhp)
>>> +void fh_fill_pre_attrs(struct svc_fh *fhp, bool atomic)
>>
>> Hi Neil, just noticed this:
>>
>> CC [M] fs/nfsd/nfsfh.o
>> CHECK /home/cel/src/linux/linux/fs/nfsd/nfsfh.c
>> /home/cel/src/linux/linux/fs/nfsd/nfsfh.c:621: warning: Function parameter or member 'atomic' not described in 'fh_fill_pre_attrs'
>
> Thanks. I"ll address that, and also the other issues that you raised in
> your patch-by-patch review. Thanks for those.
>
>>
>> And... do you intend to repost this series with the supplemental
>> fixes applied?
>
> I've been a bit distracted this week, but my current plan is to
> reorganise the patches to put as many of the NFS and NFSD patches as
> possible before the VFS patches. There should then just be one each for
> NFS and NFSD after the VFS changes. I hope to post that series early
> next week.
>
>>
>> Should we come up with a plan to merge these during the next
>> window, or do you feel more work is needed?
>
> I think it would be reasonable to merge the preliminary NFS and NFSD
> patches in the next window. I'd really like to hear from Al before
> pushing the rest too hard. Probably after rc1 I'll post the remainder
> of the series and include Linus if we haven't heard from Al. I'd be
> perfectly happy if the main content of the series landed in the
> subsequent merge window.

Please base your patches on the "for-next" branch in this repo:

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

Thanks!


--
Chuck Lever