2022-08-26 02:20:27

by NeilBrown

[permalink] [raw]
Subject: [PATCH/RFC 00/10 v5] Improve scalability of directory operations

[I made up "v5" - I haven't been counting]

VFS currently holds an exclusive lock on the directory while making
changes: add, remove, rename.
When multiple threads make changes in the one directory, the contention
can be noticeable.
In the case of NFS with a high latency link, this can easily be
demonstrated. NFS doesn't really need VFS locking as the server ensures
correctness.

Lustre uses a single(?) directory for object storage, and has patches
for ext4 to support concurrent updates (Lustre accesses ext4 directly,
not via the VFS).

XFS (it is claimed) doesn't its own locking and doesn't need the VFS to
help at all.

This patch series allows filesystems to request a shared lock on
directories and provides serialisation on just the affected name, not the
whole directory. It changes both the VFS and NFSD to use shared locks
when appropriate, and changes NFS to request shared locks.

The central enabling feature is a new dentry flag DCACHE_PAR_UPDATE
which acts as a bit-lock. The ->d_lock spinlock is taken to set/clear
it, and wait_var_event() is used for waiting. This flag is set on all
dentries that are part of a directory update, not just when a shared
lock is taken.

When a shared lock is taken we must use alloc_dentry_parallel() which
needs a wq which must remain until the update is completed. To make use
of concurrent create, kern_path_create() would need to be passed a wq.
Rather than the churn required for that, we use exclusive locking when
no wq is provided.

One interesting consequence of this is that silly-rename becomes a
little more complex. As the directory may not be exclusively locked,
the new silly-name needs to be locked (DCACHE_PAR_UPDATE) as well.
A new LOOKUP_SILLY_RENAME is added which helps implement this using
common code.

While testing I found some odd behaviour that was caused by
d_revalidate() racing with rename(). To resolve this I used
DCACHE_PAR_UPDATE to ensure they cannot race any more.

Testing, review, or other comments would be most welcome,

NeilBrown


---

NeilBrown (10):
VFS: support parallel updates in the one directory.
VFS: move EEXIST and ENOENT tests into lookup_hash_update()
VFS: move want_write checks into lookup_hash_update()
VFS: move dput() and mnt_drop_write() into done_path_update()
VFS: export done_path_update()
VFS: support concurrent renames.
VFS: hold DCACHE_PAR_UPDATE lock across d_revalidate()
NFSD: allow parallel creates from nfsd
VFS: add LOOKUP_SILLY_RENAME
NFS: support parallel updates in the one directory.


fs/dcache.c | 72 ++++-
fs/namei.c | 616 ++++++++++++++++++++++++++++++++---------
fs/nfs/dir.c | 28 +-
fs/nfs/fs_context.c | 6 +-
fs/nfs/internal.h | 3 +-
fs/nfs/unlink.c | 51 +++-
fs/nfsd/nfs3proc.c | 28 +-
fs/nfsd/nfs4proc.c | 29 +-
fs/nfsd/nfsfh.c | 9 +
fs/nfsd/nfsproc.c | 29 +-
fs/nfsd/vfs.c | 177 +++++-------
include/linux/dcache.h | 28 ++
include/linux/fs.h | 5 +-
include/linux/namei.h | 39 ++-
14 files changed, 799 insertions(+), 321 deletions(-)

--
Signature


2022-08-26 02:20:27

by NeilBrown

[permalink] [raw]
Subject: [PATCH 01/10] VFS: support parallel updates in the one directory.

From: NeilBrown <[email protected]>

Some filesystems can support parallel modifications to a directory,
either because the modification happen on a remote server which does its
own locking (e.g. NFS) or because they can internally lock just a part
of a directory (e.g. many local filesystems, with a bit of work - the
lustre project has patches for ext4 to support concurrent updates).

To allow this, we introduce VFS support for parallel modification:
unlink (including rmdir) and create. Parallel rename is added in a
later patch.

If a filesystem supports parallel modification in directories, it sets
FS_PAR_DIR_UPDATE on the file_system_type. lookup_open() and the new
lookup_hash_update() notice the flag and take a shared lock on the
directory, and rely on a lock-bit in d_flags, much like parallel lookup
relies on DCACHE_PAR_LOOKUP. The lock bit is always set even if an
exclusive lock was taken, though in that case it will always be
uncontended.

__lookup_hash() is enhanced to use d_alloc_parallel() if it was told
that a shared lock was taken - by being given a non-NULL *wq. This is
needed as the shared lock does not prevent races with lookups.

The new DCACHE_PAR_UPDATE is the most significant bit in d_flags, so we
have nearly run out of flags. 0x08000000 is still unused .... for now.

Once a dentry for the target name has been obtained, DCACHE_PAR_UPDATE
is set on it, waiting if necessary if it is already set. Once this is
set, the thread has exclusive access to the name and can call into the
filesystem to perform the required action.

We use wait_var_event() to wait for DCACHE_PAR_UPDATE to be cleared rather
than trying to overload the wq used for DCACHE_PAR_LOOKUP.

Some filesystems do *not* complete the lookup that precedes a create,
but leave the dentry d_in_lookup() and unhashed, so often a dentry will
have both DCACHE_PAR_LOOKUP and DCACHE_PAR_UPDATE set at the same time.
To allow for this, we need the 'wq' that is passed to d_alloc_parallel()
(and used when DCACHE_PAR_LOOKUP is cleared), to continue to exist until
the creation is complete, which may be after filename_create()
completes. So a wq now needs to be passed to filename_create() when
parallel modification is attempted.

Some callers, such as kern_path(), perform a lookup for create but don't
pass in a wq and changing all call sites to do so would be churn for
little gain. So if no wq is provided an exclusive lock is taken and
d_alloc() is used to allocate the dentry. A new done_path_create_wq()
can be told if a wq was provided for create, so the correct unlock can
be used.

Waiting for DCACHE_PAR_UPDATE can sometimes result in the dentry
changing - either through unlink or rename. This requires that the
lookup be retried. This in turn requires that the wq be reused. The
use of DECLARE_WAITQUEUE() in d_wait_lookup() and the omission of
remove_wait_queue() means that after the wake_up_all() in
__d_lookup_down(), the wait_queue_head list will be undefined. So we
change d_wait_lookup() to use DEFINE_WAIT(), so that
autoremove_wake_function() is used for wakeups, and the wait_queue_head
remains well defined.

Signed-off-by: NeilBrown <[email protected]>
---
fs/dcache.c | 66 ++++++++++++
fs/namei.c | 268 ++++++++++++++++++++++++++++++++++++++----------
include/linux/dcache.h | 28 +++++
include/linux/fs.h | 5 +
include/linux/namei.h | 16 +++
5 files changed, 321 insertions(+), 62 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index bb0c4d0038db..d6bfa49b143b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1765,6 +1765,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
struct dentry *dentry;
char *dname;
int err;
+ static struct lock_class_key __key;

dentry = kmem_cache_alloc_lru(dentry_cache, &sb->s_dentry_lru,
GFP_KERNEL);
@@ -1820,6 +1821,8 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
INIT_LIST_HEAD(&dentry->d_child);
d_set_d_op(dentry, dentry->d_sb->s_d_op);

+ lockdep_init_map(&dentry->d_update_map, "DENTRY_PAR_UPDATE", &__key, 0);
+
if (dentry->d_op && dentry->d_op->d_init) {
err = dentry->d_op->d_init(dentry);
if (err) {
@@ -2626,7 +2629,7 @@ static inline void end_dir_add(struct inode *dir, unsigned int n,
static void d_wait_lookup(struct dentry *dentry)
{
if (d_in_lookup(dentry)) {
- DECLARE_WAITQUEUE(wait, current);
+ DEFINE_WAIT(wait);
add_wait_queue(dentry->d_wait, &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
@@ -3274,6 +3277,67 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
}
EXPORT_SYMBOL(d_tmpfile);

+/**
+ * d_lock_update_nested - lock a dentry before updating
+ * @dentry: the dentry to be locked
+ * @base: the parent, or %NULL
+ * @name: the name in that parent, or %NULL
+ * @subclass: lockdep locking class.
+ *
+ * Lock a dentry in a directory on which a shared-lock may be held, and
+ * on which parallel updates are permitted.
+ * If the base and name are given, then on success the dentry will still
+ * have that base and name - it will not have raced with rename.
+ * On success, a positive dentry will still be hashed, ensuring there
+ * was no race with unlink.
+ * If they are not given, then on success the dentry will be negative,
+ * which again ensures no race with rename, or unlink.
+ *
+ * @subclass uses the I_MUTEX_ classes.
+ * If the parent directory is locked with I_MUTEX_NORMAL, use I_MUTEX_NORMAL.
+ * If the parent is locked with I_MUTEX_PARENT, I_MUTEX_PARENT2 or
+ * I_MUTEX_CHILD, use I_MUTEX_PARENT or, for the second in a rename,
+ * I_MUTEX_PARENT2.
+ */
+bool d_lock_update_nested(struct dentry *dentry,
+ struct dentry *base, const struct qstr *name,
+ int subclass)
+{
+ bool ret = true;
+
+ lock_acquire_exclusive(&dentry->d_update_map, subclass,
+ 0, NULL, _THIS_IP_);
+ spin_lock(&dentry->d_lock);
+ if (dentry->d_flags & DCACHE_PAR_UPDATE)
+ ___wait_var_event(&dentry->d_flags,
+ !(dentry->d_flags & DCACHE_PAR_UPDATE),
+ TASK_UNINTERRUPTIBLE, 0, 0,
+ (spin_unlock(&dentry->d_lock),
+ schedule(),
+ spin_lock(&dentry->d_lock))
+ );
+ rcu_read_lock(); /* for d_same_name() */
+ if (d_unhashed(dentry) && d_is_positive(dentry)) {
+ /* name was unlinked while we waited */
+ ret = false;
+ } else if (base &&
+ (dentry->d_parent != base ||
+ dentry->d_name.hash != name->hash ||
+ !d_same_name(dentry, base, name))) {
+ /* dentry was renamed - possibly silly-rename */
+ ret = false;
+ } else if (!base && d_is_positive(dentry)) {
+ ret = false;
+ } else {
+ dentry->d_flags |= DCACHE_PAR_UPDATE;
+ }
+ rcu_read_unlock();
+ spin_unlock(&dentry->d_lock);
+ if (!ret)
+ lock_map_release(&dentry->d_update_map);
+ return ret;
+}
+
static __initdata unsigned long dhash_entries;
static int __init set_dhash_entries(char *str)
{
diff --git a/fs/namei.c b/fs/namei.c
index 53b4bc094db2..c008dfd01e30 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1574,14 +1574,14 @@ static struct dentry *lookup_dcache(const struct qstr *name,
}

/*
- * Parent directory has inode locked exclusive. This is one
- * and only case when ->lookup() gets called on non in-lookup
- * dentries - as the matter of fact, this only gets called
- * when directory is guaranteed to have no in-lookup children
- * at all.
+ * Parent directory has inode locked exclusive, or possibly shared if wq
+ * is given. In the later case the fs has explicitly allowed concurrent
+ * updates in this directory. This is the one and only case
+ * when ->lookup() may be called on a non in-lookup dentry.
*/
static struct dentry *__lookup_hash(const struct qstr *name,
- struct dentry *base, unsigned int flags)
+ struct dentry *base, unsigned int flags,
+ wait_queue_head_t *wq)
{
struct dentry *dentry = lookup_dcache(name, base, flags);
struct dentry *old;
@@ -1594,18 +1594,103 @@ static struct dentry *__lookup_hash(const struct qstr *name,
if (unlikely(IS_DEADDIR(dir)))
return ERR_PTR(-ENOENT);

- dentry = d_alloc(base, name);
+ if (wq)
+ dentry = d_alloc_parallel(base, name, wq);
+ else
+ dentry = d_alloc(base, name);
if (unlikely(!dentry))
return ERR_PTR(-ENOMEM);
+ if (IS_ERR(dentry))
+ return dentry;
+
+ if (wq && !d_in_lookup(dentry))
+ /* Must have raced with another thread doing the lookup */
+ return dentry;

old = dir->i_op->lookup(dir, dentry, flags);
if (unlikely(old)) {
+ d_lookup_done(dentry);
dput(dentry);
dentry = old;
}
return dentry;
}

+/*
+ * Parent directory (base) is not locked. We take either an exclusive
+ * or shared lock depending on the fs preference, then do a lookup,
+ * and then set the DCACHE_PAR_UPDATE bit on the child if a shared lock
+ * was taken on the parent.
+ */
+static struct dentry *lookup_hash_update(
+ const struct qstr *name,
+ struct dentry *base, unsigned int flags,
+ wait_queue_head_t *wq)
+{
+ struct dentry *dentry;
+ struct inode *dir = base->d_inode;
+ int err;
+
+ if (wq && IS_PAR_UPDATE(dir))
+ inode_lock_shared_nested(dir, I_MUTEX_PARENT);
+ else
+ inode_lock_nested(dir, I_MUTEX_PARENT);
+
+retry:
+ dentry = __lookup_hash(name, base, flags, wq);
+ if (IS_ERR(dentry)) {
+ err = PTR_ERR(dentry);
+ goto out_err;
+ }
+ if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) {
+ /*
+ * Failed to get lock due to race with unlink or rename
+ * - try again
+ */
+ d_lookup_done(dentry);
+ dput(dentry);
+ goto retry;
+ }
+ return dentry;
+
+out_err:
+ if (wq && IS_PAR_UPDATE(dir))
+ inode_unlock_shared(dir);
+ else
+ inode_unlock(dir);
+ return ERR_PTR(err);
+}
+
+static int lookup_one_common(struct user_namespace *mnt_userns,
+ const char *name, struct dentry *base, int len,
+ struct qstr *this);
+
+struct dentry *lookup_hash_update_len(const char *name, int nlen,
+ struct path *path, unsigned int flags,
+ wait_queue_head_t *wq)
+{
+ struct qstr this;
+ int err = lookup_one_common(mnt_user_ns(path->mnt), name,
+ path->dentry, nlen, &this);
+ if (err)
+ return ERR_PTR(err);
+ return lookup_hash_update(&this, path->dentry, flags, wq);
+}
+EXPORT_SYMBOL(lookup_hash_update_len);
+
+static void done_path_update(struct path *path, struct dentry *dentry,
+ bool with_wq)
+{
+ struct inode *dir = path->dentry->d_inode;
+
+ d_lookup_done(dentry);
+ d_unlock_update(dentry);
+ if (IS_PAR_UPDATE(dir) && with_wq)
+ inode_unlock_shared(dir);
+ else
+ inode_unlock(dir);
+}
+
static struct dentry *lookup_fast(struct nameidata *nd)
{
struct dentry *dentry, *parent = nd->path.dentry;
@@ -2570,7 +2655,7 @@ static struct dentry *__kern_path_locked(struct filename *name, struct path *pat
return ERR_PTR(-EINVAL);
}
inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
- d = __lookup_hash(&last, path->dentry, 0);
+ d = __lookup_hash(&last, path->dentry, 0, NULL);
if (IS_ERR(d)) {
inode_unlock(path->dentry->d_inode);
path_put(path);
@@ -3271,11 +3356,24 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
if (nd->flags & LOOKUP_DIRECTORY)
open_flag |= O_DIRECTORY;

+ /*
+ * dentry is negative or d_in_lookup(). In case this is a
+ * shared-lock create we need to set DCACHE_PAR_UPDATE to ensure
+ * exclusion.
+ */
+ if (open_flag & O_CREAT) {
+ if (!d_lock_update(dentry, NULL, NULL))
+ /* already exists, non-atomic open */
+ return dentry;
+ }
+
file->f_path.dentry = DENTRY_NOT_SET;
file->f_path.mnt = nd->path.mnt;
error = dir->i_op->atomic_open(dir, dentry, file,
open_to_namei_flags(open_flag), mode);
d_lookup_done(dentry);
+ if ((open_flag & O_CREAT))
+ d_unlock_update(dentry);
if (!error) {
if (file->f_mode & FMODE_OPENED) {
if (unlikely(dentry != file->f_path.dentry)) {
@@ -3327,6 +3425,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
int error, create_error = 0;
umode_t mode = op->mode;
DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
+ bool have_par_update = false;

if (unlikely(IS_DEADDIR(dir_inode)))
return ERR_PTR(-ENOENT);
@@ -3400,6 +3499,12 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
dentry = res;
}
}
+ /*
+ * If dentry is negative and this is a create we need to get
+ * DCACHE_PAR_UPDATE.
+ */
+ if ((open_flag & O_CREAT) && !dentry->d_inode)
+ have_par_update = d_lock_update(dentry, NULL, NULL);

/* Negative dentry, just create the file */
if (!dentry->d_inode && (open_flag & O_CREAT)) {
@@ -3419,9 +3524,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
error = create_error;
goto out_dput;
}
+ if (have_par_update)
+ d_unlock_update(dentry);
return dentry;

out_dput:
+ if (have_par_update)
+ d_unlock_update(dentry);
dput(dentry);
return ERR_PTR(error);
}
@@ -3434,6 +3543,7 @@ static const char *open_last_lookups(struct nameidata *nd,
bool got_write = false;
struct dentry *dentry;
const char *res;
+ bool shared;

nd->flags |= op->intent;

@@ -3474,14 +3584,15 @@ static const char *open_last_lookups(struct nameidata *nd,
* dropping this one anyway.
*/
}
- if (open_flag & O_CREAT)
+ shared = !(open_flag & O_CREAT) || IS_PAR_UPDATE(dir->d_inode);
+ if (!shared)
inode_lock(dir->d_inode);
else
inode_lock_shared(dir->d_inode);
dentry = lookup_open(nd, file, op, got_write);
if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
fsnotify_create(dir->d_inode, dentry);
- if (open_flag & O_CREAT)
+ if (!shared)
inode_unlock(dir->d_inode);
else
inode_unlock_shared(dir->d_inode);
@@ -3750,41 +3861,29 @@ struct file *do_file_open_root(const struct path *root,
return file;
}

-static struct dentry *filename_create(int dfd, struct filename *name,
- struct path *path, unsigned int lookup_flags)
+static struct dentry *filename_create_one(struct qstr *last, struct path *path,
+ unsigned int lookup_flags,
+ wait_queue_head_t *wq)
{
- struct dentry *dentry = ERR_PTR(-EEXIST);
- struct qstr last;
+ struct dentry *dentry;
bool want_dir = lookup_flags & LOOKUP_DIRECTORY;
unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL;
- int type;
int err2;
int error;

- error = filename_parentat(dfd, name, reval_flag, path, &last, &type);
- if (error)
- return ERR_PTR(error);
-
- /*
- * Yucky last component or no last component at all?
- * (foo/., foo/.., /////)
- */
- if (unlikely(type != LAST_NORM))
- goto out;
-
/* don't fail immediately if it's r/o, at least try to report other errors */
err2 = mnt_want_write(path->mnt);
/*
* Do the final lookup. Suppress 'create' if there is a trailing
* '/', and a directory wasn't requested.
*/
- if (last.name[last.len] && !want_dir)
+ if (last->name[last->len] && !want_dir)
create_flags = 0;
- inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
- dentry = __lookup_hash(&last, path->dentry, reval_flag | create_flags);
+ dentry = lookup_hash_update(last, path->dentry,
+ reval_flag | create_flags, wq);
if (IS_ERR(dentry))
- goto unlock;
+ goto drop_write;

error = -EEXIST;
if (d_is_positive(dentry))
@@ -3806,14 +3905,56 @@ static struct dentry *filename_create(int dfd, struct filename *name,
}
return dentry;
fail:
+ done_path_update(path, dentry, !!wq);
dput(dentry);
dentry = ERR_PTR(error);
-unlock:
- inode_unlock(path->dentry->d_inode);
+drop_write:
if (!err2)
mnt_drop_write(path->mnt);
+ return dentry;
+}
+
+struct dentry *filename_create_one_len(const char *name, int nlen,
+ struct path *path,
+ unsigned int lookup_flags,
+ wait_queue_head_t *wq)
+{
+ struct qstr this;
+ int err;
+
+ err = lookup_one_common(mnt_user_ns(path->mnt), name,
+ path->dentry, nlen, &this);
+ if (err)
+ return ERR_PTR(err);
+ return filename_create_one(&this, path, lookup_flags, wq);
+}
+EXPORT_SYMBOL(filename_create_one_len);
+
+static struct dentry *filename_create(int dfd, struct filename *name,
+ struct path *path, unsigned int lookup_flags,
+ wait_queue_head_t *wq)
+{
+ struct dentry *dentry = ERR_PTR(-EEXIST);
+ struct qstr last;
+ unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
+ int type;
+ int error;
+
+ error = filename_parentat(dfd, name, reval_flag, path, &last, &type);
+ if (error)
+ return ERR_PTR(error);
+
+ /*
+ * Yucky last component or no last component at all?
+ * (foo/., foo/.., /////)
+ */
+ if (unlikely(type != LAST_NORM))
+ goto out;
+
+ dentry = filename_create_one(&last, path, lookup_flags, wq);
out:
- path_put(path);
+ if (IS_ERR(dentry))
+ path_put(path);
return dentry;
}

@@ -3821,27 +3962,28 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
struct path *path, unsigned int lookup_flags)
{
struct filename *filename = getname_kernel(pathname);
- struct dentry *res = filename_create(dfd, filename, path, lookup_flags);
+ struct dentry *res = filename_create(dfd, filename, path, lookup_flags,
+ NULL);

putname(filename);
return res;
}
EXPORT_SYMBOL(kern_path_create);

-void done_path_create(struct path *path, struct dentry *dentry)
+void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq)
{
+ done_path_update(path, dentry, with_wq);
dput(dentry);
- inode_unlock(path->dentry->d_inode);
mnt_drop_write(path->mnt);
path_put(path);
}
-EXPORT_SYMBOL(done_path_create);
+EXPORT_SYMBOL(done_path_create_wq);

inline struct dentry *user_path_create(int dfd, const char __user *pathname,
- struct path *path, unsigned int lookup_flags)
+ struct path *path, unsigned int lookup_flags)
{
struct filename *filename = getname(pathname);
- struct dentry *res = filename_create(dfd, filename, path, lookup_flags);
+ struct dentry *res = filename_create(dfd, filename, path, lookup_flags, NULL);

putname(filename);
return res;
@@ -3921,12 +4063,13 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
struct path path;
int error;
unsigned int lookup_flags = 0;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

error = may_mknod(mode);
if (error)
goto out1;
retry:
- dentry = filename_create(dfd, name, &path, lookup_flags);
+ dentry = filename_create(dfd, name, &path, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto out1;
@@ -3954,7 +4097,7 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
break;
}
out2:
- done_path_create(&path, dentry);
+ done_path_create_wq(&path, dentry, true);
if (retry_estale(error, lookup_flags)) {
lookup_flags |= LOOKUP_REVAL;
goto retry;
@@ -4023,9 +4166,10 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
struct path path;
int error;
unsigned int lookup_flags = LOOKUP_DIRECTORY;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

retry:
- dentry = filename_create(dfd, name, &path, lookup_flags);
+ dentry = filename_create(dfd, name, &path, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto out_putname;
@@ -4038,7 +4182,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
mode);
}
- done_path_create(&path, dentry);
+ done_path_create_wq(&path, dentry, true);
if (retry_estale(error, lookup_flags)) {
lookup_flags |= LOOKUP_REVAL;
goto retry;
@@ -4122,6 +4266,7 @@ int do_rmdir(int dfd, struct filename *name)
struct qstr last;
int type;
unsigned int lookup_flags = 0;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
retry:
error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
if (error)
@@ -4143,8 +4288,7 @@ int do_rmdir(int dfd, struct filename *name)
if (error)
goto exit2;

- inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
- dentry = __lookup_hash(&last, path.dentry, lookup_flags);
+ dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit3;
@@ -4158,9 +4302,9 @@ int do_rmdir(int dfd, struct filename *name)
mnt_userns = mnt_user_ns(path.mnt);
error = vfs_rmdir(mnt_userns, path.dentry->d_inode, dentry);
exit4:
+ done_path_update(&path, dentry, true);
dput(dentry);
exit3:
- inode_unlock(path.dentry->d_inode);
mnt_drop_write(path.mnt);
exit2:
path_put(&path);
@@ -4185,13 +4329,14 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
* @dentry: victim
* @delegated_inode: returns victim inode, if the inode is delegated.
*
- * The caller must hold dir->i_mutex.
+ * The caller must either hold a write-lock on dir->i_rwsem, or
+ * a have atomically set DCACHE_PAR_UPDATE, or both.
*
* If vfs_unlink discovers a delegation, it will return -EWOULDBLOCK and
* return a reference to the inode in delegated_inode. The caller
* should then break the delegation on that inode and retry. Because
* breaking a delegation may take a long time, the caller should drop
- * dir->i_mutex before doing so.
+ * dir->i_rwsem before doing so.
*
* Alternatively, a caller may pass NULL for delegated_inode. This may
* be appropriate for callers that expect the underlying filesystem not
@@ -4250,9 +4395,11 @@ EXPORT_SYMBOL(vfs_unlink);

/*
* Make sure that the actual truncation of the file will occur outside its
- * directory's i_mutex. Truncate can take a long time if there is a lot of
+ * directory's i_rwsem. Truncate can take a long time if there is a lot of
* writeout happening, and we don't want to prevent access to the directory
* while waiting on the I/O.
+ * If the filesystem permits (IS_PAR_UPDATE()), we take a shared lock on the
+ * directory and set DCACHE_PAR_UPDATE to get exclusive access to the dentry.
*/
int do_unlinkat(int dfd, struct filename *name)
{
@@ -4264,6 +4411,7 @@ int do_unlinkat(int dfd, struct filename *name)
struct inode *inode = NULL;
struct inode *delegated_inode = NULL;
unsigned int lookup_flags = 0;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
retry:
error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
if (error)
@@ -4276,9 +4424,9 @@ int do_unlinkat(int dfd, struct filename *name)
error = mnt_want_write(path.mnt);
if (error)
goto exit2;
+
retry_deleg:
- inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
- dentry = __lookup_hash(&last, path.dentry, lookup_flags);
+ dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
struct user_namespace *mnt_userns;
@@ -4297,9 +4445,9 @@ int do_unlinkat(int dfd, struct filename *name)
error = vfs_unlink(mnt_userns, path.dentry->d_inode, dentry,
&delegated_inode);
exit3:
+ done_path_update(&path, dentry, true);
dput(dentry);
}
- inode_unlock(path.dentry->d_inode);
if (inode)
iput(inode); /* truncate the inode here */
inode = NULL;
@@ -4388,13 +4536,14 @@ int do_symlinkat(struct filename *from, int newdfd, struct filename *to)
struct dentry *dentry;
struct path path;
unsigned int lookup_flags = 0;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

if (IS_ERR(from)) {
error = PTR_ERR(from);
goto out_putnames;
}
retry:
- dentry = filename_create(newdfd, to, &path, lookup_flags);
+ dentry = filename_create(newdfd, to, &path, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto out_putnames;
@@ -4407,7 +4556,7 @@ int do_symlinkat(struct filename *from, int newdfd, struct filename *to)
error = vfs_symlink(mnt_userns, path.dentry->d_inode, dentry,
from->name);
}
- done_path_create(&path, dentry);
+ done_path_create_wq(&path, dentry, true);
if (retry_estale(error, lookup_flags)) {
lookup_flags |= LOOKUP_REVAL;
goto retry;
@@ -4536,6 +4685,7 @@ int do_linkat(int olddfd, struct filename *old, int newdfd,
struct inode *delegated_inode = NULL;
int how = 0;
int error;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) {
error = -EINVAL;
@@ -4559,7 +4709,7 @@ int do_linkat(int olddfd, struct filename *old, int newdfd,
goto out_putnames;

new_dentry = filename_create(newdfd, new, &new_path,
- (how & LOOKUP_REVAL));
+ (how & LOOKUP_REVAL), &wq);
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto out_putpath;
@@ -4577,7 +4727,7 @@ int do_linkat(int olddfd, struct filename *old, int newdfd,
error = vfs_link(old_path.dentry, mnt_userns, new_path.dentry->d_inode,
new_dentry, &delegated_inode);
out_dput:
- done_path_create(&new_path, new_dentry);
+ done_path_create_wq(&new_path, new_dentry, true);
if (delegated_inode) {
error = break_deleg_wait(&delegated_inode);
if (!error) {
@@ -4847,7 +4997,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
retry_deleg:
trap = lock_rename(new_path.dentry, old_path.dentry);

- old_dentry = __lookup_hash(&old_last, old_path.dentry, lookup_flags);
+ old_dentry = __lookup_hash(&old_last, old_path.dentry,
+ lookup_flags, NULL);
error = PTR_ERR(old_dentry);
if (IS_ERR(old_dentry))
goto exit3;
@@ -4855,7 +5006,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
error = -ENOENT;
if (d_is_negative(old_dentry))
goto exit4;
- new_dentry = __lookup_hash(&new_last, new_path.dentry, lookup_flags | target_flags);
+ new_dentry = __lookup_hash(&new_last, new_path.dentry,
+ lookup_flags | target_flags, NULL);
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto exit4;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 92c78ed02b54..d71c12907617 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -13,7 +13,9 @@
#include <linux/rcupdate.h>
#include <linux/lockref.h>
#include <linux/stringhash.h>
+#include <linux/sched.h>
#include <linux/wait.h>
+#include <linux/wait_bit.h>

struct path;
struct vfsmount;
@@ -96,6 +98,8 @@ struct dentry {
unsigned long d_time; /* used by d_revalidate */
void *d_fsdata; /* fs-specific data */

+ /* lockdep tracking of DCACHE_PAR_UPDATE locks */
+ struct lockdep_map d_update_map;
union {
struct list_head d_lru; /* LRU list */
wait_queue_head_t *d_wait; /* in-lookup ones only */
@@ -211,6 +215,7 @@ struct dentry_operations {
#define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */
#define DCACHE_DENTRY_CURSOR 0x20000000
#define DCACHE_NORCU 0x40000000 /* No RCU delay for freeing */
+#define DCACHE_PAR_UPDATE 0x80000000 /* Being created or unlinked with shared lock */

extern seqlock_t rename_lock;

@@ -598,4 +603,27 @@ struct name_snapshot {
void take_dentry_name_snapshot(struct name_snapshot *, struct dentry *);
void release_dentry_name_snapshot(struct name_snapshot *);

+bool d_lock_update_nested(struct dentry *dentry,
+ struct dentry *base, const struct qstr *name,
+ int subclass);
+static inline bool d_lock_update(struct dentry *dentry,
+ struct dentry *base, const struct qstr *name)
+{
+ /* 0 is I_MUTEX_NORMAL, but fs.h might not be included */
+ return d_lock_update_nested(dentry, base, name, 0);
+}
+
+static inline void d_unlock_update(struct dentry *dentry)
+{
+ if (IS_ERR_OR_NULL(dentry))
+ return;
+ if (dentry->d_flags & DCACHE_PAR_UPDATE) {
+ lock_map_release(&dentry->d_update_map);
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags &= ~DCACHE_PAR_UPDATE;
+ spin_unlock(&dentry->d_lock);
+ wake_up_var(&dentry->d_flags);
+ }
+}
+
#endif /* __LINUX_DCACHE_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9eced4cc286e..a9a48306806a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2524,12 +2524,13 @@ int sync_inode_metadata(struct inode *inode, int wait);
struct file_system_type {
const char *name;
int fs_flags;
-#define FS_REQUIRES_DEV 1
+#define FS_REQUIRES_DEV 1
#define FS_BINARY_MOUNTDATA 2
#define FS_HAS_SUBTYPE 4
#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
#define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
-#define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */
+#define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */
+#define FS_PAR_DIR_UPDATE BIT(6) /* Allow parallel directory updates */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
int (*init_fs_context)(struct fs_context *);
const struct fs_parameter_spec *parameters;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index caeb08a98536..b7a123b489b1 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -61,7 +61,14 @@ extern int kern_path(const char *, unsigned, struct path *);

extern struct dentry *kern_path_create(int, const char *, struct path *, unsigned int);
extern struct dentry *user_path_create(int, const char __user *, struct path *, unsigned int);
-extern void done_path_create(struct path *, struct dentry *);
+extern struct dentry *lookup_hash_update_len(const char *name, int nlen,
+ struct path *path, unsigned int flags,
+ wait_queue_head_t *wq);
+extern void done_path_create_wq(struct path *, struct dentry *, bool);
+static inline void done_path_create(struct path *path, struct dentry *dentry)
+{
+ done_path_create_wq(path, dentry, false);
+}
extern struct dentry *kern_path_locked(const char *, struct path *);

extern struct dentry *try_lookup_one_len(const char *, struct dentry *, int);
@@ -75,6 +82,13 @@ struct dentry *lookup_one_unlocked(struct user_namespace *mnt_userns,
struct dentry *lookup_one_positive_unlocked(struct user_namespace *mnt_userns,
const char *name,
struct dentry *base, int len);
+struct dentry *filename_create_one_len(const char *name, int nlen,
+ struct path *path,
+ unsigned int lookup_flags,
+ wait_queue_head_t *wq);
+static inline bool IS_PAR_UPDATE(struct inode *dir) {
+ return !!(dir->i_sb->s_type->fs_flags & FS_PAR_DIR_UPDATE);
+}

extern int follow_down_one(struct path *);
extern int follow_down(struct path *);


2022-08-26 02:21:05

by NeilBrown

[permalink] [raw]
Subject: [PATCH 02/10] VFS: move EEXIST and ENOENT tests into lookup_hash_update()

Moving common error handling into lookup_hash_update() simplifies
callers.
A future patch will export this functionality to nfsd, and the more code
we put in the interface, the less code will be needed in nfsd.

EEXIST is returned if LOOKUP_EXCL is set and dentry is positivie.
ENOENT is returned if LOOKUP_CREAT is NOT set and dentry is negative.

This involves seting LOOKUP_EXCL in cases where it wasn't before.
In particular, when creating a non-dir named "foo/", we want the
EEXIST error, but don't want to actually create anything.
Some filesystems assume LOOKUP_EXCL implies LOOKUP_CREATE, so ensure it
does when calling ->lookup().

Signed-off-by: NeilBrown <[email protected]>
---
fs/namei.c | 58 ++++++++++++++++++++++++++++++++++------------------------
1 file changed, 34 insertions(+), 24 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index c008dfd01e30..09c2d007814a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1562,7 +1562,13 @@ static struct dentry *lookup_dcache(const struct qstr *name,
{
struct dentry *dentry = d_lookup(dir, name);
if (dentry) {
- int error = d_revalidate(dentry, flags);
+ int error;
+ /* Some filesystems assume EXCL -> CREATE, so make
+ * sure it does.
+ */
+ if (!(flags & LOOKUP_CREATE))
+ flags &= ~LOOKUP_EXCL;
+ error = d_revalidate(dentry, flags);
if (unlikely(error <= 0)) {
if (!error)
d_invalidate(dentry);
@@ -1621,6 +1627,8 @@ static struct dentry *__lookup_hash(const struct qstr *name,
* or shared lock depending on the fs preference, then do a lookup,
* and then set the DCACHE_PAR_UPDATE bit on the child if a shared lock
* was taken on the parent.
+ * If LOOKUP_EXCL, name should not already exist, else -EEXIST
+ * If not LOOKUP_CREATE, name should already exist, else -ENOENT
*/
static struct dentry *lookup_hash_update(
const struct qstr *name,
@@ -1651,6 +1659,24 @@ static struct dentry *lookup_hash_update(
dput(dentry);
goto retry;
}
+ if (flags & LOOKUP_EXCL) {
+ if (d_is_positive(dentry)) {
+ d_lookup_done(dentry);
+ d_unlock_update(dentry);
+ dput(dentry);
+ err = -EEXIST;
+ goto out_err;
+ }
+ }
+ if (!(flags & LOOKUP_CREATE)) {
+ if (!dentry->d_inode) {
+ d_lookup_done(dentry);
+ d_unlock_update(dentry);
+ dput(dentry);
+ err = -ENOENT;
+ goto out_err;
+ }
+ }
return dentry;

out_err:
@@ -3868,7 +3894,7 @@ static struct dentry *filename_create_one(struct qstr *last, struct path *path,
struct dentry *dentry;
bool want_dir = lookup_flags & LOOKUP_DIRECTORY;
unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
- unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL;
+ unsigned int create_flag = LOOKUP_CREATE;
int err2;
int error;

@@ -3879,26 +3905,16 @@ static struct dentry *filename_create_one(struct qstr *last, struct path *path,
* '/', and a directory wasn't requested.
*/
if (last->name[last->len] && !want_dir)
- create_flags = 0;
+ /* Name was foo/bar/ but not creating a directory, so
+ * we won't try to create - result will be either -ENOENT
+ * or -EEXIST.
+ */
+ create_flag = 0;
dentry = lookup_hash_update(last, path->dentry,
- reval_flag | create_flags, wq);
+ reval_flag | create_flag | LOOKUP_EXCL, wq);
if (IS_ERR(dentry))
goto drop_write;

- error = -EEXIST;
- if (d_is_positive(dentry))
- goto fail;
-
- /*
- * Special case - lookup gave negative, but... we had foo/bar/
- * From the vfs_mknod() POV we just have a negative dentry -
- * all is fine. Let's be bastards - you had / on the end, you've
- * been asking for (non-existent) directory. -ENOENT for you.
- */
- if (unlikely(!create_flags)) {
- error = -ENOENT;
- goto fail;
- }
if (unlikely(err2)) {
error = err2;
goto fail;
@@ -4292,10 +4308,6 @@ int do_rmdir(int dfd, struct filename *name)
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit3;
- if (!dentry->d_inode) {
- error = -ENOENT;
- goto exit4;
- }
error = security_path_rmdir(&path, dentry);
if (error)
goto exit4;
@@ -4435,8 +4447,6 @@ int do_unlinkat(int dfd, struct filename *name)
if (last.name[last.len])
goto slashes;
inode = dentry->d_inode;
- if (d_is_negative(dentry))
- goto slashes;
ihold(inode);
error = security_path_unlink(&path, dentry);
if (error)


2022-08-26 02:21:16

by NeilBrown

[permalink] [raw]
Subject: [PATCH 03/10] VFS: move want_write checks into lookup_hash_update()

mnt_want_write() is always called before lookup_hash_update(), so
we can simplify the code by moving the call into that function.

If lookup_hash_update() succeeds, it now will have claimed the
want_write lock. If it fails, the want_write lock isn't held.

To allow this, lookup_hash_update() now receives a 'struct path *'
instead of 'struct dentry *'

Note that when creating a name, any error from mnt_want_write() does not
get reported unless there is no other error. For unlink/rmdir though,
an error from mnt_want_write() is immediately fatal - overriding ENOENT
for example. This behaviour seems strange, but this patch is careful to
preserve it.

Signed-off-by: NeilBrown <[email protected]>
---
fs/namei.c | 72 ++++++++++++++++++++++++++----------------------------------
1 file changed, 31 insertions(+), 41 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 09c2d007814a..73c3319a1703 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1632,12 +1632,22 @@ static struct dentry *__lookup_hash(const struct qstr *name,
*/
static struct dentry *lookup_hash_update(
const struct qstr *name,
- struct dentry *base, unsigned int flags,
+ struct path *path, unsigned int flags,
wait_queue_head_t *wq)
{
struct dentry *dentry;
+ struct dentry *base = path->dentry;
struct inode *dir = base->d_inode;
- int err;
+ int err, err2;
+
+ /* For create, don't fail immediately if it's r/o,
+ * at least try to report other errors.
+ * For unlink/rmdir where LOOKUP_REVAl is the only
+ * flag, fail immediately if r/o.
+ */
+ err2 = mnt_want_write(path->mnt);
+ if (err2 && (flags & ~LOOKUP_REVAL) == 0)
+ return ERR_PTR(err2);

if (wq && IS_PAR_UPDATE(dir))
inode_lock_shared_nested(dir, I_MUTEX_PARENT);
@@ -1677,6 +1687,13 @@ static struct dentry *lookup_hash_update(
goto out_err;
}
}
+ if (err2) {
+ err = err2;
+ d_lookup_done(dentry);
+ d_unlock_update(dentry);
+ dput(dentry);
+ goto out_err;
+ }
return dentry;

out_err:
@@ -1684,6 +1701,8 @@ static struct dentry *lookup_hash_update(
inode_unlock_shared(dir);
else
inode_unlock(dir);
+ if (!err2)
+ mnt_drop_write(path->mnt);
return ERR_PTR(err);
}

@@ -1700,7 +1719,7 @@ struct dentry *lookup_hash_update_len(const char *name, int nlen,
path->dentry, nlen, &this);
if (err)
return ERR_PTR(err);
- return lookup_hash_update(&this, path->dentry, flags, wq);
+ return lookup_hash_update(&this, path, flags, wq);
}
EXPORT_SYMBOL(lookup_hash_update_len);

@@ -3891,15 +3910,10 @@ static struct dentry *filename_create_one(struct qstr *last, struct path *path,
unsigned int lookup_flags,
wait_queue_head_t *wq)
{
- struct dentry *dentry;
bool want_dir = lookup_flags & LOOKUP_DIRECTORY;
unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
unsigned int create_flag = LOOKUP_CREATE;
- int err2;
- int error;

- /* don't fail immediately if it's r/o, at least try to report other errors */
- err2 = mnt_want_write(path->mnt);
/*
* Do the final lookup. Suppress 'create' if there is a trailing
* '/', and a directory wasn't requested.
@@ -3910,24 +3924,9 @@ static struct dentry *filename_create_one(struct qstr *last, struct path *path,
* or -EEXIST.
*/
create_flag = 0;
- dentry = lookup_hash_update(last, path->dentry,
- reval_flag | create_flag | LOOKUP_EXCL, wq);
- if (IS_ERR(dentry))
- goto drop_write;
-
- if (unlikely(err2)) {
- error = err2;
- goto fail;
- }
- return dentry;
-fail:
- done_path_update(path, dentry, !!wq);
- dput(dentry);
- dentry = ERR_PTR(error);
-drop_write:
- if (!err2)
- mnt_drop_write(path->mnt);
- return dentry;
+ return lookup_hash_update(last, path,
+ reval_flag | create_flag | LOOKUP_EXCL,
+ wq);
}

struct dentry *filename_create_one_len(const char *name, int nlen,
@@ -4300,23 +4299,18 @@ int do_rmdir(int dfd, struct filename *name)
goto exit2;
}

- error = mnt_want_write(path.mnt);
- if (error)
- goto exit2;
-
- dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq);
+ dentry = lookup_hash_update(&last, &path, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
- goto exit3;
+ goto exit2;
error = security_path_rmdir(&path, dentry);
if (error)
- goto exit4;
+ goto exit3;
mnt_userns = mnt_user_ns(path.mnt);
error = vfs_rmdir(mnt_userns, path.dentry->d_inode, dentry);
-exit4:
+exit3:
done_path_update(&path, dentry, true);
dput(dentry);
-exit3:
mnt_drop_write(path.mnt);
exit2:
path_put(&path);
@@ -4433,12 +4427,8 @@ int do_unlinkat(int dfd, struct filename *name)
if (type != LAST_NORM)
goto exit2;

- error = mnt_want_write(path.mnt);
- if (error)
- goto exit2;
-
retry_deleg:
- dentry = lookup_hash_update(&last, path.dentry, lookup_flags, &wq);
+ dentry = lookup_hash_update(&last, &path, lookup_flags, &wq);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
struct user_namespace *mnt_userns;
@@ -4457,6 +4447,7 @@ int do_unlinkat(int dfd, struct filename *name)
exit3:
done_path_update(&path, dentry, true);
dput(dentry);
+ mnt_drop_write(path.mnt);
}
if (inode)
iput(inode); /* truncate the inode here */
@@ -4466,7 +4457,6 @@ int do_unlinkat(int dfd, struct filename *name)
if (!error)
goto retry_deleg;
}
- mnt_drop_write(path.mnt);
exit2:
path_put(&path);
if (retry_estale(error, lookup_flags)) {


2022-08-26 02:21:47

by NeilBrown

[permalink] [raw]
Subject: [PATCH 04/10] VFS: move dput() and mnt_drop_write() into done_path_update()

All calls to done_path_update() are followed by the same two calls, so
merge them in.

Signed-off-by: NeilBrown <[email protected]>
---
fs/namei.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 73c3319a1703..8df09c19f2b0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1734,6 +1734,8 @@ static void done_path_update(struct path *path, struct dentry *dentry,
inode_unlock_shared(dir);
else
inode_unlock(dir);
+ dput(dentry);
+ mnt_drop_write(path->mnt);
}

static struct dentry *lookup_fast(struct nameidata *nd)
@@ -3988,8 +3990,6 @@ EXPORT_SYMBOL(kern_path_create);
void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq)
{
done_path_update(path, dentry, with_wq);
- dput(dentry);
- mnt_drop_write(path->mnt);
path_put(path);
}
EXPORT_SYMBOL(done_path_create_wq);
@@ -4310,8 +4310,6 @@ int do_rmdir(int dfd, struct filename *name)
error = vfs_rmdir(mnt_userns, path.dentry->d_inode, dentry);
exit3:
done_path_update(&path, dentry, true);
- dput(dentry);
- mnt_drop_write(path.mnt);
exit2:
path_put(&path);
if (retry_estale(error, lookup_flags)) {
@@ -4446,8 +4444,6 @@ int do_unlinkat(int dfd, struct filename *name)
&delegated_inode);
exit3:
done_path_update(&path, dentry, true);
- dput(dentry);
- mnt_drop_write(path.mnt);
}
if (inode)
iput(inode); /* truncate the inode here */


2022-08-26 02:22:32

by NeilBrown

[permalink] [raw]
Subject: [PATCH 05/10] VFS: export done_path_update()

This function will be useful to nfsd, so export it. As most callers
will want "with_wq" to be true, rename the current done_path_update() to
__done_path_update() and add a new done_path_update() which sets with_wq
to true.

done_path_create_wq() is now simple enough to be "static inline"
rather than an explicit export.

Signed-off-by: NeilBrown <[email protected]>
---
fs/namei.c | 16 +++++-----------
include/linux/namei.h | 16 ++++++++++++++--
2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 8df09c19f2b0..13f8ac9721be 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1723,8 +1723,8 @@ struct dentry *lookup_hash_update_len(const char *name, int nlen,
}
EXPORT_SYMBOL(lookup_hash_update_len);

-static void done_path_update(struct path *path, struct dentry *dentry,
- bool with_wq)
+void __done_path_update(struct path *path, struct dentry *dentry,
+ bool with_wq)
{
struct inode *dir = path->dentry->d_inode;

@@ -1737,6 +1737,7 @@ static void done_path_update(struct path *path, struct dentry *dentry,
dput(dentry);
mnt_drop_write(path->mnt);
}
+EXPORT_SYMBOL(__done_path_update);

static struct dentry *lookup_fast(struct nameidata *nd)
{
@@ -3987,13 +3988,6 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
}
EXPORT_SYMBOL(kern_path_create);

-void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq)
-{
- done_path_update(path, dentry, with_wq);
- path_put(path);
-}
-EXPORT_SYMBOL(done_path_create_wq);
-
inline struct dentry *user_path_create(int dfd, const char __user *pathname,
struct path *path, unsigned int lookup_flags)
{
@@ -4309,7 +4303,7 @@ int do_rmdir(int dfd, struct filename *name)
mnt_userns = mnt_user_ns(path.mnt);
error = vfs_rmdir(mnt_userns, path.dentry->d_inode, dentry);
exit3:
- done_path_update(&path, dentry, true);
+ done_path_update(&path, dentry);
exit2:
path_put(&path);
if (retry_estale(error, lookup_flags)) {
@@ -4443,7 +4437,7 @@ int do_unlinkat(int dfd, struct filename *name)
error = vfs_unlink(mnt_userns, path.dentry->d_inode, dentry,
&delegated_inode);
exit3:
- done_path_update(&path, dentry, true);
+ done_path_update(&path, dentry);
}
if (inode)
iput(inode); /* truncate the inode here */
diff --git a/include/linux/namei.h b/include/linux/namei.h
index b7a123b489b1..b1a210a51210 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -64,11 +64,23 @@ extern struct dentry *user_path_create(int, const char __user *, struct path *,
extern struct dentry *lookup_hash_update_len(const char *name, int nlen,
struct path *path, unsigned int flags,
wait_queue_head_t *wq);
-extern void done_path_create_wq(struct path *, struct dentry *, bool);
+extern void __done_path_update(struct path *, struct dentry *, bool);
+static inline void done_path_update(struct path *path, struct dentry *dentry)
+{
+ __done_path_update(path, dentry, true);
+}
static inline void done_path_create(struct path *path, struct dentry *dentry)
{
- done_path_create_wq(path, dentry, false);
+ __done_path_update(path, dentry, false);
+ path_put(path);
}
+static inline void done_path_create_wq(struct path *path, struct dentry *dentry,
+ bool with_wq)
+{
+ __done_path_update(path, dentry, with_wq);
+ path_put(path);
+}
+
extern struct dentry *kern_path_locked(const char *, struct path *);

extern struct dentry *try_lookup_one_len(const char *, struct dentry *, int);


2022-08-26 02:24:11

by NeilBrown

[permalink] [raw]
Subject: [PATCH 07/10] VFS: hold DCACHE_PAR_UPDATE lock across d_revalidate()

->d_revalidate() functions do not, as a rule, block renames (though 9p
does). Several access the parent of the dentry often, but not always,
taking a reference to be sure it doesn't disappear.
It is in general possible for d_revalidate to race with rename so the
directory the d_revalidate works with may not be the parent of the
dentry by the time the call finishes.

The exact consequence of this will vary between filesystem and may not
be problematic. However it seems safest to provide exclusion between
d_revalidate and other operations that change the dentry such as rename,
when doing so is inexpensive

The introduction of DCACHE_PAR_UPDATE does make this easy. d_revalidate
can set this flag, or wait if it is already set. This ensures there
will be no race, without needing to lock the whole directory.

Signed-off-by: NeilBrown <[email protected]>
---
fs/namei.c | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a7c458cc787c..ef994239fa7c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -852,10 +852,37 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)

static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
{
- if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
- return dentry->d_op->d_revalidate(dentry, flags);
- else
+ int status;
+
+ if (!unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
return 1;
+
+ lock_acquire_exclusive(&dentry->d_update_map, I_MUTEX_NORMAL,
+ 0, NULL, _THIS_IP_);
+ spin_lock(&dentry->d_lock);
+ if (dentry->d_flags & DCACHE_PAR_UPDATE) {
+ /* Some other thread owns this dentry */
+ if (flags & LOOKUP_RCU) {
+ spin_unlock(&dentry->d_lock);
+ lock_map_release(&dentry->d_update_map);
+ return -ECHILD;
+ }
+ ___wait_var_event(&dentry->d_flags,
+ !(dentry->d_flags & DCACHE_PAR_UPDATE),
+ TASK_UNINTERRUPTIBLE, 0, 0,
+ (spin_unlock(&dentry->d_lock),
+ schedule(),
+ spin_lock(&dentry->d_lock))
+ );
+ }
+ dentry->d_flags |= DCACHE_PAR_UPDATE;
+ spin_unlock(&dentry->d_lock);
+
+ status = dentry->d_op->d_revalidate(dentry, flags);
+
+ d_unlock_update(dentry);
+
+ return status;
}

/**


2022-08-26 02:25:07

by NeilBrown

[permalink] [raw]
Subject: [PATCH 06/10] VFS: support concurrent renames.

Allow object can now be renamed from or to a directory in which a create
or unlink is concurrently happening.

Two or more renames with the one directory can also be concurrent.
s_vfs_rename_mutex still serialises lookups for cross-directory renames,
but the renames themselves can proceed concurrently.

A core part of this change is introducing lock_rename_lookup()
which both locks the directories and performs the lookups.
If the filesystem supports shared-lock updates and a wq is provided,
shared locks are used on directories, otherwise exclusive locks.
DCACHE_PAR_UPDATE is always set on the found dentries.

unlock_rename_lookup() performs appropriate unlocking. It needs to be
told if a wq was provided to lock_rename_lookup().

As we may use alloc_dentry_parallel() which can block, we need to be
careful of the case where both names are the same, in the same
directory. If the first ->lookup() chooses not to complete the lookup -
as may happen with LOOKUP_RENAME_TARGET - then the second will block.
LOOKUP_RENAME_TARGET is only expected on the first name listed, so we
make sure to lookup the second name given first.

Signed-off-by: NeilBrown <[email protected]>
---
fs/namei.c | 221 ++++++++++++++++++++++++++++++++++++++++++++-----
include/linux/namei.h | 10 ++
2 files changed, 208 insertions(+), 23 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 13f8ac9721be..a7c458cc787c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3156,6 +3156,187 @@ void unlock_rename(struct dentry *p1, struct dentry *p2)
}
EXPORT_SYMBOL(unlock_rename);

+static struct dentry *lock_rename_lookup(struct dentry *p1, struct dentry *p2,
+ struct dentry **d1p, struct dentry **d2p,
+ struct qstr *last1, struct qstr *last2,
+ unsigned int flags1, unsigned int flags2,
+ wait_queue_head_t *wq)
+{
+ struct dentry *p;
+ struct dentry *d1, *d2;
+ bool ok1, ok2;
+ bool shared = wq && IS_PAR_UPDATE(p1->d_inode);
+
+ if (p1 == p2) {
+ if (shared)
+ inode_lock_shared_nested(p1->d_inode, I_MUTEX_PARENT);
+ else
+ inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
+ retry:
+ /* last1 is expected to be target so and might be looked up
+ * lazily. So look up last2 first to avoid the second look up
+ * waiting for the first.
+ */
+ d2 = __lookup_hash(last2, p2, flags2, wq);
+ if (IS_ERR(d2))
+ goto out_unlock_2;
+ d1 = __lookup_hash(last1, p1, flags1, wq);
+ if (IS_ERR(d1))
+ goto out_unlock_1;
+ *d1p = d1; *d2p = d2;
+
+ if (d1 < d2) {
+ ok1 = d_lock_update_nested(d1, p1, last1,
+ I_MUTEX_PARENT);
+ ok2 = d_lock_update_nested(d2, p2, last2,
+ I_MUTEX_PARENT2);
+ } else if (d1 > d2) {
+ ok2 = d_lock_update_nested(d2, p2, last2,
+ I_MUTEX_PARENT);
+ ok1 = d_lock_update_nested(d1, p1, last1,
+ I_MUTEX_PARENT2);
+ } else {
+ /* d1 == d2 !! */
+ ok1 = d_lock_update_nested(d1, p1, last1,
+ I_MUTEX_PARENT);
+ ok2 = ok1;
+ }
+ if (!ok1 || !ok2) {
+ if (ok1)
+ d_unlock_update(d1);
+ if (ok2)
+ d_unlock_update(d2);
+ dput(d1);
+ dput(d2);
+ goto retry;
+ }
+ return NULL;
+ out_unlock_1:
+ d_lookup_done(d2);
+ dput(d2);
+ d2 = d1;
+ out_unlock_2:
+ if (shared)
+ inode_unlock_shared(p1->d_inode);
+ else
+ inode_unlock(p1->d_inode);
+ return d1;
+ }
+
+ mutex_lock(&p1->d_sb->s_vfs_rename_mutex);
+
+ if ((p = d_ancestor(p2, p1)) != NULL) {
+ if (shared) {
+ inode_lock_shared_nested(p2->d_inode, I_MUTEX_PARENT);
+ inode_lock_shared_nested(p1->d_inode, I_MUTEX_CHILD);
+ } else {
+ inode_lock_nested(p2->d_inode, I_MUTEX_PARENT);
+ inode_lock_nested(p1->d_inode, I_MUTEX_CHILD);
+ }
+ } else if ((p = d_ancestor(p1, p2)) != NULL) {
+ if (shared) {
+ inode_lock_shared_nested(p1->d_inode, I_MUTEX_PARENT);
+ inode_lock_shared_nested(p2->d_inode, I_MUTEX_CHILD);
+ } else {
+ inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
+ inode_lock_nested(p2->d_inode, I_MUTEX_CHILD);
+ }
+ } else {
+ if (shared) {
+ inode_lock_shared_nested(p1->d_inode, I_MUTEX_PARENT);
+ inode_lock_shared_nested(p2->d_inode, I_MUTEX_PARENT2);
+ } else {
+ inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
+ inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2);
+ }
+ }
+retry2:
+ d1 = __lookup_hash(last1, p1, flags1, wq);
+ if (IS_ERR(d1))
+ goto unlock_out_3;
+ d2 = __lookup_hash(last2, p2, flags2, wq);
+ if (IS_ERR(d2))
+ goto unlock_out_4;
+
+ if (d1 < d2) {
+ ok1 = d_lock_update_nested(d1, p1, last1, I_MUTEX_PARENT);
+ ok2 = d_lock_update_nested(d2, p2, last2, I_MUTEX_PARENT2);
+ } else {
+ ok2 = d_lock_update_nested(d2, p2, last2, I_MUTEX_PARENT);
+ ok1 = d_lock_update_nested(d1, p1, last1, I_MUTEX_PARENT2);
+ }
+ if (!ok1 || !ok2) {
+ if (ok1)
+ d_unlock_update(d1);
+ if (ok2)
+ d_unlock_update(d2);
+ dput(d1);
+ dput(d2);
+ goto retry2;
+ }
+ *d1p = d1;
+ *d2p = d2;
+ return p;
+unlock_out_4:
+ d_lookup_done(d1);
+ dput(d1);
+ d1 = d2;
+unlock_out_3:
+ if (shared) {
+ inode_unlock_shared(p1->d_inode);
+ inode_unlock_shared(p2->d_inode);
+ } else {
+ inode_unlock(p1->d_inode);
+ inode_unlock(p2->d_inode);
+ }
+ mutex_unlock(&p1->d_sb->s_vfs_rename_mutex);
+ return d1;
+}
+
+struct dentry *lock_rename_lookup_one(struct dentry *p1, struct dentry *p2,
+ struct dentry **d1p, struct dentry **d2p,
+ const char *name1, int nlen1,
+ const char *name2, int nlen2,
+ unsigned int flags1, unsigned int flags2,
+ wait_queue_head_t *wq)
+{
+ struct qstr this1, this2;
+ int err;
+
+ err = lookup_one_common(&init_user_ns, name1, p1, nlen1, &this1);
+ if (err)
+ return ERR_PTR(err);
+ err = lookup_one_common(&init_user_ns, name2, p2, nlen2, &this2);
+ if (err)
+ return ERR_PTR(err);
+ return lock_rename_lookup(p1, p2, d1p, d2p, &this1, &this2,
+ flags1, flags2, wq);
+}
+EXPORT_SYMBOL(lock_rename_lookup_one);
+
+void unlock_rename_lookup(struct dentry *p1, struct dentry *p2,
+ struct dentry *d1, struct dentry *d2,
+ bool with_wq)
+{
+ bool shared = with_wq && IS_PAR_UPDATE(p1->d_inode);
+ d_lookup_done(d1);
+ d_lookup_done(d2);
+ d_unlock_update(d1);
+ if (d2 != d1)
+ d_unlock_update(d2);
+ if (shared) {
+ inode_unlock_shared(p1->d_inode);
+ if (p1 != p2) {
+ inode_unlock_shared(p2->d_inode);
+ mutex_unlock(&p1->d_sb->s_vfs_rename_mutex);
+ }
+ } else
+ unlock_rename(p1, p2);
+ dput(d1);
+ dput(d2);
+}
+EXPORT_SYMBOL(unlock_rename_lookup);
+
/**
* mode_strip_umask - handle vfs umask stripping
* @dir: parent directory of the new inode
@@ -4945,6 +5126,7 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
unsigned int lookup_flags = 0, target_flags = LOOKUP_RENAME_TARGET;
bool should_retry = false;
int error = -EINVAL;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
goto put_names;
@@ -4985,58 +5167,53 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
goto exit2;

retry_deleg:
- trap = lock_rename(new_path.dentry, old_path.dentry);
-
- old_dentry = __lookup_hash(&old_last, old_path.dentry,
- lookup_flags, NULL);
- error = PTR_ERR(old_dentry);
- if (IS_ERR(old_dentry))
+ trap = lock_rename_lookup(new_path.dentry, old_path.dentry,
+ &new_dentry, &old_dentry,
+ &new_last, &old_last,
+ lookup_flags | target_flags, lookup_flags,
+ &wq);
+ if (IS_ERR(trap))
goto exit3;
/* source must exist */
error = -ENOENT;
if (d_is_negative(old_dentry))
goto exit4;
- new_dentry = __lookup_hash(&new_last, new_path.dentry,
- lookup_flags | target_flags, NULL);
- error = PTR_ERR(new_dentry);
- if (IS_ERR(new_dentry))
- goto exit4;
error = -EEXIST;
if ((flags & RENAME_NOREPLACE) && d_is_positive(new_dentry))
- goto exit5;
+ goto exit4;
if (flags & RENAME_EXCHANGE) {
error = -ENOENT;
if (d_is_negative(new_dentry))
- goto exit5;
+ goto exit4;

if (!d_is_dir(new_dentry)) {
error = -ENOTDIR;
if (new_last.name[new_last.len])
- goto exit5;
+ goto exit4;
}
}
/* unless the source is a directory trailing slashes give -ENOTDIR */
if (!d_is_dir(old_dentry)) {
error = -ENOTDIR;
if (old_last.name[old_last.len])
- goto exit5;
+ goto exit4;
if (!(flags & RENAME_EXCHANGE) && new_last.name[new_last.len])
- goto exit5;
+ goto exit4;
}
/* source should not be ancestor of target */
error = -EINVAL;
if (old_dentry == trap)
- goto exit5;
+ goto exit4;
/* target should not be an ancestor of source */
if (!(flags & RENAME_EXCHANGE))
error = -ENOTEMPTY;
if (new_dentry == trap)
- goto exit5;
+ goto exit4;

error = security_path_rename(&old_path, old_dentry,
&new_path, new_dentry, flags);
if (error)
- goto exit5;
+ goto exit4;

rd.old_dir = old_path.dentry->d_inode;
rd.old_dentry = old_dentry;
@@ -5047,12 +5224,10 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
rd.delegated_inode = &delegated_inode;
rd.flags = flags;
error = vfs_rename(&rd);
-exit5:
- dput(new_dentry);
exit4:
- dput(old_dentry);
+ unlock_rename_lookup(new_path.dentry, old_path.dentry, new_dentry, old_dentry,
+ true);
exit3:
- unlock_rename(new_path.dentry, old_path.dentry);
if (delegated_inode) {
error = break_deleg_wait(&delegated_inode);
if (!error)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index b1a210a51210..29756921f69b 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -108,6 +108,16 @@ extern int follow_up(struct path *);

extern struct dentry *lock_rename(struct dentry *, struct dentry *);
extern void unlock_rename(struct dentry *, struct dentry *);
+extern struct dentry *lock_rename_lookup_one(
+ struct dentry *p1, struct dentry *p2,
+ struct dentry **d1p, struct dentry **d2p,
+ const char *name1, int nlen1,
+ const char *name2, int nlen2,
+ unsigned int flags1, unsigned int flags2,
+ wait_queue_head_t *wq);
+extern void unlock_rename_lookup(struct dentry *p1, struct dentry *p2,
+ struct dentry *d1, struct dentry *d2,
+ bool withwq);

extern int __must_check nd_jump_link(struct path *path);



2022-08-26 02:25:31

by NeilBrown

[permalink] [raw]
Subject: [PATCH 09/10] VFS: add LOOKUP_SILLY_RENAME

When performing a "silly rename" to avoid removing a file that is still
open, we need to perform a lookup in a directory that is already locked.

In order to allow common functions to be used for this lookup, introduce
LOOKUP_SILLY_RENAME which affirms that the directory is already locked
and that the vfsmnt is already writable.

When LOOKUP_SILLY_RENAME is set, path->mnt can be NULL. As
i_op->rename() doesn't make the vfsmnt available, this is unavoidable.
So we ensure that a NULL ->mnt isn't fatal.

Signed-off-by: NeilBrown <[email protected]>
---
fs/dcache.c | 3 +-
fs/namei.c | 88 +++++++++++++++++++++++++++++--------------------
include/linux/namei.h | 9 +++--
3 files changed, 59 insertions(+), 41 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index d6bfa49b143b..9bf346a9de52 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3297,7 +3297,8 @@ EXPORT_SYMBOL(d_tmpfile);
* If the parent directory is locked with I_MUTEX_NORMAL, use I_MUTEX_NORMAL.
* If the parent is locked with I_MUTEX_PARENT, I_MUTEX_PARENT2 or
* I_MUTEX_CHILD, use I_MUTEX_PARENT or, for the second in a rename,
- * I_MUTEX_PARENT2.
+ * I_MUTEX_PARENT2. When a third name is needed, as with "silly-rename"
+ * I_MUTEX_CHILD is used.
*/
bool d_lock_update_nested(struct dentry *dentry,
struct dentry *base, const struct qstr *name,
diff --git a/fs/namei.c b/fs/namei.c
index ef994239fa7c..c9bbff120bf9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1650,12 +1650,14 @@ static struct dentry *__lookup_hash(const struct qstr *name,
}

/*
- * Parent directory (base) is not locked. We take either an exclusive
- * or shared lock depending on the fs preference, then do a lookup,
- * and then set the DCACHE_PAR_UPDATE bit on the child if a shared lock
- * was taken on the parent.
+ * Without LOOKUP_SILLY_RENAME parent directory (base) is not locked.
+ * We take either an exclusive or shared lock depending on the fs
+ * preference, then do a lookup, and then set the DCACHE_PAR_UPDATE bit
+ * on the child if a shared lock was taken on the parent.
* If LOOKUP_EXCL, name should not already exist, else -EEXIST
* If not LOOKUP_CREATE, name should already exist, else -ENOENT
+ * If LOOKUP_SILLY_RENAME, don't require write access. In this case
+ * path->mnt may be NULL.
*/
static struct dentry *lookup_hash_update(
const struct qstr *name,
@@ -1665,21 +1667,26 @@ static struct dentry *lookup_hash_update(
struct dentry *dentry;
struct dentry *base = path->dentry;
struct inode *dir = base->d_inode;
- int err, err2;
-
- /* For create, don't fail immediately if it's r/o,
- * at least try to report other errors.
- * For unlink/rmdir where LOOKUP_REVAl is the only
- * flag, fail immediately if r/o.
- */
- err2 = mnt_want_write(path->mnt);
- if (err2 && (flags & ~LOOKUP_REVAL) == 0)
- return ERR_PTR(err2);
+ int err, err2 = 0;
+ int class;
+
+ if (!(flags & LOOKUP_SILLY_RENAME)) {
+ /* For create, don't fail immediately if it's r/o,
+ * at least try to report other errors.
+ * For unlink/rmdir where LOOKUP_REVAl is the only
+ * flag, fail immediately if r/o.
+ */
+ err2 = mnt_want_write(path->mnt);
+ if (err2 && (flags & ~LOOKUP_REVAL) == 0)
+ return ERR_PTR(err2);

- if (wq && IS_PAR_UPDATE(dir))
- inode_lock_shared_nested(dir, I_MUTEX_PARENT);
- else
- inode_lock_nested(dir, I_MUTEX_PARENT);
+ if (wq && IS_PAR_UPDATE(dir))
+ inode_lock_shared_nested(dir, I_MUTEX_PARENT);
+ else
+ inode_lock_nested(dir, I_MUTEX_PARENT);
+ class = I_MUTEX_PARENT;
+ } else
+ class = I_MUTEX_CHILD;

retry:
dentry = __lookup_hash(name, base, flags, wq);
@@ -1687,7 +1694,7 @@ static struct dentry *lookup_hash_update(
err = PTR_ERR(dentry);
goto out_err;
}
- if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) {
+ if (!d_lock_update_nested(dentry, base, name, class)) {
/*
* Failed to get lock due to race with unlink or rename
* - try again
@@ -1724,12 +1731,14 @@ static struct dentry *lookup_hash_update(
return dentry;

out_err:
- if (wq && IS_PAR_UPDATE(dir))
- inode_unlock_shared(dir);
- else
- inode_unlock(dir);
- if (!err2)
- mnt_drop_write(path->mnt);
+ if (!(flags & LOOKUP_SILLY_RENAME)) {
+ if (wq && IS_PAR_UPDATE(dir))
+ inode_unlock_shared(dir);
+ else
+ inode_unlock(dir);
+ if (!err2)
+ mnt_drop_write(path->mnt);
+ }
return ERR_PTR(err);
}

@@ -1751,18 +1760,22 @@ struct dentry *lookup_hash_update_len(const char *name, int nlen,
EXPORT_SYMBOL(lookup_hash_update_len);

void __done_path_update(struct path *path, struct dentry *dentry,
- bool with_wq)
+ bool with_wq, unsigned int flags)
{
struct inode *dir = path->dentry->d_inode;

d_lookup_done(dentry);
d_unlock_update(dentry);
- if (IS_PAR_UPDATE(dir) && with_wq)
- inode_unlock_shared(dir);
- else
- inode_unlock(dir);
- dput(dentry);
- mnt_drop_write(path->mnt);
+ if (flags & LOOKUP_SILLY_RENAME) {
+ dput(dentry);
+ } else {
+ if (IS_PAR_UPDATE(dir) && with_wq)
+ inode_unlock_shared(dir);
+ else
+ inode_unlock(dir);
+ dput(dentry);
+ mnt_drop_write(path->mnt);
+ }
}
EXPORT_SYMBOL(__done_path_update);

@@ -4122,7 +4135,8 @@ static struct dentry *filename_create_one(struct qstr *last, struct path *path,
wait_queue_head_t *wq)
{
bool want_dir = lookup_flags & LOOKUP_DIRECTORY;
- unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
+ unsigned int flags = lookup_flags & (LOOKUP_REVAL |
+ LOOKUP_SILLY_RENAME);
unsigned int create_flag = LOOKUP_CREATE;

/*
@@ -4136,7 +4150,7 @@ static struct dentry *filename_create_one(struct qstr *last, struct path *path,
*/
create_flag = 0;
return lookup_hash_update(last, path,
- reval_flag | create_flag | LOOKUP_EXCL,
+ flags | create_flag | LOOKUP_EXCL,
wq);
}

@@ -4146,10 +4160,12 @@ struct dentry *filename_create_one_len(const char *name, int nlen,
wait_queue_head_t *wq)
{
struct qstr this;
+ struct user_namespace *uns = &init_user_ns;
int err;

- err = lookup_one_common(mnt_user_ns(path->mnt), name,
- path->dentry, nlen, &this);
+ if (path->mnt)
+ uns = mnt_user_ns(path->mnt);
+ err = lookup_one_common(uns, name, path->dentry, nlen, &this);
if (err)
return ERR_PTR(err);
return filename_create_one(&this, path, lookup_flags, wq);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 29756921f69b..92a62b04a83d 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -21,6 +21,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
#define LOOKUP_FOLLOW 0x0001 /* follow links at the end */
#define LOOKUP_DIRECTORY 0x0002 /* require a directory */
#define LOOKUP_AUTOMOUNT 0x0004 /* force terminal automount */
+#define LOOKUP_SILLY_RENAME 0x0008 /* Directory already locked, don't lock again */
#define LOOKUP_EMPTY 0x4000 /* accept empty path [user_... only] */
#define LOOKUP_DOWN 0x8000 /* follow mounts in the starting point */
#define LOOKUP_MOUNTPOINT 0x0080 /* follow mounts in the end */
@@ -64,20 +65,20 @@ extern struct dentry *user_path_create(int, const char __user *, struct path *,
extern struct dentry *lookup_hash_update_len(const char *name, int nlen,
struct path *path, unsigned int flags,
wait_queue_head_t *wq);
-extern void __done_path_update(struct path *, struct dentry *, bool);
+extern void __done_path_update(struct path *, struct dentry *, bool, unsigned int);
static inline void done_path_update(struct path *path, struct dentry *dentry)
{
- __done_path_update(path, dentry, true);
+ __done_path_update(path, dentry, true, 0);
}
static inline void done_path_create(struct path *path, struct dentry *dentry)
{
- __done_path_update(path, dentry, false);
+ __done_path_update(path, dentry, false, 0);
path_put(path);
}
static inline void done_path_create_wq(struct path *path, struct dentry *dentry,
bool with_wq)
{
- __done_path_update(path, dentry, with_wq);
+ __done_path_update(path, dentry, with_wq, 0);
path_put(path);
}



2022-08-26 02:26:20

by NeilBrown

[permalink] [raw]
Subject: [PATCH 10/10] NFS: support parallel updates in the one directory.

NFS can easily support parallel updates as the locking is done on the
server, so this patch enables parallel updates for NFS.

Handling of silly-rename - both for unlink and for the rename target -
requires some care as we need to get an exclusive lock on the chosen
silly name and for rename we need to keep the original target name
locked after it has been renamed to the silly name.

So nfs_sillyrename() now uses d_exchange() to swap the target and the
silly name after the silly-rename has happened on the server, and the
silly dentry - which now has the name of the target - is returned.

For unlink(), this is immediately unlocked and discarded with a call to
nfs_sillyrename_finish(). For rename it is kept locked until the
originally requested rename completes.

Signed-off-by: NeilBrown <[email protected]>
---
fs/dcache.c | 5 ++++-
fs/nfs/dir.c | 28 ++++++++++++++++------------
fs/nfs/fs_context.c | 6 ++++--
fs/nfs/internal.h | 3 ++-
fs/nfs/unlink.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
5 files changed, 64 insertions(+), 29 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9bf346a9de52..a5eaab16d39f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3056,7 +3056,9 @@ void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
write_seqlock(&rename_lock);

WARN_ON(!dentry1->d_inode);
- WARN_ON(!dentry2->d_inode);
+ /* Allow dentry2 to be negative, so we can do a rename
+ * but keep both names locked (DCACHE_PAR_UPDATE)
+ */
WARN_ON(IS_ROOT(dentry1));
WARN_ON(IS_ROOT(dentry2));

@@ -3064,6 +3066,7 @@ void d_exchange(struct dentry *dentry1, struct dentry *dentry2)

write_sequnlock(&rename_lock);
}
+EXPORT_SYMBOL(d_exchange);

/**
* d_ancestor - search for an ancestor
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 5d6c2ddc7ea6..fbb608fbe6bf 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1935,8 +1935,12 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
/*
* If we're doing an exclusive create, optimize away the lookup
* but don't hash the dentry.
+ * A silly_rename is marked exclusive, but we need to do an
+ * explicit lookup.
*/
- if (nfs_is_exclusive_create(dir, flags) || flags & LOOKUP_RENAME_TARGET)
+ if ((nfs_is_exclusive_create(dir, flags) ||
+ flags & LOOKUP_RENAME_TARGET) &&
+ !(flags & LOOKUP_SILLY_RENAME))
return NULL;

res = ERR_PTR(-ENOMEM);
@@ -2472,10 +2476,14 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry)
spin_lock(&dentry->d_lock);
if (d_count(dentry) > 1 && !test_bit(NFS_INO_PRESERVE_UNLINKED,
&NFS_I(d_inode(dentry))->flags)) {
+ struct dentry *silly;
+
spin_unlock(&dentry->d_lock);
/* Start asynchronous writeout of the inode */
write_inode_now(d_inode(dentry), 0);
- error = nfs_sillyrename(dir, dentry);
+ silly = nfs_sillyrename(dir, dentry);
+ error = PTR_ERR_OR_ZERO(silly);
+ nfs_sillyrename_finish(dir, silly);
goto out;
}
/* We must prevent any concurrent open until the unlink
@@ -2685,16 +2693,12 @@ int nfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir,

spin_unlock(&new_dentry->d_lock);

- /* copy the target dentry's name */
- dentry = d_alloc(new_dentry->d_parent,
- &new_dentry->d_name);
- if (!dentry)
- goto out;
-
/* silly-rename the existing target ... */
- err = nfs_sillyrename(new_dir, new_dentry);
- if (err)
+ dentry = nfs_sillyrename(new_dir, new_dentry);
+ if (IS_ERR(dentry)) {
+ err = PTR_ERR(dentry);
goto out;
+ }

new_dentry = dentry;
new_inode = NULL;
@@ -2750,9 +2754,9 @@ int nfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
} else if (error == -ENOENT)
nfs_dentry_handle_enoent(old_dentry);

- /* new dentry created? */
if (dentry)
- dput(dentry);
+ nfs_sillyrename_finish(new_dir, dentry);
+
return error;
}
EXPORT_SYMBOL_GPL(nfs_rename);
diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
index 4da701fd1424..7133ca9433d2 100644
--- a/fs/nfs/fs_context.c
+++ b/fs/nfs/fs_context.c
@@ -1577,7 +1577,8 @@ struct file_system_type nfs_fs_type = {
.init_fs_context = nfs_init_fs_context,
.parameters = nfs_fs_parameters,
.kill_sb = nfs_kill_super,
- .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
+ .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
+ FS_PAR_DIR_UPDATE,
};
MODULE_ALIAS_FS("nfs");
EXPORT_SYMBOL_GPL(nfs_fs_type);
@@ -1589,7 +1590,8 @@ struct file_system_type nfs4_fs_type = {
.init_fs_context = nfs_init_fs_context,
.parameters = nfs_fs_parameters,
.kill_sb = nfs_kill_super,
- .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
+ .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
+ FS_PAR_DIR_UPDATE,
};
MODULE_ALIAS_FS("nfs4");
MODULE_ALIAS("nfs4");
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 27c720d71b4e..3a7fd30a8e29 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -611,7 +611,8 @@ extern struct rpc_task *
nfs_async_rename(struct inode *old_dir, struct inode *new_dir,
struct dentry *old_dentry, struct dentry *new_dentry,
void (*complete)(struct rpc_task *, struct nfs_renamedata *));
-extern int nfs_sillyrename(struct inode *dir, struct dentry *dentry);
+extern struct dentry *nfs_sillyrename(struct inode *dir, struct dentry *dentry);
+extern void nfs_sillyrename_finish(struct inode *dir, struct dentry *dentry);

/* direct.c */
void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
index 9697cd5d2561..c8a718f09fe6 100644
--- a/fs/nfs/unlink.c
+++ b/fs/nfs/unlink.c
@@ -428,6 +428,10 @@ nfs_complete_sillyrename(struct rpc_task *task, struct nfs_renamedata *data)
*
* The final cleanup is done during dentry_iput.
*
+ * We exchange the original with the new (silly) dentries, and return
+ * the new dentry which will have the original name. This ensures that
+ * the target name remains locked until the rename completes.
+ *
* (Note: NFSv4 is stateful, and has opens, so in theory an NFSv4 server
* could take responsibility for keeping open files referenced. The server
* would also need to ensure that opened-but-deleted files were kept over
@@ -436,7 +440,7 @@ nfs_complete_sillyrename(struct rpc_task *task, struct nfs_renamedata *data)
* use to advertise that it does this; some day we may take advantage of
* it.))
*/
-int
+struct dentry *
nfs_sillyrename(struct inode *dir, struct dentry *dentry)
{
static unsigned int sillycounter;
@@ -445,6 +449,8 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
struct dentry *sdentry;
struct inode *inode = d_inode(dentry);
struct rpc_task *task;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
+ struct path path = {};
int error = -EBUSY;

dfprintk(VFS, "NFS: silly-rename(%pd2, ct=%d)\n",
@@ -459,10 +465,11 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)

fileid = NFS_FILEID(d_inode(dentry));

+ path.dentry = d_find_alias(dir);
sdentry = NULL;
do {
int slen;
- dput(sdentry);
+
sillycounter++;
slen = scnprintf(silly, sizeof(silly),
SILLYNAME_PREFIX "%0*llx%0*x",
@@ -472,14 +479,19 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
dfprintk(VFS, "NFS: trying to rename %pd to %s\n",
dentry, silly);

- sdentry = lookup_one_len(silly, dentry->d_parent, slen);
- /*
- * N.B. Better to return EBUSY here ... it could be
- * dangerous to delete the file while it's in use.
- */
- if (IS_ERR(sdentry))
- goto out;
- } while (d_inode(sdentry) != NULL); /* need negative lookup */
+ sdentry = filename_create_one_len(silly, slen,
+ &path,
+ LOOKUP_CREATE | LOOKUP_EXCL |
+ LOOKUP_SILLY_RENAME,
+ &wq);
+ } while (PTR_ERR_OR_ZERO(sdentry) == -EEXIST);
+ dput(path.dentry);
+ /*
+ * N.B. Better to return EBUSY here ... it could be
+ * dangerous to delete the file while it's in use.
+ */
+ if (IS_ERR(sdentry))
+ goto out;

ihold(inode);

@@ -513,7 +525,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
NFS_INO_INVALID_CTIME |
NFS_INO_REVAL_FORCED);
spin_unlock(&inode->i_lock);
- d_move(dentry, sdentry);
+ d_exchange(dentry, sdentry);
break;
case -ERESTARTSYS:
/* The result of the rename is unknown. Play it safe by
@@ -524,7 +536,20 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
rpc_put_task(task);
out_dput:
iput(inode);
- dput(sdentry);
+ if (!error)
+ return sdentry;
+
+ d_lookup_done(sdentry);
+ __done_path_update(&path, sdentry, true, LOOKUP_SILLY_RENAME);
out:
- return error;
+ return ERR_PTR(error);
+}
+
+void nfs_sillyrename_finish(struct inode *dir, struct dentry *dentry)
+{
+ struct path path = { .dentry = d_find_alias(dir) };
+
+ if (!IS_ERR(dentry))
+ __done_path_update(&path, dentry, true, LOOKUP_SILLY_RENAME);
+ dput(path.dentry);
}


2022-08-26 02:26:38

by NeilBrown

[permalink] [raw]
Subject: [PATCH 08/10] NFSD: allow parallel creates from nfsd

Rather than getting write access, locking the directory, and performing
a lookup, use
filename_create_one_len() for create, or
lookup_hash_update_len() for delete, or
lock_rename_lookup_one() for rename
which combine all these steps and handle shared locking for concurrent
updates where appropriate.

Functions which call these need to allocate a workqueue_head on the
stack which will be used if the name does not yet exist in the dcache.
Any other thread that wants to look up the name will wait on this
workqueue_head until the NFS operation completes.

As the directory lock is sometimes a shared lock, fh_fill_pre_attrs()
needs to determin if the lock was exclusive, so that when attributes are
returned to an NFSv4 client, that fact can be reported.

Note that there is only one filesystem that will, in this patch series,
allow shared locks for create/unlink and that is NFS (for re-export).
NFS does not support atomic pre/post attributes, so there is no loss in
not providing them.

When some other filesystem supports concurrent updates, we might need to
consider if the pre/post attributes are more important than the
parallelism.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfsd/nfs3proc.c | 28 +++-----
fs/nfsd/nfs4proc.c | 29 ++++-----
fs/nfsd/nfsfh.c | 9 +++
fs/nfsd/nfsproc.c | 29 ++++-----
fs/nfsd/vfs.c | 177 ++++++++++++++++++++--------------------------------
5 files changed, 112 insertions(+), 160 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index a41cca619338..05e7907940ea 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -234,13 +234,15 @@ static __be32
nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct svc_fh *resfhp, struct nfsd3_createargs *argp)
{
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
struct iattr *iap = &argp->attrs;
- struct dentry *parent, *child;
struct nfsd_attrs attrs = {
.na_iattr = iap,
};
__u32 v_mtime, v_atime;
+ struct dentry *child;
struct inode *inode;
+ struct path path;
__be32 status;
int host_err;

@@ -253,20 +255,15 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (status != nfs_ok)
return status;

- parent = fhp->fh_dentry;
- inode = d_inode(parent);
+ path.dentry = fhp->fh_dentry;
+ path.mnt = fhp->fh_export->ex_path.mnt;
+ inode = d_inode(path.dentry);

- host_err = fh_want_write(fhp);
- if (host_err)
- return nfserrno(host_err);
+ child = filename_create_one_len(argp->name, argp->len,
+ &path, 0, &wq);

- inode_lock_nested(inode, I_MUTEX_PARENT);
-
- child = lookup_one_len(argp->name, parent, argp->len);
- if (IS_ERR(child)) {
- status = nfserrno(PTR_ERR(child));
- goto out;
- }
+ if (IS_ERR(child))
+ return nfserrno(PTR_ERR(child));

if (d_really_is_negative(child)) {
status = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
@@ -342,10 +339,7 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
status = nfsd_create_setattr(rqstp, fhp, resfhp, &attrs);

out:
- inode_unlock(inode);
- if (child && !IS_ERR(child))
- dput(child);
- fh_drop_write(fhp);
+ done_path_update(&path, child);
return status;
}

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index a72ab97f77ef..78e20b8fb2b4 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -235,16 +235,17 @@ static __be32
nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct svc_fh *resfhp, struct nfsd4_open *open)
{
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
struct iattr *iap = &open->op_iattr;
struct nfsd_attrs attrs = {
.na_iattr = iap,
.na_seclabel = &open->op_label,
};
- struct dentry *parent, *child;
__u32 v_mtime, v_atime;
+ struct dentry *child;
struct inode *inode;
+ struct path path;
__be32 status;
- int host_err;

if (isdotent(open->op_fname, open->op_fnamelen))
return nfserr_exist;
@@ -254,23 +255,20 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
status = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_EXEC);
if (status != nfs_ok)
return status;
- parent = fhp->fh_dentry;
- inode = d_inode(parent);

- host_err = fh_want_write(fhp);
- if (host_err)
- return nfserrno(host_err);
+ path.dentry = fhp->fh_dentry;
+ path.mnt = fhp->fh_export->ex_path.mnt;

if (is_create_with_attrs(open))
nfsd4_acl_to_attr(NF4REG, open->op_acl, &attrs);
+ inode = d_inode(path.dentry);

- inode_lock_nested(inode, I_MUTEX_PARENT);
+ child = filename_create_one_len(open->op_fname,
+ open->op_fnamelen,
+ &path, 0, &wq);

- child = lookup_one_len(open->op_fname, parent, open->op_fnamelen);
- if (IS_ERR(child)) {
- status = nfserrno(PTR_ERR(child));
- goto out;
- }
+ if (IS_ERR(child))
+ return nfserrno(PTR_ERR(child));

if (d_really_is_negative(child)) {
status = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
@@ -375,11 +373,8 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (attrs.na_aclerr)
open->op_bmval[0] &= ~FATTR4_WORD0_ACL;
out:
- inode_unlock(inode);
+ done_path_update(&path, child);
nfsd_attrs_free(&attrs);
- if (child && !IS_ERR(child))
- dput(child);
- fh_drop_write(fhp);
return status;
}

diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index a5b71526cee0..5f024e6649d9 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -11,6 +11,7 @@
#include <linux/exportfs.h>

#include <linux/sunrpc/svcauth_gss.h>
+#include <linux/namei.h>
#include "nfsd.h"
#include "vfs.h"
#include "auth.h"
@@ -627,6 +628,14 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
if (fhp->fh_no_wcc || fhp->fh_pre_saved)
return;

+ if (!IS_PAR_UPDATE(fhp->fh_dentry->d_inode) &&
+ inode_trylock_shared(fhp->fh_dentry->d_inode)) {
+ /* only have a shared lock */
+ inode_unlock_shared(fhp->fh_dentry->d_inode);
+ fhp->fh_no_atomic_attr = true;
+ fhp->fh_no_wcc = true;
+ }
+
inode = d_inode(fhp->fh_dentry);
err = fh_getattr(fhp, &stat);
if (err) {
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 7381972f1677..e91aff4949e8 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -260,6 +260,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
{
struct nfsd_createargs *argp = rqstp->rq_argp;
struct nfsd_diropres *resp = rqstp->rq_resp;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
svc_fh *dirfhp = &argp->fh;
svc_fh *newfhp = &resp->fh;
struct iattr *attr = &argp->attrs;
@@ -268,8 +269,8 @@ nfsd_proc_create(struct svc_rqst *rqstp)
};
struct inode *inode;
struct dentry *dchild;
+ struct path path;
int type, mode;
- int hosterr;
dev_t rdev = 0, wanted = new_decode_dev(attr->ia_size);

dprintk("nfsd: CREATE %s %.*s\n",
@@ -285,26 +286,21 @@ nfsd_proc_create(struct svc_rqst *rqstp)
resp->status = nfserr_exist;
if (isdotent(argp->name, argp->len))
goto done;
- hosterr = fh_want_write(dirfhp);
- if (hosterr) {
- resp->status = nfserrno(hosterr);
- goto done;
- }

- inode_lock_nested(dirfhp->fh_dentry->d_inode, I_MUTEX_PARENT);
- dchild = lookup_one_len(argp->name, dirfhp->fh_dentry, argp->len);
+ path.dentry = dirfhp->fh_dentry;
+ path.mnt = dirfhp->fh_export->ex_path.mnt;
+ dchild = filename_create_one_len(argp->name, argp->len, &path, 0, &wq);
if (IS_ERR(dchild)) {
resp->status = nfserrno(PTR_ERR(dchild));
- goto out_unlock;
+ goto done;
}
fh_init(newfhp, NFS_FHSIZE);
resp->status = fh_compose(newfhp, dirfhp->fh_export, dchild, dirfhp);
if (!resp->status && d_really_is_negative(dchild))
resp->status = nfserr_noent;
- dput(dchild);
if (resp->status) {
if (resp->status != nfserr_noent)
- goto out_unlock;
+ goto out_done;
/*
* If the new file handle wasn't verified, we can't tell
* whether the file exists or not. Time to bail ...
@@ -313,7 +309,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
if (!newfhp->fh_dentry) {
printk(KERN_WARNING
"nfsd_proc_create: file handle not verified\n");
- goto out_unlock;
+ goto out_done;
}
}

@@ -347,7 +343,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
newfhp->fh_dentry,
NFSD_MAY_WRITE|NFSD_MAY_LOCAL_ACCESS);
if (resp->status && resp->status != nfserr_rofs)
- goto out_unlock;
+ goto out_done;
}
} else
type = S_IFREG;
@@ -384,7 +380,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
/* Make sure the type and device matches */
resp->status = nfserr_exist;
if (inode && inode_wrong_type(inode, type))
- goto out_unlock;
+ goto out_done;
}

resp->status = nfs_ok;
@@ -406,9 +402,8 @@ nfsd_proc_create(struct svc_rqst *rqstp)
(time64_t)0);
}

-out_unlock:
- inode_unlock(dirfhp->fh_dentry->d_inode);
- fh_drop_write(dirfhp);
+out_done:
+ done_path_update(&path, dchild);
done:
fh_put(dirfhp);
if (resp->status != nfs_ok)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 9f486b788ed0..9b1e633977e0 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1346,7 +1346,9 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
char *fname, int flen, struct nfsd_attrs *attrs,
int type, dev_t rdev, struct svc_fh *resfhp)
{
- struct dentry *dentry, *dchild = NULL;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
+ struct dentry *dchild = NULL;
+ struct path path;
__be32 err;
int host_err;

@@ -1357,33 +1359,24 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (err)
return err;

- dentry = fhp->fh_dentry;
-
- host_err = fh_want_write(fhp);
- if (host_err)
- return nfserrno(host_err);
+ path.dentry = fhp->fh_dentry;
+ path.mnt = fhp->fh_export->ex_path.mnt;

- inode_lock_nested(dentry->d_inode, I_MUTEX_PARENT);
- dchild = lookup_one_len(fname, dentry, flen);
+ dchild = filename_create_one_len(fname, flen, &path, 0, &wq);
host_err = PTR_ERR(dchild);
if (IS_ERR(dchild)) {
err = nfserrno(host_err);
- goto out_unlock;
+ goto out;
}
err = fh_compose(resfhp, fhp->fh_export, dchild, fhp);
- /*
- * We unconditionally drop our ref to dchild as fh_compose will have
- * already grabbed its own ref for it.
- */
- dput(dchild);
- if (err)
- goto out_unlock;
- fh_fill_pre_attrs(fhp);
- err = nfsd_create_locked(rqstp, fhp, fname, flen, attrs, type,
- rdev, resfhp);
- fh_fill_post_attrs(fhp);
-out_unlock:
- inode_unlock(dentry->d_inode);
+ if (!err) {
+ fh_fill_pre_attrs(fhp);
+ err = nfsd_create_locked(rqstp, fhp, fname, flen, attrs, type,
+ rdev, resfhp);
+ fh_fill_post_attrs(fhp);
+ }
+ done_path_update(&path, dchild);
+out:
return err;
}

@@ -1442,15 +1435,17 @@ nfsd_readlink(struct svc_rqst *rqstp, struct svc_fh *fhp, char *buf, int *lenp)
__be32
nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
char *fname, int flen,
- char *path, struct nfsd_attrs *attrs,
+ char *lpath, struct nfsd_attrs *attrs,
struct svc_fh *resfhp)
{
- struct dentry *dentry, *dnew;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
+ struct dentry *dnew;
+ struct path path;
__be32 err, cerr;
int host_err;

err = nfserr_noent;
- if (!flen || path[0] == '\0')
+ if (!flen || lpath[0] == '\0')
goto out;
err = nfserr_exist;
if (isdotent(fname, flen))
@@ -1460,33 +1455,27 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (err)
goto out;

- host_err = fh_want_write(fhp);
- if (host_err) {
- err = nfserrno(host_err);
- goto out;
- }
+ path.dentry = fhp->fh_dentry;
+ path.mnt = fhp->fh_export->ex_path.mnt;

- dentry = fhp->fh_dentry;
- inode_lock_nested(dentry->d_inode, I_MUTEX_PARENT);
- dnew = lookup_one_len(fname, dentry, flen);
+ dnew = filename_create_one_len(fname, flen, &path, 0, &wq);
if (IS_ERR(dnew)) {
err = nfserrno(PTR_ERR(dnew));
- inode_unlock(dentry->d_inode);
- goto out_drop_write;
+ goto out;
}
fh_fill_pre_attrs(fhp);
- host_err = vfs_symlink(&init_user_ns, d_inode(dentry), dnew, path);
+ host_err = vfs_symlink(mnt_user_ns(path.mnt), d_inode(path.dentry),
+ dnew, lpath);
err = nfserrno(host_err);
cerr = fh_compose(resfhp, fhp->fh_export, dnew, fhp);
if (!err)
nfsd_create_setattr(rqstp, fhp, resfhp, attrs);
fh_fill_post_attrs(fhp);
- inode_unlock(dentry->d_inode);
+ done_path_update(&path, dnew);
if (!err)
err = nfserrno(commit_metadata(fhp));
- dput(dnew);
- if (err==0) err = cerr;
-out_drop_write:
+ if (err==0)
+ err = cerr;
fh_drop_write(fhp);
out:
return err;
@@ -1500,8 +1489,10 @@ __be32
nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
char *name, int len, struct svc_fh *tfhp)
{
- struct dentry *ddir, *dnew, *dold;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
+ struct dentry *dold, *dnew;
struct inode *dirp;
+ struct path path;
__be32 err;
int host_err;

@@ -1521,31 +1512,24 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
if (isdotent(name, len))
goto out;

- host_err = fh_want_write(tfhp);
- if (host_err) {
- err = nfserrno(host_err);
- goto out;
- }
-
- ddir = ffhp->fh_dentry;
- dirp = d_inode(ddir);
- inode_lock_nested(dirp, I_MUTEX_PARENT);
+ path.dentry = ffhp->fh_dentry;
+ path.mnt = ffhp->fh_export->ex_path.mnt;
+ dirp = d_inode(path.dentry);

- dnew = lookup_one_len(name, ddir, len);
+ dnew = filename_create_one_len(name, len, &path, 0, &wq);
if (IS_ERR(dnew)) {
err = nfserrno(PTR_ERR(dnew));
- goto out_unlock;
+ goto out;
}

dold = tfhp->fh_dentry;

err = nfserr_noent;
if (d_really_is_negative(dold))
- goto out_dput;
+ goto out_done;
fh_fill_pre_attrs(ffhp);
host_err = vfs_link(dold, &init_user_ns, dirp, dnew, NULL);
- fh_fill_post_attrs(ffhp);
- inode_unlock(dirp);
+
if (!host_err) {
err = nfserrno(commit_metadata(ffhp));
if (!err)
@@ -1556,17 +1540,11 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
else
err = nfserrno(host_err);
}
- dput(dnew);
-out_drop_write:
- fh_drop_write(tfhp);
+out_done:
+ fh_fill_post_attrs(ffhp);
+ done_path_update(&path, dnew);
out:
return err;
-
-out_dput:
- dput(dnew);
-out_unlock:
- inode_unlock(dirp);
- goto out_drop_write;
}

static void
@@ -1602,6 +1580,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
__be32 err;
int host_err;
bool close_cached = false;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

err = fh_verify(rqstp, ffhp, S_IFDIR, NFSD_MAY_REMOVE);
if (err)
@@ -1627,40 +1606,36 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
goto out;
}

- trap = lock_rename(tdentry, fdentry);
+ trap = lock_rename_lookup_one(tdentry, fdentry, &ndentry, &odentry,
+ tname, tlen, fname, flen, 0, 0, &wq);
+ host_err = PTR_ERR(trap);
+ if (IS_ERR(trap))
+ goto out_nfserr;
fh_fill_pre_attrs(ffhp);
fh_fill_pre_attrs(tfhp);

- odentry = lookup_one_len(fname, fdentry, flen);
- host_err = PTR_ERR(odentry);
- if (IS_ERR(odentry))
- goto out_nfserr;
-
host_err = -ENOENT;
if (d_really_is_negative(odentry))
- goto out_dput_old;
+ goto out_unlock;
host_err = -EINVAL;
if (odentry == trap)
- goto out_dput_old;
+ goto out_unlock;

- ndentry = lookup_one_len(tname, tdentry, tlen);
- host_err = PTR_ERR(ndentry);
- if (IS_ERR(ndentry))
- goto out_dput_old;
host_err = -ENOTEMPTY;
if (ndentry == trap)
- goto out_dput_new;
+ goto out_unlock;

host_err = -EXDEV;
if (ffhp->fh_export->ex_path.mnt != tfhp->fh_export->ex_path.mnt)
- goto out_dput_new;
+ goto out_unlock;
if (ffhp->fh_export->ex_path.dentry != tfhp->fh_export->ex_path.dentry)
- goto out_dput_new;
+ goto out_unlock;

if ((ndentry->d_sb->s_export_op->flags & EXPORT_OP_CLOSE_BEFORE_UNLINK) &&
nfsd_has_cached_files(ndentry)) {
close_cached = true;
- goto out_dput_old;
+ dget(ndentry);
+ goto out_unlock;
} else {
struct renamedata rd = {
.old_mnt_userns = &init_user_ns,
@@ -1677,18 +1652,14 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
host_err = commit_metadata(ffhp);
}
}
- out_dput_new:
- dput(ndentry);
- out_dput_old:
- dput(odentry);
- out_nfserr:
- err = nfserrno(host_err);
-
if (!close_cached) {
fh_fill_post_attrs(ffhp);
fh_fill_post_attrs(tfhp);
}
- unlock_rename(tdentry, fdentry);
+ out_unlock:
+ unlock_rename_lookup(tdentry, fdentry, ndentry, odentry, true);
+ out_nfserr:
+ err = nfserrno(host_err);
fh_drop_write(ffhp);

/*
@@ -1715,9 +1686,11 @@ __be32
nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
char *fname, int flen)
{
- struct dentry *dentry, *rdentry;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
+ struct dentry *rdentry;
struct inode *dirp;
struct inode *rinode;
+ struct path path;
__be32 err;
int host_err;

@@ -1728,24 +1701,16 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (err)
goto out;

- host_err = fh_want_write(fhp);
- if (host_err)
- goto out_nfserr;
+ path.mnt = fhp->fh_export->ex_path.mnt;
+ path.dentry = fhp->fh_dentry;

- dentry = fhp->fh_dentry;
- dirp = d_inode(dentry);
- inode_lock_nested(dirp, I_MUTEX_PARENT);
+ rdentry = lookup_hash_update_len(fname, flen, &path, 0, &wq);
+ dirp = d_inode(path.dentry);

- rdentry = lookup_one_len(fname, dentry, flen);
host_err = PTR_ERR(rdentry);
if (IS_ERR(rdentry))
- goto out_unlock;
+ goto out_nfserr;

- if (d_really_is_negative(rdentry)) {
- dput(rdentry);
- host_err = -ENOENT;
- goto out_unlock;
- }
rinode = d_inode(rdentry);
ihold(rinode);

@@ -1762,14 +1727,11 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
}
fh_fill_post_attrs(fhp);

- inode_unlock(dirp);
+ done_path_update(&path, rdentry);
if (!host_err)
host_err = commit_metadata(fhp);
- dput(rdentry);
iput(rinode); /* truncate the inode here */

-out_drop_write:
- fh_drop_write(fhp);
out_nfserr:
if (host_err == -EBUSY) {
/* name is mounted-on. There is no perfect
@@ -1784,9 +1746,6 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
}
out:
return err;
-out_unlock:
- inode_unlock(dirp);
- goto out_drop_write;
}

/*


2022-08-26 14:53:35

by John Stoffel

[permalink] [raw]
Subject: Re: [PATCH/RFC 00/10 v5] Improve scalability of directory operations

>>>>> "NeilBrown" == NeilBrown <[email protected]> writes:

NeilBrown> [I made up "v5" - I haven't been counting]

My first comments, but I'm not a serious developer...

NeilBrown> VFS currently holds an exclusive lock on the directory while making
NeilBrown> changes: add, remove, rename.
NeilBrown> When multiple threads make changes in the one directory, the contention
NeilBrown> can be noticeable.
NeilBrown> In the case of NFS with a high latency link, this can easily be
NeilBrown> demonstrated. NFS doesn't really need VFS locking as the server ensures
NeilBrown> correctness.

NeilBrown> Lustre uses a single(?) directory for object storage, and has patches
NeilBrown> for ext4 to support concurrent updates (Lustre accesses ext4 directly,
NeilBrown> not via the VFS).

NeilBrown> XFS (it is claimed) doesn't its own locking and doesn't need the VFS to
NeilBrown> help at all.

This sentence makes no sense to me... I assume you meant to say "...does
it's own locking..."

NeilBrown> This patch series allows filesystems to request a shared lock on
NeilBrown> directories and provides serialisation on just the affected name, not the
NeilBrown> whole directory. It changes both the VFS and NFSD to use shared locks
NeilBrown> when appropriate, and changes NFS to request shared locks.

Are there any performance results? Why wouldn't we just do a shared
locked across all VFS based filesystems?

NeilBrown> The central enabling feature is a new dentry flag DCACHE_PAR_UPDATE
NeilBrown> which acts as a bit-lock. The ->d_lock spinlock is taken to set/clear
NeilBrown> it, and wait_var_event() is used for waiting. This flag is set on all
NeilBrown> dentries that are part of a directory update, not just when a shared
NeilBrown> lock is taken.

NeilBrown> When a shared lock is taken we must use alloc_dentry_parallel() which
NeilBrown> needs a wq which must remain until the update is completed. To make use
NeilBrown> of concurrent create, kern_path_create() would need to be passed a wq.
NeilBrown> Rather than the churn required for that, we use exclusive locking when
NeilBrown> no wq is provided.

Is this a per-operation wq or a per-directory wq? Can there be issues
if someone does something silly like having 1,000 directories, all of
which have multiple processes making parallel changes?

Does it degrade gracefully if a wq can't be allocated?

NeilBrown> One interesting consequence of this is that silly-rename becomes a
NeilBrown> little more complex. As the directory may not be exclusively locked,
NeilBrown> the new silly-name needs to be locked (DCACHE_PAR_UPDATE) as well.
NeilBrown> A new LOOKUP_SILLY_RENAME is added which helps implement this using
NeilBrown> common code.

NeilBrown> While testing I found some odd behaviour that was caused by
NeilBrown> d_revalidate() racing with rename(). To resolve this I used
NeilBrown> DCACHE_PAR_UPDATE to ensure they cannot race any more.

NeilBrown> Testing, review, or other comments would be most welcome,

2022-08-26 15:44:53

by John Stoffel

[permalink] [raw]
Subject: Re: [PATCH 10/10] NFS: support parallel updates in the one directory.

>>>>> "NeilBrown" == NeilBrown <[email protected]> writes:

NeilBrown> NFS can easily support parallel updates as the locking is done on the
NeilBrown> server, so this patch enables parallel updates for NFS.

Just curious, how is this handled if I have a server with an EXT#
filesystem which is exported via NFS to multiple clients. If those
NFS clients are doing lots of changes in a single directory, I can see
how the NFS server handles that, but what about if at the same time
someone does multiple changes on the server in that EXT# behind the
NFSd's back, how are the conflicts handled?

It would seem that this all really needs to purely happen at the VFS
layer, but I'm probably missing something.

I ask this because my home server exports /home to a couple of other
systems via NFS, but I still work in /home on the server.


NeilBrown> Handling of silly-rename - both for unlink and for the rename target -
NeilBrown> requires some care as we need to get an exclusive lock on the chosen
NeilBrown> silly name and for rename we need to keep the original target name
NeilBrown> locked after it has been renamed to the silly name.

NeilBrown> So nfs_sillyrename() now uses d_exchange() to swap the target and the
NeilBrown> silly name after the silly-rename has happened on the server, and the
NeilBrown> silly dentry - which now has the name of the target - is returned.

NeilBrown> For unlink(), this is immediately unlocked and discarded with a call to
NeilBrown> nfs_sillyrename_finish(). For rename it is kept locked until the
NeilBrown> originally requested rename completes.

NeilBrown> Signed-off-by: NeilBrown <[email protected]>
NeilBrown> ---
NeilBrown> fs/dcache.c | 5 ++++-
NeilBrown> fs/nfs/dir.c | 28 ++++++++++++++++------------
NeilBrown> fs/nfs/fs_context.c | 6 ++++--
NeilBrown> fs/nfs/internal.h | 3 ++-
NeilBrown> fs/nfs/unlink.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
NeilBrown> 5 files changed, 64 insertions(+), 29 deletions(-)

NeilBrown> diff --git a/fs/dcache.c b/fs/dcache.c
NeilBrown> index 9bf346a9de52..a5eaab16d39f 100644
NeilBrown> --- a/fs/dcache.c
NeilBrown> +++ b/fs/dcache.c
NeilBrown> @@ -3056,7 +3056,9 @@ void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
NeilBrown> write_seqlock(&rename_lock);

NeilBrown> WARN_ON(!dentry1->d_inode);
NeilBrown> - WARN_ON(!dentry2->d_inode);
NeilBrown> + /* Allow dentry2 to be negative, so we can do a rename
NeilBrown> + * but keep both names locked (DCACHE_PAR_UPDATE)
NeilBrown> + */
NeilBrown> WARN_ON(IS_ROOT(dentry1));
NeilBrown> WARN_ON(IS_ROOT(dentry2));

NeilBrown> @@ -3064,6 +3066,7 @@ void d_exchange(struct dentry *dentry1, struct dentry *dentry2)

NeilBrown> write_sequnlock(&rename_lock);
NeilBrown> }
NeilBrown> +EXPORT_SYMBOL(d_exchange);

NeilBrown> /**
NeilBrown> * d_ancestor - search for an ancestor
NeilBrown> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
NeilBrown> index 5d6c2ddc7ea6..fbb608fbe6bf 100644
NeilBrown> --- a/fs/nfs/dir.c
NeilBrown> +++ b/fs/nfs/dir.c
NeilBrown> @@ -1935,8 +1935,12 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
NeilBrown> /*
NeilBrown> * If we're doing an exclusive create, optimize away the lookup
NeilBrown> * but don't hash the dentry.
NeilBrown> + * A silly_rename is marked exclusive, but we need to do an
NeilBrown> + * explicit lookup.
NeilBrown> */
NeilBrown> - if (nfs_is_exclusive_create(dir, flags) || flags & LOOKUP_RENAME_TARGET)
NeilBrown> + if ((nfs_is_exclusive_create(dir, flags) ||
NeilBrown> + flags & LOOKUP_RENAME_TARGET) &&
NeilBrown> + !(flags & LOOKUP_SILLY_RENAME))
NeilBrown> return NULL;

NeilBrown> res = ERR_PTR(-ENOMEM);
NeilBrown> @@ -2472,10 +2476,14 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry)
NeilBrown> spin_lock(&dentry->d_lock);
NeilBrown> if (d_count(dentry) > 1 && !test_bit(NFS_INO_PRESERVE_UNLINKED,
NeilBrown> &NFS_I(d_inode(dentry))->flags)) {
NeilBrown> + struct dentry *silly;
NeilBrown> +
NeilBrown> spin_unlock(&dentry->d_lock);
NeilBrown> /* Start asynchronous writeout of the inode */
NeilBrown> write_inode_now(d_inode(dentry), 0);
NeilBrown> - error = nfs_sillyrename(dir, dentry);
NeilBrown> + silly = nfs_sillyrename(dir, dentry);
NeilBrown> + error = PTR_ERR_OR_ZERO(silly);
NeilBrown> + nfs_sillyrename_finish(dir, silly);
NeilBrown> goto out;
NeilBrown> }
NeilBrown> /* We must prevent any concurrent open until the unlink
NeilBrown> @@ -2685,16 +2693,12 @@ int nfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir,

NeilBrown> spin_unlock(&new_dentry->d_lock);

NeilBrown> - /* copy the target dentry's name */
NeilBrown> - dentry = d_alloc(new_dentry->d_parent,
NeilBrown> - &new_dentry->d_name);
NeilBrown> - if (!dentry)
NeilBrown> - goto out;
NeilBrown> -
NeilBrown> /* silly-rename the existing target ... */
NeilBrown> - err = nfs_sillyrename(new_dir, new_dentry);
NeilBrown> - if (err)
NeilBrown> + dentry = nfs_sillyrename(new_dir, new_dentry);
NeilBrown> + if (IS_ERR(dentry)) {
NeilBrown> + err = PTR_ERR(dentry);
NeilBrown> goto out;
NeilBrown> + }

NeilBrown> new_dentry = dentry;
NeilBrown> new_inode = NULL;
NeilBrown> @@ -2750,9 +2754,9 @@ int nfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
NeilBrown> } else if (error == -ENOENT)
NeilBrown> nfs_dentry_handle_enoent(old_dentry);

NeilBrown> - /* new dentry created? */
NeilBrown> if (dentry)
NeilBrown> - dput(dentry);
NeilBrown> + nfs_sillyrename_finish(new_dir, dentry);
NeilBrown> +
NeilBrown> return error;
NeilBrown> }
NeilBrown> EXPORT_SYMBOL_GPL(nfs_rename);
NeilBrown> diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
NeilBrown> index 4da701fd1424..7133ca9433d2 100644
NeilBrown> --- a/fs/nfs/fs_context.c
NeilBrown> +++ b/fs/nfs/fs_context.c
NeilBrown> @@ -1577,7 +1577,8 @@ struct file_system_type nfs_fs_type = {
NeilBrown> .init_fs_context = nfs_init_fs_context,
NeilBrown> .parameters = nfs_fs_parameters,
NeilBrown> .kill_sb = nfs_kill_super,
NeilBrown> - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
NeilBrown> + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
NeilBrown> + FS_PAR_DIR_UPDATE,
NeilBrown> };
NeilBrown> MODULE_ALIAS_FS("nfs");
NeilBrown> EXPORT_SYMBOL_GPL(nfs_fs_type);
NeilBrown> @@ -1589,7 +1590,8 @@ struct file_system_type nfs4_fs_type = {
NeilBrown> .init_fs_context = nfs_init_fs_context,
NeilBrown> .parameters = nfs_fs_parameters,
NeilBrown> .kill_sb = nfs_kill_super,
NeilBrown> - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA,
NeilBrown> + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA|
NeilBrown> + FS_PAR_DIR_UPDATE,
NeilBrown> };
NeilBrown> MODULE_ALIAS_FS("nfs4");
NeilBrown> MODULE_ALIAS("nfs4");
NeilBrown> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
NeilBrown> index 27c720d71b4e..3a7fd30a8e29 100644
NeilBrown> --- a/fs/nfs/internal.h
NeilBrown> +++ b/fs/nfs/internal.h
NeilBrown> @@ -611,7 +611,8 @@ extern struct rpc_task *
NeilBrown> nfs_async_rename(struct inode *old_dir, struct inode *new_dir,
NeilBrown> struct dentry *old_dentry, struct dentry *new_dentry,
NeilBrown> void (*complete)(struct rpc_task *, struct nfs_renamedata *));
NeilBrown> -extern int nfs_sillyrename(struct inode *dir, struct dentry *dentry);
NeilBrown> +extern struct dentry *nfs_sillyrename(struct inode *dir, struct dentry *dentry);
NeilBrown> +extern void nfs_sillyrename_finish(struct inode *dir, struct dentry *dentry);

NeilBrown> /* direct.c */
NeilBrown> void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
NeilBrown> diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
NeilBrown> index 9697cd5d2561..c8a718f09fe6 100644
NeilBrown> --- a/fs/nfs/unlink.c
NeilBrown> +++ b/fs/nfs/unlink.c
NeilBrown> @@ -428,6 +428,10 @@ nfs_complete_sillyrename(struct rpc_task *task, struct nfs_renamedata *data)
NeilBrown> *
NeilBrown> * The final cleanup is done during dentry_iput.
NeilBrown> *
NeilBrown> + * We exchange the original with the new (silly) dentries, and return
NeilBrown> + * the new dentry which will have the original name. This ensures that
NeilBrown> + * the target name remains locked until the rename completes.
NeilBrown> + *
NeilBrown> * (Note: NFSv4 is stateful, and has opens, so in theory an NFSv4 server
NeilBrown> * could take responsibility for keeping open files referenced. The server
NeilBrown> * would also need to ensure that opened-but-deleted files were kept over
NeilBrown> @@ -436,7 +440,7 @@ nfs_complete_sillyrename(struct rpc_task *task, struct nfs_renamedata *data)
NeilBrown> * use to advertise that it does this; some day we may take advantage of
NeilBrown> * it.))
NeilBrown> */
NeilBrown> -int
NeilBrown> +struct dentry *
NeilBrown> nfs_sillyrename(struct inode *dir, struct dentry *dentry)
NeilBrown> {
NeilBrown> static unsigned int sillycounter;
NeilBrown> @@ -445,6 +449,8 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
NeilBrown> struct dentry *sdentry;
NeilBrown> struct inode *inode = d_inode(dentry);
NeilBrown> struct rpc_task *task;
NeilBrown> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
NeilBrown> + struct path path = {};
NeilBrown> int error = -EBUSY;

NeilBrown> dfprintk(VFS, "NFS: silly-rename(%pd2, ct=%d)\n",
NeilBrown> @@ -459,10 +465,11 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)

NeilBrown> fileid = NFS_FILEID(d_inode(dentry));

NeilBrown> + path.dentry = d_find_alias(dir);
NeilBrown> sdentry = NULL;
NeilBrown> do {
NeilBrown> int slen;
NeilBrown> - dput(sdentry);
NeilBrown> +
NeilBrown> sillycounter++;
NeilBrown> slen = scnprintf(silly, sizeof(silly),
NeilBrown> SILLYNAME_PREFIX "%0*llx%0*x",
NeilBrown> @@ -472,14 +479,19 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
NeilBrown> dfprintk(VFS, "NFS: trying to rename %pd to %s\n",
NeilBrown> dentry, silly);

NeilBrown> - sdentry = lookup_one_len(silly, dentry->d_parent, slen);
NeilBrown> - /*
NeilBrown> - * N.B. Better to return EBUSY here ... it could be
NeilBrown> - * dangerous to delete the file while it's in use.
NeilBrown> - */
NeilBrown> - if (IS_ERR(sdentry))
NeilBrown> - goto out;
NeilBrown> - } while (d_inode(sdentry) != NULL); /* need negative lookup */
NeilBrown> + sdentry = filename_create_one_len(silly, slen,
NeilBrown> + &path,
NeilBrown> + LOOKUP_CREATE | LOOKUP_EXCL |
NeilBrown> + LOOKUP_SILLY_RENAME,
NeilBrown> + &wq);
NeilBrown> + } while (PTR_ERR_OR_ZERO(sdentry) == -EEXIST);
NeilBrown> + dput(path.dentry);
NeilBrown> + /*
NeilBrown> + * N.B. Better to return EBUSY here ... it could be
NeilBrown> + * dangerous to delete the file while it's in use.
NeilBrown> + */
NeilBrown> + if (IS_ERR(sdentry))
NeilBrown> + goto out;

NeilBrown> ihold(inode);

NeilBrown> @@ -513,7 +525,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
NeilBrown> NFS_INO_INVALID_CTIME |
NeilBrown> NFS_INO_REVAL_FORCED);
NeilBrown> spin_unlock(&inode->i_lock);
NeilBrown> - d_move(dentry, sdentry);
NeilBrown> + d_exchange(dentry, sdentry);
NeilBrown> break;
NeilBrown> case -ERESTARTSYS:
NeilBrown> /* The result of the rename is unknown. Play it safe by
NeilBrown> @@ -524,7 +536,20 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
NeilBrown> rpc_put_task(task);
NeilBrown> out_dput:
NeilBrown> iput(inode);
NeilBrown> - dput(sdentry);
NeilBrown> + if (!error)
NeilBrown> + return sdentry;
NeilBrown> +
NeilBrown> + d_lookup_done(sdentry);
NeilBrown> + __done_path_update(&path, sdentry, true, LOOKUP_SILLY_RENAME);
NeilBrown> out:
NeilBrown> - return error;
NeilBrown> + return ERR_PTR(error);
NeilBrown> +}
NeilBrown> +
NeilBrown> +void nfs_sillyrename_finish(struct inode *dir, struct dentry *dentry)
NeilBrown> +{
NeilBrown> + struct path path = { .dentry = d_find_alias(dir) };
NeilBrown> +
NeilBrown> + if (!IS_ERR(dentry))
NeilBrown> + __done_path_update(&path, dentry, true, LOOKUP_SILLY_RENAME);
NeilBrown> + dput(path.dentry);
NeilBrown> }


2022-08-26 19:13:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Thu, Aug 25, 2022 at 7:16 PM NeilBrown <[email protected]> wrote:
>
> If a filesystem supports parallel modification in directories, it sets
> FS_PAR_DIR_UPDATE on the file_system_type. lookup_open() and the new
> lookup_hash_update() notice the flag and take a shared lock on the
> directory, and rely on a lock-bit in d_flags, much like parallel lookup
> relies on DCACHE_PAR_LOOKUP.

Ugh.

I absolutely believe in the DCACHE_PAR_LOOKUP model, and in "parallel
updates" being important, but I *despise* locking code like this

+ if (wq && IS_PAR_UPDATE(dir))
+ inode_lock_shared_nested(dir, I_MUTEX_PARENT);
+ else
+ inode_lock_nested(dir, I_MUTEX_PARENT);

and I really really hope there's some better model for this.

That "wq" test in particular is just absolutely disgusting. So now it
doesn't just depend on whether the directory supports parallel
updates, now the caller can choose whether to do the parallel thing or
not, and that's how "create" is different from "rename".

And that last difference is, I think, the one that makes me go "No. HELL NO".

Instead of it being up to the filesystem to say "I can do parallel
creates, but I need to serialize renames", this whole thing has been
set up to be about the caller making that decision.

That's just feels horribly horribly wrong.

Yes, I realize that to you that's just a "later patches will do
renames", but what if it really is a filesystem issue where the
filesystem can easily handle new names, but needs something else for
renames because it has various cross-directory issues, perhaps?

So I feel this is fundamentally wrong, and this ugliness is a small
effect of that wrongness.

I think we should strive for

(a) make that 'wq' and DCACHE_PAR_LOOKUP bit be unconditional

(b) aim for the inode lock being taken *after* the _lookup_hash(),
since the VFS layer side has to be able to handle the concurrency on
the dcache side anyway

(c) at that point, the serialization really ends up being about the
call into the filesystem, and aim to simply move the
inode_lock{_shared]_nested() into the filesystem so that there's no
need for a flag and related conditional locking at all.

Because right now I think the main reason we cannot move the lock into
the filesystem is literally that we've made the lock cover not just
the filesystem part, but the "lookup and create dentry" part too.

But once you have that "DCACHE_PAR_LOOKUP" bit and the
d_alloc_parallel() logic to serialize a _particular_ dentry being
created (as opposed to serializing all the sleeping ops to that
directly), I really think we should strive to move the locking - that
no longer helps the VFS dcache layer - closer to the filesystem call
and eventually into it.

Please? I think these patches are "mostly going in the right
direction", but at the same time I feel like there's some serious
mis-design going on.

Linus

2022-08-26 23:13:05

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Sat, 27 Aug 2022, Linus Torvalds wrote:
> On Thu, Aug 25, 2022 at 7:16 PM NeilBrown <[email protected]> wrote:
> >
> > If a filesystem supports parallel modification in directories, it sets
> > FS_PAR_DIR_UPDATE on the file_system_type. lookup_open() and the new
> > lookup_hash_update() notice the flag and take a shared lock on the
> > directory, and rely on a lock-bit in d_flags, much like parallel lookup
> > relies on DCACHE_PAR_LOOKUP.
>
> Ugh.

Thanks :-) - no, really - thanks for the high-level review!

>
> I absolutely believe in the DCACHE_PAR_LOOKUP model, and in "parallel
> updates" being important, but I *despise* locking code like this
>
> + if (wq && IS_PAR_UPDATE(dir))
> + inode_lock_shared_nested(dir, I_MUTEX_PARENT);
> + else
> + inode_lock_nested(dir, I_MUTEX_PARENT);
>
> and I really really hope there's some better model for this.
>
> That "wq" test in particular is just absolutely disgusting. So now it
> doesn't just depend on whether the directory supports parallel
> updates, now the caller can choose whether to do the parallel thing or
> not, and that's how "create" is different from "rename".

As you note, by the end of the series "create" is not more different
from "rename" than it already is. I only broke up the patches to make
review more manageable.

The "wq" can be removed. There are two options.
One is to change every kern_path_create() or user_path_create() caller
to passed in a wq. Then we can assume that a wq is always available.
There are about a dozen of these calls, so not an enormous change, but
one that I didn't want to think about just yet. I could add a patch at
the front of the series which did this.

Alternate option is to never pass in a wq for create operation, and use
var_waitqueue() (or something similar) to provide a global shared wait
queue (which is essentially what I am using to wait for
DCACHE_PAR_UPDATE to clear).
The more I think about it, the more I think this is the best way
forward. Maybe we'll want to increase WAIT_TABLE_BITS ... I wonder how
to measure how much contention there is on these shared queues.

>
> And that last difference is, I think, the one that makes me go "No. HELL NO".
>
> Instead of it being up to the filesystem to say "I can do parallel
> creates, but I need to serialize renames", this whole thing has been
> set up to be about the caller making that decision.

I think that is a misunderstanding. The caller isn't making a decision
- except the IS_PAR_UPDATE() test which is simply acting on the fs
request. What you are seeing is a misguided attempt to leave in place
some existing interfaces which assumed exclusive locking and didn't
provide wqs.

>
> That's just feels horribly horribly wrong.
>
> Yes, I realize that to you that's just a "later patches will do
> renames", but what if it really is a filesystem issue where the
> filesystem can easily handle new names, but needs something else for
> renames because it has various cross-directory issues, perhaps?

Obviously a filesystem can add its own locking - and they will have to,
though at a finer grain that the VFS can do.


>
> So I feel this is fundamentally wrong, and this ugliness is a small
> effect of that wrongness.
>
> I think we should strive for
>
> (a) make that 'wq' and DCACHE_PAR_LOOKUP bit be unconditional

Agreed (in an earlier version DCACHE_PAR_LOOKUP was optional, but I
realised that you wouldn't like that :-)

>
> (b) aim for the inode lock being taken *after* the _lookup_hash(),
> since the VFS layer side has to be able to handle the concurrency on
> the dcache side anyway

I think you are suggesting that we change ->lookup call to NOT
require i_rwsem be held. That is not a small change.
I agree that it makes sense in the long term. Getting there .... won't
be a quick as I'd hoped.

>
> (c) at that point, the serialization really ends up being about the
> call into the filesystem, and aim to simply move the
> inode_lock{_shared]_nested() into the filesystem so that there's no
> need for a flag and related conditional locking at all.

It might be nice to take a shared lock in VFS, and let the FS upgrade it
to exclusive if needed, but we don't have upgrade_read() ... maybe it
would be deadlock-prone.

>
> Because right now I think the main reason we cannot move the lock into
> the filesystem is literally that we've made the lock cover not just
> the filesystem part, but the "lookup and create dentry" part too.
>
> But once you have that "DCACHE_PAR_LOOKUP" bit and the
> d_alloc_parallel() logic to serialize a _particular_ dentry being
> created (as opposed to serializing all the sleeping ops to that
> directly), I really think we should strive to move the locking - that
> no longer helps the VFS dcache layer - closer to the filesystem call
> and eventually into it.
>
> Please? I think these patches are "mostly going in the right
> direction", but at the same time I feel like there's some serious
> mis-design going on.

Hmmm.... I'll dig more deeply into ->lookup and see if I can understand
the locking well enough to feel safe removing i_rwsem from it.

Thanks,
NeilBrown

2022-08-26 23:16:02

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 10/10] NFS: support parallel updates in the one directory.

On Sat, 27 Aug 2022, John Stoffel wrote:
> >>>>> "NeilBrown" == NeilBrown <[email protected]> writes:
>
> NeilBrown> NFS can easily support parallel updates as the locking is done on the
> NeilBrown> server, so this patch enables parallel updates for NFS.
>
> Just curious, how is this handled if I have a server with an EXT#
> filesystem which is exported via NFS to multiple clients. If those
> NFS clients are doing lots of changes in a single directory, I can see
> how the NFS server handles that, but what about if at the same time
> someone does multiple changes on the server in that EXT# behind the
> NFSd's back, how are the conflicts handled?

Currently, an exclusive lock is taken by the VFS or NFSD before any
create request gets the the EXT# filesystem. That doesn't change (yet).

When multiple NFS clients try the create files in a directory, the nfsd
threads handling those requests will be serialized by the exclusive
lock.

Before the patch, if there were multiple threads on a single NFS client,
then they are all serialized on the client, and then again on the
server.
With my patch, the threads on the one client are NOT serialized, but
once the requests get to the server they will be. So an exclusive lock
is only held of a smaller part of the total time.

>
> It would seem that this all really needs to purely happen at the VFS
> layer, but I'm probably missing something.

Serialization has to happen somewhere. It doesn't have to be at the VFS
layer. Doing locking at the VFS layer is easy and slow. Doing it deep
in the filesystem is difficult and fast. This exercise is really just
pushing locking deeper into the filesystem.

>
> I ask this because my home server exports /home to a couple of other
> systems via NFS, but I still work in /home on the server.
>

Thanks,
NeilBrown

2022-08-27 00:00:54

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH/RFC 00/10 v5] Improve scalability of directory operations

On Sat, 27 Aug 2022, John Stoffel wrote:
> >>>>> "NeilBrown" == NeilBrown <[email protected]> writes:
>
> NeilBrown> [I made up "v5" - I haven't been counting]
>
> My first comments, but I'm not a serious developer...
>
> NeilBrown> VFS currently holds an exclusive lock on the directory while making
> NeilBrown> changes: add, remove, rename.
> NeilBrown> When multiple threads make changes in the one directory, the contention
> NeilBrown> can be noticeable.
> NeilBrown> In the case of NFS with a high latency link, this can easily be
> NeilBrown> demonstrated. NFS doesn't really need VFS locking as the server ensures
> NeilBrown> correctness.
>
> NeilBrown> Lustre uses a single(?) directory for object storage, and has patches
> NeilBrown> for ext4 to support concurrent updates (Lustre accesses ext4 directly,
> NeilBrown> not via the VFS).
>
> NeilBrown> XFS (it is claimed) doesn't its own locking and doesn't need the VFS to
> NeilBrown> help at all.
>
> This sentence makes no sense to me... I assume you meant to say "...does
> it's own locking..."

Thanks - you are correct. "does its own locking".

>
> NeilBrown> This patch series allows filesystems to request a shared lock on
> NeilBrown> directories and provides serialisation on just the affected name, not the
> NeilBrown> whole directory. It changes both the VFS and NFSD to use shared locks
> NeilBrown> when appropriate, and changes NFS to request shared locks.
>
> Are there any performance results? Why wouldn't we just do a shared
> locked across all VFS based filesystems?

Daire Byrne has done some tests with NFS clients to an NFS server which
re-exports mounts from another server - so there are a couple of levels
of locking that can be removed. At lease one of these levels has
significant network latency (100ms or so I think) The results are much
what you would expect. Many more file creations per second are
possible. 15 creates-per-second up to 121 crates-per-second in one
test.
https://lore.kernel.org/linux-nfs/CAPt2mGNjWXad6e7nSUTu=0ez1qU1wBNegrntgHKm5hOeBs5gQA@mail.gmail.com/


>
> NeilBrown> The central enabling feature is a new dentry flag DCACHE_PAR_UPDATE
> NeilBrown> which acts as a bit-lock. The ->d_lock spinlock is taken to set/clear
> NeilBrown> it, and wait_var_event() is used for waiting. This flag is set on all
> NeilBrown> dentries that are part of a directory update, not just when a shared
> NeilBrown> lock is taken.
>
> NeilBrown> When a shared lock is taken we must use alloc_dentry_parallel() which
> NeilBrown> needs a wq which must remain until the update is completed. To make use
> NeilBrown> of concurrent create, kern_path_create() would need to be passed a wq.
> NeilBrown> Rather than the churn required for that, we use exclusive locking when
> NeilBrown> no wq is provided.
>
> Is this a per-operation wq or a per-directory wq? Can there be issues
> if someone does something silly like having 1,000 directories, all of
> which have multiple processes making parallel changes?

It is per-operation though I expect to change that to be taken from a
pool for shared work queues.

Workqueues can be shared quite cheaply. There is spin-lock contention
when multiple threads add/remove waiters to/from the queues. Having
more queues in a pool than cores, and randomly selecting queues from the
pool can keep that under control.

If there are dozens of waiter of more, then a wakeup might run more
slowly (and hold the lock for longer), but in this case wakeup should be
rare.

Most filesystem operations are uncontended at the name level. e.g. it is
rare that two threads will try to create the same name at the same time,
or one looks up a name that another is unlinking it. These are the only
times that wakeups would happen, so sharing a pool among all filesystem
accesses is unlikely to be a problem.

>
> Does it degrade gracefully if a wq can't be allocated?

In the current code, the wq is allocated on the stack. I'm probably
going to change to a global allocation. In either case, there is no
risk of allocation failure during a filesystem operation.

Thanks for the questions,
NeilBrown


>
> NeilBrown> One interesting consequence of this is that silly-rename becomes a
> NeilBrown> little more complex. As the directory may not be exclusively locked,
> NeilBrown> the new silly-name needs to be locked (DCACHE_PAR_UPDATE) as well.
> NeilBrown> A new LOOKUP_SILLY_RENAME is added which helps implement this using
> NeilBrown> common code.
>
> NeilBrown> While testing I found some odd behaviour that was caused by
> NeilBrown> d_revalidate() racing with rename(). To resolve this I used
> NeilBrown> DCACHE_PAR_UPDATE to ensure they cannot race any more.
>
> NeilBrown> Testing, review, or other comments would be most welcome,
>
>

2022-08-27 00:15:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Fri, Aug 26, 2022 at 4:07 PM NeilBrown <[email protected]> wrote:
>
> As you note, by the end of the series "create" is not more different
> from "rename" than it already is. I only broke up the patches to make
> review more manageable.

Yes, I understand. But I'm saying that maybe a filesystem actually
might want to treat them differently.

That said, the really nasty part was that 'wq' thing that meant that
different paths had different directory locking not because of
low-level filesystem issues, but because of caller issues.

So that's the one I _really_ disliked, and that I don't think should
exist even as a partial first step.

The "tie every operation together with one flag" I can live with, in
case it turns out that yes, that one flag is all anybody ever really
wants.

> Alternate option is to never pass in a wq for create operation, and use
> var_waitqueue() (or something similar) to provide a global shared wait
> queue (which is essentially what I am using to wait for
> DCACHE_PAR_UPDATE to clear).

I _think_ this is what I would prefer.

I say that I _think_ I prefer that, because maybe there are issues
with it. But since you basically do that DCACHE_PAR_UPDATE thing
anyway, and it's one of the main users of this var_waitqueue, it feels
right to me.

But then if it just end sup not working well for some practical
reason, at that point maybe I'd just say "I was wrong, I thought it
would work, but it's better to spread it out to be a per-thread
wait-queue on the stack".

IOW, my preference would be to simply just try it, knowing that you
*can* do the "pass explicit wait-queue down" thing if we need to.

Hmm?

> > Instead of it being up to the filesystem to say "I can do parallel
> > creates, but I need to serialize renames", this whole thing has been
> > set up to be about the caller making that decision.
>
> I think that is a misunderstanding. The caller isn't making a decision
> - except the IS_PAR_UPDATE() test which is simply acting on the fs
> request. What you are seeing is a misguided attempt to leave in place
> some existing interfaces which assumed exclusive locking and didn't
> provide wqs.

Ok. I still would prefer to have unified locking, not that "do this
for one filesystem, do that for another" conditional one.

> > (b) aim for the inode lock being taken *after* the _lookup_hash(),
> > since the VFS layer side has to be able to handle the concurrency on
> > the dcache side anyway
>
> I think you are suggesting that we change ->lookup call to NOT
> require i_rwsem be held.

Yes and no.

One issue for me is that with your change as-is, then 99% of all
people who don't happen to use NFS, the inode lock gives all that VFS
code mutual exclusion.

Take that lookup_hash_update() function as a practical case: all the
*common* filesystems will be running with that function basically 100%
serialized per directory, because they'll be doing that

inode_lock_nested(dir);
...
inode_unlock(dir);

around it all.

At the same time, all that code is supposed to work even *without* the
lock, because once it's a IS_PAR_UPDATE() filesystem, there's
effectively no locking at all. What exclusive directory locks even
remain at that point?

IOW, to me it feels like you are trying to make the code go towards a
future with basically no locking at all as far as the VFS layer is
concerned (because once all the directory modifications take the inode
lock as shared, *all* the inode locking is shared, and is basically a
no-op).

BUT you are doing so while not having most people even *test* that situation.

See what I'm trying to say (but possibly expressing very badly)?

So I feel like if the VFS code cannot rely on locking *anyway* in the
general case, and should work without it, then we really shouldn't
have any locking around any of the VFS operations.

The logical conclusion of that would be to push it all down into the
filesystem (possibly with the help of a coccinelle script).

Now it doesn't have to go that far - at least not initially - but I do
think we should at least make sure that as much as possible of the
actual VFS code sees that "new world order" of no directory locking,
so that that situation gets *tested* as widely as possible.

> That is not a small change.

Now, that I agree with. I guss we won't get there soon (or ever). But
see above what I dislike about the directory locking model change.

> It might be nice to take a shared lock in VFS, and let the FS upgrade it
> to exclusive if needed, but we don't have upgrade_read() ... maybe it
> would be deadlock-prone.

Yes, upgrading a read lock is fundamentally impossible and will
deadlock trivially (think just two readers that both want to do the
upgrade - they'll block each other from doing so).

So it's not actually a possible operation.

Linus

2022-08-27 00:28:46

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Fri, Aug 26, 2022 at 12:06:55PM -0700, Linus Torvalds wrote:

> Because right now I think the main reason we cannot move the lock into
> the filesystem is literally that we've made the lock cover not just
> the filesystem part, but the "lookup and create dentry" part too.

How about rename loop prevention? mount vs. rmdir? d_splice_alias()
fun on tree reconnects?

> But once you have that "DCACHE_PAR_LOOKUP" bit and the
> d_alloc_parallel() logic to serialize a _particular_ dentry being
> created (as opposed to serializing all the sleeping ops to that
> directly), I really think we should strive to move the locking - that
> no longer helps the VFS dcache layer - closer to the filesystem call
> and eventually into it.

See above.

2022-08-27 01:35:09

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 09/10] VFS: add LOOKUP_SILLY_RENAME

On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:
> When performing a "silly rename" to avoid removing a file that is still
> open, we need to perform a lookup in a directory that is already locked.
>
> In order to allow common functions to be used for this lookup, introduce
> LOOKUP_SILLY_RENAME which affirms that the directory is already locked
> and that the vfsmnt is already writable.
>
> When LOOKUP_SILLY_RENAME is set, path->mnt can be NULL. As
> i_op->rename() doesn't make the vfsmnt available, this is unavoidable.
> So we ensure that a NULL ->mnt isn't fatal.

This one is really disgusting. Flag-dependent locking is a pretty much
guaranteed source of PITA and "magical" struct path is, again, asking for
trouble.

You seem to be trying for simpler call graph and you end up paying with
control flow that is much harder to reason about.

2022-08-27 03:55:42

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:

> +/**
> + * d_lock_update_nested - lock a dentry before updating
> + * @dentry: the dentry to be locked
> + * @base: the parent, or %NULL
> + * @name: the name in that parent, or %NULL
> + * @subclass: lockdep locking class.
> + *
> + * Lock a dentry in a directory on which a shared-lock may be held, and
> + * on which parallel updates are permitted.
> + * If the base and name are given, then on success the dentry will still
> + * have that base and name - it will not have raced with rename.
> + * On success, a positive dentry will still be hashed, ensuring there
> + * was no race with unlink.
> + * If they are not given, then on success the dentry will be negative,
> + * which again ensures no race with rename, or unlink.

I'm not sure it's a good idea to have that in one function, TBH.
Looking at the callers, there are
* lookup_hash_update()
lookup_hash_update_len()
nfsd shite
filename_create_one()
filename_create_one_len()
nfsd shite
filename_create()
kern_path_create()
user_path_create()
do_mknodat()
do_mkdirat()
do_symlinkat()
do_linkat()
do_rmdir()
do_unlinkat()
* fuckloads of callers in lock_rename_lookup()
* d_lock_update()
atomic_open()
lookup_open()

Only the last two can get NULL base or name. Already interesting,
isn't it? What's more, splitup between O_CREATE open() on one
side and everything else that might create, remove or rename on
the other looks really weird.

> + rcu_read_lock(); /* for d_same_name() */
> + if (d_unhashed(dentry) && d_is_positive(dentry)) {
> + /* name was unlinked while we waited */
> + ret = false;

BTW, what happens if somebody has ->lookup() returning a positive
unhashed? Livelock on attempt to hit it with any directory-modifying
syscall? Right now such behaviour is permitted; I don't know if
anything in the tree actually does it, but at the very least it
would need to be documented.

Note that *negative* unhashed is not just permitted, it's
actively used e.g. by autofs. That really needs to be well
commented...

> + } else if (base &&
> + (dentry->d_parent != base ||
> + dentry->d_name.hash != name->hash ||
> + !d_same_name(dentry, base, name))) {
> + /* dentry was renamed - possibly silly-rename */
> + ret = false;
> + } else if (!base && d_is_positive(dentry)) {
> + ret = false;
> + } else {
> + dentry->d_flags |= DCACHE_PAR_UPDATE;
> + }

> + * Parent directory has inode locked exclusive, or possibly shared if wq
> + * is given. In the later case the fs has explicitly allowed concurrent
> + * updates in this directory. This is the one and only case
> + * when ->lookup() may be called on a non in-lookup dentry.

What Linus already said about wq... To add a reason he hadn't mentioned,
the length of call chain one needs to track to figure out whether it's
NULL or not is... excessive. And I don't mean just "greater than 0".
We have places like that, and sometimes we have to, but it's never a good
thing...

> static struct dentry *__lookup_hash(const struct qstr *name,
> - struct dentry *base, unsigned int flags)
> + struct dentry *base, unsigned int flags,
> + wait_queue_head_t *wq)

> - dentry = d_alloc(base, name);
> + if (wq)
> + dentry = d_alloc_parallel(base, name, wq);
> + else
> + dentry = d_alloc(base, name);
> if (unlikely(!dentry))
> return ERR_PTR(-ENOMEM);
> + if (IS_ERR(dentry))
> + return dentry;

BTW, considering the fact that we have 12 callers of d_alloc()
(including this one) and 28 callers of its wrapper (d_alloc_name()), I
would seriously consider converting d_alloc() from "NULL or new dentry"
to "ERR_PTR(-ENOMEM) or new dentry". Especially since quite a few of
those callers will be happier that way. Grep and you'll see... As a
side benefit, if (unlikely(!dentry)) turns into if (IS_ERR(dentry)).

> +static struct dentry *lookup_hash_update(
> + const struct qstr *name,
> + struct dentry *base, unsigned int flags,
> + wait_queue_head_t *wq)
> +{
> + struct dentry *dentry;
> + struct inode *dir = base->d_inode;
> + int err;
> +
> + if (wq && IS_PAR_UPDATE(dir))
> + inode_lock_shared_nested(dir, I_MUTEX_PARENT);
> + else
> + inode_lock_nested(dir, I_MUTEX_PARENT);
> +
> +retry:
> + dentry = __lookup_hash(name, base, flags, wq);
> + if (IS_ERR(dentry)) {
> + err = PTR_ERR(dentry);
> + goto out_err;
> + }
> + if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) {
> + /*
> + * Failed to get lock due to race with unlink or rename
> + * - try again
> + */
> + d_lookup_done(dentry);

When would we get out of __lookup_hash() with in-lookup dentry?
Confused...

> +struct dentry *lookup_hash_update_len(const char *name, int nlen,
> + struct path *path, unsigned int flags,

const struct path *, damnit...

> + wait_queue_head_t *wq)
> +{
> + struct qstr this;
> + int err = lookup_one_common(mnt_user_ns(path->mnt), name,
> + path->dentry, nlen, &this);
> + if (err)
> + return ERR_PTR(err);
> + return lookup_hash_update(&this, path->dentry, flags, wq);
> +}
> +EXPORT_SYMBOL(lookup_hash_update_len);

Frankly, the calling conventions of the "..._one_len" family is something
I've kept regretting for a long time. Oh, well...

> +static void done_path_update(struct path *path, struct dentry *dentry,
> + bool with_wq)
> +{
> + struct inode *dir = path->dentry->d_inode;
> +
> + d_lookup_done(dentry);
> + d_unlock_update(dentry);
> + if (IS_PAR_UPDATE(dir) && with_wq)
> + inode_unlock_shared(dir);
> + else
> + inode_unlock(dir);
> +}

const struct path *, again...

> @@ -3400,6 +3499,12 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> dentry = res;
> }
> }
> + /*
> + * If dentry is negative and this is a create we need to get
> + * DCACHE_PAR_UPDATE.
> + */
> + if ((open_flag & O_CREAT) && !dentry->d_inode)
> + have_par_update = d_lock_update(dentry, NULL, NULL);
>
> /* Negative dentry, just create the file */
> if (!dentry->d_inode && (open_flag & O_CREAT)) {

Fold the above here, please. What's more, losing the flag would've
made it much easier to follow...

> @@ -3419,9 +3524,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> error = create_error;
> goto out_dput;
> }
> + if (have_par_update)
> + d_unlock_update(dentry);
> return dentry;
>
> out_dput:
> + if (have_par_update)
> + d_unlock_update(dentry);


> @@ -3821,27 +3962,28 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
> struct path *path, unsigned int lookup_flags)

BTW, there's 9 callers of that sucker in the entire tree, along with
3 callers of user_path_create() and 14 callers of done_path_create().
Not a big deal to add the wq in those, especially since it can be easily
split into preparatory patch (with wq passed, but being unused).

> -void done_path_create(struct path *path, struct dentry *dentry)
> +void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq)

Why "with_wq" and not the wq itself?

> - * The caller must hold dir->i_mutex.
> + * The caller must either hold a write-lock on dir->i_rwsem, or
> + * a have atomically set DCACHE_PAR_UPDATE, or both.

???

> + * If the filesystem permits (IS_PAR_UPDATE()), we take a shared lock on the
> + * directory and set DCACHE_PAR_UPDATE to get exclusive access to the dentry.

The latter happens unconditionally here, unless I'm misreading the code (as well
as your cover letter). It does *NOT* happen on rename(), though, contrary to
the same. And while your later patch adds it in lock_rename_lookup(), existing
lock_rename() callers do not get that at all. Likely to be a problem...

> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -13,7 +13,9 @@
> #include <linux/rcupdate.h>
> #include <linux/lockref.h>
> #include <linux/stringhash.h>
> +#include <linux/sched.h>

Bloody hell, man...

> +static inline void d_unlock_update(struct dentry *dentry)
> +{
> + if (IS_ERR_OR_NULL(dentry))
> + return;

Do explain... When could we ever get NULL or ERR_PTR() passed to that?


BTW, I would seriously look into splitting the "let's add a helper
that combines locking parent with __lookup_hash()" into a preliminary
patch. Would be easier to follow.

2022-08-27 03:56:01

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 03/10] VFS: move want_write checks into lookup_hash_update()

On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:
> mnt_want_write() is always called before lookup_hash_update(), so
> we can simplify the code by moving the call into that function.
>
> If lookup_hash_update() succeeds, it now will have claimed the
> want_write lock. If it fails, the want_write lock isn't held.
>
> To allow this, lookup_hash_update() now receives a 'struct path *'
> instead of 'struct dentry *'
>
> Note that when creating a name, any error from mnt_want_write() does not
> get reported unless there is no other error. For unlink/rmdir though,
> an error from mnt_want_write() is immediately fatal - overriding ENOENT
> for example. This behaviour seems strange, but this patch is careful to
> preserve it.

Would be a nice idea, if not for the mess with LOOKUP_SILLY_RENAME later
in the series...

2022-08-27 04:24:23

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 06/10] VFS: support concurrent renames.

On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:
> Allow object can now be renamed from or to a directory in which a create
> or unlink is concurrently happening.
>
> Two or more renames with the one directory can also be concurrent.
> s_vfs_rename_mutex still serialises lookups for cross-directory renames,
> but the renames themselves can proceed concurrently.

Wha...? <checks>
Not true, fortunately - you *do* hold ->s_vfs_rename_mutex over the
rename itself. If not for that, it would be utterly broken.
And I don't care for NFS server rejecting that - we are *NOT* taking
loop prevention logics into every filesystem. It's highly non-local
and trying to handle it with your per-dentry flags is going to be
painful as hell, if at all possible.

> + if (d1 < d2) {
> + ok1 = d_lock_update_nested(d1, p1, last1, I_MUTEX_PARENT);
> + ok2 = d_lock_update_nested(d2, p2, last2, I_MUTEX_PARENT2);
> + } else {
> + ok2 = d_lock_update_nested(d2, p2, last2, I_MUTEX_PARENT);
> + ok1 = d_lock_update_nested(d1, p1, last1, I_MUTEX_PARENT2);
> + }

Explain, please. What's that ordering about?

2022-08-27 04:40:23

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 08/10] NFSD: allow parallel creates from nfsd

On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:

> if (is_create_with_attrs(open))
> nfsd4_acl_to_attr(NF4REG, open->op_acl, &attrs);
> + inode = d_inode(path.dentry);
>
> - inode_lock_nested(inode, I_MUTEX_PARENT);
> + child = filename_create_one_len(open->op_fname,
> + open->op_fnamelen,
> + &path, 0, &wq);
>
> - child = lookup_one_len(open->op_fname, parent, open->op_fnamelen);
> - if (IS_ERR(child)) {
> - status = nfserrno(PTR_ERR(child));
> - goto out;
> - }
> + if (IS_ERR(child))
> + return nfserrno(PTR_ERR(child));

Leaks acls, by the look of it?

> + if (!IS_PAR_UPDATE(fhp->fh_dentry->d_inode) &&
> + inode_trylock_shared(fhp->fh_dentry->d_inode)) {
> + /* only have a shared lock */
> + inode_unlock_shared(fhp->fh_dentry->d_inode);
> + fhp->fh_no_atomic_attr = true;
> + fhp->fh_no_wcc = true;

Er... Shouldn't that be IS_PAR_UPDATE() && ... ?

2022-08-27 21:33:17

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Fri, Aug 26, 2022 at 05:13:38PM -0700, Linus Torvalds wrote:
> On Fri, Aug 26, 2022 at 4:07 PM NeilBrown <[email protected]> wrote:
> >
> > As you note, by the end of the series "create" is not more different
> > from "rename" than it already is. I only broke up the patches to make
> > review more manageable.
>
> Yes, I understand. But I'm saying that maybe a filesystem actually
> might want to treat them differently.
>
> That said, the really nasty part was that 'wq' thing that meant that
> different paths had different directory locking not because of
> low-level filesystem issues, but because of caller issues.
>
> So that's the one I _really_ disliked, and that I don't think should
> exist even as a partial first step.
>
> The "tie every operation together with one flag" I can live with, in
> case it turns out that yes, that one flag is all anybody ever really
> wants.

FWIW, what's really missing is the set of rules describing what the
methods can expect from their arguments.

Things like "oh, we can safely use ->d_parent here - we know that
foo_rmdir(dir, child) is called only with dir held exclusive and
child that had been observed to be a child of dentry alias of
dir after dir had been locked, while all places that might change
child->d_parent will be doing that only with child->d_parent->d_inode
held at least shared" rely upon the current locking scheme.

Change that 'held exclusive' to 'held shared' and we need something
different, presumably 'this new bitlock on the child is held by the caller'.
That's nice, but... What's to guarantee that we won't be hit by
__d_unalias()? It won't care about the bitlock on existing alias,
would it? And it only holds the old parent shared, so...

My comments had been along the lines of "doing that would make the
series easier to reason about"; I don't hate the approach, but
* in the current form it's hard to read; there might be
problems I hadn't even noticed yet
* it's much easier to verify that stated assertions are
guaranteed by the callers and sufficient for safety of callees
if they *ARE* stated. Spelling them out is on the patch series
authors, and IME doing that helps a lot when writing a series
like that. At least on the level of internal notes... Especially
since NFS is... special (or, as they say in New York, "sophisticated" -
sorry). There's a plenty of things that are true for it, but do
not hold for filesystems in general. And without an explicitly
spelled out warranties it's very easy to end up with a mess that
would be hell to apply to other filesystems. I really don't want
to see an explosion of cargo-culted logics that might or might
not remain valid for NFS by the time it gets copied around...


2022-08-29 02:11:12

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Sat, 27 Aug 2022, Al Viro wrote:
> On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:
>
> > +/**
> > + * d_lock_update_nested - lock a dentry before updating
> > + * @dentry: the dentry to be locked
> > + * @base: the parent, or %NULL
> > + * @name: the name in that parent, or %NULL
> > + * @subclass: lockdep locking class.
> > + *
> > + * Lock a dentry in a directory on which a shared-lock may be held, and
> > + * on which parallel updates are permitted.
> > + * If the base and name are given, then on success the dentry will still
> > + * have that base and name - it will not have raced with rename.
> > + * On success, a positive dentry will still be hashed, ensuring there
> > + * was no race with unlink.
> > + * If they are not given, then on success the dentry will be negative,
> > + * which again ensures no race with rename, or unlink.
>
> I'm not sure it's a good idea to have that in one function, TBH.
> Looking at the callers, there are
> * lookup_hash_update()
> lookup_hash_update_len()
> nfsd shite
> filename_create_one()
> filename_create_one_len()
> nfsd shite
> filename_create()
> kern_path_create()
> user_path_create()
> do_mknodat()
> do_mkdirat()
> do_symlinkat()
> do_linkat()
> do_rmdir()
> do_unlinkat()
> * fuckloads of callers in lock_rename_lookup()
> * d_lock_update()
> atomic_open()
> lookup_open()
>
> Only the last two can get NULL base or name. Already interesting,
> isn't it? What's more, splitup between O_CREATE open() on one
> side and everything else that might create, remove or rename on
> the other looks really weird.

Well, O_CREATE is a bit weird.
But I can see that it would be cleaner to pass the dir/name into those
two calls that currently get NULL - then remove the handling of NULL.
Thanks.

>
> > + rcu_read_lock(); /* for d_same_name() */
> > + if (d_unhashed(dentry) && d_is_positive(dentry)) {
> > + /* name was unlinked while we waited */
> > + ret = false;
>
> BTW, what happens if somebody has ->lookup() returning a positive
> unhashed? Livelock on attempt to hit it with any directory-modifying
> syscall? Right now such behaviour is permitted; I don't know if
> anything in the tree actually does it, but at the very least it
> would need to be documented.
>
> Note that *negative* unhashed is not just permitted, it's
> actively used e.g. by autofs. That really needs to be well
> commented...

I hadn't thought that ->lookup() could return anything unhashed. I'll
look into that - thanks.

>
> > + } else if (base &&
> > + (dentry->d_parent != base ||
> > + dentry->d_name.hash != name->hash ||
> > + !d_same_name(dentry, base, name))) {
> > + /* dentry was renamed - possibly silly-rename */
> > + ret = false;
> > + } else if (!base && d_is_positive(dentry)) {
> > + ret = false;
> > + } else {
> > + dentry->d_flags |= DCACHE_PAR_UPDATE;
> > + }
>
> > + * Parent directory has inode locked exclusive, or possibly shared if wq
> > + * is given. In the later case the fs has explicitly allowed concurrent
> > + * updates in this directory. This is the one and only case
> > + * when ->lookup() may be called on a non in-lookup dentry.
>
> What Linus already said about wq... To add a reason he hadn't mentioned,
> the length of call chain one needs to track to figure out whether it's
> NULL or not is... excessive. And I don't mean just "greater than 0".
> We have places like that, and sometimes we have to, but it's never a good
> thing...
>
> > static struct dentry *__lookup_hash(const struct qstr *name,
> > - struct dentry *base, unsigned int flags)
> > + struct dentry *base, unsigned int flags,
> > + wait_queue_head_t *wq)
>
> > - dentry = d_alloc(base, name);
> > + if (wq)
> > + dentry = d_alloc_parallel(base, name, wq);
> > + else
> > + dentry = d_alloc(base, name);
> > if (unlikely(!dentry))
> > return ERR_PTR(-ENOMEM);
> > + if (IS_ERR(dentry))
> > + return dentry;
>
> BTW, considering the fact that we have 12 callers of d_alloc()
> (including this one) and 28 callers of its wrapper (d_alloc_name()), I
> would seriously consider converting d_alloc() from "NULL or new dentry"
> to "ERR_PTR(-ENOMEM) or new dentry". Especially since quite a few of
> those callers will be happier that way. Grep and you'll see... As a
> side benefit, if (unlikely(!dentry)) turns into if (IS_ERR(dentry)).
>
> > +static struct dentry *lookup_hash_update(
> > + const struct qstr *name,
> > + struct dentry *base, unsigned int flags,
> > + wait_queue_head_t *wq)
> > +{
> > + struct dentry *dentry;
> > + struct inode *dir = base->d_inode;
> > + int err;
> > +
> > + if (wq && IS_PAR_UPDATE(dir))
> > + inode_lock_shared_nested(dir, I_MUTEX_PARENT);
> > + else
> > + inode_lock_nested(dir, I_MUTEX_PARENT);
> > +
> > +retry:
> > + dentry = __lookup_hash(name, base, flags, wq);
> > + if (IS_ERR(dentry)) {
> > + err = PTR_ERR(dentry);
> > + goto out_err;
> > + }
> > + if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) {
> > + /*
> > + * Failed to get lock due to race with unlink or rename
> > + * - try again
> > + */
> > + d_lookup_done(dentry);
>
> When would we get out of __lookup_hash() with in-lookup dentry?
> Confused...

Whenever wq is passed in and ->lookup() decides, based on the flags, to do
nothing.
NFS does this for LOOKUP_CREATE|LOOKUP_EXCL and for LOOKUP_RENAME_TARGET

>
> > +struct dentry *lookup_hash_update_len(const char *name, int nlen,
> > + struct path *path, unsigned int flags,
>
> const struct path *, damnit...

Sorry....

>
> > + wait_queue_head_t *wq)
> > +{
> > + struct qstr this;
> > + int err = lookup_one_common(mnt_user_ns(path->mnt), name,
> > + path->dentry, nlen, &this);
> > + if (err)
> > + return ERR_PTR(err);
> > + return lookup_hash_update(&this, path->dentry, flags, wq);
> > +}
> > +EXPORT_SYMBOL(lookup_hash_update_len);
>
> Frankly, the calling conventions of the "..._one_len" family is something
> I've kept regretting for a long time. Oh, well...
>
> > +static void done_path_update(struct path *path, struct dentry *dentry,
> > + bool with_wq)
> > +{
> > + struct inode *dir = path->dentry->d_inode;
> > +
> > + d_lookup_done(dentry);
> > + d_unlock_update(dentry);
> > + if (IS_PAR_UPDATE(dir) && with_wq)
> > + inode_unlock_shared(dir);
> > + else
> > + inode_unlock(dir);
> > +}
>
> const struct path *, again...
>
> > @@ -3400,6 +3499,12 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> > dentry = res;
> > }
> > }
> > + /*
> > + * If dentry is negative and this is a create we need to get
> > + * DCACHE_PAR_UPDATE.
> > + */
> > + if ((open_flag & O_CREAT) && !dentry->d_inode)
> > + have_par_update = d_lock_update(dentry, NULL, NULL);
> >
> > /* Negative dentry, just create the file */
> > if (!dentry->d_inode && (open_flag & O_CREAT)) {
>
> Fold the above here, please. What's more, losing the flag would've
> made it much easier to follow...

Yes, that can certainly be tided up - thanks.

>
> > @@ -3419,9 +3524,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> > error = create_error;
> > goto out_dput;
> > }
> > + if (have_par_update)
> > + d_unlock_update(dentry);
> > return dentry;
> >
> > out_dput:
> > + if (have_par_update)
> > + d_unlock_update(dentry);
>
>
> > @@ -3821,27 +3962,28 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
> > struct path *path, unsigned int lookup_flags)
>
> BTW, there's 9 callers of that sucker in the entire tree, along with
> 3 callers of user_path_create() and 14 callers of done_path_create().
> Not a big deal to add the wq in those, especially since it can be easily
> split into preparatory patch (with wq passed, but being unused).
>
> > -void done_path_create(struct path *path, struct dentry *dentry)
> > +void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq)
>
> Why "with_wq" and not the wq itself?
>

I did that at first. However when I did silly rename in NFS I found
that I wanted to call the 'done' thing when I didn't still have the wq.
...which might mean I have a bug.

And the done_path_create_wq() doesn't do anything with the wq so ...

I'll look at that again - thanks.


> > - * The caller must hold dir->i_mutex.
> > + * The caller must either hold a write-lock on dir->i_rwsem, or
> > + * a have atomically set DCACHE_PAR_UPDATE, or both.
>
> ???

That's a hangover from an earlier version where I didn't set
DCACHE_PAR_UPDATE when we had an exclusive lock. Will fix.


>
> > + * If the filesystem permits (IS_PAR_UPDATE()), we take a shared lock on the
> > + * directory and set DCACHE_PAR_UPDATE to get exclusive access to the dentry.
>
> The latter happens unconditionally here, unless I'm misreading the code (as well
> as your cover letter). It does *NOT* happen on rename(), though, contrary to
> the same. And while your later patch adds it in lock_rename_lookup(), existing
> lock_rename() callers do not get that at all. Likely to be a problem...

Between the patch were DCACHE_PAR_UPDATE is added and the patch were
lock_rename_lookup() is added, all filesystems use exclusive locks and
none of them check DCACHE_PAR_UPDATE. So how can there be a problem?


>
> > --- a/include/linux/dcache.h
> > +++ b/include/linux/dcache.h
> > @@ -13,7 +13,9 @@
> > #include <linux/rcupdate.h>
> > #include <linux/lockref.h>
> > #include <linux/stringhash.h>
> > +#include <linux/sched.h>
>
> Bloody hell, man...
>

I wonder what that was for .... removed.


> > +static inline void d_unlock_update(struct dentry *dentry)
> > +{
> > + if (IS_ERR_OR_NULL(dentry))
> > + return;
>
> Do explain... When could we ever get NULL or ERR_PTR() passed to that?

Another hangover from earlier iterations - removed. Thanks.

>
>
> BTW, I would seriously look into splitting the "let's add a helper
> that combines locking parent with __lookup_hash()" into a preliminary
> patch. Would be easier to follow.
>
Will look into that.

Thanks a lot for the thorough review!

NeilBrown

2022-08-29 03:11:34

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 06/10] VFS: support concurrent renames.

On Sat, 27 Aug 2022, Al Viro wrote:
> On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:
> > Allow object can now be renamed from or to a directory in which a create
> > or unlink is concurrently happening.
> >
> > Two or more renames with the one directory can also be concurrent.
> > s_vfs_rename_mutex still serialises lookups for cross-directory renames,
> > but the renames themselves can proceed concurrently.
>
> Wha...? <checks>
> Not true, fortunately - you *do* hold ->s_vfs_rename_mutex over the
> rename itself. If not for that, it would be utterly broken.
> And I don't care for NFS server rejecting that - we are *NOT* taking
> loop prevention logics into every filesystem. It's highly non-local
> and trying to handle it with your per-dentry flags is going to be
> painful as hell, if at all possible.
>

I don't know what happened there - I let myself get confused somewhere
in the process. You are of course right that s_vfs_rename_mutex is held
the whole time. I wasn't intending to try to change that.

> > + if (d1 < d2) {
> > + ok1 = d_lock_update_nested(d1, p1, last1, I_MUTEX_PARENT);
> > + ok2 = d_lock_update_nested(d2, p2, last2, I_MUTEX_PARENT2);
> > + } else {
> > + ok2 = d_lock_update_nested(d2, p2, last2, I_MUTEX_PARENT);
> > + ok1 = d_lock_update_nested(d1, p1, last1, I_MUTEX_PARENT2);
> > + }
>
> Explain, please. What's that ordering about?
>
Deadlock avoidance, just like in the same-directory case.

But I guess as s_vfs_rename_mutex is held, ordering cannot matter.
I'll remove the ordering.

Thanks,
NeilBrown

2022-08-29 03:28:53

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 08/10] NFSD: allow parallel creates from nfsd

On Sat, 27 Aug 2022, Al Viro wrote:
> On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:
>
> > if (is_create_with_attrs(open))
> > nfsd4_acl_to_attr(NF4REG, open->op_acl, &attrs);
> > + inode = d_inode(path.dentry);
> >
> > - inode_lock_nested(inode, I_MUTEX_PARENT);
> > + child = filename_create_one_len(open->op_fname,
> > + open->op_fnamelen,
> > + &path, 0, &wq);
> >
> > - child = lookup_one_len(open->op_fname, parent, open->op_fnamelen);
> > - if (IS_ERR(child)) {
> > - status = nfserrno(PTR_ERR(child));
> > - goto out;
> > - }
> > + if (IS_ERR(child))
> > + return nfserrno(PTR_ERR(child));
>
> Leaks acls, by the look of it?

Yep - I think that fell through a crack when I reordered patches to get
get some clean-ups into nfsd before this repost.

>
> > + if (!IS_PAR_UPDATE(fhp->fh_dentry->d_inode) &&
> > + inode_trylock_shared(fhp->fh_dentry->d_inode)) {
> > + /* only have a shared lock */
> > + inode_unlock_shared(fhp->fh_dentry->d_inode);
> > + fhp->fh_no_atomic_attr = true;
> > + fhp->fh_no_wcc = true;
>
> Er... Shouldn't that be IS_PAR_UPDATE() && ... ?
>
Definitely. Thanks!

NeilBrown

2022-08-29 03:40:58

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 09/10] VFS: add LOOKUP_SILLY_RENAME

On Sat, 27 Aug 2022, Al Viro wrote:
> On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:
> > When performing a "silly rename" to avoid removing a file that is still
> > open, we need to perform a lookup in a directory that is already locked.
> >
> > In order to allow common functions to be used for this lookup, introduce
> > LOOKUP_SILLY_RENAME which affirms that the directory is already locked
> > and that the vfsmnt is already writable.
> >
> > When LOOKUP_SILLY_RENAME is set, path->mnt can be NULL. As
> > i_op->rename() doesn't make the vfsmnt available, this is unavoidable.
> > So we ensure that a NULL ->mnt isn't fatal.
>
> This one is really disgusting. Flag-dependent locking is a pretty much
> guaranteed source of PITA and "magical" struct path is, again, asking for
> trouble.
>
> You seem to be trying for simpler call graph and you end up paying with
> control flow that is much harder to reason about.
>
It was mostly about avoiding code duplication.
I'll see if I can find a cleaner way.

Thanks,
NeilBrown

2022-09-01 00:35:50

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Sat, 27 Aug 2022, Al Viro wrote:
> On Fri, Aug 26, 2022 at 12:06:55PM -0700, Linus Torvalds wrote:
>
> > Because right now I think the main reason we cannot move the lock into
> > the filesystem is literally that we've made the lock cover not just
> > the filesystem part, but the "lookup and create dentry" part too.
>
> How about rename loop prevention? mount vs. rmdir? d_splice_alias()
> fun on tree reconnects?

Thanks for this list.

I think the "mount vs. rmdir" usage of inode_lock() is independent of
the usage for directory operations, so we can change the latter as much
as we like without materially affecting the former.

The lock we take on the directory being removed DOES ensure no new
objects are linked into the directory, so for that reason we still need
at least a shared lock when adding links to a directory.
Moving that lock into the filesystem would invert the locking order in
rmdir between the child being removed and the parent being locked. That
would require some consideration.

d_splice_alias() happens at ->lookup time so it is already under a
shared lock. I don't see that it depends on i_rwsem - it uses i_lock
for the important locking.

Rename loop prevention is largely managed by s_vfs_rename_mutex. Once
that is taken, nothing can be moved to a different directory. That
means 'trap' will keep any relationship it had to new_path and old_path.
It could be renamed within it's parent, but as long as it isn't removed
the comparisons with old_dentry and new_dentry should still be reliable.
As 'trap' clearly isn't empty, we trust that the filesystem won't allow
an rmdir to succeed.

What have I missed?

Thanks,
NeilBrown

>
> > But once you have that "DCACHE_PAR_LOOKUP" bit and the
> > d_alloc_parallel() logic to serialize a _particular_ dentry being
> > created (as opposed to serializing all the sleeping ops to that
> > directly), I really think we should strive to move the locking - that
> > no longer helps the VFS dcache layer - closer to the filesystem call
> > and eventually into it.
>
> See above.
>

2022-09-01 03:51:06

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Thu, Sep 01, 2022 at 10:31:10AM +1000, NeilBrown wrote:

> Thanks for this list.

Keep in mind that this list is just off the top of my head - it's
nowhere near complete.

> d_splice_alias() happens at ->lookup time so it is already under a
> shared lock. I don't see that it depends on i_rwsem - it uses i_lock
> for the important locking.

Nope.

Process 1:
rmdir foo/bar
foo found and locked exclusive [*]
dentry of bar found
->rmdir() instance called
Process 2:
stat foo/splat
foo found and locked shared [*]
dentry of splat does not exist anywhere
dentry allocated, marked in-lookup
->lookup() instance called
inode found and passed to d_splice_alias() ...
... which finds that it's a directory inode ...
... and foo/bar refers to it. E.g. it's on NFS and another
client has just done mv bar splat
__d_unalias() is called, to try and move existing alias (foo/bar)
into the right place. It sees that no change of parent is involved,
so it can just proceed to __d_move().
Process 1:
forms an rmdir request to server, using ->d_name (and possibly
->d_parent) of dentry of foo/bar. It knows that ->d_name is stable,
since the caller holds foo locked exclusive and all callers of __d_move()
hold the old parent at least shared.

In mainline process 2 will block (or, in case if it deals with different
parent, try to grab the old parent of the existing alias shared and fail
and with -ESTALE). With your changes process 1 will be holding
foo/ locked shared, so process 2 will succeed and proceed to __d_move(),
right under the nose of process 1 accessing ->d_name. If the names involved
had been longer than 32 characters, it would risk accessing kfreed memory.
Or fetching the length from old name and pointer from new one, walking
past the end of kmalloc'ed object, etc.

Sure, assuming that we are talking about NFS, server would have probably
failed the RMDIR request - if you managed to form that request without
oopsing, that is.

2022-09-03 00:15:58

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Mon, Aug 29, 2022 at 11:59:02AM +1000, NeilBrown wrote:

> > When would we get out of __lookup_hash() with in-lookup dentry?
> > Confused...
>
> Whenever wq is passed in and ->lookup() decides, based on the flags, to do
> nothing.
> NFS does this for LOOKUP_CREATE|LOOKUP_EXCL and for LOOKUP_RENAME_TARGET

Frankly, I would rather do what all other callers of ->lookup() do and
just follow it with d_lookup_done(dentry), no matter what it returns.
It's cheap enough...

2022-09-03 01:42:18

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Sat, 03 Sep 2022, Al Viro wrote:
> On Mon, Aug 29, 2022 at 11:59:02AM +1000, NeilBrown wrote:
>
> > > When would we get out of __lookup_hash() with in-lookup dentry?
> > > Confused...
> >
> > Whenever wq is passed in and ->lookup() decides, based on the flags, to do
> > nothing.
> > NFS does this for LOOKUP_CREATE|LOOKUP_EXCL and for LOOKUP_RENAME_TARGET
>
> Frankly, I would rather do what all other callers of ->lookup() do and
> just follow it with d_lookup_done(dentry), no matter what it returns.
> It's cheap enough...
>

I don't think that is a good idea. Once you call d_lookup_done()
(without having first called d_add() or similar) the dentry becomes
invisible to normal path lookup, so another might be created. But the
dentry will still be used for the 'create' or 'rename' and may then be
added to the dcache - at which point you could have two dentries with the
same name.

When ->lookup() returns success without d_add()ing the dentry, that
means that something else will complete the d_add() if/when necessary.
For NFS, it specifically means that the lookup is effectively being
combined with the following CREATE or RENAME. In this case there is no
d_lookup_done() until the full operation is complete.

For autofs (thanks for pointing me to that) the operation is completed
when d_automount() signals the daemon to create the directory or
symlink. In that case there IS a d_lookup_done() call and autofs needs
some extra magic (the internal 'active' list) to make sure subsequent
->lookup requests can see that dentry which is still in the process of
being set up.

It might be nice if the dentry passed to autofs_lookup() could remain
"d_inlookup()" until after d_automount has completed. Then autofs
wouldn't need that active list. However I haven't yet looked at how
disruptive such a change might be.

Thanks,
NeilBrown

2022-09-03 18:16:54

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Sat, Sep 03, 2022 at 03:12:26AM +0100, Al Viro wrote:

> Very much so. You are starting to invent new rules for ->lookup() that
> just never had been there, basing on nothing better than a couple of
> examples. They are nowhere near everything there is.

A few examples besides NFS and autofs:

ext4, f2fs and xfs might bloody well return NULL without hashing - happens
on negative lookups with 'casefolding' crap.

kernfs - treatment of inactive nodes.

afs_dynroot_lookup() treatment of @cell... names.

afs_lookup() treatment of @sys... names.

There might very well be more - both merged into mainline and in
development trees of various filesystems (including devel branches
of in-tree ones - I'm not talking about out-of-tree projects).

Note, BTW, that with the current rules it's perfectly possible to
have this kind of fun:
a name that resolves to different files for different processes
unlink(2) is allowed and results depend upon the calling process

All it takes is ->lookup() deliberately *NOT* hashing the sucker and
->unlink() acting according to dentry it has gotten from the caller.
unlink(2) from different callers are serialized and none of that
stuff is ever going to be hashed. d_alloc_parallel() might pick an
in-lookup dentry from another caller of e.g. stat(2), but it will
wait for in-lookup state ending, notice that the sucker is not hashed,
drop it and retry. Suboptimal, but it works.

Nothing in the mainline currently does that. Nothing that I know of,
that is. Sure, it could be made work with the changes you seem to
imply (if I'm not misreading you) - all it takes is lookup
calling d_lookup_done() on its argument before returning NULL.
But that's subtle, non-obvious and not documented anywhere...

Another interesting question is the rules for unhashing dentries.
What is needed for somebody to do temporary unhash, followed by
rehashing?

2022-09-05 00:25:50

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

On Sun, 04 Sep 2022, Al Viro wrote:
> On Sat, Sep 03, 2022 at 03:12:26AM +0100, Al Viro wrote:
>
> > Very much so. You are starting to invent new rules for ->lookup() that
> > just never had been there, basing on nothing better than a couple of
> > examples. They are nowhere near everything there is.
>
> A few examples besides NFS and autofs:

Hi Al,
thanks for these - very helpful. I will give them due consideration
when I write relevant documentation to include in the next posting of
the series.

Thanks a lot,
NeilBrown


>
> ext4, f2fs and xfs might bloody well return NULL without hashing - happens
> on negative lookups with 'casefolding' crap.
>
> kernfs - treatment of inactive nodes.
>
> afs_dynroot_lookup() treatment of @cell... names.
>
> afs_lookup() treatment of @sys... names.
>
> There might very well be more - both merged into mainline and in
> development trees of various filesystems (including devel branches
> of in-tree ones - I'm not talking about out-of-tree projects).
>
> Note, BTW, that with the current rules it's perfectly possible to
> have this kind of fun:
> a name that resolves to different files for different processes
> unlink(2) is allowed and results depend upon the calling process
>
> All it takes is ->lookup() deliberately *NOT* hashing the sucker and
> ->unlink() acting according to dentry it has gotten from the caller.
> unlink(2) from different callers are serialized and none of that
> stuff is ever going to be hashed. d_alloc_parallel() might pick an
> in-lookup dentry from another caller of e.g. stat(2), but it will
> wait for in-lookup state ending, notice that the sucker is not hashed,
> drop it and retry. Suboptimal, but it works.
>
> Nothing in the mainline currently does that. Nothing that I know of,
> that is. Sure, it could be made work with the changes you seem to
> imply (if I'm not misreading you) - all it takes is lookup
> calling d_lookup_done() on its argument before returning NULL.
> But that's subtle, non-obvious and not documented anywhere...
>
> Another interesting question is the rules for unhashing dentries.
> What is needed for somebody to do temporary unhash, followed by
> rehashing?
>