dentry->d_fsdata is set to NFS_FSDATA_BLOCKED while unlinking or
renaming-over a file to ensure that no open succeeds while the NFS
operation progressed on the server.
Setting dentry->d_fsdata to NFS_FSDATA_BLOCKED is done under ->d_lock
after checking the refcount is not elevated. Any attempt to open the
file (through that name) will go through lookp_open() which will take
->d_lock while incrementing the refcount, we can be sure that once the
new value is set, __nfs_lookup_revalidate() *will* see the new value and
will block.
We don't have any locking guarantee that when we set ->d_fsdata to NULL,
the wait_var_event() in __nfs_lookup_revalidate() will notice.
wait/wake primitives do NOT provide barriers to guarantee order. We
must use smp_load_acquire() in wait_var_event() to ensure we look at an
up-to-date value, and must use smp_store_release() before wake_up_var().
This patch adds those barrier functions and factors out
block_revalidate() and unblock_revalidate() far clarity.
There is also a hypothetical bug in that if memory allocation fails
(which never happens in practice) we might leave ->d_fsdata locked.
This patch adds the missing call to unblock_revalidate().
Reported-and-tested-by: Richard Kojedzinszky <[email protected]>
Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501
Fixes: 3c59366c207e ("NFS: don't unhash dentry during unlink/rename")
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/dir.c | 44 +++++++++++++++++++++++++++++---------------
1 file changed, 29 insertions(+), 15 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ac505671efbd..c91dc36d41cc 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1802,9 +1802,10 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
if (parent != READ_ONCE(dentry->d_parent))
return -ECHILD;
} else {
- /* Wait for unlink to complete */
+ /* Wait for unlink to complete - see unblock_revalidate() */
wait_var_event(&dentry->d_fsdata,
- dentry->d_fsdata != NFS_FSDATA_BLOCKED);
+ smp_load_acquire(&dentry->d_fsdata)
+ != NFS_FSDATA_BLOCKED);
parent = dget_parent(dentry);
ret = reval(d_inode(parent), dentry, flags);
dput(parent);
@@ -1817,6 +1818,26 @@ static int nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags)
return __nfs_lookup_revalidate(dentry, flags, nfs_do_lookup_revalidate);
}
+static void block_revalidate(struct dentry *dentry)
+{
+ /* old devname - just in case */
+ kfree(dentry->d_fsdata);
+
+ /* Any new reference that could lead to an open
+ * will take ->d_lock in lookup_open() -> d_lookup().
+ */
+ lockdep_assert_held(&dentry->d_lock);
+
+ dentry->d_fsdata = NULL;
+}
+
+static void unblock_revalidate(struct dentry *dentry)
+{
+ /* store_release ensures wait_var_event() sees the update */
+ smp_store_release(&dentry->d_fsdata, NULL);
+ wake_up_var(&dentry->d_fsdata);
+}
+
/*
* A weaker form of d_revalidate for revalidating just the d_inode(dentry)
* when we don't really care about the dentry name. This is called when a
@@ -2501,15 +2522,12 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry)
spin_unlock(&dentry->d_lock);
goto out;
}
- /* old devname */
- kfree(dentry->d_fsdata);
- dentry->d_fsdata = NFS_FSDATA_BLOCKED;
+ block_revalidate(dentry);
spin_unlock(&dentry->d_lock);
error = nfs_safe_remove(dentry);
nfs_dentry_remove_handle_error(dir, dentry, error);
- dentry->d_fsdata = NULL;
- wake_up_var(&dentry->d_fsdata);
+ unblock_revalidate(dentry);
out:
trace_nfs_unlink_exit(dir, dentry, error);
return error;
@@ -2616,8 +2634,7 @@ nfs_unblock_rename(struct rpc_task *task, struct nfs_renamedata *data)
{
struct dentry *new_dentry = data->new_dentry;
- new_dentry->d_fsdata = NULL;
- wake_up_var(&new_dentry->d_fsdata);
+ unblock_revalidate(new_dentry);
}
/*
@@ -2679,11 +2696,6 @@ int nfs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
if (WARN_ON(new_dentry->d_flags & DCACHE_NFSFS_RENAMED) ||
WARN_ON(new_dentry->d_fsdata == NFS_FSDATA_BLOCKED))
goto out;
- if (new_dentry->d_fsdata) {
- /* old devname */
- kfree(new_dentry->d_fsdata);
- new_dentry->d_fsdata = NULL;
- }
spin_lock(&new_dentry->d_lock);
if (d_count(new_dentry) > 2) {
@@ -2705,7 +2717,7 @@ int nfs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
new_dentry = dentry;
new_inode = NULL;
} else {
- new_dentry->d_fsdata = NFS_FSDATA_BLOCKED;
+ block_revalidate(new_dentry);
must_unblock = true;
spin_unlock(&new_dentry->d_lock);
}
@@ -2717,6 +2729,8 @@ int nfs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
task = nfs_async_rename(old_dir, new_dir, old_dentry, new_dentry,
must_unblock ? nfs_unblock_rename : NULL);
if (IS_ERR(task)) {
+ if (must_unblock)
+ unblock_revalidate(new_dentry);
error = PTR_ERR(task);
goto out;
}
--
2.44.0
On Mon, 2024-05-27 at 13:04 +1000, NeilBrown wrote:
>
> dentry->d_fsdata is set to NFS_FSDATA_BLOCKED while unlinking or
> renaming-over a file to ensure that no open succeeds while the NFS
> operation progressed on the server.
>
> Setting dentry->d_fsdata to NFS_FSDATA_BLOCKED is done under ->d_lock
> after checking the refcount is not elevated. Any attempt to open the
> file (through that name) will go through lookp_open() which will take
> ->d_lock while incrementing the refcount, we can be sure that once
> the
> new value is set, __nfs_lookup_revalidate() *will* see the new value
> and
> will block.
>
> We don't have any locking guarantee that when we set ->d_fsdata to
> NULL,
> the wait_var_event() in __nfs_lookup_revalidate() will notice.
> wait/wake primitives do NOT provide barriers to guarantee order. We
> must use smp_load_acquire() in wait_var_event() to ensure we look at
> an
> up-to-date value, and must use smp_store_release() before
> wake_up_var().
>
> This patch adds those barrier functions and factors out
> block_revalidate() and unblock_revalidate() far clarity.
>
> There is also a hypothetical bug in that if memory allocation fails
> (which never happens in practice) we might leave ->d_fsdata locked.
> This patch adds the missing call to unblock_revalidate().
>
> Reported-and-tested-by: Richard Kojedzinszky
> <[email protected]>
> Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501
> Fixes: 3c59366c207e ("NFS: don't unhash dentry during unlink/rename")
> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/nfs/dir.c | 44 +++++++++++++++++++++++++++++---------------
> 1 file changed, 29 insertions(+), 15 deletions(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index ac505671efbd..c91dc36d41cc 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1802,9 +1802,10 @@ __nfs_lookup_revalidate(struct dentry *dentry,
> unsigned int flags,
> if (parent != READ_ONCE(dentry->d_parent))
> return -ECHILD;
> } else {
> - /* Wait for unlink to complete */
> + /* Wait for unlink to complete - see
> unblock_revalidate() */
> wait_var_event(&dentry->d_fsdata,
> - dentry->d_fsdata !=
> NFS_FSDATA_BLOCKED);
> + smp_load_acquire(&dentry->d_fsdata)
> + != NFS_FSDATA_BLOCKED);
Doesn't this end up being a reversed ACQUIRE+RELEASE as described in
the "LOCK ACQUISITION FUNCTIONS" section of Documentation/memory-
barriers.txt?
IOW: Shouldn't the above rather be using READ_ONCE()?
> parent = dget_parent(dentry);
> ret = reval(d_inode(parent), dentry, flags);
> dput(parent);
> @@ -1817,6 +1818,26 @@ static int nfs_lookup_revalidate(struct dentry
> *dentry, unsigned int flags)
> return __nfs_lookup_revalidate(dentry, flags,
> nfs_do_lookup_revalidate);
> }
>
> +static void block_revalidate(struct dentry *dentry)
> +{
> + /* old devname - just in case */
> + kfree(dentry->d_fsdata);
> +
> + /* Any new reference that could lead to an open
> + * will take ->d_lock in lookup_open() -> d_lookup().
> + */
> + lockdep_assert_held(&dentry->d_lock);
> +
> + dentry->d_fsdata = NULL;
Why are you doing a barrier free change to dentry->d_fsdata here when
you have the memory barrier protected change in unblock_revalidate()?
> +}
> +
> +static void unblock_revalidate(struct dentry *dentry)
> +{
> + /* store_release ensures wait_var_event() sees the update */
> + smp_store_release(&dentry->d_fsdata, NULL);
Shouldn't this be a WRITE_ONCE(), for the same reason as above?
> + wake_up_var(&dentry->d_fsdata);
> +}
> +
> /*
> * A weaker form of d_revalidate for revalidating just the
> d_inode(dentry)
> * when we don't really care about the dentry name. This is called
> when a
> @@ -2501,15 +2522,12 @@ int nfs_unlink(struct inode *dir, struct
> dentry *dentry)
> spin_unlock(&dentry->d_lock);
> goto out;
> }
> - /* old devname */
> - kfree(dentry->d_fsdata);
> - dentry->d_fsdata = NFS_FSDATA_BLOCKED;
> + block_revalidate(dentry);
>
> spin_unlock(&dentry->d_lock);
> error = nfs_safe_remove(dentry);
> nfs_dentry_remove_handle_error(dir, dentry, error);
> - dentry->d_fsdata = NULL;
> - wake_up_var(&dentry->d_fsdata);
> + unblock_revalidate(dentry);
> out:
> trace_nfs_unlink_exit(dir, dentry, error);
> return error;
> @@ -2616,8 +2634,7 @@ nfs_unblock_rename(struct rpc_task *task,
> struct nfs_renamedata *data)
> {
> struct dentry *new_dentry = data->new_dentry;
>
> - new_dentry->d_fsdata = NULL;
> - wake_up_var(&new_dentry->d_fsdata);
> + unblock_revalidate(new_dentry);
> }
>
> /*
> @@ -2679,11 +2696,6 @@ int nfs_rename(struct mnt_idmap *idmap, struct
> inode *old_dir,
> if (WARN_ON(new_dentry->d_flags &
> DCACHE_NFSFS_RENAMED) ||
> WARN_ON(new_dentry->d_fsdata ==
> NFS_FSDATA_BLOCKED))
> goto out;
> - if (new_dentry->d_fsdata) {
> - /* old devname */
> - kfree(new_dentry->d_fsdata);
> - new_dentry->d_fsdata = NULL;
> - }
>
> spin_lock(&new_dentry->d_lock);
> if (d_count(new_dentry) > 2) {
> @@ -2705,7 +2717,7 @@ int nfs_rename(struct mnt_idmap *idmap, struct
> inode *old_dir,
> new_dentry = dentry;
> new_inode = NULL;
> } else {
> - new_dentry->d_fsdata = NFS_FSDATA_BLOCKED;
> + block_revalidate(new_dentry);
> must_unblock = true;
> spin_unlock(&new_dentry->d_lock);
> }
> @@ -2717,6 +2729,8 @@ int nfs_rename(struct mnt_idmap *idmap, struct
> inode *old_dir,
> task = nfs_async_rename(old_dir, new_dir, old_dentry,
> new_dentry,
> must_unblock ? nfs_unblock_rename :
> NULL);
> if (IS_ERR(task)) {
> + if (must_unblock)
> + unblock_revalidate(new_dentry);
> error = PTR_ERR(task);
> goto out;
> }
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Tue, 28 May 2024, Trond Myklebust wrote:
> On Mon, 2024-05-27 at 13:04 +1000, NeilBrown wrote:
> >
> > dentry->d_fsdata is set to NFS_FSDATA_BLOCKED while unlinking or
> > renaming-over a file to ensure that no open succeeds while the NFS
> > operation progressed on the server.
> >
> > Setting dentry->d_fsdata to NFS_FSDATA_BLOCKED is done under ->d_lock
> > after checking the refcount is not elevated. Any attempt to open the
> > file (through that name) will go through lookp_open() which will take
> > ->d_lock while incrementing the refcount, we can be sure that once
> > the
> > new value is set, __nfs_lookup_revalidate() *will* see the new value
> > and
> > will block.
> >
> > We don't have any locking guarantee that when we set ->d_fsdata to
> > NULL,
> > the wait_var_event() in __nfs_lookup_revalidate() will notice.
> > wait/wake primitives do NOT provide barriers to guarantee order. We
> > must use smp_load_acquire() in wait_var_event() to ensure we look at
> > an
> > up-to-date value, and must use smp_store_release() before
> > wake_up_var().
> >
> > This patch adds those barrier functions and factors out
> > block_revalidate() and unblock_revalidate() far clarity.
> >
> > There is also a hypothetical bug in that if memory allocation fails
> > (which never happens in practice) we might leave ->d_fsdata locked.
> > This patch adds the missing call to unblock_revalidate().
> >
> > Reported-and-tested-by: Richard Kojedzinszky
> > <[email protected]>
> > Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501
> > Fixes: 3c59366c207e ("NFS: don't unhash dentry during unlink/rename")
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > fs/nfs/dir.c | 44 +++++++++++++++++++++++++++++---------------
> > 1 file changed, 29 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > index ac505671efbd..c91dc36d41cc 100644
> > --- a/fs/nfs/dir.c
> > +++ b/fs/nfs/dir.c
> > @@ -1802,9 +1802,10 @@ __nfs_lookup_revalidate(struct dentry *dentry,
> > unsigned int flags,
> > if (parent != READ_ONCE(dentry->d_parent))
> > return -ECHILD;
> > } else {
> > - /* Wait for unlink to complete */
> > + /* Wait for unlink to complete - see
> > unblock_revalidate() */
> > wait_var_event(&dentry->d_fsdata,
> > - dentry->d_fsdata !=
> > NFS_FSDATA_BLOCKED);
> > + smp_load_acquire(&dentry->d_fsdata)
> > + != NFS_FSDATA_BLOCKED);
>
> Doesn't this end up being a reversed ACQUIRE+RELEASE as described in
> the "LOCK ACQUISITION FUNCTIONS" section of Documentation/memory-
> barriers.txt?
I don't think so. That section is talking about STORE operations which
can move backwards through ACQUIRE and forwards through RELEASE.
Above we have a LOAD operation which mustn't move backwards through
ACQUIRE. Below there is a STORE operation which mustn't move forwards
through a RELEASE. Both of those are guaranteed.
>
> IOW: Shouldn't the above rather be using READ_ONCE()?
>
> > parent = dget_parent(dentry);
> > ret = reval(d_inode(parent), dentry, flags);
> > dput(parent);
> > @@ -1817,6 +1818,26 @@ static int nfs_lookup_revalidate(struct dentry
> > *dentry, unsigned int flags)
> > return __nfs_lookup_revalidate(dentry, flags,
> > nfs_do_lookup_revalidate);
> > }
> >
> > +static void block_revalidate(struct dentry *dentry)
> > +{
> > + /* old devname - just in case */
> > + kfree(dentry->d_fsdata);
> > +
> > + /* Any new reference that could lead to an open
> > + * will take ->d_lock in lookup_open() -> d_lookup().
> > + */
> > + lockdep_assert_held(&dentry->d_lock);
> > +
> > + dentry->d_fsdata = NULL;
>
> Why are you doing a barrier free change to dentry->d_fsdata here when
> you have the memory barrier protected change in unblock_revalidate()?
Ouch. This should be
dentry->d_fsdata = NFS_FSDATA_BLOCKED;
I messed that up when rearranging the code after testing.
This doesn't need a barrier because a spinlock is held. We check the
refcount under the spinlock and only proceed if there are no other
references. So if __nfs_lookup_revalidate gets called concurrently, it
must have got a new reference, and that requires the same spinlock.
So if it is called after this assignment, the spinlock will provide all
needed barriers.
>
> > +}
> > +
> > +static void unblock_revalidate(struct dentry *dentry)
> > +{
> > + /* store_release ensures wait_var_event() sees the update */
> > + smp_store_release(&dentry->d_fsdata, NULL);
>
> Shouldn't this be a WRITE_ONCE(), for the same reason as above?
No, for the same reason as above. WRITE_ONCE() doesn't provide any
memory barriers, it only avoid compiler optimisations. Here we really
need the barrier on some CPUs.
Thanks for the review.
NeilBrown
>
> > + wake_up_var(&dentry->d_fsdata);
> > +}
> > +
> > /*
> > * A weaker form of d_revalidate for revalidating just the
> > d_inode(dentry)
> > * when we don't really care about the dentry name. This is called
> > when a
> > @@ -2501,15 +2522,12 @@ int nfs_unlink(struct inode *dir, struct
> > dentry *dentry)
> > spin_unlock(&dentry->d_lock);
> > goto out;
> > }
> > - /* old devname */
> > - kfree(dentry->d_fsdata);
> > - dentry->d_fsdata = NFS_FSDATA_BLOCKED;
> > + block_revalidate(dentry);
> >
> > spin_unlock(&dentry->d_lock);
> > error = nfs_safe_remove(dentry);
> > nfs_dentry_remove_handle_error(dir, dentry, error);
> > - dentry->d_fsdata = NULL;
> > - wake_up_var(&dentry->d_fsdata);
> > + unblock_revalidate(dentry);
> > out:
> > trace_nfs_unlink_exit(dir, dentry, error);
> > return error;
> > @@ -2616,8 +2634,7 @@ nfs_unblock_rename(struct rpc_task *task,
> > struct nfs_renamedata *data)
> > {
> > struct dentry *new_dentry = data->new_dentry;
> >
> > - new_dentry->d_fsdata = NULL;
> > - wake_up_var(&new_dentry->d_fsdata);
> > + unblock_revalidate(new_dentry);
> > }
> >
> > /*
> > @@ -2679,11 +2696,6 @@ int nfs_rename(struct mnt_idmap *idmap, struct
> > inode *old_dir,
> > if (WARN_ON(new_dentry->d_flags &
> > DCACHE_NFSFS_RENAMED) ||
> > WARN_ON(new_dentry->d_fsdata ==
> > NFS_FSDATA_BLOCKED))
> > goto out;
> > - if (new_dentry->d_fsdata) {
> > - /* old devname */
> > - kfree(new_dentry->d_fsdata);
> > - new_dentry->d_fsdata = NULL;
> > - }
> >
> > spin_lock(&new_dentry->d_lock);
> > if (d_count(new_dentry) > 2) {
> > @@ -2705,7 +2717,7 @@ int nfs_rename(struct mnt_idmap *idmap, struct
> > inode *old_dir,
> > new_dentry = dentry;
> > new_inode = NULL;
> > } else {
> > - new_dentry->d_fsdata = NFS_FSDATA_BLOCKED;
> > + block_revalidate(new_dentry);
> > must_unblock = true;
> > spin_unlock(&new_dentry->d_lock);
> > }
> > @@ -2717,6 +2729,8 @@ int nfs_rename(struct mnt_idmap *idmap, struct
> > inode *old_dir,
> > task = nfs_async_rename(old_dir, new_dir, old_dentry,
> > new_dentry,
> > must_unblock ? nfs_unblock_rename :
> > NULL);
> > if (IS_ERR(task)) {
> > + if (must_unblock)
> > + unblock_revalidate(new_dentry);
> > error = PTR_ERR(task);
> > goto out;
> > }
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>
>
On Tue, 2024-05-28 at 11:19 +1000, NeilBrown wrote:
> On Tue, 28 May 2024, Trond Myklebust wrote:
> > On Mon, 2024-05-27 at 13:04 +1000, NeilBrown wrote:
> > >
> > > dentry->d_fsdata is set to NFS_FSDATA_BLOCKED while unlinking or
> > > renaming-over a file to ensure that no open succeeds while the
> > > NFS
> > > operation progressed on the server.
> > >
> > > Setting dentry->d_fsdata to NFS_FSDATA_BLOCKED is done under -
> > > >d_lock
> > > after checking the refcount is not elevated. Any attempt to open
> > > the
> > > file (through that name) will go through lookp_open() which will
> > > take
> > > ->d_lock while incrementing the refcount, we can be sure that
> > > once
> > > the
> > > new value is set, __nfs_lookup_revalidate() *will* see the new
> > > value
> > > and
> > > will block.
> > >
> > > We don't have any locking guarantee that when we set ->d_fsdata
> > > to
> > > NULL,
> > > the wait_var_event() in __nfs_lookup_revalidate() will notice.
> > > wait/wake primitives do NOT provide barriers to guarantee order.
> > > We
> > > must use smp_load_acquire() in wait_var_event() to ensure we look
> > > at
> > > an
> > > up-to-date value, and must use smp_store_release() before
> > > wake_up_var().
> > >
> > > This patch adds those barrier functions and factors out
> > > block_revalidate() and unblock_revalidate() far clarity.
> > >
> > > There is also a hypothetical bug in that if memory allocation
> > > fails
> > > (which never happens in practice) we might leave ->d_fsdata
> > > locked.
> > > This patch adds the missing call to unblock_revalidate().
> > >
> > > Reported-and-tested-by: Richard Kojedzinszky
> > > <[email protected]>
> > > Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501
> > > Fixes: 3c59366c207e ("NFS: don't unhash dentry during
> > > unlink/rename")
> > > Signed-off-by: NeilBrown <[email protected]>
> > > ---
> > > fs/nfs/dir.c | 44 +++++++++++++++++++++++++++++---------------
> > > 1 file changed, 29 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > index ac505671efbd..c91dc36d41cc 100644
> > > --- a/fs/nfs/dir.c
> > > +++ b/fs/nfs/dir.c
> > > @@ -1802,9 +1802,10 @@ __nfs_lookup_revalidate(struct dentry
> > > *dentry,
> > > unsigned int flags,
> > > if (parent != READ_ONCE(dentry->d_parent))
> > > return -ECHILD;
> > > } else {
> > > - /* Wait for unlink to complete */
> > > + /* Wait for unlink to complete - see
> > > unblock_revalidate() */
> > > wait_var_event(&dentry->d_fsdata,
> > > - dentry->d_fsdata !=
> > > NFS_FSDATA_BLOCKED);
> > > + smp_load_acquire(&dentry-
> > > >d_fsdata)
> > > + != NFS_FSDATA_BLOCKED);
> >
> > Doesn't this end up being a reversed ACQUIRE+RELEASE as described
> > in
> > the "LOCK ACQUISITION FUNCTIONS" section of Documentation/memory-
> > barriers.txt?
>
> I don't think so. That section is talking about STORE operations
> which
> can move backwards through ACQUIRE and forwards through RELEASE.
>
> Above we have a LOAD operation which mustn't move backwards through
> ACQUIRE. Below there is a STORE operation which mustn't move
> forwards
> through a RELEASE. Both of those are guaranteed.
It isn't necessary for the LOAD to move backwards through the ACQUIRE.
As I understand it, the point is that the ACQUIRE can move through the
RELEASE as per the following paragraph in that document:
Similarly, the reverse case of a RELEASE followed by an ACQUIRE does
not imply a full memory barrier. Therefore, the CPU's execution of the
critical sections corresponding to the RELEASE and the ACQUIRE can cross,
so that:
*A = a;
RELEASE M
ACQUIRE N
*B = b;
could occur as:
ACQUIRE N, STORE *B, STORE *A, RELEASE M
This would presumably be why the function clear_and_wake_up_bit() needs
a full memory barrier on most architectures, instead of being just an
smp_wmb(). Is my understanding of this wrong?
>
> >
> > IOW: Shouldn't the above rather be using READ_ONCE()?
> >
> > > parent = dget_parent(dentry);
> > > ret = reval(d_inode(parent), dentry, flags);
> > > dput(parent);
> > > @@ -1817,6 +1818,26 @@ static int nfs_lookup_revalidate(struct
> > > dentry
> > > *dentry, unsigned int flags)
> > > return __nfs_lookup_revalidate(dentry, flags,
> > > nfs_do_lookup_revalidate);
> > > }
> > >
> > > +static void block_revalidate(struct dentry *dentry)
> > > +{
> > > + /* old devname - just in case */
> > > + kfree(dentry->d_fsdata);
> > > +
> > > + /* Any new reference that could lead to an open
> > > + * will take ->d_lock in lookup_open() -> d_lookup().
> > > + */
> > > + lockdep_assert_held(&dentry->d_lock);
> > > +
> > > + dentry->d_fsdata = NULL;
> >
> > Why are you doing a barrier free change to dentry->d_fsdata here
> > when
> > you have the memory barrier protected change in
> > unblock_revalidate()?
>
> Ouch. This should be
>
> dentry->d_fsdata = NFS_FSDATA_BLOCKED;
>
> I messed that up when rearranging the code after testing.
>
> This doesn't need a barrier because a spinlock is held. We check the
> refcount under the spinlock and only proceed if there are no other
> references. So if __nfs_lookup_revalidate gets called concurrently,
> it
> must have got a new reference, and that requires the same spinlock.
> So if it is called after this assignment, the spinlock will provide
> all
> needed barriers.
>
> >
> > > +}
> > > +
> > > +static void unblock_revalidate(struct dentry *dentry)
> > > +{
> > > + /* store_release ensures wait_var_event() sees the
> > > update */
> > > + smp_store_release(&dentry->d_fsdata, NULL);
> >
> > Shouldn't this be a WRITE_ONCE(), for the same reason as above?
>
> No, for the same reason as above. WRITE_ONCE() doesn't provide any
> memory barriers, it only avoid compiler optimisations. Here we
> really
> need the barrier on some CPUs.
>
> Thanks for the review.
>
> NeilBrown
>
> >
> > > + wake_up_var(&dentry->d_fsdata);
> > > +}
> > > +
> > > /*
> > > * A weaker form of d_revalidate for revalidating just the
> > > d_inode(dentry)
> > > * when we don't really care about the dentry name. This is
> > > called
> > > when a
> > > @@ -2501,15 +2522,12 @@ int nfs_unlink(struct inode *dir, struct
> > > dentry *dentry)
> > > spin_unlock(&dentry->d_lock);
> > > goto out;
> > > }
> > > - /* old devname */
> > > - kfree(dentry->d_fsdata);
> > > - dentry->d_fsdata = NFS_FSDATA_BLOCKED;
> > > + block_revalidate(dentry);
> > >
> > > spin_unlock(&dentry->d_lock);
> > > error = nfs_safe_remove(dentry);
> > > nfs_dentry_remove_handle_error(dir, dentry, error);
> > > - dentry->d_fsdata = NULL;
> > > - wake_up_var(&dentry->d_fsdata);
> > > + unblock_revalidate(dentry);
> > > out:
> > > trace_nfs_unlink_exit(dir, dentry, error);
> > > return error;
> > > @@ -2616,8 +2634,7 @@ nfs_unblock_rename(struct rpc_task *task,
> > > struct nfs_renamedata *data)
> > > {
> > > struct dentry *new_dentry = data->new_dentry;
> > >
> > > - new_dentry->d_fsdata = NULL;
> > > - wake_up_var(&new_dentry->d_fsdata);
> > > + unblock_revalidate(new_dentry);
> > > }
> > >
> > > /*
> > > @@ -2679,11 +2696,6 @@ int nfs_rename(struct mnt_idmap *idmap,
> > > struct
> > > inode *old_dir,
> > > if (WARN_ON(new_dentry->d_flags &
> > > DCACHE_NFSFS_RENAMED) ||
> > > WARN_ON(new_dentry->d_fsdata ==
> > > NFS_FSDATA_BLOCKED))
> > > goto out;
> > > - if (new_dentry->d_fsdata) {
> > > - /* old devname */
> > > - kfree(new_dentry->d_fsdata);
> > > - new_dentry->d_fsdata = NULL;
> > > - }
> > >
> > > spin_lock(&new_dentry->d_lock);
> > > if (d_count(new_dentry) > 2) {
> > > @@ -2705,7 +2717,7 @@ int nfs_rename(struct mnt_idmap *idmap,
> > > struct
> > > inode *old_dir,
> > > new_dentry = dentry;
> > > new_inode = NULL;
> > > } else {
> > > - new_dentry->d_fsdata =
> > > NFS_FSDATA_BLOCKED;
> > > + block_revalidate(new_dentry);
> > > must_unblock = true;
> > > spin_unlock(&new_dentry->d_lock);
> > > }
> > > @@ -2717,6 +2729,8 @@ int nfs_rename(struct mnt_idmap *idmap,
> > > struct
> > > inode *old_dir,
> > > task = nfs_async_rename(old_dir, new_dir, old_dentry,
> > > new_dentry,
> > > must_unblock ?
> > > nfs_unblock_rename :
> > > NULL);
> > > if (IS_ERR(task)) {
> > > + if (must_unblock)
> > > + unblock_revalidate(new_dentry);
> > > error = PTR_ERR(task);
> > > goto out;
> > > }
> >
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed, 29 May 2024, Trond Myklebust wrote:
> On Tue, 2024-05-28 at 11:19 +1000, NeilBrown wrote:
> > On Tue, 28 May 2024, Trond Myklebust wrote:
> > > On Mon, 2024-05-27 at 13:04 +1000, NeilBrown wrote:
> > > >
> > > > dentry->d_fsdata is set to NFS_FSDATA_BLOCKED while unlinking or
> > > > renaming-over a file to ensure that no open succeeds while the
> > > > NFS
> > > > operation progressed on the server.
> > > >
> > > > Setting dentry->d_fsdata to NFS_FSDATA_BLOCKED is done under -
> > > > >d_lock
> > > > after checking the refcount is not elevated. Any attempt to open
> > > > the
> > > > file (through that name) will go through lookp_open() which will
> > > > take
> > > > ->d_lock while incrementing the refcount, we can be sure that
> > > > once
> > > > the
> > > > new value is set, __nfs_lookup_revalidate() *will* see the new
> > > > value
> > > > and
> > > > will block.
> > > >
> > > > We don't have any locking guarantee that when we set ->d_fsdata
> > > > to
> > > > NULL,
> > > > the wait_var_event() in __nfs_lookup_revalidate() will notice.
> > > > wait/wake primitives do NOT provide barriers to guarantee order.
> > > > We
> > > > must use smp_load_acquire() in wait_var_event() to ensure we look
> > > > at
> > > > an
> > > > up-to-date value, and must use smp_store_release() before
> > > > wake_up_var().
> > > >
> > > > This patch adds those barrier functions and factors out
> > > > block_revalidate() and unblock_revalidate() far clarity.
> > > >
> > > > There is also a hypothetical bug in that if memory allocation
> > > > fails
> > > > (which never happens in practice) we might leave ->d_fsdata
> > > > locked.
> > > > This patch adds the missing call to unblock_revalidate().
> > > >
> > > > Reported-and-tested-by: Richard Kojedzinszky
> > > > <[email protected]>
> > > > Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501
> > > > Fixes: 3c59366c207e ("NFS: don't unhash dentry during
> > > > unlink/rename")
> > > > Signed-off-by: NeilBrown <[email protected]>
> > > > ---
> > > > fs/nfs/dir.c | 44 +++++++++++++++++++++++++++++---------------
> > > > 1 file changed, 29 insertions(+), 15 deletions(-)
> > > >
> > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > index ac505671efbd..c91dc36d41cc 100644
> > > > --- a/fs/nfs/dir.c
> > > > +++ b/fs/nfs/dir.c
> > > > @@ -1802,9 +1802,10 @@ __nfs_lookup_revalidate(struct dentry
> > > > *dentry,
> > > > unsigned int flags,
> > > > if (parent != READ_ONCE(dentry->d_parent))
> > > > return -ECHILD;
> > > > } else {
> > > > - /* Wait for unlink to complete */
> > > > + /* Wait for unlink to complete - see
> > > > unblock_revalidate() */
> > > > wait_var_event(&dentry->d_fsdata,
> > > > - dentry->d_fsdata !=
> > > > NFS_FSDATA_BLOCKED);
> > > > + smp_load_acquire(&dentry-
> > > > >d_fsdata)
> > > > + != NFS_FSDATA_BLOCKED);
> > >
> > > Doesn't this end up being a reversed ACQUIRE+RELEASE as described
> > > in
> > > the "LOCK ACQUISITION FUNCTIONS" section of Documentation/memory-
> > > barriers.txt?
> >
> > I don't think so. That section is talking about STORE operations
> > which
> > can move backwards through ACQUIRE and forwards through RELEASE.
> >
> > Above we have a LOAD operation which mustn't move backwards through
> > ACQUIRE. Below there is a STORE operation which mustn't move
> > forwards
> > through a RELEASE. Both of those are guaranteed.
>
> It isn't necessary for the LOAD to move backwards through the ACQUIRE.
> As I understand it, the point is that the ACQUIRE can move through the
> RELEASE as per the following paragraph in that document:
>
> Similarly, the reverse case of a RELEASE followed by an ACQUIRE does
> not imply a full memory barrier. Therefore, the CPU's execution of the
> critical sections corresponding to the RELEASE and the ACQUIRE can cross,
> so that:
>
> *A = a;
> RELEASE M
> ACQUIRE N
> *B = b;
>
> could occur as:
>
> ACQUIRE N, STORE *B, STORE *A, RELEASE M
On the wakeup side we have:
STORE ->d_fsdata
RELEASE
spin_lock
LOAD: check is waitq is empty
and on the wait side we have
STORE: add to waitq
spin_unlock
ACQUIRE
LOAD ->d_fsdata
I believe that spin_lock is an ACQUIRE operation and spin_unlock is a
RELEASE operation. So both of these have "ACQUIRE ; RELEASE" which
creates a full barrier.
>
> This would presumably be why the function clear_and_wake_up_bit() needs
> a full memory barrier on most architectures, instead of being just an
> smp_wmb(). Is my understanding of this wrong?
clear_and_wake_up_bit() calls __wake_up_bit() which calls
waitqueue_active() without taking a lock. So there is no ACQUIRE after
the RELEASE in clear_bit_unlock() and before testing for list-empty. So
we need to explicitly add a barrier.
At least that is my understanding just now. I hadn't worked through all
the details before, but I'm glad you prompted me to - thanks.
NeilBrown