2022-07-06 05:21:07

by NeilBrown

[permalink] [raw]
Subject: [PATCH] NFS: don't unhash dentry during unlink.


NFS unlink() must determine if the file is open, and must perform a
"silly rename" instead of an unlink if it is. Otherwise the client
might hold a file open which has been removed on the server.

Consequently if it determines that the file isn't open, it must block
any subsequent opens until the unlink has been completed on the server.

This is currently achieved by unhashing the dentry. This forces any
open attempt to the slow-path for lookup which will block on i_sem on
the directory until the unlink completes. A proposed patch will change
the VFS to only get a shared lock on i_sem for unlink, so this will no
longer work.

Instead we introduce an explicit interlock. A flag is set on the dentry
while the unlink is running and ->d_revalidate blocks while that flag is
set. This closes the race without requiring exclusion on i_sem.
unlink will still have exclusion on the dentry being unlinked, so it
will be safe to set and then clear the flag without any risk of another
thread touching the flag.

There is little room for adding new dentry flags, so instead of adding a
new flag, we overload an existing flag which is not used by NFS.

DCACHE_DONTCACHE is only set for filesystems which call
d_mark_dontcache() and NFS never calls this, so it is currently unused
in NFS.
DCACHE_DONTCACHE is only tested when the last reference on a dentry has
been dropped, so it is safe for NFS to set and then clear the flag while
holding a reference - the setting of the flag cannot cause a
misunderstanding.

So we define DCACHE_NFS_PENDING_UNLINK as an alias for DCACHE_DONTCACHE
and add a definition to nfs_fs.h so that if NFS ever does find a need to
call d_mark_dontcache() the build will fail with a suitable error.

Signed-off-by: NeilBrown <[email protected]>
---

Hi Trond/Anna,
this patch is a precursor for my parallel-directory-updates patch set.
I would be particularly helpful if this (and the nfsd patches I
recently sent) could land for the next merge window. Then I could post
a substantially reduced series to implement parallel directory
updates, which would then be easier for other to review.

Thanks,
NeilBrown


fs/nfs/dir.c | 23 ++++++++++++++++-------
include/linux/nfs_fs.h | 14 ++++++++++++++
2 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 0c4e8dd6aa96..695bb057cbd2 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1778,6 +1778,8 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
int ret;

if (flags & LOOKUP_RCU) {
+ if (dentry->d_flags & DCACHE_NFS_PENDING_UNLINK)
+ return -ECHILD;
parent = READ_ONCE(dentry->d_parent);
dir = d_inode_rcu(parent);
if (!dir)
@@ -1786,6 +1788,9 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
if (parent != READ_ONCE(dentry->d_parent))
return -ECHILD;
} else {
+ /* Wait for unlink to complete */
+ wait_var_event(&dentry->d_flags,
+ !(dentry->d_flags & DCACHE_NFS_PENDING_UNLINK));
parent = dget_parent(dentry);
ret = reval(d_inode(parent), dentry, flags);
dput(parent);
@@ -2454,7 +2459,6 @@ static int nfs_safe_remove(struct dentry *dentry)
int nfs_unlink(struct inode *dir, struct dentry *dentry)
{
int error;
- int need_rehash = 0;

dfprintk(VFS, "NFS: unlink(%s/%lu, %pd)\n", dir->i_sb->s_id,
dir->i_ino, dentry);
@@ -2469,15 +2473,20 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry)
error = nfs_sillyrename(dir, dentry);
goto out;
}
- if (!d_unhashed(dentry)) {
- __d_drop(dentry);
- need_rehash = 1;
- }
+ /* We must prevent any concurrent open until the unlink
+ * completes. ->d_revalidate will wait for DCACHE_NFS_PENDING_UNLINK
+ * to clear. We set it here to ensure no lookup succeeds until
+ * the unlink is complete on the server.
+ */
+ dentry->d_flags |= DCACHE_NFS_PENDING_UNLINK;
+
spin_unlock(&dentry->d_lock);
error = nfs_safe_remove(dentry);
nfs_dentry_remove_handle_error(dir, dentry, error);
- if (need_rehash)
- d_rehash(dentry);
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags &= ~DCACHE_NFS_PENDING_UNLINK;
+ spin_unlock(&dentry->d_lock);
+ wake_up_var(&dentry->d_flags);
out:
trace_nfs_unlink_exit(dir, dentry, error);
return error;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a17c337dbdf1..041a6076e045 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -617,6 +617,20 @@ nfs_fileid_to_ino_t(u64 fileid)

#define NFS_JUKEBOX_RETRY_TIME (5 * HZ)

+/* We need to block new opens while a file is being unlinked.
+ * If it is opened *before* we decide to unlink, we will silly-rename
+ * instead. If it is opened *after*, then we the open to fail unless it creates
+ * a new file.
+ * If we allow the open and unlink to race, we could end up with a file that is
+ * open but deleted on the server resulting in ESTALE.
+ * So overload DCACHE_DONTCACHE to record when the unlink is happening
+ * and block dentry revalidation while it is set.
+ * DCACHE_DONTCACHE is only used by filesystems which call d_mark_dontcache()
+ * which NFS never calls. It is only tested on a dentry on which all references
+ * have been dropped, so it is safe for NFS to set it while holding a reference.
+ */
+#define DCACHE_NFS_PENDING_UNLINK DCACHE_DONTCACHE
+#define d_mark_dontcache(i) BUILD_BUG_ON_MSG(1, "NFS cannot use d_mark_dontcache()")

# undef ifdebug
# ifdef NFS_DEBUG
--
2.36.1


2022-07-06 14:31:52

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] NFS: don't unhash dentry during unlink.

On Wed, 2022-07-06 at 15:10 +1000, NeilBrown wrote:
> NFS unlink() must determine if the file is open, and must perform a
> "silly rename" instead of an unlink if it is. Otherwise the client
> might hold a file open which has been removed on the server.
>
> Consequently if it determines that the file isn't open, it must block
> any subsequent opens until the unlink has been completed on the server.
>
> This is currently achieved by unhashing the dentry. This forces any
> open attempt to the slow-path for lookup which will block on i_sem on
> the directory until the unlink completes. A proposed patch will change
> the VFS to only get a shared lock on i_sem for unlink, so this will no
> longer work.
>
> Instead we introduce an explicit interlock. A flag is set on the dentry
> while the unlink is running and ->d_revalidate blocks while that flag is
> set. This closes the race without requiring exclusion on i_sem.
> unlink will still have exclusion on the dentry being unlinked, so it
> will be safe to set and then clear the flag without any risk of another
> thread touching the flag.
>
> There is little room for adding new dentry flags, so instead of adding a
> new flag, we overload an existing flag which is not used by NFS.
>
> DCACHE_DONTCACHE is only set for filesystems which call
> d_mark_dontcache() and NFS never calls this, so it is currently unused
> in NFS.
> DCACHE_DONTCACHE is only tested when the last reference on a dentry has
> been dropped, so it is safe for NFS to set and then clear the flag while
> holding a reference - the setting of the flag cannot cause a
> misunderstanding.
>
> So we define DCACHE_NFS_PENDING_UNLINK as an alias for DCACHE_DONTCACHE
> and add a definition to nfs_fs.h so that if NFS ever does find a need to
> call d_mark_dontcache() the build will fail with a suitable error.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
>
> Hi Trond/Anna,
> this patch is a precursor for my parallel-directory-updates patch set.
> I would be particularly helpful if this (and the nfsd patches I
> recently sent) could land for the next merge window. Then I could post
> a substantially reduced series to implement parallel directory
> updates, which would then be easier for other to review.
>
> Thanks,
> NeilBrown
>
>
> fs/nfs/dir.c | 23 ++++++++++++++++-------
> include/linux/nfs_fs.h | 14 ++++++++++++++
> 2 files changed, 30 insertions(+), 7 deletions(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 0c4e8dd6aa96..695bb057cbd2 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1778,6 +1778,8 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
> int ret;
>
> if (flags & LOOKUP_RCU) {
> + if (dentry->d_flags & DCACHE_NFS_PENDING_UNLINK)
> + return -ECHILD;
> parent = READ_ONCE(dentry->d_parent);
> dir = d_inode_rcu(parent);
> if (!dir)
> @@ -1786,6 +1788,9 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
> if (parent != READ_ONCE(dentry->d_parent))
> return -ECHILD;
> } else {
> + /* Wait for unlink to complete */
> + wait_var_event(&dentry->d_flags,
> + !(dentry->d_flags & DCACHE_NFS_PENDING_UNLINK));
> parent = dget_parent(dentry);
> ret = reval(d_inode(parent), dentry, flags);
> dput(parent);
> @@ -2454,7 +2459,6 @@ static int nfs_safe_remove(struct dentry *dentry)
> int nfs_unlink(struct inode *dir, struct dentry *dentry)
> {
> int error;
> - int need_rehash = 0;
>
> dfprintk(VFS, "NFS: unlink(%s/%lu, %pd)\n", dir->i_sb->s_id,
> dir->i_ino, dentry);
> @@ -2469,15 +2473,20 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry)
> error = nfs_sillyrename(dir, dentry);
> goto out;
> }
> - if (!d_unhashed(dentry)) {
> - __d_drop(dentry);
> - need_rehash = 1;
> - }
> + /* We must prevent any concurrent open until the unlink
> + * completes. ->d_revalidate will wait for DCACHE_NFS_PENDING_UNLINK
> + * to clear. We set it here to ensure no lookup succeeds until
> + * the unlink is complete on the server.
> + */
> + dentry->d_flags |= DCACHE_NFS_PENDING_UNLINK;
> +
> spin_unlock(&dentry->d_lock);
> error = nfs_safe_remove(dentry);
> nfs_dentry_remove_handle_error(dir, dentry, error);
> - if (need_rehash)
> - d_rehash(dentry);
> + spin_lock(&dentry->d_lock);
> + dentry->d_flags &= ~DCACHE_NFS_PENDING_UNLINK;
> + spin_unlock(&dentry->d_lock);
> + wake_up_var(&dentry->d_flags);
> out:
> trace_nfs_unlink_exit(dir, dentry, error);
> return error;
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index a17c337dbdf1..041a6076e045 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -617,6 +617,20 @@ nfs_fileid_to_ino_t(u64 fileid)
>
> #define NFS_JUKEBOX_RETRY_TIME (5 * HZ)
>
> +/* We need to block new opens while a file is being unlinked.
> + * If it is opened *before* we decide to unlink, we will silly-rename
> + * instead. If it is opened *after*, then we the open to fail unless it creates

"then we allow the open to fail"

> + * a new file.
> + * If we allow the open and unlink to race, we could end up with a file that is
> + * open but deleted on the server resulting in ESTALE.
> + * So overload DCACHE_DONTCACHE to record when the unlink is happening
> + * and block dentry revalidation while it is set.
> + * DCACHE_DONTCACHE is only used by filesystems which call d_mark_dontcache()
> + * which NFS never calls. It is only tested on a dentry on which all references
> + * have been dropped, so it is safe for NFS to set it while holding a reference.
> + */
> +#define DCACHE_NFS_PENDING_UNLINK DCACHE_DONTCACHE
> +#define d_mark_dontcache(i) BUILD_BUG_ON_MSG(1, "NFS cannot use d_mark_dontcache()")
>
> # undef ifdebug
> # ifdef NFS_DEBUG

Wow, we really are out of dentry flags. I wonder if some of them are no
longer needed?

This overloading is a bit klunky but it's probably OK. AFAICT,
0x80000000 is still available though if this turns out to be too nasty.
It looks like 0x08000000 may also be free


Reviewed-by: Jeff Layton <[email protected]>

2022-07-06 23:37:08

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] NFS: don't unhash dentry during unlink.

On Thu, 07 Jul 2022, Jeff Layton wrote:
> On Wed, 2022-07-06 at 15:10 +1000, NeilBrown wrote:
> > NFS unlink() must determine if the file is open, and must perform a
> > "silly rename" instead of an unlink if it is. Otherwise the client
> > might hold a file open which has been removed on the server.
> >
> > Consequently if it determines that the file isn't open, it must block
> > any subsequent opens until the unlink has been completed on the server.
> >
> > This is currently achieved by unhashing the dentry. This forces any
> > open attempt to the slow-path for lookup which will block on i_sem on
> > the directory until the unlink completes. A proposed patch will change
> > the VFS to only get a shared lock on i_sem for unlink, so this will no
> > longer work.
> >
> > Instead we introduce an explicit interlock. A flag is set on the dentry
> > while the unlink is running and ->d_revalidate blocks while that flag is
> > set. This closes the race without requiring exclusion on i_sem.
> > unlink will still have exclusion on the dentry being unlinked, so it
> > will be safe to set and then clear the flag without any risk of another
> > thread touching the flag.
> >
> > There is little room for adding new dentry flags, so instead of adding a
> > new flag, we overload an existing flag which is not used by NFS.
> >
> > DCACHE_DONTCACHE is only set for filesystems which call
> > d_mark_dontcache() and NFS never calls this, so it is currently unused
> > in NFS.
> > DCACHE_DONTCACHE is only tested when the last reference on a dentry has
> > been dropped, so it is safe for NFS to set and then clear the flag while
> > holding a reference - the setting of the flag cannot cause a
> > misunderstanding.
> >
> > So we define DCACHE_NFS_PENDING_UNLINK as an alias for DCACHE_DONTCACHE
> > and add a definition to nfs_fs.h so that if NFS ever does find a need to
> > call d_mark_dontcache() the build will fail with a suitable error.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> >
> > Hi Trond/Anna,
> > this patch is a precursor for my parallel-directory-updates patch set.
> > I would be particularly helpful if this (and the nfsd patches I
> > recently sent) could land for the next merge window. Then I could post
> > a substantially reduced series to implement parallel directory
> > updates, which would then be easier for other to review.
> >
> > Thanks,
> > NeilBrown
> >
> >
> > fs/nfs/dir.c | 23 ++++++++++++++++-------
> > include/linux/nfs_fs.h | 14 ++++++++++++++
> > 2 files changed, 30 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > index 0c4e8dd6aa96..695bb057cbd2 100644
> > --- a/fs/nfs/dir.c
> > +++ b/fs/nfs/dir.c
> > @@ -1778,6 +1778,8 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
> > int ret;
> >
> > if (flags & LOOKUP_RCU) {
> > + if (dentry->d_flags & DCACHE_NFS_PENDING_UNLINK)
> > + return -ECHILD;
> > parent = READ_ONCE(dentry->d_parent);
> > dir = d_inode_rcu(parent);
> > if (!dir)
> > @@ -1786,6 +1788,9 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
> > if (parent != READ_ONCE(dentry->d_parent))
> > return -ECHILD;
> > } else {
> > + /* Wait for unlink to complete */
> > + wait_var_event(&dentry->d_flags,
> > + !(dentry->d_flags & DCACHE_NFS_PENDING_UNLINK));
> > parent = dget_parent(dentry);
> > ret = reval(d_inode(parent), dentry, flags);
> > dput(parent);
> > @@ -2454,7 +2459,6 @@ static int nfs_safe_remove(struct dentry *dentry)
> > int nfs_unlink(struct inode *dir, struct dentry *dentry)
> > {
> > int error;
> > - int need_rehash = 0;
> >
> > dfprintk(VFS, "NFS: unlink(%s/%lu, %pd)\n", dir->i_sb->s_id,
> > dir->i_ino, dentry);
> > @@ -2469,15 +2473,20 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry)
> > error = nfs_sillyrename(dir, dentry);
> > goto out;
> > }
> > - if (!d_unhashed(dentry)) {
> > - __d_drop(dentry);
> > - need_rehash = 1;
> > - }
> > + /* We must prevent any concurrent open until the unlink
> > + * completes. ->d_revalidate will wait for DCACHE_NFS_PENDING_UNLINK
> > + * to clear. We set it here to ensure no lookup succeeds until
> > + * the unlink is complete on the server.
> > + */
> > + dentry->d_flags |= DCACHE_NFS_PENDING_UNLINK;
> > +
> > spin_unlock(&dentry->d_lock);
> > error = nfs_safe_remove(dentry);
> > nfs_dentry_remove_handle_error(dir, dentry, error);
> > - if (need_rehash)
> > - d_rehash(dentry);
> > + spin_lock(&dentry->d_lock);
> > + dentry->d_flags &= ~DCACHE_NFS_PENDING_UNLINK;
> > + spin_unlock(&dentry->d_lock);
> > + wake_up_var(&dentry->d_flags);
> > out:
> > trace_nfs_unlink_exit(dir, dentry, error);
> > return error;
> > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > index a17c337dbdf1..041a6076e045 100644
> > --- a/include/linux/nfs_fs.h
> > +++ b/include/linux/nfs_fs.h
> > @@ -617,6 +617,20 @@ nfs_fileid_to_ino_t(u64 fileid)
> >
> > #define NFS_JUKEBOX_RETRY_TIME (5 * HZ)
> >
> > +/* We need to block new opens while a file is being unlinked.
> > + * If it is opened *before* we decide to unlink, we will silly-rename
> > + * instead. If it is opened *after*, then we the open to fail unless it creates
>
> "then we allow the open to fail"

Actually it is "then we need the open to fail".
I should probably do a complete re-write of that para.

>
> > + * a new file.
> > + * If we allow the open and unlink to race, we could end up with a file that is
> > + * open but deleted on the server resulting in ESTALE.
> > + * So overload DCACHE_DONTCACHE to record when the unlink is happening
> > + * and block dentry revalidation while it is set.
> > + * DCACHE_DONTCACHE is only used by filesystems which call d_mark_dontcache()
> > + * which NFS never calls. It is only tested on a dentry on which all references
> > + * have been dropped, so it is safe for NFS to set it while holding a reference.
> > + */
> > +#define DCACHE_NFS_PENDING_UNLINK DCACHE_DONTCACHE
> > +#define d_mark_dontcache(i) BUILD_BUG_ON_MSG(1, "NFS cannot use d_mark_dontcache()")
> >
> > # undef ifdebug
> > # ifdef NFS_DEBUG
>
> Wow, we really are out of dentry flags. I wonder if some of them are no
> longer needed?
>
> This overloading is a bit klunky but it's probably OK. AFAICT,
> 0x80000000 is still available though if this turns out to be too nasty.
> It looks like 0x08000000 may also be free

I need one of those two in a subsequent patch to lock a dentry while the
name/link is being created. If I used the other for NFS_PENDING_UNLINK
we would be completely out.

This flag really should be completely private to nfs. d_fsdata would be
the best place to put it.
But NFS doesn't have a permanent d_fsdata in which I can store a bit.
Nor does it leave d_fsdata untouched, so I cannot store a magic value in
there.
There are two different uses of d_fsdata. I don't fully understand when
they are active, so I don't know if it is safe to add another
independent use - I suspect not though.

>
>
> Reviewed-by: Jeff Layton <[email protected]>
>

Thanks,
NeilBrown

2022-07-26 04:41:34

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] NFS: don't unhash dentry during unlink.


Hi Trond,
did you have a change to look at this patch?

Thanks,
NeilBrown

On Wed, 06 Jul 2022, NeilBrown wrote:
> NFS unlink() must determine if the file is open, and must perform a
> "silly rename" instead of an unlink if it is. Otherwise the client
> might hold a file open which has been removed on the server.
>
> Consequently if it determines that the file isn't open, it must block
> any subsequent opens until the unlink has been completed on the server.
>
> This is currently achieved by unhashing the dentry. This forces any
> open attempt to the slow-path for lookup which will block on i_sem on
> the directory until the unlink completes. A proposed patch will change
> the VFS to only get a shared lock on i_sem for unlink, so this will no
> longer work.
>
> Instead we introduce an explicit interlock. A flag is set on the dentry
> while the unlink is running and ->d_revalidate blocks while that flag is
> set. This closes the race without requiring exclusion on i_sem.
> unlink will still have exclusion on the dentry being unlinked, so it
> will be safe to set and then clear the flag without any risk of another
> thread touching the flag.
>
> There is little room for adding new dentry flags, so instead of adding a
> new flag, we overload an existing flag which is not used by NFS.
>
> DCACHE_DONTCACHE is only set for filesystems which call
> d_mark_dontcache() and NFS never calls this, so it is currently unused
> in NFS.
> DCACHE_DONTCACHE is only tested when the last reference on a dentry has
> been dropped, so it is safe for NFS to set and then clear the flag while
> holding a reference - the setting of the flag cannot cause a
> misunderstanding.
>
> So we define DCACHE_NFS_PENDING_UNLINK as an alias for DCACHE_DONTCACHE
> and add a definition to nfs_fs.h so that if NFS ever does find a need to
> call d_mark_dontcache() the build will fail with a suitable error.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
>
> Hi Trond/Anna,
> this patch is a precursor for my parallel-directory-updates patch set.
> I would be particularly helpful if this (and the nfsd patches I
> recently sent) could land for the next merge window. Then I could post
> a substantially reduced series to implement parallel directory
> updates, which would then be easier for other to review.
>
> Thanks,
> NeilBrown
>
>
> fs/nfs/dir.c | 23 ++++++++++++++++-------
> include/linux/nfs_fs.h | 14 ++++++++++++++
> 2 files changed, 30 insertions(+), 7 deletions(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 0c4e8dd6aa96..695bb057cbd2 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1778,6 +1778,8 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
> int ret;
>
> if (flags & LOOKUP_RCU) {
> + if (dentry->d_flags & DCACHE_NFS_PENDING_UNLINK)
> + return -ECHILD;
> parent = READ_ONCE(dentry->d_parent);
> dir = d_inode_rcu(parent);
> if (!dir)
> @@ -1786,6 +1788,9 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags,
> if (parent != READ_ONCE(dentry->d_parent))
> return -ECHILD;
> } else {
> + /* Wait for unlink to complete */
> + wait_var_event(&dentry->d_flags,
> + !(dentry->d_flags & DCACHE_NFS_PENDING_UNLINK));
> parent = dget_parent(dentry);
> ret = reval(d_inode(parent), dentry, flags);
> dput(parent);
> @@ -2454,7 +2459,6 @@ static int nfs_safe_remove(struct dentry *dentry)
> int nfs_unlink(struct inode *dir, struct dentry *dentry)
> {
> int error;
> - int need_rehash = 0;
>
> dfprintk(VFS, "NFS: unlink(%s/%lu, %pd)\n", dir->i_sb->s_id,
> dir->i_ino, dentry);
> @@ -2469,15 +2473,20 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry)
> error = nfs_sillyrename(dir, dentry);
> goto out;
> }
> - if (!d_unhashed(dentry)) {
> - __d_drop(dentry);
> - need_rehash = 1;
> - }
> + /* We must prevent any concurrent open until the unlink
> + * completes. ->d_revalidate will wait for DCACHE_NFS_PENDING_UNLINK
> + * to clear. We set it here to ensure no lookup succeeds until
> + * the unlink is complete on the server.
> + */
> + dentry->d_flags |= DCACHE_NFS_PENDING_UNLINK;
> +
> spin_unlock(&dentry->d_lock);
> error = nfs_safe_remove(dentry);
> nfs_dentry_remove_handle_error(dir, dentry, error);
> - if (need_rehash)
> - d_rehash(dentry);
> + spin_lock(&dentry->d_lock);
> + dentry->d_flags &= ~DCACHE_NFS_PENDING_UNLINK;
> + spin_unlock(&dentry->d_lock);
> + wake_up_var(&dentry->d_flags);
> out:
> trace_nfs_unlink_exit(dir, dentry, error);
> return error;
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index a17c337dbdf1..041a6076e045 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -617,6 +617,20 @@ nfs_fileid_to_ino_t(u64 fileid)
>
> #define NFS_JUKEBOX_RETRY_TIME (5 * HZ)
>
> +/* We need to block new opens while a file is being unlinked.
> + * If it is opened *before* we decide to unlink, we will silly-rename
> + * instead. If it is opened *after*, then we the open to fail unless it creates
> + * a new file.
> + * If we allow the open and unlink to race, we could end up with a file that is
> + * open but deleted on the server resulting in ESTALE.
> + * So overload DCACHE_DONTCACHE to record when the unlink is happening
> + * and block dentry revalidation while it is set.
> + * DCACHE_DONTCACHE is only used by filesystems which call d_mark_dontcache()
> + * which NFS never calls. It is only tested on a dentry on which all references
> + * have been dropped, so it is safe for NFS to set it while holding a reference.
> + */
> +#define DCACHE_NFS_PENDING_UNLINK DCACHE_DONTCACHE
> +#define d_mark_dontcache(i) BUILD_BUG_ON_MSG(1, "NFS cannot use d_mark_dontcache()")
>
> # undef ifdebug
> # ifdef NFS_DEBUG
> --
> 2.36.1
>
>

2022-07-27 22:51:23

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] NFS: don't unhash dentry during unlink.

On Tue, 2022-07-26 at 14:40 +1000, NeilBrown wrote:
>
> Hi Trond,
>  did you have a change to look at this patch?

I did take a look, and the patch does seem acceptable to me as it
stands.

However without the full context of the other patches you're writing,
I'm not able to ascertain whether or not a better approach than
overloading the meaning of DCACHE_DONTCACHE might exist. I'm wondering
if perhaps co-opting struct nfs_unlinkdata and/or creating something
similar for the non-sillyrename case might not work just as well?

>
> Thanks,
> NeilBrown
>
> On Wed, 06 Jul 2022, NeilBrown wrote:
> > NFS unlink() must determine if the file is open, and must perform a
> > "silly rename" instead of an unlink if it is.  Otherwise the client
> > might hold a file open which has been removed on the server.
> >
> > Consequently if it determines that the file isn't open, it must
> > block
> > any subsequent opens until the unlink has been completed on the
> > server.
> >
> > This is currently achieved by unhashing the dentry.  This forces
> > any
> > open attempt to the slow-path for lookup which will block on i_sem
> > on
> > the directory until the unlink completes.  A proposed patch will
> > change
> > the VFS to only get a shared lock on i_sem for unlink, so this will
> > no
> > longer work.
> >
> > Instead we introduce an explicit interlock.  A flag is set on the
> > dentry
> > while the unlink is running and ->d_revalidate blocks while that
> > flag is
> > set.  This closes the race without requiring exclusion on i_sem.
> > unlink will still have exclusion on the dentry being unlinked, so
> > it
> > will be safe to set and then clear the flag without any risk of
> > another
> > thread touching the flag.
> >
> > There is little room for adding new dentry flags, so instead of
> > adding a
> > new flag, we overload an existing flag which is not used by NFS.
> >
> > DCACHE_DONTCACHE is only set for filesystems which call
> > d_mark_dontcache() and NFS never calls this, so it is currently
> > unused
> > in NFS.
> > DCACHE_DONTCACHE is only tested when the last reference on a dentry
> > has
> > been dropped, so it is safe for NFS to set and then clear the flag
> > while
> > holding a reference - the setting of the flag cannot cause a
> > misunderstanding.
> >
> > So we define DCACHE_NFS_PENDING_UNLINK as an alias for
> > DCACHE_DONTCACHE
> > and add a definition to nfs_fs.h so that if NFS ever does find a
> > need to
> > call d_mark_dontcache() the build will fail with a suitable error.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> >
> > Hi Trond/Anna,
> >  this patch is a precursor for my parallel-directory-updates patch
> > set.
> >  I would be particularly helpful if this (and the nfsd patches I
> >  recently sent) could land for the next merge window.  Then I could
> > post
> >  a substantially reduced series to implement parallel directory
> >  updates, which would then be easier for other to review.
> >
> > Thanks,
> > NeilBrown
> >
> >
> >  fs/nfs/dir.c           | 23 ++++++++++++++++-------
> >  include/linux/nfs_fs.h | 14 ++++++++++++++
> >  2 files changed, 30 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > index 0c4e8dd6aa96..695bb057cbd2 100644
> > --- a/fs/nfs/dir.c
> > +++ b/fs/nfs/dir.c
> > @@ -1778,6 +1778,8 @@ __nfs_lookup_revalidate(struct dentry
> > *dentry, unsigned int flags,
> >         int ret;
> >  
> >         if (flags & LOOKUP_RCU) {
> > +               if (dentry->d_flags & DCACHE_NFS_PENDING_UNLINK)
> > +                       return -ECHILD;
> >                 parent = READ_ONCE(dentry->d_parent);
> >                 dir = d_inode_rcu(parent);
> >                 if (!dir)
> > @@ -1786,6 +1788,9 @@ __nfs_lookup_revalidate(struct dentry
> > *dentry, unsigned int flags,
> >                 if (parent != READ_ONCE(dentry->d_parent))
> >                         return -ECHILD;
> >         } else {
> > +               /* Wait for unlink to complete */
> > +               wait_var_event(&dentry->d_flags,
> > +                              !(dentry->d_flags &
> > DCACHE_NFS_PENDING_UNLINK));
> >                 parent = dget_parent(dentry);
> >                 ret = reval(d_inode(parent), dentry, flags);
> >                 dput(parent);
> > @@ -2454,7 +2459,6 @@ static int nfs_safe_remove(struct dentry
> > *dentry)
> >  int nfs_unlink(struct inode *dir, struct dentry *dentry)
> >  {
> >         int error;
> > -       int need_rehash = 0;
> >  
> >         dfprintk(VFS, "NFS: unlink(%s/%lu, %pd)\n", dir->i_sb-
> > >s_id,
> >                 dir->i_ino, dentry);
> > @@ -2469,15 +2473,20 @@ int nfs_unlink(struct inode *dir, struct
> > dentry *dentry)
> >                 error = nfs_sillyrename(dir, dentry);
> >                 goto out;
> >         }
> > -       if (!d_unhashed(dentry)) {
> > -               __d_drop(dentry);
> > -               need_rehash = 1;
> > -       }
> > +       /* We must prevent any concurrent open until the unlink
> > +        * completes.  ->d_revalidate will wait for
> > DCACHE_NFS_PENDING_UNLINK
> > +        * to clear.  We set it here to ensure no lookup succeeds
> > until
> > +        * the unlink is complete on the server.
> > +        */
> > +       dentry->d_flags |= DCACHE_NFS_PENDING_UNLINK;
> > +
> >         spin_unlock(&dentry->d_lock);
> >         error = nfs_safe_remove(dentry);
> >         nfs_dentry_remove_handle_error(dir, dentry, error);
> > -       if (need_rehash)
> > -               d_rehash(dentry);
> > +       spin_lock(&dentry->d_lock);
> > +       dentry->d_flags &= ~DCACHE_NFS_PENDING_UNLINK;
> > +       spin_unlock(&dentry->d_lock);
> > +       wake_up_var(&dentry->d_flags);
> >  out:
> >         trace_nfs_unlink_exit(dir, dentry, error);
> >         return error;
> > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > index a17c337dbdf1..041a6076e045 100644
> > --- a/include/linux/nfs_fs.h
> > +++ b/include/linux/nfs_fs.h
> > @@ -617,6 +617,20 @@ nfs_fileid_to_ino_t(u64 fileid)
> >  
> >  #define NFS_JUKEBOX_RETRY_TIME (5 * HZ)
> >  
> > +/* We need to block new opens while a file is being unlinked.
> > + * If it is opened *before* we decide to unlink, we will silly-
> > rename
> > + * instead. If it is opened *after*, then we the open to fail
> > unless it creates
> > + * a new file.
> > + * If we allow the open and unlink to race, we could end up with a
> > file that is
> > + * open but deleted on the server resulting in ESTALE.
> > + * So overload DCACHE_DONTCACHE to record when the unlink is
> > happening
> > + * and block dentry revalidation while it is set.
> > + * DCACHE_DONTCACHE is only used by filesystems which call
> > d_mark_dontcache()
> > + * which NFS never calls.  It is only tested on a dentry on which
> > all references
> > + * have been dropped, so it is safe for NFS to set it while
> > holding a reference.
> > + */
> > +#define DCACHE_NFS_PENDING_UNLINK DCACHE_DONTCACHE
> > +#define d_mark_dontcache(i) BUILD_BUG_ON_MSG(1, "NFS cannot use
> > d_mark_dontcache()")
> >  
> >  # undef ifdebug
> >  # ifdef NFS_DEBUG
> > --
> > 2.36.1
> >
> >

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2022-07-28 02:33:00

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] NFS: don't unhash dentry during unlink.

On Thu, 28 Jul 2022, Trond Myklebust wrote:
> On Tue, 2022-07-26 at 14:40 +1000, NeilBrown wrote:
> >
> > Hi Trond,
> >  did you have a change to look at this patch?
>
> I did take a look, and the patch does seem acceptable to me as it
> stands.

Thanks.

>
> However without the full context of the other patches you're writing,
> I'm not able to ascertain whether or not a better approach than
> overloading the meaning of DCACHE_DONTCACHE might exist. I'm wondering
> if perhaps co-opting struct nfs_unlinkdata and/or creating something
> similar for the non-sillyrename case might not work just as well?

The significant part of the context is that i_rwsem will no longer
provide exclusion between non-cached lookup and unlink, so the
d_unhash() trick won't work any more.

Using d_fsdata somehow would be a better solution providing we can make
it work.
I've looked again through the use of d_fsdata and I see now that an
nfs_unlinkdata is only stored there when DCACHE_NFS_RENAMED is set, and
in that case we won't be called to unlink. It is sometimes set to a
"devname", but that is only for directories (I think) and only relevant
while the dentry IS_ROOT() - if it is being unlinked, it cannot be
IS_ROOT().

So d_fsdata can only be NULL at the time of interest.
So I can set it to some other value ((void*)1 ??) to block d_revalidate
while the unlink is progressing.
I also noticed that I need to make the same change in nfs_rename() when
the target exists.

I'll let you know when I have something suitable along those lines.

Thanks,
NeilBrown