2020-11-30 21:27:14

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 3/6] nfsd: close cached files prior to a REMOVE or RENAME that would replace target

From: Jeff Layton <[email protected]>

It's not uncommon for some workloads to do a bunch of I/O to a file and
delete it just afterward. If knfsd has a cached open file however, then
the file may still be open when the dentry is unlinked. If the
underlying filesystem is nfs, then that could trigger it to do a
sillyrename.

On a REMOVE or RENAME scan the nfsd_file cache for open files that
correspond to the inode, and proactively unhash and put their
references. This should prevent any delete-on-last-close activity from
occurring, solely due to knfsd's open file cache.

This must be done synchronously though so we use the variants that call
flush_delayed_fput. There are deadlock possibilities if you call
flush_delayed_fput while holding locks, however. In the case of
nfsd_rename, we don't even do the lookups of the dentries to be renamed
until we've locked for rename.

Once we've figured out what the target dentry is for a rename, check to
see whether there are cached open files associated with it. If there
are, then unwind all of the locking, close them all, and then reattempt
the rename.

None of this is really necessary for "typical" filesystems though. It's
mostly of use for NFS, so declare a new export op flag and use that to
determine whether to close the files beforehand.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Lance Shelton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
---
Documentation/filesystems/nfs/exporting.rst | 13 +++++++++++++
fs/nfs/export.c | 2 +-
fs/nfsd/vfs.c | 16 +++++++++-------
include/linux/exportfs.h | 5 +++--
4 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index 960be64446cb..0e98edd353b5 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -202,3 +202,16 @@ following flags are defined:
This flag exempts the filesystem from subtree checking and causes
exportfs to get back an error if it tries to enable subtree checking
on it.
+
+ EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking
+ On some exportable filesystems (such as NFS) unlinking a file that
+ is still open can cause a fair bit of extra work. For instance,
+ the NFS client will do a "sillyrename" to ensure that the file
+ sticks around while it's still open. When reexporting, that open
+ file is held by nfsd so we usually end up doing a sillyrename, and
+ then immediately deleting the sillyrenamed file just afterward when
+ the link count actually goes to zero. Sometimes this delete can race
+ with other operations (for instance an rmdir of the parent directory).
+ This flag causes nfsd to close any open files for this inode _before_
+ calling into the vfs to do an unlink or a rename that would replace
+ an existing file.
diff --git a/fs/nfs/export.c b/fs/nfs/export.c
index b9ba306bf912..5428713af5fe 100644
--- a/fs/nfs/export.c
+++ b/fs/nfs/export.c
@@ -171,5 +171,5 @@ const struct export_operations nfs_export_ops = {
.encode_fh = nfs_encode_fh,
.fh_to_dentry = nfs_fh_to_dentry,
.get_parent = nfs_get_parent,
- .flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK,
+ .flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|EXPORT_OP_CLOSE_BEFORE_UNLINK,
};
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 1ecaceebee13..79cba942087e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1724,7 +1724,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
struct inode *fdir, *tdir;
__be32 err;
int host_err;
- bool has_cached = false;
+ bool close_cached = false;

err = fh_verify(rqstp, ffhp, S_IFDIR, NFSD_MAY_REMOVE);
if (err)
@@ -1783,8 +1783,9 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
if (ffhp->fh_export->ex_path.dentry != tfhp->fh_export->ex_path.dentry)
goto out_dput_new;

- if (nfsd_has_cached_files(ndentry)) {
- has_cached = true;
+ if ((ndentry->d_sb->s_export_op->flags & EXPORT_OP_CLOSE_BEFORE_UNLINK) &&
+ nfsd_has_cached_files(ndentry)) {
+ close_cached = true;
goto out_dput_old;
} else {
host_err = vfs_rename(fdir, odentry, tdir, ndentry, NULL, 0);
@@ -1805,7 +1806,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
* as that would do the wrong thing if the two directories
* were the same, so again we do it by hand.
*/
- if (!has_cached) {
+ if (!close_cached) {
fill_post_wcc(ffhp);
fill_post_wcc(tfhp);
}
@@ -1819,8 +1820,8 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
* shouldn't be done with locks held however, so we delay it until this
* point and then reattempt the whole shebang.
*/
- if (has_cached) {
- has_cached = false;
+ if (close_cached) {
+ close_cached = false;
nfsd_close_cached_files(ndentry);
dput(ndentry);
goto retry;
@@ -1872,7 +1873,8 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
type = d_inode(rdentry)->i_mode & S_IFMT;

if (type != S_IFDIR) {
- nfsd_close_cached_files(rdentry);
+ if (rdentry->d_sb->s_export_op->flags & EXPORT_OP_CLOSE_BEFORE_UNLINK)
+ nfsd_close_cached_files(rdentry);
host_err = vfs_unlink(dirp, rdentry, NULL);
} else {
host_err = vfs_rmdir(dirp, rdentry);
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 2fcbab0f6b61..d829403ffd3b 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -213,8 +213,9 @@ struct export_operations {
bool write, u32 *device_generation);
int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
int nr_iomaps, struct iattr *iattr);
-#define EXPORT_OP_NOWCC (0x1) /* Don't collect wcc data for NFSv3 replies */
-#define EXPORT_OP_NOSUBTREECHK (0x2) /* Subtree checking is not supported! */
+#define EXPORT_OP_NOWCC (0x1) /* don't collect v3 wcc data */
+#define EXPORT_OP_NOSUBTREECHK (0x2) /* no subtree checking */
+#define EXPORT_OP_CLOSE_BEFORE_UNLINK (0x4) /* close files before unlink */
unsigned long flags;
};

--
2.28.0


2020-11-30 21:27:59

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 4/6] exportfs: Add a function to return the raw output from fh_to_dentry()

From: Trond Myklebust <[email protected]>

In order to allow nfsd to accept return values that are not
acceptable to overlayfs and others, add a new function.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/exportfs/expfs.c | 32 ++++++++++++++++++++++++--------
include/linux/exportfs.h | 5 +++++
2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 2dd55b172d57..0106eba46d5a 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -417,9 +417,11 @@ int exportfs_encode_fh(struct dentry *dentry, struct fid *fid, int *max_len,
}
EXPORT_SYMBOL_GPL(exportfs_encode_fh);

-struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
- int fh_len, int fileid_type,
- int (*acceptable)(void *, struct dentry *), void *context)
+struct dentry *
+exportfs_decode_fh_raw(struct vfsmount *mnt, struct fid *fid, int fh_len,
+ int fileid_type,
+ int (*acceptable)(void *, struct dentry *),
+ void *context)
{
const struct export_operations *nop = mnt->mnt_sb->s_export_op;
struct dentry *result, *alias;
@@ -432,10 +434,8 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
if (!nop || !nop->fh_to_dentry)
return ERR_PTR(-ESTALE);
result = nop->fh_to_dentry(mnt->mnt_sb, fid, fh_len, fileid_type);
- if (PTR_ERR(result) == -ENOMEM)
- return ERR_CAST(result);
if (IS_ERR_OR_NULL(result))
- return ERR_PTR(-ESTALE);
+ return result;

/*
* If no acceptance criteria was specified by caller, a disconnected
@@ -561,10 +561,26 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,

err_result:
dput(result);
- if (err != -ENOMEM)
- err = -ESTALE;
return ERR_PTR(err);
}
+EXPORT_SYMBOL_GPL(exportfs_decode_fh_raw);
+
+struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
+ int fh_len, int fileid_type,
+ int (*acceptable)(void *, struct dentry *),
+ void *context)
+{
+ struct dentry *ret;
+
+ ret = exportfs_decode_fh_raw(mnt, fid, fh_len, fileid_type,
+ acceptable, context);
+ if (IS_ERR_OR_NULL(ret)) {
+ if (ret == ERR_PTR(-ENOMEM))
+ return ret;
+ return ERR_PTR(-ESTALE);
+ }
+ return ret;
+}
EXPORT_SYMBOL_GPL(exportfs_decode_fh);

MODULE_LICENSE("GPL");
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index d829403ffd3b..846df3c96730 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -223,6 +223,11 @@ extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
int *max_len, struct inode *parent);
extern int exportfs_encode_fh(struct dentry *dentry, struct fid *fid,
int *max_len, int connectable);
+extern struct dentry *exportfs_decode_fh_raw(struct vfsmount *mnt,
+ struct fid *fid, int fh_len,
+ int fileid_type,
+ int (*acceptable)(void *, struct dentry *),
+ void *context);
extern struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
int fh_len, int fileid_type, int (*acceptable)(void *, struct dentry *),
void *context);
--
2.28.0

2020-11-30 21:31:05

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE

From: Trond Myklebust <[email protected]>

If the underlying filesystem times out, then we want knfsd to return
NFSERR_JUKEBOX/DELAY rather than NFSERR_STALE.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfsd/nfsfh.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 0c2ee65e46f3..46c86f7bc429 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -268,12 +268,20 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp)
if (fileid_type == FILEID_ROOT)
dentry = dget(exp->ex_path.dentry);
else {
- dentry = exportfs_decode_fh(exp->ex_path.mnt, fid,
- data_left, fileid_type,
- nfsd_acceptable, exp);
- if (IS_ERR_OR_NULL(dentry))
+ dentry = exportfs_decode_fh_raw(exp->ex_path.mnt, fid,
+ data_left, fileid_type,
+ nfsd_acceptable, exp);
+ if (IS_ERR_OR_NULL(dentry)) {
trace_nfsd_set_fh_dentry_badhandle(rqstp, fhp,
dentry ? PTR_ERR(dentry) : -ESTALE);
+ switch (PTR_ERR(dentry)) {
+ case -ENOMEM:
+ case -ETIMEDOUT:
+ break;
+ default:
+ dentry = ERR_PTR(-ESTALE);
+ }
+ }
}
if (dentry == NULL)
goto out;
--
2.28.0

2020-11-30 23:14:14

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE

On Mon, Nov 30, 2020 at 04:24:54PM -0500, [email protected] wrote:
> From: Trond Myklebust <[email protected]>
>
> If the underlying filesystem times out, then we want knfsd to return
> NFSERR_JUKEBOX/DELAY rather than NFSERR_STALE.

Out of curiosity, what was causing ETIMEDOUT in practice?

--b.

>
> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> fs/nfsd/nfsfh.c | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index 0c2ee65e46f3..46c86f7bc429 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -268,12 +268,20 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp)
> if (fileid_type == FILEID_ROOT)
> dentry = dget(exp->ex_path.dentry);
> else {
> - dentry = exportfs_decode_fh(exp->ex_path.mnt, fid,
> - data_left, fileid_type,
> - nfsd_acceptable, exp);
> - if (IS_ERR_OR_NULL(dentry))
> + dentry = exportfs_decode_fh_raw(exp->ex_path.mnt, fid,
> + data_left, fileid_type,
> + nfsd_acceptable, exp);
> + if (IS_ERR_OR_NULL(dentry)) {
> trace_nfsd_set_fh_dentry_badhandle(rqstp, fhp,
> dentry ? PTR_ERR(dentry) : -ESTALE);
> + switch (PTR_ERR(dentry)) {
> + case -ENOMEM:
> + case -ETIMEDOUT:
> + break;
> + default:
> + dentry = ERR_PTR(-ESTALE);
> + }
> + }
> }
> if (dentry == NULL)
> goto out;
> --
> 2.28.0

2020-12-01 00:41:32

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE

On Mon, 2020-11-30 at 18:05 -0500, J. Bruce Fields wrote:
> On Mon, Nov 30, 2020 at 04:24:54PM -0500, [email protected] wrote:
> > From: Trond Myklebust <[email protected]>
> >
> > If the underlying filesystem times out, then we want knfsd to
> > return
> > NFSERR_JUKEBOX/DELAY rather than NFSERR_STALE.
>
> Out of curiosity, what was causing ETIMEDOUT in practice?
>

If you're only re-exporting NFS from a single server, then it is OK to
use hard mounts. However if you are exporting from multiple servers, or
you have local filesystems that are also being exported by the same
knfsd server, then you usually want to use softerr mounts for NFS so
that operations that take an inordinate amount of time due to temporary
server outages get converted into JUKEBOX/DELAY errors. Otherwise, it
is really simple to cause all the nfsd threads to hang on that one
delinquent server.

> --b.
>
> >
> > Signed-off-by: Trond Myklebust <[email protected]>
> > ---
> >  fs/nfsd/nfsfh.c | 16 ++++++++++++----
> >  1 file changed, 12 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> > index 0c2ee65e46f3..46c86f7bc429 100644
> > --- a/fs/nfsd/nfsfh.c
> > +++ b/fs/nfsd/nfsfh.c
> > @@ -268,12 +268,20 @@ static __be32 nfsd_set_fh_dentry(struct
> > svc_rqst *rqstp, struct svc_fh *fhp)
> >         if (fileid_type == FILEID_ROOT)
> >                 dentry = dget(exp->ex_path.dentry);
> >         else {
> > -               dentry = exportfs_decode_fh(exp->ex_path.mnt, fid,
> > -                               data_left, fileid_type,
> > -                               nfsd_acceptable, exp);
> > -               if (IS_ERR_OR_NULL(dentry))
> > +               dentry = exportfs_decode_fh_raw(exp->ex_path.mnt,
> > fid,
> > +                                               data_left,
> > fileid_type,
> > +                                               nfsd_acceptable,
> > exp);
> > +               if (IS_ERR_OR_NULL(dentry)) {
> >                         trace_nfsd_set_fh_dentry_badhandle(rqstp,
> > fhp,
> >                                         dentry ?  PTR_ERR(dentry) :
> > -ESTALE);
> > +                       switch (PTR_ERR(dentry)) {
> > +                       case -ENOMEM:
> > +                       case -ETIMEDOUT:
> > +                               break;
> > +                       default:
> > +                               dentry = ERR_PTR(-ESTALE);
> > +                       }
> > +               }
> >         }
> >         if (dentry == NULL)
> >                 goto out;
> > --
> > 2.28.0

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2020-12-01 02:44:22

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE

On Tue, Dec 01, 2020 at 12:39:11AM +0000, Trond Myklebust wrote:
> On Mon, 2020-11-30 at 18:05 -0500, J. Bruce Fields wrote:
> > On Mon, Nov 30, 2020 at 04:24:54PM -0500, [email protected]?wrote:
> > > From: Trond Myklebust <[email protected]>
> > >
> > > If the underlying filesystem times out, then we want knfsd to
> > > return
> > > NFSERR_JUKEBOX/DELAY rather than NFSERR_STALE.
> >
> > Out of curiosity, what was causing ETIMEDOUT in practice?
> >
>
> If you're only re-exporting NFS from a single server, then it is OK to
> use hard mounts. However if you are exporting from multiple servers, or
> you have local filesystems that are also being exported by the same
> knfsd server, then you usually want to use softerr mounts for NFS so
> that operations that take an inordinate amount of time due to temporary
> server outages get converted into JUKEBOX/DELAY errors. Otherwise, it
> is really simple to cause all the nfsd threads to hang on that one
> delinquent server.

Makes sense, thanks.

In theory the same thing could happen with block devices; longer term I
wonder if it'd make sense to limit how many threads are waiting on a
single backend.

(ACK to the patch, though, that'd be a project for another day.)

--b.