From: Yu Hsiang Huang <[email protected]>
Truncation of an unlinked inode may take a long time for I/O waiting, and
it doesn't have to prevent access to the directory. Thus, let truncation
occur outside the directory's mutex, just like do_unlinkat() does.
Signed-off-by: Yu Hsiang Huang <[email protected]>
Signed-off-by: Bing Jing Chang <[email protected]>
Signed-off-by: Robbie Ko <[email protected]>
---
fs/nfsd/vfs.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 15adf1f6ab21..39948f130712 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1859,6 +1859,7 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
{
struct dentry *dentry, *rdentry;
struct inode *dirp;
+ struct inode *rinode;
__be32 err;
int host_err;
@@ -1887,6 +1888,8 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
host_err = -ENOENT;
goto out_drop_write;
}
+ rinode = d_inode(rdentry);
+ ihold(rinode);
if (!type)
type = d_inode(rdentry)->i_mode & S_IFMT;
@@ -1902,6 +1905,8 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (!host_err)
host_err = commit_metadata(fhp);
dput(rdentry);
+ fh_unlock(fhp);
+ iput(rinode); /* truncate the inode here */
out_drop_write:
fh_drop_write(fhp);
--
2.25.1
On Fri, May 14, 2021 at 11:58:29AM +0800, Nick Huang wrote:
> From: Yu Hsiang Huang <[email protected]>
>
> Truncation of an unlinked inode may take a long time for I/O waiting, and
> it doesn't have to prevent access to the directory. Thus, let truncation
> occur outside the directory's mutex, just like do_unlinkat() does.
Makes sense to me, thanks. I'll queue it up for 5.14.
Just out of curiosity: was this found just by inspection, or were you
hitting this in a real workload? I'd be interested in what it took to
reproduce if so.
--b.
>
> Signed-off-by: Yu Hsiang Huang <[email protected]>
> Signed-off-by: Bing Jing Chang <[email protected]>
> Signed-off-by: Robbie Ko <[email protected]>
> ---
> fs/nfsd/vfs.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 15adf1f6ab21..39948f130712 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1859,6 +1859,7 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> {
> struct dentry *dentry, *rdentry;
> struct inode *dirp;
> + struct inode *rinode;
> __be32 err;
> int host_err;
>
> @@ -1887,6 +1888,8 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> host_err = -ENOENT;
> goto out_drop_write;
> }
> + rinode = d_inode(rdentry);
> + ihold(rinode);
>
> if (!type)
> type = d_inode(rdentry)->i_mode & S_IFMT;
> @@ -1902,6 +1905,8 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> if (!host_err)
> host_err = commit_metadata(fhp);
> dput(rdentry);
> + fh_unlock(fhp);
> + iput(rinode); /* truncate the inode here */
>
> out_drop_write:
> fh_drop_write(fhp);
> --
> 2.25.1
On Fri, 2021-05-14 at 11:58 +0800, Nick Huang wrote:
> From: Yu Hsiang Huang <[email protected]>
>
> Truncation of an unlinked inode may take a long time for I/O waiting,
> and
> it doesn't have to prevent access to the directory. Thus, let
> truncation
> occur outside the directory's mutex, just like do_unlinkat() does.
>
> Signed-off-by: Yu Hsiang Huang <[email protected]>
> Signed-off-by: Bing Jing Chang <[email protected]>
> Signed-off-by: Robbie Ko <[email protected]>
> ---
> fs/nfsd/vfs.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 15adf1f6ab21..39948f130712 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1859,6 +1859,7 @@ nfsd_unlink(struct svc_rqst *rqstp, struct
> svc_fh *fhp, int type,
> {
> struct dentry *dentry, *rdentry;
> struct inode *dirp;
> + struct inode *rinode;
> __be32 err;
> int host_err;
>
> @@ -1887,6 +1888,8 @@ nfsd_unlink(struct svc_rqst *rqstp, struct
> svc_fh *fhp, int type,
> host_err = -ENOENT;
> goto out_drop_write;
> }
> + rinode = d_inode(rdentry);
> + ihold(rinode);
>
> if (!type)
> type = d_inode(rdentry)->i_mode & S_IFMT;
> @@ -1902,6 +1905,8 @@ nfsd_unlink(struct svc_rqst *rqstp, struct
> svc_fh *fhp, int type,
> if (!host_err)
> host_err = commit_metadata(fhp);
Why leave the commit_metadata() call under the lock? If you're
concerned about latency, then it makes more sense to call fh_unlock()
before flushing those metadata updates to disk.
This is, BTW, an optimisation that appears to be possible in several
other cases in fs/nfsd/vfs.c.
> dput(rdentry);
> + fh_unlock(fhp);
> + iput(rinode); /* truncate the inode here */
>
> out_drop_write:
> fh_drop_write(fhp);
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Fri, May 14, 2021 at 03:46:57PM +0000, Trond Myklebust wrote:
> Why leave the commit_metadata() call under the lock? If you're
> concerned about latency, then it makes more sense to call fh_unlock()
> before flushing those metadata updates to disk.
Also I'm not sure why the extra inode reference is needed. What speaks
against just moving the dput out of the locked section?
On Sat, May 15, 2021 at 08:02:39AM +0100, Christoph Hellwig wrote:
> On Fri, May 14, 2021 at 03:46:57PM +0000, Trond Myklebust wrote:
> > Why leave the commit_metadata() call under the lock? If you're
> > concerned about latency, then it makes more sense to call fh_unlock()
> > before flushing those metadata updates to disk.
>
> Also I'm not sure why the extra inode reference is needed. What speaks
> against just moving the dput out of the locked section?
I don't know. Do you know why do_unlinkat() is doing the same thing?
--b.
On Mon, May 17, 2021 at 02:56:59PM -0400, [email protected] wrote:
> On Sat, May 15, 2021 at 08:02:39AM +0100, Christoph Hellwig wrote:
> > On Fri, May 14, 2021 at 03:46:57PM +0000, Trond Myklebust wrote:
> > > Why leave the commit_metadata() call under the lock? If you're
> > > concerned about latency, then it makes more sense to call fh_unlock()
> > > before flushing those metadata updates to disk.
> >
> > Also I'm not sure why the extra inode reference is needed. What speaks
> > against just moving the dput out of the locked section?
>
> I don't know. Do you know why do_unlinkat() is doing the same thing?
No. Al, any idea why unlink does the final dput under i_rwsem?
On Sat, 2021-05-15 at 08:02 +0100, Christoph Hellwig wrote:
> On Fri, May 14, 2021 at 03:46:57PM +0000, Trond Myklebust wrote:
> > Why leave the commit_metadata() call under the lock? If you're
> > concerned about latency, then it makes more sense to call
> > fh_unlock()
> > before flushing those metadata updates to disk.
>
> Also I'm not sure why the extra inode reference is needed. What
> speaks
> against just moving the dput out of the locked section?
Isn't the inode reference taken just in order to ensure that the call
to iput_final() (and in particular the call to
truncate_inode_pages_final()) is performed outside the lock?
The dput() is presumably usually not particularly expensive, since the
dentry is just a completely ordinary negative dentry at this point.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Tue, May 18, 2021 at 05:53:47PM +0000, Trond Myklebust wrote:
> On Sat, 2021-05-15 at 08:02 +0100, Christoph Hellwig wrote:
> > On Fri, May 14, 2021 at 03:46:57PM +0000, Trond Myklebust wrote:
> > > Why leave the commit_metadata() call under the lock? If you're
> > > concerned about latency, then it makes more sense to call
> > > fh_unlock()
> > > before flushing those metadata updates to disk.
> >
> > Also I'm not sure why the extra inode reference is needed. What
> > speaks
> > against just moving the dput out of the locked section?
>
> Isn't the inode reference taken just in order to ensure that the call
> to iput_final() (and in particular the call to
> truncate_inode_pages_final()) is performed outside the lock?
>
> The dput() is presumably usually not particularly expensive, since the
> dentry is just a completely ordinary negative dentry at this point.
Right, but why bother with the extra ihold/iput instead of just
postponing the dput?
diff --git a/fs/namei.c b/fs/namei.c
index 79b0ff9b151e..aeed93c9874c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4084,7 +4084,6 @@ long do_unlinkat(int dfd, struct filename *name)
inode = dentry->d_inode;
if (d_is_negative(dentry))
goto slashes;
- ihold(inode);
error = security_path_unlink(&path, dentry);
if (error)
goto exit2;
@@ -4092,11 +4091,10 @@ long do_unlinkat(int dfd, struct filename *name)
error = vfs_unlink(mnt_userns, path.dentry->d_inode, dentry,
&delegated_inode);
exit2:
- dput(dentry);
}
inode_unlock(path.dentry->d_inode);
- if (inode)
- iput(inode); /* truncate the inode here */
+ if (!IS_ERR(dentry))
+ dput(dentry); /* truncate the inode here */
inode = NULL;
if (delegated_inode) {
error = break_deleg_wait(&delegated_inode);
?
--b.
On Tue, May 18, 2021 at 06:36:23PM +0000, Trond Myklebust wrote:
> On Tue, 2021-05-18 at 14:21 -0400, [email protected] wrote:
> > On Tue, May 18, 2021 at 05:53:47PM +0000, Trond Myklebust wrote:
> > > On Sat, 2021-05-15 at 08:02 +0100, Christoph Hellwig wrote:
> > > > On Fri, May 14, 2021 at 03:46:57PM +0000, Trond Myklebust wrote:
> > > > > Why leave the commit_metadata() call under the lock? If you're
> > > > > concerned about latency, then it makes more sense to call
> > > > > fh_unlock()
> > > > > before flushing those metadata updates to disk.
> > > >
> > > > Also I'm not sure why the extra inode reference is needed. What
> > > > speaks
> > > > against just moving the dput out of the locked section?
> > >
> > > Isn't the inode reference taken just in order to ensure that the
> > > call
> > > to iput_final() (and in particular the call to
> > > truncate_inode_pages_final()) is performed outside the lock?
> > >
> > > The dput() is presumably usually not particularly expensive, since
> > > the
> > > dentry is just a completely ordinary negative dentry at this point.
> >
> > Right, but why bother with the extra ihold/iput instead of just
> > postponing the dput?
>
> As I said above, the dentry is negative at this point. It doesn't carry
> a reference to the inode any more, so that wouldn't defer the inode
> truncation.
Oh, of course, thanks!
--b.
>
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 79b0ff9b151e..aeed93c9874c 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -4084,7 +4084,6 @@ long do_unlinkat(int dfd, struct filename
> > *name)
> > inode = dentry->d_inode;
> > if (d_is_negative(dentry))
> > goto slashes;
> > - ihold(inode);
> > error = security_path_unlink(&path, dentry);
> > if (error)
> > goto exit2;
> > @@ -4092,11 +4091,10 @@ long do_unlinkat(int dfd, struct filename
> > *name)
> > error = vfs_unlink(mnt_userns, path.dentry->d_inode,
> > dentry,
> > &delegated_inode);
> > exit2:
> > - dput(dentry);
> > }
> > inode_unlock(path.dentry->d_inode);
> > - if (inode)
> > - iput(inode); /* truncate the inode here */
> > + if (!IS_ERR(dentry))
> > + dput(dentry); /* truncate the inode here */
> > inode = NULL;
> > if (delegated_inode) {
> > error = break_deleg_wait(&delegated_inode);
> >
> > ?
> >
> > --b.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>
On Tue, 2021-05-18 at 14:21 -0400, [email protected] wrote:
> On Tue, May 18, 2021 at 05:53:47PM +0000, Trond Myklebust wrote:
> > On Sat, 2021-05-15 at 08:02 +0100, Christoph Hellwig wrote:
> > > On Fri, May 14, 2021 at 03:46:57PM +0000, Trond Myklebust wrote:
> > > > Why leave the commit_metadata() call under the lock? If you're
> > > > concerned about latency, then it makes more sense to call
> > > > fh_unlock()
> > > > before flushing those metadata updates to disk.
> > >
> > > Also I'm not sure why the extra inode reference is needed. What
> > > speaks
> > > against just moving the dput out of the locked section?
> >
> > Isn't the inode reference taken just in order to ensure that the
> > call
> > to iput_final() (and in particular the call to
> > truncate_inode_pages_final()) is performed outside the lock?
> >
> > The dput() is presumably usually not particularly expensive, since
> > the
> > dentry is just a completely ordinary negative dentry at this point.
>
> Right, but why bother with the extra ihold/iput instead of just
> postponing the dput?
As I said above, the dentry is negative at this point. It doesn't carry
a reference to the inode any more, so that wouldn't defer the inode
truncation.
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 79b0ff9b151e..aeed93c9874c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -4084,7 +4084,6 @@ long do_unlinkat(int dfd, struct filename
> *name)
> inode = dentry->d_inode;
> if (d_is_negative(dentry))
> goto slashes;
> - ihold(inode);
> error = security_path_unlink(&path, dentry);
> if (error)
> goto exit2;
> @@ -4092,11 +4091,10 @@ long do_unlinkat(int dfd, struct filename
> *name)
> error = vfs_unlink(mnt_userns, path.dentry->d_inode,
> dentry,
> &delegated_inode);
> exit2:
> - dput(dentry);
> }
> inode_unlock(path.dentry->d_inode);
> - if (inode)
> - iput(inode); /* truncate the inode here */
> + if (!IS_ERR(dentry))
> + dput(dentry); /* truncate the inode here */
> inode = NULL;
> if (delegated_inode) {
> error = break_deleg_wait(&delegated_inode);
>
> ?
>
> --b.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Fri, May 14, 2021 at 03:46:57PM +0000, Trond Myklebust wrote:
> On Fri, 2021-05-14 at 11:58 +0800, Nick Huang wrote:
> > From: Yu Hsiang Huang <[email protected]>
> >
> > Truncation of an unlinked inode may take a long time for I/O waiting,
> > and
> > it doesn't have to prevent access to the directory. Thus, let
> > truncation
> > occur outside the directory's mutex, just like do_unlinkat() does.
> >
> > Signed-off-by: Yu Hsiang Huang <[email protected]>
> > Signed-off-by: Bing Jing Chang <[email protected]>
> > Signed-off-by: Robbie Ko <[email protected]>
> > ---
> > fs/nfsd/vfs.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 15adf1f6ab21..39948f130712 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -1859,6 +1859,7 @@ nfsd_unlink(struct svc_rqst *rqstp, struct
> > svc_fh *fhp, int type,
> > {
> > struct dentry *dentry, *rdentry;
> > struct inode *dirp;
> > + struct inode *rinode;
> > __be32 err;
> > int host_err;
> >
> > @@ -1887,6 +1888,8 @@ nfsd_unlink(struct svc_rqst *rqstp, struct
> > svc_fh *fhp, int type,
> > host_err = -ENOENT;
> > goto out_drop_write;
> > }
> > + rinode = d_inode(rdentry);
> > + ihold(rinode);
> >
> > if (!type)
> > type = d_inode(rdentry)->i_mode & S_IFMT;
> > @@ -1902,6 +1905,8 @@ nfsd_unlink(struct svc_rqst *rqstp, struct
> > svc_fh *fhp, int type,
> > if (!host_err)
> > host_err = commit_metadata(fhp);
>
> Why leave the commit_metadata() call under the lock? If you're
> concerned about latency, then it makes more sense to call fh_unlock()
> before flushing those metadata updates to disk.
>
> This is, BTW, an optimisation that appears to be possible in several
> other cases in fs/nfsd/vfs.c.
I'm tentatively applying the original patch plus the following.
Create and rename code are two other places where we have
commit_metadata() calls that could probably be moved out from under
locks but they looked slightly more complicated.
--b.
commit ec79990df716
Author: J. Bruce Fields <[email protected]>
Date: Fri May 14 18:21:37 2021 -0400
nfsd: move some commit_metadata()s outside the inode lock
The commit may be time-consuming and there's no need to hold the lock
for it.
More of these are possible, these were just some easy ones.
Signed-off-by: J. Bruce Fields <[email protected]>
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 39948f130712..d73d3c9126fc 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1613,9 +1613,9 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
host_err = vfs_symlink(&init_user_ns, d_inode(dentry), dnew, path);
err = nfserrno(host_err);
+ fh_unlock(fhp);
if (!err)
err = nfserrno(commit_metadata(fhp));
- fh_unlock(fhp);
fh_drop_write(fhp);
@@ -1680,6 +1680,7 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
if (d_really_is_negative(dold))
goto out_dput;
host_err = vfs_link(dold, &init_user_ns, dirp, dnew, NULL);
+ fh_unlock(ffhp);
if (!host_err) {
err = nfserrno(commit_metadata(ffhp));
if (!err)
@@ -1902,10 +1903,10 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
host_err = vfs_rmdir(&init_user_ns, dirp, rdentry);
}
+ fh_unlock(fhp);
if (!host_err)
host_err = commit_metadata(fhp);
dput(rdentry);
- fh_unlock(fhp);
iput(rinode); /* truncate the inode here */
out_drop_write: