Return-Path: Received: from mail-pa0-f46.google.com ([209.85.220.46]:33904 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753476AbbGEL1m (ORCPT ); Sun, 5 Jul 2015 07:27:42 -0400 Message-ID: <5599149D.7080106@gmail.com> Date: Sun, 05 Jul 2015 19:27:25 +0800 From: Kinglong Mee MIME-Version: 1.0 To: "J. Bruce Fields" , NeilBrown , Al Viro CC: "linux-nfs@vger.kernel.org" , linux-fsdevel@vger.kernel.org, kinglongmee@gmail.com Subject: Re: [PATCH] nfsd: don't hold i_mutex over userspace upcalls References: <20150430075225.21a71056@notabene.brown> <20150430213602.GB9509@fieldses.org> <20150501115326.51f5613a@notabene.brown> <20150504220130.GB16827@fieldses.org> <5548CBA3.9080608@gmail.com> <20150505141801.GB27106@fieldses.org> <20150505155203.GD27106@fieldses.org> <20150506082628.0dd049c7@notabene.brown> <20150508161558.GA18851@fieldses.org> <20150508200133.GC18851@fieldses.org> <20150603151819.GA8441@fieldses.org> In-Reply-To: <20150603151819.GA8441@fieldses.org> Content-Type: text/plain; charset=windows-1252 Sender: linux-nfs-owner@vger.kernel.org List-ID: Ping... What's the state of this patch ? Without modify nfs-utils, this one is make sense. thanks, Kinglong Mee On 6/3/2015 23:18, J. Bruce Fields wrote: > This passes my review, but it needs an ACK from Al or someone for the > addition of the new lookup_one_len_unlocked (which is the same as > lookup_one_len except that it takes the i_mutex itself when required > instead of requiring the caller to). > > --b. > > On Fri, May 08, 2015 at 04:01:33PM -0400, J. Bruce Fields wrote: >> From: NeilBrown >> >> We need information about exports when crossing mountpoints during >> lookup or NFSv4 readdir. If we don't already have that information >> cached, we may have to ask (and wait for) rpc.mountd. >> >> In both cases we currently hold the i_mutex on the parent of the >> directory we're asking rpc.mountd about. We've seen situations where >> rpc.mountd performs some operation on that directory that tries to take >> the i_mutex again, resulting in deadlock. >> >> With some care, we may be able to avoid that in rpc.mountd. But it >> seems better just to avoid holding a mutex while waiting on userspace. >> >> It appears that lookup_one_len is pretty much the only operation that >> needs the i_mutex. So we could just drop the i_mutex elsewhere and do >> something like >> >> mutex_lock() >> lookup_one_len() >> mutex_unlock() >> >> In many cases though the lookup would have been cached and not required >> the i_mutex, so it's more efficient to create a lookup_one_len() variant >> that only takes the i_mutex when necessary. >> >> Signed-off-by: J. Bruce Fields >> --- >> fs/namei.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++ >> fs/nfsd/nfs3xdr.c | 2 +- >> fs/nfsd/nfs4xdr.c | 8 +++--- >> fs/nfsd/vfs.c | 23 +++++++--------- >> include/linux/namei.h | 1 + >> 5 files changed, 89 insertions(+), 19 deletions(-) >> >> Here's an updated patch. >> >> diff --git a/fs/namei.c b/fs/namei.c >> index 4a8d998b7274..8b866d79c5b7 100644 >> --- a/fs/namei.c >> +++ b/fs/namei.c >> @@ -2139,6 +2139,8 @@ static struct dentry *lookup_hash(struct nameidata *nd) >> * >> * Note that this routine is purely a helper for filesystem usage and should >> * not be called by generic code. >> + * >> + * The caller must hold base->i_mutex. >> */ >> struct dentry *lookup_one_len(const char *name, struct dentry *base, int len) >> { >> @@ -2182,6 +2184,78 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len) >> } >> EXPORT_SYMBOL(lookup_one_len); >> >> +/** >> + * lookup_one_len_unlocked - filesystem helper to lookup single pathname component >> + * @name: pathname component to lookup >> + * @base: base directory to lookup from >> + * @len: maximum length @len should be interpreted to >> + * >> + * Note that this routine is purely a helper for filesystem usage and should >> + * not be called by generic code. >> + * >> + * Unlike lookup_one_len, it should be called without the parent >> + * i_mutex held, and will take the i_mutex itself if necessary. >> + */ >> +struct dentry *lookup_one_len_unlocked(const char *name, >> + struct dentry *base, int len) >> +{ >> + struct qstr this; >> + unsigned int c; >> + int err; >> + struct dentry *ret; >> + >> + this.name = name; >> + this.len = len; >> + this.hash = full_name_hash(name, len); >> + if (!len) >> + return ERR_PTR(-EACCES); >> + >> + if (unlikely(name[0] == '.')) { >> + if (len < 2 || (len == 2 && name[1] == '.')) >> + return ERR_PTR(-EACCES); >> + } >> + >> + while (len--) { >> + c = *(const unsigned char *)name++; >> + if (c == '/' || c == '\0') >> + return ERR_PTR(-EACCES); >> + } >> + /* >> + * See if the low-level filesystem might want >> + * to use its own hash.. >> + */ >> + if (base->d_flags & DCACHE_OP_HASH) { >> + int err = base->d_op->d_hash(base, &this); >> + if (err < 0) >> + return ERR_PTR(err); >> + } >> + >> + err = inode_permission(base->d_inode, MAY_EXEC); >> + if (err) >> + return ERR_PTR(err); >> + >> + ret = __d_lookup(base, &this); >> + if (ret) >> + return ret; >> + /* >> + * __d_lookup() is used to try to get a quick answer and avoid the >> + * mutex. A false-negative does no harm. >> + */ >> + ret = __d_lookup(base, &this); >> + if (ret && ret->d_flags & DCACHE_OP_REVALIDATE) { >> + dput(ret); >> + ret = NULL; >> + } >> + if (ret) >> + return ret; >> + >> + mutex_lock(&base->d_inode->i_mutex); >> + ret = __lookup_hash(&this, base, 0); >> + mutex_unlock(&base->d_inode->i_mutex); >> + return ret; >> +} >> +EXPORT_SYMBOL(lookup_one_len_unlocked); >> + >> int user_path_at_empty(int dfd, const char __user *name, unsigned flags, >> struct path *path, int *empty) >> { >> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c >> index f6e7cbabac5a..01dcd494f781 100644 >> --- a/fs/nfsd/nfs3xdr.c >> +++ b/fs/nfsd/nfs3xdr.c >> @@ -823,7 +823,7 @@ compose_entry_fh(struct nfsd3_readdirres *cd, struct svc_fh *fhp, >> } else >> dchild = dget(dparent); >> } else >> - dchild = lookup_one_len(name, dparent, namlen); >> + dchild = lookup_one_len_unlocked(name, dparent, namlen); >> if (IS_ERR(dchild)) >> return rv; >> if (d_mountpoint(dchild)) >> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >> index 158badf945df..2c1adaa0bd2f 100644 >> --- a/fs/nfsd/nfs4xdr.c >> +++ b/fs/nfsd/nfs4xdr.c >> @@ -2804,14 +2804,14 @@ nfsd4_encode_dirent_fattr(struct xdr_stream *xdr, struct nfsd4_readdir *cd, >> __be32 nfserr; >> int ignore_crossmnt = 0; >> >> - dentry = lookup_one_len(name, cd->rd_fhp->fh_dentry, namlen); >> + dentry = lookup_one_len_unlocked(name, cd->rd_fhp->fh_dentry, namlen); >> if (IS_ERR(dentry)) >> return nfserrno(PTR_ERR(dentry)); >> if (d_really_is_negative(dentry)) { >> /* >> - * nfsd_buffered_readdir drops the i_mutex between >> - * readdir and calling this callback, leaving a window >> - * where this directory entry could have gone away. >> + * we're not holding the i_mutex here, so there's >> + * a window where this directory entry could have gone >> + * away. >> */ >> dput(dentry); >> return nfserr_noent; >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c >> index a30e79900086..6d5b33458e91 100644 >> --- a/fs/nfsd/vfs.c >> +++ b/fs/nfsd/vfs.c >> @@ -217,10 +217,16 @@ nfsd_lookup_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp, >> host_err = PTR_ERR(dentry); >> if (IS_ERR(dentry)) >> goto out_nfserr; >> - /* >> - * check if we have crossed a mount point ... >> - */ >> if (nfsd_mountpoint(dentry, exp)) { >> + /* >> + * We don't need the i_mutex after all. It's >> + * still possible we could open this (regular >> + * files can be mountpoints too), but the >> + * i_mutex is just there to prevent renames of >> + * something that we might be about to delegate, >> + * and a mountpoint won't be renamed: >> + */ >> + fh_unlock(fhp); >> if ((host_err = nfsd_cross_mnt(rqstp, &dentry, &exp))) { >> dput(dentry); >> goto out_nfserr; >> @@ -1876,7 +1882,6 @@ static __be32 nfsd_buffered_readdir(struct file *file, nfsd_filldir_t func, >> offset = *offsetp; >> >> while (1) { >> - struct inode *dir_inode = file_inode(file); >> unsigned int reclen; >> >> cdp->err = nfserr_eof; /* will be cleared on successful read */ >> @@ -1895,15 +1900,6 @@ static __be32 nfsd_buffered_readdir(struct file *file, nfsd_filldir_t func, >> if (!size) >> break; >> >> - /* >> - * Various filldir functions may end up calling back into >> - * lookup_one_len() and the file system's ->lookup() method. >> - * These expect i_mutex to be held, as it would within readdir. >> - */ >> - host_err = mutex_lock_killable(&dir_inode->i_mutex); >> - if (host_err) >> - break; >> - >> de = (struct buffered_dirent *)buf.dirent; >> while (size > 0) { >> offset = de->offset; >> @@ -1920,7 +1916,6 @@ static __be32 nfsd_buffered_readdir(struct file *file, nfsd_filldir_t func, >> size -= reclen; >> de = (struct buffered_dirent *)((char *)de + reclen); >> } >> - mutex_unlock(&dir_inode->i_mutex); >> if (size > 0) /* We bailed out early */ >> break; >> >> diff --git a/include/linux/namei.h b/include/linux/namei.h >> index c8990779f0c3..bb3a2f7cca67 100644 >> --- a/include/linux/namei.h >> +++ b/include/linux/namei.h >> @@ -62,6 +62,7 @@ extern struct dentry *kern_path_locked(const char *, struct path *); >> extern int kern_path_mountpoint(int, const char *, struct path *, unsigned int); >> >> extern struct dentry *lookup_one_len(const char *, struct dentry *, int); >> +extern struct dentry *lookup_one_len_unlocked(const char *, struct dentry *, int); >> >> extern int follow_down_one(struct path *); >> extern int follow_down(struct path *); >> -- >> 1.9.3 >> >