Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18;
Message-ID: <c3d053dc36dd5e7dee1267f1c7107bbf911e4d53.camel@kernel.org>
Subject: Re: [PATCH RFC] NFSD: Fix possible sleep during
 nfsd4_release_lockowner()
From:   Jeff Layton <jlayton@kernel.org>
To:     Chuck Lever III <chuck.lever@oracle.com>
Cc:     Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Date:   Mon, 23 May 2022 11:26:14 -0400
In-Reply-To: <510282CB-38D3-438A-AF8A-9AC2519FCEF7@oracle.com>
References: <165323344948.2381.7808135229977810927.stgit@bazille.1015granger.net>
         <fe3f9ece807e1433631ee3e0bd6b78238305cb87.camel@kernel.org>
         <510282CB-38D3-438A-AF8A-9AC2519FCEF7@oracle.com>
Content-Type: text/plain; charset="ISO-8859-15"
Content-Transfer-Encoding: quoted-printable
User-Agent: Evolution 3.44.1 (3.44.1-1.fc36) 
MIME-Version: 1.0
Precedence: bulk

On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
>=20
> > On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org> wrote:
> >=20
> > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
> > > nfsd4_release_lockowner() holds clp->cl_lock when it calls
> > > check_for_locks(). However, check_for_locks() calls nfsd_file_get()
> > > / nfsd_file_put() to access the backing inode's flc_posix list, and
> > > nfsd_file_put() can sleep if the inode was recently removed.
> > >=20
> >=20
> > It might be good to add a might_sleep() to nfsd_file_put?
>=20
> I intend to include the patch you reviewed last week that
> adds the might_sleep(), as part of this series.
>=20
>=20
> > > Let's instead rely on the stateowner's reference count to gate
> > > whether the release is permitted. This should be a reliable
> > > indication of locks-in-use since file lock operations and
> > > ->lm_get_owner take appropriate references, which are released
> > > appropriately when file locks are removed.
> > >=20
> > > Reported-by: Dai Ngo <dai.ngo@oracle.com>
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > Cc: stable@vger.kernel.org
> > > ---
> > > fs/nfsd/nfs4state.c |    9 +++------
> > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > >=20
> > > This might be a naive approach, but let's start with it.
> > >=20
> > > This passes light testing, but it's not clear how much our existing
> > > fleet of tests exercises this area. I've locally built a couple of
> > > pynfs tests (one is based on the one Dai posted last week) and they
> > > pass too.
> > >=20
> > > I don't believe that FREE_STATEID needs the same simplification.
> > >=20
> > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > index a280256cbb03..b77894e668a4 100644
> > > --- a/fs/nfsd/nfs4state.c
> > > +++ b/fs/nfsd/nfs4state.c
> > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp=
,
> > >=20
> > > 		/* see if there are still any locks associated with it */
> > > 		lo =3D lockowner(sop);
> > > -		list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) {
> > > -			if (check_for_locks(stp->st_stid.sc_file, lo)) {
> > > -				status =3D nfserr_locks_held;
> > > -				spin_unlock(&clp->cl_lock);
> > > -				return status;
> > > -			}
> > > +		if (atomic_read(&sop->so_count) > 1) {
> > > +			spin_unlock(&clp->cl_lock);
> > > +			return nfserr_locks_held;
> > > 		}
> > >=20
> > > 		nfs4_get_stateowner(sop);
> > >=20
> > >=20
> >=20
> > lm_get_owner is called from locks_copy_conflock, so if someone else
> > happens to be doing a LOCKT or F_GETLK call at the same time that
> > RELEASE_LOCKOWNER gets called, then this may end up returning an error
> > inappropriately.
>=20
> IMO releasing the lockowner while it's being used for _anything_
> seems risky and surprising. If RELEASE_LOCKOWNER succeeds while
> the client is still using the lockowner for any reason, a
> subsequent error will occur if the client tries to use it again.
> Heck, I can see the server failing in mid-COMPOUND with this kind
> of race. Better I think to just leave the lockowner in place if
> there's any ambiguity.
>=20

The problem here is not the client itself calling RELEASE_LOCKOWNER
while it's still in use, but rather a different client altogether
calling LOCKT (or a local process does a F_GETLK) on an inode where a
lock is held by a client. The LOCKT gets a reference to it (for the
conflock), while the client that has the lockowner releases the lock and
then the lockowner while the refcount is still high.

The race window for this is probably quite small, but I think it's
theoretically possible. The point is that an elevated refcount on the
lockowner doesn't necessarily mean that locks are actually being held by
it.

> The spec language does not say RELEASE_LOCKOWNER must not return
> LOCKS_HELD for other reasons, and it does say that there is no
> choice of using another NFSERR value (RFC 7530 Section 13.2).
>=20

What recourse does the client have if this happens? It released all of
its locks and tried to release the lockowner, but the server says "locks
held". Should it just give up at that point? RELEASE_LOCKOWNER is a sort
of a courtesy by the client, I suppose...

>=20
> > My guess is that that would be pretty hard to hit the
> > timing right, but not impossible.
> >=20
> > What we may want to do is have the kernel do this check and only if it
> > comes back >1 do the actual check for locks. That won't fix the origina=
l
> > problem though.
> >=20
> > In other places in nfsd, we've plumbed in a dispose_list head and
> > deferred the sleeping functions until the spinlock can be dropped. I
> > haven't looked closely at whether that's possible here, but it may be a
> > more reliable approach.
>=20
> That was proposed by Dai last week.
>=20
> https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.n=
go@oracle.com/T/#u
>=20
> Trond pointed out that if two separate clients were releasing a
> lockowner on the same inode, there is nothing that protects the
> dispose_list, and it would get corrupted.
>=20
> https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@or=
acle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
>=20

Yeah, that doesn't look like what's needed.

What I was going to suggest is a nfsd_file_put variant that takes a
list_head. If the refcount goes to zero and the thing ends up being
unhashed, then you put it on the dispose list rather than doing the
blocking operations, and then clean it up later.

That said, nfsd_file_put has grown significantly in complexity over the
years, so maybe that's not simple to do now.
--=20
Jeff Layton <jlayton@kernel.org>