Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
From:   "NeilBrown" <neilb@suse.de>
To:     Trond Myklebust <trondmy@kernel.org>,
        Anna Schumaker <anna@kernel.org>
Cc:     Olga Kornievskaia <aglo@umich.edu>,
        Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: [PATCH] NFS: avoid infinite NFS4ERR_OLD_STATEID loops
Date:   Tue, 14 Feb 2023 09:57:09 +1100
Message-id: <167632902904.1896.16364452992981515041@noble.neil.brown.name>
Precedence: bulk


Linux-NFS responds to NFS4ERR_OLD_STATEID by simply retrying the
request, hoping to make use of an updated stateid that might have
arrived from the server.  This is usually successful.

However if the client and server get out-of-sync for some reason and if
there is no new stateid to try, this can result in an indefinite loop
which looks a bit like a DoS attack.

This can particularly happen when a server replies with success to an
OPEN request, but fails a subsequent GETATTR.  This has been observed
with Netapp and Hitachi servers when a concurrent unlink from a
different client removes the file between the OPEN and the GETATTR.  The
GETATTR returns NFS4ERR_STALE.

In this situation Linux-NFS ignores the successful open and in
particular does not record the new stateid.  If the same file was
already open, those active opens will now have an OLD_STATEID.

When the Netapp/Hitachi servers receive subseqent WRITE requests with
the STALE file handle and OLD stateid, they choose to report the
OLD_STATEID error rather than the STALE error.  This causes Linux-NFS to
loop indefinitely.

To protect against this or other scenarios that might result in old
stateids being used, this patch limits the number of retry attempts when
OLD_STATEID is received.  The first 4 retries are sent immediately after
an error.  After that the standard timed back-off retry scheme is used
until a 2-second delay is reached.  After than the OLD_STATEID error is
mapped to ESTALE so the write fails rather than hanging indefinitely.

This pattern of server behaviour can be demonstrated - and so the patch
can be tested - by modifying the Linux NFS server as follows:

1/ The second time a file is opened, unlink it.  This simulates
   a race with another client, without needing to have a race:

>    --- a/fs/nfsd/nfs4proc.c
>    +++ b/fs/nfsd/nfs4proc.c
>    @@ -594,6 +594,12 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_com=
pound_state *cstate,
>     	if (reclaim && !status)
>     		nn->somebody_reclaimed =3D true;
>     out:
>    +	if (!status && open->op_stateid.si_generation > 1) {
>    +		printk("Opening gen %d\n", (int)open->op_stateid.si_generation);
>    +		vfs_unlink(mnt_user_ns(resfh->fh_export->ex_path.mnt),
>    +			   resfh->fh_dentry->d_parent->d_inode,
>    +			   resfh->fh_dentry, NULL);
>    +	}
>     	if (open->op_filp) {
>     		fput(open->op_filp);
>     		open->op_filp =3D NULL;

2/ When a GETATTR op is attempted on an unlinked file, return ESTALE

>    @@ -852,6 +858,11 @@ nfsd4_getattr(struct svc_rqst *rqstp, struct nfsd4_=
compound_state *cstate,
>     	if (status)
>     		return status;
>
>    +	if (cstate->current_fh.fh_dentry->d_inode->i_nlink =3D=3D 0) {
>    +		printk("Return Estale for unlinked file\n");
>    +		return nfserr_stale;
>    +	}
>    +
>     	if (getattr->ga_bmval[1] & NFSD_WRITEONLY_ATTRS_WORD1)
>     		return nfserr_inval;

Then mount the filesystem and

  Thread 1            Thread 2
  open a file
                      open the same file (will fail)
  write to that file

I use this shell fragment, using 'sleep' for synchronisation.
The use of /bin/echo ensures the write is flushed when /bin/echo closes
the fd on exit.

    (
        /bin/echo hello
        sleep 3
        /bin/echo there
    ) > /import/A/afile &
    sleep 3
    cat /import/A/afile

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/nfs4proc.c       | 21 +++++++++++++++++++--
 include/linux/nfs_xdr.h |  2 ++
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 40d749f29ed3..ee77bf42ec38 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -408,7 +408,7 @@ static long nfs4_update_delay(long *timeout)
 	long ret;
 	if (!timeout)
 		return NFS4_POLL_RETRY_MAX;
-	if (*timeout <=3D 0)
+	if (*timeout <=3D NFS4_POLL_RETRY_MIN)
 		*timeout =3D NFS4_POLL_RETRY_MIN;
 	if (*timeout > NFS4_POLL_RETRY_MAX)
 		*timeout =3D NFS4_POLL_RETRY_MAX;
@@ -562,9 +562,24 @@ static int nfs4_do_handle_exception(struct nfs_server *s=
erver,
 			return 0;
=20
 		case -NFS4ERR_RETRY_UNCACHED_REP:
-		case -NFS4ERR_OLD_STATEID:
 			exception->retry =3D 1;
 			break;
+		case -NFS4ERR_OLD_STATEID:
+			/* Simple retry is usually safe, but buggy
+			 * servers can cause DoS - so be careful.
+			 */
+			if (exception->timeout > HZ*2) {
+				ret =3D -ESTALE;
+			} else if (exception->timeout < 5) {
+				/* 5 tries with no delay */
+				exception->retry =3D 1;
+				exception->timeout +=3D 1;
+				ret =3D 0;
+			} else {
+				exception->delay =3D 1;
+				ret =3D 0;
+			}
+			break;
 		case -NFS4ERR_BADOWNER:
 			/* The following works around a Linux server bug! */
 		case -NFS4ERR_BADNAME:
@@ -5506,10 +5521,12 @@ static int nfs4_write_done_cb(struct rpc_task *task,
 			.inode =3D hdr->inode,
 			.state =3D hdr->args.context->state,
 			.stateid =3D &hdr->args.stateid,
+			.timeout =3D hdr->timeout,
 		};
 		task->tk_status =3D nfs4_async_handle_exception(task,
 				NFS_SERVER(inode), task->tk_status,
 				&exception);
+		hdr->timeout =3D exception.timeout;
 		if (exception.retry) {
 			rpc_restart_call_prepare(task);
 			return -EAGAIN;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index e86cf6642d21..0c791e5b213c 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1625,6 +1625,8 @@ struct nfs_pgio_header {
 	unsigned int		good_bytes;	/* boundary of good data */
 	unsigned long		flags;
=20
+	long			timeout;
+
 	/*
 	 * rpc data
 	 */
--=20
2.39.1