LinuxLists.cc - [PATCH v2 0/9] Various NFSv4 state error handling fixes

2019-09-16 22:35:04

Subject: [PATCH v2 0/9] Various NFSv4 state error handling fixes

Various NFSv4 fixes to ensure we handle state errors correctly. In
particular, we need to ensure that for COMPOUNDs like CLOSE and
DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle the
layout state errors so that a retry of either the LAYOUTRETURN, or
the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN
reply.

Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do our
best to still try to destroy the state on the server, in order to
avoid causing state leakage.

v2: Fix bug reports from Olga
- Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE when
doing fully serialised NFSv4.0.
- Ensure LOCKU initialises the stateid correctly.

Trond Myklebust (9):
pNFS: Ensure we do clear the return-on-close layout stateid on fatal
errors
NFSv4: Clean up pNFS return-on-close error handling
NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
NFSv4: Handle RPC level errors in LAYOUTRETURN
NFSv4: Add a helper to increment stateid seqids
pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state
seqid
NFSv4: Fix OPEN_DOWNGRADE error handling
NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU

fs/nfs/nfs4_fs.h | 11 ++-
fs/nfs/nfs4proc.c | 204 ++++++++++++++++++++++++++++++---------------
fs/nfs/nfs4state.c | 16 ----
fs/nfs/pnfs.c | 71 ++++++++++++++--
fs/nfs/pnfs.h | 17 +++-
5 files changed, 229 insertions(+), 90 deletions(-)

--
2.21.0

2019-09-16 22:35:59

by Trond Myklebust

[permalink] [raw]

Subject: [PATCH v2 1/9] pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors

IF the server rejected our layout return with a state error such as
NFS4ERR_BAD_STATEID, or even a stale inode error, then we do want
to clear out all the remaining layout segments and mark that stateid
as invalid.

Fixes: 1c5bd76d17cca ("pNFS: Enable layoutreturn operation for...")
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/pnfs.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 4525d5acae38..0418b198edd3 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1449,10 +1449,15 @@ void pnfs_roc_release(struct nfs4_layoutreturn_args *args,
const nfs4_stateid *res_stateid = NULL;
struct nfs4_xdr_opaque_data *ld_private = args->ld_private;

- if (ret == 0) {
- arg_stateid = &args->stateid;
+ switch (ret) {
+ case -NFS4ERR_NOMATCHING_LAYOUT:
+ break;
+ case 0:
if (res->lrs_present)
res_stateid = &res->stateid;
+ /* Fallthrough */
+ default:
+ arg_stateid = &args->stateid;
}
pnfs_layoutreturn_free_lsegs(lo, arg_stateid, &args->range,
res_stateid);
--
2.21.0

2019-09-16 22:36:37

by Trond Myklebust

[permalink] [raw]

Subject: [PATCH v2 2/9] NFSv4: Clean up pNFS return-on-close error handling

Both close and delegreturn have identical code to handle pNFS
return-on-close. This patch refactors that code and places it
in pnfs.c

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/nfs4proc.c | 66 +++++++----------------------------------------
fs/nfs/pnfs.c | 27 +++++++++++++++++++
fs/nfs/pnfs.h | 13 ++++++++++
3 files changed, 50 insertions(+), 56 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 1406858bae6c..fcdfddfd3ab4 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -3358,32 +3358,11 @@ static void nfs4_close_done(struct rpc_task *task, void *data)
trace_nfs4_close(state, &calldata->arg, &calldata->res, task->tk_status);

/* Handle Layoutreturn errors */
- if (calldata->arg.lr_args && task->tk_status != 0) {
- switch (calldata->res.lr_ret) {
- default:
- calldata->res.lr_ret = -NFS4ERR_NOMATCHING_LAYOUT;
- break;
- case 0:
- calldata->arg.lr_args = NULL;
- calldata->res.lr_res = NULL;
- break;
- case -NFS4ERR_OLD_STATEID:
- if (nfs4_layoutreturn_refresh_stateid(&calldata->arg.lr_args->stateid,
- &calldata->arg.lr_args->range,
- calldata->inode))
- goto lr_restart;
- /* Fallthrough */
- case -NFS4ERR_ADMIN_REVOKED:
- case -NFS4ERR_DELEG_REVOKED:
- case -NFS4ERR_EXPIRED:
- case -NFS4ERR_BAD_STATEID:
- case -NFS4ERR_UNKNOWN_LAYOUTTYPE:
- case -NFS4ERR_WRONG_CRED:
- calldata->arg.lr_args = NULL;
- calldata->res.lr_res = NULL;
- goto lr_restart;
- }
- }
+ if (pnfs_roc_done(task, calldata->inode,
+ &calldata->arg.lr_args,
+ &calldata->res.lr_res,
+ &calldata->res.lr_ret) == -EAGAIN)
+ goto out_restart;

/* hmm. we are done with the inode, and in the process of freeing
* the state_owner. we keep this around to process errors
@@ -3430,8 +3409,6 @@ static void nfs4_close_done(struct rpc_task *task, void *data)
nfs_refresh_inode(calldata->inode, &calldata->fattr);
dprintk("%s: done, ret = %d!\n", __func__, task->tk_status);
return;
-lr_restart:
- calldata->res.lr_ret = 0;
out_restart:
task->tk_status = 0;
rpc_restart_call_prepare(task);
@@ -6129,32 +6106,11 @@ static void nfs4_delegreturn_done(struct rpc_task *task, void *calldata)
trace_nfs4_delegreturn_exit(&data->args, &data->res, task->tk_status);

/* Handle Layoutreturn errors */
- if (data->args.lr_args && task->tk_status != 0) {
- switch(data->res.lr_ret) {
- default:
- data->res.lr_ret = -NFS4ERR_NOMATCHING_LAYOUT;
- break;
- case 0:
- data->args.lr_args = NULL;
- data->res.lr_res = NULL;
- break;
- case -NFS4ERR_OLD_STATEID:
- if (nfs4_layoutreturn_refresh_stateid(&data->args.lr_args->stateid,
- &data->args.lr_args->range,
- data->inode))
- goto lr_restart;
- /* Fallthrough */
- case -NFS4ERR_ADMIN_REVOKED:
- case -NFS4ERR_DELEG_REVOKED:
- case -NFS4ERR_EXPIRED:
- case -NFS4ERR_BAD_STATEID:
- case -NFS4ERR_UNKNOWN_LAYOUTTYPE:
- case -NFS4ERR_WRONG_CRED:
- data->args.lr_args = NULL;
- data->res.lr_res = NULL;
- goto lr_restart;
- }
- }
+ if (pnfs_roc_done(task, data->inode,
+ &data->args.lr_args,
+ &data->res.lr_res,
+ &data->res.lr_ret) == -EAGAIN)
+ goto out_restart;

switch (task->tk_status) {
case 0:
@@ -6192,8 +6148,6 @@ static void nfs4_delegreturn_done(struct rpc_task *task, void *calldata)
}
data->rpc_status = task->tk_status;
return;
-lr_restart:
- data->res.lr_ret = 0;
out_restart:
task->tk_status = 0;
rpc_restart_call_prepare(task);
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 0418b198edd3..8769422a12f5 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1440,6 +1440,33 @@ bool pnfs_roc(struct inode *ino,
return false;
}

+int pnfs_roc_done(struct rpc_task *task, struct inode *inode,
+ struct nfs4_layoutreturn_args **argpp,
+ struct nfs4_layoutreturn_res **respp,
+ int *ret)
+{
+ struct nfs4_layoutreturn_args *arg = *argpp;
+ int retval = -EAGAIN;
+
+ if (!arg)
+ return 0;
+ /* Handle Layoutreturn errors */
+ switch (*ret) {
+ case 0:
+ retval = 0;
+ break;
+ case -NFS4ERR_OLD_STATEID:
+ if (!nfs4_layoutreturn_refresh_stateid(&arg->stateid,
+ &arg->range, inode))
+ break;
+ *ret = -NFS4ERR_NOMATCHING_LAYOUT;
+ return -EAGAIN;
+ }
+ *argpp = NULL;
+ *respp = NULL;
+ return retval;
+}
+
void pnfs_roc_release(struct nfs4_layoutreturn_args *args,
struct nfs4_layoutreturn_res *res,
int ret)
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index f15609c003d8..3ef3756d437c 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -282,6 +282,10 @@ bool pnfs_roc(struct inode *ino,
struct nfs4_layoutreturn_args *args,
struct nfs4_layoutreturn_res *res,
const struct cred *cred);
+int pnfs_roc_done(struct rpc_task *task, struct inode *inode,
+ struct nfs4_layoutreturn_args **argpp,
+ struct nfs4_layoutreturn_res **respp,
+ int *ret);
void pnfs_roc_release(struct nfs4_layoutreturn_args *args,
struct nfs4_layoutreturn_res *res,
int ret);
@@ -701,6 +705,15 @@ pnfs_roc(struct inode *ino,
return false;
}

+static inline int
+pnfs_roc_done(struct rpc_task *task, struct inode *inode,
+ struct nfs4_layoutreturn_args **argpp,
+ struct nfs4_layoutreturn_res **respp,
+ int *ret)
+{
+ return 0;
+}
+
static inline void
pnfs_roc_release(struct nfs4_layoutreturn_args *args,
struct nfs4_layoutreturn_res *res,
--
2.21.0

2019-09-16 22:36:41

by Trond Myklebust

[permalink] [raw]

Subject: [PATCH v2 3/9] NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close

If the server sends a NFS4ERR_DELAY, then allow the caller to retry.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/pnfs.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 8769422a12f5..6436047dc999 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1455,6 +1455,10 @@ int pnfs_roc_done(struct rpc_task *task, struct inode *inode,
case 0:
retval = 0;
break;
+ case -NFS4ERR_DELAY:
+ /* Let the caller handle the retry */
+ *ret = -NFS4ERR_NOMATCHING_LAYOUT;
+ return 0;
case -NFS4ERR_OLD_STATEID:
if (!nfs4_layoutreturn_refresh_stateid(&arg->stateid,
&arg->range, inode))
--
2.21.0

2019-09-18 19:58:56

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: [PATCH v2 0/9] Various NFSv4 state error handling fixes

Hi Trond,

These set of patches do not address the locking problem. It's actually
not the locking patch (which I thought it was as I reverted it and
still had the issue). Without the whole patch series the unlock works
fine so something in these new patches. Something is up with the 2
patches:
NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU

If I remove either one separately, unlock fails but if I remove both
unlock works.

On Mon, Sep 16, 2019 at 4:46 PM Trond Myklebust <[email protected]> wrote:
>
> Various NFSv4 fixes to ensure we handle state errors correctly. In
> particular, we need to ensure that for COMPOUNDs like CLOSE and
> DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle the
> layout state errors so that a retry of either the LAYOUTRETURN, or
> the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN
> reply.
>
> Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do our
> best to still try to destroy the state on the server, in order to
> avoid causing state leakage.
>
> v2: Fix bug reports from Olga
> - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE when
> doing fully serialised NFSv4.0.
> - Ensure LOCKU initialises the stateid correctly.
>
> Trond Myklebust (9):
> pNFS: Ensure we do clear the return-on-close layout stateid on fatal
> errors
> NFSv4: Clean up pNFS return-on-close error handling
> NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
> NFSv4: Handle RPC level errors in LAYOUTRETURN
> NFSv4: Add a helper to increment stateid seqids
> pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state
> seqid
> NFSv4: Fix OPEN_DOWNGRADE error handling
> NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
>
> fs/nfs/nfs4_fs.h | 11 ++-
> fs/nfs/nfs4proc.c | 204 ++++++++++++++++++++++++++++++---------------
> fs/nfs/nfs4state.c | 16 ----
> fs/nfs/pnfs.c | 71 ++++++++++++++--
> fs/nfs/pnfs.h | 17 +++-
> 5 files changed, 229 insertions(+), 90 deletions(-)
>
> --
> 2.21.0
>

2019-09-19 01:50:42

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH v2 0/9] Various NFSv4 state error handling fixes

Hi Olga

On Wed, 2019-09-18 at 15:38 -0400, Olga Kornievskaia wrote:
> Hi Trond,
>
> These set of patches do not address the locking problem. It's
> actually
> not the locking patch (which I thought it was as I reverted it and
> still had the issue). Without the whole patch series the unlock works
> fine so something in these new patches. Something is up with the 2
> patches:
> NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
>
> If I remove either one separately, unlock fails but if I remove both
> unlock works.

Can you describe how you are testing this, and perhaps provide
wireshark traces that show how we're triggering these problems?

Thanks!
Trond

> On Mon, Sep 16, 2019 at 4:46 PM Trond Myklebust <[email protected]>
> wrote:
> > Various NFSv4 fixes to ensure we handle state errors correctly. In
> > particular, we need to ensure that for COMPOUNDs like CLOSE and
> > DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle the
> > layout state errors so that a retry of either the LAYOUTRETURN, or
> > the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN
> > reply.
> >
> > Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do our
> > best to still try to destroy the state on the server, in order to
> > avoid causing state leakage.
> >
> > v2: Fix bug reports from Olga
> > - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE when
> > doing fully serialised NFSv4.0.
> > - Ensure LOCKU initialises the stateid correctly.
> >
> > Trond Myklebust (9):
> > pNFS: Ensure we do clear the return-on-close layout stateid on
> > fatal
> > errors
> > NFSv4: Clean up pNFS return-on-close error handling
> > NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
> > NFSv4: Handle RPC level errors in LAYOUTRETURN
> > NFSv4: Add a helper to increment stateid seqids
> > pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the
> > state
> > seqid
> > NFSv4: Fix OPEN_DOWNGRADE error handling
> > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> >
> > fs/nfs/nfs4_fs.h | 11 ++-
> > fs/nfs/nfs4proc.c | 204 ++++++++++++++++++++++++++++++-----------
> > ----
> > fs/nfs/nfs4state.c | 16 ----
> > fs/nfs/pnfs.c | 71 ++++++++++++++--
> > fs/nfs/pnfs.h | 17 +++-
> > 5 files changed, 229 insertions(+), 90 deletions(-)
> >
> > --
> > 2.21.0
> >
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2019-09-20 14:33:02

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH v2 0/9] Various NFSv4 state error handling fixes

Hi Olga

On Thu, 2019-09-19 at 09:14 -0400, Olga Kornievskaia wrote:
> Hi Trond,
>
> On Wed, Sep 18, 2019 at 9:49 PM Trond Myklebust <
> [email protected]> wrote:
> > Hi Olga
> >
> > On Wed, 2019-09-18 at 15:38 -0400, Olga Kornievskaia wrote:
> > > Hi Trond,
> > >
> > > These set of patches do not address the locking problem. It's
> > > actually
> > > not the locking patch (which I thought it was as I reverted it
> > > and
> > > still had the issue). Without the whole patch series the unlock
> > > works
> > > fine so something in these new patches. Something is up with the
> > > 2
> > > patches:
> > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> > >
> > > If I remove either one separately, unlock fails but if I remove
> > > both
> > > unlock works.
> >
> > Can you describe how you are testing this, and perhaps provide
> > wireshark traces that show how we're triggering these problems?
>
> I'm triggering by running "nfstest_lock --nfsversion 4.1 --runtest
> btest01" against either linux or ontap servers (while the test
> doesn't
> fail but on the network trace you can see unlock failing with
> bad_stateid). Network trace attached.
>
> But actually a simple test open, lock, unlock does the trick (network
> trace attached).
> fd1 = open(RDWR)
> fctl(fd1) (lock /unlock)

These traces really do not mesh with what I'm seeing using a simple
Connectathon lock test run. When I look at the wireshark output from
that, I see exadtly two cases where the stateid arguments are both
zero, and those are both SETATTR, so expected.

All the LOCKU are showing up as non-zero stateids, and so I'm seeing no
BAD_STATEID or OLD_STATEID errors at all.

Is there something special about how your test is running?

Cheers
Trond

PS: Note: I do think I need a v3 of the LOCKU patch in order to add a
spinlock around the new copy of the lock stateid in
nfs4_alloc_unlockdata(). However I don't see how the missing spinlocks
could cause you to consistently be seeing an all-zero stateid argument.

>
> > Thanks!
> > Trond
> >
> > > On Mon, Sep 16, 2019 at 4:46 PM Trond Myklebust <
> > > [email protected]>
> > > wrote:
> > > > Various NFSv4 fixes to ensure we handle state errors correctly.
> > > > In
> > > > particular, we need to ensure that for COMPOUNDs like CLOSE and
> > > > DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle
> > > > the
> > > > layout state errors so that a retry of either the LAYOUTRETURN,
> > > > or
> > > > the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN
> > > > reply.
> > > >
> > > > Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do
> > > > our
> > > > best to still try to destroy the state on the server, in order
> > > > to
> > > > avoid causing state leakage.
> > > >
> > > > v2: Fix bug reports from Olga
> > > > - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE
> > > > when
> > > > doing fully serialised NFSv4.0.
> > > > - Ensure LOCKU initialises the stateid correctly.
> > > >
> > > > Trond Myklebust (9):
> > > > pNFS: Ensure we do clear the return-on-close layout stateid
> > > > on
> > > > fatal
> > > > errors
> > > > NFSv4: Clean up pNFS return-on-close error handling
> > > > NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
> > > > NFSv4: Handle RPC level errors in LAYOUTRETURN
> > > > NFSv4: Add a helper to increment stateid seqids
> > > > pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping
> > > > the
> > > > state
> > > > seqid
> > > > NFSv4: Fix OPEN_DOWNGRADE error handling
> > > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> > > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> > > >
> > > > fs/nfs/nfs4_fs.h | 11 ++-
> > > > fs/nfs/nfs4proc.c | 204 ++++++++++++++++++++++++++++++-------
> > > > ----
> > > > ----
> > > > fs/nfs/nfs4state.c | 16 ----
> > > > fs/nfs/pnfs.c | 71 ++++++++++++++--
> > > > fs/nfs/pnfs.h | 17 +++-
> > > > 5 files changed, 229 insertions(+), 90 deletions(-)
> > > >
> > > > --
> > > > 2.21.0
> > > >
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > [email protected]
> >
> >
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
http://www.hammer.space

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2019-09-21 18:19:02

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: [PATCH v2 0/9] Various NFSv4 state error handling fixes

Hi Trond,

On Thu, Sep 19, 2019 at 7:42 PM Trond Myklebust <[email protected]> wrote:
>
> Hi Olga
>
> On Thu, 2019-09-19 at 09:14 -0400, Olga Kornievskaia wrote:
> > Hi Trond,
> >
> > On Wed, Sep 18, 2019 at 9:49 PM Trond Myklebust <
> > [email protected]> wrote:
> > > Hi Olga
> > >
> > > On Wed, 2019-09-18 at 15:38 -0400, Olga Kornievskaia wrote:
> > > > Hi Trond,
> > > >
> > > > These set of patches do not address the locking problem. It's
> > > > actually
> > > > not the locking patch (which I thought it was as I reverted it
> > > > and
> > > > still had the issue). Without the whole patch series the unlock
> > > > works
> > > > fine so something in these new patches. Something is up with the
> > > > 2
> > > > patches:
> > > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> > > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> > > >
> > > > If I remove either one separately, unlock fails but if I remove
> > > > both
> > > > unlock works.
> > >
> > > Can you describe how you are testing this, and perhaps provide
> > > wireshark traces that show how we're triggering these problems?
> >
> > I'm triggering by running "nfstest_lock --nfsversion 4.1 --runtest
> > btest01" against either linux or ontap servers (while the test
> > doesn't
> > fail but on the network trace you can see unlock failing with
> > bad_stateid). Network trace attached.
> >
> > But actually a simple test open, lock, unlock does the trick (network
> > trace attached).
> > fd1 = open(RDWR)
> > fctl(fd1) (lock /unlock)
>
>
> These traces really do not mesh with what I'm seeing using a simple
> Connectathon lock test run. When I look at the wireshark output from
> that, I see exadtly two cases where the stateid arguments are both
> zero, and those are both SETATTR, so expected.
>
> All the LOCKU are showing up as non-zero stateids, and so I'm seeing no
> BAD_STATEID or OLD_STATEID errors at all.
>
> Is there something special about how your test is running?

There is nothing special that I can think of about my setup or how
test run. I pull from your testing branch, build it (no extra
patches). Run tests over 4.1 (default mount opts) against a linux
server (typically same kernel).

Is this patch series somewhere in your git branches? I've been testing
your testing branch (as I could see v2 changes were in the testing
branch). It's not obvious to me what was changed in v3 to see if the
testing branch has the right code.

>
> Cheers
> Trond
>
> PS: Note: I do think I need a v3 of the LOCKU patch in order to add a
> spinlock around the new copy of the lock stateid in
> nfs4_alloc_unlockdata(). However I don't see how the missing spinlocks
> could cause you to consistently be seeing an all-zero stateid argument.

I'm not testing with v3 so I don't think that's it.

>
>
> >
> > > Thanks!
> > > Trond
> > >
> > > > On Mon, Sep 16, 2019 at 4:46 PM Trond Myklebust <
> > > > [email protected]>
> > > > wrote:
> > > > > Various NFSv4 fixes to ensure we handle state errors correctly.
> > > > > In
> > > > > particular, we need to ensure that for COMPOUNDs like CLOSE and
> > > > > DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle
> > > > > the
> > > > > layout state errors so that a retry of either the LAYOUTRETURN,
> > > > > or
> > > > > the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN
> > > > > reply.
> > > > >
> > > > > Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do
> > > > > our
> > > > > best to still try to destroy the state on the server, in order
> > > > > to
> > > > > avoid causing state leakage.
> > > > >
> > > > > v2: Fix bug reports from Olga
> > > > > - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE
> > > > > when
> > > > > doing fully serialised NFSv4.0.
> > > > > - Ensure LOCKU initialises the stateid correctly.
> > > > >
> > > > > Trond Myklebust (9):
> > > > > pNFS: Ensure we do clear the return-on-close layout stateid
> > > > > on
> > > > > fatal
> > > > > errors
> > > > > NFSv4: Clean up pNFS return-on-close error handling
> > > > > NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
> > > > > NFSv4: Handle RPC level errors in LAYOUTRETURN
> > > > > NFSv4: Add a helper to increment stateid seqids
> > > > > pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping
> > > > > the
> > > > > state
> > > > > seqid
> > > > > NFSv4: Fix OPEN_DOWNGRADE error handling
> > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> > > > >
> > > > > fs/nfs/nfs4_fs.h | 11 ++-
> > > > > fs/nfs/nfs4proc.c | 204 ++++++++++++++++++++++++++++++-------
> > > > > ----
> > > > > ----
> > > > > fs/nfs/nfs4state.c | 16 ----
> > > > > fs/nfs/pnfs.c | 71 ++++++++++++++--
> > > > > fs/nfs/pnfs.h | 17 +++-
> > > > > 5 files changed, 229 insertions(+), 90 deletions(-)
> > > > >
> > > > > --
> > > > > 2.21.0
> > > > >
> > > --
> > > Trond Myklebust
> > > Linux NFS client maintainer, Hammerspace
> > > [email protected]
> > >
> > >
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> http://www.hammer.space
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2019-09-22 18:46:32

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH v2 0/9] Various NFSv4 state error handling fixes

On Fri, 2019-09-20 at 10:25 -0400, Olga Kornievskaia wrote:
> Hi Trond,
>
> On Thu, Sep 19, 2019 at 7:42 PM Trond Myklebust <
> [email protected]> wrote:
> > Hi Olga
> >
> > On Thu, 2019-09-19 at 09:14 -0400, Olga Kornievskaia wrote:
> > > Hi Trond,
> > >
> > > On Wed, Sep 18, 2019 at 9:49 PM Trond Myklebust <
> > > [email protected]> wrote:
> > > > Hi Olga
> > > >
> > > > On Wed, 2019-09-18 at 15:38 -0400, Olga Kornievskaia wrote:
> > > > > Hi Trond,
> > > > >
> > > > > These set of patches do not address the locking problem. It's
> > > > > actually
> > > > > not the locking patch (which I thought it was as I reverted
> > > > > it
> > > > > and
> > > > > still had the issue). Without the whole patch series the
> > > > > unlock
> > > > > works
> > > > > fine so something in these new patches. Something is up with
> > > > > the
> > > > > 2
> > > > > patches:
> > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> > > > >
> > > > > If I remove either one separately, unlock fails but if I
> > > > > remove
> > > > > both
> > > > > unlock works.
> > > >
> > > > Can you describe how you are testing this, and perhaps provide
> > > > wireshark traces that show how we're triggering these problems?
> > >
> > > I'm triggering by running "nfstest_lock --nfsversion 4.1 --
> > > runtest
> > > btest01" against either linux or ontap servers (while the test
> > > doesn't
> > > fail but on the network trace you can see unlock failing with
> > > bad_stateid). Network trace attached.
> > >
> > > But actually a simple test open, lock, unlock does the trick
> > > (network
> > > trace attached).
> > > fd1 = open(RDWR)
> > > fctl(fd1) (lock /unlock)
> >
> > These traces really do not mesh with what I'm seeing using a simple
> > Connectathon lock test run. When I look at the wireshark output
> > from
> > that, I see exadtly two cases where the stateid arguments are both
> > zero, and those are both SETATTR, so expected.
> >
> > All the LOCKU are showing up as non-zero stateids, and so I'm
> > seeing no
> > BAD_STATEID or OLD_STATEID errors at all.
> >
> > Is there something special about how your test is running?
>
> There is nothing special that I can think of about my setup or how
> test run. I pull from your testing branch, build it (no extra
> patches). Run tests over 4.1 (default mount opts) against a linux
> server (typically same kernel).
>
> Is this patch series somewhere in your git branches? I've been
> testing
> your testing branch (as I could see v2 changes were in the testing
> branch). It's not obvious to me what was changed in v3 to see if the
> testing branch has the right code.

I hadn't yet updated the testing branch with the v3 code. Pushed out
now as a forced-update.

Cheers
Trond

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]