LinuxLists.cc - Stuck NFSv4 mounts of Isilon filer with repeated NFS4ERR_STALE

2020-03-25 10:31:16

Subject: Stuck NFSv4 mounts of Isilon filer with repeated NFS4ERR_STALE_CLIENTID errors

We're seeing a number of Linux (CentOS 7.5) clients getting nfs:
server isilon not responding, still trying' from various exports from
a Isilon

I appreciate we're using a vendor's Linux (out-of-date) kernel and a
third party filer, but if anyone can give me any pointers of how to
debug this issue, I would be grateful (we also have a support case
open with the Isilon vendor)

Running tshark on a client when this issue happens (taken several
hours after the issue happened), we get repeating:

1 12:18:11 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW CID: 0xde68
2 12:18:11 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call In
1) RENEW Status: NFS4ERR_STALE_CLIENTID
4 12:18:16 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW CID: 0xde68
5 12:18:16 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call In
4) RENEW Status: NFS4ERR_STALE_CLIENTID
7 12:18:21 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW CID: 0xde68
8 12:18:21 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call In
7) RENEW Status: NFS4ERR_STALE_CLIENTID
...

My knowledge of NFSv4 is sketchy, but from my (partial) reading of
rfc7530 shouldn't the client be sending a SETCLIENTID in response to a
NFS4ERR_STALE_CLIENTID - which doesn't appear to be happening here?

Although the server hasn't rebooted since the client mounted the file
system - so not sure what might be going on ?

We are upgrading clients to the latest CentOS (RHEL) 7.7 to see if
that 'fixes' the issue - but would appreciate any other pointers

Thanks

James Pearson

2020-03-25 12:23:35

by Trond Myklebust

[permalink] [raw]

Subject: Re: Stuck NFSv4 mounts of Isilon filer with repeated NFS4ERR_STALE_CLIENTID errors

On Wed, 2020-03-25 at 10:30 +0000, James Pearson wrote:
> We're seeing a number of Linux (CentOS 7.5) clients getting nfs:
> server isilon not responding, still trying' from various exports
> from
> a Isilon
>
> I appreciate we're using a vendor's Linux (out-of-date) kernel and a
> third party filer, but if anyone can give me any pointers of how to
> debug this issue, I would be grateful (we also have a support case
> open with the Isilon vendor)
>
> Running tshark on a client when this issue happens (taken several
> hours after the issue happened), we get repeating:
>
> 1 12:18:11 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW
> CID: 0xde68
> 2 12:18:11 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call
> In
> 1) RENEW Status: NFS4ERR_STALE_CLIENTID
> 4 12:18:16 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW
> CID: 0xde68
> 5 12:18:16 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call
> In
> 4) RENEW Status: NFS4ERR_STALE_CLIENTID
> 7 12:18:21 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW
> CID: 0xde68
> 8 12:18:21 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call
> In
> 7) RENEW Status: NFS4ERR_STALE_CLIENTID
> ...
>
> My knowledge of NFSv4 is sketchy, but from my (partial) reading of
> rfc7530 shouldn't the client be sending a SETCLIENTID in response to
> a
> NFS4ERR_STALE_CLIENTID - which doesn't appear to be happening here?
>
> Although the server hasn't rebooted since the client mounted the file
> system - so not sure what might be going on ?
>
> We are upgrading clients to the latest CentOS (RHEL) 7.7 to see if
> that 'fixes' the issue - but would appreciate any other pointers
>

WAG: the clients all have the default hostname 'localhost.localdomain'
and are using that to identify themselves in the SETCLIENTID call? If
so, that would cause them to cancel each other's leases by declaring
client reboots of the client with name 'localhost.localdomain'.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2020-03-25 13:25:51

by Kevin Vasko

[permalink] [raw]

Subject: Re: Stuck NFSv4 mounts of Isilon filer with repeated NFS4ERR_STALE_CLIENTID errors

James,

Just curious are your symptoms anything similar to mine where if you
transfer a file (200MB+) to the NFS server, the transfer will just
lock up and never complete? Are you using Kerberos as well? If so...

I had a problem on a Dell Unity box where on a transfer to the NFS
server the sequence number gets out of order and it would lock up the
Dell Unity box NFS and the transfer would never complete. Dell was not
aware of this bug and they had to have engineering look at the issue.
After about 3 months they got back with and had me change two
parameters.

scv_nas ALL -parma -facility nfs -modify rpcgss.discardReplay -value 0

scv_nas ALL -parma -facility nfs -modify rpcgss.discardOld -value 0

I’m doubtful that this would be the same way you would change these
settings on the Isilon but just figured if it’s related it might help.

-Kevin

On Wed, Mar 25, 2020 at 8:20 AM Kevin Vasko <[email protected]> wrote:
>
> James,
>
> Just curious are your symptoms anything similar to mine where if you transfer a file (200MB+) to the NFS server, the transfer will just lock up and never complete? Are you using Kerberos as well? If so...
>
> I had a problem on a Dell Unity box where on a transfer to the NFS server the sequence number gets out of order and it would lock up the Dell Unity box NFS and the transfer would never complete.
>
> Dell was not aware of this bug and they had to have engineering look at the issue. After about 3 months they got back with and had me change two parameters.
>
> scv_nas ALL -parma -facility nfs -modify rpcgss.discardReplay -value 0
>
> scv_nas ALL -parma -facility nfs -modify rpcgss.discardOld -value 0
>
> I’m doubtful that this would be the same way you would change these settings on the Isilon but just figured if it’s related it might help.
>
> -Kevin
>
> On Mar 25, 2020, at 7:23 AM, Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2020-03-25 at 10:30 +0000, James Pearson wrote:
>
> We're seeing a number of Linux (CentOS 7.5) clients getting nfs:
>
> server isilon not responding, still trying' from various exports
>
> from
>
> a Isilon
>
>
> I appreciate we're using a vendor's Linux (out-of-date) kernel and a
>
> third party filer, but if anyone can give me any pointers of how to
>
> debug this issue, I would be grateful (we also have a support case
>
> open with the Isilon vendor)
>
>
> Running tshark on a client when this issue happens (taken several
>
> hours after the issue happened), we get repeating:
>
>
> 1 12:18:11 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW
>
> CID: 0xde68
>
> 2 12:18:11 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call
>
> In
>
> 1) RENEW Status: NFS4ERR_STALE_CLIENTID
>
> 4 12:18:16 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW
>
> CID: 0xde68
>
> 5 12:18:16 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call
>
> In
>
> 4) RENEW Status: NFS4ERR_STALE_CLIENTID
>
> 7 12:18:21 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW
>
> CID: 0xde68
>
> 8 12:18:21 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call
>
> In
>
> 7) RENEW Status: NFS4ERR_STALE_CLIENTID
>
> ...
>
>
> My knowledge of NFSv4 is sketchy, but from my (partial) reading of
>
> rfc7530 shouldn't the client be sending a SETCLIENTID in response to
>
> a
>
> NFS4ERR_STALE_CLIENTID - which doesn't appear to be happening here?
>
>
> Although the server hasn't rebooted since the client mounted the file
>
> system - so not sure what might be going on ?
>
>
> We are upgrading clients to the latest CentOS (RHEL) 7.7 to see if
>
> that 'fixes' the issue - but would appreciate any other pointers
>
>
>
> WAG: the clients all have the default hostname 'localhost.localdomain'
> and are using that to identify themselves in the SETCLIENTID call? If
> so, that would cause them to cancel each other's leases by declaring
> client reboots of the client with name 'localhost.localdomain'.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2020-03-25 13:54:08

by James Pearson

[permalink] [raw]

Subject: Re: Stuck NFSv4 mounts of Isilon filer with repeated NFS4ERR_STALE_CLIENTID errors

On Wed, 25 Mar 2020 at 12:22, Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2020-03-25 at 10:30 +0000, James Pearson wrote:
> > We're seeing a number of Linux (CentOS 7.5) clients getting nfs:
> > server isilon not responding, still trying' from various exports
> > from
> > a Isilon
> >
> > I appreciate we're using a vendor's Linux (out-of-date) kernel and a
> > third party filer, but if anyone can give me any pointers of how to
> > debug this issue, I would be grateful (we also have a support case
> > open with the Isilon vendor)
> >
> > Running tshark on a client when this issue happens (taken several
> > hours after the issue happened), we get repeating:
> >
> > 1 12:18:11 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW
> > CID: 0xde68
> > 2 12:18:11 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call
> > In
> > 1) RENEW Status: NFS4ERR_STALE_CLIENTID
> > 4 12:18:16 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW
> > CID: 0xde68
> > 5 12:18:16 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call
> > In
> > 4) RENEW Status: NFS4ERR_STALE_CLIENTID
> > 7 12:18:21 10.78.201.95 -> 10.78.196.184 NFS 194 V4 Call RENEW
> > CID: 0xde68
> > 8 12:18:21 10.78.196.184 -> 10.78.201.95 NFS 114 V4 Reply (Call
> > In
> > 7) RENEW Status: NFS4ERR_STALE_CLIENTID
> > ...
> >
> > My knowledge of NFSv4 is sketchy, but from my (partial) reading of
> > rfc7530 shouldn't the client be sending a SETCLIENTID in response to
> > a
> > NFS4ERR_STALE_CLIENTID - which doesn't appear to be happening here?
> >
> > Although the server hasn't rebooted since the client mounted the file
> > system - so not sure what might be going on ?
> >
> > We are upgrading clients to the latest CentOS (RHEL) 7.7 to see if
> > that 'fixes' the issue - but would appreciate any other pointers
> >
>
> WAG: the clients all have the default hostname 'localhost.localdomain'
> and are using that to identify themselves in the SETCLIENTID call? If
> so, that would cause them to cancel each other's leases by declaring
> client reboots of the client with name 'localhost.localdomain'.

All the the clients have their hostname set as something like
hostname.our.domain

The Isilon has the NFSv4 domain set as 'our.domain'

Is it possible to find out (after the fact) what the host/domain name
the client used to connect to the server?

Thanks

James Pearson

2020-03-25 14:02:36

by James Pearson

[permalink] [raw]

Subject: Re: Stuck NFSv4 mounts of Isilon filer with repeated NFS4ERR_STALE_CLIENTID errors

On Wed, 25 Mar 2020 at 13:20, Kevin Vasko <[email protected]> wrote:
> James,
>
> Just curious are your symptoms anything similar to mine where if you transfer a file (200MB+) to the NFS server, the transfer will just lock up and never complete? Are you using Kerberos as well? If so...
>
> I had a problem on a Dell Unity box where on a transfer to the NFS server the sequence number gets out of order and it would lock up the Dell Unity box NFS and the transfer would never complete.
>
> Dell was not aware of this bug and they had to have engineering look at the issue. After about 3 months they got back with and had me change two parameters.
>
> scv_nas ALL -parma -facility nfs -modify rpcgss.discardReplay -value 0
>
> scv_nas ALL -parma -facility nfs -modify rpcgss.discardOld -value 0
>
> I’m doubtful that this would be the same way you would change these settings on the Isilon but just figured if it’s related it might help.
> a Isilon

We're not using Kerberos, just SYS_AUTH ...

James

2020-03-25 16:10:15

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: Stuck NFSv4 mounts of Isilon filer with repeated NFS4ERR_STALE_CLIENTID errors

Hi James,

What's in your /var/log/messages on the client that's looping?

If your network traces had some interleaved SETCLIENTIDs I would have
said that it's possible you are hitting this RHEL7.5 problem: Bug
1592911 - Fix how client ID in SETCLIENTID is constructed to prevent
lease tempering. https://bugzilla.redhat.com/show_bug.cgi?id=1592911
but if the only thing that you see is RENEWs then it might not be it.

On Wed, Mar 25, 2020 at 10:02 AM James Pearson <[email protected]> wrote:
>
> On Wed, 25 Mar 2020 at 13:20, Kevin Vasko <[email protected]> wrote:
> > James,
> >
> > Just curious are your symptoms anything similar to mine where if you transfer a file (200MB+) to the NFS server, the transfer will just lock up and never complete? Are you using Kerberos as well? If so...
> >
> > I had a problem on a Dell Unity box where on a transfer to the NFS server the sequence number gets out of order and it would lock up the Dell Unity box NFS and the transfer would never complete.
> >
> > Dell was not aware of this bug and they had to have engineering look at the issue. After about 3 months they got back with and had me change two parameters.
> >
> > scv_nas ALL -parma -facility nfs -modify rpcgss.discardReplay -value 0
> >
> > scv_nas ALL -parma -facility nfs -modify rpcgss.discardOld -value 0
> >
> > I’m doubtful that this would be the same way you would change these settings on the Isilon but just figured if it’s related it might help.
> > a Isilon
>
> We're not using Kerberos, just SYS_AUTH ...
>
> James

2020-03-25 17:11:13

by James Pearson

[permalink] [raw]

Subject: Re: Stuck NFSv4 mounts of Isilon filer with repeated NFS4ERR_STALE_CLIENTID errors

On Wed, 25 Mar 2020 at 16:09, Olga Kornievskaia <[email protected]> wrote:
>
> Hi James,
>
> What's in your /var/log/messages on the client that's looping?

Nothing but 'nfs:server isilon not responding, still trying'

> If your network traces had some interleaved SETCLIENTIDs I would have
> said that it's possible you are hitting this RHEL7.5 problem: Bug
> 1592911 - Fix how client ID in SETCLIENTID is constructed to prevent
> lease tempering. https://bugzilla.redhat.com/show_bug.cgi?id=1592911
> but if the only thing that you see is RENEWs then it might not be it.

It is quite possible whatever the issue is has been fixed in a more
recent RHEL/CentOS 7 kernel - we haven't (yet?) seen the issue on
machines that have been upgraded to CentOS 7.7 ...

Thanks

James