2014-03-07 09:48:10

by Ben Taylor

[permalink] [raw]
Subject: NFS4 patch 08/20 (BAD_SEQID recovery)

Hi

We've been getting weird occasional failures on our NFS systems where
our processing gridnodes will gradually grind to a halt (we lose a
couple of machines a day requiring a reboot - hard reboot if left long
enough). Hunting through Wireshark dumps, the problem is that the NFS
client is making repeated requests to open the same file on our
fileserver and every one has the same owner ID and a sequence ID of 0
(which the server throws out again as a bad sequence ID). I've got a
dump I can give you if you want it.

I am convinced that the problem is that described in patch 08/20 from
Chuck Lever (see http://www.spinics.net/lists/linux-nfs/msg29413.html),
where in this case the client gets the same open owner ID from the
server and retries with that, which makes the server think it's the same
request and throw it out again. In that patch Chuck added a uniqifier to
the owner ID to avoid this problem.

The problem is that we can't find any kernel versions that include that
patch - easy way to
check is look for the " therefore safely retry using a new one. We
should still warn the user though..." part - if the "warn the user" part
is there, it's not been patched (we did check other bits of the patch
too). We're running both Fedora 17 and Fedora 19 at the moment (yes, I
know 17 is EOL), neither of which includes the patch. We also can't see
it in the NFS client or server trees at

http://git.linux-nfs.org/?p=trondmy/nfs-2.6.git;a=blob;f=fs/nfs/nfs4proc.c;h=2da6a698b8f7719c14eefec65e6148a48d030bb3;hb=HEAD#l2327

http://git.linux-nfs.org/?p=trondmy/nfs-2.6.git;a=blob;f=fs/nfs/nfs4proc.c;h=2da6a698b8f7719c14eefec65e6148a48d030bb3;hb=HEAD#l2327

...and nor does Chuck appear to have it in his merging tree:
http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=blob;f=fs/nfs/nfs4proc.c;h=15052b81df4245e4f797adb0d0b2e523338b23cc;hb=HEAD#l2327

Can anyone tell me what happened to this patch please? Was it lost or
superseded?

TIA
Ben

--
Ben Taylor <[email protected]>, http://rsg.pml.ac.uk/
Remote Sensing Group, Plymouth Marine Laboratory
Tel: +44 (0)1752 633432, Fax: +44 (0)1752 633101


Please visit our new website at http://www.pml.ac.uk and follow us on Twitter @PlymouthMarine

Plymouth Marine Laboratory (PML) is a company limited by guarantee registered in England & Wales, company number 4178503. Registered Charity No. 1091222. Registered Office: Prospect Place, The Hoe, Plymouth PL1 3DH, UK.

This message is private and confidential. If you have received this message in error, please notify the sender and remove it from your system. You are reminded that e-mail communications are not secure and may contain viruses; PML accepts no liability for any loss or damage which may be caused by viruses.



2014-03-10 23:23:46

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS4 patch 08/20 (BAD_SEQID recovery)


On Mar 10, 2014, at 13:26, Ben Taylor <[email protected]> wrote:

> Hi Trond
>
> On 07/03/14 13:05, Trond Myklebust wrote:
>>> Can anyone tell me what happened to this patch please? Was it lost or
>>>> superseded?
>> It was superseded by commit 95b72eb0bdef6 (NFSv4: Ensure we do not reuse open owner names), which is available in linux 3.4 and newer.
>
> Many thanks. That's a puzzle then, because we're running 3.9 and up and
> already have that patch (I've checked).
>
> I've attached my Wireshark dump (or at least a subset of it -
> unfortunately I don't have the original call, the dump I've got is all
> the same) - don't know if this tells you anything it doesn't tell me?
> I'm not exactly experienced at reading these things!
>

It looks as if the client is trying to convert a delegation into an open stateid as part of returning that delegation, but the server is disputing the sequence id value.
What server is this?

_________________________________
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]


2014-03-11 08:49:44

by Ben Taylor

[permalink] [raw]
Subject: Re: NFS4 patch 08/20 (BAD_SEQID recovery)

On 10/03/14 23:23, Trond Myklebust wrote:
>
> On Mar 10, 2014, at 13:26, Ben Taylor <[email protected]> wrote:
>
>> Hi Trond
>>
>> On 07/03/14 13:05, Trond Myklebust wrote:
>>>> Can anyone tell me what happened to this patch please? Was it lost or
>>>>> superseded?
>>> It was superseded by commit 95b72eb0bdef6 (NFSv4: Ensure we do not reuse open owner names), which is available in linux 3.4 and newer.
>>
>> Many thanks. That's a puzzle then, because we're running 3.9 and up and
>> already have that patch (I've checked).
>>
>> I've attached my Wireshark dump (or at least a subset of it -
>> unfortunately I don't have the original call, the dump I've got is all
>> the same) - don't know if this tells you anything it doesn't tell me?
>> I'm not exactly experienced at reading these things!
>>
>
> It looks as if the client is trying to convert a delegation into an open stateid as part of returning that delegation, but the server is disputing the sequence id value.
> What server is this?

It's our main user-space file server, running CentOS 6.5, kernel
version... ah. Kernel version 2.6, I only checked the client version
previously.

That's probably the issue then. Sorry, thanks for your help!

Regards
Ben

--
Ben Taylor <[email protected]>, http://rsg.pml.ac.uk/
Remote Sensing Group, Plymouth Marine Laboratory
Tel: +44 (0)1752 633432, Fax: +44 (0)1752 633101


Please visit our new website at http://www.pml.ac.uk and follow us on Twitter @PlymouthMarine

Plymouth Marine Laboratory (PML) is a company limited by guarantee registered in England & Wales, company number 4178503. Registered Charity No. 1091222. Registered Office: Prospect Place, The Hoe, Plymouth PL1 3DH, UK.

This message is private and confidential. If you have received this message in error, please notify the sender and remove it from your system. You are reminded that e-mail communications are not secure and may contain viruses; PML accepts no liability for any loss or damage which may be caused by viruses.


2014-03-07 13:05:50

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS4 patch 08/20 (BAD_SEQID recovery)


On Mar 7, 2014, at 4:41, Ben Taylor <[email protected]> wrote:

> Hi
>
> We've been getting weird occasional failures on our NFS systems where
> our processing gridnodes will gradually grind to a halt (we lose a
> couple of machines a day requiring a reboot - hard reboot if left long
> enough). Hunting through Wireshark dumps, the problem is that the NFS
> client is making repeated requests to open the same file on our
> fileserver and every one has the same owner ID and a sequence ID of 0
> (which the server throws out again as a bad sequence ID). I've got a
> dump I can give you if you want it.
>
> I am convinced that the problem is that described in patch 08/20 from
> Chuck Lever (see http://www.spinics.net/lists/linux-nfs/msg29413.html),
> where in this case the client gets the same open owner ID from the
> server and retries with that, which makes the server think it's the same
> request and throw it out again. In that patch Chuck added a uniqifier to
> the owner ID to avoid this problem.
>
> The problem is that we can't find any kernel versions that include that
> patch - easy way to
> check is look for the " therefore safely retry using a new one. We
> should still warn the user though..." part - if the "warn the user" part
> is there, it's not been patched (we did check other bits of the patch
> too). We're running both Fedora 17 and Fedora 19 at the moment (yes, I
> know 17 is EOL), neither of which includes the patch. We also can't see
> it in the NFS client or server trees at
>
> http://git.linux-nfs.org/?p=trondmy/nfs-2.6.git;a=blob;f=fs/nfs/nfs4proc.c;h=2da6a698b8f7719c14eefec65e6148a48d030bb3;hb=HEAD#l2327
>
> http://git.linux-nfs.org/?p=trondmy/nfs-2.6.git;a=blob;f=fs/nfs/nfs4proc.c;h=2da6a698b8f7719c14eefec65e6148a48d030bb3;hb=HEAD#l2327
>
> ...and nor does Chuck appear to have it in his merging tree:
> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=blob;f=fs/nfs/nfs4proc.c;h=15052b81df4245e4f797adb0d0b2e523338b23cc;hb=HEAD#l2327
>
> Can anyone tell me what happened to this patch please? Was it lost or
> superseded?

It was superseded by commit 95b72eb0bdef6 (NFSv4: Ensure we do not reuse open owner names), which is available in linux 3.4 and newer.
_________________________________
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]


2014-03-10 17:32:58

by Ben Taylor

[permalink] [raw]
Subject: Re: NFS4 patch 08/20 (BAD_SEQID recovery)

Hi Trond

On 07/03/14 13:05, Trond Myklebust wrote:
>> Can anyone tell me what happened to this patch please? Was it lost or
>> > superseded?
> It was superseded by commit 95b72eb0bdef6 (NFSv4: Ensure we do not reuse open owner names), which is available in linux 3.4 and newer.

Many thanks. That's a puzzle then, because we're running 3.9 and up and
already have that patch (I've checked).

I've attached my Wireshark dump (or at least a subset of it -
unfortunately I don't have the original call, the dump I've got is all
the same) - don't know if this tells you anything it doesn't tell me?
I'm not exactly experienced at reading these things!

Thanks
Ben

--
Ben Taylor <[email protected]>, http://rsg.pml.ac.uk/
Remote Sensing Group, Plymouth Marine Laboratory
Tel: +44 (0)1752 633432, Fax: +44 (0)1752 633101


Please visit our new website at http://www.pml.ac.uk and follow us on Twitter @PlymouthMarine

Plymouth Marine Laboratory (PML) is a company limited by guarantee registered in England & Wales, company number 4178503. Registered Charity No. 1091222. Registered Office: Prospect Place, The Hoe, Plymouth PL1 3DH, UK.

This message is private and confidential. If you have received this message in error, please notify the sender and remove it from your system. You are reminded that e-mail communications are not secure and may contain viruses; PML accepts no liability for any loss or damage which may be caused by viruses.


Attachments:
nfs_bad_seqid_trimmed.dmp (1.42 kB)