2019-09-12 08:31:24

by Leon Kyneur

[permalink] [raw]
Subject: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID

Hi

I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call fcntl
locks on any file on the share. The problem goes away if the share is
umount && mount (mount -o remount does not resolve the issue)

Client:
EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64

Server:
EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64

I can't figure this out but the client reports bad-sequence-id in
dupicate in the logs:
Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
sequence-id error on an unconfirmed sequence ffff881c52286220!
Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
sequence-id error on an unconfirmed sequence ffff881c52286220!
Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
sequence-id error on an unconfirmed sequence ffff8810889cb020!
Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
sequence-id error on an unconfirmed sequence ffff8810889cb020!
Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad
sequence-id error on an unconfirmed sequence ffff881b414b2620!

wireshark capture shows only 1 BAD_SEQID reply from the server:
$ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z
proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d'
tshark: -R without -2 is deprecated. For single-pass filtering use -Y.
141 93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
0x80589398 Offset: 0 Length: <End of File> nfs.seqid == 0x0000004e
nfs.seqid == 0x00000002 rpc.xid == 0x9990c61d
142 93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
In 141) LOCK Status: NFS4ERR_BAD_SEQID rpc.xid == 0x9990c61d

system call I have identified as triggering it is:
fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824,
len=1}) = -1 EIO (Input/output error)

The server filesystem is ZFS though NFS sharing is turned off via ZFS
options and it's exported using /etc/exports / nfsd...

The BAD_SEQID error seems to be fairly random, we have over 2000
machines connected to the share and it's experienced frequently but
randomly accross our clients.

It's worth mentioning that the majority of the clients are mounting
4.0 we did try 4.1 everywhere but hit this
https://access.redhat.com/solutions/3146191

mount options are:
server:/data on /d/data type nfs4
(rw,noatime,nodiratime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.27.10.45,local_lock=none,addr=172.27.255.28,_netdev)
or:
server:/data on /d/data type nfs4
(rw,noatime,nodiratime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.27.30.129,local_lock=none,addr=172.27.255.28,_netdev)

I'm at a bit off a loss as to where to look next, i've tried to
reproduce locking / unlocking threading but cannot seem to create a
test case that triggers it.

Thanks

Leon


2019-09-17 11:29:27

by Benjamin Coddington

[permalink] [raw]
Subject: Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID

On 12 Sep 2019, at 4:27, Leon Kyneur wrote:

> Hi
>
> I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call fcntl
> locks on any file on the share. The problem goes away if the share is
> umount && mount (mount -o remount does not resolve the issue)
>
> Client:
> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
>
> Server:
> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
>
> I can't figure this out but the client reports bad-sequence-id in
> dupicate in the logs:
> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff881c52286220!
> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff881c52286220!
> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff8810889cb020!
> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff8810889cb020!
> Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff881b414b2620!
>
> wireshark capture shows only 1 BAD_SEQID reply from the server:
> $ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z
> proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d'
> tshark: -R without -2 is deprecated. For single-pass filtering use -Y.
> 141 93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
> 0x80589398 Offset: 0 Length: <End of File> nfs.seqid == 0x0000004e
> nfs.seqid == 0x00000002 rpc.xid == 0x9990c61d
> 142 93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
> In 141) LOCK Status: NFS4ERR_BAD_SEQID rpc.xid == 0x9990c61d
>
> system call I have identified as triggering it is:
> fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824,
> len=1}) = -1 EIO (Input/output error)

Can you simplify the trigger into something repeatable? Can you determine
if the client or the server has lost track of the sequence?

> The server filesystem is ZFS though NFS sharing is turned off via ZFS
> options and it's exported using /etc/exports / nfsd...
>
> The BAD_SEQID error seems to be fairly random, we have over 2000
> machines connected to the share and it's experienced frequently but
> randomly accross our clients.
>
> It's worth mentioning that the majority of the clients are mounting
> 4.0 we did try 4.1 everywhere but hit this
> https://access.redhat.com/solutions/3146191

This was fixed in kernel-3.10.0-735.el7, FWIW..

Ben

2019-09-18 02:44:24

by Leon Kyneur

[permalink] [raw]
Subject: Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID

On Tue, Sep 17, 2019 at 7:28 PM Benjamin Coddington <[email protected]> wrote:
>
> On 12 Sep 2019, at 4:27, Leon Kyneur wrote:
>
> > Hi
> >
> > I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call fcntl
> > locks on any file on the share. The problem goes away if the share is
> > umount && mount (mount -o remount does not resolve the issue)
> >
> > Client:
> > EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
> >
> > Server:
> > EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
> >
> > I can't figure this out but the client reports bad-sequence-id in
> > dupicate in the logs:
> > Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> > sequence-id error on an unconfirmed sequence ffff881c52286220!
> > Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> > sequence-id error on an unconfirmed sequence ffff881c52286220!
> > Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> > sequence-id error on an unconfirmed sequence ffff8810889cb020!
> > Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> > sequence-id error on an unconfirmed sequence ffff8810889cb020!
> > Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad
> > sequence-id error on an unconfirmed sequence ffff881b414b2620!
> >
> > wireshark capture shows only 1 BAD_SEQID reply from the server:
> > $ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z
> > proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d'
> > tshark: -R without -2 is deprecated. For single-pass filtering use -Y.
> > 141 93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
> > 0x80589398 Offset: 0 Length: <End of File> nfs.seqid == 0x0000004e
> > nfs.seqid == 0x00000002 rpc.xid == 0x9990c61d
> > 142 93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
> > In 141) LOCK Status: NFS4ERR_BAD_SEQID rpc.xid == 0x9990c61d
> >
> > system call I have identified as triggering it is:
> > fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824,
> > len=1}) = -1 EIO (Input/output error)
>
> Can you simplify the trigger into something repeatable? Can you determine
> if the client or the server has lost track of the sequence?
>

I have tried, I wrote some code to perform the fcntl RDKLCK the same
way and ran it accross
thousands of machines without any success. I am quite sure this is a
symptom of something
not the cause.

Is there a better way of tracking sequences other than monitoring the
network traffic?

> > The server filesystem is ZFS though NFS sharing is turned off via ZFS
> > options and it's exported using /etc/exports / nfsd...
> >
> > The BAD_SEQID error seems to be fairly random, we have over 2000
> > machines connected to the share and it's experienced frequently but
> > randomly accross our clients.
> >
> > It's worth mentioning that the majority of the clients are mounting
> > 4.0 we did try 4.1 everywhere but hit this
> > https://access.redhat.com/solutions/3146191
>
> This was fixed in kernel-3.10.0-735.el7, FWIW..
>
> Ben

Thanks good to know, am planning an update soon but have been stuck on
3.10.0-693 for other reasons.

Leon

2019-09-18 13:07:18

by Benjamin Coddington

[permalink] [raw]
Subject: Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID

On 17 Sep 2019, at 22:20, Leon Kyneur wrote:

> On Tue, Sep 17, 2019 at 7:28 PM Benjamin Coddington
> <[email protected]> wrote:
>>
>> On 12 Sep 2019, at 4:27, Leon Kyneur wrote:
>>
>>> Hi
>>>
>>> I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call
>>> fcntl
>>> locks on any file on the share. The problem goes away if the share
>>> is
>>> umount && mount (mount -o remount does not resolve the issue)
>>>
>>> Client:
>>> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
>>>
>>> Server:
>>> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
>>>
>>> I can't figure this out but the client reports bad-sequence-id in
>>> dupicate in the logs:
>>> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
>>> sequence-id error on an unconfirmed sequence ffff881c52286220!
>>> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
>>> sequence-id error on an unconfirmed sequence ffff881c52286220!
>>> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
>>> sequence-id error on an unconfirmed sequence ffff8810889cb020!
>>> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
>>> sequence-id error on an unconfirmed sequence ffff8810889cb020!
>>> Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad
>>> sequence-id error on an unconfirmed sequence ffff881b414b2620!
>>>
>>> wireshark capture shows only 1 BAD_SEQID reply from the server:
>>> $ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z
>>> proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d'
>>> tshark: -R without -2 is deprecated. For single-pass filtering use
>>> -Y.
>>> 141 93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK
>>> FH:
>>> 0x80589398 Offset: 0 Length: <End of File> nfs.seqid == 0x0000004e
>>> nfs.seqid == 0x00000002 rpc.xid == 0x9990c61d
>>> 142 93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
>>> In 141) LOCK Status: NFS4ERR_BAD_SEQID rpc.xid == 0x9990c61d
>>>
>>> system call I have identified as triggering it is:
>>> fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824,
>>> len=1}) = -1 EIO (Input/output error)
>>
>> Can you simplify the trigger into something repeatable? Can you
>> determine
>> if the client or the server has lost track of the sequence?
>>
>
> I have tried, I wrote some code to perform the fcntl RDKLCK the same
> way and ran it accross
> thousands of machines without any success. I am quite sure this is a
> symptom of something
> not the cause.
>
> Is there a better way of tracking sequences other than monitoring the
> network traffic?

I think that's the best way, right now. We do have tracepoints for
nfs4 open and close that show the sequence numbers on the client, but
I'm
not sure about how to get that from the server side. I don't think we
have
seqid for locks in tracepoints.. I could be missing something. Not only
that, but you might not get tracepoint output showing the sequence
numbers
if you're in an error-handling path.

If you have a wire capture of the event, you should be able to go
backwards
from the error and figure out what the sequence number on the state
should
be for the operation that received BAD_SEQID by finding the last
sequence-mutating (OPEN,CLOSE,LOCK) operation for that stateid that did
not
return an error.

Ben

2019-09-18 14:34:39

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID

Hi folks,

The bad_seqid error could have been the bug in 7.4
https://access.redhat.com/solutions/3354251. It's been fixed in
kernel-3.10.0-693.23.1.el7. Can you try to update and see if that
helps? The bug was client was sending a double close throwing off the
seqid use.

On Wed, Sep 18, 2019 at 9:07 AM Benjamin Coddington <[email protected]> wrote:
>
> On 17 Sep 2019, at 22:20, Leon Kyneur wrote:
>
> > On Tue, Sep 17, 2019 at 7:28 PM Benjamin Coddington
> > <[email protected]> wrote:
> >>
> >> On 12 Sep 2019, at 4:27, Leon Kyneur wrote:
> >>
> >>> Hi
> >>>
> >>> I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call
> >>> fcntl
> >>> locks on any file on the share. The problem goes away if the share
> >>> is
> >>> umount && mount (mount -o remount does not resolve the issue)
> >>>
> >>> Client:
> >>> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
> >>>
> >>> Server:
> >>> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
> >>>
> >>> I can't figure this out but the client reports bad-sequence-id in
> >>> dupicate in the logs:
> >>> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> >>> sequence-id error on an unconfirmed sequence ffff881c52286220!
> >>> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> >>> sequence-id error on an unconfirmed sequence ffff881c52286220!
> >>> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> >>> sequence-id error on an unconfirmed sequence ffff8810889cb020!
> >>> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> >>> sequence-id error on an unconfirmed sequence ffff8810889cb020!
> >>> Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad
> >>> sequence-id error on an unconfirmed sequence ffff881b414b2620!
> >>>
> >>> wireshark capture shows only 1 BAD_SEQID reply from the server:
> >>> $ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z
> >>> proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d'
> >>> tshark: -R without -2 is deprecated. For single-pass filtering use
> >>> -Y.
> >>> 141 93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK
> >>> FH:
> >>> 0x80589398 Offset: 0 Length: <End of File> nfs.seqid == 0x0000004e
> >>> nfs.seqid == 0x00000002 rpc.xid == 0x9990c61d
> >>> 142 93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
> >>> In 141) LOCK Status: NFS4ERR_BAD_SEQID rpc.xid == 0x9990c61d
> >>>
> >>> system call I have identified as triggering it is:
> >>> fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824,
> >>> len=1}) = -1 EIO (Input/output error)
> >>
> >> Can you simplify the trigger into something repeatable? Can you
> >> determine
> >> if the client or the server has lost track of the sequence?
> >>
> >
> > I have tried, I wrote some code to perform the fcntl RDKLCK the same
> > way and ran it accross
> > thousands of machines without any success. I am quite sure this is a
> > symptom of something
> > not the cause.
> >
> > Is there a better way of tracking sequences other than monitoring the
> > network traffic?
>
> I think that's the best way, right now. We do have tracepoints for
> nfs4 open and close that show the sequence numbers on the client, but
> I'm
> not sure about how to get that from the server side. I don't think we
> have
> seqid for locks in tracepoints.. I could be missing something. Not only
> that, but you might not get tracepoint output showing the sequence
> numbers
> if you're in an error-handling path.
>
> If you have a wire capture of the event, you should be able to go
> backwards
> from the error and figure out what the sequence number on the state
> should
> be for the operation that received BAD_SEQID by finding the last
> sequence-mutating (OPEN,CLOSE,LOCK) operation for that stateid that did
> not
> return an error.
>
> Ben

2019-09-19 05:45:39

by Leon Kyneur

[permalink] [raw]
Subject: Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID

On Wed, Sep 18, 2019 at 10:32 PM Olga Kornievskaia <[email protected]> wrote:
>
> Hi folks,
>
> The bad_seqid error could have been the bug in 7.4
> https://access.redhat.com/solutions/3354251. It's been fixed in
> kernel-3.10.0-693.23.1.el7. Can you try to update and see if that
> helps? The bug was client was sending a double close throwing off the
> seqid use.
>

Hi Olga

I did see that but discounted it as I am not seeing duplicate CLOSE
for the same StateID when the client is affected, could this happened
at some point earlier in time to trigger the condition? I don't
understand how I can reproduce it.

When it's affected I don't see the duplicate close:

$ tshark -n -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
|grep -E "CLOSE|BAD|LOCK|OPEN|ACCESS"
14 10 172.27.30.129 -> 172.27.255.28 NFS 276 V4 Call ACCESS
FH: 0xfd9d2fb5, [Check: RD LU MD XT DL] rpc.xid == 0x6590c61d
15 10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 14) ACCESS, [Allowed: RD LU MD XT DL] rpc.xid == 0x6590c61d
19 10 172.27.30.129 -> 172.27.255.28 NFS 276 V4 Call ACCESS
FH: 0xfd9d2fb5, [Check: RD LU MD XT DL] rpc.xid == 0x6790c61d
20 10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 19) ACCESS, [Access Denied: MD XT DL], [Allowed: RD LU] rpc.xid ==
0x6790c61d
21 10 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
FH: 0x272b8d23, [Check: RD LU MD XT DL] rpc.xid == 0x6890c61d
22 10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 21) ACCESS, [Allowed: RD LU MD XT DL] rpc.xid == 0x6890c61d
24 10 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
FH: 0x54d7a1df, [Check: RD LU MD XT DL] rpc.xid == 0x6990c61d
25 10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 24) ACCESS, [Allowed: RD LU MD XT DL] rpc.xid == 0x6990c61d
33 11 172.27.30.129 -> 172.27.255.28 NFS 372 V4 Call OPEN DH:
0x272b8d23/.bash_history rpc.xid == 0x6c90c61d
34 11 172.27.255.28 -> 172.27.30.129 NFS 396 V4 Reply (Call
In 33) OPEN StateID: 0x0f49 rpc.xid == 0x6c90c61d
36 11 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call
OPEN_CONFIRM rpc.xid == 0x6d90c61d
37 11 172.27.255.28 -> 172.27.30.129 NFS 140 V4 Reply (Call
In 36) OPEN_CONFIRM rpc.xid == 0x6d90c61d
38 11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
StateID: 0x01be rpc.xid == 0x6e90c61d
39 11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
In 38) CLOSE rpc.xid == 0x6e90c61d
44 11 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
0x272b8d23/.bashrc rpc.xid == 0x7090c61d
45 11 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
In 44) OPEN StateID: 0x13f2 rpc.xid == 0x7090c61d
46 11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
StateID: 0x13f2 rpc.xid == 0x7190c61d
47 11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
In 46) CLOSE rpc.xid == 0x7190c61d
48 11 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
FH: 0xfb7c5154, [Check: RD MD XT XE] rpc.xid == 0x7290c61d
49 11 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 48) ACCESS, [Access Denied: XE], [Allowed: RD MD XT] rpc.xid ==
0x7290c61d
50 11 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
0x272b8d23/.bashenv rpc.xid == 0x7390c61d
51 11 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
In 50) OPEN StateID: 0xca9b rpc.xid == 0x7390c61d
52 11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
StateID: 0xca9b rpc.xid == 0x7490c61d
53 11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
In 52) CLOSE rpc.xid == 0x7490c61d
55 16 172.27.30.129 -> 172.27.255.28 NFS 372 V4 Call OPEN DH:
0x272b8d23/.bash_history rpc.xid == 0x7590c61d
56 16 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
In 55) OPEN StateID: 0xf3ed rpc.xid == 0x7590c61d
58 16 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
StateID: 0xf3ed rpc.xid == 0x7690c61d
59 16 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
In 58) CLOSE rpc.xid == 0x7690c61d
61 21 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
FH: 0xa77c94de, [Check: RD LU MD XT DL] rpc.xid == 0x7790c61d
62 21 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 61) ACCESS, [Allowed: RD LU MD XT DL] rpc.xid == 0x7790c61d
64 21 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
0xa77c94de/a.out rpc.xid == 0x7890c61d
65 21 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
In 64) OPEN StateID: 0xb877 rpc.xid == 0x7890c61d
66 21 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
FH: 0xbfe01adc, [Check: RD LU MD XT DL] rpc.xid == 0x7990c61d
67 21 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 66) ACCESS, [Allowed: RD LU MD XT DL] rpc.xid == 0x7990c61d
68 21 172.27.30.129 -> 172.27.255.28 NFS 368 V4 Call OPEN DH:
0xbfe01adc/m.db rpc.xid == 0x7a90c61d
69 21 172.27.255.28 -> 172.27.30.129 NFS 396 V4 Reply (Call
In 68) OPEN StateID: 0x8101 rpc.xid == 0x7a90c61d
70 21 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
0x80589398 Offset: 0 Length: <End of File> rpc.xid == 0x7b90c61d
71 21 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
In 70) LOCK Status: NFS4ERR_BAD_SEQID rpc.xid == 0x7b90c61d
72 21 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
StateID: 0xb877 rpc.xid == 0x7c90c61d
73 21 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
In 72) CLOSE rpc.xid == 0x7c90c61d
74 21 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
StateID: 0x8101 rpc.xid == 0x7d90c61d
75 21 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
In 74) CLOSE rpc.xid == 0x7d90c61d



>
> On Wed, Sep 18, 2019 at 9:07 AM Benjamin Coddington <[email protected]> wrote:
> >


> > > Is there a better way of tracking sequences other than monitoring the
> > > network traffic?
> >
> > I think that's the best way, right now. We do have tracepoints for
> > nfs4 open and close that show the sequence numbers on the client, but
> > I'm
> > not sure about how to get that from the server side. I don't think we
> > have
> > seqid for locks in tracepoints.. I could be missing something. Not only
> > that, but you might not get tracepoint output showing the sequence
> > numbers
> > if you're in an error-handling path.
> >
> > If you have a wire capture of the event, you should be able to go
> > backwards
> > from the error and figure out what the sequence number on the state
> > should
> > be for the operation that received BAD_SEQID by finding the last
> > sequence-mutating (OPEN,CLOSE,LOCK) operation for that stateid that did
> > not
> > return an error.
> >
> > Ben


Thanks Ben, I'll persist with the network monitoring for now. The
sequence ids appear do appear to be sequential when it's in error
state. I don't see it skipping any if that's what the error is
expecting


Client
~~~~~

Here is the sequence-id that gets returned as bad:

$ tshark -n -2 -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
-R 'nfs.seqid == 0x0000003c && frame.number <= 70'
70 21 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
0x80589398 Offset: 0 Length: <End of File> rpc.xid == 0x7b90c61d

prior sequence is the OPEN as expected:

$ tshark -n -2 -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
-R 'nfs.seqid == 0x0000003b && frame.number <= 70'
68 21 172.27.30.129 -> 172.27.255.28 NFS 368 V4 Call OPEN DH:
0xbfe01adc/m.db rpc.xid == 0x7a90c61d

Server
~~~~~
bad-sequence

$ tshark -n -2 -r server_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
-R 'nfs.seqid == 0x0000003c && frame.number <= 70'
1 65 172.27.30.129 -> 172.27.255.28 NFS 350 V4 Call (Reply
In 2) LOCK FH: 0x80589398 Offset: 0 Length: <End of File> rpc.xid ==
0x7b90c61d
2 65 172.27.30.129 -> 172.27.255.28 NFS 302 V4 Call (Reply
In 3) CLOSE StateID: 0xb877 rpc.xid == 0x7c90c61d

prior sequence

$ tshark -n -2 -r server_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
-R 'nfs.seqid == 0x0000003b && frame.number <= 70'
1 65 172.27.30.129 -> 172.27.255.28 NFS 366 V4 Call (Reply
In 2) OPEN DH: 0xbfe01adc/m.db rpc.xid == 0x7a90c61d

if it is indeed Red Hat KB 3354251 then there is greater motivation to
try a kernel upgrade again. Though just trying to confirm I have not
hit something new here.

Leon

2019-09-19 13:55:35

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID

Hi Leon,

Given that you have a network trace with the BAD_SEQID error can you
filter on the file handle and trace the operations? It should be
possible to tell at that point if it's the client that's sending an
incorrect seq or the server. I'm assuming we can either trace the
locking stated to a previous (successful) use or to an open. Provided
tshark output isn't enough to see what was inside the packets.

If the LOCK that's failing used a seqid that wasn't bootstrapped from
the open seqid, then it can't be caused by a double CLOSE problem. But
if this LOCK was was bootstrapped from the open stateid we need to
traces open owner sequencing and a double close could have been the
cause.

On Thu, Sep 19, 2019 at 12:23 AM Leon Kyneur <[email protected]> wrote:
>
> On Wed, Sep 18, 2019 at 10:32 PM Olga Kornievskaia <[email protected]> wrote:
> >
> > Hi folks,
> >
> > The bad_seqid error could have been the bug in 7.4
> > https://access.redhat.com/solutions/3354251. It's been fixed in
> > kernel-3.10.0-693.23.1.el7. Can you try to update and see if that
> > helps? The bug was client was sending a double close throwing off the
> > seqid use.
> >
>
> Hi Olga
>
> I did see that but discounted it as I am not seeing duplicate CLOSE
> for the same StateID when the client is affected, could this happened
> at some point earlier in time to trigger the condition? I don't
> understand how I can reproduce it.
>
> When it's affected I don't see the duplicate close:
>
> $ tshark -n -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> |grep -E "CLOSE|BAD|LOCK|OPEN|ACCESS"
> 14 10 172.27.30.129 -> 172.27.255.28 NFS 276 V4 Call ACCESS
> FH: 0xfd9d2fb5, [Check: RD LU MD XT DL] rpc.xid == 0x6590c61d
> 15 10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 14) ACCESS, [Allowed: RD LU MD XT DL] rpc.xid == 0x6590c61d
> 19 10 172.27.30.129 -> 172.27.255.28 NFS 276 V4 Call ACCESS
> FH: 0xfd9d2fb5, [Check: RD LU MD XT DL] rpc.xid == 0x6790c61d
> 20 10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 19) ACCESS, [Access Denied: MD XT DL], [Allowed: RD LU] rpc.xid ==
> 0x6790c61d
> 21 10 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0x272b8d23, [Check: RD LU MD XT DL] rpc.xid == 0x6890c61d
> 22 10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 21) ACCESS, [Allowed: RD LU MD XT DL] rpc.xid == 0x6890c61d
> 24 10 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0x54d7a1df, [Check: RD LU MD XT DL] rpc.xid == 0x6990c61d
> 25 10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 24) ACCESS, [Allowed: RD LU MD XT DL] rpc.xid == 0x6990c61d
> 33 11 172.27.30.129 -> 172.27.255.28 NFS 372 V4 Call OPEN DH:
> 0x272b8d23/.bash_history rpc.xid == 0x6c90c61d
> 34 11 172.27.255.28 -> 172.27.30.129 NFS 396 V4 Reply (Call
> In 33) OPEN StateID: 0x0f49 rpc.xid == 0x6c90c61d
> 36 11 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call
> OPEN_CONFIRM rpc.xid == 0x6d90c61d
> 37 11 172.27.255.28 -> 172.27.30.129 NFS 140 V4 Reply (Call
> In 36) OPEN_CONFIRM rpc.xid == 0x6d90c61d
> 38 11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0x01be rpc.xid == 0x6e90c61d
> 39 11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 38) CLOSE rpc.xid == 0x6e90c61d
> 44 11 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
> 0x272b8d23/.bashrc rpc.xid == 0x7090c61d
> 45 11 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
> In 44) OPEN StateID: 0x13f2 rpc.xid == 0x7090c61d
> 46 11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0x13f2 rpc.xid == 0x7190c61d
> 47 11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 46) CLOSE rpc.xid == 0x7190c61d
> 48 11 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0xfb7c5154, [Check: RD MD XT XE] rpc.xid == 0x7290c61d
> 49 11 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 48) ACCESS, [Access Denied: XE], [Allowed: RD MD XT] rpc.xid ==
> 0x7290c61d
> 50 11 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
> 0x272b8d23/.bashenv rpc.xid == 0x7390c61d
> 51 11 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
> In 50) OPEN StateID: 0xca9b rpc.xid == 0x7390c61d
> 52 11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0xca9b rpc.xid == 0x7490c61d
> 53 11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 52) CLOSE rpc.xid == 0x7490c61d
> 55 16 172.27.30.129 -> 172.27.255.28 NFS 372 V4 Call OPEN DH:
> 0x272b8d23/.bash_history rpc.xid == 0x7590c61d
> 56 16 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
> In 55) OPEN StateID: 0xf3ed rpc.xid == 0x7590c61d
> 58 16 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0xf3ed rpc.xid == 0x7690c61d
> 59 16 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 58) CLOSE rpc.xid == 0x7690c61d
> 61 21 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0xa77c94de, [Check: RD LU MD XT DL] rpc.xid == 0x7790c61d
> 62 21 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 61) ACCESS, [Allowed: RD LU MD XT DL] rpc.xid == 0x7790c61d
> 64 21 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
> 0xa77c94de/a.out rpc.xid == 0x7890c61d
> 65 21 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
> In 64) OPEN StateID: 0xb877 rpc.xid == 0x7890c61d
> 66 21 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0xbfe01adc, [Check: RD LU MD XT DL] rpc.xid == 0x7990c61d
> 67 21 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 66) ACCESS, [Allowed: RD LU MD XT DL] rpc.xid == 0x7990c61d
> 68 21 172.27.30.129 -> 172.27.255.28 NFS 368 V4 Call OPEN DH:
> 0xbfe01adc/m.db rpc.xid == 0x7a90c61d
> 69 21 172.27.255.28 -> 172.27.30.129 NFS 396 V4 Reply (Call
> In 68) OPEN StateID: 0x8101 rpc.xid == 0x7a90c61d
> 70 21 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
> 0x80589398 Offset: 0 Length: <End of File> rpc.xid == 0x7b90c61d
> 71 21 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
> In 70) LOCK Status: NFS4ERR_BAD_SEQID rpc.xid == 0x7b90c61d
> 72 21 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0xb877 rpc.xid == 0x7c90c61d
> 73 21 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 72) CLOSE rpc.xid == 0x7c90c61d
> 74 21 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0x8101 rpc.xid == 0x7d90c61d
> 75 21 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 74) CLOSE rpc.xid == 0x7d90c61d
>
>
>
> >
> > On Wed, Sep 18, 2019 at 9:07 AM Benjamin Coddington <[email protected]> wrote:
> > >
>
>
> > > > Is there a better way of tracking sequences other than monitoring the
> > > > network traffic?
> > >
> > > I think that's the best way, right now. We do have tracepoints for
> > > nfs4 open and close that show the sequence numbers on the client, but
> > > I'm
> > > not sure about how to get that from the server side. I don't think we
> > > have
> > > seqid for locks in tracepoints.. I could be missing something. Not only
> > > that, but you might not get tracepoint output showing the sequence
> > > numbers
> > > if you're in an error-handling path.
> > >
> > > If you have a wire capture of the event, you should be able to go
> > > backwards
> > > from the error and figure out what the sequence number on the state
> > > should
> > > be for the operation that received BAD_SEQID by finding the last
> > > sequence-mutating (OPEN,CLOSE,LOCK) operation for that stateid that did
> > > not
> > > return an error.
> > >
> > > Ben
>
>
> Thanks Ben, I'll persist with the network monitoring for now. The
> sequence ids appear do appear to be sequential when it's in error
> state. I don't see it skipping any if that's what the error is
> expecting
>
>
> Client
> ~~~~~
>
> Here is the sequence-id that gets returned as bad:
>
> $ tshark -n -2 -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> -R 'nfs.seqid == 0x0000003c && frame.number <= 70'
> 70 21 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
> 0x80589398 Offset: 0 Length: <End of File> rpc.xid == 0x7b90c61d
>
> prior sequence is the OPEN as expected:
>
> $ tshark -n -2 -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> -R 'nfs.seqid == 0x0000003b && frame.number <= 70'
> 68 21 172.27.30.129 -> 172.27.255.28 NFS 368 V4 Call OPEN DH:
> 0xbfe01adc/m.db rpc.xid == 0x7a90c61d
>
> Server
> ~~~~~
> bad-sequence
>
> $ tshark -n -2 -r server_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> -R 'nfs.seqid == 0x0000003c && frame.number <= 70'
> 1 65 172.27.30.129 -> 172.27.255.28 NFS 350 V4 Call (Reply
> In 2) LOCK FH: 0x80589398 Offset: 0 Length: <End of File> rpc.xid ==
> 0x7b90c61d
> 2 65 172.27.30.129 -> 172.27.255.28 NFS 302 V4 Call (Reply
> In 3) CLOSE StateID: 0xb877 rpc.xid == 0x7c90c61d
>
> prior sequence
>
> $ tshark -n -2 -r server_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> -R 'nfs.seqid == 0x0000003b && frame.number <= 70'
> 1 65 172.27.30.129 -> 172.27.255.28 NFS 366 V4 Call (Reply
> In 2) OPEN DH: 0xbfe01adc/m.db rpc.xid == 0x7a90c61d
>
> if it is indeed Red Hat KB 3354251 then there is greater motivation to
> try a kernel upgrade again. Though just trying to confirm I have not
> hit something new here.
>
> Leon

2019-09-26 00:39:36

by Leon Kyneur

[permalink] [raw]
Subject: Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID

On Thu, Sep 19, 2019 at 9:36 PM Olga Kornievskaia <[email protected]> wrote:
>
> Hi Leon,
>
> Given that you have a network trace with the BAD_SEQID error can you
> filter on the file handle and trace the operations? It should be
> possible to tell at that point if it's the client that's sending an
> incorrect seq or the server. I'm assuming we can either trace the
> locking stated to a previous (successful) use or to an open. Provided
> tshark output isn't enough to see what was inside the packets.
>
> If the LOCK that's failing used a seqid that wasn't bootstrapped from
> the open seqid, then it can't be caused by a double CLOSE problem. But
> if this LOCK was was bootstrapped from the open stateid we need to
> traces open owner sequencing and a double close could have been the
> cause.


Hi Olga

I don't ever see a double close in my packet captures. Here's a more
verbose dump, I'm not quite sure what you mean re LOCK "wasn't
bootstrapped from the open seqid" that does seem to explain it as I
don't see the 0x00000001 seq id on the from the OPEN call later in the
LOCK FH call. The working example (also below) does show this.

So I'm still wondering if this is the double close issue or something else?

# tshark -2 -n -r ../nfs.pcap -z
proto,colinfo,nfs.stateid4.other,nfs.stateid4.other
...
22 3 172.27.50.222 -> 172.27.255.28 NFS 370 V4 Call (Reply
In 23) OPEN DH: 0xbfe01adc/m.db nfs.seqid == 0x00000022
23 3 172.27.255.28 -> 172.27.50.222 NFS 394 V4 Reply (Call
In 22) OPEN StateID: 0x7999 nfs.seqid == 0x00000001
nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
24 3 172.27.50.222 -> 172.27.255.28 NFS 290 V4 Call (Reply
In 25) OPEN_CONFIRM nfs.seqid == 0x00000001 nfs.seqid == 0x00000023
nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
25 3 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
In 24) OPEN_CONFIRM nfs.seqid == 0x00000002 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
26 3 172.27.50.222 -> 172.27.255.28 NFS 370 V4 Call (Reply
In 27) OPEN DH: 0x272b8d23/.sqliterc nfs.seqid == 0x00000024
27 3 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
In 26) OPEN Status: NFS4ERR_NOENT
28 3 172.27.50.222 -> 172.27.255.28 NFS 374 V4 Call (Reply
In 29) OPEN DH: 0x272b8d23/.sqlite_history nfs.seqid == 0x00000025
29 3 172.27.255.28 -> 172.27.50.222 NFS 430 V4 Reply (Call
In 28) OPEN StateID: 0x3203 nfs.seqid == 0x00000001 nfs.seqid ==
0x00000001 nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e5:02:00:00
nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e6:02:00:00
30 3 172.27.50.222 -> 172.27.255.28 NFS 306 V4 Call (Reply
In 31) CLOSE StateID: 0x3203 nfs.seqid == 0x00000026 nfs.seqid ==
0x00000001 nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e5:02:00:00
31 3 172.27.255.28 -> 172.27.50.222 NFS 202 V4 Reply (Call
In 30) CLOSE nfs.seqid == 0x00000002 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:e5:02:00:00
32 3 172.27.50.222 -> 172.27.255.28 NFS 306 V4 Call (Reply
In 33) LOOKUP DH: 0x272b8d23/.terminfo
33 3 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
In 32) LOOKUP Status: NFS4ERR_NOENT
34 3 172.27.50.222 -> 172.27.255.28 NFS 290 V4 Call (Reply
In 35) ACCESS FH: 0x272b8d23, [Check: RD LU MD XT DL]
35 3 172.27.255.28 -> 172.27.50.222 NFS 194 V4 Reply (Call
In 34) ACCESS, [Allowed: RD LU MD XT DL]
36 3 172.27.50.222 -> 172.27.255.28 NFS 282 V4 Call (Reply
In 37) GETATTR FH: 0x272b8d23
37 3 172.27.255.28 -> 172.27.50.222 NFS 266 V4 Reply (Call
In 36) GETATTR
38 3 172.27.50.222 -> 172.27.255.28 NFS 302 V4 Call (Reply
In 39) LOOKUP DH: 0x272b8d23/.inputrc
39 3 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
In 38) LOOKUP Status: NFS4ERR_NOENT
40 3 172.27.50.222 -> 172.27.255.28 TCP 66 822 > 2049 [ACK]
Seq=3950 Ack=2617 Win=3077 Len=0 TSval=177154789 TSecr=3474577216
41 5 172.27.50.222 -> 172.27.255.28 NFS 354 V4 Call (Reply
In 42) LOCK FH: 0x80589398 Offset: 1073741824 Length: 1 nfs.seqid ==
0x00000027 nfs.seqid == 0x00000002 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
42 5 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
In 41) LOCK Status: NFS4ERR_BAD_SEQID
43 5 172.27.50.222 -> 172.27.255.28 TCP 66 822 > 2049 [ACK]
Seq=4238 Ack=2673 Win=3077 Len=0 TSval=177156701 TSecr=3474579167


working:

7 2 172.27.50.222 -> 172.27.255.28 NFS 370 V4 Call (Reply
In 8) OPEN DH: 0xbfe01adc/m.db nfs.seqid == 0x00000017
8 2 172.27.255.28 -> 172.27.50.222 NFS 394 V4 Reply (Call
In 7) OPEN StateID: 0x08c1 nfs.seqid == 0x00000001
nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fc:02:00:00
9 2 172.27.50.222 -> 172.27.255.28 NFS 374 V4 Call (Reply
In 10) OPEN DH: 0x272b8d23/.sqlite_history nfs.seqid == 0x00000018
10 2 172.27.255.28 -> 172.27.50.222 NFS 430 V4 Reply (Call
In 9) OPEN StateID: 0x147a nfs.seqid == 0x00000001 nfs.seqid ==
0x00000001 nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fd:02:00:00
nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fe:02:00:00
11 2 172.27.50.222 -> 172.27.255.28 NFS 306 V4 Call (Reply
In 12) CLOSE StateID: 0x147a nfs.seqid == 0x00000019 nfs.seqid ==
0x00000001 nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fd:02:00:00
12 2 172.27.255.28 -> 172.27.50.222 NFS 202 V4 Reply (Call
In 11) CLOSE nfs.seqid == 0x00000002 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:fd:02:00:00
13 2 172.27.50.222 -> 172.27.255.28 TCP 66 919 > 2049 [ACK]
Seq=1293 Ack=1157 Win=40 Len=0 TSval=177379723 TSecr=3474802150
14 4 172.27.50.222 -> 172.27.255.28 NFS 354 V4 Call (Reply
In 15) LOCK FH: 0x80589398 Offset: 1073741824 Length: 1 nfs.seqid ==
0x0000001a nfs.seqid == 0x00000001 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:fc:02:00:00
15 4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
In 14) LOCK nfs.seqid == 0x00000001 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
16 4 172.27.50.222 -> 172.27.255.28 TCP 66 919 > 2049 [ACK]
Seq=1581 Ack=1229 Win=40 Len=0 TSval=177381305 TSecr=3474803772
17 4 172.27.50.222 -> 172.27.255.28 NFS 318 V4 Call (Reply
In 18) LOCK FH: 0x80589398 Offset: 1073741826 Length: 510 nfs.seqid
== 0x00000001 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
18 4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
In 17) LOCK nfs.seqid == 0x00000002 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
19 4 172.27.50.222 -> 172.27.255.28 NFS 310 V4 Call (Reply
In 20) LOCKU FH: 0x80589398 Offset: 1073741824 Length: 1 nfs.seqid
== 0x00000002 nfs.seqid == 0x00000002 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
20 4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
In 19) LOCKU nfs.seqid == 0x00000003 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
21 4 172.27.50.222 -> 172.27.255.28 NFS 298 V4 Call (Reply
In 22) READ StateID: 0x2656 Offset: 0 Len: 6144 nfs.seqid ==
0x00000003 nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
22 4 172.27.255.28 -> 172.27.50.222 NFS 6274 V4 Reply (Call
In 21) READ
23 4 172.27.50.222 -> 172.27.255.28 TCP 66 919 > 2049 [ACK]
Seq=2309 Ack=7581 Win=40 Len=0 TSval=177381309 TSecr=3474803775
24 4 172.27.50.222 -> 172.27.255.28 NFS 310 V4 Call (Reply
In 25) LOCKU FH: 0x80589398 Offset: 0 Length: <End of File> nfs.seqid
== 0x00000003 nfs.seqid == 0x00000003 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
25 4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
In 24) LOCKU nfs.seqid == 0x00000004 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
26 4 172.27.50.222 -> 172.27.255.28 NFS 190 V4 Call (Reply
In 28) RELEASE_LOCKOWNER
27 4 172.27.50.222 -> 172.27.255.28 NFS 354 V4 Call (Reply
In 29) LOCK FH: 0x80589398 Offset: 1073741824 Length: 1 nfs.seqid ==
0x0000001b nfs.seqid == 0x00000001 nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:fc:02:00:00
28 4 172.27.255.28 -> 172.27.50.222 NFS 114 V4 Reply (Call
In 26) RELEASE_LOCKOWNER

2019-09-26 08:16:28

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID

Hi Leon,

Given this trace, I don't see anything that the client is doing wrong.
I'd say this points to the server losing track of the seqid.

On Tue, Sep 24, 2019 at 12:18 AM Leon Kyneur <[email protected]> wrote:
>
> On Thu, Sep 19, 2019 at 9:36 PM Olga Kornievskaia <[email protected]> wrote:
> >
> > Hi Leon,
> >
> > Given that you have a network trace with the BAD_SEQID error can you
> > filter on the file handle and trace the operations? It should be
> > possible to tell at that point if it's the client that's sending an
> > incorrect seq or the server. I'm assuming we can either trace the
> > locking stated to a previous (successful) use or to an open. Provided
> > tshark output isn't enough to see what was inside the packets.
> >
> > If the LOCK that's failing used a seqid that wasn't bootstrapped from
> > the open seqid, then it can't be caused by a double CLOSE problem. But
> > if this LOCK was was bootstrapped from the open stateid we need to
> > traces open owner sequencing and a double close could have been the
> > cause.
>
>
> Hi Olga
>
> I don't ever see a double close in my packet captures. Here's a more
> verbose dump, I'm not quite sure what you mean re LOCK "wasn't
> bootstrapped from the open seqid" that does seem to explain it as I
> don't see the 0x00000001 seq id on the from the OPEN call later in the
> LOCK FH call. The working example (also below) does show this.
>
> So I'm still wondering if this is the double close issue or something else?
>
> # tshark -2 -n -r ../nfs.pcap -z
> proto,colinfo,nfs.stateid4.other,nfs.stateid4.other
> ...
> 22 3 172.27.50.222 -> 172.27.255.28 NFS 370 V4 Call (Reply
> In 23) OPEN DH: 0xbfe01adc/m.db nfs.seqid == 0x00000022
> 23 3 172.27.255.28 -> 172.27.50.222 NFS 394 V4 Reply (Call
> In 22) OPEN StateID: 0x7999 nfs.seqid == 0x00000001
> nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
> 24 3 172.27.50.222 -> 172.27.255.28 NFS 290 V4 Call (Reply
> In 25) OPEN_CONFIRM nfs.seqid == 0x00000001 nfs.seqid == 0x00000023
> nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
> 25 3 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
> In 24) OPEN_CONFIRM nfs.seqid == 0x00000002 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
> 26 3 172.27.50.222 -> 172.27.255.28 NFS 370 V4 Call (Reply
> In 27) OPEN DH: 0x272b8d23/.sqliterc nfs.seqid == 0x00000024
> 27 3 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
> In 26) OPEN Status: NFS4ERR_NOENT
> 28 3 172.27.50.222 -> 172.27.255.28 NFS 374 V4 Call (Reply
> In 29) OPEN DH: 0x272b8d23/.sqlite_history nfs.seqid == 0x00000025
> 29 3 172.27.255.28 -> 172.27.50.222 NFS 430 V4 Reply (Call
> In 28) OPEN StateID: 0x3203 nfs.seqid == 0x00000001 nfs.seqid ==
> 0x00000001 nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e5:02:00:00
> nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e6:02:00:00
> 30 3 172.27.50.222 -> 172.27.255.28 NFS 306 V4 Call (Reply
> In 31) CLOSE StateID: 0x3203 nfs.seqid == 0x00000026 nfs.seqid ==
> 0x00000001 nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e5:02:00:00
> 31 3 172.27.255.28 -> 172.27.50.222 NFS 202 V4 Reply (Call
> In 30) CLOSE nfs.seqid == 0x00000002 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:e5:02:00:00
> 32 3 172.27.50.222 -> 172.27.255.28 NFS 306 V4 Call (Reply
> In 33) LOOKUP DH: 0x272b8d23/.terminfo
> 33 3 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
> In 32) LOOKUP Status: NFS4ERR_NOENT
> 34 3 172.27.50.222 -> 172.27.255.28 NFS 290 V4 Call (Reply
> In 35) ACCESS FH: 0x272b8d23, [Check: RD LU MD XT DL]
> 35 3 172.27.255.28 -> 172.27.50.222 NFS 194 V4 Reply (Call
> In 34) ACCESS, [Allowed: RD LU MD XT DL]
> 36 3 172.27.50.222 -> 172.27.255.28 NFS 282 V4 Call (Reply
> In 37) GETATTR FH: 0x272b8d23
> 37 3 172.27.255.28 -> 172.27.50.222 NFS 266 V4 Reply (Call
> In 36) GETATTR
> 38 3 172.27.50.222 -> 172.27.255.28 NFS 302 V4 Call (Reply
> In 39) LOOKUP DH: 0x272b8d23/.inputrc
> 39 3 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
> In 38) LOOKUP Status: NFS4ERR_NOENT
> 40 3 172.27.50.222 -> 172.27.255.28 TCP 66 822 > 2049 [ACK]
> Seq=3950 Ack=2617 Win=3077 Len=0 TSval=177154789 TSecr=3474577216
> 41 5 172.27.50.222 -> 172.27.255.28 NFS 354 V4 Call (Reply
> In 42) LOCK FH: 0x80589398 Offset: 1073741824 Length: 1 nfs.seqid ==
> 0x00000027 nfs.seqid == 0x00000002 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
> 42 5 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
> In 41) LOCK Status: NFS4ERR_BAD_SEQID
> 43 5 172.27.50.222 -> 172.27.255.28 TCP 66 822 > 2049 [ACK]
> Seq=4238 Ack=2673 Win=3077 Len=0 TSval=177156701 TSecr=3474579167
>
>
> working:
>
> 7 2 172.27.50.222 -> 172.27.255.28 NFS 370 V4 Call (Reply
> In 8) OPEN DH: 0xbfe01adc/m.db nfs.seqid == 0x00000017
> 8 2 172.27.255.28 -> 172.27.50.222 NFS 394 V4 Reply (Call
> In 7) OPEN StateID: 0x08c1 nfs.seqid == 0x00000001
> nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fc:02:00:00
> 9 2 172.27.50.222 -> 172.27.255.28 NFS 374 V4 Call (Reply
> In 10) OPEN DH: 0x272b8d23/.sqlite_history nfs.seqid == 0x00000018
> 10 2 172.27.255.28 -> 172.27.50.222 NFS 430 V4 Reply (Call
> In 9) OPEN StateID: 0x147a nfs.seqid == 0x00000001 nfs.seqid ==
> 0x00000001 nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fd:02:00:00
> nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fe:02:00:00
> 11 2 172.27.50.222 -> 172.27.255.28 NFS 306 V4 Call (Reply
> In 12) CLOSE StateID: 0x147a nfs.seqid == 0x00000019 nfs.seqid ==
> 0x00000001 nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fd:02:00:00
> 12 2 172.27.255.28 -> 172.27.50.222 NFS 202 V4 Reply (Call
> In 11) CLOSE nfs.seqid == 0x00000002 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:fd:02:00:00
> 13 2 172.27.50.222 -> 172.27.255.28 TCP 66 919 > 2049 [ACK]
> Seq=1293 Ack=1157 Win=40 Len=0 TSval=177379723 TSecr=3474802150
> 14 4 172.27.50.222 -> 172.27.255.28 NFS 354 V4 Call (Reply
> In 15) LOCK FH: 0x80589398 Offset: 1073741824 Length: 1 nfs.seqid ==
> 0x0000001a nfs.seqid == 0x00000001 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:fc:02:00:00
> 15 4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
> In 14) LOCK nfs.seqid == 0x00000001 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
> 16 4 172.27.50.222 -> 172.27.255.28 TCP 66 919 > 2049 [ACK]
> Seq=1581 Ack=1229 Win=40 Len=0 TSval=177381305 TSecr=3474803772
> 17 4 172.27.50.222 -> 172.27.255.28 NFS 318 V4 Call (Reply
> In 18) LOCK FH: 0x80589398 Offset: 1073741826 Length: 510 nfs.seqid
> == 0x00000001 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
> 18 4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
> In 17) LOCK nfs.seqid == 0x00000002 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
> 19 4 172.27.50.222 -> 172.27.255.28 NFS 310 V4 Call (Reply
> In 20) LOCKU FH: 0x80589398 Offset: 1073741824 Length: 1 nfs.seqid
> == 0x00000002 nfs.seqid == 0x00000002 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
> 20 4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
> In 19) LOCKU nfs.seqid == 0x00000003 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
> 21 4 172.27.50.222 -> 172.27.255.28 NFS 298 V4 Call (Reply
> In 22) READ StateID: 0x2656 Offset: 0 Len: 6144 nfs.seqid ==
> 0x00000003 nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
> 22 4 172.27.255.28 -> 172.27.50.222 NFS 6274 V4 Reply (Call
> In 21) READ
> 23 4 172.27.50.222 -> 172.27.255.28 TCP 66 919 > 2049 [ACK]
> Seq=2309 Ack=7581 Win=40 Len=0 TSval=177381309 TSecr=3474803775
> 24 4 172.27.50.222 -> 172.27.255.28 NFS 310 V4 Call (Reply
> In 25) LOCKU FH: 0x80589398 Offset: 0 Length: <End of File> nfs.seqid
> == 0x00000003 nfs.seqid == 0x00000003 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
> 25 4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
> In 24) LOCKU nfs.seqid == 0x00000004 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
> 26 4 172.27.50.222 -> 172.27.255.28 NFS 190 V4 Call (Reply
> In 28) RELEASE_LOCKOWNER
> 27 4 172.27.50.222 -> 172.27.255.28 NFS 354 V4 Call (Reply
> In 29) LOCK FH: 0x80589398 Offset: 1073741824 Length: 1 nfs.seqid ==
> 0x0000001b nfs.seqid == 0x00000001 nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:fc:02:00:00
> 28 4 172.27.255.28 -> 172.27.50.222 NFS 114 V4 Reply (Call
> In 26) RELEASE_LOCKOWNER