Hello,
we have a FreeBSD 9.1 fileserver and several clients running kernel
3.8.4-102.fc17.x86_64. Everything works fine till we reboot the
server. A fraction (1/10) of the clients don't resume the NFS session
correctly. The server sends a NFS4ERR_STALE_STATEID. The client sends
a RENEW to the server but no SETCLIENTID. (this should be the correct
action from my very quick look at RFC 3530). After that the client
continues with a few READ call and the process starts again with the
NFS4ERR_STALE_STATEID response from the server. It generates a lot of
useless network traffic.
I enclosed a log of tshark and a kernel log with NFS debugging enabled
(one iteration).
Any suggestions how to solve this problem?
Thanks,
Bram.
> Just to clarify/correct what I posted yesterday...
> The boot instance is the first 4 bytes of the clientid and the first
> 4 bytes of the stateid.other. (Basically, for the FreeBSD server, a
> stateid.other is just the clientid + 4 additional bytes that identify
> which stateid related to the clientid that it is.)
>
> Those first 4 bytes should be the same for all clientids/stateid.others
> issued during a server boot cycle. Any clientid/stateid.other with a
> different first 4 bytes will get the NFS4ERR_STALE_CLIENTID/STATEID
> reply.
Thanks for the clarification. I tried to reproduce the problem using a
test setup but so far I didn't succeed. It's clearly not a problem
that happens all the time. Also not all the clients lock up in the
production system. Only a fraction of them (~ 1 in 10).
I checked the packets again. The Stateid in a read operation is:
9a:b6:5d:51:bc:07:00:00:24:23:00:00
The client id:
af:c1:63:51:8b:01:00:00
It seems we ended up with a stale stateid but with a valid client id.
I am going to do some more tests with mutiple clients to try to
reproduce the problem. If that doesn't succeed I try to get the data
from the production server when we have to reboot it next time (but
this can take a while).
Thanks,
Bram
[ Adding Rick Macklem ]
On Apr 9, 2013, at 3:08 PM, J. Bruce Fields <[email protected]> wrote:
> On Tue, Apr 09, 2013 at 05:51:40PM +0200, Bram Vandoren wrote:
>> Hello,
>> we have a FreeBSD 9.1 fileserver and several clients running kernel
>> 3.8.4-102.fc17.x86_64. Everything works fine till we reboot the
>> server. A fraction (1/10) of the clients don't resume the NFS session
>> correctly. The server sends a NFS4ERR_STALE_STATEID. The client sends
>> a RENEW to the server but no SETCLIENTID. (this should be the correct
>> action from my very quick look at RFC 3530). After that the client
>> continues with a few READ call and the process starts again with the
>> NFS4ERR_STALE_STATEID response from the server. It generates a lot of
>> useless network traffic.
>
> 0.003754 a.b.c.2 -> a.b.c.120 NFS 122 V4 Reply (Call In 49) READ Status: NFS4ERR_STALE_STATEID
> 0.003769 a.b.c.2 -> a.b.c.120 NFS 114 V4 Reply (Call In 71) RENEW
>
> I don't normally use tshark, so I don't know--does the lack of a status
> on that second line indicate that the RENEW succeeded?
>
> Assuming the RENEW is for the same clientid that the read stateid's are
> associated with--that's definitely a server bug. The RENEW should be
> returning STALE_CLIENTID.
The server is returning NFS4_OK to that RENEW and we appear to be out of the server's grace period. Thus we can assume that state recovery has already been performed following the server reboot, and a fresh client ID has been correctly established. One possible explanation for NFS4ERR_STALE_STATEID is that the client skipped recovering these state IDs for some reason.
A full network capture in pcap format, started before the server reboot occurs, would be needed for us to analyze the issue properly.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
Chuck Lever wrote:
> [ Adding Rick Macklem ]
>
> On Apr 9, 2013, at 3:08 PM, J. Bruce Fields <[email protected]>
> wrote:
>
> > On Tue, Apr 09, 2013 at 05:51:40PM +0200, Bram Vandoren wrote:
> >> Hello,
> >> we have a FreeBSD 9.1 fileserver and several clients running kernel
> >> 3.8.4-102.fc17.x86_64. Everything works fine till we reboot the
> >> server. A fraction (1/10) of the clients don't resume the NFS
> >> session
> >> correctly. The server sends a NFS4ERR_STALE_STATEID. The client
> >> sends
> >> a RENEW to the server but no SETCLIENTID. (this should be the
> >> correct
> >> action from my very quick look at RFC 3530). After that the client
> >> continues with a few READ call and the process starts again with
> >> the
> >> NFS4ERR_STALE_STATEID response from the server. It generates a lot
> >> of
> >> useless network traffic.
> >
> > 0.003754 a.b.c.2 -> a.b.c.120 NFS 122 V4 Reply (Call In 49) READ
> > Status: NFS4ERR_STALE_STATEID
> > 0.003769 a.b.c.2 -> a.b.c.120 NFS 114 V4 Reply (Call In 71) RENEW
> >
> > I don't normally use tshark, so I don't know--does the lack of a
> > status
> > on that second line indicate that the RENEW succeeded?
> >
> > Assuming the RENEW is for the same clientid that the read stateid's
> > are
> > associated with--that's definitely a server bug. The RENEW should be
> > returning STALE_CLIENTID.
>
> The server is returning NFS4_OK to that RENEW and we appear to be out
> of the server's grace period. Thus we can assume that state recovery
> has already been performed following the server reboot, and a fresh
> client ID has been correctly established. One possible explanation for
> NFS4ERR_STALE_STATEID is that the client skipped recovering these
> state IDs for some reason.
>
This is a possible explanation. There is a "boottime" field in both
the clientid and stateid for a FreeBSD NFSv4 server, which is checked
and generates a NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID, depending
on whether it is a clientid or stateid. In other words, the code would
generate the stale error message for all clientids/stateids that were
issued before the server reboot. (Also, the FreeBSD server delays the
end of grace until it doesn't see any reclaim ops for a while, so it
shouldn't just run out of grace period, unless the client delays for
seconds without completing a recovery.)
If you have a capture of the above that you can look at in wireshark,
you could look at the bits that make up the stateid/clientid. Although
I don't have the code in front of me, but I think the first 32bits are
the "boottime" and should be different for the stateid vs clientid for
the above two RPCs.
> A full network capture in pcap format, started before the server
> reboot occurs, would be needed for us to analyze the issue properly.
>
I agree. A network capture on the server from when it reboots would
be needed to analyse this. You could stick a line to start a tcpdump
capture in the /etc/rc.d/nfsd script or disable starting the nfsd
in /etc/rc.conf and then start it manually after starting a
"tcpdump -w reboot.pcap -s 0" or similar.
If you get such a <file>.pcap, I could probably take a look at it,
although it might not happen until May.
rick
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
On Fri, Apr 12, 2013 at 11:19:39AM +0200, Bram Vandoren wrote:
> > Just to clarify/correct what I posted yesterday...
> > The boot instance is the first 4 bytes of the clientid and the first
> > 4 bytes of the stateid.other. (Basically, for the FreeBSD server, a
> > stateid.other is just the clientid + 4 additional bytes that identify
> > which stateid related to the clientid that it is.)
> >
> > Those first 4 bytes should be the same for all clientids/stateid.others
> > issued during a server boot cycle. Any clientid/stateid.other with a
> > different first 4 bytes will get the NFS4ERR_STALE_CLIENTID/STATEID
> > reply.
>
> Thanks for the clarification. I tried to reproduce the problem using a
> test setup but so far I didn't succeed. It's clearly not a problem
> that happens all the time. Also not all the clients lock up in the
> production system. Only a fraction of them (~ 1 in 10).
>
> I checked the packets again. The Stateid in a read operation is:
> 9a:b6:5d:51:bc:07:00:00:24:23:00:00
> The client id:
> af:c1:63:51:8b:01:00:00
Yeah, so 9a:b6:5d:51 and af:c1:63:51 if you take them as little-endian
integers look like pretty reasonable recent dates in unix time, so that
agrees with Rick's description of the FreeBSD server's stateid and
clientid.
So that does sound like the client failing to toss an old stateid after
renew?
--b.
>
> It seems we ended up with a stale stateid but with a valid client id.
>
> I am going to do some more tests with mutiple clients to try to
> reproduce the problem. If that doesn't succeed I try to get the data
> from the production server when we have to reboot it next time (but
> this can take a while).
>
> Thanks,
> Bram
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Bram-
I'm going to leave linux-nfs on the cc: line while we debug this. If there is a privacy issue, you can send raw pcap files directly to us. The mail archive does not save attachments, IIRC.
On Apr 12, 2013, at 5:26 AM, Bram Vandoren <[email protected]> wrote:
> Hi Rick, Chuck, Bruce,
> in attachment is a small pcap when a client is in the locked.
> Hopefully I can reproduce the problem so I can send you a capture
> during a reboot cycle.
The pcap file confirms that the state IDs and client ID do not appear to match, and do appear on the same TCP connection (in different operations). I think the presence of the RENEW operations here suggest that the client believes it has not been able to renew its lease using stateful operations like READ. IMO this is evidence in favor of the theory that the client neglected to recover these state IDs for some reason.
We'll need to see the actual reboot recovery traffic to analyze further, and that occurs just after the server reboots. Even better would be to see the initial OPEN of the file where the READ operations are failing. I recognize this is a non-determinstic problem that will be a challenge to capture properly.
Rather than capturing the trace on the server, you should be able to capture it on your clients in order to capture traffic before, during, and after the server reboot. To avoid capturing an enormous amount of data, both tcpdump and tshark provide options to save the captured network data into a small ring of files (see their man pages). Once a client mount point has locked, you can stop the capture, and hopefully the ring will have everything we need.
>
> Thanks,
> Bram.
>
> On Fri, Apr 12, 2013 at 11:19 AM, Bram Vandoren <[email protected]> wrote:
>>> Just to clarify/correct what I posted yesterday...
>>> The boot instance is the first 4 bytes of the clientid and the first
>>> 4 bytes of the stateid.other. (Basically, for the FreeBSD server, a
>>> stateid.other is just the clientid + 4 additional bytes that identify
>>> which stateid related to the clientid that it is.)
>>>
>>> Those first 4 bytes should be the same for all clientids/stateid.others
>>> issued during a server boot cycle. Any clientid/stateid.other with a
>>> different first 4 bytes will get the NFS4ERR_STALE_CLIENTID/STATEID
>>> reply.
>>
>> Thanks for the clarification. I tried to reproduce the problem using a
>> test setup but so far I didn't succeed. It's clearly not a problem
>> that happens all the time. Also not all the clients lock up in the
>> production system. Only a fraction of them (~ 1 in 10).
>>
>> I checked the packets again. The Stateid in a read operation is:
>> 9a:b6:5d:51:bc:07:00:00:24:23:00:00
>> The client id:
>> af:c1:63:51:8b:01:00:00
>>
>> It seems we ended up with a stale stateid but with a valid client id.
>>
>> I am going to do some more tests with mutiple clients to try to
>> reproduce the problem. If that doesn't succeed I try to get the data
>> from the production server when we have to reboot it next time (but
>> this can take a while).
>>
>> Thanks,
>> Bram
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Tue, Apr 09, 2013 at 05:51:40PM +0200, Bram Vandoren wrote:
> Hello,
> we have a FreeBSD 9.1 fileserver and several clients running kernel
> 3.8.4-102.fc17.x86_64. Everything works fine till we reboot the
> server. A fraction (1/10) of the clients don't resume the NFS session
> correctly. The server sends a NFS4ERR_STALE_STATEID. The client sends
> a RENEW to the server but no SETCLIENTID. (this should be the correct
> action from my very quick look at RFC 3530). After that the client
> continues with a few READ call and the process starts again with the
> NFS4ERR_STALE_STATEID response from the server. It generates a lot of
> useless network traffic.
0.003754 a.b.c.2 -> a.b.c.120 NFS 122 V4 Reply (Call In 49) READ Status: NFS4ERR_STALE_STATEID
0.003769 a.b.c.2 -> a.b.c.120 NFS 114 V4 Reply (Call In 71) RENEW
I don't normally use tshark, so I don't know--does the lack of a status
on that second line indicate that the RENEW succeeded?
Assuming the RENEW is for the same clientid that the read stateid's are
associated with--that's definitely a server bug. The RENEW should be
returning STALE_CLIENTID.
--b.
Chuck Lever wrote:
> [ Adding Rick Macklem ]
>
> On Apr 9, 2013, at 3:08 PM, J. Bruce Fields <[email protected]>
> wrote:
>
> > On Tue, Apr 09, 2013 at 05:51:40PM +0200, Bram Vandoren wrote:
> >> Hello,
> >> we have a FreeBSD 9.1 fileserver and several clients running kernel
> >> 3.8.4-102.fc17.x86_64. Everything works fine till we reboot the
> >> server. A fraction (1/10) of the clients don't resume the NFS
> >> session
> >> correctly. The server sends a NFS4ERR_STALE_STATEID. The client
> >> sends
> >> a RENEW to the server but no SETCLIENTID. (this should be the
> >> correct
> >> action from my very quick look at RFC 3530). After that the client
> >> continues with a few READ call and the process starts again with
> >> the
> >> NFS4ERR_STALE_STATEID response from the server. It generates a lot
> >> of
> >> useless network traffic.
> >
> > 0.003754 a.b.c.2 -> a.b.c.120 NFS 122 V4 Reply (Call In 49) READ
> > Status: NFS4ERR_STALE_STATEID
> > 0.003769 a.b.c.2 -> a.b.c.120 NFS 114 V4 Reply (Call In 71) RENEW
> >
> > I don't normally use tshark, so I don't know--does the lack of a
> > status
> > on that second line indicate that the RENEW succeeded?
> >
> > Assuming the RENEW is for the same clientid that the read stateid's
> > are
> > associated with--that's definitely a server bug. The RENEW should be
> > returning STALE_CLIENTID.
>
> The server is returning NFS4_OK to that RENEW and we appear to be out
> of the server's grace period. Thus we can assume that state recovery
> has already been performed following the server reboot, and a fresh
> client ID has been correctly established. One possible explanation for
> NFS4ERR_STALE_STATEID is that the client skipped recovering these
> state IDs for some reason.
>
Just to clarify/correct what I posted yesterday...
The boot instance is the first 4 bytes of the clientid and the first
4 bytes of the stateid.other. (Basically, for the FreeBSD server, a
stateid.other is just the clientid + 4 additional bytes that identify
which stateid related to the clientid that it is.)
Those first 4 bytes should be the same for all clientids/stateid.others
issued during a server boot cycle. Any clientid/stateid.other with a
different first 4 bytes will get the NFS4ERR_STALE_CLIENTID/STATEID
reply.
rick
> A full network capture in pcap format, started before the server
> reboot occurs, would be needed for us to analyze the issue properly.
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
Chuck Lever wrote:
> On May 28, 2013, at 9:04 PM, Chuck Lever <[email protected]>
> wrote:
>
> >
> > On May 28, 2013, at 7:30 PM, Rick Macklem <[email protected]>
> > wrote:
> >
> >> Bram Vandoren wrote:
> >>>>> Hi Rick, Chuck, Bruce,
> >>>>> in attachment is a small pcap when a client is in the locked.
> >>>>> Hopefully I can reproduce the problem so I can send you a
> >>>>> capture
> >>>>> during a reboot cycle.
> >>>>
> >>>> The pcap file confirms that the state IDs and client ID do not
> >>>> appear to match, and do appear on the same TCP connection (in
> >>>> different operations). I think the presence of the RENEW
> >>>> operations
> >>>> here suggest that the client believes it has not been able to
> >>>> renew
> >>>> its lease using stateful operations like READ. IMO this is
> >>>> evidence
> >>>> in favor of the theory that the client neglected to recover these
> >>>> state IDs for some reason.
> >>>>
> >>>> We'll need to see the actual reboot recovery traffic to analyze
> >>>> further, and that occurs just after the server reboots. Even
> >>>> better
> >>>> would be to see the initial OPEN of the file where the READ
> >>>> operations are failing. I recognize this is a non-determinstic
> >>>> problem that will be a challenge to capture properly.
> >>>>
> >>>> Rather than capturing the trace on the server, you should be able
> >>>> to
> >>>> capture it on your clients in order to capture traffic before,
> >>>> during, and after the server reboot. To avoid capturing an
> >>>> enormous
> >>>> amount of data, both tcpdump and tshark provide options to save
> >>>> the
> >>>> captured network data into a small ring of files (see their man
> >>>> pages). Once a client mount point has locked, you can stop the
> >>>> capture, and hopefully the ring will have everything we need.
> >>>
> >>> Hi All,
> >>> I managed to capture the packets after a reboot. I send the pcap
> >>> file
> >>> to the people in cc (privacy issue, contact me if someone on the
> >>> list
> >>> wants a copy). This is a summary of what happens after a reboot
> >>> (perhaps a missed some relevant information):
> >>>
> >>> 38:
> >>> - client -> server: client executes 3 writes with a stale clientid
> >>> (A)
> >>> - client -> server: RENEW
> >>> 44:
> >>> - server -> client: NFS4ERR_STALE_STATEID (in reponse to A)
> >>> 45:
> >>> - server -> client: NFS4ERR_STALE_CLIENTID
> >>> 65:
> >>> - client -> server: RENEW
> >>> 66
> >>> - server -> client: NFS4ERR_STALE_CLIENTID
> >>> 67,85,87,93:
> >>> SETCLIENTID/SETCLIENTID_CONFIRM sequence (ok)
> >>> 78,79:
> >>> NFS4STALE_STATEID (reponse to the other 2 writes in A)
> >>>
> >>> 98: OPEN with CLAIM_PREVIOUS
> >>> 107: response to open: NFS4ERR_NO_GRACE (strange?)
> >>> after that the client re-opens the files without CLAIM_PREVIOUS
> >>> option
> >>> and they are all succesful.
> >>>
> >>> The client starts using the new stateids except for the files in
> >>> A.
> >>> The server returns a NFS4_STALE_STATEID, the client executes a
> >>> RENEW
> >>> (IMO this should be an OPEN request) and retries the WRITE, the
> >>> server
> >>> returns a NFS4_STALE_STATEID
> >>>
> >>> Server: FreeBSD 9.1 with new NFS server implementation
> >>> Client: Fedora 17, 3.8.11-100.fc17.x86_64
> >>>
> >>> Any clues?
> >>>
> >>> Thanks,
> >>> Bram
> >> I just took a look at the packet capture (I hadn't realized you had
> >> posted one when I did the last reply).
> >>
> >> I think the problem is a TCP layer one, from looking at the first 8
> >> packets captured amd their timestamps. However, my TCP is old and
> >> tired, so I'm not sure. From the first 8 packets:
> >> Time(sec)
> >> 0 - client sends a SYN
> >> 0 - server replies [RST, ACK]
> >> Since Reset is in the reply, is this when the server rebooted?
> >> 6 - client sends a SYN with same port#s. Wireshark thinks this is
> >> a new connection using same port#s.
> >> 6 - FreeBSD replies with [RST, ACK] again, thinking it is a resend
> >> of the first packet and still considers it the old Reset
> >> connection.
> >> 15 - Repeat of what happened at 6sec.
> >> 30,31 - client sends SYN twice, still from the same port#s
> >> - no reply from FreeBSD
> >> 4393 - client sends SYN again with same port#s
> >> 4393 - server replies with [SYN, ACK] establishing the new TCP
> >> connection on the same port#s, but over an hour later.
> >
> > To be clear, this is very early in the capture, just frame 9 and 10.
> > The Linux client attempts to re-use the previous connection's source
> > port to preserve the server's duplicate reply cache.
>
I just found something on the net that suggests that Cisco Intrusion Detection (IDS)
will sometimes inject RSTs to shut down connections.
If you happen to have Cisco gear between the client and server, you might
want to check for IDS enabled? (Don't ask me how you do that. I've never
touched a Cisco switch. I also saw something suggestion that Juniper
switches do something similar, but it didn't indicate how to disable the
capability.)
Good luck with it again, rick
> And by the way, if the server actually did reboot, it shouldn't have
> noticed the client's port re-use. TCP connection state should have
> been wiped by the system restart.
>
> > But it's not clear why the client is not retrying SYN during the
> > period between frame 8 and frame 9. The client should retry SYN
> > after just another minute or two, and by that time the server's TCP
> > connection state should allow a connection on that port.
> >
> > It's not typical for a client with an active workload to wait 4400
> > seconds to send a fresh SYN. Bram, can you shed any light on this?
> >
> >> Since this is now over an hour after packet 0, it isn't
> >> surprising that Grace is over.
> >
> > Fair enough. However, it does not explain why the client fails to
> > recover the open state for one file.
> >
> >> I'd guess that the problem occurs because the client reconnect
> >> uses the same client side port# even though the server did a
> >> Reset for that port#, but I'm not enough of a TCP guy to know
> >> what should be correct TCP behaviour.
> >>
> >> If you don't mind the first part of this packet capture being
> >> posted (email me to say if it ok to post it), I can post it to
> >> [email protected], since there are a bunch of TCP savy
> >> types that might be able to help. (If you give the ok for this,
> >> please also email me what version of FreeBSD the server is
> >> running.)
> >
> > Thanks for the extra set of eyes.
> >
> > --
> > Chuck Lever
> > chuck[dot]lever[at]oracle[dot]com
> >
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> > in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
On May 28, 2013, at 7:30 PM, Rick Macklem <[email protected]> wrote:
> Bram Vandoren wrote:
>>>> Hi Rick, Chuck, Bruce,
>>>> in attachment is a small pcap when a client is in the locked.
>>>> Hopefully I can reproduce the problem so I can send you a capture
>>>> during a reboot cycle.
>>>
>>> The pcap file confirms that the state IDs and client ID do not
>>> appear to match, and do appear on the same TCP connection (in
>>> different operations). I think the presence of the RENEW operations
>>> here suggest that the client believes it has not been able to renew
>>> its lease using stateful operations like READ. IMO this is evidence
>>> in favor of the theory that the client neglected to recover these
>>> state IDs for some reason.
>>>
>>> We'll need to see the actual reboot recovery traffic to analyze
>>> further, and that occurs just after the server reboots. Even better
>>> would be to see the initial OPEN of the file where the READ
>>> operations are failing. I recognize this is a non-determinstic
>>> problem that will be a challenge to capture properly.
>>>
>>> Rather than capturing the trace on the server, you should be able to
>>> capture it on your clients in order to capture traffic before,
>>> during, and after the server reboot. To avoid capturing an enormous
>>> amount of data, both tcpdump and tshark provide options to save the
>>> captured network data into a small ring of files (see their man
>>> pages). Once a client mount point has locked, you can stop the
>>> capture, and hopefully the ring will have everything we need.
>>
>> Hi All,
>> I managed to capture the packets after a reboot. I send the pcap file
>> to the people in cc (privacy issue, contact me if someone on the list
>> wants a copy). This is a summary of what happens after a reboot
>> (perhaps a missed some relevant information):
>>
>> 38:
>> - client -> server: client executes 3 writes with a stale clientid (A)
>> - client -> server: RENEW
>> 44:
>> - server -> client: NFS4ERR_STALE_STATEID (in reponse to A)
>> 45:
>> - server -> client: NFS4ERR_STALE_CLIENTID
>> 65:
>> - client -> server: RENEW
>> 66
>> - server -> client: NFS4ERR_STALE_CLIENTID
>> 67,85,87,93:
>> SETCLIENTID/SETCLIENTID_CONFIRM sequence (ok)
>> 78,79:
>> NFS4STALE_STATEID (reponse to the other 2 writes in A)
>>
>> 98: OPEN with CLAIM_PREVIOUS
>> 107: response to open: NFS4ERR_NO_GRACE (strange?)
>> after that the client re-opens the files without CLAIM_PREVIOUS option
>> and they are all succesful.
>>
>> The client starts using the new stateids except for the files in A.
>> The server returns a NFS4_STALE_STATEID, the client executes a RENEW
>> (IMO this should be an OPEN request) and retries the WRITE, the server
>> returns a NFS4_STALE_STATEID
>>
>> Server: FreeBSD 9.1 with new NFS server implementation
>> Client: Fedora 17, 3.8.11-100.fc17.x86_64
>>
>> Any clues?
>>
>> Thanks,
>> Bram
> I just took a look at the packet capture (I hadn't realized you had
> posted one when I did the last reply).
>
> I think the problem is a TCP layer one, from looking at the first 8
> packets captured amd their timestamps. However, my TCP is old and
> tired, so I'm not sure. From the first 8 packets:
> Time(sec)
> 0 - client sends a SYN
> 0 - server replies [RST, ACK]
> Since Reset is in the reply, is this when the server rebooted?
> 6 - client sends a SYN with same port#s. Wireshark thinks this is
> a new connection using same port#s.
> 6 - FreeBSD replies with [RST, ACK] again, thinking it is a resend
> of the first packet and still considers it the old Reset connection.
> 15 - Repeat of what happened at 6sec.
> 30,31 - client sends SYN twice, still from the same port#s
> - no reply from FreeBSD
> 4393 - client sends SYN again with same port#s
> 4393 - server replies with [SYN, ACK] establishing the new TCP
> connection on the same port#s, but over an hour later.
To be clear, this is very early in the capture, just frame 9 and 10. The Linux client attempts to re-use the previous connection's source port to preserve the server's duplicate reply cache.
But it's not clear why the client is not retrying SYN during the period between frame 8 and frame 9. The client should retry SYN after just another minute or two, and by that time the server's TCP connection state should allow a connection on that port.
It's not typical for a client with an active workload to wait 4400 seconds to send a fresh SYN. Bram, can you shed any light on this?
> Since this is now over an hour after packet 0, it isn't
> surprising that Grace is over.
Fair enough. However, it does not explain why the client fails to recover the open state for one file.
> I'd guess that the problem occurs because the client reconnect
> uses the same client side port# even though the server did a
> Reset for that port#, but I'm not enough of a TCP guy to know
> what should be correct TCP behaviour.
>
> If you don't mind the first part of this packet capture being
> posted (email me to say if it ok to post it), I can post it to
> [email protected], since there are a bunch of TCP savy
> types that might be able to help. (If you give the ok for this,
> please also email me what version of FreeBSD the server is
> running.)
Thanks for the extra set of eyes.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
Bram Vandoren wrote:
> > Did both the client and server have the same IP addresses before the
> > reboot?
>
> Yes.
>
> > If not, the Linux client's nfs_client_id4.id SetClientID argument
> > will be different (it has the client/side IP# in it).
> > nfs_client_id4.id
> > isn't supposed to change for a given client when it is rebooted.
> > That will make the FreeBSD NFSv4 server see "new client" (which is
> > not in the
> > stablerestart file used to avoid certain reboot edge conditions) and
> > will not give it a grace period.
> > This is the only explanation I can think of for the NFS4ERR_NO_GRACE
> > reply shortly after the reboot.
>
> I checked some other clients and they all receive the
> NFS4ERR_NO_GRACE response from the server. It's not unique for the
> clients that hang. I was unable to reproduce this is a minimal test
> configuration. Perhaps the nfs-stablerestart file is corrupt on the
> server?
>
> I checked
> strings nfs-stablerestart
> and I see a lot of duplicate entries. In total there are ~10000 lines
> but we only have ~50 clients.
> Most clients have 3 types of entries:
> Linux NFSv4.0 a.b.c.d/e.f.g.h tcp
> Linux NFSv4.0 a.b.c.d/e.f.g.h tcp*
> Linux NFSv4.0 a.b.c.d/e.f.g.h tcp+
>
I'll take a look. I wrote that code about 10 years ago, so I don't remember
all the details w.r.t. the records in the stable restart file. If you truncate
the file, there won't be any recovery on the next reboot, so you need to
unmount all the NFSv4 mounts on it before rebooting for that case.
What you packet trace didn't indicate was when the server was rebooted vs
when the client sent it a SYN that started a new connection. During the
approx. 4400 sec the server was down there should have been repeated attempts
to connect to it (basically a TCP packet with SYN in it) at least once every
30sec. Basically, after the server reboots, the client must establish a TCP
connection and attempt recovery within 2 minutes or it just isn't going to
work.
Btw, server reboot recovery doesn't get a lot of testing. Some of that is
logistics (no one pays for FreeBSD NFS development, etc) and the rest is that
most assume a server will remain up for months/years at a time. If the FreeBSD
server is crashing, you need to try and resolve that. If the approx. 4400 sec
downtime was a scheduled maintenance type of thing, you should consider unmounting
the volumes before the server is shut down and doing fresh mounts after it
is rebooted.
rick
> Again, thanks a lot for looking into this.
>
> Bram.
> It's not typical for a client with an active workload to wait 4400 seconds to send a fresh SYN. Bram, can you shed any light on this?
The server was down for ~4400 seconds (I started the capture before
the maintanance period). I captured the packets using a switch mirror
port (it's possible some packets are lost during capture). After a
while an arp timeout occurs and we don't capture the SYNs from the
client any more (I doubt the client sends the SYN because it doesn't
know where to send it).
If it helps, I can filter extra information from the big pcap file
(~20G) (eg capture from other clients or from the same client just
before the server reboot).
Thanks,
Bram.
Chuck Lever wrote:
> Hi-
>
> On May 28, 2013, at 8:31 AM, Bram Vandoren <[email protected]> wrote:
>
> >>> Hi Rick, Chuck, Bruce,
> >>> in attachment is a small pcap when a client is in the locked.
> >>> Hopefully I can reproduce the problem so I can send you a capture
> >>> during a reboot cycle.
> >>
> >> The pcap file confirms that the state IDs and client ID do not
> >> appear to match, and do appear on the same TCP connection (in
> >> different operations). I think the presence of the RENEW operations
> >> here suggest that the client believes it has not been able to renew
> >> its lease using stateful operations like READ. IMO this is evidence
> >> in favor of the theory that the client neglected to recover these
> >> state IDs for some reason.
> >>
> >> We'll need to see the actual reboot recovery traffic to analyze
> >> further, and that occurs just after the server reboots. Even better
> >> would be to see the initial OPEN of the file where the READ
> >> operations are failing. I recognize this is a non-determinstic
> >> problem that will be a challenge to capture properly.
> >>
> >> Rather than capturing the trace on the server, you should be able
> >> to capture it on your clients in order to capture traffic before,
> >> during, and after the server reboot. To avoid capturing an enormous
> >> amount of data, both tcpdump and tshark provide options to save the
> >> captured network data into a small ring of files (see their man
> >> pages). Once a client mount point has locked, you can stop the
> >> capture, and hopefully the ring will have everything we need.
> >
> > Hi All,
> > I managed to capture the packets after a reboot. I send the pcap
> > file
> > to the people in cc (privacy issue, contact me if someone on the
> > list
> > wants a copy). This is a summary of what happens after a reboot
> > (perhaps a missed some relevant information):
> >
> > 38:
> > - client -> server: client executes 3 writes with a stale clientid
> > (A)
> > - client -> server: RENEW
> > 44:
> > - server -> client: NFS4ERR_STALE_STATEID (in reponse to A)
> > 45:
> > - server -> client: NFS4ERR_STALE_CLIENTID
> > 65:
> > - client -> server: RENEW
> > 66
> > - server -> client: NFS4ERR_STALE_CLIENTID
> > 67,85,87,93:
> > SETCLIENTID/SETCLIENTID_CONFIRM sequence (ok)
> > 78,79:
> > NFS4STALE_STATEID (reponse to the other 2 writes in A)
> >
> > 98: OPEN with CLAIM_PREVIOUS
> > 107: response to open: NFS4ERR_NO_GRACE (strange?)
> > after that the client re-opens the files without CLAIM_PREVIOUS
> > option
> > and they are all succesful.
>
> That means the server is not in its grace period. I'm not familiar
> enough to know if that's typical for FreeBSD servers after a reboot,
> Rick will have to respond to that. The client responds correctly by
> switching to CLAIM_NULL OPENs.
>
The grace period is somewhat greater than the lease duration (the lease
duration is 2minutes). Without looking at the code, I can't remember how
much more than 2 minutes it is.
You should look at the timestamps on the packets to see how much time
elapsed between 38 and 107. If it is more than 2 minutes, that is simply
not happening quickly enough. If not, then the client might not be noticing
that it needs to do a TCP reconnect + RPC retry quickly and that could mean
that 38 is happening long after the server rebooted, which is when the
server's grace period starts. (NFSv4.1 "fixed" the "when should grace end"
problem, but for NFSv4.0 all a server can do is wait at least 1 lease period
and then end grace, unless it is still seeing reclaim operations.)
rick
> One thing I don't see is an OPEN for the file whose WRITE operations
> fail with NFS4ERR_STALE_STATEID. That looks like a client problem.
> Bram, would you send your pcap file to Trond (cc'd) ?
>
> > The client starts using the new stateids except for the files in A.
> > The server returns a NFS4_STALE_STATEID, the client executes a RENEW
> > (IMO this should be an OPEN request) and retries the WRITE, the
> > server
> > returns a NFS4_STALE_STATEID
>
> RENEW is an allowable response in this case. The client is trying to
> detect a server reboot before it continues with OPEN state recovery.
>
> > Server: FreeBSD 9.1 with new NFS server implementation
> > Client: Fedora 17, 3.8.11-100.fc17.x86_64
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
On Wed, May 29, 2013 at 2:49 PM, Rick Macklem <[email protected]> wrote:
>
>>
> Good point. Which leads to the question, "When did the server reboot
> relative to the packet capture?". Was it before the capture started or
> sometime during the 4400 seconds when the client wasn't sending anything or ???
The server rebooted in the beginning of the capture. I think the first
packets indicate the server was up but the nfs daemon was not running
any more. After ~4400 seconds the server was again available.
Thanks,
Bram.
Hi-
On May 28, 2013, at 8:31 AM, Bram Vandoren <[email protected]> wrote:
>>> Hi Rick, Chuck, Bruce,
>>> in attachment is a small pcap when a client is in the locked.
>>> Hopefully I can reproduce the problem so I can send you a capture
>>> during a reboot cycle.
>>
>> The pcap file confirms that the state IDs and client ID do not appear to match, and do appear on the same TCP connection (in different operations). I think the presence of the RENEW operations here suggest that the client believes it has not been able to renew its lease using stateful operations like READ. IMO this is evidence in favor of the theory that the client neglected to recover these state IDs for some reason.
>>
>> We'll need to see the actual reboot recovery traffic to analyze further, and that occurs just after the server reboots. Even better would be to see the initial OPEN of the file where the READ operations are failing. I recognize this is a non-determinstic problem that will be a challenge to capture properly.
>>
>> Rather than capturing the trace on the server, you should be able to capture it on your clients in order to capture traffic before, during, and after the server reboot. To avoid capturing an enormous amount of data, both tcpdump and tshark provide options to save the captured network data into a small ring of files (see their man pages). Once a client mount point has locked, you can stop the capture, and hopefully the ring will have everything we need.
>
> Hi All,
> I managed to capture the packets after a reboot. I send the pcap file
> to the people in cc (privacy issue, contact me if someone on the list
> wants a copy). This is a summary of what happens after a reboot
> (perhaps a missed some relevant information):
>
> 38:
> - client -> server: client executes 3 writes with a stale clientid (A)
> - client -> server: RENEW
> 44:
> - server -> client: NFS4ERR_STALE_STATEID (in reponse to A)
> 45:
> - server -> client: NFS4ERR_STALE_CLIENTID
> 65:
> - client -> server: RENEW
> 66
> - server -> client: NFS4ERR_STALE_CLIENTID
> 67,85,87,93:
> SETCLIENTID/SETCLIENTID_CONFIRM sequence (ok)
> 78,79:
> NFS4STALE_STATEID (reponse to the other 2 writes in A)
>
> 98: OPEN with CLAIM_PREVIOUS
> 107: response to open: NFS4ERR_NO_GRACE (strange?)
> after that the client re-opens the files without CLAIM_PREVIOUS option
> and they are all succesful.
That means the server is not in its grace period. I'm not familiar enough to know if that's typical for FreeBSD servers after a reboot, Rick will have to respond to that. The client responds correctly by switching to CLAIM_NULL OPENs.
One thing I don't see is an OPEN for the file whose WRITE operations fail with NFS4ERR_STALE_STATEID. That looks like a client problem. Bram, would you send your pcap file to Trond (cc'd) ?
> The client starts using the new stateids except for the files in A.
> The server returns a NFS4_STALE_STATEID, the client executes a RENEW
> (IMO this should be an OPEN request) and retries the WRITE, the server
> returns a NFS4_STALE_STATEID
RENEW is an allowable response in this case. The client is trying to detect a server reboot before it continues with OPEN state recovery.
> Server: FreeBSD 9.1 with new NFS server implementation
> Client: Fedora 17, 3.8.11-100.fc17.x86_64
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
Bram Vandoren wrote:
> > It's not typical for a client with an active workload to wait 4400
> > seconds to send a fresh SYN. Bram, can you shed any light on this?
>
> The server was down for ~4400 seconds (I started the capture before
> the maintanance period). I captured the packets using a switch mirror
> port (it's possible some packets are lost during capture). After a
> while an arp timeout occurs and we don't capture the SYNs from the
> client any more (I doubt the client sends the SYN because it doesn't
> know where to send it).
>
> If it helps, I can filter extra information from the big pcap file
> (~20G) (eg capture from other clients or from the same client just
> before the server reboot).
>
> Thanks,
> Bram.
Did both the client and server have the same IP addresses before the
reboot?
If not, the Linux client's nfs_client_id4.id SetClientID argument
will be different (it has the client/side IP# in it). nfs_client_id4.id
isn't supposed to change for a given client when it is rebooted.
That will make the FreeBSD NFSv4 server see "new client" (which is not in the
stablerestart file used to avoid certain reboot edge conditions) and
will not give it a grace period.
This is the only explanation I can think of for the NFS4ERR_NO_GRACE
reply shortly after the reboot.
I'd guess the RSTs are because the nfsd has terminated (nothing on
port# 2049) just before the system was shut down, so I suspect the
RSTs are not an issue? (I have assumed the packet trace started after
the server rebooted.)
rick
Hi Rick,
> If you happen to have Cisco gear between the client and server, you might
> want to check for IDS enabled? (Don't ask me how you do that. I've never
> touched a Cisco switch. I also saw something suggestion that Juniper
> switches do something similar, but it didn't indicate how to disable the
> capability.)
We don't use an IDS. We use HP procurve switches.I doubt (hope) they
don't inject RSTs.
Thanks,
Bram.
On May 28, 2013, at 9:04 PM, Chuck Lever <[email protected]> wrote:
>
> On May 28, 2013, at 7:30 PM, Rick Macklem <[email protected]> wrote:
>
>> Bram Vandoren wrote:
>>>>> Hi Rick, Chuck, Bruce,
>>>>> in attachment is a small pcap when a client is in the locked.
>>>>> Hopefully I can reproduce the problem so I can send you a capture
>>>>> during a reboot cycle.
>>>>
>>>> The pcap file confirms that the state IDs and client ID do not
>>>> appear to match, and do appear on the same TCP connection (in
>>>> different operations). I think the presence of the RENEW operations
>>>> here suggest that the client believes it has not been able to renew
>>>> its lease using stateful operations like READ. IMO this is evidence
>>>> in favor of the theory that the client neglected to recover these
>>>> state IDs for some reason.
>>>>
>>>> We'll need to see the actual reboot recovery traffic to analyze
>>>> further, and that occurs just after the server reboots. Even better
>>>> would be to see the initial OPEN of the file where the READ
>>>> operations are failing. I recognize this is a non-determinstic
>>>> problem that will be a challenge to capture properly.
>>>>
>>>> Rather than capturing the trace on the server, you should be able to
>>>> capture it on your clients in order to capture traffic before,
>>>> during, and after the server reboot. To avoid capturing an enormous
>>>> amount of data, both tcpdump and tshark provide options to save the
>>>> captured network data into a small ring of files (see their man
>>>> pages). Once a client mount point has locked, you can stop the
>>>> capture, and hopefully the ring will have everything we need.
>>>
>>> Hi All,
>>> I managed to capture the packets after a reboot. I send the pcap file
>>> to the people in cc (privacy issue, contact me if someone on the list
>>> wants a copy). This is a summary of what happens after a reboot
>>> (perhaps a missed some relevant information):
>>>
>>> 38:
>>> - client -> server: client executes 3 writes with a stale clientid (A)
>>> - client -> server: RENEW
>>> 44:
>>> - server -> client: NFS4ERR_STALE_STATEID (in reponse to A)
>>> 45:
>>> - server -> client: NFS4ERR_STALE_CLIENTID
>>> 65:
>>> - client -> server: RENEW
>>> 66
>>> - server -> client: NFS4ERR_STALE_CLIENTID
>>> 67,85,87,93:
>>> SETCLIENTID/SETCLIENTID_CONFIRM sequence (ok)
>>> 78,79:
>>> NFS4STALE_STATEID (reponse to the other 2 writes in A)
>>>
>>> 98: OPEN with CLAIM_PREVIOUS
>>> 107: response to open: NFS4ERR_NO_GRACE (strange?)
>>> after that the client re-opens the files without CLAIM_PREVIOUS option
>>> and they are all succesful.
>>>
>>> The client starts using the new stateids except for the files in A.
>>> The server returns a NFS4_STALE_STATEID, the client executes a RENEW
>>> (IMO this should be an OPEN request) and retries the WRITE, the server
>>> returns a NFS4_STALE_STATEID
>>>
>>> Server: FreeBSD 9.1 with new NFS server implementation
>>> Client: Fedora 17, 3.8.11-100.fc17.x86_64
>>>
>>> Any clues?
>>>
>>> Thanks,
>>> Bram
>> I just took a look at the packet capture (I hadn't realized you had
>> posted one when I did the last reply).
>>
>> I think the problem is a TCP layer one, from looking at the first 8
>> packets captured amd their timestamps. However, my TCP is old and
>> tired, so I'm not sure. From the first 8 packets:
>> Time(sec)
>> 0 - client sends a SYN
>> 0 - server replies [RST, ACK]
>> Since Reset is in the reply, is this when the server rebooted?
>> 6 - client sends a SYN with same port#s. Wireshark thinks this is
>> a new connection using same port#s.
>> 6 - FreeBSD replies with [RST, ACK] again, thinking it is a resend
>> of the first packet and still considers it the old Reset connection.
>> 15 - Repeat of what happened at 6sec.
>> 30,31 - client sends SYN twice, still from the same port#s
>> - no reply from FreeBSD
>> 4393 - client sends SYN again with same port#s
>> 4393 - server replies with [SYN, ACK] establishing the new TCP
>> connection on the same port#s, but over an hour later.
>
> To be clear, this is very early in the capture, just frame 9 and 10. The Linux client attempts to re-use the previous connection's source port to preserve the server's duplicate reply cache.
And by the way, if the server actually did reboot, it shouldn't have noticed the client's port re-use. TCP connection state should have been wiped by the system restart.
> But it's not clear why the client is not retrying SYN during the period between frame 8 and frame 9. The client should retry SYN after just another minute or two, and by that time the server's TCP connection state should allow a connection on that port.
>
> It's not typical for a client with an active workload to wait 4400 seconds to send a fresh SYN. Bram, can you shed any light on this?
>
>> Since this is now over an hour after packet 0, it isn't
>> surprising that Grace is over.
>
> Fair enough. However, it does not explain why the client fails to recover the open state for one file.
>
>> I'd guess that the problem occurs because the client reconnect
>> uses the same client side port# even though the server did a
>> Reset for that port#, but I'm not enough of a TCP guy to know
>> what should be correct TCP behaviour.
>>
>> If you don't mind the first part of this packet capture being
>> posted (email me to say if it ok to post it), I can post it to
>> [email protected], since there are a bunch of TCP savy
>> types that might be able to help. (If you give the ok for this,
>> please also email me what version of FreeBSD the server is
>> running.)
>
> Thanks for the extra set of eyes.
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
Chuck Lever wrote:
> On May 28, 2013, at 9:04 PM, Chuck Lever <[email protected]>
> wrote:
>
> >
> > On May 28, 2013, at 7:30 PM, Rick Macklem <[email protected]>
> > wrote:
> >
> >> Bram Vandoren wrote:
> >>>>> Hi Rick, Chuck, Bruce,
> >>>>> in attachment is a small pcap when a client is in the locked.
> >>>>> Hopefully I can reproduce the problem so I can send you a
> >>>>> capture
> >>>>> during a reboot cycle.
> >>>>
> >>>> The pcap file confirms that the state IDs and client ID do not
> >>>> appear to match, and do appear on the same TCP connection (in
> >>>> different operations). I think the presence of the RENEW
> >>>> operations
> >>>> here suggest that the client believes it has not been able to
> >>>> renew
> >>>> its lease using stateful operations like READ. IMO this is
> >>>> evidence
> >>>> in favor of the theory that the client neglected to recover these
> >>>> state IDs for some reason.
> >>>>
> >>>> We'll need to see the actual reboot recovery traffic to analyze
> >>>> further, and that occurs just after the server reboots. Even
> >>>> better
> >>>> would be to see the initial OPEN of the file where the READ
> >>>> operations are failing. I recognize this is a non-determinstic
> >>>> problem that will be a challenge to capture properly.
> >>>>
> >>>> Rather than capturing the trace on the server, you should be able
> >>>> to
> >>>> capture it on your clients in order to capture traffic before,
> >>>> during, and after the server reboot. To avoid capturing an
> >>>> enormous
> >>>> amount of data, both tcpdump and tshark provide options to save
> >>>> the
> >>>> captured network data into a small ring of files (see their man
> >>>> pages). Once a client mount point has locked, you can stop the
> >>>> capture, and hopefully the ring will have everything we need.
> >>>
> >>> Hi All,
> >>> I managed to capture the packets after a reboot. I send the pcap
> >>> file
> >>> to the people in cc (privacy issue, contact me if someone on the
> >>> list
> >>> wants a copy). This is a summary of what happens after a reboot
> >>> (perhaps a missed some relevant information):
> >>>
> >>> 38:
> >>> - client -> server: client executes 3 writes with a stale clientid
> >>> (A)
> >>> - client -> server: RENEW
> >>> 44:
> >>> - server -> client: NFS4ERR_STALE_STATEID (in reponse to A)
> >>> 45:
> >>> - server -> client: NFS4ERR_STALE_CLIENTID
> >>> 65:
> >>> - client -> server: RENEW
> >>> 66
> >>> - server -> client: NFS4ERR_STALE_CLIENTID
> >>> 67,85,87,93:
> >>> SETCLIENTID/SETCLIENTID_CONFIRM sequence (ok)
> >>> 78,79:
> >>> NFS4STALE_STATEID (reponse to the other 2 writes in A)
> >>>
> >>> 98: OPEN with CLAIM_PREVIOUS
> >>> 107: response to open: NFS4ERR_NO_GRACE (strange?)
> >>> after that the client re-opens the files without CLAIM_PREVIOUS
> >>> option
> >>> and they are all succesful.
> >>>
> >>> The client starts using the new stateids except for the files in
> >>> A.
> >>> The server returns a NFS4_STALE_STATEID, the client executes a
> >>> RENEW
> >>> (IMO this should be an OPEN request) and retries the WRITE, the
> >>> server
> >>> returns a NFS4_STALE_STATEID
> >>>
> >>> Server: FreeBSD 9.1 with new NFS server implementation
> >>> Client: Fedora 17, 3.8.11-100.fc17.x86_64
> >>>
> >>> Any clues?
> >>>
> >>> Thanks,
> >>> Bram
> >> I just took a look at the packet capture (I hadn't realized you had
> >> posted one when I did the last reply).
> >>
> >> I think the problem is a TCP layer one, from looking at the first 8
> >> packets captured amd their timestamps. However, my TCP is old and
> >> tired, so I'm not sure. From the first 8 packets:
> >> Time(sec)
> >> 0 - client sends a SYN
> >> 0 - server replies [RST, ACK]
> >> Since Reset is in the reply, is this when the server rebooted?
> >> 6 - client sends a SYN with same port#s. Wireshark thinks this is
> >> a new connection using same port#s.
> >> 6 - FreeBSD replies with [RST, ACK] again, thinking it is a resend
> >> of the first packet and still considers it the old Reset
> >> connection.
> >> 15 - Repeat of what happened at 6sec.
> >> 30,31 - client sends SYN twice, still from the same port#s
> >> - no reply from FreeBSD
> >> 4393 - client sends SYN again with same port#s
> >> 4393 - server replies with [SYN, ACK] establishing the new TCP
> >> connection on the same port#s, but over an hour later.
> >
> > To be clear, this is very early in the capture, just frame 9 and 10.
> > The Linux client attempts to re-use the previous connection's source
> > port to preserve the server's duplicate reply cache.
>
Btw, I think that a DRC that assumes the client will use the same port#
is broken. I'm not sure if all clients use the same port# for UDP (the
FreeBSD client does a connect() for UDP, so I think it will), but I
definitely think that a server can't assume "same port#" for TCP.)
However, this is off the current topic/issue.
> And by the way, if the server actually did reboot, it shouldn't have
> noticed the client's port re-use. TCP connection state should have
> been wiped by the system restart.
>
Good point. Which leads to the question, "When did the server reboot
relative to the packet capture?". Was it before the capture started or
sometime during the 4400 seconds when the client wasn't sending anything or ???
> > But it's not clear why the client is not retrying SYN during the
> > period between frame 8 and frame 9. The client should retry SYN
> > after just another minute or two, and by that time the server's TCP
> > connection state should allow a connection on that port.
> >
> > It's not typical for a client with an active workload to wait 4400
> > seconds to send a fresh SYN. Bram, can you shed any light on this?
> >
> >> Since this is now over an hour after packet 0, it isn't
> >> surprising that Grace is over.
> >
> > Fair enough. However, it does not explain why the client fails to
> > recover the open state for one file.
> >
> >> I'd guess that the problem occurs because the client reconnect
> >> uses the same client side port# even though the server did a
> >> Reset for that port#, but I'm not enough of a TCP guy to know
> >> what should be correct TCP behaviour.
> >>
> >> If you don't mind the first part of this packet capture being
> >> posted (email me to say if it ok to post it), I can post it to
> >> [email protected], since there are a bunch of TCP savy
> >> types that might be able to help. (If you give the ok for this,
> >> please also email me what version of FreeBSD the server is
> >> running.)
> >
> > Thanks for the extra set of eyes.
> >
Well, I am waiting for permission from Bram to post it. However, I
suppose I could post a summary instead (leaving IP#s out, etc). I
don't think there would be anything in it other than IP#s that he
might not want advertised?
rick
> > --
> > Chuck Lever
> > chuck[dot]lever[at]oracle[dot]com
> >
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> > in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
Chuck Lever wrote:
> On May 28, 2013, at 9:04 PM, Chuck Lever <[email protected]>
> wrote:
>
> >
> > On May 28, 2013, at 7:30 PM, Rick Macklem <[email protected]>
> > wrote:
> >
> >> Bram Vandoren wrote:
> >>>>> Hi Rick, Chuck, Bruce,
> >>>>> in attachment is a small pcap when a client is in the locked.
> >>>>> Hopefully I can reproduce the problem so I can send you a
> >>>>> capture
> >>>>> during a reboot cycle.
> >>>>
> >>>> The pcap file confirms that the state IDs and client ID do not
> >>>> appear to match, and do appear on the same TCP connection (in
> >>>> different operations). I think the presence of the RENEW
> >>>> operations
> >>>> here suggest that the client believes it has not been able to
> >>>> renew
> >>>> its lease using stateful operations like READ. IMO this is
> >>>> evidence
> >>>> in favor of the theory that the client neglected to recover these
> >>>> state IDs for some reason.
> >>>>
> >>>> We'll need to see the actual reboot recovery traffic to analyze
> >>>> further, and that occurs just after the server reboots. Even
> >>>> better
> >>>> would be to see the initial OPEN of the file where the READ
> >>>> operations are failing. I recognize this is a non-determinstic
> >>>> problem that will be a challenge to capture properly.
> >>>>
> >>>> Rather than capturing the trace on the server, you should be able
> >>>> to
> >>>> capture it on your clients in order to capture traffic before,
> >>>> during, and after the server reboot. To avoid capturing an
> >>>> enormous
> >>>> amount of data, both tcpdump and tshark provide options to save
> >>>> the
> >>>> captured network data into a small ring of files (see their man
> >>>> pages). Once a client mount point has locked, you can stop the
> >>>> capture, and hopefully the ring will have everything we need.
> >>>
> >>> Hi All,
> >>> I managed to capture the packets after a reboot. I send the pcap
> >>> file
> >>> to the people in cc (privacy issue, contact me if someone on the
> >>> list
> >>> wants a copy). This is a summary of what happens after a reboot
> >>> (perhaps a missed some relevant information):
> >>>
> >>> 38:
> >>> - client -> server: client executes 3 writes with a stale clientid
> >>> (A)
> >>> - client -> server: RENEW
> >>> 44:
> >>> - server -> client: NFS4ERR_STALE_STATEID (in reponse to A)
> >>> 45:
> >>> - server -> client: NFS4ERR_STALE_CLIENTID
> >>> 65:
> >>> - client -> server: RENEW
> >>> 66
> >>> - server -> client: NFS4ERR_STALE_CLIENTID
> >>> 67,85,87,93:
> >>> SETCLIENTID/SETCLIENTID_CONFIRM sequence (ok)
> >>> 78,79:
> >>> NFS4STALE_STATEID (reponse to the other 2 writes in A)
> >>>
> >>> 98: OPEN with CLAIM_PREVIOUS
> >>> 107: response to open: NFS4ERR_NO_GRACE (strange?)
> >>> after that the client re-opens the files without CLAIM_PREVIOUS
> >>> option
> >>> and they are all succesful.
> >>>
> >>> The client starts using the new stateids except for the files in
> >>> A.
> >>> The server returns a NFS4_STALE_STATEID, the client executes a
> >>> RENEW
> >>> (IMO this should be an OPEN request) and retries the WRITE, the
> >>> server
> >>> returns a NFS4_STALE_STATEID
> >>>
> >>> Server: FreeBSD 9.1 with new NFS server implementation
> >>> Client: Fedora 17, 3.8.11-100.fc17.x86_64
> >>>
> >>> Any clues?
> >>>
> >>> Thanks,
> >>> Bram
> >> I just took a look at the packet capture (I hadn't realized you had
> >> posted one when I did the last reply).
> >>
> >> I think the problem is a TCP layer one, from looking at the first 8
> >> packets captured amd their timestamps. However, my TCP is old and
> >> tired, so I'm not sure. From the first 8 packets:
> >> Time(sec)
> >> 0 - client sends a SYN
> >> 0 - server replies [RST, ACK]
> >> Since Reset is in the reply, is this when the server rebooted?
> >> 6 - client sends a SYN with same port#s. Wireshark thinks this is
> >> a new connection using same port#s.
> >> 6 - FreeBSD replies with [RST, ACK] again, thinking it is a resend
> >> of the first packet and still considers it the old Reset
> >> connection.
> >> 15 - Repeat of what happened at 6sec.
> >> 30,31 - client sends SYN twice, still from the same port#s
> >> - no reply from FreeBSD
> >> 4393 - client sends SYN again with same port#s
> >> 4393 - server replies with [SYN, ACK] establishing the new TCP
> >> connection on the same port#s, but over an hour later.
> >
> > To be clear, this is very early in the capture, just frame 9 and 10.
> > The Linux client attempts to re-use the previous connection's source
> > port to preserve the server's duplicate reply cache.
>
Btw, I have seen a case a few years ago where the network fabric between
client and server injected RSTs in the packet stream, causing havok for
NFS TCP (re)connects. (I was never able to isolate exactly what did the
injection, but I caught it by doing packet captures at both the client
and server ends and found them identical, except for the RSTs which were
not sent by the server end but came out of the network port on the client
end.)
I believe some network firewalls/switches will do the RST injection if
they think a SYN flood DOS is being attempted.
I wasn't able to get much information on the network fabric, but it was
mainly made up of 3 Cisco switches (although Cisco doesn't claim to do RST
injection, as far as I know).
Here's the thread, in case you want to read it:
http://lists.freebsd.org/pipermail/freebsd-current/2009-November/013154.html
So, this just might be the source of your problem. If you can capture packets
at both client and server ends and then compare them to see if all RSTs are
in both captures, it might be worth the effort.
Good luck with it, rick
> And by the way, if the server actually did reboot, it shouldn't have
> noticed the client's port re-use. TCP connection state should have
> been wiped by the system restart.
>
> > But it's not clear why the client is not retrying SYN during the
> > period between frame 8 and frame 9. The client should retry SYN
> > after just another minute or two, and by that time the server's TCP
> > connection state should allow a connection on that port.
> >
> > It's not typical for a client with an active workload to wait 4400
> > seconds to send a fresh SYN. Bram, can you shed any light on this?
> >
> >> Since this is now over an hour after packet 0, it isn't
> >> surprising that Grace is over.
> >
> > Fair enough. However, it does not explain why the client fails to
> > recover the open state for one file.
> >
> >> I'd guess that the problem occurs because the client reconnect
> >> uses the same client side port# even though the server did a
> >> Reset for that port#, but I'm not enough of a TCP guy to know
> >> what should be correct TCP behaviour.
> >>
> >> If you don't mind the first part of this packet capture being
> >> posted (email me to say if it ok to post it), I can post it to
> >> [email protected], since there are a bunch of TCP savy
> >> types that might be able to help. (If you give the ok for this,
> >> please also email me what version of FreeBSD the server is
> >> running.)
> >
> > Thanks for the extra set of eyes.
> >
> > --
> > Chuck Lever
> > chuck[dot]lever[at]oracle[dot]com
> >
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> > in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
Bram Vandoren wrote:
> >> Hi Rick, Chuck, Bruce,
> >> in attachment is a small pcap when a client is in the locked.
> >> Hopefully I can reproduce the problem so I can send you a capture
> >> during a reboot cycle.
> >
> > The pcap file confirms that the state IDs and client ID do not
> > appear to match, and do appear on the same TCP connection (in
> > different operations). I think the presence of the RENEW operations
> > here suggest that the client believes it has not been able to renew
> > its lease using stateful operations like READ. IMO this is evidence
> > in favor of the theory that the client neglected to recover these
> > state IDs for some reason.
> >
> > We'll need to see the actual reboot recovery traffic to analyze
> > further, and that occurs just after the server reboots. Even better
> > would be to see the initial OPEN of the file where the READ
> > operations are failing. I recognize this is a non-determinstic
> > problem that will be a challenge to capture properly.
> >
> > Rather than capturing the trace on the server, you should be able to
> > capture it on your clients in order to capture traffic before,
> > during, and after the server reboot. To avoid capturing an enormous
> > amount of data, both tcpdump and tshark provide options to save the
> > captured network data into a small ring of files (see their man
> > pages). Once a client mount point has locked, you can stop the
> > capture, and hopefully the ring will have everything we need.
>
> Hi All,
> I managed to capture the packets after a reboot. I send the pcap file
> to the people in cc (privacy issue, contact me if someone on the list
> wants a copy). This is a summary of what happens after a reboot
> (perhaps a missed some relevant information):
>
> 38:
> - client -> server: client executes 3 writes with a stale clientid (A)
> - client -> server: RENEW
> 44:
> - server -> client: NFS4ERR_STALE_STATEID (in reponse to A)
> 45:
> - server -> client: NFS4ERR_STALE_CLIENTID
> 65:
> - client -> server: RENEW
> 66
> - server -> client: NFS4ERR_STALE_CLIENTID
> 67,85,87,93:
> SETCLIENTID/SETCLIENTID_CONFIRM sequence (ok)
> 78,79:
> NFS4STALE_STATEID (reponse to the other 2 writes in A)
>
> 98: OPEN with CLAIM_PREVIOUS
> 107: response to open: NFS4ERR_NO_GRACE (strange?)
> after that the client re-opens the files without CLAIM_PREVIOUS option
> and they are all succesful.
>
> The client starts using the new stateids except for the files in A.
> The server returns a NFS4_STALE_STATEID, the client executes a RENEW
> (IMO this should be an OPEN request) and retries the WRITE, the server
> returns a NFS4_STALE_STATEID
>
> Server: FreeBSD 9.1 with new NFS server implementation
> Client: Fedora 17, 3.8.11-100.fc17.x86_64
>
> Any clues?
>
> Thanks,
> Bram
I just took a look at the packet capture (I hadn't realized you had
posted one when I did the last reply).
I think the problem is a TCP layer one, from looking at the first 8
packets captured amd their timestamps. However, my TCP is old and
tired, so I'm not sure. From the first 8 packets:
Time(sec)
0 - client sends a SYN
0 - server replies [RST, ACK]
Since Reset is in the reply, is this when the server rebooted?
6 - client sends a SYN with same port#s. Wireshark thinks this is
a new connection using same port#s.
6 - FreeBSD replies with [RST, ACK] again, thinking it is a resend
of the first packet and still considers it the old Reset connection.
15 - Repeat of what happened at 6sec.
30,31 - client sends SYN twice, still from the same port#s
- no reply from FreeBSD
4393 - client sends SYN again with same port#s
4393 - server replies with [SYN, ACK] establishing the new TCP
connection on the same port#s, but over an hour later.
Since this is now over an hour after packet 0, it isn't
surprising that Grace is over.
I'd guess that the problem occurs because the client reconnect
uses the same client side port# even though the server did a
Reset for that port#, but I'm not enough of a TCP guy to know
what should be correct TCP behaviour.
If you don't mind the first part of this packet capture being
posted (email me to say if it ok to post it), I can post it to
[email protected], since there are a bunch of TCP savy
types that might be able to help. (If you give the ok for this,
please also email me what version of FreeBSD the server is
running.)
rick
> Did both the client and server have the same IP addresses before the
> reboot?
Yes.
> If not, the Linux client's nfs_client_id4.id SetClientID argument
> will be different (it has the client/side IP# in it). nfs_client_id4.id
> isn't supposed to change for a given client when it is rebooted.
> That will make the FreeBSD NFSv4 server see "new client" (which is not in the
> stablerestart file used to avoid certain reboot edge conditions) and
> will not give it a grace period.
> This is the only explanation I can think of for the NFS4ERR_NO_GRACE
> reply shortly after the reboot.
I checked some other clients and they all receive the
NFS4ERR_NO_GRACE response from the server. It's not unique for the
clients that hang. I was unable to reproduce this is a minimal test
configuration. Perhaps the nfs-stablerestart file is corrupt on the
server?
I checked
strings nfs-stablerestart
and I see a lot of duplicate entries. In total there are ~10000 lines
but we only have ~50 clients.
Most clients have 3 types of entries:
Linux NFSv4.0 a.b.c.d/e.f.g.h tcp
Linux NFSv4.0 a.b.c.d/e.f.g.h tcp*
Linux NFSv4.0 a.b.c.d/e.f.g.h tcp+
Again, thanks a lot for looking into this.
Bram.
>> Hi Rick, Chuck, Bruce,
>> in attachment is a small pcap when a client is in the locked.
>> Hopefully I can reproduce the problem so I can send you a capture
>> during a reboot cycle.
>
> The pcap file confirms that the state IDs and client ID do not appear to match, and do appear on the same TCP connection (in different operations). I think the presence of the RENEW operations here suggest that the client believes it has not been able to renew its lease using stateful operations like READ. IMO this is evidence in favor of the theory that the client neglected to recover these state IDs for some reason.
>
> We'll need to see the actual reboot recovery traffic to analyze further, and that occurs just after the server reboots. Even better would be to see the initial OPEN of the file where the READ operations are failing. I recognize this is a non-determinstic problem that will be a challenge to capture properly.
>
> Rather than capturing the trace on the server, you should be able to capture it on your clients in order to capture traffic before, during, and after the server reboot. To avoid capturing an enormous amount of data, both tcpdump and tshark provide options to save the captured network data into a small ring of files (see their man pages). Once a client mount point has locked, you can stop the capture, and hopefully the ring will have everything we need.
Hi All,
I managed to capture the packets after a reboot. I send the pcap file
to the people in cc (privacy issue, contact me if someone on the list
wants a copy). This is a summary of what happens after a reboot
(perhaps a missed some relevant information):
38:
- client -> server: client executes 3 writes with a stale clientid (A)
- client -> server: RENEW
44:
- server -> client: NFS4ERR_STALE_STATEID (in reponse to A)
45:
- server -> client: NFS4ERR_STALE_CLIENTID
65:
- client -> server: RENEW
66
- server -> client: NFS4ERR_STALE_CLIENTID
67,85,87,93:
SETCLIENTID/SETCLIENTID_CONFIRM sequence (ok)
78,79:
NFS4STALE_STATEID (reponse to the other 2 writes in A)
98: OPEN with CLAIM_PREVIOUS
107: response to open: NFS4ERR_NO_GRACE (strange?)
after that the client re-opens the files without CLAIM_PREVIOUS option
and they are all succesful.
The client starts using the new stateids except for the files in A.
The server returns a NFS4_STALE_STATEID, the client executes a RENEW
(IMO this should be an OPEN request) and retries the WRITE, the server
returns a NFS4_STALE_STATEID
Server: FreeBSD 9.1 with new NFS server implementation
Client: Fedora 17, 3.8.11-100.fc17.x86_64
Any clues?
Thanks,
Bram
On Sat, Jun 1, 2013 at 1:24 AM, Rick Macklem <[email protected]> wrote:
> I'll take a look. I wrote that code about 10 years ago, so I don't remember
> all the details w.r.t. the records in the stable restart file. If you truncate
> the file, there won't be any recovery on the next reboot, so you need to
> unmount all the NFSv4 mounts on it before rebooting for that case.
Any news on this subject? I'm getting the same issue.
--
William