Hello,
I am debugging an issue for a customer whereby an NFS mount undergoes an
unexpected timeout and throws an EIO.
Using synchronous mounts results in no problems whatsoever, whereas
using asynchronous mounts causes the problem to occur.
The problem exists only on 10GiB networks: I can't repro on slower
networks at all, even if it is the same 10GiB network link limited to
1GiB. It reproduces on multiple versions of XenServer (both Xen 3 and
Xen 4) with a 2.6.32 based dom0 kernel multiple different 10GiB network
cards (be2net and bnx2x specifically tested). It also reproduces
against multiple different netapp servers running different firmware
versions. It also appears to reproduces with NFSv3 and NFSv4, although
the customer setup uses NFSv3
The difference between mount commands are "noac" for the synchronous
case as opposed to "actimeo=0" for the asynchronous case.
The /proc/mounts entry for async is:
10.70.148.3:/vol/camb5e3b16tmp/82ab4a02-c10f-144a-e8c2-2ae04f213b2a
/var/run/sr-mount/82ab4a02-c10f-144a-e8c2-2ae04f213b2a nfs
rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,proto=tcp,port=65535,timeo=133,retrans=0,sec=sys,mountport=65535,addr=10.70.148.3
0 0
while the entry for sync is:
10.70.148.3:/vol/camb5e3b15tmp/bc11aad3-b345-519f-1661-50e16bed6cdb
/var/run/sr-mount/bc11aad3-b345-519f-1661-50e16bed6cdb nfs
rw,sync,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,proto=tcp,port=65535,timeo=133,retrans=0,sec=sys,mountport=65535,addr=10.70.148.3
0 0
The bug itself feels very like a timing issue to me. It becomes
substantially easier to reproduce if the kernel has a high workload:
Different options involve disabling things like TSO offload on the
network card or dd'ing from /dev/urandom to /tmp, but the problem cant
be reproduced occasionally without any artificial extra workload. The
problem itself can be reprodued with a simple dd from /dev/zero onto the
nfs mount, which proceeds to transfer at ~220MBps until the EIO is hit.
I have analyzed the TCP stream using TCPdump and Wireshark. Everything
appears fine, with no NFS errors at all. However, the client sticks a
TCP FIN in the final continuation of one of its 64k block writes, in the
middle of writing the file.
The server ACKs the outstanding packets, and sends a Write Reply
(without error), but does not send its own FIN. Then, 15 seconds later,
the client sends a TCP RST, and throws an EIO back to userspace.
I am aware that sending a RST after 15 seconds is expected behavior
following changeset 7d1e8255cf959fba7ee2317550dfde39f0b936ae, in an
attempt to prevent deadlocks for a half closed TCP connection. I have
searched through the commit log all the way to tip, but can not find a
commit making any reference to problems similar to this.
What I am struggling to work out is why the client is sending a FIN in
the middle of an otherwise fine and error-free NFS stream. using
"rpcdebug -m rpc -s trans" only says "disconnected" without giving a
reason for disconnecting.
Is there a different debug flag I can use to find out why NFS thinks it
needs to disconnect, or does anyone have any suggestions as to where I
should start debugging the code? I suspect the problem is in the sunrpc
subsystem but I am not very familiar with code this high level in the
kernel.
Any help greatly appreciated
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
On 10/11/11 20:43, Trond Myklebust wrote:
> On Thu, 2011-11-10 at 15:52 +0000, Andrew Cooper wrote:
>> On 10/11/11 15:29, Chuck Lever wrote:
>>> On Nov 10, 2011, at 6:15 AM, Andrew Cooper wrote:
>>>
>>>> On 09/11/11 22:36, Chuck Lever wrote:
>>>>> On Nov 9, 2011, at 1:38 PM, Andrew Cooper wrote:
>> Sorry. I am not sure I was clear. An EIO does not present itself with
>> a hard mount, but a TCP FIN is still injected into the stream by the
>> client, causing 15 seconds of deadlock, eventually fixed by sending a
>> RST and restarting with a new TCP stream. At this point, softmounts
>> throw an EIO while hardmounts restart and continue successfully.
>>
>> My problem is not the EIO on softmount or lack of EIO for hardmout, but
>> the fact that the client sees fit to try and close the TCP stream while
>> an apparently otherwise healthy NFS session is ongoing.
> The client will attempt to close the TCP connection on any RPC level
> error. That can happen, e.g., if the server sends a faulty RPC/TCP
> record fragment header or some other garbage data.
>
> I'm assuming that you've checked that the TCP parameters are set to sane
> values for a 10GigE connection (i.e. tcp_timestamps is on) so that there
> is no corruption happening at that level?
>
> Cheers
> Trond
I have a TCPdump/wireshark analysis of the entire packet stream (4GiB).
I cant see any RPC level errors (rpc.replystat != 0 yields no matches).
What specifically would I be looking for? Wireshark seems not to have
any problem decoding any of the RPC packets, so I hope that indicates no
RPC level corruption.
There is one case where the server sends a double write reply for the
same write, with different length fields. However, this is a good 20
seconds before the FIN is sent, so I was hoping that it was unrelated.
Might it not be?
As for TCP timestamps; I have a Timestamp option in each TCP packet.
Nothing appears corrupted. What would I be looking for with corrupted
timestamps?
Thanks,
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
On 11/11/11 22:38, Trond Myklebust wrote:
> On Fri, 2011-11-11 at 10:31 +0000, Andrew Cooper wrote:
>> On 10/11/11 20:43, Trond Myklebust wrote:
>>> On Thu, 2011-11-10 at 15:52 +0000, Andrew Cooper wrote:
>>>> On 10/11/11 15:29, Chuck Lever wrote:
>>>>> On Nov 10, 2011, at 6:15 AM, Andrew Cooper wrote:
>>>>>
>>>>>> On 09/11/11 22:36, Chuck Lever wrote:
>>>>>>> On Nov 9, 2011, at 1:38 PM, Andrew Cooper wrote:
>>>> Sorry. I am not sure I was clear. An EIO does not present itself with
>>>> a hard mount, but a TCP FIN is still injected into the stream by the
>>>> client, causing 15 seconds of deadlock, eventually fixed by sending a
>>>> RST and restarting with a new TCP stream. At this point, softmounts
>>>> throw an EIO while hardmounts restart and continue successfully.
>>>>
>>>> My problem is not the EIO on softmount or lack of EIO for hardmout, but
>>>> the fact that the client sees fit to try and close the TCP stream while
>>>> an apparently otherwise healthy NFS session is ongoing.
>>> The client will attempt to close the TCP connection on any RPC level
>>> error. That can happen, e.g., if the server sends a faulty RPC/TCP
>>> record fragment header or some other garbage data.
>>>
>>> I'm assuming that you've checked that the TCP parameters are set to sane
>>> values for a 10GigE connection (i.e. tcp_timestamps is on) so that there
>>> is no corruption happening at that level?
>>>
>>> Cheers
>>> Trond
>> I have a TCPdump/wireshark analysis of the entire packet stream (4GiB).
>> I cant see any RPC level errors (rpc.replystat != 0 yields no matches).
>> What specifically would I be looking for? Wireshark seems not to have
>> any problem decoding any of the RPC packets, so I hope that indicates no
>> RPC level corruption.
>>
>> There is one case where the server sends a double write reply for the
>> same write, with different length fields. However, this is a good 20
>> seconds before the FIN is sent, so I was hoping that it was unrelated.
>> Might it not be?
> Can you send us just that portion of the trace so that we can have a look?
Attached is a small extract from the stream. It starts with the final
NFS write, and continues through the FIN, RST and until the TCP stream
gets reopened. Is this what you want?
>> As for TCP timestamps; I have a Timestamp option in each TCP packet.
>> Nothing appears corrupted. What would I be looking for with corrupted
>> timestamps?
> I just meant that you should check that you've enabled tcp window
> scaling and timestamps in order to avoid problems with wrapped sequence
> numbers (See http://tools.ietf.org/html/rfc1323 for details).
>
> On Linux this means that you need to check
>
> sysctl net.ipv4.tcp_timestamps
> and
> sysctl net.ipv4.tcp_window_scaling
>
> They should both be set to the value '1'.
They both are.
> Cheers
> Trond
>
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
On 10/11/11 15:29, Chuck Lever wrote:
> On Nov 10, 2011, at 6:15 AM, Andrew Cooper wrote:
>
>> On 09/11/11 22:36, Chuck Lever wrote:
>>> On Nov 9, 2011, at 1:38 PM, Andrew Cooper wrote:
>>>
>>>> Hello,
>>>>
>>>> I am debugging an issue for a customer whereby an NFS mount undergoes an
>>>> unexpected timeout and throws an EIO.
>>>>
>>>> Using synchronous mounts results in no problems whatsoever, whereas
>>>> using asynchronous mounts causes the problem to occur.
>>>>
>>>> The problem exists only on 10GiB networks: I can't repro on slower
>>>> networks at all, even if it is the same 10GiB network link limited to
>>>> 1GiB. It reproduces on multiple versions of XenServer (both Xen 3 and
>>>> Xen 4) with a 2.6.32 based dom0 kernel multiple different 10GiB network
>>>> cards (be2net and bnx2x specifically tested). It also reproduces
>>>> against multiple different netapp servers running different firmware
>>>> versions. It also appears to reproduces with NFSv3 and NFSv4, although
>>>> the customer setup uses NFSv3
>>>>
>>>> The difference between mount commands are "noac" for the synchronous
>>>> case as opposed to "actimeo=0" for the asynchronous case.
>>>>
>>>> The /proc/mounts entry for async is:
>>>> 10.70.148.3:/vol/camb5e3b16tmp/82ab4a02-c10f-144a-e8c2-2ae04f213b2a
>>>> /var/run/sr-mount/82ab4a02-c10f-144a-e8c2-2ae04f213b2a nfs
>>>> rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,proto=tcp,port=65535,timeo=133,retrans=0,sec=sys,mountport=65535,addr=10.70.148.3
>>>> 0 0
>>>>
>>>> while the entry for sync is:
>>>> 10.70.148.3:/vol/camb5e3b15tmp/bc11aad3-b345-519f-1661-50e16bed6cdb
>>>> /var/run/sr-mount/bc11aad3-b345-519f-1661-50e16bed6cdb nfs
>>>> rw,sync,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,proto=tcp,port=65535,timeo=133,retrans=0,sec=sys,mountport=65535,addr=10.70.148.3
>>>> 0 0
>>> Using the "soft" mount option means that any NFS level timeout is communicated to applications as EIO. The client appears to be working as expected. Do you see this problem if you replace the "soft" mount option with the "hard" mount option?
>> I am aware of soft mounts and EIOs. I forgot to say that it also
>> appears with hard mounts, with the difference that the client eventually
>> retries and continues.
> Yeah, that's a pretty important omission! :-) A transport timeout should not cause an EIO on a hard mount. More below.
>
Sorry. I am not sure I was clear. An EIO does not present itself with
a hard mount, but a TCP FIN is still injected into the stream by the
client, causing 15 seconds of deadlock, eventually fixed by sending a
RST and restarting with a new TCP stream. At this point, softmounts
throw an EIO while hardmounts restart and continue successfully.
My problem is not the EIO on softmount or lack of EIO for hardmout, but
the fact that the client sees fit to try and close the TCP stream while
an apparently otherwise healthy NFS session is ongoing.
>>> Also, timeo=133 is actually a 13.3 second timeout setting. The value of the timeo= mount option is in deciseconds. This is unreasonably short for RPC over TCP, especially if the server becomes loaded. A more reasonable timeout setting for RPC over TCP is 600.
>> I am aware of this as well. The decisions about NFS options are done by
>> a separate team who have justified their decision to use softmounts with
>> a short timeo. In our use case, the userland process needs to be
>> informed very quickly if the server goes down so it can fail over to the
>> backup. 13 seconds was considered the longest acceptable time for
>> virtual machines to be without their disks.
>>
>> The problem here is that some bug in the kernel is causing userland to
>> think the server has gone down while it is perfectly healthy and running
>> normally.
>>
>>> The difference between what you describe as a "synchronous" mount and an "asynchronous" mount is that the asynchronous mount can send multiple NFS WRITE operations for one file on the wire to the server concurrently. The "synchronous" mount throttles write traffic, and requires each WRITE to complete at the server before the client sends the next WRITE. You can see that the "asynchronous" mount can send a lot more traffic per unit time, and thus may result in the client waiting longer for WRITEs to complete.
>> I did an analysis of the latency between write requests and replies. At
>> any one time in the stream, there are no more than 4 oustanding
>> requests, and the maximum latency between request and reply (as seen by
>> the client) is 50ms
>>
>>>> The bug itself feels very like a timing issue to me. It becomes
>>>> substantially easier to reproduce if the kernel has a high workload:
>>>> Different options involve disabling things like TSO offload on the
>>>> network card or dd'ing from /dev/urandom to /tmp, but the problem cant
>>>> be reproduced occasionally without any artificial extra workload. The
>>>> problem itself can be reprodued with a simple dd from /dev/zero onto the
>>>> nfs mount, which proceeds to transfer at ~220MBps until the EIO is hit.
>>>>
>>>> I have analyzed the TCP stream using TCPdump and Wireshark. Everything
>>>> appears fine, with no NFS errors at all. However, the client sticks a
>>>> TCP FIN in the final continuation of one of its 64k block writes, in the
>>>> middle of writing the file.
>>>>
>>>> The server ACKs the outstanding packets, and sends a Write Reply
>>>> (without error), but does not send its own FIN. Then, 15 seconds later,
>>>> the client sends a TCP RST, and throws an EIO back to userspace.
>>>>
>>>> I am aware that sending a RST after 15 seconds is expected behavior
>>>> following changeset 7d1e8255cf959fba7ee2317550dfde39f0b936ae, in an
>>>> attempt to prevent deadlocks for a half closed TCP connection. I have
>>>> searched through the commit log all the way to tip, but can not find a
>>>> commit making any reference to problems similar to this.
>>>>
>>>> What I am struggling to work out is why the client is sending a FIN in
>>>> the middle of an otherwise fine and error-free NFS stream. using
>>>> "rpcdebug -m rpc -s trans" only says "disconnected" without giving a
>>>> reason for disconnecting.
>>>>
>>>> Is there a different debug flag I can use to find out why NFS thinks it
>>>> needs to disconnect, or does anyone have any suggestions as to where I
>>>> should start debugging the code? I suspect the problem is in the sunrpc
>>>> subsystem but I am not very familiar with code this high level in the
>>>> kernel.
> "trans" is the obvious place to start, but you have done that already, and the instrumentation was inadequate. You could enable all the RPC client-side debugging flags. The usual problem with debugging messages is that they can mask timing problems.
It sadly does change the behavior. I enabled all debugging by accident
to start with and the problem failed to occur, but throughput was an
order of magnitude slower.
> I don't think it's even possible for the NFS client (above the RPC client) to cause a transport connection close in the middle of an RPC. So I doubt there would be much use in enabling NFS debugging flags. More RPC debugging might tell us if indeed a retransmit timeout is occurring, or if it's some other problem (which seems to me increasingly likely).
Thats useful to know.
> If you think the bug was introduced by a recent commit, you could try a "git bisect" to pin down which one. Or try later kernels (2.6.39 is the usual candidate) to see if the problem is addressed.
>
I am not aware of a changeset introducing this behavior. It has
appeared as a result of switching from synchronous mounts to
asynchronous mounts in an attempt to increase aggregate throughput.
(There are suitable precautions in place to mitigate potential data
loss/corruption with async mounts and a genuinely fallen-over server).
Unfortunately, I only have ssh access to the servers in question so dont
have a huge scope for installing different setups. I am trying to get a
3.x kernel up and running to attempt to reproduce the problem.
> Otherwise, I think someone is going to have to sprinkle xprtsock.c with dprintks and try to narrow down the misbehavior.
I think this is probably what I am going to have to do; Its nice to
have a suggested starting point.
I will report back as and when I get the 3.x kernel running and tested.
Thanks,
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
Sorry for a slow reply - this is unfortunately not the only bug I am
working on.
After further testing, this problem does actually reproduce with
synchronous mounts as well as asynchronous mounts. It just takes some
extreme stress testing to reproduce with synchronous mounts.
After some debugging in xs_tcp_shutdown() (a cheeky dump_stack()), it
appears that periodically xprt_autoclose() is closing the TCP connection.
It appears that some of the time, the server correctly FIN+ACKs the
first FIN, at which point the TCP connection is torn down and set back
up, with no interruption to the NFS session. However, some of the time,
the server does not FIN+ACK the clients FIN, at which point the client
waits 15 seconds and RST's the TCP connection, leading to the errors seen.
What is the purpose of xprt_autoclose() ? I assume it is to
automatically close idle connections. Am I correct in assuming that it
should not be attempting to close an active connection?
Thanks,
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
On Fri, 2011-11-11 at 10:31 +0000, Andrew Cooper wrote:
> On 10/11/11 20:43, Trond Myklebust wrote:
> > On Thu, 2011-11-10 at 15:52 +0000, Andrew Cooper wrote:
> >> On 10/11/11 15:29, Chuck Lever wrote:
> >>> On Nov 10, 2011, at 6:15 AM, Andrew Cooper wrote:
> >>>
> >>>> On 09/11/11 22:36, Chuck Lever wrote:
> >>>>> On Nov 9, 2011, at 1:38 PM, Andrew Cooper wrote:
> >> Sorry. I am not sure I was clear. An EIO does not present itself with
> >> a hard mount, but a TCP FIN is still injected into the stream by the
> >> client, causing 15 seconds of deadlock, eventually fixed by sending a
> >> RST and restarting with a new TCP stream. At this point, softmounts
> >> throw an EIO while hardmounts restart and continue successfully.
> >>
> >> My problem is not the EIO on softmount or lack of EIO for hardmout, but
> >> the fact that the client sees fit to try and close the TCP stream while
> >> an apparently otherwise healthy NFS session is ongoing.
> > The client will attempt to close the TCP connection on any RPC level
> > error. That can happen, e.g., if the server sends a faulty RPC/TCP
> > record fragment header or some other garbage data.
> >
> > I'm assuming that you've checked that the TCP parameters are set to sane
> > values for a 10GigE connection (i.e. tcp_timestamps is on) so that there
> > is no corruption happening at that level?
> >
> > Cheers
> > Trond
>
> I have a TCPdump/wireshark analysis of the entire packet stream (4GiB).
> I cant see any RPC level errors (rpc.replystat != 0 yields no matches).
> What specifically would I be looking for? Wireshark seems not to have
> any problem decoding any of the RPC packets, so I hope that indicates no
> RPC level corruption.
>
> There is one case where the server sends a double write reply for the
> same write, with different length fields. However, this is a good 20
> seconds before the FIN is sent, so I was hoping that it was unrelated.
> Might it not be?
Can you send us just that portion of the trace so that we can have a look?
> As for TCP timestamps; I have a Timestamp option in each TCP packet.
> Nothing appears corrupted. What would I be looking for with corrupted
> timestamps?
I just meant that you should check that you've enabled tcp window
scaling and timestamps in order to avoid problems with wrapped sequence
numbers (See http://tools.ietf.org/html/rfc1323 for details).
On Linux this means that you need to check
sysctl net.ipv4.tcp_timestamps
and
sysctl net.ipv4.tcp_window_scaling
They should both be set to the value '1'.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On 09/11/11 22:36, Chuck Lever wrote:
> On Nov 9, 2011, at 1:38 PM, Andrew Cooper wrote:
>
>> Hello,
>>
>> I am debugging an issue for a customer whereby an NFS mount undergoes an
>> unexpected timeout and throws an EIO.
>>
>> Using synchronous mounts results in no problems whatsoever, whereas
>> using asynchronous mounts causes the problem to occur.
>>
>> The problem exists only on 10GiB networks: I can't repro on slower
>> networks at all, even if it is the same 10GiB network link limited to
>> 1GiB. It reproduces on multiple versions of XenServer (both Xen 3 and
>> Xen 4) with a 2.6.32 based dom0 kernel multiple different 10GiB network
>> cards (be2net and bnx2x specifically tested). It also reproduces
>> against multiple different netapp servers running different firmware
>> versions. It also appears to reproduces with NFSv3 and NFSv4, although
>> the customer setup uses NFSv3
>>
>> The difference between mount commands are "noac" for the synchronous
>> case as opposed to "actimeo=0" for the asynchronous case.
>>
>> The /proc/mounts entry for async is:
>> 10.70.148.3:/vol/camb5e3b16tmp/82ab4a02-c10f-144a-e8c2-2ae04f213b2a
>> /var/run/sr-mount/82ab4a02-c10f-144a-e8c2-2ae04f213b2a nfs
>> rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,proto=tcp,port=65535,timeo=133,retrans=0,sec=sys,mountport=65535,addr=10.70.148.3
>> 0 0
>>
>> while the entry for sync is:
>> 10.70.148.3:/vol/camb5e3b15tmp/bc11aad3-b345-519f-1661-50e16bed6cdb
>> /var/run/sr-mount/bc11aad3-b345-519f-1661-50e16bed6cdb nfs
>> rw,sync,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,proto=tcp,port=65535,timeo=133,retrans=0,sec=sys,mountport=65535,addr=10.70.148.3
>> 0 0
> Using the "soft" mount option means that any NFS level timeout is communicated to applications as EIO. The client appears to be working as expected. Do you see this problem if you replace the "soft" mount option with the "hard" mount option?
I am aware of soft mounts and EIOs. I forgot to say that it also
appears with hard mounts, with the difference that the client eventually
retries and continues.
> Also, timeo=133 is actually a 13.3 second timeout setting. The value of the timeo= mount option is in deciseconds. This is unreasonably short for RPC over TCP, especially if the server becomes loaded. A more reasonable timeout setting for RPC over TCP is 600.
I am aware of this as well. The decisions about NFS options are done by
a separate team who have justified their decision to use softmounts with
a short timeo. In our use case, the userland process needs to be
informed very quickly if the server goes down so it can fail over to the
backup. 13 seconds was considered the longest acceptable time for
virtual machines to be without their disks.
The problem here is that some bug in the kernel is causing userland to
think the server has gone down while it is perfectly healthy and running
normally.
> The difference between what you describe as a "synchronous" mount and an "asynchronous" mount is that the asynchronous mount can send multiple NFS WRITE operations for one file on the wire to the server concurrently. The "synchronous" mount throttles write traffic, and requires each WRITE to complete at the server before the client sends the next WRITE. You can see that the "asynchronous" mount can send a lot more traffic per unit time, and thus may result in the client waiting longer for WRITEs to complete.
I did an analysis of the latency between write requests and replies. At
any one time in the stream, there are no more than 4 oustanding
requests, and the maximum latency between request and reply (as seen by
the client) is 50ms
>> The bug itself feels very like a timing issue to me. It becomes
>> substantially easier to reproduce if the kernel has a high workload:
>> Different options involve disabling things like TSO offload on the
>> network card or dd'ing from /dev/urandom to /tmp, but the problem cant
>> be reproduced occasionally without any artificial extra workload. The
>> problem itself can be reprodued with a simple dd from /dev/zero onto the
>> nfs mount, which proceeds to transfer at ~220MBps until the EIO is hit.
>>
>> I have analyzed the TCP stream using TCPdump and Wireshark. Everything
>> appears fine, with no NFS errors at all. However, the client sticks a
>> TCP FIN in the final continuation of one of its 64k block writes, in the
>> middle of writing the file.
>>
>> The server ACKs the outstanding packets, and sends a Write Reply
>> (without error), but does not send its own FIN. Then, 15 seconds later,
>> the client sends a TCP RST, and throws an EIO back to userspace.
>>
>> I am aware that sending a RST after 15 seconds is expected behavior
>> following changeset 7d1e8255cf959fba7ee2317550dfde39f0b936ae, in an
>> attempt to prevent deadlocks for a half closed TCP connection. I have
>> searched through the commit log all the way to tip, but can not find a
>> commit making any reference to problems similar to this.
>>
>> What I am struggling to work out is why the client is sending a FIN in
>> the middle of an otherwise fine and error-free NFS stream. using
>> "rpcdebug -m rpc -s trans" only says "disconnected" without giving a
>> reason for disconnecting.
>>
>> Is there a different debug flag I can use to find out why NFS thinks it
>> needs to disconnect, or does anyone have any suggestions as to where I
>> should start debugging the code? I suspect the problem is in the sunrpc
>> subsystem but I am not very familiar with code this high level in the
>> kernel.
>>
>> Any help greatly appreciated
>>
>> --
>> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
>> T: +44 (0)1223 225 900, http://www.citrix.com
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
On Thu, 2011-11-10 at 15:52 +0000, Andrew Cooper wrote:
>
> On 10/11/11 15:29, Chuck Lever wrote:
> > On Nov 10, 2011, at 6:15 AM, Andrew Cooper wrote:
> >
> >> On 09/11/11 22:36, Chuck Lever wrote:
> >>> On Nov 9, 2011, at 1:38 PM, Andrew Cooper wrote:
> Sorry. I am not sure I was clear. An EIO does not present itself with
> a hard mount, but a TCP FIN is still injected into the stream by the
> client, causing 15 seconds of deadlock, eventually fixed by sending a
> RST and restarting with a new TCP stream. At this point, softmounts
> throw an EIO while hardmounts restart and continue successfully.
>
> My problem is not the EIO on softmount or lack of EIO for hardmout, but
> the fact that the client sees fit to try and close the TCP stream while
> an apparently otherwise healthy NFS session is ongoing.
The client will attempt to close the TCP connection on any RPC level
error. That can happen, e.g., if the server sends a faulty RPC/TCP
record fragment header or some other garbage data.
I'm assuming that you've checked that the TCP parameters are set to sane
values for a 10GigE connection (i.e. tcp_timestamps is on) so that there
is no corruption happening at that level?
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Nov 10, 2011, at 6:15 AM, Andrew Cooper wrote:
> On 09/11/11 22:36, Chuck Lever wrote:
>> On Nov 9, 2011, at 1:38 PM, Andrew Cooper wrote:
>>
>>> Hello,
>>>
>>> I am debugging an issue for a customer whereby an NFS mount undergoes an
>>> unexpected timeout and throws an EIO.
>>>
>>> Using synchronous mounts results in no problems whatsoever, whereas
>>> using asynchronous mounts causes the problem to occur.
>>>
>>> The problem exists only on 10GiB networks: I can't repro on slower
>>> networks at all, even if it is the same 10GiB network link limited to
>>> 1GiB. It reproduces on multiple versions of XenServer (both Xen 3 and
>>> Xen 4) with a 2.6.32 based dom0 kernel multiple different 10GiB network
>>> cards (be2net and bnx2x specifically tested). It also reproduces
>>> against multiple different netapp servers running different firmware
>>> versions. It also appears to reproduces with NFSv3 and NFSv4, although
>>> the customer setup uses NFSv3
>>>
>>> The difference between mount commands are "noac" for the synchronous
>>> case as opposed to "actimeo=0" for the asynchronous case.
>>>
>>> The /proc/mounts entry for async is:
>>> 10.70.148.3:/vol/camb5e3b16tmp/82ab4a02-c10f-144a-e8c2-2ae04f213b2a
>>> /var/run/sr-mount/82ab4a02-c10f-144a-e8c2-2ae04f213b2a nfs
>>> rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,proto=tcp,port=65535,timeo=133,retrans=0,sec=sys,mountport=65535,addr=10.70.148.3
>>> 0 0
>>>
>>> while the entry for sync is:
>>> 10.70.148.3:/vol/camb5e3b15tmp/bc11aad3-b345-519f-1661-50e16bed6cdb
>>> /var/run/sr-mount/bc11aad3-b345-519f-1661-50e16bed6cdb nfs
>>> rw,sync,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,proto=tcp,port=65535,timeo=133,retrans=0,sec=sys,mountport=65535,addr=10.70.148.3
>>> 0 0
>> Using the "soft" mount option means that any NFS level timeout is communicated to applications as EIO. The client appears to be working as expected. Do you see this problem if you replace the "soft" mount option with the "hard" mount option?
>
> I am aware of soft mounts and EIOs. I forgot to say that it also
> appears with hard mounts, with the difference that the client eventually
> retries and continues.
Yeah, that's a pretty important omission! :-) A transport timeout should not cause an EIO on a hard mount. More below.
>> Also, timeo=133 is actually a 13.3 second timeout setting. The value of the timeo= mount option is in deciseconds. This is unreasonably short for RPC over TCP, especially if the server becomes loaded. A more reasonable timeout setting for RPC over TCP is 600.
>
> I am aware of this as well. The decisions about NFS options are done by
> a separate team who have justified their decision to use softmounts with
> a short timeo. In our use case, the userland process needs to be
> informed very quickly if the server goes down so it can fail over to the
> backup. 13 seconds was considered the longest acceptable time for
> virtual machines to be without their disks.
>
> The problem here is that some bug in the kernel is causing userland to
> think the server has gone down while it is perfectly healthy and running
> normally.
>
>> The difference between what you describe as a "synchronous" mount and an "asynchronous" mount is that the asynchronous mount can send multiple NFS WRITE operations for one file on the wire to the server concurrently. The "synchronous" mount throttles write traffic, and requires each WRITE to complete at the server before the client sends the next WRITE. You can see that the "asynchronous" mount can send a lot more traffic per unit time, and thus may result in the client waiting longer for WRITEs to complete.
>
> I did an analysis of the latency between write requests and replies. At
> any one time in the stream, there are no more than 4 oustanding
> requests, and the maximum latency between request and reply (as seen by
> the client) is 50ms
>
>>> The bug itself feels very like a timing issue to me. It becomes
>>> substantially easier to reproduce if the kernel has a high workload:
>>> Different options involve disabling things like TSO offload on the
>>> network card or dd'ing from /dev/urandom to /tmp, but the problem cant
>>> be reproduced occasionally without any artificial extra workload. The
>>> problem itself can be reprodued with a simple dd from /dev/zero onto the
>>> nfs mount, which proceeds to transfer at ~220MBps until the EIO is hit.
>>>
>>> I have analyzed the TCP stream using TCPdump and Wireshark. Everything
>>> appears fine, with no NFS errors at all. However, the client sticks a
>>> TCP FIN in the final continuation of one of its 64k block writes, in the
>>> middle of writing the file.
>>>
>>> The server ACKs the outstanding packets, and sends a Write Reply
>>> (without error), but does not send its own FIN. Then, 15 seconds later,
>>> the client sends a TCP RST, and throws an EIO back to userspace.
>>>
>>> I am aware that sending a RST after 15 seconds is expected behavior
>>> following changeset 7d1e8255cf959fba7ee2317550dfde39f0b936ae, in an
>>> attempt to prevent deadlocks for a half closed TCP connection. I have
>>> searched through the commit log all the way to tip, but can not find a
>>> commit making any reference to problems similar to this.
>>>
>>> What I am struggling to work out is why the client is sending a FIN in
>>> the middle of an otherwise fine and error-free NFS stream. using
>>> "rpcdebug -m rpc -s trans" only says "disconnected" without giving a
>>> reason for disconnecting.
>>>
>>> Is there a different debug flag I can use to find out why NFS thinks it
>>> needs to disconnect, or does anyone have any suggestions as to where I
>>> should start debugging the code? I suspect the problem is in the sunrpc
>>> subsystem but I am not very familiar with code this high level in the
>>> kernel.
"trans" is the obvious place to start, but you have done that already, and the instrumentation was inadequate. You could enable all the RPC client-side debugging flags. The usual problem with debugging messages is that they can mask timing problems.
I don't think it's even possible for the NFS client (above the RPC client) to cause a transport connection close in the middle of an RPC. So I doubt there would be much use in enabling NFS debugging flags. More RPC debugging might tell us if indeed a retransmit timeout is occurring, or if it's some other problem (which seems to me increasingly likely).
If you think the bug was introduced by a recent commit, you could try a "git bisect" to pin down which one. Or try later kernels (2.6.39 is the usual candidate) to see if the problem is addressed.
Otherwise, I think someone is going to have to sprinkle xprtsock.c with dprintks and try to narrow down the misbehavior.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Nov 9, 2011, at 1:38 PM, Andrew Cooper wrote:
> Hello,
>
> I am debugging an issue for a customer whereby an NFS mount undergoes an
> unexpected timeout and throws an EIO.
>
> Using synchronous mounts results in no problems whatsoever, whereas
> using asynchronous mounts causes the problem to occur.
>
> The problem exists only on 10GiB networks: I can't repro on slower
> networks at all, even if it is the same 10GiB network link limited to
> 1GiB. It reproduces on multiple versions of XenServer (both Xen 3 and
> Xen 4) with a 2.6.32 based dom0 kernel multiple different 10GiB network
> cards (be2net and bnx2x specifically tested). It also reproduces
> against multiple different netapp servers running different firmware
> versions. It also appears to reproduces with NFSv3 and NFSv4, although
> the customer setup uses NFSv3
>
> The difference between mount commands are "noac" for the synchronous
> case as opposed to "actimeo=0" for the asynchronous case.
>
> The /proc/mounts entry for async is:
> 10.70.148.3:/vol/camb5e3b16tmp/82ab4a02-c10f-144a-e8c2-2ae04f213b2a
> /var/run/sr-mount/82ab4a02-c10f-144a-e8c2-2ae04f213b2a nfs
> rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,proto=tcp,port=65535,timeo=133,retrans=0,sec=sys,mountport=65535,addr=10.70.148.3
> 0 0
>
> while the entry for sync is:
> 10.70.148.3:/vol/camb5e3b15tmp/bc11aad3-b345-519f-1661-50e16bed6cdb
> /var/run/sr-mount/bc11aad3-b345-519f-1661-50e16bed6cdb nfs
> rw,sync,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,proto=tcp,port=65535,timeo=133,retrans=0,sec=sys,mountport=65535,addr=10.70.148.3
> 0 0
Using the "soft" mount option means that any NFS level timeout is communicated to applications as EIO. The client appears to be working as expected. Do you see this problem if you replace the "soft" mount option with the "hard" mount option?
Also, timeo=133 is actually a 13.3 second timeout setting. The value of the timeo= mount option is in deciseconds. This is unreasonably short for RPC over TCP, especially if the server becomes loaded. A more reasonable timeout setting for RPC over TCP is 600.
The difference between what you describe as a "synchronous" mount and an "asynchronous" mount is that the asynchronous mount can send multiple NFS WRITE operations for one file on the wire to the server concurrently. The "synchronous" mount throttles write traffic, and requires each WRITE to complete at the server before the client sends the next WRITE. You can see that the "asynchronous" mount can send a lot more traffic per unit time, and thus may result in the client waiting longer for WRITEs to complete.
> The bug itself feels very like a timing issue to me. It becomes
> substantially easier to reproduce if the kernel has a high workload:
> Different options involve disabling things like TSO offload on the
> network card or dd'ing from /dev/urandom to /tmp, but the problem cant
> be reproduced occasionally without any artificial extra workload. The
> problem itself can be reprodued with a simple dd from /dev/zero onto the
> nfs mount, which proceeds to transfer at ~220MBps until the EIO is hit.
>
> I have analyzed the TCP stream using TCPdump and Wireshark. Everything
> appears fine, with no NFS errors at all. However, the client sticks a
> TCP FIN in the final continuation of one of its 64k block writes, in the
> middle of writing the file.
>
> The server ACKs the outstanding packets, and sends a Write Reply
> (without error), but does not send its own FIN. Then, 15 seconds later,
> the client sends a TCP RST, and throws an EIO back to userspace.
>
> I am aware that sending a RST after 15 seconds is expected behavior
> following changeset 7d1e8255cf959fba7ee2317550dfde39f0b936ae, in an
> attempt to prevent deadlocks for a half closed TCP connection. I have
> searched through the commit log all the way to tip, but can not find a
> commit making any reference to problems similar to this.
>
> What I am struggling to work out is why the client is sending a FIN in
> the middle of an otherwise fine and error-free NFS stream. using
> "rpcdebug -m rpc -s trans" only says "disconnected" without giving a
> reason for disconnecting.
>
> Is there a different debug flag I can use to find out why NFS thinks it
> needs to disconnect, or does anyone have any suggestions as to where I
> should start debugging the code? I suspect the problem is in the sunrpc
> subsystem but I am not very familiar with code this high level in the
> kernel.
>
> Any help greatly appreciated
>
> --
> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
> T: +44 (0)1223 225 900, http://www.citrix.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
Andrew Cooper wrote:
As for TCP timestamps; I have a Timestamp option in each TCP packet.
Nothing appears corrupted. What would I be looking for with corrupted
timestamps?
I had a faulty NAT once that was changing the ts ecr in the syn ack packet,
causing the other end to send rst. I can't remember if wireshark flagged
this as bad or I found it manually. I do remember that there was no log
message from the linux kernel. I'll bet there's some tool that will read a
pcap and verify the timestamps.
Further debugging shows that the FINs are being inserted because of a
call to xs_tcp_release_xprt(), where req->rq_bytes_sent !=
req->rq_snd_buf.len
Some of the time, the netapp server FIN+ACKs and the TCP connection goes
down and back up without adversely affecting the NFS session. However,
some of the time, the server does not FIN+ACK the clients FIN, causing a
15 second timeout before the client RSTs the TCP connection, causing the
visible problems to the NFS session.
I would say that the netapp not FIN+ACKing is a bug in itself, but I
would also say that it is a bug for the client to not be able to send
all of its send buffer.
Are there cases where not sending its send buffer is expected, or is it
a state which should be avoided?
~Andrew
On 15/11/11 14:36, Andrew Cooper wrote:
> Sorry for a slow reply - this is unfortunately not the only bug I am
> working on.
>
> After further testing, this problem does actually reproduce with
> synchronous mounts as well as asynchronous mounts. It just takes some
> extreme stress testing to reproduce with synchronous mounts.
>
> After some debugging in xs_tcp_shutdown() (a cheeky dump_stack()), it
> appears that periodically xprt_autoclose() is closing the TCP connection.
>
> It appears that some of the time, the server correctly FIN+ACKs the
> first FIN, at which point the TCP connection is torn down and set back
> up, with no interruption to the NFS session. However, some of the time,
> the server does not FIN+ACK the clients FIN, at which point the client
> waits 15 seconds and RST's the TCP connection, leading to the errors seen.
>
> What is the purpose of xprt_autoclose() ? I assume it is to
> automatically close idle connections. Am I correct in assuming that it
> should not be attempting to close an active connection?
>
> Thanks,
>
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com