We have a couple legacy CentOS (RHEL)-based appliances with slightly
dated NFS implementations.
Server (CentOS 4 based):
nfs-utils-1.0.6-70.EL4
Kernel 2.6.9-42.0.10.plus.c4smp
Client (CentOS 5 based):
nfs-utils-1.0.9-42.el5.x86_64
Kernel 2.6.18-164.15.1.el5
The client has a long-lived NFSv3 mount to the server that sometimes
stops responding (blocks). We can lazy unmount it, but subsequent
mount requests hang and the following is observed via tcpdump:
1. Client GETPORT for NFS service succeeds.
2. Client GETPORT for MOUNT succeeds
3. Client MNT call succeeds (server gives valid response including
file handle)
4. Client sends a SYN packet to NFS port on server
5. Server responds with ACK *only*
When we bounce the NFS daemon on the server, everything starts working
and in step 5 above, we get a SYN,ACK as expected in response to #4,
and everything proceeds along nicely.
Does this jog anybody on a long-ago fixed bug? I'm thinking updating
the kernel and nfs-utils on the server will likely help, but would love
to find where behavior like the above is referenced as a "bug".
Thanks,
Ray
On Wed, Jul 20, 2011 at 12:23:42PM -0700, Ray Van Dolson wrote:
> We have a couple legacy CentOS (RHEL)-based appliances with slightly
> dated NFS implementations.
>
> Server (CentOS 4 based):
>
> nfs-utils-1.0.6-70.EL4
> Kernel 2.6.9-42.0.10.plus.c4smp
>
> Client (CentOS 5 based):
>
> nfs-utils-1.0.9-42.el5.x86_64
> Kernel 2.6.18-164.15.1.el5
>
> The client has a long-lived NFSv3 mount to the server that sometimes
> stops responding (blocks). We can lazy unmount it, but subsequent
> mount requests hang and the following is observed via tcpdump:
>
> 1. Client GETPORT for NFS service succeeds.
> 2. Client GETPORT for MOUNT succeeds
> 3. Client MNT call succeeds (server gives valid response including
> file handle)
> 4. Client sends a SYN packet to NFS port on server
> 5. Server responds with ACK *only*
>
> When we bounce the NFS daemon on the server, everything starts working
> and in step 5 above, we get a SYN,ACK as expected in response to #4,
> and everything proceeds along nicely.
>
> Does this jog anybody on a long-ago fixed bug? I'm thinking updating
> the kernel and nfs-utils on the server will likely help, but would love
> to find where behavior like the above is referenced as a "bug".
>
> Thanks,
> Ray
After thinking on this a bit more, I'm wondering if perhaps the server
side had a connection still "open" (didn't check with netstat) and thus
sent back only the ACK.
Maybe in this case the client should respond with a RST or something
else to indicate we need to start from scratch?
Is there a way, on the server side to kill an ESTABLISHED TCP
connection (specifically an NFS connection?)? Probably setting a
connection timeout value via /proc ...
I'm thinking on the client side I could inject a RST packet to the
server to clean things up?
Ray