We have a situation now where one of our NFS clients running a particular
process, which does a lot of filesystem interaction over a long time
period (10+ hours), hangs somewhere in the middle of the process, and
renders the mount unusable from then on. That is, the process goes into
the D state, and all process which attempt to access the mount also go
into the D state and never return anything.
I've sent SIGKILL to all process on the mount (which is intr) and all
processes can be killed except for the originating process which caused
the hang. That one will not die, I have to reboot to get it back.
Client: linux 2.4.20+NFS_ALL (from 29-Nov-2002)
Server: Netapp 840, running 6.2R2
mount options: rw,tcp,nfsvers=3,rsize=32768,wsize=32768,intr,hard
network: 100Mb switched
This has happened a few times in a row now running this process, so it is
repeatable, but I don't know the exact operation that is generating this
condition.
We can re-mount the same NFS mount, with the same mount options, from this
host and interact with it just fine, run benchmarks, etc. So it doesn't
seem to be a particular problem with the network or the client or the
server.
Although I am seeing a large number of "hw tcp v4 csum failed"
errors on this host (running e100 driver), I have also seen smaller
numbers of these errors on another host, same kernel, running the tg3
network driver and not experiencing this hang. Still, this looks like an
NFS client problem.
I've attached kernel logs from a few minutes of
echo 1023 > /proc/sys/sunrpc/nfsd_debug
which may shed some more light for people in the know.
thanks,
andrew
>>>>> " " == Andrew Ryan <[email protected]> writes:
> Although I am seeing a large number of "hw tcp v4 csum failed"
> errors on this host (running e100 driver), I have also seen
> smaller numbers of these errors on another host, same kernel,
> running the tg3 network driver and not experiencing this
> hang. Still, this looks like an NFS client problem.
I disagree: it looks like a hardware problem. From what I can see from
your RPC dump, the NFS client is treating what it is getting from the
server as if it were junk data. This is what I would expect to occur
if the server and client RPC streams are getting desynchronized due to
data corruption.
Try using the eepro100 driver and/or a different card.
Cheers,
Trond
-------------------------------------------------------
This SF.NET email is sponsored by: Thawte.com
Understand how to protect your customers personal information by implementing
SSL on your Apache Web Server. Click here to get our FREE Thawte Apache
Guide: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0029en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On 16 Jan 2003, Trond Myklebust wrote:
> I disagree: it looks like a hardware problem. From what I can see from
> your RPC dump, the NFS client is treating what it is getting from the
> server as if it were junk data. This is what I would expect to occur
> if the server and client RPC streams are getting desynchronized due to
> data corruption.
This machine has a long and faithful service record, although of course
that doesn't rule out hardware failure.
I can try to dig further into this process to find out exactly what is
triggering the hang. But, so you know, I switched the mount to UDP
(rw,udp,nfsvers=3,rsize=32768,wsize=32768,intr,hard) and now the process
completes normally.
Not being a kernel developer, I do wonder why a single piece of corrupted
data can have the effect of hanging the mount for all users and all
processes. It also creates an unkillable process, requiring the
machine to be rebooted, relating to the 'unmount -f' discussion that's
been going on here lately.
>
> Try using the eepro100 driver and/or a different card.
Swapping cards not so easy, this is a motherboard with built-in ethernet.
I can try the eepro100 driver though.
thanks,
andrew
-------------------------------------------------------
This SF.NET email is sponsored by: Thawte.com
Understand how to protect your customers personal information by implementing
SSL on your Apache Web Server. Click here to get our FREE Thawte Apache
Guide: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0029en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
>>>>> " " == Andrew Ryan <[email protected]> writes:
> Not being a kernel developer, I do wonder why a single piece of
> corrupted data can have the effect of hanging the mount for all
> users and all processes. It also creates an unkillable process,
> requiring the machine to be rebooted, relating to the 'unmount
> -f' discussion that's been going on here lately.
That's the downside of TCP: it is supposed to be a continuous stream,
but it separated into 'segments' (RPC messages) of a length that is
specified within the stream itself. Get the length of one such segment
wrong, and you will be very unlikely to happen on the start of another
segment.
For the unkillable process, see the discussion about 'intr' and
earlier mails on the subject of unkillable NFS processes.
Cheers,
Trond
-------------------------------------------------------
This SF.NET email is sponsored by: Thawte.com
Understand how to protect your customers personal information by implementing
SSL on your Apache Web Server. Click here to get our FREE Thawte Apache
Guide: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0029en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On 16 Jan 2003, Trond Myklebust wrote:
> >>>>> " " == Andrew Ryan <[email protected]> writes:
>
> > Although I am seeing a large number of "hw tcp v4 csum failed"
> > errors on this host (running e100 driver), I have also seen
> > smaller numbers of these errors on another host, same kernel,
> > running the tg3 network driver and not experiencing this
> > hang. Still, this looks like an NFS client problem.
>
> I disagree: it looks like a hardware problem. From what I can see from
> your RPC dump, the NFS client is treating what it is getting from the
> server as if it were junk data. This is what I would expect to occur
> if the server and client RPC streams are getting desynchronized due to
> data corruption.
>
> Try using the eepro100 driver and/or a different card.
Since then we have seen this problem occur over and over again on a
certain set of hosts and ended up tracing the problem to a bad switch
trunking. Different cables, different cards, different drivers, same
result: total hang of mount on the client and processes sticking in
unkillable D state. Once we took the bad trunk out of the equation,
everything works fine now.
What is really interesting to me is this *same* problem occurred with a
UDP mount as well. I got very similar NFS debug messages to what I
previously posted with TCP.
According to subsequent messages in this thread, bad network data can
sometimes hang TCP mounts, but I got the same effect on UDP. Which
shouldn't happen, right? That would imply that something isn't right in
the linux NFS client.
By the way, other than this, 2.4.20+NFS_ALL has been stellar! The TCP
support is great, we are maxing out our clients at 100Mb day in and day
out and not seeing any problems so far, performance and reliability have
been very good (knock on wood, praise be to Zeus, mighty may he reign,
etc.).
thanks,
andrew
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
>>>>> " " == Andrew Ryan <[email protected]> writes:
> According to subsequent messages in this thread, bad network
> data can sometimes hang TCP mounts, but I got the same effect
> on UDP. Which shouldn't happen, right? That would imply that
> something isn't right in the linux NFS client.
That depends. The UDP client emulates the TCP congestion control (with
exponential backoff etc), so of course it can end up hanging. If
indeed your problem was a bad switch, then lost packets can quickly
end up slowing throughput to a trickle (which would appear to the user
as if the NFS client is hanging)
Cheers,
Trond
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs