From: Michel Lespinasse <walken-Y93EPB1FQwg@public.gmane.org>
Subject: NFS hangs with 2.6.25/2.6.26 despite server being reachable
Date: Tue, 15 Jul 2008 22:40:53 -0700
Message-ID: <20080716054053.GE6159@zoy.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
To: linux-nfs@vger.kernel.org
Sender: linux-nfs-owner@vger.kernel.org

I'm getting frequent NFS hangs when running 2.6.25 or 2.6.26 on my
NFS clients, while 2.6.24 seems to work fine.

Here are some details about my configuration:

* The NFS server is a linux host (x86-64) running 2.6.22.19;

* I have two diskless clients using different hardware, one is an old
  AMD K7 with intel pro/100 adapter, the other is x86-64 with on-board
  intel e1000e networking.

* The clients boot from a small initrd and mount their rootfs using
  NFS3 over TCP. The initial mounting is done with "nfsmount -o ro,nolock"
  from the klibc-utils package.

* /etc/fstab in the clients describe the nfs root as having the nolock,noatime
  options. The boot scripts (unmodified from my debian etch distro)
  remount the root using these. In the end, /proc/mounts shows the rootfs as
  using the following options:
  When running 2.6.24 on the client:
    rw,noatime,vers=3,rsize=131072,wsize=131072,hard,nointr,
    nolock,proto=tcp,timeo=7,retrans=3,sec=sys
  When running 2.6.26 on the client:
    rw,noatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,nointr,
    nolock,proto=tcp,timeo=7,retrans=3,sec=sys,mountproto=udp

* I often run suspend-to-ram on the clients, if I know they'll be unused
  for a few hours.

The above setup works just fine when running 2.6.24 on the clients.

When running 2.6.25 on the clients, the rootfs often hangs after a
suspend/resume cycle. This happens most often if the clients had been
suspended for a while (say overnight, rather than just during dinner).
By 'hangs' I mean the the clients show all symptoms that would be expected
if the server was gone, i.e. processes can't access the filesystem,
load goes to high values, machine becomes basically unusable. However,
the clients are still pingable from the NFS server host, so I believe
the network itself is fine. After 5-10 minutes of this, everything comes
back to normal without any intervention on my part.

When running 2.6.26 on the clients, the same behavior is observed except that
it never goes back to normal. Actually, I could not find any way to fix the
clients without rebooting them.

I'm not sure what's going on. My best guess would be that something times
out (possibly the TCP connection itself ?) and that the NFS client in 2.6.25
and above takes a very long time before giving up and establishing a new
connection with the server ??? I'm not sure if that makes any sense, and I
was not able to collect hard evidence of that so far...

Any ideas about what might be going wrong and/or what additional information
I should try to collect about the hangs ?

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.