2008-07-16 06:06:11

by Michel Lespinasse

[permalink] [raw]
Subject: NFS hangs with 2.6.25/2.6.26 despite server being reachable

I'm getting frequent NFS hangs when running 2.6.25 or 2.6.26 on my
NFS clients, while 2.6.24 seems to work fine.

Here are some details about my configuration:

* The NFS server is a linux host (x86-64) running 2.6.22.19;

* I have two diskless clients using different hardware, one is an old
AMD K7 with intel pro/100 adapter, the other is x86-64 with on-board
intel e1000e networking.

* The clients boot from a small initrd and mount their rootfs using
NFS3 over TCP. The initial mounting is done with "nfsmount -o ro,nolock"
from the klibc-utils package.

* /etc/fstab in the clients describe the nfs root as having the nolock,noatime
options. The boot scripts (unmodified from my debian etch distro)
remount the root using these. In the end, /proc/mounts shows the rootfs as
using the following options:
When running 2.6.24 on the client:
rw,noatime,vers=3,rsize=131072,wsize=131072,hard,nointr,
nolock,proto=tcp,timeo=7,retrans=3,sec=sys
When running 2.6.26 on the client:
rw,noatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,nointr,
nolock,proto=tcp,timeo=7,retrans=3,sec=sys,mountproto=udp

* I often run suspend-to-ram on the clients, if I know they'll be unused
for a few hours.

The above setup works just fine when running 2.6.24 on the clients.

When running 2.6.25 on the clients, the rootfs often hangs after a
suspend/resume cycle. This happens most often if the clients had been
suspended for a while (say overnight, rather than just during dinner).
By 'hangs' I mean the the clients show all symptoms that would be expected
if the server was gone, i.e. processes can't access the filesystem,
load goes to high values, machine becomes basically unusable. However,
the clients are still pingable from the NFS server host, so I believe
the network itself is fine. After 5-10 minutes of this, everything comes
back to normal without any intervention on my part.

When running 2.6.26 on the clients, the same behavior is observed except that
it never goes back to normal. Actually, I could not find any way to fix the
clients without rebooting them.

I'm not sure what's going on. My best guess would be that something times
out (possibly the TCP connection itself ?) and that the NFS client in 2.6.25
and above takes a very long time before giving up and establishing a new
connection with the server ??? I'm not sure if that makes any sense, and I
was not able to collect hard evidence of that so far...

Any ideas about what might be going wrong and/or what additional information
I should try to collect about the hangs ?

Thanks,

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


2008-07-18 06:04:07

by Michel Lespinasse

[permalink] [raw]
Subject: Re: NFS hangs with 2.6.25/2.6.26 despite server being reachable

On Wed, Jul 16, 2008 at 03:15:53PM -0400, J. Bruce Fields wrote:
> On Tue, Jul 15, 2008 at 10:40:53PM -0700, Michel Lespinasse wrote:
> > I'm getting frequent NFS hangs when running 2.6.25 or 2.6.26 on my
> > NFS clients, while 2.6.24 seems to work fine.
> > [...]
> > Any ideas about what might be going wrong and/or what additional
> > information I should try to collect about the hangs ?
>
> A sysrq-T trace showing where the clients were hung might help. (So,
> "echo T >/proc/sysrq-trigger", then look at the logs.)

Thanks for the reply. I'm now running 2.6.25.11 with sysrq enabled.
Have not captured the failure yet, but then again it's been only one night.
I prefer to go with 2.6.25 instead of 2.6.26 because 2.6.25 generally
recovers from the failure after a few minutes - so there is a higher chance
that I'll actually get something useful logged.

> If it were possible to get it down to a simple test case, then we'd
> probably learn something from a git-bisect to figure out exactly when
> the problem was first introduced.

I wish I had a better way to reproduce this... as it is it happens only
every 2 or 3 days (with 2.6.25.4, but I suppose 2.6.25.11 will be the same)

I'll let you know when I capture a good trace.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2008-07-16 19:15:55

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NFS hangs with 2.6.25/2.6.26 despite server being reachable

On Tue, Jul 15, 2008 at 10:40:53PM -0700, Michel Lespinasse wrote:
> I'm getting frequent NFS hangs when running 2.6.25 or 2.6.26 on my
> NFS clients, while 2.6.24 seems to work fine.
>
> Here are some details about my configuration:
>
> * The NFS server is a linux host (x86-64) running 2.6.22.19;
>
> * I have two diskless clients using different hardware, one is an old
> AMD K7 with intel pro/100 adapter, the other is x86-64 with on-board
> intel e1000e networking.
>
> * The clients boot from a small initrd and mount their rootfs using
> NFS3 over TCP. The initial mounting is done with "nfsmount -o ro,nolock"
> from the klibc-utils package.
>
> * /etc/fstab in the clients describe the nfs root as having the nolock,noatime
> options. The boot scripts (unmodified from my debian etch distro)
> remount the root using these. In the end, /proc/mounts shows the rootfs as
> using the following options:
> When running 2.6.24 on the client:
> rw,noatime,vers=3,rsize=131072,wsize=131072,hard,nointr,
> nolock,proto=tcp,timeo=7,retrans=3,sec=sys
> When running 2.6.26 on the client:
> rw,noatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,nointr,
> nolock,proto=tcp,timeo=7,retrans=3,sec=sys,mountproto=udp
>
> * I often run suspend-to-ram on the clients, if I know they'll be unused
> for a few hours.
>
> The above setup works just fine when running 2.6.24 on the clients.
>
> When running 2.6.25 on the clients, the rootfs often hangs after a
> suspend/resume cycle. This happens most often if the clients had been
> suspended for a while (say overnight, rather than just during dinner).
> By 'hangs' I mean the the clients show all symptoms that would be expected
> if the server was gone, i.e. processes can't access the filesystem,
> load goes to high values, machine becomes basically unusable. However,
> the clients are still pingable from the NFS server host, so I believe
> the network itself is fine. After 5-10 minutes of this, everything comes
> back to normal without any intervention on my part.
>
> When running 2.6.26 on the clients, the same behavior is observed except that
> it never goes back to normal. Actually, I could not find any way to fix the
> clients without rebooting them.
>
> I'm not sure what's going on. My best guess would be that something times
> out (possibly the TCP connection itself ?) and that the NFS client in 2.6.25
> and above takes a very long time before giving up and establishing a new
> connection with the server ??? I'm not sure if that makes any sense, and I
> was not able to collect hard evidence of that so far...
>
> Any ideas about what might be going wrong and/or what additional information
> I should try to collect about the hangs ?

A sysrq-T trace showing where the clients were hung might help. (So,
"echo T >/proc/sysrq-trigger", then look at the logs.)

If it were possible to get it down to a simple test case, then we'd
probably learn something from a git-bisect to figure out exactly when
the problem was first introduced.

--b.