2008-07-23 21:52:12

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NFS hangs with 2.6.25/2.6.26 despite server being reachable

On Sun, Jul 20, 2008 at 07:14:08PM -0700, Michel Lespinasse wrote:
> On Thu, Jul 17, 2008 at 11:04:05PM -0700, Michel Lespinasse wrote:
> > On Wed, Jul 16, 2008 at 03:15:53PM -0400, J. Bruce Fields wrote:
> > > On Tue, Jul 15, 2008 at 10:40:53PM -0700, Michel Lespinasse wrote:
> > > > I'm getting frequent NFS hangs when running 2.6.25 or 2.6.26 on my
> > > > NFS clients, while 2.6.24 seems to work fine.
> > > > [...]
> > > > Any ideas about what might be going wrong and/or what additional
> > > > information I should try to collect about the hangs ?
> > >
> > > A sysrq-T trace showing where the clients were hung might help. (So,
> > > "echo T >/proc/sysrq-trigger", then look at the logs.)
> >
> > Thanks for the reply. I'm now running 2.6.25.11 with sysrq enabled.
> > Have not captured the failure yet, but then again it's been only one night.
> > I prefer to go with 2.6.25 instead of 2.6.26 because 2.6.25 generally
> > recovers from the failure after a few minutes - so there is a higher chance
> > that I'll actually get something useful logged.
>
> It took me a while, as for some reason I could not get things to fail
> this week (It's probably that I don't know all the factors that trigger
> the NFS hangs, yet). Then today I got two NFS hangs in a row, running
> kernel version 2.6.25.11 on my K7 based client.
>
> In both cases I captured information using alt-sysreq-t, the system
> hung there for a few minutes, I double checked that the machine was
> pingable from the server, and I got the dumps out of kern.log after
> the machine recovered. The logs are incomplete, given that syslog
> could not run well with the rootfs hung. I'm not sure if a larger
> dmesg buffer would help ? Anyway, please get the logs from
>
> http://lespinasse.org/kern.log
> http://lespinasse.org/kern.log.2nd

Oh, sorry, I think I overlooked the note in your original message that
this happened after suspend-to-ram. That's interesting! You really
need someone with more experience debugging the rpc client.... It might
also be worth turning on rpc debugging during the hang just to get the
dump of rpc task states. (So

echo 0 >/proc/sys/sunrpc/rpc_debug

and capture the first table it dumps to the log. Or Trond or Chuck
might have some better idea.)

--b.

>
> In both cases I see a lot of nfs_wait_schedule, wait_on_bit_lock,
> nfs_revalidate_inode, nfs_check_verifier. Not sure if that's expected,
> but that's what I get, and the machine is pingable from the server side.
>
> Hope this helps. Let me know if you want me to try something else.
>
> --
> Michel "Walken" Lespinasse
> A program is never fully debugged until the last user dies.


2008-07-24 10:03:07

by Michel Lespinasse

[permalink] [raw]
Subject: Re: NFS hangs with 2.6.25/2.6.26 despite server being reachable

On Wed, Jul 23, 2008 at 05:52:10PM -0400, J. Bruce Fields wrote:
> Oh, sorry, I think I overlooked the note in your original message that
> this happened after suspend-to-ram. That's interesting! You really
> need someone with more experience debugging the rpc client.... It might
> also be worth turning on rpc debugging during the hang just to get the
> dump of rpc task states. (So
>
> echo 0 >/proc/sys/sunrpc/rpc_debug
>
> and capture the first table it dumps to the log. Or Trond or Chuck
> might have some better idea.)

Since it does not look like we'll solve this overnight, I opened
bug 11154 to track this. Please feel free to suggest what to try there...

Thanks,

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.