2008-08-24 19:17:12

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Sun, 2008-08-24 at 19:53 +0100, Ian Campbell wrote:
> On Fri, 2008-08-22 at 14:56 -0700, Trond Myklebust wrote:
> > On Fri, 2008-08-22 at 22:37 +0100, Ian Campbell wrote:
> > > I can ssh to the server fine. The same server also serves my NFS home
> > > directory to the box I'm writing this from and I've not seen any trouble
> > > with this box at all, it's a 2.6.18-xen box.
> >
> > OK... Are you able to reproduce the problem reliably?
> >
> > If so, can you provide me with a binary tcpdump or wireshark dump? If
> > using tcpdump, then please use something like
> >
> > tcpdump -w /tmp/dump.out -s 90000 host myserver.foo.bar and port 2049
> >
> > Please also try to provide a netstat dump of the current TCP connections
> > as soon as the hang occurs:
> >
> > netstat -t
>
> Aug 24 18:08:59 iranon kernel: [168839.556017] nfs: server hopkins not responding, still trying
> but I wasn't around until 19:38 to spot it.
>
> netstat when I got to it was:
>
> Proto Recv-Q Send-Q Local Address Foreign Address State
> tcp 0 0 localhost.localdo:50891 localhost.localdom:6543 ESTABLISHED
> tcp 1 0 iranon.hellion.org.:ssh azathoth.hellion.:52682 CLOSE_WAIT
> tcp 0 0 localhost.localdom:6543 localhost.localdo:50893 ESTABLISHED
> tcp 0 0 iranon.hellion.org.:837 hopkins.hellion.org:nfs FIN_WAIT2
> tcp 0 0 localhost.localdom:6543 localhost.localdo:41831 ESTABLISHED
> tcp 0 0 localhost.localdo:13666 localhost.localdo:59482 ESTABLISHED
> tcp 0 0 localhost.localdo:34288 localhost.localdom:6545 ESTABLISHED
> tcp 0 0 iranon.hellion.org.:ssh azathoth.hellion.:48977 ESTABLISHED
> tcp 0 0 iranon.hellion.org.:ssh azathoth.hellion.:52683 ESTABLISHED
> tcp 0 0 localhost.localdom:6545 localhost.localdo:34288 ESTABLISHED
> tcp 0 0 localhost.localdom:6543 localhost.localdo:50891 ESTABLISHED
> tcp 0 0 localhost.localdo:50893 localhost.localdom:6543 ESTABLISHED
> tcp 0 0 localhost.localdo:41831 localhost.localdom:6543 ESTABLISHED
> tcp 0 87 localhost.localdo:59482 localhost.localdo:13666 ESTABLISHED
> tcp 1 0 localhost.localdom:6543 localhost.localdo:41830 CLOSE_WAIT
>
> (iranon is the problematic host .4, azathoth is my desktop machine .5, hopkins is the NFS server .6)
>
> tcpdumps are pretty big. I've attached the last 100 packets captured. If
> you need more I can put the full file up somewhere.
>
> -rw-r--r-- 1 root root 1.3G Aug 24 17:57 dump.out0
> -rw-r--r-- 1 root root 536M Aug 24 19:38 dump.out1
>
> Ian.

>From the tcpdump, it looks as if the NFS server is failing to close the
socket, when the client closes its side. You therefore end up getting
stuck in the FIN_WAIT2 state (as netstat clearly shows above).

Is the server keeping the client in this state for a very long period?

Cheers
Trond



2008-08-24 19:20:03

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Sun, 2008-08-24 at 15:17 -0400, Trond Myklebust wrote:
> >From the tcpdump, it looks as if the NFS server is failing to close the
> socket, when the client closes its side. You therefore end up getting
> stuck in the FIN_WAIT2 state (as netstat clearly shows above).

BTW: the RPC client is closing the socket because it detected no NFS
activity for 5 minutes. Did you expect any NFS activity during this
time?

Trond


2008-08-24 22:11:54

by Ian Campbell

[permalink] [raw]
Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

(added some quoting from previous mail to save replying twice)

On Sun, 2008-08-24 at 15:19 -0400, Trond Myklebust wrote:
> On Sun, 2008-08-24 at 15:17 -0400, Trond Myklebust wrote:
> > >From the tcpdump, it looks as if the NFS server is failing to close the
> > socket, when the client closes its side. You therefore end up getting
> > stuck in the FIN_WAIT2 state (as netstat clearly shows above).
> >
> > Is the server keeping the client in this state for a very long
> > period?

Well, it had been around an hour and a half on this occasion. Next time
it happens I can wait longer but I'm pretty sure I've come back from
time away and it's been wedged for at least a day. How long would you
expect it to remain in this state for?

> BTW: the RPC client is closing the socket because it detected no NFS
> activity for 5 minutes. Did you expect any NFS activity during this
> time?

It's a mythtv box so at times where no one is watching anything and
there isn't anything to record I expect NFS activity is pretty minimal.

Ian.

--
Ian Campbell

Can anyone remember when the times were not hard, and money not scarce?


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2008-08-25 21:39:15

by Roger Heflin

[permalink] [raw]
Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

Ian Campbell wrote:
> (added some quoting from previous mail to save replying twice)
>
> On Sun, 2008-08-24 at 15:19 -0400, Trond Myklebust wrote:
>> On Sun, 2008-08-24 at 15:17 -0400, Trond Myklebust wrote:
>>> >From the tcpdump, it looks as if the NFS server is failing to close the
>>> socket, when the client closes its side. You therefore end up getting
>>> stuck in the FIN_WAIT2 state (as netstat clearly shows above).
>>>
>>> Is the server keeping the client in this state for a very long
>>> period?
>
> Well, it had been around an hour and a half on this occasion. Next time
> it happens I can wait longer but I'm pretty sure I've come back from
> time away and it's been wedged for at least a day. How long would you
> expect it to remain in this state for?
>
>> BTW: the RPC client is closing the socket because it detected no NFS
>> activity for 5 minutes. Did you expect any NFS activity during this
>> time?
>
> It's a mythtv box so at times where no one is watching anything and
> there isn't anything to record I expect NFS activity is pretty minimal.
>
> Ian.
>

Ian,

Do you have a recording group setup on the NFS partition that mythtv is
going to be accessing?

I have seen similar funny stuff happen, it used to happened around 2.6.22*
(on each end), and quit happening around 2.6.24* and now has started happening
again with 2.6.25* on both ends.

Similar to what you have the only thing I see is "NFS server not responding" and
restarting the NFS server end (/etc/init.d/nfs restart) appears to get things to
continue on the NFS client. No other messages appear on either end that
indicate that anything is wrong, other non-nfs partitions on the client work
find, the machine is still up, and the NFS server is still up and fine, and
after a restart things will again work for a while (hours or days).

Roger

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/