2013-04-19 10:54:42

by Joakim Tjernlund

[permalink] [raw]
Subject: Re: NFS loop on 3.4.39

Joakim Tjernlund/Transmode wrote on 2013/04/18 14:34:03:
>
> "Myklebust, Trond" <[email protected]> wrote on 2013/04/17
00:06:51:
> >
> > On Tue, 2013-04-16 at 21:07 +0200, Joakim Tjernlund wrote:
> > > "Myklebust, Trond" <[email protected]> wrote on 2013/04/16
> > > 17:36:55:
> > >
> > > > From: "Myklebust, Trond" <[email protected]>
> > > > To: Joakim Tjernlund <[email protected]>,
> > > > Cc: "[email protected]" <[email protected]>
> > > > Date: 2013/04/16 17:37
> > > > Subject: Re: NFS loop on 3.4.39
> > > >
> > > > On Tue, 2013-04-16 at 12:41 +0200, Joakim Tjernlund wrote:
> > > > > Here we go again, this time i happened while browsing the Boston
news
> > > on
> > > > > http://www.dn.se
> > > > > Now gvfsd-metadata is turned off(not running at all) and I get:
> > > > > 10:28:44.616146 IP 192.168.201.44.nfs > 172.20.4.10.3671768838:
reply
> > > ok
> > > > > 52 getattr ERROR: unk 10024
> > > >
> > > > Part of the reason why you are getting no response to these posts
is
> > > > that you are posting tcpdump-decoded data. Tcpdump still has no
support
> > > > for NFSv4, and therefore completely garbles the output by trying
to
> > > > interpret it as NFSv2/v3.
> > > > In general, if you are posting network traffic, please record it
as
> > > > binary raw packet data (using the '-w' option on tcdump) so that
we can
> > > > look at the full contents. Either include it as an attachment, or
> > > > provide us with details on how to download it from an http server.
> > > >
> > > > Other information that is needed in order to make sense of NFS bug
> > > > reports includes:
> > >
> > > Thank you Trond, I figured there was something missing but I didn't
know
> > > where to start but here goes:
> > >
> > > >
> > > > - client OS (non-linux) or kernel version (linux)
> > > Client OS Linux 3.4.39, x86
> > >
> > > > - mount options on the client
> > > ~ # ypmatch jocke auto.home
> > > -fstype=nfs,soft devsrv:/mnt/home/jocke
> > >
> > > > - server OS (non-linux) or kernel version (linux)
> > > Server OS Linux 3.4.39, amd64
> > >
> > > > - type of exported filesystem on the server
> > > XFS
> > >
> > > > - contents of /etc/exports on the server
> > > more /etc/exports
> > > # /etc/exports: NFS file systems being exported. See exports(5).
> > > /mnt/home *(rw,async,root_squash,no_subtree_check)
> > > /mnt/systemtest *(rw,sync,root_squash,no_subtree_check)
> > > /mnt/TNM *(rw,sync,root_squash,no_subtree_check)
> > > /tftproot *(rw,async,root_squash,no_subtree_check)
> > > /mnt/images *(rw,async,no_root_squash,no_subtree_check,insecure)
> > > /rescue *(ro,async,no_root_squash,no_subtree_check,insecure)
> > >
> > > /mnt/home is the one failing
> > >
> > > >
> > > > Please ensure that you always include those in your emails.
> > >
> > > nfs.pcap:
> > > http://ftp-us.transmode.se/get/?id=1bf2561ed2e7d4e379b2936319c82c25
> > >
> > > nfs2.pcap:
> > > http://ftp-us.transmode.se/get/?id=759c7645248a426720da8e9ba7074040
> > >
> > > nfs3.pcap:
> > > http://ftp-us.transmode.se/get/?id=051c6d771978b2407e15e96152bd6e66
> > >
> > > nfs4.pcap:
> > > http://ftp-us.transmode.se/get/?id=5dfab4da6cbbe400697bc1621b541c9f
> > >
> > > nfs3.pcap is the gvsd-metadata problem one can find using google,
doesn't
> > > have to be a NFS problem
> > > The other 3 all come from surfing the www using firefox 17.0.3
> >
> > The nfs2.pcap file and nfs4.pcap seem to show the server returning
> > NFS4ERR_OLD_STATEID, which usually means that the client has an
> > OPEN/CLOSE/LOCK or LOCKU... in flight and that while the server has
> > updated the stateid, the client has not yet received the reply. The
> > problem is that I see no sign of the OPEN/CLOSE/LOCK/LOCKU...
> >
> > The nfs.pcap file is resending a load of LOCK requests that are
> > receiving NFS4ERR_BAD_STATEID replies. Normally, I'd expect the
recovery
> > engine to kick in and try to recover the OPEN.
> >
> > So when you do 'ps -efwww', on any of these clients, do you see a
> > process with a name containing the server IP address (192.168.201.44)?
> >
> > Also, is there anything special in the log when you do 'dmesg -s
90000'?

> Of course this happened again while I wasn't looking so I don't know
what
> caused it, probably firefox though.
>
> There is nothing in dmesg and ps -efwww has no hit on IP
> address 192.168.201.44, the closest I can get is:
> ps -efwww | grep nfs
> root 568 2 0 Apr16 ? 00:00:00 [nfsiod]
> root 2440 2 0 Apr16 ? 00:00:00 [nfsd4]
> root 2441 2 0 Apr16 ? 00:00:00 [nfsd4_callbacks]
> root 2442 2 0 Apr16 ? 00:00:00 [nfsd]
> root 2443 2 0 Apr16 ? 00:00:00 [nfsd]
> root 2444 2 0 Apr16 ? 00:00:00 [nfsd]
> root 2445 2 0 Apr16 ? 00:00:00 [nfsd]
> root 2446 2 0 Apr16 ? 00:00:00 [nfsd]
> root 2447 2 0 Apr16 ? 00:00:00 [nfsd]
> root 2448 2 0 Apr16 ? 00:00:00 [nfsd]
> root 2449 2 0 Apr16 ? 00:00:00 [nfsd]
> root 2667 2 0 Apr16 ? 00:00:00 [nfsv4.0-svc]
> jocke 27048 26888 0 14:28 pts/3 00:00:00 grep --colour=auto nfs
>
> Got a new pcap file also:
> http://ftp-us.transmode.se/get/?id=6f935e1d7e105d01e9a5b907c6493521
nfs5.pcap
>
> The load is not that noticeable so I can stay in this mode a while,
until I go
> home today.

So left it overnight and this morning my NFS client had completely looked
up,
had to press the power button. This has happened twice now.

One more piece of info, we think this problem started when NFS server
was upgraded from 3.4.28 to 3.4.39

I have no idea how to move forward now. Trond, are you also stuck?

Jocke