Return-Path: linux-nfs-owner@vger.kernel.org Received: from gw1.transmode.se ([195.58.98.146]:64236 "EHLO gw1.transmode.se" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756454Ab3DWOen (ORCPT ); Tue, 23 Apr 2013 10:34:43 -0400 In-Reply-To: <1366726687.35524.6.camel@leira.trondhjem.org> References: <1366126613.12556.18.camel@leira.trondhjem.org> <1366150010.27817.8.camel@leira.trondhjem.org> <1366725123.35524.2.camel@leira.trondhjem.org> To: "Myklebust, Trond" Cc: "linux-nfs@vger.kernel.org" MIME-Version: 1.0 Subject: Re: NFS loop on 3.4.39 From: Joakim Tjernlund Message-ID: Date: Tue, 23 Apr 2013 16:34:41 +0200 Content-Type: text/plain; charset="US-ASCII" Sender: linux-nfs-owner@vger.kernel.org List-ID: "Myklebust, Trond" wrote on 2013/04/23 16:18:07: > > On Tue, 2013-04-23 at 16:14 +0200, Joakim Tjernlund wrote: > > "Myklebust, Trond" wrote on 2013/04/23 > > 15:52:06: > > > > > > On Tue, 2013-04-23 at 15:38 +0200, Joakim Tjernlund wrote: > > > > So, it happened again. Just when hitting search on bugs.gentoo.org in > > > > firefox 17.0.3 > > > > > > > > This time I got a NFS loop with NFS4ERR_BAD_STATEID looping over and > > over > > > > again and FF was hung. Not posting the logs as it does not appear to > > > > do any good. Nothing in dmesg either. > > > > > > > > Noticed this patch on the NFS list: > > > > http://marc.info/?l=linux-nfs&m=136643651710066&w=2 > > > > I wonder if that could be a potential cure and if so, could it be > > > > backported to 3.4? > > > > > > It is in the testing branch on > > > > > > http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=summary > > > > > > if you want to try it out. I'm not planning on backporting anything that > > > hasn't been labelled with a Cc: stable in that branch. > > > > Well, we won't use tip of linus tree in production so there is > > little point to use your testing branch. However it looks like a trivial > > backport so I can test it on my client easily. hmm, after testing a patched 3.4 kernel I could possibly try Linus tree on my client but I doubt I will have time to bisect it as it can take days to reproduce. Will have it in mind though. > > The point of testing would not be to discover if you can use Linus' tree > in production, but rather to see if the problem is already fixed > upstream. If it is, we can bisect to figure out which patch is the fix. > > > Even the NFS server if required, is the above referenced patch for > > NFS client/server or both? Any chance this is the culprit? > > That's a client patch. Thanks, rebuilding my clients kernel now. > > > Jocke > > > > PS. > > I guess I should throw in > > NFSv4: Ensure the LOCK call cannot use the delegation stateid > > too? > > > > > > Cheers > > > Trond > > > > > > > Jocke > > > > > > > > Joakim Tjernlund/Transmode wrote on 2013/04/19 12:54:38: > > > > > > > > > > Joakim Tjernlund/Transmode wrote on 2013/04/18 14:34:03: > > > > > > > > > > > > "Myklebust, Trond" wrote on > > 2013/04/17 > > > > 00:06:51: > > > > > > > > > > > > > > On Tue, 2013-04-16 at 21:07 +0200, Joakim Tjernlund wrote: > > > > > > > > "Myklebust, Trond" wrote on > > > > 2013/04/16 > > > > > > > > 17:36:55: > > > > > > > > > > > > > > > > > From: "Myklebust, Trond" > > > > > > > > > To: Joakim Tjernlund , > > > > > > > > > Cc: "linux-nfs@vger.kernel.org" > > > > > > > > > Date: 2013/04/16 17:37 > > > > > > > > > Subject: Re: NFS loop on 3.4.39 > > > > > > > > > > > > > > > > > > On Tue, 2013-04-16 at 12:41 +0200, Joakim Tjernlund wrote: > > > > > > > > > > Here we go again, this time i happened while browsing the > > > > Boston news > > > > > > > > on > > > > > > > > > > www.dn.se > > > > > > > > > > Now gvfsd-metadata is turned off(not running at all) and I > > > > > > get: > > > > > > > > > > 10:28:44.616146 IP 192.168.201.44.nfs > > > > > 172.20.4.10.3671768838: reply > > > > > > > > ok > > > > > > > > > > 52 getattr ERROR: unk 10024 > > > > > > > > > > > > > > > > > > Part of the reason why you are getting no response to these > > > > posts is > > > > > > > > > that you are posting tcpdump-decoded data. Tcpdump still has > > no > > > > support > > > > > > > > > for NFSv4, and therefore completely garbles the output by > > trying > > > > to > > > > > > > > > interpret it as NFSv2/v3. > > > > > > > > > In general, if you are posting network traffic, please > > record it > > > > as > > > > > > > > > binary raw packet data (using the '-w' option on tcdump) so > > that > > > > we can > > > > > > > > > look at the full contents. Either include it as an > > attachment, > > > > or > > > > > > > > > provide us with details on how to download it from an http > > > > server. > > > > > > > > > > > > > > > > > > Other information that is needed in order to make sense of > > NFS > > > > bug > > > > > > > > > reports includes: > > > > > > > > > > > > > > > > Thank you Trond, I figured there was something missing but I > > > > didn't know > > > > > > > > where to start but here goes: > > > > > > > > > > > > > > > > > > > > > > > > > > - client OS (non-linux) or kernel version (linux) > > > > > > > > Client OS Linux 3.4.39, x86 > > > > > > > > > > > > > > > > > - mount options on the client > > > > > > > > ~ # ypmatch jocke auto.home > > > > > > > > -fstype=nfs,soft devsrv:/mnt/home/jocke > > > > > > > > > > > > > > > > > - server OS (non-linux) or kernel version (linux) > > > > > > > > Server OS Linux 3.4.39, amd64 > > > > > > > > > > > > > > > > > - type of exported filesystem on the server > > > > > > > > XFS > > > > > > > > > > > > > > > > > - contents of /etc/exports on the server > > > > > > > > more /etc/exports > > > > > > > > # /etc/exports: NFS file systems being exported. See > > exports(5). > > > > > > > > /mnt/home *(rw,async,root_squash,no_subtree_check) > > > > > > > > /mnt/systemtest *(rw,sync,root_squash,no_subtree_check) > > > > > > > > /mnt/TNM *(rw,sync,root_squash,no_subtree_check) > > > > > > > > /tftproot *(rw,async,root_squash,no_subtree_check) > > > > > > > > /mnt/images > > *(rw,async,no_root_squash,no_subtree_check,insecure) > > > > > > > > /rescue *(ro,async,no_root_squash,no_subtree_check,insecure) > > > > > > > > > > > > > > > > /mnt/home is the one failing > > > > > > > > > > > > > > > > > > > > > > > > > > Please ensure that you always include those in your emails. > > > > > > > > > > > > > > > > nfs.pcap: > > > > > > > > > > > > http://ftp-us.transmode.se/get/?id=1bf2561ed2e7d4e379b2936319c82c25 > > > > > > > > > > > > > > > > nfs2.pcap: > > > > > > > > > > > > http://ftp-us.transmode.se/get/?id=759c7645248a426720da8e9ba7074040 > > > > > > > > > > > > > > > > nfs3.pcap: > > > > > > > > > > > > http://ftp-us.transmode.se/get/?id=051c6d771978b2407e15e96152bd6e66 > > > > > > > > > > > > > > > > nfs4.pcap: > > > > > > > > > > > > http://ftp-us.transmode.se/get/?id=5dfab4da6cbbe400697bc1621b541c9f > > > > > > > > > > > > > > > > nfs3.pcap is the gvsd-metadata problem one can find using > > google, > > > > doesn't > > > > > > > > have to be a NFS problem > > > > > > > > The other 3 all come from surfing the www using firefox 17.0.3 > > > > > > > > > > > > > > The nfs2.pcap file and nfs4.pcap seem to show the server > > returning > > > > > > > NFS4ERR_OLD_STATEID, which usually means that the client has an > > > > > > > OPEN/CLOSE/LOCK or LOCKU... in flight and that while the server > > has > > > > > > > updated the stateid, the client has not yet received the reply. > > The > > > > > > > problem is that I see no sign of the OPEN/CLOSE/LOCK/LOCKU... > > > > > > > > > > > > > > The nfs.pcap file is resending a load of LOCK requests that are > > > > > > > receiving NFS4ERR_BAD_STATEID replies. Normally, I'd expect the > > > > recovery > > > > > > > engine to kick in and try to recover the OPEN. > > > > > > > > > > > > > > So when you do 'ps -efwww', on any of these clients, do you see > > a > > > > > > > process with a name containing the server IP address > > > > (192.168.201.44)? > > > > > > > > > > > > > > Also, is there anything special in the log when you do 'dmesg -s > > > > > > 90000'? > > > > > > > > > > Of course this happened again while I wasn't looking so I don't > > know > > > > what > > > > > > caused it, probably firefox though. > > > > > > > > > > > > There is nothing in dmesg and ps -efwww has no hit on IP > > > > > > address 192.168.201.44, the closest I can get is: > > > > > > ps -efwww | grep nfs > > > > > > root 568 2 0 Apr16 ? 00:00:00 [nfsiod] > > > > > > root 2440 2 0 Apr16 ? 00:00:00 [nfsd4] > > > > > > root 2441 2 0 Apr16 ? 00:00:00 [nfsd4_callbacks] > > > > > > root 2442 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > > root 2443 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > > root 2444 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > > root 2445 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > > root 2446 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > > root 2447 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > > root 2448 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > > root 2449 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > > root 2667 2 0 Apr16 ? 00:00:00 [nfsv4.0-svc] > > > > > > jocke 27048 26888 0 14:28 pts/3 00:00:00 grep --colour=auto > > nfs > > > > > > > > > > > > Got a new pcap file also: > > > > > > > > http://ftp-us.transmode.se/get/?id=6f935e1d7e105d01e9a5b907c6493521 > > > > nfs5.pcap > > > > > > > > > > > > The load is not that noticeable so I can stay in this mode a > > while, > > > > until I go > > > > > > home today. > > > > > > > > > > So left it overnight and this morning my NFS client had completely > > > > looked up, > > > > > had to press the power button. This has happened twice now. > > > > > > > > > > One more piece of info, we think this problem started when NFS > > server > > > > > was upgraded from 3.4.28 to 3.4.39 > > > > > > > > > > I have no idea how to move forward now. Trond, are you also stuck? > > > > > > > > > > Jocke > > > > > > > > > -- > > > Trond Myklebust > > > Linux NFS client maintainer > > > > > > NetApp > > > Trond.Myklebust@netapp.com > > > www.netapp.com > > > > > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust@netapp.com > www.netapp.com