Return-Path: linux-nfs-owner@vger.kernel.org Received: from mx12.netapp.com ([216.240.18.77]:37593 "EHLO mx12.netapp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755410Ab3DWOSI convert rfc822-to-8bit (ORCPT ); Tue, 23 Apr 2013 10:18:08 -0400 From: "Myklebust, Trond" To: Joakim Tjernlund CC: "linux-nfs@vger.kernel.org" Subject: Re: NFS loop on 3.4.39 Date: Tue, 23 Apr 2013 14:18:07 +0000 Message-ID: <1366726687.35524.6.camel@leira.trondhjem.org> References: <1366126613.12556.18.camel@leira.trondhjem.org> <1366150010.27817.8.camel@leira.trondhjem.org> <1366725123.35524.2.camel@leira.trondhjem.org> In-Reply-To: Content-Type: text/plain; charset=US-ASCII MIME-Version: 1.0 Sender: linux-nfs-owner@vger.kernel.org List-ID: On Tue, 2013-04-23 at 16:14 +0200, Joakim Tjernlund wrote: > "Myklebust, Trond" wrote on 2013/04/23 > 15:52:06: > > > > On Tue, 2013-04-23 at 15:38 +0200, Joakim Tjernlund wrote: > > > So, it happened again. Just when hitting search on bugs.gentoo.org in > > > firefox 17.0.3 > > > > > > This time I got a NFS loop with NFS4ERR_BAD_STATEID looping over and > over > > > again and FF was hung. Not posting the logs as it does not appear to > > > do any good. Nothing in dmesg either. > > > > > > Noticed this patch on the NFS list: > > > http://marc.info/?l=linux-nfs&m=136643651710066&w=2 > > > I wonder if that could be a potential cure and if so, could it be > > > backported to 3.4? > > > > It is in the testing branch on > > > > http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=summary > > > > if you want to try it out. I'm not planning on backporting anything that > > hasn't been labelled with a Cc: stable in that branch. > > Well, we won't use tip of linus tree in production so there is > little point to use your testing branch. However it looks like a trivial > backport so I can test it on my client easily. The point of testing would not be to discover if you can use Linus' tree in production, but rather to see if the problem is already fixed upstream. If it is, we can bisect to figure out which patch is the fix. > Even the NFS server if required, is the above referenced patch for > NFS client/server or both? Any chance this is the culprit? That's a client patch. > Jocke > > PS. > I guess I should throw in > NFSv4: Ensure the LOCK call cannot use the delegation stateid > too? > > > > Cheers > > Trond > > > > > Jocke > > > > > > Joakim Tjernlund/Transmode wrote on 2013/04/19 12:54:38: > > > > > > > > Joakim Tjernlund/Transmode wrote on 2013/04/18 14:34:03: > > > > > > > > > > "Myklebust, Trond" wrote on > 2013/04/17 > > > 00:06:51: > > > > > > > > > > > > On Tue, 2013-04-16 at 21:07 +0200, Joakim Tjernlund wrote: > > > > > > > "Myklebust, Trond" wrote on > > > 2013/04/16 > > > > > > > 17:36:55: > > > > > > > > > > > > > > > From: "Myklebust, Trond" > > > > > > > > To: Joakim Tjernlund , > > > > > > > > Cc: "linux-nfs@vger.kernel.org" > > > > > > > > Date: 2013/04/16 17:37 > > > > > > > > Subject: Re: NFS loop on 3.4.39 > > > > > > > > > > > > > > > > On Tue, 2013-04-16 at 12:41 +0200, Joakim Tjernlund wrote: > > > > > > > > > Here we go again, this time i happened while browsing the > > > Boston news > > > > > > > on > > > > > > > > > www.dn.se > > > > > > > > > Now gvfsd-metadata is turned off(not running at all) and I > > > > get: > > > > > > > > > 10:28:44.616146 IP 192.168.201.44.nfs > > > > 172.20.4.10.3671768838: reply > > > > > > > ok > > > > > > > > > 52 getattr ERROR: unk 10024 > > > > > > > > > > > > > > > > Part of the reason why you are getting no response to these > > > posts is > > > > > > > > that you are posting tcpdump-decoded data. Tcpdump still has > no > > > support > > > > > > > > for NFSv4, and therefore completely garbles the output by > trying > > > to > > > > > > > > interpret it as NFSv2/v3. > > > > > > > > In general, if you are posting network traffic, please > record it > > > as > > > > > > > > binary raw packet data (using the '-w' option on tcdump) so > that > > > we can > > > > > > > > look at the full contents. Either include it as an > attachment, > > > or > > > > > > > > provide us with details on how to download it from an http > > > server. > > > > > > > > > > > > > > > > Other information that is needed in order to make sense of > NFS > > > bug > > > > > > > > reports includes: > > > > > > > > > > > > > > Thank you Trond, I figured there was something missing but I > > > didn't know > > > > > > > where to start but here goes: > > > > > > > > > > > > > > > > > > > > > > > - client OS (non-linux) or kernel version (linux) > > > > > > > Client OS Linux 3.4.39, x86 > > > > > > > > > > > > > > > - mount options on the client > > > > > > > ~ # ypmatch jocke auto.home > > > > > > > -fstype=nfs,soft devsrv:/mnt/home/jocke > > > > > > > > > > > > > > > - server OS (non-linux) or kernel version (linux) > > > > > > > Server OS Linux 3.4.39, amd64 > > > > > > > > > > > > > > > - type of exported filesystem on the server > > > > > > > XFS > > > > > > > > > > > > > > > - contents of /etc/exports on the server > > > > > > > more /etc/exports > > > > > > > # /etc/exports: NFS file systems being exported. See > exports(5). > > > > > > > /mnt/home *(rw,async,root_squash,no_subtree_check) > > > > > > > /mnt/systemtest *(rw,sync,root_squash,no_subtree_check) > > > > > > > /mnt/TNM *(rw,sync,root_squash,no_subtree_check) > > > > > > > /tftproot *(rw,async,root_squash,no_subtree_check) > > > > > > > /mnt/images > *(rw,async,no_root_squash,no_subtree_check,insecure) > > > > > > > /rescue *(ro,async,no_root_squash,no_subtree_check,insecure) > > > > > > > > > > > > > > /mnt/home is the one failing > > > > > > > > > > > > > > > > > > > > > > > Please ensure that you always include those in your emails. > > > > > > > > > > > > > > nfs.pcap: > > > > > > > > > > http://ftp-us.transmode.se/get/?id=1bf2561ed2e7d4e379b2936319c82c25 > > > > > > > > > > > > > > nfs2.pcap: > > > > > > > > > > http://ftp-us.transmode.se/get/?id=759c7645248a426720da8e9ba7074040 > > > > > > > > > > > > > > nfs3.pcap: > > > > > > > > > > http://ftp-us.transmode.se/get/?id=051c6d771978b2407e15e96152bd6e66 > > > > > > > > > > > > > > nfs4.pcap: > > > > > > > > > > http://ftp-us.transmode.se/get/?id=5dfab4da6cbbe400697bc1621b541c9f > > > > > > > > > > > > > > nfs3.pcap is the gvsd-metadata problem one can find using > google, > > > doesn't > > > > > > > have to be a NFS problem > > > > > > > The other 3 all come from surfing the www using firefox 17.0.3 > > > > > > > > > > > > The nfs2.pcap file and nfs4.pcap seem to show the server > returning > > > > > > NFS4ERR_OLD_STATEID, which usually means that the client has an > > > > > > OPEN/CLOSE/LOCK or LOCKU... in flight and that while the server > has > > > > > > updated the stateid, the client has not yet received the reply. > The > > > > > > problem is that I see no sign of the OPEN/CLOSE/LOCK/LOCKU... > > > > > > > > > > > > The nfs.pcap file is resending a load of LOCK requests that are > > > > > > receiving NFS4ERR_BAD_STATEID replies. Normally, I'd expect the > > > recovery > > > > > > engine to kick in and try to recover the OPEN. > > > > > > > > > > > > So when you do 'ps -efwww', on any of these clients, do you see > a > > > > > > process with a name containing the server IP address > > > (192.168.201.44)? > > > > > > > > > > > > Also, is there anything special in the log when you do 'dmesg -s > > > > 90000'? > > > > > > > > Of course this happened again while I wasn't looking so I don't > know > > > what > > > > > caused it, probably firefox though. > > > > > > > > > > There is nothing in dmesg and ps -efwww has no hit on IP > > > > > address 192.168.201.44, the closest I can get is: > > > > > ps -efwww | grep nfs > > > > > root 568 2 0 Apr16 ? 00:00:00 [nfsiod] > > > > > root 2440 2 0 Apr16 ? 00:00:00 [nfsd4] > > > > > root 2441 2 0 Apr16 ? 00:00:00 [nfsd4_callbacks] > > > > > root 2442 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > root 2443 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > root 2444 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > root 2445 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > root 2446 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > root 2447 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > root 2448 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > root 2449 2 0 Apr16 ? 00:00:00 [nfsd] > > > > > root 2667 2 0 Apr16 ? 00:00:00 [nfsv4.0-svc] > > > > > jocke 27048 26888 0 14:28 pts/3 00:00:00 grep --colour=auto > nfs > > > > > > > > > > Got a new pcap file also: > > > > > > http://ftp-us.transmode.se/get/?id=6f935e1d7e105d01e9a5b907c6493521 > > > nfs5.pcap > > > > > > > > > > The load is not that noticeable so I can stay in this mode a > while, > > > until I go > > > > > home today. > > > > > > > > So left it overnight and this morning my NFS client had completely > > > looked up, > > > > had to press the power button. This has happened twice now. > > > > > > > > One more piece of info, we think this problem started when NFS > server > > > > was upgraded from 3.4.28 to 3.4.39 > > > > > > > > I have no idea how to move forward now. Trond, are you also stuck? > > > > > > > > Jocke > > > > > > -- > > Trond Myklebust > > Linux NFS client maintainer > > > > NetApp > > Trond.Myklebust@netapp.com > > www.netapp.com > -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com