Return-Path: linux-nfs-owner@vger.kernel.org Received: from gw1.transmode.se ([195.58.98.146]:58818 "EHLO gw1.transmode.se" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756375Ab3DWNiI (ORCPT ); Tue, 23 Apr 2013 09:38:08 -0400 In-Reply-To: References: <1366126613.12556.18.camel@leira.trondhjem.org> <1366150010.27817.8.camel@leira.trondhjem.org> Cc: "Myklebust, Trond" , "linux-nfs@vger.kernel.org" MIME-Version: 1.0 Subject: Re: NFS loop on 3.4.39 From: Joakim Tjernlund Message-ID: Date: Tue, 23 Apr 2013 15:38:03 +0200 Content-Type: text/plain; charset="US-ASCII" To: unlisted-recipients:; (no To-header on input) Sender: linux-nfs-owner@vger.kernel.org List-ID: So, it happened again. Just when hitting search on bugs.gentoo.org in firefox 17.0.3 This time I got a NFS loop with NFS4ERR_BAD_STATEID looping over and over again and FF was hung. Not posting the logs as it does not appear to do any good. Nothing in dmesg either. Noticed this patch on the NFS list: http://marc.info/?l=linux-nfs&m=136643651710066&w=2 I wonder if that could be a potential cure and if so, could it be backported to 3.4? Jocke Joakim Tjernlund/Transmode wrote on 2013/04/19 12:54:38: > > Joakim Tjernlund/Transmode wrote on 2013/04/18 14:34:03: > > > > "Myklebust, Trond" wrote on 2013/04/17 00:06:51: > > > > > > On Tue, 2013-04-16 at 21:07 +0200, Joakim Tjernlund wrote: > > > > "Myklebust, Trond" wrote on 2013/04/16 > > > > 17:36:55: > > > > > > > > > From: "Myklebust, Trond" > > > > > To: Joakim Tjernlund , > > > > > Cc: "linux-nfs@vger.kernel.org" > > > > > Date: 2013/04/16 17:37 > > > > > Subject: Re: NFS loop on 3.4.39 > > > > > > > > > > On Tue, 2013-04-16 at 12:41 +0200, Joakim Tjernlund wrote: > > > > > > Here we go again, this time i happened while browsing the Boston news > > > > on > > > > > > www.dn.se > > > > > > Now gvfsd-metadata is turned off(not running at all) and I get: > > > > > > 10:28:44.616146 IP 192.168.201.44.nfs > 172.20.4.10.3671768838: reply > > > > ok > > > > > > 52 getattr ERROR: unk 10024 > > > > > > > > > > Part of the reason why you are getting no response to these posts is > > > > > that you are posting tcpdump-decoded data. Tcpdump still has no support > > > > > for NFSv4, and therefore completely garbles the output by trying to > > > > > interpret it as NFSv2/v3. > > > > > In general, if you are posting network traffic, please record it as > > > > > binary raw packet data (using the '-w' option on tcdump) so that we can > > > > > look at the full contents. Either include it as an attachment, or > > > > > provide us with details on how to download it from an http server. > > > > > > > > > > Other information that is needed in order to make sense of NFS bug > > > > > reports includes: > > > > > > > > Thank you Trond, I figured there was something missing but I didn't know > > > > where to start but here goes: > > > > > > > > > > > > > > - client OS (non-linux) or kernel version (linux) > > > > Client OS Linux 3.4.39, x86 > > > > > > > > > - mount options on the client > > > > ~ # ypmatch jocke auto.home > > > > -fstype=nfs,soft devsrv:/mnt/home/jocke > > > > > > > > > - server OS (non-linux) or kernel version (linux) > > > > Server OS Linux 3.4.39, amd64 > > > > > > > > > - type of exported filesystem on the server > > > > XFS > > > > > > > > > - contents of /etc/exports on the server > > > > more /etc/exports > > > > # /etc/exports: NFS file systems being exported. See exports(5). > > > > /mnt/home *(rw,async,root_squash,no_subtree_check) > > > > /mnt/systemtest *(rw,sync,root_squash,no_subtree_check) > > > > /mnt/TNM *(rw,sync,root_squash,no_subtree_check) > > > > /tftproot *(rw,async,root_squash,no_subtree_check) > > > > /mnt/images *(rw,async,no_root_squash,no_subtree_check,insecure) > > > > /rescue *(ro,async,no_root_squash,no_subtree_check,insecure) > > > > > > > > /mnt/home is the one failing > > > > > > > > > > > > > > Please ensure that you always include those in your emails. > > > > > > > > nfs.pcap: > > > > http://ftp-us.transmode.se/get/?id=1bf2561ed2e7d4e379b2936319c82c25 > > > > > > > > nfs2.pcap: > > > > http://ftp-us.transmode.se/get/?id=759c7645248a426720da8e9ba7074040 > > > > > > > > nfs3.pcap: > > > > http://ftp-us.transmode.se/get/?id=051c6d771978b2407e15e96152bd6e66 > > > > > > > > nfs4.pcap: > > > > http://ftp-us.transmode.se/get/?id=5dfab4da6cbbe400697bc1621b541c9f > > > > > > > > nfs3.pcap is the gvsd-metadata problem one can find using google, doesn't > > > > have to be a NFS problem > > > > The other 3 all come from surfing the www using firefox 17.0.3 > > > > > > The nfs2.pcap file and nfs4.pcap seem to show the server returning > > > NFS4ERR_OLD_STATEID, which usually means that the client has an > > > OPEN/CLOSE/LOCK or LOCKU... in flight and that while the server has > > > updated the stateid, the client has not yet received the reply. The > > > problem is that I see no sign of the OPEN/CLOSE/LOCK/LOCKU... > > > > > > The nfs.pcap file is resending a load of LOCK requests that are > > > receiving NFS4ERR_BAD_STATEID replies. Normally, I'd expect the recovery > > > engine to kick in and try to recover the OPEN. > > > > > > So when you do 'ps -efwww', on any of these clients, do you see a > > > process with a name containing the server IP address (192.168.201.44)? > > > > > > Also, is there anything special in the log when you do 'dmesg -s 90000'? > > Of course this happened again while I wasn't looking so I don't know what > > caused it, probably firefox though. > > > > There is nothing in dmesg and ps -efwww has no hit on IP > > address 192.168.201.44, the closest I can get is: > > ps -efwww | grep nfs > > root 568 2 0 Apr16 ? 00:00:00 [nfsiod] > > root 2440 2 0 Apr16 ? 00:00:00 [nfsd4] > > root 2441 2 0 Apr16 ? 00:00:00 [nfsd4_callbacks] > > root 2442 2 0 Apr16 ? 00:00:00 [nfsd] > > root 2443 2 0 Apr16 ? 00:00:00 [nfsd] > > root 2444 2 0 Apr16 ? 00:00:00 [nfsd] > > root 2445 2 0 Apr16 ? 00:00:00 [nfsd] > > root 2446 2 0 Apr16 ? 00:00:00 [nfsd] > > root 2447 2 0 Apr16 ? 00:00:00 [nfsd] > > root 2448 2 0 Apr16 ? 00:00:00 [nfsd] > > root 2449 2 0 Apr16 ? 00:00:00 [nfsd] > > root 2667 2 0 Apr16 ? 00:00:00 [nfsv4.0-svc] > > jocke 27048 26888 0 14:28 pts/3 00:00:00 grep --colour=auto nfs > > > > Got a new pcap file also: > > http://ftp-us.transmode.se/get/?id=6f935e1d7e105d01e9a5b907c6493521 nfs5.pcap > > > > The load is not that noticeable so I can stay in this mode a while, until I go > > home today. > > So left it overnight and this morning my NFS client had completely looked up, > had to press the power button. This has happened twice now. > > One more piece of info, we think this problem started when NFS server > was upgraded from 3.4.28 to 3.4.39 > > I have no idea how to move forward now. Trond, are you also stuck? > > Jocke