From: Dennis Nezic Subject: Re: nfs: server not responding, timed out Date: Sat, 20 Mar 2010 16:28:45 -0400 Message-ID: <20100320162845.c6b7b6c4.dennisn@dennisn.dyndns.org> References: <20100318170603.f6a7f188.dennisn@dennisn.dyndns.org> <4BA2DFC5.1010400@cn.fujitsu.com> <20100319002720.0e93411e.dennisn@dennisn.dyndns.org> <20100319181038.c94fa3c4.dennisn@dennisn.dyndns.org> <20100320105237.1353566e.dennisn@dennisn.dyndns.org> <1269099753.12826.74.camel@oxygen.netxsys.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII To: linux-nfs@vger.kernel.org Return-path: Received: from lo.gmane.org ([80.91.229.12]:53922 "EHLO lo.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751339Ab0CTU26 (ORCPT ); Sat, 20 Mar 2010 16:28:58 -0400 Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1Nt5I0-0006cD-VE for linux-nfs@vger.kernel.org; Sat, 20 Mar 2010 21:28:56 +0100 Received: from 66.49.244.231 ([66.49.244.231]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 20 Mar 2010 21:28:56 +0100 Received: from dennisn by 66.49.244.231 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 20 Mar 2010 21:28:56 +0100 Sender: linux-nfs-owner@vger.kernel.org List-ID: On Sat, 20 Mar 2010 11:42:33 -0400, Krzysztof Adamski wrote: > On Sat, 2010-03-20 at 10:52 -0400, Dennis Nezic wrote: > > On Fri, 19 Mar 2010 18:10:38 -0400, Dennis Nezic wrote: > > > On Fri, 19 Mar 2010 00:27:20 -0400, Dennis Nezic wrote: > > > > On Fri, 19 Mar 2010 10:21:57 +0800, Bian Naimeng wrote: > > > > > > After upgrading my server (kernel 2.6.19 to 2.6.33, > > > > > > nfs-utils 1.1.0 to 1.2.1/1.1.4/1.1.6), and probably other > > > > > > stuff too), and possibly my client laptop's kernel, I have > > > > > > suddenly started to get these "server X not responding, > > > > > > timed out" errors (on my client), especially (only?) when > > > > > > doing large file transfers. This would lead to input/output > > > > > > errors, and the transfers would fail. > > > > > > > > > > > > I never noticed any such problems for over two years, using > > > > > > the older versions. The networking (wifi link) hasn't > > > > > > changed. > > > > > > > > > > > > Usually the file transfer trips and falls over itself near > > > > > > the end > > > > > > -- Ie. it will do 600MB out of 800MB just fine, and then > > > > > > suddently start giving these "timed out" errors, and then > > > > > > crash and burn. At this point, I am forced to "umount -fl" > > > > > > the mount. If I then try to remount it, the server > > > > > > acnowledges my "authenticated mount requests" perfectly > > > > > > fine, but my client (laptop) still appears "hung". After a > > > > > > few minutes, I am able to remount it. > > > > > > > > > > > > I tried playing with the rsize/wsize/timeo/retrans > > > > > > variables, but none of it seemed to fix the problem. > > > > > > > > > > > > Any ideas about what has changed? Maybe this is/was a > > > > > > well-known problem? :P > > > > > > > > > > > > > > > > I do not know the what's the reason. And I am not sure the > > > > > followed discussion can fix this problem, but maybe it can > > > > > help you. http://marc.info/?l=linux-nfs&m=123478426412524&w=2 > > > > > > > > Both the patches mentioned in that thread already seem to have > > > > been applied to my kernels. So, although the problem seems > > > > related, it wasn't that bug in particular. The person in that > > > > thread was talking about mounts dying after 5-15minutes, which > > > > doesn't happen with me > > > > -- my problem only seems to occur under intense activity. > > > > > > Hrm. I just noticed that my scp transfers are stalling -- which > > > also didn't used to happen before with my old kernel. No error > > > messages. Ftp transfers work fine. Eek. :S. (Despite the > > > freezing/stalling, my *actual* network connection works > > > perfectly.) > > > > > > Ideas? > > > > It seems that changing the mount options from "soft" to "hard" > > seems to "work" -- at least the transfers eventually finish! > > Although there are still stalls of 6-8minutes ... between the 16 > > syslog error messages: "nfs: server XYZ not responding, still > > trying" and the 16 subsequent error messages: "nfs: server XYZ OK". > > The key difference being that with "hard", it is "still trying" > > rather than "timed out". > > > > Now why is it stalling for so long? > > I can't tell you why, but I had the same problem with NFS server in > 2.6.32.*. Try 2.6.31.something to see if the problem goes away. I'll try that. (By the way, do you also access your nfs server over wifi? It wouldn't happen to be the b43 driver on the client side? I only ask because somehow (by setting timeo=10) I managed to get my client in a state where the transfer (actually an mplayer streaming) seemed frozen, but the wifi activity still appeared to be streaming. Although, this didn't happen before when timeo was the default 10minutes, so it's probably unrelated.) Here is a gratuitous graph when I tried to transfer a ~2G file from my nfs server to my wifi nfs client. The plateaus are where it stalls (no net traffic (although the network still works fine)), usually for about 16 minutes, which includes the default timeo=600(s) plus the ~6min delay between the "server not responding" and "OK" messages. http://dennisn.dyndns.org/guest/pubstuff/nfs-debug/nfs-stalling-2g-file-transfer.jpg Maybe I should also note that during the "stalls", "rpcinfo -t server 1000XY 3" (I use nfs3) all report "ready and waiting". Maybe there are other things I can check to pinpoint the fault?