From: comsatcat Subject: Re: Spontaneous server reboot with 2.6.10 and nfsd Date: Fri, 11 Feb 2005 13:17:30 -0700 Message-ID: <1108153050.9386.3.camel@solaris.skunkware.org> References: <420CAB6E.4010003@holviala.com> Mime-Version: 1.0 Content-Type: text/plain Cc: nfs@lists.sourceforge.net Received: from sc8-sf-mx2-b.sourceforge.net ([10.3.1.12] helo=sc8-sf-mx2.sourceforge.net) by sc8-sf-list2.sourceforge.net with esmtp (Exim 4.30) id 1CzhEQ-0002iW-NA for nfs@lists.sourceforge.net; Fri, 11 Feb 2005 12:17:38 -0800 Received: from smtp2.eldosales.com ([63.78.12.18] helo=tweeter.eldosales.com) by sc8-sf-mx2.sourceforge.net with esmtp (Exim 4.41) id 1CzhEP-0004Up-4X for nfs@lists.sourceforge.net; Fri, 11 Feb 2005 12:17:38 -0800 To: Kim Holviala In-Reply-To: <420CAB6E.4010003@holviala.com> Sender: nfs-admin@lists.sourceforge.net Errors-To: nfs-admin@lists.sourceforge.net List-Unsubscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Post: List-Help: List-Subscribe: , List-Archive: I'm not sure if this is related or not, but on a batch of 8 servers running 2.6.9 and 2.6.10 pushing 300-600mb/s we're seeing the same thing using 32k r/wsize w/ jumbo frames (MTU 9000). We push all ranges of files (few bytes -> 2+ gigs), so we haven't been able to link this to specific file sizes. Do you have a kernel version that used to work for you that I can test on some of our boxes? Note we are also running Gentoo 2004.3 on all 8 servers. Thanks, Ben On Fri, 2005-02-11 at 14:56 +0200, Kim Holviala wrote: > I already posted this to LKML, but I don't think anyone was interested > there... Here's the original posting: > > =============== > I hit an obscure bug last night when trying to copy files from an nfs > client to my nfs server. The server is a P3/800 with three IDE disks in > software RAID5 running vanilla 2.6.10 and Debian Sarge. The network is > local 100Mbit/s switched ethernet. The server exports a 220 gig > partition which contains a lot of data. > > Oh, kernel configs and stuff from the server can be found from: > http://www.holviala.com/~kimmy/crash/ > > Anyway, I mount the export to a Linux client (tried with a few with > different 2.6 kernels and distros) and then start copying files from > clients CDROM to the server through NFS. After copying a few small > files, the first big one reboots the server. There are no log entries, > and the server has no local console so I don't know what happens. This > is reproduceable 100% of the time. > To narrow down the problem, I've tried the following: > > - copied files from a different client running Gentoo: reboot > - exported a non-raided partition (hdc9) and tried that: reboot > - switched 2.6.10 to 2.6.11-rc3: reboot, but it took longer > > I hope it's just something that I've done, but this server has been in > use for a long time now without any problems, and I haven't touched it > for a while. > > So, if anyone knows what's wrong, or can tell me a way to debug the > situation more I'd be grateful. The server is in a place where it's > nearly impossible to have a local console - I could probably use a > serial one if necessary for debugging. > =============== > > So, that was my original posting. Since then I've tried localhost > mounts, tcp, udp, different r/wsizes etc etc. I can still reliably > reboot teh server remotely just by copying something to the NFS mount :-/. > > Now, there are two things that I've tested that worked better than > others: First I switched to async exports, mounted localhost:/export/tmp > with udp and copied stuff there. The copying hang > (http://www.holviala.com/~kimmy/crash/nfsd.log) but the server didn't > crash. Woo! Tried that remotely and it once again rebooted the server... > > And then I made one test with tcp,rsize=1024,wsize=1024 again with > localhost:/export/tmp, and that worked ok. I haven't had the time to > test that remotely, yet. > > So, I can only assume that there's something wrong with using r/wsize > which is bigger than MTU. However, I run a lot of stuff through that > same network and I never see any TCP retransmissions or any other > problems. Besides, I'm getting the same reboot even with localhost NFS > mounts. > > I have managed to capture some logs with nfsd logging on, those can be > found from the above link. > > I'd be grateful for any pointers, debugging flags, anything. I've > crashed my server now maybe three dozen times trying to narrow the > problem down.... > > > > Kim > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > NFS maillist - NFS@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nfs ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs