From: Bogdan Costescu Subject: Re: NFS server not responding Date: Fri, 28 Nov 2003 19:43:24 +0100 (CET) Sender: nfs-admin@lists.sourceforge.net Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Douglas Furlong , Return-path: Received: from sc8-sf-mx1-b.sourceforge.net ([10.3.1.11] helo=sc8-sf-mx1.sourceforge.net) by sc8-sf-list1.sourceforge.net with esmtp (Cipher TLSv1:DES-CBC3-SHA:168) (Exim 3.31-VA-mm2 #1 (Debian)) id 1APnaT-0001Wj-00 for ; Fri, 28 Nov 2003 10:43:29 -0800 Received: from relay2.uni-heidelberg.de ([129.206.210.211]) by sc8-sf-mx1.sourceforge.net with esmtp (Exim 4.24) id 1APnaT-0003ul-AU for nfs@lists.sourceforge.net; Fri, 28 Nov 2003 10:43:29 -0800 To: Trond Myklebust In-Reply-To: Errors-To: nfs-admin@lists.sourceforge.net List-Help: List-Post: List-Subscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Unsubscribe: , List-Archive: On 28 Nov 2003, Trond Myklebust wrote: > Huh? Why should a 1% retransmission make a noticable difference? I think that I wasn't too clear in my previous message, I did not mean to suggest that the 2 things (retransmission rate and "server not responding") are strongly correlated, rather I provided another data point and compared with another setup using older kernels, but the same hardware. For example, one node has: Client rpc stats: calls retrans authrefrsh 4211065 38746 0 > uptime 19:07:14 up 9 days, 5:59, 1 user, load average: 1.00, 1.00, 1.00 and > dmesg | grep -i "not responding" | wc -l 45 Probably about half of the "not responding" messages were generated by the previously mentioned "slocate" cron job before I disabled it and another 4-5 by another NFS server with user data that was unavailable at some point. But the rest were generated at various times when the NFS server was not so busy. It's clear from what I've seen until now that if only one client is generating massive NFS traffic, the server can cope with it well and the client is not displaying the "not responding" messages; I've tried to manually run the "slocate" cron job and other stress-tests and did not get any such message. But I do get them when several tens of nodes do it and, again, this did not happen with older kernels. As I mentioned I cannot get more details at the moment, as I'm in the middle of a big software and hardware update. With the current settings, things seem to work so people can continue their work and I'll debug these problems later... hopefully ;-) > I get ~2% retransmission rate when I do UDP loopback mounts without > seeing any problems at all: it still compares well to the same mount > using TCP. I don't think that we disagree here :-) I don't see anything wrong with having some retransmissions, unless they amount to several tens of percents of the total number of calls. The small percentage of retransmissions doesn't bother me; the large number of "not responding" messages does... I know, I can always increase the parameters "retrans" and "timeo" parameters to something very big, but I didn't need to do it before... > Now it may be that the Fedora kernel has some other crap in it that is > screwing up interrupts & other such things (NAPI perhaps?). NAPI for 3c59x that is used in this node doesn't exist. You can take my word for it :-) But I cannot say anything about the rest... OTOH testing with a vanilla kernel on Fedora might break some things, especially threaded applications, as glibc expects NPTL support in kernel; the answer to this question on the Red Hat lists doesn't get clearer than that. > Has anybody that is seeing these problems made a comparison with an > equivalent stock Marcelo kernel? I know that I can't claim anything until I do that comparison. But at the moment, it's not possible. That's why I said that is just another data point... -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs