Return-Path: Received: from mx1.redhat.com ([209.132.183.28]:34850 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1761868AbbEERYF (ORCPT ); Tue, 5 May 2015 13:24:05 -0400 Date: Tue, 5 May 2015 13:24:03 -0400 (EDT) From: Benjamin Coddington To: Joschi Brauchle cc: "linux-nfs@vger.kernel.org" Subject: Re: NFS Client looping with STALE_STATE_ID error - 100% load on one CPU core In-Reply-To: <5538C66A.1020302@tum.de> Message-ID: References: <5538C66A.1020302@tum.de> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-nfs-owner@vger.kernel.org List-ID: Hi J Brauchle, Are you producing this on a kernel that has "8faaa6d Fixing lease renewal"? https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8faaa6d5d48b2015 If not, I think that might be what you need. Ben On Thu, 23 Apr 2015, Joschi Brauchle wrote: > Hello everyone, > > I am posting to this list as suggested by Neil F Brown in this OpenSUSE > bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=909634 > > > We are mounting user homes and other shares using 'sec=krb' and NFSv4. With > kernel 3.16.7+ (includes latest NFS patches, please see BZ link for details) > we are seeing 100% CPU core usage by a kernel process on some machine > connected to a 'NetApp FAS 3140 with ONTAP 8.1.4P1 7-mode' NFSv3/4 Server. It > looks like the problem is triggered when the NFS server is unavailable for a > short period of time or was rebooted. > > Neil F Brown has come to the following conclusion so far: > ------------- QUOTE ------------- > What I see happening here is that a WRITE or LOCK request gets > NFS4ERR_STALE_STATEID, and then the client sends a "RENEW" request. > > If the stateid is really stale, that should fail, but it doesn't. So the > client assumes everything is OK and keeps sending requests .... which fail > with NFS4ERR_STALE_STATEID. > > STALE_STATEID means: > > A stateid generated by an earlier server instance was used. > > which means that the server must have restarted since that stateid was issued. > > So there seem to be two possible explanations: > > 1/ The server is lying. The stateid isn't really 'stale'. Maybe it is 'bad'. > 2/ The client has previously handled a STALE_STATEID error and created a new > "client" handle, but hasn't refreshed all of the stateids that it was holding. > So it is trying to use old stateid even though it has established a new > client. > > The client code looks OK, though it is sufficiently complex that I cannot be > certain. There are no patches in more recent versions. > ------------- END ------------- > > We have some RPC/NFS debug logs available here: > https://bugzilla.novell.com/attachment.cgi?id=631962 > And a wireshark trace here: > https://bugzilla.novell.com/attachment.cgi?id=632104 > > > What I cannot do: > - Trigger or reproduce the problem reliably. It comes up every once in a > while. Around 2-3 times a month on a handful of ~50 machines. > - Provide logs from with the latest kernel (because I cannot trigger the > problem by any means. Updating all ~50 machines is not an option) > > > What I can do: > - Provide logs from or possibly even access to the machine in the 'broken' > state for the next couple of days. I will have to reboot it at some time to > take the load of the NFS server. > > > > I would be grateful for any help or comments. > > -J Brauchle