Return-Path: MIME-Version: 1.0 In-Reply-To: References: Date: Mon, 23 Mar 2015 10:13:41 -0400 Message-ID: Subject: Re: Recovery after BAD_SEQID From: Trond Myklebust To: Benjamin Coddington Cc: Linux NFS Mailing List Content-Type: text/plain; charset=UTF-8 List-ID: On Mon, Mar 23, 2015 at 5:15 AM, Benjamin Coddington wrote: > On Sun, 22 Mar 2015, Trond Myklebust wrote: > >> On Thu, Mar 19, 2015 at 6:48 AM, Benjamin Coddington >> wrote: >> > I wrote yesterday about a RHEL6 bug, but I'd gotten some details wrong about >> > the problem, so I'm starting new thread. >> > >> > It looks like getting BAD_SEQID back from an OPEN operation drops the state_owner >> > which means that the state machine can't find or recover any other objects >> > for that state_owner. That can get the client into unrecoverable loops. I >> > can produce one of them with: >> > >> > 1) OPEN file1, OPEN file2 >> > 2) break the network for longer than the lease period >> > 3) during recovery, have the server return BAD_SEQID for one of the OPENS >> > 4) break the network again for longer than the lease period >> > 5) WRITE to the file that recovered properly in #3 >> > >> > This gets stuck in WRITE,NFS4ERR_EXPIRED. >> > >> > It looks like some cleanup is needed if we have to drop the whole >> > state_owner. Alternatively, does it make sense to just drop the objects in >> > that sequence? >> > >> > >> >> Ummm... Why are you seeing BAD_SEQID in the first place? That specific >> error means that the client and server disagree on the sequencing of >> the OPENs, which means there is a bug either on the client or on the >> server. > > It definitely needs a server bug to get here, and unfortunately that server > bug is out there. I'd like to have the client not get stuck when > encountering this bug. Recovery here would mean that we return > EIO instead of getting stuck endlessly trying to complete a write for > another open file. We do _not_ fix server bugs on the client. > I wonder now what should be the position of the client upon "discovering" > there's a bug somewhere. That bug could be client or server. Should the > client blacklist the server at that point, or can other sequences continue? It should do its best to report the server as being buggy, if that is the case, and then make a limited effort to continue (the key word here being: "limited"). -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@primarydata.com