Date: Thu, 19 Mar 2015 06:48:47 -0400 (EDT)
From: Benjamin Coddington <bcodding@redhat.com>
To: linux-nfs@vger.kernel.org
Subject: Recovery after BAD_SEQID
Message-ID: <alpine.OSX.2.19.9992.1503190633310.947@planck.local>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: linux-nfs-owner@vger.kernel.org

I wrote yesterday about a RHEL6 bug, but I'd gotten some details wrong about
the problem, so I'm starting new thread.

It looks like getting BAD_SEQID back from an OPEN operation drops the state_owner
which means that the state machine can't find or recover any other objects
for that state_owner.  That can get the client into unrecoverable loops.  I
can produce one of them with:

1) OPEN file1, OPEN file2
2) break the network for longer than the lease period
3) during recovery, have the server return BAD_SEQID for one of the OPENS
4) break the network again for longer than the lease period
5) WRITE to the file that recovered properly in #3

This gets stuck in WRITE,NFS4ERR_EXPIRED.

It looks like some cleanup is needed if we have to drop the whole
state_owner.  Alternatively, does it make sense to just drop the objects in
that sequence?

Ben