Return-Path: Subject: Re: [PATCH v7 13/31] NFSv4.1: Ensure we always run TEST/FREE_STATEID on locks To: Benjamin Coddington References: <1474565961-21303-1-git-send-email-trond.myklebust@primarydata.com> <1474565961-21303-9-git-send-email-trond.myklebust@primarydata.com> <1474565961-21303-10-git-send-email-trond.myklebust@primarydata.com> <1474565961-21303-11-git-send-email-trond.myklebust@primarydata.com> <1474565961-21303-12-git-send-email-trond.myklebust@primarydata.com> <1474565961-21303-13-git-send-email-trond.myklebust@primarydata.com> <1474565961-21303-14-git-send-email-trond.myklebust@primarydata.com> <599EE56B-46DD-411B-805D-11C2FB5E30A4@redhat.com> <34B1D68A-1A1C-4B59-A19E-467D48F7A9D0@redhat.com> <6ABCDB9B-997E-49C1-9363-D59AF9BEC0E9@primarydata.com> <806bf204-eb35-5a3a-30fa-612bf22fb09a@Netapp.com> CC: Trond Myklebust , List Linux NFS Mailing , Oleg Drokin From: Anna Schumaker Message-ID: <50b6aeb9-cb21-9f46-dadd-e7ba0f5d86ed@Netapp.com> Date: Thu, 10 Nov 2016 15:54:46 -0500 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" List-ID: On 11/10/2016 03:18 PM, Benjamin Coddington wrote: > > On 10 Nov 2016, at 10:58, Benjamin Coddington wrote: > >> Hi Anna, >> >> On 10 Nov 2016, at 10:01, Anna Schumaker wrote: >>> Do you have an estimate for when this patch will be ready? I want to include it in my next bugfix pull request for 4.9. >> >> I haven't posted because I am still trying to get to the bottom of another >> problem where the client gets stuck in a loop sending the same stateid over >> and over on NFS4ERR_OLD_STATEID. I want to make sure this problem isn't >> caused by this fix -- which I don't think it is, but I'd rather make sure. >> If I don't make any progress on this problem by the end of today, I'll post >> what I have. >> >> Read on if interested in this new problem: >> >> It looks like racing opens with the same openowner can be returned out of >> order by the server, so the client sees stateid seqid of 2 before 1. Then a >> LOCK sent with seqid 1 is endlessly retried if sent while doing recovery. >> >> It's hard to tell if I was able to capture all the moving parts to describe >> this problem, though. As it takes a very long time for me to reproduce, and >> the packet captures were dropping frames. I'm working on manually >> reproducing it now. > > Anna, > > I haven't gotten to the bottom of it, and so I'm not confident it isn't a > problem created by the fix I've been testing, which is: > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c > index e809498..2aa9d86 100644 > --- a/fs/nfs/nfs4proc.c > +++ b/fs/nfs/nfs4proc.c > @@ -2564,12 +2564,15 @@ static void nfs41_check_delegation_stateid(struct > nfs4_state *state) > static int nfs41_check_expired_locks(struct nfs4_state *state) > { > int status, ret = NFS_OK; > - struct nfs4_lock_state *lsp; > + struct nfs4_lock_state *lsp, *tmp; > struct nfs_server *server = NFS_SERVER(state->inode); > > if (!test_bit(LK_STATE_IN_USE, &state->flags)) > goto out; > - list_for_each_entry(lsp, &state->lock_states, ls_locks) { > + spin_lock(&state->state_lock); > + list_for_each_entry_safe(lsp, tmp, &state->lock_states, ls_locks) { > + atomic_inc(&lsp->ls_count); > + spin_unlock(&state->state_lock); > if (test_bit(NFS_LOCK_INITIALIZED, &lsp->ls_flags)) { > struct rpc_cred *cred = > lsp->ls_state->owner->so_cred; > > @@ -2588,7 +2591,10 @@ static int nfs41_check_expired_locks(struct > nfs4_state *state) > break; > } > } > - }; > + nfs4_put_lock_state(lsp); > + spin_lock(&state->state_lock); > + } > + spin_unlock(&state->state_lock); > out: > return ret; > } > > http://people.redhat.com/bcodding/old_stateid_loop is tshark output of my > only good wirecapture of the problem. Without this patch, generic/089 > crashes long before this problem is reproduced, so I am stuck figuring it > out, I'm afraid. Don't wait on my account. > > I plan on trying a bit more to reproduce tomorrow, and if I cannot, I'll > write about it under separate cover. Sounds good. Thanks for the update! Anna > > Ben