Return-Path: Received: from fieldses.org ([173.255.197.46]:43406 "EHLO fieldses.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965301AbeALWHt (ORCPT ); Fri, 12 Jan 2018 17:07:49 -0500 Date: Fri, 12 Jan 2018 17:07:49 -0500 From: Bruce Fields To: Chuck Lever Cc: Trond Myklebust , Bruce Fields , Linux NFS Mailing List Subject: Re: NFSv4.1 regression with v4.15-rc Message-ID: <20180112220749.GA18468@fieldses.org> References: <337F485E-4E53-4EBF-8186-009326C281EC@oracle.com> <20171230180526.GA4141@fieldses.org> <874F5218-43E6-423C-9F94-4DFC07FFDF8D@oracle.com> <1515529615.26283.2.camel@primarydata.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: Sender: linux-nfs-owner@vger.kernel.org List-ID: On Tue, Jan 09, 2018 at 03:28:23PM -0500, Chuck Lever wrote: > > > > On Jan 9, 2018, at 3:26 PM, Trond Myklebust wrote: > > > > On Sun, 2017-12-31 at 13:35 -0500, Chuck Lever wrote: > >>> On Dec 30, 2017, at 1:14 PM, Chuck Lever > >>> wrote: > >>> > >>>> > >>>> On Dec 30, 2017, at 1:05 PM, Bruce Fields > >>>> wrote: > >>>> > >>>> On Wed, Dec 27, 2017 at 03:40:58PM -0500, Chuck Lever wrote: > >>>>> Last week I updated my test server from v4.14 to v4.15-rc4, and > >>>>> began to > >>>>> observe intermittent failures in the git regression suite on > >>>>> NFSv4.1. > >>>> > >>>> I haven't run that before. Should I just > >>>> > >>>> mount -overs=4.1 server:/fs /mnt/ > >>>> cd /mnt/ > >>>> git clone git://git.kernel.org/pub/scm/git/git.git > >>>> cd git > >>>> make test > >>>> > >>>> ? > >>> > >>> You'll need to install SVN and CVS on your client as well. > >>> The failures seem to occur only in the SVN/CVS related > >>> tests. > >>> > >>> > >>>>> I > >>>>> was able to reproduce these failures with NFSv4.1 on both TCP > >>>>> and RDMA, > >>>>> yet there has not been a reproduction with NFSv3 or NFSv4.0. > >>>>> > >>>>> The server hardware is a single-socket 4-core system with 32GB > >>>>> of RAM. > >>>>> The export is a tmpfs. Networking is 56Gb InfiniBand (or > >>>>> IPoIB). > >>>>> > >>>>> The git regression suite reports individual test failures in > >>>>> the SVN > >>>>> and CVS tests. On occasion, the client mount point freezes, > >>>>> requiring > >>>>> that the client be rebooted in order to unstick the mount. > >>>>> > >>>>> Just before Christmas, I bisected the problem to: > >>>> > >>>> Thanks for the report! I'll make some time for this next > >>>> week. What's > >>>> your client? > >> > >> Oops, I didn't answer this question. The client is v4.15-rc4. > >> > >> > >>>> I guess one start might be to see if the reproducer can be > >>>> simplified e.g. by running just one of the tests from the suite. > >>> > >>> The failures are intermittent, and occur in a different test > >>> each time. You have to wait for the 9000-series scripts, which > >>> test SVN/CVS repo operations. To speed up time-to-failure, use > >>> "make -jN test" where N is more than a few. > >>> > >>> My client and server both have multiple real cores. I'm > >>> thinking it's the server that matters here (possibly a race > >>> condition is introduced by the below commit?). > >>> > >>> > >>>> --b. > >>>> > >>>>> > >>>>> commit 659aefb68eca28ba9aa482a9fc64de107332e256 > >>>>> Author: Trond Myklebust > >>>>> Date: Fri Nov 3 08:00:13 2017 -0400 > >>>>> > >>>>> nfsd: Ensure we don't recognise lock stateids after freeing > >>>>> them > >>>>> > >>>>> In order to deal with lookup races, nfsd4_free_lock_stateid() > >>>>> needs > >>>>> to be able to signal to other stateful functions that the > >>>>> lock stateid > >>>>> is no longer valid. Right now, nfsd_lock() will check whether > >>>>> or not an > >>>>> existing stateid is still hashed, but only in the "new lock" > >>>>> path. > >>>>> > >>>>> To ensure the stateid invalidation is also recognised by the > >>>>> "existing lock" > >>>>> path, and also by a second call to nfsd4_free_lock_stateid() > >>>>> itself, we can > >>>>> change the type to NFS4_CLOSED_STID under the stp->st_mutex. > >>>>> > >>>>> Signed-off-by: Trond Myklebust >>>>> om> > >>>>> Signed-off-by: J. Bruce Fields > >>>>> > >>>>> > > > > So, I'm thinking that release_open_stateid_locks() and > > nfsd4_release_lockowner() should probably be setting NFS4_CLOSED_STID > > when they call unhash_lock_stateid() (sorry for missing that). > > Send me a patch and I can test it. I'm pretty confused by that patch. Among other things, it's setting sc_type to NFS4_CLOSED_STID immediately before calling release_lock_stateid() which itself will set sc_type to 0. --b.