Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp1213670pxk; Fri, 4 Sep 2020 03:55:44 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzlxuXyz2wL7ZcOjmvIpHWVd0j7xuB95g6cfNcVT87rg+EhM6Ki86WVegEFih2ixlWcXywp X-Received: by 2002:a17:906:1185:: with SMTP id n5mr943498eja.495.1599216944480; Fri, 04 Sep 2020 03:55:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1599216944; cv=none; d=google.com; s=arc-20160816; b=O8r0VzcKzoTuDSkF6i3sDuiPRYW/DzeSiheUrkZaF+TblrOr3bD8dcHd1w0zLbep+C nkUesjJSj05bcKEvzJDyG/a5C0Yqr/x/fRl8iXYKsjjR+mCXCdQgqXAmqWn16Uej742t CuhCSfv1fnNVPutkWfdpx3HMZ27ETBvL3O6NuFk2lSzTmn04Q811NzNUcSkPStrYf6kR NPe82MbMpFkoqLaQASjyxnBHHS0d2YrMX8BMXpUYduf0GSgB956u9wPRzvc//6NazTg5 vw5wYih182vNnQ66wc0EtPbdGby87y0ZC5DEFWtyvOqHH0tCWLjNTrMOs3fczdCx/xnB /4PA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from; bh=+WaL3G6tTSQmMTmIjmsDyiLHEXAZgCe/FWohAQirFsQ=; b=LUoe1sBN+0bkma4FAt1oXSD/uUIYKG3u9KMeGay5o/qZnUPDPKUbgfFhpZR/Hv4fI5 fHuOD+6s0ZRVurM/xNqM7r44mKmMrzGfSgAY2e2jv+/WcuLQQ9JPx9NGVBQKnRf8tDUW S0c9lclMvu1d0QK32D28kRobWMsQdjbVSWPzJfU/YlbDKL+6AxOVWcE4ppjssEZaFhnq tBgvg4guUi6ntHJ8IouaDHg3cVhws9VuW14zWXSoqbaiWshW+v3G34HkuFLrnmW3Mqq1 1EollTiZCe3ZKXES14Rql/KsnqxvNW/X86YbjapZ6MKHBceuXE9jduQHwwgRzaIPaWr7 nL5A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id zn8si4335321ejb.70.2020.09.04.03.55.13; Fri, 04 Sep 2020 03:55:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726811AbgIDKzH (ORCPT + 99 others); Fri, 4 Sep 2020 06:55:07 -0400 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:44134 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726171AbgIDKzG (ORCPT ); Fri, 4 Sep 2020 06:55:06 -0400 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-483-Bz4U0dRlMB26Ev6YOqyU4g-1; Fri, 04 Sep 2020 06:55:03 -0400 X-MC-Unique: Bz4U0dRlMB26Ev6YOqyU4g-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8500D1084D6A; Fri, 4 Sep 2020 10:55:02 +0000 (UTC) Received: from [172.16.176.1] (ovpn-64-2.rdu2.redhat.com [10.10.64.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 19F3C1A7C8; Fri, 4 Sep 2020 10:55:02 +0000 (UTC) From: "Benjamin Coddington" To: "Murphy Zhou" Cc: "Trond Myklebust" , linux-nfs@vger.kernel.org Subject: Re: [PATCH] NFSv4: fix stateid refreshing when CLOSE racing with OPEN Date: Fri, 04 Sep 2020 06:55:01 -0400 Message-ID: In-Reply-To: <20200904030411.enioqeng4wxftucd@xzhoux.usersys.redhat.com> References: <20191010074020.o2uwtuyegtmfdlze@XZHOUW.usersys.redhat.com> <20191011084910.joa3ptovudasyo7u@xzhoux.usersys.redhat.com> <6AAFBD30-1931-49A8-8120-B7171B0DA01C@redhat.com> <20200904030411.enioqeng4wxftucd@xzhoux.usersys.redhat.com> MIME-Version: 1.0 Content-Type: text/plain X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On 3 Sep 2020, at 23:04, Murphy Zhou wrote: > Hi Benjamin, > > On Thu, Sep 03, 2020 at 01:54:26PM -0400, Benjamin Coddington wrote: >> >> On 11 Oct 2019, at 10:14, Trond Myklebust wrote: >>> On Fri, 2019-10-11 at 16:49 +0800, Murphy Zhou wrote: >>>> On Thu, Oct 10, 2019 at 02:46:40PM +0000, Trond Myklebust wrote: >>>>> On Thu, 2019-10-10 at 15:40 +0800, Murphy Zhou wrote: >> ... >>>>>> @@ -3367,14 +3368,16 @@ static bool >>>>>> nfs4_refresh_open_old_stateid(nfs4_stateid *dst, >>>>>> break; >>>>>> } >>>>>> seqid_open = state->open_stateid.seqid; >>>>>> - if (read_seqretry(&state->seqlock, seq)) >>>>>> - continue; >>>>>> >>>>>> dst_seqid = be32_to_cpu(dst->seqid); >>>>>> - if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) >= 0) >>>>>> + if ((s32)(dst_seqid - be32_to_cpu(seqid_open)) > 0) >>>>>> dst->seqid = cpu_to_be32(dst_seqid + 1); >>>>> >>>>> This negates the whole intention of the patch you reference in the >>>>> 'Fixes:', which was to allow us to CLOSE files even if seqid bumps >>>>> have >>>>> been lost due to interrupted RPC calls e.g. when using 'soft' or >>>>> 'softerr' mounts. >>>>> With the above change, the check could just be tossed out >>>>> altogether, >>>>> because dst_seqid will never become larger than seqid_open. >>>> >>>> Hmm.. I got it wrong. Thanks for the explanation. >>> >>> So to be clear: I'm not saying that what you describe is not a problem. >>> I'm just saying that the fix you propose is really no better than >>> reverting the entire patch. I'd prefer not to do that, and would rather >>> see us look for ways to fix both problems, but if we can't find such as >>> fix then that would be the better solution. >> >> Hi Trond and Murphy Zhou, >> >> Sorry to resurrect this old thread, but I'm wondering if any progress was >> made on this front. > > This failure stoped showing up since v5.6-rc1 release cycle > in my records. Can you reproduce this on latest upstream kernel? I'm seeing it on generic/168 on a v5.8 client against a v5.3 knfsd server. When I test against v5.8 server, the test takes longer to complete and I have yet to reproduce the livelock. - on v5.3 server takes ~50 iterations to produce, each test completes in ~40 seconds - on v5.8 server my test has run ~750 iterations without getting into the lock, each test takes ~60 seconds. I suspect recent changes to the server have changed the timing of open replies such that the problem isn't reproduced on the client. Ben