Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp1118281rwb; Fri, 23 Sep 2022 08:24:37 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4M7VxKgAVvHmg4OJuqN2gzzfeHGL4VGVumrFex0RwJGvrmznExGN77mGuF1wnkaVs/83T8 X-Received: by 2002:a05:6402:548a:b0:454:762b:158b with SMTP id fg10-20020a056402548a00b00454762b158bmr8824943edb.362.1663946676882; Fri, 23 Sep 2022 08:24:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1663946676; cv=none; d=google.com; s=arc-20160816; b=G3mFBp38pXufOo664hnaUNpJm0DBi47hNCj4DZBcn0cc0ZyBr50JcjOUS7wfrXi+H6 z0X4kvlb/xM++Wkpryjy8ntfqejvba6q10+Gt1zOPfgmbTI+SaxlruKvxLGOalT76imO 8Mgu2K6MvcmjKkm0Gzza7Qt1vUQwMdv/Zzeg2RWiFg+Ily6iQp/l192fS2fnek+pxMTR ROsSTnH0/QjCnvuXc000PXvz6d5at+IGpxfZZUr8D+H9Ox38nFAjQ3IKjkx4ljW7ToHY taP4gLrUCir4felJMG0nKwXJtGV/+3SNm7jZnxq9ibJyOU7juEqBluLzvMPrB88hcCyJ p/8A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=+PhMT3CoGgrvZY0DzAh5UkOQL7hVMIOF76kbxzaswIk=; b=YGQ+3Au/Q5qV7uZ1A2PAFPcodhFttjXhRqH+/+I+1NGPY5K5PYdxyowfsD/OSPu8Ty mg/PSmyI4+r588MJ8d/FLNIeM2Y6VVEWfdVDuOrjDBFXZOLrgXa+G4pLvvSOBVXeyaVX NuZy0sEqBrmARp/4pwDBz7THy5ka2nSMo4Af/gPk1mQVvwkPnYLZkvMYmGz78LLF29Hy AAXP3F3m6A27fF51PIR0AzZ0Xk4tuc8JmQc4Mgvh5Ag4PUAdng8j8wLnwmfVx7qxHVTI N8q7Eja2xYTyt7ePJWqs2naDjJdEj9G2xwlUBiAGtegwIDpCOwuVhcF/hDToAnbt3zFM zALQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=fEeiCm6+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id sg44-20020a170907a42c00b007707c67f523si8720107ejc.335.2022.09.23.08.24.09; Fri, 23 Sep 2022 08:24:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=fEeiCm6+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229794AbiIWPN6 (ORCPT + 99 others); Fri, 23 Sep 2022 11:13:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50268 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232596AbiIWPNr (ORCPT ); Fri, 23 Sep 2022 11:13:47 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 79DDD139416; Fri, 23 Sep 2022 08:13:46 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 33815B83878; Fri, 23 Sep 2022 15:13:45 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6A284C433D6; Fri, 23 Sep 2022 15:13:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1663946024; bh=2RNZetd/4AvNkNba7S23VtdxTbdmnfkmLDgpHZQ1mlk=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=fEeiCm6+H4K9yHE6bpGVDvz73LlaBsUZx6yBOUnULi9LhTLTE+iW3Ixa+udXjKurU hOJ8dQkNCGURJmollBx+6FUQcp7Waq52Ud+5eJxPIzWR1PwVZvlm92vRV7pnEZyxvn mLV1kbb0cDnRLCMv6g5jow2qLeR7QdrE4ij5tIHlQ6+Hkx3Yw6w9f5JPExTMcCj3+w AireuUgkBrN0IrhTE2BcfyolhlPtlkkf81dK0mN3AG18jJmN5tFMayayg9Iqhr2NCj j2wmLI7+nDTMoON0MlRgf3fNNdEa96FfBswaY7CCzVmwiAhKnE75gzKOXWPGy9A6Qo jb1FC2n4EegmQ== Date: Fri, 23 Sep 2022 09:13:40 -0600 From: Keith Busch To: Jan Kara Cc: Hugh Dickins , Jens Axboe , Yu Kuai , Liu Song , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH next] sbitmap: fix lockup while swapping Message-ID: References: <20220921164012.s7lvklp2qk6occcg@quack3> <20220923144303.fywkmgnkg6eken4x@quack3> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220923144303.fywkmgnkg6eken4x@quack3> X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 23, 2022 at 04:43:03PM +0200, Jan Kara wrote: > On Wed 21-09-22 18:40:12, Jan Kara wrote: > > On Mon 19-09-22 16:01:39, Hugh Dickins wrote: > > > On Mon, 19 Sep 2022, Keith Busch wrote: > > > > On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote: > > > > > I have almost no grasp of all the possible sbitmap races, and their > > > > > consequences: but using the same !waitqueue_active() check as used > > > > > elsewhere, fixes the lockup and shows no adverse consequence for me. > > > > > > > > > > > > > Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") > > > > > Signed-off-by: Hugh Dickins > > > > > --- > > > > > > > > > > lib/sbitmap.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > --- a/lib/sbitmap.c > > > > > +++ b/lib/sbitmap.c > > > > > @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap > > > > > * function again to wakeup a new batch on a different 'ws'. > > > > > */ > > > > > if (cur == 0) > > > > > - return true; > > > > > + return !waitqueue_active(&ws->wait); > > > > > > > > If it's 0, that is supposed to mean another thread is about to make it not zero > > > > as well as increment the wakestate index. That should be happening after patch > > > > 48c033314f37 was included, at least. > > > > > > I believe that the thread about to make wait_cnt not zero (and increment the > > > wakestate index) is precisely this interrupted thread: the backtrace shows > > > that it had just done its wakeups, so has not yet reached making wait_cnt > > > not zero; and I suppose that either its wakeups did not empty the waitqueue > > > completely, or another waiter got added as soon as it dropped the spinlock. > > I was trying to wrap my head around this but I am failing to see how we > could have wait_cnt == 0 for long enough to cause any kind of stall let > alone a lockup in sbitmap_queue_wake_up() as you describe. I can understand > we have: > > CPU1 CPU2 > sbitmap_queue_wake_up() > ws = sbq_wake_ptr(sbq); > cur = atomic_read(&ws->wait_cnt); > do { > ... > wait_cnt = cur - sub; /* this will be 0 */ > } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt)); > ... > /* Gets the same waitqueue */ > ws = sbq_wake_ptr(sbq); > cur = atomic_read(&ws->wait_cnt); > do { > if (cur == 0) > return true; /* loop */ > wake_up_nr(&ws->wait, wake_batch); > smp_mb__before_atomic(); > sbq_index_atomic_inc(&sbq->wake_index); > atomic_set(&ws->wait_cnt, wake_batch); /* This stops looping on CPU2 */ > > So until CPU1 reaches the atomic_set(), CPU2 can be looping. But how come > this takes so long that is causes a hang as you describe? Hum... So either > CPU1 takes really long to get to atomic_set(): > - can CPU1 get preempted? Likely not at least in the context you show in > your message > - can CPU1 spend so long in wake_up_nr()? Maybe the waitqueue lock is > contended but still... > > or CPU2 somehow sees cur==0 for longer than it should. The whole sequence > executed in a loop on CPU2 does not contain anything that would force CPU2 > to refresh its cache and get new ws->wait_cnt value so we are at the mercy > of CPU cache coherency mechanisms to stage the write on CPU1 and propagate > it to other CPUs. But still I would not expect that to take significantly > long. Any other ideas? Thank you for the analysis. I arrived at the same conclusions. If this is a preempt enabled context, and there's just one CPU, I suppose the 2nd task could spin in the while(), blocking the 1st task from resetting the wait_cnt. I doubt that's happening though, at least for nvme where we call this function in irq context.