Received: by 2002:a6b:fb09:0:0:0:0:0 with SMTP id h9csp2934209iog; Mon, 20 Jun 2022 07:47:51 -0700 (PDT) X-Google-Smtp-Source: AGRyM1vkr2RSdo0WZ+/W7b293eE/xG6LHjInl1ldaVi20BzEw2cLXQaT8rt6hPK3wxn0M9ox2S9O X-Received: by 2002:a17:90a:aa8a:b0:1c9:bfd8:9a90 with SMTP id l10-20020a17090aaa8a00b001c9bfd89a90mr27497667pjq.118.1655736471742; Mon, 20 Jun 2022 07:47:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1655736471; cv=none; d=google.com; s=arc-20160816; b=q5G7c18yJOdmZYhIsgYqOxDS1n8GzicLtz3HE9SnUeU1yiL9SwA4V25NZlJFAmNiik gjDeqUvSSH82suMULQQu/ps0T7xlugGNcYAoU0L8j0TbKUEc9gbVd/HoINhKv7NJ0wVY D5RnAED6L4F8Xc9NKfLvT2iMin4MqWJT4lslRDwJJQ33LBIvl3wjz0wjnTh0/pieXCRA 8fiVgD4MYGm6HfrhtCaMhNMzeudo2O5Y4mSW9CYJLzZWzoAs7tuUrRlgM79LBFS6STZj NQfgdj7Tu0TxEZzJ+aTAY3PTmU/ykrHu8NdmjbwH0vzXUFM5fiFy3e0WpjtNqBVpzFi5 qUlg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject; bh=GKFc700mINmW7UrIzIgh58MeaKD1IfywZAjdlI+Rq0Y=; b=ZMg/q1IBSIWTV/1BUE+GF0HzxY/H+M58jMTTugeLIYm+BZTk0WPU8GKnOoaKgIMjWZ rQzOYdit+TKNOEwPURvXUMZh6Piv1WPQEIBm+aRlEjobjYXatl0S+tfyqocIBgJxhbpl Dl5/GOMzpLLlYUJcyFWP9cM77TH56LCNUBbbuDkyU7omiwoeR0aDsNeuEjcJ4deJd8vN XsVUkTPak1hemc8tdCvitUDv4GQscvyHePCXTzDle3pPNjhHwYIZtmA1WHVKr2nNHncq a0XApWM06Z91dwWkmHMLEroYCNZ/dIkpdeEJ4xk5azGPDKjhg0gkNM3bBgl92lzd0LLZ Dz+Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id rj14-20020a17090b3e8e00b001eaa975a866si15769777pjb.57.2022.06.20.07.47.39; Mon, 20 Jun 2022 07:47:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244910AbiFTOal (ORCPT + 99 others); Mon, 20 Jun 2022 10:30:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34772 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244923AbiFTOaZ (ORCPT ); Mon, 20 Jun 2022 10:30:25 -0400 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 368101A055; Mon, 20 Jun 2022 06:44:22 -0700 (PDT) Received: from dggemv703-chm.china.huawei.com (unknown [172.30.72.56]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4LRW4l3wmgzSgtc; Mon, 20 Jun 2022 21:40:55 +0800 (CST) Received: from kwepemm600009.china.huawei.com (7.193.23.164) by dggemv703-chm.china.huawei.com (10.3.19.46) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Mon, 20 Jun 2022 21:44:17 +0800 Received: from [10.174.176.73] (10.174.176.73) by kwepemm600009.china.huawei.com (7.193.23.164) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Mon, 20 Jun 2022 21:44:17 +0800 Subject: Re: [PATCH RFC -next] sbitmap: fix possible io hung due to lost wakeups To: Jan Kara CC: , , , , References: <20220617141125.3024491-1-yukuai3@huawei.com> <20220620122413.2fewshn5u6t2y4oi@quack3.lan> <20220620124831.g7bswgivvg5urv3d@quack3.lan> From: Yu Kuai Message-ID: Date: Mon, 20 Jun 2022 21:44:16 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20220620124831.g7bswgivvg5urv3d@quack3.lan> Content-Type: text/plain; charset="gbk"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.176.73] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To kwepemm600009.china.huawei.com (7.193.23.164) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ?? 2022/06/20 20:48, Jan Kara ะด??: > On Mon 20-06-22 14:24:13, Jan Kara wrote: >> On Fri 17-06-22 22:11:25, Yu Kuai wrote: >>> Currently, same waitqueue might be woken up continuously: >>> >>> __sbq_wake_up __sbq_wake_up >>> sbq_wake_ptr -> assume 0 >>> sbq_wake_ptr -> 0 >>> atomic_dec_return >>> atomic_dec_return >>> atomic_cmpxchg -> succeed >>> atomic_cmpxchg -> failed >>> return true >>> >>> __sbq_wake_up >>> sbq_wake_ptr >>> atomic_read(&sbq->wake_index) -> still 0 >>> sbq_index_atomic_inc -> inc to 1 >>> if (waitqueue_active(&ws->wait)) >>> if (wake_index != atomic_read(&sbq->wake_index)) >>> atomic_set -> reset from 1 to 0 >>> wake_up_nr -> wake up first waitqueue >>> // continue to wake up in first waitqueue >>> >>> What's worse, io hung is possible in theory because wakeups might be >>> missed. For example, 2 * wake_batch tags are put, while only wake_batch >>> threads are worken: >>> >>> __sbq_wake_up >>> atomic_cmpxchg -> reset wait_cnt >>> __sbq_wake_up -> decrease wait_cnt >>> ... >>> __sbq_wake_up -> wait_cnt is decreased to 0 again >>> atomic_cmpxchg >>> sbq_index_atomic_inc -> increase wake_index >>> wake_up_nr -> wake up and waitqueue might be empty >>> sbq_index_atomic_inc -> increase again, one waitqueue is skipped >>> wake_up_nr -> invalid wake up because old wakequeue might be empty >>> >>> To fix the problem, refactor to make sure waitqueues will be woken up >>> one by one, >>> >>> Signed-off-by: Yu Kuai >> >> So as far as I can tell your patch does not completely fix this race. See >> below: >> >>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c >>> index ae4fd4de9ebe..dc2959cb188c 100644 >>> --- a/lib/sbitmap.c >>> +++ b/lib/sbitmap.c >>> @@ -574,66 +574,69 @@ void sbitmap_queue_min_shallow_depth(struct sbitmap_queue *sbq, >>> } >>> EXPORT_SYMBOL_GPL(sbitmap_queue_min_shallow_depth); >>> >>> -static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq) >>> +static void sbq_update_wake_index(struct sbitmap_queue *sbq, >>> + int old_wake_index) >>> { >>> int i, wake_index; >>> - >>> - if (!atomic_read(&sbq->ws_active)) >>> - return NULL; >>> + struct sbq_wait_state *ws; >>> >>> wake_index = atomic_read(&sbq->wake_index); >>> - for (i = 0; i < SBQ_WAIT_QUEUES; i++) { >>> - struct sbq_wait_state *ws = &sbq->ws[wake_index]; >>> + if (old_wake_index != wake_index) >>> + return; >>> >>> + for (i = 1; i < SBQ_WAIT_QUEUES; i++) { >>> + wake_index = sbq_index_inc(wake_index); >>> + ws = &sbq->ws[wake_index]; >>> + /* Find the next active waitqueue in round robin manner */ >>> if (waitqueue_active(&ws->wait)) { >>> - if (wake_index != atomic_read(&sbq->wake_index)) >>> - atomic_set(&sbq->wake_index, wake_index); >>> - return ws; >>> + atomic_cmpxchg(&sbq->wake_index, old_wake_index, >>> + wake_index); >>> + return; >>> } >>> - >>> - wake_index = sbq_index_inc(wake_index); >>> } >>> - >>> - return NULL; >>> } >>> >>> static bool __sbq_wake_up(struct sbitmap_queue *sbq) >>> { >>> struct sbq_wait_state *ws; >>> unsigned int wake_batch; >>> - int wait_cnt; >>> + int wait_cnt, wake_index; >>> >>> - ws = sbq_wake_ptr(sbq); >>> - if (!ws) >>> + if (!atomic_read(&sbq->ws_active)) >>> return false; >>> >>> - wait_cnt = atomic_dec_return(&ws->wait_cnt); >>> - if (wait_cnt <= 0) { >>> - int ret; >>> - >>> - wake_batch = READ_ONCE(sbq->wake_batch); >>> - >>> - /* >>> - * Pairs with the memory barrier in sbitmap_queue_resize() to >>> - * ensure that we see the batch size update before the wait >>> - * count is reset. >>> - */ >>> - smp_mb__before_atomic(); >>> + wake_index = atomic_read(&sbq->wake_index); >>> + ws = &sbq->ws[wake_index]; >>> + /* >>> + * This can only happen in the first wakeup when sbitmap waitqueues >>> + * are no longer idle. >>> + */ >>> + if (!waitqueue_active(&ws->wait)) { >>> + sbq_update_wake_index(sbq, wake_index); >>> + return true; >>> + } >>> >>> - /* >>> - * For concurrent callers of this, the one that failed the >>> - * atomic_cmpxhcg() race should call this function again >>> - * to wakeup a new batch on a different 'ws'. >>> - */ >>> - ret = atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wake_batch); >>> - if (ret == wait_cnt) { >>> - sbq_index_atomic_inc(&sbq->wake_index); >>> - wake_up_nr(&ws->wait, wake_batch); >>> - return false; >>> - } >>> + wait_cnt = atomic_dec_return(&ws->wait_cnt); >>> + if (wait_cnt > 0) >>> + return false; >> >> The following race is still possible: >> >> CPU1 CPU2 >> __sbq_wake_up __sbq_wake_up >> wake_index = atomic_read(&sbq->wake_index); >> wake_index = atomic_read(&sbq->wake_index); >> >> if (!waitqueue_active(&ws->wait)) -> not taken >> if (!waitqueue_active(&ws->wait)) -> not taken >> wait_cnt = atomic_dec_return(&ws->wait_cnt); >> /* decremented to 0 now */ >> if (wait_cnt > 0) -> not taken >> sbq_update_wake_index(sbq, wake_index); >> if (wait_cnt < 0) -> not taken >> ... >> atomic_set(&ws->wait_cnt, wake_batch); >> wake_up_nr(&ws->wait, wake_batch); >> wait_cnt = atomic_dec_return(&ws->wait_cnt); >> /* >> * decremented to wake_batch - 1 but >> * there are no tasks waiting anymore >> * so the wakeup should have gone >> * to a different waitqueue. >> */ >> >> I have an idea how to fix all these lost wakeups, I'll try to code it >> whether it would look usable... Hi, Jan Thanks for the analysis, it's right this is possible. > > Thinking a bit more about it your code would just need a small tweak like: > > wait_cnt = atomic_dec_return(&ws->wait_cnt); > /* > * Concurrent callers should call this function again > * to wakeup a new batch on a different 'ws'. > */ > if (wait_cnt < 0 || !waitqueue_active(&ws->wait)) { > sbq_update_wake_index(sbq, wake_index); > return true; > } I'm thinking that if the wait_queue is still active, this will decrease 'wait_cnt' in old waitqueue while 'wake_index' is already moved to next waitqueue. This really broke the design... Thanks, Kuai > if (wait_cnt > 0) > return false; > sbq_update_wake_index(sbq, wake_index); > > wake_batch = READ_ONCE(sbq->wake_batch); > wake_up_nr(&ws->wait, wake_batch); > /* > * Pairs with the memory barrier in sbitmap_queue_resize() to > * ensure that we see the batch size update before the wait > * count is reset. > * > * Also pairs with the implicit barrier between decrementing > * wait_cnt and checking for waitqueue_active() to make sure > * waitqueue_active() sees results of the wakeup if > * atomic_dec_return() has seen results of the atomic_set. > */ > smp_mb__before_atomic(); > atomic_set(&ws->wait_cnt, wake_batch); > > Honza > >>> + sbq_update_wake_index(sbq, wake_index); >>> + /* >>> + * Concurrent callers should call this function again >>> + * to wakeup a new batch on a different 'ws'. >>> + */ >>> + if (wait_cnt < 0) >>> return true; >>> - } >>> + >>> + wake_batch = READ_ONCE(sbq->wake_batch); >>> + /* >>> + * Pairs with the memory barrier in sbitmap_queue_resize() to >>> + * ensure that we see the batch size update before the wait >>> + * count is reset. >>> + */ >>> + smp_mb__before_atomic(); >>> + atomic_set(&ws->wait_cnt, wake_batch); >>> + wake_up_nr(&ws->wait, wake_batch); >>> >>> return false; >>> } >>> -- >>> 2.31.1 >>> >> -- >> Jan Kara >> SUSE Labs, CR