Received: by 2002:a05:6a10:144:0:0:0:0 with SMTP id 4csp57246pxw; Fri, 8 Apr 2022 00:32:27 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyJDb5woq2uxPfMCK3/pONMV69Hn8sh6TZYr9LuxQCftxA5SXvJZAWqyJ/ShW5FmabMYqdn X-Received: by 2002:a50:cd8c:0:b0:41c:bb5a:1c7b with SMTP id p12-20020a50cd8c000000b0041cbb5a1c7bmr17806558edi.351.1649403147415; Fri, 08 Apr 2022 00:32:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649403147; cv=none; d=google.com; s=arc-20160816; b=POETYK39VP9hna8AHnmKtxnC4QYCpSSXRsZI+hJQH7WiZ07ZyClRS/tfKf2LYl0zPw 2bgNZZ5Itp2SEou+Pxe2CtnnkA6xJ9GKejV121Wk7tUlPP9/9WON38JWZge9f5txIkbS R46Ro0WSz8gZXVkYwjB+pF/gkTXfzp1VFtVlcO4+q2H6kVVrHnkCBgcWuXBo/vUrQKYn RqtufGgrdjIHrVKSCIuNNpCtSGtoKC5Psm5dw/rZ262mStovbQIXz8uS4RUYuv7FoK52 2LkhVB6g+eih0NRlm38TJgTEi89jQcSjJqFE1ZYqvuMusnUHZ8q7R7oXepcX5YwNuuca fvew== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=RIBStfwJ6MBbIcw/b58KVoZsH+mbUnChPS8RvTm67vg=; b=o0vJeWnk7EJdKDVOwsj57qsHIQsyluczT5/r2kyv2kX2pdaftASWOaKZmn/o9Qr4Zt mmyStgTu30TKX2frANZxafPbfE5fMlo/foPfRuj6Nf4weIVnnU0W/SraSpzVEIjoe1gj t6vFNqGKvbPCZlBWpdaRnin/Q1SQNzjnh/SGMHOy6biiEDbLvcW/C+/VHwYf/9gCN3nP EqR3AFqqrwILfD0qLtztbMFWrUhgI3bHZcDJCE6wy/JYJe3HSHcngKP4TFQ5qQXQ+ZDO LatAFClUNl98Mv3n9Xs5KKaY/eRG4mHqh0tJKTuL9KpLWnWni4mRaWW7sdqDNrT3SNgn i56Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i15-20020a1709064fcf00b006e7eee64ba4si628046ejw.111.2022.04.08.00.32.01; Fri, 08 Apr 2022 00:32:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229927AbiDHH06 (ORCPT + 99 others); Fri, 8 Apr 2022 03:26:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54392 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229840AbiDHH0p (ORCPT ); Fri, 8 Apr 2022 03:26:45 -0400 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C057E369E1D; Fri, 8 Apr 2022 00:24:41 -0700 (PDT) Received: from kwepemi100009.china.huawei.com (unknown [172.30.72.53]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4KZV5Q6gtSzBs9J; Fri, 8 Apr 2022 15:20:26 +0800 (CST) Received: from kwepemm600009.china.huawei.com (7.193.23.164) by kwepemi100009.china.huawei.com (7.221.188.242) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Fri, 8 Apr 2022 15:24:39 +0800 Received: from huawei.com (10.175.127.227) by kwepemm600009.china.huawei.com (7.193.23.164) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Fri, 8 Apr 2022 15:24:38 +0800 From: Yu Kuai To: , , , , CC: , , Subject: [PATCH -next RFC v2 3/8] sbitmap: make sure waitqueues are balanced Date: Fri, 8 Apr 2022 15:39:11 +0800 Message-ID: <20220408073916.1428590-4-yukuai3@huawei.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20220408073916.1428590-1-yukuai3@huawei.com> References: <20220408073916.1428590-1-yukuai3@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.175.127.227] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To kwepemm600009.china.huawei.com (7.193.23.164) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently, same waitqueue might be woken up continuously: __sbq_wake_up __sbq_wake_up sbq_wake_ptr -> assume 0 sbq_wake_ptr -> 0 atomic_dec_return atomic_dec_return atomic_cmpxchg -> succeed atomic_cmpxchg -> failed return true __sbq_wake_up sbq_wake_ptr atomic_read(&sbq->wake_index) -> still 0 sbq_index_atomic_inc -> inc to 1 if (waitqueue_active(&ws->wait)) if (wake_index != atomic_read(&sbq->wake_index)) atomic_set -> reset from 1 to 0 wake_up_nr -> wake up first waitqueue // continue to wake up in first waitqueue What's worse, io hung is possible in theory because wake up might be missed. For example, 2 * wake_batch tags are put, while only wake_batch threads are worken: __sbq_wake_up atomic_cmpxchg -> reset wait_cnt __sbq_wake_up -> decrease wait_cnt ... __sbq_wake_up -> wait_cnt is decreased to 0 again atomic_cmpxchg sbq_index_atomic_inc -> increase wake_index wake_up_nr -> wake up and waitqueue might be empty sbq_index_atomic_inc -> increase again, one waitqueue is skipped wake_up_nr -> invalid wake up because old wakequeue might be empty To fix the problem, refactor to make sure waitqueues will be woken up one by one, and also choose the next waitqueue by the number of threads that are waiting to keep waitqueues balanced. Test cmd: nr_requests is 64, and queue_depth is 32 [global] filename=/dev/sda ioengine=libaio direct=1 allow_mounted_write=0 group_reporting [test] rw=randwrite bs=4k numjobs=512 iodepth=2 Before this patch, waitqueues can be extremly unbalanced, for example: ws_active=484 ws={ {.wait_cnt=8, .waiters_cnt=117}, {.wait_cnt=8, .waiters_cnt=59}, {.wait_cnt=8, .waiters_cnt=76}, {.wait_cnt=8, .waiters_cnt=0}, {.wait_cnt=5, .waiters_cnt=24}, {.wait_cnt=8, .waiters_cnt=12}, {.wait_cnt=8, .waiters_cnt=21}, {.wait_cnt=8, .waiters_cnt=175}, } With this patch, waitqueues is always balanced, for example: ws_active=477 ws={ {.wait_cnt=8, .waiters_cnt=59}, {.wait_cnt=6, .waiters_cnt=62}, {.wait_cnt=8, .waiters_cnt=61}, {.wait_cnt=8, .waiters_cnt=60}, {.wait_cnt=8, .waiters_cnt=63}, {.wait_cnt=8, .waiters_cnt=56}, {.wait_cnt=8, .waiters_cnt=59}, {.wait_cnt=8, .waiters_cnt=57}, } Signed-off-by: Yu Kuai --- lib/sbitmap.c | 81 ++++++++++++++++++++++++++------------------------- 1 file changed, 42 insertions(+), 39 deletions(-) diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 393f2b71647a..176fba0252d7 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -575,68 +575,71 @@ void sbitmap_queue_min_shallow_depth(struct sbitmap_queue *sbq, } EXPORT_SYMBOL_GPL(sbitmap_queue_min_shallow_depth); -static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq) +/* always choose the 'ws' with the max waiters */ +static void sbq_update_wake_index(struct sbitmap_queue *sbq, + int old_wake_index) { - int i, wake_index; + int index, wake_index; + int max_waiters = 0; - if (!atomic_read(&sbq->ws_active)) - return NULL; + if (old_wake_index != atomic_read(&sbq->wake_index)) + return; - wake_index = atomic_read(&sbq->wake_index); - for (i = 0; i < SBQ_WAIT_QUEUES; i++) { - struct sbq_wait_state *ws = &sbq->ws[wake_index]; + for (wake_index = 0; wake_index < SBQ_WAIT_QUEUES; wake_index++) { + struct sbq_wait_state *ws; + int waiters; - if (waitqueue_active(&ws->wait)) { - if (wake_index != atomic_read(&sbq->wake_index)) - atomic_set(&sbq->wake_index, wake_index); - return ws; - } + if (wake_index == old_wake_index) + continue; - wake_index = sbq_index_inc(wake_index); + ws = &sbq->ws[wake_index]; + waiters = atomic_read(&ws->waiters_cnt); + if (waiters > max_waiters) { + max_waiters = waiters; + index = wake_index; + } } - return NULL; + if (max_waiters) + atomic_cmpxchg(&sbq->wake_index, old_wake_index, index); } static bool __sbq_wake_up(struct sbitmap_queue *sbq) { struct sbq_wait_state *ws; unsigned int wake_batch; - int wait_cnt; + int wait_cnt, wake_index; - ws = sbq_wake_ptr(sbq); - if (!ws) + if (!atomic_read(&sbq->ws_active)) return false; + wake_index = atomic_read(&sbq->wake_index); + ws = &sbq->ws[wake_index]; wait_cnt = atomic_dec_return(&ws->wait_cnt); - if (wait_cnt <= 0) { - int ret; - - wake_batch = READ_ONCE(sbq->wake_batch); - - /* - * Pairs with the memory barrier in sbitmap_queue_resize() to - * ensure that we see the batch size update before the wait - * count is reset. - */ - smp_mb__before_atomic(); - + if (wait_cnt > 0) { + return false; + } else if (wait_cnt < 0) { /* - * For concurrent callers of this, the one that failed the - * atomic_cmpxhcg() race should call this function again + * Concurrent callers should call this function again * to wakeup a new batch on a different 'ws'. */ - ret = atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wake_batch); - if (ret == wait_cnt) { - sbq_index_atomic_inc(&sbq->wake_index); - wake_up_nr(&ws->wait, wake_batch); - return false; - } - + sbq_update_wake_index(sbq, wake_index); return true; } - return false; + sbq_update_wake_index(sbq, wake_index); + wake_batch = READ_ONCE(sbq->wake_batch); + + /* + * Pairs with the memory barrier in sbitmap_queue_resize() to + * ensure that we see the batch size update before the wait + * count is reset. + */ + smp_mb__before_atomic(); + atomic_set(&ws->wait_cnt, wake_batch); + wake_up_nr(&ws->wait, wake_batch); + + return true; } void sbitmap_queue_wake_up(struct sbitmap_queue *sbq) -- 2.31.1