Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3319777imu; Sun, 11 Nov 2018 12:17:10 -0800 (PST) X-Google-Smtp-Source: AJdET5f7jY5rO5SbSLnX5FZzmD9/n1Pvr0yyDhuEptFqtQp2hIqFYzQdoEJmHtiXqEadeU8myMgV X-Received: by 2002:a63:ed15:: with SMTP id d21mr10968010pgi.305.1541967430496; Sun, 11 Nov 2018 12:17:10 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541967430; cv=none; d=google.com; s=arc-20160816; b=iDIRfqV3xTKYKS+Su+dM7X2yaExfjaqMElJlwtHHb76ZRuUYTfhm8Pk343PufG8WYa kfqEhdvLiCTyfTEEN2PqLsMXL1AG1wc+ILPPYpl45SUyejmcwk88FVpfTQ47QEbYm48u n1r0w6zfIw31FKLJXjjvllualzmOTQUZry59zkXi3VQNR7lvOLhon9Vnvze6K29ye6xL XSLxbpfoM2pcM1c8anknGTwaZvRbudeY2lu08XBMkVdqQN2ERi2uz1zptpCaLWokfbp8 8PlqhEEwp3dGwBg8yTPqJn9cjVZTQik4CIlhVZWOlGbHWAKf6xsJfDB/nNKnnC3H8cFo Ru6A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:subject:message-id:date:cc:to :from:mime-version:content-transfer-encoding:content-disposition; bh=iVgpEhF0aKjL/F8S4FSo7HhxH0vzBmOZW0QkFjtWKcc=; b=1CIxQS3B++aVhR92Ahvy3JYQxzqRfkXJOCgCTAJxV2EfbRqfMQrSI730MdfaO2j816 2cJ+zlUqCEgTsBkeCY9ullS0G3tz1gEXLwaEp9n6eVtGiYkg2894lEaHQiibZnIqHKXv efXcFf3y2w4kSY1wLu0DdpigXB/1ykVXYTqkHohY2kuIjeCP1OvJzJLUhEu/sKxQN3XF trwLYtNHjBIr3AOIfh8Z/zQlgTJrOeDHhrjSyJvljU0m4BKuZq5BXd+D1irVCgxfH3RO 2yUMciZdHULBpIAIinKthgBFKNmhllSfGuv8A57bageLAjawJXTxmd9fIOy319mbKiPL BSSA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g6si966009pgr.472.2018.11.11.12.16.55; Sun, 11 Nov 2018 12:17:10 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731927AbeKLGGE (ORCPT + 99 others); Mon, 12 Nov 2018 01:06:04 -0500 Received: from shadbolt.e.decadent.org.uk ([88.96.1.126]:53178 "EHLO shadbolt.e.decadent.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726912AbeKLGGD (ORCPT ); Mon, 12 Nov 2018 01:06:03 -0500 Received: from [192.168.4.242] (helo=deadeye) by shadbolt.decadent.org.uk with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1gLvtB-0000oO-4I; Sun, 11 Nov 2018 19:59:21 +0000 Received: from ben by deadeye with local (Exim 4.91) (envelope-from ) id 1gLvsP-0001QI-2B; Sun, 11 Nov 2018 19:58:33 +0000 Content-Type: text/plain; charset="UTF-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit MIME-Version: 1.0 From: Ben Hutchings To: linux-kernel@vger.kernel.org, stable@vger.kernel.org CC: akpm@linux-foundation.org, "Omar Sandoval" , "Jens Axboe" Date: Sun, 11 Nov 2018 19:49:05 +0000 Message-ID: X-Mailer: LinuxStableQueue (scripts by bwh) Subject: [PATCH 3.16 053/366] sbitmap: fix race in wait batch accounting In-Reply-To: X-SA-Exim-Connect-IP: 192.168.4.242 X-SA-Exim-Mail-From: ben@decadent.org.uk X-SA-Exim-Scanned: No (on shadbolt.decadent.org.uk); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 3.16.61-rc1 review patch. If anyone has any objections, please let me know. ------------------ From: Jens Axboe commit c854ab5773be1c1a0d3cef0c3a3261f2c48ab7f8 upstream. If we have multiple callers of sbq_wake_up(), we can end up in a situation where the wait_cnt will continually go more and more negative. Consider the case where our wake batch is 1, hence wait_cnt will start out as 1. wait_cnt == 1 CPU0 CPU1 atomic_dec_return(), cnt == 0 atomic_dec_return(), cnt == -1 cmpxchg(-1, 0) (succeeds) [wait_cnt now 0] cmpxchg(0, 1) (fails) This ends up with wait_cnt being 0, we'll wakeup immediately next time. Going through the same loop as above again, and we'll have wait_cnt -1. For the case where we have a larger wake batch, the only difference is that the starting point will be higher. We'll still end up with continually smaller batch wakeups, which defeats the purpose of the rolling wakeups. Always reset the wait_cnt to the batch value. Then it doesn't matter who wins the race. But ensure that whomever does win the race is the one that increments the ws index and wakes up our batch count, loser gets to call __sbq_wake_up() again to account his wakeups towards the next active wait state index. Fixes: 6c0ca7ae292a ("sbitmap: fix wakeup hang after sbq resize") Reviewed-by: Omar Sandoval Signed-off-by: Jens Axboe [bwh: Backported to 3.16: - Rename almost everything - Adjust filename, context] Signed-off-by: Ben Hutchings --- block/blk-mq-tag.c | 35 +++++++++++++++++++++++++---------- 1 file changed, 25 insertions(+), 10 deletions(-) --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -336,42 +336,58 @@ static struct bt_wait_state *bt_wake_ptr return NULL; } -static void bt_clear_tag(struct blk_mq_bitmap_tags *bt, unsigned int tag) +static bool __bt_wake_up(struct blk_mq_bitmap_tags *bt) { - const int index = TAG_TO_INDEX(bt, tag); struct bt_wait_state *bs; unsigned int wake_batch; int wait_cnt; - clear_bit(TAG_TO_BIT(bt, tag), &bt->map[index].word); - /* Ensure that the wait list checks occur after clear_bit(). */ smp_mb(); bs = bt_wake_ptr(bt); if (!bs) - return; + return false; wait_cnt = atomic_dec_return(&bs->wait_cnt); if (wait_cnt <= 0) { + int ret; + wake_batch = ACCESS_ONCE(bt->wake_cnt); + /* * Pairs with the memory barrier in bt_update_count() to * ensure that we see the batch size update before the wait * count is reset. */ smp_mb__before_atomic(); + /* - * If there are concurrent callers to bt_clear_tag(), the last - * one to decrement the wait count below zero will bump it back - * up. If there is a concurrent resize, the count reset will - * either cause the cmpxchg to fail or overwrite after the - * cmpxchg. + * For concurrent callers of this, the one that failed the + * atomic_cmpxhcg() race should call this function again + * to wakeup a new batch on a different 'bs'. */ - atomic_cmpxchg(&bs->wait_cnt, wait_cnt, wait_cnt + wake_batch); - bt_index_atomic_inc(&bt->wake_index); - wake_up(&bs->wait); + ret = atomic_cmpxchg(&bs->wait_cnt, wait_cnt, wake_batch); + if (ret == wait_cnt) { + bt_index_atomic_inc(&bt->wake_index); + wake_up(&bs->wait); + return false; + } + + return true; } + + return false; +} + +static void bt_clear_tag(struct blk_mq_bitmap_tags *bt, unsigned int tag) +{ + const int index = TAG_TO_INDEX(bt, tag); + + clear_bit(TAG_TO_BIT(bt, tag), &bt->map[index].word); + + while (__bt_wake_up(bt)) + ; } static void __blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag)