Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp889546ybe; Fri, 13 Sep 2019 07:53:35 -0700 (PDT) X-Google-Smtp-Source: APXvYqz+eCj/MJYJ7N6IktKwyOObft3hOa723WPvAawV5Foj50D2ceFVV9KxZPBcxjphGl55ONbi X-Received: by 2002:aa7:d0d1:: with SMTP id u17mr42849288edo.99.1568386415867; Fri, 13 Sep 2019 07:53:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1568386415; cv=none; d=google.com; s=arc-20160816; b=c9eZ2dIag6iX9MfhBa4D6cwiW9iUKIZl1tcsep8aZsyVBsFuPEvioKskRFJYscv9pF hBJpDrvvt6QroozttOBVZKKofOzhJYf4zF6CQ2+Gi0gEkr9yt+MwI1FNLArRQtiCrikk hnJNCUVnRX0noTRZHI4VGTEAjwa3BHxvo82sNVEfpccf+VdA9EN+7M3+q8m5Kabhfec6 QAB+xIvrSUqXRZE9rdh5jJi9qSN3tUpM8CUvgDPPMQgD9i0H8RrIZ+zyeH5gm5lcSMRt flV9vhFeb0Pw/TVcRq/hGBGUlUstUlvnDSYrDxXjL8xAKJQFZCrCYtM7uxI+QZRsIxV6 A+/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=m6ZkFl13fsPxZl6OkNietweKCmmRWPTHTeCom6URSl4=; b=groPfNYqc5DCpUbf2wjw4lO7fUZwB7/ddA1foOrBqLXMPytW4YOjS1l+UlGns7O6g7 0FfU2+bRLswr8EP8m1BZI4Ug5qDAPGmG36bJ1px3pPBeEwVnmqyyu/YzIoERMmhuOlKh XBhgZ3pkupiEuxNmf6Luh0HdwVd9Nf65wvMxEUQKSnL7+tJh34nUydSgp2vwfcR1xGAw qavfknEE7R/W7mEc/uvnZSe0sUz1vuqSaXRh5l0SWmcujJdOVKKhqTi1oLchw3qEUTi/ +D3my7zyMQ30BvZi03mPBihwbyI9muWQkaZUGkpEDHNPchSNIQV6rNeggMDgeLK31QOl a3FA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=KXSoiu5U; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y10si14960480ejj.116.2019.09.13.07.53.12; Fri, 13 Sep 2019 07:53:35 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=KXSoiu5U; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2390466AbfIMNTz (ORCPT + 99 others); Fri, 13 Sep 2019 09:19:55 -0400 Received: from mail.kernel.org ([198.145.29.99]:48282 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2390444AbfIMNTy (ORCPT ); Fri, 13 Sep 2019 09:19:54 -0400 Received: from localhost (unknown [104.132.45.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 2D3D820640; Fri, 13 Sep 2019 13:19:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1568380791; bh=SKOaMgoQdCagdl1Ji8fMjbDb3ogMdQnZpGW1zMos2ls=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=KXSoiu5UPfomDl2gp4/9IN7kk1yM1G0nLFtSOn+MVaa78nCSKosXgabOVw6NbsUyw vOO8iBRZwJIh53JCnC3gTrpoG3CWlVnyQOuanbLLGNPu8s/NmyS4DYqMBZvCn1RoIQ I6f3h9ACifnj8o6w2k/rtW7roWNevWfgpo8HLprU= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Coly Li , Jens Axboe , Sasha Levin , kbuild test robot Subject: [PATCH 4.19 179/190] bcache: fix race in btree_flush_write() Date: Fri, 13 Sep 2019 14:07:14 +0100 Message-Id: <20190913130614.091905416@linuxfoundation.org> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190913130559.669563815@linuxfoundation.org> References: <20190913130559.669563815@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [ Upstream commit 50a260e859964002dab162513a10f91ae9d3bcd3 ] There is a race between mca_reap(), btree_node_free() and journal code btree_flush_write(), which results very rare and strange deadlock or panic and are very hard to reproduce. Let me explain how the race happens. In btree_flush_write() one btree node with oldest journal pin is selected, then it is flushed to cache device, the select-and-flush is a two steps operation. Between these two steps, there are something may happen inside the race window, - The selected btree node was reaped by mca_reap() and allocated to other requesters for other btree node. - The slected btree node was selected, flushed and released by mca shrink callback bch_mca_scan(). When btree_flush_write() tries to flush the selected btree node, firstly b->write_lock is held by mutex_lock(). If the race happens and the memory of selected btree node is allocated to other btree node, if that btree node's write_lock is held already, a deadlock very probably happens here. A worse case is the memory of the selected btree node is released, then all references to this btree node (e.g. b->write_lock) will trigger NULL pointer deference panic. This race was introduced in commit cafe56359144 ("bcache: A block layer cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU occupancy during journal"), which selected 128 btree nodes and flushed them one-by-one in a quite long time period. Such race is not easy to reproduce before. On a Lenovo SR650 server with 48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0 device assembled by 3 NVMe SSDs as backing device, this race can be observed around every 10,000 times btree_flush_write() gets called. Both deadlock and kernel panic all happened as aftermath of the race. The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It is set when selecting btree nodes, and cleared after btree nodes flushed. Then when mca_reap() selects a btree node with this bit set, this btree node will be skipped. Since mca_reap() only reaps btree node without BTREE_NODE_journal_flush flag, such race is avoided. Once corner case should be noticed, that is btree_node_free(). It might be called in some error handling code path. For example the following code piece from btree_split(), 2149 err_free2: 2150 bkey_put(b->c, &n2->key); 2151 btree_node_free(n2); 2152 rw_unlock(true, n2); 2153 err_free1: 2154 bkey_put(b->c, &n1->key); 2155 btree_node_free(n1); 2156 rw_unlock(true, n1); At line 2151 and 2155, the btree node n2 and n1 are released without mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here. If btree_node_free() is called directly in such error handling path, and the selected btree node has BTREE_NODE_journal_flush bit set, just delay for 1 us and retry again. In this case this btree node won't be skipped, just retry until the BTREE_NODE_journal_flush bit cleared, and free the btree node memory. Fixes: cafe56359144 ("bcache: A block layer cache") Signed-off-by: Coly Li Reported-and-tested-by: kbuild test robot Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin --- drivers/md/bcache/btree.c | 28 +++++++++++++++++++++++++++- drivers/md/bcache/btree.h | 2 ++ drivers/md/bcache/journal.c | 7 +++++++ 3 files changed, 36 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index e0468fd41b6ea..45f684689c357 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -35,7 +35,7 @@ #include #include #include - +#include #include /* @@ -649,12 +649,25 @@ static int mca_reap(struct btree *b, unsigned int min_order, bool flush) up(&b->io_mutex); } +retry: /* * BTREE_NODE_dirty might be cleared in btree_flush_btree() by * __bch_btree_node_write(). To avoid an extra flush, acquire * b->write_lock before checking BTREE_NODE_dirty bit. */ mutex_lock(&b->write_lock); + /* + * If this btree node is selected in btree_flush_write() by journal + * code, delay and retry until the node is flushed by journal code + * and BTREE_NODE_journal_flush bit cleared by btree_flush_write(). + */ + if (btree_node_journal_flush(b)) { + pr_debug("bnode %p is flushing by journal, retry", b); + mutex_unlock(&b->write_lock); + udelay(1); + goto retry; + } + if (btree_node_dirty(b)) __bch_btree_node_write(b, &cl); mutex_unlock(&b->write_lock); @@ -1071,7 +1084,20 @@ static void btree_node_free(struct btree *b) BUG_ON(b == b->c->root); +retry: mutex_lock(&b->write_lock); + /* + * If the btree node is selected and flushing in btree_flush_write(), + * delay and retry until the BTREE_NODE_journal_flush bit cleared, + * then it is safe to free the btree node here. Otherwise this btree + * node will be in race condition. + */ + if (btree_node_journal_flush(b)) { + mutex_unlock(&b->write_lock); + pr_debug("bnode %p journal_flush set, retry", b); + udelay(1); + goto retry; + } if (btree_node_dirty(b)) { btree_complete_write(b, btree_current_write(b)); diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h index a68d6c55783bd..4d0cca145f699 100644 --- a/drivers/md/bcache/btree.h +++ b/drivers/md/bcache/btree.h @@ -158,11 +158,13 @@ enum btree_flags { BTREE_NODE_io_error, BTREE_NODE_dirty, BTREE_NODE_write_idx, + BTREE_NODE_journal_flush, }; BTREE_FLAG(io_error); BTREE_FLAG(dirty); BTREE_FLAG(write_idx); +BTREE_FLAG(journal_flush); static inline struct btree_write *btree_current_write(struct btree *b) { diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index ec1e35a62934d..7bb15cddca5ec 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -404,6 +404,7 @@ static void btree_flush_write(struct cache_set *c) retry: best = NULL; + mutex_lock(&c->bucket_lock); for_each_cached_btree(b, c, i) if (btree_current_write(b)->journal) { if (!best) @@ -416,9 +417,14 @@ retry: } b = best; + if (b) + set_btree_node_journal_flush(b); + mutex_unlock(&c->bucket_lock); + if (b) { mutex_lock(&b->write_lock); if (!btree_current_write(b)->journal) { + clear_bit(BTREE_NODE_journal_flush, &b->flags); mutex_unlock(&b->write_lock); /* We raced */ atomic_long_inc(&c->retry_flush_write); @@ -426,6 +432,7 @@ retry: } __bch_btree_node_write(b, NULL); + clear_bit(BTREE_NODE_journal_flush, &b->flags); mutex_unlock(&b->write_lock); } } -- 2.20.1