Received: by 2002:a5d:9c59:0:0:0:0:0 with SMTP id 25csp2142842iof; Tue, 7 Jun 2022 21:10:48 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx480SXa0S9wTHKwq70Nd754DYLlKA5uT1ooWm9IBgoTxKckRLKlhBK63onjJZhrGsEXIEu X-Received: by 2002:a05:6a00:1387:b0:51c:2712:7859 with SMTP id t7-20020a056a00138700b0051c27127859mr11539802pfg.38.1654661448522; Tue, 07 Jun 2022 21:10:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1654661448; cv=none; d=google.com; s=arc-20160816; b=R4MzoWztc6x+XZIvJks1oLxLOmtHk8e9LhFxbrAO1GtEDAxOGPgXDQXhcJrlHeki7M f4cfAJ3IhR7zd+LCUKTwpCi+7bNyuYdrtdvYsryH8TlncNfTuN472EWF3jFwqjJYjbT8 vDRepeBIzAURYVV0JTbXVWhiYArXJIi4SNAQADvfveO98HmhnWjKN9J9LUyRIixepffb SkIJkkBzl/KCSKuakrOyBC2GsC6eZ4M5VLjfjPKZbXTVNOYgKihimOeTNegbeewsaExA y25tUVrZi2fK7WL3p2swo1ygqA0PyWhWOIlma4pLC+Cp9KheD53JS92VCMrfTnbNwABo fo8g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=qbjp+OIUmku2PiZfSGIjPx5P/iHP56Gx+S6MmlVXlB8=; b=xFRkUPok9ufcD82QaKpHrvriGPwUUlWAUzLpMjqLiAbJVIYNMvg5PTMj+bpnLcmbta kOyITPKI0X84sAMcU/5S3BLGK+7lJ5WpYfF5nTcBCZKLYM8/knaX4k5jVtBOXSqCpMid nvH7ziG0MaY7u2NyHPudsZYky+TuOEH575a7U1Fy7C3DlqqhI5ceBHy9y6wWURrH4/M/ WMxO3O8iMQkZIl89pDLmZ+6E+iwkFDSu9GfMHAAJKE2JKBftyjUS+DvNIa8j2/ivSRzG E2Bq3vrzLkwW1WEEnurYauRa3NzPeG7aHvSBUNToxclDORdzJHRS5xSrkscIu5ZAf4+w +GHg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=QXbOCAZZ; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id j3-20020a170903028300b0015d56a3c9d0si34231907plr.534.2022.06.07.21.10.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Jun 2022 21:10:48 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=QXbOCAZZ; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 3CD792298E1; Tue, 7 Jun 2022 20:42:08 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350167AbiFGSOV (ORCPT + 99 others); Tue, 7 Jun 2022 14:14:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58168 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1348555AbiFGRtI (ORCPT ); Tue, 7 Jun 2022 13:49:08 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 610D913390A; Tue, 7 Jun 2022 10:37:10 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 7164C61553; Tue, 7 Jun 2022 17:37:09 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7A91FC385A5; Tue, 7 Jun 2022 17:37:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1654623428; bh=sKs4w+ClHo4mXRkJZRn1crX/G+7JC3somcBMe7JsgBs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=QXbOCAZZmmDc2H3Xn4jhuYfVOK83wc1udICdgx27a7iuQSrQejJdu8DupLqL0hnYV UGaCN1mH+7rQyHvDYlhgfcY5ZqkcbyB2r/NmkNueVQGqtnk8S+oqOMXHST8QhXOH5Z zXr+yX4SUUUlIE6SVINcDpJSLzIzEH/eNAtV5vFE= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Nikhil Kshirsagar , Coly Li , Jens Axboe Subject: [PATCH 5.10 413/452] bcache: avoid journal no-space deadlock by reserving 1 journal bucket Date: Tue, 7 Jun 2022 19:04:30 +0200 Message-Id: <20220607164920.863780734@linuxfoundation.org> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220607164908.521895282@linuxfoundation.org> References: <20220607164908.521895282@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Coly Li commit 32feee36c30ea06e38ccb8ae6e5c44c6eec790a6 upstream. The journal no-space deadlock was reported time to time. Such deadlock can happen in the following situation. When all journal buckets are fully filled by active jset with heavy write I/O load, the cache set registration (after a reboot) will load all active jsets and inserting them into the btree again (which is called journal replay). If a journaled bkey is inserted into a btree node and results btree node split, new journal request might be triggered. For example, the btree grows one more level after the node split, then the root node record in cache device super block will be upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no space in journal buckets, the journal replay has to wait for new journal bucket to be reclaimed after at least one journal bucket replayed. This is one example that how the journal no-space deadlock happens. The solution to avoid the deadlock is to reserve 1 journal bucket in run time, and only permit the reserved journal bucket to be used during cache set registration procedure for things like journal replay. Then the journal space will never be fully filled, there is no chance for journal no-space deadlock to happen anymore. This patch adds a new member "bool do_reserve" in struct journal, it is inititalized to 0 (false) when struct journal is allocated, and set to 1 (true) by bch_journal_space_reserve() when all initialization done in run_cache_set(). In the run time when journal_reclaim() tries to allocate a new journal bucket, free_journal_buckets() is called to check whether there are enough free journal buckets to use. If there is only 1 free journal bucket and journal->do_reserve is 1 (true), the last bucket is reserved and free_journal_buckets() will return 0 to indicate no free journal bucket. Then journal_reclaim() will give up, and try next time to see whetheer there is free journal bucket to allocate. By this method, there is always 1 jouranl bucket reserved in run time. During the cache set registration, journal->do_reserve is 0 (false), so the reserved journal bucket can be used to avoid the no-space deadlock. Reported-by: Nikhil Kshirsagar Signed-off-by: Coly Li Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.de Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman --- drivers/md/bcache/journal.c | 31 ++++++++++++++++++++++++++----- drivers/md/bcache/journal.h | 2 ++ drivers/md/bcache/super.c | 1 + 3 files changed, 29 insertions(+), 5 deletions(-) --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -407,6 +407,11 @@ err: return ret; } +void bch_journal_space_reserve(struct journal *j) +{ + j->do_reserve = true; +} + /* Journalling */ static void btree_flush_write(struct cache_set *c) @@ -625,12 +630,30 @@ static void do_journal_discard(struct ca } } +static unsigned int free_journal_buckets(struct cache_set *c) +{ + struct journal *j = &c->journal; + struct cache *ca = c->cache; + struct journal_device *ja = &c->cache->journal; + unsigned int n; + + /* In case njournal_buckets is not power of 2 */ + if (ja->cur_idx >= ja->discard_idx) + n = ca->sb.njournal_buckets + ja->discard_idx - ja->cur_idx; + else + n = ja->discard_idx - ja->cur_idx; + + if (n > (1 + j->do_reserve)) + return n - (1 + j->do_reserve); + + return 0; +} + static void journal_reclaim(struct cache_set *c) { struct bkey *k = &c->journal.key; struct cache *ca = c->cache; uint64_t last_seq; - unsigned int next; struct journal_device *ja = &ca->journal; atomic_t p __maybe_unused; @@ -653,12 +676,10 @@ static void journal_reclaim(struct cache if (c->journal.blocks_free) goto out; - next = (ja->cur_idx + 1) % ca->sb.njournal_buckets; - /* No space available on this device */ - if (next == ja->discard_idx) + if (!free_journal_buckets(c)) goto out; - ja->cur_idx = next; + ja->cur_idx = (ja->cur_idx + 1) % ca->sb.njournal_buckets; k->ptr[0] = MAKE_PTR(0, bucket_to_sector(c, ca->sb.d[ja->cur_idx]), ca->sb.nr_this_dev); --- a/drivers/md/bcache/journal.h +++ b/drivers/md/bcache/journal.h @@ -105,6 +105,7 @@ struct journal { spinlock_t lock; spinlock_t flush_write_lock; bool btree_flushing; + bool do_reserve; /* used when waiting because the journal was full */ struct closure_waitlist wait; struct closure io; @@ -182,5 +183,6 @@ int bch_journal_replay(struct cache_set void bch_journal_free(struct cache_set *c); int bch_journal_alloc(struct cache_set *c); +void bch_journal_space_reserve(struct journal *j); #endif /* _BCACHE_JOURNAL_H */ --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2150,6 +2150,7 @@ static int run_cache_set(struct cache_se flash_devs_run(c); + bch_journal_space_reserve(&c->journal); set_bit(CACHE_SET_RUNNING, &c->flags); return 0; err: