Received: by 2002:a05:6a10:d5a5:0:0:0:0 with SMTP id gn37csp2270327pxb; Fri, 8 Oct 2021 04:37:59 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwQOizGDikPDe46r96HKwdBZxyHUDkkaq8mnCKT3CXm13PEUN6ssELDMAQLAprSp0WIrx9b X-Received: by 2002:a17:902:708b:b0:13e:1a20:f1b0 with SMTP id z11-20020a170902708b00b0013e1a20f1b0mr9017374plk.51.1633693079201; Fri, 08 Oct 2021 04:37:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1633693079; cv=none; d=google.com; s=arc-20160816; b=vn/rav0cslMszKY37qSSCJd4/VnUXvq30WvRzdjdCSdwJp8qM3ixA3/S9VSu490cFM PYrXEqirH7DO1UGTHuTi7ci6SLB4kxKRkJlFoJM02gPRWryZj4bHW7Klw1Zq+OKuMNDl rQVQOwKND4uiWZU7hoSgb7HoBY0r9U2kw5cc8OTi5LnKZHK5q3r4AzEcX2Mu/GWpz3l+ OYHkcD9mEl2GpCJTzl9wYkrJA4FfZoMD51z03ZU0ZoM0YqVu08zwpT4cE40knFxA8/JW ulSW3ugC4HuOBBIsjHOQBCPqmdtVwTSCqptWRmZvt6wm2mKOifkYZgDgiXaiRPwOUxnk N36g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=ze0RsOt+biYOFbEhD3++xUY9uHDu2mM4fVWm1u8lktA=; b=tEio8LKEKyEksc5Mk9jbp9Y4IFXPbUA+KbhZUk0o31t3DoWWFMLXvIKUhgprYJ6M3h l8YS6Tw7YVza9BomckCcsTigb2TYYY5cGuUTxYg8XqFyl05/2UbOw6NM2udDim4wG9C3 gHafM5qsdOhCRxFoQSC5RkwwHEHxaEvDfEp64sEx0WbvxoArpeGEgdUQicVab5Jis3B8 9Aw3ju/KvYcZucGrzaoahK/MoFg2kUyrylTU26pz0CP3kvzsm8w0tEClPFJ+JwnFSX6U 6Uh6uop0jgCchKBiUf+6JoOh4vdpXs7RyvL8j7fB4EkULltnzp9Rmy2If8kSt9BRTAw2 KOWw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b="m/7jZzHb"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id e4si2703004pgc.297.2021.10.08.04.37.45; Fri, 08 Oct 2021 04:37:59 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b="m/7jZzHb"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242018AbhJHLgs (ORCPT + 99 others); Fri, 8 Oct 2021 07:36:48 -0400 Received: from mail.kernel.org ([198.145.29.99]:59492 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241770AbhJHLeH (ORCPT ); Fri, 8 Oct 2021 07:34:07 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 2BAEA60F14; Fri, 8 Oct 2021 11:31:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1633692697; bh=+RjDmcHvWDOtR/qmxC5gHN/h/0jxri9J4EBeI0OZw7E=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=m/7jZzHbd/HWg4ZZsPiFdOgyyRt8olKmq166U+sAFhURd45ucYkUBS2dgs2nYbl4N 8H4grdzvWp8p95uOQ6C8OReEMjrBfT+5qFp6EARB0Y469Li28n6vzT2z6yaFdNY/SA jubbS/WH6SdkFUOhewdWhweT+oPOuGopT1fizOog= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Filipe Manana , David Sterba , Sasha Levin Subject: [PATCH 5.10 06/29] btrfs: fix mount failure due to past and transient device flush error Date: Fri, 8 Oct 2021 13:27:53 +0200 Message-Id: <20211008112717.141324446@linuxfoundation.org> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20211008112716.914501436@linuxfoundation.org> References: <20211008112716.914501436@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Filipe Manana [ Upstream commit 6b225baababf1e3d41a4250e802cbd193e1343fb ] When we get an error flushing one device, during a super block commit, we record the error in the device structure, in the field 'last_flush_error'. This is used to later check if we should error out the super block commit, depending on whether the number of flush errors is greater than or equals to the maximum tolerated device failures for a raid profile. However if we get a transient device flush error, unmount the filesystem and later try to mount it, we can fail the mount because we treat that past error as critical and consider the device is missing. Even if it's very likely that the error will happen again, as it's probably due to a hardware related problem, there may be cases where the error might not happen again. One example is during testing, and a test case like the new generic/648 from fstests always triggers this. The test cases generic/019 and generic/475 also trigger this scenario, but very sporadically. When this happens we get an error like this: $ mount /dev/sdc /mnt mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error. $ dmesg (...) [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices [12918.890853] BTRFS error (device sdc): open_ctree failed The failure happens because when btrfs_check_rw_degradable() is called at mount time, or at remount from RO to RW time, is sees a non zero value in a device's ->last_flush_error attribute, and therefore considers that the device is 'missing'. Fix this by setting a device's ->last_flush_error to zero when we close a device, making sure the error is not seen on the next mount attempt. We only need to track flush errors during the current mount, so that we never commit a super block if such errors happened. Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba Signed-off-by: Sasha Levin --- fs/btrfs/volumes.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d8b8764f5bd1..593e0c6d6b44 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1147,6 +1147,19 @@ static void btrfs_close_one_device(struct btrfs_device *device) atomic_set(&device->dev_stats_ccnt, 0); extent_io_tree_release(&device->alloc_state); + /* + * Reset the flush error record. We might have a transient flush error + * in this mount, and if so we aborted the current transaction and set + * the fs to an error state, guaranteeing no super blocks can be further + * committed. However that error might be transient and if we unmount the + * filesystem and mount it again, we should allow the mount to succeed + * (btrfs_check_rw_degradable() should not fail) - if after mounting the + * filesystem again we still get flush errors, then we will again abort + * any transaction and set the error state, guaranteeing no commits of + * unsafe super blocks. + */ + device->last_flush_error = 0; + /* Verify the device is back in a pristine state */ ASSERT(!test_bit(BTRFS_DEV_STATE_FLUSH_SENT, &device->dev_state)); ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)); -- 2.33.0