Received: by 2002:ad5:4acb:0:0:0:0:0 with SMTP id n11csp4452497imw; Tue, 19 Jul 2022 06:56:23 -0700 (PDT) X-Google-Smtp-Source: AGRyM1vB/nRAlPZXHGsKPO6Kk1hlI4xEblkk9SEp+iBg5KWafRCiFIDFCppHU+OTcQnPLjvqpKxy X-Received: by 2002:a05:6402:249d:b0:437:8622:6de8 with SMTP id q29-20020a056402249d00b0043786226de8mr44636010eda.113.1658238982951; Tue, 19 Jul 2022 06:56:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1658238982; cv=none; d=google.com; s=arc-20160816; b=LxRWz4kDObtKDV4QCvgq+IK0LoXt5HnMofLZvm6eTTXFevoPY9YqzU2fBGzH82uPVU ei/hxd+gDpnast3InYJiPZ+edyGgkq8Rr4yfT5qz/YNFt8qQ/uShVvpwNVRjTmA1jiAH DldkfhhW/jysA3mhfbTbBt9yPY/n+E2OSLMhbLMK08h03aDRqvOJhdi7r2ctkFr96M8x SA/MMFahbUd3aNnY7P3nEbcw6gHWL9DKT2e09n0kbJkNgikkkeeMZ+P8amfwfGsjBdMQ 0rQwWAzGSTUKsS1wPtZ2iToimEXkMdopkIjV1JaknHBXNCI78lVyBqr5BVd8iwxH2PDO zudw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=nzcmcwtgkUAs87UD9FOEV752DyDu4JVCimekWAgBoQw=; b=d0KT+EsCb9buIJujP7NWGeAn0rWV1ekVPv5B4cxFAEQgewArp+oO7yeZIwtThOdW45 L1QKCZb26ziMwJRRHwak4rllla2KiFDUQOrro3t6lTYzgv2tM5c/jh5h7JOCmuIzznPZ /lrrEUrrKmm4YooucAhf/dQWBNiLSpsTktAPSwk/3kh4iYP4dC2BxvWtW/bhuQawxUfV M8XszPpMGApzb0u5yHUyVOG72162IwdeL5CStIPlNdVb1dnGtsFd9vlD0VASmG7+E76t c2bh3itpl3OUT6GKUQYl/GsR8n1goxyX2o6CIDW4q0PdwQlUD+fKcU8OUJhgqumwNsY/ 4PHw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=Am2tXZAa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c20-20020a056402121400b0043ae18edf05si11174048edw.613.2022.07.19.06.55.59; Tue, 19 Jul 2022 06:56:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=Am2tXZAa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240325AbiGSMcp (ORCPT + 99 others); Tue, 19 Jul 2022 08:32:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35150 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240361AbiGSMcG (ORCPT ); Tue, 19 Jul 2022 08:32:06 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 738905289D; Tue, 19 Jul 2022 05:12:12 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id ECD0DB81B29; Tue, 19 Jul 2022 12:12:11 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5FD51C341C6; Tue, 19 Jul 2022 12:12:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1658232730; bh=n9UWrIeuMWWOu3oegqOKDGwjTmkWyBE+icbEFLRLnmw=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Am2tXZAajmhyjPB5RdCln7g5Mcrakfxdhsl2GdTni4RE9RuPC2hCXa5L/urLzz1bK PldAncjTmk9PuYQNT2JRKQQn8XrZE9O+0eigdmwbMPKWpZFo7vYv6iy8jJT28S/CZu 9sHCEMQAmqAfjKPLFDvAyH1GmQsUH7ppt6yePSDo= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, "Darrick J. Wong" , Chandan Babu R , Dave Chinner , Leah Rumancik Subject: [PATCH 5.15 026/167] xfs: only run COW extent recovery when there are no live extents Date: Tue, 19 Jul 2022 13:52:38 +0200 Message-Id: <20220719114659.277503687@linuxfoundation.org> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220719114656.750574879@linuxfoundation.org> References: <20220719114656.750574879@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.8 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Darrick J. Wong" [ Upstream commit 7993f1a431bc5271369d359941485a9340658ac3 ] As part of multiple customer escalations due to file data corruption after copy on write operations, I wrote some fstests that use fsstress to hammer on COW to shake things loose. Regrettably, I caught some filesystem shutdowns due to incorrect rmap operations with the following loop: mount # (0) fsstress & # (1) while true; do fsstress mount -o remount,ro # (2) fsstress mount -o remount,rw # (3) done When (2) happens, notice that (1) is still running. xfs_remount_ro will call xfs_blockgc_stop to walk the inode cache to free all the COW extents, but the blockgc mechanism races with (1)'s reader threads to take IOLOCKs and loses, which means that it doesn't clean them all out. Call such a file (A). When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which walks the ondisk refcount btree and frees any COW extent that it finds. This function does not check the inode cache, which means that incore COW forks of inode (A) is now inconsistent with the ondisk metadata. If one of those former COW extents are allocated and mapped into another file (B) and someone triggers a COW to the stale reservation in (A), A's dirty data will be written into (B) and once that's done, those blocks will be transferred to (A)'s data fork without bumping the refcount. The results are catastrophic -- file (B) and the refcount btree are now corrupt. In the first patch, we fixed the race condition in (2) so that (A) will always flush the COW fork. In this second patch, we move the _recover_cow call to the initial mount call in (0) for safety. As mentioned previously, xfs_reflink_recover_cow walks the refcount btree looking for COW staging extents, and frees them. This was intended to be run at mount time (when we know there are no live inodes) to clean up any leftover staging events that may have been left behind during an unclean shutdown. As a time "optimization" for readonly mounts, we deferred this to the ro->rw transition, not realizing that any failure to clean all COW forks during a rw->ro transition would result in catastrophic corruption. Therefore, remove this optimization and only run the recovery routine when we're guaranteed not to have any COW staging extents anywhere, which means we always run this at mount time. While we're at it, move the callsite to xfs_log_mount_finish because any refcount btree expansion (however unlikely given that we're removing records from the right side of the index) must be fed by a per-AG reservation, which doesn't exist in its current location. Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree") Signed-off-by: Darrick J. Wong Reviewed-by: Chandan Babu R Reviewed-by: Dave Chinner Signed-off-by: Leah Rumancik Acked-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman --- fs/xfs/xfs_log_recover.c | 24 +++++++++++++++++++++++- fs/xfs/xfs_mount.c | 10 ---------- fs/xfs/xfs_reflink.c | 5 ++++- fs/xfs/xfs_super.c | 9 --------- 4 files changed, 27 insertions(+), 21 deletions(-) --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -27,7 +27,7 @@ #include "xfs_buf_item.h" #include "xfs_ag.h" #include "xfs_quota.h" - +#include "xfs_reflink.h" #define BLK_AVG(blk1, blk2) ((blk1+blk2) >> 1) @@ -3502,6 +3502,28 @@ xlog_recover_finish( xlog_recover_process_iunlinks(log); xlog_recover_check_summary(log); + + /* + * Recover any CoW staging blocks that are still referenced by the + * ondisk refcount metadata. During mount there cannot be any live + * staging extents as we have not permitted any user modifications. + * Therefore, it is safe to free them all right now, even on a + * read-only mount. + */ + error = xfs_reflink_recover_cow(log->l_mp); + if (error) { + xfs_alert(log->l_mp, + "Failed to recover leftover CoW staging extents, err %d.", + error); + /* + * If we get an error here, make sure the log is shut down + * but return zero so that any log items committed since the + * end of intents processing can be pushed through the CIL + * and AIL. + */ + xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); + } + return 0; } --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -922,15 +922,6 @@ xfs_mountfs( xfs_warn(mp, "Unable to allocate reserve blocks. Continuing without reserve pool."); - /* Recover any CoW blocks that never got remapped. */ - error = xfs_reflink_recover_cow(mp); - if (error) { - xfs_err(mp, - "Error %d recovering leftover CoW allocations.", error); - xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); - goto out_quota; - } - /* Reserve AG blocks for future btree expansion. */ error = xfs_fs_reserve_ag_blocks(mp); if (error && error != -ENOSPC) @@ -941,7 +932,6 @@ xfs_mountfs( out_agresv: xfs_fs_unreserve_ag_blocks(mp); - out_quota: xfs_qm_unmount_quotas(mp); out_rtunmount: xfs_rtunmount_inodes(mp); --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -749,7 +749,10 @@ xfs_reflink_end_cow( } /* - * Free leftover CoW reservations that didn't get cleaned out. + * Free all CoW staging blocks that are still referenced by the ondisk refcount + * metadata. The ondisk metadata does not track which inode created the + * staging extent, so callers must ensure that there are no cached inodes with + * live CoW staging extents. */ int xfs_reflink_recover_cow( --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1742,15 +1742,6 @@ xfs_remount_rw( */ xfs_restore_resvblks(mp); xfs_log_work_queue(mp); - - /* Recover any CoW blocks that never got remapped. */ - error = xfs_reflink_recover_cow(mp); - if (error) { - xfs_err(mp, - "Error %d recovering leftover CoW allocations.", error); - xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); - return error; - } xfs_blockgc_start(mp); /* Create the per-AG metadata reservation pool .*/