Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp95598rwr; Tue, 2 May 2023 16:38:36 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ465J/UfKvn9kTmkSO5yeqDEwBK/wys6pNSoY5VMkTaF2dasKrQP6wtvFLV4FBaopDGevQr X-Received: by 2002:a05:6a00:158f:b0:626:2ce1:263c with SMTP id u15-20020a056a00158f00b006262ce1263cmr26829063pfk.5.1683070716543; Tue, 02 May 2023 16:38:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683070716; cv=none; d=google.com; s=arc-20160816; b=gYZfa8jD3acJEnMrfTgrPUNXsmcMUQdtLOiW5GJE/RzJDdOjMXgODM7kfq8/XndC54 NIicEjmyAqTKmp3EJ5DrNJjeryOrJBOrpGkjM93ss9SVVros2DYKZUHulx/6pWw7ejK/ zdVyDJryygl0KpLVWQFDxwfJbIbLOdy6YhuRQ37h3ok0xiFM+VmhptcxwRWkSE6GSEsv Pn6szDB7eJJZcIT+qEnBZjHYVTzidogDBVfU6MrAMDu8zc6RNCmtD0fxbOv/5gDoPB1C 0eKm4kOtCKenWP9jGRVqT2fGbEiDz3ZmTYE3QbPHon+AsxBvMDv9dj1opMij+slLp6bN i6Pw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=KsAKAQFxWgHaqTIj2E1n+/vMoe1+baTAdRQ1nffaRQ0=; b=ovVmCojlUOui4qRjP6F4+PhGAK2wklT3VFofXSjOn95oMhv0E0qXR3UTwu1hRDfJD5 27rs9xpt+lahwflRuZdfAc95BGwU8WNrNC1Lb7ex3jzSzLwcvABfWrsuCZN7iiqrjIh2 WSHWZR3l0im0qzZ5Vs0t/XBHwh/JAX7wATheqGb7jeT6Guw1Q06IP0pbLetgMjakkpXO iF794/tUicBm54narLEv4JTUJ9MPEZTxqRbYdy/WEhDZ1d3OsNgl9Gb3BIb0+7OfOz53 GSF3xd5hQrZc9ZxsbfUFbuRKgg0FhUPlVD34E14Uvc0/OXQ5A2HJNZ0OC/v0bThpU1Kh PYEQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=U8aUHLgl; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c206-20020a624ed7000000b005a9cfb6ee90si31446027pfb.84.2023.05.02.16.38.14; Tue, 02 May 2023 16:38:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=U8aUHLgl; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230019AbjEBX1N (ORCPT + 99 others); Tue, 2 May 2023 19:27:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33104 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229792AbjEBX1M (ORCPT ); Tue, 2 May 2023 19:27:12 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4C76E3591; Tue, 2 May 2023 16:27:11 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id CE3736258E; Tue, 2 May 2023 23:27:10 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2F185C433D2; Tue, 2 May 2023 23:27:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1683070030; bh=KNxBNTz2UpRH53yKJO9VH6dYsSUsX7j0wekJAVWQzAg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=U8aUHLglhBg6VMNudjulpLTpdMTMcEtsjYyZDV/H3xVk8jPPZysUZPb5XtObLXQ5b r+S1OHDFITx4o9GtEVL5lOe2e5ZLD/fZZxZWhUAfQa8WlVdOPD+FBPyn+7QXtm27Mb HDsZFVDr31az6uLrS5/EVICoVYQBE5MfAViuDRHJMC//mCpQY8YsVGqm2Ep5UCd8Gr eSwmk0p7hCXtXm7MEvGlUZAdgkWKSHQXtDg/XzvNoHFj7sgaxzH8IDb6Pwl9+PXTNf sva/q4hZMAo0PDeTPzSpR2nvThK+kIyNywuH3RNGo/k5s9wIjTmLbKRPFUXsO9royD 0r258bejriAPQ== Date: Tue, 2 May 2023 16:27:09 -0700 From: "Darrick J. Wong" To: Dave Chinner Cc: Ming Lei , Christoph Hellwig , Theodore Ts'o , Baokun Li , Matthew Wilcox , linux-ext4@vger.kernel.org, Andreas Dilger , linux-block@vger.kernel.org, Andrew Morton , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner , Eric Sandeen , Zhang Yi , yangerkun Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages Message-ID: <20230502232709.GD15420@frogsfrogsfrogs> References: <20230429044038.GA7561@lst.de> <20230501044744.GA20056@lst.de> <20230502013557.GH2155823@dread.disaster.area> <20230502153516.GA15376@frogsfrogsfrogs> <20230502223323.GI2155823@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230502223323.GI2155823@dread.disaster.area> X-Spam-Status: No, score=-7.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Wed, May 03, 2023 at 08:33:23AM +1000, Dave Chinner wrote: > On Tue, May 02, 2023 at 08:35:16AM -0700, Darrick J. Wong wrote: > > On Tue, May 02, 2023 at 11:35:57AM +1000, Dave Chinner wrote: > > > On Tue, May 02, 2023 at 08:57:32AM +0800, Ming Lei wrote: > > > > On Mon, May 01, 2023 at 06:47:44AM +0200, Christoph Hellwig wrote: > > > > > On Sat, Apr 29, 2023 at 01:10:49PM +0800, Ming Lei wrote: > > > > > > Not sure if it is needed for non s_bdev > > > > > > > > > > So you don't want to work this at all for btrfs? Or the XFS log device, > > > > > or .. > > > > > > > > Basically FS can provide one generic API of shutdown_filesystem() which > > > > shutdown FS generically, meantime calls each fs's ->shutdown() for > > > > dealing with fs specific shutdown. > > > > > > > > If there isn't superblock attached for one bdev, can you explain a bit what > > > > filesystem code can do? Same with block layer bdev. > > > > > > > > The current bio->bi_status together disk_live()(maybe bdev_live() is > > > > needed) should be enough for FS code to handle non s_bdev. > > > > > > maybe necessary for btrfs, but not for XFS.... > > > > > > > > > > > > > > > , because FS is over stackable device > > > > > > directly. Stackable device has its own logic for handling underlying disks dead > > > > > > or deleted, then decide if its own disk needs to be deleted, such as, it is > > > > > > fine for raid1 to work from user viewpoint if one underlying disk is deleted. > > > > > > > > > > We still need to propagate the even that device has been removed upwards. > > > > > Right now some file systems (especially XFS) are good at just propagating > > > > > it from an I/O error. And explicity call would be much better. > > > > > > > > It depends on the above question about how FS code handle non s_bdev > > > > deletion/dead. > > > > > > as XFS doesn't treat the individual devices differently. A > > > failure on an external log device is just as fatal as a failure on > > > a single device filesystem with an internal log. ext4 is > > > going to consider external journal device removal as fatal, too. > > > > > > As for removal of realtime devices on XFS, all the user data has > > > gone away, so the filesystem will largely be useless for users and > > > applications. At this point, we'll probably want to shut down the > > > filesystem because we've had an unknown amount of user data loss and > > > so silently continuing on as if nothing happened is not the right > > > thing to do. > > > > > > So as long as we can attach the superblock to each block device that > > > the filesystem opens (regardless of where sb->s_bdev points), device > > > removal calling sb_force_shutdown(sb, SB_SHUTDOWN_DEVICE_DEAD) will > > > do what we need. If we need anything different in future, then we > > > can worry about how to do that in the future. > > > > Shiyang spent a lot of time hooking up pmem failure notifications so > > that xfs can kill processes that have pmem in their mapping. I wonder > > if we could reuse some of that infrastructure here? > > ISTR that the generic mechanism for "device failure ranges" (I think > I called the mechanism ->corrupt_range()) that we came up with in > the first instance for this functionality got shouted down by some > block layer devs because they saw it as unnecessary complexity to > push device range failure notifications through block devices up > to the filesystem. > > The whole point of starting from that point was so that any type of > block device could report a failure to the filesystem and have the > filesystem deal with it appropriately: > > This is where we started: > > https://lore.kernel.org/linux-xfs/20201215121414.253660-1-ruansy.fnst@cn.fujitsu.com/ > > "..... > The call trace is like this: > memory_failure() > pgmap->ops->memory_failure() => pmem_pgmap_memory_failure() > gendisk->fops->corrupted_range() => - pmem_corrupted_range() > - md_blk_corrupted_range() > sb->s_ops->currupted_range() => xfs_fs_corrupted_range() > xfs_rmap_query_range() > xfs_currupt_helper() > * corrupted on metadata > try to recover data, call xfs_force_shutdown() > * corrupted on file data > try to recover data, call mf_dax_mapping_kill_procs() > ...." I dug up https://lore.kernel.org/linux-xfs/20210108095614.GB5647@lst.de/ which I interpreted as Christoph asking Shiyang not to make the dax device code go swerving through the block layer to call ->corrupted_range, since he was trying to separate the two entirely. I didn't think he was shutting down the idea of block devices being able to call ->corrupted_range to tell the filesystem that the user's $2 NVME<->STL<->USB bridge caught on fire. > > That MF_MEM_REMOVE > > patchset he's been trying to get us to merge would be a good starting > > point for building something similar for block devices. AFAICT it does > > the right thing if you hand it a subrange of the dax device or if you > > pass it the customary (0, -1ULL) to mean "the entire device". > > *nod* > > That was exactly how I originally envisiaged that whole "bad device > range" stack being used. > > > The block device version of that could be a lot simpler-- imagine if > > "echo 0 > /sys/block/fd0/device/delete" resulted in the block layer > > first sending us a notification that the device is about to be removed. > > We could then flush the fs and try to freeze it. After the device > > actually goes away, the blocy layer would send us a second notification > > about DEVICE_DEAD and we could shut down the incore filesystem objects. > > *nod* > > But seeing this mechanism has already been shot down by the block > layer devs, let's be a little less ambitious and just start with > a simple, pre-existing "kill the filesystem" mechanism. Once we've > got that in place and working, we can then expand on the error > handling mechanism to perform notification of on more fine-grained > storage errors... Seeing as LSF is next week, I'll ask the room about this when I'm there. --D > -Dave. > -- > Dave Chinner > david@fromorbit.com