Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp938869pxb; Wed, 3 Mar 2021 21:49:14 -0800 (PST) X-Google-Smtp-Source: ABdhPJyk5AUTuAWf+kXc9Pbk87c/VeV4P3aMzpFl1bYdfx6E0C/4VMUJRJ5KwlX5O73tDE/vaVP5 X-Received: by 2002:a17:906:7c57:: with SMTP id g23mr2330677ejp.195.1614836954199; Wed, 03 Mar 2021 21:49:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1614836954; cv=none; d=google.com; s=arc-20160816; b=Yc85QiM2LWs5R4vVYSwjIDIA0LF/Kii4Em6URgyokvfR50z0grJrCMhuHH0TKm7QyQ QvQrrWc8VvsQlzVZ4zJhYU6khhnH3ApS0+2eCRXcV2uIVdqxgr/gnrSevTe+YSvgxXpB hS4oKLh5ijRRnsao9z4NbDG/dG+BNFyM+vSTBhLsTWrAk8M5ef7ycXxihUxhNQ+bvD6Q F0nxrndEtpNlIKco0dw4jw8vGcUtmqXTNlfUQb7shFn7qg0GqtFXnt++HJ6XfPKkrGca asTMVu7/9CLbCpTlKvbDE/BnknLfGfRX+HDk2ynutWhiULTjs0uyRk13nd5c5vlR+KO2 BtdQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=qh4mKoIkd3kQcCA7D3vdmXdlqTwhkSQo5TnRd9KvzpU=; b=YB+skp8Oh3UqwwrabJMgNwn6P8wq8HlHEk+BAnMBFNsTqcteABAWuZ9sBIaYWw8cKK uKIErorfLe5ie0rZWjSBQ6i9+B4KvJ6ZcxptjmTJtKc0OhoPlDX9Evz8bPjbhRdIjTdY hWYk8oRDv3lhGxwLd3kvhkp/WFjUlZNkMLmeuuVXPZAgWg9OaHs7W/44nkigrX7pNT4O 9VjmJ0GmVPbpoZNvk3/0co+g0U3sx3z8SRmHAv5BB7F/2qR7tayC5txpPPwd1Ts/Pkzw 6n3Py6uL4xPAQRBonDawdIgxibyvMih2l+O/EciTMFcTCFDN4VgkgMT7rFQOjPYIl34r /bog== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=0STt5Ycl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f7si14324444ejw.650.2021.03.03.21.48.51; Wed, 03 Mar 2021 21:49:14 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=0STt5Ycl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1836274AbhCBHAG (ORCPT + 99 others); Tue, 2 Mar 2021 02:00:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1577110AbhCBFnn (ORCPT ); Tue, 2 Mar 2021 00:43:43 -0500 Received: from mail-ej1-x634.google.com (mail-ej1-x634.google.com [IPv6:2a00:1450:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B2A49C061794 for ; Mon, 1 Mar 2021 21:41:12 -0800 (PST) Received: by mail-ej1-x634.google.com with SMTP id r17so32846727ejy.13 for ; Mon, 01 Mar 2021 21:41:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=qh4mKoIkd3kQcCA7D3vdmXdlqTwhkSQo5TnRd9KvzpU=; b=0STt5Ycl180R7lj4Cga2Ph5n69e44b24ZikKiVmIWGCDqXIWGsAZuPE4P0K69CzqQ3 hp4qehMa5X4OPGH2Cr6+OD16I0cwG0Yobr5C7FFrJ2oVXh0fsFCgZSEE5gXjmtH/7ivf gbe24LuMSjYytNGphnsVeKLeBfGmFHEgw9wvB45LrAM79wWh/31Paw4y+M6f0c1HHvBb D/4wp9TatYE6PUNIEVVcMd2AL2+Gye2NSMkLupfGJBqy9jKBBehKmfdXd4aobOFuoN5Q 0J6ACOdAzbGROEwk70MjcGHkQHXaOLPtpkVWjw7rqwCLHJGQnHz2c5zSF4vOvBi7+BB6 YGIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=qh4mKoIkd3kQcCA7D3vdmXdlqTwhkSQo5TnRd9KvzpU=; b=Rr8uZ05UP9LEDSjpDNxqMDyjB8vRCRIpC4xmD4t4pQDl8sOPZViUYkx+dK/PdpcHIt wJa8sXNqSNs6tt6YSy2xCT+oOnxNTYltgu1G+jvsJZXrPLF9jLIIyGTyJ9T+cG1mJtYa rkdDtbzqDAv6+r84zlQy1y1/I17nBTLfDe2OcakVpNxfSHUl0pWwjRkmy/JaOZHDX4l0 QVN9YCsegOfA0FZ5ikQl3Qhj7LglxZIlDJjblIDVK0UVXyQwprWPFunG6RYSA9FJpiJC 3WhaZkXLSXgdHcGXZVB6B2L2VqeXjl+Q3CDPhH6S8XtLITU6Qyw0Brovt+/HwngouTvH 7CpA== X-Gm-Message-State: AOAM532Jigqudm9jxVuHPXpRejWrb2PwiLfnS1yIf4JUkfmqnJl94cHO xf83AmjH/Mt+Ia0Mj33caDF2n1LJnk7fhlNb0zfGmGE0dhXGDg== X-Received: by 2002:a17:906:2818:: with SMTP id r24mr19400531ejc.472.1614663671165; Mon, 01 Mar 2021 21:41:11 -0800 (PST) MIME-Version: 1.0 References: <20210226190454.GD7272@magnolia> <20210226205126.GX4662@dread.disaster.area> <20210226212748.GY4662@dread.disaster.area> <20210227223611.GZ4662@dread.disaster.area> <20210228223846.GA4662@dread.disaster.area> <20210302032805.GM7272@magnolia> In-Reply-To: <20210302032805.GM7272@magnolia> From: Dan Williams Date: Mon, 1 Mar 2021 21:41:02 -0800 Message-ID: Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS To: "Darrick J. Wong" Cc: Dave Chinner , "ruansy.fnst@fujitsu.com" , "linux-kernel@vger.kernel.org" , "linux-xfs@vger.kernel.org" , "linux-nvdimm@lists.01.org" , "linux-fsdevel@vger.kernel.org" , "darrick.wong@oracle.com" , "willy@infradead.org" , "jack@suse.cz" , "viro@zeniv.linux.org.uk" , "linux-btrfs@vger.kernel.org" , "ocfs2-devel@oss.oracle.com" , "hch@lst.de" , "rgoldwyn@suse.de" , "y-goto@fujitsu.com" , "qi.fuli@fujitsu.com" , "fnstml-iaas@cn.fujitsu.com" Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong wrote: > > On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote: > > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner wrote: > > > > > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote: > > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner wrote: > > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner wrote: > > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > > > it points to, check if it points to the PMEM that is being removed, > > > > > grab the page it points to, map that to the relevant struct page, > > > > > run collect_procs() on that page, then kill the user processes that > > > > > map that page. > > > > > > > > > > So why can't we walk the ptescheck the physical pages that they > > > > > map to and if they map to a pmem page we go poison that > > > > > page and that kills any user process that maps it. > > > > > > > > > > i.e. I can't see how unexpected pmem device unplug is any different > > > > > to an MCE delivering a hwpoison event to a DAX mapped page. > > > > > > > > I guess the tradeoff is walking a long list of inodes vs walking a > > > > large array of pages. > > > > > > Not really. You're assuming all a filesystem has to do is invalidate > > > everything if a device goes away, and that's not true. Finding if an > > > inode has a mapping that spans a specific device in a multi-device > > > filesystem can be a lot more complex than that. Just walking inodes > > > is easy - determining whihc inodes need invalidation is the hard > > > part. > > > > That inode-to-device level of specificity is not needed for the same > > reason that drop_caches does not need to be specific. If the wrong > > page is unmapped a re-fault will bring it back, and re-fault will fail > > for the pages that are successfully removed. > > > > > That's where ->corrupt_range() comes in - the filesystem is already > > > set up to do reverse mapping from physical range to inode(s) > > > offsets... > > > > Sure, but what is the need to get to that level of specificity with > > the filesystem for something that should rarely happen in the course > > of normal operation outside of a mistake? > > I can't tell if we're conflating the "a bunch of your pmem went bad" > case with the "all your dimms fell out of the machine" case. From the pmem driver perspective it has the media scanning to find some small handful of cachelines that have gone bad, and it has the driver ->remove() callback to tell it a bunch of pmem is now offline. The NVDIMM device "range has gone bad" mechanism has no way to communicate multiple terabytes have gone bad at once. In fact I think the distinction is important that ->remove() is not treated as ->corrupted_range() because I expect the level of freakout is much worse for a "your storage is offline" notification vs "your storage is corrupted" notification. > If, say, a single cacheline's worth of pmem goes bad on a node with 2TB > of pmem, I certainly want that level of specificity. Just notify the > users of the dead piece, don't flush the whole machine down the drain. Right, something like corrupted_range() is there to say, "keep going upper layers, but note that this handful of sectors now has indeterminant data and will return -EIO on access until repaired". The repair for device-offline is device-online. > > > > > There's likely always more pages than inodes, but perhaps it's more > > > > efficient to walk the 'struct page' array than sb->s_inodes? > > > > > > I really don't see you seem to be telling us that invalidation is an > > > either/or choice. There's more ways to convert physical block > > > address -> inode file offset and mapping index than brute force > > > inode cache walks.... > > > > Yes, but I was trying to map it to an existing mechanism and the > > internals of drop_pagecache_sb() are, in coarse terms, close to what > > needs to happen here. > > Yes. XFS (with rmap enabled) can do all the iteration and walking in > that function except for the invalidate_mapping_* call itself. The goal > of this series is first to wire up a callback within both the block and > pmem subsystems so that they can take notifications and reverse-map them > through the storage stack until they reach an fs superblock. I'm chuckling because this "reverse map all the way up the block layer" is the opposite of what Dave said at the first reaction to my proposal, "can't the mm map pfns to fs inode address_spaces?". I think dax unmap is distinct from corrupted_range() precisely because they are events happening in two different domains, block device sectors vs dax device pfns. Let's step back. I think a chain of ->corrupted_range() callbacks up the block stack terminating in the filesystem with dax implications tacked on is the wrong abstraction. Why not use the existing generic object for communicating bad sector ranges, 'struct badblocks'? Today whenever the pmem driver receives new corrupted range notification from the lower level nvdimm infrastructure(nd_pmem_notify) it updates the 'badblocks' instance associated with the pmem gendisk and then notifies userspace that there are new badblocks. This seems a perfect place to signal an upper level stacked block device that may also be watching disk->bb. Then each gendisk in a stacked topology is responsible for watching the badblock notifications of the next level and storing a remapped instance of those blocks until ultimately the filesystem mounted on the top-level block device is responsible for registering for those top-level disk->bb events. The device gone notification does not map cleanly onto 'struct badblocks'. If an upper level agent really cared about knowing about ->remove() events before they happened it could maybe do something like: dev = disk_to_dev(bdev->bd_disk)->parent; bus_register_notifier(dev->bus. &disk_host_device_notifier_block) ...where it's trying to watch for events that will trigger the driver ->remove() callback on the device hosting a disk. I still don't think that solves the need for a separate mechanism for global dax_device pte invalidation. I think that global dax_device invalidation needs new kernel infrastructure to allow internal users, like dm-writecache and future filesystems using dax for metadata, to take a fault when pmem is offlined. They can't use the direct-map because the direct-map can't fault, and they can't indefinitely pin metadata pages because that blocks ->remove() from being guaranteed of forward progress. Then an invalidation event is indeed a walk of address_space like objects where some are fs-inodes and some are kernel-mode dax-users, and that remains independent from remove events and badblocks notifications because they are independent objects and events. In contrast I think calling something like soft_offline_page() a pfn at a time over terabytes will take forever especially when that event need not fire if the dax_device is not mounted. > Once the information has reached XFS, it can use its own reverse > mappings to figure out which pages of which inodes are now targetted. It has its own sector based reverse mappings, it does not have pfn reverse map. > The future of DAX hw error handling can be that you throw the spitwad at > us, and it's our problem to distill that into mm invalidation calls. > XFS' reverse mapping data is indexed by storage location and isn't > sharded by address_space, so (except for the DIMMs falling out), we > don't need to walk the entire inode list or scan the entire mapping. ->remove() is effectively all the DIMMs falling out for all XFS knows. > Between XFS and DAX and mm, the mm already has the invalidation calls, > xfs already has the distiller, and so all we need is that first bit. > The current mm code doesn't fully solve the problem, nor does it need > to, since it handles DRAM errors acceptably* already. > > * Actually, the hwpoison code should _also_ be calling ->corrupted_range > when DRAM goes bad so that we can detect metadata failures and either > reload the buffer or (if it was dirty) shut down. [..] > > Going forward, for buses like CXL, there will be a managed physical > > remove operation via PCIE native hotplug. The flow there is that the > > PCIE hotplug driver will notify the OS of a pending removal, trigger > > ->remove() on the pmem driver, and then notify the technician (slot > > status LED) that the card is safe to pull. > > Well, that's a relief. Can we cancel longterm RDMA leases now too? > Yes, all problems can be solved with more blinky lights.