Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp824251pxf; Thu, 18 Mar 2021 12:25:38 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwPu0fdtuHH3CdcXUI4Y+VcuudhXMY+vUtjcIVS7v2OiCWi7o9S6WhZ8v3QtKBybCCs68zB X-Received: by 2002:a17:906:524f:: with SMTP id y15mr148050ejm.65.1616095538615; Thu, 18 Mar 2021 12:25:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616095538; cv=none; d=google.com; s=arc-20160816; b=VHiY5vM5LQgf/6MyGBzcRkY6xjqwwLpRaj+ahKKq7Yr11byrx68xYcc130jOGn72sw F/gZ6EZu568jCim00cLp6YdzO9V+dOlPUGIUVlJOyOcVjKVNt30wuIen4UrPXQZqhUUw eQ9pvVmkeHj5ce+wfxyGuwMljEYOzoLF51pi2TMh0kLBCrJkMPFMpo1StqwtfnJRHErR a0ht7BllA0vZR0YwHaIpqHGpem9zqSkbKDgcVeyNFbeD7E/8B2ESyLimKsMrI2VodoGS 6EqgtIh2W38n9lwGb8C3lgOFP2VCUm1L0sh5HHJw0IwPWW57534mTm0N2NwtvrMdlv+v u0xg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=DlSGkcpM7n93m2NXphTK+vfx2de8ofv8+tDkCjG9ugo=; b=chzKEE1DSRzu/ej4Kyj7YNap1SQiYM1Q1OERS5l3Hol0+ne5n6ia/ZIGgoHsHAEd7p 5HqGMia7Gs1o/hFEJGxFsUgjQ9x8Bg2HfosUE8/nCwgPVy57zXE7dC9UoFf4w4W/ia/g 0EBhqWDuFvwBZaCMKQDHZ1ujgQPZoIY3AOh7lsGy4GoNTsRO7spZHIbY19BeFerMuZu+ LIBs8lj2Fqy95yiLNLfjFjllI48OEhuRby34wD+VSHjQePN8VPfz/6w7PZtQUPvvaxPk SvKhqAUv2ue9CqcwkY6EfdSS5rwvCUs50Ef+Nx9lzeVjqFRWQfSaeYmFtYPxvSSYMjI5 cMuQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b="p/21plRA"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id g4si2236548ejp.171.2021.03.18.12.25.15; Thu, 18 Mar 2021 12:25:38 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b="p/21plRA"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232879AbhCRTUw (ORCPT + 99 others); Thu, 18 Mar 2021 15:20:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33822 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232674AbhCRTUn (ORCPT ); Thu, 18 Mar 2021 15:20:43 -0400 Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B8E27C06175F for ; Thu, 18 Mar 2021 12:20:42 -0700 (PDT) Received: by mail-ej1-x629.google.com with SMTP id u5so5723202ejn.8 for ; Thu, 18 Mar 2021 12:20:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=DlSGkcpM7n93m2NXphTK+vfx2de8ofv8+tDkCjG9ugo=; b=p/21plRAKA00kj+4T75adsXsHf7ZMRaTa6UQJA+aiTXKTKoJGKvFVVPd87riYeNbO2 VLV8soOgZV/G2xwT2z9cHO24i01oXFMff+s8U1CR/rvN9JouweHtAIU6uwQQsXvdckBZ Y7cdycazCcBx6tiqhxz8Ah+aRpQt8okOu+vyORg93c53B9vtIedJpnoLY6q+bdGXOqFt 2WV6Lz2QiXpnDvRXuvWg7v2he7R12PxPNY+WPDH/3uPX1aHxHUMoxO/XcLS1IGjFDhkd ioIkshiLQpkMwlllG15YsAY+YriO+Pr9psWo3IYgNga/aKHZM4oHxIr9jGz+GV7iXMaP XGWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=DlSGkcpM7n93m2NXphTK+vfx2de8ofv8+tDkCjG9ugo=; b=WjiyU/YySAa8W1QKlk0C8SyPXckszkXnXbJp1m8JZ5LKq8eRHI7W9lUmuz6GM6ZCqQ rY9/yYgfG5KJACSxnkEb7q3uzhg/VHiy1KuI0euqoV8ZFuwJhQxmjU+3f1hF+kmv+WLX xxnCHNDcbGXRXM3UEsUx9T/cnosZskrIqz8i8G66XQM3SpZJRonv8DM6amF0N21VPtW4 efIuGCBOsKlCKLVd2BNzq49L6slED7nzKO0h5LLL3i+biu8E5Ja+uDba0gYBYx67UAGh 5QqV9XRgFgWTah9A6aTtngJqGG8kWnEQM+kypY+NvZugi+MJlMRP9EcUq8IZXp+bwezb NKpg== X-Gm-Message-State: AOAM531bDFM5v8LmDUxCGQj1dJEpZ5yNtVu9ax2f3saNUeDqfR8R9O1k ILLU1jyS9ngUcyPY4itwQg+xMHIMwoZF0qc0nQqWwA== X-Received: by 2002:a17:906:c405:: with SMTP id u5mr116007ejz.341.1616095241325; Thu, 18 Mar 2021 12:20:41 -0700 (PDT) MIME-Version: 1.0 References: <161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com> <161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com> <20210318045745.GC349301@dread.disaster.area> In-Reply-To: <20210318045745.GC349301@dread.disaster.area> From: Dan Williams Date: Thu, 18 Mar 2021 12:20:35 -0700 Message-ID: Subject: Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure() To: Dave Chinner Cc: Linux MM , linux-nvdimm , Jason Gunthorpe , Christoph Hellwig , Shiyang Ruan , Vishal Verma , Dave Jiang , Ira Weiny , Matthew Wilcox , Jan Kara , Andrew Morton , Naoya Horiguchi , "Darrick J. Wong" , linux-fsdevel , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 17, 2021 at 9:58 PM Dave Chinner wrote: > > On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote: > > Jason wondered why the get_user_pages_fast() path takes references on a > > @pgmap object. The rationale was to protect against accessing a 'struct > > page' that might be in the process of being removed by the driver, but > > he rightly points out that should be solved the same way all gup-fast > > synchronization is solved which is invalidate the mapping and let the > > gup slow path do @pgmap synchronization [1]. > > > > To achieve that it means that new user mappings need to stop being > > created and all existing user mappings need to be invalidated. > > > > For device-dax this is already the case as kill_dax() prevents future > > faults from installing a pte, and the single device-dax inode > > address_space can be trivially unmapped. > > > > The situation is different for filesystem-dax where device pages could > > be mapped by any number of inode address_space instances. An initial > > thought was to treat the device removal event like a drop_pagecache_sb() > > event that walks superblocks and unmaps all inodes. However, Dave points > > out that it is not just the filesystem user-mappings that need to react > > to global DAX page-unmap events, it is also filesystem metadata > > (proposed DAX metadata access), and other drivers (upstream > > DM-writecache) that need to react to this event [2]. > > > > The only kernel facility that is meant to globally broadcast the loss of > > a page (via corruption or surprise remove) is memory_failure(). The > > downside of memory_failure() is that it is a pfn-at-a-time interface. > > However, the events that would trigger the need to call memory_failure() > > over a full PMEM device should be rare. > > This is a highly suboptimal design. Filesystems only need a single > callout to trigger a shutdown that unmaps every active mapping in > the filesystem - we do not need a page-by-page error notification > which results in 250 million hwposion callouts per TB of pmem to do > this. > > Indeed, the moment we get the first hwpoison from this patch, we'll > map it to the primary XFS superblock and we'd almost certainly > consider losing the storage behind that block to be a shut down > trigger. During the shutdown, the filesystem should unmap all the > active mappings (we already need to add this to shutdown on DAX > regardless of this device remove issue) and so we really don't need > a page-by-page notification of badness. XFS doesn't, but what about device-mapper and other agents? Even if the driver had a callback up the stack memory_failure() still needs to be able to trigger failures down the stack for CPU consumed poison. > > AFAICT, it's going to take minutes, maybe hours for do the page-by-page > iteration to hwposion every page. It's going to take a few seconds > for the filesystem shutdown to run a device wide invalidation. > > SO, yeah, I think this should simply be a single ranged call to the > filesystem like: > > ->memory_failure(dev, 0, -1ULL) > > to tell the filesystem that the entire backing device has gone away, > and leave the filesystem to handle failure entirely at the > filesystem level. So I went with memory_failure() after our discussion of all the other agents in the system that might care about these pfns going offline and relying on memory_failure() to route down to each of those. I.e. the "reuse the drop_pagecache_sb() model" idea was indeed insufficient. Now I'm trying to reconcile the fact that platform poison handling will hit memory_failure() first and may not immediately reach the driver, if ever (see the perennially awkward firmware-first-mode error handling: ghes_handle_memory_failure()) . So even if the ->memory_failure(dev...) up call exists there is no guarantee it can get called for all poison before the memory_failure() down call happens. Which means regardless of whether ->memory_failure(dev...) exists memory_failure() needs to be able to do the right thing. Combine that with the fact that new buses like CXL might be configured in "poison on decode error" mode which means that a memory_failure() storm can happen regardless of whether the driver initiates it programatically. How about a mechanism to optionally let a filesystem take over memory failure handling for a range of pfns that the memory_failure() can consult to fail ranges at a time rather than one by one? So a new 'struct dax_operations' op (void) (*memory_failure_register(struct dax_device *, void *data). Where any agent that claims a dax_dev can register to take over memory_failure() handling for any event that happens in that range. This would be routed through device-mapper like any other 'struct dax_operations' op. I think that meets your requirement to get notifications of all the events you want to handle, but still allows memory_failure() to be the last resort for everything that has not opted into this error handling.