Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp177522pxf; Wed, 24 Mar 2021 02:17:56 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy+qWnC8XOQ7fhFT2uQ7Uxs+q6TTXQgvgde1OHcg5WPqqP6y5HUqRd6IEahjT7XqXJBe26D X-Received: by 2002:a50:fa92:: with SMTP id w18mr2345284edr.172.1616577476056; Wed, 24 Mar 2021 02:17:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616577476; cv=none; d=google.com; s=arc-20160816; b=LXjdnamUajs9kq/D5E2krirefWUb1/uxzVoEeSUDlJpfCldLvYec81oSbZ9L+vckY2 4WJ+MIlOkLEg63HyWAtoMWl58E1/QkCjFhskm+8M8ezTpePFwUbjnCpFZcoVtFMcM9u3 XLSnwAlDRJ7o28YwfAcvKPuZkBY1cSBYDAI5MHtJSe5Vihw7YeWfYl40cqb5l6TmkKES 7J6ZkBii+eOhXDxqTmP9vPuzkGMjw2DaLoS6wDuUjvAPV8LeBHv1E62cKjJaR7r8s/4Y xBhqJpIWmEsDWquCMLov6lx8M56f4Gxbd6RpbOccVYUlYMAaWh1HXoVx5NNEVq7dzGT+ IpUQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=O6pYt7i3ZfjZRsHKNRy9TSnJLBhdXZsjApsl7qVrkWA=; b=p28gNqpgVxN+dDPegrYw98cCQTma+W0yH36f0/4rtakp/ynjHSnV/fa/QgRd1FFfzn kFvHlwLe3K5SyMhMJGrorhB/qgnI5pX1G52r+MzI/TgFslKjaAJscqCfdLSgp7HQq3fF kxMNr9kjce9n1kpJm1cUTwj8R8qNkV8aAfOblG3Pr8B1bivQRLGksU732WB9QuGS9UKw wgpwIE62HbxGOCtvZRlHzEkNF2xp/nAj7qb8TK8LCiym/aZN+Y6EmTP4dkxfFJlp5wti YuAjt4wvp/i2pf9hqsAtvB7ppAfNZrkQMO15/CEl5JXdRc2Idl8l61yWjUxw7axJ28ic FLmQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b="RsVkB/Y0"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id x27si1388860edi.240.2021.03.24.02.17.34; Wed, 24 Mar 2021 02:17:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b="RsVkB/Y0"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234773AbhCXCTl (ORCPT + 99 others); Tue, 23 Mar 2021 22:19:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33196 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231776AbhCXCTi (ORCPT ); Tue, 23 Mar 2021 22:19:38 -0400 Received: from mail-ej1-x62b.google.com (mail-ej1-x62b.google.com [IPv6:2a00:1450:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0CDC3C061763 for ; Tue, 23 Mar 2021 19:19:38 -0700 (PDT) Received: by mail-ej1-x62b.google.com with SMTP id jy13so30302120ejc.2 for ; Tue, 23 Mar 2021 19:19:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=O6pYt7i3ZfjZRsHKNRy9TSnJLBhdXZsjApsl7qVrkWA=; b=RsVkB/Y0CZH9XZzn82erHDDP8QMmPgtgAa2wOWWwpxmakMZJgHzNO7I/+g4m4adiEA /AuXJMSkwTq4YjUQhLPhFDbNlzE2s0z/XuQk4Yg9oVyPlFKtUWFGXz9b0FcDyGFf1ZSx gkTTKLNlXDSjsuh3nOejNfxEfpMWeohC44aSax0AJ/HiZfzuvt5xWllMplzX3LmiyS9+ HFcxDA7Y913vpd56MDpHtmf02XWLCE3WdLeUSj4zP1CKL4A7JEBAF9uR9Tj6+4exlzq/ /21DNknGe39jAd9gmA5iOuCZzGMbd3mXM7qmNlmeF/yAv6uShf/XV5CbiPngHl/r30am JXMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=O6pYt7i3ZfjZRsHKNRy9TSnJLBhdXZsjApsl7qVrkWA=; b=pn2LsufkIF9IAY4WZK7378jOW5VTiwT5ibbZszbrLOqO17EdOXYoS+uG2+XQIWa7JH N2piJIt5RZ/BmLHiSOxo0k1NLe8wb47M/OXCqjPci/UGSh1YavLUpOAkevLxUpRbwRmK Z4aEfqUGHMj+zN2qMahyJVoIkZpbzUrKz7/+BacWtXdyEqh7n2a3uYpkFGqrysezHQH4 dWHlPVmUXZmg6Cqkjk7HrdyvdoBs3blylTeQ+5nahYj47PTjcDL1wqjPJMiphHQMlNVx RKNdPwpB9oe2GzSKREaw1eQlSBr7vAUBq3wt2cVfaCuLkBfncqoq1TcozkBuBK8sj5Ji vlAw== X-Gm-Message-State: AOAM53047Fx05cv+bu08BimDpT98d3waNpCaAAol5y7iLdhCQg4ZGP5d MsMwack9xfkZ8Hd3jfq1t/JRciGEOrq7gKUfHHYrig== X-Received: by 2002:a17:906:2ac1:: with SMTP id m1mr1187750eje.472.1616552376639; Tue, 23 Mar 2021 19:19:36 -0700 (PDT) MIME-Version: 1.0 References: <20210208105530.3072869-1-ruansy.fnst@cn.fujitsu.com> <20210208105530.3072869-2-ruansy.fnst@cn.fujitsu.com> In-Reply-To: From: Dan Williams Date: Tue, 23 Mar 2021 19:19:28 -0700 Message-ID: Subject: Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure() To: "ruansy.fnst@fujitsu.com" Cc: Linux Kernel Mailing List , linux-xfs , linux-nvdimm , Linux MM , linux-fsdevel , device-mapper development , "Darrick J. Wong" , david , Christoph Hellwig , Alasdair Kergon , Mike Snitzer , Goldwyn Rodrigues , "qi.fuli@fujitsu.com" , "y-goto@fujitsu.com" Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 18, 2021 at 7:18 PM ruansy.fnst@fujitsu.com wrote: > > > > > -----Original Message----- > > From: ruansy.fnst@fujitsu.com > > Subject: RE: [PATCH v3 01/11] pagemap: Introduce ->memory_failure() > > > > > > > > > > > > > > After the conversation with Dave I don't see the point of this. > > > > > > > If there is a memory_failure() on a page, why not just call > > > > > > > memory_failure()? That already knows how to find the inode and > > > > > > > the filesystem can be notified from there. > > > > > > > > > > > > We want memory_failure() supports reflinked files. In this > > > > > > case, we are not able to track multiple files from a page(this > > > > > > broken > > > > > > page) because > > > > > > page->mapping,page->index can only track one file. Thus, I > > > > > > page->introduce this > > > > > > ->memory_failure() implemented in pmem driver, to call > > > > > > ->->corrupted_range() > > > > > > upper level to upper level, and finally find out files who are > > > > > > using(mmapping) this page. > > > > > > > > > > > > > > > > I know the motivation, but this implementation seems backwards. > > > > > It's already the case that memory_failure() looks up the > > > > > address_space associated with a mapping. From there I would expect > > > > > a new 'struct address_space_operations' op to let the fs handle > > > > > the case when there are multiple address_spaces associated with a given > > file. > > > > > > > > > > > > > Let me think about it. In this way, we > > > > 1. associate file mapping with dax page in dax page fault; > > > > > > I think this needs to be a new type of association that proxies the > > > representation of the reflink across all involved address_spaces. > > > > > > > 2. iterate files reflinked to notify `kill processes signal` by the > > > > new address_space_operation; > > > > 3. re-associate to another reflinked file mapping when unmmaping > > > > (rmap qeury in filesystem to get the another file). > > > > > > Perhaps the proxy object is reference counted per-ref-link. It seems > > > error prone to keep changing the association of the pfn while the reflink is > > in-tact. > > Hi, Dan > > > > I think my early rfc patchset was implemented in this way: > > - Create a per-page 'dax-rmap tree' to store each reflinked file's (mapping, > > offset) when causing dax page fault. > > - Mount this tree on page->zone_device_data which is not used in fsdax, so > > that we can iterate reflinked file mappings in memory_failure() easily. > > In my understanding, the dax-rmap tree is the proxy object you mentioned. If > > so, I have to say, this method was rejected. Because this will cause huge > > overhead in some case that every dax page have one dax-rmap tree. > > > > Hi, Dan > > How do you think about this? I am still confused. Could you give me some advice? So I think the primary driver of this functionality is dax-devices and the architectural model for memory failure where several architectures and error handlers know how to route pfn failure to the memory_failure() frontend. Compare that to block-devices where sector failure has no similar framework, and despite some initial interest about reusing 'struct badblocks' for this type of scenario there has been no real uptake to expand 'struct badblocks' outside of the pmem driver. I think the work you have done for ->corrupted_range() just needs to be repurposed away from a block-device operation to dax-device infrastructure. Christoph's pushback on extending block_device_operations makes sense to me because there is likely no other user of this facility than the pmem driver, and the pmem driver only needs it for the vestigial reason that filesystems mount on block-devices and not dax-devices. Recently Dave drove home the point that a filesystem can't do anything with pfns, it needs LBAs. A dax-device does not have LBA's, but it does operate on the concept of device-relative offsets. The filesystem is allowed to assume that dax-device:PFN[device_byte_offset >> PAGE_SHIFT] aliases the same data as the associated block-device:LBA[device_byte_offset >> SECTOR_SHIFT]. He also reiterated that this interface should be range based, which you already had, but I did not include in my attempt to communicate the mass failure of an entire surprise-removed device. So I think the path forward is: - teach memory_failure() to allow for ranged failures - let interested drivers register for memory failure events via a blocking_notifier_head - teach memory_failure() to optionally let the notifier chain claim the event vs its current default of walking page->mapping - teach the pmem driver to register for memory_failure() events and filter the ones that apply to pfns that the driver owns - drop the nfit driver's usage of the mce notifier chain since memory_failure() is a superset of what the mce notifier communicates - augment the pmem driver's view of badblocks that it gets from address range scrub with one's it gets from memory_failure() events - when pmem handles a memory_failure() event or an address range scrub event fire a new event on a new per-dax-device blocking_notifier_head indicating the dax-relative offset ranges of the translated PFNs. This notification can optionally indicate failure, offline (for removal), and online (for repaired ranges). - teach dm to receive dax-device notifier events from its leaf devices and then translate them into dax-device notifications relative to the dm-device offset. This would seem to be a straightforward conversion of what you have done with ->corrupted_range() - teach filesystems to register for dax-device notifiers With all of that in place an interested filesystem can take ownership of a memory failure that impacts a range of pfns it is responsible for via a dax-device, but it also allows a not interested filesystem to default to standard single-pfn-at-a-time error handling and assumptions about page->mapping only referring to a single address space. This obviously does not solve Dave's desire to get this type of error reporting on block_devices, but I think there's nothing stopping a parallel notifier chain from being created for block-devices, but that's orthogonal to requirements and capabilities provided by dax-devices.