Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp1948684pxb; Mon, 8 Mar 2021 10:04:44 -0800 (PST) X-Google-Smtp-Source: ABdhPJzUu/zVgynJPwkbe2FXS5q0F7KcQQ8vh57Kx25Mura8RvBRQvPAfIuf4kkIBpEUeC4GAb/f X-Received: by 2002:a17:906:3acc:: with SMTP id z12mr16529794ejd.494.1615226683871; Mon, 08 Mar 2021 10:04:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1615226683; cv=none; d=google.com; s=arc-20160816; b=PD6QMwF+Sulqov0DrrxpLCv71exGv4rkQlKDJk1GF/ExaveBp/2eXmdGxwF1+jI+Bf WkuiCUsRPWM2aPagf7T14bFvgYHzR9LgOn5jGnGWiwbTDl42WXXpFKwZqzJPQpqSe1sy qRRAZ2D5lM04u4kxjqET2/IEG6jwx+RGhrpAtF5GZ9Az2tkNzA2wm5UKezkr3ro89M+X pLgHcV1FBYqlxPQEQBy94gN4SVd8lgc59/XzfOagAE5+OwKx4ng4XWR49c3W9RcoTulN hZtEbn/dWdBqykaKVLtSK4hLyyBZTmNZ4KL32stnkIguSDaFcEF0CkjGSRHgkEPE6cpJ Bp5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=+zcp9aSnYoFbyB3tMeqX2oxWSrOP4DQTmgBJZuYBvDA=; b=KpBUn1yFvk+PNJC1W63cGvv/iYD/edaxbvNPzuVGULDlKMcaPaLwfMOND+GHtbu7RC Fp7yRKfSiAGVnUKmgYoGeVWUCzhZzYkbV1fzK6m0Fdomr4GZ25VuaMZyrMDw6kbx9VRH SdTyQ7gC/v6Lv2HudtCfrcq46CjBGkw2hXhRtkoQgRv5Rn6cV0NYyvUSXSDHOUGpr0a1 O+/GSNsg7Yqez5UD98m3cqrQpdORK2qne7eLoLN9z3AN/h8+PQXQDadpZYQ+eBhYoKK4 SMda8Xq8hu5zS2S7z6Xe5HY9OVj8flB4+PQSjef9gEw5XWLifiZo2rYXOJ4+GqDiEdrD pPEA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=yTqFiowg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id co24si7344893edb.599.2021.03.08.10.04.15; Mon, 08 Mar 2021 10:04:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=yTqFiowg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230320AbhCHSCW (ORCPT + 99 others); Mon, 8 Mar 2021 13:02:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46710 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230301AbhCHSB5 (ORCPT ); Mon, 8 Mar 2021 13:01:57 -0500 Received: from mail-ed1-x534.google.com (mail-ed1-x534.google.com [IPv6:2a00:1450:4864:20::534]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DB656C06174A for ; Mon, 8 Mar 2021 10:01:56 -0800 (PST) Received: by mail-ed1-x534.google.com with SMTP id b13so16094005edx.1 for ; Mon, 08 Mar 2021 10:01:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=+zcp9aSnYoFbyB3tMeqX2oxWSrOP4DQTmgBJZuYBvDA=; b=yTqFiowg+ZrDQtEMf4WQCSpsgfGWzoDQpdhdJUNvav3WFR5Fl31Twy3LzXQ2HywcM+ RrbZGR8F02Tfetm3U1Lpxu5XyG1yT1ZlQCSohl2vOC3g4QMqojDgmBTgZc1b7kG68+yV VvPcGQfwXkSyNByekhFWqNxqJNGNZ1+f3remo8UT9cc1o16+QjcJXep8wsZYGU18HbSI J7TDY5zxBQvWjRoJl6piqGZe+IQgnXAn3scq0hzfsDM4IaSKP59goSfGKo+1LLciZRW9 xc+lFfDGbSOygStnXQhI8ZPnjq5+SAPHsYR702i3ZVA61b2FRGUPNiC8fjLf0KqLjF9a 30lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=+zcp9aSnYoFbyB3tMeqX2oxWSrOP4DQTmgBJZuYBvDA=; b=YPsDAvs0E63zAb0IviWUcrmn9f9If3UpFR+pZhgnBQ4FpajPsrygw8cMnQlHEjy79v 4bApc2voOc6VmO2Ot4mzm8m13fNtMni0yay/ffQzmkokq+aXI8KWSke5kT1JwKzyaXUL cdT+OJAZ0r2FTXNp9ffoKJEO0bos4DOe3jp000bJnXbJVD6EuIjEZr35s9rpCQ45GywI h6MjeijgzDKOzj6Uag3j66492N+5VIi40uF4Slsiou1mEou9MXzxCC0VTUGR+ecJ9iED C+/ZX/mGz3JjmkRKmJMKs9dcpb2NaV8SAn6Z26mrdTVwtRfsrRL7DrLcOH1lKwQGJhhC 87mA== X-Gm-Message-State: AOAM532Q/FzBByfmXJgF9l+KmbgggKjN8PZXf+NGej4Ki/iXqTdroBh3 7twQvjKx8bwKFaJciFXlIamxZUt9BAedBpJUqeEd7g== X-Received: by 2002:a05:6402:11c9:: with SMTP id j9mr11617699edw.348.1615226515542; Mon, 08 Mar 2021 10:01:55 -0800 (PST) MIME-Version: 1.0 References: <20210208105530.3072869-1-ruansy.fnst@cn.fujitsu.com> <20210208105530.3072869-2-ruansy.fnst@cn.fujitsu.com> In-Reply-To: From: Dan Williams Date: Mon, 8 Mar 2021 10:01:52 -0800 Message-ID: Subject: Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure() To: "ruansy.fnst@fujitsu.com" Cc: Linux Kernel Mailing List , linux-xfs , linux-nvdimm , Linux MM , linux-fsdevel , device-mapper development , "Darrick J. Wong" , david , Christoph Hellwig , Alasdair Kergon , Mike Snitzer , Goldwyn Rodrigues , "qi.fuli@fujitsu.com" , "y-goto@fujitsu.com" Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 8, 2021 at 3:34 AM ruansy.fnst@fujitsu.com wrote: > > > > > 1 file changed, 8 insertions(+) > > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > > > > index 79c49e7f5c30..0bcf2b1e20bd 100644 > > > > > --- a/include/linux/memremap.h > > > > > +++ b/include/linux/memremap.h > > > > > @@ -87,6 +87,14 @@ struct dev_pagemap_ops { > > > > > * the page back to a CPU accessible page. > > > > > */ > > > > > vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf); > > > > > + > > > > > + /* > > > > > + * Handle the memory failure happens on one page. Notify the processes > > > > > + * who are using this page, and try to recover the data on this page > > > > > + * if necessary. > > > > > + */ > > > > > + int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn, > > > > > + int flags); > > > > > }; > > > > > > > > After the conversation with Dave I don't see the point of this. If > > > > there is a memory_failure() on a page, why not just call > > > > memory_failure()? That already knows how to find the inode and the > > > > filesystem can be notified from there. > > > > > > We want memory_failure() supports reflinked files. In this case, we are not > > > able to track multiple files from a page(this broken page) because > > > page->mapping,page->index can only track one file. Thus, I introduce this > > > ->memory_failure() implemented in pmem driver, to call ->corrupted_range() > > > upper level to upper level, and finally find out files who are > > > using(mmapping) this page. > > > > > > > I know the motivation, but this implementation seems backwards. It's > > already the case that memory_failure() looks up the address_space > > associated with a mapping. From there I would expect a new 'struct > > address_space_operations' op to let the fs handle the case when there > > are multiple address_spaces associated with a given file. > > > > Let me think about it. In this way, we > 1. associate file mapping with dax page in dax page fault; I think this needs to be a new type of association that proxies the representation of the reflink across all involved address_spaces. > 2. iterate files reflinked to notify `kill processes signal` by the > new address_space_operation; > 3. re-associate to another reflinked file mapping when unmmaping > (rmap qeury in filesystem to get the another file). Perhaps the proxy object is reference counted per-ref-link. It seems error prone to keep changing the association of the pfn while the reflink is in-tact. > It did not handle those dax pages are not in use, because their ->mapping are > not associated to any file. I didn't think it through until reading your > conversation. Here is my understanding: this case should be handled by > badblock mechanism in pmem driver. This badblock mechanism will call > ->corrupted_range() to tell filesystem to repaire the data if possible. There are 2 types of notifications. There are badblocks discovered by the driver (see notify_pmem()) and there are memory_failures() signalled by the CPU machine-check handler, or the platform BIOS. In the case of badblocks that needs to be information considered by the fs block allocator to avoid / try-to-repair badblocks on allocate, and to allow listing damaged files that need repair. The memory_failure() notification needs immediate handling to tear down mappings to that pfn and signal processes that have consumed it with SIGBUS-action-required. Processes that have the poison mapped, but have not consumed it receive SIGBUS-action-optional. > So, we split it into two parts. And dax device and block device won't be mixed > up again. Is my understanding right? Right, it's only the filesystem that knows that the block_device and the dax_device alias data at the same logical offset. The requirements for sector error handling and page error handling are separate like block_device_operations and dax_operations. > But the solution above is to solve the hwpoison on one or couple pages, which > happens rarely(I think). Do the 'pmem remove' operation cause hwpoison too? > Call memory_failure() so many times? I havn't understood this yet. I'm working on a patch here to call memory_failure() on a wide range for the surprise remove of a dax_device while a filesystem might be mounted. It won't be efficient, but there is no other way to notify the kernel that it needs to immediately stop referencing a page.