Received: by 2002:a05:6a10:83d0:0:0:0:0 with SMTP id o16csp208796pxh; Thu, 7 Apr 2022 19:06:11 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyXN56ON4CzWsqY75SHbFVBLDHCvu8pI0VAIYPNw+PM5YMranumUGYeAc2rKn5LBKkKzxlU X-Received: by 2002:a65:4081:0:b0:381:6ff8:f4ba with SMTP id t1-20020a654081000000b003816ff8f4bamr13357217pgp.457.1649383570944; Thu, 07 Apr 2022 19:06:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649383570; cv=none; d=google.com; s=arc-20160816; b=RkJemrQAZfLY1dtYOYepk1+yMoT83Eic5UkuZuheQP+kBta9w8jAlS5YASrZexWsPC 0nKyQtnjzkfBtW9BgOoczj5cEGr8iMJmcLgUrMoAPs6QDoOH+moeWRkwK1DYfvEsFomI IfQ05th/fh7tpi5S44cTxvjSbieizVpDH9X/PYgwlhgza/56Tf1mlONQKMceEIhDGp5Z Recqf6oqtO+0lx3JSrFe+UAZzOyFkEB/A3W//PgDVOMLRZEtfhA6O+QcbhB+yvmBTa5B unMF6Tyk433Negb/nZPP3FMk/fyr2iyxzlAQa0CCl7vaqqbjopWPwbXm9Tk2pbMolM5D vTeQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=Z4sY0ayTp/vh055DIkYLp+6K53mUbW2MVU3JridHXEo=; b=bP4bqjOEjKofG8SPWfphv4ouOdfr79WVxWxCxeoi5FwCyvdDEDcDkrtdivDjmA/kTv ZRTHTApilebDRVDfwR2D52YnpHhdZq4tIpiBzi9mSSEdFFKtuSsewVYN40DDNb1bOuby aS7z989ZrKAWTYCeBSS2qIuXbv6efKNNmE/ffCa9qObjP6szlL8FOmBw46dV+QTksY8S wYKMpFp8d2tokGOyn8sawZPd3hjGeV7OqxrFKF2CQW5W8nzMjOu7el6qoAX/3q7YITXy xqvRdhxH0K1CLSLlDooWq1nVhIjeOIWHmPv1YLGIJ5tNSrNSWPmN1HqkYZHUAN9O4gG5 /SSA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=GlkX8JS2; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id e19-20020a63ee13000000b0039a14df586asi7956103pgi.784.2022.04.07.19.06.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Apr 2022 19:06:10 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=GlkX8JS2; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 6F6A623148; Thu, 7 Apr 2022 18:38:26 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233495AbiDHBkV (ORCPT + 99 others); Thu, 7 Apr 2022 21:40:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52384 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233462AbiDHBkT (ORCPT ); Thu, 7 Apr 2022 21:40:19 -0400 Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com [IPv6:2607:f8b0:4864:20::1032]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1ACFA140DF2 for ; Thu, 7 Apr 2022 18:38:17 -0700 (PDT) Received: by mail-pj1-x1032.google.com with SMTP id kr12-20020a17090b490c00b001cb3ee2e4c1so478900pjb.5 for ; Thu, 07 Apr 2022 18:38:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Z4sY0ayTp/vh055DIkYLp+6K53mUbW2MVU3JridHXEo=; b=GlkX8JS2+O2TND3PKWQRDu67IA5ihu3sqlHNXgHStSouS+fez9JOmMeuT9dhVf1iGZ g6mZz/zyRTmvU7YTfzySSjqlcX3DTBxOhNqssrme6HaxS5QnfPtKZhRdDlda5PVj6EZi kPnjnXx5jLHB+y9POYhyeGdbfeH0nExOGtt9GIdXbRm1+79UNNu+fcUn4ku5Kn2jnjiJ tTpx9F3Y95JVfxGKUoV3/GHj+zoDWeeCp/0K5p62cCQ/CJvNHYE5MHGQUkgc4RQ4SkCl 7DLYmY+fTpC/vro+F82G7ueaAzJZNGy2Xa4Me8xo3DWRFX1KY3TXazMpb9dODLBCmd9B NEKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Z4sY0ayTp/vh055DIkYLp+6K53mUbW2MVU3JridHXEo=; b=DQeuy9oT3wCudBZhcwQ5NV+qFygBS4bI9S6XVDycyocKJ3CpbOided2cojAaynriVb AgGX5boE9tUEmUmd+siphJgDPv+kFadLBqh5xb+z5jcfAAOx3OpMgdG63r6baJYWfe6Z e/ColcHoTGmscrQKxcdES+JhYn7ihLAdQqv4IZpO/trJQeI5+8Y8tMbonjlI5ow2Rwf+ UG4MvOjK3O/nBtZuCeOVZgNxQOsCyuGik6DT1mYQIl0ETYO5EVtGnYGS5MmFK2DS1Qlt aOXZ5b87mcvISlQhU/R4wmVb5DU6Oai0aV2gh6gnpROFiRgD11Rsl55vp0JoKvgDxpPw 7nmg== X-Gm-Message-State: AOAM533Y2OTQB2k5i88xAw39A5yKrwhfRDOAFT3FL1dCHbJiB/Mf/yZd DuiGImAyC1Pm1yRjIeBu+M8r8rzbZsuo5sqGV5HBDA== X-Received: by 2002:a17:90a:ca:b0:1ca:5253:b625 with SMTP id v10-20020a17090a00ca00b001ca5253b625mr19017847pjd.220.1649381896572; Thu, 07 Apr 2022 18:38:16 -0700 (PDT) MIME-Version: 1.0 References: <4fd95f0b-106f-6933-7bc6-9f0890012b53@fujitsu.com> <15a635d6-2069-2af5-15f8-1c0513487a2f@fujitsu.com> <4ed8baf7-7eb9-71e5-58ea-7c73b7e5bb73@fujitsu.com> <20220330161812.GA27649@magnolia> <20220406203900.GR27690@magnolia> In-Reply-To: <20220406203900.GR27690@magnolia> From: Dan Williams Date: Thu, 7 Apr 2022 18:38:05 -0700 Message-ID: Subject: Re: [PATCH v11 1/8] dax: Introduce holder for dax_device To: "Darrick J. Wong" Cc: Jane Chu , Christoph Hellwig , Shiyang Ruan , Linux Kernel Mailing List , linux-xfs , Linux NVDIMM , Linux MM , linux-fsdevel , david , "Luck, Tony" , Mauro Carvalho Chehab Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [ add Mauro and Tony for RAS discussion ] On Wed, Apr 6, 2022 at 1:39 PM Darrick J. Wong wrote: > > On Tue, Apr 05, 2022 at 06:22:48PM -0700, Dan Williams wrote: > > On Tue, Apr 5, 2022 at 5:55 PM Jane Chu wrote: > > > > > > On 3/30/2022 9:18 AM, Darrick J. Wong wrote: > > > > On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote: > > > >> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote: > > > >>> As the code I pasted before, pmem driver will subtract its ->data_offset, > > > >>> which is byte-based. And the filesystem who implements ->notify_failure() > > > >>> will calculate the offset in unit of byte again. > > > >>> > > > >>> So, leave its function signature byte-based, to avoid repeated conversions. > > > >> > > > >> I'm actually fine either way, so I'll wait for Dan to comment. > > > > > > > > FWIW I'd convinced myself that the reason for using byte units is to > > > > make it possible to reduce the pmem failure blast radius to subpage > > > > units... but then I've also been distracted for months. :/ > > > > > > > > > > Yes, thanks Darrick! I recall that. > > > Maybe just add a comment about why byte unit is used? > > > > I think we start with page failure notification and then figure out > > how to get finer grained through the dax interface in follow-on > > changes. Otherwise, for finer grained error handling support, > > memory_failure() would also need to be converted to stop upcasting > > cache-line granularity to page granularity failures. The native MCE > > notification communicates a 'struct mce' that can be in terms of > > sub-page bytes, but the memory management implications are all page > > based. I assume the FS implications are all FS-block-size based? > > I wouldn't necessarily make that assumption -- for regular files, the > user program is in a better position to figure out how to reset the file > contents. > > For fs metadata, it really depends. In principle, if (say) we could get > byte granularity poison info, we could look up the space usage within > the block to decide if the poisoned part was actually free space, in > which case we can correct the problem by (re)zeroing the affected bytes > to clear the poison. > > Obviously, if the blast radius hits the internal space info or something > that was storing useful data, then you'd have to rebuild the whole block > (or the whole data structure), but that's not necessarily a given. tl;dr: dax_holder_notify_failure() != fs->notify_failure() So I think I see some confusion between what DAX->notify_failure() needs, memory_failure() needs, the raw information provided by the hardware, and the failure granularity the filesystem can make use of. DAX and memory_failure() need to make immediate page granularity decisions. They both need to map out whole pages (in the direct map and userspace respectively) to prevent future poison consumption, at least until the poison is repaired. The event that leads to a page being failed can be triggered by a hardware error as small as an individual cacheline. While that is interesting to a filesystem it isn't information that memory_failure() and DAX can utilize. The reason DAX needs to have a callback into filesystem code is to map the page failure back to all the processes that might have that page mapped because reflink means that page->mapping is not sufficient to find all the affected 'struct address_space' instances. So it's more of an address-translation / "help me kill processes" service than a general failure notification service. Currently when raw hardware event happens there are mechanisms like arch-specific notifier chains, like powerpc::mce_register_notifier() and x86::mce_register_decode_chain(), or other platform firmware code like ghes_edac_report_mem_error() that uplevel the error to a coarse page granularity failure, while emitting the fine granularity error event to userspace. All of this to say that the interface to ask the fs to do the bottom half of memory_failure() (walking affected 'struct address_space' instances and killing processes (mf_dax_kill_procs())) is different than the general interface to tell the filesystem that memory has gone bad relative to a device. So if the only caller of fs->notify_failure() handler is this code: + if (pgmap->ops->memory_failure) { + rc = pgmap->ops->memory_failure(pgmap, PFN_PHYS(pfn), PAGE_SIZE, + flags); ...then you'll never get fine-grained reports. So, I still think the DAX, pgmap and memory_failure() interface should be pfn based. The interface to the *filesystem* ->notify_failure() can still be byte-based, but the trigger for that byte based interface will likely need to be something driven by another agent. Perhaps like rasdaemon in userspace translating all the arch specific physical address events back into device-relative offsets and then calling a new ABI that is serviced by fs->notify_failure() on the backend.