Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp1406602pxb; Thu, 4 Nov 2021 01:32:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwyxH2HEhpO3Qlq5NStqTf5nW4yM+Se1siKhF+XNJMwGfjAvjrNjbHFS4lCQbwh8RsjSYrD X-Received: by 2002:a05:6e02:14d3:: with SMTP id o19mr35518167ilk.156.1636014770754; Thu, 04 Nov 2021 01:32:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1636014770; cv=none; d=google.com; s=arc-20160816; b=j+R2s6j2pyuYgyl12IgLG0gkzVpJ1AOo209LsG50RA9qI19haRkTvJSdngnTt7iD15 3BM7v3cBy7GkBtRMspCbobwkEiidlC7VhYepDHumx91htO4z5WYU3OU1GBk7n7DkJgeT Uk5j/1x5NsS7oCZKb8u3XXGNyiZKEc5QSG+sDCT2ca2ThQ1/zcEvufhstWjTfLbRiyjT S6Jp7bDm0OjZIKVs7KTzRpuxuCp66xwF1Sjz1LlwFaKDsjijtIQGxd0nNZGquiRqA3h3 GkegakeUm1PT5pqIp4xEZh9L4bQ25/zPgxdzAHehneLnWaT4s7aU1VLpAy+K1aCAA3Qf pNpQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=HvPPZaYKI38YNmuRJM5hzuQ/Myh4zSSZv1Dpza7E2EE=; b=hstDRxH/TiT39Y3GuMIARl3iWlmPyDCac4pa0fQqXG8MuhjRbYGVVJNialYJRw6m0H 6P9kI2elwwys7B4+WMZZDfLR+eYY9cjoujLg+KZRGeQvZmu7+Ncr8uV/7fDIgXtn0SnT mQzLQigX5zyGiFizzq04DIC/AfdWkAz7SiNVJCQFJZkftybx+uLVhXzk+lYJ4uLrVg7P IbUa4BUmk4rlAwZrHm5scxkAsE44bU0+IIQqOeHQqZ6idtHTSNimaseVzAZgVLGrW1y4 ggO+CWmFiyPpEFHCG07AfIcHoJKZbs/40ksm0amaytCxfGj+J0E8GCo33ZYfxcRZc92y mYDQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=bombadil.20210309 header.b=rgd2aQd0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id r8si1261706jad.71.2021.11.04.01.32.34; Thu, 04 Nov 2021 01:32:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=bombadil.20210309 header.b=rgd2aQd0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230489AbhKDIdj (ORCPT + 99 others); Thu, 4 Nov 2021 04:33:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46636 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230152AbhKDIdi (ORCPT ); Thu, 4 Nov 2021 04:33:38 -0400 Received: from bombadil.infradead.org (bombadil.infradead.org [IPv6:2607:7c80:54:e::133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D0E03C061714; Thu, 4 Nov 2021 01:31:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=HvPPZaYKI38YNmuRJM5hzuQ/Myh4zSSZv1Dpza7E2EE=; b=rgd2aQd0E7KFmWYzh6XuxmWQ/a 2nuIGl6b/tYqHs5nx2qXDFOkr2opwjn+i9gckjBg5UHQTZ1wLWMn54rmp0o9V/6SoOFS4Wpn7c5FA urGoX7G3JA8sqCmiKKl64pSiZ5yu2jvosA35R8rXyjj5LB+3mGc4/wXLwfXkKycwRjOCsLdHgWbUO y+xfWQF9xGg6NgJN+PqeACSXXuDQ7QJZiZ8TL6lJtSaShXTB7Yu7tMZ6Kjke02kAY7Xipc91PRVI1 ravx8elM5ndKgTlYyl6yD664dIe4upAG3HBbVrcmT3YQYEB5lQPQZ4SvvEXl6JggIkmrKH7ZpY1EC MYz7JaFA==; Received: from hch by bombadil.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1miY92-008IOw-BM; Thu, 04 Nov 2021 08:30:48 +0000 Date: Thu, 4 Nov 2021 01:30:48 -0700 From: Christoph Hellwig To: Dan Williams Cc: Christoph Hellwig , "Darrick J. Wong" , Jane Chu , "david@fromorbit.com" , "vishal.l.verma@intel.com" , "dave.jiang@intel.com" , "agk@redhat.com" , "snitzer@redhat.com" , "dm-devel@redhat.com" , "ira.weiny@intel.com" , "willy@infradead.org" , "vgoyal@redhat.com" , "linux-fsdevel@vger.kernel.org" , "nvdimm@lists.linux.dev" , "linux-kernel@vger.kernel.org" , "linux-xfs@vger.kernel.org" Subject: Re: [dm-devel] [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag Message-ID: References: <2102a2e6-c543-2557-28a2-8b0bdc470855@oracle.com> <20211028002451.GB2237511@magnolia> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 03, 2021 at 01:33:58PM -0700, Dan Williams wrote: > Is the exception table requirement not already fulfilled by: > > sigaction(SIGBUS, &act, 0); > ... > if (sigsetjmp(sj_env, 1)) { > ... > > ...but yes, that's awkward when all you want is an error return from a > copy operation. Yeah. The nice thing about the kernel uaccess / _nofault helpers is that they allow normal error handling flows. > For _nofault I'll note that on the kernel side Linus was explicit > about not mixing fault handling and memory error exception handling in > the same accessor. That's why copy_mc_to_kernel() and > copy_{to,from}_kernel_nofault() are distinct. I've always wondered why we need all this mess. But if the head penguin wants it.. > I only say that to probe > deeper about what a "copy_mc()" looks like in userspace? Perhaps an > interface to suppress SIGBUS generation and register a ring buffer > that gets filled with error-event records encountered over a given > MMAP I/O code sequence? Well, the equivalent to the kernel uaccess model would be to register a SIGBUS handler that uses an exception table like the kernel, and then if you use the right helpers to load/store they can return errors. The other option would be something more like the Windows Structured Exception Handling: https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp?view=msvc-160 > > I think you misunderstood me. I don't think pmem needs to be > > decoupled from the read/write path. But I'm very skeptical of adding > > a new flag to the common read/write path for the special workaround > > that a plain old write will not actually clear errors unlike every > > other store interfac. > > Ah, ok, yes, I agree with you there that needing to redirect writes to > a platform firmware call to clear errors, and notify the device that > its error-list has changed is exceedingly awkward. Yes. And that is the big difference to every other modern storage device. SSDs always write out of place and will just not reuse bad blocks, and all drivers built in the last 25-30 years will also do internal bad block remapping. > That said, even if > the device-side error-list auto-updated on write (like the promise of > MOVDIR64B) there's still the question about when to do management on > the software error lists in the driver and/or filesytem. I.e. given > that XFS at least wants to be aware of the error lists for block > allocation and "list errors" type features. More below... Well, the whole problem is that we should not have to manage this at all, and this is where I blame Intel. There is no good reason to not slightly overprovision the nvdimms and just do internal bad page remapping like every other modern storage device. > Hasn't this been a perennial topic at LSF/MM, i.e. how to get an > interface for the filesystem to request "try harder" to return data? Trying harder to _get_ data or to _store_ data? All the discussion here seems to be able about actually writing data. > If the device has a recovery slow-path, or error tracking granularity > is smaller than the I/O size, then RWF_RECOVER_DATA gives the > device/driver leeway to do better than the typical fast path. For > writes though, I can only come up with the use case of this being a > signal to the driver to take the opportunity to do error-list > management relative to the incoming write data. Well, while devices have data recovery slow path they tend to use those automatically. What I'm actually seeing in practice is flags in the storage interfaces to skip this slow path recovery because the higher level software would prefer to instead recover e.g. from remote copies. > However, if signaling that "now is the time to update error-lists" is > the requirement, I imagine the @kaddr returned from > dax_direct_access() could be made to point to an unmapped address > representing the poisoned page. Then, arrange for a pmem-driver fault > handler to emulate the copy operation and do the slow path updates > that would otherwise have been gated by RWF_RECOVER_DATA. That does sound like a much better interface than most of what we've discussed so far. > Although, I'm not excited about teaching every PMEM arch's fault > handler about this new source of kernel faults. Other ideas? > RWF_RECOVER_DATA still seems the most viable / cleanest option, but > I'm willing to do what it takes to move this error management > capability forward. So far out of the low instrusiveness options Janes' previous series to automatically retry after calling a clear_poison operation seems like the best idea so far. We just need to also think about what we want to do for direct users of ->direct_access that do not use the mcsafe iov_iter helpers.