Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp1847532pxb; Thu, 4 Nov 2021 09:26:22 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzzqPn8sj0lSy/yJQm8rsflD4jUVtM345iBHgXtfwjquK1T6Wz0mUp4K0lhVd8alI1jDDTK X-Received: by 2002:a92:8751:: with SMTP id d17mr32904390ilm.104.1636043182244; Thu, 04 Nov 2021 09:26:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1636043182; cv=none; d=google.com; s=arc-20160816; b=u7lpNi2hNRIx/ThB2T29nu1UhtUPyoDPn7/XLb7rHmlKSmr5/5oS11EgxrqEhPNToc rEyAO8mntyMexXzz3ylNYML7pGjAHfXwr3GKeLSBHeykBSu/LtYe0dhljNG5s0XchP9Q msfnzjHAgFK8cOBB754OFqK4+J3xq4MMJUer4F8PW8BHQ6egR6fdpGRY8VxiT2L8Owst UrPzoOgJW6Bzhjqb5UygDXyHVRoSzSUU/DVKTC29y90gR1th32YQFv8jnV8y2eHnCrYz cX4z/JA4xNI1I47Bi+uZd0/EjlYfiNpi3HM5L1pStEDAJXm9EtfOOuf2LTjOXy3U5X2t yndQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=17E90qbU5W7/llxpq3Ci8EoglXdbUk5GXT8DWrgKv7g=; b=Tx1/G1KfalKIseuENo7qfPfK1tKh8yeKO8mVoznS4XSGIt9SAISt7ck3SKnj0ct+LC 5TkTp6SleUE99+YOaDEZLUSw21CGpopvzBZ7Ke+fvhaZZZHyyW3hD3WDtZ0h//l+z5jV uMhtolKRZ3ub3uEX0RUySxH9s27OD0ums9c6wfc6TnbKVBccRaBMqC60suDRqVWIFQ0m lW8X1P54D3gmcGG50sHPMC24Dj4GsxkGAwFuc02wlcz/bSDBsACUt2+JgzleJFL989w4 P9bZ1ILMNiYrvR8qHsneLbN0dgiSBez+NwXPadDb0/XA/+u4GJgmFgCoXTBsRLwc4xON d+8A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=IixOukDC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id i11si3833518jaj.51.2021.11.04.09.26.07; Thu, 04 Nov 2021 09:26:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=IixOukDC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231684AbhKDQ1G (ORCPT + 99 others); Thu, 4 Nov 2021 12:27:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41810 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231766AbhKDQ1G (ORCPT ); Thu, 4 Nov 2021 12:27:06 -0400 Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com [IPv6:2607:f8b0:4864:20::62a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 675C1C06127A for ; Thu, 4 Nov 2021 09:24:26 -0700 (PDT) Received: by mail-pl1-x62a.google.com with SMTP id t21so8074746plr.6 for ; Thu, 04 Nov 2021 09:24:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=17E90qbU5W7/llxpq3Ci8EoglXdbUk5GXT8DWrgKv7g=; b=IixOukDCcm4ELNAVAqZJOzjCnk7lf/rubLWVOcE22LZa0xZSiAdpMolR6PT3OzDS8O xxSwYI1rX7ktQP0XAhmwHg7GDdsMJbBO87vLYpaLkvcwsFzTpFGmXBD9dJTfeFEr1SHF V+1M9t+oBfFghrhHaXrwhgKdVFerg6POaWfHnBHWtK/UL+Zc8pxxDwgno9Rcj3+Z8bQD 4r/c9eeLKiXP8gHsf6pqcnjjtlsq5INYEBSwpziM2AGC6KZtzWSQBd2NT2qrotdNHWYz jL6HR1mv4YE/boLKmF0q2n/sOuVp/sbEG+nF2AgX9M5OQJ+cJvgMO9ZUs743sJBXcjy9 jBAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=17E90qbU5W7/llxpq3Ci8EoglXdbUk5GXT8DWrgKv7g=; b=pc2EQ09+ImKuoxaT6smCe6NAwGZK7SXCn4Ui21Jwr95CczNwyvXfe9VSVolI4PdVLY XBEpkedjk/piAYL2lO5jZ8hultQ/BzopqScL/L/1s1r36Q73VpjmYV0VuGBJ9U9WnJ2p CqlMBv4m00upL1vZHbPjjJW358EIcLq4rHFVV6xzrxrT+DauzOzF6EZFVdA4c0pt/cDO j5YcXDOQySZXa7b6X/miUHZm/VrOKwlgy6q+dTPMmo95+jD9C3YEOHonCzC812Wl8GHA shjwM1zEksnf8LHTJxQwRNa/CMI0yuDjswtAFE8rjSeux6oy1tECyQT2l9jXqTxvw/5R I/9g== X-Gm-Message-State: AOAM5319xaMObSBhS85Nn80rE+v+hk1zspqjVYIITfdM9G4DSpWcVz4L 3XVdh92rw36aaJdv+VIE/t6t5D2GKjmSLN+zMJQSwg== X-Received: by 2002:a17:90a:6c47:: with SMTP id x65mr6908694pjj.8.1636043065868; Thu, 04 Nov 2021 09:24:25 -0700 (PDT) MIME-Version: 1.0 References: <2102a2e6-c543-2557-28a2-8b0bdc470855@oracle.com> <20211028002451.GB2237511@magnolia> In-Reply-To: From: Dan Williams Date: Thu, 4 Nov 2021 09:24:15 -0700 Message-ID: Subject: Re: [dm-devel] [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag To: Christoph Hellwig Cc: "Darrick J. Wong" , Jane Chu , "david@fromorbit.com" , "vishal.l.verma@intel.com" , "dave.jiang@intel.com" , "agk@redhat.com" , "snitzer@redhat.com" , "dm-devel@redhat.com" , "ira.weiny@intel.com" , "willy@infradead.org" , "vgoyal@redhat.com" , "linux-fsdevel@vger.kernel.org" , "nvdimm@lists.linux.dev" , "linux-kernel@vger.kernel.org" , "linux-xfs@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 4, 2021 at 1:30 AM Christoph Hellwig wrote: > > On Wed, Nov 03, 2021 at 01:33:58PM -0700, Dan Williams wrote: > > Is the exception table requirement not already fulfilled by: > > > > sigaction(SIGBUS, &act, 0); > > ... > > if (sigsetjmp(sj_env, 1)) { > > ... > > > > ...but yes, that's awkward when all you want is an error return from a > > copy operation. > > Yeah. The nice thing about the kernel uaccess / _nofault helpers is > that they allow normal error handling flows. > > > For _nofault I'll note that on the kernel side Linus was explicit > > about not mixing fault handling and memory error exception handling in > > the same accessor. That's why copy_mc_to_kernel() and > > copy_{to,from}_kernel_nofault() are distinct. > > I've always wondered why we need all this mess. But if the head > penguin wants it.. > > > I only say that to probe > > deeper about what a "copy_mc()" looks like in userspace? Perhaps an > > interface to suppress SIGBUS generation and register a ring buffer > > that gets filled with error-event records encountered over a given > > MMAP I/O code sequence? > > Well, the equivalent to the kernel uaccess model would be to register > a SIGBUS handler that uses an exception table like the kernel, and then > if you use the right helpers to load/store they can return errors. > > The other option would be something more like the Windows Structured > Exception Handling: > > https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp?view=msvc-160 Yes, try-catch blocks for PMEM is what is needed if it's not there already from PMDK. Searching for SIGBUS in PMDK finds things like: https://github.com/pmem/pmdk/blob/26449a51a86861db17feabdfefaa5ee4d5afabc9/src/libpmem2/mcsafe_ops_posix.c > > > > > I think you misunderstood me. I don't think pmem needs to be > > > decoupled from the read/write path. But I'm very skeptical of adding > > > a new flag to the common read/write path for the special workaround > > > that a plain old write will not actually clear errors unlike every > > > other store interfac. > > > > Ah, ok, yes, I agree with you there that needing to redirect writes to > > a platform firmware call to clear errors, and notify the device that > > its error-list has changed is exceedingly awkward. > > Yes. And that is the big difference to every other modern storage > device. SSDs always write out of place and will just not reuse bad > blocks, and all drivers built in the last 25-30 years will also do > internal bad block remapping. > No, the big difference with every other modern storage device is access to byte-addressable storage. Storage devices get to "cheat" with guaranteed minimum 512-byte accesses. So you can arrange for writes to always be large enough to scrub the ECC bits along with the data. For PMEM and byte-granularity DAX accesses the "sector size" is a cacheline and it needed a new CPU instruction before software could atomically update data + ECC. Otherwise, with sub-cacheline accesses, a RMW cycle can't always be avoided. Such a cycle pulls poison from the device on the read and pushes it back out to the media on the cacheline writeback. > > That said, even if > > the device-side error-list auto-updated on write (like the promise of > > MOVDIR64B) there's still the question about when to do management on > > the software error lists in the driver and/or filesytem. I.e. given > > that XFS at least wants to be aware of the error lists for block > > allocation and "list errors" type features. More below... > > Well, the whole problem is that we should not have to manage this at > all, and this is where I blame Intel. There is no good reason to not > slightly overprovision the nvdimms and just do internal bad page > remapping like every other modern storage device. I don't understand what overprovisioning has to do with better error management? No other storage device has seen fit to be as transparent with communicating the error list and offering ways to proactively scrub it. Dave and Darrick rightly saw this and said "hey, the FS could do a much better job for the user if it knew about this error list". So I don't get what this argument about spare blocks has to do with what XFS wants? I.e. an rmap facility to communicate files that have been clobbered by cosmic rays and other calamities. > > Hasn't this been a perennial topic at LSF/MM, i.e. how to get an > > interface for the filesystem to request "try harder" to return data? > > Trying harder to _get_ data or to _store_ data? All the discussion > here seems to be able about actually writing data. > > > If the device has a recovery slow-path, or error tracking granularity > > is smaller than the I/O size, then RWF_RECOVER_DATA gives the > > device/driver leeway to do better than the typical fast path. For > > writes though, I can only come up with the use case of this being a > > signal to the driver to take the opportunity to do error-list > > management relative to the incoming write data. > > Well, while devices have data recovery slow path they tend to use those > automatically. What I'm actually seeing in practice is flags in the > storage interfaces to skip this slow path recovery because the higher > level software would prefer to instead recover e.g. from remote copies. Ok. > > However, if signaling that "now is the time to update error-lists" is > > the requirement, I imagine the @kaddr returned from > > dax_direct_access() could be made to point to an unmapped address > > representing the poisoned page. Then, arrange for a pmem-driver fault > > handler to emulate the copy operation and do the slow path updates > > that would otherwise have been gated by RWF_RECOVER_DATA. > > That does sound like a much better interface than most of what we've > discussed so far. > > > Although, I'm not excited about teaching every PMEM arch's fault > > handler about this new source of kernel faults. Other ideas? > > RWF_RECOVER_DATA still seems the most viable / cleanest option, but > > I'm willing to do what it takes to move this error management > > capability forward. > > So far out of the low instrusiveness options Janes' previous series > to automatically retry after calling a clear_poison operation seems > like the best idea so far. We just need to also think about what > we want to do for direct users of ->direct_access that do not use > the mcsafe iov_iter helpers. Those exist? Even dm-writecache uses copy_mc_to_kernel().