Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp4627892pxb; Tue, 2 Nov 2021 13:00:33 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyZIXe7xblAGXU0e3U1TWle7cxL+xSSHj2ndahiswtnxWmgef05BAT5MOYMRsvMFPJHgcH/ X-Received: by 2002:a50:e707:: with SMTP id a7mr45088148edn.352.1635883232929; Tue, 02 Nov 2021 13:00:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1635883232; cv=none; d=google.com; s=arc-20160816; b=030tRl9zIlr30gczSglMSfKEnNNm2OFxNxFsdHvLv9sI2871af4Lwyl5WWgeUMLKax 5kUi0Idk8+fkIjThICo6IWLTkpjWRRdIQwj/qkTMunbW7PdKiVesaZ6GOcjkHZwpJKoE bW+8KJVizAy0szFyaNOrN74vxX5zaB8hcbXt86f1T7ogApK4NO3f4FB1RiQC9lclOv8E tShfKt3O62x1oWsFB66L727V355APAH6Ib6pnlkUb0cXaGRRWW/9YeuugfuQHV3eAcyD 49prUD3Wndg4+AumQV2ZBP6NV9R3zQJ4sMXeOZnbSN7fYRFd80p31HhFxBZu51HCnqhb Yd5Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=27/CElpQAMzqqxlHmQpeS6Xg/mCPIF70OaGp9omQL+c=; b=QN5cO0Z09aJLYGiyKUy575b23//vzX0Vo2BEEPR2wrKqEySIXlrUWhGyF5hCbo6Zpi JetgEqBA/SnIahVmymOtayiMD4P6ygwolR5ixk8ytU0opl6vB8x3pnI2L7+nSXg5bovG Glsnk66i66ree5gOgmu1wum6qBE917l4LzQD0NcJLNq58sZQ1D3mb6ahQW88EaZFLSfX KVWlkKrfa5XhWJB/A5HCoY+j8NsgMXkjRRMXrzjvZQymkawXKa/8n+SVKYTTuC6a5w1d sm6Rg/08rk8kLd7yhLHa0phpLQYZ9cw3o+BKxfw6Z6g5csMaVpwl678/iNOCjb8iAvDy MHcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=gnmFikOM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id hd13si11894149ejc.777.2021.11.02.13.00.06; Tue, 02 Nov 2021 13:00:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=gnmFikOM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230071AbhKBT75 (ORCPT + 99 others); Tue, 2 Nov 2021 15:59:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34040 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230060AbhKBT74 (ORCPT ); Tue, 2 Nov 2021 15:59:56 -0400 Received: from mail-pl1-x633.google.com (mail-pl1-x633.google.com [IPv6:2607:f8b0:4864:20::633]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 867A3C061203 for ; Tue, 2 Nov 2021 12:57:21 -0700 (PDT) Received: by mail-pl1-x633.google.com with SMTP id n8so533874plf.4 for ; Tue, 02 Nov 2021 12:57:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=27/CElpQAMzqqxlHmQpeS6Xg/mCPIF70OaGp9omQL+c=; b=gnmFikOM0EZHcxypXtFnpwLzOY0Njy6YMU8YvFC8zWIuvUdPg0ng/ShUl1GQH9946d j/uml9OxPc8/cIYdu5U+X7/EKsdz41DbrpobujjKkxErQyj+0U9SSrXqSHvGaLweJbBI Vw7WMgzYl6YtyH/wl3L+/YQVZQ1SZP0LZVeeAAYx2qvIb277Lj4JJI/0NEI48DBcNhCj ssj6g52WUR3TmsF60UJ8OwLblx0dz/v1O2jc2hcUaypWBm5VF8uFZKYSAcAvUAIlV84s PXKSkGIa5Efu1x+AUfXtKpWi7fsvOt9V8k6CS6FB5Onm4Onqik2F6aJHvmb8xZ4Ntuo0 t7qA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=27/CElpQAMzqqxlHmQpeS6Xg/mCPIF70OaGp9omQL+c=; b=krt7C4cfa24MsCodKtKMDwOeLiBS/5RPWu0CkySwWFDmaHN+31VRKRI05xc2xw4NMg IfoAltvr4WgCJObYbgD+yIunoPxkelKxyIVesqmwVJQ5ltnrjO/cuL4hquZ5EYQ+7vQJ gNBqhk71JZIU1HsYC5mPpbncFuH/W0ucUmTcShhOJPayfed+hk5uFubsT5lCXOCWtPWX i0iIuqWA0Hw2XZLR9zD8Z0N9K+lIdNZ0Fjxxk3va0MKYl98kYkMjZ9EJM56PvXfXYxTX p4r7T93O4WStFObLfCL4k6i62TDDStjEP+hhcRRGBpxzO9S7mpkfB8NFAs2IOrpaoh9G L7QQ== X-Gm-Message-State: AOAM531CFizSz9Wmhl4xWHrXWs39cxAdsbBKE7K8P9j2WpBlTEUf8a1h NnorIFF/7w58/8DG20sn+gTXtewd2SoVmUbk+h19Cw== X-Received: by 2002:a17:90a:a085:: with SMTP id r5mr9265133pjp.8.1635883041063; Tue, 02 Nov 2021 12:57:21 -0700 (PDT) MIME-Version: 1.0 References: <20211021001059.438843-1-jane.chu@oracle.com> <2102a2e6-c543-2557-28a2-8b0bdc470855@oracle.com> <20211028002451.GB2237511@magnolia> In-Reply-To: From: Dan Williams Date: Tue, 2 Nov 2021 12:57:10 -0700 Message-ID: Subject: Re: [dm-devel] [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag To: Christoph Hellwig Cc: "Darrick J. Wong" , Jane Chu , "david@fromorbit.com" , "vishal.l.verma@intel.com" , "dave.jiang@intel.com" , "agk@redhat.com" , "snitzer@redhat.com" , "dm-devel@redhat.com" , "ira.weiny@intel.com" , "willy@infradead.org" , "vgoyal@redhat.com" , "linux-fsdevel@vger.kernel.org" , "nvdimm@lists.linux.dev" , "linux-kernel@vger.kernel.org" , "linux-xfs@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 1, 2021 at 11:19 PM Christoph Hellwig wrote: > > On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote: > > ...so would you happen to know if anyone's working on solving this > > problem for us by putting the memory controller in charge of dealing > > with media errors? > > The only one who could know is Intel.. > > > The trouble is, we really /do/ want to be able to (re)write the failed > > area, and we probably want to try to read whatever we can. Those are > > reads and writes, not {pre,f}allocation activities. This is where Dave > > and I arrived at a month ago. > > > > Unless you'd be ok with a second IO path for recovery where we're > > allowed to be slow? That would probably have the same user interface > > flag, just a different path into the pmem driver. > > Which is fine with me. If you look at the API here we do have the > RWF_ API, which them maps to the IOMAP API, which maps to the DAX_ > API which then gets special casing over three methods. > > And while Pavel pointed out that he and Jens are now optimizing for > single branches like this. I think this actually is silly and it is > not my point. > > The point is that the DAX in-kernel API is a mess, and before we make > it even worse we need to sort it first. What is directly relevant > here is that the copy_from_iter and copy_to_iter APIs do not make > sense. Most of the DAX API is based around getting a memory mapping > using ->direct_access, it is just the read/write path which is a slow > path that actually uses this. I have a very WIP patch series to try > to sort this out here: > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize > > But back to this series. The basic DAX model is that the callers gets a > memory mapping an just works on that, maybe calling a sync after a write > in a few cases. So any kind of recovery really needs to be able to > work with that model as going forward the copy_to/from_iter path will > be used less and less. i.e. file systems can and should use > direct_access directly instead of using the block layer implementation > in the pmem driver. As an example the dm-writecache driver, the pending > bcache nvdimm support and the (horribly and out of tree) nova file systems > won't even use this path. We need to find a way to support recovery > for them. And overloading it over the read/write path which is not > the main path for DAX, but the absolutely fast path for 99% of the > kernel users is a horrible idea. > > So how can we work around the horrible nvdimm design for data recovery > in a way that: > > a) actually works with the intended direct memory map use case > b) doesn't really affect the normal kernel too much > > ? Ok, now I see where you are going, but I don't see line of sight to something better than RWF_RECOVER_DATA. This goes back to one of the original DAX concerns of wanting a kernel library for coordinating PMEM mmap I/O vs leaving userspace to wrap PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O has this error-handling-API issue whether it is a DAX mapping or not. I.e. a memory failure in page cache is going to signal the process the same way and it will need to fall back to something other than mmap I/O to make forward progress. This is not a PMEM, Intel or even x86 problem, it's a generic CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE problem. CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled and then rely on readv(2)/writev(2) to recover. Do you see a readily available way to improve upon that model without CPU instruction changes? Even with CPU instructions changes, do you think it could improve much upon the model of interrupting the process when a load instruction aborts? I do agree with you that DAX needs to separate itself from block, but I don't think it follows that DAX also needs to separate itself from readv/writev for when a kernel slow-path needs to get involved because mmap I/O (just CPU instructions) does not have the proper semantics. Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement those semantics in new / augmented CPU instructions you will likely not get all of them to move and certainly not in any near term timeframe, so the kernel path will be around indefinitely. Meanwhile, I think RWF_RECOVER_DATA is generically useful for other storage besides PMEM and helps storage-drivers do better than large blast radius "I/O error" completions with no other recourse.