Received: by 2002:a05:6a10:8395:0:0:0:0 with SMTP id n21csp622958pxh; Tue, 9 Nov 2021 16:21:00 -0800 (PST) X-Google-Smtp-Source: ABdhPJym2DCiRHvAfYoodXwRIRQf5jScgygWJkxxAgu48gveZAdDYFJc/iK5VD3ALTJ66+kzr2aP X-Received: by 2002:a05:6638:32a2:: with SMTP id f34mr8832601jav.63.1636503660278; Tue, 09 Nov 2021 16:21:00 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1636503660; cv=none; d=google.com; s=arc-20160816; b=qdiQ6uiJFACZc0ccqQvXWjvieZ31xuDHy9y8Cymqbl91gppKTtmV5gXlac/z4VSRVt QYcaDgyQ6ZYn8HBbRpjpsyhjJ1UAsH/ZzAYAV/FL4NuMkKy3z5GwzR+PAUlibvwFVQjf 7KupUf6dIZqTO6HAttASpiIFcXww10fb6EuXuWPvFOWnxNqb5xNQKa2HoihYtBKDbE79 MXc/Wz73Hnn3EYfUe0dR2hseIklE5J6b+zGWm3nyrnBBgLhx89v3v1x29haZqBUIKgWG Ya/1zpeeiuwO20CW+doFa4hXyjnV5gWIVM1zC8lqx6yBjpxxL37ej4fcn1ZT2nlTPoxC IpJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=NGCSLoiUEixbJUJjsAamjo0128u/1ab9KY+NrB7bwnI=; b=vomIgY72X7yLDh1wp0sZ59pC50MBMnp3tlM3m0WqeWqYcXXSpfDKQ6aavozpCKT+ik v5kN/hWumkARSLwdRbUYPa4lsmqmAFenXwmkDySLBcY/gfIxuJFp1V1GAjZnffQmm3OG 1z74khTz52tvppOhSNThduswMKGrYOw+5YEqSkwlU+n3uuE/XKdOEYFpa2Moyj+wayWM XbsdvPz6xyaapN+lUh1c5lcAEiFLAoaN97dpZo9UMyqo4cQFWRjAGIUIp3KSRwEVfkAD YyQZZ/y3ajhJbiqL5/GmsO2x8EM5my/r3rziRaO2FI55AlR7qdWVttwsHgvEZszy8b7g o1FQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=IDKm1rsT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id v12si6411763jat.133.2021.11.09.16.20.48; Tue, 09 Nov 2021 16:21:00 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=IDKm1rsT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244856AbhKIVFV (ORCPT + 97 others); Tue, 9 Nov 2021 16:05:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33670 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244853AbhKIVFU (ORCPT ); Tue, 9 Nov 2021 16:05:20 -0500 Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com [IPv6:2607:f8b0:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D3059C061767 for ; Tue, 9 Nov 2021 13:02:33 -0800 (PST) Received: by mail-pf1-x433.google.com with SMTP id m26so562918pff.3 for ; Tue, 09 Nov 2021 13:02:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=NGCSLoiUEixbJUJjsAamjo0128u/1ab9KY+NrB7bwnI=; b=IDKm1rsTtY3dB0GDCmzldy6w5oNdAEtYqYum2g8jAh8YcbdgNPee6EKgJgpVDGvOYm bU6jDw0yL/QcyIuE/xVK1UYwJc49WviERyfg4i3F/XXBlL1DNA+ighGEMF3DuJCCq8tv QWtcGcWkkrl6zbkb1R9aYtSkyT7LRYl6pjn5y+1bV2Wwg+Uy9m6899rhjfE/tggdqcV7 mgVw6fzNLoRs6Gtjk9melP5C5jhP0pwQfMd4XIVp2t1qomP89JlhKCbulJn73gqXMq9y yavXZHcca/est+LsSX+OrbpkL8gszp7TwD/gV7PtJ8VksmnTYRybSZLRVsLOHU/acVk3 YKlA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=NGCSLoiUEixbJUJjsAamjo0128u/1ab9KY+NrB7bwnI=; b=kMBbRDlhZ0uTr5rBDmt8RDQ5vjDoamSx0UBqYPYxkqFPXKc79i7OA3a1BJywXtY4IP y0slSjvyyDMHqgBU4+fokK2rbnjrpPxBQE5VN6S1SsCF5bbkpCaezBK14RKQkC1bCfmI pB6QP33jTVod6eTHryJvv7s2CLtvxKx4xqsppr83FLp88WWpL3PClNAi+o6dcA9uWx1F O/SoiDU0AkDkbQpmM00xdO7g72l2r+A/9rGeeLbjkcNe8C8hHHaMiEe+ZaxS5m251a/a Q+c6ye4WPDDlKkfeTrnPqI3+fcB9ROWlNL3jOtwZjFgMB+DsCTDtIUk5e5sUtu7H2I50 Szfw== X-Gm-Message-State: AOAM531xmk5uau7mzhcg6mK9MV0BNeEH74WgJgOaV2EOpVypkE8H9Rcm lI2CapNdaWs3kKeJyT03iQ/rVoxfq3Fep90oo+1syA== X-Received: by 2002:a05:6a00:140e:b0:444:b077:51ef with SMTP id l14-20020a056a00140e00b00444b07751efmr11438892pfu.61.1636491753212; Tue, 09 Nov 2021 13:02:33 -0800 (PST) MIME-Version: 1.0 References: <20211106011638.2613039-1-jane.chu@oracle.com> <20211106011638.2613039-3-jane.chu@oracle.com> <15f01d51-2611-3566-0d08-bdfbec53f88c@oracle.com> In-Reply-To: <15f01d51-2611-3566-0d08-bdfbec53f88c@oracle.com> From: Dan Williams Date: Tue, 9 Nov 2021 13:02:23 -0800 Message-ID: Subject: Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery To: Jane Chu Cc: Christoph Hellwig , david , "Darrick J. Wong" , Vishal L Verma , Dave Jiang , Alasdair Kergon , Mike Snitzer , device-mapper development , "Weiny, Ira" , Matthew Wilcox , Vivek Goyal , linux-fsdevel , Linux NVDIMM , Linux Kernel Mailing List , linux-xfs Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 9, 2021 at 11:59 AM Jane Chu wrote: > > On 11/9/2021 10:48 AM, Dan Williams wrote: > > On Mon, Nov 8, 2021 at 11:27 PM Christoph Hellwig wrote: > >> > >> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote: > >>> static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, > >>> void *addr, size_t bytes, struct iov_iter *i, int mode) > >>> { > >>> + phys_addr_t pmem_off; > >>> + size_t len, lead_off; > >>> + struct pmem_device *pmem = dax_get_private(dax_dev); > >>> + struct device *dev = pmem->bb.dev; > >>> + > >>> + if (unlikely(mode == DAX_OP_RECOVERY)) { > >>> + lead_off = (unsigned long)addr & ~PAGE_MASK; > >>> + len = PFN_PHYS(PFN_UP(lead_off + bytes)); > >>> + if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) { > >>> + if (lead_off || !(PAGE_ALIGNED(bytes))) { > >>> + dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n", > >>> + addr, bytes); > >>> + return (size_t) -EIO; > >>> + } > >>> + pmem_off = PFN_PHYS(pgoff) + pmem->data_offset; > >>> + if (pmem_clear_poison(pmem, pmem_off, bytes) != > >>> + BLK_STS_OK) > >>> + return (size_t) -EIO; > >>> + } > >>> + } > >> > >> This is in the wrong spot. As seen in my WIP series individual drivers > >> really should not hook into copying to and from the iter, because it > >> really is just one way to write to a nvdimm. How would dm-writecache > >> clear the errors with this scheme? > >> > >> So IMHO going back to the separate recovery method as in your previous > >> patch really is the way to go. If/when the 64-bit store happens we > >> need to figure out a good way to clear the bad block list for that. > > > > I think we just make error management a first class citizen of a > > dax-device and stop abstracting it behind a driver callback. That way > > the driver that registers the dax-device can optionally register error > > management as well. Then fsdax path can do: > > > > rc = dax_direct_access(..., &kaddr, ...); > > if (unlikely(rc)) { > > kaddr = dax_mk_recovery(kaddr); > > Sorry, what does dax_mk_recovery(kaddr) do? I was thinking this just does the hackery to set a flag bit in the pointer, something like: return (void *) ((unsigned long) kaddr | DAX_RECOVERY) > > > dax_direct_access(..., &kaddr, ...); > > return dax_recovery_{read,write}(..., kaddr, ...); > > } > > return copy_{mc_to_iter,from_iter_flushcache}(...); > > > > Where, the recovery version of dax_direct_access() has the opportunity > > to change the page permissions / use an alias mapping for the access, > > again, sorry, what 'page permissions'? memory_failure_dev_pagemap() > changes the poisoned page mem_type from 'rw' to 'uc-' (should be NP?), > do you mean to reverse the change? Right, the result of the conversation with Boris is that memory_failure() should mark the page as NP in call cases, so dax_direct_access() needs to create a UC mapping and dax_recover_{read,write}() would sink that operation and either return the page to NP after the access completes, or convert it to WB if the operation cleared the error. > > dax_recovery_read() allows reading the good cachelines out of a > > poisoned page, and dax_recovery_write() coordinates error list > > management and returning a poison page to full write-back caching > > operation when no more poisoned cacheline are detected in the page. > > > > How about to introduce 3 dax_recover_ APIs: > dax_recover_direct_access(): similar to dax_direct_access except > it ignores error list and return the kaddr, and hence is also > optional, exported by device driver that has the ability to > detect error; > dax_recovery_read(): optional, supported by pmem driver only, > reads as much data as possible up to the poisoned page; It wouldn't be a property of the pmem driver, I expect it would be a flag on the dax device whether to attempt recovery or not. I.e. get away from this being a pmem callback and make this a native capability of a dax device. > dax_recovery_write(): optional, supported by pmem driver only, > first clear-poison, then write. > > Should we worry about the dm targets? The dm targets after Christoph's conversion should be able to do all the translation at direct access time and then dax_recovery_X can be done on the resulting already translated kaddr. > Both dax_recovery_read/write() are hooked up to dax_iomap_iter(). Yes.