Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp979781pxb; Thu, 21 Apr 2022 15:08:44 -0700 (PDT) X-Google-Smtp-Source: ABdhPJye+nP8qC8EBMR/Oya+HZ142bQZcoRe6k2Pu6J3tntFDudv4F9fa79ItEePv0rnxONghe1D X-Received: by 2002:a17:907:7810:b0:6e7:ef73:8326 with SMTP id la16-20020a170907781000b006e7ef738326mr1331134ejc.429.1650578924206; Thu, 21 Apr 2022 15:08:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1650578924; cv=none; d=google.com; s=arc-20160816; b=wxTVfVkVgbhH4jDv9u5e2PqhSdvynELW7Rv/KLYM8VSfolM/rf9doRnb/Mkhgc5iu5 XIWFodB4PBgtnHyowoQUbnE6Z0VqR7jhd1hvBPWh/rMU0laK6lRs6emsuQkjpmjI8Ikf oSIP5aBKQSA4T1UOV8OqoKvdTQ3sT3x5s9NGFxrj4guygUm8Z8DwO0RSWoX9QAKyTOku O3dzKc1kRz8dOBGhkKxUQjqKIkQRlCi/RQNi3x+aNeQZ9Ih9JfCeqfUI7Gll1efV8LRK hXi95D/bjA3wMbRx0fm4eYs9B98TIPY/E5rBxQrWeKB0yJQsibqKwlM3oU7UlhH51Hiw BClg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=i/B+07Pdy/i99xOuJm1AhASljd/y0M0ZEBVCIGcpJEk=; b=NNXuU1DsqfXjpbxd5dSL4pvBVgRi5rYjvv1M2lh38OslljPhpTgK65MPWFawJVPhFv DoKsQq/dxPbisHsXs3lyKSZovqfiHUv0YAI2R4VdiptbR/jFvRm8pO66Y2lhefAma8iT 43WDa2pf+ogCF1t1jrar7pwTl2uiogvS9imAu4thnwv6fs1UuBGFbLR5SUdj89nNmzHE Yi7pWF8WKuWGRT+4BKJp2quS5BSg+v/djuo4A/2IMxDdX+VAKEcF5X6ziEE3w/11Tfem /hNjrlI3MPClaF998SBNgfFdGhGpNIN3+6arzaHjPZMuB2rzcK9Vkz/sn6xcfjmD1f+B wnVA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=sopR1yn2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s8-20020a1709067b8800b006e874f46e54si5442626ejo.49.2022.04.21.15.08.20; Thu, 21 Apr 2022 15:08:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=sopR1yn2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241372AbiDTG3b (ORCPT + 99 others); Wed, 20 Apr 2022 02:29:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54438 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244501AbiDTG32 (ORCPT ); Wed, 20 Apr 2022 02:29:28 -0400 Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com [IPv6:2607:f8b0:4864:20::102f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0843233365 for ; Tue, 19 Apr 2022 23:26:43 -0700 (PDT) Received: by mail-pj1-x102f.google.com with SMTP id bg24so1037066pjb.1 for ; Tue, 19 Apr 2022 23:26:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=i/B+07Pdy/i99xOuJm1AhASljd/y0M0ZEBVCIGcpJEk=; b=sopR1yn2nxsBc7ADfwFn/SWVrtJf9vSVuS5tSKUXtu79LcNiX52TDT5EJzOv8j990R X7on1VJ0Knbc/a/cZVJgJVVR1JkUoSMhNJF5tfTTfYnUfP9QcvppIdge2k2kdCaiCvdc mVPYdcWIKwQ5rqis8wCgcW0oBE0f2k8ZZXjP1fbM76+1LWFx5w7tUzyPB8K2CN37GhWa oPxDdQPnLA/ErxICkcFFOQ1K6VzYiWOXj60p3lURuXYdwhe2GR3SnGevDxFhRdcTHBc/ I4harbVtT5a80PKLqmQ+ElWi/y3stKqfiP9jbb8Fp0glT1x3EqS4tDbnDcriWWXKWwO5 gWLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=i/B+07Pdy/i99xOuJm1AhASljd/y0M0ZEBVCIGcpJEk=; b=jPZr0h7PxL39ZickulndGvj6MvR+O504LbqiYz+k3YxwVunGHZE8plZR6AwdV04j/O 1U+RnsJmE/YTaC2DKDVnvzQN54yGgAW0kLjJLdGGL58D+yXNU1WLXobdxOW09OxVu0ON RjGa77lpsTOSy4X5bjh7KbLCJpyjBXfIQC0rIWbzNAncae4TCLwcTKyXx8ObqGgF+FjH 0l0+Lg8uTPY4GlQVTJGAGhXWy+qxtYyQDtQbCdmlP+PU2kUV/rPDHbSlZZV5XIELPV0T 7ZH7ejTTlkJkBMHA4GRM6SjuZ7w5QHEfnrETKqYzaD07GBWRZlHhahWnzXDpatCJUrBp N9XA== X-Gm-Message-State: AOAM530D7CHsRjxVQtncJZgcgWdcy9Puho+M6V+gIpIrZdeHT592Fdrp ZJXum5BtZa/QKXpibDhjzYe7rUuKp78HtQ41fPTf0g== X-Received: by 2002:a17:90b:4c84:b0:1d2:cadc:4e4d with SMTP id my4-20020a17090b4c8400b001d2cadc4e4dmr2676295pjb.8.1650436002497; Tue, 19 Apr 2022 23:26:42 -0700 (PDT) MIME-Version: 1.0 References: <20220420020435.90326-1-jane.chu@oracle.com> <20220420020435.90326-8-jane.chu@oracle.com> In-Reply-To: <20220420020435.90326-8-jane.chu@oracle.com> From: Dan Williams Date: Tue, 19 Apr 2022 23:26:31 -0700 Message-ID: Subject: Re: [PATCH v8 7/7] pmem: implement pmem_recovery_write() To: Jane Chu Cc: Borislav Petkov , Christoph Hellwig , Dave Hansen , Peter Zijlstra , Andy Lutomirski , david , "Darrick J. Wong" , linux-fsdevel , Linux NVDIMM , Linux Kernel Mailing List , X86 ML , Vishal L Verma , Dave Jiang , Alasdair Kergon , Mike Snitzer , device-mapper development , "Weiny, Ira" , Matthew Wilcox , Vivek Goyal Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 19, 2022 at 7:06 PM Jane Chu wrote: > > The recovery write thread started out as a normal pwrite thread and > when the filesystem was told about potential media error in the > range, filesystem turns the normal pwrite to a dax_recovery_write. > > The recovery write consists of clearing media poison, clearing page > HWPoison bit, reenable page-wide read-write permission, flush the > caches and finally write. A competing pread thread will be held > off during the recovery process since data read back might not be > valid, and this is achieved by clearing the badblock records after > the recovery write is complete. Competing recovery write threads > are serialized by pmem device level .recovery_lock. > > Signed-off-by: Jane Chu > --- > drivers/nvdimm/pmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++- > drivers/nvdimm/pmem.h | 1 + > 2 files changed, 56 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index c3772304d417..134f8909eb65 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, > return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn); > } > > +/* > + * The recovery write thread started out as a normal pwrite thread and > + * when the filesystem was told about potential media error in the > + * range, filesystem turns the normal pwrite to a dax_recovery_write. > + * > + * The recovery write consists of clearing media poison, clearing page > + * HWPoison bit, reenable page-wide read-write permission, flush the > + * caches and finally write. A competing pread thread will be held > + * off during the recovery process since data read back might not be > + * valid, and this is achieved by clearing the badblock records after > + * the recovery write is complete. Competing recovery write threads > + * are serialized by pmem device level .recovery_lock. > + */ > static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, > void *addr, size_t bytes, struct iov_iter *i) > { > - return 0; > + struct pmem_device *pmem = dax_get_private(dax_dev); > + size_t olen, len, off; > + phys_addr_t pmem_off; > + struct device *dev = pmem->bb.dev; > + long cleared; > + > + off = offset_in_page(addr); > + len = PFN_PHYS(PFN_UP(off + bytes)); > + if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len)) > + return _copy_from_iter_flushcache(addr, bytes, i); > + > + /* > + * Not page-aligned range cannot be recovered. This should not > + * happen unless something else went wrong. > + */ > + if (off || !PAGE_ALIGNED(bytes)) { > + dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n", > + addr, bytes); If this warn stays: s/dev_warn/dev_warn_ratelimited/ The kernel prints hashed addresses for %p, so I'm not sure printing @addr is useful or @bytes because there is nothing actionable that can be done with that information in the log. @pgoff is probably the only variable worth printing (after converting to bytes or sectors) as that might be able to be reverse mapped back to the impacted data. > + return 0; > + } > + > + mutex_lock(&pmem->recovery_lock); > + pmem_off = PFN_PHYS(pgoff) + pmem->data_offset; > + cleared = __pmem_clear_poison(pmem, pmem_off, len); > + if (cleared > 0 && cleared < len) { > + dev_warn(dev, "poison cleared only %ld out of %lu bytes\n", > + cleared, len); This looks like dev_dbg() to me, or at minimum the same dev_warn_ratelimited() print as the one above to just indicate a write to the device at the given offset failed. > + mutex_unlock(&pmem->recovery_lock); > + return 0; > + } > + if (cleared < 0) { > + dev_warn(dev, "poison clear failed: %ld\n", cleared); Same feedback here, these should probably all map to the identical error exit ratelimited print. > + mutex_unlock(&pmem->recovery_lock); It occurs to me that all callers of this are arriving through the fsdax iomap ops and all of these callers take an exclusive lock to prevent simultaneous access to the inode. If recovery_write() is only used through that path then this lock is redundant. Simultaneous reads are protected by the fact that the badblocks are cleared last. I think this can wait to add the lock when / if a non-fsdax access path arrives for dax_recovery_write(), and even then it should probably enforce the single writer exclusion guarantee of the fsdax path. > + return 0; > + } > + > + olen = _copy_from_iter_flushcache(addr, bytes, i); > + pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT); > + > + mutex_unlock(&pmem->recovery_lock); > + return olen; > } > > static const struct dax_operations pmem_dax_ops = { > @@ -525,6 +578,7 @@ static int pmem_attach_disk(struct device *dev, > if (rc) > goto out_cleanup_dax; > dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); > + mutex_init(&pmem->recovery_lock); > pmem->dax_dev = dax_dev; > > rc = device_add_disk(dev, disk, pmem_attribute_groups); > diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h > index 392b0b38acb9..91e40f5e3c0e 100644 > --- a/drivers/nvdimm/pmem.h > +++ b/drivers/nvdimm/pmem.h > @@ -27,6 +27,7 @@ struct pmem_device { > struct dax_device *dax_dev; > struct gendisk *disk; > struct dev_pagemap pgmap; > + struct mutex recovery_lock; > }; > > long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, > -- > 2.18.4 >