Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp774709pxb; Fri, 22 Apr 2022 10:53:07 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzA33GzeU2YxjJ/OInYSIl8lzzUU0wjVMWd4jjwNHCz1SoqEwue52nLIWVF30HUnxK6mfqc X-Received: by 2002:a17:902:7c13:b0:156:ca91:877f with SMTP id x19-20020a1709027c1300b00156ca91877fmr5702790pll.15.1650649987488; Fri, 22 Apr 2022 10:53:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1650649987; cv=none; d=google.com; s=arc-20160816; b=fF/Dg9B83LKEjs+wiKVVCG5ZuSm8Ayq268wD0b1/XYp0sjLo2taaNMTqwr9tQz9e5T 3/fOcznva8z5GzWtF/0/m+wpAmd5yUhu/+vKtroL59JEETTCJ2Rd6lLe6X4N3NToeNas xe8sch4ytS9DmjkzPD3xNBi1mVRzbCoipv4W02zzFmOYX0w0+eQdYMaNPwbPbIFy2Mfn nvDhub0KywH78WyR6prJKVC7dFrtAS+NSJ3WqjTeSTRggEQJuoaBoOkDvXv7Af5to4yn LF+h0BlOFJgrwLG9RMOLMVsszeaUHDmqewQ+6JlAbWVN8N+1+oTKfY3VMO6JS5mH/mQ8 Gnog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=S6WRgsV29at1u8BYECoaCAIUEhBO/TRLPFjdvW2DSKY=; b=hq8x4NJ+ca5pEIHiLvCLbP2UJBrZILNQo3USHCNF2d1+DJ6PEdaMUDk1pqwxuXHUPM fGGGUaMvPiWc+rIYPLJ+GBHC6TAqdbVg/qAajKAfLM0fxR5EesFy653JgOPn1eXhjRWJ +a/agdjXwMSkMJEtnWNvD6ItsnU1yVxFTCY154mMawlAlmRixqG147FPdN+A/6/C22Cn REDMsFt+Ul0M0fSdRTqHQY+pl7ovLKSGRUAeQtLgP+/38dpaB6CUKi9r46zbLHLTuQQ4 HeAjei89+tYVy8TJrgkdlLiHRsKQDn+us67/D09m0GfhynbzBc7IVw/ZjqK/apSkH6Ud Wjtg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=evjw5k1x; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id n14-20020a170902e54e00b00153b2d165b2si9094349plf.442.2022.04.22.10.53.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Apr 2022 10:53:07 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=evjw5k1x; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email [IPv6:2620:137:e000::1:20]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 562A3101748; Fri, 22 Apr 2022 10:36:43 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242237AbiDTRI7 (ORCPT + 99 others); Wed, 20 Apr 2022 13:08:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55828 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240622AbiDTRI4 (ORCPT ); Wed, 20 Apr 2022 13:08:56 -0400 Received: from mail-pg1-x536.google.com (mail-pg1-x536.google.com [IPv6:2607:f8b0:4864:20::536]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0452D55AB for ; Wed, 20 Apr 2022 10:06:10 -0700 (PDT) Received: by mail-pg1-x536.google.com with SMTP id k29so2139983pgm.12 for ; Wed, 20 Apr 2022 10:06:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=S6WRgsV29at1u8BYECoaCAIUEhBO/TRLPFjdvW2DSKY=; b=evjw5k1x3EAw9uJs5q2tETiXT+pKUtC4XYcufWkJsAHLg9hXUQrAWQMq0wiRFnln32 spAzzDjfjw7BI0gUWKScZ4cr58Qby2B8mefJOwzhW4cd4tmY9kaq+e+ALMjVY9ZkE0cl nE4L215/WYnmDJHpTZGmwXNncbEyunYyHKwJC8+2r92NR+QL+/lpwXByvNQEIhsWG+N+ A3iGQ442+417QNeCiqJ5SynC60yv0aCTOyOw7uduV9Idg+pfymRL64fPWSVzJaOL1cE1 bF8zUDoVWgAnHqSIC1/loS/BL9Ti6C5qCBN8M/DzOmEoJtCeFp6YQmfXZASB+uFLc/n8 WcYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=S6WRgsV29at1u8BYECoaCAIUEhBO/TRLPFjdvW2DSKY=; b=CQmFUSiLgNuB891u6HfDMbzj/OnG+XEoOr+LZMLO//hqO0hmxCP9CNFZZA3D1/l2+I n0bw2ubfH7sKcVaQaEF8NdL2evDRLjYDWM1aqAe7B9dOM+/U7MnxjBtktWsa8s94qSM7 3oXt4GDZO/AU0tHUZgan63GBwne4H1fSuX3cTXBpMI4eTyM9W/e/cyS0ohiUjG0tUxpg 6A8lhr7BU41Q+MrePfPPPDuK6MUNu/wZNPU/llqyihG6GO/jQv956VyOWw6liC0SX7PE ewXm7Ib3Noqldhp31TbwGXIKoEubFyxy0i+MsVPWvVkw0nVxDyoF0w5rb34fjyS4KNwc eKVg== X-Gm-Message-State: AOAM530EVyaxpE5rzxXQhLlwvQM/WL6USI27Gjtvf9ILJ5rBomTAx722 odisdGhgjNspKveMsIm1p7k0C6V80HWLRxPJUIkvBw== X-Received: by 2002:a05:6a00:2992:b0:505:cf4b:baef with SMTP id cj18-20020a056a00299200b00505cf4bbaefmr24081084pfb.61.1650474369443; Wed, 20 Apr 2022 10:06:09 -0700 (PDT) MIME-Version: 1.0 References: <20220420020435.90326-1-jane.chu@oracle.com> <20220420020435.90326-8-jane.chu@oracle.com> In-Reply-To: From: Dan Williams Date: Wed, 20 Apr 2022 10:05:58 -0700 Message-ID: Subject: Re: [PATCH v8 7/7] pmem: implement pmem_recovery_write() To: Jane Chu Cc: Borislav Petkov , Christoph Hellwig , Dave Hansen , Peter Zijlstra , Andy Lutomirski , david , "Darrick J. Wong" , linux-fsdevel , Linux NVDIMM , Linux Kernel Mailing List , X86 ML , Vishal L Verma , Dave Jiang , Alasdair Kergon , Mike Snitzer , device-mapper development , "Weiny, Ira" , Matthew Wilcox , Vivek Goyal Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 20, 2022 at 10:02 AM Jane Chu wrote: > > On 4/19/2022 11:26 PM, Dan Williams wrote: > > On Tue, Apr 19, 2022 at 7:06 PM Jane Chu wrote: > >> > >> The recovery write thread started out as a normal pwrite thread and > >> when the filesystem was told about potential media error in the > >> range, filesystem turns the normal pwrite to a dax_recovery_write. > >> > >> The recovery write consists of clearing media poison, clearing page > >> HWPoison bit, reenable page-wide read-write permission, flush the > >> caches and finally write. A competing pread thread will be held > >> off during the recovery process since data read back might not be > >> valid, and this is achieved by clearing the badblock records after > >> the recovery write is complete. Competing recovery write threads > >> are serialized by pmem device level .recovery_lock. > >> > >> Signed-off-by: Jane Chu > >> --- > >> drivers/nvdimm/pmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++- > >> drivers/nvdimm/pmem.h | 1 + > >> 2 files changed, 56 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > >> index c3772304d417..134f8909eb65 100644 > >> --- a/drivers/nvdimm/pmem.c > >> +++ b/drivers/nvdimm/pmem.c > >> @@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, > >> return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn); > >> } > >> > >> +/* > >> + * The recovery write thread started out as a normal pwrite thread and > >> + * when the filesystem was told about potential media error in the > >> + * range, filesystem turns the normal pwrite to a dax_recovery_write. > >> + * > >> + * The recovery write consists of clearing media poison, clearing page > >> + * HWPoison bit, reenable page-wide read-write permission, flush the > >> + * caches and finally write. A competing pread thread will be held > >> + * off during the recovery process since data read back might not be > >> + * valid, and this is achieved by clearing the badblock records after > >> + * the recovery write is complete. Competing recovery write threads > >> + * are serialized by pmem device level .recovery_lock. > >> + */ > >> static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, > >> void *addr, size_t bytes, struct iov_iter *i) > >> { > >> - return 0; > >> + struct pmem_device *pmem = dax_get_private(dax_dev); > >> + size_t olen, len, off; > >> + phys_addr_t pmem_off; > >> + struct device *dev = pmem->bb.dev; > >> + long cleared; > >> + > >> + off = offset_in_page(addr); > >> + len = PFN_PHYS(PFN_UP(off + bytes)); > >> + if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len)) > >> + return _copy_from_iter_flushcache(addr, bytes, i); > >> + > >> + /* > >> + * Not page-aligned range cannot be recovered. This should not > >> + * happen unless something else went wrong. > >> + */ > >> + if (off || !PAGE_ALIGNED(bytes)) { > >> + dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n", > >> + addr, bytes); > > > > If this warn stays: > > > > s/dev_warn/dev_warn_ratelimited/ > > > > The kernel prints hashed addresses for %p, so I'm not sure printing > > @addr is useful or @bytes because there is nothing actionable that can > > be done with that information in the log. @pgoff is probably the only > > variable worth printing (after converting to bytes or sectors) as that > > might be able to be reverse mapped back to the impacted data. > > The intention with printing @addr and @bytes is to show the > mis-alignment. In the past when UC- was set on poisoned dax page, > returning less than a page being written would cause dax_iomap_iter to > produce next iteration with @addr and @bytes not-page-aligned. Although > UC- doesn't apply here, I thought it might still be worth while to watch > for similar scenario. Also that's why @pgoff isn't helpful. > > How about s/dev_warn/dev_dbg/ ? > > > > >> + return 0; > >> + } > >> + > >> + mutex_lock(&pmem->recovery_lock); > >> + pmem_off = PFN_PHYS(pgoff) + pmem->data_offset; > >> + cleared = __pmem_clear_poison(pmem, pmem_off, len); > >> + if (cleared > 0 && cleared < len) { > >> + dev_warn(dev, "poison cleared only %ld out of %lu bytes\n", > >> + cleared, len); > > > > This looks like dev_dbg() to me, or at minimum the same > > dev_warn_ratelimited() print as the one above to just indicate a write > > to the device at the given offset failed. > > Will s/dev_warn/dev_dbg/ > > > > >> + mutex_unlock(&pmem->recovery_lock); > >> + return 0; > >> + } > >> + if (cleared < 0) { > >> + dev_warn(dev, "poison clear failed: %ld\n", cleared); > > > > Same feedback here, these should probably all map to the identical > > error exit ratelimited print. > > Will s/dev_warn/dev_dbg/ > > > > >> + mutex_unlock(&pmem->recovery_lock); > > > > It occurs to me that all callers of this are arriving through the > > fsdax iomap ops and all of these callers take an exclusive lock to > > prevent simultaneous access to the inode. If recovery_write() is only > > used through that path then this lock is redundant. Simultaneous reads > > are protected by the fact that the badblocks are cleared last. I think > > this can wait to add the lock when / if a non-fsdax access path > > arrives for dax_recovery_write(), and even then it should probably > > enforce the single writer exclusion guarantee of the fsdax path. > > > > Indeed, the caller dax_iomap_rw has already held the writer lock. > > Will remove .recovery_lock for now. > > BTW, how are the other patches look to you? First pass looked good, so I'll do one more lookover, but this is coming together nicely. Thanks for all the effort on this!