Received: by 2002:a05:6a10:1a4d:0:0:0:0 with SMTP id nk13csp4358587pxb; Sat, 5 Feb 2022 10:57:26 -0800 (PST) X-Google-Smtp-Source: ABdhPJypqrp9+bdszIromyAPavusj3mZ+w9pvFwEtWUYChbPMdIOQD3M0lZfs8j2s9CDD7nliXQC X-Received: by 2002:a17:902:edd5:: with SMTP id q21mr9311224plk.93.1644087446313; Sat, 05 Feb 2022 10:57:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1644087446; cv=none; d=google.com; s=arc-20160816; b=tqY2LObLhtk5rIZcs4z2lzJCL/X6V7m8xTP61kfEvEDZzEdePoAeh5lBtKvoojV9RL OChu5umTeco7S3LuPBX9tN2CCk7aYm3rufvJfAa27zh2fZKBNMBVQzaBhg+wOm7dY5AL dB+8M/pqE/rEryhpVYuK/dJSeMSv7ZlcOiueTOr3emBZ8yKJTvThJy5Q1ooSNPNaxUT5 ntU9lzd5m4aEPHOczKQ/WGNu7nZwMXCjEKVi7JN9rLXD0HbY8RA43CHlbBnK/t6IpE/h g2OA3GO6n3/5/paJqpj3TAeVpmI1lydfsmkrUPQTk+572kvuBNFlQUORD1Ni1Jjm9fnb yw4Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=4pCg5LRiqmbI3TAMDbxbNyx17A118iw1ZgmwbQZNhZc=; b=rgZDzYja/7m8J2HSHY4CiTtF/2w74SC42+LVzNZm/JC98RkSUVYsIiGZ+fyxVfU6lQ s2URA57wOiUfgvwtuj0SIHSKUxHhkOfsVtlS8/A1OURs42uK+994kLW9h7g/xcLIDDdC FkOfZHlvZ9XIF9FAIAC3bXmyh+aXxujfcUwYjl5W7Zxyxglk7J1t+86MA2yEQUaupoXI M1CJ52tnECY0wNmruzXvQEYwHt8bYX8OOv6zIHlsufRXliEOcBruQu7fasRfkt4hWJmH gWh73My/jwAgyIr21+662WcRnGXbzfBCzYVAYL8D1vhSLmg1gBhG+L3s0SoG7rsUW/HF oBpA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=3V3KExzn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f13si6588937pgm.0.2022.02.05.10.57.14; Sat, 05 Feb 2022 10:57:26 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=3V3KExzn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242626AbiBDGVf (ORCPT + 99 others); Fri, 4 Feb 2022 01:21:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48556 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242587AbiBDGVa (ORCPT ); Fri, 4 Feb 2022 01:21:30 -0500 Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com [IPv6:2607:f8b0:4864:20::102e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E4BBFC061714 for ; Thu, 3 Feb 2022 22:21:29 -0800 (PST) Received: by mail-pj1-x102e.google.com with SMTP id y5-20020a17090aca8500b001b8127e3d3aso5249695pjt.3 for ; Thu, 03 Feb 2022 22:21:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=4pCg5LRiqmbI3TAMDbxbNyx17A118iw1ZgmwbQZNhZc=; b=3V3KExznjp8wESoZXltleisSjwaUOua+bzzCIMq/fH/eo3G/62yVKSbNMwWWHQ156X +ZhOVGRPeRzxJPDdiVJ8BmEWgh51BDK2dVYjq2Rh/3WelKHpJtScg+IkPbKfTun++/4Y aXCY78xBIVe4k15srEh1t8FyOZ396mIU707Zcru78W3FGmItRp11+pgzL3GxHlTghRU6 YOdEqPV0uFkx0h5+BpLgQNPEjjSNoFxb4xijJHK5iL1sdLiUrWAESrkp3RumVt+1HDn+ qLWQbOB/sSLN0GIfOd/65rVWK1cqr8+H76F54LT1YbOtW0ZS92GKJOhr1IB8Ic/qcfSh SnTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=4pCg5LRiqmbI3TAMDbxbNyx17A118iw1ZgmwbQZNhZc=; b=KcNhkBkFq3LX5A/CX6o1OjJJ5APq5QKpuO1FCL+RvPRHLjw47/BybbPqtHsp84Fh8f 7gLdRha0q9UU+U0MjblRxLCiAa6Red0AVF+V1Fg/LUqIY0sGfZTHCcwZ6DGyPdacRwHW MNPship70ZYMwIDN3YR3PBDoEPKQdMuiksbwg5RSDjoU+XOI7q2pOX/m198v8jN1etaO o5IGe1yDQgi9Y9vasgpK9Z/sgY7+Y9fl4Y/uk30z5NeGqO/UgN23eoTSP+xPnOcj3pkY RAF8og8DA+e9Ngds2rsEMQrfcGugq443jOqcUT0hP0DciYbJ02KlKoljmDR9EhiREpie Ej5A== X-Gm-Message-State: AOAM533yxFObI+hAPzt4zvZ7UbyrVeBiyOp3XKjWiJ858l5noMzqIvzL uNwmCgVmR0Dg3w9Sgb+9TXMgQR3pWA5freLEvFj5tA== X-Received: by 2002:a17:902:bcca:: with SMTP id o10mr1673853pls.147.1643955689386; Thu, 03 Feb 2022 22:21:29 -0800 (PST) MIME-Version: 1.0 References: <20220128213150.1333552-1-jane.chu@oracle.com> <20220128213150.1333552-6-jane.chu@oracle.com> In-Reply-To: <20220128213150.1333552-6-jane.chu@oracle.com> From: Dan Williams Date: Thu, 3 Feb 2022 22:21:17 -0800 Message-ID: Subject: Re: [PATCH v5 5/7] pmem: add pmem_recovery_write() dax op To: Jane Chu Cc: david , "Darrick J. Wong" , Christoph Hellwig , Vishal L Verma , Dave Jiang , Alasdair Kergon , Mike Snitzer , device-mapper development , "Weiny, Ira" , Matthew Wilcox , Vivek Goyal , linux-fsdevel , Linux NVDIMM , Linux Kernel Mailing List , linux-xfs Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 28, 2022 at 1:32 PM Jane Chu wrote: > > pmem_recovery_write() consists of clearing poison via DSM, > clearing page HWPoison bit, re-state _PAGE_PRESENT bit, > cacheflush, write, and finally clearing bad-block record. > > A competing pread thread is held off during recovery write > by the presence of bad-block record. A competing recovery_write > thread is serialized by a lock. > > Signed-off-by: Jane Chu > --- > drivers/nvdimm/pmem.c | 82 +++++++++++++++++++++++++++++++++++++++---- > drivers/nvdimm/pmem.h | 1 + > 2 files changed, 77 insertions(+), 6 deletions(-) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 638e64681db9..f2d6b34d48de 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -69,6 +69,14 @@ static void hwpoison_clear(struct pmem_device *pmem, > } > } > > +static void pmem_clear_badblocks(struct pmem_device *pmem, sector_t sector, > + long cleared_blks) > +{ > + badblocks_clear(&pmem->bb, sector, cleared_blks); > + if (pmem->bb_state) > + sysfs_notify_dirent(pmem->bb_state); > +} > + > static blk_status_t pmem_clear_poison(struct pmem_device *pmem, > phys_addr_t offset, unsigned int len) > { > @@ -88,9 +96,7 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem, > dev_dbg(dev, "%#llx clear %ld sector%s\n", > (unsigned long long) sector, cleared, > cleared > 1 ? "s" : ""); > - badblocks_clear(&pmem->bb, sector, cleared); > - if (pmem->bb_state) > - sysfs_notify_dirent(pmem->bb_state); > + pmem_clear_badblocks(pmem, sector, cleared); > } > > arch_invalidate_pmem(pmem->virt_addr + offset, len); > @@ -257,10 +263,15 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector, > __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, > long nr_pages, void **kaddr, pfn_t *pfn) > { > + bool bad_pmem; > + bool do_recovery = false; > resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset; > > - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, > - PFN_PHYS(nr_pages)))) > + bad_pmem = is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, > + PFN_PHYS(nr_pages)); > + if (bad_pmem && kaddr) > + do_recovery = dax_recovery_started(pmem->dax_dev, kaddr); > + if (bad_pmem && !do_recovery) > return -EIO; > > if (kaddr) > @@ -301,10 +312,68 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, > return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn); > } > > +/* > + * The recovery write thread started out as a normal pwrite thread and > + * when the filesystem was told about potential media error in the > + * range, filesystem turns the normal pwrite to a dax_recovery_write. > + * > + * The recovery write consists of clearing poison via DSM, clearing page > + * HWPoison bit, reenable page-wide read-write permission, flush the > + * caches and finally write. A competing pread thread needs to be held > + * off during the recovery process since data read back might not be valid. > + * And that's achieved by placing the badblock records clearing after > + * the completion of the recovery write. > + * > + * Any competing recovery write thread needs to be serialized, and this is > + * done via pmem device level lock .recovery_lock. > + */ > static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, > void *addr, size_t bytes, struct iov_iter *i) > { > - return 0; > + size_t rc, len, off; > + phys_addr_t pmem_off; > + struct pmem_device *pmem = dax_get_private(dax_dev); > + struct device *dev = pmem->bb.dev; > + sector_t sector; > + long cleared, cleared_blk; > + > + mutex_lock(&pmem->recovery_lock); > + > + /* If no poison found in the range, go ahead with write */ > + off = (unsigned long)addr & ~PAGE_MASK; > + len = PFN_PHYS(PFN_UP(off + bytes)); > + if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) { > + rc = _copy_from_iter_flushcache(addr, bytes, i); > + goto write_done; > + } is_bad_pmem() takes a seqlock so it should be safe to move the recovery_lock below this point. > + > + /* Not page-aligned range cannot be recovered */ > + if (off || !(PAGE_ALIGNED(bytes))) { > + dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n", > + addr, bytes); fs/dax.c will prevent this from happening, right? It seems like an upper layer bug if we get this far into the recovery process without checking if a full page is being overwritten. > + rc = (size_t) -EIO; > + goto write_done; > + } > + > + pmem_off = PFN_PHYS(pgoff) + pmem->data_offset; > + sector = (pmem_off - pmem->data_offset) / 512; > + cleared = nvdimm_clear_poison(dev, pmem->phys_addr + pmem_off, len); > + cleared_blk = cleared / 512; > + if (cleared_blk > 0) { > + hwpoison_clear(pmem, pmem->phys_addr + pmem_off, cleared); > + } else { > + dev_warn(dev, "pmem_recovery_write: cleared_blk: %ld\n", > + cleared_blk); > + rc = (size_t) -EIO; > + goto write_done; > + } > + arch_invalidate_pmem(pmem->virt_addr + pmem_off, bytes); > + rc = _copy_from_iter_flushcache(addr, bytes, i); > + pmem_clear_badblocks(pmem, sector, cleared_blk); This duplicates pmem_clear_poison() can more code be shared between the 2 functions? > + > +write_done: > + mutex_unlock(&pmem->recovery_lock); > + return rc; > } > > static const struct dax_operations pmem_dax_ops = { > @@ -495,6 +564,7 @@ static int pmem_attach_disk(struct device *dev, > goto out_cleanup_dax; > dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); > set_dax_recovery(dax_dev); > + mutex_init(&pmem->recovery_lock); > pmem->dax_dev = dax_dev; > > rc = device_add_disk(dev, disk, pmem_attribute_groups); > diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h > index 59cfe13ea8a8..971bff7552d6 100644 > --- a/drivers/nvdimm/pmem.h > +++ b/drivers/nvdimm/pmem.h > @@ -24,6 +24,7 @@ struct pmem_device { > struct dax_device *dax_dev; > struct gendisk *disk; > struct dev_pagemap pgmap; > + struct mutex recovery_lock; > }; > > long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, > -- > 2.18.4 >