2022-04-22 23:29:48

by Jane Chu

[permalink] [raw]
Subject: [PATCH v9 0/7] DAX poison recovery

In this series, dax recovery code path is independent of that of
normal write. Competing dax recovery threads are serialized,
racing read threads are guaranteed not overlapping with the
recovery process.

In this phase, the recovery granularity is page, future patch
will explore recovery in finer granularity.

Changelog:
v8 -> v9:
- collect R-Bs from Christoph and Dan
- move the actual DAX_RECOVERY_WRITE part of the work out of the
dax access mode plumbing patch and into the next patch that
introduces .recovery_write

v7 -> v8:
- add a patch to teach the nfit driver to rely on mce->misc for poison
granularity, suggested by Dan
- add set_memory_present() helper to be invoked by set_mce_nospec() for
better readibility, suggested by Dan
- folded a trivial fix to comments in patch 2/NN, suggested by Boris,
and more commit message comments from Boris
- mode .recovery_write to dax_operation as dev_pagemap_ops is meant for
device agnostic operations, suggested by Christoph
- replace DAX_RECOVERY flag with enum dax_access_mode suggested by Dan
- re-organized __pmem_direct_access as provided by Christoph
- split [PATCH v7 4/6] into two patches: one introduces
DAX_RECOVERY_WRITE, and the other introduces .recovery_write operation

v6 -> v7:
. incorporated comments from Christoph, and picked up a reviewed-by
. add [email protected] per Boris
. discovered pmem firmware doesn't reliably handle a request to clear
poison over a large range (such as 2M), hence worked around the
the feature by limiting the size of the requested range to kernel
page size.

v5->v6:
. per Christoph, move set{clear}_mce_nospec() inline functions out
of include/linux/set_memory.h and into arch/x86/mm/pat/set_memory.c
file, so that no need to export _set_memory_present().
. per Christoph, ratelimit warning message in pmem_do_write()
. per both Christoph and Dan, switch back to adding a flag to
dax_direct_access() instead of embedding the flag in kvaddr
. suggestions from Christoph for improving code structure and
readability
. per Dan, add .recovery_write to dev_pagemap.ops instead of adding
it to dax_operations, such that, the DM layer doesn't need to be
involved explicitly in dax recoovery write
. per Dan, is_bad_pmem() takes a seqlock, so no need to place it
under recovery_lock.
Many thanks for both reviewers!

v4->v5:
Fixed build errors reported by kernel test robot

v3->v4:
Rebased to v5.17-rc1-81-g0280e3c58f92

References:
v4 https://lore.kernel.org/lkml/[email protected]/T/
v3 https://lkml.org/lkml/2022/1/11/900
v2 https://lore.kernel.org/all/[email protected]/
Disussions about marking poisoned page as 'np'
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/

Jane Chu (7):
acpi/nfit: rely on mce->misc to determine poison granularity
x86/mce: relocate set{clear}_mce_nospec() functions
mce: fix set_mce_nospec to always unmap the whole page
dax: introduce DAX_RECOVERY_WRITE dax access mode
dax: add .recovery_write dax_operation
pmem: refactor pmem_clear_poison()
pmem: implement pmem_recovery_write()

arch/x86/include/asm/set_memory.h | 52 --------
arch/x86/kernel/cpu/mce/core.c | 6 +-
arch/x86/mm/pat/set_memory.c | 48 ++++++-
drivers/acpi/nfit/mce.c | 4 +-
drivers/dax/super.c | 14 ++-
drivers/md/dm-linear.c | 15 ++-
drivers/md/dm-log-writes.c | 15 ++-
drivers/md/dm-stripe.c | 15 ++-
drivers/md/dm-target.c | 4 +-
drivers/md/dm-writecache.c | 7 +-
drivers/md/dm.c | 25 +++-
drivers/nvdimm/pmem.c | 203 +++++++++++++++++++++---------
drivers/nvdimm/pmem.h | 5 +-
drivers/s390/block/dcssblk.c | 9 +-
fs/dax.c | 22 +++-
fs/fuse/dax.c | 4 +-
include/linux/dax.h | 22 +++-
include/linux/device-mapper.h | 13 +-
include/linux/set_memory.h | 10 +-
tools/testing/nvdimm/pmem-dax.c | 3 +-
20 files changed, 346 insertions(+), 150 deletions(-)


base-commit: 028192fea1de083f4f12bfb1eb7c4d7beb5c8ecd
--
2.18.4


2022-04-22 23:30:36

by Jane Chu

[permalink] [raw]
Subject: [PATCH v9 3/7] mce: fix set_mce_nospec to always unmap the whole page

The set_memory_uc() approach doesn't work well in all cases.
As Dan pointed out when "The VMM unmapped the bad page from
guest physical space and passed the machine check to the guest."
"The guest gets virtual #MC on an access to that page. When
the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest."

Since the driver has special knowledge to handle NP or UC,
mark the poisoned page with NP and let driver handle it when
it comes down to repair.

Please refer to discussions here for more details.
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/

Now since poisoned page is marked as not-present, in order to
avoid writing to a not-present page and trigger kernel Oops,
also fix pmem_do_write().

Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Jane Chu <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 6 +++---
arch/x86/mm/pat/set_memory.c | 23 +++++++++++------------
drivers/nvdimm/pmem.c | 30 +++++++-----------------------
include/linux/set_memory.h | 4 ++--
4 files changed, 23 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 981496e6bc0e..fa67bb9d1afe 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,

pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0)) {
- set_mce_nospec(pfn, whole_page(mce));
+ set_mce_nospec(pfn);
mce->kflags |= MCE_HANDLED_UC;
}

@@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)

ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
if (!ret) {
- set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+ set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
sync_core();
return;
}
@@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
p->mce_count = 0;
pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
- set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+ set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
}

static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 978cf5bd2ab6..e3a5e55f3e08 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1925,13 +1925,8 @@ int set_memory_wb(unsigned long addr, int numpages)
}
EXPORT_SYMBOL(set_memory_wb);

-/*
- * Prevent speculative access to the page by either unmapping
- * it (if we do not require access to any part of the page) or
- * marking it uncacheable (if we want to try to retrieve data
- * from non-poisoned lines in the page).
- */
-int set_mce_nospec(unsigned long pfn, bool unmap)
+/* Prevent speculative access to a page by marking it not-present */
+int set_mce_nospec(unsigned long pfn)
{
unsigned long decoy_addr;
int rc;
@@ -1956,19 +1951,23 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
*/
decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));

- if (unmap)
- rc = set_memory_np(decoy_addr, 1);
- else
- rc = set_memory_uc(decoy_addr, 1);
+ rc = set_memory_np(decoy_addr, 1);
if (rc)
pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
return rc;
}

+static int set_memory_present(unsigned long *addr, int numpages)
+{
+ return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0);
+}
+
/* Restore full speculative operation to the pfn. */
int clear_mce_nospec(unsigned long pfn)
{
- return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+ unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
+
+ return set_memory_present(&addr, 1);
}
EXPORT_SYMBOL_GPL(clear_mce_nospec);

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..4aa17132a557 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,36 +158,20 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
struct page *page, unsigned int page_off,
sector_t sector, unsigned int len)
{
- blk_status_t rc = BLK_STS_OK;
- bool bad_pmem = false;
phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
void *pmem_addr = pmem->virt_addr + pmem_off;

- if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
- bad_pmem = true;
+ if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
+ blk_status_t rc = pmem_clear_poison(pmem, pmem_off, len);
+
+ if (rc != BLK_STS_OK)
+ return rc;
+ }

- /*
- * Note that we write the data both before and after
- * clearing poison. The write before clear poison
- * handles situations where the latest written data is
- * preserved and the clear poison operation simply marks
- * the address range as valid without changing the data.
- * In this case application software can assume that an
- * interrupted write will either return the new good
- * data or an error.
- *
- * However, if pmem_clear_poison() leaves the data in an
- * indeterminate state we need to perform the write
- * after clear poison.
- */
flush_dcache_page(page);
write_pmem(pmem_addr, page, page_off, len);
- if (unlikely(bad_pmem)) {
- rc = pmem_clear_poison(pmem, pmem_off, len);
- write_pmem(pmem_addr, page, page_off, len);
- }

- return rc;
+ return BLK_STS_OK;
}

static void pmem_submit_bio(struct bio *bio)
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 683a6c3f7179..369769ce7399 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -43,10 +43,10 @@ static inline bool can_set_direct_map(void)
#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */

#ifdef CONFIG_X86_64
-int set_mce_nospec(unsigned long pfn, bool unmap);
+int set_mce_nospec(unsigned long pfn);
int clear_mce_nospec(unsigned long pfn);
#else
-static inline int set_mce_nospec(unsigned long pfn, bool unmap)
+static inline int set_mce_nospec(unsigned long pfn)
{
return 0;
}
--
2.18.4

2022-04-22 23:45:10

by Jane Chu

[permalink] [raw]
Subject: [PATCH v9 7/7] pmem: implement pmem_recovery_write()

The recovery write thread started out as a normal pwrite thread and
when the filesystem was told about potential media error in the
range, filesystem turns the normal pwrite to a dax_recovery_write.

The recovery write consists of clearing media poison, clearing page
HWPoison bit, reenable page-wide read-write permission, flush the
caches and finally write. A competing pread thread will be held
off during the recovery process since data read back might not be
valid, and this is achieved by clearing the badblock records after
the recovery write is complete. Competing recovery write threads
are already serialized by writer lock held by dax_iomap_rw().

Signed-off-by: Jane Chu <[email protected]>
---
drivers/nvdimm/pmem.c | 87 +++++++++++++++++++++++++++++++++++++++----
1 file changed, 79 insertions(+), 8 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0961625dfa05..91ce7b2b2ada 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -266,21 +266,43 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
pfn_t *pfn)
{
resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
-
- if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
- PFN_PHYS(nr_pages))))
- return -EIO;
+ sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
+ unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
+ struct badblocks *bb = &pmem->bb;
+ sector_t first_bad;
+ int num_bad;

if (kaddr)
*kaddr = pmem->virt_addr + offset;
if (pfn)
*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);

+ if (bb->count &&
+ badblocks_check(bb, sector, num, &first_bad, &num_bad)) {
+ long actual_nr;
+
+ if (mode != DAX_RECOVERY_WRITE)
+ return -EIO;
+
+ /*
+ * Set the recovery stride is set to kernel page size because
+ * the underlying driver and firmware clear poison functions
+ * don't appear to handle large chunk(such as 2MiB) reliably.
+ */
+ actual_nr = PHYS_PFN(
+ PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT));
+ dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n",
+ sector, nr_pages, first_bad, actual_nr);
+ if (actual_nr)
+ return actual_nr;
+ return 1;
+ }
+
/*
- * If badblocks are present, limit known good range to the
- * requested range.
+ * If badblocks are present but not in the range, limit known good range
+ * to the requested range.
*/
- if (unlikely(pmem->bb.count))
+ if (bb->count)
return nr_pages;
return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
}
@@ -310,10 +332,59 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
}

+/*
+ * The recovery write thread started out as a normal pwrite thread and
+ * when the filesystem was told about potential media error in the
+ * range, filesystem turns the normal pwrite to a dax_recovery_write.
+ *
+ * The recovery write consists of clearing media poison, clearing page
+ * HWPoison bit, reenable page-wide read-write permission, flush the
+ * caches and finally write. A competing pread thread will be held
+ * off during the recovery process since data read back might not be
+ * valid, and this is achieved by clearing the badblock records after
+ * the recovery write is complete. Competing recovery write threads
+ * are already serialized by writer lock held by dax_iomap_rw().
+ */
static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
{
- return 0;
+ struct pmem_device *pmem = dax_get_private(dax_dev);
+ size_t olen, len, off;
+ phys_addr_t pmem_off;
+ struct device *dev = pmem->bb.dev;
+ long cleared;
+
+ off = offset_in_page(addr);
+ len = PFN_PHYS(PFN_UP(off + bytes));
+ if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
+ return _copy_from_iter_flushcache(addr, bytes, i);
+
+ /*
+ * Not page-aligned range cannot be recovered. This should not
+ * happen unless something else went wrong.
+ */
+ if (off || !PAGE_ALIGNED(bytes)) {
+ dev_dbg(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n",
+ addr, bytes);
+ return 0;
+ }
+
+ pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
+ cleared = __pmem_clear_poison(pmem, pmem_off, len);
+ if (cleared > 0 && cleared < len) {
+ dev_dbg(dev, "poison cleared only %ld out of %lu bytes\n",
+ cleared, len);
+ return 0;
+ }
+ if (cleared < 0) {
+ dev_dbg(dev, "poison clear failed: %ld\n", cleared);
+ return 0;
+ }
+
+ olen = _copy_from_iter_flushcache(addr, bytes, i);
+ pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT);
+
+ return olen;
}

static const struct dax_operations pmem_dax_ops = {
--
2.18.4

2022-04-22 23:45:19

by Jane Chu

[permalink] [raw]
Subject: [PATCH v9 1/7] acpi/nfit: rely on mce->misc to determine poison granularity

nfit_handle_mec() hardcode poison granularity at L1_CACHE_BYTES.
Instead, let the driver rely on mce->misc register to determine
the poison granularity.

Suggested-by: Dan Williams <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Jane Chu <[email protected]>
---
drivers/acpi/nfit/mce.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index ee8d9973f60b..d48a388b796e 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -32,6 +32,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
*/
mutex_lock(&acpi_desc_lock);
list_for_each_entry(acpi_desc, &acpi_descs, list) {
+ unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
struct device *dev = acpi_desc->dev;
int found_match = 0;

@@ -63,8 +64,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,

/* If this fails due to an -ENOMEM, there is little we can do */
nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
- ALIGN(mce->addr, L1_CACHE_BYTES),
- L1_CACHE_BYTES);
+ ALIGN_DOWN(mce->addr, align), align);
nvdimm_region_notify(nfit_spa->nd_region,
NVDIMM_REVALIDATE_POISON);

--
2.18.4

2022-04-22 23:47:28

by Jane Chu

[permalink] [raw]
Subject: [PATCH v9 6/7] pmem: refactor pmem_clear_poison()

Refactor the pmem_clear_poison() function such that the common
shared code between the typical write path and the recovery write
path is factored out.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Jane Chu <[email protected]>
---
drivers/nvdimm/pmem.c | 73 ++++++++++++++++++++++++++++---------------
1 file changed, 48 insertions(+), 25 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e5e288135af7..0961625dfa05 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -45,9 +45,25 @@ static struct nd_region *to_region(struct pmem_device *pmem)
return to_nd_region(to_dev(pmem)->parent);
}

-static void hwpoison_clear(struct pmem_device *pmem,
- phys_addr_t phys, unsigned int len)
+static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
{
+ return pmem->phys_addr + offset;
+}
+
+static sector_t to_sect(struct pmem_device *pmem, phys_addr_t offset)
+{
+ return (offset - pmem->data_offset) >> SECTOR_SHIFT;
+}
+
+static phys_addr_t to_offset(struct pmem_device *pmem, sector_t sector)
+{
+ return (sector << SECTOR_SHIFT) + pmem->data_offset;
+}
+
+static void pmem_mkpage_present(struct pmem_device *pmem, phys_addr_t offset,
+ unsigned int len)
+{
+ phys_addr_t phys = to_phys(pmem, offset);
unsigned long pfn_start, pfn_end, pfn;

/* only pmem in the linear map supports HWPoison */
@@ -69,33 +85,40 @@ static void hwpoison_clear(struct pmem_device *pmem,
}
}

-static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
- phys_addr_t offset, unsigned int len)
+static void pmem_clear_bb(struct pmem_device *pmem, sector_t sector, long blks)
{
- struct device *dev = to_dev(pmem);
- sector_t sector;
- long cleared;
- blk_status_t rc = BLK_STS_OK;
+ if (blks == 0)
+ return;
+ badblocks_clear(&pmem->bb, sector, blks);
+ if (pmem->bb_state)
+ sysfs_notify_dirent(pmem->bb_state);
+}

- sector = (offset - pmem->data_offset) / 512;
+static long __pmem_clear_poison(struct pmem_device *pmem,
+ phys_addr_t offset, unsigned int len)
+{
+ phys_addr_t phys = to_phys(pmem, offset);
+ long cleared = nvdimm_clear_poison(to_dev(pmem), phys, len);

- cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
- if (cleared < len)
- rc = BLK_STS_IOERR;
- if (cleared > 0 && cleared / 512) {
- hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
- cleared /= 512;
- dev_dbg(dev, "%#llx clear %ld sector%s\n",
- (unsigned long long) sector, cleared,
- cleared > 1 ? "s" : "");
- badblocks_clear(&pmem->bb, sector, cleared);
- if (pmem->bb_state)
- sysfs_notify_dirent(pmem->bb_state);
+ if (cleared > 0) {
+ pmem_mkpage_present(pmem, offset, cleared);
+ arch_invalidate_pmem(pmem->virt_addr + offset, len);
}
+ return cleared;
+}

- arch_invalidate_pmem(pmem->virt_addr + offset, len);
+static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
+ phys_addr_t offset, unsigned int len)
+{
+ long cleared = __pmem_clear_poison(pmem, offset, len);

- return rc;
+ if (cleared < 0)
+ return BLK_STS_IOERR;
+
+ pmem_clear_bb(pmem, to_sect(pmem, offset), cleared >> SECTOR_SHIFT);
+ if (cleared < len)
+ return BLK_STS_IOERR;
+ return BLK_STS_OK;
}

static void write_pmem(void *pmem_addr, struct page *page,
@@ -143,7 +166,7 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
sector_t sector, unsigned int len)
{
blk_status_t rc;
- phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+ phys_addr_t pmem_off = to_offset(pmem, sector);
void *pmem_addr = pmem->virt_addr + pmem_off;

if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
@@ -158,7 +181,7 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
struct page *page, unsigned int page_off,
sector_t sector, unsigned int len)
{
- phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+ phys_addr_t pmem_off = to_offset(pmem, sector);
void *pmem_addr = pmem->virt_addr + pmem_off;

if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
--
2.18.4

2022-04-22 23:59:18

by Jane Chu

[permalink] [raw]
Subject: [PATCH v9 5/7] dax: add .recovery_write dax_operation

Introduce dax_recovery_write() operation. The function is used to
recover a dax range that contains poison. Typical use case is when
a user process receives a SIGBUS with si_code BUS_MCEERR_AR
indicating poison(s) in a dax range, in response, the user process
issues a pwrite() to the page-aligned dax range, thus clears the
poison and puts valid data in the range.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jane Chu <[email protected]>
---
drivers/dax/super.c | 9 +++++++++
drivers/md/dm-linear.c | 10 ++++++++++
drivers/md/dm-log-writes.c | 10 ++++++++++
drivers/md/dm-stripe.c | 10 ++++++++++
drivers/md/dm.c | 20 ++++++++++++++++++++
drivers/nvdimm/pmem.c | 7 +++++++
fs/dax.c | 13 ++++++++++++-
include/linux/dax.h | 13 +++++++++++++
include/linux/device-mapper.h | 9 +++++++++
9 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 5405eb553430..50a08b2ec247 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -195,6 +195,15 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
}
EXPORT_SYMBOL_GPL(dax_zero_page_range);

+size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *iter)
+{
+ if (!dax_dev->ops->recovery_write)
+ return 0;
+ return dax_dev->ops->recovery_write(dax_dev, pgoff, addr, bytes, iter);
+}
+EXPORT_SYMBOL_GPL(dax_recovery_write);
+
#ifdef CONFIG_ARCH_HAS_PMEM_API
void arch_wb_cache_pmem(void *addr, size_t size);
void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 13e263299c9c..cdf48bc8c5b0 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -188,9 +188,18 @@ static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
return dax_zero_page_range(dax_dev, pgoff, nr_pages);
}

+static size_t linear_dax_recovery_write(struct dm_target *ti, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
+
+ return dax_recovery_write(dax_dev, pgoff, addr, bytes, i);
+}
+
#else
#define linear_dax_direct_access NULL
#define linear_dax_zero_page_range NULL
+#define linear_dax_recovery_write NULL
#endif

static struct target_type linear_target = {
@@ -208,6 +217,7 @@ static struct target_type linear_target = {
.iterate_devices = linear_iterate_devices,
.direct_access = linear_dax_direct_access,
.dax_zero_page_range = linear_dax_zero_page_range,
+ .dax_recovery_write = linear_dax_recovery_write,
};

int __init dm_linear_init(void)
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 06bdbed65eb1..22739dccdd17 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -905,9 +905,18 @@ static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
return dax_zero_page_range(dax_dev, pgoff, nr_pages << PAGE_SHIFT);
}

+static size_t log_writes_dax_recovery_write(struct dm_target *ti,
+ pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
+{
+ struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
+
+ return dax_recovery_write(dax_dev, pgoff, addr, bytes, i);
+}
+
#else
#define log_writes_dax_direct_access NULL
#define log_writes_dax_zero_page_range NULL
+#define log_writes_dax_recovery_write NULL
#endif

static struct target_type log_writes_target = {
@@ -925,6 +934,7 @@ static struct target_type log_writes_target = {
.io_hints = log_writes_io_hints,
.direct_access = log_writes_dax_direct_access,
.dax_zero_page_range = log_writes_dax_zero_page_range,
+ .dax_recovery_write = log_writes_dax_recovery_write,
};

static int __init dm_log_writes_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 77d72900e997..baa085cc67bd 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -331,9 +331,18 @@ static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
return dax_zero_page_range(dax_dev, pgoff, nr_pages);
}

+static size_t stripe_dax_recovery_write(struct dm_target *ti, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
+
+ return dax_recovery_write(dax_dev, pgoff, addr, bytes, i);
+}
+
#else
#define stripe_dax_direct_access NULL
#define stripe_dax_zero_page_range NULL
+#define stripe_dax_recovery_write NULL
#endif

/*
@@ -470,6 +479,7 @@ static struct target_type stripe_target = {
.io_hints = stripe_io_hints,
.direct_access = stripe_dax_direct_access,
.dax_zero_page_range = stripe_dax_zero_page_range,
+ .dax_recovery_write = stripe_dax_recovery_write,
};

int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8258676a352f..5374c8aba2d6 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1147,6 +1147,25 @@ static int dm_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
return ret;
}

+static size_t dm_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ struct mapped_device *md = dax_get_private(dax_dev);
+ sector_t sector = pgoff * PAGE_SECTORS;
+ struct dm_target *ti;
+ int srcu_idx;
+ long ret = 0;
+
+ ti = dm_dax_get_live_target(md, sector, &srcu_idx);
+ if (!ti || !ti->type->dax_recovery_write)
+ goto out;
+
+ ret = ti->type->dax_recovery_write(ti, pgoff, addr, bytes, i);
+out:
+ dm_put_live_table(md, srcu_idx);
+ return ret;
+}
+
/*
* A target may call dm_accept_partial_bio only from the map routine. It is
* allowed for all bio types except REQ_PREFLUSH, REQ_OP_ZONE_* zone management
@@ -3151,6 +3170,7 @@ static const struct block_device_operations dm_rq_blk_dops = {
static const struct dax_operations dm_dax_ops = {
.direct_access = dm_dax_direct_access,
.zero_page_range = dm_dax_zero_page_range,
+ .recovery_write = dm_dax_recovery_write,
};

/*
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 47f34c50f944..e5e288135af7 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -287,9 +287,16 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
}

+static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ return 0;
+}
+
static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
.zero_page_range = pmem_dax_zero_page_range,
+ .recovery_write = pmem_recovery_write,
};

static ssize_t write_cache_show(struct device *dev,
diff --git a/fs/dax.c b/fs/dax.c
index ef3103107104..a1e4b45cbf55 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1240,6 +1240,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
const size_t size = ALIGN(length + offset, PAGE_SIZE);
pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
ssize_t map_len;
+ bool recovery = false;
void *kaddr;

if (fatal_signal_pending(current)) {
@@ -1249,6 +1250,13 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,

map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
DAX_ACCESS, &kaddr, NULL);
+ if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+ map_len = dax_direct_access(dax_dev, pgoff,
+ PHYS_PFN(size), DAX_RECOVERY_WRITE,
+ &kaddr, NULL);
+ if (map_len > 0)
+ recovery = true;
+ }
if (map_len < 0) {
ret = map_len;
break;
@@ -1260,7 +1268,10 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
if (map_len > end - pos)
map_len = end - pos;

- if (iov_iter_rw(iter) == WRITE)
+ if (recovery)
+ xfer = dax_recovery_write(dax_dev, pgoff, kaddr,
+ map_len, iter);
+ else if (iov_iter_rw(iter) == WRITE)
xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
map_len, iter);
else
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3f1339bce3c0..e7b81634c52a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -35,6 +35,12 @@ struct dax_operations {
sector_t, sector_t);
/* zero_page_range: required operation. Zero page range */
int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
+ /*
+ * recovery_write: recover a poisoned range by DAX device driver
+ * capable of clearing poison.
+ */
+ size_t (*recovery_write)(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *iter);
};

#if IS_ENABLED(CONFIG_DAX)
@@ -45,6 +51,8 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc);
bool dax_write_cache_enabled(struct dax_device *dax_dev);
bool dax_synchronous(struct dax_device *dax_dev);
void set_dax_synchronous(struct dax_device *dax_dev);
+size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i);
/*
* Check if given mapping is supported by the file / underlying device.
*/
@@ -92,6 +100,11 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
{
return !(vma->vm_flags & VM_SYNC);
}
+static inline size_t dax_recovery_write(struct dax_device *dax_dev,
+ pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
+{
+ return 0;
+}
#endif

void set_dax_nocache(struct dax_device *dax_dev);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index acdedda0d12b..47a01c7cffdf 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -152,6 +152,14 @@ typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff,
size_t nr_pages);

+/*
+ * Returns:
+ * != 0 : number of bytes transferred
+ * 0 : recovery write failed
+ */
+typedef size_t (*dm_dax_recovery_write_fn)(struct dm_target *ti, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i);
+
void dm_error(const char *message);

struct dm_dev {
@@ -201,6 +209,7 @@ struct target_type {
dm_io_hints_fn io_hints;
dm_dax_direct_access_fn direct_access;
dm_dax_zero_page_range_fn dax_zero_page_range;
+ dm_dax_recovery_write_fn dax_recovery_write;

/* For internal device-mapper use. */
struct list_head list;
--
2.18.4

2022-04-23 00:09:27

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v9 3/7] mce: fix set_mce_nospec to always unmap the whole page

[ Add Tony as the originator of the whole_page() logic and Jue who
reported the issue that lead to 17fae1294ad9 x86/{mce,mm}: Unmap the
entire page if the whole page is affected and poisoned ]


On Fri, Apr 22, 2022 at 3:46 PM Jane Chu <[email protected]> wrote:
>
> The set_memory_uc() approach doesn't work well in all cases.
> As Dan pointed out when "The VMM unmapped the bad page from
> guest physical space and passed the machine check to the guest."
> "The guest gets virtual #MC on an access to that page. When
> the guest tries to do set_memory_uc() and instructs cpa_flush()
> to do clean caches that results in taking another fault / exception
> perhaps because the VMM unmapped the page from the guest."
>
> Since the driver has special knowledge to handle NP or UC,
> mark the poisoned page with NP and let driver handle it when
> it comes down to repair.
>
> Please refer to discussions here for more details.
> https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/
>
> Now since poisoned page is marked as not-present, in order to
> avoid writing to a not-present page and trigger kernel Oops,
> also fix pmem_do_write().
>
> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Jane Chu <[email protected]>
> ---
> arch/x86/kernel/cpu/mce/core.c | 6 +++---
> arch/x86/mm/pat/set_memory.c | 23 +++++++++++------------
> drivers/nvdimm/pmem.c | 30 +++++++-----------------------
> include/linux/set_memory.h | 4 ++--
> 4 files changed, 23 insertions(+), 40 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 981496e6bc0e..fa67bb9d1afe 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>
> pfn = mce->addr >> PAGE_SHIFT;
> if (!memory_failure(pfn, 0)) {
> - set_mce_nospec(pfn, whole_page(mce));
> + set_mce_nospec(pfn);
> mce->kflags |= MCE_HANDLED_UC;
> }
>
> @@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)
>
> ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
> if (!ret) {
> - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> + set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
> sync_core();
> return;
> }
> @@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
> p->mce_count = 0;
> pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
> if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
> - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> + set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
> }
>
> static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 978cf5bd2ab6..e3a5e55f3e08 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -1925,13 +1925,8 @@ int set_memory_wb(unsigned long addr, int numpages)
> }
> EXPORT_SYMBOL(set_memory_wb);
>
> -/*
> - * Prevent speculative access to the page by either unmapping
> - * it (if we do not require access to any part of the page) or
> - * marking it uncacheable (if we want to try to retrieve data
> - * from non-poisoned lines in the page).
> - */
> -int set_mce_nospec(unsigned long pfn, bool unmap)
> +/* Prevent speculative access to a page by marking it not-present */
> +int set_mce_nospec(unsigned long pfn)
> {
> unsigned long decoy_addr;
> int rc;
> @@ -1956,19 +1951,23 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
> */
> decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
>
> - if (unmap)
> - rc = set_memory_np(decoy_addr, 1);
> - else
> - rc = set_memory_uc(decoy_addr, 1);
> + rc = set_memory_np(decoy_addr, 1);
> if (rc)
> pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
> return rc;
> }
>
> +static int set_memory_present(unsigned long *addr, int numpages)
> +{
> + return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0);
> +}
> +
> /* Restore full speculative operation to the pfn. */
> int clear_mce_nospec(unsigned long pfn)
> {
> - return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
> + unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
> +
> + return set_memory_present(&addr, 1);
> }
> EXPORT_SYMBOL_GPL(clear_mce_nospec);
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 58d95242a836..4aa17132a557 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -158,36 +158,20 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
> struct page *page, unsigned int page_off,
> sector_t sector, unsigned int len)
> {
> - blk_status_t rc = BLK_STS_OK;
> - bool bad_pmem = false;
> phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
> void *pmem_addr = pmem->virt_addr + pmem_off;
>
> - if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
> - bad_pmem = true;
> + if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
> + blk_status_t rc = pmem_clear_poison(pmem, pmem_off, len);
> +
> + if (rc != BLK_STS_OK)
> + return rc;
> + }
>
> - /*
> - * Note that we write the data both before and after
> - * clearing poison. The write before clear poison
> - * handles situations where the latest written data is
> - * preserved and the clear poison operation simply marks
> - * the address range as valid without changing the data.
> - * In this case application software can assume that an
> - * interrupted write will either return the new good
> - * data or an error.
> - *
> - * However, if pmem_clear_poison() leaves the data in an
> - * indeterminate state we need to perform the write
> - * after clear poison.
> - */
> flush_dcache_page(page);
> write_pmem(pmem_addr, page, page_off, len);
> - if (unlikely(bad_pmem)) {
> - rc = pmem_clear_poison(pmem, pmem_off, len);
> - write_pmem(pmem_addr, page, page_off, len);
> - }
>
> - return rc;
> + return BLK_STS_OK;
> }
>
> static void pmem_submit_bio(struct bio *bio)
> diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
> index 683a6c3f7179..369769ce7399 100644
> --- a/include/linux/set_memory.h
> +++ b/include/linux/set_memory.h
> @@ -43,10 +43,10 @@ static inline bool can_set_direct_map(void)
> #endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
>
> #ifdef CONFIG_X86_64
> -int set_mce_nospec(unsigned long pfn, bool unmap);
> +int set_mce_nospec(unsigned long pfn);
> int clear_mce_nospec(unsigned long pfn);
> #else
> -static inline int set_mce_nospec(unsigned long pfn, bool unmap)
> +static inline int set_mce_nospec(unsigned long pfn)
> {
> return 0;
> }
> --
> 2.18.4
>

2022-04-23 07:48:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v9 7/7] pmem: implement pmem_recovery_write()

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2022-05-11 11:14:25

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v9 3/7] mce: fix set_mce_nospec to always unmap the whole page

On Fri, Apr 22, 2022 at 4:25 PM Dan Williams <[email protected]> wrote:
>
> [ Add Tony as the originator of the whole_page() logic and Jue who
> reported the issue that lead to 17fae1294ad9 x86/{mce,mm}: Unmap the
> entire page if the whole page is affected and poisoned ]
>
>
> On Fri, Apr 22, 2022 at 3:46 PM Jane Chu <[email protected]> wrote:
> >
> > The set_memory_uc() approach doesn't work well in all cases.
> > As Dan pointed out when "The VMM unmapped the bad page from
> > guest physical space and passed the machine check to the guest."
> > "The guest gets virtual #MC on an access to that page. When
> > the guest tries to do set_memory_uc() and instructs cpa_flush()
> > to do clean caches that results in taking another fault / exception
> > perhaps because the VMM unmapped the page from the guest."
> >
> > Since the driver has special knowledge to handle NP or UC,
> > mark the poisoned page with NP and let driver handle it when
> > it comes down to repair.
> >
> > Please refer to discussions here for more details.
> > https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/
> >
> > Now since poisoned page is marked as not-present, in order to
> > avoid writing to a not-present page and trigger kernel Oops,
> > also fix pmem_do_write().
> >
> > Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
> > Reviewed-by: Christoph Hellwig <[email protected]>
> > Reviewed-by: Dan Williams <[email protected]>
> > Signed-off-by: Jane Chu <[email protected]>

Boris,

This is the last patch in this set that needs an x86 maintainer ack.
Since you have been involved in the history for most of this, mind
giving it an ack so I can pull it in for v5.19? Let me know if you
want a resend.

2022-05-11 15:06:24

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v9 3/7] mce: fix set_mce_nospec to always unmap the whole page

On Tue, May 10, 2022 at 08:56:21PM -0700, Dan Williams wrote:
> This is the last patch in this set that needs an x86 maintainer ack.
> Since you have been involved in the history for most of this, mind
> giving it an ack so I can pull it in for v5.19? Let me know if you
> want a resend.

I - just like you - am waiting for Tony to say whether he still needs
this whole_page() thing. I already suggested removing it so I'm fine
with this patch.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-05-12 07:43:16

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH v9 3/7] mce: fix set_mce_nospec to always unmap the whole page

> I - just like you - am waiting for Tony to say whether he still needs
> this whole_page() thing. I already suggested removing it so I'm fine
> with this patch.

IIRC this new patch effectively reverts back to the original behavior that
I implemented back at the dawn of time. I.e. just always mark the whole
page "not present" and don't try to mess with UC mappings to allow
partial (but non-speculative) access to the not-poisoned parts of the
page.

If that is the case ... then Acked-by: Tony Luck <[email protected]>

If I've misunderstood ... then please explain what it is doing.

Thanks

-Tony

2022-05-14 00:48:19

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v9 3/7] mce: fix set_mce_nospec to always unmap the whole page

On Wed, May 11, 2022 at 10:17 AM Luck, Tony <[email protected]> wrote:
>
> > I - just like you - am waiting for Tony to say whether he still needs
> > this whole_page() thing. I already suggested removing it so I'm fine
> > with this patch.
>
> IIRC this new patch effectively reverts back to the original behavior that
> I implemented back at the dawn of time. I.e. just always mark the whole
> page "not present" and don't try to mess with UC mappings to allow
> partial (but non-speculative) access to the not-poisoned parts of the
> page.
>
> If that is the case ... then Acked-by: Tony Luck <[email protected]>
>
> If I've misunderstood ... then please explain what it is doing.

You are correct. The page is always marked not present as far as the
page-offlining code is concerned, back to the way it always was.

The code in the pmem driver that repairs the page now knows that the
page is to be kept "not present" until the poison is cleared and
clear_mce_nospec() returns the mapping to typical write-back caching.

There is no support for what the UC case previously allowed which was
reading the good lines around the one bad line, just handle overwrites
to clear poison and restore access.

2022-05-14 03:22:01

by Dan Williams

[permalink] [raw]
Subject: [PATCH v10 7/7] pmem: implement pmem_recovery_write()

From: Jane Chu <[email protected]>

The recovery write thread started out as a normal pwrite thread and
when the filesystem was told about potential media error in the
range, filesystem turns the normal pwrite to a dax_recovery_write.

The recovery write consists of clearing media poison, clearing page
HWPoison bit, reenable page-wide read-write permission, flush the
caches and finally write. A competing pread thread will be held
off during the recovery process since data read back might not be
valid, and this is achieved by clearing the badblock records after
the recovery write is complete. Competing recovery write threads
are already serialized by writer lock held by dax_iomap_rw().

Signed-off-by: Jane Chu <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
Changes since v9:
- Fixup compile warnings in debug messages

drivers/nvdimm/pmem.c | 87 ++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 79 insertions(+), 8 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0961625dfa05..6b24ecada695 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -266,21 +266,43 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
pfn_t *pfn)
{
resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
-
- if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
- PFN_PHYS(nr_pages))))
- return -EIO;
+ sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
+ unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
+ struct badblocks *bb = &pmem->bb;
+ sector_t first_bad;
+ int num_bad;

if (kaddr)
*kaddr = pmem->virt_addr + offset;
if (pfn)
*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);

+ if (bb->count &&
+ badblocks_check(bb, sector, num, &first_bad, &num_bad)) {
+ long actual_nr;
+
+ if (mode != DAX_RECOVERY_WRITE)
+ return -EIO;
+
+ /*
+ * Set the recovery stride is set to kernel page size because
+ * the underlying driver and firmware clear poison functions
+ * don't appear to handle large chunk(such as 2MiB) reliably.
+ */
+ actual_nr = PHYS_PFN(
+ PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT));
+ dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n",
+ sector, nr_pages, first_bad, actual_nr);
+ if (actual_nr)
+ return actual_nr;
+ return 1;
+ }
+
/*
- * If badblocks are present, limit known good range to the
- * requested range.
+ * If badblocks are present but not in the range, limit known good range
+ * to the requested range.
*/
- if (unlikely(pmem->bb.count))
+ if (bb->count)
return nr_pages;
return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
}
@@ -310,10 +332,59 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
}

+/*
+ * The recovery write thread started out as a normal pwrite thread and
+ * when the filesystem was told about potential media error in the
+ * range, filesystem turns the normal pwrite to a dax_recovery_write.
+ *
+ * The recovery write consists of clearing media poison, clearing page
+ * HWPoison bit, reenable page-wide read-write permission, flush the
+ * caches and finally write. A competing pread thread will be held
+ * off during the recovery process since data read back might not be
+ * valid, and this is achieved by clearing the badblock records after
+ * the recovery write is complete. Competing recovery write threads
+ * are already serialized by writer lock held by dax_iomap_rw().
+ */
static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
{
- return 0;
+ struct pmem_device *pmem = dax_get_private(dax_dev);
+ size_t olen, len, off;
+ phys_addr_t pmem_off;
+ struct device *dev = pmem->bb.dev;
+ long cleared;
+
+ off = offset_in_page(addr);
+ len = PFN_PHYS(PFN_UP(off + bytes));
+ if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
+ return _copy_from_iter_flushcache(addr, bytes, i);
+
+ /*
+ * Not page-aligned range cannot be recovered. This should not
+ * happen unless something else went wrong.
+ */
+ if (off || !PAGE_ALIGNED(bytes)) {
+ dev_dbg(dev, "Found poison, but addr(%p) or bytes(%#zx) not page aligned\n",
+ addr, bytes);
+ return 0;
+ }
+
+ pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
+ cleared = __pmem_clear_poison(pmem, pmem_off, len);
+ if (cleared > 0 && cleared < len) {
+ dev_dbg(dev, "poison cleared only %ld out of %zu bytes\n",
+ cleared, len);
+ return 0;
+ }
+ if (cleared < 0) {
+ dev_dbg(dev, "poison clear failed: %ld\n", cleared);
+ return 0;
+ }
+
+ olen = _copy_from_iter_flushcache(addr, bytes, i);
+ pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT);
+
+ return olen;
}

static const struct dax_operations pmem_dax_ops = {


2022-05-17 12:00:22

by Dan Williams

[permalink] [raw]
Subject: [PATCH v10 3/7] mce: fix set_mce_nospec to always unmap the whole page

From: Jane Chu <[email protected]>

The set_memory_uc() approach doesn't work well in all cases.
As Dan pointed out when "The VMM unmapped the bad page from
guest physical space and passed the machine check to the guest."
"The guest gets virtual #MC on an access to that page. When
the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest."

Since the driver has special knowledge to handle NP or UC,
mark the poisoned page with NP and let driver handle it when
it comes down to repair.

Please refer to discussions here for more details.
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/

Now since poisoned page is marked as not-present, in order to
avoid writing to a not-present page and trigger kernel Oops,
also fix pmem_do_write().

Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Jane Chu <[email protected]>
Acked-by: Tony Luck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Dan Williams <[email protected]>
---
Changes since v9:
- Collect Tony's ack
- Rebase on [PATCH v11 2/7]:
https://lore.kernel.org/r/165272527328.90175.8336008202048685278.stgit@dwillia2-desk3.amr.corp.intel.com

arch/x86/kernel/cpu/mce/core.c | 6 +++---
arch/x86/mm/pat/set_memory.c | 23 +++++++++++------------
drivers/nvdimm/pmem.c | 30 +++++++-----------------------
include/linux/set_memory.h | 4 ++--
4 files changed, 23 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 981496e6bc0e..fa67bb9d1afe 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,

pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0)) {
- set_mce_nospec(pfn, whole_page(mce));
+ set_mce_nospec(pfn);
mce->kflags |= MCE_HANDLED_UC;
}

@@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)

ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
if (!ret) {
- set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+ set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
sync_core();
return;
}
@@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
p->mce_count = 0;
pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
- set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+ set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
}

static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 0caf4b0edcbc..44f0d4260bd8 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1925,14 +1925,9 @@ int set_memory_wb(unsigned long addr, int numpages)
}
EXPORT_SYMBOL(set_memory_wb);

-/*
- * Prevent speculative access to the page by either unmapping
- * it (if we do not require access to any part of the page) or
- * marking it uncacheable (if we want to try to retrieve data
- * from non-poisoned lines in the page).
- */
+/* Prevent speculative access to a page by marking it not-present */
#ifdef CONFIG_X86_64
-int set_mce_nospec(unsigned long pfn, bool unmap)
+int set_mce_nospec(unsigned long pfn)
{
unsigned long decoy_addr;
int rc;
@@ -1954,19 +1949,23 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
*/
decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));

- if (unmap)
- rc = set_memory_np(decoy_addr, 1);
- else
- rc = set_memory_uc(decoy_addr, 1);
+ rc = set_memory_np(decoy_addr, 1);
if (rc)
pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
return rc;
}

+static int set_memory_present(unsigned long *addr, int numpages)
+{
+ return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0);
+}
+
/* Restore full speculative operation to the pfn. */
int clear_mce_nospec(unsigned long pfn)
{
- return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+ unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
+
+ return set_memory_present(&addr, 1);
}
EXPORT_SYMBOL_GPL(clear_mce_nospec);
#endif /* CONFIG_X86_64 */
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..4aa17132a557 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,36 +158,20 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
struct page *page, unsigned int page_off,
sector_t sector, unsigned int len)
{
- blk_status_t rc = BLK_STS_OK;
- bool bad_pmem = false;
phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
void *pmem_addr = pmem->virt_addr + pmem_off;

- if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
- bad_pmem = true;
+ if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
+ blk_status_t rc = pmem_clear_poison(pmem, pmem_off, len);
+
+ if (rc != BLK_STS_OK)
+ return rc;
+ }

- /*
- * Note that we write the data both before and after
- * clearing poison. The write before clear poison
- * handles situations where the latest written data is
- * preserved and the clear poison operation simply marks
- * the address range as valid without changing the data.
- * In this case application software can assume that an
- * interrupted write will either return the new good
- * data or an error.
- *
- * However, if pmem_clear_poison() leaves the data in an
- * indeterminate state we need to perform the write
- * after clear poison.
- */
flush_dcache_page(page);
write_pmem(pmem_addr, page, page_off, len);
- if (unlikely(bad_pmem)) {
- rc = pmem_clear_poison(pmem, pmem_off, len);
- write_pmem(pmem_addr, page, page_off, len);
- }

- return rc;
+ return BLK_STS_OK;
}

static void pmem_submit_bio(struct bio *bio)
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 683a6c3f7179..369769ce7399 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -43,10 +43,10 @@ static inline bool can_set_direct_map(void)
#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */

#ifdef CONFIG_X86_64
-int set_mce_nospec(unsigned long pfn, bool unmap);
+int set_mce_nospec(unsigned long pfn);
int clear_mce_nospec(unsigned long pfn);
#else
-static inline int set_mce_nospec(unsigned long pfn, bool unmap)
+static inline int set_mce_nospec(unsigned long pfn)
{
return 0;
}