2022-04-06 05:51:31

by Jane Chu

[permalink] [raw]
Subject: [PATCH v7 0/6] DAX poison recovery

In this series, dax recovery code path is independent of that of
normal write. Competing dax recovery threads are serialized,
racing read threads are guaranteed not overlapping with the
recovery process.

In this phase, the recovery granularity is page, future patch
will explore recovery in finer granularity.

Changelog:
v6 -> v7:
. incorporated comments from Christoph, and picked up a reviewed-by
. add [email protected] per Boris
. discovered pmem firmware doesn't reliably handle a request to clear
poison over a large range (such as 2M), hence worked around the
the feature by limiting the size of the requested range to kernel
page size.

v5->v6:
. per Christoph, move set{clear}_mce_nospec() inline functions out
of include/linux/set_memory.h and into arch/x86/mm/pat/set_memory.c
file, so that no need to export _set_memory_present().
. per Christoph, ratelimit warning message in pmem_do_write()
. per both Christoph and Dan, switch back to adding a flag to
dax_direct_access() instead of embedding the flag in kvaddr
. suggestions from Christoph for improving code structure and
readability
. per Dan, add .recovery_write to dev_pagemap.ops instead of adding
it to dax_operations, such that, the DM layer doesn't need to be
involved explicitly in dax recoovery write
. per Dan, is_bad_pmem() takes a seqlock, so no need to place it
under recovery_lock.
Many thanks for both reviewers!

v4->v5:
Fixed build errors reported by kernel test robot

v3->v4:
Rebased to v5.17-rc1-81-g0280e3c58f92

References:
v4 https://lore.kernel.org/lkml/[email protected]/T/
v3 https://lkml.org/lkml/2022/1/11/900
v2 https://lore.kernel.org/all/[email protected]/
Disussions about marking poisoned page as 'np'
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/

Jane Chu (6):
x86/mm: fix comment
x86/mce: relocate set{clear}_mce_nospec() functions
mce: fix set_mce_nospec to always unmap the whole page
dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops
pmem: refactor pmem_clear_poison()
pmem: implement pmem_recovery_write()

arch/x86/include/asm/set_memory.h | 52 -------
arch/x86/kernel/cpu/mce/core.c | 6 +-
arch/x86/mm/pat/set_memory.c | 43 +++++-
drivers/dax/super.c | 17 ++-
drivers/md/dm-linear.c | 4 +-
drivers/md/dm-log-writes.c | 5 +-
drivers/md/dm-stripe.c | 4 +-
drivers/md/dm-target.c | 2 +-
drivers/md/dm-writecache.c | 5 +-
drivers/md/dm.c | 5 +-
drivers/nvdimm/pmem.c | 224 ++++++++++++++++++++++--------
drivers/nvdimm/pmem.h | 3 +-
drivers/s390/block/dcssblk.c | 4 +-
fs/dax.c | 24 +++-
fs/fuse/dax.c | 4 +-
include/linux/dax.h | 11 +-
include/linux/device-mapper.h | 2 +-
include/linux/memremap.h | 7 +
include/linux/set_memory.h | 11 +-
tools/testing/nvdimm/pmem-dax.c | 2 +-
20 files changed, 285 insertions(+), 150 deletions(-)


base-commit: ae085d7f9365de7da27ab5c0d16b12d51ea7fca9
--
2.18.4


2022-04-06 09:05:05

by Jane Chu

[permalink] [raw]
Subject: [PATCH v7 6/6] pmem: implement pmem_recovery_write()

The recovery write thread started out as a normal pwrite thread and
when the filesystem was told about potential media error in the
range, filesystem turns the normal pwrite to a dax_recovery_write.

The recovery write consists of clearing media poison, clearing page
HWPoison bit, reenable page-wide read-write permission, flush the
caches and finally write. A competing pread thread will be held
off during the recovery process since data read back might not be
valid, and this is achieved by clearing the badblock records after
the recovery write is complete. Competing recovery write threads
are serialized by pmem device level .recovery_lock.

Signed-off-by: Jane Chu <[email protected]>
---
drivers/nvdimm/pmem.c | 60 ++++++++++++++++++++++++++++++++++++++++++-
drivers/nvdimm/pmem.h | 1 +
2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 56596be70400..b868a88a0d58 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -340,10 +340,67 @@ static const struct dax_operations pmem_dax_ops = {
.zero_page_range = pmem_dax_zero_page_range,
};

+/*
+ * The recovery write thread started out as a normal pwrite thread and
+ * when the filesystem was told about potential media error in the
+ * range, filesystem turns the normal pwrite to a dax_recovery_write.
+ *
+ * The recovery write consists of clearing media poison, clearing page
+ * HWPoison bit, reenable page-wide read-write permission, flush the
+ * caches and finally write. A competing pread thread will be held
+ * off during the recovery process since data read back might not be
+ * valid, and this is achieved by clearing the badblock records after
+ * the recovery write is complete. Competing recovery write threads
+ * are serialized by pmem device level .recovery_lock.
+ */
static size_t pmem_recovery_write(struct dev_pagemap *pgmap, pgoff_t pgoff,
void *addr, size_t bytes, void *iter)
{
- return 0;
+ struct pmem_device *pmem = pgmap->owner;
+ size_t olen, len, off;
+ phys_addr_t pmem_off;
+ struct device *dev = pmem->bb.dev;
+ long cleared;
+
+ if (!pmem) {
+ dev_warn(dev, "pgmap->owner field not set, cannot recover\n");
+ return 0;
+ }
+
+ off = (unsigned long)addr & ~PAGE_MASK;
+ len = PFN_PHYS(PFN_UP(off + bytes));
+ if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
+ return _copy_from_iter_flushcache(addr, bytes, iter);
+
+ /*
+ * Not page-aligned range cannot be recovered. This should not
+ * happen unless something else went wrong.
+ */
+ if (off || !(PAGE_ALIGNED(bytes))) {
+ dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n",
+ addr, bytes);
+ return 0;
+ }
+
+ mutex_lock(&pmem->recovery_lock);
+ pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
+ cleared = __pmem_clear_poison(pmem, pmem_off, len);
+ if (cleared > 0 && cleared < len) {
+ dev_warn(dev, "poison cleared only %ld out of %lu\n",
+ cleared, len);
+ mutex_unlock(&pmem->recovery_lock);
+ return 0;
+ } else if (cleared < 0) {
+ dev_warn(dev, "poison clear failed: %ld\n", cleared);
+ mutex_unlock(&pmem->recovery_lock);
+ return 0;
+ }
+
+ olen = _copy_from_iter_flushcache(addr, bytes, iter);
+ pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT);
+
+ mutex_unlock(&pmem->recovery_lock);
+ return olen;
}

static ssize_t write_cache_show(struct device *dev,
@@ -533,6 +590,7 @@ static int pmem_attach_disk(struct device *dev,
if (rc)
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
+ mutex_init(&pmem->recovery_lock);
pmem->dax_dev = dax_dev;

rc = device_add_disk(dev, disk, pmem_attribute_groups);
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index e9c53d42c488..d44f6da34009 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -25,6 +25,7 @@ struct pmem_device {
struct dax_device *dax_dev;
struct gendisk *disk;
struct dev_pagemap pgmap;
+ struct mutex recovery_lock;
};

long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
--
2.18.4

2022-04-06 11:21:40

by Jane Chu

[permalink] [raw]
Subject: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is
not set by default in dax_direct_access() such that the helper
does not translate a pmem range to kernel virtual address if the
range contains uncorrectable errors. When the flag is set,
the helper ignores the UEs and return kernel virtual adderss so
that the caller may get on with data recovery via write.

Also introduce a new dev_pagemap_ops .recovery_write function.
The function is applicable to FSDAX device only. The device
page backend driver provides .recovery_write function if the
device has underlying mechanism to clear the uncorrectable
errors on the fly.

Signed-off-by: Jane Chu <[email protected]>
---
drivers/dax/super.c | 17 ++++++++--
drivers/md/dm-linear.c | 4 +--
drivers/md/dm-log-writes.c | 5 +--
drivers/md/dm-stripe.c | 4 +--
drivers/md/dm-target.c | 2 +-
drivers/md/dm-writecache.c | 5 +--
drivers/md/dm.c | 5 +--
drivers/nvdimm/pmem.c | 57 +++++++++++++++++++++++++++------
drivers/nvdimm/pmem.h | 2 +-
drivers/s390/block/dcssblk.c | 4 +--
fs/dax.c | 24 ++++++++++----
fs/fuse/dax.c | 4 +--
include/linux/dax.h | 11 +++++--
include/linux/device-mapper.h | 2 +-
include/linux/memremap.h | 7 ++++
tools/testing/nvdimm/pmem-dax.c | 2 +-
16 files changed, 116 insertions(+), 39 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0211e6f7b47a..8252858cd25a 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -13,6 +13,7 @@
#include <linux/uio.h>
#include <linux/dax.h>
#include <linux/fs.h>
+#include <linux/memremap.h>
#include "dax-private.h"

/**
@@ -117,6 +118,7 @@ enum dax_device_flags {
* @dax_dev: a dax_device instance representing the logical memory range
* @pgoff: offset in pages from the start of the device to translate
* @nr_pages: number of consecutive pages caller can handle relative to @pfn
+ * @flags: by default 0, set to DAX_RECOVERY to kick start dax recovery
* @kaddr: output parameter that returns a virtual address mapping of pfn
* @pfn: output parameter that returns an absolute pfn translation of @pgoff
*
@@ -124,7 +126,7 @@ enum dax_device_flags {
* pages accessible at the device relative @pgoff.
*/
long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
- void **kaddr, pfn_t *pfn)
+ int flags, void **kaddr, pfn_t *pfn)
{
long avail;

@@ -137,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
if (nr_pages < 0)
return -EINVAL;

- avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
+ avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages, flags,
kaddr, pfn);
if (!avail)
return -ERANGE;
@@ -194,6 +196,17 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
}
EXPORT_SYMBOL_GPL(dax_zero_page_range);

+size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+ pfn_t pfn, void *addr, size_t bytes, struct iov_iter *iter)
+{
+ struct dev_pagemap *pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
+
+ if (!pgmap || !pgmap->ops || !pgmap->ops->recovery_write)
+ return 0;
+ return pgmap->ops->recovery_write(pgmap, pgoff, addr, bytes, iter);
+}
+EXPORT_SYMBOL_GPL(dax_recovery_write);
+
#ifdef CONFIG_ARCH_HAS_PMEM_API
void arch_wb_cache_pmem(void *addr, size_t size);
void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 76b486e4d2be..9e6d8bdf3b2a 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -172,11 +172,11 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
}

static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, int flags, void **kaddr, pfn_t *pfn)
{
struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);

- return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+ return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn);
}

static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index c9d036d6bb2e..e23f062ade5f 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti,
}

static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, int flags,
+ void **kaddr, pfn_t *pfn)
{
struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);

- return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+ return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn);
}

static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index c81d331d1afe..b89339c78702 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -315,11 +315,11 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
}

static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, int flags, void **kaddr, pfn_t *pfn)
{
struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);

- return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+ return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn);
}

static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index 64dd0b34fcf4..24b1e5628f3a 100644
--- a/drivers/md/dm-target.c
+++ b/drivers/md/dm-target.c
@@ -142,7 +142,7 @@ static void io_err_release_clone_rq(struct request *clone,
}

static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, int flags, void **kaddr, pfn_t *pfn)
{
return -EIO;
}
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 5630b470ba42..180ca8fa383e 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)

id = dax_read_lock();

- da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
+ da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, 0,
+ &wc->memory_map, &pfn);
if (da < 0) {
wc->memory_map = NULL;
r = da;
@@ -309,7 +310,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
do {
long daa;
daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i,
- NULL, &pfn);
+ 0, NULL, &pfn);
if (daa <= 0) {
r = daa ? daa : -EINVAL;
goto err3;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ad2e0bbeb559..a8c697bb6603 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1087,7 +1087,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
}

static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, int flags, void **kaddr,
+ pfn_t *pfn)
{
struct mapped_device *md = dax_get_private(dax_dev);
sector_t sector = pgoff * PAGE_SECTORS;
@@ -1105,7 +1106,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
if (len < 1)
goto out;
nr_pages = min(len, nr_pages);
- ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
+ ret = ti->type->direct_access(ti, pgoff, nr_pages, flags, kaddr, pfn);

out:
dm_put_live_table(md, srcu_idx);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 30c71a68175b..0400c5a7ba39 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -238,12 +238,23 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,

/* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
__weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, int flags, void **kaddr, pfn_t *pfn)
{
resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
+ sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
+ unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
+ struct badblocks *bb = &pmem->bb;
+ sector_t first_bad;
+ int num_bad;
+ bool bad_in_range;
+ long actual_nr;
+
+ if (!bb->count)
+ bad_in_range = false;
+ else
+ bad_in_range = !!badblocks_check(bb, sector, num, &first_bad, &num_bad);

- if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
- PFN_PHYS(nr_pages))))
+ if (bad_in_range && !(flags & DAX_RECOVERY))
return -EIO;

if (kaddr)
@@ -251,13 +262,26 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
if (pfn)
*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);

+ if (!bad_in_range) {
+ /*
+ * If badblock is present but not in the range, limit known good range
+ * to the requested range.
+ */
+ if (bb->count)
+ return nr_pages;
+ return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
+ }
+
/*
- * If badblocks are present, limit known good range to the
- * requested range.
+ * In case poison is found in the given range and DAX_RECOVERY flag is set,
+ * recovery stride is set to kernel page size because the underlying driver and
+ * firmware clear poison functions don't appear to handle large chunk (such as
+ * 2MiB) reliably.
*/
- if (unlikely(pmem->bb.count))
- return nr_pages;
- return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
+ actual_nr = PHYS_PFN(PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT));
+ dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n",
+ sector, nr_pages, first_bad, actual_nr);
+ return (actual_nr == 0) ? 1 : actual_nr;
}

static const struct block_device_operations pmem_fops = {
@@ -277,11 +301,12 @@ static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
}

static long pmem_dax_direct_access(struct dax_device *dax_dev,
- pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
+ pgoff_t pgoff, long nr_pages, int flags, void **kaddr,
+ pfn_t *pfn)
{
struct pmem_device *pmem = dax_get_private(dax_dev);

- return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
+ return __pmem_direct_access(pmem, pgoff, nr_pages, flags, kaddr, pfn);
}

static const struct dax_operations pmem_dax_ops = {
@@ -289,6 +314,12 @@ static const struct dax_operations pmem_dax_ops = {
.zero_page_range = pmem_dax_zero_page_range,
};

+static size_t pmem_recovery_write(struct dev_pagemap *pgmap, pgoff_t pgoff,
+ void *addr, size_t bytes, void *iter)
+{
+ return 0;
+}
+
static ssize_t write_cache_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -349,6 +380,10 @@ static void pmem_release_disk(void *__pmem)
blk_cleanup_disk(pmem->disk);
}

+static const struct dev_pagemap_ops pmem_pgmap_ops = {
+ .recovery_write = pmem_recovery_write,
+};
+
static int pmem_attach_disk(struct device *dev,
struct nd_namespace_common *ndns)
{
@@ -380,6 +415,8 @@ static int pmem_attach_disk(struct device *dev,
rc = nvdimm_setup_pfn(nd_pfn, &pmem->pgmap);
if (rc)
return rc;
+ if (nd_pfn->mode == PFN_MODE_PMEM)
+ pmem->pgmap.ops = &pmem_pgmap_ops;
}

/* we're attaching a block device, disable raw namespace access */
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 1f51a2361429..e9c53d42c488 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -28,7 +28,7 @@ struct pmem_device {
};

long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn);
+ long nr_pages, int flag, void **kaddr, pfn_t *pfn);

#ifdef CONFIG_MEMORY_FAILURE
static inline bool test_and_clear_pmem_poison(struct page *page)
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index d614843caf6c..c3fbf500868f 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -32,7 +32,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static void dcssblk_submit_bio(struct bio *bio);
static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn);
+ long nr_pages, int flags, void **kaddr, pfn_t *pfn);

static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";

@@ -927,7 +927,7 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,

static long
dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, int flags, void **kaddr, pfn_t *pfn)
{
struct dcssblk_dev_info *dev_info = dax_get_private(dax_dev);

diff --git a/fs/dax.c b/fs/dax.c
index 67a08a32fccb..e8900e92990b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -721,7 +721,7 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter *iter
int id;

id = dax_read_lock();
- rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL);
+ rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, 0, &kaddr, NULL);
if (rc < 0) {
dax_read_unlock(id);
return rc;
@@ -1012,7 +1012,7 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
long length;

id = dax_read_lock();
- length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
+ length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size), 0,
NULL, pfnp);
if (length < 0) {
rc = length;
@@ -1122,7 +1122,7 @@ static int dax_memzero(struct dax_device *dax_dev, pgoff_t pgoff,
void *kaddr;
long ret;

- ret = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
+ ret = dax_direct_access(dax_dev, pgoff, 1, 0, &kaddr, NULL);
if (ret > 0) {
memset(kaddr + offset, 0, size);
dax_flush(dax_dev, kaddr + offset, size);
@@ -1239,15 +1239,24 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
const size_t size = ALIGN(length + offset, PAGE_SIZE);
pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
ssize_t map_len;
+ bool recovery = false;
void *kaddr;
+ long nrpg;
+ pfn_t pfn;

if (fatal_signal_pending(current)) {
ret = -EINTR;
break;
}

- map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
- &kaddr, NULL);
+ nrpg = PHYS_PFN(size);
+ map_len = dax_direct_access(dax_dev, pgoff, nrpg, 0, &kaddr, NULL);
+ if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+ map_len = dax_direct_access(dax_dev, pgoff, nrpg,
+ DAX_RECOVERY, &kaddr, &pfn);
+ if (map_len > 0)
+ recovery = true;
+ }
if (map_len < 0) {
ret = map_len;
break;
@@ -1259,7 +1268,10 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
if (map_len > end - pos)
map_len = end - pos;

- if (iov_iter_rw(iter) == WRITE)
+ if (recovery)
+ xfer = dax_recovery_write(dax_dev, pgoff, pfn, kaddr,
+ map_len, iter);
+ else if (iov_iter_rw(iter) == WRITE)
xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
map_len, iter);
else
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index d7d3a7f06862..bf5e40a0707b 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -1241,8 +1241,8 @@ static int fuse_dax_mem_range_init(struct fuse_conn_dax *fcd)
INIT_DELAYED_WORK(&fcd->free_work, fuse_dax_free_mem_worker);

id = dax_read_lock();
- nr_pages = dax_direct_access(fcd->dev, 0, PHYS_PFN(dax_size), NULL,
- NULL);
+ nr_pages = dax_direct_access(fcd->dev, 0, PHYS_PFN(dax_size), 0,
+ NULL, NULL);
dax_read_unlock(id);
if (nr_pages < 0) {
pr_debug("dax_direct_access() returned %ld\n", nr_pages);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9fc5f99a0ae2..fc9ee886de89 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -14,14 +14,17 @@ struct iomap_ops;
struct iomap_iter;
struct iomap;

+/* Flag to communicate for DAX recovery operation */
+#define DAX_RECOVERY 0x1
+
struct dax_operations {
/*
* direct_access: translate a device-relative
* logical-page-offset into an absolute physical pfn. Return the
* number of pages available for DAX at that pfn.
*/
- long (*direct_access)(struct dax_device *, pgoff_t, long,
- void **, pfn_t *);
+ long (*direct_access)(struct dax_device *dax_dev, pgoff_t pgoff,
+ long nr_pages, int flags, void **kaddr, pfn_t *pfn);
/*
* Validate whether this device is usable as an fsdax backing
* device.
@@ -40,6 +43,8 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc);
bool dax_write_cache_enabled(struct dax_device *dax_dev);
bool dax_synchronous(struct dax_device *dax_dev);
void set_dax_synchronous(struct dax_device *dax_dev);
+size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, pfn_t pfn,
+ void *addr, size_t bytes, struct iov_iter *i);
/*
* Check if given mapping is supported by the file / underlying device.
*/
@@ -178,7 +183,7 @@ static inline void dax_read_unlock(int id)
bool dax_alive(struct dax_device *dax_dev);
void *dax_get_private(struct dax_device *dax_dev);
long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
- void **kaddr, pfn_t *pfn);
+ int flags, void **kaddr, pfn_t *pfn);
size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
size_t bytes, struct iov_iter *i);
size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index c2a3758c4aaa..45ad013294a3 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -146,7 +146,7 @@ typedef int (*dm_busy_fn) (struct dm_target *ti);
* >= 0 : the number of bytes accessible at the address
*/
typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn);
+ long nr_pages, int flags, void **kaddr, pfn_t *pfn);
typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff,
size_t nr_pages);

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8af304f6b504..79a170cb49ef 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -79,6 +79,13 @@ struct dev_pagemap_ops {
* the page back to a CPU accessible page.
*/
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+
+ /*
+ * Used for FS DAX device only. For synchronous recovery from DAX media
+ * encountering Uncorrectable Error.
+ */
+ size_t (*recovery_write)(struct dev_pagemap *pgmap, pgoff_t pgoff,
+ void *addr, size_t bytes, void *iter);
};

#define PGMAP_ALTMAP_VALID (1 << 0)
diff --git a/tools/testing/nvdimm/pmem-dax.c b/tools/testing/nvdimm/pmem-dax.c
index af19c85558e7..287db5e3e293 100644
--- a/tools/testing/nvdimm/pmem-dax.c
+++ b/tools/testing/nvdimm/pmem-dax.c
@@ -8,7 +8,7 @@
#include <nd.h>

long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, int flags, void **kaddr, pfn_t *pfn)
{
resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;

--
2.18.4

2022-04-06 11:59:59

by Jane Chu

[permalink] [raw]
Subject: [PATCH v7 3/6] mce: fix set_mce_nospec to always unmap the whole page

The set_memory_uc() approach doesn't work well in all cases.
For example, when "The VMM unmapped the bad page from guest
physical space and passed the machine check to the guest."
"The guest gets virtual #MC on an access to that page.
When the guest tries to do set_memory_uc() and instructs
cpa_flush() to do clean caches that results in taking another
fault / exception perhaps because the VMM unmapped the page
from the guest."

Since the driver has special knowledge to handle NP or UC,
let's mark the poisoned page with NP and let driver handle it
when it comes down to repair.

Please refer to discussions here for more details.
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/

Now since poisoned page is marked as not-present, in order to
avoid writing to a 'np' page and trigger kernel Oops, also fix
pmem_do_write().

Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Signed-off-by: Jane Chu <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 6 +++---
arch/x86/mm/pat/set_memory.c | 18 ++++++------------
drivers/nvdimm/pmem.c | 31 +++++++------------------------
include/linux/set_memory.h | 4 ++--
4 files changed, 18 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 981496e6bc0e..fa67bb9d1afe 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,

pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0)) {
- set_mce_nospec(pfn, whole_page(mce));
+ set_mce_nospec(pfn);
mce->kflags |= MCE_HANDLED_UC;
}

@@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)

ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
if (!ret) {
- set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+ set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
sync_core();
return;
}
@@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
p->mce_count = 0;
pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
- set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+ set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
}

static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 93dde949f224..404ffcb3f2cb 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1926,13 +1926,8 @@ int set_memory_wb(unsigned long addr, int numpages)
EXPORT_SYMBOL(set_memory_wb);

#ifdef CONFIG_X86_64
-/*
- * Prevent speculative access to the page by either unmapping
- * it (if we do not require access to any part of the page) or
- * marking it uncacheable (if we want to try to retrieve data
- * from non-poisoned lines in the page).
- */
-int set_mce_nospec(unsigned long pfn, bool unmap)
+/* Prevent speculative access to a page by marking it not-present */
+int set_mce_nospec(unsigned long pfn)
{
unsigned long decoy_addr;
int rc;
@@ -1954,10 +1949,7 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
*/
decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));

- if (unmap)
- rc = set_memory_np(decoy_addr, 1);
- else
- rc = set_memory_uc(decoy_addr, 1);
+ rc = set_memory_np(decoy_addr, 1);
if (rc)
pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
return rc;
@@ -1966,7 +1958,9 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
/* Restore full speculative operation to the pfn. */
int clear_mce_nospec(unsigned long pfn)
{
- return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+ unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
+
+ return change_page_attr_set(&addr, 1, __pgprot(_PAGE_PRESENT), 0);
}
EXPORT_SYMBOL_GPL(clear_mce_nospec);

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..30c71a68175b 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,36 +158,19 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
struct page *page, unsigned int page_off,
sector_t sector, unsigned int len)
{
- blk_status_t rc = BLK_STS_OK;
- bool bad_pmem = false;
phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
void *pmem_addr = pmem->virt_addr + pmem_off;

- if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
- bad_pmem = true;
+ if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
+ blk_status_t rc = pmem_clear_poison(pmem, pmem_off, len);

- /*
- * Note that we write the data both before and after
- * clearing poison. The write before clear poison
- * handles situations where the latest written data is
- * preserved and the clear poison operation simply marks
- * the address range as valid without changing the data.
- * In this case application software can assume that an
- * interrupted write will either return the new good
- * data or an error.
- *
- * However, if pmem_clear_poison() leaves the data in an
- * indeterminate state we need to perform the write
- * after clear poison.
- */
+ if (rc != BLK_STS_OK)
+ pr_warn_ratelimited("%s: failed to clear poison\n", __func__);
+ return rc;
+ }
flush_dcache_page(page);
write_pmem(pmem_addr, page, page_off, len);
- if (unlikely(bad_pmem)) {
- rc = pmem_clear_poison(pmem, pmem_off, len);
- write_pmem(pmem_addr, page, page_off, len);
- }
-
- return rc;
+ return BLK_STS_OK;
}

static void pmem_submit_bio(struct bio *bio)
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index d6263d7afb55..cde2d8687a7b 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -43,10 +43,10 @@ static inline bool can_set_direct_map(void)
#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */

#ifdef CONFIG_X86_64
-int set_mce_nospec(unsigned long pfn, bool unmap);
+int set_mce_nospec(unsigned long pfn);
int clear_mce_nospec(unsigned long pfn);
#else
-static inline int set_mce_nospec(unsigned long pfn, bool unmap)
+static inline int set_mce_nospec(unsigned long pfn)
{
return 0;
}
--
2.18.4

2022-04-06 13:08:06

by Jane Chu

[permalink] [raw]
Subject: [PATCH v7 5/6] pmem: refactor pmem_clear_poison()

Refactor the pmem_clear_poison() in order to share common code
later.

Signed-off-by: Jane Chu <[email protected]>
---
drivers/nvdimm/pmem.c | 78 ++++++++++++++++++++++++++++---------------
1 file changed, 52 insertions(+), 26 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0400c5a7ba39..56596be70400 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -45,10 +45,27 @@ static struct nd_region *to_region(struct pmem_device *pmem)
return to_nd_region(to_dev(pmem)->parent);
}

-static void hwpoison_clear(struct pmem_device *pmem,
- phys_addr_t phys, unsigned int len)
+static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
{
+ return (pmem->phys_addr + offset);
+}
+
+static sector_t to_sect(struct pmem_device *pmem, phys_addr_t offset)
+{
+ return (offset - pmem->data_offset) >> SECTOR_SHIFT;
+}
+
+static phys_addr_t to_offset(struct pmem_device *pmem, sector_t sector)
+{
+ return ((sector << SECTOR_SHIFT) + pmem->data_offset);
+}
+
+static void pmem_clear_hwpoison(struct pmem_device *pmem, phys_addr_t offset,
+ unsigned int len)
+{
+ phys_addr_t phys = to_phys(pmem, offset);
unsigned long pfn_start, pfn_end, pfn;
+ unsigned int blks = len >> SECTOR_SHIFT;

/* only pmem in the linear map supports HWPoison */
if (is_vmalloc_addr(pmem->virt_addr))
@@ -67,35 +84,44 @@ static void hwpoison_clear(struct pmem_device *pmem,
if (test_and_clear_pmem_poison(page))
clear_mce_nospec(pfn);
}
+
+ dev_dbg(to_dev(pmem), "%#llx clear %u sector%s\n",
+ (unsigned long long) to_sect(pmem, offset), blks,
+ blks > 1 ? "s" : "");
}

-static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
+static void pmem_clear_bb(struct pmem_device *pmem, sector_t sector, long blks)
+{
+ if (blks == 0)
+ return;
+ badblocks_clear(&pmem->bb, sector, blks);
+ if (pmem->bb_state)
+ sysfs_notify_dirent(pmem->bb_state);
+}
+
+static long __pmem_clear_poison(struct pmem_device *pmem,
phys_addr_t offset, unsigned int len)
{
- struct device *dev = to_dev(pmem);
- sector_t sector;
- long cleared;
- blk_status_t rc = BLK_STS_OK;
-
- sector = (offset - pmem->data_offset) / 512;
-
- cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
- if (cleared < len)
- rc = BLK_STS_IOERR;
- if (cleared > 0 && cleared / 512) {
- hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
- cleared /= 512;
- dev_dbg(dev, "%#llx clear %ld sector%s\n",
- (unsigned long long) sector, cleared,
- cleared > 1 ? "s" : "");
- badblocks_clear(&pmem->bb, sector, cleared);
- if (pmem->bb_state)
- sysfs_notify_dirent(pmem->bb_state);
+ phys_addr_t phys = to_phys(pmem, offset);
+ long cleared = nvdimm_clear_poison(to_dev(pmem), phys, len);
+
+ if (cleared > 0) {
+ pmem_clear_hwpoison(pmem, offset, cleared);
+ arch_invalidate_pmem(pmem->virt_addr + offset, len);
}
+ return cleared;
+}

- arch_invalidate_pmem(pmem->virt_addr + offset, len);
+static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
+ phys_addr_t offset, unsigned int len)
+{
+ long cleared = __pmem_clear_poison(pmem, offset, len);

- return rc;
+ if (cleared < 0)
+ return BLK_STS_IOERR;
+
+ pmem_clear_bb(pmem, to_sect(pmem, offset), cleared >> SECTOR_SHIFT);
+ return (cleared < len) ? BLK_STS_IOERR : BLK_STS_OK;
}

static void write_pmem(void *pmem_addr, struct page *page,
@@ -143,7 +169,7 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
sector_t sector, unsigned int len)
{
blk_status_t rc;
- phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+ phys_addr_t pmem_off = to_offset(pmem, sector);
void *pmem_addr = pmem->virt_addr + pmem_off;

if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
@@ -158,7 +184,7 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
struct page *page, unsigned int page_off,
sector_t sector, unsigned int len)
{
- phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+ phys_addr_t pmem_off = to_offset(pmem, sector);
void *pmem_addr = pmem->virt_addr + pmem_off;

if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
--
2.18.4

2022-04-06 13:08:10

by Jane Chu

[permalink] [raw]
Subject: [PATCH v7 1/6] x86/mm: fix comment

There is no _set_memory_prot internal helper, while coming across
the code, might as well fix the comment.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jane Chu <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index abf5ed76e4b7..38af155aaba9 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1816,7 +1816,7 @@ static inline int cpa_clear_pages_array(struct page **pages, int numpages,
}

/*
- * _set_memory_prot is an internal helper for callers that have been passed
+ * __set_memory_prot is an internal helper for callers that have been passed
* a pgprot_t value from upper layers and a reservation has already been taken.
* If you want to set the pgprot to a specific page protocol, use the
* set_memory_xx() functions.
--
2.18.4

2022-04-06 13:56:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] pmem: implement pmem_recovery_write()

On Tue, Apr 05, 2022 at 01:47:47PM -0600, Jane Chu wrote:
> + off = (unsigned long)addr & ~PAGE_MASK;

offset_inpage()

> + if (off || !(PAGE_ALIGNED(bytes))) {

No need for the inner braces.

> + mutex_lock(&pmem->recovery_lock);
> + pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
> + cleared = __pmem_clear_poison(pmem, pmem_off, len);
> + if (cleared > 0 && cleared < len) {
> + dev_warn(dev, "poison cleared only %ld out of %lu\n",
> + cleared, len);
> + mutex_unlock(&pmem->recovery_lock);
> + return 0;
> + } else if (cleared < 0) {

No need for an else after a return.

2022-04-06 13:57:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 5/6] pmem: refactor pmem_clear_poison()

On Tue, Apr 05, 2022 at 01:47:46PM -0600, Jane Chu wrote:
> + pmem_clear_bb(pmem, to_sect(pmem, offset), cleared >> SECTOR_SHIFT);
> + return (cleared < len) ? BLK_STS_IOERR : BLK_STS_OK;

No need for the braces. That being said perosnally I find a simple:

if (cleared < len)
return BLK_STS_IOERR;
return BLK_STS_OK;

much easier to read anyway.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2022-04-06 13:57:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

On Tue, Apr 05, 2022 at 01:47:45PM -0600, Jane Chu wrote:
> Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is
> not set by default in dax_direct_access() such that the helper
> does not translate a pmem range to kernel virtual address if the
> range contains uncorrectable errors. When the flag is set,
> the helper ignores the UEs and return kernel virtual adderss so
> that the caller may get on with data recovery via write.
>
> Also introduce a new dev_pagemap_ops .recovery_write function.
> The function is applicable to FSDAX device only. The device
> page backend driver provides .recovery_write function if the
> device has underlying mechanism to clear the uncorrectable
> errors on the fly.

I know Dan suggested it, but I still think dev_pagemap_ops is the very
wrong choice here. It is about VM callbacks to ZONE_DEVICE owners
independent of what pagemap type they are. .recovery_write on the
other hand is completely specific to the DAX write path and has no
MM interactions at all.

> /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
> __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
> - long nr_pages, void **kaddr, pfn_t *pfn)
> + long nr_pages, int flags, void **kaddr, pfn_t *pfn)
> {
> resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
> + sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
> + unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
> + struct badblocks *bb = &pmem->bb;
> + sector_t first_bad;
> + int num_bad;
> + bool bad_in_range;
> + long actual_nr;
> +
> + if (!bb->count)
> + bad_in_range = false;
> + else
> + bad_in_range = !!badblocks_check(bb, sector, num, &first_bad, &num_bad);
>
> - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
> - PFN_PHYS(nr_pages))))
> + if (bad_in_range && !(flags & DAX_RECOVERY))
> return -EIO;

The use of bad_in_range here seems a litle convoluted. See the attached
patch on how I would structure the function to avoid the variable and
have the reocvery code in a self-contained chunk.

> - map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
> - &kaddr, NULL);
> + nrpg = PHYS_PFN(size);
> + map_len = dax_direct_access(dax_dev, pgoff, nrpg, 0, &kaddr, NULL);

Overly long line here.


Attachments:
(No filename) (2.28 kB)
diff (2.36 kB)
Download all attachments

2022-04-06 14:18:36

by Jane Chu

[permalink] [raw]
Subject: [PATCH v7 2/6] x86/mce: relocate set{clear}_mce_nospec() functions

Relocate the twin mce functions to arch/x86/mm/pat/set_memory.c
file where they belong.

Signed-off-by: Jane Chu <[email protected]>
---
arch/x86/include/asm/set_memory.h | 52 -------------------------------
arch/x86/mm/pat/set_memory.c | 47 ++++++++++++++++++++++++++++
include/linux/set_memory.h | 9 +++---
3 files changed, 52 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 78ca53512486..b45c4d27fd46 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,56 +86,4 @@ bool kernel_page_present(struct page *page);

extern int kernel_set_to_readonly;

-#ifdef CONFIG_X86_64
-/*
- * Prevent speculative access to the page by either unmapping
- * it (if we do not require access to any part of the page) or
- * marking it uncacheable (if we want to try to retrieve data
- * from non-poisoned lines in the page).
- */
-static inline int set_mce_nospec(unsigned long pfn, bool unmap)
-{
- unsigned long decoy_addr;
- int rc;
-
- /* SGX pages are not in the 1:1 map */
- if (arch_is_platform_page(pfn << PAGE_SHIFT))
- return 0;
- /*
- * We would like to just call:
- * set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
- * but doing that would radically increase the odds of a
- * speculative access to the poison page because we'd have
- * the virtual address of the kernel 1:1 mapping sitting
- * around in registers.
- * Instead we get tricky. We create a non-canonical address
- * that looks just like the one we want, but has bit 63 flipped.
- * This relies on set_memory_XX() properly sanitizing any __pa()
- * results with __PHYSICAL_MASK or PTE_PFN_MASK.
- */
- decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
-
- if (unmap)
- rc = set_memory_np(decoy_addr, 1);
- else
- rc = set_memory_uc(decoy_addr, 1);
- if (rc)
- pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
- return rc;
-}
-#define set_mce_nospec set_mce_nospec
-
-/* Restore full speculative operation to the pfn. */
-static inline int clear_mce_nospec(unsigned long pfn)
-{
- return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
-}
-#define clear_mce_nospec clear_mce_nospec
-#else
-/*
- * Few people would run a 32-bit kernel on a machine that supports
- * recoverable errors because they have too much memory to boot 32-bit.
- */
-#endif
-
#endif /* _ASM_X86_SET_MEMORY_H */
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 38af155aaba9..93dde949f224 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1925,6 +1925,53 @@ int set_memory_wb(unsigned long addr, int numpages)
}
EXPORT_SYMBOL(set_memory_wb);

+#ifdef CONFIG_X86_64
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+int set_mce_nospec(unsigned long pfn, bool unmap)
+{
+ unsigned long decoy_addr;
+ int rc;
+
+ /* SGX pages are not in the 1:1 map */
+ if (arch_is_platform_page(pfn << PAGE_SHIFT))
+ return 0;
+ /*
+ * We would like to just call:
+ * set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
+ * but doing that would radically increase the odds of a
+ * speculative access to the poison page because we'd have
+ * the virtual address of the kernel 1:1 mapping sitting
+ * around in registers.
+ * Instead we get tricky. We create a non-canonical address
+ * that looks just like the one we want, but has bit 63 flipped.
+ * This relies on set_memory_XX() properly sanitizing any __pa()
+ * results with __PHYSICAL_MASK or PTE_PFN_MASK.
+ */
+ decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+
+ if (unmap)
+ rc = set_memory_np(decoy_addr, 1);
+ else
+ rc = set_memory_uc(decoy_addr, 1);
+ if (rc)
+ pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+ return rc;
+}
+
+/* Restore full speculative operation to the pfn. */
+int clear_mce_nospec(unsigned long pfn)
+{
+ return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+}
+EXPORT_SYMBOL_GPL(clear_mce_nospec);
+
+#endif
+
int set_memory_x(unsigned long addr, int numpages)
{
if (!(__supported_pte_mask & _PAGE_NX))
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index f36be5166c19..d6263d7afb55 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -42,20 +42,21 @@ static inline bool can_set_direct_map(void)
#endif
#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */

-#ifndef set_mce_nospec
+#ifdef CONFIG_X86_64
+int set_mce_nospec(unsigned long pfn, bool unmap);
+int clear_mce_nospec(unsigned long pfn);
+#else
static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
return 0;
}
-#endif
-
-#ifndef clear_mce_nospec
static inline int clear_mce_nospec(unsigned long pfn)
{
return 0;
}
#endif

+
#ifndef CONFIG_ARCH_HAS_MEM_ENCRYPT
static inline int set_memory_encrypted(unsigned long addr, int numpages)
{
--
2.18.4

2022-04-06 20:12:42

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

On 4/5/2022 10:19 PM, Christoph Hellwig wrote:
> On Tue, Apr 05, 2022 at 01:47:45PM -0600, Jane Chu wrote:
>> Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is
>> not set by default in dax_direct_access() such that the helper
>> does not translate a pmem range to kernel virtual address if the
>> range contains uncorrectable errors. When the flag is set,
>> the helper ignores the UEs and return kernel virtual adderss so
>> that the caller may get on with data recovery via write.
>>
>> Also introduce a new dev_pagemap_ops .recovery_write function.
>> The function is applicable to FSDAX device only. The device
>> page backend driver provides .recovery_write function if the
>> device has underlying mechanism to clear the uncorrectable
>> errors on the fly.
>
> I know Dan suggested it, but I still think dev_pagemap_ops is the very
> wrong choice here. It is about VM callbacks to ZONE_DEVICE owners
> independent of what pagemap type they are. .recovery_write on the
> other hand is completely specific to the DAX write path and has no
> MM interactions at all.

Yes, I believe Dan was motivated by avoiding the dm dance as a result of
adding .recovery_write to dax_operations.

I understand your point about .recovery_write is device specific and
thus not something appropriate for device agnostic ops.

I can see 2 options so far -

1) add .recovery_write to dax_operations and do the dm dance to hunt
down to the base device that actually provides the recovery action

2) an ugly but expedient approach based on the observation that
dax_direct_access() has already gone through the dm dance and thus could
scoop up the .recovery_write function pointer if DAX_RECOVERY flag is
set. Like bundle action-flag with action, and if should there need more
device specific actions, just add another action with associated flag.

I'm thinking about something like this

long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
long nr_pages, struct daxdev_specific *action,
int flags, void **kaddr, pfn_t *pfn)

where
struct daxdev_specific {
int flags; /* DAX_RECOVERY, etc */
size_t (*recovery_write) (pfn_t pfn, pgoff_t pgoff, void *addr,
size_t bytes, void *iter);
}

__pmem_direct_access() provides the .recovery_write function pointer;
dax_iomap_iter() ends up directly invoke the function in pmem.c
which finds pgmap from pfn_t, and (struct pmem *) from
pgmap->owner;

In this way, we get rid of dax_recovery_write() interface as well as the
dm dance.

What do you think?

Dan, could you also chime in ?

>
>> /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
>> __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
>> - long nr_pages, void **kaddr, pfn_t *pfn)
>> + long nr_pages, int flags, void **kaddr, pfn_t *pfn)
>> {
>> resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
>> + sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
>> + unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
>> + struct badblocks *bb = &pmem->bb;
>> + sector_t first_bad;
>> + int num_bad;
>> + bool bad_in_range;
>> + long actual_nr;
>> +
>> + if (!bb->count)
>> + bad_in_range = false;
>> + else
>> + bad_in_range = !!badblocks_check(bb, sector, num, &first_bad, &num_bad);
>>
>> - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
>> - PFN_PHYS(nr_pages))))
>> + if (bad_in_range && !(flags & DAX_RECOVERY))
>> return -EIO;
>
> The use of bad_in_range here seems a litle convoluted. See the attached
> patch on how I would structure the function to avoid the variable and
> have the reocvery code in a self-contained chunk.

Much better, will use your version, thanks!

>
>> - map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
>> - &kaddr, NULL);
>> + nrpg = PHYS_PFN(size);
>> + map_len = dax_direct_access(dax_dev, pgoff, nrpg, 0, &kaddr, NULL);
>
> Overly long line here.

Okay, will run the checkpatch.pl test again.

thanks!
-jane

2022-04-06 20:52:52

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

Resend, not sure it didn't go through.

On 4/6/2022 10:32 AM, Jane Chu wrote:
> On 4/5/2022 10:19 PM, Christoph Hellwig wrote:
>> On Tue, Apr 05, 2022 at 01:47:45PM -0600, Jane Chu wrote:
>>> Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is
>>> not set by default in dax_direct_access() such that the helper
>>> does not translate a pmem range to kernel virtual address if the
>>> range contains uncorrectable errors.  When the flag is set,
>>> the helper ignores the UEs and return kernel virtual adderss so
>>> that the caller may get on with data recovery via write.
>>>
>>> Also introduce a new dev_pagemap_ops .recovery_write function.
>>> The function is applicable to FSDAX device only. The device
>>> page backend driver provides .recovery_write function if the
>>> device has underlying mechanism to clear the uncorrectable
>>> errors on the fly.
>>
>> I know Dan suggested it, but I still think dev_pagemap_ops is the very
>> wrong choice here.  It is about VM callbacks to ZONE_DEVICE owners
>> independent of what pagemap type they are.  .recovery_write on the
>> other hand is completely specific to the DAX write path and has no
>> MM interactions at all.
>
> Yes, I believe Dan was motivated by avoiding the dm dance as a result of
> adding .recovery_write to dax_operations.
>
> I understand your point about .recovery_write is device specific and
> thus not something appropriate for device agnostic ops.
>
> I can see 2 options so far -
>
> 1)  add .recovery_write to dax_operations and do the dm dance to hunt
> down to the base device that actually provides the recovery action
>
> 2)  an ugly but expedient approach based on the observation that
> dax_direct_access() has already gone through the dm dance and thus could
> scoop up the .recovery_write function pointer if DAX_RECOVERY flag is
> set.  Like bundle action-flag with action, and if should there need more
> device specific actions, just add another action with associated flag.
>
> I'm thinking about something like this
>
>    long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
>                           long nr_pages, struct daxdev_specific *action,
>                           int flags, void **kaddr, pfn_t *pfn)
>
>    where
>    struct daxdev_specific {
>     int flags;    /* DAX_RECOVERY, etc */
>     size_t (*recovery_write) (pfn_t pfn, pgoff_t pgoff, void *addr,
>                  size_t bytes, void *iter);
>    }
>
>    __pmem_direct_access() provides the .recovery_write function pointer;
>    dax_iomap_iter() ends up directly invoke the function in pmem.c
>      which finds pgmap from pfn_t, and (struct pmem *) from
>      pgmap->owner;
>
> In this way, we get rid of dax_recovery_write() interface as well as the
> dm dance.
>
> What do you think?
>
> Dan, could you also chime in ?
>
>>
>>>   /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
>>>   __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t
>>> pgoff,
>>> -        long nr_pages, void **kaddr, pfn_t *pfn)
>>> +        long nr_pages, int flags, void **kaddr, pfn_t *pfn)
>>>   {
>>>       resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
>>> +    sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
>>> +    unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
>>> +    struct badblocks *bb = &pmem->bb;
>>> +    sector_t first_bad;
>>> +    int num_bad;
>>> +    bool bad_in_range;
>>> +    long actual_nr;
>>> +
>>> +    if (!bb->count)
>>> +        bad_in_range = false;
>>> +    else
>>> +        bad_in_range = !!badblocks_check(bb, sector, num,
>>> &first_bad, &num_bad);
>>> -    if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
>>> -                    PFN_PHYS(nr_pages))))
>>> +    if (bad_in_range && !(flags & DAX_RECOVERY))
>>>           return -EIO;
>>
>> The use of bad_in_range here seems a litle convoluted.  See the attached
>> patch on how I would structure the function to avoid the variable and
>> have the reocvery code in a self-contained chunk.
>
> Much better, will use your version, thanks!
>
>>
>>> -        map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
>>> -                &kaddr, NULL);
>>> +        nrpg = PHYS_PFN(size);
>>> +        map_len = dax_direct_access(dax_dev, pgoff, nrpg, 0, &kaddr,
>>> NULL);
>>
>> Overly long line here.
>
> Okay, will run the checkpatch.pl test again.
>
> thanks!
> -jane

2022-04-07 01:25:48

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 5/6] pmem: refactor pmem_clear_poison()

On 4/5/2022 10:04 PM, Christoph Hellwig wrote:
> On Tue, Apr 05, 2022 at 01:47:46PM -0600, Jane Chu wrote:
>> + pmem_clear_bb(pmem, to_sect(pmem, offset), cleared >> SECTOR_SHIFT);
>> + return (cleared < len) ? BLK_STS_IOERR : BLK_STS_OK;
>
> No need for the braces. That being said perosnally I find a simple:
>
> if (cleared < len)
> return BLK_STS_IOERR;
> return BLK_STS_OK;
>
> much easier to read anyway.

will do.

>
> Otherwise looks good:
>
> Reviewed-by: Christoph Hellwig <[email protected]>

Thanks!
-jane

2022-04-07 01:31:39

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] pmem: implement pmem_recovery_write()

On 4/5/2022 10:21 PM, Christoph Hellwig wrote:
> On Tue, Apr 05, 2022 at 01:47:47PM -0600, Jane Chu wrote:
>> + off = (unsigned long)addr & ~PAGE_MASK;
>
> offset_inpage()
>
>> + if (off || !(PAGE_ALIGNED(bytes))) {
>
> No need for the inner braces.
>
>> + mutex_lock(&pmem->recovery_lock);
>> + pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
>> + cleared = __pmem_clear_poison(pmem, pmem_off, len);
>> + if (cleared > 0 && cleared < len) {
>> + dev_warn(dev, "poison cleared only %ld out of %lu\n",
>> + cleared, len);
>> + mutex_unlock(&pmem->recovery_lock);
>> + return 0;
>> + } else if (cleared < 0) {
>
> No need for an else after a return.


Agreed, will reflect your comments in next rev.

thanks!
-jane

2022-04-07 09:32:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

On Wed, Apr 06, 2022 at 05:32:31PM +0000, Jane Chu wrote:
> Yes, I believe Dan was motivated by avoiding the dm dance as a result of
> adding .recovery_write to dax_operations.
>
> I understand your point about .recovery_write is device specific and
> thus not something appropriate for device agnostic ops.
>
> I can see 2 options so far -
>
> 1) add .recovery_write to dax_operations and do the dm dance to hunt
> down to the base device that actually provides the recovery action

That would be my preference. But I'll wait for Dan to chime in.

> Okay, will run the checkpatch.pl test again.

Unfortuntely checkpatch.pl is broken in that regard. It treats the
exception to occasionally go longer or readability as the default.

2022-04-12 06:44:08

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <[email protected]> wrote:
>
> Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is
> not set by default in dax_direct_access() such that the helper
> does not translate a pmem range to kernel virtual address if the
> range contains uncorrectable errors. When the flag is set,
> the helper ignores the UEs and return kernel virtual adderss so
> that the caller may get on with data recovery via write.

It strikes me that there is likely never going to be any other flags
to dax_direct_access() and what this option really is an access type.
I also find code changes like this error prone to read:

- rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL);
+ rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, 0, &kaddr, NULL);

...i.e. without looking at the prototype, which option is the nr_pages
and which is the flags?

So how about change 'int flags' to 'enum dax_access_mode mode' where
dax_access_mode is:

/**
* enum dax_access_mode - operational mode for dax_direct_access()
* @DAX_ACCESS: nominal access, fail / trim access on encountering poison
* @DAX_RECOVERY_WRITE: ignore poison and provide a pointer suitable
for use with dax_recovery_write()
*/
enum dax_access_mode {
DAX_ACCESS,
DAX_RECOVERY_WRITE,
};

Then the conversions look like this:

- rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL);
+ rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1,
DAX_ACCESS, &kaddr, NULL);

...and there's less chance of confusion with the @nr_pages argument.

2022-04-12 06:46:39

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 1/6] x86/mm: fix comment

On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <[email protected]> wrote:
>
> There is no _set_memory_prot internal helper, while coming across
> the code, might as well fix the comment.

Looks good,

Reviewed-by: Dan Williams <[email protected]>

>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Jane Chu <[email protected]>
> ---
> arch/x86/mm/pat/set_memory.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index abf5ed76e4b7..38af155aaba9 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -1816,7 +1816,7 @@ static inline int cpa_clear_pages_array(struct page **pages, int numpages,
> }
>
> /*
> - * _set_memory_prot is an internal helper for callers that have been passed
> + * __set_memory_prot is an internal helper for callers that have been passed
> * a pgprot_t value from upper layers and a reservation has already been taken.
> * If you want to set the pgprot to a specific page protocol, use the
> * set_memory_xx() functions.
> --
> 2.18.4
>

2022-04-12 09:24:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

On Mon, Apr 11, 2022 at 09:57:36PM -0700, Dan Williams wrote:
> So how about change 'int flags' to 'enum dax_access_mode mode' where
> dax_access_mode is:
>
> /**
> * enum dax_access_mode - operational mode for dax_direct_access()
> * @DAX_ACCESS: nominal access, fail / trim access on encountering poison
> * @DAX_RECOVERY_WRITE: ignore poison and provide a pointer suitable
> for use with dax_recovery_write()
> */
> enum dax_access_mode {
> DAX_ACCESS,
> DAX_RECOVERY_WRITE,
> };
>
> Then the conversions look like this:
>
> - rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL);
> + rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1,
> DAX_ACCESS, &kaddr, NULL);
>
> ...and there's less chance of confusion with the @nr_pages argument.

Yes, this might be a little nicer.

2022-04-12 09:36:27

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 2/6] x86/mce: relocate set{clear}_mce_nospec() functions

I notice that none of the folks from "X86 MM" are on the cc, added.

On Tue, Apr 5, 2022 at 12:49 PM Jane Chu <[email protected]> wrote:
>
> Relocate the twin mce functions to arch/x86/mm/pat/set_memory.c
> file where they belong.
>
> Signed-off-by: Jane Chu <[email protected]>
> ---
> arch/x86/include/asm/set_memory.h | 52 -------------------------------
> arch/x86/mm/pat/set_memory.c | 47 ++++++++++++++++++++++++++++
> include/linux/set_memory.h | 9 +++---
> 3 files changed, 52 insertions(+), 56 deletions(-)
>
> diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
> index 78ca53512486..b45c4d27fd46 100644
> --- a/arch/x86/include/asm/set_memory.h
> +++ b/arch/x86/include/asm/set_memory.h
> @@ -86,56 +86,4 @@ bool kernel_page_present(struct page *page);
>
> extern int kernel_set_to_readonly;
>
> -#ifdef CONFIG_X86_64
> -/*
> - * Prevent speculative access to the page by either unmapping
> - * it (if we do not require access to any part of the page) or
> - * marking it uncacheable (if we want to try to retrieve data
> - * from non-poisoned lines in the page).
> - */
> -static inline int set_mce_nospec(unsigned long pfn, bool unmap)
> -{
> - unsigned long decoy_addr;
> - int rc;
> -
> - /* SGX pages are not in the 1:1 map */
> - if (arch_is_platform_page(pfn << PAGE_SHIFT))
> - return 0;
> - /*
> - * We would like to just call:
> - * set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
> - * but doing that would radically increase the odds of a
> - * speculative access to the poison page because we'd have
> - * the virtual address of the kernel 1:1 mapping sitting
> - * around in registers.
> - * Instead we get tricky. We create a non-canonical address
> - * that looks just like the one we want, but has bit 63 flipped.
> - * This relies on set_memory_XX() properly sanitizing any __pa()
> - * results with __PHYSICAL_MASK or PTE_PFN_MASK.
> - */
> - decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> -
> - if (unmap)
> - rc = set_memory_np(decoy_addr, 1);
> - else
> - rc = set_memory_uc(decoy_addr, 1);
> - if (rc)
> - pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
> - return rc;
> -}
> -#define set_mce_nospec set_mce_nospec
> -
> -/* Restore full speculative operation to the pfn. */
> -static inline int clear_mce_nospec(unsigned long pfn)
> -{
> - return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
> -}
> -#define clear_mce_nospec clear_mce_nospec
> -#else
> -/*
> - * Few people would run a 32-bit kernel on a machine that supports
> - * recoverable errors because they have too much memory to boot 32-bit.
> - */
> -#endif
> -
> #endif /* _ASM_X86_SET_MEMORY_H */
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 38af155aaba9..93dde949f224 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -1925,6 +1925,53 @@ int set_memory_wb(unsigned long addr, int numpages)
> }
> EXPORT_SYMBOL(set_memory_wb);
>
> +#ifdef CONFIG_X86_64

It seems like the only X86_64 dependency in this routine is the
address bit 63 usage, so how about:

if (!IS_ENABLED(CONFIG_64BIT))
return 0;

...and drop the ifdef?

Other than that you can add:

Reviewed-by: Dan Williams <[email protected]>

2022-04-12 10:26:04

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <[email protected]> wrote:
>
> Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is
> not set by default in dax_direct_access() such that the helper
> does not translate a pmem range to kernel virtual address if the
> range contains uncorrectable errors. When the flag is set,
> the helper ignores the UEs and return kernel virtual adderss so
> that the caller may get on with data recovery via write.
>
> Also introduce a new dev_pagemap_ops .recovery_write function.
> The function is applicable to FSDAX device only. The device
> page backend driver provides .recovery_write function if the
> device has underlying mechanism to clear the uncorrectable
> errors on the fly.
>
> Signed-off-by: Jane Chu <[email protected]>
> ---
> drivers/dax/super.c | 17 ++++++++--
> drivers/md/dm-linear.c | 4 +--
> drivers/md/dm-log-writes.c | 5 +--
> drivers/md/dm-stripe.c | 4 +--
> drivers/md/dm-target.c | 2 +-
> drivers/md/dm-writecache.c | 5 +--
> drivers/md/dm.c | 5 +--
> drivers/nvdimm/pmem.c | 57 +++++++++++++++++++++++++++------
> drivers/nvdimm/pmem.h | 2 +-
> drivers/s390/block/dcssblk.c | 4 +--
> fs/dax.c | 24 ++++++++++----
> fs/fuse/dax.c | 4 +--
> include/linux/dax.h | 11 +++++--
> include/linux/device-mapper.h | 2 +-
> include/linux/memremap.h | 7 ++++
> tools/testing/nvdimm/pmem-dax.c | 2 +-
> 16 files changed, 116 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0211e6f7b47a..8252858cd25a 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -13,6 +13,7 @@
> #include <linux/uio.h>
> #include <linux/dax.h>
> #include <linux/fs.h>
> +#include <linux/memremap.h>
> #include "dax-private.h"
>
> /**
> @@ -117,6 +118,7 @@ enum dax_device_flags {
> * @dax_dev: a dax_device instance representing the logical memory range
> * @pgoff: offset in pages from the start of the device to translate
> * @nr_pages: number of consecutive pages caller can handle relative to @pfn
> + * @flags: by default 0, set to DAX_RECOVERY to kick start dax recovery
> * @kaddr: output parameter that returns a virtual address mapping of pfn
> * @pfn: output parameter that returns an absolute pfn translation of @pgoff
> *
> @@ -124,7 +126,7 @@ enum dax_device_flags {
> * pages accessible at the device relative @pgoff.
> */
> long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
> - void **kaddr, pfn_t *pfn)
> + int flags, void **kaddr, pfn_t *pfn)
> {
> long avail;
>
> @@ -137,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
> if (nr_pages < 0)
> return -EINVAL;
>
> - avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
> + avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages, flags,
> kaddr, pfn);
> if (!avail)
> return -ERANGE;
> @@ -194,6 +196,17 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
> }
> EXPORT_SYMBOL_GPL(dax_zero_page_range);
>
> +size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
> + pfn_t pfn, void *addr, size_t bytes, struct iov_iter *iter)
> +{
> + struct dev_pagemap *pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
> +
> + if (!pgmap || !pgmap->ops || !pgmap->ops->recovery_write)
> + return 0;
> + return pgmap->ops->recovery_write(pgmap, pgoff, addr, bytes, iter);
> +}
> +EXPORT_SYMBOL_GPL(dax_recovery_write);
> +
> #ifdef CONFIG_ARCH_HAS_PMEM_API
> void arch_wb_cache_pmem(void *addr, size_t size);
> void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 76b486e4d2be..9e6d8bdf3b2a 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -172,11 +172,11 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
> }
>
> static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> - long nr_pages, void **kaddr, pfn_t *pfn)
> + long nr_pages, int flags, void **kaddr, pfn_t *pfn)
> {
> struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
>
> - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
> + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn);
> }
>
> static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
> diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
> index c9d036d6bb2e..e23f062ade5f 100644
> --- a/drivers/md/dm-log-writes.c
> +++ b/drivers/md/dm-log-writes.c
> @@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti,
> }
>
> static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> - long nr_pages, void **kaddr, pfn_t *pfn)
> + long nr_pages, int flags,
> + void **kaddr, pfn_t *pfn)
> {
> struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
>
> - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
> + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn);
> }
>
> static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
> index c81d331d1afe..b89339c78702 100644
> --- a/drivers/md/dm-stripe.c
> +++ b/drivers/md/dm-stripe.c
> @@ -315,11 +315,11 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
> }
>
> static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> - long nr_pages, void **kaddr, pfn_t *pfn)
> + long nr_pages, int flags, void **kaddr, pfn_t *pfn)
> {
> struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
>
> - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
> + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn);
> }
>
> static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
> diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
> index 64dd0b34fcf4..24b1e5628f3a 100644
> --- a/drivers/md/dm-target.c
> +++ b/drivers/md/dm-target.c
> @@ -142,7 +142,7 @@ static void io_err_release_clone_rq(struct request *clone,
> }
>
> static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> - long nr_pages, void **kaddr, pfn_t *pfn)
> + long nr_pages, int flags, void **kaddr, pfn_t *pfn)
> {
> return -EIO;
> }
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 5630b470ba42..180ca8fa383e 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>
> id = dax_read_lock();
>
> - da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
> + da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, 0,
> + &wc->memory_map, &pfn);
> if (da < 0) {
> wc->memory_map = NULL;
> r = da;
> @@ -309,7 +310,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
> do {
> long daa;
> daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i,
> - NULL, &pfn);
> + 0, NULL, &pfn);
> if (daa <= 0) {
> r = daa ? daa : -EINVAL;
> goto err3;
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index ad2e0bbeb559..a8c697bb6603 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1087,7 +1087,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
> }
>
> static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> - long nr_pages, void **kaddr, pfn_t *pfn)
> + long nr_pages, int flags, void **kaddr,
> + pfn_t *pfn)
> {
> struct mapped_device *md = dax_get_private(dax_dev);
> sector_t sector = pgoff * PAGE_SECTORS;
> @@ -1105,7 +1106,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> if (len < 1)
> goto out;
> nr_pages = min(len, nr_pages);
> - ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
> + ret = ti->type->direct_access(ti, pgoff, nr_pages, flags, kaddr, pfn);
>
> out:
> dm_put_live_table(md, srcu_idx);
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 30c71a68175b..0400c5a7ba39 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -238,12 +238,23 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>
> /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
> __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
> - long nr_pages, void **kaddr, pfn_t *pfn)
> + long nr_pages, int flags, void **kaddr, pfn_t *pfn)
> {
> resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
> + sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
> + unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
> + struct badblocks *bb = &pmem->bb;
> + sector_t first_bad;
> + int num_bad;
> + bool bad_in_range;
> + long actual_nr;
> +
> + if (!bb->count)
> + bad_in_range = false;
> + else
> + bad_in_range = !!badblocks_check(bb, sector, num, &first_bad, &num_bad);

Why all this change...

>
> - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
> - PFN_PHYS(nr_pages))))

...instead of adding "&& !(flags & DAX_RECOVERY)" to this statement?

> + if (bad_in_range && !(flags & DAX_RECOVERY))
> return -EIO;
>
> if (kaddr)
> @@ -251,13 +262,26 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
> if (pfn)
> *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
>
> + if (!bad_in_range) {
> + /*
> + * If badblock is present but not in the range, limit known good range
> + * to the requested range.
> + */
> + if (bb->count)
> + return nr_pages;
> + return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
> + }
> +
> /*
> - * If badblocks are present, limit known good range to the
> - * requested range.
> + * In case poison is found in the given range and DAX_RECOVERY flag is set,
> + * recovery stride is set to kernel page size because the underlying driver and
> + * firmware clear poison functions don't appear to handle large chunk (such as
> + * 2MiB) reliably.
> */
> - if (unlikely(pmem->bb.count))
> - return nr_pages;
> - return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
> + actual_nr = PHYS_PFN(PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT));
> + dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n",
> + sector, nr_pages, first_bad, actual_nr);
> + return (actual_nr == 0) ? 1 : actual_nr;

Similar feedback as above that this is more change than I would expect.

I think just adding...

if (flags & DAX_RECOVERY)
return 1;

...before the typical return path is enough.

2022-04-12 13:39:58

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 3/6] mce: fix set_mce_nospec to always unmap the whole page

On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <[email protected]> wrote:
>
> The set_memory_uc() approach doesn't work well in all cases.
> For example, when "The VMM unmapped the bad page from guest
> physical space and passed the machine check to the guest."
> "The guest gets virtual #MC on an access to that page.
> When the guest tries to do set_memory_uc() and instructs
> cpa_flush() to do clean caches that results in taking another
> fault / exception perhaps because the VMM unmapped the page
> from the guest."
>
> Since the driver has special knowledge to handle NP or UC,

I think a patch is needed before this one to make this statement true? I.e.:

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index ee8d9973f60b..11641f55025a 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -32,6 +32,7 @@ static int nfit_handle_mce(struct notifier_block
*nb, unsigned long val,
*/
mutex_lock(&acpi_desc_lock);
list_for_each_entry(acpi_desc, &acpi_descs, list) {
+ unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
struct device *dev = acpi_desc->dev;
int found_match = 0;

@@ -63,8 +64,7 @@ static int nfit_handle_mce(struct notifier_block
*nb, unsigned long val,

/* If this fails due to an -ENOMEM, there is little we can do */
nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
- ALIGN(mce->addr, L1_CACHE_BYTES),
- L1_CACHE_BYTES);
+ ALIGN(mce->addr, align), align);
nvdimm_region_notify(nfit_spa->nd_region,
NVDIMM_REVALIDATE_POISON);


> let's mark the poisoned page with NP and let driver handle it
> when it comes down to repair.
>
> Please refer to discussions here for more details.
> https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/
>
> Now since poisoned page is marked as not-present, in order to
> avoid writing to a 'np' page and trigger kernel Oops, also fix
> pmem_do_write().
>
> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
> Signed-off-by: Jane Chu <[email protected]>
> ---
> arch/x86/kernel/cpu/mce/core.c | 6 +++---
> arch/x86/mm/pat/set_memory.c | 18 ++++++------------
> drivers/nvdimm/pmem.c | 31 +++++++------------------------
> include/linux/set_memory.h | 4 ++--
> 4 files changed, 18 insertions(+), 41 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 981496e6bc0e..fa67bb9d1afe 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>
> pfn = mce->addr >> PAGE_SHIFT;
> if (!memory_failure(pfn, 0)) {
> - set_mce_nospec(pfn, whole_page(mce));
> + set_mce_nospec(pfn);
> mce->kflags |= MCE_HANDLED_UC;
> }
>
> @@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)
>
> ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
> if (!ret) {
> - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> + set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
> sync_core();
> return;
> }
> @@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
> p->mce_count = 0;
> pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
> if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
> - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> + set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
> }
>
> static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 93dde949f224..404ffcb3f2cb 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -1926,13 +1926,8 @@ int set_memory_wb(unsigned long addr, int numpages)
> EXPORT_SYMBOL(set_memory_wb);
>
> #ifdef CONFIG_X86_64
> -/*
> - * Prevent speculative access to the page by either unmapping
> - * it (if we do not require access to any part of the page) or
> - * marking it uncacheable (if we want to try to retrieve data
> - * from non-poisoned lines in the page).
> - */
> -int set_mce_nospec(unsigned long pfn, bool unmap)
> +/* Prevent speculative access to a page by marking it not-present */
> +int set_mce_nospec(unsigned long pfn)
> {
> unsigned long decoy_addr;
> int rc;
> @@ -1954,10 +1949,7 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
> */
> decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
>
> - if (unmap)
> - rc = set_memory_np(decoy_addr, 1);
> - else
> - rc = set_memory_uc(decoy_addr, 1);
> + rc = set_memory_np(decoy_addr, 1);
> if (rc)
> pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
> return rc;
> @@ -1966,7 +1958,9 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
> /* Restore full speculative operation to the pfn. */
> int clear_mce_nospec(unsigned long pfn)
> {
> - return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
> + unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
> +
> + return change_page_attr_set(&addr, 1, __pgprot(_PAGE_PRESENT), 0);

This probably warrants a set_memory_present() helper.

> }
> EXPORT_SYMBOL_GPL(clear_mce_nospec);
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 58d95242a836..30c71a68175b 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -158,36 +158,19 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
> struct page *page, unsigned int page_off,
> sector_t sector, unsigned int len)
> {
> - blk_status_t rc = BLK_STS_OK;
> - bool bad_pmem = false;
> phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
> void *pmem_addr = pmem->virt_addr + pmem_off;
>
> - if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
> - bad_pmem = true;
> + if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
> + blk_status_t rc = pmem_clear_poison(pmem, pmem_off, len);
>
> - /*
> - * Note that we write the data both before and after
> - * clearing poison. The write before clear poison
> - * handles situations where the latest written data is
> - * preserved and the clear poison operation simply marks
> - * the address range as valid without changing the data.
> - * In this case application software can assume that an
> - * interrupted write will either return the new good
> - * data or an error.
> - *
> - * However, if pmem_clear_poison() leaves the data in an
> - * indeterminate state we need to perform the write
> - * after clear poison.
> - */
> + if (rc != BLK_STS_OK)
> + pr_warn_ratelimited("%s: failed to clear poison\n", __func__);

This should be either "dev_warn_ratelimited(to_dev(pmem), ...", or a
trace point similar to trace_block_rq_complete() that tells userspace
about adverse I/O completion results.

However, that's probably a discussion for another patch, so I would
just drop this new addition for now and we can discuss the logging in
a follow-on patch.


> + return rc;
> + }
> flush_dcache_page(page);
> write_pmem(pmem_addr, page, page_off, len);
> - if (unlikely(bad_pmem)) {
> - rc = pmem_clear_poison(pmem, pmem_off, len);
> - write_pmem(pmem_addr, page, page_off, len);
> - }
> -
> - return rc;
> + return BLK_STS_OK;
> }
>
> static void pmem_submit_bio(struct bio *bio)
> diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
> index d6263d7afb55..cde2d8687a7b 100644
> --- a/include/linux/set_memory.h
> +++ b/include/linux/set_memory.h
> @@ -43,10 +43,10 @@ static inline bool can_set_direct_map(void)
> #endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
>
> #ifdef CONFIG_X86_64
> -int set_mce_nospec(unsigned long pfn, bool unmap);
> +int set_mce_nospec(unsigned long pfn);
> int clear_mce_nospec(unsigned long pfn);
> #else
> -static inline int set_mce_nospec(unsigned long pfn, bool unmap)
> +static inline int set_mce_nospec(unsigned long pfn)

Looks like after this change the "whole_page()" helper can be deleted
and the ->mce_whole_page flag in the task_struct can also go, right?
In a follow-on of course.

Other than those small issues, this looks good!

2022-04-12 22:19:55

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

On Wed, Apr 6, 2022 at 10:31 PM Christoph Hellwig <[email protected]> wrote:
>
> On Wed, Apr 06, 2022 at 05:32:31PM +0000, Jane Chu wrote:
> > Yes, I believe Dan was motivated by avoiding the dm dance as a result of
> > adding .recovery_write to dax_operations.
> >
> > I understand your point about .recovery_write is device specific and
> > thus not something appropriate for device agnostic ops.
> >
> > I can see 2 options so far -
> >
> > 1) add .recovery_write to dax_operations and do the dm dance to hunt
> > down to the base device that actually provides the recovery action
>
> That would be my preference. But I'll wait for Dan to chime in.

Yeah, so the motivation was avoiding plumbing recovery through stacked
lookups when the recovery is specific to a pfn and the provider of
that pfn, but I also see it from Christoph's perspective that the only
agent that cares about recovery is the fsdax I/O path. Certainly
having ->dax_direct_access() take a DAX_RECOVERY flag and the op
itself go through the pgmap is a confusing split that I did not
anticipate when I made the suggestion. Since that flag must be there,
then the ->recovery_write() should also stay relative to a dax device.

Apologies for the thrash Jane.

One ask though, please separate plumbing the new flag argument to
->dax_direct_access() and plumbing the new operation into preparation
patches before filling them in with the new goodness.

2022-04-12 23:12:25

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 3/6] mce: fix set_mce_nospec to always unmap the whole page

On Tue, Apr 05, 2022 at 01:47:44PM -0600, Jane Chu wrote:
> The set_memory_uc() approach doesn't work well in all cases.
> For example, when "The VMM unmapped the bad page from guest
> physical space and passed the machine check to the guest."
> "The guest gets virtual #MC on an access to that page.
> When the guest tries to do set_memory_uc() and instructs
> cpa_flush() to do clean caches that results in taking another
> fault / exception perhaps because the VMM unmapped the page
> from the guest."

I presume this is quoting someone...

> Since the driver has special knowledge to handle NP or UC,
> let's mark the poisoned page with NP and let driver handle it

s/let's mark/mark/

> when it comes down to repair.
>
> Please refer to discussions here for more details.
> https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/
>
> Now since poisoned page is marked as not-present, in order to
> avoid writing to a 'np' page and trigger kernel Oops, also fix

You can write it out: "non-present page..."

> pmem_do_write().
>
> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
> Signed-off-by: Jane Chu <[email protected]>
> ---
> arch/x86/kernel/cpu/mce/core.c | 6 +++---
> arch/x86/mm/pat/set_memory.c | 18 ++++++------------
> drivers/nvdimm/pmem.c | 31 +++++++------------------------
> include/linux/set_memory.h | 4 ++--
> 4 files changed, 18 insertions(+), 41 deletions(-)

For such mixed subsystem patches we probably should talk about who picks
them up, eventually...

> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 981496e6bc0e..fa67bb9d1afe 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>
> pfn = mce->addr >> PAGE_SHIFT;
> if (!memory_failure(pfn, 0)) {
> - set_mce_nospec(pfn, whole_page(mce));
> + set_mce_nospec(pfn);
> mce->kflags |= MCE_HANDLED_UC;
> }
>
> @@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)
>
> ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
> if (!ret) {
> - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> + set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
> sync_core();
> return;
> }
> @@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
> p->mce_count = 0;
> pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
> if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
> - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> + set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
> }

Both that ->mce_whole_page and whole_page() look unused after this.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-04-12 23:29:09

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 5/6] pmem: refactor pmem_clear_poison()

On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <[email protected]> wrote:
>
> Refactor the pmem_clear_poison() in order to share common code
> later.
>

I would just add a note here about why, i.e. to factor out the common
shared code between the typical write path and the recovery write
path.

> Signed-off-by: Jane Chu <[email protected]>
> ---
> drivers/nvdimm/pmem.c | 78 ++++++++++++++++++++++++++++---------------
> 1 file changed, 52 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 0400c5a7ba39..56596be70400 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -45,10 +45,27 @@ static struct nd_region *to_region(struct pmem_device *pmem)
> return to_nd_region(to_dev(pmem)->parent);
> }
>
> -static void hwpoison_clear(struct pmem_device *pmem,
> - phys_addr_t phys, unsigned int len)
> +static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
> {
> + return (pmem->phys_addr + offset);

Christoph already mentioned dropping the unnecessary parenthesis.

> +}
> +
> +static sector_t to_sect(struct pmem_device *pmem, phys_addr_t offset)
> +{
> + return (offset - pmem->data_offset) >> SECTOR_SHIFT;
> +}
> +
> +static phys_addr_t to_offset(struct pmem_device *pmem, sector_t sector)
> +{
> + return ((sector << SECTOR_SHIFT) + pmem->data_offset);
> +}
> +
> +static void pmem_clear_hwpoison(struct pmem_device *pmem, phys_addr_t offset,
> + unsigned int len)

Perhaps now is a good time to rename this to something else like
pmem_clear_mce_nospec()? Just to make it more distinct from
pmem_clear_poison(). While "hwpoison" is the page flag name
pmem_clear_poison() is the function that's actually clearing the
poison in hardware ("hw") and the new pmem_clear_mce_nospec() is
toggling the page back into service.

> +{
> + phys_addr_t phys = to_phys(pmem, offset);
> unsigned long pfn_start, pfn_end, pfn;
> + unsigned int blks = len >> SECTOR_SHIFT;
>
> /* only pmem in the linear map supports HWPoison */
> if (is_vmalloc_addr(pmem->virt_addr))
> @@ -67,35 +84,44 @@ static void hwpoison_clear(struct pmem_device *pmem,
> if (test_and_clear_pmem_poison(page))
> clear_mce_nospec(pfn);
> }
> +
> + dev_dbg(to_dev(pmem), "%#llx clear %u sector%s\n",
> + (unsigned long long) to_sect(pmem, offset), blks,
> + blks > 1 ? "s" : "");

In anticipation of better tracing support and the fact that this is no
longer called from pmem_clear_poison() let's drop it for now.

> }
>
> -static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
> +static void pmem_clear_bb(struct pmem_device *pmem, sector_t sector, long blks)
> +{
> + if (blks == 0)
> + return;
> + badblocks_clear(&pmem->bb, sector, blks);
> + if (pmem->bb_state)
> + sysfs_notify_dirent(pmem->bb_state);
> +}
> +
> +static long __pmem_clear_poison(struct pmem_device *pmem,
> phys_addr_t offset, unsigned int len)
> {
> - struct device *dev = to_dev(pmem);
> - sector_t sector;
> - long cleared;
> - blk_status_t rc = BLK_STS_OK;
> -
> - sector = (offset - pmem->data_offset) / 512;
> -
> - cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
> - if (cleared < len)
> - rc = BLK_STS_IOERR;
> - if (cleared > 0 && cleared / 512) {
> - hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> - cleared /= 512;
> - dev_dbg(dev, "%#llx clear %ld sector%s\n",
> - (unsigned long long) sector, cleared,
> - cleared > 1 ? "s" : "");
> - badblocks_clear(&pmem->bb, sector, cleared);
> - if (pmem->bb_state)
> - sysfs_notify_dirent(pmem->bb_state);
> + phys_addr_t phys = to_phys(pmem, offset);
> + long cleared = nvdimm_clear_poison(to_dev(pmem), phys, len);
> +
> + if (cleared > 0) {
> + pmem_clear_hwpoison(pmem, offset, cleared);
> + arch_invalidate_pmem(pmem->virt_addr + offset, len);
> }
> + return cleared;
> +}
>
> - arch_invalidate_pmem(pmem->virt_addr + offset, len);
> +static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
> + phys_addr_t offset, unsigned int len)
> +{
> + long cleared = __pmem_clear_poison(pmem, offset, len);
>
> - return rc;
> + if (cleared < 0)
> + return BLK_STS_IOERR;
> +
> + pmem_clear_bb(pmem, to_sect(pmem, offset), cleared >> SECTOR_SHIFT);
> + return (cleared < len) ? BLK_STS_IOERR : BLK_STS_OK;

I prefer "if / else" syntax instead of a ternary conditional.

> }
>
> static void write_pmem(void *pmem_addr, struct page *page,
> @@ -143,7 +169,7 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
> sector_t sector, unsigned int len)
> {
> blk_status_t rc;
> - phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
> + phys_addr_t pmem_off = to_offset(pmem, sector);
> void *pmem_addr = pmem->virt_addr + pmem_off;
>
> if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
> @@ -158,7 +184,7 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
> struct page *page, unsigned int page_off,
> sector_t sector, unsigned int len)
> {
> - phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
> + phys_addr_t pmem_off = to_offset(pmem, sector);
> void *pmem_addr = pmem->virt_addr + pmem_off;
>
> if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {

With those small fixups you can add:

Reviewed-by: Dan Williams <[email protected]>

2022-04-12 23:39:53

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/6] x86/mm: fix comment

On Tue, Apr 05, 2022 at 01:47:42PM -0600, Jane Chu wrote:
> There is no _set_memory_prot internal helper, while coming across
> the code, might as well fix the comment.
>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Jane Chu <[email protected]>
> ---
> arch/x86/mm/pat/set_memory.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index abf5ed76e4b7..38af155aaba9 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -1816,7 +1816,7 @@ static inline int cpa_clear_pages_array(struct page **pages, int numpages,
> }
>
> /*
> - * _set_memory_prot is an internal helper for callers that have been passed
> + * __set_memory_prot is an internal helper for callers that have been passed
> * a pgprot_t value from upper layers and a reservation has already been taken.
> * If you want to set the pgprot to a specific page protocol, use the
> * set_memory_xx() functions.
> --

This is such a trivial change so that having it as a separate patch is
probably not needed - might as well merge it with patch 3...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-04-14 12:17:52

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 3/6] mce: fix set_mce_nospec to always unmap the whole page

On 4/11/2022 4:27 PM, Dan Williams wrote:
> On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <[email protected]> wrote:
>>
>> The set_memory_uc() approach doesn't work well in all cases.
>> For example, when "The VMM unmapped the bad page from guest
>> physical space and passed the machine check to the guest."
>> "The guest gets virtual #MC on an access to that page.
>> When the guest tries to do set_memory_uc() and instructs
>> cpa_flush() to do clean caches that results in taking another
>> fault / exception perhaps because the VMM unmapped the page
>> from the guest."
>>
>> Since the driver has special knowledge to handle NP or UC,
>
> I think a patch is needed before this one to make this statement true? I.e.:
>
> diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
> index ee8d9973f60b..11641f55025a 100644
> --- a/drivers/acpi/nfit/mce.c
> +++ b/drivers/acpi/nfit/mce.c
> @@ -32,6 +32,7 @@ static int nfit_handle_mce(struct notifier_block
> *nb, unsigned long val,
> */
> mutex_lock(&acpi_desc_lock);
> list_for_each_entry(acpi_desc, &acpi_descs, list) {
> + unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
> struct device *dev = acpi_desc->dev;
> int found_match = 0;
>
> @@ -63,8 +64,7 @@ static int nfit_handle_mce(struct notifier_block
> *nb, unsigned long val,
>
> /* If this fails due to an -ENOMEM, there is little we can do */
> nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
> - ALIGN(mce->addr, L1_CACHE_BYTES),
> - L1_CACHE_BYTES);
> + ALIGN(mce->addr, align), align);
> nvdimm_region_notify(nfit_spa->nd_region,
> NVDIMM_REVALIDATE_POISON);
>

Dan, I tried the above change, and this is what I got after injecting 8
back-to-back poisons, then read them and received SIGBUS/BUS_MCEERR_AR,
then repair via the v7 patch which works until this change is added.

[ 6240.955331] nfit ACPI0012:00: XXX, align = 100
[ 6240.960300] nfit ACPI0012:00: XXX, ALIGN(mce->addr,
L1_CACHE_BYTES)=1851600400, L1_CACHE_BYTES=40, ALIGN(mce->addr,
align)=1851600400
[..]
[ 6242.052277] nfit ACPI0012:00: XXX, align = 100
[ 6242.057243] nfit ACPI0012:00: XXX, ALIGN(mce->addr,
L1_CACHE_BYTES)=1851601000, L1_CACHE_BYTES=40, ALIGN(mce->addr,
align)=1851601000
[..]
[ 6244.917198] nfit ACPI0012:00: XXX, align = 1000
[ 6244.922258] nfit ACPI0012:00: XXX, ALIGN(mce->addr,
L1_CACHE_BYTES)=1851601200, L1_CACHE_BYTES=40, ALIGN(mce->addr,
align)=1851602000
[..]

All 8 poisons remain uncleared.

Without further investigation, I don't know why the failure.
Could we mark this change to a follow-on task?
The driver knows a lot about how to clear poisons besides hardcoding
poison alignment to 0x40 bytes.

>
>> let's mark the poisoned page with NP and let driver handle it
>> when it comes down to repair.
>>
>> Please refer to discussions here for more details.
>> https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/
>>
>> Now since poisoned page is marked as not-present, in order to
>> avoid writing to a 'np' page and trigger kernel Oops, also fix
>> pmem_do_write().
>>
>> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
>> Signed-off-by: Jane Chu <[email protected]>
>> ---
>> arch/x86/kernel/cpu/mce/core.c | 6 +++---
>> arch/x86/mm/pat/set_memory.c | 18 ++++++------------
>> drivers/nvdimm/pmem.c | 31 +++++++------------------------
>> include/linux/set_memory.h | 4 ++--
>> 4 files changed, 18 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index 981496e6bc0e..fa67bb9d1afe 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>>
>> pfn = mce->addr >> PAGE_SHIFT;
>> if (!memory_failure(pfn, 0)) {
>> - set_mce_nospec(pfn, whole_page(mce));
>> + set_mce_nospec(pfn);
>> mce->kflags |= MCE_HANDLED_UC;
>> }
>>
>> @@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)
>>
>> ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
>> if (!ret) {
>> - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
>> + set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
>> sync_core();
>> return;
>> }
>> @@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
>> p->mce_count = 0;
>> pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
>> if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
>> - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
>> + set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
>> }
>>
>> static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
>> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>> index 93dde949f224..404ffcb3f2cb 100644
>> --- a/arch/x86/mm/pat/set_memory.c
>> +++ b/arch/x86/mm/pat/set_memory.c
>> @@ -1926,13 +1926,8 @@ int set_memory_wb(unsigned long addr, int numpages)
>> EXPORT_SYMBOL(set_memory_wb);
>>
>> #ifdef CONFIG_X86_64
>> -/*
>> - * Prevent speculative access to the page by either unmapping
>> - * it (if we do not require access to any part of the page) or
>> - * marking it uncacheable (if we want to try to retrieve data
>> - * from non-poisoned lines in the page).
>> - */
>> -int set_mce_nospec(unsigned long pfn, bool unmap)
>> +/* Prevent speculative access to a page by marking it not-present */
>> +int set_mce_nospec(unsigned long pfn)
>> {
>> unsigned long decoy_addr;
>> int rc;
>> @@ -1954,10 +1949,7 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
>> */
>> decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
>>
>> - if (unmap)
>> - rc = set_memory_np(decoy_addr, 1);
>> - else
>> - rc = set_memory_uc(decoy_addr, 1);
>> + rc = set_memory_np(decoy_addr, 1);
>> if (rc)
>> pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
>> return rc;
>> @@ -1966,7 +1958,9 @@ int set_mce_nospec(unsigned long pfn, bool unmap)
>> /* Restore full speculative operation to the pfn. */
>> int clear_mce_nospec(unsigned long pfn)
>> {
>> - return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
>> + unsigned long addr = (unsigned long) pfn_to_kaddr(pfn);
>> +
>> + return change_page_attr_set(&addr, 1, __pgprot(_PAGE_PRESENT), 0);
>
> This probably warrants a set_memory_present() helper.

I had a set_memory_present() helper in an earlier version, but as there
is no other caller, also if I'm not mis-remembering, thought Christoph
had suggested to simplify the code and I agreed. If you feel strong
about adding the helper back, I can do that to improve readibility.

>
>> }
>> EXPORT_SYMBOL_GPL(clear_mce_nospec);
>>
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index 58d95242a836..30c71a68175b 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -158,36 +158,19 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
>> struct page *page, unsigned int page_off,
>> sector_t sector, unsigned int len)
>> {
>> - blk_status_t rc = BLK_STS_OK;
>> - bool bad_pmem = false;
>> phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
>> void *pmem_addr = pmem->virt_addr + pmem_off;
>>
>> - if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
>> - bad_pmem = true;
>> + if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
>> + blk_status_t rc = pmem_clear_poison(pmem, pmem_off, len);
>>
>> - /*
>> - * Note that we write the data both before and after
>> - * clearing poison. The write before clear poison
>> - * handles situations where the latest written data is
>> - * preserved and the clear poison operation simply marks
>> - * the address range as valid without changing the data.
>> - * In this case application software can assume that an
>> - * interrupted write will either return the new good
>> - * data or an error.
>> - *
>> - * However, if pmem_clear_poison() leaves the data in an
>> - * indeterminate state we need to perform the write
>> - * after clear poison.
>> - */
>> + if (rc != BLK_STS_OK)
>> + pr_warn_ratelimited("%s: failed to clear poison\n", __func__);
>
> This should be either "dev_warn_ratelimited(to_dev(pmem), ...", or a
> trace point similar to trace_block_rq_complete() that tells userspace
> about adverse I/O completion results.
>
> However, that's probably a discussion for another patch, so I would
> just drop this new addition for now and we can discuss the logging in
> a follow-on patch.

Okay, drop the warning message. If the unexpected pathological scenario
happens, user will see I/O failure.

>
>
>> + return rc;
>> + }
>> flush_dcache_page(page);
>> write_pmem(pmem_addr, page, page_off, len);
>> - if (unlikely(bad_pmem)) {
>> - rc = pmem_clear_poison(pmem, pmem_off, len);
>> - write_pmem(pmem_addr, page, page_off, len);
>> - }
>> -
>> - return rc;
>> + return BLK_STS_OK;
>> }
>>
>> static void pmem_submit_bio(struct bio *bio)
>> diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
>> index d6263d7afb55..cde2d8687a7b 100644
>> --- a/include/linux/set_memory.h
>> +++ b/include/linux/set_memory.h
>> @@ -43,10 +43,10 @@ static inline bool can_set_direct_map(void)
>> #endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
>>
>> #ifdef CONFIG_X86_64
>> -int set_mce_nospec(unsigned long pfn, bool unmap);
>> +int set_mce_nospec(unsigned long pfn);
>> int clear_mce_nospec(unsigned long pfn);
>> #else
>> -static inline int set_mce_nospec(unsigned long pfn, bool unmap)
>> +static inline int set_mce_nospec(unsigned long pfn)
>
> Looks like after this change the "whole_page()" helper can be deleted
> and the ->mce_whole_page flag in the task_struct can also go, right?
> In a follow-on of course.

Okay, will do in a follow-on patch.

>
> Other than those small issues, this looks good!

thanks!
-jane

2022-04-14 13:44:05

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

On 4/11/2022 10:02 PM, Christoph Hellwig wrote:
> On Mon, Apr 11, 2022 at 09:57:36PM -0700, Dan Williams wrote:
>> So how about change 'int flags' to 'enum dax_access_mode mode' where
>> dax_access_mode is:
>>
>> /**
>> * enum dax_access_mode - operational mode for dax_direct_access()
>> * @DAX_ACCESS: nominal access, fail / trim access on encountering poison
>> * @DAX_RECOVERY_WRITE: ignore poison and provide a pointer suitable
>> for use with dax_recovery_write()
>> */
>> enum dax_access_mode {
>> DAX_ACCESS,
>> DAX_RECOVERY_WRITE,
>> };
>>
>> Then the conversions look like this:
>>
>> - rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL);
>> + rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1,
>> DAX_ACCESS, &kaddr, NULL);
>>
>> ...and there's less chance of confusion with the @nr_pages argument.
>
> Yes, this might be a little nicer.

Will do.

Thanks!
-jane

2022-04-14 14:53:39

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 1/6] x86/mm: fix comment

On 4/12/2022 2:53 AM, Borislav Petkov wrote:
> On Tue, Apr 05, 2022 at 01:47:42PM -0600, Jane Chu wrote:
>> There is no _set_memory_prot internal helper, while coming across
>> the code, might as well fix the comment.
>>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>> Signed-off-by: Jane Chu <[email protected]>
>> ---
>> arch/x86/mm/pat/set_memory.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>> index abf5ed76e4b7..38af155aaba9 100644
>> --- a/arch/x86/mm/pat/set_memory.c
>> +++ b/arch/x86/mm/pat/set_memory.c
>> @@ -1816,7 +1816,7 @@ static inline int cpa_clear_pages_array(struct page **pages, int numpages,
>> }
>>
>> /*
>> - * _set_memory_prot is an internal helper for callers that have been passed
>> + * __set_memory_prot is an internal helper for callers that have been passed
>> * a pgprot_t value from upper layers and a reservation has already been taken.
>> * If you want to set the pgprot to a specific page protocol, use the
>> * set_memory_xx() functions.
>> --
>
> This is such a trivial change so that having it as a separate patch is
> probably not needed - might as well merge it with patch 3...

This change used to be folded in the mce patch, but for that I received
a review comment pointing out that the change is unrelated to the said
patch and doesn't belong, hence I pulled it out to stand by itself. :)

thanks!
-jane

>
> Thx.
>

2022-04-15 00:47:31

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 5/6] pmem: refactor pmem_clear_poison()

On 4/11/2022 9:26 PM, Dan Williams wrote:
> On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <[email protected]> wrote:
>>
>> Refactor the pmem_clear_poison() in order to share common code
>> later.
>>
>
> I would just add a note here about why, i.e. to factor out the common
> shared code between the typical write path and the recovery write
> path.

Okay.

>
>> Signed-off-by: Jane Chu <[email protected]>
>> ---
>> drivers/nvdimm/pmem.c | 78 ++++++++++++++++++++++++++++---------------
>> 1 file changed, 52 insertions(+), 26 deletions(-)
>>
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index 0400c5a7ba39..56596be70400 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -45,10 +45,27 @@ static struct nd_region *to_region(struct pmem_device *pmem)
>> return to_nd_region(to_dev(pmem)->parent);
>> }
>>
>> -static void hwpoison_clear(struct pmem_device *pmem,
>> - phys_addr_t phys, unsigned int len)
>> +static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
>> {
>> + return (pmem->phys_addr + offset);
>
> Christoph already mentioned dropping the unnecessary parenthesis.
>
>> +}
>> +
>> +static sector_t to_sect(struct pmem_device *pmem, phys_addr_t offset)
>> +{
>> + return (offset - pmem->data_offset) >> SECTOR_SHIFT;
>> +}
>> +
>> +static phys_addr_t to_offset(struct pmem_device *pmem, sector_t sector)
>> +{
>> + return ((sector << SECTOR_SHIFT) + pmem->data_offset);
>> +}
>> +
>> +static void pmem_clear_hwpoison(struct pmem_device *pmem, phys_addr_t offset,
>> + unsigned int len)
>
> Perhaps now is a good time to rename this to something else like
> pmem_clear_mce_nospec()? Just to make it more distinct from
> pmem_clear_poison(). While "hwpoison" is the page flag name
> pmem_clear_poison() is the function that's actually clearing the
> poison in hardware ("hw") and the new pmem_clear_mce_nospec() is
> toggling the page back into service.

I get your point. How about calling the function explicitly
pmem_mkpage_present()? or pmem_page_present_on()? or
pmem_set_page_present()? because setting a poisoned present requires
both clearing the HWpoison bit & clear mce_nospec.

>
>> +{
>> + phys_addr_t phys = to_phys(pmem, offset);
>> unsigned long pfn_start, pfn_end, pfn;
>> + unsigned int blks = len >> SECTOR_SHIFT;
>>
>> /* only pmem in the linear map supports HWPoison */
>> if (is_vmalloc_addr(pmem->virt_addr))
>> @@ -67,35 +84,44 @@ static void hwpoison_clear(struct pmem_device *pmem,
>> if (test_and_clear_pmem_poison(page))
>> clear_mce_nospec(pfn);
>> }
>> +
>> + dev_dbg(to_dev(pmem), "%#llx clear %u sector%s\n",
>> + (unsigned long long) to_sect(pmem, offset), blks,
>> + blks > 1 ? "s" : "");
>
> In anticipation of better tracing support and the fact that this is no
> longer called from pmem_clear_poison() let's drop it for now.

OKay.

>
>> }
>>
>> -static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
>> +static void pmem_clear_bb(struct pmem_device *pmem, sector_t sector, long blks)
>> +{
>> + if (blks == 0)
>> + return;
>> + badblocks_clear(&pmem->bb, sector, blks);
>> + if (pmem->bb_state)
>> + sysfs_notify_dirent(pmem->bb_state);
>> +}
>> +
>> +static long __pmem_clear_poison(struct pmem_device *pmem,
>> phys_addr_t offset, unsigned int len)
>> {
>> - struct device *dev = to_dev(pmem);
>> - sector_t sector;
>> - long cleared;
>> - blk_status_t rc = BLK_STS_OK;
>> -
>> - sector = (offset - pmem->data_offset) / 512;
>> -
>> - cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
>> - if (cleared < len)
>> - rc = BLK_STS_IOERR;
>> - if (cleared > 0 && cleared / 512) {
>> - hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
>> - cleared /= 512;
>> - dev_dbg(dev, "%#llx clear %ld sector%s\n",
>> - (unsigned long long) sector, cleared,
>> - cleared > 1 ? "s" : "");
>> - badblocks_clear(&pmem->bb, sector, cleared);
>> - if (pmem->bb_state)
>> - sysfs_notify_dirent(pmem->bb_state);
>> + phys_addr_t phys = to_phys(pmem, offset);
>> + long cleared = nvdimm_clear_poison(to_dev(pmem), phys, len);
>> +
>> + if (cleared > 0) {
>> + pmem_clear_hwpoison(pmem, offset, cleared);
>> + arch_invalidate_pmem(pmem->virt_addr + offset, len);
>> }
>> + return cleared;
>> +}
>>
>> - arch_invalidate_pmem(pmem->virt_addr + offset, len);
>> +static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
>> + phys_addr_t offset, unsigned int len)
>> +{
>> + long cleared = __pmem_clear_poison(pmem, offset, len);
>>
>> - return rc;
>> + if (cleared < 0)
>> + return BLK_STS_IOERR;
>> +
>> + pmem_clear_bb(pmem, to_sect(pmem, offset), cleared >> SECTOR_SHIFT);
>> + return (cleared < len) ? BLK_STS_IOERR : BLK_STS_OK;
>
> I prefer "if / else" syntax instead of a ternary conditional.
>

Got it.

>> }
>>
>> static void write_pmem(void *pmem_addr, struct page *page,
>> @@ -143,7 +169,7 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
>> sector_t sector, unsigned int len)
>> {
>> blk_status_t rc;
>> - phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
>> + phys_addr_t pmem_off = to_offset(pmem, sector);
>> void *pmem_addr = pmem->virt_addr + pmem_off;
>>
>> if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
>> @@ -158,7 +184,7 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
>> struct page *page, unsigned int page_off,
>> sector_t sector, unsigned int len)
>> {
>> - phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
>> + phys_addr_t pmem_off = to_offset(pmem, sector);
>> void *pmem_addr = pmem->virt_addr + pmem_off;
>>
>> if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
>
> With those small fixups you can add:
>
> Reviewed-by: Dan Williams <[email protected]>

Thanks!
-jane

2022-04-15 02:26:14

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 3/6] mce: fix set_mce_nospec to always unmap the whole page

On Wed, Apr 13, 2022 at 4:36 PM Jane Chu <[email protected]> wrote:
>
> On 4/11/2022 4:27 PM, Dan Williams wrote:
> > On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <[email protected]> wrote:
> >>
> >> The set_memory_uc() approach doesn't work well in all cases.
> >> For example, when "The VMM unmapped the bad page from guest
> >> physical space and passed the machine check to the guest."
> >> "The guest gets virtual #MC on an access to that page.
> >> When the guest tries to do set_memory_uc() and instructs
> >> cpa_flush() to do clean caches that results in taking another
> >> fault / exception perhaps because the VMM unmapped the page
> >> from the guest."
> >>
> >> Since the driver has special knowledge to handle NP or UC,
> >
> > I think a patch is needed before this one to make this statement true? I.e.:
> >
> > diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
> > index ee8d9973f60b..11641f55025a 100644
> > --- a/drivers/acpi/nfit/mce.c
> > +++ b/drivers/acpi/nfit/mce.c
> > @@ -32,6 +32,7 @@ static int nfit_handle_mce(struct notifier_block
> > *nb, unsigned long val,
> > */
> > mutex_lock(&acpi_desc_lock);
> > list_for_each_entry(acpi_desc, &acpi_descs, list) {
> > + unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
> > struct device *dev = acpi_desc->dev;
> > int found_match = 0;
> >
> > @@ -63,8 +64,7 @@ static int nfit_handle_mce(struct notifier_block
> > *nb, unsigned long val,
> >
> > /* If this fails due to an -ENOMEM, there is little we can do */
> > nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
> > - ALIGN(mce->addr, L1_CACHE_BYTES),
> > - L1_CACHE_BYTES);
> > + ALIGN(mce->addr, align), align);
> > nvdimm_region_notify(nfit_spa->nd_region,
> > NVDIMM_REVALIDATE_POISON);
> >
>
> Dan, I tried the above change, and this is what I got after injecting 8
> back-to-back poisons, then read them and received SIGBUS/BUS_MCEERR_AR,
> then repair via the v7 patch which works until this change is added.
>
> [ 6240.955331] nfit ACPI0012:00: XXX, align = 100
> [ 6240.960300] nfit ACPI0012:00: XXX, ALIGN(mce->addr,
> L1_CACHE_BYTES)=1851600400, L1_CACHE_BYTES=40, ALIGN(mce->addr,
> align)=1851600400
> [..]
> [ 6242.052277] nfit ACPI0012:00: XXX, align = 100
> [ 6242.057243] nfit ACPI0012:00: XXX, ALIGN(mce->addr,
> L1_CACHE_BYTES)=1851601000, L1_CACHE_BYTES=40, ALIGN(mce->addr,
> align)=1851601000
> [..]
> [ 6244.917198] nfit ACPI0012:00: XXX, align = 1000
> [ 6244.922258] nfit ACPI0012:00: XXX, ALIGN(mce->addr,
> L1_CACHE_BYTES)=1851601200, L1_CACHE_BYTES=40, ALIGN(mce->addr,
> align)=1851602000
> [..]
>
> All 8 poisons remain uncleared.
>
> Without further investigation, I don't know why the failure.
> Could we mark this change to a follow-on task?

Perhaps a bit more debug before kicking this can down the road...

I'm worried that this means that the driver is not accurately tracking
poison data For example, that last case the hardware is indicating a
full page clobber, but the old code would only track poison from
1851601200 to 1851601400 (i.e. the first 512 bytes from the base error
address).

Oh... wait, I think there is a second bug here, that ALIGN should be
ALIGN_DOWN(). Does this restore the result you expect?

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index ee8d9973f60b..d7a52238a741 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -63,8 +63,7 @@ static int nfit_handle_mce(struct notifier_block
*nb, unsigned long val,

/* If this fails due to an -ENOMEM, there is little we can do */
nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
- ALIGN(mce->addr, L1_CACHE_BYTES),
- L1_CACHE_BYTES);
+ ALIGN_DOWN(mce->addr, align), align);
nvdimm_region_notify(nfit_spa->nd_region,
NVDIMM_REVALIDATE_POISON);


> The driver knows a lot about how to clear poisons besides hardcoding
> poison alignment to 0x40 bytes.

It does, but the badblocks tracking should still be reliable, and if
it's not reliable I expect there are cases where recovery_write() will
not be triggered because the driver will not fail the
dax_direct_access() attempt.

2022-04-15 08:02:09

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 1/6] x86/mm: fix comment

On 4/14/2022 1:44 AM, Borislav Petkov wrote:
> On Thu, Apr 14, 2022 at 01:00:05AM +0000, Jane Chu wrote:
>> This change used to be folded in the mce patch, but for that I received
>> a review comment pointing out that the change is unrelated to the said
>> patch and doesn't belong, hence I pulled it out to stand by itself. :)
>
> Aha, someone is being very pedantic.
>
> For trivial unrelated changes like that I usually add them to a patch
> which already touches that file and put in the commit message:
>
> "While at it, fixup a function name in a comment."
>
> Less patches to handle.
>

Sounds good, I'll fold it to one of the mce patches and make a note that
it is a BTW-kind of fix, not something related.

thanks!
-jane

2022-04-15 23:21:10

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 5/6] pmem: refactor pmem_clear_poison()

On Wed, Apr 13, 2022 at 5:55 PM Jane Chu <[email protected]> wrote:
>
> On 4/11/2022 9:26 PM, Dan Williams wrote:
> > On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <[email protected]> wrote:
> >>
> >> Refactor the pmem_clear_poison() in order to share common code
> >> later.
> >>
> >
> > I would just add a note here about why, i.e. to factor out the common
> > shared code between the typical write path and the recovery write
> > path.
>
> Okay.
>
> >
> >> Signed-off-by: Jane Chu <[email protected]>
> >> ---
> >> drivers/nvdimm/pmem.c | 78 ++++++++++++++++++++++++++++---------------
> >> 1 file changed, 52 insertions(+), 26 deletions(-)
> >>
> >> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> >> index 0400c5a7ba39..56596be70400 100644
> >> --- a/drivers/nvdimm/pmem.c
> >> +++ b/drivers/nvdimm/pmem.c
> >> @@ -45,10 +45,27 @@ static struct nd_region *to_region(struct pmem_device *pmem)
> >> return to_nd_region(to_dev(pmem)->parent);
> >> }
> >>
> >> -static void hwpoison_clear(struct pmem_device *pmem,
> >> - phys_addr_t phys, unsigned int len)
> >> +static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
> >> {
> >> + return (pmem->phys_addr + offset);
> >
> > Christoph already mentioned dropping the unnecessary parenthesis.
> >
> >> +}
> >> +
> >> +static sector_t to_sect(struct pmem_device *pmem, phys_addr_t offset)
> >> +{
> >> + return (offset - pmem->data_offset) >> SECTOR_SHIFT;
> >> +}
> >> +
> >> +static phys_addr_t to_offset(struct pmem_device *pmem, sector_t sector)
> >> +{
> >> + return ((sector << SECTOR_SHIFT) + pmem->data_offset);
> >> +}
> >> +
> >> +static void pmem_clear_hwpoison(struct pmem_device *pmem, phys_addr_t offset,
> >> + unsigned int len)
> >
> > Perhaps now is a good time to rename this to something else like
> > pmem_clear_mce_nospec()? Just to make it more distinct from
> > pmem_clear_poison(). While "hwpoison" is the page flag name
> > pmem_clear_poison() is the function that's actually clearing the
> > poison in hardware ("hw") and the new pmem_clear_mce_nospec() is
> > toggling the page back into service.
>
> I get your point. How about calling the function explicitly
> pmem_mkpage_present()?

Sure, I like pmem_mkpage_present().

2022-04-16 00:09:57

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 3/6] mce: fix set_mce_nospec to always unmap the whole page

On 4/12/2022 3:07 AM, Borislav Petkov wrote:
> On Tue, Apr 05, 2022 at 01:47:44PM -0600, Jane Chu wrote:
>> The set_memory_uc() approach doesn't work well in all cases.
>> For example, when "The VMM unmapped the bad page from guest
>> physical space and passed the machine check to the guest."
>> "The guest gets virtual #MC on an access to that page.
>> When the guest tries to do set_memory_uc() and instructs
>> cpa_flush() to do clean caches that results in taking another
>> fault / exception perhaps because the VMM unmapped the page
>> from the guest."
>
> I presume this is quoting someone...

Yes, will mention Dan's name next time.

>
>> Since the driver has special knowledge to handle NP or UC,
>> let's mark the poisoned page with NP and let driver handle it
>
> s/let's mark/mark/
>

Okay.

>> when it comes down to repair.
>>
>> Please refer to discussions here for more details.
>> https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/
>>
>> Now since poisoned page is marked as not-present, in order to
>> avoid writing to a 'np' page and trigger kernel Oops, also fix
>
> You can write it out: "non-present page..."

Okay.

>
>> pmem_do_write().
>>
>> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
>> Signed-off-by: Jane Chu <[email protected]>
>> ---
>> arch/x86/kernel/cpu/mce/core.c | 6 +++---
>> arch/x86/mm/pat/set_memory.c | 18 ++++++------------
>> drivers/nvdimm/pmem.c | 31 +++++++------------------------
>> include/linux/set_memory.h | 4 ++--
>> 4 files changed, 18 insertions(+), 41 deletions(-)
>
> For such mixed subsystem patches we probably should talk about who picks
> them up, eventually...

Hmm, maintainers' action item?

>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index 981496e6bc0e..fa67bb9d1afe 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -579,7 +579,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>>
>> pfn = mce->addr >> PAGE_SHIFT;
>> if (!memory_failure(pfn, 0)) {
>> - set_mce_nospec(pfn, whole_page(mce));
>> + set_mce_nospec(pfn);
>> mce->kflags |= MCE_HANDLED_UC;
>> }
>>
>> @@ -1316,7 +1316,7 @@ static void kill_me_maybe(struct callback_head *cb)
>>
>> ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
>> if (!ret) {
>> - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
>> + set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
>> sync_core();
>> return;
>> }
>> @@ -1342,7 +1342,7 @@ static void kill_me_never(struct callback_head *cb)
>> p->mce_count = 0;
>> pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
>> if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
>> - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
>> + set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
>> }
>
> Both that ->mce_whole_page and whole_page() look unused after this.
>

Yes, indeed, Dan also noticed, will remove while_page() in a follow-on
patch.

thanks,
-jane

2022-04-16 00:15:02

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 3/6] mce: fix set_mce_nospec to always unmap the whole page

On 4/13/2022 7:32 PM, Dan Williams wrote:
> On Wed, Apr 13, 2022 at 4:36 PM Jane Chu <[email protected]> wrote:
>>
>> On 4/11/2022 4:27 PM, Dan Williams wrote:
>>> On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <[email protected]> wrote:
>>>>
>>>> The set_memory_uc() approach doesn't work well in all cases.
>>>> For example, when "The VMM unmapped the bad page from guest
>>>> physical space and passed the machine check to the guest."
>>>> "The guest gets virtual #MC on an access to that page.
>>>> When the guest tries to do set_memory_uc() and instructs
>>>> cpa_flush() to do clean caches that results in taking another
>>>> fault / exception perhaps because the VMM unmapped the page
>>>> from the guest."
>>>>
>>>> Since the driver has special knowledge to handle NP or UC,
>>>
>>> I think a patch is needed before this one to make this statement true? I.e.:
>>>
>>> diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
>>> index ee8d9973f60b..11641f55025a 100644
>>> --- a/drivers/acpi/nfit/mce.c
>>> +++ b/drivers/acpi/nfit/mce.c
>>> @@ -32,6 +32,7 @@ static int nfit_handle_mce(struct notifier_block
>>> *nb, unsigned long val,
>>> */
>>> mutex_lock(&acpi_desc_lock);
>>> list_for_each_entry(acpi_desc, &acpi_descs, list) {
>>> + unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
>>> struct device *dev = acpi_desc->dev;
>>> int found_match = 0;
>>>
>>> @@ -63,8 +64,7 @@ static int nfit_handle_mce(struct notifier_block
>>> *nb, unsigned long val,
>>>
>>> /* If this fails due to an -ENOMEM, there is little we can do */
>>> nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
>>> - ALIGN(mce->addr, L1_CACHE_BYTES),
>>> - L1_CACHE_BYTES);
>>> + ALIGN(mce->addr, align), align);
>>> nvdimm_region_notify(nfit_spa->nd_region,
>>> NVDIMM_REVALIDATE_POISON);
>>>
>>
>> Dan, I tried the above change, and this is what I got after injecting 8
>> back-to-back poisons, then read them and received SIGBUS/BUS_MCEERR_AR,
>> then repair via the v7 patch which works until this change is added.
>>
>> [ 6240.955331] nfit ACPI0012:00: XXX, align = 100
>> [ 6240.960300] nfit ACPI0012:00: XXX, ALIGN(mce->addr,
>> L1_CACHE_BYTES)=1851600400, L1_CACHE_BYTES=40, ALIGN(mce->addr,
>> align)=1851600400
>> [..]
>> [ 6242.052277] nfit ACPI0012:00: XXX, align = 100
>> [ 6242.057243] nfit ACPI0012:00: XXX, ALIGN(mce->addr,
>> L1_CACHE_BYTES)=1851601000, L1_CACHE_BYTES=40, ALIGN(mce->addr,
>> align)=1851601000
>> [..]
>> [ 6244.917198] nfit ACPI0012:00: XXX, align = 1000
>> [ 6244.922258] nfit ACPI0012:00: XXX, ALIGN(mce->addr,
>> L1_CACHE_BYTES)=1851601200, L1_CACHE_BYTES=40, ALIGN(mce->addr,
>> align)=1851602000
>> [..]
>>
>> All 8 poisons remain uncleared.
>>
>> Without further investigation, I don't know why the failure.
>> Could we mark this change to a follow-on task?
>
> Perhaps a bit more debug before kicking this can down the road...
>
> I'm worried that this means that the driver is not accurately tracking
> poison data For example, that last case the hardware is indicating a
> full page clobber, but the old code would only track poison from
> 1851601200 to 1851601400 (i.e. the first 512 bytes from the base error
> address).

I would appear so, but the old code tracking seems to be working
correctly. For example, injecting 8 back-to-back poison, the
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/badblocks
cpatures all 8 of them, that spans 2K range, right? I never had issue
about missing poison in my tests.

>
> Oh... wait, I think there is a second bug here, that ALIGN should be
> ALIGN_DOWN(). Does this restore the result you expect?

That's it, ALIGN_DOWN fixed the issue, thanks!!
I'll add a patch, suggested-by you.

>
> diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
> index ee8d9973f60b..d7a52238a741 100644
> --- a/drivers/acpi/nfit/mce.c
> +++ b/drivers/acpi/nfit/mce.c
> @@ -63,8 +63,7 @@ static int nfit_handle_mce(struct notifier_block
> *nb, unsigned long val,
>
> /* If this fails due to an -ENOMEM, there is little we can do */
> nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
> - ALIGN(mce->addr, L1_CACHE_BYTES),
> - L1_CACHE_BYTES);
> + ALIGN_DOWN(mce->addr, align), align);
> nvdimm_region_notify(nfit_spa->nd_region,
> NVDIMM_REVALIDATE_POISON);
>
>
>> The driver knows a lot about how to clear poisons besides hardcoding
>> poison alignment to 0x40 bytes.
>
> It does, but the badblocks tracking should still be reliable, and if
> it's not reliable I expect there are cases where recovery_write() will
> not be triggered because the driver will not fail the
> dax_direct_access() attempt.

thanks!
-jane

2022-04-16 00:17:18

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 2/6] x86/mce: relocate set{clear}_mce_nospec() functions

On 4/11/2022 3:20 PM, Dan Williams wrote:
> I notice that none of the folks from "X86 MM" are on the cc, added.
>

Noted, thanks!

> On Tue, Apr 5, 2022 at 12:49 PM Jane Chu <[email protected]> wrote:
>>
>> Relocate the twin mce functions to arch/x86/mm/pat/set_memory.c
>> file where they belong.
>>
>> Signed-off-by: Jane Chu <[email protected]>
>> ---
>> arch/x86/include/asm/set_memory.h | 52 -------------------------------
>> arch/x86/mm/pat/set_memory.c | 47 ++++++++++++++++++++++++++++
>> include/linux/set_memory.h | 9 +++---
>> 3 files changed, 52 insertions(+), 56 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
>> index 78ca53512486..b45c4d27fd46 100644
>> --- a/arch/x86/include/asm/set_memory.h
>> +++ b/arch/x86/include/asm/set_memory.h
>> @@ -86,56 +86,4 @@ bool kernel_page_present(struct page *page);
>>
>> extern int kernel_set_to_readonly;
>>
>> -#ifdef CONFIG_X86_64
>> -/*
>> - * Prevent speculative access to the page by either unmapping
>> - * it (if we do not require access to any part of the page) or
>> - * marking it uncacheable (if we want to try to retrieve data
>> - * from non-poisoned lines in the page).
>> - */
>> -static inline int set_mce_nospec(unsigned long pfn, bool unmap)
>> -{
>> - unsigned long decoy_addr;
>> - int rc;
>> -
>> - /* SGX pages are not in the 1:1 map */
>> - if (arch_is_platform_page(pfn << PAGE_SHIFT))
>> - return 0;
>> - /*
>> - * We would like to just call:
>> - * set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
>> - * but doing that would radically increase the odds of a
>> - * speculative access to the poison page because we'd have
>> - * the virtual address of the kernel 1:1 mapping sitting
>> - * around in registers.
>> - * Instead we get tricky. We create a non-canonical address
>> - * that looks just like the one we want, but has bit 63 flipped.
>> - * This relies on set_memory_XX() properly sanitizing any __pa()
>> - * results with __PHYSICAL_MASK or PTE_PFN_MASK.
>> - */
>> - decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
>> -
>> - if (unmap)
>> - rc = set_memory_np(decoy_addr, 1);
>> - else
>> - rc = set_memory_uc(decoy_addr, 1);
>> - if (rc)
>> - pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
>> - return rc;
>> -}
>> -#define set_mce_nospec set_mce_nospec
>> -
>> -/* Restore full speculative operation to the pfn. */
>> -static inline int clear_mce_nospec(unsigned long pfn)
>> -{
>> - return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
>> -}
>> -#define clear_mce_nospec clear_mce_nospec
>> -#else
>> -/*
>> - * Few people would run a 32-bit kernel on a machine that supports
>> - * recoverable errors because they have too much memory to boot 32-bit.
>> - */
>> -#endif
>> -
>> #endif /* _ASM_X86_SET_MEMORY_H */
>> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>> index 38af155aaba9..93dde949f224 100644
>> --- a/arch/x86/mm/pat/set_memory.c
>> +++ b/arch/x86/mm/pat/set_memory.c
>> @@ -1925,6 +1925,53 @@ int set_memory_wb(unsigned long addr, int numpages)
>> }
>> EXPORT_SYMBOL(set_memory_wb);
>>
>> +#ifdef CONFIG_X86_64
>
> It seems like the only X86_64 dependency in this routine is the
> address bit 63 usage, so how about:
>
> if (!IS_ENABLED(CONFIG_64BIT))
> return 0;
>
> ...and drop the ifdef?

Sure.

>
> Other than that you can add:
>
> Reviewed-by: Dan Williams <[email protected]>

Thanks!
-jane

2022-04-16 00:28:35

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

On 4/11/2022 4:55 PM, Dan Williams wrote:
> On Wed, Apr 6, 2022 at 10:31 PM Christoph Hellwig <[email protected]> wrote:
>>
>> On Wed, Apr 06, 2022 at 05:32:31PM +0000, Jane Chu wrote:
>>> Yes, I believe Dan was motivated by avoiding the dm dance as a result of
>>> adding .recovery_write to dax_operations.
>>>
>>> I understand your point about .recovery_write is device specific and
>>> thus not something appropriate for device agnostic ops.
>>>
>>> I can see 2 options so far -
>>>
>>> 1) add .recovery_write to dax_operations and do the dm dance to hunt
>>> down to the base device that actually provides the recovery action
>>
>> That would be my preference. But I'll wait for Dan to chime in.
>
> Yeah, so the motivation was avoiding plumbing recovery through stacked
> lookups when the recovery is specific to a pfn and the provider of
> that pfn, but I also see it from Christoph's perspective that the only
> agent that cares about recovery is the fsdax I/O path. Certainly
> having ->dax_direct_access() take a DAX_RECOVERY flag and the op
> itself go through the pgmap is a confusing split that I did not
> anticipate when I made the suggestion. Since that flag must be there,
> then the ->recovery_write() should also stay relative to a dax device.
>
> Apologies for the thrash Jane.
>
> One ask though, please separate plumbing the new flag argument to
> ->dax_direct_access() and plumbing the new operation into preparation
> patches before filling them in with the new goodness.

Okay, will do in next revision.

thanks!
-jane

2022-04-16 00:32:58

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

On 4/6/2022 10:30 PM, Christoph Hellwig wrote:
> On Wed, Apr 06, 2022 at 05:32:31PM +0000, Jane Chu wrote:
>> Yes, I believe Dan was motivated by avoiding the dm dance as a result of
>> adding .recovery_write to dax_operations.
>>
>> I understand your point about .recovery_write is device specific and
>> thus not something appropriate for device agnostic ops.
>>
>> I can see 2 options so far -
>>
>> 1) add .recovery_write to dax_operations and do the dm dance to hunt
>> down to the base device that actually provides the recovery action
>
> That would be my preference. But I'll wait for Dan to chime in.

Okay.

>
>> Okay, will run the checkpatch.pl test again.
>
> Unfortuntely checkpatch.pl is broken in that regard. It treats the
> exception to occasionally go longer or readability as the default.

I see.

thanks,
-jane

2022-04-16 00:34:33

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

On 4/11/2022 5:08 PM, Dan Williams wrote:
> On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <[email protected]> wrote:
>>
>> Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is
>> not set by default in dax_direct_access() such that the helper
>> does not translate a pmem range to kernel virtual address if the
>> range contains uncorrectable errors. When the flag is set,
>> the helper ignores the UEs and return kernel virtual adderss so
>> that the caller may get on with data recovery via write.
>>
>> Also introduce a new dev_pagemap_ops .recovery_write function.
>> The function is applicable to FSDAX device only. The device
>> page backend driver provides .recovery_write function if the
>> device has underlying mechanism to clear the uncorrectable
>> errors on the fly.
>>
>> Signed-off-by: Jane Chu <[email protected]>
>> ---
>> drivers/dax/super.c | 17 ++++++++--
>> drivers/md/dm-linear.c | 4 +--
>> drivers/md/dm-log-writes.c | 5 +--
>> drivers/md/dm-stripe.c | 4 +--
>> drivers/md/dm-target.c | 2 +-
>> drivers/md/dm-writecache.c | 5 +--
>> drivers/md/dm.c | 5 +--
>> drivers/nvdimm/pmem.c | 57 +++++++++++++++++++++++++++------
>> drivers/nvdimm/pmem.h | 2 +-
>> drivers/s390/block/dcssblk.c | 4 +--
>> fs/dax.c | 24 ++++++++++----
>> fs/fuse/dax.c | 4 +--
>> include/linux/dax.h | 11 +++++--
>> include/linux/device-mapper.h | 2 +-
>> include/linux/memremap.h | 7 ++++
>> tools/testing/nvdimm/pmem-dax.c | 2 +-
>> 16 files changed, 116 insertions(+), 39 deletions(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index 0211e6f7b47a..8252858cd25a 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -13,6 +13,7 @@
>> #include <linux/uio.h>
>> #include <linux/dax.h>
>> #include <linux/fs.h>
>> +#include <linux/memremap.h>
>> #include "dax-private.h"
>>
>> /**
>> @@ -117,6 +118,7 @@ enum dax_device_flags {
>> * @dax_dev: a dax_device instance representing the logical memory range
>> * @pgoff: offset in pages from the start of the device to translate
>> * @nr_pages: number of consecutive pages caller can handle relative to @pfn
>> + * @flags: by default 0, set to DAX_RECOVERY to kick start dax recovery
>> * @kaddr: output parameter that returns a virtual address mapping of pfn
>> * @pfn: output parameter that returns an absolute pfn translation of @pgoff
>> *
>> @@ -124,7 +126,7 @@ enum dax_device_flags {
>> * pages accessible at the device relative @pgoff.
>> */
>> long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
>> - void **kaddr, pfn_t *pfn)
>> + int flags, void **kaddr, pfn_t *pfn)
>> {
>> long avail;
>>
>> @@ -137,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
>> if (nr_pages < 0)
>> return -EINVAL;
>>
>> - avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
>> + avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages, flags,
>> kaddr, pfn);
>> if (!avail)
>> return -ERANGE;
>> @@ -194,6 +196,17 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
>> }
>> EXPORT_SYMBOL_GPL(dax_zero_page_range);
>>
>> +size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
>> + pfn_t pfn, void *addr, size_t bytes, struct iov_iter *iter)
>> +{
>> + struct dev_pagemap *pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
>> +
>> + if (!pgmap || !pgmap->ops || !pgmap->ops->recovery_write)
>> + return 0;
>> + return pgmap->ops->recovery_write(pgmap, pgoff, addr, bytes, iter);
>> +}
>> +EXPORT_SYMBOL_GPL(dax_recovery_write);
>> +
>> #ifdef CONFIG_ARCH_HAS_PMEM_API
>> void arch_wb_cache_pmem(void *addr, size_t size);
>> void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
>> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
>> index 76b486e4d2be..9e6d8bdf3b2a 100644
>> --- a/drivers/md/dm-linear.c
>> +++ b/drivers/md/dm-linear.c
>> @@ -172,11 +172,11 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
>> }
>>
>> static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
>> - long nr_pages, void **kaddr, pfn_t *pfn)
>> + long nr_pages, int flags, void **kaddr, pfn_t *pfn)
>> {
>> struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff);
>>
>> - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
>> + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn);
>> }
>>
>> static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
>> diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
>> index c9d036d6bb2e..e23f062ade5f 100644
>> --- a/drivers/md/dm-log-writes.c
>> +++ b/drivers/md/dm-log-writes.c
>> @@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti,
>> }
>>
>> static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
>> - long nr_pages, void **kaddr, pfn_t *pfn)
>> + long nr_pages, int flags,
>> + void **kaddr, pfn_t *pfn)
>> {
>> struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff);
>>
>> - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
>> + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn);
>> }
>>
>> static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
>> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
>> index c81d331d1afe..b89339c78702 100644
>> --- a/drivers/md/dm-stripe.c
>> +++ b/drivers/md/dm-stripe.c
>> @@ -315,11 +315,11 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff)
>> }
>>
>> static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
>> - long nr_pages, void **kaddr, pfn_t *pfn)
>> + long nr_pages, int flags, void **kaddr, pfn_t *pfn)
>> {
>> struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff);
>>
>> - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
>> + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn);
>> }
>>
>> static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
>> diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
>> index 64dd0b34fcf4..24b1e5628f3a 100644
>> --- a/drivers/md/dm-target.c
>> +++ b/drivers/md/dm-target.c
>> @@ -142,7 +142,7 @@ static void io_err_release_clone_rq(struct request *clone,
>> }
>>
>> static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
>> - long nr_pages, void **kaddr, pfn_t *pfn)
>> + long nr_pages, int flags, void **kaddr, pfn_t *pfn)
>> {
>> return -EIO;
>> }
>> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
>> index 5630b470ba42..180ca8fa383e 100644
>> --- a/drivers/md/dm-writecache.c
>> +++ b/drivers/md/dm-writecache.c
>> @@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>>
>> id = dax_read_lock();
>>
>> - da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
>> + da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, 0,
>> + &wc->memory_map, &pfn);
>> if (da < 0) {
>> wc->memory_map = NULL;
>> r = da;
>> @@ -309,7 +310,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>> do {
>> long daa;
>> daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i,
>> - NULL, &pfn);
>> + 0, NULL, &pfn);
>> if (daa <= 0) {
>> r = daa ? daa : -EINVAL;
>> goto err3;
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index ad2e0bbeb559..a8c697bb6603 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -1087,7 +1087,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
>> }
>>
>> static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
>> - long nr_pages, void **kaddr, pfn_t *pfn)
>> + long nr_pages, int flags, void **kaddr,
>> + pfn_t *pfn)
>> {
>> struct mapped_device *md = dax_get_private(dax_dev);
>> sector_t sector = pgoff * PAGE_SECTORS;
>> @@ -1105,7 +1106,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
>> if (len < 1)
>> goto out;
>> nr_pages = min(len, nr_pages);
>> - ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
>> + ret = ti->type->direct_access(ti, pgoff, nr_pages, flags, kaddr, pfn);
>>
>> out:
>> dm_put_live_table(md, srcu_idx);
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index 30c71a68175b..0400c5a7ba39 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -238,12 +238,23 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>>
>> /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
>> __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
>> - long nr_pages, void **kaddr, pfn_t *pfn)
>> + long nr_pages, int flags, void **kaddr, pfn_t *pfn)
>> {
>> resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
>> + sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
>> + unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
>> + struct badblocks *bb = &pmem->bb;
>> + sector_t first_bad;
>> + int num_bad;
>> + bool bad_in_range;
>> + long actual_nr;
>> +
>> + if (!bb->count)
>> + bad_in_range = false;
>> + else
>> + bad_in_range = !!badblocks_check(bb, sector, num, &first_bad, &num_bad);
>
> Why all this change. >
>>
>> - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
>> - PFN_PHYS(nr_pages))))
>
> ...instead of adding "&& !(flags & DAX_RECOVERY)" to this statement?
>
>> + if (bad_in_range && !(flags & DAX_RECOVERY))
>> return -EIO;
>>
>> if (kaddr)
>> @@ -251,13 +262,26 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
>> if (pfn)
>> *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
>>
>> + if (!bad_in_range) {
>> + /*
>> + * If badblock is present but not in the range, limit known good range
>> + * to the requested range.
>> + */
>> + if (bb->count)
>> + return nr_pages;
>> + return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
>> + }
>> +
>> /*
>> - * If badblocks are present, limit known good range to the
>> - * requested range.
>> + * In case poison is found in the given range and DAX_RECOVERY flag is set,
>> + * recovery stride is set to kernel page size because the underlying driver and
>> + * firmware clear poison functions don't appear to handle large chunk (such as
>> + * 2MiB) reliably.
>> */
>> - if (unlikely(pmem->bb.count))
>> - return nr_pages;
>> - return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
>> + actual_nr = PHYS_PFN(PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT));
>> + dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n",
>> + sector, nr_pages, first_bad, actual_nr);
>> + return (actual_nr == 0) ? 1 : actual_nr;
>
> Similar feedback as above that this is more change than I would expect.
>
> I think just adding...
>
> if (flags & DAX_RECOVERY)
> return 1;
>
> ...before the typical return path is enough.

I plan to use the re-organized code by Christoph - looks cleaner to me.

thanks!
-jane

2022-04-16 01:24:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/6] x86/mm: fix comment

On Thu, Apr 14, 2022 at 01:00:05AM +0000, Jane Chu wrote:
> This change used to be folded in the mce patch, but for that I received
> a review comment pointing out that the change is unrelated to the said
> patch and doesn't belong, hence I pulled it out to stand by itself. :)

Aha, someone is being very pedantic.

For trivial unrelated changes like that I usually add them to a patch
which already touches that file and put in the commit message:

"While at it, fixup a function name in a comment."

Less patches to handle.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette