This patch set is a follow up to below conversation
https://lore.kernel.org/linux-xfs/[email protected]/
where Dave Chinner proposed a single flag RWF_RECOVER_DATA to
be added to the system calls preadv2() and pwritev2() for the
purpose of user directed recovery from data loss due to poison
in a dax range.
The idea is that when a dax range is poisoned, this flag gives
preadv2() permission to fetch as much clean data as possible,
and permission for pwritev2() to attempt to clear poison(s)
in the range and then store the good data over.
This feature maybe deployed by user process' signal handler
in response to SIGBUS with BUS_MCEERR_AR or BUS_ADRERR when
the 'si_addr' points to a dax range; or, simply when not sure
of a poison free range. This approach does not fragment the
dax backend, if it is unable to clear poison, it will fail
the pwritev2() so that the situation is explicit to the user.
This approach is not recommended to normal non-recovery
code path due to potential performance impact incurred by
poison checking.
Also, this approach is an alternative to the existing
punch-a-hole-followed-by-pwrite approach which does not clear
poison, but instead, allocates a discontiguous clean backend
range to satisfy the pwrite().
Jane Chu (6):
dax: introduce RWF_RECOVERY_DATA flag to preadv2() and pwritev2()
dax: prepare dax_direct_access() API with DAXDEV_F_RECOVERY flag
pmem: pmem_dax_direct_access() to honor the DAXDEV_F_RECOVERY flag
dm,dax,pmem: prepare dax_copy_to/from_iter() APIs with
DAXDEV_F_RECOVERY
dax,pmem: Add data recovery feature to pmem_copy_to/from_iter()
dm: Ensure dm honors DAXDEV_F_RECOVERY flag on dax only
drivers/dax/super.c | 19 ++++---
drivers/md/dm-linear.c | 12 ++---
drivers/md/dm-log-writes.c | 17 ++++---
drivers/md/dm-stripe.c | 12 ++---
drivers/md/dm-target.c | 2 +-
drivers/md/dm-writecache.c | 4 +-
drivers/md/dm.c | 16 +++---
drivers/nvdimm/pmem.c | 88 ++++++++++++++++++++++++++++-----
drivers/nvdimm/pmem.h | 2 +-
drivers/s390/block/dcssblk.c | 13 +++--
fs/dax.c | 24 ++++++---
fs/fuse/dax.c | 2 +-
fs/fuse/virtio_fs.c | 12 ++---
include/linux/dax.h | 15 +++---
include/linux/device-mapper.h | 4 +-
include/linux/fs.h | 1 +
include/linux/iomap.h | 1 +
include/uapi/linux/fs.h | 5 +-
tools/testing/nvdimm/pmem-dax.c | 2 +-
19 files changed, 173 insertions(+), 78 deletions(-)
--
2.18.4
Introduce RWF_RECOVERY_DATA flag to preadv2() and pwritev2()
for the purpose of recovering data loss due to dax media error.
Hence the functionality ties to the underlying media and driver
with capability to clear media error(s) on the fly.
When this flag is provided with preadv2(), preadv2() will attempt
to read as much data as possible until the poisoned page is
encountered.
When the flag is provided with pwritev2(), pwritev2() will attempt
to clear media error within the user specified range and then write
the user provided data to the range. Both the range and length
parameters must be page aligned in order get the recovery process
to work.
Signed-off-by: Jane Chu <[email protected]>
---
fs/dax.c | 3 +++
include/linux/fs.h | 1 +
include/linux/iomap.h | 1 +
include/uapi/linux/fs.h | 5 ++++-
4 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index 4e3e5a283a91..01118de00011 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1288,6 +1288,9 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
if (iocb->ki_flags & IOCB_NOWAIT)
iomi.flags |= IOMAP_NOWAIT;
+ if (iocb->ki_flags & IOCB_RECOVERY)
+ iomi.flags |= IOMAP_RECOVERY;
+
while ((ret = iomap_iter(&iomi, ops)) > 0)
iomi.processed = dax_iomap_iter(&iomi, iter);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e7a633353fd2..ae138649cbe3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -311,6 +311,7 @@ enum rw_hint {
#define IOCB_SYNC (__force int) RWF_SYNC
#define IOCB_NOWAIT (__force int) RWF_NOWAIT
#define IOCB_APPEND (__force int) RWF_APPEND
+#define IOCB_RECOVERY (__force int) RWF_RECOVERY_DATA
/* non-RWF related bits - start at 16 */
#define IOCB_EVENTFD (1 << 16)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 24f8489583ca..c13d23328140 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -141,6 +141,7 @@ struct iomap_page_ops {
#define IOMAP_NOWAIT (1 << 5) /* do not block */
#define IOMAP_OVERWRITE_ONLY (1 << 6) /* only pure overwrites allowed */
#define IOMAP_UNSHARE (1 << 7) /* unshare_file_range */
+#define IOMAP_RECOVERY (1 << 8) /* data recovery */
struct iomap_ops {
/*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index bdf7b404b3e7..febec55ea4b8 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -301,8 +301,11 @@ typedef int __bitwise __kernel_rwf_t;
/* per-IO O_APPEND */
#define RWF_APPEND ((__force __kernel_rwf_t)0x00000010)
+/* per-IO for data recovery */
+#define RWF_RECOVERY_DATA ((__force __kernel_rwf_t)0x00000020)
+
/* mask of flags supported by the kernel */
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
- RWF_APPEND)
+ RWF_APPEND | RWF_RECOVERY_DATA)
#endif /* _UAPI_LINUX_FS_H */
--
2.18.4
Prepare dax_direct_access() API with DAXDEV_F_RECOVERY flag
such that the API may perform device address translation
in spite of the presence of poison(s) in a given range.
Signed-off-by: Jane Chu <[email protected]>
---
drivers/dax/super.c | 9 +++++----
drivers/md/dm-linear.c | 4 ++--
drivers/md/dm-log-writes.c | 5 +++--
drivers/md/dm-stripe.c | 4 ++--
drivers/md/dm-target.c | 2 +-
drivers/md/dm-writecache.c | 4 ++--
drivers/md/dm.c | 4 ++--
drivers/nvdimm/pmem.c | 7 ++++---
drivers/nvdimm/pmem.h | 2 +-
drivers/s390/block/dcssblk.c | 7 ++++---
fs/dax.c | 12 ++++++++----
fs/fuse/dax.c | 2 +-
fs/fuse/virtio_fs.c | 4 ++--
include/linux/dax.h | 7 +++++--
include/linux/device-mapper.h | 2 +-
tools/testing/nvdimm/pmem-dax.c | 2 +-
16 files changed, 44 insertions(+), 33 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index fc89e91beea7..67093f1c3341 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -156,8 +156,8 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
}
id = dax_read_lock();
- len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn);
- len2 = dax_direct_access(dax_dev, pgoff_end, 1, &end_kaddr, &end_pfn);
+ len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn, 0);
+ len2 = dax_direct_access(dax_dev, pgoff_end, 1, &end_kaddr, &end_pfn, 0);
if (len < 1 || len2 < 1) {
pr_info("%pg: error: dax access failed (%ld)\n",
@@ -302,12 +302,13 @@ EXPORT_SYMBOL_GPL(dax_attribute_group);
* @nr_pages: number of consecutive pages caller can handle relative to @pfn
* @kaddr: output parameter that returns a virtual address mapping of pfn
* @pfn: output parameter that returns an absolute pfn translation of @pgoff
+ * @flags: indication whether on dax data recovery code path or not
*
* Return: negative errno if an error occurs, otherwise the number of
* pages accessible at the device relative @pgoff.
*/
long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
- void **kaddr, pfn_t *pfn)
+ void **kaddr, pfn_t *pfn, unsigned long flags)
{
long avail;
@@ -321,7 +322,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
return -EINVAL;
avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
- kaddr, pfn);
+ kaddr, pfn, flags);
if (!avail)
return -ERANGE;
return min(avail, nr_pages);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 679b4c0a2eea..cb7c8518f02d 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -165,7 +165,7 @@ static int linear_iterate_devices(struct dm_target *ti,
#if IS_ENABLED(CONFIG_DAX_DRIVER)
static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, void **kaddr, pfn_t *pfn, unsigned long flags)
{
long ret;
struct linear_c *lc = ti->private;
@@ -177,7 +177,7 @@ static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, &pgoff);
if (ret)
return ret;
- return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+ return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn, flags);
}
static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index d93a4db23512..6d8b88dcce6c 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -950,7 +950,7 @@ static int log_dax(struct log_writes_c *lc, sector_t sector, size_t bytes,
}
static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, void **kaddr, pfn_t *pfn, unsigned long flags)
{
struct log_writes_c *lc = ti->private;
sector_t sector = pgoff * PAGE_SECTORS;
@@ -959,7 +959,8 @@ static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
ret = bdev_dax_pgoff(lc->dev->bdev, sector, nr_pages * PAGE_SIZE, &pgoff);
if (ret)
return ret;
- return dax_direct_access(lc->dev->dax_dev, pgoff, nr_pages, kaddr, pfn);
+ return dax_direct_access(lc->dev->dax_dev, pgoff, nr_pages, kaddr, pfn,
+ flags);
}
static size_t log_writes_dax_copy_from_iter(struct dm_target *ti,
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 6660b6b53d5b..0a97d0472a0b 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -302,7 +302,7 @@ static int stripe_map(struct dm_target *ti, struct bio *bio)
#if IS_ENABLED(CONFIG_DAX_DRIVER)
static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, void **kaddr, pfn_t *pfn, unsigned long flags)
{
sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
struct stripe_c *sc = ti->private;
@@ -319,7 +319,7 @@ static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, &pgoff);
if (ret)
return ret;
- return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
+ return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn, flags);
}
static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index 64dd0b34fcf4..431764b77528 100644
--- a/drivers/md/dm-target.c
+++ b/drivers/md/dm-target.c
@@ -142,7 +142,7 @@ static void io_err_release_clone_rq(struct request *clone,
}
static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, void **kaddr, pfn_t *pfn, unsigned long flags)
{
return -EIO;
}
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 18320444fb0a..c523cb911eca 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -286,7 +286,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
id = dax_read_lock();
- da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
+ da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn, 0);
if (da < 0) {
wc->memory_map = NULL;
r = da;
@@ -309,7 +309,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
do {
long daa;
daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i,
- NULL, &pfn);
+ NULL, &pfn, 0);
if (daa <= 0) {
r = daa ? daa : -EINVAL;
goto err3;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a011d09cb0fa..e5a14abd45f9 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -998,7 +998,7 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
}
static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, void **kaddr, pfn_t *pfn, unsigned long flags)
{
struct mapped_device *md = dax_get_private(dax_dev);
sector_t sector = pgoff * PAGE_SECTORS;
@@ -1016,7 +1016,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
if (len < 1)
goto out;
nr_pages = min(len, nr_pages);
- ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
+ ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn, flags);
out:
dm_put_live_table(md, srcu_idx);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 72de88ff0d30..b0b7fd40560e 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -256,7 +256,7 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
/* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
__weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, void **kaddr, pfn_t *pfn, unsigned long flags)
{
resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
@@ -295,11 +295,12 @@ static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
}
static long pmem_dax_direct_access(struct dax_device *dax_dev,
- pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
+ pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn,
+ unsigned long flags)
{
struct pmem_device *pmem = dax_get_private(dax_dev);
- return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
+ return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn, flags);
}
/*
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 59cfe13ea8a8..fb769b22777a 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -27,7 +27,7 @@ struct pmem_device {
};
long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn);
+ long nr_pages, void **kaddr, pfn_t *pfn, unsigned long flags);
#ifdef CONFIG_MEMORY_FAILURE
static inline bool test_and_clear_pmem_poison(struct page *page)
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 5be3d1c39a78..6ab2f9badc8d 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -32,7 +32,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static blk_qc_t dcssblk_submit_bio(struct bio *bio);
static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn);
+ long nr_pages, void **kaddr, pfn_t *pfn, unsigned long flags);
static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
@@ -62,7 +62,7 @@ static int dcssblk_dax_zero_page_range(struct dax_device *dax_dev,
long rc;
void *kaddr;
- rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL);
+ rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL, 0);
if (rc < 0)
return rc;
memset(kaddr, 0, nr_pages << PAGE_SHIFT);
@@ -932,7 +932,8 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
static long
dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, void **kaddr, pfn_t *pfn,
+ unsigned long flags)
{
struct dcssblk_dev_info *dev_info = dax_get_private(dax_dev);
diff --git a/fs/dax.c b/fs/dax.c
index 01118de00011..f603a9ce7f20 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -722,7 +722,7 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
return rc;
id = dax_read_lock();
- rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
+ rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL, 0);
if (rc < 0) {
dax_read_unlock(id);
return rc;
@@ -1023,7 +1023,7 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
return rc;
id = dax_read_lock();
length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
- NULL, pfnp);
+ NULL, pfnp, 0);
if (length < 0) {
rc = length;
goto out;
@@ -1149,7 +1149,7 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
if (page_aligned)
rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
else
- rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
+ rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL, 0);
if (rc < 0) {
dax_read_unlock(id);
return rc;
@@ -1172,6 +1172,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
struct block_device *bdev = iomap->bdev;
struct dax_device *dax_dev = iomap->dax_dev;
loff_t end = pos + length, done = 0;
+ unsigned long dax_flag = 0;
ssize_t ret = 0;
size_t xfer;
int id;
@@ -1199,6 +1200,9 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
(end - 1) >> PAGE_SHIFT);
}
+ if (iomi->flags & IOMAP_RECOVERY)
+ dax_flag |= DAXDEV_F_RECOVERY;
+
id = dax_read_lock();
while (pos < end) {
unsigned offset = pos & (PAGE_SIZE - 1);
@@ -1218,7 +1222,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
break;
map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
- &kaddr, NULL);
+ &kaddr, NULL, dax_flag);
if (map_len < 0) {
ret = map_len;
break;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 281d79f8b3d3..2c45b94647f1 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -1245,7 +1245,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn_dax *fcd)
id = dax_read_lock();
nr_pages = dax_direct_access(fcd->dev, 0, PHYS_PFN(dax_size), NULL,
- NULL);
+ NULL, 0);
dax_read_unlock(id);
if (nr_pages < 0) {
pr_debug("dax_direct_access() returned %ld\n", nr_pages);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 0ad89c6629d7..d201b6e8a190 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -739,7 +739,7 @@ static void virtio_fs_cleanup_vqs(struct virtio_device *vdev,
* offset.
*/
static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, void **kaddr, pfn_t *pfn, unsigned long flags)
{
struct virtio_fs *fs = dax_get_private(dax_dev);
phys_addr_t offset = PFN_PHYS(pgoff);
@@ -773,7 +773,7 @@ static int virtio_fs_zero_page_range(struct dax_device *dax_dev,
long rc;
void *kaddr;
- rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL);
+ rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL, 0);
if (rc < 0)
return rc;
memset(kaddr, 0, nr_pages << PAGE_SHIFT);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 2619d94c308d..0044a5d87e5d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -9,6 +9,9 @@
/* Flag for synchronous flush */
#define DAXDEV_F_SYNC (1UL << 0)
+/* Flag for DAX data recovery */
+#define DAXDEV_F_RECOVERY (1UL << 1)
+
typedef unsigned long dax_entry_t;
struct iomap_ops;
@@ -21,7 +24,7 @@ struct dax_operations {
* number of pages available for DAX at that pfn.
*/
long (*direct_access)(struct dax_device *, pgoff_t, long,
- void **, pfn_t *);
+ void **, pfn_t *, unsigned long);
/*
* Validate whether this device is usable as an fsdax backing
* device.
@@ -192,7 +195,7 @@ static inline void dax_read_unlock(int id)
bool dax_alive(struct dax_device *dax_dev);
void *dax_get_private(struct dax_device *dax_dev);
long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
- void **kaddr, pfn_t *pfn);
+ void **kaddr, pfn_t *pfn, unsigned long);
size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
size_t bytes, struct iov_iter *i);
size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 114553b487ef..307c29789332 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -146,7 +146,7 @@ typedef int (*dm_busy_fn) (struct dm_target *ti);
* >= 0 : the number of bytes accessible at the address
*/
typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn);
+ long nr_pages, void **kaddr, pfn_t *pfn, unsigned long flags);
typedef size_t (*dm_dax_copy_iter_fn)(struct dm_target *ti, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i);
typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff,
diff --git a/tools/testing/nvdimm/pmem-dax.c b/tools/testing/nvdimm/pmem-dax.c
index af19c85558e7..45dcfffe5575 100644
--- a/tools/testing/nvdimm/pmem-dax.c
+++ b/tools/testing/nvdimm/pmem-dax.c
@@ -8,7 +8,7 @@
#include <nd.h>
long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
- long nr_pages, void **kaddr, pfn_t *pfn)
+ long nr_pages, void **kaddr, pfn_t *pfn, unsigned long flags)
{
resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
--
2.18.4
Prepare dax_copy_to/from_iter() APIs with DAXDEV_F_RECOVERY flag
such that when the flag is set, the underlying driver implementation
of the APIs may deal with potential poison in a given address
range and read partial data or write after clearing poison.
Signed-off-by: Jane Chu <[email protected]>
---
drivers/dax/super.c | 10 ++++++----
drivers/md/dm-linear.c | 8 ++++----
drivers/md/dm-log-writes.c | 12 ++++++------
drivers/md/dm-stripe.c | 8 ++++----
drivers/md/dm.c | 8 ++++----
drivers/nvdimm/pmem.c | 4 ++--
drivers/s390/block/dcssblk.c | 6 ++++--
fs/dax.c | 4 ++--
fs/fuse/virtio_fs.c | 8 ++++----
include/linux/dax.h | 8 ++++----
include/linux/device-mapper.h | 2 +-
11 files changed, 41 insertions(+), 37 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 67093f1c3341..97854da1ecf7 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -330,22 +330,24 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
EXPORT_SYMBOL_GPL(dax_direct_access);
size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
- size_t bytes, struct iov_iter *i)
+ size_t bytes, struct iov_iter *i, unsigned long flags)
{
if (!dax_alive(dax_dev))
return 0;
- return dax_dev->ops->copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+ return dax_dev->ops->copy_from_iter(dax_dev, pgoff, addr, bytes, i,
+ flags);
}
EXPORT_SYMBOL_GPL(dax_copy_from_iter);
size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
- size_t bytes, struct iov_iter *i)
+ size_t bytes, struct iov_iter *i, unsigned long flags)
{
if (!dax_alive(dax_dev))
return 0;
- return dax_dev->ops->copy_to_iter(dax_dev, pgoff, addr, bytes, i);
+ return dax_dev->ops->copy_to_iter(dax_dev, pgoff, addr, bytes, i,
+ flags);
}
EXPORT_SYMBOL_GPL(dax_copy_to_iter);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index cb7c8518f02d..cc57bd639871 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -181,7 +181,7 @@ static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
}
static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
- void *addr, size_t bytes, struct iov_iter *i)
+ void *addr, size_t bytes, struct iov_iter *i, unsigned long flags)
{
struct linear_c *lc = ti->private;
struct block_device *bdev = lc->dev->bdev;
@@ -191,11 +191,11 @@ static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
dev_sector = linear_map_sector(ti, sector);
if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), &pgoff))
return 0;
- return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+ return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i, flags);
}
static size_t linear_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
- void *addr, size_t bytes, struct iov_iter *i)
+ void *addr, size_t bytes, struct iov_iter *i, unsigned long flags)
{
struct linear_c *lc = ti->private;
struct block_device *bdev = lc->dev->bdev;
@@ -205,7 +205,7 @@ static size_t linear_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
dev_sector = linear_map_sector(ti, sector);
if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), &pgoff))
return 0;
- return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i);
+ return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i, flags);
}
static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 6d8b88dcce6c..b8e9bddc47b8 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -964,8 +964,8 @@ static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
}
static size_t log_writes_dax_copy_from_iter(struct dm_target *ti,
- pgoff_t pgoff, void *addr, size_t bytes,
- struct iov_iter *i)
+ pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i,
+ unsigned long flags)
{
struct log_writes_c *lc = ti->private;
sector_t sector = pgoff * PAGE_SECTORS;
@@ -984,19 +984,19 @@ static size_t log_writes_dax_copy_from_iter(struct dm_target *ti,
return 0;
}
dax_copy:
- return dax_copy_from_iter(lc->dev->dax_dev, pgoff, addr, bytes, i);
+ return dax_copy_from_iter(lc->dev->dax_dev, pgoff, addr, bytes, i, flags);
}
static size_t log_writes_dax_copy_to_iter(struct dm_target *ti,
- pgoff_t pgoff, void *addr, size_t bytes,
- struct iov_iter *i)
+ pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i,
+ unsigned long flags)
{
struct log_writes_c *lc = ti->private;
sector_t sector = pgoff * PAGE_SECTORS;
if (bdev_dax_pgoff(lc->dev->bdev, sector, ALIGN(bytes, PAGE_SIZE), &pgoff))
return 0;
- return dax_copy_to_iter(lc->dev->dax_dev, pgoff, addr, bytes, i);
+ return dax_copy_to_iter(lc->dev->dax_dev, pgoff, addr, bytes, i, flags);
}
static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 0a97d0472a0b..eefaa23a36fa 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -323,7 +323,7 @@ static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
}
static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
- void *addr, size_t bytes, struct iov_iter *i)
+ void *addr, size_t bytes, struct iov_iter *i, unsigned long flags)
{
sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
struct stripe_c *sc = ti->private;
@@ -338,11 +338,11 @@ static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), &pgoff))
return 0;
- return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+ return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i, flags);
}
static size_t stripe_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
- void *addr, size_t bytes, struct iov_iter *i)
+ void *addr, size_t bytes, struct iov_iter *i, unsigned long flags)
{
sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
struct stripe_c *sc = ti->private;
@@ -357,7 +357,7 @@ static size_t stripe_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), &pgoff))
return 0;
- return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i);
+ return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i, flags);
}
static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e5a14abd45f9..764183ddebc1 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1045,7 +1045,7 @@ static bool dm_dax_supported(struct dax_device *dax_dev, struct block_device *bd
}
static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
- void *addr, size_t bytes, struct iov_iter *i)
+ void *addr, size_t bytes, struct iov_iter *i, unsigned long flags)
{
struct mapped_device *md = dax_get_private(dax_dev);
sector_t sector = pgoff * PAGE_SECTORS;
@@ -1061,7 +1061,7 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
ret = copy_from_iter(addr, bytes, i);
goto out;
}
- ret = ti->type->dax_copy_from_iter(ti, pgoff, addr, bytes, i);
+ ret = ti->type->dax_copy_from_iter(ti, pgoff, addr, bytes, i, flags);
out:
dm_put_live_table(md, srcu_idx);
@@ -1069,7 +1069,7 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
}
static size_t dm_dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
- void *addr, size_t bytes, struct iov_iter *i)
+ void *addr, size_t bytes, struct iov_iter *i, unsigned long flags)
{
struct mapped_device *md = dax_get_private(dax_dev);
sector_t sector = pgoff * PAGE_SECTORS;
@@ -1085,7 +1085,7 @@ static size_t dm_dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
ret = copy_to_iter(addr, bytes, i);
goto out;
}
- ret = ti->type->dax_copy_to_iter(ti, pgoff, addr, bytes, i);
+ ret = ti->type->dax_copy_to_iter(ti, pgoff, addr, bytes, i, flags);
out:
dm_put_live_table(md, srcu_idx);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ed699416655b..e2a1c35108cd 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -311,13 +311,13 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
* dax_iomap_actor()
*/
static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
- void *addr, size_t bytes, struct iov_iter *i)
+ void *addr, size_t bytes, struct iov_iter *i, unsigned long flags)
{
return _copy_from_iter_flushcache(addr, bytes, i);
}
static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
- void *addr, size_t bytes, struct iov_iter *i)
+ void *addr, size_t bytes, struct iov_iter *i, unsigned long flags)
{
return _copy_mc_to_iter(addr, bytes, i);
}
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 6ab2f9badc8d..6eb2b9a7682b 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -45,13 +45,15 @@ static const struct block_device_operations dcssblk_devops = {
};
static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
- pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
+ pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i,
+ unsigned long flags)
{
return copy_from_iter(addr, bytes, i);
}
static size_t dcssblk_dax_copy_to_iter(struct dax_device *dax_dev,
- pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
+ pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i,
+ unsigned long flags)
{
return copy_to_iter(addr, bytes, i);
}
diff --git a/fs/dax.c b/fs/dax.c
index f603a9ce7f20..69433c6cd6c4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1241,10 +1241,10 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
*/
if (iov_iter_rw(iter) == WRITE)
xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
- map_len, iter);
+ map_len, iter, dax_flag);
else
xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
- map_len, iter);
+ map_len, iter, dax_flag);
pos += xfer;
length -= xfer;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index d201b6e8a190..b0d80459b1cb 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -754,15 +754,15 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
}
static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
- pgoff_t pgoff, void *addr,
- size_t bytes, struct iov_iter *i)
+ pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i,
+ unsigned long flags)
{
return copy_from_iter(addr, bytes, i);
}
static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
- pgoff_t pgoff, void *addr,
- size_t bytes, struct iov_iter *i)
+ pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i,
+ unsigned long flags)
{
return copy_to_iter(addr, bytes, i);
}
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0044a5d87e5d..97f421f831e2 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -33,10 +33,10 @@ struct dax_operations {
sector_t, sector_t);
/* copy_from_iter: required operation for fs-dax direct-i/o */
size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
- struct iov_iter *);
+ struct iov_iter *, unsigned long);
/* copy_to_iter: required operation for fs-dax direct-i/o */
size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t,
- struct iov_iter *);
+ struct iov_iter *, unsigned long);
/* zero_page_range: required operation. Zero page range */
int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
};
@@ -197,9 +197,9 @@ void *dax_get_private(struct dax_device *dax_dev);
long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
void **kaddr, pfn_t *pfn, unsigned long);
size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
- size_t bytes, struct iov_iter *i);
+ size_t bytes, struct iov_iter *i, unsigned long flags);
size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
- size_t bytes, struct iov_iter *i);
+ size_t bytes, struct iov_iter *i, unsigned long flags);
int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
size_t nr_pages);
void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 307c29789332..81c67c3d96ed 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -148,7 +148,7 @@ typedef int (*dm_busy_fn) (struct dm_target *ti);
typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn, unsigned long flags);
typedef size_t (*dm_dax_copy_iter_fn)(struct dm_target *ti, pgoff_t pgoff,
- void *addr, size_t bytes, struct iov_iter *i);
+ void *addr, size_t bytes, struct iov_iter *i, unsigned long flags);
typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff,
size_t nr_pages);
--
2.18.4
When DAXDEV_F_RECOVERY flag is set, pmem_copy_to_iter() shall read
as much data as possible up till the first poisoned page is
encountered, and pmem_copy_from_iter() shall try to clear poison(s)
within the page aligned range prior to writing.
Signed-off-by: Jane Chu <[email protected]>
---
drivers/nvdimm/pmem.c | 72 ++++++++++++++++++++++++++++++++++++++++---
fs/dax.c | 5 +++
2 files changed, 72 insertions(+), 5 deletions(-)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e2a1c35108cd..c456f84d2f6f 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -305,21 +305,83 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
}
/*
- * Use the 'no check' versions of copy_from_iter_flushcache() and
- * copy_mc_to_iter() to bypass HARDENED_USERCOPY overhead. Bounds
- * checking, both file offset and device offset, is handled by
- * dax_iomap_actor()
+ * Even though the 'no check' versions of copy_from_iter_flushcache()
+ * and copy_mc_to_iter() are used to bypass HARDENED_USERCOPY overhead,
+ * 'read'/'write' aren't always safe when poison is consumed. They happen
+ * to be safe because the 'read'/'write' range has been guaranteed
+ * be free of poison(s) by a prior call to dax_direct_access() on the
+ * caller stack.
+ * However with the introduction of DAXDEV_F_RECOVERY, the 'read'/'write'
+ * range may contain poison(s), so the functions perform explicit check
+ * on poison, and 'read' end up fetching only non-poisoned page(s) up
+ * till the first poison is encountered while 'write' require the range
+ * is page aligned in order to restore the poisoned page's memory type
+ * back to "rw" after clearing the poison(s).
+ * In the event of poison related failure, (size_t) -EIO is returned and
+ * caller may check the return value after casting it to (ssize_t).
*/
static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i, unsigned long flags)
{
+ phys_addr_t pmem_off;
+ size_t len, lead_off;
+ struct pmem_device *pmem = dax_get_private(dax_dev);
+ struct device *dev = pmem->bb.dev;
+
+ if (flags & DAXDEV_F_RECOVERY) {
+ lead_off = (unsigned long)addr & ~PAGE_MASK;
+ len = PFN_PHYS(PFN_UP(lead_off + bytes));
+ if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
+ if (lead_off || !(PAGE_ALIGNED(bytes))) {
+ dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
+ addr, bytes);
+ return (size_t) -EIO;
+ }
+ pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
+ if (pmem_clear_poison(pmem, pmem_off, bytes) !=
+ BLK_STS_OK)
+ return (size_t) -EIO;
+ }
+ }
+
return _copy_from_iter_flushcache(addr, bytes, i);
}
static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i, unsigned long flags)
{
- return _copy_mc_to_iter(addr, bytes, i);
+ int num_bad;
+ size_t len, lead_off;
+ unsigned long bad_pfn;
+ bool bad_pmem = false;
+ size_t adj_len = bytes;
+ sector_t sector, first_bad;
+ struct pmem_device *pmem = dax_get_private(dax_dev);
+ struct device *dev = pmem->bb.dev;
+
+ if (flags & DAXDEV_F_RECOVERY) {
+ sector = PFN_PHYS(pgoff) / 512;
+ lead_off = (unsigned long)addr & ~PAGE_MASK;
+ len = PFN_PHYS(PFN_UP(lead_off + bytes));
+ if (pmem->bb.count)
+ bad_pmem = !!badblocks_check(&pmem->bb, sector,
+ len / 512, &first_bad, &num_bad);
+ if (bad_pmem) {
+ bad_pfn = PHYS_PFN(first_bad * 512);
+ if (bad_pfn == pgoff) {
+ dev_warn(dev, "Found poison in page: pgoff(%#lx)\n",
+ pgoff);
+ return -EIO;
+ }
+ adj_len = PFN_PHYS(bad_pfn - pgoff) - lead_off;
+ dev_WARN_ONCE(dev, (adj_len > bytes),
+ "out-of-range first_bad?");
+ }
+ if (adj_len == 0)
+ return (size_t) -EIO;
+ }
+
+ return _copy_mc_to_iter(addr, adj_len, i);
}
static const struct dax_operations pmem_dax_ops = {
diff --git a/fs/dax.c b/fs/dax.c
index 69433c6cd6c4..b9286668dc46 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1246,6 +1246,11 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
map_len, iter, dax_flag);
+ if ((ssize_t)xfer == -EIO) {
+ ret = -EIO;
+ break;
+ }
+
pos += xfer;
length -= xfer;
done += xfer;
--
2.18.4
dm_dax_direct_access() supports DAXDEV_F_RECOVERY, so it may
translate a poisoned range. But if dm_dax_copy_to/from_iter()
don't have a dax_copy_to/from_iter() foundation underneath,
performing load/store over poisoned range is dangerous and
should be avoided.
Signed-off-by: Jane Chu <[email protected]>
---
drivers/md/dm.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 764183ddebc1..5f7fe64d3c37 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1058,6 +1058,8 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
if (!ti)
goto out;
if (!ti->type->dax_copy_from_iter) {
+ if (flags & DAXDEV_F_RECOVERY)
+ goto out;
ret = copy_from_iter(addr, bytes, i);
goto out;
}
@@ -1082,6 +1084,8 @@ static size_t dm_dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
if (!ti)
goto out;
if (!ti->type->dax_copy_to_iter) {
+ if (flags & DAXDEV_F_RECOVERY)
+ goto out;
ret = copy_to_iter(addr, bytes, i);
goto out;
}
--
2.18.4
Let pmem_dax_direct_access() skip badblock checking if the caller
intends to do data recovery by providing the DAXDEV_F_RECOVERY flag.
Signed-off-by: Jane Chu <[email protected]>
---
drivers/nvdimm/pmem.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b0b7fd40560e..ed699416655b 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,8 +260,9 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
{
resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
- if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
- PFN_PHYS(nr_pages))))
+ if (unlikely(!(flags & DAXDEV_F_RECOVERY) &&
+ is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
+ PFN_PHYS(nr_pages))))
return -EIO;
if (kaddr)
--
2.18.4
On Wed, Oct 20, 2021 at 06:10:55PM -0600, Jane Chu wrote:
> @@ -156,8 +156,8 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
> }
>
> id = dax_read_lock();
> - len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn);
> - len2 = dax_direct_access(dax_dev, pgoff_end, 1, &end_kaddr, &end_pfn);
> + len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn, 0);
> + len2 = dax_direct_access(dax_dev, pgoff_end, 1, &end_kaddr, &end_pfn, 0);
FYI, I have a series killing this code. But either way please avoid
these overly long lines.
> long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
> - void **kaddr, pfn_t *pfn)
> + void **kaddr, pfn_t *pfn, unsigned long flags)
API design: I'd usually expect flags before output paramters.
On Wed, Oct 20, 2021 at 06:10:56PM -0600, Jane Chu wrote:
> - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
> - PFN_PHYS(nr_pages))))
> + if (unlikely(!(flags & DAXDEV_F_RECOVERY) &&
> + is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
> + PFN_PHYS(nr_pages))))
The indentation here is pretty messed up. Something like this would
be move normal:
if (unlikely(!(flags & DAXDEV_F_RECOVERY) &&
is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
PFN_PHYS(nr_pages)))) {
but if we don't really need the unlikely we could do an actually
readable variant:
if (!(flags & DAXDEV_F_RECOVERY) &&
is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, PFN_PHYS(nr_pages)))
On Wed, Oct 20, 2021 at 06:10:57PM -0600, Jane Chu wrote:
> Prepare dax_copy_to/from_iter() APIs with DAXDEV_F_RECOVERY flag
> such that when the flag is set, the underlying driver implementation
> of the APIs may deal with potential poison in a given address
> range and read partial data or write after clearing poison.
FYI, I've been wondering for a while if we could just kill off these
methods entirely. Basically the driver interaction consists of two
parts:
a) wether to use the flushcache/mcsafe variants of the generic helpers
b) actually doing remapping for device mapper
to me it seems like we should handle a) with flags in dax_operations,
and only have a remap callback for device mapper. That way we'd avoid
the indirect calls for the native case, and also avoid tons of
boilerplate code. "futher decouple DAX from block devices" series
already massages the device mapper into a form suitable for such
callbacks.
> + if (flags & DAXDEV_F_RECOVERY) {
> + lead_off = (unsigned long)addr & ~PAGE_MASK;
> + len = PFN_PHYS(PFN_UP(lead_off + bytes));
> + if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
> + if (lead_off || !(PAGE_ALIGNED(bytes))) {
> + dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
> + addr, bytes);
> + return (size_t) -EIO;
> + }
> + pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
> + if (pmem_clear_poison(pmem, pmem_off, bytes) !=
> + BLK_STS_OK)
> + return (size_t) -EIO;
> + }
Shouldn't this just go down in a separe ->clear_poison operation
to make the whole thing a little easier to follow?
Looking over the series I have serious doubts that overloading the
slow path clear poison operation over the fast path read/write
path is such a great idea.
On 10/21/2021 4:20 AM, Christoph Hellwig wrote:
> On Wed, Oct 20, 2021 at 06:10:55PM -0600, Jane Chu wrote:
>> @@ -156,8 +156,8 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
>> }
>>
>> id = dax_read_lock();
>> - len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn);
>> - len2 = dax_direct_access(dax_dev, pgoff_end, 1, &end_kaddr, &end_pfn);
>> + len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn, 0);
>> + len2 = dax_direct_access(dax_dev, pgoff_end, 1, &end_kaddr, &end_pfn, 0);
>
> FYI, I have a series killing this code. But either way please avoid
> these overly long lines.
>
>> long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
>> - void **kaddr, pfn_t *pfn)
>> + void **kaddr, pfn_t *pfn, unsigned long flags)
>
> API design: I'd usually expect flags before output paramters.
>
Thanks for the heads up.
Sure, will break long lines and move 'flags' ahead of output parameters.
thanks,
jane
On 10/21/2021 4:23 AM, Christoph Hellwig wrote:
> On Wed, Oct 20, 2021 at 06:10:56PM -0600, Jane Chu wrote:
>> - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
>> - PFN_PHYS(nr_pages))))
>> + if (unlikely(!(flags & DAXDEV_F_RECOVERY) &&
>> + is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
>> + PFN_PHYS(nr_pages))))
>
> The indentation here is pretty messed up. Something like this would
> be move normal:
>
> if (unlikely(!(flags & DAXDEV_F_RECOVERY) &&
> is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
> PFN_PHYS(nr_pages)))) {
>
Will do.
> but if we don't really need the unlikely we could do an actually
> readable variant:
>
> if (!(flags & DAXDEV_F_RECOVERY) &&
> is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, PFN_PHYS(nr_pages)))
>
'unlikely' is needed because 'RWF_RECOVERY_DATA' flag is not
recommended for normal preadv2/pwritev2 usage, it's recommended
only if user is aware of or suspect poison in the range.
thanks,
-jane
On 10/21/2021 4:27 AM, Christoph Hellwig wrote:
> On Wed, Oct 20, 2021 at 06:10:57PM -0600, Jane Chu wrote:
>> Prepare dax_copy_to/from_iter() APIs with DAXDEV_F_RECOVERY flag
>> such that when the flag is set, the underlying driver implementation
>> of the APIs may deal with potential poison in a given address
>> range and read partial data or write after clearing poison.
>
> FYI, I've been wondering for a while if we could just kill off these
> methods entirely. Basically the driver interaction consists of two
> parts:
>
> a) wether to use the flushcache/mcsafe variants of the generic helpers
> b) actually doing remapping for device mapper
>
> to me it seems like we should handle a) with flags in dax_operations,
> and only have a remap callback for device mapper. That way we'd avoid
> the indirect calls for the native case, and also avoid tons of
> boilerplate code. "futher decouple DAX from block devices" series
> already massages the device mapper into a form suitable for such
> callbacks.
>
I've looked through your "futher decouple DAX from block devices" series
and likes the use of xarray in place of the host hash list.
Which upstream version is the series based upon?
If it's based on your development repo, I'd be happy to take a clone
and rebase my patches on yours if you provide a link. Please let me
know the best way to cooperate.
That said, I'm unclear at what you're trying to suggest with respect
to the 'DAXDEV_F_RECOVERY' flag. The flag came from upper dax-fs
call stack to the dm target layer, and the dm targets are equipped
with handling pmem driver specific task, so it appears that the flag
would need to be passed down to the native pmem layer, right?
Am I totally missing your point?
thanks,
-jane
On 10/21/2021 4:28 AM, Christoph Hellwig wrote:
>> + if (flags & DAXDEV_F_RECOVERY) {
>> + lead_off = (unsigned long)addr & ~PAGE_MASK;
>> + len = PFN_PHYS(PFN_UP(lead_off + bytes));
>> + if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
>> + if (lead_off || !(PAGE_ALIGNED(bytes))) {
>> + dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
>> + addr, bytes);
>> + return (size_t) -EIO;
>> + }
>> + pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
>> + if (pmem_clear_poison(pmem, pmem_off, bytes) !=
>> + BLK_STS_OK)
>> + return (size_t) -EIO;
>> + }
>
> Shouldn't this just go down in a separe ->clear_poison operation
> to make the whole thing a little easier to follow?
>
Do you mean to lift or refactor the above to a helper function so as
to improve the readability of the code? I can do that, just to confirm.
On the same note, would you prefer to refactor the read path as well?
thanks!
-jane
On 10/21/2021 4:31 AM, Christoph Hellwig wrote:
> Looking over the series I have serious doubts that overloading the
> slow path clear poison operation over the fast path read/write
> path is such a great idea.
>
Understood, sounds like a concern on principle. But it seems to me
that the task of recovery overlaps with the normal write operation
on the write part. Without overloading some write operation for
'recovery', I guess we'll need to come up with a new userland
command coupled with a new dax API ->clear_poison and propagate the
new API support to each dm targets that support dax which, again,
is an idea that sounds too bulky if I recall Dan's earlier rejection
correctly.
It is in my plan though, to provide pwritev2(2) and preadv2(2) man pages
with description about the RWF_RECOVERY_DATA flag and specifically not
recommending the flag for normal read/write operation - due to potential
performance impact from searching badblocks in the range.
That said, the badblock searching is turned on only if the pmem device
contains badblocks(i.e. bb->count > 0), otherwise, the performance
impact is next to none. And once the badblock search starts,
it is a binary search over user provided range. The unwanted impact
happens to be the case when the pmem device contains badblocks
that do not fall in the user specified range, and in that case, the
search would end in O(1).
Thanks!
-jane
On 10/21/2021 5:49 PM, Jane Chu wrote:
> On 10/21/2021 4:27 AM, Christoph Hellwig wrote:
>> On Wed, Oct 20, 2021 at 06:10:57PM -0600, Jane Chu wrote:
>>> Prepare dax_copy_to/from_iter() APIs with DAXDEV_F_RECOVERY flag
>>> such that when the flag is set, the underlying driver implementation
>>> of the APIs may deal with potential poison in a given address
>>> range and read partial data or write after clearing poison.
>>
>> FYI, I've been wondering for a while if we could just kill off these
>> methods entirely. Basically the driver interaction consists of two
>> parts:
>>
>> a) wether to use the flushcache/mcsafe variants of the generic helpers
>> b) actually doing remapping for device mapper
>>
>> to me it seems like we should handle a) with flags in dax_operations,
>> and only have a remap callback for device mapper. That way we'd avoid
>> the indirect calls for the native case, and also avoid tons of
>> boilerplate code. "futher decouple DAX from block devices" series
>> already massages the device mapper into a form suitable for such
>> callbacks.
>>
>
> I've looked through your "futher decouple DAX from block devices" series
> and likes the use of xarray in place of the host hash list.
> Which upstream version is the series based upon?
> If it's based on your development repo, I'd be happy to take a clone
> and rebase my patches on yours if you provide a link. Please let me
> know the best way to cooperate.
>
> That said, I'm unclear at what you're trying to suggest with respect
> to the 'DAXDEV_F_RECOVERY' flag. The flag came from upper dax-fs
> call stack to the dm target layer, and the dm targets are equipped
> with handling pmem driver specific task, so it appears that the flag
Apologies. The above line should be
"..., and the dm targets are _not_ equipped with handling pmem driver
specific task,"
-jane
> would need to be passed down to the native pmem layer, right?
> Am I totally missing your point?
>
> thanks,
> -jane
>
On Fri, Oct 22, 2021 at 01:37:28AM +0000, Jane Chu wrote:
> On 10/21/2021 4:31 AM, Christoph Hellwig wrote:
> > Looking over the series I have serious doubts that overloading the
> > slow path clear poison operation over the fast path read/write
> > path is such a great idea.
Why would data recovery after a media error ever be considered a
fast/hot path? A normal read/write to a fsdax file would not pass the
flag, which skips the poison checking with whatever MCE consequences
that has, right?
pwritev2(..., RWF_RECOVER_DATA) should be infrequent enough that
carefully stepping around dax_direct_access only has to be faster than
restoring from backup, I hope?
> Understood, sounds like a concern on principle. But it seems to me
> that the task of recovery overlaps with the normal write operation
> on the write part. Without overloading some write operation for
> 'recovery', I guess we'll need to come up with a new userland
> command coupled with a new dax API ->clear_poison and propagate the
> new API support to each dm targets that support dax which, again,
> is an idea that sounds too bulky if I recall Dan's earlier rejection
> correctly.
>
> It is in my plan though, to provide pwritev2(2) and preadv2(2) man pages
> with description about the RWF_RECOVERY_DATA flag and specifically not
> recommending the flag for normal read/write operation - due to potential
> performance impact from searching badblocks in the range.
Yes, this will help much. :)
> That said, the badblock searching is turned on only if the pmem device
> contains badblocks(i.e. bb->count > 0), otherwise, the performance
> impact is next to none. And once the badblock search starts,
> it is a binary search over user provided range. The unwanted impact
> happens to be the case when the pmem device contains badblocks
> that do not fall in the user specified range, and in that case, the
> search would end in O(1).
(I wonder about improving badblocks to be less sector-oriented and not
have that weird 16-records-max limit, but that can be a later
optimization.)
--D
> Thanks!
> -jane
>
>
On Fri, Oct 22, 2021 at 12:49:15AM +0000, Jane Chu wrote:
> I've looked through your "futher decouple DAX from block devices" series
> and likes the use of xarray in place of the host hash list.
> Which upstream version is the series based upon?
> If it's based on your development repo, I'd be happy to take a clone
> and rebase my patches on yours if you provide a link. Please let me
> know the best way to cooperate.
It is based on linux-next from when it was posted. A git tree is here:
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-block-cleanup
> That said, I'm unclear at what you're trying to suggest with respect
> to the 'DAXDEV_F_RECOVERY' flag. The flag came from upper dax-fs
> call stack to the dm target layer, and the dm targets are equipped
> with handling pmem driver specific task, so it appears that the flag
> would need to be passed down to the native pmem layer, right?
> Am I totally missing your point?
We'll need to pass it through (assuming we want to keep supporting
dm, see the recent discussion with Dan).
FYI, here is a sketch where I'd like to move to, but this isn't properly
tested yet:
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
To support something like DAXDEV_F_RECOVERYwe'd need a separate
dax_operations methods. Which to me suggest it probably should be
a different operation (fallocate / ioctl / etc) as Darrick did earlier.
On Fri, Oct 22, 2021 at 01:37:28AM +0000, Jane Chu wrote:
> On 10/21/2021 4:31 AM, Christoph Hellwig wrote:
> > Looking over the series I have serious doubts that overloading the
> > slow path clear poison operation over the fast path read/write
> > path is such a great idea.
> >
>
> Understood, sounds like a concern on principle. But it seems to me
> that the task of recovery overlaps with the normal write operation
> on the write part. Without overloading some write operation for
> 'recovery', I guess we'll need to come up with a new userland
> command coupled with a new dax API ->clear_poison and propagate the
> new API support to each dm targets that support dax which, again,
> is an idea that sounds too bulky if I recall Dan's earlier rejection
> correctly.
When I wrote the above I mostly thought about the in-kernel API, that
is use a separate method. But reading your mail and thinking about
this a bit more I'm actually less and less sure that overloading
pwritev2 and preadv2 with this at the syscall level makes sense either.
read/write are our I/O fast path. We really should not overload the
core of the VFS with error recovery for a broken hardware interface.
On Thu, Oct 21, 2021 at 06:58:17PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 22, 2021 at 01:37:28AM +0000, Jane Chu wrote:
> > On 10/21/2021 4:31 AM, Christoph Hellwig wrote:
> > > Looking over the series I have serious doubts that overloading the
> > > slow path clear poison operation over the fast path read/write
> > > path is such a great idea.
>
> Why would data recovery after a media error ever be considered a
> fast/hot path?
Not sure what you're replying to (the text is from me, the mail you are
repling to is fom Jane), but my point is that the read/write
got path should not be overloaded with data recovery.
> A normal read/write to a fsdax file would not pass the
> flag, which skips the poison checking with whatever MCE consequences
> that has, right?
Exactly!
> pwritev2(..., RWF_RECOVER_DATA) should be infrequent enough that
> carefully stepping around dax_direct_access only has to be faster than
> restoring from backup, I hope?
Yes. And thus be kept as separate as possible.
On 10/21/2021 10:33 PM, Christoph Hellwig wrote:
> On Fri, Oct 22, 2021 at 12:49:15AM +0000, Jane Chu wrote:
>> I've looked through your "futher decouple DAX from block devices" series
>> and likes the use of xarray in place of the host hash list.
>> Which upstream version is the series based upon?
>> If it's based on your development repo, I'd be happy to take a clone
>> and rebase my patches on yours if you provide a link. Please let me
>> know the best way to cooperate.
>
> It is based on linux-next from when it was posted. A git tree is here:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-block-cleanup
>
>> That said, I'm unclear at what you're trying to suggest with respect
>> to the 'DAXDEV_F_RECOVERY' flag. The flag came from upper dax-fs
>> call stack to the dm target layer, and the dm targets are equipped
>> with handling pmem driver specific task, so it appears that the flag
>> would need to be passed down to the native pmem layer, right?
>> Am I totally missing your point?
>
> We'll need to pass it through (assuming we want to keep supporting
> dm, see the recent discussion with Dan).
>
> FYI, here is a sketch where I'd like to move to, but this isn't properly
> tested yet:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
>
> To support something like DAXDEV_F_RECOVERYwe'd need a separate
> dax_operations methods. Which to me suggest it probably should be
> a different operation (fallocate / ioctl / etc) as Darrick did earlier.
>
Thanks for the info!
-jane
On 10/21/2021 10:36 PM, Christoph Hellwig wrote:
> On Fri, Oct 22, 2021 at 01:37:28AM +0000, Jane Chu wrote:
>> On 10/21/2021 4:31 AM, Christoph Hellwig wrote:
>>> Looking over the series I have serious doubts that overloading the
>>> slow path clear poison operation over the fast path read/write
>>> path is such a great idea.
>>>
>>
>> Understood, sounds like a concern on principle. But it seems to me
>> that the task of recovery overlaps with the normal write operation
>> on the write part. Without overloading some write operation for
>> 'recovery', I guess we'll need to come up with a new userland
>> command coupled with a new dax API ->clear_poison and propagate the
>> new API support to each dm targets that support dax which, again,
>> is an idea that sounds too bulky if I recall Dan's earlier rejection
>> correctly.
>
> When I wrote the above I mostly thought about the in-kernel API, that
> is use a separate method. But reading your mail and thinking about
> this a bit more I'm actually less and less sure that overloading
> pwritev2 and preadv2 with this at the syscall level makes sense either.
> read/write are our I/O fast path. We really should not overload the
> core of the VFS with error recovery for a broken hardware interface.
>
Thanks - I try to be honest. As far as I can tell, the argument
about the flag is a philosophical argument between two views.
One view assumes design based on perfect hardware, and media error
belongs to the category of brokenness. Another view sees media
error as a build-in hardware component and make design to include
dealing with such errors.
Back when I was fresh out of school, a senior engineer explained
to me about media error might be caused by cosmic ray hitting on
the media at however frequency and at whatever timing. It's an
argument that media error within certain range is a fact of the product,
and to me, it argues for building normal software component with
errors in mind from start. I guess I'm trying to articulate why
it is acceptable to include the RWF_DATA_RECOVERY flag to the
existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
and its slow path (w/ error clearing) is faster than other alternative.
Other alternative being 1 system call to clear the poison, and
another system call to run the fast pwrite for recovery, what
happens if something happened in between?
thanks!
-jane
On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:
> Thanks - I try to be honest. As far as I can tell, the argument
> about the flag is a philosophical argument between two views.
> One view assumes design based on perfect hardware, and media error
> belongs to the category of brokenness. Another view sees media
> error as a build-in hardware component and make design to include
> dealing with such errors.
No, I don't think so. Bit errors do happen in all media, which is
why devices are built to handle them. It is just the Intel-style
pmem interface to handle them which is completely broken.
> errors in mind from start. I guess I'm trying to articulate why
> it is acceptable to include the RWF_DATA_RECOVERY flag to the
> existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> and its slow path (w/ error clearing) is faster than other alternative.
> Other alternative being 1 system call to clear the poison, and
> another system call to run the fast pwrite for recovery, what
> happens if something happened in between?
Well, my point is doing recovery from bit errors is by definition not
the fast path. Which is why I'd rather keep it away from the pmem
read/write fast path, which also happens to be the (much more important)
non-pmem read/write path.
On Tue, Oct 26, 2021 at 11:49:59PM -0700, Christoph Hellwig wrote:
> On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:
> > Thanks - I try to be honest. As far as I can tell, the argument
> > about the flag is a philosophical argument between two views.
> > One view assumes design based on perfect hardware, and media error
> > belongs to the category of brokenness. Another view sees media
> > error as a build-in hardware component and make design to include
> > dealing with such errors.
>
> No, I don't think so. Bit errors do happen in all media, which is
> why devices are built to handle them. It is just the Intel-style
> pmem interface to handle them which is completely broken.
Yeah, I agree, this takes me back to learning how to use DISKEDIT to
work around a hole punched in a file (with a pen!) in the 1980s...
...so would you happen to know if anyone's working on solving this
problem for us by putting the memory controller in charge of dealing
with media errors?
> > errors in mind from start. I guess I'm trying to articulate why
> > it is acceptable to include the RWF_DATA_RECOVERY flag to the
> > existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> > and its slow path (w/ error clearing) is faster than other alternative.
> > Other alternative being 1 system call to clear the poison, and
> > another system call to run the fast pwrite for recovery, what
> > happens if something happened in between?
>
> Well, my point is doing recovery from bit errors is by definition not
> the fast path. Which is why I'd rather keep it away from the pmem
> read/write fast path, which also happens to be the (much more important)
> non-pmem read/write path.
The trouble is, we really /do/ want to be able to (re)write the failed
area, and we probably want to try to read whatever we can. Those are
reads and writes, not {pre,f}allocation activities. This is where Dave
and I arrived at a month ago.
Unless you'd be ok with a second IO path for recovery where we're
allowed to be slow? That would probably have the same user interface
flag, just a different path into the pmem driver.
Ha, how about a int fd2 = recoveryfd(fd); call where you'd get whatever
speshul options (retry raid mirrors! scrape the film off the disk if
you have to!) you want that can take forever, leaving the fast paths
alone?
(Ok, that wasn't entirely serious...)
--D
On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 26, 2021 at 11:49:59PM -0700, Christoph Hellwig wrote:
> > On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:
> > > Thanks - I try to be honest. As far as I can tell, the argument
> > > about the flag is a philosophical argument between two views.
> > > One view assumes design based on perfect hardware, and media error
> > > belongs to the category of brokenness. Another view sees media
> > > error as a build-in hardware component and make design to include
> > > dealing with such errors.
> >
> > No, I don't think so. Bit errors do happen in all media, which is
> > why devices are built to handle them. It is just the Intel-style
> > pmem interface to handle them which is completely broken.
>
> Yeah, I agree, this takes me back to learning how to use DISKEDIT to
> work around a hole punched in a file (with a pen!) in the 1980s...
>
> ...so would you happen to know if anyone's working on solving this
> problem for us by putting the memory controller in charge of dealing
> with media errors?
>
> > > errors in mind from start. I guess I'm trying to articulate why
> > > it is acceptable to include the RWF_DATA_RECOVERY flag to the
> > > existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> > > and its slow path (w/ error clearing) is faster than other alternative.
> > > Other alternative being 1 system call to clear the poison, and
> > > another system call to run the fast pwrite for recovery, what
> > > happens if something happened in between?
> >
> > Well, my point is doing recovery from bit errors is by definition not
> > the fast path. Which is why I'd rather keep it away from the pmem
> > read/write fast path, which also happens to be the (much more important)
> > non-pmem read/write path.
>
> The trouble is, we really /do/ want to be able to (re)write the failed
> area, and we probably want to try to read whatever we can. Those are
> reads and writes, not {pre,f}allocation activities. This is where Dave
> and I arrived at a month ago.
>
> Unless you'd be ok with a second IO path for recovery where we're
> allowed to be slow? That would probably have the same user interface
> flag, just a different path into the pmem driver.
I just don't see how 4 single line branches to propage RWF_RECOVERY
down to the hardware is in any way an imposition on the fast path.
It's no different for passing RWF_HIPRI down to the hardware *in the
fast path* so that the IO runs the hardware in polling mode because
it's faster for some hardware.
IOWs, saying that we shouldn't implement RWF_RECOVERY because it
adds a handful of branches to the fast path is like saying that we
shouldn't implement RWF_HIPRI because it slows down the fast path
for non-polled IO....
Just factor the actual recovery operations out into a separate
function like:
static void noinline
pmem_media_recovery(...)
{
}
pmem_copy_from_iter()
{
if ((unlikely)(flag & RECOVERY))
pmem_media_recovery(...);
return _copy_from_iter_flushcache(addr, bytes, i);
}
....
And there's basically zero overhead in the fast paths for normal
data IO operations, whilst supporting a simple, easy to use data
recovery IO operations for regions that have bad media...
> Ha, how about a int fd2 = recoveryfd(fd); call where you'd get whatever
> speshul options (retry raid mirrors! scrape the film off the disk if
> you have to!) you want that can take forever, leaving the fast paths
> alone?
Why wouldn't we just pass RWF_RECOVERY down to a REQ_RECOVERY bio
flag and have raid devices use that to trigger scraping whatever
they can if there are errors? The io path through the VFS and
filesystem to get the scraped data out to the user *is exactly the
same*, so we're going to have to plumb this functionality into fast
paths *somewhere along the line*.
Really, I think this whole "flag propagation is too much overhead
for the fast path" argument is completely invalid - if 4 conditional
branches is too much overhead to add to the fast path, then we can't
add *anything ever again* to the IO path because it has too much
overhead and impact on the fast path.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 10/28/21 23:59, Dave Chinner wrote:
[...]
>>> Well, my point is doing recovery from bit errors is by definition not
>>> the fast path. Which is why I'd rather keep it away from the pmem
>>> read/write fast path, which also happens to be the (much more important)
>>> non-pmem read/write path.
>>
>> The trouble is, we really /do/ want to be able to (re)write the failed
>> area, and we probably want to try to read whatever we can. Those are
>> reads and writes, not {pre,f}allocation activities. This is where Dave
>> and I arrived at a month ago.
>>
>> Unless you'd be ok with a second IO path for recovery where we're
>> allowed to be slow? That would probably have the same user interface
>> flag, just a different path into the pmem driver.
>
> I just don't see how 4 single line branches to propage RWF_RECOVERY
> down to the hardware is in any way an imposition on the fast path.
> It's no different for passing RWF_HIPRI down to the hardware *in the
> fast path* so that the IO runs the hardware in polling mode because
> it's faster for some hardware.
Not particularly about this flag, but it is expensive. Surely looks
cheap when it's just one feature, but there are dozens of them with
limited applicability, default config kernels are already sluggish
when it comes to really fast devices and it's not getting better.
Also, pretty often every of them will add a bunch of extra checks
to fix something of whatever it would be.
So let's add a bit of pragmatism to the picture, if there is just one
user of a feature but it adds overhead for millions of machines that
won't ever use it, it's expensive.
This one doesn't spill yet into paths I care about, but in general
it'd be great if we start thinking more about such stuff instead of
throwing yet another if into the path, e.g. by shifting the overhead
from linear to a constant for cases that don't use it, for instance
with callbacks or bit masks.
> IOWs, saying that we shouldn't implement RWF_RECOVERY because it
> adds a handful of branches the fast path is like saying that we
> shouldn't implement RWF_HIPRI because it slows down the fast path
> for non-polled IO....
>
> Just factor the actual recovery operations out into a separate
> function like:
--
Pavel Begunkov
On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
> On 10/28/21 23:59, Dave Chinner wrote:
> [...]
> > > > Well, my point is doing recovery from bit errors is by definition not
> > > > the fast path. Which is why I'd rather keep it away from the pmem
> > > > read/write fast path, which also happens to be the (much more important)
> > > > non-pmem read/write path.
> > >
> > > The trouble is, we really /do/ want to be able to (re)write the failed
> > > area, and we probably want to try to read whatever we can. Those are
> > > reads and writes, not {pre,f}allocation activities. This is where Dave
> > > and I arrived at a month ago.
> > >
> > > Unless you'd be ok with a second IO path for recovery where we're
> > > allowed to be slow? That would probably have the same user interface
> > > flag, just a different path into the pmem driver.
> >
> > I just don't see how 4 single line branches to propage RWF_RECOVERY
> > down to the hardware is in any way an imposition on the fast path.
> > It's no different for passing RWF_HIPRI down to the hardware *in the
> > fast path* so that the IO runs the hardware in polling mode because
> > it's faster for some hardware.
>
> Not particularly about this flag, but it is expensive. Surely looks
> cheap when it's just one feature, but there are dozens of them with
> limited applicability, default config kernels are already sluggish
> when it comes to really fast devices and it's not getting better.
> Also, pretty often every of them will add a bunch of extra checks
> to fix something of whatever it would be.
So we can't have data recovery because moving fast the only goal?
That's so meta.
--D
> So let's add a bit of pragmatism to the picture, if there is just one
> user of a feature but it adds overhead for millions of machines that
> won't ever use it, it's expensive.
>
> This one doesn't spill yet into paths I care about, but in general
> it'd be great if we start thinking more about such stuff instead of
> throwing yet another if into the path, e.g. by shifting the overhead
> from linear to a constant for cases that don't use it, for instance
> with callbacks or bit masks.
>
> > IOWs, saying that we shouldn't implement RWF_RECOVERY because it
> > adds a handful of branches the fast path is like saying that we
> > shouldn't implement RWF_HIPRI because it slows down the fast path
> > for non-polled IO....
> >
> > Just factor the actual recovery operations out into a separate
> > function like:
>
> --
> Pavel Begunkov
On 10/29/2021 4:46 AM, Pavel Begunkov wrote:
> On 10/28/21 23:59, Dave Chinner wrote:
> [...]
>>>> Well, my point is doing recovery from bit errors is by definition not
>>>> the fast path. Which is why I'd rather keep it away from the pmem
>>>> read/write fast path, which also happens to be the (much more
>>>> important)
>>>> non-pmem read/write path.
>>>
>>> The trouble is, we really /do/ want to be able to (re)write the failed
>>> area, and we probably want to try to read whatever we can. Those are
>>> reads and writes, not {pre,f}allocation activities. This is where Dave
>>> and I arrived at a month ago.
>>>
>>> Unless you'd be ok with a second IO path for recovery where we're
>>> allowed to be slow? That would probably have the same user interface
>>> flag, just a different path into the pmem driver.
>>
>> I just don't see how 4 single line branches to propage RWF_RECOVERY
>> down to the hardware is in any way an imposition on the fast path.
>> It's no different for passing RWF_HIPRI down to the hardware *in the
>> fast path* so that the IO runs the hardware in polling mode because
>> it's faster for some hardware.
>
> Not particularly about this flag, but it is expensive. Surely looks
> cheap when it's just one feature, but there are dozens of them with
> limited applicability, default config kernels are already sluggish
> when it comes to really fast devices and it's not getting better.
> Also, pretty often every of them will add a bunch of extra checks
> to fix something of whatever it would be.
>
> So let's add a bit of pragmatism to the picture, if there is just one
> user of a feature but it adds overhead for millions of machines that
> won't ever use it, it's expensive.
>
> This one doesn't spill yet into paths I care about, but in general
> it'd be great if we start thinking more about such stuff instead of
> throwing yet another if into the path, e.g. by shifting the overhead
> from linear to a constant for cases that don't use it, for instance
> with callbacks or bit masks.
May I ask what solution would you propose for pmem recovery that satisfy
the requirement of binding poison-clearing and write in one operation?
thanks!
-jane
>
>> IOWs, saying that we shouldn't implement RWF_RECOVERY because it
>> adds a handful of branches     the fast path is like saying that we
>> shouldn't implement RWF_HIPRI because it slows down the fast path
>> for non-polled IO....
>>
>> Just factor the actual recovery operations out into a separate
>> function like:
>
On 10/29/21 17:57, Darrick J. Wong wrote:
> On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
>> On 10/28/21 23:59, Dave Chinner wrote:
>> [...]
>>>>> Well, my point is doing recovery from bit errors is by definition not
>>>>> the fast path. Which is why I'd rather keep it away from the pmem
>>>>> read/write fast path, which also happens to be the (much more important)
>>>>> non-pmem read/write path.
>>>>
>>>> The trouble is, we really /do/ want to be able to (re)write the failed
>>>> area, and we probably want to try to read whatever we can. Those are
>>>> reads and writes, not {pre,f}allocation activities. This is where Dave
>>>> and I arrived at a month ago.
>>>>
>>>> Unless you'd be ok with a second IO path for recovery where we're
>>>> allowed to be slow? That would probably have the same user interface
>>>> flag, just a different path into the pmem driver.
>>>
>>> I just don't see how 4 single line branches to propage RWF_RECOVERY
>>> down to the hardware is in any way an imposition on the fast path.
>>> It's no different for passing RWF_HIPRI down to the hardware *in the
>>> fast path* so that the IO runs the hardware in polling mode because
>>> it's faster for some hardware.
>>
>> Not particularly about this flag, but it is expensive. Surely looks
>> cheap when it's just one feature, but there are dozens of them with
>> limited applicability, default config kernels are already sluggish
>> when it comes to really fast devices and it's not getting better.
>> Also, pretty often every of them will add a bunch of extra checks
>> to fix something of whatever it would be.
>
> So we can't have data recovery because moving fast the only goal?
That's not what was said and you missed the point, which was in
the rest of the message.
>
> That's so meta.
>
> --D
>
>> So let's add a bit of pragmatism to the picture, if there is just one
>> user of a feature but it adds overhead for millions of machines that
>> won't ever use it, it's expensive.
>>
>> This one doesn't spill yet into paths I care about, but in general
>> it'd be great if we start thinking more about such stuff instead of
>> throwing yet another if into the path, e.g. by shifting the overhead
>> from linear to a constant for cases that don't use it, for instance
>> with callbacks or bit masks.
>>
>>> IOWs, saying that we shouldn't implement RWF_RECOVERY because it
>>> adds a handful of branches the fast path is like saying that we
>>> shouldn't implement RWF_HIPRI because it slows down the fast path
>>> for non-polled IO....
>>>
>>> Just factor the actual recovery operations out into a separate
>>> function like:
>>
>> --
>> Pavel Begunkov
--
Pavel Begunkov
On Fri, Oct 29, 2021 at 08:23:53PM +0100, Pavel Begunkov wrote:
> On 10/29/21 17:57, Darrick J. Wong wrote:
> > On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
> > > On 10/28/21 23:59, Dave Chinner wrote:
> > > [...]
> > > > > > Well, my point is doing recovery from bit errors is by definition not
> > > > > > the fast path. Which is why I'd rather keep it away from the pmem
> > > > > > read/write fast path, which also happens to be the (much more important)
> > > > > > non-pmem read/write path.
> > > > >
> > > > > The trouble is, we really /do/ want to be able to (re)write the failed
> > > > > area, and we probably want to try to read whatever we can. Those are
> > > > > reads and writes, not {pre,f}allocation activities. This is where Dave
> > > > > and I arrived at a month ago.
> > > > >
> > > > > Unless you'd be ok with a second IO path for recovery where we're
> > > > > allowed to be slow? That would probably have the same user interface
> > > > > flag, just a different path into the pmem driver.
> > > >
> > > > I just don't see how 4 single line branches to propage RWF_RECOVERY
> > > > down to the hardware is in any way an imposition on the fast path.
> > > > It's no different for passing RWF_HIPRI down to the hardware *in the
> > > > fast path* so that the IO runs the hardware in polling mode because
> > > > it's faster for some hardware.
> > >
> > > Not particularly about this flag, but it is expensive. Surely looks
> > > cheap when it's just one feature, but there are dozens of them with
> > > limited applicability, default config kernels are already sluggish
> > > when it comes to really fast devices and it's not getting better.
> > > Also, pretty often every of them will add a bunch of extra checks
> > > to fix something of whatever it would be.
> >
> > So we can't have data recovery because moving fast the only goal?
>
> That's not what was said and you missed the point, which was in
> the rest of the message.
...whatever point you were trying to make was so vague that it was
totally uninformative and I completely missed it.
What does "callbacks or bit masks" mean, then, specifically? How
*exactly* would you solve the problem that Jane is seeking to solve by
using callbacks?
Actually, you know what? I'm so fed up with every single DAX
conversation turning into a ****storm of people saying NO NO NO NO NO NO
NO NO to everything proposed that I'm actually going to respond to
whatever I think your point is, and you can defend whatever I come up
with.
> >
> > That's so meta.
> >
> > --D
> >
> > > So let's add a bit of pragmatism to the picture, if there is just one
> > > user of a feature but it adds overhead for millions of machines that
> > > won't ever use it, it's expensive.
Errors are infrequent, and since everything is cloud-based and disposble
now, we can replace error handling with BUG_ON(). This will reduce code
complexity, which will reduce code size, and improve icache usage. Win!
> > > This one doesn't spill yet into paths I care about,
...so you sail in and say 'no' even though you don't yet care...
> > > but in general
> > > it'd be great if we start thinking more about such stuff instead of
> > > throwing yet another if into the path, e.g. by shifting the overhead
> > > from linear to a constant for cases that don't use it, for instance
> > > with callbacks
Ok so after userspace calls into pread to access a DAX file, hits the
poisoned memory line and the machinecheck fires, what then? I guess we
just have to figure out how to get from the MCA handler (assuming the
machine doesn't just reboot instantly) all the way back into memcpy?
Ok, you're in charge of figuring that out because I don't know how to do
that.
Notably, RWF_DATA_RECOVERY is the flag that we're calling *from* a
callback that happens after memory controller realizes it's lost
something, kicks a notification to the OS kernel through ACPI, and the
kernel signal userspace to do something about it. Yeah, that's dumb
since spinning rust already does all this for us, but that's pmem.
> > > or bit masks.
WTF does this even mean?
--D
> > >
> > > > IOWs, saying that we shouldn't implement RWF_RECOVERY because it
> > > > adds a handful of branches the fast path is like saying that we
> > > > shouldn't implement RWF_HIPRI because it slows down the fast path
> > > > for non-polled IO....
> > > >
> > > > Just factor the actual recovery operations out into a separate
> > > > function like:
> > >
> > > --
> > > Pavel Begunkov
>
> --
> Pavel Begunkov
On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
> On 10/28/21 23:59, Dave Chinner wrote:
> [...]
> > > > Well, my point is doing recovery from bit errors is by definition not
> > > > the fast path. Which is why I'd rather keep it away from the pmem
> > > > read/write fast path, which also happens to be the (much more important)
> > > > non-pmem read/write path.
> > >
> > > The trouble is, we really /do/ want to be able to (re)write the failed
> > > area, and we probably want to try to read whatever we can. Those are
> > > reads and writes, not {pre,f}allocation activities. This is where Dave
> > > and I arrived at a month ago.
> > >
> > > Unless you'd be ok with a second IO path for recovery where we're
> > > allowed to be slow? That would probably have the same user interface
> > > flag, just a different path into the pmem driver.
> >
> > I just don't see how 4 single line branches to propage RWF_RECOVERY
> > down to the hardware is in any way an imposition on the fast path.
> > It's no different for passing RWF_HIPRI down to the hardware *in the
> > fast path* so that the IO runs the hardware in polling mode because
> > it's faster for some hardware.
>
> Not particularly about this flag, but it is expensive. Surely looks
> cheap when it's just one feature, but there are dozens of them with
> limited applicability, default config kernels are already sluggish
> when it comes to really fast devices and it's not getting better.
> Also, pretty often every of them will add a bunch of extra checks
> to fix something of whatever it would be.
>
> So let's add a bit of pragmatism to the picture, if there is just one
> user of a feature but it adds overhead for millions of machines that
> won't ever use it, it's expensive.
Yup, you just described RWF_HIPRI! Seriously, Pavel, did you read
past this? I'll quote what I said again, because I've already
addressed this argument to point out how silly it is:
> > IOWs, saying that we shouldn't implement RWF_RECOVERY because it
> > adds a handful of branches to the fast path is like saying that we
> > shouldn't implement RWF_HIPRI because it slows down the fast path
> > for non-polled IO....
RWF_HIPRI functionality represents a *tiny* niche in the wider
Linux ecosystem, so by your reasoning it is too expensive to
implement because millions (billions!) of machines don't need or use
it. Do you now see how silly your argument is?
Seriously, this "optimise the IO fast path at the cost of everything
else" craziness has gotten out of hand. Nobody in the filesystem or
application world cares if you can do 10M IOPS per core when all the
CPU is doing is sitting in a tight loop inside the kernel repeatedly
overwriting data in the same memory buffers, essentially tossing the
old away the data without ever accessing it or doing anything with
it. Such speed racer games are *completely meaningless* as an
optimisation goal - it's what we've called "benchmarketing" for a
couple of decades now.
If all we focus on is bragging rights because "bigger number is
always better", then we'll end up with iand IO path that looks like
the awful mess that the fs/direct-io.c turned into. That ended up
being hyper-optimised for CPU performance right down to single
instructions and cacheline load orders that the code became
extremely fragile and completely unmaintainable.
We ended up *reimplementing the direct IO code from scratch* so that
XFS could build and submit direct IO smarter and faster because it
simply couldn't be done to the old code. That's how iomap came
about, and without *any optimisation at all* iomap was 20-30% faster
than the old, hyper-optimised fs/direct-io.c code. IOWs, we always
knew we could do direct IO faster than fs/direct-io.c, but we
couldn't make the fs/direct-io.c faster because of the
hyper-optimisation of the code paths made it impossible to modify
and maintain.
The current approach of hyper-optimising the IO path for maximum
per-core IOPS at the expensive of everything else has been proven in
the past to be exactly the wrong approach to be taking for IO path
development. Yes, we need to be concerned about performance and work
to improve it, but we should not be doing that at the cost of
everything else that the IO stack needs to be able to do.
Fundamentally, optimisation is something we do *after* we provide
the functionality that is required; using "fast path optimisation"
as a blunt force implement to prevent new features from being
implemented is just ... obnoxious.
> This one doesn't spill yet into paths I care about, but in general
> it'd be great if we start thinking more about such stuff instead of
> throwing yet another if into the path, e.g. by shifting the overhead
> from linear to a constant for cases that don't use it, for instance
> with callbacks or bit masks.
This is orthogonal to providing data recovery functionality.
If the claims that flag propagation is too expensive are true, then
fixing this problem this will also improve RWF_HIPRI performance
regardless of whether RWF_DATA_RECOVERY exists or not...
IOWs, *if* there is a fast path performance degradation as a result
of flag propagation, then *go measure it* and show us how much
impact it has on _real world applications_. *Show us the numbers*
and document how much each additional flag propagation actually
costs so we can talk about whether it is acceptible, mitigation
strategies and/or alternative implementations. Flag propagation
overhead is just not a valid reason for preventing us adding new
flags to the IO path. Fix the flag propagation overhead if it's a
problem for you, don't use it as an excuse for preventing people
from adding new functionality that uses flag propagation...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 10/29/21 23:32, Dave Chinner wrote:
> On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
>> On 10/28/21 23:59, Dave Chinner wrote:
>> [...]
>>>>> Well, my point is doing recovery from bit errors is by definition not
>>>>> the fast path. Which is why I'd rather keep it away from the pmem
>>>>> read/write fast path, which also happens to be the (much more important)
>>>>> non-pmem read/write path.
>>>>
>>>> The trouble is, we really /do/ want to be able to (re)write the failed
>>>> area, and we probably want to try to read whatever we can. Those are
>>>> reads and writes, not {pre,f}allocation activities. This is where Dave
>>>> and I arrived at a month ago.
>>>>
>>>> Unless you'd be ok with a second IO path for recovery where we're
>>>> allowed to be slow? That would probably have the same user interface
>>>> flag, just a different path into the pmem driver.
>>>
>>> I just don't see how 4 single line branches to propage RWF_RECOVERY
>>> down to the hardware is in any way an imposition on the fast path.
>>> It's no different for passing RWF_HIPRI down to the hardware *in the
>>> fast path* so that the IO runs the hardware in polling mode because
>>> it's faster for some hardware.
>>
>> Not particularly about this flag, but it is expensive. Surely looks
>> cheap when it's just one feature, but there are dozens of them with
>> limited applicability, default config kernels are already sluggish
>> when it comes to really fast devices and it's not getting better.
>> Also, pretty often every of them will add a bunch of extra checks
>> to fix something of whatever it would be.
>>
>> So let's add a bit of pragmatism to the picture, if there is just one
>> user of a feature but it adds overhead for millions of machines that
>> won't ever use it, it's expensive.
>
> Yup, you just described RWF_HIPRI! Seriously, Pavel, did you read
> past this? I'll quote what I said again, because I've already
> addressed this argument to point out how silly it is:
And you almost got to the initial point in your penult paragraph. A
single if for a single flag is not an issue, what is the problem is
when there are dozens of them and the overhead for it is not isolated,
so the kernel has to jump through dozens of those.
And just to be clear I'll outline again, that's a general problem,
I have no relation to the layers touched and it's up to relevant
people, obviously. Even though I'd expect but haven't found the flag
being rejected in other places, but well I may have missed something.
>>> IOWs, saying that we shouldn't implement RWF_RECOVERY because it
>>> adds a handful of branches to the fast path is like saying that we
>>> shouldn't implement RWF_HIPRI because it slows down the fast path
>>> for non-polled IO....
>
> RWF_HIPRI functionality represents a *tiny* niche in the wider
> Linux ecosystem, so by your reasoning it is too expensive to
> implement because millions (billions!) of machines don't need or use
> it. Do you now see how silly your argument is?
>
> Seriously, this "optimise the IO fast path at the cost of everything
> else" craziness has gotten out of hand. Nobody in the filesystem or
> application world cares if you can do 10M IOPS per core when all the
> CPU is doing is sitting in a tight loop inside the kernel repeatedly
> overwriting data in the same memory buffers, essentially tossing the
> old away the data without ever accessing it or doing anything with
> it. Such speed racer games are *completely meaningless* as an
> optimisation goal - it's what we've called "benchmarketing" for a
> couple of decades now.
10M you mentioned is just a way to measure, there is nothing wrong
with it. And considering that there are enough of users considering
or already switching to spdk because of performance, the approach
is not wrong. And it goes not only for IO polling, normal irq IO
suffers from the same problems.
A related story is that this number is for a pretty reduced config,
it'll go down with a more default-ish kernel.
> If all we focus on is bragging rights because "bigger number is
> always better", then we'll end up with iand IO path that looks like
> the awful mess that the fs/direct-io.c turned into. That ended up
> being hyper-optimised for CPU performance right down to single
> instructions and cacheline load orders that the code became
> extremely fragile and completely unmaintainable.
>
> We ended up *reimplementing the direct IO code from scratch* so that
> XFS could build and submit direct IO smarter and faster because it
> simply couldn't be done to the old code. That's how iomap came
> about, and without *any optimisation at all* iomap was 20-30% faster
> than the old, hyper-optimised fs/direct-io.c code. IOWs, we always
> knew we could do direct IO faster than fs/direct-io.c, but we
> couldn't make the fs/direct-io.c faster because of the
> hyper-optimisation of the code paths made it impossible to modify
> and maintain.> The current approach of hyper-optimising the IO path for maximum
> per-core IOPS at the expensive of everything else has been proven in
> the past to be exactly the wrong approach to be taking for IO path
> development. Yes, we need to be concerned about performance and work
> to improve it, but we should not be doing that at the cost of
> everything else that the IO stack needs to be able to do.
And iomap is great, what you described is a good typical example
of unmaintainable code. I may get wrong what you exactly refer
to, but I don't see maintainability not being considered.
Even more interesting to notice that more often than not extra
features (and flags) almost always hurt maintainability of the
kernel, but then other benefits outweigh (hopefully).
> Fundamentally, optimisation is something we do *after* we provide
> the functionality that is required; using "fast path optimisation"
> as a blunt force implement to prevent new features from being
> implemented is just ... obnoxious.
>
>> This one doesn't spill yet into paths I care about, but in general
>> it'd be great if we start thinking more about such stuff instead of
>> throwing yet another if into the path, e.g. by shifting the overhead
>> from linear to a constant for cases that don't use it, for instance
>> with callbacks or bit masks.
>
> This is orthogonal to providing data recovery functionality.
> If the claims that flag propagation is too expensive are true, then
> fixing this problem this will also improve RWF_HIPRI performance
> regardless of whether RWF_DATA_RECOVERY exists or not...
>
> IOWs, *if* there is a fast path performance degradation as a result
> of flag propagation, then *go measure it* and show us how much
> impact it has on _real world applications_. *Show us the numbers*
> and document how much each additional flag propagation actually
> costs so we can talk about whether it is acceptible, mitigation
> strategies and/or alternative implementations. Flag propagation
> overhead is just not a valid reason for preventing us adding new
> flags to the IO path. Fix the flag propagation overhead if it's a
> problem for you, don't use it as an excuse for preventing people
> from adding new functionality that uses flag propagation...
--
Pavel Begunkov
On 10/29/21 21:08, Darrick J. Wong wrote:
> On Fri, Oct 29, 2021 at 08:23:53PM +0100, Pavel Begunkov wrote:
>> On 10/29/21 17:57, Darrick J. Wong wrote:
>>> On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
>>>> On 10/28/21 23:59, Dave Chinner wrote:
>>>> [...]
>>>>>>> Well, my point is doing recovery from bit errors is by definition not
>>>>>>> the fast path. Which is why I'd rather keep it away from the pmem
>>>>>>> read/write fast path, which also happens to be the (much more important)
>>>>>>> non-pmem read/write path.
>>>>>>
>>>>>> The trouble is, we really /do/ want to be able to (re)write the failed
>>>>>> area, and we probably want to try to read whatever we can. Those are
>>>>>> reads and writes, not {pre,f}allocation activities. This is where Dave
>>>>>> and I arrived at a month ago.
>>>>>>
>>>>>> Unless you'd be ok with a second IO path for recovery where we're
>>>>>> allowed to be slow? That would probably have the same user interface
>>>>>> flag, just a different path into the pmem driver.
>>>>>
>>>>> I just don't see how 4 single line branches to propage RWF_RECOVERY
>>>>> down to the hardware is in any way an imposition on the fast path.
>>>>> It's no different for passing RWF_HIPRI down to the hardware *in the
>>>>> fast path* so that the IO runs the hardware in polling mode because
>>>>> it's faster for some hardware.
>>>>
>>>> Not particularly about this flag, but it is expensive. Surely looks
>>>> cheap when it's just one feature, but there are dozens of them with
>>>> limited applicability, default config kernels are already sluggish
>>>> when it comes to really fast devices and it's not getting better.
>>>> Also, pretty often every of them will add a bunch of extra checks
>>>> to fix something of whatever it would be.
>>>
>>> So we can't have data recovery because moving fast the only goal?
>>
>> That's not what was said and you missed the point, which was in
>> the rest of the message.
>
> ...whatever point you were trying to make was so vague that it was
> totally uninformative and I completely missed it.
>
> What does "callbacks or bit masks" mean, then, specifically? How
> *exactly* would you solve the problem that Jane is seeking to solve by
> using callbacks?
>
> Actually, you know what? I'm so fed up with every single DAX
> conversation turning into a ****storm of people saying NO NO NO NO NO NO
> NO NO to everything proposed that I'm actually going to respond to
> whatever I think your point is, and you can defend whatever I come up
> with.
Interesting, I don't want to break it to you but nobody is going to
defend against yours made up out of thin air interpretations. I think
there is one thing we can relate, I wonder as well what the bloody
hell that opus of yours was
>
>>>
>>> That's so meta.
>>>
>>> --D
>>>
>>>> So let's add a bit of pragmatism to the picture, if there is just one
>>>> user of a feature but it adds overhead for millions of machines that
>>>> won't ever use it, it's expensive.
>
> Errors are infrequent, and since everything is cloud-based and disposble
> now, we can replace error handling with BUG_ON(). This will reduce code
> complexity, which will reduce code size, and improve icache usage. Win!
>
>>>> This one doesn't spill yet into paths I care about,
>
> ...so you sail in and say 'no' even though you don't yet care...
>
>>>> but in general
>>>> it'd be great if we start thinking more about such stuff instead of
>>>> throwing yet another if into the path, e.g. by shifting the overhead
>>>> from linear to a constant for cases that don't use it, for instance
>>>> with callbacks
>
> Ok so after userspace calls into pread to access a DAX file, hits the
> poisoned memory line and the machinecheck fires, what then? I guess we
> just have to figure out how to get from the MCA handler (assuming the
> machine doesn't just reboot instantly) all the way back into memcpy?
> Ok, you're in charge of figuring that out because I don't know how to do
> that.
>
> Notably, RWF_DATA_RECOVERY is the flag that we're calling *from* a
> callback that happens after memory controller realizes it's lost
> something, kicks a notification to the OS kernel through ACPI, and the
> kernel signal userspace to do something about it. Yeah, that's dumb
> since spinning rust already does all this for us, but that's pmem.
>
>>>> or bit masks.
>
> WTF does this even mean?
>
> --D
>
>>>>
>>>>> IOWs, saying that we shouldn't implement RWF_RECOVERY because it
>>>>> adds a handful of branches the fast path is like saying that we
>>>>> shouldn't implement RWF_HIPRI because it slows down the fast path
>>>>> for non-polled IO....
>>>>>
>>>>> Just factor the actual recovery operations out into a separate
>>>>> function like:
--
Pavel Begunkov
On Sun, Oct 31, 2021 at 01:19:48PM +0000, Pavel Begunkov wrote:
> On 10/29/21 23:32, Dave Chinner wrote:
> > Yup, you just described RWF_HIPRI! Seriously, Pavel, did you read
> > past this? I'll quote what I said again, because I've already
> > addressed this argument to point out how silly it is:
>
> And you almost got to the initial point in your penult paragraph. A
> single if for a single flag is not an issue, what is the problem is
> when there are dozens of them and the overhead for it is not isolated,
> so the kernel has to jump through dozens of those.
This argument can be used to reject *ANY* new feature. For example, by
using your argument, we should have rejected the addition of IOCB_WAITQ
because it penalises the vast majority of IOs which do not use it.
But we didn't. Because we see that while it may not be of use to US
today, it's a generally useful feature for Linux to support. You say
yourself that this feature doesn't slow down your use case, so why are
you spending so much time and energy annoying the people who actually
want to use it?
Seriously. Stop arguing about something you actually don't care about.
You're just making Linux less fun to work on.
On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
> ...so would you happen to know if anyone's working on solving this
> problem for us by putting the memory controller in charge of dealing
> with media errors?
The only one who could know is Intel..
> The trouble is, we really /do/ want to be able to (re)write the failed
> area, and we probably want to try to read whatever we can. Those are
> reads and writes, not {pre,f}allocation activities. This is where Dave
> and I arrived at a month ago.
>
> Unless you'd be ok with a second IO path for recovery where we're
> allowed to be slow? That would probably have the same user interface
> flag, just a different path into the pmem driver.
Which is fine with me. If you look at the API here we do have the
RWF_ API, which them maps to the IOMAP API, which maps to the DAX_
API which then gets special casing over three methods.
And while Pavel pointed out that he and Jens are now optimizing for
single branches like this. I think this actually is silly and it is
not my point.
The point is that the DAX in-kernel API is a mess, and before we make
it even worse we need to sort it first. What is directly relevant
here is that the copy_from_iter and copy_to_iter APIs do not make
sense. Most of the DAX API is based around getting a memory mapping
using ->direct_access, it is just the read/write path which is a slow
path that actually uses this. I have a very WIP patch series to try
to sort this out here:
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
But back to this series. The basic DAX model is that the callers gets a
memory mapping an just works on that, maybe calling a sync after a write
in a few cases. So any kind of recovery really needs to be able to
work with that model as going forward the copy_to/from_iter path will
be used less and less. i.e. file systems can and should use
direct_access directly instead of using the block layer implementation
in the pmem driver. As an example the dm-writecache driver, the pending
bcache nvdimm support and the (horribly and out of tree) nova file systems
won't even use this path. We need to find a way to support recovery
for them. And overloading it over the read/write path which is not
the main path for DAX, but the absolutely fast path for 99% of the
kernel users is a horrible idea.
So how can we work around the horrible nvdimm design for data recovery
in a way that:
a) actually works with the intended direct memory map use case
b) doesn't really affect the normal kernel too much
?
On Wed, Oct 27, 2021 at 5:25 PM Darrick J. Wong <[email protected]> wrote:
>
> On Tue, Oct 26, 2021 at 11:49:59PM -0700, Christoph Hellwig wrote:
> > On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:
> > > Thanks - I try to be honest. As far as I can tell, the argument
> > > about the flag is a philosophical argument between two views.
> > > One view assumes design based on perfect hardware, and media error
> > > belongs to the category of brokenness. Another view sees media
> > > error as a build-in hardware component and make design to include
> > > dealing with such errors.
> >
> > No, I don't think so. Bit errors do happen in all media, which is
> > why devices are built to handle them. It is just the Intel-style
> > pmem interface to handle them which is completely broken.
>
> Yeah, I agree, this takes me back to learning how to use DISKEDIT to
> work around a hole punched in a file (with a pen!) in the 1980s...
>
> ...so would you happen to know if anyone's working on solving this
> problem for us by putting the memory controller in charge of dealing
> with media errors?
What are you guys going on about? ECC memory corrects single-bit
errors in the background, multi-bit errors cause the memory controller
to signal that data is gone. This is how ECC memory has worked since
forever. Typically the kernel's memory-failure path is just throwing
away pages that signal data loss. Throwing away pmem pages is harder
because unlike DRAM the physical address of the page matters to upper
layers.
>
> > > errors in mind from start. I guess I'm trying to articulate why
> > > it is acceptable to include the RWF_DATA_RECOVERY flag to the
> > > existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> > > and its slow path (w/ error clearing) is faster than other alternative.
> > > Other alternative being 1 system call to clear the poison, and
> > > another system call to run the fast pwrite for recovery, what
> > > happens if something happened in between?
> >
> > Well, my point is doing recovery from bit errors is by definition not
> > the fast path. Which is why I'd rather keep it away from the pmem
> > read/write fast path, which also happens to be the (much more important)
> > non-pmem read/write path.
>
> The trouble is, we really /do/ want to be able to (re)write the failed
> area, and we probably want to try to read whatever we can. Those are
> reads and writes, not {pre,f}allocation activities. This is where Dave
> and I arrived at a month ago.
>
> Unless you'd be ok with a second IO path for recovery where we're
> allowed to be slow? That would probably have the same user interface
> flag, just a different path into the pmem driver.
>
> Ha, how about a int fd2 = recoveryfd(fd); call where you'd get whatever
> speshul options (retry raid mirrors! scrape the film off the disk if
> you have to!) you want that can take forever, leaving the fast paths
> alone?
I am still failing to see the technical argument for why
RWF_RECOVER_DATA significantly impacts the fast path, and why you
think this is somehow specific to pmem. In fact the pmem effort is
doing the responsible thing and trying to plumb this path while other
storage drivers just seem to be pretending that memory errors never
happen.
On Tue, Oct 26, 2021 at 11:50 PM Christoph Hellwig <[email protected]> wrote:
>
> On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:
> > Thanks - I try to be honest. As far as I can tell, the argument
> > about the flag is a philosophical argument between two views.
> > One view assumes design based on perfect hardware, and media error
> > belongs to the category of brokenness. Another view sees media
> > error as a build-in hardware component and make design to include
> > dealing with such errors.
>
> No, I don't think so. Bit errors do happen in all media, which is
> why devices are built to handle them. It is just the Intel-style
> pmem interface to handle them which is completely broken.
No, any media can report checksum / parity errors. NVME also seems to
do a poor job with multi-bit ECC errors consumed from DRAM. There is
nothing "pmem" or "Intel" specific here.
> > errors in mind from start. I guess I'm trying to articulate why
> > it is acceptable to include the RWF_DATA_RECOVERY flag to the
> > existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> > and its slow path (w/ error clearing) is faster than other alternative.
> > Other alternative being 1 system call to clear the poison, and
> > another system call to run the fast pwrite for recovery, what
> > happens if something happened in between?
>
> Well, my point is doing recovery from bit errors is by definition not
> the fast path. Which is why I'd rather keep it away from the pmem
> read/write fast path, which also happens to be the (much more important)
> non-pmem read/write path.
I would expect this interface to be useful outside of pmem as a
"failfast" or "try harder to recover" flag for reading over media
errors.
On Mon, Nov 1, 2021 at 11:19 PM Christoph Hellwig <[email protected]> wrote:
>
> On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
> > ...so would you happen to know if anyone's working on solving this
> > problem for us by putting the memory controller in charge of dealing
> > with media errors?
>
> The only one who could know is Intel..
>
> > The trouble is, we really /do/ want to be able to (re)write the failed
> > area, and we probably want to try to read whatever we can. Those are
> > reads and writes, not {pre,f}allocation activities. This is where Dave
> > and I arrived at a month ago.
> >
> > Unless you'd be ok with a second IO path for recovery where we're
> > allowed to be slow? That would probably have the same user interface
> > flag, just a different path into the pmem driver.
>
> Which is fine with me. If you look at the API here we do have the
> RWF_ API, which them maps to the IOMAP API, which maps to the DAX_
> API which then gets special casing over three methods.
>
> And while Pavel pointed out that he and Jens are now optimizing for
> single branches like this. I think this actually is silly and it is
> not my point.
>
> The point is that the DAX in-kernel API is a mess, and before we make
> it even worse we need to sort it first. What is directly relevant
> here is that the copy_from_iter and copy_to_iter APIs do not make
> sense. Most of the DAX API is based around getting a memory mapping
> using ->direct_access, it is just the read/write path which is a slow
> path that actually uses this. I have a very WIP patch series to try
> to sort this out here:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
>
> But back to this series. The basic DAX model is that the callers gets a
> memory mapping an just works on that, maybe calling a sync after a write
> in a few cases. So any kind of recovery really needs to be able to
> work with that model as going forward the copy_to/from_iter path will
> be used less and less. i.e. file systems can and should use
> direct_access directly instead of using the block layer implementation
> in the pmem driver. As an example the dm-writecache driver, the pending
> bcache nvdimm support and the (horribly and out of tree) nova file systems
> won't even use this path. We need to find a way to support recovery
> for them. And overloading it over the read/write path which is not
> the main path for DAX, but the absolutely fast path for 99% of the
> kernel users is a horrible idea.
>
> So how can we work around the horrible nvdimm design for data recovery
> in a way that:
>
> a) actually works with the intended direct memory map use case
> b) doesn't really affect the normal kernel too much
>
> ?
Ok, now I see where you are going, but I don't see line of sight to
something better than RWF_RECOVER_DATA.
This goes back to one of the original DAX concerns of wanting a kernel
library for coordinating PMEM mmap I/O vs leaving userspace to wrap
PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O
has this error-handling-API issue whether it is a DAX mapping or not.
I.e. a memory failure in page cache is going to signal the process the
same way and it will need to fall back to something other than mmap
I/O to make forward progress. This is not a PMEM, Intel or even x86
problem, it's a generic CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE problem.
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will
receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled
and then rely on readv(2)/writev(2) to recover. Do you see a readily
available way to improve upon that model without CPU instruction
changes? Even with CPU instructions changes, do you think it could
improve much upon the model of interrupting the process when a load
instruction aborts?
I do agree with you that DAX needs to separate itself from block, but
I don't think it follows that DAX also needs to separate itself from
readv/writev for when a kernel slow-path needs to get involved because
mmap I/O (just CPU instructions) does not have the proper semantics.
Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement
those semantics in new / augmented CPU instructions you will likely
not get all of them to move and certainly not in any near term
timeframe, so the kernel path will be around indefinitely.
Meanwhile, I think RWF_RECOVER_DATA is generically useful for other
storage besides PMEM and helps storage-drivers do better than large
blast radius "I/O error" completions with no other recourse.
On Tue, Nov 02, 2021 at 09:03:55AM -0700, Dan Williams wrote:
> > why devices are built to handle them. It is just the Intel-style
> > pmem interface to handle them which is completely broken.
>
> No, any media can report checksum / parity errors. NVME also seems to
> do a poor job with multi-bit ECC errors consumed from DRAM. There is
> nothing "pmem" or "Intel" specific here.
If you do get data corruption from NVMe (which yes can happen despite
the typical very good UBER rate) you just write over it again. You
don't need to magically whack the underlying device. Same for hard
drives.
> > Well, my point is doing recovery from bit errors is by definition not
> > the fast path. Which is why I'd rather keep it away from the pmem
> > read/write fast path, which also happens to be the (much more important)
> > non-pmem read/write path.
>
> I would expect this interface to be useful outside of pmem as a
> "failfast" or "try harder to recover" flag for reading over media
> errors.
Maybe we need to sit down and define useful semantics then? The problem
on the write side isn't really that the behavior with the flag is
undefined, it is more that writes without the flag have horrible
semantics if they don't just clear the error.
On Tue, Nov 02, 2021 at 12:57:10PM -0700, Dan Williams wrote:
> This goes back to one of the original DAX concerns of wanting a kernel
> library for coordinating PMEM mmap I/O vs leaving userspace to wrap
> PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O
> has this error-handling-API issue whether it is a DAX mapping or not.
Semantics of writes through shared mmaps are a nightmare. Agreed,
including agreeing that this is neither new nor pmem specific. But
it also has absolutely nothing to do with the new RWF_ flag.
> CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will
> receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled
> and then rely on readv(2)/writev(2) to recover. Do you see a readily
> available way to improve upon that model without CPU instruction
> changes? Even with CPU instructions changes, do you think it could
> improve much upon the model of interrupting the process when a load
> instruction aborts?
The "only" think we need is something like the exception table we
use in the kernel for the uaccess helpers (and the new _nofault
kernel access helper). But I suspect refitting that into userspace
environments is probably non-trivial.
> I do agree with you that DAX needs to separate itself from block, but
> I don't think it follows that DAX also needs to separate itself from
> readv/writev for when a kernel slow-path needs to get involved because
> mmap I/O (just CPU instructions) does not have the proper semantics.
> Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement
> those semantics in new / augmented CPU instructions you will likely
> not get all of them to move and certainly not in any near term
> timeframe, so the kernel path will be around indefinitely.
I think you misunderstood me. I don't think pmem needs to be
decoupled from the read/write path. But I'm very skeptical of adding
a new flag to the common read/write path for the special workaround
that a plain old write will not actually clear errors unlike every
other store interfac.
> Meanwhile, I think RWF_RECOVER_DATA is generically useful for other
> storage besides PMEM and helps storage-drivers do better than large
> blast radius "I/O error" completions with no other recourse.
How?
On 11/1/2021 11:18 PM, Christoph Hellwig wrote:
> On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
>> ...so would you happen to know if anyone's working on solving this
>> problem for us by putting the memory controller in charge of dealing
>> with media errors?
>
> The only one who could know is Intel..
>
>> The trouble is, we really /do/ want to be able to (re)write the failed
>> area, and we probably want to try to read whatever we can. Those are
>> reads and writes, not {pre,f}allocation activities. This is where Dave
>> and I arrived at a month ago.
>>
>> Unless you'd be ok with a second IO path for recovery where we're
>> allowed to be slow? That would probably have the same user interface
>> flag, just a different path into the pmem driver.
>
> Which is fine with me. If you look at the API here we do have the
> RWF_ API, which them maps to the IOMAP API, which maps to the DAX_
> API which then gets special casing over three methods.
>
> And while Pavel pointed out that he and Jens are now optimizing for
> single branches like this. I think this actually is silly and it is
> not my point.
>
> The point is that the DAX in-kernel API is a mess, and before we make
> it even worse we need to sort it first. What is directly relevant
> here is that the copy_from_iter and copy_to_iter APIs do not make
> sense. Most of the DAX API is based around getting a memory mapping
> using ->direct_access, it is just the read/write path which is a slow
> path that actually uses this. I have a very WIP patch series to try
> to sort this out here:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
>
> But back to this series. The basic DAX model is that the callers gets a
> memory mapping an just works on that, maybe calling a sync after a write
> in a few cases. So any kind of recovery really needs to be able to
> work with that model as going forward the copy_to/from_iter path will
> be used less and less. i.e. file systems can and should use
> direct_access directly instead of using the block layer implementation
> in the pmem driver. As an example the dm-writecache driver, the pending
> bcache nvdimm support and the (horribly and out of tree) nova file systems
> won't even use this path. We need to find a way to support recovery
> for them. And overloading it over the read/write path which is not
> the main path for DAX, but the absolutely fast path for 99% of the
> kernel users is a horrible idea.
>
> So how can we work around the horrible nvdimm design for data recovery
> in a way that:
>
> a) actually works with the intended direct memory map use case
> b) doesn't really affect the normal kernel too much
>
> ?
>
This is clearer, I've looked at your 'dax-devirtualize' patch which
removes pmem_copy_to/from_iter, and as you mentioned before,
a separate API for poison-clearing is needed. So how about I go ahead
rebase my earlier patch
https://lore.kernel.org/lkml/[email protected]/
on 'dax-devirtualize', provide dm support for clear-poison?
That way, the non-dax 99% of the pwrite use-cases aren't impacted at all
and we resolve the urgent pmem poison-clearing issue?
Dan, are you okay with this? I am getting pressure from our customers
who are basically stuck at the moment.
thanks!
-jane
On Wed, Nov 3, 2021 at 9:58 AM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Nov 02, 2021 at 12:57:10PM -0700, Dan Williams wrote:
> > This goes back to one of the original DAX concerns of wanting a kernel
> > library for coordinating PMEM mmap I/O vs leaving userspace to wrap
> > PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O
> > has this error-handling-API issue whether it is a DAX mapping or not.
>
> Semantics of writes through shared mmaps are a nightmare. Agreed,
> including agreeing that this is neither new nor pmem specific. But
> it also has absolutely nothing to do with the new RWF_ flag.
Ok.
> > CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will
> > receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled
> > and then rely on readv(2)/writev(2) to recover. Do you see a readily
> > available way to improve upon that model without CPU instruction
> > changes? Even with CPU instructions changes, do you think it could
> > improve much upon the model of interrupting the process when a load
> > instruction aborts?
>
> The "only" think we need is something like the exception table we
> use in the kernel for the uaccess helpers (and the new _nofault
> kernel access helper). But I suspect refitting that into userspace
> environments is probably non-trivial.
Is the exception table requirement not already fulfilled by:
sigaction(SIGBUS, &act, 0);
...
if (sigsetjmp(sj_env, 1)) {
...
...but yes, that's awkward when all you want is an error return from a
copy operation.
For _nofault I'll note that on the kernel side Linus was explicit
about not mixing fault handling and memory error exception handling in
the same accessor. That's why copy_mc_to_kernel() and
copy_{to,from}_kernel_nofault() are distinct. I only say that to probe
deeper about what a "copy_mc()" looks like in userspace? Perhaps an
interface to suppress SIGBUS generation and register a ring buffer
that gets filled with error-event records encountered over a given
MMAP I/O code sequence?
> > I do agree with you that DAX needs to separate itself from block, but
> > I don't think it follows that DAX also needs to separate itself from
> > readv/writev for when a kernel slow-path needs to get involved because
> > mmap I/O (just CPU instructions) does not have the proper semantics.
> > Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement
> > those semantics in new / augmented CPU instructions you will likely
> > not get all of them to move and certainly not in any near term
> > timeframe, so the kernel path will be around indefinitely.
>
> I think you misunderstood me. I don't think pmem needs to be
> decoupled from the read/write path. But I'm very skeptical of adding
> a new flag to the common read/write path for the special workaround
> that a plain old write will not actually clear errors unlike every
> other store interfac.
Ah, ok, yes, I agree with you there that needing to redirect writes to
a platform firmware call to clear errors, and notify the device that
its error-list has changed is exceedingly awkward. That said, even if
the device-side error-list auto-updated on write (like the promise of
MOVDIR64B) there's still the question about when to do management on
the software error lists in the driver and/or filesytem. I.e. given
that XFS at least wants to be aware of the error lists for block
allocation and "list errors" type features. More below...
> > Meanwhile, I think RWF_RECOVER_DATA is generically useful for other
> > storage besides PMEM and helps storage-drivers do better than large
> > blast radius "I/O error" completions with no other recourse.
>
> How?
Hasn't this been a perennial topic at LSF/MM, i.e. how to get an
interface for the filesystem to request "try harder" to return data?
If the device has a recovery slow-path, or error tracking granularity
is smaller than the I/O size, then RWF_RECOVER_DATA gives the
device/driver leeway to do better than the typical fast path. For
writes though, I can only come up with the use case of this being a
signal to the driver to take the opportunity to do error-list
management relative to the incoming write data.
However, if signaling that "now is the time to update error-lists" is
the requirement, I imagine the @kaddr returned from
dax_direct_access() could be made to point to an unmapped address
representing the poisoned page. Then, arrange for a pmem-driver fault
handler to emulate the copy operation and do the slow path updates
that would otherwise have been gated by RWF_RECOVER_DATA.
Although, I'm not excited about teaching every PMEM arch's fault
handler about this new source of kernel faults. Other ideas?
RWF_RECOVER_DATA still seems the most viable / cleanest option, but
I'm willing to do what it takes to move this error management
capability forward.
On Wed, Nov 3, 2021 at 11:10 AM Jane Chu <[email protected]> wrote:
>
> On 11/1/2021 11:18 PM, Christoph Hellwig wrote:
> > On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
> >> ...so would you happen to know if anyone's working on solving this
> >> problem for us by putting the memory controller in charge of dealing
> >> with media errors?
> >
> > The only one who could know is Intel..
> >
> >> The trouble is, we really /do/ want to be able to (re)write the failed
> >> area, and we probably want to try to read whatever we can. Those are
> >> reads and writes, not {pre,f}allocation activities. This is where Dave
> >> and I arrived at a month ago.
> >>
> >> Unless you'd be ok with a second IO path for recovery where we're
> >> allowed to be slow? That would probably have the same user interface
> >> flag, just a different path into the pmem driver.
> >
> > Which is fine with me. If you look at the API here we do have the
> > RWF_ API, which them maps to the IOMAP API, which maps to the DAX_
> > API which then gets special casing over three methods.
> >
> > And while Pavel pointed out that he and Jens are now optimizing for
> > single branches like this. I think this actually is silly and it is
> > not my point.
> >
> > The point is that the DAX in-kernel API is a mess, and before we make
> > it even worse we need to sort it first. What is directly relevant
> > here is that the copy_from_iter and copy_to_iter APIs do not make
> > sense. Most of the DAX API is based around getting a memory mapping
> > using ->direct_access, it is just the read/write path which is a slow
> > path that actually uses this. I have a very WIP patch series to try
> > to sort this out here:
> >
> > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
> >
> > But back to this series. The basic DAX model is that the callers gets a
> > memory mapping an just works on that, maybe calling a sync after a write
> > in a few cases. So any kind of recovery really needs to be able to
> > work with that model as going forward the copy_to/from_iter path will
> > be used less and less. i.e. file systems can and should use
> > direct_access directly instead of using the block layer implementation
> > in the pmem driver. As an example the dm-writecache driver, the pending
> > bcache nvdimm support and the (horribly and out of tree) nova file systems
> > won't even use this path. We need to find a way to support recovery
> > for them. And overloading it over the read/write path which is not
> > the main path for DAX, but the absolutely fast path for 99% of the
> > kernel users is a horrible idea.
> >
> > So how can we work around the horrible nvdimm design for data recovery
> > in a way that:
> >
> > a) actually works with the intended direct memory map use case
> > b) doesn't really affect the normal kernel too much
> >
> > ?
> >
>
> This is clearer, I've looked at your 'dax-devirtualize' patch which
> removes pmem_copy_to/from_iter, and as you mentioned before,
> a separate API for poison-clearing is needed. So how about I go ahead
> rebase my earlier patch
>
> https://lore.kernel.org/lkml/[email protected]/
> on 'dax-devirtualize', provide dm support for clear-poison?
> That way, the non-dax 99% of the pwrite use-cases aren't impacted at all
> and we resolve the urgent pmem poison-clearing issue?
>
> Dan, are you okay with this? I am getting pressure from our customers
> who are basically stuck at the moment.
The concern I have with dax_clear_poison() is that it precludes atomic
error clearing. Also, as Boris and I discussed, poisoned pages should
be marked NP (not present) rather than UC (uncacheable) [1]. With
those 2 properties combined I think that wants a custom pmem fault
handler that knows how to carefully write to pmem pages with poison
present, rather than an additional explicit dax-operation. That also
meets Christoph's requirement of "works with the intended direct
memory map use case".
[1]: https://lore.kernel.org/r/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com
On Wed, Nov 03, 2021 at 06:09:57PM +0000, Jane Chu wrote:
> This is clearer, I've looked at your 'dax-devirtualize' patch which
> removes pmem_copy_to/from_iter, and as you mentioned before,
> a separate API for poison-clearing is needed. So how about I go ahead
> rebase my earlier patch
>
> https://lore.kernel.org/lkml/[email protected]/
> on 'dax-devirtualize', provide dm support for clear-poison?
> That way, the non-dax 99% of the pwrite use-cases aren't impacted at all
> and we resolve the urgent pmem poison-clearing issue?
FYI, I really do like the in-kernel interface in that series. And
now that we discussed all the effects I also think that automatically
doing the clear on EIO is a good idea.
I'll go back to the series and will reply with a few nitpicks.
On Wed, Nov 03, 2021 at 01:33:58PM -0700, Dan Williams wrote:
> Is the exception table requirement not already fulfilled by:
>
> sigaction(SIGBUS, &act, 0);
> ...
> if (sigsetjmp(sj_env, 1)) {
> ...
>
> ...but yes, that's awkward when all you want is an error return from a
> copy operation.
Yeah. The nice thing about the kernel uaccess / _nofault helpers is
that they allow normal error handling flows.
> For _nofault I'll note that on the kernel side Linus was explicit
> about not mixing fault handling and memory error exception handling in
> the same accessor. That's why copy_mc_to_kernel() and
> copy_{to,from}_kernel_nofault() are distinct.
I've always wondered why we need all this mess. But if the head
penguin wants it..
> I only say that to probe
> deeper about what a "copy_mc()" looks like in userspace? Perhaps an
> interface to suppress SIGBUS generation and register a ring buffer
> that gets filled with error-event records encountered over a given
> MMAP I/O code sequence?
Well, the equivalent to the kernel uaccess model would be to register
a SIGBUS handler that uses an exception table like the kernel, and then
if you use the right helpers to load/store they can return errors.
The other option would be something more like the Windows Structured
Exception Handling:
https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp?view=msvc-160
> > I think you misunderstood me. I don't think pmem needs to be
> > decoupled from the read/write path. But I'm very skeptical of adding
> > a new flag to the common read/write path for the special workaround
> > that a plain old write will not actually clear errors unlike every
> > other store interfac.
>
> Ah, ok, yes, I agree with you there that needing to redirect writes to
> a platform firmware call to clear errors, and notify the device that
> its error-list has changed is exceedingly awkward.
Yes. And that is the big difference to every other modern storage
device. SSDs always write out of place and will just not reuse bad
blocks, and all drivers built in the last 25-30 years will also do
internal bad block remapping.
> That said, even if
> the device-side error-list auto-updated on write (like the promise of
> MOVDIR64B) there's still the question about when to do management on
> the software error lists in the driver and/or filesytem. I.e. given
> that XFS at least wants to be aware of the error lists for block
> allocation and "list errors" type features. More below...
Well, the whole problem is that we should not have to manage this at
all, and this is where I blame Intel. There is no good reason to not
slightly overprovision the nvdimms and just do internal bad page
remapping like every other modern storage device.
> Hasn't this been a perennial topic at LSF/MM, i.e. how to get an
> interface for the filesystem to request "try harder" to return data?
Trying harder to _get_ data or to _store_ data? All the discussion
here seems to be able about actually writing data.
> If the device has a recovery slow-path, or error tracking granularity
> is smaller than the I/O size, then RWF_RECOVER_DATA gives the
> device/driver leeway to do better than the typical fast path. For
> writes though, I can only come up with the use case of this being a
> signal to the driver to take the opportunity to do error-list
> management relative to the incoming write data.
Well, while devices have data recovery slow path they tend to use those
automatically. What I'm actually seeing in practice is flags in the
storage interfaces to skip this slow path recovery because the higher
level software would prefer to instead recover e.g. from remote copies.
> However, if signaling that "now is the time to update error-lists" is
> the requirement, I imagine the @kaddr returned from
> dax_direct_access() could be made to point to an unmapped address
> representing the poisoned page. Then, arrange for a pmem-driver fault
> handler to emulate the copy operation and do the slow path updates
> that would otherwise have been gated by RWF_RECOVER_DATA.
That does sound like a much better interface than most of what we've
discussed so far.
> Although, I'm not excited about teaching every PMEM arch's fault
> handler about this new source of kernel faults. Other ideas?
> RWF_RECOVER_DATA still seems the most viable / cleanest option, but
> I'm willing to do what it takes to move this error management
> capability forward.
So far out of the low instrusiveness options Janes' previous series
to automatically retry after calling a clear_poison operation seems
like the best idea so far. We just need to also think about what
we want to do for direct users of ->direct_access that do not use
the mcsafe iov_iter helpers.
On Wed, Nov 03, 2021 at 11:21:39PM -0700, Dan Williams wrote:
> The concern I have with dax_clear_poison() is that it precludes atomic
> error clearing.
atomic as in clear poison and write the actual data? Yes, that would
be useful, but it is not how the Intel pmem support actually works, right?
> Also, as Boris and I discussed, poisoned pages should
> be marked NP (not present) rather than UC (uncacheable) [1].
This would not really have an affect on the series, right? But yes,
that seems like the right thing to do.
> With
> those 2 properties combined I think that wants a custom pmem fault
> handler that knows how to carefully write to pmem pages with poison
> present, rather than an additional explicit dax-operation. That also
> meets Christoph's requirement of "works with the intended direct
> memory map use case".
So we have 3 kinds of accesses to DAX memory:
(1) user space mmap direct access.
(2) iov_iter based access (could be from kernel or userspace)
(3) open coded kernel access using ->direct_access
One thing I noticed: (2) could also work with kernel memory or pages,
but that doesn't use MC safe access. Which seems like a major independent
of this discussion.
I suspect all kernel access could work fine with a copy_mc_to_kernel
helper as long as everyone actually uses it, which will cover the
missing required bits of (2) and (3) together with something like the
->clear_poison series from Jane. We just need to think hard what we
want to do for userspace mmap access.
On Thu, Nov 04, 2021 at 01:30:48AM -0700, Christoph Hellwig wrote:
> Well, the whole problem is that we should not have to manage this at
> all, and this is where I blame Intel. There is no good reason to not
> slightly overprovision the nvdimms and just do internal bad page
> remapping like every other modern storage device.
What makes you think they don't?
The problem is persuading the CPU to do writes without doing reads.
That's where magic instructions or out of band IO is needed.
On Thu, Nov 4, 2021 at 1:36 AM Christoph Hellwig <[email protected]> wrote:
>
> On Wed, Nov 03, 2021 at 11:21:39PM -0700, Dan Williams wrote:
> > The concern I have with dax_clear_poison() is that it precludes atomic
> > error clearing.
>
> atomic as in clear poison and write the actual data? Yes, that would
> be useful, but it is not how the Intel pmem support actually works, right?
Yes, atomic clear+write new data. The ability to atomic clear requires
either a CPU with the ability to overwrite cachelines without doing a
RMW cycle (MOVDIR64B), or it requires a device with a suitable
slow-path mailbox command like the one defined for CXL devices (see
section 8.2.9.5.4.3 Clear Poison in CXL 2.0).
I don't know why you think these devices don't perform wear-leveling
with spare blocks?
> > Also, as Boris and I discussed, poisoned pages should
> > be marked NP (not present) rather than UC (uncacheable) [1].
>
> This would not really have an affect on the series, right? But yes,
> that seems like the right thing to do.
It would because the implementations would need to be careful to clear
poison in an entire page before any of it could be accessed. With an
enlightened write-path RWF flag or custom fault handler it could do
sub-page overwrites of poison. Not that I think the driver should
optimize for multiple failed cachelines in a page, but it does mean
dax_clear_poison() fails in more theoretical scenarios.
> > With
> > those 2 properties combined I think that wants a custom pmem fault
> > handler that knows how to carefully write to pmem pages with poison
> > present, rather than an additional explicit dax-operation. That also
> > meets Christoph's requirement of "works with the intended direct
> > memory map use case".
>
> So we have 3 kinds of accesses to DAX memory:
>
> (1) user space mmap direct access.
> (2) iov_iter based access (could be from kernel or userspace)
> (3) open coded kernel access using ->direct_access
>
> One thing I noticed: (2) could also work with kernel memory or pages,
> but that doesn't use MC safe access.
Yes, but after the fight to even get copy_mc_to_kernel() to exist for
pmem_copy_to_iter() I did not have the nerve to push for wider usage.
> Which seems like a major independent
> of this discussion.
>
> I suspect all kernel access could work fine with a copy_mc_to_kernel
> helper as long as everyone actually uses it,
All kernel accesses do use it. They either route to
pmem_copy_to_iter(), or like dm-writecache, call it directly. Do you
see a kernel path that does not use that helper?
> missing required bits of (2) and (3) together with something like the
> ->clear_poison series from Jane. We just need to think hard what we
> want to do for userspace mmap access.
dax_clear_poison() is at least ready to go today and does not preclude
adding the atomic and finer grained support later.
On Thu, Nov 4, 2021 at 1:30 AM Christoph Hellwig <[email protected]> wrote:
>
> On Wed, Nov 03, 2021 at 01:33:58PM -0700, Dan Williams wrote:
> > Is the exception table requirement not already fulfilled by:
> >
> > sigaction(SIGBUS, &act, 0);
> > ...
> > if (sigsetjmp(sj_env, 1)) {
> > ...
> >
> > ...but yes, that's awkward when all you want is an error return from a
> > copy operation.
>
> Yeah. The nice thing about the kernel uaccess / _nofault helpers is
> that they allow normal error handling flows.
>
> > For _nofault I'll note that on the kernel side Linus was explicit
> > about not mixing fault handling and memory error exception handling in
> > the same accessor. That's why copy_mc_to_kernel() and
> > copy_{to,from}_kernel_nofault() are distinct.
>
> I've always wondered why we need all this mess. But if the head
> penguin wants it..
>
> > I only say that to probe
> > deeper about what a "copy_mc()" looks like in userspace? Perhaps an
> > interface to suppress SIGBUS generation and register a ring buffer
> > that gets filled with error-event records encountered over a given
> > MMAP I/O code sequence?
>
> Well, the equivalent to the kernel uaccess model would be to register
> a SIGBUS handler that uses an exception table like the kernel, and then
> if you use the right helpers to load/store they can return errors.
>
> The other option would be something more like the Windows Structured
> Exception Handling:
>
> https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp?view=msvc-160
Yes, try-catch blocks for PMEM is what is needed if it's not there
already from PMDK. Searching for SIGBUS in PMDK finds things like:
https://github.com/pmem/pmdk/blob/26449a51a86861db17feabdfefaa5ee4d5afabc9/src/libpmem2/mcsafe_ops_posix.c
>
>
> > > I think you misunderstood me. I don't think pmem needs to be
> > > decoupled from the read/write path. But I'm very skeptical of adding
> > > a new flag to the common read/write path for the special workaround
> > > that a plain old write will not actually clear errors unlike every
> > > other store interfac.
> >
> > Ah, ok, yes, I agree with you there that needing to redirect writes to
> > a platform firmware call to clear errors, and notify the device that
> > its error-list has changed is exceedingly awkward.
>
> Yes. And that is the big difference to every other modern storage
> device. SSDs always write out of place and will just not reuse bad
> blocks, and all drivers built in the last 25-30 years will also do
> internal bad block remapping.
>
No, the big difference with every other modern storage device is
access to byte-addressable storage. Storage devices get to "cheat"
with guaranteed minimum 512-byte accesses. So you can arrange for
writes to always be large enough to scrub the ECC bits along with the
data. For PMEM and byte-granularity DAX accesses the "sector size" is
a cacheline and it needed a new CPU instruction before software could
atomically update data + ECC. Otherwise, with sub-cacheline accesses,
a RMW cycle can't always be avoided. Such a cycle pulls poison from
the device on the read and pushes it back out to the media on the
cacheline writeback.
> > That said, even if
> > the device-side error-list auto-updated on write (like the promise of
> > MOVDIR64B) there's still the question about when to do management on
> > the software error lists in the driver and/or filesytem. I.e. given
> > that XFS at least wants to be aware of the error lists for block
> > allocation and "list errors" type features. More below...
>
> Well, the whole problem is that we should not have to manage this at
> all, and this is where I blame Intel. There is no good reason to not
> slightly overprovision the nvdimms and just do internal bad page
> remapping like every other modern storage device.
I don't understand what overprovisioning has to do with better error
management? No other storage device has seen fit to be as transparent
with communicating the error list and offering ways to proactively
scrub it. Dave and Darrick rightly saw this and said "hey, the FS
could do a much better job for the user if it knew about this error
list". So I don't get what this argument about spare blocks has to do
with what XFS wants? I.e. an rmap facility to communicate files that
have been clobbered by cosmic rays and other calamities.
> > Hasn't this been a perennial topic at LSF/MM, i.e. how to get an
> > interface for the filesystem to request "try harder" to return data?
>
> Trying harder to _get_ data or to _store_ data? All the discussion
> here seems to be able about actually writing data.
>
> > If the device has a recovery slow-path, or error tracking granularity
> > is smaller than the I/O size, then RWF_RECOVER_DATA gives the
> > device/driver leeway to do better than the typical fast path. For
> > writes though, I can only come up with the use case of this being a
> > signal to the driver to take the opportunity to do error-list
> > management relative to the incoming write data.
>
> Well, while devices have data recovery slow path they tend to use those
> automatically. What I'm actually seeing in practice is flags in the
> storage interfaces to skip this slow path recovery because the higher
> level software would prefer to instead recover e.g. from remote copies.
Ok.
> > However, if signaling that "now is the time to update error-lists" is
> > the requirement, I imagine the @kaddr returned from
> > dax_direct_access() could be made to point to an unmapped address
> > representing the poisoned page. Then, arrange for a pmem-driver fault
> > handler to emulate the copy operation and do the slow path updates
> > that would otherwise have been gated by RWF_RECOVER_DATA.
>
> That does sound like a much better interface than most of what we've
> discussed so far.
>
> > Although, I'm not excited about teaching every PMEM arch's fault
> > handler about this new source of kernel faults. Other ideas?
> > RWF_RECOVER_DATA still seems the most viable / cleanest option, but
> > I'm willing to do what it takes to move this error management
> > capability forward.
>
> So far out of the low instrusiveness options Janes' previous series
> to automatically retry after calling a clear_poison operation seems
> like the best idea so far. We just need to also think about what
> we want to do for direct users of ->direct_access that do not use
> the mcsafe iov_iter helpers.
Those exist? Even dm-writecache uses copy_mc_to_kernel().
On Thu, Nov 04, 2021 at 09:24:15AM -0700, Dan Williams wrote:
> No, the big difference with every other modern storage device is
> access to byte-addressable storage. Storage devices get to "cheat"
> with guaranteed minimum 512-byte accesses. So you can arrange for
> writes to always be large enough to scrub the ECC bits along with the
> data. For PMEM and byte-granularity DAX accesses the "sector size" is
> a cacheline and it needed a new CPU instruction before software could
> atomically update data + ECC. Otherwise, with sub-cacheline accesses,
> a RMW cycle can't always be avoided. Such a cycle pulls poison from
> the device on the read and pushes it back out to the media on the
> cacheline writeback.
Indeed. The fake byte addressability is indeed the problem, and the
fix is to not do that, at least on the second attempt.
> I don't understand what overprovisioning has to do with better error
> management? No other storage device has seen fit to be as transparent
> with communicating the error list and offering ways to proactively
> scrub it. Dave and Darrick rightly saw this and said "hey, the FS
> could do a much better job for the user if it knew about this error
> list". So I don't get what this argument about spare blocks has to do
> with what XFS wants? I.e. an rmap facility to communicate files that
> have been clobbered by cosmic rays and other calamities.
Well, the answer for other interfaces (at least at the gold plated
cost option) is so strong internal CRCs that user visible bits clobbered
by cosmic rays don't realisticly happen. But it is a problem with the
cheaper ones, and at least SCSI and NVMe offer the error list through
the Get LBA status command (and I bet ATA too, but I haven't looked into
that). Oddly enough there has never been much interested from the
fs community for those.
> > So far out of the low instrusiveness options Janes' previous series
> > to automatically retry after calling a clear_poison operation seems
> > like the best idea so far. We just need to also think about what
> > we want to do for direct users of ->direct_access that do not use
> > the mcsafe iov_iter helpers.
>
> Those exist? Even dm-writecache uses copy_mc_to_kernel().
I'm sorry, I have completely missed that it has been added. And it's
been in for a whole year..
On Thu, Nov 04, 2021 at 09:08:41AM -0700, Dan Williams wrote:
> Yes, atomic clear+write new data. The ability to atomic clear requires
> either a CPU with the ability to overwrite cachelines without doing a
> RMW cycle (MOVDIR64B), or it requires a device with a suitable
> slow-path mailbox command like the one defined for CXL devices (see
> section 8.2.9.5.4.3 Clear Poison in CXL 2.0).
>
> I don't know why you think these devices don't perform wear-leveling
> with spare blocks?
Because the interface looks so broken. But yes, apparently it's not
the media management that is broken but just the inteface that fakes
up byte level access.
> All kernel accesses do use it. They either route to
> pmem_copy_to_iter(), or like dm-writecache, call it directly. Do you
> see a kernel path that does not use that helper?
No, sorry. My knowledge is out of date.
(nova does, but it is out of tree, and the lack of using
copy_mc_to_kernel is the least of its problems)
On Thu, Nov 4, 2021 at 10:43 AM Christoph Hellwig <[email protected]> wrote:
>
> On Thu, Nov 04, 2021 at 09:24:15AM -0700, Dan Williams wrote:
> > No, the big difference with every other modern storage device is
> > access to byte-addressable storage. Storage devices get to "cheat"
> > with guaranteed minimum 512-byte accesses. So you can arrange for
> > writes to always be large enough to scrub the ECC bits along with the
> > data. For PMEM and byte-granularity DAX accesses the "sector size" is
> > a cacheline and it needed a new CPU instruction before software could
> > atomically update data + ECC. Otherwise, with sub-cacheline accesses,
> > a RMW cycle can't always be avoided. Such a cycle pulls poison from
> > the device on the read and pushes it back out to the media on the
> > cacheline writeback.
>
> Indeed. The fake byte addressability is indeed the problem, and the
> fix is to not do that, at least on the second attempt.
Fair enough.
> > I don't understand what overprovisioning has to do with better error
> > management? No other storage device has seen fit to be as transparent
> > with communicating the error list and offering ways to proactively
> > scrub it. Dave and Darrick rightly saw this and said "hey, the FS
> > could do a much better job for the user if it knew about this error
> > list". So I don't get what this argument about spare blocks has to do
> > with what XFS wants? I.e. an rmap facility to communicate files that
> > have been clobbered by cosmic rays and other calamities.
>
> Well, the answer for other interfaces (at least at the gold plated
> cost option) is so strong internal CRCs that user visible bits clobbered
> by cosmic rays don't realisticly happen. But it is a problem with the
> cheaper ones, and at least SCSI and NVMe offer the error list through
> the Get LBA status command (and I bet ATA too, but I haven't looked into
> that). Oddly enough there has never been much interested from the
> fs community for those.
It seems the entanglement with 'struct page', error handling, and
reflink made people take notice. Hopefully someone could follow the
same plumbing we're doing for pmem to offer error-rmap help for NVME
badblocks.
On Thu, Nov 04, 2021 at 10:43:23AM -0700, Christoph Hellwig wrote:
> Well, the answer for other interfaces (at least at the gold plated
> cost option) is so strong internal CRCs that user visible bits clobbered
> by cosmic rays don't realisticly happen. But it is a problem with the
> cheaper ones, and at least SCSI and NVMe offer the error list through
> the Get LBA status command (and I bet ATA too, but I haven't looked into
> that). Oddly enough there has never been much interested from the
> fs community for those.
"don't realistically happen" is different when you're talking about
"doesn't happen within the warranty period of my laptop's SSD" and
"doesn't happen on my fleet of 10k servers before they're taken out of
service". There's also a big difference in speeds between an NVMe drive
(7GB/s) and a memory device (20-50GB/s). The UBER being talked about
when I was still at Intel was similar to / slightly better than DRAM,
but that's still several failures per year across an entire data centre
that's using pmem flat-out.
Thanks for the enlightening discussion here, it's so helpful!
Please allow me to recap what I've caught up so far -
1. recovery write at page boundary due to NP setting in poisoned
page to prevent undesirable prefetching
2. single interface to perform 3 tasks:
{ clear-poison, update error-list, write }
such as an API in pmem driver.
For CPUs that support MOVEDIR64B, the 'clear-poison' and 'write'
task can be combined (would need something different from the
existing _copy_mcsafe though) and 'update error-list' follows
closely behind;
For CPUs that rely on firmware call to clear posion, the existing
pmem_clear_poison() can be used, followed by the 'write' task.
3. if user isn't given RWF_RECOVERY_FLAG flag, then dax recovery
would be automatic for a write if range is page aligned;
otherwise, the write fails with EIO as usual.
Also, user mustn't have punched out the poisoned page in which
case poison repairing will be a lot more complicated.
4. desirable to fetch as much data as possible from a poisoned range.
If this understanding is in the right direction, then I'd like to
propose below changes to
dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter()
and the dm layer copy_to/from_iter, dax_iomap_iter().
1. dax_iomap_iter() rely on dax_direct_access() to decide whether there
is likely media error: if the API without DAX_F_RECOVERY returns
-EIO, then switch to recovery-read/write code. In recovery code,
supply DAX_F_RECOVERY to dax_direct_access() in order to obtain
'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY.
2. the _copy_to/from_iter implementation would be largely the same
as in my recent patch, but some changes in Christoph's
'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously
virtual devices don't have the ability to clear poison, so no need
to complicate them. And this also means that not every endpoint
dax device has to provide dax_op.copy_to/from_iter, they may use the
default.
I'm not sure about nova and others, if they use different 'write' other
than via iomap, does that mean there will be need for a new set of
dax_op for their read/write? the 3-in-1 binding would always be
required though. Maybe that'll be an ongoing discussion?
Comments? Suggestions?
Thanks!
-jane
On Thu, Nov 4, 2021 at 11:34 AM Jane Chu <[email protected]> wrote:
>
> Thanks for the enlightening discussion here, it's so helpful!
>
> Please allow me to recap what I've caught up so far -
>
> 1. recovery write at page boundary due to NP setting in poisoned
> page to prevent undesirable prefetching
> 2. single interface to perform 3 tasks:
> { clear-poison, update error-list, write }
> such as an API in pmem driver.
> For CPUs that support MOVEDIR64B, the 'clear-poison' and 'write'
> task can be combined (would need something different from the
> existing _copy_mcsafe though) and 'update error-list' follows
> closely behind;
> For CPUs that rely on firmware call to clear posion, the existing
> pmem_clear_poison() can be used, followed by the 'write' task.
> 3. if user isn't given RWF_RECOVERY_FLAG flag, then dax recovery
> would be automatic for a write if range is page aligned;
> otherwise, the write fails with EIO as usual.
> Also, user mustn't have punched out the poisoned page in which
> case poison repairing will be a lot more complicated.
> 4. desirable to fetch as much data as possible from a poisoned range.
>
> If this understanding is in the right direction, then I'd like to
> propose below changes to
> dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter()
> and the dm layer copy_to/from_iter, dax_iomap_iter().
>
> 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there
> is likely media error: if the API without DAX_F_RECOVERY returns
> -EIO, then switch to recovery-read/write code. In recovery code,
> supply DAX_F_RECOVERY to dax_direct_access() in order to obtain
> 'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY.
I like it. It allows for an atomic write+clear implementation on
capable platforms and coordinates with potentially unmapped pages. The
best of both worlds from the dax_clear_poison() proposal and my "take
a fault and do a slow-path copy".
> 2. the _copy_to/from_iter implementation would be largely the same
> as in my recent patch, but some changes in Christoph's
> 'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously
> virtual devices don't have the ability to clear poison, so no need
> to complicate them. And this also means that not every endpoint
> dax device has to provide dax_op.copy_to/from_iter, they may use the
> default.
Did I miss this series or are you talking about this one?
https://lore.kernel.org/all/[email protected]/
> I'm not sure about nova and others, if they use different 'write' other
> than via iomap, does that mean there will be need for a new set of
> dax_op for their read/write?
No, they're out-of-tree they'll adjust to the same interface that xfs
and ext4 are using when/if they go upstream.
> the 3-in-1 binding would always be
> required though. Maybe that'll be an ongoing discussion?
Yeah, let's cross that bridge when we come to it.
> Comments? Suggestions?
It sounds great to me!
On 11/4/2021 12:00 PM, Dan Williams wrote:
>>
>> If this understanding is in the right direction, then I'd like to
>> propose below changes to
>> dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter()
>> and the dm layer copy_to/from_iter, dax_iomap_iter().
>>
>> 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there
>> is likely media error: if the API without DAX_F_RECOVERY returns
>> -EIO, then switch to recovery-read/write code. In recovery code,
>> supply DAX_F_RECOVERY to dax_direct_access() in order to obtain
>> 'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY.
>
> I like it. It allows for an atomic write+clear implementation on
> capable platforms and coordinates with potentially unmapped pages. The
> best of both worlds from the dax_clear_poison() proposal and my "take
> a fault and do a slow-path copy".
>
>> 2. the _copy_to/from_iter implementation would be largely the same
>> as in my recent patch, but some changes in Christoph's
>> 'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously
>> virtual devices don't have the ability to clear poison, so no need
>> to complicate them. And this also means that not every endpoint
>> dax device has to provide dax_op.copy_to/from_iter, they may use the
>> default.
>
> Did I miss this series or are you talking about this one?
> https://lore.kernel.org/all/[email protected]/
I was referring to
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
that has not come out yet, I said early on that I'll rebase on it,
but looks like we still need pmem_copy_to/from_iter(), so.
>
>> I'm not sure about nova and others, if they use different 'write' other
>> than via iomap, does that mean there will be need for a new set of
>> dax_op for their read/write?
>
> No, they're out-of-tree they'll adjust to the same interface that xfs
> and ext4 are using when/if they go upstream.
>
>> the 3-in-1 binding would always be
>> required though. Maybe that'll be an ongoing discussion?
>
> Yeah, let's cross that bridge when we come to it.
>
>> Comments? Suggestions?
>
> It sounds great to me!
>
Thanks! I'll send out an updated patchset when it's ready.
-jane
On Thu, Nov 4, 2021 at 1:27 PM Jane Chu <[email protected]> wrote:
>
> On 11/4/2021 12:00 PM, Dan Williams wrote:
>
> >>
> >> If this understanding is in the right direction, then I'd like to
> >> propose below changes to
> >> dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter()
> >> and the dm layer copy_to/from_iter, dax_iomap_iter().
> >>
> >> 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there
> >> is likely media error: if the API without DAX_F_RECOVERY returns
> >> -EIO, then switch to recovery-read/write code. In recovery code,
> >> supply DAX_F_RECOVERY to dax_direct_access() in order to obtain
> >> 'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY.
> >
> > I like it. It allows for an atomic write+clear implementation on
> > capable platforms and coordinates with potentially unmapped pages. The
> > best of both worlds from the dax_clear_poison() proposal and my "take
> > a fault and do a slow-path copy".
> >
> >> 2. the _copy_to/from_iter implementation would be largely the same
> >> as in my recent patch, but some changes in Christoph's
> >> 'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously
> >> virtual devices don't have the ability to clear poison, so no need
> >> to complicate them. And this also means that not every endpoint
> >> dax device has to provide dax_op.copy_to/from_iter, they may use the
> >> default.
> >
> > Did I miss this series or are you talking about this one?
> > https://lore.kernel.org/all/[email protected]/
>
> I was referring to
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
> that has not come out yet, I said early on that I'll rebase on it,
> but looks like we still need pmem_copy_to/from_iter(), so.
Yeah, since the block-layer divorce gets rid of the old poison
clearing path, then we're back to pmem_copy_to_iter() (or something
like it) needing to pick up the slack for poison clearing. I do agree
it would be nice to clean up all the unnecessary boilerplate, but the
error-list coordination requires a driver specific callback. At least
the DAX_F_VIRTUAL flag can eliminate the virtiofs and fuse callbacks.
On Thu, Nov 4, 2021 at 5:46 PM Dan Williams <[email protected]> wrote:
>
> On Thu, Nov 4, 2021 at 1:27 PM Jane Chu <[email protected]> wrote:
> >
> > On 11/4/2021 12:00 PM, Dan Williams wrote:
> >
> > >>
> > >> If this understanding is in the right direction, then I'd like to
> > >> propose below changes to
> > >> dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter()
> > >> and the dm layer copy_to/from_iter, dax_iomap_iter().
> > >>
> > >> 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there
> > >> is likely media error: if the API without DAX_F_RECOVERY returns
> > >> -EIO, then switch to recovery-read/write code. In recovery code,
> > >> supply DAX_F_RECOVERY to dax_direct_access() in order to obtain
> > >> 'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY.
> > >
> > > I like it. It allows for an atomic write+clear implementation on
> > > capable platforms and coordinates with potentially unmapped pages. The
> > > best of both worlds from the dax_clear_poison() proposal and my "take
> > > a fault and do a slow-path copy".
> > >
> > >> 2. the _copy_to/from_iter implementation would be largely the same
> > >> as in my recent patch, but some changes in Christoph's
> > >> 'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously
> > >> virtual devices don't have the ability to clear poison, so no need
> > >> to complicate them. And this also means that not every endpoint
> > >> dax device has to provide dax_op.copy_to/from_iter, they may use the
> > >> default.
> > >
> > > Did I miss this series or are you talking about this one?
> > > https://lore.kernel.org/all/[email protected]/
> >
> > I was referring to
> >
> > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
> > that has not come out yet, I said early on that I'll rebase on it,
> > but looks like we still need pmem_copy_to/from_iter(), so.
>
> Yeah, since the block-layer divorce gets rid of the old poison
> clearing path, then we're back to pmem_copy_to_iter() (or something
> like it) needing to pick up the slack for poison clearing. I do agree
> it would be nice to clean up all the unnecessary boilerplate, but the
> error-list coordination requires a driver specific callback. At least
> the DAX_F_VIRTUAL flag can eliminate the virtiofs and fuse callbacks.
Although, if error management is generically implemented relative to a
'struct dax_device' then there wouldn't be a need to call all the way
back into the driver, and the cleanup could still move forward.
On Thu, Nov 04, 2021 at 12:00:12PM -0700, Dan Williams wrote:
> > 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there
> > is likely media error: if the API without DAX_F_RECOVERY returns
> > -EIO, then switch to recovery-read/write code. In recovery code,
> > supply DAX_F_RECOVERY to dax_direct_access() in order to obtain
> > 'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY.
>
> I like it. It allows for an atomic write+clear implementation on
> capable platforms and coordinates with potentially unmapped pages. The
> best of both worlds from the dax_clear_poison() proposal and my "take
> a fault and do a slow-path copy".
Fine with me as well.
>
> > 2. the _copy_to/from_iter implementation would be largely the same
> > as in my recent patch, but some changes in Christoph's
> > 'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously
> > virtual devices don't have the ability to clear poison, so no need
> > to complicate them. And this also means that not every endpoint
> > dax device has to provide dax_op.copy_to/from_iter, they may use the
> > default.
>
> Did I miss this series or are you talking about this one?
> https://lore.kernel.org/all/[email protected]/
Yes. This is an early RFC, but I plan to finish this up and submit
it after the updated decouple series.
>
> > I'm not sure about nova and others, if they use different 'write' other
> > than via iomap, does that mean there will be need for a new set of
> > dax_op for their read/write?
>
> No, they're out-of-tree they'll adjust to the same interface that xfs
> and ext4 are using when/if they go upstream.
Yepp.
On Tue, 2 Nov 2021 09:03:55 -0700
Dan Williams <[email protected]> wrote:
> On Tue, Oct 26, 2021 at 11:50 PM Christoph Hellwig <[email protected]> wrote:
> >
> > On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:
> > > Thanks - I try to be honest. As far as I can tell, the argument
> > > about the flag is a philosophical argument between two views.
> > > One view assumes design based on perfect hardware, and media error
> > > belongs to the category of brokenness. Another view sees media
> > > error as a build-in hardware component and make design to include
> > > dealing with such errors.
> >
> > No, I don't think so. Bit errors do happen in all media, which is
> > why devices are built to handle them. It is just the Intel-style
> > pmem interface to handle them which is completely broken.
>
> No, any media can report checksum / parity errors. NVME also seems to
> do a poor job with multi-bit ECC errors consumed from DRAM. There is
> nothing "pmem" or "Intel" specific here.
>
> > > errors in mind from start. I guess I'm trying to articulate why
> > > it is acceptable to include the RWF_DATA_RECOVERY flag to the
> > > existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> > > and its slow path (w/ error clearing) is faster than other alternative.
> > > Other alternative being 1 system call to clear the poison, and
> > > another system call to run the fast pwrite for recovery, what
> > > happens if something happened in between?
> >
> > Well, my point is doing recovery from bit errors is by definition not
> > the fast path. Which is why I'd rather keep it away from the pmem
> > read/write fast path, which also happens to be the (much more important)
> > non-pmem read/write path.
>
> I would expect this interface to be useful outside of pmem as a
> "failfast" or "try harder to recover" flag for reading over media
> errors.
Yeah, I think this flag could also be useful for non-raid btrfs.
If you have an extend that is shared between multiple snapshots and
it's data is corrupted (without the disk returning an i/o error), btrfs
won't be able to fix the corruption without raid and will always return
an i/o error when accessing the affected range (due to checksum
mismatch).
Of course you could just overwrite the range in the file with good
data, but that would only fix the file you are operating on, snapshots
will still reference the corrupted data.
With this flag, a read could just return the corrupted data without i/o
error and a write could write directly to the on-disk data to fixup the
corruption everywhere. btrfs could also check that the newly written
data actually matches the checksum.
However, in this btrfs usecase the process still needs to be
CAP_SYS_ADMIN or similar, since it's easy to create collisions for
crc32 and so an attacker could write to a file that he has no
permissions for, if that file shares an extend with one where he has
write permissions.
Regards,
Lukas Straub
--