Until now, dax has been disabled if media errors were found on
any device. This series attempts to address that.
The first three patches from Dan re-enable dax even when media
errors are present.
The fourth patch from Matthew removes the
zeroout path from dax entirely, making zeroout operations always
go through the driver (The motivation is that if a backing device
has media errors, and we create a sparse file on it, we don't
want the initial zeroing to happen via dax, we want to give the
block driver a chance to clear the errors).
One pending item is addressing clear_pmem usages in dax.c. clear_pmem is
'unsafe' as it attempts to simply memcpy, and does not go through the driver.
We have a few options of solving this:
1. Remove all usages of clear_pmem that are not sector-aligned. For the
ones that are aligned, replace them with a bio submission that goes
through the driver to clear errors.
2. Export from the block layer, either an API to zero sub-sector ranges,
or in general, clear errors in a range. The dax attempts to clear_pmem
can then use either of these and not be hit be media errors.
I'll send out a v3 with a crack at option 1, but I wanted to get these
changes (especially the ones in xfs) out for review.
The fifth patch changes all the callers of dax_do_io to check for
EIO, and fallback to direct_IO as needed. This forces the IO to
go through the block driver, and can attempt to clear the error.
v2:
- Use blockdev_issue_zeroout in xfs instead of sb_issue_zeroout (Christoph)
- Un-wrapper-ize dax_do_io and leave the fallback to direct_IO to callers
(Christoph)
- Rebase to v4.6-rc1 (fixup a couple of conflicts in ext4 and xfs)
Dan Williams (3):
block, dax: pass blk_dax_ctl through to drivers
dax: fallback from pmd to pte on error
dax: enable dax in the presence of known media errors (badblocks)
Vishal Verma (2):
dax: use sb_issue_zerout instead of calling dax_clear_sectors
dax: handle media errors in dax_do_io
arch/powerpc/sysdev/axonram.c | 10 +++++-----
block/ioctl.c | 9 ---------
drivers/block/brd.c | 9 +++++----
drivers/nvdimm/pmem.c | 17 +++++++++++++----
drivers/s390/block/dcssblk.c | 12 ++++++------
fs/block_dev.c | 19 +++++++++++++++----
fs/dax.c | 36 ++----------------------------------
fs/ext2/inode.c | 29 ++++++++++++++++++-----------
fs/ext4/indirect.c | 18 +++++++++++++-----
fs/ext4/inode.c | 21 ++++++++++++++-------
fs/xfs/xfs_aops.c | 14 ++++++++++++--
fs/xfs/xfs_bmap_util.c | 15 ++++-----------
include/linux/blkdev.h | 3 +--
include/linux/dax.h | 1 -
14 files changed, 108 insertions(+), 105 deletions(-)
--
2.5.5
From: Dan Williams <[email protected]>
In preparation for consulting a badblocks list in pmem_direct_access(),
teach dax_pmd_fault() to fallback rather than fail immediately upon
encountering an error. The thought being that reducing the span of the
dax request may avoid the error region.
Signed-off-by: Dan Williams <[email protected]>
---
fs/dax.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 90322eb..ec6417b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -945,8 +945,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
long length = dax_map_atomic(bdev, &dax);
if (length < 0) {
- result = VM_FAULT_SIGBUS;
- goto out;
+ dax_pmd_dbg(&bh, address, "dax-error fallback");
+ goto fallback;
}
if (length < PMD_SIZE) {
dax_pmd_dbg(&bh, address, "dax-length too small");
--
2.5.5
dax_do_io (called for read() or write() for a dax file system) may fail
in the presence of bad blocks or media errors. Since we expect that a
write should clear media errors on nvdimms, make dax_do_io fall back to
the direct_IO path, which will send down a bio to the driver, which can
then attempt to clear the error.
Cc: Matthew Wilcox <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
---
fs/block_dev.c | 17 ++++++++++++++---
fs/ext2/inode.c | 22 +++++++++++++++-------
fs/ext4/indirect.c | 18 +++++++++++++-----
fs/ext4/inode.c | 21 ++++++++++++++-------
fs/xfs/xfs_aops.c | 14 ++++++++++++--
5 files changed, 68 insertions(+), 24 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index c5837fa..d6113b9 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -166,13 +166,24 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct inode *inode = bdev_file_inode(file);
+ ssize_t ret, ret_saved = 0;
- if (IS_DAX(inode))
- return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
+ if (IS_DAX(inode)) {
+ ret = dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
NULL, DIO_SKIP_DIO_COUNT);
- return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
+ if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
+ ret_saved = ret;
+ else
+ return ret;
+ }
+
+ ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
blkdev_get_block, NULL, NULL,
DIO_SKIP_DIO_COUNT);
+ if (ret < 0 && ret_saved)
+ return ret_saved;
+
+ return ret;
}
int __sync_blockdev(struct block_device *bdev, int wait)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 824f249..64792c6 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -859,14 +859,22 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
size_t count = iov_iter_count(iter);
- ssize_t ret;
+ ssize_t ret, ret_saved = 0;
- if (IS_DAX(inode))
- ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block, NULL,
- DIO_LOCKING);
- else
- ret = blockdev_direct_IO(iocb, inode, iter, offset,
- ext2_get_block);
+ if (IS_DAX(inode)) {
+ ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block,
+ NULL, DIO_LOCKING | DIO_SKIP_HOLES);
+ if (ret == -EIO && iov_iter_rw(iter) == WRITE)
+ ret_saved = ret;
+ else
+ goto out;
+ }
+
+ ret = blockdev_direct_IO(iocb, inode, iter, offset, ext2_get_block);
+ if (ret < 0 && ret_saved)
+ ret = ret_saved;
+
+ out:
if (ret < 0 && iov_iter_rw(iter) == WRITE)
ext2_write_failed(mapping, offset + count);
return ret;
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 3027fa6..798f341 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -716,14 +716,22 @@ retry:
NULL, NULL, 0);
inode_dio_end(inode);
} else {
+ ssize_t ret_saved = 0;
+
locked:
- if (IS_DAX(inode))
+ if (IS_DAX(inode)) {
ret = dax_do_io(iocb, inode, iter, offset,
ext4_dio_get_block, NULL, DIO_LOCKING);
- else
- ret = blockdev_direct_IO(iocb, inode, iter, offset,
- ext4_dio_get_block);
-
+ if (ret == -EIO && iov_iter_rw(iter) == WRITE)
+ ret_saved = ret;
+ else
+ goto skip_dio;
+ }
+ ret = blockdev_direct_IO(iocb, inode, iter, offset,
+ ext4_get_block);
+ if (ret < 0 && ret_saved)
+ ret = ret_saved;
+skip_dio:
if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
loff_t isize = i_size_read(inode);
loff_t end = offset + count;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dab84a2..27f07c2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3341,7 +3341,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
- ssize_t ret;
+ ssize_t ret, ret_saved = 0;
size_t count = iov_iter_count(iter);
int overwrite = 0;
get_block_t *get_block_func = NULL;
@@ -3401,15 +3401,22 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
#ifdef CONFIG_EXT4_FS_ENCRYPTION
BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
#endif
- if (IS_DAX(inode))
+ if (IS_DAX(inode)) {
ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
ext4_end_io_dio, dio_flags);
- else
- ret = __blockdev_direct_IO(iocb, inode,
- inode->i_sb->s_bdev, iter, offset,
- get_block_func,
- ext4_end_io_dio, NULL, dio_flags);
+ if (ret == -EIO && iov_iter_rw(iter) == WRITE)
+ ret_saved = ret;
+ else
+ goto skip_dio;
+ }
+ ret = __blockdev_direct_IO(iocb, inode,
+ inode->i_sb->s_bdev, iter, offset,
+ get_block_func,
+ ext4_end_io_dio, NULL, dio_flags);
+ if (ret < 0 && ret_saved)
+ ret = ret_saved;
+ skip_dio:
if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
EXT4_STATE_DIO_UNWRITTEN)) {
int err;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d445a64..7cfcf86 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1413,6 +1413,7 @@ xfs_vm_direct_IO(
dio_iodone_t *endio = NULL;
int flags = 0;
struct block_device *bdev;
+ ssize_t ret, ret_saved = 0;
if (iov_iter_rw(iter) == WRITE) {
endio = xfs_end_io_direct_write;
@@ -1420,13 +1421,22 @@ xfs_vm_direct_IO(
}
if (IS_DAX(inode)) {
- return dax_do_io(iocb, inode, iter, offset,
+ ret = dax_do_io(iocb, inode, iter, offset,
xfs_get_blocks_direct, endio, 0);
+ if (ret == -EIO && iov_iter_rw(iter) == WRITE)
+ ret_saved = ret;
+ else
+ return ret;
}
bdev = xfs_find_bdev_for_inode(inode);
- return __blockdev_direct_IO(iocb, inode, bdev, iter, offset,
+ ret = __blockdev_direct_IO(iocb, inode, bdev, iter, offset,
xfs_get_blocks_direct, endio, NULL, flags);
+
+ if (ret < 0 && ret_saved)
+ ret = ret_saved;
+
+ return ret;
}
/*
--
2.5.5
From: Matthew Wilcox <[email protected]>
dax_clear_sectors() cannot handle poisoned blocks. These must be
zeroed using the BIO interface instead. Convert ext2 and XFS to use
only sb_issue_zerout().
Signed-off-by: Matthew Wilcox <[email protected]>
[vishal: Also remove the dax_clear_sectors function entirely]
Signed-off-by: Vishal Verma <[email protected]>
---
fs/dax.c | 32 --------------------------------
fs/ext2/inode.c | 7 +++----
fs/xfs/xfs_bmap_util.c | 15 ++++-----------
include/linux/dax.h | 1 -
4 files changed, 7 insertions(+), 48 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index ec6417b..f4ac5f2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -78,38 +78,6 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n)
return page;
}
-/*
- * dax_clear_sectors() is called from within transaction context from XFS,
- * and hence this means the stack from this point must follow GFP_NOFS
- * semantics for all operations.
- */
-int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size)
-{
- struct blk_dax_ctl dax = {
- .sector = _sector,
- .size = _size,
- };
-
- might_sleep();
- do {
- long count, sz;
-
- count = dax_map_atomic(bdev, &dax);
- if (count < 0)
- return count;
- sz = min_t(long, count, SZ_128K);
- clear_pmem(dax.addr, sz);
- dax.size -= sz;
- dax.sector += sz / 512;
- dax_unmap_atomic(bdev, &dax);
- cond_resched();
- } while (dax.size);
-
- wmb_pmem();
- return 0;
-}
-EXPORT_SYMBOL_GPL(dax_clear_sectors);
-
/* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first,
loff_t pos, loff_t end)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 6bd58e6..824f249 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -26,6 +26,7 @@
#include <linux/highuid.h>
#include <linux/pagemap.h>
#include <linux/dax.h>
+#include <linux/blkdev.h>
#include <linux/quotaops.h>
#include <linux/writeback.h>
#include <linux/buffer_head.h>
@@ -737,10 +738,8 @@ static int ext2_get_blocks(struct inode *inode,
* so that it's not found by another thread before it's
* initialised
*/
- err = dax_clear_sectors(inode->i_sb->s_bdev,
- le32_to_cpu(chain[depth-1].key) <<
- (inode->i_blkbits - 9),
- 1 << inode->i_blkbits);
+ err = sb_issue_zeroout(inode->i_sb,
+ le32_to_cpu(chain[depth-1].key), 1, GFP_NOFS);
if (err) {
mutex_unlock(&ei->truncate_mutex);
goto cleanup;
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index a32c1dc..5b4351a 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -72,18 +72,11 @@ xfs_zero_extent(
struct xfs_mount *mp = ip->i_mount;
xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb);
sector_t block = XFS_BB_TO_FSBT(mp, sector);
- ssize_t size = XFS_FSB_TO_B(mp, count_fsb);
-
- if (IS_DAX(VFS_I(ip)))
- return dax_clear_sectors(xfs_find_bdev_for_inode(VFS_I(ip)),
- sector, size);
-
- /*
- * let the block layer decide on the fastest method of
- * implementing the zeroing.
- */
- return sb_issue_zeroout(mp->m_super, block, count_fsb, GFP_NOFS);
+ return blkdev_issue_zeroout(xfs_find_bdev_for_inode(VFS_I(ip)),
+ block << (mp->m_super->s_blocksize_bits - 9),
+ count_fsb << (mp->m_super->s_blocksize_bits - 9),
+ GFP_NOFS, true);
}
/*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 636dd59..933198a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -7,7 +7,6 @@
ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
get_block_t, dio_iodone_t, int flags);
-int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
--
2.5.5
From: Dan Williams <[email protected]>
This is in preparation for doing badblocks checking against the
requested sector range in the driver. Currently we opportunistically
return as much data that can be "dax'd" starting at the given sector.
When errors are present we want to limit that range to the first
encountered error, or fail the dax request if the range encompasses an
error.
Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 10 +++++-----
drivers/block/brd.c | 9 +++++----
drivers/nvdimm/pmem.c | 9 +++++----
drivers/s390/block/dcssblk.c | 12 ++++++------
fs/block_dev.c | 2 +-
include/linux/blkdev.h | 3 +--
6 files changed, 23 insertions(+), 22 deletions(-)
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 0d112b9..d85673f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,17 +139,17 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
/**
* axon_ram_direct_access - direct_access() method for block device
- * @device, @sector, @data: see block_device_operations method
+ * @dax: see block_device_operations method
*/
static long
-axon_ram_direct_access(struct block_device *device, sector_t sector,
- void __pmem **kaddr, pfn_t *pfn)
+axon_ram_direct_access(struct block_device *device, struct blk_dax_ctl *dax)
{
+ sector_t sector = get_start_sect(device) + dax->sector;
struct axon_ram_bank *bank = device->bd_disk->private_data;
loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
- *kaddr = (void __pmem __force *) bank->io_addr + offset;
- *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+ dax->addr = (void __pmem __force *) bank->io_addr + offset;
+ dax->pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
return bank->size - offset;
}
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index f7ecc28..e3e4780 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -380,9 +380,10 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
}
#ifdef CONFIG_BLK_DEV_RAM_DAX
-static long brd_direct_access(struct block_device *bdev, sector_t sector,
- void __pmem **kaddr, pfn_t *pfn)
+static long brd_direct_access(struct block_device *bdev,
+ struct blk_dax_ctl *dax)
{
+ sector_t sector = get_start_sect(bdev) + dax->sector;
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
@@ -391,8 +392,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
page = brd_insert_page(brd, sector);
if (!page)
return -ENOSPC;
- *kaddr = (void __pmem *)page_address(page);
- *pfn = page_to_pfn_t(page);
+ dax->addr = (void __pmem *)page_address(page);
+ dax->pfn = page_to_pfn_t(page);
return PAGE_SIZE;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ca5721c..da10554 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -167,14 +167,15 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
return rc;
}
-static long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void __pmem **kaddr, pfn_t *pfn)
+static long pmem_direct_access(struct block_device *bdev,
+ struct blk_dax_ctl *dax)
{
+ sector_t sector = get_start_sect(bdev) + dax->sector;
struct pmem_device *pmem = bdev->bd_disk->private_data;
resource_size_t offset = sector * 512 + pmem->data_offset;
- *kaddr = pmem->virt_addr + offset;
- *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
+ dax->addr = pmem->virt_addr + offset;
+ dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
return pmem->size - pmem->pfn_pad - offset;
}
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 1bce9cf..5719c30 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -30,8 +30,8 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static blk_qc_t dcssblk_make_request(struct request_queue *q,
struct bio *bio);
-static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void __pmem **kaddr, pfn_t *pfn);
+static long dcssblk_direct_access(struct block_device *bdev,
+ struct blk_dax_ctl *dax)
static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
@@ -882,9 +882,9 @@ fail:
}
static long
-dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void __pmem **kaddr, pfn_t *pfn)
+dcssblk_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
{
+ sector_t secnum = get_start_sect(bdev) + dax->sector;
struct dcssblk_dev_info *dev_info;
unsigned long offset, dev_sz;
@@ -893,8 +893,8 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
return -ENODEV;
dev_sz = dev_info->end - dev_info->start;
offset = secnum * 512;
- *kaddr = (void __pmem *) (dev_info->start + offset);
- *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+ dax->addr = (void __pmem *) (dev_info->start + offset);
+ dax->pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
return dev_sz - offset;
}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3172c4e..c5837fa 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -488,7 +488,7 @@ long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
sector += get_start_sect(bdev);
if (sector % (PAGE_SIZE / 512))
return -EINVAL;
- avail = ops->direct_access(bdev, sector, &dax->addr, &dax->pfn);
+ avail = ops->direct_access(bdev, dax);
if (!avail)
return -ERANGE;
if (avail > 0 && avail & ~PAGE_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7e5d7e0..92f8a1f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1656,8 +1656,7 @@ struct block_device_operations {
int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
- long (*direct_access)(struct block_device *, sector_t, void __pmem **,
- pfn_t *);
+ long (*direct_access)(struct block_device *, struct blk_dax_ctl *dax);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
--
2.5.5
From: Dan Williams <[email protected]>
1/ If a mapping overlaps a bad sector fail the request.
2/ Do not opportunistically report more dax-capable capacity than is
requested when errors present.
[vishal: fix a conflict with system RAM collision patches]
Signed-off-by: Dan Williams <[email protected]>
---
block/ioctl.c | 9 ---------
drivers/nvdimm/pmem.c | 8 ++++++++
2 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/block/ioctl.c b/block/ioctl.c
index d8996bb..cd7f392 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -423,15 +423,6 @@ bool blkdev_dax_capable(struct block_device *bdev)
|| (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
return false;
- /*
- * If the device has known bad blocks, force all I/O through the
- * driver / page cache.
- *
- * TODO: support finer grained dax error handling
- */
- if (disk->bb && disk->bb->count)
- return false;
-
return true;
}
#endif
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index da10554..eac5f93 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -174,9 +174,17 @@ static long pmem_direct_access(struct block_device *bdev,
struct pmem_device *pmem = bdev->bd_disk->private_data;
resource_size_t offset = sector * 512 + pmem->data_offset;
+ if (unlikely(is_bad_pmem(&pmem->bb, sector, dax->size)))
+ return -EIO;
dax->addr = pmem->virt_addr + offset;
dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
+ /*
+ * If badblocks are present, limit known good range to the
+ * requested range.
+ */
+ if (unlikely(pmem->bb.count))
+ return dax->size;
return pmem->size - pmem->pfn_pad - offset;
}
--
2.5.5
Hi Vishal,
[auto build test WARNING on linux-nvdimm/libnvdimm-for-next]
[also build test WARNING on v4.6-rc1 next-20160329]
[cannot apply to xfs/for-next]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
url: https://github.com/0day-ci/linux/commits/Vishal-Verma/dax-handling-of-media-errors/20160330-100409
base: https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-for-next
config: x86_64-randconfig-i0-03300245 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings
All warnings (new ones prefixed by >>):
fs/ext4/indirect.c: In function 'ext4_ind_direct_IO':
>> fs/ext4/indirect.c:719:11: warning: 'ret_saved' may be used uninitialized in this function [-Wmaybe-uninitialized]
ssize_t ret_saved = 0;
^
vim +/ret_saved +719 fs/ext4/indirect.c
703 smp_mb();
704 if (unlikely(ext4_test_inode_state(inode,
705 EXT4_STATE_DIOREAD_LOCK))) {
706 inode_dio_end(inode);
707 goto locked;
708 }
709 if (IS_DAX(inode))
710 ret = dax_do_io(iocb, inode, iter, offset,
711 ext4_dio_get_block, NULL, 0);
712 else
713 ret = __blockdev_direct_IO(iocb, inode,
714 inode->i_sb->s_bdev, iter,
715 offset, ext4_dio_get_block,
716 NULL, NULL, 0);
717 inode_dio_end(inode);
718 } else {
> 719 ssize_t ret_saved = 0;
720
721 locked:
722 if (IS_DAX(inode)) {
723 ret = dax_do_io(iocb, inode, iter, offset,
724 ext4_dio_get_block, NULL, DIO_LOCKING);
725 if (ret == -EIO && iov_iter_rw(iter) == WRITE)
726 ret_saved = ret;
727 else
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Dan,
[auto build test ERROR on linux-nvdimm/libnvdimm-for-next]
[also build test ERROR on v4.6-rc1 next-20160329]
[cannot apply to xfs/for-next]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
url: https://github.com/0day-ci/linux/commits/Vishal-Verma/dax-handling-of-media-errors/20160330-100409
base: https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-for-next
config: s390-default_defconfig (attached as .config)
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=s390
All error/warnings (new ones prefixed by >>):
drivers/s390/block/dcssblk.c: In function 'dcssblk_direct_access':
>> drivers/s390/block/dcssblk.c:36:13: error: storage class specified for parameter 'dcssblk_segments'
static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
^
>> drivers/s390/block/dcssblk.c:36:1: error: parameter 'dcssblk_segments' is initialized
static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
^
>> drivers/s390/block/dcssblk.c:38:12: error: storage class specified for parameter 'dcssblk_major'
static int dcssblk_major;
^
>> drivers/s390/block/dcssblk.c:39:45: error: storage class specified for parameter 'dcssblk_devops'
static const struct block_device_operations dcssblk_devops = {
^
>> drivers/s390/block/dcssblk.c:39:21: error: parameter 'dcssblk_devops' is initialized
static const struct block_device_operations dcssblk_devops = {
^
>> drivers/s390/block/dcssblk.c:46:1: warning: empty declaration
struct dcssblk_dev_info {
^
drivers/s390/block/dcssblk.c:62:1: warning: empty declaration
struct segment_info {
^
>> drivers/s390/block/dcssblk.c:70:16: error: storage class specified for parameter 'dcssblk_add_store'
static ssize_t dcssblk_add_store(struct device * dev, struct device_attribute *attr, const char * buf,
^
>> drivers/s390/block/dcssblk.c:72:16: error: storage class specified for parameter 'dcssblk_remove_store'
static ssize_t dcssblk_remove_store(struct device * dev, struct device_attribute *attr, const char * buf,
^
In file included from include/linux/genhd.h:63:0,
from include/linux/blkdev.h:9,
from drivers/s390/block/dcssblk.c:16:
>> include/linux/device.h:575:26: error: storage class specified for parameter 'dev_attr_add'
struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)
^
>> drivers/s390/block/dcssblk.c:75:8: note: in expansion of macro 'DEVICE_ATTR'
static DEVICE_ATTR(add, S_IWUSR, NULL, dcssblk_add_store);
^
>> include/linux/device.h:575:9: error: parameter 'dev_attr_add' is initialized
struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)
^
>> drivers/s390/block/dcssblk.c:75:8: note: in expansion of macro 'DEVICE_ATTR'
static DEVICE_ATTR(add, S_IWUSR, NULL, dcssblk_add_store);
^
>> include/linux/device.h:575:26: error: storage class specified for parameter 'dev_attr_remove'
struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)
^
drivers/s390/block/dcssblk.c:76:8: note: in expansion of macro 'DEVICE_ATTR'
static DEVICE_ATTR(remove, S_IWUSR, NULL, dcssblk_remove_store);
^
>> include/linux/device.h:575:9: error: parameter 'dev_attr_remove' is initialized
struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)
^
drivers/s390/block/dcssblk.c:76:8: note: in expansion of macro 'DEVICE_ATTR'
static DEVICE_ATTR(remove, S_IWUSR, NULL, dcssblk_remove_store);
^
>> drivers/s390/block/dcssblk.c:78:23: error: storage class specified for parameter 'dcssblk_root_dev'
static struct device *dcssblk_root_dev;
^
In file included from include/linux/module.h:9:0,
from drivers/s390/block/dcssblk.c:10:
>> drivers/s390/block/dcssblk.c:80:18: error: storage class specified for parameter 'dcssblk_devices'
static LIST_HEAD(dcssblk_devices);
^
include/linux/list.h:23:19: note: in definition of macro 'LIST_HEAD'
struct list_head name = LIST_HEAD_INIT(name)
^
>> include/linux/list.h:23:9: error: parameter 'dcssblk_devices' is initialized
struct list_head name = LIST_HEAD_INIT(name)
^
>> drivers/s390/block/dcssblk.c:80:8: note: in expansion of macro 'LIST_HEAD'
static LIST_HEAD(dcssblk_devices);
^
>> drivers/s390/block/dcssblk.c:81:28: error: storage class specified for parameter 'dcssblk_devices_sem'
static struct rw_semaphore dcssblk_devices_sem;
^
>> drivers/s390/block/dcssblk.c:88:1: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
drivers/s390/block/dcssblk.c:109:1: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
drivers/s390/block/dcssblk.c:136:1: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
drivers/s390/block/dcssblk.c:154:1: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
drivers/s390/block/dcssblk.c:172:1: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
drivers/s390/block/dcssblk.c:189:1: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
drivers/s390/block/dcssblk.c:213:1: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
drivers/s390/block/dcssblk.c:279:1: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
drivers/s390/block/dcssblk.c:315:1: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
drivers/s390/block/dcssblk.c:324:1: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
{
^
In file included from include/linux/genhd.h:63:0,
from include/linux/blkdev.h:9,
from drivers/s390/block/dcssblk.c:16:
vim +/dcssblk_segments +36 drivers/s390/block/dcssblk.c
^1da177e Linus Torvalds 2005-04-16 23
^1da177e Linus Torvalds 2005-04-16 24 #define DCSSBLK_NAME "dcssblk"
^1da177e Linus Torvalds 2005-04-16 25 #define DCSSBLK_MINORS_PER_DISK 1
^1da177e Linus Torvalds 2005-04-16 26 #define DCSSBLK_PARM_LEN 400
98df67b3 Kay Sievers 2008-12-25 27 #define DCSS_BUS_ID_SIZE 20
^1da177e Linus Torvalds 2005-04-16 28
46d74326 Al Viro 2008-03-02 @29 static int dcssblk_open(struct block_device *bdev, fmode_t mode);
db2a144b Al Viro 2013-05-05 @30 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
dece1635 Jens Axboe 2015-11-05 31 static blk_qc_t dcssblk_make_request(struct request_queue *q,
dece1635 Jens Axboe 2015-11-05 32 struct bio *bio);
e3cb53fb Dan Williams 2016-03-29 @33 static long dcssblk_direct_access(struct block_device *bdev,
e3cb53fb Dan Williams 2016-03-29 34 struct blk_dax_ctl *dax)
^1da177e Linus Torvalds 2005-04-16 35
^1da177e Linus Torvalds 2005-04-16 @36 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
^1da177e Linus Torvalds 2005-04-16 37
^1da177e Linus Torvalds 2005-04-16 @38 static int dcssblk_major;
83d5cde4 Alexey Dobriyan 2009-09-21 @39 static const struct block_device_operations dcssblk_devops = {
^1da177e Linus Torvalds 2005-04-16 40 .owner = THIS_MODULE,
46d74326 Al Viro 2008-03-02 41 .open = dcssblk_open,
46d74326 Al Viro 2008-03-02 42 .release = dcssblk_release,
420edbcc Carsten Otte 2005-06-23 43 .direct_access = dcssblk_direct_access,
^1da177e Linus Torvalds 2005-04-16 44 };
^1da177e Linus Torvalds 2005-04-16 45
b2300b9e Hongjie Yang 2008-10-10 @46 struct dcssblk_dev_info {
b2300b9e Hongjie Yang 2008-10-10 47 struct list_head lh;
b2300b9e Hongjie Yang 2008-10-10 48 struct device dev;
98df67b3 Kay Sievers 2008-12-25 49 char segment_name[DCSS_BUS_ID_SIZE];
b2300b9e Hongjie Yang 2008-10-10 50 atomic_t use_count;
b2300b9e Hongjie Yang 2008-10-10 51 struct gendisk *gd;
b2300b9e Hongjie Yang 2008-10-10 52 unsigned long start;
b2300b9e Hongjie Yang 2008-10-10 53 unsigned long end;
b2300b9e Hongjie Yang 2008-10-10 54 int segment_type;
b2300b9e Hongjie Yang 2008-10-10 55 unsigned char save_pending;
b2300b9e Hongjie Yang 2008-10-10 56 unsigned char is_shared;
b2300b9e Hongjie Yang 2008-10-10 57 struct request_queue *dcssblk_queue;
b2300b9e Hongjie Yang 2008-10-10 58 int num_of_segments;
b2300b9e Hongjie Yang 2008-10-10 59 struct list_head seg_list;
b2300b9e Hongjie Yang 2008-10-10 60 };
b2300b9e Hongjie Yang 2008-10-10 61
b2300b9e Hongjie Yang 2008-10-10 62 struct segment_info {
b2300b9e Hongjie Yang 2008-10-10 63 struct list_head lh;
98df67b3 Kay Sievers 2008-12-25 64 char segment_name[DCSS_BUS_ID_SIZE];
b2300b9e Hongjie Yang 2008-10-10 65 unsigned long start;
b2300b9e Hongjie Yang 2008-10-10 66 unsigned long end;
b2300b9e Hongjie Yang 2008-10-10 67 int segment_type;
b2300b9e Hongjie Yang 2008-10-10 68 };
b2300b9e Hongjie Yang 2008-10-10 69
e404e274 Yani Ioannou 2005-05-17 @70 static ssize_t dcssblk_add_store(struct device * dev, struct device_attribute *attr, const char * buf,
^1da177e Linus Torvalds 2005-04-16 71 size_t count);
e404e274 Yani Ioannou 2005-05-17 @72 static ssize_t dcssblk_remove_store(struct device * dev, struct device_attribute *attr, const char * buf,
^1da177e Linus Torvalds 2005-04-16 73 size_t count);
^1da177e Linus Torvalds 2005-04-16 74
^1da177e Linus Torvalds 2005-04-16 @75 static DEVICE_ATTR(add, S_IWUSR, NULL, dcssblk_add_store);
^1da177e Linus Torvalds 2005-04-16 @76 static DEVICE_ATTR(remove, S_IWUSR, NULL, dcssblk_remove_store);
^1da177e Linus Torvalds 2005-04-16 77
^1da177e Linus Torvalds 2005-04-16 @78 static struct device *dcssblk_root_dev;
^1da177e Linus Torvalds 2005-04-16 79
c11ca97e Denis Cheng 2008-01-26 @80 static LIST_HEAD(dcssblk_devices);
^1da177e Linus Torvalds 2005-04-16 @81 static struct rw_semaphore dcssblk_devices_sem;
^1da177e Linus Torvalds 2005-04-16 82
^1da177e Linus Torvalds 2005-04-16 83 /*
^1da177e Linus Torvalds 2005-04-16 84 * release function for segment device.
^1da177e Linus Torvalds 2005-04-16 85 */
^1da177e Linus Torvalds 2005-04-16 86 static void
^1da177e Linus Torvalds 2005-04-16 87 dcssblk_release_segment(struct device *dev)
^1da177e Linus Torvalds 2005-04-16 @88 {
b2300b9e Hongjie Yang 2008-10-10 89 struct dcssblk_dev_info *dev_info;
b2300b9e Hongjie Yang 2008-10-10 90 struct segment_info *entry, *temp;
b2300b9e Hongjie Yang 2008-10-10 91
:::::: The code at line 36 was first introduced by commit
:::::: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2
:::::: TO: Linus Torvalds <[email protected]>
:::::: CC: Linus Torvalds <[email protected]>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Vishal,
still NAK to calling the direct I/O code directly from the dax code.
On Tue, 2016-03-29 at 23:34 -0700, Christoph Hellwig wrote:
> Hi Vishal,
>
> still NAK to calling the direct I/O code directly from the dax code.
Hm, I thought this was what you meant -- do the fallback/retry attempts
at the callers of dax_do_io instead of the new dax wrapper function..
Did I misunderstand you?
On Wed, Mar 30, 2016 at 12:54:37AM -0600, Vishal Verma wrote:
> On Tue, 2016-03-29 at 23:34 -0700, Christoph Hellwig wrote:
> > Hi Vishal,
> >
> > still NAK to calling the direct I/O code directly from the dax code.
>
> Hm, I thought this was what you meant -- do the fallback/retry attempts
> at the callers of dax_do_io instead of the new dax wrapper function..
> Did I misunderstand you?
Sorry, it is. I misread fs/block_dev.c as fs/dax.c before my first
coffee this morning. I'll properly review the series in the afternoon..
Vishal Verma <[email protected]> writes:
> From: Dan Williams <[email protected]>
>
> This is in preparation for doing badblocks checking against the
> requested sector range in the driver. Currently we opportunistically
> return as much data that can be "dax'd" starting at the given sector.
> When errors are present we want to limit that range to the first
> encountered error, or fail the dax request if the range encompasses an
> error.
I'm not a fan of hiding arguments like this, but it looks fine.
Reviewed-by: Jeff Moyer <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> arch/powerpc/sysdev/axonram.c | 10 +++++-----
> drivers/block/brd.c | 9 +++++----
> drivers/nvdimm/pmem.c | 9 +++++----
> drivers/s390/block/dcssblk.c | 12 ++++++------
> fs/block_dev.c | 2 +-
> include/linux/blkdev.h | 3 +--
> 6 files changed, 23 insertions(+), 22 deletions(-)
>
> diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
> index 0d112b9..d85673f 100644
> --- a/arch/powerpc/sysdev/axonram.c
> +++ b/arch/powerpc/sysdev/axonram.c
> @@ -139,17 +139,17 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
>
> /**
> * axon_ram_direct_access - direct_access() method for block device
> - * @device, @sector, @data: see block_device_operations method
> + * @dax: see block_device_operations method
> */
> static long
> -axon_ram_direct_access(struct block_device *device, sector_t sector,
> - void __pmem **kaddr, pfn_t *pfn)
> +axon_ram_direct_access(struct block_device *device, struct blk_dax_ctl *dax)
> {
> + sector_t sector = get_start_sect(device) + dax->sector;
> struct axon_ram_bank *bank = device->bd_disk->private_data;
> loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
>
> - *kaddr = (void __pmem __force *) bank->io_addr + offset;
> - *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
> + dax->addr = (void __pmem __force *) bank->io_addr + offset;
> + dax->pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
> return bank->size - offset;
> }
>
> diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> index f7ecc28..e3e4780 100644
> --- a/drivers/block/brd.c
> +++ b/drivers/block/brd.c
> @@ -380,9 +380,10 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
> }
>
> #ifdef CONFIG_BLK_DEV_RAM_DAX
> -static long brd_direct_access(struct block_device *bdev, sector_t sector,
> - void __pmem **kaddr, pfn_t *pfn)
> +static long brd_direct_access(struct block_device *bdev,
> + struct blk_dax_ctl *dax)
> {
> + sector_t sector = get_start_sect(bdev) + dax->sector;
> struct brd_device *brd = bdev->bd_disk->private_data;
> struct page *page;
>
> @@ -391,8 +392,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
> page = brd_insert_page(brd, sector);
> if (!page)
> return -ENOSPC;
> - *kaddr = (void __pmem *)page_address(page);
> - *pfn = page_to_pfn_t(page);
> + dax->addr = (void __pmem *)page_address(page);
> + dax->pfn = page_to_pfn_t(page);
>
> return PAGE_SIZE;
> }
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index ca5721c..da10554 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -167,14 +167,15 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
> return rc;
> }
>
> -static long pmem_direct_access(struct block_device *bdev, sector_t sector,
> - void __pmem **kaddr, pfn_t *pfn)
> +static long pmem_direct_access(struct block_device *bdev,
> + struct blk_dax_ctl *dax)
> {
> + sector_t sector = get_start_sect(bdev) + dax->sector;
> struct pmem_device *pmem = bdev->bd_disk->private_data;
> resource_size_t offset = sector * 512 + pmem->data_offset;
>
> - *kaddr = pmem->virt_addr + offset;
> - *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
> + dax->addr = pmem->virt_addr + offset;
> + dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
>
> return pmem->size - pmem->pfn_pad - offset;
> }
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 1bce9cf..5719c30 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -30,8 +30,8 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
> static void dcssblk_release(struct gendisk *disk, fmode_t mode);
> static blk_qc_t dcssblk_make_request(struct request_queue *q,
> struct bio *bio);
> -static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
> - void __pmem **kaddr, pfn_t *pfn);
> +static long dcssblk_direct_access(struct block_device *bdev,
> + struct blk_dax_ctl *dax)
>
> static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
>
> @@ -882,9 +882,9 @@ fail:
> }
>
> static long
> -dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
> - void __pmem **kaddr, pfn_t *pfn)
> +dcssblk_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
> {
> + sector_t secnum = get_start_sect(bdev) + dax->sector;
> struct dcssblk_dev_info *dev_info;
> unsigned long offset, dev_sz;
>
> @@ -893,8 +893,8 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
> return -ENODEV;
> dev_sz = dev_info->end - dev_info->start;
> offset = secnum * 512;
> - *kaddr = (void __pmem *) (dev_info->start + offset);
> - *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
> + dax->addr = (void __pmem *) (dev_info->start + offset);
> + dax->pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
>
> return dev_sz - offset;
> }
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 3172c4e..c5837fa 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -488,7 +488,7 @@ long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
> sector += get_start_sect(bdev);
> if (sector % (PAGE_SIZE / 512))
> return -EINVAL;
> - avail = ops->direct_access(bdev, sector, &dax->addr, &dax->pfn);
> + avail = ops->direct_access(bdev, dax);
> if (!avail)
> return -ERANGE;
> if (avail > 0 && avail & ~PAGE_MASK)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7e5d7e0..92f8a1f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1656,8 +1656,7 @@ struct block_device_operations {
> int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
> int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
> int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
> - long (*direct_access)(struct block_device *, sector_t, void __pmem **,
> - pfn_t *);
> + long (*direct_access)(struct block_device *, struct blk_dax_ctl *dax);
> unsigned int (*check_events) (struct gendisk *disk,
> unsigned int clearing);
> /* ->media_changed() is DEPRECATED, use ->check_events() instead */
Vishal Verma <[email protected]> writes:
> From: Dan Williams <[email protected]>
>
> In preparation for consulting a badblocks list in pmem_direct_access(),
> teach dax_pmd_fault() to fallback rather than fail immediately upon
> encountering an error. The thought being that reducing the span of the
> dax request may avoid the error region.
>
> Signed-off-by: Dan Williams <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
> ---
> fs/dax.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 90322eb..ec6417b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -945,8 +945,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> long length = dax_map_atomic(bdev, &dax);
>
> if (length < 0) {
> - result = VM_FAULT_SIGBUS;
> - goto out;
> + dax_pmd_dbg(&bh, address, "dax-error fallback");
> + goto fallback;
> }
> if (length < PMD_SIZE) {
> dax_pmd_dbg(&bh, address, "dax-length too small");
Vishal Verma <[email protected]> writes:
> From: Dan Williams <[email protected]>
>
> 1/ If a mapping overlaps a bad sector fail the request.
>
> 2/ Do not opportunistically report more dax-capable capacity than is
> requested when errors present.
>
> [vishal: fix a conflict with system RAM collision patches]
> Signed-off-by: Dan Williams <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
> ---
> block/ioctl.c | 9 ---------
> drivers/nvdimm/pmem.c | 8 ++++++++
> 2 files changed, 8 insertions(+), 9 deletions(-)
>
> diff --git a/block/ioctl.c b/block/ioctl.c
> index d8996bb..cd7f392 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -423,15 +423,6 @@ bool blkdev_dax_capable(struct block_device *bdev)
> || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
> return false;
>
> - /*
> - * If the device has known bad blocks, force all I/O through the
> - * driver / page cache.
> - *
> - * TODO: support finer grained dax error handling
> - */
> - if (disk->bb && disk->bb->count)
> - return false;
> -
> return true;
> }
> #endif
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index da10554..eac5f93 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -174,9 +174,17 @@ static long pmem_direct_access(struct block_device *bdev,
> struct pmem_device *pmem = bdev->bd_disk->private_data;
> resource_size_t offset = sector * 512 + pmem->data_offset;
>
> + if (unlikely(is_bad_pmem(&pmem->bb, sector, dax->size)))
> + return -EIO;
> dax->addr = pmem->virt_addr + offset;
> dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
>
> + /*
> + * If badblocks are present, limit known good range to the
> + * requested range.
> + */
> + if (unlikely(pmem->bb.count))
> + return dax->size;
> return pmem->size - pmem->pfn_pad - offset;
> }
Vishal Verma <[email protected]> writes:
> From: Matthew Wilcox <[email protected]>
>
> dax_clear_sectors() cannot handle poisoned blocks. These must be
> zeroed using the BIO interface instead. Convert ext2 and XFS to use
> only sb_issue_zerout().
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> [vishal: Also remove the dax_clear_sectors function entirely]
> Signed-off-by: Vishal Verma <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
Vishal Verma <[email protected]> writes:
> dax_do_io (called for read() or write() for a dax file system) may fail
> in the presence of bad blocks or media errors. Since we expect that a
> write should clear media errors on nvdimms, make dax_do_io fall back to
> the direct_IO path, which will send down a bio to the driver, which can
> then attempt to clear the error.
[snip]
> + if (IS_DAX(inode)) {
> + ret = dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
> NULL, DIO_SKIP_DIO_COUNT);
> - return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
> + if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> + ret_saved = ret;
> + else
> + return ret;
> + }
> +
> + ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
> blkdev_get_block, NULL, NULL,
> DIO_SKIP_DIO_COUNT);
> + if (ret < 0 && ret_saved)
> + return ret_saved;
> +
Hmm, did you just break async DIO? I think you did! :)
__blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned that
into -EIO. Really, I don't see a reason to save that first -EIO. The
same applies to all instances in this patch.
Cheers,
Jeff
> + return ret;
> }
>
> int __sync_blockdev(struct block_device *bdev, int wait)
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 824f249..64792c6 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -859,14 +859,22 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
> struct address_space *mapping = file->f_mapping;
> struct inode *inode = mapping->host;
> size_t count = iov_iter_count(iter);
> - ssize_t ret;
> + ssize_t ret, ret_saved = 0;
>
> - if (IS_DAX(inode))
> - ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block, NULL,
> - DIO_LOCKING);
> - else
> - ret = blockdev_direct_IO(iocb, inode, iter, offset,
> - ext2_get_block);
> + if (IS_DAX(inode)) {
> + ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block,
> + NULL, DIO_LOCKING | DIO_SKIP_HOLES);
> + if (ret == -EIO && iov_iter_rw(iter) == WRITE)
> + ret_saved = ret;
> + else
> + goto out;
> + }
> +
> + ret = blockdev_direct_IO(iocb, inode, iter, offset, ext2_get_block);
> + if (ret < 0 && ret_saved)
> + ret = ret_saved;
> +
> + out:
> if (ret < 0 && iov_iter_rw(iter) == WRITE)
> ext2_write_failed(mapping, offset + count);
> return ret;
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 3027fa6..798f341 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -716,14 +716,22 @@ retry:
> NULL, NULL, 0);
> inode_dio_end(inode);
> } else {
> + ssize_t ret_saved = 0;
> +
> locked:
> - if (IS_DAX(inode))
> + if (IS_DAX(inode)) {
> ret = dax_do_io(iocb, inode, iter, offset,
> ext4_dio_get_block, NULL, DIO_LOCKING);
> - else
> - ret = blockdev_direct_IO(iocb, inode, iter, offset,
> - ext4_dio_get_block);
> -
> + if (ret == -EIO && iov_iter_rw(iter) == WRITE)
> + ret_saved = ret;
> + else
> + goto skip_dio;
> + }
> + ret = blockdev_direct_IO(iocb, inode, iter, offset,
> + ext4_get_block);
> + if (ret < 0 && ret_saved)
> + ret = ret_saved;
> +skip_dio:
> if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
> loff_t isize = i_size_read(inode);
> loff_t end = offset + count;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index dab84a2..27f07c2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3341,7 +3341,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
> {
> struct file *file = iocb->ki_filp;
> struct inode *inode = file->f_mapping->host;
> - ssize_t ret;
> + ssize_t ret, ret_saved = 0;
> size_t count = iov_iter_count(iter);
> int overwrite = 0;
> get_block_t *get_block_func = NULL;
> @@ -3401,15 +3401,22 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
> #ifdef CONFIG_EXT4_FS_ENCRYPTION
> BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
> #endif
> - if (IS_DAX(inode))
> + if (IS_DAX(inode)) {
> ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
> ext4_end_io_dio, dio_flags);
> - else
> - ret = __blockdev_direct_IO(iocb, inode,
> - inode->i_sb->s_bdev, iter, offset,
> - get_block_func,
> - ext4_end_io_dio, NULL, dio_flags);
> + if (ret == -EIO && iov_iter_rw(iter) == WRITE)
> + ret_saved = ret;
> + else
> + goto skip_dio;
> + }
>
> + ret = __blockdev_direct_IO(iocb, inode,
> + inode->i_sb->s_bdev, iter, offset,
> + get_block_func,
> + ext4_end_io_dio, NULL, dio_flags);
> + if (ret < 0 && ret_saved)
> + ret = ret_saved;
> + skip_dio:
> if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
> EXT4_STATE_DIO_UNWRITTEN)) {
> int err;
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d445a64..7cfcf86 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1413,6 +1413,7 @@ xfs_vm_direct_IO(
> dio_iodone_t *endio = NULL;
> int flags = 0;
> struct block_device *bdev;
> + ssize_t ret, ret_saved = 0;
>
> if (iov_iter_rw(iter) == WRITE) {
> endio = xfs_end_io_direct_write;
> @@ -1420,13 +1421,22 @@ xfs_vm_direct_IO(
> }
>
> if (IS_DAX(inode)) {
> - return dax_do_io(iocb, inode, iter, offset,
> + ret = dax_do_io(iocb, inode, iter, offset,
> xfs_get_blocks_direct, endio, 0);
> + if (ret == -EIO && iov_iter_rw(iter) == WRITE)
> + ret_saved = ret;
> + else
> + return ret;
> }
>
> bdev = xfs_find_bdev_for_inode(inode);
> - return __blockdev_direct_IO(iocb, inode, bdev, iter, offset,
> + ret = __blockdev_direct_IO(iocb, inode, bdev, iter, offset,
> xfs_get_blocks_direct, endio, NULL, flags);
> +
> + if (ret < 0 && ret_saved)
> + ret = ret_saved;
> +
> + return ret;
> }
>
> /*
On Fri, 2016-04-15 at 12:11 -0400, Jeff Moyer wrote:
> Vishal Verma <[email protected]> writes:
>
> >
> > dax_do_io (called for read() or write() for a dax file system) may
> > fail
> > in the presence of bad blocks or media errors. Since we expect that
> > a
> > write should clear media errors on nvdimms, make dax_do_io fall
> > back to
> > the direct_IO path, which will send down a bio to the driver, which
> > can
> > then attempt to clear the error.
> [snip]
>
> >
> > + if (IS_DAX(inode)) {
> > + ret = dax_do_io(iocb, inode, iter, offset,
> > blkdev_get_block,
> > NULL, DIO_SKIP_DIO_COUNT);
> > - return __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > iter, offset,
> > + if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> > + ret_saved = ret;
> > + else
> > + return ret;
> > + }
> > +
> > + ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > iter, offset,
> > blkdev_get_block, NULL, NULL,
> > DIO_SKIP_DIO_COUNT);
> > + if (ret < 0 && ret_saved)
> > + return ret_saved;
> > +
> Hmm, did you just break async DIO? I think you did! :)
> __blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned
> that
> into -EIO. Really, I don't see a reason to save that first
> -EIO. The
> same applies to all instances in this patch.
The reason I saved it was if __blockdev_direct_IO fails for some
reason, we should return the original cause o the error, which was an
EIO.. i.e. we shouldn't be hiding the EIO if the direct_IO fails with
something else..
But, how does _EIOCBQUEUED work? Maybe we need an exception for it?
Thanks,
-Vishal
"Verma, Vishal L" <[email protected]> writes:
> On Fri, 2016-04-15 at 12:11 -0400, Jeff Moyer wrote:
>> Vishal Verma <[email protected]> writes:
>> > + if (IS_DAX(inode)) {
>> > + ret = dax_do_io(iocb, inode, iter, offset,
>> > blkdev_get_block,
>> > NULL, DIO_SKIP_DIO_COUNT);
>> > - return __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
>> > iter, offset,
>> > + if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
>> > + ret_saved = ret;
>> > + else
>> > + return ret;
>> > + }
>> > +
>> > + ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
>> > iter, offset,
>> > blkdev_get_block, NULL, NULL,
>> > DIO_SKIP_DIO_COUNT);
>> > + if (ret < 0 && ret_saved)
>> > + return ret_saved;
>> > +
>> Hmm, did you just break async DIO? I think you did! :)
>> __blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned
>> that
>> into -EIO. Really, I don't see a reason to save that first
>> -EIO. The
>> same applies to all instances in this patch.
>
> The reason I saved it was if __blockdev_direct_IO fails for some
> reason, we should return the original cause o the error, which was an
> EIO.. i.e. we shouldn't be hiding the EIO if the direct_IO fails with
> something else..
OK.
> But, how does _EIOCBQUEUED work? Maybe we need an exception for it?
For async direct I/O, only the setup phase of the I/O is performed and
then we return to the caller. -EIOCBQUEUED signifies this.
You're heading towards code that looks like this:
if (IS_DAX(inode)) {
ret = dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
NULL, DIO_SKIP_DIO_COUNT);
if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
ret_saved = ret;
else
return ret;
}
ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
blkdev_get_block, NULL, NULL,
DIO_SKIP_DIO_COUNT);
if (ret < 0 && ret != -EIOCBQUEUED && ret_saved)
return ret_saved;
There's a lot of special casing here, so you might consider adding
comments.
Cheers,
Jeff
On Fri, 2016-04-15 at 13:11 -0400, Jeff Moyer wrote:
> "Verma, Vishal L" <[email protected]> writes:
>
> >
> > On Fri, 2016-04-15 at 12:11 -0400, Jeff Moyer wrote:
> > >
> > > Vishal Verma <[email protected]> writes:
> > > >
> > > > + if (IS_DAX(inode)) {
> > > > + ret = dax_do_io(iocb, inode, iter, offset,
> > > > blkdev_get_block,
> > > > NULL, DIO_SKIP_DIO_COUNT);
> > > > - return __blockdev_direct_IO(iocb, inode,
> > > > I_BDEV(inode),
> > > > iter, offset,
> > > > + if (ret == -EIO && (iov_iter_rw(iter) ==
> > > > WRITE))
> > > > + ret_saved = ret;
> > > > + else
> > > > + return ret;
> > > > + }
> > > > +
> > > > + ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > > > iter, offset,
> > > > blkdev_get_block, NULL,
> > > > NULL,
> > > > DIO_SKIP_DIO_COUNT);
> > > > + if (ret < 0 && ret_saved)
> > > > + return ret_saved;
> > > > +
> > > Hmm, did you just break async DIO? I think you did! :)
> > > __blockdev_direct_IO can return -EIOCBQUEUED, and you've now
> > > turned
> > > that
> > > into -EIO. Really, I don't see a reason to save that first
> > > -EIO. The
> > > same applies to all instances in this patch.
> > The reason I saved it was if __blockdev_direct_IO fails for some
> > reason, we should return the original cause o the error, which was
> > an
> > EIO.. i.e. we shouldn't be hiding the EIO if the direct_IO fails
> > with
> > something else..
> OK.
>
> >
> > But, how does _EIOCBQUEUED work? Maybe we need an exception for it?
> For async direct I/O, only the setup phase of the I/O is performed
> and
> then we return to the caller. -EIOCBQUEUED signifies this.
>
> You're heading towards code that looks like this:
>
> if (IS_DAX(inode)) {
> ret = dax_do_io(iocb, inode, iter, offset,
> blkdev_get_block,
> NULL, DIO_SKIP_DIO_COUNT);
> if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> ret_saved = ret;
> else
> return ret;
> }
>
> ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter,
> offset,
> blkdev_get_block, NULL, NULL,
> DIO_SKIP_DIO_COUNT);
> if (ret < 0 && ret != -EIOCBQUEUED && ret_saved)
> return ret_saved;
>
> There's a lot of special casing here, so you might consider adding
> comments.
Correct - maybe we should reconsider wrapper-izing this? :)
Thanks for the explanation and for catching this. I'll fix it for the
next revision.
>
> Cheers,
> Jeff
On Fri, Apr 15, 2016 at 10:37 AM, Verma, Vishal L
<[email protected]> wrote:
> On Fri, 2016-04-15 at 13:11 -0400, Jeff Moyer wrote:
[..]
>> >
>> > But, how does _EIOCBQUEUED work? Maybe we need an exception for it?
>> For async direct I/O, only the setup phase of the I/O is performed
>> and
>> then we return to the caller. -EIOCBQUEUED signifies this.
>>
>> You're heading towards code that looks like this:
>>
>> if (IS_DAX(inode)) {
>> ret = dax_do_io(iocb, inode, iter, offset,
>> blkdev_get_block,
>> NULL, DIO_SKIP_DIO_COUNT);
>> if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
>> ret_saved = ret;
>> else
>> return ret;
>> }
>>
>> ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter,
>> offset,
>> blkdev_get_block, NULL, NULL,
>> DIO_SKIP_DIO_COUNT);
>> if (ret < 0 && ret != -EIOCBQUEUED && ret_saved)
>> return ret_saved;
>>
>> There's a lot of special casing here, so you might consider adding
>> comments.
>
> Correct - maybe we should reconsider wrapper-izing this? :)
Another option is just to skip dax_do_io() and this special casing
fallback entirely if errors are present. I.e. only attempt dax_do_io
when: IS_DAX() && gendisk->bb && bb->count == 0.
Dan Williams <[email protected]> writes:
>>> There's a lot of special casing here, so you might consider adding
>>> comments.
>>
>> Correct - maybe we should reconsider wrapper-izing this? :)
>
> Another option is just to skip dax_do_io() and this special casing
> fallback entirely if errors are present. I.e. only attempt dax_do_io
> when: IS_DAX() && gendisk->bb && bb->count == 0.
So, if there's an error anywhere on the device, penalize all I/O (not
just writes, and not just on sectors that are bad)? I'm not sure that's
a great plan, either.
-Jeff
On Fri, Apr 15, 2016 at 11:06 AM, Jeff Moyer <[email protected]> wrote:
> Dan Williams <[email protected]> writes:
>
>>>> There's a lot of special casing here, so you might consider adding
>>>> comments.
>>>
>>> Correct - maybe we should reconsider wrapper-izing this? :)
>>
>> Another option is just to skip dax_do_io() and this special casing
>> fallback entirely if errors are present. I.e. only attempt dax_do_io
>> when: IS_DAX() && gendisk->bb && bb->count == 0.
>
> So, if there's an error anywhere on the device, penalize all I/O (not
> just writes, and not just on sectors that are bad)? I'm not sure that's
> a great plan, either.
>
If errors are rare how much are we actually losing in practice?
Moreover, we're going to do the full badblocks lookup anyway when we
call ->direct_access(). If we had that information earlier we can
avoid this fallback dance.
Dan Williams <[email protected]> writes:
> On Fri, Apr 15, 2016 at 11:06 AM, Jeff Moyer <[email protected]> wrote:
>> Dan Williams <[email protected]> writes:
>>
>>>>> There's a lot of special casing here, so you might consider adding
>>>>> comments.
>>>>
>>>> Correct - maybe we should reconsider wrapper-izing this? :)
>>>
>>> Another option is just to skip dax_do_io() and this special casing
>>> fallback entirely if errors are present. I.e. only attempt dax_do_io
>>> when: IS_DAX() && gendisk->bb && bb->count == 0.
>>
>> So, if there's an error anywhere on the device, penalize all I/O (not
>> just writes, and not just on sectors that are bad)? I'm not sure that's
>> a great plan, either.
>>
>
> If errors are rare how much are we actually losing in practice?
How long is a piece of string?
> Moreover, we're going to do the full badblocks lookup anyway when we
> call ->direct_access(). If we had that information earlier we can
> avoid this fallback dance.
None of the proposed approaches looks clean to me. I'll go along with
whatever you guys think is best. I am in favor of wrapping up all that
duplicated code, though.
Cheers,
Jeff
On Fri, Apr 15, 2016 at 11:24 AM, Jeff Moyer <[email protected]> wrote:
>> Moreover, we're going to do the full badblocks lookup anyway when we
>> call ->direct_access(). If we had that information earlier we can
>> avoid this fallback dance.
>
> None of the proposed approaches looks clean to me. I'll go along with
> whatever you guys think is best. I am in favor of wrapping up all that
> duplicated code, though.
Christoph originally pushed for open coding this fallback decision
per-filesystem. I agree with you on the "none the above" options are
clean.
On Fri, 2016-04-15 at 11:17 -0700, Dan Williams wrote:
> On Fri, Apr 15, 2016 at 11:06 AM, Jeff Moyer <[email protected]> wrote:
> >
> > Dan Williams <[email protected]> writes:
> >
> > > > > There's a lot of special casing here, so you might consider
> > > > > adding comments.
> > > > Correct - maybe we should reconsider wrapper-izing this? :)
> > > Another option is just to skip dax_do_io() and this special casing
> > > fallback entirely if errors are present. I.e. only attempt dax_do_io
> > > when: IS_DAX() && gendisk->bb && bb->count == 0.
> >
> > So, if there's an error anywhere on the device, penalize all I/O (not
> > just writes, and not just on sectors that are bad)? I'm not sure
> > that's a great plan, either.
> >
> If errors are rare how much are we actually losing in practice?
> Moreover, we're going to do the full badblocks lookup anyway when we
> call ->direct_access(). If we had that information earlier we can
> avoid this fallback dance.
A system running with DAX may have active data set in NVDIMM lager than RAM
size. In this case, falling back to non-DAX will allocate page cache for
the data, which will saturate the system with memory pressure.
Thanks,
-Toshi
Dan Williams <[email protected]> writes:
> On Fri, Apr 15, 2016 at 11:24 AM, Jeff Moyer <[email protected]> wrote:
>>> Moreover, we're going to do the full badblocks lookup anyway when we
>>> call ->direct_access(). If we had that information earlier we can
>>> avoid this fallback dance.
>>
>> None of the proposed approaches looks clean to me. I'll go along with
>> whatever you guys think is best. I am in favor of wrapping up all that
>> duplicated code, though.
>
> Christoph originally pushed for open coding this fallback decision
> per-filesystem. I agree with you on the "none the above" options are
> clean.
I don't recall him saying "open code". Rather, the sentiment was to
leave the fallback to the callers. That doesn't mean you can't wrap it
up in a convenience function.
Cheers,
Jeff
On Fri, 2016-04-15 at 13:01 -0600, Toshi Kani wrote:
> On Fri, 2016-04-15 at 11:17 -0700, Dan Williams wrote:
> >
> > On Fri, Apr 15, 2016 at 11:06 AM, Jeff Moyer <[email protected]> wrote:
> > >
> > > Dan Williams <[email protected]> writes:
> > >
> > > > > > There's a lot of special casing here, so you might consider
> > > > > > adding comments.
> > > > > Correct - maybe we should reconsider wrapper-izing this? :)
> > > > Another option is just to skip dax_do_io() and this special casing
> > > > fallback entirely if errors are present. I.e. only attempt
> > > > dax_do_io when: IS_DAX() && gendisk->bb && bb->count == 0.
> > >
> > > So, if there's an error anywhere on the device, penalize all I/O (not
> > > just writes, and not just on sectors that are bad)? I'm not sure
> > > that's a great plan, either.
> > >
> > If errors are rare how much are we actually losing in practice?
> > Moreover, we're going to do the full badblocks lookup anyway when we
> > call ->direct_access(). If we had that information earlier we can
> > avoid this fallback dance.
>
> A system running with DAX may have active data set in NVDIMM lager than
> RAM size. In this case, falling back to non-DAX will allocate page cache
> for the data, which will saturate the system with memory pressure.
Oh, sorry, we are still in DIO path. Falling back to DIO should not cause
this issue.
-Toshi
On Fri, Apr 15, 2016 at 12:11:36PM -0400, Jeff Moyer wrote:
> > + if (IS_DAX(inode)) {
> > + ret = dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
> > NULL, DIO_SKIP_DIO_COUNT);
> > + if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> > + ret_saved = ret;
> > + else
> > + return ret;
> > + }
> > +
> > + ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
> > blkdev_get_block, NULL, NULL,
> > DIO_SKIP_DIO_COUNT);
> > + if (ret < 0 && ret_saved)
> > + return ret_saved;
> > +
>
> Hmm, did you just break async DIO? I think you did! :)
> __blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned that
> into -EIO. Really, I don't see a reason to save that first -EIO. The
> same applies to all instances in this patch.
Yes, there is no point in saving the earlier error - just return the
second error all the time.
E.g.
ret = dax_io();
if (dax_need_dio_retry(ret))
ret = direct_IO();
On Wed, 2016-04-20 at 13:59 -0700, Christoph Hellwig wrote:
> On Fri, Apr 15, 2016 at 12:11:36PM -0400, Jeff Moyer wrote:
> >
> > >
> > > + if (IS_DAX(inode)) {
> > > + ret = dax_do_io(iocb, inode, iter, offset,
> > > blkdev_get_block,
> > > NULL, DIO_SKIP_DIO_COUNT);
> > > + if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> > > + ret_saved = ret;
> > > + else
> > > + return ret;
> > > + }
> > > +
> > > + ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > > iter, offset,
> > > blkdev_get_block, NULL,
> > > NULL,
> > > DIO_SKIP_DIO_COUNT);
> > > + if (ret < 0 && ret_saved)
> > > + return ret_saved;
> > > +
> > Hmm, did you just break async DIO? I think you did! :)
> > __blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned
> > that
> > into -EIO. Really, I don't see a reason to save that first
> > -EIO. The
> > same applies to all instances in this patch.
> Yes, there is no point in saving the earlier error - just return the
> second error all the time.
Is it ok to do that?
direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM due
to some allocation failing, and I thought we should return the original
-EIO in such cases so that the application doesn't lose the information
that the bad block is actually causing the error.
>
> E.g.
>
> ret = dax_io();
> if (dax_need_dio_retry(ret))
> ret = direct_IO();
>
On Sat, Apr 23, 2016 at 06:08:37PM +0000, Verma, Vishal L wrote:
> direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM due
> to some allocation failing, and I thought we should return the original
> -EIO in such cases so that the application doesn't lose the information
> that the bad block is actually causing the error.
EINVAL is a concern here. Not due to the right error reported, but
because it means your current scheme is fundamentally broken - we
need to support I/O at any alignment for DAX I/O, and not fail due to
alignbment concernes for a highly specific degraded case.
I think this whole series need to go back to the drawing board as I
don't think it can actually rely on using direct I/O as the EIO
fallback.
"[email protected]" <[email protected]> writes:
> On Sat, Apr 23, 2016 at 06:08:37PM +0000, Verma, Vishal L wrote:
>> direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM due
>> to some allocation failing, and I thought we should return the original
>> -EIO in such cases so that the application doesn't lose the information
>> that the bad block is actually causing the error.
>
> EINVAL is a concern here. Not due to the right error reported, but
> because it means your current scheme is fundamentally broken - we
> need to support I/O at any alignment for DAX I/O, and not fail due to
> alignbment concernes for a highly specific degraded case.
>
> I think this whole series need to go back to the drawing board as I
> don't think it can actually rely on using direct I/O as the EIO
> fallback.
The only callers of dax_do_io are direct_IO methods.
Cheers,
Jeff
On Mon, 2016-04-25 at 01:31 -0700, [email protected] wrote:
> On Sat, Apr 23, 2016 at 06:08:37PM +0000, Verma, Vishal L wrote:
> >
> > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
> > due
> > to some allocation failing, and I thought we should return the
> > original
> > -EIO in such cases so that the application doesn't lose the
> > information
> > that the bad block is actually causing the error.
> EINVAL is a concern here. Not due to the right error reported, but
> because it means your current scheme is fundamentally broken - we
> need to support I/O at any alignment for DAX I/O, and not fail due to
> alignbment concernes for a highly specific degraded case.
>
> I think this whole series need to go back to the drawing board as I
> don't think it can actually rely on using direct I/O as the EIO
> fallback.
>
Agreed that DAX I/O can happen with any size/alignment, but how else do
we send an IO through the driver without alignment restrictions? Also,
the granularity at which we store badblocks is 512B sectors, so it
seems natural that to clear such a sector, you'd expect to send a write
to the whole sector.
The expected usage flow is:
- Application hits EIO doing dax_IO or load/store io
- It checks badblocks and discovers it's files have lost data
- It write()s those sectors (possibly converted to file offsets using
fiemap)
* This triggers the fallback path, but if the application is doing
this level of recovery, it will know the sector is bad, and write the
entire sector
- Or it replaces the entire file from backup also using write() (not
mmap+stores)
* This just frees the fs block, and the next time the block is
reallocated by the fs, it will likely be zeroed first, and that will be
done through the driver and will clear errors
I think if we want to keep allowing arbitrary alignments for the
dax_do_io path, we'd need:
1. To represent badblocks at a finer granularity (likely cache lines)
2. To allow the driver to do IO to a *block device* at sub-sector
granularity
Can we do that?
On Mon, Apr 25, 2016 at 10:14 AM, Verma, Vishal L
<[email protected]> wrote:
> On Mon, 2016-04-25 at 01:31 -0700, [email protected] wrote:
>> On Sat, Apr 23, 2016 at 06:08:37PM +0000, Verma, Vishal L wrote:
>> >
>> > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
>> > due
>> > to some allocation failing, and I thought we should return the
>> > original
>> > -EIO in such cases so that the application doesn't lose the
>> > information
>> > that the bad block is actually causing the error.
>> EINVAL is a concern here. Not due to the right error reported, but
>> because it means your current scheme is fundamentally broken - we
>> need to support I/O at any alignment for DAX I/O, and not fail due to
>> alignbment concernes for a highly specific degraded case.
>>
>> I think this whole series need to go back to the drawing board as I
>> don't think it can actually rely on using direct I/O as the EIO
>> fallback.
>>
> Agreed that DAX I/O can happen with any size/alignment, but how else do
> we send an IO through the driver without alignment restrictions? Also,
> the granularity at which we store badblocks is 512B sectors, so it
> seems natural that to clear such a sector, you'd expect to send a write
> to the whole sector.
>
> The expected usage flow is:
>
> - Application hits EIO doing dax_IO or load/store io
>
> - It checks badblocks and discovers it's files have lost data
>
> - It write()s those sectors (possibly converted to file offsets using
> fiemap)
> * This triggers the fallback path, but if the application is doing
> this level of recovery, it will know the sector is bad, and write the
> entire sector
>
> - Or it replaces the entire file from backup also using write() (not
> mmap+stores)
> * This just frees the fs block, and the next time the block is
> reallocated by the fs, it will likely be zeroed first, and that will be
> done through the driver and will clear errors
>
>
> I think if we want to keep allowing arbitrary alignments for the
> dax_do_io path, we'd need:
> 1. To represent badblocks at a finer granularity (likely cache lines)
> 2. To allow the driver to do IO to a *block device* at sub-sector
> granularity
3. Arrange for O_DIRECT to bypass dax_do_io(), and leave the
optimization only for the dax "buffered I/O" case.
4. Skip dax_do_io() entirely in the presence of errors
I think 3 is the most closely aligned with the typical block device
model. In the typical case a buffered write may fail due to a
badblock read when filling the page cache, but an O_DIRECT write would
bypass the page cache and potentially clear the error / cause the
block to be reallocated internally to the drive.
On Mon, Apr 25, 2016 at 05:14:36PM +0000, Verma, Vishal L wrote:
> On Mon, 2016-04-25 at 01:31 -0700, [email protected] wrote:
> > On Sat, Apr 23, 2016 at 06:08:37PM +0000, Verma, Vishal L wrote:
> > >
> > > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
> > > due
> > > to some allocation failing, and I thought we should return the
> > > original
> > > -EIO in such cases so that the application doesn't lose the
> > > information
> > > that the bad block is actually causing the error.
> > EINVAL is a concern here.??Not due to the right error reported, but
> > because it means your current scheme is fundamentally broken - we
> > need to support I/O at any alignment for DAX I/O, and not fail due to
> > alignbment concernes for a highly specific degraded case.
> >
> > I think this whole series need to go back to the drawing board as I
> > don't think it can actually rely on using direct I/O as the EIO
> > fallback.
> >
> Agreed that DAX I/O can happen with any size/alignment, but how else do
> we send an IO through the driver without alignment restrictions? Also,
> the granularity at which we store badblocks is 512B sectors, so it
> seems natural that to clear such a sector, you'd expect to send a write
> to the whole sector.
>
> The expected usage flow is:
>
> - Application hits EIO doing dax_IO or load/store io
>
> - It checks badblocks and discovers it's files have lost data
Lots of hand-waving here. How does the application map a bad
"sector" to a file without scanning the entire filesystem to find
the owner of the bad sector?
> - It write()s those sectors (possibly converted to file offsets using
> fiemap)
> ? ? * This triggers the fallback path, but if the application is doing
> this level of recovery, it will know the sector is bad, and write the
> entire sector
Where does the application find the data that was lost to be able to
rewrite it?
> - Or it replaces the entire file from backup also using write() (not
> mmap+stores)
> ? ? * This just frees the fs block, and the next time the block is
> reallocated by the fs, it will likely be zeroed first, and that will be
> done through the driver and will clear errors
There's an implicit assumption that applications will keep redundant
copies of their data at the /application layer/ and be able to
automatically repair it? And then there's the implicit assumption
that it will unlink and free the entire file before writing a new
copy, and that then assumes the the filesystem will zero blocks if
they get reused to clear errors on that LBA sector mapping before
they are accessible again to userspace..
It seems to me that there are a number of assumptions being made
across multiple layers here. Maybe I've missed something - can you
point me to the design/architecture description so I can see how
"app does data recovery itself" dance is supposed to work?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Apr 26, 2016 at 09:25:52AM +1000, Dave Chinner wrote:
> On Mon, Apr 25, 2016 at 05:14:36PM +0000, Verma, Vishal L wrote:
> > On Mon, 2016-04-25 at 01:31 -0700, [email protected] wrote:
> > > On Sat, Apr 23, 2016 at 06:08:37PM +0000, Verma, Vishal L wrote:
> > > >
> > > > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
> > > > due
> > > > to some allocation failing, and I thought we should return the
> > > > original
> > > > -EIO in such cases so that the application doesn't lose the
> > > > information
> > > > that the bad block is actually causing the error.
> > > EINVAL is a concern here.??Not due to the right error reported, but
> > > because it means your current scheme is fundamentally broken - we
> > > need to support I/O at any alignment for DAX I/O, and not fail due to
> > > alignbment concernes for a highly specific degraded case.
> > >
> > > I think this whole series need to go back to the drawing board as I
> > > don't think it can actually rely on using direct I/O as the EIO
> > > fallback.
> > >
> > Agreed that DAX I/O can happen with any size/alignment, but how else do
> > we send an IO through the driver without alignment restrictions? Also,
> > the granularity at which we store badblocks is 512B sectors, so it
> > seems natural that to clear such a sector, you'd expect to send a write
> > to the whole sector.
> >
> > The expected usage flow is:
> >
> > - Application hits EIO doing dax_IO or load/store io
> >
> > - It checks badblocks and discovers it's files have lost data
>
> Lots of hand-waving here. How does the application map a bad
> "sector" to a file without scanning the entire filesystem to find
> the owner of the bad sector?
FWIW there was some discussion @ LSF about using (XFS) rmap to figure out
which parts of a file (on XFS) have gone bad. Chris Mason said that he'd
like to collaborate on having a common getfsmap ioctl between btrfs and
XFS since they have a backref index that could be hooked up to it for them.
Obviously the app still has to coordinate stopping file IO and calling
GETFSMAP since the fs won't do that on its own. There's also the question
of how to handle LBA translation if there's other stuff like dm in the way.
I don't think device-mapper or md do reverse mapping, so things get murky
from here.
Guess I should get on pushing out a getfsmap patch for review. :)
--D
(/me doesn't have answers to any of your other questions.)
> > - It write()s those sectors (possibly converted to file offsets using
> > fiemap)
> > ? ? * This triggers the fallback path, but if the application is doing
> > this level of recovery, it will know the sector is bad, and write the
> > entire sector
>
> Where does the application find the data that was lost to be able to
> rewrite it?
>
> > - Or it replaces the entire file from backup also using write() (not
> > mmap+stores)
> > ? ? * This just frees the fs block, and the next time the block is
> > reallocated by the fs, it will likely be zeroed first, and that will be
> > done through the driver and will clear errors
>
> There's an implicit assumption that applications will keep redundant
> copies of their data at the /application layer/ and be able to
> automatically repair it? And then there's the implicit assumption
> that it will unlink and free the entire file before writing a new
> copy, and that then assumes the the filesystem will zero blocks if
> they get reused to clear errors on that LBA sector mapping before
> they are accessible again to userspace..
>
> It seems to me that there are a number of assumptions being made
> across multiple layers here. Maybe I've missed something - can you
> point me to the design/architecture description so I can see how
> "app does data recovery itself" dance is supposed to work?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>
> _______________________________________________
> xfs mailing list
> [email protected]
> http://oss.sgi.com/mailman/listinfo/xfs
On Mon, Apr 25, 2016 at 4:25 PM, Dave Chinner <[email protected]> wrote:
> On Mon, Apr 25, 2016 at 05:14:36PM +0000, Verma, Vishal L wrote:
>> On Mon, 2016-04-25 at 01:31 -0700, [email protected] wrote:
>> > On Sat, Apr 23, 2016 at 06:08:37PM +0000, Verma, Vishal L wrote:
>> > >
>> > > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
>> > > due
>> > > to some allocation failing, and I thought we should return the
>> > > original
>> > > -EIO in such cases so that the application doesn't lose the
>> > > information
>> > > that the bad block is actually causing the error.
>> > EINVAL is a concern here. Not due to the right error reported, but
>> > because it means your current scheme is fundamentally broken - we
>> > need to support I/O at any alignment for DAX I/O, and not fail due to
>> > alignbment concernes for a highly specific degraded case.
>> >
>> > I think this whole series need to go back to the drawing board as I
>> > don't think it can actually rely on using direct I/O as the EIO
>> > fallback.
>> >
>> Agreed that DAX I/O can happen with any size/alignment, but how else do
>> we send an IO through the driver without alignment restrictions? Also,
>> the granularity at which we store badblocks is 512B sectors, so it
>> seems natural that to clear such a sector, you'd expect to send a write
>> to the whole sector.
>>
>> The expected usage flow is:
>>
>> - Application hits EIO doing dax_IO or load/store io
>>
>> - It checks badblocks and discovers it's files have lost data
>
> Lots of hand-waving here. How does the application map a bad
> "sector" to a file without scanning the entire filesystem to find
> the owner of the bad sector?
>
>> - It write()s those sectors (possibly converted to file offsets using
>> fiemap)
>> * This triggers the fallback path, but if the application is doing
>> this level of recovery, it will know the sector is bad, and write the
>> entire sector
>
> Where does the application find the data that was lost to be able to
> rewrite it?
>
>> - Or it replaces the entire file from backup also using write() (not
>> mmap+stores)
>> * This just frees the fs block, and the next time the block is
>> reallocated by the fs, it will likely be zeroed first, and that will be
>> done through the driver and will clear errors
>
> There's an implicit assumption that applications will keep redundant
> copies of their data at the /application layer/ and be able to
> automatically repair it? And then there's the implicit assumption
> that it will unlink and free the entire file before writing a new
> copy, and that then assumes the the filesystem will zero blocks if
> they get reused to clear errors on that LBA sector mapping before
> they are accessible again to userspace..
>
> It seems to me that there are a number of assumptions being made
> across multiple layers here. Maybe I've missed something - can you
> point me to the design/architecture description so I can see how
> "app does data recovery itself" dance is supposed to work?
>
Maybe I missed something, but all these assumptions are already
present for typical block devices, i.e. sectors may go bad and a write
may make the sector usable again. This patch series is extending that
out to the DAX-mmap case, but it's the same principle of "write to
clear error" that we live with in the block-I/O path. What
clarification are you looking for beyond that point?
On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
>
<>
> >
> > - It checks badblocks and discovers it's files have lost data
> Lots of hand-waving here. How does the application map a bad
> "sector" to a file without scanning the entire filesystem to find
> the owner of the bad sector?
Yes this was hand-wavey, but we talked about this a bit at LSF..
The idea is that a per-block-device badblocks list is available at
/sys/block/<pmemX>/badblocks. The application (or a suitable yet-to-be-
written library function) does a fiemap to figure out the sectors its
files are using, and correlates the two lists.
We can also look into providing an easier-to-use interface from the
kernel, in the form of an fiemap flag to report only the bad sectors, or
a SEEK_BAD flag..
The application doesn't have to scan the entire filesystem, but
presumably it knows what files it 'owns', and does a fiemap for those.
>
> >
> > - It write()s those sectors (possibly converted to file offsets
> > using
> > fiemap)
> > * This triggers the fallback path, but if the application is
> > doing
> > this level of recovery, it will know the sector is bad, and write
> > the
> > entire sector
> Where does the application find the data that was lost to be able to
> rewrite it?
The data that was lost is gone -- this assumes the application has some
ability to recover using a journal/log or other redundancy - yes, at the
application layer. If it doesn't have this sort of capability, the only
option is to restore files from a backup/mirror.
>
> >
> > - Or it replaces the entire file from backup also using write() (not
> > mmap+stores)
> > * This just frees the fs block, and the next time the block is
> > reallocated by the fs, it will likely be zeroed first, and that will
> > be
> > done through the driver and will clear errors
> There's an implicit assumption that applications will keep redundant
> copies of their data at the /application layer/ and be able to
> automatically repair it? And then there's the implicit assumption
> that it will unlink and free the entire file before writing a new
> copy, and that then assumes the the filesystem will zero blocks if
> they get reused to clear errors on that LBA sector mapping before
> they are accessible again to userspace..
>
> It seems to me that there are a number of assumptions being made
> across multiple layers here. Maybe I've missed something - can you
> point me to the design/architecture description so I can see how
> "app does data recovery itself" dance is supposed to work?
There isn't a document other than the flow in my head :) - but maybe I
could write one up..
I wasn't thinking the application itself maintains and restores from
backup copy of the file.. The application hits either a SIGBUS or EIO
depending on how it accesses the data, and crashes or raises some alarm.
The recovery is then done out-of-band, by a sysadmin or such (i.e.
delete the file, replace with a known good copy, restart application).
To summarize, the two cases we want to handle are:
1. Application has inbuilt recovery:
- hits badblock
- figures out it is able to recover the data
- handles SIGBUS or EIO
- does a (sector aligned) write() to restore the data
2. Application doesn't have any inbuilt recovery mechanism
- hits badblock
- gets SIGBUS (or EIO) and crashes
- Sysadmin restores file from backup
Case 1 is handled by either a fallback to direct_IO from dax_do_io, or
always _actually_ doing direct_IO when we're opened with O_DIRECT in
spite of dax (what Dan suggested). Currently if we're mounted with dax,
all IO O_DIRECT or otherwise will go through dax_do_io.
Case 2 is handled by patch 4 of the series:
dax: use sb_issue_zerout instead of calling dax_clear_sectors
>
> Cheers,
>
> Dave.
On Mon, Apr 25, 2016 at 04:43:14PM -0700, Dan Williams wrote:
> On Mon, Apr 25, 2016 at 4:25 PM, Dave Chinner <[email protected]> wrote:
> > On Mon, Apr 25, 2016 at 05:14:36PM +0000, Verma, Vishal L wrote:
> >> On Mon, 2016-04-25 at 01:31 -0700, [email protected] wrote:
> >> > On Sat, Apr 23, 2016 at 06:08:37PM +0000, Verma, Vishal L wrote:
> >> > >
> >> > > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
> >> > > due
> >> > > to some allocation failing, and I thought we should return the
> >> > > original
> >> > > -EIO in such cases so that the application doesn't lose the
> >> > > information
> >> > > that the bad block is actually causing the error.
> >> > EINVAL is a concern here. Not due to the right error reported, but
> >> > because it means your current scheme is fundamentally broken - we
> >> > need to support I/O at any alignment for DAX I/O, and not fail due to
> >> > alignbment concernes for a highly specific degraded case.
> >> >
> >> > I think this whole series need to go back to the drawing board as I
> >> > don't think it can actually rely on using direct I/O as the EIO
> >> > fallback.
> >> >
> >> Agreed that DAX I/O can happen with any size/alignment, but how else do
> >> we send an IO through the driver without alignment restrictions? Also,
> >> the granularity at which we store badblocks is 512B sectors, so it
> >> seems natural that to clear such a sector, you'd expect to send a write
> >> to the whole sector.
> >>
> >> The expected usage flow is:
> >>
> >> - Application hits EIO doing dax_IO or load/store io
> >>
> >> - It checks badblocks and discovers it's files have lost data
> >
> > Lots of hand-waving here. How does the application map a bad
> > "sector" to a file without scanning the entire filesystem to find
> > the owner of the bad sector?
> >
> >> - It write()s those sectors (possibly converted to file offsets using
> >> fiemap)
> >> * This triggers the fallback path, but if the application is doing
> >> this level of recovery, it will know the sector is bad, and write the
> >> entire sector
> >
> > Where does the application find the data that was lost to be able to
> > rewrite it?
> >
> >> - Or it replaces the entire file from backup also using write() (not
> >> mmap+stores)
> >> * This just frees the fs block, and the next time the block is
> >> reallocated by the fs, it will likely be zeroed first, and that will be
> >> done through the driver and will clear errors
> >
> > There's an implicit assumption that applications will keep redundant
> > copies of their data at the /application layer/ and be able to
> > automatically repair it? And then there's the implicit assumption
> > that it will unlink and free the entire file before writing a new
> > copy, and that then assumes the the filesystem will zero blocks if
> > they get reused to clear errors on that LBA sector mapping before
> > they are accessible again to userspace..
> >
> > It seems to me that there are a number of assumptions being made
> > across multiple layers here. Maybe I've missed something - can you
> > point me to the design/architecture description so I can see how
> > "app does data recovery itself" dance is supposed to work?
> >
>
> Maybe I missed something, but all these assumptions are already
> present for typical block devices, i.e. sectors may go bad and a write
> may make the sector usable again.
The assumption we make about sectors going bad on SSDs or SRDs is
that the device is about to die and needs replacing ASAP. Then
RAID takes care of the rebuild completely transparently. i.e.
handling and correcting bad sectors is typically done completely
transparently /below/ the filesytem like so:
Application
Filesystem
block
[LBA mapping/redundancy/correction driver e.g. md/dm]
driver
hardware
[LBA redundancy/correction e.g h/w RAID]
In the case of filesystems with their own RAID/redundancy code (e.g.
btrfs), then it looks like this:
Application
Filesystem
mapping/redundancy/correction driver
block
driver
hardware
[LBA redundancy/correction e.g h/w RAID]
> This patch series is extending that
> out to the DAX-mmap case, but it's the same principle of "write to
> clear error" that we live with in the block-I/O path. What
> clarification are you looking for beyond that point?
I'm asking for an actual design document that explains how moving
all the redundancy and bad sector correction stuff from the LBA
layer up into application space is supposed to work when
applications have no clue about LBA mappings, nor tend to keep
redundant data around. i.e. you're proposing this:
Application
Application data redundancy/correction
Filesystem
Block
[LBA mapping/redundancy/correction driver e.g. md/dm]
driver
hardware
And somehow all the error information from the hardware layer needs
to be propagated up to the application layer, along with all the
mapping information from the filesystem and block layers for the
application to make sense of the hardware reported errors.
I see assumptions this this "just works" but we don't have any of
the relevant APIs or infrastructure to enable the application to do
the hardware error->file+offset namespace mapping (i.e. filesystem
reverse mapping for for file offsets and directory paths, and
reverse mapping for the the block layer remapping drivers).
I haven't seen any design/documentation for infrastructure at the
application layer to handle redundant data and correctly
transparently so I don't have any idea what the technical
requirements this different IO stack places on filesystems may be.
Hence I'm asking for some kind of architecture/design documentation
that I can read to understand exactly what is being proposed here...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Apr 25, 2016 at 11:53:13PM +0000, Verma, Vishal L wrote:
> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> >?
> <>
>
> > >
> > > - It checks badblocks and discovers it's files have lost data
> > Lots of hand-waving here. How does the application map a bad
> > "sector" to a file without scanning the entire filesystem to find
> > the owner of the bad sector?
>
> Yes this was hand-wavey, but we talked about this a bit at LSF..
> The idea is that a per-block-device badblocks list is available at
> /sys/block/<pmemX>/badblocks. The application (or a suitable yet-to-be-
> written library function) does a fiemap to figure out the sectors its
> files are using, and correlates the two lists.
> We can also look into providing an easier-to-use interface from the
> kernel, in the form of an fiemap flag to report only the bad sectors, or
> a SEEK_BAD flag..
> The application doesn't have to scan the entire filesystem, but
> presumably it knows what files it 'owns', and does a fiemap for those.
You're assuming that only the DAX aware application accesses it's
files. users, backup programs, data replicators, fileystem
re-organisers (e.g. defragmenters) etc all may access the files and
they may throw errors. What then?
> > > - It write()s those sectors (possibly converted to file offsets
> > > using
> > > fiemap)
> > > ? ? * This triggers the fallback path, but if the application is
> > > doing
> > > this level of recovery, it will know the sector is bad, and write
> > > the
> > > entire sector
> > Where does the application find the data that was lost to be able to
> > rewrite it?
>
> The data that was lost is gone -- this assumes the application has some
> ability to recover using a journal/log or other redundancy - yes, at the
> application layer. If it doesn't have this sort of capability, the only
> option is to restore files from a backup/mirror.
So the architecture has a built in assumption that only userspace
can handle data loss?
What about filesytsems like NOVA, that use log structured design to
provide DAX w/ update atomicity and can potentially also provide
redundancy/repair through the same mechanisms? Won't pmem native
filesystems with built in data protection features like this remove
the need for adding all this to userspace applications?
If so, shouldn't that be the focus of development rahter than
placing the burden on userspace apps to handle storage repair
situations?
> > > - Or it replaces the entire file from backup also using write() (not
> > > mmap+stores)
> > > ? ? * This just frees the fs block, and the next time the block is
> > > reallocated by the fs, it will likely be zeroed first, and that will
> > > be
> > > done through the driver and will clear errors
> > There's an implicit assumption that applications will keep redundant
> > copies of their data at the /application layer/ and be able to
> > automatically repair it? And then there's the implicit assumption
> > that it will unlink and free the entire file before writing a new
> > copy, and that then assumes the the filesystem will zero blocks if
> > they get reused to clear errors on that LBA sector mapping before
> > they are accessible again to userspace..
> >
> > It seems to me that there are a number of assumptions being made
> > across multiple layers here. Maybe I've missed something - can you
> > point me to the design/architecture description so I can see how
> > "app does data recovery itself" dance is supposed to work?
>
> There isn't a document other than the flow in my head :) - but maybe I
> could write one up..
> I wasn't thinking the application itself maintains and restores from
> backup copy of the file.. The application hits either a SIGBUS or EIO
> depending on how it accesses the data, and crashes or raises some alarm.
> The recovery is then done out-of-band, by a sysadmin or such (i.e.
> delete the file, replace with a known good copy, restart application).
>
> To summarize, the two cases we want to handle are:
> 1. Application has inbuilt recovery:
> ? - hits badblock
> ? - figures out it is able to recover the data
> ? - handles SIGBUS or EIO
> ? - does a (sector aligned) write() to restore the data
The "figures out" step here is where >95% of the work we'd have to
do is. And that's in filesystem and block layer code, not
userspace, and userspace can't do that work in a signal handler.
And it can still fall down to the second case when the application
doesn't have another copy of the data somewhere.
FWIW, we don't have a DAX enabled filesystem that can do
reverse block mapping, so we're a year or two away from this being a
workable production solution from the filesystem perspective. And
AFAICT, it's not even on the roadmap for dm/md layers.
> 2. Application doesn't have any inbuilt recovery mechanism
> ? - hits badblock
> ? - gets SIGBUS (or EIO) and crashes
> ? - Sysadmin restores file from backup
Which is no different to an existing non-DAX application getting an
EIO/sigbus from current storage technologies.
Except: in the existing storage stack, redundancy and correction has
already had to have failed for the application to see such an error.
Hence this is normally considered a DR case as there's had to be
cascading failures (e.g. multiple disk failures in a RAID) to get
to this stage, not a single error in a single sector in
non-redundant storage.
We need some form of redundancy and correction in the PMEM stack to
prevent single sector errors from taking down services until an
administrator can correct the problem. I'm trying to understand
where this is supposed to fit into the picture - at this point I
really don't think userspace applications are going to be able to do
this reliably....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Apr 25, 2016 at 5:11 PM, Dave Chinner <[email protected]> wrote:
> On Mon, Apr 25, 2016 at 04:43:14PM -0700, Dan Williams wrote:
[..]
>> Maybe I missed something, but all these assumptions are already
>> present for typical block devices, i.e. sectors may go bad and a write
>> may make the sector usable again.
>
> The assumption we make about sectors going bad on SSDs or SRDs is
> that the device is about to die and needs replacing ASAP.
Similar assumptions here. Storage media is experiencing errors and
past a certain threshold it may be time to decommission the device.
You can see definitions for SMART / media health commands from various
vendors at these links, and yes, hopefully these are standardized /
unified at some point down the road:
http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/NFIT_DSM_DDR4_NVDIMM-N_v84s.pdf
https://msdn.microsoft.com/en-us/library/windows/hardware/mt604717(v=vs.85).aspx
> Then
> RAID takes care of the rebuild completely transparently. i.e.
> handling and correcting bad sectors is typically done completely
> transparently /below/ the filesytem like so:
Again, same for an NVDIMM. Use the pmem block-device as a RAID-member device.
>> This patch series is extending that
>> out to the DAX-mmap case, but it's the same principle of "write to
>> clear error" that we live with in the block-I/O path. What
>> clarification are you looking for beyond that point?
>
> I'm asking for an actual design document that explains how moving
> all the redundancy and bad sector correction stuff from the LBA
> layer up into application space is supposed to work when
> applications have no clue about LBA mappings, nor tend to keep
> redundant data around. i.e. you're proposing this:
These patches are not proposing *new* / general infrastructure for
moving redundancy and bad sector correction handling to userspace. If
an existing app is somehow dealing with raw (without RAID) device
errors on disk storage media today it should not need to change to
handle errors on an NVDIMM. My expectation is that very few if any
applications handle this today and just fail in the presence of media
errors.
> Application
> Application data redundancy/correction
> Filesystem
> Block
> [LBA mapping/redundancy/correction driver e.g. md/dm]
> driver
> hardware
>
> And somehow all the error information from the hardware layer needs
> to be propagated up to the application layer, along with all the
> mapping information from the filesystem and block layers for the
> application to make sense of the hardware reported errors.
>
> I see assumptions this this "just works" but we don't have any of
> the relevant APIs or infrastructure to enable the application to do
> the hardware error->file+offset namespace mapping (i.e. filesystem
> reverse mapping for for file offsets and directory paths, and
> reverse mapping for the the block layer remapping drivers).
If an application expects errors to be handled beneath the filesystem
then it should forgo DAX and arrange for the NVDIMM devices to be
RAIDed. Otherwise, if an application wants to use DAX then it might
need to be prepared to handle media errors itself same as the
un-RAIDed disk case. Yes, at an administrative level without
reverse-mapping support from a filesystem there's presently no way to
ask "which files on this fs are impacted by media errors", and we're
aware that reverse-mapping capabilities are nascent for current
DAX-aware filesystems. The forward lookup path, as impractical as it
is for large numbers of files, is available if an application wanted
to know if a specific file was impacted. We've discussed possibly
extending fiemap() to return bad blocks in a file rather than
consulting sysfs, or extending lseek() with something like SEEK_ERROR
to return offsets of bad areas in a file.
> I haven't seen any design/documentation for infrastructure at the
> application layer to handle redundant data and correctly
> transparently so I don't have any idea what the technical
> requirements this different IO stack places on filesystems may be.
> Hence I'm asking for some kind of architecture/design documentation
> that I can read to understand exactly what is being proposed here...
I think this is a discussion for a solution that would build on top of
this basic "here are the errors, re-write them with good data if you
can; otherwise, best of luck" foundation. Something like a DAX-aware
device mapper layer that duplicates data tagged with REQ_META so at
least we have a recovery path when a sector error lands in critical
filesystem-metadata. However, anything we come up with to make NVDIMM
errors more survivable should be directly applicable to traditional
disk storage as well. Along these lines we had a BoF session at Vault
where drive vendors we're wondering if the sysfs bad sectors list
could help software recover from the loss of a disk-head, or other
errors that only take down part of the drive. An I/O hint that flags
data that should be stored redundantly might be useful there as well.
By the way, your presence was sorely missed at LSF/MM!
On Mon, Apr 25, 2016 at 06:45:08PM -0700, Dan Williams wrote:
> On Mon, Apr 25, 2016 at 5:11 PM, Dave Chinner <[email protected]> wrote:
> > On Mon, Apr 25, 2016 at 04:43:14PM -0700, Dan Williams wrote:
> [..]
> >> Maybe I missed something, but all these assumptions are already
> >> present for typical block devices, i.e. sectors may go bad and a write
> >> may make the sector usable again.
> >
> > The assumption we make about sectors going bad on SSDs or SRDs is
> > that the device is about to die and needs replacing ASAP.
>
> Similar assumptions here. Storage media is experiencing errors and
> past a certain threshold it may be time to decommission the device.
>
> You can see definitions for SMART / media health commands from various
> vendors at these links, and yes, hopefully these are standardized /
> unified at some point down the road:
>
> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/NFIT_DSM_DDR4_NVDIMM-N_v84s.pdf
> https://msdn.microsoft.com/en-us/library/windows/hardware/mt604717(v=vs.85).aspx
>
>
> > Then
> > RAID takes care of the rebuild completely transparently. i.e.
> > handling and correcting bad sectors is typically done completely
> > transparently /below/ the filesytem like so:
>
> Again, same for an NVDIMM. Use the pmem block-device as a RAID-member device.
Which means we're not using DAX and so the existing storage model
applies. I understand how this works.
What I'm asking about the redundancy/error correction model /when
using DAX/ and a userspace DAX load/store throws the MCE.
> > And somehow all the error information from the hardware layer needs
> > to be propagated up to the application layer, along with all the
> > mapping information from the filesystem and block layers for the
> > application to make sense of the hardware reported errors.
> >
> > I see assumptions this this "just works" but we don't have any of
> > the relevant APIs or infrastructure to enable the application to do
> > the hardware error->file+offset namespace mapping (i.e. filesystem
> > reverse mapping for for file offsets and directory paths, and
> > reverse mapping for the the block layer remapping drivers).
>
> If an application expects errors to be handled beneath the filesystem
> then it should forgo DAX and arrange for the NVDIMM devices to be
> RAIDed.
See above: I'm asking about the DAX-enabled error handling model,
not the traditional error handling model.
> Otherwise, if an application wants to use DAX then it might
> need to be prepared to handle media errors itself same as the
> un-RAIDed disk case. Yes, at an administrative level without
> reverse-mapping support from a filesystem there's presently no way to
> ask "which files on this fs are impacted by media errors", and we're
> aware that reverse-mapping capabilities are nascent for current
> DAX-aware filesystems.
Precisely my point - suggestions are being proposed which assume
use of infrastructure that *does not exist yet* and has not been
discussed or documented. If we're expecting such infrastructure to
be implemented in the filesystems and block device drivers, then we
need to determine that the error model actually works first...
> The forward lookup path, as impractical as it
> is for large numbers of files, is available if an application wanted
> to know if a specific file was impacted. We've discussed possibly
> extending fiemap() to return bad blocks in a file rather than
> consulting sysfs, or extending lseek() with something like SEEK_ERROR
> to return offsets of bad areas in a file.
Via what infrastructure will the filesystem use for finding out
whether a file has bad blocks in it? And if the file does have bad
blocks, what are you expecting the filesystem to do with that
information?
> > I haven't seen any design/documentation for infrastructure at the
> > application layer to handle redundant data and correctly
> > transparently so I don't have any idea what the technical
> > requirements this different IO stack places on filesystems may be.
> > Hence I'm asking for some kind of architecture/design documentation
> > that I can read to understand exactly what is being proposed here...
>
> I think this is a discussion for a solution that would build on top of
> this basic "here are the errors, re-write them with good data if you
> can; otherwise, best of luck" foundation. Something like a DAX-aware
> device mapper layer that duplicates data tagged with REQ_META so at
> least we have a recovery path when a sector error lands in critical
> filesystem-metadata.
Filesytsem metadata is not the topic of discussion here - it's
user data that throws an error on a DAX load/store that is the
issue.
> However, anything we come up with to make NVDIMM
> errors more survivable should be directly applicable to traditional
> disk storage as well.
I'm not sure it does. DAX implies that traditional block layer RAID
infrastructure is not possible, nor are data CRCs, nor are any other
sort of data transformations that are needed for redundancy at the
device layers. Anything that relies on copying/modifying/stable data to
provide redundancies needs to do such work at a place where it can
stall userspace page faults.
This is where pmem native filesystem designs like NOVA take over
from traditional block based filesystems - they are designed around
the ability to do atomic page-based operations for data protection
and recovery operations. It is this mechanism that allows stable
pages to be committed to permanent storage and as such, allow
redundancy operations such as mirroring to be performed before
operations are marked as "stable".
I'm missing the bigger picture that is being aimed at here - what's the
point of DAX if we have to turn it off if we want any sort of
failure protection? What's the big plan for fully enabling DAX with
robust error correction? Where is this all supposed to be leading
to?
> Along these lines we had a BoF session at Vault
> where drive vendors we're wondering if the sysfs bad sectors list
> could help software recover from the loss of a disk-head, or other
> errors that only take down part of the drive.
Right, but as I've said elsewhere, loss of a disk head implies
terabyte scale data loss. That is not something we can automatically
recovery from at the filesystem level. Low level raid recovery could
handle that sort of loss, but at the higher layers it's a disaster
similar to multiple disk RAID failure. It's a completely different
scale to a single sector/page loss we are talking about here, and so
I don't see there as being much (if any) overlap here.
> An I/O hint that flags
> data that should be stored redundantly might be useful there as well.
DAX doesn't have an IO path to hint with... :/
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Apr 25, 2016 at 7:56 PM, Dave Chinner <[email protected]> wrote:
> On Mon, Apr 25, 2016 at 06:45:08PM -0700, Dan Williams wrote:
[..]
>> Otherwise, if an application wants to use DAX then it might
>> need to be prepared to handle media errors itself same as the
>> un-RAIDed disk case. Yes, at an administrative level without
>> reverse-mapping support from a filesystem there's presently no way to
>> ask "which files on this fs are impacted by media errors", and we're
>> aware that reverse-mapping capabilities are nascent for current
>> DAX-aware filesystems.
>
> Precisely my point - suggestions are being proposed which assume
> use of infrastructure that *does not exist yet* and has not been
> discussed or documented. If we're expecting such infrastructure to
> be implemented in the filesystems and block device drivers, then we
> need to determine that the error model actually works first...
These patches only assume the clear-error-on write-model, and that
*maybe* the sysfs bad blocks list is useful if the filesystem has a
reverse-map, or if the application can compare the list against the
results of fiemap(). Beyond that, this is the same perennial "we
should really have better error coordination between block device and
filesystems" discussions that we have at LSF.
>
>> The forward lookup path, as impractical as it
>> is for large numbers of files, is available if an application wanted
>> to know if a specific file was impacted. We've discussed possibly
>> extending fiemap() to return bad blocks in a file rather than
>> consulting sysfs, or extending lseek() with something like SEEK_ERROR
>> to return offsets of bad areas in a file.
>
> Via what infrastructure will the filesystem use for finding out
> whether a file has bad blocks in it? And if the file does have bad
> blocks, what are you expecting the filesystem to do with that
> information?
We currently have no expectation that the filesystem does anything
with the bad blocks list. However, if a filesystem had btrfs-like
capabilities to recover data from a redundant location we'd be looking
to plug into that infrastructure.
>> > I haven't seen any design/documentation for infrastructure at the
>> > application layer to handle redundant data and correctly
>> > transparently so I don't have any idea what the technical
>> > requirements this different IO stack places on filesystems may be.
>> > Hence I'm asking for some kind of architecture/design documentation
>> > that I can read to understand exactly what is being proposed here...
>>
>> I think this is a discussion for a solution that would build on top of
>> this basic "here are the errors, re-write them with good data if you
>> can; otherwise, best of luck" foundation. Something like a DAX-aware
>> device mapper layer that duplicates data tagged with REQ_META so at
>> least we have a recovery path when a sector error lands in critical
>> filesystem-metadata.
>
> Filesytsem metadata is not the topic of discussion here - it's
> user data that throws an error on a DAX load/store that is the
> issue.
Which is not a new problem since volatile DRAM in the non-DAX case can
throw the exact same error. The current recovery model there is crash
the kernel (without MCE recovery), or crash the application and hope
the kernel maps out the page or the application knows how to restart
after SIGBUS. Memory mirroring is meant to make this a bit less
harsh, but there's no mechanism to make this available outside the
kernel.
>> However, anything we come up with to make NVDIMM
>> errors more survivable should be directly applicable to traditional
>> disk storage as well.
>
> I'm not sure it does. DAX implies that traditional block layer RAID
> infrastructure is not possible, nor are data CRCs, nor are any other
> sort of data transformations that are needed for redundancy at the
> device layers. Anything that relies on copying/modifying/stable data to
> provide redundancies needs to do such work at a place where it can
> stall userspace page faults.
>
> This is where pmem native filesystem designs like NOVA take over
> from traditional block based filesystems - they are designed around
> the ability to do atomic page-based operations for data protection
> and recovery operations. It is this mechanism that allows stable
> pages to be committed to permanent storage and as such, allow
> redundancy operations such as mirroring to be performed before
> operations are marked as "stable".
>
> I'm missing the bigger picture that is being aimed at here - what's the
> point of DAX if we have to turn it off if we want any sort of
> failure protection? What's the big plan for fully enabling DAX with
> robust error correction? Where is this all supposed to be leading
> to?
>
NOVA and other solutions are free and encouraged to do a coherent
bottoms-up rethink of error handling on top of persistent memory
devices, in the meantime applications can only expect the legacy
SIGBUS and -EIO mechanisms are available. So I'm still trying to
connect how the "What would NOVA do?" discussion is anything but
orthogonal to hooking up SIGBUS and -EIO for traditional-filesystem
DAX. It's the only error model an application can expect because it's
the only one that currently exists.
>> Along these lines we had a BoF session at Vault
>> where drive vendors we're wondering if the sysfs bad sectors list
>> could help software recover from the loss of a disk-head, or other
>> errors that only take down part of the drive.
>
> Right, but as I've said elsewhere, loss of a disk head implies
> terabyte scale data loss. That is not something we can automatically
> recovery from at the filesystem level. Low level raid recovery could
> handle that sort of loss, but at the higher layers it's a disaster
> similar to multiple disk RAID failure. It's a completely different
> scale to a single sector/page loss we are talking about here, and so
> I don't see there as being much (if any) overlap here.
>
>> An I/O hint that flags
>> data that should be stored redundantly might be useful there as well.
>
> DAX doesn't have an IO path to hint with... :/
...I was thinking traditional filesystem metadata operations through
the block layer. NOVA could of course do something better since it
always indirects userspace access through a filesystem managed page.
On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
> On Mon, Apr 25, 2016 at 7:56 PM, Dave Chinner <[email protected]> wrote:
> > On Mon, Apr 25, 2016 at 06:45:08PM -0700, Dan Williams wrote:
> >> > I haven't seen any design/documentation for infrastructure at the
> >> > application layer to handle redundant data and correctly
> >> > transparently so I don't have any idea what the technical
> >> > requirements this different IO stack places on filesystems may be.
> >> > Hence I'm asking for some kind of architecture/design documentation
> >> > that I can read to understand exactly what is being proposed here...
> >>
> >> I think this is a discussion for a solution that would build on top of
> >> this basic "here are the errors, re-write them with good data if you
> >> can; otherwise, best of luck" foundation. Something like a DAX-aware
> >> device mapper layer that duplicates data tagged with REQ_META so at
> >> least we have a recovery path when a sector error lands in critical
> >> filesystem-metadata.
> >
> > Filesytsem metadata is not the topic of discussion here - it's
> > user data that throws an error on a DAX load/store that is the
> > issue.
>
> Which is not a new problem since volatile DRAM in the non-DAX case can
> throw the exact same error.
They are not the same class of error, not by a long shot.
The "bad page in page cache" error on traditional storage means data
is not lost - the original copy still in whatever storage medium
that the cached page was filled from. i.e. Re-read the file and the
data is still there, which is no different to crashing and
restarting that machine and losing whatever writes had not been
committed to stable storage..
In the pmem case, a "bad page" is a permanent loss of data - it's
unrecoverable without some form data recovery operation being
performed on the storage.
> The current recovery model there is crash
> the kernel (without MCE recovery),
Ouch. Permanent data loss and a system wide DoS.
> or crash the application and hope
> the kernel maps out the page or the application knows how to restart
> after SIGBUS.
Not much better - neither provide a mechanism for recovery.
> Memory mirroring is meant to make this a bit less
> harsh, but there's no mechanism to make this available outside the
> kernel.
Which implies that we need a DM module that interfaces with the
hardware memory mirroring to perform recovery and remapping
operations. i.e. in the traditional storage stack location.
> >> However, anything we come up with to make NVDIMM
> >> errors more survivable should be directly applicable to traditional
> >> disk storage as well.
> >
> > I'm not sure it does. DAX implies that traditional block layer RAID
> > infrastructure is not possible, nor are data CRCs, nor are any other
> > sort of data transformations that are needed for redundancy at the
> > device layers. Anything that relies on copying/modifying/stable data to
> > provide redundancies needs to do such work at a place where it can
> > stall userspace page faults.
> >
> > This is where pmem native filesystem designs like NOVA take over
> > from traditional block based filesystems - they are designed around
> > the ability to do atomic page-based operations for data protection
> > and recovery operations. It is this mechanism that allows stable
> > pages to be committed to permanent storage and as such, allow
> > redundancy operations such as mirroring to be performed before
> > operations are marked as "stable".
> >
> > I'm missing the bigger picture that is being aimed at here - what's the
> > point of DAX if we have to turn it off if we want any sort of
> > failure protection? What's the big plan for fully enabling DAX with
> > robust error correction? Where is this all supposed to be leading
> > to?
> >
>
> NOVA and other solutions are free and encouraged to do a coherent
> bottoms-up rethink of error handling on top of persistent memory
> devices, in the meantime applications can only expect the legacy
> SIGBUS and -EIO mechanisms are available. So I'm still trying to
> connect how the "What would NOVA do?" discussion is anything but
> orthogonal to hooking up SIGBUS and -EIO for traditional-filesystem
> DAX. It's the only error model an application can expect because it's
> the only one that currently exists.
<sigh>
Yes, I get that. I'm not interested in the resultant fatal error
delivery - I'm asking about what happens between the memory error
and the delivery of the fatal "we've lost your data forever" error
that gets delivered to userspace. i.e. I'm after a description of
how error correction/recovery is supposed to be applied to DAX
*before we report SIGBUS or EIO* to the application.
What is the plan/model/vision for intercepting MCEs and recovering
from them? e.g. how do we going to pull the good copy from
hardware/software memory mirrors? What layer is supposed to be
responsible for that? Is it different for hardware mirroring
compared to a more traditional software dm-RAID1 solution? What
requirements does software recovery imply - do we need stable page
state for DAX (i.e. to prevent userspace modification while we
make copies)? Do we need to remap LBAs in the storage stack iduring
recovery when bad blocks are reported? If so, where does it get
done? What atomicity and resiliency requirements are there for
recovery? e.g. bad block is reported, system crashes - what needs to
happen on reboot to have recovery work correctly?
There's heaps of stuff that is completely undefined here - error
handling is fucking hard at the best of times, but I'm struggling to
understand even the basics of what is being proposed here apart from
"pmem error == crash the application, maybe even the system".
Future filesystems are only part of the solution here -
infrastructure like access to hardware mirrored copies for recovery
purposes will impact greatly on the design of upper layers and their
performance (e.g. no need for RAID1 in a software layer), so we
really need the model/architecture to be pretty clearly defined at
the outset before people waste too much time going down paths that
simply won't work on the hardware/infrastructure that is being
provided....
> >> An I/O hint that flags
> >> data that should be stored redundantly might be useful there as well.
> >
> > DAX doesn't have an IO path to hint with... :/
>
> ...I was thinking traditional filesystem metadata operations through
> the block layer. NOVA could of course do something better since it
> always indirects userspace access through a filesystem managed page.
It seems to me you are focussing on code/technologies that exist
today instead of trying to define an architecture that is more
optimal for pmem storage systems. Yes, working code is great, but if
you can't tell people how things like robust error handling and
redundancy are going to work in future then it's going to take
forever for everyone else to handle such errors robustly through the
storage stack...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Apr 25, 2016 at 11:32:08AM -0400, Jeff Moyer wrote:
> > EINVAL is a concern here. Not due to the right error reported, but
> > because it means your current scheme is fundamentally broken - we
> > need to support I/O at any alignment for DAX I/O, and not fail due to
> > alignbment concernes for a highly specific degraded case.
> >
> > I think this whole series need to go back to the drawing board as I
> > don't think it can actually rely on using direct I/O as the EIO
> > fallback.
>
> The only callers of dax_do_io are direct_IO methods.
They are because the DAX I/O pass is a mess, but that doesn't mean
the user specific O_DIRECT on the open nessecarily.
On Mon, Apr 25, 2016 at 05:14:36PM +0000, Verma, Vishal L wrote:
> - Application hits EIO doing dax_IO or load/store io
>
> - It checks badblocks and discovers it's files have lost data
>
> - It write()s those sectors (possibly converted to file offsets using
> fiemap)
> ?? ?? * This triggers the fallback path, but if the application is doing
> this level of recovery, it will know the sector is bad, and write the
> entire sector
This sounds like a mess.
> I think if we want to keep allowing arbitrary alignments for the
> dax_do_io path, we'd need:
> 1. To represent badblocks at a finer granularity (likely cache lines)
> 2. To allow the driver to do IO to a *block device* at sub-sector
> granularity
It's not a block device if it supports DAX. It's byte addressable
memory masquerading as a block device.
On Tue, 2016-04-26 at 10:41 +1000, Dave Chinner wrote:
> <>
> > The application doesn't have to scan the entire filesystem, but
> > presumably it knows what files it 'owns', and does a fiemap for
> > those.
> You're assuming that only the DAX aware application accesses it's
> files. users, backup programs, data replicators, fileystem
> re-organisers (e.g. defragmenters) etc all may access the files and
> they may throw errors. What then?
In this scenario, backup applications etc that try to read that data
before it has been replaced will just hit the errors and fail..
>
<>
> > The data that was lost is gone -- this assumes the application has
> > some
> > ability to recover using a journal/log or other redundancy - yes,
> > at the
> > application layer. If it doesn't have this sort of capability, the
> > only
> > option is to restore files from a backup/mirror.
> So the architecture has a built in assumption that only userspace
> can handle data loss?
>
> What about filesytsems like NOVA, that use log structured design to
> provide DAX w/ update atomicity and can potentially also provide
> redundancy/repair through the same mechanisms? Won't pmem native
> filesystems with built in data protection features like this remove
> the need for adding all this to userspace applications?
>
> If so, shouldn't that be the focus of development rahter than
> placing the burden on userspace apps to handle storage repair
> situations?
Agreed that file systems like NOVA can be designed to handle this
better, but haven't you said in the past that it may take years for a
new file system to become production ready, and that DAX is the until-
then solution that gets us most of the way there.. I think we just want
to ensure that current-DAX has some way to deal with errors, and these
patches provide an admin-intervention recovery path and possibly
another if the app wants to try something fancy for recovery.
<>
>
> >
> > To summarize, the two cases we want to handle are:
> > 1. Application has inbuilt recovery:
> > - hits badblock
> > - figures out it is able to recover the data
> > - handles SIGBUS or EIO
> > - does a (sector aligned) write() to restore the data
> The "figures out" step here is where >95% of the work we'd have to
> do is. And that's in filesystem and block layer code, not
> userspace, and userspace can't do that work in a signal handler.
> And it can still fall down to the second case when the application
> doesn't have another copy of the data somewhere.
Ah when I said "figures out" I was only thinking if the application has
some redundancy/jouranlling, and if it can recover using that -- not
additional recovery mechanisms at the block/fs layer.
>
> FWIW, we don't have a DAX enabled filesystem that can do
> reverse block mapping, so we're a year or two away from this being a
> workable production solution from the filesystem perspective. And
> AFAICT, it's not even on the roadmap for dm/md layers.
>
> >
> > 2. Application doesn't have any inbuilt recovery mechanism
> > - hits badblock
> > - gets SIGBUS (or EIO) and crashes
> > - Sysadmin restores file from backup
> Which is no different to an existing non-DAX application getting an
> EIO/sigbus from current storage technologies.
>
> Except: in the existing storage stack, redundancy and correction has
> already had to have failed for the application to see such an error.
> Hence this is normally considered a DR case as there's had to be
> cascading failures (e.g. multiple disk failures in a RAID) to get
> to this stage, not a single error in a single sector in
> non-redundant storage.
>
> We need some form of redundancy and correction in the PMEM stack to
> prevent single sector errors from taking down services until an
> administrator can correct the problem. I'm trying to understand
> where this is supposed to fit into the picture - at this point I
> really don't think userspace applications are going to be able to do
> this reliably....
Agreed that the pmem stack could use more redundancy and error
correction, perhaps enabling md-raid to raid pmem devices and then
enable DAX on top of that and we'll have a better chance to handle
errors, but that level of recovery isn't what these patches are aiming
for -- that is obviously a longer term effort. These simply aim to
provide that disaster recovery path when a single sector failure does
take down the service.
Today, on a dax enabled filesystem, if/when the app hits an error and
crashes, dax is simply disabled till the errors are gone. This is
obviously less than ideal. (This was done because there is currently no
way for a DAX file system to send any IO - mmap or otherwise - through
the driver, including zeroing of new fs blocks). These patches enable
the DR path by allowing some non-mmap IO (most importantly zeroing) to
go through the driver which can tell the device to do some remapping
etc.
So, yes, this is very much a DR case in our current pmem+dax
architecture, and we should probably design more robust handling at the
block/md/fs layer, but with these, you at least get to crash the app,
delete its files and restore them from out-of-band backups and continue
with DAX.
>
> Cheers,
>
> Dave.
On Tue, Apr 26, 2016 at 1:27 AM, Dave Chinner <[email protected]> wrote:
> On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
[..]
> It seems to me you are focussing on code/technologies that exist
> today instead of trying to define an architecture that is more
> optimal for pmem storage systems. Yes, working code is great, but if
> you can't tell people how things like robust error handling and
> redundancy are going to work in future then it's going to take
> forever for everyone else to handle such errors robustly through the
> storage stack...
Precisely because higher order redundancy is built on top this baseline.
MD-RAID can't do it's error recovery if we don't have -EIO and
clear-error-on-write. On the other hand, you're absolutely right that
we have a gaping hole on top of the SIGBUS recovery model, and don't
have a kernel layer we can interpose on top of DAX to provide some
semblance of redundancy.
In the meantime, a handful of applications with a team of full-time
site-reliability-engineers may be able to plug in external redundancy
infrastructure on top of what is defined in these patches. For
everyone else, the hard problem, we need to do a lot more thinking
about a trap and recover solution.
On Tue, 2016-04-26 at 01:33 -0700, [email protected] wrote:
> On Mon, Apr 25, 2016 at 05:14:36PM +0000, Verma, Vishal L wrote:
> >
> > - Application hits EIO doing dax_IO or load/store io
> >
> > - It checks badblocks and discovers it's files have lost data
> >
> > - It write()s those sectors (possibly converted to file offsets
> > using
> > fiemap)
> > ?? ?? * This triggers the fallback path, but if the application is
> > doing
> > this level of recovery, it will know the sector is bad, and write
> > the
> > entire sector
> This sounds like a mess.
>
> >
> > I think if we want to keep allowing arbitrary alignments for the
> > dax_do_io path, we'd need:
> > 1. To represent badblocks at a finer granularity (likely cache
> > lines)
> > 2. To allow the driver to do IO to a *block device* at sub-sector
> > granularity
> It's not a block device if it supports DAX. It's byte addressable
> memory masquerading as a block device.
Yes but we made that decision a while back with pmem :)
Are you saying it should stop being a block device anymore?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-
> block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 26-04-16 07:59:10, Dan Williams wrote:
> On Tue, Apr 26, 2016 at 1:27 AM, Dave Chinner <[email protected]> wrote:
> > On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
> [..]
> > It seems to me you are focussing on code/technologies that exist
> > today instead of trying to define an architecture that is more
> > optimal for pmem storage systems. Yes, working code is great, but if
> > you can't tell people how things like robust error handling and
> > redundancy are going to work in future then it's going to take
> > forever for everyone else to handle such errors robustly through the
> > storage stack...
>
> Precisely because higher order redundancy is built on top this baseline.
>
> MD-RAID can't do it's error recovery if we don't have -EIO and
> clear-error-on-write. On the other hand, you're absolutely right that
> we have a gaping hole on top of the SIGBUS recovery model, and don't
> have a kernel layer we can interpose on top of DAX to provide some
> semblance of redundancy.
>
> In the meantime, a handful of applications with a team of full-time
> site-reliability-engineers may be able to plug in external redundancy
> infrastructure on top of what is defined in these patches. For
> everyone else, the hard problem, we need to do a lot more thinking
> about a trap and recover solution.
So we could actually implement some kind of redundancy with DAX with
reasonable effort. We already do track dirty storage PFNs in the radix
tree. After DAX locking patches get merged we also have a reliable way to
write-protect them when we decide to do 'writeback' (translates to flushing
CPU caches) for them. When we do that, we have all the infrastructure in
place to provide 'stable pages' while some mirroring or other redundancy
mechanism in kernel works with the data.
But as Dave said, we should do some writeup of how this is all supposed to
work and e.g. which layer is going to be responsible for the redundancy. Do
we want to have that in DAX code? Or just provide stable page guarantees
from DAX and do the redundancy from device mapper? This needs more
thought...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Apr 26, 2016 at 8:31 AM, Jan Kara <[email protected]> wrote:
> On Tue 26-04-16 07:59:10, Dan Williams wrote:
>> On Tue, Apr 26, 2016 at 1:27 AM, Dave Chinner <[email protected]> wrote:
>> > On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
>> [..]
>> > It seems to me you are focussing on code/technologies that exist
>> > today instead of trying to define an architecture that is more
>> > optimal for pmem storage systems. Yes, working code is great, but if
>> > you can't tell people how things like robust error handling and
>> > redundancy are going to work in future then it's going to take
>> > forever for everyone else to handle such errors robustly through the
>> > storage stack...
>>
>> Precisely because higher order redundancy is built on top this baseline.
>>
>> MD-RAID can't do it's error recovery if we don't have -EIO and
>> clear-error-on-write. On the other hand, you're absolutely right that
>> we have a gaping hole on top of the SIGBUS recovery model, and don't
>> have a kernel layer we can interpose on top of DAX to provide some
>> semblance of redundancy.
>>
>> In the meantime, a handful of applications with a team of full-time
>> site-reliability-engineers may be able to plug in external redundancy
>> infrastructure on top of what is defined in these patches. For
>> everyone else, the hard problem, we need to do a lot more thinking
>> about a trap and recover solution.
>
> So we could actually implement some kind of redundancy with DAX with
> reasonable effort. We already do track dirty storage PFNs in the radix
> tree. After DAX locking patches get merged we also have a reliable way to
> write-protect them when we decide to do 'writeback' (translates to flushing
> CPU caches) for them. When we do that, we have all the infrastructure in
> place to provide 'stable pages' while some mirroring or other redundancy
> mechanism in kernel works with the data.
>
> But as Dave said, we should do some writeup of how this is all supposed to
> work and e.g. which layer is going to be responsible for the redundancy. Do
> we want to have that in DAX code? Or just provide stable page guarantees
> from DAX and do the redundancy from device mapper? This needs more
> thought...
>
[ adding Mike, since his ears are likely burning by this point ]
If we had the ability to specify a range or list of ranges to
blkdev_issue_flush() that would allow the driver level to implement
redundancy at sync time. And no, before someone flies off the handle,
this isn't rehashing the same argument I lost about where to track
dirty pfns. Rather this relies on the radix to track dirty pfns, but
asks the driver to do the flush operation. In the nominal case this
is a clflush / clwb loop or wbinvd in the pmem driver, in the
redundancy case the pmem driver is swapped out for a driver that uses
the flush request as a trigger point to synchronize redundant data.
We want this at the driver level to take advantage of standard
asynchronous completions, and make it administratively equivalent to
the dm/md layering people are used to using.