2016-03-24 23:17:25

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH 0/5] dax: handling of media errors

Until now, dax has been disabled if media errors were found on
any device. This series attempts to address that.

The first three patches from Dan re-enable dax even when media
errors are present.

The fourth patch from Matthew removes the
zeroout path from dax entirely, making zeroout operations always
go through the driver (The motivation is that if a backing device
has media errors, and we create a sparse file on it, we don't
want the initial zeroing to happen via dax, we want to give the
block driver a chance to clear the errors).

The fifth patch changes the behaviour of dax_do_io by adding a
wrapper around it that is passed all the arguments also needed by
__blockdev_do_direct_IO. If (the new) __dax_do_io fails with -EIO
due to a bad block, we simply retry with the direct_IO path which
forces the IO to go through the block driver, and can attempt to
clear the error.

Dan Williams (3):
block, dax: pass blk_dax_ctl through to drivers
dax: fallback from pmd to pte on error
dax: enable dax in the presence of known media errors (badblocks)

Vishal Verma (2):
dax: use sb_issue_zerout instead of calling dax_clear_sectors
dax: handle media errors in dax_do_io

arch/powerpc/sysdev/axonram.c | 10 +++----
block/ioctl.c | 9 ------
drivers/block/brd.c | 9 +++---
drivers/nvdimm/pmem.c | 17 ++++++++---
drivers/s390/block/dcssblk.c | 12 ++++----
fs/block_dev.c | 7 +++--
fs/dax.c | 70 +++++++++++++++++++++----------------------
fs/ext2/inode.c | 12 ++++----
fs/ext4/indirect.c | 11 ++++---
fs/ext4/inode.c | 5 ++--
fs/xfs/xfs_aops.c | 7 +++--
fs/xfs/xfs_bmap_util.c | 9 ------
include/linux/blkdev.h | 3 +-
include/linux/dax.h | 7 +++--
14 files changed, 93 insertions(+), 95 deletions(-)

--
2.5.5



2016-03-24 23:17:26

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH 1/5] block, dax: pass blk_dax_ctl through to drivers

From: Dan Williams <[email protected]>

From: Dan Williams <[email protected]>

This is in preparation for doing badblocks checking against the
requested sector range in the driver. Currently we opportunistically
return as much data that can be "dax'd" starting at the given sector.
When errors are present we want to limit that range to the first
encountered error, or fail the dax request if the range encompasses an
error.

Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 10 +++++-----
drivers/block/brd.c | 9 +++++----
drivers/nvdimm/pmem.c | 9 +++++----
drivers/s390/block/dcssblk.c | 12 ++++++------
fs/block_dev.c | 2 +-
include/linux/blkdev.h | 3 +--
6 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 0d112b9..d85673f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,17 +139,17 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)

/**
* axon_ram_direct_access - direct_access() method for block device
- * @device, @sector, @data: see block_device_operations method
+ * @dax: see block_device_operations method
*/
static long
-axon_ram_direct_access(struct block_device *device, sector_t sector,
- void __pmem **kaddr, pfn_t *pfn)
+axon_ram_direct_access(struct block_device *device, struct blk_dax_ctl *dax)
{
+ sector_t sector = get_start_sect(device) + dax->sector;
struct axon_ram_bank *bank = device->bd_disk->private_data;
loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;

- *kaddr = (void __pmem __force *) bank->io_addr + offset;
- *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+ dax->addr = (void __pmem __force *) bank->io_addr + offset;
+ dax->pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
return bank->size - offset;
}

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index cb27190..0b8bcfa 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -380,9 +380,10 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
}

#ifdef CONFIG_BLK_DEV_RAM_DAX
-static long brd_direct_access(struct block_device *bdev, sector_t sector,
- void __pmem **kaddr, pfn_t *pfn)
+static long brd_direct_access(struct block_device *bdev,
+ struct blk_dax_ctl *dax)
{
+ sector_t sector = get_start_sect(bdev) + dax->sector;
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;

@@ -391,8 +392,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
page = brd_insert_page(brd, sector);
if (!page)
return -ENOSPC;
- *kaddr = (void __pmem *)page_address(page);
- *pfn = page_to_pfn_t(page);
+ dax->addr = (void __pmem *)page_address(page);
+ dax->pfn = page_to_pfn_t(page);

return PAGE_SIZE;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ca5721c..da10554 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -167,14 +167,15 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
return rc;
}

-static long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void __pmem **kaddr, pfn_t *pfn)
+static long pmem_direct_access(struct block_device *bdev,
+ struct blk_dax_ctl *dax)
{
+ sector_t sector = get_start_sect(bdev) + dax->sector;
struct pmem_device *pmem = bdev->bd_disk->private_data;
resource_size_t offset = sector * 512 + pmem->data_offset;

- *kaddr = pmem->virt_addr + offset;
- *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
+ dax->addr = pmem->virt_addr + offset;
+ dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);

return pmem->size - pmem->pfn_pad - offset;
}
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index ce7b701..e5bda68 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -30,8 +30,8 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static blk_qc_t dcssblk_make_request(struct request_queue *q,
struct bio *bio);
-static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void __pmem **kaddr, pfn_t *pfn);
+static long dcssblk_direct_access(struct block_device *bdev,
+ struct blk_dax_ctl *dax)

static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";

@@ -883,9 +883,9 @@ fail:
}

static long
-dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void __pmem **kaddr, pfn_t *pfn)
+dcssblk_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
{
+ sector_t secnum = get_start_sect(bdev) + dax->sector;
struct dcssblk_dev_info *dev_info;
unsigned long offset, dev_sz;

@@ -894,8 +894,8 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
return -ENODEV;
dev_sz = dev_info->end - dev_info->start;
offset = secnum * 512;
- *kaddr = (void __pmem *) (dev_info->start + offset);
- *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+ dax->addr = (void __pmem *) (dev_info->start + offset);
+ dax->pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);

return dev_sz - offset;
}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 826b164..9c0765b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -488,7 +488,7 @@ long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
sector += get_start_sect(bdev);
if (sector % (PAGE_SIZE / 512))
return -EINVAL;
- avail = ops->direct_access(bdev, sector, &dax->addr, &dax->pfn);
+ avail = ops->direct_access(bdev, dax);
if (!avail)
return -ERANGE;
if (avail > 0 && avail & ~PAGE_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 413c84f..f8c65b8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1653,8 +1653,7 @@ struct block_device_operations {
int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
- long (*direct_access)(struct block_device *, sector_t, void __pmem **,
- pfn_t *);
+ long (*direct_access)(struct block_device *, struct blk_dax_ctl *dax);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
--
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-24 23:17:27

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH 2/5] dax: fallback from pmd to pte on error

From: Dan Williams <[email protected]>

From: Dan Williams <[email protected]>

In preparation for consulting a badblocks list in pmem_direct_access(),
teach dax_pmd_fault() to fallback rather than fail immediately upon
encountering an error. The thought being that reducing the span of the
dax request may avoid the error region.

Signed-off-by: Dan Williams <[email protected]>
---
fs/dax.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index bbb2ad7..bb7e9f8 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -940,8 +940,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
long length = dax_map_atomic(bdev, &dax);

if (length < 0) {
- result = VM_FAULT_SIGBUS;
- goto out;
+ dax_pmd_dbg(&bh, address, "dax-error fallback");
+ goto fallback;
}
if (length < PMD_SIZE) {
dax_pmd_dbg(&bh, address, "dax-length too small");
--
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-24 23:17:29

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

From: Matthew Wilcox <[email protected]>

dax_clear_sectors() cannot handle poisoned blocks. These must be
zeroed using the BIO interface instead. Convert ext2 and XFS to use
only sb_issue_zerout().

Signed-off-by: Matthew Wilcox <[email protected]>
[vishal: Also remove the dax_clear_sectors function entirely]
Signed-off-by: Vishal Verma <[email protected]>
---
fs/dax.c | 32 --------------------------------
fs/ext2/inode.c | 7 +++----
fs/xfs/xfs_bmap_util.c | 9 ---------
include/linux/dax.h | 1 -
4 files changed, 3 insertions(+), 46 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index bb7e9f8..a30481e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -78,38 +78,6 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n)
return page;
}

-/*
- * dax_clear_sectors() is called from within transaction context from XFS,
- * and hence this means the stack from this point must follow GFP_NOFS
- * semantics for all operations.
- */
-int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size)
-{
- struct blk_dax_ctl dax = {
- .sector = _sector,
- .size = _size,
- };
-
- might_sleep();
- do {
- long count, sz;
-
- count = dax_map_atomic(bdev, &dax);
- if (count < 0)
- return count;
- sz = min_t(long, count, SZ_128K);
- clear_pmem(dax.addr, sz);
- dax.size -= sz;
- dax.sector += sz / 512;
- dax_unmap_atomic(bdev, &dax);
- cond_resched();
- } while (dax.size);
-
- wmb_pmem();
- return 0;
-}
-EXPORT_SYMBOL_GPL(dax_clear_sectors);
-
/* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first,
loff_t pos, loff_t end)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 6bd58e6..824f249 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -26,6 +26,7 @@
#include <linux/highuid.h>
#include <linux/pagemap.h>
#include <linux/dax.h>
+#include <linux/blkdev.h>
#include <linux/quotaops.h>
#include <linux/writeback.h>
#include <linux/buffer_head.h>
@@ -737,10 +738,8 @@ static int ext2_get_blocks(struct inode *inode,
* so that it's not found by another thread before it's
* initialised
*/
- err = dax_clear_sectors(inode->i_sb->s_bdev,
- le32_to_cpu(chain[depth-1].key) <<
- (inode->i_blkbits - 9),
- 1 << inode->i_blkbits);
+ err = sb_issue_zeroout(inode->i_sb,
+ le32_to_cpu(chain[depth-1].key), 1, GFP_NOFS);
if (err) {
mutex_unlock(&ei->truncate_mutex);
goto cleanup;
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 6c87601..23a759a 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -72,16 +72,7 @@ xfs_zero_extent(
struct xfs_mount *mp = ip->i_mount;
xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb);
sector_t block = XFS_BB_TO_FSBT(mp, sector);
- ssize_t size = XFS_FSB_TO_B(mp, count_fsb);

- if (IS_DAX(VFS_I(ip)))
- return dax_clear_sectors(xfs_find_bdev_for_inode(VFS_I(ip)),
- sector, size);
-
- /*
- * let the block layer decide on the fastest method of
- * implementing the zeroing.
- */
return sb_issue_zeroout(mp->m_super, block, count_fsb, GFP_NOFS);

}
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 636dd59..933198a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -7,7 +7,6 @@

ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
get_block_t, dio_iodone_t, int flags);
-int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
--
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-24 23:17:30

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH 5/5] dax: handle media errors in dax_do_io

dax_do_io (called for read() or write() for a dax file system) may fail
in the presence of bad blocks or media errors. Since we expect that a
write should clear media errors on nvdimms, make dax_do_io fall back to
the direct_IO path, which will send down a bio to the driver, which can
then attempt to clear the error.

Cc: Matthew Wilcox <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
---
fs/block_dev.c | 5 +++--
fs/dax.c | 34 ++++++++++++++++++++++++++++++++--
fs/ext2/inode.c | 5 +++--
fs/ext4/indirect.c | 11 +++++++----
fs/ext4/inode.c | 5 +++--
fs/xfs/xfs_aops.c | 7 ++++---
include/linux/dax.h | 6 +++++-
7 files changed, 57 insertions(+), 16 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9c0765b..f3873ab 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -168,8 +168,9 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
struct inode *inode = bdev_file_inode(file);

if (IS_DAX(inode))
- return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
- NULL, DIO_SKIP_DIO_COUNT);
+ return dax_do_io(iocb, inode, I_BDEV(inode), iter, offset,
+ blkdev_get_block, blkdev_get_block,
+ NULL, NULL, DIO_SKIP_DIO_COUNT);
return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
blkdev_get_block, NULL, NULL,
DIO_SKIP_DIO_COUNT);
diff --git a/fs/dax.c b/fs/dax.c
index a30481e..b90c8e9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -208,7 +208,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
}

/**
- * dax_do_io - Perform I/O to a DAX file
+ * __dax_do_io - Perform I/O to a DAX file
* @iocb: The control block for this I/O
* @inode: The file which the I/O is directed at
* @iter: The addresses to do I/O from or to
@@ -224,7 +224,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
* As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
* is in progress.
*/
-ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
+ssize_t __dax_do_io(struct kiocb *iocb, struct inode *inode,
struct iov_iter *iter, loff_t pos, get_block_t get_block,
dio_iodone_t end_io, int flags)
{
@@ -262,8 +262,38 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
out:
return retval;
}
+EXPORT_SYMBOL_GPL(__dax_do_io);
+
+/*
+ * This is a library function for use by file systems. It will perform a
+ * fallback to direct_io semantics if the dax_io fails due to a media error.
+ */
+ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
+ struct block_device *bdev, struct iov_iter *iter, loff_t pos,
+ get_block_t dax_get_block, get_block_t dio_get_block,
+ dio_iodone_t end_io, dio_submit_t submit_io, int flags)
+{
+ ssize_t retval;
+
+ retval = __dax_do_io(iocb, inode, iter, pos, dax_get_block, end_io,
+ flags);
+ if (iov_iter_rw(iter) == WRITE && retval == -EIO) {
+ /*
+ * __dax_do_io may have failed a write due to a bad block.
+ * Retry with direct_io, and if the direct_IO also fails,
+ * return -EIO as that was the original error that led us
+ * down the direct_IO path.
+ */
+ retval = __blockdev_direct_IO(iocb, inode, bdev, iter, pos,
+ dio_get_block, end_io, submit_io, flags);
+ if (retval < 0)
+ return -EIO;
+ }
+ return retval;
+}
EXPORT_SYMBOL_GPL(dax_do_io);

+
/*
* The user has performed a load from a hole in the file. Allocating
* a new page in the file would cause excessive storage usage for
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 824f249..8a307cf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -862,8 +862,9 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
ssize_t ret;

if (IS_DAX(inode))
- ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block, NULL,
- DIO_LOCKING);
+ ret = dax_do_io(iocb, inode, inode->i_sb->s_bdev, iter,
+ offset, ext2_get_block, ext2_get_block,
+ NULL, NULL, DIO_LOCKING | DIO_SKIP_HOLES);
else
ret = blockdev_direct_IO(iocb, inode, iter, offset,
ext2_get_block);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 355ef9c..4b087b7 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -692,8 +692,9 @@ retry:
goto locked;
}
if (IS_DAX(inode))
- ret = dax_do_io(iocb, inode, iter, offset,
- ext4_get_block, NULL, 0);
+ ret = dax_do_io(iocb, inode, inode->i_sb->s_bdev, iter,
+ offset, ext4_get_block, ext4_get_block,
+ NULL, NULL, 0);
else
ret = __blockdev_direct_IO(iocb, inode,
inode->i_sb->s_bdev, iter,
@@ -703,8 +704,10 @@ retry:
} else {
locked:
if (IS_DAX(inode))
- ret = dax_do_io(iocb, inode, iter, offset,
- ext4_get_block, NULL, DIO_LOCKING);
+ ret = dax_do_io(iocb, inode, inode->i_sb->s_bdev, iter,
+ offset, ext4_get_block, ext4_get_block,
+ NULL, NULL, DIO_LOCKING |
+ DIO_SKIP_HOLES);
else
ret = blockdev_direct_IO(iocb, inode, iter, offset,
ext4_get_block);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index aee960b..4220dac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3315,8 +3315,9 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
#endif
if (IS_DAX(inode))
- ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
- ext4_end_io_dio, dio_flags);
+ ret = dax_do_io(iocb, inode, inode->i_sb->s_bdev, iter, offset,
+ get_block_func, get_block_func,
+ ext4_end_io_dio, NULL, dio_flags);
else
ret = __blockdev_direct_IO(iocb, inode,
inode->i_sb->s_bdev, iter, offset,
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a9ebabfe..dc4e088 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1682,11 +1682,12 @@ xfs_vm_do_dio(
void *private),
int flags)
{
- struct block_device *bdev;
+ struct block_device *bdev = xfs_find_bdev_for_inode(inode);

if (IS_DAX(inode))
- return dax_do_io(iocb, inode, iter, offset,
- xfs_get_blocks_direct, endio, 0);
+ return dax_do_io(iocb, inode, bdev, iter, offset,
+ xfs_get_blocks_direct, xfs_get_blocks_direct,
+ endio, NULL, flags);

bdev = xfs_find_bdev_for_inode(inode);
return __blockdev_direct_IO(iocb, inode, bdev, iter, offset,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 933198a..6981076 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -5,8 +5,12 @@
#include <linux/mm.h>
#include <asm/pgtable.h>

-ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
+ssize_t __dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
get_block_t, dio_iodone_t, int flags);
+ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
+ struct block_device *bdev, struct iov_iter *iter, loff_t pos,
+ get_block_t dax_get_block, get_block_t dio_get_block,
+ dio_iodone_t end_io, dio_submit_t submit_io, int flags);
int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
--
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-24 23:17:28

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH 3/5] dax: enable dax in the presence of known media errors (badblocks)

From: Dan Williams <[email protected]>

From: Dan Williams <[email protected]>

1/ If a mapping overlaps a bad sector fail the request.

2/ Do not opportunistically report more dax-capable capacity than is
requested when errors present.

[vishal: fix a conflict with system RAM collision patches]
Signed-off-by: Dan Williams <[email protected]>
---
block/ioctl.c | 9 ---------
drivers/nvdimm/pmem.c | 8 ++++++++
2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index d8996bb..cd7f392 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -423,15 +423,6 @@ bool blkdev_dax_capable(struct block_device *bdev)
|| (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
return false;

- /*
- * If the device has known bad blocks, force all I/O through the
- * driver / page cache.
- *
- * TODO: support finer grained dax error handling
- */
- if (disk->bb && disk->bb->count)
- return false;
-
return true;
}
#endif
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index da10554..eac5f93 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -174,9 +174,17 @@ static long pmem_direct_access(struct block_device *bdev,
struct pmem_device *pmem = bdev->bd_disk->private_data;
resource_size_t offset = sector * 512 + pmem->data_offset;

+ if (unlikely(is_bad_pmem(&pmem->bb, sector, dax->size)))
+ return -EIO;
dax->addr = pmem->virt_addr + offset;
dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);

+ /*
+ * If badblocks are present, limit known good range to the
+ * requested range.
+ */
+ if (unlikely(pmem->bb.count))
+ return dax->size;
return pmem->size - pmem->pfn_pad - offset;
}

--
2.5.5

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2016-03-24 23:23:26

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [PATCH 3/5] dax: enable dax in the presence of known media errors (badblocks)

On Thu, 2016-03-24 at 17:17 -0600, Vishal Verma wrote:
> From: Dan Williams <[email protected]>
>
> From: Dan Williams <[email protected]>

Eep, not sure how this happened, looked alright in the patches!

2016-03-25 10:44:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

On Thu, Mar 24, 2016 at 05:17:29PM -0600, Vishal Verma wrote:
> @@ -72,16 +72,7 @@ xfs_zero_extent(
> struct xfs_mount *mp = ip->i_mount;
> xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb);
> sector_t block = XFS_BB_TO_FSBT(mp, sector);
> - ssize_t size = XFS_FSB_TO_B(mp, count_fsb);
>
> - if (IS_DAX(VFS_I(ip)))
> - return dax_clear_sectors(xfs_find_bdev_for_inode(VFS_I(ip)),
> - sector, size);
> -
> - /*
> - * let the block layer decide on the fastest method of
> - * implementing the zeroing.
> - */
> return sb_issue_zeroout(mp->m_super, block, count_fsb, GFP_NOFS);

While not new: using sb_issue_zeroout in XFS is wrong as it doesn't
account for the RT device. We need the xfs_find_bdev_for_inode and
call blkdev_issue_zeroout directly with the bdev it returned.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-25 10:45:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/5] dax: handle media errors in dax_do_io

On Thu, Mar 24, 2016 at 05:17:30PM -0600, Vishal Verma wrote:
> dax_do_io (called for read() or write() for a dax file system) may fail
> in the presence of bad blocks or media errors. Since we expect that a
> write should clear media errors on nvdimms, make dax_do_io fall back to
> the direct_IO path, which will send down a bio to the driver, which can
> then attempt to clear the error.

Leave the fallback on -EIO to the callers please. They generally call
__blockdev_direct_IO anyway, so it should actually become simpler that
way.

2016-03-25 18:47:26

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

On Thu, Mar 24, 2016 at 4:17 PM, Vishal Verma <[email protected]> wrote:
> From: Matthew Wilcox <[email protected]>
>
> dax_clear_sectors() cannot handle poisoned blocks. These must be
> zeroed using the BIO interface instead. Convert ext2 and XFS to use
> only sb_issue_zerout().
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> [vishal: Also remove the dax_clear_sectors function entirely]
> Signed-off-by: Vishal Verma <[email protected]>
> ---
> fs/dax.c | 32 --------------------------------
> fs/ext2/inode.c | 7 +++----
> fs/xfs/xfs_bmap_util.c | 9 ---------
> include/linux/dax.h | 1 -
> 4 files changed, 3 insertions(+), 46 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index bb7e9f8..a30481e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -78,38 +78,6 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n)
> return page;
> }
>
> -/*
> - * dax_clear_sectors() is called from within transaction context from XFS,
> - * and hence this means the stack from this point must follow GFP_NOFS
> - * semantics for all operations.
> - */
> -int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size)
> -{
> - struct blk_dax_ctl dax = {
> - .sector = _sector,
> - .size = _size,
> - };
> -
> - might_sleep();
> - do {
> - long count, sz;
> -
> - count = dax_map_atomic(bdev, &dax);
> - if (count < 0)
> - return count;
> - sz = min_t(long, count, SZ_128K);
> - clear_pmem(dax.addr, sz);
> - dax.size -= sz;
> - dax.sector += sz / 512;
> - dax_unmap_atomic(bdev, &dax);
> - cond_resched();
> - } while (dax.size);
> -
> - wmb_pmem();
> - return 0;
> -}
> -EXPORT_SYMBOL_GPL(dax_clear_sectors);

What about the other unwritten extent conversions in the dax path?
Shouldn't those be converted to block-layer zero-outs as well?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-25 20:59:30

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [PATCH 5/5] dax: handle media errors in dax_do_io

On Fri, 2016-03-25 at 03:45 -0700, Christoph Hellwig wrote:
> On Thu, Mar 24, 2016 at 05:17:30PM -0600, Vishal Verma wrote:
> >
> > dax_do_io (called for read() or write() for a dax file system) may
> > fail
> > in the presence of bad blocks or media errors. Since we expect that
> > a
> > write should clear media errors on nvdimms, make dax_do_io fall
> > back to
> > the direct_IO path, which will send down a bio to the driver, which
> > can
> > then attempt to clear the error.
> Leave the fallback on -EIO to the callers please.  They generally
> call
> __blockdev_direct_IO anyway, so it should actually become simpler
> that
> way.

I thought of this, but made the retrying happen in the wrapper so that
it can be centralized. If the callers were to become responsible for
the retry, then any new callers of dax_do_io might not realize they are
responsible for retrying, and hit problems. Another tricky point might
be: in the wrapper, if __dax_do_io failed with -EIO, and subsequently
__blockdev_direct_IO also fails with a *different* error, I chose to
return -EIO because that was the 'first' error we hit and caused us to
fallback.. (Does this even seem reasonable?) And if so, do we want to
push back this decision too, to the callers?

2016-03-25 21:01:17

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

On Fri, 2016-03-25 at 03:44 -0700, Christoph Hellwig wrote:
> On Thu, Mar 24, 2016 at 05:17:29PM -0600, Vishal Verma wrote:
> >
> > @@ -72,16 +72,7 @@ xfs_zero_extent(
> >   struct xfs_mount *mp = ip->i_mount;
> >   xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb);
> >   sector_t block = XFS_BB_TO_FSBT(mp, sector);
> > - ssize_t size = XFS_FSB_TO_B(mp, count_fsb);
> >  
> > - if (IS_DAX(VFS_I(ip)))
> > - return
> > dax_clear_sectors(xfs_find_bdev_for_inode(VFS_I(ip)),
> > - sector, size);
> > -
> > - /*
> > -  * let the block layer decide on the fastest method of
> > -  * implementing the zeroing.
> > -  */
> >   return sb_issue_zeroout(mp->m_super, block, count_fsb,
> > GFP_NOFS);
> While not new: using sb_issue_zeroout in XFS is wrong as it doesn't
> account for the RT device.  We need the xfs_find_bdev_for_inode and
> call blkdev_issue_zeroout directly with the bdev it returned.

Ok, I'll fix and send a v2. Thanks!
>

2016-03-25 21:03:19

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

On Fri, 2016-03-25 at 11:47 -0700, Dan Williams wrote:
> On Thu, Mar 24, 2016 at 4:17 PM, Vishal Verma <[email protected]
> om> wrote:
> >
> > From: Matthew Wilcox <[email protected]>
> >
> > dax_clear_sectors() cannot handle poisoned blocks.  These must be
> > zeroed using the BIO interface instead.  Convert ext2 and XFS to
> > use
> > only sb_issue_zerout().
> >
> > Signed-off-by: Matthew Wilcox <[email protected]>
> > [vishal: Also remove the dax_clear_sectors function entirely]
> > Signed-off-by: Vishal Verma <[email protected]>
> > ---
> >  fs/dax.c               | 32 --------------------------------
> >  fs/ext2/inode.c        |  7 +++----
> >  fs/xfs/xfs_bmap_util.c |  9 ---------
> >  include/linux/dax.h    |  1 -
> >  4 files changed, 3 insertions(+), 46 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index bb7e9f8..a30481e 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -78,38 +78,6 @@ struct page *read_dax_sector(struct block_device
> > *bdev, sector_t n)
> >         return page;
> >  }
> >
> > -/*
> > - * dax_clear_sectors() is called from within transaction context
> > from XFS,
> > - * and hence this means the stack from this point must follow
> > GFP_NOFS
> > - * semantics for all operations.
> > - */
> > -int dax_clear_sectors(struct block_device *bdev, sector_t _sector,
> > long _size)
> > -{
> > -       struct blk_dax_ctl dax = {
> > -               .sector = _sector,
> > -               .size = _size,
> > -       };
> > -
> > -       might_sleep();
> > -       do {
> > -               long count, sz;
> > -
> > -               count = dax_map_atomic(bdev, &dax);
> > -               if (count < 0)
> > -                       return count;
> > -               sz = min_t(long, count, SZ_128K);
> > -               clear_pmem(dax.addr, sz);
> > -               dax.size -= sz;
> > -               dax.sector += sz / 512;
> > -               dax_unmap_atomic(bdev, &dax);
> > -               cond_resched();
> > -       } while (dax.size);
> > -
> > -       wmb_pmem();
> > -       return 0;
> > -}
> > -EXPORT_SYMBOL_GPL(dax_clear_sectors);
> What about the other unwritten extent conversions in the dax path?
> Shouldn't those be converted to block-layer zero-outs as well?

Could you point me to where these might be? I thought once we've
converted all the zeroout type callers (by removing dax_clear_sectors),
and fixed up dax_do_io to try a driver fallback, we've handled all the
media error cases in dax..

2016-03-25 21:20:39

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

On Fri, Mar 25, 2016 at 2:03 PM, Verma, Vishal L
<[email protected]> wrote:
> On Fri, 2016-03-25 at 11:47 -0700, Dan Williams wrote:
>> On Thu, Mar 24, 2016 at 4:17 PM, Vishal Verma <[email protected]
>> om> wrote:
>> >
>> > From: Matthew Wilcox <[email protected]>
>> >
>> > dax_clear_sectors() cannot handle poisoned blocks. These must be
>> > zeroed using the BIO interface instead. Convert ext2 and XFS to
>> > use
>> > only sb_issue_zerout().
>> >
>> > Signed-off-by: Matthew Wilcox <[email protected]>
>> > [vishal: Also remove the dax_clear_sectors function entirely]
>> > Signed-off-by: Vishal Verma <[email protected]>
>> > ---
>> > fs/dax.c | 32 --------------------------------
>> > fs/ext2/inode.c | 7 +++----
>> > fs/xfs/xfs_bmap_util.c | 9 ---------
>> > include/linux/dax.h | 1 -
>> > 4 files changed, 3 insertions(+), 46 deletions(-)
>> >
>> > diff --git a/fs/dax.c b/fs/dax.c
>> > index bb7e9f8..a30481e 100644
>> > --- a/fs/dax.c
>> > +++ b/fs/dax.c
>> > @@ -78,38 +78,6 @@ struct page *read_dax_sector(struct block_device
>> > *bdev, sector_t n)
>> > return page;
>> > }
>> >
>> > -/*
>> > - * dax_clear_sectors() is called from within transaction context
>> > from XFS,
>> > - * and hence this means the stack from this point must follow
>> > GFP_NOFS
>> > - * semantics for all operations.
>> > - */
>> > -int dax_clear_sectors(struct block_device *bdev, sector_t _sector,
>> > long _size)
>> > -{
>> > - struct blk_dax_ctl dax = {
>> > - .sector = _sector,
>> > - .size = _size,
>> > - };
>> > -
>> > - might_sleep();
>> > - do {
>> > - long count, sz;
>> > -
>> > - count = dax_map_atomic(bdev, &dax);
>> > - if (count < 0)
>> > - return count;
>> > - sz = min_t(long, count, SZ_128K);
>> > - clear_pmem(dax.addr, sz);
>> > - dax.size -= sz;
>> > - dax.sector += sz / 512;
>> > - dax_unmap_atomic(bdev, &dax);
>> > - cond_resched();
>> > - } while (dax.size);
>> > -
>> > - wmb_pmem();
>> > - return 0;
>> > -}
>> > -EXPORT_SYMBOL_GPL(dax_clear_sectors);
>> What about the other unwritten extent conversions in the dax path?
>> Shouldn't those be converted to block-layer zero-outs as well?
>
> Could you point me to where these might be? I thought once we've
> converted all the zeroout type callers (by removing dax_clear_sectors),
> and fixed up dax_do_io to try a driver fallback, we've handled all the
> media error cases in dax..

grep for usages of clear_pmem()... which I was hoping to eliminate
after this change to push zeroing down to the driver.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-25 21:42:37

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 5/5] dax: handle media errors in dax_do_io

On Fri, Mar 25, 2016 at 1:59 PM, Verma, Vishal L
<[email protected]> wrote:
> On Fri, 2016-03-25 at 03:45 -0700, Christoph Hellwig wrote:
>> On Thu, Mar 24, 2016 at 05:17:30PM -0600, Vishal Verma wrote:
>> >
>> > dax_do_io (called for read() or write() for a dax file system) may
>> > fail
>> > in the presence of bad blocks or media errors. Since we expect that
>> > a
>> > write should clear media errors on nvdimms, make dax_do_io fall
>> > back to
>> > the direct_IO path, which will send down a bio to the driver, which
>> > can
>> > then attempt to clear the error.
>> Leave the fallback on -EIO to the callers please. They generally
>> call
>> __blockdev_direct_IO anyway, so it should actually become simpler
>> that
>> way.
>
> I thought of this, but made the retrying happen in the wrapper so that
> it can be centralized. If the callers were to become responsible for
> the retry, then any new callers of dax_do_io might not realize they are
> responsible for retrying, and hit problems.

That's their prerogative otherwise you are precluding an alternate
handling of a dax_do_io() failure. Maybe a fs or upper layer can
recover in a different manner than re-submit the I/O to the
__blockdev_direct_IO path.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-25 22:36:50

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [PATCH 5/5] dax: handle media errors in dax_do_io

On Fri, 2016-03-25 at 14:42 -0700, Dan Williams wrote:
> On Fri, Mar 25, 2016 at 1:59 PM, Verma, Vishal L
> <[email protected]> wrote:
> >
> > On Fri, 2016-03-25 at 03:45 -0700, Christoph Hellwig wrote:
> > >
> > > On Thu, Mar 24, 2016 at 05:17:30PM -0600, Vishal Verma wrote:
> > > >
> > > >
> > > > dax_do_io (called for read() or write() for a dax file system)
> > > > may
> > > > fail
> > > > in the presence of bad blocks or media errors. Since we expect
> > > > that
> > > > a
> > > > write should clear media errors on nvdimms, make dax_do_io fall
> > > > back to
> > > > the direct_IO path, which will send down a bio to the driver,
> > > > which
> > > > can
> > > > then attempt to clear the error.
> > > Leave the fallback on -EIO to the callers please.  They generally
> > > call
> > > __blockdev_direct_IO anyway, so it should actually become simpler
> > > that
> > > way.
> > I thought of this, but made the retrying happen in the wrapper so
> > that
> > it can be centralized. If the callers were to become responsible
> > for
> > the retry, then any new callers of dax_do_io might not realize they
> > are
> > responsible for retrying, and hit problems.
> That's their prerogative otherwise you are precluding an alternate
> handling of a dax_do_io() failure.  Maybe a fs or upper layer can
> recover in a different manner than re-submit the I/O to the
> __blockdev_direct_IO path.

I'm happy to make the change, but we don't preclude that -- __dax_do_io
is still exported and available..

2016-03-26 16:53:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/5] dax: handle media errors in dax_do_io

On Fri, Mar 25, 2016 at 02:42:37PM -0700, Dan Williams wrote:
> That's their prerogative otherwise you are precluding an alternate
> handling of a dax_do_io() failure. Maybe a fs or upper layer can
> recover in a different manner than re-submit the I/O to the
> __blockdev_direct_IO path.

Let's keep the interface separate because they are, well separate.
There is a reason direct I/O falls back to buffered I/O by returning
and error if it can't handle it instead of handling all the magic.

I also really want to get rid of get_block as soon as possible for
DAX and direct I/O. For DAX that should actually be possible
really quickly, while direct I/O might take some time and will
be have to be gradual. So tighter integration of the two interface is
not just bad design, but actively harmful at this point in time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-28 20:01:29

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

On Fri, 2016-03-25 at 14:20 -0700, Dan Williams wrote:
> On Fri, Mar 25, 2016 at 2:03 PM, Verma, Vishal L
> <[email protected]> wrote:
> >
> > On Fri, 2016-03-25 at 11:47 -0700, Dan Williams wrote:
> > >
> > > On Thu, Mar 24, 2016 at 4:17 PM, Vishal Verma <vishal.l.verma@int
> > > el.c
> > > om> wrote:
> > > >
> > > >
> > > > From: Matthew Wilcox <[email protected]>
> > > >
> > > > dax_clear_sectors() cannot handle poisoned blocks.  These must
> > > > be
> > > > zeroed using the BIO interface instead.  Convert ext2 and XFS
> > > > to
> > > > use
> > > > only sb_issue_zerout().
> > > >
> > > > Signed-off-by: Matthew Wilcox <[email protected]>
> > > > [vishal: Also remove the dax_clear_sectors function entirely]
> > > > Signed-off-by: Vishal Verma <[email protected]>
> > > > ---
> > > >  fs/dax.c               | 32 --------------------------------
> > > >  fs/ext2/inode.c        |  7 +++----
> > > >  fs/xfs/xfs_bmap_util.c |  9 ---------
> > > >  include/linux/dax.h    |  1 -
> > > >  4 files changed, 3 insertions(+), 46 deletions(-)
> > > >
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index bb7e9f8..a30481e 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -78,38 +78,6 @@ struct page *read_dax_sector(struct
> > > > block_device
> > > > *bdev, sector_t n)
> > > >         return page;
> > > >  }
> > > >
> > > > -/*
> > > > - * dax_clear_sectors() is called from within transaction
> > > > context
> > > > from XFS,
> > > > - * and hence this means the stack from this point must follow
> > > > GFP_NOFS
> > > > - * semantics for all operations.
> > > > - */
> > > > -int dax_clear_sectors(struct block_device *bdev, sector_t
> > > > _sector,
> > > > long _size)
> > > > -{
> > > > -       struct blk_dax_ctl dax = {
> > > > -               .sector = _sector,
> > > > -               .size = _size,
> > > > -       };
> > > > -
> > > > -       might_sleep();
> > > > -       do {
> > > > -               long count, sz;
> > > > -
> > > > -               count = dax_map_atomic(bdev, &dax);
> > > > -               if (count < 0)
> > > > -                       return count;
> > > > -               sz = min_t(long, count, SZ_128K);
> > > > -               clear_pmem(dax.addr, sz);
> > > > -               dax.size -= sz;
> > > > -               dax.sector += sz / 512;
> > > > -               dax_unmap_atomic(bdev, &dax);
> > > > -               cond_resched();
> > > > -       } while (dax.size);
> > > > -
> > > > -       wmb_pmem();
> > > > -       return 0;
> > > > -}
> > > > -EXPORT_SYMBOL_GPL(dax_clear_sectors);
> > > What about the other unwritten extent conversions in the dax
> > > path?
> > > Shouldn't those be converted to block-layer zero-outs as well?
> > Could you point me to where these might be? I thought once we've
> > converted all the zeroout type callers (by removing
> > dax_clear_sectors),
> > and fixed up dax_do_io to try a driver fallback, we've handled all
> > the
> > media error cases in dax..
> grep for usages of clear_pmem()... which I was hoping to eliminate
> after this change to push zeroing down to the driver.

Ok, so I looked at these, and it looks like the majority of callers of
clear_pmem are from the fault path (either pmd or regular), and in
those cases we should be 'protected', as we would have failed at a
prior step (dax_map_atomic).

The two cases that may not be well handled are the calls to
dax_zero_page_range and dax_truncate_page which are called from file
systems. I think we may need to do a fallback to the driver for those
cases just like we do for dax_direct_io.. Thoughts?

2016-03-28 23:34:16

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

On Mon, Mar 28, 2016 at 1:01 PM, Verma, Vishal L
<[email protected]> wrote:
> On Fri, 2016-03-25 at 14:20 -0700, Dan Williams wrote:
>> On Fri, Mar 25, 2016 at 2:03 PM, Verma, Vishal L
>> <[email protected]> wrote:
>> >
>> > On Fri, 2016-03-25 at 11:47 -0700, Dan Williams wrote:
>> > >
>> > > On Thu, Mar 24, 2016 at 4:17 PM, Vishal Verma <vishal.l.verma@int
>> > > el.c
>> > > om> wrote:
>> > > >
>> > > >
>> > > > From: Matthew Wilcox <[email protected]>
>> > > >
>> > > > dax_clear_sectors() cannot handle poisoned blocks. These must
>> > > > be
>> > > > zeroed using the BIO interface instead. Convert ext2 and XFS
>> > > > to
>> > > > use
>> > > > only sb_issue_zerout().
>> > > >
>> > > > Signed-off-by: Matthew Wilcox <[email protected]>
>> > > > [vishal: Also remove the dax_clear_sectors function entirely]
>> > > > Signed-off-by: Vishal Verma <[email protected]>
>> > > > ---
>> > > > fs/dax.c | 32 --------------------------------
>> > > > fs/ext2/inode.c | 7 +++----
>> > > > fs/xfs/xfs_bmap_util.c | 9 ---------
>> > > > include/linux/dax.h | 1 -
>> > > > 4 files changed, 3 insertions(+), 46 deletions(-)
>> > > >
>> > > > diff --git a/fs/dax.c b/fs/dax.c
>> > > > index bb7e9f8..a30481e 100644
>> > > > --- a/fs/dax.c
>> > > > +++ b/fs/dax.c
>> > > > @@ -78,38 +78,6 @@ struct page *read_dax_sector(struct
>> > > > block_device
>> > > > *bdev, sector_t n)
>> > > > return page;
>> > > > }
>> > > >
>> > > > -/*
>> > > > - * dax_clear_sectors() is called from within transaction
>> > > > context
>> > > > from XFS,
>> > > > - * and hence this means the stack from this point must follow
>> > > > GFP_NOFS
>> > > > - * semantics for all operations.
>> > > > - */
>> > > > -int dax_clear_sectors(struct block_device *bdev, sector_t
>> > > > _sector,
>> > > > long _size)
>> > > > -{
>> > > > - struct blk_dax_ctl dax = {
>> > > > - .sector = _sector,
>> > > > - .size = _size,
>> > > > - };
>> > > > -
>> > > > - might_sleep();
>> > > > - do {
>> > > > - long count, sz;
>> > > > -
>> > > > - count = dax_map_atomic(bdev, &dax);
>> > > > - if (count < 0)
>> > > > - return count;
>> > > > - sz = min_t(long, count, SZ_128K);
>> > > > - clear_pmem(dax.addr, sz);
>> > > > - dax.size -= sz;
>> > > > - dax.sector += sz / 512;
>> > > > - dax_unmap_atomic(bdev, &dax);
>> > > > - cond_resched();
>> > > > - } while (dax.size);
>> > > > -
>> > > > - wmb_pmem();
>> > > > - return 0;
>> > > > -}
>> > > > -EXPORT_SYMBOL_GPL(dax_clear_sectors);
>> > > What about the other unwritten extent conversions in the dax
>> > > path?
>> > > Shouldn't those be converted to block-layer zero-outs as well?
>> > Could you point me to where these might be? I thought once we've
>> > converted all the zeroout type callers (by removing
>> > dax_clear_sectors),
>> > and fixed up dax_do_io to try a driver fallback, we've handled all
>> > the
>> > media error cases in dax..
>> grep for usages of clear_pmem()... which I was hoping to eliminate
>> after this change to push zeroing down to the driver.
>
> Ok, so I looked at these, and it looks like the majority of callers of
> clear_pmem are from the fault path (either pmd or regular), and in
> those cases we should be 'protected', as we would have failed at a
> prior step (dax_map_atomic).

Seems kind of sad to fail the fault due to a bad block when we were
going to zero it anyway, right? I'm not seeing a compelling reason to
keep any zeroing in fs/dax.c.

2016-03-29 18:57:16

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

On Mon, 2016-03-28 at 16:34 -0700, Dan Williams wrote:

<>

> Seems kind of sad to fail the fault due to a bad block when we were
> going to zero it anyway, right?  I'm not seeing a compelling reason to
> keep any zeroing in fs/dax.c.

Agreed - but how do we do this? clear_pmem needs to be able to clear an
arbitrary number of bytes, but to go through the driver, we'd need to
send down a bio? If only the driver had an rw_bytes like interface that
could be used by anyone... :)

2016-03-29 19:37:14

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

On Tue, Mar 29, 2016 at 11:57 AM, Verma, Vishal L
<[email protected]> wrote:
> On Mon, 2016-03-28 at 16:34 -0700, Dan Williams wrote:
>
> <>
>
>> Seems kind of sad to fail the fault due to a bad block when we were
>> going to zero it anyway, right? I'm not seeing a compelling reason to
>> keep any zeroing in fs/dax.c.
>
> Agreed - but how do we do this? clear_pmem needs to be able to clear an
> arbitrary number of bytes, but to go through the driver, we'd need to
> send down a bio? If only the driver had an rw_bytes like interface that
> could be used by anyone... :)

I think we're ok because clear_pmem() will always happen on PAGE_SIZE,
or HPAGE_SIZE boundaries.

2016-03-30 07:49:26

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

On Tue 29-03-16 18:57:16, Verma, Vishal L wrote:
> On Mon, 2016-03-28 at 16:34 -0700, Dan Williams wrote:
>
> <>
>
> > Seems kind of sad to fail the fault due to a bad block when we were
> > going to zero it anyway, right???I'm not seeing a compelling reason to
> > keep any zeroing in fs/dax.c.
>
> Agreed - but how do we do this? clear_pmem needs to be able to clear an
> arbitrary number of bytes, but to go through the driver, we'd need to
> send down a bio? If only the driver had an rw_bytes like interface that
> could be used by anyone... :)

Actually, my patches for page fault locking remove zeroing from
dax_insert_mapping() and __dax_pmd_fault() - the zeroing now happens from
the filesystem only and the zeroing in those two functions is just a dead
code...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-04-01 19:17:52

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

On Wed, 2016-03-30 at 09:49 +0200, Jan Kara wrote:
> On Tue 29-03-16 18:57:16, Verma, Vishal L wrote:
> >
> > On Mon, 2016-03-28 at 16:34 -0700, Dan Williams wrote:
> >
> > <>
> >
> > >
> > > Seems kind of sad to fail the fault due to a bad block when we
> > > were
> > > going to zero it anyway, right?  I'm not seeing a compelling
> > > reason to
> > > keep any zeroing in fs/dax.c.
> > Agreed - but how do we do this? clear_pmem needs to be able to clear
> > an
> > arbitrary number of bytes, but to go through the driver, we'd need
> > to
> > send down a bio? If only the driver had an rw_bytes like interface
> > that
> > could be used by anyone... :)
> Actually, my patches for page fault locking remove zeroing from
> dax_insert_mapping() and __dax_pmd_fault() - the zeroing now happens
> from
> the filesystem only and the zeroing in those two functions is just a
> dead
> code...

That should make things easier! Do you have a tree I could merge in to
get this? (WIP is ok as we know that my series will depend on yours..)
or, if you can distill out that patch on a 4.6-rc1 base, I could carry
it in my series too (your v2's 3/10 doesn't apply on 4.6-rc1..)

Thanks,
-Vishal

2016-04-04 12:09:49

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors

On Fri 01-04-16 19:17:52, Verma, Vishal L wrote:
> On Wed, 2016-03-30 at 09:49 +0200, Jan Kara wrote:
> > On Tue 29-03-16 18:57:16, Verma, Vishal L wrote:
> > >
> > > On Mon, 2016-03-28 at 16:34 -0700, Dan Williams wrote:
> > >
> > > <>
> > >
> > > >
> > > > Seems kind of sad to fail the fault due to a bad block when we
> > > > were
> > > > going to zero it anyway, right???I'm not seeing a compelling
> > > > reason to
> > > > keep any zeroing in fs/dax.c.
> > > Agreed - but how do we do this? clear_pmem needs to be able to clear
> > > an
> > > arbitrary number of bytes, but to go through the driver, we'd need
> > > to
> > > send down a bio? If only the driver had an rw_bytes like interface
> > > that
> > > could be used by anyone... :)
> > Actually, my patches for page fault locking remove zeroing from
> > dax_insert_mapping() and __dax_pmd_fault() - the zeroing now happens
> > from
> > the filesystem only and the zeroing in those two functions is just a
> > dead
> > code...
>
> That should make things easier! Do you have a tree I could merge in to
> get this? (WIP is ok as we know that my series will depend on yours..)
> or, if you can distill out that patch on a 4.6-rc1 base, I could carry
> it in my series too (your v2's 3/10 doesn't apply on 4.6-rc1..)

I'll CC you on the next posting of the series which I want to do this week.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR