2016-05-11 21:08:46

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH v7 0/6] dax: handling media errors (clear-on-zero only)

Until now, dax has been disabled if media errors were found on
any device. This series attempts to address that.

The first two patches from Dan re-enable dax even when media
errors are present.

The third patch from Matthew removes the zeroout path from dax
entirely, making zeroout operations always go through the driver
(The motivation is that if a backing device has media errors,
and we create a sparse file on it, we don't want the initial
zeroing to happen via dax, we want to give the block driver a
chance to clear the errors).

Patch 4 from Christoph exports a low level dax helper for zeroing

Patch 5 reduces our calls to clear_pmem from dax in the
truncate/hole-punch cases. We check if the range being truncated
is sector aligned/sized, and if so, send blkdev_issue_zeroout
instead of clear_pmem so that errors can be handled better by
the driver.

Patch 6 fixes a redundant comment in DAX and is mostly unrelated
to the rest of this series.

This series also depends on/is based on Jan Kara's Ext4 and DAX
fixups series:
http://marc.info/?l=linux-ext4&m=146295959100848&w=2
http://marc.info/?l=linux-ext4&m=146296078001307&w=2

v7:
- Fix the dax alignment check to only check 'offset' and 'length'
for alignment as that's all that is needed. (Jan)
- Fix the blockdev_issue_zeroout call to zero the correct sector
- Rebase to v4.6-rc7 + the two patch-series from Jan linked above
- Add a patch from Christoph's iomap series:
http://www.spinics.net/lists/xfs/msg39656.html

v6:
- Use IS_ALIGNED in dax_range_is_aligned instead of open coding
an alignment check (Jan)
- Collect all Reveiwed-by tags so far.

v5:
- Drop the patch that attempts to clear-errors-on-write till we
reach consensus on how to handle that.
- Don't pass blk_dax_ctl to direct_access, instead pass in all the
required arguments individually (Christoph, Dan)

v4:
- Remove the dax->direct_IO fallbacks entirely. Instead, go through
the usual direct_IO path when we're in O_DIRECT, and use dax_IO
for other, non O_DIRECT IO. (Dan, Christoph)

v3:
- Wrapper-ize the direct_IO fallback again and make an exception
for -EIOCBQUEUED (Jeff, Dan)
- Reduce clear_pmem usage in DAX to the minimum



Christoph Hellwig (1):
dax: export a low-level __dax_zero_page_range helper

Dan Williams (2):
dax: fallback from pmd to pte on error
dax: enable dax in the presence of known media errors (badblocks)

Matthew Wilcox (1):
dax: use sb_issue_zerout instead of calling dax_clear_sectors

Vishal Verma (2):
dax: for truncate/hole-punch, do zeroing through the driver if
possible
dax: fix a comment in dax_zero_page_range and dax_truncate_page

Documentation/filesystems/dax.txt | 32 ++++++++++++
arch/powerpc/sysdev/axonram.c | 2 +-
block/ioctl.c | 9 ----
drivers/block/brd.c | 2 +-
drivers/nvdimm/pmem.c | 10 +++-
drivers/s390/block/dcssblk.c | 2 +-
fs/block_dev.c | 2 +-
fs/dax.c | 104 ++++++++++++++++----------------------
fs/ext2/inode.c | 7 ++-
fs/xfs/xfs_bmap_util.c | 15 ++----
include/linux/blkdev.h | 2 +-
include/linux/dax.h | 8 ++-
12 files changed, 103 insertions(+), 92 deletions(-)

--
2.5.5

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs


2016-05-11 21:08:48

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH v7 2/6] dax: enable dax in the presence of known media errors (badblocks)

From: Dan Williams <[email protected]>

1/ If a mapping overlaps a bad sector fail the request.

2/ Do not opportunistically report more dax-capable capacity than is
requested when errors present.

Reviewed-by: Jeff Moyer <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
Signed-off-by: Vishal Verma <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 2 +-
block/ioctl.c | 9 ---------
drivers/block/brd.c | 2 +-
drivers/nvdimm/pmem.c | 10 +++++++++-
drivers/s390/block/dcssblk.c | 2 +-
fs/block_dev.c | 2 +-
include/linux/blkdev.h | 2 +-
7 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 0d112b9..ff75d70 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -143,7 +143,7 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
*/
static long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void __pmem **kaddr, pfn_t *pfn)
+ void __pmem **kaddr, pfn_t *pfn, long size)
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
diff --git a/block/ioctl.c b/block/ioctl.c
index 4ff1f92..bf80bfd 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -423,15 +423,6 @@ bool blkdev_dax_capable(struct block_device *bdev)
|| (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
return false;

- /*
- * If the device has known bad blocks, force all I/O through the
- * driver / page cache.
- *
- * TODO: support finer grained dax error handling
- */
- if (disk->bb && disk->bb->count)
- return false;
-
return true;
}
#endif
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 51a071e..c04bd9b 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -381,7 +381,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,

#ifdef CONFIG_BLK_DEV_RAM_DAX
static long brd_direct_access(struct block_device *bdev, sector_t sector,
- void __pmem **kaddr, pfn_t *pfn)
+ void __pmem **kaddr, pfn_t *pfn, long size)
{
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5101f3a..b68f04d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -182,14 +182,22 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
}

static long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void __pmem **kaddr, pfn_t *pfn)
+ void __pmem **kaddr, pfn_t *pfn, long size)
{
struct pmem_device *pmem = bdev->bd_disk->private_data;
resource_size_t offset = sector * 512 + pmem->data_offset;

+ if (unlikely(is_bad_pmem(&pmem->bb, sector, size)))
+ return -EIO;
*kaddr = pmem->virt_addr + offset;
*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);

+ /*
+ * If badblocks are present, limit known good range to the
+ * requested range.
+ */
+ if (unlikely(pmem->bb.count))
+ return size;
return pmem->size - pmem->pfn_pad - offset;
}

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index b839086..c45d538 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -884,7 +884,7 @@ fail:

static long
dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void __pmem **kaddr, pfn_t *pfn)
+ void __pmem **kaddr, pfn_t *pfn, long size)
{
struct dcssblk_dev_info *dev_info;
unsigned long offset, dev_sz;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index b25bb23..02c68c4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -488,7 +488,7 @@ long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
sector += get_start_sect(bdev);
if (sector % (PAGE_SIZE / 512))
return -EINVAL;
- avail = ops->direct_access(bdev, sector, &dax->addr, &dax->pfn);
+ avail = ops->direct_access(bdev, sector, &dax->addr, &dax->pfn, size);
if (!avail)
return -ERANGE;
if (avail > 0 && avail & ~PAGE_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 669e419..55ed530 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1657,7 +1657,7 @@ struct block_device_operations {
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
long (*direct_access)(struct block_device *, sector_t, void __pmem **,
- pfn_t *);
+ pfn_t *, long);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
--
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-05-11 21:08:47

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH v7 1/6] dax: fallback from pmd to pte on error

From: Dan Williams <[email protected]>

In preparation for consulting a badblocks list in pmem_direct_access(),
teach dax_pmd_fault() to fallback rather than fail immediately upon
encountering an error. The thought being that reducing the span of the
dax request may avoid the error region.

Reviewed-by: Jeff Moyer <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/dax.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9bc6624..d602410 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -855,8 +855,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
long length = dax_map_atomic(bdev, &dax);

if (length < 0) {
- result = VM_FAULT_SIGBUS;
- goto out;
+ dax_pmd_dbg(&bh, address, "dax-error fallback");
+ goto fallback;
}
if (length < PMD_SIZE) {
dax_pmd_dbg(&bh, address, "dax-length too small");
--
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-05-11 21:08:49

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH v7 3/6] dax: use sb_issue_zerout instead of calling dax_clear_sectors

From: Matthew Wilcox <[email protected]>

dax_clear_sectors() cannot handle poisoned blocks. These must be
zeroed using the BIO interface instead. Convert ext2 and XFS to use
only sb_issue_zerout().

Reviewed-by: Jeff Moyer <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Matthew Wilcox <[email protected]>
[vishal: Also remove the dax_clear_sectors function entirely]
Signed-off-by: Vishal Verma <[email protected]>
---
fs/dax.c | 32 --------------------------------
fs/ext2/inode.c | 7 +++----
fs/xfs/xfs_bmap_util.c | 15 ++++-----------
include/linux/dax.h | 1 -
4 files changed, 7 insertions(+), 48 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index d602410..0abbbb6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -87,38 +87,6 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n)
return page;
}

-/*
- * dax_clear_sectors() is called from within transaction context from XFS,
- * and hence this means the stack from this point must follow GFP_NOFS
- * semantics for all operations.
- */
-int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size)
-{
- struct blk_dax_ctl dax = {
- .sector = _sector,
- .size = _size,
- };
-
- might_sleep();
- do {
- long count, sz;
-
- count = dax_map_atomic(bdev, &dax);
- if (count < 0)
- return count;
- sz = min_t(long, count, SZ_128K);
- clear_pmem(dax.addr, sz);
- dax.size -= sz;
- dax.sector += sz / 512;
- dax_unmap_atomic(bdev, &dax);
- cond_resched();
- } while (dax.size);
-
- wmb_pmem();
- return 0;
-}
-EXPORT_SYMBOL_GPL(dax_clear_sectors);
-
static bool buffer_written(struct buffer_head *bh)
{
return buffer_mapped(bh) && !buffer_unwritten(bh);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 1f07b75..35f2b0bf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -26,6 +26,7 @@
#include <linux/highuid.h>
#include <linux/pagemap.h>
#include <linux/dax.h>
+#include <linux/blkdev.h>
#include <linux/quotaops.h>
#include <linux/writeback.h>
#include <linux/buffer_head.h>
@@ -737,10 +738,8 @@ static int ext2_get_blocks(struct inode *inode,
* so that it's not found by another thread before it's
* initialised
*/
- err = dax_clear_sectors(inode->i_sb->s_bdev,
- le32_to_cpu(chain[depth-1].key) <<
- (inode->i_blkbits - 9),
- 1 << inode->i_blkbits);
+ err = sb_issue_zeroout(inode->i_sb,
+ le32_to_cpu(chain[depth-1].key), 1, GFP_NOFS);
if (err) {
mutex_unlock(&ei->truncate_mutex);
goto cleanup;
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 3b63098..930ac6a 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -72,18 +72,11 @@ xfs_zero_extent(
struct xfs_mount *mp = ip->i_mount;
xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb);
sector_t block = XFS_BB_TO_FSBT(mp, sector);
- ssize_t size = XFS_FSB_TO_B(mp, count_fsb);
-
- if (IS_DAX(VFS_I(ip)))
- return dax_clear_sectors(xfs_find_bdev_for_inode(VFS_I(ip)),
- sector, size);
-
- /*
- * let the block layer decide on the fastest method of
- * implementing the zeroing.
- */
- return sb_issue_zeroout(mp->m_super, block, count_fsb, GFP_NOFS);

+ return blkdev_issue_zeroout(xfs_find_bdev_for_inode(VFS_I(ip)),
+ block << (mp->m_super->s_blocksize_bits - 9),
+ count_fsb << (mp->m_super->s_blocksize_bits - 9),
+ GFP_NOFS, true);
}

/*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7c45ac7..7f853ff 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -7,7 +7,6 @@

ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
get_block_t, dio_iodone_t, int flags);
-int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
--
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-05-11 21:08:50

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH v7 4/6] dax: export a low-level __dax_zero_page_range helper

From: Christoph Hellwig <[email protected]>

This allows XFS to perform zeroing using the iomap infrastructure and
avoid buffer heads.

[vishal: fix conflicts with dax-error-handling]
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/dax.c | 35 ++++++++++++++++++++---------------
include/linux/dax.h | 7 +++++++
2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0abbbb6..651d4b1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -947,6 +947,23 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
}
EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);

+int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
+ unsigned int offset, unsigned int length)
+{
+ struct blk_dax_ctl dax = {
+ .sector = sector,
+ .size = PAGE_SIZE,
+ };
+
+ if (dax_map_atomic(bdev, &dax) < 0)
+ return PTR_ERR(dax.addr);
+ clear_pmem(dax.addr + offset, length);
+ wmb_pmem();
+ dax_unmap_atomic(bdev, &dax);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(__dax_zero_page_range);
+
/**
* dax_zero_page_range - zero a range within a page of a DAX file
* @inode: The file being truncated
@@ -982,23 +999,11 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
bh.b_bdev = inode->i_sb->s_bdev;
bh.b_size = PAGE_SIZE;
err = get_block(inode, index, &bh, 0);
- if (err < 0)
+ if (err < 0 || !buffer_written(&bh))
return err;
- if (buffer_written(&bh)) {
- struct block_device *bdev = bh.b_bdev;
- struct blk_dax_ctl dax = {
- .sector = to_sector(&bh, inode),
- .size = PAGE_SIZE,
- };

- if (dax_map_atomic(bdev, &dax) < 0)
- return PTR_ERR(dax.addr);
- clear_pmem(dax.addr + offset, length);
- wmb_pmem();
- dax_unmap_atomic(bdev, &dax);
- }
-
- return 0;
+ return __dax_zero_page_range(bh.b_bdev, to_sector(&bh, inode),
+ offset, length);
}
EXPORT_SYMBOL_GPL(dax_zero_page_range);

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7f853ff..90fbc99 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -14,12 +14,19 @@ int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);

#ifdef CONFIG_FS_DAX
struct page *read_dax_sector(struct block_device *bdev, sector_t n);
+int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
+ unsigned int offset, unsigned int length);
#else
static inline struct page *read_dax_sector(struct block_device *bdev,
sector_t n)
{
return ERR_PTR(-ENXIO);
}
+static inline int __dax_zero_page_range(struct block_device *bdev,
+ sector_t sector, unsigned int offset, unsigned int length)
+{
+ return -ENXIO;
+}
#endif

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
--
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-05-11 21:08:52

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH v7 6/6] dax: fix a comment in dax_zero_page_range and dax_truncate_page

The distinction between PAGE_SIZE and PAGE_CACHE_SIZE was removed in

09cbfea mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release}
macros

The comments for the above functions described a distinction between
those, that is now redundant, so remove those paragraphs

Cc: Kirill A. Shutemov <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
---
fs/dax.c | 12 ------------
1 file changed, 12 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0b9a169..ea936b3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -995,12 +995,6 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
* page in a DAX file. This is intended for hole-punch operations. If
* you are truncating a file, the helper function dax_truncate_page() may be
* more convenient.
- *
- * We work in terms of PAGE_SIZE here for commonality with
- * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
- * took care of disposing of the unnecessary blocks. Even if the filesystem
- * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
- * since the file might be mmapped.
*/
int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
get_block_t get_block)
@@ -1035,12 +1029,6 @@ EXPORT_SYMBOL_GPL(dax_zero_page_range);
*
* Similar to block_truncate_page(), this function can be called by a
* filesystem when it is truncating a DAX file to handle the partial page.
- *
- * We work in terms of PAGE_SIZE here for commonality with
- * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
- * took care of disposing of the unnecessary blocks. Even if the filesystem
- * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
- * since the file might be mmapped.
*/
int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
{
--
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-05-11 21:08:51

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH v7 5/6] dax: for truncate/hole-punch, do zeroing through the driver if possible

In the truncate or hole-punch path in dax, we clear out sub-page ranges.
If these sub-page ranges are sector aligned and sized, we can do the
zeroing through the driver instead so that error-clearing is handled
automatically.

For sub-sector ranges, we still have to rely on clear_pmem and have the
possibility of tripping over errors.

Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
---
Documentation/filesystems/dax.txt | 32 ++++++++++++++++++++++++++++++++
fs/dax.c | 30 +++++++++++++++++++++++++-----
2 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 7bde640..ce4587d 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -79,6 +79,38 @@ These filesystems may be used for inspiration:
- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt


+Handling Media Errors
+---------------------
+
+The libnvdimm subsystem stores a record of known media error locations for
+each pmem block device (in gendisk->badblocks). If we fault at such location,
+or one with a latent error not yet discovered, the application can expect
+to receive a SIGBUS. Libnvdimm also allows clearing of these errors by simply
+writing the affected sectors (through the pmem driver, and if the underlying
+NVDIMM supports the clear_poison DSM defined by ACPI).
+
+Since DAX IO normally doesn't go through the driver/bio path, applications or
+sysadmins have an option to restore the lost data from a prior backup/inbuilt
+redundancy in the following ways:
+
+1. Delete the affected file, and restore from a backup (sysadmin route):
+ This will free the file system blocks that were being used by the file,
+ and the next time they're allocated, they will be zeroed first, which
+ happens through the driver, and will clear bad sectors.
+
+2. Truncate or hole-punch the part of the file that has a bad-block (at least
+ an entire aligned sector has to be hole-punched, but not necessarily an
+ entire filesystem block).
+
+These are the two basic paths that allow DAX filesystems to continue operating
+in the presence of media errors. More robust error recovery mechanisms can be
+built on top of this in the future, for example, involving redundancy/mirroring
+provided at the block layer through DM, or additionally, at the filesystem
+level. These would have to rely on the above two tenets, that error clearing
+can happen either by sending an IO through the driver, or zeroing (also through
+the driver).
+
+
Shortcomings
------------

diff --git a/fs/dax.c b/fs/dax.c
index 651d4b1..0b9a169 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -947,6 +947,19 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
}
EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);

+static bool dax_range_is_aligned(struct block_device *bdev,
+ unsigned int offset, unsigned int length)
+{
+ unsigned short sector_size = bdev_logical_block_size(bdev);
+
+ if (!IS_ALIGNED(offset, sector_size))
+ return false;
+ if (!IS_ALIGNED(length, sector_size))
+ return false;
+
+ return true;
+}
+
int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
unsigned int offset, unsigned int length)
{
@@ -955,11 +968,18 @@ int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
.size = PAGE_SIZE,
};

- if (dax_map_atomic(bdev, &dax) < 0)
- return PTR_ERR(dax.addr);
- clear_pmem(dax.addr + offset, length);
- wmb_pmem();
- dax_unmap_atomic(bdev, &dax);
+ if (dax_range_is_aligned(bdev, offset, length)) {
+ sector_t start_sector = dax.sector + (offset >> 9);
+
+ return blkdev_issue_zeroout(bdev, start_sector,
+ length >> 9, GFP_NOFS, true);
+ } else {
+ if (dax_map_atomic(bdev, &dax) < 0)
+ return PTR_ERR(dax.addr);
+ clear_pmem(dax.addr + offset, length);
+ wmb_pmem();
+ dax_unmap_atomic(bdev, &dax);
+ }
return 0;
}
EXPORT_SYMBOL_GPL(__dax_zero_page_range);
--
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-05-12 08:38:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v7 5/6] dax: for truncate/hole-punch, do zeroing through the driver if possible

On Wed 11-05-16 15:08:51, Vishal Verma wrote:
> In the truncate or hole-punch path in dax, we clear out sub-page ranges.
> If these sub-page ranges are sector aligned and sized, we can do the
> zeroing through the driver instead so that error-clearing is handled
> automatically.
>
> For sub-sector ranges, we still have to rely on clear_pmem and have the
> possibility of tripping over errors.
>
> Cc: Dan Williams <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Jan Kara <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Vishal Verma <[email protected]>

The patch looks good to me now. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-05-12 08:41:38

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: export a low-level __dax_zero_page_range helper

On Wed 11-05-16 15:08:50, Vishal Verma wrote:
> From: Christoph Hellwig <[email protected]>
>
> This allows XFS to perform zeroing using the iomap infrastructure and
> avoid buffer heads.
>
> [vishal: fix conflicts with dax-error-handling]
> Signed-off-by: Christoph Hellwig <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

BTW: You are supposed to add your Signed-off-by when forwarding patches
like this...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2016-05-12 17:06:39

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] dax: export a low-level __dax_zero_page_range helper

On Thu, 2016-05-12 at 10:41 +0200, Jan Kara wrote:
> On Wed 11-05-16 15:08:50, Vishal Verma wrote:
> >
> > From: Christoph Hellwig <[email protected]>
> >
> > This allows XFS to perform zeroing using the iomap infrastructure
> > and
> > avoid buffer heads.
> >
> > [vishal: fix conflicts with dax-error-handling]
> > Signed-off-by: Christoph Hellwig <[email protected]>
> Looks good. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>
>
> BTW: You are supposed to add your Signed-off-by when forwarding
> patches
> like this...

Ah I didn't know. I'll add it when making the stable topic branch for
this. Thanks!

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs