Until now, dax has been disabled if media errors were found on
any device. This series attempts to address that.
The first three patches from Dan re-enable dax even when media
errors are present.
The fourth patch from Matthew removes the zeroout path from dax
entirely, making zeroout operations always go through the driver
(The motivation is that if a backing device has media errors,
and we create a sparse file on it, we don't want the initial
zeroing to happen via dax, we want to give the block driver a
chance to clear the errors).
The fifth patch changes how DAX IO is re-routed as direct IO.
We add a new iocb flag for DAX to distinguish it from actual
direct IO, and if we're in O_DIRECT, use the regular direct_IO
path instead of DAX. This gives us an opportunity to do recovery
by doing O_DIRECT writes that will go through the driver to clear
errors from bad sectors.
Patch 6 reduces our calls to clear_pmem from dax in the
truncate/hole-punch cases. We check if the range being truncated
is sector aligned/sized, and if so, send blkdev_issue_zeroout
instead of clear_pmem so that errors can be handled better by
the driver.
Patch 7 fixes a redundant comment in DAX and is mostly unrelated
to the rest of this series.
This series also depends on/is based on Jan Kara's DAX Locking
fixes series [1].
[1]: http://www.spinics.net/lists/linux-mm/msg105819.html
v4:
- Remove the dax->direct_IO fallbacks entirely. Instead, go through
the usual direct_IO path when we're in O_DIRECT, and use dax_IO
for other, non O_DIRECT IO. (Dan, Christoph)
v3:
- Wrapper-ize the direct_IO fallback again and make an exception
for -EIOCBQUEUED (Jeff, Dan)
- Reduce clear_pmem usage in DAX to the minimum
Dan Williams (3):
block, dax: pass blk_dax_ctl through to drivers
dax: fallback from pmd to pte on error
dax: enable dax in the presence of known media errors (badblocks)
Matthew Wilcox (1):
dax: use sb_issue_zerout instead of calling dax_clear_sectors
Vishal Verma (3):
fs: prioritize and separate direct_io from dax_io
dax: for truncate/hole-punch, do zeroing through the driver if
possible
dax: fix a comment in dax_zero_page_range and dax_truncate_page
arch/powerpc/sysdev/axonram.c | 10 +++---
block/ioctl.c | 9 -----
drivers/block/brd.c | 9 ++---
drivers/block/loop.c | 2 +-
drivers/nvdimm/pmem.c | 17 +++++++---
drivers/s390/block/dcssblk.c | 12 +++----
fs/block_dev.c | 19 ++++++++---
fs/dax.c | 78 +++++++++++++++----------------------------
fs/ext2/inode.c | 23 ++++++++-----
fs/ext4/file.c | 2 +-
fs/ext4/inode.c | 19 +++++++----
fs/xfs/xfs_aops.c | 20 +++++++----
fs/xfs/xfs_bmap_util.c | 15 +++------
fs/xfs/xfs_file.c | 4 +--
include/linux/blkdev.h | 3 +-
include/linux/dax.h | 1 -
include/linux/fs.h | 15 +++++++--
mm/filemap.c | 4 +--
18 files changed, 134 insertions(+), 128 deletions(-)
--
2.5.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
From: Dan Williams <[email protected]>
This is in preparation for doing badblocks checking against the
requested sector range in the driver. Currently we opportunistically
return as much data that can be "dax'd" starting at the given sector.
When errors are present we want to limit that range to the first
encountered error, or fail the dax request if the range encompasses an
error.
Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 10 +++++-----
drivers/block/brd.c | 9 +++++----
drivers/nvdimm/pmem.c | 9 +++++----
drivers/s390/block/dcssblk.c | 12 ++++++------
fs/block_dev.c | 2 +-
include/linux/blkdev.h | 3 +--
6 files changed, 23 insertions(+), 22 deletions(-)
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 0d112b9..d85673f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,17 +139,17 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
/**
* axon_ram_direct_access - direct_access() method for block device
- * @device, @sector, @data: see block_device_operations method
+ * @dax: see block_device_operations method
*/
static long
-axon_ram_direct_access(struct block_device *device, sector_t sector,
- void __pmem **kaddr, pfn_t *pfn)
+axon_ram_direct_access(struct block_device *device, struct blk_dax_ctl *dax)
{
+ sector_t sector = get_start_sect(device) + dax->sector;
struct axon_ram_bank *bank = device->bd_disk->private_data;
loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
- *kaddr = (void __pmem __force *) bank->io_addr + offset;
- *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+ dax->addr = (void __pmem __force *) bank->io_addr + offset;
+ dax->pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
return bank->size - offset;
}
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 51a071e..71521c1 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -380,9 +380,10 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
}
#ifdef CONFIG_BLK_DEV_RAM_DAX
-static long brd_direct_access(struct block_device *bdev, sector_t sector,
- void __pmem **kaddr, pfn_t *pfn)
+static long brd_direct_access(struct block_device *bdev,
+ struct blk_dax_ctl *dax)
{
+ sector_t sector = get_start_sect(bdev) + dax->sector;
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
@@ -391,8 +392,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
page = brd_insert_page(brd, sector);
if (!page)
return -ENOSPC;
- *kaddr = (void __pmem *)page_address(page);
- *pfn = page_to_pfn_t(page);
+ dax->addr = (void __pmem *)page_address(page);
+ dax->pfn = page_to_pfn_t(page);
return PAGE_SIZE;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f798899..f72733c 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -181,14 +181,15 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
return rc;
}
-static long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void __pmem **kaddr, pfn_t *pfn)
+static long pmem_direct_access(struct block_device *bdev,
+ struct blk_dax_ctl *dax)
{
+ sector_t sector = get_start_sect(bdev) + dax->sector;
struct pmem_device *pmem = bdev->bd_disk->private_data;
resource_size_t offset = sector * 512 + pmem->data_offset;
- *kaddr = pmem->virt_addr + offset;
- *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
+ dax->addr = pmem->virt_addr + offset;
+ dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
return pmem->size - pmem->pfn_pad - offset;
}
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index b839086..613f587 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -30,8 +30,8 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static blk_qc_t dcssblk_make_request(struct request_queue *q,
struct bio *bio);
-static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void __pmem **kaddr, pfn_t *pfn);
+static long dcssblk_direct_access(struct block_device *bdev,
+ struct blk_dax_ctl *dax)
static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
@@ -883,9 +883,9 @@ fail:
}
static long
-dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void __pmem **kaddr, pfn_t *pfn)
+dcssblk_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
{
+ sector_t secnum = get_start_sect(bdev) + dax->sector;
struct dcssblk_dev_info *dev_info;
unsigned long offset, dev_sz;
@@ -894,8 +894,8 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
return -ENODEV;
dev_sz = dev_info->end - dev_info->start;
offset = secnum * 512;
- *kaddr = (void __pmem *) (dev_info->start + offset);
- *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+ dax->addr = (void __pmem *) (dev_info->start + offset);
+ dax->pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
return dev_sz - offset;
}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index b25bb23..79defba 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -488,7 +488,7 @@ long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax)
sector += get_start_sect(bdev);
if (sector % (PAGE_SIZE / 512))
return -EINVAL;
- avail = ops->direct_access(bdev, sector, &dax->addr, &dax->pfn);
+ avail = ops->direct_access(bdev, dax);
if (!avail)
return -ERANGE;
if (avail > 0 && avail & ~PAGE_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 669e419..9d8c6d5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1656,8 +1656,7 @@ struct block_device_operations {
int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
- long (*direct_access)(struct block_device *, sector_t, void __pmem **,
- pfn_t *);
+ long (*direct_access)(struct block_device *, struct blk_dax_ctl *dax);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
--
2.5.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
From: Dan Williams <[email protected]>
In preparation for consulting a badblocks list in pmem_direct_access(),
teach dax_pmd_fault() to fallback rather than fail immediately upon
encountering an error. The thought being that reducing the span of the
dax request may avoid the error region.
Signed-off-by: Dan Williams <[email protected]>
---
fs/dax.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 5a34f08..52f0044 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1111,8 +1111,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
long length = dax_map_atomic(bdev, &dax);
if (length < 0) {
- result = VM_FAULT_SIGBUS;
- goto out;
+ dax_pmd_dbg(&bh, address, "dax-error fallback");
+ goto fallback;
}
if (length < PMD_SIZE) {
dax_pmd_dbg(&bh, address, "dax-length too small");
--
2.5.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
From: Dan Williams <[email protected]>
1/ If a mapping overlaps a bad sector fail the request.
2/ Do not opportunistically report more dax-capable capacity than is
requested when errors present.
[vishal: fix a conflict with system RAM collision patches]
Signed-off-by: Dan Williams <[email protected]>
---
block/ioctl.c | 9 ---------
drivers/nvdimm/pmem.c | 8 ++++++++
2 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/block/ioctl.c b/block/ioctl.c
index 4ff1f92..bf80bfd 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -423,15 +423,6 @@ bool blkdev_dax_capable(struct block_device *bdev)
|| (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
return false;
- /*
- * If the device has known bad blocks, force all I/O through the
- * driver / page cache.
- *
- * TODO: support finer grained dax error handling
- */
- if (disk->bb && disk->bb->count)
- return false;
-
return true;
}
#endif
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f72733c..4567d9a 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -188,9 +188,17 @@ static long pmem_direct_access(struct block_device *bdev,
struct pmem_device *pmem = bdev->bd_disk->private_data;
resource_size_t offset = sector * 512 + pmem->data_offset;
+ if (unlikely(is_bad_pmem(&pmem->bb, sector, dax->size)))
+ return -EIO;
dax->addr = pmem->virt_addr + offset;
dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
+ /*
+ * If badblocks are present, limit known good range to the
+ * requested range.
+ */
+ if (unlikely(pmem->bb.count))
+ return dax->size;
return pmem->size - pmem->pfn_pad - offset;
}
--
2.5.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
From: Matthew Wilcox <[email protected]>
dax_clear_sectors() cannot handle poisoned blocks. These must be
zeroed using the BIO interface instead. Convert ext2 and XFS to use
only sb_issue_zerout().
Signed-off-by: Matthew Wilcox <[email protected]>
[vishal: Also remove the dax_clear_sectors function entirely]
Signed-off-by: Vishal Verma <[email protected]>
---
fs/dax.c | 32 --------------------------------
fs/ext2/inode.c | 7 +++----
fs/xfs/xfs_bmap_util.c | 15 ++++-----------
include/linux/dax.h | 1 -
4 files changed, 7 insertions(+), 48 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 52f0044..5948d9b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -116,38 +116,6 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n)
return page;
}
-/*
- * dax_clear_sectors() is called from within transaction context from XFS,
- * and hence this means the stack from this point must follow GFP_NOFS
- * semantics for all operations.
- */
-int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size)
-{
- struct blk_dax_ctl dax = {
- .sector = _sector,
- .size = _size,
- };
-
- might_sleep();
- do {
- long count, sz;
-
- count = dax_map_atomic(bdev, &dax);
- if (count < 0)
- return count;
- sz = min_t(long, count, SZ_128K);
- clear_pmem(dax.addr, sz);
- dax.size -= sz;
- dax.sector += sz / 512;
- dax_unmap_atomic(bdev, &dax);
- cond_resched();
- } while (dax.size);
-
- wmb_pmem();
- return 0;
-}
-EXPORT_SYMBOL_GPL(dax_clear_sectors);
-
static bool buffer_written(struct buffer_head *bh)
{
return buffer_mapped(bh) && !buffer_unwritten(bh);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 1f07b75..35f2b0bf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -26,6 +26,7 @@
#include <linux/highuid.h>
#include <linux/pagemap.h>
#include <linux/dax.h>
+#include <linux/blkdev.h>
#include <linux/quotaops.h>
#include <linux/writeback.h>
#include <linux/buffer_head.h>
@@ -737,10 +738,8 @@ static int ext2_get_blocks(struct inode *inode,
* so that it's not found by another thread before it's
* initialised
*/
- err = dax_clear_sectors(inode->i_sb->s_bdev,
- le32_to_cpu(chain[depth-1].key) <<
- (inode->i_blkbits - 9),
- 1 << inode->i_blkbits);
+ err = sb_issue_zeroout(inode->i_sb,
+ le32_to_cpu(chain[depth-1].key), 1, GFP_NOFS);
if (err) {
mutex_unlock(&ei->truncate_mutex);
goto cleanup;
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 3b63098..930ac6a 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -72,18 +72,11 @@ xfs_zero_extent(
struct xfs_mount *mp = ip->i_mount;
xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb);
sector_t block = XFS_BB_TO_FSBT(mp, sector);
- ssize_t size = XFS_FSB_TO_B(mp, count_fsb);
-
- if (IS_DAX(VFS_I(ip)))
- return dax_clear_sectors(xfs_find_bdev_for_inode(VFS_I(ip)),
- sector, size);
-
- /*
- * let the block layer decide on the fastest method of
- * implementing the zeroing.
- */
- return sb_issue_zeroout(mp->m_super, block, count_fsb, GFP_NOFS);
+ return blkdev_issue_zeroout(xfs_find_bdev_for_inode(VFS_I(ip)),
+ block << (mp->m_super->s_blocksize_bits - 9),
+ count_fsb << (mp->m_super->s_blocksize_bits - 9),
+ GFP_NOFS, true);
}
/*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ef94fa7..426841a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -11,7 +11,6 @@
ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
get_block_t, dio_iodone_t, int flags);
-int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
--
2.5.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
In the truncate or hole-punch path in dax, we clear out sub-page ranges.
If these sub-page ranges are sector aligned and sized, we can do the
zeroing through the driver instead so that error-clearing is handled
automatically.
For sub-sector ranges, we still have to rely on clear_pmem and have the
possibility of tripping over errors.
Cc: Matthew Wilcox <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
---
fs/dax.c | 30 +++++++++++++++++++++++++-----
1 file changed, 25 insertions(+), 5 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 5948d9b..d8c974e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1196,6 +1196,20 @@ out:
}
EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
+static bool dax_range_is_aligned(struct block_device *bdev,
+ struct blk_dax_ctl *dax, unsigned int offset,
+ unsigned int length)
+{
+ unsigned short sector_size = bdev_logical_block_size(bdev);
+
+ if (((u64)dax->addr + offset) % sector_size)
+ return false;
+ if (length % sector_size)
+ return false;
+
+ return true;
+}
+
/**
* dax_zero_page_range - zero a range within a page of a DAX file
* @inode: The file being truncated
@@ -1240,11 +1254,17 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
.size = PAGE_SIZE,
};
- if (dax_map_atomic(bdev, &dax) < 0)
- return PTR_ERR(dax.addr);
- clear_pmem(dax.addr + offset, length);
- wmb_pmem();
- dax_unmap_atomic(bdev, &dax);
+ if (dax_range_is_aligned(bdev, &dax, offset, length))
+ return blkdev_issue_zeroout(bdev, dax.sector,
+ length / bdev_logical_block_size(bdev),
+ GFP_NOFS, true);
+ else {
+ if (dax_map_atomic(bdev, &dax) < 0)
+ return PTR_ERR(dax.addr);
+ clear_pmem(dax.addr + offset, length);
+ wmb_pmem();
+ dax_unmap_atomic(bdev, &dax);
+ }
}
return 0;
--
2.5.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
All IO in a dax filesystem used to go through dax_do_io, which cannot
handle media errors, and thus cannot provide a recovery path that can
send a write through the driver to clear errors.
Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
path for DAX filesystems, use the same direct_IO path for both DAX and
direct_io iocbs, but use the flags to identify when we are in O_DIRECT
mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
direct_IO path instead of DAX.
This allows us a recovery path in the form of opening the file with
O_DIRECT and writing to it with the usual O_DIRECT semantics (sector
alignment restrictions).
Cc: Matthew Wilcox <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
---
drivers/block/loop.c | 2 +-
fs/block_dev.c | 17 +++++++++++++----
fs/ext2/inode.c | 16 ++++++++++++----
fs/ext4/file.c | 2 +-
fs/ext4/inode.c | 19 +++++++++++++------
fs/xfs/xfs_aops.c | 20 +++++++++++++-------
fs/xfs/xfs_file.c | 4 ++--
include/linux/fs.h | 15 ++++++++++++---
mm/filemap.c | 4 ++--
9 files changed, 69 insertions(+), 30 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 80cf8ad..c0a24c3 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -568,7 +568,7 @@ struct switch_request {
static inline void loop_update_dio(struct loop_device *lo)
{
- __loop_update_dio(lo, io_is_direct(lo->lo_backing_file) |
+ __loop_update_dio(lo, (lo->lo_backing_file->f_flags & O_DIRECT) |
lo->use_dio);
}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 79defba..97a1f5f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -167,12 +167,21 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
struct file *file = iocb->ki_filp;
struct inode *inode = bdev_file_inode(file);
- if (IS_DAX(inode))
+ if (iocb_is_direct(iocb))
+ return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter,
+ offset, blkdev_get_block, NULL,
+ NULL, DIO_SKIP_DIO_COUNT);
+ else if (iocb_is_dax(iocb))
return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
NULL, DIO_SKIP_DIO_COUNT);
- return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
- blkdev_get_block, NULL, NULL,
- DIO_SKIP_DIO_COUNT);
+ else {
+ /*
+ * If we're in the direct_IO path, either the IOCB_DIRECT or
+ * IOCB_DAX flags must be set.
+ */
+ WARN_ONCE(1, "Kernel Bug with iocb flags\n");
+ return -ENXIO;
+ }
}
int __sync_blockdev(struct block_device *bdev, int wait)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 35f2b0bf..45f2b51 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -861,12 +861,20 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
size_t count = iov_iter_count(iter);
ssize_t ret;
- if (IS_DAX(inode))
- ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block, NULL,
- DIO_LOCKING);
- else
+ if (iocb_is_direct(iocb))
ret = blockdev_direct_IO(iocb, inode, iter, offset,
ext2_get_block);
+ else if (iocb_is_dax(iocb))
+ ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block, NULL,
+ DIO_LOCKING);
+ else {
+ /*
+ * If we're in the direct_IO path, either the IOCB_DIRECT or
+ * IOCB_DAX flags must be set.
+ */
+ WARN_ONCE(1, "Kernel Bug with iocb flags\n");
+ return -ENXIO;
+ }
if (ret < 0 && iov_iter_rw(iter) == WRITE)
ext2_write_failed(mapping, offset + count);
return ret;
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2e9aa49..165a0b8 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -94,7 +94,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(iocb->ki_filp);
struct blk_plug plug;
- int o_direct = iocb->ki_flags & IOCB_DIRECT;
+ int o_direct = iocb->ki_flags & (IOCB_DIRECT | IOCB_DAX);
int unaligned_aio = 0;
int overwrite = 0;
ssize_t ret;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6d5d5c1..0b6d77a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3410,15 +3410,22 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter,
#ifdef CONFIG_EXT4_FS_ENCRYPTION
BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
#endif
- if (IS_DAX(inode)) {
- ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
- ext4_end_io_dio, dio_flags);
- } else
+ if (iocb_is_direct(iocb))
ret = __blockdev_direct_IO(iocb, inode,
inode->i_sb->s_bdev, iter, offset,
get_block_func,
ext4_end_io_dio, NULL, dio_flags);
-
+ else if (iocb_is_dax(iocb))
+ ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
+ ext4_end_io_dio, dio_flags);
+ else {
+ /*
+ * If we're in the direct_IO path, either the IOCB_DIRECT or
+ * IOCB_DAX flags must be set.
+ */
+ WARN_ONCE(1, "Kernel Bug with iocb flags\n");
+ return -ENXIO;
+ }
if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
EXT4_STATE_DIO_UNWRITTEN)) {
int err;
@@ -3503,7 +3510,7 @@ static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter,
else
unlocked = 1;
}
- if (IS_DAX(inode)) {
+ if (iocb_is_dax(iocb)) {
ret = dax_do_io(iocb, inode, iter, offset, ext4_dio_get_block,
NULL, unlocked ? 0 : DIO_LOCKING);
} else {
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index e49b240..8134e99 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1412,21 +1412,27 @@ xfs_vm_direct_IO(
struct inode *inode = iocb->ki_filp->f_mapping->host;
dio_iodone_t *endio = NULL;
int flags = 0;
- struct block_device *bdev;
+ struct block_device *bdev = xfs_find_bdev_for_inode(inode);
if (iov_iter_rw(iter) == WRITE) {
endio = xfs_end_io_direct_write;
flags = DIO_ASYNC_EXTEND;
}
- if (IS_DAX(inode)) {
+ if (iocb_is_direct(iocb))
+ return __blockdev_direct_IO(iocb, inode, bdev, iter, offset,
+ xfs_get_blocks_direct, endio, NULL, flags);
+ else if (iocb_is_dax(iocb))
return dax_do_io(iocb, inode, iter, offset,
- xfs_get_blocks_direct, endio, 0);
+ xfs_get_blocks_direct, endio, 0);
+ else {
+ /*
+ * If we're in the direct_IO path, either the IOCB_DIRECT or
+ * IOCB_DAX flags must be set.
+ */
+ WARN_ONCE(1, "Kernel Bug with iocb flags\n");
+ return -ENXIO;
}
-
- bdev = xfs_find_bdev_for_inode(inode);
- return __blockdev_direct_IO(iocb, inode, bdev, iter, offset,
- xfs_get_blocks_direct, endio, NULL, flags);
}
/*
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c2946f4..3d5d3c2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -300,7 +300,7 @@ xfs_file_read_iter(
XFS_STATS_INC(mp, xs_read_calls);
- if (unlikely(iocb->ki_flags & IOCB_DIRECT))
+ if (unlikely(iocb->ki_flags & (IOCB_DIRECT | IOCB_DAX)))
ioflags |= XFS_IO_ISDIRECT;
if (file->f_mode & FMODE_NOCMTIME)
ioflags |= XFS_IO_INVIS;
@@ -898,7 +898,7 @@ xfs_file_write_iter(
if (XFS_FORCED_SHUTDOWN(ip->i_mount))
return -EIO;
- if ((iocb->ki_flags & IOCB_DIRECT) || IS_DAX(inode))
+ if ((iocb->ki_flags & (IOCB_DIRECT | IOCB_DAX)))
ret = xfs_file_dio_aio_write(iocb, from);
else
ret = xfs_file_buffered_aio_write(iocb, from);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9f28130..adca1d8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -322,6 +322,7 @@ struct writeback_control;
#define IOCB_APPEND (1 << 1)
#define IOCB_DIRECT (1 << 2)
#define IOCB_HIPRI (1 << 3)
+#define IOCB_DAX (1 << 4)
struct kiocb {
struct file *ki_filp;
@@ -2930,9 +2931,15 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
extern void save_mount_options(struct super_block *sb, char *options);
extern void replace_mount_options(struct super_block *sb, char *options);
-static inline bool io_is_direct(struct file *filp)
+static inline bool iocb_is_dax(struct kiocb *iocb)
{
- return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
+ return IS_DAX(file_inode(iocb->ki_filp)) &&
+ (iocb->ki_flags & IOCB_DAX);
+}
+
+static inline bool iocb_is_direct(struct kiocb *iocb)
+{
+ return iocb->ki_flags & IOCB_DIRECT;
}
static inline int iocb_flags(struct file *file)
@@ -2940,8 +2947,10 @@ static inline int iocb_flags(struct file *file)
int res = 0;
if (file->f_flags & O_APPEND)
res |= IOCB_APPEND;
- if (io_is_direct(file))
+ if (file->f_flags & O_DIRECT)
res |= IOCB_DIRECT;
+ if (IS_DAX(file_inode(file)))
+ res |= IOCB_DAX;
return res;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 3effd5c..b959acf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1849,7 +1849,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
if (!count)
goto out; /* skip atime */
- if (iocb->ki_flags & IOCB_DIRECT) {
+ if (iocb->ki_flags & (IOCB_DIRECT | IOCB_DAX)) {
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
loff_t size;
@@ -2719,7 +2719,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (err)
goto out;
- if (iocb->ki_flags & IOCB_DIRECT) {
+ if (iocb->ki_flags & (IOCB_DIRECT | IOCB_DAX)) {
loff_t pos, endbyte;
written = generic_file_direct_write(iocb, from, iocb->ki_pos);
--
2.5.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
The distinction between PAGE_SIZE and PAGE_CACHE_SIZE was removed in
09cbfea mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release}
macros
The comments for the above functions described a distinction between
those, that is now redundant, so remove those paragraphs
Cc: Matthew Wilcox <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
---
fs/dax.c | 12 ------------
1 file changed, 12 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index d8c974e..b8fa85a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1221,12 +1221,6 @@ static bool dax_range_is_aligned(struct block_device *bdev,
* page in a DAX file. This is intended for hole-punch operations. If
* you are truncating a file, the helper function dax_truncate_page() may be
* more convenient.
- *
- * We work in terms of PAGE_SIZE here for commonality with
- * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
- * took care of disposing of the unnecessary blocks. Even if the filesystem
- * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
- * since the file might be mmapped.
*/
int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
get_block_t get_block)
@@ -1279,12 +1273,6 @@ EXPORT_SYMBOL_GPL(dax_zero_page_range);
*
* Similar to block_truncate_page(), this function can be called by a
* filesystem when it is truncating a DAX file to handle the partial page.
- *
- * We work in terms of PAGE_SIZE here for commonality with
- * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
- * took care of disposing of the unnecessary blocks. Even if the filesystem
- * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
- * since the file might be mmapped.
*/
int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
{
--
2.5.5
This just provides information of the basic paths that can be used to
deal with (i.e. clear) media errors from the file system point-of-view.
Cc: Dave Chinner <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
---
While this isn't a design document for new mechanisms for adding
error recovery/redundancy at the block/fs layers, this attempts to
explain the bare essentials required for anything operating above
the pmem block driver in the stack.
Documentation/filesystems/dax.txt | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 7bde640..71cd8fa 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -79,6 +79,40 @@ These filesystems may be used for inspiration:
- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt
+Handling Media Errors
+---------------------
+
+The libnvdimm subsystem stores a record of known media error locations for
+each pmem block device (in gendisk->badblocks). If we fault at such location,
+or one with a latent error not yet discovered, the application can expect
+to receive a SIGBUS. Libnvdimm also allows clearing of these errors by simply
+writing the affected sectors (through the pmem driver, and if the underlying
+NVDIMM supports the clear_poison DSM defined by ACPI).
+
+Since DAX IO normally doesn't go through the driver/bio path, applications or
+sysadmins have an option to restore the lost data from a prior backup/inbuilt
+redundancy in the following two ways:
+
+1. Delete the affected file, and restore from a backup (sysadmin route):
+ This will free the file system blocks that were being used by the file,
+ and the next time they're allocated, they will be zeroed first, which
+ happens through the driver, and will clear bad sectors.
+
+2. Open the file with O_DIRECT, and restore a sector's worth of data at the
+ bad location (application route):
+ We allow O_DIRECT writes to go through the normal O_DIRECT path that sends
+ bios down through the driver. If an application is able to restore its own
+ data, it can use this path to clear errors.
+
+These are the two basic paths that allow DAX filesystems to continue operating
+in the presence of media errors. More robust error recovery mechanisms can be
+built on top of this in the future, for example, involving redundancy/mirroring
+provided at the block layer through DM, or additionally, at the filesystem
+level. These would have to rely on the above two tenets, that error clearing
+can happen either by sending an IO through the driver, or zeroing (also through
+the driver).
+
+
Shortcomings
------------
--
2.5.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> index 79defba..97a1f5f 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -167,12 +167,21 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
> struct file *file = iocb->ki_filp;
> struct inode *inode = bdev_file_inode(file);
>
> - if (IS_DAX(inode))
> + if (iocb_is_direct(iocb))
> + return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter,
> + offset, blkdev_get_block, NULL,
> + NULL, DIO_SKIP_DIO_COUNT);
> + else if (iocb_is_dax(iocb))
> return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
> NULL, DIO_SKIP_DIO_COUNT);
> - return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
> - blkdev_get_block, NULL, NULL,
> - DIO_SKIP_DIO_COUNT);
> + else {
> + /*
> + * If we're in the direct_IO path, either the IOCB_DIRECT or
> + * IOCB_DAX flags must be set.
> + */
> + WARN_ONCE(1, "Kernel Bug with iocb flags\n");
> + return -ENXIO;
> + }
DAX should not even end up in ->direct_IO.
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -300,7 +300,7 @@ xfs_file_read_iter(
>
> XFS_STATS_INC(mp, xs_read_calls);
>
> - if (unlikely(iocb->ki_flags & IOCB_DIRECT))
> + if (unlikely(iocb->ki_flags & (IOCB_DIRECT | IOCB_DAX)))
> ioflags |= XFS_IO_ISDIRECT;
please also add a XFS_IO_ISDAX flag to propagate the information
properly and allow tracing to display the actual I/O type.
> +static inline bool iocb_is_dax(struct kiocb *iocb)
> {
> + return IS_DAX(file_inode(iocb->ki_filp)) &&
> + (iocb->ki_flags & IOCB_DAX);
> +}
> +
> +static inline bool iocb_is_direct(struct kiocb *iocb)
> +{
> + return iocb->ki_flags & IOCB_DIRECT;
> }
No need for these helpers - especially as IOCB_DAX should never be set
if IS_DAX is false.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On 04/29/2016 12:16 AM, Vishal Verma wrote:
> All IO in a dax filesystem used to go through dax_do_io, which cannot
> handle media errors, and thus cannot provide a recovery path that can
> send a write through the driver to clear errors.
>
> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
> path for DAX filesystems, use the same direct_IO path for both DAX and
> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
> direct_IO path instead of DAX.
>
Really? What are your thinking here?
What about all the current users of O_DIRECT, you have just made them
4 times slower and "less concurrent*" then "buffred io" users. Since
direct_IO path will queue an IO request and all.
(And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
I hate it that you overload the semantics of a known and expected
O_DIRECT flag, for special pmem quirks. This is an incompatible
and unrelated overload of the semantics of O_DIRECT.
> This allows us a recovery path in the form of opening the file with
> O_DIRECT and writing to it with the usual O_DIRECT semantics (sector
> alignment restrictions).
>
I understand that you want a sector aligned IO, right? for the
clear of errors. But I hate it that you forced all O_DIRECT IO
to be slow for this.
Can you not make dax_do_io handle media errors? At least for the
parts of the IO that are aligned.
(And your recovery path application above can use only aligned
IO to make sure)
Please look for another solution. Even a special IOCTL_DAX_CLEAR_ERROR
[*"less concurrent" because of the queuing done in bdev. Note how
pmem is not even multi-queue, and even if it was it will be much
slower then DAX because of the code depth and all the locks and task
switches done in the block layer. In DAX the final memcpy is done directly
on the user-mode thread]
Thanks
Boaz
> Cc: Matthew Wilcox <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Al Viro <[email protected]>
> Signed-off-by: Vishal Verma <[email protected]>
> ---
> drivers/block/loop.c | 2 +-
> fs/block_dev.c | 17 +++++++++++++----
> fs/ext2/inode.c | 16 ++++++++++++----
> fs/ext4/file.c | 2 +-
> fs/ext4/inode.c | 19 +++++++++++++------
> fs/xfs/xfs_aops.c | 20 +++++++++++++-------
> fs/xfs/xfs_file.c | 4 ++--
> include/linux/fs.h | 15 ++++++++++++---
> mm/filemap.c | 4 ++--
> 9 files changed, 69 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 80cf8ad..c0a24c3 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -568,7 +568,7 @@ struct switch_request {
>
> static inline void loop_update_dio(struct loop_device *lo)
> {
> - __loop_update_dio(lo, io_is_direct(lo->lo_backing_file) |
> + __loop_update_dio(lo, (lo->lo_backing_file->f_flags & O_DIRECT) |
> lo->use_dio);
> }
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 79defba..97a1f5f 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -167,12 +167,21 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
> struct file *file = iocb->ki_filp;
> struct inode *inode = bdev_file_inode(file);
>
> - if (IS_DAX(inode))
> + if (iocb_is_direct(iocb))
> + return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter,
> + offset, blkdev_get_block, NULL,
> + NULL, DIO_SKIP_DIO_COUNT);
> + else if (iocb_is_dax(iocb))
> return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
> NULL, DIO_SKIP_DIO_COUNT);
> - return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
> - blkdev_get_block, NULL, NULL,
> - DIO_SKIP_DIO_COUNT);
> + else {
> + /*
> + * If we're in the direct_IO path, either the IOCB_DIRECT or
> + * IOCB_DAX flags must be set.
> + */
> + WARN_ONCE(1, "Kernel Bug with iocb flags\n");
> + return -ENXIO;
> + }
> }
>
> int __sync_blockdev(struct block_device *bdev, int wait)
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 35f2b0bf..45f2b51 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -861,12 +861,20 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
> size_t count = iov_iter_count(iter);
> ssize_t ret;
>
> - if (IS_DAX(inode))
> - ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block, NULL,
> - DIO_LOCKING);
> - else
> + if (iocb_is_direct(iocb))
> ret = blockdev_direct_IO(iocb, inode, iter, offset,
> ext2_get_block);
> + else if (iocb_is_dax(iocb))
> + ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block, NULL,
> + DIO_LOCKING);
> + else {
> + /*
> + * If we're in the direct_IO path, either the IOCB_DIRECT or
> + * IOCB_DAX flags must be set.
> + */
> + WARN_ONCE(1, "Kernel Bug with iocb flags\n");
> + return -ENXIO;
> + }
> if (ret < 0 && iov_iter_rw(iter) == WRITE)
> ext2_write_failed(mapping, offset + count);
> return ret;
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2e9aa49..165a0b8 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -94,7 +94,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> struct file *file = iocb->ki_filp;
> struct inode *inode = file_inode(iocb->ki_filp);
> struct blk_plug plug;
> - int o_direct = iocb->ki_flags & IOCB_DIRECT;
> + int o_direct = iocb->ki_flags & (IOCB_DIRECT | IOCB_DAX);
> int unaligned_aio = 0;
> int overwrite = 0;
> ssize_t ret;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6d5d5c1..0b6d77a 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3410,15 +3410,22 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter,
> #ifdef CONFIG_EXT4_FS_ENCRYPTION
> BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
> #endif
> - if (IS_DAX(inode)) {
> - ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
> - ext4_end_io_dio, dio_flags);
> - } else
> + if (iocb_is_direct(iocb))
> ret = __blockdev_direct_IO(iocb, inode,
> inode->i_sb->s_bdev, iter, offset,
> get_block_func,
> ext4_end_io_dio, NULL, dio_flags);
> -
> + else if (iocb_is_dax(iocb))
> + ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
> + ext4_end_io_dio, dio_flags);
> + else {
> + /*
> + * If we're in the direct_IO path, either the IOCB_DIRECT or
> + * IOCB_DAX flags must be set.
> + */
> + WARN_ONCE(1, "Kernel Bug with iocb flags\n");
> + return -ENXIO;
> + }
> if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
> EXT4_STATE_DIO_UNWRITTEN)) {
> int err;
> @@ -3503,7 +3510,7 @@ static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter,
> else
> unlocked = 1;
> }
> - if (IS_DAX(inode)) {
> + if (iocb_is_dax(iocb)) {
> ret = dax_do_io(iocb, inode, iter, offset, ext4_dio_get_block,
> NULL, unlocked ? 0 : DIO_LOCKING);
> } else {
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e49b240..8134e99 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1412,21 +1412,27 @@ xfs_vm_direct_IO(
> struct inode *inode = iocb->ki_filp->f_mapping->host;
> dio_iodone_t *endio = NULL;
> int flags = 0;
> - struct block_device *bdev;
> + struct block_device *bdev = xfs_find_bdev_for_inode(inode);
>
> if (iov_iter_rw(iter) == WRITE) {
> endio = xfs_end_io_direct_write;
> flags = DIO_ASYNC_EXTEND;
> }
>
> - if (IS_DAX(inode)) {
> + if (iocb_is_direct(iocb))
> + return __blockdev_direct_IO(iocb, inode, bdev, iter, offset,
> + xfs_get_blocks_direct, endio, NULL, flags);
> + else if (iocb_is_dax(iocb))
> return dax_do_io(iocb, inode, iter, offset,
> - xfs_get_blocks_direct, endio, 0);
> + xfs_get_blocks_direct, endio, 0);
> + else {
> + /*
> + * If we're in the direct_IO path, either the IOCB_DIRECT or
> + * IOCB_DAX flags must be set.
> + */
> + WARN_ONCE(1, "Kernel Bug with iocb flags\n");
> + return -ENXIO;
> }
> -
> - bdev = xfs_find_bdev_for_inode(inode);
> - return __blockdev_direct_IO(iocb, inode, bdev, iter, offset,
> - xfs_get_blocks_direct, endio, NULL, flags);
> }
>
> /*
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c2946f4..3d5d3c2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -300,7 +300,7 @@ xfs_file_read_iter(
>
> XFS_STATS_INC(mp, xs_read_calls);
>
> - if (unlikely(iocb->ki_flags & IOCB_DIRECT))
> + if (unlikely(iocb->ki_flags & (IOCB_DIRECT | IOCB_DAX)))
> ioflags |= XFS_IO_ISDIRECT;
> if (file->f_mode & FMODE_NOCMTIME)
> ioflags |= XFS_IO_INVIS;
> @@ -898,7 +898,7 @@ xfs_file_write_iter(
> if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> return -EIO;
>
> - if ((iocb->ki_flags & IOCB_DIRECT) || IS_DAX(inode))
> + if ((iocb->ki_flags & (IOCB_DIRECT | IOCB_DAX)))
> ret = xfs_file_dio_aio_write(iocb, from);
> else
> ret = xfs_file_buffered_aio_write(iocb, from);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9f28130..adca1d8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -322,6 +322,7 @@ struct writeback_control;
> #define IOCB_APPEND (1 << 1)
> #define IOCB_DIRECT (1 << 2)
> #define IOCB_HIPRI (1 << 3)
> +#define IOCB_DAX (1 << 4)
>
> struct kiocb {
> struct file *ki_filp;
> @@ -2930,9 +2931,15 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
> extern void save_mount_options(struct super_block *sb, char *options);
> extern void replace_mount_options(struct super_block *sb, char *options);
>
> -static inline bool io_is_direct(struct file *filp)
> +static inline bool iocb_is_dax(struct kiocb *iocb)
> {
> - return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
> + return IS_DAX(file_inode(iocb->ki_filp)) &&
> + (iocb->ki_flags & IOCB_DAX);
> +}
> +
> +static inline bool iocb_is_direct(struct kiocb *iocb)
> +{
> + return iocb->ki_flags & IOCB_DIRECT;
> }
>
> static inline int iocb_flags(struct file *file)
> @@ -2940,8 +2947,10 @@ static inline int iocb_flags(struct file *file)
> int res = 0;
> if (file->f_flags & O_APPEND)
> res |= IOCB_APPEND;
> - if (io_is_direct(file))
> + if (file->f_flags & O_DIRECT)
> res |= IOCB_DIRECT;
> + if (IS_DAX(file_inode(file)))
> + res |= IOCB_DAX;
> return res;
> }
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 3effd5c..b959acf 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1849,7 +1849,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> if (!count)
> goto out; /* skip atime */
>
> - if (iocb->ki_flags & IOCB_DIRECT) {
> + if (iocb->ki_flags & (IOCB_DIRECT | IOCB_DAX)) {
> struct address_space *mapping = file->f_mapping;
> struct inode *inode = mapping->host;
> loff_t size;
> @@ -2719,7 +2719,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (err)
> goto out;
>
> - if (iocb->ki_flags & IOCB_DIRECT) {
> + if (iocb->ki_flags & (IOCB_DIRECT | IOCB_DAX)) {
> loff_t pos, endbyte;
>
> written = generic_file_direct_write(iocb, from, iocb->ki_pos);
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, 2016-05-02 at 07:56 -0700, Christoph Hellwig wrote:
> >
> > index 79defba..97a1f5f 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -167,12 +167,21 @@ blkdev_direct_IO(struct kiocb *iocb, struct
> > iov_iter *iter, loff_t offset)
> > struct file *file = iocb->ki_filp;
> > struct inode *inode = bdev_file_inode(file);
> >
> > - if (IS_DAX(inode))
> > + if (iocb_is_direct(iocb))
> > + return __blockdev_direct_IO(iocb, inode,
> > I_BDEV(inode), iter,
> > + offset,
> > blkdev_get_block, NULL,
> > + NULL,
> > DIO_SKIP_DIO_COUNT);
> > + else if (iocb_is_dax(iocb))
> > return dax_do_io(iocb, inode, iter, offset,
> > blkdev_get_block,
> > NULL, DIO_SKIP_DIO_COUNT);
> > - return __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > iter, offset,
> > - blkdev_get_block, NULL, NULL,
> > - DIO_SKIP_DIO_COUNT);
> > + else {
> > + /*
> > + * If we're in the direct_IO path, either the
> > IOCB_DIRECT or
> > + * IOCB_DAX flags must be set.
> > + */
> > + WARN_ONCE(1, "Kernel Bug with iocb flags\n");
> > + return -ENXIO;
> > + }
> DAX should not even end up in ->direct_IO.
Do you mean to say remove the last 'else' clause entirely?
I agree that it should never be hit, which is why it is a WARN..
But I'm happy to remove it.
>
> >
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -300,7 +300,7 @@ xfs_file_read_iter(
> >
> > XFS_STATS_INC(mp, xs_read_calls);
> >
> > - if (unlikely(iocb->ki_flags & IOCB_DIRECT))
> > + if (unlikely(iocb->ki_flags & (IOCB_DIRECT | IOCB_DAX)))
> > ioflags |= XFS_IO_ISDIRECT;
> please also add a XFS_IO_ISDAX flag to propagate the information
> properly and allow tracing to display the actual I/O type.
Will do.
>
> >
> > +static inline bool iocb_is_dax(struct kiocb *iocb)
> > {
> > + return IS_DAX(file_inode(iocb->ki_filp)) &&
> > + (iocb->ki_flags & IOCB_DAX);
> > +}
> > +
> > +static inline bool iocb_is_direct(struct kiocb *iocb)
> > +{
> > + return iocb->ki_flags & IOCB_DIRECT;
> > }
> No need for these helpers - especially as IOCB_DAX should never be
> set
> if IS_DAX is false.
Ok. So check the flags directly where needed?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-
> block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, 2016-05-02 at 18:41 +0300, Boaz Harrosh wrote:
> On 04/29/2016 12:16 AM, Vishal Verma wrote:
> >
> > All IO in a dax filesystem used to go through dax_do_io, which
> > cannot
> > handle media errors, and thus cannot provide a recovery path that
> > can
> > send a write through the driver to clear errors.
> >
> > Add a new iocb flag for DAX, and set it only for DAX mounts. In the
> > IO
> > path for DAX filesystems, use the same direct_IO path for both DAX
> > and
> > direct_io iocbs, but use the flags to identify when we are in
> > O_DIRECT
> > mode vs non O_DIRECT with DAX, and for O_DIRECT, use the
> > conventional
> > direct_IO path instead of DAX.
> >
> Really? What are your thinking here?
>
> What about all the current users of O_DIRECT, you have just made them
> 4 times slower and "less concurrent*" then "buffred io" users. Since
> direct_IO path will queue an IO request and all.
> (And if it is not so slow then why do we need dax_do_io at all?
> [Rhetorical])
>
> I hate it that you overload the semantics of a known and expected
> O_DIRECT flag, for special pmem quirks. This is an incompatible
> and unrelated overload of the semantics of O_DIRECT.
We overloaded O_DIRECT a long time ago when we made DAX piggyback on
the same path:
static inline bool io_is_direct(struct file *filp)
{
return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
}
Yes O_DIRECT on a DAX mounted file system will now be slower, but -
>
> >
> > This allows us a recovery path in the form of opening the file with
> > O_DIRECT and writing to it with the usual O_DIRECT semantics
> > (sector
> > alignment restrictions).
> >
> I understand that you want a sector aligned IO, right? for the
> clear of errors. But I hate it that you forced all O_DIRECT IO
> to be slow for this.
> Can you not make dax_do_io handle media errors? At least for the
> parts of the IO that are aligned.
> (And your recovery path application above can use only aligned
> IO to make sure)
>
> Please look for another solution. Even a special
> IOCTL_DAX_CLEAR_ERROR
- see all the versions of this series prior to this one, where we try
to do a fallback...
>
> [*"less concurrent" because of the queuing done in bdev. Note how
> pmem is not even multi-queue, and even if it was it will be much
> slower then DAX because of the code depth and all the locks and
> task
> switches done in the block layer. In DAX the final memcpy is done
> directly
> on the user-mode thread]
>
> Thanks
> Boaz
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh <[email protected]> wrote:
> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>> All IO in a dax filesystem used to go through dax_do_io, which cannot
>> handle media errors, and thus cannot provide a recovery path that can
>> send a write through the driver to clear errors.
>>
>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>> path for DAX filesystems, use the same direct_IO path for both DAX and
>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>> direct_IO path instead of DAX.
>>
>
> Really? What are your thinking here?
>
> What about all the current users of O_DIRECT, you have just made them
> 4 times slower and "less concurrent*" then "buffred io" users. Since
> direct_IO path will queue an IO request and all.
> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>
> I hate it that you overload the semantics of a known and expected
> O_DIRECT flag, for special pmem quirks. This is an incompatible
> and unrelated overload of the semantics of O_DIRECT.
I think it is the opposite situation, it us undoing the premature
overloading of O_DIRECT that went in without performance numbers.
This implementation clarifies that dax_do_io() handles the lack of a
page cache for buffered I/O and O_DIRECT behaves as it nominally would
by sending an I/O to the driver. It has the benefit of matching the
error semantics of a typical block device where a buffered write could
hit an error filling the page cache, but an O_DIRECT write potentially
triggers the drive to remap the block.
On 05/02/2016 06:51 PM, Vishal Verma wrote:
> On Mon, 2016-05-02 at 18:41 +0300, Boaz Harrosh wrote:
>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>>
>>> All IO in a dax filesystem used to go through dax_do_io, which
>>> cannot
>>> handle media errors, and thus cannot provide a recovery path that
>>> can
>>> send a write through the driver to clear errors.
>>>
>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the
>>> IO
>>> path for DAX filesystems, use the same direct_IO path for both DAX
>>> and
>>> direct_io iocbs, but use the flags to identify when we are in
>>> O_DIRECT
>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the
>>> conventional
>>> direct_IO path instead of DAX.
>>>
>> Really? What are your thinking here?
>>
>> What about all the current users of O_DIRECT, you have just made them
>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>> direct_IO path will queue an IO request and all.
>> (And if it is not so slow then why do we need dax_do_io at all?
>> [Rhetorical])
>>
>> I hate it that you overload the semantics of a known and expected
>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>> and unrelated overload of the semantics of O_DIRECT.
>
> We overloaded O_DIRECT a long time ago when we made DAX piggyback on
> the same path:
>
> static inline bool io_is_direct(struct file *filp)
> {
> return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
> }
>
No as far as the user is concerned we have not. The O_DIRECT user
is still getting all the semantics he wants, .i.e no syncs no
memory cache usage, no copies ...
Only with DAX the buffered IO is the same since with pmem it is faster.
Then why not? The basic contract with the user did not break.
The above was just an implementation detail to easily navigate
through the Linux vfs IO stack and make the least amount of changes
in every FS that wanted to support DAX.(And since dax_do_io is much
more like direct_IO then like page-cache IO)
> Yes O_DIRECT on a DAX mounted file system will now be slower, but -
>
>>
>>>
>>> This allows us a recovery path in the form of opening the file with
>>> O_DIRECT and writing to it with the usual O_DIRECT semantics
>>> (sector
>>> alignment restrictions).
>>>
>> I understand that you want a sector aligned IO, right? for the
>> clear of errors. But I hate it that you forced all O_DIRECT IO
>> to be slow for this.
>> Can you not make dax_do_io handle media errors? At least for the
>> parts of the IO that are aligned.
>> (And your recovery path application above can use only aligned
>> IO to make sure)
>>
>> Please look for another solution. Even a special
>> IOCTL_DAX_CLEAR_ERROR
>
> - see all the versions of this series prior to this one, where we try
> to do a fallback...
>
And?
So now all O_DIRECT APPs go 4 times slower. I will have a look but if
it is really so bad than please consider an IOCTL or syscall. Or a special
O_DAX_ERRORS flag ...
Please do not trash all the O_DIRECT users, they are the more important
clients, like DBs and VMs.
Thanks
Boaz
>>
>> [*"less concurrent" because of the queuing done in bdev. Note how
>> pmem is not even multi-queue, and even if it was it will be much
>> slower then DAX because of the code depth and all the locks and
>> task
>> switches done in the block layer. In DAX the final memcpy is done
>> directly
>> on the user-mode thread]
>>
>> Thanks
>> Boaz
>>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On 05/02/2016 07:01 PM, Dan Williams wrote:
> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh <[email protected]> wrote:
>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>> All IO in a dax filesystem used to go through dax_do_io, which cannot
>>> handle media errors, and thus cannot provide a recovery path that can
>>> send a write through the driver to clear errors.
>>>
>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>>> path for DAX filesystems, use the same direct_IO path for both DAX and
>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>>> direct_IO path instead of DAX.
>>>
>>
>> Really? What are your thinking here?
>>
>> What about all the current users of O_DIRECT, you have just made them
>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>> direct_IO path will queue an IO request and all.
>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>>
>> I hate it that you overload the semantics of a known and expected
>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>> and unrelated overload of the semantics of O_DIRECT.
>
> I think it is the opposite situation, it us undoing the premature
> overloading of O_DIRECT that went in without performance numbers.
We have tons of measurements. Is not hard to imagine the results though.
Specially the 1000 threads case
> This implementation clarifies that dax_do_io() handles the lack of a
> page cache for buffered I/O and O_DIRECT behaves as it nominally would
> by sending an I/O to the driver.
> It has the benefit of matching the
> error semantics of a typical block device where a buffered write could
> hit an error filling the page cache, but an O_DIRECT write potentially
> triggers the drive to remap the block.
>
I fail to see how in writes the device error semantics regarding remapping of
blocks is any different between buffered and direct IO. As far as the block
device it is the same exact code path. All The big difference is higher in the
VFS.
And ... So you are willing to sacrifice the 99% hotpath for the sake of the
1% error path? and piggybacking on poor O_DIRECT.
Again there are tons of O_DIRECT apps out there, why are you forcing them to
change if they want true pmem performance?
I still believe dax_do_io() can be made more resilient to errors, and clear
errors on writes. Me going digging in old patches ...
Cheers
Boaz
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh <[email protected]> wrote:
> On 05/02/2016 07:01 PM, Dan Williams wrote:
>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh <[email protected]> wrote:
>>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot
>>>> handle media errors, and thus cannot provide a recovery path that can
>>>> send a write through the driver to clear errors.
>>>>
>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>>>> path for DAX filesystems, use the same direct_IO path for both DAX and
>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>>>> direct_IO path instead of DAX.
>>>>
>>>
>>> Really? What are your thinking here?
>>>
>>> What about all the current users of O_DIRECT, you have just made them
>>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>>> direct_IO path will queue an IO request and all.
>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>>>
>>> I hate it that you overload the semantics of a known and expected
>>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>>> and unrelated overload of the semantics of O_DIRECT.
>>
>> I think it is the opposite situation, it us undoing the premature
>> overloading of O_DIRECT that went in without performance numbers.
>
> We have tons of measurements. Is not hard to imagine the results though.
> Specially the 1000 threads case
>
>> This implementation clarifies that dax_do_io() handles the lack of a
>> page cache for buffered I/O and O_DIRECT behaves as it nominally would
>> by sending an I/O to the driver.
>
>> It has the benefit of matching the
>> error semantics of a typical block device where a buffered write could
>> hit an error filling the page cache, but an O_DIRECT write potentially
>> triggers the drive to remap the block.
>>
>
> I fail to see how in writes the device error semantics regarding remapping of
> blocks is any different between buffered and direct IO. As far as the block
> device it is the same exact code path. All The big difference is higher in the
> VFS.
>
> And ... So you are willing to sacrifice the 99% hotpath for the sake of the
> 1% error path? and piggybacking on poor O_DIRECT.
>
> Again there are tons of O_DIRECT apps out there, why are you forcing them to
> change if they want true pmem performance?
This isn't forcing them to change. This is the path of least surprise
as error semantics are identical to a typical block device. Yes, an
application can go faster by switching to the "buffered" / dax_do_io()
path it can go even faster to switch to mmap() I/O and use DAX
directly. If we can later optimize the O_DIRECT path to bring it's
performance more in line with dax_do_io(), great, but the
implementation should be correct first and optimized later.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On 05/02/2016 07:49 PM, Dan Williams wrote:
> On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh <[email protected]> wrote:
>> On 05/02/2016 07:01 PM, Dan Williams wrote:
>>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh <[email protected]> wrote:
>>>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot
>>>>> handle media errors, and thus cannot provide a recovery path that can
>>>>> send a write through the driver to clear errors.
>>>>>
>>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>>>>> path for DAX filesystems, use the same direct_IO path for both DAX and
>>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>>>>> direct_IO path instead of DAX.
>>>>>
>>>>
>>>> Really? What are your thinking here?
>>>>
>>>> What about all the current users of O_DIRECT, you have just made them
>>>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>>>> direct_IO path will queue an IO request and all.
>>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>>>>
>>>> I hate it that you overload the semantics of a known and expected
>>>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>>>> and unrelated overload of the semantics of O_DIRECT.
>>>
>>> I think it is the opposite situation, it us undoing the premature
>>> overloading of O_DIRECT that went in without performance numbers.
>>
>> We have tons of measurements. Is not hard to imagine the results though.
>> Specially the 1000 threads case
>>
>>> This implementation clarifies that dax_do_io() handles the lack of a
>>> page cache for buffered I/O and O_DIRECT behaves as it nominally would
>>> by sending an I/O to the driver.
>>
>>> It has the benefit of matching the
>>> error semantics of a typical block device where a buffered write could
>>> hit an error filling the page cache, but an O_DIRECT write potentially
>>> triggers the drive to remap the block.
>>>
>>
>> I fail to see how in writes the device error semantics regarding remapping of
>> blocks is any different between buffered and direct IO. As far as the block
>> device it is the same exact code path. All The big difference is higher in the
>> VFS.
>>
>> And ... So you are willing to sacrifice the 99% hotpath for the sake of the
>> 1% error path? and piggybacking on poor O_DIRECT.
>>
>> Again there are tons of O_DIRECT apps out there, why are you forcing them to
>> change if they want true pmem performance?
>
> This isn't forcing them to change. This is the path of least surprise
> as error semantics are identical to a typical block device. Yes, an
> application can go faster by switching to the "buffered" / dax_do_io()
> path it can go even faster to switch to mmap() I/O and use DAX
> directly. If we can later optimize the O_DIRECT path to bring it's
> performance more in line with dax_do_io(), great, but the
> implementation should be correct first and optimized later.
>
Why does it need to be either or. Why not both?
And also I disagree if you are correct and dax_do_io is bad and needs fixing
than you have broken applications. Because in current model:
read => -EIO, write-bufferd, sync()
gives you the same error semantics as: read => -EIO, write-direct-io
In fact this is what the delete, restore from backup model does today.
Who said it uses / must direct IO. Actually I think it does not.
Two things I can think of which are better:
[1]
Why not go deeper into the dax io loops, and for any WRITE
failed page call bdev_rw_page() to let the pmem.c clear / relocate
the error page.
So reads return -EIO - is what you wanted no?
writes get a memory error and retry with bdev_rw_page() to let the bdev
relocate / clear the error - is what you wanted no?
In the partial page WRITE case on bad sectors. we can carefully read-modify-write
sector-by-sector and zero-out the bad-sectors that could not be read, what else?
(Or enhance the bdev_rw_page() API)
[2]
Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still
hate that you overload error semantics with O_DIRECT which does not exist today
see above
Thanks
Boaz
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, May 2, 2016 at 10:44 AM, Boaz Harrosh <[email protected]> wrote:
> On 05/02/2016 07:49 PM, Dan Williams wrote:
>> On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh <[email protected]> wrote:
>>> On 05/02/2016 07:01 PM, Dan Williams wrote:
>>>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh <[email protected]> wrote:
>>>>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot
>>>>>> handle media errors, and thus cannot provide a recovery path that can
>>>>>> send a write through the driver to clear errors.
>>>>>>
>>>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>>>>>> path for DAX filesystems, use the same direct_IO path for both DAX and
>>>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>>>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>>>>>> direct_IO path instead of DAX.
>>>>>>
>>>>>
>>>>> Really? What are your thinking here?
>>>>>
>>>>> What about all the current users of O_DIRECT, you have just made them
>>>>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>>>>> direct_IO path will queue an IO request and all.
>>>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>>>>>
>>>>> I hate it that you overload the semantics of a known and expected
>>>>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>>>>> and unrelated overload of the semantics of O_DIRECT.
>>>>
>>>> I think it is the opposite situation, it us undoing the premature
>>>> overloading of O_DIRECT that went in without performance numbers.
>>>
>>> We have tons of measurements. Is not hard to imagine the results though.
>>> Specially the 1000 threads case
>>>
>>>> This implementation clarifies that dax_do_io() handles the lack of a
>>>> page cache for buffered I/O and O_DIRECT behaves as it nominally would
>>>> by sending an I/O to the driver.
>>>
>>>> It has the benefit of matching the
>>>> error semantics of a typical block device where a buffered write could
>>>> hit an error filling the page cache, but an O_DIRECT write potentially
>>>> triggers the drive to remap the block.
>>>>
>>>
>>> I fail to see how in writes the device error semantics regarding remapping of
>>> blocks is any different between buffered and direct IO. As far as the block
>>> device it is the same exact code path. All The big difference is higher in the
>>> VFS.
>>>
>>> And ... So you are willing to sacrifice the 99% hotpath for the sake of the
>>> 1% error path? and piggybacking on poor O_DIRECT.
>>>
>>> Again there are tons of O_DIRECT apps out there, why are you forcing them to
>>> change if they want true pmem performance?
>>
>> This isn't forcing them to change. This is the path of least surprise
>> as error semantics are identical to a typical block device. Yes, an
>> application can go faster by switching to the "buffered" / dax_do_io()
>> path it can go even faster to switch to mmap() I/O and use DAX
>> directly. If we can later optimize the O_DIRECT path to bring it's
>> performance more in line with dax_do_io(), great, but the
>> implementation should be correct first and optimized later.
>>
>
> Why does it need to be either or. Why not both?
> And also I disagree if you are correct and dax_do_io is bad and needs fixing
> than you have broken applications. Because in current model:
>
> read => -EIO, write-bufferd, sync()
> gives you the same error semantics as: read => -EIO, write-direct-io
> In fact this is what the delete, restore from backup model does today.
> Who said it uses / must direct IO. Actually I think it does not.
The semantic I am talking about preserving is:
buffered / unaligned write of a bad sector => -EIO on reading into the
page cache
...and that the only guaranteed way to clear an error (assuming the
block device supports it) is an O_DIRECT write.
>
> Two things I can think of which are better:
> [1]
> Why not go deeper into the dax io loops, and for any WRITE
> failed page call bdev_rw_page() to let the pmem.c clear / relocate
> the error page.
Where do you get the rest of the data to complete a full page write?
> So reads return -EIO - is what you wanted no?
That's well understood. What we are debating is the method to clear
errors / ask the storage device to remap bad blocks.
> writes get a memory error and retry with bdev_rw_page() to let the bdev
> relocate / clear the error - is what you wanted no?
>
> In the partial page WRITE case on bad sectors. we can carefully read-modify-write
> sector-by-sector and zero-out the bad-sectors that could not be read, what else?
> (Or enhance the bdev_rw_page() API)
See all the previous discussions on why the fallback path is
problematic to implement.
>
> [2]
> Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still
> hate that you overload error semantics with O_DIRECT which does not exist today
> see above
I still think we're talking past each other on this point. This patch
set is not overloading error semantics, it's fixing the error handling
problem that was introduced in this commit:
d475c6346a38 dax,ext2: replace XIP read and write with DAX I/O
...where we started overloading O_DIRECT and dax_do_io() semantics.
On 05/02/2016 09:10 PM, Dan Williams wrote:
<>
>
> The semantic I am talking about preserving is:
>
> buffered / unaligned write of a bad sector => -EIO on reading into the
> page cache
>
What about aligned buffered write? like write 0-to-eof
This still broken? (and is what restore apps do)
> ...and that the only guaranteed way to clear an error (assuming the
> block device supports it) is an O_DIRECT write.
>
Sure fixing dax_do_io will guaranty that.
<>
> I still think we're talking past each other on this point.
Yes we are!
> This patch
> set is not overloading error semantics, it's fixing the error handling
> problem that was introduced in this commit:
>
> d475c6346a38 dax,ext2: replace XIP read and write with DAX I/O
>
> ...where we started overloading O_DIRECT and dax_do_io() semantics.
>
But above does not fix them does it? it just completely NULLs DAX for
O_DIRECT which is a great pity, why did we do all this work in the first
place.
And then it keeps broken the aligned buffered writes, which are still
broken after this set.
I have by now read the v2 patches. And I think you guys did not yet try
the proper fix for dax_do_io. I think you need to go deeper into the loops
and selectively call bdev_* when error on a specific page copy. No need to
go through direct_IO path at all.
Do you need that I send you a patch to demonstrate what I mean?
But yes I feel too that "we're talking past each other". I did want
to come to LSF and talk to you, but was not invited. Should I call you?
Thanks
Boaz
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, May 2, 2016 at 11:32 AM, Boaz Harrosh <[email protected]> wrote:
> On 05/02/2016 09:10 PM, Dan Williams wrote:
> <>
>>
>> The semantic I am talking about preserving is:
>>
>> buffered / unaligned write of a bad sector => -EIO on reading into the
>> page cache
>>
>
> What about aligned buffered write? like write 0-to-eof
> This still broken? (and is what restore apps do)
>
>> ...and that the only guaranteed way to clear an error (assuming the
>> block device supports it) is an O_DIRECT write.
>>
>
> Sure fixing dax_do_io will guaranty that.
>
> <>
>> I still think we're talking past each other on this point.
>
> Yes we are!
>
>> This patch
>> set is not overloading error semantics, it's fixing the error handling
>> problem that was introduced in this commit:
>>
>> d475c6346a38 dax,ext2: replace XIP read and write with DAX I/O
>>
>> ...where we started overloading O_DIRECT and dax_do_io() semantics.
>>
>
> But above does not fix them does it? it just completely NULLs DAX for
> O_DIRECT which is a great pity, why did we do all this work in the first
> place.
This is hyperbole. We don't impact "all the work" we did for the mmap
I/O case and the acceleration of the non-direct-I/O case.
> And then it keeps broken the aligned buffered writes, which are still
> broken after this set.
...identical to the current situation with a traditional disk.
> I have by now read the v2 patches. And I think you guys did not yet try
> the proper fix for dax_do_io. I think you need to go deeper into the loops
> and selectively call bdev_* when error on a specific page copy. No need to
> go through direct_IO path at all.
We still reach a point where the minimum granularity of
bdev_direct_access() is larger than a sector, so you end up still
needing to have the application understand how to send a properly
aligned I/O. The semantics of how to send a properly aligned
direct-I/O are already well understood, so we simply reuse that path.
> Do you need that I send you a patch to demonstrate what I mean?
I remain skeptical of what you are proposing, but yes, a patch has a
better chance to move the discussion forward.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, 2016-05-02 at 19:03 +0300, Boaz Harrosh wrote:
> On 05/02/2016 06:51 PM, Vishal Verma wrote:
> >
> > On Mon, 2016-05-02 at 18:41 +0300, Boaz Harrosh wrote:
> > >
> > > On 04/29/2016 12:16 AM, Vishal Verma wrote:
> > > >
> > > >
> > > > All IO in a dax filesystem used to go through dax_do_io, which
> > > > cannot
> > > > handle media errors, and thus cannot provide a recovery path
> > > > that
> > > > can
> > > > send a write through the driver to clear errors.
> > > >
> > > > Add a new iocb flag for DAX, and set it only for DAX mounts. In
> > > > the
> > > > IO
> > > > path for DAX filesystems, use the same direct_IO path for both
> > > > DAX
> > > > and
> > > > direct_io iocbs, but use the flags to identify when we are in
> > > > O_DIRECT
> > > > mode vs non O_DIRECT with DAX, and for O_DIRECT, use the
> > > > conventional
> > > > direct_IO path instead of DAX.
> > > >
> > > Really? What are your thinking here?
> > >
> > > What about all the current users of O_DIRECT, you have just made
> > > them
> > > 4 times slower and "less concurrent*" then "buffred io" users.
> > > Since
> > > direct_IO path will queue an IO request and all.
> > > (And if it is not so slow then why do we need dax_do_io at all?
> > > [Rhetorical])
> > >
> > > I hate it that you overload the semantics of a known and expected
> > > O_DIRECT flag, for special pmem quirks. This is an incompatible
> > > and unrelated overload of the semantics of O_DIRECT.
> > We overloaded O_DIRECT a long time ago when we made DAX piggyback on
> > the same path:
> >
> > static inline bool io_is_direct(struct file *filp)
> > {
> > return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping-
> > >host);
> > }
> >
> No as far as the user is concerned we have not. The O_DIRECT user
> is still getting all the semantics he wants, .i.e no syncs no
> memory cache usage, no copies ...
>
> Only with DAX the buffered IO is the same since with pmem it is
> faster.
> Then why not? The basic contract with the user did not break.
>
> The above was just an implementation detail to easily navigate
> through the Linux vfs IO stack and make the least amount of changes
> in every FS that wanted to support DAX.(And since dax_do_io is much
> more like direct_IO then like page-cache IO)
>
> >
> > Yes O_DIRECT on a DAX mounted file system will now be slower, but -
> >
> > >
> > >
> > > >
> > > >
> > > > This allows us a recovery path in the form of opening the file
> > > > with
> > > > O_DIRECT and writing to it with the usual O_DIRECT semantics
> > > > (sector
> > > > alignment restrictions).
> > > >
> > > I understand that you want a sector aligned IO, right? for the
> > > clear of errors. But I hate it that you forced all O_DIRECT IO
> > > to be slow for this.
> > > Can you not make dax_do_io handle media errors? At least for the
> > > parts of the IO that are aligned.
> > > (And your recovery path application above can use only aligned
> > > IO to make sure)
> > >
> > > Please look for another solution. Even a special
> > > IOCTL_DAX_CLEAR_ERROR
> > - see all the versions of this series prior to this one, where we
> > try
> > to do a fallback...
> >
> And?
>
> So now all O_DIRECT APPs go 4 times slower. I will have a look but if
> it is really so bad than please consider an IOCTL or syscall. Or a
> special
> O_DAX_ERRORS flag ...
I'm curious where the 4x slower comes from.. The O_DIRECT path is still
without page-cache copies, and nor does it go through request queues
(since pmem is a bio-based driver). The only overhead is that of
submitting a bio - and while I agree it is more overhead than dax_do_io,
4x seems a bit high.
>
> Please do not trash all the O_DIRECT users, they are the more
> important
> clients, like DBs and VMs.
Shouldn't they be using mmaps and dax faults? I was under the impression
that the dax_do_io path is a nice-to-have, but for anyone that will want
to use DAX, they will want the mmap/fault path, not the IO path. This is
just making the IO path 'more correct' by allowing it a way to deal with
errors.
>
> Thanks
> Boaz
>
> >
> > >
> > >
> > > [*"less concurrent" because of the queuing done in bdev. Note how
> > > pmem is not even multi-queue, and even if it was it will be much
> > > slower then DAX because of the code depth and all the locks and
> > > task
> > > switches done in the block layer. In DAX the final memcpy is
> > > done
> > > directly
> > > on the user-mode thread]
> > >
> > > Thanks
> > > Boaz
> > >
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On 05/02/2016 09:48 PM, Dan Williams wrote:
<>
>> And then it keeps broken the aligned buffered writes, which are still
>> broken after this set.
>
> ...identical to the current situation with a traditional disk.
>
Not true!! please see what I wrote "aligned buffered writes"
If there are no reads involved then there are no errors returned
to application.
>> I have by now read the v2 patches. And I think you guys did not yet try
>> the proper fix for dax_do_io. I think you need to go deeper into the loops
>> and selectively call bdev_* when error on a specific page copy. No need to
>> go through direct_IO path at all.
>
> We still reach a point where the minimum granularity of
> bdev_direct_access() is larger than a sector, so you end up still
> needing to have the application understand how to send a properly
> aligned I/O. The semantics of how to send a properly aligned
> direct-I/O are already well understood, so we simply reuse that path.
>
You are making a mountain out of a mouse. The simple copy of a file
from start (offset ZERO) to end-of-file which is the most common usage
on earth is perfectly aligned and needs not any O_DIRECT and is what is used
everywhere.
>> Do you need that I send you a patch to demonstrate what I mean?
>
> I remain skeptical of what you are proposing, but yes, a patch has a
> better chance to move the discussion forward.
>
Sigh! OK
Boaz
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, May 02, 2016 at 06:41:51PM +0300, Boaz Harrosh wrote:
> > All IO in a dax filesystem used to go through dax_do_io, which cannot
> > handle media errors, and thus cannot provide a recovery path that can
> > send a write through the driver to clear errors.
> >
> > Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
> > path for DAX filesystems, use the same direct_IO path for both DAX and
> > direct_io iocbs, but use the flags to identify when we are in O_DIRECT
> > mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
> > direct_IO path instead of DAX.
> >
>
> Really? What are your thinking here?
>
> What about all the current users of O_DIRECT, you have just made them
> 4 times slower and "less concurrent*" then "buffred io" users. Since
> direct_IO path will queue an IO request and all.
> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>
> I hate it that you overload the semantics of a known and expected
> O_DIRECT flag, for special pmem quirks. This is an incompatible
> and unrelated overload of the semantics of O_DIRECT.
Agreed - makig O_DIRECT less direct than not having it is plain stupid,
and I somehow missed this initially.
This whole DAX story turns into a major nightmare, and I fear all our
hodge podge tweaks to the semantics aren't helping it.
It seems like we simply need an explicit O_DAX for the read/write
bypass if can't sort out the semantics (error, writer synchronization)
just as we need a special flag for MMAP..
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Thu, May 5, 2016 at 7:24 AM, Christoph Hellwig <[email protected]> wrote:
> On Mon, May 02, 2016 at 06:41:51PM +0300, Boaz Harrosh wrote:
>> > All IO in a dax filesystem used to go through dax_do_io, which cannot
>> > handle media errors, and thus cannot provide a recovery path that can
>> > send a write through the driver to clear errors.
>> >
>> > Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>> > path for DAX filesystems, use the same direct_IO path for both DAX and
>> > direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>> > mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>> > direct_IO path instead of DAX.
>> >
>>
>> Really? What are your thinking here?
>>
>> What about all the current users of O_DIRECT, you have just made them
>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>> direct_IO path will queue an IO request and all.
>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>>
>> I hate it that you overload the semantics of a known and expected
>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>> and unrelated overload of the semantics of O_DIRECT.
>
> Agreed - makig O_DIRECT less direct than not having it is plain stupid,
> and I somehow missed this initially.
Of course I disagree because like Dave argues in the msync case we
should do the correct thing first and make it fast later, but also
like Dave this arguing in circles is getting tiresome.
> This whole DAX story turns into a major nightmare, and I fear all our
> hodge podge tweaks to the semantics aren't helping it.
>
> It seems like we simply need an explicit O_DAX for the read/write
> bypass if can't sort out the semantics (error, writer synchronization)
> just as we need a special flag for MMAP.
I don't see how O_DAX makes this situation better if the goal is to
accelerate unmodified applications...
Vishal, at least the "delete a file with a badblock" model will still
work for implicitly clearing errors with your changes to stop doing
block clearing in fs/dax.c. This combined with a new -EBADBLOCK (as
Dave suggests) and explicit logging of I/Os that fail for this reason
at least gives a chance to communicate errors in files to suitably
aware applications / environments.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Thu, May 05, 2016 at 08:15:32AM -0700, Dan Williams wrote:
> > Agreed - makig O_DIRECT less direct than not having it is plain stupid,
> > and I somehow missed this initially.
>
> Of course I disagree because like Dave argues in the msync case we
> should do the correct thing first and make it fast later, but also
> like Dave this arguing in circles is getting tiresome.
We should do the right thing first, and make it fast later. But this
proposal is not getting it right - it still does not handle errors
for the fast path, but magically makes it work for direct I/O by
in general using a less optional path for O_DIRECT. It's getting the
worst of all choices.
As far as I can tell the only sensible option is to:
- always try dax-like I/O first
- have a custom get_user_pages + rw_bytes fallback handles bad blocks
when hitting EIO
And then we need to sort out the concurrent write synchronization.
Again there I think we absolutely have to obey Posix for the !O_DIRECT
case and can avoid it for O_DIRECT, similar to the existing non-DAX
semantics. If we want any special additional semantics we _will_ need
a special O_DAX flag.
On Thu, May 5, 2016 at 8:22 AM, Christoph Hellwig <[email protected]> wrote:
> On Thu, May 05, 2016 at 08:15:32AM -0700, Dan Williams wrote:
>> > Agreed - makig O_DIRECT less direct than not having it is plain stupid,
>> > and I somehow missed this initially.
>>
>> Of course I disagree because like Dave argues in the msync case we
>> should do the correct thing first and make it fast later, but also
>> like Dave this arguing in circles is getting tiresome.
>
> We should do the right thing first, and make it fast later. But this
> proposal is not getting it right - it still does not handle errors
> for the fast path, but magically makes it work for direct I/O by
> in general using a less optional path for O_DIRECT. It's getting the
> worst of all choices.
>
> As far as I can tell the only sensible option is to:
>
> - always try dax-like I/O first
> - have a custom get_user_pages + rw_bytes fallback handles bad blocks
> when hitting EIO
If you're on board with more special fallbacks for dax-capable block
devices that indeed opens up the thinking. The O_DIRECT approach was
meant to keep the error clearing model close to the traditional block
device case, but yes that does constrain the implementation in
sub-optimal ways.
However, we still have the alignment problem in the rw_bytes case, how
do we communicate to the application that only writes with a certain
size/alignment will clear errors? That forced alignment assumption
was the other appeal of O_DIRECT. Perhaps we can at least start with
hole punching and block reallocation as the error clearing method
while we think more about the write-to-clear case?
> And then we need to sort out the concurrent write synchronization.
> Again there I think we absolutely have to obey Posix for the !O_DIRECT
> case and can avoid it for O_DIRECT, similar to the existing non-DAX
> semantics. If we want any special additional semantics we _will_ need
> a special O_DAX flag.
Ok, makes sense.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Thu, 2016-05-05 at 07:24 -0700, Christoph Hellwig wrote:
> On Mon, May 02, 2016 at 06:41:51PM +0300, Boaz Harrosh wrote:
> >
> > >
> > > All IO in a dax filesystem used to go through dax_do_io, which
> > > cannot
> > > handle media errors, and thus cannot provide a recovery path that
> > > can
> > > send a write through the driver to clear errors.
> > >
> > > Add a new iocb flag for DAX, and set it only for DAX mounts. In
> > > the IO
> > > path for DAX filesystems, use the same direct_IO path for both DAX
> > > and
> > > direct_io iocbs, but use the flags to identify when we are in
> > > O_DIRECT
> > > mode vs non O_DIRECT with DAX, and for O_DIRECT, use the
> > > conventional
> > > direct_IO path instead of DAX.
> > >
> > Really? What are your thinking here?
> >
> > What about all the current users of O_DIRECT, you have just made
> > them
> > 4 times slower and "less concurrent*" then "buffred io" users. Since
> > direct_IO path will queue an IO request and all.
> > (And if it is not so slow then why do we need dax_do_io at all?
> > [Rhetorical])
> >
> > I hate it that you overload the semantics of a known and expected
> > O_DIRECT flag, for special pmem quirks. This is an incompatible
> > and unrelated overload of the semantics of O_DIRECT.
> Agreed - makig O_DIRECT less direct than not having it is plain
> stupid,
> and I somehow missed this initially.
How is it any 'less direct'? All it does now is follow the blockdev
O_DIRECT path. There still isn't any page cache involved..
>
> This whole DAX story turns into a major nightmare, and I fear all our
> hodge podge tweaks to the semantics aren't helping it.
>
> It seems like we simply need an explicit O_DAX for the read/write
> bypass if can't sort out the semantics (error, writer synchronization)
> just as we need a special flag for MMAP..
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Thu, 2016-05-05 at 08:15 -0700, Dan Williams wrote:
> On Thu, May 5, 2016 at 7:24 AM, Christoph Hellwig <[email protected]>
> wrote:
> >
> > On Mon, May 02, 2016 at 06:41:51PM +0300, Boaz Harrosh wrote:
> > >
> > > >
> > > > All IO in a dax filesystem used to go through dax_do_io, which
> > > > cannot
> > > > handle media errors, and thus cannot provide a recovery path
> > > > that can
> > > > send a write through the driver to clear errors.
> > > >
> > > > Add a new iocb flag for DAX, and set it only for DAX mounts. In
> > > > the IO
> > > > path for DAX filesystems, use the same direct_IO path for both
> > > > DAX and
> > > > direct_io iocbs, but use the flags to identify when we are in
> > > > O_DIRECT
> > > > mode vs non O_DIRECT with DAX, and for O_DIRECT, use the
> > > > conventional
> > > > direct_IO path instead of DAX.
> > > >
> > > Really? What are your thinking here?
> > >
> > > What about all the current users of O_DIRECT, you have just made
> > > them
> > > 4 times slower and "less concurrent*" then "buffred io" users.
> > > Since
> > > direct_IO path will queue an IO request and all.
> > > (And if it is not so slow then why do we need dax_do_io at all?
> > > [Rhetorical])
> > >
> > > I hate it that you overload the semantics of a known and expected
> > > O_DIRECT flag, for special pmem quirks. This is an incompatible
> > > and unrelated overload of the semantics of O_DIRECT.
> > Agreed - makig O_DIRECT less direct than not having it is plain
> > stupid,
> > and I somehow missed this initially.
> Of course I disagree because like Dave argues in the msync case we
> should do the correct thing first and make it fast later, but also
> like Dave this arguing in circles is getting tiresome.
>
> >
> > This whole DAX story turns into a major nightmare, and I fear all
> > our
> > hodge podge tweaks to the semantics aren't helping it.
> >
> > It seems like we simply need an explicit O_DAX for the read/write
> > bypass if can't sort out the semantics (error, writer
> > synchronization)
> > just as we need a special flag for MMAP.
> I don't see how O_DAX makes this situation better if the goal is to
> accelerate unmodified applications...
>
> Vishal, at least the "delete a file with a badblock" model will still
> work for implicitly clearing errors with your changes to stop doing
> block clearing in fs/dax.c. This combined with a new -EBADBLOCK (as
> Dave suggests) and explicit logging of I/Os that fail for this reason
> at least gives a chance to communicate errors in files to suitably
> aware applications / environments.
Agreed - I'll send out a series that has just the zeroing changes, and
drop the dax_io fallback/O_DIRECT tweak for now while we figure out the
right thing to do. That should get us to a place where we still have dax
in the presence of errors, and have _a_ path for recovery.
> _______________________________________________
> Linux-nvdimm mailing list
> [email protected]
> https://lists.01.org/mailman/listinfo/linux-nvdimm????{.n?+??????jg????+a??칻?&ޖ)?????)????w+h?????&??/i??????梷???(?????Ǟ?m??????)????????????^???????
On Thu, 2016-05-05 at 08:22 -0700, Christoph Hellwig wrote:
> On Thu, May 05, 2016 at 08:15:32AM -0700, Dan Williams wrote:
> >
> > >
> > > Agreed - makig O_DIRECT less direct than not having it is plain
> > > stupid,
> > > and I somehow missed this initially.
> > Of course I disagree because like Dave argues in the msync case we
> > should do the correct thing first and make it fast later, but also
> > like Dave this arguing in circles is getting tiresome.
> We should do the right thing first, and make it fast later. But this
> proposal is not getting it right - it still does not handle errors
> for the fast path, but magically makes it work for direct I/O by
> in general using a less optional path for O_DIRECT. It's getting the
> worst of all choices.
>
> As far as I can tell the only sensible option is to:
>
> - always try dax-like I/O first
> - have a custom get_user_pages + rw_bytes fallback handles bad blocks
> when hitting EIO
I'm not sure I completely understand how this will work? Can you explain
a bit? Would we have to export rw_bytes up to layers above the pmem
driver? Where does get_user_pages come in?
>
> And then we need to sort out the concurrent write synchronization.
> Again there I think we absolutely have to obey Posix for the !O_DIRECT
> case and can avoid it for O_DIRECT, similar to the existing non-DAX
> semantics. If we want any special additional semantics we _will_ need
> a special O_DAX flag.
> _______________________________________________
> Linux-nvdimm mailing list
> [email protected]
> https://lists.01.org/mailman/listinfo/linux-nvdimm
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Thu, May 05, 2016 at 09:45:07PM +0000, Verma, Vishal L wrote:
> I'm not sure I completely understand how this will work? Can you explain
> a bit? Would we have to export rw_bytes up to layers above the pmem
> driver? Where does get_user_pages come in?
A DAX filesystem can directly use the nvdimm layer the same way btt
doe,s what's the problem?
Re get_user_pages my idea was to simply use that to lock down the user
pages so that we can call rw_bytes on it. How else would you do it? Do
a kmalloc, copy_from_user and then another memcpy?
On Thu, May 05, 2016 at 09:39:14PM +0000, Verma, Vishal L wrote:
> How is it any 'less direct'? All it does now is follow the blockdev
> O_DIRECT path. There still isn't any page cache involved..
It's still more overhead than the play DAX I/O path.
On Sun, 2016-05-08 at 02:01 -0700, [email protected] wrote:
> On Thu, May 05, 2016 at 09:45:07PM +0000, Verma, Vishal L wrote:
> >
> > I'm not sure I completely understand how this will work? Can you
> > explain
> > a bit? Would we have to export rw_bytes up to layers above the pmem
> > driver? Where does get_user_pages come in?
> A DAX filesystem can directly use the nvdimm layer the same way btt
> doe,s what's the problem?
The BTT does rw_bytes through an internal-to-libnvdimm mechanism, but
rw_bytes isn't exported to the filesystem, currently.. To do this we'd
have to either add an rw_bytes to block device operations...or
something.
Another thing is rw_bytes currently doesn't do error clearing either.
We store badblocks at sector granularity, and like Dan said earlier,
that hides the clear_error alignment requirements and upper layers
don't have to be aware of it. To make rw_bytes clear sub-sector errors,
we'd have to change the granularity of bad-blocks, and make upper
layers aware of the clearing alignment requirements.
Using a block-write semantic for clearing hides all this away.
>
> Re get_user_pages my idea was to simply use that to lock down the
> user
> pages so that we can call rw_bytes on it. How else would you do
> it? Do
> a kmalloc, copy_from_user and then another memcpy?